Efficient Modulo 2k + 1 Squarers

DCIS 2006
Efficient Modulo 2k + 1 Squarers
H. T. Vergos
Computer Engineering and Informatics Dept.
University of Patras, 26500, Greece
E-mail : [email protected]
Abstract— A new modulo 2k + 1 squarer architecture is
proposed for operands in the normal representation. The novel
architecture is derived by showing that all required correction
factors can be merged into a single constant one and by treating
this, partly as a partial product and partly by the final parallel
adder. The proposed architecture utilizes a total of d k2 e+1 partial
products, each k bits wide and is built using an inverted endaround-carry, carry-save adder tree and a final parallel adder.
C. Efstathiou
Informatics Dept., Technological
Educational Institute of Athens, 12210, Greece
E-mail : [email protected]
enables the derivation of very efficient modulo mi arithmetic
components, for example [6]–[11]. Concentrating in the modulo 2k + 1 squarer case, a designer currently has the following
alternatives :
•
Index Terms— Computer arithmetic, residue / modulo arithmetic, residue number system, modulo squarers.
I. I NTRODUCTION
The squaring operation is met in several applications of
high-performance digital signal processors (DSPs). Such applications include signal filtering, image processing, Euclidean
branch metrics calculation and modulation for communication components. The squaring operation can also be used
effectively in square and multiply cryptoalgorithm implementations, for avoiding time-consuming exponentiations.
For achieving today’s performance goals, many DSPs that
target the above applications rely on the use of a residue
number system (RNS) [1]–[5] instead of a positional one. The
use of a RNS can significantly speed up arithmetic operations
such as addition, multiplication and squaring, provided that
arithmetic components that perform the modulo operations are
available and operate efficiently.
In a RNS that is composed by the set S of the n pairwise
prime integers m1 , m2 , . . . , mn , an integer is represented by
its residues over the numbers of S. Then, for the result Z of
the X ¦Y operation it holds that when X = {X1 , X2 , . . . , Xn }
and Y = {Y1 , Y2 , . . . , Yn }, then Z = {Z1 , Z2 , . . . , Zn },
where Zi = |Xi ¦ Yi |mi . That is, the result is derived
from the parallel computation of modulo mi , 1 ≤ i ≤ n,
operations, each performed on the residues of X and Y over
mi . Since carry propagation is not required among the parallel
computation units and each such unit operates on narrow
residues instead of wide operands, an operation in RNS can
be computed in less time.
Among the several proposals that have been examined,
the three moduli set {m1 , m2 , m3 } = {2k , 2k − 1, 2k + 1},
has attracted significantly more attention, mainly because it
The research presented in this work was conducted within the framework of
the Educational and Initial Vocational Training program “Archimedes” which
is co-funded by the E.U. (75%) and by the Greek Government (25%).
•
•
•
To use a full/partial look-up table, implemented in ROM.
This solution is good as long as k is very small, because,
the required ROM size grows exponentially with k. Considering that in cryptoalgorithms, operations of the form
|X H |M are required, that may be computed by squarings
and multiplications, where M may be longer than 500
bits, the use of look-up tables can not be considered as
a viable solution.
To use the existing modulo multiplier, to also perform the
squaring, when needed. Since a multiplier is also used for
the multiplication operation, this solution is area efficient.
This solution is however very slow when compared
against solutions that have a dedicated squarer circuit.
Therefore is a viable alternative only when the squaring
application is rarely required. This, is not however the
case in several DSP algorithms and applications.
To follow the general design methodology for modulo A
squarer design [10]. The squarers that result following
this methodology are unnecessarily complex. Apart from
a partial product generation and reduction circuitry, the
squarers proposed in [10] require a correction step and a
result translation procedure. Both these are implemented
by consecutive parallel adders and multiplexors that are
both area- and time-consuming.
To follow the solution that fully exploits the properties of
modulo 2k + 1 arithmetic [11]. The squarers that result
consist only of a partial product generation, a partial
product reduction and a final addition stage. Although
this solution is the most efficient known, it is only
applicable to input operands that follow the diminished1 representation [12]. This representation implies that a
zero operand is treated distinctly and all other operands
are represented decreased by one. In this way only k
bits are required for representing the integers 0, 1, . . . , 2k .
Since this representation requires time- and hardwareconsuming input / output translators and discrete handling
of zero operands, it is attractive only when a long series of
computations takes place on the diminished-1 representations before conversion back to weighted representation
is required.
DCIS 2006
In this manuscript, we propose a new architecture for modulo
2k + 1 squaring. The squarers designed by following the
proposed architecture offer the same number of stages as
the architecture of [11] and are therefore both area and time
efficient. However, they accept their operand in weighted
form and therefore do not require any time- and hardwareconsuming input / output translators nor any special circuit
for handling zero operands. For the derivation of the proposed
architecture we merge all required correction factors that are
treated separately in [10] into a single one and we prove that
this is a constant, that is, independent on the size of the squarer.
Finally, treatment of this constant is partly assigned to the final
stage adder while the rest is handled as an extra partial product.
This allows us to use a very fast, inverted end-around carry
(EAC) k-bit adder for the final addition without any further
requirement for translation of the result.
II. P ROPOSED A RCHITECTURE
Let A = ak ak−1 . . . a1 a0 denote a (k + 1)-bit number, with
A ∈ [0, 2k ]. Then, R, the square of A modulo 2k + 1, can be
computed by :
¯
¯
¯ k k
¯
¯X X
¯ 2¯
¯
R = ¯A ¯2k +1 = ¯¯
ai aj 2i+j ¯¯
=
¯ i=0 j=0
¯k
TABLE I
PARTIAL P RODUCT M ATRIX
P P0 =
P P1 =
P P2 =
···
P Pk−2 =
P Pk−1 =
2k−1
ak−1 a0
ak−2 a1
ak−3 a2
···
a1 ak−2
a0 ak−1
with
xi,j
.
2k +1
20
a0 + ak
ak−1 a1
ak−2 a2
···
a2 ak−2
a1 ak−1
if i + j < k
if i + j ≥ k.
i=0
2 +1
where P Pi denotes the partial product :
P Pi =
k−1−i
X
k−1
X
ai aj 2|i+j|k +
j=0
ai aj 2|i+j|k .
j=k−i
k−1
X
¯
¯
¯ k |i+j|k ¯
¯2 2
¯
j=k−i
2k +1
= 2k (2i − 1).
Summing all correction factors for complemented terms, gives
us their total correction factor, suppose CFCM P L :
CFCM P L =
2k +1
Taking into account that |22k |2k +1 = 1 and for i + j ≤ 2k − 2,
|2i+j |2k +1 = (−1)s 2|i+j|k , where :
(
0, if i + j < k
s=
1, if i + j ≥ k
(1)
2k +1
We can unify the positive and negative terms of (1), by using
| − z|2k +1 = |2k + 1 − z|2k +1 = |2k + z|2k +1 , where z denotes
the complement of z. Then (1) becomes :
¯
¯
¯
¯
k−1
X k−1
X
¯
¯
xi,j 2|i+j|k ¯¯
R = ¯¯ak +
(2)
¯k
¯
i=0 j=0
2 +1
21
a1 a0
a0 a1
ak−1 a2
···
a3 ak−2
a2 ak−1
Relation (2) indicates that the product bits ai aj , with i+j > k,
can be inverted and placed at bit position |i+j|k if a correction
factor equal to 2k 2|i+j|k is also taken into account.
Therefore, we can rewrite relation (2) as :
¯
¯
k−1
¯
¯
X
¯
¯
,
(3)
R = ¯a k +
(P Pi + CFi )¯
¯
¯k
CFi =
2 +1
we derive that :
¯
¯
¯
¯
k−1
X k−1
X
¯
¯
s
|i+j|k ¯
¯
R = ¯a k +
(−1) ai aj 2
¯
¯
¯
i=0 j=0
···
···
···
···
···
···
···
and CFi the corresponding correction introduced by the i
complemented terms of P Pi . The partial product P P0 does
not contain any complemented terms and therefore CF0 = 0.
The required correction CFi , for the complemented terms of
P Pi is :
Since for 0 ≤ i ≤ k − 1, ak ai = 0, we get that :
¯
¯
¯
¯
k−1
X k−1
X
¯
¯
R = ¯¯ak 22k +
ai aj 2i+j ¯¯
=
¯
¯k
i=0 j=0
¯
¯
¯
¯
k−1
X k−1
X
¯ ¯ 2k ¯
¯ i+j ¯
¯
= ¯¯ak ¯2 ¯2k +1 +
ai aj ¯2 ¯2k +1 ¯¯
¯
¯
i=0 j=0
2k−3
ak−3 a0
ak−4 a1
ak−4 a2
···
ak−1 ak−2
ak−2 ak−1
(
ai aj ,
¯
= ¯¯ k
2 + ai aj ¯2k +1 ,
2 +1
¯
¯
¯
¯
k−1
k−1
X
X k−1
X
¯
¯
2k
k+i
i+j ¯
¯
= ¯a k 2 + 2
ak ai 2
+
ai aj 2 ¯
¯
¯
i=0
i=0 j=0
2k−2
ak−2 a0
ak−3 a1
ak−4 a2
···
a0 ak−2
ak−1
k−1
X
2k (2i − 1) = 2k (2k − 1 − k).
i=1
Therefore, R, is given by :
¯
¯
k−1
¯
¯
X
¯
¯
R = ¯CFCM P L +
P Pi ¯
¯
¯
i=0
2k +1
where the general form of the k partial products is presented
in Table I. Since ak a0 = 0 the addition of ak indicated in (3)
has been performed by ORing ak with a0 in P P0 in Table I.
In every column of the partial product matrix, most terms
appear twice, once as ai aj (ai aj ) and once as aj ai (aj ai ). This
enables us to reduce the length of the array, to approximately
the half. Adding such a pair of terms of column 2i+j would
yield a result of 2 × ai aj (2 × ai aj ). Instead of adding them,
we can therefore equivalently move one of these terms to the
column with weight 2i+j+1 . The pairs of equal terms in the
leftmost column can be replaced by the complement of one
of the terms in the rightmost column, introducing a correction
DCIS 2006
factor 2k . Since there are b k2 c pairs of equal terms in the
leftmost column, the total correction factor required, CFM OV ,
is given by :
¹ º
k
.
CFM OV = 2k
2
The resulting partial products and the total correction factor
must be added modulo 2k + 1, until two final summands
are produced. This can be performed by using either a carry
save adder (CSA) array or even faster by a CSA tree (Dadda
tree [13]). Taking into account that the carry outputs of the
k-th column have a weight of 2k and since in modulo 2k + 1
arithmetic :
|ci 2k |2k +1 = | − ci |2k +1 = |2k + ci |2k +1 ,
the carries out of the most significant bit position can be
complemented and added to the least significant bit position,
thereby forming an inverted EAC CSA tree. A correction
factor of 2k must be taken into account for each such carry
recirculation.
We can now combine all above mentioned correction factors
into a single one and treat this as an extra parallel product term.
Since however the partial products matrix form depends on k,
we distinguish the following two cases :
• k is odd. After pair identification and repositioning, the
partial product matrix has k+1
2 rows. The reduction of the
k+1
+
1
products
in
two
summands
will yield k+1
2 −1
2
k
carries of weight 2 , which should complemented be
added to the least significant position. This imposes a
correction factor of CFRECIRC,odd = 2k ( k+1
2 − 1).
Therefore, the total correction required in this case is :
2 +1
where the P Pi∗ s are derived from the P Pi s of Table I, using
the above described operations. In the following we examine
the addition of the constant total correction factor.
A straightforward implementation would be to use CF as
an extra partial product, along with a fast modulo 2k +1 adder
(for example [7]) which accepts the two summands produced
by the reduction scheme (array / tree) and produces the result.
An alternative, more efficient solution is however proposed in
the following, based on the use of an inverted EAC parallel
adder (equivalently, a diminished-1 modulo 2k + 1 adder). If
the architecture proposed in [14] is followed, this adder can
provide its result faster than the fastest modulo 2k + 1 adder
available [7], with smaller area requirements. Since however
such an adder can provide only k bits of the result, the last
bit must be derived in a separate manner.
Relation (4) can equivalently be written as :
¯
¯
¯
¯¯
dk
2e
¯
X
¯
¯¯
∗¯
¯
P Pi ¯ k + 1¯¯
= |C + S + 1|2k +1
R = ¯¯2 +
2 +1
¯k
¯
i=0
2 +1
where C = ck−2 ck−1 . . . c0 ck−1 and S = sk−1 sk−2 . . . s1 s0
are respectively the k-bit carry
sum vectors
¯ produced by
¯ and
Pd k2 e
¯
¯
P Pi∗ ¯ k . The most
the multi-operand addition ¯2 + i=0
2 +1
k is even. After pair identification and repositioning, the
resulting partial product matrix is not completely regular.
The columns with weight 2i+j , with i+j even, have k2 +2
terms, while the rest columns only have k2 − 1 terms. A
full adder is used in each column with k2 + 2 terms. Since
this combines three terms from this column and provides
its sum term to the same column and its carry term to
the column with double weight the matrix is transformed
into a completely regular one with k2 rows. During the
addition of the k2 + 1 products (+1 stands for the total
correction factor), k2 − 1 carries of weight 2k , will be
generated, which should be complemented and added to
the least significant position. This
a correction
¢
¡ imposes
factor of CFRECIRC,even = 2k k2 − 1 . Therefore, the
total correction required in this case is also :
significant bit of R, is 1 only when R = 2k or equivalently
when |S +C +1|2k +1 = 2k . Taking into account that S and C
are k-bit vectors we then get that R = 2k <=> S + C + 1 =
2k or equivalently that S + C = 2k − 1. That is, the most
significant bit of the multiplication is 1 only when S and C are
complementary vectors. As explained at the end of this section,
this observation enables us to compute the most significant bit
distinctly from the rest. In the following we focus on the k
least significant bits of R. Let Rk denote the k-bit vector of
the least significant bits of R. We then have that :
¯
¯
¯
¯
Rk = ¯|A2 |2k +1 ¯2k = ¯|S + C + 1|2k +1 ¯2k
(
|S + C + 1 − (2k + 1)|2k , if S + C + 1 ≥ 2k + 1
=
|S + C + 1|2k , otherwise.
(
|S + C − 2k |2k , if S + C ≥ 2k
=
|S + C + 1|2k , otherwise.
(
if S + C ≥ 2k
|S + C|2k ,
=
|S + C + 1|2k , otherwise.
CF = CFCM P L + CFM OV + CFRECIRC,even =
¯ h
i¯¯
¯ k k
k k
¯
=3
= ¯2 (2 − 1 − k) + + − 1 ¯¯
2
2
2k +1
The latter relation reveals that the k least significant bits of
the product can be handled by an k-bit adder that increases
the binary sum of its inputs by one when the carry output is
0 and leaves it unchanged in the case of a carry output. This
CF = CFCM P L + CFM OV + CFRECIRC,odd =
¯ h
¹ º
i¯¯
¯ k k
k+1
k
¯
=3
− 1 ¯¯
+
= ¯2 (2 − 1 − k) +
2
2
2k +1
•
According to the above, we conclude that the total correction factor required is constant and independent of the squarer
size. The square is given by :
¯
¯
¯
¯
dk
2e
X
¯
¯
∗¯
¯
P Pi ¯
(4)
R = ¯3 +
¯k
¯
i=0
DCIS 2006
TABLE II
I NITIAL PARTIAL P RODUCT M ATRIX
is exactly the function performed by an inverted EAC parallel
adder. We therefore conclude that, if a total correction factor of
2 is used as an extra partial product, an inverted EAC parallel
adder used as the final adder will accept S and C at its inputs
and will provide Rk .
Very fast inverted EAC adders based on parallel prefix
carry computation units have appeared in [14], [15]. For
integer adders, a parallel prefix carry computation unit is
derived by the following. Let A = ak−1 ak−2 . . . a1 a0 and
B = bk−1 bk−2 . . . b1 b0 denote the two k-bit addition operands
and the let the terms gi = ai bi and pi = ai ⊕ bi denote the
carry generate and the carry propagate terms at bit position
i respectively. By defining the ◦ operator as an operator that
associates generate and propagate pairs and produces a new
group pair according to the equation :
P P0
P P1
P P2
P P3
P P4
P P0∗
P P1∗
P P2∗
P P3∗
a2
24
a3 a0
a2 a1
a2
0
=
=
=
=
a4
a2a1
HA
22
a4 a2
a1 a0
a1
0
a4a3
21
a1 a0
a0 a1
a4 a2
a3
a2 a4
20
a0 + a5
a4 a1
a3 a2
a2 a3
a1 a4
a1
a1a0
a0+a5
FA
s1
c1
a2a1 a3a0
HA
a4a1
s2
c2
a3a2
FA+
FA
s3
20
a3 a0
a2 a1
a0 + a6
0
a3
a4a2
FA
c3
21
a4 a1
a3 a2
a3
1
HA
a2a0
FA
s4
23
a4 a3
a4
a2 a0
0
HA
a3a0
(Gi , Pi ) = (gi , pi ) ◦ (gi−1 , pi−1 ) ◦ · · · ◦ (g1 , p1 ) ◦ (g0 , p0 ).
We apply the architecture derived in the previous section to
the design of a modulo 25 + 1 squarer. According to Table I,
the initial partial products indicated in Table II are derived.
We then identify pairs of equal terms in each column and
move one of the terms to its left column. Terms from the
leftmost column are complemented and driven to the rightmost
one. We also add 00010(210 ), for our total correction factor.
22
a2 a0
a1
a0 a2
a4 a3
a3 a4
TABLE III
the computation of a carry ci of the integer addition of A
and B is equivalent to the problem of computing the group
generate Gi given by the prefix equation :
A. An example of the proposed architecture
23
a3 a0
a2 a1
a1 a2
a0 a3
a4
F INAL PARTIAL P RODUCT M ATRIX
(gx , px ) ◦ (gy , py ) = (gx + px gx , px py )
Once the carries have been computed the sum bits, si , are
computed by si = pi ⊕ ci−1 .
For attaining an inverted EAC adder, simple solutions, such
as the connection of the carry output of an integer adder back
to the carry input via an inverter, are not well-suited, since they
suffer from oscillations. In [15], it was proposed that the output
carry is driven back via an inverter to a late carry increment
stage composed of nodes implementing a prefix operator.
Therefore, no oscillations occur and if the carry computation
unit is designed according to the fast algorithms presented
in [16], [17], the derived inverted EAC adders feature an
operating speed close to the corresponding integer adders.
The need for an extra prefix stage that handles the reentering carry has been canceled in the parallel-prefix inverted
EAC adders proposed in [14]. This was achieved by performing carry re-circulation at each existing prefix level. As a
result parallel-prefix adder architectures with log2 k have been
derived, that is, inverted EAC adders that can achieve the same
operating speed as the corresponding integer adders.
Since S and C are the inputs of the final adder and the
group propagate over all k bits Pk−1 = pk−1 pk−2 · · · p0 , we
conclude that the most significant bit of the result which should
be 1 only in the case that S and C are complementary vectors,
is equal to the group propagate signal out of the k bits of the
final inverted EAC adder.
24
a4 a0
a3 a1
a2
a1 a3
a0 a4
=
=
=
=
=
FA
c0
s0
c4
Modulo 25+1 Diminished-1 Adder
with parallel prefix carry computation unit
Group
propagate
signal
r5
r4
Fig. 1.
r3
r2
r1
r0
Proposed modulo 25 + 1 squarer.
After these operations, our partial products matrix, takes the
form of Table III. The resulting implementation is presented
in Figure 1.
In figure 1, the AND / NAND / inverter gates required for
forming the partial product bits are not shown. The blocks used
are half-adders (HA), full-adders (FA), simplified FA blocks
(FA+ ), that is, FAs with one of their inputs set at 1 and the
final adder. The output carries at the most significant bit of
each stage are complemented and driven to the least significant
bit of the subsequent stage. The two final derived summands
(S and C vectors) are added in the final inverted EAC adder.
Note that further simplifications are possible in several blocks,
since their operands are functions of the same input bits. For
example the top leftmost HA that accepts a2 and a2 a1 as input
operands can be further simplified.
We must note here, that the partial product bits of equal
weight should not be driven randomly in the FAs of the
corresponding column. For achieving the least delay, the
partial product bits derived earlier should be driven to the FAs
at the top of the CSA tree, whereas late arriving signals to
DCIS 2006
TABLE IV
FA STAGES IN A f OPERAND DADDA T REE
f
4
5≤f
7≤f
10 ≤ f
14 ≤ f
20 ≤ f
29 ≤ f
43 ≤ f
63 ≤ f
≤6
≤9
≤ 13
≤ 19
≤ 28
≤ 42
≤ 63
≤ 94
D(f ) in FA stages
2
3
4
5
6
7
8
9
10
FAs of subsequent tree levels. Moreover, the addition of 0 can
not be avoided since doing so alters the number of inverted
EACs in the CSA tree and invalidates all the previous analysis.
However, the FAs accepting the bits from the constant partial
product can be simplified to HAs or FA+ .
III. A REA AND D ELAY E STIMATIONS
In this section we examine the area and delay complexities
of the proposed squarers. For our calculations, we adopt
the approximations of the unit-gate model [18], that is, we
consider that all 2-input monotonic gates count as one gate
equivalent for both area and delay, while a 2-input XOR
or XNOR gate counts as 2 equivalents for both area and
delay. Inverters are not taken into account. According to these
approximations the area and delay of a FA are modelled as 7
equivalent gates and 4 time units respectively, whereas a HA
or a FA+ requires an area of 3 equivalent gates and offers a
delay of 2 time units.
In the proposed squarers, the required partial product bits
AND / NAND
can be derived in parallel by the use of k(k−1)
2
gates and 1 OR gate, in 1 time unit.
We consider that these partial products are then reduced to
two summands by the use of a Dadda tree. The depth in FA
stages of a Dadda tree is a function, suppose D(f ), of its
number of operands and is listed in Table IV for all practical
values. When k is odd, each column of the Dadda tree has
(k+1)
− 2 FAs and one
+ 1 operand bits that require (k+1)
2
2
HA or FA+ for their reduction. When k is even, half of the
columns require k2 − 2 FAs and one HA or FA+ while the
rest require one
¢ The maximum number of FA stages
¡ FA less.
required is D k2 + 2 in this case.
Finally, the area and delay of a k-bit parallel diminished-1
modulo 2k + 1 adder that follows the architecture proposed
in [14] is 92 k log2 k + 12 k + 6 equivalent gates and 2 log2 k + 3
time units.
Summing all the above, we conclude that the area and delay
requirements of the proposed modulo 2k + 1 squarers are :
º
¹
1
9
k+1
k(k − 1)
+ k log2 k + k +7 equivalent gates
+7k
2
2
2
2
and
µ¹ º¶
k
+ 2 log2 k + 3 time units
1 + 4D
2
respectively.
Comparing the estimations derived above against those
of the squarers presented in [11], we conclude that both
architectures lead to equally fast squarers with approximately
the same area requirements. We remind that the squarers
of [11] were shown to outperform those designed according
to the general procedure of [10] and to provide significantly
shorter squaring times when compared against modulo 2k + 1
multipliers. Therefore, the same merits are also valid for the
proposed squarers. The proposed squarers accept their operand
in weighted form in contrast to those of [11], that utilize
a diminished-1 representation. This means that the proposed
squarers do not require any translator from/to the weighted
to/from the diminished-1 representation nor any special extra
circuit for handling the zero operands. Therefore they can be
used more efficiently than the squarers proposed in [11].
IV. C ONCLUSIONS
Efficient modulo 2k + 1 squarers are useful design components in RNS and cryptography applications. A new architecture for modulo 2n + 1 squarers, has been proposed in
this manuscript. The new architecture was derived by merging
all required correction factors into a single one, which was
shown to be a constant. Therefore, no circuit is required for
its formation. Further performance enhancement is achieved
by treating part of this constant as a partial product, whereas
the rest is handled by the final parallel adder. This enables to
use an inverted EAC adder as the final adder leading to shorter
execution times and smaller area requirements.
Our estimations indicate that the proposed squarers attain
the same execution frequency with those of [11] with approximately the same area requirements. Considering however that
the proposed squarers accept their operand in normal form
while the solution presented in [11] operands in diminished1 representation, it is clear that the proposed squarers can
be used more efficiently since they do not require time- and
hardware-consuming input / output translators and handling of
zero operands.
R EFERENCES
[1] T. Kwan and T. Martin, “Addaptive Detection and Enhancement of
Multiple Sinusoids using a Cascade IIR Filter,” IEEE Trans. Circuits
Syst., vol. 36, pp. 937–945, 1989.
[2] J. Ramirez et al., “High Performance, Reduced Complexity Programmable RNS–FPL Merged FIR Filters,” Electronics Letters, vol.
38, no. 4, pp. 199–200, 2002.
[3] J. Ramirez et al., “RNS-enabled Digital Signal Processor Design,”
Electronics Letters, vol. 38, no. 6, pp. 266–268, 2002.
[4] T. Keller et al., “Adaptive Redundant Residue Number System Coded
Multicarrier Modulation,” IEEE Journal on Selected Areas in Communication, vol. 18, no. 11, pp. 2292–2301, 2000.
[5] J. Ramirez et al., “Fast RNS FPL-based Communications Receiver
Design and Implementation,” in Proc. of the 12th International
Conference on Field Programmable Logic, Lecture Notes in Computer
Science Vol. 2438, Springer-Verlag, 2002, pp. 472–481.
[6] L. Kalampoukas et al., “High-Speed Parallel-Prefix Modulo 2n − 1
Adders,” IEEE Trans. Comput., vol. 49, no. 7, pp. 673–680, 2000.
[7] C. Efstathiou, H. T. Vergos, and D. Nikolos, “Fast Parallel-Prefix
Modulo 2n + 1 Adders,” IEEE Trans. Comput., vol. 53, pp. 1211–
1216, 2004.
[8] C. Efstathiou, H. T. Vergos, and D. Nikolos, “Modified Booth Modulo
2n − 1 Multipliers,” IEEE Trans. Comput., vol. 53, pp. 370–374, 2004.
DCIS 2006
[9] A. Wrzyszcz and D. Milford, “A New Modulo 2a + 1 Multiplier,” in
Proc. of the International Conference on Computer Design (ICCD’93),
1995, pp. 614–617.
[10] S. J. Piestrak, “Design of Squarers Modulo A with Low-Level Pipelining,” IEEE Trans. Circuits Syst. II, vol. 49, no. 1, pp. 31–41, 2002.
[11] H. T. Vergos and C. Efstathiou, “Diminished-1 Modulo 2n + 1 Squarer
Design,” IEE Proceedings - Computers and Digital Techniques, vol.
152, pp. 561–566, 2005.
[12] L. M. Leibowitz, “A Simplified Binary Arithmetic for the Fermat
Number Transform,” IEEE Trans. Acoust., Speech, Signal Processing,
vol. 24, pp. 356–359, 1976.
[13] L. Dadda, “On Parallel Digital Multipliers,” Alta Frequenza, vol. 45,
pp. 574–580, 1976.
[14] H. T. Vergos, C. Efstathiou, and D. Nikolos, “Diminished-One Modulo
[15]
[16]
[17]
[18]
2n + 1 Adder Design,” IEEE Trans. Comput., vol. 51, pp. 1389–1399,
2002.
R. Zimmerman, “Efficient VLSI Implementation of Modulo (2n ± 1)
Addition and Multiplication,” in Proc. of the 14th IEEE Symposium on
Computer Arithmetic, April 1999, pp. 158–167.
P. M. Kogge and H. S. Stone, “A Parallel Algorithm for the Efficient
Solution of a General Class of Recurrence Equations,” IEEE Trans.
Comput., vol. C-22, pp. 786–792, 1973.
R. E. Ladner and M. J. Fisher, “Parallel Prefix Computation,” Journal
of The ACM, vol. 27, no. 4, pp. 831–838, 1980.
A. Tyagi, “A Reduced-Area Scheme for Carry-Select Adders,” IEEE
Trans. Comput., vol. 42, no. 10, pp. 1163–1170, 1993.

Download Report

Efficient Modulo 2k + 1 Squarers

Paperzz.com

Your Paperzz