DCIS 2006 Efficient Modulo 2k + 1 Squarers H. T. Vergos Computer Engineering and Informatics Dept. University of Patras, 26500, Greece E-mail : [email protected] Abstract— A new modulo 2k + 1 squarer architecture is proposed for operands in the normal representation. The novel architecture is derived by showing that all required correction factors can be merged into a single constant one and by treating this, partly as a partial product and partly by the final parallel adder. The proposed architecture utilizes a total of d k2 e+1 partial products, each k bits wide and is built using an inverted endaround-carry, carry-save adder tree and a final parallel adder. C. Efstathiou Informatics Dept., Technological Educational Institute of Athens, 12210, Greece E-mail : [email protected] enables the derivation of very efficient modulo mi arithmetic components, for example [6]–[11]. Concentrating in the modulo 2k + 1 squarer case, a designer currently has the following alternatives : • Index Terms— Computer arithmetic, residue / modulo arithmetic, residue number system, modulo squarers. I. I NTRODUCTION The squaring operation is met in several applications of high-performance digital signal processors (DSPs). Such applications include signal filtering, image processing, Euclidean branch metrics calculation and modulation for communication components. The squaring operation can also be used effectively in square and multiply cryptoalgorithm implementations, for avoiding time-consuming exponentiations. For achieving today’s performance goals, many DSPs that target the above applications rely on the use of a residue number system (RNS) [1]–[5] instead of a positional one. The use of a RNS can significantly speed up arithmetic operations such as addition, multiplication and squaring, provided that arithmetic components that perform the modulo operations are available and operate efficiently. In a RNS that is composed by the set S of the n pairwise prime integers m1 , m2 , . . . , mn , an integer is represented by its residues over the numbers of S. Then, for the result Z of the X ¦Y operation it holds that when X = {X1 , X2 , . . . , Xn } and Y = {Y1 , Y2 , . . . , Yn }, then Z = {Z1 , Z2 , . . . , Zn }, where Zi = |Xi ¦ Yi |mi . That is, the result is derived from the parallel computation of modulo mi , 1 ≤ i ≤ n, operations, each performed on the residues of X and Y over mi . Since carry propagation is not required among the parallel computation units and each such unit operates on narrow residues instead of wide operands, an operation in RNS can be computed in less time. Among the several proposals that have been examined, the three moduli set {m1 , m2 , m3 } = {2k , 2k − 1, 2k + 1}, has attracted significantly more attention, mainly because it The research presented in this work was conducted within the framework of the Educational and Initial Vocational Training program “Archimedes” which is co-funded by the E.U. (75%) and by the Greek Government (25%). • • • To use a full/partial look-up table, implemented in ROM. This solution is good as long as k is very small, because, the required ROM size grows exponentially with k. Considering that in cryptoalgorithms, operations of the form |X H |M are required, that may be computed by squarings and multiplications, where M may be longer than 500 bits, the use of look-up tables can not be considered as a viable solution. To use the existing modulo multiplier, to also perform the squaring, when needed. Since a multiplier is also used for the multiplication operation, this solution is area efficient. This solution is however very slow when compared against solutions that have a dedicated squarer circuit. Therefore is a viable alternative only when the squaring application is rarely required. This, is not however the case in several DSP algorithms and applications. To follow the general design methodology for modulo A squarer design [10]. The squarers that result following this methodology are unnecessarily complex. Apart from a partial product generation and reduction circuitry, the squarers proposed in [10] require a correction step and a result translation procedure. Both these are implemented by consecutive parallel adders and multiplexors that are both area- and time-consuming. To follow the solution that fully exploits the properties of modulo 2k + 1 arithmetic [11]. The squarers that result consist only of a partial product generation, a partial product reduction and a final addition stage. Although this solution is the most efficient known, it is only applicable to input operands that follow the diminished1 representation [12]. This representation implies that a zero operand is treated distinctly and all other operands are represented decreased by one. In this way only k bits are required for representing the integers 0, 1, . . . , 2k . Since this representation requires time- and hardwareconsuming input / output translators and discrete handling of zero operands, it is attractive only when a long series of computations takes place on the diminished-1 representations before conversion back to weighted representation is required. DCIS 2006 In this manuscript, we propose a new architecture for modulo 2k + 1 squaring. The squarers designed by following the proposed architecture offer the same number of stages as the architecture of [11] and are therefore both area and time efficient. However, they accept their operand in weighted form and therefore do not require any time- and hardwareconsuming input / output translators nor any special circuit for handling zero operands. For the derivation of the proposed architecture we merge all required correction factors that are treated separately in [10] into a single one and we prove that this is a constant, that is, independent on the size of the squarer. Finally, treatment of this constant is partly assigned to the final stage adder while the rest is handled as an extra partial product. This allows us to use a very fast, inverted end-around carry (EAC) k-bit adder for the final addition without any further requirement for translation of the result. II. P ROPOSED A RCHITECTURE Let A = ak ak−1 . . . a1 a0 denote a (k + 1)-bit number, with A ∈ [0, 2k ]. Then, R, the square of A modulo 2k + 1, can be computed by : ¯ ¯ ¯ k k ¯ ¯X X ¯ 2¯ ¯ R = ¯A ¯2k +1 = ¯¯ ai aj 2i+j ¯¯ = ¯ i=0 j=0 ¯k TABLE I PARTIAL P RODUCT M ATRIX P P0 = P P1 = P P2 = ··· P Pk−2 = P Pk−1 = 2k−1 ak−1 a0 ak−2 a1 ak−3 a2 ··· a1 ak−2 a0 ak−1 with xi,j . 2k +1 20 a0 + ak ak−1 a1 ak−2 a2 ··· a2 ak−2 a1 ak−1 if i + j < k if i + j ≥ k. i=0 2 +1 where P Pi denotes the partial product : P Pi = k−1−i X k−1 X ai aj 2|i+j|k + j=0 ai aj 2|i+j|k . j=k−i k−1 X ¯ ¯ ¯ k |i+j|k ¯ ¯2 2 ¯ j=k−i 2k +1 = 2k (2i − 1). Summing all correction factors for complemented terms, gives us their total correction factor, suppose CFCM P L : CFCM P L = 2k +1 Taking into account that |22k |2k +1 = 1 and for i + j ≤ 2k − 2, |2i+j |2k +1 = (−1)s 2|i+j|k , where : ( 0, if i + j < k s= 1, if i + j ≥ k (1) 2k +1 We can unify the positive and negative terms of (1), by using | − z|2k +1 = |2k + 1 − z|2k +1 = |2k + z|2k +1 , where z denotes the complement of z. Then (1) becomes : ¯ ¯ ¯ ¯ k−1 X k−1 X ¯ ¯ xi,j 2|i+j|k ¯¯ R = ¯¯ak + (2) ¯k ¯ i=0 j=0 2 +1 21 a1 a0 a0 a1 ak−1 a2 ··· a3 ak−2 a2 ak−1 Relation (2) indicates that the product bits ai aj , with i+j > k, can be inverted and placed at bit position |i+j|k if a correction factor equal to 2k 2|i+j|k is also taken into account. Therefore, we can rewrite relation (2) as : ¯ ¯ k−1 ¯ ¯ X ¯ ¯ , (3) R = ¯a k + (P Pi + CFi )¯ ¯ ¯k CFi = 2 +1 we derive that : ¯ ¯ ¯ ¯ k−1 X k−1 X ¯ ¯ s |i+j|k ¯ ¯ R = ¯a k + (−1) ai aj 2 ¯ ¯ ¯ i=0 j=0 ··· ··· ··· ··· ··· ··· ··· and CFi the corresponding correction introduced by the i complemented terms of P Pi . The partial product P P0 does not contain any complemented terms and therefore CF0 = 0. The required correction CFi , for the complemented terms of P Pi is : Since for 0 ≤ i ≤ k − 1, ak ai = 0, we get that : ¯ ¯ ¯ ¯ k−1 X k−1 X ¯ ¯ R = ¯¯ak 22k + ai aj 2i+j ¯¯ = ¯ ¯k i=0 j=0 ¯ ¯ ¯ ¯ k−1 X k−1 X ¯ ¯ 2k ¯ ¯ i+j ¯ ¯ = ¯¯ak ¯2 ¯2k +1 + ai aj ¯2 ¯2k +1 ¯¯ ¯ ¯ i=0 j=0 2k−3 ak−3 a0 ak−4 a1 ak−4 a2 ··· ak−1 ak−2 ak−2 ak−1 ( ai aj , ¯ = ¯¯ k 2 + ai aj ¯2k +1 , 2 +1 ¯ ¯ ¯ ¯ k−1 k−1 X X k−1 X ¯ ¯ 2k k+i i+j ¯ ¯ = ¯a k 2 + 2 ak ai 2 + ai aj 2 ¯ ¯ ¯ i=0 i=0 j=0 2k−2 ak−2 a0 ak−3 a1 ak−4 a2 ··· a0 ak−2 ak−1 k−1 X 2k (2i − 1) = 2k (2k − 1 − k). i=1 Therefore, R, is given by : ¯ ¯ k−1 ¯ ¯ X ¯ ¯ R = ¯CFCM P L + P Pi ¯ ¯ ¯ i=0 2k +1 where the general form of the k partial products is presented in Table I. Since ak a0 = 0 the addition of ak indicated in (3) has been performed by ORing ak with a0 in P P0 in Table I. In every column of the partial product matrix, most terms appear twice, once as ai aj (ai aj ) and once as aj ai (aj ai ). This enables us to reduce the length of the array, to approximately the half. Adding such a pair of terms of column 2i+j would yield a result of 2 × ai aj (2 × ai aj ). Instead of adding them, we can therefore equivalently move one of these terms to the column with weight 2i+j+1 . The pairs of equal terms in the leftmost column can be replaced by the complement of one of the terms in the rightmost column, introducing a correction DCIS 2006 factor 2k . Since there are b k2 c pairs of equal terms in the leftmost column, the total correction factor required, CFM OV , is given by : ¹ º k . CFM OV = 2k 2 The resulting partial products and the total correction factor must be added modulo 2k + 1, until two final summands are produced. This can be performed by using either a carry save adder (CSA) array or even faster by a CSA tree (Dadda tree [13]). Taking into account that the carry outputs of the k-th column have a weight of 2k and since in modulo 2k + 1 arithmetic : |ci 2k |2k +1 = | − ci |2k +1 = |2k + ci |2k +1 , the carries out of the most significant bit position can be complemented and added to the least significant bit position, thereby forming an inverted EAC CSA tree. A correction factor of 2k must be taken into account for each such carry recirculation. We can now combine all above mentioned correction factors into a single one and treat this as an extra parallel product term. Since however the partial products matrix form depends on k, we distinguish the following two cases : • k is odd. After pair identification and repositioning, the partial product matrix has k+1 2 rows. The reduction of the k+1 + 1 products in two summands will yield k+1 2 −1 2 k carries of weight 2 , which should complemented be added to the least significant position. This imposes a correction factor of CFRECIRC,odd = 2k ( k+1 2 − 1). Therefore, the total correction required in this case is : 2 +1 where the P Pi∗ s are derived from the P Pi s of Table I, using the above described operations. In the following we examine the addition of the constant total correction factor. A straightforward implementation would be to use CF as an extra partial product, along with a fast modulo 2k +1 adder (for example [7]) which accepts the two summands produced by the reduction scheme (array / tree) and produces the result. An alternative, more efficient solution is however proposed in the following, based on the use of an inverted EAC parallel adder (equivalently, a diminished-1 modulo 2k + 1 adder). If the architecture proposed in [14] is followed, this adder can provide its result faster than the fastest modulo 2k + 1 adder available [7], with smaller area requirements. Since however such an adder can provide only k bits of the result, the last bit must be derived in a separate manner. Relation (4) can equivalently be written as : ¯ ¯ ¯ ¯¯ dk 2e ¯ X ¯ ¯¯ ∗¯ ¯ P Pi ¯ k + 1¯¯ = |C + S + 1|2k +1 R = ¯¯2 + 2 +1 ¯k ¯ i=0 2 +1 where C = ck−2 ck−1 . . . c0 ck−1 and S = sk−1 sk−2 . . . s1 s0 are respectively the k-bit carry sum vectors ¯ produced by ¯ and Pd k2 e ¯ ¯ P Pi∗ ¯ k . The most the multi-operand addition ¯2 + i=0 2 +1 k is even. After pair identification and repositioning, the resulting partial product matrix is not completely regular. The columns with weight 2i+j , with i+j even, have k2 +2 terms, while the rest columns only have k2 − 1 terms. A full adder is used in each column with k2 + 2 terms. Since this combines three terms from this column and provides its sum term to the same column and its carry term to the column with double weight the matrix is transformed into a completely regular one with k2 rows. During the addition of the k2 + 1 products (+1 stands for the total correction factor), k2 − 1 carries of weight 2k , will be generated, which should be complemented and added to the least significant position. This a correction ¢ ¡ imposes factor of CFRECIRC,even = 2k k2 − 1 . Therefore, the total correction required in this case is also : significant bit of R, is 1 only when R = 2k or equivalently when |S +C +1|2k +1 = 2k . Taking into account that S and C are k-bit vectors we then get that R = 2k <=> S + C + 1 = 2k or equivalently that S + C = 2k − 1. That is, the most significant bit of the multiplication is 1 only when S and C are complementary vectors. As explained at the end of this section, this observation enables us to compute the most significant bit distinctly from the rest. In the following we focus on the k least significant bits of R. Let Rk denote the k-bit vector of the least significant bits of R. We then have that : ¯ ¯ ¯ ¯ Rk = ¯|A2 |2k +1 ¯2k = ¯|S + C + 1|2k +1 ¯2k ( |S + C + 1 − (2k + 1)|2k , if S + C + 1 ≥ 2k + 1 = |S + C + 1|2k , otherwise. ( |S + C − 2k |2k , if S + C ≥ 2k = |S + C + 1|2k , otherwise. ( if S + C ≥ 2k |S + C|2k , = |S + C + 1|2k , otherwise. CF = CFCM P L + CFM OV + CFRECIRC,even = ¯ h i¯¯ ¯ k k k k ¯ =3 = ¯2 (2 − 1 − k) + + − 1 ¯¯ 2 2 2k +1 The latter relation reveals that the k least significant bits of the product can be handled by an k-bit adder that increases the binary sum of its inputs by one when the carry output is 0 and leaves it unchanged in the case of a carry output. This CF = CFCM P L + CFM OV + CFRECIRC,odd = ¯ h ¹ º i¯¯ ¯ k k k+1 k ¯ =3 − 1 ¯¯ + = ¯2 (2 − 1 − k) + 2 2 2k +1 • According to the above, we conclude that the total correction factor required is constant and independent of the squarer size. The square is given by : ¯ ¯ ¯ ¯ dk 2e X ¯ ¯ ∗¯ ¯ P Pi ¯ (4) R = ¯3 + ¯k ¯ i=0 DCIS 2006 TABLE II I NITIAL PARTIAL P RODUCT M ATRIX is exactly the function performed by an inverted EAC parallel adder. We therefore conclude that, if a total correction factor of 2 is used as an extra partial product, an inverted EAC parallel adder used as the final adder will accept S and C at its inputs and will provide Rk . Very fast inverted EAC adders based on parallel prefix carry computation units have appeared in [14], [15]. For integer adders, a parallel prefix carry computation unit is derived by the following. Let A = ak−1 ak−2 . . . a1 a0 and B = bk−1 bk−2 . . . b1 b0 denote the two k-bit addition operands and the let the terms gi = ai bi and pi = ai ⊕ bi denote the carry generate and the carry propagate terms at bit position i respectively. By defining the ◦ operator as an operator that associates generate and propagate pairs and produces a new group pair according to the equation : P P0 P P1 P P2 P P3 P P4 P P0∗ P P1∗ P P2∗ P P3∗ a2 24 a3 a0 a2 a1 a2 0 = = = = a4 a2a1 HA 22 a4 a2 a1 a0 a1 0 a4a3 21 a1 a0 a0 a1 a4 a2 a3 a2 a4 20 a0 + a5 a4 a1 a3 a2 a2 a3 a1 a4 a1 a1a0 a0+a5 FA s1 c1 a2a1 a3a0 HA a4a1 s2 c2 a3a2 FA+ FA s3 20 a3 a0 a2 a1 a0 + a6 0 a3 a4a2 FA c3 21 a4 a1 a3 a2 a3 1 HA a2a0 FA s4 23 a4 a3 a4 a2 a0 0 HA a3a0 (Gi , Pi ) = (gi , pi ) ◦ (gi−1 , pi−1 ) ◦ · · · ◦ (g1 , p1 ) ◦ (g0 , p0 ). We apply the architecture derived in the previous section to the design of a modulo 25 + 1 squarer. According to Table I, the initial partial products indicated in Table II are derived. We then identify pairs of equal terms in each column and move one of the terms to its left column. Terms from the leftmost column are complemented and driven to the rightmost one. We also add 00010(210 ), for our total correction factor. 22 a2 a0 a1 a0 a2 a4 a3 a3 a4 TABLE III the computation of a carry ci of the integer addition of A and B is equivalent to the problem of computing the group generate Gi given by the prefix equation : A. An example of the proposed architecture 23 a3 a0 a2 a1 a1 a2 a0 a3 a4 F INAL PARTIAL P RODUCT M ATRIX (gx , px ) ◦ (gy , py ) = (gx + px gx , px py ) Once the carries have been computed the sum bits, si , are computed by si = pi ⊕ ci−1 . For attaining an inverted EAC adder, simple solutions, such as the connection of the carry output of an integer adder back to the carry input via an inverter, are not well-suited, since they suffer from oscillations. In [15], it was proposed that the output carry is driven back via an inverter to a late carry increment stage composed of nodes implementing a prefix operator. Therefore, no oscillations occur and if the carry computation unit is designed according to the fast algorithms presented in [16], [17], the derived inverted EAC adders feature an operating speed close to the corresponding integer adders. The need for an extra prefix stage that handles the reentering carry has been canceled in the parallel-prefix inverted EAC adders proposed in [14]. This was achieved by performing carry re-circulation at each existing prefix level. As a result parallel-prefix adder architectures with log2 k have been derived, that is, inverted EAC adders that can achieve the same operating speed as the corresponding integer adders. Since S and C are the inputs of the final adder and the group propagate over all k bits Pk−1 = pk−1 pk−2 · · · p0 , we conclude that the most significant bit of the result which should be 1 only in the case that S and C are complementary vectors, is equal to the group propagate signal out of the k bits of the final inverted EAC adder. 24 a4 a0 a3 a1 a2 a1 a3 a0 a4 = = = = = FA c0 s0 c4 Modulo 25+1 Diminished-1 Adder with parallel prefix carry computation unit Group propagate signal r5 r4 Fig. 1. r3 r2 r1 r0 Proposed modulo 25 + 1 squarer. After these operations, our partial products matrix, takes the form of Table III. The resulting implementation is presented in Figure 1. In figure 1, the AND / NAND / inverter gates required for forming the partial product bits are not shown. The blocks used are half-adders (HA), full-adders (FA), simplified FA blocks (FA+ ), that is, FAs with one of their inputs set at 1 and the final adder. The output carries at the most significant bit of each stage are complemented and driven to the least significant bit of the subsequent stage. The two final derived summands (S and C vectors) are added in the final inverted EAC adder. Note that further simplifications are possible in several blocks, since their operands are functions of the same input bits. For example the top leftmost HA that accepts a2 and a2 a1 as input operands can be further simplified. We must note here, that the partial product bits of equal weight should not be driven randomly in the FAs of the corresponding column. For achieving the least delay, the partial product bits derived earlier should be driven to the FAs at the top of the CSA tree, whereas late arriving signals to DCIS 2006 TABLE IV FA STAGES IN A f OPERAND DADDA T REE f 4 5≤f 7≤f 10 ≤ f 14 ≤ f 20 ≤ f 29 ≤ f 43 ≤ f 63 ≤ f ≤6 ≤9 ≤ 13 ≤ 19 ≤ 28 ≤ 42 ≤ 63 ≤ 94 D(f ) in FA stages 2 3 4 5 6 7 8 9 10 FAs of subsequent tree levels. Moreover, the addition of 0 can not be avoided since doing so alters the number of inverted EACs in the CSA tree and invalidates all the previous analysis. However, the FAs accepting the bits from the constant partial product can be simplified to HAs or FA+ . III. A REA AND D ELAY E STIMATIONS In this section we examine the area and delay complexities of the proposed squarers. For our calculations, we adopt the approximations of the unit-gate model [18], that is, we consider that all 2-input monotonic gates count as one gate equivalent for both area and delay, while a 2-input XOR or XNOR gate counts as 2 equivalents for both area and delay. Inverters are not taken into account. According to these approximations the area and delay of a FA are modelled as 7 equivalent gates and 4 time units respectively, whereas a HA or a FA+ requires an area of 3 equivalent gates and offers a delay of 2 time units. In the proposed squarers, the required partial product bits AND / NAND can be derived in parallel by the use of k(k−1) 2 gates and 1 OR gate, in 1 time unit. We consider that these partial products are then reduced to two summands by the use of a Dadda tree. The depth in FA stages of a Dadda tree is a function, suppose D(f ), of its number of operands and is listed in Table IV for all practical values. When k is odd, each column of the Dadda tree has (k+1) − 2 FAs and one + 1 operand bits that require (k+1) 2 2 HA or FA+ for their reduction. When k is even, half of the columns require k2 − 2 FAs and one HA or FA+ while the rest require one ¢ The maximum number of FA stages ¡ FA less. required is D k2 + 2 in this case. Finally, the area and delay of a k-bit parallel diminished-1 modulo 2k + 1 adder that follows the architecture proposed in [14] is 92 k log2 k + 12 k + 6 equivalent gates and 2 log2 k + 3 time units. Summing all the above, we conclude that the area and delay requirements of the proposed modulo 2k + 1 squarers are : º ¹ 1 9 k+1 k(k − 1) + k log2 k + k +7 equivalent gates +7k 2 2 2 2 and µ¹ º¶ k + 2 log2 k + 3 time units 1 + 4D 2 respectively. Comparing the estimations derived above against those of the squarers presented in [11], we conclude that both architectures lead to equally fast squarers with approximately the same area requirements. We remind that the squarers of [11] were shown to outperform those designed according to the general procedure of [10] and to provide significantly shorter squaring times when compared against modulo 2k + 1 multipliers. Therefore, the same merits are also valid for the proposed squarers. The proposed squarers accept their operand in weighted form in contrast to those of [11], that utilize a diminished-1 representation. This means that the proposed squarers do not require any translator from/to the weighted to/from the diminished-1 representation nor any special extra circuit for handling the zero operands. Therefore they can be used more efficiently than the squarers proposed in [11]. IV. C ONCLUSIONS Efficient modulo 2k + 1 squarers are useful design components in RNS and cryptography applications. A new architecture for modulo 2n + 1 squarers, has been proposed in this manuscript. The new architecture was derived by merging all required correction factors into a single one, which was shown to be a constant. Therefore, no circuit is required for its formation. Further performance enhancement is achieved by treating part of this constant as a partial product, whereas the rest is handled by the final parallel adder. This enables to use an inverted EAC adder as the final adder leading to shorter execution times and smaller area requirements. Our estimations indicate that the proposed squarers attain the same execution frequency with those of [11] with approximately the same area requirements. Considering however that the proposed squarers accept their operand in normal form while the solution presented in [11] operands in diminished1 representation, it is clear that the proposed squarers can be used more efficiently since they do not require time- and hardware-consuming input / output translators and handling of zero operands. R EFERENCES [1] T. Kwan and T. Martin, “Addaptive Detection and Enhancement of Multiple Sinusoids using a Cascade IIR Filter,” IEEE Trans. Circuits Syst., vol. 36, pp. 937–945, 1989. [2] J. Ramirez et al., “High Performance, Reduced Complexity Programmable RNS–FPL Merged FIR Filters,” Electronics Letters, vol. 38, no. 4, pp. 199–200, 2002. [3] J. Ramirez et al., “RNS-enabled Digital Signal Processor Design,” Electronics Letters, vol. 38, no. 6, pp. 266–268, 2002. [4] T. Keller et al., “Adaptive Redundant Residue Number System Coded Multicarrier Modulation,” IEEE Journal on Selected Areas in Communication, vol. 18, no. 11, pp. 2292–2301, 2000. [5] J. Ramirez et al., “Fast RNS FPL-based Communications Receiver Design and Implementation,” in Proc. of the 12th International Conference on Field Programmable Logic, Lecture Notes in Computer Science Vol. 2438, Springer-Verlag, 2002, pp. 472–481. [6] L. Kalampoukas et al., “High-Speed Parallel-Prefix Modulo 2n − 1 Adders,” IEEE Trans. Comput., vol. 49, no. 7, pp. 673–680, 2000. [7] C. Efstathiou, H. T. Vergos, and D. Nikolos, “Fast Parallel-Prefix Modulo 2n + 1 Adders,” IEEE Trans. Comput., vol. 53, pp. 1211– 1216, 2004. [8] C. Efstathiou, H. T. Vergos, and D. Nikolos, “Modified Booth Modulo 2n − 1 Multipliers,” IEEE Trans. Comput., vol. 53, pp. 370–374, 2004. DCIS 2006 [9] A. Wrzyszcz and D. Milford, “A New Modulo 2a + 1 Multiplier,” in Proc. of the International Conference on Computer Design (ICCD’93), 1995, pp. 614–617. [10] S. J. Piestrak, “Design of Squarers Modulo A with Low-Level Pipelining,” IEEE Trans. Circuits Syst. II, vol. 49, no. 1, pp. 31–41, 2002. [11] H. T. Vergos and C. Efstathiou, “Diminished-1 Modulo 2n + 1 Squarer Design,” IEE Proceedings - Computers and Digital Techniques, vol. 152, pp. 561–566, 2005. [12] L. M. Leibowitz, “A Simplified Binary Arithmetic for the Fermat Number Transform,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 24, pp. 356–359, 1976. [13] L. Dadda, “On Parallel Digital Multipliers,” Alta Frequenza, vol. 45, pp. 574–580, 1976. [14] H. T. Vergos, C. Efstathiou, and D. Nikolos, “Diminished-One Modulo [15] [16] [17] [18] 2n + 1 Adder Design,” IEEE Trans. Comput., vol. 51, pp. 1389–1399, 2002. R. Zimmerman, “Efficient VLSI Implementation of Modulo (2n ± 1) Addition and Multiplication,” in Proc. of the 14th IEEE Symposium on Computer Arithmetic, April 1999, pp. 158–167. P. M. Kogge and H. S. Stone, “A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations,” IEEE Trans. Comput., vol. C-22, pp. 786–792, 1973. R. E. Ladner and M. J. Fisher, “Parallel Prefix Computation,” Journal of The ACM, vol. 27, no. 4, pp. 831–838, 1980. A. Tyagi, “A Reduced-Area Scheme for Carry-Select Adders,” IEEE Trans. Comput., vol. 42, no. 10, pp. 1163–1170, 1993.
© Copyright 2025 Paperzz