logic synthesis of binary, carry-save and mixed

LOGIC SYNTHESIS OF BINARY, CARRY-SAVE
AND MIXED-RADIX ARITHMETIC FOR
DIGITAL SIGNAL PROCESSING
Stefan J. Bitterlich, H. Meyr
RWTH Aachen, Integrated Systems for Signal Processing
ISS { 611810, Templergraben 55, D{52056 Aachen, Germany
Tel: +49-241-807885, email: [email protected]
Abstract - All the commercially available logic-synthesis tools currently use only (non-redundant) binary and two's complement number
representations for representing the results of arithmetic operators. In
this paper we analyze and compare silicon real-estate and throughput of
wordparallel arithmetic circuits (add and shift type arithmetics) based on
various redundant number representations and compare these results with
the automatically optimized two's complement implementations. The
literature on redundant number representations typically recommends
radix-4 arithmetics for full-custom or a traditional semi-custom design
style. We show that the radix-4 implementation is often not optimal
for a logic-synthesis based semi-custom design style. Instead, a high-radix or
a mixed-radix implementation (which we derive in this paper) should be
considered.
INTRODUCTION
Demanded by the importance of a short time-to-market a logic synthesis based
design methodology for digital signal processing chips is now well established
and in common use. But all the commerciallyavailable logic synthesis systems
currently use only non-redundant number representations (unsigned binary or
two's complement) for the I/O signals of the automatically synthesized arithmetic operators. Due to the well known advantages of redundant number
representations (short critical path, lower complexity than look-ahead circuit
structures)1,2,3,4,5,6,7,8,9] the instantiated components for more complex
operators like multipiers can use a redundant number representation internally. But because all output signals are represented in unsigned binary or
two's complement form the internally used redundant number representation
is always converted into the externally used non-redundant form. This leads
to a signicant conversion overhead. This happens even if the following operator could be easily modied to handle the initially generated redundant
number representation. Direct processing of the redundant representation,
which could save the instantiation of the corresponding convertion units (e.g.
vector merging adders), is currently not supported.
In this paper we quantitativly analyze the savings that can be gained by
giving up the constraint of using non-redundant number representations only.
We restrict ourselves to word-parallel architectures because they t smoothly
into the word-parallel synthesis environment of the Synopsys Design Compiler. Digit-serial or digit pipelined architectures are certainly an interesting
alternative4,3,10]. We don't include them here since the direct processing
of skewed data is not possible with DesignWare components. Therefore, the
overhead for conversion at the interface would be too high. Alternatively,
completely giving up the use of these library elements would lead to an unacceptable increase in design eort.
BINARY AND TWO'S COMPLEMENT ADDERS
Binary adders are part of the synthesis tool's DesignWare library11]. The automatic constraint-based selection of ripple-carry (lower-speed region), carry
look-ahead (medium-speed region) and fast carry look-ahead adders 12] (highspeed region) is supported. This allows to synthesize a semi-custom circuit
that almost exactly meets the designers speed requirements for critical paths
in the 10 to 50 ns region (cf. binary case in Fig. 1). We reuse this expert
knowledge built into the DesignWare by instantiating these adders for the
higher radix and mixed radix archiectures to be discussed later on.
CARRY-SAVE ADDERS / SUBTRACTORS
A (bin + CS)!CS adder/subtractor is essentially a row of full adders (ge-
neralized full-adders for a subtractor 8]). For a fully pipelined circuit with
pipeline registers after each 2-operand adder it has a critical path of just one
full-adder delay which is unneccessarily fast for most semi-custom designs.
For accumulators or accumulator-like circuits as numerically controlled oscillators (NCOs) CS+CS!CS-adders are required if the input is already a
CS number. A CS+CS!CS-operator can be implemented as two rows of
full-adders which leads to a critial path of two full-adder delays.
A carry-save number requires 2 ; 1 (1-bit) registers for storing a signal
value of range 0...2 ;1, i.e. a signal that can be represented with bits
in non-redundant form. Full-custom carry-save implemenations can be areaecient because the dynamic registers usually used in full-custom designs are
a cheap resource5]. But semi-custom based logic synthesis uses edge-triggered
registers that are much more expensive (2.5 times in our target technology
13]).
Typically the high speed of the carry-save design (critical path of one or
two full-adder delays) can not be matched by the other components of the
semi-custom system. Therefore the high speed of the carry-save adder is a
waste, here, and it makes sense to look for architectures that allow to trade
speed for area.
w
w
w
parallel binary adder
*
width w
Carry-Save adder
o
width w
parallel SD-adder
x
amount of digits Dt
bin
CS
15 Dt
100
90
80
11 Dt
Area/FA
70
w
= 31
CS
60
w = 31
bin
9 Dt
50
40
5Dt
wCS
= 11
20
3 Dt
w = 7
bin
w = 15
bin
wbin
= 11
w = 7
CS
10
bin
w = 19
CS
30
w = 23
7 Dt
wCS= 23
wCS= 19
w = 15
bin
0
0
5
10
15
Time/ns
20
25
30
Figure 1: area vs. length of critical path for carry-save, binary and radix-4
adders. Since some high-radix and mixed adders are built from lower wordlength binary adders this gure may be used in those cases, too.
RADIX-4 ADDERS AND SUBTRACTORS
The well-known arithmetic operators based on radix-4 redundant number
representations can be used to reduce the number of stored bits for a given
resolution compared to the carry-save representation. The radix-4 maximum
redundant number system uses a digit set ;3 ;2 + 3. Therefore a digit
can be represented by 3 bits. A digit data word can be used to represent a
range1 ;4 +1 +4 ; 1] (containing 2 4 ; 1 = 22 +1 ; 1 dierent values).
Thus the resolution is roughly equivalent to a log2 (22 +1 ; 1) (2 + 1)-bit
two's complement number (e.g. a 2 digit radix-4 implementation should be
compared to a 5 bit 2's complement implementation).
We synthesized radix-4 adder and subtractor architectures for various synthesis constraints and compared the results to non-redundant binary and
carry-save implementations. Fig. 1 shows that the standard literature recommendation for full-custom or standard semi-custom implementations which
indicates that often the radix-4 implementations should have an optimum
implementation eciency does not hold in the framework of logic synthesis
(for an equivalent wordlength w of 32 bit).2 Despite the conceptional
short critical path of 4 full-adder unit delays the radix-4 word-parallel architecture for addition and subtraction is inferior: For the low-speed region
the area of the binary implementations generated by logic synthesis using the
:::
n
n
:::
n
n
n
n
n
w
of the geometric series the maximum value for the n-digit radix 4 number
Pby evaluation
; 4 = 3 n; = 4 1
;
1
is 3
n
1
i=0
i
4
1
4 1
n
;
For cryptographical applications typically long wordlength implementations e.g. w =
256 bits are needed. In other areas typically much less than 32 bits are required (e.g. 3-10
bits in digital receiver design).
2
DesignWare-library is smaller. For the high-speed region the carry-save implementations are superior to the radix-4 implementations3 (for ES2's ECPD10
standard cell library) carry-save has a higher speed at about the same area
(Fig. 1).
Thus, at rst sight it would seem that the radix-4 adder/subtractor cells are
not needed. But since radix-4 digit-pipelined multipliers can have area/speed
advantages over their binary/carry-save cousins there is always a need for
processing these results directly. Converting the results to non-redundant
word-parallel form and then using binary or carry-save adders/subtractors
would typically be much more expensive.
digit (n-1)
in
t
x1 y1
x2 y2
x0 y0
0
digit (n-2)
"0"
x2 y2
in
t
1
"0"
t
t
z2
z1
z0
x1 y1
x0 y0
out
"0"
0
out
"0"
1
digit (n-1)
t
t
generalized
full-adder
(see Koren [8])
z2
z1
z0
out
0
...
out
1
digit (n-2)
Figure 2: radix 4 adder
area-optimized
time-optimized
area FA] time ns] area FA] time ns]
word-parallel (area per digit)
SD adder (combin. only)
8.0
8.4
9.6
7.0
SD subtr (combin. only)
8.4
7.9
10.1
6.9
architecture
Table 1: results of synthesis for radix-4 adders/subtractors.
HIGH-RADIX ARITHMETIC
Other variants of redundant operators with less than 2 bits of storage per
equivalent bit of resolution are known. Interesting canditates are the wellknown high-radix architectures2,14]. For a high-radix operator (of radix )
the digits may be represented by unsigned binary or two's complement numbers of 2 + 1 bits (in one possible implementation). Therefore compared
to the binary non-redundant implementation just one additional bit of storage per digit is needed. Thus the \degree of redundancy" may be scaled.
A higher radix means less redundant bits per bit of eective resolution and
thus less registers { but also a longer critical path, i.e. speed may be traded
for area (i.e. number of storage registers for a given resolution). Fig. 3(left)
R
log R
3
Note that this result may be technology-dependent.
shows the basic principle of construction (instead of a real high-radix, e.g.
r=256, a radix-8 adder with unsigned digits is shown for simplicity).
MIXED-RADIX ARITHMETIC
We found that a generalization of the high-radix scheme can be advantageous:
In these cases the overhead for re-converting the redundant representation to
binary form { which must be done in almost any application as a nal step {
can be signicantly reduced. To achive this the position of the \stored carry"
must be carefully selected. Since for the higher-radix scheme (for = 2 ) the
position of the \stored carries" is predeterminated and \tied to" the radix
it can no longer chosen freely. A free selection of the position of the stored
carry naturally leads to a mixed-radix number representation, i.e. the number
representation is specied by not a single but multiple radices (cf. Fig. 3, for
a more formal denition please refer to the appendix of this paper).
r
7
64
6
32
5
16
4 3
8 8
2
4
d1
1
2
d
radix 8
0 Bit #
1 weight
7
64
6
32
5
16
d
0
radix 8
4
8
1
2
d
1
radix 16
stored carry
3 2
4 4
n
0 Bit #
1 weight
0
radix 4
stored carry
Figure 3: output stages of high-radix adder (radix 8, left) and mixed-radix
adder (radices 4 and 16, right)
For example, consider the numerically controlled oscillator (NCO) circuit
of g 4. An NCO is basically a accumulator which uses a high precision
internally but outputs only the higher-weighted bits. Its internal operation
is controlled by the increment value . But this high precision is needed only
internally, i.e. for the recursive update process of the oscillator's current
phase. The output value of the oscillator processed by other components of
the system is needed less accurately4 . Since in this case the output after the
addition of the phase oset is needed in binary form (it is used for a tableselect operation) a vector merging adder (VMA) is required. A standard (i.e.
non mixed-radix) implementation therefore leads to the circuit of Fig. 4. Note
that the VMA must be wide enough to include the right-most stored carry bit
because a carry propagation from this bit-level might inuence higher bits.
However bits lower than this need not be included because here just a single
bit for each weight exists and thus no carry propagation is needed.
By using a mixed number representation (Fig. 5, radices 29 and 211) the
additional carry-output of the lower weighted adder can be processed by the
i
4 an additional increase in accuracy would signicantly increase the cost but only lead
to a neglectable additional communication-technical \performance".
phase increment (20 bit)
....
....
....
high−radix adder (20 bit equiv. resolution)
....
....
register
VMA
9 bit
phase
offset
(binary)
rightmost
stored carry
...
half adder
(LSB−only)
phase
output
Figure 4: standard implementation of NCO
carry input of the following adder (if the next component is not an adder,
subtractor or multiplier an incrementer will have to be used). Compared
to a carry-save or radix-4 implementation the overhead for conversion can be
reduced signicantly. Instead of using a complete VMA just one half-adder of
the following adder stage has to be replaced by a full-adder. Since a full-adder
is aquivalent to two half adders this means that a complete vector merging
adder can be replaced by just a single half adder in this case. Even if the
i19
9 bit
adder
i18
...
...
i11
i10
...
i0
...
11 bit
adder
register
phase
offset
(binary)
...
half adder has been
replaced by full−adder
phase
output
Figure 5: mixed-radix implementation of NCO
result should contain a few stored carries (using lower radices) instead of just
one as in Fig. 4 the re-conversion to binary form may be signicantly simpler
than for a carry-save number because a modied incrementer instead of the
full vector merging adder is instantiated by the logic synthesis tool.
Don't think that now the circuit is completely unbalanced having elements
of a short critical path in the higher weighted region and a longer critical path
in the lower weighted bits. This would in fact be true if the implementation
were just the chains of full-adders as indicated graphically in gure 4. Here
logic synthesis (together with the DesignWare library) makes the dierence to
a traditional full-custom or semi-custom implementation! Since the circuit is
modelled as an \abstract" 9 bit adder and a 11 bit adder working in parallel
logic synthesis selects possibly dierent architectures for both adders. The
selection process is controlled by the constraints specied for the critical path
for the system. This leads to using a faster carry-skip or carry-look ahead
based architecure for the wider adder and a carry-ripple adder for the smaller
one. Thus the synthesis tool plays the \speed vs. area tradeo game", here: It
selectes the right architectures to balance the critical papth of both adders and
thus re-balances the circuit automatically. Moreover, by this approach we are
re-using the know-how on binary non-redundant adders already built into the
tool. Fig 1 shows that area can be traded for speed quite smoothly for binary
adders which are our constituents here. Therefore the match of the critical
paths is good (mismatch typically 10% for the relevant combinations).
<
APPLICATION TO FRACTIONAL ARITHMETIC
The described optimization by using a mixed-radix number representation
can be applied if the following component doesn't require the full precision.
Multiplication of two fractional -bit number results in a 2 ; 1-bit number,
in principle. But this result is usally truncated to -bits again since then the
output has the same resolution as each input (an \unlimited" increase of the
wordlength over several processing stages is not desired, of course). Therefore for the important case of an addition following fractional multiplications
which is used very frequently in DSP (FIR, IIR, etc. lters, inner Products,
vector-matrix and matrix-matrix products, transformations like DCT and
FFT, etc.) the mixed-radix trick can be applied as well.
n
n
n
CASE STUDY: PHASE/FREQUENCY SYNC UNIT
We compared the use of the dierent number systems within the setting of
a real-world example, the phase- and frequency synchronization unit of a
digital receiver. The algorithm is not a toy example invented for showing
the applicability of redundant number systems: The non-redundant binary
implementation of the algorithm was recently fabricated (a 1 m-CMOS chip)
and is currently in commercial use in a modem for digital TV transmission15].
We used exactly the same wordlengthes (resolution) for all the operators in
the algorithm.
Since the mixed-radix or high-radix implementations makes use of DesignWare adders playing the area/speed tradeo game is possible by:
changing the time constraints in the synthesis process
by changing the number representation for each operator
and by changing the position of the pipelining cuts16].
We compared three implementations: a standard binary implementation, a
partially binary and carry-save implementation, and a partially binary and
mixed-radix/high-radix implementation. Synthesis was then performed for
dierent timing contraints. The results are shown in Fig. 6. For high operating frequencies ( 20 MHz) the mixed-radix/high-radix implementations
are always the smallest. For operating frequencies of less than 20 MHz the
binary implementation is the best.
Surprisingly the carry-save architecture appears to be slower than the
mixed-radix architecture. This happens because of two reasons: First, the
increased area of the carry-save architecture leads to the use of a slower
delay model for the longer interconnections. Second, re-conversion to nonredundant binary form forces complete vector merging adders (VMAs) for
carry-save. In the mixed-radix case, however, many inputs of the VMA are
constant zero (because the \degree of redundancy" is lower, for some weights
there is just a single bit). These constants terms are used by logic synthesis to optimize the converter's circuit which results in a faster and smaller
converter. Because of the given latency budget (for reasons of numerical stability and time-of-convergence of the synchronization algorithm) additional
pipelining of the VMA has not been performed. If the application circuit is
not latency-bound carry-save should always result in faster but also larger
implementations.
>
CONCLUSIONS
In this paper we analyze dierent word-parallel redundant arithmetic schemes
for add- and shift type arithmetic operators. The implementationarea and the
critical path are compared to binary adders generated automatically by the
synthesis tool. Quantitative data based on more than 100 synthesis runs for
dierent synthesis constraints using a CMOS target technology is presented.
A signicant problem for the use of redundant arithmetic is the high conversion overhead needed to convert the redundant number representation back
into non-redundant form. To solve this problem we derive a novel mixedradix number representation as a generalization of the well-known high-radix
scheme. The use of the mixed-radix scheme allows to scale the position of
the redundant bits (\stored carries"). We show that this additional freedom
may be exploited to save a signiciant amount of data conversion functionality (vector merging adders) that would be required otherwise. Since logic
synthesis is used, the critical paths of digits of dierent wordlength (inherent
to mixed-radix) are automaticaly balanced by logic synthesis.
For a real world example, the frequency- and phase synchronisation unit
of a digital TV receiver, the use of mixed-radix arithmetic led to 30% higher
speed and simultaneously to area savings of 20%. This means that the 1/AT
implementation eciency has been increased by over 50%.
ACKNOWLEDGEMENTS
We wish to thank our students J. Striegel and P. Micus for performing the
more than 300 logic synthesis runs required.
This research was supported by German Research Association (DFG) under grant no. Me-651/13.
460
440
Carry-Save
mixed/high-radix redundant
two‘s complement non-redundant
420
400
Area/FA
380
360
340
320
300
280
260
10
20
30
40
Time/ns
50
60
70
Figure 6: case study: results
APPENDIX: MATHEMATICAL DESCRIPTION OF
MIXED-RADIX NUMBER SYSTEMS
A number in a (generalized) radix-r positional number representation is
represented by the digit string ( ;1 ;2
0):17]
X
X
=
X;
dn
1
n
di r
i
di
i=0
dn
:::: d
2 f; ;
+1
::: i
;1 g
By a straightforward generalization the correspondence between a mixedradix number representation ( ;1 ;2
0) of the number and its numerical value may be established by the following equation:
dn
dn
:::: d
X
X
=
X;
n
1
i=0
di
(! ;=01 )
i
j
ri
2 f; ; + 1 ; 1 g
2 f0 1 ; 1gg. The representation is a
di
i
i
::: i
i
given a set of radices f
redundant number representation if + + 1
for at least one .
For an ecient implementation we usually restrict ourselves to the case
=2
=2
=2
2 f0 1 2 g. In this case the circuits may
be constructed using standard two's complement adders for the digit-level
computations.
n
ri i
::: n
i
ri
k
i
p
i
q
k p q
i
> ri
::
i
A tool supporting the dierent redundant number representations discussed in this paper needs to handle only the mixed-radix case. All other represenations including the non-redundant binary representation may be treated
as special cases in this mathematical framework, of course.
References
1] A. Avizienis, \Signed{digit number representations for fast parallel arithmetic," IRE Trans. EC, vol. EC{10, pp. 389{400, 1961.
2] D. E. Atkins, \Design of the arithmetic units of Illiac III: Use of redundancy
and higher radix methods," IEEE Transcations on Computers, vol. C-19, no. 8,
pp. 720{733, Aug.70.
3] T. M.D.Ercegovac, \On-line arithmetic: An overview," 1984.
4] M. J. Irwin and M. Owens, \Digit-pipelined arithmetic as illustrated by the
paste-up system: A tutorial," IEEE Computer, no. 4, pp. 61{73, 1987.
5] T. Noll, \Carry{Save Architectures for High{Speed Digital Signal Processing,"
Journal of VLSI Signal Processing, vol. 3, pp. 121{140, June 1991.
6] N. R. Scott, Computer number systems and arithmetic. Englewood Clis:
Prentice Hall, 1988.
7] K. Hwang, Computer Arithmetic Algorithms: Principles, Architecture and Design. John Wiley & Sons, 1979.
8] I. Koren, Computer Arithmetic Algorithms. Englewood Clis, New Jersey:
Prentice-Hall, 1993.
9] A. R. Omondi, Computer Arithmetic Systems. Campus 400, Maylands Avenue,
Hertfordshire: Prentice-Hall International (UK) Limited, 1994.
10] J. M. D.W. Trainor, R.F. Woods, \Architectural synthesis of an image processing algorithm using IRIS," in VLSI Signal Processing VIII, pp. 167{176,
1995.
11] Synopsys Inc., DesignWare Components Databook Version 3.1a, Document Order Number 1uS01-10061. March 1994.
12] R. Brent and H. Kung, \A regular layout for parallel adders," IEEE Trans.
Computers, vol. C-31, pp. 260{264, 1982.
13] European Silicon Structures, ECPD10 Library Databook. 1992.
14] E.M.Schwartz and M.J.Flynn, \Cost-ecient high-radix division," Journal of
VLSI Signal Processing, vol. 3, no. 4, Oct.91.
15] M. Vaupel and H. Meyr, \Applying a Seamless Design Flow to Fast Development of a Carrier Synchronizer for MPSK," in VLSI Signal Processing VIII
(T. Nishitani and K. Parhi, eds.), pp. 126{134, IEEE, 1995.
16] C. Leiserson and J. Saxe, \Optimizing synchronous systems," Journal of VLSI
and Computer Systems, vol. 1, pp. 41{67, 1983.
17] B. Parhami, \Generalized signed-digit number systems: A unifying framework
for redundant number representations," IEEE Transcations on Computers,
vol. 39, no. 1, pp. 89{98, 1990.
18] T. Noll, \Carry-save arithmetic for high-speed digital signal processing," in
IEEE ISCAS'90, vol. 2, pp. 982{86, 1990.