Document

Description and
Analysis of
Multipliers
using Lava
Masters thesis in computing science
By:
Emil Axelsson 780226 – 4874 CTH
Examiner:
Mary Sheeran
Department of Computing Science
Chalmers University of Technology
412 96 Göteborg
2003-04-24
2
Abstract
Lava is a hardware description language embedded in the powerful functional
programming language Haskell. This report deals with the question: “Is Lava a suitable
language for construction of binary multipliers?”
The most common multiplier methods are described systematically, and the Lava
descriptions are developed in parallel. All methods can be split into two basic steps, so we
make a general interface, where any method for the first step can be combined with any
method for the second step. This way, a variety of multiplier circuits, all with different
properties, can be expressed as higher-level combinations of the two basic steps.
The aim is both to explain the methods theoretically and to obtain workable Lava
descriptions. It is shown that Lava is a very powerful tool for both constructing and verifying
generic binary multipliers. The functional behaviour of Lava allows complex generic
networks to be described by small, abstract and easy to understand expressions. It is also
shown how the environment in Lava can be slightly modified, to obtain self-optimising
hardware and simple performance estimation methods. Finally, the different methods
described are compared for speed and size, using the developed estimation methods.
3
Contents
Abstract ...................................................................................................................................... 3
Contents ..................................................................................................................................... 4
Introduction ............................................................................................................................... 6
Background........................................................................................................................... 6
Circuit speed and size .......................................................................................................... 7
1
Binary multiplication .....................................................................................................1-8
1.1
The basic algorithm...............................................................................................1-8
1.2
Implementation alternatives...............................................................................1-12
1.2.1
Dividing the algorithm in two steps ..............................................................1-12
1.2.2
General representation of partial products ....................................................1-12
2
The multiplier circuit....................................................................................................2-17
2.1
Hiding functions ..................................................................................................2-17
2.2
Partial product generation (PPG)......................................................................2-17
2.2.1
Bit multiplier .................................................................................................2-17
2.2.2
Simple PPG (1-bit selection).........................................................................2-18
2.3
Summation networks ..........................................................................................2-19
2.3.1
Full and half adders; bit adders and bit counters...........................................2-19
2.3.2
Carry-propagate adder...................................................................................2-21
2.3.3
Linear array summation ................................................................................2-22
2.3.4
Adder tree summation ...................................................................................2-24
2.4
Combining PPG and summation network ........................................................2-25
2.4.1
Simulation .....................................................................................................2-25
2.4.2
Verification....................................................................................................2-25
3
Booth’s algorithm.........................................................................................................3-27
3.1
Partial product selection .....................................................................................3-27
3.1.1
Two selection methods..................................................................................3-27
3.1.2
The selection circuit ......................................................................................3-29
3.2
Booth encoding (s-bit selection) .........................................................................3-30
3.3
Improved Booth’s algorithm ..............................................................................3-32
3.3.1
A new selection method, 2-bit.......................................................................3-32
3.3.2
Negative partial products ..............................................................................3-33
3.3.3
Generalization to s-bit ...................................................................................3-36
4
Improved summation methods.....................................................................................4-41
4.1
Faster adders .......................................................................................................4-41
4.1.1
The carry-save adder .....................................................................................4-41
4.1.2
Logarithmic adder .........................................................................................4-42
4.2
Carry-save array .................................................................................................4-45
4.3
Wallace tree .........................................................................................................4-47
4
5
Non-standard interpretation ........................................................................................5-49
5.1
Non-standard gates .............................................................................................5-49
5.1.1
Self-reducing gates ........................................................................................5-49
5.1.2
Time estimation.............................................................................................5-51
5.1.3
Size estimation ..............................................................................................5-52
5.2
New environment for building circuits with non-standard gates ...................5-52
5.2.1
New types ......................................................................................................5-52
5.2.2
Constants .......................................................................................................5-53
5.2.3
Type conversions...........................................................................................5-53
5.2.4
Manipulating the information........................................................................5-55
5.2.5
New gates ......................................................................................................5-56
5.3
6
7
Non-standard circuits .........................................................................................5-57
Dadda summation.........................................................................................................6-59
6.1
Wallace disadvantages ........................................................................................6-59
6.2
Improved Wallace; the Dadda summation network ........................................6-60
Results...........................................................................................................................7-65
7.1
Verification ..........................................................................................................7-65
7.2
Performance comparison....................................................................................7-66
7.2.1
Partial product generators..............................................................................7-66
7.2.2
Summation networks.....................................................................................7-68
7.2.3
Regular multipliers ........................................................................................7-69
7.2.4
Summary .......................................................................................................7-70
8
Conclusions ..................................................................................................................8-71
Appendix A
– Lava code ...................................................................................................... 72
Appendix B
– Lava code NSI .............................................................................................. 77
Future work ............................................................................................................................. 86
Related work ............................................................................................................................ 87
Lava related: ..................................................................................................................... 87
Multipliers, verification:................................................................................................... 87
Reference list ........................................................................................................................... 88
5
Introduction
Background
The reason for doing this work is to explore the use of the hardware description language
Lava for the description and analysis of binary multiplier circuits. At the same time, the report
explains all the described circuits, so it can also be used as a source for studying multipliers. I
have chosen to examine the most common methods for binary multiplication, and restricted
the work to multiplication of positive numbers only.
The work is interesting, first because it demonstrates the abilities of Lava, and second
because circuits described in Lava can easily be verified for correctness. Multiplier
verification is known as a difficult task.
In the text, it is assumed that the reader has some knowledge about the basics of digital
circuits and digital construction. It is also assumed that the reader is familiar with Lava,
although other readers should be able to assimilate the text too. For more information, see the
Lava tutorial [1].
In the text, bits are written with small letters, and numbers with capital letters. Normally,
the number N has n bits and is written as:
N = n n−1 ...n1 n0
In Lava, numbers are normally represented as lists of bits (type [Signal Bool]), where
the least significant bit (LSB) is the first element. Single bits may have any name, and lists of
bits end with the letter s (denoting plural), for example:
as = [a0,a1,a2,a3]
Here is a simple Lava example, to help unfamiliar readers to follow the text. The full
adder circuit fullAdd takes three bits, a, b and ci, as inputs and returns two bits, s and co. It
can be defined like this:
fullAdd (ci,(a,b)) = (s,co)
where
g = and2 (a,b)
p = xor2 (a,b)
s = xor2 (p,ci)
x = and2 (p,ci)
co = or2 (g,x)
This creates a network of two AND gates, two XOR gates and one OR gate, connected by
wires with the names ci, a, b, s, co, g, p and x. This is shown in the following figure:
a
ci
FA
s
b
co
a
b
&
g
≥1
&
=1
co
x
p
=1
s
ci
6
Simulation of the circuit is done with the function simulate, by giving the inputs the
values low (zero) or high (one):
Main> simulate fullAdd (high,(low,high))
(low,high)
Main> simulate fullAdd (low,(low,high))
(high,low)
Main> simulate fullAdd (high,(high,high))
(high,high)
Most of the Lava code is described in the text, and all code is listed in the appendices. In
Appendix A is the code for the circuit descriptions in chapter 2 – 4. And in Appendix B, the
code for the non-standard interpretation, introduced in chapter 5, is given.
Circuit speed and size
In this text, when performance of circuits is compared, it is always done in terms of circuit
speed and size. A good estimation of the circuit’s size is to count the total number of gates
used. The actual chip size of a circuit also depends on how the gates are placed on the chip –
the circuit’s layout. Since this text does not deal with layout, the only thing we can say about
this is that regular circuits are usually smaller than non-regular ones (for the same number of
gates), because regularity allows more compact layout.
The physical delay of circuits originates from the small delays in single gates, and from
the wiring between them. The delay of a wire depends on how long it is. Therefore, it is
difficult to model the wiring delay; it requires knowledge about the circuit’s layout on the
chip. The gate delay, however, can easily be modelled by saying that the output is delayed a
constant amount of time from the latest input. What we can say about the wiring delay is that
larger circuits have longer wires, and hence more wiring delay. It follows that a circuit with a
regular layout usually has shorter wires and hence less wiring delay than a non-regular circuit.
Therefore, if circuit delay is estimated as the total gate delay, one should also have in mind
the circuit’s size and amount of regularity, when comparing it to other circuits.
“Delay” usually refers to the “worst-case delay”. That is, if the delay of the output is
dependant on the inputs given, it is always the largest possible output delay that sets the
speed. Furthermore, if different bits in the output have different worst-case delays, it is always
the slowest bit that sets the delay for the whole output. The slowest path between any input bit
and any output bit is called the “critical path”. If a circuit is to be sped up, it is always the
critical path that should be attacked in the first place.
7
1 Binary multiplication
This chapter describes the theory of binary
multiplication through illustrative examples.
The algorithm can be split into two steps, and
each step can be implemented in different ways.
A general interface between the steps is
presented, which allows multiplier circuits to be
defined as abstract combinations of different
implementations of the two steps.
1.1 The basic algorithm
The basic algorithm for multiplication of two binary numbers, M (multiplier) and N
(multiplicand), makes use of the distributivity property of multiplication. That is, if M can be
written as a sum of smaller numbers M = M 0 + M 1 + ... + M m −1 , then the multiplication
M ⋅ N can be written as
M ⋅ N = (M 0 + M 1 + ... + M m −1 ) ⋅ N
= M 0 ⋅ N + M 1 ⋅ N + ... + M m −1 ⋅ N
(1)
The terms on the right hand side of this equation are called partial products – smaller
products, each one representing only a part of the total product. A multiplication algorithm
finds a simple way to decompose M into the sum M 0 + M 1 + ... + M m −1 , where the terms are
sufficiently small to allow a simple calculation of the partial products. Then the total product
is computed by summation of all partial products.
The value of a binary number M = mm−1 ...m1m0 can be written as
m −1
M = ∑ mi ⋅ 2 i
(2)
i =0
where mi are the different bits of M, and m is the total number of bits. For example, the
binary number 001101102 used in this equation gives:
00110110 2 = 0 + 1 ⋅ 2 + 1 ⋅ 4 + 0 + 1 ⋅ 16 + 1 ⋅ 32 + 0 + 0 = 5410
If we write this sum with binary numbers instead (recalling that multiplying a binary
number with 2i is equivalent to left shifting i steps), we get the sum represented by Figure 1.1
(a). We see that each term consists of one significant bit only (printed in bold style), the rest
are just zeroes from the shifting. This way we have rewritten the binary number as a sum of
all its bits, shifted to the right positions. If we group the terms two by two and add them, we
get a shorter sum – with the same result – shown in Figure 1.1 (b). This is a sum of shifted 2bit groups, where the bit groups are obtained by grouping the bits of the number two by two.
1-8
00110110
Grouping:
0
10
100
0000
10000
100000
0000000
+ 00000000
00110110
1
00 11 01 0
10
0
0
10
0100
110000
+ 00000000
00110110
(a)
(b)
Figure 1.1 – Decomposing a binary number into a sum of
(a) Shifted bits. (b) Shifted bit groups.
This shows that any binary number can be rewritten as a sum of shifted bit groups,
consisting of one or more bits. Since the bit groups and the amount of shifting are easily
extracted from the binary number, this means that we have found a simple way of
decomposing the multiplier into a sum of shorter numbers. This can be used for computing
binary multiplication as a sum of partial products. If the ith bit group Si is shifted x steps, we
can write the ith term in the multiplier decomposition as
M i = Si ⋅ 2 x
Then the ith partial product Pi is given by
Pi = M i ⋅ N = (S i ⋅ 2 x )⋅ N
= (S i ⋅ N ) ⋅ 2 x
(3)
So, each partial product is obtained by a multiplication (S i ⋅ N ) that can be thought of as a
selection of the partial product value, and some shifting that is independent of the selection
value (multiplication with 2 x ). Therefore, the bit group Si is called a selection group, and this
method for computing the partial products is called s-bit selection, where s is the number of
bits in Si. The following statement defines the method:
Statement 1
In s-bit selection, the multiplier bits are grouped into groups of s bits,
called selection groups. The ith partial product Pi is obtained by
multiplying the ith selection group Si with the multiplicand N, and
shifting it to the same position x as the selection group.
By choosing the group length s in this method, we can make trade-offs between the
number of partial products, and the complexity of computing them. If we use 1-bit selection,
1-9
we get m partial products, where m is the number of bits in M, and a very simple way of
computing them (1-bit multiplication). If we use s-bit selection instead, we get m / s partial
products and a computation complexity increasing with s.
Example 1.1 shows a simple example of multiplication with 1-bit selection, and Example
1.2 shows the same example with 2-bit selection.
Example 1.1
Calculate the product of the two binary numbers M = 0110 (6)
and N = 1001 (9), using 1-bit selection.
Solution:
Multiplicand (N):
Multiplier (M):
Product:
1001 (9)
* 0110 (6)
0000
1001
Partial products
1001
+ 0000
p
00110110 (54)
To obtain the partial products, each bit of the multiplier,
starting from the least significant bit (LSB), is multiplied with
the multiplicand. This means that each partial product is
chosen to be either 0 – if the multiplier bit is zero, or N – if the
multiplier bit is one. The partial products are then shifted to
get the same magnitude as their corresponding multiplier bits,
and finally they are all summed up to generate the final
product.
Example 1.2
Calculate the product of the two binary numbers M = 0110 (6)
and N = 1001 (9), using 2-bit selection.
Solution:
Multiplicand (N):
Multiplier (M):
Product:
1001 (9)
* 0110 (6)
010010 (10⋅1001 = 2⋅9)
+ 001001 p(01⋅1001 = 1⋅9)
00110110 (54)
×4 (Shifting)
First, the bits of the multiplier are grouped two by two, to form
the selection groups. Then the partial products are computed
by multiplying each group with N and shifting it to the same
position as the selection group. This means that the value of the
partial product is selected from the set {0, N,2N,3N }. The
result is given by summing all partial products.
1-10
The method in Example 1.2 is known as Booth’s algorithm [3]. More generally, we say
that a binary multiplier with s-bit selection, where s > 1 , uses s-bit Booth encoding. An
improved Booth algorithm is described in section 3.3.
So, how many bits are needed to represent all possible results of multiplication?
The largest value of an m-bit binary number is M max = 1111...112 (all digits = 1). Using
equation (2), this gives us a geometrical sum, with the value
m −1
M max = ∑ 2 k = 2 m − 1
(4)
k =0
If we solve this equation for m, we get the number of bits needed to represent the number
Mmax:
m = log 2 (M max + 1)
(5)
Note that this equation is only valid for values that can be written as M max = 2 m − 1 , where
m is an integer. If we have a number X, such that 2 m −1 − 1 < X ≤ 2 m − 1 , and want to know
how many bits n are required for representing it, we can reason that we need more than m − 1
bits, but not more than m bits, and hence we need m bits. If we use X in equation (5), we get a
result in the interval (m − 1, m], which says the same thing. Therefore, we can easily modify
equation (5) so that it always gives the result m for the value X:
n = ceiling(log 2 (X + 1))
(6)
This equation is valid for all values of X. So if, for example, we want to know how many
bits the number 234 requires, equation (6) gives:
ceiling(log 2 (234 + 1)) = ceiling(7.87 ) = 8
Now we can write an expression for the number of bits p needed for multiplication of two
numbers of length m and n:
p = ceiling(log 2 ((2 m − 1)⋅ (2 n − 1) + 1))
= ceiling(log 2 (2 m + n − 2 m − 2 n + 2))
= ceiling(log 2 (2 m 2 n (1 − 2 − m − 2 − n + 21− m − n )))
= m + n + ceiling(log 2 (1 − 2 − m − 2 − n + 21− m − n ))
(7)
If we analyse the argument to the logarithm, A = (1 − 2 − m − 2 − n + 21− m − n ) , we see that it is
a strictly monotonic function with respect to both m and n. So, its maximum value is given
for:
1-11
m = 1 or n = 1
⇒
A = 1/ 2
And minimum for:
(m, n ) → ∞
⇒
A →1
This means that log 2 A is in the interval [− 1,0) , and if we use this in (7), we get:
p = m,
for
n=1
p =n,
for
m=1
p = m+n,
for
m > 1 and n > 1
From this we make the following statement:
Statement 2
Multiplication of two numbers of length m and n needs m + n bits for
the result, except for one-bit multiplication where the number of bits
needed is the same as the length of the multiplicand.
1.2 Implementation alternatives
1.2.1 Dividing the algorithm in two steps
As we have seen before, all multiplication methods can be split into two separate steps:
1) Generation of partial products
2) Summation of the partial products
In modern microprocessors, where multiplication uses a significant part of the
computation time, the aim is to make the multiplier as fast as possible. Referring to the two
steps of all multiplication methods, there are two ways to make multiplication faster:
a) Reduction of the number of partial products
b) Reduction of the time required for summing the partial products
The following chapters describe different methods for the two basic steps. Different
methods for the first step can be combined with different methods for the second step in order
to form the best-suited multiplier for any application.
1.2.2 General representation of partial products
Our goal is to be able to combine any partial products generator (PPG) with any
summation network. For this we need a general and abstract interface between these two
1-12
blocks, a representation of the partial products that lets us concentrate on behaviour instead of
bit-level details.
In Example 1.1, each partial product consisted of four bits only, and these were manually
placed at the right position in the summation table. Before and after each partial product were
empty gaps without any bits to be added. For such circuits it would be difficult to make a
general summation network that works with all possible partial product generators. This is
because different PPGs have different lengths of the partial products, and also different ways
to place them. The summation network has to know, for any PPG, where the empty gaps are,
in order to make an efficient summation.
It turns out that the most abstract way to make a general summation network is to make
PPGs that place all partial products at the same position. Instead, shifting is done by filling the
empty gaps with zeroes. Then there will be a lot of unnecessary zeroes added, which require
unnecessary hardware. But we would gain compatibility in the interface between the PPGs
and the summation networks. And if a more hardware efficient implementation is needed, we
can use summation networks that recognize constant zeroes as being empty gaps and not add
them (see chapter 5). This also allows empty gaps inside the partial products. The partial
products in Example 1.1 would then look like in Figure 1.2.
0000
10010
100100
+ 0000000
00110110
Figure 1.2 – Shifting partial products with zero padding.
Different summation networks may have different ways to add the partial products. For
example, four partial products, P0, P1, P2 and P3, can be summed as:
S = ((P0 + P1 ) + P2 ) + P3
(Linear array)
S = (P0 + P1 ) + (P2 + P3 )
(Adder tree)
or as
In Figure 1.3 we see how this is done at bit-level. The different “steps” of the summation
are separated by horizontal lines. When P0 and P1 are added together in the first step of the
linear array, their result is placed at the top of the next step, and so on. The adder tree makes
two additions in the first step, so it needs fewer steps to compute the sum.
To get the right number of bits in the partial sums and the final sum, we sometimes need
to include the carry-out bits after addition. In the linear array, this is needed for all additions,
but for the adder tree, it is only needed in the first step (see Figure 1.3). However, we cannot
concentrate on such details when making abstract summation networks; we need some way to
overcome this inconvenience.
1-13
P0
P1
P2
P3
P0
P1
P2
P3
+
+
+
+
+
S
+
= Bits
S
= Carry-out bits
= Padded zeroes
(a)
(b)
Figure 1.3 – (a) Linear array summation. (b) Adder tree summation.
The simplest way to overcome this would be to always use adders that don’t include the
carry-out bits in the result. Instead, their result should have the same number of bits as the
longest input number. But how do we know how many bits are needed for all partial sums?
We must be sure that we don’t miss any signifi cant bits.
In the summation networks, three types of additions can occur:
1) Two partial products are added together
2) One partial product is added to a partial sum of partial products
3) Two partial sums are added together
If the no-carry adders discussed above are used in the summation networks, it is always
the longest partial product (alone, or inside a partial sum) that sets the size of the result.
Consider any addition where the ith partial product Pi sets the size. The largest possible result
of this addition is given if Pi is added to the sum of all shorter partial products, which in fact
is the same as the result of a multiplication where Pi is the last partial product. So now, we
can state that:
Statement 3
In a list of partial products, the ith partial product Pi should have the
same number of bits as the result of a multiplication where Pi is the
last partial product.
If this rule is followed and the no-carry adders are used, we will automatically get the
right number of bits for all partial sums, and also for the final sum! And it is a general rule
that works for Booth encoding as well.
Let us now see how many bits this rule implies for Pi in the general case. Consider an
m × n -bit multiplication of two numbers M and N. If we use 1-bit selection, we get Pi as N
multiplied with mi (the ith bit of M) and shifted to the same position x as mi, according to
Statement 1. In Figure 1.4 (a), we see that if Pi was the last partial product, mi would be the
last bit, and we would have a multiplier with x + 1 bits. The multiplicand still has n bits so the
result would require n + x + 1 bits according to Statement 2. We also see that for Pi to have
n + x + 1 bits, we need to add a zero-bit (grey dot) after the most significant bit (MSB).
1-14
If s-bit selection is used instead, where s > 1 , we get Pi by multiplying N with Si (the ith
selection group in M) and shifting it to the same position x as Si. Figure 1.4 (b) shows that if
Pi was the last partial product, Si would be the last selection group, and we would have a
multiplier with x + s bits instead. Then the result requires n + x + s bits. But because s > 1 ,
the multiplication N ⋅ S i always gives us n + s bits (Statement 2), so here we don’t need to
add any extra bits after MSB, as in the one-bit case.
x
+1
M:
M:
i
x
0, LSB
mi
0, LSB
Si
n
+1
x
+s
x
n+s
Pi :
x
Pi :
0, LSB
mi · N
0, LSB
Si · N
(a)
(b)
Figure 1.4 – The length of the ith partial product for
(a) 1-bit selection. (b) s-bit selection.
Now we can also see that the one-bit expression is only a special case of the s-bit
expression. If we set the selection group length to s = 1, we get the same expression. This
gives us the following statement:
Statement 4
When using s-bit selection, the ith partial product Pi requires
n + x + s bits to hold all possible results after additions, where n is
the number of bits in the multiplicand, s is the number of bits in the
selection groups, and x is the position of the ith selection group Si.
If the multiplication S i ⋅ N is done so that the result has n + s bits, Pi will automatically
get the right number of bits. We know from Statement 2 that this is the case for all
multiplications except for when s = 1 . This is the reason why we must add an extra zero-bit
after MSB when doing 1-bit selection, and it is also the reason why in Figure 1.3 we needed
to include the carry-out bits after additions with single partial products.
If we go back to the partial products in Example 1.1, which were computed by one-bit
selection, we now have to add zeroes after MSB in each partial product to give them proper
lengths. This is seen in Figure 1.5.
1-15
00000
010010
0100100
+ 00000000
00110110
Figure 1.5 – Zero padding for controlling size of additions.
If the partial products have this shape, and we use no-carry adders for additions, the linear
array and the adder tree (and any other network of adders) can sum without worrying about
the carries, as shown in Figure 1.6. All partial sums will have the right number of bits.
P0
P1
P2
P3
P0
P1
P2
P3
+
+
+
+
+
S
+
= Bits
S
= Padded zeroes
(a)
(b)
Figure 1.6 – (a) Linear array summation. (b) Adder tree summation.
Here is a summary of what we expect from the PPG and the summation network in order
to get an abstract and general way of combining them:
PPG:
• Creates a list of partial products, all having the same magnitude (LSB in the same
position). Shifting is done by padding with constant zero bits.
• All partial products Pi have the same length as the result of a multiplication where
Pi was the last partial product.
Summation network:
• Sums so that all partial sums have the same length as the longest input term (carryout bits ignored).
• If a hardware efficient implementation is needed, bits with constant zero value
should be recognized as empty gaps so that no extra hardware is used to add them.
1-16
2 The multiplier circuit
This chapter shows how the multiplier circuit is
described in Lava, together with the helper
circuits it uses. Only the simplest methods are
dealt with. The constructed multipliers are
simulated and verified, and it is shown how the
interface between partial product generation
and summation network works in Lava.
2.1 Hiding functions
In this chapter, some of the built-in Lava functions will be redefined. Therefore, we must
hide the names of these functions. This is done when the packages are imported (se the full
code in Appendix A):
import Lava hiding (mux)
import Arithmetic hiding (halfAdd, fullAdd, bitMulti)
This will hide the names of the listed functions, allowing other functions to be defined
with their names. Should one of the hidden functions be needed, it is reached by writing the
package name and a dot before the function name, for example:
Main> simulate Lava.mux (low,([low],[low]))
[low]
2.2 Partial product generation (PPG)
2.2.1 Bit multiplier
A 1-bit selection PPG uses a bit multiplier to generate the values of the partial products. A
bit multiplier has a bit a, and a number N as inputs. Its output P is given by the table in Figure
2.1 (a), and it is realized by a simple row of AND gates as in Figure 2.1 (b).
The wiring network to the left of the AND gates in the figure can be generalized to the
connection pattern distributeLeft, which creates a list of pairs from a and the bits of N:
distributeLeft (a,[]) = []
distributeLeft (a, b:bs) = (a,b):(distributeLeft (a,bs))
Then the bit multiplier is given by mapping the list of pairs onto as many AND gates as
are needed. This is done with the Haskell function map:
bitMulti = distributeLeft ->- map and2
2-17
a
0
1
P=a⋅N
0
N
n0
&
p0
n1
&
p1
n2
&
p2
nn-1
&
pn-1
nn
&
pn
a
(a)
(b)
Figure 2.1 – (a) The bit multiplier’s function table. (b) A bit multiplier as a
column of AND gates.
2.2.2 Simple PPG (1-bit selection)
Simple PPG is the method used in Example 1.1. In Lava, simple PPG looks like this:
ppgSimple (as,bs) = ppgSimpleHelp 0 (as,bs)
where
ppgSimpleHelp i ([],bs) = []
ppgSimpleHelp i (a:as,bs) = p:ps
where
p = (zeroList i) ++ (bitMulti (a,bs)) ++ [low]
ps = ppgSimpleHelp (i+1) (as,bs)
This function (like all PPGs in the text) returns a list of numbers. Here, as is the multiplier
and bs is the multiplicand. The helper function has an argument i that tells which partial
product is currently computed. It is set to zero in the first call, and then increased by one each
recursion step. The value of p (the ith partial product) is selected by multiplying bs with a
(the ith bit of as), and shifting it i steps by padding with zeroes, and finally adding a zero-bit
after MSB (Statement 4).
If we test the function with the same inputs as in Example 1.1, we get (LSB is always the
first element in the lists):
Main> simulate ppgSimple ([low,high,high,low],[high,low,low,high])
[
[low,low,low,low,low],
[low,high,low,low,high,low],
[low,low,high,low,low,high,low],
[low,low,low,low,low,low,low,low]
]
2-18
2.3 Summation networks
2.3.1 Full and half adders; bit adders and bit counters
As we have seen, all multiplication algorithms compute their result as a sum of partial
products, and for this we need to construct binary adders. Example 2.1 is a repetition of how
binary addition works.
Example 2.1
Calculate the sum of the two binary numbers A = 1011 (11)
and B = 1001 (9).
Solution:
1 0 1 1
A:
B:
Sum:
1011 (11)
+ 1001 (9)
10100 (20)
The bits are added one by one, so that LSB from both numbers
are added first, and then the next bits and so on. Additions with
more than one bit results in a two bit binary number s1s0. The
bit s1 is passed as a carry bit (the bits at the top) to the next
higher bit addition, and s0 goes to the sum. If any addition
stage has a carry bit, this is added with the other bits.
The addition in the example has five addition stages, where each stage represents one bit
addition. One stage adds two number bits a and b and one carry-in bit ci (from the previous
stage), and results in one sum bit s and one carry-out bit co (to the next stage). The exception
is the first stage that has no carry-in, and the last stage that has no number bits. In general,
there are three input bits to a stage, all shifted the same amount. So, the result should be a
number between 0 and 3, also shifted the same amount.
The so-called stages can be realized by digital circuits called full adders (FA) and half
adders (HA). Their truth tables are shown in Figure 2.2. A full adder takes three bits as input
a, b and ci, and returns their sum as the number c0 s , where s is the least significant bit. A half
adder is simply a full adder without the carry-in. The reader can verify that these functions
give us proper bit adders. In some arrangements it is easier to think of them as being bit
counters rather than bit adders. In the FA truth table, we can see that the output number is
always equal to the number of ones present at the inputs, independent of the input ordering.
Therefore, we can also say that this function is able to count ones. This is why sometimes a
HA is called a 2-2 counter, and a FA is called a 3-2 counter.
2-19
ab
s co
00
01
10
11
00
10
10
01
ci a b
s co
000
001
010
011
100
101
110
111
00
10
10
01
10
01
01
11
(a)
(b)
Figure 2.2 – Truth table for
(a) The half adder (b) The full adder
The half adder is a simple circuit, with only two standard gates. Its Boolean functions are
s = a ⊕ b

co = a ⋅ b 
(8)
For defining the full adder, we will use two helper signals – g (generate), and p
(propagate) – that tells how the carry-out is behaving. If g = 1, a carry-out is generated in the
full adder, which means that co = 1. And if p = 1, the carry-in bit is propagated through the it,
which means that co = ci. The expressions for g and p are
g = a ⋅b 

p = a ⊕ b
(9)
And after this definition, s and co are computed as
s = p ⊕ ci


c o = g + p ⋅ ci 
(10)
Although the g and p signals are not necessary for making the circuit, it is meaningful to
define them here, because they play a central role when making the logarithmic adder in
section 4.1.2.
Here is the Lava code for the half adder and the full adder:
2-20
halfAdd (a,b) = (s,co)
where
s = xor2 (a,b)
co = and2 (a,b)
fullAdd (ci,(a,b)) = (s,co)
where
g = and2 (a,b)
p = xor2 (a,b)
s = xor2 (p,ci)
co = or2 (g, and2 (p,ci))
In all the multipliers of this text, summation of the partial products is done through bit
counters. They are arranged either as explicit adders (for example linear array summation,
section 2.3.3), or in some other, more sophisticated, way (for example Dadda summation,
section 5.3).
2.3.2 Carry-propagate adder
Now we can connect several full adders in a row, to get the bit addition stages discussed
in section 2.3.1. In Figure 2.3, we see a four-bit adder that realizes S = A + B , for the binary
numbers A = a3 a 2 a1 a 0 , B = b3 b2 b1b0 and S = (co ,3 )s 3 s 2 s1 s 0 . It can easily be generalized to an
arbitrary-length adder, using the same pattern. The last carry-out co,3 can be included in the
sum if needed (as MSB), however, usually we don’t have any carry -in to the first stage, so we
can remove the ci,0 signal in the figure. Then the leftmost full adder can be reduced to a half
adder instead. In real circuits, there is always a small delay between the inputs and the
outputs. Since no bit addition can be performed before the previous carry bit has been
computed, we will get a situation where one carry-out bit is waiting for the previous one. This
will result in a propagation of the carries from the left to the right in the circuit, and therefore,
this is called a carry-propagate adder (CPA). The carry chain is the critical path of the adder,
and the number of counters in the chain is equal to the number of bits n in the two numbers
being added. So the speed of the adder will be O(n), which is relatively slow (for adding large
numbers). But because of the simple structure, this is the smallest possible adder.
a0
CPA:
b0
a
A+B=S
ci,0
ci
FA
b
co
a1
b1
a
ci
FA
b
co
a2
b2
a
ci
FA
b
co
a3
b3
a
ci
FA
s
s
s
s
s0
s1
s2
s3
b
co
co,3
Figure 2.3 – The carry-propagate adder.
A CPA with carry-in/out is easily written in Lava using the connection pattern row (see
[1]):
cpaCarry = row fullAdd
The inputs and outputs have the following form:
2-21
cpaCarry (ci, abs) = (ss, co)
The input abs should be a list of bit-pairs from the two numbers that are to be added. So if
we want to add two number as and bs, we have to zip them first using:
abs = zipp (as,bs)
But the discussion in section 1.2.2 stated that we need adders that work for numbers of
different lengths, and also that the carry-out bit should be excluded. The problem with
different lengths of the inputs arises when we try to zip them, which results in an error. So let
us first make a new zip function called zipp2 that makes use of the following functions:
head2 [] = low
head2 as = head as
tail2 [] = []
tail2 as = tail as
They work like the usual head and tail functions, but when picking the head of an empty
list, we get a low in return. And picking the tail of an empty list gives back an empty list,
instead of an error message. Now we can write zipp2 as:
zipp2 ([],[]) = []
zipp2 (as,bs) = (head2 as, head2 bs):(zipp2 (tail2 as, tail2 bs))
It picks the heads of as and bs in each step, and combines them as a pair. But if one of the
numbers finishes before the other, the head2 function pads this number with zeroes for the
remaining pairs. When both numbers are finished, zipp2 stops. Here is simple example of
how it works (var"a1" gives a symbolic value as input – a variable with the name a1):
Main> zipp2 ([var"a1", var"a2", var"a3", var"a4"],[var"b1", var"b2"])
[(a1,b1),(a2,b2),(a3,low),(a4,low)]
If we use zipp2 and exclude the carry-in/out in our cpaCarry, we can write a new one as:
cpa (as,bs) = ss
where (ss,_) = cpaCarry (low, zipp2 (as,bs))
This is the one that will be used in our summation networks. Note that the zeroes, padded
by zipp2, are constant, so this can be seen as adding empty gaps after MSB of a partial
product. If we have adders that deal with empty gaps, the hardware for adding two numbers of
different lengths can be reduced (see chapter 5).
2.3.3 Linear array summation
Linear array summation uses carry-propagate adders (CPAs) to add the partial products
one by one to a partial sum. Its mathematical function is
linArray(P0 , P1 , P2 ...Pm −1 ) = (((P0 + P1 ) + P2 ) + ...) + Pm −1 ,
(11)
2-22
where {P0 , P1 , P2 ...Pm −0 } are the partial products. This expression computes the sum in
m − 1 steps, and hence takes O(m) time. Figure 2.4 shows how the four partial products from
Example 1.1 are added together in a linear array.
00000
010010
00000
010010
0100100
+ 00000000
00110110
c
CPA
0100100
c
CPA
00000000
c
CPA
00110110
Figure 2.4 – Linear array summation for the partial products in Example 1.1.
The organisation of the CPAs in Figure 2.4 can be generalized to a connection pattern in
Lava. The pattern, called reduceLin looks like this:
reduceLin circ c []
= c
reduceLin circ c (l:ls) = reduceLin circ (circ (c,l)) ls
Here l is a list, which is reduced by one element each recursion step, and c is the partial
result (represents the data flowing vertically in Figure 2.4, the rightmost arrows). For each
step, circ (c,l) is given as c to the next step, so circ is the circuit used for the reduction
(the CPAs in Figure 2.4)
Linear array summation in Lava, using the just defined connection pattern, looks like this:
linArray (p:ps) = reduceLin cpa p ps
This creates a column of CPAs, and the first partial product is fed as the parameter c to
just like in Figure 2.4. Since a CPA is a row of full adders, the linear array
becomes a column of rows of full adders – a matrix of full adders. Figure 2.5 shows this
matrix for 4 × 4 -bit multiplication with simple PPG. The grey bits and arrows represent
constant zeroes. If we have gates that treat constant bits specially, the matrix can be reduced
(see chapter 5).
reduceLin,
2-23
P0,0 P1,0
P0,1 P1,1
P0,2 P1,2
P0,3 P1,3
P0,4 P1,4
FA
FA
FA
FA
FA
P2,0
P2,1
P2,2
P2,3
P2,4
P1,5
FA
P2,5
a
P2,6
ci
FA
b
co
s
FA
P3,0
FA
FA
P3,1
FA
P3,2
FA
P3,3
P3,4
FA
P3,5
FA
P3,6
P3,7
FA
FA
FA
FA
FA
FA
FA
FA
S0
S1
S2
S3
S4
S5
S6
S7
Figure 2.5 – Linear array; a matrix of full adders.
2.3.4 Adder tree summation
If the adders of the linear array are organized as a binary tree instead, the summation will
be much faster when the number of partial products is large. This is because an adder tree
does several additions in parallel. An adder tree for twelve partial products is seen in Figure
2.6. A full binary tree with m leaves has log 2 m levels, so this tree computes the sum in
O(log m ) time, where m is the number of partial products.
P0
P1
CPA
P2
P3
P4
CPA
P5
P6
CPA
P7
P8
CPA
P9
P10 P11
CPA
CPA
CPA
CPA
CPA
CPA
CPA
Figure 2.6 – Summation of twelve terms with an adder tree.
This organisation of the adders can also be generalized to a connection pattern in Lava:
binTree circ [p] = p
binTree circ ps =
(halveList ->- (binTree circ -|- binTree circ) ->- circ) ps
2-24
This definition uses the serial ->-, and parallel -|-, composition operators in Lava, see
[1]. Here ps is the list of partial products, and circ is the circuit used for reducing them. The
base case delivers the result for the whole tree, and for all partial trees inside. When a partial
tree has an odd number of terms at one level, the base case passes the odd term to the next
level (the dashed arrows in Figure 2.6).
Using this connection pattern, the adder tree summation is written as:
addTree = binTree cpa
2.4 Combining PPG and summation network
2.4.1 Simulation
Now we are ready to make our first complete multiplier. Since the simple PPG and the
summation networks have followed the PPG-sum interface described in section 1.2.2, we
should be able to combine them right away. This is done with the serial composition operator:
mult1 = ppgSimple ->- linArray
mult2 = ppgSimple ->- addTree
And we can try to simulate them:
Main> simulate mult1 ([low,high,high,low],[high,low,low,high])
[low,high,high,low,high,high,low,low]
Main> simulate mult2 ([low,high,high,low],[high,low,low,high])
[low,high,high,low,high,high,low,low]
These results are recognized from Example 1.1 (remember that LSB is always the first
element in the list). For simulating large numbers, we can define an integer multiplier that
converts the multiplier input/output to integer signals instead. This can be done by the
following function:
intMult n mult = (int2bin n -|- int2bin n) ->- mult ->- bin2int
The parameter n is the number of bits the input integers are converted to, and mult is the
multiplier circuit used. Simulation gives us, for example:
Main> simulate (intMult 16 mult1) (12,5)
60
Main> simulate (intMult 16 mult1) (365,11088)
4047120
Main> simulate (intMult 16 mult2) (365,11088)
4047120
2.4.2 Verification
For verifying the circuits, we define two helper functions that allow us to test equivalence
between two circuits (see [1] for more information on equivalence checking):
2-25
prop_Equivalent circ1 circ2 a = ok
where
out1 = circ1 a
out2 = circ2 a
ok
= out1 <==> out2
circCorrectForSizes m n circ1 circ2 =
forAll (list m) $ \a ->
forAll (list n) $ \b ->
prop_Equivalent circ1 circ2 (a,b)
And after that, we can use the formal verification tool VIS , and define a simple
verification function:
verif mult n = vis
(circCorrectForSizes n n multi mult)
This function verifies the correctness of the multiplier mult for all inputs of length n.
Verification is done as equivalence checking between the constructed multiplier and the builtin Lava multiplier multi. We can now try to verify one of our constructed circuits:
Main> verif mult1 4
Vis: ... (t=0.6) Valid.
Main> verif mult1 8
Vis: ... (t=2.0) Valid.
It follows that mult1 is correct (has the same functionality as multi) for all 4-bit and 8-bit
numbers. More circuits are verified in section 7.1.
2-26
3 Booth’s algorithm
This chapter describes how the previously
proposed s-bit selection method for generating
partial products can be constructed in Lava.
The first section shows how the values of the
partial products can be selected when having s
selection bits. Then comes the s-bit Booth PPG,
and finally, an improvement of the basic Booth
algorithm.
3.1 Partial product selection
3.1.1 Two selection methods
Selecting the values of the partial products with s-bit selection (Statement 1) can be done
in two equivalent ways:
1) An explicit s-bit multiplier circuit for each partial product.
2) A table with all multiples {0, N ,2 N ,...(2 s − 1)N }, where N is the multiplicand, and a
selection circuit for each partial product, that selects the proper multiple.
For the first method, we can use the circuit mult1 described in section 2.4.1. If we use
mult1.(ss,bs), so that ss is the s-bit selection group, and bs is the multiplicand N, the result
of mult1 is computed through an array of s − 1 CPAs. This means that this selection method
has both the delay and the size of s − 1 CPAs in series (if we neglect the bit multiplication
inside mult1, which is both small and fast). But we will need one separate multiplier circuit
for each partial product, which might seem a little excessive. If, for example, we have a
multiplier that uses 3-bit selection, the selection result can only be one of the multiples
{0, N ,...7 N }. Since the values of the partial products are selected from a relatively small set,
we might consider sharing multiples between different partial products. This is done in the
second method. There, we start by computing the table of all multiples of N, from 0 to
(2 s − 1)N (the highest multiple that s bits can select):
MultTab = {0, N ,2 N ,...(2 s − 1)N }
If MultTabi is the ith element in the list (indexing starts from zero), we have the following
relation:
MultTabi = i ⋅ N
(12)
So, the s-bit multiplication can be exchanged for a list selection instead, and the selection
table is shared between all partial products, which suggests that this method requires less
hardware.
When the multiples are computed one by one, it is easier to control how they are
generated. For example, the 2N multiple does not need s − 1 CPAs to be computed (as in the
first method), it is enough with a simple shifting of N.
3-27
We need a generic algorithm that computes the list MultTab in the most efficient way. If
the multiples are computed one by one, starting from the 0th, higher multiples can use the
already generated values of the lower multiples, in order to minimize the hardware used.
The ith multiple is computed by one of the following cases:
1) i = 0 ⇒
MultTab0 = 0 ⋅ N = 0
2) i = 1 ⇒
MultTab1 = 1 ⋅ N = N
3) i > 1 and even ⇒
i

MultTabi ,even = i ⋅ N = 2 ⋅  ⋅ N  = 2 ⋅ MultTabi / 2
2

4) i > 1 and odd ⇒
MultTabi ,odd = i ⋅ N = (2 p + (i − 2 p ))⋅ N
= 2 p ⋅ N + (i − 2 p )⋅ N = MultTab2 p + MultTabi − 2 p ,
where p is an integer and 2 p < i < 2 p +1 .
Case 3) and 4) use equation (12) for converting between multiplication and table index.
Case 3) expresses an even multiple as two times a lower multiple, so it is generated with
simple shifting of a previously computed value. Therefore, we can note that case 1) to 3),
which generate more than 50% of all multiples, don’t use any adders at all. Case 4) expresses
an odd multiple as one addition of two lower terms. In general, the whole selection table is
generated by 2 s / 2 − 1 = 2 s −1 − 1 adders. However, the delay of single multiples can be more
than one adder delay, since the multiples in the additions may be the results of other additions.
So, odd multiples are expressed as MultTab2 p + MultTabi − 2 p , where 2 p is the next lower
power of two from i. The reason for this is that MultTab2 p is a completely shifted multiple
(the multiplicand shifted p steps), and MultTabi − 2 p is an odd multiple between N and
(2
p
− 1)N . The highest multiple that can be expressed by their sum is
MultTabmax = 2 p ⋅ N + (2 p − 1)⋅ N = (2 p +1 − 1)⋅ N
(13)
This shows that we can express multiples up to (2 p +1 − 1)N with only one addition (or
less) of multiples between 0 and 2 p N . So if we look at the whole selection table, which
contains the multiples between 0 and (2 s − 1)N , equation (13) says that all multiples can be
obtained by one addition, or less, of multiples between 0 and 2 s −1 N . These lower multiples,
in turn, are composed by multiples between 0 and 2 s − 2 N , and so on. We can use this for
determining the delay of the selection table by noting that:
1) Multiples from 0 to 21 N are obtained without additions.
2) Multiples from 0 to 2 2 N are obtained through one addition, or less.
3) Multiples from 0 to 2 3 N are obtained one addition (or less) of multiples from step 2),
so the result has to go through 2 adders.
4) Multiples from 0 to 2 4 N go through 4 adders, or less.
5) And so on...
3-28
Using this pattern, we state that multiples from 0 to 2 s N go through s − 1 adders, or less,
and thus the selection table will have a worst-case delay of s − 1 adders, just like in the first
selection method. But most multiples will have a lower delay.
Here is the selection table in Lava:
sBitSelTab s bs = multiples 0 []
where
nb = length bs
-shift as = skipLast ([low] ++ as)
-multiples i ms | (i == 2^s) = ms
| (i == 0) = multiples (i+1) [(zeroList (s+nb))]
| (i == 1) =
multiples (i+1) (ms ++ [bs ++ zeroList s])
| (mod i 2 == 0) =
multiples (i+1) (ms ++ [shift (ms!!(div i 2))])
| otherwise = multiples (i+1) (ms ++ [m])
where
m = cpa (ms!!i1, ms!!i2)
i1 = (2^(floor ((log (fromInt i))/(log 2))))
i2 = i - i1
The function has the selection length s, and the multiplicand bs as inputs, and returns a
call for the helper function multiples. This is a recursive function that has two parameters –
the index of the current multiple i, and the list of all previous multiples ms. In the first call,
the current index is set to zero, and the list of multiples to an empty list. Then, in each step, i
is increased by one, and one new multiple is added at the end of ms. When the index reaches
2 s , ms is returned. The multiple for i = 0 is added immediately as a list of zero bits. For
i = 1 , the multiplicand bs is added. These multiples are adjusted to get n + s bits, where n is
the number of bits in the multiplicand, according to Statement 4. When i > 1 , we have two
other cases, one for even multiples (mod i 2 == 0), and one for odd multiples (otherwise).
These correspond to the cases discussed above. The index i1 = 2 p is computed by the line
i1 = (2^(floor ((log (fromInt i))/(log 2))))
3.1.2 The selection circuit
For the second method in section 3.1.1, we need a selection circuit that is supposed to
work as a look-up table. Its input should be a list of 2s numbers (of the same length), and an sbit binary number S, whose value selects the Sth number in the list. This number is returned.
First, we define a multiplexer circuit that has two numbers A and B, and a selection bit s as
inputs. Its function is seen in Figure 3.1 (a), and in Lava it looks like this:
mux (s,(as,bs)) = muxHelp s (inv s) (as,bs)
where
muxHelp s sInv ([],[]) = []
muxHelp s sInv (a:as, b:bs) = x:(muxHelp s sInv (as,bs))
where x = or2 (and2 (sInv, a), and2 (s,b))
It works only for numbers of the same length. Figure 3.1 (b) shows how a tree of
multiplexers can be used to realize the selection circuit for eight numbers. This tree selects
3-29
one element from the list {A0 , A1 ... A7 }. Which element is selected, is determined by the binary
number s2s1s0.
A
B
0
s
0
1
X
1
s
b0
&
b1
a0
&
bn
a1
&
an
&
&
&
X
A
B
A0
0
A1
1
0
A2
0
A3
1
1
xn
X
1
A4
0
A5
1
A6
0
A7
1
0
x0
x1
0
1
s0
s1
s2
=1
s
(a)
(b)
Figure 3.1 – (a) The multiplexer circuit. (b) Selection circuit as a tree of
multiplexers.
Here is the Lava code for a multiplexer tree:
muxTree ([],[a]) = a
muxTree (s:ss,as) = xs
where
(a1,a2) = halveList as
xs
= mux (s, (muxTree (ss,a1), muxTree (ss,a2)))
The list as must have 2s numbers, where s is the number of bits in ss (it is not limited to
three bits as in the figure). Here it is the first bit in ss that goes to the last multiplexer (the
rightmost in Figure 3.1 (b)) and selects between the lower or upper part of the list. But for the
tree to function as a look-up table for the list as, we would like to have the last bit in ss (the
MSB) doing this selection. Therefore, we define a new function called select that realizes
the tree in Figure 3.1 (b):
select ss as = muxTree (reverse ss, as)
3.2 Booth encoding (s-bit selection)
In section 1.1, we mentioned Booth’s algorithm as a PPG method that uses s-bit selection,
where s > 1 . And section 3.1 presented two ways for implementing s-bit selection – either as
explicit multiplication, or as selection from a selection table.
First comes the basic Booth s PPG with selection through multiplication. It looks like this
in Lava:
3-30
boothBasic s (as,bs) = boothBasicHelp 0 (as,bs)
where
boothBasicHelp i ([],bs) = []
boothBasicHelp i (as,bs) | (length as < s) =
[zeroList (s*i) ++ mult1 (as,bs)]
boothBasicHelp i (as,bs) | (length as >= s) = p:ps
where
(ss,as2) = splitAt s as
p
= (zeroList (s*i)) ++ (mult1 (ss,bs))
ps
= boothBasicHelp (i+1) (as2,bs)
This is very similar to ppgSimple in section 2.2.2, but instead of bitMulti there is an sbit multiplication mult1 (ss,bs), and the shifting is now s*i, instead of i. There is also an
extra base case for the helper function, which is reached only if the number of bits in the
multiplier is not a multiple of s. The multiplier circuit mult1 is the one described in section
2.4.1. Note that if s = 1, boothBasic is exactly equivalent to ppgSimple. The number of
partial products returned is equal to m / s , where m is the number of bits in the multiplier. So,
for s > 1 , we get a significant reduction of the number of partial products compared to s = 1,
and thus we expect it to result in a faster multiplier. On the other hand, s-bit multiplication
with mult1 is computed through s − 1 CPA circuits, so the selection is significantly slower
than the 1-bit. Therefore, we cannot be too sure about any speed gain.
Testing this function with the inputs in Example 1.2 gives us:
Main> simulate (boothBasic 2)([low,high,high,low],[high,low,low,high])
[
[low,high,low,low,high,low],
[low,low,high,low,low,high,low,low]
]
Now we can also write the basic Booth s with the second selection method from section
3.1. The function, called boothBasicMux, looks like this in Lava:
boothBasicMux s (as,bs) = boothBasicHelp 0 ms (as,bs)
where
ms = sBitSelTab s bs
-boothBasicHelp i ms ([],bs) = []
boothBasicHelp i ms (as,bs) | (n < s) =
[zeroList (s*i) ++ select as ms2]
| (n >= s) = p:ps
where
n
= length as
ms2
= trimMatrix (2^n) (length bs + n) ms
(ss,as2) = splitAt s as
p
= zeroList (s*i) ++ select ss ms
ps
= boothBasicHelp (i+1) ms (as2,bs)
The difference from boothBasic is that the s-bit multiplication has been exchanged for a
selection from the list ms, and the same list is used for all partial products. If the number of
bits in the multiplier is not a multiple of s, we reach the case where (n < s). Then a selection
from ms2 is returned, where ms2 is trimmed version of ms. The function trimMatrix (see the
full code in Appendix A) takes two sizes m and n, and a list of numbers as input (a list of
numbers is a matrix of bits). The list is trimmed so that there will not be more than m numbers
3-31
in the list, and the numbers will not have more than n bits. The number of multiples and the
number of bits in the multiples in ms2 are trimmed so that a selection is equivalent to a
multiplication with a less than s-bit number. A comparison between the two methods in terms
of size and speed is found in section 7.2.
3.3 Improved Booth’s algorithm
3.3.1 A new selection method, 2-bit
The basic Booth PPG is rarely (or never) used, because there is a simple way to rearrange
the selection method to get rid of a half of the odd multiples, and hence get a faster PPG (one
less adder delay).
We start by examining the new method for 2-bit selection, and then we generalize it to s
bits. In 2-bit selection, the values of the partial products are chosen from the set
{0, N ,2 N ,3N }, where N is the multiplicand. We can get rid of the 3N multiple by hiding it
inside two adjacent partial products as 4 N − N . That is, the first partial product is (− N ) , and
the next one is N plus its own value. The factor four is dropped because there’s a factor four
in magnitude difference between adjacent partial products in 2-bit selection, coming from the
shifting. In order to get a workable method, we will also let 2N be hidden inside two partial
products.
If the first selection group has the value 0 or 1 (MSB = 0), every thing works as normal,
so either 0 or N is selected. But if the value is 2 or 3 (MSB = 1), (− 2 N ) or (− N ) should be
selected instead, and then we rely on the next selection to add N to its own value. Therefore,
the next selection must look at MSB of the previous group, and adjust its value for this.
Multiplier M
Multiplier bits
0010011010001101110
MSB
LSB
001 011 100 110 110
100 101 001 011
Groups
(a)
000
001
010
011
100
101
110
111
Selection
+0
+N
+N
+ 2N
- 2N
-N
-N
-0
(b)
Figure 3.2 – Improved 2-bit selection
(a) Grouping of the multiplier bits. (b) New selection table.
This selection method is achieved if the bits of the multiplier are grouped as in Figure 3.2
(a) – the bits are put together two by two as before, but before the first bit in each group, the
last bit from the previous group should be included. So, there will be three bits in each group,
but only steps of two bits between the groups. We also have to add a zero before LSB of the
multiplier, and two zeroes after MSB, since these don’t have any previous/following groups.
Because of these extra bits, this method gives one more selection group than the basic Booth
3-32
2, so the number of partial products will be m / 2 + 1 , where m is the number of bits in the
multiplier. But the nice thing here is that all multiples in the new selection table can be
generated without any adders at all!
If we have a selection group S = s 2 s1 s 0 , where s0 is the last bit from the previous group,
the elements in the selection table can be written formally as
Booth 2TabS = (s 0 + s1 − 2 s 2 ) ⋅ N
(14)
Inversion of a bit can be written as a difference
a = 1− a
(15)
If we invert the bits in S and use the relation (15) in (14), we get
Booth 2TabS = (s 0 + s1 − 2 s 2 ) ⋅ N
= (1 − s 0 + 1 − s1 − 2(1 − s 2 )) ⋅ N
= (− s 0 − s1 + 2 s 2 ) ⋅ N
= − Booth2TabS
(16)
This relation will be used in the Lava implementation of the selection table.
3.3.2 Negative partial products
It can be seen from Figure 3.2 (b) that the MSB of each group determines whether the
partial product should be positive or negative. The best way to represent the negative
multiples is in the 2’s -complement form, since that makes it possible for them to be added
directly together with the positive numbers.
Negating a number using the 2’s -complement form is easily done by the following two
steps:
1) Invert all bits in the number.
2) Add one to the result.
Instead of letting the selection table compute the negative multiples, we can have a table
with the absolute values of the multiples; the negation is done after the selection. Assume that
the ith partial product is negative. Its value is then computed by the following steps:
1)
2)
3)
4)
Select the absolute value from the table in Figure 3.2 (b).
Invert all its bits.
Add one to the result.
Shift it 2i steps to the left.
When inverting all bits, we have to deal with the fact that the empty gaps to the left of the
partial products are actually zeroes. These zeros, although not present in step 2), should also
be inverted. So, negative partial products must be padded with ones to the left of MSB. This is
3-33
called sign extension. Note that step 3) can be done after step 4), if the one-bit is also shifted
2i steps. This fact means that the addition can be pushed to the next partial product instead.
Since the next partial product is shifted 2i + 2 steps there will always be a zero at position 2i ,
and this zero is replaced by the sign bit from the previous partial product (see Figure 3.3 (b)).
When the addition used to negate a number can be hidden inside the partial product
summation, negation practically takes no extra time.
Figure 3.3 (a) shows the partial product matrix for 8 × 8 -bit multiplication with the
improved Booth 2 algorithm, if all partial products happen to be positive. All terms except the
last one, have nine bits (plus the shifting zeroes). This is because the highest multiple in the
selection table is 2 N , and N has eight bits. If all partial products happen to be negative, we
get the matrix in Figure 3.3 (b).
1 1 1 1 1 1 1
1 1 1 1 1
1
1 1 1
1
1
1
1
(a)
(b)
s
1 1 1 1 1 1 1
s
1 1 1 1 1
s
1 1 1
s
1
s
s
s
s
s
s
1 s
s
1 s
s
s
s
s
(c)
s
(d)
Figure 3.3 – Partial products for 8×8-bits improved Booth 2 multiplication.
(a) All partial products positive.
(b) All partial products negative.
(c) Both positive and negative partial products with padded ones.
(d) Both positive and negative partial products without padded ones.
We want to be able to commute between the positive and negative form for single partial
products. One way to do this is shown in Figure 3.3 (c). The one-bits below the partial
products in Figure 3.3 (b), are replaced by the partial product’s sign bits s ( s = 0 for positive,
and s = 1 for negative). And the inverse of the sign bit s is added to the string of padded ones
to the left of the partial products. Adding one to a string of ones results in a string of zeroes (if
the carry-out bit is neglected), see Figure 3.4.
3-34
1 1 . . . 1 1 1 1 1
1
+ 11...1111
100...0000
Figure 3.4 – Adding one to a string of ones.
So, for a positive partial product, where s = 1 , the string of ones gets converted to a string
of zeroes. This way we can easily handle both positive and negative partial products. If the
padded ones and the sign bits to the left of the partial products are summed manually, we get
the equivalent, but smaller matrix of Figure 3.3 (d). To achieve this matrix, the partial
products should be generated by the following scheme (generates the ith partial product Pi
from the selection group S i = s 2 s1 s 0 ):
Improved Booth 2 selection scheme:
1) Select the absolute value from the table in Figure 3.2 (b).
2) If the sign is negative ( s 2 = 1 ), invert all its bits.
3) i = 0 ⇒ Pad with the string s 2 s 2 s 2 after MSB.
i ≥ 1 ⇒ Pad with the string 1 s 2 after MSB,
and the string 0 s p before LSB ( s p is the previous sign bit)
Shift it 2i − 2 steps to the left.
4) Trim so that the number of bits is not more than m + n , where m and n are the number
of bits in the multiplier and the multiplicand respectively.
We must make sure that all partial sums have the right lengths for additions with partial
products from this algorithm. Statement 3 in section 1.2.2 said that the ith partial product Pi
should have the same number of bits as a multiplication where Pi was the last partial product.
This rule is hard to apply here, since the last selection group is padded with one or two zeroes
so that the last partial product is always positive (see Figure 3.2). So, in a multiplication
where Pi is the last partial product, Pi would look different from when it is not the last one.
Therefore, we will use another approach to find out how many bits we need.
Just like in section 1.2.2, we want to find the largest possible result of any addition where
Pi sets the size. It is given if Pi is added to the sum of all shorter partial products SUM i −1 . We
said before that if a one is added to a string of ones, we get a string of zeroes and a carry-out.
In Figure 3.3 (d), we can see that when any two partial products are added together, it is
always possible that the shorter ends with a one, and the longer ends with a string of ones. So,
any addition with single partial products can result in a carry-out. Therefore, we need to pad
the terms with at least one zero after MSB. The next question is then: Is it enough with just
one zero, or do we need to pad with more? If SUM i −1 is shorter than Pi , any carry-out will be
eaten up by the padded zero in Pi , so the only possibility that we need to take care of another
carry-out, is if Pi and SUM i −1 have the same lengths. (If SUM i −1 should be longer than Pi , it
is no longer Pi that sets the size, so we don’t need to bother about that case.) In Figure 3.3
(d), each partial product has two more bits than the previous one, except for the special term
in the beginning and the trimmed terms at the end. If we generate SUM i −1 as
SUM i −1 = ((P0 + P1 ) + P2 ) + ...Pi −1 ,
3-35
we can reason like this: Since P1 is longer than P0 , the carry from the addition (P0 + P1 )
is eaten up by the padded zero in P1 . Then P2 is longer than the result of (P0 + P1 ) , which has
the same length as P1 , so this carry is also eaten up, and so on. Therefore, Pi is always longer
than SUM i −1 , so there will never be any extra carry to deal with. Now we can add one more
step to the improved Booth 2 scheme:
5) Add one constant zero bit after MSB.
Here is an example of the improved Booth algorithm:
Example 3.1
Calculate the product of the two binary numbers M = 110110
(54) and N = 1001 (9), using improved Booth 2 selection.
Solution:
Multiplicand (N):
Multiplier (M):
Product:
1001 (9)
* 110110 (54)
01101101
1110010 1
010110
+ 1001 1
p
0111100110 (486)
First, the bits of the multiplier are grouped into the following
groups:
110110 0
110 100
001 011
Then, the partial products are generated from the abovedescribed scheme. And finally, they are summed up.
3.3.3 Generalization to s-bit
The improved Booth algorithm for 2-bit selection can be generalized to s-bit as well,
following the same pattern. The multiplier grouping for improved s-bit selection is shown in
Figure 3.5. Each selection group consists of s bits plus MSB from the previous group. And as
before there are zeroes padded before and after the first and last groups. The general rule,
which holds for any group length and any multiplier length, is: There should always be one
zero added before LSB of the multiplier. After MSB, at least one zero should be added, and
then as many more as are needed to fill the last group.
The values of the improved table, for the selection group S = s s s s −1 ...s 2 s1 s 0 , are given as
3-36
BoothTabS = (s 0 + s1 + 2 s 2 ... + 2 s − 2 s s −1 − 2 s −1 s s )⋅ N ,
(17)
where N is the multiplicand. Just like in the 2-bit table, it is MSB of the group s s that
determines whether the selected value is positive or negative.
s bits s bits
MSB
Multiplier:
Groups:
LSB
... ...
...
... ..
S2
S1
S0
Figure 3.5 – Grouping of the multiplier bits in improved s-bit selection.
This table also has the same property as the 2-bit table when inverting the bits in S
(equation (16)):
BoothTabS = − BoothTabS
(18)
Since we only want to select absolute values from the table (negation is done after the
selection), we could skip the second half of the list (where the negative elements are), and
instead, when a value from the second half is needed, we just invert the selection bits and
select from the first half. If we don’t use the second half of the table, we don’t need to include
MSB of S in the selection, so we get a smaller selection group S ′ = s s− 2 ...s 2 s1 s 0 . Then we can
write a modified selection table with the following function:
BoothTabS′ ′ = (s 0 + s1 + 2 s 2 ... + 2 s − 2 s s −1 )⋅ N
= (s 0 + s1 + 2 ⋅ [s s −1 ...s 2 ]) ⋅ N = (s 0 + s1 ) ⋅ N + 2 ⋅ [s s −1 ...s 2 ]⋅ N ,
where the left hand side is obtained by noting that the sum 2 s 2 ... + 2 s − 2 s s −1 is equal to the
value of the binary number 2 ⋅ [s s−1 ...s 2 ] . So, the values of the BoothTab ′ table are given as
the addition of a special 2-bit selection (s 0 + s1 ) ⋅ N (the same as in Figure 3.2 (b)), and an
ordinary (s − 2) -bit selection 2 ⋅ [s s −1 ...s 2 ]⋅ N (the factor 2 is just a one-step shifting of the
selected value). The table looks like in Figure 3.6.
Now we can define the generalized improved Booth scheme, which generates the ith
partial product Pi from the selection group S i = s s s s −1 ...s1 s 0 :
Improved Booth s selection scheme:
1) If s s = 0 (positive sign); Let S ′ = s s−1 ...s1 s 0 .
If s s = 1 (negative);
Let S ′ = s s−1 ...s1 s 0 .
2) Let S ′ select the proper value from the table in Figure 3.6.
3) If s s = 1 , invert all bits in the selected value.
4) i = 0 ⇒ Pad with the string s s s s ...s s ( s + 1 bits totally) after MSB.
i ≥ 1 ⇒ Pad with the string 1...1 s s (s bits totally) after MSB,
3-37
and the string 0...0 s p (s bits totally) before LSB
( s p is the previous sign bit).
Shift it s ⋅ i − s steps to the left.
5) Trim so that the number of bits is not more than m + n , where m and n are the number
of bits in the multiplier and the multiplicand respectively.
6) Add one constant zero bit after MSB.
As for the basic Booth, there are two ways to implement the s-bit selection table:
1) An improved 2-bit selection, added to an (s − 2 ) -bit multiplication.
2) A selection from a list with the multiples in Figure 3.6.
Here is the Lava code that implements the special 2-bit selection (s 0 + s1 )⋅ N :
improved_2bitSelTab ss bs = select ss [b0,b1,b1,b2]
where
b0 = zeroList ((length bs)+1)
b1 = bs ++ [low]
b2 = [low] ++ bs
The highest multiple is 2 N so all numbers have n + 1 bits. And all multiples are obtained
by simple shifting, so no adder is used.
Multiplier bits
0..000
0..001
0..010
0..011
0..100
.....
1..101
1..110
1..111
Selection
0
N
N
2N
2N
.....
(2 − 1)⋅ N
(2 s−1 − 1)⋅ N
s −1
2
s −1
⋅N
Figure 3.6 – The improved s-bit selection table.
Here is the code for the first implementation of the selection table; an addition of a 2-bit
selection and a multiplication:
improved_sel ss bs
| (length ss == 2) = x
| (length ss >= 2) = cpa (x,y)
where
(a1,a2) = splitAt 2 ss
x
= improved_2bitSel a1 bs
y
= [low] ++ (mult1 (a2,bs))
3-38
The multiplication needs s − 3 adders, so the whole selection is done through s − 2
adders, which is one less than the basic Booth s algorithm.
Here is the code for the second implementation:
improved_selTab s bs = modify ms
where
ms = trimMatrix (2^(s-1)+1) (length bs + s - 1) (sBitSelTab s bs)
-modify [] = []
modify [m] = [m]
modify [m1,m2] = [m1,m2]
modify (m1:m2:ms) = m1:m2:(modify (m2:ms))
The multiples are generated by the function sBitSelTab from section 3.1.1, which returns
the list {0, N ,2 N ,...(2 s − 1)N }. However, in the table in Figure 3.6, the highest multiple is
2 s −1 ⋅ N so the list is trimmed by the trimMatrix function to contain 2 s −1 + 1 multiples. Since
the higher multiples are cut off, the remaining multiples need one less bit. This is also done by
trimMartix. We can also see in Figure 3.6 that all multiples except the first and last one are
repeated twice. This is done by the helper function modify.
In the selection scheme, we need to be able to conditionally invert the bits of numbers.
The Boolean expression for an XOR gate with inputs a and b, and output x, is
x = a ⊕ b = a b + ab
(19)
We see that if a = 0 , then x = b , and if a = 1 , we have x = b , so the XOR gate is
actually a conditional bit inverter. Then a conditional number inverter is obtained as an array
of xor2 gates with the distributeLeft pattern from section 2.2.1:
invCond c as = (distributeLeft ->- map xor2) (c,as)
This circuit inverts the bits of as if c is high, otherwise as is passed without being
inverted.
The selection scheme also requires that the selected values are padded with sign bits and
shifting zeroes. The pad strings are given by the following function:
boothBits s x sp sign
where
start1
start2
end1
end2
=
=
=
=
| (x < s)
= (start1,end1)
| otherwise = (start2,end2)
zeroList x
(zeroList (x-s)) ++ [sp] ++ (zeroList (s-1))
(replicate s sign) ++ [inv sign]
[inv sign] ++ (replicate (s-1) high)
The parameter s is the selection group length, x is the number of shifting steps (it is
supposed to have the value s ⋅ i ), and sp and sign are the signs of the previous and current
partial products respectively. The first case (x < s) happens for the first partial product only.
Now we can write the improved Booth s algorithm with the first selection method:
3-39
booth s (as,bs) = boothHelp 0 low (as,bs)
where
len = (length as) + (length bs)
-boothHelp i sp (as,bs) | (length as < s) = [p]
where
ss
= [sp] ++ as ++ zeroList (s-(length as)-1)
x
= improved_sel ss bs
(start,_) = boothBits s (s*i) sp low
p
= trim len (start ++ x)
boothHelp i sp (as,bs) | (length as >= s) = p:ps
where
(ss, sign:as2) = splitAt s ([sp] ++ as)
signInv
= invCond sign
x
= (improved_sel (signInv ss) ->- signInv) bs
(start, end)
= boothBits s (s*i) sp sign
p
= trim len (start ++ x ++ end ++ [low])
ps
= boothHelp (i+1) sign (as2,bs)
This function has the same basic structure as the boothBasic. The helper function has two
cases. The first one is the base case that returns the last partial product, and the second one
adds one partial product to the list ps in each recursive call. The parameter sp is the sign bit
for the previous selection group (MSB of that group). The current selection group ss is
obtained by picking s bits from the list [sp] ++ as. One more bit sign is picked, which is
the current group’s sign bit. Then ss selects the partial product value. The value of ss and the
selected value are inverted depending on the value of sign. The selected value is padded with
the bits from boothBits, and finally trimmed to the right length. The shifting argument is set
to s*i. In the multiplier grouping, illustrated by Figure 3.5, the first and the last partial
products should be padded with zeroes. The zero to the first partial product is given by the
low argument in the first call of the helper function. The zeroes at the end are added in the
base case, where ss is padded with enough zeroes to fill the last group.
Here is the improved Booth s with the second selection method:
boothMux s (as,bs) = boothHelp 0 low (as,bs)
where
ms = improved_selTab s bs
len = (length as) + (length bs)
-boothHelp i sp (as,bs) | (length as < s) = [p]
where
ss
= [sp] ++ as ++ zeroList (s-(length as)-1)
x
= select ss ms
(start,_) = boothBits s (s*i) sp low
p
= trim len (start ++ x)
boothHelp i sp (as,bs) | (length as >= s) = p:ps
where
(ss, sign:as2) = splitAt s ([sp] ++ as)
signInv
= invCond sign
x
= (select (signInv ss) ->- signInv) ms
(start, end)
= boothBits s (s*i) sp sign
p
= trim len (start ++ x ++ end ++ [low])
ps
= boothHelp (i+1) sign (as2,bs)
3-40
4 Improved summation methods
This chapter describes how the basic summation
methods for summing partial products can be
improved using other adder structures than the
previously used carry-propagate adders. And
just like the carry-propagate adders, these can
be arranged either as linear arrays, or as trees.
4.1 Faster adders
4.1.1 The carry-save adder
We said before that a CPA is a relatively slow adder, because of the propagating carry. A
much faster adder is achieved if the full adders in the CPA are connected slightly differently.
In a CPA, the different carry-out bits can be thought of as a separate binary number, which is
added together with the inputs. So, if all carry-out signals were disconnected from their carryin, this number could be saved and added at a later stage. Then there would also be an extra
input available through the carry-in bits. Such an adder is called a carry-save adder (CSA). It
is not a usual adder that returns one sum; instead it takes three numbers A, B, C as inputs, and
returns two X, Y, such that A + B + C = X + Y (see Figure 4.1). The second output Y is
indexed from one, to show that it should be shifted one step to the left. This is because the
carry-out bit is one position more significant than the sum bit of the full adder. The CSA is
useful when we have more than two terms to add, such as when summing partial products.
Both the CPA and the CSA are lines of full adders, and both reduce the number of input
numbers with one. But since the carry chain is cut off, the CSA does it on constant time! The
transformation between CPA and CSA is shown in Figure 4.1.
a0
CPA:
b0
a
A+B=S
ci,0
c0
CSA:
FA
ci
a
ci
FA
a2
b
co
b2
a
FA
ci
a3
b
co
b3
a
ci
FA
s
s
s
s0
s1
s2
s3
b0
a
A+B+C=X+Y
b
co
b1
s
a0
co
a1
c1
b
FA
a
co
ci
s
x0
a1
b1
c2
b
FA
x1
b2
a
ci
co
s
y1
a2
c3
b
FA
a
co
ci
s
y2
x2
a3
b
co
co,3
b3
b
FA
ci
s
y3
x3
y4
Figure 4.1 – Transforming a CPA to a CSA.
4-41
For writing the CSA in Lava, we need a function zipp3 that is like the zipp2 (section
2.3.2), but for three numbers instead. It is included in the full code in Appendix A.
The CSA looks like this:
csa ((as,bs),cs) =
(zipp3 ->- map (convert ->- fullAdd) ->- unzipp ->- shift) (as,bs,cs)
where
convert (a,b,c) = (c,(a,b))
shift (xs,ys) = (xs, [low] ++ skipLast ys)
The inputs are zipped, mapped on full adders, unzipped, and finally the second output
number is shifted one step to the left. Just like for the CPA, we want the output numbers to
have the same number of bits as the longest input number. This is why shift uses the
skipLast (removes MSB) function to keep the number of bits constant when shifting.
4.1.2 Logarithmic adder
When fast addition is needed for large numbers, we cannot use the CPA because of the
O(n) behaviour. And the CSA cannot help us when we want to compute a single sum out of
two terms. To get a faster adder we must somehow attack the long carry chain of the CPA.
The full adder circuit in section 2.3.1 made use of generate and propagate signals for
computing the sum and carry-out bits. The expression for the kth generate and propagate bits
were:
g k = a k ⋅ bk
(20)
p k = a k ⊕ bk
(21)
Since ak and bk are available immediately, gk and pk are computed in constant time. These
can be used to compute co,k without waiting for a propagating carry. Such an adder is called a
carry-lookahead adder. For simplicity, we denote the kth carry-out bit as ck, so that
ck = co ,k = ci ,k +1
Then the expressions for the kth sum and carry-out bits, sk and ck, are:
s k = p k ⊕ c k −1
(22)
c k = g k + p k ⋅ c k −1
(23)
If we expand the expression for ck, we get:
c k = g k + p k ⋅ (g k −1 + p k −1 ⋅ c k − 2 )
= g k + p k ⋅ (g k −1 + p k −1 ⋅ (... + p1 ⋅ (g 0 + p 0 ⋅ ci , 0 )))
(24)
Since we don’t have any carry -in to the whole adder, the innermost term, p0 ⋅ ci , 0
disappears. The number of levels needed to compute ck from the above expression is
4-42
proportional to k, so an n-bit addition would still take O(n ) time. There is, however, a way to
arrange the ck calculations in a tree, in order to get an adder with O(log n ) time. Previously
we defined generate and propagate signals for single bit additions, but they can be defined for
groups of bits as well. The group from bit i to bit j in an addition is denoted group i:j. This
group is marked by dashed boxes in Figure 4.2. Group generation occurs when a carry is
generated somewhere inside the group, and then propagated throughout the rest of the group
bits (Figure 4.2 (a)). And group propagation occurs when the carry-in to the group is
propagated through the whole group (Figure 4.2 (b)).
G
P
P
P
G
i
j
(a)
P
P
P
P
P
P
P
i
j
(b)
Figure 4.2 – (a) Group generation. (b) Group propagation.
We define group generation so that gi:j = 1 if the group i:j generates a carry, and gi:j = 0
otherwise. We also define group propagation as pi:j = 1 if the incoming carry is propagated
through the whole group, and pi:j = 0 otherwise.
What we really are searching for is the different carry bits ck. They are given as
c k = g i:k + pi:k ⋅ ci −1
This expression says: Either a carry is generated inside the group, or we have a carry-in
that is propagated through the group. If we set the start of the group to i = 0, and use the fact
that ci-1 = 0 (no carry-in to the adder), we get
c k = g 0:k
Our goal is to find an efficient way to calculate the different g0:k . For this we need to be
able to combine smaller groups into larger ones. First we define a carry pair cpi:j , containing
the generate and propagate bits for the group i : j :
cpi: j = (g i: j , pi: j )
4-43
Then we define a dot operator • with the following properties:
(g ′, p ′) • (g , p ) = (g + pg ′, pp ′)
If the operands are carry pairs of adjacent groups cpi: j and cp k :l , where k = j + 1 , the
result of the operation is a carry pair for the combination of the two groups:
cpi: j • cp k :l = cpi:l
Note that the operands must be in the above order, that is, the dot operation is not
commutative. Now we will use the gk and pk bits (can also be written as gk:k and pk:k), that are
obtained in constant time, and combine them in a tree of dot operators to get the different g0:k
bits. Then all carry-out bits are known, and the sum bits can be computed as
s k = p k ⊕ c k −1
This last operation is also done in constant time, so it is the carry generation that will set
the speed limit for the adder.
We have the list {cp0:0 ,...cp(n −1)(: n −1) }, and want a circuit that computes the list
{cp
,...cp0:(n −1) } in O(log n ) time. This could be done by a more general circuit with the
following function,
0:0
carryGen{cp j: j ,...cp k :k }= {cp j: j ,...cp j:k },
if we set j = 0 and k = n .
Here the algorithm is demonstrated for the input list {cp0:0 ,...cp7:7 }:
The list is split in two halves {cp0:0 ,...cp3:3 } and {cp 4:4 ,...cp7:7 }. Then the halves are sent in
parallel through the same circuit (recursively) to get the lists {cp0:0 ,...cp0:3 } and {cp 4:4 ,...cp 4:7 }.
Then cp0:3 is combined (with the dot operator) with each element in the second list to form
the list {cp0:4 ,...cp0:7 }. And finally, the list {cp0:0 ,...cp0:3 , cp0:4 ,...cp0:7 } is returned. The
recursive calls for the circuit follow the same procedure, until there is only one element in the
list. Then this element is returned. Figure 4.3 shows the tree built by this algorithm. The
recursive calls are represented by dashed boxes. A generic algorithm follows the same pattern.
The dot tree can always be fit in a square as in the figure, which allows efficient and regular
layout of the tree (see [5]). And since the list is halved in each recursive call, the tree will
always have log 2 n levels, and thus the carry generation is done in O(log n ) time. However,
because of this carry-lookahead tree, this adder is much larger than a usual CPA.
Here is the Lava code for the generic circuit:
carryGen [cp] = [cp]
carryGen cps = cps1 ++ map (dotOp (last cps1)) cps2
where
(cps1,cps2) = (halveList ->- (carryGen -|- carryGen)) cps
-dotOp (g',p') (g,p) = (or2 (g, and2 (p,g')), and2 (p,p'))
4-44
The input cps is the list {cp0:0 ,...cp(n −1)(: n −1) }.
cp0:0 cp1:1 cp2:2 cp3:3 cp4:4 cp5:5 cp6:6 cp7:7
cp0:1
cp2:3
cp4:5
cp6:7
cp0:2
cp0:3
cp4:6
cp4:7
cp0:4
cp0:5
cp0:6
cp0:7
Figure 4.3 – Carry generation tree for 8-bit addition.
Using the carryGen circuit, we can now write the whole logarithmic adder as:
logAdd (as,bs) = ss
where
gs = (zipp2 ->- map and2) (as,bs)
ps = (zipp2 ->- map xor2) (as,bs)
cps = zipp (gs,ps)
cs = (unzipp ->- first ->- skipLast) (carryGen cps)
ss = (zipp ->- map xor2) (ps, [low]++cs)
-first (g,p) = g
-carryGen [cp] = [cp]
carryGen cps = cps1 ++ map (dotOp (last cps1)) cps2
where
(cps1,cps2) = (halveList ->- (carryGen -|- carryGen)) cps
-dotOp (g',p') (g,p) = (or2 (g, and2 (p,g')), and2 (p,p'))
The lists gs {g 0 ,...g n −1 } and ps {p 0 ,... p n −1 } are obtained according to equations (20) and
(21). Since zipp2 is used, this circuit works for numbers of different lengths. Then gs and ps
are zipped into the list cps {cp0:0 ,...cp(n −1)(: n −1) }. The list cs {c0 ,...c n − 2 }, is computed by the
carryGen
function. The expression (unzipp ->- first) extracts the list {g 0:0 ,...g 0:(n−1) }
from {cp0:0 ,...cp0:(n −1) }. As always, the last carry-bit is neglected (skipLast). Finally, the list
ss
{s0 ,...s n −1 }, which is the output number, is computed according to equation (22). The carry
list is shifted one step, so that the first carry-out goes to the second sum bit computation, and
so on.
4.2 Carry-save array
The previous summation networks, linArray and addTree in section 2.3, used CPAs to
reduce the partial products. CPAs are slow adders, because of the carry that ripples through
them. Of course, we could use the faster logarithmic adder in section 4.1.2, but unless all
CPAs are exchanged, there wouldn’t be any large gain in using faster adders. Exchanging all
CPAs to log adders would result in huge multiplier, because of the larger area of the latter.
4-45
A better way to speed up the summation is to use an array of CSAs instead. As we saw in
section 4.1.1, a CSA takes three numbers A, B, C as inputs, and returns two X, Y, such that
A + B + C = X + Y , and this reduction is done in constant time. This suggests that an array of
CSAs could be used for summing the partial products until there is only two terms left. These
last two terms can then be added by a CPA, to get the result. The nice thing is that if faster
addition is needed, there is only one CPA to exchange, so the size of the multiplier will not
explode. The reduceLin connection pattern, defined in section 2.3.3, can be used for the
carry-save array too:
carrySave [p] = p
carrySave (p0:p1:ps) = (reduceLin csa (p0,p1) ->- cpa) ps
This will result in an array of CSAs, with pairs of numbers flowing through it (the
parameter c in the definition of reduceLin). The array is ended with one CPA that produces
the result. Note that both the linear array (with CPAs) and the carry-save array results in a
matrix of full adders. Figure 4.4 shows both matrices for 4 × 4 -bit multiplication.
P0,0 P1,0
P0,1 P1,1
FA
P2,0
P0,2 P1,2
FA
P2,1
P0,3 P1,3
FA
P2,2
P0,4 P1,4
FA
P2,3
P1,5
FA
P2,4
FA
P2,5
a
P2,6
ci
FA
b
co
s
FA
P3,0
FA
P3,1
FA
P3,2
FA
P3,3
FA
P3,4
FA
P3,5
FA
P3,6
P3,7
FA
FA
FA
FA
FA
FA
FA
FA
S0
S1
S2
S3
S4
S5
S6
S7
P2,0P0,0P1,0 P2,1P0,1P1,1 P2,2P0,2P1,2 P2,3P0,3P1,3 P2,4P0,4P1,4 P2,5
FA
P3,0
FA
P3,1
FA
P3,2
FA
P3,3
FA
P3,4
P1,5 P2,6
FA
FA
P3,5
P3,6
P3,7
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
S0
S1
S2
S3
S4
S5
S6
S7
Figure 4.4 – Transforming a linear array to a carry-save array.
The two shaded full adders to the left in the carry-save array come from zeroes padded by
the zipp3 circuit in the csa. They have both constant inputs and constant outputs, so they can
be removed. Then we see that the two arrays have exactly the same number of full adders, and
for each full adder, the number of constant inputs are also the same. The only difference is the
wiring network.
4-46
The two arrays do not differ much in speed, but there is one significant difference. The
linear array has several critical paths with approximately the same delay, all going through the
carry chains of the CPAs. The carry-save array, however, has only one critical path, going
through the CPA at the bottom. So, if we exchange the CPA for a logarithmic adder, we
would expect the carry-save array to become much faster than the linear array:
carrySave [p] = p
carrySave (p0:p1:ps) = (reduceLin csa (p0,p1) ->- logAdd) ps
4.3 Wallace tree
Just as the CPAs of the linear array can be organized as a tree to allow parallel additions
(section 2.3.4), the CSAs of the carry-save array can be arranged in a tree structure. This is
called a Wallace tree [6]. Reduction of twelve partial products in a Wallace tree looks like in
Figure 4.5.
P0 P1 P2
CSA
P3 P4 P5 P6 P7 P8
CSA
CSA
CSA
P9
P10 P11
CSA
CSA
CSA
CSA
CSA
CSA
CPA
Figure 4.5 – Wallace tree for twelve partial products
The Wallace tree in Lava uses a special CSA that looks like this:
csaWallace (a,[],[]) = (a,[])
csaWallace (a,b,[]) = (a,b)
csaWallace (a,b,c) = csa ((a,b), c)
This is an ordinary CSA as long as it is given three inputs. But if the last, or the two last
inputs, are empty lists (= non-existing), the existing inputs are just passed to the next level
(the dashed arrows in Figure 4.5)
4-47
The Lava code for the Wallace tree looks like this:
wallace ps
| ((length ps) <= 2) = carrySave ps
| otherwise
= wallace s
where s = (group3 ->- map csaWallace ->- unpairWallace) ps
The partial products ps are grouped three by three, and each group is mapped to a
csaWallace. The results from all CSAs are merged to a new list, called s. Then a recursive
call wallace s is made. When there is only two terms left, they are added by the carry-save
array, which means that the Wallace tree is ended with a log adder.
Now we want to find out how many steps are needed for computing the sum of m terms
with a Wallace tree. If number of terms mi in one step is a multiple of three ( mi = 3q , where
q is an integer), the number of terms mi +1 in the next step will be
mi +1 = 23 mi = 2q
(25)
If we had the same reduction factor ( 2 / 3 ) for all levels, and started with m terms, we
would get the relations
m0 = m
m1 = 23 m
m2 = 23 m1 = (23 ) m
2
(26)
...
m x = (23 ) m
x
where x is the number of steps needed until we have m x terms left. So, if we want to
know how many steps are needed to sum m terms into two, we set m x = 2 and solve equation
(26) for x:
2 = (23 ) m
x
log(m / 2 ) = log(32 )
x
x=
log(m / 2 )
∈ O(log m )
log(32 )
When two terms are left, they are summed in constant time by the final adder. The
conclusion is that under the assumption that we have the same reduction factor ( 2 / 3 ) for all
levels in the tree, Wallace can compute the sum in O(log m ) steps. But since we don’t have
full groups in all levels, the above assumption is only approximately true. However, it can be
shown that the O(log m ) behaviour is not affected by that.
So, the Wallace tree sums m partial products in O(log m ) time, just like the adder tree in
section 2.3.4. But Wallace has no carry chains, so we expect it to be the fastest method so far.
4-48
5 Non-standard interpretation
This chapter describes how the standard Lava
gates can be exchanged for non-standard ones,
with additional features. These gates work
differently depending on whether the input bits
are constant or not, which means that counting
on constant bits is done with less hardware than
counting on variable bits. They also allow
timing and speed information to flow in parallel
with the bits in the circuit. This can be used to
estimate the speed and size of constructed
circuits.
5.1 Non-standard gates
5.1.1 Self-reducing gates
In the circuits in the previous chapters, we made use of constant bits for shifting partial
products, controlling the size of summations of partial products, handling addition of numbers
of different lengths, and so on. The reason for this was to get more abstract circuit
descriptions. Since these constant bit were treated just like the variable ones, it means that our
circuit description used lots of unnecessary hardware. For example, if one input to an and2
gate is constant zero, the output will also be constant zero, and we don’t need any gate to
compute that. Moreover, a full adder with one constant zero input actually functions as a half
adder. If it has two constant zeroes, it functions as a simple wire between the variable input
and the s output. The linear array circuit in Figure 2.5 has several full adders with one or
more constant inputs (grey arrows). Therefore, the array can be significantly reduced. This is
shown in Figure 5.1. On the other hand, when our circuit descriptions are given to a tool for
implementing real hardware on a chip, this tool will automatically do this reduction for us.
However, it would be nice to be able to make descriptions that are already usable, and do not
have to rely on another tool for making optimisations.
We define that a constant bit in Lava has the value low or high. So, we can have our
circuits comparing the inputs with low or high, and giving different functions depending on
whether they are constant or not. Of course, even for example, inv(low) is a constant value,
but that will not be recognized by our circuits. For example, a special non-standard inverter
that inverts variable bits, and passes constant bit can be written as
invNS a
| (a == low || a == high) = a
| otherwise
= inv a
If we test the circuit, we get:
5-49
Main> invNS low
low
Main> invNS high
high
Main> invNS (var"a")
inv[a]
P0,0 P1,0
FA
P2,0
P0,1 P1,1
FA
FA
P2,1
FA
P3,0
P0,2 P1,2
FA
P2,2
FA
P2,3
FA
P3,1
P0,3 P1,3
FA
P3,2
P3,3
P0,4 P1,4
FA
P2,4
FA
P3,4
P1,5
FA
P2,5
FA
P3,5
P2,6
FA
P3,6
P3,7
FA
FA
FA
FA
FA
FA
FA
FA
S0
S1
S2
S3
S4
S5
S6
S7
P0,0 P0,1
P1,1
P0,2P1,2
P0,3P1,3
P1,4
a
ci
S0
HA
FA
FA
P2,2
P2,3
b
co
s
HA
P2,4
FA
P2,5
S1
HA
FA
P3,3
FA
P3,4
FA
P3,5
b
P3,6
a
S2
HA
co
s
HA
FA
FA
FA
S3
S4
S5
S6
S7
Figure 5.1 – Reducing the hardware in the linear array.
The problem with comparing the inputs is that there are many cases. For a full adder, we
have one case where all inputs are variable, three cases with one constant zero input, three
cases with two constant zero inputs, one case with three constant zero inputs, plus all cases
where one or more inputs are constant one. It is very inconvenient to test all these cases, and
for larger circuits, like a selection circuit or a multiplier, it is impossible. But since all circuits
are built up by the standard gates inv, and2, nand2, or2, nor2, xor2 and xnor2, it is actually
enough to make reduction in these. Then the reduction of the larger circuits will follow
automatically. For example, a non-standard and2 gate, and2NS would look like this:
5-50
and2NS (a,b)
| (a == low || b == low) = low
| (a == high)
= b
| (b == high)
= a
| otherwise
= and2 (a,b)
And if we test it:
Main> and2NS (low,high)
low
Main> and2NS (high,high)
high
Main> and2NS (high,var"a")
a
Main> and2NS (var"a",low)
low
Main> and2NS (var"a",var"b")
andl[a,b]
In the first, second and third test, the outputs are constant, and no gate is needed. In the
third test, the gate is reduced to a wire. It is only in the last test that a gate is actually created.
5.1.2 Time estimation
We can also add more features to the non-standard gates. In real circuits, all gates have a
small delay between the inputs and outputs. The delay of a bit can be modelled by combining
the bit with a time integer, which tells after how many time units the bit’s value is stable.
Then the delay of a gate can be modelled as an additive constant. That is, the time of the gate
output is given as the time of the latest input plus a constant. Here is the and2NS with a delay
of two time units (the input and output bits have the type (Integer, Signal Bool)):
and2NS ((aTime,a), (bTime,b))
| (a == low || b == low) = (0,low)
| (a == high)
= (bTime,b)
| (b == high)
= (aTime,a)
| otherwise
= (outTime, and2 (a,b))
where outTime = maximum [aTime,bTime] + 2
Testing the new and2NS gives us:
Main> and2NS ((3,var"a"), (0,low))
(0,low)
Main> and2NS ((0,high), (12,var"b"))
(12,b)
Main> and2NS ((3,var"a"), (4,var"b"))
(6,andl[a,b])
In the first test, one of the inputs is constant zero; so, the output is reduced to a constant
zero with no delay (constant bits with delay do not exist). In the second test, one of the inputs
is a constant one; then the gate is reduced to a simple wire, and the output gets the same delay
5-51
as the variable input. In the third test, a gate is created, and the delay of the input is given as
two plus four time units. If gates such as and2NS are used for building larger circuits, the time
information will be flowing together with the bits through the circuit, and it follows
automatically that the delay of the circuit’s output bits will be an estimation of the total circuit
delay.
5.1.3 Size estimation
Now that we are able to estimate circuit delay with simple models, we should be able to
estimate circuit size in a similar manner. We let each bit be combined with both time and size
information, where the size tells how many area units that have been used to create the bit’s
value. Then the size of a gate output is equal to the sum of the inputs’ sizes plus a constant.
The and2NS gate with size modelling (the input and output bits have the type ((Integer,
Integer), Signal Bool)):
and2NS (((aSize,aTime),a), ((bSize,bTime),b))
| (a == low || b == low) = ((0,0),low)
| (a == high)
= ((bSize,bTime),b)
| (b == high)
= ((aSize,aTime),a)
| otherwise
= ((outSize,outTime), and2 (a,b))
where
outTime = maximum [aTime,bTime] + 2
outSize = aSize + bSize + 2
The total size of a circuit built with such gates should then be given as the sum of all
output bits’ sizes. However, this is only true for circuits where no gates share the same signal.
If a signal with some size ≠ 0 is shared between two or more inputs, the size will be counted
multiple times. Therefore, we must make sure that whenever a signal is shared between
several inputs, the size information is reset for all sharing inputs except one, and that this nonreset input is represented in the result. Bits that are not represented in the result occur, for
example, when the higher bits of a number are trimmed.
5.2 New environment for building circuits with non-standard gates
In this section, we will define a new environment in Lava, where our previously described
circuits can be transformed – with a minimum of changes – into circuits with non-standard
gates. These circuits will have the same function as the previous ones, but they will
automatically reduce the hardware used for counting on constant bits, and they will also allow
efficient time and size estimation. The code for the circuits in the new environment is found in
Appendix B. We will redefine some more of the built-in Lava gates and functions. Therefore,
we have to hide some more function in the non-standard code:
import Lava hiding
(high, low, inv, and2, nand2, or2, nor2, xor2, xnor2, mux, zeroList)
import Arithmetic hiding (halfAdd, fullAdd, bitMulti)
5.2.1 New types
In section 5.1, we had bits combined with time and size information. For this we define
new types:
5-52
type
type
type
type
type
type
Normnumber = [Signal Bool]
Bittime = Integer
Bitsize = (Integer, (Integer,Integer))
Info = (Bittime, Bitsize)
Infobit = (Info, Signal Bool)
Infonumber = [Infobit]
is the normal representation of a number - a list of bits where LSB is the first
element, and MSB is the last.
Infobit is a normal bit (Signal Bool) combined with information (Bittime,Bitsize).
Bittime is an integer, which tells after how many time units the bit’s value is s table.
Bitsize has the type (Integer,(Integer,Integer)), where the first integer counts the
number of gates and the second and third integer counts the number of half and full adders
respectively. In section 5.1.3, we had only the gates parameter.
Infonumber is a list of info bits, where, as for normal numbers, LSB is the first element.
Normnumber
5.2.2 Constants
Now we can define constants with the new types:
zeroSize :: Bitsize
zeroSize = (0, (0,0))
zeroInfo :: Info
zeroInfo = (0, zeroSize)
low :: Infobit
low = (zeroInfo, Lava.low)
high = (zeroInfo, Lava.high)
lowVar :: Infobit
lowVar = (zeroInfo, Lava.inv Lava.high)
highVar = (zeroInfo, Lava.inv Lava.low)
zeroList :: Int -> Infonumber
zeroList n = replicate n low
zeroListVar :: Int -> Infonumber
zeroListVar n = replicate n lowVar
The constants lowVar and highVar are signals that the self-reducing gates treat as
variable bits. They can be used when simulating the circuit, to make sure that the result is
computed through actual gates.
5.2.3 Type conversions
We also need to define operations for the new types. Here is a list of type conversions
with short comments afterwards:
5-53
norm2info :: Normnumber -> Infonumber
norm2info [] = []
norm2info (a:as) = (zeroInfo,a):(norm2info as)
n2iPair = (norm2info -|- norm2info)
makeVar :: Infonumber -> Infonumber
makeVar [] = []
makeVar ((aInfo,a):as) =
(aInfo, (Lava.inv ->- Lava.inv) a):(makeVar as)
makeVarPair = (makeVar -|- makeVar)
These are used to convert from normal numbers to info numbers. The makeVar function
makes sure that all bits in the info number are treated as variable.
valueB :: Infobit -> Signal Bool
valueB (info,a) = a
value :: Infonumber -> Normnumber
value = map valueB
timeB :: Infobit -> Bittime
timeB ((time,size),a) = time
time :: Infonumber -> Integer
time = (map timeB) ->- maximum
infoB (info,a) = info
gateSizeB :: Infobit -> Integer
gateSizeB ((_,(gates,_)),a) = gates
halfB :: Infobit -> Integer
halfB ((_,(_,(h,f))),a) = h
fullB :: Infobit -> Integer
fullB ((_,(_,(h,f))),a) = f
sizeB :: Infobit -> Bitsize
sizeB ((time,size),a) = size
size :: Infonumber -> Bitsize
size as = (sum gateSizes, (sum halfs, sum fulls))
where
gateSizes = map gateSizeB as
halfs
= map halfB as
fulls
= map fullB as
These extract information from info bits and numbers. Functions with the letter ‘B’ work
on bits. The time function returns the time when a number is stable. This is the same as the
time when all bits in it are stable, and it is given by the expression (map timeB)d->maximum (the maximum time of all bits). The size of a number is the number of gates or
counters needed to create the bits in it so; the size function returns the sum of all bit sizes.
5-54
5.2.4 Manipulating the information
We need to have an abstract way to manipulate the information of info bits and numbers.
All manipulation should be done through the following functions, listed with short comments
afterwards:
incTime :: Integer -> Infobit -> Infobit
incTime t ((time,size),a) = ((time+t, size),a)
incGateSize :: Integer -> Infobit -> Infobit
incGateSize g ((time,(gates,counts)),a) = ((time,(gates+g,counts)),a)
incHalf :: Integer -> Infobit -> Infobit
incHalf h ((time,(gates,(half,f))),a) = ((time,(gates,(half+h,f))),a)
incFull :: Integer -> Infobit -> Infobit
incFull f ((time,(gates,(h,full))),a) = ((time,(gates,(h,full+f))),a)
These are used to increase any info parameter by an integer.
resetTimeB :: Infobit -> Infobit
resetTimeB ((time,size), a) = ((0,size), a)
resetTime :: Infonumber -> Infonumber
resetTime = map resetTimeB
resetSizeB :: Infobit -> Infobit
resetSizeB ((time,size), a) = ((time,zeroSize), a)
resB = resetSizeB
resetSize :: Infonumber -> Infonumber
resetSize = map resetSizeB
res = resetSize
resSizePair = (resetSizeB -|- resetSizeB)
These functions are used when we want to measure a smaller part of a circuit. For
example, if we want to measure the summation network only, all numbers should be reset
immediately after the PPG. The resetSize functions are also used when signals are shared
between two or more inputs (this problem was discussed in section 5.1.3).
mergeInfos :: (Info,Info) -> Info
mergeInfos ((atime,asize), (btime,bsize)) = (maximum [atime,btime],
addSizes (asize,bsize))
This function merges two infos into one. This is needed when the output of a gate depends
on two input bits. The time when the input is stable is equal to maximum of the input times,
and the “size” of the input is equal to the sum of t he input sizes.
increaseCounts (a,b,c) d = case (countVari [a,b,c]) of
2
-> incHalf 1 d
3
-> incFull 1 d
otherwise -> d
5-55
This functions increases the half adder or the full adder parameter of the bit d, depending
on how many of the bits (a,b,c) are variable. If three bits are variable, the full adder
parameter is increased by one. If two bits are variable, the half adder parameter is increased,
and if less than two are variable, no parameter is increased. This is used to increase the proper
parameter in a full adder, since a full adder is reduced if one or more bits are variable. The
helper function countVari returns the number of variable bits in a list; it is included in the
full code in Appendix B.
5.2.5 New gates
Now we are ready to redefine the standard Lava gates. Here is the and2 gate for info bits,
with all the non-standard features from section 5.1:
and2 (a,b)
| (a == low || b == low) = low
| (a == high)
= b
| (b == high)
= a
| otherwise = (incTime andDelay ->- incGateSize andSize)
(info, Lava.and2 (valueB a, valueB b))
where info = mergeInfos (infoB a, infoB b)
The info of the output is then given by merging the infos of the inputs, and then increasing
the time and gate parameter by the constants andDelay and andSize. The other gates, inv,
nand2, or2, nor2, xor2 and xnor2 are written in a similar way (see the full code in Appendix
B).
Here are the delay and size constants for the different gates:
invDelay = 1
invSize = 1
andDelay = 3
andSize = 3
nandDelay = 2
nandSize = 2
orDelay = 3
orSize = 3
norDelay = 2
norSize = 2
xorDelay = 3
xorSize = 3
xnorDelay = 3
xnorSize = 3
The actual values of these constants are much dependant on how the hardware is
implemented. Although no motivation is given for how these values were chosen, we expect it
to at least give us a hint about the speed and delay of our circuits – enough to be able to
compare different circuits.
Here are the fullAdd and halfAdd circuits in the new environment:
5-56
halfAdd (a,b) = (incHalf 1 s,co)
where
s = xor2 (a,b)
co = (resSizePair ->- and2) (a,b)
fullAdd (ci,(a,b)) = (s,co)
where
g = (resSizePair ->- and2) (a,b)
p = xor2 (a,b)
s = increaseCounts (a,b,ci) (xor2 (p,ci))
co = or2 (g, (resSizePair ->- and2) (p,ci))
They are defined just like in section 2.3.1, but here, the s output gets its half adder or full
adder parameter increased. Also, for some signals, the size is reset, to prevent that size is
counted multiple times.
5.3 Non-standard circuits
Now we can use the same circuit descriptions as before in the new environment, and get
circuits built upon non-standard gates. The only thing we need to think about is the resetting
of sizes when signals are shared between more than one input. In the full code in Appendix B,
we see how this is done.
Now all our multipliers work for info numbers as inputs and outputs. Multipliers for
normal numbers are written as, for example:
normMult = n2iPair ->- ppgSimple ->- linArray ->- value
Since we have redefined the constants low and high as info bits, we don’t have to make
the n2iPair conversion when simulating the circuits. However, we need to make them
variable so that our simulation is done through actual circuits. Therefore, we define a new
simulation function (only for circuits with type (Infonumber, Infonumber) ->
Infonumber):
sim circ = (simulate (makeVarPair ->- circ ->- value))
An example of a simulation with this function is:
Main> sim mult1 ([low,high,low,high],[high,low,low,high])
[low,high,low,high,high,low,high,low]
Note that the result is a normal number – low and high in a result always refer to the
constants Lava.low and Lava.high respectively.
For measuring the delay and size of the circuits, we have the following functions:
5-57
how_fast circ n = (n2iPair ->- circ ->- time)
((replicate n (var "a")),(replicate n (var "a")))
how_fast2 ppg sum n = how_fast (ppg ->- sum) n
how_big circ n = (n2iPair ->- circ ->- size)
((replicate n (var "a")),(replicate n (var "a")))
how_big2 ppg sum n = how_big (ppg ->- sum) n
measure circ n = (how_fast circ n, how_big circ n)
measure2 ppg sum n = measure (ppg ->- sum) n
For example:
Main> measure2 (boothMux 3) wallace 20
(204,(11183,(33,112)))
This tell us that the multiplier circuit with (boothMux 3) PPG and wallace summation,
for 20 bits, computes the result in 204 time units, using 11183 area units, 33 half adders and
112 full adders.
5-58
6 Dadda summation
In this chapter, disadvantages of the Wallace
summation network are found. An improved
method called Dadda summation, which is both
slightly faster and smaller than Wallace, is
described.
6.1 Wallace disadvantages
The Wallace summation was described in section 4.3 as a faster organization of the CSAs
in the carry-save array. If we measure the two methods for 16 × 16 -bit multiplication with
simple PPG, we get:
Main> measure2 ppgSimple carrySave 12
(123,(2484,(11,109)))
Main> measure2 ppgSimple wallace 12
(78,(2634,(34,102)))
This shows that Wallace is indeed faster, but at the same time larger, with 34 half adders
instead of 11, for the carry-save array. To understand where these extra half adders come
from, we need to look at the Wallace summation at bit-level. We remember that Wallace is a
tree structure, where in each step, the terms are grouped three by three, and mapped onto
CSAs. Each CSA has two numbers as output, which are the summation terms of the next step.
Figure 6.1 shows how a CSA works at bit-level. The bits of each column go to one bit adder,
and the outputs of all these form two new numbers. Bits that are outputs from the same full
adder are held together with a line, and half adder outputs are specially marked by a small
crossing line.
CSA:
Half adder:
Full adder:
Figure 6.1 – Carry-save adder at bit level.
Figure 6.2 shows how Wallace summation works at bit-level for 12 × 12 -bit multiplication
with simple PPG (the shifting zeroes are left out). We see that half adders occur in the right
and the left ends of the summation matrices, but not in the middle, and that the same bits go
through several half adders. Although half adders are sometimes needed, they do not actually
reduce the number of bits in the matrices (two inputs, two outputs). Therefore, it seems to be a
waste of hardware to have the same bits going through several half adders.
6-59
= Bits
= Padded zeroes
s
= Full adder output
c
s
c
= Half adder output
Figure 6.2 – 16×16-bit Wallace summation at bit level.
6.2 Improved Wallace; the Dadda summation network
Dadda [7] proposed a scheme for summation in a Wallace-like fashion – but more
efficiently. The key is to look at the partial product matrices as columns instead of rows.
Since all bits in a column have the same numerical magnitude, they can be summed in any
order at any summation step. In one step, any two or three bits from the same column can be
added into a 2-bit sum, using half or full adders as in Figure 6.1. This is called “column
compression”, and we can choose how many bits we want to compress in each step. What
Wallace does, is to compress all columns to the maximum in each step. In fact, this would
have been optimal for matrices without empty gaps, but because of the empty gaps to the right
and the left of the matrices, we have several columns in each step that are shorter than the
others. These shorter columns need not be compressed to the maximum; some bits can be kept
to a later stage. So, we need to find the optimal amount of compression for each column. That
is, we want to keep all columns as long as possible in each step, without requiring more
summation steps than Wallace does. We can think like this:
The largest number of terms n1 that can be compressed into n 2 terms, is
n1 = 3n2 / 2 ,
if n 2 is even, and
6-60
n1 = 3(n2 − 1) / 2 + 1 ,
if n 2 is odd. The even expression is obtained by noting that 3n1 / 2 terms can be added by
n1 / 2 CSAs with three numbers each as inputs. In the odd expression, 3(n2 − 1) / 2 + 1 terms
are added by (n1 − 1) / 2 CSAs, and the odd term is passed without addition. In fact, both
expressions are covered by the formula:
n1 = floor(3n 2 / 2 )
(27)
The result of the Wallace tree is two numbers – or a list of columns with two bits in each –
that are to be added by the final adder. So, if we start with two terms, and repeatedly use (27)
to compute the next higher number of terms, we get the series:
2, 3, 4, 6, 9, 13, 19, 28, 42, 63, ...
The ith number in this series (starting from zero) gives the largest number of terms that
can be reduced to two terms in i summation steps. For example, four terms need two steps,
but five and six terms need three steps. Seven, eight and nine terms need four steps, and so on.
We said before that we wanted to have as long columns as possible in each step, without
getting extra summation steps. This is achieved if we follow the numbers in the series, when
compressing. For example, if there are twelve terms in one step, they could be reduced into
eight in the next step (as Wallace would do). But the total number of summation steps, would
be unchanged if they were reduced into nine terms instead. This is the trick that Dadda
summation uses – not for entire terms, but for single columns.
The Dadda scheme is illustrated in Figure 6.3 for 12 × 12 -bit multiplication with simple
PPG. First, all columns are lifted up to the top line, and at the same time, all empty gaps are
removed. This operation is valid, because all bits in the same column have the same numerical
magnitude, that is, they can be added in any order. Then the leftmost column becomes empty,
but we need to keep it anyway, since this column sets the size of the result. Then, in each step,
we start by finding the length n1 of the longest column, and the next lower number n 2 from
the above series. All columns are compressed into n2 bits. The compression should start from
the least significant column, because it results in carry-out bits that belong to the next higher
column. So, the next column has to be compressed even more. Columns shorter than n 2 bits
(including carry bits from the previous column) need not be compressed in this step.
We can see in Figure 6.3 that the maximum number of column bits in each step follows
the above series, and that it uses the same number of steps as Wallace in Figure 6.2 – that is,
the O(log m ) behaviour is inherited from Wallace. We also see that we have got rid of all the
extra half adders that Wallace used.
6-61
= Bits
= Padded zeroes
s
Lifting
= Full adder output
c
s
c
= Half adder output
= Empty column
Figure 6.3 – 16×16-bit Dadda summation at bit level.
We start by writing the Lava code for the function that computes n 2 from n1 :
nextLength len = nextLengthHelp 2
where
nextLengthHelp k | (k < len) && (len <= nextk) = k
| otherwise
= nextLengthHelp nextk
where nextk = div (3*k) 2
Then comes the function that transforms a list of rows to a list of columns:
6-62
transposePPs ps
| allFinished ps = []
| otherwise
= c:(transposePPs cs)
where
c = (map head2) ps
cs = (map tail2) ps
allFinished [] = True
allFinished ([]:as) = allFinished as
allFinished (a:as) = False
This operation can be seen as transposing the matrix ps. It uses head2 and tail2 from
section 2.3.2 so that all empty gaps after MSB of the partial products are filled with constant
zeroes. This means that we will get a square matrix – a list of equal-length columns. The first
elements in the columns come from topmost row in ps. The lifting is done by the following
function, which in fact, only removes the constant zeroes from the columns:
compressGap [] = []
compressGap (a:as) | (a == low) = compressGap as
| otherwise = a:(compressGap as)
Here is the circuit that compresses a column c into the length len:
compressCount len c | (length c <= len) = (c,[])
compressCount len c | (length c == (len+1)) = (x:cs, [y])
where
c0:c1:cs = c
(x,y)
= halfAdd (c0,c1)
compressCount len c = (x:xs, y:ys)
where
c0:c1:c2:cs = c
(x,y)
= fullAdd (c0, (c1,c2))
(xs,ys)
= compressCount (len-1) cs
It returns a pair with the compressed column and a list of all carry-out bits from the
compression. If the length of c is less than or equal to len, the column is returned as it is and
the list of carry bits is empty. If c has length len plus one, it is compressed by a half adder,
and one carry bit is returned. If c has more bits, it gets recursively compressed by as many full
adders as needed. In each step, three bits are compressed into one (plus the carry-out), so to
compensate fore this one bit, the len parameter is decreased by one each recursive call.
Here is the function that performs one step of the Dadda summation:
compressDaddaCols len prev [] = []
compressDaddaCols len prev (c:cs) = (x ++ prev):ys
where
(x,next) = compressCount (len - (length prev)) c
ys
= compressDaddaCols len next cs
It goes recursively through the columns cs, and compresses them into the length len. The
parameter prev is the list of carry bits from the reduction of the previous column. Since prev
will be appended to the current column, it has to be compressed to (len - (length prev))
bits. Note that prev is appended after the compression. It is very important that prev is not
compressed in this step (although it would have been possible), since that would mean that
this column would have to wait for the compression of the previous column. That would give
a carry-propagating effect to the whole summation step, which would be devastating to the
speed of the summation network. But when prev is not included in the compression, all
6-63
columns in one summation step are compressed in parallel. The carry bits from the current
compression are given as prev to the next column.
Then all helper functions are done, and we can write the Dadda summation as:
dadda ps
| (n == 0) = zeroList (length cs)
| otherwise = daddaHelp cs
where
cs = (transposePPs ->- (map compressGap)) ps
n = ((map length) ->- sort ->- last) cs
-daddaHelp cs | (n1 <= 2) = (transposePPs ->- carrySave) cs
| otherwise = daddaHelp (compressDaddaCols n2 [] cs)
where
n1 = ((map length) ->- sort ->- last) cs
n2 = nextLength n1
The list of columns cs is given by transposing and lifting the list of partial products ps. If
the longest column n in cs has no bits (happens if all bits in ps are constant zeroes), a zero list
is returned. Otherwise, a call is made for the helper function daddaHelp. In the helper
function, n1 is the longest column in cs, and n2 is the length that cs is compressed into. In
each recursive step, cs is compressed to the length n2, and when all columns have two bits or
left, these are transposed back into two numbers and added by the carry-save array.
If we measure the new summation method, we get:
Main> measure2 ppgSimple dadda 12
(75,(2592,(11,99)))
We see that we have got a summation network, which is both smaller and faster than
Wallace.
6-64
7 Results
In this chapter, the different partial product
generators and summation networks are
combined into multiplier circuits and verified
for correctness. The different circuits are also
compared in terms of estimated speed and size.
7.1 Verification
In the full code in Appendix B, several multipliers are defined:
mult1 = ppgSimple ->- linArray
mult2 = ppgSimple ->- addTree
mult3 = boothBasic 2 ->- linArray
mult4 = boothBasicMux 3 ->- addTree
mult5 = ppgSimple ->- carrySave
mult6 = ppgSimple ->- wallace
mult7 = booth 2 ->- carrySave
mult8 = booth 3 ->- wallace
mult9 = ppgSimple ->- dadda
mult10 = (booth 2) ->- dadda
mults = [mult1,mult2,mult3,mult4,mult5,mult6,mult7,mult8,mult9,mult10]
These are made to be a representative variety of all our described circuits. If we can verify
that all these are correct for some sizes, we assume that the descriptions are correct. For this,
we redefine the verif function from section 2.4.2 to work for info bit multipliers. We also
define two new functions:
verif mult n = vis
(circCorrectForSizes n n multi (n2iPair ->- mult ->- value))
verif2 ppg sum n = vis
(circCorrectForSizes n n multi (n2iPair ->- ppg ->- sum ->- value))
verif3 ppg sum m n = vis
(circCorrectForSizes m n multi (n2iPair ->- ppg ->- sum ->- value))
These are used to verify the correctness of the multiplier mult or (ppg ->- sum), for all
inputs of length n, or m and n. Verification of a list of circuits for the length n, can be done by
the following function:
7-65
verif_circs circs n = mapM vis
[(circCorrectForSizes n n multi (n2iPair ->- mult ->- value))
| mult <- circs]
And if we test it on the list mults, we get:
Main> verif_circs mults 5
Vis: ... (t=0.3) Valid.
Vis: ... (t=0.3) Valid.
Vis: ... (t=0.3) Valid.
Vis: ... (t=0.4) Valid.
Vis: ... (t=0.3) Valid.
Vis: ... (t=0.3) Valid.
Vis: ... (t=0.4) Valid.
Vis: ... (t=0.4) Valid.
Vis: ... (t=0.3) Valid.
Vis: ... (t=0.4) Valid.
Main> verif_circs mults 10
Vis: ... (t=8:56.6) Valid.
Vis: ... (t=6:45.7) Valid.
Vis: ... (t=4:52.0) Valid.
Vis: ... (t=5:36.5) Valid.
Vis: ... (t=3:43.4) Valid.
Vis: ... (t=7:11.8) Valid.
Vis: ... (t=8:41.5) Valid.
Vis: ... (t=4:32.2) Valid.
Vis: ... (t=6:24.1) Valid.
Vis: ... (t=9:11.6) Valid.
7.2 Performance comparison
7.2.1 Partial product generators
Now we will measure different combinations of PPGs and summation networks, and see
which ones have the best performance. The PPGs are measured for 24-bit multiplication;
other lengths give similar results. We start by examining the Booth PPGs together with linear
array summation as a function of the selection group length. This is shown in Figure 7.1.
Remember that boothBasic 1 is exactly equivalent with ppgSimple.
We see that all tested Booth PPGs are faster than ppgSimple, although at the same time
larger. The fastest PPG is boothBasic 6, with 306 time units of delay, and not much more
area than ppgSimple. For more than 6-bit selection, the delay of boothBasic is only
increasing. For Booth with multiplexer selection, the size increases drastically with the
selection group length. The main contribution to this increase comes from the multiplexer
circuits. They are both larger and slower than the ones without multiplexers, so there seems to
be no gain in letting the partial products share the same selection table. But apart from that,
there is not much variation in area at all between the methods. The booth.Basic PPGs have
just below 10 000 area units, and the booth PPGs have just above 11 000 units.
7-66
boothBasic
boothBasicMux
booth
boothMux
ppgSimple
420
380
Area units
Time units
400
360
340
320
300
1-bit
2-bit
3-bit
4-bit
5-bit
6-bit
26000
24000
22000
20000
18000
16000
14000
12000
10000
8000
7-bit
ppgSimple
1-bit 2-bit 3-bit 4-bit 5-bit 6-bit 7-bit
Selection group length
Selection group length
Figure 7.1 – Measurements on PPGs for 24-bit multiplication with linear array
summation.
It may seem surprising that the improved Booth PPGs are slower than the basic ones. The
explanation is in the structure of linArray. All critical paths of the linear array start from the
least significant bit of the first partial product. The least significant bits from the improved
Booth are actually slower (because of the more complex selection network) than from the
basic Booth. On the other hand, the most significant bits are slower in the basic Booth, but
this delay is hidden by the CPAs of the linear array, so we do not see it.
We said that the result of the previous measurement was largely affected by the
summation network. Therefore, in Figure 7.2 we have measured boothBasic and booth
again, with Dadda summation instead.
boothBasic
250
13000
12500
Area units
200
Time units
booth
150
100
ppgSimple
12000
11500
ppgSimple
11000
10500
50
10000
0
9500
1-bit
2-bit
3-bit
Selection group length
4-bit
1-bit
2-bit
3-bit
4-bit
Selection group length
Figure 7.2 – Measurements on PPGs for 24-bit multiplication with Dadda
summation.
7-67
Now we start seeing the real nature of the PPGs. We can note that all PPGs except
and booth 2 make use of CPAs, and all of those have delays of more than 200
time units. This is because the delay of the most significant bits no longer is hidden by the
summation network when Dadda is used. The conclusion is that when we have summation
networks without CPAs, the only usable PPGs are ppgSimple and booth 2.
We also see that ppgSimple gives both the fastest and the smallest multiplier of all, which
might be surprising. This is because of the O(log m ) behaviour of the Dadda summation,
which means that doubling the number of terms only adds a constant to the summation delay.
The ppgSimple gives approximately twice as many partial products as booth 2, but the extra
delay for summing these is apparently smaller than the constant selection delay in booth 2.
ppgSimple
7.2.2 Summation networks
In Figure 7.3 we have measurements of all summation networks for 12, 24, 36 and 48-bit
multiplication with simple PPG:
linArray
carrySave
Dadda
50000
900
45000
800
40000
700
35000
Area units
600
Time units
addTree
Wallace
500
400
30000
25000
20000
300
15000
200
10000
100
5000
0
0
12
24
36
Bits
48
12
24
36
48
Bits
Figure 7.3 – Measurements on summation networks with simple PPG
The first thing we can notice is that all methods have approximately the same size
(carrySave, wallace and dadda are slightly larger because of the logarithmic adders at the
ends). Regarding the delays, we see that the linArray, addTree and carrySave are most
sensitive to the number of bits, with a delay approximately proportional to the number of bits.
The carry-save array is the least sensitive of those. The Wallace and Dadda summations are
very close in the measurements. Their most important feature is the very low sensitivity to the
number of bits. A doubling of the number of bits results in a delay increase of approximately
20 time units only.
The results of the measurements is that the fastest multiplier circuit of all, is a simple PPG
combined with a Dadda summation. However, we should bear in mind that the actual delay of
a circuit also depends on the amount of regularity in the layout. In fact, linArray, addTree
and carrySave, which had the most critical delay properties, are also the ones that most
7-68
easily can be given regular layout. The wallace and dadda networks, on the other hand, are
irregular (see [8] for a regular Wallace-like multiplier).
7.2.3 Regular multipliers
We want to find the combination with a regular summation network that gives the highestperforming multiplier. In Figure 7.4 and Figure 7.5, addTree and carrySave are measured
with different PPGs (linArray is excluded, because we expect carrySave to be faster in all
possible combinations).
boothBasic
290
285
275
Area units
Time units
280
ppgSimple
270
265
260
255
250
1-bit
2-bit
3-bit
12500
12000
11500
11000
10500
10000
9500
9000
8500
8000
4-bit
booth
ppgSimple
1-bit
Selection group length
2-bit
3-bit
4-bit
Selection group length
Figure 7.4 – Measurements on the adder tree summation network with different
PPGs.
boothBasic
booth
300
14000
250
13000
200
12000
Area units
Time units
ppgSimple
150
ppgSimple
11000
100
10000
50
9000
0
8000
1-bit
2-bit
3-bit
4-bit
Selection group length
5-bit
1-bit
2-bit
3-bit
4-bit
5-bit
Selection group length
Figure 7.5 – Measurements on the carry-save array summation network with
different PPGs.
7-69
Here we see the strength of the improved Booth 2 method. Together with the carry-save
array, it is remarkably faster than all other combinations. With the adder tree however, the
improved Booth is even slower than the basic Booth. This has the same explanation as in the
measurements with the linear array in section 7.2.1 – the critical path of the adder tree starts
from the least significant bit of the first partial product, and this bit has larger delay in the
improved Booth methods than in the other ones.
7.2.4 Summary
Apart from multipliers with multiplexer selection, there are only small differences in size
between the methods. The main difference is that the improved Booth methods are somewhat
larger than the basic.
There are much greater differences when it comes to speed. The fastest multiplier of all is
a simple PPG together with Dadda summation:
multTOP = ppgSimple ->- dadda
The fastest among multipliers with regular layout is an improved Booth 2 together with a
carry-save array:
multTOP_regular = booth 2 ->- carrySave
7-70
8 Conclusions
The description of the multiplier circuits turned out well; especially the summation
networks could be grasped in very small descriptions. All multipliers and helper circuits were
easily made generic with respect to the number of bits in the inputs. The Booth algorithms
also had generic selection group length, but this was maybe more complicated to achieve.
Because of the neat descriptions, the circuits are easy to understand, and many errors can be
avoided at design-time. In a conventional hardware description language like VHDL, it would
have been extremely difficult to make descriptions with the same generality, and impossible
to have a code with the same readability.
The circuits could be verified for sizes of up to ten bits within reasonable time, and all
multipliers turned out to be correct.
When comparing the performance of the different methods, we found that the fastest
multiplier was
multTOP = ppgSimple ->- dadda
The disadvantage about this combination is the irregularity of the Dadda network. When
implementing on chip, this will probably lead to larger wiring delay, reducing the speed
compared to our estimation.
For multipliers with regular summation network (linArray, carrySave, and addTree),
the fastest multiplier was
multTOP_regular = booth 2 ->- carrySave
All methods had essentially similar sizes, except for the Booth methods with multiplexer
selection, whose sizes increased drastically with the selection group length. Multiplexer
selection seemed to be slower in the tests too, so we found that we had no use at all of these
methods.
8-71
Appendix A – Lava code
------
**********************************************
Description and verification of multipliers
by Emil Axelsson 2003
**********************************************
import
import
import
import
Lava hiding (mux)
Arithmetic hiding (halfAdd, fullAdd, bitMulti)
Patterns
List
-- **********************************************
-- Basic circuits:
-- **********************************************
halfAdd (a,b) = (s,co)
where
s = xor2 (a,b)
co = and2 (a,b)
fullAdd (ci,(a,b)) = (s,co)
where
g = and2 (a,b)
p = xor2 (a,b)
s = xor2 (p,ci)
co = or2 (g, and2 (p,ci))
distributeLeft (a,[]) = []
distributeLeft (a, b:bs) = (a,b):(distributeLeft (a,bs))
bitMulti = distributeLeft ->- map and2
mux (s,(as,bs)) = muxHelp s (inv s) (as,bs)
where
muxHelp s sInv ([],[]) = []
muxHelp s sInv (a:as, b:bs) = x:(muxHelp s sInv (as,bs))
where x = or2 (and2 (sInv, a), and2 (s,b))
-- **********************************************
-- Helper functions:
-- **********************************************
skipLast = reverse ->- tail ->- reverse
-- Removes the last element in a list
head2 [] = low
head2 as = head as
tail2 [] = []
tail2 as = tail as
zipp2 ([],[]) = []
zipp2 (as,bs) = (head2 as, head2 bs):(zipp2 (tail2 as, tail2 bs))
zipp3 ([],[],[]) = []
zipp3 (as,bs,cs) = (head2 as, head2 bs, head2 cs):(zipp3 (tail2 as, tail2 bs, tail2 cs))
cpaCarry = row fullAdd
-- A carry-propagate adder (CPA) with carry in/out
cpa (as,bs) = ss
where (ss,_) = cpaCarry (low, zipp2 (as,bs))
-- A cpa without carry in/out
csa ((as,bs),cs) = (zipp3 ->- map (convert ->- fullAdd) ->- unzipp ->- shift) (as,bs,cs)
where
convert (a,b,c) = (c,(a,b))
shift (xs,ys) = (xs, [low] ++ skipLast ys) -- Shift ys one step to the left
logAdd (as,bs) = ss
72
where
gs = (zipp2 ->- map and2) (as,bs)
ps = (zipp2 ->- map xor2) (as,bs)
cps = zipp (gs,ps)
cs = (unzipp ->- first ->- skipLast) (carryGen cps)
ss = (zipp ->- map xor2) (ps, [low]++cs)
-first (g,p) = g
-carryGen [cp] = [cp]
carryGen cps = cps1 ++ map (dotOp (last cps1)) cps2
where
(cps1,cps2) = (halveList ->- (carryGen -|- carryGen)) cps
-dotOp (g',p') (g,p) = (or2 (g, and2 (p,g')), and2 (p,p'))
-- A logarithmic adder
trim n as = xs
where (xs,_) = splitAt n as
trimMatrix m n as = ys
where
(xs,_) = splitAt m as
(ys,_) = (map (splitAt n) ->- unzipp) xs
-- The matrix as (a list of lists) is trimmed so that the inner lists (rows) will have n elements,
-- and the outer list (columns) will have m element.
reduceLin circ c [] = c
reduceLin circ c (l:ls) = reduceLin circ (circ (c,l)) ls
binTree circ [p] = p
binTree circ ps =
(halveList ->- (binTree circ -|- binTree circ) ->- circ) ps
-- A binary tree structure with the circuit circ, and a list of numbers as input
group3
group3
group3
group3
[] = []
[a] = [(a,[],[])]
[a,b] = [(a,b,[])]
(a:b:c:ds) = [(a,b,c)] ++ (group3 ds)
csaWallace (a,[],[]) = (a,[])
csaWallace (a,b,[]) = (a,b)
csaWallace (a,b,c) = csa ((a,b),c)
-- A special csa, where the inputs go directly to the outputs if they are less than three
unpairWallace [] = []
unpairWallace ((a,[]):as) = [a] ++ (unpairWallace as)
unpairWallace ((a0,a1):as) = [a0,a1] ++ (unpairWallace as)
-- A special unpair, where empty numbers are excluded
muxTree ([],[a]) = a
muxTree (s:ss,as) = xs
where
(a1,a2) = halveList as
xs
= mux (s, (muxTree (ss,a1), muxTree (ss,a2)))
select ss as = muxTree (reverse ss, as)
-- Chooses the element with position ss in the list as
-- The number of elem. in as must be = 2^n, where n is the number of bits in ss,
-- and all numbers in as must have the same length
invCond c as = (distributeLeft ->- map xor2) (c,as)
boothBits s x sp sign
| (x < s)
= (start1,end1)
| otherwise = (start2,end2)
where
start1 = zeroList x
start2 = (zeroList (x-s)) ++ [sp] ++ (zeroList (s-1))
end1
= (replicate s sign) ++ [inv sign]
end2
= [inv sign] ++ (replicate (s-1) high)
-- Returns the extra bits needed for Booth with negative partial products
-- s is the order of the Booth encoding, x is the number of shifting steps of the current partial product
-- sp and sign are the signs of the previous and current partial products respectively
-- (low = positive; high = negative)
-- Returns (start, end), which are the bits to put before and after the PP
73
-- **********************************************
-- Selection tables:
-- **********************************************
sBitSelTab s bs = multiples 0 []
where
nb = length bs
-shift as = skipLast ([low] ++ as)
-multiples i ms | (i == 2^s)
= ms
| (i == 0)
= multiples (i+1)
| (i == 1)
= multiples (i+1)
| (mod i 2 == 0) = multiples (i+1)
| otherwise
= multiples (i+1)
where
m = cpa (ms!!i1, ms!!i2)
i1 = (2^(floor ((log (fromInt i))/(log 2))))
i2 = i - i1
-- Returns a list with the multiples [0, bs, 2*bs, ...
-- computed in the most efficient way using cpa adders
-- All multiples have length (length bs + s)
[(zeroList (s+nb))]
(ms ++ [bs ++ zeroList s])
(ms ++ [shift (ms!!(div i 2))])
(ms ++ [m])
(2^s-1)*bs]
improved_2bitSel ss bs = select ss [b0,b1,b1,b2]
where
b0 = zeroList ((length bs)+1)
b1 = bs ++ [low]
b2 = [low] ++ bs
-- Returns the list [0, bs, bs, 2*bs]
-- All multiples have length (s+1)
improved_sel ss bs
| (length ss == 2) = x
| (length ss >= 2) = cpa (x,y)
where
(a1,a2) = splitAt 2 ss
x
= improved_2bitSel a1 bs
y
= [low] ++ (mult1 (a2,bs))
-- Selects multiples for the generic booth s algorithm (only one multiple returned),
-- using improved_2bitSelTab and mult1
-- All multiples have length (s+1)
improved_selTab s bs = modify ms
where
ms = trimMatrix (2^(s-1)+1) (length bs + s - 1) (sBitSelTab s bs)
-modify [] = []
modify [m] = [m]
modify [m1,m2] = [m1,m2]
modify (m1:m2:ms) = m1:m2:(modify (m2:ms))
-- Returns the list [0, bs, bs, 2*bs, 2*bs, ...]
-- All multiples have length (length bs + s - 1)
-- **********************************************
-- Partial product generators:
-- **********************************************
ppgSimple (as,bs) = ppgSimpleHelp 0 (as,bs)
where
ppgSimpleHelp i ([],bs) = []
ppgSimpleHelp i (a:as,bs) = p:ps
where
p = (zeroList i) ++ (bitMulti (a,bs)) ++ [low]
ps = ppgSimpleHelp (i+1) (as,bs)
-- Simple 1-bit partial product generation
boothBasic s (as,bs) = boothBasicHelp 0 (as,bs)
where
boothBasicHelp i ([],bs) = []
boothBasicHelp i (as,bs) | (length as < s) = [zeroList (s*i) ++ mult1 (as,bs)]
boothBasicHelp i (as,bs) | (length as >= s) = p:ps
where
(ss,as2) = splitAt s as
p
= (zeroList (s*i)) ++ (mult1 (ss,bs))
ps
= boothBasicHelp (i+1) (as2,bs)
-- The basic Booth algorithm with selection through multiplication
74
boothBasicMux s (as,bs) = boothBasicHelp 0 ms (as,bs)
where
ms = sBitSelTab s bs
-boothBasicHelp i ms ([],bs) = []
boothBasicHelp i ms (as,bs) | (n < s) = [zeroList (s*i) ++ select as ms2]
| (n >= s) = p:ps
where
n
= length as
ms2
= trimMatrix (2^n) (length bs + n) ms
(ss,as2) = splitAt s as
p
= zeroList (s*i) ++ select ss ms
ps
= boothBasicHelp (i+1) ms (as2,bs)
-- The basic Booth algorithm with selection table
booth s (as,bs) = boothHelp 0 low (as,bs)
where
len = (length as) + (length bs)
-boothHelp i sp (as,bs) | (length as < s) = [p]
where
ss
= [sp] ++ as ++ zeroList (s-(length as)-1)
x
= improved_sel ss bs
(start,_) = boothBits s (s*i) sp low
p
= trim len (start ++ x)
boothHelp i sp (as,bs) | (length as >= s) = p:ps
where
(ss, sign:as2) = splitAt s ([sp] ++ as)
signInv
= invCond sign
x
= (improved_sel (signInv ss) ->- signInv) bs
(start, end)
= boothBits s (s*i) sp sign
p
= trim len (start ++ x ++ end ++ [low])
ps
= boothHelp (i+1) sign (as2,bs)
-- The improved Booth PPG, selection through multiplication if s>2
boothMux s (as,bs) = boothHelp 0 low ms (as,bs)
where
ms = improved_selTab s bs
len = (length as) + (length bs)
-boothHelp i sp ms (as,bs) | (length as < s) = [p]
where
ss
= [sp] ++ as ++ zeroList (s-(length as)-1)
x
= select ss ms
(start,_) = boothBits s (s*i) sp low
p
= trim len (start ++ x)
boothHelp i sp ms (as,bs) | (length as >= s) = p:ps
where
(ss, sign:as2) = splitAt s ([sp] ++ as)
signInv
= invCond sign
x
= (select (signInv ss) ->- signInv) ms
(start, end)
= boothBits s (s*i) sp sign
p
= trim len (start ++ x ++ end ++ [low])
ps
= boothHelp (i+1) sign ms (as2,bs)
-- The improved Booth PPG with selection table
-- **********************************************
-- Summation networks:
-- **********************************************
linArray [p] = p
linArray (p:ps) = reduceLin cpa p ps
addTree = binTree cpa
carrySave [p] = p
carrySave (p0:p1:ps) = (reduceLin csa (p0,p1) ->- logAdd) ps
wallace ps
| ((length ps) <= 2) = carrySave ps
| otherwise
= wallace s
where s = (group3 ->- map csaWallace ->- unpairWallace) ps
75
-- **********************************************
-- Multipliers:
-- **********************************************
mult1 = ppgSimple ->- linArray
mult2 = ppgSimple ->- addTree
mult3 = boothBasic 2 ->- linArray
mult4 = boothBasicMux 3 ->- addTree
mult5 = ppgSimple ->- carrySave
mult6 = ppgSimple ->- wallace
mult7 = booth 2 ->- carrySave
mult8 = booth 3 ->- wallace
mults = [mult1,mult2,mult3,mult4,mult5,mult6,mult7,mult8]
-- For integers:
intMult n mult = (int2bin n -|- int2bin n) ->- mult ->- bin2int
-- **********************************************
-- Simulation:
-- **********************************************
len mult n = length (mult (zeroList n, zeroList n))
len2 ppg sum n = length ((ppg ->- sum) (zeroList n, zeroList n))
-- **********************************************
-- Verification:
-- **********************************************
prop_Equivalent circ1 circ2 a = ok
where
out1 = circ1 a
out2 = circ2 a
ok
= out1 <==> out2
circCorrectForSizes m n circ1 circ2 =
forAll (list m) $ \a ->
forAll (list n) $ \b ->
prop_Equivalent circ1 circ2 (a,b)
verif mult n = vis (circCorrectForSizes n n multi mult)
-- multi is the built-in lava multiplier, serves as the GOLDEN MODEL
verif2 ppg sum n = vis (circCorrectForSizes n n multi (ppg ->- sum))
verif3 ppg sum m n = vis (circCorrectForSizes m n multi (ppg ->- sum))
verif_sizes mult m = mapM vis [(circCorrectForSizes n n multi mult) | n <- [1..m]]
verif_circs circs n = mapM vis [(circCorrectForSizes n n multi mult) | mult <- circs]
76
Appendix B – Lava code NSI
--------
**********************************************
Description and verification of multipliers
Non-standard interpretation
by Emil Axelsson 2003
**********************************************
-- * Bits are represented in two ways:
-1) Normal form a::Signal Bool
-2) Info form a::Infobit = (Info, Signal Bool)
-- Info contains measures of how many gates has been required to create the bit,
-and how long time it takes until the bit is stable
--- * Numbers are represented in two ways:
-1) Normal form a::Normnumber = [Signal Bool]
-- A list of bits, where LSB is the FIRST element in the list
-2) Info number as::Infonumber = [Infobit]
import
import
import
import
Lava hiding (high, low, inv, and2, nand2, or2, nor2, xor2, xnor2, mux, zeroList)
Arithmetic hiding (halfAdd, fullAdd, bitMulti)
Patterns
List
-- **********************************************
-- Types:
-- **********************************************
type Normnumber = [Signal Bool]
type Bittime = Integer
type Bitsize = (Integer, (Integer,Integer))
type Info = (Bittime, Bitsize)
type Infobit = (Info, Signal Bool)
type Infonumber = [Infobit]
-- **********************************************
-- Type operations:
-- **********************************************
zeroSize :: Bitsize
zeroSize = (0, (0,0))
zeroInfo :: Info
zeroInfo = (0, zeroSize)
low :: Infobit
low = (zeroInfo, Lava.low)
high = (zeroInfo, Lava.high)
lowVar :: Infobit
lowVar = (zeroInfo, Lava.inv Lava.high)
highVar = (zeroInfo, Lava.inv Lava.low)
-- Non-constant infobits
zeroList :: Int -> Infonumber
zeroList n = replicate n low
zeroListVar :: Int -> Infonumber
zeroListVar n = replicate n lowVar
norm2info :: Normnumber -> Infonumber
norm2info [] = []
norm2info (a:as) = (zeroInfo,a):(norm2info as)
n2iPair = (norm2info -|- norm2info)
makeVar :: Infonumber -> Infonumber
makeVar [] = []
77
makeVar ((aInfo,a):as) = (aInfo, (Lava.inv ->- Lava.inv) a):(makeVar as)
makeVarPair = (makeVar -|- makeVar)
valueB :: Infobit -> Signal Bool
valueB (info,a) = a
value :: Infonumber -> Normnumber
value = map valueB
timeB :: Infobit -> Bittime
timeB ((time,size),a) = time
time :: Infonumber -> Integer
time = (map timeB) ->- maximum
infoB :: Infobit -> Info
infoB (info,a) = info
gateSizeB :: Infobit -> Integer
gateSizeB ((_,(gates,_)),a) = gates
halfB :: Infobit -> Integer
halfB ((_,(_,(h,f))),a) = h
fullB :: Infobit -> Integer
fullB ((_,(_,(h,f))),a) = f
sizeB :: Infobit -> Bitsize
sizeB ((time,size),a) = size
size :: Infonumber -> Bitsize
size as = (sum gateSizes, (sum halfs, sum fulls))
where
gateSizes = map gateSizeB as
halfs
= map halfB as
fulls
= map fullB as
incTime :: Integer -> Infobit -> Infobit
incTime t ((time,size),a) = ((time+t, size),a)
newTimeB :: Integer -> Infobit -> Infobit
newTimeB t ((time,size),a) | (((time,size),a) == low) = low
| otherwise = ((t,size),a)
-- low bits should always have zero info
newTime :: [Integer] -> Infonumber -> Infonumber
newTime [] [] = []
newTime (t:ts) (a:as) = (newTimeB t a):(newTime ts as)
incGateSize :: Integer -> Infobit -> Infobit
incGateSize g ((time,(gates,counts)),a) = ((time,(gates+g,counts)),a)
incHalf :: Integer -> Infobit -> Infobit
incHalf h ((time,(gates,(half,f))),a) = ((time,(gates,(half+h,f))),a)
incFull :: Integer -> Infobit -> Infobit
incFull f ((time,(gates,(h,full))),a) = ((time,(gates,(h,full+f))),a)
resetTimeB :: Infobit -> Infobit
resetTimeB ((time,size), a) = ((0,size), a)
-- Resets the time information for the bit a
resetTime :: Infonumber -> Infonumber
resetTime = map resetTimeB
-- Resets the time information for the number as
resetSizeB :: Infobit -> Infobit
resetSizeB ((time,size), a) = ((time,zeroSize), a)
resB = resetSizeB
-- Resets the size information for the bit a
resetSize :: Infonumber -> Infonumber
resetSize = map resetSizeB
res = resetSize
-- Resets the size information for the number as
resSizePair = (resetSizeB -|- resetSizeB)
78
addSizesB :: (Bitsize,Bitsize) -> Bitsize
addSizesB ((gates1,(h1,f1)), (gates2,(h2,f2))) = (gates1+gates2, (h1+h2, f1+f2))
mergeInfos :: (Info,Info) -> Info
mergeInfos ((atime,asize), (btime,bsize)) = (maximum [atime,btime], addSizesB (asize,bsize))
countVari as = countHelp 0 as
where
countHelp n [] = n
countHelp n (a:as) | (a == high || a == low) = countHelp n as
| otherwise = countHelp (n+1) as
-- Counts the number of variable bits in the list as
increaseCounts (a,b,c) d = case (countVari [a,b,c])
2
-> incHalf 1
3
-> incFull 1
otherwise -> d
-- Increases the half and full adder parameter in d
-- 3 vari => full adder; 2 vari => half adder
of
d
d
depending on how many of the bits (a,b,c) are variable
-- **********************************************
-- Basic circuits for Infobits:
-- **********************************************
invDelay = 1
invSize = 1
andDelay = 3
andSize = 3
nandDelay = 2
nandSize = 2
orDelay = 3
orSize = 3
norDelay = 2
norSize = 2
xorDelay = 3
xorSize = 3
xnorDelay = 3
xnorSize = 3
inv a
| (a == low) = high
| (a == high) = low
| otherwise = (incTime invDelay ->- incGateSize invSize) (infoB a, Lava.inv (valueB a))
and2 (a,b)
| (a == low || b == low) = low
| (a == high)
= b
| (b == high)
= a
| otherwise = (incTime andDelay ->- incGateSize andSize) (info, Lava.and2 (valueB a, valueB b))
where info = mergeInfos (infoB a, infoB b)
nand2 (a,b)
| (a == low || b == low) = high
| (a == high)
= inv b
| (b == high)
= inv a
| otherwise = (incTime nandDelay ->- incGateSize nandSize) (info, Lava.nand2 (valueB a, valueB b))
where info = mergeInfos (infoB a, infoB b)
or2 (a,b)
| (a == low)
= b
| (b == low)
= a
| (a == high || b == high) = high
| otherwise = (incTime orDelay ->- incGateSize orSize) (info, Lava.or2 (valueB a, valueB b))
where info = mergeInfos (infoB a, infoB b)
nor2 (a,b)
| (a == low)
= inv b
| (b == low)
= inv a
| (a == high || b == high) = low
| otherwise = (incTime norDelay ->- incGateSize norSize) (info, Lava.nor2 (valueB a, valueB b))
where info = mergeInfos (infoB a, infoB b)
xor2 (a,b)
| (a == low) = b
| (b == low) = a
| (a == high) = inv b
79
| (b == high) = inv a
| otherwise = (incTime xorDelay ->- incGateSize xorSize) (info, Lava.xor2 (valueB a, valueB b))
where info = mergeInfos (infoB a, infoB b)
xnor2 (a,b)
| (a == low) = inv b
| (b == low) = inv a
| (a == high) = b
| (b == high) = a
| otherwise = (incTime xnorDelay ->- incGateSize xnorSize) (info, Lava.xnor2 (valueB a, valueB b))
where info = mergeInfos (infoB a, infoB b)
halfAdd (a,b) = (incHalf 1 s,co)
where
s = xor2 (a,b)
co = (resSizePair ->- and2) (a,b)
fullAdd (ci,(a,b)) = (s,co)
where
g = (resSizePair ->- and2) (a,b)
p = xor2 (a,b)
s = increaseCounts (a,b,ci) (xor2 (p,ci))
co = or2 (g, (resSizePair ->- and2) (p,ci))
distributeLeft :: (Infobit, Infonumber) -> [(Infobit, Infobit)]
distributeLeft (a,[]) = []
distributeLeft (a, b:bs) = (a,b):(distributeLeft (resB a, bs))
bitMulti = distributeLeft ->- map and2
mux (s,(as,bs)) = muxHelp s sInv (as,bs)
where
sInv = inv (resB s)
muxHelp s sInv ([],[]) = []
muxHelp s sInv (a:as, b:bs) = x:(muxHelp (resB s) (resB sInv) (as,bs))
where x = or2 (and2 (sInv, a), and2 (s,b))
-- **********************************************
-- Helper functions:
-- **********************************************
skipLast = reverse ->- tail ->- reverse
-- Removes the last element in a list
head2 [] = low
head2 as = head as
tail2 [] = []
tail2 as = tail as
zipp2 ([],[]) = []
zipp2 (as,bs) = (head2 as, head2 bs):(zipp2 (tail2 as, tail2 bs))
zipp3 ([],[],[]) = []
zipp3 (as,bs,cs) = (head2 as, head2 bs, head2 cs):(zipp3 (tail2 as, tail2 bs, tail2 cs))
cpaCarry = row fullAdd
-- A carry-propagate adder (CPA) with carry in/out
cpa (as,bs) = ss
where (ss,_) = cpaCarry (low, zipp2 (as,bs))
-- A cpa without carry in/out
csa ((as,bs),cs) = (zipp3 ->- map (convert ->- fullAdd) ->- unzipp ->- shift) (as,bs,cs)
where
convert (a,b,c) = (c,(a,b))
shift (xs,ys) = (xs, [low] ++ skipLast ys) -- Shift ys one step to the left
logAdd (as,bs) = ss
where
gs = (zipp2 ->- map (resSizePair ->- and2)) (as,bs)
ps = (zipp2 ->- map xor2) (as,bs)
cps = zipp (gs, res ps)
cs = (unzipp ->- first ->- skipLast) (carryGen cps)
ss = (zipp ->- map xor2) (ps, [low]++cs)
--
80
first (g,p) = g
-carryGen [cp] = [cp]
carryGen cps = cps1 ++ map (dotOp (resSizePair (last cps1))) cps2
where
(cps1,cps2) = (halveList ->- (carryGen -|- carryGen)) cps
-dotOp (g',p') (g,p) = (or2 (g, and2 (p,g')), and2 (resB p,p'))
-- A logarithmic adder
trim n as = xs
where (xs,_) = splitAt n as
trimMatrix m n as = ys
where
(xs,_) = splitAt m as
(ys,_) = (map (splitAt n) ->- unzipp) xs
-- The matrix as (a list of lists) is trimmed so that the inner lists (rows) will have n elements,
-- and the outer list (columns) will have m element.
reduceLin circ c [] = c
reduceLin circ c (l:ls) = reduceLin circ (circ (c,l)) ls
binTree circ [p] = p
binTree circ ps =
(halveList ->- (binTree circ -|- binTree circ) ->- circ) ps
-- A binary tree structure with the circuit circ, and a list of numbers as input
group3
group3
group3
group3
[] = []
[a] = [(a,[],[])]
[a,b] = [(a,b,[])]
(a:b:c:ds) = [(a,b,c)] ++ (group3 ds)
csaWallace (a,[],[]) = (a,[])
csaWallace (a,b,[]) = (a,b)
csaWallace (a,b,c) = csa ((a,b),c)
-- A special csa, where the inputs go directly to the outputs if they are less than three
unpairWallace [] = []
unpairWallace ((a,[]):as) = [a] ++ (unpairWallace as)
unpairWallace ((a0,a1):as) = [a0,a1] ++ (unpairWallace as)
-- A special unpair, where empty numbers are excluded
transposePPs ps
| allFinished ps = []
| otherwise
= c:(transposePPs cs)
where
c = (map head2) ps
cs = (map tail2) ps
allFinished [] = True
allFinished ([]:as) = allFinished as
allFinished (a:as) = False
compressGap [] = []
compressGap (a:as) | (a == low) = compressGap as
| otherwise = a:(compressGap as)
-- Removes all constant low bits from the number as
compressPPs ps = (transposePPs ->- map compressGap ->- transposePPs) ps
-- Compresses a partial product matrix with empty gaps inside
nextLength len = nextLengthHelp 2
where
nextLengthHelp k | (k < len) && (len <= nextk) = k
| otherwise
= nextLengthHelp nextk
where nextk = div (3*k) 2
-- Computes the number of rows in the next step for the Dadda PPG
compressCount len c | (length c <= len) = (c,[])
compressCount len c | (length c == (len+1)) = (x:cs, [y])
where
c0:c1:cs = c
(x,y)
= halfAdd (c0,c1)
compressCount len c = (x:xs, y:ys)
where
c0:c1:c2:cs = c
(x,y)
= fullAdd (c0, (c1,c2))
(xs,ys)
= compressCount (len-1) cs
81
-- The column c is compressed using full or half adders, and two columns (x,y) are returned,
-- where length x <= len; length y <= length x
compressDaddaCols len prev [] = []
compressDaddaCols len prev (c:cs) = (x ++ prev):ys
where
(x,next) = compressCount (len - (length prev)) c
ys
= compressDaddaCols len next cs
-- Performs one step of a Dadda summation
muxTree ([],[a]) = a
muxTree (s:ss,as) = xs
where
(a1,a2) = halveList as
xs
= mux (s, (muxTree (ss,a1), muxTree (res ss, a2)))
select ss as = muxTree (reverse ss, as)
-- Chooses the element with position ss in the list as
-- The number of elem. in as must be = 2^n, where n is the number of bits in ss,
-- and all numbers in as must have the same length
invCond c as = (distributeLeft ->- map xor2) (c,as)
boothBits s x sp sign
| (x < s)
= (start1,end1)
| otherwise = (start2,end2)
where
start1 = zeroList x
start2 = (zeroList (x-s)) ++ [sp] ++ (zeroList (s-1))
end1
= (replicate s (resB sign)) ++ [inv sign]
end2
= [inv sign] ++ (replicate (s-1) high)
-- Returns the extra bits needed for Booth with negative partial products
-- s is the order of the Booth encoding, x is the number of shifting steps of the current partial product
-- sp and sign are the signs of the previous and current partial products respectively
-- (low = positive; high = negative)
-- Returns (start, end), which are the bits to put before and after the PP
-- **********************************************
-- Selection tables:
-- **********************************************
sBitSelTab s bs = multiples 0 []
where
nb = length bs
-shift as = skipLast ([low] ++ as)
-multiples i ms | (i == 2^s)
= ms
| (i == 0)
= multiples (i+1)
| (i == 1)
= multiples (i+1)
| (mod i 2 == 0) = multiples (i+1)
| otherwise
= multiples (i+1)
where
m = cpa (res (ms!!i1), res (ms!!i2))
i1 = (2^(floor ((log (fromInt i))/(log 2))))
i2 = i - i1
-- Returns a list with the multiples [0, bs, 2*bs, ...
-- computed in the most efficient way using cpa adders
-- All multiples have length (length bs + s)
[(zeroList (s+nb))]
(ms ++ [bs ++ zeroList s])
(ms ++ [shift (res (ms!!(div i 2)))])
(ms ++ [m])
(2^s-1)*bs]
improved_2bitSel ss bs = select ss [b0, res b1, b1, b2]
where
b0 = zeroList ((length bs)+1)
b1 = bs ++ [low]
b2 = [low] ++ (res bs)
-- Returns the list [0, bs, bs, 2*bs]
-- All multiples have length (s+1)
improved_sel ss bs
| (length ss == 2) = x
| (length ss >= 2) = cpa (x,y)
where
(a1,a2) = splitAt 2 ss
x
= improved_2bitSel a1 bs
y
= [low] ++ (mult1 (a2, res bs))
-- Selects multiples for the generic booth s algorithm (only one multiple returned),
-- using improved_2bitSel and mult1
82
-- All multiples have length (s+1)
improved_selTab s bs = modify ms
where
ms = trimMatrix (2^(s-1)+1) (length bs + s - 1) (sBitSelTab s bs)
-modify [] = []
modify [m] = [m]
modify [m1,m2] = [m1,m2]
modify (m1:m2:ms) = m1:m2:mms
where mms = modify ((res m2):ms)
-- Returns the list [0, bs, bs, 2*bs, 2*bs, ...]
-- All multiples have length (length bs + s - 1)
-- **********************************************
-- Partial product generators:
-- **********************************************
ppgSimple (as,bs) = ppgSimpleHelp 0 (as,bs)
where
ppgSimpleHelp i ([],bs) = []
ppgSimpleHelp i (a:as,bs) = p:ps
where
p = (zeroList i) ++ (bitMulti (a,bs)) ++ [low]
ps = ppgSimpleHelp (i+1) (as, res bs) -- only first PP gets non-reset bs
-- Simple 1-bit partial product generation
boothBasic s (as,bs) = boothBasicHelp 0 (as,bs)
where
boothBasicHelp i ([],bs) = []
boothBasicHelp i (as,bs) | (length as < s) = [zeroList (s*i) ++ mult1 (as,bs)]
boothBasicHelp i (as,bs) | (length as >= s) = p:ps
where
(ss,as2) = splitAt s as
p
= (zeroList (s*i)) ++ (mult1 (ss,bs))
ps
= boothBasicHelp (i+1) (as2, res bs)
-- The basic Booth algorithm with selection through multiplication
boothBasicMux s (as,bs) = boothBasicHelp 0 ms (as, res bs)
where
ms = sBitSelTab s bs
-boothBasicHelp i ms ([],bs) = []
boothBasicHelp i ms (as,bs) | (n < s) = [zeroList (s*i) ++ select as ms2]
| (n >= s) = p:ps
where
n
= length as
ms2
= trimMatrix (2^n) (length bs + n) ms
(ss,as2) = splitAt s as
p
= zeroList (s*i) ++ select ss ms
ps
= boothBasicHelp (i+1) (map res ms) (as2,bs)
-- The basic Booth algorithm with selection table
booth s (as,bs) = boothHelp 0 low (as,bs)
where
len = (length as) + (length bs)
-boothHelp i sp (as,bs) | (length as < s) = [p]
where
ss
= [resB sp] ++ as ++ zeroList (s-(length as)-1)
x
= improved_sel ss bs
(start,_) = boothBits s (s*i) sp low
p
= trim len (start ++ x)
boothHelp i sp (as,bs) | (length as >= s) = p:ps
where
(ss, sign:as2) = splitAt s ([resB sp] ++ as)
signInv
= invCond (resB sign)
x
= (improved_sel (signInv ss) ->- signInv) bs
(start, end)
= boothBits s (s*i) sp (resB sign)
p
= trim len (start ++ x ++ end ++ [low])
ps
= boothHelp (i+1) sign (as2,bs)
-- The improved Booth PPG, selection through multiplication if s>2
boothMux s (as,bs) = boothHelp 0 low ms (as, res bs)
where
ms = improved_selTab s bs
83
len = (length as) + (length bs)
-boothHelp i sp ms (as,bs) | (length as < s) = [p]
where
ss
= [resB sp] ++ as ++ zeroList (s-(length as)-1)
x
= select ss ms
(start,_) = boothBits s (s*i) sp low
p
= trim len (start ++ x)
boothHelp i sp ms (as,bs) | (length as >= s) = p:ps
where
(ss, sign:as2) = splitAt s ([resB sp] ++ as)
signInv
= invCond (resB sign)
x
= (select (signInv ss) ->- signInv) ms
(start, end)
= boothBits s (s*i) sp (resB sign)
p
= trim len (start ++ x ++ end ++ [low])
ps
= boothHelp (i+1) sign (map res ms) (as2,bs)
-- The improved Booth PPG with selection table
-- **********************************************
-- Summation networks:
-- **********************************************
linArray [p] = p
linArray (p:ps) = reduceLin cpa p ps
addTree = binTree cpa
carrySave [p] = p
carrySave (p0:p1:ps) = (reduceLin csa (p0,p1) ->- logAdd) ps
wallace ps
| ((length ps) <= 2) = carrySave ps
| otherwise
= wallace s
where s = (group3 ->- map csaWallace ->- unpairWallace) ps
dadda ps
| (n == 0) = zeroList (length cs)
| otherwise = daddaHelp cs
-- if all comlumns were totally compressed
where
cs = (transposePPs ->- (map compressGap)) ps
n = ((map length) ->- sort ->- last) cs -- The longest column in cs
-daddaHelp cs | (n1 <= 2) = (transposePPs ->- carrySave) cs
| otherwise = daddaHelp (compressDaddaCols n2 [] cs)
where
n1 = ((map length) ->- sort ->- last) cs
n2 = nextLength n1
-- **********************************************
-- Multipliers:
-- **********************************************
mult1 = ppgSimple ->- linArray
mult2 = ppgSimple ->- addTree
mult3 = boothBasic 2 ->- linArray
mult4 = boothBasicMux 3 ->- addTree
mult5 = ppgSimple ->- carrySave
mult6 = ppgSimple ->- wallace
mult7 = booth 2 ->- carrySave
mult8 = booth 3 ->- wallace
mult9 = ppgSimple ->- dadda
mult10 = booth 2 ->- dadda
mults = [mult1,mult2,mult3,mult4,mult5,mult6,mult7,mult8,mult9,mult10]
-- For integers:
intMult n mult = (int2bin n -|- int2bin n) ->- (n2iPair ->- mult ->- value) ->- bin2int
84
-- **********************************************
-- Simulation:
-- **********************************************
sim circ = (simulate (makeVarPair ->- circ ->- value))
-- Works like simulate, but gives variable inputs to the multiplier
-- Only for 2-input 1-output circuits
ppg_out ppg = simulate (makeVarPair ->- ppg ->- map value)
ppg_time ppg n = (n2iPair ->- ppg ->- map (map timeB)) ((replicate n (var "a")),(replicate n (var "a")))
ppg_size ppg n = (n2iPair ->- ppg ->- map (map sizeB)) ((replicate n (var "a")),(replicate n (var "a")))
sum_time ppg sum n = (n2iPair ->- ppg ->- map resetTime ->- sum ->- time)
((replicate n (var "a")),(replicate n (var "a")))
len mult n = length ((n2iPair ->- mult) ((replicate n (var "a")),(replicate n (var "a"))))
len2 ppg sum n = length ((n2iPair ->- ppg ->- sum) ((replicate n (var "a")),(replicate n (var "a"))))
len3 ppg sum m n = length ((n2iPair ->- ppg ->- sum) ((replicate m (var "a")),(replicate n (var "a"))))
height ppg m n = (n2iPair ->- ppg ->- transpose ->- map countVari ->- sort ->- last)
((replicate m (var "a")),(replicate n (var "a")))
how_fast circ n = (n2iPair ->- circ ->- time) ((replicate n (var "a")),(replicate n (var "a")))
how_fast2 ppg sum n = (n2iPair ->- ppg ->- sum ->- time) ((replicate n (var "a")),(replicate n (var "a")))
how_fast3 ppg sum m n = (n2iPair ->- ppg ->- sum ->- time)
((replicate m (var "a")),(replicate n (var "a")))
how_big circ n = (n2iPair ->- circ ->- size) ((replicate n (var "a")),(replicate n (var "a")))
how_big2 ppg sum n = (n2iPair ->- ppg ->- sum ->- size) ((replicate n (var "a")),(replicate n (var "a")))
how_big3 ppg sum m n = (n2iPair ->- ppg ->- sum ->- size)
((replicate m (var "a")),(replicate n (var "a")))
measure circ n = (how_fast circ n, how_big circ n)
measure2 ppg sum n = measure (ppg ->- sum) n
measure3 ppg sum m n = (how_fast3 ppg sum m n, how_big3 ppg sum m n)
-- **********************************************
-- Verification:
-- **********************************************
prop_Equivalent circ1 circ2 a = ok
where
out1 = circ1 a
out2 = circ2 a
ok
= out1 <==> out2
circCorrectForSizes m n circ1 circ2 =
forAll (list m) $ \a ->
forAll (list n) $ \b ->
prop_Equivalent circ1 circ2 (a,b)
verif mult n = vis (circCorrectForSizes n n multi (n2iPair ->- mult ->- value))
-- multi is the built-in lava multiplier, serves as the GOLDEN MODEL
verif2 ppg sum n = vis (circCorrectForSizes n n multi (n2iPair ->- ppg ->- sum ->- value))
verif3 ppg sum m n = vis (circCorrectForSizes m n multi (n2iPair ->- ppg ->- sum ->- value))
verif_sizes mult m = mapM vis [(circCorrectForSizes n n multi (n2iPair ->- mult ->- value)) | n <- [1..m]]
verif_circs circs n = mapM vis
[(circCorrectForSizes n n multi (n2iPair ->- mult ->- value)) | mult <- circs]
85
Future work
The next step after this work would be to refine the developed models for time and size
estimation. Especially the time estimation contains many simplifications. For example, the
delay depends also on
•
•
The lengths of the connecting wires, as mentioned before.
The “fan -out” of gates, that is, the number of gate inpu ts that are connected to the
same output.
The main factor of these is the wiring delay, which in fact, tends to dominate even over
the gate delay in modern chip technologies. Modelling wires requires information about the
circuit’s layout; so, the next st ep is to introduce layout information in the circuit descriptions.
There is a version of Lava at Xilinx Inc., which is used to construct circuits on FPGAs
(Field-Programmable Gate Arrays). This version allows layout information together with the
circuit descriptions. The key is to make use of connection patterns as much as possible, and
have the layout inside these. Information about the Xilinx’ Lava is found at
http://www.xilinx.com/labs/lava/.
86
Related work
Lava related:
1. John O’Donnel, http://www.dcs.gla.ac.uk/~jtod/research/, developed a hardware
description language similar to Lava, called Hydra. It is used for teaching Computer
Architecture at the University of Glasgow. For more information.
2. Satnam Singh, http://www.xilinx.com/labs/satnam/, developed the Xilinx version of
Lava. At http://www.xilinx.com/labs/lava/kcm/kcm.htm, he has constructed a constant
coefficient multiplier (KCM) for an FPGA using Lava. A KCM multiplies any number
with a constant factor. This can also be done with the non-standard circuits from this
text (if we give one constant input); however, Satnam’s version is specially made to
demonstrate construction on FPGAs.
Multipliers, verification:
1. Gary W. Bewick. Fast Multiplication: Algorithms and Implementation. Technical
Report CSL-TR-94-617, Stanford University, April 1994.
A thesis, which describes some of the common multipliers, with emphasis on
Booth encoding. He also describes a further improvement of Booth’s algorithm, called
redundant Booth.
2. A. D. Booth. A Signed Binary Multiplication Technique. Quarterly Journal of
Mechanics and Applied Mathematics, 4(2):236 – 240, June 1951.
This is the classic paper about Booth’s algorithm.
3. C. S. Wallace. A Suggestion for a Fast Multiplier. IEEE Transactions on Electronic
Computers, EC-13:14 – 17, February 1964.
This is the classic paper about Wallace summation.
4. L. Dadda. Some Schemes for Parallel Multipliers. Alta Frequenza, 34:349 – 356, May
1965.
This is the classic paper about Dadda summation.
5. T. Stanion. Implicit verification of structurally dissimilar arithmetic circuits. In Proc.
of IEEE ICCD ' 99, pages 46– 50. IEEE, 1999.
An article about a general algorithm for verifying large multipliers. With this
method, he has been able to verify the equivalence between the carry-save and
Wallace multiplier for 32-bit numbers.
87
Reference list
[1]
M. Sheeran. K. Claessen. A tutorial on Lava: A hardware description and verification
system. Available from http://www.cs.chalmers.se/~koen/Lava, 2000.
[2]
Gary W. Bewick. Fast Multiplication: Algorithms and Implementation. Technical
Report CSL-TR-94-617, Stanford University, April 1994.
[3]
J. M. Rabaey. Digital Integrated Circuits - A Design Perspective. Prentice Hall
Electronics and VLSI Series, 1996.
[4]
A. D. Booth. A Signed Binary Multiplication Technique. Quarterly Journal of
Mechanics and Applied Mathematics, 4(2):236 – 240, June 1951.
[5]
R. P. Brent and H. T. Kung. A Regular Layout for Parallel Adders. IEEE Transactions
on Electronic Computers, vol C-31, no. 3, pp. 260 – 264, March 1982.
[6]
C. S. Wallace. A Suggestion for a Fast Multiplier. IEEE Transactions on Electronic
Computers, EC-13:14 – 17, February 1964.
[7]
L. Dadda. Some Schemes for Parallel Multipliers. Alta Frequenza, 34:349 – 356, May
1965.
[8]
Gensuke Goto, Tomio Sato, Masao Nakajima, and Takao Sukemura. A 54×54-b
Regularly Structured Tree Multiplier. IEEE Journal of Solid-State Circuits,
27(9):1229 – 1236, September 1992.
88

Download Report

Document

Paperzz.com

Your Paperzz