ASM Design Example: Binary Multiplier

An Example of ASM Design: A Binary Multiplier
Dr. D. Capson
Electrical and Computer Engineering
McMaster University
Introduction
An algorithmic state machine (ASM) is a Finite State Machine that uses a sequential circuit (the
Controller) to coordinates a series of operations among other functional units such as counters, registers,
adders etc. (the Datapath). The series of operations implement an algorithm. The Controller passes
“control” signals which can be Moore or Mealy outputs from the Controller, to the Datapath. The
Datapath returns information to the Controller in the form of “status” information that can then be used
to determine the sequence of states in the Controller. Both the Controller and the Datapath may each
have external inputs and outputs and are clocked simultaneously as shown in the following figure:
Inputs
Inputs
Outputs
Status
Controller
Control
Datapath
clock
Outputs
Think about this: A microprocessor may be considered as a (large !) ASM with many inputs, states and
outputs. A program (any software) is really just a method for specification of its initial state …
The two basic strategies for the design of a controller are:
1. hardwired control which includes techniques such as “one-hot-state” (also known as "one flipflop
per state") and decoded sequence registers.
2. microprogrammed control which uses a memory device to produce a sequence of control words
to a datapath..
Since hardwired control is, generally speaking, fast compared with microprogramming strategies, most
modern microprocessors incorporate hardwired control to help achieve their high performance (or in
some cases, a combination of hardwired and microprogrammed control). The early generations of
microprocessors used microprogramming almost exclusively. We will discuss some basic concepts in
microprogramming later in the course – for now we concentrate on a design example of hardwired
control. The ASM we will design is an n-bit unsigned binary multiplier.
Binary Multiplication
The design of binary multiplication strategies has a long history. Multiplication is such a fundamental
and frequently used operation in digital signal processing, that most modern DSP chips have dedicated
multiplication hardware to maximize performance. Examples are filtering, coding and compression for
telecommunications and control applications as well as many others. Multiplier units must be fast !
The first example that we considered (in class) that used a repeated addition strategy is not always fast.
In fact, the time required to multiply two numbers is variable and dependent on the value of the
multiplier itself. For example, the calculation of 5 x 9 as 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 requires more
clock pulses than the calculation of 5 x 3 = 5 + 5 + 5. The larger the multiplier, the more iterations that
are required. This is not practical. Think about this: How many iterations are required for multiplying
say, two 16-bit numbers, in the worst case ?
Another approach to achieve fast multiplication is the look-up table (LUT). The multiplier and
multiplicand are used to form an address in memory in which the corresponding, pre-computed value of
the product is stored . For an n-bit multiplier (that is, multiplying an n-bit number by an n-bit number),
a (2n+n x 2n)-bit memory is required to hold all possible products. For example, a 4-bit x 4-bit multiplier
requires (28) x 8 = 2048 bits. For an 8-bit x 8-bit multiplier, a (28+8) x 16 = 1 Mbit memory is required.
This approach is conceptually simple and has a fixed multiply time equal to the access time of the
memory device, regardless of the data being multiplied. But it is also impractical for larger values of n.
Think about this: What memory capacity is required for multiplying two 16-bit numbers ? Two 32-bit
numbers ?
Most multiplication hardware units use iterative algorithms implemented as an ASM for which the
worst-case multiplication time can be guaranteed. The algorithm we present here is similar to the
“pencil-and-paper” technique that we naturally use for multiplying in base 10. Consider the following
example:
123
x 432
--246
369
492
----53136
(the multiplicand)
(the multiplier)
(1st partial product)
(2nd partial product)
(3rd partial product)
(the product)
Each digit of the multiplier is multiplied by the multiplicand to form a partial product. Each partial
product is shifted left (that is, multiplied by the base) by the amount equal to the power of the digit of
the corresponding multiplier. In the example above, 246 is actually 246x100, 369 is 369x101= 3690 and
492 is actually 492x102 = 49200, etc. There are as many partial products as there are digits in the
multiplier.
Binary multiplication can be done in exactly the same way:
1100
x 1011
---1100
1100
0000
1100
-------10000100
(the multiplicand)
(the multiplier)
(1st partial product)
(2nd partial product)
(3rd partial product)
(4th partial product)
(the product)
However, with binary digits we can make some important observations:
-
Since we multiply by only 1 or 0, each partial product is either a copy of the multiplicand shifted
by the appropriate number of places, or, it is 0.
The number of partial products is the same as the number of bits in the multiplier
The number of bits in the product is twice the number of bits in the multiplicand. Multiplying
two n-bit numbers produces a 2n-bit product.
We could then design datapath hardware using a 2n-bit adder plus some other components (as in the
example of Figure 10.17 of Brown and Vranesic) that emulates this manual procedure. However, the
hardware requirement can be reduced by considering the multiplication in a different light. Our
algorithm may be informally described as follows.
Consider each bit of the multiplier from right to left. When a bit is 1, the multiplicand is added to the
running total that is then shifted right. When the multiplier bit is 0, no add is necessary since the partial
product is 0 and then only the shift takes place. After n cycles of this strategy (once for each bit in the
multiplier) the final answer is produced. Consider the previous example again:
x
1100 (the multiplicand)
1011 (the multiplier)
---0000 (initial partial product, start with 0000)
1100 (1 multiplier bit is 1, so add the multiplicand)
---1100 (sum)
---01100 (shift sum one position to the right)
1100 (2 multiplier bit is 1, so add multiplicand again)
---100100 (sum, with a carry generated on the left)
st
nd
----
100100 (shift sum once to the right, including carry)
0100100 (3rd multiplier bit is 0, so skip add, shift once)
---1100 (4th multiplier bit is 1, so add multiplicand again)
---10000100 (sum, with a carry generated on the left)
10000100 (shift sum once to the right, including carry)
↑↑↑↑
Notice that all the adds take place in these 4 bit positions – we need only a 4-bit adder ! We also
need shifting capability to capture the bits moving to the right as well as a way to store the
carries resulting from the additions. The final answer (the product) consists of the accumulated
sum and the bits shifted out to the right. A hardware design that can implement this algorithm is
described in the next section.
Design of the Binary Multiplier Datapath
The multiplication as described above can be implemented with the components as shown in the figure
on the next page (note that for simplicity, the clock inputs are not shown). It is the role of the controller
to provide a sequence of the inputs to each component to cause the datapath hardware to perform the
desired operations. Registers A and Q are controlled with synchronous inputs Load (parallel load), Shift
(shift one position to the right with left serial input) and Clear (force the contents to 0). The D flipflop
has an asynchronous Clear input and the counter has an asynchronous input Init (force the contents to
11..1).
The log2n-bit counter (Counter P) is used to keep track of the number of iterations (n). Counter P is
loaded with the value n-1 and counts down to zero - thus n operations are ensured. Each operation is
either (a) add then shift or (b) just shift as described in the multiply algorithm above. Zero detection on
the counter produces an output Z that is HI when the counter hits zero and this is used to tell the
controller that the sequence is complete. The Counter P is initialized to n-1 with input Init = 1.
The multiplicand is applied to one n-bit input of the adder. The sum output from the adder is stored as a
parallel load into Register A. Register A can also shift to the right, accepting a 1-bit serial input from
the left. This is provided from the output of a D flip flop which stores the value of the carry out from the
adder in the previous addition. Register Q receives its left serial input when shifting from the right-most
bit (lsb) of Register A. Register A and Q are identical in operation (but controlled differently) and
together with the carry flipflop, they form a (1 + n + n)-bit shift register. That is, Registers C, A and Q
are connected such that the carry value stored in the flipflop enters Register A from the left and the bit
shifted out from the right of Register A enters Register Q from its left.
At the end of the process, registers A and Q will hold the 2n-bit product (the n msb’s are in Register A).
The multiplier is initially stored in Register Q via its parallel load capability. The reason for this is that
it provides a convenient way to access each bit of the multiplier in succession at the lsb position (Q0) of
Register Q. In the multiply algorithm, each bit of the multiplier is used to ‘decide” if there should be an
(a) add with shift or (b) shift only. So, Q0 is used to tell the controller which of these operations to
perform on each iteration. After each shift, one bit of the multiplier is lost to the right and the Product
shifts into Register Q from the left. After n shifts, Register Q holds the n lsb’s of the product and the
Multiplier is totally lost.
Putting the datapath circuit for the binary multiplier into a box, we see it has:
Data Inputs:
Multiplicand (n bits)
Multiplier (n bits)
Data Outputs:
Product (2n bits)
Control inputs: Clear carry
Load, Shift and Clear (for each shift register)
Init (for the counter)
Status outputs:
Z (zero detect) and Q0 (each bit of the Multiplier, in succession)
Multiplicand
log 2 n
n
A
0
Binary
Down
Counter
B
Cin
Parallel
Adder
Cout
Clear
Register A
Load
Shift
Clear
Left serial input
C
(Zero Detect)
Shift Reg
Load
Shift
Clear
Q
Z
n
Shift Reg
Left serial input
D
Counter P
SUM
n
Flipflop
n-1
Init
n
Multiplier
Register Q
1
n
n
1
(lsb of Reg A)
Product (msb's)
n
Product (lsb's)
Datapath for Binary Multiplier
Q0 (lsb of Reg Q)
Design of the Binary Multiplier Controller
An ASM chart that implements the binary multiply algorithm is given below. Note that << indicates an
assignment, for example, C<<0 means “set C to 0”.
IDLE
0
1
G
C « 0, A « 0, P « n-1
Q « multiplier
MUL0
0
Q0
A « A + multiplicand
C « Cout
C«0
MUL1
1
C|A|Q « shr (C|A|Q)
P « P-1
0
Z
1
The process is achieved with 3 states (IDLE, MUL0 and MUL1). Each state will provide control signals
to the Datapath to perform the multiplication sequence. The process is started with an input G. As long
as G remains LO, the ASM remains in state IDLE. When G=1, the multiplication process is started. As
the ASM moves to state MUL0, the carry flip flop is cleared (C<<0), Reg A is cleared (A<<0), the
Counter is preset to n-1 (P << n-1) and Register Q is loaded with the Multiplier.
In state MUL0, the value of each bit of the multiplier (available on Q0) determines if the multiplicand is
added (Q0 = 1) or not (Q0=0). For the case Q0=0, the Carry flipflop is cleared ; for the case Q0=1, the
Cout from the adder is stored in the carry flipflop. The next state is always MUL1.
In MUL1, the Carry flipflop, Reg A and Reg Q are treated as a (1 + n + n)-bit register and shifted one
position to the right, together. This is indicated with the notation C|A|Q << shr (C|A|Q) in the ASM
chart. The counter is also decremented (P << P – 1). The value of Z then determines whether to:
return to state MUL0 (Z=0) to continue iteration OR
return to state IDLE (Z=1) thus completing the process. Remember that Z=1 means that the
counter has counted down from n-1 to 0 and therefore n iterations have been completed.
State IDLE=0 therefore indicates that the Multiplier is “currently multiplying” and when the ASM
returns to state IDLE (IDLE=1), it indicates that multiplication is completed.
At this point in the design process, the control signals must be identified and their names chosen. This is
done by inspection of the ASM chart and the datapath circuit. In MUL0, the operations P << n – 1,
A<<0 and Q << multiplier are all independent of one another in the datapath and thus can be done
simultaneously and therefore can share a common control signal (Initialize). However, the operation
C<<0 must have its own control signal (Clear_C) since it occurs in both states IDLE and in MUL0.
Operations C << Cout and A << A + multiplicand, required in state MUL0, can share a control signal
(Load) since they are also independent functions in the datapath. And, similarly, the shifting of registers
C|A|Q and decrementing of counter P can share a common control signal since they are independent
operations in the datapath and are required in state MUL1 (Shift_dec). The names of the control signals
are of course, a matter of design choice.
We can summarize all the operations that must take place on each component in the datapath and
indicate the corresponding control signal names that should be passed to the datapath in the following
table:
Datapath
component
Carry flipflop
Counter P
Register A
Register Q
Operation
Control
Signal name
C << 0
C << Cout (from the adder)
P << n - 1
P << P – 1
A << 0
A << A + multiplicand
C|A|Q << shr (C|A|Q)
Q << multiplier
C|A|Q << shr (C|A|Q
Clear_C
Load
Initialize
Shift_dec
Initialize
Load
Shift_dec
Initialize
Shift_dec
The state transition diagram for the controller for this ASM is shown below. Note that only the inputs
are shown; the outputs are not indicated:
G=0
G=1
IDLE
MUL0
z=0
MUL1
z=1
From inspection of the state transition diagram, the input equations for the D flipflops (using one
flipflop per state) are easily formed:
DIDLE = G’ • IDLE + MUL1 • Z
DMUL0 = IDLE • G + MUL1 • Z’
DMUL1 = MUL0
From the ASM chart and the table above, the equations for the control signals outputs from the
controller are formed:
Initialize = G • IDLE
Clear_C = G • IDLE + MUL0 • Q0’
Load = MUL0 • Q0
Shift_dec = MUL1
Finally, to provide a mechanism to force the state machine to state IDLE (such as at power-up), an
asynchronous input Reset_to_IDLE is connected to the asynchronous inputs of the flipflops. The
circuit for the controller is then simply, an implementation of all of these equations as follows:
Controller for Binary Multiplier
Go
Reset_to_IDLE
D
Clock
P
Q
IDLE
IDLE
Initialize
Clear_C
Q0
MUL0
D
Load
Q
C
MUL0
Z
MUL1
D
Q
Shift_dec
C
Our binary multiplier ASM has the form:
Go
Reset to
IDLE
Multiplicand
Multiplier
n
n
Z, Q0
Controller
clock
Initialize, Clear_C,
Load, Shift_dec
Datapath
2n
Product
Combining the controller and the datapath to form the top level of our design, the binary multiplier may
be viewed as:
Multiplier
Multiplicand
Go
IDLE
n
n
Binary
Multiplier
Reset
to IDLE
2n
Product
Clock
Note that the IDLE state variable has been brought to the top level since it can be use to indicate when
the Binary Multiplier is busy. The Go and IDLE lines are called “handshaking” lines and are used to
coordinate the operation of the multiplier with the external world. If IDLE =1, a multiply can be started
by putting the numbers to be multiplied on the Multiplier and Multiplicand inputs and setting Go=1 at
which time the state machine jumps to state MUL0 (and therefore, simultaneously, IDLE changes to 0)
to start the process. When IDLE returns to 1, the answer is available on the Product output and another
multiplication could be started. No multiplication should be attempted while IDLE is 0.
Conclusion
This design of a Binary Multiplier is valid for any value of n. For example, for n=16, the multiplication
of two 16-bit numbers, the datapath components would simply be extended to accommodate 16 bits in
Registers A and Q and the counter would require log 2(16) = 4 bits. The adder would also be required to
be 16-bits in width. However, the same controller implementation can be used since its design is
independent of n. The multiplication time for n=16 would be 2(16) + 1 = 33 clocks. The product would
contain 32 bits.
Further refinements can be made to enhance the speed and capability of the ASM. For example, in our
algorithm, each 0 in the multiplier input data causes a shift without an add, each taking a clock pulse. If
the multiplier input contains “runs” of consecutive 0’s, a barrel shifter could be used to implement all of
the required shifts (equal to the length of the run of 0’s) in a single clock.
Think about this:
What modifications to our design would be required in order to be able to handle signed numbers. ?
Example: Multiply 12 x 5 = 60 (with n = 4)
Assuming a 4-bit multiplier, in binary, this is 1100 x 0101 = 00111100. The following table
summarizes all the values in the ASM for each step in this multiplication. The left column represents
each clock pulse applied to the multiplier. The multiplication time for this ASM is always 2n+1 clocks
(confirm this with the state transition diagram). Since n=4, there are 9 clocks required to complete a
multiplication. Multiplication time is not data dependent as in our first example that used repeated
addition !
The first row of the table is the initial state (state IDLE) at which every multiply begins. Then, for each
clock pulse applied, we move down one row in the table. Counter P is this example has 2 bits (to count 4
iterations) and the zero detect Z can be seen to be Z=1 only when Counter P counts down to 00.
The values of registers C, A and Q are shown for each clock pulse in the process. Note that the
multiplier is initially stored in Q, then shifted out to the right giving access to each bit in the multiplier at
the Q0 (lsb) position. At the same time, the product shifts in from the left. The product is formed in
registers A and Q with the addition on each iteration occurring in Register A if the contents of the lsb of
Register Q is HI (i.e. Q0 = 1). Notice that registers C, A and Q are shifted on every iteration and that the
final answer 00111100 is contained in Registers A and Q on the final clock pulse. At this point, we have
returned to state IDLE indicating that multiplication is complete.
The current state of the ASM is indicated with a 1 in the appropriate States column. Note that since we
are using one flipflop per state, only one of the 3 columns can contain a 1; the others are of course, 0.
In the Control Signals columns, the values for each control signal are provided for each clock pulse.
Note that Initialize, Clear_C and Load are Mealy-type outputs since they are a function of both current
state and inputs. Shift_dec is a Moore-type output since it depends only on current state (MUL1) and is
not a function of any input. In fact, Shift_dec = MUL1.
Work through this example line by line to verify its operation.
Example: 12 x 5
Clock
pulse
1
2
3
4
5
6
7
8
9
Counter
P
1
1
1
1
1
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
Z
C
Reg A
Reg Q
0
0
0
0
0
0
0
1
1
0
x
0
0
0
0
0
0
0
0
0
xxxx
0000
1100
0110
0110
0011
1111
0111
0111
0011
xxxx
0101
0101
0010
0010
0001
0001
1000
1000
1100
States
IDLE MUL0
1
0
0
0
0
0
0
0
0
1
0
1
0
1
0
1
0
1
0
0
Control Signals
MUL1
Initialize
Clear_C
0
0
1
0
1
0
1
0
1
0
1
0
0
0
0
0
0
0
0
0
1
1
0
1
0
0
0
1
0
0
Load Shift_dec
0
1
0
0
0
1
0
0
0
0
0
0
1
0
1
0
1
0
1
0