Concurrent Error Detection in ALU`s by Recomputing with Shifted

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-31, NO. 7, JULY 1982
589
Concurrent Error Detection in ALU's by
Recomputing with Shifted Operands
JANAK H. PATEL,
MEMBER, IEEE, AND
LEONA Y. FUNG, STUDENT MEMBER,
IEEE
Furthermore, they are based on the traditional stuck-at fault
model which is not appropriate for the VLSI technology. A
more appropriate assumption is that the failure in a VLSI
circuit affects a small area of the chip and the nature of the
failure is not precisely known. For example, imperfection in
the semiconductor material may be confined to a small area
of the chip and thus affect several neighboring gates. Similarly,
an incidence of a high energy particle may also affect a small
area of the chip. The underlying failure mechanisms are not
well understood as yet. Therefore, it is unwise to assume that
under failure the output of a specific gate is stuck at some
logical value. It is true, however, that at some higher functional
Index Terms-ALU, bit-sliced ALU, concurrent error detection, level the effect of failures will be felt as changes in logical
values. This functional level, for example, can be at the level
fault detection, time redundancy, VLSI circuits, VLSI faults.
of a bit-slice of an ALU. This is the level we choose for our
functional fault model in this paper. Throughout this paper,
I. INTRODUCTION
a failure in a circuit means some physical malfunction; and
I T HAS been known for some time that no low-cost and ef- an error means an incorrect value of the function under conficient techniques that can check both arithmetic and logic sideration.
operations have been available. The AN code, Residue code,
In this paper we present an error detection scheme based on
and Inverse Residue code [1] -[5] are the error detecting codes time redundancy. In the next section, some systematic apthat were developed earlier for checking arithmetic operations. proaches of devising error detectionr by time redundancy are
However, the methods mentioned above are unable to detect described. Later we describe the proposed error detection
some single errors in group carry-lookahead structures. Fur- method, called Recomputing with Shifted Operands (RESO).
thermore, these methods cannot be used for checking logical The error detection capabilities of RESO are described in
operations.
Sections IV and V. It is shown that in a typical ALU, RESO
Utilizing a fully duplicated logic unit has been recognized detects all functional errors resulting from failures confined
as the most effective method for checking logical operations. to a certain area of the chip, for example, a bit-slice. In Section
In fact, most machines that have been built with an error de- VI some extensions of RESO, as related to errror correction
tection scheme used duplication to check logical operations. in logic operations and error detection in multiply-divide
For example, the fault-tolerant STAR computer used inverse circuits, are discussed.
residue codes to check the arithmetic unit but duplication for
logic unit [2]. Several other machines such as EDVAC and
IBM System/3, have duplicated the entire ALU. Nevertheless,
II. ERROR DETECTION USING TIME REDUNDANCY
a few designs, such as the residue-checked ALU [6] and a
In this section we introduce a systematic way of exploiting
partially self-checking ALU [4], [7] were introduced for
time
redundancy for error-detection. Let x be the input to a
checking the entire ALU besides the full duplication scheme. computation
letf(x) be the desired output. A space
However, both schemes require a large increase in the com- redundancy unitf and like
Fig. 1, with two identical function
technique,
plexity of the circuitry, and therefore the area on the chip.
units f, will detect any error in one of the two computation
units. Now consider computingf(x) twice in time, on the same
Manuscript received October 6, 1981; revised January 12, 1982. This work hardware boxf, as illustrated in Fig. 2. The result of the first
was supported by the U.S. Navy under VHSIC Contract N00039-80-Ccomputation step is stored in a register and then compared with
0556.
J. H. Patel is with the Coordinated Science Laboratory, and the Department the result of the second computation step. An intermittent error
of Electrical Engineering, University of Illinois, Urbana, IL 61801.
occurring during either of the computation steps, but not both
L. Y. Fung was with the Coordinated Science Laboratory, University of
will
be detected; however, no permanent error can be detected.
Illinois, Urbana, IL 61801. She is now with STC Computer Research CorThus, the error detection capability of Fig. 2 is much worse
poration, Santa Clara, CA 95051.
Abstract-A new method of concurrent error detection in the
Arithmetic and Logic Units (ALU's) is proposed. This method, called
"Recomputing with Shifted Operands" (RESO), can detect errors in
both the arithmetic and logic operations. RESO uses the principle of
time redundancy in detecting the errors and achieves its error detection
capability through the use of the already existing replicated hardware
in the form of identical bit slices. It is shown that for most practical
ALU implementations, including the carry-lookahead adders, the
RESO technique will detect all errors caused by faults in a bit-slice
or a specific subcircuit of the bit slice. The fault model used is more
general than the commonly assumed stuck-at fault model. Our fault
model assumes that the faults are confined to a small area of the circuit
and that the precise nature of the faults is not known. This model is
very appropriate for the VLSI circuits.
0018-9340/82/0700-0589$00.75
© 1982 IEEE
590
IEEE TRANSACTIONS ON COMPUTERS, VOL.
Step 1:
C-31, NO. 7,
1982
JULY
{
x
error
x
signal
Step 2:
x
Step 2:
x
4
Step 1:
error
signal
f
fwx))
dZ3
d(f(c(x))L......)
.
error
signal
Fig. 3. Error detection with time redundancy.
Fig. 1. Error detection with space redundancy.
Step 1:
{ 3iF..
Erc(x)r
x
Step
2:
x
f
error
f
x
signal
Fig. 2. Intermittent error detection with time redundancy.
than that of Fig. 1. Consider modifying the second step of Fig.
2. Suppose that we have two functions c and d such that
d(f(c(x))) =f(x) for all x. Now we computef(x) in the first
step, and then during the second step we compute d(f(c(x)))
as shown in Fig. 3. The functions c and d may be called the
coding and decoding functions, respectively. If c and d are
properly chosen, then a failure in the unitf will affectf(x) and
f(c(x)) differently. Therefore, the outputs of the first step and
the second step will not match, producing an error signal. Quite
often the functins c and d are such that they\are inverses of
each other, that is, d(c(x)) = x for all x. In this-case we write
c'I(f(c(x))) = f(x). Fig. 4(a) shows this organization. A
somewhat equivalent organization of Fig. 4(b) results from
the fact that c'- (f(c(x))) = f(x) implies that f(c(x)) =
c(f(x)) for all x. Figs. 3 and 4 are the basis of most error detection methods using time redundancy.
As an example of Fig. 4, consider Boolean functions f(x)
which are self duals. Let x = (xI, x2, , x,) be an n-bit
vector;f(x) is a self dual if for all x,f(l1, x2, .. , Yn) =f(x1,
x2, ..., x"). Let c be a function which complements each bit
of a vector, e.g., c(x1, -, x.) = (xl, * , in). It is clear that
c- = c and c(f(c(x))) = f(x), alternately c(f(x)) = f(c(x)).
This property of self duals was the basis of one of the first error
detection methods devised using time redundancy technique.
It is called alternating logic design and was inroduced by
Reynolds and Metze [8]. Time redundancy has also been used
for error correction [9]. Here, we address ourselves specifically
to error detection in ALU's with organizations like Fig. 4.
There are several problems to consider in the design of an
error detection method using time redundancy. The first
problem is to find a function c for the given functionf such that
c -(f(c(x))) = f(x). The second problem is that the mere
existence of the function c does not provide the desired error
detection capability. Error detection capability varies widely
with different c's; and, in addition, the functional properties
of f and c are not sufficient by themselves to determine the
error detection capability. Different circuit implementations
of the same functionf will have different capabilities. Even the
organizations of Fig. 4(a) and (b) may have slightly different
error coverage. And one final problem is the complexity of the
Step 1:
Step 2:
x
x
f fi4 1
~
1
error
StepJ L J2: ssignal
Fig. 4. Two approaches to error detection using time redundancy.
hardware implementation of the coding and decoding functions
c and c-1. If the hardware required to implement c is comparable to that of the function unitf, then the space redundancy
approach of Fig. 1 is clearly more cost-effective. In short, for
a cost-effective design, the function c must be such that it
provides a very good error coverage and is far less complex than
the function f. We present here a cost-effective method of error
detection called Recomputing with Shifted Operands (RESO).
An overview of this method follows.
III. DESCRIPTION OF RESO
The RESO method is based on Fig. 4. Let the function unit
f be an ALU, the coding function c be a left shift operation,
and the decoding function c 1 be a right shift operation. Thus,
c(x) = left shift of x and c-'(x) = right shift of x. With a
more precise description of c, e.g., logical or arithmetic shift,
by how many bits, what gets shifted in and so forth, it can be
shown that for most typical operations of an ALU,
c-I(f(c(x))) = f(x). The details about shifts are discussed
later in this section, but for now we shall only give a less precise
but intuitively clear picture of the RESO method. A schematic
of the implementation appears in Fig. 5. Depending on whether
we follow the principle of Fig. 4(a) or (b), the operation of the
error detection scheme in the ALU of Fig. 5 can be described
as below.
Following the principle of Fig. 4(a), during the initial
computation step the operands are passed through the shifter
unshifted and then they are input to the ALU. The result is
then stored in a register unshifted. During the recomputation
step, the operands are shifted left and then input to the ALU.
The result is right shifted and then compared with the contents
PATEL AND FUNG: CONCURRENT ERROR DETECTION IN ALU'S
x
591
determination of k depends on three factors: the implemenation off, the space redundancy allowed, and the set of errors
defined. The hardware that the RESO-k needs to detect errors
in an ALU for n-bit operations includes two shifters, a register,
an (n + k)-bit ALU, and an equality checker. A totally selfchecking equality checker can be implemented based on 1out-of-2 code checkers [10]. The errors in the shifter can be
detected using any suitable parity codes. For the smallest
circuitry, k has to be one. If k is one, then the faults covered
are confined to a bit-slice or a subnetwork of a bit-slice depending on the implementation off. The next section will give
the detail of RESO-1. It will be shown later that RESO-2
guarantees the detection of all functional errors resulting from
failures confined to a bit-slice independent of the implemenerror signal
tation. The error detection capabilities of RESO-k will be
Fig. 5. Concurrent error detection in an ALU using RESO.
presented in a later section.
At first glance, it appears that the above method of error
of the register. A mismatch indicates an error in computa- detection reduces the execution rate in half. However, if one
tion.
views the ALU in the global context involving the entire
If we follow the principle of Fig. 4(b), then the operation computer, then the time penalty is only a small portion of the
changes slightly. Here we input the unshifted operands during entire instruction cycle. If the ALU is pipelined,. then the
the first step as before, but we left shift the result and then store recomputation step can be started immediately after one
it in the register. In the second step, we input the left shifted segment time unit or one clock cycle of the pipeline. Therefore,
operands to the ALU and then compare the output directly both the computation and recomputation are overlapped in the
with the register. An inequality signals an error.
ALU, and the resulting penatly in performance is exactly one
When an n-bit operand is shifted left by k-bit, its leftmost segment delay per ALU operation.
k bits move out. To preserve these bits during the recomputation step, an (n + k)-bit ALU is needed. For certain logical
IV. ERROR DETECTION CAPABILITY OF RESO-l
operations, the operands can be shifted left circularly, in which
case only an n-bit ALU is required. Use of rotations rather
In this section we describe error detection capability of
than shifts is discussed later in Section VI. For now, we shall RESO- 1, where RESO- 1 refers to recomputing with operands
just assume an (n + k)-bit ALU for error detection in n-bit shifted by one bit. First, we consider logical operations and then
operations. The leftmost k bits of the (n + k)-bit operands for arithmetic operations. All theorem statements and proofs for
the first computation step and the rightmost k bits for the arithmetic operations only refer to add operations and can be
recomputation step are determined depending on the opera- trivially extended to include subtract, increment, decrement,
tions and the number system. For all bit-wise logical opera- and negate operations.
tions, it does not matter what they are. For arithmetic operaTheorem 1: RESO-1 detects all errors in an ALU for all
tions, the leftmost k bits of the unshifted operands have to be bitwise operations AND, OR, NOT, and their derivations (e.g.,
zeros for the unsigned binary integers and extensions of the XOR, NOR, etc.) when the failure is confined to a single bitsign-bit for the one's and two's complement number repre- slice.
sentation. Consider the left-shifting by k bits as a multiplicaProof: Let the bit-slice i be faulty. If the fault produces
tion process, that is, the original operands are multiplied by an error, then the bit i of the result during the first computation
2 for the recomputation step. In order to be consistent, the step is incorrect. During the second computation step, the bit
carry-in has to be multiplied by 2 k. It can be done by shifting i of the result is computed by the nonfaulty slice i + 1.
the carry-in to the right of one of the operands k times. Now, Therefore, the bit i of the recomputation step is the correct bit.
(2 k 1) X (carry-in) has been added to one of the operands Thus, if the failure produces an error affecting the bit i of the
and one more carry-in will be added to the sum during the first result or the bit i - 1 of the second result of both, then the
recomputation step. For the other operand, k zeros should be two results will not match. The error, therefore, is detected.
shifted into its right. Under error-free conditions, the rightmost
k bits of the result from the recomputation step should be all
Theorem 2: In a bit-sliced ALU whose sum and carry
zeros. So these k bits are essential to indicate errors resulting functions of the adders are represented by two disjoint netfrom faults in the rightmost k bit-slices. To preserve these bits works like Fig. 6, any errors will be detected by RESO- 1 if the
for equality checking, we follow the principle of Fig. 4(b). failure is confined to either the sum network or the carry netThus, k zeros are shifted into the right of the result from the work, but not both.
first computation, and this shifted result is compared with the
Proof:
unshifted result from the recomputation step.
Case 1-Only the Sum Circuit is Affected: Suppose the
Let RESO-k be the name of the error detection scheme failure is confined to the sum circuit of bit-slice i. Then the bit
achieved by recomputing with k-bit shifted operands. The i of the result is either correct or off by +2i during the first
function
k
Y
0
out
592
IEEE TRANSACTIONS ON COMPUTERS, VOL.
fI
S
I
C-31, NO. 7,
--
JULY 1982
I
(xi,yi,cc)
S.
c
X.
Yi
Fig. 6. Full adder with disjoint sum and carry networks.
C.I.
Si
91
ci+'
C +1
Fig. 7. Full adder with sum and carry networks sharing a subnetwork.
Fig. 8. A typical full adder circuit satisfying the structure of Fig. 7.
function strongly depends on the function k, if the output of
ki is wrong, then the sum bit i 1 must also be wrong and the
carry bit i can be wrong or correct. Of course, if the output of
circuit k is correct, then no error has occurred during the
second step. Hence, the result of the recomputation step will
errors are equal. Therefore, all errors are detected.
be off by one of the values 10, ±2i-1, ±3 X 2V-1. Since the
Case 2-Only the Carry Circuit is Affected: Suppose that erroneous results from both of the computation step will never
the carry circuit of bit-slice i is faulty. Then during the first have the same value, the errors can always be detected. 3
Theorem 4: Any Boolean algebraic functionf(xI, x2,
computation step, the result is either correct or off by ±2i+ 1.
During the recomputation step, the result is either correct or Xn), which is strongly dependent on x1, must be of the form xI
off by ±2i. Thus, the errors in two results cannot be equal, and
g(x2, * xn) for some function g of x2, *.. xn.
therefore any errors will be detected.
Proof: By definition,f(x1, x2, X3, ) is strongly dependent on x 1 if for every pair of input vectors which differ only
In certain implementations, carry and sum circuit are not in x1, the values off for these vectors differ. That is
disjoint and they share a common subnetwork. If this sharing
f(l, X2, X3, * *) =f(0 X2, X3, * ).
is done in a particular manner, then it is possible to detect er(1)
rors. The specific nature of this sharing is described below.
A function f(x1, * xn) is said to be strongly dependent By Shannon's Expansion Theorem
on the variable xi, if for every pair of input vectors which differ
XI .f(l, X2, X3, * )
f(x1, X2, X3, * )
only in xi, the values off for these vectors differ.
+ XI 4f(0, X2, X3, * )Theorem 3: In an adder, whose sum and carry functions are
represented by two networks which share a common subnet- Substituting (1) into it
work and the sum function is strongly dependent on the
function that the shared subnetwork represents, any errors f(Xi, X2, X3, * ) = X1 *f(0, X2, X3, * )
caused by failures confined only to one of three subnetworks
+ X1 4f(0, X2, X3, .. *)
will be detected by RESO- 1.
Let g(X2, X3, * *) = f(0, X2, X3, ), then
Proof: Let the adder implementation be as in Fig. 7.
Case I -Only the Circuit f is Affected: This is the same
f(xI, X2, X3, * * *) = X I g(X2, X3, * * *) + XI * g(X2, X3, * * )
as Case 1 of Theorem 2.
Case 2-Only the Circuit g is Affected: This is the same that is
as Case 2 of Theorem 2.
Case 3-Only the Circuit k is Affected: Let the circuit
f(x1, X2, X3, * )
Xi
g(X2, X3, *).
k of bit-slice i be faulty. During the first computation step let
We are now ready to look at some specific implementations
only the output of ki be wrong, then the sum bit i must also be
wrong since the sum function strongly depends on the function of adders whose errors can be detected by RESO- 1. The most
k and the carry ci+1 can be correct or incorrect. Hence, the commonly used implementation of a full adder is shown in Fig.
result of the first computation step will be off the expected 8. This is a typical network that satisfies the structure of Fig.
correct result by ±2i if only the sum bit is incorrect, or by 7. By Theorem 3, the RESO- 1 can detect all errors resulting
2i2±2 i+ I if carry is also incorrect. If the failure in unit k does from any faults confined to one of the subnetworksf, g, or k.
not produce an error then the result is correct. Therefore, due Any faults on the lines which are not in the dotted boxes are
to a failure in unit k, the result can be off by one of 10, ±2i, ±3 also included in the subnetwork from which the line origiX 2i}. During the recomputation step, again since the sum nates.
computation step. During the recomputation step, the bit-slice
i operates on bit (i - 1) of the original operands. Thus, the
failure will cause the result to be either correct or off by 2'- 1
The equality checker will find two results identical if and only
if both results are correct, since in either computation no two
-
.
,
,
,
=
-
=
@
593
PATEL AND FUNG: CONCURRENT ERROR DETECTION IN ALU'S
TABLE I
EFFECT OF FAULTY VALUES OF Gi AND Pi ON THE SUM
Fig. 9. Typical implementation of a carry-lookahead adder.
V. ERROR DETECTION CAPABILITIES OF RESO-k
In the previous sections we presented the error detection
capabilities of RESO-1, that is, recomputing with operands
shifted by 1 bit. We assumed in that case that failures were
confined to a well-defined cluster of logic elements. These
clusters include a small number of elements. With the everincreasing density of components in the fast-changing VLSI
technology, one may consider the assumptions of the previous
section somewhat restrictive. For this reason, in this section
we present the generalized RESO error detection method for
a less restrictive fault model.
The fault model we assume here is also a functional fault
model. It is the same as that of the previous section except that
we allow more area of the chip or more components to be included in the affected cluster. For example, one can assume
that failures affect a complete bit-slice or several adjacent
bit-slices. To understand why RESO- 1 may not be adequate
for error detection when a large area of the chip is affected,
consider the following example, which also leads us to the next
theorem.
Consider a bit-sliced ripple-carry adder. Suppose that bitslice i is faulty. Then the sum and the carry bits can have any
logical values at any time. The functional nature of the error
is to change the arithmetic value of the final result. During the
first computation step, bit-slice i computes a sum bit with
weight 2 i and the carry-out bit with a weight 2i+ 1. Thus, it is
possible that the result of the first computation step is off by
±2i or ±2i+1 or ±2i±2i+I or 0. In other words, the result is
off by one ofl0, ±2', ±2i+1, +3 X 2i}. During the second step,
the operands are shifted left by one bit, and therefore the bitslice i computes the sum bit with weight 2i- 1 and the carry-out
bit with weight 2i. Again, not knowing the exact nature of the
fault, the sum and carry-out bits can take any logical values
independent of the input. Reasoning as before, the result is off
by one ofl0, ±2i-1, ±2i, ±3 X 2i-11. From this, it is clear that
results of the two steps can be identical not only when there is
no error, but also when the errors happen to be +2i or -2i in
both steps. This suggests that the second computation step
should be changed so that no two errors are the same. This
leads us to the following theorem.
Theorem 5: In a bit-sliced ripple-carry adder, RESO-2
Correct
Faulty
Change in
values
values
the sum
Gi pi
Gi Pi
o
0
0
1
0
O
O
1
d
+2i+l
O
1
°
0
0.
2i
0
1
1
d
O,
2i+ 1
1
d
0
0
-2
1
d
0
1
0, -2i+
2
*detects all errors resulting from failures confined to any one
bit-slice.
Proof: Let the bit-slice i be faulty. During the first
computation step, the sum and the carry outputs of slice i have
weights 2i and 2i+1, respectively. Therefore, the result from
the adder can be off by one of the values 10, ±2i, +2i+ 1, ±3
X
2i).
Now consider the recomputation after the operands have
been shifted left by two bits. The sum and carry-out of bit-slice
i now have weights 2i-2 and 2i- 1, respectively. Therefore, the
result of the recomputation step can.be off by one of the values
10, +2i-2, ±2i-1, ±3 X 2i-21.
No single error (nonzero value) appears in both sets.
Therefore, any error in either step will cause a mismatch of the
results of the two computation steps, and the error is detected.
For carry-lookahead adders, the RESO-2 is also effective
as proven in the following theorem. Since a bit-slice is not well
defined in a carry-lookahead (CLA) adder, we first describe
the exact implementation of a CLA adder and then define
what we mean by a "bit-slice." Fig. 9 describes a typical implementation of a CLA adder. The function unitfi computes
the sum bit si, carry propagate signal Pi, and the carry generate signal Gi. The function unit g, computes the carry-in to
the stage i; all function units fi are identical. However, the
function units gi get more complex as i grows. It is only for the
sake of convenience that we define a "bit-slice" i to consist of
unitsfi and g,.
Theorem 6: In a bit-sliced carry-lookahead adder, RESO-2
detects all errors resulting from failures confined to any one
bit-slice.
Proof: Let the bit-slice i, which includes function unitfi
and gi, be faulty (see Fig. 9). Then by assumption only the sum
bit si, carry generate Gi, and carry propagate Pi are affected.
Since the consequences of faults in gi is to affect only the sum
bit si, this case is already included in the above assumption.
Sum bit si has an arithmetic weight of 2i, and Gi has a weight
of 2i+ 1. When G, = 1, the propagate signal Pi has no effect on
sum bits i + 1 and higher. Thus, when Gi = 1, Pi has a weight
of 0. Depending on the implementation, the sum-bit si may or
may not be a function of Pi. Since we have already considered
si to be erroneous, the effect of Pi on si can be ignored. When
Gi = 0, Pi has a weight of 2i. With this information, we can
594
enumerate all possible changes in the sum contributed by the
errors in Gi and Pi, as shown in Table I.
Combining these values with the possible errors contributed
by the sum bit si, we conclude that due to the failures in slice
i, the result can be off by one of the values $0, +2i, +2i+1, ±3
X 2i}. After the operands are shifted left by two bits, the result
of the recomputation step can be off by one of the values $0,
+2i-2, +2i-1, +3 X 2i-21. Thus, no single error value appears
in both computation steps and any error can be detected. 0
The above results can be generalized to include the failure
of more than one bit-slice of the ALU. The generalized result
is stated below:
Theorem 7: RESO-k has the following error detection
capabilities in an ALU:
1) detects all errors in all bit-wise logical operations when
the failures are confined to k adjacent bit-slices,
2) detects all errors in arithmetic operations in a ripplecarry adder when the failures are confined to (k - 1) adjacent
bit-slices for k > 1 (for k = 1 see RESO- 1 in the last section),
3) detects all errors in arithmetic operations in a full
carry-lookahead adder when the failures are confined to (k 1) adjacent bit-slices, where a bit-slice i consists of functional
units f and g1 (see Fig. 9),
4) detects all errors in arithmetic operations in a group
carry-lookahead adder, each group i consisting of a (k - 1)-bit
adder, and circuits for group-carry generate Gi, group-carry
propagate Pi, and the group carry-in C, (similar to Fig. 9)
when the failures are confined to a single group.
Proof:
I) Let the k adjacent bit-slices i, i + 1,
i + k - 1 be
faulty. During the first computation step, bits i, i + 1
,i
+ k - 1 of the result are affected by faults. The recomputation
step is performed after k-bit left-shifts of the input operands.
Therefore, the bit slice i affects the bit i - k of the result, slice
i + 1 affects bit i - k + 1 of the result, and so on. Thus, the
result bits that are affected by the faults are bits i - k, i - k
+ 1,*, i -1. Therefore, no single bit of the result is affected
in both computation steps. Hence, an error guarantees a mismatch between the results of the two computation steps.
2) Let us assume that the (k - 1) adjacent bit-slices i,
i + 1, * * , i + k - 2 are faulty. During the first computation
step, the faults can cause errors in the sum bits i, i + 1, . , i
+ k -2 and carry-out of bits i, i + 1, * -, i + k-2. The
smallest nonzero magnitude of the change in the result due to
faults occur when the sum bit i is in error. The magnitude of
this error is 2i. The largest magnitude of the errors occurs when
the sum bits i, i + 1,
i + k - 2 and the carry-out of bit i
+ k - 2 are all wrong in one direction. (That is, they all change
from 0 to 1 or all from 1 to 0.) The magnitude of this error is
2i + 2i+1 + .. .+ 2i+k-2 + 2i+k-1, which is the same as 2i(2k
- 1).
After the input operands are shifted left by k bits, the recomputation step is performed. The bits affected are the sum
bitsi-k,i-k+ 1, , i-2 and carry-out of bit i-2. The
smallest and the largest magnitude of the nonzero errors are
2i-k and 2i-k(2k - 1), respectively.
The largest error of the recomputation step is less than the
IEEE TRANSACTIONS ON COMPUTERS, VOL. C-31, NO. 7, JULY 1982
smallest error of the first step, since 2i-k(2k - 1) < 2i.
Therefore, no single error value can occur in both of the
computation steps and thus an error always causes a mismatch
of the two results and, hence, it is detected.
3) Again assume that the (k - 1) adjacent bit-slices i,
i + 1, , i + k - 2 are faulty. Using a reasoning similar to
that in the proof of Theorem 6 and in the proof of part 2)
above, it can be shown that the smallest nonzero magnitude
of the error during the first step of computation is 2i, and the
largest magnitude of the error during the recomputation step
is 2i-k(2i - 1) which is less than 2i. Thus, no single error value
occurs in both steps.
4) The group of (k - 1) bit-slices is the same as the (k
- 1) adjacent bit-slices of 3).
0
VI. EXTENSIONS OF RESO
The method of error detection presented in the previous
section can be modified and/or extended in several different
directions. Among these are: use of rotations rather than shifts
in certain applications, error correction using RESO, and
extending RESO to more complex arithmetic functions, such
as multiply, divide, and floating-point operations. We discuss
them briefly in this section.
Recomputing with Rotated Operands: Since the rotation
is the same as a circular shift, some of our results derived for
RESO are also valid under rotation. For bit-wise logical operations in an ALU, no two bits of the same operand interact,
and therefore the positioning of a bit with respect to other bits
does not affect the outcome of the result as long as the bits of
the second operand are similarly positioned. Thus, it is seen
that rotations can be substituted for logical shifts in RESO,
and the same error detection capability is achieved for logical
operands. It is clear that rotations have an advantage over
shifts because no additional bit-slices are needed. It is not clear,
however, that rotations can be used in a straightforward
manner when the arithmetic operations are involved. With
additional hardware, it is possible to ensure a correct add operation after a rotation, so that the carry-in is applied to an
appropriate bit-slice and the carry-out is extracted from the
proper bit-slices. Thus, there is a tradeoff between the cost of
additional bit-slices needed for shifts and the cost of additional
control hardware needed for the rotations. Furthermore, we
must also consider the effects of faults on the additional
hardware which is different from a bit-slice. Rotations in a
carry-lookahead adder require even more complex control
since the carry-lookahead unit cannot be divided into identical
bit-slices.
Error Correction Using RESO: First, let us discuss the
bitwise logical operations in an ALU. Suppose that the bit-slice
i is faulty. Then the bit i of the result may or may not be correct. For the first recomputation step, the operands are shifted
left by one bit. Now the bit i - 1 of the result is computed by
the faulty bit-slice. If the bit i of the first step and bit i - 1 of
the second step are incorrect, then the two results disagree in
two bit positions. From this the conclusion is obvious that the
slice i produced an incorrect output during both steps. Hence,
the result can be corrected by complementing the bit i of the
first result or bit i - 1 of the second result. However, if the
PATEL AND FUNG: CONCURRENT ERROR DETECTION IN ALU'S
incorrect output were produced during only one of the two
steps, then the disagreement between the two steps occurs in
only one bit position. But it cannot be determined which of the
two results is correct. For this reason, we need more information, and it can be obtained by doing a third computation step
after the operands have been shifted left one more bit, that is,
a total of two bits off from the original operands. Now each bit
of the result is computed by at least two nonfaulty bit-slices.
Therefore, 2 out of 3 majority votes on each bit will decide the
correct value. This is very similar to Triple Modular Redundancy (TMR), the difference being that the TMR is redundant
is space, while RESO uses redundancy in time.
Correcting errors in arithmetic operations with RESO is not
generally straightforward or even possible. Although, under
very restrictive fault models (such as single stuck-at fault), one
may be able to correct errors in arithmetic operations with
additional hardware.
RESOfor Complex Arithmetic Functions: We have so far
described the error detection capabilities of RESO as applied
to logical and simple arithmetic operations (add, subtract).
For arithmetic operations such as integer multiply and divide,
one can determine the error detection capabilities of RESO
for specific hardware implementations. There are also different
ways for applying RESO to multiplication or division. For
example, if the multiplication is performed using add and shift
method on an ALU, then we can apply the RESO to individual
add and shift operations and thus check each step of the multiplication algorithm. If the multiplication is done using an
array multiplier, then one can use the shifted operands for
recomputation step. Thus, the first step computes, say, x * y
and the recomputation step computes 2x * 2y which is then
compared with x * y shifted left by two bits. Since there are
many different array multipliers, we shall not give here the
error detection capabilities of RESO-k for any particular
multiplier. However, the methods described in the previous
section can be used in determining the error detection capabilities of RESO-k for an assumed fault model. RESO is
especially suitable for array multipliers and dividers because
most such array structures are very regular so that they can
be divided into identical bit-slices.
For floating-point operations, one can apply the already
established RESO techniques for integers. Thus, exponent and
mantissa can be handled separately as integers, each with its
own error detection mechanism.
VII. CONCLUDING REMARKS
We have presented a time redundancy technique for concurrent error detection in arithmetic and logic units. The
method, Recomputing with Shifted Operands (RESO), ex-
595
ploits the bit-slice structure of the ALU's. The fault model used
is more general than the commonly assumed stuck-at-fault
models. Our model assumes that the physical failures are
confined to a small area of the chip or equivalently to a cluster
of components, and the precise nature of the resulting faults
is unknown. This model is very appropriate for the VLSI
technology, since the failure modes in VLSI circuits are not
well understood. Furthermore, we have shown that RESO has
the capability of detecting errors not only in the logical operations, but also in the arithmetic operations in ripple carry
adders, full carry-lookahead adders, and group carry-lookahead adders. The universality of error detection and a low cost
of implementation makes RESO more attractive than other
-methods of error detection in ALU.
REFERENCES
[1] A. Avizienis, "Arithmetic codes: Cost and effectiveness studies for application in digital systems design," IEEE Trans. Comput., vol. C-20,
pp. 1322-1331,Nov. 1971.
[2] A. Avizienis, G. C. Gilley, F. P. Mathur, D. A. Rennels, J. A. Rohr, and
D. K. Rubin, "The STAR computer: An investigation of the theory and
practice of fault-tolerant computer design," IEEE Trans. Comput., vol.
C-20, pp. 1312-1322, Nov. 1971.
[3] H. L. Garner, "Error codes for arithmetic operations," IEEE Trans,
Electron. Comput., vol. EC-15, pp. 763-770, May 1966.
[4] J. F. Wakerly, Error Detecting Code, Self-Checking Circuits and Applications. New York: North-Holland, 1978.
[5] F. F. Sellers, M. Y. Hsiao, and L. W. Bearnson, Error Detecting Logic
for Digital Computers. New York: McGraw-Hill, 1968.
[6] T. R. N. Rao and P. M. Monteiro, "A residue checker for arithmetic
and logical operations," in Proc. 2nd Int. Symp. on Fault-Tolerant
Comput., June 1971.
[7] J. F. Wakerly, "Partially self-checking circuits and their use in performing logical operations," IEEE Trans. Comput., vol. C-23, pp.
658-666, Dec. 1974.
[8] D. Reynolds and G. Metze, "Fault detection capabilities of alternating
logic," IEEE Trans. Comput., vol. C-27, pp. 1093-1098, Dec. 1978.
[9] J. J. Shedletsky, "Error correction by alternate-data retry," IEEE Trans.
Comput., vol. C-27, pp. 106-112, Feb. 1978.
[10] D. A. Anderson, "Design of self-checking digital networks using coding
techniques," Coord. Sci. Lab., Univ. of Illinois, Urbana, Tech. Rep.
R-527, Sept. 1971.
Janak H. Patel (S'73-M'76), for a photograph and biography, see page 304
of the April 1982 issue of this TRANSACTIONS.
Leona V. Fung (S'80)
was
born in Hong Kong on
July 24, 1958. She received the B.S. degreee in
computer science and the M.S. degree in electrical
engineering from the University of Illinois at Urbana-Champaign.
In 1981 she was a Research Assistant in the
Coordinated Science Laboratory at the University
of Illinois. Her research interests are fault-tolerant
computing, computer architecture, and VLSI systems.