IEEE TRANSACTIONS ON COMPUTERS, VOL. C-31, NO. 7, JULY 1982 589 Concurrent Error Detection in ALU's by Recomputing with Shifted Operands JANAK H. PATEL, MEMBER, IEEE, AND LEONA Y. FUNG, STUDENT MEMBER, IEEE Furthermore, they are based on the traditional stuck-at fault model which is not appropriate for the VLSI technology. A more appropriate assumption is that the failure in a VLSI circuit affects a small area of the chip and the nature of the failure is not precisely known. For example, imperfection in the semiconductor material may be confined to a small area of the chip and thus affect several neighboring gates. Similarly, an incidence of a high energy particle may also affect a small area of the chip. The underlying failure mechanisms are not well understood as yet. Therefore, it is unwise to assume that under failure the output of a specific gate is stuck at some logical value. It is true, however, that at some higher functional Index Terms-ALU, bit-sliced ALU, concurrent error detection, level the effect of failures will be felt as changes in logical values. This functional level, for example, can be at the level fault detection, time redundancy, VLSI circuits, VLSI faults. of a bit-slice of an ALU. This is the level we choose for our functional fault model in this paper. Throughout this paper, I. INTRODUCTION a failure in a circuit means some physical malfunction; and I T HAS been known for some time that no low-cost and ef- an error means an incorrect value of the function under conficient techniques that can check both arithmetic and logic sideration. operations have been available. The AN code, Residue code, In this paper we present an error detection scheme based on and Inverse Residue code [1] -[5] are the error detecting codes time redundancy. In the next section, some systematic apthat were developed earlier for checking arithmetic operations. proaches of devising error detectionr by time redundancy are However, the methods mentioned above are unable to detect described. Later we describe the proposed error detection some single errors in group carry-lookahead structures. Fur- method, called Recomputing with Shifted Operands (RESO). thermore, these methods cannot be used for checking logical The error detection capabilities of RESO are described in operations. Sections IV and V. It is shown that in a typical ALU, RESO Utilizing a fully duplicated logic unit has been recognized detects all functional errors resulting from failures confined as the most effective method for checking logical operations. to a certain area of the chip, for example, a bit-slice. In Section In fact, most machines that have been built with an error de- VI some extensions of RESO, as related to errror correction tection scheme used duplication to check logical operations. in logic operations and error detection in multiply-divide For example, the fault-tolerant STAR computer used inverse circuits, are discussed. residue codes to check the arithmetic unit but duplication for logic unit [2]. Several other machines such as EDVAC and IBM System/3, have duplicated the entire ALU. Nevertheless, II. ERROR DETECTION USING TIME REDUNDANCY a few designs, such as the residue-checked ALU [6] and a In this section we introduce a systematic way of exploiting partially self-checking ALU [4], [7] were introduced for time redundancy for error-detection. Let x be the input to a checking the entire ALU besides the full duplication scheme. computation letf(x) be the desired output. A space However, both schemes require a large increase in the com- redundancy unitf and like Fig. 1, with two identical function technique, plexity of the circuitry, and therefore the area on the chip. units f, will detect any error in one of the two computation units. Now consider computingf(x) twice in time, on the same Manuscript received October 6, 1981; revised January 12, 1982. This work hardware boxf, as illustrated in Fig. 2. The result of the first was supported by the U.S. Navy under VHSIC Contract N00039-80-Ccomputation step is stored in a register and then compared with 0556. J. H. Patel is with the Coordinated Science Laboratory, and the Department the result of the second computation step. An intermittent error of Electrical Engineering, University of Illinois, Urbana, IL 61801. occurring during either of the computation steps, but not both L. Y. Fung was with the Coordinated Science Laboratory, University of will be detected; however, no permanent error can be detected. Illinois, Urbana, IL 61801. She is now with STC Computer Research CorThus, the error detection capability of Fig. 2 is much worse poration, Santa Clara, CA 95051. Abstract-A new method of concurrent error detection in the Arithmetic and Logic Units (ALU's) is proposed. This method, called "Recomputing with Shifted Operands" (RESO), can detect errors in both the arithmetic and logic operations. RESO uses the principle of time redundancy in detecting the errors and achieves its error detection capability through the use of the already existing replicated hardware in the form of identical bit slices. It is shown that for most practical ALU implementations, including the carry-lookahead adders, the RESO technique will detect all errors caused by faults in a bit-slice or a specific subcircuit of the bit slice. The fault model used is more general than the commonly assumed stuck-at fault model. Our fault model assumes that the faults are confined to a small area of the circuit and that the precise nature of the faults is not known. This model is very appropriate for the VLSI circuits. 0018-9340/82/0700-0589$00.75 © 1982 IEEE 590 IEEE TRANSACTIONS ON COMPUTERS, VOL. Step 1: C-31, NO. 7, 1982 JULY { x error x signal Step 2: x Step 2: x 4 Step 1: error signal f fwx)) dZ3 d(f(c(x))L......) . error signal Fig. 3. Error detection with time redundancy. Fig. 1. Error detection with space redundancy. Step 1: { 3iF.. Erc(x)r x Step 2: x f error f x signal Fig. 2. Intermittent error detection with time redundancy. than that of Fig. 1. Consider modifying the second step of Fig. 2. Suppose that we have two functions c and d such that d(f(c(x))) =f(x) for all x. Now we computef(x) in the first step, and then during the second step we compute d(f(c(x))) as shown in Fig. 3. The functions c and d may be called the coding and decoding functions, respectively. If c and d are properly chosen, then a failure in the unitf will affectf(x) and f(c(x)) differently. Therefore, the outputs of the first step and the second step will not match, producing an error signal. Quite often the functins c and d are such that they\are inverses of each other, that is, d(c(x)) = x for all x. In this-case we write c'I(f(c(x))) = f(x). Fig. 4(a) shows this organization. A somewhat equivalent organization of Fig. 4(b) results from the fact that c'- (f(c(x))) = f(x) implies that f(c(x)) = c(f(x)) for all x. Figs. 3 and 4 are the basis of most error detection methods using time redundancy. As an example of Fig. 4, consider Boolean functions f(x) which are self duals. Let x = (xI, x2, , x,) be an n-bit vector;f(x) is a self dual if for all x,f(l1, x2, .. , Yn) =f(x1, x2, ..., x"). Let c be a function which complements each bit of a vector, e.g., c(x1, -, x.) = (xl, * , in). It is clear that c- = c and c(f(c(x))) = f(x), alternately c(f(x)) = f(c(x)). This property of self duals was the basis of one of the first error detection methods devised using time redundancy technique. It is called alternating logic design and was inroduced by Reynolds and Metze [8]. Time redundancy has also been used for error correction [9]. Here, we address ourselves specifically to error detection in ALU's with organizations like Fig. 4. There are several problems to consider in the design of an error detection method using time redundancy. The first problem is to find a function c for the given functionf such that c -(f(c(x))) = f(x). The second problem is that the mere existence of the function c does not provide the desired error detection capability. Error detection capability varies widely with different c's; and, in addition, the functional properties of f and c are not sufficient by themselves to determine the error detection capability. Different circuit implementations of the same functionf will have different capabilities. Even the organizations of Fig. 4(a) and (b) may have slightly different error coverage. And one final problem is the complexity of the Step 1: Step 2: x x f fi4 1 ~ 1 error StepJ L J2: ssignal Fig. 4. Two approaches to error detection using time redundancy. hardware implementation of the coding and decoding functions c and c-1. If the hardware required to implement c is comparable to that of the function unitf, then the space redundancy approach of Fig. 1 is clearly more cost-effective. In short, for a cost-effective design, the function c must be such that it provides a very good error coverage and is far less complex than the function f. We present here a cost-effective method of error detection called Recomputing with Shifted Operands (RESO). An overview of this method follows. III. DESCRIPTION OF RESO The RESO method is based on Fig. 4. Let the function unit f be an ALU, the coding function c be a left shift operation, and the decoding function c 1 be a right shift operation. Thus, c(x) = left shift of x and c-'(x) = right shift of x. With a more precise description of c, e.g., logical or arithmetic shift, by how many bits, what gets shifted in and so forth, it can be shown that for most typical operations of an ALU, c-I(f(c(x))) = f(x). The details about shifts are discussed later in this section, but for now we shall only give a less precise but intuitively clear picture of the RESO method. A schematic of the implementation appears in Fig. 5. Depending on whether we follow the principle of Fig. 4(a) or (b), the operation of the error detection scheme in the ALU of Fig. 5 can be described as below. Following the principle of Fig. 4(a), during the initial computation step the operands are passed through the shifter unshifted and then they are input to the ALU. The result is then stored in a register unshifted. During the recomputation step, the operands are shifted left and then input to the ALU. The result is right shifted and then compared with the contents PATEL AND FUNG: CONCURRENT ERROR DETECTION IN ALU'S x 591 determination of k depends on three factors: the implemenation off, the space redundancy allowed, and the set of errors defined. The hardware that the RESO-k needs to detect errors in an ALU for n-bit operations includes two shifters, a register, an (n + k)-bit ALU, and an equality checker. A totally selfchecking equality checker can be implemented based on 1out-of-2 code checkers [10]. The errors in the shifter can be detected using any suitable parity codes. For the smallest circuitry, k has to be one. If k is one, then the faults covered are confined to a bit-slice or a subnetwork of a bit-slice depending on the implementation off. The next section will give the detail of RESO-1. It will be shown later that RESO-2 guarantees the detection of all functional errors resulting from failures confined to a bit-slice independent of the implemenerror signal tation. The error detection capabilities of RESO-k will be Fig. 5. Concurrent error detection in an ALU using RESO. presented in a later section. At first glance, it appears that the above method of error of the register. A mismatch indicates an error in computa- detection reduces the execution rate in half. However, if one tion. views the ALU in the global context involving the entire If we follow the principle of Fig. 4(b), then the operation computer, then the time penalty is only a small portion of the changes slightly. Here we input the unshifted operands during entire instruction cycle. If the ALU is pipelined,. then the the first step as before, but we left shift the result and then store recomputation step can be started immediately after one it in the register. In the second step, we input the left shifted segment time unit or one clock cycle of the pipeline. Therefore, operands to the ALU and then compare the output directly both the computation and recomputation are overlapped in the with the register. An inequality signals an error. ALU, and the resulting penatly in performance is exactly one When an n-bit operand is shifted left by k-bit, its leftmost segment delay per ALU operation. k bits move out. To preserve these bits during the recomputation step, an (n + k)-bit ALU is needed. For certain logical IV. ERROR DETECTION CAPABILITY OF RESO-l operations, the operands can be shifted left circularly, in which case only an n-bit ALU is required. Use of rotations rather In this section we describe error detection capability of than shifts is discussed later in Section VI. For now, we shall RESO- 1, where RESO- 1 refers to recomputing with operands just assume an (n + k)-bit ALU for error detection in n-bit shifted by one bit. First, we consider logical operations and then operations. The leftmost k bits of the (n + k)-bit operands for arithmetic operations. All theorem statements and proofs for the first computation step and the rightmost k bits for the arithmetic operations only refer to add operations and can be recomputation step are determined depending on the opera- trivially extended to include subtract, increment, decrement, tions and the number system. For all bit-wise logical opera- and negate operations. tions, it does not matter what they are. For arithmetic operaTheorem 1: RESO-1 detects all errors in an ALU for all tions, the leftmost k bits of the unshifted operands have to be bitwise operations AND, OR, NOT, and their derivations (e.g., zeros for the unsigned binary integers and extensions of the XOR, NOR, etc.) when the failure is confined to a single bitsign-bit for the one's and two's complement number repre- slice. sentation. Consider the left-shifting by k bits as a multiplicaProof: Let the bit-slice i be faulty. If the fault produces tion process, that is, the original operands are multiplied by an error, then the bit i of the result during the first computation 2 for the recomputation step. In order to be consistent, the step is incorrect. During the second computation step, the bit carry-in has to be multiplied by 2 k. It can be done by shifting i of the result is computed by the nonfaulty slice i + 1. the carry-in to the right of one of the operands k times. Now, Therefore, the bit i of the recomputation step is the correct bit. (2 k 1) X (carry-in) has been added to one of the operands Thus, if the failure produces an error affecting the bit i of the and one more carry-in will be added to the sum during the first result or the bit i - 1 of the second result of both, then the recomputation step. For the other operand, k zeros should be two results will not match. The error, therefore, is detected. shifted into its right. Under error-free conditions, the rightmost k bits of the result from the recomputation step should be all Theorem 2: In a bit-sliced ALU whose sum and carry zeros. So these k bits are essential to indicate errors resulting functions of the adders are represented by two disjoint netfrom faults in the rightmost k bit-slices. To preserve these bits works like Fig. 6, any errors will be detected by RESO- 1 if the for equality checking, we follow the principle of Fig. 4(b). failure is confined to either the sum network or the carry netThus, k zeros are shifted into the right of the result from the work, but not both. first computation, and this shifted result is compared with the Proof: unshifted result from the recomputation step. Case 1-Only the Sum Circuit is Affected: Suppose the Let RESO-k be the name of the error detection scheme failure is confined to the sum circuit of bit-slice i. Then the bit achieved by recomputing with k-bit shifted operands. The i of the result is either correct or off by +2i during the first function k Y 0 out 592 IEEE TRANSACTIONS ON COMPUTERS, VOL. fI S I C-31, NO. 7, -- JULY 1982 I (xi,yi,cc) S. c X. Yi Fig. 6. Full adder with disjoint sum and carry networks. C.I. Si 91 ci+' C +1 Fig. 7. Full adder with sum and carry networks sharing a subnetwork. Fig. 8. A typical full adder circuit satisfying the structure of Fig. 7. function strongly depends on the function k, if the output of ki is wrong, then the sum bit i 1 must also be wrong and the carry bit i can be wrong or correct. Of course, if the output of circuit k is correct, then no error has occurred during the second step. Hence, the result of the recomputation step will errors are equal. Therefore, all errors are detected. be off by one of the values 10, ±2i-1, ±3 X 2V-1. Since the Case 2-Only the Carry Circuit is Affected: Suppose that erroneous results from both of the computation step will never the carry circuit of bit-slice i is faulty. Then during the first have the same value, the errors can always be detected. 3 Theorem 4: Any Boolean algebraic functionf(xI, x2, computation step, the result is either correct or off by ±2i+ 1. During the recomputation step, the result is either correct or Xn), which is strongly dependent on x1, must be of the form xI off by ±2i. Thus, the errors in two results cannot be equal, and g(x2, * xn) for some function g of x2, *.. xn. therefore any errors will be detected. Proof: By definition,f(x1, x2, X3, ) is strongly dependent on x 1 if for every pair of input vectors which differ only In certain implementations, carry and sum circuit are not in x1, the values off for these vectors differ. That is disjoint and they share a common subnetwork. If this sharing f(l, X2, X3, * *) =f(0 X2, X3, * ). is done in a particular manner, then it is possible to detect er(1) rors. The specific nature of this sharing is described below. A function f(x1, * xn) is said to be strongly dependent By Shannon's Expansion Theorem on the variable xi, if for every pair of input vectors which differ XI .f(l, X2, X3, * ) f(x1, X2, X3, * ) only in xi, the values off for these vectors differ. + XI 4f(0, X2, X3, * )Theorem 3: In an adder, whose sum and carry functions are represented by two networks which share a common subnet- Substituting (1) into it work and the sum function is strongly dependent on the function that the shared subnetwork represents, any errors f(Xi, X2, X3, * ) = X1 *f(0, X2, X3, * ) caused by failures confined only to one of three subnetworks + X1 4f(0, X2, X3, .. *) will be detected by RESO- 1. Let g(X2, X3, * *) = f(0, X2, X3, ), then Proof: Let the adder implementation be as in Fig. 7. Case I -Only the Circuit f is Affected: This is the same f(xI, X2, X3, * * *) = X I g(X2, X3, * * *) + XI * g(X2, X3, * * ) as Case 1 of Theorem 2. Case 2-Only the Circuit g is Affected: This is the same that is as Case 2 of Theorem 2. Case 3-Only the Circuit k is Affected: Let the circuit f(x1, X2, X3, * ) Xi g(X2, X3, *). k of bit-slice i be faulty. During the first computation step let We are now ready to look at some specific implementations only the output of ki be wrong, then the sum bit i must also be wrong since the sum function strongly depends on the function of adders whose errors can be detected by RESO- 1. The most k and the carry ci+1 can be correct or incorrect. Hence, the commonly used implementation of a full adder is shown in Fig. result of the first computation step will be off the expected 8. This is a typical network that satisfies the structure of Fig. correct result by ±2i if only the sum bit is incorrect, or by 7. By Theorem 3, the RESO- 1 can detect all errors resulting 2i2±2 i+ I if carry is also incorrect. If the failure in unit k does from any faults confined to one of the subnetworksf, g, or k. not produce an error then the result is correct. Therefore, due Any faults on the lines which are not in the dotted boxes are to a failure in unit k, the result can be off by one of 10, ±2i, ±3 also included in the subnetwork from which the line origiX 2i}. During the recomputation step, again since the sum nates. computation step. During the recomputation step, the bit-slice i operates on bit (i - 1) of the original operands. Thus, the failure will cause the result to be either correct or off by 2'- 1 The equality checker will find two results identical if and only if both results are correct, since in either computation no two - . , , , = - = @ 593 PATEL AND FUNG: CONCURRENT ERROR DETECTION IN ALU'S TABLE I EFFECT OF FAULTY VALUES OF Gi AND Pi ON THE SUM Fig. 9. Typical implementation of a carry-lookahead adder. V. ERROR DETECTION CAPABILITIES OF RESO-k In the previous sections we presented the error detection capabilities of RESO-1, that is, recomputing with operands shifted by 1 bit. We assumed in that case that failures were confined to a well-defined cluster of logic elements. These clusters include a small number of elements. With the everincreasing density of components in the fast-changing VLSI technology, one may consider the assumptions of the previous section somewhat restrictive. For this reason, in this section we present the generalized RESO error detection method for a less restrictive fault model. The fault model we assume here is also a functional fault model. It is the same as that of the previous section except that we allow more area of the chip or more components to be included in the affected cluster. For example, one can assume that failures affect a complete bit-slice or several adjacent bit-slices. To understand why RESO- 1 may not be adequate for error detection when a large area of the chip is affected, consider the following example, which also leads us to the next theorem. Consider a bit-sliced ripple-carry adder. Suppose that bitslice i is faulty. Then the sum and the carry bits can have any logical values at any time. The functional nature of the error is to change the arithmetic value of the final result. During the first computation step, bit-slice i computes a sum bit with weight 2 i and the carry-out bit with a weight 2i+ 1. Thus, it is possible that the result of the first computation step is off by ±2i or ±2i+1 or ±2i±2i+I or 0. In other words, the result is off by one ofl0, ±2', ±2i+1, +3 X 2i}. During the second step, the operands are shifted left by one bit, and therefore the bitslice i computes the sum bit with weight 2i- 1 and the carry-out bit with weight 2i. Again, not knowing the exact nature of the fault, the sum and carry-out bits can take any logical values independent of the input. Reasoning as before, the result is off by one ofl0, ±2i-1, ±2i, ±3 X 2i-11. From this, it is clear that results of the two steps can be identical not only when there is no error, but also when the errors happen to be +2i or -2i in both steps. This suggests that the second computation step should be changed so that no two errors are the same. This leads us to the following theorem. Theorem 5: In a bit-sliced ripple-carry adder, RESO-2 Correct Faulty Change in values values the sum Gi pi Gi Pi o 0 0 1 0 O O 1 d +2i+l O 1 ° 0 0. 2i 0 1 1 d O, 2i+ 1 1 d 0 0 -2 1 d 0 1 0, -2i+ 2 *detects all errors resulting from failures confined to any one bit-slice. Proof: Let the bit-slice i be faulty. During the first computation step, the sum and the carry outputs of slice i have weights 2i and 2i+1, respectively. Therefore, the result from the adder can be off by one of the values 10, ±2i, +2i+ 1, ±3 X 2i). Now consider the recomputation after the operands have been shifted left by two bits. The sum and carry-out of bit-slice i now have weights 2i-2 and 2i- 1, respectively. Therefore, the result of the recomputation step can.be off by one of the values 10, +2i-2, ±2i-1, ±3 X 2i-21. No single error (nonzero value) appears in both sets. Therefore, any error in either step will cause a mismatch of the results of the two computation steps, and the error is detected. For carry-lookahead adders, the RESO-2 is also effective as proven in the following theorem. Since a bit-slice is not well defined in a carry-lookahead (CLA) adder, we first describe the exact implementation of a CLA adder and then define what we mean by a "bit-slice." Fig. 9 describes a typical implementation of a CLA adder. The function unitfi computes the sum bit si, carry propagate signal Pi, and the carry generate signal Gi. The function unit g, computes the carry-in to the stage i; all function units fi are identical. However, the function units gi get more complex as i grows. It is only for the sake of convenience that we define a "bit-slice" i to consist of unitsfi and g,. Theorem 6: In a bit-sliced carry-lookahead adder, RESO-2 detects all errors resulting from failures confined to any one bit-slice. Proof: Let the bit-slice i, which includes function unitfi and gi, be faulty (see Fig. 9). Then by assumption only the sum bit si, carry generate Gi, and carry propagate Pi are affected. Since the consequences of faults in gi is to affect only the sum bit si, this case is already included in the above assumption. Sum bit si has an arithmetic weight of 2i, and Gi has a weight of 2i+ 1. When G, = 1, the propagate signal Pi has no effect on sum bits i + 1 and higher. Thus, when Gi = 1, Pi has a weight of 0. Depending on the implementation, the sum-bit si may or may not be a function of Pi. Since we have already considered si to be erroneous, the effect of Pi on si can be ignored. When Gi = 0, Pi has a weight of 2i. With this information, we can 594 enumerate all possible changes in the sum contributed by the errors in Gi and Pi, as shown in Table I. Combining these values with the possible errors contributed by the sum bit si, we conclude that due to the failures in slice i, the result can be off by one of the values $0, +2i, +2i+1, ±3 X 2i}. After the operands are shifted left by two bits, the result of the recomputation step can be off by one of the values $0, +2i-2, +2i-1, +3 X 2i-21. Thus, no single error value appears in both computation steps and any error can be detected. 0 The above results can be generalized to include the failure of more than one bit-slice of the ALU. The generalized result is stated below: Theorem 7: RESO-k has the following error detection capabilities in an ALU: 1) detects all errors in all bit-wise logical operations when the failures are confined to k adjacent bit-slices, 2) detects all errors in arithmetic operations in a ripplecarry adder when the failures are confined to (k - 1) adjacent bit-slices for k > 1 (for k = 1 see RESO- 1 in the last section), 3) detects all errors in arithmetic operations in a full carry-lookahead adder when the failures are confined to (k 1) adjacent bit-slices, where a bit-slice i consists of functional units f and g1 (see Fig. 9), 4) detects all errors in arithmetic operations in a group carry-lookahead adder, each group i consisting of a (k - 1)-bit adder, and circuits for group-carry generate Gi, group-carry propagate Pi, and the group carry-in C, (similar to Fig. 9) when the failures are confined to a single group. Proof: I) Let the k adjacent bit-slices i, i + 1, i + k - 1 be faulty. During the first computation step, bits i, i + 1 ,i + k - 1 of the result are affected by faults. The recomputation step is performed after k-bit left-shifts of the input operands. Therefore, the bit slice i affects the bit i - k of the result, slice i + 1 affects bit i - k + 1 of the result, and so on. Thus, the result bits that are affected by the faults are bits i - k, i - k + 1,*, i -1. Therefore, no single bit of the result is affected in both computation steps. Hence, an error guarantees a mismatch between the results of the two computation steps. 2) Let us assume that the (k - 1) adjacent bit-slices i, i + 1, * * , i + k - 2 are faulty. During the first computation step, the faults can cause errors in the sum bits i, i + 1, . , i + k -2 and carry-out of bits i, i + 1, * -, i + k-2. The smallest nonzero magnitude of the change in the result due to faults occur when the sum bit i is in error. The magnitude of this error is 2i. The largest magnitude of the errors occurs when the sum bits i, i + 1, i + k - 2 and the carry-out of bit i + k - 2 are all wrong in one direction. (That is, they all change from 0 to 1 or all from 1 to 0.) The magnitude of this error is 2i + 2i+1 + .. .+ 2i+k-2 + 2i+k-1, which is the same as 2i(2k - 1). After the input operands are shifted left by k bits, the recomputation step is performed. The bits affected are the sum bitsi-k,i-k+ 1, , i-2 and carry-out of bit i-2. The smallest and the largest magnitude of the nonzero errors are 2i-k and 2i-k(2k - 1), respectively. The largest error of the recomputation step is less than the IEEE TRANSACTIONS ON COMPUTERS, VOL. C-31, NO. 7, JULY 1982 smallest error of the first step, since 2i-k(2k - 1) < 2i. Therefore, no single error value can occur in both of the computation steps and thus an error always causes a mismatch of the two results and, hence, it is detected. 3) Again assume that the (k - 1) adjacent bit-slices i, i + 1, , i + k - 2 are faulty. Using a reasoning similar to that in the proof of Theorem 6 and in the proof of part 2) above, it can be shown that the smallest nonzero magnitude of the error during the first step of computation is 2i, and the largest magnitude of the error during the recomputation step is 2i-k(2i - 1) which is less than 2i. Thus, no single error value occurs in both steps. 4) The group of (k - 1) bit-slices is the same as the (k - 1) adjacent bit-slices of 3). 0 VI. EXTENSIONS OF RESO The method of error detection presented in the previous section can be modified and/or extended in several different directions. Among these are: use of rotations rather than shifts in certain applications, error correction using RESO, and extending RESO to more complex arithmetic functions, such as multiply, divide, and floating-point operations. We discuss them briefly in this section. Recomputing with Rotated Operands: Since the rotation is the same as a circular shift, some of our results derived for RESO are also valid under rotation. For bit-wise logical operations in an ALU, no two bits of the same operand interact, and therefore the positioning of a bit with respect to other bits does not affect the outcome of the result as long as the bits of the second operand are similarly positioned. Thus, it is seen that rotations can be substituted for logical shifts in RESO, and the same error detection capability is achieved for logical operands. It is clear that rotations have an advantage over shifts because no additional bit-slices are needed. It is not clear, however, that rotations can be used in a straightforward manner when the arithmetic operations are involved. With additional hardware, it is possible to ensure a correct add operation after a rotation, so that the carry-in is applied to an appropriate bit-slice and the carry-out is extracted from the proper bit-slices. Thus, there is a tradeoff between the cost of additional bit-slices needed for shifts and the cost of additional control hardware needed for the rotations. Furthermore, we must also consider the effects of faults on the additional hardware which is different from a bit-slice. Rotations in a carry-lookahead adder require even more complex control since the carry-lookahead unit cannot be divided into identical bit-slices. Error Correction Using RESO: First, let us discuss the bitwise logical operations in an ALU. Suppose that the bit-slice i is faulty. Then the bit i of the result may or may not be correct. For the first recomputation step, the operands are shifted left by one bit. Now the bit i - 1 of the result is computed by the faulty bit-slice. If the bit i of the first step and bit i - 1 of the second step are incorrect, then the two results disagree in two bit positions. From this the conclusion is obvious that the slice i produced an incorrect output during both steps. Hence, the result can be corrected by complementing the bit i of the first result or bit i - 1 of the second result. However, if the PATEL AND FUNG: CONCURRENT ERROR DETECTION IN ALU'S incorrect output were produced during only one of the two steps, then the disagreement between the two steps occurs in only one bit position. But it cannot be determined which of the two results is correct. For this reason, we need more information, and it can be obtained by doing a third computation step after the operands have been shifted left one more bit, that is, a total of two bits off from the original operands. Now each bit of the result is computed by at least two nonfaulty bit-slices. Therefore, 2 out of 3 majority votes on each bit will decide the correct value. This is very similar to Triple Modular Redundancy (TMR), the difference being that the TMR is redundant is space, while RESO uses redundancy in time. Correcting errors in arithmetic operations with RESO is not generally straightforward or even possible. Although, under very restrictive fault models (such as single stuck-at fault), one may be able to correct errors in arithmetic operations with additional hardware. RESOfor Complex Arithmetic Functions: We have so far described the error detection capabilities of RESO as applied to logical and simple arithmetic operations (add, subtract). For arithmetic operations such as integer multiply and divide, one can determine the error detection capabilities of RESO for specific hardware implementations. There are also different ways for applying RESO to multiplication or division. For example, if the multiplication is performed using add and shift method on an ALU, then we can apply the RESO to individual add and shift operations and thus check each step of the multiplication algorithm. If the multiplication is done using an array multiplier, then one can use the shifted operands for recomputation step. Thus, the first step computes, say, x * y and the recomputation step computes 2x * 2y which is then compared with x * y shifted left by two bits. Since there are many different array multipliers, we shall not give here the error detection capabilities of RESO-k for any particular multiplier. However, the methods described in the previous section can be used in determining the error detection capabilities of RESO-k for an assumed fault model. RESO is especially suitable for array multipliers and dividers because most such array structures are very regular so that they can be divided into identical bit-slices. For floating-point operations, one can apply the already established RESO techniques for integers. Thus, exponent and mantissa can be handled separately as integers, each with its own error detection mechanism. VII. CONCLUDING REMARKS We have presented a time redundancy technique for concurrent error detection in arithmetic and logic units. The method, Recomputing with Shifted Operands (RESO), ex- 595 ploits the bit-slice structure of the ALU's. The fault model used is more general than the commonly assumed stuck-at-fault models. Our model assumes that the physical failures are confined to a small area of the chip or equivalently to a cluster of components, and the precise nature of the resulting faults is unknown. This model is very appropriate for the VLSI technology, since the failure modes in VLSI circuits are not well understood. Furthermore, we have shown that RESO has the capability of detecting errors not only in the logical operations, but also in the arithmetic operations in ripple carry adders, full carry-lookahead adders, and group carry-lookahead adders. The universality of error detection and a low cost of implementation makes RESO more attractive than other -methods of error detection in ALU. REFERENCES [1] A. Avizienis, "Arithmetic codes: Cost and effectiveness studies for application in digital systems design," IEEE Trans. Comput., vol. C-20, pp. 1322-1331,Nov. 1971. [2] A. Avizienis, G. C. Gilley, F. P. Mathur, D. A. Rennels, J. A. Rohr, and D. K. Rubin, "The STAR computer: An investigation of the theory and practice of fault-tolerant computer design," IEEE Trans. Comput., vol. C-20, pp. 1312-1322, Nov. 1971. [3] H. L. Garner, "Error codes for arithmetic operations," IEEE Trans, Electron. Comput., vol. EC-15, pp. 763-770, May 1966. [4] J. F. Wakerly, Error Detecting Code, Self-Checking Circuits and Applications. New York: North-Holland, 1978. [5] F. F. Sellers, M. Y. Hsiao, and L. W. Bearnson, Error Detecting Logic for Digital Computers. New York: McGraw-Hill, 1968. [6] T. R. N. Rao and P. M. Monteiro, "A residue checker for arithmetic and logical operations," in Proc. 2nd Int. Symp. on Fault-Tolerant Comput., June 1971. [7] J. F. Wakerly, "Partially self-checking circuits and their use in performing logical operations," IEEE Trans. Comput., vol. C-23, pp. 658-666, Dec. 1974. [8] D. Reynolds and G. Metze, "Fault detection capabilities of alternating logic," IEEE Trans. Comput., vol. C-27, pp. 1093-1098, Dec. 1978. [9] J. J. Shedletsky, "Error correction by alternate-data retry," IEEE Trans. Comput., vol. C-27, pp. 106-112, Feb. 1978. [10] D. A. Anderson, "Design of self-checking digital networks using coding techniques," Coord. Sci. Lab., Univ. of Illinois, Urbana, Tech. Rep. R-527, Sept. 1971. Janak H. Patel (S'73-M'76), for a photograph and biography, see page 304 of the April 1982 issue of this TRANSACTIONS. Leona V. Fung (S'80) was born in Hong Kong on July 24, 1958. She received the B.S. degreee in computer science and the M.S. degree in electrical engineering from the University of Illinois at Urbana-Champaign. In 1981 she was a Research Assistant in the Coordinated Science Laboratory at the University of Illinois. Her research interests are fault-tolerant computing, computer architecture, and VLSI systems.
© Copyright 2025 Paperzz