1394
IEEE TRANSACTIONS ON COMPUTERS, NOVEMBER 1971
A Fault-Tolerant Information
Processing Concept for Space
Vehicles
Some of the current ambitions in the aerospace field are
providing us with some "otherwise hopeless cases," where
reliability beyond anything available commercially is required. What is required is a computer system that will
ALBERT L. HOPKINS, JR., MEMBER, IEEE
maintain the integrity of the control environment to a very
high degree of certainty in the face of any faults that are
Abstract-A distributed fault-tolerant information processing system is
reasonably likely to occur. We can quantify the degree of
proposed, comprising a central multiprocessor, dedicated local processors,
and multiplexed input-output buses connecting them together. The procertainty by comparison with airline experience, where
cessors in the multiprocessor are duplicated for error detection, which is
catastrophic system failure occurs in fewer than 1 in 106
felt to be less expensive than using coded redundancy of comparable
missions. The likelihood of occurrence of faults is describeffectiveness. Error recovery is made possible by a triplicated scratchpad
memory in each processor. The main multiprocessor memory uses repliable in terms of measured failure rate of semiconductors,
cated memory for error detection and correction. Local processors use any
where
the best experience has been of the order of 0.001
with
of three conventional redundancy techniques: voting, duplex pairs
percent per thousand hours.
backup, and duplex pairs in independent subsystems.
A number of fault-tolerant computer design projects have
Index Terms-Aerospace computers, distributed control computing syssuggested several approaches to meeting high reliability
tems, error detection, error recovery, fault tolerance, multiprocessors, onrequirements [5]-[9]. These approaches differ considerably
board computers, spaceborne computers.
for a variety of reasons, including the absence of a common
INTRODUCTION
figure of merit and differences in their projected production
Aerospace technology is developing vehicles for which dates. This note attempts to describe in summary the salient
system survival is a paramount issue, yet which require features of a comprehensive design proposal for a highly
sophisticated on-board information processing. Examples are reliable computer system for manned space vehicles in the
the giant commercial aircraft, reusable space shuttles, space 1975-1985 timeframe.
stations, deep planetary mission vehicles, and nuclear rocket
stages. In aircraft and spacecraft mechanical structures, the
SOME PRECEPTS OF THE DESIGN PHILOSOPHY
use of redundancy is a well established art. In electronics and
For space flight in the 1975-1985 time period and parinformation processing, however, rational approaches to
for manned
an information processing
ticularly
redundancy are not so well developed, probably because system is required thatmissions,
will
have
functional longevity and a
avionics has been historically less intimately related to crew
very
high
of
probability
performing
correctly during timeor passenger safety than mechanics.
mission
critical
To
meet
phases.
these
requirements, the
A number of current potential applications exist for faultof
the
system
will
have
components
to
exhibit
very low failure
tolerant computers, but the development of such computers
lower
rates,
probably
than
any
so
far
observed
in comlags far behind the need. Fault-tolerant computer developof
the
ponents
is
an
complexity
required.
This
oft-spoken
ment has been compared to surgery, where new techniques
comment, but it cannot be overemphasized. The reliability
may be practiced only upon "otherwise hopeless cases" [1].
of the components is a major cost and schedule
engineering
This comparison may be somewhat strong in terms of comitem
and
cannot
reasonably be left until after the design is
puter technology today, but without question the use of
Overall
complete.
reliability can be achieved only by good
computer redundancy has been virtually limited to cases
design,
very
high
component reliability, and redundancy.
where infallible digital processing has been indispensable or
alone
is insufficient, as may be observed in the
Redundancy
irresistibly rewarding. The SAGE air defense computer
following
example.
system [2], the ESS electronic switching telephone exchange
Let us consider the power law relationship that char[3], and the Saturn V guidance and control computer [4]
acterizes
the failure probability of a redundant unit. Typare examples of relatively early approaches to fault tolerif
is the probability of a single unit failure, then the
ically,
f
ance; all were wrought at what appeared to be a very large
of all of n redundant units failing is F=fn. It
probability
cost, i.e., replicated equipment. But in reality these were
follows
that
if f is less than one and if n is increased inreasonable approaches both because the cost of additional
F can be made arbitrarily small. In practice,
then
definitely,
equipment was tolerable and because there was no expeis
The
this
not
so.
model is imprecise in that it omits the
ditious alternative. A number of more recent computers
effect
of
the
additional
circuitry required for error managehave been designed to be fault tolerant to some degree, but
ment.
Consider,
nevertheless,
the inverse relationship
the majority of these have been content with detection and
Fl"n.
we
are
overall failure probability
f=
Suppose
given
the
correction of most of the possible faults of a restricted class.
F as a design goal and seek to determine reasonable values
of f and n as parameters of the design. One of the reasons
Manuscript received March 1, 1971; revised June 16, 1971. This for employing one or more redundant elements is to try to
work was sponsored by the Manned Spacecraft Center (MSC) of the
National Aeronautics and Space Administration through Contract reduce the stringency of the unit failure-rate requirements
NAS 9-4065 with the Massachusetts Institute of Technology, Cam- for meeting the given overall failure probability requirebridge, Mass.
ment. If F is fixed, we do, in principle, increase the allowThe author is with the Department of Aeronautics and Astronautics and the C. S. Draper Laboratory of the Massachusetts Insti- able unit failure probability f when we add one (extra)
redundant element. To express the fractional increase in this
tute of Technology, Cambridge, Mass.
SHORT NOTES
probability, we define fn=Fl"n as the allowable unit failure
probability if n redundant units are used and also define
fl+,=Flln+l as the allowable unit failure probability if one
more redundant unit is used. The fractional increase is the
ratio of fn+l to f,,, which is F-1/n(n+l). For example, if F,
the overall failure probability, is 10-2, then increasing the
number of redundant units from three to four increases the
allowable unit fail probability by 47 percent. As gratifying
as it may seem, it is a relatively insignificant increase in
terms of its impact on component-part procurement and
reliability engineering. It is especially insignificant in comparison to the factor-of-10 increase in going from a single
unit to two redundant units. It is even small compared to
the increase by more than a factor of 2 in going from two to
three redundant units. Generally, as n becomes larger, additional replication yields ever-diminishing returns. As a
practical matter, it is improbable that designing a system to
tolerate more than three or four faults will be justifiable in
the near future.
A second fundamental design precept is that the system be
highly modular for the sake of producibility, flexible expansion, fault isolation, and the broad applicability of a small
set of module types. This can also result in greater hardware
efficiency than is obtained by n-fold replication of a single
computer system. The philosophy is manifested in the use of
processors of modest size in a central multiprocessor and in
local processing applications.
The third precept of the design is that in cases where fault
detection, but not correction, is required, duplication and
comparison of units of significant size (e.g., arithmetic logic
units) will be used in preference to coded means of fault
detection. The reasons for this crucial choice are as follows.
1) Coded redundancy presupposes the occurrence of
faults of a certain type, such as stuck-at-zero or stuck-at-one.
Other types of faults are possible, however, including multiple faults due to shorts and opens in interconnecting paths,
and degradation or timing faults, where signals may, in
effect, be randomly faulty or faulty in a complex relationship
to a number of other signals. To extend the coverage of
coded detection to all of these possibilities requires exhaustive failure mode analysis, which is rarely apt to be accomplished successfully. Duplication with comparison, on the
other hand, provides powerful detection for nearly every
class of fault.
2) The implementation of coded detection becomes expensive when it is designed to be able to exercise all possible
fault modes in the field to verify that it does indeed respond
to every different type of stimulus for which it is intended.
A digital comparator, however, can periodically be verified
relatively easily.
3) The use of coded redundancy and the logic required
to verify it in a well checked system require a large amount of
random logic where the interconnections are irregular and
the component density is two to four times lower than in
regular circuitry, such as memories and arithmetic units.
For this reason, the cost of duplication as opposed to coding
is expected to be comparable or even lower because the cost
depends far more upon the number of integrated-circuit
1395
packages and interconnections than on the number of gates.
4) Duplication has the advantage of conceptual simplicity
compared to coded methods. With diminishing hardware
cost, this approach becomes increasingly effective from a
system point of view even if the piece-part costs of duplicated
circuits were to exceed those of circuits with coded redundancy.
Systematic faults can occur in either method, but are of
special concern in the duplication method. A systematic
fault may be defined as a fault that can be induced simultaneously in more than one copy of a replicated circuit.
Clearly, these must be eliminated before a design is viable.
The fourth basic precept is that data continuity and recovery will be achieved in processor scratchpad memories
by triplication and voting, but on a short-term basis only.
A degraded triple scratchpad memory circuit will be retired,
along with associated units, and their functions assumed
by other similar units. The justification here is essentially
the same as for using duplication to detect faults. Although
it may appear to be highly costly from a circuit-minimization
point of view, this cost applies to a small fraction of the
system and to an even smaller fraction of the overall costs.
On the other hand, it accomplishes the very important goal
of providing a basis for automatic information continuity
in the face of faults, independent of the program being executed. This is done, as discussed in a later section, by routing
all external communication via these scratchpad memories.
This triplication also permits programs to be restarted on
healthy processors when faults occur in a processor or
scratchpad memory. This is also presented in a subsequent
section.
The fifth precept is that reconfiguration will be accomplished by a mixture of graceful degradation and hybrid twoout-of-three voting with replacement. This permits the
"depth" of fault tolerance, which is the minimum number of
separate faults that can bring down the system, to be varied
in the most flexible way.
Finally, every effort will be made to keep the system conceptually simple, even at the possible expenditure of hardware. The importance of conceptual simplicity should not
be underestimated. It yields advantages in design, test, production, and marketing, i.e., the customer's willingness to
accept the development and deployment risk.
GENERAL DESCRIPTION OF THE SPACE FLIGHT
INFORMATION PROCESSING SYSTEM [10]
A large number of attributes are desirable for the system,
all of which are familiar to aerospace computer designers,
including low cost, small size, expandability, adaptability,
ease of programming, ease of debugging, ease of integration into the system, minimum connections, and so on [11].
Probably the first decision one would like to make is whether
the systems should be centralized, dedicated, or distributed;
this last term embodies elements of central and dedicated
computers. Since this decision must be made in context with
the problem, it is first useful to summarize some advantages
and disadvantages of central and dedicated computers.
1) Advantages of a central computer: A central computer
1396
allows the pooling of resources for the sake of repair (reconfiguration) and allocation to meet instantaneous demands. The more centralized a system is, the more efficient
it will be of cost and size, due to such overhead considerations as packaging, power conversion, and regulation. The
multiprocessor-type of structure, in particular, affords an
excellent way to configure the elements of a central computer
to realize these advantages.
2) Disadvantages of a central computer: In general applications, a central computer is made feasible only by the
incorporation of digital logic elsewhere in the system, e.g.,
peripherals, modems. This is so primarily for two reasons:
one is the inefficiency of dedicated interfaces for far-flung
communications; the other is the awkwardness of incorporating all sequencing and coding functions in a central
computer. In this case, if the central computer is a faulttolerant multiprocessor, the incorporation of a large dedicated set of interfaces would be so expensive to accomplish
in redundant and infallible fashion as to mask the fundamental economy of the multiprocessor.
3) Advantages of dedicated computers: If one were to adopt
a free interpretation of what a computer is, the remote logic
used in central computer systems might be called dedicated
computers. Indeed, some of the peripheral controllers made
today are so complex that they will soon be realized using
small computers. In this light, therefore, one could argue
that dedicated computers exist whether or not they are
wanted. The advantage of using them to the exclusion of
central computers is that they can be integrated into subsystems early in the design and production cycles and can
be programmed without possible interference from other
programs in other computers except via the interfaces. If
desired, they can also be tailored to their specific applications.
4) Disadvantages of dedicated computers: These are more
or less opposite to the advantages of a central computer with
respect to resource pooling. Another disadvantage is the
amount of intercomputer communication that has to take
place in a complex system.
In most applications there is almost necessarily some combination of central and dedicated elements, even if one of the
dedicated computers serves a coordinating role. Therefore,
the first decision is not so much on a central versus a dedicated system, but rather how a distributed system should be
partitioned between central and dedicated elements.
An important characteristic of this concept is that it is
not restricted to the synthesis of a fault-tolerant computer.
Rather, the entire information processing system is considered. The problem of handling hundreds of real-time inputs and outputs in a single computer is difficult and all of
the elements of the distributed system must be designed in
coordination.
The requirement for fault tolerance in a central computer
can be met by a carefully constructed multiprocessor that
can also possess some of the other desirable system attributes
named earlier. The principal limitations of this design technique lie in its input-output bandwidth and reaction speed.
IEEE
TRANSACTIONS ON COMPUTERS,
NOVEMBER 1971
Fig. 1. A distributed fault-tolerant digital system.
To interface a multiprocessor with a set of subsystems requires some form of local digital processing with dedicated
interfaces to subsystem electronics. Here, a different approach is needed to implement the required degree of fault
tolerance, which will depend on the nature of the subsystem
itself and on what technique is used to prevent its faults from
adversely affecting the mission.
To be somewhat forward looking, it is not unreasonable
to assume that each significant subsystem will contain one
or more digital computers acting as sequencers, format converters, test elements, and data filters. As such, they constitute an excellent functional complement to the central multiprocessor. They have the high input-output bandwidth and
reaction speed needed to communicate with subsystem electronics and the capability to predigest these data and communicate with the central multiprocessor at substantially
lower bandwidth and with less reaction speed required.
Communication between central and dedicated (local)
computers is via a multiplexed bus. For most applications,
a bandwidth capability on the order of a million pulses per
second is sufficient. The bus is redundant for the sake of fault
tolerance and is interconnected in such a way that no fault
can render all of the buses useless as would happen, for
example, if a defective processor transmitted "garbage"
on all copies of the bus.
Fig. 1 shows the hierarchy of the system. The multiprocessor, the local processor, the bus, and the multiplexers that
protect the bus are described in the following sections.
Local processors resemble the processors in the central
computer. Either a voting arrangement of three or more local
processors, a switched group of paired local processors, or
independent processors in redundant subsystems may be
used. The tradeoff between the schemes is basically the cost
of voting circuits in the voting arrangement, the cost of extra
processors and the switching and recovery mechanismn in the
switched group arrangement, and the expendability of the
independent subsystems. All arrangements may be used within a single system.
The central computer is capable of communicating with
similar central computers in other areas of a vehicle, other
1397
SHORT NOTES
stages, ground test equipment, etc. The communication
medium is a redundant serial time-shared bus identical to
that used for communication between a central computer
and the local processors. The bandwidth requirements in
this case are somewhat lower, however.
The system concept described here is intended to serve all
of the information processing needs of a complex spacecraft. All system information is made available in digital
form where it can be accessed, if desired, by the central
computer for the implementation of any desired functional
relationship between sensors and actuators. Standardization
of interface hardware and formats are embodied in this
system to further this purpose.
A FAULT-TOLERANT MULTIPROCESSOR
The multiprocessor concept described in this section has
as its goal the ability to reconfigure without loss of data,
to be flexible and expandable, and to be easy to program. To
do this requires error detection, infallible data recovery,
redundant processors, memories and data paths, and some
means of avoiding memory conflicts. Error detection, moreover, must be accomplished soon enough to enable correction to be made before errors spread.
Multiprocessors have the advantage of allowing the use
of graceful degradation as a reconfiguration mechanism. A
single multiprogrammed computer would require total
replication for the same kind of reconfigurability that a
multiprocessor can obtain by replication of a fraction of its
total, i.e., one processing unit and one memory unit. The
limitation on this advantage is that the multiprocessor requires the partitioning of the overall program into segments
that can be accomplished on a single processor. Each processor must have adequate resources to be capable of running
every program segment, so that in practice the processors
cannot be made arbitrarily small or slow.
For the space vehicle applications involved, a typical number of separate processors in a multiprocessor is expected to
be between six and nine. This assumes that the failure of three
processors would be fully tolerated and that from three to
six processors would be required to execute the nominal
central computer functions.
Many contemporary multiprocessors are designed as
compromises between availability and throughput. To some
extent, these aims are incompatible because many of the
hardware configurations that support high throughput are
difficult to make fault tolerant. In this concept, the emphasis
is on fault tolerance first and only secondarily on throughput.
To meet the goals for this multiprocessor, there must be
comprehensive error detection in all elements. Each processor must be able to detect sufficiently quickly the occurrence of an error and retire gracefully. Each memory unit
must be paralleled by at least one other and be able to detect
its own error and retire or to acquire a new parallel unit when
necessary. There must be, moreover, a means of error correction and/or recovery and a substantial benefit accrues if
Fig. 2. Block diagram of a collaborative
fault-tolerant multiprocessor.
these procedures are transparent to, i.e., independent of,
applications programs.
An important design consideration is the origin and nature
of faults that may occur. At worst, faults may occur in a
correlated way, such as by battle damage in combat aircraft,
where multiple faults are induced virtually simultaneously.
In the more benign environment of spacecraft, however,
fault occurrences are more nearly independent of one another, provided there are no systematic failure modes built
into the equipment. If one can assume randomness of faults,
then the concept of redundancy is valid and advantage can
be taken of the high probability that faults will not occur
close together in time. For example, if reconfiguration and
recovery from one fault can be effected in a hundredth of a
second, then two faults in a week-long mission have an
a priori probability of only about 108 of occurring too close
together to permit valid recovery.
The design of this multiprocessor, then, is an exercise involving the design of error-detecting processors, an errordetecting and correcting memory system, and the additional
mechanisms and procedures to guarantee smooth recovery.
Fig. 2 illustrates the conceptual structure of the multiprocessor, showing the interconnection of several independent processing units and a number of separate memory
units combined into a single logical complex. All of the
interconnections are redundant buses. The connection of all
of these buses to the processing units via the triplicated
scratchpad memories accomplishes information continuity
in the event of faults up to the total number tolerated, provided they are sufficiently separated in time to allow recovery after each fault. The following sections describe the
various parts of the multiprocessor.
The Processing Unit
The processing unit incorporates duplicated processors
with comparison for error detection and triplicated scratchpad memories with voting for information continuity and
recovery. The recovery mechanism is called single instruction
restart, which divides every instruction into two phases such
that none of the information used as input by a phase is
destroyed within that phase. Generally, there is a compute
1398
IEEE TRANSACIIONS ON COMPUTERS, NOVEMBER
1971
Errors
Fig. 3. One processing unit of a collaborative fault-tolerant multiprocessor.
phase and a store phase, both of which leave their results in
the scratchpad memory, where they are recoverable for
rerun in another processing unit if an error is detected in one
of the processors or scratchpad memories. It is necessary
in this scheme that processor errors be detected within the
phase in which they occur, which is accomplished by the
duplicated and triplicated elements of the processing unit.
When a processor or scratchpad memory error is detected
the scratchpad memories autonomously transfer their entire contents to a reserved area in the main memory. The
next processing unit that enters the executive program subsequently acquires these data and commences processing at
the instruction and phase where the faulty processing unit
committed its error. This automatic recovery mechanism,
made possible by triplicated scratchpad memories, relieves
the programmer of responsibility for ad hoc programming
of rerun points.
No effort is made to salvage and reassign those portions
of the processing unit that are not faulty. To do so would
require more additional interconnections and circuitry than
is felt to be justifiable. If the fault is a transient one, the
processing unit will be reinstated when it successfully completes a self-check program. If the fault is permanent, the
processing unit is shut off after several failures to run the
-self-check program.
The autonomous mechanism that allows the scratchpad
memories to store their contents in main memory has an
important function in helping to cope with one of the major
problems of multiprocessing, which is the problem of updating correlated pieces of data in such a way that other
processing units cannot access them part way through the
update. In this concept, correlated pieces of data are first
assembled in the scratchpad memory and then stored in their
respective destinations during a single transfer move from
scratchpad to main memory. During this move, no other
processing unit has access to the main memory and the
successful completion of the move is guaranteed in the face
of a single fault by the triplication of the scratchpad
memories.
Fig. 3 illustrates a processing unit comprising two processors, two dual comparators (that can use common
circuitry), three scratchpads, and voting units necessary for
fault isolation from quadruply redundant serial buses. This
unit is capable of: 1) operation as a collaborative processor;
2) detection of any error resulting from a single fault; 3)
completion of any data transfers in progress with main
memory, with the system or with other systems; 4) transfer of
all necessary information to main memory to effect resumption of processing and reconfiguration; and 5) entrance into
a self-check mode to attempt to qualify for returning to
active status. Successful recovery is dependent upon its
being completed before two of the scratchpads experience
faults. The projected duration is on the order of 10 ms. The
self-check mode must be exhaustive in its exposure of faults
to avoid a situation where a faulty processing unit goes back
on line, perhaps to suffer a second fault from which recovery
is impossible.
The multiprocessor can be configured to tolerate the failure of n processing units if n extra units are incorporated
over and above the number necessary to furnish sufficient
processing capability for the mission.
All communication with external units is via serial redundant buses, with the possible exception of the processormemory bus which may be made byte-serial or parallel for
1399
SHORT NOTES
MEMORY
L_° m-1
MEMORY
1, m-1
MEMORY
MEMORY
0,0
1,0
.
*
MEMORY
l
m1
... [MEMORy|
L n+l,l0
voter: l out if any
two or more l's in.
Fig. 4. Memory organization with preassigned modules.
the sake of high bandwidth. The bus system shown is two- and data are located in that memory. The processing unit
fault tolerant, expandable to n faults by using n+2 buses must wait if some other processing unit has access. With a
time-shared bus, no particular affiliation pattern emerges,
and an appropriate configuration of voters.
as each processing unit in turn has access to the entire set of
The Main Memory Unit
memory units through a single bottleneck. With multiport
Error detection and correction in memory units are essen- memories, processing units can request access to any one
tially different from error detection and correction in pro- memory unit at any time; this will be granted in order of
cessors in that where logic is cheap, memory is not. More priority among those seeking access to that unit. In general,
important, however, is that where a processor is time- instructions are read very few at a time to avoid the ineffishared, memory systems are not. That is, neglecting swap- ciencies that would otherwise result from branching early in
ping in hierarchical storage systems, all words of memory an instruction block. Data, on the other hand, may very
must be retained and backed up actively at all times, unlike well be accessed in blocks; for this reason it may be stored in
processors, where no active continuity is required. There- units separate from instructions.
The crossbar and time-shared bus techniques easily lend
fore, if duplication is used for error.detection, then a third
for
is
correction
of
the
to allowing a processing unit to "hold on" to a
first
error
and
themselves
memory required
either
a fourth memory or else a reassigned third is needed for memory unit to write correlated data, but only the bus
correction of the second error, and so on.
technique allows a processing unit to have momentary
Coded error-detection methods are far more viable in the access to all memory units at the same time to write correcase of memories than for processors for two reasons: they lated data into scattered locations. As mentioned in the
are more comprehensive and they occupy a far smaller preceding section, the bus technique is the one adopted in
fraction of memory than they do of processors. This is not this concept.
to say that a trivial check like single parity is adequate; far
Whichever memory-processor interconnection technique
more comprehensive checks are needed, especially if the may be used, care must be taken in its design to ensure that
memory has sophisticated internal sequencing such as test- no combination of n or fewer faults can bring down the sysand-set operations on flag bits, important in global-variable tem. As examples, two main-memory design approaches
write protection, or page table management. But until a will be described, both of which use the time-shared serial
coded redundancy design is actually in hand and verified to bus shown in Figs. 2 and 3. The first approach, illustrated in
be comprehensive for all single faults, the only acceptable Fig. 4, has n+2 redundant memory units for n-fault tolerassumption to make is that error detection will require two ance. Again, four buses are shown for the case of two-fault
tolerance. Each memory unit comprises enough modules
memory units as set out in the preceding paragraph.
The next item to be discussed is the method of intercon- (m in number) to store all data, and may be a hierarchical
necting the memory units with processing units. Three such memory system with mass memory. This is conceptually the
methods are distinguished: crossbar, redundant time-shared simpler, more effective, but possibly more costly of the two
bus, and multiport units. With the crossbar, a processing approaches shown. Each unit has a control unit, similar to
unit makes a temporary (uninterruptible) hookup with a a processor, which formats information for transfer and
memory unit for the duration of a job whose instructions executes access sequences such as read, write, and test-and-
1400
IEEE
MMR
MEMORY
...
TRANSACTIONS ON COMPUTERS, NOVEMBER 1971
| MEMORY
3m +n
voter: 1 out If any
two or more l's In.
Fig. 5. Memory organization using assignable modules.
set. At least two of these four control units must survive for
system survival.
The second approach is shown in Fig. 5. It requires fewer
memory modules than the first approach, but more control
modules, each of which is more complicated. In the second
approach, n- I spare memory modules are available for
assignment to replace failed modules in two-out-of-three
voting groups. This requires the ability to load replacement
modules to match the survivors of the group, the ability to
identify a disagreeing module in a group, and the ability
to ensure that a module cannot falsely respond to a request
for access to data it does not contain. The total number of
memory modules used is 3m+n- 1, as compared to m(n+2)
in the first approach. For example, to achieve three-fault
tolerance (n = 3) where the number of different words stored
is four times the capacity of one memory module (m = 4),
then the first approach uses 20 memory modules and 4 control units, where the second uses 14 memory modules and 14
control units.
The choice between these approaches is not immediately
obvious, and, moreover, several other approaches may be
competitive with these two. This area is as yet far from adequately understood.
Input-Output and Executive
Access from the multiprocessor to the input-output redundant bus is made by way of the scratchpad memories in
order to achieve data continuity at the time of a processor
failure.
Operationally, a time-shared input-output channel requires a certain amount of data management. A promising
approach is- the quasi-dedication of one of the processing
units of the multiprocessor to management of the input and
output through its scratchpad memory. In this scheme, the
processing unit that handles the input-output continues to
do so uninterrupted until it commits an error, at which
time the job is transferred to any free processing unit. This
is like having a distinct input-output processor, except that
it is taken from the processing unit pool, and needs no special
backup. The reason for quasi-dedication of this processing
unit rather than letting any free unit manage the input and
output on a sampled basis is to provide full use of the available bandwidth.
The input-output bus is managed by the central multiprocessor, and the remote units (dedicated processors) respond
on a speak-when-spoken-to basis. In this way there are no
program interrupts in the usual sense. Job requests are sent
to the central multiprocessor in response to polling inquiries,
issued at a rate of roughly one every few milliseconds. This
is approximately the rate at which jobs are dispatched, and
is a fair measure of the reaction time of the system. It would
be possible to implement conventional program interrupts,
but it would be awkward and unnecessary. In fact, with no
more program restrictions than are observed in conventional
machines, it is possible to time share a multiprocessor without interrupt as easily as, or more easily than, to time share
a multiprogrammed machine, provided that the number of
processing units is greater than the number of jobs that are
allowed unrestricted running time, such as the input-output
job.
The executive program, which is responsible for allocating resources to jobs that are called by automatic or
manual inputs, is of a floating form. Any processing unit
may insert a job request in the executive input file during the
execution of a job segment. Each processing unit, upon
completing a job segment (which it must do within a specified
time or be dumped), first checks to see if any pathological
situation exists, in which case it takes care of it. If not, the
processing unit attempts to examine the executive file in
memory. If the file is busy, the processing unit waits. If not,
it updates the file and takes the highest priority job to do.
There is virtually nothing else to be done at high speed.
At low speed, another segment of the executive program
handles the allocation of memory resources. There is little
executive overhead, and a moderate amount of input-output
overhead.
LOCAL PROCESSORS IN A FAULT-TOLERANT SYSTEM [7]
The use of small dedicated processors in subsystems is a
new extension of an old idea. The extension comes about
because of the new, inexpensive, and small realizations of
processors that are being developed now. It is not unreasonable to expect to have simple processors with 10 ,us add
times and 4K words of memory that occupy only about 10
in' of volume, available in the time span with which this
note is concerned.
SHORT NOTES
No one redundancy configuration of local processors
appears to be basically superior to another, because of the
wide variety of applications. In rocket propulsion, redundancy is sometimes achieved in using multiengine stages,
where the life expectancy of the engine is less than that of a
a computer. It would appear to be sensible to use a single
processor pair per engine, which would shut the engine
down if an engine failure or computer error were detected.
In a power distribution system, on the other hand, the system
itself is redundant, and a single redundant computer complex seems appropriate for control of the whole system. This
may be done in either of two basic ways. One way is a set of
paired processors in which one is active (at bat), one is in
standby (on deck), one inactive (in the hole), and so forth.
Input-output is handled by the active processor, whose
interface circuits are enabled by power switching. The
second way is to use a set of single processors three at
a time in synchronism, with voter circuits on each output
signal. This uses fewer processors for a given number of
tolerable faults, but requires a more expensive input-output
complex.
For either type of redundant local processor configuration, redundant inputs from the subsystem(s) are brought
into all of the processors to be compared or voted upon.
Analog inputs in particular do not lend themselves to hardware comparison or voting, and, in general, inputs will not
coincide well enough in time to permit hardware voting.
Local processors will use conventional approaches to
multijob real-time control ranging from programmed
scanning of inputs to multiprogramming. This is of little
concern to the fault-tolerant aspect of the system except as it
emphasizes the need for multiphased input circuitry to
avoid pulse hazards (slivering) that might force the redundant processors out of synchronism, and for restart
mechanisms which can resynchronize them if they do slip
out.
A FAULT-TOLERANT REDUNDANT INPUT-OUTPUT Bus
A parallel multiplexed bus can be made fault tolerant by
coded redundancy, provided the multiplexers for each bit
line are independent and capable of determining for themselves when to allow access. In this way a multiplexer failure
affects only one line, whose erroneous behavior is filtered
out by all of the error-correcting decoders. A serial multiplexed bus can be made fault tolerant by replication with
error detection by coded or comparative means. Voting,
however, best assures information continuity in the serial
case, using n+2 buses for n-fault tolerance. Parity may be
used to signal ultimate breakdown of the bus system, but it
is inadequate to detect all possible erroneous behavior in a
single line.
It may be expedient to route separate bus lines through
separate conduits to avoid the possibility of system breakdown through physical damage. Alternatively, all lines might
be run through a single well armored cable that may be
assumed to be indestructible.
Electrically, the bus presents an interesting challenge,
because there may be as many as 100 stations. One solution
is to arrange the stations in a daisy chain where information
1401
is transmitted from one station to the next from which it is
transmitted to a third, and so on in order through all stations. The originating station alone refrains from transmitting the information arriving from its neighbor. When
this station's message ends, another station may open its tie
from its receiver and begin to transmit its information. A
fault in any station can make the whole loop faulty. A
second solution is to connect all transmitters and receivers
in parallel. Only one station transmits at a time, the other
transmitter connections being passive. An unpowered station
would not affect bus operation, so that faulty units could be
retired from the system by being shut down.
Each bus line interfaces with each central or local processor through a separate multiplexer. Each multiplexer
contains a transmitter, a receiver, and logic. Functionally,
the multiplexer needs to be able to recognize when it is eligible to transmit and when it is not. For local processor
multiplexers, this requires logic to ascertain that its name is
called by the multiprocessor, and to place a limit on the
length of the message that its local processor tries to send.
For central processor multiplexers, the quasi-dedicated
input-output processing unit is the only one which transmits, and it is enabled at the end of each message from a
local processor. This end-of-message is signified by a synchronizing pulse issued by the local processor's multiplexers.
Multiplexers have the additional function of turning power
on and off in local processors for purposes of mission sequencing and of reconfiguration following detection of
errors.
The multiplexer is a key element in the system. It has many
responsibilities and is used in large number. To make the
system feasible, it will be necessary to implement these
devices using customized integrated circuits. Because of the
irregular nature of the logic required, it will not be packaged
as efficiently as memories and arithmetic units, and it may
require from two to six chips for its implementation, depending upon a number of factors not yet known.
A serial implementation is chosen for the input-output
bus for spacecraft applications, assuming a requirement for
tolerance to two or three faults, and a bandwidth requirement of the order of a megabit per second. Voting is done in
each processor on identical and simultaneous information
received over identical bus lines. Of the possible voting
strategies, the hybrid two-out-of-three, with replacement,
has one interesting shortcoming. It fails to isolate multiplexer faults in different stations (i.e., processors and local
processors). A pure voting strategy would be indifferent to
which lines were involved in multiplexer faults in any
station; but the hybrid scheme requires at least two bus lines
and all of their associated multiplexers to be fault-free. The
choice of strategy should be deferred until specific failure
rates and reliability goals are given.
ADDITIONAL DESIGN CONSIDERATIONS
A number of design problems and features of the system
are worthy of brief mention here. The first of these is the
problem of clocking a highly synchronized redundant system
in a fault-tolerant manner. The approach taken has been to
design a single conceptually complex redundant clock
1402
system that tolerates faults of a restricted type without
generating substandard clock pulses on more than one output line. A good deal is left to be desired here and the area is
still being explored.
A second consideration, mentioned earlier, is the ability
to test all parts of the system at any time to determine the
degree of degradation, if any. Though the information
needed is available in the individual voters and comparators,
provision has to be made to make it all available to the system without compromising the fault-isolating nature of these
devices. This is accomplished only at extra hardware cost,
and with some difficulty. A better means is still sought.
Third, the creation of a fault-tolerant computer is a
waste of effort unless its programs are error-free. Lacking
a concept for fault-tolerant programs, we rely on program
verification to yield error-free programs. Toward this end,
it is proposed to incorporate hardware facilities which will
be cooperative with external program verification facilities.
These include facilities to aid memory dumps, tracing, and
starting and stopping of the various units of the system.
These features might be incorporated in those models used
solely for ground simulation, or they might be incorporated
into all models.
Fourth, there are several types of difficulties that can be
caused by separate programs that reside in a common computer. In a ground time-sharing environment, it is possible
to maintain a certain degree of order by the use of privileged
instructions and lockout provisions. In a spaceborne environment, the solutions used on the ground are rather inadequate, because the suppression of one program to permit
another to run may be catastrophic. Again, program verification is required, but it is difficult to prove that a program
assembly is free of conflict. It is proposed here that main
memory accesses be explicit, rather than indexed or indirect,
and that data bases which are common to more than one job
be accessed with flag-bit, rather than lock-bit, protection.
The purpose is to allow the assembly program to have
visibility into possible conflicts, and to have it enforce the
conventions of data usage between different programs and
jobs before run time. The fault-detection and recovery
mechanisms will then -be depended upon to prevent improper execution of a properly assembled program.
Finally, the advantages of the modularity of the system
should be exploited as much as possible. The physical structure of the various units should be planned to permit the
greatest possible flexibility in sizing the system. It should be
possible to add resources, such as additional memory, into
the system as late as possible in the development cycle of the
spacecraft.
IEEE TRANSACTIONS ON COMPUTERS, NOVEMBER 1971
The only such circuits contemplated are for multiplexers,
which seem unavoidable. Nearly all of the rest of the system
circuitry consists of memories and arithmetic units.
Local control functions involve simple data processing
and high information rates, whereas global control functions
involve sophisticated calculation ("number crunching") and
low information rates. Fault tolerance concepts are extended to all members of the system and the communication
medium between them.
A central computer handles the global computing functions. It is configured as a collaborative multiprocessor,
designed to incorporate an invulnerable floating executive
and input-output control. Each processor is duplicated for
error detection, and has a triplicated scratchpad memory
for error detection and salvage of data critical to automatic
error recovery and data continuity. In the event of a processor or scratchpad fault, the scratchpad memory complex
sends its contents to the data memory, where it will be
acquired by the next free processor, which will continue
processing from the beginning of the interrupted instruction
phase. No attempt is made to reconfigure the failed processor. If the fault was transient, as may be most likely, the
processor will be reinstated after it successfully executes a
self-check program. Otherwise, it is totally expendable.
Additional features, briefly stated, are as follows. The use
of time-shared input-output buses saves much weight and
cost over the type of harnessing used previously. The system
is highly modular, and offers the capability of adding or
deleting information processing resources. Those resources
that do exist are allocated flexibly by the proposed floating
executive. The detection and recovery mechanisms are transparent to the applications programmer, i.e., he does not
need to do special programming to support detection or
recovery.
The system is capable of being made tolerant to any desired number of faults by additional replication and the use
of voters of corresponding complexity. It can, moreover, be
configured in such a way that some parts may be fault
tolerant to a greater degree than others. The assumption
made is that two faults may not occur closer together than
a specified time, which will probably be on the order of
10 ms.
The concept is within the state of the art using large-scale
integrated and hybrid circuits, both bipolar and MOS. An
effort is currently under way to reduce a number of the novel
aspects of the system to practice.
ACKNOWLEDGMENT
The author would like to acknowledge the invaluable
of A. I. Green, R. J. Filene, and W. W. Weinstein; and
work
CONCLUSION
the encouragement and support of R. R. Ragan and E. C.
This note has so far tried to convey in brief the important Hall, of the M.I.T. Charles Stark Draper Laboratory. He
elements of a proposed spaceborne fault-tolerant informa- also wishes to thank C. W. Frasier and T. V. Chambers of
tion processing system that is felt to represent a propitious NASA/MSC for their important early contributions.
solution to a many-faceted problem. There are several
notable features in this concept. One is a new assessment of
REFERENCES
the capabilities and limitations of integrated circuits. CirW. H. Pierce, Failure-Tolerant Computer Design. New York:
[I]
cuits with irregular interconnection patterns are costly to
Academic, 1965, p. 1.
design, difficult to qualify, and wasteful of interconnections. [2] P. R. Vance, L. G. Dooley, and C. E. Diss, "Operation of the
1403
SHORT NOTES
SAGE duplex computers," in Proc. 1957 IRE Eastern Joint
Comput. Conf., 1957, pp. 160-163.
[3] R. W. Downing, J. S. Nowak, and L. S. Tuomenoska, "No. 1
ESS maintenance plan," Bell Syst. Tech. J., vol. 43, no. 5, pt. 1, pp.
1961-2019, Sept. 1964.
[4] M. M. Dickinson, J. B. Jackson, and G. C. Randa, "Saturn V
launch vehicle digital computer and data adapter," in 1964 Fall
Joint Comput. Conf., AFIPS Conf. Proc., vol. 26. Washington,
D. C.: Spartan, 1964, pp. 501-516.
[5] G. Y. Wang, "System design of a multiprocessor organization,"
NASA Electron. Res. Cen., Cambridge, Mass., Internal Tech.
Memo. RC-T-079, 1969.
[61 A. Avizienis, G. C. Gilley, F. P. Mathur, D. A. Rennels, J. A.
Rohr, and D. K. Rubin, "The STAR (self-testing and repairing)
computer: An investigation of the theory and practice of faulttolerant computer design," this issue, pp. 1312-1 321.
[7] A. L. Hopkins, et al., "A fault-tolerant information processing
system for advanced control, guidance, and navigation," M.I.T.
C. S. Draper Lab., Cambridge, Mass., Rep. R-659, May 1970.
[8] W. C. Carter, D. C. Jessep, W. G. Bouricius, A. B. Wadia, C. E.
McCarthy, and F. G. Milligan, "Design techniques for modular
architecture for reliable computer systems," IBM Res. Rep. RA12,
Mar. 1970.
[91 L. J. Koczela and G. J. Burnett, "Advanced space missions and
computer systems," IEEE Trans. Aerosp. Electron. Syst., vol.
AES-4, pp. 456-467, May 1968.
[10] A. I. Green et al., "STS data management system design," M.I.T.
C. S. Draper Lab., Cambridge, Mass., Rep. E-2529, June 1970.
[11] M. D. Anderson and V. J. Marek, "Evaluation of aerospace
computer architecture," presented at the AIAA Guidance, Control, and Flight Dynamics Conf., Pasadena, Calif., Aug. 12-14,
1968, Paper no. 68-836.
Analysis of Parallel Systems
THOMAS H. BREDT, MEMBER, IEEE
Abstract-A formal analysis procedure for hardware and software
computer systems is described. A system is described by a flow table
model. The concept of an output hazard is introduced to account for
effects of unbounded line delays. Necessary and sufficient conditions for
the abs@nce of output hazards are given. A system that contains no
output hazards is said to operate correctly if the system state graph
that describes all system states and state transitions is free from forbidden states and forbidden state sequences. A flow table solution for
the two-component mutual exclusion problem is analyzed and shown to
be correct.
Index Terms-Determinancy, flow tables, hazards, interlocks, models,
mutual exclusion, operating system, program correctness, systems analysis.
INTRODUCTION
A major concern for computer designers is the development of formal procedures for the analysis of hardware and
software systems. A model has been developed in which fundamental-mode flow tables are used to describe the operation
of each system component [1], [2]. Procedures for synthesizmg and analyzing sequential circuits are well known [13].
Analogous procedures have been developed for a class of
sequential programs [3]. Thus the flow table model provides
a common basis for the analysis of the interactions of proManuscript received March 1, 1971; revised June 16, 1971. This
work was supported in part by the Joint Services Electronics Program
(U. S. Army, U. S. Navy, and U. S. Air Force) under Contract N00014-67-A-01 12-0044 and by the National Aeronautics and Space
Administration under Grant 05-020-337.
The author is with the Digital Systems Laboratory, Stanford
University, Stanford, Calif.
gram and circuit components in a computer system. This
model is finite-state, that is, it can describe only programs
and circuits that are equivalent to finite-state machines.
When considering the interactions of program and circuit
components, a finite-state model may not be overly restrictive. In this short note, a formal analysis procedure for systems described using the flow table model is presented. To
illustrate the application of the procedure, a solution to the
two-component mutual exclusion problem is analyzed.
The Mutual Exclusion Problem
In this problem, the system has two or more components
that are operated concurrently and contain critical sections.
Such components must be controlled so that the following
two restrictions are always satisfied: 1) at most one component is in its critical section at any instant; and 2) if a component wants to enter its critical section, it is eventually
allowed to do so. This definition of the mutual exclusioA
problem is the one used by Knuth [10]. Dijkstra [5], [6] has
studied a slightly different version of the problem in which
it is possible for a component to be blocked from its critical
section indefinitely. The components in the mutual exclusion
problem usually represent the process of executing a sequential program. In this short note, hardware processes are
included as well. The nature of the critical sections is not
important to the development of a solution. Typically, a
critical section contains an access to a common memory
location, file, or system table.
PARALLEL SYSTEMS
In the flow table model, a parallel system is defined as
follows.
Definition 1: A parallel system is a finite collection of N
components Cl, * *, CN and M lines 11, * I*,IM. Each
component Ci has a set of distinct input variables called the
component input set Ii= {xil, Xi2 .* * Xi,}, 1.<.<M,
j= 1, ... , n, and a set of distinct output variables called the
component output set 0-= Xil, Xi2, * * *, XiM}1I < q< M,
j=1, * * , m. Each line l= (Xj, x;) connects a component
output variable X, with a component input variable xi.
Lines carry binary level values and value changes propagate
from component output to input. Each output variable
must be connected by a line to exactly one input variable
and each input variable must be connected by a line to exactly one output variable. The operation of each component
is described by a completely specified flow table with a designated initial state. The initial value of each line is the value
specified for the output variable associated with the line.
The values of component input and output variables define
the component input state and output state, respectively. The
assumptions about physical delays in a parallel system are
the following: 1) the time for a value change to propagate
from a component output to a component input (line delay)
is finite and unbounded; and 2) within a component, delays
are finite and bounded.
The parallel system solution for the two-component mutual exclusion problem is shown in Fig. 1. The initial internal state for each component is internal state 1. Components
© Copyright 2026 Paperzz