FAULT-TOLERANT
COMMUNICATION IN EMBEDDED
SUPERCOMPUTING
INTRODUCTION
Table 1. Fault tolerance requirements in
applications.
Level Detection
Isolation
reconfiguration
Thread Watchdogs
Terminating
a
faulty
thread;
remapping threads
Memor
y
Node
System
Messag
e
Link
embedded supercomputing
Associated
mechanisms
Reinitializing a Exception
thread; restoring handling;
saved state in checkpointing
application
;
atomic
actions
Memory access Freeing
memory Restoring
Stable
checking
resources
memory
memory
I’m alive
Disconnecting
a Reinitializing or
faulty
node, quickly
reconfiguring
rebooting
a
process
node;
graph/communicati integration of a
on topology
recovered node
into the system
Monitoring
Fast reboot; fast
system
restart
parameters
Communicatio Discarding message Retransmission; Reliable
n
time-out;
reordering
communicatio
message order
n
checking
Checking
Terminating
a Reinitializing
communication faulty
channel/partner communication
status
before channel
communication
1
Recovery
and
FT FRAMEWORK ARCHITECTURE
The proposed FT framework, aimed at general embedded applications
running on parallel systems, operates at three layers. At the lowest layer, it consists of
error detection tools (D-tools) and error recovery tools (R-tools). These parameterizable
functions start dynamically during application execution. When a D-tool detects an error,
it uses a standardized interface to pass specific information to the next higher layer, the
detection-isolation-recovery network (DIRnet), a distributed control network. The DIRnet
starts the R-tools, which recover the application after an error occurs. These tools can
work either in combination with the higher layers or as standalone tools.
At the middle layer, the DIRnet coordinates the D-tools and R-tools. This
hierarchical network serves as a backbone for passing information among the
applications’ FT elements, and it enables distributed action.
NORMAL AGENT
APPLICATION OR FT
THREAD
NODE
BACKUP AGENT
FT BUFFER
VIRTUAL LINK
CENTRAL MANAGER
<I’m ALIVE> THREAD
SHARED
MEMORY
Figure 1. The DIRnet architecture.
A central manager with a global view of the system coordinates the
DIRnet. The manager’s view includes each node’s status, the type and location of
the D-tools, the type and location of errors, and the status of R-tools being
executed. The manager can also connect to an operator module, thereby
establishing a bi-directional interface between the operator and the DIRnet to
perform manual recovery actions.
Several agents located in different nodes assist the DIRnet central
manager. These agents interact with D/R-tools in their field, take local recovery
actions, and, upon the DIRnet manager’s request, perform multiple coordinated
actions along the network. In this way static schemes as dictated by the user are
2
implemented. The agents start up and initialize D-tools, which then warns the
respective agents when an error occurs. The agents forward this information to the
DIRnet manager. Agents are not interconnected but nevertheless communicate
through the DIRnet manager, which is responsible for their cohesion.
At the highest layer, the D/R-tools and the DIRnet are combined
into mechanisms that apply fault tolerance to processing or communication
modules, and a custom language specifies the user’s recovery strategy.
Application
R-tool
D-tool
DIRnet
Adaptation layer
Operating system
Figure 2. Fault tolerance framework architecture.
Figure 2 shows a view of the FT framework architecture. The
adaptation layer allows a generic definition of the FT library interface to both the
underlying operating system and the target hardware. The FT library has been
implemented on a Parsytec CC system, a distributed-memory multiple-instruction
and multiple-data (MIMD) supercomputing consisting of powerful processing
nodes based on 133MHz Power PC 604 microprocessors, dedicated high-speed
links, I/O modules, and routers.
The CC system’s main characteristic is the adoption of the threadprocessing model and the message-passing communication model.
Communicating threads exchange messages through a proprietary messagepassing operating system called EPX. EPX adopts the concept of “virtual links”
to build point-to-point connections between arbitrary threads among the
processors. It adopts the concept of “local links” to create similar connections
between threads running on the same node. The only noticeable difference
between virtual and local links is performance. Once a connection between any
two threads have been set up, the connected refers to it by means of a link and use
the link to send and receive the messages along the same connection.
In addition to EPX operating system the FT library has
implemented on top of Tex and Win/NT platforms. This guarantees portability to
different architectures where these platforms are available. The FT library uses a
subset of the Tex and Win/NT platform services that is quite common to
3
commercial real-time kernels or that is easily reimplemented using similar
features. This implies increased overall FT framework portability.
FAULT-TOLERANT SYNCHRONOUS COMMUNICATION
In synchronous message-passing, communication problems arise when
communication links are communicating threads are in erroneous state (broken links,
threads in infinite loops, and so on). Because communication cannot be initiated or
completed, communication threads remain blocked. There are two ways to avoid these
situations:
The status of both communication link and communication partner is
explicitly tested before messages are passed.
Communication is established normally, but time-out mechanisms are
initiated to escape from problematic situations.
Naturally, both approaches can be used in combination.
Message delivery time-out:
This mechanism detects whether a message is delivered before a certain
deadline; it can be implemented as a simple acknowledgment protocol. The sender sends
a message and waits for the acknowledgement from the receiver. If acknowledgment is
not received by certain deadline a time-out error is returned. From the receiver’s point of
view, the receiver waits for the message from the sender. If the message arrives within a
certain amount of time, the receiver sends an acknowledgment to the sender, otherwise it
returns a time-out error.
DIR AGENT
DIR AGENT
TIMER
CCT
MESSAGE
MESSAGE
ACK
READY
SENDER
RECEIVER
VIRTUAL LINK
4
Figure 3. The Channel Control Thread cooperates with DIRnet.
Channel control thread: Whenever two threads need to establish a
communication channel, the initializing thread orders the creation of a
special thread the will control the fault-tolerant communication. Using this
separate thread provides the possibility of returning to a safe state from a
blocked communication. The channel control thread (CCT) shown, in
Figure 3, handles time-outs and triggers recovery actions in cooperation
with the DIRnet. The CCT is also responsible for handling isolation
actions and recovery actions. The CCT and its actions are transparent to
the application and are initiated only if a communication channel is
defined to be fault tolerant.
Dual CCTs: Several message-passing environments may treat a
communication channel as one object or as a symmetric pair of
communication activities. For the latter case the implementation
shown in Figure 4 is adopted, with
DIR AGENT
TIMER
TIMER
CCT
CCT
SENDER
DIR AGENT
RECEIVER
NODE 1
NODE 2
Figure 4. Fault-tolerant Synchronous Communication with dual CCTs.
5
CCT for each communication partner. It is an extension of the singleCCT implementation.
With the addition of an extended protocol, an algorithm
that similar to that used to implement the single CCT implements the
message delivery time-out in the dual-CCT scheme. Figure 5 shows
the protocol between an application thread and the responsible CCT,
as well as between a sender and a receiver CCT.
Whenever an application thread wants to communicate
with its partner, it sends a relevant control signal to its associated CCT.
At the same time, a time-out mechanism is initiated at the specific
CCT. This time-out value is related to the average time an application
thread waits for its partner to be available for communication. A Sync
control signal, sent by the sending side to receiving side within the
time-out period, synchronizes the two CCTs. After synchronization,
the CCTs send a Ready signal to the application threads, so that they
are ready to exchange data through the CCTs. This avoids the problem
of blocking one thread because its partner is not responding. If the
partner is
SENDER
CCT
CCT
RECEIVER
Send
Recv
Sync
Ready/time-out 1
Ready/time-out 1
Data
Ack
Ack
Sync
Ready/time-out 2
Ready/time-out 2
Figure 5. Protocol used for Synchronous Communication on fault-tolerant links
6
not ready to communicate within the time-out period, synchronization
between the CCTs fails, and an error message is sent to the DIR agent
associated with the application thread that requested the
communication. If the communication cannot be achieved even after
the DIRnet-initiated recovery phase, the CCT returns an error control
signal to the application thread that requested the communication.
After the application thread exchange data, a second
synchronization occurs between the sender and receiver CCTs just as
before, but with a different time-out value. This time-out is related to
the maximum time needed for data transmission on the link. If this
synchronization fails, and if the DIRnet’s recovery attempt also fails,
the system sends an error control signal to both application threads to
inform them that the transmission failed. Figure 6 shows the
algorithms executed by the sender and by sending CCT.
send(CCT, send);
recv(CCT, ctrl_msg);
if (ctrl_msg == READY) {
send(receiver, message);
send(CCT, ACK);
recv(CCT, ctrl_msg);
if (ctrl_msg == READY)
return (OK);
else
return (ERROR….);
} else return (ERROR….);
(a)
while (1) {
receive(sender, SEND);
send(CCT_receiver, ctrl_msg | time-out);
if (ctrl_msg == SYNC) {
send(sender, READY);
recv(sender, ACK);
recv(CCT_receiver, ctrl_msg | time-out);
if (ctrl_msg == SYNC)
send(sender, READY);
else
if time-out
send(sender, TIMEOUT);
else
send(sender, ERROR….);
} else
if time-out
send(sender ,TIMEOUT);
else
send(sender, ERROR….);
}
7
Figure 6. In the dual-CCT case; the sender executes (a) and sending CCT
executes (b).
CCTs only for control messages: The two implementations just
described can be made faster if the actual messages are sent directly
from sender to receiver and not through the CCTs. When this is done,
the CCTs don’t need knowledge of the protocol used by the original
channels. Thus the CCTs become pure control instances of the
application’s sending and receiving actions and, as a result, have a
smaller load.
Other synchronous communication issues:
Before an application message is sent, the communication channel’s status
must be examined. Most message-passing environments expect a channel to be in one of
the two states: usable or stopped. In this implementation, communicating threads detect
the channel status by simply trying to use the channel.
An extension to the checking of communication channel status enables the
receiving threads to be checked as well. There are two possible implementations of this
extension:
as an extension to select mechanism, which allows the definition of
sending options along with the time-out option; and
as an extra feature of the CCT implementation, described in the
“Message delivery time-out” The receiver sends a Ready signal to the CCT just before
blocking for communication. The CCT can issue a ConditionalSelect for this signal to
check the receiver’s status.
The first implementation has the advantage of more consistent mechanism.
The mechanism in the second implementation requires an interface to return information
about the receiver’s status. The first mechanism can be completely incorporated into the
kernel, the second into the FT library.
The current implementations of content checking in message-passing
environments are rather basic, but sufficient for most purposes. Introducing more
redundancy would decrease communication performance in a way that could be
unacceptable for some applications. An application’s developer can select the chosen
link’s reliability by using one or another transmission function.
Message-ordering mechanisms determine whether the communication
system between a particular sender and a particular receiver maintains the local message
order. Message-passing environments supporting synchronous communication ensure
correct message ordering.
Recovery tools:
8
When the message delivery time-out mechanism detects an error, the CCT
will try to initiate recovery actions, or it will report the error to the DIRnet and thus pass
total control of the situation to it. Recovery actions deal with both application and system
errors. More specifically,
for a Send-Send fault, one of the messages is stored in a local buffer
and communication switches temporarily to asynchronous mode.
for a Recv-Recv fault, dummy data is sent to one partner and
communication continues.
for a Send/Receive-Stop fault, the active partner continues without
sending/receiving the message, and the idle partner’s status is checked.
If the idle partner is found in a faulty situation, the developer-dictated
recovery scenario takes appropriate action.
for a Send/Receive-Stop fault, the active communicating partner tries
again to communicate, with no time-out.
for channel errors, the communication link is restarted.
FAULT-TOLERANT ASYNCHRONOUS COMMUNICATION
Asynchronous communication is based on the mailbox concept, whereby
the sending thread no longer hangs after sending its message. The message is stored in a
buffer or mailbox; when receiving thread is available it retrieves the message from the
mailbox. This way the sender is free to continue its tasks after sending its message.
Message delivery time-out:
Ensuring mail delivery to recipients is also of fundamental importance for
asynchronous communication. A fault is detected when a message cannot be delivered to
the receiver’s mailbox within a time-out period or when the receiver does not retrieve the
message from the mailbox within a time-out period or when the receiver does not retrieve
the message from the mailbox within a time-out period. After the message delivery timeout detects a fault, it triggers the recovery phase. If recovery fails, the sender thread is
interrupted.
The sender should be able to get information about the sent mail’s status
and be interrupted when a time-out occurs. For this reason we propose two modes for
sender. In the first mode (suspending mode), the sender waits until it gets an
acknowledgment and then continues its job, or a time-out occurs and the sender is
interrupted. In the second mode, the sender is interrupted only if a time-out has occurred;
otherwise no signal is sent back to sender. However, the sender can, at any time, get
information about the mail status or enter the suspending mode.
RECEIVER
MAIL
MONITORING TASK
9
Figure 7. Fault-tolerant Asynchronous Communication with monitoring task.
In the case of mailboxes, the message delivery time-out mechanism is
implemented as follows: Upon sending its message, the sender invokes a task or creates a
thread; the job of this task or thread is to report the status of the sender’s message to the
DIRnet and/or the sender. As shown in Figure 7, when the receiver receives the mail
from the mailbox, it lets the monitoring task know whether the mail was correctly
received. The monitoring task waits for such a message from the receiver or for a timeout. If the message arrives before the time-out, and the sender is in suspending mode, the
monitoring task sends an appropriate message to the sender. If the time-out occurs, the
monitoring task concludes that something is wrong with the receiver and informs the
DIRnet, which can then take recovery actions. In case of a stand-alone mechanism, the
recovery phase is built in and automatically initiated. The success or failure of these
actions is then propagated via the monitoring task to the sender. This mechanism
corresponds to the one with the single CCT for control messages—the mechanism
devised for the synchronous communication case.
Other asynchronous communication issues:
A thread can be specified for the sole purpose of monitoring whether or
not the mailbox is empty. Errors are reported to the monitoring tasks that try to send mail
to a faulty (full) mailbox. Monitoring tasks can then trigger actions by issuing a
ConditionalSelect for the error signal.
Since asynchronous communication operates by means of mailboxes, this
FIFO implementation guarantees the messages are transferred in the order in which they
are queued. Therefore, no further check is necessary.
Recovery tools:
These tools will try to recover the application from a faulty state by
executing specific actions, such as
Leave the mail in the mailbox (no recovery),
Deleting the mail when the mailbox is full,
Resetting the receiver via an interrupt signal,
10
Resetting both the sender and receiver, or
Performing an application-dependent recovery action.
In these actions involving resetting, the entire mailbox is cleared. The user’s recovery
strategy, as specified in custom language, will determine which action is executed.
PERFORMANCE STUDY
Table 2 shows the time overhead for the synchronous communication FT
library. We obtained the measurements by running a generic application on top of the
EPX operating system with and without the communication FT library. Specifically, the
table depicts the time taken to
Table 2. Time Overhead of the Synchronous Communication fault tolerance
library in milliseconds.
CREATE
Fault
tolerant
communication
EPX operating system
TRANSFER
BREAK
0.5
2.0
2.0
0.5
0.5
0.0
create a communication link, transfer a message through a link, and break a link. The
create and break times are of minor importance, since these functions are performed only
once. The transfer time becomes more important as the application becomes more
communication intensive. For example, in applications in which communications
consume much more time than CPU calculations, and, additionally, communications are
perfectly synchronized, the total runtime always quadruples by using the library.
Applications in which communication time is comparable to CPU time are ones that can
benefit from the FT library.
Communication time includes the wait time caused by imperfect
synchronization. Except for special cases, this wait time is greater than transfer time. For
example, the applications for which this library was designed, the average wait time is on
the order of 100 ms. For such applications, the total overhead is comparatively small.
We use Tp to denote the parallel application’s processing time, the Tw, the
total wait time for synchronous communication owing to imperfect synchronization, and
Tx, the communication transfer time. Communication time then is the sum of Tw and Tx.
The total application runtime when no fault occurs is
Tt = Tp + Tw + Tx.
(1)
11
We assume that faults occur at random times at a rate of one fault within
time window Tf . Each fault causes the system to reboot, taking time Tb. We assume that
after rebooting, the system returns to the last near point before the failure and does not go
back to reexecute already completed jobs. The expected total application runtime when
faults occur is
TtB = Tx + (Tb / Tf) Tx.
(2)
The library incurs an overhead resulting from the detection tools’ protocol.
This overhead equals hxTx and is proportional to communication transfer time Tx. Each
fault activates a recovery tool that takes time Tx. We assume that the library detects all
faults. The expected total application runtime is
TtL = Tt + hxTx + (Tr / Tf) Tt.
(3)
We have assumed that incorporating the library does not alter the processing and wait
times. This assumption is valid in applications in which Tx / Tt is small, since in this case
the extra time resulting from the library’s detection and recovery tools does not alter
significantly the profile of the processes’ states in relation to time. Hence synchronization
between processes is not significantly affected.
For common practical applications, the relation Tr << Tb << Tf is valid.
The first inequality is clearly valid because Tr is about a millisecond (time taken to run a
recovery routine), regardless of whether Tb is on the order of a second or even on the
order of a minute (if manual reset is necessary). The second inequality is valid in
practical embedded applications, assuming that reasonable debugging has reduced faults
to rare occurrences.
The total overhead caused by rebooting the system when the fault tolerant library is not
used is
htB = (TtB - Tt) / Tt = Tb / Tf.
(4)
The total overhead resulting from the library’s detection and recovery tools is
htL = (TtL - Tt) / Tt = (Tx / Tt )hx + Tr / Tf.
(5)
For our measurement under EPX in the present implementation, hx = 3. As an example, if
the communication and processing times are about the same, and if Tx / Tw is about 1%,
then the overhead when no recovery occurs is about 1.5%.
Use of the FT library is recommended when
htL < htB .
From Equations 4, 5, and 6 it follows that
(Tx / Tt ) hx + Tr / Tf < Tb / Tf
Solving Equation 7 for Tx / Tt we
Tx / Tt < (Tb - Tr) / hxTf
12
(6)
(7)
(8)
Equation 8 shows that the library cannot offer a competitive advantage when the
ratio of communication transfer time to total application time is greater than the upper
bound designated by the right-hand side of Equation 8.
As an example, let Tr be much smaller than Tb and Tb / Tf = 3%. Equation 8
yields Tx / Tt < 1%. Using Equation 1, from the last inequality it follows that
Tp / Tx
+ Tw / Tx >99. This means that if the processing time is not large enough compared with
the communication transfer time, then the ratio of the total wait time to communication
transfer time must be large enough for the library to be effective when in use.
In the EFTOS project framework, FT mechanisms are also used to detect and
recover from other fault categories, apart from communication faults that might occur
while the application is running. For instance, the FT library includes detection and
recovery tools that deal with processing faults, memory failures, I/O errors, and so on.
Assume k categories of FT tools, each with an overhead equal to hjTj resulting from the
protocol of its detection tools. The expected total runtime is augmented either by hjTj + (
Tr j / Tf j )Tt for every category j of tools that is employed, or by ( TbL / Tf j )Tt when system
reboot performs better than the corresponding tool set for the specific application at hand.
The expected total application runtime thus becomes
Tt = Tt + j=1 { uj (hjTj + ( Tr j / Tf j )Tt ) + ( 1 - uj ) ( Tb / Tf j )Tt }
(9)
This assumes that faults of category j occur at random times at a rate of one fault within
time window Tf j and that each fault activates a recovery tool that takes time Tr j .
= { u1, . . . . , uk } is a set of binary parameters that make it possible to use only those
FT mechanisms whose corresponding overhead is less than the overhead caused by
system reboot.
1,
uj
{ hjTj + ( Tr j / Tf j )Tt } < ( Tb / Tf j )Tt
……(10)
=
0,
{ hjTj + ( Tr j / Tf j )Tt } > ( Tb / Tf j )Tt
For development under EPX, the FT library adds approximately 50 Kbytes of overhead to
the application code. For a typical EPX application of about 1MB, the overhead is 5%.
CONCLUSION
The communication FT framework we have described has been
successfully integrated into real-time embedded high-performance computing
applications. They are an image processing module in an automatic mail-processing
system developed by Siemens ElektroCom and a remotely controlled automation system
for electric high voltage substations operated by ENEL (the Italian electricity provider).
13
Both systems proved more dependable when faults occurred, and overall system
performance improved. System downtime decreased significantly, and the mean time
between system reboots increased.
Also it has been planned to port the FT framework across additional
platforms and operating systems, providing and integrating standard mechanisms for
node-to-node interoperability. Furthermore, researchers will consider FT middleware
implementation using emerging standards, technologies, and industrial initiatives (such as
CORBA) to guarantee the required level of dependability in object oriented open
distributed systems.
REFERENCES
1. IEEE Magazine, Sept – Oct 1998.
2. Fault – Tolerant Computing : Theory and Techniques, 2nd ed., D.Pradhan, ed.,
Prentice Hall, Old Tappan, N.J. 1995.
3. Dependable Computing for Critical Applications, C.Landwehr, B.Randell, and
L.Simoncini, eds., Springer-Verlag, (Berlin, Heidelberg, N.Y), 1995.
4. G. Deconinick et al., “Fault Tolerance in Massively Parallel Systems” , Transputer
Comm, Vol 2, No. 4, Dec. 1994, pp. 241-257.
5. IEEE Trans, Reliability, special issue on fault tolerance, Vol.42, No. 2, June 1993.
14
© Copyright 2026 Paperzz