reliability

Reliability
Threads for Fault Tolerance
Multiprocessors:
Transient fault detection
Transient Faults
Faults that persist for a “short” duration
Cause: cosmic rays, energetic particles
originating from outer space
Effect: knock off electrons, discharge capacitor
Solution
no practical absorbent for cosmic rays
1 fault per 1000 computers per year (estimated fault rate)
Future is worse
smaller feature size, higher transistor count, reduced
noise margin
Background
 Fault tolerant systems use redundancy to improve
reliability:
Time redundancy: separate executions
Space redundancy: separate physical copies of resources
DMR/TMR
Data redundancy
ECC: Automatic repeat request (ARQ) , Forward error correction (FEC)
Parity: odd/even
 Examples:
IBM: duplicated pipelines, spare processors, ECC in memories...
HP: DMR/TMR processors, Parity/ECC in buses, memories...
Multiprocessors: Fault
Detection
Chip-level Redundantly Threaded
processor
Replicates register values but not memory values
The leading thread commits stores only after
checking
Memory is guaranteed to be correct
Other instructions commit without checking
The leading thread sends committed values for:
branch outcomes
load/store values
store addresses
Sphere of Replication (SoR)
Logical boundary of redundant
execution within a system
Components within protected via
redundant execution
Components outside must be
protected via other means
Its size matters:
Error detection latency
Stored-state size
Example Spheres of
Replication
Compaq Himalaya
ORH-Dual: On-Chip Replicated Har
(similar to IBM G5)
Fault Detection in Compaq
Himalaya System
Replicated Microprocessors + Cycle-by-Cycle Lockstep
Fault Detection via Simultaneous
Multithreading (SMT)
Replicated Microprocessors + Cycle-by-Cycle Lockstep
Concept
SMT improves the performance of a processor
by:
allowing independent threads to execute
simultaneously
doing so in different functional units
Redundant Multithreading (RMT):
leverages SMT’s properties to allow fault detection
for microprocessors
runs two copies of the same program as independent
threads
compares their outputs and initiates recovery in case of
mismatch
Input Replication
 Load Value Queue (LVQ)
Keep threads on same path despite I/O or MP writes
Out-of-order load issue possible
Output Comparison
Compare & validate output before sending it outside the SoR
Store Queue Comparator
(STQ)
Store Queue Comparator
Compares outputs to data cache
Catch faults before propagating to rest
of system
Store Queue Comparator
(cont’d)
 Extends residence time of leading-thread stores
Size constrained by cycle time goal
Base CPU statically partitions single queue among threads
Potential solution: per-thread store queues
 Deadlock if matching trailing store cannot commit
Several small but crucial changes to avoid this
Branch Outcome Queue (BOQ)
Branch Outcome Queue
Forward leading-thread branch targets to
trailing fetch
100% prediction accuracy in absence of
faults
Simultaneous & Redundantly
Threaded Processor (SRT)
SRT = SMT + Fault Detection
Less hardware compared to replicated
microprocessors
SMT needs ~5% more hardware over uniprocessor
SRT adds very little hardware overhead to existing
SMT
Better performance than complete replication
better use of resources
Lower cost
Issues
Cycle-by-cycle output comparison and input
replication:
Equivalent insts from different threads may execute
in different cycles
Equivalent insts from different threads might execute
in different order
Precise scheduling of the threads crucial for
optimal performance
Branch misprediction
Cache miss