Dynamic Processor Sparing

Dynamic Processor Sparing
IBM Academy of Research study on
Proactive Problem Prediction,
Avoidance and Diagnosis
04-28-2003
Redundancy
• Redundancy always provides a theoretical
•
mechanism for avoiding failures
Requires precise fault detection and
containment, and operation retry capability
– Non-retry mechanisms possible, but require more
redundancy
• In a multi-processor environment, each
processor is inherently redundant
Processor Recovery
• Several different recovery techniques used
throughout the industry
• Most focused on array errors
– ECC, miss/purge/re-fetch, …
• Few focused on logic errors
– Z-series uses Instruction Checkpoint Retry
Instruction Checkpoint Retry
• Processor micro-architectural checkpoint
maintained on instruction boundaries
– Protected with robust error detection
• When error detected, processor micro•
•
architected state restored from checkpoint
Operation resumed from last known-good
checkpoint
Checkpoint may be restored on the same
processor, or an alternate (redundant) processor
Dynamic Processor Sparing
• Dynamically move workload from a defective
•
processor to a healthy, redundant processor
Wide range of implementations
– Defective processor still able to make forward
progress
• Gracefully de-allocate workload from defective processor and
dispatch to alternate processor
– Defective processor unable to make forward progress
• Defective processor abruptly shut-down
• Checkpoint extracted from defective processor and
“transplanted” into alternate processor
– Redundant processor could be active or dormant
Proactive Implementation of
Dynamic Processor Sparing
• Implement thresholds for recoverable errors
– Take action to remove defective processor from configuration
• Guarantee spare (dormant) processors in every machine
configuration
–
–
–
–
Use of spare processors does not reduce overall capacity
Potentially transparent to operating system
No impact to customer, no parts need to be replaced
Align number of available spares with concurrent repair
capabilities
• Still support sparing to active processors
– Effective for continuous reliable operation
– More difficult to “hide” from operating system
– Usually does not avoid parts needing replaced