Dynamic Processor Sparing IBM Academy of Research study on Proactive Problem Prediction, Avoidance and Diagnosis 04-28-2003 Redundancy • Redundancy always provides a theoretical • mechanism for avoiding failures Requires precise fault detection and containment, and operation retry capability – Non-retry mechanisms possible, but require more redundancy • In a multi-processor environment, each processor is inherently redundant Processor Recovery • Several different recovery techniques used throughout the industry • Most focused on array errors – ECC, miss/purge/re-fetch, … • Few focused on logic errors – Z-series uses Instruction Checkpoint Retry Instruction Checkpoint Retry • Processor micro-architectural checkpoint maintained on instruction boundaries – Protected with robust error detection • When error detected, processor micro• • architected state restored from checkpoint Operation resumed from last known-good checkpoint Checkpoint may be restored on the same processor, or an alternate (redundant) processor Dynamic Processor Sparing • Dynamically move workload from a defective • processor to a healthy, redundant processor Wide range of implementations – Defective processor still able to make forward progress • Gracefully de-allocate workload from defective processor and dispatch to alternate processor – Defective processor unable to make forward progress • Defective processor abruptly shut-down • Checkpoint extracted from defective processor and “transplanted” into alternate processor – Redundant processor could be active or dormant Proactive Implementation of Dynamic Processor Sparing • Implement thresholds for recoverable errors – Take action to remove defective processor from configuration • Guarantee spare (dormant) processors in every machine configuration – – – – Use of spare processors does not reduce overall capacity Potentially transparent to operating system No impact to customer, no parts need to be replaced Align number of available spares with concurrent repair capabilities • Still support sparing to active processors – Effective for continuous reliable operation – More difficult to “hide” from operating system – Usually does not avoid parts needing replaced
© Copyright 2026 Paperzz