An Evaluation of Spare Node Method for Fault Tolerance in Exascale Systems Kazumi Yoshinaga RIKEN AICS Background • In the Exa‐flops era, faults happen much more frequently • Application should survive from node failures in the future • A job can survive from node failures with the mechanisms so far proposed • OK, a job can survive. But there is no discussion how a job should survive from node failures Guideline for Methodology of Survival from Node Failures • Programmability – As easy as possible – Minimize the code size to handle node failures – Survival code must be the same even if multiple and/or consequent node failures happen • Efficiency – Loads should be balanced always – No or less communication performance (CP) degradation Problem • What is the best way to survive from a node failure ? – Assuming a job can survive from a node failure by using an existing fault mitigation software – Not to propose a new fault mitigation mechanism Assumptions • A System and a parallel job can somehow survive from permanent node failures – User‐Level Fault Mitigation (ULFM) and Checkpoint/Restart • The system has a mesh/torus topology network(e.g. K, BG/Q,…) • Stencil computation (Widely used by real‐world HPC programs) • Spare nodes are allocated adjacently – Not to interfere with the other jobs Job Restart vs. Fault Mitigation(1) • Assuming ideal scalability A) Job execution time w/o failure (A node failure happens at _ ) B) Simple job‐restart (with a new node set) _ • : Time to re‐submit the job, can be large when the system is heavily loaded Job Restart vs. Fault Mitigation(2) C) Survive by avoiding the failed node _ D) Survive by replacing the failed node with a spare node _ : Slowdown factors due to the failure node avoidance or replacement : Number of spare nodes Avoiding Failure Node • Stencil computation characteristics – Communication pattern is fixed – Load can be balanced • When a recovery happens, above stencil computation characteristics must be preserved • However, in C) x1.5 computation – Hard to balance loads – Impossible to preserve communication pattern – Every time a new failure happens, communication pattern can differ • Hard to program !!! Failure New comm. pattern Job Restart vs. Using Spare Nodes • D) is more efficient than B), if _ _ Fault Mitigation Method Using Spare Nodes • Loads are balanced (continues with the same # of procs.) • Logical communication pattern can be preserved – by creating a new MPI communicator • However, physical communication pattern is not the same, and communication performance can be degraded. – Larger hop counts (latency), and – Possible message collisions • Assumptions for simplicity – 2D 5‐point stencil (non‐periodic) – 2D mesh network (X‐Y routing) – Nodes on the topmost row work as spare nodes • Up to 5 possible collisions after 1 node failure – Independent from the # of nodes 5 possible collisions CP Degradation of Spare Node Method CP Evaluation on K – Tofu (K’s network) has 4 DMA engines – 4 messages can be sent in a time of sending one msg. plus alpha (1.7) 5/1.7 ≒ 3 Ratio (Smaller is better) • Maximum slowdown ratio in every possible combinations of a failure node and a spare node is measured 4 • Almost independent from the # of nodes 3 • Why not “5” ? 2 1 10 100 1000 # of Nodes 10000 Impact of Replacing Patterns • Clarify the impact of replacing patterns on CP by simulation • A job runs with 12x12 nodes, spare nodes are allocated on the topmost row and the rightmost column • Simulated with following 3 replacing method I. Simple Replace Method II. 1D Slide Method III. 2D Slide Method Impact of Replacing Patterns I. Simple Replace Method • Failed rank is continued on an alternative node • Alternative node is selected by following priority: – Top node of the same column – Rightmost node of the same row – Top node of the neighbor column – Rightmost node of the neighbor row Impact of Replacing Patterns II. 1D Slide Method • Processes between the failure node and the spare node are shifted to the top • If top spare node is already used, shifted to right Impact of Replacing Patterns III. 2D Slide Method • Whole processes between the failure node's row(column) and the spare node's row(column) are shifted to the top(right) Max # of Possible Collisions 10 9 8 7 6 5 4 3 2 1 1 Impact of Replacing Patterns Simulation results 10 9 8 7 6 5 4 3 2 1 10 9 8 7 6 5 4 3 2 1 1 Simple Replace 2 3 4 5 # of Failed Node Worst Case Best Case (w/o edge procs.) Best Case Some of failure patterns can not be handled 1 2 3 4 5 1D Slide 10 ・No collisions 9 8 ・2D Slide method 7 6 can only handle 2 5 or less nodes failure 4 3 2 1 1 2D Slide 2 3 4 5 Summary • Various techniques to survive from a node failure is discussed – Replacing the failed node with a spare node is better than avoiding the failed node • Degradation of CP in a stencil computation with a spare node – In theory, the degradation is constant over the number of nodes – Evaluation on the K computer reveals that the latency is about 3 times larger – Clarify the impact of replacing patterns on CP by simulation
© Copyright 2026 Paperzz