An Evaluation of Spare Node Method for Fault - AICS Sys-Soft

An Evaluation of Spare Node Method for Fault Tolerance in Exascale Systems
Kazumi Yoshinaga
RIKEN AICS
Background
• In the Exa‐flops era, faults happen much more frequently
• Application should survive from node failures in the future
• A job can survive from node failures with the mechanisms so far proposed
• OK, a job can survive. But there is no discussion how a job should survive from node failures
Guideline for Methodology of Survival from Node Failures
• Programmability
– As easy as possible
– Minimize the code size to handle node failures
– Survival code must be the same even if multiple and/or consequent node failures happen
• Efficiency
– Loads should be balanced always
– No or less communication performance (CP) degradation
Problem
• What is the best way to survive from a node failure ?
– Assuming a job can survive from a node failure by using an existing fault mitigation software
– Not to propose a new fault mitigation mechanism
Assumptions
• A System and a parallel job can somehow survive from permanent node failures
– User‐Level Fault Mitigation (ULFM) and Checkpoint/Restart
• The system has a mesh/torus topology network(e.g. K, BG/Q,…)
• Stencil computation (Widely used by real‐world HPC programs)
• Spare nodes are allocated adjacently
– Not to interfere with the other jobs
Job Restart vs. Fault Mitigation(1)
• Assuming ideal scalability
A) Job execution time w/o failure
(A node failure happens at _
)
B) Simple job‐restart (with a new node set)
_
•
: Time to re‐submit the job, can be large when the system is heavily loaded
Job Restart vs. Fault Mitigation(2)
C) Survive by avoiding the failed node
_
D) Survive by replacing the failed node with a spare node
_
: Slowdown factors due to the failure node avoidance or replacement
: Number of spare nodes
Avoiding Failure Node
• Stencil computation characteristics
– Communication pattern is fixed
– Load can be balanced
• When a recovery happens, above stencil computation characteristics must be preserved
• However, in C)
x1.5 computation
– Hard to balance loads
– Impossible to preserve
communication pattern
– Every time a new failure
happens, communication
pattern can differ
• Hard to program !!!
Failure
New comm. pattern
Job Restart vs. Using Spare Nodes
• D) is more efficient than B), if
_
_
Fault Mitigation Method Using Spare Nodes
• Loads are balanced (continues with the same # of procs.)
• Logical communication pattern can be preserved
– by creating a new MPI communicator
• However, physical communication pattern is not the same, and communication performance can be degraded.
– Larger hop counts (latency), and
– Possible message collisions
• Assumptions for simplicity
– 2D 5‐point stencil
(non‐periodic)
– 2D mesh network
(X‐Y routing)
– Nodes on the topmost row work as spare nodes
• Up to 5 possible collisions after 1 node failure
– Independent from the # of nodes
5 possible collisions
CP Degradation of Spare Node Method
CP Evaluation on K
– Tofu (K’s network) has
4 DMA engines
– 4 messages can be sent
in a time of sending one
msg. plus alpha (1.7)
5/1.7 ≒ 3
Ratio (Smaller is better)
• Maximum slowdown ratio in every possible combinations of a failure node and a spare node is measured
4
• Almost independent
from the # of nodes
3
• Why not “5” ?
2
1
10
100
1000
# of Nodes
10000
Impact of Replacing Patterns
• Clarify the impact of replacing patterns on CP by simulation
• A job runs with 12x12 nodes, spare nodes are allocated on the topmost row and the rightmost column
• Simulated with following 3 replacing method
I. Simple Replace Method
II. 1D Slide Method III. 2D Slide Method
Impact of Replacing Patterns
I. Simple Replace Method
• Failed rank is continued on an alternative node
• Alternative node is selected by following priority:
– Top node of the same column – Rightmost node of the same row
– Top node of the neighbor column
– Rightmost node of the neighbor row
Impact of Replacing Patterns
II. 1D Slide Method
• Processes between the failure node and the spare node are shifted to the top
• If top spare node is already used, shifted to right
Impact of Replacing Patterns
III. 2D Slide Method
• Whole processes between the failure node's row(column) and the spare node's row(column) are shifted to the top(right)
Max # of Possible Collisions
10
9
8
7
6
5
4
3
2
1
1
Impact of Replacing Patterns
Simulation results
10
9
8
7
6
5
4
3
2
1
10
9
8
7
6
5
4
3
2
1
1
Simple Replace
2
3
4
5
# of Failed Node
Worst Case
Best Case
(w/o edge procs.)
Best Case
Some of failure patterns can not be handled
1
2
3
4
5
1D Slide
10
・No collisions
9
8
・2D Slide method
7
6
can only handle 2
5
or less nodes failure
4
3
2
1
1
2D Slide
2
3
4
5
Summary
• Various techniques to survive from a node failure is discussed
– Replacing the failed node with a spare node is better than avoiding the failed node
• Degradation of CP in a stencil computation with a spare node
– In theory, the degradation is constant over the number of nodes
– Evaluation on the K computer reveals that the latency is about 3 times larger
– Clarify the impact of replacing patterns on CP by simulation