10th-Joint-workshop-UIUC-sdi

Optimization of Multi-level
Checkpoint Model for Large
Scale HPC Applications
Sheng Di, Mohamed Slim Bouguerra,
Leonardo Bautista-gomez, Franck Cappello
INRIA and ANL
2013
1/20
Outline
 Background of Multi-level Checkpoint Model
 Problem Formulation
 Optimization of Multi-level Checkpoint Model
 Optimizing Checkpoint Intervals for each level
 Optimizing the Selection of Levels
 Performance Evaluation
 Conclusion and Future Work
2/20
Background of Multi-level Ckpt Model
 Traditional Ckpt/Restart model always stores
checkpoint files onto Parallel File System (PFS)
 PFS is of central-controlled mode, which
suffers bottle-neck issue for large-scale app.
 For example, our experiments shows that the
checkpoint overhead on PFS increases quickly
with problem size and execution scale:
# cores
128
256
512
1024
Ckpt cost
7.4 sec
10.8 sec 16.8 sec 43.1 sec
3/20
Background of Multi-level Ckpt Model
 Existing Multi-level checkpoint toolkits
 Scalable Checkpoint/Restart Library (SCR) – SC’10
 RAM disk / local disk
 Partner-copy / XOR encoding
 Parallel File System (PFS), e.g., NFS
 Fault Tolerance Interface (FTI) - SC’11
 Local disk: storing ckpt files into local disk
 Partner-copy: storing ckpt files in local disk & partner disk
 Reed-Solomon encoding (RS-encoding)
 Parallel File System (PFS): such as NFS
4/20
Problem Formulation
 Different Types of Failures
 CPL1: There are no hardware failures but
software errors.
 CPL2: There are non-adjacent hardware failures
 CPL3: There are a few adjacent hardware failures
 CPL4: There are a lot of hardware failures
Soft-F
Node1
Node2
Node3
Node4
Time
CPL1
CPL2
CPL3
CPL4
CPL2
5/20
Problem Formulation
 The process of running an HPC application
with failures over multi-level checkpoint model
PFS
Level 4
Level 3
Te/x3
RS
encoding
Te/x2
Partner
copy
Level 2
Te/x1
Local FS
Level 1
Parallel app
execution
Soft failure
One checkpoint
Hard failure Hard failure
One node crash
Normal run
Adjacent node crash
Roll-back loss
6/20
Problem Formulation
 Our Objective - Minimize the expected wallclock length for each HPC application with:
# of levels
 optimized selection
of levels
Productive time
# of ckpt intervals
at level
i
 optimized checkpoint intervals
on each
level
 Mathematical Expectation of Wall-clock Length:
# of failures at level i
probability
Ckpt overhead
Rollback loss
Restart cost
7/20
Optimization of Multi-level Checkpoint Model
 E(Tw) is convex, because
 xi is referred to as the # of ckpt intervals at level i
 We get optimal solution as long as we solve
the simultaneous equations,
 optimal xi* :
where i = 1, 2, 3, …., L
8/20
Optimization of Multi-level Checkpoint Model
 Optimizing Checkpoint Intervals
 Simplified equations:
 We use an iterative algorithm to solve it:
 k=0: err=0.2
 k=1: err=0.08
k+1
 k=2: err=0.005
 K=3: err=0.0001
 ……
k
k
 We use Young’s formula
to initialize xi(0)
9/20
Optimization of Multi-level Checkpoint Model
 Optimizing Checkpoint Intervals
 How fast is our iterative optimal algorithm?
 If we set the error threshold to 10-6, the algorithm will
converge with only about 20-30 iterations !!
 What is the performance gain under our method,
compared to the traditional Young’s formula?



Suppose there are 8 levels and application execution
length is 1000 ~ 9000 seconds
The checkpoint overheads on the 8 levels are 10, 30,
45, 50, 55, 60, 65, 240 seconds per checkpoint.
Numerical simulation shows that our method is better
than Young’s formula by 4.2% - 17.8%.
10/20
Optimization of Multi-level Checkpoint Model
 Optimizing Selection of Checkpoint Levels
 For a particular combination of levels, the
computation complexity is only about 30 iterations.
 It is feasible to traverse all of combinations of
levels to find the optimal selection of levels.
 Suppose there are 8 levels, so there are 28-1=255
different combinations of levels, and the total
computation complexity is 255*30=7650, which is
very small!
11/20
Optimization of Multi-level Checkpoint Model
 Analysis of A Practical Case – FTI
 There are 4 levels: local disk, partner-copy, RS-
encoding, and PFS
 Use Clf, Cpc, Crs, Cpf to denote ckpt overheads
 Use Rlf, Rpc, Rrs, Rpf to denote restart overheads
12/20
Optimization of Multi-level Checkpoint Model
 Analysis of A Practical Case – FTI
 The target simultaneous equations derived from
convex optimization (first-order derivatives) is:
 The solution to the above equations must be optimal
 We can use iterative method to get it very quickly.
13/20
Performance Evaluation
 Experimental Setting
 Evaluation Type A: Numerical Simulation
 To evaluate a large number of various cases with
different parameters, including different ckpt overheads,
restart cost, application length, etc.
 Evaluation Type B: Real Experiment
 To validate the feasibility of using our optimal checkpoint
model in a real use case – FTI scenario.
 MPI program used in our
experiment: Head distribution
14/20
Performance Evaluation
 Checkpoint Overhead of FTI on FUSION cluster
26MB per proc
57MB per proc
 Key Indicator:
 Workload Processing Ratio (WPR)
= productive time / wall-clock length
15/20
Performance Evaluation
 Different Selections of Checkpoint Levels
 Simulation Settings
16/20
Performance Evaluation
 Different Selections of Checkpoint Levels
 Simulation Results
Improvement:10-20%
17/20
Performance Evaluation
 Experimental Results on FUSION cluster
18/20
Conclusion
 Optimal Multi-level Checkpoint/Restart Model
 Key Theoretical Conclusions:
 Ckpt intervals on each level can be optimized by fast
iterative methods (converged within only 30 iterations)
 The ckpt intervals are optimal based on convexoptimization theory
 Key Simulation/Experimental Results:
 For FTI, Iterative Optimal method with best selection of
levels is better than other solutions by up to 20%.
 For other cases like 8 levels, Optimized selection of
levels can improve performance by 50% in some cases.
19/20
Future Work
 In the future, we plan to:
 evaluate our optimal ckpt/restart model using
more complex MPI program on real clusters
with larger scales, such as CESM.
 optimize the robustness and stability by
taking into account the possible prediction errors
on checkpoint overheads and execution length.
 optimize the execution scale (# of processes)
based on checkpoint overheads for some
application with specific productive time.
20/20
Thanks!!
Contact me at:
[email protected]
21/20