Tuning Stationary Iterative Solvers for Fault Resilience

ScalA15
November 16, 2015, Austin, TX, USA
Tuning Stationary Iterative Solvers
for Fault Resilience
Hartwig Anzt, Jack Dongarra, Enrique S. Quintana-Orti
Hardware Resilience at Exascale?
• Titan Supercomputer:
• O(105) processor cores
• O(107) CUDA cores
• Maybe O(109) cores in an Exascale machine?
• Hardware resilience is in the end a resource problem*:
• Build an system of this order at a reasonable budget
(cost-per-component has to be low)
• Power and Energy constrains motivate to allow for relaxed resilience
(~20-40 MW limit motivate for near threshold computing)
• Looks like we will have to deal with an increased number of bitflips…
• Looks like we will have to deal with resilience on the algorithmic side…
* See
Christian Engelmann’s Keynote at the Resilience workshop 2015
http://www.christian-engelmann.info/publications/engelmann15toward.ppt.pdf
2
Resilience of Iterative Linear System solvers
• Checkpoint/restart strategies can improve fault-tolerance of iterative solvers.
• Research has shown efficiency for Krylov subspace solvers*.
• The iterations affected by a bit-flip are disregarded.
• What if a bit-flip occurs in every iteration?
* Agullo
3
et al.: Towards resilient parallel linear Krylov solvers with recover-restart strategies, 2013.
Resilience of Iterative Linear System solvers
• Checkpoint/restart strategies can improve fault-tolerance of iterative solvers.
• Research has shown efficiency for Krylov subspace solvers*.
• The iterations affected by a bit-flip are disregarded.
• What if a bit-flip occurs in every iteration?
• We look into Relaxation methods based on matrix splitting.
• Think of Jacobi:
𝑥 {#} = 𝐷 '( (𝑏 − (𝐴 − 𝐷)𝑥 #'(
𝑥 {#} = 𝑦 + 𝑀𝑥 {#'(}
4
Resilience of Iterative Linear System solvers
• Checkpoint/restart strategies can improve fault-tolerance of iterative solvers.
• Research has shown efficiency for Krylov subspace solvers*.
• The iterations affected by a bit-flip are disregarded.
• What if a bit-flip occurs in every iteration?
• We look into Relaxation methods based on matrix splitting.
• Think of Jacobi:
𝑥 {#} = 𝐷 '( (𝑏 − (𝐴 − 𝐷)𝑥 #'(
𝑥 {#} = 𝑦 + 𝑀𝑥 {#'(}
• Contraction property: ∃0 ≤ 𝜃 < 1:
𝑥 # − 𝑥 #'( ≤ 𝜃 𝑥 #'( − 𝑥 #'9 ≤ 𝜃 9 𝑥
5
#'9
−𝑥
#':
…
Resilience of Iterative Linear System solvers
• Checkpoint/restart strategies can improve fault-tolerance of iterative solvers.
• Research has shown efficiency for Krylov subspace solvers*.
• The iterations affected by a bit-flip are disregarded.
• What if a bit-flip occurs in every iteration?
• We look into Relaxation methods based on matrix splitting.
• Think of Jacobi:
𝑥 {#} = 𝐷 '( (𝑏 − (𝐴 − 𝐷)𝑥 #'(
𝑥 {#} = 𝑦 + 𝑀𝑥 {#'(}
• Contraction property: ∃0 ≤ 𝜃 < 1:
𝑥 # − 𝑥 #'( ≤ 𝜃 𝑥 #'( − 𝑥 #'9 ≤ 𝜃 9 𝑥
6
#'9
−𝑥
#':
…
Resilience of Iterative Linear System solvers
• Checkpoint/restart strategies can improve fault-tolerance of iterative solvers.
• Research has shown efficiency for Krylov subspace solvers*.
• The iterations affected by a bit-flip are disregarded.
• What if a bit-flip occurs in every iteration?
• We look into Relaxation methods based on matrix splitting.
• Think of Jacobi:
𝑥 {#} = 𝐷 '( (𝑏 − (𝐴 − 𝐷)𝑥 #'(
𝑥 {#} = 𝑦 + 𝑀𝑥 {#'(}
• Contraction property: ∃0 ≤ 𝜃 < 1:
𝑥 # − 𝑥 #'( ≤ 𝜃 𝑥 #'( − 𝑥 #'9 ≤ 𝜃 9 𝑥 #'9 − 𝑥 #': …
• Also on a component level:
𝑥< # − 𝑥< #'( ≤ 𝜃< 𝑥< #'( − 𝑥< #'9 ≤ 𝜃<9 𝑥< #'9 − 𝑥< #': …
𝑧< # ≤ 𝜃< 𝑧< #'( ≤ 𝜃<9 𝑧< #'9 …
• Use this to construct fine-grained bit-flip check!
7
Bit-Flip Protection for Jacobi
FT-Jacobi
?@
A
, 𝑖 = 1 … 𝑛 for some 𝑘 > 2 …
• Check every update against 𝑐< −
?@
for some threshold 𝛿*.
ABC
A
?@
• Accept and reject updates on
component-level.
• Fallback strategy if component not
updated multiple times in a row to
account for numerical effects*.
*See paper for details.
8
≤ 𝑐< J 𝛿
100
Approximation difference for component
• Compute ratios 𝑐< = ABC
?@
| xnew - x |
T-COND
10
-5
10-10
10-15
10
-20
0
200
400
600
Iterations
800
1000
1200
FT-Jacobi convergence
3D Laplace, 16x16x16FD; 40 bitflips/iteration; 𝛿=0.9;
1.18
100
1.16
Relative convergence delay
Relative residual norm
Jacobi no bitflips
FT-Jacobi no bitflips
FT-Jacobi bitflips
10-5
10-10
1.14
1.12
1.1
1.08
1.06
1.04
10-15
0
200
400
600
Iterations
800
1000
1200
1.02
1
0.01
0.0001
1e-06
1e-08
Relative residual threshold
• No convergence delay of FT-Jacobi without bitflips.
• Rejecting updates affects convergence.
• Relative convergence delay increases linear with iteration count.
9
1e-10
1e-12
Threshold optimization
3D Laplace, 16x16x16FD; 40 bitflips/iteration;
24
Detected bitflips
Missed bitflips
False positive
Absolute values for iteration 400
23
Absolute values for iteration 400
103
Detected bitflips
Missed bitflips
22
21
20
19
18
10
2
17
16
0.7
0.75
0.8
0.85
Threshold /
0.9
0.95
1
101
0.7
0.75
0.8
0.85
Threshold /
0.9
0.95
• Strict threshold (𝛿 close to 1) catches more bitflips, increases false-positives.
10
1
Threshold optimization
3D Laplace, 16x16x16FD;
1.35
1 bitflips
20 bitflips
40 bitflips
60 bitflips
80 bitflips
Relative convergence delay
1.3
1.25
1.2
1.15
1.1
1.05
1
0.7
0.75
0.8
0.85
Threshold /
0.9
0.95
1
• Strict threshold (𝛿 close to 1) catches more bitflips, increases false-positives.
• Relative convergence delay good for 𝛿 ≈ 0.95.
• Closer to 1 increases chance of failure, use 𝛿=0.9.
11
Bit-flip in sign bit
40
10
Detected bitflips
Missed bitflips
35
30
Jacobi no bitflips
Jacobi bitflips
FT-Jacobi no bitflips
FT-Jacobi bitflips
Relative residual norm
Average counts per iteration
5
25
20
15
10
10
10
0
-5
10
-10
10
-15
5
0
0
500
1000
1500
0
Iterations
• Non-protected Jacobi stagnates.
• Bitflips are detected from the beginning on.
12
500
1000
Iterations
1500
Bit-flip in exponent
40
10
Detected bitflips
Missed bitflips
35
30
Jacobi no bitflips
Jacobi bitflips
FT-Jacobi no bitflips
FT-Jacobi bitflips
Relative residual norm
Average counts per iteration
5
25
20
15
10
10
10
0
-5
10
-10
10
-15
5
0
0
500
1000
1500
0
Iterations
• Non-protected Jacobi diverges.
• Bitflips are detected from the beginning on.
13
500
1000
Iterations
1500
Bit-flip in significant mantissa
40
10
Detected bitflips
Missed bitflips
35
30
Jacobi no bitflips
Jacobi bitflips
FT-Jacobi no bitflips
FT-Jacobi bitflips
Relative residual norm
Average counts per iteration
5
25
20
15
10
10
10
0
-5
10
-10
10
-15
5
0
0
500
1000
Iterations
1500
0
500
1000
Iterations
• Non-protected Jacobi stagnates at residual level.
• Bitflips are detected once these bits become relevant for accuracy.
14
1500
Bit-flip in less significant mantissa
40
10
Detected bitflips
Missed bitflips
35
30
Jacobi no bitflips
Jacobi bitflips
FT-Jacobi no bitflips
FT-Jacobi bitflips
Relative residual norm
Average counts per iteration
5
25
20
15
10
10
10
0
-5
10
-10
10
-15
5
0
0
500
1000
Iterations
1500
0
500
1000
Iterations
• Non-protected Jacobi stagnates at low residual level.
• Bitflips are detected once these bits become relevant for accuracy.
15
1500
Bit-flip in less significant mantissa
40
10
Detected bitflips
Missed bitflips
35
30
Jacobi no bitflips
Jacobi bitflips
FT-Jacobi no bitflips
FT-Jacobi bitflips
Relative residual norm
Average counts per iteration
5
25
20
15
10
10
10
0
-5
10
-10
10
-15
5
0
0
500
1000
Iterations
1500
0
500
1000
1500
Iterations
• Non-protected Jacobi stagnates at low residual level.
• Only the convergence-relevant bit-flips are detected.
• Idea: Recursively extending the mantissa length for Jacobi:
Anzt et al.: Adaptive Precision Solvers for Sparse Linear Systems, E2SC15.
16
Sparse triangular solves
Occur for approximate incomplete factorization preconditioners:
• Low solution accuracy required as𝐴 ≈ 𝐿𝑈only
a rough approximation.
• Replace forward/backward substitutions with iterative method.
• Better scalability of iterative methods.
• Analyze relative convergence delay 𝜇,
Detected/Missed Bit-Flips (MDBF/MBF), False-Positives (FP)
DC
University of Florida Matrix Collection: https://www.cise.ufl.edu/research/sparse/matrices/
17
Summary and Future work
• The error resilience of stationary methods can efficiently be enhanced by
using leight-weight checking techniques on component-level.
• Bit-flip location in IEEE number representation matters:
• Bit-flips in the less significant bits are easier to miss.
• These exert milder effect on the convergence.
• Component-level checking technique does not work for bit-flips
permanently affecting the same components.
• Future work:
• Approximate triangular solves for ILU based on FT-Jacobi.
• Large-scale evaluation of FT-Jacobi.
This research is based on a cooperation with Enrique Quntana-Orti from the University of Jaume I,
and supported by the U.S. Department of Energy and the Ministerio de Economia y Competitividad.
18