Efficient Reproducible Floating-Point Reduction Operations

Efficient Reproducible Floating-Point Reduction
Operations on Large Scale Systems
James Demmel, Hong Diep Nguyen
ParLab - EECS - UC Berkeley
SIAM AN13 Jul 8-12, 2013
Plan
Introduction
Algorithms
Experimental results
Conclusions and Future work
Plan
Introduction
Algorithms
Experimental results
Conclusions and Future work
Floating-point arithmetic: defines a discrete subset of real values
and suffers from rounding errors.
→ Floating-point operations (+, ×) are commutative but not
associative:
(−1 + 1) + 2−53 ≠ −1 + (1 + 2−53 ).
Consequence: results of floating-point computations depend on the
order of computation.
Reproducibility: ability to obtain bit-wise identical results from
run-to-run on the same input data, with different resources.
Motivations
Demands for reproducible floating-point computations:
▸
▸
Debugging: look inside the code step-by-step, and might need
to rerun multiple times on the same input data.
Understanding the reliability of output. Ex: 1 , Power State
Estimation problem (spmv + dot product), after the 5th step
the Euclidean norm of the residual vector differs up to 20%
from one run to another.
▸
Contractual reasons (road type approval, drug design),
▸
...
1
Villa et al, Effects of Floating-point non-Associativity on Numerical
Computations on Massively Multithreaded Systems, CUG 2009 Proceedings
Sources of non-reproducibility
A performance-optimized floating-point library is prone to
non-reproducibility for various reasons:
▸ Changing Data Layouts:
▸
▸
▸
▸
Data alignment,
Data partitioning,
Data ordering,
Changing Hardware Resources:
▸
▸
▸
▸
▸
▸
▸
Fused Multiply-Adder support,
Intermediate precision (64 bits, 80 bits, 128 bits, etc),
Data path (SSE, AVX, GPU warp, etc),
Cache line size,
Number of processors,
Network topology,
???
Sources of non-reproducibility
A performance-optimized floating-point library is prone to
non-reproducibility for various reasons:
▸ Changing Data Layouts:
▸
▸
▸
▸
Data alignment,
Data partitioning,
Data ordering,
Changing Hardware Resources:
▸
▸
▸
▸
▸
▸
▸
Fused Multiply-Adder support,
Intermediate precision (64 bits, 80 bits, 128 bits, etc),
Data path (SSE, AVX, GPU warp, etc),
Cache line size,
Number of processors,
Network topology,
???
Reproducibility at Large Scale
Large Scale: improve performance by increasing the number of
processors.
▸
Highly dynamic scheduling,
▸
Network heterogeneity: reduction tree shape can vary,
▸
Drastically increased communication time
Cost =
Arithmetic
FLOPs
▸
▸
+
Communication
#words moved + #messages
Communication-Avoiding algorithms change the order of
computation on purpose, for ex. 2.5D Matmult, 2.5D LU, etc,
A little extra arithmetic cost is allowed so long as the
communication cost is controlled.
Communication cost
DDOT normalized timing breakdown (n = 106)
1
2
4
8
16
# Processors
32
64
128
256
computation
512
1024
communication
0.8
0.6
0.4
0.2
ddot
ddot
ddot
ddot
ddot
ddot
ddot
ddot
ddot
ddot
0
ddot
Time (normalized by ddot time)
1
State of the art
Source of floating-point non-reproducibility: rounding errors lead
to dependence of computed result on order of computations.
To obtain reproducibility:
▸
Fix the order of computations:
▸
▸
▸
Eliminate/Reduce the rounding errors:
▸
▸
▸
▸
sequential mode: intolerably costly at large-scale systems
fixed reduction tree: substantial communication overhead
exact arithmetic (rounded at the end): much more expensive
in communication and very wide multi-word arithmetic
fixed-point arithmetic: limited range of values
higher precision: reproducible with high probability (not
certain).
Our proposed solution: deterministic errors.
Plan
Introduction
Algorithms
Experimental results
Conclusions and Future work
A proposed solution for global sum
Objectives:
▸
bit-wise identical results from run-to-run regardless of
hardware heterogeneity, # processors, reduction tree shape,
...
▸
independent of data ordering,
▸
only 1 reduction per sum,
▸
no severe loss of accuracy.
Idea: pre-rounding input values.
Pre-rounding technique
EMAX
EMIN
x1
x2
x3
x4
x5
x6
...
Rounding occurs at each addition. Computation’s error depends on
the intermediate results, which depend on the order of
computation.
Pre-rounding technique
Boundary
EMAX
EMIN
x1
x2
x3
x4
Bits discarded
in advance
x5
x6
...
No rounding error at each addition. Computation’s error depends
on the Boundary, which depends on max ∣xi ∣, not on the ordering
Pre-rounding technique
Boundary
EMAX
x1
x2
x3
x4
x5
x6
EMIN
proc 1
Bits discarded
in advance
proc 2
proc 3
...
No rounding error at each addition. Computation’s error depends
on the Boundary, which depends on max ∣xi ∣, not on the ordering
⇒ extra communication among processors.
1-Reduction technique
W-bit
EMAX
EMIN
x1
x2
proc 1
x3
x4
proc 2
x5
x6
proc 3
...
Boundaries are precomputed. Special Reduction Operator:
(MAX of boundaries combined SUM of corresponding partial sums)
1-Reduction technique
W-bit
EMAX
EMIN
x1
x2
proc 1
x3
x4
proc 2
x5
x6
proc 3
...
Boundaries are precomputed. Special Reduction Operator:
(MAX of boundaries combined SUM of corresponding partial sums)
k-fold Algorithm
W-bit
EMAX
EMIN
x1
x2
proc 1
x3
x4
proc 2
x5
x6
proc 3
...
Increase the number of bins to increase the accuracy.
k-fold Algorithm
W-bit
EMAX
EMIN
x1
x2
proc 1
x3
x4
proc 2
x5
x6
proc 3
...
Increase the number of bins to increase the accuracy.
k-fold Algorithm: Accuracy
k-fold algorithm has an error bound:
absolute error ≤ N ⋅ Boundaryk < N ⋅ 2−(k−1)⋅W ⋅ max ∣xi ∣.
In practice: k = 3, W = 40.
absolute error < N ⋅ 2−80 ⋅ max ∣xi ∣ =
2−27 ⋅ N ⋅ ⋅ max ∣xi ∣
Standard sum’s error bound ≤
(N − 1) ⋅ ⋅ ∑ ∣xi ∣
Plan
Introduction
Algorithms
Experimental results
Conclusions and Future work
Experimental results: Accuracy
Summation of n = 106 floating-point numbers. Computed results
of both reproducible summation and standard summation (with
different ordering: ascending value, descending value, ascending
magnitude, descending magnitude ) are compared with result
computed using quad-double precision.
Generator xi
drand48()
drand48() − 0.5
sin(2.0 ∗ π ∗ i/n)
sin(2.0 ∗ π ∗ i/n) ∗ 2−5
reproducible
0
1.5e-16
1.5e-15
1.0
standard
-8.5e-15 ÷ 1.0e-14
−1.7e − 13 ÷ 1.8e − 13
−1.0 ÷ 1.0
−1.0 ÷ 1.0
Experimental results: Performance
DDOT normalized timing breakdown (n = 106)
1
2
4
8
# Processors
32
64
16
128
256
512
1024
6
5.5
communication
computation
5.4
5.2
4.6
4.5
3.8
4
3.4
3
2.5
2.1
2
1.7
1.2
1
ddot
prddot
ddot
prddot
ddot
prddot
ddot
prddot
ddot
prddot
ddot
prddot
ddot
prddot
ddot
prddot
ddot
prddot
ddot
prddot
ddot
0
prddot
Time (normalized by ddot time)
5
Experimental results: Performance
DASUM normalized timing breakdown (n = 106)
1
2
6.0
6
4
5.9
8
# Processors
32
64
16
128
256
512
1024
communication
computation
5.7
5
4.2
4
3.3
2.9
3
2.4
2
1.6
1.5
1.3
1
dasum
prdasum
dasum
prdasum
dasum
prdasum
dasum
prdasum
dasum
prdasum
dasum
prdasum
dasum
prdasum
dasum
prdasum
dasum
prdasum
dasum
prdasum
dasum
0
prdasum
Time (normalized by dasum time)
5.3
Experimental results: Performance
DNRM2 normalized timing breakdown (n = 106)
1
2
4
8
# Processors
32
64
16
computation
128
256
512
1024
communication
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.3
0.3
0.3
0.3
prdnrm2
prdnrm2
prdnrm2
prdnrm2
0.4
prdnrm2
0.4
0.3
0.2
dnrm2
prdnrm2
dnrm2
prdnrm2
dnrm2
prdnrm2
dnrm2
prdnrm2
dnrm2
prdnrm2
dnrm2
prdnrm2
dnrm2
dnrm2
dnrm2
dnrm2
0
dnrm2
Time (normalized by dnrm2 time)
1
Conclusions
The proposed 1-Reduction pre-rounding technique
▸ provides bit-wise identical reproducibility, regardless of
▸
▸
▸
data permutation, data assignment,
processor count, reduction tree shape,
hardware heterogeneity, etc.
▸
obtains better error bound than the standard sum’s,
▸
can be done in on-the-fly mode,
▸
requires only ONE reduction for the global parallel summation,
▸
is suitable for very large scale systems (ExaScale),
▸
can be applied to Cloud computing environment,
▸
can be applied to other operations which use summation as
the reduction operator.
Future works
In Progress
▸
Reproducible Blas level 1,
▸
Parallel Prefix Sum,
▸
Matrix-vector / Matrix-matrix multiplication,
TODO
▸
Higher level driver routine: trsm, factorizations like LU, . . .
▸
n.5D algorithms (2.5D Matmult, 2.5D LU),
▸
spMV,
▸
Other associative operations
▸
Real-world applications ?
Experimental results: Performance (single precision)
SDOT normalized timing breakdown (n = 106)
1
2
4
8
# Processors
32
64
16
128
256
512
1024
5
4.5
4.5
4.4
communication
computation
4.1
3.9
3.3
2.9
3
2.1
2
1.3
1.3
prsdot
prsdot
1.5
1
sdot
sdot
sdot
prsdot
sdot
prsdot
sdot
prsdot
sdot
prsdot
sdot
prsdot
sdot
prsdot
sdot
prsdot
sdot
prsdot
sdot
0
prsdot
Time (normalized by sdot time)
4
Experimental results: Performance (single precision)
SASUM normalized timing breakdown (n = 106)
1
2
6.8
7
4
8
# Processors
32
64
16
128
256
512
6.8
6.4
communication
computation
5.9
6
5.2
5
4.3
4
3.3
3
2.0
2
1.2
1.3
prsasum
prsasum
1.6
1
sasum
sasum
sasum
prsasum
sasum
prsasum
sasum
prsasum
sasum
prsasum
sasum
prsasum
sasum
prsasum
sasum
prsasum
sasum
prsasum
sasum
0
prsasum
Time (normalized by sasum time)
1024
Experimental results: Performance (single precision)
SNRM2 normalized timing breakdown (n = 106)
1
2
4
8
# Processors
32
64
16
128
256
512
1024
3
computation
2.5
2.4
2.4
2.3
communication
2.2
2.0
2
1.8
1.6
1.5
1.4
1.4
1.2
1
0.5
snrm2
prsnrm2
snrm2
prsnrm2
snrm2
prsnrm2
snrm2
prsnrm2
snrm2
prsnrm2
snrm2
prsnrm2
snrm2
prsnrm2
snrm2
prsnrm2
snrm2
prsnrm2
snrm2
prsnrm2
snrm2
0
prsnrm2
Time (normalized by snrm2 time)
2.5