Efficient Reproducible Floating-Point Reduction Operations on Large Scale Systems James Demmel, Hong Diep Nguyen ParLab - EECS - UC Berkeley SIAM AN13 Jul 8-12, 2013 Plan Introduction Algorithms Experimental results Conclusions and Future work Plan Introduction Algorithms Experimental results Conclusions and Future work Floating-point arithmetic: defines a discrete subset of real values and suffers from rounding errors. → Floating-point operations (+, ×) are commutative but not associative: (−1 + 1) + 2−53 ≠ −1 + (1 + 2−53 ). Consequence: results of floating-point computations depend on the order of computation. Reproducibility: ability to obtain bit-wise identical results from run-to-run on the same input data, with different resources. Motivations Demands for reproducible floating-point computations: ▸ ▸ Debugging: look inside the code step-by-step, and might need to rerun multiple times on the same input data. Understanding the reliability of output. Ex: 1 , Power State Estimation problem (spmv + dot product), after the 5th step the Euclidean norm of the residual vector differs up to 20% from one run to another. ▸ Contractual reasons (road type approval, drug design), ▸ ... 1 Villa et al, Effects of Floating-point non-Associativity on Numerical Computations on Massively Multithreaded Systems, CUG 2009 Proceedings Sources of non-reproducibility A performance-optimized floating-point library is prone to non-reproducibility for various reasons: ▸ Changing Data Layouts: ▸ ▸ ▸ ▸ Data alignment, Data partitioning, Data ordering, Changing Hardware Resources: ▸ ▸ ▸ ▸ ▸ ▸ ▸ Fused Multiply-Adder support, Intermediate precision (64 bits, 80 bits, 128 bits, etc), Data path (SSE, AVX, GPU warp, etc), Cache line size, Number of processors, Network topology, ??? Sources of non-reproducibility A performance-optimized floating-point library is prone to non-reproducibility for various reasons: ▸ Changing Data Layouts: ▸ ▸ ▸ ▸ Data alignment, Data partitioning, Data ordering, Changing Hardware Resources: ▸ ▸ ▸ ▸ ▸ ▸ ▸ Fused Multiply-Adder support, Intermediate precision (64 bits, 80 bits, 128 bits, etc), Data path (SSE, AVX, GPU warp, etc), Cache line size, Number of processors, Network topology, ??? Reproducibility at Large Scale Large Scale: improve performance by increasing the number of processors. ▸ Highly dynamic scheduling, ▸ Network heterogeneity: reduction tree shape can vary, ▸ Drastically increased communication time Cost = Arithmetic FLOPs ▸ ▸ + Communication #words moved + #messages Communication-Avoiding algorithms change the order of computation on purpose, for ex. 2.5D Matmult, 2.5D LU, etc, A little extra arithmetic cost is allowed so long as the communication cost is controlled. Communication cost DDOT normalized timing breakdown (n = 106) 1 2 4 8 16 # Processors 32 64 128 256 computation 512 1024 communication 0.8 0.6 0.4 0.2 ddot ddot ddot ddot ddot ddot ddot ddot ddot ddot 0 ddot Time (normalized by ddot time) 1 State of the art Source of floating-point non-reproducibility: rounding errors lead to dependence of computed result on order of computations. To obtain reproducibility: ▸ Fix the order of computations: ▸ ▸ ▸ Eliminate/Reduce the rounding errors: ▸ ▸ ▸ ▸ sequential mode: intolerably costly at large-scale systems fixed reduction tree: substantial communication overhead exact arithmetic (rounded at the end): much more expensive in communication and very wide multi-word arithmetic fixed-point arithmetic: limited range of values higher precision: reproducible with high probability (not certain). Our proposed solution: deterministic errors. Plan Introduction Algorithms Experimental results Conclusions and Future work A proposed solution for global sum Objectives: ▸ bit-wise identical results from run-to-run regardless of hardware heterogeneity, # processors, reduction tree shape, ... ▸ independent of data ordering, ▸ only 1 reduction per sum, ▸ no severe loss of accuracy. Idea: pre-rounding input values. Pre-rounding technique EMAX EMIN x1 x2 x3 x4 x5 x6 ... Rounding occurs at each addition. Computation’s error depends on the intermediate results, which depend on the order of computation. Pre-rounding technique Boundary EMAX EMIN x1 x2 x3 x4 Bits discarded in advance x5 x6 ... No rounding error at each addition. Computation’s error depends on the Boundary, which depends on max ∣xi ∣, not on the ordering Pre-rounding technique Boundary EMAX x1 x2 x3 x4 x5 x6 EMIN proc 1 Bits discarded in advance proc 2 proc 3 ... No rounding error at each addition. Computation’s error depends on the Boundary, which depends on max ∣xi ∣, not on the ordering ⇒ extra communication among processors. 1-Reduction technique W-bit EMAX EMIN x1 x2 proc 1 x3 x4 proc 2 x5 x6 proc 3 ... Boundaries are precomputed. Special Reduction Operator: (MAX of boundaries combined SUM of corresponding partial sums) 1-Reduction technique W-bit EMAX EMIN x1 x2 proc 1 x3 x4 proc 2 x5 x6 proc 3 ... Boundaries are precomputed. Special Reduction Operator: (MAX of boundaries combined SUM of corresponding partial sums) k-fold Algorithm W-bit EMAX EMIN x1 x2 proc 1 x3 x4 proc 2 x5 x6 proc 3 ... Increase the number of bins to increase the accuracy. k-fold Algorithm W-bit EMAX EMIN x1 x2 proc 1 x3 x4 proc 2 x5 x6 proc 3 ... Increase the number of bins to increase the accuracy. k-fold Algorithm: Accuracy k-fold algorithm has an error bound: absolute error ≤ N ⋅ Boundaryk < N ⋅ 2−(k−1)⋅W ⋅ max ∣xi ∣. In practice: k = 3, W = 40. absolute error < N ⋅ 2−80 ⋅ max ∣xi ∣ = 2−27 ⋅ N ⋅ ⋅ max ∣xi ∣ Standard sum’s error bound ≤ (N − 1) ⋅ ⋅ ∑ ∣xi ∣ Plan Introduction Algorithms Experimental results Conclusions and Future work Experimental results: Accuracy Summation of n = 106 floating-point numbers. Computed results of both reproducible summation and standard summation (with different ordering: ascending value, descending value, ascending magnitude, descending magnitude ) are compared with result computed using quad-double precision. Generator xi drand48() drand48() − 0.5 sin(2.0 ∗ π ∗ i/n) sin(2.0 ∗ π ∗ i/n) ∗ 2−5 reproducible 0 1.5e-16 1.5e-15 1.0 standard -8.5e-15 ÷ 1.0e-14 −1.7e − 13 ÷ 1.8e − 13 −1.0 ÷ 1.0 −1.0 ÷ 1.0 Experimental results: Performance DDOT normalized timing breakdown (n = 106) 1 2 4 8 # Processors 32 64 16 128 256 512 1024 6 5.5 communication computation 5.4 5.2 4.6 4.5 3.8 4 3.4 3 2.5 2.1 2 1.7 1.2 1 ddot prddot ddot prddot ddot prddot ddot prddot ddot prddot ddot prddot ddot prddot ddot prddot ddot prddot ddot prddot ddot 0 prddot Time (normalized by ddot time) 5 Experimental results: Performance DASUM normalized timing breakdown (n = 106) 1 2 6.0 6 4 5.9 8 # Processors 32 64 16 128 256 512 1024 communication computation 5.7 5 4.2 4 3.3 2.9 3 2.4 2 1.6 1.5 1.3 1 dasum prdasum dasum prdasum dasum prdasum dasum prdasum dasum prdasum dasum prdasum dasum prdasum dasum prdasum dasum prdasum dasum prdasum dasum 0 prdasum Time (normalized by dasum time) 5.3 Experimental results: Performance DNRM2 normalized timing breakdown (n = 106) 1 2 4 8 # Processors 32 64 16 computation 128 256 512 1024 communication 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.3 0.3 0.3 0.3 prdnrm2 prdnrm2 prdnrm2 prdnrm2 0.4 prdnrm2 0.4 0.3 0.2 dnrm2 prdnrm2 dnrm2 prdnrm2 dnrm2 prdnrm2 dnrm2 prdnrm2 dnrm2 prdnrm2 dnrm2 prdnrm2 dnrm2 dnrm2 dnrm2 dnrm2 0 dnrm2 Time (normalized by dnrm2 time) 1 Conclusions The proposed 1-Reduction pre-rounding technique ▸ provides bit-wise identical reproducibility, regardless of ▸ ▸ ▸ data permutation, data assignment, processor count, reduction tree shape, hardware heterogeneity, etc. ▸ obtains better error bound than the standard sum’s, ▸ can be done in on-the-fly mode, ▸ requires only ONE reduction for the global parallel summation, ▸ is suitable for very large scale systems (ExaScale), ▸ can be applied to Cloud computing environment, ▸ can be applied to other operations which use summation as the reduction operator. Future works In Progress ▸ Reproducible Blas level 1, ▸ Parallel Prefix Sum, ▸ Matrix-vector / Matrix-matrix multiplication, TODO ▸ Higher level driver routine: trsm, factorizations like LU, . . . ▸ n.5D algorithms (2.5D Matmult, 2.5D LU), ▸ spMV, ▸ Other associative operations ▸ Real-world applications ? Experimental results: Performance (single precision) SDOT normalized timing breakdown (n = 106) 1 2 4 8 # Processors 32 64 16 128 256 512 1024 5 4.5 4.5 4.4 communication computation 4.1 3.9 3.3 2.9 3 2.1 2 1.3 1.3 prsdot prsdot 1.5 1 sdot sdot sdot prsdot sdot prsdot sdot prsdot sdot prsdot sdot prsdot sdot prsdot sdot prsdot sdot prsdot sdot 0 prsdot Time (normalized by sdot time) 4 Experimental results: Performance (single precision) SASUM normalized timing breakdown (n = 106) 1 2 6.8 7 4 8 # Processors 32 64 16 128 256 512 6.8 6.4 communication computation 5.9 6 5.2 5 4.3 4 3.3 3 2.0 2 1.2 1.3 prsasum prsasum 1.6 1 sasum sasum sasum prsasum sasum prsasum sasum prsasum sasum prsasum sasum prsasum sasum prsasum sasum prsasum sasum prsasum sasum 0 prsasum Time (normalized by sasum time) 1024 Experimental results: Performance (single precision) SNRM2 normalized timing breakdown (n = 106) 1 2 4 8 # Processors 32 64 16 128 256 512 1024 3 computation 2.5 2.4 2.4 2.3 communication 2.2 2.0 2 1.8 1.6 1.5 1.4 1.4 1.2 1 0.5 snrm2 prsnrm2 snrm2 prsnrm2 snrm2 prsnrm2 snrm2 prsnrm2 snrm2 prsnrm2 snrm2 prsnrm2 snrm2 prsnrm2 snrm2 prsnrm2 snrm2 prsnrm2 snrm2 prsnrm2 snrm2 0 prsnrm2 Time (normalized by snrm2 time) 2.5
© Copyright 2026 Paperzz