BeBOP Automatic Performance Tuning of Sparse Matrix Kernels 3 x 3 Register Blocking Example 0 Consider sparse matrix-vector multiply (SpMxV) Implementation space • Set of r x c block sizes • Fill in explicit zeros • Run-time estimation (when matrix is known) 2 • Evaluate code against architecture-specific upper bounds 2 35 50 0 10 20 30 40 (688 true non−zeros) + (383 explicit zeros) = 1071 nz 50 Mflop/s 208 200 52 1.20 0.81 0.75 0.95 3 1.44 0.99 1.07 0.64 2 1.21 1.27 0.97 0.78 1.00 0.86 0.99 0.93 r=6 1.31 1.34 1.41 1.39 46 44 2 1.17 1.25 1.30 1.30 1.00 1.09 1.19 1.21 180 170 48 42 row block size (r) row block size (r) 50 3 160 150 38 140 130 40 1 190 120 r=1 110 100 36 1 2 3 column block size (c) 6 c=1 2 3 column block size (c) c=6 1600 • Conceptually, the set of “interesting” implementations • Depends on kernel and input • May vary: instruction mix and order, memory access patterns, data structures and precisions, mathematical formulation, … • Dense linear algebra: ATLAS/PHiPAC • Signal processing: FFTW; SPIRAL • MPI collective operations (Vadhiyar & Dongarra, 2001) 10 43 0 3 4 5 6 7 8 column block size (c) 9 10 11 90 80 70 60 50 40 30 20 D FEM FEM (var. blk.) Mixed LP 12 1 2 3 4 5 6 7 8 9 10 0 10 11 12 13 15 17 21 25 27 28 36 40 44 matrix D FEM FEM (var. blk.) Mixed LP 1 2 3 4 5 6 7 8 9 101112131415161718202123242526272829363740414244 matrix Performance Summary [power3−aix] 130 4 1.19 3 1.20 1.17 1.18 2 1.18 120 240 220 110 4 200 7 180 1.42 6 5 160 140 9 10 11 260 180 160 140 120 100 120 20 1 2 3 4 5 6 7 8 column block size (c) 9 10 11 12 109 0 Reference Cache blocked 320 Performance (Mflop/s) 12 16 16 18 10 12 10 18 16 10 600 8 10 FEM (var. blk.) 13 5 8 8 8 10 11 7 8 11 10 8 4 5 7 5 10 15 20 25 Matrix No. 30 7 35 Matrix Splitting: A = Ar×c + A1×1 [ultra−solaris] 4.94 220 120 40 0 5 6 FEM (var. blk.) 7 8 9 Mixed LP 10 11 12 13 15 17 21 25 27 28 36 40 44 matrix ⎞⎛ x1 ⎞ ⎛ b1 ⎞ ⎟⎜ ⎟ = ⎜ ⎟ Ldense ⎟⎠⎜⎝ x2 ⎟⎠ ⎜⎝ b2 ⎟⎠ Sparse solve Sparse matvec Dense solve Sparse Triangular Solve: Performance Summary −− [ultra−solaris] LSI Matrix LP−Italian Railways LP−osa60 Exp. Design (BIBD) Cache blocking—Performance on a Pentium 4 for information retrieval and linear programming matrices, with up to 5x speedups. Post−TSP Reordering: 17−rim.rua (1, 1) −− (100, 100) 0 0 10 10 20 20 30 30 40 40 50 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 Reference Reg. Blocking (RB) Switch−to−Dense (S2D) RB + S2D Analytic upper bound Analytic lower bound PAPI upper bound dense memplus wang4 60 60 70 70 90 100 0 100 0 80 20 40 60 80 100 column ex11 raefsky4 goodwin lhr10 matrix Sparse triangular solve—(Top-left) Triangular factors from sparse LU often have a large dense trailing triangle. (Top-right) The matrix can be partitioned into sparse (L11, L21) and dense (Ldense) parts. (Bottom) Performance improvements from register blocking the sparse part, and calling a tuned vendor BLAS routine (TRSM) for the dense solve step. [ICS/POHLL ’02] 50 90 38 4 20 45 1.1 21 3 2 80 20 2 L11 x1 = b1 bˆ2 = b2 − L21 x1 Ldense x2 = bˆ2 140 1.2 16 17 Matrix No. FEM 1 1.43 160 1.4 13 0 40 ⎛ L11 ⎜⎜ ⎝ L21 180 1.5 12 15 200 1.6 10 13 20% of non-zeros 1.7 1 12 2.60 2.46 17−rim.rua (1, 1) −− (100, 100) Reference Reg. Blocking Splitting 1.3 10 Experimental results—Performance (Mflop/s) on a set of 44 benchmark matrices from a variety of applications. Speedups of 2.5x are possible. [SC’02] matrices 1.9 •AATx, ATAx (1.2—4.2x) •RART, Akx, … 9 matrix 40 Multiple vectors—Significant speedups are possible when multiplying by several vectors (800 MHz Itanium; DGEMM n=k=2000, m=32). 1.8 8 80 5 30 0 0 120 20 1 60 10 8 DGEMV 10 12 9 8 200 140 40 D LP 100 10 11 8 8 160 60 FEM 300 15 18 200 180 80 D 280 16 18 16 220 Cache Blocking Performance [Pentium 4−1.5 GHz, Intel Cv6.0] 340 DGEMM 10 800 240 100 60 1 12 260 1000 280 40 2 1.49 1.49 90 87 5 6 7 8 column block size (c) 300 80 3 1.55 100 3 8 1.48 Analytic upper bound Upper bound (PAPI) Sparsity (exhaustive) Sparsity (heuristic) Reference 320 200 4 1.55 1.17 1.19 2 340 Analytic upper bound Upper bound (PAPI) Sparsity (exhaustive) Sparsity (heuristic) Reference 260 240 9 140 1.22 5 Performance Summary [itanium−linux−ecc] 280 1.46 1.40 1.45 220 6 1.17 1.18 Best r×c Sparsity Sparsity + new heur. Sparsity 1×1 Naive (NIST) 400 Higher-level sparse kernels 93 Implementation space Successful examples 50 240 •Hybrid sparse/dense data structure (1.2x—1.8x) 20 40 60 80 100 column Splitting and reordering—(Left) Speedup after splitting a matrix (possibly after reordering) into a blocked part and an unblocked part to avoid fill. (Middle, right) Dense blocks can be created by a judicious reordering—on matrix 17, we used a traveling salesman problem formulation due to Pinar (1997). Approach to Automatic Tuning • Either off-line, on-line, or combination 2.54 255 7 1200 Sparse triangular solve Architecture dependence—Sixteen r x c blocked compressed sparse row implementations of SpMxV, each color coded by performance (Mflop/s) and labeled by speedup over the unblocked (1 x 1) code for the sparse matrix above on two platforms: 333 MHz Ultra 2i (left) and 800 MHz Itanium (right). The best block size is not always 6x6! Search using models and experiments 2 12 150 Multiple Vector Performance [itanium−ecc] 12 1400 •Register-level blocking (up to 2.5x speedups) •Symmetry (up to 2x speedup) •Diagonals, bands (up to 2.2x) •Splitting for variable block structure (1.3x—1.7x) •Reordering to create dense blocks + splitting (up to 2x) •Cache blocking (1.5x—5x) •Multiple vectors (2—7x) •And combinations… Mflop/s For each kernel, identify and generate a space of implementations, and search for the best one. 30 15 10 1 Speedup Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc] 54 1 row Itanium: 3x1 35 5 11 1.53 row 60 Sparse matrix example—A 6x6 blocked storage format appears to be the most natural choice for this matrix for a sparse matrixvector multiply (SpMxV) implementation… 1.53 2.41 2.37 2.45 2.38 40 20 Register Blocking Performance (Mflop/s) [ Dense (n=1000); itanium−linux−ecc] 8 Filling in zeros—True non-zeros (•) and explicit zeros (•); fill ratio=1.5. Sparse matrix-vector multiply 60 1.42 12 173 170 T A Ax Performance [ultra−solaris] 150 BeBOP: Current and Future Work Understanding the impact on higher-level kernels, algorithms, and applications. 4x1 (2.49) 260 200 3x1 (1.99) 2x1 (1.50) 180 160 140 120 y = AA x T m 120 ( ) = ∑ ak a x 100 80 60 k =1 40 20 0 130 ⎛ a1T ⎞ ⎜ ⎟ = (a1 L am )⎜ M ⎟ x ⎜ aT ⎟ ⎝ m⎠ 240 220 Design and implementation of a library based on the Sparse BLAS; new heuristics for efficiently choosing optimizations. Study of performance implications for higher-level algorithms (e.g., block Lanczos) New sparse kernels (e.g., powers Ak, triple product RART) Integrating with applications (e.g., DOE SciDAC codes) Further automation: generating implementation generators Using bounds to evaluate current and future architectures Upper bound Upper bound (PAPI) Cache + Reg (best) Cache + Reg (heuristic) Reg only Cache only Reference 140 Performance Summary: Web Subgraph (1M x 1M, 3.1M nz) [Itanium 2−900, Intel C v7.0] 300 Reference 280 Register blocked Performance (Mflop/s) 50 1.39 11 9 45 40 Blocking Performance (Mflop/s) [03−olafu.rua; ultra−solaris] 60 160 Performance (Mflop/s) 30 1.40 10 10 Additional techniques for y=Ax, sparse triangular solve, and ATAx. 20 6 9 12 Exploiting Matrix Structure Spy Plot (zoom−in): 03−olafu.rua Ultra 2i: 6x6 2.49 Off-line benchmarking—Performance (Mflop/s) for a dense matrix in sparse format on four architectures (clockwise from upper-left): Ultra 2i-333, Pentium III-500, Power3-375, Itanium-800. Performance is a strong function of the hardware platform. 0 50 5 6 7 8 column block size (c) 11 Dense Performance ( r , c) Estimated Performance ( r , c) = Fill Ratio (r , c ) 10 30 40 1188 non−zeros 4 36 40 Performance depends strongly on both the matrix and hardware platform. 20 3 25 Observations 10 2.39 45 25 1 1 1 0 2.44 2 40 1 o Estimate Fill Ratio (r,c): (# stored non-zeros) / (# true non-zeros) o Choose r x c to maximize o Matrix known only at run-time (in general) 2.39 55 50 Performance (Mflop/s) o Measure Dense Performance (r,c), in Mflop/s, of dense matrix in sparse r x c format • Choose best data structure and implementation for given kernel, sparse matrix, and machine 5 70 3 3 Register Blocking Performance (Mflop/s) [dense−regprof−2k; power3−aix] • Off-line benchmarking (once per architecture) Goal: Automatic tuning of sparse kernels 45 20 2.36 4 4 15 6 30 Search • Performance depends on architecture, kernel, and matrix 50 1.94 80 7 110 100 WebBase−1M (natural) WebBase−1M (RCM) matrix (ordering) WebBase−1M (MMD) Application matrices: Web connectivity matrix—Speeding up SpMxV for a web subgraph (left) using register blocking and reordering (right) on a 900 MHz Itanium 2. T k 110 Performance (Mflop/s) Microprocessor performance difficult to model Widening processor-memory gap; deep memory hierarchies 1.96 1.98 1.97 5 10 8 Performance (Mflop/s) • Hardware complexity is increasing 55 6 Analytic upper bound Upper bound (PAPI) Sparsity (exhaustive) Sparsity (heuristic) Reference 130 120 60 90 Performance (Mflop/s) Indirect, irregular memory accesses High bandwidth requirements, poor instruction mix 7 row block size (r) • Typical uniprocessor performance < 10% peak 60 2.03 1.95 1.94 1.98 8 row block size (r) Challenges in tuning sparse code 5 65 9 1.98 9 Performance Summary [pentium3−linux−icc] 140 70 10 65 Analytic upper bound Upper bound (PAPI) Sparsity (exhaustive) Sparsity (heuristic) Reference 75 100 11 10 Performance Summary [ultra−solaris] 12 70 1.94 11 80 107 72 12 Performance (Mflop/s) The SPARSITY system (Im & Yelick, 1999) applies the methodology to y=Ax. Attila Gyulassy Chris Hsu Shoaib Kamil Benjamin Lee Hyun-Jin Moon Rajesh Nishtala bebop.cs.berkeley.edu Register Blocking Performance (Mflop/s) [ Dense (n=1000); pentium3−linux−icc] Register Blocking Performance (Mflop/s) [ Dense (n=1000); ultra−solaris] Performance (Mflop/s) Sparse kernel performance depends on both the matrix and hardware platform. Berkeley Benchmarking and OPtimization Group row block size (r) Example: Choosing a Block Size row block size (r) Problem Context Richard Vuduc Eun-Jin Im James Demmel Katherine Yelick 100 90 80 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 44 matrix no. Sparse AATx, ATAx—(Left) A can be brought through the memory hierarchy only once: for each column ak of A, compute a dot product followed by a vector scale (“axpy”). (Right) This cache optimized implementation can be naturally combined with register blocking.
© Copyright 2026 Paperzz