Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply Richard Vuduc, James Demmel, Katherine Yelick Shoaib Kamil, Rajesh Nishtala, Benjamin Lee Wednesday, November 20, 2002 Berkeley Benchmarking and OPtimization (BeBOP) Project www.cs.berkeley.edu/ richie/bebop Computer Science Division, U.C. Berkeley Berkeley, California, USA . SC 2002: Session on Sparse Linear Algebra [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.1/36 Context: Performance Tuning in the Sparse Case Application performance dominated by a few computational kernels Performance tuning today Vendor-tuned libraries (e.g., BLAS) or user hand-tunes Automatic tuning (e.g., PHiPAC/ATLAS, FFTW/SPIRAL/UHFFT) Tuning sparse linear algebra kernels is hard Sparse code has . . . high bandwidth requirements (extra storage) poor locality (indirect, irregular memory access) poor instruction mix (data structure manipulation) Sparse matrix-vector multiply (SpM V) performance: less than 10% of machine peak Performance depends on kernel, architecture, and matrix [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.2/36 Example: Matrix olafu Spy Plot: 03−olafu.rua 0 N = 16146 2400 nnz = 1.0M Kernel = SpM V 4800 7200 9600 12000 14400 0 2400 4800 7200 9600 1015156 non−zeros 12000 14400 [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.3/36 Example: Matrix olafu Spy Plot (zoom−in): 03−olafu.rua 0 N = 16146 10 nnz = 1.0M Kernel = SpM V 20 30 A natural choice: blocked compressed sparse row (BCSR). 40 Experiment: 50 0 block sizes for . 10 20 30 40 1188 non−zeros 50 60 all Measure performance of 60 [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.3/36 Speedups on Itanium: The Need for Search Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc] Mflop/s 208 200 r=6 1.20 0.81 0.75 0.95 190 row block size (r) 180 3 1.44 0.99 1.07 0.64 170 160 150 2 1.21 1.27 0.97 0.78 140 130 120 r=1 1.00 0.86 0.99 0.93 110 100 c=1 2 3 column block size (c) (Peak machine speed: 3.2 Gflop/s) c=6 93 Mflop/s [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.4/36 Speedups on Itanium: The Need for Search Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc] Mflop/s 208 200 r=6 1.20 0.81 0.75 0.95 190 row block size (r) 180 3 1.44 0.99 1.07 0.64 170 160 150 2 1.21 1.27 0.97 0.78 140 130 120 r=1 1.00 0.86 0.99 0.93 110 100 c=1 2 3 column block size (c) (Peak machine speed: 3.2 Gflop/s) c=6 93 Mflop/s [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.4/36 Speedups on Itanium: The Need for Search Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc] Mflop/s 208 200 r=6 1.20 0.81 0.75 0.95 190 row block size (r) 180 3 1.44 0.99 1.07 0.64 170 160 150 2 1.21 1.27 0.97 0.78 140 130 120 r=1 1.00 0.86 0.99 0.93 110 100 c=1 2 3 column block size (c) (Peak machine speed: 3.2 Gflop/s) c=6 93 Mflop/s [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.4/36 Speedups on Itanium: The Need for Search Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc] Mflop/s 208 200 r=6 1.20 0.81 0.75 0.95 190 row block size (r) 180 3 1.44 0.99 1.07 0.64 170 160 150 2 1.21 1.27 0.97 0.78 140 130 120 r=1 1.00 0.86 0.99 0.93 110 100 c=1 2 3 column block size (c) (Peak machine speed: 3.2 Gflop/s) c=6 93 Mflop/s [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.4/36 Speedups on Itanium: The Need for Search Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc] Mflop/s 208 200 r=6 1.20 0.81 0.75 0.95 190 row block size (r) 180 3 1.44 0.99 1.07 0.64 170 160 150 2 1.21 1.27 0.97 0.78 140 130 120 r=1 1.00 0.86 0.99 0.93 110 100 c=1 2 3 column block size (c) (Peak machine speed: 3.2 Gflop/s) c=6 93 Mflop/s [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.4/36 Key Questions and Conclusions How do we choose the best data structure automatically? New heuristic for choosing optimal (or near-optimal) block sizes What are the limits on performance of blocked SpM V? Derive performance upper bounds for blocking Often within 20% of upper bound, placing limits on improvement from more “low-level” tuning Performance is memory-bound: reducing data structure size is critical Where are the new opportunities (kernels, techniques) for achieving higher performance? Identify cases in which blocking does and does not work Identify new kernels and opportunities for reducing memory traffic [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.5/36 Related Work Automatic tuning systems and code generation PHiPAC [BACD97], ATLAS [WPD01], S PARSITY [Im00] FFTW [FJ98], SPIRAL [PSVM01], UHFFT [MMJ00] MPI collective ops (Vadhiyar, et al. [VFD01]) Sparse compilers (Bik [BW99], Bernoulli [Sto97]) Generic programming (Blitz++ [Vel98], MTL [SL98], GMCL [Neu98], . . . ) FLAME [GGHvdG01] Sparse performance modeling and tuning Temam and Jalby [TJ92] Toledo [Tol97], White and Sadayappan [WS97], Pinar [PH99] Navarro [NGLPJ96], Heras [HPDR99], Fraguela [FDZ99] Gropp, et al. [GKKS99], Geus [GR99] Sparse kernel interfaces Sparse BLAS Standard [BCD 01] NIST SparseBLAS [RP96], SPARSKIT [Saa94], PSBLAS [FC00] PETSc, hypre, . . . [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.6/36 Approach to Automatic Tuning For each kernel, Identify and generate a space of implementations Search to find the fastest (using models, experiments) [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.7/36 Approach to Automatic Tuning For each kernel, Identify and generate a space of implementations Search to find the fastest (using models, experiments) The S PARSITY system for SpM V [Im & Yelick ’99] Interface Input: Your sparse matrix (CSR) Output: Data structure + routine tuned to your matrix & machine register level blocking ( Implementation space ) cache blocking, multiple vectors, . . . Search Off-line: benchmarking (once per architecture) Run-time: estimate matrix properties (“search”) and predict best data structure parameters [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.7/36 Approach to Automatic Tuning For each kernel, Identify and generate a space of implementations Search to find the fastest (using models, experiments) The S PARSITY system for SpM V [Im & Yelick ’99] Interface Input: Your sparse matrix (CSR) Output: Data structure + routine tuned to your matrix & machine register level blocking ( Implementation space ) cache blocking, multiple vectors, . . . Search Off-line: benchmarking (once per architecture) Run-time: estimate matrix properties (“search”) and predict best data structure parameters [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.7/36 Register-Level Blocking (S PARSITY): 3x3 Example 3 x 3 Register Blocking Example 0 5 10 15 20 25 30 35 40 45 50 0 10 20 30 688 true non−zeros 40 50 [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.8/36 Register-Level Blocking (S PARSITY): 3x3 Example 3 x 3 Register Blocking Example 0 5 10 15 20 25 30 35 40 45 50 0 10 20 30 688 true non−zeros 40 50 BCSR with uniform, aligned grid [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.8/36 Register-Level Blocking (S PARSITY): 3x3 Example 0 3 x 3 Register Blocking Example 5 10 15 20 25 30 35 40 45 50 0 10 20 30 40 (688 true non−zeros) + (383 explicit zeros) = 1071 nz 50 Fill-in zeros: trade-off extra flops for better efficiency This example: 50% fill led to 1.5x speedup on Pentium III [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.8/36 Search: Choosing the Block Size Off-line benchmarking (once per architecture) Measure Dense Performance (r,c) blocked format Performance (Mflop/s) of dense matrix in sparse Estimate Fill Ratio (r,c), At run-time, when matrix is known: # % # % # ) + ! ** )( $ & # "! $ # "! ' " % that maximizes $ Choose Fill Ratio (r,c) = (number of stored values) / (number of true non-zeros) (Replaces previous S PARSITY heurstic) [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.9/36 Off-line Benchmarking [Intel Pentium III] Register Blocking Performance (Mflop/s) [ Dense (n=1000); pentium3−linux−icc] 12 107 100 11 10 90 8 80 7 2.36 5 2.39 70 2.44 4 2.39 3 2.49 2.41 2.37 2.45 2.38 2 60 2.54 50 1 1 2 3 4 5 6 7 8 column block size (c) 9 10 11 12 43 Top 10 codes labeled by speedup over unblocked code. Max speedup = 2.54 ( ). , 6 row block size (r) 9 [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.10/36 333 MHz Sun Ultra 2i (2.03) 500 MHz Intel Pentium III (2.54) Register Blocking Performance (Mflop/s) [ Dense (n=1000); ultra−solaris] 72 12 Register Blocking Performance (Mflop/s) [ Dense (n=1000); pentium3−linux−icc] 70 10 65 1.98 row block size (r) 7 55 6 1.96 1.98 1.97 5 50 1.94 8 80 7 6 2.36 5 2.39 70 2.44 4 4 45 3 2 40 1 1 2 3 4 5 6 7 8 column block size (c) 9 10 11 12 36 170 11 60 2.41 2.37 2.45 2.38 2 36 2.49 2.54 50 1 1 2 3 4 5 6 7 8 column block size (c) 9 10 11 174 255 1.46 12 10 150 120 1.22 5 4 1.19 3 1.20 1.17 1.18 90 1.19 1 2 3 4 5 6 7 8 column block size (c) 8 1.48 200 7 1.42 6 5 9 10 11 12 180 160 140 3 1.55 1.17 100 1 220 4 1.55 110 1.18 2 row block size (r) 130 6 1.17 1.18 256 9 140 7 42 240 1.40 1.45 11 1.53 10 8 12 Register Blocking Performance (Mflop/s) [ Dense (n=1000); itanium−linux−ecc] 160 9 43 800 MHz Intel Itanium (1.55) Register Blocking Performance (Mflop/s) [Dense (n=2000); power3−aix] 12 2.39 3 375 MHz IBM Power3 (1.22) row block size (r) 90 9 60 2.03 1.95 1.94 1.98 8 107 100 11 10 9 107 12 1.94 11 row block size (r) 72 86 2 1.49 1.49 120 1 1 2 3 4 5 6 7 8 column block size (c) 9 10 11 12 109 109 [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.11/36 Performance Bounds for Register Blocking How close are we to the speed limit of blocking? Flops - Upper-bounds on performance: (flops) / (time) 2 * (number of true non-zeros) Other features of model Account for cache line size Bound is function of matrix and Lower-bound on time: Two key assumptions Consider only memory operations Count only compulsory misses (i.e., ignore conflicts) Model of execution time (see paper) 9 9 3 9 9 3 .< 9 .: D 9 ?A @?. 8 2 8 7 65 34 ; 34 1 8 => ; =3 7 65 7 65 9 ./ 0 ) for hits at each cache level , e.g., 9 Charge full latency ( 34 E .< 8 8 8 > ?A .: ?. .< E .: = ; 8 > 8 8 = 8 BC ; 6 34 6 34 2 [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.12/36 Overview of Performance Results Experimental setup Four machines: Pentium III, Ultra 2i, Power3, Itanium 44 matrices: dense, finite element, mixed, linear programming G Reference: CSR (i.e., G F Measured misses and cycles using PAPI [BDG 00] or “unblocked”) Main observations S PARSITY vs. reference: up to 2.5x faster, especially on FEM Block size selection: chooses within 10% of best S PARSITY performance typically within 20% of upper-bound S PARSITY least effective on Power3 [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.13/36 Performance Results: Intel Pentium III Performance Summary [pentium3−linux−icc] 140 Analytic upper bound Reference 130 120 110 Performance (Mflop/s) 100 90 80 70 60 50 40 30 20 10 0 D FEM FEM (var. blk.) Mixed LP 1 2 3 4 5 6 7 8 9 101112131415161718202123242526272829363740414244 matrix DGEMV (n=1000): 96 Mflop/s [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.14/36 Performance Results: Intel Pentium III Performance Summary [pentium3−linux−icc] 140 Analytic upper bound Upper bound (PAPI) Reference 130 120 110 Performance (Mflop/s) 100 90 80 70 60 50 40 30 20 10 0 D FEM FEM (var. blk.) Mixed LP 1 2 3 4 5 6 7 8 9 101112131415161718202123242526272829363740414244 matrix DGEMV (n=1000): 96 Mflop/s [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.14/36 Performance Results: Intel Pentium III Performance Summary [pentium3−linux−icc] 140 Analytic upper bound Upper bound (PAPI) Sparsity (exhaustive) Reference 130 120 110 Performance (Mflop/s) 100 90 80 70 60 50 40 30 20 10 0 D FEM FEM (var. blk.) Mixed LP 1 2 3 4 5 6 7 8 9 101112131415161718202123242526272829363740414244 matrix DGEMV (n=1000): 96 Mflop/s [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.14/36 Performance Results: Intel Pentium III Performance Summary [pentium3−linux−icc] 140 Analytic upper bound Upper bound (PAPI) Sparsity (exhaustive) Sparsity (heuristic) Reference 130 120 110 Performance (Mflop/s) 100 90 80 70 60 50 40 30 20 10 0 D FEM FEM (var. blk.) Mixed LP 1 2 3 4 5 6 7 8 9 101112131415161718202123242526272829363740414244 matrix DGEMV (n=1000): 96 Mflop/s [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.14/36 Performance Results: Sun Ultra 2i Performance Summary [ultra−solaris] 80 Analytic upper bound Upper bound (PAPI) Sparsity (exhaustive) Sparsity (heuristic) Reference 75 70 65 Performance (Mflop/s) 60 55 50 45 40 35 30 25 20 15 10 5 0 D 1 FEM 2 3 4 5 6 FEM (var. blk.) 7 8 DGEMV (n=1000): 58 Mflop/s 9 Mixed LP 10 11 12 13 15 17 21 25 27 28 36 40 44 matrix [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.15/36 Performance Results: Intel Itanium Performance Summary [itanium−linux−ecc] 340 Analytic upper bound Upper bound (PAPI) Sparsity (exhaustive) Sparsity (heuristic) Reference 320 300 280 Performance (Mflop/s) 260 240 220 200 180 160 140 120 100 80 60 40 D 20 0 1 FEM 2 3 4 5 6 FEM (var. blk.) 7 8 9 DGEMV (n=1000): 310 Mflop/s Mixed LP 10 11 12 13 15 17 21 25 27 28 36 40 44 matrix [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.16/36 Performance Results: IBM Power3 Performance Summary [power3−aix] 280 Analytic upper bound Upper bound (PAPI) Sparsity (exhaustive) Sparsity (heuristic) Reference 260 240 220 Performance (Mflop/s) 200 180 160 140 120 100 80 60 40 20 0 D 1 FEM 4 5 FEM (var. blk.) 7 DGEMV (n=2000): 260 Mflop/s 8 9 matrix 10 12 13 LP 15 40 [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.17/36 Conclusions Tuning can be difficult, even when matrix structure is known Performance is a complicated function of architecture and matrix New heuristic for choosing block size selects optimal implementation, or near-optimal (performance within 5–10%) Limits of low-level tuning for blocking are near Performance is often within 20% of upper-bound, particularly with FEM matrices Unresolved: closing the gap on the Power3 [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.18/36 BeBOP: Current and Future Work (1/2) Further performance improvements symmetry (1.5–2x speedups) diagonals, block diagonals, and bands (1.2–2x), splitting for variable block structure (1.3–1.7x), reordering to create dense structure (1.7x), cache blocking (1.5–4x) multiple vectors (2–7x) and combinations . . . How to choose optimizations & tuning parameters? Sparse triangular solve (ICS’02: POHLL Workshop paper) hybrid sparse/dense structure (1.2–1.8x) (1.5–3x) , . . . (future work) I N H , N simultaneously, H J M I J H H H LI and H JKI H , J H Higher-level kernels that permit reuse [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.19/36 BeBOP: Current and Future Work (2/2) An automatically tuned sparse matrix library Code generation via sparse compilers (Bernoulli; Bik) Plan to extend new Sparse BLAS standard by one routine to support tuning Architectural issues Improvements for Power3? Latency vs. bandwidth (see paper) Using models to explore architectural design space [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.20/36 EXTRA SLIDES [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.21/36 Example: No Big Surprises on Sun Ultra 2i Blocking Performance (Mflop/s) [03−olafu.rua; ultra−solaris] 54 6 1.40 1.39 1.42 1.53 52 row block size (r) 50 3 1.31 1.34 1.41 1.39 48 46 44 2 1.17 1.25 1.30 1.30 42 40 1 1.00 1.09 1.19 1.21 38 36 1 2 3 column block size (c) 6 [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.22/36 Where in Memory is the Time Spent? Fraction of Cycles (exhaustive best; average over matrices) Execution Time Model (Model Hits/Misses) −− Where is the Time Spent? 1 0.9 0.8 0.7 L1 L2 L3 Mem 0.6 0.5 0.4 0.3 0.2 0.1 0 Ultra 2i (L1/L2) Pentium III (L1/L2) Power3 (L1/L2) Platform Itanium (L1/L2/L3) [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.23/36 Cache Miss Bound Verification: Sun Ultra 2i (L1) L1 Misses −− [ultra−solaris] 5 Upper bound PAPI Lower Bound 4 3 no. of misses (millions) 2 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 44 matrix [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.24/36 Cache Miss Bound Verification: Sun Ultra 2i (L2) L2 Misses −− [ultra−solaris] 1 no. of misses (millions) 10 Upper bound PAPI Lower Bound 0 10 −1 10 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 44 matrix [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.25/36 Cache Miss Bound Verification: Intel Pentium III (L1) L1 Misses −− [pentium3−linux−icc] 1 no. of misses (millions) 10 Upper bound PAPI Lower Bound 0 10 −1 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20 21 23 24 25 26 27 28 29 36 37 40 41 42 44 matrix [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.26/36 Cache Miss Bound Verification: Intel Pentium III (L2) L2 Misses −− [pentium3−linux−icc] 1 no. of misses (millions) 10 Upper bound PAPI Lower Bound 0 10 −1 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20 21 23 24 25 26 27 28 29 36 37 40 41 42 44 matrix [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.27/36 Cache Miss Bound Verification: IBM Power3 (L1) L1 Misses −− [power3−aix] 1 no. of misses (millions) 10 Upper bound PAPI Lower Bound 0 10 −1 10 1 4 5 7 8 9 matrix 10 12 13 15 40 [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.28/36 Cache Miss Bound Verification: IBM Power3 (L2) L2 Misses −− [power3−aix] 1 no. of misses (millions) 10 Upper bound PAPI Lower Bound 0 10 −1 10 1 4 5 7 8 9 matrix 10 12 13 15 40 [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.29/36 Cache Miss Bound Verification: Intel Itanium (L2) L2 Misses −− [itanium−linux−ecc] 1 no. of misses (millions) 10 Upper bound PAPI Lower Bound 0 10 −1 10 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 44 matrix [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.30/36 Cache Miss Bound Verification: Intel Itanium (L3) L3 Misses −− [itanium−linux−ecc] 1 no. of misses (millions) 10 Upper bound PAPI Lower Bound 0 10 −1 10 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 44 matrix [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.31/36 Fraction of Peak Main Memory Bandwidth Latency vs. Bandwidth 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 Sustainable Memory Bandwidth (STREAM) a[i] = b[i] a[i] = α*b[i] a[i] = b[i] + c[i] a[i] = b[i] + α*c[i] s += a[i] s += a[i]*b[i] s += a[i]; k += ind[i] s += a[i]*x[ind[i]]; x∈L_chip s += a[i]*x[ind[i]]; x∈L_ext DGEMV SpM×V (dense, 1×1) SpM×V (dense, best) *−⋅ Full−latency model Ultra 2i Pentium III Power3 Platform Itanium [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.32/36 Related Work (2/2) Compilers (analysis and models); run-time selection CROPS (UCSD/Carter, Ferrante, et al.) TUNE (Chatterjee, et al.) Iterative compilation (O’Boyle, et al., 1998) Broadway (Guyer and Lin, ’99) Brewer (’95); ADAPT (Voss, 2000) Interfaces: Sparse BLAS; PSBLAS; PETSc Sparse triangular solve SuperLU/MUMPS/SPOOLES/UMFPACK/PSPASES. . . Approximation: Alvarado (’93); Raghavan (’98) Scalability: Rothberg (’92;’95); Gupta (’95); Li, Coleman (’88) [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.33/36 What is the Cost of Search? Block Size Selection Overhead [itanium−linux−ecc] 45 .86 .85 40 heuristic rebuild .85 .80 .84 .85 Cost (no. of reference Sp×MVs) 35 30 .76 25 .76 .74 .74 .79 .76 .78 20 .71 .72 15 10 5 0 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 44 Matrix # [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.34/36 Where in Memory is the Time Spent? (PAPI Data) Fraction of Cycles (exhaustive best; average over matrices) Execution Time Model (PAPI Hits/Misses) −− Where is the Time Spent? 1 0.9 0.8 0.7 L1 L2 L3 Mem 0.6 0.5 0.4 0.3 0.2 0.1 0 Ultra 2i (L1/L2) Pentium III (L1/L2) Power3 (L1/L2) Platform Itanium (L1/L2/L3) [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.35/36 Fill: Some Surprises! Sometimes faster to fill in many zeros Dense Fill (Size BCSR) (Perf BCSR) Matrix Speedup ratio (Size CSR) (Perf CSR) Platform 11 1.09 1.70 1.26 1.55 Itanium 13 1.50 1.52 1.07 2.30 Pentium 3 17 1.04 1.59 1.23 1.54 Itanium 27 1.16 1.94 1.47 1.54 Itanium 27 1.10 1.53 1.25 1.23 Ultra 2i 29 1.00 1.98 1.44 1.89 Pentium 3 [ <--> ] Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.36/36 References [BACD97] J. Bilmes, K. Asanović, C.W. Chin, and J. Demmel. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In Proceedings of the International Conference on Supercomputing, Vienna, Austria, July 1997. ACM SIGARC. see http://www.icsi.berkeley.edu/˜bilmes/phipac. [BCD O 01] S. Blackford, G. Corliss, J. Demmel, J. Dongarra, I. Duff, S. Hammarling, G. Henry, M. Heroux, C. Hu, W. Kahan, L. Kaufman, B. Kearfott, F. Krogh, X. Li, Z. Maany, A. Petitet, R. Pozo, K. Remington, W. Walster, C. Whaley, and J. Wolff von Gudenberg. Document for the Basic Linear Algebra Subprograms (BLAS) standard: BLAS Technical Forum, 2001. www.netlib.org/blast. [BDG O 00] S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci. A scalable cross-platform infrastructure for application performance tuning using hardware counters. In Proceedings of Supercomputing, November 2000. [BW99] Aart J. C. Bik and Harry A. G. Wijshoff. Automatic nonzero structure analysis. SIAM Journal on Computing, 28(5):1576–1587, 1999. [FC00] Salvatore Filippone and Michele Colajanni. PSBLAS: A library for parallel linear algebra computation on sparse matrices. ACM Transactions on Mathematical Software, 26(4):527–550, December 2000. [FDZ99] Basilio B. Fraguela, Ramón Doallo, and Emilio L. Zapata. Memory hierarchy performance prediction for sparse blocked algorithms. Parallel Processing Letters, 9(3), March 1999. [FJ98] Matteo Frigo and Stephen Johnson. FFTW: An adaptive software architecture for the FFT. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Seattle, Washington, May 1998. [GGHvdG01] John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de Geijn. FLAME: Formal Linear Algebra Methods Environment. ACM Transactions on Mathematical Software, 27(4), December 2001. 36-1 [GKKS99] William D. Gropp, D. K. Kasushik, David E. Keyes, and Barry F. Smith. Towards realistic bounds for implicit CFD codes. In Proceedings of Parallel Computational Fluid Dynamics, pages 241–248, 1999. [GR99] Roman Geus and S. Röllin. Towards a fast parallel sparse matrix-vector multiplication. In E. H. D’Hollander, J. R. Joubert, F. J. Peters, and H. Sips, editors, Proceedings of the International Conference on Parallel Computing (ParCo), pages 308–315. Imperial College Press, 1999. [HPDR99] Dora Blanco Heras, Vicente Blanco Perez, Jose Carlos Cabaleiro Dominguez, and Francisco F. Rivera. Modeling and improving locality for irregular problems: sparse matrix-vector product on cache memories as a case study. In HPCN Europe, pages 201–210, 1999. [Im00] Eun-Jin Im. Optimizing the performance of sparse matrix-vector multiplication. PhD thesis, University of California, Berkeley, May 2000. [MMJ00] Dragan Mirkovic, Rishad Mahasoom, and Lennart Johnsson. An adaptive software library for fast fourier transforms. In Proceedings of the International Conference on Supercomputing, pages 215–224, Sante Fe, NM, May 2000. [Neu98] T. Neubert. Anwendung von generativen Programmiertechniken am Beispiel der Matrixalgebra. Master’s thesis, Technische Universität Chemnitz, 1998. [NGLPJ96] J. J. Navarro, E. García, J. L. Larriba-Pey, and T. Juan. Algorithms for sparse matrix computations on high-performance workstations. In Proceedings of the 10th ACM International Conference on Supercomputing, pages 301–308, Philadelpha, PA, USA, May 1996. [PH99] Ali Pinar and Michael Heath. Improving performance of sparse matrixvector multiplication. In Proceedings of Supercomputing, 1999. [PSVM01] Markus Püschel, Bryan Singer, Manuela Veloso, and José M. F. Moura. Fast automatic generation of DSP algorithms. In Proceedings of the International Conference on Computational Science, volume 2073 of LNCS, pages 97–106, San Francisco, CA, May 2001. Springer. 36-1 [RP96] K. Remington and R. Pozo. NIST Sparse BLAS: User’s Guide. Technical report, NIST, 1996. gams.nist.gov/spblas. [Saa94] Yousef Saad. SPARSKIT: A basic toolkit for sparse matrix computations, 1994. www.cs.umn.edu/Research/arpa/SPARSKIT/sparskit.htm [SL98] Jeremy G. Siek and Andrew Lumsdaine. A rational approach to portable high performance: the Basic Linear Algebra Instruction Set (BLAIS) and the Fixed Algorithm Size Template (fast) library. In Proceedings of ECOOP, 1998. [Sto97] Paul Stodghill. A Relational Approach to the Automatic Generation of Sequential Sparse Matrix Codes. PhD thesis, Cornell University, August 1997. [TJ92] O. Temam and W. Jalby. Characterizing the behavior of sparse algorithms on caches. In Proceedings of Supercomputing ’92, 1992. [Tol97] Sivan Toledo. Improving memory-system performance of sparse matrixvector multiplication. In Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing, March 1997. [Vel98] Todd Veldhuizen. Arrays in blitz++. In Proceedings of ISCOPE, volume 1505 of LNCS. Springer-Verlag, 1998. [VFD01] Sathish S. Vadhiyar, Graham E. Fagg, and Jack J. Dongarra. Towards an accurate model for collective communications. In Proceedings of the International Conference on Computational Science, volume 2073 of LNCS, pages 41–50, San Francisco, CA, May 2001. Springer. [WPD01] R. Clint Whaley, Antoine Petitet, and Jack Dongarra. Automated empirical optimizations of software and the ATLAS project. Parallel Computing, 27(1):3–25, 2001. [WS97] James B. White and P. Sadayappan. On improving the performance of sparse matrix-vector multiplication. In Proceedings of the International Conference on High-Performance Computing, 1997. 36-1
© Copyright 2026 Paperzz