Performance Optimizations and Bounds for Sparse Matrix

Performance Optimizations and Bounds
for Sparse Matrix-Vector Multiply
Richard Vuduc, James Demmel, Katherine Yelick
Shoaib Kamil, Rajesh Nishtala, Benjamin Lee
Wednesday, November 20, 2002
Berkeley Benchmarking and OPtimization (BeBOP) Project
www.cs.berkeley.edu/ richie/bebop
Computer Science Division, U.C. Berkeley
Berkeley, California, USA
.
SC 2002: Session on Sparse Linear Algebra
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.1/36
Context: Performance Tuning in the Sparse Case
Application performance dominated by a few
computational kernels
Performance tuning today
Vendor-tuned libraries (e.g., BLAS) or user hand-tunes
Automatic tuning (e.g., PHiPAC/ATLAS, FFTW/SPIRAL/UHFFT)
Tuning sparse linear algebra kernels is hard
Sparse code has . . .
high bandwidth requirements (extra storage)
poor locality (indirect, irregular memory access)
poor instruction mix (data structure manipulation)
Sparse matrix-vector multiply (SpM V) performance:
less than 10% of machine peak
Performance depends on kernel, architecture, and matrix
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.2/36
Example: Matrix olafu
Spy Plot: 03−olafu.rua
0
N = 16146
2400
nnz = 1.0M
Kernel = SpM V
4800
7200
9600
12000
14400
0
2400
4800
7200
9600
1015156 non−zeros
12000
14400
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.3/36
Example: Matrix olafu
Spy Plot (zoom−in): 03−olafu.rua
0
N = 16146
10
nnz = 1.0M
Kernel = SpM V
20
30
A natural choice:
blocked compressed
sparse row (BCSR).
40
Experiment:
50
0
block sizes for
.
10
20
30
40
1188 non−zeros
50
60
all
Measure performance of
60
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.3/36
Speedups on Itanium: The Need for Search
Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc]
Mflop/s
208
200
r=6
1.20
0.81
0.75
0.95
190
row block size (r)
180
3
1.44
0.99
1.07
0.64
170
160
150
2
1.21
1.27
0.97
0.78
140
130
120
r=1
1.00
0.86
0.99
0.93
110
100
c=1
2
3
column block size (c)
(Peak machine speed: 3.2 Gflop/s)
c=6
93
Mflop/s
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.4/36
Speedups on Itanium: The Need for Search
Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc]
Mflop/s
208
200
r=6
1.20
0.81
0.75
0.95
190
row block size (r)
180
3
1.44
0.99
1.07
0.64
170
160
150
2
1.21
1.27
0.97
0.78
140
130
120
r=1
1.00
0.86
0.99
0.93
110
100
c=1
2
3
column block size (c)
(Peak machine speed: 3.2 Gflop/s)
c=6
93
Mflop/s
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.4/36
Speedups on Itanium: The Need for Search
Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc]
Mflop/s
208
200
r=6
1.20
0.81
0.75
0.95
190
row block size (r)
180
3
1.44
0.99
1.07
0.64
170
160
150
2
1.21
1.27
0.97
0.78
140
130
120
r=1
1.00
0.86
0.99
0.93
110
100
c=1
2
3
column block size (c)
(Peak machine speed: 3.2 Gflop/s)
c=6
93
Mflop/s
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.4/36
Speedups on Itanium: The Need for Search
Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc]
Mflop/s
208
200
r=6
1.20
0.81
0.75
0.95
190
row block size (r)
180
3
1.44
0.99
1.07
0.64
170
160
150
2
1.21
1.27
0.97
0.78
140
130
120
r=1
1.00
0.86
0.99
0.93
110
100
c=1
2
3
column block size (c)
(Peak machine speed: 3.2 Gflop/s)
c=6
93
Mflop/s
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.4/36
Speedups on Itanium: The Need for Search
Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc]
Mflop/s
208
200
r=6
1.20
0.81
0.75
0.95
190
row block size (r)
180
3
1.44
0.99
1.07
0.64
170
160
150
2
1.21
1.27
0.97
0.78
140
130
120
r=1
1.00
0.86
0.99
0.93
110
100
c=1
2
3
column block size (c)
(Peak machine speed: 3.2 Gflop/s)
c=6
93
Mflop/s
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.4/36
Key Questions and Conclusions
How do we choose the best data structure automatically?
New heuristic for choosing optimal (or near-optimal) block sizes
What are the limits on performance of blocked SpM V?
Derive performance upper bounds for blocking
Often within 20% of upper bound, placing limits on improvement from more
“low-level” tuning
Performance is memory-bound: reducing data structure size is critical
Where are the new opportunities (kernels, techniques) for
achieving higher performance?
Identify cases in which blocking does and does not work
Identify new kernels and opportunities for reducing memory traffic
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.5/36
Related Work
Automatic tuning systems and code generation
PHiPAC [BACD97], ATLAS [WPD01], S PARSITY [Im00]
FFTW [FJ98], SPIRAL [PSVM01], UHFFT [MMJ00]
MPI collective ops (Vadhiyar, et al. [VFD01])
Sparse compilers (Bik [BW99], Bernoulli [Sto97])
Generic programming (Blitz++ [Vel98], MTL [SL98], GMCL [Neu98], . . . )
FLAME [GGHvdG01]
Sparse performance modeling and tuning
Temam and Jalby [TJ92]
Toledo [Tol97], White and Sadayappan [WS97], Pinar [PH99]
Navarro [NGLPJ96], Heras [HPDR99], Fraguela [FDZ99]
Gropp, et al. [GKKS99], Geus [GR99]
Sparse kernel interfaces
Sparse BLAS Standard [BCD 01]
NIST SparseBLAS [RP96], SPARSKIT [Saa94], PSBLAS [FC00]
PETSc, hypre, . . .
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.6/36
Approach to Automatic Tuning
For each kernel,
Identify and generate a space of implementations
Search to find the fastest (using models, experiments)
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.7/36
Approach to Automatic Tuning
For each kernel,
Identify and generate a space of implementations
Search to find the fastest (using models, experiments)
The S PARSITY system for SpM V [Im & Yelick ’99]
Interface
Input: Your sparse matrix (CSR)
Output: Data structure + routine tuned to your matrix & machine
register level blocking (
Implementation space
)
cache blocking, multiple vectors, . . .
Search
Off-line: benchmarking (once per architecture)
Run-time: estimate matrix properties (“search”) and predict best data
structure parameters
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.7/36
Approach to Automatic Tuning
For each kernel,
Identify and generate a space of implementations
Search to find the fastest (using models, experiments)
The S PARSITY system for SpM V [Im & Yelick ’99]
Interface
Input: Your sparse matrix (CSR)
Output: Data structure + routine tuned to your matrix & machine
register level blocking (
Implementation space
)
cache blocking, multiple vectors, . . .
Search
Off-line: benchmarking (once per architecture)
Run-time: estimate matrix properties (“search”) and predict best data
structure parameters
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.7/36
Register-Level Blocking (S PARSITY): 3x3 Example
3 x 3 Register Blocking Example
0
5
10
15
20
25
30
35
40
45
50
0
10
20
30
688 true non−zeros
40
50
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.8/36
Register-Level Blocking (S PARSITY): 3x3 Example
3 x 3 Register Blocking Example
0
5
10
15
20
25
30
35
40
45
50
0
10
20
30
688 true non−zeros
40
50
BCSR with uniform, aligned grid
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.8/36
Register-Level Blocking (S PARSITY): 3x3 Example
0
3 x 3 Register Blocking Example
5
10
15
20
25
30
35
40
45
50
0
10
20
30
40
(688 true non−zeros) + (383 explicit zeros) = 1071 nz
50
Fill-in zeros: trade-off extra flops for better efficiency
This example: 50% fill led to 1.5x speedup on Pentium III
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.8/36
Search: Choosing the Block Size
Off-line benchmarking (once per architecture)
Measure Dense Performance (r,c)
blocked format
Performance (Mflop/s) of dense matrix in sparse
Estimate Fill Ratio (r,c),
At run-time, when matrix is known:
#
%
#
%
#
)
+
!
**
)(
$
&
#
"!
$
#
"!
'
"
%
that maximizes
$
Choose
Fill Ratio (r,c) = (number of stored values) / (number of true non-zeros)
(Replaces previous S PARSITY heurstic)
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.9/36
Off-line Benchmarking [Intel Pentium III]
Register Blocking Performance (Mflop/s) [ Dense (n=1000); pentium3−linux−icc]
12
107
100
11
10
90
8
80
7
2.36
5
2.39
70
2.44
4
2.39
3
2.49
2.41 2.37 2.45
2.38
2
60
2.54
50
1
1
2
3
4
5
6
7
8
column block size (c)
9
10
11
12
43
Top 10 codes labeled by speedup over unblocked code. Max speedup = 2.54 (
).
,
6
row block size (r)
9
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.10/36
333 MHz Sun Ultra 2i (2.03)
500 MHz Intel Pentium III (2.54)
Register Blocking Performance (Mflop/s) [ Dense (n=1000); ultra−solaris]
72
12
Register Blocking Performance (Mflop/s) [ Dense (n=1000); pentium3−linux−icc]
70
10
65
1.98
row block size (r)
7
55
6
1.96 1.98 1.97
5
50
1.94
8
80
7
6
2.36
5
2.39
70
2.44
4
4
45
3
2
40
1
1
2
3
4
5
6
7
8
column block size (c)
9
10
11
12
36
170
11
60
2.41 2.37 2.45
2.38
2
36
2.49
2.54
50
1
1
2
3
4
5
6
7
8
column block size (c)
9
10
11
174
255
1.46
12
10
150
120
1.22
5
4
1.19
3
1.20 1.17 1.18
90
1.19
1
2
3
4
5
6
7
8
column block size (c)
8 1.48
200
7
1.42
6
5
9
10
11
12
180
160
140
3 1.55
1.17
100
1
220
4 1.55
110
1.18
2
row block size (r)
130
6 1.17 1.18
256
9
140
7
42
240
1.40 1.45
11 1.53
10
8
12
Register Blocking Performance (Mflop/s) [ Dense (n=1000); itanium−linux−ecc]
160
9
43
800 MHz Intel Itanium (1.55)
Register Blocking Performance (Mflop/s) [Dense (n=2000); power3−aix]
12
2.39
3
375 MHz IBM Power3 (1.22)
row block size (r)
90
9
60
2.03 1.95 1.94 1.98
8
107
100
11
10
9
107
12
1.94
11
row block size (r)
72
86
2 1.49 1.49
120
1
1
2
3
4
5
6
7
8
column block size (c)
9
10
11
12
109
109
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.11/36
Performance Bounds for Register Blocking
How close are we to the speed limit of blocking?
Flops
-
Upper-bounds on performance: (flops) / (time)
2 * (number of true non-zeros)
Other features of model
Account for cache line size
Bound is function of matrix and
Lower-bound on time: Two key assumptions
Consider only memory operations
Count only compulsory misses (i.e., ignore conflicts)
Model of execution time (see paper)
9
9
3
9
9
3
.<
9
.:
D
9
?A
@?.
8
2
8
7 65
34
;
34
1
8
=>
;
=3
7 65
7 65
9
./
0
) for hits at each cache level , e.g.,
9
Charge full latency (
34
E
.<
8
8
8
>
?A
.:
?.
.<
E
.:
=
;
8
>
8
8
=
8
BC
;
6
34
6
34
2
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.12/36
Overview of Performance Results
Experimental setup
Four machines: Pentium III, Ultra 2i, Power3, Itanium
44 matrices: dense, finite element, mixed, linear programming
G
Reference: CSR (i.e.,
G
F
Measured misses and cycles using PAPI [BDG 00]
or “unblocked”)
Main observations
S PARSITY vs. reference: up to 2.5x faster, especially on FEM
Block size selection: chooses within 10% of best
S PARSITY performance typically within 20% of upper-bound
S PARSITY least effective on Power3
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.13/36
Performance Results: Intel Pentium III
Performance Summary [pentium3−linux−icc]
140
Analytic upper bound
Reference
130
120
110
Performance (Mflop/s)
100
90
80
70
60
50
40
30
20
10
0
D
FEM
FEM (var. blk.)
Mixed
LP
1 2 3 4 5 6 7 8 9 101112131415161718202123242526272829363740414244
matrix
DGEMV (n=1000): 96 Mflop/s
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.14/36
Performance Results: Intel Pentium III
Performance Summary [pentium3−linux−icc]
140
Analytic upper bound
Upper bound (PAPI)
Reference
130
120
110
Performance (Mflop/s)
100
90
80
70
60
50
40
30
20
10
0
D
FEM
FEM (var. blk.)
Mixed
LP
1 2 3 4 5 6 7 8 9 101112131415161718202123242526272829363740414244
matrix
DGEMV (n=1000): 96 Mflop/s
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.14/36
Performance Results: Intel Pentium III
Performance Summary [pentium3−linux−icc]
140
Analytic upper bound
Upper bound (PAPI)
Sparsity (exhaustive)
Reference
130
120
110
Performance (Mflop/s)
100
90
80
70
60
50
40
30
20
10
0
D
FEM
FEM (var. blk.)
Mixed
LP
1 2 3 4 5 6 7 8 9 101112131415161718202123242526272829363740414244
matrix
DGEMV (n=1000): 96 Mflop/s
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.14/36
Performance Results: Intel Pentium III
Performance Summary [pentium3−linux−icc]
140
Analytic upper bound
Upper bound (PAPI)
Sparsity (exhaustive)
Sparsity (heuristic)
Reference
130
120
110
Performance (Mflop/s)
100
90
80
70
60
50
40
30
20
10
0
D
FEM
FEM (var. blk.)
Mixed
LP
1 2 3 4 5 6 7 8 9 101112131415161718202123242526272829363740414244
matrix
DGEMV (n=1000): 96 Mflop/s
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.14/36
Performance Results: Sun Ultra 2i
Performance Summary [ultra−solaris]
80
Analytic upper bound
Upper bound (PAPI)
Sparsity (exhaustive)
Sparsity (heuristic)
Reference
75
70
65
Performance (Mflop/s)
60
55
50
45
40
35
30
25
20
15
10
5
0
D
1
FEM
2
3
4
5
6
FEM (var. blk.)
7
8
DGEMV (n=1000): 58 Mflop/s
9
Mixed
LP
10 11 12 13 15 17 21 25 27 28 36 40 44
matrix
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.15/36
Performance Results: Intel Itanium
Performance Summary [itanium−linux−ecc]
340
Analytic upper bound
Upper bound (PAPI)
Sparsity (exhaustive)
Sparsity (heuristic)
Reference
320
300
280
Performance (Mflop/s)
260
240
220
200
180
160
140
120
100
80
60
40 D
20
0
1
FEM
2
3
4
5
6
FEM (var. blk.)
7
8
9
DGEMV (n=1000): 310 Mflop/s
Mixed
LP
10 11 12 13 15 17 21 25 27 28 36 40 44
matrix
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.16/36
Performance Results: IBM Power3
Performance Summary [power3−aix]
280
Analytic upper bound
Upper bound (PAPI)
Sparsity (exhaustive)
Sparsity (heuristic)
Reference
260
240
220
Performance (Mflop/s)
200
180
160
140
120
100
80
60
40
20
0
D
1
FEM
4
5
FEM (var. blk.)
7
DGEMV (n=2000): 260 Mflop/s
8
9
matrix
10
12
13
LP
15
40
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.17/36
Conclusions
Tuning can be difficult, even when matrix structure is known
Performance is a complicated function of architecture and
matrix
New heuristic for choosing block size selects optimal
implementation, or near-optimal (performance within 5–10%)
Limits of low-level tuning for blocking are near
Performance is often within 20% of upper-bound, particularly
with FEM matrices
Unresolved: closing the gap on the Power3
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.18/36
BeBOP: Current and Future Work (1/2)
Further performance improvements
symmetry (1.5–2x speedups)
diagonals, block diagonals, and bands (1.2–2x),
splitting for variable block structure (1.3–1.7x),
reordering to create dense structure (1.7x),
cache blocking (1.5–4x)
multiple vectors (2–7x)
and combinations . . .
How to choose optimizations & tuning parameters?
Sparse triangular solve (ICS’02: POHLL Workshop paper)
hybrid sparse/dense structure (1.2–1.8x)
(1.5–3x)
, . . . (future work)
I
N
H
,
N
simultaneously,
H
J M
I
J H
H
H
LI
and
H
JKI
H
,
J H
Higher-level kernels that permit reuse
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.19/36
BeBOP: Current and Future Work (2/2)
An automatically tuned sparse matrix library
Code generation via sparse compilers (Bernoulli; Bik)
Plan to extend new Sparse BLAS standard by one routine to support tuning
Architectural issues
Improvements for Power3?
Latency vs. bandwidth (see paper)
Using models to explore architectural design space
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.20/36
EXTRA SLIDES
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.21/36
Example: No Big Surprises on Sun Ultra 2i
Blocking Performance (Mflop/s) [03−olafu.rua; ultra−solaris]
54
6
1.40
1.39
1.42
1.53
52
row block size (r)
50
3
1.31
1.34
1.41
1.39
48
46
44
2
1.17
1.25
1.30
1.30
42
40
1
1.00
1.09
1.19
1.21
38
36
1
2
3
column block size (c)
6
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.22/36
Where in Memory is the Time Spent?
Fraction of Cycles (exhaustive best; average over matrices)
Execution Time Model (Model Hits/Misses) −− Where is the Time Spent?
1
0.9
0.8
0.7
L1
L2
L3
Mem
0.6
0.5
0.4
0.3
0.2
0.1
0
Ultra 2i (L1/L2)
Pentium III (L1/L2)
Power3 (L1/L2)
Platform
Itanium (L1/L2/L3)
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.23/36
Cache Miss Bound Verification: Sun Ultra 2i (L1)
L1 Misses −− [ultra−solaris]
5
Upper bound
PAPI
Lower Bound
4
3
no. of misses (millions)
2
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
1
2
3
4
5
6
7
8
9
10 11 12 13 15 17 21 25 27 28 36 40 44
matrix
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.24/36
Cache Miss Bound Verification: Sun Ultra 2i (L2)
L2 Misses −− [ultra−solaris]
1
no. of misses (millions)
10
Upper bound
PAPI
Lower Bound
0
10
−1
10
1
2
3
4
5
6
7
8
9
10 11 12 13 15 17 21 25 27 28 36 40 44
matrix
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.25/36
Cache Miss Bound Verification: Intel Pentium III (L1)
L1 Misses −− [pentium3−linux−icc]
1
no. of misses (millions)
10
Upper bound
PAPI
Lower Bound
0
10
−1
10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20 21 23 24 25 26 27 28 29 36 37 40 41 42 44
matrix
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.26/36
Cache Miss Bound Verification: Intel Pentium III (L2)
L2 Misses −− [pentium3−linux−icc]
1
no. of misses (millions)
10
Upper bound
PAPI
Lower Bound
0
10
−1
10
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20 21 23 24 25 26 27 28 29 36 37 40 41 42 44
matrix
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.27/36
Cache Miss Bound Verification: IBM Power3 (L1)
L1 Misses −− [power3−aix]
1
no. of misses (millions)
10
Upper bound
PAPI
Lower Bound
0
10
−1
10
1
4
5
7
8
9
matrix
10
12
13
15
40
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.28/36
Cache Miss Bound Verification: IBM Power3 (L2)
L2 Misses −− [power3−aix]
1
no. of misses (millions)
10
Upper bound
PAPI
Lower Bound
0
10
−1
10
1
4
5
7
8
9
matrix
10
12
13
15
40
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.29/36
Cache Miss Bound Verification: Intel Itanium (L2)
L2 Misses −− [itanium−linux−ecc]
1
no. of misses (millions)
10
Upper bound
PAPI
Lower Bound
0
10
−1
10
1
2
3
4
5
6
7
8
9
10 11 12 13 15 17 21 25 27 28 36 40 44
matrix
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.30/36
Cache Miss Bound Verification: Intel Itanium (L3)
L3 Misses −− [itanium−linux−ecc]
1
no. of misses (millions)
10
Upper bound
PAPI
Lower Bound
0
10
−1
10
1
2
3
4
5
6
7
8
9
10 11 12 13 15 17 21 25 27 28 36 40 44
matrix
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.31/36
Fraction of Peak Main Memory Bandwidth
Latency vs. Bandwidth
1
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
Sustainable Memory Bandwidth (STREAM)
a[i] = b[i]
a[i] = α*b[i]
a[i] = b[i] + c[i]
a[i] = b[i] + α*c[i]
s += a[i]
s += a[i]*b[i]
s += a[i]; k += ind[i]
s += a[i]*x[ind[i]]; x∈L_chip
s += a[i]*x[ind[i]]; x∈L_ext
DGEMV
SpM×V (dense, 1×1)
SpM×V (dense, best)
*−⋅ Full−latency model
Ultra 2i
Pentium III
Power3
Platform
Itanium
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.32/36
Related Work (2/2)
Compilers (analysis and models); run-time selection
CROPS (UCSD/Carter, Ferrante, et al.)
TUNE (Chatterjee, et al.)
Iterative compilation (O’Boyle, et al., 1998)
Broadway (Guyer and Lin, ’99)
Brewer (’95); ADAPT (Voss, 2000)
Interfaces: Sparse BLAS; PSBLAS; PETSc
Sparse triangular solve
SuperLU/MUMPS/SPOOLES/UMFPACK/PSPASES. . .
Approximation: Alvarado (’93); Raghavan (’98)
Scalability: Rothberg (’92;’95); Gupta (’95); Li, Coleman (’88)
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.33/36
What is the Cost of Search?
Block Size Selection Overhead [itanium−linux−ecc]
45
.86 .85
40
heuristic
rebuild
.85
.80
.84
.85
Cost (no. of reference Sp×MVs)
35
30
.76
25
.76
.74 .74
.79
.76
.78
20
.71
.72
15
10
5
0
2
3
4
5
6
7
8
9 10 11 12 13 15 17 21 25 27 28 36 40 44
Matrix #
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.34/36
Where in Memory is the Time Spent? (PAPI Data)
Fraction of Cycles (exhaustive best; average over matrices)
Execution Time Model (PAPI Hits/Misses) −− Where is the Time Spent?
1
0.9
0.8
0.7
L1
L2
L3
Mem
0.6
0.5
0.4
0.3
0.2
0.1
0
Ultra 2i (L1/L2)
Pentium III (L1/L2)
Power3 (L1/L2)
Platform
Itanium (L1/L2/L3)
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.35/36
Fill: Some Surprises!
Sometimes faster to fill in many zeros
Dense
Fill
(Size BCSR)
(Perf BCSR)
Matrix
Speedup
ratio
(Size CSR)
(Perf CSR)
Platform
11
1.09
1.70
1.26
1.55
Itanium
13
1.50
1.52
1.07
2.30
Pentium 3
17
1.04
1.59
1.23
1.54
Itanium
27
1.16
1.94
1.47
1.54
Itanium
27
1.10
1.53
1.25
1.23
Ultra 2i
29
1.00
1.98
1.44
1.89
Pentium 3
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.36/36
References
[BACD97]
J. Bilmes, K. Asanović, C.W. Chin, and J. Demmel. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In Proceedings of the International Conference on
Supercomputing, Vienna, Austria, July 1997. ACM SIGARC.
see
http://www.icsi.berkeley.edu/˜bilmes/phipac.
[BCD O 01]
S. Blackford, G. Corliss, J. Demmel, J. Dongarra, I. Duff, S. Hammarling, G. Henry, M. Heroux, C. Hu, W. Kahan, L. Kaufman, B. Kearfott,
F. Krogh, X. Li, Z. Maany, A. Petitet, R. Pozo, K. Remington, W. Walster,
C. Whaley, and J. Wolff von Gudenberg. Document for the Basic Linear Algebra Subprograms (BLAS) standard: BLAS Technical Forum, 2001.
www.netlib.org/blast.
[BDG O 00]
S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci. A scalable
cross-platform infrastructure for application performance tuning using hardware counters. In Proceedings of Supercomputing, November 2000.
[BW99]
Aart J. C. Bik and Harry A. G. Wijshoff. Automatic nonzero structure analysis. SIAM Journal on Computing, 28(5):1576–1587, 1999.
[FC00]
Salvatore Filippone and Michele Colajanni. PSBLAS: A library for parallel linear algebra computation on sparse matrices. ACM Transactions on
Mathematical Software, 26(4):527–550, December 2000.
[FDZ99]
Basilio B. Fraguela, Ramón Doallo, and Emilio L. Zapata. Memory hierarchy performance prediction for sparse blocked algorithms. Parallel Processing Letters, 9(3), March 1999.
[FJ98]
Matteo Frigo and Stephen Johnson. FFTW: An adaptive software architecture for the FFT. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Seattle, Washington, May 1998.
[GGHvdG01] John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van de
Geijn. FLAME: Formal Linear Algebra Methods Environment. ACM Transactions on Mathematical Software, 27(4), December 2001.
36-1
[GKKS99]
William D. Gropp, D. K. Kasushik, David E. Keyes, and Barry F. Smith.
Towards realistic bounds for implicit CFD codes. In Proceedings of Parallel
Computational Fluid Dynamics, pages 241–248, 1999.
[GR99]
Roman Geus and S. Röllin. Towards a fast parallel sparse matrix-vector
multiplication. In E. H. D’Hollander, J. R. Joubert, F. J. Peters, and H. Sips,
editors, Proceedings of the International Conference on Parallel Computing
(ParCo), pages 308–315. Imperial College Press, 1999.
[HPDR99]
Dora Blanco Heras, Vicente Blanco Perez, Jose Carlos Cabaleiro
Dominguez, and Francisco F. Rivera. Modeling and improving locality for
irregular problems: sparse matrix-vector product on cache memories as a
case study. In HPCN Europe, pages 201–210, 1999.
[Im00]
Eun-Jin Im. Optimizing the performance of sparse matrix-vector multiplication. PhD thesis, University of California, Berkeley, May 2000.
[MMJ00]
Dragan Mirkovic, Rishad Mahasoom, and Lennart Johnsson. An adaptive
software library for fast fourier transforms. In Proceedings of the International Conference on Supercomputing, pages 215–224, Sante Fe, NM, May
2000.
[Neu98]
T. Neubert. Anwendung von generativen Programmiertechniken am
Beispiel der Matrixalgebra. Master’s thesis, Technische Universität Chemnitz, 1998.
[NGLPJ96]
J. J. Navarro, E. García, J. L. Larriba-Pey, and T. Juan. Algorithms for
sparse matrix computations on high-performance workstations. In Proceedings of the 10th ACM International Conference on Supercomputing,
pages 301–308, Philadelpha, PA, USA, May 1996.
[PH99]
Ali Pinar and Michael Heath. Improving performance of sparse matrixvector multiplication. In Proceedings of Supercomputing, 1999.
[PSVM01]
Markus Püschel, Bryan Singer, Manuela Veloso, and José M. F. Moura.
Fast automatic generation of DSP algorithms. In Proceedings of the International Conference on Computational Science, volume 2073 of LNCS,
pages 97–106, San Francisco, CA, May 2001. Springer.
36-1
[RP96]
K. Remington and R. Pozo. NIST Sparse BLAS: User’s Guide. Technical
report, NIST, 1996. gams.nist.gov/spblas.
[Saa94]
Yousef Saad. SPARSKIT: A basic toolkit for sparse matrix computations,
1994. www.cs.umn.edu/Research/arpa/SPARSKIT/sparskit.htm
[SL98]
Jeremy G. Siek and Andrew Lumsdaine. A rational approach to portable
high performance: the Basic Linear Algebra Instruction Set (BLAIS) and
the Fixed Algorithm Size Template (fast) library. In Proceedings of ECOOP,
1998.
[Sto97]
Paul Stodghill. A Relational Approach to the Automatic Generation of Sequential Sparse Matrix Codes. PhD thesis, Cornell University, August 1997.
[TJ92]
O. Temam and W. Jalby. Characterizing the behavior of sparse algorithms
on caches. In Proceedings of Supercomputing ’92, 1992.
[Tol97]
Sivan Toledo. Improving memory-system performance of sparse matrixvector multiplication. In Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing, March 1997.
[Vel98]
Todd Veldhuizen. Arrays in blitz++. In Proceedings of ISCOPE, volume
1505 of LNCS. Springer-Verlag, 1998.
[VFD01]
Sathish S. Vadhiyar, Graham E. Fagg, and Jack J. Dongarra. Towards an
accurate model for collective communications. In Proceedings of the International Conference on Computational Science, volume 2073 of LNCS,
pages 41–50, San Francisco, CA, May 2001. Springer.
[WPD01]
R. Clint Whaley, Antoine Petitet, and Jack Dongarra. Automated empirical optimizations of software and the ATLAS project. Parallel Computing,
27(1):3–25, 2001.
[WS97]
James B. White and P. Sadayappan. On improving the performance of
sparse matrix-vector multiplication. In Proceedings of the International
Conference on High-Performance Computing, 1997.
36-1