Automatic Performance Tuning of Sparse Matrix

BeBOP
Automatic Performance Tuning of Sparse Matrix Kernels
3 x 3 Register Blocking Example
0
ƒ Consider sparse matrix-vector multiply (SpMxV)
ƒ Implementation space
• Set of r x c block sizes
• Fill in explicit zeros
• Run-time estimation (when matrix is known)
2
• Evaluate code against architecture-specific upper bounds
2
35
50
0
10
20
30
40
(688 true non−zeros) + (383 explicit zeros) = 1071 nz
50
Mflop/s
208
200
52
1.20
0.81
0.75
0.95
3
1.44
0.99
1.07
0.64
2
1.21
1.27
0.97
0.78
1.00
0.86
0.99
0.93
r=6
1.31
1.34
1.41
1.39
46
44
2
1.17
1.25
1.30
1.30
1.00
1.09
1.19
1.21
180
170
48
42
row block size (r)
row block size (r)
50
3
160
150
38
140
130
40
1
190
120
r=1
110
100
36
1
2
3
column block size (c)
6
c=1
2
3
column block size (c)
c=6
1600
• Conceptually, the set of “interesting” implementations
• Depends on kernel and input
• May vary: instruction mix and order, memory access patterns,
data structures and precisions, mathematical formulation, …
• Dense linear algebra: ATLAS/PHiPAC
• Signal processing: FFTW; SPIRAL
• MPI collective operations (Vadhiyar & Dongarra, 2001)
10
43
0
3
4
5
6
7
8
column block size (c)
9
10
11
90
80
70
60
50
40
30
20
D
FEM
FEM (var. blk.)
Mixed
LP
12
1
2
3
4
5
6
7
8
9
10
0
10 11 12 13 15 17 21 25 27 28 36 40 44
matrix
D
FEM
FEM (var. blk.)
Mixed
LP
1 2 3 4 5 6 7 8 9 101112131415161718202123242526272829363740414244
matrix
Performance Summary [power3−aix]
130
4
1.19
3
1.20 1.17 1.18
2
1.18
120
240
220
110
4
200
7
180
1.42
6
5
160
140
9
10
11
260
180
160
140
120
100
120
20
1
2
3
4
5
6
7
8
column block size (c)
9
10
11
12
109
0
Reference
Cache blocked
320
Performance (Mflop/s)
12
16
16
18
10
12
10
18
16
10
600
8
10
FEM (var. blk.)
13
5
8
8
8
10
11
7
8
11
10
8
4
5
7
5
10
15
20
25
Matrix No.
30
7
35
Matrix Splitting: A = Ar×c + A1×1 [ultra−solaris]
4.94
220
120
40
0
5
6
FEM (var. blk.)
7
8
9
Mixed
LP
10 11 12 13 15 17 21 25 27 28 36 40 44
matrix
⎞⎛ x1 ⎞ ⎛ b1 ⎞
⎟⎜ ⎟ = ⎜ ⎟
Ldense ⎟⎠⎜⎝ x2 ⎟⎠ ⎜⎝ b2 ⎟⎠
Sparse solve
Sparse matvec
Dense solve
Sparse Triangular Solve: Performance Summary −− [ultra−solaris]
LSI Matrix
LP−Italian Railways
LP−osa60
Exp. Design (BIBD)
Cache blocking—Performance on a Pentium 4
for information retrieval and linear
programming matrices, with up to 5x speedups.
Post−TSP Reordering: 17−rim.rua (1, 1) −− (100, 100)
0
0
10
10
20
20
30
30
40
40
50
90
85
80
75
70
65
60
55
50
45
40
35
30
25
20
Reference
Reg. Blocking (RB)
Switch−to−Dense (S2D)
RB + S2D
Analytic upper bound
Analytic lower bound
PAPI upper bound
dense
memplus wang4
60
60
70
70
90
100
0
100
0
80
20
40
60
80
100
column
ex11
raefsky4 goodwin lhr10
matrix
Sparse triangular solve—(Top-left) Triangular factors from
sparse LU often have a large dense trailing triangle. (Top-right)
The matrix can be partitioned into sparse (L11, L21) and dense
(Ldense) parts. (Bottom) Performance improvements from register
blocking the sparse part, and calling a tuned vendor BLAS
routine (TRSM) for the dense solve step. [ICS/POHLL ’02]
50
90
38
4
20
45
1.1
21
3
2
80
20
2
L11 x1 = b1
bˆ2 = b2 − L21 x1
Ldense x2 = bˆ2
140
1.2
16
17
Matrix No.
FEM
1
1.43
160
1.4
13
0
40
⎛ L11
⎜⎜
⎝ L21
180
1.5
12
15
200
1.6
10
13
20% of
non-zeros
1.7
1
12
2.60
2.46
17−rim.rua (1, 1) −− (100, 100)
Reference
Reg. Blocking
Splitting
1.3
10
Experimental results—Performance (Mflop/s) on a set of 44 benchmark
matrices from a variety of applications. Speedups of 2.5x are possible.
[SC’02]
matrices
1.9
•AATx, ATAx (1.2—4.2x)
•RART, Akx, …
9
matrix
40
Multiple vectors—Significant speedups are
possible when multiplying by several vectors
(800 MHz Itanium; DGEMM n=k=2000, m=32).
1.8
8
80
5
30
0
0
120
20
1
60
10
8
DGEMV
10
12 9
8
200
140
40 D
LP
100
10
11
8 8
160
60
FEM
300
15
18
200
180
80
D
280
16
18
16
220
Cache Blocking Performance [Pentium 4−1.5 GHz, Intel Cv6.0]
340
DGEMM
10
800
240
100
60
1
12
260
1000
280
40
2 1.49 1.49
90
87
5
6
7
8
column block size (c)
300
80
3 1.55
100
3
8 1.48
Analytic upper bound
Upper bound (PAPI)
Sparsity (exhaustive)
Sparsity (heuristic)
Reference
320
200
4 1.55
1.17
1.19
2
340
Analytic upper bound
Upper bound (PAPI)
Sparsity (exhaustive)
Sparsity (heuristic)
Reference
260
240
9
140
1.22
5
Performance Summary [itanium−linux−ecc]
280
1.46
1.40 1.45
220
6 1.17 1.18
Best r×c
Sparsity
Sparsity + new heur.
Sparsity 1×1
Naive (NIST)
400
ƒHigher-level sparse kernels
93
ƒ Implementation space
ƒ Successful examples
50
240
•Hybrid sparse/dense data structure (1.2x—1.8x)
20
40
60
80
100
column
Splitting and reordering—(Left) Speedup after splitting a matrix (possibly after reordering) into a blocked
part and an unblocked part to avoid fill. (Middle, right) Dense blocks can be created by a judicious
reordering—on matrix 17, we used a traveling salesman problem formulation due to Pinar (1997).
Approach to Automatic Tuning
• Either off-line, on-line, or combination
2.54
255
7
1200
ƒSparse triangular solve
Architecture dependence—Sixteen r x c blocked compressed sparse
row implementations of SpMxV, each color coded by performance
(Mflop/s) and labeled by speedup over the unblocked (1 x 1) code
for the sparse matrix above on two platforms: 333 MHz Ultra 2i (left)
and 800 MHz Itanium (right). The best block size is not always 6x6!
ƒ Search using models and experiments
2
12
150
Multiple Vector Performance [itanium−ecc]
12
1400
•Register-level blocking (up to 2.5x speedups)
•Symmetry (up to 2x speedup)
•Diagonals, bands (up to 2.2x)
•Splitting for variable block structure (1.3x—1.7x)
•Reordering to create dense blocks + splitting (up to 2x)
•Cache blocking (1.5x—5x)
•Multiple vectors (2—7x)
•And combinations…
Mflop/s
For each kernel, identify and generate a space of
implementations, and search for the best one.
30
15
10
1
Speedup
Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc]
54
1
row
Itanium: 3x1
35
5
11 1.53
row
60
Sparse matrix example—A 6x6 blocked storage format appears to
be the most natural choice for this matrix for a sparse matrixvector multiply (SpMxV) implementation…
1.53
2.41 2.37 2.45
2.38
40
20
Register Blocking Performance (Mflop/s) [ Dense (n=1000); itanium−linux−ecc]
8
Filling in zeros—True non-zeros (•)
and explicit zeros (•); fill ratio=1.5.
ƒSparse matrix-vector multiply
60
1.42
12
173
170
T
A Ax Performance [ultra−solaris]
150
BeBOP: Current and Future Work
Understanding the impact on higher-level
kernels, algorithms, and applications.
4x1 (2.49)
260
200
3x1 (1.99)
2x1 (1.50)
180
160
140
120
y = AA x
T
m
120
( )
= ∑ ak a x
100
80
60
k =1
40
20
0
130
⎛ a1T ⎞
⎜ ⎟
= (a1 L am )⎜ M ⎟ x
⎜ aT ⎟
⎝ m⎠
240
220
ƒ Design and implementation of a library based on the Sparse
BLAS; new heuristics for efficiently choosing optimizations.
ƒ Study of performance implications for higher-level
algorithms (e.g., block Lanczos)
ƒ New sparse kernels (e.g., powers Ak, triple product RART)
ƒ Integrating with applications (e.g., DOE SciDAC codes)
ƒ Further automation: generating implementation generators
ƒ Using bounds to evaluate current and future architectures
Upper bound
Upper bound (PAPI)
Cache + Reg (best)
Cache + Reg (heuristic)
Reg only
Cache only
Reference
140
Performance Summary: Web Subgraph (1M x 1M, 3.1M nz) [Itanium 2−900, Intel C v7.0]
300
Reference
280
Register blocked
Performance (Mflop/s)
50
1.39
11
9
45
40
Blocking Performance (Mflop/s) [03−olafu.rua; ultra−solaris]
60
160
Performance (Mflop/s)
30
1.40
10
10
Additional techniques for y=Ax, sparse
triangular solve, and ATAx.
20
6
9
12
Exploiting Matrix Structure
Spy Plot (zoom−in): 03−olafu.rua
Ultra 2i: 6x6
2.49
Off-line benchmarking—Performance (Mflop/s) for a dense matrix
in sparse format on four architectures (clockwise from upper-left):
Ultra 2i-333, Pentium III-500, Power3-375, Itanium-800. Performance is
a strong function of the hardware platform.
0
50
5
6
7
8
column block size (c)
11
Dense Performance ( r , c)
Estimated Performance ( r , c) =
Fill Ratio (r , c )
10
30
40
1188 non−zeros
4
36
40
Performance depends strongly on both the
matrix and hardware platform.
20
3
25
Observations
10
2.39
45
25
1
1
1
0
2.44
2
40
1
o Estimate Fill Ratio (r,c): (# stored non-zeros) / (# true non-zeros)
o Choose r x c to maximize
o Matrix known only at run-time (in general)
2.39
55
50
Performance (Mflop/s)
o Measure Dense Performance (r,c), in Mflop/s, of dense matrix in
sparse r x c format
• Choose best data structure and implementation for given
kernel, sparse matrix, and machine
5
70
3
3
Register Blocking Performance (Mflop/s) [dense−regprof−2k; power3−aix]
• Off-line benchmarking (once per architecture)
ƒ Goal: Automatic tuning of sparse kernels
45
20
2.36
4
4
15
6
30
ƒ Search
• Performance depends on architecture, kernel, and matrix
50
1.94
80
7
110
100
WebBase−1M (natural)
WebBase−1M (RCM)
matrix (ordering)
WebBase−1M (MMD)
Application matrices: Web connectivity matrix—Speeding up SpMxV for a web
subgraph (left) using register blocking and reordering (right) on a 900 MHz Itanium 2.
T
k
110
Performance (Mflop/s)
Microprocessor performance difficult to model
Widening processor-memory gap; deep memory hierarchies
1.96 1.98 1.97
5
10
8
Performance (Mflop/s)
• Hardware complexity is increasing
55
6
Analytic upper bound
Upper bound (PAPI)
Sparsity (exhaustive)
Sparsity (heuristic)
Reference
130
120
60
90
Performance (Mflop/s)
Indirect, irregular memory accesses
High bandwidth requirements, poor instruction mix
7
row block size (r)
• Typical uniprocessor performance < 10% peak
60
2.03 1.95 1.94 1.98
8
row block size (r)
ƒ Challenges in tuning sparse code
5
65
9
1.98
9
Performance Summary [pentium3−linux−icc]
140
70
10
65
Analytic upper bound
Upper bound (PAPI)
Sparsity (exhaustive)
Sparsity (heuristic)
Reference
75
100
11
10
Performance Summary [ultra−solaris]
12
70
1.94
11
80
107
72
12
Performance (Mflop/s)
The SPARSITY system (Im & Yelick, 1999)
applies the methodology to y=Ax.
Attila Gyulassy
Chris Hsu
Shoaib Kamil
Benjamin Lee
Hyun-Jin Moon
Rajesh Nishtala
bebop.cs.berkeley.edu
Register Blocking Performance (Mflop/s) [ Dense (n=1000); pentium3−linux−icc]
Register Blocking Performance (Mflop/s) [ Dense (n=1000); ultra−solaris]
Performance (Mflop/s)
Sparse kernel performance depends on both
the matrix and hardware platform.
Berkeley Benchmarking and OPtimization Group
row block size (r)
Example: Choosing a Block Size
row block size (r)
Problem Context
Richard Vuduc
Eun-Jin Im
James Demmel
Katherine Yelick
100
90
80
70
60
50
40
30
20
10
0
1
2
3
4
5
6
7
8
9
10 11 12 13 15 17 21 25 27 28 36 40 44
matrix no.
Sparse AATx, ATAx—(Left) A can be brought through the memory hierarchy
only once: for each column ak of A, compute a dot product followed by a vector
scale (“axpy”). (Right) This cache optimized implementation can be naturally
combined with register blocking.