slides - Simons Institute for the Theory of Computing

Communication-Avoiding Parallel Strassen:
Implementation and Performance
Grey Ballard, James Demmel, Benjamin Lipshitz and Oded Schwartz
Sandia National Labs
UC Berkeley
Simons Institute Workshop
October 22, 2013
Research supported by Microsoft (Award #024263) and Intel (Award #024894) funding and by matching funding by U.C.
Discovery (Award #DIG07-10227). Additional support comes from Par Lab affiliates National Instruments, NEC, Nokia,
NVIDIA, and Samsung. Research is also supported by DOE grants DE-SC0003959, DE-SC0004938, and DE-AC02-05CH11231
and by the National Science Foundation under agreement DMS-0635607.
Grey Ballard
1
The Plan
I’ll present a new parallel algorithm based on Strassen’s matrix
multiplication, called Communication Avoiding Parallel Strassen
The new Strassen-based parallel algorithm CAPS
is communication optimal
matches the lower bounds [B., Demmel, Holtz, Schwartz, ‘11]
is faster: in theory and in practice
I’ll also show performance results and talk about practical
considerations for using Strassen and CAPS
Strassen’s algorithm is not just a theoretical idea: it can be practical
in parallel and deserves further exploration
Grey Ballard
2
Outline
1
Motivation
2
Lower Bounds
3
Algorithms
4
Performance
5
Practical Considerations
Grey Ballard
Motivation: Strassen’s fast matrix multiplication (1969)
Strassen’s original algorithm uses 7 multiplies and 18 adds for n = 2.
Most importantly, it can be applied recursively.
Q1
=
(A11 + A22 ) · (B11 + B22 )
Q2
=
(A21 + A22 ) · B11
Q3
=
A11 · (B12 − B22 )
Q4
=
A22 · (B21 − B11 )
Q5
=
(A11 + A12 ) · B22
Q6
=
(A21 − A11 ) · (B11 + B12 )
Q7
=
(A12 − A22 ) · (B21 + B22 )
C11
=
Q1 + Q4 − Q5 + Q7
C12
=
Q3 + Q5
C21
=
Q2 + Q4
C22
=
Q1 − Q2 + Q3 + Q6
F (n) = 7 · F (n/2) + O(n2 )
F (n) = Θ nlog2 7
log2 7 ≈ 2.81
Grey Ballard
3
Motivation: communication costs
Two kinds of costs:
Arithmetic (FLOPs)
Communication: moving data
between levels of a memory hierarchy (sequential case)
over a network connecting processors (parallel case)
Communication will only get more expensive relative to arithmetic
Grey Ballard
4
Motivation: communication costs
γ = time per FLOP
F = #Flops
β = time per word
BW = #Words
α = time per message
L = #Messages
Running time = γ · F + β · BW + α · L
Grey Ballard
4
Outline
1
Motivation
2
Lower Bounds
3
Algorithms
4
Performance
5
Practical Considerations
Grey Ballard
Communication lower bounds for matrix multiplication
Classical (cubic):
[Hong & Kung 81]
Combinatorial proof
n
√
M
log2 8
n
√
M
log2 8
Ω
Sequential only
!
M
[Irony, Toledo, Tiskin 04]
Geometric proof
Sequential and parallel
Ω
M
P
!
n = matrix dimension, M = fast/local memory size, P = number of processors
Grey Ballard
5
Communication lower bounds for matrix multiplication
[B., Demmel, Holtz, Schwartz 11]:
Sequential and parallel
Graph expansion proof
Strassen:
n
√
M
log2 7
n
√
M
log2 7
Ω
Ω
Classical (cubic):
!
M
M
P
n
√
M
log2 8
n
√
M
log2 8
Ω
!
Ω
!
M
M
P
!
n = matrix dimension, M = fast/local memory size, P = number of processors
Grey Ballard
5
Communication lower bounds for matrix multiplication
[B., Demmel, Holtz, Schwartz 11]:
Sequential and parallel
Graph expansion proof
Strassen:
n
√
M
log2 7
n
√
M
log2 7
Ω
Ω
!
M
M
P
Classical (cubic):
Strassen-like:
Ω
!
Ω
n
√
M
ω0
n
√
M
ω0
Ω
M
M
P
n
√
M
log2 8
n
√
M
log2 8
Ω
!
M
M
P
!
n = matrix dimension, M = fast/local memory size, P = number of processors
Grey Ballard
5
Communication lower bounds for matrix multiplication
Classical (cubic):
Strassen:
n
√
M
log2 7
n
√
M
log2 7
Ω
Ω
!
M
M
P
n
√
M
log2 8
n
√
M
log2 8
Ω
!
Ω
!
M
M
P
!
n = matrix dimension, M = fast/local memory size, P = number of processors
Grey Ballard
5
Communication lower bounds for matrix multiplication
Classical (cubic):
Strassen:
n
√
M
log2 7
n
√
M
log2 7
n2
Ω
Ω
Ω
!
M
M
P
n
√
M
log2 8
n
√
M
log2 8
n2
Ω
!
Ω
Ω
P 2/log2 7
!
M
M
P
!
P 2/log2 8
Memory independent bound [B., Demmel, Holtz, Lipshitz, Schwartz 12]
Grey Ballard
5
Communication lower bounds for matrix multiplication
Algorithms attaining these bounds?
Classical (cubic):
Strassen:
n
√
M
log2 7
n
√
M
log2 7
n2
Ω
Ω
Ω
!
M
M
P
n
√
M
log2 8
n
√
M
log2 8
n2
Ω
!
Ω
Ω
P 2/log2 7
!
M
M
P
!
P 2/log2 8
n = matrix dimension, M = fast/local memory size, P = number of processors
Grey Ballard
5
Communication lower bounds for matrix multiplication
Algorithms attaining these bounds?
Classical (cubic):
Strassen:
n
√
M
log2 7
n
√
M
log2 7
n2
Ω
Ω
Ω
!
M
M
P
n
√
M
log2 8
n
√
M
log2 8
n2
Ω
!
Ω
Ω
P 2/log2 7
!
M
M
P
!
P 2/log2 8
n = matrix dimension, M = fast/local memory size, P = number of processors
Grey Ballard
5
Communication lower bounds for matrix multiplication
Algorithms attaining these bounds?
Classical (cubic):
Strassen:
n
√
M
log2 7
n
√
M
log2 7
n2
Ω
Ω
Ω
!
M
M
P
n
√
M
log2 8
n
√
M
log2 8
n2
Ω
!
Ω
[B., Demmel, Holtz,
Lipshitz, Schwartz 12]
[McColl & Tiskin 99]
Ω
P 2/log2 7
!
M
M
P
!
P 2/log2 8
n = matrix dimension, M = fast/local memory size, P = number of processors
Grey Ballard
5
Lessons from lower bounds
1
Don’t use a classical algorithm for the communication
Strassen can communicate less than classical
log2 7 log2 8
n
M
√
√n
Strassen: Ω
Classical:
Ω
P
M
M
Grey Ballard
M
P
6
Lessons from lower bounds
1
Don’t use a classical algorithm for the communication
Strassen can communicate less than classical
log2 7 log2 8
n
M
√
√n
Strassen: Ω
Classical:
Ω
P
M
M
2
M
P
Use all available memory
Communication bound decreases with increased memory
Up to a factor of O(P 1−2/ log2 7 ) extra memory is useful
log2 7
M
n2
√n
Strassen: Ω max
,
P P 2/ log2 7
M
Grey Ballard
6
Outline
1
Motivation
2
Lower Bounds
3
Algorithms
4
Performance
5
Practical Considerations
Grey Ballard
Simple “2D” Classical Algorithm
Here’s the basic communication pattern for the classical “2D” algorithm:
A
B
Grey Ballard
C
7
Simple “2D” Classical Algorithm
Here’s the basic communication pattern for the classical “2D” algorithm:
A
B
C
2D: think Cannon or SUMMA
[Cannon 69, van de Geijn & Watts 97]
2.5D: think reduced communication by using more memory
[Solomonik & Demmel 11]
Grey Ballard
7
Previous parallel Strassen-based algorithms
2D-Strassen: [Luo & Drake 95]
Run classical 2D inter-processors.
Same communication costs as classical 2D.
Run Strassen locally.
Can’t use Strassen on the full matrix size.
Grey Ballard
8
Previous parallel Strassen-based algorithms
2D-Strassen: [Luo & Drake 95]
Run classical 2D inter-processors.
Same communication costs as classical 2D.
Run Strassen locally.
Can’t use Strassen on the full matrix size.
Strassen-2D: [Luo & Drake 95; Grayson, Shah, van
de Geijn 95]
Run Strassen inter-processors
This part can be done without communication.
Then run classical 2D.
Communication costs grow exponentially with
the number of Strassen steps.
Grey Ballard
8
Previous parallel Strassen-based algorithms
2D-Strassen: [Luo & Drake 95]
Run classical 2D inter-processors.
Same communication costs as classical 2D.
Run Strassen locally.
Can’t use Strassen on the full matrix size.
Strassen-2D: [Luo & Drake 95; Grayson, Shah, van
de Geijn 95]
Run Strassen inter-processors
This part can be done without communication.
Then run classical 2D.
Communication costs grow exponentially with
the number of Strassen steps.
Neither is communication optimal, even if you use 2.5D
Grey Ballard
8
Main idea of CAPS algorithm
At each level of recursion tree, choose either breadth-first or depth-first
traversal of the recursion tree
Breadth-First-Search (BFS)
Depth-First-Search (DFS)
Runs all 7 multiplies in parallel
Runs all 7 multiplies sequentially
each uses all P processors
each uses P/7 processors
Requires 7/4 as much extra memory
Requires 1/4 as much extra memory
Requires communication, but
No immediate communication
All BFS minimizes communication if
possible
Increases bandwidth by factor of 7/4
Grey Ballard
Increases latency by factor of 7
9
Words sent per processor, units of n2
Tuning the choices of BFS and DFS Steps
1.6
Simple
Optimal Mixed
Other Mixed
1.4
1.2
1
0.8
0.6
0.4
0.2
0
3
10
30
100
Memory usage per processor, units of n2/P
The memory and communication costs of all 10
5 = 252 possible
interleavings of BFS and DFS steps for multiplying matrices of size
n = 351,232 on P = 75 = 16,807 processors using 10 Strassen steps.
Grey Ballard
10
Asymptotic costs analysis
Flops
Strassen
Lower Bound
2D-Strassen
Strassen-2D
nlog2 7
P (log2 7−1)/2
7 ` n3
8
P
n2
P 1/2
7 ` n2
4
P 1/2
nlog2 7
P
max
n
nlog2 7
PM (log2 7)/2−1
2
n
, P 2/log
27
o
Classical
CAPS
nlog2 7
P
Bandwidth Cost
n
o
nlog2 7
n2
max PM (log
,
2 7)/2−1 P 2/log2 7
Grey Ballard
11
Asymptotic costs analysis
Flops
Strassen
Lower Bound
2D-Strassen
Strassen-2D
Classical
CAPS
nlog2 7
P
Bandwidth Cost
n
o
nlog2 7
n2
max PM (log
,
2 7)/2−1 P 2/log2 7
nlog2 7
P (log2 7−1)/2
7 ` n3
8
P
n2
P 1/2
7 ` n2
4
P 1/2
nlog2 7
P
Lower Bound
n3
P
2D
n3
P
2.5D
n3
P
max
n
nlog2 7
PM (log2 7)/2−1
max
n
2
n
, P 2/log
27
o
2
o
n3
, n
PM 1/2 P 2/3
n2
P 1/2
max
Grey Ballard
n
2
n3
, n
PM 1/2 P 2/3
o
11
Outline
1
Motivation
2
Lower Bounds
3
Algorithms
4
Performance
5
Practical Considerations
Grey Ballard
Performance of CAPS on large problems
Effective Performance, Fraction of Peak
Strong-scaling on Intrepid (IBM BG/P), n = 65,856.
1.4
1.2
1
0.8
0.6
0.4
Strong-Scaling Range
0.2
0
5e2
CAPS
2.5D-Strassen
1e3
5e3
1e4
Number of Cores
2D-Strassen
Strassen-2D
5e4
2.5D
2D
Strassen-Winograd peak
Grey Ballard
12
Performance of CAPS on large problems
Effective Performance, Fraction of Peak
Strong-scaling on Intrepid (IBM BG/P), n = 65,856.
1.8
actual
1.6
1.4
1.2
classical
1
0.8
0.6
0.4
0.2
Strong-Scaling Range
0
5e2
CAPS
2.5D-Strassen
1e3
5e3
1e4
Number of Cores
2D-Strassen
Strassen-2D
5e4
2.5D
2D
Strassen-Winograd peak
Grey Ballard
12
Effective Performance, Fraction of Peak
Performance: Model vs Actual
1.4
1.2
1
0.8
0.6
0.4
0.2
0
CAPS Model
2.5D Model
2D Model
5e2
1e3
CAPS
2.5D
2D
5e3
1e4
Number of Cores
5e4
Comparison of the parallel models with the algorithms in strong scaling of
matrix dimension n = 65,856 on Intrepid.
No Contention
Grey Ballard
13
Performance of CAPS on large problems
Effective Performance, Fraction of Peak
Strong-scaling on Hopper (Cray XE6), n = 131,712.
1.4
1.2
1
0.8
0.6
0.4
Strong-Scaling Range
0.2
0
5e2
1e3
CAPS
2.5D-Strassen
5e3 1e4
Number of Cores
2D-Strassen
Strassen-2D
5e4
1e5
2.5D
2D
Franklin
Grey Ballard
14
Performance of CAPS on small (comm-bound) problems
Strong-scaling on Intrepid (left) and Hopper (right), n = 4704.
10
Execution time, seconds
Execution time, seconds
10
1
0.1
0.01
1
0.1
0.01
1e1
1e2
1e3
1e4
1e1
Number of Cores
CAPS
2.5D-Strassen
1e2
1e3
1e4
1e5
Number of Cores
2D-Strassen
Strassen-2D
Grey Ballard
2.5D
2D
15
Outline
1
Motivation
2
Lower Bounds
3
Algorithms
4
Performance
5
Practical Considerations
Grey Ballard
Practical Considerations for Strassen
1
Harder to reach actual peak performance
computation to communication ratio smaller than classical
2
Additions and multiplications are no longer balanced
3
Architectures are based on powers of 2 not 7
CAPS prefers P = m · 7k
Intrepid requires allocation of power of two number of nodes
4
Stability bounds are not as strong as for classical
Grey Ballard
16
Stability - why you shouldn’t worry
CAPS has the same stability properties as any other Strassen
(Strassen-Winograd) algorithm
Weaker stability guarantee than classical, but still norm-wise stable
This can be improved with techniques like diagonal scaling
Grey Ballard
17
Stability - why you shouldn’t worry
CAPS has the same stability properties as any other Strassen
(Strassen-Winograd) algorithm
Weaker stability guarantee than classical, but still norm-wise stable
This can be improved with techniques like diagonal scaling
Taking fewer Strassen steps improves the bound
kC −A·Bk
kAkkBk
Max-norm Error
Theoretical bounds are pessimistic in the typical case
100
1
0.01
0.0001
1e-06
1e-08
1e-10
1e-12
1e-14
Theoretical bound
Actual
0
↑
2
4
6
8
10
12
Number of Strassen Steps
Classical
Diagonal Scaling
Grey Ballard
17
Summary
The CAPS matrix multiplication algorithm
1
is communication optimal
2
is faster: in theory and in practice
3
can be practical and should be used and improved
Grey Ballard
18
Communication-Avoiding Parallel Strassen:
Implementation and Performance
Grey Ballard, James Demmel, Benjamin Lipshitz and Oded Schwartz
Thank You!
www.eecs.berkeley.edu/~ballard
http://bebop.cs.berkeley.edu
Grey Ballard
19
Extra slides
1
Performance: Model vs Actual
2
Time breakdown
3
DFS vs BFS
4
BFS on 7 Processors
5
Sequential Performance
6
Data Layout
7
Strassen-Winograd Algorithm
8
Actual vs Effective Performance
9
Small problem on Franklin
10
Big problem on Franklin
11
Diagonal Scaling
12
Open Problems
Grey Ballard
20
Effective Performance
Actual Performance
1.2
1.0
0.8
0.6
0.4
2
3
4
5
Number of Strassen Steps
CAPS
2D-Str
Str-2D
CAPS
2D-Str
Str-2D
CAPS
2D-Str
Str-2D
1
CAPS
2D-Str
Str-2D
0
CAPS
2D-Str
Str-2D
2D-Str
Str-2D
0.2
2D-Str
Str-2D
Performance, Fraction of Peak
Effective vs Actual Performance
6
Efficiency at various numbers of Strassen steps, n = 21952, on 49 nodes
(196 cores) of Intrepid.
Extras
Grey Ballard
21
Communication-Free DFS
Possible if each processor owns corresponding entries of four submatrices
of A, B, and C . [Luo & Drake 95; Grayson, Shah, van de Geijn 95]
Additions of submatrices of A to form the Ti (no communication)
Additions of submatrices of B to form the Si (no communication)
Recursive calls Qi = Ti · Si (communication deeper in recursion tree)
Additions of the Qi to form submatrices of C (no communication)
local additions
T0
A
T1
...
Extras
Grey Ballard
22
Communication Pattern of BFS
Additions of submatrices of A, B to form Ti , Si (no communication)
Redistribution of the Ti , Si (communication)
Recursive calls Qi = Ti · Si (communication deeper in recursion tree)
Redistribution of the Qi (communication)
Additions of the Qi to form submatrices of C (no communication)
Redistributions are disjoint 7-way all-to-all communications.
local additions
communication
T0
T0
T1
T1
A
Grey Ballard
...
...
Extras
1
23
BFS on 7 Processors
Requires 3 all-to-all communications, one for each of A, B, C
local additions
01234560123456
communication
local multiplications
T0 0 1 2 3 4 5 6
T0
0
T1 0 1 2 3 4 5 6
T1
1
communication
local additions
A
Q0
0
S0 0 1 2 3 4 5 6
S0
0
01234560123456
S1 0 1 2 3 4 5 6
S1
1
01234560123456
C
Q1
1
...
B
Q1 0 1 2 3 4 5 6
01234560123456
...
...
01234560123456
...
...
local additions
Q0 0 1 2 3 4 5 6
...
01234560123456
communication
Extras
Grey Ballard
24
Time, normalized to classical model time
Effective Performance, Fraction of Peak
Sequential Performance
1
0.8
0.6
0.4
Classical Model
Strassen Model
Classical Data
Strassen Data
0.2
0
0
1000
2000
3000
4000
5000
Matrix Dimension
Other
Extra Additions
DGEMM
1.2
1
0.8
0.6
0.4
0.2
0
Model Data
0
Model Data
Model Data
1
2
Number of Strassen Steps
Model Data
3
Comparison of the sequential model to the
Time breakdown comparison between the
actual performance of classical and Strassen sequential model and the data for n = 4097.
matrix multiplication on four cores (one
Both model and data times are normalized
node) of Intrepid.
to the modeled classical algorithm time.
Extras
Grey Ballard
25
Data Layout
Extras
Grey Ballard
26
Strassen-Winograd Algorithm
C11 C12
C21 C22
S0
S1
S2
S3
S4
S5
S6
=C =A·B =
= A11
= A12
= A21 + A22
= S2 − A12
= A11 − A21
= A12 + S3
= A22
T0
T1
T2
T3
T4
T5
T6
A11 A12
A21 A22
= B11
= B21
= B12 + B11
= B22 − T2
= B22 − B12
= B22
= T3 − B21
A11 A12
·
A21 A22
Qi = Si · Ti
U1 = Qi + Q4
U2 = U1 + Q5
U3 = U1 + Q5
C11 = Q1 + Q2
C12 = U3 + Q6
C21 = U2 − Q7
C22 = U2 + Q3
Extras
Grey Ballard
27
Time, normalized to model time
Performance Breakdown: Model vs Actual
1.6
1.4
1.2
Other
Reordering
Communication
Extra Additions
DGEMM
1
0.8
0.6
0.4
0.2
0
Model Data
Model Data
Model Data
Model Data
P=49
n=4116
P=49
n=16464
P=2401
n=16464
P=2401
n=65856
Time breakdown comparison between the parallel model and data on
Intrepid. In each case the entire modeled execution time is normalized to 1.
Extras
Grey Ballard
28
Performance on Franklin for small problem
Execution time, seconds
n = 3136 on Franklin
1
0.1
0.01
1e1
1e2
1e3
Number of Cores
1e4
Extras
Grey Ballard
29
Performance of CAPS on large problem
Effective Performance, Fraction of Peak
Strong-scaling on Franklin (Cray XT4), n = 94,080.
1.4
1.2
1
0.8
0.6
0.4
Strong-Scaling Range
0.2
0
2e2
CAPS
2.5D-Strassen
Extras
5e2
1e3
2e3
5e3
Number of Cores
2D-Strassen
Strassen-2D
1e4
2e4
2.5D
2D
Hopper
Grey Ballard
30
Sequential recursive Strassen is communication optimal
Run Strassen algorithm recursively.
When blocks are small enough, work in local memory, so no further
bandwidth cost
7W ( n2 , M) + O(n2 ) if 3n2 > M
W (n, M) =
O(n2 )
otherwise
Solution is
W (n, M) = O
n ω0
M ω0 /2−1
Extras
Grey Ballard
31
Diagonal Scaling
Outside scaling:
Scale so each row of A and each column of B has unit norm.
Explicitly:
Let DiiA = (kA(i, :)k)−1 , and DjjB = (kB(:, j)k)−1 .
Scale A0 = D A A, and B 0 = BD B .
Use Strassen for the product C 0 = A0 B 0 .
−1 0 B −1
.
C D
Unscale C = D A
Extras
Back
Grey Ballard
32
Diagonal Scaling
Outside scaling:
Scale so each row of A and each column of B has unit norm.
Explicitly:
Let DiiA = (kA(i, :)k)−1 , and DjjB = (kB(:, j)k)−1 .
Scale A0 = D A A, and B 0 = BD B .
Use Strassen for the product C 0 = A0 B 0 .
−1 0 B −1
.
C D
Unscale C = D A
Inside scaling:
Scale so each column of A has the same norm as the corresponding
row of B.
Explicitly:
Let Dii = (kA(:, i)k/kB(i, :)k)−1/2 .
Scale A0 = AD, and B 0 = D −1 B.
Use Strassen for the product C = A0 B 0 .
Extras
Back
Grey Ballard
32
Stability: easy case
1
No scaling
Outer
Inner
Outer-Inner
Inner-Outer
0.01
maxij
|Ĉij −Cij |
(|A|·|B|)ij
0.0001
1e-06
1e-08
1e-10
1 1
1 1
1 1
·
1 1
1e-12
1e-14
1e-16
0
Extras
Back
2
4
6
8
Number of Strassen Steps
Grey Ballard
10
33
Stability: more interesting case
1
No scaling
Outer
Inner
Outer-Inner
Inner-Outer
0.01
maxij
|Ĉij −Cij |
(|A|·|B|)ij
0.0001
1e-06
1e-08
1e-10
1 1
1 −1
·
1 1
1e-12
1e-14
1e-16
0
Extras
Back
2
4
6
8
Number of Strassen Steps
Grey Ballard
10
34
Stability: problems scaling can’t fix
1
No scaling
Outer
Inner
Outer-Inner
Inner-Outer
0.01
maxij
|Ĉij −Cij |
(|A|·|B|)ij
0.0001
1e-06
1e-08
1e-10
1 −1
1 1
1 −1
·
1 1
1e-12
1e-14
1e-16
0
Extras
Back
2
4
6
8
Number of Strassen Steps
Grey Ballard
10
35
Discussion / open problems
Our parallelization approach extends to other matrix multiplication
algorithms:
classical matrix multiplication (matching the 2.5D algorithm)
other fast matrix multiplication algorithms
And to other algorithms with recursive formulations?
Make use of CAPS within other linear algebra algorithms
Grey Ballard
36
Performance of CAPS on large problems
Effective Performance, Fraction of Peak
Strong-scaling on Intrepid (IBM BG/P), n = 65,856.
4.5
Strassen-Winograd
4
3.5
3
2.5
2
actual
1.5
classical
1
0.5
0
Strong-Scaling Range
5e2
CAPS
2.5D-Strassen
1e3
5e3
1e4
Number of Cores
2D-Strassen
Strassen-2D
5e4
2.5D
2D
Back
Grey Ballard
37
Effective Performance, Fraction of Peak
Performance: Model vs Actual
1.4
1.2
1
0.8
0.6
0.4
0.2
0
CAPS Model
2.5D Model
2D Model
CAPS no cont.
5e2
1e3
CAPS
2.5D
2D
5e3
1e4
Number of Cores
5e4
Comparison of the parallel models with the algorithms in strong scaling of
matrix dimension n = 65,856 on Intrepid.
Back
Grey Ballard
38
Extra slides
1
Performance: Model vs Actual
2
Time breakdown
3
DFS vs BFS
4
BFS on 7 Processors
5
Sequential Performance
6
Data Layout
7
Strassen-Winograd Algorithm
8
Actual vs Effective Performance
9
Small problem on Franklin
10
Big problem on Franklin
11
Diagonal Scaling
12
Open Problems
Grey Ballard
39