Communication-Avoiding Parallel Strassen: Implementation and Performance Grey Ballard, James Demmel, Benjamin Lipshitz and Oded Schwartz Sandia National Labs UC Berkeley Simons Institute Workshop October 22, 2013 Research supported by Microsoft (Award #024263) and Intel (Award #024894) funding and by matching funding by U.C. Discovery (Award #DIG07-10227). Additional support comes from Par Lab affiliates National Instruments, NEC, Nokia, NVIDIA, and Samsung. Research is also supported by DOE grants DE-SC0003959, DE-SC0004938, and DE-AC02-05CH11231 and by the National Science Foundation under agreement DMS-0635607. Grey Ballard 1 The Plan I’ll present a new parallel algorithm based on Strassen’s matrix multiplication, called Communication Avoiding Parallel Strassen The new Strassen-based parallel algorithm CAPS is communication optimal matches the lower bounds [B., Demmel, Holtz, Schwartz, ‘11] is faster: in theory and in practice I’ll also show performance results and talk about practical considerations for using Strassen and CAPS Strassen’s algorithm is not just a theoretical idea: it can be practical in parallel and deserves further exploration Grey Ballard 2 Outline 1 Motivation 2 Lower Bounds 3 Algorithms 4 Performance 5 Practical Considerations Grey Ballard Motivation: Strassen’s fast matrix multiplication (1969) Strassen’s original algorithm uses 7 multiplies and 18 adds for n = 2. Most importantly, it can be applied recursively. Q1 = (A11 + A22 ) · (B11 + B22 ) Q2 = (A21 + A22 ) · B11 Q3 = A11 · (B12 − B22 ) Q4 = A22 · (B21 − B11 ) Q5 = (A11 + A12 ) · B22 Q6 = (A21 − A11 ) · (B11 + B12 ) Q7 = (A12 − A22 ) · (B21 + B22 ) C11 = Q1 + Q4 − Q5 + Q7 C12 = Q3 + Q5 C21 = Q2 + Q4 C22 = Q1 − Q2 + Q3 + Q6 F (n) = 7 · F (n/2) + O(n2 ) F (n) = Θ nlog2 7 log2 7 ≈ 2.81 Grey Ballard 3 Motivation: communication costs Two kinds of costs: Arithmetic (FLOPs) Communication: moving data between levels of a memory hierarchy (sequential case) over a network connecting processors (parallel case) Communication will only get more expensive relative to arithmetic Grey Ballard 4 Motivation: communication costs γ = time per FLOP F = #Flops β = time per word BW = #Words α = time per message L = #Messages Running time = γ · F + β · BW + α · L Grey Ballard 4 Outline 1 Motivation 2 Lower Bounds 3 Algorithms 4 Performance 5 Practical Considerations Grey Ballard Communication lower bounds for matrix multiplication Classical (cubic): [Hong & Kung 81] Combinatorial proof n √ M log2 8 n √ M log2 8 Ω Sequential only ! M [Irony, Toledo, Tiskin 04] Geometric proof Sequential and parallel Ω M P ! n = matrix dimension, M = fast/local memory size, P = number of processors Grey Ballard 5 Communication lower bounds for matrix multiplication [B., Demmel, Holtz, Schwartz 11]: Sequential and parallel Graph expansion proof Strassen: n √ M log2 7 n √ M log2 7 Ω Ω Classical (cubic): ! M M P n √ M log2 8 n √ M log2 8 Ω ! Ω ! M M P ! n = matrix dimension, M = fast/local memory size, P = number of processors Grey Ballard 5 Communication lower bounds for matrix multiplication [B., Demmel, Holtz, Schwartz 11]: Sequential and parallel Graph expansion proof Strassen: n √ M log2 7 n √ M log2 7 Ω Ω ! M M P Classical (cubic): Strassen-like: Ω ! Ω n √ M ω0 n √ M ω0 Ω M M P n √ M log2 8 n √ M log2 8 Ω ! M M P ! n = matrix dimension, M = fast/local memory size, P = number of processors Grey Ballard 5 Communication lower bounds for matrix multiplication Classical (cubic): Strassen: n √ M log2 7 n √ M log2 7 Ω Ω ! M M P n √ M log2 8 n √ M log2 8 Ω ! Ω ! M M P ! n = matrix dimension, M = fast/local memory size, P = number of processors Grey Ballard 5 Communication lower bounds for matrix multiplication Classical (cubic): Strassen: n √ M log2 7 n √ M log2 7 n2 Ω Ω Ω ! M M P n √ M log2 8 n √ M log2 8 n2 Ω ! Ω Ω P 2/log2 7 ! M M P ! P 2/log2 8 Memory independent bound [B., Demmel, Holtz, Lipshitz, Schwartz 12] Grey Ballard 5 Communication lower bounds for matrix multiplication Algorithms attaining these bounds? Classical (cubic): Strassen: n √ M log2 7 n √ M log2 7 n2 Ω Ω Ω ! M M P n √ M log2 8 n √ M log2 8 n2 Ω ! Ω Ω P 2/log2 7 ! M M P ! P 2/log2 8 n = matrix dimension, M = fast/local memory size, P = number of processors Grey Ballard 5 Communication lower bounds for matrix multiplication Algorithms attaining these bounds? Classical (cubic): Strassen: n √ M log2 7 n √ M log2 7 n2 Ω Ω Ω ! M M P n √ M log2 8 n √ M log2 8 n2 Ω ! Ω Ω P 2/log2 7 ! M M P ! P 2/log2 8 n = matrix dimension, M = fast/local memory size, P = number of processors Grey Ballard 5 Communication lower bounds for matrix multiplication Algorithms attaining these bounds? Classical (cubic): Strassen: n √ M log2 7 n √ M log2 7 n2 Ω Ω Ω ! M M P n √ M log2 8 n √ M log2 8 n2 Ω ! Ω [B., Demmel, Holtz, Lipshitz, Schwartz 12] [McColl & Tiskin 99] Ω P 2/log2 7 ! M M P ! P 2/log2 8 n = matrix dimension, M = fast/local memory size, P = number of processors Grey Ballard 5 Lessons from lower bounds 1 Don’t use a classical algorithm for the communication Strassen can communicate less than classical log2 7 log2 8 n M √ √n Strassen: Ω Classical: Ω P M M Grey Ballard M P 6 Lessons from lower bounds 1 Don’t use a classical algorithm for the communication Strassen can communicate less than classical log2 7 log2 8 n M √ √n Strassen: Ω Classical: Ω P M M 2 M P Use all available memory Communication bound decreases with increased memory Up to a factor of O(P 1−2/ log2 7 ) extra memory is useful log2 7 M n2 √n Strassen: Ω max , P P 2/ log2 7 M Grey Ballard 6 Outline 1 Motivation 2 Lower Bounds 3 Algorithms 4 Performance 5 Practical Considerations Grey Ballard Simple “2D” Classical Algorithm Here’s the basic communication pattern for the classical “2D” algorithm: A B Grey Ballard C 7 Simple “2D” Classical Algorithm Here’s the basic communication pattern for the classical “2D” algorithm: A B C 2D: think Cannon or SUMMA [Cannon 69, van de Geijn & Watts 97] 2.5D: think reduced communication by using more memory [Solomonik & Demmel 11] Grey Ballard 7 Previous parallel Strassen-based algorithms 2D-Strassen: [Luo & Drake 95] Run classical 2D inter-processors. Same communication costs as classical 2D. Run Strassen locally. Can’t use Strassen on the full matrix size. Grey Ballard 8 Previous parallel Strassen-based algorithms 2D-Strassen: [Luo & Drake 95] Run classical 2D inter-processors. Same communication costs as classical 2D. Run Strassen locally. Can’t use Strassen on the full matrix size. Strassen-2D: [Luo & Drake 95; Grayson, Shah, van de Geijn 95] Run Strassen inter-processors This part can be done without communication. Then run classical 2D. Communication costs grow exponentially with the number of Strassen steps. Grey Ballard 8 Previous parallel Strassen-based algorithms 2D-Strassen: [Luo & Drake 95] Run classical 2D inter-processors. Same communication costs as classical 2D. Run Strassen locally. Can’t use Strassen on the full matrix size. Strassen-2D: [Luo & Drake 95; Grayson, Shah, van de Geijn 95] Run Strassen inter-processors This part can be done without communication. Then run classical 2D. Communication costs grow exponentially with the number of Strassen steps. Neither is communication optimal, even if you use 2.5D Grey Ballard 8 Main idea of CAPS algorithm At each level of recursion tree, choose either breadth-first or depth-first traversal of the recursion tree Breadth-First-Search (BFS) Depth-First-Search (DFS) Runs all 7 multiplies in parallel Runs all 7 multiplies sequentially each uses all P processors each uses P/7 processors Requires 7/4 as much extra memory Requires 1/4 as much extra memory Requires communication, but No immediate communication All BFS minimizes communication if possible Increases bandwidth by factor of 7/4 Grey Ballard Increases latency by factor of 7 9 Words sent per processor, units of n2 Tuning the choices of BFS and DFS Steps 1.6 Simple Optimal Mixed Other Mixed 1.4 1.2 1 0.8 0.6 0.4 0.2 0 3 10 30 100 Memory usage per processor, units of n2/P The memory and communication costs of all 10 5 = 252 possible interleavings of BFS and DFS steps for multiplying matrices of size n = 351,232 on P = 75 = 16,807 processors using 10 Strassen steps. Grey Ballard 10 Asymptotic costs analysis Flops Strassen Lower Bound 2D-Strassen Strassen-2D nlog2 7 P (log2 7−1)/2 7 ` n3 8 P n2 P 1/2 7 ` n2 4 P 1/2 nlog2 7 P max n nlog2 7 PM (log2 7)/2−1 2 n , P 2/log 27 o Classical CAPS nlog2 7 P Bandwidth Cost n o nlog2 7 n2 max PM (log , 2 7)/2−1 P 2/log2 7 Grey Ballard 11 Asymptotic costs analysis Flops Strassen Lower Bound 2D-Strassen Strassen-2D Classical CAPS nlog2 7 P Bandwidth Cost n o nlog2 7 n2 max PM (log , 2 7)/2−1 P 2/log2 7 nlog2 7 P (log2 7−1)/2 7 ` n3 8 P n2 P 1/2 7 ` n2 4 P 1/2 nlog2 7 P Lower Bound n3 P 2D n3 P 2.5D n3 P max n nlog2 7 PM (log2 7)/2−1 max n 2 n , P 2/log 27 o 2 o n3 , n PM 1/2 P 2/3 n2 P 1/2 max Grey Ballard n 2 n3 , n PM 1/2 P 2/3 o 11 Outline 1 Motivation 2 Lower Bounds 3 Algorithms 4 Performance 5 Practical Considerations Grey Ballard Performance of CAPS on large problems Effective Performance, Fraction of Peak Strong-scaling on Intrepid (IBM BG/P), n = 65,856. 1.4 1.2 1 0.8 0.6 0.4 Strong-Scaling Range 0.2 0 5e2 CAPS 2.5D-Strassen 1e3 5e3 1e4 Number of Cores 2D-Strassen Strassen-2D 5e4 2.5D 2D Strassen-Winograd peak Grey Ballard 12 Performance of CAPS on large problems Effective Performance, Fraction of Peak Strong-scaling on Intrepid (IBM BG/P), n = 65,856. 1.8 actual 1.6 1.4 1.2 classical 1 0.8 0.6 0.4 0.2 Strong-Scaling Range 0 5e2 CAPS 2.5D-Strassen 1e3 5e3 1e4 Number of Cores 2D-Strassen Strassen-2D 5e4 2.5D 2D Strassen-Winograd peak Grey Ballard 12 Effective Performance, Fraction of Peak Performance: Model vs Actual 1.4 1.2 1 0.8 0.6 0.4 0.2 0 CAPS Model 2.5D Model 2D Model 5e2 1e3 CAPS 2.5D 2D 5e3 1e4 Number of Cores 5e4 Comparison of the parallel models with the algorithms in strong scaling of matrix dimension n = 65,856 on Intrepid. No Contention Grey Ballard 13 Performance of CAPS on large problems Effective Performance, Fraction of Peak Strong-scaling on Hopper (Cray XE6), n = 131,712. 1.4 1.2 1 0.8 0.6 0.4 Strong-Scaling Range 0.2 0 5e2 1e3 CAPS 2.5D-Strassen 5e3 1e4 Number of Cores 2D-Strassen Strassen-2D 5e4 1e5 2.5D 2D Franklin Grey Ballard 14 Performance of CAPS on small (comm-bound) problems Strong-scaling on Intrepid (left) and Hopper (right), n = 4704. 10 Execution time, seconds Execution time, seconds 10 1 0.1 0.01 1 0.1 0.01 1e1 1e2 1e3 1e4 1e1 Number of Cores CAPS 2.5D-Strassen 1e2 1e3 1e4 1e5 Number of Cores 2D-Strassen Strassen-2D Grey Ballard 2.5D 2D 15 Outline 1 Motivation 2 Lower Bounds 3 Algorithms 4 Performance 5 Practical Considerations Grey Ballard Practical Considerations for Strassen 1 Harder to reach actual peak performance computation to communication ratio smaller than classical 2 Additions and multiplications are no longer balanced 3 Architectures are based on powers of 2 not 7 CAPS prefers P = m · 7k Intrepid requires allocation of power of two number of nodes 4 Stability bounds are not as strong as for classical Grey Ballard 16 Stability - why you shouldn’t worry CAPS has the same stability properties as any other Strassen (Strassen-Winograd) algorithm Weaker stability guarantee than classical, but still norm-wise stable This can be improved with techniques like diagonal scaling Grey Ballard 17 Stability - why you shouldn’t worry CAPS has the same stability properties as any other Strassen (Strassen-Winograd) algorithm Weaker stability guarantee than classical, but still norm-wise stable This can be improved with techniques like diagonal scaling Taking fewer Strassen steps improves the bound kC −A·Bk kAkkBk Max-norm Error Theoretical bounds are pessimistic in the typical case 100 1 0.01 0.0001 1e-06 1e-08 1e-10 1e-12 1e-14 Theoretical bound Actual 0 ↑ 2 4 6 8 10 12 Number of Strassen Steps Classical Diagonal Scaling Grey Ballard 17 Summary The CAPS matrix multiplication algorithm 1 is communication optimal 2 is faster: in theory and in practice 3 can be practical and should be used and improved Grey Ballard 18 Communication-Avoiding Parallel Strassen: Implementation and Performance Grey Ballard, James Demmel, Benjamin Lipshitz and Oded Schwartz Thank You! www.eecs.berkeley.edu/~ballard http://bebop.cs.berkeley.edu Grey Ballard 19 Extra slides 1 Performance: Model vs Actual 2 Time breakdown 3 DFS vs BFS 4 BFS on 7 Processors 5 Sequential Performance 6 Data Layout 7 Strassen-Winograd Algorithm 8 Actual vs Effective Performance 9 Small problem on Franklin 10 Big problem on Franklin 11 Diagonal Scaling 12 Open Problems Grey Ballard 20 Effective Performance Actual Performance 1.2 1.0 0.8 0.6 0.4 2 3 4 5 Number of Strassen Steps CAPS 2D-Str Str-2D CAPS 2D-Str Str-2D CAPS 2D-Str Str-2D 1 CAPS 2D-Str Str-2D 0 CAPS 2D-Str Str-2D 2D-Str Str-2D 0.2 2D-Str Str-2D Performance, Fraction of Peak Effective vs Actual Performance 6 Efficiency at various numbers of Strassen steps, n = 21952, on 49 nodes (196 cores) of Intrepid. Extras Grey Ballard 21 Communication-Free DFS Possible if each processor owns corresponding entries of four submatrices of A, B, and C . [Luo & Drake 95; Grayson, Shah, van de Geijn 95] Additions of submatrices of A to form the Ti (no communication) Additions of submatrices of B to form the Si (no communication) Recursive calls Qi = Ti · Si (communication deeper in recursion tree) Additions of the Qi to form submatrices of C (no communication) local additions T0 A T1 ... Extras Grey Ballard 22 Communication Pattern of BFS Additions of submatrices of A, B to form Ti , Si (no communication) Redistribution of the Ti , Si (communication) Recursive calls Qi = Ti · Si (communication deeper in recursion tree) Redistribution of the Qi (communication) Additions of the Qi to form submatrices of C (no communication) Redistributions are disjoint 7-way all-to-all communications. local additions communication T0 T0 T1 T1 A Grey Ballard ... ... Extras 1 23 BFS on 7 Processors Requires 3 all-to-all communications, one for each of A, B, C local additions 01234560123456 communication local multiplications T0 0 1 2 3 4 5 6 T0 0 T1 0 1 2 3 4 5 6 T1 1 communication local additions A Q0 0 S0 0 1 2 3 4 5 6 S0 0 01234560123456 S1 0 1 2 3 4 5 6 S1 1 01234560123456 C Q1 1 ... B Q1 0 1 2 3 4 5 6 01234560123456 ... ... 01234560123456 ... ... local additions Q0 0 1 2 3 4 5 6 ... 01234560123456 communication Extras Grey Ballard 24 Time, normalized to classical model time Effective Performance, Fraction of Peak Sequential Performance 1 0.8 0.6 0.4 Classical Model Strassen Model Classical Data Strassen Data 0.2 0 0 1000 2000 3000 4000 5000 Matrix Dimension Other Extra Additions DGEMM 1.2 1 0.8 0.6 0.4 0.2 0 Model Data 0 Model Data Model Data 1 2 Number of Strassen Steps Model Data 3 Comparison of the sequential model to the Time breakdown comparison between the actual performance of classical and Strassen sequential model and the data for n = 4097. matrix multiplication on four cores (one Both model and data times are normalized node) of Intrepid. to the modeled classical algorithm time. Extras Grey Ballard 25 Data Layout Extras Grey Ballard 26 Strassen-Winograd Algorithm C11 C12 C21 C22 S0 S1 S2 S3 S4 S5 S6 =C =A·B = = A11 = A12 = A21 + A22 = S2 − A12 = A11 − A21 = A12 + S3 = A22 T0 T1 T2 T3 T4 T5 T6 A11 A12 A21 A22 = B11 = B21 = B12 + B11 = B22 − T2 = B22 − B12 = B22 = T3 − B21 A11 A12 · A21 A22 Qi = Si · Ti U1 = Qi + Q4 U2 = U1 + Q5 U3 = U1 + Q5 C11 = Q1 + Q2 C12 = U3 + Q6 C21 = U2 − Q7 C22 = U2 + Q3 Extras Grey Ballard 27 Time, normalized to model time Performance Breakdown: Model vs Actual 1.6 1.4 1.2 Other Reordering Communication Extra Additions DGEMM 1 0.8 0.6 0.4 0.2 0 Model Data Model Data Model Data Model Data P=49 n=4116 P=49 n=16464 P=2401 n=16464 P=2401 n=65856 Time breakdown comparison between the parallel model and data on Intrepid. In each case the entire modeled execution time is normalized to 1. Extras Grey Ballard 28 Performance on Franklin for small problem Execution time, seconds n = 3136 on Franklin 1 0.1 0.01 1e1 1e2 1e3 Number of Cores 1e4 Extras Grey Ballard 29 Performance of CAPS on large problem Effective Performance, Fraction of Peak Strong-scaling on Franklin (Cray XT4), n = 94,080. 1.4 1.2 1 0.8 0.6 0.4 Strong-Scaling Range 0.2 0 2e2 CAPS 2.5D-Strassen Extras 5e2 1e3 2e3 5e3 Number of Cores 2D-Strassen Strassen-2D 1e4 2e4 2.5D 2D Hopper Grey Ballard 30 Sequential recursive Strassen is communication optimal Run Strassen algorithm recursively. When blocks are small enough, work in local memory, so no further bandwidth cost 7W ( n2 , M) + O(n2 ) if 3n2 > M W (n, M) = O(n2 ) otherwise Solution is W (n, M) = O n ω0 M ω0 /2−1 Extras Grey Ballard 31 Diagonal Scaling Outside scaling: Scale so each row of A and each column of B has unit norm. Explicitly: Let DiiA = (kA(i, :)k)−1 , and DjjB = (kB(:, j)k)−1 . Scale A0 = D A A, and B 0 = BD B . Use Strassen for the product C 0 = A0 B 0 . −1 0 B −1 . C D Unscale C = D A Extras Back Grey Ballard 32 Diagonal Scaling Outside scaling: Scale so each row of A and each column of B has unit norm. Explicitly: Let DiiA = (kA(i, :)k)−1 , and DjjB = (kB(:, j)k)−1 . Scale A0 = D A A, and B 0 = BD B . Use Strassen for the product C 0 = A0 B 0 . −1 0 B −1 . C D Unscale C = D A Inside scaling: Scale so each column of A has the same norm as the corresponding row of B. Explicitly: Let Dii = (kA(:, i)k/kB(i, :)k)−1/2 . Scale A0 = AD, and B 0 = D −1 B. Use Strassen for the product C = A0 B 0 . Extras Back Grey Ballard 32 Stability: easy case 1 No scaling Outer Inner Outer-Inner Inner-Outer 0.01 maxij |Ĉij −Cij | (|A|·|B|)ij 0.0001 1e-06 1e-08 1e-10 1 1 1 1 1 1 · 1 1 1e-12 1e-14 1e-16 0 Extras Back 2 4 6 8 Number of Strassen Steps Grey Ballard 10 33 Stability: more interesting case 1 No scaling Outer Inner Outer-Inner Inner-Outer 0.01 maxij |Ĉij −Cij | (|A|·|B|)ij 0.0001 1e-06 1e-08 1e-10 1 1 1 −1 · 1 1 1e-12 1e-14 1e-16 0 Extras Back 2 4 6 8 Number of Strassen Steps Grey Ballard 10 34 Stability: problems scaling can’t fix 1 No scaling Outer Inner Outer-Inner Inner-Outer 0.01 maxij |Ĉij −Cij | (|A|·|B|)ij 0.0001 1e-06 1e-08 1e-10 1 −1 1 1 1 −1 · 1 1 1e-12 1e-14 1e-16 0 Extras Back 2 4 6 8 Number of Strassen Steps Grey Ballard 10 35 Discussion / open problems Our parallelization approach extends to other matrix multiplication algorithms: classical matrix multiplication (matching the 2.5D algorithm) other fast matrix multiplication algorithms And to other algorithms with recursive formulations? Make use of CAPS within other linear algebra algorithms Grey Ballard 36 Performance of CAPS on large problems Effective Performance, Fraction of Peak Strong-scaling on Intrepid (IBM BG/P), n = 65,856. 4.5 Strassen-Winograd 4 3.5 3 2.5 2 actual 1.5 classical 1 0.5 0 Strong-Scaling Range 5e2 CAPS 2.5D-Strassen 1e3 5e3 1e4 Number of Cores 2D-Strassen Strassen-2D 5e4 2.5D 2D Back Grey Ballard 37 Effective Performance, Fraction of Peak Performance: Model vs Actual 1.4 1.2 1 0.8 0.6 0.4 0.2 0 CAPS Model 2.5D Model 2D Model CAPS no cont. 5e2 1e3 CAPS 2.5D 2D 5e3 1e4 Number of Cores 5e4 Comparison of the parallel models with the algorithms in strong scaling of matrix dimension n = 65,856 on Intrepid. Back Grey Ballard 38 Extra slides 1 Performance: Model vs Actual 2 Time breakdown 3 DFS vs BFS 4 BFS on 7 Processors 5 Sequential Performance 6 Data Layout 7 Strassen-Winograd Algorithm 8 Actual vs Effective Performance 9 Small problem on Franklin 10 Big problem on Franklin 11 Diagonal Scaling 12 Open Problems Grey Ballard 39
© Copyright 2026 Paperzz