Efficient Implementation of BLAS Level-3 solvers for Sylvester-type Matrix Equations Martin Köhler February 14, 2017 7th Workshop on Matrix Equations and Tensor Techniques Introduction Generalized Sylvester Equation (GSYLV) AXB ± CXD = Y Martin Köhler, [email protected] BLAS-3 Sylvester 2/24 Introduction Standard Sylvester Equation (SYLV) Standard Lyapunov Equation (LYAP) AX ± XB = Y AX + AX T = Y Standard Sylvester Equation 2 (SYLV2) Standard Stein Equation (STEIN) AXB ± X = Y AXAT − X = Y Generalized Sylvester Equation (GSYLV) AXB ± CXD = Y Martin Köhler, [email protected] BLAS-3 Sylvester 2/24 Introduction Standard Sylvester Equation (SYLV) Standard Lyapunov Equation (LYAP) AX ± XB = Y AX + AX T = Y Standard Sylvester Equation 2 (SYLV2) Standard Stein Equation (STEIN) AXB ± X = Y AXAT − X = Y Generalized Sylvester Equation (GSYLV) AXB ± CXD = Y Generalized Stein Equation (GSTEIN) AXAT − BXB T = Y Generalized Lyapunov Equation (GLYAP) AXB T + BXAT = Y Martin Köhler, [email protected] BLAS-3 Sylvester 2/24 Introduction Standard Sylvester Equation (SYLV) Standard Lyapunov Equation (LYAP) AX ± XB = Y AX + AX T = Y Standard Sylvester Equation 2 (SYLV2) Standard Stein Equation (STEIN) AXB ± X = Y AXAT − X = Y Generalized Sylvester Equation (GSYLV) AXB ± CXD = Y Generalized Stein Equation (GSTEIN) Coupled Sylvester Equation (CSYLV) AXAT − BXB T = Y AR ± LB = E CR ± LD = F Generalized Lyapunov Equation (GLYAP) AXB T + BXAT = Y Martin Köhler, [email protected] BLAS-3 Sylvester 2/24 Introduction Implemented . . . . . . . . . . . . . . . . . .Direct . . . . . . . . Solvers . . . . . . . . . . on . . . . .Shared . . . . . . . . .Memory . . . . . . . . . . . .Architectures: ........................................... Equation SYLV SYLV2 LYAP STEIN GSYLV CSYLV GLYAP GSTEIN RECSY1 RECSYCT RECSYDT RECLYCT RECLYDT RECGSYL RECGCSY RECGLYCT RECGLYDT 1 http://www8.cs.umu.se/ 2 http://www.slicot.org/ Software Packages SLICOT2 LAPACK Alg. 4323 SB04PD xTRSYL AXPXB SB04PD SB03TD ATXPXA SB03UD SB04OD xTGSYL SG03AD SG03AD - Alg. 7054 SYLG SYLGC SYLGD ~isak/recsy/ 3 http://www.netlib.org/toms/432.gz 4 http://www.netlib.org/toms/705.gz Martin Köhler, [email protected] BLAS-3 Sylvester 3/24 Introduction Implemented . . . . . . . . . . . . . . . . . .Direct . . . . . . . . Solvers . . . . . . . . . . on . . . . .Shared . . . . . . . . .Memory . . . . . . . . . . . .Architectures: ........................................... Remarks: → SLICOT implementations only usePackages Level-2 BLAS. Software 1 2 recursive and Level-3 BLAS, → RECSY implementations but 7054 Equation RECSY SLICOTare LAPACK Alg. 4323 Alg. optimized for 15 years old architectures, SYLV RECSYCT SB04PD xTRSYL bad license. AXPXB 432 – SB04PD original code by Bartels and Stewart, no SYLV2 → Algorithm RECSYDT BLAS at all, FORTRAN IV. LYAP RECLYCT SB03TD ATXPXA 705 – SB03UD LINPACK style code- by Gardiner- et. al., STEIN → Algorithm RECLYDT Level-1 BLAS calls. GSYLV few RECGSYL SYLG – Level-3 BLAS block implementation -only for CSYLV→ GLYAP-3 RECGCSY SB04OD xTGSYL GLYAP GLYAP/GSTEIN. RECGLYCT SG03AD SYLGC GSTEIN RECGLYDT SG03AD SYLGD 1 http://www8.cs.umu.se/ 2 http://www.slicot.org/ ~isak/recsy/ 3 http://www.netlib.org/toms/432.gz 4 http://www.netlib.org/toms/705.gz Martin Köhler, [email protected] BLAS-3 Sylvester 3/24 Introduction Implemented . . . . . . . . . . . . . . . . . .Direct . . . . . . . . Solvers . . . . . . . . . . on . . . . .Shared . . . . . . . . .Memory . . . . . . . . . . . .Architectures: ........................................... Remarks: → SLICOT implementations only usePackages Level-2 BLAS. Software 1 2 recursive and Level-3 BLAS, → RECSY implementations but 7054 Equation RECSY SLICOTare LAPACK Alg. 4323 Alg. optimized for 15 years old architectures, SYLV RECSYCT SB04PD xTRSYL bad license. AXPXB 432 – SB04PD original code by Bartels and Stewart, no SYLV2 → Algorithm RECSYDT BLAS at all, FORTRAN IV. LYAP RECLYCT SB03TD ATXPXA 705 – SB03UD LINPACK style code- by Gardiner- et. al., STEIN → Algorithm RECLYDT Level-1 BLAS calls. GSYLV few RECGSYL SYLG – Level-3 BLAS block implementation -only for CSYLV→ GLYAP-3 RECGCSY SB04OD xTGSYL GLYAP GLYAP/GSTEIN. RECGLYCT SG03AD SYLGC GSTEIN RECGLYDT SG03AD SYLGD All packages are not feature complete and mostly old. 1 http://www8.cs.umu.se/ 2 http://www.slicot.org/ ~isak/recsy/ 3 http://www.netlib.org/toms/432.gz 4 http://www.netlib.org/toms/705.gz Martin Köhler, [email protected] BLAS-3 Sylvester 3/24 Introduction General . . . . . . . . . . .Workflow . . . . . . . . . . . . .in . . .Direct . . . . . . . . .Solvers: ..................................................................... Compute real Schur forms: A = QAs Q T and B = ZBs Z T . Compute generalized real Schur forms: (A, C ) = (Q1 As Z1T , Q1 Cs Z1T ) (B, D) = (Q2 Bs Z2T , Q2 Ds Z2T ). Transform the right hand side: Ỹ = Q T YZ . Transform the right hand side: Ỹ = Q1T YZ2 . Solve the (quasi-) triangular equation: As X̃ ± X̃ Bs = Ỹ . Solve the (quasi-) triangular equation As X̃ Bs ± Cs X̃ Ds = Ỹ . Transform the solution: X = Q X̃ Z T . Transform the solution: X = Z1 X̃ Q2T . Martin Köhler, [email protected] BLAS-3 Sylvester 4/24 Introduction Our focus: Fast solution of (quasi-) triangular Sylvester-type MaGeneral . . . . . . . . . . .Workflow . . . . . . . . . . . . .in . . .Direct . . . . . . . . .Solvers: ..................................................................... trix Equations: Compute generalized real Schur forms: Compute real Schur @ forms: @ @ @ ± . 1T , Q1 Cs Z1T ) X̃ X̃ C ) = (Q=1 AỸs Z (A, T @ @ @ @ @ @ A = QAs Q T and@ B = ZBs Z@ . (B, D) = (Q2 Bs Z2T , Q2 Ds Z2T ). Transform the right hand side: Ỹ = Q T YZ . Transform the right hand side: Ỹ = Q1T YZ2 . Solve the (quasi-) triangular equation: As X̃ ± X̃ Bs = Ỹ . Solve the (quasi-) triangular equation As X̃ Bs ± Cs X̃ Ds = Ỹ . Transform the solution: X = Q X̃ Z T . Transform the solution: X = Z1 X̃ Q2T . Martin Köhler, [email protected] BLAS-3 Sylvester 4/24 Introduction Our focus: Fast solution of (quasi-) triangular Sylvester-type MaGeneral . . . . . . . . . . .Workflow . . . . . . . . . . . . .in . . .Direct . . . . . . . . .Solvers: ..................................................................... trix Equations: Compute generalized real Schur forms: Compute real Schur @ forms: @ @ @ ± . 1T , Q1 Cs Z1T ) X̃ X̃ C ) = (Q=1 AỸs Z (A, T @ @ @ @ @ @ A = QAs Q T and@ B = ZBs Z@ . (B, D) = (Q2 Bs Z2T , Q2 Ds Z2T ). We use the Generalized Sylvester Equation as most complex variant for the optimizations. Transform the right hand side: Transform the right hand side: T Ỹ = Q1T YZ2 . Ỹ = Q YZ . Solve the (quasi-) triangular equation: As X̃ ± X̃ Bs = Ỹ . Solve the (quasi-) triangular equation As X̃ Bs ± Cs X̃ Ds = Ỹ . Transform the solution: X = Q X̃ Z T . Transform the solution: X = Z1 X̃ Q2T . Martin Köhler, [email protected] BLAS-3 Sylvester 4/24 Block-Algorithm The real Generalized Schur Decomposition of A11 · · · A1p C11 · · · . . .. . . As = . . . , Cs = 0 App 0 B11 · · · B1q D11 · · · . . .. . . Bs = . . . , Ds = 0 Bqq 0 (A, C ) and (B, D) yields: C1p X̃11 · · · .. , X̃ = .. .. . . . Cpp C1q .. , . Dqq X̃1q .. , . X̃p1 · · · X̃pq Ỹ11 · · · Ỹ1q .. , .. Ỹ = ... . . Ỹp1 · · · Ỹqq where (Aii , Cii ) and (Bii , Dii ) are 1 × 1 or 2 × 2 according to the eigenvalues of (A, C ), or (B, D) respectively. Martin Köhler, [email protected] BLAS-3 Sylvester 5/24 Block-Algorithm The real Generalized Schur Decomposition of A11 · · · A1p C11 · · · . . .. . . As = . . . , Cs = 0 App 0 B11 · · · B1q D11 · · · . . .. . . Bs = . . . , Ds = 0 Bqq 0 (A, C ) and (B, D) yields: C1p X̃11 · · · .. , X̃ = .. .. . . . Cpp C1q .. , . Dqq X̃1q .. , . X̃p1 · · · X̃pq Ỹ11 · · · Ỹ1q .. , .. Ỹ = ... . . Ỹp1 · · · Ỹqq where (Aii , Cii ) and (Bii , Dii ) are 1 × 1 or 2 × 2 according to the eigenvalues of (A, C ), or (B, D) respectively. We need to solve p · q small Sylvester equations: X Akk X̃kl Bll + Ckk X̃k` Dll = Ỹkl − Aki X̃ij Bjl ± Cki X̃ij Djl . i=k,...,p j=1,...,l (i,j)6=(k,l) Martin Köhler, [email protected] BLAS-3 Sylvester 5/24 Block-Algorithm Our Question: Are the matrices Akk , All , . . . restricted to be 1 × 1 or 2 × 2 matrices? Martin Köhler, [email protected] BLAS-3 Sylvester 6/24 Block-Algorithm Our Question: Are the matrices Akk , All , . . . restricted to be 1 × 1 or 2 × 2 matrices? Answer – No! As long as no complex eigenvalue-pairs in (A, C ) and (B, D) is split by the blocking and X̃ and Ỹ are partitioned accordingly everything works for larger blocks. Martin Köhler, [email protected] BLAS-3 Sylvester 6/24 Block-Algorithm Our Question: Are the matrices Akk , All , . . . restricted to be 1 × 1 or 2 × 2 matrices? Answer – No! As long as no complex eigenvalue-pairs in (A, C ) and (B, D) is split by the blocking and X̃ and Ỹ are partitioned accordingly everything works for larger blocks. Arbitrary block sizes mb for (A, C ) and nb for (B, D) are possible but adjustments by ±1 to fit the eigenvalue structure are necessary. Martin Köhler, [email protected] BLAS-3 Sylvester 6/24 Block-Algorithm Our Question: Are the matrices Akk , All , . . . restricted to be 1 × 1 or 2 × 2 matrices? Answer – No! As long as no complex eigenvalue-pairs in (A, C ) and (B, D) is split by the blocking and X̃ and Ỹ are partitioned accordingly everything works for larger blocks. Arbitrary block sizes mb for (A, C ) and nb for (B, D) are possible but adjustments by ±1 to fit the eigenvalue structure are necessary. Special cases: (As , Cs ∈ Rm×m , Bs , Ds ∈ Rn×n and X̃ , Ỹ ∈ Rm×n ) mb = 1 and nb = 1 – Level-2 Bartels-Stewart implementation in SLICOT. Martin Köhler, [email protected] BLAS-3 Sylvester 6/24 Block-Algorithm Our Question: Are the matrices Akk , All , . . . restricted to be 1 × 1 or 2 × 2 matrices? Answer – No! As long as no complex eigenvalue-pairs in (A, C ) and (B, D) is split by the blocking and X̃ and Ỹ are partitioned accordingly everything works for larger blocks. Arbitrary block sizes mb for (A, C ) and nb for (B, D) are possible but adjustments by ±1 to fit the eigenvalue structure are necessary. (As , Cs ∈ Rm×m , Bs , Ds ∈ Rn×n and X̃ , Ỹ ∈ Rm×n ) Special cases: mb = 1 and nb = 1 – Level-2 Bartels-Stewart implementation in SLICOT. mb = m 2 and nb = n 2 Martin Köhler, [email protected] – Recursive Blocking (RECSY) BLAS-3 Sylvester 6/24 Block-Algorithm Our Question: Are the matrices Akk , All , . . . restricted to be 1 × 1 or 2 × 2 matrices? Answer – No! As long as no complex eigenvalue-pairs in (A, C ) and (B, D) is split by the blocking and X̃ and Ỹ are partitioned accordingly everything works for larger blocks. Arbitrary block sizes mb for (A, C ) and nb for (B, D) are possible but adjustments by ±1 to fit the eigenvalue structure are necessary. (As , Cs ∈ Rm×m , Bs , Ds ∈ Rn×n and X̃ , Ỹ ∈ Rm×n ) Special cases: mb = 1 and nb = 1 – Level-2 Bartels-Stewart implementation in SLICOT. mb = m 2 and nb = n 2 – Recursive Blocking (RECSY) mb = m and nb = 1 – Algorithm 705 by Gardiner et. al. Martin Köhler, [email protected] BLAS-3 Sylvester 6/24 Block-Algorithm Our Question: Are the matrices Akk , All , . . . restricted to be 1 × 1 or 2 × 2 matrices? Answer – No! As long as no complex eigenvalue-pairs in (A, C ) and (B, D) is split by the blocking and X̃ and Ỹ are partitioned accordingly everything works for larger blocks. Arbitrary block sizes mb for (A, C ) and nb for (B, D) are possible but adjustments by ±1 to fit the eigenvalue structure are necessary. (As , Cs ∈ Rm×m , Bs , Ds ∈ Rn×n and X̃ , Ỹ ∈ Rm×n ) Special cases: mb = 1 and nb = 1 – Level-2 Bartels-Stewart implementation in SLICOT. mb = m 2 and nb = n 2 – Recursive Blocking (RECSY) mb = m and nb = 1 – Algorithm 705 by Gardiner et. al. → Why are mb and nb not chosen such that Akk , Bll , Ckk , Dll , X̃kl and Ỹkl fit into the CPU’s cache(s)? Martin Köhler, [email protected] BLAS-3 Sylvester 6/24 Efficient Implementation General . . . . . . . . . . .Aspects: .............................................................................................. The updates on the right hand sides Ỹkl X Ỹkl − Aki X̃ij Bjl ± Cki X̃ij Djl i=k,...,p j=1,...,l (i,j)6=(k,l) can be rewritten and unified into few GEMM/TRMM operations. → Assumed to be as efficient as possible for sufficiently large matrices. Martin Köhler, [email protected] BLAS-3 Sylvester 7/24 Efficient Implementation General . . . . . . . . . . .Aspects: .............................................................................................. The updates on the right hand sides Ỹkl X Ỹkl − Aki X̃ij Bjl ± Cki X̃ij Djl i=k,...,p j=1,...,l (i,j)6=(k,l) can be rewritten and unified into few GEMM/TRMM operations. → Assumed to be as efficient as possible for sufficiently large matrices. Block sizes mb and nb are a freely selectable parameter. Martin Köhler, [email protected] BLAS-3 Sylvester 7/24 Efficient Implementation General . . . . . . . . . . .Aspects: .............................................................................................. The updates on the right hand sides Ỹkl X Ỹkl − Aki X̃ij Bjl ± Cki X̃ij Djl i=k,...,p j=1,...,l (i,j)6=(k,l) can be rewritten and unified into few GEMM/TRMM operations. → Assumed to be as efficient as possible for sufficiently large matrices. Block sizes mb and nb are a freely selectable parameter. The solution X̃ overwrites the right hand side Ỹ . Martin Köhler, [email protected] BLAS-3 Sylvester 7/24 Efficient Implementation General . . . . . . . . . . .Aspects: .............................................................................................. The updates on the right hand sides Ỹkl X Ỹkl − Aki X̃ij Bjl ± Cki X̃ij Djl i=k,...,p j=1,...,l (i,j)6=(k,l) can be rewritten and unified into few GEMM/TRMM operations. → Assumed to be as efficient as possible for sufficiently large matrices. Block sizes mb and nb are a freely selectable parameter. The solution X̃ overwrites the right hand side Ỹ . All matrices are stored in the Fortran Column-Major-Scheme. Martin Köhler, [email protected] BLAS-3 Sylvester 7/24 Efficient Implementation Algorithm 1 Block Bartels-Stewart Algorithm for Generalized Sylvester Equations Input: As , Cs ∈ Rm×m , Bs , Ds ∈ Rn×n and Ỹ ∈ Rm×n , mb , nb ∈ N Output: X̃ ∈ Rm×n overwriting Ỹ 1: if m ≤ mb and n ≤ nb then 2: Solve As X̃ Bs ± Cs X̃ Ds = Ỹ . 3: else 4: for k = m, . . . , 1 step by mb do 5: for l = 1, . . . , n step by nb do 6: Solve Akk X̃kl Bll ± Ckk X̃kl Dll = Ỹkl . 7: Ỹk,l+1:n = Ỹk,l+1:n − Akk X̃kl Bkl ∓ Ckk X̃kl Dkl 8: end for 9: Ỹ1:k−1,1:n = Ỹ1:k−1,1:n − A1:k−1,k:k+mb −1 X̃k:k+mb −1,1:l Bs ∓ C1:k−1,k:k+mb −1 X̃k:k+mb −1,1:l Ds 10: end for 11: end if Martin Köhler, [email protected] BLAS-3 Sylvester 8/24 Efficient Implementation Algorithm 1 Block Bartels-Stewart Algorithm for Generalized Sylvester Equations Input: As , Cs ∈ Rm×m , Bs , Ds ∈ Rn×n and Ỹ ∈ Rm×n , mb , nb ∈ N Output: X̃ ∈ Rm×n overwriting Ỹ 1: if m ≤ mb and n ≤ nb then 2: Solve As X̃ Bs ± Cs X̃ Ds = Ỹ . 3: else Critical operation for a fast 4: for k = m, . . . , 1 step by mb do solution scheme. 5: for l = 1, . . . , n step by nb do 6: Solve Akk X̃kl Bll ± Ckk X̃kl Dll = Ỹkl . 7: Ỹk,l+1:n = Ỹk,l+1:n − Akk X̃kl Bkl ∓ Ckk X̃kl Dkl 8: end for 9: Ỹ1:k−1,1:n = Ỹ1:k−1,1:n − A1:k−1,k:k+mb −1 X̃k:k+mb −1,1:l Bs ∓ C1:k−1,k:k+mb −1 X̃k:k+mb −1,1:l Ds 10: end for 11: end if Martin Köhler, [email protected] BLAS-3 Sylvester 8/24 Efficient Implementation Algorithm 1 Block Bartels-Stewart Algorithm for Generalized Sylvester Equations Input: As , Cs ∈ Rm×m , Bs , Ds ∈ Rn×n and Ỹ ∈ Rm×n , mb , nb ∈ N Output: X̃ ∈ Rm×n overwriting Ỹ 1: if m ≤ mb and n ≤ nb then 2: Solve As X̃ Bs ± Cs X̃ Ds = Ỹ . 3: else Critical operation for a fast 4: for k = m, . . . , 1 step by mb do solution scheme. 5: for l = 1, . . . , n step by nb do 6: Solve Akk X̃kl Bll ± Ckk X̃kl Dll = Ỹkl . 7: Ỹk,l+1:n = Ỹk,l+1:n − Akk X̃kl Bkl ∓ Ckk X̃kl Dkl 8: end for 9: Ỹ1:k−1,1:n = Ỹ1:k−1,1:n − A1:k−1,k:k+mb −1 X̃k:k+mb −1,1:l Bs ∓ C1:k−1,k:k+mb −1 X̃k:k+mb −1,1:l Ds 10: end for 11: end if Goal: Develop an efficient solver for small Sylvester Equations. Martin Köhler, [email protected] BLAS-3 Sylvester 8/24 Efficient Inner Solvers Hardware . . . . . . . . . . . . . and . . . . . .Software . . . . . . . . . . . .Setup .......................................................................... We use a test driven development scheme for the kernels on random matrices. Hardware Intel Xeon E5-2640v3 (Haswell), 2x8 Cores, 16x256kB L2 Cache, 2x20MB L3 Cache, AVX vector unit 64GB DDR3 RAM Software CentOS 7.3 – x86 64 Intel Parallel Studio XE 2017 C/Fortran Compiler + Intel MKL 2017.0.1 Compiler-Flags: -O3 -xHost -qopenmp Computational routines are written in Fortran 90/95. Reference results with Algorithm 705 and RECSY, double precision. Residual and Forward error of all results are comparable. Martin Köhler, [email protected] BLAS-3 Sylvester 9/24 Efficient Inner Solvers Naive . . . . . . . . Approach ................................................................................................. Sylvester equations with m ≤ 2 and n ≤ 2 are trivial to solve via their Kronecker representation: AXB ± CXD = Y ⇐⇒ (B T ⊗ A ± D T ⊗ C ) vec X = vec Y Larger problems are solved using Algorithm 1 with mb = 1 and nb = 1. Martin Köhler, [email protected] BLAS-3 Sylvester 10/24 Efficient Inner Solvers Naive . . . . . . . . Approach ................................................................................................. Sylvester equations with m ≤ 2 and n ≤ 2 are trivial to solve via their Kronecker representation: AXB ± CXD = Y ⇐⇒ (B T ⊗ A ± D T ⊗ C ) vec X = vec Y Larger problems are solved using Algorithm 1 with mb = 1 and nb = 1. Usual LAPACK-like blocking scheme. Martin Köhler, [email protected] BLAS-3 Sylvester 10/24 Efficient Inner Solvers Naive . . . . . . . . Approach ................................................................................................. Sylvester equations with m ≤ 2 and n ≤ 2 are trivial to solve via their Kronecker representation: AXB ± CXD = Y ⇐⇒ (B T ⊗ A ± D T ⊗ C ) vec X = vec Y Larger problems are solved using Algorithm 1 with mb = 1 and nb = 1. Usual LAPACK-like blocking scheme. Test Procedure Algorithm 1 with random matrices A, B, C , D ∈ Rm×m , m = 1, . . . , 1030. Eigenvalues sorted such that A64k+1,64k = 0 and B64k+1,64k = 0, ∀k ∈ N. Only inner solver exchanged/optimized. 16 threads for multi-threaded BLAS calls. Martin Köhler, [email protected] BLAS-3 Sylvester 10/24 Efficient Inner Solvers Naive . . . . . . . . Approach ................................................................................................. 0.6 Time to solution in [s] RECSY Algorithm 705 Naive 0.4 0.2 0 0 100 200 300 Dimension of X̃ Martin Köhler, [email protected] 400 500 m=n 600 700 800 900 1,000 block size mb = nb = 64 BLAS-3 Sylvester 11/24 Efficient Inner Solvers Naive . . . . . . . . Approach ................................................................................................. 0.6 Time to solution in [s] RECSY Algorithm 705 Naive Reproducible peaks at m = 512 and m = 1024. 0.4 0.2 0 0 100 200 300 Dimension of X̃ Martin Köhler, [email protected] 400 500 m=n 600 700 800 900 1,000 block size mb = nb = 64 BLAS-3 Sylvester 11/24 Efficient Inner Solvers Level-2 . . . . . . . . . . BLAS . . . . . . . . .Calls ...................................................................................... Replace Level-3 BLAS calls with the corresponding Level-2 operations: xGEMM → xGEMV, xGER, and xAXPY, xTRMM → xTRMV. Martin Köhler, [email protected] BLAS-3 Sylvester 12/24 Efficient Inner Solvers Level-2 . . . . . . . . . . BLAS . . . . . . . . .Calls ...................................................................................... Replace Level-3 BLAS calls with the corresponding Level-2 operations: xGEMM → xGEMV, xGER, and xAXPY, xTRMM → xTRMV. GEMM operations up to 2 × 2 are directly computed. Martin Köhler, [email protected] BLAS-3 Sylvester 12/24 Efficient Inner Solvers Level-2 . . . . . . . . . . BLAS . . . . . . . . .Calls ...................................................................................... Replace Level-3 BLAS calls with the corresponding Level-2 operations: xGEMM → xGEMV, xGER, and xAXPY, xTRMM → xTRMV. GEMM operations up to 2 × 2 are directly computed. Appearing xAXPY operations, caused by the GEMM replacement, are performed by Fortran intrinsics: CALL CALL CALL CALL DAXPY(N-LH, DAXPY(N-LH, DAXPY(N-LH, DAXPY(N-LH, -MAT(1,1), -MAT(3,1), -MAT(2,1), -MAT(4,1), B(L,LH+1), LDB, X(K,LH+1), LDX) B(LH,LH+1), LDB,X(K,LH+1), LDX) B(L,LH+1), LDB, X(KH,LH+1), LDX) B(LH,LH+1), LDB,X(KH,LH+1), LDX) is transformed into X(K,LH+1:N) = X(K,LH+1:N) - MAT(1,1) * B(L,LH+1:N) MAT(3,1) * B(LH,LH+1:N) - SGN * MAT(1,2) * D(L,LH+1:N) - SGN * MAT(3,2) * D(LH, LH+1:N) Martin Köhler, [email protected] BLAS-3 Sylvester 12/24 Efficient Inner Solvers Level-2 . . . . . . . . . . BLAS . . . . . . . . .Calls ...................................................................................... Time to solution in [s] 0.6 RECSY Algorithm 705 Naive BLAS-2 0.4 0.2 0 0 100 200 300 Dimension of X̃ Martin Köhler, [email protected] 400 500 m=n 600 700 800 900 1,000 block size mb = nb = 64 BLAS-3 Sylvester 13/24 Efficient Inner Solvers Level-2 . . . . . . . . . . BLAS . . . . . . . . .Calls ...................................................................................... Time to solution in [s] 0.6 RECSY Algorithm 705 Naive BLAS-2 0.4 Still reproducible peaks at m = 512 and m = 1024. 0.2 0 0 100 200 300 Dimension of X̃ Martin Köhler, [email protected] 400 500 m=n 600 700 800 900 1,000 block size mb = nb = 64 BLAS-3 Sylvester 13/24 Efficient Inner Solvers Reorder . . . . . . . . . . .the . . . . .data . . . . . . .access .................................................................................. The peaks at m = 512 and m = 1024 (and the next ones at 1536 and 2048) are caused by cache-misses/unused prefetching. Reason: The leading dimension of the matrix (times 8 Byte per value) is a multiple of the pagesize (4096 Bytes). The access to X(K,LH+1:N), B(L,LH+1:N), and D(L,LH+1:N) is not well suited for the matrix storage. (column-major-storage vs. row-wise access). Martin Köhler, [email protected] BLAS-3 Sylvester 14/24 Efficient Inner Solvers Reorder . . . . . . . . . . .the . . . . .data . . . . . . .access .................................................................................. The peaks at m = 512 and m = 1024 (and the next ones at 1536 and 2048) are caused by cache-misses/unused prefetching. Reason: The leading dimension of the matrix (times 8 Byte per value) is a multiple of the pagesize (4096 Bytes). The access to X(K,LH+1:N), B(L,LH+1:N), and D(L,LH+1:N) is not well suited for the matrix storage. (column-major-storage vs. row-wise access). Same optimizations as before but. . . Martin Köhler, [email protected] BLAS-3 Sylvester 14/24 Efficient Inner Solvers Reorder . . . . . . . . . . .the . . . . .data . . . . . . .access .................................................................................. The peaks at m = 512 and m = 1024 (and the next ones at 1536 and 2048) are caused by cache-misses/unused prefetching. Reason: The leading dimension of the matrix (times 8 Byte per value) is a multiple of the pagesize (4096 Bytes). The access to X(K,LH+1:N), B(L,LH+1:N), and D(L,LH+1:N) is not well suited for the matrix storage. (column-major-storage vs. row-wise access). Same optimizations as before but. . . solve the small equation column-wise instead of row-wise. The crucial operations change to: X(1:K-1,L) = X(1:K-1,L) - MAT(1,1) * A(1:K-1,K) - MAT(2,1) * A(1:K-1,KH) - SGN * MAT(1,2) * C(1:K-1,K) - SGN * MAT(2,2) * C(1:K-1,KH) → Access on Xkl , Akk and Ckk fits the storage scheme. Martin Köhler, [email protected] BLAS-3 Sylvester 14/24 Efficient Inner Solvers Reorder . . . . . . . . . . .the . . . . .data . . . . . . .access .................................................................................. Time to solution in [s] 0.6 RECSY Algorithm 705 Naive BLAS-2 Reordered 0.4 0.2 0 0 100 200 300 Dimension of X̃ Martin Köhler, [email protected] 400 500 m=n 600 700 800 900 1,000 block size mb = nb = 64 BLAS-3 Sylvester 15/24 Efficient Inner Solvers Reorder . . . . . . . . . . .the . . . . .data . . . . . . .access .................................................................................. Time to solution in [s] 0.6 RECSY Algorithm 705 Naive BLAS-2 Reordered 0.4 One small peak at m = 1024 left. 0.2 0 0 100 200 300 Dimension of X̃ Martin Köhler, [email protected] 400 500 m=n 600 700 800 900 1,000 block size mb = nb = 64 BLAS-3 Sylvester 15/24 Efficient Inner Solvers Use . . . . . .local . . . . . . .copies . . . . . . . . .of . . .A . .kk . . .,. .B. .ll.,. .C . .kk . . ,. . D . . ll. .,. .and . . . . .X . . kl ................................................. The leading dimension of As , Bs , Cs and Ds still yields cache misses and unnecessary prefetching. Keep all previous optimizations and Martin Köhler, [email protected] BLAS-3 Sylvester 16/24 Efficient Inner Solvers Use . . . . . .local . . . . . . .copies . . . . . . . . .of . . .A . .kk . . .,. .B. .ll.,. .C . .kk . . ,. . D . . ll. .,. .and . . . . .X . . kl ................................................. The leading dimension of As , Bs , Cs and Ds still yields cache misses and unnecessary prefetching. Keep all previous optimizations and create a local copy of Akk and Ckk with leading dimension mb , Martin Köhler, [email protected] BLAS-3 Sylvester 16/24 Efficient Inner Solvers Use . . . . . .local . . . . . . .copies . . . . . . . . .of . . .A . .kk . . .,. .B. .ll.,. .C . .kk . . ,. . D . . ll. .,. .and . . . . .X . . kl ................................................. The leading dimension of As , Bs , Cs and Ds still yields cache misses and unnecessary prefetching. Keep all previous optimizations and create a local copy of Akk and Ckk with leading dimension mb , create a local copy of Bll and Cll with leading dimension nb , Martin Köhler, [email protected] BLAS-3 Sylvester 16/24 Efficient Inner Solvers Use . . . . . .local . . . . . . .copies . . . . . . . . .of . . .A . .kk . . .,. .B. .ll.,. .C . .kk . . ,. . D . . ll. .,. .and . . . . .X . . kl ................................................. The leading dimension of As , Bs , Cs and Ds still yields cache misses and unnecessary prefetching. Keep all previous optimizations and create a local copy of Akk and Ckk with leading dimension mb , create a local copy of Bll and Cll with leading dimension nb , create a local copy of Xkl with leading dimension mb and copy it to the original location afterwards. Martin Köhler, [email protected] BLAS-3 Sylvester 16/24 Efficient Inner Solvers Use . . . . . .local . . . . . . .copies . . . . . . . . .of . . .A . .kk . . .,. .B. .ll.,. .C . .kk . . ,. . D . . ll. .,. .and . . . . .X . . kl ................................................. The leading dimension of As , Bs , Cs and Ds still yields cache misses and unnecessary prefetching. Keep all previous optimizations and create a local copy of Akk and Ckk with leading dimension mb , create a local copy of Bll and Cll with leading dimension nb , create a local copy of Xkl with leading dimension mb and copy it to the original location afterwards. All data can be copied to the (L2) cache before the solution of the inner Sylvester equation starts. Martin Köhler, [email protected] BLAS-3 Sylvester 16/24 Efficient Inner Solvers Use . . . . . .local . . . . . . .copies . . . . . . . . .of . . .A . .kk . . .,. .B. .ll.,. .C . .kk . . ,. . D . . ll. .,. .and . . . . .X . . kl ................................................. Time to solution in [s] 0.6 RECSY Algorithm 705 Naive BLAS-2 Reordered Local Copy 0.4 0.2 0 0 100 200 300 Dimension of X̃ Martin Köhler, [email protected] 400 500 m=n 600 700 800 900 1,000 block size mb = nb = 64 BLAS-3 Sylvester 17/24 Efficient Inner Solvers Alignment . . . . . . . . . . . . . . of . . . .the . . . . .local . . . . . . .copies ........................................................................... Vector units (AVX, VSX,. . . ) of modern CPUs need a special data alignment to work really fast. → Without additional help compilers cannot produce efficient code for them. Keep all previous optimizations, especially the local copies. Martin Köhler, [email protected] BLAS-3 Sylvester 18/24 Efficient Inner Solvers Alignment . . . . . . . . . . . . . . of . . . .the . . . . .local . . . . . . .copies ........................................................................... Vector units (AVX, VSX,. . . ) of modern CPUs need a special data alignment to work really fast. → Without additional help compilers cannot produce efficient code for them. Keep all previous optimizations, especially the local copies. Annotate the declaration of the local data such that they are 64-byte aligned: !dir$ attributes align: 64:: AL, BL, CL, DL, XL or !IBM* ALIGN(64,AL,BL,CL,DL,XL) depending on the Fortran compiler. Martin Köhler, [email protected] BLAS-3 Sylvester 18/24 Efficient Inner Solvers Alignment . . . . . . . . . . . . . . of . . . .the . . . . .local . . . . . . .copies ........................................................................... Vector units (AVX, VSX,. . . ) of modern CPUs need a special data alignment to work really fast. → Without additional help compilers cannot produce efficient code for them. Keep all previous optimizations, especially the local copies. Annotate the declaration of the local data such that they are 64-byte aligned: !dir$ attributes align: 64:: AL, BL, CL, DL, XL or !IBM* ALIGN(64,AL,BL,CL,DL,XL) depending on the Fortran compiler. Replace all BLAS calls except of TRMV with Fortran vector intrinsics. Martin Köhler, [email protected] BLAS-3 Sylvester 18/24 Efficient Inner Solvers Alignment . . . . . . . . . . . . . . of . . . .the . . . . .local . . . . . . .copies ........................................................................... Vector units (AVX, VSX,. . . ) of modern CPUs need a special data alignment to work really fast. → Without additional help compilers cannot produce efficient code for them. Keep all previous optimizations, especially the local copies. Annotate the declaration of the local data such that they are 64-byte aligned: !dir$ attributes align: 64:: AL, BL, CL, DL, XL or !IBM* ALIGN(64,AL,BL,CL,DL,XL) depending on the Fortran compiler. Replace all BLAS calls except of TRMV with Fortran vector intrinsics. → The compiler should be able to optimize our code. Martin Köhler, [email protected] BLAS-3 Sylvester 18/24 Efficient Inner Solvers Alignment . . . . . . . . . . . . . . of . . . .the . . . . .local . . . . . . .copies ........................................................................... Time to solution in [s] 0.6 RECSY Algorithm 705 Naive BLAS-2 Reordered Local Copy Aligned Local Copy 0.4 0.2 0 0 100 200 300 Dimension of X̃ Martin Köhler, [email protected] 400 500 m=n 600 700 800 900 1,000 block size mb = nb = 64 BLAS-3 Sylvester 19/24 Efficient Inner Solvers Large . . . . . . . . Scale . . . . . . . .Test ......................................................................................... Time to solution in [s] 30 20 RECSY Naive BLAS-2 Reordered Local Copy Aligned Local Copy 10 0 1,024 1,536 2,048 2,560 Dimension of X̃ Martin Köhler, [email protected] 3,072 3,584 m=n 4,096 4,608 5,120 5,632 6,144 optimal block size chosen BLAS-3 Sylvester 20/24 Efficient Inner Solvers Large . . . . . . . . Scale . . . . . . . .Test ......................................................................................... Time to solution in [s] 30 20 10 0 1,024 Speed up: RECSY Dimension m=n Naive 1 024 BLAS-2 1 536 Reordered 2 048 Local Copy 2 560 Aligned Local Copy 3 072 3 584 4 096 4 608 5 120 5 632 6 144 1,536 2,048 2,560 Dimension of X̃ Martin Köhler, [email protected] 3,072 Speed up vs. RECSY Naive 2.83 4.76 2.61 4.16 2.36 4.15 2.11 3.83 2.12 3.78 2.01 3.68 2.11 3.88 2.01 3.56 1.78 3.47 1.86 3.39 1.86 3.45 3,584 m=n 4,096 4,608 5,120 5,632 6,144 optimal block size chosen BLAS-3 Sylvester 20/24 Further Optimization Optimal Block Sizes mb and nb The optimal block sizes mb and nb must be chosen such that: small enough that the inner solvers working nearly inside the (L2) CPU cache and large enough that the matrix-matrix products in Algorithm 1 are making use of the multi-threading capabilities of the BLAS library. Martin Köhler, [email protected] BLAS-3 Sylvester 21/24 Further Optimization Optimal Block Sizes mb and nb The optimal block sizes mb and nb must be chosen such that: small enough that the inner solvers working nearly inside the (L2) CPU cache and large enough that the matrix-matrix products in Algorithm 1 are making use of the multi-threading capabilities of the BLAS library. For large m and n and increasing number of CPU cores it is no longer possible to satisfy both conditions properly. Martin Köhler, [email protected] BLAS-3 Sylvester 21/24 Further Optimization Optimal Block Sizes mb and nb The optimal block sizes mb and nb must be chosen such that: small enough that the inner solvers working nearly inside the (L2) CPU cache and large enough that the matrix-matrix products in Algorithm 1 are making use of the multi-threading capabilities of the BLAS library. For large m and n and increasing number of CPU cores it is no longer possible to satisfy both conditions properly. Idea Move the parallelization from the BLAS library to the Algorithm. Use the data dependencies between the p · q blocks of the solution matrix X̃ : X̃p1 depends on nothing. X̃pj , j = 2 . . . q depend on X̃p,j−1 . X̃i1 , i = p − 1 . . . 1 depend on X̃i+1,1 . X̃ij , i = p − 1 . . . 1, j = 2 . . . q depend on X̃i,j−1 and X̃i+1,j . Martin Köhler, [email protected] BLAS-3 Sylvester 21/24 Further Optimization Direct-Acyclic-Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . .(DAG) . . . . . . . . . Scheduling .................................................................... The data dependencies in the solution lead to the following DAG: .. . ↑ X̃ p−2,1 ↑ X̃ p−1,1 ↑ X̃p,1 ··· → → → .. . ↑ X̃p−2,2 ↑ X̃p−1,2 ↑ X̃p,2 ··· → → → .. . ↑ X̃p−2,3 ↑ X̃p−1,3 ↑ X̃p,3 · · · . . . . . . ... The blocks X̃ij on the same anti-diagonal can be solved independently from each other (in parallel). Martin Köhler, [email protected] BLAS-3 Sylvester 22/24 Further Optimization Direct-Acyclic-Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . .(DAG) . . . . . . . . . Scheduling .................................................................... The data dependencies in the solution lead to the following DAG: .. . ↑ X̃ p−2,1 ↑ X̃ p−1,1 ↑ X̃p,1 ··· → → → .. . ↑ X̃p−2,2 ↑ X̃p−1,2 ↑ X̃p,2 ··· → → → .. . ↑ X̃p−2,3 ↑ X̃p−1,3 ↑ X̃p,3 · · · . . . . . . ... The blocks X̃ij on the same anti-diagonal can be solved independently from each other (in parallel). OpenMP 4.0 – task depend Since OpenMP 4.0 such data dependencies can be attached to the omp task directive and the OpenMP runtime system does the scheduling with respect to the DAG. Martin Köhler, [email protected] BLAS-3 Sylvester 22/24 Further Optimization Direct-Acyclic-Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . .(DAG) . . . . . . . . . Scheduling .................................................................... The data dependencies in the solution lead to the following DAG: Sketch of the Implementation: .. .. .. · · ·is done . using · · · annotations . · · ·in the source OpenMP parallelization . ↑ ↑ ↑ code: X̃p−2,1 → X̃p−2,2 → X̃p−2,3 . . . ↑ ↑ ↑ X̃ p−1,1 → X̃p−1,2 → X̃p−1,3 . . . ↑ ↑ ↑ X̃p,1 → X̃p,2 → X̃p,3 ... CALL TGSYLV SOLVE BLOCK(TRANSA, TRANSB, M,N,K,KH,KB,L,LH,LB, & A,LDA,B,LDB,C,LDC,D,LDD,X,LDX,SGN,SCAL,WORK,INFO1,ARCH ) The blocks& X̃ ij on the same anti-diagonal can be solved independently from each other (in parallel). OpenMP 4.0 – task depend Since OpenMP 4.0 such data dependencies can be attached to the omp task directive and the OpenMP runtime system does the scheduling with respect to the DAG. Martin Köhler, [email protected] BLAS-3 Sylvester 22/24 Further Optimization Direct-Acyclic-Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . .(DAG) . . . . . . . . . Scheduling .................................................................... The data dependencies in the solution lead to the following DAG: Sketch of the Implementation: .. .. .. · · ·is done . using · · · annotations . · · ·in the source OpenMP parallelization . ↑ ↑ ↑ code: X̃p−2,1 → X̃p−2,2 → X̃p−2,3 . . . ↑ ↑ ↑ !$omp task firstprivate(K,KH,KB,L,LH,LB,SCAL,INFO1) X̃ → X̃ → X̃ . . . X(K,L)) p−1,1 p−1,2 p−1,3 !$omp& depend(in: X(KOLD,L), X(K,LOLD)) depend(out: ↑ ↑ ↑ !$omp& default(shared) X̃p,1 → X̃p,2 → X̃p,3 ... CALL TGSYLV SOLVE BLOCK(TRANSA, TRANSB, M,N,K,KH,KB,L,LH,LB, & A,LDA,B,LDB,C,LDC,D,LDD,X,LDX,SGN,SCAL,WORK,INFO1,ARCH ) The blocks& X̃ ij on the same anti-diagonal can be solved independently from each omp end task other (in!$parallel). OpenMP 4.0 – task depend Since OpenMP 4.0 such data dependencies can be attached to the omp task directive and the OpenMP runtime system does the scheduling with respect to the DAG. Martin Köhler, [email protected] BLAS-3 Sylvester 22/24 Further Optimization Direct-Acyclic-Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . .(DAG) . . . . . . . . . Scheduling .................................................................... The data dependencies in the solution lead to the following DAG: Sketch of the Implementation: .. .. .. · · ·is done . using · · · annotations . · · ·in the source OpenMP parallelization . ↑ ↑ ↑ code: X̃p−2,1 → X̃p−2,2 → X̃p−2,3 . . . ↑ ↑ ↑ !$omp task firstprivate(K,KH,KB,L,LH,LB,SCAL,INFO1) X̃ → X̃ → X̃ . . . X(K,L)) p−1,1 p−1,2 p−1,3 !$omp& depend(in: X(KOLD,L), X(K,LOLD)) depend(out: ↑ ↑ ↑ !$omp& default(shared) X̃p,1 → X̃p,2 → X̃p,3 ... CALL TGSYLV SOLVE BLOCK(TRANSA, TRANSB, M,N,K,KH,KB,L,LH,LB, & A,LDA,B,LDB,C,LDC,D,LDD,X,LDX,SGN,SCAL,WORK,INFO1,ARCH ) The blocks& X̃ ij on the same anti-diagonal can be solved independently from each omp end task other (in!$parallel). → Annotations are ignored by non OpenMP compatible compilers. OpenMP 4.0 – task depend Since OpenMP 4.0 such data dependencies can be attached to the omp task directive and the OpenMP runtime system does the scheduling with respect to the DAG. Martin Köhler, [email protected] BLAS-3 Sylvester 22/24 Further Optimization Direct-Acyclic-Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . .(DAG) . . . . . . . . . Scheduling .................................................................... Time to solution in [s] 15 RECSY RECSY Parallel Aligned Local Copy Aligned Local Copy + DAG 10 5 0 1,024 1,536 2,048 2,560 Dimension of X̃ Martin Köhler, [email protected] 3,072 3,584 m=n 4,096 4,608 5,120 5,632 6,144 optimal block size chosen BLAS-3 Sylvester 23/24 Further Optimization Direct-Acyclic-Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . .(DAG) . . . . . . . . . Scheduling .................................................................... Speed up: Time to solution in [s] 15 10 5 0 1,024 RECSY Dimension RECSY Parallel m=n Aligned Local Aligned Local Copy 1 024 Aligned Local Copy + DAG 1 536 2 048 2 560 3 072 3 584 4 096 4 608 5 120 5 632 6 144 1,536 2,048 2,560 Dimension of X̃ Martin Köhler, [email protected] Speed up vs. Copy RECSY 2.20 6.24 2.27 5.92 2.29 5.40 2.16 4.56 2.04 4.34 1.96 3.94 1.84 3.88 1.77 3.57 1.76 3.13 1.57 2.92 1.65 3.06 3,072 3,584 m=n 4,096 RECSY Parallel 4.39 4.46 4.37 3.95 3.91 3.74 3.79 3.66 3.51 3.28 3.54 4,608 5,120 5,632 6,144 optimal block size chosen BLAS-3 Sylvester 23/24 Conclusions Block Bartels-Stewart algorithms can beat the RECSY approach by a factor of 2.8 or 6.2. Performance gain of a factor 4.7 if the code is written in a way such that the compiler can optimize the code properly. DAG-Scheduling in OpenMP 4 allows easy high level parallelization on top of the data dependencies. (GCC since 4.9.1, Intel since 15.0, LLVM since 3.9) Same ideas work for LYAP, STEIN, SYLV, SYLV2, GLYAP, GSTEIN and CSYLV. Martin Köhler, [email protected] BLAS-3 Sylvester 24/24 Conclusions Block Bartels-Stewart algorithms can beat the RECSY approach by a factor of 2.8 or 6.2. Performance gain of a factor 4.7 if the code is written in a way such that the compiler can optimize the code properly. DAG-Scheduling in OpenMP 4 allows easy high level parallelization on top of the data dependencies. (GCC since 4.9.1, Intel since 15.0, LLVM since 3.9) Same ideas work for LYAP, STEIN, SYLV, SYLV2, GLYAP, GSTEIN and CSYLV. Outlook GPU/Accelerator enabled implementations. Compile-time tuning by benchmarks of xGEMM and xTRMM. DAG-scheduling like algorithms if OpenMP 4 features are not available. Martin Köhler, [email protected] BLAS-3 Sylvester 24/24 Conclusions Block Bartels-Stewart algorithms can beat the RECSY approach by a factor of 2.8 or 6.2. Performance gain of a factor 4.7 if the code is written in a way such that the compiler can optimize the code properly. DAG-Scheduling in OpenMP 4 allows easy high level parallelization on top of the data dependencies. (GCC since 4.9.1, Intel since 15.0, LLVM since 3.9) Same ideas work for LYAP, STEIN, GLYAP, GSTEIN and Thank you for SYLV, yourSYLV2, attention. CSYLV. Outlook GPU/Accelerator enabled implementations. Compile-time tuning by benchmarks of xGEMM and xTRMM. DAG-scheduling like algorithms if OpenMP 4 features are not available. Martin Köhler, [email protected] BLAS-3 Sylvester 24/24
© Copyright 2026 Paperzz