Efficient Implementation of BLAS Level-3 solvers for Sylvester

Efficient Implementation of BLAS Level-3
solvers for Sylvester-type Matrix Equations
Martin Köhler
February 14, 2017
7th Workshop on Matrix Equations and Tensor Techniques
Introduction
Generalized Sylvester Equation (GSYLV)
AXB ± CXD = Y
Martin Köhler, [email protected]
BLAS-3 Sylvester
2/24
Introduction
Standard Sylvester Equation (SYLV)
Standard Lyapunov Equation (LYAP)
AX ± XB = Y
AX + AX T = Y
Standard Sylvester Equation 2 (SYLV2)
Standard Stein Equation (STEIN)
AXB ± X = Y
AXAT − X = Y
Generalized Sylvester Equation (GSYLV)
AXB ± CXD = Y
Martin Köhler, [email protected]
BLAS-3 Sylvester
2/24
Introduction
Standard Sylvester Equation (SYLV)
Standard Lyapunov Equation (LYAP)
AX ± XB = Y
AX + AX T = Y
Standard Sylvester Equation 2 (SYLV2)
Standard Stein Equation (STEIN)
AXB ± X = Y
AXAT − X = Y
Generalized Sylvester Equation (GSYLV)
AXB ± CXD = Y
Generalized Stein Equation (GSTEIN)
AXAT − BXB T = Y
Generalized Lyapunov Equation (GLYAP)
AXB T + BXAT = Y
Martin Köhler, [email protected]
BLAS-3 Sylvester
2/24
Introduction
Standard Sylvester Equation (SYLV)
Standard Lyapunov Equation (LYAP)
AX ± XB = Y
AX + AX T = Y
Standard Sylvester Equation 2 (SYLV2)
Standard Stein Equation (STEIN)
AXB ± X = Y
AXAT − X = Y
Generalized Sylvester Equation (GSYLV)
AXB ± CXD = Y
Generalized Stein Equation (GSTEIN)
Coupled Sylvester Equation (CSYLV)
AXAT − BXB T = Y
AR ± LB = E
CR ± LD = F
Generalized Lyapunov Equation (GLYAP)
AXB T + BXAT = Y
Martin Köhler, [email protected]
BLAS-3 Sylvester
2/24
Introduction
Implemented
. . . . . . . . . . . . . . . . . .Direct
. . . . . . . . Solvers
. . . . . . . . . . on
. . . . .Shared
. . . . . . . . .Memory
. . . . . . . . . . . .Architectures:
...........................................
Equation
SYLV
SYLV2
LYAP
STEIN
GSYLV
CSYLV
GLYAP
GSTEIN
RECSY1
RECSYCT
RECSYDT
RECLYCT
RECLYDT
RECGSYL
RECGCSY
RECGLYCT
RECGLYDT
1 http://www8.cs.umu.se/
2 http://www.slicot.org/
Software Packages
SLICOT2 LAPACK Alg. 4323
SB04PD
xTRSYL
AXPXB
SB04PD
SB03TD
ATXPXA
SB03UD
SB04OD
xTGSYL
SG03AD
SG03AD
-
Alg. 7054
SYLG
SYLGC
SYLGD
~isak/recsy/
3 http://www.netlib.org/toms/432.gz
4 http://www.netlib.org/toms/705.gz
Martin Köhler, [email protected]
BLAS-3 Sylvester
3/24
Introduction
Implemented
. . . . . . . . . . . . . . . . . .Direct
. . . . . . . . Solvers
. . . . . . . . . . on
. . . . .Shared
. . . . . . . . .Memory
. . . . . . . . . . . .Architectures:
...........................................
Remarks:
→ SLICOT implementations
only usePackages
Level-2 BLAS.
Software
1
2 recursive and Level-3 BLAS,
→ RECSY
implementations
but 7054
Equation
RECSY
SLICOTare
LAPACK Alg. 4323 Alg.
optimized
for 15 years
old architectures,
SYLV
RECSYCT
SB04PD
xTRSYL bad license.
AXPXB
432 – SB04PD
original code by Bartels
and Stewart,
no
SYLV2 → Algorithm
RECSYDT
BLAS
at all, FORTRAN
IV. LYAP
RECLYCT
SB03TD
ATXPXA
705 – SB03UD
LINPACK style code- by Gardiner- et. al.,
STEIN → Algorithm
RECLYDT
Level-1 BLAS calls. GSYLV few RECGSYL
SYLG
– Level-3
BLAS block
implementation -only for
CSYLV→ GLYAP-3
RECGCSY
SB04OD
xTGSYL
GLYAP GLYAP/GSTEIN.
RECGLYCT
SG03AD
SYLGC
GSTEIN
RECGLYDT
SG03AD
SYLGD
1 http://www8.cs.umu.se/
2 http://www.slicot.org/
~isak/recsy/
3 http://www.netlib.org/toms/432.gz
4 http://www.netlib.org/toms/705.gz
Martin Köhler, [email protected]
BLAS-3 Sylvester
3/24
Introduction
Implemented
. . . . . . . . . . . . . . . . . .Direct
. . . . . . . . Solvers
. . . . . . . . . . on
. . . . .Shared
. . . . . . . . .Memory
. . . . . . . . . . . .Architectures:
...........................................
Remarks:
→ SLICOT implementations
only usePackages
Level-2 BLAS.
Software
1
2 recursive and Level-3 BLAS,
→ RECSY
implementations
but 7054
Equation
RECSY
SLICOTare
LAPACK Alg. 4323 Alg.
optimized
for 15 years
old architectures,
SYLV
RECSYCT
SB04PD
xTRSYL bad license.
AXPXB
432 – SB04PD
original code by Bartels
and Stewart,
no
SYLV2 → Algorithm
RECSYDT
BLAS
at all, FORTRAN
IV. LYAP
RECLYCT
SB03TD
ATXPXA
705 – SB03UD
LINPACK style code- by Gardiner- et. al.,
STEIN → Algorithm
RECLYDT
Level-1 BLAS calls. GSYLV few RECGSYL
SYLG
– Level-3
BLAS block
implementation -only for
CSYLV→ GLYAP-3
RECGCSY
SB04OD
xTGSYL
GLYAP GLYAP/GSTEIN.
RECGLYCT
SG03AD
SYLGC
GSTEIN
RECGLYDT
SG03AD
SYLGD
All packages are not feature complete and mostly old.
1 http://www8.cs.umu.se/
2 http://www.slicot.org/
~isak/recsy/
3 http://www.netlib.org/toms/432.gz
4 http://www.netlib.org/toms/705.gz
Martin Köhler, [email protected]
BLAS-3 Sylvester
3/24
Introduction
General
. . . . . . . . . . .Workflow
. . . . . . . . . . . . .in
. . .Direct
. . . . . . . . .Solvers:
.....................................................................
Compute real Schur forms:
A = QAs Q T and B = ZBs Z T .
Compute generalized real Schur forms:
(A, C ) = (Q1 As Z1T , Q1 Cs Z1T )
(B, D) = (Q2 Bs Z2T , Q2 Ds Z2T ).
Transform the right hand side:
Ỹ = Q T YZ .
Transform the right hand side:
Ỹ = Q1T YZ2 .
Solve the (quasi-) triangular equation:
As X̃ ± X̃ Bs = Ỹ .
Solve the (quasi-) triangular equation
As X̃ Bs ± Cs X̃ Ds = Ỹ .
Transform the solution:
X = Q X̃ Z T .
Transform the solution:
X = Z1 X̃ Q2T .
Martin Köhler, [email protected]
BLAS-3 Sylvester
4/24
Introduction
Our focus:
Fast solution
of (quasi-) triangular Sylvester-type MaGeneral
. . . . . . . . . . .Workflow
. . . . . . . . . . . . .in
. . .Direct
. . . . . . . . .Solvers:
.....................................................................
trix Equations:
Compute
generalized
real Schur forms:
Compute real Schur
@ forms:
@
@
@
±
. 1T , Q1 Cs Z1T )
X̃
X̃ C ) = (Q=1 AỸs Z
(A,
T
@
@
@
@
@
@
A = QAs Q T and@
B = ZBs Z@
.
(B, D) = (Q2 Bs Z2T , Q2 Ds Z2T ).
Transform the right hand side:
Ỹ = Q T YZ .
Transform the right hand side:
Ỹ = Q1T YZ2 .
Solve the (quasi-) triangular equation:
As X̃ ± X̃ Bs = Ỹ .
Solve the (quasi-) triangular equation
As X̃ Bs ± Cs X̃ Ds = Ỹ .
Transform the solution:
X = Q X̃ Z T .
Transform the solution:
X = Z1 X̃ Q2T .
Martin Köhler, [email protected]
BLAS-3 Sylvester
4/24
Introduction
Our focus:
Fast solution
of (quasi-) triangular Sylvester-type MaGeneral
. . . . . . . . . . .Workflow
. . . . . . . . . . . . .in
. . .Direct
. . . . . . . . .Solvers:
.....................................................................
trix Equations:
Compute
generalized
real Schur forms:
Compute real Schur
@ forms:
@
@
@
±
. 1T , Q1 Cs Z1T )
X̃
X̃ C ) = (Q=1 AỸs Z
(A,
T
@
@
@
@
@
@
A = QAs Q T and@
B = ZBs Z@
.
(B, D) = (Q2 Bs Z2T , Q2 Ds Z2T ).
We use the Generalized Sylvester Equation as
most complex variant for the optimizations.
Transform the right hand side:
Transform the right hand side:
T
Ỹ = Q1T YZ2 .
Ỹ = Q YZ .
Solve the (quasi-) triangular equation:
As X̃ ± X̃ Bs = Ỹ .
Solve the (quasi-) triangular equation
As X̃ Bs ± Cs X̃ Ds = Ỹ .
Transform the solution:
X = Q X̃ Z T .
Transform the solution:
X = Z1 X̃ Q2T .
Martin Köhler, [email protected]
BLAS-3 Sylvester
4/24
Block-Algorithm
The real Generalized Schur Decomposition of



A11 · · · A1p
C11 · · ·



.
.
..
.
.
As = 
.
.
.  , Cs = 
0
App
0



B11 · · · B1q
D11 · · ·



.
.
..
.
.
Bs = 
.
.
.  , Ds = 
0
Bqq
0
(A, C ) and (B, D) yields:


C1p
X̃11 · · ·
..  , X̃ =  ..
..
 .
.
. 
Cpp

C1q
..  ,
. 
Dqq

X̃1q
..  ,
. 
X̃p1 · · · X̃pq


Ỹ11 · · · Ỹ1q

..  ,
..
Ỹ =  ...
.
. 
Ỹp1 · · · Ỹqq
where (Aii , Cii ) and (Bii , Dii ) are 1 × 1 or 2 × 2 according to the eigenvalues of
(A, C ), or (B, D) respectively.
Martin Köhler, [email protected]
BLAS-3 Sylvester
5/24
Block-Algorithm
The real Generalized Schur Decomposition of



A11 · · · A1p
C11 · · ·



.
.
..
.
.
As = 
.
.
.  , Cs = 
0
App
0



B11 · · · B1q
D11 · · ·



.
.
..
.
.
Bs = 
.
.
.  , Ds = 
0
Bqq
0
(A, C ) and (B, D) yields:


C1p
X̃11 · · ·
..  , X̃ =  ..
..
 .
.
. 
Cpp

C1q
..  ,
. 
Dqq

X̃1q
..  ,
. 
X̃p1 · · · X̃pq


Ỹ11 · · · Ỹ1q

..  ,
..
Ỹ =  ...
.
. 
Ỹp1 · · · Ỹqq
where (Aii , Cii ) and (Bii , Dii ) are 1 × 1 or 2 × 2 according to the eigenvalues of
(A, C ), or (B, D) respectively.
We need to solve p · q small Sylvester equations:
X Akk X̃kl Bll + Ckk X̃k` Dll = Ỹkl −
Aki X̃ij Bjl ± Cki X̃ij Djl .
i=k,...,p
j=1,...,l
(i,j)6=(k,l)
Martin Köhler, [email protected]
BLAS-3 Sylvester
5/24
Block-Algorithm
Our Question:
Are the matrices Akk , All , . . . restricted to be 1 × 1 or 2 × 2 matrices?
Martin Köhler, [email protected]
BLAS-3 Sylvester
6/24
Block-Algorithm
Our Question:
Are the matrices Akk , All , . . . restricted to be 1 × 1 or 2 × 2 matrices?
Answer – No!
As long as no complex eigenvalue-pairs in (A, C ) and (B, D) is split by the
blocking and X̃ and Ỹ are partitioned accordingly everything works for larger
blocks.
Martin Köhler, [email protected]
BLAS-3 Sylvester
6/24
Block-Algorithm
Our Question:
Are the matrices Akk , All , . . . restricted to be 1 × 1 or 2 × 2 matrices?
Answer – No!
As long as no complex eigenvalue-pairs in (A, C ) and (B, D) is split by the
blocking and X̃ and Ỹ are partitioned accordingly everything works for larger
blocks.
Arbitrary block sizes mb for (A, C ) and nb for (B, D) are possible but
adjustments by ±1 to fit the eigenvalue structure are necessary.
Martin Köhler, [email protected]
BLAS-3 Sylvester
6/24
Block-Algorithm
Our Question:
Are the matrices Akk , All , . . . restricted to be 1 × 1 or 2 × 2 matrices?
Answer – No!
As long as no complex eigenvalue-pairs in (A, C ) and (B, D) is split by the
blocking and X̃ and Ỹ are partitioned accordingly everything works for larger
blocks.
Arbitrary block sizes mb for (A, C ) and nb for (B, D) are possible but
adjustments by ±1 to fit the eigenvalue structure are necessary.
Special cases:
(As , Cs ∈ Rm×m , Bs , Ds ∈ Rn×n and X̃ , Ỹ ∈ Rm×n )
mb = 1 and nb = 1 – Level-2 Bartels-Stewart implementation in SLICOT.
Martin Köhler, [email protected]
BLAS-3 Sylvester
6/24
Block-Algorithm
Our Question:
Are the matrices Akk , All , . . . restricted to be 1 × 1 or 2 × 2 matrices?
Answer – No!
As long as no complex eigenvalue-pairs in (A, C ) and (B, D) is split by the
blocking and X̃ and Ỹ are partitioned accordingly everything works for larger
blocks.
Arbitrary block sizes mb for (A, C ) and nb for (B, D) are possible but
adjustments by ±1 to fit the eigenvalue structure are necessary.
(As , Cs ∈ Rm×m , Bs , Ds ∈ Rn×n and X̃ , Ỹ ∈ Rm×n )
Special cases:
mb = 1 and nb = 1 – Level-2 Bartels-Stewart implementation in SLICOT.
mb =
m
2
and nb =
n
2
Martin Köhler, [email protected]
– Recursive Blocking (RECSY)
BLAS-3 Sylvester
6/24
Block-Algorithm
Our Question:
Are the matrices Akk , All , . . . restricted to be 1 × 1 or 2 × 2 matrices?
Answer – No!
As long as no complex eigenvalue-pairs in (A, C ) and (B, D) is split by the
blocking and X̃ and Ỹ are partitioned accordingly everything works for larger
blocks.
Arbitrary block sizes mb for (A, C ) and nb for (B, D) are possible but
adjustments by ±1 to fit the eigenvalue structure are necessary.
(As , Cs ∈ Rm×m , Bs , Ds ∈ Rn×n and X̃ , Ỹ ∈ Rm×n )
Special cases:
mb = 1 and nb = 1 – Level-2 Bartels-Stewart implementation in SLICOT.
mb =
m
2
and nb =
n
2
– Recursive Blocking (RECSY)
mb = m and nb = 1 – Algorithm 705 by Gardiner et. al.
Martin Köhler, [email protected]
BLAS-3 Sylvester
6/24
Block-Algorithm
Our Question:
Are the matrices Akk , All , . . . restricted to be 1 × 1 or 2 × 2 matrices?
Answer – No!
As long as no complex eigenvalue-pairs in (A, C ) and (B, D) is split by the
blocking and X̃ and Ỹ are partitioned accordingly everything works for larger
blocks.
Arbitrary block sizes mb for (A, C ) and nb for (B, D) are possible but
adjustments by ±1 to fit the eigenvalue structure are necessary.
(As , Cs ∈ Rm×m , Bs , Ds ∈ Rn×n and X̃ , Ỹ ∈ Rm×n )
Special cases:
mb = 1 and nb = 1 – Level-2 Bartels-Stewart implementation in SLICOT.
mb =
m
2
and nb =
n
2
– Recursive Blocking (RECSY)
mb = m and nb = 1 – Algorithm 705 by Gardiner et. al.
→ Why are mb and nb not chosen such that Akk , Bll , Ckk , Dll , X̃kl and Ỹkl fit
into the CPU’s cache(s)?
Martin Köhler, [email protected]
BLAS-3 Sylvester
6/24
Efficient Implementation
General
. . . . . . . . . . .Aspects:
..............................................................................................
The updates on the right hand sides Ỹkl
X Ỹkl −
Aki X̃ij Bjl ± Cki X̃ij Djl
i=k,...,p
j=1,...,l
(i,j)6=(k,l)
can be rewritten and unified into few GEMM/TRMM operations.
→ Assumed to be as efficient as possible for sufficiently large matrices.
Martin Köhler, [email protected]
BLAS-3 Sylvester
7/24
Efficient Implementation
General
. . . . . . . . . . .Aspects:
..............................................................................................
The updates on the right hand sides Ỹkl
X Ỹkl −
Aki X̃ij Bjl ± Cki X̃ij Djl
i=k,...,p
j=1,...,l
(i,j)6=(k,l)
can be rewritten and unified into few GEMM/TRMM operations.
→ Assumed to be as efficient as possible for sufficiently large matrices.
Block sizes mb and nb are a freely selectable parameter.
Martin Köhler, [email protected]
BLAS-3 Sylvester
7/24
Efficient Implementation
General
. . . . . . . . . . .Aspects:
..............................................................................................
The updates on the right hand sides Ỹkl
X Ỹkl −
Aki X̃ij Bjl ± Cki X̃ij Djl
i=k,...,p
j=1,...,l
(i,j)6=(k,l)
can be rewritten and unified into few GEMM/TRMM operations.
→ Assumed to be as efficient as possible for sufficiently large matrices.
Block sizes mb and nb are a freely selectable parameter.
The solution X̃ overwrites the right hand side Ỹ .
Martin Köhler, [email protected]
BLAS-3 Sylvester
7/24
Efficient Implementation
General
. . . . . . . . . . .Aspects:
..............................................................................................
The updates on the right hand sides Ỹkl
X Ỹkl −
Aki X̃ij Bjl ± Cki X̃ij Djl
i=k,...,p
j=1,...,l
(i,j)6=(k,l)
can be rewritten and unified into few GEMM/TRMM operations.
→ Assumed to be as efficient as possible for sufficiently large matrices.
Block sizes mb and nb are a freely selectable parameter.
The solution X̃ overwrites the right hand side Ỹ .
All matrices are stored in the Fortran Column-Major-Scheme.
Martin Köhler, [email protected]
BLAS-3 Sylvester
7/24
Efficient Implementation
Algorithm 1 Block Bartels-Stewart Algorithm for Generalized Sylvester Equations
Input: As , Cs ∈ Rm×m , Bs , Ds ∈ Rn×n and Ỹ ∈ Rm×n , mb , nb ∈ N
Output: X̃ ∈ Rm×n overwriting Ỹ
1: if m ≤ mb and n ≤ nb then
2:
Solve As X̃ Bs ± Cs X̃ Ds = Ỹ .
3: else
4:
for k = m, . . . , 1 step by mb do
5:
for l = 1, . . . , n step by nb do
6:
Solve Akk X̃kl Bll ± Ckk X̃kl Dll = Ỹkl .
7:
Ỹk,l+1:n = Ỹk,l+1:n − Akk X̃kl Bkl ∓ Ckk X̃kl Dkl
8:
end for
9:
Ỹ1:k−1,1:n = Ỹ1:k−1,1:n − A1:k−1,k:k+mb −1 X̃k:k+mb −1,1:l Bs
∓ C1:k−1,k:k+mb −1 X̃k:k+mb −1,1:l Ds
10:
end for
11: end if
Martin Köhler, [email protected]
BLAS-3 Sylvester
8/24
Efficient Implementation
Algorithm 1 Block Bartels-Stewart Algorithm for Generalized Sylvester Equations
Input: As , Cs ∈ Rm×m , Bs , Ds ∈ Rn×n and Ỹ ∈ Rm×n , mb , nb ∈ N
Output: X̃ ∈ Rm×n overwriting Ỹ
1: if m ≤ mb and n ≤ nb then
2:
Solve As X̃ Bs ± Cs X̃ Ds = Ỹ .
3: else
Critical operation for a fast
4:
for k = m, . . . , 1 step by mb do
solution scheme.
5:
for l = 1, . . . , n step by nb do
6:
Solve Akk X̃kl Bll ± Ckk X̃kl Dll = Ỹkl .
7:
Ỹk,l+1:n = Ỹk,l+1:n − Akk X̃kl Bkl ∓ Ckk X̃kl Dkl
8:
end for
9:
Ỹ1:k−1,1:n = Ỹ1:k−1,1:n − A1:k−1,k:k+mb −1 X̃k:k+mb −1,1:l Bs
∓ C1:k−1,k:k+mb −1 X̃k:k+mb −1,1:l Ds
10:
end for
11: end if
Martin Köhler, [email protected]
BLAS-3 Sylvester
8/24
Efficient Implementation
Algorithm 1 Block Bartels-Stewart Algorithm for Generalized Sylvester Equations
Input: As , Cs ∈ Rm×m , Bs , Ds ∈ Rn×n and Ỹ ∈ Rm×n , mb , nb ∈ N
Output: X̃ ∈ Rm×n overwriting Ỹ
1: if m ≤ mb and n ≤ nb then
2:
Solve As X̃ Bs ± Cs X̃ Ds = Ỹ .
3: else
Critical operation for a fast
4:
for k = m, . . . , 1 step by mb do
solution scheme.
5:
for l = 1, . . . , n step by nb do
6:
Solve Akk X̃kl Bll ± Ckk X̃kl Dll = Ỹkl .
7:
Ỹk,l+1:n = Ỹk,l+1:n − Akk X̃kl Bkl ∓ Ckk X̃kl Dkl
8:
end for
9:
Ỹ1:k−1,1:n = Ỹ1:k−1,1:n − A1:k−1,k:k+mb −1 X̃k:k+mb −1,1:l Bs
∓ C1:k−1,k:k+mb −1 X̃k:k+mb −1,1:l Ds
10:
end for
11: end if
Goal: Develop an efficient solver for small Sylvester Equations.
Martin Köhler, [email protected]
BLAS-3 Sylvester
8/24
Efficient Inner Solvers
Hardware
. . . . . . . . . . . . . and
. . . . . .Software
. . . . . . . . . . . .Setup
..........................................................................
We use a test driven development scheme for the kernels on random matrices.
Hardware
Intel Xeon E5-2640v3 (Haswell), 2x8 Cores, 16x256kB L2 Cache,
2x20MB L3 Cache, AVX vector unit
64GB DDR3 RAM
Software
CentOS 7.3 – x86 64
Intel Parallel Studio XE 2017 C/Fortran Compiler + Intel MKL 2017.0.1
Compiler-Flags: -O3 -xHost -qopenmp
Computational routines are written in Fortran 90/95.
Reference results with Algorithm 705 and RECSY, double precision.
Residual and Forward error of all results are comparable.
Martin Köhler, [email protected]
BLAS-3 Sylvester
9/24
Efficient Inner Solvers
Naive
. . . . . . . . Approach
.................................................................................................
Sylvester equations with m ≤ 2 and n ≤ 2 are trivial to solve via their
Kronecker representation:
AXB ± CXD = Y
⇐⇒
(B T ⊗ A ± D T ⊗ C ) vec X = vec Y
Larger problems are solved using Algorithm 1 with mb = 1 and nb = 1.
Martin Köhler, [email protected]
BLAS-3 Sylvester
10/24
Efficient Inner Solvers
Naive
. . . . . . . . Approach
.................................................................................................
Sylvester equations with m ≤ 2 and n ≤ 2 are trivial to solve via their
Kronecker representation:
AXB ± CXD = Y
⇐⇒
(B T ⊗ A ± D T ⊗ C ) vec X = vec Y
Larger problems are solved using Algorithm 1 with mb = 1 and nb = 1.
Usual LAPACK-like blocking scheme.
Martin Köhler, [email protected]
BLAS-3 Sylvester
10/24
Efficient Inner Solvers
Naive
. . . . . . . . Approach
.................................................................................................
Sylvester equations with m ≤ 2 and n ≤ 2 are trivial to solve via their
Kronecker representation:
AXB ± CXD = Y
⇐⇒
(B T ⊗ A ± D T ⊗ C ) vec X = vec Y
Larger problems are solved using Algorithm 1 with mb = 1 and nb = 1.
Usual LAPACK-like blocking scheme.
Test Procedure
Algorithm 1 with random matrices A, B, C , D ∈ Rm×m , m = 1, . . . , 1030.
Eigenvalues sorted such that A64k+1,64k = 0 and B64k+1,64k = 0, ∀k ∈ N.
Only inner solver exchanged/optimized.
16 threads for multi-threaded BLAS calls.
Martin Köhler, [email protected]
BLAS-3 Sylvester
10/24
Efficient Inner Solvers
Naive
. . . . . . . . Approach
.................................................................................................
0.6
Time to solution in [s]
RECSY
Algorithm 705
Naive
0.4
0.2
0
0
100
200
300
Dimension of X̃
Martin Köhler, [email protected]
400
500
m=n
600
700
800
900
1,000
block size mb = nb = 64
BLAS-3 Sylvester
11/24
Efficient Inner Solvers
Naive
. . . . . . . . Approach
.................................................................................................
0.6
Time to solution in [s]
RECSY
Algorithm 705
Naive
Reproducible peaks at m =
512 and m = 1024.
0.4
0.2
0
0
100
200
300
Dimension of X̃
Martin Köhler, [email protected]
400
500
m=n
600
700
800
900
1,000
block size mb = nb = 64
BLAS-3 Sylvester
11/24
Efficient Inner Solvers
Level-2
. . . . . . . . . . BLAS
. . . . . . . . .Calls
......................................................................................
Replace Level-3 BLAS calls with the corresponding Level-2 operations:
xGEMM → xGEMV, xGER, and xAXPY,
xTRMM → xTRMV.
Martin Köhler, [email protected]
BLAS-3 Sylvester
12/24
Efficient Inner Solvers
Level-2
. . . . . . . . . . BLAS
. . . . . . . . .Calls
......................................................................................
Replace Level-3 BLAS calls with the corresponding Level-2 operations:
xGEMM → xGEMV, xGER, and xAXPY,
xTRMM → xTRMV.
GEMM operations up to 2 × 2 are directly computed.
Martin Köhler, [email protected]
BLAS-3 Sylvester
12/24
Efficient Inner Solvers
Level-2
. . . . . . . . . . BLAS
. . . . . . . . .Calls
......................................................................................
Replace Level-3 BLAS calls with the corresponding Level-2 operations:
xGEMM → xGEMV, xGER, and xAXPY,
xTRMM → xTRMV.
GEMM operations up to 2 × 2 are directly computed.
Appearing xAXPY operations, caused by the GEMM replacement, are
performed by Fortran intrinsics:
CALL
CALL
CALL
CALL
DAXPY(N-LH,
DAXPY(N-LH,
DAXPY(N-LH,
DAXPY(N-LH,
-MAT(1,1),
-MAT(3,1),
-MAT(2,1),
-MAT(4,1),
B(L,LH+1), LDB, X(K,LH+1), LDX)
B(LH,LH+1), LDB,X(K,LH+1), LDX)
B(L,LH+1), LDB, X(KH,LH+1), LDX)
B(LH,LH+1), LDB,X(KH,LH+1), LDX)
is transformed into
X(K,LH+1:N) = X(K,LH+1:N) - MAT(1,1) * B(L,LH+1:N) MAT(3,1) * B(LH,LH+1:N) - SGN * MAT(1,2) * D(L,LH+1:N)
- SGN * MAT(3,2) * D(LH, LH+1:N)
Martin Köhler, [email protected]
BLAS-3 Sylvester
12/24
Efficient Inner Solvers
Level-2
. . . . . . . . . . BLAS
. . . . . . . . .Calls
......................................................................................
Time to solution in [s]
0.6
RECSY
Algorithm 705
Naive
BLAS-2
0.4
0.2
0
0
100
200
300
Dimension of X̃
Martin Köhler, [email protected]
400
500
m=n
600
700
800
900
1,000
block size mb = nb = 64
BLAS-3 Sylvester
13/24
Efficient Inner Solvers
Level-2
. . . . . . . . . . BLAS
. . . . . . . . .Calls
......................................................................................
Time to solution in [s]
0.6
RECSY
Algorithm 705
Naive
BLAS-2
0.4
Still reproducible peaks at
m = 512 and m = 1024.
0.2
0
0
100
200
300
Dimension of X̃
Martin Köhler, [email protected]
400
500
m=n
600
700
800
900
1,000
block size mb = nb = 64
BLAS-3 Sylvester
13/24
Efficient Inner Solvers
Reorder
. . . . . . . . . . .the
. . . . .data
. . . . . . .access
..................................................................................
The peaks at m = 512 and m = 1024 (and the next ones at 1536 and 2048) are
caused by cache-misses/unused prefetching.
Reason: The leading dimension of the matrix (times 8 Byte per value) is a
multiple of the pagesize (4096 Bytes).
The access to X(K,LH+1:N), B(L,LH+1:N), and D(L,LH+1:N) is not well suited
for the matrix storage. (column-major-storage vs. row-wise access).
Martin Köhler, [email protected]
BLAS-3 Sylvester
14/24
Efficient Inner Solvers
Reorder
. . . . . . . . . . .the
. . . . .data
. . . . . . .access
..................................................................................
The peaks at m = 512 and m = 1024 (and the next ones at 1536 and 2048) are
caused by cache-misses/unused prefetching.
Reason: The leading dimension of the matrix (times 8 Byte per value) is a
multiple of the pagesize (4096 Bytes).
The access to X(K,LH+1:N), B(L,LH+1:N), and D(L,LH+1:N) is not well suited
for the matrix storage. (column-major-storage vs. row-wise access).
Same optimizations as before but. . .
Martin Köhler, [email protected]
BLAS-3 Sylvester
14/24
Efficient Inner Solvers
Reorder
. . . . . . . . . . .the
. . . . .data
. . . . . . .access
..................................................................................
The peaks at m = 512 and m = 1024 (and the next ones at 1536 and 2048) are
caused by cache-misses/unused prefetching.
Reason: The leading dimension of the matrix (times 8 Byte per value) is a
multiple of the pagesize (4096 Bytes).
The access to X(K,LH+1:N), B(L,LH+1:N), and D(L,LH+1:N) is not well suited
for the matrix storage. (column-major-storage vs. row-wise access).
Same optimizations as before but. . .
solve the small equation column-wise instead of row-wise.
The crucial operations change to:
X(1:K-1,L) = X(1:K-1,L) - MAT(1,1) * A(1:K-1,K)
- MAT(2,1) * A(1:K-1,KH) - SGN * MAT(1,2) * C(1:K-1,K)
- SGN * MAT(2,2) * C(1:K-1,KH)
→ Access on Xkl , Akk and Ckk fits the storage scheme.
Martin Köhler, [email protected]
BLAS-3 Sylvester
14/24
Efficient Inner Solvers
Reorder
. . . . . . . . . . .the
. . . . .data
. . . . . . .access
..................................................................................
Time to solution in [s]
0.6
RECSY
Algorithm 705
Naive
BLAS-2
Reordered
0.4
0.2
0
0
100
200
300
Dimension of X̃
Martin Köhler, [email protected]
400
500
m=n
600
700
800
900
1,000
block size mb = nb = 64
BLAS-3 Sylvester
15/24
Efficient Inner Solvers
Reorder
. . . . . . . . . . .the
. . . . .data
. . . . . . .access
..................................................................................
Time to solution in [s]
0.6
RECSY
Algorithm 705
Naive
BLAS-2
Reordered
0.4
One small peak at m = 1024
left.
0.2
0
0
100
200
300
Dimension of X̃
Martin Köhler, [email protected]
400
500
m=n
600
700
800
900
1,000
block size mb = nb = 64
BLAS-3 Sylvester
15/24
Efficient Inner Solvers
Use
. . . . . .local
. . . . . . .copies
. . . . . . . . .of
. . .A
. .kk
. . .,. .B. .ll.,. .C
. .kk
. . ,. . D
. . ll. .,. .and
. . . . .X
. . kl
.................................................
The leading dimension of As , Bs , Cs and Ds still yields cache misses and
unnecessary prefetching.
Keep all previous optimizations and
Martin Köhler, [email protected]
BLAS-3 Sylvester
16/24
Efficient Inner Solvers
Use
. . . . . .local
. . . . . . .copies
. . . . . . . . .of
. . .A
. .kk
. . .,. .B. .ll.,. .C
. .kk
. . ,. . D
. . ll. .,. .and
. . . . .X
. . kl
.................................................
The leading dimension of As , Bs , Cs and Ds still yields cache misses and
unnecessary prefetching.
Keep all previous optimizations and
create a local copy of Akk and Ckk with leading dimension mb ,
Martin Köhler, [email protected]
BLAS-3 Sylvester
16/24
Efficient Inner Solvers
Use
. . . . . .local
. . . . . . .copies
. . . . . . . . .of
. . .A
. .kk
. . .,. .B. .ll.,. .C
. .kk
. . ,. . D
. . ll. .,. .and
. . . . .X
. . kl
.................................................
The leading dimension of As , Bs , Cs and Ds still yields cache misses and
unnecessary prefetching.
Keep all previous optimizations and
create a local copy of Akk and Ckk with leading dimension mb ,
create a local copy of Bll and Cll with leading dimension nb ,
Martin Köhler, [email protected]
BLAS-3 Sylvester
16/24
Efficient Inner Solvers
Use
. . . . . .local
. . . . . . .copies
. . . . . . . . .of
. . .A
. .kk
. . .,. .B. .ll.,. .C
. .kk
. . ,. . D
. . ll. .,. .and
. . . . .X
. . kl
.................................................
The leading dimension of As , Bs , Cs and Ds still yields cache misses and
unnecessary prefetching.
Keep all previous optimizations and
create a local copy of Akk and Ckk with leading dimension mb ,
create a local copy of Bll and Cll with leading dimension nb ,
create a local copy of Xkl with leading dimension mb and copy it to the
original location afterwards.
Martin Köhler, [email protected]
BLAS-3 Sylvester
16/24
Efficient Inner Solvers
Use
. . . . . .local
. . . . . . .copies
. . . . . . . . .of
. . .A
. .kk
. . .,. .B. .ll.,. .C
. .kk
. . ,. . D
. . ll. .,. .and
. . . . .X
. . kl
.................................................
The leading dimension of As , Bs , Cs and Ds still yields cache misses and
unnecessary prefetching.
Keep all previous optimizations and
create a local copy of Akk and Ckk with leading dimension mb ,
create a local copy of Bll and Cll with leading dimension nb ,
create a local copy of Xkl with leading dimension mb and copy it to the
original location afterwards.
All data can be copied to the (L2) cache before the solution of the inner
Sylvester equation starts.
Martin Köhler, [email protected]
BLAS-3 Sylvester
16/24
Efficient Inner Solvers
Use
. . . . . .local
. . . . . . .copies
. . . . . . . . .of
. . .A
. .kk
. . .,. .B. .ll.,. .C
. .kk
. . ,. . D
. . ll. .,. .and
. . . . .X
. . kl
.................................................
Time to solution in [s]
0.6
RECSY
Algorithm 705
Naive
BLAS-2
Reordered
Local Copy
0.4
0.2
0
0
100
200
300
Dimension of X̃
Martin Köhler, [email protected]
400
500
m=n
600
700
800
900
1,000
block size mb = nb = 64
BLAS-3 Sylvester
17/24
Efficient Inner Solvers
Alignment
. . . . . . . . . . . . . . of
. . . .the
. . . . .local
. . . . . . .copies
...........................................................................
Vector units (AVX, VSX,. . . ) of modern CPUs need a special data alignment to
work really fast.
→ Without additional help compilers cannot produce efficient code for them.
Keep all previous optimizations, especially the local copies.
Martin Köhler, [email protected]
BLAS-3 Sylvester
18/24
Efficient Inner Solvers
Alignment
. . . . . . . . . . . . . . of
. . . .the
. . . . .local
. . . . . . .copies
...........................................................................
Vector units (AVX, VSX,. . . ) of modern CPUs need a special data alignment to
work really fast.
→ Without additional help compilers cannot produce efficient code for them.
Keep all previous optimizations, especially the local copies.
Annotate the declaration of the local data such that they are 64-byte aligned:
!dir$ attributes align: 64:: AL, BL, CL, DL, XL
or
!IBM* ALIGN(64,AL,BL,CL,DL,XL)
depending on the Fortran compiler.
Martin Köhler, [email protected]
BLAS-3 Sylvester
18/24
Efficient Inner Solvers
Alignment
. . . . . . . . . . . . . . of
. . . .the
. . . . .local
. . . . . . .copies
...........................................................................
Vector units (AVX, VSX,. . . ) of modern CPUs need a special data alignment to
work really fast.
→ Without additional help compilers cannot produce efficient code for them.
Keep all previous optimizations, especially the local copies.
Annotate the declaration of the local data such that they are 64-byte aligned:
!dir$ attributes align: 64:: AL, BL, CL, DL, XL
or
!IBM* ALIGN(64,AL,BL,CL,DL,XL)
depending on the Fortran compiler.
Replace all BLAS calls except of TRMV with Fortran vector intrinsics.
Martin Köhler, [email protected]
BLAS-3 Sylvester
18/24
Efficient Inner Solvers
Alignment
. . . . . . . . . . . . . . of
. . . .the
. . . . .local
. . . . . . .copies
...........................................................................
Vector units (AVX, VSX,. . . ) of modern CPUs need a special data alignment to
work really fast.
→ Without additional help compilers cannot produce efficient code for them.
Keep all previous optimizations, especially the local copies.
Annotate the declaration of the local data such that they are 64-byte aligned:
!dir$ attributes align: 64:: AL, BL, CL, DL, XL
or
!IBM* ALIGN(64,AL,BL,CL,DL,XL)
depending on the Fortran compiler.
Replace all BLAS calls except of TRMV with Fortran vector intrinsics.
→ The compiler should be able to optimize our code.
Martin Köhler, [email protected]
BLAS-3 Sylvester
18/24
Efficient Inner Solvers
Alignment
. . . . . . . . . . . . . . of
. . . .the
. . . . .local
. . . . . . .copies
...........................................................................
Time to solution in [s]
0.6
RECSY
Algorithm 705
Naive
BLAS-2
Reordered
Local Copy
Aligned Local Copy
0.4
0.2
0
0
100
200
300
Dimension of X̃
Martin Köhler, [email protected]
400
500
m=n
600
700
800
900
1,000
block size mb = nb = 64
BLAS-3 Sylvester
19/24
Efficient Inner Solvers
Large
. . . . . . . . Scale
. . . . . . . .Test
.........................................................................................
Time to solution in [s]
30
20
RECSY
Naive
BLAS-2
Reordered
Local Copy
Aligned Local Copy
10
0
1,024
1,536
2,048
2,560
Dimension of X̃
Martin Köhler, [email protected]
3,072
3,584
m=n
4,096
4,608
5,120
5,632
6,144
optimal block size chosen
BLAS-3 Sylvester
20/24
Efficient Inner Solvers
Large
. . . . . . . . Scale
. . . . . . . .Test
.........................................................................................
Time to solution in [s]
30
20
10
0
1,024
Speed up:
RECSY
Dimension
m=n
Naive
1 024
BLAS-2
1 536
Reordered
2 048
Local Copy
2 560
Aligned Local Copy
3 072
3 584
4 096
4 608
5 120
5 632
6 144
1,536
2,048
2,560
Dimension of X̃
Martin Köhler, [email protected]
3,072
Speed up vs.
RECSY
Naive
2.83
4.76
2.61
4.16
2.36
4.15
2.11
3.83
2.12
3.78
2.01
3.68
2.11
3.88
2.01
3.56
1.78
3.47
1.86
3.39
1.86
3.45
3,584
m=n
4,096
4,608
5,120
5,632
6,144
optimal block size chosen
BLAS-3 Sylvester
20/24
Further Optimization
Optimal Block Sizes mb and nb
The optimal block sizes mb and nb must be chosen such that:
small enough that the inner solvers working nearly inside the (L2) CPU cache
and large enough that the matrix-matrix products in Algorithm 1 are making
use of the multi-threading capabilities of the BLAS library.
Martin Köhler, [email protected]
BLAS-3 Sylvester
21/24
Further Optimization
Optimal Block Sizes mb and nb
The optimal block sizes mb and nb must be chosen such that:
small enough that the inner solvers working nearly inside the (L2) CPU cache
and large enough that the matrix-matrix products in Algorithm 1 are making
use of the multi-threading capabilities of the BLAS library.
For large m and n and increasing number of CPU cores it is no longer
possible to satisfy both conditions properly.
Martin Köhler, [email protected]
BLAS-3 Sylvester
21/24
Further Optimization
Optimal Block Sizes mb and nb
The optimal block sizes mb and nb must be chosen such that:
small enough that the inner solvers working nearly inside the (L2) CPU cache
and large enough that the matrix-matrix products in Algorithm 1 are making
use of the multi-threading capabilities of the BLAS library.
For large m and n and increasing number of CPU cores it is no longer
possible to satisfy both conditions properly.
Idea
Move the parallelization from the BLAS library to the Algorithm.
Use the data dependencies between the p · q blocks of the solution matrix X̃ :
X̃p1 depends on nothing.
X̃pj , j = 2 . . . q depend on X̃p,j−1 .
X̃i1 , i = p − 1 . . . 1 depend on X̃i+1,1 .
X̃ij , i = p − 1 . . . 1, j = 2 . . . q depend on X̃i,j−1 and X̃i+1,j .
Martin Köhler, [email protected]
BLAS-3 Sylvester
21/24
Further Optimization
Direct-Acyclic-Graph
. . . . . . . . . . . . . . . . . . . . . . . . . . . .(DAG)
. . . . . . . . . Scheduling
....................................................................
The data dependencies in the solution lead to the following DAG:

..
.
↑



X̃
 p−2,1
 ↑

X̃
 p−1,1
 ↑
X̃p,1
···
→
→
→
..
.
↑
X̃p−2,2
↑
X̃p−1,2
↑
X̃p,2
···
→
→
→
..
.
↑
X̃p−2,3
↑
X̃p−1,3
↑
X̃p,3

· · ·


. . .



. . .


...
The blocks X̃ij on the same anti-diagonal can be solved independently from each
other (in parallel).
Martin Köhler, [email protected]
BLAS-3 Sylvester
22/24
Further Optimization
Direct-Acyclic-Graph
. . . . . . . . . . . . . . . . . . . . . . . . . . . .(DAG)
. . . . . . . . . Scheduling
....................................................................
The data dependencies in the solution lead to the following DAG:

..
.
↑



X̃
 p−2,1
 ↑

X̃
 p−1,1
 ↑
X̃p,1
···
→
→
→
..
.
↑
X̃p−2,2
↑
X̃p−1,2
↑
X̃p,2
···
→
→
→
..
.
↑
X̃p−2,3
↑
X̃p−1,3
↑
X̃p,3

· · ·


. . .



. . .


...
The blocks X̃ij on the same anti-diagonal can be solved independently from each
other (in parallel).
OpenMP 4.0 – task depend
Since OpenMP 4.0 such data dependencies can be attached to the omp task
directive and the OpenMP runtime system does the scheduling with respect to the
DAG.
Martin Köhler, [email protected]
BLAS-3 Sylvester
22/24
Further Optimization
Direct-Acyclic-Graph
. . . . . . . . . . . . . . . . . . . . . . . . . . . .(DAG)
. . . . . . . . . Scheduling
....................................................................
The data dependencies in the solution lead to the following DAG:
Sketch of the
 Implementation:

..
..
..
· · ·is done
. using
· · · annotations
.
· · ·in the source
OpenMP parallelization
 .
 ↑

↑
↑


code:


X̃p−2,1 → X̃p−2,2 → X̃p−2,3 . . .
 ↑

↑
↑



X̃
 p−1,1 → X̃p−1,2 → X̃p−1,3 . . .
 ↑

↑
↑
X̃p,1
→
X̃p,2
→
X̃p,3
...
CALL TGSYLV SOLVE BLOCK(TRANSA, TRANSB, M,N,K,KH,KB,L,LH,LB, &
A,LDA,B,LDB,C,LDC,D,LDD,X,LDX,SGN,SCAL,WORK,INFO1,ARCH
)
The blocks& X̃
ij on the same anti-diagonal can be solved independently from each
other (in parallel).
OpenMP 4.0 – task depend
Since OpenMP 4.0 such data dependencies can be attached to the omp task
directive and the OpenMP runtime system does the scheduling with respect to the
DAG.
Martin Köhler, [email protected]
BLAS-3 Sylvester
22/24
Further Optimization
Direct-Acyclic-Graph
. . . . . . . . . . . . . . . . . . . . . . . . . . . .(DAG)
. . . . . . . . . Scheduling
....................................................................
The data dependencies in the solution lead to the following DAG:
Sketch of the
 Implementation:

..
..
..
· · ·is done
. using
· · · annotations
.
· · ·in the source
OpenMP parallelization
 .
 ↑

↑
↑


code:


X̃p−2,1 → X̃p−2,2 → X̃p−2,3 . . .
 ↑

↑
↑
!$omp task firstprivate(K,KH,KB,L,LH,LB,SCAL,INFO1)



X̃
→
X̃
→
X̃
.
.
.

 X(K,L))
p−1,1
p−1,2
p−1,3
!$omp& depend(in:
X(KOLD,L),
X(K,LOLD))
depend(out:
 ↑

↑
↑
!$omp& default(shared)
X̃p,1
→
X̃p,2
→
X̃p,3
...
CALL TGSYLV SOLVE BLOCK(TRANSA, TRANSB, M,N,K,KH,KB,L,LH,LB, &
A,LDA,B,LDB,C,LDC,D,LDD,X,LDX,SGN,SCAL,WORK,INFO1,ARCH
)
The blocks& X̃
ij on the same anti-diagonal can be solved independently from each
omp end task
other (in!$parallel).
OpenMP 4.0 – task depend
Since OpenMP 4.0 such data dependencies can be attached to the omp task
directive and the OpenMP runtime system does the scheduling with respect to the
DAG.
Martin Köhler, [email protected]
BLAS-3 Sylvester
22/24
Further Optimization
Direct-Acyclic-Graph
. . . . . . . . . . . . . . . . . . . . . . . . . . . .(DAG)
. . . . . . . . . Scheduling
....................................................................
The data dependencies in the solution lead to the following DAG:
Sketch of the
 Implementation:

..
..
..
· · ·is done
. using
· · · annotations
.
· · ·in the source
OpenMP parallelization
 .
 ↑

↑
↑


code:


X̃p−2,1 → X̃p−2,2 → X̃p−2,3 . . .
 ↑

↑
↑
!$omp task firstprivate(K,KH,KB,L,LH,LB,SCAL,INFO1)



X̃
→
X̃
→
X̃
.
.
.

 X(K,L))
p−1,1
p−1,2
p−1,3
!$omp& depend(in:
X(KOLD,L),
X(K,LOLD))
depend(out:
 ↑

↑
↑
!$omp& default(shared)
X̃p,1
→
X̃p,2
→
X̃p,3
...
CALL TGSYLV SOLVE BLOCK(TRANSA, TRANSB, M,N,K,KH,KB,L,LH,LB, &
A,LDA,B,LDB,C,LDC,D,LDD,X,LDX,SGN,SCAL,WORK,INFO1,ARCH
)
The blocks& X̃
ij on the same anti-diagonal can be solved independently from each
omp end task
other (in!$parallel).
→ Annotations are ignored by non OpenMP compatible compilers.
OpenMP 4.0 – task depend
Since OpenMP 4.0 such data dependencies can be attached to the omp task
directive and the OpenMP runtime system does the scheduling with respect to the
DAG.
Martin Köhler, [email protected]
BLAS-3 Sylvester
22/24
Further Optimization
Direct-Acyclic-Graph
. . . . . . . . . . . . . . . . . . . . . . . . . . . .(DAG)
. . . . . . . . . Scheduling
....................................................................
Time to solution in [s]
15
RECSY
RECSY Parallel
Aligned Local Copy
Aligned Local Copy + DAG
10
5
0
1,024
1,536
2,048
2,560
Dimension of X̃
Martin Köhler, [email protected]
3,072
3,584
m=n
4,096
4,608
5,120
5,632
6,144
optimal block size chosen
BLAS-3 Sylvester
23/24
Further Optimization
Direct-Acyclic-Graph
. . . . . . . . . . . . . . . . . . . . . . . . . . . .(DAG)
. . . . . . . . . Scheduling
....................................................................
Speed up:
Time to solution in [s]
15
10
5
0
1,024
RECSY
Dimension
RECSY Parallel
m=n
Aligned Local
Aligned Local Copy
1 024
Aligned
Local Copy + DAG
1 536
2 048
2 560
3 072
3 584
4 096
4 608
5 120
5 632
6 144
1,536
2,048
2,560
Dimension of X̃
Martin Köhler, [email protected]
Speed up vs.
Copy
RECSY
2.20
6.24
2.27
5.92
2.29
5.40
2.16
4.56
2.04
4.34
1.96
3.94
1.84
3.88
1.77
3.57
1.76
3.13
1.57
2.92
1.65
3.06
3,072
3,584
m=n
4,096
RECSY Parallel
4.39
4.46
4.37
3.95
3.91
3.74
3.79
3.66
3.51
3.28
3.54
4,608
5,120
5,632
6,144
optimal block size chosen
BLAS-3 Sylvester
23/24
Conclusions
Block Bartels-Stewart algorithms can beat the RECSY approach by a factor
of 2.8 or 6.2.
Performance gain of a factor 4.7 if the code is written in a way such that the
compiler can optimize the code properly.
DAG-Scheduling in OpenMP 4 allows easy high level parallelization on top of
the data dependencies. (GCC since 4.9.1, Intel since 15.0, LLVM since 3.9)
Same ideas work for LYAP, STEIN, SYLV, SYLV2, GLYAP, GSTEIN and
CSYLV.
Martin Köhler, [email protected]
BLAS-3 Sylvester
24/24
Conclusions
Block Bartels-Stewart algorithms can beat the RECSY approach by a factor
of 2.8 or 6.2.
Performance gain of a factor 4.7 if the code is written in a way such that the
compiler can optimize the code properly.
DAG-Scheduling in OpenMP 4 allows easy high level parallelization on top of
the data dependencies. (GCC since 4.9.1, Intel since 15.0, LLVM since 3.9)
Same ideas work for LYAP, STEIN, SYLV, SYLV2, GLYAP, GSTEIN and
CSYLV.
Outlook
GPU/Accelerator enabled implementations.
Compile-time tuning by benchmarks of xGEMM and xTRMM.
DAG-scheduling like algorithms if OpenMP 4 features are not available.
Martin Köhler, [email protected]
BLAS-3 Sylvester
24/24
Conclusions
Block Bartels-Stewart algorithms can beat the RECSY approach by a factor
of 2.8 or 6.2.
Performance gain of a factor 4.7 if the code is written in a way such that the
compiler can optimize the code properly.
DAG-Scheduling in OpenMP 4 allows easy high level parallelization on top of
the data dependencies. (GCC since 4.9.1, Intel since 15.0, LLVM since 3.9)
Same ideas work
for LYAP,
STEIN,
GLYAP, GSTEIN and
Thank
you
for SYLV,
yourSYLV2,
attention.
CSYLV.
Outlook
GPU/Accelerator enabled implementations.
Compile-time tuning by benchmarks of xGEMM and xTRMM.
DAG-scheduling like algorithms if OpenMP 4 features are not available.
Martin Köhler, [email protected]
BLAS-3 Sylvester
24/24