Multilevel Hierarchical Matrix Multiplication on Clusters

Multilevel Hierarchical
Matrix Multiplication on
Clusters
Sascha Hunold Thomas Rauber Gudula Runger
Outline
• Background
• Introduction algorithm
• Multilevel combination
• Experiment result
• Conclusion
Background
n
Cij   Aik Bkj
(i, j  1,..., n)
k 1
3
O(n )
• One of the core computations algorithms for
scientific computing and numerical analysis.
Background
• Many efficient realizations have been
invented over the years, such as
Strassen on distributed system and
BLAS on single processor.
• In this paper, different combinations of
existing algorithms applying in the
multilevel were investigated, and
compared with the isolation algorithms.
Introduction algorithm
• Strassen algorithm
• Task parallel matrix multiplication(tpMM)
• PDGEMM
Strassen algorithm
n n
R
• Matrices A and B are of dimension
with an even n, the matrix product can
be expressed as:
where
 C11

 C21
C12   A11

C22   A21
A12   B11

A22   B21
B12 

B22 
Q1  ( A11  A22 )( B11  B22 )
C11  Q1  Q4  Q5  Q7
C12  Q3  Q5
C21  Q2  Q4
C22  Q1  Q3  Q2  Q6
Total 7 multiplications and 18 additions
Q2  ( A21  A22 ) B11
Q3  A11 ( B12  B22 )
Q4  A22 ( B21  B11 )
Q5  ( A11  A12 ) B22
Q6  ( A21  A11 )( B11  B12 )
Q7  ( A12  A22 )( B21  B22 )
Strassen algorithm
Task C11
Task C12
Task C21
Task C22
compute Q1 compute Q3 compute Q2 compute Q1
compute Q7 compute Q5 compute Q4 compute Q6
receive Q5
send Q5
send Q2
receive Q2
receive Q4
send Q3
send Q4
receive Q3
Time complexity O(n2.8) while recursive call
Strassen for the distributed sub-blocks.
Task parallel matrix
multiplication(tpMM)
• tpMM is designed to work with p  2i
processors which are groups into
clusters.
• The input matrices Amn and Bnk and
p % m=0 and p % k=0.
• The initial data distribution is a row
block-wise distribution for matrix A and
a column block-wise distribution for
matrix B.
tpMM and Ring method
The Ring Method
PDGMM
• Functions declaration from the Parallel
Basic Linear algebra set(PBLAS) (part of
the SCalLAPACK project).
• Exists numerous implementations,
vendor-specific or free realizations as in
ScalAPACK.
• The algorithm that lies behind this
function interface differs in most
libraries.
Combination algorithm
Experiment result on Dual Xeon
Cluster 3 GHz
MFLOPS per processor for 16 and 32 processors
(number of full word-size fp multiply operations that can be performed per second, unit million )
Conclusion
• Combination multilevel algorithms can
get significant performance.
• Important point for the construction of
the multilevel algorithms is the order
and block size.
• Experiment shows that a combination
Strassen’s method at top level with
special communications-optimized
algorithms on the intermediate level
Questions
• Q&A