Multilevel Hierarchical Matrix Multiplication on Clusters Sascha Hunold Thomas Rauber Gudula Runger Outline • Background • Introduction algorithm • Multilevel combination • Experiment result • Conclusion Background n Cij Aik Bkj (i, j 1,..., n) k 1 3 O(n ) • One of the core computations algorithms for scientific computing and numerical analysis. Background • Many efficient realizations have been invented over the years, such as Strassen on distributed system and BLAS on single processor. • In this paper, different combinations of existing algorithms applying in the multilevel were investigated, and compared with the isolation algorithms. Introduction algorithm • Strassen algorithm • Task parallel matrix multiplication(tpMM) • PDGEMM Strassen algorithm n n R • Matrices A and B are of dimension with an even n, the matrix product can be expressed as: where C11 C21 C12 A11 C22 A21 A12 B11 A22 B21 B12 B22 Q1 ( A11 A22 )( B11 B22 ) C11 Q1 Q4 Q5 Q7 C12 Q3 Q5 C21 Q2 Q4 C22 Q1 Q3 Q2 Q6 Total 7 multiplications and 18 additions Q2 ( A21 A22 ) B11 Q3 A11 ( B12 B22 ) Q4 A22 ( B21 B11 ) Q5 ( A11 A12 ) B22 Q6 ( A21 A11 )( B11 B12 ) Q7 ( A12 A22 )( B21 B22 ) Strassen algorithm Task C11 Task C12 Task C21 Task C22 compute Q1 compute Q3 compute Q2 compute Q1 compute Q7 compute Q5 compute Q4 compute Q6 receive Q5 send Q5 send Q2 receive Q2 receive Q4 send Q3 send Q4 receive Q3 Time complexity O(n2.8) while recursive call Strassen for the distributed sub-blocks. Task parallel matrix multiplication(tpMM) • tpMM is designed to work with p 2i processors which are groups into clusters. • The input matrices Amn and Bnk and p % m=0 and p % k=0. • The initial data distribution is a row block-wise distribution for matrix A and a column block-wise distribution for matrix B. tpMM and Ring method The Ring Method PDGMM • Functions declaration from the Parallel Basic Linear algebra set(PBLAS) (part of the SCalLAPACK project). • Exists numerous implementations, vendor-specific or free realizations as in ScalAPACK. • The algorithm that lies behind this function interface differs in most libraries. Combination algorithm Experiment result on Dual Xeon Cluster 3 GHz MFLOPS per processor for 16 and 32 processors (number of full word-size fp multiply operations that can be performed per second, unit million ) Conclusion • Combination multilevel algorithms can get significant performance. • Important point for the construction of the multilevel algorithms is the order and block size. • Experiment shows that a combination Strassen’s method at top level with special communications-optimized algorithms on the intermediate level Questions • Q&A
© Copyright 2026 Paperzz