Parallel Algorithm Design Case Study: Tridiagonal Solvers Xian-He Sun Department of Computer Science Illinois Institute of Technology [email protected] X. Sun (IIT) CS546 © Sun Outline • Problem Description • Parallel Algorithms – The Partition Method – The PPT Algorithm – The PDD Algorithm – The LU Pipelining Algorithm – The PTH Method and PPD Algorithm • Implementations X. Sun (IIT) CS546 © Sun Problem Description • Tridiagonal linear system Ax d b0 c0 c1 a1 b ~ A A A an 2 bn 2 cn 2 an 1 bn 1 X. Sun (IIT) CS546 © Sun Sequential Solver Problem akxk 1 bkxk ckxk 1 dk (k=2, … N) Forward step b 1 d1 / 1 1 1 k bk akc k 1 (a 1 k / k 1 k 1 dk ) / k (k=2, … N) Backward Step xn n xk (k xk 1ck ) / k (k=N-1, … 1) X. Sun (IIT) CS546 © Sun Partition Partitioning Tasks Processes CS546 Mapping Orchestration X. Sun (IIT) Assignment Decomposition Sequential computation Parallel program P0 P1 P2 P3 Processors © Sun The Matrix Modification Formula 1 ~ ~ x A d A A d ( A VE T ) 1 d ~ 1 ~ 1 T ~ 1 1 T ~ 1 x A d A V (I E A V ) E A d 1 A0 ~ A A1 Ap 1 cm1 c2 m1 am a2m A c m ( p 1)1 a m ( p 1) The Partition of Tridiagonal Systems X. Sun (IIT) CS546 © Sun ~ A A VE T A [amem , cm1em1 , a2me2 m , c2m1e2m1 ,..., emT 1 T em . c( p 1) m 1e( p 1) m 1 ] . . T e ( p 1) m 1 eT ( p 1) m T = VE e i are column vector with ith element being one and all the other entries being zero. X. Sun (IIT) CS546 © Sun The Solving process 1. Solve the subsystems in parallel 2. Solve the reduced system 3. Modification ~ 1 ~ 1 A Ax A d A0 1 A1 1 X. Sun (IIT) 1 Ap1 A0 **A 1 * * x x . . . Ap1 x 1 n 1 0 * * CS546 A0 1 A11 1 Ap1 d0 d 1 . . . d n1 © Sun The Solving Procedure ~ Ax d ~ AY V hE ~ x T Z I ETY Zy h x Yy x~ x x X. Sun (IIT) CS546 © Sun The Reduced System I I (Zy=h) I . . . . . . . . . . . . Needs global communication X. Sun (IIT) CS546 © Sun The Parallel Partition LU (PPT) Algorithm Step 1. Allocate Ai , d (i ) and elements aim , c(i 1) m1 to the ith node, where 0 i p 1. Step 2. Use the LU decomposit ion method to solve Ai [ ~ x (i ) , v (i ) , w(i ) ] [d (i ) , aime0 , c(i 1) m1em1 ] Step 3. Send ~ x0(i ) , ~ xm(i)1 , v0(i ) , vm(i) 1 , w0(i ) , wm(i) 1 from the ith node to the other nodes 0 i p 1. Step 4. Use the LU method to solve Zy h on all nodes Step 5. Compute in parallel on p processors y2i 1 x [v , w ] y 2i x (i ) ~ x (i ) x (i ) (i ) X. Sun (IIT) (i ) (i ) CS546 © Sun Orchestration Partitioning Tasks Processes CS546 Mapping Orchestration X. Sun (IIT) Assignment Decomposition Sequential computation Parallel program P0 P1 P2 P3 Processors © Sun Orchestration Orchestration is implied in the PPT algorithm • Intuitively, the reduced system should be solved on one node • A tree-reduction communication to get the data • Solve • A reversed tree-reduction communication to set the results • 2 log(p) communication, one solving • In PPT algorithm (step 3) • One total data exchange • All nodes solve the reduced system concurrently • 1 log(p) communication, one solving X. Sun (IIT) CS546 © Sun Tree Reduction (Data gathering/scattering) X. Sun (IIT) CS546 © Sun All-to-All Total Data Exchange X. Sun (IIT) CS546 © Sun Mapping Partitioning Tasks Processes CS546 Mapping Orchestration X. Sun (IIT) Assignment Decomposition Sequential computation Parallel program P0 P1 P2 P3 Processors © Sun Mapping • Try to reduce the communication – Reduce time – Reduce message size – Reduce cost: distance, contention, congestion, etc • In total data exchange – Try to make every comm. a direct comm. – Can be achieved in hypercube architecture X. Sun (IIT) CS546 © Sun The PPT Algorithm • Advantage – Perfect parallel • Disadvantage – Increased computation (vs. sequential alg.) – Global communication – Sequential bottleneck X. Sun (IIT) CS546 © Sun Problem Description • Parallel codes have been developed during last decade • The performances of many codes suffer in a scalable computing environment • Need to identify and overcome the scalability bottlenecks X. Sun (IIT) CS546 © Sun Diagonal Dominant Systems 1 2 7 . X. Sun (IIT) 2 9 . 1 5 . . . 1 . 1 6 1 3 8 CS546 3 7 1 x0 x1 . . . x n 1 d0 . . . . dn 1 © Sun • The Reduced System of Diagonal Dominant Systems I I I I I I • Decay Bound for Inverses of Band Matrices A1 (i, j ) Cq q q(r ) X. Sun (IIT) i j / m r 1 r 1 CS546 0 q 1 , r max min © Sun The Reduced communication I 1 1 I 1 . . I . . . I . 1 1 1 . . . . . . 1 1 I 1 1 1 1 . . . . . I . Generally needs global communication, Decay for diagonal dominant systems X. Sun (IIT) CS546 © Sun I 1 * * * * 1 * * I 1 * * 1 1 * * * 1 I . ÷ *. ÷ *÷ . ÷ * *. ÷ . ÷ . ÷ * *. I * 0 1 0 I 1 * * 1 0 0 1 * * 1 I . ÷ *. ÷ *÷ . ÷ * *. ÷ . ÷ . ÷ * *. ~ Z Z X. Sun (IIT) 1 * * CS546 © Sun The Parallel Diagonal Dominant (PDD) Algorithm Step 1. Allocate Ai , d (i ) and elements aim , c(i 1) m1 to the ith node, where 0 i p 1. Step 2. Use the LU decomposit ion method to solve Ai [ ~ x (i ) , v (i ) , w(i ) ] [d (i ) , aime0 , c(i 1) m1em1 ] Step 3. Send ~ x0( i ) , v0( i ) to the (i 1)th node . Step 4. Solve (i ) (i ) ~ y wm 1 x 2 i 1 m 1 ~ ( i 1) ( i 1) 1 v0 y2i 1 x0 in parallel and send y2 i 1 to (i 1 )th node Step 5. Compute in parallel on p processors y2i 1 x [v , w ] y 2i x (i ) ~ x (i ) x (i ) (i ) X. Sun (IIT) (i ) (i ) CS546 © Sun Orchestration Partitioning Tasks Processes CS546 Mapping Orchestration X. Sun (IIT) Assignment Decomposition Sequential computation Parallel program P0 P1 P2 P3 Processors © Sun Computing/Communication of PDD Non-periodic X. Sun (IIT) Periodic CS546 © Sun Orchestration • Orchestration is implied in the algorithm design • Only two one-to-one neighboring communication X. Sun (IIT) CS546 © Sun Mapping • Communication has reduced – Take the special mathematical property – Formal analysis can be performed based on the mathematical partition formula • Two neighboring communication – Can be achieved on array communication network X. Sun (IIT) CS546 © Sun The PDD Algorithm • Advantage – Perfect parallel – Constant, minimum communication • Disadvantage – Increased computation (vs. sequential alg.) – Applicability • Diagonal dominant • Subsystems are reasonably large X. Sun (IIT) CS546 © Sun speedup Scaled Speedup of the PDD Algorithm on Paragon. 1024 System of order 1600, periodic & non-periodic X. Sun (IIT) CS546 Scaled Speedup of the Reduced PDD Algorithm on SP2. 1024 System of Order 1600, periodic & non-periodic © Sun Problem Description • For tridiagonal systems we may need new algorithms X. Sun (IIT) CS546 © Sun Problem Description • Tridiagonal linear systems AX D b0 c0 a b c 1 1 1 A an 2 bn 2 cn 2 a b n 1 n 1 X. Sun (IIT) . . x0 ,n1 x00 x01 . . x10 x11 x12 . X . . xn2 ,n3 xn2 ,n2 xn2 ,n1 . x . . x x n 1 , 0 n 1 , n 2 n 1 , n 1 . . d 0,n 1 d 00 d 01 d12 . . d10 d11 . D . . d n 2,n 3 d n 2,n 2 d n 2,n 1 . d . . d d n 1 , 0 n 1 , n 2 n 1 , n 1 CS546 © Sun The Pipelined Method • Exploit temporal parallelism of multiple systems • Passing the results form solving a subset to the next before continuing • Communication is high, 3p • Pipelining delay, p • Optimal computing Task Order Time X. Sun (IIT) CS546 © Sun The Parallel Two-Level Hybrid Method • PDD is scalable but has limited applicability • The pipelined method is mathematically efficient but not scalable • Combine these two algorithms, outer PDD, inner pipelining • Can combine with other algorithms too X. Sun (IIT) CS546 © Sun The Parallel Two-Level Hybrid Method • Use an accurate parallel tridiagonal solver to solve the m super-subsystems concurrently, each with k processors • Modify PDD algorithm and consider communications only between the m supersubsystems. X. Sun (IIT) CS546 © Sun The Partition Pipeline diagonal Dominant (PPD) algorithm X. Sun (IIT) CS546 * * * * * * * * * * * * © Sun •Evaluation of Algorithms System Algorithm Computation Best Sequential 8n 7 Multiple systems Pipelining PDD PPD X. Sun (IIT) Communication 0 (n1 1 p)(8n 7) 3(n1 1 p)( 4 ) p n (17 14) * n1 (2 12 * n1 * ) p 13n n (n1 1 k ) 4n1 ( 1) 3(2 12 ) [2 log( k )]( 12n1 ) p p CS546 © Sun Practical Motivation • NLOM (NRL Layered Ocean Model) is a wellused naval parallel ocean simulation code (see http://www7320.nrlssc.navy.mil/global_nlom/inde x.html ). • Fine tuned with the best algorithms available at the time • Efficiency goes down when the number of processors increases. • Poisson solver is the identified scalability bottleneck X. Sun (IIT) CS546 © Sun Project Objectives • Incorporate the best scalable solver, the PDD algorithm, into NLOM • Increase the scalability of NLOM • Accumulate experience for a general toolkits solution for other naval simulation codes X. Sun (IIT) CS546 © Sun Experimental Testing • Fast Poisson solvers (FACR) (Hockney, 1965) • One of the most successful rapid elliptic solvers f q f q Tridiagonal System FFT k q Diagonal Dominant k q FFT • Large number of systems, each node has a piece of each system • NLOM implementation, highly optimized pipelining •Burn At Both Ends (BABE), trade computation with comm. (p, 2p) X. Sun (IIT) CS546 © Sun NLOM Implementation • NLOM has a special data structure and partition – Large number of systems, each node has a piece of each system • Pipelined method, highly optimized • Burn At Both Ends (BABE), pipelining at both sides, trade computation with comm. (p, 2p) X. Sun (IIT) CS546 © Sun Tridiagonal solver runtime: Pipelining (square) and PDD (delta) Pipelining PDD X. Sun (IIT) CS546 © Sun Accuracy: X. Sun (IIT) Circle - BABE Square - PDD CS546 Diamond - PPD © Sun NLOM Application • Pipelined method is not scalable • PDD is scalable but loses accuracy, due to the subsystems are very small • Need the two-level combined method X. Sun (IIT) CS546 © Sun Trid. Solver Time: Pipelining (square), PDD (delta), PPD (circle) PDD Pipelining PPD X. Sun (IIT) CS546 © Sun Total runtime: Pipelining (square), PDD (delta), PPD (circle) Pipelining PPD PDD X. Sun (IIT) CS546 © Sun Parallel Two-Level Hybrid (PTH) Method • Use an accurate parallel tridiagonal solver to solve the m super-subsystems concurrently, each with k processors, where p l k , and solving three unknowns as given in the Step 2 of PDD algorithm. • Modify the solutions of Step 1 with Steps 3-5 of PDD algorithm, or of PPT algorithm if PPT is chosen as the outer solver. X. Sun (IIT) CS546 © Sun The PTH method and related algorithms Abbreviat ion Full Name Explanation PPT Parallel ParTition LU Algorithm PDD Parallel Diagonal Dominant Alg A parallel solver based on rank-one modification A variant of PPT for diagonal dominant system PTH Parallel Two-level Hybrid Method A novel two-level approach PPD Partition Pipelined diagonal Dominant Algorithm A PTH uses PDD and pipelining as outer/inner solver X. Sun (IIT) CS546 © Sun Perform Evaluation •Evaluation of Algorithms Comparison of computation and communication (non periodic) System Algorithm Single system Best Sequential Computation 8n 7 n 16 p 23 p n 17 4 p PPT 17 PDD Reduced PDD Multiple system PPT Reduced PDD n 6j4 p (5n 3).n1 Best Sequential PDD X. Sun (IIT) 11 (9 n 10 p 11).n1 p n (9 1).n1 p (5 n 4 j 1).n1 p CS546 Communication 0 (2 8 p )( p 1) 2 12 2 12 0 (2 8 p..n1. )( 1) (2 8n1. ) (2 8n1. ) © Sun Algorithm Analysis: 1. LU-Pipelining (n1 1 p)[(8n 7) com p p 3( 4 )] 2. The PDD Algorithm n (17 14) * n1 com p (2 12 * n1 * ) p 3. The PPD Algorithm 13n ( n1 1 p1 )[ comp 3( 2 12 )] p n 4n1 ( 1) comp [2 log( p1 )]( 12n1 ) p X. Sun (IIT) CS546 © Sun Where n - the order of each system n1 - the number of systems p - the number of processors p1 - the number of processors used for LU-pipelining com p - the computing speed - the communication start time - the transmission time X. Sun (IIT) CS546 © Sun Parameters on IBM Blue Horizon at SDSC comp 0.01696 s 31.5s 3 9.52 10 s / byte X. Sun (IIT) CS546 © Sun Computation and Comm. Algorithm Count (multiple Pipelined right sides) Algorithm PPT non-pivoting PDD PPD PDD/Pipeline PPT/ Pipeline PDD/PPT X. Sun (IIT) Computation Communication m 1 p 8n 7 m1 m 1 p3 12m1 n 17 16 p 45 n1 p log p 16 p 1n1 p 2 12n1 n 17 14 n1 p m 1 k 13n m1 4n 7 n1 m 1 k 3 24m1 p p 2 log k 8 log k 12n1 m 1 k 3 24m1 4n 13n p p m 1 k m1 16 23 n1 p k p log p 16 k 1 8 log k n1 n 30 21k 41n1 p CS546 2 2 log k 16k 1 8 log( k ) 20n1 © Sun PPD: The predicted (line) and numerical (square) runtime runti me X. Sun (IIT) CS546 © Sun Pipelining: The predicted (line) and numerical (square) runtime runti me X. Sun (IIT) CS546 © Sun Significance • Advances in massively parallelism, grid computing, and hierarchical data access make performance sensitive to system and problem size • Scalability is becoming increasingly important • Poisson solver is a kernel solver used in many naval applications. • The PPD algorithm provides a scalable solution for Poisson solver • We also have proposed the general PTH method X. Sun (IIT) CS546 © Sun Reference – X.-H. Sun, H. Zhang, and L. Ni, "Efficient Tridiagonal Solvers on Multicomputers," IEEE Trans. on Computers, Vol. 41, No. 3, pp.286-296, March 1992. – X.-H. Sun, "Application and Accuracy of the Parallel Diagonal Dominant Algorithm" Parallel Computing, August, 1995. – X.-H. Sun and W. Zhang, "A Parallel Two-Level Hybrid Method for Tridiagonal Systems, and its Application to Fast Poisson Solvers," IEEE Trans. on Parallel and Distributed Systems, Vol. 15, No. 2, pp: 97-106, 2004. – X.-H. Sun, and S. Moitra, “Performance Comparison of a Set of Periodic and NonPeriodic Tridiagonal Solvers on SP2 and Paragon Parallel Computers,” Concurrency: Practice and Experience, pp.1-21, Vol.8(10), 1997. – X.H. Sun, and D. Joslin, "A Simple Parallel Prefix Algorithm for Almost Toeplitz Tridiagonal Systems," High Speed Computing, Vol.7, No.4, pp. 547-576, Dec. 1995. – Y. Zhuang, and X.H. Sun, "A High Order Fast Direct Solver for Singular Poisson Equations," Journal of Computational Physics, Vol. 171, pp. 79-94 (2001). – Y. Zhuang, and X.H. Sun, "A High Order ADI Method For Separable Generalized Helmholtz Equations," International Journal on Advances in Engineering Software, Vol. 31, pp. 585-592, August 2000. X. Sun (IIT) CS546 © Sun
© Copyright 2025 Paperzz