The Parallel Diagonal Dominant (PDD) Algorithm

Parallel Algorithm Design
Case Study: Tridiagonal Solvers
Xian-He Sun
Department of Computer Science
Illinois Institute of Technology
[email protected]
X. Sun (IIT)
CS546
© Sun
Outline
• Problem Description
• Parallel Algorithms
– The Partition Method
– The PPT Algorithm
– The PDD Algorithm
– The LU Pipelining Algorithm
– The PTH Method and PPD Algorithm
• Implementations
X. Sun (IIT)
CS546
© Sun
Problem Description
• Tridiagonal linear system
Ax  d
 b0 c0



c1
 a1 b

~


A
 A  A
  


an  2 bn  2 cn  2 



an 1 bn 1 

X. Sun (IIT)
CS546
© Sun
Sequential Solver
Problem
akxk  1  bkxk  ckxk  1  dk
(k=2, … N)
Forward step
 b
1  d1 /  1
1
1
k  bk  akc
k 1
  (a 
1
k
/ k  1
k 1
 dk ) / k
(k=2, … N)
Backward Step
xn  n
xk  (k  xk  1ck ) / k
(k=N-1, … 1)
X. Sun (IIT)
CS546
© Sun
Partition
Partitioning
Tasks
Processes
CS546
Mapping
Orchestration
X. Sun (IIT)
Assignment
Decomposition
Sequential
computation
Parallel
program
P0
P1
P2
P3
Processors
© Sun
The Matrix Modification Formula


1
~
~
x  A d  A  A d  ( A  VE T ) 1 d
~ 1 ~ 1
T ~ 1 1 T ~ 1
x  A d  A V (I  E A V ) E A d
1
 A0

~ 
A



A1





Ap 1 
 cm1

c2 m1
am

a2m

A  

c m ( p 1)1


a m ( p 1)


The Partition of Tridiagonal Systems
X. Sun (IIT)
CS546
© Sun
~
A  A  VE T
A  [amem , cm1em1 , a2me2 m , c2m1e2m1 ,...,
 emT 1 
 T 
 em 
 . 


c( p 1) m 1e( p 1) m 1 ]   . 
 . 
 T

e
 ( p 1) m 1 
 eT

 ( p 1) m 
T
= VE
e i are column vector with ith element being one and all the
other entries being zero.
X. Sun (IIT)
CS546
© Sun
The Solving process
1. Solve the subsystems in parallel
2. Solve the reduced system
3. Modification
~ 1
~ 1
A Ax  A d
 A0





1
A1
1
X. Sun (IIT)




1 
Ap1 
 A0
 **A
1
*

*



  x
  x
 .
  .
  .
Ap1   x


1 






n 1 
0
*
*
CS546
 A0

 



1
A11




1 
Ap1 
 d0 


d
 1 
 . 


.


 . 


d 
 n1 
© Sun
The Solving Procedure
~
Ax  d
~
AY  V
hE ~
x
T
Z  I  ETY
Zy  h
x  Yy
x~
x  x
X. Sun (IIT)
CS546
© Sun
The Reduced System
I





I
(Zy=h)





I
 .
 
 .
 .
 
 .
 .
 
 .
 

 .
 
 .
 .
 
 .
 .
 
 .
 
Needs global communication
X. Sun (IIT)
CS546
© Sun
The Parallel Partition LU (PPT) Algorithm
Step 1. Allocate Ai , d (i ) and elements aim , c(i 1) m1 to
the ith node, where 0  i  p  1.
Step 2. Use the LU decomposit ion method to solve
Ai [ ~
x (i ) , v (i ) , w(i ) ]  [d (i ) , aime0 , c(i 1) m1em1 ]
Step 3. Send ~
x0(i ) , ~
xm(i)1 , v0(i ) , vm(i) 1 , w0(i ) , wm(i) 1 from the
ith node to the other nodes 0  i  p  1.
Step 4. Use the LU method to solve Zy  h on all nodes
Step 5. Compute in parallel on p processors
 y2i 1 
x  [v , w ]

y
 2i 
x (i )  ~
x (i )  x (i )
(i )
X. Sun (IIT)
(i )
(i )
CS546
© Sun
Orchestration
Partitioning
Tasks
Processes
CS546
Mapping
Orchestration
X. Sun (IIT)
Assignment
Decomposition
Sequential
computation
Parallel
program
P0
P1
P2
P3
Processors
© Sun
Orchestration
Orchestration is implied in the PPT algorithm
• Intuitively, the reduced system should be solved on one
node
• A tree-reduction communication to get the data
• Solve
• A reversed tree-reduction communication to set the
results
• 2 log(p) communication, one solving
• In PPT algorithm (step 3)
• One total data exchange
• All nodes solve the reduced system concurrently
• 1 log(p) communication, one solving
X. Sun (IIT)
CS546
© Sun
Tree Reduction (Data gathering/scattering)
X. Sun (IIT)
CS546
© Sun
All-to-All Total Data Exchange
X. Sun (IIT)
CS546
© Sun
Mapping
Partitioning
Tasks
Processes
CS546
Mapping
Orchestration
X. Sun (IIT)
Assignment
Decomposition
Sequential
computation
Parallel
program
P0
P1
P2
P3
Processors
© Sun
Mapping
• Try to reduce the communication
– Reduce time
– Reduce message size
– Reduce cost: distance, contention, congestion,
etc
• In total data exchange
– Try to make every comm. a direct comm.
– Can be achieved in hypercube architecture
X. Sun (IIT)
CS546
© Sun
The PPT Algorithm
• Advantage
– Perfect parallel
• Disadvantage
– Increased computation (vs. sequential alg.)
– Global communication
– Sequential bottleneck
X. Sun (IIT)
CS546
© Sun
Problem Description
• Parallel codes have been developed during
last decade
• The performances of many codes suffer in a
scalable computing environment
• Need to identify and overcome the
scalability bottlenecks
X. Sun (IIT)
CS546
© Sun
Diagonal Dominant Systems

1

2
7
 .







X. Sun (IIT)
2
9
.
1
5
.
.
.
1
.
1
6
1
3
8
CS546








3
7

1

 x0 
 
 x1 
 . 
 
 . 
 . 
 
x 
 n 1 

 d0 
 
 . 
 . 
 
 . 
 . 
 
 dn  1 
 
© Sun
• The Reduced System of Diagonal Dominant Systems
I
I
I
I
I
I
• Decay Bound for Inverses of Band Matrices
A1 (i, j )  Cq
q  q(r ) 
X. Sun (IIT)
i j / m
r 1
r 1
CS546
0  q 1
,
r
max
min
© Sun
The Reduced communication
I





1
1
I
1
 .
  .  I
  . 
   
 .
  
.

  
I   . 










1
1
1
.

.
.

.
.

.

1
1
I
1
1
1
1
 .
 .
 
.

 
 .
 
 .
I  .
Generally needs global communication, Decay for diagonal
dominant systems
X. Sun (IIT)
CS546
© Sun
I




1
*

*
*

*
1
*

*
I
1
*

*
1

1

*

*

*
1





I










.
÷
*. ÷
*÷
.
÷
*
*. ÷
. ÷
.
÷
*
*.
I


 

*

0
1
0
I
1
*


*
1
0
0

1


*


*
1





I










.
÷
*. ÷
*÷
.
÷
*
*. ÷
. ÷
.
÷
*
*.

~
Z
Z
X. Sun (IIT)
1
*

*
CS546
© Sun
The Parallel Diagonal Dominant (PDD) Algorithm
Step 1. Allocate Ai , d (i ) and elements aim , c(i 1) m1 to
the ith node, where 0  i  p  1.
Step 2. Use the LU decomposit ion method to solve
Ai [ ~
x (i ) , v (i ) , w(i ) ]  [d (i ) , aime0 , c(i 1) m1em1 ]
Step 3. Send
~
x0( i ) , v0( i ) to the (i  1)th node .
Step 4. Solve
(i )
(i )
~
y
 wm



1
x


2
i
1
m 1





  ~ ( i 1) 
( i 1) 

 1

v0

 y2i 1   x0

in parallel and send y2 i 1 to (i  1 )th node
Step 5. Compute in parallel on p processors
 y2i 1 
x  [v , w ]

y
 2i 
x (i )  ~
x (i )  x (i )
(i )
X. Sun (IIT)
(i )
(i )
CS546
© Sun
Orchestration
Partitioning
Tasks
Processes
CS546
Mapping
Orchestration
X. Sun (IIT)
Assignment
Decomposition
Sequential
computation
Parallel
program
P0
P1
P2
P3
Processors
© Sun
Computing/Communication of PDD
Non-periodic
X. Sun (IIT)
Periodic
CS546
© Sun
Orchestration
• Orchestration is implied in the algorithm design
• Only two one-to-one neighboring communication
X. Sun (IIT)
CS546
© Sun
Mapping
• Communication has reduced
– Take the special mathematical property
– Formal analysis can be performed based on the
mathematical partition formula
• Two neighboring communication
– Can be achieved on array communication
network
X. Sun (IIT)
CS546
© Sun
The PDD Algorithm
• Advantage
– Perfect parallel
– Constant, minimum communication
• Disadvantage
– Increased computation (vs. sequential alg.)
– Applicability
• Diagonal dominant
• Subsystems are reasonably large
X. Sun (IIT)
CS546
© Sun
speedup
Scaled Speedup of the PDD Algorithm
on Paragon. 1024 System of order 1600,
periodic & non-periodic
X. Sun (IIT)
CS546
Scaled Speedup of the Reduced PDD
Algorithm on SP2. 1024 System of Order
1600, periodic & non-periodic
© Sun
Problem Description
• For tridiagonal systems we may need new
algorithms
X. Sun (IIT)
CS546
© Sun
Problem Description
• Tridiagonal linear systems
AX  D
 b0 c0



a
b
c
 1 1

1

A
  


an  2 bn  2 cn  2 



a
b
n 1
n 1 

X. Sun (IIT)
.
.
x0 ,n1 
 x00 x01


.
. 
 x10 x11 x12

. 
X  .  


. xn2 ,n3 xn2 ,n2 xn2 ,n1 
 .
x

.
.
x
x
n

1
,
0
n

1
,
n

2
n

1
,
n

1


.
.
d 0,n 1 
 d 00 d 01


d12
.
. 
 d10 d11



. 
D .


. d n  2,n 3 d n  2,n  2 d n  2,n 1 
 .
d

.
.
d
d
n

1
,
0
n

1
,
n

2
n

1
,
n

1


CS546
© Sun
The Pipelined Method
• Exploit temporal
parallelism of multiple
systems
• Passing the results form
solving a subset to the next
before continuing
• Communication is high, 3p
• Pipelining delay, p
• Optimal computing
Task
Order
Time
X. Sun (IIT)
CS546
© Sun
The Parallel Two-Level Hybrid Method
• PDD is scalable but has limited applicability
• The pipelined method is mathematically efficient
but not scalable
• Combine these two algorithms, outer PDD, inner
pipelining
• Can combine with other algorithms too
X. Sun (IIT)
CS546
© Sun
The Parallel Two-Level Hybrid Method
• Use an accurate
parallel tridiagonal
solver to solve the m
super-subsystems
concurrently, each
with k processors
• Modify PDD
algorithm and consider
communications only
between the m supersubsystems.
X. Sun (IIT)
CS546
© Sun
The Partition Pipeline diagonal Dominant (PPD) algorithm






X. Sun (IIT)






CS546
*
*
*
*
*
*
*
*

*
*
*
*
© Sun
•Evaluation of Algorithms
System
Algorithm
Computation
Best Sequential
8n  7
Multiple
systems
Pipelining
PDD
PPD
X. Sun (IIT)
Communication
0
(n1  1  p)(8n  7)
3(n1 1  p)(  4 )
p
n
(17  14) * n1
(2  12 * n1 *  )
p
13n
n
(n1  1  k )  4n1 (  1) 3(2  12 )  [2  log( k )](  12n1 )
p
p
CS546
© Sun
Practical Motivation
• NLOM (NRL Layered Ocean Model) is a wellused naval parallel ocean simulation code (see
http://www7320.nrlssc.navy.mil/global_nlom/inde
x.html ).
• Fine tuned with the best algorithms available at the
time
• Efficiency goes down when the number of
processors increases.
• Poisson solver is the identified scalability
bottleneck
X. Sun (IIT)
CS546
© Sun
Project Objectives
• Incorporate the best scalable solver, the
PDD algorithm, into NLOM
• Increase the scalability of NLOM
• Accumulate experience for a general
toolkits solution for other naval simulation
codes
X. Sun (IIT)
CS546
© Sun
Experimental Testing
• Fast Poisson solvers (FACR) (Hockney, 1965)
• One of the most successful rapid elliptic solvers
f q 
 f 
 
 q
Tridiagonal System
FFT
k
q
Diagonal Dominant
k
q
FFT
• Large number of systems, each node has a piece of
each system
• NLOM implementation, highly optimized pipelining
•Burn At Both Ends (BABE), trade computation with
comm. (p, 2p)
X. Sun (IIT)
CS546
© Sun
NLOM Implementation
• NLOM has a special data structure and partition
– Large number of systems, each node has a piece of each system
• Pipelined method, highly optimized
• Burn At Both Ends (BABE), pipelining at both sides, trade
computation with comm. (p, 2p)
X. Sun (IIT)
CS546
© Sun
Tridiagonal solver runtime: Pipelining (square) and
PDD (delta)
Pipelining
PDD
X. Sun (IIT)
CS546
© Sun
Accuracy:
X. Sun (IIT)
Circle - BABE
Square - PDD
CS546
Diamond - PPD
© Sun
NLOM Application
• Pipelined method is not scalable
• PDD is scalable but loses accuracy, due to the
subsystems are very small
• Need the two-level combined method
X. Sun (IIT)
CS546
© Sun
Trid. Solver Time: Pipelining (square), PDD
(delta), PPD (circle)
PDD
Pipelining
PPD
X. Sun (IIT)
CS546
© Sun
Total runtime: Pipelining (square), PDD
(delta), PPD (circle)
Pipelining
PPD
PDD
X. Sun (IIT)
CS546
© Sun
Parallel Two-Level Hybrid (PTH) Method
• Use an accurate parallel tridiagonal solver to
solve the m super-subsystems concurrently, each
with k processors, where p  l  k , and solving
three unknowns as given in the Step 2 of PDD
algorithm.
• Modify the solutions of Step 1 with Steps 3-5 of
PDD algorithm, or of PPT algorithm if PPT is
chosen as the outer solver.
X. Sun (IIT)
CS546
© Sun
The PTH method and related algorithms
Abbreviat
ion
Full Name
Explanation
PPT
Parallel ParTition
LU Algorithm
PDD
Parallel Diagonal
Dominant Alg
A parallel solver
based on rank-one
modification
A variant of PPT
for diagonal
dominant system
PTH
Parallel Two-level
Hybrid Method
A novel two-level
approach
PPD
Partition Pipelined
diagonal Dominant
Algorithm
A PTH uses PDD
and pipelining as
outer/inner solver
X. Sun (IIT)
CS546
© Sun
Perform Evaluation
•Evaluation of Algorithms
Comparison of computation and communication (non periodic)
System
Algorithm
Single
system
Best Sequential
Computation
8n  7
n
 16 p  23
p
n
17  4
p
PPT
17
PDD
Reduced PDD
Multiple
system
PPT
Reduced PDD
n
6j4
p
(5n  3).n1
Best Sequential
PDD
X. Sun (IIT)
11
(9
n
 10 p  11).n1
p
n
(9  1).n1
p
(5
n
 4 j  1).n1
p
CS546
Communication
0
(2  8 p )( p  1)
2  12 
2  12 
0
(2  8 p..n1. )(   1)
(2  8n1. )
(2  8n1. )
© Sun
Algorithm Analysis:
1. LU-Pipelining
(n1  1  p)[(8n  7)
 com p
p
 3(  4  )]
2. The PDD Algorithm
n
(17  14) * n1 com p  (2  12 * n1 *  )
p
3. The PPD Algorithm
13n
( n1  1  p1 )[
 comp  3( 2  12  )] 
p
n
4n1 (  1) comp  [2  log( p1 )](  12n1 )
p
X. Sun (IIT)
CS546
© Sun
Where
n
- the order of each system
n1
- the number of systems
p
- the number of processors
p1
- the number of processors used
for LU-pipelining
 com p - the computing speed

- the communication start time

- the transmission time
X. Sun (IIT)
CS546
© Sun
Parameters on IBM Blue Horizon at SDSC
 comp  0.01696 s
  31.5s
3
  9.52  10 s / byte
X. Sun (IIT)
CS546
© Sun
Computation
and Comm. Algorithm
Count
(multiple
Pipelined
right sides) Algorithm
PPT
non-pivoting
PDD
PPD
PDD/Pipeline
PPT/
Pipeline
PDD/PPT
X. Sun (IIT)
Computation
Communication
m  1  p  8n  7 m1
m 1  p3 12m1 


n
17  16 p  45 n1
p


log  p 16 p 1n1 
p
2  12n1 


n
17  14 n1
p


m  1  k 13n m1   4n  7 n1 m  1  k 3  24m1  
p
 p

2  log k   8 log k   12n1
m  1  k 3  24m1  
 4n

13n
p
  p


m  1  k 
m1  
 16  23 n1
p
k
 p
 log  p   16 k  1  8 log k  n1



n
 30  21k  41n1
p


CS546



2  2 log k  
16k  1  8 log( k )  20n1 
© Sun
PPD: The predicted (line) and numerical
(square) runtime
runti
me
X. Sun (IIT)
CS546
© Sun
Pipelining: The predicted (line) and
numerical (square) runtime
runti
me
X. Sun (IIT)
CS546
© Sun
Significance
• Advances in massively parallelism, grid
computing, and hierarchical data access make
performance sensitive to system and problem size
• Scalability is becoming increasingly important
• Poisson solver is a kernel solver used in many
naval applications.
• The PPD algorithm provides a scalable solution
for Poisson solver
• We also have proposed the general PTH method
X. Sun (IIT)
CS546
© Sun
Reference
– X.-H. Sun, H. Zhang, and L. Ni, "Efficient Tridiagonal Solvers on Multicomputers,"
IEEE Trans. on Computers, Vol. 41, No. 3, pp.286-296, March 1992.
– X.-H. Sun, "Application and Accuracy of the Parallel Diagonal Dominant
Algorithm" Parallel Computing, August, 1995.
– X.-H. Sun and W. Zhang, "A Parallel Two-Level Hybrid Method for Tridiagonal
Systems, and its Application to Fast Poisson Solvers," IEEE Trans. on Parallel and
Distributed Systems, Vol. 15, No. 2, pp: 97-106, 2004.
– X.-H. Sun, and S. Moitra, “Performance Comparison of a Set of Periodic and NonPeriodic Tridiagonal Solvers on SP2 and Paragon Parallel Computers,”
Concurrency: Practice and Experience, pp.1-21, Vol.8(10), 1997.
– X.H. Sun, and D. Joslin, "A Simple Parallel Prefix Algorithm for Almost Toeplitz
Tridiagonal Systems," High Speed Computing, Vol.7, No.4, pp. 547-576, Dec. 1995.
– Y. Zhuang, and X.H. Sun, "A High Order Fast Direct Solver for Singular Poisson
Equations," Journal of Computational Physics, Vol. 171, pp. 79-94 (2001).
– Y. Zhuang, and X.H. Sun, "A High Order ADI Method For Separable Generalized
Helmholtz Equations," International Journal on Advances in Engineering Software,
Vol. 31, pp. 585-592, August 2000.
X. Sun (IIT)
CS546
© Sun