A Scalable Parallel Block Algorithm for Band Cholesky Factorization

A Scalable Parallel Block Algorithm for
Band Cholesky Factorization
Ramesh Agarwal
Fred Gustavsony
Mahesh Joshiz
Mohammad Zubairx
Abstract
In this paper, we present an algorithm for computing the Cholesky factorization of
large banded matrices on the IBM distributed memory parallel machines. The algorithm aims at optimizing the single node performance and minimizing the communication overheads. An important result of our paper is that the proposed algorithm is
strongly scalable. As the bandwidth of the matrix increases, the number of processors
that can be eciently utilized has a quadratic relationship.
1 Introduction
Many of the matrices arising from large scientic applications have a banded structure.
Banded solvers, as opposed to dense solvers, are dicult to parallelize because of their
lower computation to communication ratios. Therefore, algorithms for such problems
need to employ special techniques to reduce communication overheads. Several researchers
have investigated parallel algorithms for solving band systems [2, 3, 4, 5, 6, 7, 8]. Most
of these algorithms are either developed for special purpose architectures, or they have
been suggested for hypothetical machines. In [3], Dongarra and Johnsson report some
performance results on the two bus-based architectures with shared memory, namely, the
Alliant FX/8 and the Sequent Balance 21000. However, we are not aware of any published
performance results for band system solvers on distributed memory parallel machines.
In this paper, we present an algorithm for computing the Cholesky factorization of
large banded matrices on the IBM distributed memory parallel machines. The algorithm
aims at optimizing the single node performance, minimizing the communication overheads,
and optimally scheduling the processing at various nodes such that all processors are fully
utilized all the time. An important result of our paper is that the proposed algorithm is
strongly scalable. As the bandwidth of the matrix increases, the number of processors that
can be eciently utilized has a quadratic relationship.
2 Block Cholesky Formulation
The Cholesky factorization of a symmetric positive denite band matrix A computes the
lower triangular band matrix L where A = LLT . The Cholesky factor L replaces the original
matrix A in storage. It is assumed that only the lower part of the symmetric matrix A is
stored in memory.
IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Hts., NY 10598
IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Hts., NY 10598
IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Hts., NY 10598 ( On assignment from
TISL, India)
x IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Hts., NY 10598
y
z
1
2
Agarwal et al.
In this section, we will develop a block formulation of the Cholesky factorization. We
partition the lower band matrix A into r by r blocks. The diagonal block is a lower triangular
matrix and the last block is an upper triangular matrix, and rest of the blocks are square
matrices. The half bandwidth of the matrix is m = r*mb, where mb-1 is the number of
square blocks in a block column. This represents a block band matrix of half bandwidth
mb. Here we are assuming that m is an integer multiple of the block size.
The j-th block column consists of the diagonal block A(j,j) and the square blocks
A(i+j,j), i = 1,2,..,mb-1, and the bottom upper triangular block A(j+mb,j). Now we will
describe the Cholesky factorization in terms of block operations. To simplify discussion,
here we will ignore the slight dierences in factorizing the rst and last mb block columns.
The computing described below applies to the remaining middle block columns.
1 Factorizing a Diagonal Block A(j,j). It consists of the following (mb+1) computational
steps performed on A(j,j).
1.1 Do a symmetric rank-r update using the upper triangular block L(j,j-mb). This
requires multiplying a triangular matrix with its transpose. Note that here we
are using the terminology L to indicate that these blocks have already been
factored.
1.2 Do mb-1 symmetric rank-r updates (DSYRK computations) using square blocks
L(j,j-k), k = mb-1, mb-2,...,1. These constitute (mb-1) steps.
1.3 After having done the above updates, Do the Cholesky factorization on the
diagonal block. This completes the transformation of A(j,j) to L(j,j).
2 Factorizing Square Blocks A(i+j,j), i = 1,2,....,mb-1. It consists of the following (mb+1-i)
computational steps performed on A(i+j,j).
2.1 Do a rank-r update (DTRMM type computation) using the upper triangular
block L(i+j,i+j-mb) and the square block L(j,i+j-mb). This requires multiplying
a triangular matrix with a square matrix.
2.2 Do (mb-i-1) rank-r updates (DGEMM computations) using square blocks
L(i+j,i+j-k) and L(j,i+j-k), k = mb-1, mb-2,...,i+1
2.3 After having done the above updates, complete the factorization of the block
using L(j,j). This is a DTRSM computation. This completes the transformation
of A(i+j,j) to L(i+j,j).
3 Factorizing the Bottom Triangular Block A(j+mb,j). This consists of only one step.
3.1 Factor the triangular block using L(j,j). This transforms A(j+mb,j) to L(j+mb,j).
For large values of mb, most of the computation in the above steps is carried out in step
2.2 as a DGEMM computation on square matrices of size r, requiring 2r3 ops. The
computation in all other steps is less than this. Thus as a rst approximation (for load
balancing purposes), we can assume that each of the above steps is a DGEMM computation
of size r. For large values of mb, the resulting ineciency is small. We can denote one
DGEMM computation as "one computing step". Using this terminology, we can say that
transforming A(i+j,j) to L(i+j,j) requires (mb+1-i) computing steps (i = 0,1,...,mb).
3 Parallel Block Cholesky
We rst discuss issues in designing a parallel block Cholesky factorizing algorithm on a
distributed memory machine. In a distributed memory parallel formulation, each block
Band Cholesky Factorization
3
of A resides at one of the processors (the owning processor). This processor receives all
the blocks of L needed in its processing from other processors. Once the block is fully
processed, it becomes a block of the L matrix. We call the owning processor the producer
of this block of L. This block will be needed by other processors in their computation. Those
processors are called consumers of this block of L. Observe that in the block formulation
outlined above, several computational steps can be concurrently executed. For example,
once a block is factored, all blocks to its right can be updated by it in parallel. This type
of parallelism assumes that the factored block is immediately available to all processors.
This model holds in a shared memory environment. However, in a distributed memory
environment, because of the nite bandwidth of the underlying communication system, it
takes time for a block to move from its producer to its consumers. We are also assuming
that the communication system does not have a bus-based broadcast capability. In the
absence of the broadcast facility, the block will arrive at dierent consumers at dierent
times. Thus, data movement has to be carefully scheduled so that a block reaches all its
consumers at the right time (or before it is needed), so that the consumer is not sitting idle
waiting for the data to arrive. At the same time, communication should be scheduled so
that it does not create contention at any processor (communication node). In other words,
at all nodes, the communication also should be uniformly distributed.
We now discuss our parallel algorithm which achieves all of the above objectives.
These are achieved by implementing a highly structured and synchronized computing and
communication scheme. This requires the notion of a time step. During one time step,
all processors do essentially identical amount of computing and communication (block
send/receive). It is convenient to describe parallel processing in terms of tasks. Task(i+j,j)
represents the computational and communication steps required (described above) to
transform A(i+j,j) to L(i+j,j). Each task is assigned to one of the processors according
to a xed mapping and is active for only (mb+1-i) time steps. After this, the processor
is assigned to another task. At any given time a large number of tasks are active. The
number of processors required is the maximum number of active tasks over the duration
of the algorithm. This is a quadratic function of mb. The tasks are assigned to available
processors on a static basis so that it is known ahead of time which processor will do a
particular task. The SPMD programming model is used to implement the algorithm. After
completion of the current task, a processor automatically assigns itself to its next task.
Initial blocks of A are distributed such that the processor assigned to carry out task(i+j,j)
holds the block A(i+j,j). After transforming A(i+j,j) to L(i+j,j), task(i+j,j) sends the
factored block L(i+j,j) to another task and then terminates.
In carrying out its computations, task(i+j,j) requires previously factored blocks of L
which are sent to it by other tasks. The nature of Cholesky factorization imposes strict
ordering constraints on how dierent tasks schedule their computational steps. We now
describe our parallel algorithm in which all computing and communication takes place in
a highly structured and synchronized manner. Each of the computational steps outlined
above (1.1 to 3.1) require up to two blocks of the L matrix. Our algorithm makes sure that
all the tasks receive all their required blocks at just the right time; neither early nor late.
Generally in one time step, a task(i+j,j) does the following computation and communication.
a. Receives a block of L from task(i+j,j-1) (the task to its left in the matrix notation) and
task(i+j-1,j) (the task above it). This represents receiving two blocks of size r by r.
b.
Does a computation utilizing the blocks just received. This represents one of the
4
c.
Agarwal et al.
computational steps described as steps 1.1 to 3.1.
Sends a block of L to task(i+j,j+1) (the task to its right) and to task(i+j+1,j) (task
below it). This represents sending two blocks of size r by r. Unless this is the last time
step for this task, the block received from task(i+j,j-1) is forwarded to task(i+j,j+1)
and the block received from task(i+j-1,j) is forwarded to task(i+j+1,j). If this is the
last time step for this task, the block L(i+j,j) computed by it is sent to one of the
tasks and the L block from the task above is forwarded to the task below; the task
terminates after this.
During those time steps where triangular matrices are involved, the computing and
communication cost is about half of a regular time step. For large values of mb, the
load balance ineciency introduced by this is small. It is possible to eliminate these minor
ineciencies in the algorithm. The details can be found in [1]. Also, a detailed analysis of
our algorithm is given in [1]. This algorithm factors a block column every three time steps
using approximately mb2=6 processors. We now give experimental results on the IBM SP2
machines.
4 Experimental Results
4.1 The IBM SP2 Parallel System
The SP2 is the second oering in IBM's Scalable POWER parallel family of parallel systems
based on IBM's RS/6000 processor technology. The SP2 is a distributed-memory system
consisting of up to 128 processor nodes connected by a High-Performance Switch. Three
dierent processor nodes are available, based on RS/6000 Model 370 (peak mop of 125),
390 (peak mop of 266), and 590 (peak mop of 266) CPU planars. (These nodes are also
known as Thin 62, Thin 66, and Wide nodes, respectively). The Model 370 processor is
based on the original POWER architecture while the 390 and 590 processors are based on
POWER2 architecture. Each processor has at least 64 MB of local memory (wide nodes
can have up to 2 GB of local memory per node). Each node has a locally attached disk.
The High-Performance Switch is a multi-stage packet switch providing a peak pointto-point bandwidth of 40 MB/s in each direction between any two nodes in the system.
For the wide-node system, the sustained application buer to application buer transfer
rate is approximately 35 MB/s for a uni-directional transfer measured as one half of the
time necessary for a round-trip \ping" operation between two compute nodes. The latency
(i.e. the time for a zero-byte message) measured in the same manner is approximately 40
microseconds on the SP2. In the case where a compute processor simultaneously sends and
receives dierent messages, the aggregate (incoming plus outgoing) bandwidth at this node
is approximately 48 MB/s on the wide-node system. This is the transfer rate observed when
two nodes exchange long messages, a common communication operation in many parallel
algorithms.
4.2 Results
We implemented the proposed algorithm on the two congurations of the IBM SP2 with:
(i) Model 370 nodes, and (ii) Model 590 wide nodes. We summarize our results for the 370
nodes in Tables 1 - 4, and for the 590 nodes in Table 5. The following notation has been
used for tabulating our results.
NP: Number of Processors.
r: Blocksize.
Band Cholesky Factorization
5
m: Half bandwidth ( related to mb as m=mb*r).
N: Problem size.
Time: Total elapsed time in seconds.
TotalMF: Total mops given by (Nm2 ? 2=3m3)=Time
MF/p: Mops per processor given by TotalMF=NP .
The amount of memory required per processor is proportional to Nr2=m. In our
experiments when we moved to larger number of processors we scaled the problem size
such that the memory requirement per processor remains the same. There are three ways
to do this: (a) Constant N=m, (b) Constant Nr2 , and (c) Constant r2 =m. The rst three
tables correspond to these three cases. The fourth table has results for r and N constant.
On the 370 nodes, DGEMM typically performs at 90-95 mops. The mop per processor
achieved is somewhat lower because of the time spent on communication. The Table 5 has
results for the IBM SP2 with 590 nodes. At present we only have partial results for this
conguration. A noteworthy result from the above tables is that in almost all cases, mops
per processor actually increases with increasing number of processors, indicating very good
scalability of the algorithm.
5 Conclusion
In this paper we have presented a strongly scalable parallel algorithm for band Cholesky
factorization on distributed memory parallel machines. The proposed algorithm has
a highly structured computing and communication requirement. We implemented this
algorithm on the IBM SP2 machines and obtained a high level of performance.
References
[1] R. Agarwal, F.Gustavson, M. Joshi, and M. Zubair, A Scalable Parallel Block Algorithm for
Band Cholesky Factorization, IBM Research Report under preparation.
[2] I. Bar-on, A Practical Parallel Algorithm for Solving Band Symmetric Positive Denite Systems
of Linear Equations, ACM Trans. Math. Softw.,13, (1987), pp. 323-332.
[3] J. Dongarra, and L. Johnsson, Solving Banded Systems on a Parallel Processor, Parallel
Computing, 5, (1987), pp. 219-246.
[4] D. Lawrie, and A. Sameh, The Computation and Communication Complexity of a Parallel
Banded System Solver, ACM Trans. Math. Softw., 10, (1984), pp. 185-195.
[5] J. Navarro, J. Llaberia, and M. Valero, Partitioning: An Essential Step in Mapping Algorithms
Into Systolic Array Processors, IEEE Computer, (1987), pp. 77-89.
[6] Y. Robert, Block LU Decomposition of a Band Matrix on a Systolic Array, Int. J. Comput.
Math., 17, (1985), pp. 295-316.
[7] Y. Saad, and M. Schultz, Parallel Direct Methods for Solving Banded Linear Systems, Lin. Alg.
& Appl., 88, (1987), pp. 623-650.
[8] R. Schreiber, On Systolic Array Methods for Band Matrix Factorizations, BIT, 26, (1986),
pp. 303-316.
6
Agarwal et al.
NP
10
15
22
25
30
35
r
200
200
200
200
200
200
m
1200
1600
2000
2200
2400
2600
N Time(s) TotalMF MF/p
28800
88.9
466 46.6
38400
119.6
821 54.8
48000
150.2
1278 58.1
52800
165.9
1540 61.6
57600
181.9
1824 60.8
62400
197.5
2135 61.0
Table 1
r xed : (Aspect Ratio = N=m is kept approximately constant)
NP
10
15
22
30
r
200
150
120
100
m
1200
1200
1200
1200
N Time(s) TotalMF MF/p
15200
46.2
473 47.3
26700
51.4
748 49.9
41760
58.2
1033 46.9
60000
64.7
1336 44.5
Table 2
m xed : (Nr2 is kept approximately constant)
NP
10
13
18
25
30
r
m
N Time(s) TotalMF MF/p
100 600 57600
61.7
336 33.6
120 840 57600
80.1
507 39.0
150 1350 57600
112.9
929 51.6
180 1980 57600
152.1
1484 59.4
200 2400 57600
181.9
1824 60.8
Table 3
N xed : (r =m is kept approximately constant)
2
NP
9
12
24
54
r
200
200
200
200
m
1200
1400
2200
3400
N Time(s) TotalMF MF/p
34000
105.7
463 51.5
34000
105.7
630 52.5
34000
106.1
1550 64.6
34000
114.4
3434 63.6
Table 4
r and N xed
NP
r
m
N Time(s) TotalMF MF/p
16 250 2000 60000
134.8
1740 108.8
16 300 2400 60000
175.1
1921 120.1
16 350 2800 70000
271.2
1969 123.1
Table 5
Results on SP2