A Scalable Parallel Block Algorithm for Band Cholesky Factorization Ramesh Agarwal Fred Gustavsony Mahesh Joshiz Mohammad Zubairx Abstract In this paper, we present an algorithm for computing the Cholesky factorization of large banded matrices on the IBM distributed memory parallel machines. The algorithm aims at optimizing the single node performance and minimizing the communication overheads. An important result of our paper is that the proposed algorithm is strongly scalable. As the bandwidth of the matrix increases, the number of processors that can be eciently utilized has a quadratic relationship. 1 Introduction Many of the matrices arising from large scientic applications have a banded structure. Banded solvers, as opposed to dense solvers, are dicult to parallelize because of their lower computation to communication ratios. Therefore, algorithms for such problems need to employ special techniques to reduce communication overheads. Several researchers have investigated parallel algorithms for solving band systems [2, 3, 4, 5, 6, 7, 8]. Most of these algorithms are either developed for special purpose architectures, or they have been suggested for hypothetical machines. In [3], Dongarra and Johnsson report some performance results on the two bus-based architectures with shared memory, namely, the Alliant FX/8 and the Sequent Balance 21000. However, we are not aware of any published performance results for band system solvers on distributed memory parallel machines. In this paper, we present an algorithm for computing the Cholesky factorization of large banded matrices on the IBM distributed memory parallel machines. The algorithm aims at optimizing the single node performance, minimizing the communication overheads, and optimally scheduling the processing at various nodes such that all processors are fully utilized all the time. An important result of our paper is that the proposed algorithm is strongly scalable. As the bandwidth of the matrix increases, the number of processors that can be eciently utilized has a quadratic relationship. 2 Block Cholesky Formulation The Cholesky factorization of a symmetric positive denite band matrix A computes the lower triangular band matrix L where A = LLT . The Cholesky factor L replaces the original matrix A in storage. It is assumed that only the lower part of the symmetric matrix A is stored in memory. IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Hts., NY 10598 IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Hts., NY 10598 IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Hts., NY 10598 ( On assignment from TISL, India) x IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Hts., NY 10598 y z 1 2 Agarwal et al. In this section, we will develop a block formulation of the Cholesky factorization. We partition the lower band matrix A into r by r blocks. The diagonal block is a lower triangular matrix and the last block is an upper triangular matrix, and rest of the blocks are square matrices. The half bandwidth of the matrix is m = r*mb, where mb-1 is the number of square blocks in a block column. This represents a block band matrix of half bandwidth mb. Here we are assuming that m is an integer multiple of the block size. The j-th block column consists of the diagonal block A(j,j) and the square blocks A(i+j,j), i = 1,2,..,mb-1, and the bottom upper triangular block A(j+mb,j). Now we will describe the Cholesky factorization in terms of block operations. To simplify discussion, here we will ignore the slight dierences in factorizing the rst and last mb block columns. The computing described below applies to the remaining middle block columns. 1 Factorizing a Diagonal Block A(j,j). It consists of the following (mb+1) computational steps performed on A(j,j). 1.1 Do a symmetric rank-r update using the upper triangular block L(j,j-mb). This requires multiplying a triangular matrix with its transpose. Note that here we are using the terminology L to indicate that these blocks have already been factored. 1.2 Do mb-1 symmetric rank-r updates (DSYRK computations) using square blocks L(j,j-k), k = mb-1, mb-2,...,1. These constitute (mb-1) steps. 1.3 After having done the above updates, Do the Cholesky factorization on the diagonal block. This completes the transformation of A(j,j) to L(j,j). 2 Factorizing Square Blocks A(i+j,j), i = 1,2,....,mb-1. It consists of the following (mb+1-i) computational steps performed on A(i+j,j). 2.1 Do a rank-r update (DTRMM type computation) using the upper triangular block L(i+j,i+j-mb) and the square block L(j,i+j-mb). This requires multiplying a triangular matrix with a square matrix. 2.2 Do (mb-i-1) rank-r updates (DGEMM computations) using square blocks L(i+j,i+j-k) and L(j,i+j-k), k = mb-1, mb-2,...,i+1 2.3 After having done the above updates, complete the factorization of the block using L(j,j). This is a DTRSM computation. This completes the transformation of A(i+j,j) to L(i+j,j). 3 Factorizing the Bottom Triangular Block A(j+mb,j). This consists of only one step. 3.1 Factor the triangular block using L(j,j). This transforms A(j+mb,j) to L(j+mb,j). For large values of mb, most of the computation in the above steps is carried out in step 2.2 as a DGEMM computation on square matrices of size r, requiring 2r3 ops. The computation in all other steps is less than this. Thus as a rst approximation (for load balancing purposes), we can assume that each of the above steps is a DGEMM computation of size r. For large values of mb, the resulting ineciency is small. We can denote one DGEMM computation as "one computing step". Using this terminology, we can say that transforming A(i+j,j) to L(i+j,j) requires (mb+1-i) computing steps (i = 0,1,...,mb). 3 Parallel Block Cholesky We rst discuss issues in designing a parallel block Cholesky factorizing algorithm on a distributed memory machine. In a distributed memory parallel formulation, each block Band Cholesky Factorization 3 of A resides at one of the processors (the owning processor). This processor receives all the blocks of L needed in its processing from other processors. Once the block is fully processed, it becomes a block of the L matrix. We call the owning processor the producer of this block of L. This block will be needed by other processors in their computation. Those processors are called consumers of this block of L. Observe that in the block formulation outlined above, several computational steps can be concurrently executed. For example, once a block is factored, all blocks to its right can be updated by it in parallel. This type of parallelism assumes that the factored block is immediately available to all processors. This model holds in a shared memory environment. However, in a distributed memory environment, because of the nite bandwidth of the underlying communication system, it takes time for a block to move from its producer to its consumers. We are also assuming that the communication system does not have a bus-based broadcast capability. In the absence of the broadcast facility, the block will arrive at dierent consumers at dierent times. Thus, data movement has to be carefully scheduled so that a block reaches all its consumers at the right time (or before it is needed), so that the consumer is not sitting idle waiting for the data to arrive. At the same time, communication should be scheduled so that it does not create contention at any processor (communication node). In other words, at all nodes, the communication also should be uniformly distributed. We now discuss our parallel algorithm which achieves all of the above objectives. These are achieved by implementing a highly structured and synchronized computing and communication scheme. This requires the notion of a time step. During one time step, all processors do essentially identical amount of computing and communication (block send/receive). It is convenient to describe parallel processing in terms of tasks. Task(i+j,j) represents the computational and communication steps required (described above) to transform A(i+j,j) to L(i+j,j). Each task is assigned to one of the processors according to a xed mapping and is active for only (mb+1-i) time steps. After this, the processor is assigned to another task. At any given time a large number of tasks are active. The number of processors required is the maximum number of active tasks over the duration of the algorithm. This is a quadratic function of mb. The tasks are assigned to available processors on a static basis so that it is known ahead of time which processor will do a particular task. The SPMD programming model is used to implement the algorithm. After completion of the current task, a processor automatically assigns itself to its next task. Initial blocks of A are distributed such that the processor assigned to carry out task(i+j,j) holds the block A(i+j,j). After transforming A(i+j,j) to L(i+j,j), task(i+j,j) sends the factored block L(i+j,j) to another task and then terminates. In carrying out its computations, task(i+j,j) requires previously factored blocks of L which are sent to it by other tasks. The nature of Cholesky factorization imposes strict ordering constraints on how dierent tasks schedule their computational steps. We now describe our parallel algorithm in which all computing and communication takes place in a highly structured and synchronized manner. Each of the computational steps outlined above (1.1 to 3.1) require up to two blocks of the L matrix. Our algorithm makes sure that all the tasks receive all their required blocks at just the right time; neither early nor late. Generally in one time step, a task(i+j,j) does the following computation and communication. a. Receives a block of L from task(i+j,j-1) (the task to its left in the matrix notation) and task(i+j-1,j) (the task above it). This represents receiving two blocks of size r by r. b. Does a computation utilizing the blocks just received. This represents one of the 4 c. Agarwal et al. computational steps described as steps 1.1 to 3.1. Sends a block of L to task(i+j,j+1) (the task to its right) and to task(i+j+1,j) (task below it). This represents sending two blocks of size r by r. Unless this is the last time step for this task, the block received from task(i+j,j-1) is forwarded to task(i+j,j+1) and the block received from task(i+j-1,j) is forwarded to task(i+j+1,j). If this is the last time step for this task, the block L(i+j,j) computed by it is sent to one of the tasks and the L block from the task above is forwarded to the task below; the task terminates after this. During those time steps where triangular matrices are involved, the computing and communication cost is about half of a regular time step. For large values of mb, the load balance ineciency introduced by this is small. It is possible to eliminate these minor ineciencies in the algorithm. The details can be found in [1]. Also, a detailed analysis of our algorithm is given in [1]. This algorithm factors a block column every three time steps using approximately mb2=6 processors. We now give experimental results on the IBM SP2 machines. 4 Experimental Results 4.1 The IBM SP2 Parallel System The SP2 is the second oering in IBM's Scalable POWER parallel family of parallel systems based on IBM's RS/6000 processor technology. The SP2 is a distributed-memory system consisting of up to 128 processor nodes connected by a High-Performance Switch. Three dierent processor nodes are available, based on RS/6000 Model 370 (peak mop of 125), 390 (peak mop of 266), and 590 (peak mop of 266) CPU planars. (These nodes are also known as Thin 62, Thin 66, and Wide nodes, respectively). The Model 370 processor is based on the original POWER architecture while the 390 and 590 processors are based on POWER2 architecture. Each processor has at least 64 MB of local memory (wide nodes can have up to 2 GB of local memory per node). Each node has a locally attached disk. The High-Performance Switch is a multi-stage packet switch providing a peak pointto-point bandwidth of 40 MB/s in each direction between any two nodes in the system. For the wide-node system, the sustained application buer to application buer transfer rate is approximately 35 MB/s for a uni-directional transfer measured as one half of the time necessary for a round-trip \ping" operation between two compute nodes. The latency (i.e. the time for a zero-byte message) measured in the same manner is approximately 40 microseconds on the SP2. In the case where a compute processor simultaneously sends and receives dierent messages, the aggregate (incoming plus outgoing) bandwidth at this node is approximately 48 MB/s on the wide-node system. This is the transfer rate observed when two nodes exchange long messages, a common communication operation in many parallel algorithms. 4.2 Results We implemented the proposed algorithm on the two congurations of the IBM SP2 with: (i) Model 370 nodes, and (ii) Model 590 wide nodes. We summarize our results for the 370 nodes in Tables 1 - 4, and for the 590 nodes in Table 5. The following notation has been used for tabulating our results. NP: Number of Processors. r: Blocksize. Band Cholesky Factorization 5 m: Half bandwidth ( related to mb as m=mb*r). N: Problem size. Time: Total elapsed time in seconds. TotalMF: Total mops given by (Nm2 ? 2=3m3)=Time MF/p: Mops per processor given by TotalMF=NP . The amount of memory required per processor is proportional to Nr2=m. In our experiments when we moved to larger number of processors we scaled the problem size such that the memory requirement per processor remains the same. There are three ways to do this: (a) Constant N=m, (b) Constant Nr2 , and (c) Constant r2 =m. The rst three tables correspond to these three cases. The fourth table has results for r and N constant. On the 370 nodes, DGEMM typically performs at 90-95 mops. The mop per processor achieved is somewhat lower because of the time spent on communication. The Table 5 has results for the IBM SP2 with 590 nodes. At present we only have partial results for this conguration. A noteworthy result from the above tables is that in almost all cases, mops per processor actually increases with increasing number of processors, indicating very good scalability of the algorithm. 5 Conclusion In this paper we have presented a strongly scalable parallel algorithm for band Cholesky factorization on distributed memory parallel machines. The proposed algorithm has a highly structured computing and communication requirement. We implemented this algorithm on the IBM SP2 machines and obtained a high level of performance. References [1] R. Agarwal, F.Gustavson, M. Joshi, and M. Zubair, A Scalable Parallel Block Algorithm for Band Cholesky Factorization, IBM Research Report under preparation. [2] I. Bar-on, A Practical Parallel Algorithm for Solving Band Symmetric Positive Denite Systems of Linear Equations, ACM Trans. Math. Softw.,13, (1987), pp. 323-332. [3] J. Dongarra, and L. Johnsson, Solving Banded Systems on a Parallel Processor, Parallel Computing, 5, (1987), pp. 219-246. [4] D. Lawrie, and A. Sameh, The Computation and Communication Complexity of a Parallel Banded System Solver, ACM Trans. Math. Softw., 10, (1984), pp. 185-195. [5] J. Navarro, J. Llaberia, and M. Valero, Partitioning: An Essential Step in Mapping Algorithms Into Systolic Array Processors, IEEE Computer, (1987), pp. 77-89. [6] Y. Robert, Block LU Decomposition of a Band Matrix on a Systolic Array, Int. J. Comput. Math., 17, (1985), pp. 295-316. [7] Y. Saad, and M. Schultz, Parallel Direct Methods for Solving Banded Linear Systems, Lin. Alg. & Appl., 88, (1987), pp. 623-650. [8] R. Schreiber, On Systolic Array Methods for Band Matrix Factorizations, BIT, 26, (1986), pp. 303-316. 6 Agarwal et al. NP 10 15 22 25 30 35 r 200 200 200 200 200 200 m 1200 1600 2000 2200 2400 2600 N Time(s) TotalMF MF/p 28800 88.9 466 46.6 38400 119.6 821 54.8 48000 150.2 1278 58.1 52800 165.9 1540 61.6 57600 181.9 1824 60.8 62400 197.5 2135 61.0 Table 1 r xed : (Aspect Ratio = N=m is kept approximately constant) NP 10 15 22 30 r 200 150 120 100 m 1200 1200 1200 1200 N Time(s) TotalMF MF/p 15200 46.2 473 47.3 26700 51.4 748 49.9 41760 58.2 1033 46.9 60000 64.7 1336 44.5 Table 2 m xed : (Nr2 is kept approximately constant) NP 10 13 18 25 30 r m N Time(s) TotalMF MF/p 100 600 57600 61.7 336 33.6 120 840 57600 80.1 507 39.0 150 1350 57600 112.9 929 51.6 180 1980 57600 152.1 1484 59.4 200 2400 57600 181.9 1824 60.8 Table 3 N xed : (r =m is kept approximately constant) 2 NP 9 12 24 54 r 200 200 200 200 m 1200 1400 2200 3400 N Time(s) TotalMF MF/p 34000 105.7 463 51.5 34000 105.7 630 52.5 34000 106.1 1550 64.6 34000 114.4 3434 63.6 Table 4 r and N xed NP r m N Time(s) TotalMF MF/p 16 250 2000 60000 134.8 1740 108.8 16 300 2400 60000 175.1 1921 120.1 16 350 2800 70000 271.2 1969 123.1 Table 5 Results on SP2
© Copyright 2026 Paperzz