Nonzero Structure Analysis Aart J.C. Bik and Harry A.G. Wijsho High Performance Computing Division Department of Computer Science, Leiden University P.O. Box 9512, 2300 RA Leiden, the Netherlands fajcbik,[email protected] Abstract Because the eciency of sparse codes is very much dependent on the size and structure of input data, peculiarities of the nonzero structures of sparse matrices must be accounted for in order to avoid unsatisfying performance. Usually, this implies retargeting a sparse application to specic instances of the same problem. However, if characteristics of the input data are collected at compile-time and used in the data structure selection and code generation by a compiler that converts dense programs into sparse programs automatically, the complexity of sparse code development can be greatly reduced, and an ecient way for this retargeting results. Such a `sparse compiler' requires an analysis engine, which is the topic of this paper. Index Terms: Nonzero Structures, Restructuring Compilers, Sparse Matrices. 1 Introduction Automatic analysis of nonzero structures of sparse matrices is useful for a number of reasons. First, it can provide a programmer with useful insights about characteristics of a range of sparse matrices for which an application must be developed, so that these characteristics can be accounted for. This requires, of course, access to a representative set of sparse matrices. Second, an analysis engine is very helpful for a `sparse compiler', proposed in [4, 6]. This compiler automatically transforms a program that operates on dense matrices into a program that exploits the sparsity of certain matrices. In this manner, the burden of sparse code generation is placed on the compiler, while the programmer only has to deal with the dense code that is both qualitative and quantitative much simpler [8]. Usually more optimizations are enabled because, in general, information that is obtained by analysis of dense codes is much more accurate [3]. Clearly, the characteristics of the sparse matrices Support was provided by the Foundation for Computer Science (SION) of the Netherlands Organization for the Advancement of Pure Research (NWO) and the EC Esprit Agency DG XIII under Grant No. APPARC 6634 BRA III. THIS PAPER HAS BEEN ACCEPTED FOR ICS 94. must drive such a compiler. An analysis engine is required to obtain nonzero structure information that can be used in the data structure selection and code generation. Since analysis time contributes to compile-time, it is important to have an ecient analyzer. Finally, once the techniques of the sparse compiler have been well developed, a more powerful approach can be taken. The compiler generates multiple versions of the code that have been optimized for several nonzero structures that are likely to occur, and inserts the analysis code in the resulting program. At run-time, the outcome of the analysis determines which version of the code is the most suited for a particular matrix. This approach has as major advantage that the nonzero structure of a matrix does not have to be known at compile-time. However, since analysis is performed at run-time, it is even more important to have an ecient analyzer, since no gain is obtained, if the savings in execution time in an optimized version are outweighted by the analysis time. In the context of LU-factorization, a priori orderings, usually expressed in terms of a relabeling on the underlying graph [9, 14], are often applied to matrices in order to conne so-called ll-in to certain regions in the matrix. For example, application of Tarjan`s algorithm [22] can be used to obtain a block triangular form [9], while application of the well-known Cuthill-McKee algorithm [7] is often used to reduce the bandwidth of a matrix. So-called local strategies, like application of the Markowitz criterion [9], try to meet the minimum ll-in objective by selection of a pivot at each stage for which some ll-in related properties are satised. These methods, however, are not considered in this paper, i.e. we merely try to detect a certain structure in a (possibly reordered) matrix. A necessary extension of the sparse compiler, however, must deal with reorderings, for example, by automatic insertion of such routines. This makes all sparsity related issues completely transparent to the programmer. In this paper we present some ecient analysis techniques that have been implemented for the prototype compiler MT1 [5], which is currently used to implement the techniques of automatic sparse code generation. In section 2, important characteristics of nonzero structures are identied, followed by a discussion in section 3 of how some of these nonzero structures can be detected eciently. In section 4 the implementation of these techniques is discussed, illustrated with some compile-time techniques that become possible if nonzero structure information is available. Finally, conclusions and topics for further research are stated in section 5. 2 Nonzero Structures In this section, we explore some important characteristics of the nonzero structure of a sparse n n matrix A, i.e. Nonz A = f(i; j )jaij 6= 0g (see e.g. [9, 18, 23, 24, 25]). 2.1 Band Forms The band form of a matrix is dened by the lower and upper semi-bandwidth, which are two integers b1 0 and b2 0 that contain the minimum values for which constraint (aij 6= 0) ) ( b1 j i b2 ) is satised,1 which means that all nonzeros are conned to a band. If b1 = b2 holds, b1 + b2 + 1 is referred to as the bandwidth of this band. Although the semi-bandwidths are dened for arbitrary matrices (e.g. b1 = b2 = n 1 for a full matrix), matrices are usually only called band matrices if the semi-bandwidths are relatively small. If ( b1 j i b2 ) ) (aij 6= 0) also holds, the matrix is a full band matrix [24], while zero elements might appear in the band otherwise. Some special kinds of band matrices can be distinguished. If b1 = b2 = 0, the matrix is in diagonal form, while for b1 = b2 = 1 the matrix is called tridiagonal. A form that is closely related to storage formats, consists of the so-called variable band matrices [9, 13, 16, 17], where lower and upper semi-bandwidths are determined per diagonal element, which gives rise to a skyline as dened in section 3.1. 2.2 Triangular Forms A matrix is in lower triangular form, if (aij 6= 0) ) (j i). For a unit lower triangular matrix, aii = 1 for all 1 i n hold as well. If (aij 6= 0) ) (j < i) holds, the matrix is called strictly lower triangular. A lower triangular matrix can be seen as a special band matrix for b2 = 0 and relatively large b1 > 0 (note that b1 = n 1 does not necessarily hold if the triangular part is not full). For relatively small b2 > 0 the matrix is in so-called band lower triangular form. The corresponding upper triangular forms are dened similarly. 2.3 Block Forms Consider a partition of a square matrix A into submatrices Aij for 1 i p and 1 j p. Each Aii is an ni ni matrix, and is referred to as a diagonal block. The sizes ni of these diagonal blocks completely determine the partition of the whole matrix: each Aij is an ni nj submatrix. Submatrices Aij for i 6= j are called o-diagonal blocks. If a block only consists of zero elements, this is denoted by Aij = 0. Such blocks are referred to as zero blocks. 0 A11 @ ... Ap1 : : : A1p ... App 1 A Given this partition, a block banded form can be dened with semi-bandwidths B1 0 and B2 0, which are the minimum values for which (Aij 6= 0) ) ( B1 j i B2 ) is satised. Consequently, a block tridiagonal form results if B1 = B2 = 1 holds, and a block diagonal form results 1 Minimum values reveal the most information about a matrix, because the constraints are satised in any matrix for the trivial semi-bandwidths b1 = b2 = n 1. Allowing for negative semibandwidths would enable the specication of arbitrary bands, where b1 = b2 = 1 corresponds to a zero matrix. if B1 = B2 = 0. Similarly, a block lower triangular or block upper triangular form is dened, if (Aij 6= 0) ) (j i) or (Aij 6= 0) ) (i j ) holds respectively. The o-diagonal blocks Api and Aip for 1 i < p are referred to as the lower border and upper border respectively. If nonzero blocks occur in the borders, this gives rise to singly bordered diagonal, doubly bordered diagonal, and bordered block triangular forms. Although, depending on which blocks are nonzero, the block form of a matrix is dened once a partition of that matrix is given, block forms corresponding to dierent partitions might dier in the accuracy of description of the nonzero structure. As an extreme example, a matrix is in any block form if the trivial partition A = A11 is used. The most accurate description is obtained from a block form that is dened by a minimum partition into that particular block form, which means that there are no other partitions of the matrix into that block form for which the corresponding nonzero blocks have a smaller total area. For example, the areas of the nonzero blocks of a partition into block diagonal and block tridiagonal are determined by the following formulae: BDF: Pp n2 i=1 i BTF: Pp n2 + 2 pP1 ni ni+1 i=1 i i=1 These numbers are likely to exceed the number of nonzero elements, since even in a minimum partition the nonzero blocks are not necessarily full. Proposition 1 A square matrix has a unique minimum partition into block diagonal form. Proof Assume that a matrix has two dierent minimum partitions into block diagonal form. Since each diagonal element is contained in a diagonal block, at least two diagonal blocks of these partitions partially overlap or one is properly contained in the other. This implies that there is a non-trivial partition into block diagonal form of at least one of these diagonal blocks, which for the corresponding partition gives rise to a partition into block diagonal form of the whole matrix with fewer elements, contradicting the minimality of both partitions. 4 The following obvious property holds: Proposition 2 A partition of a square matrix into block diagonal form is minimum if and only if no corresponding diagonal block has a non-trivial partition into block diagonal form Proof `)' A non-trivial partition of a diagonal block into block diagonal form would imply the existence of a partition into block diagonal form of the whole matrix with fewer elements. `(' Since each diagonal block of the minimum partition into block diagonal form is properly contained in, or equal to a diagonal block of an arbitrary partition into block diagonal form, this implies that for a partition into block diagonal form where no diagonal block has a non-trivial partition into block diagonal form, each diagonal block must be equal to a diagonal block of the minimum partition into block diagonal form, so that both partitions are equal. 4 Similar propositions can be given for the minimal partitions into block lower or upper triangular form. In the context of permutations, a matrix is called reducible if there is a symmetric permutation PAP T with a non-trivial partition into block triangular form [14]. A permuted matrix is fully reduced, if it has a partition into block triangular form for which all diagonal blocks are irreducible, so that this partition is necessarily minimal. In contrast, diagonal blocks of the minimal partition into triangular form can still be reducible. For the previous partitions it was tacitly assumed that all diagonal blocks Aii are non-empty (an 0 0 submatrix is called an empty block). However, in order to generalize partitions into bordered block forms and into the corresponding block forms without border, it is convenient to allow that the last diagonal block App is empty, so that all border blocks are also empty, as is illustrated below for a partition into bordered block upper triangular form, where D and O denote diagonal and o-diagonal blocks respectively, and is used for an empty block: 0D B@ O D O O D 1 CA Some matrices have several minimum partitions into bordered block form, in which case we will select the partition that corresponds to the smallest border size. For partitions into block banded form, empty diagonal blocks will be used to obtain a block banded form from a more general form, as will be further explained in section 3.4. 3 Automatic Analysis In this section, techniques to analyze nonzero structures are presented. It is assumed that all sparse matrices are available on le in so-called coordinate scheme storage (see e.g. [9, 10, 11, 12, 29]). In this scheme, a le consists of integers n and m that contain the size of the matrix, an integer that indicates the number of stored elements, followed by unordered triples (i; j; aij ) to indicate row and column indices and value of each entry. Because there is no advantage in storing particular zeros explicitly, it is assumed that all stored elements are nonzero. 3.1 Lower and Upper Skyline The lower skyline of an n n sparse matrix is dened as the sequence i = maxf(i j )j(aij 6= 0) ^ (j < i)g, for 1 i n, and similarly the upper skyline is dened as the sequence j = maxf(j i)j(aij 6= 0) ^ (i < j )g for 1 j n, where the maximum over an empty set is zero. This implies that the maximum horizontal and vertical distance from each entry to the main diagonal is stored per diagonal element (cf. variable band). For example, the skylines of the 15 15 sparse matrix in gure 1, where the positions of the 26 nonzeros are shown, are described as 1 = 0; 2 = 0; 3 = 1; : : : ; 15 = 3 and 1 = 0; 2 = 0; 3 = 2; : : : ; 15 = 2. Skyline information requires O(n) storage and can be obtained in O( ) time in a single pass over the stored nonzero elements in a le by execution of the following fragment, limiting the increase of computational time and storage requirements of the compiler: read(n,m,nz); = Assume m=n holds = allocate storage for lsky[1..n] and usky[1..n] for k := 1, nz do read(i,j,aij); lsky[i] := max(lsky[i],(i-j)); usky[j] := max(usky[j],(j-i)); endfor The elements of the arrays lsky and usky, all initialized to zero, are used to store all i and j . The semi-bandwidths can be computed: b1 = 1max b = max in i 2 1jn j For example, b1 = 4 and b2 = 3 holds for the matrix in gure 1. The bandwidths reect the band form of a matrix and can be used to determine whether a matrix is diagonal, tridiagonal, or lower or upper triangular. However, skylines can also be used to obtain more complex characteristics of the nonzero structure in an ecient way, as is discussed in the following sections. 3.2 Block Diagonal/Triangular Form If, in the search for a partition of a matrix into block diagonal matrix form, all diagonal blocks beyond row and column B already have been identied, then the next diagonal block can be of size k if for all B k < i B , constraints i + (B i) < k and i + (B i) < k are satised, as is illustrated in gure 2. This gives rise to the following algorithm to construct a partition into diagonal block form in O(n) time. Starting with k = 1 and B = n, a scan i = B;: : : ; B k + 1 is made. At each step i, assignment k := max(k; max(i ; i )+ B i +1) is performed, which possibly aects the lower bound of the scan. However, if this lower bound is reached, the current k k diagonal block is recorded, the base is decreased by k, and k is reset to 1. This method is repeated until i = 1. Note that in any case, a diagonal block is recorded after the nal step. Figure 2: Valid Block in Block Diagonal Form Figure 1: Example Matrix Proposition 3 Application of the previous algorithm to the lower and upper skyline of a matrix results in the minimum partition of that matrix into block diagonal form. usky[i]+B−i+1 i B lsky[i]+B−i+1 Proof By construction each entry is incorporated in a diagonal block, so it suces to show that the minimal partition into block diagonal form results, for which proposition 2 is used. Assume that in the resulting partition, a k k diagonal block, consisting of the elements aij for B k < i; j B , for 1 B n, has a non-trivial partition into block diagonal form. Consideration of the right lower most diagonal block in this partition0 implies that there is k0 < k, 0 such that (1) for all B k < i B , i + (B i) < k and i +(B i) <0 k0 hold. If ki denotes the value of k after step i, then (2) k < kB k0 +1 holds because the scan proceeded after i = B k0 +1. Furthermore, since 1 k0 < k, the scan proceeded after step i = B , and thus kB > 1. Consequently, k has been adapted, which implies that (3) ki = j +B j +1 or ki = j + B j +1 holds for0 B k0 < i j 0 B . Combining (2) and (3) yields either k j + B j or k j + B j for certain B k0 < j B , contradicting (1). 4 If only i or i is used in the assignment to k, the minimum partition into respectively block lower or upper triangular forms are constructed, as can be proven similarly. Application of these algorithms to the matrix of gure 1, yields the block diagonal, block lower and upper triangular forms that are shown in gure 3. The total number of elements that are contained in the nonzero blocks of these block forms (respectively 67, 140 and 138) can be used as a measure of the description accuracy, which implies that the block diagonal form reveals the most information about the nonzero structure of this matrix. Therefore, it might be useful to have algorithms that also construct a minimum partition into bordered block forms. In a naive approach, such a partition is found by application of the algorithms of the previous section to the submatrix that remains for every border size, followed by selection of the partition with the fewest corresponding elements. However, even if a limited number of border sizes is considered, this approach would have unacceptable complexity [8], since a reasonable upper bound on the border size must be expressed in terms of n. Fortunately, it is also possible to obtain the best border size in O(n) time. 0 Suppose that for a certain border size b , the minimum partition of the remaining (n b0 ) (n b0 ) submatrix into block upper triangular form is determined by application of the previous algorithm, i.e. each next diagonal block is computed by a scan i = B; : : : ; B k + 1, where k is increased if necessary. After each step i, it can also be decided to discard the partition found so far, and to start the algorithm for B = i 1 and k = 1 after a new border size b = n i + 1 has been recorded. Selection of this border is only protable if the number of extra elements that are included in the border (loss) is less than the number of elements that are gained by starting with a new partition (gain). If the new partition remains within the old diagonal block, gain consists of elements in that block only, as illustrated in gure 5. Figure 3: Block Forms Figure 5: Gain and Loss within Diagonal Block 3.3 Bordered Block Form Nonzero elements that occur in the borders of a matrix might be responsible for large values in the two skylines, so that the size of the resulting diagonal blocks increases. For the matrix in gure 4, for example, 176 elements appear in the nonzero blocks of the minimum partition into block upper triangular form, which is large in comparison with the total number of elements. Since only the trivial partition denes a block diagonal and lower triangular form, the minimal partitions into these forms even contain more elements. Figure 4: Block Upper Triangular Form However if a diagonal block of this new partition exceeds the old diagonal block, more elements are included in the gain. The loss decreases because k would have been increased to BI BV +kV in the old partition, where subscripts I and V are used to denote the values of the variables in this particular run at initiation time (i = n b +1) and verication time (when i becomes BI max(kI ; BI B + k)+1) respectively, as illustrated in gure 6. Gain kV Loss iV BV iI BI . . . b b kI Figure 6: Gain and Loss outside Diagonal Block Gain kV Loss iV BV bd iI BI . . . b b kI If Z contains the number of elements in the zero blocks of the partition found so far, the loss and gain can be computed as l(b; b0 ) = ZI + (BV kV ) (BI iI + 1) and g(b; b0 ) = ZV (BV kV ) (iI BV 1) respectively, so that the improvement is determined as I (b; b0 ) = g(b; b0 ) l(b; b0 ) = ZV ZI (BV kV ) (BI BV ). If the gain exceeds the loss, i.e. I (b; b0 ) > 0, it is protable to continue the algorithm with the current partition by recording this next block, while for0 I (b; b0 ) 0 the old partition (corresponding to border size b ) must be restored. Moreover, the next block of size max(kI ; BI B + k)+1) can be recorded immediately (i.e. without any backtracking). Because the remaining (iV 1) (iV 1) submatrix would0 be further partitioned identically for both border sizes, I (b; b ), in fact, indicates the dierence between the number of elements in the nonzero blocks of the partitions that would result for border sizes b and0 b0 respectively. Consequently, although the value of I (b; b ) is only constructed for certain b0 < b, this new denition reects the fact that properties like I00(b; b0 ) = 0 I (b0 ; b),0 I00(b; b) = 0 and, more important I (b; b ) = I (b; b ) + I (b ; b ) hold. This latter property is exploited to consider all border sizes in0 a single pass over the skyline. If at a0 verication I (b; b ) > 0 holds, pend-00 ing verications of b with respect to smaller border sizes b00 can be converted into verication of b with respect to b , since I (b; b00 ) > I (b0 ; b00 ) holds, so that in any case it is more protable to select b than b0 . If I (b; b0 ) 0 holds, this verication can be abandoned safely, since0 I (b; b00 ) I (b0 ; b0000) holds for all pending verications of b with respect to b . No improvement can be obtained from a border size n i +1 if at step i a block is recorded, and the following algorithm to construct a minimum partition into bordered block upper triangular form results. The border size is initially 0, corresponding to an empty last diagonal block: Z:=0; B:=n; k:=1; b:=0; s:=0; p:=0; for i:=n, 1 step -1 do k:=max(k,lsky(i)+B-i+1); if (i = B-k+1) then = Block Detected = while ((s > 0) cand (i = sB[s]-max(sk[s],sB[s]-B+k)+1)) do if ( (Z-sZ[s]-(B-k)sB[s]-B) > 0 ) then s:=s-1; else k:=max(sk[s],sB[s]-B+k); Z:=sZ[s]; B:=sB[s]; b:=sb[s]; p:=sp[s]; s:=s-1; in O(n) time because each border can only be moved to and from the auxiliary arrays once. If the upper skyline or both skylines are considered, this algorithm determines the minimum partition into bordered block lower triangular form or doubly bordered block diagonal form respectively, because the same improvement function can be used (ignoring a factor of 2 in the latter case). For example, application of this algorithm to the matrix of gure 4, yields the bordered block forms shown in gure 7 with 113 (viz. 225 2 56), 157 (viz. 225 68) and 162 (viz. 176 14) elements, for I (3; 0) = 56, I (2; 0) = 68 and I (3; 0) = 14 respectively: Figure 7: Bordered Block Forms The block upper triangular form, for example, is represented in array part after elimination of the partition below row 12 as shown below: 12 11 10 8 6 5 4 2 1 Application of this algorithm to the matrix with 22 nonzero elements in gure 8, results in a minimum partition into bordered block upper triangular form with 166 elements in the nonzero blocks because I (1; 0) = 21. However, the nonzero blocks of the partitions corresponding to border sizes 2 and 3 also contain 166 elements, which illustrates that a matrix does not have a unique minimum partition into a particular bordered block form. Because a border is denied if the gain is equal to the loss, all ties (in this case I (3; 2) = 0 and I (2; 1) = 0, and thus, I (2; 0) = 21 and I (3; 0) = 21) are solved in favor of the smallest associated border size. endif enddo Z:=Z+k(B-k); B:=B-k; k:=1; p:=p+1; part[p]:=i; s:=s+1; sZ[s]:=Z; sB[s]:=B; sk[s]:=k; sb[s]:=b; sp[s]:=p; Z:=0; B:=i-1; k:=1; b:=n-i+1; else = No Block Detected = endif endfor Five auxiliary arrays are used in a stack-like manner with pointer s to save and restore states if necessary. Array part with pointer p is used similarly to record the left upper corner of each diagonal block in the partition, so that an old partition can be discarded by restoring the corresponding pointer value. As soon as improvement is obtained, s is simply decremented by 1, which eectively converts pending verications into verication of the current border size because of the formulation of the improvement function. Moreover, the old partition below row n b that still resides on the stack, can be eliminated afterwards. Although a while-loop occurs inside the loop body, this algorithm runs Figure 8: Dierent Minimum Partitions 3.4 Block Banded Form Even if no entries occur within the borders, large diagonal blocks might result if the size of the next diagonal block is increased each time before the end of a scan has been reached. In gure 2, for example, k is increased at step i, although k = 3 was valid before that step. This implies that many zeros are incorporated in the partition into diagonal form. Therefore, construction of a partition into block banded form is also useful. In this section, we consider a form with identical semi-bandwidths. Suppose that after step i = U , the next diagonal block would consists of a square with the left lower corner in row L. As long as during the scan i = U; : : : ; L, inequality L i max(i ; i ) holds, this next block remains valid. However, as soon as i max(i; i ) < L holds at step i, unnecessary incorporation of zeros is avoided by recording a diagonal block that extends rows i + 1; : : : ; U that has one o-diagonal block, while the new square is recorded for future computations. This process is repeated until the end of the rst square is reached, thereby possibly further dividing the o-diagonal blocks as is illustrated in gure 9. indicates following squares and array rec and variable num are used to determine the number of o-diagonal blocks at the end of each square. Application of this algorithm to the matrix of gure 1 and two other example matrices results in the partitions shown in gure 10. Figure 10: Block Forms Figure 9: Blocks in Block Banded Form At the end of a square, the number of o-diagonal blocks for all diagonal blocks outside the next square is known, because these blocks cannot be divided any further. Subsequently, this next square is considered. If no next square has been recorded, the end of a real diagonal block has been reached, and the number of o-diagonal blocks for all current diagonal blocks is known. This process is repeated until the whole matrix has been partitioned, as expressed in the following algorithm: New Square L U next:=n+1; num:=0; s:=0; sc:=1; p:=0; pv:=1; lc[0]:=n+1; rec[0]:= 0; for i:=n, 1 step -1 do left:=i-max(lsky[i],usky[i]); if (left < next) then = Initiate New Square = if (i+1 6= lc[sc-1] then p:=p+1; part[p]:=i+1; num:=num+1; endif s:=s+1; rc[s]:=i; lc[s]:=left; rec[s]:=num; left:=next; endif if (i = lc[sc]) then = End of Current Square = p:=p+1; part[p]:=i+1; num:=num+1; sc:=sc+1; row:=(sc s) ? rc[sc] : i-1; cnt:=num-rec[sc-1]-1; while (part[pv] > row) do sky[pv]:=cnt; pv:=pv+1; cnt:=cnt-1; Since the semi-bandwidths in the resulting block forms vary, a completion method is required to obtain a block banded form with one bandwidth. A possible completion method is based on the observation that in a block banded form, the semi-bandwidths of the left upper diagonal blocks form a sequence 0; 1; : : : ; B 1, followed by an arbitrary number of semi-bandwidths B . Within the partition this sequence can be `simulated' by insertion of B empty blocks. as is illustrated below for a block tridiagonal form, where, although the semi-bandwidths are all 1, the semi-bandwidths appear to be 0,1,1,0,1: 0D BB O BB @ O D O O D D O O D 1 CC CC A Consequently, it is easy to construct a partition into block banded if the semi-bandwidths form such a feasible sequence by insertion of empty blocks. For example, this method inserts a total of three empty blocks in the partition of the rst matrix, while the partition of the second matrix remains unaected. Block forms with respectively 65 and 117 elements result. For the third matrix, however, additional o-diagonal blocks must be inserted to obtain a feasible sequence (i.e. semi-bandwidth 0; 1; 1; 1; 2; 1; 2; 3; 2; 2; 2 sequence is converted into 0; 1; 2; 3; 3; 3; 3; 3; 3; 3; 3). A partition into block banded form with 127 elements that is shown in gure 11 results. If additional o-diagonal blocks are inserted, the resulting partition is not always minimal. For example, the matrix of gure 11 also has a partition into block tridiagonal form with only 99 elements. However, it provides an easy way to obtain a block banded form of a matrix. enddo endif enddo Arrays lc and rc are used to record the left and right corner of each square. Array part has the same function as in the algorithm of section 3.3. Parallel to this array is sky, which is used to record the number of o-diagonal blocks per diagonal block. Again, the nested while-loop does not aect the O(n) time, since each diagonal block is only considered once. Finally, sc contains the index of the current square, s Figure 11: Block Banded Form 3.5 Statistical Information In this section, it is explained how some statistical information about the nonzero structure of a matrix can be determined eciently. A more advanced tool kit for statistical analysis of sparse matrices is described by Saad in [19]. Under the assumption that each nonzero element is stored exactly once in coordinate scheme, the average number of nonzero elements per row or column in an m n sparse matrix A can be computed directly as m and n , while the density of A is determined as mn . However, more characteristics of the nonzero structure of a matrix can be determined, if during the pass over all nonzero elements the total number of nonzero elements in each row, column and diagonal are accumulated in O(n + m) storage for arrays rowc [1::m], colc[1::n] and diagc[1 m::n 1], by incrementing rowc [i], colc[j ] and diagc[j i] for each entry aij . The resulting contents of diagc for the matrix of gure 1 is shown below for indices -5 through 5. 0 1 1 0 3 15 2 2 2 0 0 These arrays can be used to compute the density in a particular row i, column j , or diagonal k: rowc [i] ; colc[j ] ; diagc[k] n m min(m; n k) max(1; 1 k) + 1 Consequently, the number of diagonals in which elements occur can be determined by counting the number of elements in diagc that are nonzero, while the number of full diagonals is determined similarly by counting the number of diagonals with density 1. Note that if the denition of semibandwidths is extended to arbitrary matrices, this implies that diagc[i] = 0 for i j > b1 and j i > b2 . The percentage p of nonzero elements that are conned a band with given semi-bandwidths B1 and B2 is computed as follows: p = 1 B P 2 k= B1 Figure 13: Quad Tree Representation diagc [k] 100% On the other hand, the smallest band with bandwidth 2 B + 1 in which p% of the nonzero elements is contained, is computed as follows: B = min b 0 j Pb k= b diagc [k] p e d 100 For example, for the matrix of gure 1, B = 3 results for p = 90%, so that the corresponding bandwidth is 7. Accumulating for B1 = 2 and B2 = 2 yields p 85%. Four diagonals are used in the matrix of gure 12, while three diagonals are full. Figure 12: Statistical Example 0 1 2 3 a22 a33 a11 a21 a22 0 1 2 3 0 1 2 3 a33 a44 a11 a21 a44 3.6 Detection of Dense Submatrices In [1, 26, 27, 28] the use of quad trees, well-known from image processing and computer graphics [15, 21], for representing sparse matrices is proposed. A matrix of order n is embedded in a 2dln ne 2dln ne matrix and padded appropriately. If the resulting matrix consists of only zeros or a single scalar, it is represented by a nil pointer or the value of that scalar respectively. Otherwise, it is represented by a tree with four subtrees (labeled 0,1,2 and 3) corresponding to left upper, right upper, left lower and right lower quadrant. This data structure provides a uniform way of representing dense and sparse matrices, while it also simplies the implementation of algorithms that are based on matrix partitioning. For example, the sum of two matrices is assembled recursively, which terminates if one of the operands is the nil pointer, resulting in the other operand, or if two scalars have to be added. An example of a quad tree representation for a sparse matrix is shown in in gure 13. A particular element is accessed by the tree walk dened by successively accessing the subtrees dened by the next pair of bits, starting with the most signicant bits in the binary representation of the row and column index, if these are numbered from 0 to 2dln ne 1. For example, element a21 is stored in row 01 and column 00. It is accessed along the boldfaced path in gure 13, using the labels 00 and 10. The same idea can be used to detect the occurrences of dense submatrices in an arbitrary sparse m n matrix if another stop criterion is used. A matrix for which the density exceeds a given threshold is considered to be a dense matrix, while a zero matrix is not considered any further. Otherwise, these criterions are applied recursively to the submatrices in its quadrants, as formulated in the following algorithm. The dense blocks in an m n matrix A are recorded by the call `P(Nonz A ; 1; m; 1; n)', where Nonz A indicates the index set of all nonzero elements. procedure P(I; il ; ih ; jl ; jh ) begin if (jI j > 0) then a:=(ih il + 1) (jh jl + 1) if ( jIaj < ) then ic :=b il +2ih c; jc :=b jl +2jh c; P(f(i; j ) 2 I j(i ic ) ^ (j jc )g; il ; ic ; jl ; jc ); P(f(i; j ) 2 I j(i ic ) ^ (jc < j )g; il ; ic ; jc + 1; jh ); P(f(i; j ) 2 I j(ic < i) ^ (j jc )g; ic + 1; ih ; jl ; jc ); P(f(i; j ) 2 I j(ic < i) ^ (jc < j )g; ic + 1; ih ; jc + 1; jh ); else record block(il; ih ; jl ; jh ); endif endif end procedure Application of this algorithm takes O(n log n) time for an n n matrix. O( ) storage suces for the index set, if at each call in-place sorting is applied to a global array that contains I and pointers are passed as parameters instead. For example, respectively 54 and 235 dense submatrices result for a 3232 matrix with 331 nonzero element for = 0:5 and = 1:0 respectively, as illustrated in gure 14. This example clearly illustrates the shortcomings of this method. Although the submatrices reveal the nonzero structure of this matrix, too many submatrices might result because arbitrary divisions of the index set are made. Decreasing the threshold partially reduces this problem, but clusters of dense submatrices might still result in general. for dense submatrices is performed. An overview of the results of the analysis is prompted to the user. For example, for matrix impcol e of the Harwell-Boeing Sparse Matrix Collection [10], of which the nonzero structure is illustrated in gure 15, the following output results: Size : #Entries : Density : Av.#Entries p. row : Av.#Entries p. col.: 225 x 1308 0.026 5.81 5.81 225 Semi-Bandwidths (B90) (P22) #Used Diagonals #Full Diagonals 92 <-> 129 11.93 120 0 35 : : : : : Type : Block Banded Form #Elements (D/L/U/B): 50289 34552 49778 27931 (bs/bs/bs/B): 56 0 142 13 (#blocks) : 2 15 12 49 #Dense Blocks : 556 (Density: 0.50) Figure 14: Dense Submatrices ( = 0:5 and = 1:0) 4 Implementation and Motivation The algorithms presented in this paper have been implemented in the analysis engine of the prototype compiler MT1 [5]. First, the skylines of a sparse matrix are determined in a single pass over the le. In fact, only the index set is needed, since actual values are not used in the analysis. Subsequently, the semi-bandwidths are determined, and if b1 = 0 and b2 = 0, or b1 = 0, or b2 = 0 holds, a diagonal, upper or lower triangular form are recorded. Otherwise, the minimum partitions into bordered block forms and block banded form are constructed, and the partition with the fewest number of elements in the nonzero blocks is selected. If this number does not exceed a certain threshold, expressed as a fraction f of the size of the matrix, the corresponding form is used as characterization of this matrix, while a general sparse matrix is assumed otherwise. Finally, some statistics are determined for the matrix and recorded for later use in the compiler. Consequently, analysis of an n n matrix takes O( + n) time, or O( + n log n) if a search Figure 16: Output for impcol e Entry B90 and P22 denote the bandwidth in which 90% of all nonzero elements is contained, and the percentage of nonzero elements within a band with bandwidth 5 . (B1 = 2 and B2 = 2). The total number of elements in the computed block diagonal, block lower and upper triangular and block banded form (D/L/U/B) is given, together with the border size or semi-bandwidth respectively (bs/B), and the total number of diagonal blocks in the partitions, excluding the diagonal block in the border. The partition of the block form that is selected as characterization of the matrix is made available to the compiler in a data structure that is similar to the array used in the construction algorithms. Figure 17: jagmesh2 Figure 15: impcol e The results of analysis on jagmesh2, illustrated in gure 17, are given in gure 18. The matrix is lower triangular and no information about block forms is determined. Finally, we illustrate how nonzero structure information can be used by a sparse compiler, as described in [4, 6]. In case nonzero structures are static, i.e. ll-in does not occur, the only concern of the compiler is to select a data structure Size : #Entries : Density : Av.#Entries p. row : Av.#Entries p. col.: 1009 x 3937 0.004 3.90 3.90 1009 Semi-Bandwidths (B90) (P22) #Used Diagonals #Full Diagonals : : : : : 991 <-> 101 50.39 244 1 0 Type #Dense Blocks : Lower Triangular Form : 2035 (Density: 0.50) Figure 18: Output for jagmesh2 according to the nonzero structure characteristics. Since in some cases, explicit storage of some zeros results in storage schemes with less overhead, set EA , for which Nonz A EA f1 : : : mg f1 : : : ng is used to denote the index set of explicit stored elements, referred to as entries. However, for dynamic data structures the compiler must determine which properties are preserved during execution, so that only those properties are exploited. The results of nonzero structure analysis can be given in terms of constraints of the form aij 6= 0 ) P (i; j ) where P (i; j ) is some predicate. For example, for a lower triangular matrix, this predicate has the form j i. If a data structure is selected by the compiler such that (i; j ) 2 EA ) P (i; j ) still holds, contra position :P (i; j ) ) (i; j ) 2= EA can be used to reduce the iteration set [6]. For example, since in the following dense fragment for matrix-vector multiplication, condition (i; j ) 2 EA is associated with the assignment statement to indicate the instances that must be executed, the compiler can transform this fragment as shown below if (i; j ) 2 EA ) ( 4 j i 5) holds: DO I = 1, M DO J = 1, N Y(I) = Y(I) + A(I,J) * X(J) ENDDO ENDDO First, a unimodular transformation [2] is applied so that all elements aij for j i = J are accessed in one iteration of the outermost loop: DO J = 1 - M, N - 1 DO I = MAX(1,1-J), MIN(M,N-J) Y(I) = Y(I)+ A(I,J+I) * X(J+I) ENDDO ENDDO Subsequently, iteration space reduction is applied, based on the nonzero structure characteristics: DO J = -4, 5 DO I = MAX(1,1-J), MIN(M,N-J) Y(I) = Y(I)+ A(I,J+I) * X(J+I) ENDDO ENDDO Statistical information about the density in the remaining diagonal that are accessed can be used subsequently to select sparse or dense storage for these diagonals, so that one of the following versions result (see [6] for further details of code generation): sparse band: DO J = -4, 5 DO AD = ALOW(J+5), AHIGH(J+5) I = AIND(AD) Y(I) = Y(I) + AVAL(AD) * X(J+I) ENDDO ENDDO dense band: DO J = -4, 5 DO I = MAX(1,1-J), MIN(M,N-J) Y(I) = Y(I) + ADNS(I-MAX(1,1-J)+1,6-J) * X(J+I) ENDDO ENDDO In the second fragment, EA consists of the whole band that is stored in a two-dimensional dense array ADNS, while in the rst fragment, arrays ALOW and AHIGH are used to point to the start addresses of sparse vectors that are stored conform the diagonals in two parallel arrays AVAL and AIND. Since appropriate initialization code is generated by the compiler, the exact structure EA along the band will only be constructed at run-time.2 Clearly, this code is more suited for this particular matrix than the code that would result for e.g. sparse row-wise storage of a general sparse matrix, shown below. The overhead in this code is clearly more intrusive, because many short sparse vectors are traversed. DO I = 1, M DO AD = ALOW(I), AHIGH(I) J = AIND(I) Y(I) = Y(I) + AVAL(AD) * X(J) ENDDO ENDDO Similarly, block forms, stored in arrays like part can be used to transform fragments into the corresponding block algorithms, while statistical information is used to determine whether sparse or dense storage is used for all nonzero blocks of the partition. If the nonzero structure changes during execution, i.e. ll-in occurs, a dynamic data structure must be used to account for these changes. In such cases, the compiler must determine which properties are preserved during program execution, which requires an advanced form of program analysis. 5 Conclusions In this paper we have presented techniques to analyze the nonzero structures of sparse matrices in an ecient way. More research is necessary into techniques to use this information in the data structure selection and code generation phase of a sparse compiler, and in program analysis techniques that determine which properties are preserved in case ll-in occurs. The results from this research can also be used for the automatic generation of analysis code in sparse codes that select a particular routine that is optimized for specic nonzero structures. Other issues, such as reordering of matrices in order to reduce the ll-in, must also be accounted for. Acknowledgements The authors would like to acknowledge helpful discussions with Arnold Niessen. 2 This is a strong argument for user annotations for nonzero structures in case only certain properties of the nonzero structure are known in advance, but the exact nonzero structure is not available. References [1] S. Kamal Abdali and David S. Wise. Experiments with quadtree representation of matrices. In G. Goos and J. Hartmanis, editors, Lecture Notes in Computer Science 358, pages 96{108. Springer-Verlag, 1988. [2] U. Banerjee. Loop Transformations for Restructuring Compilers: The Foundations. Kluwer Academic Publishers, Boston, 1993. [3] Aart J.C. Bik and Harry A.G. Wijsho. Advanced compiler optimizations for sparse computations. In Proceedings of Supercomputing 93, pages 430{439, 1993. [4] Aart J.C. Bik and Harry A.G. Wijsho. Compilation techniques for sparse matrix computations. In Proceedings of the International Conference on Supercomputing, pages 416{424, 1993. [5] Aart J.C. Bik and Harry A.G. Wijsho. MT1: A prototype restructuring compiler. Technical Report no. 93-32, Dept. of Computer Science, Leiden University, 1993. [6] Aart J.C. Bik and Harry A.G. Wijsho. On automatic data structure selection and code generation for sparse computations. In Proceedings of the Sixth International Workshop on Languages and Compilers for Parallel Computing, pages 57{75, 1993. Lecture Notes in Computer Science, No. 768. [7] Elizabeth Cuthill. Several strategies for reducing the bandwidth of matrices. In Donald J. Rose and Ralph A. Willoughby, editors, Sparse Matrices and Their Applications, pages 157{166. Plenum Press, New York, 1972. [8] I.S. Du. A sparse future. In I.S. Du, editor, Sparse Matrices and their Uses, pages 1{29. Academic Press, London, 1981. [9] I.S. Du, A.M. Erisman, and J.K. Reid. Direct Methods for Sparse Matrices. Oxford Science Publications, 1990. [10] I.S. Du, Roger G. Grimes, and John G. Lewis. Sparse matrix test problems. ACM Transactions on Mathematical Software, Volume 15:1{14, 1989. [11] I.S. Du and J.K. Reid. Some design features of a sparse matrix code. ACM Transactions on Mathematical Software, pages 18{35, 1979. [12] Alan George and Joseph W. Liu. The design of a user interface for a sparse matrix package. ACM Transactions on Mathematical Software, Volume 5:139{162, 1979. [13] Alan George and Joseph W. Liu. Computer Solution of Large Sparse Positive Denite Systems. Prentice-Hall Inc., 1981. [14] Frank Harary. Sparse matrices and graph theory. In J.K. Reid, editor, Large Sparse Sets of Linear Equations, pages 139{150. Academic Press, 1971. [15] Donald Hearn and M. Pauline Baker. Computer Graphics. Prentice-Hall International, 1986. [16] A. Jennings. A compact storage scheme for the solution of symmetric linear simultaneous equations. The Computer Journal, Volume 9:281{285, 1966. [17] A. Jennings and A.D. Tu. A direct method for the solution of large sparse symmetric simultaneous equations. In J.K. Reid, editor, Large Sparse Sets of Linear Equations, pages 97{104. Academic Press, 1971. [18] William H. Press, Brian P. Flannery, Saul A. Teukolsky, and William T. Vetterling. Numerical Recipes. Cambridge University Press, Cambridge, 1986. [19] Youcef Saad. SPARSKIT: a basic tool kit for sparse matrix computations. CSRD/RIACS, 1990. [20] Youcef Saad and Harry A.G. Wijsho. Spark: A benchmark package for sparse computations. In Proceedings of the 1990 International Conference on Supercomputing, pages 239{253, 1990. [21] Hanan Samet. Connected component labeling using quadtrees. Journal of the ACM, Volume 28:487{501, 1981. [22] R. Tarjan. Depth-rst search and linear graph algorithms. SIAM J. Computing, pages 146{160, 1972. [23] Reginal P. Tewarson. Sorting and ordering sparse linear systems. In J.K. Reid, editor, Large Sparse Sets of Linear Equations, pages 151{167. Academic Press, 1971. [24] Reginal P. Tewarson. Sparse Matrices. Academic Press, New York, 1973. [25] M. Veldhorst. An Analysis of Sparse Matrix Storage Schemes. PhD thesis, Mathematisch Centrum, Amsterdam, 1982. [26] David S. Wise. Representating matrices as quadtrees for parallel processing. Information Processing Letters, Volume 20:195{199, 1985. [27] David S. Wise. Parallel decomposition of matrix inversion using quadtrees. In Proceedings of the 1986 International Conference on Parallel Processing, pages 92{99, 1986. [28] David S. Wise and John Franco. Costs of quadtree representation of nondense matrices. Journal of Parallel and Distributed Computing, Volume 9:282{296, 1990. [29] Zahari Zlatev. Computational Methods for General Sparse Matrices. Kluwer Academic Publishers, 1991.
© Copyright 2026 Paperzz