BLOCK TRIANGULAR PRECONDITIONERS FOR M -MATRICES AND MARKOV CHAINS MICHELE BENZI† AND BORA UÇAR‡ Abstract. We consider preconditioned Krylov subspace methods for solving large sparse linear systems under the assumption that the coefficient matrix is a (possibly singular) M -matrix. The matrices are partitioned into 2 × 2 block form using graph partitioning. Approximations to the Schur complement are used to produce various preconditioners of block triangular and block diagonal type. A few properties of the preconditioners are established, and extensive numerical experiments are used to illustrate the performance of the various preconditioners on singular linear systems arising from Markov modeling. Key words. M -matrices, preconditioning, discrete Markov chains, iterative methods, graph partitioning AMS subject classifications. 05C50, 60J10, 60J22, 65F10, 65F35, 65F50 1. Introduction. We consider the solution of (consistent) linear systems Ax = b, where A ∈ RN ×N is a large, sparse M -matrix. Our main interest is in singular, homogeneous (b = 0) systems arising from Markov chain modeling; our methods and results, however, are also applicable in the nonsingular case. The aim of this paper is to develop and investigate efficient preconditioners for Krylov subspace methods like GMRES [35] or BiCGStab [39]. Our main focus is on block preconditioners that are based on Schur complement approximations. Preconditioners of this type have proven very successful in the context of (generalized) saddle point problems arising in a variety of applications (see [5]); here we investigate such techniques for M -matrices and, in particular, for Markov chain problems. We assume that A is partitioned into 2 × 2 block structure A11 A12 (1.1) A= , A21 A22 where A11 and A22 are square matrices of size, respectively, n × n and m × m with N = n + m. Building on the given structure, we consider block upper triangular preconditioners of the form M11 A12 MBP = (1.2) , O M22 where M11 is equal to or an approximation of A11 , and M22 is an approximation of the Schur complement of A11 in A, i.e., M22 ≈ A22 − A21 A−1 11 A12 . Obviously, block lower triangular preconditioners of the same type can also be constructed, as well as block diagonal ones. In any case, the main issue is the choice of the approximations M11 ≈ A11 and M22 ≈ A22 − A21 A−1 11 A12 . † Department of Mathematics and Computer Science, Emory University, Atlanta, GA 30322, USA ([email protected]). The work of this author was supported in part by the National Science Foundation grant DMS-0511336. ‡ Department of Mathematics and Computer Science, Emory University, Atlanta, GA 30322, USA ([email protected]). The work of this author was partially supported by The Scientific and Technological Research Council of Turkey (TÜBITAK) and by the University Research Committee of Emory University. 1 In the context of saddle-point problems, the 2 × 2 block structure is mandated by the application, and the approximation of the Schur complement is usually defined and interpreted in terms of the application. In contrast, the block preconditioners investigated in this paper use well-known graph partitioning techniques to define the block structure, and make use of M -matrix properties to define approximations of the Schur complement. For certain approximations of S = A22 − A21 A−1 11 A12 , the block triangular preconditioners proposed in this work are akin to the product splitting preconditioners proposed in [10]. In fact, we experimentally show that in such cases these two types of preconditioners perform almost the same in terms of convergence rates for a large number of Markov chain problems from the MARCA collection [37]. However, the block triangular ones are cheaper to construct and apply than the product splitting ones. The paper is organized as follows. We briefly review background material on M matrices, discrete Markov chains, graph partitioning, matrix splittings and product splitting preconditioners in Section 2. We introduce the block triangular preconditioners in Section 3. Section 4 contains materials on partitioning the matrices into the 2 × 2 block structure (1.1) with an eye to future parallel implementations of the proposed preconditioners. In Section 5 we investigate the effect of the block triangular preconditioner under various partitionings, the properties of the 2 × 2 block structure imposed by the graph partitioning, and the performance of the proposed block triangular preconditioners relative to that of some other well-known preconditioners. We present our conclusions in Section 6. 2. Background. Here we borrow some material mainly from [9, 10, 11, 40] to provide the reader with a short summary of the concepts and results that are used in building the proposed preconditioners. We also give a brief description of graph partitioning by vertex separator, which can be used to obtain the 2 × 2 block structure (1.1). 2.1. Nonnegative matrices and M -matrices. A matrix AN ×N is nonnegative if all of its entries are nonnegative, i.e., A ≥ O if aij ≥ 0 for all 1 ≤ i, j ≤ N . Any matrix A with nonnegative diagonal entries and nonpositive off-diagonal entries can be written in the form (2.1) A = sI − B, s > 0, B≥O. A matrix A of the form (2.1) with s ≥ ρ(B) is called an M -matrix. Here, ρ(B) denotes the spectral radius of B. If s = ρ(B) then A is singular, otherwise nonsingular. If A is a nonsingular M -matrix, then A−1 ≥ O. If A is a singular, irreducible M -matrix, then rank (A) = N − 1 and each k × k principal square submatrix of A, where 1 ≤ k < N , is a nonsingular M -matrix. If, furthermore, A is the generator of an ergodic Markov chain (see below), then the Schur complement Sm×m = A22 − A21 A−1 11 A12 of A11 (Eq. (1.1)) is a singular, irreducible M -matrix with rank m − 1 [9, 27]. 2.2. Stationary distribution of ergodic Markov chains. Discrete Markov chains with large state spaces arise in many applications, including for instance reliability modeling, queuing network analysis, web-based information retrieval, and computer system performance evaluation [38]. As is well known, the long-run behavior of an ergodic (irreducible) Markov chain is described by the stationary distribution 2 vector of the corresponding matrix of transition probabilities. Recall that the stationary probability distribution vector of a finite, ergodic Markov chain with N × N transition probability matrix P is the unique 1 × N vector π which satisfies (2.2) π = πP, πi > 0 for i = 1, . . . , N, N X πi = 1 . i=1 PN Here P is nonnegative (pij ≥ 0 for 1 ≤ i, j ≤ N ), row-stochastic ( j=1 pij = 1 for 1 ≤ i ≤ N ), and due to the ergodicity assumption it is irreducible. The matrix A = I − P T , where I is the N × N identity matrix, is called the generator of the Markov process. The matrix A is a singular, irreducible M -matrix of rank N − 1. Letting x = π T and hence xT = xT P , the computation of the stationary vector reduces to finding a nontrivial solution to the homogeneous linear system (2.3) Ax = 0 , PN where x ∈ RN , xi > 0 for i = 1, . . . , N , and i=1 xi = 1. Perron–Frobenius theory [11] implies that such a vector exists and is unique. Due to the very large number N of states typical of many real-world applications, there has been increasing interest in recent years in developing parallel algorithms for Markov chain computations; see [4, 6, 12, 22, 25, 28]. Most of the attention so far has focused on (linear) stationary iterative methods, including block versions of Jacobi and Gauss–Seidel [12, 25, 28], and on (nonlinear) iterative aggregation/disaggregation schemes specifically tailored to stochastic matrices [12, 22]. In contrast, comparatively little work has been done with parallel preconditioned Krylov subspace methods. The suitability of preconditioned Krylov subspace methods for solving Markov models has been demonstrated, e.g., in [30, 33], although no discussion of parallelization aspects was given there. Parallel computing aspects can be found in [6], where a symmetrizable stationary iteration (Cimmino’s method) was accelerated using the Conjugate Gradient method on a Cray T3D, and in [25], where an out-of-core, parallel implementation of Conjugate Gradient Squared (with no preconditioning) was used to solve very large Markov models with up to 50 million states. In [9], parallel preconditioners based on sparse approximate pseudoinverses were used to speed-up the convergence of BiCGStab, and favorable parallelization results have been reported. In our recent work [10], we proposed product splitting preconditioners and discussed parallelization aspects of the proposed preconditioners. 2.3. Stationary iterations and matrix splittings. Consider again the solution of a linear system of the form Ax = b. The representation A = B − C is called a splitting if B is nonsingular. A splitting gives rise to the stationary iterative method (2.4) xk+1 = T xk + c, k = 0, 1, . . . , where T = B −1 C is called the iteration matrix, c = B −1 b, and x0 ∈ RN is a given initial guess. The splitting A = B − C is called regular if B −1 ≥ O and C ≥ O [40], weak regular if B −1 ≥ O and T ≥ O [11], an M -splitting if B is an M -matrix and C ≥ O [36], and weak nonnegative of the II kind if B −1 ≥ O and I −AB −1 ≥ O [41]. If A is a nonsingular M -matrix, any of the above conditions on the splitting A = B − C is sufficient to ensure the convergence of the stationary iteration (2.4) to the unique solution of Ax = b, for any choice of the initial guess x0 . 3 A related approach is defined by the alternating iterations k+1/2 x = M1−1 N1 xk + M1−1 b (2.5) k+1 x = M2−1 N2 xk+1/2 + M2−1 b, k = 0, 1, . . . , where A = M1 − N1 = M2 − N2 are splittings of A, and x0 is the initial guess. The convergence of alternating iterations was analyzed by Benzi and Szyld [7] under various assumptions on A, including the singular M -matrix case. They constructed a (possibly non-unique) splitting A = B − C associated with the alternating iterations (Eq. (10) in [7]) with (2.6) B −1 = M2−1 (M1 + M2 − A)M1−1 . Clearly, the matrix M1 + M2 − A must be nonsingular for (2.6) to be well-defined. 2.4. Product splitting preconditioners. The product splitting preconditioners [10] are based on alternating iterations (2.5) defined by two simple preconditioners. The first preconditioner can be taken to be the well-known block Jacobi preconditioner, i.e., (2.7) MBJ = A11 O O A22 . Note that A11 and A22 are nonsingular M -matrices, and A = MBJ − (MBJ − A) is a regular splitting (in fact, an M -splitting). The second preconditioner is given by D11 A12 MSC = (2.8) , A21 A22 where, D11 6= A11 stands for an approximation of A11 . In [10], we take D11 to be the diagonal matrix formed with the diagonal entries of A11 , e.g., D11 = diag(A11 ). More generally, D11 is a matrix obtained from A11 by setting off-diagonal entries to zero. Thus, D11 is a nonsingular M -matrix [40, Theorem 3.12]. The Schur complement ma−1 trix A22 − A21 D11 A12 is therefore well-defined. As shown in [1], if A is a nonsingular −1 M -matrix, so is A22 − A21 D11 A12 . When A is a singular irreducible M -matrix, the same property holds under the (very mild) structural conditions given in [9, Theorem 3]. Therefore MSC is a nonsingular M -matrix and A = MSC − (MSC − A) is an M -splitting (hence, a regular splitting). Since both MBJ and MSC define regular splittings, the product preconditioner MP S given by (2.9) −1 −1 MP−1 S = MSC (MBJ + MSC − A)MBJ , (see (2.6)), implicitly defines a weak regular splitting [7, Theorem 3.4]. Note that since the matrix (2.10) MBJ + MSC − A = D11 O O A22 is invertible, MP−1 S is well-defined, and so is the corresponding splitting of A. 4 2.5. Graph partitioning. Given an undirected graph G = (V, E), the problem of K-way graph partitioning by vertex separator (GPVS) asks for finding a set of vertices VS of minimum size whose removal decomposes a graph into K disconnected subgraphs with balanced sizes. The problem is NP-hard [13]. Formally, Π = {V1 , . . . , VK ; VS } is a K-way vertex partition by vertex separator VS if the following conditions hold: Vk ⊂ V and VS k 6= ∅ for 1 ≤ k ≤ K; Vk ∩ V` = ∅ for 1 ≤ k < ` ≤ K and Vk ∩ VS = ∅ for 1 ≤ k ≤ K; k Vk ∪ VS = V ; there is no edge between vertices lying in two different parts Vk and V` for 1 ≤ k < ` ≤ K; Wmax /Wavg ≤ , where Wmax is the maximum part size (defined as maxk |Vk |), Wavg is the average part size (defined as (|V | − |VS |)/K), and is a given maximum allowable imbalance ratio. See the works [2, 14, 19, 20, 23] for applications of the GPVS and heuristics for GPVS. In the weighted GPVS problem, the vertices of the given undirected graph have weights. The weight of the separator or a part is defined as the sum of the weights of the vertices that they contain. The objective of the weighted GPVS problem is to minimize the weight of the separator while maintaining a balance criterion on the part weights. 3. Block triangular preconditioners. Let A ∈ RN ×N be an M -matrix, initially assumed to be nonsingular. Consider a 2 × 2 block structure as in (1.1). Recall from Section 2 that A11 , A22 and the Schur complement of A11 in A, i.e., −1 S = A22 − A21 A−1 11 A12 are all M -matrices. Note that if A11 is irreducible, then A11 is dense and so is S; if A11 is reducible (e.g., block diagonal), then S will still be fairly dense unless the block sizes are very small. Consider the ideal preconditioner [21, 29] A11 A12 (3.1) P0 = . O S It follows from the block LU factorization I O A11 A= O A21 A−1 I 11 A12 S = LP0 that AP0−1 (and therefore P0−1 A) has the eigenvalue λ = 1 of multiplicity N as the only point in the spectrum. Also, (AP0−1 − I)2 = (L − I)2 = O which shows that the minimum polynomial of the preconditioned matrix has degree 2; hence, a minimal residual method (like GMRES) is guaranteed to converge in at most two steps. Unfortunately, preconditioning with P0 is impractical. Here, similar to the situation for saddle point problems, we consider preconditioners obtained by different approximations of A11 and of S. We start with the case where A11 is solved “exactly”, while S is replaced by an −1 −1 approximation Ŝ = A22 − A21 M11 A21 where M11 is an approximate inverse of A11 . If we impose the condition that (3.2) −1 O ≤ M11 ≤ A−1 11 (entrywise) then Ŝ is guaranteed to be an M -matrix and invertible if A is invertible; see [1, pp. 263–265]. The condition (3.2) is satisfied (for example) when A11 = M11 −(M11 −A11 ) is an M -splitting. In particular, if M11 = diag(A11 ) (or any other approximation obtained by setting to zero any off-diagonal entries of A11 ), then the condition (3.2) is satisfied. 5 Assume that linear systems with Ŝ are solved exactly. Then, the preconditioner is (3.3) P1 = A11 O A12 Ŝ . A simple calculation shows that AP1−1 = I A21 A−1 11 O S Ŝ −1 . Hence, AP1−1 (or P1−1 A) has the eigenvalue λ = 1 of multiplicity at least n, while the remaining m eigenvalues are those of S Ŝ −1 . These m eigenvalues lie in the open disk centered at (1, 0) of radius 1. To show this, consider the following matrix splitting A11 A12 O O A = P1 − (P1 − A) = − . −1 −A21 −A21 M11 A12 O Ŝ −1 Note that this splitting is not a regular splitting, since −A21 M11 A12 ≤ O. However, we have the following result. Theorem 3.1. The splitting A = P1 − (P1 − A) is a weak nonnegative splitting of the II kind. Proof. Due to the condition (3.2), P1 is an M -matrix, and hence P −1 ≥ O. We only need to check the nonnegativity of O O I − AP1−1 = . −A21 A−1 I − S Ŝ −1 11 −1 −1 Clearly −A21 A−1 ≥ O. 11 ≥ O (since A21 ≤ O and A11 ≥ O). We need to show I −S Ŝ We have S = A22 − A21 A−1 11 A12 −1 −1 = A22 − A21 (A−1 11 − M11 + M11 )A12 −1 −1 −1 = A22 − A21 M11 A12 − A21 (A11 − M11 )A12 = Ŝ − R , −1 −1 −1 where R = A21 (A−1 11 − M11 )A12 is nonnegative, since A11 ≥ M11 (3.2). Recall that −1 −1 −1 Ŝ is an M -matrix, therefore I − S Ŝ = (Ŝ − S)Ŝ = RŜ ≥ O. Since A is an invertible M -matrix and A = P1 − (P1 − A) is a weak nonnegative splitting of the II kind, it is a convergent splitting: ρ(I − AP1−1 ) < 1 (see [41]). In other words, all the eigenvalues of AP1−1 or (P1−1 A) satisfy |λ−1| < 1. For the special −1 −1 case M11 = A−1 11 , we have P1 = P0 and σ(AP1 ) = {1}. In practice, exact solves with A11 or Ŝ may be impractical or not advisable on the grounds of efficiency or numerical stability. This leads to the following three variants. The first one is Â11 A12 (3.4) P2 = , Â11 ≈ A11 , O Ŝ where we have inexact solves with A11 and exact solves with Ŝ. We can assume that A11 = Â11 −(Â11 −A11 ) is a regular splitting, or even an M -splitting. Another variant 6 is (3.5) P3 = A11 O A12 S̃ , S̃ ≈ Ŝ , where we have exact solves with A11 and inexact solves with Ŝ. Here we can assume that Ŝ = S̃ − (S̃ − Ŝ) is a regular splitting. P3 can be analyzed exactly like P1 . Finally we consider Â11 A12 (3.6) P4 = , Â11 ≈ A11 , S̃ ≈ Ŝ , O S̃ where we can assume that both Â11 and S̃ induce regular splittings of A11 and Ŝ, respectively. In practice, we use P4 with Â11 and S̃ coming from incomplete LU factorizations (ILU) of A11 and Ŝ. Note that unlike position-based ILU’s, threshold-based ILU’s do not always lead to regular splittings (see Section 10.4.2 in [34]). Nevertheless, threshold-based ILUs often perform better than position-based ones, and the clustering of the eigenvalues of the preconditioned matrices around unity can be easily controlled by the drop tolerance. When A is a singular, irreducible M -matrix, then the splittings A = Pi − (Pi − A) satisfy ρ(I − APi−1 ) = 1, and the spectrum of APi−1 will also include the eigenvalue λ = 0 (for i = 1, . . . , 4). We also note that λ = 0 is a simple eigenvalue, and the splitting A = Pi − (Pi − A) is semiconvergent (see [11]), if Ŝ − S contains no zero rows and Ŝ is irreducible. The zero eigenvalue, in any event, does not negatively affect the convergence of Krylov methods like GMRES; in practice, the only effect is to introduce a set of (Lebesgue) measure zero of “bad” initial guesses. These can be easily avoided, for instance by using a random initial guess. 4. Building the block triangular preconditioner. 4.1. Defining the blocks. The first requirement to be met in permuting the matrix A into 2×2 block structure (1.1) is that the permutation should be symmetric. If A is nonsingular or obtained from the transition probability matrix of an irreducible Markov chain, then a symmetric permutation on the rows and columns of A guarantees that A11 and A22 are invertible M -matrices. The second requirement, as already discussed in Section 3, is to keep the order n of A11 as large as possible to maximize the number of (near) unit eigenvalues of the preconditioned matrix (APi−1 for i = 1, 2, 3, and 4). The requirement is important in order to have fast convergence of the Krylov subspace method, since m, the size of A22 , is an upper bound on the number of non-unit eigenvalues. Strictly speaking, this is true only if we assume exact solves in the application of the preconditioner, e.g., for AP1−1 . In practice we will use inexact solves, and rather than having n eigenvalues (or more) exactly equal to 1, there will be a cluster of at least n eigenvalues near the point (1, 0). Still, we want this cluster to contain as many eigenvalues as possible. The second requirement is also desirable from the point of view of parallel implementation. A possible parallelization approach would be constructing and solving the linear systems with approximate Schur complements on a single processor, and then solving the linear systems with A11 in parallel. This approach has been taken previously in parallelizing applications of approximate inverse preconditioners in Markov chain analysis [9]. Another possible approach would be parallelizing the solution of the approximate Schur complement systems either by allowing redundancies in the computations (each processor can form the whole system or a part of it) or by running 7 a parallel solver on those systems. In both cases, the solution with the approximate Schur complement system constitutes a serial bottleneck and requires additional storage space. The third requirement, not necessary for the convergence analysis but crucial for an efficient implementation, is that A11 should be block diagonal with subblocks of approximately equal size and density. Given K subblocks in the (1,1) block A11 , the K linear systems stemming from A11 can be solved independently. Meeting this requirement for a serial implementation will enable solution of very large systems, since the subblocks can be handled one at a time. In any admissible parallelization, each of these subblocks would more likely be assigned to a single processor. Therefore, maintaining balance on the sizes and the densities of the subblocks will relate to maintaining balance on computational loads of the processors. Furthermore, it is desirable that the sizes of these subblocks be larger than the order m of A22 , if possible, for the reasons given for the second requirement. Meeting all of the above three requirements is a very challenging task. Therefore, as a pragmatic approach we apply well established heuristics for addressing the remaining three requirements. As it is common, we adopt the standard undirected graph model to represent a square matrix AN ×N . The vertices of the graph G(A) = (V, E) correspond to the rows and columns of A and the edges correspond to the nonzeros of A. The vertex vi ∈ V represents the ith row and the ith column of A, and there exists an edge (vi , vj ) ∈ E if aij and aji are nonzero. Consider a partitioning Π = {V1 , . . . , VK ; VS } of G(A) with vertex separator VS . The matrix A can be permuted into the 2 × 2 block S structure (1.1) by permuting the rows and columns associated with the vertices in k Vk before the rows and columns associated with the vertices in VS . That is, VS defines the rows and columns of the (2,2) block A22 . Notice that the resulting permutation is symmetric, and hence the first requirement is met. Furthermore, since GPVS tries to minimize the size of the separator set VS , it tries to minimize the order of the block A22 . Therefore, the permutation induced by Π meets the second requirement S as well. Consider the A11 block defined by the vertices in k Vk . The rows and columns that are associated with the vertices in Vk can be permuted before the rows and columns associated with the vertices in V` for 1 ≤ k < ` ≤ K. Such a permutation of A11 gives rise to diagonal subblocks. Since we have already constructed A22 using VS , we end up with the following structure: A1 B1 A2 B2 .. . .. A= . . AK BK C1 C2 · · · CK AS The diagonal blocks A1 , . . . , AK correspond to the vertex parts V1 , . . . , VK , and therefore have approximately the same order. The off-diagonal blocks Bi , Ci represent the connections between the subgraphs, and the diagonal block AS represents the connections between nodes in the separator set. Note that if A is nonsingular or obtained from the transition probability matrix of an irreducible Markov chain, then each block Ai is a nonsingular M -matrix. Thus, graph partitioning induces a reordering and block partitioning of the matrix A in the form (1.1) where A11 = diag(A1 , A2 , . . . , AK ), A22 = AS 8 and T T A12 = [B1T B2T · · · BK ] , A21 = [C1 C2 · · · CK ] . Therefore, the permutation induced by the GPVS partially addresses the third requirement. Note that the GPVS formulation ignores the requirement of balancing the densities of the diagonal subblocks of A11 . In fact, obtaining balance on the densities of the diagonal blocks is a complex partitioning requirement that cannot be met before a partitioning takes place (see [31] for a possible solution) even with a weighted GPVS formulation. If the matrix is structurally nonsymmetric, which is common for matrices arising from Markov chains, then A cannot be modeled with undirected graphs. In this case, a 2 × 2 block structure can be obtained by partitioning the graph of A + AT . 4.2. Approximating the Schur complement. Recall from Section 3 that we are interested in approximations of the Schur complement of the form Ŝ = A22 − −1 −1 A21 M11 A21 where M11 is an approximate inverse of A11 satisfying (3.2). We found the rather crude approximation M11 = diag(A11 ) to be quite satisfying. It is easy to invert and apply, and also it maintains a great deal of sparsity in Ŝ. Apart from this approximation, we also tried M11 = Â11 with Â11 = L̄11 Ū11 (an incomplete factorization of A11 ), the same approximation used in the (1,1) blocks of P2 and P4 . This works well in terms of reducing the number of iterations and in all experiments it delivered the smallest number of iterations; however, this good rate of convergence came at the price of a prohibitive preconditioner construction overhead, and in a few cases it failed due to lack of space for Ŝ. In an attempt to find a midway, we tried to construct a block diagonal (up to a symmetric permutation) M11 with 1 × 1 and 2 × 2 blocks. We create a list of all pairs hi, ji with i < j and either aij or aji orboth are nonzero. Each of these pairs is a aii aij candidate for a 2 × 2 block of the form . Then we visit the candidates aji ajj in the descending order of the determinants of the corresponding 2 × 2 blocks, where the determinant of the pair hi, ji is computed using aii ajj − aij aji . At candidate hi, ji, if both i and j are not included in any 2 × 2 block, we form the corresponding 2 × 2 block. Otherwise, we proceed to the next candidate. Any row i (and hence column) that is not included in a 2 × 2 block defines a 1 × 1 block aii in M11 . Note that this greedy algorithm tries to maximize the minimum of the determinants of the 2 × 2 blocks in M11 and hence tries to yield a well-conditioned matrix M11 without any attempt to maximize the number of those blocks. Similar algorithms were used in [17, 18] in preconditioning indefinite systems and were found to be useful. In our case, however, this choice of M11 did not improve upon the simple diagonal one. 5. Numerical experiments. In this section, we report on experimental results obtained with a Matlab 7.1.0 implementation on a 2.2 GHz dual core AMD Opteron Processor 875 with 4GB main memory. The main goal was to test the proposed block preconditioners and to compare them with a few other techniques, including the product splitting preconditioner [10]. The Krylov method used was GMRES [35]. For completeness we performed experiments with the stationary iterations corresponding to the various splittings (without GMRES acceleration), but they were found to converge too slowly to be competitive with preconditioned GMRES. Therefore, we do not show these results. The various methods were tested on the generator matrices of some Markov chain models provided in the MARCA (MARkov Chain Analyzer) collection [37]. The 9 Table 5.1 Properties of the generator matrices. Matrix mutex09 mutex12 ncd07 ncd10 qnatm06 qnatm07 tcomm16 tcomm20 twod08 twod10 number of rows/cols N 65535 263950 62196 176851 79220 130068 13671 17081 66177 263169 total 1114079 4031310 420036 1207051 533120 875896 67381 84211 263425 1050625 number of nonzeros average row row/col min max 17.0 16 17 15.3 9 21 6.8 2 7 6.8 2 7 6.7 3 9 6.7 3 9 4.9 2 5 4.9 2 5 4.0 2 4 4.0 2 4 min 16 9 2 2 4 4 2 2 2 2 col max 17 21 7 7 7 7 5 5 4 4 models are discussed in [16, 30, 32] and have been used to compare different solution methods in [9, 10, 15] and elsewhere. These matrices are infinitesimal generators of time-continuous Markov chains, but can be easily converted (as we did) to the form A = I − P T , with P row-stochastic, so that A corresponds to a discrete-time Markov chain, known as the embedded Markov chain; see [38, Chapter 1.4.3]. The preconditioning techniques described in this paper can be applied to either form of the generator matrix. We performed a large number of tests on numerous matrices; here we present a selection of results for a few test matrices, chosen to be representative of our overall findings. Table 5.1 displays the properties of the chosen test matrices. Each matrix is named by its family followed by its index in the family. For example, mutex09 refers to the 9th matrix in the mutex family. The matrices from the mutex and ncd families are structurally symmetric, the matrices from the qnatm and twod families are structurally nonsymmetric, and the matrices from the tcomm family are very close to being structurally symmetric—the nonzero patterns of tcomm20 and tcomm16 differ from the nonzero patterns of their transposes in only 60 locations. We compared the block preconditioner (BT) with the block Jacobi (BJ), block Gauss-Seidel (BGS), block successive overrelaxation (BSOR), and product splitting (PS) preconditioners. The block Gauss-Seidel (BGS) and the block successive overrelaxation (BSOR) preconditioners are given, respectively, by A11 O MBGS = 1 A MBSOR = ω 11 O A12 A22 A12 1 ω A22 A11 O or MBGS = , A21 A22 1 A11 O ω or MBSOR = . 1 A21 ω A22 In agreement with previously reported results [15] on the MARCA collection, we observed that ω = 1.0 (which reduces the BSOR to BGS) or very close to 1.0 is nearly always the best choice of the relaxation parameter for BSOR. We also observed that for most MARCA problems, the block lower triangular versions of the BGS and BSOR preconditioners are indistinguishable from the block upper triangular versions under either the storage or performance criteria. Therefore, we report only the experiments with the upper triangular BGS preconditioner. As discussed for instance in [21, 29], the ideal block triangular preconditioner P0 10 Table 5.2 Properties of the partitions and the induced block structures averaged over all matrices excluding mutex matrices. The column “sep” refers to the number of rows in the 2nd row block of A normalized by the number of rows in A, i.e., m/N ; the column “part” refers to the average part size normalized by the number of rows in A, i.e., (n/K)/N ; the columns Aij for i, j = 1, 2 refer to the number of nonzeros in the (i, j) block normalized by the number of nonzeros in A, i.e., nnz(Aij )/nnz(A). K 2 4 8 16 32 Partition sep part 0.008 0.496 0.016 0.232 0.028 0.113 0.045 0.055 0.068 0.027 A11 0.986 0.852 0.808 0.770 0.731 Blocks A12 A21 0.006 0.006 0.069 0.069 0.088 0.088 0.105 0.105 0.122 0.122 A22 0.002 0.011 0.016 0.020 0.025 has a natural block diagonal (BD) counterpart, namely A11 O MBD = . O S As in P0 , exact solutions with MBD is not feasible. Therefore, we replace A11 and the Schur complement S with approximations, as in P4 , and compare the BT preconditioners with the corresponding inexact BD as well. 5.1. Properties of the preconditioners. We partitioned the matrix into the 2 × 2 block structure (1.1) using Metis [24]. In all cases, the partitioning time is negligible compared to the solve time. For the structurally symmetric mutex and ncd matrices, we used the graph of A, and for the other matrices we used the graph of A + AT as mentioned in Section 4. As discussed in Section 4, we maintain balance on the size, rather than the densities, of the subblocks of A11 . We have conducted experiments with K = 2, 4, 8, 16, and 32 subblocks in the (1,1) block. For each K value, K-way partitioning of a test matrix constitutes a partitioning instance. Since Metis incorporates randomized algorithms, it was run 10 times starting from different random seeds for each partitioning instance with a maximum allowable imbalance ratio of 25%. In all partitioning instances except the mutex matrices, the imbalance ratios among the parts were within the specified limit. The following tables give the average of these 10 different runs for each partitioning instance. Only the mutex matrices have a large number of rows in the second row block, i.e., a large separator among all partitioning instances. For these matrices, the average part size is larger than the size of the separator set only in K = 2-way partitioning. For the other matrices, the average part size is larger than the size of the separator in all partitioning instances with K = 2, 4, 8, and 16 except in K = 16-way partitioning of ncd07, ncd10, and qnatm06, giving the average figures in Table 5.2. We have conducted experiments with block triangular preconditioners P1 through P4 . A somewhat surprising find is that those variants requiring exact solves with A11 , e.g., P1 and P3 , besides being rather expensive (as expected), are prone to numerical instabilities. By this we mean that at least one block Ak in A11 was found to have an upper triangular factor U with a huge condition number, causing the convergence of GMRES to deteriorate. (The unit lower triangular factor is always well-conditioned for the problems considered here.) We encountered this difficulty with all test matrices and for all values of K, except for the qnatm matrices. This phenomenon can be explained by noting that the diagonal blocks Ak , while guaranteed to be nonsingular, are often close to singular, in particular when A is nearly reducible 11 (see [27]). Hence, the corresponding upper triangular factor must have an exceedingly small pivot, and consequently its inverse must have a huge norm. This problem disappears when the complete factorizations of the diagonal blocks Ak are replaced by incomplete ones. This is not surprising: it has been shown in [26] that in an incomplete factorization of an M -matrix, the pivots (i.e., the diagonal entries of the upper triangular factor) cannot become smaller, and in practice they always increase. As a result, the condition number of the incomplete upper triangular factor is several orders of magnitude smaller than that of the complete factor, and no instabilities arise. Therefore, we have the somewhat unexpected conclusion that in practice inexact solvers result in greater robustness and faster convergence than exact ones. For these reasons we do not present results with P1 and P3 . In addition, P2 has a large construction overhead due to exact factorization of Ŝ and did not perform better than P4 in reducing the number of iterations. Therefore, in the following we present results only with P4 . We tried all three approximations of the Schur complement discussed in Section 4.2 with the block triangular preconditioner P4 . The one with block diagonal M11 with 1 × 1 and 2 × 2 blocks did not improve upon the simple diagonal one. Therefore, we omit the results with this choice of M11 . The one with M11 = Â11 has very large memory requirements; for example, it was not possible to run it with the mutex matrices. Therefore, it is not recommended as a general purpose solution. However, it merits presenting because it gives the smallest number of iterations and has fairly robust behavior with respect to increasing K (see Table 5.3). In the following discussion, BT thus refers to the block triangular preconditioner P4 : P4 = Â11 O A12 S̃ −1 , Â11 ≈ A11 , S̃ ≈ Ŝ = A22 − A21 M11 A12 , where M11 = diag(A11 ). Each subblock Ak , for k = 1, . . . , K, of A11 and the (2,2) block A22 in the BJ and BGS preconditioners were ordered using symmetric reverse Cuthill-McKee (for better numerical properties; see [8]) and factored using the incomplete LU factorization (ILUTH) with threshold parameter τ = 0.01 for the qnatm matrices and τ = 0.001 for the other matrices. The threshold of 0.001 was too small for the qnatm matrices: the resulting preconditioners had 8 times more nonzeros than the generator matrices. We observed that reordering the approximate Schur complement matrix causes ILUTH to take too much time (presumably due to the need for pivoting), but reduces the number of nonzeros in the factors only by a small amount. Therefore, the approximate Schur complement matrices were factored using ILUTH without any prior ordering. The densities of the preconditioners, i.e., the number of nonzeros in the matrices appearing in the preconditioner solve phase divided by the number of nonzeros in the corresponding generator matrices, are given in Table 5.4. As seen from the table, memory requirements of the BT preconditioners are less than those of the PS preconditioners and comparable to those of the three, relatively simple, ‘classical’ preconditioners. 5.2. Performance comparisons. The underlying Krylov subspace method was GMRES, restarted every 50 iterations (if needed). Right preconditioning was used in all the tests. The stopping criterion was set as krk k2 /kr0 k2 < 10−10 , 12 Table 5.3 Data pertaining to the P4 preconditioner with S̃ ≈ Ŝ = A22 − A21 (L̄11 Ū11 )−1 A12 . The column “its” refers to the average number of iterations, “dens” refers to the total number of nonzeros in the P4 matrices, “prec” refers to the time spent in constructing the preconditioners, and “solve” refers to the time spent during GMRES iterations. Matrix K its dens ncd07 2 4 8 16 32 2 4 8 16 32 2 4 8 16 32 2 4 8 16 32 2 4 8 16 32 2 4 8 16 32 2 4 8 16 32 2 4 8 16 32 11 12 12 13 13 11 11 12 13 13 35 35 37 38 40 39 40 42 42 45 14 16 19 26 25 16 19 22 30 28 10 12 17 18 18 17 18 20 21 23 1.11 1.23 1.20 1.18 1.18 1.10 1.13 1.12 1.14 1.12 2.97 2.65 2.61 2.49 2.48 2.99 2.58 2.59 2.52 2.49 3.03 2.47 2.13 2.01 1.93 2.93 2.24 2.17 2.09 1.99 2.24 2.30 3.05 2.77 2.87 4.64 4.63 5.11 5.01 4.37 ncd10 qnatm06 qnatm07 tcomm16 tcomm20 twod08 twod10 Total prec 29.19 14.86 16.87 20.26 20.50 187.40 120.63 99.12 81.39 95.72 45.86 43.09 37.19 31.51 31.73 86.45 96.43 84.09 69.27 73.82 0.38 0.60 0.84 1.27 1.64 0.48 0.73 1.02 1.51 2.23 7.20 12.18 17.17 18.68 26.38 156.29 163.76 190.06 199.50 229.49 time solve 1.11 1.16 1.25 1.35 1.46 3.68 4.29 4.14 4.31 4.51 6.07 6.21 6.57 6.85 7.66 11.66 12.50 12.88 13.05 14.32 0.24 0.24 0.29 0.40 0.40 0.34 0.35 0.42 0.61 0.59 0.91 1.03 1.56 1.63 1.68 8.71 8.60 9.51 10.30 10.50 where rk is the residual at the kth iteration and r0 is the initial residual. In all cases, the initial solution is set to the one-vector normalized to have an `1 -norm of 1.0. We allowed at most 250 iterations, i.e., 5 restarts, for the GMRES iteration. Therefore, the number 250 in the following tables marks the cases in which GMRES failed to deliver solutions with the prescribed accuracy within 250 iterations. Iteration counts for GMRES(50) on the various test matrices with no permutation or preconditioning (GMRES) and averages of the iteration counts with the preconditioners BJ, BD, BGS, PS and BT are given in Table 5.5. Note that without preconditioning, 13 Table 5.4 The densities of the preconditioners, i.e., the total number of nonzeros in the blocks appearing in the preconditioning matrices divided by the number of nonzeros in the corresponding generator matrices. Matrix mutex09 mutex12 ncd07 ncd10 qnatm06 qnatm07 tcomm16 tcomm20 twod08 twod10 K 2 4 8 16 32 2 4 8 16 32 2 4 8 16 32 2 4 8 16 32 2 4 8 16 32 2 4 8 16 32 2 4 8 16 32 2 4 8 16 32 2 4 8 16 32 2 4 8 16 32 BJ 1.30 0.78 0.51 0.36 0.30 1.87 0.83 0.45 0.29 0.23 1.09 1.18 1.11 1.05 0.99 1.08 1.09 1.05 1.04 0.99 2.94 2.58 2.46 2.25 2.10 2.97 2.52 2.48 2.34 2.20 3.03 2.45 2.09 1.91 1.76 2.93 2.22 2.13 2.01 1.83 2.24 2.26 2.98 2.65 2.70 4.64 4.60 5.06 4.92 4.24 Preconditioners BD BGS 1.52 1.48 1.24 1.06 1.18 0.86 1.14 0.73 1.16 0.69 2.06 2.06 1.21 1.14 0.96 0.82 0.90 0.70 0.89 0.66 1.10 1.11 1.19 1.21 1.13 1.16 1.07 1.11 1.03 1.08 1.08 1.09 1.10 1.11 1.06 1.09 1.06 1.09 1.01 1.06 2.95 2.95 2.61 2.60 2.52 2.50 2.33 2.30 2.23 2.18 2.98 2.98 2.54 2.53 2.52 2.51 2.40 2.38 2.30 2.27 3.03 3.03 2.45 2.46 2.09 2.10 1.93 1.94 1.78 1.80 2.93 2.93 2.23 2.23 2.14 2.14 2.02 2.03 1.84 1.86 2.24 2.24 2.26 2.27 2.98 2.99 2.66 2.66 2.71 2.72 4.64 4.64 4.60 4.61 5.06 5.06 4.92 4.92 4.25 4.25 PS 2.00 1.99 2.07 2.11 2.17 2.53 1.92 1.82 1.83 1.85 1.28 1.40 1.38 1.37 1.38 1.26 1.30 1.29 1.33 1.32 3.13 2.81 2.76 2.63 2.59 3.15 2.73 2.75 2.67 2.62 3.24 2.67 2.33 2.20 2.10 3.14 2.44 2.37 2.28 2.16 2.49 2.53 3.26 2.95 3.03 4.89 4.86 5.33 5.20 4.53 BT 1.70 1.52 1.52 1.51 1.55 2.25 1.51 1.33 1.31 1.32 1.11 1.21 1.17 1.13 1.12 1.09 1.12 1.10 1.11 1.08 2.96 2.63 2.55 2.39 2.31 2.98 2.56 2.55 2.44 2.36 3.03 2.46 2.10 1.95 1.82 2.93 2.23 2.15 2.04 1.88 2.24 2.27 2.99 2.67 2.73 4.64 4.61 5.07 4.93 4.26 GMRES(50) converges only for the mutex matrices. In all instances, preconditioned GMRES(50) with the PS and the proposed BT preconditioners converged under the stopping criterion given above. Preconditioned GMRES(50) with BJ, BD, and BGS 14 did not converge in, respectively, 103, 141, and 8 of the 500 instances. The largest of the `1 -norms of the residuals corresponding to the approximate solutions returned by the preconditioned GMRES(50) were less than 1.1e-11 for the converged instances with the BGS, PS, and BT preconditioners. With the BJ and BD preconditioners, the largest of the `1 -norms of the residuals corresponding to the approximate solutions returned by the preconditioned GMRES(50) was 4.6e-10. As seen from Table 5.5, the PS and BT preconditioners perform consistently better than the BJ, BD, and BGS preconditioners. The proposed BT preconditioner performs almost as well as the PS preconditioner, and it outperforms the BGS one by a factor of two, on the average, in terms of iteration counts, at the expense of a slight increase in memory requirements (see Table 5.4). We close this section by discussing running times for GMRES. The running times are measured using Matlab’s cputime command, and these measurements are given in Tables 5.6 and 5.7 in units of seconds. In Table 5.6, the total time for GMRES without preconditioning is the total time spent in performing the GMRES iterations. In both of the tables, the total time for the preconditioned GMRES is dissected into the preconditioner construction and the solve phases. We first discuss the case of mutex matrices, since all the preconditioners lead to convergence for these matrices. In all partitioning instances of the mutex matrices, the solve phase time with the proposed BT preconditioner is less than those with the other preconditioners. On the other hand, the total running time of BGS preconditioner is always the minimum except in K = 32-way partitioning of these two mutex matrices, in which case BJ gives the minimum total running time. We observe that the mutex matrices are the worst case for the construction of the BD, PS, and BT preconditioners, since the size of the separator set is very large already for K = 2, thus forming the Schur complement is very time-consuming. Note that PS and BT also suffer from the cost of copying the large A12 blocks (and A21 in PS). Table 5.7 contains the running times of the preconditioned GMRES with the BGS, PS, and BT preconditioners for the larger matrices in each matrix family. As seen from the table, for these matrices (whose partitions have small separators) the preconditioner construction phase times are always smaller than the solve phase times with all preconditioners. Although the construction phase times with the BGS preconditioners are always smaller than those with the BT preconditioner, BT preconditioner is faster than BGS in all instances. Note that the ith iteration of GMRES after a restart requires i inner product computations with vectors of length N [3]. Therefore, the performance gains in the solve phase with the BT preconditioners are not only due to the savings in preconditioner solves and matrix-vector multiplies, but also due to the savings in the inner product computations. Recall from Table 5.5 that the number of iterations with the BT and and PS preconditioners are almost the same. However, the BT preconditioner is always faster than the PS preconditioner. 6. Conclusions. We have investigated block triangular preconditioning strategies for M -matrices, with an emphasis on the singular systems arising from Markov chain modeling. These preconditioners require a 2 × 2 block partitioning and suitable Schur complement approximations. The resulting preconditioner can be applied exactly or inexactly, leading to several possible variants, including block diagonal ones. Numerical experiments with preconditioned GMRES using test matrices from MARCA allowed us to identify the most efficient variant and to rule out some of the others. Block triangular preconditioning with inexact solves obtained by means of (local) ILUTH factorizations was found to be superior to several other approaches, 15 Table 5.5 Average number of iterations to reduce the `2 -norm of the initial residual by ten orders of magnitude using GMRES(50) with at most 5 restarts. The number 250 means that the method did not converge in 250 iterations. Matrix GMRES Preconditioned GMRES Preconditioners BJ BD BGS PS 2 24 24 13 9 4 27 27 14 9 8 28 28 14 8 16 29 30 15 9 32 29 30 15 9 2 26 27 14 8 4 28 29 15 8 8 28 29 15 8 16 29 31 16 7 32 30 31 16 7 2 40 68 23 12 4 212 250 87 13 8 222 250 101 15 16 243 250 129 16 32 250 250 192 18 2 122 168 57 14 4 160 250 47 15 8 244 250 105 15 16 250 250 145 18 32 250 250 215 19 2 60 60 38 36 4 78 83 41 36 8 104 119 46 39 16 177 181 55 43 32 223 250 83 46 2 51 56 40 39 4 83 91 44 41 8 110 138 51 44 16 163 186 74 47 32 232 246 92 55 2 30 30 18 16 4 48 51 27 21 8 112 131 39 29 16 250 250 85 42 32 250 250 105 46 2 33 34 20 18 4 55 61 29 23 8 132 157 43 32 16 247 250 94 44 32 250 250 221 96 2 28 27 15 10 4 41 40 22 13 8 49 48 25 18 16 59 61 29 24 32 92 93 36 29 2 38 38 21 18 4 47 47 26 21 8 58 58 30 24 16 77 78 35 28 32 94 94 39 32 K mutex09 97 mutex12 91 ncd07 250 ncd10 250 qnatm06 250 qnatm07 250 tcomm16 250 tcomm20 250 twod08 250 twod10 250 BT 9 9 9 9 9 10 9 9 9 8 12 13 15 16 18 15 15 15 17 19 36 36 39 43 47 39 41 44 47 56 16 21 30 42 46 18 23 32 44 98 10 13 18 24 30 18 21 25 29 32 including block diagonal, block Gauss–Seidel, and product splitting preconditioning. Furthermore, the numerical experiments indicate that, for most problems, the number of iterations grows slowly with the number of parts (subdomains). This suggests that 16 Table 5.6 Running times (in seconds) for GMRES(50) without preconditioning (GMRES column) and with BJ, BD, BGS, PS, and BT preconditioning for the mutex matrices. Matrix mutex09 GMRES Total time Solve 7.0 mutex12 27.2 K 2 4 8 16 32 2 4 8 16 32 BJ 3.55 1.39 0.78 0.51 0.48 17.75 5.06 2.44 1.74 1.59 Preconditioned Total time Preconditioner const BD BGS PS BT 4.28 3.60 4.53 4.30 2.84 1.52 3.33 2.95 2.92 1.04 3.67 3.16 3.15 1.06 4.44 3.62 3.28 1.45 5.47 4.16 20.61 17.95 21.75 20.72 10.84 5.49 12.63 11.10 11.49 3.37 14.58 12.36 11.05 3.42 16.11 12.69 11.11 4.62 19.81 13.97 GMRES BJ 3.55 3.45 3.19 3.22 2.95 17.69 14.76 12.97 11.70 11.42 BD 3.94 4.23 4.45 4.80 4.83 19.42 18.59 17.87 18.89 18.83 Solve BGS 1.82 1.82 1.99 2.60 3.63 9.18 8.05 7.79 9.47 13.35 PS 1.96 2.37 2.86 3.94 5.54 8.07 8.29 9.98 12.13 17.92 BT 1.42 1.51 1.72 2.07 2.82 7.23 6.21 6.46 7.47 9.13 Table 5.7 Running times (in seconds) for GMRES(50) with BGS, PS, and BT preconditioners for the larger matrices in each family. Matrix ncd10 qnatm07 tcomm20 twod10 K 2 4 8 16 32 2 4 8 16 32 2 4 8 16 32 2 4 8 16 32 Total time Precond const BGS PS BT BGS 1.18 1.93 1.68 19.40 1.04 1.99 1.59 15.85 1.01 2.31 1.58 33.95 1.01 3.04 1.67 45.84 1.09 4.55 2.00 71.01 3.68 4.18 4.04 11.43 2.34 2.98 2.71 11.69 1.66 2.53 2.04 13.23 1.34 2.75 1.84 18.07 1.30 3.72 1.96 23.85 0.10 0.13 0.11 0.43 0.10 0.10 0.10 0.63 0.10 0.18 0.10 0.93 0.10 0.20 0.10 1.98 0.10 0.30 0.10 5.01 7.04 7.92 7.64 10.46 6.65 7.55 7.15 12.85 4.95 6.47 5.64 14.41 4.09 6.43 4.89 16.78 3.02 7.17 4.18 18.27 Solve PS 5.39 6.16 7.74 12.34 20.82 13.62 15.12 18.73 26.62 46.35 0.50 0.74 1.28 2.47 8.30 10.90 14.32 18.83 30.60 53.05 BT 4.13 4.01 3.85 4.30 5.03 11.20 11.02 11.23 12.13 14.73 0.40 0.50 0.65 0.97 2.21 8.89 10.43 11.17 13.00 13.82 the inexact block triangular preconditioner should perform very well in a parallel implementation. Acknowledgment. We thank Billy Stewart for making the MARCA software available. REFERENCES [1] O. Axelsson, Iterative Solution Methods, Cambridge University Press, New York, NY, USA, 1994. [2] C. Aykanat, A. Pınar, and Ü. V. Çatalyürek, Permuting sparse rectangular matrices into block-diagonal form, SIAM J. Sci. Comput., 25 (2004), pp. 1860–1879. [3] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. A. van der Vorst, Templates for the Solution of Linear 17 Systems: Building Blocks for Iterative Methods, SIAM, Philadelphia, PA, 1994. [4] M. Benzi and T. Dayar, The arithmetic mean method for finding the stationary vector of Markov chains, Parallel Algorithms Appl., 6 (1995), pp. 25–37. [5] M. Benzi, G. H. Golub, and J. Liesen, Numerical solution of saddle point problems, Acta Numerica, 14 (2005), pp. 1–137. [6] M. Benzi, F. Sgallari, and G. Spaletta, A parallel block projection method of the Cimmino type for finite Markov chains, in Computations with Markov Chains, W. J. Stewart, ed., Kluwer Academic Publishers, Boston/London/Dordrecht, 1995, pp. 65–80. [7] M. Benzi and D. B. Szyld, Existence and uniqueness of splittings for stationary iterative methods with applications to alternating methods, Numer. Math., 76 (1997), pp. 309–321. [8] M. Benzi, D. B. Szyld, and A. van Duin, Orderings for incomplete factorization preconditioning of nonsymmetric problems, SIAM J. Sci. Comput., 20 (1999), pp. 1652–1670. [9] M. Benzi and M. Tůma, A parallel solver for large-scale Markov chains, Appl. Numer. Math., 41 (2002), pp. 135–153. [10] M. Benzi and B. Uçar, Product preconditioning for Markov chain problems, in Proceedings of the 2006 Markov Anniversary Meeting, A. N. Langville and W. J. Stewart, eds., Raleigh, NC, 2006, (Charleston, SC), Boson Books, pp. 239–256. [11] A. Berman and R. J. Plemmons, Nonnegative Matrices in the Mathematical Sciences, Academic Press, 1979. Reprinted by SIAM, Philadelphia, PA, 1994. [12] P. Buchholz, M. Fischer, and P. Kemper, Distributed steady state analysis using Kronecker algebra, in Numerical Solutions of Markov Chains (NSMC’99), B. Plateau, W. J. Stewart, and M. Silva, eds., Prensas Universitarias de Zaragoza, Zaragoza, Spain, 1999, pp. 76–95. [13] T. N. Bui and C. Jones, Finding good approximate vertex and edge partitions is NP hard, Inform. Process. Lett., 42 (1992), pp. 153–159. [14] T. N. Bui and C. Jones, A heuristic for reducing fill in sparse matrix factorization, in Proc. 6th SIAM Conf. Parallel Processing for Scientific Computing, SIAM, Philadelphia, PA, 1993, pp. 445–452. [15] T. Dayar and W. J. Stewart, Comparison of partitioning techniques for two-level iterative solvers on large, sparse Markov chains, SIAM J. Sci. Comput., 21 (2000), pp. 1691–1705. [16] P. Fernandes, B. Plateau, and W. J. Stewart, Efficient descriptor-vector multiplications in stochastic automata networks, J. ACM, 45 (1998), pp. 381–414. [17] R. W. Freund and F. Jarre, A QMR-based interior-point algorithm for solving linear programs, Math. Program., 76 (1996), pp. 183–210. [18] M. Hagemann and O. Schenk, Weighted matchings for preconditioning symmetric indefinite linear systems, SIAM J. Sci. Comput., 28 (2006), pp. 403–420. [19] B. Hendrickson and R. Leland, A multilevel algorithm for partitioning graphs, in Supercomputing’95: Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), New York, NY, USA, 1995, ACM Press, p. 28. [20] B. Hendrickson and E. Rothberg, Improving the run time and quality of nested dissection ordering, SIAM J. Sci. Comput., 20 (1998), pp. 468–489. [21] I. C. F. Ipsen, A note on preconditioning nonsymmetric matrices, SIAM J. Sci. Comput., 23 (2001), pp. 1050–1051. [22] M. Jarraya and D. El Baz, Asynchronous iterations for the solution of Markov systems, in Numerical Solutions of Markov Chains (NSMC’99), B. Plateau, W. J. Stewart, and M. Silva, eds., Prensas Universitarias de Zaragoza, Zaragoza, Spain, 1999, pp. 335–338. [23] G. Karypis and V. Kumar, A fast and high quality multilevel scheme for partitioning irregular graphs, SIAM J. Sci. Comput., 20 (1998), pp. 359–392. [24] G. Karypis and V. Kumar, MeTiS: A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices version 4.0, University of Minnesota, Department of Computer Science/Army HPC Research Center, Minneapolis, MN 55455, September 1998. [25] W. J. Knottenbelt and P. G. Harrison, Distributed disk-based solution techniques for large Markov models, in Numerical Solutions of Markov Chains (NSMC’99), B. Plateau, W. J. Stewart, and M. Silva, eds., Prensas Universitarias de Zaragoza, Zaragoza, Spain, 1999, pp. 58–75. [26] J. A. Meijerink and H. A. van der Vorst, An iterative solution method for linear systems of which the coefficient matrix is a symmetric M-matrix, Math. Comp., 31 (1977), pp. 148– 162. [27] C. D. Meyer, Stochastic complementation, uncoupling Markov chains, and the theory of nearly reducible systems, SIAM Rev., 31 (1989), pp. 240–272. [28] V. Migallón, J. Penadés, and D. B. Szyld, Experimental studies of parallel iterative solutions of Markov chains with block partitions, in Numerical Solutions of Markov Chains 18 [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] (NSMC’99), B. Plateau, W. J. Stewart, and M. Silva, eds., Prensas Universitarias de Zaragoza, Zaragoza, Spain, 1999, pp. 96–110. M. F. Murphy, G. H. Golub, and A. J. Wathen, A note on preconditioning for indefinite linear systems, SIAM J. Sci. Comput., 21 (2000), pp. 1969–1972. B. Philippe, Y. Saad, and W. J. Stewart, Numerical methods in Markov chain modeling, Oper. Res., 40 (1992), pp. 1156–1179. A. Pınar and B. Hendrickson, Partitioning for complex objectives, in Proceedings of the 15th International Parallel and Distributed Processing Symposium (CDROM), Washington, DC, USA,, 2001, IEEE Computer Society, p. 121. P. K. Pollet and S. E. Stewart, An efficient procedure for computing quasi-stationary distributions of Markov chains with sparse transition structure, Adv. in Appl. Probab., 26 (1994), pp. 68–79. Y. Saad, Preconditioned Krylov subspace methods for the numerical solution of Markov chains, in Computations with Markov Chains, W. J. Stewart, ed., Kluwer Academic Publishers, Boston/London/Dordrecht, 1995, pp. 65–80. Y. Saad, Iterative Methods for Sparse Linear Systems, 2nd ed., SIAM, Philadelphia, PA, 2003. Y. Saad and M. H. Schultz, GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems, SIAM J. Sci. Stat. Comput., 7 (1986), pp. 856–869. H. Schneider, Theorems on M -splittings of a singular M -matrix which depend on graph structure, Linear Algebra Appl., 58 (1984), pp. 407–424. W. J. Stewart, MARCA Models: A collection of Markov chain models. URL http://www.csc.ncsu.edu/faculty/stewart/MARCA Models/MARCA Models.html. W. J. Stewart, Introduction to the Numerical Solution of Markov Chains, Princeton University Press, Princeton, NJ, 1994. H. A. van der Vorst, Bi-CGSTAB: A fast and smoothly converging variant of Bi-CG for the solution of non-symmetric linear systems, SIAM J. Sci. Comput., 12 (1992), pp. 631–644. R. S. Varga, Matrix Iterative Analysis, Prentice-Hall,Englewood Cliffs, New Jersey, 1962. Second Edition, revised and expanded, Springer, Berlin, Heidelberg, New York, 2000. Z. I. Woźnicki, Nonnegative splitting theory, Japan. J. Indust. Appl. Math., 11 (1994), pp. 289– 342. 19
© Copyright 2026 Paperzz