BLOCK TRIANGULAR PRECONDITIONERS FOR M

BLOCK TRIANGULAR PRECONDITIONERS FOR M -MATRICES
AND MARKOV CHAINS
MICHELE BENZI† AND BORA UÇAR‡
Abstract. We consider preconditioned Krylov subspace methods for solving large sparse linear
systems under the assumption that the coefficient matrix is a (possibly singular) M -matrix. The
matrices are partitioned into 2 × 2 block form using graph partitioning. Approximations to the Schur
complement are used to produce various preconditioners of block triangular and block diagonal type.
A few properties of the preconditioners are established, and extensive numerical experiments are
used to illustrate the performance of the various preconditioners on singular linear systems arising
from Markov modeling.
Key words. M -matrices, preconditioning, discrete Markov chains, iterative methods, graph
partitioning
AMS subject classifications. 05C50, 60J10, 60J22, 65F10, 65F35, 65F50
1. Introduction. We consider the solution of (consistent) linear systems Ax =
b, where A ∈ RN ×N is a large, sparse M -matrix. Our main interest is in singular,
homogeneous (b = 0) systems arising from Markov chain modeling; our methods and
results, however, are also applicable in the nonsingular case. The aim of this paper
is to develop and investigate efficient preconditioners for Krylov subspace methods
like GMRES [35] or BiCGStab [39]. Our main focus is on block preconditioners that
are based on Schur complement approximations. Preconditioners of this type have
proven very successful in the context of (generalized) saddle point problems arising in
a variety of applications (see [5]); here we investigate such techniques for M -matrices
and, in particular, for Markov chain problems.
We assume that A is partitioned into 2 × 2 block structure
A11 A12
(1.1)
A=
,
A21 A22
where A11 and A22 are square matrices of size, respectively, n × n and m × m with
N = n + m. Building on the given structure, we consider block upper triangular
preconditioners of the form
M11 A12
MBP =
(1.2)
,
O
M22
where M11 is equal to or an approximation of A11 , and M22 is an approximation of
the Schur complement of A11 in A, i.e., M22 ≈ A22 − A21 A−1
11 A12 . Obviously, block
lower triangular preconditioners of the same type can also be constructed, as well as
block diagonal ones. In any case, the main issue is the choice of the approximations
M11 ≈ A11 and M22 ≈ A22 − A21 A−1
11 A12 .
† Department of Mathematics and Computer Science, Emory University, Atlanta, GA 30322, USA
([email protected]). The work of this author was supported in part by the National Science
Foundation grant DMS-0511336.
‡ Department of Mathematics and Computer Science, Emory University, Atlanta, GA 30322, USA
([email protected]). The work of this author was partially supported by The Scientific and
Technological Research Council of Turkey (TÜBITAK) and by the University Research Committee
of Emory University.
1
In the context of saddle-point problems, the 2 × 2 block structure is mandated by
the application, and the approximation of the Schur complement is usually defined
and interpreted in terms of the application. In contrast, the block preconditioners
investigated in this paper use well-known graph partitioning techniques to define the
block structure, and make use of M -matrix properties to define approximations of the
Schur complement.
For certain approximations of S = A22 − A21 A−1
11 A12 , the block triangular preconditioners proposed in this work are akin to the product splitting preconditioners
proposed in [10]. In fact, we experimentally show that in such cases these two types
of preconditioners perform almost the same in terms of convergence rates for a large
number of Markov chain problems from the MARCA collection [37]. However, the
block triangular ones are cheaper to construct and apply than the product splitting
ones.
The paper is organized as follows. We briefly review background material on M matrices, discrete Markov chains, graph partitioning, matrix splittings and product
splitting preconditioners in Section 2. We introduce the block triangular preconditioners in Section 3. Section 4 contains materials on partitioning the matrices into
the 2 × 2 block structure (1.1) with an eye to future parallel implementations of the
proposed preconditioners. In Section 5 we investigate the effect of the block triangular
preconditioner under various partitionings, the properties of the 2 × 2 block structure
imposed by the graph partitioning, and the performance of the proposed block triangular preconditioners relative to that of some other well-known preconditioners. We
present our conclusions in Section 6.
2. Background. Here we borrow some material mainly from [9, 10, 11, 40]
to provide the reader with a short summary of the concepts and results that are
used in building the proposed preconditioners. We also give a brief description of
graph partitioning by vertex separator, which can be used to obtain the 2 × 2 block
structure (1.1).
2.1. Nonnegative matrices and M -matrices. A matrix AN ×N is nonnegative if all of its entries are nonnegative, i.e., A ≥ O if aij ≥ 0 for all 1 ≤ i, j ≤ N .
Any matrix A with nonnegative diagonal entries and nonpositive off-diagonal
entries can be written in the form
(2.1)
A = sI − B,
s > 0,
B≥O.
A matrix A of the form (2.1) with s ≥ ρ(B) is called an M -matrix. Here, ρ(B) denotes
the spectral radius of B. If s = ρ(B) then A is singular, otherwise nonsingular. If A
is a nonsingular M -matrix, then A−1 ≥ O.
If A is a singular, irreducible M -matrix, then rank (A) = N − 1 and each k × k
principal square submatrix of A, where 1 ≤ k < N , is a nonsingular M -matrix. If,
furthermore, A is the generator of an ergodic Markov chain (see below), then the Schur
complement Sm×m = A22 − A21 A−1
11 A12 of A11 (Eq. (1.1)) is a singular, irreducible
M -matrix with rank m − 1 [9, 27].
2.2. Stationary distribution of ergodic Markov chains. Discrete Markov
chains with large state spaces arise in many applications, including for instance reliability modeling, queuing network analysis, web-based information retrieval, and computer system performance evaluation [38]. As is well known, the long-run behavior
of an ergodic (irreducible) Markov chain is described by the stationary distribution
2
vector of the corresponding matrix of transition probabilities. Recall that the stationary probability distribution vector of a finite, ergodic Markov chain with N × N
transition probability matrix P is the unique 1 × N vector π which satisfies
(2.2)
π = πP,
πi > 0 for i = 1, . . . , N,
N
X
πi = 1 .
i=1
PN
Here P is nonnegative (pij ≥ 0 for 1 ≤ i, j ≤ N ), row-stochastic ( j=1 pij = 1 for 1 ≤
i ≤ N ), and due to the ergodicity assumption it is irreducible.
The matrix A = I − P T , where I is the N × N identity matrix, is called the
generator of the Markov process. The matrix A is a singular, irreducible M -matrix of
rank N − 1. Letting x = π T and hence xT = xT P , the computation of the stationary
vector reduces to finding a nontrivial solution to the homogeneous linear system
(2.3)
Ax = 0 ,
PN
where x ∈ RN , xi > 0 for i = 1, . . . , N , and i=1 xi = 1. Perron–Frobenius theory [11] implies that such a vector exists and is unique.
Due to the very large number N of states typical of many real-world applications,
there has been increasing interest in recent years in developing parallel algorithms for
Markov chain computations; see [4, 6, 12, 22, 25, 28]. Most of the attention so far has
focused on (linear) stationary iterative methods, including block versions of Jacobi
and Gauss–Seidel [12, 25, 28], and on (nonlinear) iterative aggregation/disaggregation
schemes specifically tailored to stochastic matrices [12, 22]. In contrast, comparatively
little work has been done with parallel preconditioned Krylov subspace methods. The
suitability of preconditioned Krylov subspace methods for solving Markov models has
been demonstrated, e.g., in [30, 33], although no discussion of parallelization aspects
was given there. Parallel computing aspects can be found in [6], where a symmetrizable stationary iteration (Cimmino’s method) was accelerated using the Conjugate
Gradient method on a Cray T3D, and in [25], where an out-of-core, parallel implementation of Conjugate Gradient Squared (with no preconditioning) was used to solve
very large Markov models with up to 50 million states. In [9], parallel preconditioners
based on sparse approximate pseudoinverses were used to speed-up the convergence
of BiCGStab, and favorable parallelization results have been reported. In our recent
work [10], we proposed product splitting preconditioners and discussed parallelization
aspects of the proposed preconditioners.
2.3. Stationary iterations and matrix splittings. Consider again the solution of a linear system of the form Ax = b. The representation A = B − C is called a
splitting if B is nonsingular. A splitting gives rise to the stationary iterative method
(2.4)
xk+1 = T xk + c,
k = 0, 1, . . . ,
where T = B −1 C is called the iteration matrix, c = B −1 b, and x0 ∈ RN is a given
initial guess. The splitting A = B − C is called regular if B −1 ≥ O and C ≥ O [40],
weak regular if B −1 ≥ O and T ≥ O [11], an M -splitting if B is an M -matrix and
C ≥ O [36], and weak nonnegative of the II kind if B −1 ≥ O and I −AB −1 ≥ O [41]. If
A is a nonsingular M -matrix, any of the above conditions on the splitting A = B − C
is sufficient to ensure the convergence of the stationary iteration (2.4) to the unique
solution of Ax = b, for any choice of the initial guess x0 .
3
A related approach is defined by the alternating iterations
k+1/2
x
= M1−1 N1 xk + M1−1 b
(2.5)
k+1
x
= M2−1 N2 xk+1/2 + M2−1 b, k = 0, 1, . . . ,
where A = M1 − N1 = M2 − N2 are splittings of A, and x0 is the initial guess.
The convergence of alternating iterations was analyzed by Benzi and Szyld [7] under
various assumptions on A, including the singular M -matrix case. They constructed a
(possibly non-unique) splitting A = B − C associated with the alternating iterations
(Eq. (10) in [7]) with
(2.6)
B −1 = M2−1 (M1 + M2 − A)M1−1 .
Clearly, the matrix M1 + M2 − A must be nonsingular for (2.6) to be well-defined.
2.4. Product splitting preconditioners. The product splitting preconditioners [10] are based on alternating iterations (2.5) defined by two simple preconditioners.
The first preconditioner can be taken to be the well-known block Jacobi preconditioner, i.e.,
(2.7)
MBJ =
A11
O
O
A22
.
Note that A11 and A22 are nonsingular M -matrices, and A = MBJ − (MBJ − A) is a
regular splitting (in fact, an M -splitting).
The second preconditioner is given by
D11 A12
MSC =
(2.8)
,
A21 A22
where, D11 6= A11 stands for an approximation of A11 . In [10], we take D11 to be the
diagonal matrix formed with the diagonal entries of A11 , e.g., D11 = diag(A11 ). More
generally, D11 is a matrix obtained from A11 by setting off-diagonal entries to zero.
Thus, D11 is a nonsingular M -matrix [40, Theorem 3.12]. The Schur complement ma−1
trix A22 − A21 D11
A12 is therefore well-defined. As shown in [1], if A is a nonsingular
−1
M -matrix, so is A22 − A21 D11
A12 . When A is a singular irreducible M -matrix, the
same property holds under the (very mild) structural conditions given in [9, Theorem 3]. Therefore MSC is a nonsingular M -matrix and A = MSC − (MSC − A) is an
M -splitting (hence, a regular splitting).
Since both MBJ and MSC define regular splittings, the product preconditioner
MP S given by
(2.9)
−1
−1
MP−1
S = MSC (MBJ + MSC − A)MBJ ,
(see (2.6)), implicitly defines a weak regular splitting [7, Theorem 3.4]. Note that
since the matrix
(2.10)
MBJ + MSC − A =
D11
O
O
A22
is invertible, MP−1
S is well-defined, and so is the corresponding splitting of A.
4
2.5. Graph partitioning. Given an undirected graph G = (V, E), the problem of K-way graph partitioning by vertex separator (GPVS) asks for finding a
set of vertices VS of minimum size whose removal decomposes a graph into K disconnected subgraphs with balanced sizes. The problem is NP-hard [13]. Formally,
Π = {V1 , . . . , VK ; VS } is a K-way vertex partition by vertex separator VS if the following conditions hold: Vk ⊂ V and VS
k 6= ∅ for 1 ≤ k ≤ K; Vk ∩ V` = ∅ for 1 ≤ k < ` ≤ K
and Vk ∩ VS = ∅ for 1 ≤ k ≤ K; k Vk ∪ VS = V ; there is no edge between vertices
lying in two different parts Vk and V` for 1 ≤ k < ` ≤ K; Wmax /Wavg ≤ , where
Wmax is the maximum part size (defined as maxk |Vk |), Wavg is the average part size
(defined as (|V | − |VS |)/K), and is a given maximum allowable imbalance ratio. See
the works [2, 14, 19, 20, 23] for applications of the GPVS and heuristics for GPVS.
In the weighted GPVS problem, the vertices of the given undirected graph have
weights. The weight of the separator or a part is defined as the sum of the weights
of the vertices that they contain. The objective of the weighted GPVS problem is
to minimize the weight of the separator while maintaining a balance criterion on the
part weights.
3. Block triangular preconditioners. Let A ∈ RN ×N be an M -matrix, initially assumed to be nonsingular. Consider a 2 × 2 block structure as in (1.1).
Recall from Section 2 that A11 , A22 and the Schur complement of A11 in A, i.e.,
−1
S = A22 − A21 A−1
11 A12 are all M -matrices. Note that if A11 is irreducible, then A11
is dense and so is S; if A11 is reducible (e.g., block diagonal), then S will still be fairly
dense unless the block sizes are very small.
Consider the ideal preconditioner [21, 29]
A11 A12
(3.1)
P0 =
.
O
S
It follows from the block LU factorization
I
O
A11
A=
O
A21 A−1
I
11
A12
S
= LP0
that AP0−1 (and therefore P0−1 A) has the eigenvalue λ = 1 of multiplicity N as
the only point in the spectrum. Also, (AP0−1 − I)2 = (L − I)2 = O which shows
that the minimum polynomial of the preconditioned matrix has degree 2; hence, a
minimal residual method (like GMRES) is guaranteed to converge in at most two
steps. Unfortunately, preconditioning with P0 is impractical. Here, similar to the
situation for saddle point problems, we consider preconditioners obtained by different
approximations of A11 and of S.
We start with the case where A11 is solved “exactly”, while S is replaced by an
−1
−1
approximation Ŝ = A22 − A21 M11
A21 where M11
is an approximate inverse of A11 .
If we impose the condition that
(3.2)
−1
O ≤ M11
≤ A−1
11
(entrywise)
then Ŝ is guaranteed to be an M -matrix and invertible if A is invertible; see [1, pp.
263–265]. The condition (3.2) is satisfied (for example) when A11 = M11 −(M11 −A11 )
is an M -splitting. In particular, if M11 = diag(A11 ) (or any other approximation
obtained by setting to zero any off-diagonal entries of A11 ), then the condition (3.2)
is satisfied.
5
Assume that linear systems with Ŝ are solved exactly. Then, the preconditioner
is
(3.3)
P1 =
A11
O
A12
Ŝ
.
A simple calculation shows that
AP1−1
=
I
A21 A−1
11
O
S Ŝ −1
.
Hence, AP1−1 (or P1−1 A) has the eigenvalue λ = 1 of multiplicity at least n, while the
remaining m eigenvalues are those of S Ŝ −1 . These m eigenvalues lie in the open disk
centered at (1, 0) of radius 1. To show this, consider the following matrix splitting
A11 A12
O
O
A = P1 − (P1 − A) =
−
.
−1
−A21 −A21 M11
A12
O
Ŝ
−1
Note that this splitting is not a regular splitting, since −A21 M11
A12 ≤ O. However,
we have the following result.
Theorem 3.1. The splitting A = P1 − (P1 − A) is a weak nonnegative splitting
of the II kind.
Proof. Due to the condition (3.2), P1 is an M -matrix, and hence P −1 ≥ O. We
only need to check the nonnegativity of
O
O
I − AP1−1 =
.
−A21 A−1
I − S Ŝ −1
11
−1
−1
Clearly −A21 A−1
≥ O.
11 ≥ O (since A21 ≤ O and A11 ≥ O). We need to show I −S Ŝ
We have
S = A22 − A21 A−1
11 A12
−1
−1
= A22 − A21 (A−1
11 − M11 + M11 )A12
−1
−1
−1
= A22 − A21 M11 A12 − A21 (A11 − M11
)A12
= Ŝ − R ,
−1
−1
−1
where R = A21 (A−1
11 − M11 )A12 is nonnegative, since A11 ≥ M11 (3.2). Recall that
−1
−1
−1
Ŝ is an M -matrix, therefore I − S Ŝ = (Ŝ − S)Ŝ = RŜ ≥ O.
Since A is an invertible M -matrix and A = P1 − (P1 − A) is a weak nonnegative
splitting of the II kind, it is a convergent splitting: ρ(I − AP1−1 ) < 1 (see [41]). In
other words, all the eigenvalues of AP1−1 or (P1−1 A) satisfy |λ−1| < 1. For the special
−1
−1
case M11
= A−1
11 , we have P1 = P0 and σ(AP1 ) = {1}.
In practice, exact solves with A11 or Ŝ may be impractical or not advisable on the
grounds of efficiency or numerical stability. This leads to the following three variants.
The first one is
Â11 A12
(3.4)
P2 =
, Â11 ≈ A11 ,
O
Ŝ
where we have inexact solves with A11 and exact solves with Ŝ. We can assume that
A11 = Â11 −(Â11 −A11 ) is a regular splitting, or even an M -splitting. Another variant
6
is
(3.5)
P3 =
A11
O
A12
S̃
, S̃ ≈ Ŝ ,
where we have exact solves with A11 and inexact solves with Ŝ. Here we can assume
that Ŝ = S̃ − (S̃ − Ŝ) is a regular splitting. P3 can be analyzed exactly like P1 . Finally
we consider
Â11 A12
(3.6)
P4 =
, Â11 ≈ A11 , S̃ ≈ Ŝ ,
O
S̃
where we can assume that both Â11 and S̃ induce regular splittings of A11 and Ŝ,
respectively. In practice, we use P4 with Â11 and S̃ coming from incomplete LU factorizations (ILU) of A11 and Ŝ. Note that unlike position-based ILU’s, threshold-based
ILU’s do not always lead to regular splittings (see Section 10.4.2 in [34]). Nevertheless, threshold-based ILUs often perform better than position-based ones, and the
clustering of the eigenvalues of the preconditioned matrices around unity can be easily
controlled by the drop tolerance.
When A is a singular, irreducible M -matrix, then the splittings A = Pi − (Pi − A)
satisfy ρ(I − APi−1 ) = 1, and the spectrum of APi−1 will also include the eigenvalue
λ = 0 (for i = 1, . . . , 4). We also note that λ = 0 is a simple eigenvalue, and the
splitting A = Pi − (Pi − A) is semiconvergent (see [11]), if Ŝ − S contains no zero
rows and Ŝ is irreducible. The zero eigenvalue, in any event, does not negatively
affect the convergence of Krylov methods like GMRES; in practice, the only effect is
to introduce a set of (Lebesgue) measure zero of “bad” initial guesses. These can be
easily avoided, for instance by using a random initial guess.
4. Building the block triangular preconditioner.
4.1. Defining the blocks. The first requirement to be met in permuting the
matrix A into 2×2 block structure (1.1) is that the permutation should be symmetric.
If A is nonsingular or obtained from the transition probability matrix of an irreducible
Markov chain, then a symmetric permutation on the rows and columns of A guarantees
that A11 and A22 are invertible M -matrices.
The second requirement, as already discussed in Section 3, is to keep the order n
of A11 as large as possible to maximize the number of (near) unit eigenvalues of the
preconditioned matrix (APi−1 for i = 1, 2, 3, and 4). The requirement is important in
order to have fast convergence of the Krylov subspace method, since m, the size of
A22 , is an upper bound on the number of non-unit eigenvalues. Strictly speaking, this
is true only if we assume exact solves in the application of the preconditioner, e.g., for
AP1−1 . In practice we will use inexact solves, and rather than having n eigenvalues
(or more) exactly equal to 1, there will be a cluster of at least n eigenvalues near the
point (1, 0). Still, we want this cluster to contain as many eigenvalues as possible.
The second requirement is also desirable from the point of view of parallel implementation. A possible parallelization approach would be constructing and solving the
linear systems with approximate Schur complements on a single processor, and then
solving the linear systems with A11 in parallel. This approach has been taken previously in parallelizing applications of approximate inverse preconditioners in Markov
chain analysis [9]. Another possible approach would be parallelizing the solution of
the approximate Schur complement systems either by allowing redundancies in the
computations (each processor can form the whole system or a part of it) or by running
7
a parallel solver on those systems. In both cases, the solution with the approximate
Schur complement system constitutes a serial bottleneck and requires additional storage space.
The third requirement, not necessary for the convergence analysis but crucial for
an efficient implementation, is that A11 should be block diagonal with subblocks of
approximately equal size and density. Given K subblocks in the (1,1) block A11 ,
the K linear systems stemming from A11 can be solved independently. Meeting this
requirement for a serial implementation will enable solution of very large systems,
since the subblocks can be handled one at a time. In any admissible parallelization,
each of these subblocks would more likely be assigned to a single processor. Therefore,
maintaining balance on the sizes and the densities of the subblocks will relate to
maintaining balance on computational loads of the processors. Furthermore, it is
desirable that the sizes of these subblocks be larger than the order m of A22 , if
possible, for the reasons given for the second requirement.
Meeting all of the above three requirements is a very challenging task. Therefore,
as a pragmatic approach we apply well established heuristics for addressing the remaining three requirements. As it is common, we adopt the standard undirected graph
model to represent a square matrix AN ×N . The vertices of the graph G(A) = (V, E)
correspond to the rows and columns of A and the edges correspond to the nonzeros
of A. The vertex vi ∈ V represents the ith row and the ith column of A, and there
exists an edge (vi , vj ) ∈ E if aij and aji are nonzero.
Consider a partitioning Π = {V1 , . . . , VK ; VS } of G(A) with vertex separator VS .
The matrix A can be permuted into the 2 × 2 block
S structure (1.1) by permuting the
rows and columns associated with the vertices in k Vk before the rows and columns
associated with the vertices in VS . That is, VS defines the rows and columns of the
(2,2) block A22 . Notice that the resulting permutation is symmetric, and hence the
first requirement is met. Furthermore, since GPVS tries to minimize the size of the
separator set VS , it tries to minimize the order of the block A22 . Therefore, the
permutation induced by Π meets the second requirement
S as well.
Consider the A11 block defined by the vertices in k Vk . The rows and columns
that are associated with the vertices in Vk can be permuted before the rows and
columns associated with the vertices in V` for 1 ≤ k < ` ≤ K. Such a permutation
of A11 gives rise to diagonal subblocks. Since we have already constructed A22 using
VS , we end up with the following structure:


A1
B1

A2
B2 



..  .
..
A=
.
. 



AK BK 
C1 C2 · · · CK AS
The diagonal blocks A1 , . . . , AK correspond to the vertex parts V1 , . . . , VK , and therefore have approximately the same order. The off-diagonal blocks Bi , Ci represent the
connections between the subgraphs, and the diagonal block AS represents the connections between nodes in the separator set. Note that if A is nonsingular or obtained
from the transition probability matrix of an irreducible Markov chain, then each
block Ai is a nonsingular M -matrix. Thus, graph partitioning induces a reordering
and block partitioning of the matrix A in the form (1.1) where
A11 = diag(A1 , A2 , . . . , AK ), A22 = AS
8
and
T T
A12 = [B1T B2T · · · BK
] , A21 = [C1 C2 · · · CK ] .
Therefore, the permutation induced by the GPVS partially addresses the third requirement. Note that the GPVS formulation ignores the requirement of balancing
the densities of the diagonal subblocks of A11 . In fact, obtaining balance on the densities of the diagonal blocks is a complex partitioning requirement that cannot be met
before a partitioning takes place (see [31] for a possible solution) even with a weighted
GPVS formulation.
If the matrix is structurally nonsymmetric, which is common for matrices arising
from Markov chains, then A cannot be modeled with undirected graphs. In this case,
a 2 × 2 block structure can be obtained by partitioning the graph of A + AT .
4.2. Approximating the Schur complement. Recall from Section 3 that we
are interested in approximations of the Schur complement of the form Ŝ = A22 −
−1
−1
A21 M11
A21 where M11
is an approximate inverse of A11 satisfying (3.2). We found
the rather crude approximation M11 = diag(A11 ) to be quite satisfying. It is easy
to invert and apply, and also it maintains a great deal of sparsity in Ŝ. Apart from
this approximation, we also tried M11 = Â11 with Â11 = L̄11 Ū11 (an incomplete
factorization of A11 ), the same approximation used in the (1,1) blocks of P2 and P4 .
This works well in terms of reducing the number of iterations and in all experiments
it delivered the smallest number of iterations; however, this good rate of convergence
came at the price of a prohibitive preconditioner construction overhead, and in a few
cases it failed due to lack of space for Ŝ.
In an attempt to find a midway, we tried to construct a block diagonal (up to a
symmetric permutation) M11 with 1 × 1 and 2 × 2 blocks. We create a list of all pairs
hi, ji with i < j and either aij or aji orboth are nonzero.
Each of these pairs is a
aii aij
candidate for a 2 × 2 block of the form
. Then we visit the candidates
aji ajj
in the descending order of the determinants of the corresponding 2 × 2 blocks, where
the determinant of the pair hi, ji is computed using aii ajj − aij aji . At candidate
hi, ji, if both i and j are not included in any 2 × 2 block, we form the corresponding
2 × 2 block. Otherwise, we proceed to the next candidate. Any row i (and hence
column) that is not included in a 2 × 2 block defines a 1 × 1 block aii in M11 . Note
that this greedy algorithm tries to maximize the minimum of the determinants of the
2 × 2 blocks in M11 and hence tries to yield a well-conditioned matrix M11 without
any attempt to maximize the number of those blocks. Similar algorithms were used
in [17, 18] in preconditioning indefinite systems and were found to be useful. In our
case, however, this choice of M11 did not improve upon the simple diagonal one.
5. Numerical experiments. In this section, we report on experimental results
obtained with a Matlab 7.1.0 implementation on a 2.2 GHz dual core AMD Opteron
Processor 875 with 4GB main memory. The main goal was to test the proposed
block preconditioners and to compare them with a few other techniques, including
the product splitting preconditioner [10]. The Krylov method used was GMRES [35].
For completeness we performed experiments with the stationary iterations corresponding to the various splittings (without GMRES acceleration), but they were found to
converge too slowly to be competitive with preconditioned GMRES. Therefore, we do
not show these results.
The various methods were tested on the generator matrices of some Markov chain
models provided in the MARCA (MARkov Chain Analyzer) collection [37]. The
9
Table 5.1
Properties of the generator matrices.
Matrix
mutex09
mutex12
ncd07
ncd10
qnatm06
qnatm07
tcomm16
tcomm20
twod08
twod10
number of
rows/cols
N
65535
263950
62196
176851
79220
130068
13671
17081
66177
263169
total
1114079
4031310
420036
1207051
533120
875896
67381
84211
263425
1050625
number of nonzeros
average
row
row/col min max
17.0
16
17
15.3
9
21
6.8
2
7
6.8
2
7
6.7
3
9
6.7
3
9
4.9
2
5
4.9
2
5
4.0
2
4
4.0
2
4
min
16
9
2
2
4
4
2
2
2
2
col
max
17
21
7
7
7
7
5
5
4
4
models are discussed in [16, 30, 32] and have been used to compare different solution
methods in [9, 10, 15] and elsewhere. These matrices are infinitesimal generators
of time-continuous Markov chains, but can be easily converted (as we did) to the
form A = I − P T , with P row-stochastic, so that A corresponds to a discrete-time
Markov chain, known as the embedded Markov chain; see [38, Chapter 1.4.3]. The
preconditioning techniques described in this paper can be applied to either form of
the generator matrix.
We performed a large number of tests on numerous matrices; here we present a
selection of results for a few test matrices, chosen to be representative of our overall
findings. Table 5.1 displays the properties of the chosen test matrices. Each matrix
is named by its family followed by its index in the family. For example, mutex09
refers to the 9th matrix in the mutex family. The matrices from the mutex and ncd
families are structurally symmetric, the matrices from the qnatm and twod families
are structurally nonsymmetric, and the matrices from the tcomm family are very close
to being structurally symmetric—the nonzero patterns of tcomm20 and tcomm16 differ
from the nonzero patterns of their transposes in only 60 locations.
We compared the block preconditioner (BT) with the block Jacobi (BJ), block
Gauss-Seidel (BGS), block successive overrelaxation (BSOR), and product splitting
(PS) preconditioners. The block Gauss-Seidel (BGS) and the block successive overrelaxation (BSOR) preconditioners are given, respectively, by
A11
O
MBGS =
1
A
MBSOR = ω 11
O
A12
A22
A12
1
ω A22
A11 O
or MBGS =
,
A21 A22
1
A11
O
ω
or MBSOR =
.
1
A21
ω A22
In agreement with previously reported results [15] on the MARCA collection, we
observed that ω = 1.0 (which reduces the BSOR to BGS) or very close to 1.0 is nearly
always the best choice of the relaxation parameter for BSOR. We also observed that
for most MARCA problems, the block lower triangular versions of the BGS and BSOR
preconditioners are indistinguishable from the block upper triangular versions under
either the storage or performance criteria. Therefore, we report only the experiments
with the upper triangular BGS preconditioner.
As discussed for instance in [21, 29], the ideal block triangular preconditioner P0
10
Table 5.2
Properties of the partitions and the induced block structures averaged over all matrices excluding
mutex matrices. The column “sep” refers to the number of rows in the 2nd row block of A normalized
by the number of rows in A, i.e., m/N ; the column “part” refers to the average part size normalized
by the number of rows in A, i.e., (n/K)/N ; the columns Aij for i, j = 1, 2 refer to the number of
nonzeros in the (i, j) block normalized by the number of nonzeros in A, i.e., nnz(Aij )/nnz(A).
K
2
4
8
16
32
Partition
sep
part
0.008 0.496
0.016 0.232
0.028 0.113
0.045 0.055
0.068 0.027
A11
0.986
0.852
0.808
0.770
0.731
Blocks
A12
A21
0.006 0.006
0.069 0.069
0.088 0.088
0.105 0.105
0.122 0.122
A22
0.002
0.011
0.016
0.020
0.025
has a natural block diagonal (BD) counterpart, namely
A11 O
MBD =
.
O S
As in P0 , exact solutions with MBD is not feasible. Therefore, we replace A11 and
the Schur complement S with approximations, as in P4 , and compare the BT preconditioners with the corresponding inexact BD as well.
5.1. Properties of the preconditioners. We partitioned the matrix into the
2 × 2 block structure (1.1) using Metis [24]. In all cases, the partitioning time is
negligible compared to the solve time. For the structurally symmetric mutex and ncd
matrices, we used the graph of A, and for the other matrices we used the graph of
A + AT as mentioned in Section 4. As discussed in Section 4, we maintain balance
on the size, rather than the densities, of the subblocks of A11 . We have conducted
experiments with K = 2, 4, 8, 16, and 32 subblocks in the (1,1) block. For each K
value, K-way partitioning of a test matrix constitutes a partitioning instance. Since
Metis incorporates randomized algorithms, it was run 10 times starting from different
random seeds for each partitioning instance with a maximum allowable imbalance
ratio of 25%. In all partitioning instances except the mutex matrices, the imbalance
ratios among the parts were within the specified limit. The following tables give the
average of these 10 different runs for each partitioning instance.
Only the mutex matrices have a large number of rows in the second row block, i.e.,
a large separator among all partitioning instances. For these matrices, the average
part size is larger than the size of the separator set only in K = 2-way partitioning.
For the other matrices, the average part size is larger than the size of the separator in
all partitioning instances with K = 2, 4, 8, and 16 except in K = 16-way partitioning
of ncd07, ncd10, and qnatm06, giving the average figures in Table 5.2.
We have conducted experiments with block triangular preconditioners P1 through
P4 . A somewhat surprising find is that those variants requiring exact solves with
A11 , e.g., P1 and P3 , besides being rather expensive (as expected), are prone to
numerical instabilities. By this we mean that at least one block Ak in A11 was found
to have an upper triangular factor U with a huge condition number, causing the
convergence of GMRES to deteriorate. (The unit lower triangular factor is always
well-conditioned for the problems considered here.) We encountered this difficulty
with all test matrices and for all values of K, except for the qnatm matrices. This
phenomenon can be explained by noting that the diagonal blocks Ak , while guaranteed
to be nonsingular, are often close to singular, in particular when A is nearly reducible
11
(see [27]). Hence, the corresponding upper triangular factor must have an exceedingly
small pivot, and consequently its inverse must have a huge norm. This problem
disappears when the complete factorizations of the diagonal blocks Ak are replaced
by incomplete ones. This is not surprising: it has been shown in [26] that in an
incomplete factorization of an M -matrix, the pivots (i.e., the diagonal entries of the
upper triangular factor) cannot become smaller, and in practice they always increase.
As a result, the condition number of the incomplete upper triangular factor is several
orders of magnitude smaller than that of the complete factor, and no instabilities
arise. Therefore, we have the somewhat unexpected conclusion that in practice inexact
solvers result in greater robustness and faster convergence than exact ones.
For these reasons we do not present results with P1 and P3 . In addition, P2 has
a large construction overhead due to exact factorization of Ŝ and did not perform
better than P4 in reducing the number of iterations. Therefore, in the following
we present results only with P4 . We tried all three approximations of the Schur
complement discussed in Section 4.2 with the block triangular preconditioner P4 .
The one with block diagonal M11 with 1 × 1 and 2 × 2 blocks did not improve upon
the simple diagonal one. Therefore, we omit the results with this choice of M11 . The
one with M11 = Â11 has very large memory requirements; for example, it was not
possible to run it with the mutex matrices. Therefore, it is not recommended as a
general purpose solution. However, it merits presenting because it gives the smallest
number of iterations and has fairly robust behavior with respect to increasing K
(see Table 5.3). In the following discussion, BT thus refers to the block triangular
preconditioner P4 :
P4 =
Â11
O
A12
S̃
−1
, Â11 ≈ A11 , S̃ ≈ Ŝ = A22 − A21 M11
A12 ,
where M11 = diag(A11 ).
Each subblock Ak , for k = 1, . . . , K, of A11 and the (2,2) block A22 in the BJ and
BGS preconditioners were ordered using symmetric reverse Cuthill-McKee (for better
numerical properties; see [8]) and factored using the incomplete LU factorization
(ILUTH) with threshold parameter τ = 0.01 for the qnatm matrices and τ = 0.001 for
the other matrices. The threshold of 0.001 was too small for the qnatm matrices: the
resulting preconditioners had 8 times more nonzeros than the generator matrices. We
observed that reordering the approximate Schur complement matrix causes ILUTH
to take too much time (presumably due to the need for pivoting), but reduces the
number of nonzeros in the factors only by a small amount. Therefore, the approximate
Schur complement matrices were factored using ILUTH without any prior ordering.
The densities of the preconditioners, i.e., the number of nonzeros in the matrices appearing in the preconditioner solve phase divided by the number of nonzeros
in the corresponding generator matrices, are given in Table 5.4. As seen from the
table, memory requirements of the BT preconditioners are less than those of the PS
preconditioners and comparable to those of the three, relatively simple, ‘classical’
preconditioners.
5.2. Performance comparisons. The underlying Krylov subspace method was
GMRES, restarted every 50 iterations (if needed). Right preconditioning was used in
all the tests. The stopping criterion was set as
krk k2 /kr0 k2 < 10−10 ,
12
Table 5.3
Data pertaining to the P4 preconditioner with S̃ ≈ Ŝ = A22 − A21 (L̄11 Ū11 )−1 A12 . The column
“its” refers to the average number of iterations, “dens” refers to the total number of nonzeros in the
P4 matrices, “prec” refers to the time spent in constructing the preconditioners, and “solve” refers
to the time spent during GMRES iterations.
Matrix
K
its
dens
ncd07
2
4
8
16
32
2
4
8
16
32
2
4
8
16
32
2
4
8
16
32
2
4
8
16
32
2
4
8
16
32
2
4
8
16
32
2
4
8
16
32
11
12
12
13
13
11
11
12
13
13
35
35
37
38
40
39
40
42
42
45
14
16
19
26
25
16
19
22
30
28
10
12
17
18
18
17
18
20
21
23
1.11
1.23
1.20
1.18
1.18
1.10
1.13
1.12
1.14
1.12
2.97
2.65
2.61
2.49
2.48
2.99
2.58
2.59
2.52
2.49
3.03
2.47
2.13
2.01
1.93
2.93
2.24
2.17
2.09
1.99
2.24
2.30
3.05
2.77
2.87
4.64
4.63
5.11
5.01
4.37
ncd10
qnatm06
qnatm07
tcomm16
tcomm20
twod08
twod10
Total
prec
29.19
14.86
16.87
20.26
20.50
187.40
120.63
99.12
81.39
95.72
45.86
43.09
37.19
31.51
31.73
86.45
96.43
84.09
69.27
73.82
0.38
0.60
0.84
1.27
1.64
0.48
0.73
1.02
1.51
2.23
7.20
12.18
17.17
18.68
26.38
156.29
163.76
190.06
199.50
229.49
time
solve
1.11
1.16
1.25
1.35
1.46
3.68
4.29
4.14
4.31
4.51
6.07
6.21
6.57
6.85
7.66
11.66
12.50
12.88
13.05
14.32
0.24
0.24
0.29
0.40
0.40
0.34
0.35
0.42
0.61
0.59
0.91
1.03
1.56
1.63
1.68
8.71
8.60
9.51
10.30
10.50
where rk is the residual at the kth iteration and r0 is the initial residual. In all
cases, the initial solution is set to the one-vector normalized to have an `1 -norm of
1.0. We allowed at most 250 iterations, i.e., 5 restarts, for the GMRES iteration.
Therefore, the number 250 in the following tables marks the cases in which GMRES
failed to deliver solutions with the prescribed accuracy within 250 iterations. Iteration
counts for GMRES(50) on the various test matrices with no permutation or preconditioning (GMRES) and averages of the iteration counts with the preconditioners BJ,
BD, BGS, PS and BT are given in Table 5.5. Note that without preconditioning,
13
Table 5.4
The densities of the preconditioners, i.e., the total number of nonzeros in the blocks appearing
in the preconditioning matrices divided by the number of nonzeros in the corresponding generator
matrices.
Matrix
mutex09
mutex12
ncd07
ncd10
qnatm06
qnatm07
tcomm16
tcomm20
twod08
twod10
K
2
4
8
16
32
2
4
8
16
32
2
4
8
16
32
2
4
8
16
32
2
4
8
16
32
2
4
8
16
32
2
4
8
16
32
2
4
8
16
32
2
4
8
16
32
2
4
8
16
32
BJ
1.30
0.78
0.51
0.36
0.30
1.87
0.83
0.45
0.29
0.23
1.09
1.18
1.11
1.05
0.99
1.08
1.09
1.05
1.04
0.99
2.94
2.58
2.46
2.25
2.10
2.97
2.52
2.48
2.34
2.20
3.03
2.45
2.09
1.91
1.76
2.93
2.22
2.13
2.01
1.83
2.24
2.26
2.98
2.65
2.70
4.64
4.60
5.06
4.92
4.24
Preconditioners
BD BGS
1.52
1.48
1.24
1.06
1.18
0.86
1.14
0.73
1.16
0.69
2.06
2.06
1.21
1.14
0.96
0.82
0.90
0.70
0.89
0.66
1.10
1.11
1.19
1.21
1.13
1.16
1.07
1.11
1.03
1.08
1.08
1.09
1.10
1.11
1.06
1.09
1.06
1.09
1.01
1.06
2.95
2.95
2.61
2.60
2.52
2.50
2.33
2.30
2.23
2.18
2.98
2.98
2.54
2.53
2.52
2.51
2.40
2.38
2.30
2.27
3.03
3.03
2.45
2.46
2.09
2.10
1.93
1.94
1.78
1.80
2.93
2.93
2.23
2.23
2.14
2.14
2.02
2.03
1.84
1.86
2.24
2.24
2.26
2.27
2.98
2.99
2.66
2.66
2.71
2.72
4.64
4.64
4.60
4.61
5.06
5.06
4.92
4.92
4.25
4.25
PS
2.00
1.99
2.07
2.11
2.17
2.53
1.92
1.82
1.83
1.85
1.28
1.40
1.38
1.37
1.38
1.26
1.30
1.29
1.33
1.32
3.13
2.81
2.76
2.63
2.59
3.15
2.73
2.75
2.67
2.62
3.24
2.67
2.33
2.20
2.10
3.14
2.44
2.37
2.28
2.16
2.49
2.53
3.26
2.95
3.03
4.89
4.86
5.33
5.20
4.53
BT
1.70
1.52
1.52
1.51
1.55
2.25
1.51
1.33
1.31
1.32
1.11
1.21
1.17
1.13
1.12
1.09
1.12
1.10
1.11
1.08
2.96
2.63
2.55
2.39
2.31
2.98
2.56
2.55
2.44
2.36
3.03
2.46
2.10
1.95
1.82
2.93
2.23
2.15
2.04
1.88
2.24
2.27
2.99
2.67
2.73
4.64
4.61
5.07
4.93
4.26
GMRES(50) converges only for the mutex matrices. In all instances, preconditioned
GMRES(50) with the PS and the proposed BT preconditioners converged under the
stopping criterion given above. Preconditioned GMRES(50) with BJ, BD, and BGS
14
did not converge in, respectively, 103, 141, and 8 of the 500 instances. The largest
of the `1 -norms of the residuals corresponding to the approximate solutions returned
by the preconditioned GMRES(50) were less than 1.1e-11 for the converged instances
with the BGS, PS, and BT preconditioners. With the BJ and BD preconditioners, the
largest of the `1 -norms of the residuals corresponding to the approximate solutions
returned by the preconditioned GMRES(50) was 4.6e-10.
As seen from Table 5.5, the PS and BT preconditioners perform consistently
better than the BJ, BD, and BGS preconditioners. The proposed BT preconditioner
performs almost as well as the PS preconditioner, and it outperforms the BGS one
by a factor of two, on the average, in terms of iteration counts, at the expense of a
slight increase in memory requirements (see Table 5.4).
We close this section by discussing running times for GMRES. The running times
are measured using Matlab’s cputime command, and these measurements are given
in Tables 5.6 and 5.7 in units of seconds. In Table 5.6, the total time for GMRES
without preconditioning is the total time spent in performing the GMRES iterations.
In both of the tables, the total time for the preconditioned GMRES is dissected into
the preconditioner construction and the solve phases. We first discuss the case of
mutex matrices, since all the preconditioners lead to convergence for these matrices.
In all partitioning instances of the mutex matrices, the solve phase time with the
proposed BT preconditioner is less than those with the other preconditioners. On
the other hand, the total running time of BGS preconditioner is always the minimum
except in K = 32-way partitioning of these two mutex matrices, in which case BJ gives
the minimum total running time. We observe that the mutex matrices are the worst
case for the construction of the BD, PS, and BT preconditioners, since the size of the
separator set is very large already for K = 2, thus forming the Schur complement is
very time-consuming. Note that PS and BT also suffer from the cost of copying the
large A12 blocks (and A21 in PS).
Table 5.7 contains the running times of the preconditioned GMRES with the BGS,
PS, and BT preconditioners for the larger matrices in each matrix family. As seen
from the table, for these matrices (whose partitions have small separators) the preconditioner construction phase times are always smaller than the solve phase times with
all preconditioners. Although the construction phase times with the BGS preconditioners are always smaller than those with the BT preconditioner, BT preconditioner
is faster than BGS in all instances. Note that the ith iteration of GMRES after a
restart requires i inner product computations with vectors of length N [3]. Therefore,
the performance gains in the solve phase with the BT preconditioners are not only
due to the savings in preconditioner solves and matrix-vector multiplies, but also due
to the savings in the inner product computations. Recall from Table 5.5 that the
number of iterations with the BT and and PS preconditioners are almost the same.
However, the BT preconditioner is always faster than the PS preconditioner.
6. Conclusions. We have investigated block triangular preconditioning strategies for M -matrices, with an emphasis on the singular systems arising from Markov
chain modeling. These preconditioners require a 2 × 2 block partitioning and suitable
Schur complement approximations. The resulting preconditioner can be applied exactly or inexactly, leading to several possible variants, including block diagonal ones.
Numerical experiments with preconditioned GMRES using test matrices from
MARCA allowed us to identify the most efficient variant and to rule out some of the
others. Block triangular preconditioning with inexact solves obtained by means of
(local) ILUTH factorizations was found to be superior to several other approaches,
15
Table 5.5
Average number of iterations to reduce the `2 -norm of the initial residual by ten orders of
magnitude using GMRES(50) with at most 5 restarts. The number 250 means that the method did
not converge in 250 iterations.
Matrix
GMRES
Preconditioned GMRES
Preconditioners
BJ BD BGS PS
2
24
24
13
9
4
27
27
14
9
8
28
28
14
8
16
29
30
15
9
32
29
30
15
9
2
26
27
14
8
4
28
29
15
8
8
28
29
15
8
16
29
31
16
7
32
30
31
16
7
2
40
68
23
12
4 212 250
87
13
8 222 250
101
15
16 243 250
129
16
32 250 250
192
18
2 122 168
57
14
4 160 250
47
15
8 244 250
105
15
16 250 250
145
18
32 250 250
215
19
2
60
60
38
36
4
78
83
41
36
8 104 119
46
39
16 177 181
55
43
32 223 250
83
46
2
51
56
40
39
4
83
91
44
41
8 110 138
51
44
16 163 186
74
47
32 232 246
92
55
2
30
30
18
16
4
48
51
27
21
8 112 131
39
29
16 250 250
85
42
32 250 250
105
46
2
33
34
20
18
4
55
61
29
23
8 132 157
43
32
16 247 250
94
44
32 250 250
221
96
2
28
27
15
10
4
41
40
22
13
8
49
48
25
18
16
59
61
29
24
32
92
93
36
29
2
38
38
21
18
4
47
47
26
21
8
58
58
30
24
16
77
78
35
28
32
94
94
39
32
K
mutex09
97
mutex12
91
ncd07
250
ncd10
250
qnatm06
250
qnatm07
250
tcomm16
250
tcomm20
250
twod08
250
twod10
250
BT
9
9
9
9
9
10
9
9
9
8
12
13
15
16
18
15
15
15
17
19
36
36
39
43
47
39
41
44
47
56
16
21
30
42
46
18
23
32
44
98
10
13
18
24
30
18
21
25
29
32
including block diagonal, block Gauss–Seidel, and product splitting preconditioning.
Furthermore, the numerical experiments indicate that, for most problems, the number
of iterations grows slowly with the number of parts (subdomains). This suggests that
16
Table 5.6
Running times (in seconds) for GMRES(50) without preconditioning (GMRES column) and
with BJ, BD, BGS, PS, and BT preconditioning for the mutex matrices.
Matrix
mutex09
GMRES
Total
time
Solve
7.0
mutex12
27.2
K
2
4
8
16
32
2
4
8
16
32
BJ
3.55
1.39
0.78
0.51
0.48
17.75
5.06
2.44
1.74
1.59
Preconditioned
Total time
Preconditioner const
BD
BGS
PS
BT
4.28
3.60
4.53
4.30
2.84
1.52
3.33
2.95
2.92
1.04
3.67
3.16
3.15
1.06
4.44
3.62
3.28
1.45
5.47
4.16
20.61 17.95 21.75 20.72
10.84
5.49 12.63 11.10
11.49
3.37 14.58 12.36
11.05
3.42 16.11 12.69
11.11
4.62 19.81 13.97
GMRES
BJ
3.55
3.45
3.19
3.22
2.95
17.69
14.76
12.97
11.70
11.42
BD
3.94
4.23
4.45
4.80
4.83
19.42
18.59
17.87
18.89
18.83
Solve
BGS
1.82
1.82
1.99
2.60
3.63
9.18
8.05
7.79
9.47
13.35
PS
1.96
2.37
2.86
3.94
5.54
8.07
8.29
9.98
12.13
17.92
BT
1.42
1.51
1.72
2.07
2.82
7.23
6.21
6.46
7.47
9.13
Table 5.7
Running times (in seconds) for GMRES(50) with BGS, PS, and BT preconditioners for the
larger matrices in each family.
Matrix
ncd10
qnatm07
tcomm20
twod10
K
2
4
8
16
32
2
4
8
16
32
2
4
8
16
32
2
4
8
16
32
Total time
Precond const
BGS
PS
BT
BGS
1.18 1.93 1.68 19.40
1.04 1.99 1.59 15.85
1.01 2.31 1.58 33.95
1.01 3.04 1.67 45.84
1.09 4.55 2.00 71.01
3.68 4.18 4.04 11.43
2.34 2.98 2.71 11.69
1.66 2.53 2.04 13.23
1.34 2.75 1.84 18.07
1.30 3.72 1.96 23.85
0.10 0.13 0.11
0.43
0.10 0.10 0.10
0.63
0.10 0.18 0.10
0.93
0.10 0.20 0.10
1.98
0.10 0.30 0.10
5.01
7.04 7.92 7.64 10.46
6.65 7.55 7.15 12.85
4.95 6.47 5.64 14.41
4.09 6.43 4.89 16.78
3.02 7.17 4.18 18.27
Solve
PS
5.39
6.16
7.74
12.34
20.82
13.62
15.12
18.73
26.62
46.35
0.50
0.74
1.28
2.47
8.30
10.90
14.32
18.83
30.60
53.05
BT
4.13
4.01
3.85
4.30
5.03
11.20
11.02
11.23
12.13
14.73
0.40
0.50
0.65
0.97
2.21
8.89
10.43
11.17
13.00
13.82
the inexact block triangular preconditioner should perform very well in a parallel
implementation.
Acknowledgment. We thank Billy Stewart for making the MARCA software
available.
REFERENCES
[1] O. Axelsson, Iterative Solution Methods, Cambridge University Press, New York, NY, USA,
1994.
[2] C. Aykanat, A. Pınar, and Ü. V. Çatalyürek, Permuting sparse rectangular matrices into
block-diagonal form, SIAM J. Sci. Comput., 25 (2004), pp. 1860–1879.
[3] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout,
R. Pozo, C. Romine, and H. A. van der Vorst, Templates for the Solution of Linear
17
Systems: Building Blocks for Iterative Methods, SIAM, Philadelphia, PA, 1994.
[4] M. Benzi and T. Dayar, The arithmetic mean method for finding the stationary vector of
Markov chains, Parallel Algorithms Appl., 6 (1995), pp. 25–37.
[5] M. Benzi, G. H. Golub, and J. Liesen, Numerical solution of saddle point problems, Acta
Numerica, 14 (2005), pp. 1–137.
[6] M. Benzi, F. Sgallari, and G. Spaletta, A parallel block projection method of the Cimmino
type for finite Markov chains, in Computations with Markov Chains, W. J. Stewart, ed.,
Kluwer Academic Publishers, Boston/London/Dordrecht, 1995, pp. 65–80.
[7] M. Benzi and D. B. Szyld, Existence and uniqueness of splittings for stationary iterative
methods with applications to alternating methods, Numer. Math., 76 (1997), pp. 309–321.
[8] M. Benzi, D. B. Szyld, and A. van Duin, Orderings for incomplete factorization preconditioning of nonsymmetric problems, SIAM J. Sci. Comput., 20 (1999), pp. 1652–1670.
[9] M. Benzi and M. Tůma, A parallel solver for large-scale Markov chains, Appl. Numer. Math.,
41 (2002), pp. 135–153.
[10] M. Benzi and B. Uçar, Product preconditioning for Markov chain problems, in Proceedings of
the 2006 Markov Anniversary Meeting, A. N. Langville and W. J. Stewart, eds., Raleigh,
NC, 2006, (Charleston, SC), Boson Books, pp. 239–256.
[11] A. Berman and R. J. Plemmons, Nonnegative Matrices in the Mathematical Sciences, Academic Press, 1979. Reprinted by SIAM, Philadelphia, PA, 1994.
[12] P. Buchholz, M. Fischer, and P. Kemper, Distributed steady state analysis using Kronecker
algebra, in Numerical Solutions of Markov Chains (NSMC’99), B. Plateau, W. J. Stewart,
and M. Silva, eds., Prensas Universitarias de Zaragoza, Zaragoza, Spain, 1999, pp. 76–95.
[13] T. N. Bui and C. Jones, Finding good approximate vertex and edge partitions is NP hard,
Inform. Process. Lett., 42 (1992), pp. 153–159.
[14] T. N. Bui and C. Jones, A heuristic for reducing fill in sparse matrix factorization, in Proc.
6th SIAM Conf. Parallel Processing for Scientific Computing, SIAM, Philadelphia, PA,
1993, pp. 445–452.
[15] T. Dayar and W. J. Stewart, Comparison of partitioning techniques for two-level iterative
solvers on large, sparse Markov chains, SIAM J. Sci. Comput., 21 (2000), pp. 1691–1705.
[16] P. Fernandes, B. Plateau, and W. J. Stewart, Efficient descriptor-vector multiplications
in stochastic automata networks, J. ACM, 45 (1998), pp. 381–414.
[17] R. W. Freund and F. Jarre, A QMR-based interior-point algorithm for solving linear programs, Math. Program., 76 (1996), pp. 183–210.
[18] M. Hagemann and O. Schenk, Weighted matchings for preconditioning symmetric indefinite
linear systems, SIAM J. Sci. Comput., 28 (2006), pp. 403–420.
[19] B. Hendrickson and R. Leland, A multilevel algorithm for partitioning graphs, in Supercomputing’95: Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM),
New York, NY, USA, 1995, ACM Press, p. 28.
[20] B. Hendrickson and E. Rothberg, Improving the run time and quality of nested dissection
ordering, SIAM J. Sci. Comput., 20 (1998), pp. 468–489.
[21] I. C. F. Ipsen, A note on preconditioning nonsymmetric matrices, SIAM J. Sci. Comput., 23
(2001), pp. 1050–1051.
[22] M. Jarraya and D. El Baz, Asynchronous iterations for the solution of Markov systems,
in Numerical Solutions of Markov Chains (NSMC’99), B. Plateau, W. J. Stewart, and
M. Silva, eds., Prensas Universitarias de Zaragoza, Zaragoza, Spain, 1999, pp. 335–338.
[23] G. Karypis and V. Kumar, A fast and high quality multilevel scheme for partitioning irregular
graphs, SIAM J. Sci. Comput., 20 (1998), pp. 359–392.
[24] G. Karypis and V. Kumar, MeTiS: A software package for partitioning unstructured graphs,
partitioning meshes, and computing fill-reducing orderings of sparse matrices version 4.0,
University of Minnesota, Department of Computer Science/Army HPC Research Center,
Minneapolis, MN 55455, September 1998.
[25] W. J. Knottenbelt and P. G. Harrison, Distributed disk-based solution techniques for large
Markov models, in Numerical Solutions of Markov Chains (NSMC’99), B. Plateau, W. J.
Stewart, and M. Silva, eds., Prensas Universitarias de Zaragoza, Zaragoza, Spain, 1999,
pp. 58–75.
[26] J. A. Meijerink and H. A. van der Vorst, An iterative solution method for linear systems of
which the coefficient matrix is a symmetric M-matrix, Math. Comp., 31 (1977), pp. 148–
162.
[27] C. D. Meyer, Stochastic complementation, uncoupling Markov chains, and the theory of nearly
reducible systems, SIAM Rev., 31 (1989), pp. 240–272.
[28] V. Migallón, J. Penadés, and D. B. Szyld, Experimental studies of parallel iterative solutions of Markov chains with block partitions, in Numerical Solutions of Markov Chains
18
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
(NSMC’99), B. Plateau, W. J. Stewart, and M. Silva, eds., Prensas Universitarias de
Zaragoza, Zaragoza, Spain, 1999, pp. 96–110.
M. F. Murphy, G. H. Golub, and A. J. Wathen, A note on preconditioning for indefinite
linear systems, SIAM J. Sci. Comput., 21 (2000), pp. 1969–1972.
B. Philippe, Y. Saad, and W. J. Stewart, Numerical methods in Markov chain modeling,
Oper. Res., 40 (1992), pp. 1156–1179.
A. Pınar and B. Hendrickson, Partitioning for complex objectives, in Proceedings of the 15th
International Parallel and Distributed Processing Symposium (CDROM), Washington, DC,
USA,, 2001, IEEE Computer Society, p. 121.
P. K. Pollet and S. E. Stewart, An efficient procedure for computing quasi-stationary
distributions of Markov chains with sparse transition structure, Adv. in Appl. Probab., 26
(1994), pp. 68–79.
Y. Saad, Preconditioned Krylov subspace methods for the numerical solution of Markov chains,
in Computations with Markov Chains, W. J. Stewart, ed., Kluwer Academic Publishers,
Boston/London/Dordrecht, 1995, pp. 65–80.
Y. Saad, Iterative Methods for Sparse Linear Systems, 2nd ed., SIAM, Philadelphia, PA, 2003.
Y. Saad and M. H. Schultz, GMRES: A generalized minimal residual algorithm for solving
nonsymmetric linear systems, SIAM J. Sci. Stat. Comput., 7 (1986), pp. 856–869.
H. Schneider, Theorems on M -splittings of a singular M -matrix which depend on graph structure, Linear Algebra Appl., 58 (1984), pp. 407–424.
W. J. Stewart, MARCA Models: A collection of Markov chain models.
URL
http://www.csc.ncsu.edu/faculty/stewart/MARCA Models/MARCA Models.html.
W. J. Stewart, Introduction to the Numerical Solution of Markov Chains, Princeton University Press, Princeton, NJ, 1994.
H. A. van der Vorst, Bi-CGSTAB: A fast and smoothly converging variant of Bi-CG for the
solution of non-symmetric linear systems, SIAM J. Sci. Comput., 12 (1992), pp. 631–644.
R. S. Varga, Matrix Iterative Analysis, Prentice-Hall,Englewood Cliffs, New Jersey, 1962.
Second Edition, revised and expanded, Springer, Berlin, Heidelberg, New York, 2000.
Z. I. Woźnicki, Nonnegative splitting theory, Japan. J. Indust. Appl. Math., 11 (1994), pp. 289–
342.
19