Outline
Part 1
A tutorial on:
Iterative methods for Sparse Linear Systems
Yousef Saad
University of Minnesota
Computer Science and Engineering
Part 2
• Sparse matrices and sparsity
• Preconditioned iterations
• Basic iterative techniques
• Preconditioning techniques
• Projection methods
• Multigrid methods
• Krylov subspace methods
Part 3
• Parallel implementations
• Parallel Preconditioners
CIMPA - Tlemcen – May 14-21
• Software
CIMPA - Tlemcen May 2008April 26, 2008
2
Typical Problem:
Physical Model
↓
Nonlinear PDEs
INTRODUCTION TO SPARSE MATRICES
↓
Discretization
↓
Linearization (Newton)
↓
Sequence of Sparse Linear Systems Ax = b
CIMPA - Tlemcen May 2008April 26, 2008
4
Direct sparse
Solvers
Introduction: Linear System Solvers
Iterative Methods
Preconditioned Krylov
➤ Problem considered: Linear systems
Ax = b
General
Purpose
Ax=b
− ∆ u = f + bc
➤ Can view the problem from somewhat different angles:
Specialized
• Discretized problem coming from a PDE
• An algebraic system of equations [ignore origin]
Fast Poisson
Solvers
• Sometimes a system of equations where A is not explicitly available
CIMPA - Tlemcen May 2008April 26, 2008
Multigrid
Methods
5
Introduction: Linear System Solvers
CIMPA - Tlemcen May 2008April 26, 2008
6
A few observations
➤ Much of recent work on solvers has focussed on:
➤ Problems are getting harder for Sparse Direct methods
(more 3-D models, much bigger problems,..)
(1) Parallel implementation – scalable performance
➤ Problems are also getting difficult for iterative methods Cause: more
(2) Improving Robustness, developing more general preconditioners
complex models - away from Poisson
➤ Researchers in iterative methods are borrowing techniques from direct
methods: → preconditioners
➤ The inverse is also happening: Direct methods are being adapted for
use as preconditioners
CIMPA - Tlemcen May 2008April 26, 2008
7
CIMPA - Tlemcen May 2008April 26, 2008
8
What are sparse matrices?
Nonzero patterns of a few sparse matrices
Common definition: “..matrices that allow special techniques to take
advantage of the large number of zero elements and the structure.”
A few applications of sparse matrices: Structural Engineering, Reservoir
simulation, Electrical Networks, optimization problems, ...
Goals:
Much less storage and work than dense computations.
Observation:
A−1 is usually dense, but L and U in the LU factorization
may be reasonably sparse (if a good technique is used).
ARC130: Unsymmetric matrix from laser problem. a.r.curtis, oct 1974
CIMPA - Tlemcen May 2008April 26, 2008
SHERMAN5: fully implicit black oil simulator 16 by 23 by 3 grid, 3 unk
9
➤ Two types of matrices: structured (e.g. Sherman5) and unstructured
(e.g. BP 1000)
➤ Main goal of Sparse Matrix Techniques: To perform standard matrix
computations economically i.e., without storing the zeros of the matrix.
➤ Example: To add two square dense matrices of size n requires O(n2)
operations. To add two sparse matrices A and B requires O(nnz(A) +
nnz(B)) where nnz(X) = number of nonzero elements of a matrix X.
➤ For typical Finite Element /Finite difference matrices, number of nonzero
elements is O(n).
PORES3: Unsymmetric MATRIX FROM PORES
BP_1000: UNSYMMETRIC BASIS FROM LP PROBLEM BP
CIMPA - Tlemcen May 2008April 26, 2008
11
CIMPA - Tlemcen May 2008April 26, 2008
12
Graph Representations of Sparse Matrices
➤ Graph theory is a fundamental tool in sparse matrix techniques.
Graph G = (V, E) of an n × n matrix A defined by
Vertices V = {1, 2, ...., N }.
Edges E = {(i, j)|aij 6= 0}.
➤ Graph is undirected if matrix has symmetric structure: aij 6= 0 iff
aji 6= 0.
CIMPA - Tlemcen May 2008April 26, 2008
Example:
13
14
Direct versus iterative methods
Adjacency graph of:
⋆ ⋆
⋆
⋆ ⋆ ⋆
⋆
⋆ ⋆
.
A=
⋆
⋆
⋆
⋆ ⋆ ⋆
⋆
⋆ ⋆
Example:
CIMPA - Tlemcen May 2008April 26, 2008
Background. Two types of methods:
➤ Direct methods : based on sparse Gaussian eimination, sparse Cholesky,..
➤ Iterative methods: compute a sequence of iterates which converge to
For any matrix A, what is the graph of A2? [interpret in
terms of paths in the graph of A]
the solution - preconditioned Krylov methods..
Remark:
These two classes of methods have always been in competition.
➤ 40 years ago solving a system with n = 10, 000 was a challenge
➤ Now you can solve this in < 1 sec. on a laptop.
CIMPA - Tlemcen May 2008April 26, 2008
15
CIMPA - Tlemcen May 2008April 26, 2008
16
➤ Sparse direct methods made huge gains in efficiency. As a result they
Direct Sparse Matrix Techniques
are very competitive for 2-D problems.
➤ 3-D problems lead to more challenging systems [inherent to the under-
Principle of sparse matrix techniques: Store only the nonzero elements of
lying graph]
A. Try to minimize computations and (perhaps more importantly) storage.
➤ Problems with many unknowns per grid point similar to 3-D problems
➤ Difficulty in Gaussian elimination: Fill-in
Remarks:
Trivial Example:
• No robust ‘black-box’ iterative solvers.
• Robustness often conflicts with efficiency
+ + + + + +
+ +
+
+
A=
+
+
+
+
+
+
• However, situation improved in last ≈ decade
• Line between direct and iterative solvers blurring
CIMPA - Tlemcen May 2008April 26, 2008
➤
Reorder equations and un-
knowns in order N, N − 1, ..., 1
➤ A stays sparse during Gaussian eliminatin – i.e., no fill-in.
17
+
+
+
+
+
A=
+
+
+ +
+ + + + + +
CIMPA - Tlemcen May 2008April 26, 2008
18
Reorderings and graphs
+
➤ Let π = {i1, · · · , in} a permutation
➤ Aπ,∗ = aπ(i),j i,j=1,...,n = matrix A with its i-th row replaced by row
number π(i).
➤ Finding the best ordering to minimize fill-in is NP-complete.
➤ A∗,π = matrix A with its j-th column replaced by column π(j).
➤ A number of heuristics developed. Among the best known:
➤ Define Pπ = Iπ,∗ = “Permutation matrix” – Then:
• Minimum degree ordering (Tinney Scheme 2)
(1) Each row (column) of Pπ consists of zeros and exactly one “1”
• Nested Dissection Ordering.
(2) Aπ,∗ = Pπ A
• Approximate Minimal Degree ...
(3) Pπ PπT = I
(4) A∗,π = APπT
CIMPA - Tlemcen May 2008April 26, 2008
19
CIMPA - Tlemcen May 2008April 26, 2008
20
A′ = Aπ,π = Pπ APπT
Consider now:
A 9 × 9 ’arrow’ matrix and its adjacency graph.
Example
➤ Entry (i, j) in matrix A′ is exactly entry in position (π(i), π(j)) in A,
i.e., (a′ij = aπ(i),π(j))
7
(i, j) ∈ EA′
⇐⇒
(π(i), π(j)) ∈ EA
6
@
@
@
@
@
@
@
General picture :
5
j
8
9
1
i
4
@
@
@
@
@
@
@
2
π(i)
3
π (j)
CIMPA - Tlemcen May 2008April 26, 2008
21
Graph and matrix after permuting the nodes in reverse order.
CIMPA - Tlemcen May 2008April 26, 2008
22
Cuthill-McKee & reverse Cuthill-McKee
3
4
@
@
@
@
@
@
@
5
6
➤ Related to Breadth First Search (BFS) traversal in graph theory.
9
➤ A class of reordering techniques proceeds by levels in the graph.
2
1
@
@
@
@
@
@
@
➤ Idea of BFS is to visit the nodes by ‘levels’. Level 0 = level of starting
node.
8
7
➤ Start with a node, visit its neighbors, then the (unmarked) neighbors of
its neighbors, etc...
CIMPA - Tlemcen May 2008April 26, 2008
23
CIMPA - Tlemcen May 2008April 26, 2008
24
Implementation using levels
Example:
A
B
H
I
*--------*-----*-------*
|
|
|
/
|
|
|
/
|
|
|
/
|
|
|
/
|
|
| /
C*--------* D
| /
|
\
|/
|
\
* K
|
F
\
E *-------*----* G
\
/
\
/
\ /
* H
Algorithm BF S(G, v) – by level sets –
• Initialize S = {v}, seen = 1; Mark v;
• While seen < n Do
BFS from
Level 0:
Level 1:
Level 2:
Level 3:
node A:
A
B, C;
E, D, H;
I, K, E, F, G, H.
– Snew = ∅;
– For each node v in S do
∗ For each unmarked w in adj(v) do
· Add w to Snew ;
· Mark w;
· seen + +;
– S := Snew
CIMPA - Tlemcen May 2008April 26, 2008
25
A few properties of Breadth-First-Search
CIMPA - Tlemcen May 2008April 26, 2008
26
Cuthill McKee ordering
➤ If G is a connected undirected graph then each vertex will be visited
Algorithm proceeds by levels. Same as BFS except: in each level, nodes
are ordered by increasing degree
once each edge will be inspected at least once
➤ Therefore, for a connected undirected graph,
D
B
A
G
Example
The cost of BFS is O(|V | + |E|)
E
C
➤ Distance = level number; ➤ For each node v we have:
F
min dist(s, v) = level number(v) = depthT (v)
Level Nodes Deg.
Order
➤ Several reordering algorithms are based on variants of Breadth-First-
0
A
2
A
Search
1
B, C
4, 3
C, B
2
D, E, F 3, 4, 2 F, D, E
3
G
CIMPA - Tlemcen May 2008April 26, 2008
27
2
G
CIMPA - Tlemcen May 2008April 26, 2008
28
ALGORITHM : 1
Cuthill Mc Kee ordering
Reverse Cuthill McKee ordering
0.
Find an intial node for the traversal
1.
Initialize S = {v}, seen = 1, π(seen) = v; Mark v;
2.
While seen < n Do
3.
Snew = ∅;
4.
For each node v, going from lowest to highest degree, Do:
matrices (going the wrong way):
Origimal matrix
5.
π(+ + seen) = v;
6.
For each unmarked w in adj(v) do
7.
Add w to Snew ;
8.
Mark w;
9.
10.
11.
➤ The Cuthill - Mc Kee ordering has a tendency to create small arrow
CM ordering
0
0
10
10
20
20
30
30
40
40
50
50
60
60
EndDo
S := Snew
EndDo
70
12.
EndWhile
70
0
CIMPA - Tlemcen May 2008April 26, 2008
10
20
29
➤ Idea: Take the reverse ordering
30
40
nz = 377
50
60
70
0
10
20
30
40
nz = 377
50
60
70
CIMPA - Tlemcen May 2008April 26, 2008
30
Nested Dissection ordering
RCM ordering
0
➤ The idea of divide and conquer – recursively divide graph in two using
10
a separator.
20
30
40
1
5
50
7
2
60
3
70
0
10
20
30
40
nz = 377
50
60
6
4
70
➤ Reverse Cuthill M Kee ordering (RCM).
CIMPA - Tlemcen May 2008April 26, 2008
31
CIMPA - Tlemcen May 2008April 26, 2008
32
Second Dissection
Nested dissection for a small mesh
Third Dissection
First dissection
Original Grid
CIMPA - Tlemcen May 2008April 26, 2008
Nested dissection: cost for a regular mesh
34
Ordering techniques for direct methods in practice
➤ In 2-D consider an n × n problem, N = n2
➤ In practice: Nested dissection (+ variants) is preferred for parallel
processing
➤ In 3-D consider an n × n × n problem, N = n3
➤ Good implementations of Min. Degree algorithm work well in practice.
2-D
3-D
space (fill)
O(N log N )
O(N 4/3)
time (flops)
O(N 3/2)
O(N 2)
Currently AMD and AMF are best known implementations/variants/
➤ Best practical reordering algorithms usually combine Nested dissection
and min. degree algorithms.
➤ Significant difference in complexity between 2-D and 3-D
CIMPA - Tlemcen May 2008April 26, 2008
35
CIMPA - Tlemcen May 2008April 26, 2008
36
BASIC RELAXATION SCHEMES
Relaxation schemes: based on the decomposition A = D − E − F
@
@
@
@
@
@
@
BASIC RELAXATION METHODS
-E
D = diag(A), −E = strict lower
-F
part of A and −F its strict up-
D
@
@
@
@
@
@
@
per part.
Gauss-Seidel iteration for solving Ax = b:
(D − E)x(k+1) = F x(k) + b
→ idea: correct the j-th component of the current approximate solution,
j = 1, 2, ..n, to zero the j − th component of residual.
CIMPA - Tlemcen May 2008April 26, 2008
Can also define a backward Gauss-Seidel Iteration:
(D − F )x
(k+1)
= Ex
(k)
38
Iteration matrices
+b
Jacobi, Gauss-Seidel, SOR, & SSOR iterations are of the form
and a Symmetric Gauss-Seidel Iteration: forward sweep followed by backward sweep.
x(k+1) = M x(k) + f
• MJ ac = D −1(E + F ) = I − D −1A
Over-relaxation is based on the decomposition:
• MGS (A) = (D − E)−1F == I − (D − E)−1A
ωA = (D − ωE) − (ωF + (1 − ω)D)
• MSOR(A) = (D − ωE)−1(ωF + (1 − ω)D) = I − (ω −1D − E)−1A
• MSSOR(A) = I − (2ω −1 − 1)(ω −1D − F )−1D(ω −1D − E)−1A
→ successive overrelaxation, (SOR):
= I − ω(2ω − 1)(D − ωF )−1D(D − ωE)−1A
(D − ωE)x(k+1) = [ωF + (1 − ω)D]x(k) + ωb
CIMPA - Tlemcen May 2008April 26, 2008
39
CIMPA - Tlemcen May 2008April 26, 2008
40
General convergence result
An observation. Introduction to Preconditioning
➤ The iteration x(k+1) = M x(k) + f is attempting to solve (I − M )x =
Consider the iteration:
x(k+1) = Gx(k) + f
f . Since M is of the form M = I − P −1A this system can be rewritten as
(1) Assume that ρ(A) < 1. Then I − G is non-singular and G has a fixed
P −1Ax = P −1b
point. Iteration converges to a fixed point for any f and x(0).
where for SSOR, we have
(2) If iteration converges for any f and x(0) then ρ(G) < 1.
Example: Richardson’s iteration
PSSOR = (D − ωE)D −1(D − ωF )
x(k+1) = x(k) + α(b − A(k))
referred to as the SSOR ‘preconditioning’ matrix.
♦Assume Λ(A) ⊂ R. When does the iteration converge?
In other words:
➤ Jacobi and Gauss-Seidel converge for diagonal dominant A
Relaxation Scheme ⇐⇒ Preconditioned Fixed Point Iteration
➤ SOR converges for 0 < ω < 2 for SPD matrices
CIMPA - Tlemcen May 2008April 26, 2008
41
CIMPA - Tlemcen May 2008April 26, 2008
42
The Problem
We consider the linear system
Ax = b
PROJECTION METHODS FOR LINEAR SYSTEMS
where A is N × N and can be
• Real symmetric positive definite
• Real nonsymmetric
• Complex
➤ Focus:
A is large and sparse, possibly with an irregular structure
CIMPA - Tlemcen May 2008April 26, 2008
44
➤ With a nonzero initial guess x0, the approximate problem is
Projection Methods
Find
b − Ax = 0
Initial Problem:
Given two subspaces K and L of R
N
x̃ ∈ x0 + K
such that
b − Ax̃ ⊥ L
Write x̃ = x0 + δ and r0 = b − Ax0. Leads to a system for δ:
Find δ ∈ K such that r0 − Aδ ⊥ L
define the approximate problem:
Find x̃ ∈ K such that b − Ax̃ ⊥ L
➤ Leads to a small linear system (‘projected problems’) This is a basic
projection step. Typically: sequence of such steps are applied
CIMPA - Tlemcen May 2008April 26, 2008
45
Matrix representation:
CIMPA - Tlemcen May 2008April 26, 2008
46
Prototype Projection Method
• V = [v1, . . . , vm] a basis of K &
Until Convergence Do:
• W = [w1, . . . , wm] a basis of L
1. Select a pair of subspaces K, and L;
Let
Then letting x be the approximate solution x̃ = x0 + δ ≡ x0 + V y where
2. Choose bases V = [v1, . . . , vm] for K and W = [w1, . . . , wm] for
L.
y is a vector of Rm, the Petrov-Galerkin condition yields,
3. Compute
W T (r0 − AV y) = 0
r ← b − Ax,
and therefore
y ← (W T AV )−1W T r,
x̃ = x0 + V [W T AV ]−1W T r0
x ← x + V y.
T
Remark: In practice W AV is known from algorithm and has a simple
structure [tridiagonal, Hessenberg,..]
CIMPA - Tlemcen May 2008April 26, 2008
47
CIMPA - Tlemcen May 2008April 26, 2008
48
Approximate problem amounts to solving
Operator Form Representation
Q(b − Ax) = 0, x ∈ K
Let P be the orthogonal projector onto K and
or in operator form
Q the (oblique) projector onto K and orthogonally to L.
Q(b − APx) = 0
Px ∈ K, x − Px ⊥ K
Qx ∈ K, x − Qx ⊥ L
Question:
Let x∗ be the exact solution. Then
x
L
HH
H
HH
Qx
HH
HH HH
H
what accuracy can one expect?
1) We cannot get better accuracy than k(I − P)x∗k2, i.e.,
HH
HH
HH
kx̃ − x∗k2 ≥ k(I − P)x∗k2
K
Px ?
2) The residual of the exact solution for the approximate problem satisfies:
kb − QAPx∗k2 ≤ kQA(I − P)k2 k(I − P)x∗k2
The P and Q projectors
CIMPA - Tlemcen May 2008April 26, 2008
49
Two important particular cases.
CIMPA - Tlemcen May 2008April 26, 2008
50
One-dimensional projection processes
K = span{d}
and
L = span{e}
1. L = AK . then kb − Ax̃k2 = minz∈K kb − Azk2
→ class of minimal residual methods: CR, GCR, ORTHOMIN, GM-
Then x̃ ← x + αd and Petrov-Galerkin condition r − Aδ ⊥ e yields
RES, CGNR, ...
α=
2. L = K → class of Galerkin or orthogonal projection methods. When
A is SPD then
(r,e)
(Ad,e)
Three popular choices:
∗
∗
kx − x̃kA = min kx − zkA.
z∈K
(I) Steepest descent.
(II) Residual norm steepest descent .
(III) Minimal residual iteration.
CIMPA - Tlemcen May 2008April 26, 2008
51
CIMPA - Tlemcen May 2008April 26, 2008
52
(I) Steepest descent. A is SPD. Take at each step d = r and e = r.
Iteration:
(II) Residual norm steepest descent . A is arbitrary (nonsingular). Take at
each step d = AT r and e = Ad.
r ← b − Ax,
α ← (r, r)/(Ar, r)
x ← x + αr
Iteration:
➤ Each step minimizes
r ← b − Ax, d = AT r
α ← kdk22/kAdk22
x ← x + αd
➤ Each step minimizes f (x) = kb − Axk22 in direction −∇f .
f (x) = kx − x∗k2A = (A(x − x∗), (x − x∗))
➤ Important Note: equivalent to usual steepest descent applied to normal
equations AT Ax = AT b .
in direction −∇f . Convergence guaranteed if A is SPD.
➤ Converges under the condition that A is nonsingular.
CIMPA - Tlemcen May 2008April 26, 2008
53
(III) Minimal residual iteration. A positive definite (A + AT is SPD). Take
CIMPA - Tlemcen May 2008April 26, 2008
54
Krylov Subspace Methods
at each step d = r and e = Ar.
Iteration:
r ← b − Ax,
α ← (Ar, r)/(Ar, Ar)
x ← x + αr
Principle:
Projection methods on Krylov subspaces:
Km(A, v1) = span{v1, Av1, · · · , Am−1v1}
➤ Each step minimizes f (x) = kb − Axk22 in direction r.
• probably the most important class of iterative methods.
T
➤ Converges under the condition that A + A is SPD.
• many variants exist depending on the subspace L.
Simple properties of Km . Let µ = deg. of minimal polynomial of v
• Km = {p(A)v|p = polynomial of degree ≤ m − 1}
• Km = Kµ for all m ≥ µ. Moreover, Kµ is invariant under A.
• dim(Km) = m iff µ ≥ m.
CIMPA - Tlemcen May 2008April 26, 2008
55
CIMPA - Tlemcen May 2008April 26, 2008
56
ALGORITHM : 3
A little review: Gram-Schmidt process
1. For j = 1, ..., m Do:
→ Goal: given X = [x1, . . . , xm] compute an orthonormal set Q =
[q1, . . . , qm] which spans the same susbpace.
ALGORITHM : 2
Classical Gram-Schmidt
1. For j = 1, ..., m Do:
4.
Compute rij = (xj , qi) for i = 1, . . . , j − 1
P
Compute q̂j = xj − j−1
i=1 rij qi
5.
qj = q̂j /rjj
2.
3.
Modified Gram-Schmidt
rjj = kq̂j k2 If rjj == 0 exit
2.
q̂j := xj
3.
For i = 1, . . . , j − 1 Do
4.
rij = (q̂j , qi)
5.
q̂j := q̂j − rij qi
6.
EndDo
7.
rjj = kq̂j k2. If rjj == 0 exit
8.
qj := q̂j /rjj
9. EndDo
6. EndDo
CIMPA - Tlemcen May 2008April 26, 2008
57
Let:
CIMPA - Tlemcen May 2008April 26, 2008
58
Arnoldi’s Algorithm
X = [x1, . . . , xm] (n × m matrix)
➤ Goal: to compute an orthogonal basis of Km.
Q = [q1, . . . , qm] (n × m matrix)
➤ Input: Initial vector v1, with kv1k2 = 1 and m.
R = {rij } (m × m upper triangular matrix)
➤ At each step,
For j = 1, ..., m do
xj =
j
X
• Compute w := Avj
rij qi
• for i = 1, . . . , j, do
i=1
Result:
h
i,j
:= (w, vi)
w := w − h v
i,j i
• hj+1,j = kwk2 and vj+1 = w/hj+1,j
X = QR
CIMPA - Tlemcen May 2008April 26, 2008
59
CIMPA - Tlemcen May 2008April 26, 2008
60
Result of orthogonalization process
Arnoldi’s Method (Lm = Km)
➤ Petrov-Galerkin condition when Lm = Km, shows:
1. Vm = [v1, v2, ..., vm] orthonormal basis of Km.
−1 T
xm = x0 + VmHm
Vm r0
2. AVm = Vm+1H m
3. VmT AVm = Hm ≡ H m− last row.
➤ Select v1 = r0/kr0k2 ≡ r0/β in Arnoldi’s algorithm, then:
−1
xm = x0 + βVmHm
e1
Vm
Hm =
@
@@
@@
@@
@@
@@
@
@
@
O
CIMPA - Tlemcen May 2008April 26, 2008
62
Difficulty: As m increases, storage and work per step increase fast.
T
T
xm = x0 + Vm[Wm
AVm]−1Wm
r0
First remedy: Restarting. Fix the dimension m of the subspace
➤ Use again v1 := r0/(β := kr0k2) and: AVm = Vm+1H¯m
ALGORITHM : 4
Restarted GMRES (resp. Arnoldi)
1. Start/Restart: Compute r0 = b − Ax0, and v1 = r0/(β := kr0k2).
T
T
xm = x0 + Vm[H̄m
H̄m]−1H̄m
βe1 = x0 + Vmym
where ym minimizes kβe1 − H̄myk2 over y ∈ Rm. Hence, (Generalized
Minimal Residual method (GMRES) [Saad-Schultz, 1983]):
2. Arnoldi Process: generate H̄m and Vm.
3. Compute
−1
ym = Hm
βe1 (FOM), or
ym = argminkβe1 − H̄myk2 (GMRES)
ym : miny kβe1 − H̄myk2
4. xm = x0 + Vmym
• Axelsson’s CGLS • Orthomin (1980)
• Orthodir
* Young and Jea’s ORTHORES [1982].
Restarting and Truncating
➤ When Lm = AKm, we let Wm ≡ AVm and obtain:
Equivalent methods:
gorithms:
61
Minimal residual methods (Lm = AKm)
where
* FOM [YS, 1981] (above formulation)
* Axelsson’s projection method [1981].
CIMPA - Tlemcen May 2008April 26, 2008
xm = x0 + Vmym
Equivalent al-
5. If krmk2 ≤ ǫkr0k2 stop else set x0 := xm and go to 1.
• GCR
CIMPA - Tlemcen May 2008April 26, 2008
63
CIMPA - Tlemcen May 2008April 26, 2008
64
Second remedy: Truncate the orthogonalization
The direct version of IOM [DIOM]:
The formula for vj+1 is replaced by
Writing the LU decomposition of Hm as Hm = LmUm we get
hj+1,j vj+1 = Avj −
j
X
−1
L−1
xm = x0 + VmUm
m βe1 ≡ x0 + Pm zm
hij vi
i=j−k+1
➤ Structure of Lm, Um when k = 3
→ each vj is made orthogonal to the previous k vi’s.
→ xm still computed as xm = x0 +
1
x
1
x 1
x 1
=
x
1
x 1
x 1
−1
VmHm
βe1.
→ It can be shown that this is again an oblique projection process.
➤
IOM (Incomplete Orthogonalization Method) = replace orthogo-
nalization in FOM, by the above truncated (or ‘incomplete’) orthogo-
pm
nalization.
CIMPA - Tlemcen May 2008April 26, 2008
x x x
x
x
x
x x x
x x x
Lm
Um =
x x x
x x
x
z
P
m−1
m−1
= u−1
zm =
mm [vm −
i=m−k+1 uim pi ]
ζm
65
Result: Can update xm at each step:
CIMPA - Tlemcen May 2008April 26, 2008
66
The Symmetric Case: Observation
xm = xm−1 + ζmpm
Note:
Several existing pairs of methods have a similar link: they are
Observe: When A is real symmetric then in Arnoldi’s method:
Hm = VmT AVm
based on the LU, or other, factorizations of the Hm matrix
must be symmetric. Therefore
➤ CG-like formulation of IOM called DIOM [Saad, 1982]
THEOREM. When Arnoldi’s algorithm is applied to a (real) symmetric
➤ ORTHORES(k) [Young & Jea ’82] equivalent to DIOM(k)
matrix then the matrix Hm is symmetric tridiagonal.
➤ SYMMLQ [Paige and Saunders, ’77] uses LQ factorization of Hm.
In other words:
➤ Can add partial pivoting to LU factorization of Hm
1) hij = 0 for |i − j| > 1
2) hj,j+1 = hj+1,j ,
CIMPA - Tlemcen May 2008April 26, 2008
67
j = 1, . . . , m
CIMPA - Tlemcen May 2008April 26, 2008
68
➤ We can write
Hm
β2 α2 β3
β3 α3 β4
=
. .
.
The Lanczos algorithm
α1 β2
.
.
βm
.
αm
ALGORITHM : 5
(1)
Lanczos
1. Choose an initial vector v1 of norm unity.
Set β1 ≡ 0, v0 ≡ 0
2. For j = 1, 2, . . . , m Do:
The vi’s satisfy a three-term recurrence [Lanczos Algorithm]:
βj+1vj+1 = Avj − αj vj − βj vj−1
→ simplified version of Arnoldi’s algorithm for sym. systems.
Symmetric matrix + Arnoldi → Symmetric Lanczos
3.
wj := Avj − βj vj−1
4.
αj := (wj , vj )
5.
wj := wj − αj vj
6.
βj+1 := kwj k2. If βj+1 = 0 then Stop
7.
vj+1 := wj /βj+1
8. EndDo
CIMPA - Tlemcen May 2008April 26, 2008
69
CIMPA - Tlemcen May 2008April 26, 2008
ALGORITHM : 6
Lanczos algorithm for linear systems
70
Lanczos Method for Linear Systems
1. Compute r0 = b − Ax0, β := kr0k2, and v1 := r0/β
2. For j = 1, 2, . . . , m Do:
➤ Usual orthogonal projection method setting:
• Lm = Km = span{r0, Ar0, . . . , Am−1r0}
• Basis Vm = [v1, . . . , vm] of Km generated by the Lanczos algorithm
➤ Three different possible implementations.
(1) Arnoldi-like; (2) Exploit tridigonal nature of Hm (DIOM); (3) Conjugate gradient.
3.
wj = Avj − βj vj−1 (If j = 1 set β1v0 ≡ 0)
4.
αj = (wj , vj )
5.
wj := wj − αj vj
6.
βj+1 = kwj k2. If βj+1 = 0 set m := j and go to 9
7.
vj+1 = wj /βj+1
8. EndDo
9. Set Tm = tridiag(βi, αi, βi+1), and Vm = [v1, . . . , vm].
−1
10. Compute ym = Tm
(βe1) and xm = x0 + Vmym
CIMPA - Tlemcen May 2008April 26, 2008
71
CIMPA - Tlemcen May 2008April 26, 2008
72
ALGORITHM : 7
D-Lanczos
The Conjugate Gradient Algorithm (A S.P.D.)
1. Compute r0 = b − Ax0, ζ1 := β := kr0k2, v1 := r0/β
2. Set λ1 = β1 = 0, p0 = 0
➤ Note: the pi’s are A-orthogonal
3. For m = 1, 2, . . ., until convergence Do:
➤ The ri′ ’s are orthogonal.
4.
Compute w := Avm − βmvm−1 and αm = (w, vm)
5.
If m > 1: Compute λm =
6.
ηm = αm − λmβm
7.
−1
pm = ηm
(vm − βmpm−1)
So there must be an update of
8.
xm = xm−1 + ζmpm
the form:
9.
If xm has converged then Stop
10.
w := w − αmvm
11.
βm+1 = kwk2, vm+1 = w/βm+1
βm
ηm−1
& ζm = −λmζm−1
➤ And we have xm = xm−1 + ξmpm
1. pm = rm−1 + βmpm−1
2. xm = xm−1 + ξmpm
3. rm = rm−1 − ξmApm
12. EndDo
CIMPA - Tlemcen May 2008April 26, 2008
ALGORITHM : 8
73
CIMPA - Tlemcen May 2008April 26, 2008
Conjugate Gradient
Start: r0 := b − Ax0, p0 := r0.
Iterate: Until convergence do,
αj := (rj , rj )/(Apj , pj )
xj+1 := xj + αj pj
rj+1 := rj − αj Apj
METHODS BASED ON LANCZOS BIORTHOGONALIZATION
βj := (rj+1, rj+1)/(rj , rj )
pj+1 := rj+1 + βj pj
EndDo
➤ rj = scaling × vj+1. The rj ’s are orthogonal.
➤ The pj ’s are A-conjugate, i.e., (Api, pj ) = 0 for i 6= j.
CIMPA - Tlemcen May 2008April 26, 2008
75
74
ALGORITHM : 9
➤ Extension of the symmetric Lanczos algorithm
Lanczos Bi-Orthogonalization
1. Choose two vectors v1, w1 such that (v1, w1) = 1.
➤ Builds a pair of biorthogonal bases for the two subspaces
2. Set β1 = δ1 ≡ 0, w0 = v0 ≡ 0
Km(A, v1) and Km(AT , w1)
3. For j = 1, 2, . . . , m Do:
4.
αj = (Avj , wj )
5.
v̂j+1 = Avj − αj vj − βj vj−1
➤ Different ways to choose δj+1, βj+1 in lines 7 and 8.
6.
ŵj+1 = AT wj − αj wj − δj wj−1
Let
7.
δj+1 = |(v̂j+1, ŵj+1)|1/2. If δj+1 = 0 Stop
8.
βj+1 = (v̂j+1, ŵj+1)/δj+1
9.
wj+1 = ŵj+1/βj+1
10.
vj+1 = v̂j+1/δj+1
Tm
11. EndDo
α1 β2
δ2 α2 β3
=
.
.
.
.
δm−1 αm−1 βm
δm αm
➤ vi ∈ Km(A, v1) and wj ∈ Km(AT , w1).
CIMPA - Tlemcen May 2008April 26, 2008
If the algorithm does not break down before step m, then the
77
The Lanczos Algorithm for Linear Systems
vectors vi, i = 1, . . . , m, and wj , j = 1, . . . , m, are biorthogonal, i.e.,
ALGORITHM : 10
(vj , wi) = δij
1 ≤ i, j ≤ m .
Lanczos Alg. for Linear Systems
1. Compute r0 = b − Ax0 and β := kr0k2
Moreover, {vi}i=1,2,...,m is a basis of Km(A, v1) and {wi}i=1,2,...,m
2. Run m steps of the nonsymmetric Lanczos Algorithm i.e.,
is a basis of Km(AT , w1) and
3.
(v1, w1) = 1
AVm = VmTm + δm+1vm+1eTm,
T
AT Wm = WmTm
+ βm+1wm+1eTm,
T
Wm
AVm = Tm .
Start with v1 := r0/β, and any w1 such that
4.
Generate the pair of Lanczos vectors v1, . . . , vm,
and w1, . . . , wm
5.
and the tridiagonal matrix Tm from Algorithm 9.
−1
6. Compute ym = Tm
(βe1) and xm := x0 + Vmym.
➤ BCG can be derived from the Lanczos Algorithm similarly to CG
CIMPA - Tlemcen May 2008April 26, 2008
80
ALGORITHM : 11
BiConjugate Gradient (BCG)
Quasi-Minimal Residual Algorithm
1. Compute r0 := b − Ax0.
2. Choose r0∗ such that (r0, r0∗) 6= 0;
➤ Recall relation from the lanczos algorithm:
= Vm+1T̄m with T̄m
AVm
Tm
.
= (m + 1) × m tridiagonal matrix T̄m =
δm+1eTm
Set p0 := r0, p∗0 := r0∗
3. For j = 0, 1, . . ., until convergence Do:,
4.
αj := (rj , rj∗)/(Apj , p∗j )
5.
xj+1 := xj + αj pj
6.
rj+1 := rj − αj Apj
7.
∗
rj+1
:= rj∗ − αj AT p∗j
8.
∗
βj := (rj+1, rj+1
)/(rj , rj∗)
9.
pj+1 := rj+1 + βj pj
10.
p∗j+1
:=
∗
rj+1
+
➤ Let v1 ≡ βr0 and x = x0 + Vmy. Residual norm kb − Axk2 equals
kr0 − AVmyk2 = kβv1 − Vm+1T̄myk2 = kVm+1 βe1 − T̄my k2
➤ Column-vectors of Vm+1 are not ⊥ (6= GMRES).
➤ But: reasonable idea to minimize the function J (y) ≡ kβe1 − T̄myk2
βj p∗j
➤ Quasi-Minimal Residual Algorithm (Freund, 1990).
11. EndDo
CIMPA - Tlemcen May 2008April 26, 2008
81
Transpose-Free Variants
CIMPA - Tlemcen May 2008April 26, 2008
Conjugate Gradient Squared
➤ BCG and QMR require a matrix-by-vector product with A and AT
* Clever variant of BCG which avoids using AT [Sonneveld, 1984].
T
at each step. The products with A do not contribute directly to xm. ➤
In BCG:
They allow to determine the scalars (αj and βj in BCG).
ri = ρi(A)r0
➤ QUESTION: is it possible to bypass the use of AT ?
➤ Motivation: in nonlinear equations, A is often not available explicitly
where ρi = polynomial of degree i.
but via the Frechet derivative:
J (uk )v =
F (uk + ǫv) − F (uk )
ǫ
In CGS:
.
ri = ρ2i (A)r0
➤ Define :
rj = φj (A)r0,
pj = πj (A)r0,
CIMPA - Tlemcen May 2008April 26, 2008
83
82
rj∗ = φj (AT )r0∗,
Solution: Let φj+1(t)πj (t), be a third member of the recurrence. For
p∗j = πj (AT )r0∗
πj (t)φj (t), note:
φj (t)πj (t) = φj (t) (φj (t) + βj−1πj−1(t))
Scalar αj in BCG is given by
αj =
(φj (A)r0, φj (AT )r0∗)
(Aπj (A)r0, πj (AT )r0∗)
➤ Possible to get a recursion for the
=
= φ2j (t) + βj−1φj (t)πj−1(t).
(φ2j (A)r0, r0∗)
(Aπj2(A)r0, r0∗)
φ2j (A)r0
φj+1(t) = φj (t) − αj tπj (t),
and
Result:
πj2(A)r0?
φ2j+1 = φ2j − αj t 2φ2j + 2βj−1φj πj−1 − αj t πj2
φj+1πj = φ2j + βj−1φj πj−1 − αj t πj2
➤ Square these equalities
2
πj+1
= φ2j+1 + 2βj φj+1πj + βj2πj2.
πj+1(t) = φj+1(t) + βj πj (t)
φ2j+1(t) = φ2j (t) − 2αj tπj (t)φj (t) + α2j t2πj2(t),
2
πj+1
(t) = φ2j+1(t) + 2βj φj+1(t)πj (t) + βj2πj (t)2.
➤ Problem: ...
..
Cross terms
CIMPA - Tlemcen May 2008April 26, 2008
rj = φ2j (A)r0,
pj = πj2(A)r0,
qj = φj+1(A)πj (A)r0
85
CIMPA - Tlemcen May 2008April 26, 2008
86
➤ one more auxiliary vector, uj = rj + βj−1qj−1. So
Recurrences become:
rj+1 = rj − αj A (2rj + 2βj−1qj−1 − αj A pj ) ,
dj = uj + qj ,
qj = rj + βj−1qj−1 − αj A pj ,
pj+1 = rj+1 + 2βj qj +
Define:
qj = uj − αj Apj ,
βj2pj .
pj+1 = uj+1 + βj (qj + βj pj ),
➤ vector dj is no longer needed.
Define auxiliary vector dj = 2rj + 2βj−1qj−1 − αj Apj
➤ Sequence of operations to compute the approximate solution, starting
with r0 := b − Ax0, p0 := r0, q0 := 0, β0 := 0.
1. αj = (rj , r0∗)/(Apj , r0∗)
5. rj+1 = rj − αj Adj
2. dj = 2rj + 2βj−1qj−1 − αj Apj
6. βj = (rj+1, r0∗)/(rj , r0∗)
3. qj = rj + βj−1qj−1 − αj Apj
7. pj+1 = rj+1 + βj (2qj + βj pj ).
4. xj+1 = xj + αj dj
CIMPA - Tlemcen May 2008April 26, 2008
87
CIMPA - Tlemcen May 2008April 26, 2008
88
ALGORITHM : 12
➤ Note: no matrix-by-vector products with AT but two matrix-by-vector
Conjugate Gradient Squared
1. Compute r0 := b −
Ax0; r0∗
arbitrary.
products with A, at each step.
2. Set p0 := u0 := r0.
3. For j = 0, 1, 2 . . . , until convergence Do:
4.
αj = (rj , r0∗)/(Apj , r0∗)
5.
qj = uj − αj Apj
qi ←→ r̄i(t)p̄i−1(t)
6.
xj+1 = xj + αj (uj + qj )
ui ←→ p̄2i (t)
7.
rj+1 = rj − αj A(uj + qj )
ri ←→ r̄i2(t)
Vector: ←→ Polynomial in BCG :
(rj+1, r0∗)/(rj , r0∗)
8.
βj =
9.
uj+1 = rj+1 + βj qj
where r̄i(t) = residual polynomial at step i for BCG, .i.e., ri = r̄i(A)r0,
10.
pj+1 = uj+1 + βj (qj + βj pj )
and p̄i(t) = conjugate direction polynomial at step i, i.e., pi = p̄i(A)r0.
11. EndDo
CIMPA - Tlemcen May 2008April 26, 2008
89
BCGSTAB (van der Vorst, 1992)
➤ In CGS: residual polynomial of BCG is squared. ➤ bad behavior in
case of irregular convergence.
ALGORITHM : 13
92
2. p0 := r0.
3. For j = 0, 1, . . . , until convergence Do:
➤ Bi-Conjugate Gradient Stabilized (BCGSTAB) = a variation of CGS
5.
sj := rj − αj Apj
which avoids this difficulty. ➤ Derivation similar to CGS.
6.
ωj := (Asj , sj )/(Asj , Asj )
➤ Residuals in BCGSTAB are of the form,
7.
xj+1 := xj + αj pj + ωj sj
8.
rj+1 := sj − ωj Asj
9.
βj :=
10.
pj+1 := rj+1 + βj (pj − ωj Apj )
➤ .. ψj (t) = a new polynomial defined recursively as
CIMPA - Tlemcen May 2008April 26, 2008
BCGSTAB
αj := (rj , r0∗)/(Apj , r0∗)
in which, φj (t) = BCG residual polynomial, and ..
90
1. Compute r0 := b − Ax0; r0∗ arbitrary;
4.
rj′ = ψj (A)φj (A)r0
CIMPA - Tlemcen May 2008April 26, 2008
(rj+1 ,r0∗ )
(rj ,r0∗ )
×
αj
ωj
11. EndDo
ψj+1(t) = (1 − ωj t)ψj (t)
ωi chosen to ‘smooth’ convergence [steepest descent step]
Preconditioning – Basic principles
Basic idea
is to use the Krylov subspace method on a modified system
such as
M −1Ax = M −1b.
PRECONDITIONING
• The matrix M −1A need not be formed explicitly; only need to solve
M w = v whenever needed.
• Consequence: fundamental requirement is that it should be easy to compute M −1v for an arbitrary vector v.
CIMPA - Tlemcen May 2008April 26, 2008
Left, Right, and Split preconditioning
94
Preconditioned CG (PCG)
Left preconditioning: M −1Ax = M −1b
➤ Assume: A and M are both SPD.
➤ Applying CG directly to M −1Ax = M −1b or AM −1u = b
Right preconditioning: AM −1u = b, with x = M −1u
won’t work because coefficient matrices are not symmetric.
➤ Alternative: when M = LLT use split preconditioner option
Split preconditioning: ML−1AMR−1u = ML−1b, with x = MR−1u
➤ Second alternative: Observe that M −1A is self-adjoint wrt M inner
[Assume M is factored: M = MLMR. ]
product:
(M −1Ax, y)M = (Ax, y) = (x, Ay) = (x, M −1Ay)M
CIMPA - Tlemcen May 2008April 26, 2008
95
CIMPA - Tlemcen May 2008April 26, 2008
96
Note M −1A is also self-adjoint with respect to (., .)A:
Preconditioned CG (PCG)
(M −1Ax, y)A = (AM −1Ax, y) = (x, AM −1Ay) = (x, M −1Ay)A
ALGORITHM : 14
Preconditioned Conjugate Gradient
1. Compute r0 := b − Ax0, z0 = M −1r0, and p0 := z0
➤ Can obtain a similar algorithm
2. For j = 0, 1, . . ., until convergence Do:
➤ Assume that M = Cholesky product M = LLT .
3.
αj := (rj , zj )/(Apj , pj )
4.
xj+1 := xj + αj pj
5.
rj+1 := rj − αj Apj
6.
zj+1 := M −1rj+1
7.
βj := (rj+1, zj+1)/(rj , zj )
➤ Notation: Â = L−1AL−T . All quantities related to the preconditioned
8.
pj+1 := zj+1 + βj pj
system are indicated byˆ.
Then, another possibility: Split preconditioning option, which applies CG
to the system
L−1AL−T u = L−1b, with x = LT u
9. EndDo
CIMPA - Tlemcen May 2008April 26, 2008
ALGORITHM : 15
97
CG with Split Preconditioner
CIMPA - Tlemcen May 2008April 26, 2008
98
Flexible accelerators
1. Compute r0 := b − Ax0; r̂0 = L−1r0; and p0 := L−T r̂0.
2. For j = 0, 1, . . ., until convergence Do:
Question:
What can we do in case M is defined only approximately?
3.
αj := (r̂j , r̂j )/(Apj , pj )
4.
xj+1 := xj + αj pj
5.
r̂j+1 := r̂j − αj L−1Apj
6.
βj := (r̂j+1, r̂j+1)/(r̂j , r̂j )
➤ Iterative techniques as preconditioners: Block-SOR, SSOR, Multi-grid,
7.
pj+1 := L−T r̂j+1 + βj pj
etc..
i.e., if it can vary from one step to the other.?
Applications:
8. EndDo
➤ Chaotic relaxation type preconditioners (e.g., in a parallel computing
➤ The xj ’s produced by the above algorithm and PCG are identical (if
environment)
➤ Mixing Preconditioners – mixing coarse mesh / fine mesh precondition-
same initial guess is used).
ers.
CIMPA - Tlemcen May 2008April 26, 2008
99
CIMPA - Tlemcen May 2008April 26, 2008
100
ALGORITHM : 16
GMRES – No preconditioning
ALGORITHM : 17
1. Start: Choose x0 and a dimension m
2. Arnoldi process:
1. Start: Choose x0 and a dimension m of the Krylov subspaces.
2. Arnoldi process:
• Compute r0 = b − Ax0, β = kr0k2 and v1 = r0/β.
• For j = 1, ..., m do
– Compute w := Avj hi,j := (w, vi)
– for i = 1, . . . , j, do
;
w := w − hi,j vi
w
– hj+1,1 = kwk2; vj+1 = h
j+1,1
• Define Vm := [v1, ...., vm] and H̄m = {hi,j }.
3. Form the approximate solution: Compute xm = x0 + Vmym where
ym = argminy kβe1 − H̄myk2 and e1 = [1, 0, . . . , 0]T .
4. Restart: If satisfied stop, else set x0 ← xm and goto 2.
CIMPA - Tlemcen May 2008April 26, 2008
ALGORITHM : 18
GMRES – (right) Preconditioning
• Compute r0 = b − Ax0, β = kr0k2 and v1 = r0/β.
• For j = 1, ..., m do
– Compute zj := M −1vj
– Compute w := Azj hi,j := (w, vi)
– for i = 1, . . . , j, do :
w := w − hi,j vi
– hj+1,1 = kwk2; vj+1 = w/hj+1,1
• Define Vm := [v1, ...., vm] and H̄m = {hi,j }.
3. Form the approximate solution: xm = x0 + M −1Vmym where ym =
argminy kβe1 − H̄myk2 and e1 = [1, 0, . . . , 0]T .
4. Restart: If satisfied stop, else set x0 ← xm and goto 2.
101
GMRES – variable preconditioner
Properties
1. Start: Choose x0 and a dimension m of the Krylov subspaces.
2. Arnoldi process:
• Compute r0 = b − Ax0, β = kr0k2 and v1 = r0/β.
• For j = 1, ..., m do
– Compute zj := Mj−1vj ; Compute w := Azj ;
hi,j := (w, vi)
– for i = 1, . . . , j, do:
;
w := w − hi,j vi
– hj+1,1 = kwk2; vj+1 = w/hj+1,1
• Define Zm := [z1, ...., zm] and H̄m = {hi,j }.
• xm minimizes b − Axm over Span{Zm}.
• If Azj = vj (i.e., if preconditioning is ‘exact’ at step j) then approximation xj is exact.
• If Mj is constant then method is ≡ to Right-Preconditioned GMRES.
3. Form the approximate solution: Compute xm = x0 + Zmym where
ym = argminy kβe1 − H̄myk2 and e1 = [1, 0, . . . , 0]T .
4. Restart: If satisfied stop, else set x0 ← xm and goto 2.
Additional Costs:
• Arithmetic: none.
• Memory: Must save the additional set of vectors {zj }j=1,...m
Advantage:
CIMPA - Tlemcen May 2008April 26, 2008
103
Flexibility
CIMPA - Tlemcen May 2008April 26, 2008
104
Standard preconditioners
An observation. Introduction to Preconditioning
• Simplest preconditioner: M = Diag(A) ➤ poor convergence.
➤ Take a look back at basic relaxation methods: Jacobi, Gauss-Seidel,
SOR, SSOR, ...
• Next to simplest: SSOR M = (D − ωE)D −1(D − ωF )
➤ These are iterations of the form x(k+1) = M x(k) + f where M is of
• Still simple but often more efficient: ILU(0).
the form M = I − P −1A . For example for SSOR,
• ILU(p) – ILU with level of fill p – more complex.
PSSOR = (D − ωE)D −1(D − ωF )
• Class of ILU preconditioners with threshold
• Class of approximate inverse preconditioners
• Class of Multilevel ILU preconditioners: Multigrid, Algebraic Multigrid,
M-level ILU, ..
CIMPA - Tlemcen May 2008April 26, 2008
105
➤ SSOR attempts to solve the equivalent system
P
−1
Ax = P
−1
CIMPA - Tlemcen May 2008April 26, 2008
106
The SOR/SSOR preconditioner
b
➤ SOR preconditioning
where P ≡ PSSOR by the fixed point iteration
−F
x(k+1) = (I − P −1A) x(k)+P −1b instead of x(k+1) = (I−A)x(k)+b
|
{z
}
D
M
In other words:
MSOR = (D − ωE)
➤ SSOR preconditioning
−E
MSSOR = (D − ωE)D −1(D − ωF )
Relaxation Scheme ⇐⇒ Preconditioned Fixed Point Iteration
➤ MSSOR = LU , L = lower unit matrix, U = upper triangular. One
solve with MSSOR ≈ same cost as a MAT-VEC.
CIMPA - Tlemcen May 2008April 26, 2008
107
CIMPA - Tlemcen May 2008April 26, 2008
108
➤ k-step SOR (resp. SSOR) preconditioning:
k steps of SOR (resp. SSOR)
➤ Questions: Best ω? For preconditioning can take ω = 1
Iteration times versus
M = (D − E)D −1(D − F )
k for SOR(k) precondiObserve: M = LU + R with R = ED
−1
F.
tioned GMRES
➤ Best k? k = 1 is rarely the best. Substantial difference in performance.
CIMPA - Tlemcen May 2008April 26, 2008
ILU(0) and IC(0) preconditioners
What is the IKJ version of GE?
Different computational patterns for gaussian elimination
➤
Notation:
N Z(X) = {(i, j) | Xi,j 6= 0}
➤ Formal definition of ILU(0):
A = LUS+ R
N Z(L) N Z(U ) = N Z(A)
rij = 0 for (i, j) ∈ N Z(A)
➤ This does not define ILU (0) in a unique way.
Constructive definition: Compute the LU factorization of A but drop any
fill-in in L and U outside of Struct(A).
➤ ILU factorizations are often based on i, k, j version of GE.
KJI,KJI
CIMPA - Tlemcen May 2008April 26, 2008
111
IJK
110
ALGORITHM : 19
Gaussian Elimination – IKJ Variant
1. For i = 2, . . . , n Do:
2.
For k = 1, . . . , i − 1 Do:
3.
aik := aik /akk
4.
For j = k + 1, . . . , n Do:
5.
aij := aij − aik ∗ akj
6.
7.
EndDo
EndDo
8. EndDo
IKJ
JKI
CIMPA - Tlemcen May 2008April 26, 2008
113
CIMPA - Tlemcen May 2008April 26, 2008
114
ILU(0) – zero-fill ILU
Accessed but not
modified
ALGORITHM : 20
ILU(0)
For i = 1, . . . , N Do:
Accessed and
modified
For k = 1, . . . , i − 1 and if (i, k) ∈ N Z(A) Do:
Compute aik := aik /akj
For j = k + 1, . . . and if (i, j) ∈ N Z(A), Do:
Not accessed
compute aij := aij − aik ak,j .
EndFor
EndFor
➤ When A is SPD then the ILU factorization = Incomplete Cholesky
factorization – IC(0). Meijerink and Van der Vorst [1977].
CIMPA - Tlemcen May 2008April 26, 2008
115
CIMPA - Tlemcen May 2008April 26, 2008
116
Typical eigenvalue distribution of preconditioned matrix
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
01
1
01
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
0
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
0
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
01
1
01
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
01
01
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
01
01
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
Pattern of ILU(0) for 5-point matrix
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
CIMPA - Tlemcen May 2008April 26, 2008
117
Stencils and ILU factorization
CIMPA - Tlemcen May 2008April 26, 2008
118
Higher order ILU factorization
➤ Higher accuracy incomplete Cholesky: for regularly structured prob-
Stencils of A and the L and U parts of A:
lems, IC(p) allows p additional diagonals in L.
➤ Can be generalized to irregular sparse matrices using the notion of level
of fill-in [Watts III, 1979]
0 for a 6= 0
ij
• Initially Levij =
∞ for a == 0
ij
• At a given step i of Gaussian elimination:
Levkj = min{Levkj ; Levki + Levij + 1}
CIMPA - Tlemcen May 2008April 26, 2008
119
CIMPA - Tlemcen May 2008April 26, 2008
120
➤ ILU(p) Strategy = drop anything with level of fill-in exceeding p.
ILU(1)
* Increasing level of fill-in usually results in more accurate ILU and...
* ...typically in fewer steps and fewer arithmetic operations.
CIMPA - Tlemcen May 2008April 26, 2008
ALGORITHM : 21
121
ILU(p)
CIMPA - Tlemcen May 2008April 26, 2008
122
ILU with threshold – generic algorithms
For i = 2, N Do
For each k = 1, . . . , i − 1 and if aij 6= 0 do
ILU(p) factorizations are based on structure only and not numerical
Compute aik := aik /ajj
values ➤ potential problems for non M-matrices.
Compute ai,∗ := ai,∗ − aik ak,∗.
➤ One remedy: ILU with threshold – (generic name ILUT.)
Update the levels of ai,∗
Two broad approaches:
Replace any element in row i with lev(aij ) > p by zero.
First approach [derived from direct solvers]: use any (direct) sparse solver
EndFor
and incorporate a dropping strategy. [Munksgaard (?), Osterby & Zlatev,
EndFor
Sameh & Zlatev[90], D. Young, & al. (Boeing) etc...]
➤ The algorithm can be split into a symbolic and a numerical phase.
Level-of-fill ➤ in Symbolic phase
CIMPA - Tlemcen May 2008April 26, 2008
123
CIMPA - Tlemcen May 2008April 26, 2008
124
Second approach : [derived from ‘iterative solvers’ viewpoint]
ILU with threshold: ILUT(k, ǫ)
1. use a (row or colum) version of the (i, k, j) version of GE;
• Do the i, k, j version of Gaussian Elimination (GE).
2. apply a drop strategy for the elment lik as it is computed;
3. perform the linear combinations to get ai∗. Use full row expansion of
• During each i-th step in GE, discard any pivot or fill-in whose value is
below ǫkrowi(A)k.
ai∗;
• Once the i-th row of L + U , (L-part + U-part) is computed retain only
4. apply a drop strategy to fill-ins.
the k largest elements in both parts.
➤ Advantages: controlled fill-in. Smaller memory overhead.
➤ Easy to implement –
➤ Can be made quite inexpensive.
CIMPA - Tlemcen May 2008April 26, 2008
125
Crout-based ILUT (ILUTC)
Terminology: Crout versions of LU compute the k-th row of U and the
k-th column of L at the k-th step.
CIMPA - Tlemcen May 2008April 26, 2008
126
References:
[1] M. Jones and P. Plassman. An improved incomplete Choleski factorization. ACM Transactions on Mathematical Software, 21:5–17, 1995.
[2] S. C. Eisenstat, M. H. Schultz, and A. H. Sherman. Algorithms and data
structures for sparse symmetric Gaussian elimination. SIAM Journal on
Computational pattern
Black = part computed at step k
Blue
= part accessed
Scientific Computing, 2:225–237, 1981.
[3] M. Bollhöfer.
A robust ILU with pivoting based on monitoring the
growth of the inverse factors. Linear Algebra and its Applications, 338(1–
3):201–218, 2001.
[4] N. Li, Y. Saad, and E. Chow. Crout versions of ILU. MSI technical
1. Less expensive than ILUT (avoids sorting)
Main advantages:
report, 2002.
2. Allows better techniques for dropping
CIMPA - Tlemcen May 2008April 26, 2008
128
Note: Entries 1 : k − 1 in k-th row of figure need not be computed.
Crout LU (dense case)
Available from already computed columns of L.
➤ Go back to delayed update algorithm (IKJ alg.) and observe: we could
Similar observation for L (right)
do both a column and a row version
➤ Left: U computed by rows. Right: L computed by columns
CIMPA - Tlemcen May 2008April 26, 2008
ALGORITHM : 22
129
Crout LU Factorization (dense case)
3.
ak,k:n = ak,k:n − aki ∗ ai,k:n
4.
EndDo
5.
For i = 1 : k − 1 and if aik 6= 0 Do :
Preconditioning time vs. Lfil for RAEFSKY3
14
ak+1:n.k = ak+1:n,k − aik ∗ ak+1:n,i
6.
7.
EndDo
8.
aik = aik /akk for i = k + 1, ..., n
ILUC
r−LLUT
c−ILUT
b−ILUT
12
Preconditioning Time (sec.)
For i = 1 : k − 1 and if aki 6= 0 Do :
130
Comparison with standard techniques
1. For k = 1 : n Do :
2.
CIMPA - Tlemcen May 2008April 26, 2008
10
8
6
4
2
9. EndDo
0
0
10
20
30
40
50
60
70
Lfil
Precondition time vs. Lfil for ILUC (solid), row-ILUT (circles), columnILUT (triangles) and r-ILUT with Binary Search Trees (stars)
CIMPA - Tlemcen May 2008April 26, 2008
131
CIMPA - Tlemcen May 2008April 26, 2008
132
Independent set orderings & ILUM (Background)
Independent set orderings permute a matrix into the form
B F
E C
ILUM AND ARMS
where B is a diagonal matrix.
➤ Unknowns associated with the B block form an independent set (IS).
➤ IS is maximal if it cannot be augmented by other nodes to form another
IS.
➤ IS ordering can be viewed as a “simplification” of multicoloring
Main observation: Reduced system obtained by eliminating the unknowns
ALGORITHM : 23
associated with the IS, is still sparse since its coefficient matrix is the Schur
For lev = 1, nlev Do
ILUM
a. Get an independent set for A.
complement
b. Form the reduced system associated with this set;
S = C − EB −1F
➤ Idea: apply IS set reduction recursively.
c. Apply a dropping strategy to this system;
➤ When reduced system small enough solve by any method
d. Set A := current reduced matrix and go back to (a).
EndDo
➤ Can devise an ILU factorization based on this strategy.
➤ See work by [Botta-Wubbs ’96, ’97, YS’94, ’96, (ILUM), Leuze ’89, ..]
CIMPA - Tlemcen May 2008April 26, 2008
135
CIMPA - Tlemcen May 2008April 26, 2008
136
Group Independent Sets / Aggregates
Algebraic Recursive Multilevel Solver (ARMS)
Original matrix, A , and reordered matrix, A0 = P0T AP0 .
➤ Generalizes (common) Independent Sets
0
Main goal: to improve robustness
Main idea: use independent sets of “cliques”, or “aggregates”. There is no
0
50
50
100
100
150
150
200
200
250
250
coupling between the aggregates.
➤
Reorder equations so
No Coupling
300
300
0
nodes of independent sets
factorization
of Al
CIMPA - Tlemcen May 2008April 26, 2008
0
El Cl
300
0
Ll
50
0
100
I
150
nz = 3155
0
200
250
300
Ul L−1
l Fl
≈
El Ul−1 I
0 Al+1
0
I
138
Algebraic Recursive Multilevel Solver (ARMS)
50
50
100
100
➤
150
Basic step:
➤
200
250
250
300
250
300
150
200
200
B l Fl
250
CIMPA - Tlemcen May 2008April 26, 2008
0
150
nz = 12205
200
Remedy: dropping strategy
Problem: Fill-in
100
150
nz = 3155
137
➤ Diagonal blocks treated as sparse
50
100
➤ Block ILU
come first
0
50
y
f
=
→
E C
z
g
L
0
U L−1F
y
f
×
=
−1
EU
I
0
S
z
g
B F
where S = C − EB −1F = Schur complement.
➤ Perform block factorization recursively on S
➤ L, U Blocks: sparse
300
0
50
100
150
nz = 4255
200
250
➤ Exploit recursivity
➤ Next step: treat the Schur complement recursively
CIMPA - Tlemcen May 2008April 26, 2008
300
139
CIMPA - Tlemcen May 2008April 26, 2008
140
Factorization: at level l PlT Al Pl =
B l Fl
Ll
0
I 0
Fl
Ul L−1
l
≈
E l Cl
El Ul−1 I
0 Al+1
0
I
ALGORITHM : 24
ARMS(Alev ) factorization
1. If lev = last lev then
2.
Compute Alev ≈ Llev Ulev
3. Else:
➤ L-solve ∼ restriction. U-solve ∼ prolongation.
4
➤ Solve Last level system with, e.g., ILUT+GMRES
5.
Find an independent set permutation Plev
T
Apply permutation Alev := Plev
Alev P
6.
Compute factorization
7.
Call ARMS(Alev+1 )
8. EndIf
CIMPA - Tlemcen May 2008April 26, 2008
141
CIMPA - Tlemcen May 2008April 26, 2008
142
Original matrix
Group Independent Set reordering
0.19E+07
Separator
First Block
Simple strategy used: Do a Cuthill-MKee ordering until there are enough
points to make a block. Reverse ordering. Start a new block from a
non-visited node. Continue until all points are visited. Add criterion for
rejecting “not sufficiently diagonally dominant rows.”
CIMPA - Tlemcen May 2008April 26, 2008
143
0.10E-06
Block size of 6
Block size of 20
0.19E+07
0.19E+07
0.10E-06
0.10E-06
Introduction
➤ Premise: we now work directly on a Partial Differential Equation e.g.
M U L T I G R I D (VERY BRIEF)
−∆u = f, +B.C
➤ Main idea of multigrid: exploit a hierarchy of grids to get good convergence from simple iterative schemes
➤ Need a good grasp of matrices and spectra of model problems
CIMPA - Tlemcen May 2008April 26, 2008
148
➤ take ω = 1/γ → converging iteration since
Richardson’s iteration
1/γ ≤ 1/ρ(A) < 2/ρ(A).
➤ Simple iterative scheme: Richardson’s iteration for 1-D case
➤ Eigenvalues of the iteration matrix are 1 − ωλk , where
λk = 2(1 − cos θk ) = 4 sin2 θ2k
➤ Fixed parameter ω. Iteration:
➤ Eigenvectors are the same as those of A.
uj+1 = uj + ω(b − Auj ) = (I − ωA)uj + ωb .
➤ If u∗ is the exact solution, the error vector dj ≡ u∗ − uj , obeys the
➤ Iteration matrix is
relation,
Mω = I − ωA .
dj = Mωj d0
Recall: convergence takes place for 0 < ω < 2/ρ(A)
➤ Expand the error vector d0 in the eigenbasis of Mω , as
➤ In practice an upper bound ρ(A) ≤ γ is often available
d0 =
n
X
ξk w k .
k=1
CIMPA - Tlemcen May 2008April 26, 2008
149
➤ Then from dj = Mωj d0 and Mω = I − ωA:
n X
λk j
ξk w k .
dj =
1−
γ
k=1
Basic observation: convergence is not similar for all components.
➤ Half of the error components see a very good decrease.
➤ This is the case for the high frequency components, that is, all those
➤ Each component is reduced by (1 − λk /γ)j .
components corresponding to k > n/2. [referred to as the oscillatory part]
➤ Slowest converging component corresponds to λ1
➤ The reduction factors for these components are
➤ Possibly very slow convergence rate when |λ1/γ|λ1.
ηk = 1 − sin2
kπ
2(n + 1)
= cos2
kπ
2(n + 1)
≤
1
2
.
➤ For the model problem – one-dimensional case – Gershgorin’s theorem
yields γ = 4, so the corresponding reduction coefficient is
1 − sin2
π
2(n + 1)
≈ 1 − (πh/2)2 = 1 − O(h2) .
Consequence: convergence can be quite slow for fine meshes, i.e., when h
is small.
CIMPA - Tlemcen May 2008April 26, 2008
151
CIMPA - Tlemcen May 2008April 26, 2008
152
How can we reduce the other components?
1
η1
➤ Introduce a coarse grid problem. Assume n is odd. Consider the
problem issued from discretizing the original PDE on a mesh Ω2h with the
1/2
mesh-size 2h. Use superscripts h and 2h for the two meshes.
ηn
h
Observation: note that x2h
i = x2i from which it follows that, for k ≤ n/2,
θ1
θn/2+1
θn
2h
2h
wkh(xh2i) = sin(kπxh2i) = sin(kπx2h
i ) = wk (xi ) .
Reduction coefficients for Richardson’s method applied to the 1-D model
problem
So: Taking a smooth mode on the fine grid (wkh with k ≤ n/2) and
canonically injecting it into the coarse grid, i.e., defining its values on
Observations: Oscillatory components, undergo excellent reduction, Also
the coarse points to be the same as those on the fine points, yields the
reduction factor is independent of the step-size h.
k-th mode on the coarse grid.
CIMPA - Tlemcen May 2008April 26, 2008
153
CIMPA - Tlemcen May 2008April 26, 2008
154
➤ At some point iteration will fail to make progress on the fine grid: when
the only components left are those associated with the smooth modes.
Multigrid strategy: do not attempt to eliminate these components on the
fine grid. Move down to a coarser grid where smooth modes are translated
into oscillatory ones. Then iterate.
➤ Practically, need to go back and forth between different grids.
The mode w2 on a fine grid (n = 7) and a coarse grid (n = 3)
➤ Some of the modes which were smooth on the fine grid, become oscillatory.
➤ The oscillatory modes on the fine mesh are no longer represented on the
coarse mesh.
CIMPA - Tlemcen May 2008April 26, 2008
155
CIMPA - Tlemcen May 2008April 26, 2008
156
Inter-grid operations: Prolongation
Restriction
➤ A prolongation operation takes a vector from ΩH and defines the ana-
➤ Given a function v h on the fine mesh, a corresponding function in ΩH
logue vector in Ωh. Common notation in use
must be defined from v h. Reverse of prolongation
h
IH
: ΩH −→ Ωh .
h
➤ Simplest example canonical injection: vi2h = v2i
.
➤ Simplest approach : linear interpolation. Example: in 1-D. Given
h 2h
(vi2h)i=0,...,(n+1)/2, define vector v h = I2h
v ∈ Ωh:
vh
= vj2h
n+1
2j
.
for j = 0, . . . ,
vh
2
= (v 2h + v 2h )/2
2j+1
j
2h
h
➤ Termed Injection operator. Obvious 2-D analogue: vi,j
= v2i,2j
.
➤ More common restriction operator: full weighting (FW),
j+1
vj2h =
1
4
h
h
h
v2j−1
+ 2v2j
+ v2j+1
.
➤ Averages the neighboring values using the weights 0.25, 0.5, 0.25.
CIMPA - Tlemcen May 2008April 26, 2008
157
Standard multigrid techniques
CIMPA - Tlemcen May 2008April 26, 2008
158
Coarse problems and smoothers
➤ Simplest idea: obtain an initial guess from interpolating a solution
At the highest level (finest grid)
A h uh = f h
computed on a coarser grid. Repeat recursively. Interpolation from a
coarser grid can be followed by a few s of a smoothing iteration.
➤ Can define this problem at next level (mesh-size= H) on mesh ΩH
➤ known as nested iteration
➤ Also common (FEM) to define the system by Galerkin projection,
➤ MG techniques use essentially two main ingredients: a hierarchy of grid
problems (restrictions and prolongations) and a smoother.
What is a smoother?
h
,
AH = IhH AhIH
f H = IhH f h .
Ans: any scheme which has the smoothing propNotation:
erty of damping quickly the high frequency components of the error.
uhν = smootherν (Ah, uh0, fh)
uhν == result of ν smoothing s for Ahu = fh starting with uh0 .
CIMPA - Tlemcen May 2008April 26, 2008
159
CIMPA - Tlemcen May 2008April 26, 2008
160
V-cycles
➤ V-cyle multigrid is a canonical application of recursivity - starting from
1. Pre-smooth:
= V-cycle(Ah, uh0 , f h)
uh := smootherν1 (Ah, uh0, fh)
2. Get residual:
r h = f h − A h uh
3. Coarsen:
r H = IhH r h
ALGORITHM : 25
u
h
the 2-grid cycle as an example. Notation: H = 2h and h0 is the coarsest
mesh-size.
➤ Many other options avaialble
➤ For Poisson equation MG is a ’fast solver’ – cost of order N log N . In
4. If (H == h0)
5.
Solve:
AH δ H = r H
fact O(N ) for FMG.
6. Else
7.
Recursion:
δ H = V-cycle(AH , 0, r H )
8. EndIf
9. Correct:
h H
uh := uh + IH
δ
10. Post-smooth:
uh := smootherν2 (Ah, uh, fh)
11. Return uh
CIMPA - Tlemcen May 2008April 26, 2008
Algebraic Multigrid (AMG)
➤ Generalizes ‘geometric’ or ’mesh-based’ MG to general problems
➤ Idea: Define coarsening by looking at ‘strong’ couplings
➤ Define coarse problem from a Galerkin approach, i.e., using the restrich
tion AH = IhH AhIH
...
➤ Generally speaking: limited success for problems with several unknowns
per mesh point, of for non-PDE relatex problems..
CIMPA - Tlemcen May 2008April 26, 2008
163
PARALLEL IMPLEMENTATION
162
Introduction
Parallel preconditioners: A few approaches
➤ Thrust of parallel computing techniques in most applications areas.
“Parallel matrix computation” viewpoint:
➤ Programming model: Message-passing seems (MPI) dominates
• Local preconditioners: Polynomial (in the 80s), Sparse Approximate
Inverses, [M. Benzi-Tuma & al ’99., E. Chow ’00]
➤ Open MP and threads for small number of processors
➤ Important new reality: parallel programming has penetrated the ‘ap-
• Distributed versions of ILU [Ma & YS ’94, Hysom & Pothen ’00]
• Use of multicoloring to unaravel parallelism
plications’ areas [Sciences and Engineering + industry]
➤ Problem 1: algorithms lagging behind somewhat
➤ Problem 2: Message passing is painful for large applications. ‘Time to
solution’ up.
CIMPA - Tlemcen May 2008April 26, 2008
165
CIMPA - Tlemcen May 2008April 26, 2008
166
Intrinsically parallel preconditioners
Domain Decomposition ideas:
• Schwarz-type Preconditioners [e.g. Widlund, Bramble-Pasciak-Xu, X.
Some alternatives
Cai, D. Keyes, Smith, ...]
• Schur-complement techniques [Gropp & Smith, Ferhat et al. (FETI),
(1) Polynomial preconditioners;
T.F. Chan et al., YS and Sosonkina ’97, J. Zhang ’00, ...]
(2) Approximate inverse preconditioners;
(3) Multi-coloring + independent set ordering;
Multigrid / AMG viewpoint:
• Multi-level Multigrid-like preconditioners [e.g., Shadid-Tuminaro et al
(4) Domain decomposition approach.
(Aztec project), ...]
➤ In practice: Variants of additive Schwarz very common (simplicity)
CIMPA - Tlemcen May 2008April 26, 2008
167
CIMPA - Tlemcen May 2008April 26, 2008
168
POLYNOMIAL PRECONDITIONING
Principle:
Domain Decomposition
Problem:
∆u =
f in Ω
u = u on Γ = ∂Ω.
Γ
M −1 = s(A) where s is a (low) degree polynomial:
s(A)Ax = s(A)b
Problem:
how to obtain s? Note: s(A) ≈ A−1
Domain:
* Chebyshev polynomials
➤ Several approaches.
* Least squares polynomials
Ω=
* Others
s
[
Ωi,
i=1
➤ Polynomial preconditioners are seldom used in practice.
➤ Domain decomposition or substructuring methods attempt to solve a
PDE problem (e.g.) on the entire domain from problem solutions on the
subdomains Ωi.
CIMPA - Tlemcen May 2008April 26, 2008
169
CIMPA - Tlemcen May 2008April 26, 2008
170
CIMPA - Tlemcen May 2008April 26, 2008
172
Coefficient Matrix
Discretization of domain
Types of mappings
DISTRIBUTED SPARSE MATRICES
(a) Vertex-based;
(b) edge-based; and
(c) element-based partitioning
➤ Can adapt PDE viewpoint to general sparse matrices
➤ Will use the graph representation and ’vertex-based’ viewpoint –
CIMPA - Tlemcen May 2008April 26, 2008
173
➤ Best idea is to use the adjacency graph of A:
Generalization: Distributed Sparse Systems
1
2
4
3
Vertices = {1, 2, · · · , n};
Edges: i → j iff aij 6= 0
➤ Simple illustration: Block
Graph partitioning problem:
assignment. Assign equation
i and unknown i to a given
• Want a partition of the vertices of the graph so that
’process’
(1) partitions have ∼ the same sizes
➤ Naive partitioning - won’t
(2) interfaces are small in size
work well in practice
CIMPA - Tlemcen May 2008April 26, 2008
175
CIMPA - Tlemcen May 2008April 26, 2008
176
General Partitioning of a sparse linear system
21
22
23
24
17
18
19
Map elements / edges rather than vertices
21
22
23
24
25
16
17
18
19
20
25
S1
16
Alternative:
20
=
{1, 2, 6, 7, 11, 12}:
This
means equations and unknowns 1, 2,
Equations/unknowns 3, 8, 12 shared
3, 6, 7, 11, 12 are assigned to Domain
11
12
13
14
15
7
8
9
10
1
2
3
4
5
12
13
14
15
6
7
8
9
10
1
2
3
4
5
by 2 domains.
From distributed
sparse matrix viewpoint this is an
S2 = {3, 4, 5, 8, 9, 10, 13}
6
11
1.
S3 = {16, 17, 18, 21, 22, 23}
S4 = {14, 15, 19, 20, 24, 25}
overlap of one layer
➤ Partitioners : Metis, Chaco, Scotch, ..
➤ More recent: Zoltan, H-Metis, PaToH
CIMPA - Tlemcen May 2008April 26, 2008
177
CIMPA - Tlemcen May 2008April 26, 2008
178
➤ Standard dual objective: “minimize” communication + “balance” partition sizes
➤ Recent trend: use of hypergraphs [PaToh, Hmetis,...]
➤ Hypergraphs are very general.. Ideas borrowed from VLSI work
➤ Main motivation: to better represent communication volumes when
partitioning a graph. Standard models face many limitations
➤ Hypergraphs can better express complex graph partitioning problems
and provide better solutions. Example: completely nonsymmetric patterns.
CIMPA - Tlemcen May 2008April 26, 2008
179
CIMPA - Tlemcen May 2008April 26, 2008
180
Distributed Sparse matrices (continued)
Two views of a distributed sparse matrix
➤ Once a good partitioning is found, questions are:
External interface
nodes
1. How to represent this partitioning?
Internal
nodes
2. What is a good data structure for representing distributed sparse
matrices?
Xi
Ai
Xi
Local interface
nodes
3. How to set up the various “local objects” (matrices, vectors, ..)
4. What can be done to prepare for communication that will be required
during execution?
➤ Local interface variables always ordered last.
CIMPA - Tlemcen May 2008April 26, 2008
181
Local view of distributed matrix:
CIMPA - Tlemcen May 2008April 26, 2008
182
The local matrix:
The local matrix consists of 2 parts:
External data
O
Xi
local
Data
External data
Internal
Points
O
Ai
Local
Interface
points
Xi
a part (’Aloc’) which acts on local
A loc
data and another (’Bext’) which
Bext acts on remote data.
➤ Once the partitioning is available these parts must be identified and
The local system:
built locally..
ui
f
0
B i Fi
= i
+ P
yi
gi
E i Ci
j∈Ni Eij yj
{z
}
|
| {z }
Ai
➤ In finite elements, assembly is a local process.
➤ How to perform a matrix vector product? [needed by iterative schemes?]
yext
➤ ui : Internal variables; yi : Interface variables
CIMPA - Tlemcen May 2008April 26, 2008
183
CIMPA - Tlemcen May 2008April 26, 2008
184
Distributed Sparse Matrix-Vector Product Kernel
Main Operations in (F) GMRES :
Algorithm:
1. Saxpy’s – local operation – no communication
1. Communicate: exchange boundary data.
2. Dot products – global operation
Scatter xbound to neighbors - Gather xext from neighbors
3. Matrix-vector products – local operation – local communication
2. Local matrix – vector product
4. Preconditioning operations – locality varies.
y = Alocxloc
3. External matrix – vector product
y = y + Bextxext
NOTE: 1 and 2 are independent and can be overlapped.
CIMPA - Tlemcen May 2008April 26, 2008
185
Distributed Dot Product
CIMPA - Tlemcen May 2008April 26, 2008
186
A remark: the global viewpoint
/*-------------------- call blas1 function
tloc = DDOT(n, x, incx, y, incy);
/*-------------------- call global reduction
MPI_Allreduce(&tloc,&ro,1,MPI_DOUBLE,MPI_SUM,comm);
B1
B2
...
...
Bp
E1
C1
E2
E21
..
...
Ep Ep1
←
Interior
variables
CIMPA - Tlemcen May 2008April 26, 2008
187
F1
→←
F2
.
.
E12 · · ·
C2 · · ·
.
..
Ep2 · · ·
Interface
u1
f1
u2
f2
.
.
.
.
..
..
Fp up = fp
g1
E1p
y1
g2
E2p y2
..
..
yp
gp
Cp
→
variables
CIMPA - Tlemcen May 2008April 26, 2008
188
Three approaches:
• Schwarz Preconditioners
• Schur-complement based Preconditioners
• Multi-level ILU-type Preconditioners
PARALLEL PRECONDITIONERS
➤ Observation: Often, in practical applications, Schwarz Preconditioners
are used : SUB-OPTIMAL
CIMPA - Tlemcen May 2008April 26, 2008
➤ Multiplicative Schwarz. Need a coloring of the subdomains.
Domain-Decomposition-Type Preconditioners
External data
local
Data
O
Xi
Ai
3
1
Local view of distributed matrix:
O
1
Xi
Block Jacobi Iteration (Additive Schwarz):
1. Obtain external data yi
2. Compute (update) local residual ri = (b − Ax)i = bi − Aixi − Biyi
3. Solve Aiδi = ri
4. Update solution xi = xi + δi
CIMPA - Tlemcen May 2008April 26, 2008
4
2
External data
191
1
2
3
3
4
1
3
190
Multicolor Block SOR Iteration (Multiplicative Schwarz):
Breaking the sequential color loop
1. Do col = 1, . . . , numcols
2.
If (col.eq.mycol) Then
➤ “Color” loop is sequential. Can be broken in several different ways.
3.
Obtain external data yi
4.
Update local residual ri = (b − Ax)i
5.
Solve Aiδi = ri
6.
7.
(1) Have a few subdomains per processors
4
1
Update solution xi = xi + δi
3
EndIf
3
2
3
4
2
1
4
1
3
2
4
3
1
2
4
3
2
1
4
1
3
2
4
2
4
2
CIMPA - Tlemcen May 2008April 26, 2008
194
Local Solves
Color 3
Color # 1
Interior nodes
4
1
3
1
3
193
(2) Separate interior nodes from interface nodes (2-level blocking)
Color 2
2
3
4
1
CIMPA - Tlemcen May 2008April 26, 2008
4
3
1
2
1
8. EndDo
2
Color 2
➤ Each local system Aiδi = ri can be solved in three ways:
1. By a (sparse) direct solver
Color 3
2. Using a standard preconditioned Krylov solver
(3) Use a block-GMRES algorithm - with Block-size = number of colors.
SOR step targets a different color on each column of the block ➤ no iddle
3. Doing a backward-forward solution associated with an accurate ILU
(e.g. ILUT) precondioner
time.
➤ We only use (2) with a small number of inner s (up to 10) or (3).
CIMPA - Tlemcen May 2008April 26, 2008
195
CIMPA - Tlemcen May 2008April 26, 2008
196
Schur complement system
Local system can be written as
Aixi + Xiyi,ext = bi.
SCHUR COMPLEMENT-BASED PRECONDITIONERS
local
Data
External data
O
(2)
External data
O
Ai
Xi
Xi
xi= vector of local unknowns, yi,ext = external interface variables, and bi =
local part of RHS.
CIMPA - Tlemcen May 2008April 26, 2008
➤ Local equations
B i Fi
ui
0
f
+ P
= i
E i Ci
yi
gi
j∈Ni Eij yj
Structure of Schur complement system
(3)
Global Schur complement system:
➤ eliminate ui from the above system:
Siyi +
X
Eij yj = gi − EiBi−1fi ≡ gi′,
j∈Ni
where Si is the “local” Schur complement
Si = Ci − EiBi−1Fi.
198
(4)
S1
E12
E21 S2
S=
.
Ep1 Ep−1,2
Sy = g ′ with :
g′
1
′
g
. . . E2p y2
= 2 .
.
. . . .. .
.
. . . Sp
gp′
yp
. . . E1p
y1
➤ Eij ’s are sparse = same as in the original matrix
➤ Can solve global Schur complement system iteratively. Back-substitute
to recover rest of variables (internal).
➤ Can use the procedure as a preconditining to global system.
CIMPA - Tlemcen May 2008April 26, 2008
199
CIMPA - Tlemcen May 2008April 26, 2008
200
Simplest idea: Schur Complement Iterations
ui
yi
Approximate Schur-LU
➤ Two-level method based on induced preconditioner. Global system can
Internal variables
also be viewed as
Interface variables
➤ Do a global primary iteration (e.g., block-Jacobi)
B F
u
f
= ,
E C
y
g
➤ Then accelerate only the y variables (with a Krylov method)
Still need to precondition..
B=
F1
B1
F2
..
...
B p Fp
E1 E2 · · · Ep C
B2
Block LU factorization of A:
B F
B 0
I B −1F
=
,
E C
E S
0
I
CIMPA - Tlemcen May 2008April 26, 2008
201
CIMPA - Tlemcen May 2008April 26, 2008
202
➤ Question: How to solve with Si?
Preconditioning:
L=
B
0
E MS
and
U =
I B −1F
0
I
➤ Can use LU factorization of local matrix Ai =
with MS = some approximation to S.
➤ Preconditioning to global system can be induced from any preconditioning on Schur complement.
X
j∈Ni
→
E i Ci
LSi USi = Si
➤ Need only the (I) LU factorization of the Ai [rest is already available]
Rewrite local Schur system as
yi + Si−1
and exploit the relation:
LBi
0
UBi L−1
F
i
Bi
Ai =
EiUB−1
L
0
U
S
S
i
i
i
B i Fi
➤ Very easy implementation of (parallel) Schur complement techniques
Eij yj = Si−1 gi − EiBi−1fi .
for vertex-based partitioned systems : YS-Sosonkina ’97; YS-Sosonkina-
➤ equivalent to Block-Jacobi preconditioner for Schur complement.
Zhang ’99.
➤ Solve with, e.g., a few s (e.g., 5) of GMRES
CIMPA - Tlemcen May 2008April 26, 2008
203
CIMPA - Tlemcen May 2008April 26, 2008
204
360 x 360 Mesh − CPU Time
An experiment
450
Solid line: BJ
Dash−dot line: SAPINV
Dash−star line: SLU
35
30
af23560 SAPINV
250
32 40 56 64 80 96
15
200
20
32 36
27 29 73 35 71 61
10
150
30
32 35
23 29 46 60 33 52
5
100
32 35
24 29 55 35 37 59
30
32 34
23 28 43 45 23 35
20
81 105 94 88 90 76 85 71
30
38 34
20
37 153 53 60 77 80 95 *
30
36 41
BJ
300
20
lfil 16 24
SAPINVS 20
SLU
350
25
0
Solid line: BJ
Dash−dot line: SAPINV
Dash−star line: SLU
400
Iterations
Precon
T3E seconds
Number of FGMRES(20) iterations for the AF23560 problem.
Name
360 x 360 Mesh − Iterations
40
0
10
20
30
40
50
60
70
80
90
100
50
0
10
Processors
205
70 x 70 Mesh
80
60
T3E seconds
T3E seconds
PARALLEL ARMS
70
50
40
AS
SClu
30
20
5
10
0
10
20
30
40
50
60
Processors
70
80
90
100
80
90
CIMPA - Tlemcen May 2008April 26, 2008
10
0
70
100
53 57 81 87 97 115
50 x 50 Mesh in each PE
15
60
problem with 3 different preconditioners using flexible GMRES(10).
SLU) and the Schur complement iteration (SI).
20
50
37 39 38 39 38 35
sizes using FGMRES(10) with 3 different preconditioners (BJ, SAPINV,
Solid line: BJ
Dash−dot line: SAPINV
Dash−star line: SLU
Dash−circle line: SI
40
Processors
➤ Solution times for a Laplacean problem with various local subproblem
25
30
Times and iteration counts for solving a 360 × 360 discretized Laplacean
CIMPA - Tlemcen May 2008April 26, 2008
30
20
0
0
10
20
30
40
50
60
70
80
90
Processors
CIMPA - Tlemcen May 2008April 26, 2008
207
100
206
Result: 2-part Schur complement: one corresponding to local interfaces
Parallel implementation of ARMS
and the other to inter-domain interfaces.
IS
I1
I2
Three types of points:
interior (independent sets), local
interfaces, and global interfaces
Interior points
Interdomain
Interfaces
Local
Interfaces
Main ideas: (1) exploit recursivity (2) distinguish two phases: elimination
of interior points and then interface points.
B
CIMPA - Tlemcen May 2008April 26, 2008
ext
209
CIMPA - Tlemcen May 2008April 26, 2008
210
Construction of global group independent sets A two level strategy
Three approaches
Proc 1
Proc 2
Proc 4
Proc 3
Method 1: Simple additive Schwarz using ILUT or ARMS locally
Method 2: Schur complement approach. Solve Schur complement system
(both I1 and I2) with either a block Jacobi (M. Sosonkina and YS, ’99) or
1. Color subdomains
2. Find group independent sets locally
multicolor ILU(0).
3. Color groups consistently
Method 3: Do independent set reduction across subdomains. Requires
construction of global group independent sets.
➤ Current status Methods 1 and 2.
CIMPA - Tlemcen May 2008April 26, 2008
211
CIMPA - Tlemcen May 2008April 26, 2008
212
color 1
color 3
Methods implemented in pARMS:
color 3
add x
color 2
Additive Schwarz with method x for subdomains. With/out over-
lap. x = one of ILUT, ILUK, ARMS.
sch x
Internal interface points
color 1
Schur complement technique, with method x = factorization used
for local submatrix. Same x as above. Equiv. to Additive Schwarz
External interface points
color 4
preconditioner on Schur complement.
Algorithm: Multicolor Distributed ILU(0)
sch sgs
1. Eliminate local rows,
2. Receive external interf. rows from PEs s.t. color(P E) < MyColor
Multicolor Multiplicative Schwarz (block Gauss-Seidel) pre-
conditioning is used instead of additive Schwarz for Schur complement.
sch gilu0
3. Process local interface rows
tem obtained from ARMS reduction.
4. Send local interface rows to PEs s.t. color(P E) > MyColor
CIMPA - Tlemcen May 2008April 26, 2008
ILU(0) preconditioning to solve global Schur complement sys-
213
CIMPA - Tlemcen May 2008April 26, 2008
100 x 100 mesh per processor − Wall−Clock Time
Test problem
20
add_arms
add_arms no its
add_arms ovp
add_arms ovp no its
add_ilut
add_ilut no its
add_ilut ovp
add_ilut ovp no its
18
1. Scalability experiment: sample finite difference problem.
−xy ∂u
xy ∂u
+e
+ αu = f ,
− ∆u + γ e
∂x
∂y
Dirichlet Boundary Conditions ; γ = 100, α = −10; centered differences
discretization.
➤ Keep size constant on each processor [100 × 100] ➤ Global linear
system with 10, 000 ∗ nproc unknowns.
Origin 3800 seconds
16
14
12
10
8
6
4
2
2. Comparison with a parallel direct solver – symmetric problems
0
3. Large irregular matrix example arising from magneto hydrodynamics.
0
10
20
30
40
50
60
70
80
90
Processors
Times for 2D PDE problem with fixed subproblem size
CIMPA - Tlemcen May 2008April 26, 2008
215
214
100 x 100 mesh per processor − Iterations
100 x 100 mesh per processor − Wall−Clock Time
100
4
add_arms
add_arms no its
add_arms ovp
add_arms ovp no its
add_ilut
add_ilut no its
add_ilut ovp
add_ilut ovp no its
80
70
Iterations
add_arms no its
add_arms ovp no its
sch_arms
sch_gilu0
sch_gilu0 no its
sch_sgs
sch_sgs no its
3.5
Origin 3800 seconds
90
60
50
40
3
2.5
2
1.5
30
1
20
10
0
10
20
30
40
50
60
70
80
90
0.5
0
10
Processors
100 x 100 mesh per processor − Iterations
add_arms no its
add_arms ovp no its
sch_arms
sch_gilu0
sch_gilu0 no its
sch_sgs
sch_sgs no its
50
60
70
80
90
Times for 2D PDE problem with fixed subproblem size
Direct solvers:
➤ SUPERLU
70
Iterations
40
Software
100
80
30
Processors
Iterations for 2D PDE problem with fixed subproblem size
90
20
60
http://crd.lbl.gov/ xiaoye/SuperLU/
50
➤ MUMPS: [cerfacs]
40
➤ Univ. Minn. / IBM’s PSPASES [SPD matrices]
30
http://www-users.cs.umn.edu/ mjoshi/pspases/
20
10
➤ UMFPACK
0
0
10
20
30
40
50
60
70
80
90
Processors
Iterations
CIMPA - Tlemcen May 2008April 26, 2008
220
Iterative solvers:
➤ PETSc
http://acts.nersc.gov/petsc/
and Trilinos (more recent)
http://trilinos.sandia.gov/
... are very comprehensive packages..
➤ PETSc includes few preconditioners...
➤ Hypre, ML, ..., all include interfaces to PETSc or trilinos
➤ pARMS:
http://www.cs.umn.edu∼saad/software
is a more modest effort CIMPA - Tlemcen May 2008April 26, 2008
221
© Copyright 2026 Paperzz