ITERATIVE METHODS BASED ON KRYLOV SUBSPACES
LONG CHEN
We shall present iterative methods for solving linear algebraic equation Au = b based
on Krylov subspaces. We derive conjugate gradient (CG) method developed by Hestenes
and Stiefel in 1950s [1] for symmetric and positive definite matrix A and briefly mention
the GMRES method [3] for general non-symmetric matrix systems.
1. C ONJUGATE G RADIENT M ETHOD
In this section, we introduce the conjugate gradient (CG) method for solving
Au = b
with a symmetric and positive definite operator A defined on a Hilbert space V with
dim V = N and its preconditioned version PCG. We derive the CG from the projection in
A-inner product point of view. We shall use (·, ·) for the standard inner product and (·, ·)A
for the inner product introduced by the SPD operator A. When we say ‘orthogonal’ vectors, we refer to the default (·, ·) inner product and emphasize the orthogonality in (·, ·)A
by ‘A-orthogonal’.
1.1. Gradient Method. Recall that the simplest iteration method, Richardson method, is
in the form
uk+1 = uk + α rk ,
where α is a fixed constant. The correction α rk can be thought of as an approximation
to the residual equation Aek = rk . Now we solve the residual equation Aek = rk in the
1-dimensional subspace spanned by rk , i.e., we look for e∗k = αk rk such that
in span{rk }.
A(αk rk ) = rk ,
Test with rk , we get
αk =
(1)
(u − uk , rk )A
(rk , rk )
=
.
(Ark , rk )
(rk , rk )A
The iterative method
uk+1 = uk + αk rk ,
is called gradient method. Since αk is a nonlinear function of uk , the method becomes
nonlinear. The second identity in (1) implies that the correction e∗k = αk rk is the projection
of the error ek = u − uk onto the space span{rk } in the A-inner product. From that, we
immediately get the A-orthogonality
(ek − e∗k , rk )A = 0,
(2)
which can be rewritten as
(u − uk+1 , rk )A = (rk+1 , rk ) = 0.
(3)
Namely the residual in the consecutive step is orthogonal.
Date: March 6, 2016.
1
2
LONG CHEN
Remark 1.1. The name gradient method comes from the minimization point of view. Let
E(u) = kuk2A /2 − (b, u) and consider the optimization problem minu∈V E(u). Given
a current approximation uk , the decent direction is the negative gradient −∇E(uk ) =
b − Auk = rk . The optimal parameter αk is obtained by considering the one-dimensional
minimization problem minα E(uk + αrk ). The residual is simply the (negative) gradient
of the energy.
We then give convergence analysis of the gradient method to show that it converges as
the optimal Richardson method. Recall that in the Richardson method, the optimal choice
α∗ = 2/(λmin (A) + λmax (A)) requires the information of eigenvalues of A, while in the
gradient method, αk is computed using the action of A only. The optimality is build in
through the optimization of the step size (so-called the exact line search).
Theorem 1.2. Let A be SPD and let uk be the kth iteration in the gradient method with
an initial guess u0 . We have
k
κ(A) − 1
ku − u0 kA ,
(4)
ku − uk kA ≤
κ(A) + 1
where κ(A) = λmax(A) /λmin(A) is the condition number of A.
Proof. Since we solve the residual equation in the subspace span{rk } exactly, or in other
words, αk rk is the projection of u − uk in the A-inner product, we have
ku − uk+1 kA = kek − e∗k k = inf kek − αrk kA = inf k(I − αA)(u − uk )kA .
α∈R
α∈R
Since I − αA is symmetric, kI − αAkA = ρ(I − αA). Consequently
ku − uk+1 kA ≤ inf ρ(I − αA)ku − uk kA =
α∈R
κ(A) − 1
ku − uk kA .
κ(A) + 1
The proof is completed by recursion on k.
The level set of kxk2A is an ellipsoid and the condition number of A is the ratio of major
and minor axis of the ellipsoid. When κ(A) 1, the ellipsoid is very skinny and the
gradient method will give a long zig-zag path towards the solution.
1.2. Conjugate Gradient Method. By tracing back to the initial guess u0 , the k + 1-th
approximation of the gradient method can be written as
uk+1 = u0 + α0 r0 + α1 r1 + · · · + αk rk .
Let
Vk = span{r0 , r1 , · · · , rk }.
The correction
tion
Pk
i=0
αi ri can be thought as an approximate solution of the residual equa-
Ae0 = r0 ,
in Vk by computing the coefficients along each basis {r0 , r1 , · · · , rk }. It is no longer the
best approximation of e0 = u − u0 in the subspace Vk since the basis {r0 , r1 , · · · , rk }
is not A-orthogonal. If we can find an A-orthogonal basis {p0 , p1 , · · · , pk } of Vk , i.e.,
(pi , pj )A = 0 if i 6= j, the best approximation, i.e., the A-projection Pk e0 ∈ Vk can be
found by the formulae
Pk e 0 =
k
X
i=0
αi pi ,
αi =
(e0 , pi )A
,
(pi , pi )A
for i = 0, · · · k.
ITERATIVE METHODS BASED ON KRYLOV SUBSPACES
Let uk = u0 +
Pk−1
i=0
3
αi pi for k ≥ 1, then uk+1 can be computed recursively by
uk+1 = uk + αk pk ,
αk =
(u − u0 , pk )A
.
(pk , pk )A
The residual
rk+1 = b − Auk+1 = b − Auk − αk Apk = rk − αk Apk
can be also updated recursively. The vector pk is called the conjugate direction which is
better than gradient direction since it is now A-orthogonal to all previous directions while
the gradient direction is only orthogonal (not even A-orthogonal) to the previous one.
The questions are: How to find such an A-orthogonal basis? Which A-orthogonal basis
is better in favor of storage and computation?
The answer to the first question is easy. Gram-Schmidt process. We begin with p0 =
r0 . Update u1 = u0 + α0 p0 and compute the new residual r1 = b − Au1 or update
r1 = r0 − α0 Ap0 . If r1 = 0, we obtain the exact solution and stop. Otherwise we have
two vectors {p0 , r1 } and apply the Gram-Schmidt process to obtain
p1 = r1 + β0 p0 ,
β0 = −
(r1 , p0 )A
,
(p0 , p0 )A
which is A-orthogonal to p0 . We can then update u2 = u1 + α1 p1 and continue.
In general, suppose we have found an A-orthogonal basis {p0 , p1 , · · · , pk } for Vk . We
update the approximation uk+1 = uk + αk pk and update the residual rk+1 = rk − αk Apk .
Again we apply the Gram-Schmidt process to look for the next conjugate direction
pk+1 = rk+1 + βk pk + γk−1 pk−1 + · · · + γ0 p0 .
The coefficients can be computed by the A-orthogonality (pk+1 , pi )A = 0 for 0 ≤ i ≤ k
and the formula for γi is γi = −(rk+1 , pi )A /(pi , pi )A . If we need to compute all these
coefficients, we need to store all previous basis vectors. This is not efficient in both time
and space. The amazing fact is that all γi , 0 ≤ i ≤ k − 1, are zero. We can simply compute
the new conjugate direction as
(5)
pk+1 = rk+1 + βk pk
βk = −
(rk+1 , pk )A
.
(pk , pk )A
The formulae (5) is justified by the following orthogonality.
Lemma 1.3. The residual rk+1 is (·, ·)-orthogonal to Vk and A-orthogonal to Vk−1 .
Proof. Since {p0 , p1 , · · · , pk } is a A-orthogonal basis of Vk . As we mentioned before,
Pk
the formulae uk+1 − u0 = i=0 αi pi actually computes the A-projection of the error i.e.,
uk+1 − u0 = Pk (u − u0 ). We write u − uk+1 = u − u0 − (uk+1 − u0 ) = (I − Pk )(u − u0 ).
Therefore u−uk+1 ⊥A Vk , i.e. rk+1 = A(u−uk+1 )⊥Vk . The first orthogonality is proved.
By the update formula for the residual ri = ri−1 − αi−1 Api−1 , we get Api−1 ∈
span{ri−1 , ri } ⊂ Vk for i ≤ k. Since we have proved rk+1 ⊥Vk , we get (rk+1 , pi )A =
(rk+1 , Api ) = 0 for i ≤ k − 1, i.e. rk+1 is A-orthogonal to Vk−1 .
This lemma shows the advantange of the conjugate gradient method over the gradient
method. The new residual is orthogonal to the whole space not only to one residual vector
in the previous step. The additional orthogonality reduces the Gram-Schmidt process to
three-term recursion.
4
LONG CHEN
In summary, starting from an initial guess u0 and p0 = r0 , for k = 0, 1, 2, · · · , we use
three recursion formulae to compute
uk+1 = uk + αk pk ,
αk =
(u − u0 , pk )A
,
(pk , pk )A
rk+1 = rk − αk Apk ,
(rk+1 , pk )A
.
(pk , pk )A
The method is called conjugate gradient (CG) method since the conjugate direction is
obtained by a correction of the gradient (the residual) direction. The importance of the
gradient direction is the orthogonality given in Lemma 1.3.
We now derive more efficient formulae for αk and βk .
pk+1 = rk+1 + βk pk ,
βk = −
Proposition 1.4.
(rk+1 , rk+1 )
(rk , rk )
, βk =
.
(Apk , pk )
(rk , rk )
Pk−1
Proof. First since rk = r0 − i=0 αi Api and (pk , Api ) = 0 for 0 ≤ i ≤ k − 1, we have
αk =
(u − u0 , pk )A = (r0 , pk ) = (rk , pk ).
Then since pk = rk + βk−1 pk−1 and (rk , pk−1 ) = 0, we have
(rk , pk ) = (rk , rk ).
Recall that αk is the coefficient of u − u0 corresponding to the basis pk , i.e.,
(rk , rk )
(u − u0 , pk )A
=
.
(pk , pk )A
(Apk , pk )
Second, to prove the formula of βk , we use the recursion formula for the residual to get
1
(rk+1 , Apk ) = − (rk+1 , rk+1 ).
αk
Then using the formula we just proved for αk , we replace
1
(rk , rk ),
(Apk , pk ) =
αk
to get
(rk+1 , pk )A
(rk+1 , Apk )
(rk+1 , rk+1 )
βk = −
=−
=
.
(pk , pk )A
(Apk , pk )
(rk , rk )
αk =
We present the algorithm of the conjugate gradient method as follows.
1
2
3
4
5
6
7
8
9
10
11
function u = CG(A,b,u,tol)
tol = tol*norm(b);
k = 1;
r = b - A*u;
p = r;
r2 = r’*r;
while sqrt(r2) ≥ tol && k<length(b)
Ap = A*p;
alpha = r2/(p’*Ap);
u = u + alpha*p;
r = r - alpha*Ap;
ITERATIVE METHODS BASED ON KRYLOV SUBSPACES
12
13
14
15
16
17
5
r2old = r2;
r2 = r’*r;
beta = r2/r2old;
p = r + beta*p;
k = k + 1;
end
Several remarks on the realization are listed below:
• The most time consuming part is the matrix-vector multiplication A*p.
• The error is measured by the relative error of the residual krk ≤ tolkbk.
• A maximum iteration step is given to avoid a large loop.
1.3. Convergence Analysis of Conjugate Gradient Method. Recall that
Vk = span{r0 , r1 , · · · , rk } = span{p0 , p1 , · · · , pk }.
An important observation is that
Lemma 1.5.
Vk = span{r0 , Ar0 , · · · , Ak r0 }.
Proof. We prove it by induction. For k = 0, it is trivial. Suppose the statement holds for
k = i. We prove it holds for k = i + 1 by noting that
Vi+1 = Vi + span{ri+1 } = Vi + span{ri , Api } = Vi + span{Api } = Vi + AVi .
k
The space span{r0 , Ar0 , · · · , A r0 } is called Krylov subspace. The CG method belongs to a large class of Krylov subspace iterative methods.
By the construction, we have
uk = u0 + Pk−1 (u − u0 ).
Writing u − uk = u − u0 − (uk − u0 ), we have
ku − uk kA =
(6)
inf
v∈Vk−1
ku − u0 + vkA .
Theorem 1.6. Let A be SPD and let uk be the kth iteration in the CG method with an
initial guess u0 . Then
(7)
ku − uk kA =
(8)
ku − uk kA
kpk (A)(u − u0 )kA ,
inf
pk ∈Pk ,
pk (0)=1
inf
sup |pk (λ)| ku − u0 kA .
pk ∈Pk , λ∈σ(A)
pk (0)=1
Proof. For v ∈ Vk−1 , it can be expanded as
v=
k−1
X
ci Ai r0 =
i=0
Let pk (x) = 1+
Pk
i=1 ci−1 x
i
k
X
ci−1 Ai (u − u0 ).
i=1
. Then
u − u0 + v = pk (A)(u − u0 ).
The identity (7) then follows from (6).
Since Ai is symmetric in the A-inner product, we have
kpk (A)kA = ρ(pk (A)) = sup |pk (λ)|,
λ∈σ(A)
6
LONG CHEN
which leads to the estimate (8).
The polynomial pk ∈ Pk with constrain pk (0) = 1 will be called the residual polynomial. Various convergence results of CG method can be obtained by choosing specific
residual polynomials.
Corollary 1.7. Let A be SPD with eigenvectors {φ}N
i=1 . Let b ∈ span{φi1 , φi2 , · · · , φik },
k ≤ N . Then the CG method with u0 = 0 will find the solution Au = b in at most
k iterations. In particular, for a given b ∈ RN , the CG algorithm will find the solution
Au = b within N iterations. Namely CG can be viewed as a direct method.
Qk
Proof. Choose the polynomial p(x) = l=1 (1 − x/λil ). Then p is a residual polynomial
Pk
Pk
of order k and p(0) = 1, p(λil ) = 0. Let b = l=1 bl φil . Then u = l=1 ul φil with
ul = bl /λil and
k
k
X
X
p(A)
ul φil =
ul p(λil )φil = 0.
l=1
l=1
The assertion is then from the identity (7).
Remark 1.8. The CG method can be also applied to symmetric and positive semi-definite
matrix A. Let {φi }ki=1 be the eigenvectors associated to λmin (A) = 0. Then from Corollary 1.7, if b ∈ range(A) = span{φk+1 , φk+2 , · · · , φN }, then the CG method with
u0 ∈ range(A) will find a solution Au = b within N − k − 1 iterations.
The CG is invented as a direct method but it is more effective to use as an iterative
method. The rate of convergence depends crucially on the distribution of eigenvalues of A
and could converge to the solution within certain tolerance in steps k N .
Theorem 1.9. Let uk be the k-th iteration of the CG method with u0 . Then
!k
p
κ(A) − 1
ku − u0 kA ,
(9)
ku − uk kA ≤ 2 p
κ(A) + 1
Proof. Let b = λmax (A) and a = λmin (A). Choose
pk (x) =
Tk ((b + a − 2x)/(b − a))
,
Tk ((b + a)/(b − a))
where Tk (x) is the Chebyshev polynomial of degree k given by
cos(k · arccos x)
if |x| ≤ 1
Tk (x) =
cosh(k · arccosh x) if |x| ≥ 1
It is easy to see that pk (0) = 1, and if x ∈ [a, b], then
b + a − 2x b − a ≤ 1.
Hence
−1
b+a
sup |pk (λ)| ≤ Tk
.
pk ∈Pk , λ∈σ(A)
b−a
inf
pk (0)=1
We set
b+a
eσ + e−σ
= coshσ =
.
b−a
2
ITERATIVE METHODS BASED ON KRYLOV SUBSPACES
7
Solving this equation for eσ , we have
p
σ
e =p
κ(A) + 1
κ(A) − 1
with κ(A) = b/a. We then obtain
b+a
ekσ + e−kσ
1
1
Tk
= cosh(kσ) =
≥ ekσ =
b−a
2
2
2
!k
p
κ(A) + 1
p
,
κ(A) − 1
which complete the proof.
The estimate (9) shows that CG is in general better than the gradient method. Furthermore if the condition number of A is close to one, the CG iteration will converge very fast.
Even if κ(A) is large, the iteration will perform well if the eigenvalues are clustered in a
few small intervals; see the following estimate.
Corollary 1.10. Assume that σ(A) = σ0 (A) ∪ σ1 (A) and l is the number of elements in
σ0 (A). Then
!k−l
p
b/a − 1
(10)
ku − uk kA ≤ 2M p
ku − u0 kA ,
b/a + 1
where
a = min λ, b = max λ,
λ∈σ1 (A)
and M = max
λ∈σ1 (A)
λ∈σ1 (A)
Y
|1 − λ/µ|.
µ∈σ0 (A)
1.4. Preconditioned Conjugate Gradient (PCG) method. Let B be an SPD matrix. We
shall apply the Gram-Schmidt process to the subspace
Vk = span{Br0 , Br1 , · · · , Brk },
to get another A-orthogonal basis. That is, starting from an initial guess u0 and p0 = Br0 ,
for k = 0, 1, 2, · · · , we use three recursion formulae to compute
(Brk , rk )
,
uk+1 = uk + αk pk ,
αk =
(Apk , pk )
rk+1 = rk − αk Apk ,
(Brk+1 , rk+1 )
.
(Brk , rk )
We need to justify pk+1 computed above is A-orthogonal to Vk .
pk+1 = Brk+1 + βk pk ,
βk =
Lemma 1.11. Brk+1 is A-orthogonal to Vk−1 and pk+1 is is A-orthogonal to Vk .
Exercise 1.12. Prove Lemma 1.11.
We present the algorithm of preconditioned conjugate gradient method as follows.
1
2
3
4
5
6
7
8
function u = pcg(A,b,u,B,tol)
tol = tol*norm(b);
k = 1;
r = b - A*u;
rho = 1;
while sqrt(rho) ≥ tol && k<length(b)
Br = B*r;
rho = r’*Br;
8
9
10
11
12
13
14
15
16
17
18
19
20
21
LONG CHEN
if k = 1
p = Br;
else
beta = rho/rho_old;
p = Br + beta*p;
end
Ap = A*p;
alpha = rho/(p’*Ap);
u = u + alpha*p;
r = r - alpha*Ap;
rho_old = rho;
k = k + 1;
end
Similarly we collect several remarks for the implementation
• If we think A : V → V0 , the residual is in the dual space V0 . B is the Reisz
representation from V0 to V of some inner product. In the original CG, B is the
identity matrix. A general guiden principle of design a good principle is to find
the correct inner product such that the operator A is continuous and stable.
• For an effective PCG, the computation B*r is crucial. The matrix B do not have
to be formed explicitly. All we need is the matrix-vector multiplication which can
be replaced by a subroutine.
To perform the convergence analysis, we let r̃0 = Br0 . Then one can easily verify that
Vk = span{r̃0 , BAr̃0 , · · · , (BA)k r̃0 }.
Since the optimality
ku − uk kA =
inf
v∈Vk−1
ku − u0 + vkA .
still holds, and BA is symmetric in the A-inner product, we obtain the following convergence rate of PCG.
Theorem 1.13. Let A be SPD and let uk be the kth iteration in the PCG method with the
preconditioner B and an initial guess u0 . Then
(11)
ku − uk kA =
(12)
ku − uk kA
(13)
inf
pk ∈Pk ,
pk (0)=1
inf
kpk (BA)(u − u0 )kA ,
sup
pk ∈Pk , λ∈σ(BA)
pk (0)=1
ku − uk kA ≤ 2
|pk (λ)| ku − u0 kA ,
p
κ(BA) − 1
p
κ(BA) + 1
!k
ku − u0 kA .
The SPD matrix B is called preconditioner. A good preconditioner should have the
properties that the action of B is easy to compute and that κ(BA) is significantly smaller
than κ(A). The design of a good preconditioner requires the knowledge of the problem
and finding a good preconditioner is a central topic of scientific computing.
For a linear iterative method ΦB with a symmetric iterator B, the iterator can be used
as a preconditioner for A. The corresponding PCG method converges at a faster rate. Let
ρ denote the convergence rate of ΦB , i.e. ρ = ρ(I − BA). Then
p
p
κ(BA) − 1
1 − 1 − ρ2
δ=p
≤
< ρ.
ρ
κ(BA) + 1
ITERATIVE METHODS BASED ON KRYLOV SUBSPACES
9
A more interesting fact is that the scheme ΦB may not be convergent at all whereas B can
always be a preconditioner. That is even ρ > 1, the PCG rate δ < 1. For example, the
Jacobi method is not convergent for all SPD systems, but B = D−1 can always be used
as a preconditioner, which is often known as the diagonal preconditioner. Another popular
preconditioner is an incomplete Cholesky factorization.
2. GMRES
For a general matrix A, we can reformulated the equation Au = b as a least square
problem
min kb − Auk,
u∈RN
and restrict the minimization in the affine subspace u0 + Vk , i.e.,
min
(14)
v∈u0 +Vk
kb − Avk.
Let uk be the solution of (14) and rk = b − Auk . Identical to the proof of CG, we
immediately get
(15)
krk k =
min
p∈Pk ,p(0)=1
kp(A)r0 k ≤
min
p∈Pk ,p(0)=1
kp(A)kkr0 k.
Unlike the case when A is SPD, now the norm kp(A)k is not easy to estimate since the
spectrum of a general matrix A includes complex numbers and much harder to estimate.
Convergence analysis for diagonalizable matrices can be found in [2].
In the algorithmic level, the GMRES is also more expensive than CG. There are no
recursive formula to update the approximation. One has to
(1) Compute and store an orthogonal basis of Vk ;
(2) Solve the least square problem (14).
The k orthogonal basis can be computed by Gram-Schmidt process requiring O(kN )
memory and complexity. The least square problem can be solved in O(k 3 ) complexity.
Plus one matrix-vector product with O(sN ) operations, where s is the sparsity of the
matrix (average number of non-zeros in one row). Therefore for m GMRES iterations, the
total complexity is O(smN + m2 N + m4 ).
As the iteration progresses, the term m2 N + m4 increases quickly. In addition, it
requires mN memory to store m orthogonal basis. GMRES won not be efficient for large
m. One simple fix is the restart. Choose a upper bound for m, say, 20. After that, discard
the previous basis and record new basis. The restart will lose the optimality (15) and in
turn slow down the convergence since in (14) the space is changed.
A good preconditioner is thus more important for GMRES. Suppose we can choose a
preconditioner B such that kI − BAk = ρ < 1. Then solving BA u = Bb by GMRES.
The estimate (15) will lead to the convergence
kBrk k ≤ ρk kBr0 k.
One can also solve AB v = b by GMRES if kI − ABk = ρ < 1 and set u = Bv.
These two are called left or right preconditioner, respectively. Again a general guiding
principle of design a good principle is to find the correct inner product such that the operator A is continuous and stable and use the corresponding Reisz representation as the
preconditioner.
10
LONG CHEN
R EFERENCES
[1] M. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems. Journal cf Research
of the National Bureau of Standards Vol, 49(6), 1952.
[2] C. Kelley. Iterative methods for linear and nonlinear equations. Society for Industrial Mathematics, 1995.
[3] Y. Saad and M. H. Schultz. GMRES: A generalized minimal residual algorithm for solving nonsymmetric
linear systems. SIAM J. Sci. Statist. Comput., 7:856–869, 1986.
© Copyright 2026 Paperzz