8
Norms and Inner products
This section is devoted to vector spaces with an additional structure - inner product.
| .
The underlining field here is either F = IR or F = C
8.1 Definition (Norm, distance)
A norm on a vector space V is a real valued function k · k satisfying
1. kvk ≥ 0 for all v ∈ V and kvk = 0 if and only if v = 0.
2. kcvk = |c| kvk for all c ∈ F and v ∈ V .
3. ku + vk ≤ kuk + kvk for all u, v ∈ V (triangle inequality).
A vector space V = (V, k · k) together with a norm is called a normed space.
In normed spaces, we define the distance between two vectors u, v by d(u, v) = ku−vk.
Note that property 3 has a useful implication: kuk − kvk ≤ ku − vk for all u, v ∈ V .
8.2 Examples
| n:
(i) Several standard norms in IRn and C
kxk1 :=
n
X
|xi |
(1 − norm)
i=1
kxk2 :=
n
X
!1/2
|xi |2
(2 − norm)
i=1
note that this is Euclidean norm,
kxkp :=
n
X
!1/p
p
|xi |
(p − norm)
i=1
for any real p ∈ [1, ∞),
kxk∞ := max |xi |
(∞ − norm)
1≤i≤n
(ii) Several standard norms in C[a, b] for any a < b:
kf k1 :=
kf k2 :=
Z
Z
b
|f (x)| dx
a
b
!1/2
|f (x)|2 dx
a
kf k∞ := max |f (x)|
a≤x≤b
1
8.3 Definition (Unit sphere, Unit ball)
Let k · k be a norm on V . The set
S1 = {v ∈ V : kvk = 1}
is called a unit sphere in V , and the set
B1 = {v ∈ V : kvk ≤ 1}
a unit ball (with respect to the norm k · k). The vectors v ∈ S1 are called unit vectors.
For any vector v 6= 0 the vector u = v/kvk belongs in the unit sphere S1 , i.e. any nonzero
vector is a multiple of a unit vector.
The following four theorems, 8.4–8.7, are given for the sake of completeness, they will
not be used in the rest of the course. Their proofs involve some advanced material of
real analysis. The students who are not familiar with it, may disregard the proofs.
8.4 Theorem
| n is a continuous function. Precisely: for any ε > 0 there is
Any norm on IRn or C
a δ > 0 such that for any two vectors
u = (x1 , . . . , xn ) and v = (y1 , . . . , yn ) satisfying
maxi |xi − yi | < δ we have kuk − kvk < ε.
Proof. Let g = max{ke1 k, . . . , ken k}. By the triangle inequality, for any vector
v = (x1 , . . . , xn ) = x1 e1 + · · · + xn en
we have
kvk ≤ |x1 | ke1 k + · · · + |xn | ken k ≤ (|x1 | + · · · + |xn |) · g
Hence if max{|x1 |, . . . , |xn |} < δ, then kvk < ngδ.
Now let two vectors v = (x1 , . . . , xn ) and u = (y1 , . . . , yn ) be close, so that maxi |xi −
yi | < δ. Then
kuk − kvk ≤ ku − vk < ngδ
It is then enough to set δ = ε/ng. This proves the continuity of the norm k · k as a
function of v. 2
8.5 Corollary
| n . Denote by
Let k · k be a norm on IRn or C
S1e = {(x1 , . . . , xn ) : |x1 |2 + · · · + |xn |2 = 1}
| n ). Then the function k · k is bounded above and
the Euclidean unit sphere in IRn (or C
e
below on S1 :
0 < mine kvk ≤ maxe kvk < ∞
v∈S1
v∈S1
2
Proof. Indeed, it is known in real analysis that S1e is a compact set. Note that k · k is a
continuous function on S1e . It is known in real analysis that a continuous function on a
compact set always takes its maximum and minimum values. In our case, the minimum
value of k · k on S1e is strictly positive, because 0 ∈
/ S1e . This proves the corollary. 2
8.6 Definition (Equivalent norms)
Two norms, k · ka and k · kb , on V are said to be equivalent if there are constants
0 < C1 < C2 such that C1 ≤ kuka /kukb ≤ C2 < ∞ for all u 6= 0. This is an equivalence
relation.
8.7 Theorem
In any finite dimensional space V , any two norms are equivalent.
| n . It is enough to prove that any norm is
Proof. Assume first that V = IRn or V = C
equivalent to the 2-norm. Any vector v = (x1 , . . . , xn ) ∈ V is a multiple of a Euclidean
unit vector u ∈ S1e , so it is enough to check the equivalence for vectors u ∈ S1e , which
| n . An
immediately follows from 8.5. So, the theorem is proved for V = IRn and V = C
| is isomorphic to IRn or C
| n,
arbitrary n-dimensional vector space over F = IR or F = C
respectively. 2
8.8 Theorem + Definition (Matrix norm)
| n . Then
Let k · k be a norm on IRn or C
A := sup kAxk = sup
kxk=1
x6=0
kAxk
kxk
defines a norm in the space of n×n matrices. It is called the matrix norm induced by k·k.
Proof is a direct inspection.
Note that supremum can be replaced by maximum in Theorem 8.8. Indeed, one can obviously write A = supkxk2 =1 kAxk/kxk, then argue that the function kAxk is continuous
on the Euclidean unit sphere S1e (as a composition of two continuous functions, Ax and
k · k), then argue that kAxk/kxk is a continuous function, as a ratio of two continuous
functions, of which kxk =
6 0, so that kAxk/kxk takes its maximum value on S1e .
Note that there are norms on IRn×n that are not induced by any norm on IRn , for example
kAk := maxi,j |aij | (Exercise, use 8.10(ii) below).
8.9 Theorem
P
(i) kAk1 = max1≤j≤n ni=1 |aij |
P
(ii) kAk∞ = max1≤i≤n nj=1 |aij |
(maximum column sum)
(maximum row sum)
3
Note: There is no explicit characterization of kAk2 in terms of the aij ’s.
8.10 Theorem
| n . Then
Let k · k be a norm on IRn or C
(i) kAxk ≤ kAk kxk for all vectors x and matrices A.
(ii) kABk ≤ kAk kBk for all matrices A, B.
8.11 Definition (Real Inner Product)
Let V be a real vector space. A real inner product on V is a real valued function on
V × V , denoted by h·, ·i, satisfying
1. hu, vi = hv, ui for all u, v ∈ V
2. hcu, vi = chu, vi for all c ∈ IR and u, v ∈ V
3. hu + v, wi = hu, wi + hv, wi for all u, v, w ∈ V
4. hu, ui ≥ 0 for all u ∈ V , and hu, ui = 0 iff u = 0
Comments: 1 says that the inner product is symmetric, 2 and 3 say that it is linear in
the first argument (the linearity in the second argument follows then from 1), and 4 says
that the inner product is non-negative and non-degenerate (just like a norm). Note that
h0, vi = hu, 0i = 0 for all u, v ∈ V .
A real vector space together with a real inner product is called a real inner product space,
or sometimes a Euclidean space.
8.12 Examples
P
Note: hu, vi = ut v = v t u.
(i) V = IRn : hu, vi = ni=1 ui vi (standard inner product).
Rb
(ii) V = C([a, b]) (real continuous functions): hf, gi = a f (x)g(x) dx
8.13 Definition (Complex Inner Product)
Let V be a complex vector space. A complex inner product on V is a complex valued
function on V × V , denoted by h·, ·i, satisfying
1. hu, vi = hv, ui for all u, v ∈ V
2, 3, 4 as in 8.11.
A complex vector space together with a complex inner product is called a complex inner
product space, or sometimes a unitary space.
8.14 Simple properties
(i) hu, v + wi = hu, vi + hu, wi
(ii) hu, cvi = c̄hu, vi
The properties (i) and (ii) are called conjugate linearity in the second argument.
4
8.15 Examples
Pn
| n : hu, vi =
(i) V = C
(standard inner product). Note:
hu, vi = ut v̄ = v̄ t u.
i=1 ui v̄i
Rb
(ii) V = C([a, b]) (complex continuous functions): hf, gi = a f (x)g(x) dx
Note: the term “inner product space” from now on refers to either real or complex inner
product space.
8.16 Theorem (Cauchy-Schwarz-Buniakowsky inequality)
Let V be an inner product space. Then
|hu, vi| ≤ hu, ui1/2 hv, vi1/2
for all u, v ∈ V .
The equality holds if and only if {u, v} is linearly dependent.
Proof. We do it in the complex case. Assume that v 6= 0. Consider the function
f (z) = hu − zv, u − zvi
= hu, ui − zhv, ui − z̄hu, vi + |z|2 hv, vi
of a complex variable z. Let z = reiθ and hu, vi = seiϕ be the polar forms of the numbers
z and hu, vi. Set θ = ϕ and assume that r varies from −∞ to ∞, then
0 ≤ f (z) = hu, ui − 2sr + r2 hv, vi
Since this holds for all r ∈ IR (also for r < 0, because the coefficients are all nonnegative),
the discriminant has to be ≤ 0, i.e. s2 − hu, uihv, vi ≤ 0. This completes the proof in
the complex case. It the real case it goes even easier, just assume z ∈ IR. The equality
case in the theorem corresponds to the zero discriminant, hence the above polynomial
| . (We left out the case v = 0, do
assumes a zero value, and hence u = zv for some z ∈ C
it yourself as an exercise.) 2
8.17 Theorem + Definition (Induced norm)
If V is an inner product vector space, then kvk := hv, vi1/2 defines a norm on V . It
is called the induced norm.
To prove the triangle inequality, you will need 8.16.
8.18 Example
| n,
The inner products in Examples 8.12(i) and 8.15(i) induce the 2-norm on IRn and C
respectively.
The inner products in Examples 8.12(ii) and 8.15(ii) induce the 2-norm on the spaces
C[a, b] of real and complex functions, respectively.
5
8.19 Theorem
Let V be a real vector space with norm k · k. The norm k · k is induced by an inner
product if and only if the function
hu, vi :=
1
ku + vk2 − ku − vk2
4
(polarization identity)
satisfies the definition of an inner product. In this case k · k is induced by the above inner
product.
Note: A similar but more complicated polarization identity holds in complex inner product spaces.
6
9
Orthogonal vectors
In this section, V is always an inner product space (real or complex).
9.1 Definition (Orthogonal vectors)
Two vectors u, v ∈ V are said to be orthogonal if hu, vi = 0.
9.2 Example
| n with the standard inner product
(i) The canonical basis vectors e1 , . . . , en in IRn or C
are mutually (i.e., pairwise) orthogonal.
(ii) Any vectors u = (u1 , . . . , uk , 0, . . . , 0) and v = (0, . . . , 0, vk+1 , . . . , vn ) are orthog| n with the standard inner product.
onal in IRn or C
(iii) The zero vector 0 is orthogonal to any vector.
9.3 Theorem (Pythagoras)
If hu, vi = 0, then ku + vk2 = kuk2 + kvk2 .
Inductively, it follows that if u1 , . . . , uk are mutually orthogonal, then
ku1 + · · · + uk k2 = ku1 k2 + · · · + kuk k2
9.4 Theorem
If nonzero vectors u1 , . . . , uk are mutually orthogonal, then they are linearly independent.
9.5 Definition (Orthogonal/Orthonormal basis)
A basis {ui } in V is said to be orthogonal, if all the basis vectors ui are mutually
orthogonal. If, in addition, all the basis vectors are unit (i.e., kui k = 1 for all i), then
the basis is said to be orthonormal, or an ONB.
9.6 Theorem (Fourier expansion)
If B = {u1 , . . . , un } is an ONB in a finite dimensional space V , then
v=
n
X
hv, ui iui
i=1
for every v ∈ V , i.e. ci = hv, ui i are the coordinates of the vector v in the basis B. One
can also write this as [v]tB = (hv, u1 i, . . . , hv, un i).
Note: the numbers hv, ui i are called the Fourier coefficients of v in the ONB {u1 , . . . , un }.
| n with the standard inner product, the coordinates of any vector
For example, in IRn or C
v = (v1 , . . . , vn ) satisfy the equations vi = hv, ei i.
7
9.7 Definition (Orthogonal projection)
Let u, v ∈ V , and v 6= 0. The orthogonal projection of u onto v is
hu, vi
v
kvk2
Prv u =
Note that the vector w := u − Prv u is orthogonal to v. Therefore, u is the sum of two
vectors, Prv u parallel to v, and w orthogonal to v (see the diagram below).
Figure 1: Orthogonal projection of u to v
*
u
θ
?
-
kuk cos θ · (v/kvk)
v
9.8 Definition (Angle)
In the real case, for any nonzero vectors u, v ∈ V let
cos θ =
hu, vi
kuk kvk
By 8.16, we have cos θ ∈ [−1, 1]. Hence, there is a unique angle θ ∈ [0, π] with this value
of cosine. It is called the angle between u and v.
Note that cos θ = 0 if and only if u and v are orthogonal. Also, cos θ = ±1 if and only if
u, v are proportional, v = cu, then the sign of c coincides with the sign of cos θ.
9.9 Theorem
Let B = {u1 , . . . , un } be an ONB in a finite dimensional space V . Then
v=
n
X
Prui v = kvk
i=1
n
X
i=1
where θi is the angle between ui and v.
8
ui cos θi
9.10 Theorem (Gram-Schmidt)
Let nonzero vectors w1 , . . . , wm be mutually orthogonal. For v ∈ V , set
wm+1 = v −
m
X
Pr wi v
i=1
Then the vectors w1 , . . . , wm+1 are mutually orthogonal, and
span{w1 , . . . , wm , v} = span{w1 , . . . , wm , wm+1 }
In particular, wm+1 = 0 if and only if y ∈ span{w1 , . . . , wm }.
9.11 Algorithm (Gram-Schmidt orthogonalization)
Let {v1 , . . . , vn } be a basis in V . Define
w1 = v1
and then inductively, for m ≥ 1,
wm+1 = vm+1 −
= vm+1 −
m
X
Prwi vm+1
i=1
m
X
hvm+1 , wi i
wi
kwi k2
i=1
This gives an orthogonal basis {w1 , . . . , wn }, which ‘agrees’ with the basis {v1 , . . . , vn } in
the following sense:
span{v1 , . . . , vm } = span{w1 , . . . , wm }
for all 1 ≤ m ≤ n.
The basis {w1 , . . . , wn } can be normalized by ui = wi /kwi k to give an ONB {u1 , . . . , un }.
Alternatively, an ONB {u1 , . . . , un } can be obtained directly by
w1 = v1
u1 = w1 /kw1 k
and inductively for m ≥ 1
wm+1 = vm+1 −
m
X
hvm+1 , ui iui
um+1 = wm+1 /kwm+1 k
i=1
9.12 Example
R
Let V = Pn (IR) with the inner product given by hf, gi = 01 f (x)g(x) dx. Applying
Gram-Schmidt orthogonalization to the basis {1, x, . . . , xn } gives the first n + 1 of the so
called Legendre polynomials.
9
9.13 Corollary
Let W ⊂ V be a finite dimensional subspace of an inner product space V , and
dim W = k. Then there is an ONB {u1 , . . . , uk } in W . If, in addition, V is finite dimensional, then the basis {u1 , . . . , uk } of W can be extended to an ONB {u1 , . . . , un } of V .
9.14 Definition (Orthogonal complement)
Let S ⊂ V be a subset (not necessarily a subspace). Then
S ⊥ := {v ∈ V : hv, wi = 0 for all w ∈ S}
is called the orthogonal complement to S.
9.15 Theorem
S ⊥ is a subspace of V . If W = span S, then W ⊥ = S ⊥ .
9.16 Example
If S = {(1, 0, 0)} in IR3 , then S ⊥ = span{(0, 1, 0), (0, 0, 1)}.
Let V = C[a, b] (real continuous functions) with the inner product from
8.12(ii). Let
R
S = {f ≡ const} (the subspace of constant functions). Then S ⊥ = {g : ab g(x) dx = 0}.
Note that V = S ⊕ S ⊥ , see 1.34(b).
9.17 Theorem
If W is a finite dimensional subspace of V , then
V = W ⊕ W⊥
Proof. By 9.13, there is an ONB {u1 , . . . , uk } of W . For any v ∈ V the vector
v−
k
X
hv, ui iui
i=1
belongs in W ⊥ . Hence, V = W + W ⊥ . The linear independence of W and W ⊥ follows
from 9.4. 2
Note: the finite dimension of W is essential. Let V = C[a, b] with the inner product from
8.12(ii) and W ⊂ V be the set of real polynomials restricted to the interval [a, b]. Then
W ⊥ = ∅, and at the same time V 6= W .
9.18 Theorem (Parceval’s identity)
Let B = {u1 , . . . , un } be an ONB in V . Then
hv, wi =
n
X
hv, ui ihw, ui i = [v]tB [w]B = [w]tB [v]B
i=1
10
for all v, w ∈ V . In particular,
kvk2 =
n
X
|hv, ui i|2 = [v]tB [v]B
i=1
Proof. Follows from 9.6. 2
9.19 Theorem (Bessel’s inequality)
Let {u1 , . . . , un } be an orthonormal subset of V . Then
kvk2 ≥
n
X
|hv, ui i|2
i=1
for all v ∈ V .
Proof. For any v ∈ V the vector
w := v −
n
X
hv, ui iui
i=1
belongs in {u1 , . . . , un }⊥ . Hence,
kvk2 = kwk2 +
n
X
|hv, ui i|2
i=1
9.20 Definition (Isometry)
Let V, W be two inner product spaces (both real or both complex). An isomorphism
T : V → W is called an isometry if it preserves the inner product, i.e.
hT v, T wi = hv, wi
for all v, w ∈ V . In this case V and W are said to be isometric.
Note: it can be shown by polarization identity that T : V → W preserves inner product
if and only if T preserves the induced norms, i.e. kT vk = kvk for all v ∈ V .
9.21 Theorem Let dim V < ∞. A linear transformation T : V → W is an isometry if
and only if whenever {u1 , . . . , un } is an ONB in V , then {T u1 , . . . , T un } is an ONB in W .
9.22 Corollary Finite dimensional inner product spaces V and W (over the same field)
are isometric if and only if dim V = dim W .
9.23 Example
2
Let V = IR
! with the standard !inner product. Then the maps defined by matrices
0 1
1 0
A1 =
and A2 =
are isometries of IR2 .
1 0
0 −1
11
10
Orthogonal/Unitary Matrices
10.1 Definition
If V is a real inner product space, then isometries V → V are called orthogonal operators. If V is a complex inner product space, then isometries V → V are called an
unitary operators.
Note: If U1 , U2 : V → V are unitary operators, then so are U1 U2 and U1−1 , U2−1 . Thus,
unitary operators form a group, denoted by U (V ) ⊂ GL(V ). Similarly, orthogonal operators on a real inner product space V make a group O(V ) ⊂ GL(V ).
10.2 Definition.
A real n × n matrix Q is said to be orthogonal if QQt = I, i.e. Q is invertible and
−1
Q = Qt . A complex n × n matrix U is said to be unitary if U U H = I, where U H = (Ū )t
is the Hermitian transpose of U .
Note: If Q is orthogonal, then also Qt Q = I, and so Qt is an orthogonal matrix as well.
If U is unitary, then also U H U = I, and so U H is a unitary matrix, too. In the latter
case, we also have Ū U t = I and U t Ū = I.
10.3 Theorem
(i) The linear transformation in IRn defined by a matrix Q ∈ IRn×n preserves the
standard (Euclidean) inner product if and only if Q is orthogonal.
| n defined by a matrix U ∈ C
| n×n preserves the
(ii) The linear transformation in C
standard (Hermitian) inner product if and only if U is unitary.
Proof. In the complex case: hU x, U yi = (U x)t U y = xt U t Ū ȳ = xt ȳ = hx, yi. The real
case is similar. 2
10.4 Theorem
(i) An operator T : V → V of a finite dimensional complex space V is unitary iff the
matrix [T ]B is unitary in any ONB B.
(ii) An operator T : V → V of a finite dimensional real space V is orthogonal iff the
matrix [T ]B is orthogonal in any ONB B.
10.5 Corollary
Unitary n × n matrices make a group, denoted by U (n). Orthogonal n × n matrices
make a group, denoted by O(n).
10.6 Theorem
| n.
| n×n is unitary iff its columns (resp., rows) make an ONB of C
(i) A matrix U ∈ C
12
(ii) A matrix Q ∈ IRn×n is orthogonal iff its columns (resp., rows) make an ONB of IRn .
10.7 Theorem
If Q is orthogonal, then det Q = ±1. If U is unitary, then |det U | = 1, i.e. det U = eiθ
for some θ ∈ [0, 2π] .
Proof. In the complex case: 1 = det I = det U H U = det Ū t ·det U = | det U |2 . The real
case is similar. 2
Note: Orthogonal matrices have determinant 1 or −1. Orthogonal n × n matrices with
determinant 1 make a subgroup of O(n), denoted by SO(n).
10.8 Example
The orthogonal matrix Q =
cos θ − sin θ
sin θ
cos θ
!
represents the counterclockwise rota-
tion of IR2 through the angle θ.
10.9 Theorem
If λ is an eigenvalue of an orthogonal or unitary matrix, then |λ| = 1. In the real case
this means λ = ±1.
Proof. If U x = λx for some x 6= 0, then hx, xi = hU x, U xi = hλx, λxi = |λ|2 hx, xi, so
that |λ|2 = 1. 2
10.10 Theorem
Let T : V → V be an isometry of a finite dimensional space V . If a subspace W ⊂ V
is invariant under T , then so is its orthogonal complement W ⊥ .
10.11 Theorem
Any unitary operator T of a finite dimensional complex space is diagonalizable. Furthermore, there is an ONB consisting of eigenvectors of T . Any unitary matrix is diagonalizable.
Proof goes by induction on the dimension of the space, use 10.10. 2
10.12 Theorem
Let T : V → V be an orthogonal operator on a finite dimensional real space V . Then
V = V1 ⊕ · · · ⊕ Vm , where Vi are mutually orthogonal subspaces, each Vi is a T -invariant
one- or two-dimensional subspace of V .
Proof. If T has an eigenvalue, we can use 10.10 and reduce the dimension of V by
one. Assume now that T has no eigenvalues. The characteristic polynomial CT (x) has
13
real coefficients and no (real) roots, so its roots are pairs of conjugate complex numbers
(because CT (z) = 0 ⇔ CT (z̄) = 0). So, CT (x) is a product of quadratic polynomials
with no real roots: CT (x) = P1 (x) · · · Pk (x), where deg Pi (x) = 2. By Cayley-Hamilton,
CT (T ) = P1 (T ) · · · Pk (T ) = 0
Hence, at least one operator Pi (T ), 1 ≤ i ≤ k, is not invertible (otherwise CT (T ) would
be invertible). Let Pi (x) = x2 + ax + b. Then the operator T 2 + aT + bI has a nontrivial
kernel, i.e. T 2 v + aT v + bv = 0 for some vectror v 6= 0. Let w = T v (note that w 6= 0,
since the operator T is an isometry). We get T w + aw + bv = 0, so that
Tv = w
T w = −aw − bv
and
Hence, the subspace span{v, w} is invariant under T . It is two-dimensional, since v and
w are linearly independent (otherwise T v = w = λv, and λ would be an eigenvalue of
T ). Now we can use 10.10 and reduce the dimension of V by two. 2
14
11
Adjoint and Self-adjoint Matrices
In this chapter, V denotes a finite dimensional inner product space (unless stated otherwise).
11.1 Theorem (Riesz representation)
Let f ∈ V ∗ , i.e. f is a linear functional on V . Then there is a unique vector w ∈ V
such that
f (v) = hv, wi
∀v ∈ V
Proof. Let B = {u1 , . . . , un } be an ONB in V . Then for any v = ci ui we have
P
P
P
f (v) = ci f (ui ) by linearity. Also, for any w = di ui we have hv, wi = ci d¯i . Hence,
P
the vector w = f (ui )ui will suffice. The uniqueness of w is obvious. 2
P
11.2 Corollary
The identity f ↔ w established in the previous theorem is “quasi-linear” in the following sense1 : f1 + f2 ↔ w1 + w2 and cf ↔ c̄w. In the real case, it is perfectly linear,
though, and hence it is an isomorphism between V ∗ and V . This is a canonical isomorphism associated with the given (real) inner product h·, ·i
Remark. If dim V = ∞, then Theorem
11.1 fails. Consider V = C[0, 1] (real functions)
R1
with the inner product hF, Gi = 0 F (x)G(x) dx. Pick a point t ∈ [0, 1]. Let f ∈ V ∗ be
a linear functional defined by f (F ) = F (t). It does not correspond to any G ∈ V so that
f (F ) = hF, Gi. In fact, the lack of such functions G has led to the concept of generalized
functions: a generalized function
Gt (x) is “defined” by three requirements: Gt (x) ≡ 0
R
for all x 6= t, Gt (t) = ∞ and 01 F (x)Gt (x) dx = F (t) for every F ∈ C[0, 1]. Clearly,
no regular function satisfies these requirements, so a “generalized function” is a purely
abstract concept.
11.3 Lemma
Let T : V → V be a linear operator, B = {u1 , . . . , un } an ONB in V , and A = [T ]B .
Then
Aij = hT uj , ui i
11.4 Theorem and Definition (Adjoint Operator)
Let T : V → V be a linear operator. Then there is a unique linear operator T ∗ : V →
V such that
hT v, wi = hv, T ∗ wi
∀v, w ∈ V
1
Recall the conjugate linearity of the complex inner product in the second argument.
15
T ∗ is called the adjoint of T .
Proof. Let w ∈ V . Then f (v) := hT v, wi defines a linear functional f ∈ V ∗ . By the
Riesz representation theorem, there is a unique w0 ∈ V such that f (v) = hv, w0 i. Then
we define T ∗ by setting T ∗ w = w0 . The linearity of T ∗ is a routine check. Note that in
the complex case the conjugating bar appears twice and thus cancels. The uniqueness of
T ∗ is obvious. 2
11.5 Corollary
Let T, S be linear operators on V . Then
(i) (T + S)∗ = T ∗ + S ∗
(ii) (cT )∗ = c̄ T ∗ (conjugate linearity)
(iii) (T S)∗ = S ∗ T ∗
(iv) (T ∗ )∗ = T .
11.6 Corollary
Let T : V → V be a linear operator, and B an ONB in V . Then
[T ∗ ]B = [T ]tB
| n×n we call
Notation. For a matric A ∈ C
A∗ := At = Āt
the adjoint matrix. One often uses AH instead of A∗ . In the real case, we simply have
A∗ = At , the transposed matrix.
11.7 Theorem.
Let T : V → V be a linear operator. Then
Ker T ∗ = (Im T )⊥
In particular, V = Im T ⊕ Ker T ∗ .
Proof is a routine check.
11.8 Defnition (Selfadjoint Operator)
A linear operator T : V → V is said to be selfadjoint if T ∗ = T . A matrix A is said to
be selfadjoint if A∗ = A. In the real case, this is equivalent to At = A, i.e. A is a symmetric matrix. In the complex case, selfadjoint matrices are often called Hermitean matrices.
16
Note: By 11.6, an operator T is selfadjoint whenever the matrix [T ]B is selfadjoint for
any (and then every) ONB B.
11.9 Theorem
Let T : V → V be selfadjoint. Then
(i) All eigenvalues of T are real
(ii) If λ1 6= λ2 are two distinct eigenvalues of T with respective eigenvectors v1 , v2 , then
v1 and v2 are orthogonal.
Proof. If T v = λv, then λhv, vi = hT v, vi = hv, T vi = λ̄hv, vi, hence λ = λ̄ (since
v 6= 0). To prove (ii), we have λ1 hv1 , v2 i = hT v1 , v2 i = hv1 , T v2 i = λ2 hv1 , v2 i, and in
addition λ2 = λ2 6= λ1 , hence hv1 , v2 i = 0.
Remark. Note that in the real case Theorem 11.9 implies that the characteristic polynoQ
mial CT (x) has all real roots, i.e. CT (x) = i (x − λi ) where all λi ’s are real numbers. In
particular, every real symmetric matrix has at least one (real) eigenvalue.
11.10 Corollary
Any selfadjoint operator T has at least one eigenvalue.
11.11 Lemma
Let T be a selfadjoint operator and a subspace W be T -invariant, i.e. T W ⊂ W .
Then W ⊥ is also T -invariant, i.e. T W ⊥ ⊂ W ⊥ .
Proof. If v ∈ W ⊥ , then for any w ∈ W we have hT v, wi = hv, T wi = 0, so T v ∈ W ⊥ .
2
11.12 Definition (Projection)
Let V be a vector space. A linear map T : V → V is called a projection if P 2 = P .
Note: By the homework problem 3 in assignment 2 of MA631, a projection P satisfies
Ker P = Im (I − P ) and V = Ker P ⊕ Im P . For any vector v ∈ V there is a unique
decomposition v = v1 + v2 with v1 ∈ Ker P and v2 ∈ Im P . Then P v = P (v1 + v2 ) = v2 .
We say that P is the projection on Im P along Ker P .
11.13 Corollary
Let V be a vector space and V = W1 ⊕ W2 . Then there is a unique projection P on
W2 along W1 .
11.14 Definition (Orthogonal Projection)
Let V is an inner product vector space and W ⊂ V a finite dimensional subspace.
Then the projection on W along W ⊥ is called the orthogonal projection on W , denoted
17
by PW .
Remark. The assumption on W being finite dimensional is made to ensure that V =
W ⊕ W ⊥ , recall 9.17.
11.15 Theorem
Let V be a finite dimensional inner product space and W ⊂ V a subspace. Then
⊥ ⊥
(W ) = W and
P W ⊥ = I − PW
Remark. More generally, if V = W1 ⊕ W2 , then P1 + P2 = I, where P1 is a projection on
W1 along W2 and P2 is a projection on W2 along W1 .
11.16 Theorem
Let P be a projection. Then P is an orthogonal projection if and only if P is selfadjoint.
Proof. Let P be a projection on W2 along W1 , and V = W1 ⊕ W2 . For any vectors v, w ∈ V we have v = v1 + v2 and w = w1 + w2 with some vi , wi ∈ Wi , i = 1, 2.
Now, if P is orthogonal, then hP v, wi = hv2 , wi = hv2 , w2 i = hv, w2 i = hv, P wi. If
P is not orthogonal, then there are v1 ∈ W1 , w2 ∈ W2 so that hv1 , w2 i =
6 0. Then
hv1 , P w2 i =
6 0 = hP v1 , w2 i.
11.17 Definition (Unitary/Orthogonal Equivalence)
| n×n are said to be unitary equivalent if B = P −1 AP
Two complex matrices A, B ∈ C
for some unitary matrix P , i.e. we also have B = P ∗ AP .
Two real matrices A, B ∈ IRn×n are said to be orthogonally equivalent if B = P −1 AP
for some orthogonal matrix P , i.e. we also have B = P t AP .
11.18 Remark
A coordinate change between two ONB’s is represented by a unitary (resp. orthogonal) matrix, cf. 9.22. Therefore, for any linear operator T : V → V and ONB’s B, B 0
the matrices [T ]B and [T ]B 0 are unitary (resp., orthogonally) equivalent. Conversely, two
matrices A, B are unitary (resp., orthogonally) equivalent iff they represent one linear
operator in some two ONB’s.
11.19 Remark
Any unitary matrix U is unitary equivalent to a diagonal matrix D (which is also
unitary), recall 10.11.
11.20 Theorem (Spectral Theorem)
18
Let T : V → V be a selfadjoint operator. Then there is an ONB consisting entirely
of eigenvectors of T .
Proof goes by induction on n = dim V , just as the proof of 10.11. You need to use
11.10 and 11.11.
11.21 Corollary
Any complex Hermitean matrix is unitary equivalent to a diagonal matrix (with real
diagonal entries). Any real symmetric matrix is orthogonally equivalent to a diagonal
matrix.
11.22 Lemma
If a complex (real) matrix A is unitary (resp., orthogonally) equivalent to a diagonal
matrix with real diagonal entries, then A is Hermitean (resp., symmetric).
Proof. If A = P −1 DP with P −1 = P ∗ , then A∗ = P ∗ D∗ (P −1 )∗ = A, because D∗ = D
(as a real diagonal matrix). 2
11.23 Corollary
If a linear operator T has an ONB consisting of eigenvectors and all its eigenvalues
are real, then T is selfadjoint.
11.24 Corollary
If an operator T is selfadjoint and invertible, then so is T −1 . If a matrix A is selfadjoint and nonsingular, then so is A−1 .
Proof. By the Spectral Theorem 11.20, there is an ONB B consisting of eigenvectors
of T . Now T −1 has the same eigenvectors, and its eigenvalues are the reciprocals of those
of T , hence they are real, too. Therefore, T −1 is selfadjoint. 2
19
12
Bilinear Forms, Positive Definite Matrices
12.1 Definition (Bilinear Form)
| such that
A bilinear form on a complex vector space V is a mapping f : V × V → C
f (u1 + u2 , v) = f (u1 , v) + f (u2 , v)
f (cu, v) = cf (u, v)
f (u, v1 + v2 ) = f (u, v1 ) + f (u, v2 )
f (u, cv) = c̄f (u, v)
| . In other words, f is linear in the first
for all vectors u, v, ui , vi ∈ V and scalars c ∈ C
argument and conjugate linear in the second.
A bilinear form on a real vector space is a mapping f : V × V → IR that satisfies the
same properties, except c is a real scalar and so c̄ = c.
12.2 Example
If V is an inner product space and T : V → V a linear operator, then f (u, v) := hT u, vi
is a bilinear form.
12.3 Theorem
Let V be a finite dimensional inner product space. Then for every bilinear form f on
V , then there is a unique linear operator T : V → V such that
f (u, v) = hT u, vi
∀u, v ∈ V
Proof. For every v ∈ V the function g(u) = f (u, v) is linear in u, so by the Riesz
representation theorem 11.1 there is a vector w ∈ V such that f (u, v) = hu, wi. Define a
map S : V → V by Sv = w. It is then a routine check that S is linear. Setting T = S ∗
proves the existence. The uniqueness is obvious. 2
12.4 Example
| n are of the type f (x, y) = hAx, yi with A ∈ C
| n×n , i.e.
All bilinear forms on C
f (x, y) =
X
Aji xi ȳj
ij
12.5 Definition (Hermitean/Symmetric Form)
A bilinear form f on a complex (real) vector space V is called Hermitean (resp.,
symmetric) if
∀u, v ∈ V
f (u, v) = f (v, u)
In the real case, the bar can be dropped.
20
For a Hermitean or symmetric form f , the function q : V → IR defined by q(u) :=
f (u, u) is called the quadratic form associated with f . Note that q(u) ∈ IR even in the
complex case, because f (u, u) = f (u, u).
12.6 Theorem
A linear operator T : V → V is selfadjoint if and only if the bilinear form f (u, v) =
hT u, vi is Hermitean (symmetric, in the real case).
Proof. If T is selfadjoint, then f (u, v) = hT u, vi = hu, T vi = hT v, ui = f (v, u). If
f is Hermitean, then hu, T vi = hT v, ui = f (v, u) = f (u, v) = hT u, vi = hu, T ∗ vi, hence
T = T ∗. 2
12.7 Lemma
Let V be a complex inner product space and T, S : V → V linear operators. If
hT u, ui = hSu, ui for all u ∈ V , then T = S.
Note: This lemma holds only in complex spaces, it fails in real spaces.
Proof. Let B be an ONB. Then hT u, ui = [u]tB [T ]tB [u]B , and the same holds for S.
| n×n we have xt Ax̄ = xt B x̄ for all
It is then enough to prove that if for some A, B ∈ C
| n , then A = B. Equivalently, if xt Ax̄ = 0 for all x ∈ C
| n , then A = 0. This is done
x∈C
in the homework assignment.
12.8 Corollary
If f is a bilinear form in a complex vector space V such that f (u, u) ∈ IR for all
vectors u ∈ V , then f is Hermitean.
Proof. By 12.3, there is an operator T : V → V such that f (u, v) = hT u, vi. Then we
have hT u, ui = f (u, u) = f (u, u) = hT u, ui = hu, T ui = hT ∗ u, ui, hence T = T ∗ by 12.7.
Then apply 12.6. 2
12.9 Definition (Positive Definite Form/Matrix)
A Hermitean (symmetric) bilinear form f on a vector space V is said to be positive
definite if f (u, u) > 0 for all u 6= 0.
A selfadjoint operator T : V → V is said to be positive definite if hT u, ui > 0 for all
u 6= 0.
A selfadjoint matrix A is said to be positive definite if xt Ax̄ > 0 for all x 6= 0.
By replacing “> 0” with “≥ 0”, one gets positive semi-definite forms/operators/matrices.
Note: in the complex cases the Hermitean requirement can be dropped, because f (u, u) >
0 implies f (u, u) ∈ IR, see 12.8.
21
12.10 Theorem
Let A be a complex or real n × n matrix. The following are equivalent:
(a) A is positive definite
| n (resp., IRn )
(b) f (x, y) := xt Aȳ defines an inner product in C
(c) A is Hermitean (resp., symmetric) and all its eigenvalues are positive.
Proof. (a)⇔(b) is a routine check. To prove (a)⇔(c), find an ONB consisting of
P
eigenvectors u1 , . . . , un of A (one exists by 11.20). Then for any v =
ci ui we have
P
hAv, vi = λi |ci |2 . Hence, hAv, vi > 0 ∀v if and only if λi > 0 ∀i. 2
Remark: Let A be a Hermitean (symmetric) matrix. Then it is positive semidefinite if
and only if its eigenvalues are nonnegative.
12.11 Corollary
If a matrix A is positive definite, then so is A−1 . If an operator T is positive definite,
then so is T −1 .
There is a very simple and efficient way to take square roots of positive definite matrices.
12.12 Definition (Square Root)
An n × n matrix B is called a square root of an n × n matrix A if A = B 2 . Notation:
B = A1/2 .
12.13 Lemma
If A is a positive definite matrix, then there is a square root A1/2 which is also positive
definite.
Proof. By 11.21 and 12.10, A = P −1 DP , where D is a diagonal matrix with positive
diagonal entries and P√ a unitary
If D = diag (d1 , . . . , dn ), then
√ (orthogonal) matrix.
2
1/2
denote D
= diag ( d1 , . . . , dn ). Now A = B where B = P −1 D1/2 P , and B is
selfadjoint by 11.22. 2
Remark. If A is positive semidefinite, then there is a square root A1/2 which is also
positive semidefinite.
12.14 Corollary
An n × n matrix A is positive definite if and only if there is a nonsingular matrix B
such that A = B ∗ B.
Proof.
“⇒” follows from 12.13.
If A = B ∗ B, then A∗ = B ∗ (B ∗ )∗ = A and
22
hAx, xi = hBx, Bxi > 0 for any x 6= 0, because B is nonsingular. 2
Remark. An n × n matrix A is positive semi-definite if and only if there is a matrix B
such that A = B ∗ B.
12.15 Lemma (Rudin)
For any matrix A ∈ IRn×n there is a symmetric positive semi-definite matrix B such
that
(i) At A = B 2
(ii) kAxk = kBxk for all x ∈ IRn
(One can think of B as a ‘rectification’ or ‘symmetrization’ of A.)
Proof. The matrix At A is symmetric positive semidefinite by the previous remark, so
it has a symmetric square root B by 12.13. Lastly, hBx, Bxi = hB 2 x, xi = hAt Ax, xi =
hAx, Axi.
12.16 Remark
Let A ∈ IRn×n be symmetric. Then, by 11.20, there is an ONB {u1 , . . . , un } consisting
entirely of eigenvectors of A. Denote by λ1 , . . . , λn the corresponding eigenvalues of A.
Then
n
A=
X
λi ui uti
i=1
Proof. Just verify that Aui = λi ui for all 1 ≤ i ≤ n.
23
13
Cholesky Factorization
This is the continuation of Section 4 (LU Decomposition). We explore other decompositions of square matrices.
13.1 Theorem (LDMt Decomposition)
Let A be an n × n matrix with nonsingular principal minors, i.e. det Ak 6= 0 for
k = 1, . . . , n. Then there are unique matrices L, D, M such that L, M are unit lower
triangular and D is diagonal, and
A = LDM t
Proof. By the LU decomposition (Theorem 4.10) there are unit lower triangular
matrix L and upper triangular matrix U such that A = LU . Let u11 , . . . , unn be the
diagonal entries of U , and set D = diag(u11 , . . . , unn ). Then the matrix M t := D−1 U is
unit upper triangular, and A = LDM t .
To establish uniqueness, let A = LDM t = L1 D1 M1t . By the uniqueness of the LU
decomposition 4.10, we have L = L1 . Hence, (D1−1 D)M t = M1t . Since both M t and
M1t are unit upper triangular, the diagonal matrix D1−1 D must be the identity matrix.
Hence, D = D1 , and then M = M1 . 2
13.2 Corollary
If, in addition, A is symmetric, then there exist unique unit lower triangular matrix
L and diagonal D such that
A = LDLt
Proof. By the previous theorem A = LDM t . Then A = At = M DLt , and by the
uniqueness of the LDMt decomposition we have L = M . 2
13.3 Theorem (Sylvester’s Theorem)
Let A ∈ IRn×n be symmetric. Then A is positive definite if and only if det Ak > 0 for
all k = 1, . . . , n.
Proof. Let A be positive definite. By 12.14, det A = (det B)2 > 0. Any principal
minor Ak is also a symmetric positive definite matrix, therefore by the same argument
det Ak > 0. Conversely, let det Ak > 0. By Corollary 13.2 we have A = LDLt . Denote
by Lk and Dk the k-th principal minors of L and D, respectively. Then Ak = Lk Dk Ltk .
Note that det Dk = det Ak > 0 for all k = 1, . . . , n, therefore all the diagonal entries of D
are positive. Lastly, hAx, xi = hDLt x, Lt xi = hDy, yi > 0 because y = Lt x 6= 0 whenever
x 6= 0. 2
24
13.4 Corollary
Let A be a real symmetric positive definite matrix. Then Aii > 0 for all i = 1, . . . , n.
Furthermore, let 1 ≤ i1 < i2 < · · · < ik ≤ n, and let A0 be the k × k matrix formed by
the intersections of the rows and columns of A with numbers i1 , . . . , ik . Then det A0 > 0.
Proof. Just reorder the coordinates in IRn so that A0 becomes a principal minor.
13.5 Theorem (Cholesky Factorization)
Let A ∈ IRn×n be symmetric and positive definite. Then there exists a unique lower
triangular matrix G with positive diagonal entries such that
A = GGt
Proof. By Corollary 13.2 we have A = LDLt . Let D = diag(d
√ 1 , . . .√, dn ). As it
1/2
was shown in the proof of 13.3, all di > 0. Let D
= diag( d1 , . . . , dn ). Then
1/2 1/2
1/2
t
D = D √D
and setting G = LD
gives A = GG . The diagonal entries of G are
√
d1 , . . . , dn , so they are positive. To establish uniqueness, let A = GGt = G1 Gt1 . Then
t
t −1
G−1
1 G = G1 (G ) . Since this is the equality of a lower triangular matrix and an upt
t −1
per triangular one, then both matrices are diagonal: G−1
= D. Hence,
1 G = G1 (G )
t
−1
−1
t
G1 = GD and G1 = DG . Therefore, D = D , so the diagonal entries of D are all
±1. Note that the value −1 is not possible since the diagonal entries of both G and G1
are positive. Hence, D = I, and so G1 = G. 2
13.6 Algorithm (Cholesky Factorization).
Here we outline the algorithm of computing the matrix G = (gij ) from the matrix
A = (aij ). Note that G is lower triangular, so gij = 0 for i < j. Hence,
min{i,j}
aij =
X
gik gjk
k=1
2
Setting i = j = 1 gives a11 = g11
, so
g11 =
√
a11
(remember, gii must be positive). Next, for 2 ≤ i ≤ n we have ai1 = gi1 g11 , hence
gi1 = ai1 /g11
i = 2, . . . , n
This gives the first column of G. Now, inductively, assume that we already have the first
P
2
j − 1 columns of G. Then ajj = jk=1 gjk
, hence
gjj
v
u
j−1
u
X
2
= tajj −
gjk
k=1
25
Next, for j + 1 ≤ i ≤ n we have aij =
Pj
k=1
gik gjk , hence
j−1
X
1
gij =
aij −
gik gjk
gjj
k=1
13.7 Cost of computation
The cost of computation is measured in flops, where a flop is a multiplication or
division together with one addition or subtraction, recall 4.12. The above algorithm of
computation of gij takes j flops for each i = j, . . . , n, so the total is
n
X
j(n − j) ≈ n
j=1
n2 n3
n3
−
=
2
3
6
It also takes n square root extractions. Recall that the LU decomposition takes about
n3 /3 flops, so the Cholesky factorization is nearly twice as efficient. It is also more stable
than the LU decomposition.
13.8 Remark
The above algorithm can be used to verify that a given symmetric matrix, A, is
positive definite. Whenever the square root extractions in 13.6 are all possible and non
zero, i.e. whenever
a11 > 0
and
ajj −
j−1
X
k=1
the matrix A is positive definite.
26
2
> 0 ∀j ≥ 2
gjk
14
Machine Arithmetic
This section is unusual. It discusses imprecise calculations. It may be difficult or frustrating for the students who were not much involved in computer work. But it might be
easy (or trivial) for people experienced with computers. This section is necessary, since
it motivates the discussion in the following section.
14.1 Binary numbers
A bit is a binary digit, it can only take two values: 0 and 1. Any natural number N
can be written, in the binary system, as a sequence of binary digits:
N = (dn · · · d1 d0 )2 = 2n dn + · · · + 2d1 + d0
For example, 5 = 1012 , 11 = 10112 , 64 = 10000002 , etc.
14.2 More binary numbers
To represent more numbers in the binary system (not just positive integers), it is
convenient to use the numbers ±N × 2±M where M and N are positive integers. Clearly,
with these numbers we can approximate any real number arbitrarily accurately. In other
words, the set of numbers {±N × 2±M } is dense on the real line. The number E = ±M
is called the exponent and N the mantissa.
Note that N × 2M = (2k N ) × 2M −k , so the same real number can be represented in
many ways as ±N × 2±M .
14.3 Floating point representation
We return to our decimal system. Any decimal number (with finitely many digits)
can be written as
f = ±.d1 d2 . . . dt × 10e
where di are decimal digits and e is an integer. For example, 18.2 = .182 × 102 =
.0182 × 103 , etc. This is called a floating point representation of decimal numbers. The
part .d1 . . . dt is called the mantissa and e is the exponent. By changing the exponent
e with a fixed mantissa , d1 . . . dt we can move (“float”) the decimal point, for example
.182 × 102 = 18.2 and .182 × 101 = 1.82.
14.4 Normalized floating point representation
To avoid unnecessary multiple representations of the same number (as 18.2 by .182 ×
2
10 and .0182 × 103 above), we require that d1 6= 0. We say the floating point representation is normalized if d1 6= 0. Then .182 × 102 is the only normalized representation of
the number 18.2.
For every positive real f > 0 there is a unique integer e ∈ ZZ such that g := 10−e f ∈
[0.1, 1). Then f = g × 10e is the normalized representation of f . So, the normalized
representation is unique.
27
14.5 Other number systems
Now suppose we are working in a number system with base β ≥ 2. By analogy with
14.3, the floating point representation is
f = ±.d1 d2 . . . dt × β e
where 0 ≤ di ≤ β − 1 are digits,
.d1 d2 . . . dt = d1 β −1 + d2 β −2 + · · · dt β −t
is the mantissa and e ∈ ZZ is the exponent. Again, we say that the above representation
is normalized if d1 6= 0, this ensures uniqueness.
14.6 Machine floating point numbers
Any computer can only handle finitely many numbers. Hence, the number of digits
di ’s is necessarily bounded, and the possible values of the exponent e are limited to a
finite interval. Assume that the number of digits t is fixed (it characterizes the accuracy
of machine numbers) and the exponent is bounded by L ≤ e ≤ U . Then the parameters
β, t, L, U completely characterize the set of numbers that a particular machine system
can handle. The most important parameter is t, the number of significant digits, or the
length of the mantissa. (Note that the same computer can use many possible machine
systems, with different values of β, t, L, U , see 14.8.)
14.7 Remark
The maximal (in absolute value) number that a machine system can handle is M =
U
β (1 − β −t ). The minimal positive number is m = β L−1 .
14.8 Examples
Most computers use the binary system, β = 2. Many modern computers (e.g., all
IBM compatible PC’s) conform to the IEEE floating-point standard (ANSI/IEEE Standard 754-1985). This standard provides two systems. One is called single precision, it is
characterized by t = 24, L = −125 and U = 128. The other is called double precision,
it is characterized by t = 53, L = −1021 and U = 1024. The PC’s equipped with the
so called numerical coprocessor also use an internal system called temporary format, it is
characterized by t = 65, L = −16381 and U = 16384.
14.9 Relative errors
Let x > 0 be a positive real number with the normalized floating point representation
with base β
x = .d1 d2 . . . × β e
where the number of digits may be finite or infinite. We need to represent x in a machine
system with parameters β, t, L, U . If e > U , then x cannot be represented (an attempt to
store x in the computer memory or perform calculation that results in x should terminate
28
the computer program with error message OVERFLOW – a number too large). If e < L,
the system may either represent x by 0 (quite reasonably) or terminate the program with
error message UNDERFLOW – a number too small. If e ∈ [L, U ] is within the right
range, then the matissa has to be reduced to t digits (if it is longer or infinite). There
are two standard ways to do that reduction:
(i) just take the first t digits of the mantissa of x, i.e. .d1 . . . dt , and chop off the rest;
(ii) round off to the nearest available, i.e. take the mantissa
(
.d1 . . . dt
.d1 . . . dt + .0 . . . 01
if dt+1 < β/2
if dt+1 ≥ β/2
Denote the obtained number by xc (the computer representation of x). The relative error
in this representation can be estimated as
xc − x
=ε
x
or
xc = x(1 + ε)
where the maximal possible value of ε is
(
u=
β 1−t
1 1−t
β
2
for chopped arithmetic
for rounded arithmetic
The number u is called the unit round off or machine precision.
14.10 Examples
a) For the IEEE floating-point standard with chopped arithmetic in single precision
we have u = 2−23 ≈ 1.2 × 10−7 . In other words, approximately 7 decimal digits are
accurate.
b) For the IEEE floating point standard with chopped arithmetic in double precision
we have u = 2−52 ≈ 2.2 × 10−16 . In other words, approximately 16 decimal digits are
accurate.
14.11 Computational errors
Let x, y be two real numbers represented in a machine system by xc , yc . An arithmetic
operation x ∗ y, where ∗ is one of +, −, ×, ÷, is performed by a computer in the following
way. The computer finds xc ∗ yc (first, exactly) and then represents that number by the
machine system. The result is z := (xc ∗ yc )c . Note that, generally, z is different from
(x ∗ y)c , which is the representation of the exact result x ∗ y. Hence, z is not necessarily
the best representation for x ∗ y. In other words, the computer makes additional round
off errors during computations. Assuming that xc = x(1 + ε1 ) and yc = y(1 + ε2 ) we have
(xc ∗ yc )c = (xc ∗ yc )(1 + ε3 ) = [x(1 + ε1 )] ∗ [y(1 + ε2 )](1 + ε3 )
where |ε1 |, |ε2 |, |ε3 | ≤ u.
29
14.12 Multiplication and Division
For multiplication, we have
z = xy(1 + ε1 )(1 + ε2 )(1 + ε3 ) ≈ xy(1 + ε1 + ε2 + ε3 )
so the relative error is (approximately) bounded by 3u. A similar estimate can be made
in the case of division.
14.13 Addition and Subtraction
For addition, we have
!
xε1 + yε2
z = (x + y + xε1 + yε2 )(1 + ε3 ) = (x + y) 1 +
(1 + ε3 )
x+y
The relative error is now small if |x| and |y| are not much bigger than |x + y|. The
error, however, can be arbitrarily large if |x + y| max{|x|, |y|}. This effect is known
as catastrophic cancellation. A similar estimate can be made in the case of subtraction
x − y: if |x − y| is not much smaller than |x| or |y|, then the relative error is small,
otherwise we may have a catastrophic cancellation.
14.14 Quantitative estimation
Assume that z = x + y is the exact sum of x and y. Let xc = x + ∆x and yc = y + ∆y
be the machine representations of x and y. Let z + ∆z = xc + yc . We want to estimate
the relative error |∆z|/|z| in terms of the relative errors |∆x|/|x| and |∆y|/|y|:
|∆z|
|x| |∆x|
|y| |∆y|
≤
+
|z|
|x + y| |x|
|x + y| |y|
In particular, assume that |∆x|/|x| ≤ u and |∆y|/|y| ≤ u, so that xc and yc are the best
possible machine representations of x and y. Then
|∆z|
|x| + |y|
≤
u
|z|
|x + y|
so that the value q := (|x|+|y|)/|x+y| characterizes the accuracy of the machine addition
of x and y. More precisely, if q u−1 , then the result is fairly accurate. On the contrary,
if q ≈ u−1 or higher, then a catastrophic cancellation occurs.
14.15 Example
Consider the system of equations
0.01 2
1 3
!
x
y
30
!
=
2
4
!
The exact solution is
x
y
!
=
200/197
196/197
!
=
1.015 . . .
0.995 . . .
!
One can solve the system with the chopped arithmetic with base β = 10 and t = 2
(i.e. working with a two digit mantissa) by the LU decomposition without pivoting. One
gets (xc , yc ) = (0, 1). Increasing the accuracy to t = 3 gives (xc , yc ) = (2, 0.994), not
much of improvement, since xc is still far off.
By applying partial pivoting one gets (xc , yc ) = (1, 1) for t = 2 and (xc , yc ) =
(1.02, 0, 994) for t = 3, almost perfect accuracy!
One can see that solving linear systems in machine arithmetic is different from solving them exactly. We ‘pretend’ that we are computers subject to the strict rules of
machine arithmetic, in particular we are limited to t significant digits. As we notice,
these limitations may lead to unexpectedly large errors in the end.
31
15
Sensitivity and Stability
One needs to avoid catastrophic cancellations. At the very least, one wants to have
means to determine whether catastrophic cancellations can occur. Here we develop such
means for the problem of solving systems of linear equations Ax = b.
15.1 Convention
Let k · k be a norm in IRn . If x ∈ IRn is a vector and x + ∆x ∈ IRn is a nearby vector
(say, a machine representation of x), we consider k∆xk/kxk as the relative error (in x).
If A ∈ IRn×n is a matrix and A + ∆A ∈ IRn×n a nearby matrix, we consider k∆Ak/kAk
as the relative error (in A). Here k · k is the matrix norm induced by the norm k · k in
IRn .
15.2 Definition (Condition Number)
For a nonsingular matrix A, the condition number with respect to the given matrix
norm k · k is
κ(A) = kA−1 k kAk
We denote by κ1 (A), κ2 (A), κ∞ (A) the condition numbers with respect to the 1-norm,
2-norm and ∞-norm, respectively.
15.3 Theorem
Suppose we have
Ax = b
(A + ∆A)(x + ∆x) = b + ∆b
with a nonsingular matrix A. Assume that k∆Ak is small so that k∆Ak kA−1 k < 1.
Then
!
k∆xk
κ(A)
k∆Ak k∆bk
≤
+
kxk
kAk
kbk
1 − κ(A) k∆Ak
kAk
Proof. Expanding out the second equation, subtracting the first one and multiplying
by A−1 gives
∆x = −A−1 ∆A(x + ∆x) + A−1 ∆b
Taking norms and using the triangular inequality and 8.10 gives
k∆xk ≤ kA−1 k k∆Ak kxk + k∆xk + kA−1 k k∆bk
Using kbk ≤ kAk kxk, this rearranges to
!
k∆bk
1 − kA k k∆Ak k∆xk ≤ kA k k∆Ak + kA k kAk
kxk
kbk
−1
−1
32
−1
Recall that k∆Ak kA−1 k < 1, so the first factor above is positive. The theorem follows
immediately. 2
Interpretation. Let Ax = b be a given system of linear equations to be solved numerically. A computer represents A by Ac = A + ∆A and b by bc = b + ∆b and then solves
the system Ac x = bc . Assume now (ideally) that the computer finds an exact solution
xc = x + ∆x, i.e., Ac xc = bc . Then the relative error k∆xk/kxk can be estimated by
15.3. The smaller the condition number κ(A), the tighter (better) estimate on k∆xk/kxk
we get. The value of κ(A) thus characterizes the sensitivity of the solution of the linear
system Ax = b to small errors in A and b.
15.4 Corollary
Assume that in Theorem 15.3 we have k∆Ak ≤ ukAk and k∆bk ≤ ukbk, i.e. the
matrix A and the vector b are represented with the best possible machine accuracy.
Then
k∆xk
2uκ(A)
≤
kxk
1 − uκ(A)
Interpretation. If uκ(A) 1, then the numerical solution of Ax = b is quite accurate
in the ideal case (when the computer solves the system Ac x = bc exactly). If, however,
uκ(A) ≈ 1 or > 1, then the numerical solution is completely unreliable, no matter how
accurately the computer works. Comparing this to 14.14, we can interpret κ(A) as a
quantitative indicator of the possibility of catastrophic cancellations.
Linear systems Ax = b with κ(A) u−1 are often called well-conditioned. Those
with κ(A) of order u−1 or higher are called ill-conditioned. In practical applications,
when one runs (or may run) into an ill-conditioned linear system, one needs to reformulate the underlying problem rather than trying to use numerical tricks to deal with the
ill-conditioning. We will face this problem in Section 16.
15.5 Example.
Assume that u ≈ 10−l , i.e. the machine system provides l accurate digits. Then if
κ(A) ≈ 10k with k < l, then k∆xk/kxk ≤ 10l−k , i.e. even an ideal numerical solution
only provides l − k accurate digits.
15.6 Proposition
1. κ(λA) = κ(A) for λ 6= 0
2.
κ(A) =
maxkxk=1 kAxk
minkxk=1 kAxk
3. If aj denotes the j-th column of A, then κ(A) ≥ kaj k/kai k
4. κ2 (A) = κ2 (At )
5. κ(I) = 1
33
6. κ(A) ≥ 1
7. For any orthogonal matrix Q,
κ2 (QA) = κ2 (AQ) = κ2 (A)
8. If D = diag(d1 , . . . , dn ) then
κ2 (D) = κ1 (D) = κ∞ (D) =
max1≤i≤n |di |
min1≤i≤n |di |
9. If λM is the largest eigenvalue of At A and λm is its smallest eigenvalue, then
κ2 (At A) = κ2 (AAt ) = (κ2 (A))2 = λM /λm
10. κ2 (A) = 1 if and only if A is a multiple of an orthogonal matrix.
15.7 Example
!
1 0
Let A =
. Then κ2 (A) = 1/ε by 15.6 (8), and so the system Ax = b is
0 ε
ill-conditioned. It is, however, easy to ‘correct’ the system by multiplying the second
equation by 1/ε. Then the new system A0 x0 = b0 will be perfectly conditioned, since
κ2 (A0 ) = 1 by 15.6 (5). The trick of multiplying rows (and columns) by scalars is called
the scaling of the matrix A. Sometimes, scaling allows to significantly reduce the condition number of A. Such scalable matrices are, however, not very often in practice.
Besides, no satisfactory method exists for detecting scalable matrices.
15.8 Remark
Another way to look at the condition number κ(A) is the following. Since in practice
the exact solution x of the system Ax = b is rarely known, one can try to verify whether
the so called residual vector r = b − Axc is small relative to b. Since Axc = b + r,
Theorem 15.3 with ∆A = 0 implies that
krk
kxc − xk
≤ κ(A)
kxk
kbk
If A is well conditioned, the smallness of krk/kbk ensures the smallness of the relative
error kxc − xk/kxk. If A is ill-conditioned, this does not work: one can find xc far from
x for which r is still small.
15.9 Remark
In 15.3–15.8, we assumed that xc was an exact solution of the system Ac x = bc . We
now get back to reality and consider the numerical solution x̂c of the system Ac x = bc
obtained by a computer. Because of computational errors (see 14.11–14.13) the vector
x̂c will not be an exact solution, i.e. we have Ac x̂c 6= bc . Hence, we have
ˆ + ∆x
x̂c − x = (x̂c − xc ) + (xc − x) = ∆x
34
actual A,rb
exact solution
x
-actual
r
exact solution
xc
machine Acr, bc
-‘ideal’
r
XX
XXX
XXX
computer
X
XX algorithm
XXX
XX computed x̂c
‘virtual’ Âc ,b bc
X r
z
-X
exact solution
The error ∆x is solely due to inaccurate representation of the input data, A and b, in the
computer memory (by Ac and bc ). The error ∆x is estimated in Theorem 15.3 with the
help of the condition number κ(A). If κ(A) is too large, the error ∆x may be too big,
and one should not even attempt to solve the system numerically.
15.10 Round-off error analysis
Assume now that κ(A) is small enough, so that we need not worry about ∆x. Now we
ˆ = x̂c − xc . This results from computational errors, see 14.11–14.13.
need to estimate ∆x
Note that even small errors may accumulate to a large error in the end.
In order to estimate x̂c −xc , a typical approach is to find another matrix, Âc = Ac +δA
such that (Ac + δA)x̂c = bc . We call Ac + δA a virtual matrix, since it is neither given nor
computed numerically. Moreover, it is only specified by the fact that it takes the vector
x̂c to bc , hence it far from being uniquely defined. One wants to find a virtual matrix as
close to Ac as possible, to make δA small. Then one can use Theorem 15.3 to estimate
ˆ
∆x:
kx̂c − xc k
κ(Ac )
kδAk
≤
kδAk
kxc k
1 − κ(Ac ) kA k kAc k
c
Clearly, the numerical solution x̂c , and then the virtual matrix Ac + δA, will depend
on the method (algorithm) used. For example, one can use the LU decomposition method
with or without pivoting. For a symmetric positive definite matrix A, one can use the
Cholesky factorization.
We will say that the numerical method is (algebraically) stable if one can find a virtual
matrix so that kδAk/kAk is small, and unstable otherwise. The rule of thumb is then that
for stable methods, computational errors do not accumulate too much and the numerical
solution x̂c is close to the ideal solution xc . For unstable methods, computational errors
may accumulate too much and force x̂c to be far from xc , this making the numerical
solution x̂c unreliable.
15.11 Wilkinson’s analysis for the LU decomposition
Here we provide, without proof, an explicit estimate on the matrix δA for the LU
35
c
decomposition method described in Section 4. Denote by Lc = (lij
) and Uc = (ucij ) the
computed matrices L and U by the LU decomposition algorithm. Then
|δAij | ≤ nu 3|aij | + 5
n
X
!
c
|lik
| |uckj |
+ O(u2 )
k=1
c
If the values of lik
and uckj are not too large compared to those of aij , this provides a good
upper bound on kδAk/kAk, it is of order n2 u. In this case the LU decomposition method
P c
is stable (assuming that n is not too big). But it may become unstable if |lik
| |uckj | is
large compared to |aij |.
c
One can improve the stability by using partial pivoting. In this case |lij
| ≤ 1. So,
the trouble can only occur if one gets large elements of Uc . Practically, however, it is
observed that kUc k∞ ≈ kAk∞ , so that the partial pivoting algorithm is usually stable.
The LU decomposition with complete pivoting is always stable. In this case it can be
proved that
kUc k∞
≤ n1/2 (21 31/2 41/3 · · · n1/(n−1) )1/2
kAk∞
This bound grows slowly with n and ensures the stability. (Exercise: show that this
bound grows approximately like n0.25 ln n .)
15.12 Remark
The Cholesky factorization A = GGt of a symmetric positive definite matrix A, see
13.5, is a particular form of the LU decomposition, so the above analysis applies. In this
case, however, we know that
aii =
i
X
gij2
j=1
see 13.6. Hence, the elements of G cannot be large compared to the elements of A. This
proves that the Cholesky factorization is always stable.
15.13 Remark (Iterative Improvement)
In some practical applications, a special technique can be used to improve the numerical solution x̂c of a system Ax = b. First one solves the system Ac x = bc numerically
and then finds the residual vector r = bc − Ac x̂c . Then one solves the system Ac z = rc for
z and replaces x̂c by x̂c + zc . This is one iteration of the so called iterative improvement
or iterative refinement. Note that solving the system Ac z = rc is relatively cheap, since
the LU decomposition of A is obtained (and stored) already. Sometimes one uses the
2t-digit arithmetic to compute r, if the solution x̂c is computed in the t-digit arithmetic
(one must have it available, though, which is not always the case).
It is known that by repeating the iterations k times with machine precision u = 10−d
and κ∞ = 10q one expects to obtain approximately min(d, k(d − q)) correct digits.
If one uses the LU decomposition with partial pivoting, then just one iteration of the
above improvement makes the solution algebraically stable.
36
16
Overdetermined Linear Systems
16.1 Definition (Overdetermined Linear System)
| m×n , x ∈ C
| n and b ∈ C
| m , is said to be
A system of linear equations Ax = b with A ∈ C
overdetermined if m > n. Since there are more equations than unknowns, the system usually has no solutions. We will be concerned here exactly with that “no solution” situation.
| n → C
| m . Then Ax = b
Note that the matrix A defines a linear transformation TA : C
has a solution if and only if b ∈ Im TA . In this case the solution is unique if and only if
Ker TA = {0}, i.e. the columns of A are linearly independent.
16.2 Definition (Least Squares Fit)
Let (xi , yi ), 1 ≤ i ≤ m, be given points on the real xy plane. For any straight line
y = a + bx one defines the “distance” of that line from the given points by
E(a, b) =
m
X
(a + bxi − yi )2
i=1
This quantity is called the residual sum of squares (RSS). Suppose that the line y = â+ b̂x
minimizes the function E(a, b), i.e.
E(â, b̂) = min E(a, b)
a,b
Then y = â + b̂x is called the least squares approximation to the given points (xi , yi ).
Finding the least squares approximation is called the least squares fit of a straight line
to the given points. It is widely used in statistics.
16.3 Example
Let {xi } be the heights of some m people and {yi } their weights. One can expect an
approximately linear dependence y = a + bx of the weight from the height. The least
squares fit of a line y = a + bx to the experimental data (xi , yi ), 1 ≤ i ≤ m, gives the
numerical estimates of the coefficients a and b in the formula y = a + bx.
Note: in many statistical applications the least squares estimates are the best possible
(most accurate). There are, however, just as many exceptions, these questions are beyond the scope of this course.
37
16.4 Remark
For the least squares fit in 16.2, let
A=
1 x1
1 x2
.. ..
. .
1 xm
x=
a
b
!
b=
y1
y2
..
.
ym
Then
E(a, b) = kb − Axk22
!
a
is an exact solution of
Note that E(a, b) ≥ 0. Also, E(a, b) = 0 if and only if x =
b
the system Ax = b. The system Ax = b is overdetermined whenever m > 2, so exact
solutions rarely exist.
16.5 Definition (Least Squares Solution)
| n that minimizes the
Let Ax = b be an overdetermined linear system. A vector x ∈ C
function
E(x) = kb − Axk22
is called a least squares solution of Ax = b.
16.6 Remark
If A, b and x are real, then one can easily differentiate the function E(x) with respect
to the components of the vector x and equate the derivatives to zero. Then one arrives
at the linear system At Ax = At x.
16.7 Definition (Normal Equations)
Let Ax = b be an overdetermined linear system. Then the linear system
A∗ Ax = A∗ b
is called the system of normal equations associated with the overdetermined system Ax =
b. In the real case, the system of normal equations is
At Ax = At b
| n×m .
| m×n , its adjoint is defined by A∗ = At = Āt ∈ C
Note: For a matrix A ∈ C
16.8 Lemma
The matrix A∗ A is a square n × n Hermitean and positive semidefinite. If A has full
rank, then A∗ A is positive definite.
38
16.9 Theorem
Let Ax = b be an overdetermined linear system.
(a) A vector x minimizes E(x) = kb − Axk2 if and only if it is an exact solution of the
system Ax = b̂, where b̂ is the orthogonal projection of b onto Im TA .
(b) A vector x minimizing E(x) always exists. It is unique if and only if A has full rank,
i.e. the columns of A are linearly independent.
(c) A vector x minimizes E(x) if and only if it is a solution of the system of normal
equations A∗ Ax = A∗ b.
| m → C
| n . Note that
Proof. The matrix A∗ defines a linear transformation TA∗ : C
(Im TA )⊥ = Ker TA∗ , which generalizes 11.7. The proof is a routine check, just as that
| m = Im T
of 11.7. So, we have an orthogonal decomposition C
A ⊕ Ker TA∗ . Then we can
write b = b̂ + r uniquely, where b̂ ∈ Im TA and r ∈ Ker TA∗ . Since hb̂, ri = 0, it follows
from Theorem of Pythagoras that
kb − Axk2 = kb̂ − Axk2 + krk2 ≥ krk2
Hence, minx E(x) = krk2 is attained whenever Ax = b̂. Note that b̂ ∈ Im TA , so there is
| n such that Ax = b̂. The vector x is unique whenever Ker T
always an x ∈ C
A = {0}, i.e.
dim Im TA = n, which occurs precisely when the columns of A are linearly independent,
i.e. A has full rank (rank A = n). This proves (a) and (b).
To prove (c), observe that if x minimizes E(x), then by (a) we have b − Ax = r ∈
Ker TA∗ , and thus A∗ (b − Ax) = 0. Conversely, if A∗ (b − Ax) = 0, then b − Ax ∈ Ker TA∗ ,
and by the uniqueness of the decomposition b = b̂ + r we have Ax = b̂. 2
16.10 Normal Equations, Pros and Cons
Let A ∈ IRm×n .
a) By Lemma 16.8, the matrix At A is symmetric positive definite, provided A has
full rank. In this case the solution of the system of normal equations At Ax = At b can be
effectively found by Cholesky factorization. Furthermore, in many applications n m,
so the system At Ax = At b is much smaller than Ax = b.
b) It may happen, though, that the matrix A is somewhat ill-conditioned (i.e. almost
causes catastrophic cancellations). In that case the condition of the matrix At A will be
much worse than that of A, compare this to 15.6 (9). Solving the normal equations can
be disasterous. For example, let
1 1
A=
ε 0
0 ε
then
t
AA=
1 + ε2
1
1
1 + ε2
!
If ε is so small that ε2 < u (for example, ε = 10−4 in single precision), then the matrix At A
will be stored as a singular matrix, and the numerical solution of the normal equations
39
will crash. Still, it is possible to find a numerical least squares solution of the original
system Ax = b, if one uses more elaborate methods.
c) One can define a condition number of an m × n rectangular matrix A with m > n
by
maxkxk=1 kAxk
κ(A) :=
minkxk=1 kAxk
Then, in the above example,
√
κ2 (A) = ε−1 2 + ε2
and
κ2 (At A) = ε−2 (2 + ε2 )
Hence, κ2 (At A) = [κ2 (A)]2 .
d) Clearly, κ2 (A) = κ2 (QA) for any orthogonal m × m matrix Q. Hence, one can
safely (without increasing the 2-condition number of A) multiply A by orthogonal m × m
matrices. We will try to find an orthogonal matrix Q so that QA is upper triangular
(i.e., perform orthogonal triangularization).
16.11 Definition (Hyperplane, Reflection)
Let V be a finite dimensional inner product space (real or complex). A subspace
W ⊂ V is called a hyperplane if dim W = dim V − 1. Note that in this case dim W ⊥ = 1.
Let W ⊂ V be a hyperplane. For any vector v ∈ V we have a unique decomposition
v = w + w0 , where w ∈ W and w0 ∈ W ⊥ . The linear operator P on V defined by
P v = w − w0 is called a reflection (or reflector) across the hyperplane W . It is identical
on W and negates vectors orthogonal to W .
16.12 Definition (Reflection Matrix)
| n . The matrix
Let x 6= 0 be a vector in IRn or C
P =I −2
xx̄t
xt x̄
is called a reflection matrix (or a reflector matrix). Obviously, P is unchanged if x is
replaced by cx for any c 6= 0. Note also that xt x̄ = kxk2
16.13 Theorem
Let P be a reflection matrix corresponding to a vector x 6= 0. Then
(a) P x = −x
(b) P y = y whenever hy, xi = 0
(c) P is Hermitean (in the real case it is symmetric)
(d) P is unitary (in the real case it is orthogonal)
(e) P is involution, i.e. P 2 = I
| n across the hyperplane orthogNote that (a)+(b) mean that P is a reflection of IRn or C
onal to the vector x.
40
Proof. (a), (b) and (c) are proved by direct calculation. To prove (e), write
P2 = I − 4
xx̄t xx̄t
xx̄t
+
4
xt x̄
(xt x̄)2
Then (e) follows from the fact that x̄t x = xt x̄ = kxk2 is a scalar quantity. Next, (d)
follows from (c) and (e).
16.14 Theorem
| n . Choose a scalar σ so that |σ| = kyk and σ · he , yi ∈ IR.
Let y be a vector in IRn or C
1
Suppose that x = y + σe1 6= 0. Let P = I − 2xx̄t /kxk2 be the reflection matrix defined
in 16.12. Then P y = −σe1 .
Proof. Notice that hy − σe1 , y + σe1 i = kyk2 − σhe1 , yi + σ̄hy, e1 i − |σ|2 = 0. Hence,
P (y − σe1 ) = y − σe1 by 16.13 (b). Besides, P (y + σe1 ) = −y − σe1 by 16.13 (a). Adding
these two equations gives the theorem.
16.15 Remark
(a) To choose σ in 16.14, write a polar representation for he1 , yi = ȳ1 = reiθ and then set
σ = ±kyke−iθ .
(b) In the real case, we have y1 ∈ IR, and one can just set
σ = ±kyk
Note: It is geometrically obvious that for any two unit vectors x, y ∈ IRn there is a
| n , this is not true: for generic unit
reflection P that takes x to y. In the complex space C
| .
vectors x, y one can only find a reflection that takes x to cy with some scalar c ∈ C
16.16 Corollary
| n there is a scalar σ (defined in 16.14) and a matrix P ,
For any vector y in IRn or C
which is either a reflection or an identity, such that P y = −σe1 .
Proof. Apply Theorem 16.14 in the case y + σe1 6= 0 and set P = I otherwise. 2
16.17 Theorem (QR Decomposition)
Let A be an m × n complex or real matrix with m ≥ n. Then there is a unitary (resp.,
orthogonal) m × m matrix Q and an upper triangular m × n matrix R (i.e., Rij = 0 for
i > j) such that
A = QR
Furthermore, Q may be found as a product of at most n reflection matrices.
41
Proof. We use induction on n. Let n = 1, so that A is a column m-vector. By
16.16 there is a matrix P (a reflection or an identity) such that P A = −σe1 for a scalar
σ. Hence, A = P R where R = −σe1 is upper triangular. Now, let n ≥ 1 and a1 the
first column of A. Again, by 16.16 there is a (reflection or identity) matrix P such that
P a1 = −σe1 . Hence,
!
−σ wt
PA =
0 B
where w is an (n − 1) vector and B an (m − 1) × (n − 1) matrix. By the inductive
assumption, there is an (m − 1) × (m − 1) unitary matrix Q0 and an (m − 1) × (n − 1)
upper triangular matrix R0 such that B = Q0 R0 . Consider the unitary m × m matrix
Q1 =
1 0
0 Q0
!
(by 10.6, Q1 is unitary whenever Q0 is). One can easily check that P A = Q1 R where
R=
−σ wt
0 R0
!
is an upper triangular matrix. Hence, A = QR with Q = P Q1 . 2
16.18 Remark
| m , 1 ≤ j ≤ n, the columns of A, by q ∈ C
| m,
In the above theorem, denote by aj ∈ C
j
1 ≤ j ≤ m, the columns of Q, and put R = (rij ). Then A = QR may be written as
a1 = r11 q1
a2 = r12 q1 + r22 q2
...
an = r1n q1 + r2n q2 + · · · + rnn qn
16.19 Corollary
If, in addition, A has full rank (rank A = n), then
(a) span{a1 , . . . , ak } = span{q1 , . . . , qk } for every 1 ≤ k ≤ n.
(b) one can find a unitary Q so that the diagonal entries of R will be real and positive
(rii > 0 for 1 ≤ i ≤ n).
Proof. (a) follows from 16.18. To prove (b), let rjj = |rjj |eiθj be the polar representation of rjj . Then multiplying every column qj , 1 ≤ j ≤ n, by eiθj gives (b). The new
matrix Q will be still unitary by 10.6. 2
42
The matrix A depends only on the first n columns of Q. This suggests a reduced QR
decomposition.
16.20 Theorem (Reduced or “Skinny” QR decomposition)
Let A be an m × n real or complex matrix with m ≥ n. Then there is an m × n
matrix Q̂ with orthonormal columns and an upper triangular n × n matrix R̂ such that
A = Q̂R̂
Proof. Apply Theorem 16.17. Let Q̂ be the left m × n rectangular block of Q (the
first n columns of Q). Let R̂ be the top n × n square block of R (note that the remainder
of R is zero). Then A = Q̂R̂. 2
16.21 Corollary
If, in addition, A has full rank (rank A = n), then
(a) The columns of Q̂ make an ONB in the column space of A (=Im TA )
(b) one can find Q̂ so that the diagonal entries of R̂ will be real and positive (rii > 0).
Such Q̂ and R̂ are unique.
Proof. This follows from 16.19, except the uniqueness. For simplicity, we prove it only
in the real case. Let A = Q̂R̂ = Q̂1 R̂1 . Then At A = R̂t R̂ = R̂1t R̂1 . Since At A is positive
definite, we can use the uniqueness of Cholesky factorization and obtain R̂ = R̂1 . Then
also Q̂1 = Q̂R̂R̂1−1 = Q̂. 2
In the remainder of this section, we only consider real matrices and vectors.
16.22 Algorithm
Let Ax = b be a real overdetermined system with matrix A of full rank. One finds the
decomposition A = QR with reflectors, as shown!in the proof of Theorem 16.17. Then
c
Qt Q = I and so Rx = Qt b. Denote Qt b =
where c ∈ IRn is the vector of top n
d
components of Qt b and d ∈ IRm−n its bottom part. Next, one finds x by using backward
substitution to solve the system R̂x = c, where R̂ is the top n × n square block of R.
Lastly, one finds the value of E(x) = kr − Axk2 by computing kdk2 . Indeed,
kb − Axk2 = kQt b − Qt Axk2 = k(c, d)t − Rxk2 = kdk2
The above algorithm is the most suitable for computer implementation. It does not
worsen the condition of the given system Ax = b, see also numerical hints in 16.26.
43
On the other hand, for calculations ‘by hands’ (such as on tests!), the following reduced
version of this algorithm is more convenient.
16.23 Algorithm
Let Ax = b be a real overdetermined system with matrix A of full rank. One can
obtain a “skinny” QR decomposition A = Q̂R̂ by using 16.18 and employing a version
of the Gram-Schmidt orthogonalization method:
r11 = ka1 k and q1 = a1 /r11
r12 = ha2 , q1 i, r22 = ka2 − r12 q1 k and q2 = (a2 − r12 q1 )/r22
···
rin = han , qi i (i < n), rnn = kan −
X
rin qi k and qn = (an −
X
rin qi )/rnn
Then one solves the triangular system R̂x = Q̂t b by backward substitution. This method
does not give the value of kb − Axk2 , one has to compute it directly, if necessary.
Below we outline an alternative approach to finding a QR decomposition.
16.24 Definition (Rotation Matrix)
Let 1 ≤ p < q ≤ m and θ ∈ [0, 2π). The matrix G = Gp,q,θ = (gij ) defined by
gpp = cos θ, gpq = sin θ, gqp = − sin θ, gqq = cos θ and gij = δij otherwise is called a
Givens rotation matrix. It defines a rotation through the angle θ of the xp xq coordinate
plane in IRm with all the other coordinates fixed. Obviously, G is orthogonal.
16.25 Algorithm
Let Ax = b be a real overdetermined system with matrix A of full rank. Let aj be
the leftmost column of A that contains a nonzero entry below the main diagonal, aij 6= 0
with some i > j. Consider the matrix A0 = GA where G = Gj,i,θ is a Givens rotation
matrix. One easily checks that
(a) the first j − 1 columns of A0 are zero below the main diagonal;
(b) in the j-th column, only the elements a0jj and a0ij will be different from the corresponding elements of A, and moreover
a0ij = −ajj sin θ + aij cos θ
Now we find sin θ and cos θ so that a0ij = 0. For example, let
ajj
cos θ = q
2
ajj + a2ij
and
aij
sin θ = q
a2jj
+ a2ij
Note that one never actually evaluates the angle θ, since Gj,i,θ only contains cos θ and
sin θ.
44
In this way one zeroes out one nonzero element aij below the main diagonal. Working
from left to right, one can convert A into an upper triangular matrix G̃A = R where G̃
is a product of Givens’s rotation matrices. Each nonzero element of A below the main
diagonal requires one multiplication by a rotation matrix. Then we get A = QR with an
orthogonal matrix Q = G̃−1 . This algorithm is generally more expensive than the QR
decomposition with reflectors. But it works very efficiently if the matrix A is sparse, i.e.
contains just a few nonzero elements below the main diagonal.
16.26 Remarks (Numerical Hints)
The most accurate numerical algorithm for solving a generic overdetermined linear
system Ax = b is based on the QR decomposition with reflectors, then it proceeds as in
16.22. In the process of computing the reflector matrices, two rules should be followed:
(i) The sign of σ in 16.15(b) must coincide with the sign of y1 = hy, e1 i every time one
uses Theorem 16.14. This helps to avoid catastrophic cancellations when calculating
x = y + σe1 (note that the “skinny” algorithm in 16.22 does not provide this flexibility)
(ii) In the calculation of
q
2
kyk = y12 + · · · + ym
in Theorem 16.14, there is a danger of overflow (if one of yi ’s is too large) or underflow
(if all yi ’s are too small, so that yi2 is a machine zero, then so will be kyk). To avoid this,
find first
ymax = max{|y1 |, . . . , |ym |}
and then compute
kyk = ymax ·
q
|y1 /ymax |2 + · · · + |ym /ymax |2
The same trick must be used when computing the rotation matrices in 16.25.
16.27 Polynomial Least Squares Fit
Generalizing 16.2, one can fit a set of data points (xi , yi ), 1 ≤ i ≤ m, by a polynomial
y = p(x) = a0 + a1 x + · · · + an xn with n + 1 ≤ m. The least squares fit is based on
minimizing the function
E(a0 , . . . , an ) =
m X
a0 + a1 xi + · · · + an xni − yi
i=1
This leads to an overdetermined linear system generalizing 16.4:
a0 + a1 xi + · · · + an xni = yi
45
1≤i≤m
2
16.28 Continuous Least Squares Fit
Instead of fitting a discrete data set (xi , yi ) one can fit a continuous function y = f (x)
on [0, 1] by a polynomial y = p(x) ∈ Pn (IR). The least squares fit is the one minimizing
1
Z
E(a0 , . . . , an ) =
|f (x) − p(x)|2 dx
0
The solution of this problem is the orthogonal projection of f (x) onto Pn (IR), recall the
homework problem #1 in Assignment 1.
To find the solution, consider a basis {1, x, . . . , xn } in Pn (IR). Then a0 , . . . , an can be
found from the system of normal equations
n
X
aj hxj , xi i = hf, xi i
1≤i≤n
j=0
The matrix A of coefficients here is
j
i
aij = hx , x i =
Z
1
xi+j dx =
0
1
1+i+j
This is the infamous Hilbert matrix, it is ill-conditioned even for moderately large n, since
κ2 (A) ≈ 101.5n . One can chose an orthonormal basis in Pn (IR) (Legendre polynomials,
see 9.12) to improve the condition of the problem.
46
17
Other Decomposition Theorems
17.1 Theorem (Schur Decomposition)
| n×n , then there exists a unitary matrix Q such that
If A ∈ C
Q∗ AQ = T
where T is upper triangular, i.e. A is unitary equivalent to an upper triangular matrix.
Moreover, the matrix Q can be chosen so that the eigenvalues of A appear in any order
on the diagonal of T .
The columns of the matrix Q are called Schur vectors.
Proof. We use induction on n. The theorem is clearly true for n = 1. Assume that it
holds for matrices of order less than n. Let λ be an eigenvalue of A and let x be a unit
eigenvector for λ. Let R be a unitary matrix whose first column is x (note: such a matrix
| n whose first vector is x, by 9.13, and then R can
exists, because there is an ONB in C
be constructed by using the vectors of that ONB as columns of R). Note that Re1 = x,
and hence R∗ x = e1 , since R−1 = R∗ . Hence, we have,
R∗ ARe1 = R∗ Ax = λR∗ x = λe1
so e1 is an eigenvector of the matrix R∗ AR for the eigenvalue λ. Thus,
λ wt
0 B
R∗ AR =
!
| n−1 and B ∈ C
| (n−1)×(n−1) . By the induction assumption, there is a
with some w ∈ C
| (n−1)×(n−1) such that Q∗ BQ = T , where T
unitary matrix Q1 ∈ C
1
1
1 is upper triangular.
1
Let
!
1 0
Q=R
0 Q1
Note: by 10.6, the second factor is a unitary matrix because so is Q1 , and then note that
the product of two unitary matrices is a unitary matrix Q. Next,
∗
Q AQ =
1 0
0 Q∗1
!
λ wt
0 B
!
1 0
0 Q1
!
=
λ wt Q1
0 Q∗1 BQ1
which is upper triangular, as required. 2
17.2 Definition (Normal matrix)
| n×n is said to be normal if AA∗ = A∗ A.
A matrix A ∈ C
Note: unitary and Hermitean matrices are normal.
47
!
=
λ wt Q1
0 T1
!
17.3 Lemma
If A is normal and Q unitary, then B = Q∗ AQ is also normal (i.e., the class of normal
matrices is closed under unitary equivalence).
Proof. Note that B ∗ = Q∗ A∗ Q and BB ∗ = Q∗ AA∗ Q = Q∗ A∗ AQ = B ∗ B. 2
17.4 Lemma
If A is normal and upper triangular, then A is diagonal.
Proof. We use induction on n. For n = 1 the theorem is trivially true. Assume that
it holds for matrices of order less than n. Compute the top left element of the matrix
AA∗ = A∗ A. On the one hand, it is
n
X
a1i ā1i =
i=1
n
X
|a1i |2
i=1
On the other hand, it is just |a11 |2 . Hence, a12 = · · · = a1n = 0, and
A=
a11 0
0 B
!
One can easily check that AA∗ = A∗ A implies BB ∗ = B ∗ B. By the induction assumption, B is diagonal. 2
Note: Any diagonal matrix is normal.
17.5 Theorem
| n×n is normal if and only if it is unitary equivalent to a diagonal
(i) A matrix A ∈ C
matrix. In that case the Schur decomposition takes form
Q∗ AQ = D
where D is a diagonal matrix.
(ii) If A is normal, the columns of Q (Schur vectors) are eigenvectors of A (specifically,
the jth column of Q is an eigenvector of A for the eigenvalue in the jth position on the
diagonal of D).
Proof. The first claim follows from 17.1–17.4. The second is verified by direct inspection. 2
Note: Theorem 17.5 applies to unitary and Hermitean matrices.
48
17.6 Remark
Three classes of complex matrices have the same property, that their matrices are
unitary equivalent to a diagonal matrix (i.e., admit an ONB consisting of eigenvectors).
The difference between those classes lies in restrictions on eigenvalues: unitary matrices
have eigenvalues on the unit circle (|λ| = 1), Hermitean matrices have real eigenvalues
(λ ∈ IR), and now normal matrices may have arbitrary complex eigenvalues.
Schur decomposition established in 17.1 is mainly of theoretical interest. As most matrices of practical interest have real entries the following variation of the Schur decomposition is of some practical value.
17.7 Theorem (Real Schur Decomposition)
If A ∈ IRn×n then there exists an orthogonal matrix Q such that
Q AQ =
t
· · · R1m
· · · R2m
..
..
.
.
· · · Rmm
R11 R12
0 R22
..
..
.
.
0
0
where each Rii is either a 1 × 1 matrix or a 2 × 2 matrix having complex conjugate
eigenvalues.
Proof. We use the induction on n. If the matrix A has a real eigenvalue, the we can
reduce the dimension and use induction just like in the proof of Schur Theorem 17.1.
Assume that A has no real eigenvalues. Since the characteristic polynomial of A has real
coefficients, its roots come in conjugate pairs. Let λ1 = α + iβ and λ2 = α − iβ be a pair
of roots of CA (x), with β 6= 0. The root λ1 is a complex eigenvalue of A, considered as a
complex matrix, with a complex eigenvector x + iy, where x, y ∈ IRn . The equation
A(x + iy) = (α + iβ)(x + iy)
can be written as
Ax = αx − βy
Ay = βx + αy
or in the matrix form
A
x y
=
x y
α β
−β α
!
where x y is an n × 2 matrix with columns x and y. Observe that x and y must
be lnearly independent, because if not, one can easily show that β = 0, contrary to the
assumption. Let
!
R
x y =P
0
49
be a QR decomposition of the matrix
x y , where R ∈ IR2×2 is an upper triangular
matrix and P ∈ IRn×n orthogonal. Note that R is nonsingular, because
rank. Denote
!
T
T
11
12
P t AP =
T21 T22
x y
has full
where T11 is a 2 × 2 matrix, T12 is a 2 × (n − 2) matrix, T21 is a (n − 2) × 2 matrix, and
T22 is a (n − 2) × (n − 2) matrix. Now, combining the above equations gives
T11 T12
T21 T22
!
R
0
!
R
0
=
!
α β
−β α
!
This implies
α β
−β α
T11 R = R
!
and
T21 R = 0
Since R is nonsingular, the latter equation implies
! T21 = 0, and the former equation
α β
shows that the matrix T11 is similar to
. That last matrix has eigenvalues
−β α
λ1,2 = α±iβ, hence so does T11 , by similarity, therefore it satisfies the requirements of the
theorem. Now we can apply the induction assumption to the (n−2)×(n−2) matrix T22 . 2
In the next theorem we consider a matrix A ∈ IRm×n . It defines a linear transformation
TA : IRn → IRm . Let B be an ONB in IRn and B 0 be an ONB in IRm . Then the
transformation TA is represented in the bases B and B 0 by the matrix U t AV , where U
and V are orthogonal matrices of size m × m and n × n, respectively. The following
theorem shows that one can always find bases B and B 0 so that the matrix U t AV will
be diagonal.
Note: D ∈ IRm×n is said to be diagonal if Dij = 0 for i 6= j. It has exactly p = min{m, n}
diagonal elements and can be denoted by D = diag(d1 , . . . , dp ).
17.8 Theorem (Singular Value Decomposition (SVD))
Let A ∈ IRm×n have rank r and let p = min{m, n}. Then there are orthogonal
matrices U ∈ IRm×m and V ∈ IRn×n and a diagonal matrix D = diag(σ1 , . . . , σp ) such
that
U t AV = D
. In the diagonal of D, exactly r elements are nonzero. If additionally we require that
σ1 ≥ · · · ≥ σr > 0 and σr+1 = · · · = σp = 0, then the matrix D is unique.
50
Proof. The matrix At defines a linear transformation IRm → IRn . Recall that the
matrices At A ∈ IRn×n and AAt ∈ IRm×m are symmetric and positive semi-definite. Note
that
Ker A = Ker At A and Ker At = Ker AAt
(indeed, if At Ax = 0, then 0 = hAt Ax, xi = hAx, Axi, so Ax = 0; the second identity is
proved similarly). Hence,
rank At A = rank AAt = rank A = r
Indeed,
rank At A = n − dim Ker At A = n − dim Ker A = rank A = r
and similarly rank AAt = rank At = rank A = r. Next, by 11.21 and the remark
after 12.10, there is an orthogonal matrix V ∈ IRn×n such that At A = V SV t where
S = diag(s1 , . . . , sn ) ∈ IRn×n and si ≥ 0 are the eigenvalues of At A. Note that rank S =
rank At A = r, so exactly r diagonal elements of S are positive, and we can reorder them
(e.g., according to 17.1 and 17.5) so that s1 ≥ · · · ≥ sr > 0 and sr+1 = · · · = sn = 0. Let
v1 , . . . , vn be the columns of V , which are the eigenvectors of At A, according to 17.5, i.e.
At Avi = si vi . Define vectors u1 , . . . , um as follows. Let
1
ui = √ Avi
for 1 ≤ i ≤ r
si
For r + 1 ≤ i ≤ m, let {ui } be an arbitrary ONB of the subspace Ker AAt , whose
dimension is m − rank AAt = m − r. Note that
√
1
AAt ui = √ A(At A)vi = si Avi = si ui
si
for i = 1, . . . , r, and AAt ui = 0 for i ≥ r + 1. Therefore, u1 , . . . , um are the eigenvectors
of AAt corresponding to the eigenvalues s1 , . . . , sr , 0, . . . , 0. Next, we claim that {ui }
make an ONB in IRm , i.e. hui , uj i = δij . This is true for i, j ≥ r + 1 by our choice of
ui . It is true for i ≤ r < j and j ≤ r < i because then ui , uj are eigenvectors of the
symmetric matrix AAt corresponding to distinct eigenvalues, so they are orthogonal by
11.9 (ii). Lastly, for i, j ≤ r
√
si
1
1
t
hui , uj i = √
hAvi , Avj i = √
hA Avi , vj i = √ hvi , vj i = δij
sj
si sj
si sj
Now let U ∈ IRm×m be the matrix whose columns are u1 , . . . , um . Then U is orthog√
onal, and U t At AU = S 0 ∈ IRm×m where S 0 = diag(s1 , . . . , sr , 0, . . . , 0). Let σi = si
for 1 ≤ i ≤ p and observe that Avi = σi ui . Consequently, if D ∈ IRm×n is defined by
D = diag(σ1 , . . . , σp ), then AV = U D (the i-th column of the left matrix is Avi , and
that of the right matrix is σi ui ). Hence, U t AV = D as required. It remains to prove the
uniqueness of D. Let A = Û D̂V̂ t be another SVD. Then At A = V̂ D̂2 V̂ t , so the diagonal
elements of D̂2 are the eigenvalues of At A, hence they are s1 , . . . , sr , 0, . . . , 0. Since the
√
√
diagonal elements of D̂ are nonnegative, they are s1 , . . . , sr , 0, . . . , 0. 2
51
17.9 Definition (Singular Values and Vectors)
The positive numbers σ1 , . . . , σr are called the singular values of A. The columns
v1 , . . . , vn of the matrix V are called the right singular vectors for A, and the columns
u1 , . . . , um of the matrix U are called the left singular vectors for A.
17.10 Remark
For 1 ≤ i ≤ r we have
At ui = σi vi
Avi = σi ui
and
Ker A = span{vr+1 , . . . , vn }
and
Ker At = span{ur+1 , . . . , um }
Im A = span{u1 , . . . , ur }
and
Im At = span{v1 , . . . , vr }
We also have
and
Illustration to Remark 17.10:
v1
v2
..
.
A
σ1
−→
σ2
−→
..
.
At
σ1
−→
σ2
−→
..
.
u1
u2
..
.
v1
v2
..
.
σ
σ
r
r
−→
vr
−→
ur
vr
vr+1 → 0 ur+1 → 0
..
..
..
..
.
.
.
.
vn
→ 0 um → 0
17.11 Remark
We have the following SVD expansion:
A=
r
X
σi ui vit
i=1
Proof. It is enough to observe that for every vj
r
X
!
σi ui vit
vj = σj uj = Avj
i=1
because vit vj = δij . 2
52
17.12 Remark
The positive numbers σ12 , . . . , σr2 are all non-zero eigenvalues of both At A and AAt .
This can be used as a practical method to compute the singular values σ1 , . . . , σr . The
multiplicity of σi > 0 as a singular value of A equals the (algebraic and geometric) multiplicity of σi2 as an eigenvalue of both At A and AAt .
Much useful information about the matrix A is revealed by the SVD.
17.13 Remark
Assume that m = n, i.e. A is a square matrix. Then kAk2 = kAt k2 = kDk2 = σ1 .
If A is nonsingular, then similarly we have kA−1 k2 = kD−1 k2 = σn−1 . Therefore,
κ2 (A) = κ2 (D) = σ1 /σn .
SVD is helpful in the study of the matrix rank.
17.14 Definition
A matrix A ∈ IRm×n is said to have full rank if its rank equals p = min{m, n}. Otherwise, A is said to be rank deficient.
Note: If A = U DV t is an SVD, then rank A = rank D is the number of positive singular
values. If A has full rank, then all its singular values are positive. Rank deficient matrices
have at least one zero singular value.
17.15 Definition
The 2-norm of a rectangular matrix A ∈ IRm×n is defined by
kAk2 = max kAxk2
kxk2 =1
where kxk2 is the 2-norm in IRn and kAxk2 is the 2-norm in IRm .
Note: kAk2 = kAt k2 = kDk2 = σ1 , generalizing Remark 17.13.
The linear space IRm×n with the distance between matrices given by kA − Bk2 is a metric
space. Then topological notions, like open sets, dense sets, etc., apply.
17.16 Theorem
Full rank matrices make an open and dense subset of IRm×n .
Openness means that for any full rank matrix A there is an ε > 0 such that A + E has
full rank whenever kEk2 ≤ ε. Denseness means that if A is rank deficient, then for any
ε > 0 there is E, kEk2 ≤ ε, such that A + E has full rank.
53
Proof. To prove denseness, let A be a rank deficient matrix and A = U DV t its SVD.
For ε > 0, put Dε = εI and E = U Dε V t . Then kEk2 = kDε k2 = ε and
rank(A + E) = rank(U (D + Dε )V t ) = rank(D + Dε ) = p
For the openness, see homework assignment (one can use the method of the proof of
17.18 below).
One can see that slight perturbation of a rank deficient matrix (e.g., caused by roundoff errors) can produce a full rank matrix. On the other hand, a full rank matrix may be
perturbed by round-off errors and become rank deficient, a very undesirable event. To
describe such situations, we introduce the following
17.17 Definition (Numerical Rank)
The numerical rank of A ∈ IRm×n with tolerance ε is
rank(A, ε) = min rank(A + E)
kEk2 ≤ε
Note: rank(A, ε) ≤ rank A. The numerical rank gives the minimum rank of A under
perturbations of norm ≤ ε. If A has full rank p = min{m, n} but rank(A, ε) < p, then
A is ‘nearly rank deficient’, which indicates dangerous situation in numerical calculations.
17.18 Theorem
The numerical rank of a matrix A, i.e. rank(A, ε), equals the number of singular
values of A (counted with multiplicity) that are greater than ε.
Proof. Let A = U DV t be a SVD and σ1 ≥ · · · ≥ σp the singular values of A. Let
k be defined so that σk > ε ≥ σk+1 . We show that rank(A, ε) = k. Let kEk2 ≤ ε and
E 0 = U t EV . Observe that kE 0 k2 = kEk2 ≤ ε and
rank(A + E) = rank (U (D + E 0 )V t ) = rank(D + E 0 )
Hence,
rank(A, ε) = min
rank(D + E 0 )
0
kE k2 ≤ε
Now, for any x =
Pk
1
xi ei we have
k(D + E 0 )xk2 ≥ kDxk2 − kE 0 xk2 ≥ (σk − ε)kxk2
Hence, (D + E 0 )x 6= 0 for x 6= 0, so rank(D + E 0 ) ≥ k. On the other hand, setting
E 0 = diag(0, . . . , 0, −σk+1 , . . . , −σp ) implies that kE 0 k2 ≤ ε and rank(D + E 0 ) = k. 2
54
17.19 Corollary
For every ε > 0, rank(At , ε) = rank(A, ε).
Proof. A and At have the same singular values. 2
17.20 Theorem (Distance to the nearest singular matrix)
Let A ∈ IRn×n be a nonsingular matrix. Then
)
(
kA − As k2
1
: As is singular =
min
kAk2
κ2 (A)
Proof. Follows from 17.13 and 17.18. 2
17.21 Example!
3
Let A =
. Then At A = (25) is a 1 × 1 matrix, it has the only eigenvalue λ = 25
4
!
√
9
12
with eigenvector v1 = e1 = (1). Hence, σ1 = λ = 5. The matrix AAt =
12 16
has eigenvalues λ1 = 25 and λ2 = 0 with corresponding eigenvectors u1 = (3/5 4/5)t and
u2 = (4/5 − 3/5)t . The SVD takes form
3/4
4/5
4/5 −3/5
!
3
4
!
(1) =
5
0
!
Observe that Av1 = 5u1 and At u1 = 5v1 in accordance with the SVD theorem.
55
18
Eigenvalues and eigenvectors: sensitivity
18.1 Definition (Left Eigenvector)
| n×n . A nonzero vector x ∈ C
| n is called a left eigenvector of A corresponding
Let A ∈ C
to an eigenvalue λ if
x∗ A = λx∗
Note that this is equivalent to A∗ x = λ̄x, i.e. x being an ordinary (right) eigenvector of
A∗ corresponding to the eigenvalue λ̄.
18.2 Lemma
A matrix A has a left eigenvector corresponding to λ if and only if λ is an eigenvalue
of A (a root of the characteristic polynomial of A).
Proof. x∗ A = λx∗ for an x 6= 0 is equivalent to (A∗ − λ̄I)x = 0, which means that
det(A∗ − λ̄I) = 0, or equivalently, det(A − λI) = 0, i.e. CA (λ) = 0. 2
This explains why we do not introduce a notion of a left eigenvalue: the eigenvalues for
left eigenvectors and those for ordinary (right) eigenvectors are the same.
18.3 Lemma
For any eigenvalue λ of A the dimension of the ordinary (right) eigenspace equals the
dimension of the left eigenspace (i.e., the geometric multiplicity of λ is the same).
Proof. Ker(A − λI) = n − rank(A − λI) = n − rank(A∗ − λ̄I) = Ker(A∗ − λ̄I). 2
18.4 Definition (Rayleigh Quotient)
| n×n and x ∈ C
| n a nonzero vector. We call
Let A ∈ C
x∗ Ax
hAx, xi
=
∗
xx
hx, xi
the Rayleigh quotient for A. One can consider it as a function of x.
Note: The Rayleigh quotient takes the same value for all scalar multiples of a given vector
x, i.e. it is constant on the line spanned by x (with the zero vector removed). Recall that
any nonzero vector is a scalar multiple of a unit vector. Hence, Rayleigh quotient as a
function of x is completely defined by its values on the unit vectors (on the unit sphere).
For unit vectors x, the Rayleigh quotient can be simply defined by
x∗ Ax = hAx, xi
56
18.5 Theorem
| n×n and x a unit vector in C
| n . Then
Let A ∈ C
kAx − (x∗ Ax)xk2 = min kAx − µxk2
µ∈C
That is, the vector (x∗ Ax)x is the orthogonal projection of the vector Ax on the line
spanned by x.
Proof. From the homework problem #1 in the first assignment we know that
min kAx − µxk2
µ∈C
is attained when Ax − µx is orthogonal to x. This gives the value µ = x∗ Ax. Alternatively, consider an overdetermined (n × 1) system xµ = Ax where µ is (the only)
unknown. Then the result follows from Theorem 16.9. 2
Note: If x is a unit eigenvector for an eigenvalue λ, then x∗ Ax = λ. Suppose that x is not
an eigenvector. If one wants to regard x as an eigenvector, then the Rayleigh quotient
for x is the best choice that one could make for the associated eigenvalue in the sense
that this value of µ comes closest (in the 2-norm) to achieving Ax − µx = 0. So, one
could regard x∗ Ax as a “quasi-eigenvalue” for x.
18.6 Theorem
| n×n and x a unit eigenvector of A corresponding to eigenvalue λ. Let y be
Let A ∈ C
another unit vector and ρ = y ∗ Ay. Then
|λ − ρ| ≤ 2 kAk2 kx − yk2
Moreover, if A is a Hermitean matrix, then there is a constant C = C(A) > 0 such that
|λ − ρ| ≤ C kx − yk22
Proof. To prove the first part, put
λ − ρ = x∗ A(x − y) + (x − y)∗ Ay
and then using the triangle inequality and Cauchy-Schwarz inequality gives the result.
Now, assume that A is Hermitean. Then there is an ONB of eigenvectors, and we can
assume that x is one of them. Denote that ONB by {x, x2 , . . . , xn } and the corresponding
eigenvalues by λ, λ2 , . . . , λn . Let y = cx + c2 x2 + · · · + cn xn . Then
ky − xk2 = |c − 1|2 +
n
X
i=2
57
|ci |2 ≥
n
X
i=2
|ci |2
On the other hand, kyk = 1, so
2
λ = λ|c| +
n
X
λ|ci |2
i=2
Now, Ay = cλx +
Pn
i=2 ci λi xi ,
so
ρ = hAy, yi = λ|c|2 +
n
X
λi |ci |2
i=2
Therefore,
λ−ρ=
n
X
(λ − λi )|ci |2
i=2
The result now follows with
C = max |λ − λi |
2≤i≤n
The theorem is proved. 2.
Note: The Rayleigh quotient function x∗ Ax is obviously continuous on the unit sphere,
because it is a polynomial of the coordinates of x. Since at every eigenvector x the
value of this function equals the eigenvalue λ for x, then for y close to x its values are
close to λ. Suppose that ky − xk = ε. The first part of the theorem gives a specific
estimate on how close y ∗ Ay to λ is, the difference between them is O(ε). The second
part says that for Hermitean matrices y ∗ Ay is very close to λ, the difference is now O(ε2 ).
| n , i.e. the
Note: If A is Hermitean, then x∗ Ax is a real number for any vector x ∈ C
Rayleigh quotient is a real valued function.
18.7 Theorem
| n×n be Hermitean with eigenvalues λ ≤ · · · ≤ λ . Then for any kxk = 1
Let A ∈ C
1
n
λ1 ≤ x∗ Ax ≤ λn
Proof. Let {x1 , . . . , xn } be an ONB consisting of eigenvectors of A. Let x = c1 x1 +
· · · + cn xn . Then x∗ Ax = λ1 |c1 |2 + · · · + λn |cn |2 . The result now follows easily. 2
18.8 Lemma
| n and dim G > dim L. Then there is a nonzero vector
Let L and G be subspaces of C
in G orthogonal to L.
| n
Proof. By way of contradiction, if G ∩ L⊥ = {0}, then G ⊕ L⊥ is a subspace of C
with dimension dim G + n− dim L, which is > n, which is impossible. 2
58
18.9 Theorem (Courant-Fisher Minimax Theorem)
| n×n be Hermitean with eigenvalues λ
Let A ∈ C
1 ≤ · · · ≤ λn . Then for every i =
1, . . . , n
x∗ Ax
λi = min max
dim L=i x∈L\{0} x∗ x
| n.
where L stands for a subspace of C
Proof. Let {u1 , . . . , un } be an ONB of eigenvectors of A corresponding to the eigenvalues λ1 , . . . , λn . If dim L = i, then by Lemma 18.8 there is a nonzero vector x ∈ L
orthogonal to the space span{u1 , . . . , ui−1 }. Hence, the first i coordinates of x are zero
P
and x = nj=i cj uj , so
Pn
2
x∗ Ax
j=i |cj | λj
=
≥ λi
P
n
2
x∗ x
j=i |cj |
Therefore,
x∗ Ax
≥ λi
x∈L\{0} x∗ x
max
Now, take the subspace L = span{u1 , . . . , ui }. Obviously, dim L = i and for every nonzero
P
vector x ∈ L we have x = ij=1 cj uj , so
i
2
x∗ Ax
j=1 |cj | λj
=
≤ λi
P
i
2
x∗ x
j=1 |cj |
P
The theorem is proved. 2
18.10 Theorem
Let A and ∆A be Hermitean matrices. Let α1 ≤ · · · ≤ αn be the eigenvalues of A,
δmin and δmax the smallest and the largest eigenvalues of ∆A. Denote the eigenvalues of
the matrix B = A + ∆A by β1 ≤ · · · ≤ βn . Then for each i = 1, . . . , n
αi + δmin ≤ βi ≤ αi + δmax
Proof. Let {u1 , . . . , un } be an ONB consisting of eigenvectors of A corresponding to
the eigenvalues α1 , . . . , αn . Let L = span{u1 , . . . , ui }. Then, by 18.9,
βi ≤
≤
≤
=
x∗ Bx
max
x∈L\{0} x∗ x
x∗ Ax
x∗ ∆Ax
max
+
max
x∈L\{0} x∗ x
x∈L\{0}
x∗ x
x∗ ∆Ax
αi + max
x∈C n \{0}
x∗ x
αi + δmax
59
which is the right inequality. Now apply this theorem to the matrices B, −∆A and
A = B + (−∆A). Then its right inequality, just proved, will read βi ≤ αi − δmin (note
that the largest eigenvalue of −∆A is −δmin ). The theorem is completely proved. 2
Theorem 18.10 shows that if we perturb a Hermitean matrix A by a small Hermitean
matrice ∆A, the eigenvalues of A will not change much. This is clearly related to numerical calculations of eigenvalues.
18.11 Remark.
A similar situation occurs when one knows an approximate eigenvalue λ and an
approximate eigenvector x of a matrix A. We can assume that kxk = 1. To estimate
the closeness of λ to the actual but unknown eigenvalue of A, one can compute the
residual r = Ax − λx. Assume that r is small and define the matrix ∆A = −rx∗ . Then
k∆Ak2 = krk2 (see the homework assignment) and
(A + ∆A)x = Ax − rx∗ x = λx
Therefore, (λ, x) are an exact eigenpair of a perturbed matrix A + ∆A, and the norm
k∆Ak is known. One could then apply Theorem 18.10 to estimate the closeness of λ
to the actual eigenvalue of A, if the matrices A and ∆A were Hermitean. Since this is
not always the case, we need to study how eigenvalues of a generic matrix change under
small perturbations of the matrix. This is the issue of eigenvalue sensitivity.
18.12 Theorem (Bauer-Fike)
| n×n and suppose that
Let A ∈ C
Q−1 AQ = D = diag(λ1 , . . . , λn )
If µ is an eigenvalue of a perturbed matrix A + ∆A, then
min |λi − µ| ≤ κp (Q)k∆Akp
1≤i≤n
where k · kp stands for any p-norm.
Proof. If µ is an eigenvalue of A, the claim is trivial. If not, the matrix D − µI is
invertible. Observe that
Q−1 (A + ∆A − µI)Q = D + Q−1 ∆AQ − µI
= (D − µI)(I + (D − µI)−1 (Q−1 ∆AQ))
Since the matrix A + ∆A − µI is singular, so is the matrix I + (D − µI)−1 (Q−1 ∆AQ).
Then the Neumann lemma implies
1 ≤ k(D − µI)−1 (Q−1 ∆AQ)kp ≤ k(D − µI)−1 kp kQ−1 kp k∆Akp kQkp
60
Lastly, observe that (D − µI)−1 is diagonal, so
1
1
=
1≤i≤n |λi − µ|
min1≤i≤n |λi − µ|
k(D − µI)−1 kp = max
The theorem now follows. 2
This theorem answers the question raised in 18.11, it gives an estimate on the error in
the eigenvalue in terms of k∆Ak and κ(Q). However, this answer is not good enough –
it gives one estimate for all eigenvalues. In practice, some eigenvalues can be estimated
much better than others. It is important then to develop finer estimates for individual
eigenvalues.
18.13 Lemma
| n×n .
Let A ∈ C
(i) If λ is an eigenvalue with a right eigenvector x, and µ 6= λ is another eigenvalue with
a left eigenvector y, then y ∗ x = 0.
(ii) If λ is a simple eigenvalue (of algebraic multiplicity one) with right and left eigenvectors x and y, respectively, then y ∗ x 6= 0.
Proof. To prove (i), observe that hAx, yi = λhx, yi and, by a remark after 18.1,
hx, A∗ yi = hx, µ̄yi = µhx, yi. Hence, λhx, yi = µhx, yi, which proves (i), since λ 6= µ.
To prove (ii), assume that kxk = 1. By the Schur decomposition theorem, there is a
unitary matrix R with first column x such that
λ h∗
0 C
∗
R AR =
!
| n−1 and C ∈ C
| (n−1)×(n−1) . Note also that Re = x. Since λ is a simple
with some h ∈ C
1
eigenvalue of A, it is not an eigenvalue of C, so the matrix λI − C is invertible, and so
is λ̄I − C ∗ . Let z = (λ̄I − C ∗ )−1 h. Then λ̄z − C ∗ z = h, hence
h∗ + z ∗ C = λz ∗
One can immediately verify that
(1, z ∗ )R∗ AR = λ(1, z ∗ )
Put w∗ = (1, z ∗ )R∗ . The above equation can be rewritten as
w∗ A = λw∗
Hence w is a left eigenvector of A. By the simplicity of λ, the vector w is a nonzero
multiple of y. However, observe that
w∗ x = (1, z ∗ )R∗ Re1 = 1
61
which proves the lemma. 2
18.14 Theorem
| n×n have a simple eigenvalue λ with right and left unit eigenvectors x and
Let A ∈ C
| n×n such that kEk = 1. For small ε, denote by λ(ε) and x(ε),
y, respectively. Let E ∈ C
2
y(ε) the eigenvalue and unit eigenvectors of the matrix A + εE obtained from λ and x,
y. Then
1
|λ0 (0)| ≤ ∗
|y x|
Proof. Write the equation
(A + εE) x(ε) = λ(ε) x(ε)
and differentiate it in ε, set ε = 0, and get
Ax0 (0) + Ex = λ0 (0)x + λ x0 (0)
Then multiply this equation through on the left by the vector y ∗ , use the fact that
y ∗ A = λy ∗ and get
y ∗ Ex = λ0 (0) y ∗ x
Now the result follows since |y ∗ Ex| = |hy, Exi| ≤ kyk kExk ≤ 1. Note that y ∗ x 6= 0 by
18.13. 2
Note: The matrix A+εE is the perturbation of A in the direction of E. If the perturbation
matrix E is known, one has exactly
λ0 (0) =
and so
λ(ε) = λ +
y ∗ Ex
y∗x
y ∗ Ex
ε + O(ε2 )
y∗x
by Taylor expansion, a fairly precise estimate on λ(ε). In practice, however, the matrix
E is absolutely unknown, so one has to use the bound in 18.14 to estimate the sensitivity
of λ to small perturbations of A.
18.15 Definition (Condition Number of An Eigenvalue)
| n×n and
Let λ be a simple eigenvalue (of algebraic multiplicity one) of a matrix A ∈ C
x, y the corresponding right and left unit eigenvectors. The condition number of λ is
K(λ) =
62
1
|y ∗ x|
The condition number K(λ) describes the sensitivity of a (simple) eigenvalue to small
perturbations of the matrix. Large K(λ) signifies an ill-conditioned eigenvalue.
18.16 Simple properties of K(λ)
(a) Obviously, K(λ) ≥ 1.
(b) If a matrix A is normal, then K(λ) = 1 for all its eigenvalues.
(c) Conversely, if a matrix A has all simple eigenvalues with K(λ) = 1, then it is normal.
Normal matrices are characterized by the fact that the Shur decomposition Q∗ AQ = T
results in a diagonal matrix T . One can expect that if the matrix T is nearly diagonal
(its off-diagonal elements are small), then the eigenvalues of A are well-conditioned. On
the contrary, if some off-diagonal elements of T are substantial, then at least some eigenvalues of A are ill-conditioned.
18.17 Remark
It remains to discuss the case of multiple eigenvalues (of algebraic multiplicity ≥ 2).
If λ is a multiple eigenvalue, the left and right eigenvectors may be
! orthogonal even if
0 1
the geometric multiplicity of λ equals one. Example: A =
, the right and left
0 0
eigenvectors are x = e1 and y = e2 , respectively. Moreover, if the geometric multiplicity
is ≥ 2, then for any right eigenvector x there is a left eigenvector y such that y ∗ x = 0.
Hence, the definition 18.17 gives an infinite value of K(λ).
This does not necessarily mean that a multiple eigenvalue is always ill-conditioned.
It does mean, however, that an ill-conditioned simple eigenvalue is ‘nearly multiple’.
Precisely, if λ is a simple eigenvalue of A with K(λ) > 1, then there is a matrix E such
that
1
kEk2
≤q
kAk2
K(λ)2 − 1
and λ is a multiple eigenvalue of A + E. We leave out the proof. We will not further
discuss the sensitivity of multiple eigenvalues.
18.18 Theorem (Gershgorin)
| n×n be ‘almost diagonal’. Precisely, let A = D+E, where D = diag(d , . . . , d )
Let A ∈ C
1
n
and E = (eij ) is small. Then every eigenvalue of A lies in at least one of the circular
disks
n
Di = {z : |z − di | ≤
X
|eij |}
j=1
Note: Di are called Gershgorin disks.
Proof. Let λ be an eigenvalue of A with eigenvector x. Let |xr | = maxi {|x1 |, . . . , |xn |}
be the maximal (in absolute value) component of x. We can normalize x so that xr = 1
63
and |xi | ≤ 1 for all i. On equating the r-th components in Ax = λx we obtain
(Ax)r = dr xr +
n
X
erj xj = dr +
j=1
Hence
|λ − dr | ≤
n
X
erj xj = λxr = λ
j=1
n
X
|erj | |xj | ≤
j=1
n
X
|erj |
j=1
The theorem is proved. 2
18.19 Theorem
If k of the Gershgorin disks Di form a connected region which is isolated from the
other disks, then there are precisely k eigenvalues of A (counting multiplicity) in this
connected region.
We call connected components of the union ∪Di clusters.
Proof. For brevity, denote
hi =
n
X
|eij |
j=1
Consider a family of matrices A(s) = D + sE for 0 ≤ s ≤ 1. Clearly, the eigenvalues
of A(s) depend continuously on s. The Gershgorin disks Di (s) for the matrix A(s) are
centered at di and have radii shi . As s increases, the disks Di (s) grow concentrically,
until they reach the size of the Gershgorin disks Di of Theorem 18.18 at s = 1. When
s = 0, each Gershgorin disk Di (0) is just a point, di , which is an eigenvalue of the matrix A(0) = D. So, if di is an eigenvalue of multiplicity m ≥ 1, then exactly m disks
Dj (0) coincide with the point di , which make a cluster of m disks containing m eigenvalues (more precisely, one eigenvalue of multiplicity m). As the disks grow with s, the
eigenvalues cannot jump from one cluster to another (by continuity), unless two cluster
overlap and then make one cluster. When two clusters overlap (merge) at some s, they
will be in one cluster for all larger values of s, including s = 1. This proves the theorem. 2
Note: If the Gershgorin disks are disjoint, then each contains exactly one eigenvalue of
A.
64
19
Eigenvalues and eigenvectors: computation
19.1 Preface
| n×n are the roots of its characteristic polynomial,
Eigenvalues of a matrix A ∈ C
CA (x). It is a consequence of the famous Galois group theory (Abel’s theorem) that
there is no finite algorithm for calculation of the roots of a generic polynomial of degree
> 4. Therefore, all the methods of computing eigenvalues of matrices larger than 4 × 4
are necessarily iterative, they only provide successive approximations to the eigenvalues,
but never exact formulas.
For this reason, iterative methods may not look very nice as compared to other
exact methods covered in this course. Some of them may not even look reasonable or
logically consistent at first, but we need to understand that they have been obtained
by numerical analysts after years and years of experimentation. And these methods can
work surprisingly well, in fact even better than some finite and “exact” algorithms.
If an eigenvalue λ of a matrix A is known, an eigenvector x can be found by solving
the linear system (A − λI)x = 0, which can be done by a finite algorithm (say, LU decomposition). But since the eigenvalues can only be obtained approximately, by iterative
procedures, the same goes for eigenvectors. It is then reasonable to define a procedure
that gives approximations for both eigenvalues and eigenvectors, in parallel. Also, knowing an approximate eigenvector x one can approximate the corresponding eigenvalue λ
by the Rayleigh quotient x∗ Ax/x∗ x.
To simplify the matter, we always assume that the matrix A is diagonalizable, i.e.
it has a complete set of eigenvectors x1 , . . . , xn with eigenvalues λ1 , . . . , λn , which are
ordered in absolute value:
|λ1 | ≥ |λ2 | ≥ · · · ≥ |λn |
19.2 Definition (Dominant Eigenvalue/Eigenvector)
Assume that |λ1 | > |λ2 |, i.e. largest eigenvalue is simple. We call λ1 the dominant
eigenvalue and x1 a dominant eigenvector.
19.3 Power method: the idea
Let λ1 be the dominant eigenvalue of A and
q = c1 x1 + · · · + cn xn
an arbitrary vector such that c1 6= 0. Then
Ak q = c1 λk1 x1 + · · · + cn λkn xn
= λk1 [c1 x1 + c2 (λ2 /λ1 )k x2 + · · · + cn (λn /λ1 )k xn ]
Denote
q (k) = Ak q/λk1 = c1 x1 + c2 (λ2 /λ1 )k x2 + · · · + cn (λn /λ1 )k xn
{z
|
∆k
65
}
19.4 Lemma
The vector q (k) converges to c1 x1 . Moreover,
k∆k k = kq (k) − c1 x1 k ≤ const · rk
where r = |λ2 /λ1 | < 1.
Therefore, the vectors Ak q (obtained by the powers of A) will align in the direction of the
dominant eigenvector x1 as k → ∞. The number r characterizes the speed of alignment,
i.e. the speed of convergence k∆k k → 0. Note that if c2 6= 0, then
kq (k+1) − c1 x1 k/kq (k) − c1 x1 k → r
The number r is called the convergence ratio or the contraction number.
19.5 Definition (Linear/Quadratic Convergence)
We say that the convergence ak → a is linear if
|ak+1 − a| ≤ r|ak − a|
for some r < 1 and all sufficiently large k. If
|ak+1 − a| ≤ C|ak − a|2
with some C > 0, then the convergence is said to be quadratic. It is much faster than
linear. If
|ak+1 − a| ≤ C|ak − a|3
with some C > 0, then the convergence is said to be cubic. It is even faster than quadratic.
19.6 Remarks on Convergence
If the convergence is linear, then by induction we have |ak − a| ≤ Crk , where C =
|a0 − a|. The sequence Crk decreases to zero exponentially fast, which is very fast by
calculus standards. However, in numerical algebra standards are different.
The linear convergence practically means that each iteration adds a fixed number
of accurate digits to the result. For example, if r = 0.1, then each iteration adds one
correct decimal digit. This is not superb, since it will take 6–7 iterations to reach the
maximum accuracy in single precision arithmetic and 15–16 iterations in double precision.
If r = 0.5, then each iteration adds one binary digit (a bit), hence one needs 22–23
iterations in single precision and 52–53 in double precision. Now imagine how long it
might take when r = 0.9 or r = 0.99.
On the other hand, the quadratic convergence means that each iteration doubles (!)
the number of digits of accuracy. Starting with just one accurate binary digit, one needs
66
4–5 iterations in single precision arithmetic and only one more iteration in double precision. The cubic convergence means that each iteration triples (!!!) the number of digits
of accuracy. See Example 19.16 for an illustration.
19.7 Remark on Power Method
In practice, the vector q (k) = Ak q/λk1 is inaccessible because we do not know λ1 in
advance. On the other hand, it is impractical to work with Ak q, because kAk qk → ∞ if
|λ1 | > 1 and kAk qk → 0 if |λ1 | < 1. In order to avoid the danger of overflow or underflow,
we must somehow normalize, or scale, the vector Ak q.
19.8 Power Method: Algorithm
Pick an initial vector q0 . For k ≥ 1, define
qk = Aqk−1 /σk
where σk is a properly chosen scaling factor. A common choice is σk = kAqk−1 k, so that
kqk k = 1. Then one can approximate the eigenvalue λ1 by the Rayleigh quotient
(k)
λ1 = qk∗ Aqk
Note that
qk = Ak q0 /(σ1 · · · σk ) = Ak q0 /kAk q0 k
To estimate how close the unit vector qk is to the one-dimensional eigenspace span{x1 },
denote by pk the orthogonal projection of qk on span{x1 } and by dk = qk − pk the orthogonal component. Then kdk k measures the distance from qk to span{x1 }.
19.9 Theorem (Convergence of Power Method)
P
Assume that λ1 is the dominant eigenvalue, and q0 = ci xi is chosen so that c1 6= 1.
(k)
Then the distance from qk to the eigenspace span{x1 } converges to zero and λ1 converges
to λ1 . Furthermore,
kdk k ≤ const · rk
(k)
|λ1 − λ1 | ≤ const · rk
Note: The sequence of unit vectors qk need not have a limit, see examples.
Proof. It is a direct calculation, based on the representation Ak q0 = λk1 (c1 x1 + ∆k ) of
19.3 and Lemma 19.4. 2
19.10 Remark
Another popular choice for σk in 19.8 is the largest (in absolute value) component
of the vector Aqk−1 . This ensures that kqk k∞ = 1. Assume that the vector x has one
67
component with the largest absolute value. In that case Theorem 19.9 applies, and
moreover, σk → λ1 so that
|σk − λ1 | ≤ const · rk
Furthermore, the vector qk will now converge to a dominant eigenvector. The largest
component of that dominant eigenvector will be equal to one, and this condition specifies
it uniquely.
19.11 Examples
!
3 2
(a) Let A =
. Pick q0 = (1, 1)t and chose σk as in 19.10. Then σ1 = 5 and
1 1
q1 = (1, 0.4)t , σ2 = 3.8 and q2 √
= (1, 0.368)t , σ3 = 3.736 etc. Here σk converges to the
dominant
eigenvalue λ1 = 2 + 3 = 3.732 and qk converges to a dominant eigenvector
√
t
(1, 3/2 − 1/2)t = (1, 0.366)
!.
−1 0
(b) Let A =
. Pick q0 = (1, 1)t and chose σk as in 19.8. Then qk =
0 0
((−1)k , 0) does not have a limit, it oscillates. With σk chosen as in 19.10, we have
qk = (1, 0) and σk = −1 = λ1 for all k ≥ 1.
19.12 Remark
The choice of the initial vector q0 only has to fulfill the requirement c1 6= 0. Since
| n , one hopes that a vector q picked “at
the vectors with c1 = 0 form a hyperplane in C
0
random” will not lie in that hyperplane “with probability one”. Furthermore, even if
c1 = 0, round-off errors will most likely pull the numerical vectors qk away from that
hyperplane. If that does not seem to be enough, one can carry out the power method for
n different initial vectors that make a basis, say e1 , . . . , en . One of these vectors surely
lies away from that hyperplane.
19.13 Inverse Power Method
−1
−1
Assume that A is invertible. Then λ−1
1 , . . . , λn are the eigenvalues of A , with the
−1
−1
same eigenvectors x1 , . . . , xn . Note that |λ1 | ≤ · · · ≤ |λn |.
−1
−1
−1
and xn a
Assume that |λ−1
n | > |λn−1 |. Then λn is the dominant eigenvalue of A
−1
−1
dominant eigenvector. One can apply the power method to A and find λn and xn .
The rate of convergence of iterations will be characterized by the ratio r = |λn /λn−1 |.
This is called the inverse power method.
Now we know how to compute the largest and the smallest eigenvalues. The following
trick allows us to compute any simple eigenvalue.
19.14 Inverse Power Method with Shift.
Recall that if λ is an eigenvalue of A with eigenvector x, then λ − ρ is an eigenvalue
of A − ρI with the same eigenvector x.
68
Assume that ρ is a good approximation to a simple eigenvalue λi of A, so that |λi −ρ| <
|λj − ρ| for all j 6= i. Then the matrix A − ρI will have the smallest eigenvalue λi − ρ
with the eigenvector xi .
The inverse power method can now be applied to A − ρI to find λi − ρ and xi . The
convergence of iterations will be linear with ratio
r=
|λi − ρ|
minj6=i |λj − ρ|
Hence, the better ρ approximates λi , the faster convergence is guaranteed.
By subtracting ρ from all the eigenvalues of A we shift the entire spectrum of A by ρ.
The number ρ is called the shift. The above algorithm for computing λi and xi is called
the inverse power method with shift.
The inverse power method with shift allows us to compute all the simple eigenvalues
and eigenvectors of a matrix, but the convergence is slow (just linear).
19.15 Rayleigh Quotient Iterations with Shift
This is an improvement of the algorithm 19.14. Since at each iteration of the inverse
power method we obtain a better approximation to the eigenvalue λi , we can use it as
the shift ρ for the next iteration. So, the shift ρ will be updated at each iteration. This
will ensure a faster convergence. The algorithm goes as follows: one chooses an initial
vector q0 and an initial approximation ρ0 , and for k ≥ 1 computes
qk =
(A − ρk−1 I)−1 qk−1
σk
ρk =
qk∗ (A − ρk−1 I)−1 qk
qk∗ qk
and
where σk a convenient scaling factor, for example, σk = k(A − ρk−1 I)−1 qk−1 k.
Note: in practice, there is no need to compute the inverse matrix (A − ρk−1 I)−1
explicitly. To find (A − ρk−1 I)−1 b for any given vector b, one can just solve the system
(A − ρk−1 I)x = b by the LU decomposition.
The convergence of the Rayleigh quotient iterations is, generally, quadratic (better
than linear). If the matrix A is Hermitean, the convergence is even faster – it is cubic.
19.16 Example
Consider the symmetric matrix
2 1 1
A= 1 3 1
1 1 4
69
√
and let q0 = (1, 1, 1)t / 3 be the initial vector. When Rayleigh quotient iteration is
applied to A, the following values ρk are computed by the first three iterations:
ρ0 = 5,
ρ1 = 5.2131,
ρ2 = 5.214319743184
The actual value is λ = 5.214319743377. After only three iterations, Rayleigh quotient
method produced 10 accurate digits. The next iteration would bring about 30 accurate
digits – more than enough in double precision. Two more iterations would increase this
figure to about 270 digits, if our machine precision were high enough.
19.17 Power Method: Pros and Cons
a) The power method and its variations described above are classic. They are very
simple and generally good.
b) An obvious concern is the numerical stability of the method. The matrices used in
the inverse power method with shift tend to be exceedingly ill-conditioned. As a result,
the numerical solution of the system (A − ρI)x = b, call it xc , will deviate significantly
from its exact solution x. However, for some peculiar reason (we do not elaborate) the
difference xc − x tends to align with the vector x. Therefore, the normalized vectors
xc /kxc k and x/kxk are close to each other. Hence, ill conditioning of the matrices does
not cause a serious problem here.
c) On the other hand, the power method is slow. Even its fastest version, the Rayleigh
quotient iterations with shift 19.15, is not so fast, because it requires solving a linear
system of equations (A − ρk−1 I)x = b with a different matrix at each iteration for each
eigenvalue (a quite expensive procedure).
d) Lastly, when the matrix A is real, then in order to compute its complex eigenvalues
one has to deal with complex matrices (A − ρk−1 I), which is inconvenient and expensive.
It would be nice to stick to real matrices as much as possible and obtain pairs of complex
conjugate eigenvalues only at final steps. Some algorithms provide such a luxury, see for
example 19.27 below.
In recent decades, the most widely used algorithm for calculating the complete set of
eigenvalues and eigenvectors of a matrix has been the QR algorithm.
19.18 The QR Algorithm
| n×n be nonsingular. The algorithm starts with A
Let A ∈ C
0 = A and generates a
sequence of matrices Ak defined as follows:
Ak−1 = Qk Rk
Rk Qk = Ak
That is, a QR factorizarion of Ak−1 is computed and then its factors are recombined in
the reverse order to produce Ak . One iteration of the QR algorithm is called a QR step.
19.19 Lemma
70
All matrices Ak in the QR algorithm are unitary equivalent, in particular, they have
the same eigenvalues.
Proof. Indeed, Ak = Q∗k Ak−1 Qk . 2
Note: if A is a Hermitean matrix, then all Ak ’s are Hermitean matrices as well, since
by induction A∗k = Q∗k A∗k−1 Qk = Q∗k Ak−1 Qk = Ak .
19.20 Theorem (Convergence of the QR Algorithm)
Let λ1 , . . . , λn be the eigenvalues of A satisfying
|λ1 | > |λ2 | > · · · > |λn | > 0
(k)
Under one technical assumption, see below, the matrix Ak = (aij ) will converge to an
upper triangular form, so that
(k)
(a) aij → 0 as k → ∞ for all i > j.
(k)
(b) aii → λi as k → ∞ for all i.
The convergence is linear, with the speed maxk |λk+1 /λk |.
This theorem is given without proof. The technical assumption here is that the matrix Y whose i-th row is a left eigenvector of A corresponding to λi for all i, must have
an LU decomposition (i.e. all its principal minors are nonsingular).
19.21 Remarks
(a) If A is Hermitean, then Ak will converge to a diagonal matrix, which is sweet.
(b) All the matrices Ak (and Rk ) involved in the QR algorithm have the same 2condition number (by 15.6), thus the QR algorithm is numerically stable. This is also
true for all the variations of the QR method discussed below.
(c) Unfortunately, the theorem does not apply to real matrices with complex eigenvalues. Indeed, complex eigenvalues of a real matrix come in conjugate pairs a ± ib,
which have equal absolute values |a + ib| = |a − ib|, violating the main assumption of the
theorem. Furthermore, if A is real, then all Qk , Rk and Ak are real matrices as well, and
thus we cannot even expect the real diagonal elements of Ak to converge to the complex
eigenvalues of A. In this case Ak simply does not (!) converge to an upper triangular
matrix.
The QR algorithm described above is very reliable but quite expensive – each iteration takes too much flops and the convergence is slow (linear). These problems can be
resolved with the help of Hessenberg matrices.
19.22 Definition (Upper Hessenberg Matrix)
71
A is called an upper Hessenberg matrix if aij = 0 for all i > j + 1, i.e. A has the form
× ···
×
× ×
0 × ×
.. . . . . . .
.
.
.
. ..
.
0 ··· 0 × ×
19.23 Lemma (Arnoldi Iterations)
| n×n is unitary equivalent to an upper Hessenberg matrix, i.e.
Every matrix A ∈ C
A = Q∗ HQ
where H is an upper Hessenberg matrix and Q is a unitary matrix. There is an explicit
and inexpensive algorithm to compute such H and Q.
Note: if A is Hermitean, then H is a tridiagonal matrix (i.e. such that hij = 0 for all
|i − j| > 1).
Proof. We present an explicit algorithm for computation of Q and H, known as
Arnoldi iteration. The matrix equation in the lemma can be rewritten as AQ∗ = Q∗ H.
Denote by qi , 1 ≤ i ≤ n, the columns of the unitary matrix Q∗ and by hij the components
of H. Now our matrix equation takes form
Aqi = h1i q1 + · · · + hii qi + hi+1,i qi+1
for all i = 1, . . . , n − 1 (recall that hij = 0 for j > i + 1) and
Aqn = h1n q1 + · · · + hnn qn
Now Arnoldi iteration method goes like this. We can pick an arbitrary unit vector
q1 . Then, for each i = 1, . . . , n − 1 we make four steps.
Step 1: vi = Aqi .
Step 2: for all j = 1, . . . , i we compute hji = qj∗ vi .
P
Step 3: ui = vi − ij=1 hji qj (note: this vector is orthogonal to q1 , . . . , qi ).
Step 4: hi+1,i = kui k and qi+1 = ui /hi+1,i , unless hi+1,i = 0, see below.
Finally, for i = n we execute steps 1 and 2 only.
The exceptional case hi+1,i = 0 in Step 4 is referred to as the breakdown of Arnoldi
iterations. This sounds bad, but it is not a bad thing at all. This means that the Hessenberg matrix H will have a simpler structure allowing some reductions (not discussed
here). Technically, if hi+1,i = 0, then one simply picks an arbitrary unit vector qi+1
orthogonal to q1 , . . . , qi and continues the Arnoldi iteration algorithm. 2
72
Lemma 19.23 shows that one can first transform A to an upper Hessenberg matrix
A0 = H, which has the same eigenvalues as A by similarity, and then start the QR
algorithm with A0 .
19.24 Lemma
If the matrix A0 is upper Hessenberg, then all the matrices Ak generated by the QR
algorithm are upper Hessenberg.
Proof. By induction, let Ak−1 be upper Hessenberg. Then Ak−1 = Qk Rk and so
Qk = Ak−1 Rk−1 . Since this is a product of an upper Hessenberg matrix and an upper
triangular matrix, it is verified by direct inspection that Qk is upper Hessenberg. Then,
similarly, Ak = Rk Qk is a product of an upper triangular and upper Hessenberg matrices,
so it is upper Hessenberg. 2
For upper Hessenberg matrices, the QR algorithm can be implemented with the help
of rotation matrices at a relatively low cost, cf. 16.25. This trick provides an inexpensive
modification of the QR algorithm.
There is a further improvement of the QR algorithm that accelerates the convergence
in Theorem 19.20.
19.25 Lemma
Assume that A0 , and hence Ak for all k ≥ 1, are upper Hessenberg matrices. Then
(k)
the convergence ai+1,i → 0 as k → ∞ in Theorem 19.20 is linear with ratio r = |λi+1 /λi |.
This lemma is given without proof. Note that |λi+1 /λi | < 1 for all i. Clearly, the
smaller this ratio the faster the convergence. Next, the faster the subdiagonal entries
(k)
(k)
ai+1,i converge to zero the faster the diagonal entries aii converge to the eigenvalues of
A.
One can modify the matrix A to decrease the ratio |λn /λn−1 | and thus accelerate the
convergence a(k)
nn → λn , with the help of shifting, as in 19.14. One applies the QR steps to
the matrix A − ρI where ρ is an appropriate approximation to λn . Then the convergence
(k)
an,n−1 → 0 will be linear with ratio r = |λn − ρ|/|λn−1 − ρ|. The better ρ approximates
λn the faster the convergence.
19.26 The QR Algorithm with Shift
Further improving the convergence, one can update the shift ρ = ρk at each iteration.
The QR algorithm with shift then goes as follows:
Ak−1 − ρk−1 I = Qk Rk
Rk Qk + ρk−1 I = Ak
where the shift ρk is updated at each step k. A standard choice for the shift ρk is the
73
Rayleigh quotient of the vector en :
ρk = e∗n Ak en = a(k)
nn
(regarding en as the approximate eigenvector corresponding to λn ). This is called the
(k)
Rayleigh quotient shift. The convergence of an,n−1 → 0 and a(k)
nn → λn is, generally,
quadratic.
(k)
However, the other subdiagonal entries, ai+1,i , 1 ≤ i ≤ n − 2, move to zero slowly
(k)
(linearly). To speed them up, one uses the following trick. After making an,n−1 practically
zero, one ensures that a(k)
nn is practically λn . Then one can partition the matrix Ak as
Ak =
Âk bk
0 λn
!
where Âk is an (n−1)×(n−1) upper Hessenberg matrix, whose eigenvalues are (obviously)
λ1 , . . . , λn−1 . Then one can apply further steps of the QR algorithm with shift to the
matrix Âk instead of Ak . This quickly produces its smallest eigenvalue, λn−1 , which can
be split off as above, etc. This procedure is called the deflation of the matrix A.
In practice, each eigenvalue of A requires just 3-5 iterations (QR steps), on the average. The algorithm is rather fast and very accurate. For Hermitean matrices, it takes
just 2-3 iterations per eigenvalue.
19.27 An improvement: Wilkinson Iteration
The Rayleigh quotient shift ρk = a(k)
nn described above can be slightly improved. At
step k, consider the trailing 2-by-2 matrix at the bottom right of Ak :
(k)
(k)
an−1,n−1 an−1,n
(k)
an,n−1
a(k)
n,n
!
and set ρk to its eigenvalue that is closer to a(k)
n,n . Note: the eigenvalues of a 2-by-2
matrix can be easily computed by the quadratic formula. This modification is called the
Wilkinson shift. It is slightly more accurate and stable than the Rayleigh quotient shift.
Furthermore, the Wilkinson shift allows us to resolve the problem of avoiding complex
matrices when the original matrix A is real. In that case the above 2-by-2 trailing block
of Ak is real and therefore has either real eigenvalues or complex conjugate eigenvalues,
λ and λ̄. If λ ∈
/ IR, the matrix A − λI is complex, and a QR with shift λ will result in a
complex matrix Ak+1 . However, if this step is followed immediately by a QR step with
shift λ̄, then the resulting matrix Ak+2 is real again.
Actually, if λ ∈
/ IR, there is no need to compute the complex matrix Ak+1 . One can
just combine two QR steps and construct Ak+2 directly from Ak , bypassing Ak+1 . This
can be carried out entirely in real arithmetic and thus is more convenient and economical.
The resulting all-real procedure is called the double-step QR algorithm.
74
© Copyright 2026 Paperzz