33 Least Squares

Least Squares
Learning Goals: find the best solution (by one measure, anyway) of inconsistent equation. Learn
to apply the algebra, geometry, and calculus of projections to this problem.
Example: You plant a seedling at noon on Monday. Each day after at noon you measure its
height. On Tuesday it is 3cm tall. Wednesday it is 5cm tall. Thursday it is 6cm tall. How tall
was it when you planted it?
Let’s assume that the seedling has a constant growth rate—obviously an incorrect
assumption, but what else can we assume? (Assuming an arithmetic growth rate gives a height
of zero on planting day, and negative growth from here on out; geometric growth would give it a
height of –1 cm on day zero.) If the growth rate is r and the initial height is h, we end up with
the following three equations in two unknowns: r + h = 3, 2r + h = 5, 3r + h = 6. As usual, when
there are more equations than unknowns we expect the system to be inconsistent, and reduction
⎡ 1 1⎤
⎡ 3⎤ ⎡ 1 1 ⎤
⎡ 3 ⎤ ⎡1 1 ⎤
⎡3⎤
⎡r ⎤ ⎢ ⎥ ⎢
⎡r ⎤ ⎢ ⎥ ⎢
⎡r ⎤ ⎢ ⎥
⎢
⎥
⎥
⎥
quickly proves that point: ⎢ 2 1⎥ ⎢ ⎥ = ⎢ 5 ⎥ → ⎢ 0 −1 ⎥ ⎢ ⎥ = ⎢ −1⎥ → ⎢ 0 −1⎥ ⎢ ⎥ = ⎢ −1⎥ .
h
h
h
⎢⎣ 3 1⎥⎦ ⎣ ⎦ ⎢⎣ 6 ⎥⎦ ⎢⎣ 0 −2 ⎥⎦ ⎣ ⎦ ⎢⎣ −3⎥⎦ ⎢⎣ 0 0 ⎥⎦ ⎣ ⎦ ⎢⎣ −1⎥⎦
Of course, the problem is that our right-hand side, b = (3, 5, 6) is not in the column space
of the matrix A, so there is no solution to Ax = b. If it were, we could solve the system with no
problems. So what should we do?
⎡ 1 1⎤
⎡r ⎤
Let’s ask how close we can come to solving the equation. ⎢⎢ 2 1⎥⎥ ⎢ ⎥ is guaranteed to be
h
⎢⎣ 3 1⎥⎦ ⎣ ⎦
in the column space of the matrix. So instead of using the real b, let’s find the thing in the
column space that is as close to b as possible, and solve for that instead! Let p be the projection
of b into the column space. Then the error vector e = b – p is as small as possible. Let’s call the
solution to this new problem x̂ so we are solving Ax̂ = p . The one thing we know about e is
that it is orthogonal to the column space, so it is in the left nullspace. That is, ATe = 0. This
means that AT (b − Ax̂) = 0 , or AT Ax̂ = AT b .
So instead of Ax = b, we solve the normal equations AT Ax̂ = AT b . (We will show later
⎡1 2 3⎤
that this always has a solution). In this case, we multiply both sides by AT = ⎢
⎥ to
⎣1 1 1 ⎦
⎡14 6 ⎤ ⎡ r̂ ⎤ ⎡ 31⎤
obtain the system ⎢
⎥ ⎢ ⎥ = ⎢ ⎥ . A little elimination shows that r̂ = 3 / 2 and ĥ = 5 / 3 .
⎣ 6 3 ⎦ ⎣ ĥ ⎦ ⎣14 ⎦
So we guess our little plant started our 5/3 cm tall and grew at a rate of 3/2 cm/day. This
is clearly wrong, since it would predict heights of 19/6, 14/3, and 37/6 instead of 3, 5,and 6, so
we’re off by –1/6, 1/3, and –1/6 respectively. The is, e = (–1/6, 1/3, –1/6).
This is an example of a very general problem. Not just “fit a line through a bunch of data
which may not be collinear” but in general “find the best solution to Ax = b if b is not in the
column space of A.” And our solution method is completely general.
We know that we can’t solve Ax = b unless b is in the column space. But how close can
we get? Well, we must first determine how we are going to measure closeness. One thing to do
is to take our x̂ that we find, and measure the error as e = b – A x̂ . How are we going to decide
how bad the error is? The easiest thing to do is to minimize the length of this error vector.
We can approach this several ways.
Geometry
Geometrically, to minimize the length of e, since A x̂ is in the column space of A, we
want e to be orthogonal to this space, hence in the left nullspace. Then ATe = 0. Writing this out
gives the normal equations ATA x̂ = ATb. We will see shortly why this always has a solution.
Algebra
We know that b is not in the column space, so we break up b = e + p, where p is the
projection of b into the column space, and e is the error vector. Then we solve A x̂ = p. Well,
we know that the formula for projection on the column space of A is p = A(ATA)–1ATb (assuming
the columns of A are independent—more later if they’re not!). If A x̂ = p, then x̂ = (ATA)–1ATb.
we get the solution above.
Calculus
We could even use the techniques of calculus to minimize the error (or more rightly, the
square of the error). I’ll spare you the details, but we get—unsurprisingly—the same normal
equations ATA x̂ = ATb.
This method of finding a “best” solution is called the method of least squares. This is so
because we minimize the sum of the squares of the errors.
So the method is: if Ax = b is solvable, great. If not, multiply both sides by AT to obtain
the normal equation and solve ATA x̂ = ATb.
We know that if the columns of A are independent then ATA is invertible, and there is a
unique solution to this system. What if they aren’t?
It turns out that we can still always solve the system, but now the solution won’t be
unique. We can add any null vector of A to a particular x̂ and obtain another solution. This will
be the point of the pseudoinverse. We will pick out the particular solution that is in the row
space of A. That way, any other solution will be this plus a null vector, which is orthogonal to
the row space. So any other solution will be longer than the row space solution. Thus not only
will the pseudoinverse make the error as small as possible, it will make the choice of x̂ as small
as possible, too!
So why is ATA x̂ = ATb always solvable? Well, we use our Fundamental Theorem of
Linear Algebra. The column space C(ATA) is the orthogonal complement of the left nullspace of
(
) (
⊥
)
⊥
ATA. Well, this is easier in symbols: C(AT A) = N(AT A)T = N(AT A) = ( N(A)) = C(AT )
(we’ve seen that A and ATA have the same nullspace because if Ax = 0 certainly ATAx = 0, but if
ATAx = 0, we multiply on both sides by xT and find the ||Ax|| = 0, so Ax = 0). But since the
column spaces of ATA and AT are the same, and ATb is in the column space of AT we can certainly
always solve ATA x̂ = ATb.
The method of least squares is very applicable, not just to finding best fit lines. Later on,
we may find best fit functions—the theory of Fourier series. As another example, we will find a
best fit parabola.
⊥
Example
Measure the acceleration due to gravity.
An object is dropped. The distance it falls is measured and the following data are
collected: (t, d) is given in (time (tenths of a second), distance (centimeters)): (1, 5), (2, 19), (3,
44), (4, 79), (5, 122). There may be some measurement error, the object may not have been at
exactly zero when it was released, and it might have had some initial speed (for instance, it might
have been hand-held, and it was hard to get the timing just right or to hold the hand perfectly
steady). So the formula for distance fallen ought to be gt2/2 + vt + d, where g is the acceleration
of gravity, v is its initial downward speed, and d is the distance below zero-level from which it
was actually dropped. Putting in the data gives five equations in three unknowns:
1g/2 + 1v + d = 5
4g/2 + 2v + d = 19
9g/2 + 3v + d = 44
16g/2 + 4v + d = 79
25g/2 + 5v + d = 122
⎡ 1 / 2 1 1⎤
⎡ 5 ⎤
⎢ 4 / 2 2 1⎥ g
⎢ 19 ⎥
⎢
⎥⎡ ⎤ ⎢
⎥
Our system has the matrix form ⎢ 9 / 2 3 1⎥ ⎢⎢ v ⎥⎥ = ⎢ 44 ⎥ . Note that even if the system is
⎢
⎥
⎢
⎥
⎢16 / 2 4 1⎥ ⎢⎣ d ⎥⎦ ⎢ 79 ⎥
⎢⎣ 25 / 5 5 1⎥⎦
⎢⎣122 ⎥⎦
consistent the least squares method will produce the correct answer, for the projection will just
be b and the error will be 0 (we’re projecting something in the column space into the column
space!). So let’s not bother to check whether this is consistent (it’s not) and pass to the normal
equations. Multiplying on both sides by the transpose of the coefficient matrix gives:
⎡ 979/4 225 / 2 55 / 2 ⎤ ⎡ ĝ ⎤ ⎡ 4791 / 2 ⎤
⎢ 225 / 2
⎥ ⎢ v̂ ⎥ = ⎢ 1101 ⎥
55
15
⎢
⎥⎢ ⎥ ⎢
⎥
⎢⎣ 55 / 2
15
5 ⎥⎦ ⎢⎣ d̂ ⎥⎦ ⎢⎣ 269 ⎥⎦
I’ll spare you the details, but the solution to this system is ( ĝ, v̂, d̂) = (68/7, 9/35, –2/5). This tells
us that the object was (up to experimental error) dropped from 2/5 cm above the presumed
starting point, with an initial downward velocity of 9/35 (cm/0.1 s) = 90/35 cm/s (pretty fast!)
and subjected to a gravitational acceleration of 68/7 cm/(0.1 s)2 = 68/7 m/s2 or about 9.7 m/s2
Reading: 4.3
Problems: 4.3: 1, 2, 4, 5, 7, 9, 10, 12 – 16, 17, 22, 26, 27