13. Nonlinear least squares

L. Vandenberghe
EE133A (Spring 2017)
13. Nonlinear least squares
• definition and examples
• derivatives and optimality condition
• Gauss-Newton method
• Levenberg-Marquardt method
13-1
Nonlinear least squares
minimize
m
X
i=1
fi(x)2 = kf (x)k2
• f1(x), . . . , fm(x) are differentiable functions of a vector variable x
• f is a function from Rn to Rm with components fi(x):



f (x) = 

f1(x)
f2(x)
..
.





fm(x)
• problem reduces to (linear) least squares if f (x) = Ax − b
Nonlinear least squares
13-2
Location from range measurements
• vector x represents unknown location in 2-D or 3-D
• we estimate x by measuring distances to known points a1, . . . , am:
ρi = kx − aik + vi,
i = 1, . . . , m
• vi is measurement error
Nonlinear least squares estimate: compute estimate x̂ by minimizing
m
X
i=1
(kx − aik − ρi)2
this is a nonlinear least squares problem with fi(x) = kx − aik − ρi
Nonlinear least squares
13-3
Example
graph of kf (x)k
contour lines of kf (x)k2
4
3
x2
4
2
4
00
2
1
2
x1
3
40
x2
2
1
0
0
1
2
x1
3
4
• correct position is (1, 1); the five points ai are marked with blue dots
• red square marks (global) minimum at (1.18, 0.82)
Nonlinear least squares
13-4
Location from multiple camera views
x
0
x
camera center
principal axis
image plane
Camera model: described by parameters A ∈ R2×3, b ∈ R2, c ∈ R3, d ∈ R
• object at location x ∈ R3 creates image at location x0 ∈ R2 in image plane
1
x = T
(Ax + b)
c x+d
0
cT x + d > 0 if object is in front of the camera
• A, b, c, d characterize the camera, and its position and orientation
Nonlinear least squares
13-5
Location from multiple camera views
• an object at location x is viewed by l cameras (described by Ai, bi, ci, di)
• the image of the object in the image plane of camera i is at location
1
(Aix + bi) + vi
yi = T
ci x + di
• vi is measurement or quantization error
• goal is to determine 3-D location x from the l observations y1, . . . , yl
Nonlinear least squares estimate: compute estimate x̂ by minimizing
2
l X
1
cT x + d (Aix + bi) − yi
i=1
i
i
this is a nonlinear least squares problem with m = 2l,
fi(x) =
Nonlinear least squares
(Aix + bi)1
− (yi)1,
T
ci x + di
fl+i(x) =
(Aix + bi)2
− (yi)2
T
ci x + di
13-6
Model fitting
minimize
N
P
(fˆ(xi, θ) − yi)2
i=1
• model fˆ(x, θ) is parameterized by parameters θ1, . . . , θp
• (x1, y1), . . . , (xN , yN ) are data points
• the minimization is over the model parameters θ
• on page 9-9 we considered models that are linear in the parameters θ:
fˆ(x, θ) = θ1f1(x) + · · · + θpfp(x)
here we allow fˆ(x, θ) to be a nonlinear function of θ
Nonlinear least squares
13-7
Example
fˆ(x, θ)
fˆ(x, θ) = θ1 exp(θ2x) cos(θ3x + θ4)
x
a nonlinear least squares problem with four variables θ1, θ2, θ3, θ4:
minimize
N
X
i=1
Nonlinear least squares
θ1 e
θ2 xi
2
cos(θ3xi + θ4) − yi)
13-8
Orthogonal distance regression
minimize the mean square distance of data points to graph of fˆ(x, θ)
Example: orthogonal distance regression with cubic polynomial
fˆ(x, θ) = θ1 + θ2x + θ3x2 + θ4x3
f (x, θ)
f (x, θ)
standard least squares fit
Nonlinear least squares
x
orthogonal distance fit
x
13-9
Nonlinear least squares formulation
minimize
N X
i=1
(fˆ(ui, θ) − yi) + kui − xik
2
2
• optimization variables are model parameters θ and N points ui
• ith term is squared distance of data point (xi, yi) to point (ui, fˆ(ui, θ))
(xi, yi)
di
(ui, fˆ(ui, θ))
d2i = (fˆ(ui, θ) − yi)2 + kui − xik2
• minimizing over ui gives squared distance of (xi, yi) to graph
• minimizing over u and θ minimizes mean squared distance
Nonlinear least squares
13-10
Binary classification
fˆ(x, θ) = sign (θ1f1(x) + θ2f2(x) + · · · + θpfp(x))
• in lecture 9 (p. 9-25) we computed θ by solving a linear least squares problem
• better results are obtained by solving a nonlinear least squares problem
minimize
N
X
i=1
(φ(θ1f1(xi) + · · · + θpfp(xi)) − yi)
φ(u)
• (xi, yi) are data points, yi ∈ {−1, 1}
1
−4
−2
• φ(u) is the sigmoidal function
2
−1
Nonlinear least squares
2
4
u
eu − e−u
φ(u) = u
e + e−u
a differentiable approximation of sign(u)
13-11
Outline
• definition and examples
• derivatives and optimality condition
• Gauss-Newton method
• Levenberg-Marquardt method
Gradient
Gradient of differentiable function g : Rn → R at z ∈ Rn is
∇g(z) =
∂g
∂g
∂g
(z),
(z), . . . ,
(z)
∂x1
∂x2
∂xn
Affine approximation (linearization) of g around z is
∂g
∂g
ĝ(x) = g(z) +
(z)(x1 − z1) + · · · +
(z)(xn − zn)
∂x1
∂xn
= g(z) + ∇g(z)T (x − z)
(see page 1-30)
Nonlinear least squares
13-12
Derivative matrix
Derivative matrix (Jacobian) of differentiable function f : Rn → Rm at z ∈ Rn:

∂f1
(z)
∂x1
∂f2
(z)
∂x1
..
.






Df (z) = 





 ∂fm
(z)
∂x1
∂f1
(z)
∂x2
∂f2
(z)
∂x2
..
.
···
···
∂fm
(z) · · ·
∂x2
∂f1
(z)
∂xn
∂f2
(z)
∂xn
..
.
∂fm
(z)
∂xn


 

∇f1(z)T

 
  ∇f2(z)T
=
..
 
.


∇fm(z)T








Affine approximation (linearization) of f around z is
fˆ(x) = f (z) + Df (z)(x − z)
• see page 3-40
• we also use notation fˆ(x; z) to indicate the point z around which we linearize
Nonlinear least squares
13-13
Gradient of nonlinear least squares cost
g(x) = kf (x)k2 =
m
X
fi(x)2
i=1
• first derivative of g with respect to xj :
m
X
∂fi
∂g
(z) = 2
fi(z)
(z)
∂xj
∂x
j
i=1
• gradient of g at z :




∇g(z) = 


Nonlinear least squares
∂g
(z)
∂x1
..
.
∂g
(z)
∂xn


m

X

fi(z)∇fi(z) = 2Df (z)T f (z)
=2

i=1

13-14
Optimality condition
minimize
g(x) =
m
X
fi(x)2
i=1
• necessary condition for optimality: if x minimizes g(x) then it must satisfy
∇g(x) = 2Df (x)T f (x) = 0
• this generalized the normal equations: if f (x) = Ax − b, then Df (x) = A and
∇g(x) = 2AT (Ax − b)
• for general f , the condition ∇g(x) = 0 is not sufficient for optimality
Nonlinear least squares
13-15
Outline
• definition and examples
• derivatives and optimality condition
• Gauss-Newton method
• Levenberg-Marquardt method
Gauss-Newton method
minimize g(x) = kf (x)k2 =
m
X
fi(x)2
i=1
start at some initial guess x(1), and repeat for k = 1, 2, . . .:
• linearize f around x(k):
fˆ(x; x(k)) = f (x(k)) + Df (x(k))(x − x(k))
• substitute affine approximation fˆ(x; x(k)) for f in least squares problem:
minimize kfˆ(x; x(k))k2
• define x(k+1) as the solution of this (linear) least squares problem
Nonlinear least squares
13-16
Gauss-Newton update
least squares problem solved in iteration k :
minimize kf (x(k)) + Df (x(k))(x − x(k))k2
• if Df (x(k)) has linearly independent columns, solution is given by
x
(k+1)
=x
(k)
− Df (x
(k) T
) Df (x
(k)
)
−1
Df (x(k))T f (x(k))
• Gauss-Newton step ∆x(k) = x(k+1) − x(k) is
−1
∆x(k) = − Df (x(k))T Df (x(k))
Df (x(k))T f (x(k))
−1
1
∇g(x(k))
= − Df (x(k))T Df (x(k))
2
(using the expression for ∇g(x) on page 13-14)
Nonlinear least squares
13-17
Predicted cost reduction in iteration k
• predicted cost function at x(k+1), based on approximation fˆ(x; x(k)):
kfˆ(x(k+1); x(k))k2
= kf (x(k)) + Df (x(k))∆x(k)k2
= kf (x(k))k2 + 2f (x(k))T Df (x(k))∆x(k) + kDf (x(k))∆x(k)k2
= kf (x(k))k2 − kDf (x(k))∆x(k)k2
• if columns of Df (x(k)) are linearly independent and ∆x(k) 6= 0,
kfˆ(x(k+1); x(k))k2 < kf (x(k))k2
• however, fˆ(x; x(k)) is only a local approximation of f (x), so it is possible that
kf (x(k+1))k2 > kf (x(k))k2
Nonlinear least squares
13-18
Outline
• definition and examples
• derivatives and optimality condition
• Gauss-Newton method
• Levenberg-Marquardt method
Levenberg-Marquardt method
addresses two difficulties in Gauss-Newton method:
• how to update x(k) when columns of Df (x(k)) are linearly dependent
• what to do when the Gauss-Newton update does not reduce kf (x)k2
Levenberg-Marquardt method
compute x(k+1) by solving a regularized least squares problem
minimize kfˆ(x; x(k))k2 + λ(k)kx − x(k)k2
• as before, fˆ(x; x(k)) = f (x(k)) + Df (x(k))(x − x(k))
• with λ(k) > 0, always has a unique solution (no condition on Df (x(k)))
• parameter λ(k) > 0 controls size of update
Nonlinear least squares
13-19
Levenberg-Marquardt update
regularized least squares problem solved in iteration k
minimize kf (x(k)) + Df (x(k))(x − x(k))k2 + λ(k)kx − x(k)k2
• solution is given by
x
(k+1)
=x
(k)
− Df (x
(k) T
) Df (x
(k)
)+λ
(k)
I
−1
Df (x(k))T f (x(k))
• Levenberg-Marquardt step ∆x(k) = x(k+1) − x(k) is
−1
∆x(k) = − Df (x(k))T Df (x(k)) + λ(k)I
Df (x(k))T f (x(k))
−1
1
∇g(x(k))
= − Df (x(k))T Df (x(k)) + λ(k)I
2
• for λ(k) = 0 this is the Gauss-Newton step (if defined); for large λ(k),
∆x(k) ≈ −
Nonlinear least squares
1
(k)
∇g(x
)
(k)
2λ
13-20
Cost reduction
• predicted cost function at x(k+1), based on local approximation fˆ(x; x(k)):
kfˆ(x(k+1); x(k))k2
= kf (x(k)) + Df (x(k))∆x(k)k2
= kf (x(k))k2 + 2f (x(k))T Df (x(k))∆x(k) + kDf (x(k))∆x(k)k2
= kf (x(k))k2 − 2λ(k)k∆x(k)k2 − kDf (x(k))∆x(k)k2
• for large λ(k), we can use affine approximation to estimate actual cost:
g(x(k+1)) ≈ g(x(k)) + ∇g(x(k))T ∆x(k)
kf (x(k+1))k2 ≈ kf (x(k))k2 + ∇g(x(k))T ∆x(k)
1
≈ kf (x(k))k2 − (k) k∇g(x(k))k2
2λ
(using the expression on page 13-20)
Nonlinear least squares
13-21
Regularization parameter
several strategies for adapting λ(k) are possible; the simplest example:
• at iteration k , compute the solution x̂ of
minimize kf (x; x(k))k2 + λ(k)kx − x(k)k2
• if kf (x̂)k2 < kf (x(k)k2, take x(k+1) = x̂ and decrease λ
• otherwise, do not update x (take x(k+1) = x(k)), but increase λ
Some variations
• compare actual cost reduction with predicted cost reduction
• solve a least squares problem with ‘trust region’
minimize
kf (x; x(k))k2
subject to kx − x(k)k2 ≤ γ
Nonlinear least squares
13-22
Summary: Levenberg-Marquardt method
choose x(1) and λ(1) and repeat for k = 1, 2, . . .:
1. evaluate f (x(k)) and A = Df (x(k))
2. compute solution of regularized least squares problem:
x̂ = x(k) − (AT A + λ(k)I)−1AT f (x(k))
3. define x(k+1) and λ(k+1) as follows:
(
x(k+1) = x̂ and λ(k+1) = β1λ(k)
if kf (x̂)k2 < kf (x(k))k2
x(k+1) = x(k) and λ(k+1) = β2λ(k) otherwise
• β1, β2 are constants with 0 < β1 < 1 < β2
• in step 2, x̂ can be computed using a QR factorization
• terminate if ∇g(x(k)) = 2AT f (x(k)) is sufficiently small
Nonlinear least squares
13-23
Location from range measurements
4
x2
3
2
1
0
0
1
2
x1
3
4
• iterates from three starting points, with λ(1) = 0.1, β1 = 0.8, β2 = 2
• algorithm started at (1.8, 3.5) and (3.0, 1.5) finds minimum (1.18, 0.82)
• started at (2.2, 3.5) converges to non-optimal point
Nonlinear least squares
13-24
Cost function and regularization parameter
4
0.3
3
kf (x(k))k2
0.2
λ(k)
2
0.1
1
0
1 2 3 4 5 6 7 8 9 10
k
0
1 2 3 4 5 6 7 8 9 10
k
cost function and λ(k) for the three starting points on previous page
Nonlinear least squares
13-25