Part 2: Linesearch methods
for unconstrained optimization
Nick Gould (RAL)
minimize
x∈IRn
f (x)
Part C course on continuoue optimization
UNCONSTRAINED MINIMIZATION
minimize f (x)
x∈IRn
where the objective function f : IRn −→ IR
assume that f ∈ C 1 (sometimes C 2) and Lipschitz
often in practice this assumption violated, but not necessary
ITERATIVE METHODS
in practice very rare to be able to provide explicit minimizer
iterative method: given starting “guess” x0, generate sequence
{xk }, k = 1, 2, . . .
AIM: ensure that (a subsequence) has some favourable limiting
properties:
satisfies first-order necessary conditions
satisfies second-order necessary conditions
Notation: fk = f (xk ), gk = g(xk ), Hk = H(xk ).
LINESEARCH METHODS
calculate a search direction pk from xk
ensure that this direction is a descent direction, i.e.,
gkT pk < 0 if gk 6= 0
so that, for small steps along pk , the objective function
will be reduced
calculate a suitable steplength αk > 0 so that
f (xk + αk pk ) < fk
computation of αk is the linesearch—may itself be an iteration
generic linesearch method:
xk+1 = xk + αk pk
STEPS MIGHT BE TOO LONG
f (x)
3
2.5
(x1 ,f (x1 )
2
(x2 ,f (x2 )
1.5
(x3 ,f (x3 )
1
(x4 ,f (x4 )
(x5 ,f (x5 )
0.5
0
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
x
The objective function f (x) = x2 and the iterates xk+1 = xk + αk pk
generated by the descent directions pk = (−1)k+1 and steps αk =
2 + 3/2k+1 from x0 = 2
STEPS MIGHT BE TOO SHORT
f (x)
3
2.5
(x1 ,f (x1 )
2
(x2 ,f (x2 )
1.5
(x3 ,f (x3 )
(x4 ,f (x4 )
(x5 ,f (x5 )
1
0.5
0
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
x
The objective function f (x) = x2 and the iterates xk+1 = xk + αk pk
generated by the descent directions pk = −1 and steps αk = 1/2k+1
from x0 = 2
PRACTICAL LINESEARCH METHODS
in early days, pick αk to minimize
f (xk + αpk )
exact linesearch—univariate minimization
rather expensive and certainly not cost effective
modern methods: inexact linesearch
ensure steps are neither too long nor too short
try to pick “useful” initial stepsize for fast convergence
best methods are either
.
.
“backtracking- Armijo” or
“Armijo-Goldstein”
based
BACKTRACKING LINESEARCH
Procedure to find the stepsize αk :
Given αinit > 0 (e.g., αinit = 1)
let α(0) = αinit and l = 0
Until f (xk + α(l)pk )“<”fk
set α(l+1) = τ α(l), where τ ∈ (0, 1) (e.g., τ = 21 )
and increase l by 1
Set αk = α(l)
this prevents the step from getting too small . . . but does not prevent
too large steps relative to decrease in f
need to tighten requirement
f (xk + α(l)pk )“<”fk
ARMIJO CONDITION
In order to prevent large steps relative to decrease in f , instead require
f (xk + αk pk ) ≤ f (xk ) + αk βgkT pk
for some β ∈ (0, 1) (e.g., β = 0.1 or even β = 0.0001)
0.14
0.12
0.1
T
f (xk )+αβgk
pk
0.08
0.06
0.04
0.02
f (xk +αpk )
0
−0.02
T
pk
f (xk )+αgk
−0.04
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
BACKTRACKING-ARMIJO LINESEARCH
Procedure to find the stepsize αk :
Given αinit > 0 (e.g., αinit = 1)
let α(0) = αinit and l = 0
Until f (xk + α(l)pk ) ≤ f (xk ) + α(l)βgkT pk
set α(l+1) = τ α(l), where τ ∈ (0, 1) (e.g., τ = 21 )
and increase l by 1
Set αk = α(l)
2
α
SATISFYING THE ARMIJO CONDITION
Theorem 2.1. Suppose that f ∈ C 1, that g(x) is Lipschitz continuous with Lipschitz constant γ(x), that β ∈ (0, 1) and that p is
a descent direction at x. Then the Armijo condition
f (x + αp) ≤ f (x) + αβg(x)T p
is satisfied for all α ∈ [0, αmax(x)], where
2(β − 1)g(x)T p
αmax =
γ(x)kpk22
PROOF OF THEOREM 2.1
Taylor’s theorem (Theorem 1.1) +
α≤
=⇒
2(β − 1)g(x)T p
,
γ(x)kpk22
f (x + αp) ≤ f (x) + αg(x)T p + 12 γ(x)α2kpk2
≤ f (x) + αg(x)T p + α(β − 1)g(x)T p
= f (x) + αβg(x)T p
THE ARMIJO LINESEARCH TERMINATES
Corollary 2.2. Suppose that f ∈ C 1, that g(x) is Lipschitz continuous with Lipschitz constant γk at xk , that β ∈ (0, 1) and that
pk is a descent direction at xk . Then the stepsize generated by the
backtracking-Armijo linesearch terminates with
2τ (β − 1)gkT pk
αk ≥ min αinit ,
γk kpk k22
PROOF OF COROLLARY 2.2
Theorem 2.1 =⇒ linesearch will terminate as soon as α(l) ≤ αmax.
2 cases to consider:
1. May be that αinit satisfies the Armijo condition =⇒ αk = αinit.
2. Otherwise, must be a last linesearch iteration (the l-th) for which
α(l) > αmax =⇒
αk ≥ α(l+1) = τ α(l) > τ αmax
Combining these 2 cases gives required result.
GENERIC LINESEARCH METHOD
Given an initial guess x0, let k = 0
Until convergence:
Find a descent direction pk at xk
Compute a stepsize αk using a
backtracking-Armijo linesearch along pk
Set xk+1 = xk + αk pk , and increase k by 1
GLOBAL CONVERGENCE THEOREM
Theorem 2.3. Suppose that f ∈ C 1 and that g is Lipschitz continuous on IRn. Then, for the iterates generated by the Generic
Linesearch Method,
either
gl = 0 for some l ≥ 0
or
lim fk = −∞
k→∞
or
lim min |pTk gk |, |pTk gk |/kpk k2 = 0.
k→∞
PROOF OF THEOREM 2.3
Suppose that gk 6= 0 for all k and that lim fk > −∞. Armijo =⇒
k→∞
fk+1 − fk ≤ αk βpTk gk
for all k =⇒ summing over first j iterations
fj+1 − f0 ≤
j
X
αk βpTk gk .
k=0
LHS bounded below by assumption =⇒ RHS bounded below. Sum
composed of -ve terms =⇒
lim αk |pTk gk | = 0
k→∞
Let
def
K1 =
2τ (β − 1)gkT pk
k | αinit >
γkpk k22
def
& K2 = {1, 2, . . .} \ K1
where γ is the assumed uniform Lipschitz constant.
For k ∈ K1,
=⇒
=⇒
2τ (β − 1)gkT pk
αk ≥
γkpk k22
T 2
gk p k
2τ
(β
−
1)
<0
αk pTk gk ≤
γ
kpk k
|pTk gk |
lim
= 0.
k∈K1 →∞ kpk k2
(1)
For k ∈ K2,
αk ≥ αinit
=⇒
lim
k∈K2 →∞
|pTk gk | = 0.
Combining (1) and (2) gives the required result.
(2)
METHOD OF STEEPEST DESCENT
The search direction
pk = −gk
gives the so-called steepest-descent direction.
pk is a descent direction
pk solves the problem
def
minimize mLk(xk + p) = fk + gkT p subject to kpk2 = kgk k2
p∈IRn
Any method that uses the steepest-descent direction is a
method of steepest descent.
GLOBAL CONVERGENCE FOR STEEPEST DESCENT
Theorem 2.4. Suppose that f ∈ C 1 and that g is Lipschitz continuous on IRn. Then, for the iterates generated by the Generic
Linesearch Method using the steepest-descent direction,
either
gl = 0 for some l ≥ 0
or
lim fk = −∞
k→∞
or
lim gk = 0.
k→∞
PROOF OF THEOREM 2.4
Follows immediately from Theorem 2.3, since
min |pTk gk |, |pTk gk |/kpk k2 = kgk k2 min (1, kgk k2)
and thus
lim min |pTk gk |, |pTk gk |/kpk k2 = 0
k→∞
implies that limk→∞ gk = 0.
METHOD OF STEEPEST DESCENT (cont.)
archetypical globally convergent method
many other methods resort to steepest descent in bad cases
not scale invariant
convergence is usually very (very!) slow (linear)
numerically often not convergent at all
STEEPEST DESCENT EXAMPLE
1.5
1
0.5
0
−0.5
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Contours for the objective function f (x, y) = 10(y − x2)2 + (x − 1)2,
and the iterates generated by the Generic Linesearch steepest-descent
method
MORE GENERAL DESCENT METHODS
Let Bk be a symmetric, positive definite matrix, and define the
search direction pk so that
Bk pk = −gk
Then
pk is a descent direction
pk solves the problem
def
T
1 T
minimize mQ
k (xk + p) = fk + gk p + 2 p Bk p
p∈IRn
if the Hessian Hk is positive definite, and Bk = Hk , this is
Newton’s method
MORE GENERAL GLOBAL CONVERGENCE
Theorem 2.5. Suppose that f ∈ C 1 and that g is Lipschitz continuous on IRn. Then, for the iterates generated by the Generic
Linesearch Method using the more general descent direction, either
gl = 0 for some l ≥ 0
or
lim fk = −∞
k→∞
or
lim gk = 0
k→∞
provided that the eigenvalues of Bk are uniformly bounded and
bounded away from zero.
PROOF OF THEOREM 2.5
Let λmin(Bk ) and λmax(Bk ) be the smallest and largest eigenvalues of
Bk . By assumption, there are bounds λmin > 0 and λmax such that
s T Bk s
λmin ≤ λmin(Bk ) ≤
≤ λmax(Bk ) ≤ λmax
ksk2
and thus that
λ−1
max
≤
λ−1
max (Bk )
=
λmin(Bk−1)
sT Bk−1s
−1
≤ λmax(Bk−1) = λ−1
≤
min (Bk ) ≤ λmin
2
ksk
for any nonzero vector s. Thus
2
|pTk gk | = |gkT Bk−1gk | ≥ λmin(Bk−1)kgk k22 ≥ λ−1
max kgk k2
In addition
2
kpk k22 = gkT Bk−2gk ≤ λmax(Bk−2)kgk k22 ≤ λ−2
minkgk k2 ,
=⇒
kpk k2 ≤ λ−1
minkgk k2
=⇒
|pTk gk |
λmin
≥
kg k2
kpk k2
λmax k
Thus
=⇒
kg k
min |pTk gk |, |pTk gk |/kpk k2 ≥ k 2 min (λmin, kgk k2)
λmax
lim min |pTk gk |, |pTk gk |/kpk k2 = 0
k→∞
=⇒
lim gk = 0.
k→∞
MORE GENERAL DESCENT METHODS (cont.)
may be viewed as “scaled” steepest descent
convergence is often faster than steepest descent
can be made scale invariant for suitable Bk
CONVERGENCE OF NEWTON’S METHOD
Theorem 2.6. Suppose that f ∈ C 2 and that H is Lipschitz
continuous on IRn. Then suppose that the iterates generated by the
Generic Linesearch Method with αinit = 1 and β < 21 , in which the
search direction is chosen to be the Newton direction pk = −Hk−1gk
whenever possible, has a limit point x∗ for which H(x∗) is positive
definite. Then
(i) αk = 1 for all sufficiently large k,
(ii) the entire sequence {xk } converges to x∗, and
(iii) the rate is Q-quadratic, i.e, there is a constant κ ≥ 0.
kxk+1 − x∗k2
≤ κ.
k→∞ kxk − x∗k22
lim
PROOF OF THEOREM 2.6
Consider lim xk = x∗. Continuity =⇒ Hk positive definite for all k ∈ K
k∈K
sufficiently large =⇒ ∃k0 ≥ 0:
pTk Hk pk ≥ 12 λmin(H∗)kpk k22
∀k0 ≤ k ∈ K, where λmin(H∗) = smallest eigenvalue of H(x∗) =⇒
|pTk gk | = −pTk gk = pTk Hk pk ≥ 21 λmin(H∗)kpk k22.
∀k0 ≤ k ∈ K, and
lim pk = 0
k∈K→∞
since Theorem 2.5 =⇒ at least one of the LHS of (3) and
|pTk gk |
pT g
= − k k ≥ 21 λmin(H∗)kpk k2
kpk k2
kpk k2
converges to zero for such k.
(3)
Taylor’s theorem =⇒ ∃zk between xk and xk + pk such that
f (xk + pk ) = fk + pTk gk + 12 pTk H(zk )pk .
Lipschitz continuity of H & Hk pk + gk = 0 =⇒
f (xk + pk ) − fk − 21 pTk gk = 21 (pTk gk + pTk H(zk )pk )
= 12 (pTk gk + pTk Hk pk ) + 21 (pTk (H(zk ) − Hk )pk )
≤ 21 γkzk − xk k2kpk k22 ≤ 12 γkpk k32
(4)
Now pick k sufficiently large so that
γkpk k2 ≤ λmin(H∗)(1 − 2β).
+ (3) + (4) =⇒
f (xk + pk ) − fk ≤ 12 pTk gk + 12 λmin(H∗)(1 − 2β)kpk k22
≤ 21 (1 − (1 − 2β))pTk gk = βpTk gk
=⇒ unit stepsize satisfies the Armijo condition for all sufficiently large
k∈K
Now note that kHk−1k2 ≤ 2/λmin(H∗) for all sufficiently large k ∈ K.
The iteration gives
xk+1 − x∗ = xk − x∗ − Hk−1gk = xk − x∗ − Hk−1 (gk − g(x∗))
= Hk−1 (g(x∗) − gk − Hk (x∗ − xk )) .
But Theorem 1.3 =⇒
kg(x∗) − gk − Hk (x∗ − xk ) k2 ≤ γkx∗ − xk k22
=⇒
kxk+1 − x∗k2 ≤ γkHk−1 k2kx∗ − xk k22
which is (iii) when κ = 2γ/λmin (H∗). for k ∈ K.
Result (ii) follows since once iterate becomes sufficiently close to x∗,
(iii) for k ∈ K sufficiently large implies k + 1 ∈ K =⇒ K = IN. Thus
(i) and (iii) are true for all k sufficiently large.
NEWTON METHOD EXAMPLE
1.5
1
0.5
0
−0.5
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Contours for the objective function f (x, y) = 10(y − x2)2 + (x − 1)2,
and the iterates generated by the Generic Linesearch Newton method
MODIFIED NEWTON METHODS
If Hk is indefinite, it is usual to solve instead
(Hk + Mk )pk ≡ Bk pk = −gk
where
Mk chosen so that Bk = Hk + Mk is “sufficiently” positive definite
Mk = 0 when Hk is itself “sufficiently” positive definite
Possibilities:
If Hk has the spectral decomposition Hk = Qk Dk QTk then
Bk ≡ Hk + Mk = Qk max(, |Dk |)QTk
Mk = max(0, − λmin(Hk ))I
Modified Cholesky: Bk ≡ Hk + Mk = Lk LTk
QUASI-NEWTON METHODS
Various attempts to approximate Hk :
Finite-difference approximations:
(Hk )ei ≈ h−1(g(xk + hei) − gk ) = (Bk )ei
for some “small” scalar h > 0
Secant approximations: try to ensure the secant condition
Bk+1sk = yk ≈ Hk+1sk , where sk = xk+1 − xk and yk = gk+1 − gk
Symmetric Rank-1 method (but may be indefinite or even
fail):
(yk − Bk sk )(yk − Bk sk )T
Bk+1 = Bk +
(yk − Bk sk )T sk
BFGS method: (symmetric and positive definite if ykT sk > 0):
y y T B s sT B
Bk+1 = Bk + kT k − kT k k k
yk s k
s k Bk s k
MINIMIZING A CONVEX QUADRATIC MODEL
For convex models (Bk positive definite)
pk = (approximate) arg min fk + pT gkT + 12 pT Bk p
p∈IRn
Generic convex quadratic problem: (B positive definite)
(approximately) minimize q(p) = pT g + 21 pT Bp
p∈IRn
MINIMIZATION OVER A SUBSPACE
Given vectors {d0, : . . . , di−1}, let
Di = (d0 : · · · : di−1)
Subspace D i = {p | p = D ipd for some pd ∈ IRi}
pi = arg min q(p)
p∈Di
Result: D i T g i = 0, where g i = Bpi + g
Proof: require pi = Dipid, where pid = arg min q(Dipd)
pd∈IRi
But q(D ipd) = pTd Di T g + 12 pTd Di T BDipd =⇒
0 = D i T BDipid + Di T g = Di T (BDipid + g) = D i T (Bpi + g) = D i T g i
Equivalently: dj T g i = 0 for j = 0, . . . , i − 1
MINIMIZATION OVER A SUBSPACE (cont.)
dj T g i = 0 for j = 0, . . . , i − 1, where g i = Bpi + g
Result: pi = pi−1 − di−1 T g i−1Di(Di T BDi)−1ei
Proof: Clearly pi−1 ∈ D i−1 ⊂ Di
=⇒ require pi = pi−1 + Dipid, where pid = arg min q(pi−1 + Dipd)
pd∈IRi
But q(pi−1 + Dipd)
= q(pi−1) + pTd Di T (g + Bpi−1) + 21 pTd Di T BDipd
= q(pi−1) + pTd Di T g i−1 + 21 pTd Di T BDipd
= q(pi−1) + pTd (di−1 T g i−1)ei + 12 pTd Di T BDipd
where ei is i-th unit vector =⇒
pid = −di−1 T g i−1(Di T BDi)−1ei
MINIMIZATION OVER A B-CONJUGATE SUBSPACE
Minimizer over D i: pi = pi−1 − di−1 T g i−1Di(Di T BDi)−1ei
Suppose in addition the members of D i are B-conjugate:
B-conjugacy: di T Bdj = 0 (i 6= j)
Result: pi = pi−1 + αi−1di−1, where
di−1 T g i−1
α = − i−1 T i−1
d
Bd
Proof: D i T BDi = diagonal matrix with entries dj T Bdj
for j = 0, . . . i − 1
=⇒ (D i T BDi)−1 = diagonal matrix with entries 1/dj T Bdj
for j = 0, . . . i − 1
=⇒ (D i T BDi)−1ei = (1/di−1 T Bdi−1)ei
i−1
BUILDING A B-CONJUGATE SUBSPACE
dj T g i = 0 for j = 0, . . . , i − 1
Since this implies g i is independent of D i, let
i
i
d = −g +
i−1
X
β ij dj
j=0
Aim: find β ij so that di is B-conjugate to D i
Result (orthogonal gradients): g i T g j = 0 for all i 6= j
Proof: span{g i} = span{di}
P
=⇒ g j = jk=0 γ j,k dk for some γ j,k
P
=⇒ g i T g j = jk=0 γ j,k g i T dk = 0 when j < i
BUILDING A B-CONJUGATE SUBSPACE (cont.)
Pi−1
di = −g i +
dj T g i = 0 for j = 0, . . . , i − 1, where g i = Bpi + g
j=0 β
ij j
d
Result: g i T di = −kg ik22
Pi−1 ij i T j
Proof: g i T di = −g i T g i + j=0
β g d
kg ik22
Corollary: α = i T i 6= 0 ⇐⇒g i 6= 0
d Bd
Proof: by definition
g i T di
i
α = − iT i
d Bd
i
BUILDING A B-CONJUGATE SUBSPACE (cont.)
Pi−1
di = −g i +
g i T g j = 0 for all i 6= j
iT
j=0 β
j
ij j
d
iT
Result: g Bd = 0 if j < i − 1 and g Bd
i−1
kg ik22
= i−1
α
Proof: pj+1 = pj + αj dj & g j+1 = Bpj+1 + g
=⇒ g j+1 = g j + αj Bdj
=⇒ g i T g j+1 = g i T g j + αj g i T Bdj
=⇒ g i T Bdj = 0 if j < i − 1
while g i T g i = g i T g i−1 + αi−1g i T Bdi−1 if j = i − 1
=⇒ g i T Bdi−1 = kg ik22/αi−1
BUILDING A B-CONJUGATE SUBSPACE (cont.)
Pi−1
di = −g i +
dk T Bg i = 0 if k < i − 1 and di−1 T Bg i = kg ik22/αi−1
αi−1 = kg i−1k22/di−1 T Bdi−1
k=0 β
ik k
d
ij
Result: β = 0 for j < i − 1 and β
Proof: B-conjugacy =⇒
0=d
jT
i
Bd = −d
jT
i
Bg +
i−1
X
i i−1
kgik22
≡β =
kgi−1k22
i
β ik dj T Bdk = −dj T Bg i + β ij dj T Bdj
k=0
=⇒ β ij = dj T Bg i/dj T Bdj
Result immediate for j < i − 1. For j = i − 1,
β
i i−1
kg ik22
kg ik22
di−1 T Bg i
= i−1 T i−1 = i−1 i−1 T i−1 = i−1 2
d
Bd
α d
Bd
kg k2
CONJUGATE-GRADIENT METHOD
Given p0 = 0, set g 0 = g, d0 = −g and i = 0.
Until g i “small” iterate
αi = −g i T di/di T Bdi
pi+1 = pi + αidi
g i+1 = g i + αiBdi
β i = kg i+1k22/kg ik22
di+1 = −g i+1 + β idi
and increase i by 1
Important features
dj T g i+1 = 0 = g j T g i+1 for all j = 0, . . . , i
g T pi < 0 for i = 1, . . . , n =⇒ descent direction for any pk = pi
stop: kg ik ≤ min(kgkω , η)kgk (0 < η, ω < 1) =⇒ fast convergence
CONJUGATE GRADIENT METHOD GIVES DESCENT
g
i−1 T i−1
d
=d
i−1 T
(g + Bp
i−1
)=d
i−1 T
g+
i−2
X
αj di−1 T Bdj = di−1 T g
j=0
i
i
p minimizes q(p) in D =⇒
i
p =p
i−1
g T di−1
g i−1 T di−1 i−1
i−1
− i−1 T i−1 d = p − i−1 T i−1 di−1.
d
Bd
d
Bd
=⇒
(g T di−1)2
g p = g p − i−1 T i−1 ,
d
Bd
T i
T i−1
=⇒ g p < g p =⇒ (induction)
T i
T i−1
g T pi < 0
since
kgk42
g p =− T
< 0.
g Bg
=⇒ pk = pi is a descent direction
T 1
CG METHODS FOR GENERAL QUADRATICS
Suppose f (x) is quadratic and x = x0 + p
Taylors theorem =⇒
f (x) = f (x0 + p) = f (x0) + pT g(x0) + 21 pT H(x0)p
can minimize as function of p using CG
if xi = x0 + pi =⇒ g i = g(x0) + H(x0)pi = g(xi)
g(xi)T di
α = − iT
= arg min f (xi + αdi)
i
d H(x0)d
α
i
NONLINEAR CONJUGATE-GRADIENT METHODS
method for minimizing quadratic f (x)
Given x0 and g(x0), set d0 = −g(x0) and i = 0.
Until g(xk ) “small” iterate
αi = arg min f (xi + αdi)
α
xi+1 = xi + αidi
β i = kg(xi+1)k22/kg(xi)k22
di+1 = −g(xi+1) + β idi
and increase i by 1
may also be used for nonlinear f (x) (Fletcher & Reeves)
replace calculation of αi by suitable linesearch
other methods pick different β i to ensure descent
© Copyright 2026 Paperzz