descent method

Multivariate Unconstrained
Optimisation
• First we consider algorithms for functions
for which derivatives are not available.
• Could try to extend direct method such as
Golden Section:
y
a
Number of function
evaluations
increases as en,
where n is number
of dimensions.
x
b
One dimension
Two dimensions
The Polytope Algorithm
• This is a direct search method.
• Also known as “simplex” method.
• In n dimensional case, at each stage we
have n+1 points x1, x2,…,xn+1 such that:
F(x1) F(x2)  F(xn+1)
• The algorithm seeks to replace the worst
point, xn+1, with a better one.
• The xi lie at the vertices of an ndimensional polytope.
The Polytope Algorithm 2
• The new point is formed by reflecting the worst
point through the centroid of the best n vertices:
n
1
c   xi
n i 1
• Mathematically the new point can be written:
xr = c + (c-xn+1)
where >0 is the reflection coefficient.
• In two dimensions polytope is a triangle; in three
dimensions it is a tetrahedron.
Polytope Example
• For n = 2 we have three points at each
step.
x
1
x3
c-x3
c
(c-x3)
(worst point)
x2
xr
Detailed Polytope Algorithm
1. Evaluate F(xr)Fr. If F1 Fr  Fn, then xr
replaces xn+1.
2. If Fr< F1 then xr is new best point and we
assume direction of reflection is “good”
and attempt to expand polytope in that
direction by defining the point,
xe = c + (xr-c)
where >1. If Fe< Fr then xe replaces
xn+1; otherwise xr replaces xn+1.
Detailed Polytope Algorithm 2
3. If Fr> Fn then the polytope is too big and
we attempt to contract it by defining:
xc = c + (xn+1-c) if Fr  Fn+1
xc = c + (xr-c) if Fr < Fn+1
where 0<<1. If Fc< min(Fr,Fn+1) then xc
replaces xn+1; otherwise a further
contraction is done.
MATLAB Example Polytope
>> banana = @(x)10*(x(2)-x(1)^2)^2+(1-x(1))^2;
>> [x,fval] = fminsearch(banana,[-1.2, 1],optimset('Display','iter'))
Polytope Example by Hand
F ( x, y ) 
 x  y  0.5  2 x  ( y 1)
3 ( x  2)  ( y  1)  0.5
2
2
2
2
2
2
2
2

2
 0.5 
Polytope Example
• Start with equilateral triangle:
x1 = (0,0) x2=(0,0.5) x3=(3,1)/4
• Take =1, =1.5, and =0.5
Polytope Example: Step 1
• Polytope is
•
•
•
•
•
•
i
1
2
3
xi
(0,0)
(0,0.5)
(0.433,0.25)
F(xi)
9.7918
7.3153
4.8601
Worst point is x1, c = (x2+ x3)/2 = (0.2165,0.375)
Relabel points: x3 x1, x1 x3
xr = c + (c- x3) = (0.433,0.75) and F(xr)=3.6774
F(xr)< F(x1) so xr is best point so try to expand.
xe = c + (xr-c) = (0.5413,0.9375) and F(xe)=3.1086
F(xe)< F(xr) so accept expand
After Step 1
Polytope Example: Step 2
• Polytope is
•
•
•
•
•
•
i
1
2
3
xi
(0.433,0.25)
(0,0.5)
(0.5413,0.9375)
F(xi)
4.8601
7.3153
3.1086
Worst point is x2, c = (x1+ x3)/2 = (0.4871,0.5938)
Relabel points: x3 x1, x2 x3, x1 x2
xr = c + (c- x3) = (0.9743,0.6875) and F(xr)=2.0093
F(xr)< F(x1) so xr is best point so try to expand.
xe = c + (xr-c) = (1.2179,0.7344) and F(xe)=2.2837
F(xe)>F(xr) so reject expand.
After Step 2
Polytope Example: Step 3
• Polytope is
•
•
•
•
•
•
i
1
2
3
xi
(0.5413,0.9375)
(0.433.0.25)
(0.9743,0.6875)
F(xi)
3.1086
4.8601
2.0093
Worst point is x2, c = (x1+ x3)/2 = (0.7578,0.8125)
Relabel points: x3 x1, x2 x3, x1 x2
xr = c + (c- x3) = (1.0826,1.375) and F(xr)=3.1199
F(xr)>F(x2) so polytope is too big. Need to contract.
xc = c + (xr-c) = (0.9202,1.0938) and F(xc)=2.2476
F(xc)<F(xr) so accept contraction.
After Step 3
Polytope Example: Step 4
• Polytope is
•
•
•
•
•
•
i
1
2
3
xi
(0.9743,0.6875)
(0.5413,0.9375) (0.9202,1.0938)
F(xi)
2.0093
3.1086
2.2476
Worst point is x2, c = (x1+ x3)/2 = (0.9472,0.8906)
Relabel points: x3 x2, x2 x3
xr = c + (c-x3) = (1.3532,0.8438) and F(xr)=2.7671
F(xr)>F(x2) so polytope is too big. Need to contract.
xc = c + (xr-c) = (1.1502,0.8672) and F(xc)=2.1391
F(xc)<F(xr) so accept contraction.
After Step 4
Polytope Example: Step 5
• Polytope is
•
•
•
•
•
•
i
1
2
3
xi
(0.9743,0.6875)
(0.9202,1.0938) (1.1502,0.8672)
F(xi)
2.0093
2.2476
2.1391
Worst point is x2, c = (x1+ x3)/2 = (1.0622,0.7773)
Relabel points: x3 x2, x2 x3
xr = c + (c- x3) = (1.2043,0.4609) and F(xr)=2.6042
F(xr)F(x3) so polytope is too big. Need to contract.
xc = c + (x3-c) = (0.9912,0.9355) and F(xc)=2.0143
F(xc)<F(xr) so accept contraction.
After Step 5
Polytope Example: Step 6
• Polytope is
•
•
•
•
•
•
i
1
2
3
xi
(0.9743,0.6875)
(1.1502,0.8672) (0.9912,0.9355)
F(xi)
2.0093
2.1391
2.0143
Worst point is x2, c = (x1+ x3)/2 = (0.9827,0.8117)
Relabel points: x3 x2, x2 x3
xr = c + (c- x3) = (0.8153,0.7559) and F(xr)=2.1314
F(xr)>F(x2) so polytope is too big. Need to contract.
xc = c + (xr-c) = (0.8990,0.7837) and F(xc)=2.0012
F(xc)<F(xr) so accept contraction.
Polytope Example: Final Result
• So after 6 steps the best estimate of the
minimum is x = (0.8990,0.7837) for which
F(x)=2.0012.
Alternating Variables Method
• Start from point x = (a1, a2,…, an).
• Take first variable x1, and minimise
F(x1, a2,…, an) with respect to x1. This

gives x1= a1 .
• Take second variable x2, and minimise
F(a1, x2,…, an) with respect to x2. This
gives x2= a2.
• Continue with each variable in turn until
minimum is reached.
AVM in Two Dimensions
Start
• Method of minimisation over each
variable can be any univariate method.
AVM Example in 2D
• Minimise F(x,y)=x2+y2+xy-2x-4y
F
2 y
 2x  y  2  0  x 
x
2
F
4 x
 2y  x  4  0  y 
y
2
• Start at (0,0).
AVM Example in 2D
x
0
y
0
F(x,y)
0
|error|
4
1
1
0.25
0
1.5
1.5
-1
-3.25
-3.8125
3
0.75
0.1875
0.25
0.0625
0.0625
0.0156
1.875
1.875
1.968
1.968
-3.953
-3.988
-3.9971
-3.9992
0.047
0.012
0.0029
0.0008
0.0156
0.004
0.004
1.992
1.992
1.998
-3.9998
-3.99995
-3.99999
0.0002
0.00005
0.00001
Definition of Gradient Vector
• The gradient vector is:
 F 


 x1 
 F 
g ( x)   x2 


  
 F 
 x 
 n
• The gradient vector is also written as F(x).
Definition of Hessian Matrix
• The Hessian matrix is defined as:
  2 F ( x)  2 F ( x)
 2 F ( x) 



2
x1x2
x1xn 
 x1
  2 F ( x)  2 F ( x)
 2 F ( x) 


2
G   x x
x2
x2 xn 
 2 1


 
 
  2 F ( x)  2 F ( x)
 2 F ( x) 

 x x

2
xn 
 n 1 xn x2
• The Hessian matrix is symmetric, and is also
written as 2F(x).
Conditions for a Minimum of a
Multivariate Function
1. |g(x*)| = 0. That is, all partial derivatives
are zero.
2. G(x*) is positive definite. That is,
xTG(x*)x > 0 for all vectors x0.
•
The second condition implies that the
eigenvalues of G(x*) are strictly positive.
Stationary Points
•
•
If g(x*)=0 then x* is said to be a
stationary point.
There are 3 types of stationary point:
1. Minimum, e.g., x2+y2 at (0,0)
2. Maximum, e.g., 1-x2-y2 at (0,0)
3. Saddle Point, e.g., x2-y2 at (0,0)
Definition: Level Surface
• F(x)=constant defines a “level surface”.
• For different values of the constant we generate
different level surfaces.
• For example, in 3-D suppose
F(x,y,z) = x2/4 + y2/9 + z2/4
• F(x,y,z) = constant is an ellipsoid surface
centred on the origin.
• Thus, the level surfaces are a series of
concentric ellipsoidal surfaces.
• The gradient vector at point x is normal to the
level surface passing through x.
Definition: Tangent Hyperplane
• For a differentiable multivariate function, F,
the tangent hyperplane at the point xt on
the surface F(x)=constant is normal to the
gradient vector.
Definition: Quadratic Function
• If the Hessian matrix of F is constant then
F is said to be a quadratic function.
• In this case F can be expressed as:
F(x) = (1/2)xTGx + cTx + 
for a constant matrix G, vector c, and
scalar .
• F(x) = Gx + c and 2F(x) = G.
Example Quadratic Function
• F(x,y) = x2 + 2y2 + xy – x + 2y
1
F ( x, y )  x
2
 2 1  x 
 x
    1 2 
y 
 1 4  y 
 y
• Gradient vector is zero at stationary point, so
Gx + c = 0 at stationary point
• Need to solve Gx = -c to find stationary point:
x* = G-1c  x* = (6/7 -5/7)T
Hessian Matrix Again
• We can predict the behaviour of a general
nonlinear function near a stationary point, x*, by
looking at the eigenvalues of the Hessian matrix.
• Let uj and j denote the jth eigenvector and
eigenvalue of G.
• If j > 0 the function will increase as we move
away from x* in direction uj.
• If j < 0 the function will decrease as we move
away from x* in direction uj.
• If j = 0 the function will stay constant as we
move away from x* in direction uj.
Example Again
1
F ( x, y )  x
2
 2 1  x 
 x
    1 2 
y 
 1 4  y 
 y
• 1 = 1.5858 and 2 = 4.4142, so F increases
as we move away from the stationary point at
(6/7 -5/7)T.
• So the stationary point is a minimum.
Example in 4D
2 
2 1 0
  1


 
 1 2 1 3 
2
G
, c 

0 1 4
1
3


 
 2 3 1  2
4


 
• In MATLAB:
>> c = [-1 2 3 4]’;
>>G = [2 1 0 2; 1 2 -1 3; 0 -1 4 1; 2 3 1 -2];
>>x = G\(-c)
>>[u,lambda] = eigs(G)
Descent Methods
• Seek a general algorithm for
unconstrained minimisation of a smooth
multivariate function.
• Require that F decreases at each iteration.
• A method that imposes this type of
condition is called a descent method.
A General Descent Algorithm
Let xk be current iterate.
• If converged then quit; xk is estimate of
minimum.
• Compute a nonzero vector pk giving
direction of search.
• Compute a positive scalar step length, k
for which F(xk+ k pk) < F(xk).
• New estimate of minimum is xk+1 = xk+
kpk. Increment k by 1, and go to step 1.
Method of Steepest Descent
• Direction in which F decreases most
steeply is -F, so we use this as the
search direction.
• New iterate is xk+1 = xk - kF, where k
is non-negative scalar chosen so that xk+1
is the minimum point along the line from xk
in the direction -F.
• Thus, k minimises F(xk - F) with
respect to .
Steepest Descent Algorithm
• Initialise: x0, k=0
• Loop:
k
u = F(xk)
if |u|=0 then quit
else minimise h()=F(xk- u) to get
xk+1 = xk- ku
k = k+1
if (not finished) go to Loop
Example
• F(x,y) = x3 + y3 - 2x2 + 3y2 - 8
0 
 3x 2  4 x 
 6x  4
, G( x, y)  

F ( x, y)   2

6 y  6
 0
3y  6 y 
• F(x,y) = 0 gives 3x2-4x=0 so x = 0 or 4/3;
and, 3y2+6y=0 so y=0 or -2.
(x,y)
G
Type
(0,0)
Indefinite
Saddle point
(0,-2)
Negative definite
Maximum
(4/3,0)
Positive definite
Minimum
(4/3,-2)
Indefinite
Saddle point
Solve with Steepest Descent
• Take x0 = (1 -1)T, then F(x0)=(-1 -3)T.
• h()F(x0- F(x0)) = F(1+,-1+3)
=
(1+)3+(3-1)3-2(1+)2+3(3-1)2-8
• Minimise h() with respect to .
h/ = 3(1+)2+9(3-1)24(1+)+18(3-1)
= 842 + 2 -10 = 0
• So  = 1/3 or -5/14.  must be
Solve with Steepest Descent
• x1 = x0 - F(x0) = (1 -1)T – (-1/3 -1)T =
(4/3 0)T.
• This is the exact minimum.
• We were lucky that the search direction at
x0 points directly towards (4/3 0)T.
• Usually we would need to do more than
one iteration to get a good solution.
Newton’s Method
• Approximate F locally by a quadratic function and
minimise this exactly.
• Taylor’s Theorem:
F(x)  F(xk)+(g(xk))T(x-xk)+
(1/2)(x-xk)TG(xk)(x-xk)
= F(xk)-(g(xk))Txk+ (1/2)xkTG(xk)xk+
(g(xk)-G(xk)xk)Tx+(1/2)xTG(xk)x
• RHS is minimum when
g(xk) – G(xk)xk+G(xk)xk+1=0
Search direction is • So
-1g(x ) and
[G(x
)]
k
k
xk+1= xk – [G(xk)]-1g(xk)
step length is 1.
Newton’s Method Example
• Rosenbrock’s function:
F(x,y) = 10(y-x2)2 + (1-x)2
  40 x( y  x 2 )  2(1  x) 

g ( x)  
2

20
(
y

x
)


120 x 2  40 y  2  40 x 

G ( x)  


40
x
20


• Use Newton’s Method starting at (-1.2 1)T.
MATLAB Solution
>> F=@(x,y)10*(y-x^2)^2+(1-x)^2
>>fgrad1=@(x,y)-40*x*(y-x^2)-2*(1-x)
>>fgrad2=@(x,y)20*(y-x^2)
>>G11=@(x,y)120*x^2-40*y+2
>>x=[-1.2;1]
>>x=x-inv([G11(x(1),x(2)) -40*x(1);-40*x(1)
20])*[fgrad1(x(1),x(2)) fgrad2(x(1),x(2))]’
MATLAB Iterations
x
-1.2
-0.9755
0.0084
0.0571
0.9573
0.9598
1.0000
1.0000
y
1
0.9012
-0.9679
0.0009
0.1060
0.9212
0.9984
1.0000
F(x,y)
6.7760
3.9280
10.3533
0.8892
6.5695
0.0016
2.6210-5
2.4110-14
Notes on Newton’s Method
• Newton’s Method converges quadratically
if the quadratic model is a good fit to the
objective function.
• Problems arise if the quadratic model is
not a good fit outside a small
neighbourhood of the current point.