0, F`(x * )

Qualifier Exam in HPC
February 10th, 2010
Quasi-Newton methods
Alexandru Cioaca
Quasi-Newton methods
(nonlinear systems)

Nonlinear systems:
F(x) = 0,
F : Rn  Rn
F(x) = [ fi(x1,…,xn) ]T

Such systems appear in the simulation of processes
(physical, chemical, etc.)

Iterative algorithm to solve nonlinear systems

Newton’s method != Nonlinear least-squares
Quasi-Newton methods
(nonlinear systems)
Standard assumptions
1. F – continuously differentiable in an open convex set D
2. F – Lipschitz continuous on D
3. There is x* in D s.t. F(x*)=0, F’(x*) nonsingular
Newton’s method:
Starting from x0 (initial iterate)
xk+1 = xk – F’(xk)-1 * F(xk),
{xk}  x*
Until termination criterion is satisfied
Quasi-Newton methods
(nonlinear systems)

Linear model around xk:
Mn(x) = F(xn) + F’(xn)(x-xn)
Mn(x) = 0  xn+1 = xn - F’(xn)-1 *F(xn)

Iterates are computed as:
F’(xn) * sn = F(xn)
xn+1 = xn - sn
Quasi-Newton methods
(nonlinear systems)
Evaluate F’(xn)
 Symbolically
 Numerically with finite differences
 Automatic differentiation
Solve the linear system F’(xn) * sn = F(xn)
 Direct solve: LU, Cholesky
 Iterative methods: GMRES, CG
Quasi-Newton methods
(nonlinear systems)
Computation:
F(xk)
F’(xk)
n scalar functions
n2 scalar functions

LU
Cholesky
O(2n3/3)
O(n3/3)

Krylov methods (depends on condition number)



Quasi-Newton methods
(nonlinear systems)






LU and Cholesky are useful when we want to reuse the
factorization (quasi-implicit)
Difficult to parallelize and balance the workload
Cholesky is faster and more stable but needs SPD (!)
For n large, factorization is very impractical (n~106)
Krylov methods contain elements easily parallelizable
(updates, inner products, matrix-vector products)
CG is faster and more stable but needs SPD
Quasi-Newton methods
(nonlinear systems)
Advantages:




Under standard assumptions, Newton’s method
converges locally and quadratically
There exists a domain of attraction S which contains the
solution
Once the iterates enter S, they stay in S and eventually
converge to x*
The algorithm is memoryless (self-corrective)
Quasi-Newton methods
(nonlinear systems)
Disadvantages:



Convergence depends on the choice of x0
F’(x) has to be evaluated for each xk
Computation can be expensive: F(xk), F’(xk), sk
Quasi-Newton methods
(nonlinear systems)

Implicit schemes for ODEs
y’ = f(t,y)
Forward Euler:
Backward Euler:

yn+1 = yn + hf(tn,yn)
yn+1 = yn + hf(tn+1, yn+1)
(explicit)
(implicit)
Implicit schemes need the solution of a nonlinear system
(also CN, RK, LMF)
Quasi-Newton methods
(nonlinear systems)



How to circumvent evaluating F’(xk) ?
Broyden’s method
Bk+1 = Bk + (yk – Bk*sk)*skT / <sk, sk>
xk+1 = xk – Bk-1 * F(xk)
Inverse update (Sherman-Morrison formula)
Hk+1=Hk+(sk-Hk*yk)*skT*Hk/<sk,Hk*yk>
xk+1 = xk – Hk * F(xk)
( sk+1 = xk+1 – xk,
yk+1 = F(xk+1) – F(xk) )
Quasi-Newton methods
(nonlinear systems)
Advantages:
 No need to compute F’(xk)
 For inverse update – no linear system to solve
Disadvantages:
 Superlinear convergence
 No longer memoryless
Quasi-Newton methods
(unconstrained optimization)

Problem:
Find the global minimizer of a cost function
f : Rn  R,
x* = arg min f

f differentiable means the problem can be attacked by
looking for zeros of the gradient
Quasi-Newton methods
(unconstrained optimization)

Descent methods
xk+1=xk – λk*Pk*f(xk)
Pk = In
Pk = 2f(xk)-1
Pk = Bk-1


-
steepest descent
Newton’s method
Quasi-Newton
Angle between Pk,f(xk) less than 90
Bk has to mimic the behavior of the Hessian
Quasi-Newton methods
(unconstrained optimization)
Global convergence

Line search
Step length: backtracking, interpolation
Sufficient decrease: Wolfe conditions

Trust regions
Quasi-Newton methods
(unconstrained optimization)
For Quasi-Newton, Bk has to resemble 2f(xk)




Single-Rank:
( y  Bs )( y  Bs )T
BSR1  B 
 y  Bs , s 
Symmetry:
( y  Bs )sT  s( y  Bs )T  y  Bs , s  ss T
BPSB  B 

 s, s 
 s, s  2
Positive def.:
( y  Bs ) yT  y( y  Bs )T  y  Bs , s  yyT
BDFP  B 

 y, s 
 y, s  2
Inverse update:
H BFGS
sy T
ysT
ss T
 (I 
)H (I 
)
 y, s 
 y, s   y, s 
Quasi-Newton methods
(unconstrained optimization)
Computation
 Matrix updates, inner products
 DFP, PSB
3 matrix-vector products
 BFGS
2 matrix-matrix products
Storage
 Limited memory versions (L-BFGS)
 Store {sk, yk} for the last m iterations and recompute H
Further improvements
Preconditioning the linear system




For faster convergence one may solve K*Bk*pk = K*F(xk)
If B is spd (and sparse) we can use sparse approximate
inverses to generate the preconditioner
This preconditioner can be refined on a subspace of Bk
using an algebraic multigrid technique
We need to solve the eigenvalue problem
Further improvements
Model reduction

Sometimes the dimension of the system is very large
Smaller model that captures the essence of the original
An approximation of the model variability can be
retrieved from an ensemble of forward simulations
The covariance matrix gives the subspace

We need to solve the eigenvalue problem



QR/QL algorithms
for symmetric matrices



Solves the eigenvalue problem
Iterative algorithm
Uses QR/QL factorization at each step
(A=Q*R, Q unitary, R upper triangular)
for k = 1,2,..
Ak=Qk*Rk
Ak+1=Rk*Qk
end

Diagonal of Ak converges to eigenvalues of A
QR/QL algorithms
for symmetric matrices




The matrix A is reduced to upper Hessenberg form
before starting the iterations
Householder reflections (U=I-v*v’)
Reduction is made column-wise
If A is symmetric, it is reduced to tridiagonal form
QR/QL algorithms
for symmetric matrices


Convergence to a triangular form can be slow
Origin shifts are used to accelerate it
for k = 1,2,..
Ak-zk*I=Qk*Rk
Ak+1=Rk*Qk+zk*I
end


Wilkinson shift
QR makes heavy use of matrix-matrix products
Alternatives to quasi-Newton
Inexact Newton methods
 Inner iteration – determine a search direction by solving
the linear system with a certain tolerance
 Only Hessian-vector products are necessary
 Outer iteration – line search on the search direction
Nonlinear CG
 Residual replaced by gradient of cost function
 Line search
 Different flavors
Alternatives to quasi-Newton
Direct search




Does not involve derivatives of the cost function
Uses a structure called simplex to search for decrease in f
Stops when further progress cannot be achieved
Can get stuck in a local minima
More alternatives
Monte Carlo



Computational method relying on random sampling
Can be used for optimization (MDO), inverse problems
by using random walks
In the case where we have multiple correlated variables,
the correlation matrix is spd so we can use Cholesky to
factorize it
Conclusions

Newton’s method is a very powerful method with many
applications and uses (solving nonlinear systems, finding
minima of cost functions). Newton’s method can be used
together with many other numerical algorithms
(factorizations, linear solvers)

The optimization and parallelization of matrix-vector,
matrix-matrix products, decompositions and other
numerical methods can have a significant impact in
overall performance
Thank you for your time!