Optimization principles for discrete choice theory

Optimization principles for discrete choice
theory
Fabian Bastin
[email protected]
University of Montréal – CIRRELT
Summer School 2015
Fabian Bastin
Optimization principles - University of Maryland
Likelihood approach
Assume that we face a population of size I, K alternatives,
individual choices as well as attributes values for all the
alternatives, in every choice situation.
The explanatory variables are supposed to be exogeneous to
the choice situation.
The classical approach aims to find the parameters β that
maximize the probability of the observed choices:
I Y
K
Y
max L(β) =
Pi,k (β)
β
i=1 k=1
where
Yi,k
(
1
=
0
Yi,k
,
if i has chosen k;
otherwise.
Fabian Bastin
Optimization principles - University of Maryland
Likelihood function
Equivalently, we can write
I
Y
max L(β) =
Pi (β),
β
i=1
where we have omitted the index of the chosen alternative in
Pi (β).
Note: the previous formulation assume that the observations
are indépendent.
∀ i, 0 < Pi (β) < 1. Numerically: not stable.
Solution: transform the product to a sum by using the logarithm
operator:
I
X
max LL(β) =
ln Pi (β).
β
i=1
Fabian Bastin
Optimization principles - University of Maryland
Optimization problem
We face an unconstrained optimization problem:
I
1X
max LL(β) =
ln Pi (β),
I
β
i=1
where we have introduced the factor 1I in order to control the
function behavior when I rises to infinity. Many authors do not
introduce this factor, but it allows to maintain a value that is not
sensitive to the population size.
In many cases, the function will even be concave, so we have a
convex optimization problem.
Fabian Bastin
Optimization principles - University of Maryland
Optimization problem
More generally, we consider the problem
min f (x),
x∈Rn
where we assume that f : Rn → R ∈ C 2 (i.e. f is twice
continuously differentiable), and we consider only minimization
since maximization can be achieved by minimizing the
opposite:
maxn f (x) = − minn f (x).
x∈R
x∈R
Maximum log-likelihood:
I
1X
log f (xi |β).
max LL(β) =
I
β
I=1
Fabian Bastin
Optimization principles - University of Maryland
Some notations. . .
Recall that the gradient vector and the Hessian are
∇x f (x) =
∂f
∂x2
∂f
∂x1
...
∂2f
2
∂x
 21
 ∂f
 ∂x2 ∂x1
∂2f
∂x1 ∂x2
∂2f
∂ 2 x2
∂2f
∂xn ∂x1
∂2f
∂xn ∂x2

∇2xx f (x) = 


..
.
..
.
∂f
∂xn
T
...
...
..
.
...
,
∂2f
∂x1 ∂xn

∂2f 
∂x2 ∂xn 

..
.
∂2f
∂ 2 xn



Ideally, we would like to find a global minimizer.
Definition (Global minimizer)
Let f : X ⊆ Rn → R. A point x ∗ is a global minimizer of f (·) on X
if
f (x ∗ ) ≤ f (x)
∀ x ∈ X.
Fabian Bastin
Optimization principles - University of Maryland
Local minimization
In practice, we can only find local minimizers.
Definition (Local minimizer)
Let f : X → R. A point x ∗ ∈ X is a local minimizer of f (·) on X if
there exists a neighborhood B (x ∗ ) = {x ∈ Rn : kx − x ∗ k < }
such that
f (x ∗ ) ≤ f (x)
∀ x ∈ B (x ∗ ).
Lemma
Let X ⊂ Rn , f ∈ C 1 on X , and x ∈ X . If d ∈ Rn is a feasible
direction at x, i.e. there exists some scalar λ > 0 such that
(x + λd) ∈ X , ∀λ ∈ [0, λ],
and ∇x f (x)T d < 0, then there exists a scalar δ > 0 such that
for all 0 < τ ≤ δ, f (x + τ d) < f (x).
In such a case, we say that d is a descent direction at x.
Fabian Bastin
Optimization principles - University of Maryland
Steepest descent
If x in the interior of X , we see that a particular descent
direction is −∇x f (x), as long as ∇x f (x) 6= 0. Consider the
linear expansion of f around x:
f (x + d) ≈ f (x0 ) + ∇x f (x)T d,
for ”small” d, i.e. if kdk is small. If the approximation is good, it
seems reasonable to try to make ∇f (x)T d as small as possible
(i.e. as negative as possible). From the Cauchy-Schwartz
inequality, we have
∇x f (x)T d ≥ −k∇x f (x)kkdk,
and the equality is achieved if d = −α∇x f (x), with α > 0.
The direction −∇x f (x) is called the steepest descent direction
of f at x. If ∇x f (x) = 0, there is no descent direction!
Fabian Bastin
Optimization principles - University of Maryland
First-order criticality
From the previous observations, we also have the following
result.
Theorem (Necessary condition)
Let X ⊂ Rn an open set, f ∈ C 1 on X . If x is a local minimizer
of f on X , then ∇x f (x) = 0.
From the previous theorem, we introduce the definition of
first-order criticality.
Definition (First-order criticality)
A point x ∗ is said to be be first-order critical for f : Rn → R if
∇x f (x ∗ ) = 0.
Fabian Bastin
Optimization principles - University of Maryland
First-order criticality (2)
A first-order critical point is however not always a local
minimizer. Consider for instance the function f (x) = x 3 . Then
f 0 (0) = 0, but it is clear that 0 is not a local minimizer for f (·).
Special case: convex functions. If f is convex, a local minimizer
is also a global minimizer, and so a first-order critical point is a
global minimizer.
Fabian Bastin
Optimization principles - University of Maryland
Algorithms
Definition (Descent algorithm)
An algorithm A is a descent algorithm with respect to a
continuous function z : X → R is
1
if x ∈
/ X ∗ and y ∈ A(x), then z(y) < z(x),
2
if x ∈ X ∗ and y ∈ A(x), then z(y) ≤ z(x).
All the algorithms that we will consider are descent algorithms,
at least locally, and are iterative. In other terms, they will
construct a (possibly infinite) sequence of point xn (n = 0, . . .)
and we hope that this sequence converges to a minimizer of
the objective function.
The question is therefore how to construct such a sequence?
Fabian Bastin
Optimization principles - University of Maryland
Steepest descent algorithm
The simplest algorithm is the steepest descent method.
Step 0. Given a starting point x0 , set k = 0.
Step 1. Set dk = −∇x f (xk ). If dk = 0, stop.
Step 2. Solve (possibly approximately) minα f (xk + αdk )
for the stepsize αk .
Step 3. Set xk ← xk + αk dk , k ← k + 1. Go to Step 1.
Theorem (Convergence of the steepest descent algorithm)
Suppose that f (·) : Rn → R is continuously differentiable on the
set S = {x ∈ Rn | f (x) ≤ f (x0 )}, and that S is a closed and
bounded set. then every point x ∗ that is a cluster point of the
sequence {xk } satisfies ∇x f (x ∗ ) = 0.
But proving convergence is not sufficient! Is it efficient?
Fabian Bastin
Optimization principles - University of Maryland
The Rosenbrock function
Apply the steepest descent algorithm to the function
f (x1 , x2 ) = (1 − x1 )2 + 100(x2 − x12 )2
”Zig-zag” behavior. In other terms, the convergence is very
slow! Is it possible to improve it?
Fabian Bastin
Optimization principles - University of Maryland
Newton method: introduction
Can we converge faster? If f is C 2 , we can write the Taylor
expansion of order 2:
1
f (x + d) ≈ f (x) + ∇x f (x)T d + d T ∇2xx f (x)d.
2
Note: assume 21 d T ∇2xx f (x ∗ )d > 0 if kdk < (sufficient
second-order criticality condition - local strong convexity).
If ∇x f (x ∗ ) = 0, then x ∗ is a local minimizer.
Otherwise, we can choose d such that ∇x f (x ∗ )T d < 0 and x ∗ is
not a local minimizer.
Fabian Bastin
Optimization principles - University of Maryland
Newton method
At iteration k , define (around the current iterate xk )
1
mk (d) = f (xk ) + ∇x f (xk )T d + d T ∇2xx f (xk )d.
2
If ∇2xx f (xk ) is positive definite, mk (d) is convex in a
neighborhood of xk and we can minimize mk (d) by computing
d such that ∇d mk (d) = 0. In other terms, we compute dk as
dk = −[∇2xx f (xk )]−1 ∇x f (xk ),
and we set
xk+1 = xk + dk .
If the starting point x 0 is close to the solution, the convergence
is quadratic, but if the starting point is not good enough, the
method can diverge!
Fabian Bastin
Optimization principles - University of Maryland
Quasi-Newton method
It is often difficult to obtain an analytical expression for the
derivatives, and they may be costly to compute at each
iteration, so we prefer to turn to approximations.
First-order derivatives can be computed by finite differences:
f (x + ei ) − f (x)
∂f
≈
,
∂x[i]
T
for small , and ei = 0 0 . . . 0 1 0 . . . 0 is the i th
canonical vector, or central differences:
∂f
f (x + ei ) − f (x − ei )
≈
.
∂x[i]
2
This however requires O(n) evaluations of the objective.
Fabian Bastin
Optimization principles - University of Maryland
Hessian approximation: BFGS
We can approximate the Hessian with a similar approach, but
the computation cost is then of O(n2 ), wich is usually too
expensive.
An alternative is to construct an approximation of the Hessian
that we will improve at each iteration, using the gained
information. We then speak of quasi-Newton method.
A popular approximation is the BFGS (Broyden, Fletcher,
Goldfarb and Shanno):
Bk+1 = Bk +
yk ykT
B d (B d )T
+ k kT k k ,
t
yk dk
dk Bk dk
where yk = ∇x f (xk+1 ) − ∇x f (xk ).
Fabian Bastin
Optimization principles - University of Maryland
BHHH
Various alternatives to BFGS exist, as the Symmetric-Rank 1
(SR1) that is popular for nonconvex optimzation with
trust-region.
Particular case: maximum likelihood.
X
max LL(β) =
Pi (β).
i
Information identity: when I → ∞, and under some conditions
(basically, the model has to be correctly specified), we have
H(β ∗ ) = −B(β ∗ ),
where B is the information matrix.
The BHHH method, proposed by Berndt, Hall, Hall, Hausman
in 1974, simply replaces the Hessian by the opposite of the
information matrix in the quasi-Newton method.
Fabian Bastin
Optimization principles - University of Maryland
Information matrix
The information matrix is
∇P(β ∗ ) ∇P(β ∗ )T
B=E
.
P(β ∗ ) P(β ∗ )
In practice, we compute it as
I
1 X ∇Pi (β ∗ ) ∇Pi (β ∗ )T
B=
.
I
Pi (β ∗ ) Pi (β ∗ )
i
Advantage: cheap to compute!
Drawback: only valid if the information identity holds. In
practice, rarely the case! Example: multinomial logit on panel
data.
Suggestion: start with the BHHH and then switch to BFGS or
SR1.
Fabian Bastin
Optimization principles - University of Maryland
Asymptotics
It can be shown that
√
D
I(β ∗I − β ∗ ) ⇒ N (0, H −1 BH −1 ).
so the information matrix will be needed when computing some
confidence intervals on the parameters.
Under the information identity, this can be simplified as
√
D
I(β ∗I − β ∗ ) ⇒ N (0, −B),
but since the identity often do not hold, care must be exercised.
Note: a model can be statistically consistent while not correctly
formulated (i.e. one obtains the right parameters)!
Example: multinomial logit on panel data.
Fabian Bastin
Optimization principles - University of Maryland
Globalization of the Newton method
Global convergence: the algorithm must converge for any
starting point. BUT it still converges to a local mimimum, not a
global minimum!
So how to ensure global convergence? Globalization of the
Newton method:
linesearch methods;
trust-region methods.
Line-search methods generate the iterates by setting
xk1 = xk + αk dk .
where dk is a search direction and αK dk is chosen so that
f (xk+1 ) < f (xk ).
Linesearch methods are therefore descent methods, and a
special case is the steepest descent method.
Fabian Bastin
Optimization principles - University of Maryland
Linesearch methods
Most line-search versions of the basic Newton method
generate the direction dk by modifying the Hessian matrix
∇xx f (xk ) to ensure that the quadratic model mk of the function
has a unique minimizer.
The modified Cholesky decomposition approach adds positive
quantities to the diagonal of ∇xx f (xk ) during the Cholesky
factorization. As a result, a diagonal matrix, Ek , with
nonnegative diagonal entries is generated such that
∇xx f (xk ) + Ek
is positive definite.
Fabian Bastin
Optimization principles - University of Maryland
Linesearch methods (2)
Given this decomposition, the search direction dk is obtained
by solving
(∇xx f (xk ) + Ek )dk = −∇x f (xk ).
After dk is found, a line-search procedure is used to choose an
αk > 0 that approximately minimizes f along the ray
{xk + αdk | α > 0}.
The algorithms for determining αk usually rely on quadratic or
cubic interpolation of the univariate function
φ(α) = f (xk + αdk )
in their search for a suitable αk .
Fabian Bastin
Optimization principles - University of Maryland
Determination of αk
An elegant and practical criterion for a suitable αk is to require
αk to satisfy the sufficient decrease condition:
f (xk + αk dk ) ≤ f (xk ) + µαk dkT ∇x f (xk ).
and the curvature condition:
|dkT ∇x f (xk + αk dK )| ≤ η|dkT ∇x f (xk )|.
where µ and η are two constants with 0 < µ < η < 1. The
sufficient decrease condition guarantees, in particular, that
f (xk+1 ) < f (xk ), while the curvature condition requires that αk
be not too far from a minimizer of φ.
Requiring an accurate minimizer is generally wasteful of
function and gradient evaluations, so codes typically use
µ = 0.001 and η = 0.9 in these conditions.
Fabian Bastin
Optimization principles - University of Maryland
Trust region methods
Principle: at iteration k , approximately minimize a model mk of
the objective inside a region Bk . A typical choice for mk is
1
mk (xk + s) = f (xk ) + sT ∇f (xk ) + sT Hk s.
2
Hk : approximation of ∇2xx f (xk ).
We therefore have to solve the subproblem
min mk (xk + s), such that xk + s ∈ Bk .
s
The solution is the candidate iterate with candidate step sk .
Computing the following ratio:
ρk =
f (xk + sk ) − f (xk )
.
mk (xk + sk ) − mk (xk )
Fabian Bastin
Optimization principles - University of Maryland
Trust region methods (2)
Let η1 and η2 be constants such that 0 < η1 ≤ η2 < 1 (for
instance, η1 = 0.01 and η2 = 0.75).
If ρk ≥ η1 , accept the candidate.
If ρk ≥ η2 , enlarge Bk , otherwise reduce it or keep it the
same.
If ρk < η1 , reject the candidate and reduce Bk .
Stop when some criteria are met (e.g. norm of the relative
gradient must be small enough).
The neighborhood where we considere the model as valid is
mathematically defined by a ball centered at xk , and with a
radius ∆k :
Bk = {x | kx − xk kk ≤ ∆k }.
Fabian Bastin
Optimization principles - University of Maryland
Norms
The norm choice may depend of the iteration k . We often use
the 2- or ∞-norm.
Let x be a vector of Rn .
v
u n
n
X
uX
2 , kxk
x(i)
=
max
{|x
|},
kxk
=
|x(i) |.
kxk2 = t
∞
1
(i)
i=1
i=1,...,n
Fabian Bastin
i=1
Optimization principles - University of Maryland
Trust region update
At the end of each iteration k , we update the trust-region radius
as following:


si ρk ≥ η2 ,
[∆k , ∞)
∆k+1 ∈ [γ2 ∆k , ∆k ]
si ρk ∈ [η1 , η2 ),


[γ1 ∆k , γ2 ∆k ] si ρk < η1 ,
with 0 < η1 ≤ η2 < 1 and 0 < γ1 ≤ γ2 < 1.
Remark: there exist variants, both for theoretical or pratical
concerns.
The choice of ∆0 remains difficult, and various heuristics have
been proposed. See for instance A. Sartenaer, Automatic
Determination Of An Initial Trust Region In Nonlinear
Programming, SIAM Journal of Scientific Computing 18(6), pp.
1788-1803, 1997.
Fabian Bastin
Optimization principles - University of Maryland
Trust region: example
min −10x 2 + 10y 2 + 4 sin(xy) − 2x + x 4
x,y
Trust−region method
3
2
y
1
0
−1
−2
−3
−3
−2
−1
0
Fabian Bastin
1
x
2
3
4
5
Optimization principles - University of Maryland
Trust region: example (2)
Fabian Bastin
Optimization principles - University of Maryland