Optimization principles for discrete choice theory Fabian Bastin [email protected] University of Montréal – CIRRELT Summer School 2015 Fabian Bastin Optimization principles - University of Maryland Likelihood approach Assume that we face a population of size I, K alternatives, individual choices as well as attributes values for all the alternatives, in every choice situation. The explanatory variables are supposed to be exogeneous to the choice situation. The classical approach aims to find the parameters β that maximize the probability of the observed choices: I Y K Y max L(β) = Pi,k (β) β i=1 k=1 where Yi,k ( 1 = 0 Yi,k , if i has chosen k; otherwise. Fabian Bastin Optimization principles - University of Maryland Likelihood function Equivalently, we can write I Y max L(β) = Pi (β), β i=1 where we have omitted the index of the chosen alternative in Pi (β). Note: the previous formulation assume that the observations are indépendent. ∀ i, 0 < Pi (β) < 1. Numerically: not stable. Solution: transform the product to a sum by using the logarithm operator: I X max LL(β) = ln Pi (β). β i=1 Fabian Bastin Optimization principles - University of Maryland Optimization problem We face an unconstrained optimization problem: I 1X max LL(β) = ln Pi (β), I β i=1 where we have introduced the factor 1I in order to control the function behavior when I rises to infinity. Many authors do not introduce this factor, but it allows to maintain a value that is not sensitive to the population size. In many cases, the function will even be concave, so we have a convex optimization problem. Fabian Bastin Optimization principles - University of Maryland Optimization problem More generally, we consider the problem min f (x), x∈Rn where we assume that f : Rn → R ∈ C 2 (i.e. f is twice continuously differentiable), and we consider only minimization since maximization can be achieved by minimizing the opposite: maxn f (x) = − minn f (x). x∈R x∈R Maximum log-likelihood: I 1X log f (xi |β). max LL(β) = I β I=1 Fabian Bastin Optimization principles - University of Maryland Some notations. . . Recall that the gradient vector and the Hessian are ∇x f (x) = ∂f ∂x2 ∂f ∂x1 ... ∂2f 2 ∂x 21 ∂f ∂x2 ∂x1 ∂2f ∂x1 ∂x2 ∂2f ∂ 2 x2 ∂2f ∂xn ∂x1 ∂2f ∂xn ∂x2 ∇2xx f (x) = .. . .. . ∂f ∂xn T ... ... .. . ... , ∂2f ∂x1 ∂xn ∂2f ∂x2 ∂xn .. . ∂2f ∂ 2 xn Ideally, we would like to find a global minimizer. Definition (Global minimizer) Let f : X ⊆ Rn → R. A point x ∗ is a global minimizer of f (·) on X if f (x ∗ ) ≤ f (x) ∀ x ∈ X. Fabian Bastin Optimization principles - University of Maryland Local minimization In practice, we can only find local minimizers. Definition (Local minimizer) Let f : X → R. A point x ∗ ∈ X is a local minimizer of f (·) on X if there exists a neighborhood B (x ∗ ) = {x ∈ Rn : kx − x ∗ k < } such that f (x ∗ ) ≤ f (x) ∀ x ∈ B (x ∗ ). Lemma Let X ⊂ Rn , f ∈ C 1 on X , and x ∈ X . If d ∈ Rn is a feasible direction at x, i.e. there exists some scalar λ > 0 such that (x + λd) ∈ X , ∀λ ∈ [0, λ], and ∇x f (x)T d < 0, then there exists a scalar δ > 0 such that for all 0 < τ ≤ δ, f (x + τ d) < f (x). In such a case, we say that d is a descent direction at x. Fabian Bastin Optimization principles - University of Maryland Steepest descent If x in the interior of X , we see that a particular descent direction is −∇x f (x), as long as ∇x f (x) 6= 0. Consider the linear expansion of f around x: f (x + d) ≈ f (x0 ) + ∇x f (x)T d, for ”small” d, i.e. if kdk is small. If the approximation is good, it seems reasonable to try to make ∇f (x)T d as small as possible (i.e. as negative as possible). From the Cauchy-Schwartz inequality, we have ∇x f (x)T d ≥ −k∇x f (x)kkdk, and the equality is achieved if d = −α∇x f (x), with α > 0. The direction −∇x f (x) is called the steepest descent direction of f at x. If ∇x f (x) = 0, there is no descent direction! Fabian Bastin Optimization principles - University of Maryland First-order criticality From the previous observations, we also have the following result. Theorem (Necessary condition) Let X ⊂ Rn an open set, f ∈ C 1 on X . If x is a local minimizer of f on X , then ∇x f (x) = 0. From the previous theorem, we introduce the definition of first-order criticality. Definition (First-order criticality) A point x ∗ is said to be be first-order critical for f : Rn → R if ∇x f (x ∗ ) = 0. Fabian Bastin Optimization principles - University of Maryland First-order criticality (2) A first-order critical point is however not always a local minimizer. Consider for instance the function f (x) = x 3 . Then f 0 (0) = 0, but it is clear that 0 is not a local minimizer for f (·). Special case: convex functions. If f is convex, a local minimizer is also a global minimizer, and so a first-order critical point is a global minimizer. Fabian Bastin Optimization principles - University of Maryland Algorithms Definition (Descent algorithm) An algorithm A is a descent algorithm with respect to a continuous function z : X → R is 1 if x ∈ / X ∗ and y ∈ A(x), then z(y) < z(x), 2 if x ∈ X ∗ and y ∈ A(x), then z(y) ≤ z(x). All the algorithms that we will consider are descent algorithms, at least locally, and are iterative. In other terms, they will construct a (possibly infinite) sequence of point xn (n = 0, . . .) and we hope that this sequence converges to a minimizer of the objective function. The question is therefore how to construct such a sequence? Fabian Bastin Optimization principles - University of Maryland Steepest descent algorithm The simplest algorithm is the steepest descent method. Step 0. Given a starting point x0 , set k = 0. Step 1. Set dk = −∇x f (xk ). If dk = 0, stop. Step 2. Solve (possibly approximately) minα f (xk + αdk ) for the stepsize αk . Step 3. Set xk ← xk + αk dk , k ← k + 1. Go to Step 1. Theorem (Convergence of the steepest descent algorithm) Suppose that f (·) : Rn → R is continuously differentiable on the set S = {x ∈ Rn | f (x) ≤ f (x0 )}, and that S is a closed and bounded set. then every point x ∗ that is a cluster point of the sequence {xk } satisfies ∇x f (x ∗ ) = 0. But proving convergence is not sufficient! Is it efficient? Fabian Bastin Optimization principles - University of Maryland The Rosenbrock function Apply the steepest descent algorithm to the function f (x1 , x2 ) = (1 − x1 )2 + 100(x2 − x12 )2 ”Zig-zag” behavior. In other terms, the convergence is very slow! Is it possible to improve it? Fabian Bastin Optimization principles - University of Maryland Newton method: introduction Can we converge faster? If f is C 2 , we can write the Taylor expansion of order 2: 1 f (x + d) ≈ f (x) + ∇x f (x)T d + d T ∇2xx f (x)d. 2 Note: assume 21 d T ∇2xx f (x ∗ )d > 0 if kdk < (sufficient second-order criticality condition - local strong convexity). If ∇x f (x ∗ ) = 0, then x ∗ is a local minimizer. Otherwise, we can choose d such that ∇x f (x ∗ )T d < 0 and x ∗ is not a local minimizer. Fabian Bastin Optimization principles - University of Maryland Newton method At iteration k , define (around the current iterate xk ) 1 mk (d) = f (xk ) + ∇x f (xk )T d + d T ∇2xx f (xk )d. 2 If ∇2xx f (xk ) is positive definite, mk (d) is convex in a neighborhood of xk and we can minimize mk (d) by computing d such that ∇d mk (d) = 0. In other terms, we compute dk as dk = −[∇2xx f (xk )]−1 ∇x f (xk ), and we set xk+1 = xk + dk . If the starting point x 0 is close to the solution, the convergence is quadratic, but if the starting point is not good enough, the method can diverge! Fabian Bastin Optimization principles - University of Maryland Quasi-Newton method It is often difficult to obtain an analytical expression for the derivatives, and they may be costly to compute at each iteration, so we prefer to turn to approximations. First-order derivatives can be computed by finite differences: f (x + ei ) − f (x) ∂f ≈ , ∂x[i] T for small , and ei = 0 0 . . . 0 1 0 . . . 0 is the i th canonical vector, or central differences: ∂f f (x + ei ) − f (x − ei ) ≈ . ∂x[i] 2 This however requires O(n) evaluations of the objective. Fabian Bastin Optimization principles - University of Maryland Hessian approximation: BFGS We can approximate the Hessian with a similar approach, but the computation cost is then of O(n2 ), wich is usually too expensive. An alternative is to construct an approximation of the Hessian that we will improve at each iteration, using the gained information. We then speak of quasi-Newton method. A popular approximation is the BFGS (Broyden, Fletcher, Goldfarb and Shanno): Bk+1 = Bk + yk ykT B d (B d )T + k kT k k , t yk dk dk Bk dk where yk = ∇x f (xk+1 ) − ∇x f (xk ). Fabian Bastin Optimization principles - University of Maryland BHHH Various alternatives to BFGS exist, as the Symmetric-Rank 1 (SR1) that is popular for nonconvex optimzation with trust-region. Particular case: maximum likelihood. X max LL(β) = Pi (β). i Information identity: when I → ∞, and under some conditions (basically, the model has to be correctly specified), we have H(β ∗ ) = −B(β ∗ ), where B is the information matrix. The BHHH method, proposed by Berndt, Hall, Hall, Hausman in 1974, simply replaces the Hessian by the opposite of the information matrix in the quasi-Newton method. Fabian Bastin Optimization principles - University of Maryland Information matrix The information matrix is ∇P(β ∗ ) ∇P(β ∗ )T B=E . P(β ∗ ) P(β ∗ ) In practice, we compute it as I 1 X ∇Pi (β ∗ ) ∇Pi (β ∗ )T B= . I Pi (β ∗ ) Pi (β ∗ ) i Advantage: cheap to compute! Drawback: only valid if the information identity holds. In practice, rarely the case! Example: multinomial logit on panel data. Suggestion: start with the BHHH and then switch to BFGS or SR1. Fabian Bastin Optimization principles - University of Maryland Asymptotics It can be shown that √ D I(β ∗I − β ∗ ) ⇒ N (0, H −1 BH −1 ). so the information matrix will be needed when computing some confidence intervals on the parameters. Under the information identity, this can be simplified as √ D I(β ∗I − β ∗ ) ⇒ N (0, −B), but since the identity often do not hold, care must be exercised. Note: a model can be statistically consistent while not correctly formulated (i.e. one obtains the right parameters)! Example: multinomial logit on panel data. Fabian Bastin Optimization principles - University of Maryland Globalization of the Newton method Global convergence: the algorithm must converge for any starting point. BUT it still converges to a local mimimum, not a global minimum! So how to ensure global convergence? Globalization of the Newton method: linesearch methods; trust-region methods. Line-search methods generate the iterates by setting xk1 = xk + αk dk . where dk is a search direction and αK dk is chosen so that f (xk+1 ) < f (xk ). Linesearch methods are therefore descent methods, and a special case is the steepest descent method. Fabian Bastin Optimization principles - University of Maryland Linesearch methods Most line-search versions of the basic Newton method generate the direction dk by modifying the Hessian matrix ∇xx f (xk ) to ensure that the quadratic model mk of the function has a unique minimizer. The modified Cholesky decomposition approach adds positive quantities to the diagonal of ∇xx f (xk ) during the Cholesky factorization. As a result, a diagonal matrix, Ek , with nonnegative diagonal entries is generated such that ∇xx f (xk ) + Ek is positive definite. Fabian Bastin Optimization principles - University of Maryland Linesearch methods (2) Given this decomposition, the search direction dk is obtained by solving (∇xx f (xk ) + Ek )dk = −∇x f (xk ). After dk is found, a line-search procedure is used to choose an αk > 0 that approximately minimizes f along the ray {xk + αdk | α > 0}. The algorithms for determining αk usually rely on quadratic or cubic interpolation of the univariate function φ(α) = f (xk + αdk ) in their search for a suitable αk . Fabian Bastin Optimization principles - University of Maryland Determination of αk An elegant and practical criterion for a suitable αk is to require αk to satisfy the sufficient decrease condition: f (xk + αk dk ) ≤ f (xk ) + µαk dkT ∇x f (xk ). and the curvature condition: |dkT ∇x f (xk + αk dK )| ≤ η|dkT ∇x f (xk )|. where µ and η are two constants with 0 < µ < η < 1. The sufficient decrease condition guarantees, in particular, that f (xk+1 ) < f (xk ), while the curvature condition requires that αk be not too far from a minimizer of φ. Requiring an accurate minimizer is generally wasteful of function and gradient evaluations, so codes typically use µ = 0.001 and η = 0.9 in these conditions. Fabian Bastin Optimization principles - University of Maryland Trust region methods Principle: at iteration k , approximately minimize a model mk of the objective inside a region Bk . A typical choice for mk is 1 mk (xk + s) = f (xk ) + sT ∇f (xk ) + sT Hk s. 2 Hk : approximation of ∇2xx f (xk ). We therefore have to solve the subproblem min mk (xk + s), such that xk + s ∈ Bk . s The solution is the candidate iterate with candidate step sk . Computing the following ratio: ρk = f (xk + sk ) − f (xk ) . mk (xk + sk ) − mk (xk ) Fabian Bastin Optimization principles - University of Maryland Trust region methods (2) Let η1 and η2 be constants such that 0 < η1 ≤ η2 < 1 (for instance, η1 = 0.01 and η2 = 0.75). If ρk ≥ η1 , accept the candidate. If ρk ≥ η2 , enlarge Bk , otherwise reduce it or keep it the same. If ρk < η1 , reject the candidate and reduce Bk . Stop when some criteria are met (e.g. norm of the relative gradient must be small enough). The neighborhood where we considere the model as valid is mathematically defined by a ball centered at xk , and with a radius ∆k : Bk = {x | kx − xk kk ≤ ∆k }. Fabian Bastin Optimization principles - University of Maryland Norms The norm choice may depend of the iteration k . We often use the 2- or ∞-norm. Let x be a vector of Rn . v u n n X uX 2 , kxk x(i) = max {|x |}, kxk = |x(i) |. kxk2 = t ∞ 1 (i) i=1 i=1,...,n Fabian Bastin i=1 Optimization principles - University of Maryland Trust region update At the end of each iteration k , we update the trust-region radius as following: si ρk ≥ η2 , [∆k , ∞) ∆k+1 ∈ [γ2 ∆k , ∆k ] si ρk ∈ [η1 , η2 ), [γ1 ∆k , γ2 ∆k ] si ρk < η1 , with 0 < η1 ≤ η2 < 1 and 0 < γ1 ≤ γ2 < 1. Remark: there exist variants, both for theoretical or pratical concerns. The choice of ∆0 remains difficult, and various heuristics have been proposed. See for instance A. Sartenaer, Automatic Determination Of An Initial Trust Region In Nonlinear Programming, SIAM Journal of Scientific Computing 18(6), pp. 1788-1803, 1997. Fabian Bastin Optimization principles - University of Maryland Trust region: example min −10x 2 + 10y 2 + 4 sin(xy) − 2x + x 4 x,y Trust−region method 3 2 y 1 0 −1 −2 −3 −3 −2 −1 0 Fabian Bastin 1 x 2 3 4 5 Optimization principles - University of Maryland Trust region: example (2) Fabian Bastin Optimization principles - University of Maryland
© Copyright 2025 Paperzz