Notes on Gradient Methods (CS 281A Recitation) Peter Jin [email protected] September 14, 2016 Abstract These notes are a sampling of topics related to gradient methods. 1 1.1 Gradient methods Gradient descent The optimization view of learning poses the learning problem as a loss minimization problem: θ∗ = arg min L(x; θ) θ (1) where the loss function L(x; θ) is defined in terms of the data, x ∈ Rn×m , and the model parameters to be learned, θ ∈ Rd . The loss is typically a separable loss that is the sum of smaller loss functions defined on individual data samples: n L(x; θ) = 1X `(xi ; θ). n i=1 (2) If the loss is differentiable (it often is), then the method of gradient descent applies, in which the model parameters are updated iteratively in the negative direction of the gradient of the loss with respect to the parameters: θk+1 ← θk − αk ∇θk L(x; θk ). (3) Gradient descent will converge to the optimal parameter θ∗ that minimizes the loss for a convex loss and appropriate choices of the step sizes αk , with rates of convergence related to the convexity of L and the continuity of ∇θ L. The theory of convex optimization deals with the convergence of gradient methods and other algorithms on convex loss functions. 1 1.2 Accelerated gradient/momentum The gradient method with momentum involves keeping track of the gradients of previous iterations through a moving average: vk ← −αk ∇θk L(x; θk ) + µvk−1 θk+1 ← θk + vk . (4) (5) Nesterov’s accelerated gradient can also be formulated as a momentum method with the slightly modified update: vk ← −αk ∇θk L(x; θk + µvk−1 ) + µvk−1 θk+1 ← θk + vk . (6) (7) Momentum typically doesn’t slow down convergence, and is fairly cheap to implement, so there’s usually no reason not to use it (at least on convex-ish problems). 1.3 Newton and quasi-Newton methods Whereas gradient (1st order) methods update the parameter in the direction of the gradient, Newton and quasi-Newton (2nd order) methods use the second derivatives to incorporate curvature into the update direction. Newton’s method proper makes use of the Hessian: Hk = ∇2θk L(x; θk ) (8) θk+1 ← θk − Hk−1 ∇θk L(x; θk ). (9) with update rule: Rather than inverting H to calculate d = H −1 g, instead solve the system Hd = g using e.g. conjugate gradient. Quasi-Newton methods involve an approximate Hessian H̃, often using only first derivatives to estimate second derivatives. Positive definite approximations are the Gauss-Newton and natural gradient methods, while a low rank approximation is L-BFGS. 1.4 Stochastic gradient descent Stochastic gradient descent (SGD) in its most common form: select an index i ∈ [1, . . . , n] and perform an update using the gradient calculated from a single sample: θk+1 ← θk − αk ∇θk `(xi ; θk ). 2 (10) SGD works in expectation: n X 1 Ei∼Uniform([1,...,n]) [∇θ `(xi ; θ)] = ∇θ `(xi ; θ) n i=1 = ∇θ L(x; θ). (11) (12) In other words, the stochastic gradient is an unbiased estimator of the total gradient ∇θ L. In minibatch SGD, select a minibatch of indices I ⊂ [1, . . . , n] and perform an update using the average gradient of the minibatch samples: αk X ∇θk `(xi ; θk ). (13) θk+1 ← θk − |I| i∈I While minibatching requires doing more work, the minibatch stochastic gradient is a lower variance estimator of ∇θ L which allows a larger step size αk . Additionally, on modern hardware platforms (e.g. GPUs, the distributed setting), a moderately sized minibatch stochastic gradient may be more efficient to compute than a single-sample stochastic gradient. 1.5 Neural networks, backprop, autodiff Common to all gradient methods is the calculation of the gradient of a loss function with respect to a parameter θ ∈ Rd . Today, d can be very large: some commonly used modern deep neural networks for computer vision problems have d ≈ 107 to 108 . Estimating the gradient using finite differences is unacceptable as it would require O(d) function evaluations. Efficient calculation of the gradient is necessary for learning to work at all. Fortunately, we have a neat trick we can use: the chain rule. Algorithms for computing the chain rule include backpropagation and automatic differentiation, or backprop and autodiff for short. (Backpropagation is also sometimes called reverse-mode autodiff in a particular context.) As a neural network is essentially a chain of function compositions, then it should be possible to start from the “outside” and compute gradients in the opposite order of the original function evaluations; this is exactly what backprop does. For example, consider the following “neural network”: L = `(g(f (x; θf ); θg ), y) (14) where x is the input data, y is a training label for supervised learning, f and g are arbitrary functions each with their own parameters, and ` is a link function that calculates the loss. Note that θf and θg can be considered blocks of a single parameter: θ = (θf , θg ) (15) ∇θ L = (∇θf L, ∇ L). θg 3 (16) The regular way to evaluate the neural network is to start from the inside moving out: xf = f (x; θf ) g f (17) g x = g(x ; θ ) (18) L = `(xg , y) (19) where xf and xg are the intermediate values. This is the forward propagation. To use a gradient method, we need ∇θ L. To do so, we will start from the outside moving in and gradually calculate gradients of L with respect to the blocks of θ and the intermediate values: ∂` ∂L g = ∂xj ∂xgj X ∂L ∂g(xf ; θg ) X ∂L ∂xg ∂L i = g = g ∂θj ∂xi ∂θj ∂xgi ∂θjg i i ∂L ∂xfj ∂L ∂θjf = = X ∂L ∂xg X ∂L ∂g(xf ; θg ) i = g ∂xi ∂xfj ∂xgi ∂xfj i i X ∂L ∂xf i i ∂xfi ∂θjf = X ∂L ∂f (x; θf ) i ∂xfi ∂θjf . (20) (21) (22) (23) This is the backward propagation. The key to doing the above efficiently in an algorithm is caching the values ∂x(·) L and ∂θ(·) L and reusing them when j j needed. It is assumed that gradients of f and g with respect to their respective parameters can be efficiently calculated, typically either analytically or using autodiff (see below). The previous algorithm works on arbitrary directed acyclic graphs of function compositions, and it can also be generalized to computing directional gradients, which can produce Jacobian-vector products and Hessianvector products (recall the section on quasi-Newton methods). The other algorithm family, autodiff, has a particular connotation, which is calculating derivatives based on the function composition of arithmetic operations and maybe some special functions (e.g. transcendentals). The advantage of autodiff is that all computer programs are composed of arithmetic instructions, allowing autodiff to have wide applicability. Whereas reverse-mode autodiff is essentially backprop for arithmetic operations, forward-mode autodiff operates a bit differently and is more suited for evaluating derivatives with respect to single coordinates rather than calculating entire gradients. The idea is to map θ 7→ θ+ej δ, where ej has value 1 at the j-th coordinate and 0 at all others, and δ is a “dual number” which satisfies δ 2 = 0. Then applying a function f to θ + ej δ yields the derivative as the coefficient of δ: f (θ + ej δ) = f (θ) + 4 ∂f (θ) δ. ∂θj (24) The trick behind forward-mode autodiff is that f is composed of arithmetic operations for which the dual number calculations are straightforward; for example: (x + x0 δ) + (y + y 0 δ) = (x + y) + (x0 + y 0 )δ 0 0 0 0 0 0 (25) 0 0 2 (x + x δ)(y + y δ) = xy + (xy + x y)δ + x y δ = xy + (xy + x y)δ. 2 2.1 (26) (27) Policy gradient method Markov chains Recall from the previous recitation the definition of a Markov chain. In a Markov chain, a random state variable evolves in time, producing a sequence of random variables for each time step k: y = (x0 , x1 , x2 , . . . , xk , . . .) (28) where x0 ∼ p(x0 ) and xk ∼ p(xk |xk−1 ) for k > 0. The probability density of sequences y can be written out: p(y) = p(x0 )p(x1 |x0 )p(x2 |x1 ) · · · p(xk |xk−1 ) · · · 2.2 (29) Markov decision processes An interesting variation on a Markov chain is the Markov decision process. Whereas a Markov chain can be thought of as modeling the time evolution of a single object, a Markov decision process models the interaction between a rational agent and its environment (e.g. a gambler at a slot machine, or a Roomba navigating a room). The agent observes the state sk of its environment, and in turn the agent follows a policy for choosing an action ak to perform in response to the observed state sk . After the action ak is performed, the agent collects a scalar reward rk , and the next state sk+1 is revealed. The agent’s goal is to maximize some function of its rewards rk . The MDP produces a sequence of observations and decisions, also called an episode or trajectory τ of experience: τ = (s0 , a0 , r0 , s1 , a1 , r1 , s2 , a2 , r2 , . . . , sk , ak , rk , . . .) (30) where s0 ∼ p(s0 ), sk ∼ p(sk |sk−1 , ak−1 ), ak ∼ p(ak |sk ), and rk = R(sk , ak ). Just like in the case of Markov chains, the probability density of episodes in an MDP can be written out: p(τ ) = p(s0 )p(a0 |s0 )p(s1 |s0 , a0 )p(a1 |s1 )p(s2 |s1 , a1 ) · · · p(ak |sk )p(sk+1 |sk , ak ) · · · (31) In general, the initial state distribution p(s0 ) and the state transition distribution p(sk |sk−1 , ak−1 ) may be difficult if not impossible to estimate correctly; 5 they represent the underlying behavior of the environment and may be hidden from easy measurement. However, the rewards rk are assumed to be easy to measure, and the policy p(ak |sk ) is within full control of the agent. Then the reinforcement learning task is to learn the policy p(ak |sk ) given empirical episodic data. 2.3 Derivation of the policy gradient method The policy gradient method, also known as REINFORCE1 , sets the exact distribution p(ak |sk ) to be a parametric stochastic policy π(ak |sk ; θ) where θ are the policy parameters. The goal is to learn a parameterized policy π(ak |sk ; θ) that maximizes rewards rk using the stochastic gradient. First, we will need to define an objective function. For an episode τ of fixed length or horizon H, we define the utility η of an episode to be the sum of the empirical rewards: η(τ ) = H−1 X rk . (32) k=0 Other choices of η are possible as well, such as the averaged or the discounted rewards. There are a couple things to note here: • η is a random variable as it is a function of τ ; • p(τ ) now depends on the policy parameters θ through π(ak |sk ; θ). We set the optimization objective to be the expectation of η(τ ) with respect to the distribution of episodes τ : θ∗ = arg max Eτ ∼p(τ ) [η(τ )] θ (33) where p(τ ) was given earlier in this section. The remaining thing we need is the gradient of the objective with respect to the policy parameters θ. In the rest of this section we will do a step-by-step derivation. Write the gradient of the expectation as a (multi-dimensional!) integral and expand p(τ ): Z ∇θ Eτ ∼p(τ ) [η(τ )] = ∇θ η(τ )p(τ ) dτ (34) Z = ∇θ η(τ )p(s0 ) H−1 Y p(ak |sk )p(sk+1 |sk , ak ) dτ (35) π(ak |sk ; θ)p(sk+1 |sk , ak ) dτ. (36) k=0 Z = ∇θ η(τ )p(s0 ) H−1 Y k=0 1 Williams. “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning.” Machine Learning, 1992. 6 Move the gradient inside the integral and isolate the gradient to just the part that depends on θ, i.e. the policy probabilities: " # Z H−1 Y ∇θ Eτ ∼p(τ ) [η(τ )] = ∇θ η(τ )p(s0 ) π(ak |sk ; θ)p(sk+1 |sk , ak ) dτ (37) k=0 Z = η(τ )∇θ "H−1 Y # π(ak |sk ; θ) p(s0 ) H−1 Y k=0 p(sk+1 |sk , ak ) dτ. k=0 (38) Expand the gradient of the bracketed term using the product rule: ∇θ Eτ ∼p(τ ) [η(τ )] (39) # "H−1 # "H−1 Z H−1 Y Y X ∇θ π(ak |sk ; θ) p(sk+1 |sk , ak ) dτ. π(ak |sk ; θ) p(s0 ) = η(τ ) π(ak |sk ; θ) k=0 k=0 k=0 | {z }| {z } score p(τ ) (40) As indicated in the equation above, the terms on the right recover p(τ ). For the term marked “score,” which is a sum of score functions, we can apply the following trick: ∇x f (x) = ∇x log(f (x)). f (x) Finally, substituting for both terms, we derive the policy gradient: "H−1 # Z X ∇θ Eτ ∼p(τ ) [η(τ )] = η(τ ) ∇θ log(π(ak |sk ; θ)) p(τ ) dτ. (41) (42) k=0 In fact, the policy gradient essentially consists of the gradient of the log likelihood of the policy weighted by the utility η, making the policy gradient very similar to maximum likelihood supervised learning. To estimate the policy gradient in practice, rather than compute the exact integral which requires full knowledge of p(τ ), instead we assume that we can sample episodes τ (i) ∼ p(τ ) by executing the current policy π(a|s; θ) in a real or simulated environment. Then, we use a minibatch of size B to compute a Monte Carlo estimate of the integral, which yields a stochastic gradient: "H−1 # B X 1 X (i) (i) ∇θ Eτ ∼p(τ ) [η(τ )] ≈ η(τ (i) ) ∇θ log(π(ak |sk ; θ)) . (43) B i=1 k=0 7
© Copyright 2025 Paperzz