Notes on Gradient Methods (CS 281A Recitation)

Notes on Gradient Methods
(CS 281A Recitation)
Peter Jin
[email protected]
September 14, 2016
Abstract
These notes are a sampling of topics related to gradient methods.
1
1.1
Gradient methods
Gradient descent
The optimization view of learning poses the learning problem as a loss minimization problem:
θ∗ = arg min L(x; θ)
θ
(1)
where the loss function L(x; θ) is defined in terms of the data, x ∈ Rn×m , and
the model parameters to be learned, θ ∈ Rd . The loss is typically a separable
loss that is the sum of smaller loss functions defined on individual data samples:
n
L(x; θ) =
1X
`(xi ; θ).
n i=1
(2)
If the loss is differentiable (it often is), then the method of gradient descent
applies, in which the model parameters are updated iteratively in the negative
direction of the gradient of the loss with respect to the parameters:
θk+1 ← θk − αk ∇θk L(x; θk ).
(3)
Gradient descent will converge to the optimal parameter θ∗ that minimizes the
loss for a convex loss and appropriate choices of the step sizes αk , with rates
of convergence related to the convexity of L and the continuity of ∇θ L. The
theory of convex optimization deals with the convergence of gradient methods
and other algorithms on convex loss functions.
1
1.2
Accelerated gradient/momentum
The gradient method with momentum involves keeping track of the gradients
of previous iterations through a moving average:
vk ← −αk ∇θk L(x; θk ) + µvk−1
θk+1 ← θk + vk .
(4)
(5)
Nesterov’s accelerated gradient can also be formulated as a momentum method
with the slightly modified update:
vk ← −αk ∇θk L(x; θk + µvk−1 ) + µvk−1
θk+1 ← θk + vk .
(6)
(7)
Momentum typically doesn’t slow down convergence, and is fairly cheap to
implement, so there’s usually no reason not to use it (at least on convex-ish
problems).
1.3
Newton and quasi-Newton methods
Whereas gradient (1st order) methods update the parameter in the direction
of the gradient, Newton and quasi-Newton (2nd order) methods use the second
derivatives to incorporate curvature into the update direction. Newton’s method
proper makes use of the Hessian:
Hk = ∇2θk L(x; θk )
(8)
θk+1 ← θk − Hk−1 ∇θk L(x; θk ).
(9)
with update rule:
Rather than inverting H to calculate d = H −1 g, instead solve the system Hd = g
using e.g. conjugate gradient.
Quasi-Newton methods involve an approximate Hessian H̃, often using only
first derivatives to estimate second derivatives. Positive definite approximations are the Gauss-Newton and natural gradient methods, while a low rank
approximation is L-BFGS.
1.4
Stochastic gradient descent
Stochastic gradient descent (SGD) in its most common form: select an index
i ∈ [1, . . . , n] and perform an update using the gradient calculated from a single
sample:
θk+1 ← θk − αk ∇θk `(xi ; θk ).
2
(10)
SGD works in expectation:
n
X
1
Ei∼Uniform([1,...,n]) [∇θ `(xi ; θ)] =
∇θ `(xi ; θ)
n
i=1
= ∇θ L(x; θ).
(11)
(12)
In other words, the stochastic gradient is an unbiased estimator of the total
gradient ∇θ L.
In minibatch SGD, select a minibatch of indices I ⊂ [1, . . . , n] and perform
an update using the average gradient of the minibatch samples:
αk X
∇θk `(xi ; θk ).
(13)
θk+1 ← θk −
|I|
i∈I
While minibatching requires doing more work, the minibatch stochastic gradient is a lower variance estimator of ∇θ L which allows a larger step size αk .
Additionally, on modern hardware platforms (e.g. GPUs, the distributed setting), a moderately sized minibatch stochastic gradient may be more efficient
to compute than a single-sample stochastic gradient.
1.5
Neural networks, backprop, autodiff
Common to all gradient methods is the calculation of the gradient of a loss
function with respect to a parameter θ ∈ Rd . Today, d can be very large: some
commonly used modern deep neural networks for computer vision problems
have d ≈ 107 to 108 . Estimating the gradient using finite differences is unacceptable as it would require O(d) function evaluations. Efficient calculation of
the gradient is necessary for learning to work at all.
Fortunately, we have a neat trick we can use: the chain rule. Algorithms for
computing the chain rule include backpropagation and automatic differentiation,
or backprop and autodiff for short. (Backpropagation is also sometimes called
reverse-mode autodiff in a particular context.) As a neural network is essentially
a chain of function compositions, then it should be possible to start from the
“outside” and compute gradients in the opposite order of the original function
evaluations; this is exactly what backprop does.
For example, consider the following “neural network”:
L = `(g(f (x; θf ); θg ), y)
(14)
where x is the input data, y is a training label for supervised learning, f and g
are arbitrary functions each with their own parameters, and ` is a link function
that calculates the loss. Note that θf and θg can be considered blocks of a single
parameter:
θ = (θf , θg )
(15)
∇θ L = (∇θf L, ∇ L).
θg
3
(16)
The regular way to evaluate the neural network is to start from the inside
moving out:
xf = f (x; θf )
g
f
(17)
g
x = g(x ; θ )
(18)
L = `(xg , y)
(19)
where xf and xg are the intermediate values. This is the forward propagation.
To use a gradient method, we need ∇θ L. To do so, we will start from the
outside moving in and gradually calculate gradients of L with respect to the
blocks of θ and the intermediate values:
∂`
∂L
g =
∂xj
∂xgj
X ∂L ∂g(xf ; θg )
X ∂L ∂xg
∂L
i
=
g =
g
∂θj
∂xi ∂θj
∂xgi
∂θjg
i
i
∂L
∂xfj
∂L
∂θjf
=
=
X ∂L ∂xg
X ∂L ∂g(xf ; θg )
i
=
g
∂xi ∂xfj
∂xgi
∂xfj
i
i
X ∂L ∂xf
i
i
∂xfi ∂θjf
=
X ∂L ∂f (x; θf )
i
∂xfi
∂θjf
.
(20)
(21)
(22)
(23)
This is the backward propagation. The key to doing the above efficiently in
an algorithm is caching the values ∂x(·) L and ∂θ(·) L and reusing them when
j
j
needed. It is assumed that gradients of f and g with respect to their respective
parameters can be efficiently calculated, typically either analytically or using
autodiff (see below). The previous algorithm works on arbitrary directed acyclic
graphs of function compositions, and it can also be generalized to computing
directional gradients, which can produce Jacobian-vector products and Hessianvector products (recall the section on quasi-Newton methods).
The other algorithm family, autodiff, has a particular connotation, which is
calculating derivatives based on the function composition of arithmetic operations and maybe some special functions (e.g. transcendentals). The advantage of
autodiff is that all computer programs are composed of arithmetic instructions,
allowing autodiff to have wide applicability.
Whereas reverse-mode autodiff is essentially backprop for arithmetic operations, forward-mode autodiff operates a bit differently and is more suited for
evaluating derivatives with respect to single coordinates rather than calculating
entire gradients. The idea is to map θ 7→ θ+ej δ, where ej has value 1 at the j-th
coordinate and 0 at all others, and δ is a “dual number” which satisfies δ 2 = 0.
Then applying a function f to θ + ej δ yields the derivative as the coefficient of
δ:
f (θ + ej δ) = f (θ) +
4
∂f (θ)
δ.
∂θj
(24)
The trick behind forward-mode autodiff is that f is composed of arithmetic
operations for which the dual number calculations are straightforward; for example:
(x + x0 δ) + (y + y 0 δ) = (x + y) + (x0 + y 0 )δ
0
0
0
0
0
0
(25)
0 0 2
(x + x δ)(y + y δ) = xy + (xy + x y)δ + x y δ
= xy + (xy + x y)δ.
2
2.1
(26)
(27)
Policy gradient method
Markov chains
Recall from the previous recitation the definition of a Markov chain. In a Markov
chain, a random state variable evolves in time, producing a sequence of random
variables for each time step k:
y = (x0 , x1 , x2 , . . . , xk , . . .)
(28)
where x0 ∼ p(x0 ) and xk ∼ p(xk |xk−1 ) for k > 0. The probability density of
sequences y can be written out:
p(y) = p(x0 )p(x1 |x0 )p(x2 |x1 ) · · · p(xk |xk−1 ) · · ·
2.2
(29)
Markov decision processes
An interesting variation on a Markov chain is the Markov decision process.
Whereas a Markov chain can be thought of as modeling the time evolution
of a single object, a Markov decision process models the interaction between
a rational agent and its environment (e.g. a gambler at a slot machine, or a
Roomba navigating a room). The agent observes the state sk of its environment,
and in turn the agent follows a policy for choosing an action ak to perform in
response to the observed state sk . After the action ak is performed, the agent
collects a scalar reward rk , and the next state sk+1 is revealed. The agent’s
goal is to maximize some function of its rewards rk . The MDP produces a
sequence of observations and decisions, also called an episode or trajectory τ of
experience:
τ = (s0 , a0 , r0 , s1 , a1 , r1 , s2 , a2 , r2 , . . . , sk , ak , rk , . . .)
(30)
where s0 ∼ p(s0 ), sk ∼ p(sk |sk−1 , ak−1 ), ak ∼ p(ak |sk ), and rk = R(sk , ak ).
Just like in the case of Markov chains, the probability density of episodes in an
MDP can be written out:
p(τ ) = p(s0 )p(a0 |s0 )p(s1 |s0 , a0 )p(a1 |s1 )p(s2 |s1 , a1 ) · · · p(ak |sk )p(sk+1 |sk , ak ) · · ·
(31)
In general, the initial state distribution p(s0 ) and the state transition distribution p(sk |sk−1 , ak−1 ) may be difficult if not impossible to estimate correctly;
5
they represent the underlying behavior of the environment and may be hidden
from easy measurement. However, the rewards rk are assumed to be easy to
measure, and the policy p(ak |sk ) is within full control of the agent. Then the reinforcement learning task is to learn the policy p(ak |sk ) given empirical episodic
data.
2.3
Derivation of the policy gradient method
The policy gradient method, also known as REINFORCE1 , sets the exact distribution p(ak |sk ) to be a parametric stochastic policy π(ak |sk ; θ) where θ are the
policy parameters. The goal is to learn a parameterized policy π(ak |sk ; θ) that
maximizes rewards rk using the stochastic gradient.
First, we will need to define an objective function. For an episode τ of fixed
length or horizon H, we define the utility η of an episode to be the sum of the
empirical rewards:
η(τ ) =
H−1
X
rk .
(32)
k=0
Other choices of η are possible as well, such as the averaged or the discounted
rewards. There are a couple things to note here:
• η is a random variable as it is a function of τ ;
• p(τ ) now depends on the policy parameters θ through π(ak |sk ; θ).
We set the optimization objective to be the expectation of η(τ ) with respect to
the distribution of episodes τ :
θ∗ = arg max Eτ ∼p(τ ) [η(τ )]
θ
(33)
where p(τ ) was given earlier in this section.
The remaining thing we need is the gradient of the objective with respect
to the policy parameters θ. In the rest of this section we will do a step-by-step
derivation.
Write the gradient of the expectation as a (multi-dimensional!) integral and
expand p(τ ):
Z
∇θ Eτ ∼p(τ ) [η(τ )] = ∇θ η(τ )p(τ ) dτ
(34)
Z
= ∇θ
η(τ )p(s0 )
H−1
Y
p(ak |sk )p(sk+1 |sk , ak ) dτ
(35)
π(ak |sk ; θ)p(sk+1 |sk , ak ) dτ.
(36)
k=0
Z
= ∇θ
η(τ )p(s0 )
H−1
Y
k=0
1 Williams. “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning.” Machine Learning, 1992.
6
Move the gradient inside the integral and isolate the gradient to just the part
that depends on θ, i.e. the policy probabilities:
"
#
Z
H−1
Y
∇θ Eτ ∼p(τ ) [η(τ )] = ∇θ η(τ )p(s0 )
π(ak |sk ; θ)p(sk+1 |sk , ak ) dτ
(37)
k=0
Z
=
η(τ )∇θ
"H−1
Y
#
π(ak |sk ; θ) p(s0 )
H−1
Y
k=0
p(sk+1 |sk , ak ) dτ.
k=0
(38)
Expand the gradient of the bracketed term using the product rule:
∇θ Eτ ∼p(τ ) [η(τ )]
(39)
# "H−1
#
"H−1
Z
H−1
Y
Y
X ∇θ π(ak |sk ; θ)
p(sk+1 |sk , ak ) dτ.
π(ak |sk ; θ) p(s0 )
= η(τ )
π(ak |sk ; θ)
k=0
k=0
k=0
|
{z
}|
{z
}
score
p(τ )
(40)
As indicated in the equation above, the terms on the right recover p(τ ). For
the term marked “score,” which is a sum of score functions, we can apply the
following trick:
∇x f (x)
= ∇x log(f (x)).
f (x)
Finally, substituting for both terms, we derive the policy gradient:
"H−1
#
Z
X
∇θ Eτ ∼p(τ ) [η(τ )] = η(τ )
∇θ log(π(ak |sk ; θ)) p(τ ) dτ.
(41)
(42)
k=0
In fact, the policy gradient essentially consists of the gradient of the log likelihood of the policy weighted by the utility η, making the policy gradient very
similar to maximum likelihood supervised learning.
To estimate the policy gradient in practice, rather than compute the exact
integral which requires full knowledge of p(τ ), instead we assume that we can
sample episodes τ (i) ∼ p(τ ) by executing the current policy π(a|s; θ) in a real
or simulated environment. Then, we use a minibatch of size B to compute a
Monte Carlo estimate of the integral, which yields a stochastic gradient:
"H−1
#
B
X
1 X
(i)
(i)
∇θ Eτ ∼p(τ ) [η(τ )] ≈
η(τ (i) )
∇θ log(π(ak |sk ; θ)) .
(43)
B i=1
k=0
7