The Gradient Overflow

The Gradient Overflow
Suthee Chaidaroon
May 2016
1
Introduction
Gradient descent is the basic learning algorithm that everybody needs to understand as it is typically introduced as the learning model for a linear regression.
On the surface, this algorithm is straightforward and intuitive. We iteratively
update the model parameter by minimizing the square loss function. However,
the difficulty becomes more visible as we implement a stochastic gradient descent - a more practical gradient descent (SGD) algorithm for a large dataset. A
choice of a learning rate heavily affects the convergence of the algorithm. This
note summarizes the problems that we typically encounter while developing
SGD and propose a few solutions to address those issues.
2
A learning rate
The essential formula of the gradient descent algorithm is an update formula:
Θ = Θ − α · ∇Θ f (Θ)
(1)
The model parameter, Θ is updated to the new value that has a smaller loss.
The loss is represented as a function, f (Θ).
The parameter, α is a learning rate or step size. Depending on the value of
α, the large α will make algorithm converges faster but can be overshoot. The
smaller α is safer but might make algorithm converge slowly.
It turns out that there is no universal default value of the learning rate.
Some texts set α to 0.01 to 0.005, but a good value of learning rate is upto the
dataset that we are working on.
3
The effect of the learning rate to the convergence
We demonstrate the effect of using different learning rate. We download a Math
Scores and Drug Concentrations that has only 7 data points 1 . We want to train
a linear model to find a straight line that will fit this fabricated data.
1 http://www.stat.ufl.edu/
~winner/datasets.html
1
Figure 1: A simulated dataset
(a) α = 0.1
(b) α = 0.01
Figure 2: The influence of a learning rate to the simple regression dataset
It is clear that we can easily find a straight line that fit the dataset from
Figure 1. Here is the result after training the model on 100000 iterations:
The figure on the left shows that a learning rate of 0.0001 is too high as
the model never converge to the optimal solution. In contrast, changing the
learning rate to 0.0005 will make the algorithm to converge.
When we plot the cost function over time, we found that when the learning
rate is too high, the cost function kept oscillating around the optimal point.
This simple experiment tells us that we need to pick a good learning rate,
at this point, by trial-and-errors, and hopefully, we will end up with a good
convergence.
4
Gradient overflow
The previous section, we normalize the gradient at each step to keep it numerical
value from being overflow. This technique is very important as numerical error is
2
(a) α = 0.1
(b) α = 0.01
Figure 3: A cost function over 100000 iterations of the gradient descent.
(a) dataset and regression line
(b) Iterations vs Cost
Figure 4: Left: the best model after 500 iterations; Right: the cost function
in a log-plot. Due to an inappropriate learning rate, the gradient will never
converge.
common in gradient descent implmentation. We will demonstrate the numerical
overflow when we don’t normalize the gradient.
From the figure above, we found that without normalize the gradient, the
choice of learning rate is even more crucial. We set a learning rate to 0.001 and
end up with a gradient explosion. The plot on the right shows that the cost
function kept increasing exponentially.
The Python implementation of gradient descent is shown below. Without
normalization, it is more difficult to find the right learning rate.
d e f g r a d i e n t D e s c e n t ( x , y , t h e t a , alpha , m, n u m I t e r a t i o n s ) :
xTrans = x . t r a n s p o s e ( )
costs = [ ]
f o r i in range (0 , numIterations ) :
h y p o t h e s i s = np . dot ( x , t h e t a )
l o s s = hypothesis − y
c o s t = np . sum ( l o s s ∗∗ 2 ) / ( 2 ∗ m)
3
(a) SGD
(b) AdaGrad
Figure 5: Left: a cost function of SGD for 200000 iterations. Right: a cost
function of AdaGrad SGD for 20000 iterations
g r a d i e n t = np . dot ( xTrans , l o s s ) / m
# normalize the gradient
g r a d i e n t = g r a d i e n t / np . l i n a l g . norm ( g r a d i e n t )
# update
theta = theta − alpha ∗ g r ad i e nt
return theta , c o s t s
5
SGD
The gradient descent is not practical algorithm when we work on a larger dataset
because at each step we need to iterate through all training sample to find the
average gradient. Thus, a better approximation is to only look at one sample
and update the model parameter right way. This is the essence of stochastic
gradient descent algorithm.
Shuffle dataset and keep gradient from overflow are very important in SGD.
Shuffling dataset will make the order of the samples become more random. SGD
algorithm put less workload on accessing the data as it becomes a bottleneck
during training the model.
We also implement an extension of SGD, AdaGrad 2 , which is use past
gradients to normalize the gradient at every step. It turns out that AdaGrad
converges much faster than a tradition SGD.
From Figure 5, it shows that the cost function from the model trained by
AdaGrad converges much faster than a model trained by SGD. It takes 200000
for SGD to converge to a good solution.
2 http://www.magicbroom.info/Papers/DuchiHaSi10.pdf
4
6
Future Study
There are other SGD variants that works extremely well for a large dataset such
as momemtum and Adam. The stochastic optimization is an exciting field as it
is commonly used in Deep Learning.
7
Revision History
• 5/10/2016 - First Draft
5