The Gradient Overflow Suthee Chaidaroon May 2016 1 Introduction Gradient descent is the basic learning algorithm that everybody needs to understand as it is typically introduced as the learning model for a linear regression. On the surface, this algorithm is straightforward and intuitive. We iteratively update the model parameter by minimizing the square loss function. However, the difficulty becomes more visible as we implement a stochastic gradient descent - a more practical gradient descent (SGD) algorithm for a large dataset. A choice of a learning rate heavily affects the convergence of the algorithm. This note summarizes the problems that we typically encounter while developing SGD and propose a few solutions to address those issues. 2 A learning rate The essential formula of the gradient descent algorithm is an update formula: Θ = Θ − α · ∇Θ f (Θ) (1) The model parameter, Θ is updated to the new value that has a smaller loss. The loss is represented as a function, f (Θ). The parameter, α is a learning rate or step size. Depending on the value of α, the large α will make algorithm converges faster but can be overshoot. The smaller α is safer but might make algorithm converge slowly. It turns out that there is no universal default value of the learning rate. Some texts set α to 0.01 to 0.005, but a good value of learning rate is upto the dataset that we are working on. 3 The effect of the learning rate to the convergence We demonstrate the effect of using different learning rate. We download a Math Scores and Drug Concentrations that has only 7 data points 1 . We want to train a linear model to find a straight line that will fit this fabricated data. 1 http://www.stat.ufl.edu/ ~winner/datasets.html 1 Figure 1: A simulated dataset (a) α = 0.1 (b) α = 0.01 Figure 2: The influence of a learning rate to the simple regression dataset It is clear that we can easily find a straight line that fit the dataset from Figure 1. Here is the result after training the model on 100000 iterations: The figure on the left shows that a learning rate of 0.0001 is too high as the model never converge to the optimal solution. In contrast, changing the learning rate to 0.0005 will make the algorithm to converge. When we plot the cost function over time, we found that when the learning rate is too high, the cost function kept oscillating around the optimal point. This simple experiment tells us that we need to pick a good learning rate, at this point, by trial-and-errors, and hopefully, we will end up with a good convergence. 4 Gradient overflow The previous section, we normalize the gradient at each step to keep it numerical value from being overflow. This technique is very important as numerical error is 2 (a) α = 0.1 (b) α = 0.01 Figure 3: A cost function over 100000 iterations of the gradient descent. (a) dataset and regression line (b) Iterations vs Cost Figure 4: Left: the best model after 500 iterations; Right: the cost function in a log-plot. Due to an inappropriate learning rate, the gradient will never converge. common in gradient descent implmentation. We will demonstrate the numerical overflow when we don’t normalize the gradient. From the figure above, we found that without normalize the gradient, the choice of learning rate is even more crucial. We set a learning rate to 0.001 and end up with a gradient explosion. The plot on the right shows that the cost function kept increasing exponentially. The Python implementation of gradient descent is shown below. Without normalization, it is more difficult to find the right learning rate. d e f g r a d i e n t D e s c e n t ( x , y , t h e t a , alpha , m, n u m I t e r a t i o n s ) : xTrans = x . t r a n s p o s e ( ) costs = [ ] f o r i in range (0 , numIterations ) : h y p o t h e s i s = np . dot ( x , t h e t a ) l o s s = hypothesis − y c o s t = np . sum ( l o s s ∗∗ 2 ) / ( 2 ∗ m) 3 (a) SGD (b) AdaGrad Figure 5: Left: a cost function of SGD for 200000 iterations. Right: a cost function of AdaGrad SGD for 20000 iterations g r a d i e n t = np . dot ( xTrans , l o s s ) / m # normalize the gradient g r a d i e n t = g r a d i e n t / np . l i n a l g . norm ( g r a d i e n t ) # update theta = theta − alpha ∗ g r ad i e nt return theta , c o s t s 5 SGD The gradient descent is not practical algorithm when we work on a larger dataset because at each step we need to iterate through all training sample to find the average gradient. Thus, a better approximation is to only look at one sample and update the model parameter right way. This is the essence of stochastic gradient descent algorithm. Shuffle dataset and keep gradient from overflow are very important in SGD. Shuffling dataset will make the order of the samples become more random. SGD algorithm put less workload on accessing the data as it becomes a bottleneck during training the model. We also implement an extension of SGD, AdaGrad 2 , which is use past gradients to normalize the gradient at every step. It turns out that AdaGrad converges much faster than a tradition SGD. From Figure 5, it shows that the cost function from the model trained by AdaGrad converges much faster than a model trained by SGD. It takes 200000 for SGD to converge to a good solution. 2 http://www.magicbroom.info/Papers/DuchiHaSi10.pdf 4 6 Future Study There are other SGD variants that works extremely well for a large dataset such as momemtum and Adam. The stochastic optimization is an exciting field as it is commonly used in Deep Learning. 7 Revision History • 5/10/2016 - First Draft 5
© Copyright 2026 Paperzz