Introduction to optimization DOMINIK CSIBA, KNOYD BASECAMP, 20. JAN 2017, VIENNA KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 1 MOTIVATION KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 2 Why optimization matters? Recall the piggy! Would like to have We actually have Approximation error Estimation error Optimization error BASECAMP: INTRODUCTION TO MACHINE LEARNING, DOMINIK CSIBA, 19. JANUARY 2017 3 Solving the ERM Recall the ERM: Optimization problem Linear regression - closed form solution What if there is no closed form solution? Example: Classification KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 4 Naïve Classification Naïve model: Naïve loss: Naïve ERM: KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 5 HIGH-LEVEL INTUITION KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 6 How does optimization work? Intuition: Find the bottom of a valley in a dense fog using 7-league boots, where you can ask for the following information: ◦ 0th order info: the altitude at any location ◦ 1st order info: the slope at any location ◦ 2nd order info: the curvature at any location ◦… Most popular 0th order 1st order 2nd order KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 7 How does optimization work formally? Iterative methods: Goal: “Close”: KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 8 Going down the hill – Gradient Descent By far the most popular iterative scheme: Intuition: We do a step down the hill ◦ Stepsize: in some cases given by theory, otherwise difficult to pick Too small Too big KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 9 BREAKDOWN OF THE PROBLEM KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 10 When is optimization easy? Set : Function ◦ Continuous ◦ Convex ◦ Easy to project on : ◦ Continuous ◦ Convex + Strongly-convex ◦ Smooth ◦ Small n / Small d Example: Ridge Regression KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 11 Non-continuous sets Discrete: ◦ Example: , allocation problems, graph problems Separated: ◦ Example: KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 12 Non-convex sets Convex set: every arc connecting two points in the set is in the set Convex sets Non-convex sets KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 13 Projection on sets In general, arbitrary complexity – can be even more complex than the original optimization problem KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 14 Non-continuous objectives Almost impossible to optimize ◦ Example: If possible, always avoid these kind of optimization problems. Question: What would happen if we just “connect” the non-continuous parts? KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 15 Lipschitz objectives The convergence depends on the Lipschitz property of the objective Dependence: Large Large Small KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 16 Non-convex objectives Convex functions: ◦ Any arc connecting two points of the function lies above the function. ◦ At each point of the function we can draw a line passing through the point, which lies below the function ◦ The set of points above the function is a convex set. Convex Convex Non-convex KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 Non-convex 17 Strongly-convex objectives convex Example: Note: Convex function 0-strongly convex function Advantages: ◦ Stability: it can be shown that strongly convex functions generalize better ◦ Minimization: the methods converge much faster specified later ◦ Dependence: KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 18 Non-smooth objectives Function is smooth, if it does not have “kinks” ◦ Example: most of the standard functions ◦ Example of non-smooth: absolute value Complications: ◦ How to choose the stepsize? ◦ How to make a gradient step at a “kink”? ◦ Subgradient: slope of any line passing through a “kink”, which is under the function (denoted ) Subgradient method: KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 19 Lipschitz continuous gradients The convergence depends on the Lipschitz property of the gradients Dependence: Constant Constant Large Small KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 20 Simple non-smoothness Suppose we can write smooth non-smooth Example: ◦ Lasso: Non-smoothness is not an issue, if we can solve the following easily KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 21 Large-scale data Gradient descent step: #dimension partial derivatives #examples functions Gradient Descent Do the best we can using first order information Wrong step in a smart direction Stepsize: constant 4 Iteration cost depends on both dimension and number of examples! 2 6 5 3 1 KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 23 Randomized Coordinate Descent Update only one randomly chosen coordinate at a time Stepsize: constant for every dimension Iteration cost is independent of dimension! Smart step in a wrong direction 6 7 4 5 2 3 KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 1 N E W S 24 Stochastic Gradient Descent Update using only one random example at a time Wrong step in a wrong direction 4 6 Stepsize: Decaying Iteration cost independent of the number of examples! 2 7 5 3 1 KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 25 STOCHASTIC GRADIENT DESCENT KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 26 Magic of SGD explained Correct direction in expectation: Variance is not vanishing! Stepsize has to be decaying, otherwise SGD will not converge! For GD/RCD, there is no such issue, because: KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 27 Sample convergence theorem Convex Optimization: Algorithms and Complexity, S. Bubeck, 2015 Without strong convexity KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 28 Randomization of SGD Theoretical results: ◦ On each iteration, we pick an example uniformly at random ◦ More advanced: Importance sampling – weighted probabilities for examples Standard practice: ◦ Each epoch is a random permutation ◦ Performs better than uniform sampling ◦ Theory is way more difficult Infinite data: ◦ Converges also in the case of streamed data (every example seen only once) KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 29 SGD stepsizes Convergence for any stepsizes satisfying: Standard choice: No gold standard for choosing Leads to a Hyperparameter Optimization problem KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 30 HYPERPARAMETER OPTIMIZATION KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 31 Intuition 0th order optimization Hyperparameter Optimization Parameters Hyperparameter Optimization Optimization Parameters Meta-Hyperparameter Optimization… Optimization Model Parameters One could go on and on… KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 32 th 0 order optimization Goal: find with the best performance Issue: we have no idea about the function Solution: Try out a bunch of stuff and hope for the best KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 33 Grid search The most basic algorithm for hyperparameter optimization ◦ Try out a grid of values ◦ Pick the best performing Performs reasonably well Usually sufficient Notable competitor: Random Search KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 34 Labs KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 35 Labs Goal: Train a 2nd order polynomial predictor using both gradient descent and stochastic gradient descent. Optimize the stepsizes and compare against scikit-learn implementation. 1. Fetch regression data using sklearn.datasets.load_boston() 2. Create a function psi(x), which transforms the features into 2nd order polynomial features (add each feature squared and each pair of features multiplied with every other) 3. Create a transformed data matrix X, where each x is mapped to psi(x). 4. Create a function p2(x,w), which outputs the value of the polynomial at x for given parameters w. KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 36 Labs 5. Create a function Loss(X,y,w), which computes the squared loss of predicting y from X by p2(x,w) using parameters w. 6. Code up the update rule for gradient descent (slide 9). It should input a point w and a stepsize eta. Hint: the gradient of squared loss was already computed yesterday. 7. Choose an arbitrary point and stepsize. Run gradient descent for 100 iterations and compute the Loss after each iteration. How does the loss behave? Does it converge to something? 8. Can you find the eta, for which the loss is smallest after 100 iterations? KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 37 Labs 9. Code up the update rule for stochastic gradient descent as given on slide 25. It should depend on a point w and a stepsize eta. 10. Choose an arbitrary initial point and stepsize parameter eta0. Use the stepsize rule defined on slide 30. How many iterations do you need to converge? 11. Find the best eta0. 12. How does SGD compare against GD? Which algorithm would you prefer? KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 38 Labs 13. Open the documentation for the function sklearn.linear_model.SGDRegressor. Do you understand each of the options? Try out several of them to get a better intuition. 14. Run the sklearn implementation against yours for the same amount of iterations. How do they compare? 15. Experiment with the setup! KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017 39 Image sources ◦ https://upload.wikimedia.org/wikipedia/commons/thumb/4/4f/Signum_function.svg/300px-Signum_function.svg.png ◦ http://d1czgh453hg3kg.cloudfront.net/content/royprsa/471/2176/20140638/F2.large.jpg ◦ https://cdn.drawception.com/images/panels/2012/3-30/FDZQjytFsn-8.png ◦ http://i.imgur.com/M9Nz7kl.png ◦ http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf BASECAMP: INTRODUCTION TO MACHINE LEARNING, DOMINIK CSIBA, 19. JANUARY 2017 40
© Copyright 2026 Paperzz