Introduction to machine learning

Introduction to
optimization
DOMINIK CSIBA, KNOYD BASECAMP, 20. JAN 2017, VIENNA
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
1
MOTIVATION
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
2
Why optimization matters?
Recall the piggy!
Would like
to have
We actually
have
Approximation error
Estimation error
Optimization error
BASECAMP: INTRODUCTION TO MACHINE LEARNING, DOMINIK CSIBA, 19. JANUARY 2017
3
Solving the ERM
Recall the ERM:
Optimization
problem
Linear regression - closed form solution
What if there is no closed form solution?
Example: Classification
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
4
Naïve Classification
Naïve model:
Naïve loss:
Naïve ERM:
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
5
HIGH-LEVEL INTUITION
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
6
How does optimization work?
Intuition: Find the bottom of a valley in a dense fog using 7-league
boots, where you can ask for the following information:
◦ 0th order info: the altitude at any location
◦ 1st order info: the slope at any location
◦ 2nd order info: the curvature at any location
◦…
Most popular
0th order
1st order
2nd order
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
7
How does optimization work formally?
Iterative methods:
Goal:
“Close”:
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
8
Going down the hill – Gradient Descent
By far the most popular iterative scheme:
Intuition: We do a step down the hill
◦ Stepsize:
in some cases given by theory, otherwise difficult to pick
Too small
Too big
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
9
BREAKDOWN OF THE PROBLEM
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
10
When is optimization easy?
Set
:
Function
◦ Continuous
◦ Convex
◦ Easy to project on
:
◦ Continuous
◦ Convex + Strongly-convex
◦ Smooth
◦ Small n / Small d
Example: Ridge Regression
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
11
Non-continuous sets
Discrete:
◦ Example:
, allocation problems, graph problems
Separated:
◦ Example:
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
12
Non-convex sets
Convex set: every arc connecting two points in the set is in the set
Convex sets
Non-convex sets
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
13
Projection on sets
In general, arbitrary complexity – can be even more complex than the original optimization problem
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
14
Non-continuous objectives
Almost impossible to optimize
◦ Example:
If possible, always avoid these kind of
optimization problems.
Question: What would happen if we just
“connect” the non-continuous parts?
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
15
Lipschitz objectives
The convergence depends on the Lipschitz property of the objective
Dependence:
Large
Large
Small
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
16
Non-convex objectives
Convex functions:
◦ Any arc connecting two points of the function lies above the function.
◦ At each point of the function we can draw a line passing through the point,
which lies below the function
◦ The set of points above the function is a convex set.
Convex
Convex
Non-convex
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
Non-convex
17
Strongly-convex objectives
convex
Example:
Note: Convex function
0-strongly convex function
Advantages:
◦ Stability: it can be shown that strongly convex functions generalize better
◦ Minimization: the methods converge much faster
specified later
◦ Dependence:
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
18
Non-smooth objectives
Function is smooth, if it does not have “kinks”
◦ Example: most of the standard functions
◦ Example of non-smooth: absolute value
Complications:
◦ How to choose the stepsize?
◦ How to make a gradient step at a “kink”?
◦ Subgradient: slope of any line passing through a
“kink”, which is under the function (denoted )
Subgradient method:
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
19
Lipschitz continuous gradients
The convergence depends on the Lipschitz property of the gradients
Dependence:
Constant
Constant
Large
Small
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
20
Simple non-smoothness
Suppose we can write
smooth
non-smooth
Example:
◦ Lasso:
Non-smoothness is not an issue, if we can solve the following easily
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
21
Large-scale data
Gradient descent step:
#dimension partial derivatives
#examples functions
Gradient Descent
Do the best we can using first
order information
Wrong step in a smart direction
Stepsize: constant
4
Iteration cost depends on
both dimension and number
of examples!
2
6
5
3
1
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
23
Randomized Coordinate Descent
Update only one randomly
chosen coordinate at a time
Stepsize:
constant for every dimension
Iteration cost is independent
of dimension!
Smart step in a wrong direction
6
7 4
5
2
3
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
1
N
E
W
S
24
Stochastic Gradient Descent
Update using only one random
example at a time
Wrong step in a wrong direction
4 6
Stepsize: Decaying
Iteration cost independent of
the number of examples!
2
7
5
3
1
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
25
STOCHASTIC GRADIENT DESCENT
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
26
Magic of SGD explained
Correct direction in expectation:
Variance is not vanishing!
Stepsize has to be decaying, otherwise SGD will not converge!
For GD/RCD, there is no such issue, because:
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
27
Sample convergence theorem
Convex Optimization: Algorithms and Complexity, S. Bubeck, 2015
Without strong
convexity
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
28
Randomization of SGD
Theoretical results:
◦ On each iteration, we pick an example uniformly at random
◦ More advanced: Importance sampling – weighted probabilities for examples
Standard practice:
◦ Each epoch is a random permutation
◦ Performs better than uniform sampling
◦ Theory is way more difficult
Infinite data:
◦ Converges also in the case of streamed data (every example seen only once)
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
29
SGD stepsizes
Convergence for any stepsizes satisfying:
Standard choice:
No gold standard for choosing
Leads to a Hyperparameter Optimization problem
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
30
HYPERPARAMETER OPTIMIZATION
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
31
Intuition
0th order optimization
Hyperparameter
Optimization
Parameters
Hyperparameter
Optimization
Optimization
Parameters
Meta-Hyperparameter Optimization…
Optimization
Model
Parameters
One could go on and on…
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
32
th
0
order optimization
Goal: find with the
best performance
Issue: we have no idea
about the function
Solution: Try out a
bunch of stuff and
hope for the best
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
33
Grid search
The most basic algorithm for hyperparameter optimization
◦ Try out a grid of values
◦ Pick the best performing
Performs reasonably well
Usually sufficient
Notable competitor: Random Search
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
34
Labs
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
35
Labs
Goal: Train a 2nd order polynomial predictor using both gradient descent
and stochastic gradient descent. Optimize the stepsizes and compare
against scikit-learn implementation.
1. Fetch regression data using sklearn.datasets.load_boston()
2. Create a function psi(x), which transforms the features into 2nd order
polynomial features (add each feature squared and each pair of
features multiplied with every other)
3. Create a transformed data matrix X, where each x is mapped to psi(x).
4. Create a function p2(x,w), which outputs the value of the polynomial
at x for given parameters w.
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
36
Labs
5. Create a function Loss(X,y,w), which computes the squared loss of
predicting y from X by p2(x,w) using parameters w.
6. Code up the update rule for gradient descent (slide 9). It should input a
point w and a stepsize eta. Hint: the gradient of squared loss was
already computed yesterday.
7. Choose an arbitrary point and stepsize. Run gradient descent for 100
iterations and compute the Loss after each iteration. How does the loss
behave? Does it converge to something?
8. Can you find the eta, for which the loss is smallest after 100 iterations?
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
37
Labs
9. Code up the update rule for stochastic gradient descent as given
on slide 25. It should depend on a point w and a stepsize eta.
10. Choose an arbitrary initial point and stepsize parameter eta0. Use
the stepsize rule defined on slide 30. How many iterations do you
need to converge?
11. Find the best eta0.
12. How does SGD compare against GD? Which algorithm would you
prefer?
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
38
Labs
13. Open the documentation for the function
sklearn.linear_model.SGDRegressor. Do you understand each of
the options? Try out several of them to get a better intuition.
14. Run the sklearn implementation against yours for the same
amount of iterations. How do they compare?
15. Experiment with the setup!
KNOYD BASECAMP: INTRODUCTION TO OPTIMIZATION, DOMINIK CSIBA, 20. JANUARY 2017
39
Image sources
◦
https://upload.wikimedia.org/wikipedia/commons/thumb/4/4f/Signum_function.svg/300px-Signum_function.svg.png
◦
http://d1czgh453hg3kg.cloudfront.net/content/royprsa/471/2176/20140638/F2.large.jpg
◦
https://cdn.drawception.com/images/panels/2012/3-30/FDZQjytFsn-8.png
◦
http://i.imgur.com/M9Nz7kl.png
◦
http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf
BASECAMP: INTRODUCTION TO MACHINE LEARNING, DOMINIK CSIBA, 19. JANUARY 2017
40