Introduction to machine learning

The role of optimization
in machine learning
DOMINIK CSIBA, MLMU BRATISLAVA, 19.APRIL 2017
19. April 2017
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
1
Motivation
19. April 2017
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
2
Linear models
19. April 2017
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
3
Linear models as optimization
Features
LASSO
Logistic regression
Label
Regularizer
Loss function
Features
19. April 2017
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
Label
4
Dimensionality Reduction - PCA
19. April 2017
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
5
PCA as optimization
Features
Loss function
Features
Principal Components
19. April 2017
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
Constraint
Principal Components
6
Matrix completion
-1
1 -1
-1 -1
1 1
1 -1
-1
-1 1
19. April 2017
-1
1
-1
1
1
1
-1
1
-1
-1
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
1
-1
1
-1
-1
1
-1
1
-1
-1
-1
1
-1
1
1
7
Matrix completion as optimization
Constraint
Loss function
Low rank matrix
19. April 2017
Observed indices
Observed matrix
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
8
Real-time bidding in online advertising
AUCTION
WEBPAGE
Banner
0.1$
Competitor
19. April 2017
0.05$
0.01$
Banner
Banner
Competitor
Competitor
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
9
Real-time bidding as optimization
Arrive sequentially
Feasible
Allocations
19. April 2017
Utility function
Spent
Budget
Allocation
Allocation
Penalty
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
10
Optimization in Machine Learning
Most of the learning boils down to an optimization problem
Attracts many mathematicians
◦ Main topic of my PhD. thesis
◦ Optimization has its own ecosystem in the machine learning community
Treated as a black-box by a lot of practitioners
◦ Black-box gets faster and faster each year
◦ Optimization is a consideration only if .fit() has issues in learning
Next: understanding some of the scenarios with such issues
19. April 2017
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
11
Supervised learning
19. April 2017
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
12
Main idea
Ground truth
CAT
WORLD
BREAD
CAT
DOG
examples
19. April 2017
labels
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
13
Learning the true predictor
Empirical Risk Minimization:
◦ Samples:
◦ ERM:
◦ Hopefully:
Hypothesis class
Loss function
◦ Usual form:
Solve this!
19. April 2017
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
14
How does optimization work?
Intuition: Find the bottom of a valley in a dense fog using a teleport,
where you can ask for the following information:
◦ 0th order info: the altitude at any location
◦ 1st order info: the slope at any location
◦ 2nd order info: the curvature at any location
◦…
Most popular
0th order
19. April 2017
1st order
2nd order
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
15
Going down the hill – Gradient Descent
By far the most popular iterative scheme:
Intuition: We do a step down the hill
◦ Stepsize:
in some cases given by theory, otherwise difficult to pick
Too small
19. April 2017
Too big
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
16
Big data
19. April 2017
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
17
Large-scale data
Gradient descent step:
#dimension partial derivatives
19. April 2017
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
#examples functions
18
Gradient Descent
Do the best we can using first
order information
Wrong step in a smart direction
Stepsize: constant
4
Iteration cost depends on
both dimension and number
of examples!
19. April 2017
2
6
5
3
1
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
19
Randomized Coordinate Descent
Update only one randomly
chosen coordinate at a time
Stepsize:
constant for every dimension
Iteration cost is independent
of dimension!
19. April 2017
Smart step in a wrong direction
6
7 4
5
2
3
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
1
N
E
W
S
20
Stochastic Gradient Descent
Update using only one random
example at a time
Wrong step in a wrong direction
4 6
Stepsize: Decaying
Iteration cost independent of
the number of examples!
2
7
5
3
1
19. April 2017
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
21
Magic of SGD explained
Correct direction in expectation:
Variance is not vanishing!
Stepsize has to be decaying, otherwise SGD will not converge!
For GD/RCD, there is no such issue, because:
19. April 2017
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
22
Stochastic Variance Reduced Gradient
A new method was proposed in 2013 to reduce the variance of SGD.
Outer Loop (repeat forever):
◦ Store the current iterate as . Compute and store
◦ Inner Loop (repeat K times)
.
◦ uniformly at random sample
◦ perform the update:
Correct direction!
We have
19. April 2017
.
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
23
Distributed problems
19. April 2017
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
24
Distributed framework
Node 1
Node K
Node 2
...
Master node
Communication is
very expensive!
19. April 2017
Naïve Approach 1:
Distributed GD
Naïve Approach 2:
One-shot averaging
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
25
Distributed convergence rates
Standard convergence time measure:
Distributed convergence time measure:
Ideally similar
19. April 2017
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
26
Distributed methods: Intuition
Iterate over the following steps:
1. Compute the minimizers of
the local objectives
2. Send the minimizers to the
master node
3. Create a new local objective
for each node based on the
other minimizers
4. Distribute the local objectives
back to the local nodes
19. April 2017
Local estimates of global objective
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
27
Complex objectives
19. April 2017
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
28
Deep Learning / Neural Networks
For a practitioner: The ultimate tool for machine learning
For a mathematician: A nightmare (or the ultimate challenge)
◦ Optimization without any assumptions (except continuity)
◦ No real guarantees on generalization error
◦ No real guarantees on convergence
A lot of attention – ¼ of 2500 papers submitted to NIPS 2016
Next:
Understand Deep Learning better by understanding when it fails
19. April 2017
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
29
Learning parities (Failures of Deep Learning, Ohad Shamir, 2017)
TASK: learn a function, which outputs the parity of active entries in an unknown
subset of coordinates of a vector, formally:
◦ Choose a vector
◦ For
◦ Learn
define
without any information on
The task is realizable by a single-layer neural network with
units
In the following experiment we try to learn
units
19. April 2017
using
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
30
Parities convergence
(Failures of Deep Learning, Ohad Shamir, 2017)
19. April 2017
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
31
Non-informative gradients
Let
be the loss corresponding to the parity problem given .
Claim: Fix a point and consider all the gradients
vectors . Their variance is upper-bounded by
for all
.
It follows that for large dimensions, all the methods based only on
gradient information fail to converge.
A more general version of the above claim holds for linear functions
composed with a periodic function (Shamir, 2016)
19. April 2017
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
32
Objective function example
(Distribution-specific Hardness of Learning Neural Networks, Ohad Shamir, 2016)
19. April 2017
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
33
Final remarks
Optimization is the backbone of machine learning
◦ One does not realize how important it is, until something goes wrong
Optimization offers a lot of challenging problems
◦ Ideal for mathematically oriented people with applied tastes
Optimization improves a lot by analyzing its failures
◦ “Learning from failures is the key to success”, (put here any name)
◦ Responsible for most of the modern advances in deep learning
19. April 2017
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
34
Thank you for your attention!
Feel free to contact me on [email protected] with any further questions!
19. April 2017
MLMU BRATISLAVA: THE ROLE OF OPTIMIZATION IN MACHINE LEARNING, DOMINIK CSIBA
35