Neural Networks Teaser
February 27, 2017
1/11
Deep Learning in the News
Go falls to computers.
2/11
Learning
How to teach a robot to be able to recognize images as either a cat
or a non-cat? This sounds like a biology problem. How can we
formulate this as a mathematics problem?
3/11
Learning
How to teach a robot to be able to recognize images as either a cat
or a non-cat? This sounds like a biology problem. How can we
formulate this as a mathematics problem?
R3×1000×1000
is a space of 1000 by 1000 rgb images
3/11
Learning
How to teach a robot to be able to recognize images as either a cat
or a non-cat? This sounds like a biology problem. How can we
formulate this as a mathematics problem?
R3×1000×1000
C⊂
is a space of 1000 by 1000 rgb images
R3×1000×1000 is the cat subset.
3/11
Learning
How to teach a robot to be able to recognize images as either a cat
or a non-cat? This sounds like a biology problem. How can we
formulate this as a mathematics problem?
R3×1000×1000
C⊂
is a space of 1000 by 1000 rgb images
R3×1000×1000 is the cat subset.
Try to learn the classier function
that
fC : R3000000 → {1, −1}
so
fC (x) = 1 ⇐⇒ x ∈ C .
3/11
Learning
How to teach a robot to be able to recognize images as either a cat
or a non-cat? This sounds like a biology problem. How can we
formulate this as a mathematics problem?
R3×1000×1000
C⊂
is a space of 1000 by 1000 rgb images
R3×1000×1000 is the cat subset.
Try to learn the classier function
that
fC : R3000000 → {1, −1}
so
fC (x) = 1 ⇐⇒ x ∈ C .
Usually we try to nd the best function
functions
F
that approximates
f
in a a class of
fC .
3/11
Learning
How to teach a robot to be able to recognize images as either a cat
or a non-cat? This sounds like a biology problem. How can we
formulate this as a mathematics problem?
R3×1000×1000
C⊂
is a space of 1000 by 1000 rgb images
R3×1000×1000 is the cat subset.
Try to learn the classier function
that
fC : R3000000 → {1, −1}
so
fC (x) = 1 ⇐⇒ x ∈ C .
Usually we try to nd the best function
functions
F
that approximates
Let us play in a playground:
f
in a a class of
fC .
playground.tensorflow.org/
3/11
Optimization and Learning
For example, support vector machines (SVM).
4/11
Optimization and Learning
For example, support vector machines (SVM).
Data points
(xi , yi )ni=1 , xi ∈ Rd
and
yi ∈ {−1, 1}.
4/11
Optimization and Learning
For example, support vector machines (SVM).
Data points
Classifer
(xi , yi )ni=1 , xi ∈ Rd
and
yi ∈ {−1, 1}.
f (x) = w · x + b.
4/11
Optimization and Learning
For example, support vector machines (SVM).
Data points
Classifer
(xi , yi )ni=1 , xi ∈ Rd
and
yi ∈ {−1, 1}.
f (x) = w · x + b.
Optimization problem:
min
w∈Rd ,b∈R
||w||2 + C
n
X
max(0, 1 − yi f (xi ))
i=1
4/11
Optimization and Learning
For example, support vector machines (SVM).
Data points
Classifer
(xi , yi )ni=1 , xi ∈ Rd
and
yi ∈ {−1, 1}.
f (x) = w · x + b.
Optimization problem:
min
w∈Rd ,b∈R
||w||2 + C
n
X
max(0, 1 − yi f (xi ))
i=1
The regularization term is convex and so is the loss function.
4/11
Scikit-learn for Quick and Dirty Machine Learning
5/11
Scikit-learn for Quick and Dirty Machine Learning
Scikit-learn is a convenient python library built on Numpy,
Scipy, and matplotlib for many standard machine learning
algorithms.
5/11
Scikit-learn for Quick and Dirty Machine Learning
Scikit-learn is a convenient python library built on Numpy,
Scipy, and matplotlib for many standard machine learning
algorithms.
Helpful examples at
http://scikit-learn.org/stable/index.html
5/11
Obligatory Slide on Big Data"
How many images do you think we have?
6/11
Obligatory Slide on Big Data"
How many images do you think we have?
7 billion people, 3 billion people with smartphones, 1 picture a
day = approximately 1 trillion pictures a year
6/11
Obligatory Slide on Big Data"
How many images do you think we have?
7 billion people, 3 billion people with smartphones, 1 picture a
day = approximately 1 trillion pictures a year
Some claim that more data was generated in the last 2 years
than the rest of the history of mankind.
6/11
Obligatory Slide on Big Data"
How many images do you think we have?
7 billion people, 3 billion people with smartphones, 1 picture a
day = approximately 1 trillion pictures a year
Some claim that more data was generated in the last 2 years
than the rest of the history of mankind.
In comparison: there are around 3 billion seconds in a 100 year
lifetime.
6/11
Obligatory Slide on Big Data"
How many images do you think we have?
7 billion people, 3 billion people with smartphones, 1 picture a
day = approximately 1 trillion pictures a year
Some claim that more data was generated in the last 2 years
than the rest of the history of mankind.
In comparison: there are around 3 billion seconds in a 100 year
lifetime.
Want algorithms that can continuously improve with such
large data sets.
6/11
Obligatory Slide on Big Data"
How many images do you think we have?
7 billion people, 3 billion people with smartphones, 1 picture a
day = approximately 1 trillion pictures a year
Some claim that more data was generated in the last 2 years
than the rest of the history of mankind.
In comparison: there are around 3 billion seconds in a 100 year
lifetime.
Want algorithms that can continuously improve with such
large data sets.
If error = bias + variance, then we want a large and exible
class of functions so that bias is small since large enough data
can control variance.
6/11
Deep Learning: Learning Representation and Classier
Imagine two inputs
y=
(0) (0
f (x1 , x2 ).
(0)
(0)
x1 , x2
, trying to learn classier
7/11
Deep Learning: Learning Representation and Classier
Imagine two inputs
y=
(0) (0
f (x1 , x2 ).
Linear classiers
(0)
(0)
(0)
(0)
x1 , x2
, trying to learn classier
(0)
(0)
f (x1 , x2 ) = w1 x1 + w2 x2 + b
are not
always expressive enough.
7/11
Deep Learning: Learning Representation and Classier
Imagine two inputs
y=
(0) (0
f (x1 , x2 ).
Linear classiers
(0)
(0)
(0)
(0)
x1 , x2
, trying to learn classier
(0)
(0)
f (x1 , x2 ) = w1 x1 + w2 x2 + b
are not
always expressive enough.
Idea: introduce a non-linearity such as
g(x) =
ex −e−x
.
ex +e−x
7/11
Deep Learning: Learning Representation and Classier
Imagine two inputs
y=
(0) (0
f (x1 , x2 ).
Linear classiers
(0)
(0)
(0)
(0)
x1 , x2
, trying to learn classier
(0)
(0)
f (x1 , x2 ) = w1 x1 + w2 x2 + b
are not
always expressive enough.
Idea: introduce a non-linearity such as
g(x) =
ex −e−x
.
ex +e−x
Now we can stack many layers to get a composite classier:
f (X 0 ) = W3 g(W2 g(W1 X (0) + b1 ) + b2 ) + b3
.
7/11
Deep Learning: Learning Representation and Classier
Imagine two inputs
y=
(0) (0
f (x1 , x2 ).
Linear classiers
(0)
(0)
(0)
(0)
x1 , x2
, trying to learn classier
(0)
(0)
f (x1 , x2 ) = w1 x1 + w2 x2 + b
are not
always expressive enough.
Idea: introduce a non-linearity such as
g(x) =
ex −e−x
.
ex +e−x
Now we can stack many layers to get a composite classier:
f (X 0 ) = W3 g(W2 g(W1 X (0) + b1 ) + b2 ) + b3
.
X (1) = g(W1 × X 0 + b1 )
(0) .
input X
The output of the rst hidden layer
is a feature representation of the
7/11
Deep Learning: Learning Representation and Classier
Imagine two inputs
y=
(0) (0
f (x1 , x2 ).
Linear classiers
(0)
(0)
(0)
(0)
x1 , x2
, trying to learn classier
(0)
(0)
f (x1 , x2 ) = w1 x1 + w2 x2 + b
are not
always expressive enough.
Idea: introduce a non-linearity such as
g(x) =
ex −e−x
.
ex +e−x
Now we can stack many layers to get a composite classier:
f (X 0 ) = W3 g(W2 g(W1 X (0) + b1 ) + b2 ) + b3
.
X (1) = g(W1 × X 0 + b1 )
(0) .
input X
The output of the rst hidden layer
is a feature representation of the
The output of the second hidden layer
X (2) = g(W2 X (1) + b2 )
(1) .
feature extractor X
is a feature transformation of the
7/11
Deep Learning: Learning Representation and Classier
Imagine two inputs
y=
(0) (0
f (x1 , x2 ).
Linear classiers
(0)
(0)
(0)
(0)
x1 , x2
, trying to learn classier
(0)
(0)
f (x1 , x2 ) = w1 x1 + w2 x2 + b
are not
always expressive enough.
Idea: introduce a non-linearity such as
g(x) =
ex −e−x
.
ex +e−x
Now we can stack many layers to get a composite classier:
f (X 0 ) = W3 g(W2 g(W1 X (0) + b1 ) + b2 ) + b3
.
X (1) = g(W1 × X 0 + b1 )
(0) .
input X
The output of the rst hidden layer
is a feature representation of the
The output of the second hidden layer
X (2) = g(W2 X (1) + b2 ) is a feature transformation of the
(1) .
feature extractor X
0
Finally, f (X ) is a classier trained on the representation
learned by data. The representation is not xed!
7/11
Backpropagation
In order to train we need to take the gradient of the loss
1
n
Pn
i=1 L(yi
− f (Xi ))
with respect to the weight matrices
W.
8/11
Backpropagation
In order to train we need to take the gradient of the loss
1
n
Pn
i=1 L(yi
− f (Xi ))
with respect to the weight matrices
Exercise: take the derivative of
f (x) = w3 g(w2 g(w1 x + b1 ) + b2 ) + b3
respect to the wi .
for
g(x) =
W.
ex −e−x
with
ex +e−x
8/11
Backpropagation
In order to train we need to take the gradient of the loss
1
n
Pn
i=1 L(yi
− f (Xi ))
with respect to the weight matrices
Exercise: take the derivative of
f (x) = w3 g(w2 g(w1 x + b1 ) + b2 ) + b3
respect to the wi .
df
dw3
df
dw2
df
dw1
for
g(x) =
W.
ex −e−x
with
ex +e−x
= g(w2 g(w1 x + b1 ) + b2 )
= w3 g 0 (w2 g(w1 x + b1 ) + b2 )g(w1 x + b1 )
= w3 g 0 (w2 g(w1 x + b1 ) + b2 )w2 g 0 (w1 x + b1 )x
Can exploit structure to save on computation in a recursive way.
8/11
Making Descent Work better
9/11
Making Descent Work better
In order to save time in descent on expressions such as
E(W ) =
1
n
Pn
i=1 L(yi
− f (Xi )),
people often work with one
example at a time.
9/11
Making Descent Work better
In order to save time in descent on expressions such as
E(W ) =
1
n
Pn
i=1 L(yi
− f (Xi )),
people often work with one
example at a time.
They may also descend with small batches (say 100
examples) at a time.
9/11
Making Descent Work better
In order to save time in descent on expressions such as
E(W ) =
1
n
Pn
i=1 L(yi
− f (Xi )),
people often work with one
example at a time.
They may also descend with small batches (say 100
examples) at a time.
∆W = η∇E(W ) + α∆W where α
parameter and η is the learning rate.
With momentum:
momentum
is the
9/11
Making Descent Work better
In order to save time in descent on expressions such as
E(W ) =
1
n
Pn
i=1 L(yi
− f (Xi )),
people often work with one
example at a time.
They may also descend with small batches (say 100
examples) at a time.
∆W = η∇E(W ) + α∆W where α
parameter and η is the learning rate.
With momentum:
momentum
is the
Overall emphasis is on faster rst-order methods that try to
avoid getting stuck. Finding a global optimum may even give
worse generalization performance.
9/11
Making Descent Work better
In order to save time in descent on expressions such as
E(W ) =
1
n
Pn
i=1 L(yi
− f (Xi )),
people often work with one
example at a time.
They may also descend with small batches (say 100
examples) at a time.
∆W = η∇E(W ) + α∆W where α
parameter and η is the learning rate.
With momentum:
momentum
is the
Overall emphasis is on faster rst-order methods that try to
avoid getting stuck. Finding a global optimum may even give
worse generalization performance.
There are many issues in practice:
https://arxiv.org/abs/1206.5533.
9/11
Making Descent Work better
In order to save time in descent on expressions such as
E(W ) =
1
n
Pn
i=1 L(yi
− f (Xi )),
people often work with one
example at a time.
They may also descend with small batches (say 100
examples) at a time.
∆W = η∇E(W ) + α∆W where α
parameter and η is the learning rate.
With momentum:
momentum
is the
Overall emphasis is on faster rst-order methods that try to
avoid getting stuck. Finding a global optimum may even give
worse generalization performance.
There are many issues in practice:
https://arxiv.org/abs/1206.5533.
Try running a small example in scikit-learn
9/11
Open Challenges
Solve harder challenges with better network architectures,
optimization methods, and datasets.
10/11
Open Challenges
Solve harder challenges with better network architectures,
optimization methods, and datasets.
10/11
Open Challenges
Solve harder challenges with better network architectures,
optimization methods, and datasets.
Give a more satisfactory theoretical explanation for why they work.
10/11
Getting Started
Nice Tutorials and Free Books
http://www.deeplearningbook.org/
http://deeplearning.net/tutorial/
http://neuralnetworksanddeeplearning.com/index.html
http://www.cs.nyu.edu/~yann/talks/
lecun-ranzato-icml2013.pdf
11/11
Getting Started
Nice Tutorials and Free Books
http://www.deeplearningbook.org/
http://deeplearning.net/tutorial/
http://neuralnetworksanddeeplearning.com/index.html
http://www.cs.nyu.edu/~yann/talks/
lecun-ranzato-icml2013.pdf
Popular packages:
Popular framework:
http://caffe.berkeleyvision.org/
Destined to be the future standard high level library:
https://keras.io/
Popular lower level library:
http://deeplearning.net/software/theano/
New lower level library: https://www.tensorflow.org/
Course to learn keras without need for your own machine:
http://course.fast.ai/lessons/lessons.html
11/11
© Copyright 2026 Paperzz