Lecture slides for Chapter 11 of Deep Learning www

Practical Methodology
Lecture slides for Chapter 11 of Deep Learning
www.deeplearningbook.org
Ian Goodfellow
2016-09-26
What
drives
success
in
ML?
What drives success in ML?
Arcane knowledge
of dozens of
obscure algorithms?
h1(2)
h1(1)
h2(2)
h2(1)
v1
h3(2)
h3(1)
v2
Mountains
of data?
Knowing how
to apply 3-4
standard techniques?
h4(1)
v3
(Goodfellow 2016)
Example: Street View Address
Number
Transcription
Street
View
Transcription
(Goodfellow et al, 2014)
(Goodfellow 2016)
3 Step Process
Three Step Process
•
Use needs to define metric-based goals
•
Build an end-to-end system
•
Data-driven refinement
(Goodfellow 2016)
Identify needs
Identify Needs
•
High accuracy or low accuracy?
•
Surgery robot: high accuracy
•
Celebrity look-a-like app: low accuracy
(Goodfellow 2016)
ChooseMetrics
Metrics
Choose
•
Accuracy?
(% of examples correct)
•
Coverage?
(% of examples processed)
•
Precision?
(% of detections that are right)
•
Recall?
(% of objects detected)
•
Amount of error? (For regression problems)
(Goodfellow 2016)
End-to-endSystem
system
End-to-end
•
Get up and running ASAP
•
Build the simplest viable system first
•
What baseline to start with though?
•
Copy state-of-the-art from related publication
(Goodfellow 2016)
DeepororNot?
not?
Deep
•
Lots of noise, little structure -> not deep
•
Little noise, complex structure -> deep
•
Good shallow baseline:
•
Use what you know
•
Logistic regression, SVM, boosted tree are all
good
(Goodfellow 2016)
What kind of deep
Choosing Architecture Family
•
No structure -> fully connected
•
Spatial structure -> convolutional
•
Sequential structure -> recurrent
(Goodfellow 2016)
Fully connected baseline
Fully
Connected
Baseline
2-3 hidden layer feedforward network
•
•
2-3 hidden layer feed-forward neural network
• AKA “multilayer perceptron”
•
•
•
•
AKA “multilayer perceptron”
Rectified linear units
Rectified linear units
• Dropout
Batch normalization
• SGD + momentum
•
Adam
•
Maybe dropout
V
W
(Goodfellow 2016)
Convolutional Network Baseline
•
Download a pretrained network
•
Or copy-paste an architecture from a related task
•
Convolutional baseline
Or:
• Inception
• Deep residual network
• Batch normalization
• Batch normalization
• Fallback option:
• Adam
• Rectified linear convolutional net
(Goodfellow 2016)
Recurrent
baseline
Recurrent
Network
Baseline
output
•
LSTM
•
SGD
•
Gradient clipping
•
High forget gate bias
×
output gate
self-loop
+
state
×
forget gate
×
input gate
input
(Goodfellow 2016)
Data driven adaptation
Data-driven Adaptation
•
Choose what to do based on data
•
Don’t believe hype
•
Measure train and test error
•
“Overfitting” versus “underfitting”
(Goodfellow 2016)
High Train
train Error
error
High
•
Inspect data for defects
•
Inspect software for bugs
•
Don’t roll your own unless you know what you’re
doing
•
Tune learning rate (and other optimization settings)
•
Make model bigger
(Goodfellow 2016)
Checking
Checking data
Data for
for defects
Defects
•
Can a human process it?
26624
(Goodfellow 2016)
Increasing
depth
Increasing Depth
Effect of Depth
96.5
96.0
95.5
Test accuracy (%)
95.0
94.5
94.0
93.5
93.0
92.5
92.0
3
4
5
6
7
8
Number of hidden layers
9
10
11
(Goodfellow 2016)
HighError
test error
High Test
•
Add dataset augmentation
•
Add dropout
•
Collect more data
(Goodfellow 2016)
Increasing
Increasing Training
training Set
set Size
size
20
6
Error (MSE)
5
Optimal capacity (polynomial degree)
Bayes error
Train (quadratic)
Test (quadratic)
Test (optimal capacity)
Train (optimal capacity)
4
3
2
15
10
5
1
0
0
10
1
10
2
3
4
10
10
10
# train examples
5
10
0 0
10
1
10
2
3
4
10
10
10
# train examples
5
10
(Goodfellow 2016)
Tuning the Learning Rate
APTER 11. PRACTICAL METHODOLOGY
8
Training error
7
6
5
4
3
2
1
0
10
2
10
1
100
Learning rate (logarithmic scale)
ure 11.1: Typical relationship between
the11.1
learning rate and the training error.
Figure
(Goodfellow 2016)
Reasoning about
Hyperparameters
CHAPTER 11. PRACTICAL METHODOLOGY
Hyperparameter
Increases
capacity
when. . .
increased
Reason
Caveats
Increasing the number of
hidden units increases the
representational capacity
of the model.
Learning rate
tuned optimally
Convolution kernel width
increased
An improper learning rate,
whether too high or too
low, results in a model
with low eﬀective capacity
due to optimization failure
Increasing the kernel width
increases the number of parameters in the model
Increasing the number
of hidden units increases
both the time and memory
cost of essentially every operation on the model.
Number of hidden units
Table 11.1
A wider kernel results in
a narrower output dimension, reducing model ca(Goodfellow 2016)
pacity unless you use im-
Hyperparameter Search
Grid
Random
Figure 11.2: Comparison of grid search and random search. For illustration purposes we
display two hyperparameters but we are typically interested in having many more. (Left)To
perform grid search, we provide a set of values for each hyperparameter. The search
algorithm runs training for every jointFigure
hyperparameter
these
11.2 setting in the cross product of(Goodfellow
2016)
sets. (Right)To perform random search, we provide a probability distribution over joint

Download Report

Lecture slides for Chapter 11 of Deep Learning www

Paperzz.com

Your Paperzz