Practical Methodology Lecture slides for Chapter 11 of Deep Learning www.deeplearningbook.org Ian Goodfellow 2016-09-26 What drives success in ML? What drives success in ML? Arcane knowledge of dozens of obscure algorithms? h1(2) h1(1) h2(2) h2(1) v1 h3(2) h3(1) v2 Mountains of data? Knowing how to apply 3-4 standard techniques? h4(1) v3 (Goodfellow 2016) Example: Street View Address Number Transcription Street View Transcription (Goodfellow et al, 2014) (Goodfellow 2016) 3 Step Process Three Step Process • Use needs to define metric-based goals • Build an end-to-end system • Data-driven refinement (Goodfellow 2016) Identify needs Identify Needs • High accuracy or low accuracy? • Surgery robot: high accuracy • Celebrity look-a-like app: low accuracy (Goodfellow 2016) ChooseMetrics Metrics Choose • Accuracy? (% of examples correct) • Coverage? (% of examples processed) • Precision? (% of detections that are right) • Recall? (% of objects detected) • Amount of error? (For regression problems) (Goodfellow 2016) End-to-endSystem system End-to-end • Get up and running ASAP • Build the simplest viable system first • What baseline to start with though? • Copy state-of-the-art from related publication (Goodfellow 2016) DeepororNot? not? Deep • Lots of noise, little structure -> not deep • Little noise, complex structure -> deep • Good shallow baseline: • Use what you know • Logistic regression, SVM, boosted tree are all good (Goodfellow 2016) What kind of deep Choosing Architecture Family • No structure -> fully connected • Spatial structure -> convolutional • Sequential structure -> recurrent (Goodfellow 2016) Fully connected baseline Fully Connected Baseline 2-3 hidden layer feedforward network • • 2-3 hidden layer feed-forward neural network • AKA “multilayer perceptron” • • • • AKA “multilayer perceptron” Rectified linear units Rectified linear units • Dropout Batch normalization • SGD + momentum • Adam • Maybe dropout V W (Goodfellow 2016) Convolutional Network Baseline • Download a pretrained network • Or copy-paste an architecture from a related task • Convolutional baseline Or: • Inception • Deep residual network • Batch normalization • Batch normalization • Fallback option: • Adam • Rectified linear convolutional net (Goodfellow 2016) Recurrent baseline Recurrent Network Baseline output • LSTM • SGD • Gradient clipping • High forget gate bias × output gate self-loop + state × forget gate × input gate input (Goodfellow 2016) Data driven adaptation Data-driven Adaptation • Choose what to do based on data • Don’t believe hype • Measure train and test error • “Overfitting” versus “underfitting” (Goodfellow 2016) High Train train Error error High • Inspect data for defects • Inspect software for bugs • Don’t roll your own unless you know what you’re doing • Tune learning rate (and other optimization settings) • Make model bigger (Goodfellow 2016) Checking Checking data Data for for defects Defects • Can a human process it? 26624 (Goodfellow 2016) Increasing depth Increasing Depth Effect of Depth 96.5 96.0 95.5 Test accuracy (%) 95.0 94.5 94.0 93.5 93.0 92.5 92.0 3 4 5 6 7 8 Number of hidden layers 9 10 11 (Goodfellow 2016) HighError test error High Test • Add dataset augmentation • Add dropout • Collect more data (Goodfellow 2016) Increasing Increasing Training training Set set Size size 20 6 Error (MSE) 5 Optimal capacity (polynomial degree) Bayes error Train (quadratic) Test (quadratic) Test (optimal capacity) Train (optimal capacity) 4 3 2 15 10 5 1 0 0 10 1 10 2 3 4 10 10 10 # train examples 5 10 0 0 10 1 10 2 3 4 10 10 10 # train examples 5 10 (Goodfellow 2016) Tuning the Learning Rate APTER 11. PRACTICAL METHODOLOGY 8 Training error 7 6 5 4 3 2 1 0 10 2 10 1 100 Learning rate (logarithmic scale) ure 11.1: Typical relationship between the11.1 learning rate and the training error. Figure (Goodfellow 2016) Reasoning about Hyperparameters CHAPTER 11. PRACTICAL METHODOLOGY Hyperparameter Increases capacity when. . . increased Reason Caveats Increasing the number of hidden units increases the representational capacity of the model. Learning rate tuned optimally Convolution kernel width increased An improper learning rate, whether too high or too low, results in a model with low effective capacity due to optimization failure Increasing the kernel width increases the number of parameters in the model Increasing the number of hidden units increases both the time and memory cost of essentially every operation on the model. Number of hidden units Table 11.1 A wider kernel results in a narrower output dimension, reducing model ca(Goodfellow 2016) pacity unless you use im- Hyperparameter Search Grid Random Figure 11.2: Comparison of grid search and random search. For illustration purposes we display two hyperparameters but we are typically interested in having many more. (Left)To perform grid search, we provide a set of values for each hyperparameter. The search algorithm runs training for every jointFigure hyperparameter these 11.2 setting in the cross product of(Goodfellow 2016) sets. (Right)To perform random search, we provide a probability distribution over joint
© Copyright 2026 Paperzz