Introduction to Deep Learning

CNN / AlexNet
Sungjoon Choi
AlexNet
+
Not so minor (actually SUPER IMPORTANT) heuristics.




ReLu Nonlinearity
Local Response Normalization
Data Augmentation
Dropout
What is ImageNet?
ILSVRC 2010
ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
It uses a subset of ImageNet with roughly 1M images with
1K categories.
Are these all just cats? (Of course, some are super cute!)
These are all different categories!
(Egyptian, Persian, Tiger, Siamese, and Tabby cat)
Convolution Neural Network
This is pretty much everything about the convolutional
neural network.
Convolution + Subsampling + Full Connection
What is CNN?
CNNs are basically layers of convolutions followed by
subsampling and fully connected layers.
Intuitively speaking, convolutions and subsampling layers
works as feature extraction layers while a fully connected
layer classifies which category current input belongs to using
extracted features.
http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
Why is CNN so powerful?
Local Invariance
Loosely speaking, as the convolution filters are ‘sliding’ over
the input image, the exact location of the object we want to
find does not matter much.
Compositionality
There is a hierarchy in CNNs. It is GOOD!
Huge representation capacity!
https://starwarsanon.wordpress.com/tag/darth-sidious-vs-yoda/
What is Convolution?
http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution
Details of Convolution
ZeroStride padding Channel
Conv: Zero-padding?
What is the size of the input?
𝑛𝑖𝑛 = 5
What is the size of the output?
𝑛𝑜𝑢𝑡 = 5
What is the size of the filter?
𝑛𝑓𝑖𝑙𝑡𝑒𝑟 = 3
What is the size of the zero-padding?
𝑛𝑝𝑎𝑑𝑑𝑖𝑛𝑔 = 1
𝑛𝑜𝑢𝑡 = 𝑛𝑖𝑛 + 2 ∗ 𝑛𝑝𝑎𝑑𝑑𝑖𝑛𝑔 − 𝑛𝑓𝑖𝑙𝑡𝑒𝑟 + 1
5= 5+2∗1−3 +1
Stride?
Conv: Stride?
(Left) Stride size: 1
(Right) Stride size: 2
If stride size equals the filter size, there will
be no overlapping.
Conv: Channel
[batch, in_height, in_width, in_chnnel]
[filter_height, filter_width, in_channels, out_channels]
[batch, in_height=4, in_width=4, in_chnnel=3]
[filter_height=3, filter_width=3, in_channels=3, out_channels=7]
Conv: Channel
[batch, in_height=4, in_width=4, in_chnnel=3]
[filter_height=3, filter_width=3, in_channels=3, out_channels=7]
What is the number of parameters in this convolution layer?
 𝟏𝟖𝟗 = 𝟑 ∗ 𝟑 ∗ 𝟑 ∗ 𝟕
AlexNet
What is the number of parameters?
Why are layers divided into two parts?
AlexNet
ReLU Nonlinearity
ReLU
tanh
Faster Convergence!
Local Response Normalization
The response-normalized activity is given by:
𝑖
𝑎
𝑥,𝑦
𝑖 =
𝑏𝑥,𝑦
min 𝑁−1,𝑖+𝑛/2
𝑗
𝑘 + 𝛼 𝑗=max 0,𝑖−𝑛/2 𝑎𝑥,𝑦
2 𝛽
It implements a form of lateral inhibition inspired by real
neurons.
Reducing Overfitting
It is often called regularization in machine learning literatures.
More details will be handled in next week.
In the AlexNet, two regularization methods are used.
 Data augmentation
 Dropout
Reg: Data Augmentation
http://www.slideshare.net/KenChatfield/chatfield14-devil
Reg: Data Augmentation 1
Original
Image
(256 × 256)
Smaller
Patch
(224 × 224)
This increases the size of the training set
by a factor of 𝟐𝟎𝟒𝟖 32 ∗ 32 ∗ 2 .
Two comes from horizontal reflections.
Reg: Data Augmentation 2
Original
Patch
(224 × 224)
Color variation
Altered
Patch
(224 × 224)
To each RGB image pixel, following quantity is added:
𝑝1 , 𝑝2 , 𝑝3 𝛼1 𝜆1 , 𝛼2 𝜆2 , 𝛼3 𝜆3 𝑇
where 𝑝𝑖 and 𝜆𝑖 are 𝑖th eigenvector and eigenvalue of
3 × 3 covariance matrix of RGB pixel values.
Probabilistically, not a single patch will be same at the
training phase! (a factor of infinity!)
Reg: Dropout
Original dropout [1] sets the output of each hidden
neuron with certain probability.
In this paper, they simply multiply the outputs by 0.5.
[1] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R.R. Salakhutdinov. Improving neural networks by preventing coadaptation of feature detectors. arXiv, 2012.
http://www.eddyazar.com/the-regrets-of-a-dropout-and-why-you-should-drop-out-too/