Deep Learning and Its Applications
Convolutional Neural Network and Its Application in Image Recognition
Xin Sui
Oct 28, 2016
Xin Sui
Deep Learning and Its Applications
Outline
1
A Motivating Example
2
The Convolutional Neural Network (CNN) Model
3
Training the CNN Model
4
Issues and Recent Advances
5
Code Demonstration
institution-logo-filena
Xin Sui
Deep Learning and Its Applications
A Motivating Example
Image Classification: A classic application of CNN
institution-logo-filena
Xin Sui
Deep Learning and Its Applications
The CNN Model
Visualization of a typical CNN Model
Layers in the model
1
2
3
4
5
6
Convolutional Layer: 1 feature map → 8 feature maps
Pooling Layer: shrinks the resolution: 24 × 24 → 12 × 12
Convolutional Layer: 8 feature maps → 16 feature maps
Pooling Layer: shrinks the resolution: 12 × 12 → 4 × 4
Fully Connected Layer: each output fully connects all 4 ∗ 4 ∗ 16 pixels
Softmax Layer: applies the softmax function
institution-logo-filena
Xin Sui
Deep Learning and Its Applications
Key Elements in Convolutional Layer
This convolutional layer takes 1 feature map of size 24 × 24 to 8
feature maps of size 24 × 24.
In: pin feature maps (images) of resolution min × nin : (denote each
pixel by) Ii [x , y], i ∈ [pin ], x ∈ [min ], y ∈ [nin ], where
[a] = {0, 1, 2, · · · , a − 1}.
Out: pout feature maps of resolution mout × nout : Oj [x , y],
j ∈ [pout ], x ∈ [mout ], y ∈ [nout ].
Filter size u × v × pin .
Activation function h(·). E.g., relu function: h(x ) = max{0, x }.
institution-logo-filena
Xin Sui
Deep Learning and Its Applications
2-D Discrete Convolution
For simplicity, assume pin = pout = 1, and h(x ) = x for now. Let
F [x , y] be the filter of size u × v × 1, that is, x ∈ [u], y ∈ [v ].
Recall I [x , y] is the input image of size min × nin , while O[x , y] is
the output image of size mout × nout .
Then,
O[x , y] =
X X
I [x − a, y − b]F [a, b] := (I ∗ F )[x , y]
a∈[u] b∈[v ]
for all x ∈ [mout ], y ∈ [nout ].
The operator ∗ denotes 2-D discrete convolution. For simplicity,
denote this by O = I ∗ F
Let I [x − a, y − b] = 0 when x − a 6∈ [u] or y − b 6∈ [v ].
Visualization of 2-D convolution
Convolution detects features in an image
For simplicity, it is commonly used that (mout , nout ) = (min , nin ).
It is named ‘same padding’.
institution-logo-filena
Xin Sui
Deep Learning and Its Applications
Convolutional Layer
Let us remove the assumptions pin = pout = 1, and h(x ) = x
one-by-one to obtain the final form of a convolutional layer.
When pin 6= 1,
X
O=
Ii ∗ Fi .
i∈[pin ]
Additionally, when pout 6= 1,
X
(j )
Oj =
Ii ∗ Fi for all j .
i∈[pin ]
In the most general case, where h(x ) 6= x and there is an additional
bias term,
X
(j )
Oj = h(
Ii ∗ Fi + bj ) for all j .
i∈[pin ]
Note that in a convolutional layer, there are pout filters, each of size
u × v × pin .
institution-logo-filena
Xin Sui
Deep Learning and Its Applications
Key Elements in Pooling Layer
This pooling layer takes 8 feature maps of size 24 × 24 to 8 feature
maps of size 12 × 12.
In: p feature maps of resolution min × nin : Ii [x , y],
i ∈ [p], x ∈ [min ], y ∈ [nin ].
Out: p feature maps of resolution mout × nout : Oi [x , y],
i ∈ [p], x ∈ [mout ], y ∈ [nout ].
Pooling size: u × v .
Pooling method: max pooling, average pooling, etc.
institution-logo-filena
Xin Sui
Deep Learning and Its Applications
Pooling
For all i ∈ [p], obtain Oi from Ii . Intuitively,
Max-pooling introduces non-linearity in the kernel.
institution-logo-filena
Xin Sui
Deep Learning and Its Applications
Fully-connected Layer
In: pin variables x ∈ Rpin (every pixel is considered a variable).
Out: pout variables y ∈ Rpout .
Activation function h(·).
The model:
y = h(W x + b),
where W ∈ Rpout ×pin , b ∈ Rpout , and h(·) is a component-wise
function.
institution-logo-filena
Xin Sui
Deep Learning and Its Applications
Softmax Function
For any x ∈ RK , the softmax function for every k ∈ [K ] is defined as
softmax (k , x ) = P
e xk
i∈[K ]
e xi
.
Let pk = P(Y = k ) be the probability that a given image is in the
k th class, then under the model assumption
P(Y = k ) =
1 w Tk x +bk
e
Z
for all k ∈ [K ] and some explanatory variables x , we have
pk = softmax (k , W x + b),
where W = [w 1 , w 2 , · · · , w K ]T and b = (b1 , b2 , · · · , bK )T .
Combine fully connected layer with h(x ) = x , pout = K and the
softmax function to obtain class probabilities.
Cross-entropy loss (essentially negative log-likelihood) can be used
institution-logo-filena
to train the model.
Xin Sui
Deep Learning and Its Applications
Recap of the CNN Model
Convolutional Layer: generates features by varying filters.
Pooling Layer: performs subsampling that shrinks the dimensionality.
Fully-connected layer + Softmax function: obtain class probabilities.
Note that Fully-connected layers can also be used alone to introduce
non-linearity (h(x ) 6= x ) or shrink dimensionality(pout < pin ).
A typical CNN model: Conv-Pool-Conv-Pool-FC-FC-Softmax. The
more layers there are, the ‘deeper’ the model is.
Fully connected layers are usually placed at the end of the model,
because it fully connects all pixels and breaks their spatial
correlation.
institution-logo-filena
Xin Sui
Deep Learning and Its Applications
Optimization: Stochastic Gradient Descent
When 1) data size is too large, or 2) data is obtained in a sequential
way, that some future data is not accessible currently, gradient
descent may not be practical to train a CNN model.
Stochastic Gradient Descent (SGD) can be used.
Split the whole dataset (X , y ) into B mini-batches (X (i) , y (i) ),
i = 1, 2, · · · , B .
At iteration t, find a mini-batch (X t , y t ) (usually iteratively) and
then perform update θ (t) ← θ (t−1) − η∇θ f (θ (t−1) |X t , y t ), where θ
is the set of all parameters in the model, and f (·) is the loss function.
Challenge: vanishing gradient problem
institution-logo-filena
Xin Sui
Deep Learning and Its Applications
Vanishing Gradient Problem
Let σ(x ) = 1/(1 + exp{−x }) be the commonly-used sigmoid
function.
Assume a neural network xk +1 = σ(wk xk ), where k = 1, 2, · · · , K ,
where x1 is the input. Let f (xK +1 ) be the loss function.
K
Y
∂f
∂f ∂xK +1
∂x3 ∂x2
=
···
= f 0 (xK +1 )
σ 0 (wk xk )wk σ 0 (w1 x1 )x1
∂w1
∂xK +1 ∂xK
∂x2 ∂w1
k =2
Since σ 0 (x ) ≤ 1/4,
∂f
∂w1
≤ (1/4)K f 0 (xK +1 )
QK
k =2
wk x1 .
Intuitively, when wk ’s are not too large, the gradient vanishes as K
increases.
Practically, the first few layers are hard to train when the network is
deep.
institution-logo-filena
Xin Sui
Deep Learning and Its Applications
To Deal With This Issue
There are several ways to deal with this issue in deep learning.
1
Use other activation functions, such as ReLU: h(x ) = max {0, x }.
Since max h 0 (x ) = 1, it helps to pass down the gradient while keeps
non-linearity.
2
Use other ways to connect layers, such as short-cut connections.
3
Modify the way to update parameters, for example, use momentum
in the updates.
institution-logo-filena
Xin Sui
Deep Learning and Its Applications
Deep Learning
Why deep models?
Theoretically, the intuition is that, the deeper the model, the more
complex the feature space is.
Computationally, deep learning models, such as CNN, leads to easy
massive parallelization.
But hard to interpret.
Some frontier applications of deep learning
institution-logo-filena
Xin Sui
Deep Learning and Its Applications
Packages: Theano/TensorFlow
These packages are quite similar. In fact, some developers of
Theano went to develop TensorFlow for Google.
Low-level, highly customizable.
Automatic computation of derivatives.
Same code can be run on either CPU or GPU.
Fast (cuda) C implementations.
High-level libraries, such as pylearn2, are available.
Actively developed and maintained. (In the past year in Theano,
they added support for multiple GPUs, implemented average
pooling, and made faster implementations available as far as I know)
Hard to debug.
(My code is available online)
institution-logo-filena
Xin Sui
Deep Learning and Its Applications
Graph
c = a 2 + 2b − 2a
institution-logo-filena
Xin Sui
Deep Learning and Its Applications
© Copyright 2026 Paperzz