Community Level Diffusion Extraction

Convolutional Neural
Networks at Constrained
Time Cost (CVPR 2015)
Authors : Kaiming He, Jian Sun (MSR)
Presenter : Hyunjun Ju
1
Motivation
• Most of the recent advanced CNNs are time consuming.
• They take a high-end GPU or multiple GPUs
• one week or several weeks to train,
• which can sometimes be too demanding for the rapidly changing
industry.
• This paper investigates the accuracy of CNN architecture
• at constrained time cost.
• Factors : depth, width(the number of filter), filter size, stride.
Goal
Find the efficient and
relatively accurate CNN model.
2
Time Complexity of Convolutions
𝑑
𝑛𝑙−1 ∙ 𝑠𝑙2 ∙ 𝑛𝑙 ∙ 𝑚𝑙2
𝑂
𝑙=1
where
𝒍 ∶ 𝑡ℎ𝑒 𝑖𝑛𝑑𝑒𝑥 𝑜𝑓 𝑎 𝑐𝑜𝑛𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛𝑎𝑙 𝑙𝑎𝑦𝑒𝑟
𝒅 ∶ 𝑡ℎ𝑒 𝑑𝑒𝑝𝑡ℎ 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑛𝑣 𝑙𝑎𝑦𝑒𝑟𝑠
𝒏𝒍−𝟏 ∶ 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑝𝑢𝑡 𝑐ℎ𝑎𝑛𝑛𝑒𝑙𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑙 𝑡ℎ 𝑙𝑎𝑦𝑒𝑟
𝒏𝒍 ∶ 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑖𝑙𝑡𝑒𝑟𝑠 𝑤𝑖𝑑𝑡ℎ 𝑖𝑛 𝑙 𝑡ℎ 𝑙𝑎𝑦𝑒𝑟
𝒔𝒍 ∶ 𝑠𝑝𝑎𝑡𝑖𝑎𝑙 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑓𝑖𝑙𝑡𝑒𝑟
𝒎𝒍 ∶ 𝑠𝑝𝑎𝑡𝑖𝑎𝑙 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑜𝑢𝑡𝑝𝑢𝑡 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑚𝑎𝑝
The time cost of fc layers and pooling layers is not involved in the above formulation.
These layers often take 5-10% computational time.
3
Baseline Model
• Layer replacement method
• Progressively modify the model
• Investigate the accuracy through a series of controlled experiments.
• Eight layer model
• Five convolutional layers
• Three fully connected layer
• Input : 224 × 224 color image with mean subtracted
• All the conv/fc layers are with Rectified Linear Units(ReLU).
• They do not apply local normalization.
Conv1
64 7 × 7/2
Max pooling
3 ×3/2
Conv2
128 5 ×5/1
Max pooling
2 ×2/2
Conv3
256 3 ×3/1
Conv4
256 3 ×3/1
Conv5
256 3 ×3/1
SPP
(spatial pyramid
pooling)
Full6
4096
Full7
4096
Full8
Softmax 1000
4
3-stage Design
Stage 2
Stage 1
convolution
pooling
convolution
Stage 3
pooling
conv
conv
conv
pooling
Stage
Layers between two nearby pooling layers.
5
Model Designs by Layer Replacement
• Model A is baseline.
• The others are variation of the model A.
• In each model, a few layers are replaced with some other layers that preserve
time costs.
• There are trade offs by replacing the layers.
6
Trade-offs between Depth and Filter Sizes
• Replace a larger filter with a cascaded of smaller filters.
• Feature map size is temporarily omitted. (they are same)
• New time complexity(one conv layer)
𝑛𝑙−1 ∙ 𝑠𝑙2 ∙ 𝑛𝑙
A=>B
256 ∙ 32 ∙ 256 ⇒ 256 ∙ 22 ∙ 256 + 256 ∙ 22 ∙ 256
A=>C
64 ∙ 52 ∙ 128 ⇒ 64 ∙ 32 ∙ 128 + 128 ∙ 32 ∙ 128
B=>E
64 ∙ 52 ∙ 128 ⇒ 64 ∙ 22 ∙ 128 + 128 ∙ 22 ∙ 128 × 3
7
Trade-offs between Depth and Filter Sizes
• The depth is more important than the filter sizes.
When the time complexity is roughly the same,
the deeper networks with smaller filters show
better results than
the shallower networks with large filters.
8
Trade-offs between Depth and Width
• Increase depth while properly reducing the number of filters(width) per layer
A=>F
128 ∙ 32 ∙ 256 + 256 ∙ 32 ∙ 256 × 2
= 128 ∙ 32 ∙ 160 + 160 ∙ 32 ∙ 160 × 4 + 160 ∙ 32 ∙ 256
A=>G
128 ∙ 32 ∙ 256 + 256 ∙ 32 ∙ 256 × 2 = 128 ⋅ 32 ⋅ 128 × 8 + 128 ⋅ 32 ⋅256
9
Trade-offs between Depth and Width
Increasing the depth leads to considerable gains,
even the width needs to be properly reduced.
But, G is only better than F marginally.
10
Trade-offs between Width and Filter Size
B=>F
128 ⋅ 22 ⋅ 256 + 256 ⋅ 22 ⋅ 256 × 5
⇒ 128 ⋅ 32 ⋅ 160 + 160 ⋅ 32 ⋅ 160 × 4 + 160 ⋅ 32 ⋅ 256
E=>I
64 ⋅ 32 ⋅ 64 × 3 + 64 ⋅ 32 ⋅ 128
⇒ 64 ⋅ 22 ⋅ 96 + 96 ⋅ 22 ⋅ 128 + 128 ⋅ 22 ⋅ 128 × 2
11
Trade-offs between Width and Filter Size
Unlike the depth that has a high priority, the width and
filter sizes do not show apparent priorities to each other.
12
But, Is Deeper Always better?
• In experiments, they find
• the accuracy is stagnant or even reduced in some of their very
deep attempts.
• Two possible explanations
1. The width/filter sizes are reduced overly and may harm the
accuracy.
2. Overly increasing the depth will degrade the accuracy even if
the other factors are not traded.
13
But, Is Deeper Always better?
• To understand the main reason
• Do not constrain the time complexity (just add conv layers)
• The errors not only get saturated at some point, but get worse if going deeper.
• The degradation is not due to over-fitting. (training errors are also worse)
Overly increasing depth can harm the accuracy,
even if the width/filter sizes are unchanged.
14
Adding Pooling Layer(Feature map size and width)
• In the previous, the feature map size 𝑚𝑙
• of each stage is unchanged. (or nearly unchanged)
• By padding
• 2 pixels for the 5 × 5 filters
• 1 pixels for the 3 × 3 filters
• Feature map size is
• mainly determined by the strides of all previous layers
• E => J
• 256 ∙ 22 ∙ 256 + 256 ∙ 22 ∙ 256 = (256 ∙ 22 ∙ 2304 + 2304 ∙ 22 ∙ 256)/32
The model J(low feature map size, high width)
results in new error rates
which is better than that of model E.
15
Delayed Subsampling of Pooling Layers
• Max pooling Layer has two different role
1.
2.
Lateral suppression that increase the invariance to small local translation
Reducing the spatial size of feature maps by subsampling
• Usually, max pooling layer plays the two roles simultaneously with
stride > 1.
• We can separate these two roles using two different layers.
• Pooling layer : by setting the stride = 1
• A convolutional layer : by setting the stride > 1 (original stride of the pooling layer)
• this operation doesn’t change the complexity of all convolutional layer.
Delayed model has lower Top-5 error rates
than original model’s.
16
Comparisons (Fast models)
(J’)
• "𝑛𝑢𝑚𝑏𝑒𝑟 × " means
• Relative numbers compared with the model J’
• The difference between complexity and seconds/mini-batch
• is mainly due to the overhead of the fc and pooling layers.
Model J’ has the best performance
and complexity is low.
17
Comparisons (Accurate models)
• VGG-16 and GoogLeNet are trained by more additional
data augmentation than others.
Model J’ has low complex
and relatively good performance.
18
Conclusion
• Constrained time cost is
• practical issue in industrial and commercial requirement.
• They proposed models that
• are fast for practical applications
• yet are more accurate than existing fast models
• Not the best performance model or the fastest model.
19
20