Convolutional Neural Networks at Constrained Time Cost (CVPR 2015) Authors : Kaiming He, Jian Sun (MSR) Presenter : Hyunjun Ju 1 Motivation • Most of the recent advanced CNNs are time consuming. • They take a high-end GPU or multiple GPUs • one week or several weeks to train, • which can sometimes be too demanding for the rapidly changing industry. • This paper investigates the accuracy of CNN architecture • at constrained time cost. • Factors : depth, width(the number of filter), filter size, stride. Goal Find the efficient and relatively accurate CNN model. 2 Time Complexity of Convolutions 𝑑 𝑛𝑙−1 ∙ 𝑠𝑙2 ∙ 𝑛𝑙 ∙ 𝑚𝑙2 𝑂 𝑙=1 where 𝒍 ∶ 𝑡ℎ𝑒 𝑖𝑛𝑑𝑒𝑥 𝑜𝑓 𝑎 𝑐𝑜𝑛𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛𝑎𝑙 𝑙𝑎𝑦𝑒𝑟 𝒅 ∶ 𝑡ℎ𝑒 𝑑𝑒𝑝𝑡ℎ 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑛𝑣 𝑙𝑎𝑦𝑒𝑟𝑠 𝒏𝒍−𝟏 ∶ 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑝𝑢𝑡 𝑐ℎ𝑎𝑛𝑛𝑒𝑙𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑙 𝑡ℎ 𝑙𝑎𝑦𝑒𝑟 𝒏𝒍 ∶ 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑖𝑙𝑡𝑒𝑟𝑠 𝑤𝑖𝑑𝑡ℎ 𝑖𝑛 𝑙 𝑡ℎ 𝑙𝑎𝑦𝑒𝑟 𝒔𝒍 ∶ 𝑠𝑝𝑎𝑡𝑖𝑎𝑙 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑓𝑖𝑙𝑡𝑒𝑟 𝒎𝒍 ∶ 𝑠𝑝𝑎𝑡𝑖𝑎𝑙 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑜𝑢𝑡𝑝𝑢𝑡 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑚𝑎𝑝 The time cost of fc layers and pooling layers is not involved in the above formulation. These layers often take 5-10% computational time. 3 Baseline Model • Layer replacement method • Progressively modify the model • Investigate the accuracy through a series of controlled experiments. • Eight layer model • Five convolutional layers • Three fully connected layer • Input : 224 × 224 color image with mean subtracted • All the conv/fc layers are with Rectified Linear Units(ReLU). • They do not apply local normalization. Conv1 64 7 × 7/2 Max pooling 3 ×3/2 Conv2 128 5 ×5/1 Max pooling 2 ×2/2 Conv3 256 3 ×3/1 Conv4 256 3 ×3/1 Conv5 256 3 ×3/1 SPP (spatial pyramid pooling) Full6 4096 Full7 4096 Full8 Softmax 1000 4 3-stage Design Stage 2 Stage 1 convolution pooling convolution Stage 3 pooling conv conv conv pooling Stage Layers between two nearby pooling layers. 5 Model Designs by Layer Replacement • Model A is baseline. • The others are variation of the model A. • In each model, a few layers are replaced with some other layers that preserve time costs. • There are trade offs by replacing the layers. 6 Trade-offs between Depth and Filter Sizes • Replace a larger filter with a cascaded of smaller filters. • Feature map size is temporarily omitted. (they are same) • New time complexity(one conv layer) 𝑛𝑙−1 ∙ 𝑠𝑙2 ∙ 𝑛𝑙 A=>B 256 ∙ 32 ∙ 256 ⇒ 256 ∙ 22 ∙ 256 + 256 ∙ 22 ∙ 256 A=>C 64 ∙ 52 ∙ 128 ⇒ 64 ∙ 32 ∙ 128 + 128 ∙ 32 ∙ 128 B=>E 64 ∙ 52 ∙ 128 ⇒ 64 ∙ 22 ∙ 128 + 128 ∙ 22 ∙ 128 × 3 7 Trade-offs between Depth and Filter Sizes • The depth is more important than the filter sizes. When the time complexity is roughly the same, the deeper networks with smaller filters show better results than the shallower networks with large filters. 8 Trade-offs between Depth and Width • Increase depth while properly reducing the number of filters(width) per layer A=>F 128 ∙ 32 ∙ 256 + 256 ∙ 32 ∙ 256 × 2 = 128 ∙ 32 ∙ 160 + 160 ∙ 32 ∙ 160 × 4 + 160 ∙ 32 ∙ 256 A=>G 128 ∙ 32 ∙ 256 + 256 ∙ 32 ∙ 256 × 2 = 128 ⋅ 32 ⋅ 128 × 8 + 128 ⋅ 32 ⋅256 9 Trade-offs between Depth and Width Increasing the depth leads to considerable gains, even the width needs to be properly reduced. But, G is only better than F marginally. 10 Trade-offs between Width and Filter Size B=>F 128 ⋅ 22 ⋅ 256 + 256 ⋅ 22 ⋅ 256 × 5 ⇒ 128 ⋅ 32 ⋅ 160 + 160 ⋅ 32 ⋅ 160 × 4 + 160 ⋅ 32 ⋅ 256 E=>I 64 ⋅ 32 ⋅ 64 × 3 + 64 ⋅ 32 ⋅ 128 ⇒ 64 ⋅ 22 ⋅ 96 + 96 ⋅ 22 ⋅ 128 + 128 ⋅ 22 ⋅ 128 × 2 11 Trade-offs between Width and Filter Size Unlike the depth that has a high priority, the width and filter sizes do not show apparent priorities to each other. 12 But, Is Deeper Always better? • In experiments, they find • the accuracy is stagnant or even reduced in some of their very deep attempts. • Two possible explanations 1. The width/filter sizes are reduced overly and may harm the accuracy. 2. Overly increasing the depth will degrade the accuracy even if the other factors are not traded. 13 But, Is Deeper Always better? • To understand the main reason • Do not constrain the time complexity (just add conv layers) • The errors not only get saturated at some point, but get worse if going deeper. • The degradation is not due to over-fitting. (training errors are also worse) Overly increasing depth can harm the accuracy, even if the width/filter sizes are unchanged. 14 Adding Pooling Layer(Feature map size and width) • In the previous, the feature map size 𝑚𝑙 • of each stage is unchanged. (or nearly unchanged) • By padding • 2 pixels for the 5 × 5 filters • 1 pixels for the 3 × 3 filters • Feature map size is • mainly determined by the strides of all previous layers • E => J • 256 ∙ 22 ∙ 256 + 256 ∙ 22 ∙ 256 = (256 ∙ 22 ∙ 2304 + 2304 ∙ 22 ∙ 256)/32 The model J(low feature map size, high width) results in new error rates which is better than that of model E. 15 Delayed Subsampling of Pooling Layers • Max pooling Layer has two different role 1. 2. Lateral suppression that increase the invariance to small local translation Reducing the spatial size of feature maps by subsampling • Usually, max pooling layer plays the two roles simultaneously with stride > 1. • We can separate these two roles using two different layers. • Pooling layer : by setting the stride = 1 • A convolutional layer : by setting the stride > 1 (original stride of the pooling layer) • this operation doesn’t change the complexity of all convolutional layer. Delayed model has lower Top-5 error rates than original model’s. 16 Comparisons (Fast models) (J’) • "𝑛𝑢𝑚𝑏𝑒𝑟 × " means • Relative numbers compared with the model J’ • The difference between complexity and seconds/mini-batch • is mainly due to the overhead of the fc and pooling layers. Model J’ has the best performance and complexity is low. 17 Comparisons (Accurate models) • VGG-16 and GoogLeNet are trained by more additional data augmentation than others. Model J’ has low complex and relatively good performance. 18 Conclusion • Constrained time cost is • practical issue in industrial and commercial requirement. • They proposed models that • are fast for practical applications • yet are more accurate than existing fast models • Not the best performance model or the fastest model. 19 20
© Copyright 2026 Paperzz