Rethinking architectures of DCNN and object detection in scene recognition Current work and related consideration • • • • • Task: Object Detection Algorithm: You Only Look Once(YOLO) Architecture: Googlenet based Parameters: >=97M (relatively small) Techniques: Inception V3 (construction series) Efficient Grid Size Reduction (channels in parallel) Feature fusion by multi-resolution feature maps • Problems: relatively high training loss and non-ideal mAP Thinking: • Structure of the model impacts the detection accuracy so much (Reasonable loss? Small gap between training loss and test loss? Is the model too large?(easy to overfit and make the model out of control) )! Keep searching for the balance between accuracy and model size. Standard for construction? • Relationship between objects and scenes. Dose scene classification benefits objective detection? Merge scene information in object detection. 3*3 3*3 Wide–Residual-Inception Networks for Real-time Object Detection Computer Vision Laboratory, Inha University Scale down the size of the model furtherly Wide-Residual-Inception Networks for Real-time Object Detecciton——Youngwan Lee[2017] Wide–Residual-Inception Networks for Real-time Object Detection Computer Vision Laboratory, Inha University Resnet inception Feature extractor Wide-Residual-Inception Networks for Real-time Object Detecciton——Youngwan Lee[2017] Wide–Residual-Inception Networks for Real-time Object Detection Computer Vision Laboratory, Inha University SSD WR-Inception network Wide-Residual-Inception Networks for Real-time Object Detecciton——Youngwan Lee[2017] Computer Vision Laboratory, Inha University Wide–Residual-Inception Networks for Real-time Object Detection Car Pedestrian Cyclist Model mAP mAR 71 58 69 53.61 75.26 58.9 70.06 63.01 54.63 76.17 61.18 73.51 64.29 59.28 75.26 63.03 75.14 AP AR AP AR AP AR VGG-16 74 75 50 56 52 ResNet-101 76.04 74.82 47.74 56.07 WR-Inception 77.2 76.18 52.51 WR-Inception-12 78.24 80.24 51.08 Dataset: KITTI: obtained through stereo cameras and lidar scannerss in urban, rural, and highway driving environments, and has 10 categories in total, which are small cars, vans, trucks pedestrains, sitting people, cyclists, traims, miscellaneous, and “do not care” TP mAP average Pr ecision TP FP TP: Truth Possitive TP mAR average Pr ecision TP FN FP:False Possitive FN:False Negative Wide-Residual-Inception Networks for Real-time Object Detecciton——Youngwan Lee[2017] Wide–Residual-Inception Networks for Object Detection Computer Vision Laboratory, Inha Real-time University Contribution: Propose the model that requires less memory and fewer computations but shows better performance Ensure the real-time performance of object detector Query: KITTI is still a relatively small dataset. And its categories are limited. The performance of this model should be tested on more common and large dataset like Imagenet. Cambridge Proper model for a specific dataset The somewhat unanswered question in deep learning: Is the selected CNN optimal for the dataset in terms of accuracy and model size? There needs some certain standard, but base what? Given a pre-trained CNN for a specific dataset, refine the architecture in order to potentially increase the accuracy while possibly reducing the model size. Standard: the feature extraction ability of a CNN for a specific dataset. Intuition: separation enhancement To best separate the classes of a dataset, assuming a constant depth of the network. Refining Architectures of Deep Convolutional Neural Networks——Machine Intelligence Lab, University of Cambridge , UK and Microsoft Research Cambridge, UK[CVPR 2016] Separation enhancement and deterioration capacity of a layer Cambridge Correlation Matrices for 8 Convolutional Layers of VGG-11 trained on SAD and CAMIT-NSAD CL : M M Dark blue: minimum correlation between classes Bright yellow: maximum correlation l 1, , L m 1,, M • Correlation Matrics give an indication of the separation between classes for a given convolutional layer. • Top Row(SAD):The lower layers can separate the class better as compared to deeper layers • Bottom Row(CAMIT-NSAD):The classes are separated lesser in lower layers and more prominently in deeper layers Separation enhancement and deterioration capacity of a layer Cambridge Comparing Cl and Cl+1, which class pairs, the separation increased and which deteriorated l n nl The number of class pairs where the separation increased compared between layer l and l-1 The number of class pairs where the separation decreased compared between layer l and l-1 Finding the inner-class separation Separation situation varies through layers for different dataset case (a ) : nl nl split factor : rsl case (b) : nl nl stretch factor : rel (l ) L 1 (n i l 1 i / nT ) / ( L l 1) nl (l ) nl rsl 2 nl rel 1 (l ) nT Separation enhancement and deterioration capacity of a layer Cambridge t: 22084 V: 3056 T: 5618 t: 22084 V: 3056 T: 5618 DR=Deep Refined Architecture (proposed approach) DR-1=Deep Refined Architecture with only the Stretch network DR-2=Deep Refined Architecture with only the Symmetric Split Sp-1=L1 Sparsified network Sp-2=L2 Sparsified network Separation enhancement and deterioration capacity of a layer Cambridge Contribution: Provide quantified refining network architecture Realize the balance between precision and model size Query: SAD and CAMIT-NSAD are relatively small dataset and they are only scene data. What about big object dataset like ImageNet? Generalization problem has been avoided in this paper. When we do not know the source of the test data, how to transfer transfer learning the model and refine the better model? MIT: Object Detectors Emerge in Deep Scene CNNs Published as a conference paper at ICLR 2015 Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences(CAS) Mix scene data and object data together in the training process CVPR 2016 Scene recognition and classification The same network can do both object localization and scene recognition in a single forward-pass The deep features from Places-CNN tend to perform better on scene-related recognition tasks compared to the features from ImageNet-CNN Cambridge What needs to be taken into consideration? • Scale down the size of our model and make it more easy to controlled while improving feature extraction ability. • How to carry out training with both object dataset and scene dataset with one single feature detector and how to merge the abstracted scene information with the features of objects? Thank you!
© Copyright 2026 Paperzz