Rethinking architectures of DCNN and object detection

Rethinking architectures of DCNN
and object detection in scene recognition
Current work and related consideration
•
•
•
•
•
Task: Object Detection
Algorithm: You Only Look Once(YOLO)
Architecture: Googlenet based
Parameters: >=97M (relatively small)
Techniques: Inception V3 (construction series)
Efficient Grid Size Reduction (channels in parallel)
Feature fusion by multi-resolution feature maps
• Problems: relatively high training loss and non-ideal mAP
Thinking:
• Structure of the model impacts the detection accuracy so
much (Reasonable loss? Small gap between training loss and
test loss? Is the model too large?(easy to overfit and make the
model out of control) )! Keep searching for the balance
between accuracy and model size. Standard for construction?
• Relationship between objects and scenes. Dose scene
classification benefits objective detection? Merge scene
information in object detection.
3*3
3*3
Wide–Residual-Inception Networks for
Real-time Object Detection
Computer Vision
Laboratory, Inha
University
Scale down the size of the model furtherly
Wide-Residual-Inception Networks for Real-time Object Detecciton——Youngwan Lee[2017]
Wide–Residual-Inception Networks for
Real-time Object Detection
Computer Vision
Laboratory, Inha
University
Resnet
inception
Feature extractor
Wide-Residual-Inception Networks for Real-time Object Detecciton——Youngwan Lee[2017]
Wide–Residual-Inception Networks for
Real-time Object Detection
Computer Vision
Laboratory, Inha
University
SSD
WR-Inception network
Wide-Residual-Inception Networks for Real-time Object Detecciton——Youngwan Lee[2017]
Computer Vision
Laboratory, Inha
University
Wide–Residual-Inception Networks for
Real-time Object Detection
Car
Pedestrian
Cyclist
Model
mAP
mAR
71
58
69
53.61
75.26
58.9
70.06
63.01
54.63
76.17
61.18
73.51
64.29
59.28
75.26
63.03
75.14
AP
AR
AP
AR
AP
AR
VGG-16
74
75
50
56
52
ResNet-101
76.04
74.82
47.74
56.07
WR-Inception
77.2
76.18
52.51
WR-Inception-12
78.24
80.24
51.08
Dataset: KITTI: obtained through stereo cameras and lidar scannerss in urban, rural, and
highway driving environments, and has 10 categories in total, which are small cars, vans,
trucks pedestrains, sitting people, cyclists, traims, miscellaneous, and “do not care”
 TP 
mAP  average Pr ecision 

 TP  FP 
TP:
Truth Possitive
 TP 
mAR  average Pr ecision 

 TP  FN 
FP:False Possitive
FN:False Negative
Wide-Residual-Inception Networks for Real-time Object Detecciton——Youngwan Lee[2017]
Wide–Residual-Inception Networks for
Object Detection
Computer Vision
Laboratory, Inha
Real-time University
 Contribution: Propose the model that requires less memory and
fewer computations but shows better performance
Ensure the real-time performance of object detector
 Query: KITTI is still a relatively small dataset. And its categories are
limited. The performance of this model should be tested on
more common and large dataset like Imagenet.
Cambridge
Proper model for a specific dataset
 The somewhat unanswered question in deep learning:
Is the selected CNN optimal for the dataset in terms of accuracy and
model size?
 There needs some certain standard, but base what?
Given a pre-trained CNN for a specific dataset, refine the architecture in
order to potentially increase the accuracy while possibly reducing the
model size.
Standard: the feature extraction ability of a CNN for a specific dataset.
Intuition: separation
enhancement
To best separate the
classes of a dataset,
assuming a constant
depth of the network.
Refining Architectures of Deep Convolutional Neural Networks——Machine Intelligence Lab,
University of Cambridge , UK and Microsoft Research Cambridge, UK[CVPR 2016]
Separation enhancement
and deterioration capacity of a layer
Cambridge
Correlation Matrices for 8 Convolutional Layers of VGG-11 trained on SAD and CAMIT-NSAD
CL : M  M
Dark blue: minimum correlation between classes
Bright yellow: maximum correlation
l  1, , L m  1,, M 
• Correlation Matrics give an indication of the separation between classes for a given convolutional
layer.
• Top Row(SAD):The lower layers can separate the class better as compared to deeper layers
• Bottom Row(CAMIT-NSAD):The classes are separated lesser in lower layers and more prominently in
deeper layers
Separation enhancement
and deterioration capacity of a layer
Cambridge
Comparing Cl and Cl+1, which class pairs, the
separation increased and which deteriorated
l

n
nl
The number of class pairs where the
separation increased compared
between layer l and l-1
The number of class pairs where the
separation decreased compared
between layer l and l-1
Finding the
inner-class
separation
Separation situation varies through layers for different dataset
case (a ) : nl  nl
split factor : rsl
case (b) : nl  nl
stretch factor : rel
 (l ) 
L 1
 (n
i  l 1
i

/ nT ) / ( L  l  1)
 nl

 (l ) 
 nl

 


rsl  2
 nl

rel  1      (l ) 
 nT

Separation enhancement
and deterioration capacity of a layer
Cambridge
t: 22084
V: 3056
T: 5618
t: 22084 V: 3056 T: 5618
DR=Deep Refined Architecture (proposed approach)
DR-1=Deep Refined Architecture with only the Stretch network
DR-2=Deep Refined Architecture with only the Symmetric Split
Sp-1=L1 Sparsified network
Sp-2=L2 Sparsified network
Separation enhancement
and deterioration capacity of a layer
Cambridge
 Contribution: Provide quantified refining network architecture
Realize the balance between precision and model size
 Query: SAD and CAMIT-NSAD are relatively small dataset and they
are only scene data. What about big object dataset like
ImageNet?
Generalization problem has been avoided in this paper. When
we do not know the source of the test data, how to transfer
transfer learning the model and refine the better model?
MIT: Object Detectors Emerge in Deep Scene CNNs
Published as a conference paper at ICLR 2015
Key Laboratory of Intelligent Information Processing of
Chinese Academy of Sciences(CAS)
 Mix scene data and object data together in the training process
CVPR 2016
Scene recognition and classification
 The same network can do both object localization and scene
recognition in a single forward-pass
 The deep features from Places-CNN tend to perform better on
scene-related recognition tasks compared to the features from
ImageNet-CNN
Cambridge
What needs to be taken into consideration?
• Scale down the size of our model and make it more easy to controlled
while improving feature extraction ability.
• How to carry out training with both object dataset and scene dataset
with one single feature detector and how to merge the abstracted
scene information with the features of objects?
Thank you!