Inter-dependent CNNs for Joint Scene and Object Recognition

Inter-dependent CNNs for Joint Scene
and Object Recognition
J. H. Bappy and Amit Roy-Chowdhury
Electrical and Computer Engineering
University of California, Riverside
Classification Task
Input Image
Scene: Living Room
Detected Objects
Motivation
 CNN shows superior performance to extract
hierarchical features from an image
 CNN is used for either scene or object
classification, but not joint classification
 There are certain objects that co-occur in a scene,
such as ‘bed in a bedroom’
 Joint classification (scene-objects) can be useful, as both of
them share some low level features
 Computational cost can be significantly reduced in this process
 Question: Can CNN based scene and object recognition benefit each other
by exploiting the interconnections between them in order to improve
performance?
Joint Classification: overall idea

The current best-performing detectors are based on the
technique of finding region proposals to localize objects.


In deep layers, the activation of receptive fields of
feature maps can be used to spot semantic regions,
which can be considered as object proposals


4
Selective search, edge box
These semantic regions are very useful as they are related to
objects in an image.
As objects comprise a scene, detection scores from the
detectors can also be useful to identify robust features
to classify a scene.
Region Proposal
Selective Search
Proposed Region Proposal
Technique: Less regions,
and more semantics
Joint Scene-Object Classification

State-of-the-art:
 Most of the methods consider individual task, either
scene or object classification
 Joint scene and object classification uses CRF model



Recognition Performance is low
Deep learning based approaches provide the best
results in recognition tasks
Can we use deep learning to perform joint scene and
object recognition?

We propose CNN based joint classification task in this paper
Proposed Framework Overview
Scene CNN and Object CNN interacts each other in
the proposed method
Proposed Method: Overview


Proposed Model has two CNNs, S-CNN and O-CNN.
In object detection, feature maps of final convolutional layer of SCNN are used to the regions where an object might appear.



In scene classification, features from S-CNN and object detection
scores provided by detectors are fed into a network that consists
of three hidden layers.


8
These proposals are fed into O-CNN architecture to extract features.
Classify proposals
Scene and objects interact in these layers, that eventually improves the
scene classification performance
Finally, both S-CNN and O-CNN are fine-tuned for better
performance.
Region Proposal Technique
 Receptive field of all the units of last convolutional layer is used to
generate object proposals
 If multiple proposals appear in same spatial location, we keep one from
them
 In this way, we get approximately 250 proposals on average per image
Scene Classification




Global and local features are fused
Local feature encodes the probability of an object appearing in scene
Global features are extracted from fc7 layer of S-CNN
Three fully connected layers are used to represent the scene feature,
which is then classified in softmax layer
Loss function formulation
Classification. (Softmax Classifier)
Loss: cross entropy
Training hidden layers
Training hidden layers.
 Stochastic gradient descent is used to solve the
optimization
 In each iteration, parameters are updated
Fine-tuning.
 S-CNN and O-CNN networks are fine-tuned
iteratively
Experiments

Datasets

MSRC (21 scene class, 15 Object)

Scene15 (15 scene class)


CNN Model
SUN Dataset (150 scene class, 120
Object class)

MIT-67 Dataset (67 scene class)

VOC2010 Dataset (20 object class)
CNN Model: (Pre-trained model)

Places-205 model for scene

VGG-16 net for objects
Experimental Set-up

Evaluation Criterion: (Object Detection)



Precision depends on both correct labeling and localization
Overlap (>50%) between object detection box and ground truth box
is considered as correct localization
Number of Region Proposals:



For object detection, we choose approximately 250 regions
In MSRC dataset, 1-3 objects are present in an image and the
object occupies significant portion of the whole image.
So, we choose the top 10 activated regions with larger area from
proposed candidate bounding boxes for MSRC dataset. Similarly,
we choose 50 regions for VOC2010.
Comparison Methods

Baseline Methods:
Results on Object Detection
Comparison of number of proposal against Selective
Search (2400 proposals) for similar performance
Shows best performance achieved by our method
with low region proposals
Computation Efficiency: MSRC-240x, VOC2010- 48x and other datasets9x faster than R-CNN
Results on Region Proposals


For MSRC and VOC2010 datasets, we achieve good recognition
performance with few proposals.
Sometimes, performance may degrade due to increasing false
positive rate if we use large set of proposals
Results on Scene Classification

Object Detection helps S-CNN perform better

Red box indicates the best performance

In Scene15, as we could not fine-tune O-CNN due
to lack of object annotation, the proposed method
fails to achieve the best performance
Some Detection Results
Examples of some object detection results
Conclusion



We propose a CNN-based joint scene and object
classification framework.
Interaction between S-CNN and O-CNN leads to the
reduction of computational cost significantly
Future Works:


Acknowledgement:


20
Develop a single network to solve the both tasks, rather than
having two separate networks for two tasks
NSF grant 1544969
ONR N00014-15-C-5113 through Mayachitra, Inc.
Thank You!!
21