Inter-dependent CNNs for Joint Scene and Object Recognition J. H. Bappy and Amit Roy-Chowdhury Electrical and Computer Engineering University of California, Riverside Classification Task Input Image Scene: Living Room Detected Objects Motivation CNN shows superior performance to extract hierarchical features from an image CNN is used for either scene or object classification, but not joint classification There are certain objects that co-occur in a scene, such as ‘bed in a bedroom’ Joint classification (scene-objects) can be useful, as both of them share some low level features Computational cost can be significantly reduced in this process Question: Can CNN based scene and object recognition benefit each other by exploiting the interconnections between them in order to improve performance? Joint Classification: overall idea The current best-performing detectors are based on the technique of finding region proposals to localize objects. In deep layers, the activation of receptive fields of feature maps can be used to spot semantic regions, which can be considered as object proposals 4 Selective search, edge box These semantic regions are very useful as they are related to objects in an image. As objects comprise a scene, detection scores from the detectors can also be useful to identify robust features to classify a scene. Region Proposal Selective Search Proposed Region Proposal Technique: Less regions, and more semantics Joint Scene-Object Classification State-of-the-art: Most of the methods consider individual task, either scene or object classification Joint scene and object classification uses CRF model Recognition Performance is low Deep learning based approaches provide the best results in recognition tasks Can we use deep learning to perform joint scene and object recognition? We propose CNN based joint classification task in this paper Proposed Framework Overview Scene CNN and Object CNN interacts each other in the proposed method Proposed Method: Overview Proposed Model has two CNNs, S-CNN and O-CNN. In object detection, feature maps of final convolutional layer of SCNN are used to the regions where an object might appear. In scene classification, features from S-CNN and object detection scores provided by detectors are fed into a network that consists of three hidden layers. 8 These proposals are fed into O-CNN architecture to extract features. Classify proposals Scene and objects interact in these layers, that eventually improves the scene classification performance Finally, both S-CNN and O-CNN are fine-tuned for better performance. Region Proposal Technique Receptive field of all the units of last convolutional layer is used to generate object proposals If multiple proposals appear in same spatial location, we keep one from them In this way, we get approximately 250 proposals on average per image Scene Classification Global and local features are fused Local feature encodes the probability of an object appearing in scene Global features are extracted from fc7 layer of S-CNN Three fully connected layers are used to represent the scene feature, which is then classified in softmax layer Loss function formulation Classification. (Softmax Classifier) Loss: cross entropy Training hidden layers Training hidden layers. Stochastic gradient descent is used to solve the optimization In each iteration, parameters are updated Fine-tuning. S-CNN and O-CNN networks are fine-tuned iteratively Experiments Datasets MSRC (21 scene class, 15 Object) Scene15 (15 scene class) CNN Model SUN Dataset (150 scene class, 120 Object class) MIT-67 Dataset (67 scene class) VOC2010 Dataset (20 object class) CNN Model: (Pre-trained model) Places-205 model for scene VGG-16 net for objects Experimental Set-up Evaluation Criterion: (Object Detection) Precision depends on both correct labeling and localization Overlap (>50%) between object detection box and ground truth box is considered as correct localization Number of Region Proposals: For object detection, we choose approximately 250 regions In MSRC dataset, 1-3 objects are present in an image and the object occupies significant portion of the whole image. So, we choose the top 10 activated regions with larger area from proposed candidate bounding boxes for MSRC dataset. Similarly, we choose 50 regions for VOC2010. Comparison Methods Baseline Methods: Results on Object Detection Comparison of number of proposal against Selective Search (2400 proposals) for similar performance Shows best performance achieved by our method with low region proposals Computation Efficiency: MSRC-240x, VOC2010- 48x and other datasets9x faster than R-CNN Results on Region Proposals For MSRC and VOC2010 datasets, we achieve good recognition performance with few proposals. Sometimes, performance may degrade due to increasing false positive rate if we use large set of proposals Results on Scene Classification Object Detection helps S-CNN perform better Red box indicates the best performance In Scene15, as we could not fine-tune O-CNN due to lack of object annotation, the proposed method fails to achieve the best performance Some Detection Results Examples of some object detection results Conclusion We propose a CNN-based joint scene and object classification framework. Interaction between S-CNN and O-CNN leads to the reduction of computational cost significantly Future Works: Acknowledgement: 20 Develop a single network to solve the both tasks, rather than having two separate networks for two tasks NSF grant 1544969 ONR N00014-15-C-5113 through Mayachitra, Inc. Thank You!! 21
© Copyright 2026 Paperzz