Lecture Notes in Computer Science - University of York

Global-to-Local Motivated Scene Recognition
Le Dong, Ebroul Izquierdo
Department of Electronic Engineering, Queen Mary, University of London,
London E1 4NS, U.K.
{le.dong, ebroul.izquierdo}@elec.qmul.ac.uk
Abstract. In this paper, an approach for scene recognition from global layout to
local features is presented. This essential-centred approach is able to provide a
meaningful representation of natural scenes at multiple levels of categorization.
The representation of a complex scene was initially built from a collection of
global features from which properties related to the spatial layout of the scene
and its semantic category can be estimated. The recognition of natural scenes
relies partly on a global estimation of the features contained in the scene. Further analysis on the local conspicuous areas was deployed on the basis. Such
kind of integrated model guarantees the interactive processing between local
and global features, thus enabling low level features to initiate scene recognition and categorization efficiently.
Keywords: essential-centred, global estimation, scene recognition.
1 Introduction
Related research and modelling experiments are dedicated to test scene representation, rapid recognition and subsequent contextual object detection. From the perspective of the psychological and behavioural study, it is suggested that the representation
of a complex scene can be initially built from a collection of global features from
which properties related to the spatial layout of the scene and its semantic category
can be estimated [1-3]. There are two generic approaches for scene recognition: object-based and essential-centred. The former is content-emphasized, where the representation of a scene is built from a list of objects contained. The latter is contextemphasized, where a collection of intermediate abstract information is extracted from
the whole scene. These two approaches are obviously complementary from the aspect
of representation. In [4] and [5], the role of low-level and medium levels of representation is to make available to the high-level a useful and segmented representation of
the scene. Following this approach, current computer vision models propose to render
the process of recognition by extracting a set of image-based features that are combined to form higher-level representations [6-8]. Scene identity level is then achieved
by the recognition of a set of objects or blocks delivered by the medium level of processing. On the other hand, some scene recognition studies suggest that human can
apprehend the meaning of a complex scene within a glance without necessarily remembering important objects and their locations [9, 10]. All together, those studies
suggest that the identity of a natural scene may also be perceived from essentialcentred based features not related to object or segmentation mechanisms. A few studies in computer vision have proposed such kind of alternative essential-centred representation [11-13]. Common to these studies is the goal to find the basic-level category
of the image directly from scene recognition.
This paper is dedicated to the representation of an essential-centred representation
in the initial stage, at the medium level of visual processing. The further analysis on
the local conspicuous area could be deployed on the basis. Our goal is to generate an
efficient representation of natural scene, based on the identification of spatial properties. In the real cluttered environment, this integrated scheme offers an intriguing
platform for scene recognition and object localization. Furthermore, the gap between
low-level features and high-level semantic meaning could be ingeniously bridged
without the conventional local-to-global feature based multi-layer framework. More
interestingly, the resulting scheme is independent of the complexity of the scene and
able to provide hierarchy of scene representation. Therefore, it is a promising approach in recognition of the complex scene with multiple level entities contained. The
paper is organized as follows: The mechanism of essential-centred recognition is introduced in Section 2. In Section 3, spatial layout representation is illustrated. Section
4 describes the global representation in scene recognition. Selected experimental evaluation is given in section 5. Conclusion is presented in the end of the paper.
2 Essential-centred Recognition
Opposed to object-centred representation, an essential-centred representation that
encodes the distribution of textures in the image will be described in this section. The
resulting coarse representation is not adequate for representing objects within a scene
though; it does capture enough information of the scene layout to reliably estimate
structural and textural attributes of the scene.
Global Receptive Area
Local Receptive Area
Fig. 1. Illustration of a local and a global receptive area.
A local receptive area is tuned to a specific orientation and spatial scale, at a particular position in the scene. A global receptive area is tuned to a spatial pattern of
orientations and scales across the entire scene. A global receptive area can be generated as a combination of local receptive areas and implemented from a population of
local receptive areas like the ones found in the early visual areas. Larger receptive
areas, which can be selective to global scene properties, are found in higher cortical
areas [14]. The global feature illustrated in Fig. 1 is tuned to scenes with vertical
structures at the top part and relatively horizontal texture at the bottom part.
The sketch of the essential-centred spatiotemporal representation is based on a lowresolution encoding of the output magnitude of multi-scale oriented Gabor filters:
M ( x, n, t ) | i( x, t )* gn ( x, t ) | .
(1)
i ( x, t ) is the input information at time t and g n ( x, t ) is the impulse response of a Gabor
filter at that time. N indexes filters tuned to different scales and orientations. This
essential-centred representation contains a coarse description of the structures of the
image and their spatiotemporal layout. Each scene is represented by a feature vector
with the set of M ( x, n, t ) reconstructed into a column vector. Note that the dimensionality of the vector is independent of the scene complexity. Applying a principal
component analysis (PCA) further reduces the dimensionality of the vector while
preserving most of the information that accounts for the variability among scenes. The
principal components are the eigenvectors of the covariance matrix C  E[(v  m)(v  m)T ] , where v is a column vector composed by the scene features, and m is the mean value of v. v can be referred as the k dimensional vector obtained by projection of the scene features onto the first k principal components with
the largest eigenvalues.
3 Spatial Layout Representation
The layout of the scene can be represented by a set of global features that provide a
holistic low-dimensional representation of the scene. The global features are built by
pooling together the low-level features across large areas of the visual field. In order
to learn a small set of global features that capture the dominant statistical properties of
natural scenes, the aforementioned PCA was performed on the output magnitude of a
set of multi-scale oriented filters on natural scenes. The features provided by PCA
provide information about orientations and scales and their spatial information.
Given a set of scene features, the system can learn to predict the relevance and
the value of each scene representation. The analysis was deployed in spatiotemporal
dimension based on [14]. Some factors are estimated for each scene:
◇Pertinence. Pertinence of a scene representation is the likelihood that used for
describing three-dimensional scene. The pertinence can be approximated as:
P  { p(t ) : p(rj  1| f i ) | i  j} , where p(t) indicates the relevance for the scene from
each input feature f i to training attribute rj at time t.
◇Fitness. Fitness of a scene representation estimates which semantic meaning
would best apply to a scene. It can be estimated from the scene features as E[v j | f i ] .
◇Credit. Credit gives how reliable is the estimation of each scene representation provided the scene features, expressed as E[(v j  E[v j | fi ])2 | fi ] . The higher the
expression the less reliable is the estimation of the property given the scene features.
The pertinence is calculated as the likelihood:
p( fi , t | rj  1) p(rj  1)
p(rj  1| fi , t ) 
p( fi , t | rj  0) p(rj  0)  p( fi , t | rj  1) p(rj  1)
.
(2)
p ( f i , t | rj  1) and p( f i , t | rj  0) are modeled as temporal mixture of Gaussian:

Nc
m  0,1
p( fi , cm ) p(cm ) . The parameters of the mixtures are then estimated with the
expectation maximization (EM) algorithm [15]. The prior p ( rj  1) is approximated
by the frequency of use of the attribute j within the training set. Estimation of the
value of each descriptor can be performed as the conditional expectation  t  v j p(v j , t | f i )dv j dt . The function can be evaluated by estimating the joint
distribution between that value of the attribute and the scene features. This function is
modeled
by
a
temporal
mixture
of
Gaussian :

Nc
m  0,1
p(v j , t | fi , cm ) p( fi | cm ) p(cm ) with p(v j , t | f i , cm ) being a Gaussian with mean
ai  f i T bi and variance  i2 . The learning of the model parameters for each property is
estimated with the EM algorithm and the MIT-CSAIL database [15, 16]. Once the
learning is completed, the conditional probability density function of the attribute
value, given the scene features, is:

p (v , t | f ) 
j
Nc
p(v j , t | f i , cm ) p( f i | cm ) p(cm )
m  0,1

i
Nc
m  0,1
(3)
.
p( fi | cm ) p(cm )
Therefore, given a new scene, the attribute value is estimated from the image
features as a mixture of local linear regressions:
vˆ j


Nc
m  0,1
(ai  fi T bi ) p( fi | cm ) p(cm )

Nc
m  0,1
(4)
.
p( fi | cm ) p(cm )
The estimation of p (v j , t | f i ) provides a method to measure the credit of the estimation provided by (4) for each scene:
 2j 
2
N
m0,1  i p( f i | cm ) p(cm )
.
N
m0,1 p( f i | cm ) p(cm )
c
(5)
c
The credit measure allows rejecting estimations that are not expected to be reliable. The bigger the value of the variance the less reliable is the estimation.
4 Global Representation in Recognition
In the proposed scheme, as shown in Fig. 2, visual context information is available
early in the visual processing chain providing an efficient shortcut for object detection
and recognition. The processing of the global scene information is performed in parallel to the processing of local image structure such as the detection of essential areas or
object recognition. As described in the preceding sections, the scene is initially represented using a low-dimensional vector of global features computed from the same
pool of low-level features used to compute local image saliency. The global scene
features can then used to predict the probability of presence of the target object in the
scene, its location and scale [17, 18].
Fig. 2 illustrates two parallel pathways of analysis. The local pathway represents a
scene as a collection of areas, and each area is represented using a set of local features.
In this example, the local pathway is based on the essential detection from previously
proposed bottom-up process [19]. In the global pathway, the scene is represented as a
collection of global features with large receptive areas that summarize the overall
layout of the scene as a single entity. This representation provides a description of the
entire scene that can be used for scene recognition. The representations of the scene in
both pathways are built upon the same set of low-level features. The images on the
right illustrate the differences between the two pathways. The local pathway analyzes
small local areas for which a lot of features are computed. On the other hand, the
global pathway gives only a coarse description of the entire scene.
Local Processing
Eessential Areas
t 
Natural Scene
t
Global Processing
Global Layout
Fig. 2. Parallel pathways for scene recognition.
Our model implements the part of a global-to-local processing of visual information,
where coarsely localized and unbound low level features would initiate scene recognition and object recognition before features of higher level of complexity are integrated.
This kind of mechanism is in accordance with the procedure in scene recognition from
visual systems, where the global layout of the scene should be caught in the first instance generally and the local essential areas could be focused on subsequently.
5 Experimental Evaluation
A set of pictures including indoor and outdoor scenes were taken for the similarity
comparison and essential detection. The selected results of the subset containing 20
images were given in Fig. 3, which shows images by ranking the similarity of global
features with the target image. The top most similar images to the target in globalfeature-space are displayed. It is obvious that scenes with similar spatial layout were
close together in a multi-dimensional space formed by the representation of global
layout. Within this space, neighbour scenes look alike.
02.jpg
d=0.000
03.jpg
d=0.125
01.jpg
d=0.226
04.jpg
d=0.286
14.jpg
d=0.316
13.jpg
d=0.371
08.jpg
d=0.408
07.jpg
d=0.436
15.jpg
d=0.465
20.jpg
d=0.490
09.jpg
d=0.494
05.jpg
d=0.498
10.jpg
d=0.547
12.jpg
d=0.611
16.jpg
d=0.622
06.jpg
d=0.661
17.jpg
d=0.758
18.jpg
d=0.844
19.jpg
d=0.860
11.jpg
d=0.969
Fig. 3. Similarity ranking.
Fig. 4 shows the essential detection in different scenes that reflect distinctive context making for detailed analysis and perception. The essential areas (orange circles)
were hierarchically detected based on conspicuity reflected by blue dashes.
20
20
20
40
40
40
60
60
60
80
80
80
100
100
100
120
120
120
140
140
140
160
160
160
180
180
180
200
200
220
200
220
50
100
150
200
250
300
220
50
100
150
200
250
300
50
100
150
200
250
300
Fig. 4. Essential detection.
In addition, eight categories of various scenes from MIT-CSAIL database containing 2688 images were used for the evaluation of the proposed framework on scene
perception, which including city, mountain, highway, country, coast, street, building,
and forest, respectively. To evaluate the contribution of the global features to the
scene perception task, a retrieval evaluation was conducted using Bayesian classifier
to recognize the scene only with global features. The 10-fold cross validation was
used for the experiment with different scenes. Table 1 shows the confusion matrix
indicating the errors made with each scene category, using global features of a resolution of 2 cycles per image. Note that performance with coast scenes and countryside
scenes were often confused. It is obvious that the features typically occurring in coast
scenes (horizon between sky and sea) also occur frequently in countryside scenes
(horizon between sky and landscape), leading to confusions between the two categories (16.14% of responses to coast scenes were in the countryside category).
Table 1. Confusion Matrix for Scene Recognition.
City
Mountain
Highway
Country
Coast
Street
Building
Forest
City
87.03
0.92
1.86
0.81
0.56
5.22
4.63
1.12
Mountain
0.82
86.74
0.39
4.89
5.91
0.68
0.53
2.39
Highway
1.96
0.33
85.12
3.02
1.66
3.52
0.23
0.20
Country
0.93
3.97
3.07
78.51
16.14
0.80
0.47
1.10
Coast
0.64
1.37
1.52
4.68
69.98
0.18
0.21
0.33
Street
4.25
0.62
5.19
1.47
1.40
86.69
1.64
0.72
Building
3.58
0.75
1.35
0.59
0.48
2.04
91.22
1.05
Forest
0.79
5.30
1.50
6.03
3.87
0.87
1.07
93.09
6 Conclusion
In our framework, a scene is initially processed as a single entity and local information participates at a later stage of visual processing. The essential-centred recognition based on spatial layout show that the highest level of recognition, the identity of a
scene, may be built from of a set of three-dimensional properties available in the scene.
It defines a general recognition framework within which complex scene categorization
may be achieved free of segmentation, grouping, interpretation and object detection.
The global processing approach provides a meaningful description of scene at multiple levels of description and independently of scene complexity. The essential-centred
scheme provides a novel approach to context modelling, and can be further used to
enhance local analysis. The global scene representation summarised in this paper
delivers contextual information in parallel with the processing of local features,
providing a formal instance of a mechanism for the guidance of attention.
References
1. Biederman, I.: Visual Object Recognition. An Invitation to Cognitive Science: Visual Cognition (2nd edition), vol. 2, 121--165 (1995)
2. Oliva, A., Schyns, P. G.: Diagnostic Colors Mediate Scene Recognition. Cognitive Psychology, vol. 41, 176--210 (2000)
3. Maljkovic, V., Martini, P.: Short-term Memory for Scenes with Affective Content. Journal of
Vision, vol. 5, 215--229 (2005)
4. Barrow, H. G., Tannenbaum, J. M.: Recovering Intrinsic Scene Characteristics from Images.
Computer Vision Systems, New York, Academic press, 3--26 (1978)
5. Marr, D.: Vision. San Francisco, CA. WH Freeman (1982)
6. Carson, C., Belongie, S., Greenspan, H., Malik. J.: Blobworld: Image Segmentation Using
Expectation-Maximization and its Application to Image Querying. IEEE Trans. on Pattern
Analysis and Machine Intelligence, vol. 24, 1026--1038 (2002)
7. Biederman, I.: Recognition-by-components: A Theory of Human Image Interpretation. Psychological Review, vol. 94, 115--148 (1987)
8. Barnard, K., Forsyth, D. A.: Learning the Semantics of Words and Pictures. In Proc. ICCV,
pp. 408—415, Vancouver, Canada (2001)
9. Henderson, J. M., Hollingworth, A.: High Level Scene Perception. Annual Review of Psychology, vol. 50, 243--271 (1999)
10. Rensink, R. A., O’Regan, J. K., Clark, J. J.: To See or Not to See: The Need for Attention
to Perceive Changes in Scenes. Psychological Science, vol. 8, 368--373 (1997)
11. Vailaya, A., Jain, A., Zhang, H. J.: On Image Classification: City Images vs. Landscapes.
Pattern Recognition, vol, 31, 1921--1935 (1998)
12. Oliva, A., Torralba, A.: Modeling the Shape of the Scene: A Holistic Representation of the
Spatial Envelope. International Journal of Computer Vision, vol. 42, 145--175 (2001)
13. Torralba, A., Oliva, A.: Depth estimation from image structure. IEEE Trans. on Pattern
Analysis and Machine Intelligence, vol. 24, 1226--1238 (2002)
14. Oliva, A., Torralba, A.: Building the Gist of a Scene: The Role of Global Image Features in
Recognition. Visual Perception, Progress in Brain Research, vol. 155 (2006)
15. Gershnfeld, N.: The Nature of Mathematical Modeling. Cambridge University Press (1999)
16. The MIT-CSAIL Database of Objects and Scenes,
http://people.csail.mit.edu/torralba/images/
17. Torralba, A.: Contextual Priming for Object Detection. International Journal of Computer
Vision, vol. 53, 153--167 (2003)
18. Oliva, A., Torralba, A., Castelhano, M. S., Henderson, J. M.: Top-Down Control of Visual
Attention in Object Detection. In Proc. ICIP, vol. 1, pp. 253--256 (2003)
19. Walther, D.: Interactions of Visual Attention and Object Recognition: Computational Modelling, Algorithms, and Psychophysics. PhD Thesis, California Institute of Technology, Pasadena, California, USA (2006)