21820.pdf

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 36, NO. 6, DECEMBER 2006
1373
Qualitative Visual Environment Retrieval
Rajashekhara, Amit B. Prabhudesai, and Subhasis Chaudhuri
Abstract—A system for retrieval of an unstructured environment under static and dynamic scenarios is proposed. The use
of cylindrical mosaics or omnidirectional images is exploited for
providing a rich description about the surrounding environment
spanning 360◦ . The environment description is based on defining
the attributes of the nodes of a graph derived from the angular
partitions of the captured images. Content-based image retrieval
for each of these partitions is performed on an exemplar image
database to annotate the nodes of the graph. The complete environment description is recovered by collating the retrieval results
over all the partitions based on a simple voting scheme. This offers
a qualitative description of the location in a totally natural and
unstructured surrounding. The experiments yield quite promising
results.
Index Terms—Concentric mosaic, environment retrieval, node
annotation, omnicam, view partition.
I. I NTRODUCTION
T
HE PROBLEM of localization of robots is a welldocumented and researched problem. Several approaches
have been presented in solving this. In the class of problems belonging to the simultaneous localization and mapping (SLAM)
category [1], the robot starts at an unknown location with no
prior knowledge of the landmark positions. From landmark
observations, it simultaneously estimates its location and those
of the landmarks. The robot then builds up a complete map
of landmarks for localization. Early approaches used laser
range data or sonar for localization, whereas recent approaches
have focused on using vision for mobile robot localization. An
illustrative example is the MINERVA tour guide robot [2] used
in the Smithsonian’s National Museum of American History.
Vision-based localization approaches have also been reported
by Kosecka and Li [3], Davison [4], Davison and Murray [5],
Kröse et al. [6], Goedeme et al. [7], and Se et al. [8]. These
works are not exhaustive but are quite representative of the
various approaches proposed. Among other approaches, the
global positioning system (GPS) [9] and location estimation
using global system for mobile communications (GSM) [10]
are becoming increasingly popular.
The robot localization problem attempts to answer the question “Where am I?” It is one of estimating the state of the robot
at the time instant tk , given all the measurements up to tk .
Manuscript received July 8, 2005; revised January 9, 2006 and April 3,
2006. This work was supported in part by the Indian Department of Science
and Technology, Ministry of Science and Technology, under a Swarnajayanti
project. This paper was recommended by Associate Editor S. Sarkar.
Rajashekhara is with GE Healthcare Technologies, Bangalore 560 066, India.
A. B. Prabhudesai is with Siemens Corporate Technology, Bangalore 560
001, India.
S. Chaudhuri is with the Department of Electrical Engineering, Indian Institute of Technology, Bombay, Mumbai 400076, India (e-mail: [email protected]).
Digital Object Identifier 10.1109/TSMCB.2006.877797
Typically, a three-dimensional (3-D) state vector is used: x =
[x, y, θ]T , i.e., the position and orientation of the robot. Even
the GPS and the GSM systems provide only a metric coordinate
of the position in terms of latitude and longitude. None of these
systems provides a qualitative description of the surrounding
environment. In the proposed system, we attempt to provide a
rich description of the surrounding environment. Such a system
would prove to be of significant interest to the wearable computing community and could be intended for visually impaired
persons. To the best of our knowledge, there has been no prior
work directed toward generating a qualitative description of the
environment in a “humanlike” fashion that involves topological
relationships among various entities in a scene.
In this paper, we present a novel approach for environment
retrieval using cylindrical panoramic mosaics or omnidirectional images as input. The use of panoramic or omnidirectional
vision sensors in the vision system of a mobile robot has
previously been reported by Thompson and Zelinsky [11],
Menegatti et al. [12], Matsumoto et al. [13], [14], and
Dellaert et al. [2], although Dellaert et al. use an image mosaic
built up from the individual images. However, the crux of
all these approaches is positioning or navigation and not the
generation of a description of the surrounding environment.
Furthermore, all these approaches as well as those reported in
[2]–[5], [7], and [8] deal with an indoor environment, where
they rely on artificial or natural landmarks. In an unstructured
environment, artificial landmarks cannot be set up, and natural
landmarks cannot be segmented with accuracy. The outdoor
environment is totally unstructured, and this motivates the
use of global appearance-based features in our environment
retrieval system. Zhou et al. have proposed the use of global
appearance-based features [15], but again, the emphasis is on
robot localization.
We show how, given a 360◦ or a hemispherical view of a
totally unstructured and natural surrounding, descriptions such
as “buildings to our left,” “road in the front,” and “a lawn to
our right” can be obtained. We provide a description of the
environment in terms of a graph whose nodes correspond to one
of the annotated image classes. It may be noted that Ulrich and
Nourbakhsh [16] have also suggested the use of content-based
image retrieval (CBIR) [17] for localization. However, their
method mainly focuses on a restricted and pretrained environment such as halls, rooms, and corridors. Given the region
adjacency map of the environment and the color histogram of
each of the entities, they find out which node is currently being
traversed by the robot. Recently, Wolf et al. [18] have also
proposed a system combining an image retrieval system with
Monte Carlo localization (MCL). They represent an image by
a histogram of local features. However, the primary use of their
image retrieval system is to update the weights of the samples
1083-4419/$20.00 © 2006 IEEE
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on January 9, 2009 at 06:16 from IEEE Xplore. Restrictions apply.
1374
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 36, NO. 6, DECEMBER 2006
in the subsequent MCL method. Again, their system was tested
on an indoor environment. Our method handles scenes from
a totally unconstrained natural static or dynamic environment
using a simple feature, such as color, for image retrieval. It can
handle the variability of the outdoor world. Our method assumes no a priori information about the environment. In the
proposed method, we are able to generate a fairly rich description of the environment using a limited number of image
classes. We represent the topological relationships among the
entities in the environment using a graph structure whose nodes
are annotated by an identifier associated with a particular image
class. We also show that the graph can be updated in real time
as the observer roams around.
The rest of this paper is organized as follows. The next
section presents a formal problem definition. Section III describes how one can obtain an environment representation from
a 360◦ view of a scene. Section IV explains the environment
retrieval mechanism. Section V focuses on the experimental
results of environment retrieval. Finally, this paper concludes in
Section VI with a look at a possible direction for future work.
II. P ROBLEM D EFINITION
Concentric panoramic mosaics and images obtained from a
catadioptric imaging system such as an omnicam [19], [20] provide a 360◦ view of the environment. Given these panoramic/
omnidirectional images, we investigate the problem of generating a rich description of a static or dynamic environment.
A. Static Environment Retrieval
We propose to describe the environment using a graph to
indicate the topological relationships among the entities in the
environment. Let G denote a graph of the environment and gk
denote the kth node of this graph. The nodes are annotated
using an identifier associated with a class Ci . Mathematically,
this may be represented as
G ← gk : gk ∈ {Ci }M
i=1 .
(1)
Here, M is the number of annotation classes into which the
image database is divided. Let V be a 360◦ view of the
environment. The maximum-likelihood solution to the problem
of static environment retrieval can now be written as
= arg max p V |G, {Ci }M .
G
i=1
{gk }
(2)
The granularity of the graph G in terms of the number of nodes
has been discussed in Section III.
B. Dynamic Environment Retrieval
In a manner similar to that of the static case, we build a
graph for each frame of the video sequence. Thus, the complete representation is given by a temporally evolving graph
−1
G = {Gn }N
n=0 corresponding to the N frames of the video
sequence. Let V = {V1 , V2 , . . . , VN } be the omnidirectional
video sequence as the robot moves in the environment. The
maximum-likelihood solution to the dynamic environment retrieval problem is given by
= arg max p V|G, {Ci }M .
(3)
G
i=1
{Gi }
C. Change Detection in the Environment
The graph G of a temporally evolving environment can
change due to two reasons:
1) a change in the entities of the environment as the robot
moves past a building, a lawn, or any one of the other
classes;
2) a change in the topological relationships of the entities
vis-a-vis the observer, as he/she turns left or right, or
reverses his/her direction.
We also address this problem of detecting the change in the
environment. For the first case aforementioned, we intend to
find the frame n for which Gn = Gn−1 (Gn denotes the current
frame and Gn−1 , the previous frame). The detected frame is
given by
(n)
(n−1)
for at least one k .
(4)
n : gk = gk
In other words, we find that the attribute of at least one node
of the graph has changed by comparing the graphs for two
successive frames.
In the second case aforementioned, we obtain a complete
different annotation of the graph when the observer takes a
turn along his path, but the environment otherwise remains the
same. We may then relate the graphs in the nth frame and the
(n − 1)th frame as Gn = RGn−1 , where R represents a rotation operator and G represents a subgraph of G excluding the
base (explained in the next section). Thus, the corresponding
change detection problem has the following solution:
(n)
(n−1)
≥ 2 && Gn = RGn−1
(5)
n : # gk = gk
(n)
(n−1)
where #{gk = gk
} denotes the number of nodes for
which the annotations have changed between the previous
[(n − 1)th] and the current (nth) frames.
III. E NVIRONMENT R EPRESENTATION
The proposed method uses either cylindrical panoramic mosaics or omnidirectional images as the input to the system for
building a description of the environment. We use the following
six classes (categories) for annotation: lawns L, woods W,
buildings B, water bodies H, roads R, and traffic T. We notice that most natural environments may be quite reasonably
described using images belonging to these classes. Our database consists of about 200 images divided nearly equally into
these six classes. In order to capture the variability in the
outdoor scenes, the images within a class were chosen to have
a moderately large intraclass variance (in the feature space).
However, there is a tradeoff between handling the variability in
the outdoor scenes and the discriminative power of the classifier
as they are inversely related. Furthermore, the use of more
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on January 9, 2009 at 06:16 from IEEE Xplore. Restrictions apply.
RAJASHEKHARA et al.: QUALITATIVE VISUAL ENVIRONMENT RETRIEVAL
1375
It may be noted that one can select a different granularity for
environment description. For example, if we are interested only
in a left/right description, a 90◦ split of the view is enough and
G will have just four nodes for the mosaic. If required, one can
increase the number of nodes by reducing the view angle suitably. However, the retrieval of the attributes of the nodes may
not be that accurate when the view granularity is much finer.
IV. E NVIRONMENT R ETRIEVAL
Fig. 1. Illustration of an environment in terms of attributes such as woods and
lawns (FR, front; RT, right top; RB, right bottom; LB, left bottom; LT, left top).
The environment retrieval method involves three main
processing steps, namely: 1) view partitioning; 2) feature computation; and 3) node annotation.
A. View Partitioning
Fig. 2. (a) View partitioning for an omnidirectional image. (b) Graphical
representation of the environment for the aforementioned partitioning.
sample images, as it will be discussed later in the text, will slow
the retrieval process.
To represent the topological relationships among the entities
in an environment, we use a graph whose nodes are annotated
by an identifier associated with a particular class, as illustrated
in Fig. 1. As shown, an observer passes through an arbitrary
environment through a set of points {P1 , P2 , P3 , . . .} at time
instants {t1 , t2 , t3 , . . .}. The observer sees a part of woods W,
lawns L, buildings B, and water bodies H, around him/her.
Now, we face the question as to how should we represent the
environment topologically. In order to simplify the description,
we select a topology that is fixed with respect to the observer
rather than being environment specific as an outdoor scene is
totally unstructured and no prior information is available. As a
simple way of indicating relationships such as “to the left of”
or “in front of,” we construct a graph with six nodes. We divide
the 360◦ view into six viewing cones of 60◦ each. They are
front FR, left top LT, left bottom LB, back XX, right bottom RB,
and right top RT, respectively. It may be noted that the nodes
of the graph G are denoted by two characters, as illustrated
in Fig. 2. The attributes for the nodes are represented by a
single character, such as “L” and “W,” which corresponds to the
appropriate class. For an omnidirectional camera, one can also
see the reflection of the base (on which the observer is standing)
along the periphery. One would also like to find out if we are
walking on a road or, say, on grass. Hence, the base BS forms
the seventh node of the graph G. For the cylindrical mosaic, G
has six nodes only. One is now required to find out the attribute
of each node to recover the environment.
We adopt two separate methods for partitioning the concentric cylindrical mosaics and the omnidirectional images to
match the nodes of the graph G.
Concentric cylindrical mosaics: The partitioning of the concentric mosaics involves extracting six equal nonoverlapping
windows that span the entire image. The vertical span of each
of these windows is the complete vertical span of the given
mosaic. Each subimage consists of a 60◦ field of view.
Omnidirectional images: In the case of omnidirectional images, we adopt a similar approach, except for one difference.
The part near the center of the image corresponds to the part directly overhead the observer, which in an outdoor scene carries
little information for navigation or description purposes. The
sky component is not considered while extracting the feature.
Again, the part of the omnicam image near the periphery corresponds to the surface on which the observer is standing while
capturing the images. This ring-shaped part extending inward
from the periphery to some reasonable extent is considered
as one single separate partition. The remaining annular part
of the image is now split into six sectoral views, as shown in
Fig. 2(a). Out of these sectors, one corresponds to the direction
opposite the direction of motion (sector marked as XX in
Fig. 2), which is not considered for feature computation as it is
always occluded by the mobile trolley or the person carrying it.
B. Feature Extraction
As the environment is fully unstructured and previously
untrained for, we do not attempt to recognize objects in the
scene. They will appear at different scales, perspectives, and
locations, and under varying amounts of occlusions. Hence,
we prefer the CBIR method, in which an image is represented
by certain features, and the comparison of images is carried
out in this feature space. For CBIR, we desire a feature invariant to scaling, viewpoint, illumination changes, and the
geometric warping introduced by omnicam images. In the
literature, many researchers have proposed the use of textural
features for image retrieval. One of the popular representations
of image texture is the co-occurrence matrix proposed by
Haralick et al. [21]. Recently, Jhanwar et al. [22] have proposed
a translation- and illumination-invariant retrieval scheme using
motif co-occurrence matrix (MCM). Unlike the co-occurrence
matrix, MCM captures third-order statistical features for texture
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on January 9, 2009 at 06:16 from IEEE Xplore. Restrictions apply.
1376
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 36, NO. 6, DECEMBER 2006
Fig. 3. Cylindrical concentric mosaic of a university campus used as a test image.
description. Wavelet-based methods [23], [24] and Gabor filters
[25] have also been proposed for textural description. However,
the geometric distortion introduced in the catadioptric images
does not preserve texture; thus, texture-based methods cannot
be used in our system. Rajashekhar et al. [26] propose a CBIR
method based on projective invariance; however, this works
only for entities having linear structures. Furthermore, in omnicam images, straight lines are not mapped into straight lines;
therefore, we cannot use this method for retrieval purposes.
We are not aware of any existing CBIR scheme that can handle
all such variations. In [14], Matsumoto et al. propose transforming a raw omnidirectional image to a low-resolution image in
cylindrical projection. However, such a transformation is very
much viewpoint specific, and the textural properties of a scene
change quite substantially when viewed from a different point.
Hence, we do not perform the transformation suggested in [14].
Instead, we use color [27], [28] as a feature for image retrieval
on the scaled omnidirectional images, and our results provide
ample evidence of the efficacy of this simple scheme. Because
the color histogram relates only to the property of a point and
not about its neighborhood and because most outdoor objects,
with the possible exception of glass buildings, are quite close
to being matte surfaces, it provides a convenient way of doing
CBIR in cylindrical or spherical images. The choice of the color
space may have a bearing on the accuracy of the results of
similarity matching using color histograms. We experimented
with both the red–green–blue (RGB) and hue–saturation–value
(HSV) color spaces. As would be expected, the HSV color
space yields better results, as it is relatively more robust to
changes in illumination.
Fig. 4. Graph of the environment shown in Fig. 3. Here, the characters refer
to textual annotation, and the thumbnails provide a visual annotation.
C. Node Annotation
We partition the input panoramic/omnidirectional image as
discussed in Section IV-A and compute the color histogram for
each of the partitions. We experimented with the use of three
different distance metrics for the similarity measure, namely:
1) Euclidean distance; 2) Kullback–Leibler distance; and
3) Jeffrey divergence. The Euclidean distance metric yielded
the best results, and this was used to compile the final results. Assuming the components of the color histogram to be
Gaussian distributed about their nominal values, it yields a
maximum-likelihood solution to (2). The top 20 retrievals are
considered while deciding the annotation for a particular image.
To make the retrieval robust against illumination changes and
variations within a class, we use a simple voting scheme,
instead of focusing on retrieval rank, to decide the image
annotation. We prepare a frequency count for each class using
Fig. 5.
Retrieved visual description of the environment shown in Fig. 3.
the top 20 retrievals. Then, the class having the maximum representation is used to annotate the given query image. Let mj
denote the number of retrieved images belonging to class Cj .
Then, the majority class C ∗ is defined as
{C ∗ = Ck : k = arg max mi }
i
(6)
where i is in the range of 1–5, which corresponds to the
specified image classes in the database.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on January 9, 2009 at 06:16 from IEEE Xplore. Restrictions apply.
RAJASHEKHARA et al.: QUALITATIVE VISUAL ENVIRONMENT RETRIEVAL
Fig. 6.
1377
Cylindrical concentric mosaic of a lawn used as another test image.
The majority class is found for each of the partitions. This
information is used to build the graph G. The vertices of
the graph are indexed by the identifier corresponding to the
majority class C ∗ .
For the omnidirectional images acquired with the sphericalmirror–camera system, the part of the image extending about
50 pixels (20% of the radius) inward from the periphery is
treated as a separate partition. We use this partition as the base
BS in the environment. We obtain the retrievals for this partition
in the same way as for the other partitions. One may partition
the BS node into two suitable halves or four quadrants for finer
granularity in environment description, if required.
D. Dynamic Node Annotation
A scene where one moves with an omnidirectional camera
corresponds to a graph Gn that changes dynamically with time
tn . We perform the retrieval operations as previously discussed
on each frame of the video sequence. Currently, each frame
is processed independently to get the environment description
Gn . The complete temporal evolution of the environment as one
navigates through it is given by G and obtained by concatenating the subgraphs Gn , i.e., G = {G1 , G2 , . . . , Gk , . . .}.
E. Change Detection
The dynamic node annotation discussed in the previous
section helps us to detect a change in the environment given
the omnivideo sequence as the input. By comparing the annotations of the corresponding nodes for the graphs generated for
consecutive frames through a simple XOR operation, we can
detect a change in the scene. This may arise due to either of
the two reasons mentioned in Section II-C. Once a change is
detected at more than one node in the subgraph G (excluding
the base BS from G), we try to match Gn+1 with Gn by shifting
the nodes to the left or right appropriately. If a match is found,
we declare that the observer has changed his direction. We note
that if there is a simultaneous change in the observer direction
and the scene content (mostly due to occlusion or disocclusion),
this cannot be recovered.
F. Real-Time Operation
The color histogram provides a very compact feature vector.
In addition, the color histograms of all the database images
are computed off-line and stored. Given an input image, we
only have to compute the histograms for the six partitions of
the image and compute the similarity metric for each partition.
Histogram comparison has a linear time complexity (O(r)), in
Fig. 7. Retrieved visual environment description for the cylindrical concentric
mosaic image shown in Fig. 6.
terms of the number of gray levels. The database comprises of
only around 30 images for each class, requiring very little computation. We performed experiments on a Pentium IV processor
clocked at 2 GHz. The image resolution for all the omnicam test
images was 512 × 512 pixels. It took approximately 100 ms
to process a single omnidirectional image without any effort
on code optimization. Hence, the environment updating can be
performed at a rate of approximately 5–10 frames/s. However,
an outdoor environment typically does not change very rapidly.
Hence, an operation even at the rate of 1 frame/s should suffice,
and building a real-time system poses no difficulty at all.
V. R ESULTS
We conducted extensive experiments on three categories of
images, namely: 1) cylindrical panoramic mosaics; 2) still omnicam images; and 3) images obtained from an omnicam video.
Because real-time generation of cylindrical mosaic video is not
possible, this was not considered in this paper. The panoramic
image mosaics were collected randomly from the Internet. The
omnidirectional images were generated using a hemisphericalmirror–camera system developed at Indian Institute of Technology (IIT), Bombay. The camera was mounted on top of
the mirror with the optical axis coinciding with the axis of
the hemisphere. All images used for experimentation were
collected in the IIT campus and the adjoining urban localities.
For CBIR purposes, we initially created an image database
by manually annotating an appropriate set of training images
into several classes. The training images used were collected
partly from the web and partly from the images provided by
the University of Texas, Austin. The exemplar images are all
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on January 9, 2009 at 06:16 from IEEE Xplore. Restrictions apply.
1378
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 36, NO. 6, DECEMBER 2006
Fig. 8. (a) Frame from the omnicam video. (b) Description of the retrieved environment.
Fig. 9. (a) Another frame from the omnicam video sequence. (b) Description of the retrieved environment.
standard parallel plane images because the collection of such
data is much easier and serves the purpose equally well.
We demonstrate the performance of the proposed method
starting with a cylindrical mosaic of the environment. Fig. 3
shows one such cylindrical concentric mosaic image used in
the experiment. The graph generated by collating the retrieval
results for all the partitions of this mosaic is shown in Fig. 4. We
see that the majority class for the LT partition is the buildings B
class, as indicated by the image placed at the LT node of the
graph. As shown in Fig. 3, the rest of the image is dominated by
lawns, and this is correctly indicated in the graph. Fig. 5 shows
another way of describing the retrieved environment. We place
a characteristic thumbnail image of the retrieved class to each
node of the graph. To further illustrate the result of the environment retrieval problem, we present the results of analyzing
another concentric mosaic. Fig. 6 shows a concentric mosaic
of a lawn image. The thumbnail representation of the retrieved
environment is shown in Fig. 7. Such a representation is useful
in providing a feel for the environment one is surrounded with.
As an example of processing a static omnicam scene, we
present one of the frames from our omnicam video sequence
[Fig. 8(a)]. Notice the observer blocking the XX partition.
The complete environment recovered for this scene is shown
in Fig. 8(b). Another example of a static omnicam scene is
provided in Fig. 9(a). The recovered environment for this image
is shown in Fig. 9(b). The annotations of all the nodes are,
indeed, correct.
To test the performance of the proposed technique, we collected data at many locations and at different times of the
day when the ambient illumination changes. Because there is
considerable temporal redundancy in the omniview video, we
used a temporally downsampled video for our experiments.
Accordingly, we considered about 50 frames of the omniview
sequence for creating the animated video for demonstration
purposes. We compiled our results over all the data sets. Results
over this extensive data set were quite positive, with an accuracy
of about 80%, given the use of a simple feature such as color.
This may appear to be a bit poor, but several points deserve
explanation. A third of the labeling errors occurs between
the buildings B and the traffic T classes when both of these
classes receive comparable votes. This is not surprising for, in
an urban environment, both coexist in a scene. In addition, in
cases where a building may be partially occluded by trees,
the system is confused. The class having the second highest
representation among the top 20 retrievals is often the correct
label in such cases. However, we have not considered the
multiclass labeling problem in this paper. The discriminative
power of the system can be further improved by a simple yet
effective modification. We use a simple clustering technique to
segment out the region in each partition having the largest area
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on January 9, 2009 at 06:16 from IEEE Xplore. Restrictions apply.
RAJASHEKHARA et al.: QUALITATIVE VISUAL ENVIRONMENT RETRIEVAL
and use the corresponding local color histogram for retrieval
purposes. With the aforementioned modification, we achieved
an accuracy close to 90%.
VI. C ONCLUSION
We have presented a CBIR-based approach to build a reasonably rich description of the environment using cylindrical
concentric mosaics or omnidirectional images. We describe the
visual environment using a graph whose nodes are annotated
with the identifiers of classes belonging to an annotated database. We tested our method extensively on static scenes as well
as on omnivideo sequences. For the latter case, we provide a
temporally evolving graph as well as an animated representation that tracks the change in environment over time. This representation also provides us with a mechanism to detect changes
in the scene. Our experiments have yielded quite promising
results in terms of the accuracy of description vis-a-vis the
computational complexity involved. For practical reasons, we
have included a Braille board display in the developed system to
display the environment annotations for the benefit of visually
impaired persons. A complete description of the developed
portable system can be found in a patent document [29].
In the current implementation, each graph at a given instant
is generated independently of the previous graph. In the future,
we intend to introduce a memory in the system to predict
the changes in the environment based on past observations.
Furthermore, it should be noted that the proposed method
of sensing the visual environment is quite different from the
way humans do. This is because we are good at extracting
features from only the foveated part of the scene. However,
the peripheral vision does provide enough information for us
to figure out what surroundings we are in and that alone may
suffice for the proposed task of environment retrieval, albeit not
over the entire 360◦ view. This has been another motivation for
using a weak cuelike color for the CBIR purposes. It would be
interesting to compare the performance of the proposed system
to that of the human visual system.
ACKNOWLEDGMENT
The authors would like to thank the reviewers for their
constructive comments.
R EFERENCES
[1] J. J. Leonard and H. F. Durrant-Whyte, “Simultaneous map building and
localization for an autonomous mobile robot,” in Proc. IEEE/RSJ IROS,
Osaka, Japan, 1991, pp. 1442–1447.
[2] F. Dellaert, W. Burgard, D. Fox, and S. Thrun, “Using the CONDENSATION algorithm for robust, vision-based mobile robot localization,” in
Proc. IEEE Comput. Soc. Conf. Comput. Vis. and Pattern Recog., Fort
Collins, CO, Jun. 1999, pp. 588–594.
[3] J. Kosecka and F. Li, “Vision based topological Markov localization,” in
Proc. IEEE Int. Conf. Robot. and Autom., New Orleans, LA, Apr. 2004,
pp. 1481–1486.
[4] A. J. Davison, “Real-time simultaneous localization and mapping with a
single camera,” in Proc. 9th IEEE Int. Conf. Comput. Vis., Nice, France,
2003, pp. 1403–1410.
[5] A. J. Davison and D. W. Murray, “Simultaneous localization and map
building using active vision,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 24, no. 7, pp. 865–880, Jul. 2002.
[6] B. J. A. Kröse, N. Vlassis, R. Bunschoten, and Y. Motomura, “A probabilistic model for appearance-based robot localization,” Image Vis. Comput., vol. 19, no. 6, pp. 381–391, Apr. 2001.
1379
[7] T. Goedeme, M. Nuttin, T. Tuytelaars, and L. V. Gool, “Markerless computer vision based localization using automatically generated topological
maps,” in Proc. Eur. Navigat. Conf., Rotterdam, The Netherlands, 2004.
[8] S. Se, D. Lowe, and J. Little, “Mobile robot localization and mapping
with uncertainty using scale-invariant visual landmarks,” Int. J. Rob. Res.,
vol. 32, no. 4, pp. 431–443, 1996.
[9] D. Fox, J. Hightower, L. Liao, D. Schultz, and G. Borriello, “Bayesian
filters for location estimation,” IEEE Pervasive Comput., vol. 2, no. 3,
pp. 24–33, Jul.–Sep. 2003.
[10] T. Rappaport, J. Reed, and B. Woemer, “Position location using wireless communications on highways of the future,” IEEE Commun. Mag.,
vol. 34, no. 10, pp. 33–41, Oct. 1996.
[11] S. Thompson and A. Zelinsky, “Accurate local positioning using visual
landmarks from a panoramic sensor,” in Proc. IEEE Int. Conf. Robot. and
Autom., Washington, DC, May 2002, pp. 2656–2661.
[12] E. Menegatti, M. Zoccarato, E. Pagello, and H. Ishiguro, “Image-based
Monte Carlo localization with omnidirectional images,” Robot. Auton.
Syst., vol. 48, no. 1, pp. 17–30, Aug. 2004.
[13] Y. Matsumoto, M. Inaba, and H. Inoue, “Memory-based navigation using
omni-view sequence,” in Proc. Int. Conf. Field and Service Robot., 1997,
pp. 184–191.
[14] ——, “Visual navigation using view-sequenced route representation,” in
Proc. Int. Conf. Robot. and Autom., 1996, pp. 83–88.
[15] C. Zhou, Y. Wei, and T. Tan, “Mobile robot self-localization using global
visual appearance based features,” in Proc. IEEE Int. Conf. Robot. and
Autom., Sep. 2003, pp. 1271–1276.
[16] I. Ulrich and I. Nourbakhsh, “Appearance-based place recognition for
topological localization,” in Proc. IEEE Int. Conf. Robot. and Autom.,
San Francisco, CA, Apr. 2000, pp. 1023–1029.
[17] A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain,
“Content based image retrieval at the end of the early years,” IEEE Trans.
Pattern Anal. Mach. Intell., vol. 22, no. 12, pp. 1349–1380, Dec. 2000.
[18] J. Wolf, W. Burgard, and H. Burkhardt, “Robust vision based localization
by combining an image-retrieval system with Monte Carlo localization,”
IEEE Trans. Robot., vol. 21, no. 2, pp. 208–216, Apr. 2005.
[19] S. K. Nayar, “Catadioptric omnidirectional camera,” in Proc. IEEE Int.
Conf. Comput. Vis. and Pattern Recog., 1997, pp. 482–488.
[20] ——, “Omnidirectional vision systems,” in Proc. DARPA Image Understanding Workshop, Monterey, CA, Nov. 1998, pp. 93–99.
[21] R. M. Haralick, K. Shanmugam, and I. Dinstein, “Texture features for
image classification,” IEEE Trans. Syst., Man, Cybern., vol. SMC-3, no. 6,
pp. 610–621, Nov. 1973.
[22] N. Jhanwar, S. Chaudhuri, G. Seetharaman, and B. Zavidovique, “Content based image retrieval using motif cooccurrence matrix,” Image Vis.
Comput. J., vol. 22, no. 14, pp. 1211–1220, 2004.
[23] M. N. Do and M. Vetterli, “Wavelet based texture retrieval using generalized Gaussian density and Kullback–Leibler distance,” IEEE Trans.
Image Process., vol. 11, no. 2, pp. 146–158, Feb. 2002.
[24] M. Kokare, P. K. Biswas, and B. N. Chatterji, “Rotated complex
wavelet based texture features for content based image retrieval,” in Proc.
17th Int. Conf. Pattern Recog., Cambridge, U.K., Aug. 2004, vol. 1,
pp. 652–655.
[25] B. S. Manjunath and W. Y. Ma, “Texture features for browsing and retrieval of image data,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 8,
no. 8, pp. 837–841, Nov. 1996.
[26] Rajashekhara, S. Chaudhuri, and V. P. Namboodiri, “Image retrieval based
on projective invariance,” in Proc. IEEE Int. Conf. Image Process., Singapore, Oct. 2004, pp. 405–408.
[27] M. J. Swain and D. H. Ballard, “Color indexing,” Int. J. Comput. Vis.,
vol. 7, no. 1, pp. 11–32, Sep. 1991.
[28] D. Balthasar, “Color matching by using tuple matching,” in Proc. Int.
Conf. Image Anal. and Process., Sep. 2003, vol. 1, no. 12, pp. 402–407.
[29] S. Chaudhuri, Rajashekhara, and A. Prabhudesai, “Head mounted device
for semantic representation of the user surroundings,” Indian Patent Filed,
no. 133/MUM/2006, 2006.
Rajashekhara received the B.E. degree in electronic
and communication engineering from Kuvempu
University, Karnataka, India, in 1994, the M.Tech.
degree from the Mysore University, Mysore, India, in
1997, and the Ph.D. degree from the Indian Institute
of Technology (IIT), Bombay, India, in 2006.
He is currently part of the imaging team at GE
Healthcare Technologies, Bangalore, India. His research interests include signal and image processing,
pattern recognition, and computer vision.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on January 9, 2009 at 06:16 from IEEE Xplore. Restrictions apply.
1380
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 36, NO. 6, DECEMBER 2006
Amit B. Prabhudesai received the B.E. degree in
electronics engineering from Bombay University,
Bombay, India, in 2004. He is currently working
toward the M.Tech degree at the Indian Institute of
Technology (IIT), Bombay.
He is currently with Siemens Corporate Technology, Bangalore, India. His research interests include
signal and image processing and computer vision.
Subhasis Chaudhuri was born in Bahutali, India.
He received the B.Tech. degree in electronics and
electrical communication engineering from the Indian Institute of Technology (IIT), Kharagpur, in
1985, the M.S. degree from the University of Calgary, Calgary, AB, Canada, and the Ph.D. degree
from the University of California, San Diego, both
in electrical engineering.
He joined the IIT, Bombay, as an Assistant Professor in 1990, where he is currently a Professor
and the Head of the Department of Electrical Engineering. He was a Visiting Professor with the University of ErlangenNuremberg, Germany, and the University of Paris XI, France. He is a
coauthor of the books Depth From Defocus: A Real Aperture Imaging Approach
(Springer, 1999) and Motion-Free Super-Resolution (Springer, 2005) and the
Editor of the book Super-Resolution Imaging (Kluwer Academic, 2001). His
research interests include image processing, computer vision, and multimedia.
He is an Associate Editor for the International Journal of Computer Vision.
Dr. Chaudhuri was a recipient of the Dr. Vikram Sarabhai Research Award
in 2001, the Prof. SVC Aiya Memorial Award in 2003, the Swarnajayanti
Fellowship in 2003, and the S. S. Bhatnagar Prize in engineering sciences in
2004. He is a Fellow of the Alexander von Humboldt Foundation, Germany,
the Indian National Academy of Engineering, and the National Academy of
Sciences, India. He is also an Associate Editor for the IEEE TRANSACTIONS
ON P ATTERN A NALYSIS AND M ACHINE I NTELLIGENCE .
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on January 9, 2009 at 06:16 from IEEE Xplore. Restrictions apply.