Iterative Random Visual Word Selection

Iterative Random Visual Word Selection
Thierry Urruty
Syntyche Gbèhounou
Huu Ton Le
XLIM Laboratory
UMR CNRS 7252
University of Poitiers
France
XLIM Laboratory
UMR CNRS 7252
University of Poitiers
France
XLIM Laboratory
UMR CNRS 7252
University of Poitiers
France
[email protected]
[email protected] [email protected]
Jean Martinet
Christine Fernandez
LIFL – UMR CNRS 8022
Lille 1 University
France
[email protected]
XLIM Laboratory
UMR CNRS 7252
University of Poitiers
France
[email protected]
ABSTRACT
1.
In content based image retrieval, one of the most important
step is the construction of image signatures. To do so, a
part of state-of-the-art approaches propose to build a visual
vocabulary. In this paper, we propose a new methodology
for visual vocabulary construction that obtains high retrieval
results. Moreover, it is computationally inexpensive to build
and needs no prior knowledge on features or dataset used.
Classically, the vocabulary is built by aggregating a certain number of features in centroids using a clustering algorithm. The final centroids are assimilated to visual ”words”.
Our approach for building a visual vocabulary is based on
an iterative random visual word selection mixing a saliency
map and tf-idf scheme. Experiment results show that it outperforms the original ”Bag of visual words” based approach
in efficiency and effectiveness.
In the last decades, online image databases have exponentially increased, especially with users sharing and embedding
personal photographs such as Flickr1 or any other free image
hosting website. However, such shared databases have heterogeneous user annotations, thus semantic retrieval is not
effective enough to access all images. The need for novel
methods of content based searching image databases has become more pressing.
Building an effective and efficient content-based image retrieval (CBIR) system helps bridging the semantic gap between the low-level features and high-level semantic concepts contained in the image. Many researches have tried
to close this gap, however with time, the term ”close” has
become ”narrow down”. To do so, research work has been
done to improve low-level features [4, 16, 25], multiplying
the number of existing image global and local descriptors.
Others have been focused their effort on selecting the best
way to use, combine, refine, select those low-level features
to optimise the effectiveness and efficiency of the retrieval,
using supervised and unsupervised classification methodologies. Among them, the ”Bag of features” or ”Bags of visual
words” approach [2, 24] has been widely used to create image
signatures and to index image database.
More recently, visual attention models, which extract the
most salient location of an image, have been widely applied
in different applications. Some bio-inspired approaches are
based on psycho-visual features [9, 17] while others make
use of several low-level features [6, 28]. Saliency maps help
selecting the most interesting parts of an image or a video.
Thus, it reduces the computational cost to create the image
index or signature while increasing the precision of the retrieval.
In this article, we present a novel approach to construct a
visual vocabulary. Our main objective is to propose a new
simpler, faster, and more robust alternative to the ”bag of
visual words” approach that is based on an iterative random
visual word selection. In our word selection process, we introduce an information gain measure that mixes ideas from
saliency maps and tf-idf scheme.
The remainder of this article is structured as follows: we
Categories and Subject Descriptors
H.2 [Database Management]: Database Applications
; H.3 [Information Storage and Retrieval]: Content
Analysis and Indexing, Information Search and Retrieval;
I.4 [Image Processing and Computer Vision]: Scene
Analysis, Image Representation, Quantization
General Terms
Algorithms, Measurement, Performance
Keywords
Images Retrieval, Vocabulary Construction, Bags of Visual
Words, Random Word Selection, Saliency Map
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ICMR 2014 Glasgow, GB
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.
1
INTRODUCTION
www.flickr.com
to differentiate images based on their colour features.
Once computed, these moments provide a measurement for colour similarity between images. They are
based on generalized colour moments [19] and are 30dimensional. Given a colour image represented by a
function I with RGB triplets, for image position (x, y),
the generalized colour moments are defined by Equation (2).
∫∫
abc
Mpq
=
xp y q [IR (x, y)]a [IG (x, y)]b [IB (x, y)]c dxdy
provide a brief overview of related works in Section 2. Later,
we explain our iterative random visual word selection approach in Section 3. Then, we present the exhaustive experiments in Section 4. Finally, we discuss the findings of our
study in Section 5 and conclude in Section 6.
2. STATE OF THE ART
We do not aim to present in this section everything that
has been done in content based image retrieval. This research domain is very active and we just present the most
popular proposals and the ones that have inspired our research work. Thus, we first present related work concerning
low-level features, then we discuss methodologies for image
representation, finally we highlight some saliency-map-based
approaches.
(2)
abc
Mpq
is referred to as a generalized colour moments of
order p+q and degree a+b+c. Only generalized colour
moments up to the first order and the second degree
are considered, thus the resulting invariants are funcabc
abc
, M10
tions of the generalized colour moments M00
abc
and M01 , with:


 (1, 0, 0) , (0, 1, 0) , (0, 0, 1) 
(2, 0, 0) , (0, 2, 0) , (0, 0, 2)
(a, b, c) ∈
.


(1, 1, 0) , (1, 0, 1) , (0, 1, 1)
2.1 Image Features
The most popular descriptor in object recognition is the
SIFT (Scale Invariant Feature Transform) descriptor proposed by Lowe in 1999 [13]. The image is represented by a
set of vectors of local invariant features. The efficiency of
SIFT and its different extensions have been demonstrated in
numerous papers in object recognition and image retrieval
[10, 11, 13, 14, 20, 25].
The original version of SIFT [13] is defined in greyscale
and different colour variants of SIFT have been proposed.
For example, OpponentSIFT, proposed by Van de Sande et
al. [25] are recommended when no prior knowledge about
the data set is available. OpponentSIFT describes all the
channels in the opponent colour space (Equation (1)) using
SIFT descriptors.

   R−G
√
01
2
R+G−2B 
02  = 
(1)
 √6 
R+G+B
03
√
3
The information in the O3 channel is the intensity information, while the other channels describe the colour information in the image.
The greyscale SIFT is 128-dimensional whereas the colour
versions of SIFT, e.g. OpponentSIFT, are 384-dimensional.
This high dimensionality induces slow implementations for
matching the different feature vectors. The first solution to
resolve this aspect proposed by Ke and Sukthankar is PCASIFT [11] with 36 dimensions. This variant is faster for
matching, but less distinctive than SIFT according to the
comparative study of Mikolajczyk and Schmid [18]. During this study, they proposed GLOH (Gradient Location
and Orientation Histogram) a new variant of SIFT, which
is more effective than SIFT with the same number of dimensions. However GLOH is more computationally expensive. Again with the aim to propose a solution less computationally expensive and more precise, Bay et al. [1] introduced SURF (Speeded Up Robust Features) which is a
new detector-descriptor scheme. In fact, the detector used
in SURF scheme is based on Hessian matrix and applied to
integral images to make it fast. Their average rate of the
recognition of objects of art in a museum shows that SURF
is better than GLOH, SIFT and PCA-SIFT.
There are other interesting local point descriptors such as:
• Colour moments2 : they are measures that can be used
2
abbreviated CM for the remainder of the paper
• Colour Moment Invariants3 computed from the algorithm proposed by Mindru et al. [19]. The authors
use generalised colour moments for the construction
of combined invariants to the affine transform of coordinates and contrast changes. There are 24 basis
invariants involving generalized colour moments in all
3 colour bands.
The features presented above are interesting works, among
others, that inspired the design of our approach. Next section discusses variations around the popular image representation with visual words.
2.2 Bag-of-Words-based Approaches
Concerning the image representation, plenty of solutions
have been proposed. The most popular is commonly called
”Bag of Visual Words” (BoVW) and was inspired by the
bag-of-words used in text categorisation. This approach has
different names. For example it is called Bag of keypoints
by Csurka et al. [2]. However the general principle is the
same. Given a visual vocabulary, the idea is to characterise
an image by a vector of visual word frequencies [24]. The visual vocabulary construction is often done through low-level
features vector clustering using for instance k-Means [2, 24]
or Gaussian Mixture Models (GMM) [5, 22]. However, it is
usual to apply a weighting value to the components of this
vector. The standard weighting scheme is known as ”term
frequency-inverse document frequency”, tf-idf [24], and it is
computed as described in Equation (3).
Suppose there is a vocabulary of k words, then each document is represented by a k-vector Vd = (t1 , . . . , ti , . . . , tk )⊤
of weighted word frequencies with components:
ti =
nid
N
log
nd
ni
(3)
where nid is the number of occurrences of word i in document d, nd is the total number of words in the document d,
ni is the number of occurrences of term i in the dataset and
N is the number of documents in the dataset.
3
Abbreviated CMI for the remainder of the paper.
Besides BoVW approaches, many other equally efficient
methods exist, for example Fisher Kernel or VLAD (Vector
of Locally Aggregated Descriptors). The first one has been
used by Perronnin and Dance [23] on visual vocabularies for
image categorisation. They proposed to apply Fisher kernels to visual vocabularies represented by means of a GMM.
In comparison to the BoVW representation, fewer visual
words are required by this more sophisticated representation. VLAD is introduced by Jégou et al. [10] and can be
seen as a simplification of the Fisher kernel. Considering a
codebook C = c1 , . . . , ck of k visual words generated with
k-Means, each local descriptor x is associated to its nearest
visual word ci = N N (x). The idea of the VLAD descriptor is to accumulate, for each visual word ci , the differences
x − ci of the vectors x assigned to ci .
2.3 Saliency Map
Visual attention models are used to identify the most
salient locations in an image; they are widely applied to
many image-related research domains. In the last decades,
many visual saliency frameworks have been published; some
of them are inspired from psycho-visual features [9, 17],
while others make use of several low-level features in different ways [6, 28]. The works of Itti et al. [9] can be considered as an outstanding example of the formal group. In
[9], an input image is processed by parallel channels (colour,
orientation, intensity) before combining them to generate
the final saliency map. Alternatively, Zhang et al. in [28]
present an approach based on the natural statistical information. They define an expression called ”pointwise mutual
information” between the visual features and the presence
of the target to indicate the overall saliency.
Recently, visual attention models have been used in new
applications such as image retrieval, and they have achieved
interesting results. The most common approach consists in
taking advantage of saliency map to decrease the amount
of information to be processed [6, 12, 27]. These methods
usually take the information given by the visual attention
model at an early stage; image information will be either
discarded or picked as inputs for next stages based on its
saliency value. In [6], SIFT descriptor is implemented for
image retrieval and the saliency value is used to rank all
the keypoints in the input image, only the distinctive points
are reserved for the matching stage. Research in [27] shares
the same idea: a threshold is defined and SURF features
are computed only for pixels with saliency (also called cornerness sometimes) value above this threshold. Their experiments show that the number of features can be reduced
without affecting the performance of the classifier.
In [15], a weighting scheme is designed for visual words
based on saliency, assuming that the importance of an image
object can be estimated with the saliency value of its region. In [12], visual attention model is combined with segmentation procedure to extract regions associated to highest
saliency values for further processing.
3. ITERATIVE RANDOM VISUAL WORDS
SELECTION
As discussed in the state of the art, a lot of researchers
and engineers have used BoVW as a tool to develop better
object recognition systems, and to construct higher semantic
levels, often called Bags of Visual Phrases. In [3, 29], the au-
thors propose hierarchical representations based on a visual
word/phrase vocabulary. These methodologies have shown
high performance results when combining phrases and words
in the retrieval phase. Their results also highlighted the importance of both the feature selection step and the visual
vocabulary selection step in order to obtain a more effective approach. Thus, in this paper we focus on opening this
”black box” named Bags of Visual Words to simplify it and
to make it more robust to the choice of descriptor. In the
next sections, we discuss the inspiring limitations of BoVWbased approaches and present our approach.
3.1
Robustness and BoVW
Most of the time, the BoVW uses the well-known k-Means
algorithm [2] or a Gaussian Mixture Model to construct the
visual vocabulary. But the effectiveness of these clustering
algorithms tends to drop drastically with respect to the high
dimension of the image features. As the authors mentioned
in [21], the curse of dimensionality occurs and makes the
result of the clustering close to random. This randomness
aspect has been one of the starting point of our reflexion.
A second aspect of our reflexion was to minimize the importance of feature selections. Without a-priori knowledge
about the image dataset, our approach is designed to simplify the indexing step and the parameter tuning according
to selected features.
3.2
Random Selection of Words
As mentioned in the previous section, using a k-Means
or similar algorithms in the construction step of the visual
vocabulary has no justification in high-dimensional spaces.
k-Means is still widely used because it is well known and
relatively not computationally expensive compared to other
clustering algorithms. Experiments made in previous works
(e.g. [15]) have shown k-Means random clustering results
when the dimensionality grows, which is the case with many
visual image descriptors. Thus it is not adapted for generating visual words. One of our objectives is to simplify the
construction of the visual vocabulary. We propose to replace the clustering algorithm step by randomly selecting a
large set of visual words. This random selection can either
be made by:
• randomly picking keypoints in all images of a collection
and using their computed descriptors as visual words;
• creating synthetic visual words with knowledge of the
feature space distribution.
Our experimental results show that either way give similar results but picking randomly keypoints is probably easier
without any prior knowledge on the feature space. However, the image dataset needs to contain heterogeneous images that reflect the real world and the number of randomly
picked words should be high enough (discussed in Section
4).
The second step of the visual vocabulary construction is
to identify the visual words that have the best information
gain IG in the set of selected random visual words. To do
so, we need a selection criteria. Even if many exist, as we
are dealing with visual words, we made the analogy between
information retrieval and CBIR domain and select to base
part of the IG score on the tf-idf weighting scheme [26].
The other part of the IG score comes from saliency maps [9]
computed using the graph-based visual saliency software4 .
Data: set of random words, image dataset
Result: final visual vocabulary
extraction of random words;
while Nb words in Vocabulary > threshold do
Assign each keypoint to its closest word w;
Compute IGw for each word w with equation 4;
Sort words according to their IGw value;
Delete α % of words with lowest IG value;
end
Algorithm 1: Visual vocabulary construction
To construct the visual vocabulary, we start from a set
of random visual words and a heterogeneous training image
dataset. As illustrated in Algorithm 1, we first assign all extracted keypoints from each image to one of randomly chosen visual word. Then, we compute the Information Gain
for each visual word. As we already mentioned, the information gain IGw is based on tf-idf weighting scheme and
a saliency score, given by the following formula:
∑
SalwD
N
nwD
log
+
(4)
nD
nw
nwD
where IGw is the IG value of the visual word w, nwD
is the frequency of w over all the keypoints of the dataset
D, nD the total number of keypoints in the dataset, N the
number of images in the dataset, nw the number of images
containing the word w, and SalwD is the saliency score for
all the keypoints assigned to word w. The tf-idf part is
estimated dataset-wise, instead of document-wise.
The whole set of random visual words is then sorted in ascending order of IG values. All words with no assigned keypoint have an IG value equals to 0, hence they are deleted.
Then, a threshold α is used to set the number or percentage of words to be deleted in each iteration. The while
block code is repeated until the desired number of words
in the visual vocabulary is reached, re-assigning only the
keypoints that were left alone due to the previously deleted
visual words. At the end of this step, the visual vocabulary
is built.
The final step of our approach consists in indexing all
the target dataset images using the visual vocabulary built
from the training image dataset. It creates a signature for
each image with respect to the occurrence frequency of each
visual word in the image. This frequency histogram is L2normalized for better retrieval results.
IGw =
3.3 Stabilization process
As described above, the proposed algorithm is based on
a random set of visual words. When using a random approach, it is natural to assume that several runs of the algorithm would create different visual vocabularies, and therefore would yield different experimental results. However,
after running exhaustive experiments, the results presented
in section 4 show that the algorithm is more stable that we
expected.
Nevertheless, in order to avoid a hypothetical lack of stability that may appear for some descriptors, we propose a
stability extension to our approach, inspired from strong
clustering methods.
4
http://www.klab.caltech.edu/~harel/share/gbvs.php
For this purpose, we generate β visual vocabularies, the
iterative process is run for each of them (see Algorithm 1).
We combine the obtained words together and run the iterative process again, until the desired number of visual words is
reached. Combining few sets of visual words will improve the
overall informative gain of the vocabulary. However, since
the already obtained results are almost stable, experiment
results (see Figure 6) show that β = 3 is a good parameter
value, and choosing β > 3 would give similar results for a
longer construction time.
4.
4.1
EXPERIMENTS
Experimental setup
We considered three datasets during our experiments:
- University of Kentucky Benchmark introduced by
Nistér and Stewénius[20]. In the remainder, we will refer to
this dataset as ”UKB” to simplify the reading. UKB is really
interesting for image retrieval and presents three main advantages. First, it is a large benchmark composed of 10.200
images grouped in sets of 4 images showing the same object.
Figure 1 is an good illustration of the diversity of the set of
4 images (changes of point of view, illumination, rotation,
etc.). Second, it is easily accessible and a lot of results are
available to make an effective comparison. In our case, we
will compare our results to those obtained by Jégou et al.
[10] and Nistér and Stewénius [20]. Third, the evaluation
score of the results on UKB is simple: it counts the average
number of relevant images (including the query itself) that
are ranked in the first four nearest neighbours when searching the 10.200 images.
- PASCAL Visual Object Classes challenge 2012 [4]
Figure 1: An example set of 4 images showing the
same object in UKB.
called PASCAL VOC2012. This benchmark is composed of
17.215 images, among which only 11.530 are classified. Note
that while we just considered the 11.530 classified images
when testing our approach on Pascal VOC2012, we used the
full dataset to construct the vocabulary. PASCAL VOC2012
images represent realistic scenes and they are categorised in
20 objects classes, e.g. person, bird, airplane, bottle, chair
and dining table.
- MIRFLICKR-25000 [8]; this image collection is composed of 25.000 images. All images of this collection are supplied by a total of 9862 Flickr users. They were annotated
with tags (mainly in English) e.g. sky, sunset, graffiti/streetart, people and food. We used this image collection only to
build a vocabulary in order to have an independent dataset
for visual word selection.
Note also that for all our experiments, we chose to use the
Harris-Laplace point detector.
4.2
Observations
The first observation we made was to select a distance
measure for our low-level features, which is not in the scope
of this work but is still needed to guarantee good results.
Studies have been made in the state of the art, e.g. [7]. We
made few experiments on several distance measures, including L2, L1, Bhattacharyya, χ2 (Chi-Square). We select the
L2 distance between keypoints and the χ2 distance between
image signatures. This selection has given the best results
for the k-Means version of bag of visual words. Thus, all our
experiment results use the same framework.
Figure 3: Recursive selection scores for 2048 random
initial words
Figure 2: k-Means based Bag of words vs Random
selection of words
The second observation is presented Figure 2. It shows the
UKB scores with respect to the number of visual words for
selected descriptors comparing ”Bag-of-words”-approaches using k-Means and just a random selection of words. The overall tendency is that k-Means-based BoVW approach outperforms simple random selection of words for small vocabularies.
However, when the number of randomly selected words is
high enough, the figure shows a stable score value for random
selection, that is equivalent to k-Means scores. Thus, by
simply taking a high number of random visual words, the
scores are equivalent.
For a thorough comparison with k-Means-based BoVW,
we also need to take into account the retrieval process time,
which is computationally linked to the number of words used
to create the image signature. Hence, a recursive selection
approach has been deployed to reduce the number of words
and therefore to increase efficiency without losing effectiveness.
highlight important facts. First, the results are very stable.
Indeed, starting from a random vocabulary, we see that for a
given number of words, the score lies within a quite narrow
window. For example, around a sub-selection of 150 words,
the score value ranges from 3.05 and 3.15 approximately.
The second important fact is the high score values of these
sub-selections. In the previous results, neither the standard
BoVW nor the random selection of vocabulary has given a
score over 3. In Figure 3, the visual vocabularies containing
between 90 and 1000 (1300 if the figure is extended) give a
score higher than 3.
Figure 4: Recursive selection mean scores for 1024
to 65536 random initial words
4.3 Iterative selection approach results
In this section, we guide the reader through result and
discussion steps that lead to the construction of our final approach. To make this section clearer, we present the UKB
score results only with the Colour Moment Invariants descriptor as it gives the best UKB score with just a set of
random visual words. In the next section, we compare all
selected descriptors.
As it is mentioned earlier, random selection of visual words
shows as good results as using k-Means clustering algorithm.
However, to obtain this results, more words are needed.
Therefore, we decide to apply a selection of words as presented in Algorithm 1. Figure 3 presents the UKB scores
with respect to a sub-selection of the 2048 initial visual
words. These exhaustive results, made from different runs,
Figure 4 presents the mean UKB score of 10 runs with
respect to the number of visual words using different initial
random sets, from 1024 to 65538 words. There is a clear
evolution of results. The initial number of visual words affects the overall results. Starting with 1024 words is not
enough, however going higher than 4096 words has no more
clear effect on results. For the next experiments, we select
4096 as the maximum value of initial random visual words.
4.4
Mixing extension
Figure 5 presents the mean UKB score of our approach
with its stabilization process with respect to number of visual words with different initial random sets, from 512 to
4096 words. Our hypothesis to group vocabularies 3 × 3
UKB score compared to the BoVW approach. This demonstrates that without adding any prior knowledge regarding
which feature to use, and how to use it, our iterative random
selection of visual words approach performs better.
Descriptor
CM
SURF
CMI
SIFT
OppSIFT
BoVW
2.62
2.69
2.96
2.16
2.30
vs IteRaSel
2.81
2.75
3.18
2.30
2.46
%
+7%
+2.75%
+7.4%
+6.5%
+6.9%
Table 1: UKB scores
Figure 5: Recursive selection mean scores for 512 to
4096 random initial words group 3 × 3
with high IG was good. Indeed, the maximum mean value
for 3 × 3 − 4096 is almost 3.20 which is far higher than the
previous results (3.12 approx.) with only a set of 250 to 300
visual words. We also observe that grouping 3×3 vocabulary
sets gives a mean result of 3 for sets of 50 words, which is
far better than the original BoVW approach in terms of efficiency and effectiveness.
Since our approach is not based on supervised classification, it is difficult to compare it to other state-of-the-art
methods. However, Table 2 presents the UKB scores given
by [10] in which we add our best score obtained with CMI
visual descriptor.
Descriptor
BoVW
Fisher Vector
VLAD
IteRaSel
Best score
2.95
3.07
3.17
3.23
vs IteRaSel %
-7%
-5%
-1.9%
0%
Table 2: UKB scores: IteRaSel vs other methods
Table 3 gives the mean Pascal VOC2012 of 3 vocabulary
constructions of 256 words using 4096 initial random visual
words without any mixing extension between vocabularies.
As the table shows, the results of our approach on Pascal
VOC2012 is globally similar to the Bag of visual words approach. Therefore the proposed easy and fast vocabulary
construction for any visual features may replace the BoV W
construction stage without losing in retrieval effectiveness.
Note that our score is often higher with a different number of
visual words. We also suppose that tuning the constructed
vocabularies could improve the performance.
Figure 6: Mean scores for 1024 to 4096 random initial words group 1 × 1 to 10 × 10
Similar experiments have been made grouping vocabularies 2 × 2 to 10 × 10, the results are presented in Figure
6. This figure shows the interest of mixing vocabularies together at least 3 × 3 as the UKB score does not improve
significantly for higher grouping values.
Other grouping techniques have been investigated to improve the score, like mixing again the already grouped and
re-selected vocabularies, or mixing best informative vocabularies sets together. However, the results were not promising.
4.5 Global comparison
In this section, we present our experimental results using
other visual descriptors than CMI, then we compare our approach (noted IteRaSel for Iterative Random Selection) to
state-of-the-art approaches, and finally, we give the results
on Pascal VOC2012 dataset. Table 1 gives the mean UKB
scores of 9 vocabulary constructions of 256 words using 4096
initial random words with a 3 × 3 vocabulary grouping approach. The results show that for most of selected features,
there is a significative improvement, +7% approx., in the
Descriptor
CM
CMI
SIFT
OppSIFT
SURF
BoVW
28.8
31.3
37.7
37.4
33.7
IteRaSel
27.5
31.4
36.5
36.8
35.1
%
-4.6%
0.3%
-3.2%
-1.6%
4.1%
Table 3: Pascal VOC2012 scores
4.6
Mixing features
In this section, we propose to mix the retrieval results
from different features with a frequency-plus-rank-based order. Figures 7 and 8 show the precision (score) obtained
with respect to all possible combinations of features on UKB
and Pascal VOC2012 respectively. The global evolution of
results in those two figures is that mixing features improves
the overall precision. When mixing ”good” and ”not so good”
features, the precision of the mix tends to remain close to
the best one. Mixing all 5 features yields a precision close
to the best mix, which means that without any knowledge
on the nature of the dataset and which feature to use, it
is always a good option to mix more features. If we analyze a bit closer these mixed results, we note that the best
UKB (3.275) score is obtained with a four-feature combination: CM, CMI, OppSIFT and SURF. Whereas on the Pascal VOC2012, we need to mix the following four features:
OppSIFT, CMI, SURF and SIFT.
Figure 7: UKB scores with respect to number of
mixed features on UKB dataset
feature. Thus, it makes our approach easier to implement
as it is feature independent and no a-priori knowledge on
the feature is needed. The image signature is constructed
using the selected visual words. If several low-level features
are used, one signature by feature is created. The retrieval
step may return several lists of results, or be combined with
a round-robin algorithm for example.
A third advantage of our framework is that the creation
of the visual vocabulary is based on a random selection of
words on a heterogeneous image dataset. No supervised or
semi-supervised learning is needed. Some non-presented experiment results using UKB as training set show lower precision on all testing sets even on UKB, showing the importance of using a heterogeneous dataset. As we already mentioned, comparing our approach to others is difficult since
the comparison needs to be performed using the same lowlevel features, with the same set of keypoints. That is why
our experiment only compares the ”bags of visual words”
to our iterative random word selection approach using [25]
features extraction tools. We are fully aware that retrieval
results are not at their best; e.g. no filter has been applied,
no selection of keypoints have been made.
Our results highlight that given a standard framework of
Content Based Image Retrieval, our proposed methodology
to construct a visual vocabulary has a mean score above the
well-known BoVW approach, with a 7% score mean increase
on UKB dataset. We also show that mixing several lowlevel features benefits generally most of the features, with a
mixing score close to the best feature score. Thus, combining
features is a good way to obtain a good results on any dataset
without any knowledge on the visual content of the images.
6.
Figure 8: Precision with respect to number of mixed
features on Pascal dataset
5. DISCUSSION
The results presented in this article are part of the experiments we made to confirm that the proposed approach
outperforms in different ways the ”bag of visual words”.
First, the time required to create the visual vocabulary is
O(k ∗ w ∗ N ), where k is the number of iterations needed
to reduce the number of words w and N is the number of
interest keypoints of all images in the dataset. Note that k
depends on the threshold that selects the number of words
to delete. It has been fixed to 10% in our experiments, which
means k < 40 iterations to obtain 128 selected visual words
from 4096 random words. Note also that for each iteration,
only the keypoints that are not assigned anymore because
of word deletion are processed, which reduces considerably
the computational estimation.
A second advantage of our approach is that the selection
of the visual words is based on information mixing a saliency
map and the tf-idf scheme. This choice has been made to get
rid of the notion of ”best distance” to use with one low-level
CONCLUSIONS AND FUTURE WORKS
In this paper, we have introduced a new iterative random
visual word selection framework for content based image
retrieval. Such an approach has the potential to ”narrow
down” a bit more the semantic gap issue by constructing a
visual vocabulary, which is a level above low-level features.
Our proposal main objective is to present an efficient and
effective alternative to the well-known ”bag of visual words”
approach.
We propose to randomly select a large number of visual
words. Then, an iterative process is performed to select the
visual words containing the most informative words for the
random set. The information gain we use is based on a
saliency map as well as the tf-idf scheme. To improve the
robustness of our proposal, we enhance our framework with
a recursive word mixing process.
The experimental results demonstrate the potential benefits of the proposed approach. It is shown that even starting
from a random set of visual words, the results are very stable
and higher than the ones from BoV W . We also highlight
the fact that it is suitable to different low-level features, and
it gives good results on several datasets.
The more we work on this framework, the more perspectives we envision. A first perspective would be to try other
existing saliency maps, in order to improve the informative
gain of the constructed vocabulary. Another perspective is
an application of the same framework to the ”bag of phrases”
level, which is mentioned in related research section. To conclude the perspective part with a general direction, we are
interested in investigating how the iterative selection process
could be applied to other domains where a space quantiza-
tion is necessary, which requires to define relevant domainspecific information gain functions.
7. REFERENCES
[1] H. Bay, T. Tuytelaars, and L. Gool. Surf: Speeded up
robust features. In A. Leonardis, H. Bischof, and
A. Pinz, editors, Computer V ision − ECCV 2006,
volume 3951 of Lecture Notes in Computer Science,
pages 404–417. Springer Berlin Heidelberg, 2006.
[2] G. Csurka, C. Bray, C. Dance, and L. Fan. Visual
categorization with bags of keypoints. Workshop on
Statistical Learning in Computer Vision, ECCV, pages
1–22, 2004.
[3] I. Elsayad, J. Martinet, T. Urruty, and C. Djeraba.
Toward a higher-level visual representation for
content-based image retrieval. Multimedia Tools Appl.,
60(2):455–482, 2012.
[4] M. Everingham, L. Van Gool, C. K. I. Williams,
J. Winn, and A. Zisserman. The PASCAL Visual
Object Classes Challenge 2012 (VOC2012) Results.
http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
[5] J. Farquhar, S. Szedmak, H. Meng, and
J. Shawe-Taylor. Improving ”bag-of-keypoints” image
categorisation: Generative models and pdf-kernels.
PASCAL Eprint Series, 2005.
[6] K. Gao, S. Lin, Y. Zhang, S. Tang, and H. Ren.
Attention model based sift keypoints filtration for
image retrieval. In R. Y. Lee, editor, ACIS-ICIS,
pages 191–196. IEEE Computer Society, 2008.
[7] M. Halvey, P. Punitha, D. Hannah, R. Villa,
F. Hopfgartner, A. Goyal, and J. M. Jose. Diversity,
assortment, dissimilarity, variety: A study of diversity
measures using low level features for video retrieval. In
European Conference on Information Retrieval, pages
126–137, 2009.
[8] M. J. Huiskes and M. S. Lew. The mir flickr retrieval
evaluation. In MIR ’08: Proceedings of the 2008 ACM
International Conference on Multimedia Information
Retrieval, New York, NY, USA, 2008. ACM.
[9] L. Itti, C. Koch, and E. Niebur. A model of
saliency-based visual attention for rapid scene
analysis. IEEE Trans. Pattern Anal. Mach. Intell.,
20(11):1254–1259, Nov. 1998.
[10] H. Jégou, M. Douze, C. Schmid, and P. Pérez.
Aggregating local descriptors into a compact image
representation. In 23rd IEEE Conference on Computer
Vision & Pattern Recognition (CVPR ’10), pages
3304–3311, San Francisco, United States, 2010. IEEE
Computer Society.
[11] Y. Ke and R. Sukthankar. Pca-sift: a more distinctive
representation for local image descriptors. In
Computer Vision and Pattern Recognition, 2004.
CVPR 2004. Proceedings of the 2004 IEEE Computer
Society Conference on, volume 2, pages II–506–II–513
Vol.2, 2004.
[12] Y. Lei, X. Gui, and Z. Shi. Feature description and
image retrieval based on visual attention model.
Journal of Multimedia, 6(1):56–65, 2011.
[13] D. G. Lowe. Object recognition from local
scale-invariant features. International Conference on
Computer Vision, 2:1150–1157, 1999.
[14] D. G. Lowe. Distinctive image features from
scale-invariant keypoints. International Journal of
Computer Vision, 60:91–110, 2004.
[15] J. Martinet. Human-centered region selection and
weighting for image retrieval. In S. Battiato and
J. Braz, editors, VISAPP (1), pages 729–734.
SciTePress, 2013.
[16] J. M. Martı́nez. Mpeg-7 overview.
www.chiariglione.org/mpeg/standards/mpeg-7, 2003.
[17] O. L. Meur, P. L. Callet, D. Barba, and D. Thoreau.
A coherent computational approach to model the
bottom − up visual attention. IEEE Transactions on
Pattern Analysis and MLachine Intelligence (PAMI),
28(5):802–817, 2006.
[18] K. Mikolajczyk and C. Schmid. A performance
evaluation of local descriptors. IEEE Transactions on
Pattern Analysis & Machine Intelligence,
27(10):1615–1630, 2005.
[19] F. Mindru, T. Tuytelaars, L. V. Gool, and T. Moons.
Moment invariants for recognition under changing
viewpoint and illumination. Computer Vision and
Image Understanding, 94(1âĂŞ3):3 – 27, 2004.
[20] D. Nistér and H. Stewénius. Scalable recognition with
a vocabulary tree. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), volume 2,
pages 2161–2168, June 2006.
[21] L. Parsons, E. Haque, and H. Liu. Subspace clustering
for high dimensional data: a review. In ACM
SIGKDD, volume 6, pages 90–105. Explorations
Newsletter, 2004.
[22] F. Perronnin, C. Dance, G. Csurka, and M. Bressan.
Adapted vocabularies for generic visual categorization.
In In ECCV, pages 464–475, 2006.
[23] F. Perronnin and C. R. Dance. Fisher kernels on
visual vocabularies for image categorization. In 2007
IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR 2007), 18-23
June 2007, Minneapolis, Minnesota, USA. IEEE
Computer Society, 2007.
[24] J. Sivic and A. Zisserman. Video Google: A text
retrieval approach to object matching in videos. In
Proceedings of the International Conference on
Computer Vision, pages 1470–1477, Oct. 2003.
[25] K. E. A. van de Sande, T. Gevers, and C. G. M.
Snoek. Evaluating color descriptors for object and
scene recognition. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 32(9):1582–1596,
2010.
[26] C. J. van Rijsbergen. Information Retrieval.
Butterworth, 1979.
[27] Z. Zdziarski and R. Dahyot. Feature selection using
visual saliency for content-based image retrieval. In
Signals and Systems Conference (ISSC 2012), IET
Irish, pages 1–6, 2012.
[28] L. Zhang, M. H. Tong, T. K. Marks, H. Shan, and
G. W. Cottrell. Sun: A bayesian framework for
saliency using natural statistics. J Vis, 8(7):32.1–20,
2008.
[29] S. Zhang, Q. Tian, G. Hua, Q. Huang, and W. Gao.
Generating descriptive visual words and visual phrases
for large-scale image applications. IEEE Transactions
on Image Processing, 20(9):2664–2677, 2011.