An Efficient Search Algorithm for Content

An Efficient Search Algorithm for Content-Based Image Retrieval
with User Feedback
Alex Po Leung and Peter Auer
Department für Mathematik und Informationstechnologie
Montanuniversität Leoben
Franz-Josef-Straße 18, 8700, Leoben, Austria.
Abstract
We propose a probabilistic model for the relevance feedback of users looking for target images. This model takes
into account user errors and user uncertainty about distinguishing similarly relevant images. Based on this model,
we have developed an algorithm, which selects images to
be presented to the user for further relevance feedback until a satisfactory image is found. In each query session, the
algorithm maintains weights on the images in the database
which reflect the assumed relevance of the images. Relevance feedback is used to modify these weights. As a second ingredient, the algorithm uses a minimax principle to
select images for presentation to the user: any response
of the user will provide significant information about his
query, such that relatively few feedback rounds are sufficient
to find a satisfactory image. We have implemented this algorithm and have conducted experiments on both simulated
data and real data which show promising results.
1 Introduction
Content-based image retrieval with relevance feedback
can be divided into two sub-problems:
• how we can conduct a specific search to find a suitable
image in as few iterations as possible, and
• how we can learn a good similarity measure among
images based on long-term user feedback from a large
number of user search sessions or user labels from
datasets.
The focus of this work is the efficient search for a suitable image within a small number of iterations without testing users’ patience. For content-based image retrieval with
feeback, we consider the fact that user feedback is very expensive.
In previous work [10, 11, 12, 13], active learning has
been used to select images around the decision boundary
for user feedback, for speeding up the search process and
to boost the amount of information which can be obtained
from user feedback. However, images around the decision
boundary are usually difficult to label. A user might find
it hard to label images in between two categories. Such
difficulties and noise from user feedback is not explicitly
modeled or taken into account in most previous work.
In contrast, we explicitly model the noisy user feedback
and select images for presentation to the user, such that —
after obtaining the user feedback — the algorithm can efficiently search for suitable images by eliminating images
not matching the user’s query.
To solve the second of the two sub-problems, i.e. longterm learning, it is necessary to find a reasonable similarity
measure among the images. In this paper, we do not address this problem. But, we note that recently user labels
are easily obtainable because of the technological advances
of the Internet. Large amounts of data for high-level features can be found from databases with user labels, often
called image tags, such as Flickr, Facebook and Pbase. The
popularity of these databases enhances the accuracies of image search engines. For example, the Yahoo image search
engine is using tags from images on Flickr. Thus we will
consider a combination low-level visual features and highlevel features obtained from user labels, and we assume that
a reasonably good similarity measure among images can be
defined using this features. In our experiments we will use
a similarity measure based on the 2-norm. A combination
of keywords and visual features has also be used in [3] and
[4].
1.1
Previous Work
Traditionally, content-based image retrieval with user
feedback is considered a learning problem using data from
user feedback and, with visual features most previous work
assumes that no label describing images in datasets is avail-
Input: The images x in the database D, the similarity measure Φ, the relevance factor β > 1, and the number of images
N to be presented in each iteration
Output: a suitable image I
Initialize all relevance weights, wx = 1.
for t = 1, 2, ... do
Calculate cluster centers c1 , . . . , cN ∈ D by weighted K-means, based on Φ and the weights wx .
Present images c1 , . . . , cN to the user
if one of the images is suitable then
Stop
end
Let ci be the image selected by the user as most relevant.
For any image x which is more similar to ci than to any other images c1 , . . . , cN , update the relevance weight,
wx = β · wx .
end
Figure 1. Algorithms 1 and 2 use weighted K-means for clustering with discounts given to the
weights according to the user feedback.
able, [11, 14, 15, 16]. Metric functions measuring similarity
based on low-level visual features are obtained by discriminative methods. Long-term learning is used with training
datasets from the feedback of different users [5],[6],[7],[8]
and [9]. However, because of different perceptions about
the same object, different users may give different kinds of
feedback for the same query target. Short-term learning using feedback from a single user in a single search session
can be used to deal with the different perceptions of objects.
Weighting the importance of different low-level features is
often used for short-term learning (e.g. PicSOM [2]).
The use of user feedback as training data has played an
important role in most recent work [17, 18, 19, 20, 21].
Feedback is used as positive or negative labels for training. But as the user chooses the most relevant images in
any iteration, such an image may be chosen even if the image is rather dissimilar to any suitable image. Furthermore,
images predicted to be positive examples by discriminative
methods are traditionally selected for presentation in each
round. Thus, mistakes of the discriminative method might
hinder progress in the search significantly — by ignoring
part of the search space with images which are incorrectly
predicted as negative.
2 Our Approach
Assuming we have a reasonable feature vector or a good
similarity measure obtained from high-level and low-level
visual features, to minimize the retrieval cost for the user,
we present images such that we can get informative feedback effectively to find a suitable image in a small number
of iterations given the noisy user feedback. The noise is due
to:
• the user may find it hard to make certain choices,
• human errors, and
• the fact that we cannot expect the similarity measure
to be perfect.
2.1
The User Model
Suppose we have a database D of images x for image
retrieval, and we can measure how close any given image x
in the database is to a suitable image I, using a similarity
function Φ(x, I). We also assume that there is a limit N on
the number of images presented to the user in each iteration,
because of users’ inability to handle a large amount of data.
Let T be the number of iterations required for the retrieval
of a suitable image, and let St be the image subset presented
in iteration t, St ⊆ D. As the number of user responses
determines the retrieval costs, our objective is to minimize
T.
When images are presented to the user, the user gives
feedback by selecting a single image. The reliability of the
user feedback is dependent on the similarities of the presented images to suitable images. If a suitable image is
equally similar to two presented images, the user may struggle to tell which one of the two presented images is more
relevant. Thus, we consider the following noisy feedback
model: image x ∈ S is selected with probability
Φ(x, I)
+ α (1)
y∈S Φ(y, I)
P (x is selected) = (1 − α · |S|) P
where I is a suitable image, Φ is a similarity measure, and
α is a constant noise rate. Possible similarity measures are
Φ(v, I) = v · I,
and
(2)
Φ(v, I) = exp(−a||v − I||2 ).
N -ary Search with Noise
3 Experiments
The objective of our experiments is to evaluate: (1) the
efficiency of our search algorithm, (2) how well it copes
with different sets of data, and (3) how well it copes with
noise. We use the VOC2007 dataset with 23 categories and
9963 images. All images contain at least one labeled object.
The dataset is originally built for the PASCAL Visual Object Classes Challenge 2007 [1] to recognize objects from
a number of visual object classes in realistic scenes. There
are twenty-three object classes selected:
• Person: person, foot, hand, head
• Animal: bird, cat, cow, dog, horse, sheep
• Vehicle: aeroplane, bicycle, boat, bus, car, motorbike,
train
Average Number of Iterations
Binary search is an efficient search method with logarithmic complexity. Our search can be formulated as a multidimensional N -ary search with noisy feedback. Our algorithm is listed in Figure 1. In contrast to binary search without noise, where irrelevant items can be discarded, we cope
with noise by putting weights wx on the images x. The
algorithm tries to divide the search space into N equally
sized regions, where the size is measured by the sum of the
weights on the images in the region. This is achieved by
a variant of weighted K-means. Then each region is represented by its center, and these center images are presented
to the user. The images in the region considered most relevant by the user (based on the center image) receive higher
weights in the next iteration of the search. This process continues until a suitable image is found.
When there are only two images displayed, sometimes,
it can be hard for the user to tell which image is closer to
the target especially in the early stage of a search session.
We extend our ideas to model the search as a binary search
with noisy information to a practical search algorithm with
20 clusters. Algorithm 2, instead of finding two clusters at
each iterations as Algorithm 1 in Figure 1, looks for 20 clusters with the 20 centroids as displayed images for user feedback. High-level features from user labels and low-level visual features from colors, textures and edge orientations are
utilized to form the feature vector. The use of this combination of high-level and low-level features and the 20 clusters
give us a practical and efficient image search algorithm. Encouraging results showing the performance of Algorithms
can be found in the experiment section.
220
=0.1
200
=0.2
180
=0.3
160
140
120
100
80
60
40
20
0
0.0
0.2
0.4
0.6
0.8
1.0
Figure 2. Experiment 1: Algorithm 1 with synthesized data and varying α. When α = 0.1,
the average number of iterations stay around
20 with a small β. When α = 0.2, the average number of iterations is a bit higher but
it is still around 30. However, when α = 0.3,
the average number of iterations goes up to
somewhere around 80.
Average Number of Iterations
2.2
(3)
120
100
80
60
40
20
0
0.0
0.2
0.4
0.6
0.8
1.0
Figure 3. Experiment 2: Algorithm 1 without
noise from the user model. Obviously, without noise, the algorithm performs the best.
The figure shows the average number of iterations required to find the target with different β. The number can be as low as 15.
220
=0.1
=0.1
=0.2
200
Average Number of Iterations
Average Number of Iterations
220
=0.3
180
160
140
120
100
80
60
200
=0.2
=0.3
180
160
140
120
100
80
60
40
40
20
0
20
0.0
0.2
0.4
0.6
0.8
1.0
Figure 4. Experiment 3:
With the 23dimensional high-level feature vector from
the VOC2007 dataset, Algorithm 1 can find
the target image in about 50 iterations with an
appropriate β when α = 0.1 or α = 0.2. When
α = 0.3, the algorithm can find the target in
about 80 iterations.
• Indoor: bottle, chair, dining table, potted plant, sofa,
tv/monitor
For each of the 9963 images in the dataset, there is one
corresponding annotation file giving a bounding box and an
object-class label for each object in one of the twenty-three
classes present in the image. Multiple objects from multiple classes may be present in the same image. To see how
well our search algorithms perform, both synthesized data
and images in the VOC2007 dataset are used for the empirical evaluation of the expected number of iterations required.
Our experiments use object sizes as high-level features from
VOC2007. However, other high-level features could also
be used such as the number of objects in the same category.
There are 23 categories so the high-level feature vector is
23-dimensional where each entry is the object size (given
by the bounding box). When an object does not exist in an
image, the entry is 0. For the synthesized data, each entry
of the 23-dimensional feature vector is a random number
between 0 and 1. Ten-thousand feature vectors are generated representing ten-thousand images in experiments with
synthesized data.
Four sets of experiments conducted to demonstrate the
performance of our algorithms using a constant error rate
(α in Equation 1) with noise altering the correct user feedback or without any noise from the user feedback (i.e. an a
approaching infinity in Equation 3) include:
• Experiment 1: Algorithm 1 with synthesized data (Figure 2),
0.0
0.2
0.4
0.6
0.8
1.0
Figure 5. Experiment 4: With a normalized
feature vector from the VOC2007 dataset, Algorithm 1 performs better and finds the target
image in 20 iterations when α = 0.1. It shows
a similar performance for α = 0.2 and α = 0.3
as it is without normalization in Figure 4.
• Experiment 2: Algorithm 1 with synthesized data and
no noise (Figure 3),
• Experiment 3: Algorithm 1 with VOC2007 (Figure 4),
and
• Experiment 4: Algorithm 1 with VOC2007 and normalized feature vectors with the 2-norm (Figure 5).
To reduce statistical fluctuations, each curve in the experiments is plotted using the average from three repeated experiments with the same set of parameters.
Experiments 1, 2 ,3 and 4 are conducted with just N = 2
presented images in each iteration. Experiment 5 is conducted with N = 20 presented images in each iteration.
In Experiment 1, the performance of Algorithm 1 with
synthesized data and varying α is demonstrated. When α =
0.1, the average number of iterations stay around 20 with a
small β. When α = 0.2, the average number of iterations
is a bit higher but it is still around 30. However, when α =
0.3, the average number of iterations goes up to somewhere
around 80.
In Experiment 2, Algorithm 1 without noise from the
user model performs the best obviously. The figure shows
the average number of iterations required to find the target
with different β. The number can be as low as 15.
In Experiment 3, with the 23-dimensional high-level feature vector from the VOC2007 dataset, Algorithm 1 can find
the target image in about 50 iterations with an appropriate
β when α = 0.1 or α = 0.2. When α = 0.3, the algorithm
can find the target in about 80 iterations.
Table 1. Search for a car on grass in the VOC2007 dataset by a real user with high-level and low-level
features: Iterations 1 and 2.
Table 2. Search for a car on grass in the VOC2007 dataset by a real user with high-level and low-level
features: Chosen images for Iterations 3,4,5 and 8 respectively.
Table 3. Search for a motorbike on grass in the VOC2007 dataset by a real user with high-level and
low-level features: Chosen images for Iterations 1,2,3,4,5,6,9 and 10 respectively.
In Experiment 4, with the same feature vector from the
VOC2007 dataset but normalized with the 2-norm, Algorithm 1 performs better and finds the target image in 20
iterations when α = 0.1. It shows a similar performance
for α = 0.2 and α = 0.3 as it is without normalization in
Figure 4.
The last set of experiments (Experiment 5), with a real
user, demonstrates the performance of Algorithm 2 which
looks for 20 clusters with the 20 centroids as displayed images for user feedback. High-level features from user labels
and low-level visual features from colors, textures and edge
orientations are utilized to form the feature vector. The vector includes colors, textures, labeled object sizes and edge
orientations with 23 (labels) + 738 (visual features) = 761
dimensions. The low-level visual features used are the same
set of features as in the PicSOM system [2]. The use of
this combination of high-level and low-level features and
20 clusters give us a practical and efficient image search algorithm as shown in Tables 1,2 and 3.
4 Conclusion
In this work, we consider a probabilistic model for
content-based image retrieval which takes into account
noise from user feedback. An algorithm based on binary
search with noise is proposed and evaluated with experiments using both synthesized data and real data. We extend these ideas and build a practical search algorithm using
high-level and low-level features. The algorithm devised
with our approach is shown to produce promising results by
experiments.
5 Acknowledgement
The research leading to these results has received funding from the European Community’s Seventh Framework
Programme (FP7/2007-2013) under grant agreement n◦
216529.
References
[1] Everingham, M., Van Gool, L., Williams,
C.K.I., Winn, J.,
Zisserman, A.:
The
PASCAL Visual Object Classes Challenge
2007
Results
(2007),
http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop
.
[2] Markus Koskela, Jorma Laaksonen, and Erkki
Oja. “Inter-Query Relevance Learning in PicSOM
for Content-Based Image Retrieval”. In Supplementary Proceedings of 13th International Conference on Artificial Neural Networks / 10th International Conference on Neural Information Processing
(ICANN/ICONIP 2003). Istanbul, Turkey. June 2003.
[3] F. Jing, M. Li, H. Zhang, and B. Zhang, “A unified
framework for image retrieval using keyword and visual features”, IEEE Transactions on Image Processing, 2005, pp.979-989.
[4] X.S. Zhou and T.S. Huang, “Unifying Keywords and
Visual Contents in Image Retrieval”, IEEE MultiMedia, 2002, pp.23-33.
[5] X. He, O. King, W. Ma, M. Li, and H. Zhang, “Learning a semantic space from user’s relevance feedback
for image retrieval”, IEEE Trans. Circuits Syst. Video
Techn., 2003, pp.39-48.
[6] J. Fournier and M. Cord, “Long-term similarity learning in content-based image retrieval”, Proc. ICIP (1),
2002, pp.441-444.
[7] M. Koskela and J. Laaksonen, “Using Long-Term
Learning to Improve Efficiency of Content-Based Image Retrieval”, Proc. PRIS, 2003, pp.72-79.
[8] Jacob Linenthal and Xiaojun Qi, “An Effective NoiseResilient Long-Term Semantic Learning Approach to
Content-Based Image Retrieval,” IEEE International
Conference on Acoustics, Speech, and Signal Processing (ICASSP’08), March 30-April 4, Las Vegas,
Nevada, USA, 2008.
[9] Michael Wacht, Juan Shan, and Xiaojun Qi, “A ShortTerm and Long-Term Learning Approach for ContentBased Image Retrieval,” Proc. of IEEE International
Conference on Acoustics, Speech, and Signal Processing (ICASSP’06), pp. 389-392, Toulouse, France, May
14-19, 2006.
[10] C. Zhang and T. Chen, “An active learning framework
for content-based information retrieval”, IEEE Transactions on Multimedia, 2002, pp.260-268.
[11] S. Tong and E.Y. Chang, “Support vector machine active learning for image retrieval”, Proc. ACM Multimedia, 2001, pp.107-118.
[12] P.-H. Gosselin, M. Cord, S. Philipp-Foliguet, “Active
learning methods for Interactive Image Retrieval” ,
IEEE Transactions on Image Processing, 2008.
[13] E. Chang, S. Tong, K. Goh, and C. Chang, “Support
Vector Machine Concept-Dependent Active Learning
for Image Retrieval”,IEEE Transactions on Multimedia, 2005.
[14] Y. Chen, X.S. Zhou, and T.S. Huang, “One-class SVM
for learning in image retrieval”, Proc. ICIP (1), 2001,
pp.34-37.
[15] Y. Rui and T.S. Huang, “Optimizing Learning in Image Retrieval”, Proc. CVPR, 2000, pp.1236-1236.
[16] J. Rocchio. “Relevance Feedback in Information Retrieval”, Salton: The SMART Retrieval System: Experiments in Automatic Document Processing, Chapter 14, pages 313323, Prentice-Hall, 1971.
[17] Remco C. Veltkamp, Mirela Tanase, “Content-based
Image Retrieval Systems: a Survey”. State-of-the-Art
in Content-Based Image and Video Retrieval 1999:
97-124.
[18] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta,
and R. Jain, “Content-Based Image Retrieval at the
End of the Early Years”, IEEE Trans. Pattern Anal.
Mach. Intell., 2000, pp.1349-1380.
[19] Crucianu, M., Ferecatu, M., Boujemaa, N. (2004)
“Relevance feedback for image retrieval: a short survey”, 20 p., State of the Art in Audiovisual ContentBased Retrieval, Information Universal Access and
Interaction, Including Datamodels and Languages, report of the DELOS2 European Network of Excellence
(FP6).
[20] M.S. Lew, N. Sebe, C. Djeraba, and R. Jain, “Contentbased multimedia information retrieval: State of the
art and challenges”, TOMCCAP, 2006, pp.1-19.
[21] R. Datta, D. Joshi, J. Li, and J.Z. Wang, “Image retrieval: Ideas, influences, and trends of the new age”,
ACM Comput. Surv., 2008.