An Efficient Search Algorithm for Content-Based Image Retrieval with User Feedback Alex Po Leung and Peter Auer Department für Mathematik und Informationstechnologie Montanuniversität Leoben Franz-Josef-Straße 18, 8700, Leoben, Austria. Abstract We propose a probabilistic model for the relevance feedback of users looking for target images. This model takes into account user errors and user uncertainty about distinguishing similarly relevant images. Based on this model, we have developed an algorithm, which selects images to be presented to the user for further relevance feedback until a satisfactory image is found. In each query session, the algorithm maintains weights on the images in the database which reflect the assumed relevance of the images. Relevance feedback is used to modify these weights. As a second ingredient, the algorithm uses a minimax principle to select images for presentation to the user: any response of the user will provide significant information about his query, such that relatively few feedback rounds are sufficient to find a satisfactory image. We have implemented this algorithm and have conducted experiments on both simulated data and real data which show promising results. 1 Introduction Content-based image retrieval with relevance feedback can be divided into two sub-problems: • how we can conduct a specific search to find a suitable image in as few iterations as possible, and • how we can learn a good similarity measure among images based on long-term user feedback from a large number of user search sessions or user labels from datasets. The focus of this work is the efficient search for a suitable image within a small number of iterations without testing users’ patience. For content-based image retrieval with feeback, we consider the fact that user feedback is very expensive. In previous work [10, 11, 12, 13], active learning has been used to select images around the decision boundary for user feedback, for speeding up the search process and to boost the amount of information which can be obtained from user feedback. However, images around the decision boundary are usually difficult to label. A user might find it hard to label images in between two categories. Such difficulties and noise from user feedback is not explicitly modeled or taken into account in most previous work. In contrast, we explicitly model the noisy user feedback and select images for presentation to the user, such that — after obtaining the user feedback — the algorithm can efficiently search for suitable images by eliminating images not matching the user’s query. To solve the second of the two sub-problems, i.e. longterm learning, it is necessary to find a reasonable similarity measure among the images. In this paper, we do not address this problem. But, we note that recently user labels are easily obtainable because of the technological advances of the Internet. Large amounts of data for high-level features can be found from databases with user labels, often called image tags, such as Flickr, Facebook and Pbase. The popularity of these databases enhances the accuracies of image search engines. For example, the Yahoo image search engine is using tags from images on Flickr. Thus we will consider a combination low-level visual features and highlevel features obtained from user labels, and we assume that a reasonably good similarity measure among images can be defined using this features. In our experiments we will use a similarity measure based on the 2-norm. A combination of keywords and visual features has also be used in [3] and [4]. 1.1 Previous Work Traditionally, content-based image retrieval with user feedback is considered a learning problem using data from user feedback and, with visual features most previous work assumes that no label describing images in datasets is avail- Input: The images x in the database D, the similarity measure Φ, the relevance factor β > 1, and the number of images N to be presented in each iteration Output: a suitable image I Initialize all relevance weights, wx = 1. for t = 1, 2, ... do Calculate cluster centers c1 , . . . , cN ∈ D by weighted K-means, based on Φ and the weights wx . Present images c1 , . . . , cN to the user if one of the images is suitable then Stop end Let ci be the image selected by the user as most relevant. For any image x which is more similar to ci than to any other images c1 , . . . , cN , update the relevance weight, wx = β · wx . end Figure 1. Algorithms 1 and 2 use weighted K-means for clustering with discounts given to the weights according to the user feedback. able, [11, 14, 15, 16]. Metric functions measuring similarity based on low-level visual features are obtained by discriminative methods. Long-term learning is used with training datasets from the feedback of different users [5],[6],[7],[8] and [9]. However, because of different perceptions about the same object, different users may give different kinds of feedback for the same query target. Short-term learning using feedback from a single user in a single search session can be used to deal with the different perceptions of objects. Weighting the importance of different low-level features is often used for short-term learning (e.g. PicSOM [2]). The use of user feedback as training data has played an important role in most recent work [17, 18, 19, 20, 21]. Feedback is used as positive or negative labels for training. But as the user chooses the most relevant images in any iteration, such an image may be chosen even if the image is rather dissimilar to any suitable image. Furthermore, images predicted to be positive examples by discriminative methods are traditionally selected for presentation in each round. Thus, mistakes of the discriminative method might hinder progress in the search significantly — by ignoring part of the search space with images which are incorrectly predicted as negative. 2 Our Approach Assuming we have a reasonable feature vector or a good similarity measure obtained from high-level and low-level visual features, to minimize the retrieval cost for the user, we present images such that we can get informative feedback effectively to find a suitable image in a small number of iterations given the noisy user feedback. The noise is due to: • the user may find it hard to make certain choices, • human errors, and • the fact that we cannot expect the similarity measure to be perfect. 2.1 The User Model Suppose we have a database D of images x for image retrieval, and we can measure how close any given image x in the database is to a suitable image I, using a similarity function Φ(x, I). We also assume that there is a limit N on the number of images presented to the user in each iteration, because of users’ inability to handle a large amount of data. Let T be the number of iterations required for the retrieval of a suitable image, and let St be the image subset presented in iteration t, St ⊆ D. As the number of user responses determines the retrieval costs, our objective is to minimize T. When images are presented to the user, the user gives feedback by selecting a single image. The reliability of the user feedback is dependent on the similarities of the presented images to suitable images. If a suitable image is equally similar to two presented images, the user may struggle to tell which one of the two presented images is more relevant. Thus, we consider the following noisy feedback model: image x ∈ S is selected with probability Φ(x, I) + α (1) y∈S Φ(y, I) P (x is selected) = (1 − α · |S|) P where I is a suitable image, Φ is a similarity measure, and α is a constant noise rate. Possible similarity measures are Φ(v, I) = v · I, and (2) Φ(v, I) = exp(−a||v − I||2 ). N -ary Search with Noise 3 Experiments The objective of our experiments is to evaluate: (1) the efficiency of our search algorithm, (2) how well it copes with different sets of data, and (3) how well it copes with noise. We use the VOC2007 dataset with 23 categories and 9963 images. All images contain at least one labeled object. The dataset is originally built for the PASCAL Visual Object Classes Challenge 2007 [1] to recognize objects from a number of visual object classes in realistic scenes. There are twenty-three object classes selected: • Person: person, foot, hand, head • Animal: bird, cat, cow, dog, horse, sheep • Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train Average Number of Iterations Binary search is an efficient search method with logarithmic complexity. Our search can be formulated as a multidimensional N -ary search with noisy feedback. Our algorithm is listed in Figure 1. In contrast to binary search without noise, where irrelevant items can be discarded, we cope with noise by putting weights wx on the images x. The algorithm tries to divide the search space into N equally sized regions, where the size is measured by the sum of the weights on the images in the region. This is achieved by a variant of weighted K-means. Then each region is represented by its center, and these center images are presented to the user. The images in the region considered most relevant by the user (based on the center image) receive higher weights in the next iteration of the search. This process continues until a suitable image is found. When there are only two images displayed, sometimes, it can be hard for the user to tell which image is closer to the target especially in the early stage of a search session. We extend our ideas to model the search as a binary search with noisy information to a practical search algorithm with 20 clusters. Algorithm 2, instead of finding two clusters at each iterations as Algorithm 1 in Figure 1, looks for 20 clusters with the 20 centroids as displayed images for user feedback. High-level features from user labels and low-level visual features from colors, textures and edge orientations are utilized to form the feature vector. The use of this combination of high-level and low-level features and the 20 clusters give us a practical and efficient image search algorithm. Encouraging results showing the performance of Algorithms can be found in the experiment section. 220 =0.1 200 =0.2 180 =0.3 160 140 120 100 80 60 40 20 0 0.0 0.2 0.4 0.6 0.8 1.0 Figure 2. Experiment 1: Algorithm 1 with synthesized data and varying α. When α = 0.1, the average number of iterations stay around 20 with a small β. When α = 0.2, the average number of iterations is a bit higher but it is still around 30. However, when α = 0.3, the average number of iterations goes up to somewhere around 80. Average Number of Iterations 2.2 (3) 120 100 80 60 40 20 0 0.0 0.2 0.4 0.6 0.8 1.0 Figure 3. Experiment 2: Algorithm 1 without noise from the user model. Obviously, without noise, the algorithm performs the best. The figure shows the average number of iterations required to find the target with different β. The number can be as low as 15. 220 =0.1 =0.1 =0.2 200 Average Number of Iterations Average Number of Iterations 220 =0.3 180 160 140 120 100 80 60 200 =0.2 =0.3 180 160 140 120 100 80 60 40 40 20 0 20 0.0 0.2 0.4 0.6 0.8 1.0 Figure 4. Experiment 3: With the 23dimensional high-level feature vector from the VOC2007 dataset, Algorithm 1 can find the target image in about 50 iterations with an appropriate β when α = 0.1 or α = 0.2. When α = 0.3, the algorithm can find the target in about 80 iterations. • Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor For each of the 9963 images in the dataset, there is one corresponding annotation file giving a bounding box and an object-class label for each object in one of the twenty-three classes present in the image. Multiple objects from multiple classes may be present in the same image. To see how well our search algorithms perform, both synthesized data and images in the VOC2007 dataset are used for the empirical evaluation of the expected number of iterations required. Our experiments use object sizes as high-level features from VOC2007. However, other high-level features could also be used such as the number of objects in the same category. There are 23 categories so the high-level feature vector is 23-dimensional where each entry is the object size (given by the bounding box). When an object does not exist in an image, the entry is 0. For the synthesized data, each entry of the 23-dimensional feature vector is a random number between 0 and 1. Ten-thousand feature vectors are generated representing ten-thousand images in experiments with synthesized data. Four sets of experiments conducted to demonstrate the performance of our algorithms using a constant error rate (α in Equation 1) with noise altering the correct user feedback or without any noise from the user feedback (i.e. an a approaching infinity in Equation 3) include: • Experiment 1: Algorithm 1 with synthesized data (Figure 2), 0.0 0.2 0.4 0.6 0.8 1.0 Figure 5. Experiment 4: With a normalized feature vector from the VOC2007 dataset, Algorithm 1 performs better and finds the target image in 20 iterations when α = 0.1. It shows a similar performance for α = 0.2 and α = 0.3 as it is without normalization in Figure 4. • Experiment 2: Algorithm 1 with synthesized data and no noise (Figure 3), • Experiment 3: Algorithm 1 with VOC2007 (Figure 4), and • Experiment 4: Algorithm 1 with VOC2007 and normalized feature vectors with the 2-norm (Figure 5). To reduce statistical fluctuations, each curve in the experiments is plotted using the average from three repeated experiments with the same set of parameters. Experiments 1, 2 ,3 and 4 are conducted with just N = 2 presented images in each iteration. Experiment 5 is conducted with N = 20 presented images in each iteration. In Experiment 1, the performance of Algorithm 1 with synthesized data and varying α is demonstrated. When α = 0.1, the average number of iterations stay around 20 with a small β. When α = 0.2, the average number of iterations is a bit higher but it is still around 30. However, when α = 0.3, the average number of iterations goes up to somewhere around 80. In Experiment 2, Algorithm 1 without noise from the user model performs the best obviously. The figure shows the average number of iterations required to find the target with different β. The number can be as low as 15. In Experiment 3, with the 23-dimensional high-level feature vector from the VOC2007 dataset, Algorithm 1 can find the target image in about 50 iterations with an appropriate β when α = 0.1 or α = 0.2. When α = 0.3, the algorithm can find the target in about 80 iterations. Table 1. Search for a car on grass in the VOC2007 dataset by a real user with high-level and low-level features: Iterations 1 and 2. Table 2. Search for a car on grass in the VOC2007 dataset by a real user with high-level and low-level features: Chosen images for Iterations 3,4,5 and 8 respectively. Table 3. Search for a motorbike on grass in the VOC2007 dataset by a real user with high-level and low-level features: Chosen images for Iterations 1,2,3,4,5,6,9 and 10 respectively. In Experiment 4, with the same feature vector from the VOC2007 dataset but normalized with the 2-norm, Algorithm 1 performs better and finds the target image in 20 iterations when α = 0.1. It shows a similar performance for α = 0.2 and α = 0.3 as it is without normalization in Figure 4. The last set of experiments (Experiment 5), with a real user, demonstrates the performance of Algorithm 2 which looks for 20 clusters with the 20 centroids as displayed images for user feedback. High-level features from user labels and low-level visual features from colors, textures and edge orientations are utilized to form the feature vector. The vector includes colors, textures, labeled object sizes and edge orientations with 23 (labels) + 738 (visual features) = 761 dimensions. The low-level visual features used are the same set of features as in the PicSOM system [2]. The use of this combination of high-level and low-level features and 20 clusters give us a practical and efficient image search algorithm as shown in Tables 1,2 and 3. 4 Conclusion In this work, we consider a probabilistic model for content-based image retrieval which takes into account noise from user feedback. An algorithm based on binary search with noise is proposed and evaluated with experiments using both synthesized data and real data. We extend these ideas and build a practical search algorithm using high-level and low-level features. The algorithm devised with our approach is shown to produce promising results by experiments. 5 Acknowledgement The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement n◦ 216529. References [1] Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2007 Results (2007), http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop . [2] Markus Koskela, Jorma Laaksonen, and Erkki Oja. “Inter-Query Relevance Learning in PicSOM for Content-Based Image Retrieval”. In Supplementary Proceedings of 13th International Conference on Artificial Neural Networks / 10th International Conference on Neural Information Processing (ICANN/ICONIP 2003). Istanbul, Turkey. June 2003. [3] F. Jing, M. Li, H. Zhang, and B. Zhang, “A unified framework for image retrieval using keyword and visual features”, IEEE Transactions on Image Processing, 2005, pp.979-989. [4] X.S. Zhou and T.S. Huang, “Unifying Keywords and Visual Contents in Image Retrieval”, IEEE MultiMedia, 2002, pp.23-33. [5] X. He, O. King, W. Ma, M. Li, and H. Zhang, “Learning a semantic space from user’s relevance feedback for image retrieval”, IEEE Trans. Circuits Syst. Video Techn., 2003, pp.39-48. [6] J. Fournier and M. Cord, “Long-term similarity learning in content-based image retrieval”, Proc. ICIP (1), 2002, pp.441-444. [7] M. Koskela and J. Laaksonen, “Using Long-Term Learning to Improve Efficiency of Content-Based Image Retrieval”, Proc. PRIS, 2003, pp.72-79. [8] Jacob Linenthal and Xiaojun Qi, “An Effective NoiseResilient Long-Term Semantic Learning Approach to Content-Based Image Retrieval,” IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’08), March 30-April 4, Las Vegas, Nevada, USA, 2008. [9] Michael Wacht, Juan Shan, and Xiaojun Qi, “A ShortTerm and Long-Term Learning Approach for ContentBased Image Retrieval,” Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’06), pp. 389-392, Toulouse, France, May 14-19, 2006. [10] C. Zhang and T. Chen, “An active learning framework for content-based information retrieval”, IEEE Transactions on Multimedia, 2002, pp.260-268. [11] S. Tong and E.Y. Chang, “Support vector machine active learning for image retrieval”, Proc. ACM Multimedia, 2001, pp.107-118. [12] P.-H. Gosselin, M. Cord, S. Philipp-Foliguet, “Active learning methods for Interactive Image Retrieval” , IEEE Transactions on Image Processing, 2008. [13] E. Chang, S. Tong, K. Goh, and C. Chang, “Support Vector Machine Concept-Dependent Active Learning for Image Retrieval”,IEEE Transactions on Multimedia, 2005. [14] Y. Chen, X.S. Zhou, and T.S. Huang, “One-class SVM for learning in image retrieval”, Proc. ICIP (1), 2001, pp.34-37. [15] Y. Rui and T.S. Huang, “Optimizing Learning in Image Retrieval”, Proc. CVPR, 2000, pp.1236-1236. [16] J. Rocchio. “Relevance Feedback in Information Retrieval”, Salton: The SMART Retrieval System: Experiments in Automatic Document Processing, Chapter 14, pages 313323, Prentice-Hall, 1971. [17] Remco C. Veltkamp, Mirela Tanase, “Content-based Image Retrieval Systems: a Survey”. State-of-the-Art in Content-Based Image and Video Retrieval 1999: 97-124. [18] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, “Content-Based Image Retrieval at the End of the Early Years”, IEEE Trans. Pattern Anal. Mach. Intell., 2000, pp.1349-1380. [19] Crucianu, M., Ferecatu, M., Boujemaa, N. (2004) “Relevance feedback for image retrieval: a short survey”, 20 p., State of the Art in Audiovisual ContentBased Retrieval, Information Universal Access and Interaction, Including Datamodels and Languages, report of the DELOS2 European Network of Excellence (FP6). [20] M.S. Lew, N. Sebe, C. Djeraba, and R. Jain, “Contentbased multimedia information retrieval: State of the art and challenges”, TOMCCAP, 2006, pp.1-19. [21] R. Datta, D. Joshi, J. Li, and J.Z. Wang, “Image retrieval: Ideas, influences, and trends of the new age”, ACM Comput. Surv., 2008.
© Copyright 2026 Paperzz