Scalable Object Retrieval with Compact Image Representation from Generic Object Regions SHAOYAN SUN and WENGANG ZHOU, CAS Key Laboratory of Technology in Geo-spatial Information Processing and Application System, University of Science and Technology of China QI TIAN, University of Texas at San Antonio HOUQIANG LI, CAS Key Laboratory of Technology in Geo-spatial Information Processing and Application System, University of Science and Technology of China 29 In content-based visual object retrieval, image representation is one of the fundamental issues in improving retrieval performance. Existing works adopt either local SIFT-like features or holistic features, and may suffer sensitivity to noise or poor discrimination power. In this article, we propose a compact representation for scalable object retrieval from few generic object regions. The regions are identified with a general object detector and are described with a fusion of learning-based features and aggregated SIFT features. Further, we compress feature representation in large-scale image retrieval scenarios. We evaluate the performance of the proposed method on two public ground-truth datasets, with promising results. Experimental results on a million-scale image database demonstrate superior retrieval accuracy with efficiency gain in both computation and memory usage. Categories and Subject Descriptors: H.3.3 [Information Search and Retrieval]: Retrieval models General Terms: Algorithms, Experimentation, Performance Additional Key Words and Phrases: Image retrieval, compact image representation ACM Reference Format: Shaoyan Sun, Wengang Zhou, Qi Tian, and Houqiang Li. 2015. Scalable object retrieval with compact image representation from generic object regions. ACM Trans. Multimedia Comput. Commun. Appl. 12, 2, Article 29 (October 2015), 21 pages. DOI: http://dx.doi.org/10.1145/2818708 1. INTRODUCTION The last decade has witnessed the explosive growth of digital visual content on the Internet. It has caused demands for effective and efficient algorithms to retrieve that data from a large-scale visual database. As a result, content-based image retrieval has attracted lots of attention from both academia and industry. In this article, we target visual object retrieval in large-scale image databases. In content-based image retrieval [Lew et al. 2006], the basic problem is to measure the similarity between images [Hoi et al. 2010]. Generally, images are represented by This work was supported in part for Professor Houqiang Li by 973 Program under contract No. 2015CB351803, NSFC under contract No. 61325009 and No. 61390514; in part for Dr. Wengang Zhou by NSFC under contract No. 61472378 and the Fundamental Research Funds for the Central Universities under contract No. WK2100060014 and WK2100060011; and in part for Prof. Qi Tian by ARO grant W911NF-12-1-0057 and Faculty Research Awards by NEC Laboratories of America, respectively. This work was supported in part by NSFC under contract No. 61429201. Authors’ addresses: S. Sun, W. Zhou, and H. Li, Electrical Engineering and Information Science Department, University of Science and Technology of China, Hefei, 230027; emails: [email protected], {zhwg, lihq}@ustc.edu.cn; Q. Tian, Department of Computer Science, University of Texas at San Antonio, San Antonio, TX, 78249; email: [email protected]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. c 2015 ACM 1551-6857/2015/10-ART29 $15.00 DOI: http://dx.doi.org/10.1145/2818708 ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015. 29:2 S. Sun et al. visual features and image similarity is defined by the comparison of visual features [Zhou et al. 2011]. According to the image scope in feature extraction, visual features can be divided into two categories: local and global. Local features, such as SIFT [Lowe 2004], are extracted from detected interest points and designed to be robust to various changes in illumination, rotation, scaling, and partial occlusion. With such merit, local features have been popularly selected as a routine image representation in CBIR since the pioneering work of Video Google [Sivic and Zisserman 2003]. There are mainly two strategies to perform image retrieval with local features. In the first strategy, text retrieval techniques are leveraged to quantize the high-dimensional continuous local features to discrete visual words with a large visual codebook. Then, an image is represented in a sparse and uniform visual word histogram and an inverted file structure is adopted for efficient indexing and retrieval [Nister and Stewenius 2006; Zhang et al. 2011; Chu et al. 2014; Zhou et al. 2014]. In this paradigm, each local feature should be quantized and indexed individually, which causes severe memory overhead. The second strategy alleviates this problem by aggregating local features of an image into a dense feature vector with a small visual codebook [Jégou et al. 2010; Perronnin et al. 2010; Liu et al. 2015]. In this way, some hashing techniques for nearest neighbor search are adopted to identify relevant image results based on those dense representations. In contrast, global features describe the whole image content—such as color, edge, texture, and structure—into a single holistic representation. Representative global features include GIST [Oliva and Torralba 2001] and edgel [Cao et al. 2011]. These features represent an image with only one feature vector. Although efficient in memory cost and computation, those features suffer poor discriminative power. Apart from the handcrafted features, it is also possible to extract features in a datadriven manner. The explosive research on deep neural networks (DNNs) has recently witnessed the success of the data-driven features in multiple areas. With the deep architectures, high-level abstractions that are close to human cognition can be learned [Bengio 2009], so that DNN is suitable to extract semantic-aware features. In Hörster and Lienhart [2008], features are extracted in local patches with a deep restricted Boltzmann machine (DBN) and the BoW model is used to perform image retrieval. In Sun et al. [2014], convolutional neural networks (CNNs) are applied to extract image features for image retrieval. A comprehensive study on CNN-based CBIR is given in [Wang et al. 2014], who use CNN to extract one feature from an image as a holistic descriptor, and demonstrate an impressive performance in their experiments. Both local and global features suffer some nontrivial issues in the scenario of image retrieval. Local invariant features are sensitive to rich texture from image backgrounds and suffer from the problem of burstiness in repetitive areas such as grass and carpets [Wang et al. 2011]. Though geometric verification [Zhou et al. 2013; Liu et al. 2014; Chu et al. 2013; Zhou et al. 2014] or retrieval list reranking [Mei et al. 2014; Xie et al. 2014] can alleviate their impact to some extent, they introduce more complexity. In addition, they are unstable when there are large changes in viewpoint. Global features describe the image as a whole and are more suitable for appearance-similar image search than object retrieval. In Figure 1, we illustrate the top 4 retrieval results from the UKBench dataset for one query with different features. The query image concerns a beer bottle placed on a carpet. The results in the top row are returned by the local feature–based method. Three unrelated objects are returned because most feature correspondences are from the background carpet area. The middle row shows results of one global CNN feature describing image content as a whole. We observe that two different kinds of bottles are returned, which implies that global features can capture the content appearance but may fail to describe the details of the target object. ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015. Scalable Object Retrieval with Compact Image Representation from Generic Object Regions 29:3 Fig. 1. Retrieval results of one query from the UKBench dataset for three retrieval methods. The results are generated by a baseline local feature–based method [Nister and Stewenius 2006] (top row), one global feature–based method we implement with a CNN tool-kit [Jia et al. 2014] (middle row) and our method (last row), respectively. The bounding boxes are drawn by a general object detector [Cheng et al. 2014], indicating detected object patches. To avoid those issues discussed earlier, we propose extracting features from generalized object regions (denoted as object patches hereafter). The motivation is that, on an object level, the object appearance is kept consistent no matter how the background and image layout change, so that global features can describe the object precisely. With little background kept in object patches, the interference of noise features from the background area can be significantly weakened or eliminated. As a result, both global and local features are expected to benefit from this representation. However, we do not require the detected regions to be exactly meaningful objects, since our objective is to represent image content for similar image identification instead of object detection. To achieve scalable retrieval in a large database, we adopt product quantization (PQ) [Jégou et al. 2011] to compress the object-level feature and speed up distance computing. Different from the local feature–based methods by voting [Jégou et al. 2011; Zheng et al. 2014a; Jégou et al. 2008; Zhou et al. 2015], which consider thousands of local features in each image, we only preserve very few object level features and gain significant efficiency in memory. Moreover, as mentioned before, the object-level representation is more discriminative than global image–level representations. In large-scale image retrieval experiment on one million database images, we demonstrate state-of-the-art retrieval accuracy with very efficient memory usage and real-time search response. The framework of the proposed method is illustrated in Figure 2. It consists of three components: feature extraction, indexing and querying. In both the index phase and the query phase, the same feature extraction process is conducted. The inverted index is built in the index phase, storing all database-image features with their IDs and compressed representations. In the query phase, the distances between query-image features and related database-image features are computed, which are used for scoring the database images, and images with highest scores are returned as the retrieval results. In the feature extraction phase, we first detect potential objects in images with a general-object detector. With a few related works available in recent years, we take BING, which is proposed in Cheng et al. [2014] and demonstrated to be very fast, in our framework because image-retrieval systems usually require a real-time response. Then we extract features in the object patches. There are also multiple alternatives on what feature to extract. In our implementation, we test three kinds of features. ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015. 29:4 S. Sun et al. Fig. 2. The proposed image retrieval framework with object-level features. The feature-extraction stage takes one image as the input and outputs a few object-level features. It contains three steps: object detection, CNN and VLAD feature extraction, and feature fusion. In the index phase, features are extracted from all database images, and are indexed into an inverted table and encoded with PQ. In the query phase, features extracted from the query image are assigned to related indexes and the distances between query features, and database features are computed for scoring and ranking. Images in this figure are from the Holidays dataset. To describe the patches effectively as a whole, we adopt the CNN model trained on ImageNet by Krizhevsky et al. [2012]. This model is originally trained to perform object classification and achieves great success. Therefore, the model is descriptive for objects and fits our scenario well. To describe the local properties in the object patches, we extract SIFT features and aggregate them with VLAD. We experimentally demonstrate performance improvement of the object-level representations over their image-level counterparts. In addition, we propose fusing the two representations on the feature level. Since CNN and VLAD specialize in describing different properties of the image (i.e., general semantics and local details), we make the fusion of them a better representation. Specially, as one global description, the CNN feature cannot handle image variance in scaling and rotating well, which can be complemented by the VLAD feature generated from SIFT. We denote such features extracted from object patches as object-level features hereafter. The rest of this article is organized as follows. We first introduce the related work in Section 2. Then we discuss how to extract object-level features in Section 3. We describe our image-retrieval framework with the proposed object-level feature, as well as the details of feature quantizing and indexing for large-scale image retrieval, in Section 4. Next, we provide experimental results with the proposed method in terms of accuracy, efficiency, and memory cost, as well as conduct comparisons with the state-of-the-art methods in Section 5. We present our conclusions in Section 6. 2. RELATED WORK Our work involves feature detection, feature description, feature fusion, and image indexing. In this section, we discuss the related works on each topic in the following. As a prerequisite step for feature description, feature detection aims at locating repeatable local structures. The SIFT feature [Lowe 2004], a representative local invariant feature, identifies interest points that are scale and translation invariant with the DoG detector. In addition, some works extend it with detectors such as the HessianAffine detector [Mikolajczyk and Schmid 2004], MSER detector [Kadir et al. 2004], and SURF [Bay et al. 2006]. However, these detectors detect only low-level salient structures, such as blobs and corners. The detected regions of interest or patches are usually simple corner points or texture with little semantic information; thousands of such patches can be detected in one image. To locate image patches that contain complicated objects, we need to ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015. Scalable Object Retrieval with Compact Image Representation from Generic Object Regions 29:5 apply an object detector. Though detection models for certain objects (e.g., human face, pedestrian, vehicles) have been developed for decades and many successful models have been proposed (e.g., the Viola-Jones face detector [Viola and Jones 2004], the HoG-based human detector [Dalal and Triggs 2005; Amit et al. 2014], and the deformable partbased model [Felzenszwalb et al. 2008]), it is infeasible to train models object-wise in real applications. As an alternative, we can resort to general-object detection to find object patches regardless of object categories. While some works [Uijlings et al. 2013; Alexe et al. 2012; Endres and Hoiem 2010] have tried to solve this problem recently, they can hardly achieve high detection rate, high computational efficiency, and good generalization ability simultaneously. We take the recent work employing BING [Cheng et al. 2014] as our general object detector, because in this work, detection repeatability and effectiveness are improved greatly, which lends itself very suitably to detect object patches of interest in image retrieval. The other step in feature extraction is feature description. Traditionally, handcrafted descriptors are exploited to represent images or image patches. For example, the SIFT feature [Lowe 2004] computes the gradient magnitude around detected key points. In addition, there are some variances of SIFT, such as PCA-SIFT [Ke and Sukthankar 2004] and Edge-SIFT [Zhang et al. 2013]. The global GIST feature [Oliva and Torralba 2001] integrates orientation, color, and intensity information in the whole image. In recent years, CNN has been frequently applied in multiple computer-vision research areas. With local receptive fields and shared weights, CNN can extract high-level semantic features from raw pixels efficiently. In Krizhevsky et al. [2012], a CNN model is trained to perform image classification and achieves outstanding accuracy. In Sermanet et al. [2013], the proposed CNN model, along with some new twists, demonstrates stateof-the-art performance on pedestrian detection. Inspired by these successes, we make use of CNN to extract object-level features for image retrieval in this article. In Gong et al. [2014], CNN features are extracted from predefined subwindows in different scales of images, and pooled as the final representation. This work is similar to ours in that CNN features are extracted from local patches, but our representation is different in that we extract features only from object-like patches, so that the number of regions we examine is much smaller, which makes it scalable for the scenario of image retrieval. Feature fusion is a common technique to utilize the advantages of different features. Generally, feature fusion is performed in either the indexing or reranking phase. In Zhang et al. [2013] and Xie et al. [2015], semantic attributes are co-indexed into an inverted index from visual words; in Zheng et al. [2014a], a multi-index consisting of color and SIFT visual words is built. Zhang et al. [2012] proposes a graph-based, queryspecific fusion approach to merge retrieval results given by different features. All these methods treat different kinds of features separately, and fusing them requires some modification to the existing image-retrieval frameworks. In this article, we perform feature fusion from another perspective, that is, we explore the possibility of fusing different features on the feature level by combining multiple feature vectors into one single representation. Such an operation suffers from the concern that the space distribution and the distance metric of different features usually vary a lot. However, we observe that, after appropriate transformations of the original features, we can lay them on one united distance metric space. This feature-level fusion makes our representation scheme readily adaptable to the existing image retrieval framework. When images are represented by local features, it is intractable and prohibitive to perform large-scale image retrieval with the original features by exhaustive linear scan. To achieve scalability to a large image database, a visual codebook is trained to quantize local features to visual words. Based on the quantization results, the inverted index structure [Sivic and Zisserman 2003] can be used [Nister and Stewenius 2006; ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015. 29:6 S. Sun et al. Jégou et al. 2010, 2011] to index a large image database for efficient retrieval. To lower the memory cost and speed up feature distance verification, features are usually hashed to binary features [Jégou et al. 2008; Liu et al. 2014]. In this article, we follow the work of Jégou et al. [2011] to quantize features to a number of inverted indexes, and store compressed features with PQ. 3. OBJECT-LEVEL FEATURE In this section, we introduce our object level feature for image representation. In Section 3.1, we discuss the general-object detector that we employ to detect possible object patches. Then, we introduce two kinds of image representations with CNN in Section 3.2 and VLAD in Section 3.3. Finally, we describe the proposed feature-level fusion in Section 3.4. 3.1. Object Patch Detection We choose the BING detector [Cheng et al. 2014] to detect generic object proposals/regions for feature extraction. The BING detector is demonstrated to have a very high detection rate with a relatively small number of object proposals. In addition, it enjoys excellent generalization ability to identify diverse objects with extremely high detection speed. The generalization ability means that the detected object proposals are generic over categories, and the high detection speed (300fps on a laptop) makes the detector well adapted to the task of image retrieval, which requires a real-time response. When applying the object detector in our framework, we also emphasize the detection repeatability and the saliency of the object proposals. Repeatability means that in two similar images, the proposals should be consistent, so that the matching between them is reliable. Saliency means that an object proposal should be an informative area for discrimination. These two requirements are well satisfied with the BING detector [Cheng et al. 2014]. A simple example is shown in Figure 1. Moreover, when the matching is a reliable one between informative areas, it is sufficient for image retrieval and we do not require the detected area to be exactly a meaningful object. We run the object detector on every image. The detector outputs thousands of candidate object proposals, each with a score indicating its possibility to contain an object. Instead of keeping all of them, we only preserve a few, with the highest scores. As reported in Cheng et al. [2014], the top 7 object proposals can hit an object with a probability of 45%; however, if we want to raise the probability to 80%, 100 proposals should be considered, which will increase the complexity greatly and may introduce some noisy proposals. 3.2. Feature Extraction with CNN Recent years have witnessed the great success of DNN in many research areas including computer vision. Among plenty of algorithms in the DNN framework, CNN has been demonstrated to be a powerful tool to extract expressive image features [Krizhevsky et al. 2012]. We make use of the pretrained CNN model designed by Krizhevsky et al. [2012] and implemented by Jia et al. [2014]. In this model, each input image (or object patch in our method) is resized to 224 × 224 and then passed through 5 convolutional layers and 3 fully connected layers. The output layer of this model has 1000 nodes for classification. We discard this layer because our objective is to represent the image rather than classification. In our framework, the output of the model is a 4096-D positive, realvalued vector. With the Caffe tool-kit, 500 features can be extracted in 1s with GPU and 50 with CPU [Jia et al. 2014]. The time cost on feature extraction is tiny for image retrieval. ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015. Scalable Object Retrieval with Compact Image Representation from Generic Object Regions 29:7 Before measuring the distance between different features, we need to normalize them. Following Arandjelovic and Zisserman [2012], we obtain the root feature by first L1 normalizing the feature vector and then computing the square root per dimension: xi , (1) xi = x1 where x1 is the L1 norm of x. 3.3. Feature Extraction with VLAD VLAD [Jégou et al. 2010] is a popular local-feature aggregation method. Specifically, SIFT features are first extracted in affine-invariant regions of an image. Then, each feature is quantized to one of k pretrained clusters. For each cluster, the residuals of all features quantized to it with the cluster centroid are accumulated, and all the summations are concatenated. The representation can be described as: ⎡ ⎤ v=⎣ (x − c1 ); . . . ; (x − ck)⎦ , (2) q(x)=1 q(x)=k where q(x) = t denotes SIFT features that are quantized to the t-th cluster, and ct is the corresponding cluster centroid. The semicolon in the equation means a concatenation operation of two-column vectors. When extracting VLAD on one object patch P, we modify Equation (2) as: ⎡ ⎤ ⎢ ⎢ vP = ⎢ ⎣ q(x) = 1, p(x) ∈ P (x − c1 ); . . . ; q(x) = k, p(x) ∈ P ⎥ ⎥ (x − ck)⎥ , ⎦ (3) where q(x) = t, p(x) ∈ P denotes SIFT features that are located in patch P and quantized to the t-th cluster. With the second constraint, we make use of the geometry information of SIFT features, which is ignored in the original VLAD representation. Following Jégou et al. [2010], we perform L2 -normalization to the VLAD vector. 3.4. Fusion on Feature Level With CNN, we can extract image features representing high-level abstractions. Besides, with VLAD, some local properties of images are aggregated. While image retrieval with either feature is possible, and their performance is demonstrated in our initial study [Sun et al. 2014], we expect their fusion to generate a better representation with both high-level information and local invariant properties. To avoid complicating the retrieval system, we propose performing the fusion on the feature level by combining the CNN feature and VLAD vector into a single feature vector. The most intuitive feature-level fusion is to directly concatenate two vectors to a long vector. Denote CNN feature as xC and VLAD feature as xV . This simple fusion can be written as: x f1 = [xC ; xV ]. (4) When computing the distance of two fused features x f1 and y f1 , we have D(x f1 , y f1 ) = x f1 − y f1 = xC − yC + xV − yV . (5) However, it is problematic to add the distances of these two kinds of features for two reasons. First, these two kinds of features have different scale distributions, thus the ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015. 29:8 S. Sun et al. distances computed with them are incomparable. Second, since the original features suffer from the problem of co-occurrence [Chum and Matas 2010] (so that some patterns are overcounted when comparing two features), the simple addition can lead to one distance dominating the summation with many overcounted patterns. To overcome this difficulty, we propose applying principal component analysis (PCA) and whiten the two features separately before the concatenation. Therefore, we can remove the redundancy in features and make all feature components share the same variance. After these two operations, the original features are distributed in the uniform distance metric space, and computing or comparing distances becomes reasonable. Moreover, PCA can also help to reduce the dimension of the original features. The success of PCA and whitening on VLAD has been demonstrated in Jégou and Chum [2012], where the two operations on the VLAD vector improve the retrieval performance remarkably. We perform similar transformation to our VLAD and CNN features, respectively, then concatenate them to one single feature: x̂C = −1/2 , . . . , λ DC )pCT xC −1/2 , . . . , λ DC )pCT xC diag(λ1 diag(λ1 −1/2 −1/2 x̂V = diag(γ1 −1/2 diag(γ1 −1/2 , −1/2 , . . . , γ DV )pTV xV −1/2 , . . . , γ DV )pTV xV , (6) x f = [x̂C ; x̂V ], −1/2 −1/2 −1/2 −1/2 where (λ1 , . . . , λ D ) and (γ1 , . . . , γ D ) denote the sorted eigenvalue lists of two features, pC and pV are associated eigenvectors, while DC and DV are the preserved feature dimension of CNN and VLAD features after PCA, respectively. 4. IMAGE RETRIEVAL WITH OBJECT-LEVEL FEATURE In this section, we introduce the image-retrieval framework with the proposed objectlevel feature. In Section 4.1, we discuss how to measure the similarity between the query and database images. In Section 4.2, we describe the feature quantization and indexing method for image retrieval on a large-scale database. 4.1. Similarity Measurement In each image, we extract features (CNN feature, VLAD feature, or the fused feature in Equation (6)) on Np object patches as our object features. To make full use of information of the whole image, we also extract one feature from the entire image. As a result, we generate N = Np + 1 object features in total from one image. The image is then represented as a group of feature vectors: X = {x1 , . . . , x N }, xi ∈ Rm, (7) where m denotes the dimension of each feature vector. Given a query image Xq , to measure its similarity with a database image Xd, we define a matching score S(Xq , Xd) based on the distances between the object features in them: N q (8) f min D Xi , Xdj , S(Xq , Xd) = i=1 j q where D(Xi , Xdj) represents the distance between the i-th object feature in Xq and the j-th object feature in Xd, and f (x) is an exponential function defined as: f (x) = exp (−(αx)2 ). (9) ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015. Scalable Object Retrieval with Compact Image Representation from Generic Object Regions 29:9 ALGORITHM 1: Retrieval Process Input: Features in a query image, Q; Database size, N; Image IDs corresponding to all features; Output: Returned retrieval results, R; Scores = [0, 0, . . . , 0] N ; for each q ∈ Q do Compute distance lookup table for efficient distance computing D(·, ·); Dists = [Inf, Inf, . . . , Inf] N ; C = MA(q); // Neighbor clusters found by MA. for d ∈ {features indexed in C} do t = image id(d); Dists[t] ← min (D(q, d), Dists[t]); end for d ∈ {features indexed in C} do if Dists[i] = Inf then // Ignore images that are never visited. Scores[i]+ = exp (−(αDists[i])2 ); end end end R = Sort(Scores); The exponential function penalizes on large feature distances. With this setting, relevant images gain a high score on shared similar object patches, but a low score is contributed by irrelevant object patch pairs between two images. It is possible to apply another decreasing function here (e.g., a sigmoid or tangent function). However, we find the selected exponential function simple and effective in experiments. The effect of the parameter α will be discussed in the experiments. Finally, the database images are ranked by the matching scores and returned to the user as retrieval results. 4.2. Quantization and Indexing For large-scale image retrieval, time and memory cost should be taken into consideration. It is not scalable to do exhaustive search with the original features in the image database. We exploit PQ [Jégou et al. 2011] to compress the features and speed up feature distance computing, and adopt the inverted index structure to avoid exhaustive search. 4.2.1. Feature Quantization with PQ. In product quantization, the original feature space is decomposed into a Cartesian product of m low-dimensional subspaces. If the original feature is D dimensional, then the dimension of each subspace is D∗ = D/m. In each subspace, k∗ cluster centroids are trained and stored. With these settings, each feature is quantized m times, each in one subspace and the IDs of corresponding centroids are stored. When computing the distance between one query feature and one database feature, we apply asymmetric distance computation (ADC) proposed in Jégou et al. [2011], in which, before search, we compute and store the distances of the query feature to centroids in each subspace, and the final distance is computed by summing the precomputed distances in a lookup table. ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015. 29:10 S. Sun et al. 4.2.2. Inverted Index. We train k clusters in the complete feature space, and each database image feature is quantized to one of the clusters. The inverted index is built with each entry corresponding to one cluster, where all IDs of features quantized to the cluster are stored. In the online querying stage, we apply multiple assignment (MA) proposed in Jégou et al. [2010]. First, 10 cluster centroids nearest to the query feature are found with ANN algorithms. If the distance of the query feature to one centroid d is smaller than δ · d0 , where d0 is the distance of the query feature to its nearest centroid (δ = 1.2, as set in Jégou et al. [2010]), then the inverted index list associated with this centroid will be visited. We describe the algorithm in the retrieval process in Algorithm 1. 5. EXPERIMENTAL RESULTS In this section, we evaluate the proposed method on two public benchmark datasets: the Holidays dataset [Jégou et al. 2008] and the UKBench dataset [Nister and Stewenius 2006]. Wang and Jiang [2015] list a few common benchmark datasets for image retrieval. We chose these two datasets because they are among the most used ones in the field, and images in them are very suitable for our scenario of object retrieval. The Holidays dataset contains 1491 holiday images from 500 groups. The first image in each group is selected as query. Mean Average Precision (mAP) is used to evaluate the retrieval accuracy. In the UKBench dataset, there are 10200 images from 2550 object/scene categories, each containing 4 images. On this dataset, NS-score (averaged four times top-4 accuracy) is used to measure the retrieval accuracy. To evaluate the scalability of the proposed algorithm, we apply the MIR Flickr 1M dataset as distractor dataset. This dataset contains 1 million images randomly retrieved from Flickr. We run all experiments on a single core of a PC with a I7-3770K CPU. In Section 5.1, we explore the impact of related parameters on retrieval performance with different features. Then, we illustrate some retrieval results to demonstrate the benefit introduced by feature fusion in Section 5.2. In Section 5.3, we show the experimental results on large-scale image retrieval with different experiment settings. After that, we analyze the time efficiency on four main components in our method in Section 5.4. At last, we compare our method in multiple experimental settings with other related algorithms in Section 5.5. 5.1. Impact of Parameters In our method, there are 3 key parameters: α in the scoring function in Equation (9), object patch number Np, and feature dimension D. Here, we discuss their impact when CNN, VLAD, and the fused feature are adopted in the framework, respectively. When extracting VLAD features, we set the codebook size as 16, so that the initial VLAD feature dimension is 16 × 128 = 2048. First, we explore the impact of α by experiments on the Holidays and UKBench datasets. We fix Np = 7 and reduce the CNN and VLAD feature dimension to 512 by PCA, so that the dimension of the fused feature is 1024. To improve the performance of the 512-D VLAD feature, we whiten it according to Jégou and Chum [2012]. From Figure 3, we can see that when selecting the CNN feature, the mAP on Holidays waves from 0.771 to 0.789 and peaks at α = 3, while the NS-score on UKBench achieves the best result 3.613 during the interval α ∈ [2.0, 3.0]. Based on this observation, we set α = 3 for CNN feature in the rest of the experiments. When applying VLAD, the mAP on Holidays peaks at α = 0.5 with the value 0.639; at this point, the NS-Score on UKBench achieves the best value, 3.264. Thus we set α = 0.5 for VLAD in the following. A similar ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015. Scalable Object Retrieval with Compact Image Representation from Generic Object Regions 29:11 Fig. 3. The impact of α on image-retrieval accuracy. The three figures show how retrieval accuracy changes with α when CNN (a), VLAD (b) and the fused feature (c) are tested, respectively. Fig. 4. The impact of object patch number Np on image-retrieval accuracy. The three figures show how the retrieval accuracy changes with Np when CNN (a), VLAD (b), and the fused feature (c) are used, respectively. trend is observed when the fused feature is used, and we get the optimized value of α = 0.25, where the mAP on Holidays is 0.837 and the NS-Score on UKBench is 3.814. Then, we study the impact of object patch number Np on the Holidays and UKBench datasets. In the related experiments, we keep the feature dimensions of CNN and VLAD as 512, and α as the corresponding optimized values summarized before. We evaluate Np with values from 1 to 35, that is, the feature number N in each image from 2 to 36. The experiment results with CNN, VLAD, and the fused feature are shown in Figure 4. We can see that, in all cases, both mAP on Holidays and NS-Score on UKBench have a rising trend when Np increases. For example, when using the fused feature, the mAP on Holidays is 0.837 when Np = 7, compared to 0.796 when only one object patch is considered, while on UKBench, the NS-Score increases from 3.755 to 3.814 when Np changes from 1 to 7. When Np is even larger, the retrieval accuracy still increases. However, as the computational and memory cost introduced by large Np is expensive, we just set Np = 7 in the following experiments. We demonstrate in Section 5.4 that, with such a setting, the average query time is about 1s in a 1 million–image database. To investigate how the feature dimension D affects retrieval performance, we evaluate multiple values of D with different features on the Holidays dataset. Here, we fix Np = 7 and set α as the optimized values as well. We test D = 64, 128, 256, 512, and 1024 for CNN and VLAD features, respectively. When fusing CNN and VLAD features, we keep them with the same dimension; thus, the dimension of the fused feature is 2D accordingly. As shown in Figure 5, the accuracy with the fused feature grows when the dimension D increases from 64 to 256, then keeps relatively stable after that. This indicates that, by applying PCA, the feature space distribution is well captured and some noises and redundancy are removed. ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015. 29:12 S. Sun et al. Fig. 5. The impact of feature dimension D on image-retrieval accuracy. Note that the dimension of the fused feature is actually 2D in the figure, as it is the concatenation of the CNN and VLAD features. Table I. The mAP Performance of Different Features Under Different Settings of Feature Dimensions on the Holidays Dataset Feature CNN VLAD Fused D = 256 2D = 512 0.789 0.613 0.828 D = 512 2D = 1024 0.789 0.639 0.837 D = 1024 2D = 2048 0.781 0.637 0.837 5.2. Fused Feature versus Standalone Features To demonstrate the performance boost brought by the proposed feature fusion, in this section we compare the retrieval results when using the fused feature and standalone CNN and VLAD features. From Figure 5, we can see that, when D = 256, 512, and 1024, the fused feature outperforms both CNN and VLAD features. It may be arguable that the feature dimension of the fused feature is twice that of CNN and VLAD feature. However, we can see that, in this range, even when comparing them on the same dimension, the superiority of the fused feature is still significant, as summarized in Table I, where the mAP values tested in the same dimension are highlighted with the same color. As discussed in Section 1, CNN and VLAD specialize in describing different properties of an image. Even though the CNN feature alone achieves promising accuracy (i.e., mAP 0.789 on Holidays and NS-Score 3.613 on UKBench), its fusion with VLAD promotes performance further with mAP 0.837 on Holidays and NS-Score 3.814 on UKBench. We argue that the capability of VLAD to handle image variance in scaling and rotating contributes to this improvement. To compare three features more intuitively, we illustrate some retrieval results on UKBench with these three features in Figure 6. In Figure 6(a) and Figure 6(b), related images of the query image in the database have severe viewpoint changes, so that the CNN feature alone retrieves only the query image. In Figure 6(a) and Figure 6(b), VLAD retrieves all 4 related images and 2 related images, respectively. The fused feature preserves 4 related images and improves the result to 3 in the two cases, respectively. In Figure 6(c) and Figure 6(d), VLAD misses 3 related images due to the poor feature matches, while CNN successfully retrieves 2 and 4 related images, respectively. In contrast, the fused feature returns 3 and 4 related images in these ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015. Scalable Object Retrieval with Compact Image Representation from Generic Object Regions 29:13 Fig. 6. Top 4 retrieval results with three features on UKBench. In each group, the first column represents the query image, and results with CNN, VLAD, and the fused feature are given in the top, middle, and bottom row, respectively. situations, respectively. We illustrate one case in which the performance of the fused feature is encumbered with the failure of VLAD in Figure 6(e). In this case, CNN and VLAD retrieve 4 and 1 related images, respectively. But the number with the fused feature is 3, with the fourth result being false positive, which is caused by the failure of VLAD. Figure 6(f) illustrates one case in which all three features successfully retrieve all the 4 related images. In their ranking, the CNN feature favors images with similar shapes, while the VLAD feature is more concerned with the matching of local features. 5.3. Large-Scale Image Retrieval To perform large-scale image retrieval, we apply the quantization and indexing method in Section 4.2. According to the experimental results discussed in Section 5.1, we use the fused feature, and fix the feature dimension to be 2D = 1024. We extract all 8 features from 7 object patches and the entire image. ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015. 29:14 S. Sun et al. Table II. Accuracy with ADC on the Holidays and UKBench Datasets m mAP NS-Score 32 0.770 (↓ 0.067) 3.650 (↓ 0.164) 64 0.811 (↓ 0.026) 3.743 (↓ 0.071) 128 0.818 (↓ 0.019) 3.790 (↓ 0.024) Note: Down arrows denote accuracy decreases compared with using the original feature. Fig. 7. Retrieval accuracy of IVFADC when different vocabulary size k is used to build the inverted index, and no distractor dataset is added. Results on Holidays (a) and UKBench (b) are illustrated. The blue bars represent m = 64 in PQ, while red bars represent m = 128. We first explore the impact of PQ on the retrieval accuracy without an inverted index. We denote this as the ADC method. When performing PQ, we test the number of subspaces m = 32, 64, 128, where the dimension of the subspaces are 32, 16, and 8, respectively. The cluster number in each subspace is set as k∗ = 256, so that each centroid ID can be represented by an unsigned char variable with 1b memory. We can see from Table II that the impairment of PQ to the accuracy is minor when m = 128. Compared with using the original feature, the mAP on Holidays drops from 0.837 to 0.818, while the NS-Score on UKBench drops from 3.814 to 3.790. When m = 64, the accuracy decrease is minor and acceptable. However, when m = 32, the accuracy drops severely because the 256 clusters in the subspaces can hardly represent the 32 dimensional features well. We then test retrieval performance when applying the inverted index (denoted as IVFADC). We test different vocabulary size k (i.e., the number of entries) in the inverted index. The performances are compared when no distractor dataset is added and when the MIR Flickr 1M dataset is added, respectively. We test PQ with m = 64 and m = 128 only because the accuracy with m = 32 is much lower. When no distractor dataset is added, the retrieval accuracies on Holidays and UKBench are shown in Figure 7. We can conclude that smaller vocabulary size k generates better accurac, as expected, because when the quantization is coarse, more features are taken for distance computing, so that a high recall can be achieved. The extreme case when k = 1 is exactly the ADC version, where all database features are compared to the query feature. When k = 500 and m = 128, the best results are achieved, that is, mAP on Holidays is 0.804 and NS-Score on UKBench is 3.725. Next, we add the 1M distractor dataset to test retrieval performance. We demonstrate retrieval accuracy and average query time in Figure 8. All timings are performed excluding feature extraction. When the PQ subspace number m is 128, the accuracy is better than m = 64, while the time cost is a little larger at the same time. Specifically, when k = 500 and m = 128, the mAP on Holidays is 0.642 and NS-Score on UKBench is 3.707, and the time costs are 1.18s and 1.22s. When k = 500 and m = 64, we get ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015. Scalable Object Retrieval with Compact Image Representation from Generic Object Regions 29:15 Fig. 8. Retrieval accuracy and average query time of IVFADC when different vocabulary size k is used to build the inverted index, and the 1M distractor dataset is added. Accuracy on Holidays (a) and UKBench (b), and average query time on Holidays (c) and UKBench (d) are illustrated. The blue bars and lines represent m = 64 in PQ, while red bars and lines represent m = 128. mAP 0.620 and NS-Score 3.634, and the average time costs are 1.03s and 0.91s for the two ground truth datasets, respectively. When the vocabulary size k increases, both accuracy and time cost drop on the two datasets. When k = 4000, only about 0.4s is required to perform one query. The memory cost to store quantized features depends only on m, as discussed in Section 4.2. 5.4. Computational Efficiency Analysis In this section, we analyze the time cost of our method in more detail. From Algorithm 1, we can see that there are 4 main components in the retrieval process: lookup table computing (T1 ), distances to database features computing (T2 ), scoring the images (T3 ), and sorting (T4 ). In the following experiments, we show how the time costs on the 4 parts change with the database size. We show the contribution of the 4 parts to the total time cost in Figure 9. The experiment is performed when k = 500, m = 128 on Holidays. Obviously, the feature distance computing stage (T2 ) is the most time consuming one when the database size grows large. This is because more database features are required to be compared with the query. When database size is 1M, the time it costs is 0.96s. Without the inverted index structure, the complexity of (T2 ) is O(N 2 × n), where n represents the database size, and N 2 denotes the square of used object patch number, which is actually a constant number. When the inverted index is built and if database features are equally distributed in the inverted index entries, the complexity reduces to O(N 2 × n/k). We observe that the time to compute the lookup table (T1 ) is nearly constant. ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015. 29:16 S. Sun et al. Fig. 9. (a) Contribution of four main parts in the retrieval process to the total time cost: lookup table computing (T1 ), distances to database features computing (T2 ), scoring the images (T3 ), and sorting (T4 ). (b) The average number of candidate features retrieved for distance computing. Table III. Accuracy Comparison of Object-Level Representation with Baselines Methods Holidays (mAP) UKBench (NS-Score) CNN-1 0.710 OR-CNN 0.789 (↑ 0.079) VLAD-16 0.572 OR-VLAD-16 0.639 (↑ 0.067) Fused-1 0.815 3.412 3.613 (↑ 0.201) 3.167 3.258 (↑ 0.091) 3.754 OR-Fused 0.837 (↑ 0.022) 3.814 (↑ 0.06) When database size is small, for example, 1000, it occupies most of the retrieval time. However, when database size is very large, it accounts for only a small proportion. The time it takes is about 0.09s during the retrieval process. Time taken on scoring and sorting increases with database size. However, even when the database size is 1M, they only cost about 0.05s, respectively. 5.5. Comparison To demonstrate the superiority in accuracy and efficiency of our method, we make a comparison with the baseline and state-of-the-art methods. For notation convenience, we denote our object-level representation as OR in the comparisons. First, we compare our object level representation with three baseline methods, VLAD [Jégou et al. 2010], CNN-1 [Sun et al. 2014], and Fused-1. In the CNN-1 and Fused-1 method, only one CNN feature or fused feature is extracted on the entire image. The VLAD-16 plugged-in method [Sun et al. 2014] is the object-level representation of the VLAD method with the vocabulary size 16, which we denote as OR-VLAD-16 here. The comparison results are summarized in Table III. We observe that, in all cases, the object-level representations are superior to their image level counterparts. We then compare our method with some recent image search algorithms. Here, we present our method with different configurations: the original fused feature (OR), compressed feature with PQ when m = 128 (OR-ADC128) and when m = 64 (OR-ADC64), and indexed feature with the inverted index with k = 500, m = 128 (OR-IVFADC). The compared methods include: (1) CWVT [Wang et al. 2011], an improved vocabulary tree–based method with contextual weighting of local features in both descriptor and spatial domains; (2) SCSM [Shen et al. 2012], where a spatially constrained similarity measure is used to perform object retrieval; (3) BoC [Wengert et al. 2011], an advanced color signature fused with SIFT descriptor for image retrieval; (4) Semantic-aware ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015. Scalable Object Retrieval with Compact Image Representation from Generic Object Regions 29:17 Table IV. Comparison of the Proposed Method with State-of-Arts Methods CWVT [Wang et al. 2011] SCSM [Shen et al. 2012] BoC [Wengert et al. 2011] SC [Zhang et al. 2013] CM [Zheng et al. 2014a] BM [Zheng et al. 2014b] OR OR-ADC128 OR-ADC64 OR-IVFADC Holidays(mAP) 0.78 0.762 0.789 0.809 0.840 0.819 0.837 0.818 0.811 0.804 UKBench (NS-Score) 3.56 3.52 3.50 3.60 3.71 3.62 3.814 3.790 3.743 3.725 Note: The performance of the comparison algorithms is cited from the reported results of the original papers. Bolded numbers indicate top results. Table V. Memory Cost Comparison of Object-Level Representation with Baselines VT HE PQ VLAD [Nister and [Jégou [Jégou [Jégou Methods Stewenius 2006] et al. 2008] et al. 2011] et al. 2010] OR-ADC64 OR-ADC128 Memory for features (GB) 8.0 12.0 12.0 1.0 0.5 1.0 Memory for quantizer (MB) 142 398 100 0.0078 5 5 Co-indexing (SC) [Zhang et al. 2013], a fusion of local invariant features and semantic attributes for image retrieval; (5) Coupled multi-index (CM) [Zheng et al. 2014a], a index-level fusion method to exploit information of multiple features in images; (6) Bayes merging of multiple vocabularies (BM) [Zheng et al. 2014b], in which multiple vocabularies are built with the principle that low correlation exists among them. The comparison in Table IV shows that our method achieves state-of-the-art retrieval accuracy on both the Holidays and UKBench datasets. When no inverted index is applied, best results are achieved among the compared methods. Even when we index the features for efficiency that results in a decrease in accuracy, the mAP on Holidays is still comparable with most recent works, and the NS-Score on UKBench is superior to others. Results on UKBench is significantly improved with our method. This is possibly caused by the character of the database, in which objects usually dominate the image scene, and burstiness occurs frequently in rich texture regions. This is prevalent in local feature–based methods, but can be alleviated by our object-level representation. We also compare our method with four baselines in large-scale image retrieval experiments with different database sizes. The compared methods are Vocabulary Tree (VT) [Nister and Stewenius 2006], Hamming Embedding (HE) [Jégou et al. 2008], Product Quantization (PQ) [Jégou et al. 2011], and VLAD [Jégou et al. 2010]. The codebook size is 0.99M in VT and 200K in HE. In PQ, the codebook size is also 200K with IVFADC applied, and m = 8, k∗ = 256. In VLAD, the SIFT feature codebook size is 16, and the VLAD feature dimension is reduced to 512 from 2048 by PCA. We perform exhaustive search without an inverted index, because one image is represented with only one 512-D feature. The memory cost comparison between our method and the baseline methods is concluded in Table V. In our implementation, the original features are 1024-dimensional, and m is as 64 or 128, while k∗ = 256. When m = 64, the memory cost on storing features for 1 million images is 1M × 8 × 64 × 1b = 512MB. When m = 128, the memory cost is 1GB. In addition, 1MB memory is required to store the centroids. Our method does not store image IDs for features, because we have a fixed number of features (i.e., 8) for each image and the image ID can be calculated from feature ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015. 29:18 S. Sun et al. Table VI. Time Cost Comparison of Object-Level Representation with Baselines Methods Average query time (s) VT [Nister and Stewenius 2006] 0.098 HE [Jégou et al. 2008] 0.254 PQ [Jégou et al. 2011] 1.054 VLAD [Jégou et al. 2010] 1.180 OR-ADC128 1.179 ID (image ID = feature ID/8). In our method, apart from memory cost on the indexed features discussed earlier, we require 4MB memory to store the cluster centers for inverted index when k = 1000, and 1MB memory to store the centroids in each feature subspace. Therefore, in total, 5MB is used for the quantizer. The VT [Nister and Stewenius 2006] needs 4b to store one image ID and another 4b to store the tf-idf weight. For 1M images, 1M × 1000 × 8byte = 8GB memory is required to store all features. In the HE [Jégou et al. 2008], 4b and 8b are used to store the image ID and Hamming code, respectively. Therefore, 12GB memory is used to store all features. In these two methods, about 142MB memory is required to store a hierarchical visual vocabulary tree. In addition, HE also requires 256MB to store median vectors for each leaf node. In the PQ [Jégou et al. 2011], 4b and 8b are used to store one image ID and the compressed feature, respectively, leading to 12GB memory cost for all features. To store 20k cluster centers for the inverted index and all centroids in each feature subspace, it costs about 100MB memory. The VLAD [Jégou et al. 2010] stores one 512-D feature for each image, so that 1GB memory cost is needed for the 1M image dataset. In addition, 8KB is required to store 16 cluster centers to aggregate the SIFT features. The time cost comparison between our method and the compared methods is concluded in Table VI. We show the average query time of these methods when performing image retrieval with the 1-million-image dataset. In implementation, our method is most close to PQ [Jégou et al. 2011]. We use 8 object-level features in each image, while PQ use thousands of SIFT features, so that we make less comparisons between images. On the other hand, to get a high accuracy, our optimal codebook size is much smaller than that used in PQ, resulting in many more features stored in each inverted index entry. Consequently, their time costs are very similar. The VT [Nister and Stewenius 2006] does not perform feature-distance computing, and its codebook size is very large, so that the time cost is much smaller, but its accuracy is much lower, as illustrated in the following.The HE [Jégou et al. 2008] has the same system framework as VT, but verifies feature distances using hamming codes. The time cost of HE is a little larger than VT. However, its accuracy is still lower than that of PQ and our method. The VLAD [Jégou et al. 2010] keeps only one feature for each image, but exhaustive search is used to ensure accuracy in our implementation, and the time cost is comparable to our method. The accuracy comparisons on Holidays and UKBench are plotted in Figure 10, which demonstrates the scalability of our method for image retrieval in the large image database. We can see when m = 128, k = 500 or 1k, the accuracies are higher than m = 64, k = 500. As discussed earlier, however, when m = 128, 1GB memory is required to store the compressed features for 1M images, while only 512MB is required for m = 64. Nonetheless, even 1GB memory cost is still affordable in many real-life applications. It is notable that the performance drop of our method on UKBench is slight when the distractor image number increases. This is because images in the UKBench dataset usually contain objects with noisy background. Compared to other methods, our method can describe the objects well while suppressing the distraction from background areas. 6. CONCLUSION In this article, we propose a novel image retrieval framework with compact image representation from generic object regions. We first identify regions of interest with ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015. Scalable Object Retrieval with Compact Image Representation from Generic Object Regions 29:19 Fig. 10. Comparison of retrieval accuracies on Holidays (a) and UKBench (b) in large-scale image-retrieval experiments. Three configurations of our method are compared with four baselines. a generic object detector. To describe the detected regions, we apply CNN to describe the global content and VLAD to capture the local invariant patterns. In addition, we propose fusing the CNN and VLAD features for a more effective representation. The fusion is performed on the feature level to avoid any modification to existing retrieval frameworks; promising accuracy promotion is achieved. Scalability on a large image database is obtained based on the inverted indexing structure. The representation is efficient in memory overhead, and the retrieval process is time efficient. Moreover, experiments on benchmark datasets demonstrate state-of-the-art performance of our proposed method. REFERENCES Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. 2012. Measuring the objectness of image windows. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2189–2202. Satpathy Amit, Jiang Xudong, and Eng How-Lung. 2014. Human detection by quadratic classification on subspace of extended histogram of gradients. IEEE Transactions on Image Processing 23, 1, 287–297. Relja Arandjelovic and Andrew Zisserman. 2012. Three things everyone should know to improve object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2911–2918. Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust features. In Proceedings of European Conference on Computer Vision. Springer, 404–417. R in Machine Learning 2, Yoshua Bengio. 2009. Learning deep architectures for AI. Foundations and Trends 1, 1–127. Yang Cao, Changhu Wang, Liqing Zhang, and Lei Zhang. 2011. Edgel index for large-scale sketch-based image search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 761–768. Mingming Cheng, Z. Zhang, W. Lin, and P. Torr. 2014. BING: Binarized normed gradients for objectness estimation at 300fps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE. Lingyang Chu, Shuqiang Jiang, Shuhui Wang, Yanyan Zhang, and Qingming Huang. 2013. Robust spatial consistency graph model for partial duplicate image retrieval. IEEE Transactions on Multimedia 15, 8, 1982–1996. Lingyang Chu, Shuhui Wang, Yanyan Zhang, Shuqiang Jiang, and Qingming Huang. 2014. Graph-densitybased visual word vocabulary for image retrieval. In IEEE International Conference on Multimedia and Expo. IEEE, 1–6. Ondrej Chum and Jiri Matas. 2010. Unsupervised discovery of co-occurrence in sparse high dimensional data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3416–3423. Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1. IEEE, 886–893. ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015. 29:20 S. Sun et al. Ian Endres and Derek Hoiem. 2010. Category independent object proposals. In Proceedings of European Conference on Computer Vision. Springer, 575–588. Pedro Felzenszwalb, David McAllester, and Deva Ramanan. 2008. A discriminatively trained, multiscale, deformable part model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1–8. Yunchao Gong, Liwei Wang, Ruiqi Guo, and Svetlana Lazebnik. 2014. Multi-scale orderless pooling of deep convolutional activation features. In Proceedings of European Conference on Computer Vision. Springer, 392–407. Steven Ch Hoi, Wei Liu, and Shih-Fu Chang. 2010. Semi-supervised distance metric learning for collaborative image retrieval and clustering. ACM Transactions on Multimedia Computing, Communications and Applications 6, 3, 18. Eva Hörster and Rainer Lienhart. 2008. Deep networks for image retrieval on large-scale databases. In Proceedings of the 16th ACM International Conference on Multimedia. ACM, New York, NY, 643–646. Hervé Jégou and Ondřej Chum. 2012. Negative evidences and co-occurences in image retrieval: The benefit of PCA and whitening. In Proceedings of European Conference on Computer Vision. Springer, 774–787. Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2008. Hamming embedding and weak geometric consistency for large scale image search. In Proceedings of European Conference on Computer Vision. Springer, 304–317. Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2010. Improving bag-of-features for large scale image search. International Journal of Computer Vision 87, 3, 316–336. Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2011. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 117–128. Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3304–3311. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093. Timor Kadir, Andrew Zisserman, and Michael Brady. 2004. An affine invariant salient region detector. In Proceedings of European Conference on Computer Vision. Springer, 228–241. Yan Ke and Rahul Sukthankar. 2004. PCA-SIFT: A more distinctive representation for local image descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2. IEEE, II–506. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Neural Information Processing Systems. Michael S. Lew, Nicu Sebe, Chabane Djeraba, and Ramesh Jain. 2006. Content-based multimedia information retrieval: State of the art and challenges. ACM Transactions on Multimedia Computing, Communications and Applications 2, 1, 1–19. Zhen Liu, Houqiang Li, Liyan Zhang, Wengang Zhou, and Qi Tian. 2014. Cross-indexing of binary SIFT codes for large-scale image search. IEEE Transactions on Image Processing. Zhen Liu, Houqiang Li, Wengang Zhou, Richang Hong, and Qi Tian. 2015. Uniting keypoints: Local visual information fusion for large-scale image search. IEEE Transactions on Multimedia 17, 4, 538–548. Zhen Liu, Houqiang Li, Wengang Zhou, Ruizhen Zhao, and Qi Tian. 2014. Contextual hashing for large-scale image search. IEEE Transactions on Image Processing. David G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 91–110. Tao Mei, Yong Rui, Shipeng Li, and Qi Tian. 2014. Multimedia search reranking: A literature survey. Computing Surveys 46, 3, 38. Krystian Mikolajczyk and Cordelia Schmid. 2004. Scale and affine invariant interest point detectors. International Journal of Computer Vision 60, 1, 63–86. David Nister and Henrik Stewenius. 2006. Scalable recognition with a vocabulary tree. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2161–2168. Aude Oliva and Antonio Torralba. 2001. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 145–175. Florent Perronnin, Jorge Sánchez, and Thomas Mensink. 2010. Improving the fisher kernel for large-scale image classification. In Proceedings of European Conference on Computer Vision. Springer, 143–156. Pierre Sermanet, Koray Kavukcuoglu, Soumith Chintala, and Yann LeCun. 2013. Pedestrian detection with unsupervised multi-stage feature learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3626–3633. ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015. Scalable Object Retrieval with Compact Image Representation from Generic Object Regions 29:21 Xiaohui Shen, Zhe Lin, Jonathan Brandt, Shai Avidan, and Ying Wu. 2012. Object retrieval and localization with spatially-constrained similarity measure and k-nn re-ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3013–3020. Josef Sivic and Andrew Zisserman. 2003. Video Google: A text retrieval approach to object matching in videos. In Proceedings of the International Conference on Computer Vision. 1470–1477. Shaoyan Sun, Wengang Zhou, Houqiang Li, and Qi Tian. 2014. Search by detection: Object-level feature for image retrieval. In Proceedings of International Conference on Internet Multimedia Computing and Service. ACM, New York, NY, 46. JRR Uijlings, KEA van de Sande, T. Gevers, and A. W. M. Smeulders. 2013. Selective search for object recognition. International Journal of Computer Vision, 154–171. Paul Viola and Michael J. Jones. 2004. Robust real-time face detection. International Journal of Computer Vision 57, 2, 137–154. Ji Wan, Dayong Wang, Steven Chu Hong Hoi, Pengcheng Wu, Jianke Zhu, Yongdong Zhang, and Jintao Li. 2014. Deep learning for content-based image retrieval: A comprehensive study. In Proceedings of the ACM International Conference on Multimedia. ACM, New York, NY, 157–166. Shuang Wang and Shuqiang Jiang. 2015. INSTRE: A new benchmark for instance-level object retrieval and recognition. ACM Transactions on Multimedia Computing, Communications and Applications 11, 3, 37. Xiaoyu Wang, Ming Yang, Timothee Cour, Shenghuo Zhu, Kai Yu, and Tony X. Han. 2011. Contextual weighting for vocabulary tree based image retrieval. In Proceedings of the International Conference on Computer Vision. 209–216. Christian Wengert, Matthijs Douze, and Hervé Jégou. 2011. Bag-of-colors for improved image search. In ACM International Conference on Multimedia. ACM, New York, NY, 1437–1440. Lingxi Xie, Qi Tian, Wengang Zhou, and Bo Zhang. 2014. Fast and accurate near-duplicate image search with affinity propagation on the ImageWeb. Computer Vision and Image Understanding 124, 31–41. Lingxi Xie, Jingdong Wang, Bo Zhang, and Qi Tian. 2015. Fine-grained image search. IEEE Transactions on Multimedia 17, 5, 636–647. Shiliang Zhang, Qi Tian, Gang Hua, Qingming Huang, and Wen Gao. 2011. Generating descriptive visual words and visual phrases for large-scale image applications. IEEE Transactions on Image Processing 20, 9, 2664–2677. Shiliang Zhang, Qi Tian, Ke Lu, Qingming Huang, and Wen Gao. 2013. Edge-SIFT: Discriminative binary descriptor for scalable partial-duplicate mobile search. IEEE Transactions on Image Processing 22, 7, 2889–2902. Shaoting Zhang, Ming Yang, Timothee Cour, Kai Yu, and Dimitris N. Metaxas. 2012. Query specific fusion for image retrieval. In Proceedings of European Conference on Computer Vision. Springer, 660–673. Shiliang Zhang, Ming Yang, Xiaoyu Wang, Yuanqing Lin, and Qi Tian. 2013. Semantic-aware co-indexing for image retrieval. In Proceedings of the International Conference on Computer Vision. Liang Zheng, Shengjin Wang, Ziqiong Liu, and Qi Tian. 2014a. Packing and padding: Coupled multi-index for accurate image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE. Liang Zheng, Shengjin Wang, Wengang Zhou, and Qi Tian. 2014b. Bayes merging of multiple vocabularies for scalable image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1963–1970. Wengang Zhou, Houqiang Li, Richang Hong, Yijuan Lu, and Qi Tian. 2015. BSIFT: Towards data-independent codebook for large scale image search. IEEE Transactions on Image Processing 24, 3, 967–979. Wengang Zhou, Houqiang Li, Yijuan Lu, and Qi Tian. 2013. SIFT match verification by geometric coding for large-scale partial-duplicate web image search. ACM Transactions on Multimedia Computing, Communications and Applications, 4. Wengang Zhou, Houqiang Li, Yijuan Lu, and Qi Tian. 2014. Encoding spatial context for large-scale partialduplicate web image retrieval. Journal of Computer Science and Technology 29, 5, 837–848. Wengang Zhou, Qi Tian, Yijuan Lu, Linjun Yang, and Houqiang Li. 2011. Latent visual context learning for web image applications. Pattern Recognition 44, 10, 2263–2273. Wengang Zhou, Ming Yang, Houqiang Li, Xiaoyu Wang, Yuanqing Lin, and Qi Tian. 2014. Towards codebookfree: Scalable cascaded hashing for mobile image search. IEEE Transactions on Multimedia 16, 3, 601– 611. Received December 2014; revised March 2015; accepted May 2015 ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015.
© Copyright 2025 Paperzz