29 Scalable Object Retrieval with Compact Image Representation

Scalable Object Retrieval with Compact Image Representation
from Generic Object Regions
SHAOYAN SUN and WENGANG ZHOU, CAS Key Laboratory of Technology in Geo-spatial
Information Processing and Application System, University of Science and Technology of China
QI TIAN, University of Texas at San Antonio
HOUQIANG LI, CAS Key Laboratory of Technology in Geo-spatial Information Processing
and Application System, University of Science and Technology of China
29
In content-based visual object retrieval, image representation is one of the fundamental issues in improving
retrieval performance. Existing works adopt either local SIFT-like features or holistic features, and may
suffer sensitivity to noise or poor discrimination power. In this article, we propose a compact representation
for scalable object retrieval from few generic object regions. The regions are identified with a general object
detector and are described with a fusion of learning-based features and aggregated SIFT features. Further,
we compress feature representation in large-scale image retrieval scenarios. We evaluate the performance
of the proposed method on two public ground-truth datasets, with promising results. Experimental results
on a million-scale image database demonstrate superior retrieval accuracy with efficiency gain in both
computation and memory usage.
Categories and Subject Descriptors: H.3.3 [Information Search and Retrieval]: Retrieval models
General Terms: Algorithms, Experimentation, Performance
Additional Key Words and Phrases: Image retrieval, compact image representation
ACM Reference Format:
Shaoyan Sun, Wengang Zhou, Qi Tian, and Houqiang Li. 2015. Scalable object retrieval with compact image
representation from generic object regions. ACM Trans. Multimedia Comput. Commun. Appl. 12, 2, Article 29
(October 2015), 21 pages.
DOI: http://dx.doi.org/10.1145/2818708
1. INTRODUCTION
The last decade has witnessed the explosive growth of digital visual content on the
Internet. It has caused demands for effective and efficient algorithms to retrieve that
data from a large-scale visual database. As a result, content-based image retrieval has
attracted lots of attention from both academia and industry. In this article, we target
visual object retrieval in large-scale image databases.
In content-based image retrieval [Lew et al. 2006], the basic problem is to measure
the similarity between images [Hoi et al. 2010]. Generally, images are represented by
This work was supported in part for Professor Houqiang Li by 973 Program under contract No.
2015CB351803, NSFC under contract No. 61325009 and No. 61390514; in part for Dr. Wengang Zhou
by NSFC under contract No. 61472378 and the Fundamental Research Funds for the Central Universities under contract No. WK2100060014 and WK2100060011; and in part for Prof. Qi Tian by ARO grant
W911NF-12-1-0057 and Faculty Research Awards by NEC Laboratories of America, respectively. This work
was supported in part by NSFC under contract No. 61429201.
Authors’ addresses: S. Sun, W. Zhou, and H. Li, Electrical Engineering and Information Science Department,
University of Science and Technology of China, Hefei, 230027; emails: [email protected], {zhwg,
lihq}@ustc.edu.cn; Q. Tian, Department of Computer Science, University of Texas at San Antonio, San
Antonio, TX, 78249; email: [email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by
others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to
post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions
from [email protected].
c 2015 ACM 1551-6857/2015/10-ART29 $15.00
DOI: http://dx.doi.org/10.1145/2818708
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015.
29:2
S. Sun et al.
visual features and image similarity is defined by the comparison of visual features
[Zhou et al. 2011]. According to the image scope in feature extraction, visual features
can be divided into two categories: local and global.
Local features, such as SIFT [Lowe 2004], are extracted from detected interest points
and designed to be robust to various changes in illumination, rotation, scaling, and
partial occlusion. With such merit, local features have been popularly selected as a
routine image representation in CBIR since the pioneering work of Video Google [Sivic
and Zisserman 2003]. There are mainly two strategies to perform image retrieval with
local features. In the first strategy, text retrieval techniques are leveraged to quantize
the high-dimensional continuous local features to discrete visual words with a large
visual codebook. Then, an image is represented in a sparse and uniform visual word
histogram and an inverted file structure is adopted for efficient indexing and retrieval
[Nister and Stewenius 2006; Zhang et al. 2011; Chu et al. 2014; Zhou et al. 2014].
In this paradigm, each local feature should be quantized and indexed individually,
which causes severe memory overhead. The second strategy alleviates this problem by
aggregating local features of an image into a dense feature vector with a small visual
codebook [Jégou et al. 2010; Perronnin et al. 2010; Liu et al. 2015]. In this way, some
hashing techniques for nearest neighbor search are adopted to identify relevant image
results based on those dense representations.
In contrast, global features describe the whole image content—such as color, edge,
texture, and structure—into a single holistic representation. Representative global
features include GIST [Oliva and Torralba 2001] and edgel [Cao et al. 2011]. These
features represent an image with only one feature vector. Although efficient in memory
cost and computation, those features suffer poor discriminative power.
Apart from the handcrafted features, it is also possible to extract features in a datadriven manner. The explosive research on deep neural networks (DNNs) has recently
witnessed the success of the data-driven features in multiple areas. With the deep
architectures, high-level abstractions that are close to human cognition can be learned
[Bengio 2009], so that DNN is suitable to extract semantic-aware features. In Hörster
and Lienhart [2008], features are extracted in local patches with a deep restricted
Boltzmann machine (DBN) and the BoW model is used to perform image retrieval. In
Sun et al. [2014], convolutional neural networks (CNNs) are applied to extract image
features for image retrieval. A comprehensive study on CNN-based CBIR is given in
[Wang et al. 2014], who use CNN to extract one feature from an image as a holistic
descriptor, and demonstrate an impressive performance in their experiments.
Both local and global features suffer some nontrivial issues in the scenario of image
retrieval. Local invariant features are sensitive to rich texture from image backgrounds
and suffer from the problem of burstiness in repetitive areas such as grass and carpets
[Wang et al. 2011]. Though geometric verification [Zhou et al. 2013; Liu et al. 2014; Chu
et al. 2013; Zhou et al. 2014] or retrieval list reranking [Mei et al. 2014; Xie et al. 2014]
can alleviate their impact to some extent, they introduce more complexity. In addition,
they are unstable when there are large changes in viewpoint. Global features describe
the image as a whole and are more suitable for appearance-similar image search than
object retrieval. In Figure 1, we illustrate the top 4 retrieval results from the UKBench
dataset for one query with different features. The query image concerns a beer bottle
placed on a carpet. The results in the top row are returned by the local feature–based
method. Three unrelated objects are returned because most feature correspondences
are from the background carpet area. The middle row shows results of one global
CNN feature describing image content as a whole. We observe that two different kinds
of bottles are returned, which implies that global features can capture the content
appearance but may fail to describe the details of the target object.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015.
Scalable Object Retrieval with Compact Image Representation from Generic Object Regions
29:3
Fig. 1. Retrieval results of one query from the UKBench dataset for three retrieval methods. The results
are generated by a baseline local feature–based method [Nister and Stewenius 2006] (top row), one global
feature–based method we implement with a CNN tool-kit [Jia et al. 2014] (middle row) and our method (last
row), respectively. The bounding boxes are drawn by a general object detector [Cheng et al. 2014], indicating
detected object patches.
To avoid those issues discussed earlier, we propose extracting features from generalized object regions (denoted as object patches hereafter). The motivation is that, on
an object level, the object appearance is kept consistent no matter how the background
and image layout change, so that global features can describe the object precisely. With
little background kept in object patches, the interference of noise features from the
background area can be significantly weakened or eliminated. As a result, both global
and local features are expected to benefit from this representation. However, we do not
require the detected regions to be exactly meaningful objects, since our objective is to
represent image content for similar image identification instead of object detection.
To achieve scalable retrieval in a large database, we adopt product quantization (PQ)
[Jégou et al. 2011] to compress the object-level feature and speed up distance computing. Different from the local feature–based methods by voting [Jégou et al. 2011; Zheng
et al. 2014a; Jégou et al. 2008; Zhou et al. 2015], which consider thousands of local features in each image, we only preserve very few object level features and gain significant
efficiency in memory. Moreover, as mentioned before, the object-level representation is
more discriminative than global image–level representations. In large-scale image retrieval experiment on one million database images, we demonstrate state-of-the-art
retrieval accuracy with very efficient memory usage and real-time search response.
The framework of the proposed method is illustrated in Figure 2. It consists of three
components: feature extraction, indexing and querying. In both the index phase and
the query phase, the same feature extraction process is conducted. The inverted index
is built in the index phase, storing all database-image features with their IDs and
compressed representations. In the query phase, the distances between query-image
features and related database-image features are computed, which are used for scoring
the database images, and images with highest scores are returned as the retrieval
results.
In the feature extraction phase, we first detect potential objects in images with a
general-object detector. With a few related works available in recent years, we take
BING, which is proposed in Cheng et al. [2014] and demonstrated to be very fast, in
our framework because image-retrieval systems usually require a real-time response.
Then we extract features in the object patches. There are also multiple alternatives
on what feature to extract. In our implementation, we test three kinds of features.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015.
29:4
S. Sun et al.
Fig. 2. The proposed image retrieval framework with object-level features. The feature-extraction stage
takes one image as the input and outputs a few object-level features. It contains three steps: object detection,
CNN and VLAD feature extraction, and feature fusion. In the index phase, features are extracted from all
database images, and are indexed into an inverted table and encoded with PQ. In the query phase, features
extracted from the query image are assigned to related indexes and the distances between query features,
and database features are computed for scoring and ranking. Images in this figure are from the Holidays
dataset.
To describe the patches effectively as a whole, we adopt the CNN model trained on
ImageNet by Krizhevsky et al. [2012]. This model is originally trained to perform
object classification and achieves great success. Therefore, the model is descriptive
for objects and fits our scenario well. To describe the local properties in the object
patches, we extract SIFT features and aggregate them with VLAD. We experimentally
demonstrate performance improvement of the object-level representations over their
image-level counterparts. In addition, we propose fusing the two representations on
the feature level. Since CNN and VLAD specialize in describing different properties
of the image (i.e., general semantics and local details), we make the fusion of them
a better representation. Specially, as one global description, the CNN feature cannot
handle image variance in scaling and rotating well, which can be complemented by
the VLAD feature generated from SIFT. We denote such features extracted from object
patches as object-level features hereafter.
The rest of this article is organized as follows. We first introduce the related work in
Section 2. Then we discuss how to extract object-level features in Section 3. We describe
our image-retrieval framework with the proposed object-level feature, as well as the
details of feature quantizing and indexing for large-scale image retrieval, in Section 4.
Next, we provide experimental results with the proposed method in terms of accuracy,
efficiency, and memory cost, as well as conduct comparisons with the state-of-the-art
methods in Section 5. We present our conclusions in Section 6.
2. RELATED WORK
Our work involves feature detection, feature description, feature fusion, and image
indexing. In this section, we discuss the related works on each topic in the following.
As a prerequisite step for feature description, feature detection aims at locating
repeatable local structures. The SIFT feature [Lowe 2004], a representative local invariant feature, identifies interest points that are scale and translation invariant with
the DoG detector. In addition, some works extend it with detectors such as the HessianAffine detector [Mikolajczyk and Schmid 2004], MSER detector [Kadir et al. 2004], and
SURF [Bay et al. 2006].
However, these detectors detect only low-level salient structures, such as blobs and
corners. The detected regions of interest or patches are usually simple corner points
or texture with little semantic information; thousands of such patches can be detected
in one image. To locate image patches that contain complicated objects, we need to
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015.
Scalable Object Retrieval with Compact Image Representation from Generic Object Regions
29:5
apply an object detector. Though detection models for certain objects (e.g., human face,
pedestrian, vehicles) have been developed for decades and many successful models have
been proposed (e.g., the Viola-Jones face detector [Viola and Jones 2004], the HoG-based
human detector [Dalal and Triggs 2005; Amit et al. 2014], and the deformable partbased model [Felzenszwalb et al. 2008]), it is infeasible to train models object-wise in
real applications. As an alternative, we can resort to general-object detection to find
object patches regardless of object categories. While some works [Uijlings et al. 2013;
Alexe et al. 2012; Endres and Hoiem 2010] have tried to solve this problem recently,
they can hardly achieve high detection rate, high computational efficiency, and good
generalization ability simultaneously. We take the recent work employing BING [Cheng
et al. 2014] as our general object detector, because in this work, detection repeatability
and effectiveness are improved greatly, which lends itself very suitably to detect object
patches of interest in image retrieval.
The other step in feature extraction is feature description. Traditionally, handcrafted
descriptors are exploited to represent images or image patches. For example, the SIFT
feature [Lowe 2004] computes the gradient magnitude around detected key points. In
addition, there are some variances of SIFT, such as PCA-SIFT [Ke and Sukthankar
2004] and Edge-SIFT [Zhang et al. 2013]. The global GIST feature [Oliva and Torralba
2001] integrates orientation, color, and intensity information in the whole image. In
recent years, CNN has been frequently applied in multiple computer-vision research
areas. With local receptive fields and shared weights, CNN can extract high-level semantic features from raw pixels efficiently. In Krizhevsky et al. [2012], a CNN model is
trained to perform image classification and achieves outstanding accuracy. In Sermanet
et al. [2013], the proposed CNN model, along with some new twists, demonstrates stateof-the-art performance on pedestrian detection. Inspired by these successes, we make
use of CNN to extract object-level features for image retrieval in this article. In Gong
et al. [2014], CNN features are extracted from predefined subwindows in different
scales of images, and pooled as the final representation. This work is similar to ours
in that CNN features are extracted from local patches, but our representation is different in that we extract features only from object-like patches, so that the number of
regions we examine is much smaller, which makes it scalable for the scenario of image
retrieval.
Feature fusion is a common technique to utilize the advantages of different features.
Generally, feature fusion is performed in either the indexing or reranking phase. In
Zhang et al. [2013] and Xie et al. [2015], semantic attributes are co-indexed into an
inverted index from visual words; in Zheng et al. [2014a], a multi-index consisting of
color and SIFT visual words is built. Zhang et al. [2012] proposes a graph-based, queryspecific fusion approach to merge retrieval results given by different features. All these
methods treat different kinds of features separately, and fusing them requires some
modification to the existing image-retrieval frameworks. In this article, we perform
feature fusion from another perspective, that is, we explore the possibility of fusing
different features on the feature level by combining multiple feature vectors into one
single representation. Such an operation suffers from the concern that the space distribution and the distance metric of different features usually vary a lot. However,
we observe that, after appropriate transformations of the original features, we can
lay them on one united distance metric space. This feature-level fusion makes our
representation scheme readily adaptable to the existing image retrieval framework.
When images are represented by local features, it is intractable and prohibitive to
perform large-scale image retrieval with the original features by exhaustive linear
scan. To achieve scalability to a large image database, a visual codebook is trained to
quantize local features to visual words. Based on the quantization results, the inverted
index structure [Sivic and Zisserman 2003] can be used [Nister and Stewenius 2006;
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015.
29:6
S. Sun et al.
Jégou et al. 2010, 2011] to index a large image database for efficient retrieval. To
lower the memory cost and speed up feature distance verification, features are usually
hashed to binary features [Jégou et al. 2008; Liu et al. 2014]. In this article, we follow
the work of Jégou et al. [2011] to quantize features to a number of inverted indexes,
and store compressed features with PQ.
3. OBJECT-LEVEL FEATURE
In this section, we introduce our object level feature for image representation. In
Section 3.1, we discuss the general-object detector that we employ to detect possible
object patches. Then, we introduce two kinds of image representations with CNN in
Section 3.2 and VLAD in Section 3.3. Finally, we describe the proposed feature-level
fusion in Section 3.4.
3.1. Object Patch Detection
We choose the BING detector [Cheng et al. 2014] to detect generic object proposals/regions for feature extraction. The BING detector is demonstrated to have a very
high detection rate with a relatively small number of object proposals. In addition, it
enjoys excellent generalization ability to identify diverse objects with extremely high
detection speed. The generalization ability means that the detected object proposals
are generic over categories, and the high detection speed (300fps on a laptop) makes
the detector well adapted to the task of image retrieval, which requires a real-time
response.
When applying the object detector in our framework, we also emphasize the detection
repeatability and the saliency of the object proposals. Repeatability means that in two
similar images, the proposals should be consistent, so that the matching between them
is reliable. Saliency means that an object proposal should be an informative area for
discrimination. These two requirements are well satisfied with the BING detector
[Cheng et al. 2014]. A simple example is shown in Figure 1. Moreover, when the
matching is a reliable one between informative areas, it is sufficient for image retrieval
and we do not require the detected area to be exactly a meaningful object.
We run the object detector on every image. The detector outputs thousands of candidate object proposals, each with a score indicating its possibility to contain an object.
Instead of keeping all of them, we only preserve a few, with the highest scores. As
reported in Cheng et al. [2014], the top 7 object proposals can hit an object with a
probability of 45%; however, if we want to raise the probability to 80%, 100 proposals
should be considered, which will increase the complexity greatly and may introduce
some noisy proposals.
3.2. Feature Extraction with CNN
Recent years have witnessed the great success of DNN in many research areas including computer vision. Among plenty of algorithms in the DNN framework, CNN has been
demonstrated to be a powerful tool to extract expressive image features [Krizhevsky
et al. 2012].
We make use of the pretrained CNN model designed by Krizhevsky et al. [2012] and
implemented by Jia et al. [2014]. In this model, each input image (or object patch in our
method) is resized to 224 × 224 and then passed through 5 convolutional layers and 3
fully connected layers. The output layer of this model has 1000 nodes for classification.
We discard this layer because our objective is to represent the image rather than
classification. In our framework, the output of the model is a 4096-D positive, realvalued vector. With the Caffe tool-kit, 500 features can be extracted in 1s with GPU
and 50 with CPU [Jia et al. 2014]. The time cost on feature extraction is tiny for image
retrieval.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015.
Scalable Object Retrieval with Compact Image Representation from Generic Object Regions
29:7
Before measuring the distance between different features, we need to normalize
them. Following Arandjelovic and Zisserman [2012], we obtain the root feature by first
L1 normalizing the feature vector and then computing the square root per dimension:
xi
,
(1)
xi =
x1
where x1 is the L1 norm of x.
3.3. Feature Extraction with VLAD
VLAD [Jégou et al. 2010] is a popular local-feature aggregation method. Specifically,
SIFT features are first extracted in affine-invariant regions of an image. Then, each
feature is quantized to one of k pretrained clusters. For each cluster, the residuals
of all features quantized to it with the cluster centroid are accumulated, and all the
summations are concatenated. The representation can be described as:
⎡
⎤
v=⎣
(x − c1 ); . . . ;
(x − ck)⎦ ,
(2)
q(x)=1
q(x)=k
where q(x) = t denotes SIFT features that are quantized to the t-th cluster, and ct is the
corresponding cluster centroid. The semicolon in the equation means a concatenation
operation of two-column vectors.
When extracting VLAD on one object patch P, we modify Equation (2) as:
⎡
⎤
⎢ ⎢
vP = ⎢
⎣
q(x) = 1,
p(x) ∈ P
(x − c1 ); . . . ;
q(x) = k,
p(x) ∈ P
⎥
⎥
(x − ck)⎥ ,
⎦
(3)
where q(x) = t, p(x) ∈ P denotes SIFT features that are located in patch P and quantized to the t-th cluster. With the second constraint, we make use of the geometry
information of SIFT features, which is ignored in the original VLAD representation.
Following Jégou et al. [2010], we perform L2 -normalization to the VLAD vector.
3.4. Fusion on Feature Level
With CNN, we can extract image features representing high-level abstractions. Besides,
with VLAD, some local properties of images are aggregated. While image retrieval with
either feature is possible, and their performance is demonstrated in our initial study
[Sun et al. 2014], we expect their fusion to generate a better representation with
both high-level information and local invariant properties. To avoid complicating the
retrieval system, we propose performing the fusion on the feature level by combining
the CNN feature and VLAD vector into a single feature vector.
The most intuitive feature-level fusion is to directly concatenate two vectors to a long
vector. Denote CNN feature as xC and VLAD feature as xV . This simple fusion can be
written as:
x f1 = [xC ; xV ].
(4)
When computing the distance of two fused features x f1 and y f1 , we have
D(x f1 , y f1 ) = x f1 − y f1 = xC − yC + xV − yV .
(5)
However, it is problematic to add the distances of these two kinds of features for two
reasons. First, these two kinds of features have different scale distributions, thus the
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015.
29:8
S. Sun et al.
distances computed with them are incomparable. Second, since the original features
suffer from the problem of co-occurrence [Chum and Matas 2010] (so that some patterns
are overcounted when comparing two features), the simple addition can lead to one
distance dominating the summation with many overcounted patterns.
To overcome this difficulty, we propose applying principal component analysis (PCA)
and whiten the two features separately before the concatenation. Therefore, we can
remove the redundancy in features and make all feature components share the same
variance. After these two operations, the original features are distributed in the uniform distance metric space, and computing or comparing distances becomes reasonable.
Moreover, PCA can also help to reduce the dimension of the original features.
The success of PCA and whitening on VLAD has been demonstrated in Jégou and
Chum [2012], where the two operations on the VLAD vector improve the retrieval
performance remarkably. We perform similar transformation to our VLAD and CNN
features, respectively, then concatenate them to one single feature:
x̂C =
−1/2
, . . . , λ DC )pCT xC
−1/2
, . . . , λ DC )pCT xC diag(λ1
diag(λ1
−1/2
−1/2
x̂V =
diag(γ1
−1/2
diag(γ1
−1/2
,
−1/2
, . . . , γ DV )pTV xV
−1/2
, . . . , γ DV )pTV xV ,
(6)
x f = [x̂C ; x̂V ],
−1/2
−1/2
−1/2
−1/2
where (λ1 , . . . , λ D ) and (γ1 , . . . , γ D ) denote the sorted eigenvalue lists of two
features, pC and pV are associated eigenvectors, while DC and DV are the preserved
feature dimension of CNN and VLAD features after PCA, respectively.
4. IMAGE RETRIEVAL WITH OBJECT-LEVEL FEATURE
In this section, we introduce the image-retrieval framework with the proposed objectlevel feature. In Section 4.1, we discuss how to measure the similarity between the
query and database images. In Section 4.2, we describe the feature quantization and
indexing method for image retrieval on a large-scale database.
4.1. Similarity Measurement
In each image, we extract features (CNN feature, VLAD feature, or the fused feature in
Equation (6)) on Np object patches as our object features. To make full use of information
of the whole image, we also extract one feature from the entire image. As a result, we
generate N = Np + 1 object features in total from one image. The image is then
represented as a group of feature vectors:
X = {x1 , . . . , x N }, xi ∈ Rm,
(7)
where m denotes the dimension of each feature vector.
Given a query image Xq , to measure its similarity with a database image Xd, we
define a matching score S(Xq , Xd) based on the distances between the object features
in them:
N
q
(8)
f min D Xi , Xdj ,
S(Xq , Xd) =
i=1
j
q
where D(Xi , Xdj) represents the distance between the i-th object feature in Xq and the
j-th object feature in Xd, and f (x) is an exponential function defined as:
f (x) = exp (−(αx)2 ).
(9)
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015.
Scalable Object Retrieval with Compact Image Representation from Generic Object Regions
29:9
ALGORITHM 1: Retrieval Process
Input:
Features in a query image, Q;
Database size, N;
Image IDs corresponding to all features;
Output:
Returned retrieval results, R;
Scores = [0, 0, . . . , 0] N ;
for each q ∈ Q do
Compute distance lookup table for efficient distance computing D(·, ·);
Dists = [Inf, Inf, . . . , Inf] N ;
C = MA(q); // Neighbor clusters found by MA.
for d ∈ {features indexed in C} do
t = image id(d);
Dists[t] ← min (D(q, d), Dists[t]);
end
for d ∈ {features indexed in C} do
if Dists[i] = Inf then
// Ignore images that are never visited.
Scores[i]+ = exp (−(αDists[i])2 );
end
end
end
R = Sort(Scores);
The exponential function penalizes on large feature distances. With this setting, relevant images gain a high score on shared similar object patches, but a low score is
contributed by irrelevant object patch pairs between two images. It is possible to apply
another decreasing function here (e.g., a sigmoid or tangent function). However, we
find the selected exponential function simple and effective in experiments. The effect
of the parameter α will be discussed in the experiments.
Finally, the database images are ranked by the matching scores and returned to the
user as retrieval results.
4.2. Quantization and Indexing
For large-scale image retrieval, time and memory cost should be taken into consideration. It is not scalable to do exhaustive search with the original features in the image
database. We exploit PQ [Jégou et al. 2011] to compress the features and speed up feature distance computing, and adopt the inverted index structure to avoid exhaustive
search.
4.2.1. Feature Quantization with PQ. In product quantization, the original feature space
is decomposed into a Cartesian product of m low-dimensional subspaces. If the original
feature is D dimensional, then the dimension of each subspace is D∗ = D/m. In each
subspace, k∗ cluster centroids are trained and stored. With these settings, each feature
is quantized m times, each in one subspace and the IDs of corresponding centroids are
stored.
When computing the distance between one query feature and one database feature,
we apply asymmetric distance computation (ADC) proposed in Jégou et al. [2011],
in which, before search, we compute and store the distances of the query feature
to centroids in each subspace, and the final distance is computed by summing the
precomputed distances in a lookup table.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015.
29:10
S. Sun et al.
4.2.2. Inverted Index. We train k clusters in the complete feature space, and each
database image feature is quantized to one of the clusters. The inverted index is built
with each entry corresponding to one cluster, where all IDs of features quantized to the
cluster are stored.
In the online querying stage, we apply multiple assignment (MA) proposed in Jégou
et al. [2010]. First, 10 cluster centroids nearest to the query feature are found with
ANN algorithms. If the distance of the query feature to one centroid d is smaller than
δ · d0 , where d0 is the distance of the query feature to its nearest centroid (δ = 1.2, as
set in Jégou et al. [2010]), then the inverted index list associated with this centroid
will be visited.
We describe the algorithm in the retrieval process in Algorithm 1.
5. EXPERIMENTAL RESULTS
In this section, we evaluate the proposed method on two public benchmark datasets:
the Holidays dataset [Jégou et al. 2008] and the UKBench dataset [Nister and
Stewenius 2006]. Wang and Jiang [2015] list a few common benchmark datasets for
image retrieval. We chose these two datasets because they are among the most used
ones in the field, and images in them are very suitable for our scenario of object
retrieval. The Holidays dataset contains 1491 holiday images from 500 groups. The
first image in each group is selected as query. Mean Average Precision (mAP) is used
to evaluate the retrieval accuracy. In the UKBench dataset, there are 10200 images
from 2550 object/scene categories, each containing 4 images. On this dataset, NS-score
(averaged four times top-4 accuracy) is used to measure the retrieval accuracy.
To evaluate the scalability of the proposed algorithm, we apply the MIR Flickr 1M
dataset as distractor dataset. This dataset contains 1 million images randomly retrieved from Flickr. We run all experiments on a single core of a PC with a I7-3770K
CPU.
In Section 5.1, we explore the impact of related parameters on retrieval performance
with different features. Then, we illustrate some retrieval results to demonstrate the
benefit introduced by feature fusion in Section 5.2. In Section 5.3, we show the experimental results on large-scale image retrieval with different experiment settings.
After that, we analyze the time efficiency on four main components in our method in
Section 5.4. At last, we compare our method in multiple experimental settings with
other related algorithms in Section 5.5.
5.1. Impact of Parameters
In our method, there are 3 key parameters: α in the scoring function in Equation (9),
object patch number Np, and feature dimension D. Here, we discuss their impact when
CNN, VLAD, and the fused feature are adopted in the framework, respectively. When
extracting VLAD features, we set the codebook size as 16, so that the initial VLAD
feature dimension is 16 × 128 = 2048.
First, we explore the impact of α by experiments on the Holidays and UKBench
datasets. We fix Np = 7 and reduce the CNN and VLAD feature dimension to 512 by
PCA, so that the dimension of the fused feature is 1024. To improve the performance
of the 512-D VLAD feature, we whiten it according to Jégou and Chum [2012].
From Figure 3, we can see that when selecting the CNN feature, the mAP on Holidays
waves from 0.771 to 0.789 and peaks at α = 3, while the NS-score on UKBench achieves
the best result 3.613 during the interval α ∈ [2.0, 3.0]. Based on this observation, we set
α = 3 for CNN feature in the rest of the experiments. When applying VLAD, the mAP on
Holidays peaks at α = 0.5 with the value 0.639; at this point, the NS-Score on UKBench
achieves the best value, 3.264. Thus we set α = 0.5 for VLAD in the following. A similar
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015.
Scalable Object Retrieval with Compact Image Representation from Generic Object Regions 29:11
Fig. 3. The impact of α on image-retrieval accuracy. The three figures show how retrieval accuracy changes
with α when CNN (a), VLAD (b) and the fused feature (c) are tested, respectively.
Fig. 4. The impact of object patch number Np on image-retrieval accuracy. The three figures show how the
retrieval accuracy changes with Np when CNN (a), VLAD (b), and the fused feature (c) are used, respectively.
trend is observed when the fused feature is used, and we get the optimized value of
α = 0.25, where the mAP on Holidays is 0.837 and the NS-Score on UKBench is 3.814.
Then, we study the impact of object patch number Np on the Holidays and UKBench
datasets. In the related experiments, we keep the feature dimensions of CNN and
VLAD as 512, and α as the corresponding optimized values summarized before. We
evaluate Np with values from 1 to 35, that is, the feature number N in each image from
2 to 36.
The experiment results with CNN, VLAD, and the fused feature are shown in
Figure 4. We can see that, in all cases, both mAP on Holidays and NS-Score on UKBench
have a rising trend when Np increases. For example, when using the fused feature, the
mAP on Holidays is 0.837 when Np = 7, compared to 0.796 when only one object patch
is considered, while on UKBench, the NS-Score increases from 3.755 to 3.814 when
Np changes from 1 to 7. When Np is even larger, the retrieval accuracy still increases.
However, as the computational and memory cost introduced by large Np is expensive,
we just set Np = 7 in the following experiments. We demonstrate in Section 5.4 that,
with such a setting, the average query time is about 1s in a 1 million–image database.
To investigate how the feature dimension D affects retrieval performance, we evaluate multiple values of D with different features on the Holidays dataset. Here, we fix
Np = 7 and set α as the optimized values as well. We test D = 64, 128, 256, 512, and
1024 for CNN and VLAD features, respectively. When fusing CNN and VLAD features,
we keep them with the same dimension; thus, the dimension of the fused feature is 2D
accordingly.
As shown in Figure 5, the accuracy with the fused feature grows when the dimension
D increases from 64 to 256, then keeps relatively stable after that. This indicates that,
by applying PCA, the feature space distribution is well captured and some noises and
redundancy are removed.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015.
29:12
S. Sun et al.
Fig. 5. The impact of feature dimension D on image-retrieval accuracy. Note that the dimension of the fused
feature is actually 2D in the figure, as it is the concatenation of the CNN and VLAD features.
Table I. The mAP Performance of Different Features Under Different
Settings of Feature Dimensions on the Holidays Dataset
Feature
CNN
VLAD
Fused
D = 256
2D = 512
0.789
0.613
0.828
D = 512
2D = 1024
0.789
0.639
0.837
D = 1024
2D = 2048
0.781
0.637
0.837
5.2. Fused Feature versus Standalone Features
To demonstrate the performance boost brought by the proposed feature fusion, in this
section we compare the retrieval results when using the fused feature and standalone
CNN and VLAD features. From Figure 5, we can see that, when D = 256, 512, and 1024,
the fused feature outperforms both CNN and VLAD features. It may be arguable that
the feature dimension of the fused feature is twice that of CNN and VLAD feature.
However, we can see that, in this range, even when comparing them on the same
dimension, the superiority of the fused feature is still significant, as summarized in
Table I, where the mAP values tested in the same dimension are highlighted with the
same color.
As discussed in Section 1, CNN and VLAD specialize in describing different properties of an image. Even though the CNN feature alone achieves promising accuracy
(i.e., mAP 0.789 on Holidays and NS-Score 3.613 on UKBench), its fusion with VLAD
promotes performance further with mAP 0.837 on Holidays and NS-Score 3.814 on
UKBench. We argue that the capability of VLAD to handle image variance in scaling
and rotating contributes to this improvement.
To compare three features more intuitively, we illustrate some retrieval results on
UKBench with these three features in Figure 6. In Figure 6(a) and Figure 6(b), related
images of the query image in the database have severe viewpoint changes, so that
the CNN feature alone retrieves only the query image. In Figure 6(a) and Figure 6(b),
VLAD retrieves all 4 related images and 2 related images, respectively. The fused
feature preserves 4 related images and improves the result to 3 in the two cases,
respectively. In Figure 6(c) and Figure 6(d), VLAD misses 3 related images due to
the poor feature matches, while CNN successfully retrieves 2 and 4 related images,
respectively. In contrast, the fused feature returns 3 and 4 related images in these
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015.
Scalable Object Retrieval with Compact Image Representation from Generic Object Regions 29:13
Fig. 6. Top 4 retrieval results with three features on UKBench. In each group, the first column represents
the query image, and results with CNN, VLAD, and the fused feature are given in the top, middle, and
bottom row, respectively.
situations, respectively. We illustrate one case in which the performance of the fused
feature is encumbered with the failure of VLAD in Figure 6(e). In this case, CNN and
VLAD retrieve 4 and 1 related images, respectively. But the number with the fused
feature is 3, with the fourth result being false positive, which is caused by the failure of
VLAD. Figure 6(f) illustrates one case in which all three features successfully retrieve
all the 4 related images. In their ranking, the CNN feature favors images with similar
shapes, while the VLAD feature is more concerned with the matching of local features.
5.3. Large-Scale Image Retrieval
To perform large-scale image retrieval, we apply the quantization and indexing method
in Section 4.2. According to the experimental results discussed in Section 5.1, we use
the fused feature, and fix the feature dimension to be 2D = 1024. We extract all 8
features from 7 object patches and the entire image.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015.
29:14
S. Sun et al.
Table II. Accuracy with ADC on the Holidays and UKBench Datasets
m
mAP
NS-Score
32
0.770 (↓ 0.067)
3.650 (↓ 0.164)
64
0.811 (↓ 0.026)
3.743 (↓ 0.071)
128
0.818 (↓ 0.019)
3.790 (↓ 0.024)
Note: Down arrows denote accuracy decreases compared with using
the original feature.
Fig. 7. Retrieval accuracy of IVFADC when different vocabulary size k is used to build the inverted index,
and no distractor dataset is added. Results on Holidays (a) and UKBench (b) are illustrated. The blue bars
represent m = 64 in PQ, while red bars represent m = 128.
We first explore the impact of PQ on the retrieval accuracy without an inverted
index. We denote this as the ADC method. When performing PQ, we test the number
of subspaces m = 32, 64, 128, where the dimension of the subspaces are 32, 16, and
8, respectively. The cluster number in each subspace is set as k∗ = 256, so that each
centroid ID can be represented by an unsigned char variable with 1b memory.
We can see from Table II that the impairment of PQ to the accuracy is minor when
m = 128. Compared with using the original feature, the mAP on Holidays drops from
0.837 to 0.818, while the NS-Score on UKBench drops from 3.814 to 3.790. When
m = 64, the accuracy decrease is minor and acceptable. However, when m = 32, the
accuracy drops severely because the 256 clusters in the subspaces can hardly represent
the 32 dimensional features well.
We then test retrieval performance when applying the inverted index (denoted as
IVFADC). We test different vocabulary size k (i.e., the number of entries) in the inverted
index. The performances are compared when no distractor dataset is added and when
the MIR Flickr 1M dataset is added, respectively. We test PQ with m = 64 and m = 128
only because the accuracy with m = 32 is much lower.
When no distractor dataset is added, the retrieval accuracies on Holidays and UKBench are shown in Figure 7. We can conclude that smaller vocabulary size k generates
better accurac, as expected, because when the quantization is coarse, more features
are taken for distance computing, so that a high recall can be achieved. The extreme
case when k = 1 is exactly the ADC version, where all database features are compared
to the query feature. When k = 500 and m = 128, the best results are achieved, that is,
mAP on Holidays is 0.804 and NS-Score on UKBench is 3.725.
Next, we add the 1M distractor dataset to test retrieval performance. We demonstrate
retrieval accuracy and average query time in Figure 8. All timings are performed
excluding feature extraction. When the PQ subspace number m is 128, the accuracy is
better than m = 64, while the time cost is a little larger at the same time. Specifically,
when k = 500 and m = 128, the mAP on Holidays is 0.642 and NS-Score on UKBench
is 3.707, and the time costs are 1.18s and 1.22s. When k = 500 and m = 64, we get
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015.
Scalable Object Retrieval with Compact Image Representation from Generic Object Regions 29:15
Fig. 8. Retrieval accuracy and average query time of IVFADC when different vocabulary size k is used to
build the inverted index, and the 1M distractor dataset is added. Accuracy on Holidays (a) and UKBench
(b), and average query time on Holidays (c) and UKBench (d) are illustrated. The blue bars and lines
represent m = 64 in PQ, while red bars and lines represent m = 128.
mAP 0.620 and NS-Score 3.634, and the average time costs are 1.03s and 0.91s for the
two ground truth datasets, respectively. When the vocabulary size k increases, both
accuracy and time cost drop on the two datasets. When k = 4000, only about 0.4s is
required to perform one query. The memory cost to store quantized features depends
only on m, as discussed in Section 4.2.
5.4. Computational Efficiency Analysis
In this section, we analyze the time cost of our method in more detail. From Algorithm 1,
we can see that there are 4 main components in the retrieval process: lookup table
computing (T1 ), distances to database features computing (T2 ), scoring the images
(T3 ), and sorting (T4 ). In the following experiments, we show how the time costs on the
4 parts change with the database size.
We show the contribution of the 4 parts to the total time cost in Figure 9. The
experiment is performed when k = 500, m = 128 on Holidays. Obviously, the feature
distance computing stage (T2 ) is the most time consuming one when the database
size grows large. This is because more database features are required to be compared
with the query. When database size is 1M, the time it costs is 0.96s. Without the
inverted index structure, the complexity of (T2 ) is O(N 2 × n), where n represents the
database size, and N 2 denotes the square of used object patch number, which is actually
a constant number. When the inverted index is built and if database features are
equally distributed in the inverted index entries, the complexity reduces to O(N 2 ×
n/k). We observe that the time to compute the lookup table (T1 ) is nearly constant.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015.
29:16
S. Sun et al.
Fig. 9. (a) Contribution of four main parts in the retrieval process to the total time cost: lookup table
computing (T1 ), distances to database features computing (T2 ), scoring the images (T3 ), and sorting (T4 ).
(b) The average number of candidate features retrieved for distance computing.
Table III. Accuracy Comparison of Object-Level Representation with Baselines
Methods
Holidays
(mAP)
UKBench
(NS-Score)
CNN-1
0.710
OR-CNN
0.789 (↑ 0.079)
VLAD-16
0.572
OR-VLAD-16
0.639 (↑ 0.067)
Fused-1
0.815
3.412
3.613 (↑ 0.201)
3.167
3.258 (↑ 0.091)
3.754
OR-Fused
0.837
(↑ 0.022)
3.814
(↑ 0.06)
When database size is small, for example, 1000, it occupies most of the retrieval time.
However, when database size is very large, it accounts for only a small proportion. The
time it takes is about 0.09s during the retrieval process. Time taken on scoring and
sorting increases with database size. However, even when the database size is 1M, they
only cost about 0.05s, respectively.
5.5. Comparison
To demonstrate the superiority in accuracy and efficiency of our method, we make a
comparison with the baseline and state-of-the-art methods. For notation convenience,
we denote our object-level representation as OR in the comparisons.
First, we compare our object level representation with three baseline methods, VLAD
[Jégou et al. 2010], CNN-1 [Sun et al. 2014], and Fused-1. In the CNN-1 and Fused-1
method, only one CNN feature or fused feature is extracted on the entire image. The
VLAD-16 plugged-in method [Sun et al. 2014] is the object-level representation of the
VLAD method with the vocabulary size 16, which we denote as OR-VLAD-16 here.
The comparison results are summarized in Table III. We observe that, in all cases, the
object-level representations are superior to their image level counterparts.
We then compare our method with some recent image search algorithms. Here, we
present our method with different configurations: the original fused feature (OR), compressed feature with PQ when m = 128 (OR-ADC128) and when m = 64 (OR-ADC64),
and indexed feature with the inverted index with k = 500, m = 128 (OR-IVFADC).
The compared methods include: (1) CWVT [Wang et al. 2011], an improved vocabulary
tree–based method with contextual weighting of local features in both descriptor and
spatial domains; (2) SCSM [Shen et al. 2012], where a spatially constrained similarity
measure is used to perform object retrieval; (3) BoC [Wengert et al. 2011], an advanced
color signature fused with SIFT descriptor for image retrieval; (4) Semantic-aware
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015.
Scalable Object Retrieval with Compact Image Representation from Generic Object Regions 29:17
Table IV. Comparison of the Proposed Method with State-of-Arts
Methods
CWVT [Wang et al. 2011]
SCSM [Shen et al. 2012]
BoC [Wengert et al. 2011]
SC [Zhang et al. 2013]
CM [Zheng et al. 2014a]
BM [Zheng et al. 2014b]
OR
OR-ADC128
OR-ADC64
OR-IVFADC
Holidays(mAP)
0.78
0.762
0.789
0.809
0.840
0.819
0.837
0.818
0.811
0.804
UKBench (NS-Score)
3.56
3.52
3.50
3.60
3.71
3.62
3.814
3.790
3.743
3.725
Note: The performance of the comparison algorithms is cited from the
reported results of the original papers.
Bolded numbers indicate top results.
Table V. Memory Cost Comparison of Object-Level Representation with Baselines
VT
HE
PQ
VLAD
[Nister and
[Jégou
[Jégou
[Jégou
Methods
Stewenius 2006] et al. 2008] et al. 2011] et al. 2010] OR-ADC64 OR-ADC128
Memory for features (GB)
8.0
12.0
12.0
1.0
0.5
1.0
Memory for quantizer (MB)
142
398
100
0.0078
5
5
Co-indexing (SC) [Zhang et al. 2013], a fusion of local invariant features and semantic attributes for image retrieval; (5) Coupled multi-index (CM) [Zheng et al. 2014a],
a index-level fusion method to exploit information of multiple features in images;
(6) Bayes merging of multiple vocabularies (BM) [Zheng et al. 2014b], in which multiple vocabularies are built with the principle that low correlation exists among them.
The comparison in Table IV shows that our method achieves state-of-the-art retrieval
accuracy on both the Holidays and UKBench datasets. When no inverted index is
applied, best results are achieved among the compared methods. Even when we index
the features for efficiency that results in a decrease in accuracy, the mAP on Holidays is
still comparable with most recent works, and the NS-Score on UKBench is superior to
others. Results on UKBench is significantly improved with our method. This is possibly
caused by the character of the database, in which objects usually dominate the image
scene, and burstiness occurs frequently in rich texture regions. This is prevalent in
local feature–based methods, but can be alleviated by our object-level representation.
We also compare our method with four baselines in large-scale image retrieval experiments with different database sizes. The compared methods are Vocabulary Tree (VT)
[Nister and Stewenius 2006], Hamming Embedding (HE) [Jégou et al. 2008], Product
Quantization (PQ) [Jégou et al. 2011], and VLAD [Jégou et al. 2010]. The codebook size
is 0.99M in VT and 200K in HE. In PQ, the codebook size is also 200K with IVFADC
applied, and m = 8, k∗ = 256. In VLAD, the SIFT feature codebook size is 16, and the
VLAD feature dimension is reduced to 512 from 2048 by PCA. We perform exhaustive
search without an inverted index, because one image is represented with only one
512-D feature.
The memory cost comparison between our method and the baseline methods is concluded in Table V. In our implementation, the original features are 1024-dimensional,
and m is as 64 or 128, while k∗ = 256. When m = 64, the memory cost on storing
features for 1 million images is 1M × 8 × 64 × 1b = 512MB. When m = 128, the memory
cost is 1GB. In addition, 1MB memory is required to store the centroids.
Our method does not store image IDs for features, because we have a fixed number
of features (i.e., 8) for each image and the image ID can be calculated from feature
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015.
29:18
S. Sun et al.
Table VI. Time Cost Comparison of Object-Level Representation with Baselines
Methods
Average query time (s)
VT
[Nister and
Stewenius 2006]
0.098
HE
[Jégou
et al. 2008]
0.254
PQ
[Jégou
et al. 2011]
1.054
VLAD
[Jégou
et al. 2010]
1.180
OR-ADC128
1.179
ID (image ID = feature ID/8). In our method, apart from memory cost on the indexed
features discussed earlier, we require 4MB memory to store the cluster centers for
inverted index when k = 1000, and 1MB memory to store the centroids in each feature
subspace. Therefore, in total, 5MB is used for the quantizer. The VT [Nister and
Stewenius 2006] needs 4b to store one image ID and another 4b to store the tf-idf
weight. For 1M images, 1M × 1000 × 8byte = 8GB memory is required to store all
features. In the HE [Jégou et al. 2008], 4b and 8b are used to store the image ID and
Hamming code, respectively. Therefore, 12GB memory is used to store all features.
In these two methods, about 142MB memory is required to store a hierarchical visual
vocabulary tree. In addition, HE also requires 256MB to store median vectors for each
leaf node. In the PQ [Jégou et al. 2011], 4b and 8b are used to store one image ID and
the compressed feature, respectively, leading to 12GB memory cost for all features.
To store 20k cluster centers for the inverted index and all centroids in each feature
subspace, it costs about 100MB memory. The VLAD [Jégou et al. 2010] stores one 512-D
feature for each image, so that 1GB memory cost is needed for the 1M image dataset.
In addition, 8KB is required to store 16 cluster centers to aggregate the SIFT features.
The time cost comparison between our method and the compared methods is concluded in Table VI. We show the average query time of these methods when performing
image retrieval with the 1-million-image dataset. In implementation, our method is
most close to PQ [Jégou et al. 2011]. We use 8 object-level features in each image, while
PQ use thousands of SIFT features, so that we make less comparisons between images.
On the other hand, to get a high accuracy, our optimal codebook size is much smaller
than that used in PQ, resulting in many more features stored in each inverted index
entry. Consequently, their time costs are very similar. The VT [Nister and Stewenius
2006] does not perform feature-distance computing, and its codebook size is very large,
so that the time cost is much smaller, but its accuracy is much lower, as illustrated in
the following.The HE [Jégou et al. 2008] has the same system framework as VT, but
verifies feature distances using hamming codes. The time cost of HE is a little larger
than VT. However, its accuracy is still lower than that of PQ and our method. The
VLAD [Jégou et al. 2010] keeps only one feature for each image, but exhaustive search
is used to ensure accuracy in our implementation, and the time cost is comparable to
our method.
The accuracy comparisons on Holidays and UKBench are plotted in Figure 10, which
demonstrates the scalability of our method for image retrieval in the large image
database. We can see when m = 128, k = 500 or 1k, the accuracies are higher than m =
64, k = 500. As discussed earlier, however, when m = 128, 1GB memory is required to
store the compressed features for 1M images, while only 512MB is required for m = 64.
Nonetheless, even 1GB memory cost is still affordable in many real-life applications.
It is notable that the performance drop of our method on UKBench is slight when the
distractor image number increases. This is because images in the UKBench dataset
usually contain objects with noisy background. Compared to other methods, our method
can describe the objects well while suppressing the distraction from background areas.
6. CONCLUSION
In this article, we propose a novel image retrieval framework with compact image
representation from generic object regions. We first identify regions of interest with
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015.
Scalable Object Retrieval with Compact Image Representation from Generic Object Regions 29:19
Fig. 10. Comparison of retrieval accuracies on Holidays (a) and UKBench (b) in large-scale image-retrieval
experiments. Three configurations of our method are compared with four baselines.
a generic object detector. To describe the detected regions, we apply CNN to describe
the global content and VLAD to capture the local invariant patterns. In addition, we
propose fusing the CNN and VLAD features for a more effective representation. The
fusion is performed on the feature level to avoid any modification to existing retrieval
frameworks; promising accuracy promotion is achieved. Scalability on a large image
database is obtained based on the inverted indexing structure. The representation is
efficient in memory overhead, and the retrieval process is time efficient. Moreover,
experiments on benchmark datasets demonstrate state-of-the-art performance of our
proposed method.
REFERENCES
Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. 2012. Measuring the objectness of image windows.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2189–2202.
Satpathy Amit, Jiang Xudong, and Eng How-Lung. 2014. Human detection by quadratic classification on
subspace of extended histogram of gradients. IEEE Transactions on Image Processing 23, 1, 287–297.
Relja Arandjelovic and Andrew Zisserman. 2012. Three things everyone should know to improve object
retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE,
2911–2918.
Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust features. In Proceedings of
European Conference on Computer Vision. Springer, 404–417.
R
in Machine Learning 2,
Yoshua Bengio. 2009. Learning deep architectures for AI. Foundations and Trends
1, 1–127.
Yang Cao, Changhu Wang, Liqing Zhang, and Lei Zhang. 2011. Edgel index for large-scale sketch-based
image search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
IEEE, 761–768.
Mingming Cheng, Z. Zhang, W. Lin, and P. Torr. 2014. BING: Binarized normed gradients for objectness
estimation at 300fps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
IEEE.
Lingyang Chu, Shuqiang Jiang, Shuhui Wang, Yanyan Zhang, and Qingming Huang. 2013. Robust spatial
consistency graph model for partial duplicate image retrieval. IEEE Transactions on Multimedia 15, 8,
1982–1996.
Lingyang Chu, Shuhui Wang, Yanyan Zhang, Shuqiang Jiang, and Qingming Huang. 2014. Graph-densitybased visual word vocabulary for image retrieval. In IEEE International Conference on Multimedia and
Expo. IEEE, 1–6.
Ondrej Chum and Jiri Matas. 2010. Unsupervised discovery of co-occurrence in sparse high dimensional data.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3416–3423.
Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1. IEEE, 886–893.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015.
29:20
S. Sun et al.
Ian Endres and Derek Hoiem. 2010. Category independent object proposals. In Proceedings of European
Conference on Computer Vision. Springer, 575–588.
Pedro Felzenszwalb, David McAllester, and Deva Ramanan. 2008. A discriminatively trained, multiscale,
deformable part model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1–8.
Yunchao Gong, Liwei Wang, Ruiqi Guo, and Svetlana Lazebnik. 2014. Multi-scale orderless pooling of deep
convolutional activation features. In Proceedings of European Conference on Computer Vision. Springer,
392–407.
Steven Ch Hoi, Wei Liu, and Shih-Fu Chang. 2010. Semi-supervised distance metric learning for collaborative
image retrieval and clustering. ACM Transactions on Multimedia Computing, Communications and
Applications 6, 3, 18.
Eva Hörster and Rainer Lienhart. 2008. Deep networks for image retrieval on large-scale databases. In
Proceedings of the 16th ACM International Conference on Multimedia. ACM, New York, NY, 643–646.
Hervé Jégou and Ondřej Chum. 2012. Negative evidences and co-occurences in image retrieval: The benefit
of PCA and whitening. In Proceedings of European Conference on Computer Vision. Springer, 774–787.
Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2008. Hamming embedding and weak geometric consistency for large scale image search. In Proceedings of European Conference on Computer Vision. Springer,
304–317.
Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2010. Improving bag-of-features for large scale image
search. International Journal of Computer Vision 87, 3, 316–336.
Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2011. Product quantization for nearest neighbor search.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 117–128.
Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into
a compact image representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. IEEE, 3304–3311.
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. arXiv
preprint arXiv:1408.5093.
Timor Kadir, Andrew Zisserman, and Michael Brady. 2004. An affine invariant salient region detector. In
Proceedings of European Conference on Computer Vision. Springer, 228–241.
Yan Ke and Rahul Sukthankar. 2004. PCA-SIFT: A more distinctive representation for local image descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2. IEEE,
II–506.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Neural Information Processing Systems.
Michael S. Lew, Nicu Sebe, Chabane Djeraba, and Ramesh Jain. 2006. Content-based multimedia information
retrieval: State of the art and challenges. ACM Transactions on Multimedia Computing, Communications
and Applications 2, 1, 1–19.
Zhen Liu, Houqiang Li, Liyan Zhang, Wengang Zhou, and Qi Tian. 2014. Cross-indexing of binary SIFT
codes for large-scale image search. IEEE Transactions on Image Processing.
Zhen Liu, Houqiang Li, Wengang Zhou, Richang Hong, and Qi Tian. 2015. Uniting keypoints: Local visual
information fusion for large-scale image search. IEEE Transactions on Multimedia 17, 4, 538–548.
Zhen Liu, Houqiang Li, Wengang Zhou, Ruizhen Zhao, and Qi Tian. 2014. Contextual hashing for large-scale
image search. IEEE Transactions on Image Processing.
David G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. International Journal of
Computer Vision, 91–110.
Tao Mei, Yong Rui, Shipeng Li, and Qi Tian. 2014. Multimedia search reranking: A literature survey.
Computing Surveys 46, 3, 38.
Krystian Mikolajczyk and Cordelia Schmid. 2004. Scale and affine invariant interest point detectors. International Journal of Computer Vision 60, 1, 63–86.
David Nister and Henrik Stewenius. 2006. Scalable recognition with a vocabulary tree. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2161–2168.
Aude Oliva and Antonio Torralba. 2001. Modeling the shape of the scene: A holistic representation of the
spatial envelope. International Journal of Computer Vision, 145–175.
Florent Perronnin, Jorge Sánchez, and Thomas Mensink. 2010. Improving the fisher kernel for large-scale
image classification. In Proceedings of European Conference on Computer Vision. Springer, 143–156.
Pierre Sermanet, Koray Kavukcuoglu, Soumith Chintala, and Yann LeCun. 2013. Pedestrian detection with
unsupervised multi-stage feature learning. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition. IEEE, 3626–3633.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015.
Scalable Object Retrieval with Compact Image Representation from Generic Object Regions 29:21
Xiaohui Shen, Zhe Lin, Jonathan Brandt, Shai Avidan, and Ying Wu. 2012. Object retrieval and localization with spatially-constrained similarity measure and k-nn re-ranking. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. IEEE, 3013–3020.
Josef Sivic and Andrew Zisserman. 2003. Video Google: A text retrieval approach to object matching in
videos. In Proceedings of the International Conference on Computer Vision. 1470–1477.
Shaoyan Sun, Wengang Zhou, Houqiang Li, and Qi Tian. 2014. Search by detection: Object-level feature
for image retrieval. In Proceedings of International Conference on Internet Multimedia Computing and
Service. ACM, New York, NY, 46.
JRR Uijlings, KEA van de Sande, T. Gevers, and A. W. M. Smeulders. 2013. Selective search for object
recognition. International Journal of Computer Vision, 154–171.
Paul Viola and Michael J. Jones. 2004. Robust real-time face detection. International Journal of Computer
Vision 57, 2, 137–154.
Ji Wan, Dayong Wang, Steven Chu Hong Hoi, Pengcheng Wu, Jianke Zhu, Yongdong Zhang, and Jintao Li.
2014. Deep learning for content-based image retrieval: A comprehensive study. In Proceedings of the
ACM International Conference on Multimedia. ACM, New York, NY, 157–166.
Shuang Wang and Shuqiang Jiang. 2015. INSTRE: A new benchmark for instance-level object retrieval and
recognition. ACM Transactions on Multimedia Computing, Communications and Applications 11, 3, 37.
Xiaoyu Wang, Ming Yang, Timothee Cour, Shenghuo Zhu, Kai Yu, and Tony X. Han. 2011. Contextual
weighting for vocabulary tree based image retrieval. In Proceedings of the International Conference on
Computer Vision. 209–216.
Christian Wengert, Matthijs Douze, and Hervé Jégou. 2011. Bag-of-colors for improved image search. In
ACM International Conference on Multimedia. ACM, New York, NY, 1437–1440.
Lingxi Xie, Qi Tian, Wengang Zhou, and Bo Zhang. 2014. Fast and accurate near-duplicate image search
with affinity propagation on the ImageWeb. Computer Vision and Image Understanding 124, 31–41.
Lingxi Xie, Jingdong Wang, Bo Zhang, and Qi Tian. 2015. Fine-grained image search. IEEE Transactions on
Multimedia 17, 5, 636–647.
Shiliang Zhang, Qi Tian, Gang Hua, Qingming Huang, and Wen Gao. 2011. Generating descriptive visual
words and visual phrases for large-scale image applications. IEEE Transactions on Image Processing
20, 9, 2664–2677.
Shiliang Zhang, Qi Tian, Ke Lu, Qingming Huang, and Wen Gao. 2013. Edge-SIFT: Discriminative binary
descriptor for scalable partial-duplicate mobile search. IEEE Transactions on Image Processing 22, 7,
2889–2902.
Shaoting Zhang, Ming Yang, Timothee Cour, Kai Yu, and Dimitris N. Metaxas. 2012. Query specific fusion
for image retrieval. In Proceedings of European Conference on Computer Vision. Springer, 660–673.
Shiliang Zhang, Ming Yang, Xiaoyu Wang, Yuanqing Lin, and Qi Tian. 2013. Semantic-aware co-indexing
for image retrieval. In Proceedings of the International Conference on Computer Vision.
Liang Zheng, Shengjin Wang, Ziqiong Liu, and Qi Tian. 2014a. Packing and padding: Coupled multi-index
for accurate image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. IEEE.
Liang Zheng, Shengjin Wang, Wengang Zhou, and Qi Tian. 2014b. Bayes merging of multiple vocabularies
for scalable image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. IEEE, 1963–1970.
Wengang Zhou, Houqiang Li, Richang Hong, Yijuan Lu, and Qi Tian. 2015. BSIFT: Towards data-independent
codebook for large scale image search. IEEE Transactions on Image Processing 24, 3, 967–979.
Wengang Zhou, Houqiang Li, Yijuan Lu, and Qi Tian. 2013. SIFT match verification by geometric coding for large-scale partial-duplicate web image search. ACM Transactions on Multimedia Computing,
Communications and Applications, 4.
Wengang Zhou, Houqiang Li, Yijuan Lu, and Qi Tian. 2014. Encoding spatial context for large-scale partialduplicate web image retrieval. Journal of Computer Science and Technology 29, 5, 837–848.
Wengang Zhou, Qi Tian, Yijuan Lu, Linjun Yang, and Houqiang Li. 2011. Latent visual context learning for
web image applications. Pattern Recognition 44, 10, 2263–2273.
Wengang Zhou, Ming Yang, Houqiang Li, Xiaoyu Wang, Yuanqing Lin, and Qi Tian. 2014. Towards codebookfree: Scalable cascaded hashing for mobile image search. IEEE Transactions on Multimedia 16, 3, 601–
611.
Received December 2014; revised March 2015; accepted May 2015
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 12, No. 2, Article 29, Publication date: October 2015.