An Analysis of Dependency of Prior Probability for - Journal-dl

2016 Joint 8th International Conference on Soft Computing and Intelligent Systems and 2016 17th International Symposium
on Advanced Intelligent Systems
An Analysis of Dependency of Prior Probability for Codebook-Based Image
Representation
Yuki Shinomiya
School of System Engineering
Kochi University of Technology
Kami, Japan
Email: [email protected]
Yukinobu Hoshino
School of System Engineering
Kochi University of Technology
Kami, Japan
Email: [email protected]
weight factor for each visual vocabulary. A dependency of
prior probability in codebook-based approaches is analyzed
by two experiments. One is to investigate the effect of
prior probability for recognition performance in recent stateof-the-art techniques. Another is to verify the impact by
controlling a distribution of the prior probabilities of a
codebook.
This paper is organized as follows: the next section
describes recent codebook approaches in detail, then the
recognition performances are evaluated in Section II, where
we show the effect of prior probability. Section IV presents
an approach, which is to improve recognition accuracy by
controlling the distribution of prior probability. In addition,
we show the analyzed results of the relationship between
prior probability and recognition performance. Finally, Section V presents the conclusions of this paper.
Abstract—Codebook-based image representation has been
widely used in many image recognition applications. While each
of the applications has an individual purpose, the pipeline of
image representation follows the same approach. This paper
focuses on the “prior probability”, which is a parameter
of a codebook, and verifies a relationship with recognition
performance. In the experiment, a codebook is modified by
adding two parameters to analyze a dependency of prior probability. The result shows a strong negative correlation between
prior probablity and recognition accuracy. Additionally, the
recognition accuracy can be improved by offline optimization.
Keywords-image representation; codebook; prior probability;
I. I NTRODUCTION
Recently, image recognition has been especially attracted
in machine vision and pattern recognition, and has been used
in several applications such as large-scale image annotation
[1], fine-grained object recognition [2], [3], image retrieval
[4] and object detection [5]. Each of these applications
individually has a specific purpose and different scales of
target images and domains. For example, the following
datasets has been used as a benchmark
• ImageNet10K [1] : 9 million images of 10,184 categories
• Caltech-101 [6]: 9,145 images of 102 object categories
• Caltech-Bird200 [7]: 6,033 images of 200 bird species
• 15-Scenes [8]: 4,485 images and 15 scene categories
• PASCAL VOC 2007: 9,963 images containing 24,640
annotated objects
Images in the real-world often include various deformations due to viewpoint and changes in shooting environment.
A local image feature is a key technique for describing
feature vectors that have robustness to the deformations in
many cases [9], [10]. In addition, an image can be treated
as a set of the extracted local descriptors by treating a local
descriptor as a visual vocabulary, which is called a codebook
approach.
This paper focuses on the parameter prior probability,
which is one of the parameters of a codebook and is a
978-1-5090-2678-4/16 $31.00 © 2016 IEEE
DOI 10.1109/SCIS&ISIS.2016.45
II. C ODEBOOK -BASED A PPROACH
The pipeline of image recognition generally consists of
the following three stages:
1) Extracting local image descriptors from a given image; A local image descriptor is extracted as a ddimensional vector (d = 128 in the case of using SIFT
[9]), and has an invariant to changes of image scaling,
illumination and rotation [9]–[11].
2) Encoding the local image descriptors; A set of local
descriptors extracted from the image is encoded into
a global image feature vector based on the codebook
that have been created in advance.
3) Recognizing the object label by a discriminant model;
A linear support vector machine (SVM) is often used
as a discriminant model because the scalability is
O (N ), with respect to the number N of images. In
the case of using a non-linear
the scalabirity
measure
is increased to O N 2 ∼ O N 3 .
The following section describes codebook creation and
recent encoding strategies in detail.
103
representation in natural language processing.
A. Codebook Creation
A codebook is created by quantizing the local image
descriptors extracted
to various cat from images belonging
egories. Let I = xi ∈ Rd : i = 1, . . . , N be a set of the
d-dimentional local descriptors extracted from images.
In the case of using k-means clustering,
the model param
eter is defined as Θk−means = μk ∈ Rd : k = 1, . . . , K ,
where K is the number of clusters and μk is a k-th mean
vector. The probability that a local descriptor xi belongs a
cluster μk is predicted by:
p (xi ; μk ) = δ [k, arg mink xi − μk ] ,
[GΘ ]k =
d/2
|Σk |
1/2
(1)
Gγ,k
1
(3)
exp − (xi −μk ) Σ−1
(x
−μ
)
.
i
k
k
2
γi (k)
=
πk
=
μk
=
Σk
=
K
j=1 πj p (xi ; μj , Σj )
N
,
1
γi (k) ,
N i=1
N
i=1 γi (k) xi
,
N
i=1 γi (k)
N
i=1 γi (k) (xi − μk ) (xi − μk )
,
N
i=1 γi (k)
=
Gμ,k
=
Gσ,k
=
1
√
N πk
N
i=1
N
(γi (k) − πk ) ,
(9)
xi − μ k
,
(10)
N πk i=1
σk
2
N
xi − μk
1
√
γi (k)
− 1 ,(11)
σk
N 2πk i=1
1
√
γi (k)
then the FV signature is represented by concatenating the
components Eqs. (9)–(11) as
· · · Gμ,k
· · · Gσ,k
··· .
(12)
GΘ = · · · Gγ,k
Then, the model parameters are calculated by:
πk p (xi ; μk , Σk )
(8)
where I = xi ∈ Rd : i = 1, . . . , N denotes a set of local
descriptors extracted from an given image, [GΘ ]k is the k-th
element of a BOVW signature, and the codebook size is normally set to K ≈ 4, 000, where the recognition performance
on Caltech101 dataset is not significantly improved with
K > 4000 [12]. BOVW represents a frequency histogram as
a K-dimensional image signature, so a large codebook size
is required to precisely represent the image signature, which
is a problem because it increases the cost of searching for
the nearest neighbor.
Fisher Vector (FV) tackled the problem by including
higher order statistics from a smaller codebook than BOVW
[3], [13]. A signature of FV includes mean and variance
statistics in addition to frequency. The components (frequency, mean and variance) of a FV signature are respectively captured by:
where Eqs. (1) and (2) are repeated T -iterations.
In the case of using a Gaussian Mixture Model
(GMM),
the model parameter is defined as ΘGM M =
πk ∈ R1 , μk ∈ Rd , Σk ∈ Rd×d : k = 1, . . . , K , where K
is the number of Gaussians and πk , μk and Σk are prior
probability (or mixing weight), mean vector and covariance
matrix, respectively. The covariance matrix is often assumed
to be a diagonal or symmetric positively matrix. The probability of a descriptor is predicted by:
(2π)
δ [k, arg mink xi − μk ] ,
i=1
where δ[·, ·] is the Kronecker delta function that is to be 1
if the k-th cluster μk is the nearest of the descriptor xi and
0 otherwise. The mean vector μk is calculated by:
N
i=1 p (xi ; μk ) xi
μk = ,
(2)
N
i=1 p (xi ; μk )
p (xi ; μk , Σk ) =
1
N
As improvement techniques for FV, power-normalization
and then 2 -normalization have been proposed [14] and these
are applied to a FV signature as follows:
(13)
[GΘ ]j ← sign [GΘ ]j
[GΘ ]j ,
(4)
(5)
[GΘ ]j
← [GΘ ]j / GΘ ,
(14)
where sign (·) indicates the sign of an input value. The
power-normalized vectors have a sensitivity to small bin
values and the Euclidean distance between the two vectors
is the same charateristic as the Hellinger distance [15], so
nowadays the power-normalization is well known as a defact post-processing [3]. The 2 -normalization is to remove
the effect of the prior probability between two FV signatures
[14].
As an almost similar approach to FV, Vector of Locally
Aggregated Descriptors (VLAD) has been proposed and it
represents an image feature by aggregating the residuals of
the local descriptors and each of the nearest vocabularies
created by k-means, which is similar to the mean component
(6)
(7)
where γi (k) is a posterior probability and Eqs. (3)–(7) are
repeated T -iterations.
B. Encoding Strategy
The typical strategy is to represent an image signature
by counting the number of descriptors that belong in each
cluster created by k-means as in Eq. (8), which is called bagof-visual-vocabulary (BOVW) and is similar to document
104
80
GMM
kmeans
0.15
Recognition accuracy [%]
Prior probability πk
0.20
0.10
0.05
0.00
1
2
3
4
5
6
7
8 9 10 11 12 13 14 15 16
Index k
Figure 1: The distribution of the prior probabilities in the
case of that the codebook size is equal to K = 16. The
dashed line denotes the mean of prior probabilities (π =
1/16 = 0.0625), where the mean is always
K equal to 1/K
caused by the probabilistic constraint k=1 πk = 1.
FV(γ + μ + σ)
75
FV(μ)
VLAD
VLAD+PN
70
65
60
55
50
5
10
15
20
25
The number of training images per category
30
Figure 2: The recognition performances of the following
techniques: “FV(γ + μ + σ)” the fully encoded signature;
“FV(μ)” the FV signature encoded by only the mean
component (Eq. (10)); “VLAD” the original VLAD signature; “VLAD+PN” the VLAD signature applied the powernormalization.
Table I: Standard deviation of prior probabilities πk .
Codebook size
64
128
Method
16
32
GMM
k-means
0.03830
0.02921
0.02257
0.01511
0.01665
0.00800
0.00940
0.00391
0.00779
0.00188
Relative ratio
1.31119
1.49371
2.08125
2.40409
4.14362
vocabularies was set to K ∈ (16, 32, 64, 128, 256) and the
number of iterations for creating a codebook was set to 30iterations. Each covariance matrix of the codebook created
by GMM was assumed to be a diagnal matrix.
The linear SVM implemented in LIBLINEAR package
[16] was used as a linear discriminant model. The images
randomely selected 5, 10, 15, 20 or 30 images from each
category were used for training the model, and the rest
images were used for test. The hyper-parameter C, which is
a penalty parameter of the objective function, was optimized
by 5-fold cross validation. The recognition performance
was examined by the average accuracy over five sets of
independent training and test images.
256
of FV as shown in Eq. (10). A VLAD signature is given by
the following equations:
Gμ,k =
N
p (xi ; μk ) (xi − μk ) ,
(15)
i=1
GΘ = [· · · Gμ,k · · · ] ,
(16)
B. Experimental Results and Discussion
where 2 -normalization is applied as a post-processing.
Fig. 1 shows the distribution of the prior probabilities of
the codebooks created by k-means and GMM.
N The prior
probablity of k-means was calculated by N1 i=1 p (xi ; μk ),
which is the same way as Eq. (5). The mean of prior
probabilities is constant for the codebook size, therefore the
standard deviations of prior probability can be used as a
measure of over-fitting to the samples.
Table I shows the standard deviations of the prior probabilities with respect to the codebook size, and the bottom
row (relative ratio) shows the relative spread of standard
deviation in the case of GMM to the standard deviations
in the case of k-means. The standard deviations of GMM
were larger than the case of k-means in all the codebook
sizes. In addition, the spread of the standard deviation of
GMM was sharply increased with increasing the codebook
size according to the relative ratio. So the distribution of
prior probabilities of a codebook is a cause of the sparseness
of FV, where it is empirically known that a FV signature become to be sparser with increasing the number of Gaussians
[14].
Fig. 2 shows the recognition performances to compare
III. E XPERIMENT 1: E FFECT OF P RIOR P ROBABILITIES
This section discusses an effect of prior probabilities in
codebook-based image representation for recognition performance.
A. Exprimental Setup
Caltech-101 dataset was used as a benchmark, the datasest
consists of 9,145 images from 101 object categories and a
background category, each category contains about 40 to
800 images. The images were resized to be that the longest
dimension is equal to 300 pixels.
Scale-invariant feature transform (SIFT) [9] was used for
extracting local image descriptors. The SIFT descriptors
were densely extracted from four scale levels (16, 22, 33
and 44 pixels) on the intersection of regular grid of 8
pixels spacing, and then the dimensionality were reduced
to be 80-dimensional descriptors by a principle component
analysis. The SIFT descriptors extracted from 510 images
(randomly selected 5 images from each category) were used
as the samples for creating codebook, where the number of
105
FV (FV(γ + μ + σ)) and original VLAD (VLAD) with
additional two encodings (FV(μ) and VLAD+PN) in the
case of that the codebook size is K = 256, where a FV
signature FV(γ + μ + σ) has 41,216 dimension and the other
signatures (VLAD, FV(μ) and VLAD+PN) have 20,480
dimension. FV(μ) is to compare with VLAD in the same
dimensionality and VLAD+PN is to remove the improvement of FV. First, the original FV signature outperformed the
other signatures. There are the following possible causes: the
first cause is that FV(γ +μ+σ) captures more statistics than
the other. VLAD+PN was improved about 3% recognition
accuracy by power-normalization and significantly outperformed FV(μ), where the dimensionalities and the included
statistics are the same.
For these results, recognition performance possibly depends on the distribution of prior probabilities and the
standard deviation can be used as an indicator for creating
a codebook.
Table II: The relationship with the standard deviations and
the recognition accuracies.
5
Correlation coeficient
In this section, the prediction function of GMM is parameterized to discuss the control of the standard deviation of
prior probabilities. This function is applied for encoding a
FV signature, and the results are compared with the results
in previous section (Sec. III).
A. Parameterization of Fisher Vector
The two scaling parameters γ and ν are added, which is
the same way as [17], to control the standard deviation of
prior probabilities to the prediction function Eq. (3) as:
(2π)
d/2
|Σk |
1/γ
−0.60
−0.61
−0.62
30
−0.62
that the standard deviation become smaller with decreasing
the scaling parameters. In the contour of the recognition
accuracies (Fig. 3b) the miximum accuracy was recorded
at γ = 2−5 and ν = 2−5 . In addition, these contours
(especially Fig. 3b) may not a simple convex distribution.
Therefore, multipoint searching algorithms such as a particle
swarm optimization and a diffirential evolution are suitable
for optimizing the two scaling parameters. Generally, a
multipoint searching algorithm requires a lot of evaluation.
However the codebook creation step is normally offline,
so the computational cost is not increased in any practical
applications.
Fig. 4 shows the relationship with the scaling parameters.
The relationships were measured by the Pearson’s correlation analysis and the blue line indicate the line of best fit by a
least square method. Fig. 4a shows the relationship between
the standard deviation of prior probabilities and recognition
performance, the correlation coefficient was strong negative
−0.62. In addition, Table II shows the correlation cofficient
with respect to the number of training images. Fig. 4b
shows the relationship between the standard deviation of
prior probabilities and the sparseness of parameterized FV
signatures, the correlation coefficient was especially strong
negative −0.88.
These results suggest that the parameterized FV has a
potential to improve recognition accuracy on other tasks of
image recognition because GMM does not have a regularization term.
IV. E XPERIMENT 2: I MPACT OF C ONTROL OF S TANDARD
D EVIATION OF P RIOR P ROBABILITIES
p∗ (xi ; μk , Σk ) =
1
−0.58
The number of training images
10
15
20
1
(17)
exp − (xi −μk ) Σ−1
(x
−μ
)
,
i
k
k
ν
and a codebook is created with Eq. (4)–(7) in the same way
as GMM.
V. C ONCLUSIONS AND F UTURE W ORK
B. Experimental Setup
parameters was defined as
iThe range of the scaling
2 : i = −5, −3, −1, 1, 3, 5 , where the case of γ = 2 and
ν = 2 in Eq. (17) is the same as the original prediction
fucntion of GMM as shown in Eq. (3), therefore we used
the results in previous experiments for the parameterized FV
in the case of γ = 2 and ν = 2.
The other parameters were set to the same with the
previous experiment in Sect. III.
This paper has reported the relationship between the
prior probabilities and recognition performance in codebookbased image representation, there is a possible to improve
the recognition accuracy and the sparseness of image signatures by controling the standard deviation of the prior
probabilities of codebook.
We have considered that the following constraint conditions are suitable for optimizing the scaling parameters
(γ and ν) by multi-points searching algorithms such as a
particle swarm optimization and a deffirential evolution:
C. Experimental Results and Discussion
Fig. 3 shows the effect of the scaling paramters: Figs. 3a
and 3b are for the standard deviations and recognition
accuracies, respectively. In the contour of the standard
deviations (Fig. 3a), the minimum standard deviation was
recorded at γ = 2−5 and ν = 2−5 , and the contour shows
•
•
106
The standard deviation of weights depends on the
parameters.
Absolute logarithmic terms of the scaling parameters,
which is to convergence around the origin.
0.120
2−5
75
2−5
0.105
2−3
0.090
2−1
65
2−1
0.060
60
γ
γ
0.075
21
70
2−3
21
55
0.045
23
0.030
25
45
25
0.015
2−5
2−3
2−1
ν
21
23
50
23
2−5
25
2−3
2−1
ν
21
23
25
40
(b) Recognition accuracy of parameterized FV.
(a) Standard deviation of prior probabilities.
Figure 3: Effect of the scaling parameters.
80
100
The rate of non-zero elements [%]
Recognition accuracy [%]
75
70
65
60
55
50
45
40
35
0.00
80
60
40
20
0
0.00
0.02 0.04 0.06 0.08 0.10 0.12 0.14
Standard deviation of prior probabilities
(a) Recognition performance
0.02 0.04 0.06 0.08 0.10 0.12 0.14
Standard deviation of prior probabilities
(b) Sparseness of parameterized FV signatures
Figure 4: Relationiship with the standard deviation of prior probabilities.
ACKNOWLEDGMENT
us?” in Proceedings of the 11th European Conference on
Computer Vision: Part V, ser. ECCV’10. Berlin, Heidelberg:
Springer-Verlag, 2010, pp. 71–84. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1888150.1888157
This work was supported by JSPS KAKENHI Grant
Number 25330240.
R EFERENCES
[2] H. Nakayama, “Augmenting descriptors for fine-grained visual categorization using polynomial embedding.” in ICME.
IEEE Computer Society, 2013, pp. 1–6.
[1] J. Deng, A. C. Berg, K. Li, and L. Fei-Fei, “What
does classifying more than 10,000 image categories tell
107
[15] R. Arandjelović and A. Zisserman, “Three things everyone
should know to improve object retrieval,” in IEEE Conference
on Computer Vision and Pattern Recognition, 2012.
[3] P.-H. Gosselin, N. Murray, H. Jégou, and F. Perronnin,
“Revisiting the Fisher vector for fine-grained classification,”
Pattern Recognition Letters, vol. 49, pp. 92–98, Nov.
2014. [Online]. Available: https://hal.archives-ouvertes.fr/hal01056223
[16] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang,
and C.-J. Lin, “LIBLINEAR: A library for large linear
classification,” Journal of Machine Learning Research,
vol. 9, pp. 1871–1874, 2008. [Online]. Available:
http://www.csie.ntu.edu.tw/ cjlin/liblinear/
[4] R. Arandjelovic and A. Zisserman, “All about VLAD,”
in 2013 IEEE Conference on Computer Vision and
Pattern Recognition, Portland, OR, USA, June 23-28,
2013. IEEE, 2013, pp. 1578–1585. [Online]. Available:
http://dx.doi.org/10.1109/CVPR.2013.207
[17] H. Ichihashi, K. Honda, A. Notsu, and K. Ohta, “Fuzzy cmeans classifier with particle swarm optimization.” in FUZZIEEE. IEEE, 2008, pp. 207–215.
[5] K. Simonyan, O. M. Parkhi, A. Vedaldi, and A. Zisserman,
“Fisher Vector Faces in the Wild,” in British Machine Vision
Conference, 2013.
[6] L. Fei-Fei, R. Fergus, and P. Perona, “Learning
generative visual models from few training examples:
An incremental bayesian approach tested on 101 object
categories,” Comput. Vis. Image Underst., vol. 106,
no. 1, pp. 59–70, Apr. 2007. [Online]. Available:
http://dx.doi.org/10.1016/j.cviu.2005.09.012
[7] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona, “Caltech-UCSD Birds 200,” California
Institute of Technology, Tech. Rep. CNS-TR-2010-001, 2010.
[8] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags
of features: Spatial pyramid matching for recognizing
natural scene categories,” in Proceedings of the 2006
IEEE Computer Society Conference on Computer Vision and
Pattern Recognition - Volume 2, ser. CVPR ’06. Washington,
DC, USA: IEEE Computer Society, 2006, pp. 2169–2178.
[Online]. Available: http://dx.doi.org/10.1109/CVPR.2006.68
[9] D. G. Lowe, “Distinctive image features from scaleinvariant keypoints,” Int. J. Comput. Vision, vol. 60,
no. 2, pp. 91–110, Nov. 2004. [Online]. Available:
http://dx.doi.org/10.1023/B:VISI.0000029664.99615.94
[10] K. van de Sande, T. Gevers, and C. Snoek, “Evaluating
color descriptors for object and scene recognition,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 32,
no. 9, pp. 1582–1596, Sep. 2010. [Online]. Available:
http://dx.doi.org/10.1109/TPAMI.2009.154
[11] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up
robust features (surf),” Comput. Vis. Image Underst., vol.
110, no. 3, pp. 346–359, Jun. 2008. [Online]. Available:
http://dx.doi.org/10.1016/j.cviu.2007.09.014
[12] L. Seidenari, G. Serra, A. D. Bagdanov, and A. D.
Bimbo, “Local pyramidal descriptors for image recognition,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 36,
no. 5, pp. 1033–1040, 2014. [Online]. Available:
http://doi.ieeecomputersociety.org/10.1109/TPAMI.2013.232
[13] F. Perronnin and C. R. Dance, “Fisher kernels on visual
vocabularies for image categorization.” in CVPR.
IEEE
Computer Society, 2007.
[14] F. Perronnin, J. Sánchez, and T. Mensink, “Improving
the fisher kernel for large-scale image classification,” in
Proceedings of the 11th European Conference on Computer
Vision: Part IV, ser. ECCV’10. Berlin, Heidelberg:
Springer-Verlag, 2010, pp. 143–156. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1888089.1888101
108