2016 Joint 8th International Conference on Soft Computing and Intelligent Systems and 2016 17th International Symposium on Advanced Intelligent Systems An Analysis of Dependency of Prior Probability for Codebook-Based Image Representation Yuki Shinomiya School of System Engineering Kochi University of Technology Kami, Japan Email: [email protected] Yukinobu Hoshino School of System Engineering Kochi University of Technology Kami, Japan Email: [email protected] weight factor for each visual vocabulary. A dependency of prior probability in codebook-based approaches is analyzed by two experiments. One is to investigate the effect of prior probability for recognition performance in recent stateof-the-art techniques. Another is to verify the impact by controlling a distribution of the prior probabilities of a codebook. This paper is organized as follows: the next section describes recent codebook approaches in detail, then the recognition performances are evaluated in Section II, where we show the effect of prior probability. Section IV presents an approach, which is to improve recognition accuracy by controlling the distribution of prior probability. In addition, we show the analyzed results of the relationship between prior probability and recognition performance. Finally, Section V presents the conclusions of this paper. Abstract—Codebook-based image representation has been widely used in many image recognition applications. While each of the applications has an individual purpose, the pipeline of image representation follows the same approach. This paper focuses on the “prior probability”, which is a parameter of a codebook, and verifies a relationship with recognition performance. In the experiment, a codebook is modified by adding two parameters to analyze a dependency of prior probability. The result shows a strong negative correlation between prior probablity and recognition accuracy. Additionally, the recognition accuracy can be improved by offline optimization. Keywords-image representation; codebook; prior probability; I. I NTRODUCTION Recently, image recognition has been especially attracted in machine vision and pattern recognition, and has been used in several applications such as large-scale image annotation [1], fine-grained object recognition [2], [3], image retrieval [4] and object detection [5]. Each of these applications individually has a specific purpose and different scales of target images and domains. For example, the following datasets has been used as a benchmark • ImageNet10K [1] : 9 million images of 10,184 categories • Caltech-101 [6]: 9,145 images of 102 object categories • Caltech-Bird200 [7]: 6,033 images of 200 bird species • 15-Scenes [8]: 4,485 images and 15 scene categories • PASCAL VOC 2007: 9,963 images containing 24,640 annotated objects Images in the real-world often include various deformations due to viewpoint and changes in shooting environment. A local image feature is a key technique for describing feature vectors that have robustness to the deformations in many cases [9], [10]. In addition, an image can be treated as a set of the extracted local descriptors by treating a local descriptor as a visual vocabulary, which is called a codebook approach. This paper focuses on the parameter prior probability, which is one of the parameters of a codebook and is a 978-1-5090-2678-4/16 $31.00 © 2016 IEEE DOI 10.1109/SCIS&ISIS.2016.45 II. C ODEBOOK -BASED A PPROACH The pipeline of image recognition generally consists of the following three stages: 1) Extracting local image descriptors from a given image; A local image descriptor is extracted as a ddimensional vector (d = 128 in the case of using SIFT [9]), and has an invariant to changes of image scaling, illumination and rotation [9]–[11]. 2) Encoding the local image descriptors; A set of local descriptors extracted from the image is encoded into a global image feature vector based on the codebook that have been created in advance. 3) Recognizing the object label by a discriminant model; A linear support vector machine (SVM) is often used as a discriminant model because the scalability is O (N ), with respect to the number N of images. In the case of using a non-linear the scalabirity measure is increased to O N 2 ∼ O N 3 . The following section describes codebook creation and recent encoding strategies in detail. 103 representation in natural language processing. A. Codebook Creation A codebook is created by quantizing the local image descriptors extracted to various cat from images belonging egories. Let I = xi ∈ Rd : i = 1, . . . , N be a set of the d-dimentional local descriptors extracted from images. In the case of using k-means clustering, the model param eter is defined as Θk−means = μk ∈ Rd : k = 1, . . . , K , where K is the number of clusters and μk is a k-th mean vector. The probability that a local descriptor xi belongs a cluster μk is predicted by: p (xi ; μk ) = δ [k, arg mink xi − μk ] , [GΘ ]k = d/2 |Σk | 1/2 (1) Gγ,k 1 (3) exp − (xi −μk ) Σ−1 (x −μ ) . i k k 2 γi (k) = πk = μk = Σk = K j=1 πj p (xi ; μj , Σj ) N , 1 γi (k) , N i=1 N i=1 γi (k) xi , N i=1 γi (k) N i=1 γi (k) (xi − μk ) (xi − μk ) , N i=1 γi (k) = Gμ,k = Gσ,k = 1 √ N πk N i=1 N (γi (k) − πk ) , (9) xi − μ k , (10) N πk i=1 σk 2 N xi − μk 1 √ γi (k) − 1 ,(11) σk N 2πk i=1 1 √ γi (k) then the FV signature is represented by concatenating the components Eqs. (9)–(11) as · · · Gμ,k · · · Gσ,k ··· . (12) GΘ = · · · Gγ,k Then, the model parameters are calculated by: πk p (xi ; μk , Σk ) (8) where I = xi ∈ Rd : i = 1, . . . , N denotes a set of local descriptors extracted from an given image, [GΘ ]k is the k-th element of a BOVW signature, and the codebook size is normally set to K ≈ 4, 000, where the recognition performance on Caltech101 dataset is not significantly improved with K > 4000 [12]. BOVW represents a frequency histogram as a K-dimensional image signature, so a large codebook size is required to precisely represent the image signature, which is a problem because it increases the cost of searching for the nearest neighbor. Fisher Vector (FV) tackled the problem by including higher order statistics from a smaller codebook than BOVW [3], [13]. A signature of FV includes mean and variance statistics in addition to frequency. The components (frequency, mean and variance) of a FV signature are respectively captured by: where Eqs. (1) and (2) are repeated T -iterations. In the case of using a Gaussian Mixture Model (GMM), the model parameter is defined as ΘGM M = πk ∈ R1 , μk ∈ Rd , Σk ∈ Rd×d : k = 1, . . . , K , where K is the number of Gaussians and πk , μk and Σk are prior probability (or mixing weight), mean vector and covariance matrix, respectively. The covariance matrix is often assumed to be a diagonal or symmetric positively matrix. The probability of a descriptor is predicted by: (2π) δ [k, arg mink xi − μk ] , i=1 where δ[·, ·] is the Kronecker delta function that is to be 1 if the k-th cluster μk is the nearest of the descriptor xi and 0 otherwise. The mean vector μk is calculated by: N i=1 p (xi ; μk ) xi μk = , (2) N i=1 p (xi ; μk ) p (xi ; μk , Σk ) = 1 N As improvement techniques for FV, power-normalization and then 2 -normalization have been proposed [14] and these are applied to a FV signature as follows: (13) [GΘ ]j ← sign [GΘ ]j [GΘ ]j , (4) (5) [GΘ ]j ← [GΘ ]j / GΘ , (14) where sign (·) indicates the sign of an input value. The power-normalized vectors have a sensitivity to small bin values and the Euclidean distance between the two vectors is the same charateristic as the Hellinger distance [15], so nowadays the power-normalization is well known as a defact post-processing [3]. The 2 -normalization is to remove the effect of the prior probability between two FV signatures [14]. As an almost similar approach to FV, Vector of Locally Aggregated Descriptors (VLAD) has been proposed and it represents an image feature by aggregating the residuals of the local descriptors and each of the nearest vocabularies created by k-means, which is similar to the mean component (6) (7) where γi (k) is a posterior probability and Eqs. (3)–(7) are repeated T -iterations. B. Encoding Strategy The typical strategy is to represent an image signature by counting the number of descriptors that belong in each cluster created by k-means as in Eq. (8), which is called bagof-visual-vocabulary (BOVW) and is similar to document 104 80 GMM kmeans 0.15 Recognition accuracy [%] Prior probability πk 0.20 0.10 0.05 0.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index k Figure 1: The distribution of the prior probabilities in the case of that the codebook size is equal to K = 16. The dashed line denotes the mean of prior probabilities (π = 1/16 = 0.0625), where the mean is always K equal to 1/K caused by the probabilistic constraint k=1 πk = 1. FV(γ + μ + σ) 75 FV(μ) VLAD VLAD+PN 70 65 60 55 50 5 10 15 20 25 The number of training images per category 30 Figure 2: The recognition performances of the following techniques: “FV(γ + μ + σ)” the fully encoded signature; “FV(μ)” the FV signature encoded by only the mean component (Eq. (10)); “VLAD” the original VLAD signature; “VLAD+PN” the VLAD signature applied the powernormalization. Table I: Standard deviation of prior probabilities πk . Codebook size 64 128 Method 16 32 GMM k-means 0.03830 0.02921 0.02257 0.01511 0.01665 0.00800 0.00940 0.00391 0.00779 0.00188 Relative ratio 1.31119 1.49371 2.08125 2.40409 4.14362 vocabularies was set to K ∈ (16, 32, 64, 128, 256) and the number of iterations for creating a codebook was set to 30iterations. Each covariance matrix of the codebook created by GMM was assumed to be a diagnal matrix. The linear SVM implemented in LIBLINEAR package [16] was used as a linear discriminant model. The images randomely selected 5, 10, 15, 20 or 30 images from each category were used for training the model, and the rest images were used for test. The hyper-parameter C, which is a penalty parameter of the objective function, was optimized by 5-fold cross validation. The recognition performance was examined by the average accuracy over five sets of independent training and test images. 256 of FV as shown in Eq. (10). A VLAD signature is given by the following equations: Gμ,k = N p (xi ; μk ) (xi − μk ) , (15) i=1 GΘ = [· · · Gμ,k · · · ] , (16) B. Experimental Results and Discussion where 2 -normalization is applied as a post-processing. Fig. 1 shows the distribution of the prior probabilities of the codebooks created by k-means and GMM. N The prior probablity of k-means was calculated by N1 i=1 p (xi ; μk ), which is the same way as Eq. (5). The mean of prior probabilities is constant for the codebook size, therefore the standard deviations of prior probability can be used as a measure of over-fitting to the samples. Table I shows the standard deviations of the prior probabilities with respect to the codebook size, and the bottom row (relative ratio) shows the relative spread of standard deviation in the case of GMM to the standard deviations in the case of k-means. The standard deviations of GMM were larger than the case of k-means in all the codebook sizes. In addition, the spread of the standard deviation of GMM was sharply increased with increasing the codebook size according to the relative ratio. So the distribution of prior probabilities of a codebook is a cause of the sparseness of FV, where it is empirically known that a FV signature become to be sparser with increasing the number of Gaussians [14]. Fig. 2 shows the recognition performances to compare III. E XPERIMENT 1: E FFECT OF P RIOR P ROBABILITIES This section discusses an effect of prior probabilities in codebook-based image representation for recognition performance. A. Exprimental Setup Caltech-101 dataset was used as a benchmark, the datasest consists of 9,145 images from 101 object categories and a background category, each category contains about 40 to 800 images. The images were resized to be that the longest dimension is equal to 300 pixels. Scale-invariant feature transform (SIFT) [9] was used for extracting local image descriptors. The SIFT descriptors were densely extracted from four scale levels (16, 22, 33 and 44 pixels) on the intersection of regular grid of 8 pixels spacing, and then the dimensionality were reduced to be 80-dimensional descriptors by a principle component analysis. The SIFT descriptors extracted from 510 images (randomly selected 5 images from each category) were used as the samples for creating codebook, where the number of 105 FV (FV(γ + μ + σ)) and original VLAD (VLAD) with additional two encodings (FV(μ) and VLAD+PN) in the case of that the codebook size is K = 256, where a FV signature FV(γ + μ + σ) has 41,216 dimension and the other signatures (VLAD, FV(μ) and VLAD+PN) have 20,480 dimension. FV(μ) is to compare with VLAD in the same dimensionality and VLAD+PN is to remove the improvement of FV. First, the original FV signature outperformed the other signatures. There are the following possible causes: the first cause is that FV(γ +μ+σ) captures more statistics than the other. VLAD+PN was improved about 3% recognition accuracy by power-normalization and significantly outperformed FV(μ), where the dimensionalities and the included statistics are the same. For these results, recognition performance possibly depends on the distribution of prior probabilities and the standard deviation can be used as an indicator for creating a codebook. Table II: The relationship with the standard deviations and the recognition accuracies. 5 Correlation coeficient In this section, the prediction function of GMM is parameterized to discuss the control of the standard deviation of prior probabilities. This function is applied for encoding a FV signature, and the results are compared with the results in previous section (Sec. III). A. Parameterization of Fisher Vector The two scaling parameters γ and ν are added, which is the same way as [17], to control the standard deviation of prior probabilities to the prediction function Eq. (3) as: (2π) d/2 |Σk | 1/γ −0.60 −0.61 −0.62 30 −0.62 that the standard deviation become smaller with decreasing the scaling parameters. In the contour of the recognition accuracies (Fig. 3b) the miximum accuracy was recorded at γ = 2−5 and ν = 2−5 . In addition, these contours (especially Fig. 3b) may not a simple convex distribution. Therefore, multipoint searching algorithms such as a particle swarm optimization and a diffirential evolution are suitable for optimizing the two scaling parameters. Generally, a multipoint searching algorithm requires a lot of evaluation. However the codebook creation step is normally offline, so the computational cost is not increased in any practical applications. Fig. 4 shows the relationship with the scaling parameters. The relationships were measured by the Pearson’s correlation analysis and the blue line indicate the line of best fit by a least square method. Fig. 4a shows the relationship between the standard deviation of prior probabilities and recognition performance, the correlation coefficient was strong negative −0.62. In addition, Table II shows the correlation cofficient with respect to the number of training images. Fig. 4b shows the relationship between the standard deviation of prior probabilities and the sparseness of parameterized FV signatures, the correlation coefficient was especially strong negative −0.88. These results suggest that the parameterized FV has a potential to improve recognition accuracy on other tasks of image recognition because GMM does not have a regularization term. IV. E XPERIMENT 2: I MPACT OF C ONTROL OF S TANDARD D EVIATION OF P RIOR P ROBABILITIES p∗ (xi ; μk , Σk ) = 1 −0.58 The number of training images 10 15 20 1 (17) exp − (xi −μk ) Σ−1 (x −μ ) , i k k ν and a codebook is created with Eq. (4)–(7) in the same way as GMM. V. C ONCLUSIONS AND F UTURE W ORK B. Experimental Setup parameters was defined as iThe range of the scaling 2 : i = −5, −3, −1, 1, 3, 5 , where the case of γ = 2 and ν = 2 in Eq. (17) is the same as the original prediction fucntion of GMM as shown in Eq. (3), therefore we used the results in previous experiments for the parameterized FV in the case of γ = 2 and ν = 2. The other parameters were set to the same with the previous experiment in Sect. III. This paper has reported the relationship between the prior probabilities and recognition performance in codebookbased image representation, there is a possible to improve the recognition accuracy and the sparseness of image signatures by controling the standard deviation of the prior probabilities of codebook. We have considered that the following constraint conditions are suitable for optimizing the scaling parameters (γ and ν) by multi-points searching algorithms such as a particle swarm optimization and a deffirential evolution: C. Experimental Results and Discussion Fig. 3 shows the effect of the scaling paramters: Figs. 3a and 3b are for the standard deviations and recognition accuracies, respectively. In the contour of the standard deviations (Fig. 3a), the minimum standard deviation was recorded at γ = 2−5 and ν = 2−5 , and the contour shows • • 106 The standard deviation of weights depends on the parameters. Absolute logarithmic terms of the scaling parameters, which is to convergence around the origin. 0.120 2−5 75 2−5 0.105 2−3 0.090 2−1 65 2−1 0.060 60 γ γ 0.075 21 70 2−3 21 55 0.045 23 0.030 25 45 25 0.015 2−5 2−3 2−1 ν 21 23 50 23 2−5 25 2−3 2−1 ν 21 23 25 40 (b) Recognition accuracy of parameterized FV. (a) Standard deviation of prior probabilities. Figure 3: Effect of the scaling parameters. 80 100 The rate of non-zero elements [%] Recognition accuracy [%] 75 70 65 60 55 50 45 40 35 0.00 80 60 40 20 0 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Standard deviation of prior probabilities (a) Recognition performance 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Standard deviation of prior probabilities (b) Sparseness of parameterized FV signatures Figure 4: Relationiship with the standard deviation of prior probabilities. ACKNOWLEDGMENT us?” in Proceedings of the 11th European Conference on Computer Vision: Part V, ser. ECCV’10. Berlin, Heidelberg: Springer-Verlag, 2010, pp. 71–84. [Online]. Available: http://dl.acm.org/citation.cfm?id=1888150.1888157 This work was supported by JSPS KAKENHI Grant Number 25330240. R EFERENCES [2] H. Nakayama, “Augmenting descriptors for fine-grained visual categorization using polynomial embedding.” in ICME. IEEE Computer Society, 2013, pp. 1–6. [1] J. Deng, A. C. Berg, K. Li, and L. Fei-Fei, “What does classifying more than 10,000 image categories tell 107 [15] R. Arandjelović and A. Zisserman, “Three things everyone should know to improve object retrieval,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012. [3] P.-H. Gosselin, N. Murray, H. Jégou, and F. Perronnin, “Revisiting the Fisher vector for fine-grained classification,” Pattern Recognition Letters, vol. 49, pp. 92–98, Nov. 2014. [Online]. Available: https://hal.archives-ouvertes.fr/hal01056223 [16] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “LIBLINEAR: A library for large linear classification,” Journal of Machine Learning Research, vol. 9, pp. 1871–1874, 2008. [Online]. Available: http://www.csie.ntu.edu.tw/ cjlin/liblinear/ [4] R. Arandjelovic and A. Zisserman, “All about VLAD,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013. IEEE, 2013, pp. 1578–1585. [Online]. Available: http://dx.doi.org/10.1109/CVPR.2013.207 [17] H. Ichihashi, K. Honda, A. Notsu, and K. Ohta, “Fuzzy cmeans classifier with particle swarm optimization.” in FUZZIEEE. IEEE, 2008, pp. 207–215. [5] K. Simonyan, O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Fisher Vector Faces in the Wild,” in British Machine Vision Conference, 2013. [6] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” Comput. Vis. Image Underst., vol. 106, no. 1, pp. 59–70, Apr. 2007. [Online]. Available: http://dx.doi.org/10.1016/j.cviu.2005.09.012 [7] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona, “Caltech-UCSD Birds 200,” California Institute of Technology, Tech. Rep. CNS-TR-2010-001, 2010. [8] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2, ser. CVPR ’06. Washington, DC, USA: IEEE Computer Society, 2006, pp. 2169–2178. [Online]. Available: http://dx.doi.org/10.1109/CVPR.2006.68 [9] D. G. Lowe, “Distinctive image features from scaleinvariant keypoints,” Int. J. Comput. Vision, vol. 60, no. 2, pp. 91–110, Nov. 2004. [Online]. Available: http://dx.doi.org/10.1023/B:VISI.0000029664.99615.94 [10] K. van de Sande, T. Gevers, and C. Snoek, “Evaluating color descriptors for object and scene recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1582–1596, Sep. 2010. [Online]. Available: http://dx.doi.org/10.1109/TPAMI.2009.154 [11] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (surf),” Comput. Vis. Image Underst., vol. 110, no. 3, pp. 346–359, Jun. 2008. [Online]. Available: http://dx.doi.org/10.1016/j.cviu.2007.09.014 [12] L. Seidenari, G. Serra, A. D. Bagdanov, and A. D. Bimbo, “Local pyramidal descriptors for image recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 5, pp. 1033–1040, 2014. [Online]. Available: http://doi.ieeecomputersociety.org/10.1109/TPAMI.2013.232 [13] F. Perronnin and C. R. Dance, “Fisher kernels on visual vocabularies for image categorization.” in CVPR. IEEE Computer Society, 2007. [14] F. Perronnin, J. Sánchez, and T. Mensink, “Improving the fisher kernel for large-scale image classification,” in Proceedings of the 11th European Conference on Computer Vision: Part IV, ser. ECCV’10. Berlin, Heidelberg: Springer-Verlag, 2010, pp. 143–156. [Online]. Available: http://dl.acm.org/citation.cfm?id=1888089.1888101 108
© Copyright 2026 Paperzz