Image Processing Theory, Tools and Applications
Novel Color HWML Descriptors
for Scene and Object Image Classification
Sugata Banerji1 , Atreyee Sinha and Chengjun Liu
Department of Computer Science,
New Jersey Institute of Technology,
Newark, NJ 07102, USA
Email:{sb256, as739, chengjun.liu}@njit.edu
Lately, several methods based on LBP features have been
proposed for image representation and classification [15], [16].
In a 3 × 3 neighborhood of an image, the basic LBP operator
assigns a binary label 0 or 1 to each surrounding pixel by
thresholding at the gray value of the central pixel and replacing
its value with a decimal number converted from the 8-bit
binary number. Extraction of LBP features is computationally
efficient and with the use of multi-scale filters invariance to
scaling and rotation can be achieved [15], [4]. Fusion of
different features has been shown to achieve a good image
retrieval success rate [4], [16], [17]. Local image descriptors
have also been shown to perform well for texture based
image retrieval [4], [17]. Several researchers have used the
Haar wavelet transform for object detection in images and
LBP has also been combined with Haar-like features for face
detection [18]. The Histogram of Orientation Gradients (HOG)
descriptor [19], [20] is able to represent an image by its local
Abstract—Several new image descriptors are presented in this
paper that combine color, texture and shape information to
create feature vectors for scene and object image classification. In
particular, first, a new three dimensional Local Binary Patterns
(3D-LBP) descriptor is proposed for color image local feature
extraction. Second, three novel color HWML (HOG of Wavelet
of Multiplanar LBP) descriptors are derived by computing the
histogram of the orientation gradients of the Haar wavelet transformation of the original image and the 3D-LBP images. Third,
the Enhanced Fisher Model (EFM) is applied for discriminatory
feature extraction and the nearest neighbor classification rule
is used for image classification. Finally, the Caltech 256 object
categories database and the MIT scene dataset are used to show
the feasibility of the proposed new methods.
Index Terms—The HOG of Wavelet of Multiplanar LBP
(HWML) descriptor, the Fused Color HWML (FCHWML)
descriptor, Enhanced Fisher Model (EFM), image search, scene
and object classification
I. I NTRODUCTION
In recent years, use of color as a means to image retrieval
[1], [2] and object and scene search [3], [4] has gained popularity. Color features can capture discriminative information by
means of the color invariants, color histogram, color texture,
etc. The early methods for object and scene classification were
mainly based on the global descriptors such as the color and
texture histograms [5], [6], [7]. One representative method is
the color indexing system designed by Swain and Ballard,
which uses the color histogram for image inquiry from a
large image database [8]. These early methods are sensitive to
viewpoint and lighting changes, clutter and occlusions. For this
reason, global methods were gradually replaced by the partbased methods, which became one of the popular techniques
in the object recognition community.
More recent work on color based image classification appears in [9], [3], [10] that propose several new color spaces
and methods for face, object and scene classification. The HSV
color space is used for scene category recognition in [11], and
the evaluation of local color invariant descriptors is performed
in [12]. Fusion of color models, color region detection and
color edge detection has been investigated for representation
of color images [13]. Some important contributions of color,
texture, and shape abstraction for image retrieval have been
discussed in Datta et al. [14].
1 Corresponding
Fig. 1.
author
978-1-4673-2584-4/12/$31.00 ©2012 IEEE
1
The Histogram of Orientation Gradients (HOG) descriptor.
Fig. 2.
A sample image from the MIT Scene dataset split up into various color components and grayscale.
shape and the spatial layout of the shape. The local shape is
captured by the distribution over edge orientations within a
region. Figure 1 shows how the HOG descriptor is formed by
the gradient histograms from a scene image.
Efficient retrieval requires a robust feature extraction
method that has the ability to learn meaningful lowdimensional patterns in spaces of very high dimensionality [21], [22], [23]. Low-dimensional representation is also
important when one considers the intrinsic computational
aspect. PCA has been widely used to perform dimensionality
reduction for image indexing and retrieval [24], [25]. The EFM
feature extraction method has achieved good success for the
task of image based representation and retrieval [26].
In this paper, we employ three masks in three perpendicular
planes to generate a novel multiplanar 3D-LBP feature that
contains more information than the traditional LBP. Further,
we subject the 3D-LBP image to a Haar wavelet transformation and then generate the HOG of the resultant image to create
a robust feature vector. We extend this concept to different
color spaces and propose the new oRGB-HWML and YCbCrHWML feature representations, and then integrate them with
other color HWML features to produce the novel Fused Color
HWML (FCHWML) descriptor. Feature extraction applies the
Enhanced Fisher Model (EFM) [24] and image classification
is based on the nearest neighbor classification rule.
Fig. 3. The proposed 3D-LBP descriptor. A 3x3x3 pixel region of the original
image is magnified to show the 3D-LBP neighborhoods and the resulting LBP
images.
Fig. 4. The proposed HWML descriptor. The original and 3D-LBP images
undergo Haar Wavelet Transformation and then HOG is generated for each
component of the resulting image and concatenated.
II. I MPLEMENTATION DETAILS
We first review in this section five color spaces in which
our new descriptor is defined, and then discuss the 3DLBP descriptor which is an improvement upon the traditional
2
The oRGB color space [28] has three channels L, C1 and
C2. The primaries of this model are based on the three
fundamental psychological opponent axes: white-black, redgreen, and yellow-blue. The color information is contained in
C1 and C2. The value of C1 lies within [−1, 1] and the value
of C2 lies within [−0.8660, 0.8660]. The L channel contains
the luminance information and its values range between [0, 1]:
L
0.2990
0.5870
0.1140
R
C1 = 0.5000
0.5000 −1.0000 G (2)
C2
0.8660 −0.8660
0.0000
B
The five color spaces used by us in this paper are shown
in Figure 2. The grayscale image has also been included as a
reference.
B. The 3D-LBP and HWML descriptors
The LBP descriptor assigns an intensity value to each
pixel of an image based on the intensity values of the eight
neighboring pixels. Since a color image is represented by a
three dimensional matrix, we extend this concept to assign an
intensity value to each pixel based on its neighboring pixels
not only on the same color plane but on other planes as well.
This method is explained in Figure 3. We replicate the first
and third image planes on opposite sides of the three existing
planes to create a five-plane matrix. After the LBP operation,
only the three middle planes are retained.
The 3D-LBP method produces three images. We then apply
the Haar wavelet transformation to the original and these three
images to divide each image into four distinct regions. We then
generate the HOG descriptor for each of these regions of the
four images and then concatenate them to get our final HWML
feature vector. This process is illustrated in Figure 4.
Fig. 5. An overview of multiple features fusion methodology, the EFM
feature extraction method, and the classification stages.
LBP descriptor. Next we present the new HWML descriptor
in oRGB and YCbCr color spaces and a combined color
FCHWML descriptor.
A. Color spaces
A color image contains three component images, and each
pixel of a color image is specified in a color space, which
serves as a color coordinate system. The commonly used color
space is the RGB color space. Other color spaces are usually
calculated from the RGB color space by means of either
linear or nonlinear transformations. To reduce the sensitivity of
the RGB images to luminance, surface orientation, and other
photographic conditions, the rgb color space is defined by
normalizing the R, G, and B components. The HSV color
space is motivated by human vision system because humans
describe color by means of hue, saturation, and brightness.
Hue and saturation define chrominance, while intensity or
value specifies luminance [27]. The YCbCr color space is developed for digital video standard and television transmissions.
In YCbCr, the RGB components are separated into luminance,
chrominance blue, and chrominance red:
Y
16
Cb = 128
128
Cr
(1)
65.4810 128.5530
24.9660
R
+ −37.7745 −74.1592 111.9337 G
111.9581 −93.7509 −18.2072
B
Fig. 6. Example images from the (a) Caltech 256 dataset and (b) the MIT
Scene dataset.
where the R, G, B values are scaled to [0, 1].
3
TABLE I
C ATEGORY WISE DESCRIPTOR PERFORMANCE (%) ON THE TOP 15
CLASSES OF THE C ALTECH 256 DATASET. N OTE THAT THE CATEGORIES
ARE SORTED ON THE FCHWML RESULTS
TABLE II
C OMPARISON OF THE C LASSIFICATION P ERFORMANCE (%)
M ETHODS ON C ALTECH 256 DATASET
#train
Category
car-side
faces-easy
airplanes
motorbikes
bonsai
leopards
sunflower
hibiscus
watch
grand-piano
ketch
telephone-box
american-flag
school-bus
tomato
Fusion
100
100
100
100
100
96
92
92
88
88
84
84
80
76
76
YCbCr
100
100
92
88
88
96
100
88
84
76
76
72
60
80
56
oRGB
100
100
96
96
92
96
96
92
88
80
76
76
72
80
52
RGB
100
100
92
88
80
88
88
80
76
88
84
52
44
72
48
HSV
100
100
100
100
68
96
80
92
88
76
88
56
40
80
28
12800
6400
oRGB-SIFT
CSF
CGSF
23.9
30.1
35.6
(8)
where Ψ, ∆ are the eigenvector and eigenvalue matrices of
Sw−1 Sb , respectively. The FLD discriminating features are defined by projecting the pattern vector Y onto the eigenvectors
of Ψ:
Z = Ψt Y
(9)
Z thus is more effective than the feature vector Y derived
by PCA for image classification.
The FLD method, however, often leads to overfitting when
implemented in an inappropriate PCA space. To improve
the generalization performance of the FLD method, a proper
balance between two criteria should be maintained: the energy
criterion for adequate image representation and the magnitude
criterion for eliminating the small-valued trailing eigenvalues
of the within-class scatter matrix [24]. A better method, the
Enhanced Fisher Model (EFM), is capable of improving the
generalization performance of the FLD method [24]. Specifically, the EFM method improves the generalization capability
of the FLD method by decomposing the FLD procedure into a
simultaneous diagonalization of the within-class and betweenclass scatter matrices [24]. The simultaneous diagonalization
is stepwise equivalent to two operations as pointed out by
[29]: whitening the within-class scatter matrix and applying
PCA to the between-class scatter matrix using the transformed
data. The stepwise operation shows that during whitening
the eigenvalues of the within-class scatter matrix appear in
the denominator. Since the small (trailing) eigenvalues tend
to capture noise [24], they cause the whitening step to fit
for misleading variations, which leads to poor generalization
performance. To achieve enhanced performance, the EFM
method preserves a proper balance between the need that
the selected eigenvalues account for most of the spectral
energy of the raw data (for representational adequacy), and
the requirement that the eigenvalues of the within-class scatter
matrix (in the reduced PCA space) are not too small (for better
generalization performance) [24].
Image classification is implemented using the nearest neighbor classification rule. Figure 5 gives an overview of multiple features fusion methodology, the EFM feature extraction
method, and the classification stages.
where E (·) is the expectation operator and t denotes the
transpose operation. The eigenvectors of the covariance matrix
ΣX can be derived by PCA:
ΣX = ΦΛΦt
(4)
where Φ = [ϕ1 ϕ2 . . . ϕN ] is an orthogonal eigenvector matrix
and Λ = diag{λ1 , λ2 , . . . , λN } a diagonal eigenvalue matrix
with diagonal elements in decreasing order. An important
application of PCA is dimensionality reduction:
Y = Pt X
(5)
where P = [ϕ1 ϕ2 . . . ϕK ], and K < N. Y ∈ RK thus is composed
of the most significant principal components. PCA, which is
derived based on an optimal representation criterion, usually
does not lead to good image classification performance. To
improve upon PCA, the Fisher Linear Discriminant (FLD)
analysis [29] is introduced to extract the most discriminating
features.
The FLD method optimizes a criterion defined on the
within-class and between-class scatter matrices, Sw and Sb
[29]:
L
(6)
L
Sb = ∑ P(ωi )(Mi − M)(Mi − M)t
5120
[3]
30.7
31.1
34.7
29.0
29.7
33.0
Sw−1 Sb Ψ = Ψ∆
Image classification using the new descriptor introduced
in the preceding section is implemented using the Enhanced
Fisher Model (EFM) feature extraction method [24] and the
Nearest Neighbor classification rule (EFM-NN).
Let X ∈ RN be a random vector whose covariance matrix
is ΣX :
ΣX = E {[X − E (X )][X − E (X )]t }
(3)
i=1
15360
Proposed Method
oRGB-HWML
YCbCr-HWML
FCHWML
oRGB-HWML
YCbCr-HWML
FCHWML
OTHER
is J1 = tr(Sw−1 Sb ), which is maximized when Ψ contains the
eigenvectors of the matrix Sw−1 Sb [29]:
C. The EFM-NN Classifier
Sw = ∑ P(ωi )E {(Y − Mi )(Y − Mi )t |ωi }
#test
WITH
(7)
i=1
where P(ωi ) is a priori probability, ωi represent the classes,
and Mi and M are the means of the classes and the grand
mean, respectively. The criterion the FLD method optimizes
4
TABLE III
C ATEGORY WISE DESCRIPTOR PERFORMANCE (%) ON THE MIT S CENE
DATASET. N OTE THAT THE CATEGORIES ARE SORTED ON THE FCHWML
RESULTS
Category
forest
coast
street
inside city
mountain
tall building
highway
open country
Mean
Fusion
98
94
94
92
91
90
88
78
90.6
YCbCr
98
90
88
92
90
89
88
75
88.7
oRGB
96
92
95
92
90
86
88
76
88.2
RGB
97
93
89
92
89
87
86
71
88.1
HSV
97
90
89
88
88
88
86
74
87.5
rgb
95
91
88
89
86
86
84
68
85.7
On this dataset, we conduct experiments for HWML descriptors from four different color spaces and their fusion.
For each class, we make use of 50 images for training and
25 images for testing. The data splits are the ones that are
provided on the Caltech website [30]. Figure 7 shows the
detailed performance of our EFM-NN classification technique
on this dataset. The best recognition rate that we obtain is
33.0%, which is a very respectable value for a dataset of this
size and complexity. It can be seen from our previous work
[3] that dense histograms perform poorly on this dataset as
the intra-class variability is very high and in several cases
the object occupies a small portion of the full image. Also,
previously processor-intensive SIFT and visual vocabulary
based methods achieved the best classification rate of 23.9%
for a single color space and 35.6% after color and grayscale
fusion. The proposed method is faster than the SIFT-based
method and the YCbCr-HWML alone achieves a success rate
of 29.7%. Fusion of color spaces improves our result further
by over 3%. Due to the nature of the 3D-LBP descriptor, it is
not defined for grayscale images and so we did not conduct
any experiments for grayscale. Also, conversion to the rgb
color space is undefined for grayscale images and we did not
use the rgb color space on this dataset as it contains some
grayscale images.
Fig. 7. The mean average classification performance of the five descriptors
using the EFM-NN classifier on the Caltech 256 dataset: the RGB-HWML,
the HSV-HWML, the oRGB-HWML, the YCbCr HWML and the FCHWML
descriptors.
III. E XPERIMENTAL RESULTS
A. Caltech 256 Dataset
The Caltech 256 dataset [30] holds 30,607 images divided
into 256 object categories and a clutter class. The images have
high intra-class variability and high object location variability.
Each category contains at least 80 images, at most 827 images
and 119 images on average. The images represent a diverse set
of lighting conditions, poses, backgrounds, and sizes. Images
are in color, in JPEG format with only a small percentage
in grayscale. The average size of each image is 351x351
pixels. Some sample images from this dataset can be seen
in Figure 6 (a).
Table II compares our results with those of SIFT-based
methods. Table I shows the descriptor performance for the top
15 categories from this dataset. The FCHWML recognition
rates for the top 15 categories lie between 76% and 100%
with five categories having full success rate.
TABLE IV
C OMPARISON OF THE C LASSIFICATION P ERFORMANCE (%)
M ETHODS ON THE MIT S CENE DATASET
#train
Fig. 8. The mean average classification performance of the six descriptors
using the EFM-NN classifier on the MIT Scene dataset: the rgb-HWML, the
HSV-HWML, RGB-HWML, the oRGB-HWML, the YCbCr HWML and the
FCHWML descriptors.
5
#test
2000
688
800
1888
HWML
oRGB
YCbCr
Fused
oRGB
YCbCr
Fused
88.2
88.7
90.6
85.8
85.7
87.8
[4]
CLF
CGLF
CGLF+PHOG
CLF
CGLF
CGLF+PHOG
WITH
86.4
86.6
89.5
79.3
80.0
84.3
OTHER
[31]
-
83.7
B. MIT Scene Dataset
The MIT Scene dataset [31] has 2,688 images classified as
eight categories: 360 coast, 328 forest, 374 mountain, 410 open
country, 260 highway, 308 inside of cities, 356 tall buildings,
and 292 streets. All of the images are in color, in JPEG format,
and the average size of each image is 256x256 pixels. There
is a large variation in light and angles, along with a high intraclass variation. Figure 6 (b) shows some sample images from
this dataset.
From each class, we use 250 images for training and the
rest of the images for testing the performance, and we do
this for five random splits. Here YCbCr-HWML is the best
single-color descriptor at 88.7% followed closely by oRGBHWML at 88.2% The combined descriptor FCHWML gives a
mean average performance of 90.6%. See Figure 8 for details.
Table III shows the category wise descriptor performance on
the MIT scene dataset. Table IV compares our result with that
of other methods. Note that we tested our descriptor in the rgb
color space as well for this dataset and the fused descriptor
contains RGB, HSV, rgb, YCbCr and oRGB-HWML descriptors.
[8] M. Swain and D. Ballard, “Color indexing,” International Journal of
Computer Vision, vol. 7, no. 1, pp. 11–32, 1991.
[9] J. Yang and C. Liu, “Color image discriminant models and algorithms
for face recognition,” IEEE Trans. on Neural Networks, vol. 19, no. 12,
pp. 2088–2098, 2008.
[10] C. Liu, “Learning the uncorrelated, independent, and discriminating
color spaces for face recognition,” IEEE Transactions on Information
Forensics and Security, vol. 3, no. 2, pp. 213–222, 2008.
[11] A. Bosch, A. Zisserman, and X. Munoz, “Scene classification using
a hybrid generative/discriminative approach,” IEEE Trans. on Pattern
Analysis and Machine Intelligence, vol. 30, no. 4, pp. 712–727, 2008.
[12] G. Burghouts and J.-M. Geusebroek, “Performance evaluation of local
color invariants,” Computer Vision and Image Understanding, vol. 113,
pp. 48–62, 2009.
[13] H. Stokman and T. Gevers, “Selection and fusion of color models for
image feature detection,” IEEE Trans. on Pattern Analysis and Machine
Intelligence, vol. 29, no. 3, pp. 371–381, 2007.
[14] R. Datta, D. Joshi, J. Li, and J. Wang, “Image retrieval: Ideas, influences,
and trends of the new age,” ACM Computing Surveys, vol. 40, no. 2,
pp. 509–522, 2008.
[15] C. Zhu, C. Bichot, and L. Chen, “Multi-scale color local binary
patterns for visual object classes recognition,” in Int. Conf. on Pattern
Recognition, Istanbul, Turkey, August 23-26 2010, pp. 3065–3068.
[16] M. Crosier and L. Griffin, “Texture classification with a dictionary of basic image features,” in Proc. Computer Vision and Pattern Recognition,
Anchorage, Alaska, June 23-28, 2008, pp. 1–7.
[17] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, “Local features
and kernels for classification of texture and object categories: A comprehensive study,” Int. Journal of Computer Vision, vol. 73, no. 2, pp.
213–238, 2007.
[18] L. Zhang, R. Chu, S. Xiang, S. Liao, and S. Z. Li, “Face detection based
on multi-block LBP representation,” in ICB’2007, 2007, pp. 11–18.
[19] A. Bosch, A. Zisserman, and X. Munoz, “Representing shape with a
spatial pyramid kernel,” in Int. Conf. on Image and Video Retrieval,
Amsterdam, The Netherlands, July 9-11 2007, pp. 401–408.
[20] O. Ludwig, D. Delgado, V. Goncalves, and U. Nunes, “Trainable
classifier-fusion schemes: An application to pedestrian detection,” in
12th International IEEE Conference On Intelligent Transportation Systems, vol. 1, St. Louis, 2009, pp. 432–437.
[21] C. Liu, “A Bayesian discriminating features method for face detection,”
IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 25,
no. 6, pp. 725–740, 2003.
[22] C. Liu and H. Wechsler, “Evolutionary pursuit and its application to face
recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence,
vol. 22, no. 6, pp. 570–582, 2000.
[23] ——, “Independent component analysis of Gabor features for face
recognition,” IEEE Trans. on Neural Networks, vol. 14, no. 4, pp. 919–
928, 2003.
[24] ——, “Robust coding schemes for indexing and retrieval from large
face databases,” IEEE Trans. on Image Processing, vol. 9, no. 1, pp.
132–137, 2000.
[25] C. Liu, “Gabor-based kernel PCA with fractional power polynomial
models for face recognition,” IEEE Trans. Pattern Analysis and Machine
Intelligence, vol. 26, no. 5, pp. 572–581, 2004.
[26] ——, “Enhanced independent component analysis and its application
to content based face image retrieval,” IEEE Trans. Systems, Man, and
Cybernetics, Part B: Cybernetics, vol. 34, no. 2, pp. 1117–1127, 2004.
[27] R. Gonzalez and R. Woods, Digital Image Processing. Prentice Hall,
2001.
[28] M. Bratkova, S. Boulos, and P. Shirley, “oRGB: A practical opponent
color space for computer graphics,” IEEE Computer Graphics and
Applications, vol. 29, no. 1, pp. 42–55, 2009.
[29] K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd ed.
Academic Press, 1990.
[30] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category
dataset,” California Institute of Technology, Tech. Rep. 7694, 2007.
[Online]. Available: http://authors.library.caltech.edu/7694
[31] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic
representation of the spatial envelope,” Int. Journal of Computer Vision,
vol. 42, no. 3, pp. 145–175, 2001.
IV. C ONCLUSION
We have proposed a new LBP-based texture feature extraction method for color images and combined it with Haar
wavelet features and HOG features to generate three new color
descriptors: the oRGB-HWML descriptor, the YCbCr-HWML
descriptor and the FCHWML descriptor for scene image and
object image classification. Results of the experiments using
two grand challenge datasets show that our oRGB-HWML and
YCbCr-HWML descriptors improve recognition performance
over conventional color LBP descriptors. The fusion of multiple color HWML descriptors (FCHWML) shows improvement
in the classification performance, which indicates that various
color HWML descriptors are not redundant for image classification tasks.
R EFERENCES
[1] C. Liu, “Capitalize on dimensionality increasing techniques for improving face recognition grand challenge performance,” IEEE Trans. Pattern
Analysis and Machine Intelligence, vol. 28, no. 5, pp. 725–737, 2006.
[2] P. Shih and C. Liu, “Comparative assessment of content-based face
image retrieval in different color spaces,” International Journal of
Pattern Recognition and Artificial Intelligence, vol. 19, no. 7, 2005.
[3] A. Verma, S. Banerji, and C. Liu, “A new color SIFT descriptor and
methods for image category classification,” in International Congress
on Computer Applications and Computational Science, Singapore, December 4-6 2010, pp. 819–822.
[4] S. Banerji, A. Verma, and C. Liu, “Novel color LBP descriptors for
scene and image texture classification,” in 15th Intl. Conf. on Image
Processing, Computer Vision, and Pattern Recognition, Las Vegas,
Nevada, July 18-21 2011.
[5] W. Niblack, R. Barber, and W. Equitz, “The QBIC project: Querying
images by content using color, texture and shape,” in SPIE Conference
on Geometric Methods in Computer Vision II, 1993, pp. 173–187.
[6] M. Pontil and A. Verri, “Support vector machines for 3D object
recognition,” IEEE Trans. on Pattern Analysis and Machine Intelligence,
vol. 20, no. 6, pp. 637–646, 1998.
[7] B. Schiele and J. Crowley, “Recognition without correspondence using
multidimensional receptive field histograms,” Int. Journal of Computer
Vision, vol. 36, no. 1, pp. 31–50, 2000.
6
© Copyright 2026 Paperzz