Mobile Phone Spam Image Detection based on Graph

Mobile Phone Spam Image Detection
based on Graph Partitioning with Pyramid Histogram
of Visual Words Image Descriptor
So Yeon Kim
Department of Information and Computer Engineering
Ajou University
Suwon, S. Korea
[email protected]
Abstract Image spams have been annoying users everywhere
and it has also been increasingly appearing in mobile phones
these days. In accordance with more sophisticated spam filtering
system, spams are being more intelligent and have caused severe
social problems. However, there has not been effective solution
for detecting mobile phone spam images yet. Due to the
insufficient spam image data in mobile phones, training the
predictive model is quite hard. To resolve this issue, we recently
proposed a phone spam image filtering system using e-mail spam
images and showed that using e-mail spam data is fairly
meaningful in improving the performance of phone spam image
detection. In this paper, we further investigate the effectiveness of
utilizing the graph structure in e-mail spam data. Furthermore,
the classification performance behavior depending on different
image descriptors of Pyramid Histogram of Visual Words
(PHOW) and RGB histogram is explored extensively.
Keywords graph partitioning; spectral clustering; PHOW;
image spam; spam detection; image classification; color SIFT
I. INTRODUCTION
Image spams are widely spread in all kinds of media.
Although there have been many studies on detecting spam
images in e-mails or web pages, those in a mobile phone are
much more insufficient than in other media. According to a
bunch of personal information leaks, spam messages are
increasingly appearing in personal areas like in a smart phone.
Spam text messages have been irritating users for years and
there have been several approaches for detecting them
effectively. Recently, those unsolicited spam messages have
caused severe social problems in that they are used for bank
fraud and financial crimes. In order to avoid the conventional
text-based spam filtering system, spam messages have been
evolved. They include unnecessary special characters or white
spaces between words to prevent spam filtering from detecting
spam keywords. Usually, spam messages can be detected by
user-supplied spam number database. It can be nevertheless
deceived by changing their sending number or by using an
actual
s number to be filtered out of the database.
Furthermore, image spams without any text are rapidly
increasing in mobile phones these days, thus making spam
detection even harder. Due to high cost of image processing in
a mobile phone as well as insufficient phone spam image data,
detecting spam images in a mobile phone becomes a difficult
issue that we struggle with. Accordingly, researches on phone
spam images are necessary. However, the size of phone spam
image data is still too small to train a predictive model with
sufficient accuracy.
* Corresponding Author
978-1-4799-8679-8/15/$31.00 copyright 2015 IEEE
ICIS 2015, June 28-July 1 2015, Las Vegas, USA
Kyung-Ah Sohn*
Department of Information and Computer Engineering
Ajou University
Suwon, S. Korea
[email protected]
In this respect, we recently proposed a phone spam image
filtering system taking advantage of widely available e-mail
spam image data [1]. We showed that using a visually similar
sub-group of e-mail spam images in addition to phone spam
images is effective in phone spam image detection. In this
paper, we further investigate the effectiveness of using the
graph structure in e-mail spam data. To obtain similar subgroup of e-mail spam images, graph partitioning algorithm of
spectral clustering is used as well as the k-means clustering. In
addition, the performance on spam image classification using
multiple image descriptors are compared which are RGB
histogram feature and Pyramid Histogram of Visual Words
(PHOW) descriptor with gray, RGB, and opponent color mode.
II. METHODOLOGY
A. Image Descriptors
To obtain image features, we use existing image descriptors.
Each spam or non-spam image is represented by RGB
histogram or Pyramid Histogram of Visual Words (PHOW)
descriptor [2] whose color mode is gray, RGB and opponent,
respectively. PHOW descriptor is implemented by VLFeat
open source visual computing library [3].
1) RGB histogram: For each single image, color histogram
is computed which has 4 bins per red, green, and blue, totally
64 bins. It describes an image with RGB color distributions.
2) PHOW (gray, RGB, opponent): The image is
represented by PHOW descriptor [2] based on spatial pyramid
matching scheme [4].
a) Feature extraction: For each input image, multiple
dense SIFT descriptors for gray, RGB, and opponent color
mode are obtained [2]. SIFT descriptor with opponent color
space has shown to perform better than other color SIFT
descriptors in many categories of image dataset [5].
b) Bag of visual words: The extracted visual features of
images are partitioned into 500 visual words by k-means
clustering [6] and a visual word dictionary is constructed.
Then, each input image is vector-quantized into visual words
by the kd-tree from the visual word dictionary [7].
c) Spatial histogram: Each image is divided into 2 4
sub-regions to consider spatial co-occurences of histograms in
every sub-region of an image [4]. In each su
b-region,
histogram of bag of visual words is obtained, namely 500
visual word distribution in each histogram. Finally, the
concatenation of 8 spatial histograms is the descriptor of an
image.
The overview of the image descriptor extraction process is
summarized in Fig 1.
Note that D is the similarity matrix between e-mail spam
images which is computed with each image feature (RGB
histogram, PHOW-gray, PHOW-RGB, and PHOW-opponent).
The scaling parameter controls how rapidly the affinity W
falls off with distances in D.
As a result, the performances on phone spam image
classification with each sub-graph are compared. The bestperformed sub-graph is used for spam image classification.
Fig. 1. The overview of extracting image descriptor
B. Database Construction for Training
To get similar sub-group of e-mail spam images to phone
spam images, k-means clustering and spectral clustering are
used. Based on each clustering method, sub-group of e-mail
spam images are added to phone images. As a result, phone
images and similar sub-group of e-mail spam images are
used as training data for learning our model. Additionally, a
randomly selected sub-group of e-mail spam images is used
as a baseline. The overall process is illustrated in Fig 2.
1) K-means clustering: In k-means clustering, a distance
matrix between e-mail and phone spam images is obtained. To
compute the distance between two images, standard euclidean
distance is used. By k-means clustering, the distance matrix is
partitioned into k mutually exclusive clusters [6]. Note that
here the distance values are used as features and the centroid
of each cluster is the mean of euclidean distances between
images in the cluster. The most visually similar sub-group is
the cluster which has the smallest centroid.
It is performed iteratively to find the optimal centroids of
clusters. Although 100 iterations are computed, it does not
guarantee that the clustering result is converged to the optimal
solution. Thus, we used k-means++ algorithm which greedily
takes center points being maximally different rather than
randomly [8]. In [8], they showed that k-means++ has
improved both running time and the quality of clustering result.
2) Spectral clustering: Spectral clustering is a standard
graph cut algorithm which is used for graph clustering [9]. To
partition e-mail spam image graph G, normalized cut
algorithm is used which considers not only inter-cluster
similarity but also intra-cluster similarity [10]. We use the
implementation of normalized cut algorithm in publicly
available Spectral Clustering Toolbox [11].
Given e-mail spam images, the spam image graph G = (V, E)
is constructed. Each e-mail spam image is taken as a node and
similarity distances between each pair of images are taken as
edge. To compute similarity between a pair of e-mail spam
images, euclidean distances of phone and e-mail spam images
are computed in advance. As shown in Fig. 2, each e-mail
spam image has a vector of similarity distances to all the
phone spam images. The similarity between each pair of the
similarity vector is computed. As a result, the similarity matrix
between e-mail spam images is obtained.
The affinity matrix in spectral clustering [9] is defined as
Fig. 2. Database construction for training with k-means clustering and
spectral clustering
3) Random: To demonstrate that the use of advanced
clustering techniques such as spectral clustering and k-means
clustering is indeed meaningful, we used randomly selected 10,
-mail spam images and phone spam
images for training the predictive model. In this part, RGB
histogram is used for describing an image in order to compare
with PHOW descriptor. Therefore, the result when trained
with randomly selected images with RGB histogram feature is
compared with the one using k-means clustering and spectral
clustering with RGB histogram and PHOW descriptor.
C. Image Classification on Phone Spam Data
The constructed phone and e-mail spam image data from
each clustering method is used for training our predictive
model. First, the feature vector of RGB histogram, PHOWgray, PHOW-RGB, and PHOW-opponent of each input image
is obtained. Trained with phone and e-mail spam images, the
predictive model finally classifies the phone spam image into
spam or non-spam. We trained SVM on training data and
validated our result on validation set. To compute large image
data effectively, -kernel SVM using homogeneous kernel
map is u
sed [12]. It transforms the data into linear
-kernel SVM can be
representation, thus non-linear
computed. In [12], they showed -kernel SVM showed better
performance than other kernels. The soft margin of SVM is set
to 10.
D. 5-fold Cross Validation and Evaluation
To determine the optimal parameter in spectral clustering
and prevent over-fitting to the training data, we evaluated our
result with 5-fold cross validation. Note that we train our model
with phone images and similar e-mail spam images and
classify phone spam images into spam or non-spam. As shown
in Fig. 3, e-mail and phone image data are divided into 5-folds,
respectively. 4-folds of phone and e-mail images are used as
training set, and the remaining one-fold of phone images is
used as validation set. Namely, 80% of phone and e-mail
images are used for training and 20% of phone images are used
for testing our model at each run.
Fig. 3. 5-fold cross validation on phone image data trained with phone and email image datasets together
III. RESULTS
A. Dataset
Image Spam Hunter [13] is a publicly available dataset of
image attachments in e-mail, which contains 929 e-mail spam
images. We used a similar sub-set of 929 e-mail spam images
that is clustered from k-means clustering, spectral clustering
and randomly. Table I shows the size of spam and non-spam
data when the data is clustered by spectral clustering with each
image descriptor which yields the better performance than the
one using k-means clustering.
TABLE I.
DATASET SIZE USED IN SPECTRAL CLUSTERING
Phone
Spam
RGB histogram
PHOW-gray
PHOW-RGB
PHOW-opponent
Non-spam
66
405
E-mail
12
201
20
324
Total
78
267
86
390
-
405
B. Performance Comparison in 5-fold Cross Validaion
We performed 5-fold cross validation to evaluate the result.
First of all, we visually examined how many clusters the data
are partitioned into. The heat-map of the affinity matrix in
spectral clustering is shown with respect to different parameter
values for . We used image descriptors of RGB histogram,
PHOW-gray, PHOW-RGB, and PHOW-opponent. The result
using spectral clustering across , and the one from k-means
clustering is compared with the one using randomly selected
image subset as a baseline. We evaluated our result in terms of
accuracy, sensitivity, specificity, and F-score, respectively. The
result of spectral clustering and the overall performance is
shown with heat-maps and plots of accuracy, sensitivity,
specificity, and F-score in Fig. 4 and 5 as explained below.
1) Random: The performance using randomly selected
images is shown as a green dotted line in Fig. 4 and 5. As a
baseline, the RGB histogram is used as an image descriptor.
The accuracy, sensitivity, specificity, and F-score is shown
when randomly selected 10% of e-mail spam images are used
which had the highest F-score.
In this result, because it only considers the color distribution
of images when training the data, it tends to classify any spam
or non-spam image into non-spam. As there are more nonspam images than spam images in training data, the model is
more likely to be trained with color distribution of non-spam
images. Thus, sensitivity (true positive rate) is lower than
accuracy and specificity.
2) RGB histogram: The result of spectral clustering as
heatmaps and the performance using RGB histogram feature is
shown in Fig. 4. The overall performance of k-means
clustering or spectral clustering is better than the one using a
randomly selected subset of images. In case of spectral
clustering, the result varies with different parameter values of
. As shown in the heatmap, the data is likely to be clustered
better when is between 0.5 and 0.7 approximately. Although
F-score of spectral clustering result is similar to k-means
is
clustering result, the performance is improved when
around 0.7 in terms of the sensitivity. Contrary to the
sensitivity, the specificity showed the best performance when
is 0.4. In the way that F-score is quite low regardless of the
clustering method, the overall performance is influenced more
by which feature of image is used than the used clustering
method. Nonetheless the result shows that using a similar subgroup of images can improve the performance.
Fig. 4. Heatmap of affinity matrix in spectral clustering with respect to different (upper), and the performance comparison of spectral clustering, k-means
clustering and the randomized method in 5-fold cross validation (lower). Affinity matrix is computed with RGB histogram feature
A
B
C
Fig. 5. Heatmap of affinity matrix in spectral clustering with respect to different (upper), and the performance comparison of spectral clustering, k-means
clustering and the randomized method in 5-fold cross validation (lower). Affinity matrix is computed with PHOW features (A: gray mode, B: RGB mode, C:
opponent mode)
3) PHOW(gray, RGB, opponent): The performance of
PHOW descriptor is shown in Fig. 5. When PHOW
descriptor is used, performances are significantly higher
than the one using RGB histogram on the whole.
As PHOW descriptor considers both geometric
information and color distribution, images can be
distinguished more precisely. For example, some spam
images in Fig. 6 contain lots of texts but the color
distribution is different. Many e-mail spam images contain
only texts with different color and scale because they want
to be looked like actual e-mail and to be filtered out of spam
filtering system. Those images should be grouped in the
same group but are classified into different group when
RGB histogram feature is used. However, they are classified
in the same group when PHOW descriptor is used. It shows
that the performance is highly improved when both
geometric and color information are used.
Table II and III shows the best accuracy, sensitivity,
specificity and F-score in k-means clustering and spectral
clustering respectively with respect to each image feature.
The best performance of all features is obtained in PHOW
descriptor in RGB mode. In spectral clustering, the result is
better when
is large except in RGB mode. Though
improvements are quite marginal, we find that PHOW
descriptor considering various color distribution rather than
gray color is also meaningful.
TABLE II.
Fig. 6. Sample spam images which are correctly grouped in the same
cluster with the PHOW descriptor, but in a different one with RGB
histogram feature.
Additionally PHOW descriptors with three color modes
(gray, RGB, and opponent) are compared. The performance
of spectral clustering is similar or slightly better than that of
k-means clustering. Though the specificity (true negative
rate) and overall performance is not different across
clustering methods, sensitivity in spectral clustering varies
depending on paramter . As clustered e-mail spam data
from spectral clustering are added in the model training, the
true positive rate can be affected by clustering result. As
PHOW descriptor cosiders much more features than RGB
histogram, the performance is improved considerably.
C. Averaged Performance Comparison in Optimal
Parameter
The best performance of each image descriptor is
compared when using k-means clustering and spectral
clustering, respectively in Fig. 7. The best F-score is used for
evaluation. As shown in Fig. 7, the performance when
trained with k-means clustering and spectral clustering is
almost the same. Regardless of the color mode, PHOW
descriptors show much better performance than RGB
histogram. It shows that the color mode of PHOW descriptor
. Therefore, we
show that considering color distribution with geometric
information has a big impact on the overall performance
rather than the color variance.
Fig. 7. Best performance comparison on RGB histogram, PHOW-gray,
PHOW-rgb, PHOW-opponent feature in k-means clustering, spectral
clustering (red dotted line: best performance when training with
randomly selected images)
BEST PERFORMANCE COMPARISON WITH RESPECT TO
IMAGE DESCRIPTORS IN K-MEANS CLUSTERING
RGB
Histogram
PHOW
(gray)
PHOW
(RGB)
PHOW
(opponent)
random
Accuracy
73.47%
95.12%
95.54%
94.27%
72.25%
Sensitivity
42.42%
92.42%
92.42%
87.91%
32.03%
Specificity
78.52%
95.56%
96.05%
95.31%
78.81%
F-score
30.73%
84.19%
85.49%
81.15%
24.14%
10%
TABLE III.
BEST PERFORMANCE COMPARISON WITH RESPECT TO
IMAGE DESCRIPTORS IN SPECTRAL CLUSTERING
RGB
histogram
PHOW
(gray)
PHOW
(RGB)
PHOW
(opponent)
random
Accuracy
81.75%
96.39%
96.82%
96.39%
72.25%
Sensitivity
30.55%
95.45%
87.91%
84.95%
32.03%
Specificity
90.12%
96.54%
98.27%
98.27%
78.81%
F-score
32.31%
88.28%
88.48%
86.76%
24.14%
10%
D. Misclassified Samples in Best-performed Cluster
The sample images of false positives and false negatives
in validation set are shown in Fig. 8 when trained with bestperformed cluster. Note that best-performed cluster is
obtained in spectral clustering when
is 0.3 with
PHOW(RGB) descriptor.
Images in Fig. 8(a) are legitimate images but are
classified as spam. It contains mobile-coupons that the user
actually asked for. Those coupons shared many visual
features with actual spam images. This is the reason why
sensitivity is generally lower than specificity. Also, a user
can send captured or saved images on the web to another
user that is necessary information. Those images contain
many texts that look like e-mail spam images. False
negatives in Fig. 8(b) also visually look similar to mobilecoupon in Fig. 8(a). As shown in these examples, the criteria
for classifying spam image are quite subjective, namely
some images in mobile phone are considered as spam for
some users but non-spam for others.
[5]
[6]
[7]
[8]
[9]
Fig. 8. Examples of misclassified images
[10]
IV. CONCLUSION
We proposed a mobile phone spam image filtering
system using a large set of e-mail spam images. In [1], we
recently showed that using e-mail spam image data is quite
useful for phone spam image classification. In this paper, we
demonstrate that using similar sub-graph of e-mail spam
images by graph partitioning algorithm yields desired
performance as well as k-means clustering algorithm.
Additionally, performances on phone spam image
classification with RGB histogram and PHOW descriptor
with gray, RGB, opponent color mode are compared to
consider color distribution of an image, geometric
information and both geometric and color information.
The result showed PHOW descriptor with RGB that
takes geometric and RGB color information performs the
best on phone spam image classification. It showed that
considering both geometric and color information can
improve the performance on spam image classification. Also,
a sophisticated clustering technique has positive impact on
improvement. If the size of phone image data for validation
gets bigger, improvements are expected to be more
distinguished. Furthermore, it can be applied to the data from
other domain that encounters a similar data insufficiency
problem.
ACKNOWLEDGMENT
This research was supported by
Research Program through the
Foundation (NRF) of Korea funded
Science,
ICT,
and
Future
(2014R1A1A3051169).
the Basic Science
National Research
by the Ministry of
Planning
(MSIP)
REFERENCES
[1]
[2]
[3]
[4]
K. So Yeon, B. Yenewondim, and S. Kyung-Ah, "Investigating
the Effectiveness of E-mail Spam Image Data for Phone Spam
Image Detection Using Scale Invariant Feature Transform
Image Descriptor.", Information Science and Applications,
LNEE 339, pp. 591-598, 2015.
A. Bosch, A. Zisserman, and X. Munoz, "Image classification
using random forests and ferns," in Computer Vision, 2007.
ICCV 2007. IEEE 11th International Conference on, pp. 1-8,
2007.
A. Vedaldi and B. Fulkerson, "VLFeat: An open and portable
library of computer vision algorithms," in Proceedings of the
international conference on Multimedia, pp. 1469-1472, 2010.
S. Lazebnik, C. Schmid, and J. Ponce, "Beyond bags of features:
Spatial pyramid matching for recognizing natural scene
categories," in Computer Vision and Pattern Recognition, 2006
IEEE Computer Society Conference on, vol. 2, pp. 2169-2178,
2006.
[11]
[12]
[13]
K. E. Van De Sande, T. Gevers, and C. G. Snoek, "Evaluating
color descriptors for object and scene recognition," Pattern
Analysis and Machine Intelligence, IEEE Transactions on, vol.
32(9), pp. 1582-1596, 2010.
C. Elkan, "Using the triangle inequality to accelerate k-means,"
in ICML, vol. 3, pp. 147-153, 2003.
M. Muja and D. G. Lowe, "Fast Approximate Nearest
Neighbors with Automatic Algorithm Configuration," in
VISAPP (1), pp. 331-340, 2009.
D. Arthur and S. Vassilvitskii, "k-means++: The advantages of
careful seeding," in Proceedings of the eighteenth annual ACMSIAM symposium on Discrete algorithms, pp. 1027-1035, 2007.
A. Y. Ng, M. I. Jordan, and Y. Weiss, "On spectral clustering:
Analysis and an algorithm," Advances in neural information
processing systems, vol. 2, pp. 849-856, 2002.
J. Shi and J. Malik, "Normalized cuts and image segmentation,"
Pattern Analysis and Machine Intelligence, IEEE Transactions
on, vol. 22, pp. 888-905, 2000.
S. Agarwal, "Spectral Clustering Toolbox," Available:
http://vision.ucsd.edu/~sagarwal/clustering.html, 2002.
A. Vedaldi and A. Zisserman, "Efficient additive kernels via
explicit feature maps," Pattern Analysis and Machine
Intelligence, IEEE Transactions on, vol. 34(3), pp. 480-492,
2012.
Y. Gao, M. Yang, X. Zhao, B. Pardo, Y. Wu, T. N. Pappas, et
al., "Image spam hunter," in Acoustics, Speech and Signal
Processing, 2008. ICASSP 2008. IEEE International
Conference on, pp. 1765-1768, 2008.