AISC 124 - Fusion of Text and Image Features: A New Approach to

Fusion of Text and Image Features: A New
Approach to Image Spam Filtering
Congfu Xu1 , Kevin Chiew2 , Yafang Chen1 , and Juxin Liu1
1
Institute of Artificial Intelligence, Zhejiang University, Hangzhou, China
2
School of Engineering, Tan Tao University, Long An, Vietnam
Abstract. While enjoying the convenience of email communications,
many users have also experienced annoying email spam. Even if the current spam detecting approaches have gained a competitive edge against
text-based email spam, they still face the challenge arising from imagebased spam (image spam in short). Image spam normally includes embedded images that contain the spam messages in binary format rather than
text format and cost more storage and bandwidth resources. In this paper, we propose a hybrid image spam filtering framework to detect spam
images based on both extracted text and image features. Our experimental results show that our approach achieves significant improvement
in detection accuracy as compared with other methods that simply use
text or image features, and works robustly in an environment with either
complex background or compression artifact.
1
Introduction
Nowadays one of the most pervasive applications of the Internet is the email service which has brought great convenience in our communications. While enjoying
the facilities of email service, users are also facing a big number of annoying email
spam. Email spam, of which the volume has been growing tremendously in past
few years as reported, has also decreased the quality of email service. This is
partly because email spam costs the resources of storage and communication
bandwidth. Moreover, a latest news1 reports a research result telling that spam
produces millions of tons of CO2 globally every year.
Many solutions are proposed for detecting and filtering spam emails to prevent
them from being received, forwarded, and spread. The basic technique for these
solutions is to train classifiers to identify spam images from ham (hold-andmodify) images. These classifiers normally use two types of rules: (a) rules based
on connection and relay properties of emails, and (b) rules using the features
extracted from the contents of emails. The second type of rules that carry out
contents filtering by using machine learning mechanisms such as Naive Bayes
classification or support vector machines (SVM), have been a cornerstone of
anti-spam systems [16] and have shown the advantage of high accuracy.
However, currently there is a new attack which could be devastating on content filters. Instead of obscuring the message’s text, spammers now are able to
1
See http://news.bbc.co.uk/2/hi/technology/8001749.stm
Y. Wang and T. Li (Eds.): Practical Applications of Intelligent Systems, AISC 124, pp. 129–140.
c Springer-Verlag Berlin Heidelberg 2011
springerlink.com
130
C. Xu et al.
Fig. 1. Examples of spam images (noticing the high amount of text and the use of
text obfuscation technique against OCR)
defeat text analysis techniques by replacing text with images. A whitepaper released in November 2006 [17] shows the rise of image spam from 10% in April
to 27% of all email spam in October 2006 totaling up to 48 billion emails every
day. A possible way to detect image spam is using a pipeline of an optical character recognition (OCR) system, which extracts and recognizes embedded text,
followed by a text classifier that separates spam from legitimate content. It was
found that this approach can be effective for clean images [8]. However image
spam has allowed spammers to design spam as CAPTCHAs (see the right part
of Figure 1) or use obscuring image text to defeat OCR tools. Thus if an image
spam filter is equipped with an OCR-based module as the unique countermeasure against spam, it is vulnerable to image spam with obfuscated text.
In this paper, we propose a solution for image spam filtering. Since most of
spam images contain large proportions of text as shown in Figure 1, our solution
first extracts the text information embedded into images, together with the image
information that can be identified by the unique properties [14] of spam images
as compared with those of natural scene images or generic computer-generated
graphic images. We then use a combinational filter with two-layer structure for
training and classification, of which the bottom-layer classifiers obtain the image
spam confidence score by using the two types of features, and a top-layer classifier
makes the final decision by using the outputs of the bottom-layer classifiers.
The remaining sections of the paper are organized as follows. Firstly in
Section 2 we review the related work on the filtering techniques for contentbased image spam, following which in Section 3 we introduce the framework of
image spam filtering in details. In Section 4, we report experimental results on
real data sets of ham and spam images, and conclude the paper in Section 5.
2
Related Work
The detection of image spam is a special case of image categorization, which is
addressed as a task of two-class classification between ham and spam images in [1,
6,8] and has been extensively studied in context of many important applications.
In [1], Aradhye et al. used a support vector classifier to extract the text
regions in an image, followed by which they identified five visual features of the
spam. The first feature is the relative area of the image occupied by text. It is
used with the underlying idea that spam images usually contain more text than
Fusion of Text and Image Features: A New Approach
131
legitimate images. The other features such as color heterogeneity and saturation
are identified over text and non-text regions based on the assumption that images
of which the main part are synthetic are normally more likely to be spam.
Based on the method in [1], Dredze et al. [6] proposed to use different kind of
features. Although some visual features are used (like average RGB colors, the
relative area occupied by the most common color, and color saturation features
as in [1]), the most important role is played by metadata extracted from the
images. They also introduced a feature selection algorithm (JIT) to select the
most discriminant features based on their speed as well as the predictive power.
Fumera et al. [8] proposed an approach to anti-spam filtering which exploits
the text information embedded into images sent as attachments. This approach is
based on the consideration that text embedded into images plays the same role as
text in the body of emails without images (i.e., it conveys the spam messages).
After extracting text with OCR tools from images attached to emails, they
carried out the semantic analysis of text using text categorization techniques
like the ones applied to the body of the email without images.
A method [4] is presented to recognize image spam based on detecting the
presence of content obscuring techniques which aim to compromise the OCR
effectiveness. The implementation is based on two low-level image features aimed
at measuring the extent of character breaking or the presence of small noise
components, and the presence of merged characters or large noise components.
Nhung and Phuong used simple edge-based features [16] to compute a vector
of similarity scores between an image and a set of templates. This similarity
vector is then used with an SVM to separate spam images from other common
categories of images. In [11] specific features are selected for inspection by the
components-based method, and then the spam-filter system uses these features
to identify image spam by feature matching.
3
Hybrid Framework for Image Spam Filtering
Since the content obscuring techniques can defeat the attempts of using OCR
tools [8] to detect text embedded into images, to filter such image spam, we
propose an image categorization approach that detects both text and image
features. Figure 2 shows the proposed hybrid framework for image spam filtering.
The framework works by three phases. Firstly, we calculate the features of
an input spam email. This work includes keyword detection and text-related
features extraction. We then use an SVM to obtain the image spam confidence
score. Secondly, we define a small number of reliable spam-indicative features
from the image metadata and image color properties, and then use an SVM
again to classify the image. Lastly, we use fusion classifier to make a decision
based on the outputs of both text and image classifiers.
An example of a spam image is shown in Figure 3. The spam image is identified
by our framework as a ham image with the confidence score of 0.0422283 by the
image classifier and as a spam image by the text classifier with the confidence
score of 0.659779. Thus finally the image is identified as a spam image after fusion
132
C. Xu et al.
Fig. 2. Architecture of our hybrid framework for image spam filtering
Fig. 3. An example of spam image
of both confidence scores. The functions of major components are introduced as
follows.
3.1
Keyword Detection
Semantic analysis of text embedded into images first requires text extraction by
techniques such as OCR which may bring with the following two issues: (a) high
computational complexity and (b) susceptible to content obscuring techniques.
For the first issue, it is possible to reduce the computational complexity by using
a hierarchical architecture for the spam filter. Text extraction and analysis are
carried out only if the previous and less complex modules are unable to reliably
identify whether an email is legitimate or not. To further reduce computational
complexity, techniques based on image signature could be employed.
For the second issue, since embedded text extraction is often inaccurate, we
use keyword detection to improve classification accuracy. We first define a keyword set composed of thirty words and five phrases. And then, for every image
we calculate a feature indicating whether at least one element of the keyword
set is detected in the text extracted by an OCR system. Performing OCR on
images attached to emails is carried out by the demo version of the commercial
software ABBYY FineReader 8.0 Professional with default parameter settings.
Fusion of Text and Image Features: A New Approach
3.2
133
Text-Related Features Extraction
The text-related features detect the properties of text in an image. The text
regions in the image are firstly extracted. A subsequent step defines some features
from the image by using the extracted text regions. Our method of text region
extraction comprises the following three main steps.
Step 1: Edge detection. A convolution operation with a compass operator [12]
is used to generate intensity images of four oriented edges which are at 0◦ ,
45◦ , 90◦ and 135◦ orientations respectively. For color images, we convert
them into gray images at first.
Step 2: Feature generation. We first subdivide an image into a grid of w × h
equally sized cells Cij where i = 1, . . . , w and j = 1, . . . , h (each cell is as big
as 10 × 10 pixels in this work), and then compute the six features over all
cells. These six features, namely mean μ, standard deviation σ, energy Eg ,
entropy Et , inertial-quadrature I, and local homogeneity H, are defined by
the following Equations (1) to (6) [5, 9]:
w
h
1 E(i, j)
w×h
i=1 j=1
w h
1 [E(i, j) − μ]2
σ=
w × h i=1 j=1
Eg =
E 2 (i, j)
μ=
(1)
(2)
(3)
i,j
Et =
E(i, j) log E(i, j)
(4)
(i − j)2 E(i, j)
(5)
1
E(i, j)
1 + (i − j)2
(6)
i,j
I=
i,j
H=
i,j
in which E(i, j) is the normalized symmetrical gray level co-occurrence matrix (GLCM) of cell Cij [10].
Step 3: Text region detection. We first use the K-means clustering based on
the above features to obtain the text areas and background areas, and then
refine the text region by morphological dilation and erosion.
Figure 4 illustrates the process of text region detection. Based on the extracted
text regions, we calculate the following simple features that are most indicative
of spam images: (1) Extent of text regions. The extent of text in the image is
defined as the proportion between the area of the extracted text regions and the
total areas of the image; (2) Amount of text regions; and (3) Amount of text
letters.
134
C. Xu et al.
(a) Initial picture
(c) After erosion operation
(e) Final result
(b) Candidate of text region
(d) After dilation operation
(f) Labeled by pane
Fig. 4. Illustration of the process of text region detection
Text may be inherently presented in natural scene images in the form of
road signs, building names, company names or others, and synthetic images
may include text. However, the extraction of text features as defined above is
intuitively expected to be discriminative between spam images and non-spam
images. Figure 5 shows the distributions of features 1 and 3, from which we
can find that the spam images and non-spam images distribute in different data
domains. For feature 1, more than 40% of ham images distribute in the range of
0 to 0.1, and more than 80% of spam images in the range of 0.2 to 0.6; whereas
for feature 3, more ham images distribute in the range of 0 to 6, and more spam
images in the range of 6 to 60.
According to [3], we also use three features to detect the presence of content
obscuring. The idea is to measure the perimetric complexity which is used in the
psychophysics of reading literature and aspect ratio (the ratio between width
and height). The perimetric complexity is defined as the squared length of the
boundary between black and white pixels in the whole image, divided by the
black area.
Fusion of Text and Image Features: A New Approach
90%
90%
ham images
spam images
80%
70%
60%
60%
50%
50%
40%
40%
30%
30%
20%
20%
10%
10%
0−0.1
0.1−0.2
0.2−0.6
More than 0.6
(a) Distribution of extent of text regions
(feature 1)
ham images
spam images
80%
70%
0%
135
0%
0−6
6−16
16−60
More than 60
(b) Distribution of amount of text letters
(feature 3)
Fig. 5. Feature distributions in all images
3.3
Image Features Extraction
Our first group of image features relies on the following metadata: (1) File format. The file format of an image includes its extension, the actual file format (as
identified by metadata) and whether they match with each other; and (2) Image
metadata. We extract 10 features that are contained in the image metadata,
including whether the image has comments, bits per pixel, number of bands,
progressive flag, sample precision, transparent color, approx high, index value,
logical height and width.
The rest of our image features based on the following color properties: (1)
Color saturation. As defined by Frankel et al. [7], color saturation is quantified
as the fraction of the total number of pixels in the image for which the difference
max(R, G, B) − min(R, G, B) is greater than a predefined threshold; (2) Color
histogram. The color histogram is a compact summary of the image, and the
legitimate images typically convey a much larger number of colors than spam
images. We chose a 6-bit color space leading to 64 feature vectors; and (3) Color
moments. The use of color moments is based on the assumption that the distribution of color in an image can be interpreted as a probability distribution. The
distribution of spam images is always not continuous since they are synthetic.
In our study, we use the following three central moments of an image’s color distribution, namely mean, standard deviation and skewness. Using RGB channels
and three moments for each channel, we obtain nine feature vectors.
Figure 6 shows several ham and spam images and Figure 7 shows their color
saturation, from which we can see that spam images are generally more saturated
as compared with images of natural scenes.
3.4
Bottom-Layer Classifiers
Some significant advantages of an SVM, such as excellent generalization ability through maximum margin approach, the absence of local minima, and the
sparse representation of solution, are the major reason for using an SVM as a
136
C. Xu et al.
(a) Ham image 1
(b) Ham image 2
(c) Ham image 3
(d) Spam image 1
(e) Spam image 2
Fig. 6. Three ham images and two spam images
Fig. 7. Color saturation of images in Figure 6
powerful model in classification tasks. Both the text classifier and image classifier use SVMs first to differentiate between text and images, and obtain the
spam confidence scores as the inputs of classifier fusion for further decision.
The kernel trick is another important point to the success of SVMs. Polynomial kernel, radial basic function (RBF) kernel and sigmoid kernel are three
typical kernels. In our study, LIBSVM2 is adopted and RBF is used as a kernel function since the corresponding Hilbert space is of infinite dimension. The
2
The software is available at http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Fusion of Text and Image Features: A New Approach
137
default parameters are used. In the previous section, we extract features and obtain the vector space model (VSM) which represents each image. The text-based
vector space includes seven feature vectors and the image-based vector space includes 87 feature vectors. The text classifier and image classifier use their vectors
as inputs to the SVM for training and classification respectively.
3.5
Classifier Fusion
Combining the outputs from multiple tools has been reported effective in terms
of improving information retrieval [13, 15] and classification performance [2, 18].
Our experiments also show that we can improve accuracy by combining the
results of several classifiers. Furthermore, it makes sense that by including the
inputs of many types of classifiers we can protect ourselves from risk of any one
classifier being compromised. We use an SVM again to fuse the confidence scores
of text and image classifiers. The outputs of bottom-layer classifiers constitute
a vector for SVM training and classification. The vector is defined as (St , Si ) in
which St is the confidence score of text classifier and Si the confidence score of
image classifier. Similar to bottom-layer classifiers, LIBSVM and RBF are also
adopted for classifiers fusion.
4
Experiment
4.1
Experimental Setup
The experiments are carried out on the corpora of images taken from real emails.
The corpora are collections of personal emails used in [6], containing 2006 ham
images and 3297 spam images. To our best knowledge, this is the only corpus of
real ham images publicly available to research communities3 .
For the experiments, the images are first split into two subsets: about 60% are
randomly chosen for training classifiers on the bottom layer, and the other 40%
for testing. And then for fusion stage, about 50% images are randomly chosen
for training, and the other 50% for testing. We repeat this random selection 10
times and average all of the results.
We first reduce the images by scaling so that the width and height are no more
than 200 pixels. This simple mechanism makes our method robust to random
pixels and simple scaling. It also meets the computational requirements since
image analysis has high computational complexity. We then extract features
from all the images from the positive and negative test sets.
In our evaluation, accuracy, precision, image spam recall (recall in short) and
image non-spam recall (non-spam recall in short) are defined as follows:
accuracy =
3
# of all images correctly classified
# of all images
Available at http://www.cs.jhu.edu/~mdredze/datasets/image spam/
138
C. Xu et al.
100.00%
100%
Performance
Performance
87.00%
80%
60%
Image classifier
Text classifier
40%
Fusion classifier with averaging
74.00%
61.00%
48.00%
Fusion classifier with SVM
SA with Bayes-OCR
Huang's approach in [8]
Our approach
35.00%
20%
Accuracy
Precision
Recall
Precision
Non-spam recall
Measure
Recall
Measure
(a) Performance comparison for different
approaches
(b) Performance
Huang’s approach
comparison
with
Fig. 8. Experimental results
precision =
recall =
# of spam images correctly classified
# of images classified as spam
# of spam images correctly classified
# of all spam images
non-spam recall =
# of non-spam images correctly classified
# of all non-spam images
All the experiments are conducted on a typical PC with Core 2 Quad Q6600
CPU and 4GB memory and with Windows XP installed.
4.2
Experimental Results
Figure 8(a) shows the details of experiment results, from which we can see that,
as compared with the text classifier, the image classifier can obtain higher accuracy for common categories of email images classification; whereas the text
classifier has a better discriminative capability for spam images classification.
The fusion classifier with averaging has achieved better results in total accuracy
though, we cannot see any improvement in other indicators. The discriminative
capability is greatly improved when we fuse the confidence scores of text classifier and image classifier with an SVM. Therefore, we can draw such a conclusion
from the results: the fusion classifier with an SVM combines the classification
performance from the text and image classifiers in a complementary fashion that
unites the strengths of both.
To evaluate the performance of our approach, we compare it with a public
spam corpus SpamAssassin4 (SA in short) in its standard configuration and
equipped with a device Bayes-OCR for filtering image spam, and with the existing approach which is presented in a recent paper [11]. The comparative results
are shown in Figure 8(b). The results of SA with Bayes-OCR are our baseline,
of which the precision values are very good (almost as high as 100%) while the
recall is still acceptably challenged (lower than 40%). Although our experiment
4
Available at http://spamassassin.apache.org/
Fusion of Text and Image Features: A New Approach
139
and the approach in [11] are not using the same corpora, from the table we can
see that our approach obtains better results, i.e., the precision is high enough to
compete that from SA with Bayes-OCR, while the recall is much more improved.
We also compare our approach with the existing approach in [6] which uses
the same corpus. The average accuracy of our approach is 98.205%, better than
the result of 98.004% by the approach in [6].
For some text-based anti-spam filtering experiments, there are a number of
public benchmark datasets publicly available; whereas for our experiments, there
are not any other shared ham images available besides another public corpus
SpamArchive5 which consists of 16,021 spam images. We hope that a larger corpus with real spam and non-spam images be available in the future to facilitate
the experiments so that we can conduct a more fair comparison for the above
mentioned approaches.
5
Conclusion
In this paper, we have presented a novel hybrid framework for detecting spam
email with content embedded in images by fusion of classifiers. Given a spammed
image, our method has been able to extract both the text and image features,
and input the vector into the bottom-layer classifiers respectively, and lastly
obtain the final decision based on the fusion of the outputs of the classifiers. Our
experimental results have shown that our approach has achieved a significant
improvement in the accuracy of image spam detection as compared with other
approaches.
For the next stage of study, we will further formalize our framework and
approach, and will develop an online version of the fusion method by considering
the spam filter’s handing capacity and test the image model’s ability in spam
detection.
Acknowledgments. This paper is supported by the 863 Plan project of
China (No. 2007AA01Z197) and the Natural Science Foundations of China
(No. 60970081), and partially supported by the National Basic Research Program of China (No. 2010CB327903). We would like to thank Dr. Mark Dredze
who is now in the Department of Computer Science at University of Pennsylvania for making his data set publicly available and sending us his code for
performing the feature extraction.
References
1. Aradhye, H.B., Myers, G.K., Herson, J.A.: Image analysis for efficient categorization of image-based spam e-mail. In: Proceedings of International Conference on
Document Analysis and Recognition, pp. 914–918 (August 2005)
5
SpamArchive was downloadable from SpamArchive.org which has been shut down.
It is now available at http://www.cs.jhu.edu/~mdredze/datasets/ image spam/
140
C. Xu et al.
2. Bennett, P.N., Dumais, S.T., Horvitz, E.: The combination of text classifiers using
reliability indicators. Information Retrieval 8(1), 67–100 (2005)
3. Biggio, B., Fumera, G., Pillai, I., Roli, F.: Image spam filtering by content obscuring detection. In: Proceedings of the Fourth Conference on Email and Anti-Spam
(CEAS 2007), pp. 2–3 (August 2007)
4. Biggio, B., Fumera, G., Pillai, I., Roli, F.: Image spam filtering using visual information. In: Proceedings of the 14th International Conference on Image Analysis
and Processing (ICIAP 2007), pp. 105–110 (September 2007)
5. Cheng, H.D., Sun, Y.: A hierarchical approach to color image segmentation using
homogeneity 9(12), 2071–2082 (2000)
6. Dredze, M., Gevaryahu, R., Elias-Bachrach, A.: Learning fast classifiers for image
spam. In: Proceedings of the Fourth Conference on Email and Anti-Spam (CEAS
2007), pp. 487–493 (August 2007)
7. Frankel, C., Swain, M., Athitsos, V.: Webseer: an image search engine for the world
wide web. Technical report, University of Chicago (1996)
8. Fumera, G., Pillai, I., Roli, F.: Spam filtering based on the analysis of text information embedded into images. Journal of Maching Learning Research (special issue
on Machine Learning in Computer Security) 7, 2699–2720 (2006)
9. Gopalan, C., Manjula, D.: Statistical modeling for the detection, localization
and extraction of text from heterogeneous textual images using combined feature
scheme, 1863–1703 (2010)
10. Haralick, R., Shanmugam, K., Dinstein, I.: Textual features for image classification 3(6), 610–631 (1973)
11. Huang, H., Guo, W., Zhang, Y.: A novel method for image spam filtering. In:
Proceedings of the 9th International Conference for Young Computer Scientists
(ICYCS 2008), pp. 826–830 (November 2008)
12. Jain, A.K.: Fundamentals of Digital Image Processing. Prentice-Hall, Inc., Upper
Saddle River (1989)
13. Lynam, T.R., Buckley, C., Clarke, C.L.A., Cormack, G.V.: A multi-system analysis
of document and term selection for blind feedback. In: Proceedings of the 13th
ACM Conference on Information and Knowledge Management (CIKM 2004), pp.
261–269 (November 2004)
14. Mehta, B., Nangia, S., Gupta, M., Nejdl, W.: Detecting image spam using visual
features and near duplicate detection. In: Proceedings of the 17th International
Conference on World Wide Web (WWW 2008), pp. 21–25 (April 2008)
15. Montague, M., Aslam, J.A.: Condorcet fusion for improved retrieval. In: Proceedings of the 11th ACM Conference on Information and Knowledge Management
(CIKM 2002), pp. 538–548 (November 2002)
16. Nhung, N.P., Phuong, T.M.: An efficient method for filtering image-based spam.
In: Proceedings of 2007 IEEE International Conference on Research, Innovation
and Vision for the Future, pp. 96–102 (March 2007)
17. Secure Computing Whitepaper. Image spam: The latest attack on the enterprise
inbox. Technical report (November 2006)
18. Zhang, Y.: Using bayesian priors to combine classifiers for adaptive filtering. In:
Proceedings of the 27th Conference on Research and Development in Information
Retrieval (SIGIR 2004), pp. 345–352 (July 2004)