An Approach to Image Spam Filtering Based on Base64 Encoding

An Approach to Image Spam Filtering Based on
Base64 Encoding and N -Gram Feature Extraction
Congfu Xu
Yafang Chen
Kevin Chiew
Institute of Artificial Intelligence
College of Computer Science
Zhejiang University
Hangzhou 310027, China
Email: [email protected]
Institute of Artificial Intelligence
College of Computer Science
Zhejiang University
Hangzhou 310027, China
Email: [email protected]
School of Information Systems
Singapore Management University
80 Stamford Road
Singapore 178902
Email: [email protected]
Abstract—As compared with text spam, the image spam is a
variant which is invented to escape from traditional text-based
spam classification and filtering. Various approaches to image
spam filtering have been proposed with respective advantages
and drawbacks in terms of time cost and efficiency. In this
paper, we propose a new approach based on Base64 encoding
of image files and n-gram technique for feature extraction.
By transforming normal images into Base64 presentation, we
try to extract features of an image with n-gram technique.
With these features we train an SVM (support vector machine)
which shows effectiveness and efficiency in detecting spam images
from legitimate images. With an online shared personal corpus
of images as the input, experimental results show that our
approach, in comparison with some of the existing methods of
feature extraction, can achieve very high performance for image
spam classification in terms of some basic measures such as
accuracy, precision, and recall. Moreover, our approach shows
its practicability by taking less running time for image spam
classification in comparison to other methods.
I. I NTRODUCTION
Nowadays, with the increasing effectiveness of text-based
spam filtering technologies like Bayesian filters which has
done an excellent job, almost all junk mails can be detected
and blocked. However, the image spam, in which the message text of the spam is presented as pictures in an image
file coupled with adding noises in the pictures and using
obfuscating technique, is invented to circumvent those filtering
software dedicated to text-based space filtering. Image spam
first appeared at the end of 2005, and is now accounting for
roughly 40% of all spam traffic and is still on the rise [1].
Typically, the size of an image spam email is about 3 to 4
times larger than a corresponding plain text-based email. This
feature brings with several direct harms two of which that
can be immediately perceived by intuition are the contention
of bandwidth when transferring image spam over the Internet
and the extra requirement for storage space.
Aiming at image spam filtering, various approaches have
been proposed with respective advantages and drawbacks in
terms of time cost and efficiency. For example, a simple way
to detect image spam is using optical character recognition
(OCR) technique to extract embedded text from images, and
with the extracted text and regular spam messages, using
text-based classifiers to separate spam images from legitimate
(a) wavy text
(b) obfuscating text
(c) adding random noises
(d) using colorful background
Fig. 1. Spam images with several obfuscating techniques
ones [2]. Although the OCR technique can catch certain
spam images, its effectiveness is quickly reduced by spam
images with new obfuscating techniques like using wave text
or colorful background, adding random noise or lines to the
images, as shown in Figure 1.
In this paper, we give another try for image spam filtering
with a new approach based on Base64 encoding and n-gram
technique for feature extraction. With our approach, we first
convert the image file to the form of Base64 presentation,
and then tokenize the converted image file which is a Base64
string for feature extraction with n-gram technique, and lastly
classify the image into spam or non-spam with a trained SVM
classifier. The experimental results show that as compared
with some of the existing methods of feature extraction, our
approach can achieve higher performance in terms of basic
measures like precision, recall, F1, and accuracy. Moreover,
our approach takes less time for classification than other
methods of feature extraction.
The remaining sections of this paper are organized as follows. We first review the related work of image spam filtering
in section II, and present our algorithm and feature extraction
process in section III. We then introduce some other feature
extraction methods in section IV for a fair comparison with our
method, following which in section V we present experimental
results before concluding this paper in section VI.
II. R ELATED W ORK
For spam images that use obfuscating techniques like wave
text or colorful background to escape from the detection of
OCR technique, researchers propose some methods [3], [4] as
the complement of traditional OCR technique to filter these
spam images. Biggio et al. [3] propose a method to detect
whether content obscuring techniques are used to make OCR
ineffective. More specifically, the method can detect whether
character breaking or merging or noise components (like small
dots) are added in the images. Wang et al. [4] propose another
method by analyzing whether the image contents are similar to
some known spam images and then labeling them as spam or
not. Based on an implemented prototype system, their method
can achieve a high detection rate with false positive rate less
than 0.001%.
Some other methods [5]–[7] are proposed by using image
metadata instead of extracted text as features. These methods
can perform classification at high detection rates with less
time. By extracting features from image header information
and file properties, Krasser et al. [6] evaluate the classification
performance with C4.5 decision trees and SVM. This approach
achieves low computational load by eliminating 60% of spam
images with a low false positive rate of 0.5%. Dredze et al. [5]
propose an algorithm that allows to train a classifier with even
less time by taking additional features from color properties
such as average color, color saturation and prevalent color
coverage. Nhung and Phuong [7] propose another method
which uses simple edge-based features to represent major
shape properties of images and applies SVM to carry out
image classification based on the features extracted.
A recent approach [1] adopts Fourier-Mellin Transform
(FMT) to train the one-class SVM classifier with FourierMellin invariant features. The authors evaluate the approach
in 10-cross validation and obtain the precision of 98.9% on a
personal spam corpus and an own-collected public non-spam
corpus.
III. O UR A PPROACH
Image
Feature Vector
Encode into
Base64
Binary
Features
Base64 String
N-gram
N-gram
String Vector
SVM
Spam
Feature Space
Non-Spam
Fig. 2.
The framework of our approach
image to a string of its Base64 format regardless of whether
the image contains textual elements or not, and then divide
the Base64 string into groups from which we extract features
by using n-gram method, and lastly we use a trained SVM to
carry out classification of image spam. As shown in Figure 2
for the framework, our approach works via the following three
steps:
Step 1: Format conversion. This step converts an image
to the presentation of its Base64 format regardless
whether the image contains textual elements or not.
Step 2: Feature extraction. Based on the Base64 string
converted in Step 1, the n-gram technique is used to
tokenize and extract features to represent a unique
image. This step allows to represent the image as a
vector with binary features according to the feature
space.
Step 3: Classification. With the vector representation of
images calculated in Step 2, the SVM algorithm [8]
is used to classify the image into spam or non-spam.
This method can be easily implemented and has been proven
to be very fast by experiments because it does not require
any additional text extraction or other time-consuming image
processing procedures (like OCR).
A. Main Idea
There are various approaches proposed for image spam
filtering though, a crucial step among all is to extract features
from the source file of an image spam. These approaches
extract visually seeable features like foreground or background
colors, saturation of an image, grey-scale of an image from a
spam which may appear in ASCII format, JPEG format, or any
other binary format. In other words, as long as the approach
can detect and classify image spam, there is no difference
whichever format an image is presented and whatever features
are extracted. Following this clue, we propose a new approach
which is a try by using n-gram method to extract features from
another format of images (i.e., the format of Base64 encoding).
The features we extract from a binary string of Base64 format
may not be visually seeable when they are presented on the
screen as images, however, they may represent some intrinsic
characteristics of the images. To do this, we first convert an
B. Base64 Encoding
An email is normally encoded to a sequence of certain
format and transferred over the Internet. At the recipient side,
the encoded sequence is decoded to its original format. There
are various available formats, such as unicode, html, or plain
text, to encode an email. In our approach, we try Base64
encoding to represent an email that may contain images.
According to RFC 4648, Base64 encoding is designed to
represent arbitrary sequences of octets that need not to be
readable by human. It converts a file to a string format which
only contains 64 ASCII characters (i.e., A–Z, a–z, 0–9, +, /)
together with a special suffix “=” used for padding. It firstly
converts data to a byte array, then encodes each 3 bytes (from
left to right) at a time. Every substring with 24 bits (3 bytes) is
split into 4 groups, and each 6-bit group is used as the index
to the following 64 printable characters “A–Z, a–z, 0–9, +,
R0lGODlhXgGpAfcAAAAAAMzMzJkAAFJSUpNxUD1WSv///ygqKr1KB2ZmZuTk12wKBJqlmABQoa8w
B6RJKDteL5mQfhoZGacfB9+rlGaZZpJ+apiVjUg5KJwJB/ffzUttMMZbABoRCSE0HUpyStGplvmy
gv1/LOV/f7Wzj5mZZmKHV5FtS/f391N7VLIoKcwAAGFiPN12ISpELISUaKY/HlZFLs6Fgj4yH8xm
APKkZVZ2SoyojrOwm5prQ+XKwKISCGl3W3hZO1NSSwoQCduYbcVoU8V0TNW+rb2MYxwhELhBCHWa
XrGsjMNTB8WcfyklHZRkPXKIW4Ggg9VmIUZoVkRBPEBqN/bl2t6ljcyZZs0HBzMzM3ipOWZLM7xX
U97BrbN9Uj1QIDVHPleDZGaEStQnJ9uei3abO6/CsK0pB9aVjGaZZk1pQuvX0NSFWNSHZHOgipC5
qGaZM8umind5dctrQst5ZJ+Kcap4Ty5FH2diS7dLR8VrQJqfg16IPsxmM21pXt+jgrU6BwoIAENp
ScNiPW83ILo0MdCMjMbUxzhfPBkZD8xSIWuHWsBRByoiELyRb4Cbdc56UoyUVTRSLoWKamaZZklt
OtChgDg6Nfrv4TYnGVhtWui8jXCLS3plQiAiGrVaMXt2X6UZB+7QwfjCm9WMartiOb29vRAREGud
iPyWU0VHL3OVdVVOPP9mAKMVAN5hYc54S4NrVKSyguLp46oiBz9qQtG0o7xqQYCvnbrLvFqHVk04
Ie6rq9usjXhQMk15Q9fX12KKZmaZmVFxOy02HzRILJixmUdKQViEb6SCaGmVjJmEWkhPHLK2sJOt
bF57PyAZD54OB9/f3/m3i7MzB8VbB2uHjKO6pSY2KwkIB+a2izZSPH2GaLSUgM52RPPz83N+QcA5
Itg9PffX17yObKxKIEJhPYCkY2KSimmLaapfO9eNY3lmSUtOShARCCAtD7tDB+rMtqdqT1pbVczO
r7BTLSEhIaCliv9SADUjEsGDVmyjSVaHSUp0V4CITXV1bVhfS4WEfSH5BAQUAP8ALAAAAABeAakB
AAj/AD8IHEiwoMGDCBMqXMiwocOHECNKnEixosWLGDNq7Maxo8ePIEOKHEmypMmTKFOqXMmypcuX
MGPKnBlTIM2bOHPq3Mmzp8+fQEPa/IjCgNGjSJMqXcq0qdOnUKNKnUq1qtWrWLNq3cq1q1MUH4d2
LOq1rNmzaNOqXcu2rVqPYruR1SCGkN27d83o3cu3r14xgAMLHkylsOHDiLEpxpapsePHkCNLnvx4
seXLmDNr3sy5M2bKoEOLHk26tOnTqFNT1nC0o1iyFHYImE27tu3az56x2s2794TfsoLLKkOcuIPj
DqQpl+anuXM/7qIjSMKhuvXr2LNr3869u/fv4MOL/x9Pvrz58+jTk6eRySjYbmKNapCdob79DDvy
5//E/9Pv/xMIV1xxyCW33HPOGWFEdNIh4OCDikQo4TQ0VGjhhRhmqOGGHHbo4YcghijiiCSWaOKJ
KKY44hRGcRSfAWYIcF8GuT2j3w799QegcMMNWKCByiHoh4IKMujOgw5KqEgSTDI5DYUqRinllFRW
aeWVWKLIATYtwvdBR0YRIqN9Ndp4Y478/cfjgMYhtxxzQhZpJJJJSthkEk/mqeeefPbp55+ABiro
oIQWauihiCaq6KKMFpoElwa4+CVHYY5ZX5lnormjgGyWcdybQg5J5JxIKnlno6imquqqrLbq6qup
Kv8CqaRgGiDmfWWauZ+Oaq7JZoFvwtkckQuSWuqSp8Kq7LLMNuvss4wiMKuXtd56KaY34sgrcL52
GmycozZ4bITJ9nnnueimq+667Lbr7rvwxivvvPTWa++9+LorbZcvWnttbtnmCGCAnPro5oHgFisu
hOQ2CSiy+UYs8cQUV2zxxRi7684u/E7aTaW41pittmn2WrCnBy
(a) A GIF spam image
(b) The first 2000 bytes of the Base64 string of a spam image
/9j/4AAQSkZJRgABAQEAtAC0AAD/4gxYSUNDX1BST0ZJTEUAAQEAAAxITGlubwIQAABtbnRyUkdC
IFhZWiAHzgACAAkABgAxAABhY3NwTVNGVAAAAABJRUMgc1JHQgAAAAAAAAAAAAAAAAAA9tYAAQAA
AADTLUhQICAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABFj
cHJ0AAABUAAAADNkZXNjAAABhAAAAGx3dHB0AAAB8AAAABRia3B0AAACBAAAABRyWFlaAAACGAAA
ABRnWFlaAAACLAAAABRiWFlaAAACQAAAABRkbW5kAAACVAAAAHBkbWRkAAACxAAAAIh2dWVkAAAD
TAAAAIZ2aWV3AAAD1AAAACRsdW1pAAAD+AAAABRtZWFzAAAEDAAAACR0ZWNoAAAEMAAAAAxyVFJD
AAAEPAAACAxnVFJDAAAEPAAACAxiVFJDAAAEPAAACAx0ZXh0AAAAAENvcHlyaWdodCAoYykgMTk5
OCBIZXdsZXR0LVBhY2thcmQgQ29tcGFueQAAZGVzYwAAAAAAAAASc1JHQiBJRUM2MTk2Ni0yLjEA
AAAAAAAAAAAAABJzUkdCIElFQzYxOTY2LTIuMQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAWFlaIAAAAAAAAPNRAAEAAAABFsxYWVogAAAAAAAAAAAAAAAA
AAAAAFhZWiAAAAAAAABvogAAOPUAAAOQWFlaIAAAAAAAAGKZAAC3hQAAGNpYWVogAAAAAAAAJKAA
AA+EAAC2z2Rlc2MAAAAAAAAAFklFQyBodHRwOi8vd3d3LmllYy5jaAAAAAAAAAAAAAAAFklFQyBo
dHRwOi8vd3d3LmllYy5jaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAABkZXNjAAAAAAAAAC5JRUMgNjE5NjYtMi4xIERlZmF1bHQgUkdCIGNvbG91ciBzcGFjZSAt
IHNSR0IAAAAAAAAAAAAAAC5JRUMgNjE5NjYtMi4xIERlZmF1bHQgUkdCIGNvbG91ciBzcGFjZSAt
IHNSR0IAAAAAAAAAAAAAAAAAAAAAAAAAAAAAZGVzYwAAAAAAAAAsUmVmZXJlbmNlIFZpZXdpbmcg
Q29uZGl0aW9uIGluIElFQzYxOTY2LTIuMQAAAAAAAAAAAAAALFJlZmVyZW5jZSBWaWV3aW5nIENv
bmRpdGlvbiBpbiBJRUM2MTk2Ni0yLjEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHZpZXcAAAAA
ABOk/gAUXy4AEM8UAAPtzAAEEwsAA1yeAAAAAVhZWiAAAAAAAEwJVgBQAAAAVx/nbWVhcwAAAAAA
AAABAAAAAAAAAAAAAAAAAAAAAAAAAo8AAAACc2lnIAAAAABDUlQgY3VydgAAAAAAAAQAAAAABQAK
AA8AFAAZAB4AIwAoAC0AMgA3ADsAQABFAEoATwBUAFkAXgBjAGgAbQByAHcAfACBAIYAiwCQAJUA
mgCfAKQAqQCuALIAtwC8AMEAxgDLANAA1QDbAOAA5QDrAPAA9gD7AQEBBwENARMBGQEfASUBKwEy
ATgBPgFFAUwBUgFZAWABZwFuAXUBfAGDAYsBkgGaAaEBqQGxAbkBwQHJAdEB2QHhAekB8gH6AgMC
DAIUAh0CJgIvAjgCQQJLAlQCXQJnAnECegKEAo4CmAKiAqwCtgLBAssC1QLgAusC9QMAAwsDFgMh
Ay0DOANDA08DWgNmA3IDfgOKA5YDogOuA7oDxwPTA+AD7AP5BAYEEwQgBC0EOwRIBFUEYwRxBH4E
jASaBKgEtgTEBNME4QTwBP4FDQUcBSsFOgVJBVgFZwV3BYYFlg
(c) A JEPG ham image
Fig. 3.
(d) The first 2000 bytes of the Base64 string of a ham image
Examples of spam and non-spam images and their Base64 formats
/” to find the corresponding character. When there are fewer
than 24 bits left at the end of the encoding data, symbol “=”
is padded. Since we process 3 bytes at one time, we pad one
“=” to the encoding string when there are 2 bytes left at the
end of the encoded data, and pad two “=” if there is only one
byte left.
Base64 encodes an image by its specific file grammar.
Taking GIF (graphics interchange format) as an example, GIF
grammar [9] defines the GIF data stream in the following
way: it starts with a fixed-length header (usually “GIF87a”
or “GIF89a”) which defines the version number, followed by
a logical screen descriptor which gives the size and other
characteristics of the graphic, and then followed by other
entities that define the other information about the image.
Figure 3 shows two example images and their first 2000
bytes in Base64 format. As can be seen in Figure 3(b), the
GIF fixed-length header “GIF89a” is encoded to “R0lGODlh”
at the very beginning of the image’s Base64 string.
C. N -Gram
Various methods have been proposed to detect image spam
by extracting different features for classification. These features include embedded text features [2], image file metadata
features [5], [10], color-based features [5], edge-base features
[7] and histogram features [10]. Since we treat the Base64
format of an image as a string, our approach follows the
practice of text-based spam classification. By tokenizing the
encoded image string, we use n-gram technique to extract
features of the image.
An n-gram is a subsequence of n items from a given
sequence. The n-gram method is widely used in natural
language processing [11] and text categorization [12] for
feature extraction. For an m-word string that consists of
k discriminative characters, the number of n-grams in its
feature space will be bounded by k n , meaning that every
discriminative character occurs one time in each n-gram. Thus
with our approach, the feature space contains 65n n-grams
because there are 64 ASCII characters and a padding symbol
“=” used for Base64 encoding.
IV. F EATURE E XTRACTION
Feature extraction plays an important role in classification.
The existing image spam filtering approaches extract image
features in several ways: (1) using basic file properties which
can be easily obtained from image file with low computational
cost; (2) extracting embedded text from images with the help
of OCR technique and generating textual features by analyzing
the extracted text; (3) extracting visual features from image
metadata like color, shape and edge features, saturation, etc.
We implement these 3 methods to make a comparison with
our method.
feature. These features are defined as follows.
E
=
=
A. File Properties
•
•
•
•
•
B. Textual Features
As aforementioned, OCR technique is the first method
proposed to image spam filtering. There are many commercial
and open source OCR tools (e.g., Bayes-OCR and FuzzyOCR) available. In our study, we use Tesseract OCR which
has been open sourced by Google1 . With the embedded text
extracted from images, we use the following textual features
as proposed by [13].
•
•
•
•
•
•
TextLength: the number of characters of the whole text.
WordsNumber: the number of words in the text.
Ambiguity: n1 /n2 , where n1 is the number of special
characters, and n2 the number of normal characters.
Correctness: Nn /Ns , where Nn is the number of words
that contain normal character, and Ns the number of
words that contain special character.
SpecialLength: the maximum length of continuous special character sequence.
SepcialDistance: the maximum distance between two
special characters.
Here the special-character set contains the following characters: {!, ”, #, $, %, &, ’, (, ), *, +, ,, –, . . . , /, @, ˆ}.
C. Visual Features
There are various visual features used to represent images, like textual features, color features, shape features.
In our study, by transforming an image into a gray level
co-occurrence matrix (GLCM), we extract 5 GLCM-based
features [14] as visual features. Meanwhile, as proposed in
[13], perimetric complexity [3] is also considered as a visual
1 Tesseract
is available at http://code.google.com/p/tesseract-ocr
N
−1
∑
Pij log2 Pij
i,j=0
Basic image file features can be quickly obtained from an
image. The features are listed as follows (refer to [6] for further
detail):
Image width and height. These two features are denoted
in the header of the image file.
Image file type. The following 4 image file types are
taken into considering: GIF, JEPG, PNG and BMP.
Image file size.
Image area, defined by w × h, in which w and h are the
pixels of the width and height of an image respectively.
Aspect ratio, defined by w/h, in which w and h are the
pixels of the width and height of an image respectively.
Compression: image area/file size.
Pij2
i,j=0
S
•
N
−1
∑
C
=
N
−1
∑
Pij (i − j)2
i,j=0
N
−1
∑
Pij
1 + (i + j)2
i,j=0
H
=
C
=
Pij (i − µi )(j − µj )
σi σj
i,j=0
PC
=
P2
A
N
−1
∑
in which E stands for energy, S for entropy, C for contrast,
H for homogeneity, C for correlation, and P C for perimetric
complexity, and i is the row number of GLCM, j the column
number, Pij the value of normalized symmetrical GLCM
at point (i, j), N the number
√∑ of gray levels in the image,
∑N −1
N −1
2
µi = i,j=0 iPij , σi =
i,j=0 Pij (i − µi ) (µj and σj are
obtainable by replacing i with j in µi and σi ), P the squared
length of the boundary between black and while pixels in the
whole image (the perimeter), and A the black area.
V. E XPERIMENT
In the next, we conduct two sets of experiments to verify
the effectiveness and efficiency of our approach. In the first set
of experiments, we verify the classification performance under
the measures of precision, recall, F 1, and accuracy with ngram where n = 1, 2, . . . , 5. In the second set of experiments,
we compare the performance of our approach with that of
other approaches of feature extraction.
A. Corpus
Although there is no publicly image spam corpus available
online, some researchers have made their personal corpora
available for research communities. Dredze et al. [5] offer
their personally collected corpus (called Personal Ham Dataset
and Personal Spam Dataset) from their personal webpage2 .
To the best of our knowledge, this is the only corpus contains
both spam and ham images shared online. The corpus consists
of 2020 ham images and 3297 spam images, among which
1828 ham and 3209 spam images can be recognized by image
processors and are used in our experiments as listed in Table I.
The spam and ham on the right side of the table are the
numbers of images can be recognized by Tesseract OCR.
2 Available
at http://www.cs.jhu.edu/˜mdredze/datasets/image spam/.
100
99
99
98
98
Recall (%)
Precision (%)
100
97
96
95
93
0
1000
96
95
2gram
3gram
4gram
5gram
94
97
2gram
3gram
4gram
5gram
94
2000
3000
4000
5000
6000
93
7000
0
1000
2000
Number of Features
100
99
99
Classification Accuarcy (%)
100
F1 (%)
98
97
96
2gram
3gram
4gram
5gram
94
93
0
1000
3000
4000
5000
6000
7000
96
95
2gram
3gram
4gram
5gram
93
0
1000
2000
3000
4000
5000
6000
7000
(d) Classification Accuracy obtained by n-gram features
Classification results using n-gram for feature extraction
TABLE I
S UMMARY OF CORPUS
Recognized by
Tesseract OCR
Spam
Ham
3108
1804
B. Evaluation
We evaluate our method by using 10-fold cross-validation.
Our dataset is randomly divided into 10 subsets of approximate
equal size. One subset serves as the test set and the other
9 subsets are used for the training of SVM. We repeat the
experiment 10 times by using each subset as the test set in
turn, and average the results from all 10 runs.
The classification algorithm used in our method is support
vector machine (SVM) [8]. We use software LIBLINEAR3
[15] with the penalty parameter c = 0.5 and the solver s = 5
3 Available
7000
Number of Features
(c) F1 obtained by n-gram features
Available
spam & ham
Spam
Ham
3209
1828
6000
97
Number of Features
Fig. 4.
5000
98
94
2000
4000
(b) Recall obtained by n-gram features
(a) Precision obtained by n-gram features
95
3000
Number of Features
at http://www.csie.ntu.edu.tw/˜cjlin/liblinear/index.html.
to conduct the experiments. By using the linear kernel and
proper penalty parameter, LIBLINEAR is usually much faster
than LIBSVM which is introduced in [8].
For a fair comparison, we use the following measures which
are popularly used by other methods: accuracy, precision,
recall and F1.
Our experiments are conducted on a workstation with an
Intel Core(TM)2 Duo CPU at speed 2.54 GHz and 4 GB of
memory.
C. Experimental Results
In the first set of experiments, we use 2-, 3-, 4- and 5-grams
respectively to extract features from the Base64 string of an
image and show the results in Figure 4. From the figure we
can see that (1) the best performance among all is achieved
by using 5-gram for feature extraction, (2) the performance
drops for 2-gram with increase of feature number but keeps
stable for 3-, 4-, and 5-grams, (3) the best performance for
any measure is achieved by 5-gram feature extraction.
Visual features
File properties
Textual features
Value of measures
100.00%
95.00%
90.00%
85.00%
Precision
Recall
F1
Accuracy
Time for classification (second)
Our approach with 5-gram
6000
4500
3000
1500
0
Time (second)
Our approach
Visual
features
File properties
Textual
features
106
607.5
145.6
5418.7
Methods of feature extraction
Performance measures
(a) Performance comparison
Fig. 6.
(b) Time requirement comparison
Comparison with other methods of feature extraction
97.5% and accuracy at 98.4%. We can see that our both
F1 value and accuracy are higher than the baseline, from
which we can conclude that our approach achieves satisfactory
performance of image spam classification and uses the lease
time cost as compared with other methods.
400
2−gram
3−gram
4−gram
5−gram
350
Total Time (s)
300
250
VI. C ONCLUSION
200
150
100
50
0
1000
2000
3000
4000
5000
6000
7000
Number of Features
Fig. 5.
Time requirement for classification with n-gram feature extraction
Figure 5 shows the time requirement of spam classification
for each n-gram, which tells that (1) the time requirements for
2-gram and 3-gram are very close, (2) the time requirements
for 4-gram and 5-gram are close and are nearly 5 times of
that for 2- or 3-gram, and (3) the time requirement increases
roughly linearly with the number of features.
In the second set of experiments, we conduct classification
with different methods of feature extraction as introduced
in the previous section, and compare the results with our
approach for 5-gram which achieves the best performance
among all 2- to 5-grams of feature extraction.
Figure 6 shows the results of comparison. From Figure 6(a)
we can see that our approach achieves the best of over 99%
for all four performance measures as compared with other
methods of feature extraction, and use the least time among
all as shown in Figure 6(b). Since OCR technique is used for
textual feature extraction, comparatively, it takes very long
time for classification as shown in Figure 6(b).
Since we use the personal corpus supported by Dredze et
al. and their reported results so far are satisfied compared
with other methods, we list their experimental results (only
accuracy and F1 available) as the baseline, which is F1 at
In this paper, we have proposed a new approach to image
spam filtering. Our experiments have proven that our approach
has achieved high performance with less running time as compared with some other methods. In summary, we have made
the following contributions: (1) we have given a successful try
which uses Base64 encoding to present an image and uses ngram for feature extraction; (2) we have conducted intensive
experiments to verify the effectiveness and efficiency of our
approach which has achieved higher performance based on
several commonly used measures as compared with some other
methods; (3) our approach may provide a reference for future
study on image spam filtering in terms of image presentation
and feature extraction.
For the next stage of study, it would be interesting to apply
our approach to images with other media format presentations.
VII. ACKNOWLEDGEMENTS
The research work of this paper is supported by the 863
Plan of China (No. 2007AA01Z197) and the Natural Science
Foundations of China (No. 60970081), and partially supported
by the 973 Program of China (No. 2010CB327903).
R EFERENCES
[1] H. Zuo, X. Li, O. Wu, W. Hu, and G. Luo, “Image spam filtering
using fourier-mellin invariant features,” in 2009 IEEE International
Conference on Acoustics, Speech and Signal Processing, 2009, pp. 849–
852.
[2] G. Fumera, I. Pillai, and F. Roli, “Spam filtering based on the analysis
of text information embedded into images,” The Journal of Machine
Learning Research, vol. 7, pp. 2699–2720, 2006.
[3] B. Biggio, G. Fumera, I. Pillai, and F. Roli, “Image spam filtering using
visual information,” in Proceeding of the 14th International Conference
on Image Analysis and Processing (ICIAP 2007), 2007, pp. 105–110.
[4] Z. Wang, W. Josephson, Q. Lv, M. Charikar, and K. Li, “Filtering image
spam with near-duplicate detection,” in Proceedings of the Fourth Conference on Email and AntiSpam, CEAS’2007, Mountain View, California
USA, August 2007.
[5] M. Dredze, R. Gevaryahu, and A. Elias-Bachrach, “Learning fast classifiers for image spam,” in Proceedings of the Fourth Conference on Email
and AntiSpam, CEAS’2007, Mountain View, California USA, August
2007.
[6] S. Krasser, Y. Tang, J. Gould, D. Alperovitch, and P. Judge, “Identifying
image spam based on header and file properties using c4.5 decision
trees and support vector machine learning,” in Information Assurance
and Security Workshop,IAW’07.IEEE SMC, 2007, pp. 255–261.
[7] N. Nhung and T. Phuong, “An effective method for filtering image-based
spam e-mail,” in IEEE International Conference on Research, Innovation
and Vision for the Future (RIVF 07), March 2007, pp. 96–102.
[8] C. W. Hsu, C. C. Chang, and C. J. Lin, “A practical
guide
to
support
vector
classification
(2009),”
http://www.csie.ntu.edu.tw/ cjlin/papers/guide/guide.pdf, 2009.
[9] W3C,
“Cover
sheet
for
the
gif89a
specification,”
http://www.w3.org/Graphics/GIF/spec-gif89a.txt, March 2009.
[10] P. He, X. Wen, and W. Zheng, “A simple method for filtering image
spam,” in Eigth IEEE/ACIS International Conference on Computer and
Information Science (ICIS 2009), 2009, pp. 910–913.
[11] P. F. Brown, V. J. D. Pietra, P. V. deSouza, J. C. Lai, and R. L. Mercer, “Class-based n-gram models of natural language,” Computational
Linguistics, vol. 18, no. 4, pp. 467–479, 1992.
[12] J. Fürnkranz, “A study using n-gram features for text categorization,” Austrian Research Institute for Artificial Intelligence, Tech. Rep.
OEFAI-TR-98-30, 1998.
[13] F. Gargiulo and C. Sansone, “Combining visual and textual features
for filtering spam emails,” in 19th International Conference on Pattern
Recognition, 2008. ICPR 2008, December 2008, pp. 1–4.
[14] C. Gopalan and D. Manjula, “Statistical modeling for the detection,
localization and extraction of text from heterogeneous textual images
using combined feature scheme,” Signal, Image and Video Processing,
January 2010.
[15] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin,
“Liblinear: a library for large linear classification,” Journal of Machine
Learning Research 9(2008), pp. 1871–1874, 2008.