An Approach to Image Spam Filtering Based on Base64 Encoding and N -Gram Feature Extraction Congfu Xu Yafang Chen Kevin Chiew Institute of Artificial Intelligence College of Computer Science Zhejiang University Hangzhou 310027, China Email: [email protected] Institute of Artificial Intelligence College of Computer Science Zhejiang University Hangzhou 310027, China Email: [email protected] School of Information Systems Singapore Management University 80 Stamford Road Singapore 178902 Email: [email protected] Abstract—As compared with text spam, the image spam is a variant which is invented to escape from traditional text-based spam classification and filtering. Various approaches to image spam filtering have been proposed with respective advantages and drawbacks in terms of time cost and efficiency. In this paper, we propose a new approach based on Base64 encoding of image files and n-gram technique for feature extraction. By transforming normal images into Base64 presentation, we try to extract features of an image with n-gram technique. With these features we train an SVM (support vector machine) which shows effectiveness and efficiency in detecting spam images from legitimate images. With an online shared personal corpus of images as the input, experimental results show that our approach, in comparison with some of the existing methods of feature extraction, can achieve very high performance for image spam classification in terms of some basic measures such as accuracy, precision, and recall. Moreover, our approach shows its practicability by taking less running time for image spam classification in comparison to other methods. I. I NTRODUCTION Nowadays, with the increasing effectiveness of text-based spam filtering technologies like Bayesian filters which has done an excellent job, almost all junk mails can be detected and blocked. However, the image spam, in which the message text of the spam is presented as pictures in an image file coupled with adding noises in the pictures and using obfuscating technique, is invented to circumvent those filtering software dedicated to text-based space filtering. Image spam first appeared at the end of 2005, and is now accounting for roughly 40% of all spam traffic and is still on the rise [1]. Typically, the size of an image spam email is about 3 to 4 times larger than a corresponding plain text-based email. This feature brings with several direct harms two of which that can be immediately perceived by intuition are the contention of bandwidth when transferring image spam over the Internet and the extra requirement for storage space. Aiming at image spam filtering, various approaches have been proposed with respective advantages and drawbacks in terms of time cost and efficiency. For example, a simple way to detect image spam is using optical character recognition (OCR) technique to extract embedded text from images, and with the extracted text and regular spam messages, using text-based classifiers to separate spam images from legitimate (a) wavy text (b) obfuscating text (c) adding random noises (d) using colorful background Fig. 1. Spam images with several obfuscating techniques ones [2]. Although the OCR technique can catch certain spam images, its effectiveness is quickly reduced by spam images with new obfuscating techniques like using wave text or colorful background, adding random noise or lines to the images, as shown in Figure 1. In this paper, we give another try for image spam filtering with a new approach based on Base64 encoding and n-gram technique for feature extraction. With our approach, we first convert the image file to the form of Base64 presentation, and then tokenize the converted image file which is a Base64 string for feature extraction with n-gram technique, and lastly classify the image into spam or non-spam with a trained SVM classifier. The experimental results show that as compared with some of the existing methods of feature extraction, our approach can achieve higher performance in terms of basic measures like precision, recall, F1, and accuracy. Moreover, our approach takes less time for classification than other methods of feature extraction. The remaining sections of this paper are organized as follows. We first review the related work of image spam filtering in section II, and present our algorithm and feature extraction process in section III. We then introduce some other feature extraction methods in section IV for a fair comparison with our method, following which in section V we present experimental results before concluding this paper in section VI. II. R ELATED W ORK For spam images that use obfuscating techniques like wave text or colorful background to escape from the detection of OCR technique, researchers propose some methods [3], [4] as the complement of traditional OCR technique to filter these spam images. Biggio et al. [3] propose a method to detect whether content obscuring techniques are used to make OCR ineffective. More specifically, the method can detect whether character breaking or merging or noise components (like small dots) are added in the images. Wang et al. [4] propose another method by analyzing whether the image contents are similar to some known spam images and then labeling them as spam or not. Based on an implemented prototype system, their method can achieve a high detection rate with false positive rate less than 0.001%. Some other methods [5]–[7] are proposed by using image metadata instead of extracted text as features. These methods can perform classification at high detection rates with less time. By extracting features from image header information and file properties, Krasser et al. [6] evaluate the classification performance with C4.5 decision trees and SVM. This approach achieves low computational load by eliminating 60% of spam images with a low false positive rate of 0.5%. Dredze et al. [5] propose an algorithm that allows to train a classifier with even less time by taking additional features from color properties such as average color, color saturation and prevalent color coverage. Nhung and Phuong [7] propose another method which uses simple edge-based features to represent major shape properties of images and applies SVM to carry out image classification based on the features extracted. A recent approach [1] adopts Fourier-Mellin Transform (FMT) to train the one-class SVM classifier with FourierMellin invariant features. The authors evaluate the approach in 10-cross validation and obtain the precision of 98.9% on a personal spam corpus and an own-collected public non-spam corpus. III. O UR A PPROACH Image Feature Vector Encode into Base64 Binary Features Base64 String N-gram N-gram String Vector SVM Spam Feature Space Non-Spam Fig. 2. The framework of our approach image to a string of its Base64 format regardless of whether the image contains textual elements or not, and then divide the Base64 string into groups from which we extract features by using n-gram method, and lastly we use a trained SVM to carry out classification of image spam. As shown in Figure 2 for the framework, our approach works via the following three steps: Step 1: Format conversion. This step converts an image to the presentation of its Base64 format regardless whether the image contains textual elements or not. Step 2: Feature extraction. Based on the Base64 string converted in Step 1, the n-gram technique is used to tokenize and extract features to represent a unique image. This step allows to represent the image as a vector with binary features according to the feature space. Step 3: Classification. With the vector representation of images calculated in Step 2, the SVM algorithm [8] is used to classify the image into spam or non-spam. This method can be easily implemented and has been proven to be very fast by experiments because it does not require any additional text extraction or other time-consuming image processing procedures (like OCR). A. Main Idea There are various approaches proposed for image spam filtering though, a crucial step among all is to extract features from the source file of an image spam. These approaches extract visually seeable features like foreground or background colors, saturation of an image, grey-scale of an image from a spam which may appear in ASCII format, JPEG format, or any other binary format. In other words, as long as the approach can detect and classify image spam, there is no difference whichever format an image is presented and whatever features are extracted. Following this clue, we propose a new approach which is a try by using n-gram method to extract features from another format of images (i.e., the format of Base64 encoding). The features we extract from a binary string of Base64 format may not be visually seeable when they are presented on the screen as images, however, they may represent some intrinsic characteristics of the images. To do this, we first convert an B. Base64 Encoding An email is normally encoded to a sequence of certain format and transferred over the Internet. At the recipient side, the encoded sequence is decoded to its original format. There are various available formats, such as unicode, html, or plain text, to encode an email. In our approach, we try Base64 encoding to represent an email that may contain images. According to RFC 4648, Base64 encoding is designed to represent arbitrary sequences of octets that need not to be readable by human. It converts a file to a string format which only contains 64 ASCII characters (i.e., A–Z, a–z, 0–9, +, /) together with a special suffix “=” used for padding. It firstly converts data to a byte array, then encodes each 3 bytes (from left to right) at a time. Every substring with 24 bits (3 bytes) is split into 4 groups, and each 6-bit group is used as the index to the following 64 printable characters “A–Z, a–z, 0–9, +, R0lGODlhXgGpAfcAAAAAAMzMzJkAAFJSUpNxUD1WSv///ygqKr1KB2ZmZuTk12wKBJqlmABQoa8w B6RJKDteL5mQfhoZGacfB9+rlGaZZpJ+apiVjUg5KJwJB/ffzUttMMZbABoRCSE0HUpyStGplvmy gv1/LOV/f7Wzj5mZZmKHV5FtS/f391N7VLIoKcwAAGFiPN12ISpELISUaKY/HlZFLs6Fgj4yH8xm APKkZVZ2SoyojrOwm5prQ+XKwKISCGl3W3hZO1NSSwoQCduYbcVoU8V0TNW+rb2MYxwhELhBCHWa XrGsjMNTB8WcfyklHZRkPXKIW4Ggg9VmIUZoVkRBPEBqN/bl2t6ljcyZZs0HBzMzM3ipOWZLM7xX U97BrbN9Uj1QIDVHPleDZGaEStQnJ9uei3abO6/CsK0pB9aVjGaZZk1pQuvX0NSFWNSHZHOgipC5 qGaZM8umind5dctrQst5ZJ+Kcap4Ty5FH2diS7dLR8VrQJqfg16IPsxmM21pXt+jgrU6BwoIAENp ScNiPW83ILo0MdCMjMbUxzhfPBkZD8xSIWuHWsBRByoiELyRb4Cbdc56UoyUVTRSLoWKamaZZklt OtChgDg6Nfrv4TYnGVhtWui8jXCLS3plQiAiGrVaMXt2X6UZB+7QwfjCm9WMartiOb29vRAREGud iPyWU0VHL3OVdVVOPP9mAKMVAN5hYc54S4NrVKSyguLp46oiBz9qQtG0o7xqQYCvnbrLvFqHVk04 Ie6rq9usjXhQMk15Q9fX12KKZmaZmVFxOy02HzRILJixmUdKQViEb6SCaGmVjJmEWkhPHLK2sJOt bF57PyAZD54OB9/f3/m3i7MzB8VbB2uHjKO6pSY2KwkIB+a2izZSPH2GaLSUgM52RPPz83N+QcA5 Itg9PffX17yObKxKIEJhPYCkY2KSimmLaapfO9eNY3lmSUtOShARCCAtD7tDB+rMtqdqT1pbVczO r7BTLSEhIaCliv9SADUjEsGDVmyjSVaHSUp0V4CITXV1bVhfS4WEfSH5BAQUAP8ALAAAAABeAakB AAj/AD8IHEiwoMGDCBMqXMiwocOHECNKnEixosWLGDNq7Maxo8ePIEOKHEmypMmTKFOqXMmypcuX MGPKnBlTIM2bOHPq3Mmzp8+fQEPa/IjCgNGjSJMqXcq0qdOnUKNKnUq1qtWrWLNq3cq1q1MUH4d2 LOq1rNmzaNOqXcu2rVqPYruR1SCGkN27d83o3cu3r14xgAMLHkylsOHDiLEpxpapsePHkCNLnvx4 seXLmDNr3sy5M2bKoEOLHk26tOnTqFNT1nC0o1iyFHYImE27tu3az56x2s2794TfsoLLKkOcuIPj DqQpl+anuXM/7qIjSMKhuvXr2LNr3869u/fv4MOL/x9Pvrz58+jTk6eRySjYbmKNapCdob79DDvy 5//E/9Pv/xMIV1xxyCW33HPOGWFEdNIh4OCDikQo4TQ0VGjhhRhmqOGGHHbo4YcghijiiCSWaOKJ KKY44hRGcRSfAWYIcF8GuT2j3w799QegcMMNWKCByiHoh4IKMujOgw5KqEgSTDI5DYUqRinllFRW aeWVWKLIATYtwvdBR0YRIqN9Ndp4Y478/cfjgMYhtxxzQhZpJJJJSthkEk/mqeeefPbp55+ABiro oIQWauihiCaq6KKMFpoElwa4+CVHYY5ZX5lnormjgGyWcdybQg5J5JxIKnlno6imquqqrLbq6qup Kv8CqaRgGiDmfWWauZ+Oaq7JZoFvwtkckQuSWuqSp8Kq7LLMNuvss4wiMKuXtd56KaY34sgrcL52 GmycozZ4bITJ9nnnueimq+667Lbr7rvwxivvvPTWa++9+LorbZcvWnttbtnmCGCAnPro5oHgFisu hOQ2CSiy+UYs8cQUV2zxxRi7684u/E7aTaW41pittmn2WrCnBy (a) A GIF spam image (b) The first 2000 bytes of the Base64 string of a spam image /9j/4AAQSkZJRgABAQEAtAC0AAD/4gxYSUNDX1BST0ZJTEUAAQEAAAxITGlubwIQAABtbnRyUkdC IFhZWiAHzgACAAkABgAxAABhY3NwTVNGVAAAAABJRUMgc1JHQgAAAAAAAAAAAAAAAAAA9tYAAQAA AADTLUhQICAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABFj cHJ0AAABUAAAADNkZXNjAAABhAAAAGx3dHB0AAAB8AAAABRia3B0AAACBAAAABRyWFlaAAACGAAA ABRnWFlaAAACLAAAABRiWFlaAAACQAAAABRkbW5kAAACVAAAAHBkbWRkAAACxAAAAIh2dWVkAAAD TAAAAIZ2aWV3AAAD1AAAACRsdW1pAAAD+AAAABRtZWFzAAAEDAAAACR0ZWNoAAAEMAAAAAxyVFJD AAAEPAAACAxnVFJDAAAEPAAACAxiVFJDAAAEPAAACAx0ZXh0AAAAAENvcHlyaWdodCAoYykgMTk5 OCBIZXdsZXR0LVBhY2thcmQgQ29tcGFueQAAZGVzYwAAAAAAAAASc1JHQiBJRUM2MTk2Ni0yLjEA AAAAAAAAAAAAABJzUkdCIElFQzYxOTY2LTIuMQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAWFlaIAAAAAAAAPNRAAEAAAABFsxYWVogAAAAAAAAAAAAAAAA AAAAAFhZWiAAAAAAAABvogAAOPUAAAOQWFlaIAAAAAAAAGKZAAC3hQAAGNpYWVogAAAAAAAAJKAA AA+EAAC2z2Rlc2MAAAAAAAAAFklFQyBodHRwOi8vd3d3LmllYy5jaAAAAAAAAAAAAAAAFklFQyBo dHRwOi8vd3d3LmllYy5jaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAABkZXNjAAAAAAAAAC5JRUMgNjE5NjYtMi4xIERlZmF1bHQgUkdCIGNvbG91ciBzcGFjZSAt IHNSR0IAAAAAAAAAAAAAAC5JRUMgNjE5NjYtMi4xIERlZmF1bHQgUkdCIGNvbG91ciBzcGFjZSAt IHNSR0IAAAAAAAAAAAAAAAAAAAAAAAAAAAAAZGVzYwAAAAAAAAAsUmVmZXJlbmNlIFZpZXdpbmcg Q29uZGl0aW9uIGluIElFQzYxOTY2LTIuMQAAAAAAAAAAAAAALFJlZmVyZW5jZSBWaWV3aW5nIENv bmRpdGlvbiBpbiBJRUM2MTk2Ni0yLjEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHZpZXcAAAAA ABOk/gAUXy4AEM8UAAPtzAAEEwsAA1yeAAAAAVhZWiAAAAAAAEwJVgBQAAAAVx/nbWVhcwAAAAAA AAABAAAAAAAAAAAAAAAAAAAAAAAAAo8AAAACc2lnIAAAAABDUlQgY3VydgAAAAAAAAQAAAAABQAK AA8AFAAZAB4AIwAoAC0AMgA3ADsAQABFAEoATwBUAFkAXgBjAGgAbQByAHcAfACBAIYAiwCQAJUA mgCfAKQAqQCuALIAtwC8AMEAxgDLANAA1QDbAOAA5QDrAPAA9gD7AQEBBwENARMBGQEfASUBKwEy ATgBPgFFAUwBUgFZAWABZwFuAXUBfAGDAYsBkgGaAaEBqQGxAbkBwQHJAdEB2QHhAekB8gH6AgMC DAIUAh0CJgIvAjgCQQJLAlQCXQJnAnECegKEAo4CmAKiAqwCtgLBAssC1QLgAusC9QMAAwsDFgMh Ay0DOANDA08DWgNmA3IDfgOKA5YDogOuA7oDxwPTA+AD7AP5BAYEEwQgBC0EOwRIBFUEYwRxBH4E jASaBKgEtgTEBNME4QTwBP4FDQUcBSsFOgVJBVgFZwV3BYYFlg (c) A JEPG ham image Fig. 3. (d) The first 2000 bytes of the Base64 string of a ham image Examples of spam and non-spam images and their Base64 formats /” to find the corresponding character. When there are fewer than 24 bits left at the end of the encoding data, symbol “=” is padded. Since we process 3 bytes at one time, we pad one “=” to the encoding string when there are 2 bytes left at the end of the encoded data, and pad two “=” if there is only one byte left. Base64 encodes an image by its specific file grammar. Taking GIF (graphics interchange format) as an example, GIF grammar [9] defines the GIF data stream in the following way: it starts with a fixed-length header (usually “GIF87a” or “GIF89a”) which defines the version number, followed by a logical screen descriptor which gives the size and other characteristics of the graphic, and then followed by other entities that define the other information about the image. Figure 3 shows two example images and their first 2000 bytes in Base64 format. As can be seen in Figure 3(b), the GIF fixed-length header “GIF89a” is encoded to “R0lGODlh” at the very beginning of the image’s Base64 string. C. N -Gram Various methods have been proposed to detect image spam by extracting different features for classification. These features include embedded text features [2], image file metadata features [5], [10], color-based features [5], edge-base features [7] and histogram features [10]. Since we treat the Base64 format of an image as a string, our approach follows the practice of text-based spam classification. By tokenizing the encoded image string, we use n-gram technique to extract features of the image. An n-gram is a subsequence of n items from a given sequence. The n-gram method is widely used in natural language processing [11] and text categorization [12] for feature extraction. For an m-word string that consists of k discriminative characters, the number of n-grams in its feature space will be bounded by k n , meaning that every discriminative character occurs one time in each n-gram. Thus with our approach, the feature space contains 65n n-grams because there are 64 ASCII characters and a padding symbol “=” used for Base64 encoding. IV. F EATURE E XTRACTION Feature extraction plays an important role in classification. The existing image spam filtering approaches extract image features in several ways: (1) using basic file properties which can be easily obtained from image file with low computational cost; (2) extracting embedded text from images with the help of OCR technique and generating textual features by analyzing the extracted text; (3) extracting visual features from image metadata like color, shape and edge features, saturation, etc. We implement these 3 methods to make a comparison with our method. feature. These features are defined as follows. E = = A. File Properties • • • • • B. Textual Features As aforementioned, OCR technique is the first method proposed to image spam filtering. There are many commercial and open source OCR tools (e.g., Bayes-OCR and FuzzyOCR) available. In our study, we use Tesseract OCR which has been open sourced by Google1 . With the embedded text extracted from images, we use the following textual features as proposed by [13]. • • • • • • TextLength: the number of characters of the whole text. WordsNumber: the number of words in the text. Ambiguity: n1 /n2 , where n1 is the number of special characters, and n2 the number of normal characters. Correctness: Nn /Ns , where Nn is the number of words that contain normal character, and Ns the number of words that contain special character. SpecialLength: the maximum length of continuous special character sequence. SepcialDistance: the maximum distance between two special characters. Here the special-character set contains the following characters: {!, ”, #, $, %, &, ’, (, ), *, +, ,, –, . . . , /, @, ˆ}. C. Visual Features There are various visual features used to represent images, like textual features, color features, shape features. In our study, by transforming an image into a gray level co-occurrence matrix (GLCM), we extract 5 GLCM-based features [14] as visual features. Meanwhile, as proposed in [13], perimetric complexity [3] is also considered as a visual 1 Tesseract is available at http://code.google.com/p/tesseract-ocr N −1 ∑ Pij log2 Pij i,j=0 Basic image file features can be quickly obtained from an image. The features are listed as follows (refer to [6] for further detail): Image width and height. These two features are denoted in the header of the image file. Image file type. The following 4 image file types are taken into considering: GIF, JEPG, PNG and BMP. Image file size. Image area, defined by w × h, in which w and h are the pixels of the width and height of an image respectively. Aspect ratio, defined by w/h, in which w and h are the pixels of the width and height of an image respectively. Compression: image area/file size. Pij2 i,j=0 S • N −1 ∑ C = N −1 ∑ Pij (i − j)2 i,j=0 N −1 ∑ Pij 1 + (i + j)2 i,j=0 H = C = Pij (i − µi )(j − µj ) σi σj i,j=0 PC = P2 A N −1 ∑ in which E stands for energy, S for entropy, C for contrast, H for homogeneity, C for correlation, and P C for perimetric complexity, and i is the row number of GLCM, j the column number, Pij the value of normalized symmetrical GLCM at point (i, j), N the number √∑ of gray levels in the image, ∑N −1 N −1 2 µi = i,j=0 iPij , σi = i,j=0 Pij (i − µi ) (µj and σj are obtainable by replacing i with j in µi and σi ), P the squared length of the boundary between black and while pixels in the whole image (the perimeter), and A the black area. V. E XPERIMENT In the next, we conduct two sets of experiments to verify the effectiveness and efficiency of our approach. In the first set of experiments, we verify the classification performance under the measures of precision, recall, F 1, and accuracy with ngram where n = 1, 2, . . . , 5. In the second set of experiments, we compare the performance of our approach with that of other approaches of feature extraction. A. Corpus Although there is no publicly image spam corpus available online, some researchers have made their personal corpora available for research communities. Dredze et al. [5] offer their personally collected corpus (called Personal Ham Dataset and Personal Spam Dataset) from their personal webpage2 . To the best of our knowledge, this is the only corpus contains both spam and ham images shared online. The corpus consists of 2020 ham images and 3297 spam images, among which 1828 ham and 3209 spam images can be recognized by image processors and are used in our experiments as listed in Table I. The spam and ham on the right side of the table are the numbers of images can be recognized by Tesseract OCR. 2 Available at http://www.cs.jhu.edu/˜mdredze/datasets/image spam/. 100 99 99 98 98 Recall (%) Precision (%) 100 97 96 95 93 0 1000 96 95 2gram 3gram 4gram 5gram 94 97 2gram 3gram 4gram 5gram 94 2000 3000 4000 5000 6000 93 7000 0 1000 2000 Number of Features 100 99 99 Classification Accuarcy (%) 100 F1 (%) 98 97 96 2gram 3gram 4gram 5gram 94 93 0 1000 3000 4000 5000 6000 7000 96 95 2gram 3gram 4gram 5gram 93 0 1000 2000 3000 4000 5000 6000 7000 (d) Classification Accuracy obtained by n-gram features Classification results using n-gram for feature extraction TABLE I S UMMARY OF CORPUS Recognized by Tesseract OCR Spam Ham 3108 1804 B. Evaluation We evaluate our method by using 10-fold cross-validation. Our dataset is randomly divided into 10 subsets of approximate equal size. One subset serves as the test set and the other 9 subsets are used for the training of SVM. We repeat the experiment 10 times by using each subset as the test set in turn, and average the results from all 10 runs. The classification algorithm used in our method is support vector machine (SVM) [8]. We use software LIBLINEAR3 [15] with the penalty parameter c = 0.5 and the solver s = 5 3 Available 7000 Number of Features (c) F1 obtained by n-gram features Available spam & ham Spam Ham 3209 1828 6000 97 Number of Features Fig. 4. 5000 98 94 2000 4000 (b) Recall obtained by n-gram features (a) Precision obtained by n-gram features 95 3000 Number of Features at http://www.csie.ntu.edu.tw/˜cjlin/liblinear/index.html. to conduct the experiments. By using the linear kernel and proper penalty parameter, LIBLINEAR is usually much faster than LIBSVM which is introduced in [8]. For a fair comparison, we use the following measures which are popularly used by other methods: accuracy, precision, recall and F1. Our experiments are conducted on a workstation with an Intel Core(TM)2 Duo CPU at speed 2.54 GHz and 4 GB of memory. C. Experimental Results In the first set of experiments, we use 2-, 3-, 4- and 5-grams respectively to extract features from the Base64 string of an image and show the results in Figure 4. From the figure we can see that (1) the best performance among all is achieved by using 5-gram for feature extraction, (2) the performance drops for 2-gram with increase of feature number but keeps stable for 3-, 4-, and 5-grams, (3) the best performance for any measure is achieved by 5-gram feature extraction. Visual features File properties Textual features Value of measures 100.00% 95.00% 90.00% 85.00% Precision Recall F1 Accuracy Time for classification (second) Our approach with 5-gram 6000 4500 3000 1500 0 Time (second) Our approach Visual features File properties Textual features 106 607.5 145.6 5418.7 Methods of feature extraction Performance measures (a) Performance comparison Fig. 6. (b) Time requirement comparison Comparison with other methods of feature extraction 97.5% and accuracy at 98.4%. We can see that our both F1 value and accuracy are higher than the baseline, from which we can conclude that our approach achieves satisfactory performance of image spam classification and uses the lease time cost as compared with other methods. 400 2−gram 3−gram 4−gram 5−gram 350 Total Time (s) 300 250 VI. C ONCLUSION 200 150 100 50 0 1000 2000 3000 4000 5000 6000 7000 Number of Features Fig. 5. Time requirement for classification with n-gram feature extraction Figure 5 shows the time requirement of spam classification for each n-gram, which tells that (1) the time requirements for 2-gram and 3-gram are very close, (2) the time requirements for 4-gram and 5-gram are close and are nearly 5 times of that for 2- or 3-gram, and (3) the time requirement increases roughly linearly with the number of features. In the second set of experiments, we conduct classification with different methods of feature extraction as introduced in the previous section, and compare the results with our approach for 5-gram which achieves the best performance among all 2- to 5-grams of feature extraction. Figure 6 shows the results of comparison. From Figure 6(a) we can see that our approach achieves the best of over 99% for all four performance measures as compared with other methods of feature extraction, and use the least time among all as shown in Figure 6(b). Since OCR technique is used for textual feature extraction, comparatively, it takes very long time for classification as shown in Figure 6(b). Since we use the personal corpus supported by Dredze et al. and their reported results so far are satisfied compared with other methods, we list their experimental results (only accuracy and F1 available) as the baseline, which is F1 at In this paper, we have proposed a new approach to image spam filtering. Our experiments have proven that our approach has achieved high performance with less running time as compared with some other methods. In summary, we have made the following contributions: (1) we have given a successful try which uses Base64 encoding to present an image and uses ngram for feature extraction; (2) we have conducted intensive experiments to verify the effectiveness and efficiency of our approach which has achieved higher performance based on several commonly used measures as compared with some other methods; (3) our approach may provide a reference for future study on image spam filtering in terms of image presentation and feature extraction. For the next stage of study, it would be interesting to apply our approach to images with other media format presentations. VII. ACKNOWLEDGEMENTS The research work of this paper is supported by the 863 Plan of China (No. 2007AA01Z197) and the Natural Science Foundations of China (No. 60970081), and partially supported by the 973 Program of China (No. 2010CB327903). R EFERENCES [1] H. Zuo, X. Li, O. Wu, W. Hu, and G. Luo, “Image spam filtering using fourier-mellin invariant features,” in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, 2009, pp. 849– 852. [2] G. Fumera, I. Pillai, and F. Roli, “Spam filtering based on the analysis of text information embedded into images,” The Journal of Machine Learning Research, vol. 7, pp. 2699–2720, 2006. [3] B. Biggio, G. Fumera, I. Pillai, and F. Roli, “Image spam filtering using visual information,” in Proceeding of the 14th International Conference on Image Analysis and Processing (ICIAP 2007), 2007, pp. 105–110. [4] Z. Wang, W. Josephson, Q. Lv, M. Charikar, and K. Li, “Filtering image spam with near-duplicate detection,” in Proceedings of the Fourth Conference on Email and AntiSpam, CEAS’2007, Mountain View, California USA, August 2007. [5] M. Dredze, R. Gevaryahu, and A. Elias-Bachrach, “Learning fast classifiers for image spam,” in Proceedings of the Fourth Conference on Email and AntiSpam, CEAS’2007, Mountain View, California USA, August 2007. [6] S. Krasser, Y. Tang, J. Gould, D. Alperovitch, and P. Judge, “Identifying image spam based on header and file properties using c4.5 decision trees and support vector machine learning,” in Information Assurance and Security Workshop,IAW’07.IEEE SMC, 2007, pp. 255–261. [7] N. Nhung and T. Phuong, “An effective method for filtering image-based spam e-mail,” in IEEE International Conference on Research, Innovation and Vision for the Future (RIVF 07), March 2007, pp. 96–102. [8] C. W. Hsu, C. C. Chang, and C. J. Lin, “A practical guide to support vector classification (2009),” http://www.csie.ntu.edu.tw/ cjlin/papers/guide/guide.pdf, 2009. [9] W3C, “Cover sheet for the gif89a specification,” http://www.w3.org/Graphics/GIF/spec-gif89a.txt, March 2009. [10] P. He, X. Wen, and W. Zheng, “A simple method for filtering image spam,” in Eigth IEEE/ACIS International Conference on Computer and Information Science (ICIS 2009), 2009, pp. 910–913. [11] P. F. Brown, V. J. D. Pietra, P. V. deSouza, J. C. Lai, and R. L. Mercer, “Class-based n-gram models of natural language,” Computational Linguistics, vol. 18, no. 4, pp. 467–479, 1992. [12] J. Fürnkranz, “A study using n-gram features for text categorization,” Austrian Research Institute for Artificial Intelligence, Tech. Rep. OEFAI-TR-98-30, 1998. [13] F. Gargiulo and C. Sansone, “Combining visual and textual features for filtering spam emails,” in 19th International Conference on Pattern Recognition, 2008. ICPR 2008, December 2008, pp. 1–4. [14] C. Gopalan and D. Manjula, “Statistical modeling for the detection, localization and extraction of text from heterogeneous textual images using combined feature scheme,” Signal, Image and Video Processing, January 2010. [15] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “Liblinear: a library for large linear classification,” Journal of Machine Learning Research 9(2008), pp. 1871–1874, 2008.
© Copyright 2026 Paperzz