Following the Trail of Image Spam

Following the Trail of Image Spam
Shruti Wakade, Robert Bruen, Kathy J. Liszka, and Chien-Chung Chan
Department of Computer Science, University of Akron,
Akron, Oh 44325-4003, USA
{liszka,chan}@uakron.edu
Abstract- Image spam has evolved from simple images
containing spam text to high quality photographic images. This
paper explores the current trends in image spam by analyzing
the contents of a corpus built over the past three years, courtesy
of KnujOn. Statistics over a three year period show that
spammers follow different patterns in sending spam which are
based on various factors such as time of the year, holidays and
politics. Other subjects appear to remain constant. Hate images
have surfaced in the past year as well as malware-embedded
images. Finally, we offer this corpus to others interested in this
research area.
followed to advertise inferior, fake, or non-existent
products. Pharmaceutical email accounts for more than
70% of all spam [2]. Filled orders are often not the product
advertised, if delivered at all. In any given spam message, a
click on the embedded link will likely result in a drive-by
download of malware, or result in a phishing attempt [3].
Keywords- spam, image spam, malware-embedded images,
image scraping
1
INTRODUCTION
Each morning, a daemon running on a server farm in
Vermont, activates, zips up a folder of images from the
prior day’s haul on spam, and forwards it to a server in
Akron, Ohio. These gems of information are stripped from
emails collected by KnujOn (“no junk” spelled backwards)
an anti-spam company [1]. Their mission is to fight against
Internet threats, and specifically those delivered by email.
They work with Internet governance bodies to help
investigate abusive registrars and track cyber criminals.
Part of the company’s business is to allow users to upload
their spam email and then process it, extracting hyperlinks
and other information to help track the source of the
message. In our case, images are stripped out and
forwarded to us in an effort to build a sizeable corpus for
the purpose of image spam research.
Spam is certainly not a new phenomenon. Filters work
diligently to protect us, but it is a never ending battle. Once
a mere annoyance, the motivation of spammers has become
increasingly malicious. Originally these emails were
primarily digital forms of paper junk mail. Scams quickly
Figure 1. Image spam.
No form of digital communication is immune to these
attacks. Social networks are becoming hotbeds of this
malignant activity. Anecdotally, in February 2011, a search
for “Viagra” on Twitter [4] delivered approximately 1 new
posted tweet every 30 seconds, over a three hour period.
That was just an observation of one spam keyword!
Email spam can be divided into two basic categories:
1) simple text and 2) an image embedded into the body of a
message or sent as an attachment. Image spam, as a filter
avoiding tactic, first appeared around 2000, but was
noticed in 2005 when it comprised a mere 1% of spam
emails. Within 18 months, images such as that in Figure 1,
populated over 21% of all email messages [5]. It’s a
relatively effective mode of content delivery because
morphing and other digital manipulations make it difficult
for OCR readers to catch, and with minor changes,
fingerprinting (ex., via MD5 hashes) is virtually
impossible. The Spammer’s Compendium [6] is an
excellent resource of information on different techniques
spammers use to avoid detection. Use of animated gifs,
picture tilting, picture waving, and image slicing are
techniques described and demonstrated with real examples
collected by volunteers.
In this paper, we focus on trends in image spam
collected by KnujOn over the past three years. We have
examined content of the images on an almost daily basis
from April 2008 through January 2011. Some basic
statistics are provided along with several observations.
Finally, we delve into new techniques and trends in these
types of images, namely scraped images and malware
embedding.
2
Figure 2. Image spam potentially detectable by an OCR filter.
TRENDS
Most e-mail filters check for text based spam but not
image spam. Spam filters look for phrases or words related
to spam, for example, Viagra, free, money, cash and so
forth. When the message is included in an image an OCR
needs to read the content to detect these keywords. This is
time consuming for anti-spam software, yet easy for
spammers who easily find new ways to defeat OCR filters
by adding random noise to images, rotating contents, using
multipart gif image formats, blurring the text, and adding
colorful backgrounds [7]. One exception may be what one
may consider more mainstream advertising, such as
landscape lighting from Home Depot or Cisco routers.
Spammers have banked on image spam for over a
decade. Today, most spam is generated automatically and
dispersed with bots, thus the number of images generated is
only limited by the computational competency of the botinfected computers. Figure 2 shows a high quality image
spam example where words can easily be picked off by an
OCR filter. Figure 3 shows an example that could pass
through these filters as legitimate email. Other image spam
is simply photographs. Clearly pornography spammers
have an agenda, but other photographs seem elusive in
their intent, with pictures of sailboats in a harbor or a quiet
city street.
Figure 3. Image spam undetectable by an OCR filter.
Image spam is usually comprised of short text images,
URLs and hyperlinks. The content can be broadly
classified into following categories:






Advertisement/Marketing (Rolex watches, outdoor
lighting, office furniture)
Pharmaceuticals (prescription as well as nonprescription)
Pornography
Financial
Freebies, coupons, software
Politically motivated
According to Computer World [8], image spam hit a
peak in 2006-2007 with a dramatic decline at the end of
2008. However, with the shutdown of McColo, all spam
declined for a period of weeks until the spammers reestablished themselves with another host service. We’ve
certainly had no lack of data for research in image spam
identification since the alliance with KnujOn. Table 1
shows statistics for the number of unique images collected
per month from August 2008-2011. We used an MD5
checksum script to eliminate duplicate images. What we
found is that the computer generated images were highly
unique while many of the photo-quality images (mainly
adult content) were eliminated because they were actually
identical. These statistics are only with respect to the
images that we have downloaded. We note that some days
of the month, servers were down either in Vermont or
Ohio due to maintenance or weather. Since this started out
purely for the purpose of collecting enough images for
testing an artificial neural network [9], we were not
concerned with those missing days. We do feel, however,
that the analysis can be extrapolated to the overall picture.
Month
Jan
Feb
Mar
Apr
May
June
July
Aug
Sep
Oct
Nov
Dec
2008
2009
3660
5527
8021
3525
601
764
2268
1008
1277
10863
7840
12883
13329
8040
5883
4119
2010
3171
3451
16403
18462
7337
18141
6725
36003
9105
2233
2601
943
2011
717
781
Table1. Unique image spam collected in 2008-2011
We manually checked images from January 2010
through February 2011 to see what trends are prevalent
during the year. In order to get the trend we noted the most
common type of spam in a month and then checked which
of these appear across most of the months of a year. Figure
4 shows a graph describing the frequency of occurrence of
47 categorical trends in a year. The category for photo
spam covers those photos that were pictures not fitting any
category (the sailboat for example). We put pictures of
seductive women in this category, when they did not fit
clearly in the adult content category. As expected, the most
common type of images are pharmaceutical, photo spam,
and adult content (pornographic).
Figure 4. Frequency of trends
We observed that on special days of the year such as
New Year’s Day, Christmas or Valentine’s Day, spam
related to only this specific event was noticeably
predominant. On 14-Feb-2011 we saw a large number of
spam related to Valentine’s Day, such as that seen in
Figure 5.
Figure 5. Spam on Valentine’s Day 14 Feb 2011.
Similarly the months of December and January contain
images of chocolates, inexpensive gifts, coupons, clothes
and such, all related to the holiday season of Christmas and
New Years.
In early 2010, we saw our first animal cruelty pictures,
and also our first “hate” images. These were particularly
disturbing to look at. The politically motivated hate images
first appeared in October 2010 and have continued through
January 2011. To date, only a small number have appeared
in the corpus, but the timing may not be coincidental to the
unrest, protests, and ensuing violence that marked world
events in January and February 2011.
3
SCRAPING IMAGES
Manual inspection of images shows that spammers
have devised a new way to prevent inspection of image
content by scraping the header part of the image. This
renders the image unreadable by a file reader although it
opens using a picture editor. The technique makes it
possible to successfully convey the intended message to the
user but prevents processing of images. Another technique
to tamper with the format of the image is to include
improper header information and/or incorrect color maps.
File readers expect these be formatted properly and so, fail
to read these spam images. Figure 6 is an example of such
an image found in our corpus.
4
MALWARE EMBEDDING
Embedding malware in files is not a new concept. It is
used with MP3 files, video files, text documents and
others. We found images during three months (May,
September, and November 2010) with malware embedded
in them. These were the first occurrences since we began
downloading our corpus in April 2008. As yet, there is
little literature available on how these images are actually
used for malware embedding and how they attack their
victims.
In general, when a non-executable file such as a jpeg
containing an executable is double clicked, the nonexecutable file is opened by its associated application. You
view the jpeg as a picture and nothing happens with the
embedded malware. Another component must be present,
such as a loader. This presumes that the host machine is
already infected. The second step is for the loader to
extract the code from the jpeg (or other image) and run it.
Typically, this works because the component on the
infected machine downloads the image from the web. In
our case, these images came wrapped in spam, so we are
certain that the complimentary portion of the malware
infection was not present. We did perform some basic
reverse engineering, and note that the loader was not
present, thus not infecting our machines.
Recently Microsoft's Malware Protection Center
discovered a variation of a malicious image which looks
like a simple png file [10, 11]. Amazingly, the image
displays instructions for the user to open it in MS Paint and
then resave it as an hta file, which is an HTML application.
Part of the image resembles random noise, but when the
file is resaved according to directions, it decompresses into
JavaScript. Now when this file is opened, the presumably
malicious payload is executed. Without the user’s willing
participation, this is a lame attempt at spreading malware.
The curious user, however, will most likely regret it. Figure
7 shows an example of one of these images logged by
Microsoft [11]. Figure 8 shows the binary data of the
image before and subsequent hta file [11].
Figure 6. Scraped image spam.
Figure 7. Image in .png form.
messages and by the occurrence of image spam as malware
carriers. Understanding trends and types of images can be
useful in applying data mining techniques to develop
classification methods which can 2integrate this
knowledge.
Researchers interested in accessing this corpus of over
215,000 spam images (and growing), may contact the
authors. Because of the adult content, we do not make this
available over the Internet, but are happy to provide the
corpus on a DVD.
6
REFERENCES
(a) Original image data.
(b) After resaving with an .hta extension.
Figure 8. Binary form of the image data.
This is a complicated technique and it requires the user to
actively participate in order to succeed. It is unlikely to
become widespread although it presents the possibilities
for hiding malware and data in image formats, then
dispersing them out into the wild via email spam. One
could conjecture that hackers would use this technique
simply to share malware rather than execute direct attacks.
5
CONCLUSIONS
The study on image spam has provided us with
insightful observations. Firstly, pharmaceutical and
pornography is the main staple of unsolicited email. The
demand clearly must be there for these types of spam to
persist. However, spammers must feel that people can be
socially engineered to click on holiday images. There is
also a small shift in content as witnessed by the recent hate
1. KnujOn, http://www.knujon.com/, last accessed
March 2011.
2. M86 Security Lab, Spam Statistics for the week
ending March 13, 2011,
http://www.m86security.com/labs/spam_statistics.
asp, last accessed March 13, 2011.
3. D. Berlind, “Phishing: Spam that can’t be
ignored,” ZDNet, January 7, 2004,
http://www.zdnet.com/news/phishing-spam-thatcant-be-ignored/299295, last accessed March 2011.
4. Twitter, http://www.twitter.com/, last accessed
March 2011.
5. J. Swartz, “Picture this: A sneakier kind of spam,”
USA Today, Jul. 23, 2006.
6. Spammer’s Compendium,
http://www.virusbtn.com/resources/spammerscom
pendium/index, last accessed March 2011.
7. M. Kasser, “The Unwelcome Return of Image
Spam,”
Tech
Republic,
March
2010,
http://www.techrepublic.com/blog/security/theunwelcome-return-of-image-spam/3248,
last
accessed February 2011.
8. J. Cheung, “Researchers: image spam making
unexpected return,” Computer World, May 11,
2009.
9. P. Hope, J. R. Bowling, and K. J. Liszka,
“Artificial Neural Networks as a Tool for
Identifying Image Spam,” The 2009 International
Conference on Security and Management
(SAM'09), July 2009, pp. 447-451.
10. S. Beskerming, “Malware in Images, a Social
Engineering Example,” August 12, 2010,
http://www.beskerming.com/commentary/2010/08/
12/527/Malware_in_Images,_a_Social_Engineerin
g_Example, last accessed February 2011.
11. Painting by Numbers, Microsoft Malware
Protection Center, Threat Response and research
blog, August 9, 2010,
http://blogs.technet.com/b/mmpc/archive/2010/08/
09/painting-by-numbers.aspx,
last
accessed
February 2011.