Spam and Malware - CIS KDDM

Research of Alan Sprague:
Using Data Mining
to Combat Spam, Phishing, and Malware
Department of Computer and
Information Sciences
University of Alabama at Birmingham
June 2013
Univ. of Alabama @ Birmingham
1
Computer Forensics at UAB
 We offer BS and MS degrees with an emphasis on
forensics; the Criminal Justice Department participates in
these programs.
 Research center: CIA/JFR: http://thecenter.uab.edu
 Gary Warner
 Blog “Cyber Crime and Doing Time”

http://garwarner.blogspot.com
 My research



June 2013
Spam
Phishing
Malware
Univ. of Alabama @ Birmingham
2
Outline
 This presentation will describe my research interests
in spam and malware.
 The next 9 slides: spam.
 Subsequent slides: malware.
June 2013
Univ. of Alabama @ Birmingham
3
Spam and the criminal web
70-80% of all email in the world is spam.
Spam enables various classes of antisocial activity:
Spam advertises opportunities to buy counterfeit goods,
for example, pills (possibly adulterated pills)
Spam delivers phish, which commonly are intended to
steal credentials to banks and other financial institutions.
Spam delivers malware.
June 2013
Univ. of Alabama @ Birmingham
4
Spam: Clustering, not Classification
 People commonly expect our research to be classification
of emails as ham or spam: desired or undesired. They
then expect us to help filter email, so that spam will not
be delivered.
 That is not our research. Instead, we start with a data
file that we expect is entirely spam, and our goal is to
cluster it into spam campaigns.
 This is an important goal, because after we understand
the various spam campaigns, we know which are the
largest, and we know what type of criminal activity each
campaign enables. This enabled law enforcement to
focus attention on the most harmful campaigns.
June 2013
Univ. of Alabama @ Birmingham
5
Background on Data Mining
 Data Mining studies the challenges and opportunities
offered by huge data files.
 Three methods are central to Data Mining.
 Clustering: group together records in the data file if
they resemble each other (without knowing the
“meaning” of any resulting group, called a cluster).
 Classification: assign each record to one of several
“classes”, each of which corresponds to a known type
of data.
 Frequent sets and association rules
June 2013
Univ. of Alabama @ Birmingham
6
Our spam data
 Each day: 1 million spam messages
 Stored into UAB Spam Data Mine
June 2013
Univ. of Alabama @ Birmingham
7
Preprocessing of spam data
 Parsing





June 2013
Subject
Sender IP
Sendername
If body contains a URL:
 Its domain, and IP
Word count of body
Univ. of Alabama @ Birmingham
8
Some spams, parsed
Subject
Sender
Name
Sender
Username
Order HCG online y5fh6
EfrenGriffith artq.com
Order HCG online vfe3ih Victor
musicradio.com
Pfizer Inc Discount 43681 lefley
uab.edu
Buy Cialis Online
Tam Smith adeptis.com
Your LinkedIn blocked
John Fial
irs.gov
June 2013
Univ. of Alabama @ Birmingham
9
Goal, for the Spam Data Mine
 Cluster each day’s emails, to find largest spam
campaigns, and then to find clues: where are they
coming from?
 Relate each day’s clusters to the previous day’s
clusters. Any new types of spam are considered
“emerging threats”.
June 2013
Univ. of Alabama @ Birmingham
10
Largest Cluster on a particular day
Email screenshot
Domain names
IP addresses
Subgroup 2
sincejust.com
110.52.8.253
124.42.91.162
agethough.com
60.191.239.150
88.80.16.161
Subgroup 1
numbertook.com
rolloccur.com
xtpnttm.cn
vlxejzg.cn
91.213.33.10
203.93.208.86
218.75.144.6
220.196.59.35
Ihusepod.cn
tyinoriv.cn
Subgroup 3
aoibejp.cn
159.226.7.162
curbdta.cn
June 2013
Univ. of Alabama @ Birmingham
11
Why Is This Work Useful?
 A large number of domains used by leading spammers to
counter domain blacklisting
 Shutdown of those domains and their hosting servers can
greatly cripple spammers’ ability to conduct spam-related
cyber crimes.
 Further investigation of domains and IP addresses may
lead to the identities of spammers.
June 2013
Univ. of Alabama @ Birmingham
12
Transition
 Spam clustering is an ongoing project. A different
thrust is the study of malware. I describe two
methods of static analysis of malware: using blocks
and jumps (slide 16), and using strings (slides 17-23).
June 2013
Univ. of Alabama @ Birmingham
13
Malware
 What is malware?
A program that performs actions that the user
does not want
 Executable file, i.e., machine code
 Each day, we add 5000 new malwares to our database
 Two types of analysis:
 Static analysis
 Dynamic analysis

June 2013
Univ. of Alabama @ Birmingham
14
Goals
 Malwares belong to families, such as Zeus, Reveton,
Perfect keylogger
 Eventual goal: Put each malware into its family.
 Current goal: Cluster malwares, based on their
strings.
June 2013
Univ. of Alabama @ Birmingham
15
Static Analysis, using Blocks and Jumps
 Method to encode malwares:
Jumps (e.g. subroutines, and subroutine calls)
 Disassemble each malware, split it into “blocks”,
compute a hash value for each block. Also find each
jump, and write which block it is from and which it is
to.
 Result: each malware is a directed graph.
 When malwares are encoded this way, malwares will be
clustered together if their graphs are similar.

July 2013
Univ. of Alabama @ Birmingham
16
Static Analysis, using strings of printable
characters
 at least 4 characters long, ending with \0
cxczxczxczxcc
Enter
%d-%02d-%02d_%02d-%02d-%02d-%d
JPEG Image saved successfully!^
Screenshot saving cancelled because of logging disabled.^
COXJPEGFile::fill_input_buffer : Catching CFileException^
%d-%d-%d_%d-%d-%d
_controlfp
1.12782
@.rsrc
Password:
June 2013
Univ. of Alabama @ Birmingham
17
Data File for 1 Day
Each row is the list of strings in one malware.
A sample file of 5000 malwares looks like:
m1: cxczxczxczxcc, Enter, _controlfp, ….
m2: …………….
m3: …………….
m4: …………….
.
.
.
m5000: ………….
June 2013
Univ. of Alabama @ Birmingham
18
Frequent sets
 A typical application is retail data.
 Data File: Purchases at a large store.
 Each record: List of purchases of one customer.
 Question: Which items are often bought together?
 Our application: malware.
 Our data file: Strings in malwares.
 Each record: List of strings of one malware.
 Question: Which strings are often found together?

June 2013
Dual Question: which malwares have many common strings?
Univ. of Alabama @ Birmingham
19
Frequent sets: Tiny example
 6 malwares (so 6
records), 4 strings.
 The malwares:






July 2013
a, b, c, d
b, c, d
a, c, d
a, b
c, d
b, d
 Incidence matrix
a b c d
1 1 1 1
0 1 1 1
1 0 1 1
1 1 0 0
0 0 1 1
0 1 0 1
Univ. of Alabama @ Birmingham
20
Frequent sets: Tiny example
 Strings a,c are a frequent
set (records r1 and r3
contain both)
 But a,c is not maximal,
because d is in both
records
 Incidence matrix
a b c d
r1 *1 1 *1 *1
r2 0 1 1 1
r3 *1 0 *1 *1
r4 1 1 0 0
r5 0 0 1 1
r6 0 1 0 1
Univ. of Alabama @ Birmingham
21
Closed frequent sets
 A frequent set is closed if it
 Ex: Incidence matrix
equals the intersection of
the records containing it.
 Alternate definition: a
closed set is a maximal
all-ones submatrix.
 Since rows and columns
play the same role in this,
one can let malwares and
strings exchange roles.
July 2013
Univ. of Alabama @ Birmingham
r1
r2
r3
r4
r5
r6
a
*1
0
*1
1
0
0
b
1
1
0
1
0
1
c
*1
1
*1
0
1
0
d
*1
1
*1
0
1
1
22
Closed Frequent Sets for Malware
Analysis
 Wanted closed frequent sets, with threshold 30.
 The lowest the state-of-the-art algorithm could
do was 1000.
 By being willing to discard strings that appear
more than 10 times, we recently managed
threshold 20.
 Ongoing
June 2013
Univ. of Alabama @ Birmingham
23
The end
.
July 2008
Univ. of Alabama @ Birmingham
24