Chi-Squared Distance and Metamorphic Virus Detection

Chi-Squared Distance and Metamorphic Virus
Detection
Annie H. Toderici∗ and Mark Stamp†
Abstract
Metamorphic malware changes its internal structure with each generation, while
maintaining its original behavior. Current commercial antivirus software generally
scan for known malware signatures; therefore, they are not able to detect metamorphic malware that sufficiently morphs its internal structure.
Machine learning methods such as hidden Markov models (HMM) have shown
promise for detecting hacker-produced metamorphic malware. However, previous
research has shown that it is possible to evade HMM-based detection by carefully
morphing with content from benign files. In this paper, we combine HMM detection
with a statistical technique based on the chi-squared test to build an improved detection method. We discuss our technique in detail and provide experimental evidence
to support our claim of improved detection.
Keywords: metamorphic malware, chi-squared statistics, hidden Markov models,
malware detection
1
Introduction
Malicious software attacks can cause extensive financial damage. For example, MyDoom, a spam-mailing malware, caused an estimated $38 billion in damage while
the damage due to Conficker, a password-stealing botnet, has been estimated at $9.1
billion [7].
Antivirus software aim to detect malware. Signature detection is the most commonly used approach used by commercial antivirus [4]. This technique relies on a
database of known signatures for viruses and other malware, and attempts to match
these signatures against files on a user’s computer. If a match is found, the file is
likely infected by the corresponding malware. Traditionally, this method has been
effective for detecting most malware. The major weakness of signature detection is
that it cannot detect previously unknown viruses.
∗
†
Department of Computer Science, San Jose State University
Department of Computer Science, San Jose State University: [email protected]
1
As a way to evade signature detection, malware writers employ code obfuscation
methods [23]. Metamorphic viruses apply code obfuscation techniques at each generation and, consequently, a well designed metamorphic virus cannot be detected using
standard signatures [6].
Previous research has shown that machine learning methods such as hidden Markov
models (HMMs) are effective at detecting hacker-produced metamorphic viruses [29].
However, such a detection strategy can be defeated by inserting a sufficient amount
of code from benign files into each virus—at some point, the HMM classifier cannot
reliably distinguish such a virus from a benign file. An experimental metamorphic
virus generator was previously developed to exploit this strategy [16]. It was found
that the HMM-based approach is very robust when random changes are made to the
viral code, or when small blocks of code are copied from benign files into the viruses.
However, the HMM technique is relatively fragile when code is copied from benign
files in the form of contiguous blocks (e.g., entire subroutines). The motivation for
the research presented here is to improve on this weakness in the HMM detector.
In this paper, we analyze the utility of a statistical chi-squared test (or χ2 test)
for malware detection, as suggested by the theoretical framework in [11]. Our results
show that the chi-squared statistic, computed on instruction opcode frequencies, can
significantly improve virus detection—whether code is copied from benign files in the
form of many small segments or a contiguous block has little effect on the chi-squared
statistic. We show that by combining a chi-squared test with an HMM detector, we
can improve on the results obtained by either when used individually.
This paper is organized as follows. Section 2 gives relevant background information on malware, malware detection, code obfuscation, hidden Markov models,
and the chi-squared statistical test. In Section 3, we discuss our proposed detection
techniques, while Section 4 covers the performance criteria we use to quantify our
results. Then, in Section 5 we present our experimental results. Finally, conclusions
and suggestions for future work are given in Section 6.
2
Background
In this section, we briefly discuss background material that is relevant to the research
in this paper. First, we cover malware, with the focus on metamorphic malware and
code obfuscation techniques. Then we consider malware detection, with the emphasis
on a machine learning technique based on hidden Markov models. Finally, we discuss
a chi-squared statistical test and its potential application to malware detection.
2.1
Malware
As the name suggests, malware is software designed specifically for malicious purposes. Malware is often classified into various categories, including virus, worm,
trojan horse, spyware, adware, and botnet.
2
A virus is often defined as malware that relies on passive propagation, whereas a
worm uses active means [21]. That is, a worm actively propagates itself (typically,
via a network), while a virus requires outside assistance (e.g., an infected USB key
inserted into a computer). However, others define a virus as parasitic malware, in
contrast to a worm that is stand-alone code [4]. Here, we use the term “virus”
generically to refer to malware.
Next, we consider encrypted, polymorphic, and metamorphic viruses. These categories can be viewed as a hierarchy, employing increasingly sophisticated strategies
designed to evade signature detection.
2.2
Encrypted Viruses
Encrypted viruses encrypt their body using a different key at every infection. While
encryption is an effective means of evading signature detection, an encrypted virus
must include a plaintext decryptor routine. Therefore, it is possible to detect this
class of viruses by analyzing the decryptor to obtain a signature.
2.3
Polymorphic Viruses
Polymorphic viruses are encrypted viruses that obfuscate their decryption code by
mutating it. In practice, there are usually a relatively small number of decryptors,
making signature detection a viable option. In addition, if a part of the code looks
“suspicious,” we can execute it in a virtual machine. If the suspicious code is a
polymorphic virus, it will decrypt itself, at which point standard signature detection
can succeed.
2.4
Metamorphic Viruses
Metamorphic viruses change their internal structure at each generation. Unlike polymorphic viruses, the entire virus is morphed while still maintaining its original behavior. If the morphing is sufficient, no common signature is available and hence
signature-based antivirus software cannot reliably detect well-designed metamorphic
viruses. Note that no encryption is necessary for a metamorphic virus.
Although it is difficult to detect metamorphic viruses, fortunately, it has proven
equally difficult for malware writers to implement. One difficulty is that the malware
must mutate sufficiently, but yet the size of the code must not increase uncontrollably. Another concern is that the viruses must be sufficiently similar to benign code
to avoid detection by similarity-based or heuristic methods. Malware writers have
not yet successfully overcome these obstacles [29, 30]. In fact, most metamorphic generators introduce very little metamorphism, and those that do, produce variants that
are easily distinguished from benign code since. Nevertheless, it is possible to produce relatively strong, practical metamorphic generators [16, 19, 20], so metamorphic
detection is a worthy research problem.
3
2.4.1
Virus Obfuscation Techniques
Code obfuscation techniques are applied when creating metamorphic viruses. These
techniques can be used to create a vast number of distinct copies that have the same
behavior but different internal structure. In this section, we briefly discuss some
elementary code morphing techniques.
Register Renaming: Register renaming is one of the oldest and simplest techniques
used in metamorphic generators. For example, Figure 1 provides a code snippet where
the following substitutions have been made:
eax
ebx
ecx
−→
−→
−→
ebx
ecx
eax
In spite of its simplicity, this technique does change the binary pattern in morphed
executable files. However, register renaming does not effect the opcode sequence.
Furthermore, it is relatively easy to detect register renaming by using signatures
with wildcard strings [24].
MOV eax, 4
MOV ebx, 4
ADD ebx, eax −→ ADD ecx, ebx
SUB ecx, 1
SUB eax, 1
Figure 1: Register renaming example.
Equivalent Instruction Replacement: The instruction set for modern processors
have numerous equivalent instructions (or groups of instructions). For example, MOV
eax, 0, is equivalent to SUB eax, eax, and XOR eax, eax. Figure 2 illustrates an
example where a single instruction is equivalent to a sequence of instructions.
MOV eax, 1 −→
SUB eax, eax
ADD eax, 1
Figure 2: Equivalent instructions substitution.
Instruction Reordering: Instruction reordering consists of transposing instructions
that do not depend on the output of previous instructions. When instructions are
reordered, signatures involving the instructions can be broken, but code execution is
unaffected. Figure 3 shows an example of instruction reordering.
4
MOV eax, 4
ADD ecx, 1
ADD ecx, 1 −→ MOV ebx, 0
MOV ebx, 0
MOV eax, 4
Figure 3: Instruction reordering.
Junk Code: Junk code is any code that has no effect on program execution. Junk
code might be executed, with no effect on the program, or it might be “dead code”
that is never executed. Junk code is often inserted randomly throughout the body of
a metamorphic virus during the morphing process. The intention is that such junk
code will break up signatures.
A trivial example of junk code is the NOP instruction, which does nothing to affect
the CPU state. Other examples of junk code include MOV eax, eax and ADD eax,
0 and SUB eax, 0.
2.5
Malware Detection
In this section, we briefly discuss signature detection and heuristic detection. Other
detection strategies are used, but these are the most common today.
2.5.1
Signature Detection
Commercial antivirus software typically uses signature detection to identify malicious
files. A signature is created by analyzing the binary code of a virus, and selecting
a sequence of bits that is, ideally, unique to that virus [21]. The signature must be
long enough so that it is unlikely to appear in uninfected files. For example, the
Chernobyl/CIH virus can be detected using the signature [28]
E800 0000 005B 8D4B 4251 5050 0F01 4C24 FE5B 83C3 1CFA 8B2B .
String matching algorithms are applied when scanning for virus signatures. Examples of such algorithms include Aho-Corasick, Veldman, and Wu-Manber [4]. The
Aho-Corasick algorithm scans for exact matches, so a slight variation will escape detection [1]. On the other hand, Veldman and Wu-Manber allow for the use of wildcard
search strings [23].
Signature scanning is relatively easy and efficient, but the virus database needs
to be kept up to date. The crucial weakness of signature detection is that signatures
are unlikely to detect previously unseen malware, including metamorphic variants.
2.5.2
Heuristic Detection
Heuristic analysis can be used to detect unknown viruses and variants of known
viruses. Heuristic analysis can be based on a static or dynamic approach, or a combi-
5
nation of the two. An example of static heuristic analysis would be to look for opcode
sequences that match a general pattern found in viral code. For dynamic analysis,
the code might be executed in a virtual machine to watch for suspicious behavior.
An example of such suspicious behavior is opening an executable file for writing [4].
2.6
Hidden Markov Models
Machine learning can be defined as computer algorithms that improve through experiments [17]. Examples of such techniques include Naı̈ve Bayes [18], decision trees [15],
hidden Markov models [22], and many other statistical learning methods [11].
A hidden Markov model (HMM) is a statistical modeling method that has been
used in speech recognition, bioinformatics, mouse gesture recognition, credit card
fraud detection, and computer virus detection research. It is widely used because it
is simple and computationally efficient [22].
The popularity of HMMs for virus detection stems from the fact that a program
can be represented as a sequences of instructions. The CPU executes the instructions
one at a time, which implies that programs can be treated as time series, which is an
ideal situation for an HMM.
For a Markov model of order one, we assume that the sequential data can be
modeled based solely on the current state, with no memory—what happens next in
the sequence depends only on the current state. In a typical Markov chain, the states
are fully observed. In the case of hidden Markov model, as its name implies, the
states are not directly observed, as they are “hidden.” We can only estimate these
states while observing sequences of data [22].
2.6.1
HMM Notation
To describe an HMM, we will use the notation summarized in Table 1. The main
components of the HMM are the state transition probability matrix A, the observation
probability matrix B (which gives the likelihood of an observation given the state),
the initial state distribution π, the observation sequence O, and the hidden states X.
We denote the HMM model as λ = (A, B, π). The matrices A, B, and π are row
stochastic, that is, each row is a probability distribution. The probability πxi is
the initial probability of starting in state Xi , while axi ,xi+1 is the probability of
transitioning from state Xi to state Xi+1 . Finally, bxi (Oi ) is the probability of
observing Oi when the underlying Markov process is in state Xi .
Figure 4 illustrates a generic HMM. The state transitions in Figure 4 are placed
above the dashed line to indicate that they are “hidden.” Some information about
the transitions can be deduced by analyzing the observations, since the observation
sequence is related to the states of the Markov process by the matrix B.
The utility of HMMs derives largely from the fact that there are efficient algorithms to solve the following three problems.
6
Table 1: HMM symbols and their meanings.
Symbol
Description
T
N
M
O
X
A
B
π
λ
The
The
The
The
The
The
The
The
The
length of the observation sequence
number of states in the model
number of distinct observation symbols
observation sequence, O = {O0 , O1 , . . . , OT −1 }
hidden states, X = {X0 , X1 , . . . , XT −1 }
state transition probability matrix
observation probability matrix
initial state distribution
hidden Markov model: λ = (A, B, π)
Figure 4: Hidden Markov Model [22].
1. Given a model λ and an observed sequence O determine P (O|λ). That is, we
can score a sequence of observations against a given model.
2. Given a model λ and an observed sequence O, we can find the most likely hidden
state sequence {X0 , X1 , . . . , XT −1 }.
3. Given an observation sequence O, we can train a model λ to maximizes the
probability of O. That is, we can train a model to fit the data.
In this research, we apply the solution to the third problem to train an HMM to
fit opcode sequences extracted from a metamorphic family. Then we use the solution
to problem one to score files (based on extracted opcode sequences) and classify each
as either belonging to the metamorphic family or not.
For more information on HMMs in general, including detailed examples and
pseudo-code, see [22]. For additional information on the application of HMMs to
the malware detection problem, see, for example [16, 30]
7
2.7
Chi-Squared Distance
In this section, we first discuss chi-squared tests in general terms. Then we consider
the use of such a test for malware detection.
2.7.1
CSD Notation
Suppose that Y is a statistical variable from a distribution under observation. Our
goal is to estimate the main characteristics of the probability distribution P of Y . Let
Y1 , Y2 , . . . , Yn be a random sample of elements from this distribution. These samples
reveal some information about the unknown parameter, say, θ of the probability
distribution P .
Let f (Y1 , Y2 , . . . , Yn ) denote a function that is used to estimate θ; we refer to f as
an estimator function. An estimator function can be used to compute the probability
distribution of a given sample.
The parameter space from which θ is drawn can be arbitrary, but will typically
depend on the problem and the choice of f . For example, we might have θ ∈ N,
where N is the set of natural numbers, or θ ∈ Rk , for some k, where R is the set of real
numbers. For example, if θ represents the parameters of a normal distribution, then θ
will have two real-valued dimensions, corresponding to the mean µ and variance σ.
For the purpose of malware analysis, we restrict the parameter space to k-dimensional
vectors with elements from the set of natural numbers, that is, θ ∈ Nk . The general
form of P is assumed to be known.
Statistical testing is used to decide which hypothesis best fits an observed sequence of samples (Y1 , Y2 , . . . , Yn ). In statistical testing, initial hypotheses are proposed which are either accepted or rejected after measuring the likelihood of these
hypotheses with respect to the probabilistic law P of Y .
The initial hypothesis is denoted H0 and is referred to as the null hypothesis. The
alternative hypothesis is denoted as H1 . The tests that we use will either accept or
reject the null hypothesis. To construct such a test, we determine an estimator that
separates possible test values into disjoint sets consisting of an acceptance region and
rejection region. Then given a sample, its estimator value is computed and compared
with the threshold to determine whether we accept or reject the null hypothesis.
There are two types of errors associated with this detection problem.
• A type I error occurs when we reject the null hypothesis, although the null
hypothesis is correct. The probability of such an error is denoted as α and it
corresponds to the false positive rate.
• A type II error occurs when we accept the null hypothesis, although the null
hypothesis is false. This probability corresponds to the false negative rate.
8
2.7.2
Statistical Test for Malware Detection
To formalize the virus detection framework, we adopt notation from Chess and
White [5]. A given antivirus system uses a detection algorithm D, which it applies to a program p. The goal of the algorithm is to determine whether p is infected
by a particular virus V . Then D(p) should return true if and only if the program p
is infected by the virus V . Filiol and Josse [11] further expand on this approach by
providing a statistical framework to describe the detection process.
For malware analysis, we consider opcode instruction frequencies. We model
the statistics of viruses from a given metamorphic family. Ideally, the spectrum of
instructions from these family viruses would be obtained by analyzing instruction
frequencies of all possible family viruses. However, for any realistic metamorphic
generator, an exact spectrum is impossible to compute. As an approximation, we
rely on instruction frequencies from a representative set of family viruses.
The related problem of modeling compilers is considered in [3]. Most compilers use
a relatively small subset of all possible instructions, different compilers use different
subsets, and the instructions that are common between the two will appear with
different frequencies. We can make use of such observations to develop an estimator
function which can be used to classify whether a given executable was generated
with a particular compiler. This is analogous to the metamorphic detection problem,
where a metamorphic generator can be considered a type of “compiler.”
The formal definition of the spectrum of a program is
spectrum(C) = (Ii , ni )1≤i≤c
(1)
where, in our case, C represents a particular metamorphic generator, Ii represents
the ith instruction, and ni is the frequency of instruction i. Here, c denotes the
total number of unique instructions that may be output by generator. For the 80x86
architecture, the total number of possible instructions is 501 [14], but, for a given
metamorphic generator, only a relatively small subset of these instructions are likely
to occur.
In our experiments, the spectrum in equation (1) represents the expected value
of opcodes from a typical family virus. Consequently, for any family virus, we expect
to observe instruction i at approximately frequency ni .
Given an executable file that we want to classify, we first compute its spectrum.
That is, we compute the instruction frequencies observed in this file—we denote the
observed frequency for instruction i as n̂i . Then we rely on a statistical test to
determine whether we should accept or reject the null hypothesis.
To make this process concrete, we need to specify the null hypothesis, an estimator
function, and the decision threshold. The null hypothesis and alternative hypothesis
are specified as
H0 : n̂i = ni , 1 ≤ i ≤ c
H1 : n̂i 6= ni , 1 ≤ i ≤ c
9
respectively. Note that the null hypothesis states that the observed frequencies match
the expected frequencies. If these frequencies are the same, then the suspected file
is likely a family virus, and the null hypothesis will be accepted. However, if the
frequencies are significantly different, then the alternative hypothesis will be accepted,
which means that the suspect file is classified as benign.
However, we cannot expect an exact match of any given spectrum to the expected
spectral values. Filiol and Josse [11] suggest using Pearson’s χ2 statistical test. This
test, which we denote as D2 , is given by
D2 =
c
X
(n̂i − ni )2
i=1
ni
.
Pearson’s χ2 statistic is commonly used to determine whether the difference between the expected and observed data is significant. The decision threshold is obtained by comparing the estimator value given by D2 to the χ2 (α, c − 1) distribution,
that is, a chi-squared distribution with c − 1 degrees of freedom and a type I error
rate of α. Typically, the type I error rate is set to 0.05, which means that the test
tolerates no more than 5% of the viruses files being misclassified as benign.
Using Pearson’s χ2 statistic, the null hypothesis and the alternative hypothesis
are
H0 : D2 ≤ χ2 (α, c − 1)
(2)
H1 : D2 > χ2 (α, c − 1)
The bottom line here is that the χ2 statistical test gives us a practical means to
determine whether an observed spectrum matches a given distribution.
To summarize, the following steps need to be performed to implement Pearson χ2
test for malware detection.
1. Specify the null hypothesis H0 and the alternative hypothesis H1 .
2. Choose a significance level (we use α = 0.05).
3. Compute the estimator value D2 given the opcode frequencies in the file under
consideration (that is, n̂i for i = 1, 2, . . . , c) and the spectrum that corresponds
to the null hypothesis (that is, ni for i = 1, 2, . . . , c).
4. Determine whether to accept or reject the null hypothesis H0 based on equation (2).
2.7.3
Example
To illustrate the statistical test discussed above, we consider a simplified problem
involving only three instructions Ii1 , Ii2 , Ii3 . Suppose that for the family viruses under
consideration, instruction Ii1 has frequency ni1 , with ni2 and ni3 . The spectrum for
this example appears in Table 2.
10
Table 2: Example family virus spectrum.
Instruction
i1
i2
i3
Opcode
MOV
PUSH
POP
Frequency ni
7
10
3
Table 3: Example frequencies of suspect program.
Instruction
i1
i3
Opcode
MOV
POP
Frequency ni
6
11
The possible observations for this compiler are MOV, PUSH, and POP, and the corresponding frequencies are 7, 10, and 3. Since we can view the distribution of interest
as a histogram over these three instructions, the parameter space of θ is N3 .
Suppose that we have a suspect file that may or may not be a family virus. Given
the instructions and the frequencies in Table 3, we would like to perform the χ2 test
so as to classify the file as a family virus or benign.
The null hypothesis H0 is that the file is benign, provided that the estimator
function D2 yields a score less than or equal to the χ2 value. That is,
D2 =
c
X
(n̂i − ni )2
i=1
ni
≤ χ2 (α, c − 1) .
When computing the estimator D2 , we use the opcode frequency counts for each
instruction. However, since χ2 is a probability distribution, the frequencies are normalized before performing this test—normalization is done by simply dividing the
count of an instruction by the total number of instructions. The normalized values
for the spectrum in Table 2 are (MOV, 0.35), (PUSH, 0.5), and (POP, 0.15). The normalized values for the file under consideration (see Table 3) are (MOV, 0.353) and (POP,
0.647). Therefore,
D2 =
(0.353 − 0.35)2 (0.0 − 0.5)2 (0.647 − 0.15)2 ∼
+
+
= 2.1467.
0.35
0.5
0.15
We compare D2 = 2.1467 to χ2 (0.05, 2) = 5.991. Since D2 ≤ χ2 (0.05, 2), we accept
the null hypothesis, that is, we classify the file as a family virus.
3
Proposed Virus Detector
We propose a metamorphic detection method that combines an HMM detector (as developed in [30] and further extended and analyzed in [16]) with a chi-squared distance
11
(CSD) estimator, as discussed above in Section 2.7. We refer to this combination of
HMM and CSD as a hybrid method.
Experiments show that the HMM detector performs extremely well in detecting
viruses morphed with short sequences of code, whether the code is randomly selected
or carefully chosen to be statistically similar to benign code [9]. The same high level
of detection is achieved even when the short sequences of code are taken directly
from benign files [16]. However, when long contiguous blocks (e.g., subroutines) from
benign files are inserted into morphed viruses, the HMM detector fails at a relatively
low percentage of such code [16].
Intuitively, the performance of our CSD estimator should not be affected by the
length of the code blocks copied from benign files—only the total amount of such
code should matter. Our goal is to experimentally verify this intuition and develop
a hybrid model that will outperform both the HMM and CSD detectors.
Below, we consider ways to combine the HMM and CSD scores to obtain a score
for the hybrid model. But first, we convert each of these scores into probabilities.
Let PHMM (X) be the probability that corresponds to an HMM score for X and
let Pχ2 (X) be the probability that corresponds to a CSD score for X. We can directly
compute PHMM (X) from the score; see [22] for the details. To compute a probability
for our CSD score, we use the fact that it is related to the χ2 distribution and make
use of the χ2 cumulative distribution function (CDF). We can write
Pχ2 (X) = P (Y < D2 )
= CDFχ2 (D2 )
1
=
γ(k/2, D2 /2)
Γ(k/2)
where Γ is the well-known gamma function1
Z ∞
Γ(z) =
tz−1 · e−t dt
0
and k represents the degrees of freedom, which in our case is one less than the number
of unique instructions encountered in the training phase. The function γ(k, z) is the
lower incomplete gamma function, which is given by
Z x
γ(s, x) =
ts−1 · e−t dt.
0
With these probabilities in hand, the probability for a hybrid model could be
computed as
P (X) = PHMM (X) · Pχ2 (X).
1
The Γ function can be viewed as a generalization of the factorial function; for positive integers n, we
have Γ(n) = (n − 1)!.
12
To allow for different weightings of the component probabilities, we define
w1
(X) · Pχw22 (X)
Phybrid (X) = PHMM
where w1 and w2 are to be determined. Finally, for numerical stability, we transform Phybrid (X) to its corresponding log-likelihood form
log Phybrid (X) = w1 · log PHMM (X) + w2 · log Pχ2 (X).
(3)
The values we used for w1 and w2 in equation (3) were determined by a grid
search over the range 0 to 1. For this search, we conducted 100 experiments with w1
and w2 varying independently, on a logarithmic scale. For each case, we computed
results analogous to those described in Section 4. From this process, we found that
our best results were obtained using the values
w1 = 10−8 and w2 = 10−9
and, consequently, these weights are used for all experiments discussed in the next
section.
4
Validation and Datasets
In this section, we discuss the validation method used and the metrics that we employ
to compare our experimental results. Then we briefly cover the datasets used in our
experiments.
4.1
Cross-Validation
Cross-validation, or rotational estimation is used to enhance the statistical validity
of a limited data set [12]. This approach enables us to have sufficient training data,
while also obtaining meaningful results in the testing phase.
Since testing data must be disjoint from training data, a limited data set implies
that only a relatively small amount of data may be available for testing. Hence, we
may have insufficient data to obtain meaningful test results. With cross-validation,
the training data and the testing data are selected so that they do not intersect, and
the experiment is repeated multiple times on different subsets of data. This results in
many more test cases, which will tend to improve the overall reilability of the results.
The approach that we employ here is five-fold cross-validation, where the data
is divided into five equal subsets. Then four subsets are used for training with the
remain subset reserved for testing. This process is repeated five times, with a different
subset reserved for testing each time—each such selection is referred to as a “fold.”
For each fold, the evaluation performance is recorded and all folds are used to estimate
the performance of a particular experiment.
Next, we discuss the metrics we use to measure the accuracy of our classifiers.
13
4.2
Evaluation Metrics
Here, we briefly discuss false positive and false negative rates, and explain how we
combine these into a measure of accuracy. We also consider ROC curves and their
role in evaluating our metamorphic detection technique.
4.2.1
Accuracy Measure
There are four possible outcomes for detection, namely, true positive (TP), false
positive (FP), true negative (TN), and false negative (FN). A detection is considered
a true positive when a virus is correctly classified as a virus, whereas it is a true
negative when a benign file is correctly marked as benign. Of course, TP and TN are
desirable outcomes. A false positive occurs when a benign file is mistakenly classified
as a virus. Similarly, a false negative occurs when a virus file is not detected as a
virus, but instead is classified as benign. Table 4 shows these four possible outcomes.
Table 4: Possible outcomes for detection.
Actual Class
Virus
Benign
Predicted Class
Virus
Benign
True Positive False Negative
False Positive True Negative
Short of perfect detection, there is an inherent tradeoff between the FP and FN
rates. As an aside, we note that commercial antivirus makers generally try to avoid
false positives, although this leads to significantly higher false negative rates. Users
tend to notice (and be very unhappy) whenever an uninfected file is identified as
malicious, but they tend to be more forgiving of (or may not even notice) an occasional
false negative [4]. In any case, we would like to avoid both false positives and false
negatives.
The overall success rate, or accuracy rate, is the fraction of correct classifications
obtained from the total number of files tested. The accuracy rate is given by
Accuracy Rate =
TP + TN
.
TP + TN + FP + FN
(4)
The error rate is one minus this accuracy rate, that is,
Error Rate = 1 − Accuracy Rate = 1 −
TP + TN
TP + TN + FP + FN
(5)
Since we performed five-fold cross-validation, we have a total of five models for
each experiment. Each of these folds will yield an accuracy rate. Therefore, we use
the mean of the accuracy values obtained from the five folds, which we refer to as the
14
mean maximum accuracy (MMA) rate. This rate is computed as
5
MMA =
1X
Accuracy Ratei
5
(6)
i=1
where Accuracy Ratei denotes the accuracy rate for the ith fold in the five-fold crossvalidation.
4.2.2
Receiver Operating Characteristic
The receiver operating characteristic (ROC) was developed for applications related
to signal detection [10]. But, ROC curves have gained popularity in the machine
learning community as a tool for evaluating the performance of algorithms [2, 25].
ROC curves are typically represented as two-dimensional plots. For the virus
detection problem under consideration here, we let the x-axis represent the false
positive rate and the y-axis represent the true positive rate. Figure 5 illustrates an
example with several ROC curves plotted on the same axis.
Figure 5: Example ROC curves.
An algorithm that performs random classification with 50% accuracy will generate a diagonal line in the ROC space. Results closer to the top left area of the graph
represent improved classification with a higher true positive rate. For example, the
15
Figure 6: Process for obtaining opcode sequences.
Figure 7: Process for generating metamorphic viruses.
blue diamond line in Figure 5 represents perfect classification, i.e., a 100% true positive rate with a 0% false positive rate. The line indicated by red squares correspond
to a classification algorithm that achieves about a 78% true positive rate with a 10%
false positive rate. The green triangle line indicates essentially random classification.
In Section 5, we give ROC curves for an HMM-based metamorphic detector and our
proposed CSD detector.
4.3
Datasets
The basic malware dataset used in this research consists of 200 Next Generation
Virus Construction Kit, or NGVCK, metamorphic family viruses [27]. For our representative benign files, we selected 40 Cygwin utility files [8]. The NGVCK files
were selected since previous research has shown that these viruses are highly metamorphic, and they have served as the basis for previous HMM-based metamorphic
detection [16, 29].
Each executable file was disassembled using IDA Pro [13] and opcode sequences
were extracted for use in our experiments. Figure 6 illustrates the process.
For many of our experiments, the opcode sequences from the NGVCK virus files
were further morphed, using a metamorphic generator similar to that in [16]. Figure 7
illustrates the basic steps taken to generate the various metamorphic viruses.
16
Table 5: Experiments.
Training Dataset
Original NGVCK viruses
NGVCK morphed with 10% dead code
NGVCK morphed with 10% subroutine code
While the NGVCK files can be detected using an HMM-based approach [29], the
generator in [16] is able to defeat the HMM detector. Since our goal is to improve
on the HMM detection results, these tools provide us with the material needed to
compare our CSD detector and hybrid approach with the HMM detector, as well as
to validate our HMM detector based on previous work.
5
Experimental Results
In this section, we present the results of our experiments using an HMM-based detector, a chi-squared distance (CSD) estimator, and our proposed hybrid virus detector.
As discussed in the previous section, we use ROC curves and the mean maximum
accuracy (MMA) rates to evaluate the performance of the each of the three detectors.
Here, we refer to short segments of benign code that are interspersed throughout
the file as “dead code.” For the case where a long contiguous block is used, we
refer to the inserted code as “subroutine code.” For each of the three experiments
listed in Table 5, the parameters for our metamorphic generator were set to generate
increased dead code and increased subroutine code, both in increments of 10%, up to
a maximum of 40%. This yields 25 combinations of parameter values per experiment,
with each combination employing 200 distinct metamorphic virus files (and 40 benign
files), and each of these using five-fold cross validation.
All experiments presented here rely on the datasets and procedures discussed in
Section 4. Additional experimental results can be found in [26].
5.1
Training on NGVCK Viruses
In this experiment, we used NGVCK metamorphic virus files as our base viruses.
That is, for each training set, we used the NGVCK viruses without further morphing.
For viruses in the test sets, various levels of morphing (dead code insertion and/or
subroutine insertion) were applied.
Table 6 contains MMA results for the HMM detector, the CSD estimator, and our
hybrid model. For each row in the table, the score in boldface is the best detection
rate for the specified level of morphing. For this case, relevant ROC curves appear
in Figure 8.
17
For this test case, we observe that the HMM results are extremely good, with
the hybrid results being only marginally better. The CSD results are comparable to
the HMM at lower levels of morphing, but at high levels or morphing, the HMM is
superior. We also note that CSD tends to perform somewhat better, relative to the
HMM, at higher levels of subroutine insertion rates, although it is still below the level
of the HMM.
5.2
Training Set Morphed with 10% Dead Code
For this experiment, the training dataset consists of NGVCK virus files that were further morphed by inserting 10% dead code. This experiment simulates the case where
the base viruses are more highly morphed, with the additional morphing consists of
dead code selected from normal files. The purpose of inserting such code would be
to evade statistical-based detection strategies. Table 7 summarizes the MMA scores
obtained by the HMM detector, the CSD estimator, and the hybrid model. For this
case, relevant ROC curves appear in Figure 9.
In this case, the HMM results are somewhat stronger, relative to the CSD, than
in the previous experiment. As in the previous experiment, we see that the hybrid
model offers some improvement over the HMM classifier.
5.3
Training Set Morphed with 10% Subroutine Code
As in the previous experiment, the base metamorphic viruses were morphed by an
additional 10% of code, with the morphing code taken from benign files. However,
in this case, entire subroutines were extracted from benign files. This represents the
situation where the metamorphic viruses are more highly morphed than NGVCK,
with the additional morphing consisting of contiguous blocks of code from benign
files. The results for this experiment are given in Table 8. For this case, relevant
ROC curves appear in Figure 10.
For this experiment, the HMM detection rates are poor, which is consistent with
previous research [16]. But, the CSD performs well and the hybrid results improve
(slightly) on the CSD.
5.4
Discussion
As expected, for the CSD detector, it makes little difference whether the benign code
is dispersed throughout the file (dead code), or inserted as long contiguous blocks
(subroutine code). But, this is not the case for the HMM, which fails at relatively
low percentages when the inserted benign code is in the form of contiguous blocks.
These results show that the HMM and CSD detectors provide significantly different statistical measures. Most importantly from the perspective of metamorphic
detection, we have shown that it is possible to combine these two detectors to obtain
a hybrid detector that is stronger than either individual detector.
18
6
Conclusion
In this paper, we considered a hybrid metamorphic detection strategy that employs
both a machine learning (HMM) component and a statistical analysis (CSD) component. We showed that our hybrid detector generally outperforms either individual
technique. Our hybrid approach overcomes a significant weakness in HMM-based
metamorphic detection that was identified in previous research [16, 29].
Consistent with previous research, we found that the HMM detector performs
well when benign code is inserted in small blocks, but does much worse when the
morphing consists of contiguous blocks. As expected, the CSD estimator had similar
performance in both cases. The overall high level performance of the CSD detector
was somewhat surprising, although the hybrid approach is superior in most cases.
But, given the simplicity of the CSD detector and the fact that its detection rates are
generally close to those of the hybrid model, the CSD detector might be preferable
in practice.
Future work could include the investigation of more statistical models and evaluations of their performance. It might also be worthwhile to investigate other methods
of combining two (or more) scores into a hybrid model. Also, additional tests with
other metamorphic generators and morphing strategies could prove interesting.
References
[1] A. V. Aho and M. J. Corasick, Efficient string matching: An aid to bibliographic
search, Communications of the ACM, Vol. 18, pp. 333–340, 1975.
[2] K. Ataman, W. N. Street, and Y. Zhang, Learning to rank by maximizing auc
with linear programming, IEEE Technical Report, 2006.
[3] T. H. Austin, E. Filiol, S. Josse, and M. Stamp, Exploring hidden Markov models
for virus analysis: A semantic approach, submitted for publication, 2012.
[4] J. Aycock, Computer Viruses and Malware, Springer, 2006.
[5] D. Chess and S. White, An undetectable computer virus, Virus Bulletin Conference, 2000.
[6] J. Borello and L. Me. Code obfuscation techniques for metamorphic viruses,
Journal in Computer Virology, Vol. 4, No. 3, pp. 211–220, August 2008.
[7] F. Coulter and K. Eichorn, A good decade for cybercrime, McAfee, Inc, Technical
Report, 2011.
[8] Cygwin, September 2011, [online], at http://www.cygwin.com
[9] P. Desai and M. Stamp, A highly metamorphic virus generator, International
Journal of Multimedia Intelligence and Security, Vol. 1, No. 4, pp. 402–427, 2010.
[10] J. Egan, Signal Detection Theory and ROC Analysis, Academic Press, Inc., 1975.
19
[11] E. Filiol and S. Josse, A statistical model for undecidable viral detection, Journal
in Computer Virology, Vol. 3, No. 1, pp. 65–74, April 2007.
[12] S. Geisser, Predictive Inference: An Introduction, Chapman and Hall, 1993.
[13] IDAPro, Interactive dissassembler, 2011, [online], at
http://www.hex-rays.com/products/ida/index.shtml
R
[14] Intel, Intel
Architecture Software Developer’s Manual, Volume 2, Instruction
Set Reference Manual, October 2011.
[15] J. Kolter and M. Maloof, Learning to detect malicious executables in the wild,
Proceedings of KDD ’04, 2004.
[16] D. Lin and M. Stamp, Hunting for undetectable metamorphic viruses, Journal
in Computer Virology, Vol. 7, No. 3, pp. 201–214, August 2011.
[17] T. Mitchell, Machine Learning, McGraw Hill, 1997.
[18] M. Schultz, E. Eskin, and E. Zadok, Data mining methods for data mining
methods for detection of new malicious executables, Proceedings of IEEE International Conference on Data Mining, 2001.
[19] S. Madenur Sridhara. Metamorphic worm that carries its own morphing engine,
Master’s Projects, Paper 240, 2012,
http://scholarworks.sjsu.edu/etd_projects/240.
[20] S. Madenur Sridhara and M. Stamp. Metamorphic worm that carries its own
morphing engine, submitted for publication.
[21] M. Stamp, Information Security: Principles and Practice, 2nd edition, Wiley,
2011.
[22] M. Stamp, A revealing introduction to hidden Markov models, [online], at
www.cs.sjsu.edu/faculty/RUA/HMM.pdf
[23] P. Ször, The Art of Computer Virus Research and Defense. Addition Wesley
Professional, 2005.
[24] P. Ször and P. Ferrie, Hunting for metamorphic, Virus Bulletin, pp. 123–144,
2001.
[25] S. Thrun, L. K. Saul, and B. Scholkopf, Eds., AUC Optimization vs. Error Rate
Minimization, MIT Press, 2004.
[26] A. H. Toderici, Chi-squared distance and metamorphic virus detection, Master’s
Thesis, Department of Computer Science, San Jose State University, May 2012
[27] Vx heavens, [online], at http://www.vx.netlux.org/
[28] R. Wang, Flash in the pan? Virus Bulletin, July 1998.
[29] W. Wong, Analysis and detection of metamorphic computer viruses, Master’s
Thesis, Department of Computer Science, San Jose State University, May 2006.
[30] W. Wong and M. Stamp, Hunting for metamorphic engines, Journal in Computer
Virology, Vol. 2, No. 3, pp. 211–229, December 2006.
20
Table 6: MMA results training on NGVCK viruses.
Dead Code
0%
Subroutine
0%
10%
20%
30%
40%
10%
0%
10%
20%
30%
40%
20%
0%
10%
20%
30%
40%
30%
0%
10%
20%
30%
40%
40%
0%
10%
20%
30%
40%
Average
HMM
99.75%
87.50%
82.00%
79.75%
77.25%
92.00%
86.00%
83.00%
78.50%
77.50%
91.50%
85.50%
82.75%
80.25%
77.75%
90.75%
86.50%
82.25%
81.50%
79.50%
90.25%
85.75%
83.00%
81.25%
80.50%
84.09%
21
CSD
99.50%
85.81%
81.58%
80.35%
76.84%
90.81%
83.87%
79.60%
76.90%
74.89%
86.39%
82.66%
78.91%
75.91%
74.41%
81.91%
79.16%
76.47%
74.16%
73.21%
77.82%
76.00%
73.97%
73.23%
71.46%
79.43%
Hybrid
99.75%
88.96%
84.20%
81.69%
78.94%
93.48%
88.47%
84.96%
83.46%
80.19%
92.98%
88.47%
85.71%
82.20%
79.95%
92.22%
88.47%
85.71%
82.20%
82.20%
91.47%
89.22%
84.96%
84.21%
82.20%
86.25%
(a) HMM: dead code
(b) HMM: subroutine code
(c) CSD: dead code
(d) CSD: subroutine code
Figure 8: ROC curves for the HMM and CSD detectors, trained on NGVCK viruses, tested
on viruses morphed with dead code or subroutine code.
22
Table 7: MMA results training on NGVCK morphed 10% dead code.
Dead Code
0%
Subroutine
0%
10%
20%
30%
40%
10%
0%
10%
20%
30%
40%
20%
0%
10%
20%
30%
40%
30%
0%
10%
20%
30%
40%
40%
0%
10%
20%
30%
40%
Average
HMM
99.50%
87.75%
81.75%
80.00%
76.75%
99.75%
90.75%
85.75%
81.75%
80.00%
99.75%
90.25%
86.25%
82.75%
81.00%
100.00%
93.50%
88.75%
84.00%
83.00%
100.00%
93.25%
87.75%
85.25%
83.75%
88.12%
23
CSD
92.00%
80.75%
76.00%
73.50%
71.50%
95.75%
86.75%
80.00%
77.25%
75.25%
94.00%
87.25%
81.75%
78.25%
76.00%
93.00%
87.25%
82.75%
78.00%
77.75%
92.50%
88.00%
82.75%
81.50%
78.50%
82.72%
Hybrid
99.24%
88.46%
83.95%
81.18%
77.18%
99.24%
90.47%
86.46%
83.46%
81.20%
98.49%
91.73%
88.21%
86.71%
84.20%
98.24%
93.47%
91.22%
87.46%
86.96%
97.99%
93.47%
90.97%
89.22%
88.46%
89.40%
(a) HMM: dead code
(b) HMM: subroutine code
(c) CSD: dead code
(d) CSD: subroutine code
Figure 9: ROC curves for HMM and CSD detectors, trained on viruses morphed with 10%
dead code, tested on viruses morphed with dead code or subroutine code.
24
Table 8: MMA results training on NGVCK morphed 10% subroutine code.
Dead Code
0%
Subroutine
0%
10%
20%
30%
40%
10%
0%
10%
20%
30%
40%
20%
0%
10%
20%
30%
40%
30%
0%
10%
20%
30%
40%
40%
0%
10%
20%
30%
40%
Average
HMM
62.25%
60.50%
59.25%
58.25%
57.00%
61.75%
60.50%
59.50%
58.25%
59.00%
65.75%
64.25%
64.00%
62.25%
59.75%
68.75%
68.00%
67.00%
64.75%
63.75%
71.25%
70.25%
69.25%
66.00%
65.25%
63.46%
25
CSD
96.25%
92.25%
90.00%
88.50%
86.75%
95.00%
92.50%
90.00%
88.75%
88.50%
92.00%
90.25%
89.50%
88.75%
88.25%
89.50%
89.75%
90.25%
88.50%
88.25%
87.50%
88.75%
88.75%
87.75%
87.25%
89.74%
Hybrid
97.24%
94.97%
92.47%
90.72%
88.71%
97.23%
96.22%
93.72%
92.22%
91.22%
95.23%
94.23%
92.72%
91.96%
90.97%
91.96%
92.21%
92.46%
90.71%
90.96%
90.45%
91.46%
91.46%
90.96%
89.71%
92.48%
(a) HMM: dead code
(b) HMM: subroutine code
(c) CSD: dead code
(d) CSD: subroutine code
Figure 10: ROC curves for HMM and CSD detectors, trained on viruses morphed with
10% subroutine code, and tested on viruses morphed with dead code or subroutine code.
26