Chi-Squared Distance and Metamorphic Virus Detection Annie H. Toderici∗ and Mark Stamp† Abstract Metamorphic malware changes its internal structure with each generation, while maintaining its original behavior. Current commercial antivirus software generally scan for known malware signatures; therefore, they are not able to detect metamorphic malware that sufficiently morphs its internal structure. Machine learning methods such as hidden Markov models (HMM) have shown promise for detecting hacker-produced metamorphic malware. However, previous research has shown that it is possible to evade HMM-based detection by carefully morphing with content from benign files. In this paper, we combine HMM detection with a statistical technique based on the chi-squared test to build an improved detection method. We discuss our technique in detail and provide experimental evidence to support our claim of improved detection. Keywords: metamorphic malware, chi-squared statistics, hidden Markov models, malware detection 1 Introduction Malicious software attacks can cause extensive financial damage. For example, MyDoom, a spam-mailing malware, caused an estimated $38 billion in damage while the damage due to Conficker, a password-stealing botnet, has been estimated at $9.1 billion [7]. Antivirus software aim to detect malware. Signature detection is the most commonly used approach used by commercial antivirus [4]. This technique relies on a database of known signatures for viruses and other malware, and attempts to match these signatures against files on a user’s computer. If a match is found, the file is likely infected by the corresponding malware. Traditionally, this method has been effective for detecting most malware. The major weakness of signature detection is that it cannot detect previously unknown viruses. ∗ † Department of Computer Science, San Jose State University Department of Computer Science, San Jose State University: [email protected] 1 As a way to evade signature detection, malware writers employ code obfuscation methods [23]. Metamorphic viruses apply code obfuscation techniques at each generation and, consequently, a well designed metamorphic virus cannot be detected using standard signatures [6]. Previous research has shown that machine learning methods such as hidden Markov models (HMMs) are effective at detecting hacker-produced metamorphic viruses [29]. However, such a detection strategy can be defeated by inserting a sufficient amount of code from benign files into each virus—at some point, the HMM classifier cannot reliably distinguish such a virus from a benign file. An experimental metamorphic virus generator was previously developed to exploit this strategy [16]. It was found that the HMM-based approach is very robust when random changes are made to the viral code, or when small blocks of code are copied from benign files into the viruses. However, the HMM technique is relatively fragile when code is copied from benign files in the form of contiguous blocks (e.g., entire subroutines). The motivation for the research presented here is to improve on this weakness in the HMM detector. In this paper, we analyze the utility of a statistical chi-squared test (or χ2 test) for malware detection, as suggested by the theoretical framework in [11]. Our results show that the chi-squared statistic, computed on instruction opcode frequencies, can significantly improve virus detection—whether code is copied from benign files in the form of many small segments or a contiguous block has little effect on the chi-squared statistic. We show that by combining a chi-squared test with an HMM detector, we can improve on the results obtained by either when used individually. This paper is organized as follows. Section 2 gives relevant background information on malware, malware detection, code obfuscation, hidden Markov models, and the chi-squared statistical test. In Section 3, we discuss our proposed detection techniques, while Section 4 covers the performance criteria we use to quantify our results. Then, in Section 5 we present our experimental results. Finally, conclusions and suggestions for future work are given in Section 6. 2 Background In this section, we briefly discuss background material that is relevant to the research in this paper. First, we cover malware, with the focus on metamorphic malware and code obfuscation techniques. Then we consider malware detection, with the emphasis on a machine learning technique based on hidden Markov models. Finally, we discuss a chi-squared statistical test and its potential application to malware detection. 2.1 Malware As the name suggests, malware is software designed specifically for malicious purposes. Malware is often classified into various categories, including virus, worm, trojan horse, spyware, adware, and botnet. 2 A virus is often defined as malware that relies on passive propagation, whereas a worm uses active means [21]. That is, a worm actively propagates itself (typically, via a network), while a virus requires outside assistance (e.g., an infected USB key inserted into a computer). However, others define a virus as parasitic malware, in contrast to a worm that is stand-alone code [4]. Here, we use the term “virus” generically to refer to malware. Next, we consider encrypted, polymorphic, and metamorphic viruses. These categories can be viewed as a hierarchy, employing increasingly sophisticated strategies designed to evade signature detection. 2.2 Encrypted Viruses Encrypted viruses encrypt their body using a different key at every infection. While encryption is an effective means of evading signature detection, an encrypted virus must include a plaintext decryptor routine. Therefore, it is possible to detect this class of viruses by analyzing the decryptor to obtain a signature. 2.3 Polymorphic Viruses Polymorphic viruses are encrypted viruses that obfuscate their decryption code by mutating it. In practice, there are usually a relatively small number of decryptors, making signature detection a viable option. In addition, if a part of the code looks “suspicious,” we can execute it in a virtual machine. If the suspicious code is a polymorphic virus, it will decrypt itself, at which point standard signature detection can succeed. 2.4 Metamorphic Viruses Metamorphic viruses change their internal structure at each generation. Unlike polymorphic viruses, the entire virus is morphed while still maintaining its original behavior. If the morphing is sufficient, no common signature is available and hence signature-based antivirus software cannot reliably detect well-designed metamorphic viruses. Note that no encryption is necessary for a metamorphic virus. Although it is difficult to detect metamorphic viruses, fortunately, it has proven equally difficult for malware writers to implement. One difficulty is that the malware must mutate sufficiently, but yet the size of the code must not increase uncontrollably. Another concern is that the viruses must be sufficiently similar to benign code to avoid detection by similarity-based or heuristic methods. Malware writers have not yet successfully overcome these obstacles [29, 30]. In fact, most metamorphic generators introduce very little metamorphism, and those that do, produce variants that are easily distinguished from benign code since. Nevertheless, it is possible to produce relatively strong, practical metamorphic generators [16, 19, 20], so metamorphic detection is a worthy research problem. 3 2.4.1 Virus Obfuscation Techniques Code obfuscation techniques are applied when creating metamorphic viruses. These techniques can be used to create a vast number of distinct copies that have the same behavior but different internal structure. In this section, we briefly discuss some elementary code morphing techniques. Register Renaming: Register renaming is one of the oldest and simplest techniques used in metamorphic generators. For example, Figure 1 provides a code snippet where the following substitutions have been made: eax ebx ecx −→ −→ −→ ebx ecx eax In spite of its simplicity, this technique does change the binary pattern in morphed executable files. However, register renaming does not effect the opcode sequence. Furthermore, it is relatively easy to detect register renaming by using signatures with wildcard strings [24]. MOV eax, 4 MOV ebx, 4 ADD ebx, eax −→ ADD ecx, ebx SUB ecx, 1 SUB eax, 1 Figure 1: Register renaming example. Equivalent Instruction Replacement: The instruction set for modern processors have numerous equivalent instructions (or groups of instructions). For example, MOV eax, 0, is equivalent to SUB eax, eax, and XOR eax, eax. Figure 2 illustrates an example where a single instruction is equivalent to a sequence of instructions. MOV eax, 1 −→ SUB eax, eax ADD eax, 1 Figure 2: Equivalent instructions substitution. Instruction Reordering: Instruction reordering consists of transposing instructions that do not depend on the output of previous instructions. When instructions are reordered, signatures involving the instructions can be broken, but code execution is unaffected. Figure 3 shows an example of instruction reordering. 4 MOV eax, 4 ADD ecx, 1 ADD ecx, 1 −→ MOV ebx, 0 MOV ebx, 0 MOV eax, 4 Figure 3: Instruction reordering. Junk Code: Junk code is any code that has no effect on program execution. Junk code might be executed, with no effect on the program, or it might be “dead code” that is never executed. Junk code is often inserted randomly throughout the body of a metamorphic virus during the morphing process. The intention is that such junk code will break up signatures. A trivial example of junk code is the NOP instruction, which does nothing to affect the CPU state. Other examples of junk code include MOV eax, eax and ADD eax, 0 and SUB eax, 0. 2.5 Malware Detection In this section, we briefly discuss signature detection and heuristic detection. Other detection strategies are used, but these are the most common today. 2.5.1 Signature Detection Commercial antivirus software typically uses signature detection to identify malicious files. A signature is created by analyzing the binary code of a virus, and selecting a sequence of bits that is, ideally, unique to that virus [21]. The signature must be long enough so that it is unlikely to appear in uninfected files. For example, the Chernobyl/CIH virus can be detected using the signature [28] E800 0000 005B 8D4B 4251 5050 0F01 4C24 FE5B 83C3 1CFA 8B2B . String matching algorithms are applied when scanning for virus signatures. Examples of such algorithms include Aho-Corasick, Veldman, and Wu-Manber [4]. The Aho-Corasick algorithm scans for exact matches, so a slight variation will escape detection [1]. On the other hand, Veldman and Wu-Manber allow for the use of wildcard search strings [23]. Signature scanning is relatively easy and efficient, but the virus database needs to be kept up to date. The crucial weakness of signature detection is that signatures are unlikely to detect previously unseen malware, including metamorphic variants. 2.5.2 Heuristic Detection Heuristic analysis can be used to detect unknown viruses and variants of known viruses. Heuristic analysis can be based on a static or dynamic approach, or a combi- 5 nation of the two. An example of static heuristic analysis would be to look for opcode sequences that match a general pattern found in viral code. For dynamic analysis, the code might be executed in a virtual machine to watch for suspicious behavior. An example of such suspicious behavior is opening an executable file for writing [4]. 2.6 Hidden Markov Models Machine learning can be defined as computer algorithms that improve through experiments [17]. Examples of such techniques include Naı̈ve Bayes [18], decision trees [15], hidden Markov models [22], and many other statistical learning methods [11]. A hidden Markov model (HMM) is a statistical modeling method that has been used in speech recognition, bioinformatics, mouse gesture recognition, credit card fraud detection, and computer virus detection research. It is widely used because it is simple and computationally efficient [22]. The popularity of HMMs for virus detection stems from the fact that a program can be represented as a sequences of instructions. The CPU executes the instructions one at a time, which implies that programs can be treated as time series, which is an ideal situation for an HMM. For a Markov model of order one, we assume that the sequential data can be modeled based solely on the current state, with no memory—what happens next in the sequence depends only on the current state. In a typical Markov chain, the states are fully observed. In the case of hidden Markov model, as its name implies, the states are not directly observed, as they are “hidden.” We can only estimate these states while observing sequences of data [22]. 2.6.1 HMM Notation To describe an HMM, we will use the notation summarized in Table 1. The main components of the HMM are the state transition probability matrix A, the observation probability matrix B (which gives the likelihood of an observation given the state), the initial state distribution π, the observation sequence O, and the hidden states X. We denote the HMM model as λ = (A, B, π). The matrices A, B, and π are row stochastic, that is, each row is a probability distribution. The probability πxi is the initial probability of starting in state Xi , while axi ,xi+1 is the probability of transitioning from state Xi to state Xi+1 . Finally, bxi (Oi ) is the probability of observing Oi when the underlying Markov process is in state Xi . Figure 4 illustrates a generic HMM. The state transitions in Figure 4 are placed above the dashed line to indicate that they are “hidden.” Some information about the transitions can be deduced by analyzing the observations, since the observation sequence is related to the states of the Markov process by the matrix B. The utility of HMMs derives largely from the fact that there are efficient algorithms to solve the following three problems. 6 Table 1: HMM symbols and their meanings. Symbol Description T N M O X A B π λ The The The The The The The The The length of the observation sequence number of states in the model number of distinct observation symbols observation sequence, O = {O0 , O1 , . . . , OT −1 } hidden states, X = {X0 , X1 , . . . , XT −1 } state transition probability matrix observation probability matrix initial state distribution hidden Markov model: λ = (A, B, π) Figure 4: Hidden Markov Model [22]. 1. Given a model λ and an observed sequence O determine P (O|λ). That is, we can score a sequence of observations against a given model. 2. Given a model λ and an observed sequence O, we can find the most likely hidden state sequence {X0 , X1 , . . . , XT −1 }. 3. Given an observation sequence O, we can train a model λ to maximizes the probability of O. That is, we can train a model to fit the data. In this research, we apply the solution to the third problem to train an HMM to fit opcode sequences extracted from a metamorphic family. Then we use the solution to problem one to score files (based on extracted opcode sequences) and classify each as either belonging to the metamorphic family or not. For more information on HMMs in general, including detailed examples and pseudo-code, see [22]. For additional information on the application of HMMs to the malware detection problem, see, for example [16, 30] 7 2.7 Chi-Squared Distance In this section, we first discuss chi-squared tests in general terms. Then we consider the use of such a test for malware detection. 2.7.1 CSD Notation Suppose that Y is a statistical variable from a distribution under observation. Our goal is to estimate the main characteristics of the probability distribution P of Y . Let Y1 , Y2 , . . . , Yn be a random sample of elements from this distribution. These samples reveal some information about the unknown parameter, say, θ of the probability distribution P . Let f (Y1 , Y2 , . . . , Yn ) denote a function that is used to estimate θ; we refer to f as an estimator function. An estimator function can be used to compute the probability distribution of a given sample. The parameter space from which θ is drawn can be arbitrary, but will typically depend on the problem and the choice of f . For example, we might have θ ∈ N, where N is the set of natural numbers, or θ ∈ Rk , for some k, where R is the set of real numbers. For example, if θ represents the parameters of a normal distribution, then θ will have two real-valued dimensions, corresponding to the mean µ and variance σ. For the purpose of malware analysis, we restrict the parameter space to k-dimensional vectors with elements from the set of natural numbers, that is, θ ∈ Nk . The general form of P is assumed to be known. Statistical testing is used to decide which hypothesis best fits an observed sequence of samples (Y1 , Y2 , . . . , Yn ). In statistical testing, initial hypotheses are proposed which are either accepted or rejected after measuring the likelihood of these hypotheses with respect to the probabilistic law P of Y . The initial hypothesis is denoted H0 and is referred to as the null hypothesis. The alternative hypothesis is denoted as H1 . The tests that we use will either accept or reject the null hypothesis. To construct such a test, we determine an estimator that separates possible test values into disjoint sets consisting of an acceptance region and rejection region. Then given a sample, its estimator value is computed and compared with the threshold to determine whether we accept or reject the null hypothesis. There are two types of errors associated with this detection problem. • A type I error occurs when we reject the null hypothesis, although the null hypothesis is correct. The probability of such an error is denoted as α and it corresponds to the false positive rate. • A type II error occurs when we accept the null hypothesis, although the null hypothesis is false. This probability corresponds to the false negative rate. 8 2.7.2 Statistical Test for Malware Detection To formalize the virus detection framework, we adopt notation from Chess and White [5]. A given antivirus system uses a detection algorithm D, which it applies to a program p. The goal of the algorithm is to determine whether p is infected by a particular virus V . Then D(p) should return true if and only if the program p is infected by the virus V . Filiol and Josse [11] further expand on this approach by providing a statistical framework to describe the detection process. For malware analysis, we consider opcode instruction frequencies. We model the statistics of viruses from a given metamorphic family. Ideally, the spectrum of instructions from these family viruses would be obtained by analyzing instruction frequencies of all possible family viruses. However, for any realistic metamorphic generator, an exact spectrum is impossible to compute. As an approximation, we rely on instruction frequencies from a representative set of family viruses. The related problem of modeling compilers is considered in [3]. Most compilers use a relatively small subset of all possible instructions, different compilers use different subsets, and the instructions that are common between the two will appear with different frequencies. We can make use of such observations to develop an estimator function which can be used to classify whether a given executable was generated with a particular compiler. This is analogous to the metamorphic detection problem, where a metamorphic generator can be considered a type of “compiler.” The formal definition of the spectrum of a program is spectrum(C) = (Ii , ni )1≤i≤c (1) where, in our case, C represents a particular metamorphic generator, Ii represents the ith instruction, and ni is the frequency of instruction i. Here, c denotes the total number of unique instructions that may be output by generator. For the 80x86 architecture, the total number of possible instructions is 501 [14], but, for a given metamorphic generator, only a relatively small subset of these instructions are likely to occur. In our experiments, the spectrum in equation (1) represents the expected value of opcodes from a typical family virus. Consequently, for any family virus, we expect to observe instruction i at approximately frequency ni . Given an executable file that we want to classify, we first compute its spectrum. That is, we compute the instruction frequencies observed in this file—we denote the observed frequency for instruction i as n̂i . Then we rely on a statistical test to determine whether we should accept or reject the null hypothesis. To make this process concrete, we need to specify the null hypothesis, an estimator function, and the decision threshold. The null hypothesis and alternative hypothesis are specified as H0 : n̂i = ni , 1 ≤ i ≤ c H1 : n̂i 6= ni , 1 ≤ i ≤ c 9 respectively. Note that the null hypothesis states that the observed frequencies match the expected frequencies. If these frequencies are the same, then the suspected file is likely a family virus, and the null hypothesis will be accepted. However, if the frequencies are significantly different, then the alternative hypothesis will be accepted, which means that the suspect file is classified as benign. However, we cannot expect an exact match of any given spectrum to the expected spectral values. Filiol and Josse [11] suggest using Pearson’s χ2 statistical test. This test, which we denote as D2 , is given by D2 = c X (n̂i − ni )2 i=1 ni . Pearson’s χ2 statistic is commonly used to determine whether the difference between the expected and observed data is significant. The decision threshold is obtained by comparing the estimator value given by D2 to the χ2 (α, c − 1) distribution, that is, a chi-squared distribution with c − 1 degrees of freedom and a type I error rate of α. Typically, the type I error rate is set to 0.05, which means that the test tolerates no more than 5% of the viruses files being misclassified as benign. Using Pearson’s χ2 statistic, the null hypothesis and the alternative hypothesis are H0 : D2 ≤ χ2 (α, c − 1) (2) H1 : D2 > χ2 (α, c − 1) The bottom line here is that the χ2 statistical test gives us a practical means to determine whether an observed spectrum matches a given distribution. To summarize, the following steps need to be performed to implement Pearson χ2 test for malware detection. 1. Specify the null hypothesis H0 and the alternative hypothesis H1 . 2. Choose a significance level (we use α = 0.05). 3. Compute the estimator value D2 given the opcode frequencies in the file under consideration (that is, n̂i for i = 1, 2, . . . , c) and the spectrum that corresponds to the null hypothesis (that is, ni for i = 1, 2, . . . , c). 4. Determine whether to accept or reject the null hypothesis H0 based on equation (2). 2.7.3 Example To illustrate the statistical test discussed above, we consider a simplified problem involving only three instructions Ii1 , Ii2 , Ii3 . Suppose that for the family viruses under consideration, instruction Ii1 has frequency ni1 , with ni2 and ni3 . The spectrum for this example appears in Table 2. 10 Table 2: Example family virus spectrum. Instruction i1 i2 i3 Opcode MOV PUSH POP Frequency ni 7 10 3 Table 3: Example frequencies of suspect program. Instruction i1 i3 Opcode MOV POP Frequency ni 6 11 The possible observations for this compiler are MOV, PUSH, and POP, and the corresponding frequencies are 7, 10, and 3. Since we can view the distribution of interest as a histogram over these three instructions, the parameter space of θ is N3 . Suppose that we have a suspect file that may or may not be a family virus. Given the instructions and the frequencies in Table 3, we would like to perform the χ2 test so as to classify the file as a family virus or benign. The null hypothesis H0 is that the file is benign, provided that the estimator function D2 yields a score less than or equal to the χ2 value. That is, D2 = c X (n̂i − ni )2 i=1 ni ≤ χ2 (α, c − 1) . When computing the estimator D2 , we use the opcode frequency counts for each instruction. However, since χ2 is a probability distribution, the frequencies are normalized before performing this test—normalization is done by simply dividing the count of an instruction by the total number of instructions. The normalized values for the spectrum in Table 2 are (MOV, 0.35), (PUSH, 0.5), and (POP, 0.15). The normalized values for the file under consideration (see Table 3) are (MOV, 0.353) and (POP, 0.647). Therefore, D2 = (0.353 − 0.35)2 (0.0 − 0.5)2 (0.647 − 0.15)2 ∼ + + = 2.1467. 0.35 0.5 0.15 We compare D2 = 2.1467 to χ2 (0.05, 2) = 5.991. Since D2 ≤ χ2 (0.05, 2), we accept the null hypothesis, that is, we classify the file as a family virus. 3 Proposed Virus Detector We propose a metamorphic detection method that combines an HMM detector (as developed in [30] and further extended and analyzed in [16]) with a chi-squared distance 11 (CSD) estimator, as discussed above in Section 2.7. We refer to this combination of HMM and CSD as a hybrid method. Experiments show that the HMM detector performs extremely well in detecting viruses morphed with short sequences of code, whether the code is randomly selected or carefully chosen to be statistically similar to benign code [9]. The same high level of detection is achieved even when the short sequences of code are taken directly from benign files [16]. However, when long contiguous blocks (e.g., subroutines) from benign files are inserted into morphed viruses, the HMM detector fails at a relatively low percentage of such code [16]. Intuitively, the performance of our CSD estimator should not be affected by the length of the code blocks copied from benign files—only the total amount of such code should matter. Our goal is to experimentally verify this intuition and develop a hybrid model that will outperform both the HMM and CSD detectors. Below, we consider ways to combine the HMM and CSD scores to obtain a score for the hybrid model. But first, we convert each of these scores into probabilities. Let PHMM (X) be the probability that corresponds to an HMM score for X and let Pχ2 (X) be the probability that corresponds to a CSD score for X. We can directly compute PHMM (X) from the score; see [22] for the details. To compute a probability for our CSD score, we use the fact that it is related to the χ2 distribution and make use of the χ2 cumulative distribution function (CDF). We can write Pχ2 (X) = P (Y < D2 ) = CDFχ2 (D2 ) 1 = γ(k/2, D2 /2) Γ(k/2) where Γ is the well-known gamma function1 Z ∞ Γ(z) = tz−1 · e−t dt 0 and k represents the degrees of freedom, which in our case is one less than the number of unique instructions encountered in the training phase. The function γ(k, z) is the lower incomplete gamma function, which is given by Z x γ(s, x) = ts−1 · e−t dt. 0 With these probabilities in hand, the probability for a hybrid model could be computed as P (X) = PHMM (X) · Pχ2 (X). 1 The Γ function can be viewed as a generalization of the factorial function; for positive integers n, we have Γ(n) = (n − 1)!. 12 To allow for different weightings of the component probabilities, we define w1 (X) · Pχw22 (X) Phybrid (X) = PHMM where w1 and w2 are to be determined. Finally, for numerical stability, we transform Phybrid (X) to its corresponding log-likelihood form log Phybrid (X) = w1 · log PHMM (X) + w2 · log Pχ2 (X). (3) The values we used for w1 and w2 in equation (3) were determined by a grid search over the range 0 to 1. For this search, we conducted 100 experiments with w1 and w2 varying independently, on a logarithmic scale. For each case, we computed results analogous to those described in Section 4. From this process, we found that our best results were obtained using the values w1 = 10−8 and w2 = 10−9 and, consequently, these weights are used for all experiments discussed in the next section. 4 Validation and Datasets In this section, we discuss the validation method used and the metrics that we employ to compare our experimental results. Then we briefly cover the datasets used in our experiments. 4.1 Cross-Validation Cross-validation, or rotational estimation is used to enhance the statistical validity of a limited data set [12]. This approach enables us to have sufficient training data, while also obtaining meaningful results in the testing phase. Since testing data must be disjoint from training data, a limited data set implies that only a relatively small amount of data may be available for testing. Hence, we may have insufficient data to obtain meaningful test results. With cross-validation, the training data and the testing data are selected so that they do not intersect, and the experiment is repeated multiple times on different subsets of data. This results in many more test cases, which will tend to improve the overall reilability of the results. The approach that we employ here is five-fold cross-validation, where the data is divided into five equal subsets. Then four subsets are used for training with the remain subset reserved for testing. This process is repeated five times, with a different subset reserved for testing each time—each such selection is referred to as a “fold.” For each fold, the evaluation performance is recorded and all folds are used to estimate the performance of a particular experiment. Next, we discuss the metrics we use to measure the accuracy of our classifiers. 13 4.2 Evaluation Metrics Here, we briefly discuss false positive and false negative rates, and explain how we combine these into a measure of accuracy. We also consider ROC curves and their role in evaluating our metamorphic detection technique. 4.2.1 Accuracy Measure There are four possible outcomes for detection, namely, true positive (TP), false positive (FP), true negative (TN), and false negative (FN). A detection is considered a true positive when a virus is correctly classified as a virus, whereas it is a true negative when a benign file is correctly marked as benign. Of course, TP and TN are desirable outcomes. A false positive occurs when a benign file is mistakenly classified as a virus. Similarly, a false negative occurs when a virus file is not detected as a virus, but instead is classified as benign. Table 4 shows these four possible outcomes. Table 4: Possible outcomes for detection. Actual Class Virus Benign Predicted Class Virus Benign True Positive False Negative False Positive True Negative Short of perfect detection, there is an inherent tradeoff between the FP and FN rates. As an aside, we note that commercial antivirus makers generally try to avoid false positives, although this leads to significantly higher false negative rates. Users tend to notice (and be very unhappy) whenever an uninfected file is identified as malicious, but they tend to be more forgiving of (or may not even notice) an occasional false negative [4]. In any case, we would like to avoid both false positives and false negatives. The overall success rate, or accuracy rate, is the fraction of correct classifications obtained from the total number of files tested. The accuracy rate is given by Accuracy Rate = TP + TN . TP + TN + FP + FN (4) The error rate is one minus this accuracy rate, that is, Error Rate = 1 − Accuracy Rate = 1 − TP + TN TP + TN + FP + FN (5) Since we performed five-fold cross-validation, we have a total of five models for each experiment. Each of these folds will yield an accuracy rate. Therefore, we use the mean of the accuracy values obtained from the five folds, which we refer to as the 14 mean maximum accuracy (MMA) rate. This rate is computed as 5 MMA = 1X Accuracy Ratei 5 (6) i=1 where Accuracy Ratei denotes the accuracy rate for the ith fold in the five-fold crossvalidation. 4.2.2 Receiver Operating Characteristic The receiver operating characteristic (ROC) was developed for applications related to signal detection [10]. But, ROC curves have gained popularity in the machine learning community as a tool for evaluating the performance of algorithms [2, 25]. ROC curves are typically represented as two-dimensional plots. For the virus detection problem under consideration here, we let the x-axis represent the false positive rate and the y-axis represent the true positive rate. Figure 5 illustrates an example with several ROC curves plotted on the same axis. Figure 5: Example ROC curves. An algorithm that performs random classification with 50% accuracy will generate a diagonal line in the ROC space. Results closer to the top left area of the graph represent improved classification with a higher true positive rate. For example, the 15 Figure 6: Process for obtaining opcode sequences. Figure 7: Process for generating metamorphic viruses. blue diamond line in Figure 5 represents perfect classification, i.e., a 100% true positive rate with a 0% false positive rate. The line indicated by red squares correspond to a classification algorithm that achieves about a 78% true positive rate with a 10% false positive rate. The green triangle line indicates essentially random classification. In Section 5, we give ROC curves for an HMM-based metamorphic detector and our proposed CSD detector. 4.3 Datasets The basic malware dataset used in this research consists of 200 Next Generation Virus Construction Kit, or NGVCK, metamorphic family viruses [27]. For our representative benign files, we selected 40 Cygwin utility files [8]. The NGVCK files were selected since previous research has shown that these viruses are highly metamorphic, and they have served as the basis for previous HMM-based metamorphic detection [16, 29]. Each executable file was disassembled using IDA Pro [13] and opcode sequences were extracted for use in our experiments. Figure 6 illustrates the process. For many of our experiments, the opcode sequences from the NGVCK virus files were further morphed, using a metamorphic generator similar to that in [16]. Figure 7 illustrates the basic steps taken to generate the various metamorphic viruses. 16 Table 5: Experiments. Training Dataset Original NGVCK viruses NGVCK morphed with 10% dead code NGVCK morphed with 10% subroutine code While the NGVCK files can be detected using an HMM-based approach [29], the generator in [16] is able to defeat the HMM detector. Since our goal is to improve on the HMM detection results, these tools provide us with the material needed to compare our CSD detector and hybrid approach with the HMM detector, as well as to validate our HMM detector based on previous work. 5 Experimental Results In this section, we present the results of our experiments using an HMM-based detector, a chi-squared distance (CSD) estimator, and our proposed hybrid virus detector. As discussed in the previous section, we use ROC curves and the mean maximum accuracy (MMA) rates to evaluate the performance of the each of the three detectors. Here, we refer to short segments of benign code that are interspersed throughout the file as “dead code.” For the case where a long contiguous block is used, we refer to the inserted code as “subroutine code.” For each of the three experiments listed in Table 5, the parameters for our metamorphic generator were set to generate increased dead code and increased subroutine code, both in increments of 10%, up to a maximum of 40%. This yields 25 combinations of parameter values per experiment, with each combination employing 200 distinct metamorphic virus files (and 40 benign files), and each of these using five-fold cross validation. All experiments presented here rely on the datasets and procedures discussed in Section 4. Additional experimental results can be found in [26]. 5.1 Training on NGVCK Viruses In this experiment, we used NGVCK metamorphic virus files as our base viruses. That is, for each training set, we used the NGVCK viruses without further morphing. For viruses in the test sets, various levels of morphing (dead code insertion and/or subroutine insertion) were applied. Table 6 contains MMA results for the HMM detector, the CSD estimator, and our hybrid model. For each row in the table, the score in boldface is the best detection rate for the specified level of morphing. For this case, relevant ROC curves appear in Figure 8. 17 For this test case, we observe that the HMM results are extremely good, with the hybrid results being only marginally better. The CSD results are comparable to the HMM at lower levels of morphing, but at high levels or morphing, the HMM is superior. We also note that CSD tends to perform somewhat better, relative to the HMM, at higher levels of subroutine insertion rates, although it is still below the level of the HMM. 5.2 Training Set Morphed with 10% Dead Code For this experiment, the training dataset consists of NGVCK virus files that were further morphed by inserting 10% dead code. This experiment simulates the case where the base viruses are more highly morphed, with the additional morphing consists of dead code selected from normal files. The purpose of inserting such code would be to evade statistical-based detection strategies. Table 7 summarizes the MMA scores obtained by the HMM detector, the CSD estimator, and the hybrid model. For this case, relevant ROC curves appear in Figure 9. In this case, the HMM results are somewhat stronger, relative to the CSD, than in the previous experiment. As in the previous experiment, we see that the hybrid model offers some improvement over the HMM classifier. 5.3 Training Set Morphed with 10% Subroutine Code As in the previous experiment, the base metamorphic viruses were morphed by an additional 10% of code, with the morphing code taken from benign files. However, in this case, entire subroutines were extracted from benign files. This represents the situation where the metamorphic viruses are more highly morphed than NGVCK, with the additional morphing consisting of contiguous blocks of code from benign files. The results for this experiment are given in Table 8. For this case, relevant ROC curves appear in Figure 10. For this experiment, the HMM detection rates are poor, which is consistent with previous research [16]. But, the CSD performs well and the hybrid results improve (slightly) on the CSD. 5.4 Discussion As expected, for the CSD detector, it makes little difference whether the benign code is dispersed throughout the file (dead code), or inserted as long contiguous blocks (subroutine code). But, this is not the case for the HMM, which fails at relatively low percentages when the inserted benign code is in the form of contiguous blocks. These results show that the HMM and CSD detectors provide significantly different statistical measures. Most importantly from the perspective of metamorphic detection, we have shown that it is possible to combine these two detectors to obtain a hybrid detector that is stronger than either individual detector. 18 6 Conclusion In this paper, we considered a hybrid metamorphic detection strategy that employs both a machine learning (HMM) component and a statistical analysis (CSD) component. We showed that our hybrid detector generally outperforms either individual technique. Our hybrid approach overcomes a significant weakness in HMM-based metamorphic detection that was identified in previous research [16, 29]. Consistent with previous research, we found that the HMM detector performs well when benign code is inserted in small blocks, but does much worse when the morphing consists of contiguous blocks. As expected, the CSD estimator had similar performance in both cases. The overall high level performance of the CSD detector was somewhat surprising, although the hybrid approach is superior in most cases. But, given the simplicity of the CSD detector and the fact that its detection rates are generally close to those of the hybrid model, the CSD detector might be preferable in practice. Future work could include the investigation of more statistical models and evaluations of their performance. It might also be worthwhile to investigate other methods of combining two (or more) scores into a hybrid model. Also, additional tests with other metamorphic generators and morphing strategies could prove interesting. References [1] A. V. Aho and M. J. Corasick, Efficient string matching: An aid to bibliographic search, Communications of the ACM, Vol. 18, pp. 333–340, 1975. [2] K. Ataman, W. N. Street, and Y. Zhang, Learning to rank by maximizing auc with linear programming, IEEE Technical Report, 2006. [3] T. H. Austin, E. Filiol, S. Josse, and M. Stamp, Exploring hidden Markov models for virus analysis: A semantic approach, submitted for publication, 2012. [4] J. Aycock, Computer Viruses and Malware, Springer, 2006. [5] D. Chess and S. White, An undetectable computer virus, Virus Bulletin Conference, 2000. [6] J. Borello and L. Me. Code obfuscation techniques for metamorphic viruses, Journal in Computer Virology, Vol. 4, No. 3, pp. 211–220, August 2008. [7] F. Coulter and K. Eichorn, A good decade for cybercrime, McAfee, Inc, Technical Report, 2011. [8] Cygwin, September 2011, [online], at http://www.cygwin.com [9] P. Desai and M. Stamp, A highly metamorphic virus generator, International Journal of Multimedia Intelligence and Security, Vol. 1, No. 4, pp. 402–427, 2010. [10] J. Egan, Signal Detection Theory and ROC Analysis, Academic Press, Inc., 1975. 19 [11] E. Filiol and S. Josse, A statistical model for undecidable viral detection, Journal in Computer Virology, Vol. 3, No. 1, pp. 65–74, April 2007. [12] S. Geisser, Predictive Inference: An Introduction, Chapman and Hall, 1993. [13] IDAPro, Interactive dissassembler, 2011, [online], at http://www.hex-rays.com/products/ida/index.shtml R [14] Intel, Intel Architecture Software Developer’s Manual, Volume 2, Instruction Set Reference Manual, October 2011. [15] J. Kolter and M. Maloof, Learning to detect malicious executables in the wild, Proceedings of KDD ’04, 2004. [16] D. Lin and M. Stamp, Hunting for undetectable metamorphic viruses, Journal in Computer Virology, Vol. 7, No. 3, pp. 201–214, August 2011. [17] T. Mitchell, Machine Learning, McGraw Hill, 1997. [18] M. Schultz, E. Eskin, and E. Zadok, Data mining methods for data mining methods for detection of new malicious executables, Proceedings of IEEE International Conference on Data Mining, 2001. [19] S. Madenur Sridhara. Metamorphic worm that carries its own morphing engine, Master’s Projects, Paper 240, 2012, http://scholarworks.sjsu.edu/etd_projects/240. [20] S. Madenur Sridhara and M. Stamp. Metamorphic worm that carries its own morphing engine, submitted for publication. [21] M. Stamp, Information Security: Principles and Practice, 2nd edition, Wiley, 2011. [22] M. Stamp, A revealing introduction to hidden Markov models, [online], at www.cs.sjsu.edu/faculty/RUA/HMM.pdf [23] P. Ször, The Art of Computer Virus Research and Defense. Addition Wesley Professional, 2005. [24] P. Ször and P. Ferrie, Hunting for metamorphic, Virus Bulletin, pp. 123–144, 2001. [25] S. Thrun, L. K. Saul, and B. Scholkopf, Eds., AUC Optimization vs. Error Rate Minimization, MIT Press, 2004. [26] A. H. Toderici, Chi-squared distance and metamorphic virus detection, Master’s Thesis, Department of Computer Science, San Jose State University, May 2012 [27] Vx heavens, [online], at http://www.vx.netlux.org/ [28] R. Wang, Flash in the pan? Virus Bulletin, July 1998. [29] W. Wong, Analysis and detection of metamorphic computer viruses, Master’s Thesis, Department of Computer Science, San Jose State University, May 2006. [30] W. Wong and M. Stamp, Hunting for metamorphic engines, Journal in Computer Virology, Vol. 2, No. 3, pp. 211–229, December 2006. 20 Table 6: MMA results training on NGVCK viruses. Dead Code 0% Subroutine 0% 10% 20% 30% 40% 10% 0% 10% 20% 30% 40% 20% 0% 10% 20% 30% 40% 30% 0% 10% 20% 30% 40% 40% 0% 10% 20% 30% 40% Average HMM 99.75% 87.50% 82.00% 79.75% 77.25% 92.00% 86.00% 83.00% 78.50% 77.50% 91.50% 85.50% 82.75% 80.25% 77.75% 90.75% 86.50% 82.25% 81.50% 79.50% 90.25% 85.75% 83.00% 81.25% 80.50% 84.09% 21 CSD 99.50% 85.81% 81.58% 80.35% 76.84% 90.81% 83.87% 79.60% 76.90% 74.89% 86.39% 82.66% 78.91% 75.91% 74.41% 81.91% 79.16% 76.47% 74.16% 73.21% 77.82% 76.00% 73.97% 73.23% 71.46% 79.43% Hybrid 99.75% 88.96% 84.20% 81.69% 78.94% 93.48% 88.47% 84.96% 83.46% 80.19% 92.98% 88.47% 85.71% 82.20% 79.95% 92.22% 88.47% 85.71% 82.20% 82.20% 91.47% 89.22% 84.96% 84.21% 82.20% 86.25% (a) HMM: dead code (b) HMM: subroutine code (c) CSD: dead code (d) CSD: subroutine code Figure 8: ROC curves for the HMM and CSD detectors, trained on NGVCK viruses, tested on viruses morphed with dead code or subroutine code. 22 Table 7: MMA results training on NGVCK morphed 10% dead code. Dead Code 0% Subroutine 0% 10% 20% 30% 40% 10% 0% 10% 20% 30% 40% 20% 0% 10% 20% 30% 40% 30% 0% 10% 20% 30% 40% 40% 0% 10% 20% 30% 40% Average HMM 99.50% 87.75% 81.75% 80.00% 76.75% 99.75% 90.75% 85.75% 81.75% 80.00% 99.75% 90.25% 86.25% 82.75% 81.00% 100.00% 93.50% 88.75% 84.00% 83.00% 100.00% 93.25% 87.75% 85.25% 83.75% 88.12% 23 CSD 92.00% 80.75% 76.00% 73.50% 71.50% 95.75% 86.75% 80.00% 77.25% 75.25% 94.00% 87.25% 81.75% 78.25% 76.00% 93.00% 87.25% 82.75% 78.00% 77.75% 92.50% 88.00% 82.75% 81.50% 78.50% 82.72% Hybrid 99.24% 88.46% 83.95% 81.18% 77.18% 99.24% 90.47% 86.46% 83.46% 81.20% 98.49% 91.73% 88.21% 86.71% 84.20% 98.24% 93.47% 91.22% 87.46% 86.96% 97.99% 93.47% 90.97% 89.22% 88.46% 89.40% (a) HMM: dead code (b) HMM: subroutine code (c) CSD: dead code (d) CSD: subroutine code Figure 9: ROC curves for HMM and CSD detectors, trained on viruses morphed with 10% dead code, tested on viruses morphed with dead code or subroutine code. 24 Table 8: MMA results training on NGVCK morphed 10% subroutine code. Dead Code 0% Subroutine 0% 10% 20% 30% 40% 10% 0% 10% 20% 30% 40% 20% 0% 10% 20% 30% 40% 30% 0% 10% 20% 30% 40% 40% 0% 10% 20% 30% 40% Average HMM 62.25% 60.50% 59.25% 58.25% 57.00% 61.75% 60.50% 59.50% 58.25% 59.00% 65.75% 64.25% 64.00% 62.25% 59.75% 68.75% 68.00% 67.00% 64.75% 63.75% 71.25% 70.25% 69.25% 66.00% 65.25% 63.46% 25 CSD 96.25% 92.25% 90.00% 88.50% 86.75% 95.00% 92.50% 90.00% 88.75% 88.50% 92.00% 90.25% 89.50% 88.75% 88.25% 89.50% 89.75% 90.25% 88.50% 88.25% 87.50% 88.75% 88.75% 87.75% 87.25% 89.74% Hybrid 97.24% 94.97% 92.47% 90.72% 88.71% 97.23% 96.22% 93.72% 92.22% 91.22% 95.23% 94.23% 92.72% 91.96% 90.97% 91.96% 92.21% 92.46% 90.71% 90.96% 90.45% 91.46% 91.46% 90.96% 89.71% 92.48% (a) HMM: dead code (b) HMM: subroutine code (c) CSD: dead code (d) CSD: subroutine code Figure 10: ROC curves for HMM and CSD detectors, trained on viruses morphed with 10% subroutine code, and tested on viruses morphed with dead code or subroutine code. 26
© Copyright 2026 Paperzz