Machine Learning for Solving Systems Problems Automatic analysis

Automatic Analysis of Malware Behavior
using Machine Learning
Author’s: Konrad Rieck, Philipp Trinius,
Carsten Willems, and Thosten Holz
Presented by: Satyajeet
Dept of Computer & Information Sciences
University of Delaware
CISC 879 - Machine Learning for Solving Systems Problems
Abstract & Introduction
•
Malware •
Poses major threat to security of computer systems.
•
Very diverse – viruses, internet worms, trojan horses,
•
Amount of malware – millions of hosts infected
•
Obfuscation and polymorphism impede detection at
file level
•
Dynamic analysis helps characterizing and
defending.
CISC 879 - Machine Learning for Solving Systems Problems
Abstract & Introduction
Contd..
•
Framework for automatic analysis of
malware behavior using Machine learning
•
Framework allows automatic analysis of novel
classes of malware with similar behavior –
Clustering.
•
Assigning unknown classes of malware to these
discovered classes – Classification.
•
An incremental approach based on both for
behavior based analysis.
CISC 879 - Machine Learning for Solving Systems Problems
Automatic analysis of
Malware Behavior
•
Framework steps and procedure
•
Executing and monitoring malware binaries in
sandbox environment. Report generated on
system calls and their arguments.
•
Sequential reports are embedded in a vector
space where each dimension is associated with
a behavioral pattern.
•
ML techniques then applied to the embedded
reports to identify and classify malware.
•
Incremental analysis progress by alternating
between clustering and classification.
CISC 879 - Machine Learning for Solving Systems Problems
Report representation
•
•
Can be textual or XML
•
Human readable and suitable for computation of
general statistics
•
But not efficient for automatic analysis
Hence MIST (Malware Instr. Set)
•
Inspired from instr. set used in process design.
CISC 879 - Machine Learning for Solving Systems Problems
MIST
•
Category of system calls
•
Operation - Reflects a particular system call
•
Arguments as argblocks.
CISC 879 - Machine Learning for Solving Systems Problems
Sandbox and MIST
representation
CISC 879 - Machine Learning for Solving Systems Problems
Representation
•
These sequential reports identify typical behavior of
malware – Changing registry keys, modifying
system files.
•
But still not suitable for efficient analysis
techniques. Hence the need to embed behavior
reports in vector space – Using instruction q-grams.
•
This embedding enables expressing the similarity
of behavior geometrically – Calculating distance.
CISC 879 - Machine Learning for Solving Systems Problems
Clustering and Classification
•
•
•
Reports are embedded in vector space – Process
ready for applying ML techniques
Clustering of behavior – where classes of similar
behavior malware are identified.
Classification of behavior – which allows to assign
malware to known classes of behavior.
•
What allows us to do this?
•
Malware binaries are a family of similar variants
with similar behavior patterns !
CISC 879 - Machine Learning for Solving Systems Problems
Contd..
CISC 879 - Machine Learning for Solving Systems Problems
Algorithms
•
•
•
Prototype extraction
•
Iterative algorithm
•
Extracts small set of prototypes from set of reports. First
one chosen at random.
Clustering using Prototypes
•
Prototypes at beginning are individual clusters
•
Algorithm determines and merges nearest pairs of
clusters
Classification using Prototypes
•
Allows to learn to discriminate between classes of
malware.
CISC 879 - Machine Learning for Solving Systems Problems
Algorithms Contd..
•
•
For each report algorithm determines the nearest
prototype of clusters in training data, if within radius
then assigns to cluster
•
Else rejects and holds back for later incremental
analysis.
Incremental analysis
•
Reports to be analyzed are received from source.
•
Initially classified using prototypes of known clusters
•
Thereby variants of known malware are identified for
further analysis.
•
Prototypes extracted from remaining reports and
clustered again.
CISC 879 - Machine Learning for Solving Systems Problems
Experiments and Results
CISC 879 - Machine Learning for Solving Systems Problems
Evaluating components
•
Prototype extraction
•
•
•
Precision – 0.99 when corpus compressed by 2.9 % &
7%
Clustering
•
•
•
Evaluated using Precision, Recall and Compression.
Evaluated using F-measure
F-measure for experiments – MIST 1 = 0.93 and MIST 2 =
0.95 better than previous related work 0.881
Classification
•
F-measure for experiments – MIST 1= 0.96 and MIST 2 =
0.99
CISC 879 - Machine Learning for Solving Systems Problems
Experiments and Results Contd..
CISC 879 - Machine Learning for Solving Systems Problems
Experiments and Results Contd..
CISC 879 - Machine Learning for Solving Systems Problems
Conclusion
•
A new framework introduced which overcomes
several previous deficiencies.
•
The framework is learning based
•
Framework can be implemented in practice
•
•
Steps – Collection of malware, a study in sandbox
environment, embed observed behavior in vector space,
apply learning algorithms – clustering and classification.
This process is efficient and learns automatically
after initial setup and run.
CISC 879 - Machine Learning for Solving Systems Problems
Thank you !
CISC 879 - Machine Learning for Solving Systems Problems