Automatic Analysis of Malware Behavior using Machine Learning Author’s: Konrad Rieck, Philipp Trinius, Carsten Willems, and Thosten Holz Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware CISC 879 - Machine Learning for Solving Systems Problems Abstract & Introduction • Malware • Poses major threat to security of computer systems. • Very diverse – viruses, internet worms, trojan horses, • Amount of malware – millions of hosts infected • Obfuscation and polymorphism impede detection at file level • Dynamic analysis helps characterizing and defending. CISC 879 - Machine Learning for Solving Systems Problems Abstract & Introduction Contd.. • Framework for automatic analysis of malware behavior using Machine learning • Framework allows automatic analysis of novel classes of malware with similar behavior – Clustering. • Assigning unknown classes of malware to these discovered classes – Classification. • An incremental approach based on both for behavior based analysis. CISC 879 - Machine Learning for Solving Systems Problems Automatic analysis of Malware Behavior • Framework steps and procedure • Executing and monitoring malware binaries in sandbox environment. Report generated on system calls and their arguments. • Sequential reports are embedded in a vector space where each dimension is associated with a behavioral pattern. • ML techniques then applied to the embedded reports to identify and classify malware. • Incremental analysis progress by alternating between clustering and classification. CISC 879 - Machine Learning for Solving Systems Problems Report representation • • Can be textual or XML • Human readable and suitable for computation of general statistics • But not efficient for automatic analysis Hence MIST (Malware Instr. Set) • Inspired from instr. set used in process design. CISC 879 - Machine Learning for Solving Systems Problems MIST • Category of system calls • Operation - Reflects a particular system call • Arguments as argblocks. CISC 879 - Machine Learning for Solving Systems Problems Sandbox and MIST representation CISC 879 - Machine Learning for Solving Systems Problems Representation • These sequential reports identify typical behavior of malware – Changing registry keys, modifying system files. • But still not suitable for efficient analysis techniques. Hence the need to embed behavior reports in vector space – Using instruction q-grams. • This embedding enables expressing the similarity of behavior geometrically – Calculating distance. CISC 879 - Machine Learning for Solving Systems Problems Clustering and Classification • • • Reports are embedded in vector space – Process ready for applying ML techniques Clustering of behavior – where classes of similar behavior malware are identified. Classification of behavior – which allows to assign malware to known classes of behavior. • What allows us to do this? • Malware binaries are a family of similar variants with similar behavior patterns ! CISC 879 - Machine Learning for Solving Systems Problems Contd.. CISC 879 - Machine Learning for Solving Systems Problems Algorithms • • • Prototype extraction • Iterative algorithm • Extracts small set of prototypes from set of reports. First one chosen at random. Clustering using Prototypes • Prototypes at beginning are individual clusters • Algorithm determines and merges nearest pairs of clusters Classification using Prototypes • Allows to learn to discriminate between classes of malware. CISC 879 - Machine Learning for Solving Systems Problems Algorithms Contd.. • • For each report algorithm determines the nearest prototype of clusters in training data, if within radius then assigns to cluster • Else rejects and holds back for later incremental analysis. Incremental analysis • Reports to be analyzed are received from source. • Initially classified using prototypes of known clusters • Thereby variants of known malware are identified for further analysis. • Prototypes extracted from remaining reports and clustered again. CISC 879 - Machine Learning for Solving Systems Problems Experiments and Results CISC 879 - Machine Learning for Solving Systems Problems Evaluating components • Prototype extraction • • • Precision – 0.99 when corpus compressed by 2.9 % & 7% Clustering • • • Evaluated using Precision, Recall and Compression. Evaluated using F-measure F-measure for experiments – MIST 1 = 0.93 and MIST 2 = 0.95 better than previous related work 0.881 Classification • F-measure for experiments – MIST 1= 0.96 and MIST 2 = 0.99 CISC 879 - Machine Learning for Solving Systems Problems Experiments and Results Contd.. CISC 879 - Machine Learning for Solving Systems Problems Experiments and Results Contd.. CISC 879 - Machine Learning for Solving Systems Problems Conclusion • A new framework introduced which overcomes several previous deficiencies. • The framework is learning based • Framework can be implemented in practice • • Steps – Collection of malware, a study in sandbox environment, embed observed behavior in vector space, apply learning algorithms – clustering and classification. This process is efficient and learns automatically after initial setup and run. CISC 879 - Machine Learning for Solving Systems Problems Thank you ! CISC 879 - Machine Learning for Solving Systems Problems
© Copyright 2026 Paperzz