Cluster Selector

Over the last years, the amount of malicious code (Viruses, worms, Trojans,
etc.) sent through the internet is highly increasing.
Extraction method:
The Interactive Disassembler (IDA):
IDA is a commercial Disassembler widely used for reverse
engineering meaning, it is able to receive a binary file and
reverse it back to the assembler code. Using a dedicated
plug-in, IDA can identify, extract and normalize all the
functions in the file.
Due to this significant growth, viruses' renewal and improvement is done
much faster than the update time of the anti-virus software selling today.
Our solution focuses on the signature generation process. We have
developed an automatic system, which its goal is to extract simple, unique
and optimal signatures for malicious files.
This way any IDS/IPS will be able to neutralize a hostile code in real-time. In
addition we have developed an evaluation environment - its objective is to
determine the best configuration for generating an optimal signature for
malicious files.
Data mining :
Using classifier which takes a training set of bytes' segments
and classify if it an end, start or neither, then classify
segments of bytes from a suspicious file, and determine if
these segments are start, end or neither. That way we are
able to extract functions from a given file.
Selection methods:
Initialize the system
Generally, the Signature Builder system operation is:
Building a common functions library (CFL), Given a
malicious file, extract its functions and filter the
common ones using the CFL, generate signature and
at last Choosing from the remaining functions
(candidates), the best one to act as the malicious
file’s signature. The system extracts functions from
the malwares by several algorithms, and provide a
signature for each malware.
Initialize Configuration
CFL Handling
Random Selector:
Choose a signature randomly from the candidates.
Minimum Entropy Selector:
The selector calculates the entropy of the candidates and
selects the one with the minimum entropy.
Receive File from Client
• Let S be a string/signature.
Extracting Functions
• Sc character in S
• |Sc| the number of times Sc appears at S.
Filter Common Functions
Generate Candidates
• The Entropy of S will be as follows:
| Sc |
| Sc |
E (S )   
 log 2
|S|
|S|
cC
Select Best Candidate
Return Signature
Cluster Selector:
This Selector creates groups of candidates by their distance
from each other, and will score each cluster by the chance it
will contain the best signature. Each cluster will get score that
will reflect this chance with the following formula:
• Cs denotes Cluster size in bytes
• Fs denotes File’s Size
• Fc denotes number of functions in cluster
• T denotes total number of function in file
• Fl denotes the sum of function’s length in cluster
Evaluation Environment - evaluates the different configurations of
the signature builder, in order to decide about the quality of the
signature. The main idea is checking if a signature of a malicious
file appears in control group- benign files. Of course, a good
signature which belongs to a malicious file – should not appear
in benign files.
Cs Fc Fl
ClusterScore   
Fs T Cs
Each configuration consist the following input:
Probability Selector:
Key idea: estimate the probability that each of the candidate
signatures will match a randomly chosen block of bytes in the
code of a randomly chosen program
• CFL size in MB
• maximum signature length in byte
• Function similarity threshold
• Offset size in byte
Select one or more signatures with the lowest estimated
False Positive probabilities of all the candidates which is less
than pre-defined threshold.
• Function Extractor
• Function selection.
The output consists the following:
•
• For a given sequence of S bytes B=B1B2…BS estimate the
probability p(B) for B to occur in a large body of normal
uninfected code:
Processed - The number of malware files that the system
managed to generate a signature for them.
•
Processed (%) - Processed / Total Malware Files.
•
Signature Hits - The number of malware files
• TS - number of S-byte sequences in a large corpus of
uninfected programs
• f(B) - number of occurrences of B in Ts
that gives at least one False Alarm, which means the number
of unique malware files that produced False Alarm.
•
Signature Hits (%) - Signature Hits / Processed.
•
Unique Signature - The number of unique signatures
f ( B1B2 B3 ) f ( B2 B3 B4 )... f ( Bs 2 Bs 1Bs )
P( B1B2 ...Bs ) 
f ( B2 B3 ) f ( B3 B4 )... f ( Bs 2 Bs 1 )T3
that didn’t produced FA.
•
Different Files - The number of distinct files in the Control Group
that has at least one hit.
•
Different Files (%) – Different Files / Total Control Group Files.
Ido Levin
Language:
IDE:
Ofir Nissel
Yotam Katzman
Operation System:
Academic Advisor:
Professional Advisor:
Dr. Yuval Elovici
Mr. Asaf Shabtai