Crucible
prepared by
Greg Hoglund,
2010,
HBGary, Inc.
Summary
Large scale automated malware analysis can provide raw data for subsequent analysis and intelligent
response to threats. In general, malware is largely self-similar in terms of behaviors, and such a system
should be able to determine if a program contains malicious behavior automatically. Also, depending on
the volumes required, such a system can be linearly scaled in terms of capacity so it can operate at the
perimeter of the network as opposed to being an end-node solution. There are several components to
the system outlined below. The logical basis for the system is that malware behavior can be recovered
by forcing malware to execute in an emulated environment, and that this can be performed at a
reasonable cost-per-malware-per-day. The goals are quite simply stated:
Recover code instructions, even if the program code is obfuscated and/or uses anti-debugging
technology
Recover as many instructions as possible, with an ideal of 100% code recovery
From said recovery, determine the behavior of the malware
Use said recovered information for secondary analysis, automatic IDS signature generation,
genome and lineage analytics, etc.
Contents
Summary ....................................................................................................................................................... 1
Logical basis for the system .......................................................................................................................... 4
Large-scale 'bump alongside the wire' ......................................................................................................... 5
Per node data collection ........................................................................................................................... 6
Software loop trace............................................................................................................................... 6
Data state trace..................................................................................................................................... 6
Loop Diagrams .............................................................................................................................................. 7
Emulated Process Environment .................................................................................................................. 10
Loader Design.......................................................................................................................................... 10
Input Expression Solver (IES)....................................................................................................................... 12
Flow Tracer and Mutation Engine ........................................................................................................... 13
Data flow tracing ..................................................................................................................................... 14
Data State Progression Map (DSPM) ...................................................................................................... 16
Crafted Input Expression Language (CIEL) .............................................................................................. 16
Co-instructions and Reverse Evaluation ................................................................................................. 18
Reporting .................................................................................................................................................... 19
Figure 1 - High level architecture of collection and analysis system ............................................................ 5
Figure 2 - Inexpensive cluster of motherboards ........................................................................................... 5
Figure 3 - Per node data collection ............................................................................................................... 6
Figure 4 - Emulated Environment for Malware Execution ......................................................................... 10
Figure 5 - Loader design .............................................................................................................................. 10
Figure 6 - Continued Execution ................................................................................................................... 11
Figure 7 - Continued Execution ................................................................................................................... 11
Figure 8 API Surface Emulation for continued execution ........................................................................... 12
Figure 9 - Conceptual diagram of input-controlled branches leading to a specific target block ................ 12
Figure 10 - prototype of FLOW TRACER illustrating controlled branches .................................................. 13
Figure 11 - data flow tracing, including arithmetic values.......................................................................... 15
Figure 13 - CIEL expression ......................................................................................................................... 16
Figure 12 - DSPM to Leaf Node for Fully Realized Capability...................................................................... 16
Figure 14 - mapping of comparisons to expressed CIEL fields .................................................................... 17
Figure 15 - prototype of buffer mutation ................................................................................................... 18
Figure 16 - pathing control flow to a capability .......................................................................................... 19
Figure 17 - using data in the path to automatically reconstruct a command and control protocol for
subsequent IDS signature generation ......................................................................................................... 20
Logical basis for the system
Fact:
•
If a line of code executes, it can be recovered
The mechanism:
• Get all the instructions to execute, therefore recovering them
What is a program?
• It’s code and data working together
• Both code and data both exist as data
• All data exists in memory
• Everything is just data in memory
What is program state?
• If everything is just data in memory, then a program state is the data at a given time in memory.
• As a program runs, it’s state changes, which is represented as changes to data in memory
Program State Capturing
• If a given program, and it’s state, exists in data memory, then to capture a ‘snapshot’ of this
state simply requires making a copy of all the data
Program State Restoration
• If we have taken a ‘snapshot’ of a program, then we can ‘restore’ that snapshot by halting
execution, overwriting the program data with the saved data, and restoring execution at the
previous point.
Code Recovery
• To obtain a line of code, it must be executed. By forcing all possible branching conditions in a
program, all reachable code can be forced to execute.
– This does not mean that encrypted code can be decrypted – and should not be confused
with cryptanalysis
– This assumes that self-modifying code has the self-contained means to decrypt itself
Large-scale 'bump alongside the
wire'
The system will ultimately be tagged off a
perimeter device that sucks binaries out of
streams, including email attachments (Fidelis
anyone?). This capability should be abstracted so
it will work with multiple devices. There is no
requirement that any network transactions be
paused, cached, or delayed, and such decisions
are beyond the scope of the work. The system
can be attached in any manner.
Figure 1 - High level architecture of collection and analysis system
The network acquisition system will collect suspicious binary packages
and rich documents (productivity applications, word / power-point,
PDF, etc) and executables (EXE's, packages which contain EXE's, ZIP,
etc), URL link data, and java-script. These parse-able and executable
artifacts are sent via server-class hardware to the tier controllers. Each
tier controller manages an array of inexpensive motherboards (AIM).
These are racked and stacked in some custom inexpensive enclosures
using off-the-shelf consumer grade hardware. This produces a
Figure 2 - Inexpensive cluster of
motherboards
'functionally parallel' (different tasks executed simultaneously over
different data spaces) computing cluster. Expect to spend about
$10,000 to setup a single 64 node cluster. For the capital expense of $10,000 USD you will get apprx.
100,000 malware samples per day in capacity. (see http://www.clustercompute.com/)
Per node data collection
The architecture of the system strongly supports modular
programming. Breaking tasks down into independent chunks
is essential. Malware analysis is ideal for this environment as
each malware sample can be treated as a single functional
task. There is no synchronization in data state or processing
required between multiple nodes, drastically simplifying the
design of the parallel computing backbone.
Each individual node will produce a large volume to low level
software behavior data. This data is delivered to a collection
point over a second network interface. The collected data will
be used to determine the fully realized capabilities of the
target under test. The two primary data sets are:
Figure 3 - Per node data collection
Software loop trace
This is a trace of the program that allows visualization of all loops,
both inner and outer. This diagram will resemble a fingerprint for the malware and will be one primary
means to determine variants.
Data state trace
This is a linear trace and collection of data-state snapshots which can provide a low-level detail of all
activity of the program. In particular, it can be used to backtrace from a leaf node (fully realized
capability) to the root, and recover all the required data state in order to reproduce (prove) the
capability exists.
Loop Diagrams
Almost all but the most trivial of software will contain loops. It is assumed that most targets under test
will contain many embedded loops. We propose a loop diagramming system that allows both outer and
inner loops to be combined along with annotations that describe behavior at various points in the loops.
These diagrams will be highly suitable for comparing multiple variants of a malware strain - as only small
subsets of the logic will differ.
The basic loop diagram is as follows:
In the diagram we see that both outer and inner loops, including nesting, is represented. Throughout
the orbit of a given loop, annotations can be added. When rendered procedurally, these diagrams can
be quite complex and will be similar to a graphical 'fingerprint' for a malware program.
Several archetype loops are shown here:
This is a blocking loop. The software in question makes an API call that blocks on an external event.
Command and control processing. This loop is a variation of the blocking loop. A call is made to read a
packet from the network (TCP/IP) and this call blocks. Once a packet arrives the program continues
execution and parses the data within the packet, making further decisions based on the data.
This loop represents a simple processing architecture with a sleep. The malware will wait some period
of time, then execute a predetermined set of tasks, and then wait and repeat again.
This is a more complex example. The malware tries to connect to a command and control server on the
web. This illustrates annotations made against the diagram, showing various events. Once a connection
is made, the webpage is read and then parsed for commands.
Emulated Process Environment
The target under test (asset) is any executable, script, or other interpreted
content that, when parsed or executed, could represent a malware threat.
The emulated process environment is designed to execute the target under
test (see figure 1). The emulated process environment would consist of
interpreter subsystems to execute/parse any language that is supported.
At a minimum, subsystems will include 32 bit x86 machine code and javascript.
Emulation subsystems:
Java-script
x86 machine code
Figure 4 - Emulated
Environment for Malware
Execution
In all cases, the environment will emulate an unpatched Windows XP
operating system with vulnerable Internet Explorer, vulnerable Acrobat Reader, and vulnerable Flash. It
is important to understand that the emulation environment will not contain any real software products,
only the presence of these software products will be emulated. No real copy of Windows will be
running, and no virtual machine products will be running. In other words, the implementation will be a
true and raw emulation environment, not a 'thick' emulator (nothing like VMWare, Bochs, or equivalent,
and never intended for an actual software installation to be placed upon it). This is similar in concept to
the WINE project for linux (REF). As an aside, because no actual software is being installed or hosted,
there are no issues with commercial software licensing.
The emulated process environment will be written in native 'c' code suitable for compilation on a posixcompliant platform, and will be architected specifically for deployment in a supercomputing array of
inexpensive motherboards (see figure X). The emulation environment will be designed for high speed
and high throughput. Modules within the emulation environment will be decomposed into decoupled
operational units that are intended to work in parallel with a minimum of locking, potentially
implemented on multiple threads, and would execute cleanly on a heavily multi-processed hardware
platform (see figure X).
Loader Design
A key development area will be the emulation of a loader for PE formatted
executables. This will be an extension to the 32 bit x86 machine code
emulation in combination with emulation of the windows OS environment.
The intention will be to load and fixup the memory associated with a 'target
under test' program, including the loading of system DLL's. As we have
already indicated, no real DLL's will be present or loaded, but the emulation
environment will acknowledge the target under test and provide feedback
such that the loading process is considered valid. To support the execution of
Figure 5 - Loader design
software intended to operate on the Windows platform, the emulation environment will include an 'API
Surface Emulator' - this will answer for any system API calls that are executed. A very large number of
API calls will need to be modeled for this to work, on the order of thousands.
This 'API surface emulator' may contain several subsystems, and may contain a
special interpreted language for specifying how API calls should be processed.
Again, the purpose is to update internal state within the engine and answer the
API call query in such a manner that the 'target under test' continues to execute
properly without error. Once loading has taken place, the target under test will
continue to execute. Again, the API surface emulator will play a big part in the
success of continued execution (see figure X).
Figure 7 - Continued
Execution
Figure 6 - Continued
Execution
Input Expression Solver (IES)
I/O emulation will be a subset of the API surface. I/O is important
because the emulation environment will not know how to respond to
a data query made to an external element. To address the possible
tree of control flows, whenever an I/O operation is performed, any
subsequent control flow that is driven by the values contained in the
response data will be crafted based upon the arithmetic comparisons
made against the data once it returns. First, a random or preset
response will be provided. Following this, data flow tracing will be
used to track every derived memory location that sources from the
response data. Whenever a control flow decision is based upon this
sourced data, the original location it was sourced from is recorded.
Then, using this source location information, the I/O response data
will be precisely mutated to affect the control flow, increasing code
coverage. This process will be repeated as necessary to cover all
control flow that is influenced by external I/O response data.
Figure 8 API Surface Emulation for
continued execution
In order to increase the performance, the design will include the
ability to snapshot (
) the program state at any point. Using such snapshot capability, the system will
snapshot execution and data state immediately prior to any crafted I/O response. This allows the
snapshot state to be restored for every subsequent crafted data mutation. In other words, the 'target
under test' will not need to be re-executed from the root, but rather can be restored directly before the
mutation operation, thus increasing speed and
effectiveness.
At any given basic block of control flow,
comparisons may be made against data that
was sourced from external input. At any of
these comparison points, the data can be data
flow traced back to the original source in the
crafted I/O response buffer.
In figure XX is illustrated the control flow for a
program as it accessed various offsets within a
Figure 9 - Conceptual diagram of input-controlled branches leading
buffer under trace. Branching decisions will be to a specific target block
made based upon values present in the buffer.
The first branching decision checks the second byte for the value 'B' (offset 1). A subsequent branching
decision tests for the presence of any value 0-9 (offset 3). Finally, another comparison is made against
offset 5 for the value 'K'. By crafting the input correctly, the target point can be reached.
Flow Tracer and Mutation Engine
The system has two primary subsystems, a control flow tracer (flow tracer) and a mutation engine.
The flow tracer is responsible for tracing all instruction execution at a single-step resolution. This
instruction flow is analyzed for specific operations:
MOVE
BRANCH
ARITHMETIC
COMPARE
These basic operations are evaluated to determine if
they operate upon, or are influenced by, external
supplied data. All operations that relate to external
supplied data are informed to the mutation engine,
which is then responsible for crafting input as
requested. For purposes of control flow resolution and
Figure 10 - prototype of FLOW TRACER illustrating
mutation, the mutation engine maintains a list of input- controlled branches
controlled branches, otherwise known as the 'Controlled
Branch List'.
For example:
mov ebx, [buf] <-- crafted input buffer
…
cmp ebx, 1 <-- compare result can be influenced by crafted input
…
jnz label <-- thus, branch can be influenced, forcing code coverage
In the example, assume the value stored in register EBX is being tracked. Such tracking is simply
represented at the FLOW TRACER component by tagging the EBX register as tainted., At the time the
compare instruction is executed, the FLOW TRACER determines that EBX carries a tracked value. In
addition, and more importantly, the FLOW TRACER can also calculate where in the original input buffer
this value was derived from. A reference to the tracking information is stored along with a reference to
the branching instruction. This information is notified to the MUTATION ENGINE so it can be stored in
the controlled branch list.
The key to success in the FLOW TRACER is accurate and detailed data flow tracing. Many different
instruction variants are possible, and can duplicate and derive data from the original crafted buffer in
many ways. All derived data must be tracked, regardless of location in execution space. Stated clearly,
the data must be tracked regardless of whether it's in a register, stack location, or heap location.
Furthermore, the FLAGS register values that result from operations made against tracked data must also
be considered as tracked, and thus any branches that influenced from those flags values are treated as
controlled branches. In figure XX is shown a prototype of the FLOW TRACER rendering all crafted-input
controlled branches.
Data flow tracing
Because execution forward of the crafted input point is traced, all derived data can thus be traced. Any
operation that moves or calculates a result from the input data can be evaluated. These operations are
linked to one another so they can be reversed by the mutation engine.
For example, if the following instruction references tracked data at ESI, then the operand is checked to
see if there is an offset modifier:
For example:
mov ebx, [buf] <-- crafted input buffer
…
mov eax, [ebx + 2] <-- offset 2 in the input buffer
In the above example, the eax register clearly contains a 32 bit value which is obtain from offset 2 in the
crafted input buffer. Furthermore, the value in eax represents four full bytes of the crafted input buffer,
starting at offset 2 and terminating at offset 6 (a length of 4 bytes, or 32 bits). Details such as little
endian / big endian byte order can be accounted for. Data types can also be accounted for:
For example:
mov ebx, [buf] <-- crafted input buffer
…
mov eax, byte ptr [ebx + 2] <-- offset 2 in the input buffer
In the above example, only a single byte is moved into the eax register. The LSB of eax then represented
tracked data, while the upper 3 bytes of eax, while containing data, would not be considered part of the
data flow trace.
Arithmetic can also be tracked. The result of arithmetic operations should be considered part of the
data flow. Although the value is not directly copied from the crafted input buffer, the resulting
arithmetic calculation can obviously be influenced.
For example:
mov ebx, [buf]
…
mov [edi], ebx
…
sub ecx, ebx
In the example, two values are derived from the crafted input buffer. A memory location pointed to by
the edi register is written to with a value stored in ebx. The data flow resolution will determine that
[edi] gets 4 bytes (32 bit value) read from positions 0-3 in the crafted input buffer. Also, a value stored
in the ecx register is arithmetically modified by
this same 32 bit number. This results in ecx
being tainted as a traced value. Thus, any
subsequent branching logic derived from the
value in ecx can clearly be influenced by the
value in the crafted input buffer. A prototype
of this level of tracing, including arithmetic
derivations, is shown in figure XXX.
Both the FLOW TRACER and MUTATION
ENGINE have already been prototyped with
great success.
Figure 11 - data flow tracing, including arithmetic values
See attached: [Automated Flow Resolution and Application for Dynamic Decompilation, HOGLUND, 2005 USAF Contract # FA8650-05-M-8021].
Data State Progression Map (DSPM)
As execution emulation continues in this manner, multiple snapshot will be created and will result in a
single-root, directed graph of data states. This tree of data states represent important points along the
control flow of the target under test. The further
down the tree, the more state transitions that have
taken place. It will be possible to define data states
that represent known malware behaviors. For
example, writing to a registry key, sniffing a keystroke,
or logging particular kinds of data to a log file. There
are nearly limitless possibilities, restricted only by that
which can be defined as software behavior (in other
words, nearly limitless). The definition of what
behaviors are noteworthy can be defined in a symbolic Figure 12 - DSPM to Leaf Node for Fully Realized Capability
language that is used and evaluated while the data
state tree is recorded. Once a clear malware behavior is identified, it will exist at a leaf node of the data
state tree. When that occurs, the data state tree can be traversed backwards and a complete trace of
the malware execution leading up to the suspicious behavior can be recovered.
Crafted Input Expression Language (CIEL)
We propose the development of a regular-expression-like language that can generate values (as
opposed to match them). This is similar to regular expressions, but the purpose is to generate every
possible data string that would match the expression. The MUTATION ENGINE in combination with the
DSPM would be able to generate a CIEL expression for any point in the control flow of the target under
test.
Figure 13 - CIEL expression
The CIEL syntax will describe a series of fields with value types and ranges. Fields will be able to be
anchored relative to the entire buffer, or relative to another field. The value range for a given field will
include numerical range, width or length of physical buffer, and selection of values from a set (similar to
regular expression syntax).
For example, assume that a malware program requires a particular IRC server response in order to
execute a given control flow.
Figure 14 - mapping of comparisons to expressed CIEL fields
In the example, assume all instructions represent an operation against a traced or derived value that can
be mapped back to specific offset in the crafted input buffer (these instructions are not intended to be
shown in order or inclusive of all instructions for the target). At step 0001, the malware performs a
string comparison against 'BOTCMD'. This API call would be emulated by the API emulation surface.
The arguments passed to the call are recovered and the system determines that the comparison expects
the substring 'BOTCMD' to be present at offset zero in the crafted input buffer. The corresponding CIEL
expression is illustrated as positional anchored at the beginning of the crafted input buffer. Next, the
malware compares a value in al against 0x20 (a space character). Assume that al represents any
character found after the substring from operation 1. The CIEL expression represents this with a
reference to position. A subsequent comparison (made numerically) translates to the characters
'DOWN', and finally a strtok call is used to derive yet another field.
It can be shown that any input buffer can be represented with an expression, and that all possible input
sequences that are relevant to the control flow can be represented as such.
Consider the following example program:
char * dd[] =
{
"SUBSCRIBE",
"NOTIFY",
"M_SEARCH",
"WHATEVER"
};
char _tokens[] = " \t\n";
char * res = strtok(lpString, _tokens);
if(0 == res) return 0;
if( 0 == stricmp(res, dd[c]))
{
…
The example program can take a variety of inputs and will branch based on these inputs. A prototype of
the BUFFER MUTATION engine working in conjunction with the FLOW TRACER produces the series of
inputs shown in figure XX.
Figure 15 - prototype of buffer mutation
Co-instructions and Reverse Evaluation
Arithmetic operations post a particular challenge, in that the arithmetic itself must be reversed in order
to arrive at the range of values that would influence branching logic. That is, assuming an arithmetic
result is used in a subsequent comparison operation, that operation will need to be reversed to
determine which starting value must be in the crafted input buffer.
For example:
mov ebx, [buf]
sub ebx, 10
cmp ebx, 5
je label
In the example, the first 32 bits of the crafted input buffer will need to be equal to 15 in order to
exercise the branch. In order to determine this value, the instructions will need to be reversed, and the
subtraction will need to be converted to an add:
cmp ebx, 5
add ebx, 10
mov ebx, [buf]
<-- value we need to control (need: 5)
<-- add 10 (need is now: 15)
<-- buffer reference (insert value 15 into buf)
This process is known as 'reverse evaluation' [Automated Flow Resolution and Application for Dynamic
Decompilation, HOGLUND, 2005 - USAF Contract # FA8650-05-M-8021]. A table of arithmetic instructions and
their reverse instruction (co-joined instruction) follow:
Instruction
ADD
SUB
OR
XOR
AND
SHL
SHR
ROL
ROR
Co-Instruction
SUB
ADD
NAND
XOR
AND
SHR
SHL
ROR
ROL
All manner of arithmetic can be reversed in this way, but it should be stated that certain operations
cannot be reversed (for example, an MD5 hash cannot be reversed in this manner). Control flow
branches that depend on irreversible arithmetic will not be able to calculated against, and the only
solution would be brute force. Depending on the situation, even brute force would not work and such
control flows would have to be abandoned. This is a simple mathematical property of the logic and
accepted as an absolute (no time will be wasted here trying to solve an unsolvable problem). In general
this almost never occurs, as most logic is represented simply by the compiler. Any irreversible logic
would have to be introduced as a form of obfuscation in most cases. Some general purpose algorithms,
such as compression, will also exhibit this behavior and a method will have to be devised to avoid
garden-pathing on these.
The reverse evaluation system should suffice for most problems in control flow recovery.
Reporting
Reporting will be a key feature once the data is collected.
Once a fully realized capability is detected, the system will
have the ability to generate a high level report of what
software activity enabled and lead up to the realized
capability. For example, this report would contain data
about what network packets had to arrive, what
commands were issued, which values had to exist in
memory, etc. The intention of this report is to service
high-level analysis and automatic creation of IDS rules,
although it's conceivable that this report could contain
nearly line-by-line singlestep data about program
execution.
Figure 16 - pathing control flow to a capability
The reporting portions of the system can exist as
a separate application that is not restricted by
the architecture of the emulation environment.
For example, the reporting system could be
written to work in a GUI-intensive environment
such as Microsoft Windows or a web-server.
The reporting system will consume the data
flows reported by the underlying recording
Figure 17 - using data in the path to automatically reconstruct a
command and control protocol for subsequent IDS signature
system (figure XX) and produce a higher level
generation
descriptive report illustrating salient data states
(such as command and control packet formats). From this, an alerting system can be developed that will
automatically recover important actionable artifacts such as unique network strings, URL's, IP's suitable
for automatic IDS deployment, and file and registry keys suitable for end-node protection.
© Copyright 2026 Paperzz