Agent-based Optimization of Emulations of Network

2012 IEEE 16th International Symposium
2012 ISCE 1569567245
Agent-based Optimization of Emulations of Network Server
Applications in Honeypots
Yilun Zhao and Jeremy J. Blum, Member, IEEE
The Pennsylvania State University, Harrisburg
Abstract--Low-interaction honeypots can provide a costeffective security mechanism for a wide range of computer
systems. A central challenge in the development of lowinteraction honeypots is the development of emulation programs
that mimic the action of server applications on the target
platform. The emulation programs must be of high enough
fidelity to fool attackers. However, the manual development of
these emulations is extremely time-consuming. This paper
describes an agent-based optimization system that can automate
the generation of emulation programs for honeypots. The system
is evaluated in its ability to emulate a mail server. In this
evaluation, the system produced correct responses to more than
99% of test data queries.
representation of an extended finite state machine (EFSM) that
can generalize the behavior of the network application. The
EFSM model is an FSM that has been extended with variables
that can store system-wide and per session state. Formally, the
goal of the optimization problem is to minimize a weighted
combination of the error in the emulated responses and the size
of the finite state machine.
This optimization system is evaluated in its ability to
emulate a POP3 email server. The results indicated that
optimization system generated a compact representation of this
server that generalized the application performance. The
resulting emulation generated correct responses for more than
99% of queries in an independent test set.
I. INTRODUCTION
Honeypots are decoy systems designed to capture malicious
users in order to collect data on patterns of malicious behavior
and deflect malicious attention from real systems. Honeypots
can play an important role in mitigating threats to an
increasing range of devices, given the trend of providing
network access for control and monitoring of devices from
consumer electronics through industrial control systems.
A large class of these honeypots emulates an entire network
of decoy systems on a single system. Central to the success of
these systems is a high fidelity emulation of network server
applications. The traditional approach to the creation of these
emulations relies on manual coding of an emulation script, a
labor-intensive and time consuming process, which must be
repeated regularly whenever the real system is updated.
Attempts to introduce a level of automation into the
generation of emulation of these applications rely on
representations of server behavior using extensions of finite
state machine models (FSM). The central challenge, however,
is that the generation of these models is comprised of
computationally intractable subtasks. For example, in order to
generalize system behavior, regular expressions must be
developed that match a class of user inputs to a single output, a
problem which is NP-Hard.
In order to address this challenge, this paper describes an
agent-based optimization architecture that has successfully
solved intractable problems in other domains. The agentbased optimization system attempts to find a compact
This research is funded in part by a seed grant from a National
Science Foundation Partnership for Innovation grant (M. Walters, PI)
and a grant from Novatech LLC (J. Blum, PI). Any opinions,
findings, and conclusions or recommendations expressed in this
material are those of the authors and do not necessarily reflect the
views of the National Science Foundation or Novatech LLC.
978-1-4673-1356-8/12/$31.00 ©2012 IEEE
II. PREVIOUS WORK
The Deception Toolkit, the first publically available
honeypot, was "intended to make it appear to attackers as if the
system running DTK [had] a large number of widely known
vulnerabilities"[1]. Current honeypots can be classified into
three different levels: Low-Interaction Honeypots, MediumInteraction Honeypots and High-Interaction Honeypots [2].
High-Interaction honeypots consist of actual devices deployed
as decoys, and, as a result, provide a high level of interactivity,
in which nothing in the system is emulated or restricted [2-3].
Given that actual systems are deployed, high-interaction
honeypots can be expensive to deploy. Moreover, they inherit
all of the vulnerabilities of the actual systems. They can be
compromised and used in attacks on other systems. More
importantly, attackers can use the high-interaction honeypots
to try to discover vulnerabilities in actual systems.
On the other end, low interaction honeypots provide
emulations of actual systems [4]. As a result, low interaction
honeypots can be inexpensive to deploy since a single system
can emulate an entire network of devices. In addition, because
these honeypots simply emulate of server applications, lowinteraction honeypots can prevent attackers from learning
about underlying vulnerabilities by omitting these from the
emulation. On the other hand, the subset of functionality that
is typically provided by emulations of server programs can
limit the effectiveness of low-interaction honeypots. If an
attacker can quickly detect that only a subset of the
functionality is implemented, the attacker can determine that
the target is a honeypot as opposed to a preferred target.
As a result, a central challenge for these honeypots is
improving the emulation of network servers, both expanding
the functionality that is implemented and improving its
fidelity. While emulations can be manually created, several
researchers have attempted to automate or partially automate
127
1
Table 1: Extended Finite State Machine
Current
State
Next
State
Order
Regular
Expression
Response
1
2
1
*
"Enter command: "
2
3
1
login
"Username: "
2
2
2
help
"Type 'login' to authenticate.\nEnter command: "
2
2
3
*
"Unrecognized command.\nEnter command: "
3
4
2
*
"Password required for " + UNAME + ": "
3
-
3
*
"Invalid Password"
the development of emulation programs. For example,
ScriptGen attempts to automatically generate compact
representations of server behavior using a finite state machine
model [5]. The approach takes as input a large dataset of
recorded conversations and attempts to distill the large input
set into a compact model. Another approach, RolePlayer
incorporates byte-stream alignment algorithms in order to
identify the portions of responses that are variable based on
differences request and current session data [6]. Replayer
represents an approach to a similar of creating the ability to
successfully replay an attack on a server by systematically
identifying the portions of the client messages that must be
altered [7]. There have also been a number of efforts to
monitor the messages produced by a server application in
order to attempt to reverse engineer a network protocol [8-10].
The problem that these automated approaches are
attempting to solve is a computationally intractable one.
Consider a simple server program which contains record of
session state. In other words, the response of the server
program is simply a function of the previous client message.
Given a comprehensive set of client and server messages, the
emulation problem then becomes one of generalizing these
messages in a compact form. One compact form would be a
set of regular expressions that match all query responses that
produce a particular output. The emulation problem then
reduces to the problem of generating regular expression from
positive data, a problem that is NP-Hard [11]. Certainly, more
complex servers present even larger problems from an
emulation perspective.
In order to solve the emulation-generation problem, this
research described here instantiates an agent-based
optimization architecture, which has successfully solved
intractable problems in other domains [12, 13]. At the
beginning of an optimization run, creation agents generate
potential solutions and add them to the solution pool. These
initial solutions are rough attempts to find an average
performing solution. After the solution pool has been seeded
with a set of initial solutions, modification agents select one or
more existing solutions for the solution pool. These agents
then use a heuristic or meta-heuristic approach to attempt to
create new solution that has an incremental improvement over
the selected solution. By incorporating domain-specific
Variable List
(Name,
Location)
(UNAME,1)
knowledge into their solution improvement algorithms, these
agents can efficiently guide the search of the solution space.
Periodically, deletion agents remove poor performing
solutions from the solution pool in order to keep the size of the
solution pool within predetermined bounds. The goal of the
system is that over many iterations of agent runs, the solutions
in the solution pool drift towards near-optimal solutions.
III. METHODOLOGY
A. Formal Problem Statement
The emulation of server programs is based on an Extended
Finite State Machine (EFSM), which determines the response
that should be generated based on the current state of the
server program and the current query. The EFSM is a Finite
State Machine which is extended to include session and query
variables, as well as variables that capture other state data like
date and time.
The EFSM model can be stored compactly in tabular form.
For example, Table 1 shows an EFSM that for a sample server
program. Each row in the table corresponds to an edge in the
EFSM, with the source state and destination state, the regular
expression that must be matched to trigger a transition, the
response to be generated, and any variable values to extract
from the input. When determining the response, the emulator
program applies regular expressions for the current state, in the
order specified by the order column in the table, until a match
of the input is found. The last rule for each state is a default
transition.
For example, in the sample server program represented in
Table 1, when user initially connects to the server, the server
sends the message Enter command:. At this point, the user can
either type login to start the login sequence (states 3 and 4) or
help to display a help message. Any other input at this point,
generates an error message. When the user enters a string after
the Username: prompt, this string is extracted, stored in the
UNAME variable, and then echoed back to the user in the
response. After the client enters a password, the server
terminates the connection.
An extended finite state machine representation that exactly
matches all of the example requests and responses could be
generated simply by creating a new state for each request and
128
2
concatenating together with “OR” keyword all requests for a
given state that generated the same response. However, there
are two significant problems with this approach. First, the size
of the resulting EFSM will be so large that the emulator will
not be able to process requests in a reasonable amount of time.
Second, this process does not attempt to generalize the
behavior of the system at all. Therefore, the emulation of the
application would only be correct if the input to the emulator is
exactly the same what input generated earlier in test examples.
Instead, the goal of this research is to find a compact
representation of an extended finite state machine that can
generalize the behavior of the network application. Formally,
we would seek to minimize a weighted combination of the
error in the emulated responses and the size of the finite state
machine:
Minimize:
Where
T = total number of conversations within the test data
H = correct responses from test data for the EFSM
V = total number of states within the EFSM
= penalty of each state
E = total number of transitions within the EFSM
= penalty of each transition
The first term in the expression to be minimized represents
the accuracy of the EFSM on the test data. In order to
encourage generalization, penalties applied to the number
states and the number transitions. In this way, if there are two
models with the same level of accuracy, the optimization
system will favor the more compact representation of server
action.
B. Agent Optimization System
In order to generate a compact and accurate EFSM model,
an agent-based optimization system is instantiated for the
emulation automation problem. This system is characterized
by three different types of agents and a shared solution pool.
The three types of agents include creation agents, modification
agents, and deletion agents.
As shown in Figure 1, the optimization system process
begins with a creation agent seeding the solution pool with a
set of initial solutions. Then, the modification agents run for a
certain number of iterations. During each iteration, a
modification agent is randomly selected. This agent chooses a
solution from the solution pool, creates a new solution in
attempt to improve the chosen solution, and then inserts the
new solution back into the solution pool. At the end of each
round of modification agent runs, deletion agent will prune the
solution pool by removing the poor performing solutions from
the pool based on its fitness score.
1) Creation Agent
The creation agent seeds the solution pool with a set of
initial solutions. All of these solutions contain two states with
two default transitions.
The first default transition begins at the initial state and
terminates at the second state. The output generated from this
transition is the initial response that is sent from the server
when a client first connects. This message is chosen at random
from a set of all of the initial welcome messages in the training
data. The initial message is chosen with a probability
proportional to the frequency that message is used as the initial
message in the training data.
The second default transition is a self loop, beginning and
ending at the second state. The output from this transition is
similarly chosen from set of responses in the training data that
are generated after the first message from the client. Again,
the probability that a message is chosen from this set is
proportional to the frequency with which it appears as the
second response in all message exchanges between clients and
the server in the training data.
2) Modification Agents
After the solution pool has been seeded with an initial set of
solutions, modification agents attempt to create improved
solutions based on the existing solutions in the solution pool.
These solutions are then evaluated based on the training data,
and inserted into the solution pool. As agents work to improve
one another’s solution, the solutions in the solution pool will
tend drift towards near optimal solutions.
This optimization system uses seven modification agents:
A state insertion agent, which attempts to improve
solution by adding new states to the EFSM
A transition insertion agent, which attempts to improve
the EFSM by adding a new transition
A variable-identification agent, which attempts to
generalize the ESFM by seeking to identifying variables
that could be added to the EFSM representation to
capture system-wide or per-session state.
A regular expression generation agent, which attempts
to merge multiple transitions at a trigger these
Create initial solutions in solution pool using Creation Agent
For i = 0 -> total number of iterations
For j = 0 -> total number of agents per iteration
Randomly select an agent
Selected agent -> pick a random solution from solution pool
Selected agent -> perform optimization
Selected agent -> insert new solution into solution pool
Delete agent -> delete less performing solutions from the pool
Figure 1: Agent-Based Optimization System Pseudocode
129
3
transitions.
A transition modification agent, which changes the
destination state in an existing transition in the EFSM.
A state removal agent, which removes a state from the
EFSM in order to attempt to make the representation
more compact.
A transition removal agent, which remove a transition
from the EFSM with the same goal.
The state insertion agent attempts to find a situation in
which the response to a non-initial client message differs from
the response generated by the EFSM. One of these types of
unmatched message exchange is randomly chosen to attempt to
improve the EFSM.
The messages in this exchange are labeled as follows:
(Λ, r0), (c1, r1), (c2, r2), …. (cn, rn)
where
(Λ, r0) corresponds to initial connect from client and r0
corresponds to the initial response that the server
generates upon the connection
(ci, ri) are the ith message from the client and the ith
reponse from the server, respectively
Assume the client message ci,i > 1, is the one for which the
response is incorrect. The agent's logic presumes that this
difference may be due to a transition to a different state that
should have occurred due to an earlier message. The agent,
then, randomly chooses one of the transition followed by
messages cj, j < i. This transition is changed so that it
terminates at a newly created state. A new transition, in the
form of a self-loop, is introduced at this state which correctly
generates ri in response to a client message ci. In addition, a
default transition, self-loop is based from the set of responses
in the training data that would be generated at this new state.
The default response is chosen with a probability proportional
to the frequency with which the response occurs at this new
state.
The transition insertion agent identifies a response ri, i ≥ 1,
which differs from the response predicted by the EFSM. This
agent attempts to fix this error by adding a transition at the
state in the EFSM where this error occurred. This transition
terminates at a randomly chosen state. The regular expression
for the transition is set equal to ci, and the response generated
is set equal to ri.
The variable-identification agent attempts to generalize the
EFSM by identifying strings in the client messages that are
echoed back later by the server. This agent scans the client
messages and server responses and chooses one conversation
at random in which a string from the client message, ci, is
echoed back by the server in a response rj, j ≥ i. The agent
then creates a variable at the transition followed due to
message ci. This variable is given a unique name, and the
information that defines where in the client message it occurs,
e.g. a particular token in the message. Then, for the transition
followed for response rj , is changed so that the string that is
echoed back is replaced by the variable value.
The regular expression generation agent attempts to merge
multiple transitions at a given state. It groups together
transitions at a state that generate similar responses, where the
similarity between the responses exceeds a threshold. One of
these groups of transitions is chosen at random, and this agent
then merges the transitions by replacing same client message
with a wildcard. For example, if requests “PASS ab” and
“PASS cd” have the same response “-ERR [AUTH]
Authentication failed”. This agent would create a transition in
the EFSM to match both requests with a regular expression in
the form “PASS [^ ]+”.
The transition modification agent, the state removal agent,
and transition removal agent are all designed to randomly
explore the neighborhood of a solution. The transition
modification agent randomly chooses one of the transitions in
the EFSM, and changes either the destination state to a
randomly chosen state. The state removal agent chooses a
non-initial state at random from the EFSM, and removes it.
Any transitions that had as their destination this deleted state,
are modified to terminate at a randomly chosen state. The
transition removal agent chooses a non-default transition at
random from the EFSM and deletes it. Both the state removal
agent and transition removal agent can improve the EFSM
making it more compact, and improving the fitness by
reducing penalties associated with each state and transition.
3) Deletion Agent
The deletion agent serves two purposes. First, it keeps the
solution pool to a manageable size. In addition, it removes the
weakest performing solutions, focusing the optimization search
on the most promising solutions and improving the speed with
which the optimization system converges to a near optimal
solution. The deletion agent used for this system simply ranks
the existing solutions based on their fitness, retains the top
performing solutions, and deletes the remaining ones.
IV. EXPERIMENTAL RESULTS
The optimization system was evaluated based on its ability
to create an EFSM that emulates a mail server using the Post
Office Protocol version 3 (POP3). The data used to train and
evaluate the optimization system was based on conversations
that represented unauthorized attempts to access the system.
There were no successful logins in the data.
Two large sets of messages were collected for this server.
One set was used for training in the EFSM, and the other set
was used to evaluate the EFSM. Each of the sets contained
transcripts from 500 connections, with an approximately of
2,900 total messages.
The optimization system was configured with the following
parameters. The target solution pool size was 40 solutions.
The creation agent seeded the solution pool with one initial
solution. In each round of the optimization system, 50
modification agents were chosen to run. At the end of the
round, the deletion agent pruned the solution pool back to 40
solutions. The optimization system ran for a total of 15
rounds. On an Intel Xeon E51225 3.1GHz processor with
Windows 7 (64 bit) operation system, the optimization system
130
4
Request
QUIT
Initial
State
Response
+OK goodbye
Any request
+OK Messaging Multiplexor
(Sun Java(tm) System
Messaging Server 6.3-11.01
(built Feb 12 2010))
2nd State
USER [^ ]+
+OK password required
for user %VAR1%
QUIT
+OK goodbye
Any Request
-ERR invalid command
PASS [^ ]+
-ERR [AUTH]
Authentication failed
3rd State
Any Request
-ERR invalid command
Figure 2: EFSM for POP3 Mail Server
run took approximately 15 minutes.
Multiple runs of the optimization system produced the
EFSM shown in Figure 2. As you can see in the figure, the
EFSM is extremely compact, with no extraneous transitions.
Morever, when evaluated on the test data, the EFSM correctly
generated responses for 2870 out of total 2875 client
messages, i.e. a 99.83% accuracy.
V. CONCLUSIONS AND FUTURE WORK
The agent based optimization holds promise for the creation
of ESFM for the emulation of network application for
honeypots. In the evaluation of this approach for a mail
server, the system produced an EFSM that provided close to
100% accuracy.
This approach is extendible to other network application.
The existing agents are generic enough that they may be useful
in the emulation of other applications. In addition, the agent
architecture is extremely flexible. If additional agents are
needed, it is a simple matter to create and install these for the
emulation of other services.
The current work is being extended to model web-based
services that are part of increasingly common interfaces for a
wide range of devices, ranging from consumer electronics to
SCADA systems. This optimization system is being used to
develop emulations for these devices. The ultimate test for
these emulations will be their ability to fool malicious actors
into believing that the honeypots are actual devices. This
deception would provide a number of benefits, including
allowing for the collection of malicious behavior that is needed
to train Intrusion Detection systems, for the identification of
potential flaws in devices or common configurations of
devices that malicious users attempt to exploit in the honeypot,
and for hiding real devices in a sea of honeypot systems.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
131
5
Cohen, F. "The Deception ToolKit," Risks Digest (9) March 1998.
Spitzner, L. Honeypots Tracking Hackers, Addison Wesley, 2003.
Capture-HPC
Client
Honeypot
/Honeyclient,
http://projects.honeynet.org/capture-hpc
Provos, N., "Honeyd - A Virtual Honeypot Daemon," 10th DFN-CERT
Workshop, Hamburg, Germany, February 2003.
Leita, C., Mermoud, K., Dacier, M. "ScriptGen: an automated script
generation tool for honeyd," 21st Annual Computer Security
Applications Conference, 5-9 Dec. 2005, 214-227.
Cui, W., Paxson, V., Weaver, N.C., Katz, R.H. "Protocol-Independent
Adaptive Replay of Application Dialog," 13th Symposium on Network
and Distributed System Security (NDSS), 2006
[7]
[8]
[9]
[10]
[11]
[12]
[13]
Newsome, J., Brumley, D., Franklin, J., Song, D., "Replayer: automatic
protocol replay by binary analysis," Proceedings of the 13th ACM
conference on Computer and communications security, 2006.
Caballero, J., Song, D., "Polyglot: Automatic Extraction of Protocol
Format using Dynamic Binary Analysis," ACM Conference on
Computer and Communications Security, 2007.
Lin, Z., Jiang, X., Xu, D., Zhang, X., "Automatic Protocol Format
Reverse Engineering through Conectect-Aware Monitored Execution,"
15th Symposium on Network and Distributed System Security, 2008.
Kruegel, C., Kirda, E., Comparetti, P.M., Wondracek, G., "Automatic
Network Protocol Analysis," 15th Annual Network and Distributed
System Security Symposium, 2008.
Fernau, H., "Algorithms for learning regular expressions,” Information
and Computation, 207(4), 2009.
Blum, J., Eskandarian, A., "Enhancing Intelligent Agent Collaboration
for Flow Optimization of Railroad Traffic," Transportation Research:
Part A, 36(10), 2002.
Blum, J., Mathew, T.V., "Intelligent Agent Optimization of Urban Bus
Transit System Design," ASCE Journal of Computing in Civil
Engineering, 25(5), pp. 331-347, 2011.
132
6