2012 IEEE 16th International Symposium 2012 ISCE 1569567245 Agent-based Optimization of Emulations of Network Server Applications in Honeypots Yilun Zhao and Jeremy J. Blum, Member, IEEE The Pennsylvania State University, Harrisburg Abstract--Low-interaction honeypots can provide a costeffective security mechanism for a wide range of computer systems. A central challenge in the development of lowinteraction honeypots is the development of emulation programs that mimic the action of server applications on the target platform. The emulation programs must be of high enough fidelity to fool attackers. However, the manual development of these emulations is extremely time-consuming. This paper describes an agent-based optimization system that can automate the generation of emulation programs for honeypots. The system is evaluated in its ability to emulate a mail server. In this evaluation, the system produced correct responses to more than 99% of test data queries. representation of an extended finite state machine (EFSM) that can generalize the behavior of the network application. The EFSM model is an FSM that has been extended with variables that can store system-wide and per session state. Formally, the goal of the optimization problem is to minimize a weighted combination of the error in the emulated responses and the size of the finite state machine. This optimization system is evaluated in its ability to emulate a POP3 email server. The results indicated that optimization system generated a compact representation of this server that generalized the application performance. The resulting emulation generated correct responses for more than 99% of queries in an independent test set. I. INTRODUCTION Honeypots are decoy systems designed to capture malicious users in order to collect data on patterns of malicious behavior and deflect malicious attention from real systems. Honeypots can play an important role in mitigating threats to an increasing range of devices, given the trend of providing network access for control and monitoring of devices from consumer electronics through industrial control systems. A large class of these honeypots emulates an entire network of decoy systems on a single system. Central to the success of these systems is a high fidelity emulation of network server applications. The traditional approach to the creation of these emulations relies on manual coding of an emulation script, a labor-intensive and time consuming process, which must be repeated regularly whenever the real system is updated. Attempts to introduce a level of automation into the generation of emulation of these applications rely on representations of server behavior using extensions of finite state machine models (FSM). The central challenge, however, is that the generation of these models is comprised of computationally intractable subtasks. For example, in order to generalize system behavior, regular expressions must be developed that match a class of user inputs to a single output, a problem which is NP-Hard. In order to address this challenge, this paper describes an agent-based optimization architecture that has successfully solved intractable problems in other domains. The agentbased optimization system attempts to find a compact This research is funded in part by a seed grant from a National Science Foundation Partnership for Innovation grant (M. Walters, PI) and a grant from Novatech LLC (J. Blum, PI). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or Novatech LLC. 978-1-4673-1356-8/12/$31.00 ©2012 IEEE II. PREVIOUS WORK The Deception Toolkit, the first publically available honeypot, was "intended to make it appear to attackers as if the system running DTK [had] a large number of widely known vulnerabilities"[1]. Current honeypots can be classified into three different levels: Low-Interaction Honeypots, MediumInteraction Honeypots and High-Interaction Honeypots [2]. High-Interaction honeypots consist of actual devices deployed as decoys, and, as a result, provide a high level of interactivity, in which nothing in the system is emulated or restricted [2-3]. Given that actual systems are deployed, high-interaction honeypots can be expensive to deploy. Moreover, they inherit all of the vulnerabilities of the actual systems. They can be compromised and used in attacks on other systems. More importantly, attackers can use the high-interaction honeypots to try to discover vulnerabilities in actual systems. On the other end, low interaction honeypots provide emulations of actual systems [4]. As a result, low interaction honeypots can be inexpensive to deploy since a single system can emulate an entire network of devices. In addition, because these honeypots simply emulate of server applications, lowinteraction honeypots can prevent attackers from learning about underlying vulnerabilities by omitting these from the emulation. On the other hand, the subset of functionality that is typically provided by emulations of server programs can limit the effectiveness of low-interaction honeypots. If an attacker can quickly detect that only a subset of the functionality is implemented, the attacker can determine that the target is a honeypot as opposed to a preferred target. As a result, a central challenge for these honeypots is improving the emulation of network servers, both expanding the functionality that is implemented and improving its fidelity. While emulations can be manually created, several researchers have attempted to automate or partially automate 127 1 Table 1: Extended Finite State Machine Current State Next State Order Regular Expression Response 1 2 1 * "Enter command: " 2 3 1 login "Username: " 2 2 2 help "Type 'login' to authenticate.\nEnter command: " 2 2 3 * "Unrecognized command.\nEnter command: " 3 4 2 * "Password required for " + UNAME + ": " 3 - 3 * "Invalid Password" the development of emulation programs. For example, ScriptGen attempts to automatically generate compact representations of server behavior using a finite state machine model [5]. The approach takes as input a large dataset of recorded conversations and attempts to distill the large input set into a compact model. Another approach, RolePlayer incorporates byte-stream alignment algorithms in order to identify the portions of responses that are variable based on differences request and current session data [6]. Replayer represents an approach to a similar of creating the ability to successfully replay an attack on a server by systematically identifying the portions of the client messages that must be altered [7]. There have also been a number of efforts to monitor the messages produced by a server application in order to attempt to reverse engineer a network protocol [8-10]. The problem that these automated approaches are attempting to solve is a computationally intractable one. Consider a simple server program which contains record of session state. In other words, the response of the server program is simply a function of the previous client message. Given a comprehensive set of client and server messages, the emulation problem then becomes one of generalizing these messages in a compact form. One compact form would be a set of regular expressions that match all query responses that produce a particular output. The emulation problem then reduces to the problem of generating regular expression from positive data, a problem that is NP-Hard [11]. Certainly, more complex servers present even larger problems from an emulation perspective. In order to solve the emulation-generation problem, this research described here instantiates an agent-based optimization architecture, which has successfully solved intractable problems in other domains [12, 13]. At the beginning of an optimization run, creation agents generate potential solutions and add them to the solution pool. These initial solutions are rough attempts to find an average performing solution. After the solution pool has been seeded with a set of initial solutions, modification agents select one or more existing solutions for the solution pool. These agents then use a heuristic or meta-heuristic approach to attempt to create new solution that has an incremental improvement over the selected solution. By incorporating domain-specific Variable List (Name, Location) (UNAME,1) knowledge into their solution improvement algorithms, these agents can efficiently guide the search of the solution space. Periodically, deletion agents remove poor performing solutions from the solution pool in order to keep the size of the solution pool within predetermined bounds. The goal of the system is that over many iterations of agent runs, the solutions in the solution pool drift towards near-optimal solutions. III. METHODOLOGY A. Formal Problem Statement The emulation of server programs is based on an Extended Finite State Machine (EFSM), which determines the response that should be generated based on the current state of the server program and the current query. The EFSM is a Finite State Machine which is extended to include session and query variables, as well as variables that capture other state data like date and time. The EFSM model can be stored compactly in tabular form. For example, Table 1 shows an EFSM that for a sample server program. Each row in the table corresponds to an edge in the EFSM, with the source state and destination state, the regular expression that must be matched to trigger a transition, the response to be generated, and any variable values to extract from the input. When determining the response, the emulator program applies regular expressions for the current state, in the order specified by the order column in the table, until a match of the input is found. The last rule for each state is a default transition. For example, in the sample server program represented in Table 1, when user initially connects to the server, the server sends the message Enter command:. At this point, the user can either type login to start the login sequence (states 3 and 4) or help to display a help message. Any other input at this point, generates an error message. When the user enters a string after the Username: prompt, this string is extracted, stored in the UNAME variable, and then echoed back to the user in the response. After the client enters a password, the server terminates the connection. An extended finite state machine representation that exactly matches all of the example requests and responses could be generated simply by creating a new state for each request and 128 2 concatenating together with “OR” keyword all requests for a given state that generated the same response. However, there are two significant problems with this approach. First, the size of the resulting EFSM will be so large that the emulator will not be able to process requests in a reasonable amount of time. Second, this process does not attempt to generalize the behavior of the system at all. Therefore, the emulation of the application would only be correct if the input to the emulator is exactly the same what input generated earlier in test examples. Instead, the goal of this research is to find a compact representation of an extended finite state machine that can generalize the behavior of the network application. Formally, we would seek to minimize a weighted combination of the error in the emulated responses and the size of the finite state machine: Minimize: Where T = total number of conversations within the test data H = correct responses from test data for the EFSM V = total number of states within the EFSM = penalty of each state E = total number of transitions within the EFSM = penalty of each transition The first term in the expression to be minimized represents the accuracy of the EFSM on the test data. In order to encourage generalization, penalties applied to the number states and the number transitions. In this way, if there are two models with the same level of accuracy, the optimization system will favor the more compact representation of server action. B. Agent Optimization System In order to generate a compact and accurate EFSM model, an agent-based optimization system is instantiated for the emulation automation problem. This system is characterized by three different types of agents and a shared solution pool. The three types of agents include creation agents, modification agents, and deletion agents. As shown in Figure 1, the optimization system process begins with a creation agent seeding the solution pool with a set of initial solutions. Then, the modification agents run for a certain number of iterations. During each iteration, a modification agent is randomly selected. This agent chooses a solution from the solution pool, creates a new solution in attempt to improve the chosen solution, and then inserts the new solution back into the solution pool. At the end of each round of modification agent runs, deletion agent will prune the solution pool by removing the poor performing solutions from the pool based on its fitness score. 1) Creation Agent The creation agent seeds the solution pool with a set of initial solutions. All of these solutions contain two states with two default transitions. The first default transition begins at the initial state and terminates at the second state. The output generated from this transition is the initial response that is sent from the server when a client first connects. This message is chosen at random from a set of all of the initial welcome messages in the training data. The initial message is chosen with a probability proportional to the frequency that message is used as the initial message in the training data. The second default transition is a self loop, beginning and ending at the second state. The output from this transition is similarly chosen from set of responses in the training data that are generated after the first message from the client. Again, the probability that a message is chosen from this set is proportional to the frequency with which it appears as the second response in all message exchanges between clients and the server in the training data. 2) Modification Agents After the solution pool has been seeded with an initial set of solutions, modification agents attempt to create improved solutions based on the existing solutions in the solution pool. These solutions are then evaluated based on the training data, and inserted into the solution pool. As agents work to improve one another’s solution, the solutions in the solution pool will tend drift towards near optimal solutions. This optimization system uses seven modification agents: A state insertion agent, which attempts to improve solution by adding new states to the EFSM A transition insertion agent, which attempts to improve the EFSM by adding a new transition A variable-identification agent, which attempts to generalize the ESFM by seeking to identifying variables that could be added to the EFSM representation to capture system-wide or per-session state. A regular expression generation agent, which attempts to merge multiple transitions at a trigger these Create initial solutions in solution pool using Creation Agent For i = 0 -> total number of iterations For j = 0 -> total number of agents per iteration Randomly select an agent Selected agent -> pick a random solution from solution pool Selected agent -> perform optimization Selected agent -> insert new solution into solution pool Delete agent -> delete less performing solutions from the pool Figure 1: Agent-Based Optimization System Pseudocode 129 3 transitions. A transition modification agent, which changes the destination state in an existing transition in the EFSM. A state removal agent, which removes a state from the EFSM in order to attempt to make the representation more compact. A transition removal agent, which remove a transition from the EFSM with the same goal. The state insertion agent attempts to find a situation in which the response to a non-initial client message differs from the response generated by the EFSM. One of these types of unmatched message exchange is randomly chosen to attempt to improve the EFSM. The messages in this exchange are labeled as follows: (Λ, r0), (c1, r1), (c2, r2), …. (cn, rn) where (Λ, r0) corresponds to initial connect from client and r0 corresponds to the initial response that the server generates upon the connection (ci, ri) are the ith message from the client and the ith reponse from the server, respectively Assume the client message ci,i > 1, is the one for which the response is incorrect. The agent's logic presumes that this difference may be due to a transition to a different state that should have occurred due to an earlier message. The agent, then, randomly chooses one of the transition followed by messages cj, j < i. This transition is changed so that it terminates at a newly created state. A new transition, in the form of a self-loop, is introduced at this state which correctly generates ri in response to a client message ci. In addition, a default transition, self-loop is based from the set of responses in the training data that would be generated at this new state. The default response is chosen with a probability proportional to the frequency with which the response occurs at this new state. The transition insertion agent identifies a response ri, i ≥ 1, which differs from the response predicted by the EFSM. This agent attempts to fix this error by adding a transition at the state in the EFSM where this error occurred. This transition terminates at a randomly chosen state. The regular expression for the transition is set equal to ci, and the response generated is set equal to ri. The variable-identification agent attempts to generalize the EFSM by identifying strings in the client messages that are echoed back later by the server. This agent scans the client messages and server responses and chooses one conversation at random in which a string from the client message, ci, is echoed back by the server in a response rj, j ≥ i. The agent then creates a variable at the transition followed due to message ci. This variable is given a unique name, and the information that defines where in the client message it occurs, e.g. a particular token in the message. Then, for the transition followed for response rj , is changed so that the string that is echoed back is replaced by the variable value. The regular expression generation agent attempts to merge multiple transitions at a given state. It groups together transitions at a state that generate similar responses, where the similarity between the responses exceeds a threshold. One of these groups of transitions is chosen at random, and this agent then merges the transitions by replacing same client message with a wildcard. For example, if requests “PASS ab” and “PASS cd” have the same response “-ERR [AUTH] Authentication failed”. This agent would create a transition in the EFSM to match both requests with a regular expression in the form “PASS [^ ]+”. The transition modification agent, the state removal agent, and transition removal agent are all designed to randomly explore the neighborhood of a solution. The transition modification agent randomly chooses one of the transitions in the EFSM, and changes either the destination state to a randomly chosen state. The state removal agent chooses a non-initial state at random from the EFSM, and removes it. Any transitions that had as their destination this deleted state, are modified to terminate at a randomly chosen state. The transition removal agent chooses a non-default transition at random from the EFSM and deletes it. Both the state removal agent and transition removal agent can improve the EFSM making it more compact, and improving the fitness by reducing penalties associated with each state and transition. 3) Deletion Agent The deletion agent serves two purposes. First, it keeps the solution pool to a manageable size. In addition, it removes the weakest performing solutions, focusing the optimization search on the most promising solutions and improving the speed with which the optimization system converges to a near optimal solution. The deletion agent used for this system simply ranks the existing solutions based on their fitness, retains the top performing solutions, and deletes the remaining ones. IV. EXPERIMENTAL RESULTS The optimization system was evaluated based on its ability to create an EFSM that emulates a mail server using the Post Office Protocol version 3 (POP3). The data used to train and evaluate the optimization system was based on conversations that represented unauthorized attempts to access the system. There were no successful logins in the data. Two large sets of messages were collected for this server. One set was used for training in the EFSM, and the other set was used to evaluate the EFSM. Each of the sets contained transcripts from 500 connections, with an approximately of 2,900 total messages. The optimization system was configured with the following parameters. The target solution pool size was 40 solutions. The creation agent seeded the solution pool with one initial solution. In each round of the optimization system, 50 modification agents were chosen to run. At the end of the round, the deletion agent pruned the solution pool back to 40 solutions. The optimization system ran for a total of 15 rounds. On an Intel Xeon E51225 3.1GHz processor with Windows 7 (64 bit) operation system, the optimization system 130 4 Request QUIT Initial State Response +OK goodbye Any request +OK Messaging Multiplexor (Sun Java(tm) System Messaging Server 6.3-11.01 (built Feb 12 2010)) 2nd State USER [^ ]+ +OK password required for user %VAR1% QUIT +OK goodbye Any Request -ERR invalid command PASS [^ ]+ -ERR [AUTH] Authentication failed 3rd State Any Request -ERR invalid command Figure 2: EFSM for POP3 Mail Server run took approximately 15 minutes. Multiple runs of the optimization system produced the EFSM shown in Figure 2. As you can see in the figure, the EFSM is extremely compact, with no extraneous transitions. Morever, when evaluated on the test data, the EFSM correctly generated responses for 2870 out of total 2875 client messages, i.e. a 99.83% accuracy. V. CONCLUSIONS AND FUTURE WORK The agent based optimization holds promise for the creation of ESFM for the emulation of network application for honeypots. In the evaluation of this approach for a mail server, the system produced an EFSM that provided close to 100% accuracy. This approach is extendible to other network application. The existing agents are generic enough that they may be useful in the emulation of other applications. In addition, the agent architecture is extremely flexible. If additional agents are needed, it is a simple matter to create and install these for the emulation of other services. The current work is being extended to model web-based services that are part of increasingly common interfaces for a wide range of devices, ranging from consumer electronics to SCADA systems. This optimization system is being used to develop emulations for these devices. The ultimate test for these emulations will be their ability to fool malicious actors into believing that the honeypots are actual devices. This deception would provide a number of benefits, including allowing for the collection of malicious behavior that is needed to train Intrusion Detection systems, for the identification of potential flaws in devices or common configurations of devices that malicious users attempt to exploit in the honeypot, and for hiding real devices in a sea of honeypot systems. REFERENCES [1] [2] [3] [4] [5] [6] 131 5 Cohen, F. "The Deception ToolKit," Risks Digest (9) March 1998. Spitzner, L. Honeypots Tracking Hackers, Addison Wesley, 2003. Capture-HPC Client Honeypot /Honeyclient, http://projects.honeynet.org/capture-hpc Provos, N., "Honeyd - A Virtual Honeypot Daemon," 10th DFN-CERT Workshop, Hamburg, Germany, February 2003. Leita, C., Mermoud, K., Dacier, M. "ScriptGen: an automated script generation tool for honeyd," 21st Annual Computer Security Applications Conference, 5-9 Dec. 2005, 214-227. Cui, W., Paxson, V., Weaver, N.C., Katz, R.H. "Protocol-Independent Adaptive Replay of Application Dialog," 13th Symposium on Network and Distributed System Security (NDSS), 2006 [7] [8] [9] [10] [11] [12] [13] Newsome, J., Brumley, D., Franklin, J., Song, D., "Replayer: automatic protocol replay by binary analysis," Proceedings of the 13th ACM conference on Computer and communications security, 2006. Caballero, J., Song, D., "Polyglot: Automatic Extraction of Protocol Format using Dynamic Binary Analysis," ACM Conference on Computer and Communications Security, 2007. Lin, Z., Jiang, X., Xu, D., Zhang, X., "Automatic Protocol Format Reverse Engineering through Conectect-Aware Monitored Execution," 15th Symposium on Network and Distributed System Security, 2008. Kruegel, C., Kirda, E., Comparetti, P.M., Wondracek, G., "Automatic Network Protocol Analysis," 15th Annual Network and Distributed System Security Symposium, 2008. Fernau, H., "Algorithms for learning regular expressions,” Information and Computation, 207(4), 2009. Blum, J., Eskandarian, A., "Enhancing Intelligent Agent Collaboration for Flow Optimization of Railroad Traffic," Transportation Research: Part A, 36(10), 2002. Blum, J., Mathew, T.V., "Intelligent Agent Optimization of Urban Bus Transit System Design," ASCE Journal of Computing in Civil Engineering, 25(5), pp. 331-347, 2011. 132 6
© Copyright 2026 Paperzz