Using Swarming Agents for Scalable Security in

Using Swarming Agents for Scalable Security in
Large Network Environments
(Invited Paper)
Michael B. Crouse, Jacob L. White,
Glenn A. Fink and Jereme Haack
Errin W. Fulp, and Kenneth S. Berenhaut
Departments of Computer Science and Mathematics
Wake Forest University
Winston-Salem, NC, 27109
Email: [email protected]
Pacific Northwest National Laboratory
Richland, WA, USA
Email: [email protected]
Abstract- The difficulty of securing computer infrastructures
increases as they grow in size and complexity. Network-based
security solutions such as IDS and firewalls cannot scale because
of exponentially increasing computational costs inherent in de­
tecting the rapidly growing number of threat signatures. Host­
based solutions like virus scanners and IDS suffer similar issues
that are compounded when enterprises try to monitor them in a
centralized manner.
Swarm-based autonomous agent systems like digital ants and
artificial immune systems can provide a scalable security solution
for large network environments. The digital ants approach offers
a biologically inspired design where each ant in the virtual colony
can detect atoms of evidence that may help identify a possible
threat. By assembling the atomic evidences from different ant
types the colony may detect the threat. This
decentralized
approach can require, on average, fewer computational resources
than traditional centralized solutions; however there are limits
to its scalability. This paper describes how dividing a large
infrastructure into smaller, managed enclaves allows the digital
ant framework to effectively operate in larger environments.
Experimental results will show that using smaller enclaves allows
for more consistent distribution of agents and results in faster
response times.
I. INTRODUCTION
Computing infrastructures continue to grow to provide
the computational resources required for various large-scale
and multi-tenant computing applications. Grid computing and
cloud computing are examples of services that offer large
amounts of dynamic computing power using large, highly
distributed compute servers. However, the size of the in­
frastructure cannot be measured by the number of physical
servers; virtualization can provide multiple hosts on one
physical server creating a much larger virtual infrastructure.
This is the approach of NSF's GENI [1], DHS's DEfER
[2] and other Emulab-based environments [3]. Unfortunately,
as these infrastructures grow in size they also become more
complex, making traditional system management approaches
increasingly impractical.
Providing security in large-scale environments is especially
challenging. Firewalls and IDS typically form the first line
of defense against the misuse of computing resources. These
measures are commonly deployed as part of the implementa­
tion of a host-based security policy. Given the diverse range
of dynamically changing services, virtualization, and per­
user policy exceptions, managing thousands of individual host
policies is quite challenging. Administrators may not even
be aware of which services are being offered by individual
machines at any time. Scalable new approaches are needed to
manage security efficiently.
Swarm-based approaches map nicely to computer security
problems precisely because of computer infrastructures' dy­
namic and decentralized properties. Swarm solutions prescribe
relatively simple rules for interaction that produce emergent
behaviors sometimes referred to as self-organization. The
result is swarm intelligence that can address problems that
appear to be more complex than the rules used to define
the behaviors. Swarm-based approaches adapt to changing
threat levels and have been demonstrated to be more efficient
than traditional security approaches [4]. In addition, swarm­
based approaches are robust, since swarms select for colony
survival and do not depend on particular individual agents [5].
These features of swarm solutions are important attributes for
securing large dynamic infrastructures.
The swarm-based security management approach described
in this paper leverages the Sensor and Sentinel levels of
the digital ants framework hierarchy described in [6]. While
Sensors are simple, lightweight, ant-like agents that roam from
host to host searching for evidence of problems, Sentinels are
immobile, host-level agents that protect and configure a single
host or a collection of similarly configured hosts. A Sentinel
uses evidence presented by multiple Sensors to determine
whether a threatening or suspicious condition exists on its host
computer.
When a Sensor presents evidence that the Sentinel cannot
account for or that it determines is truly suspicious, it will
reward the Sensor, causing it to enter the active, pheromone­
dropping mode for a period of time. This pheromone attracts
other Sensors to the suspect host producing stigmergic com­
munication among Sensors. The pheromone trails dissipate
over time so that solved problems no longer attract more
Sensors. Pheromone-based systems have been demonstrated
to simply and effectively solve highly constrained problems
where logic-based, optimizing approaches prove intractable
978-1-61284-857-0/11/$26.00 ©2011 IEEE
[7].
Management of the Sensor population is critical for the
digital ants to remain responsive to threats in large infras­
tructures. In a large network environment Sensors may have
to travel long distances to reach a host that will reward it.
We have found that subdividing a large network into smaller
localities called enclaves improves Sensor distribution and
produces faster response times to threats. However, dividing
the infrastructure into enclaves also requires increasing the
population of Sensors in the infrastructure which unfortunately
increases the computational cost of the approach. This paper
will describe how to balance the need for responsiveness and
efficiency when securing the infrastructure using the digital
ants approach.
The remainder of the paper is structured as follows: Section
II will review the digital ants framework and describe the
challenges associated with deploying digital ants to protect
a large computer infrastructure. Section III will empirically
show the performance of different agent population manage­
ment approaches and provide some deployment guidelines.
Finally, Section IV will summarize the results and describe
open areas of research.
II. A SWARM-BASED ApPROACH TO S ECURITY
The digital ants framework is a hierarchy consisting of
the Supervisor, Sergeant, Sentinel, and Sensor levels [6].
The different levels form a mixed-initiative approach, where
human administrators' decision-making and authority is com­
plemented with the computational resources of rational agents.
At the highest level of the hierarchy, human administrators,
called Supervisors, provide overall governance to the infras­
tructure and interact primarily with the top level of software
agents, called Sergeants, that are responsible for a local subset
of a computer infrastructure called an enclave. We define an
enclave as a set of geographically or topologically collocated
machines that has a single owner and is managed under a
common policy. Thus, enclaves resemble the definition of
the Internet's Autonomous Systems (ASes), only they add
geographic or topological locality [8]. We also envision that
enclaves would be much smaller than ASes. Sergeants provide
situational awareness to the Supervisor and create enclave
policies based on Supervisor guidance. The Sergeant can be
aware of issues that span multiple hosts within the enclave and
may communicate with peer Sergeants over other enclaves.
Sentinel agents provide status to their Sergeant and enforce
the Sergeant's policies on enclave hosts. Sentinels also enable
Sensors to traverse the geography, the digital ants overlay
network. Each Sensor searches for a single, atomic indicator
such as network connection frequency, number of zombie
processes, or number of open files. Their power is in their
numbers, diversity, and stigmergic communication. As they
wander, Sensors randomly adjust their current direction similar
to the movement of real ants. They compare findings at the
current host with findings in their recent visits. If the findings
are outliers, the Sensor reports this to the Sentinel. If the
Sentinel cannot explain the findings as part of it's normal
operating state, it will reward the Sensor creating pheromone­
reinforced feedback behaviors. The Sentinel will use evidence
from the diverse attracted Sensors to diagnose a problem.
A. Considerations for Large-Scale Infrastructures
While higher Sensor populations provide greater respon­
siveness [4] higher populations also are more expensive both
in computation power and communication bandwidth. An
advantage of the digital ants framework is that it adaptively
varies the Sensor population to increase during attacks and
decrease when no threat is apparent. However the system does
need to maintain a minimum Sensor population to remain
responsive. It will be shown experimentally that the minimum
will depend on the size of the infrastructure, where larger
infrastructures will require larger minimum populations.
If there are proportionally few Sensors in a very large
enclave it may require an unacceptably long time for the
Sensors to reach a compromised host and jointly diagnose a
problem. An example of this would be threats occuring close
to the maximum distance from one another, leading Sensors to
be gathered at a host the furthest from the new threat. As the
size of the infrastructure increases, the time for the Sensors to
react to new threats can be inadequate.
Consider the scalability problem associated with Sensor
distribution on an infrastructure that consists of h hosts, where
h is a perfect square. Assume the geography of the h hosts
presented to the Sensors is a toroidal grid (square matrix)
where every host has eight neighbors. Therefore edges wrap to
the opposite side forming an 8-way regular graph. For example
a geography consisting of h = 65,536 hosts can be arranged
as a 256 x 256 grid. The farthest distance between any two
hosts in this geography is the diameter, which is half of the
matrix dimension. For the 65,536 example, the diameter is
128. It is important to note that diameter is the longest direct
path in the grid, however it is likely that the initial Sensor will
take a longer path due to its random stagger. 1 However, once
the first Sensor finds a compromised hosts, other Sensor types
should find this compromised host more quickly by following
the pheromone trail.
Although pheromone can reduce the number of steps re­
quired [4], the delay associated with an initial Sensor is still
an issue for large networks. Therefore it might be beneficial
to divide a large infrastructure grid into smaller sub-grids,
where every sub-grid is still toroidal. Sensors are then evenly
distributed and remain within the bounds of their assigned
area. This approach ensures that Sensors are better distributed
across the entire infrastructure.
Using the current geography, the h hosts in the toroidal grid
can be split into m sub-grids, each of size n2, i.e.
(1)
with h, m, and n all integers. Given this representation there
are several possible sub-grid configurations. For example, h =
1 The first sensor to discover a potential problem which is then rewardard
and moves away from the suspicous host
65,536 hosts could be represented as a single 256 x 256 grid,
4 sub-grids of size 128 x 128, or 16 sub-grids of size 64 x
64. However an arrangement consisting of a large number of
small sub-grids may result in a higher total number of Sensors,
which is computationally more expensive. Therefore if there
is an upper bound on the number of Sensors, it will limit the
number and size of the sub-grids. This relationship will be
explored empirically in the next section.
III. EXPERIMENTAL RESULTS
In this section, we investigate the scalability of the digital
ants framework using simulations. We experimentally examine
the impact of increasing the enclave size and Sensor pop­
ulations. Analysis will show how appropriate initial Sensor
populations for a given enclave size can be identified. The
experiments will also give insight into the impact of imple­
menting sub-grids within the enclave on the distribution of the
Sensors within large scale environments.
In simulations considered, enclaves with geographies rep­
resented as square grids as described in section II-A, the
grids consisted of 4,096, 16,384, and 65,536 hosts. In the
experiments we employed three types of simulated Sensors
and required evidence from one of each of the three types
to identify the compromised host [6]. Although the initial
number of Sensors deployed might vary depending on the
experiment, for simplicity the population remained constant
for the duration of the simulation (no Sensor birth or death
once a simulation has started).
We evaluate the performance in each grid size using vari­
ations of two metrics, hitting time and cover time, which are
commonly used to measure the performance of random walk
processes and are also relevant to the digital ants framework.
We performed each experiment multiple times and recorded
the average result for each configuration.
A. Hitting Time Analysis
As described in the previous section, the total number of
Sensors present in the enclave will affect the responsiveness
of the system-more Sensors typically yields faster response.
One way of measuring responsiveness is to consider the hitting
time (i.e. the number of random steps required for an agent to
reach node u from node v in a graph). For these experiments,
hitting time was the number of steps required for at least one
Sensor of each type to visit a given compromised host. The
initial Sensor population and compromised host were located
at a distance equal to the diameter of the network; therefore
the expected hitting time can be considered, in some sense, an
expected worst case for the number of steps (time) required
to discover the threat.
Figure 1 shows the hitting time associated with increasing
population densities of Sensors for three different grid sizes
(4,096, 16,384, and 65,536 hosts). The x-axis measures the
Sensor density, which is the number of Sensors, per type,
divided by the number of hosts. A density of one represents
one Sensor of each type per host. As seen in the graph, the
hitting time reduces dramatically as the density increases.
Average Hit Time with increasing Agent Density
65,536 Hosts
-e-16,384 Hosts
---l>- 4096 Hosts
-e-
9
8
7
Q)
�
I
Q)
W
�
«
6
5
4
3
2
0.5
Fig. I,
1
Agent Density
1.5
Hitting time as agent densities increases for various network sizes.
However at approximately 0.001 density the reduction in
hitting time becomes less significant and can be considered
the sensible value (therefore there is a diminishing return for
higher Sensor populations). The point at which this occurs
will be referred to as the effective density. Furthermore, it is
important to note that the effective density is independent of
grid size. For example a density of approximately 0.001, or
0.1%, was appropriate for the three grids simulated.
Figure 1 also shows, as expected, that larger grids tend to
have longer average hitting times. For example the hitting time
associated with the network of 65,536 hosts is roughly three
times larger that for the network of 4,096 hosts. However as
described in the previous section, a grid can be sub-divided
into smaller sub-grids, which may result in better performance.
Assume s is the effective number of Sensors (based on the
effective density), where s 2': m since there must be as many
Sensors as there are sub-grids (assuming only one type of
Sensor for simplicity). Substituting s for m in equation (1)
and solving for n yields,
n=
�
(2)
where n is the dimension of the sub-grid and must be a positive
whole value. Therefore, as s increases, the size of the sub-grid
can decrease, which will yield lower hitting times.
Consider the grid consisting of 65,536 hosts (256 x 256
grid) depicted in figure 1. Using equations (1) and (2) and
an agent density of 0.1%, this grid can be divided into 16
sub-graphs of 4,096 hosts each. For this example this is the
smallest sub-grid dimension possible since each sub-grid must
have at least three Sensors (one of each type). Again, a large
number of smaller sub-grid dimensions is sought since this
configuration tends to yield lower hitting times.
Given only one compromised host, it will reside in only one
sub-grid. As a result the hitting time for the infrastructure is
now reduced to the hitting time of the smaller sub-grid. Figure
1 shows a graph of 4,096 hosts with the same agent density.
3
X 10
'
Average Cover Time with increasing Agent Density
IV. CONCLUSIONS
4,096 Hosts
--e- 16,384 Hosts
---8- 65,536 Hosts
--e--
2.5
0.5
0.005
Fig, 2.
0.01
Agent Density
0.D15
0.02
Cover time as agent density increases with various network sizes,
Therefore using sub-grids and the same density as the grid of
65,536 hosts, the hitting time was reduced by a factor of three.
B. Cover Time Analysis
Another metric employed to measure Sensor performance
is the cover time (i.e. the number of iterations, or steps,
it takes for all hosts to be visited, or covered, by at least
one Sensor of every type). When this scenario is applied
to the digital ants framework it represents the case where
the compromised host is only discovered by the last Sensor
type that visits the host. This is different than the hitting
time experiments where the findings from first and second
visiting Sensor (of different types) are considered useful. Our
definition of cover time considers only the spread of agents
to every node without considering a particular compromised
host as a detection target. Thus, we are measuring the time it
takes for at least one Sensor of each kind to visit every node.
Therefore the number of steps required to discover the attacker
will be considerably higher than observed for the hitting time
experiments.
Figure 2 shows the cover time for the three different
grid sizes as the agent density increases. As with hitting
time, the cover time decreases as the agent density increases
however the diminishing return happens at higher densities.
As expected, larger grid sizes tend to have higher cover times
than smaller grids. The cover time for 65,536 hosts is roughly
2.5 times higher than 4,096 hosts. The density where the cover
time stabilizes also is higher than the density associated with
the hitting time experiments. This is primarily due to the lack
of pheromone, and as a result Sensors only move in a random
walk fashion.
As done in the hitting time experiments, these findings
can be used to partition the original grid into smaller sub­
grids. However this approach requires higher minimum Sensor
populations which will yield smaller sub-grid dimensions but
may be computationally prohibitive.
The goal of the digital ants framework is to defend these
infrastructure while using minimal computational resources
and network bandwidth. Traditional, always-on approaches are
well-suited in the defense realm but can fall short in efficiency
in terms of computational resources especially when used
to manage large computing infrastructures. The digital ants
framework can provide a scalable, more efficient approach to
address both needs.
One concern with deploying digital ants in large environ­
ments is the responsiveness of the system. In these large
networks, agents may unfortunately be located distant from
a host requiring assistance, resulting in long response times.
Experimental results showed that agent density (ratio of agents
to hosts) was critical for the responsiveness of the digital
ant framework in large environments. For example, it was
observed that an agent density of 0.1% provides a relatively
good response time with the lowest agent density. Higher agent
densities did not provide a significantly better performance.
In addition, by dividing the large network infrastructure into
smaller parts (enclaves), the response time is equal to a much
smaller network.
Future work will examine pheromone trail lengths and
dissipation rates to determine the appropriate values. Another
interesting area of future work is the appropriate creation
and termination of agents. This paper addressed the minimum
population required, however more work is needed to better
understand the appropriate lifetime of agents within the sys­
tem.
ACKNOWLEDGEMENTS
This work was funded by the U. S. Department of Energy
and Pacific Northwest National Laboratory. Any opinions,
findings, and conclusions or recommendations expressed in
this material are those of the author(s) and do not necessarily
reflect the any of the sponsors of this work.
REFERENCES
[ I) [Online). Available: GENI Testbed,http://www.geni.net/
[2) [Online). Available: The DETER Network Security Testbed,
http://www.isi.eduldeter/
[3) B, White, J, Lepreau,L. Stoller,R. Ricci,S. Guruprasad,M. Newbold,
M, Hibler,C. Barb,and A, Joglekar,"An integrated experimental environ­
ment for distributed systems and networks;' in Proceedings of the Fifth
Symposium on Operating Systems Design and Implementation, December
2002,pp, 255-270,
[4) B, C. Williams,"A comparison of static to biologically modeled intrusion
detection systems;' Master's thesis,Wake Forest University,2010,
[5) H, V. D, Parunak, "Go to the ant: Engineering principles
from
natural
multi-agent
system;'
Annals
of
Operations
Research,
vol. 75, pp, 69-101, 1997,
[Online). Available:
http://www.jacobstechnology.comlvrc/pdf/gotoant,pdf
[6) J, N. Haack, G. A, Fink, W. M. Maiden, D, McKinnon, and E. W.
Fulp,"Ant-based cyber defense;' in Proceedings of the 8th International
Conference on Information Technology New Generations, 2011.
[7) M. Dorigo and L. M. Gambardella, "Ant colony system: a cooperative
learning approach to the travelingsaiesman problem;' IEEE Transactions
on Evolutionary Computation, vol. I,no, I,pp, 53�,1997,
[8) A, Tanenbaum and D, Wetherall, Computer Networks.
Prentice Hall,
2011.