New Payload Attribution Methods for Network Forensic Investigations

New Payload Attribution Methods for Network
Forensic Investigations
MIROSLAV PONEC∗† , PAUL GIURA∗ , JOEL WEIN∗† and HERVÉ BRÖNNIMANN∗
∗
Polytechnic Institute of NYU, † Akamai Technologies
Payload attribution can be an important element in network forensics. Given a history of packet
transmissions and an excerpt of a possible packet payload, a Payload Attribution System (PAS)
makes it feasible to identify the sources, destinations and the times of appearance on a network of
all the packets that contained the specified payload excerpt. A PAS, as one of the core components
in a network forensics system, enables investigating cybercrimes on the Internet, by, for example,
tracing the spread of worms and viruses, identifying who has received a phishing email in an
enterprise, or discovering which insider allowed an unauthorized disclosure of sensitive information.
Due to the increasing volume of network traffic in today’s networks it is infeasible to effectively
store and query all the actual packets for extended periods of time in order to allow analysis of
network events for investigative purposes; therefore we focus on extremely compressed digests of
the packet activity. We propose several new methods for payload attribution which utilize Rabin
fingerprinting, shingling, and winnowing. Our best methods allow building practical payload attribution systems which provide data reduction ratios greater than 100:1 while supporting efficient
queries with very low false positive rates. We demonstrate the properties of the proposed methods
and specifically analyze their performance and practicality when used as modules of a network
forensics system ForNet. Our experimental results outperform current state-of-the-art methods
both in terms of false positives and data reduction ratio. Finally, these approaches directly allow
the collected data to be stored and queried by an untrusted party without disclosing any payload
information nor the contents of queries.
Categories and Subject Descriptors: C.2.0 [Computer-Communication Networks]: General—
Security and protection
General Terms: Algorithms, Performance, Security
Additional Key Words and Phrases: Bloom Filter, Network Forensics, Payload Attribution
1.
INTRODUCTION
Cybercrime today is alive and well on the Internet and growing both in scope and
sophistication [Richardson and Peters 2007]. Given the trends of increasing Internet usage by individuals and companies alike and the numerous opportunities for
Author’s address: Polytechnic Institute of NYU, Department of Computer Science and Engineering, 6 MetroTech Center, Brooklyn, NY 11201, U.S.A.
Akamai Technologies, 8 Cambridge Center, Cambridge, MA 02142, U.S.A.
This research is supported by NSF CyberTrust Grant 0430444.
This paper is a significantly extended and updated version of [Ponec et al. 2007].
We would like to thank Kulesh Shanmugasundaram for helpful discussions.
Permission to make digital/hard copy of all or part of this material without fee for personal
or classroom use provided that the copies are not made or distributed for profit or commercial
advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and
notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish,
to post on servers, or to redistribute to lists requires prior specific permission and/or a fee.
c 20YY ACM 0000-0000/20YY/0000-0001 $5.00
ACM Journal Name, Vol. V, No. N, Month 20YY, Pages 1–32.
2
·
Miroslav Ponec et al.
anonymity and non-accountability of Internet use, we expect this trend to continue
for some time. While there is much excellent work going on targeted at preventing
cybercrime, unfortunately there is the parallel need to develop good tools to aid lawenforcement or corporate security professionals in investigating committed crimes.
Identifying the sources and destinations of all packets that appeared on a network
and contained a certain excerpt of a payload, a process called payload attribution,
can be an extremely valuable tool in helping to determine the perpetrators or the
victims of a network event and to analyze security incidents in general [Shanmugasundaram et al. 2003; Staniford-Chen and Heberlein 1995; Garfinkel 2002; King
and Weiss 2002].
It is possible to collect full packet traces even with commodity hardware [Anderson and Arlitt 2006] but the storage and analysis of terabytes of such data from
today’s high-speed networks is extremely cumbersome. Supporting network forensics by simply capturing and logging raw network traffic, however, is infeasible for
anything but short periods of history. First, storage requirements limit the time
over which the data can be archived (e.g., a 100 Mbit/s WAN can fill up 1 TB in
just one day) and it is a common practice to overwrite old data when that limit
is reached. Second, string matching over such massive amounts of data is very
time-consuming.
Recently Shanmugasundaram et al. [2004] presented an architecture for network
forensics in which payload attribution is a key component [Shanmugasundaram
et al. 2004]. They introduced the idea of using Bloom filters to achieve a reduced
size digest of the packet history that would support queries about whether any
packet containing a certain payload excerpt has been seen; the reduction in data
representation comes at the price of a manageable false positive rate in the query
results. Subsequently a different group has offered a variant technique for the same
problem [Cho et al. 2006].
Our contribution in this paper is to present new methods for payload attribution
that have substantial performance improvements over these state-of-the-art payload
attribution systems. Our approach to payload attribution, which constitutes a
crucial component of a network forensic system, can be easily integrated into any
existing network monitoring system. The best of our methods allow data reduction
ratios greater than 100:1 and achieve very low overall false positive rates. With a
data reduction ratio of 100:1 our best method gives no false positive answers for
query excerpt sizes of 250 bytes and longer; in contrast, the prior best techniques
had 100% false positive rate at that data reduction ratio and excerpt size. The
reduction in storage requirements makes it feasible to archive data taken over an
extended time period and query for events in a substantially distant past. Our
methods are capable of effectively querying for small excerpts of a payload but can
also be extended to handle excerpts that span several packets. The accuracy of
attribution increases with the length of the excerpt and the specificity of the query.
Further, the collected payload digests can be stored and queries performed by an
untrusted party without disclosing any payload information nor the query details.
This paper is organized as follows. In the next section we review related prior
work. In Section 3 we provide a detailed design description of our payload attribution techniques, with a particular focus on payload processing and querying. In
ACM Journal Name, Vol. V, No. N, Month 20YY.
New Payload Attribution Methods for Network Forensic Investigations
·
3
Section 4 we discuss several issues related to the implementation of these techniques
in a full payload attribution system. In Section 5 we present a performance comparison of the proposed methods and quantitatively measure their effectiveness for
multiple workloads. Finally, we summarize our conclusions in Section 6.
2.
RELATED WORK
When processing a packet payload by the methods described in Section 3, the overall
approach is to partition the payload into blocks and store them in a Bloom filter.
In this section we first give a short description of Bloom filters and introduce Rabin
fingerprinting and winnowing, which are techniques for block boundary selection.
Thereafter we review the work related to payload attribution systems.
2.1
Bloom Filters
Bloom filters [Bloom 1970] are space-efficient probabilistic data structures supporting membership queries and are used in many network and other applications [Broder and Mitzenmatcher 2002]. An empty Bloom filter is a bit vector
of m bits, all set to 0, that uses k different hash functions, each of which maps
a key value to one of the m positions in the vector. To insert an element into
the Bloom filter, we compute the k hash function values and set the bits at the
corresponding k positions to 1. To test whether an element was inserted, we hash
the element with these k hash functions and check if all corresponding bits are
set to 1, in which case we say the element is in the filter. The space savings of a
Bloom filter is achieved at the cost of introducing false positives; the greater the
savings, the greater the probability of a query returning a false positive. Equation 1
gives an approximation of the false positive rate α, after n distinct elements were
inserted into the Bloom filter [Mitzenmacher 2002]. More analysis reveals that an
optimal utilization of a Bloom filter is achieved when the number of hash functions,
k, equals (ln 2) · (m/n) and the probability of each bit of the Bloom filter being 0 is
1/2. In practice, of course, k has to be an integer and smaller k is mostly preferred
to reduce the amount of necessary computation. Note also that while we use Bloom
filters throughout this paper, all of our payload attribution techniques can be easily
modified to use any data structure which allows insertion and querying for strings
with no changes to the structural design and implementation of the attribution
methods.
α=
2.2
1
1− 1−
m
kn !k
k
≈ 1 − e−kn/m
(1)
Rabin Fingerprinting
Fingerprints are short checksums of strings with the property that the probability
of two different objects having the same fingerprint is very small. Rabin defined a
fingerprinting scheme [Rabin 1981] for binary strings based on polynomials in the
following way. We associate a polynomial S(x) of degree N − 1 with coefficients in
Z2 with every binary string S = (s1 , . . . , sN ), for N ≥ 1:
S(x) = s1 xN −1 + s2 xN −2 + · · · + sN .
(2)
ACM Journal Name, Vol. V, No. N, Month 20YY.
4
·
Miroslav Ponec et al.
Then we take a fixed irreducible polynomial P (x) of degree K, over Z2 and define
the fingerprint of S to be the polynomial f (S) = S(x) mod P (x).
This scheme, only slightly modified, has found several applications [Broder 1993],
for example, in defining block boundaries for identifying similar files [Manber 1994]
and for web caching [Rhea et al. 2003]. We derive a fingerprinting scheme for
payload content based on this Rabin’s scheme in Section 3.3 and use it to pick
content-dependent boundaries for a priori unknown substrings of a payload. For
details on the applications, properties and implementation issues of the Rabin’s
scheme one can refer to [Broder 1993].
2.3
Winnowing
Winnowing [Schleimer et al. 2003] is an efficient fingerprinting algorithm enabling
accurate detection of full and partial copies between documents. It works as follows:
For each sequence of ν consecutive characters in a document, we compute its hash
value and store it in an array. Thus, the first item in the array is a hash of c1 c2 . . . cν ,
the second item is a hash of c2 c3 . . . cν+1 , etc., where ci are the characters in the
document of size Ω bytes, for i = 1, . . . , Ω. We then slide a window of size w through
the array of hashes and select the minimum hash within each window. If there are
more hashes with the minimum value, we choose the rightmost one. These selected
hashes form the fingerprint of the document. It is shown in [Schleimer et al. 2003]
that fingerprints selected by winnowing are better for document fingerprinting than
the subset of Rabin fingerprints which contains hashes equal to 0 mod p, for some
fixed p, because winnowing guarantees that in any window of size w there is at least
one hash selected. We will use this idea to select boundaries for blocks in packet
payloads in Section 3.5.
2.4
Attribution Systems
There has been a major research effort over the last several years to design and
implement feasible network traffic traceback systems, which identify the machines
that directly generated certain malicious traffic and the network path this traffic
subsequently followed. These approaches, however, restrict the queries to network
floods, connection chains, or the entire payload of a single packet in the best case.
The Source Path Isolation Engine (SPIE) [Snoeren et al. 2001] is a hash-based
technique for IP traceback that generates audit trails for traffic within a network.
SPIE creates hash-digests of packets based on the packet header and a payload
fragment and stores them in a Bloom filter in routers. SPIE uses these audit trails
to trace the origin of any single packet delivered by the network in the recent past.
The router creates a packet digest for every forwarded packet using the packet’s
non-mutable header fields and a short prefix of the payload, and stores it in a Bloom
filter for a predefined time. Upon detection of a malicious attack by an intrusion
detection system, SPIE can be used to trace the packet’s attack path back to the
source by querying SPIE devices along the path.
In many cases, an investigator may not have any header information about a
packet of interest but may know some excerpt of the payload of the packets she
wishes to see. Designing techniques for this problem that achieve significant data
reduction (compared to storing raw packets) is a much greater challenge; the entire packet payload is much larger than the information hashed by SPIE; in adACM Journal Name, Vol. V, No. N, Month 20YY.
New Payload Attribution Methods for Network Forensic Investigations
·
5
dition we need to store information about numerous substrings of the payload to
support queries about excerpts. Shanmugasundaram et al. [2004] introduced the
Hierarchical Bloom Filter (HBF), a compact hash-based payload digesting data
structure, which we describe in Section 3.1. A payload attribution system based
on an HBF is a key module for a distributed system for network forensics called
ForNet [Shanmugasundaram et al. 2003]. The system has both low memory footprint and achieves a reasonable processing speed at a low false positive rate. It
monitors network traffic, creates hash-based digests of payload, and archives them
periodically. A user-friendly query mechanism based on XML provides an interface
to answer postmortem questions about network traffic. SPIE and HBF are both
digesting schemes, but while SPIE is a packet digesting scheme, HBF is a payload
digesting scheme. With an HBF module running in ForNet (or a module using
any of our methods presented in this paper), one can query for substrings of the
payload (called excerpts throughout this paper).
Recently, another group suggested an alternative approach to the payload attribution problem, the Rolling Bloom Filter (RBF) [Cho et al. 2006], which uses
packet content fingerprints based on a generalization of the Rabin-Karp stringmatching algorithm. Instead of aggregating queries in a hierarchy as an HBF, they
aggregate query results linearly from multiple Bloom filters. The RBF tries to solve
a problem of finding a correct alignment of blocks in the process of querying an
HBF by considering many possible alignments of blocks at once, i.e., RBF is rolling
a fixed-size window over the packet payload and recording all the window positions
as payload blocks. They report performance similar to the best case performance
of the HBF.
The design of an HBF is well documented in the literature and currently used in
practice. We created our implementation of the HBF as an example of a current
payload attribution method and include it in our comparisons in Section 5. The
RBF’s performance is comparable to that of the HBF and experimental results
presented in [Cho et al. 2006] show that RBF achieves low false positive rates only
for small data reduction ratios (about 32:1).
3.
METHODS FOR PAYLOAD ATTRIBUTION
In this section we introduce various data structures for payload attribution. Our
primary goal is to find techniques that give the best data reduction for payload fragments of significant size at reasonable computational cost. When viewed
through this lens, roughly speaking a technique that we call Winnowing MultiHashing (WMH) is the best and substantially outperforms previous methods; a
thorough experimental evaluation is presented in Section 5.
Our exposition of WMH will develop gradually, starting with naive approaches to
the problem, building through previous work (HBF), and introducing a variety of
new techniques. Our purpose is twofold. First, this exposition should develop a solid
intuition for the reader as to the various considerations that were taken into account
in developing WMH. Second, and equally important, there are a variety of lenses
through which one may consider and evaluate the different techniques. For example,
one may want to perform less computation and cannot utilize data aging techniques
and as a result opt for a method such as Winnowing Block Shingling (WBS) which
ACM Journal Name, Vol. V, No. N, Month 20YY.
6
·
Miroslav Ponec et al.
is more appropriate than WMH under those circumstances. Additionally, some
applications may have specific requirements on the block size and therefore prefer
a different method. By carefully developing and evaluating experimentally the
different methods, we present the reader with a spectrum of possibilities and a
clear understanding of which to use when.
As noted earlier, all of these methods follow the general program of dividing
packet payloads into blocks and inserting them into a Bloom filter. They differ in
how the blocks are chosen, what methods we use to determine which blocks belong
to which payload in which order (“consecutiveness resolution”), and miscellaneous
other techniques used to improve the number of necessary queries and to reduce the
probability of false positives. We first describe the basics of block based payload
attribution and the Hierarchical Bloom Filter [Shanmugasundaram et al. 2004] as
the current state-of-the-art method. We then propose several new methods which
solve multiple problems in the design of the former methods.
A naive method to design a simple payload attribution system is to store the
payload of all packets. In order to decrease the demand for storage capacity and
to provide some privacy guarantees, we can store hashes of payloads instead of the
actual payloads. This approach reduces the amount of data per packet to about
20 bytes (by using SHA-1, for example) at the cost of false positives due to hash
collisions. By storing payloads in a Bloom filter (described in Section 2.1), we can
further reduce the required space. The false positive rate of a Bloom filter depends
on the data reduction ratio it provides. A Bloom filter preserves privacy because
we can only ask whether a particular element was inserted into it, but it cannot
be coerced into revealing the list of elements stored; even if we try to query for
all possible elements, the result will be useless due to false positives. Compared to
storing hashes directly, the advantage of using Bloom filters is not only the space
savings but also the speed of querying. It takes only a short constant time to query
the Bloom filter for any packet.
Inserting the entire payload into a Bloom filter, however, does not allow supporting queries for payload excerpts. Instead of inserting the entire payload into
the Bloom filter we can partition it into blocks and insert them individually. This
simple modification can allow queries for excerpts of the payload by checking if all
the blocks of an excerpt are in the Bloom filter. Yet, we need to determine whether
two blocks appeared consecutively in the same payload, or if their presence is just
an artifact of the blocking scheme. The methods presented in this section deal
with this problem by using offset numbers or block overlaps. The simplest data
structure that uses a Bloom filter and partitions payloads into blocks with offsets
is a Block-based Bloom Filter (BBF) [Shanmugasundaram et al. 2004].
Note that, assuming that we do one decomposition of the payload into blocks
during payload processing, starting at the beginning of the packet, we will need to
query the data structure for multiple starting positions of our excerpt in the payload
during excerpt querying phase, as the excerpt need not start at the beginning of a
block. For example, if the payload being partitioned with a block size of 4 bytes
was ABCDEF GHIJ, we would insert blocks ABCD and EF GH into the Bloom
filter (the remainder IJ is not long enough to form a block and is therefore not
processed). Later on, when we query for an excerpt, for example, DEF GHI, we
ACM Journal Name, Vol. V, No. N, Month 20YY.
New Payload Attribution Methods for Network Forensic Investigations
·
7
would partition the excerpt into blocks (with a block size 4 bytes as done previously
on the original payload). This would give us just one block to query the Bloom
filter for, DEF G. However, because we do not know where could the excerpt be
located within the payload we also need to try partitioning the excerpt from starting
position offsets 1 and 2, which gives us blocks EF GH and F GHI, respectively. We
are then guaranteed that the Bloom filter answers positively for the correct block
EF GH, however, we can also get positive answers for blocks EF GH and F GHI
due to false positives of a Bloom filter. The payload attribution methods presented
in this section try to limit or completely eliminate (see Section 3.3) this negative
effect. Alternative payload processing schemes, such as [Cho et al. 2006; Gu et al.
2007] perform partitioning of the payload at all possible starting offsets during the
payload processing phase (which is basically similar to working on all n-grams of
the payload) but it incurs a large overhead for processing speed and also storage
requirements are multiplied.
We also need to set two parameters which determine the time precision of our
answers and the smallest query excerpt size. First, we want to be able to attribute
to each excerpt for which we query the time when the packet containing it appeared
on the network. We solve that by having multiple Bloom filters, one for each time
interval. The duration of each interval depends on the number of blocks inserted
into the Bloom filter. In order to guarantee an upper bound on the false positive
rate, we replace the Bloom filter by a new one and store the previous Bloom filter in
a permanent storage after a certain number of elements are inserted into it. There is
also an upper bound on the maximum length of one interval to limit the roughness
of time determination. Second, we specify the size of blocks. If the chosen block
size is too small, we get too many collisions as there are not enough unique patterns
and the Bloom filter gets filled quickly by many blocks. If the block size is too large,
there is not enough granularity to answer queries for smaller excerpts.
We need to distinguish blocks from different packets to be able to answer who
has sent/received the packet. The BBF as briefly described above isn’t able to
recognize the origins and destinations of packets. In order to work properly as an
attribution system over multiple packets a unique flow identifier (flowID) must be
associated with each block before inserting into the Bloom filter. A flow identifier
can be the concatenation of source and destination IP addresses, optionally with
source/destination port numbers. We maintain a list (or a more efficient data structure) of flowIDs for each Bloom filter and our data reduction estimates include the
storage required for this list. The connection records (flowIDs) for each Bloom filter
(i.e., a time interval) can be alternatively obtained from other modules monitoring
the network. The need for testing all the flowIDs in a list significantly increases the
number of queries required for the attribution as the flowID of the packet that contained the query excerpt is not known a priori and it leads to higher false positive
rate and decreases the total query performance. Therefore, we may either maintain
two separate Bloom filters to answer queries, one into which we insert blocks only
and one with blocks concatenated with the corresponding flowIDs, or insert both
into one larger Bloom filter. The former allows data aging, i.e., for very old data
we can delete the first Bloom filter and store only the one with flowIDs at the cost
of higher false positive rate and slower querying. Another method to save storage
ACM Journal Name, Vol. V, No. N, Month 20YY.
8
·
Miroslav Ponec et al.
space by reducing size taken by very old data is to take a Bloom filter of size 2b
and replace it by a new Bloom filter of size b by computing the logical or operation
of the two halves of the original Bloom filter. This halves the amount of data and
still allows querying but the false positive rate increases significantly.
An alternative construction which allows the determination of source/destination
pairs is using separate Bloom filters for each flow. Then instead of using one Bloom
filter and inserting blocks concatenated with flowIDs, we just select a Bloom filter
for the insertion of blocks based on the flowID. Because we cannot anticipate the
number of blocks each flow would contain during a time interval, we use small
Bloom filters, flush them to disk more often and use additional compression (such
as gzip) on the Bloom filters before saving to disk which helps to significantly
reduce storage requirements for very sparse flows. Having multiple small Bloom
filters also has some performance advantages compared to one large Bloom filter
because of caching; the size of a Bloom filter can be selected to fit into a memory
cache block. This technique would most likely use TCP stream reconstruction and
makes the packet processing stateful compared to the method using flowIDs. It
may thus be suitable when there is another module in the system, such as intrusion
detection (or prevention) system, which already does the stream reconstruction
and a PAS module can be attached to it. If using this technique the evaluation
of methods would then be extremely dependent on the distribution of payload
among the streams. We do not take flowIDs into further consideration in method
descriptions throughout this section for clarity of explanation.
We have identified several important data structure properties of methods presented in this paper and a summary can be found in Table I. Figure 1 shows a tree
structure representing the evolution of methods. For example, a VHBF method
was derived from HBF by the use of variable-sized blocks. These properties of all
methods are thoroughly explained within the description of a method in which they
appear first. Their impact on performance is discussed in Section 5.
There are many possible combinations of the techniques presented in this paper
and the following list of methods is not a complete of all combinations. For example,
a method which builds a hierarchy of blocks with winnowing as a boundary selection
technique can be developed. However, the selected subset provides enough details
for a reader to construct and analyze the other alternative methods and as we
have experimented with them we believe the presented subset can accomplish the
goal of selecting the most suitable one, which is, in a general case, the Winnowing
Multi-Hashing technique (Section 3.12).
In all the methods in this section we can extend the answer from the simple
yes/no (meaning that there was/wasn’t a packet containing the specified excerpt in
a specified time interval and if yes providing also the list of flowIDs of packets that
contained the excerpt) to give additional details about which parts of the excerpt
were found (i.e., blocks) and return, for instance, the longest continuous part of the
excerpt that was found.
3.1
Hierarchical Bloom Filter (HBF)
This subsection describes the (former) state-of-the-art payload attribution technique, called an HBF [Shanmugasundaram et al. 2004], in detail and extends the
description of previous work from Section 2. The following eleven subsections, each
ACM Journal Name, Vol. V, No. N, Month 20YY.
New Payload Attribution Methods for Network Forensic Investigations
·
9
Table I. Summary of properties of methods from Section 3. We show how each
method selects boundaries of blocks when processing a payload and how it affects
the block size, how each method resolves the consecutiveness of blocks, its special
characteristics, and finally, whether each method allows false negative and N/A
answers to excerpt queries.
Fig. 1. The evolution tree shows the relationship among presented methods for payload attribution. Arrow captions describe the modifications made to the parent method.
ACM Journal Name, Vol. V, No. N, Month 20YY.
10
·
Miroslav Ponec et al.
Fig. 2. Processing of a payload consisting of blocks X0 X1 X2 X3 X4 X5 in a Hierarchical Bloom
Filter.
showing a new technique, represent our novel contribution.
An HBF supports queries for excerpts of a payload by dividing the payload of
each packet into a set of blocks of fixed size s bytes (where s is a parameter specified
by the system administrator1 ). The blocks of a payload of a single packet form a
hierarchy (see Figure 2) which is inserted into a Bloom filter with appropriate
offset numbers. Thus, besides inserting all blocks of a payload as in the BBF, we
insert several super-blocks, i.e., blocks created by the concatenation of 2, 4, 8, etc.,
subsequent blocks into the HBF. This produces the same result as having multiple
BBFs with block sizes multiplied by powers of two. And a BBF can be looked upon
as the base level of the hierarchy in an HBF.
When processing a payload, we start at the level 0 of the hierarchy by inserting
all blocks of size s bytes. In the next level we double the size of a block and insert
all blocks of size 2s. In the n-th level we insert blocks of size 2n s bytes. We continue
until the block size exceeds the payload size.
PThe total number of blocks inserted
into an HBF for a payload of size p bytes is l ⌊p/(2l s)⌋, where l is the level index
s.t. 0 ≤ l ≤ ⌊log2 (p/s)⌋. Therefore, an HBF needs about two times as much storage
space compared to a BBF to achieve the same theoretical false positive rate of a
Bloom filter, because the number of elements inserted into the Bloom filter is twice
higher. However, for longer excerpts the hierarchy improves the confidence of the
query results because they are assembled from the results for multiple levels.
We use one Bloom filter to store blocks from all levels of the hierarchy to improve
space utilization because the number of blocks inserted into Bloom filters at different
levels depends on the distribution of payload sizes and is therefore dynamic. The
utilization of this single Bloom filter is easy to control by limiting the number of
inserted elements, thus we can limit the (theoretical) false positive rate.
Offset numbers are the sequence numbers of blocks within the payload. Offsets
are appended to block contents before insertion into an HBF: (content||offset),
where 0 ≤ offset ≤ ⌊p/(2l s)⌋ − 1, p is the size of the entire payload and l is the
1 The block size used for several years by an HBF-enabled system running in our campus network
is 64 and 32 bytes, respectively, depending on whether deployed on the main gateway or smaller
local ones. Longer blocks allow higher data reduction ratios but lower the querying capability for
smaller excerpts.
ACM Journal Name, Vol. V, No. N, Month 20YY.
New Payload Attribution Methods for Network Forensic Investigations
·
11
Fig. 3. The hierarchy in the HBF does not cover double-blocks at odd offset numbers. In this
example, we assume that two payloads X0 X1 X2 X3 and Y0 Y1 Y2 Y3 were processed by the HBF. If
we query for an excerpt X1 Y2 , we would get a positive answer which represents an offset collision,
because there were two blocks (X1 ||1) and (Y2 ||2) inserted from different payloads but there was
no packet containing X1 Y2 .
level of hierarchy. Offset numbers are unique within one level of the hierarchy. See
the example given in Fig. 2. We first insert all blocks of size s with the appropriate
offsets: (X0 ||0), (X1 ||1), (X2 ||2), (X3 ||3), (X4 ||4). Then we insert blocks at level 1 of
the hierarchy: (X0 X1 ||0), (X2 X3 ||1). And finally the second level: (X0 X1 X2 X3 ||0).
Note that in Figure 2 blocks X0 to X4 have size s bytes, but since block X5
has size smaller than s it does not form a block and its content is not being processed. We analyze the percentage of discarded payload content for each method
in Section 5.
Offsets don’t provide a reliable solution to the problem of detecting whether two
blocks appeared in the same packet consecutively. For example, in a BBF if we
process two packets made up of blocks X0 X1 X2 X3 and Y0 Y1 Y2 Y3 Y4 , respectively,
and later query for an excerpt X2 Y3 , the BBF will answer that it had seen a packet
with a payload containing such an excerpt. We call this event an offset collision.
This happens because of inserting a block X2 with an offset 2 from the first packet
and a block Y3 with an offset 3 from the second packet into the BBF. When blocks
from different packets are inserted at the appropriate offsets, a BBF can answer as
if they occurred inside a single packet. An HBF reduces the false positive rate due
to offset collisions and due to the inherent false positives of a Bloom filter by adding
supplementary checks when querying for an excerpt composed of multiple blocks.
In this example, an HBF would answer correctly that it did not see such excerpt
because the check for X2 Y3 in the next level of the hierarchy fails. However, if we
query for an excerpt X1 Y2 , both HBF and BBF fail to answer correctly (i.e, they
answer positively as if there was a packet containing X1 Y2 ). Figure 3 provides an
example of how the hierarchy tries to improve the resistance to offset collisions but
still fails for two-block strings at odd offsets. We discuss offset collisions in an HBF
further in Section 3.8. Because in the actual payload attribution system we insert
blocks along with their flowIDs, collisions are less common, but they can still occur
for payload inside one stream of packets within one time interval as these blocks
have the same flowID and are stored in one Bloom filter.
Querying an HBF for an excerpt x starts with the same procedure as querying
a BBF. First, we have to try all possible offsets, where x could have occurred
inside one packet. We also have to try s possible starting positions of the first
block inside x since the excerpt may not start exactly on a block boundary of the
ACM Journal Name, Vol. V, No. N, Month 20YY.
12
·
Miroslav Ponec et al.
original payload. To do this, we slide a window of size s starting at each of the
first s positions of x and query the HBF for this window (with all possible starting
offsets). After a match is found for this first block, the query proceeds to try the
next block at the next offset until all blocks of an excerpt at level 0 are matched.
An HBF continues by querying the next level for super-blocks of size twice the size
of blocks in the previous level. Super-blocks start only at blocks from the previous
level which have even offset numbers. We go up in the hierarchy until all queries
for all levels succeed. The answer to an excerpt query is positive only if all answers
from all levels of the hierarchy were positive. The maximum number of queries to
a Bloom filter in an HBF in the worst case is roughly twice the number for a BBF.
3.2
Fixed Block Shingling (FBS)
In a BBF and an HBF we use offsets to determine whether blocks appeared consecutively inside one packet’s payload. This causes a problem when querying for an
excerpt because we do not know where the excerpt starts inside the payload (the
starting offset is unknown). We have to try all possible starting offsets, which not
only slows down the query process, but also increases the false positive rate because
a false positive result may occur for any of these queries.
As an alternative to using offsets we can use block overlapping, which we call
shingling. In this scheme, the payload of a packet is divided into blocks of size s
bytes as in a BBF, but instead of inserting these blocks we insert strings of size
s + o bytes (the block plus a part of the next block) into the Bloom filter. Blocks
overlap as do shingles on the roof (see Figure 4) and the overlapping part assures
that it is likely that two blocks appeared consecutively if they share a common part
and both of them are in the Bloom filter. For a payload of size p bytes, the number
of elements inserted into the Bloom filter is ⌊(p − o)/s⌋ for a FBS which is close
to ⌊p/s⌋ for a BBF. However, the maximum number of queries to a Bloom filter
in the worst case is about v times smaller than in a BBF, where v is the number
of possible starting offsets. Since the value of v can be estimated as the system’s
maximum transmission unit (MTU) divided by the block size s, this improvement
is significant, which is supported by the experimental results presented in Section 5.
Fig. 4. Processing of a payload with a Fixed Block Shingling (FBS) method (parameters: block
size s = 8, overlap size o = 2).
The goal of the FBS scheme (of using an overlapping part) is to avoid trying
all possible offsets during query processing in an HBF to solve the consecutiveness
problem. However, both these techniques (shingling and offsets) are not guaranteed
to answer correctly, for an HBF because of offset collisions and for FBS because
ACM Journal Name, Vol. V, No. N, Month 20YY.
New Payload Attribution Methods for Network Forensic Investigations
·
13
multiple blocks can start with the same string and a FBS then confuses their position inside the payload (see Figure 5). Thus both can increase the number of false
positive answers. For example, the FBS will incorrectly answer that it has seen a
string of blocks X0 X1 Y1 Y2 after processing two packets X and Y made of blocks
X0 X1 X2 X3 X4 and Y0 Y1 Y2 , respectively, where X2 has the same prefix (of size at
least o bytes) as Y1 .
Fig. 5. An example of a collision due to a shingling failure. The same prefix prevented a FBS
method from determining whether two blocks appeared consecutively within one payload. The
FBS method incorrectly treats the string of blocks X0 X1 Y1 Y2 as if it was processed inside one
payload.
Querying a Bloom filter in the FBS scheme is similar to querying a BBF except
that we do not use any offsets and therefore do not have to try all possible offset
positions of the first block of an excerpt. Thus when querying for an excerpt x
we slide a window of size s + o bytes starting at each of the first s positions of
x and query the Bloom filter for this window. When a match is found for this
first block, the query can proceed with the next block (including the overlap) until
all blocks of an excerpt are matched. Since these blocks overlap we assume that
they occurred consecutively inside one single payload. The answer to an excerpt
query is considered to be positive only if there exists an alignment (i.e., a position
of the first block’s boundary) for which all tested blocks were found in the Bloom
filter. Figures 6 and 7 show examples of querying in the FBS method. Note that
these examples ignore that we need to determine all flowIDs of the excerpts found.
Therefore even after a match was found for some alignment and a flowID we shall
continue to check other alignments and flowIDs because multiple packets in multiple
flows could contain such an excerpt.
3.3
Variable Block Shingling (VBS)
The use of shingling instead of offsets in a FBS method lets us avoid testing all
possible offset numbers of the first block during querying, but we still have to test
all possible alignments of the first block inside an excerpt (as shown in Fig. 6 and 7).
A Variable Block Shingling (VBS) solves this problem by setting block boundaries
based on the payload itself.
ACM Journal Name, Vol. V, No. N, Month 20YY.
14
·
Miroslav Ponec et al.
Fig. 6. An example of querying a FBS method (with a block size 8 bytes and an overlap size
2 bytes). Different alignments of the first block of the query excerpt (shown on top) are tested.
When a match is found in the Bloom filter for some alignment of the first block we try subsequent
blocks. In this example all blocks for the alignment starting at the third byte are found and
therefore the query substring (at the bottom) is reported as found. We assume that the FBS
processed the packet in Fig. 4 prior to querying.
Fig. 7. An example of querying a FBS method for an excerpt which is not supposed to be found
(i.e., no packet containing such string has been processed). The query processing starts by testing
the Bloom filter for the presence of the first block of the query excerpt at different alignment
positions. For alignment 2 the first block is found because we assume that the FBS processed
the packet in Fig. 4 prior to executing this query. The second block for this alignment has been
found too due to a false positive answer of a Bloom filter. The third block for this alignment has
not been found and therefore we continue with testing first block at alignment 3. As there was
no alignment for which all blocks were found we report the query excerpt was not found.
We slide a window of size k bytes through the whole payload and for each position
of the window we compute a value of function H(c1 , . . . , ck ) on the byte values of the
payload. When H(c1 , . . . , ck ) mod m is equal to zero, we insert a block boundary
immediately after the current position of byte ck . Note that we can choose to put
a block boundary before or after any of the bytes ci , 1 ≤ i ≤ k, but this selection
has to be fixed. Note that for use with shingling it is better to put the boundary
after the byte ck such that the overlaps are not restricted to strings having only
special values which satisfy the above condition for boundary insertion (which can
increase shingle collisions). When the function H is random and uniform then the
parameter m sets the expected size of a block. For random payloads we will get a
distribution of block sizes with an average size of m bytes. This variable block size
technique’s drawback is that we can get many very small blocks, which can flood
the Bloom filter, or some large blocks, which prevent us from querying for smaller
excerpts. Therefore, we introduce an enhanced version of this scheme, EVBS, in
ACM Journal Name, Vol. V, No. N, Month 20YY.
New Payload Attribution Methods for Network Forensic Investigations
Fig. 8.
·
15
Processing of a payload with a Variable Block Shingling (VBS) method.
the next section.
In order to save computational resources it is convenient to use a function that
can use the computations performed for the previous positions of the window to
calculate a new value as we move from bytes c1 , . . . , ck to c2 , . . . , ck+1 . Rabin
fingerprints (see Section 2.2) have such iterative property and we define a fingerprint
F of a substring c1 c2 . . . ck , where ci is the value of the i-th byte of the substring
of a payload, as:
F (c1 , . . . , ck ) = (c1 pk−1 + c2 pk−2 + · · · + ck ) mod M,
(3)
where p is a fixed prime number and M is a constant. To compute the fingerprint
of substring c2 . . . ck+1 , we need only to add the last element and remove the first
one:
F (c2 , . . . , ck+1 ) = (p F (c1 , . . . , ck ) + ck+1 − c1 pk ) mod M.
(4)
k−1
Because p and k are fixed we can precompute the values for p
. It is also possible
to use Rabin fingerprints as hash functions in the Bloom filter. In our implementation we use a modified scheme [Broder 1997] to increase randomness without any
additional computational costs:
F (c2 , . . . , ck+1 ) = (p(F (c1 , . . . , ck ) + ck+1 − c1 pk )) mod M.
(5)
The advantage of picking block boundaries using Rabin functions is that when we
get an excerpt of a payload and divide it into blocks using the same Rabin function
that we used for splitting during the processing of the payload, we will get exactly
the same blocks. Thus, we do not have to try all possible alignments of the first
block of a query excerpt as in previous methods.
The rest of this method is similar to the FBS scheme where instead of using
fixed-size blocks we have variable-size blocks depending on the payload. To process
a payload we slide a window of size k bytes through the whole payload. For each
its position we check whether the value of F modulo m is zero and if yes we set a
new block boundary. All blocks are inserted with the overlap of o bytes as shown
in Figure 3.3.
Querying in a VBS method is the simplest of all methods in this section because
there are no offsets and no alignment problems. Therefore, this method involves
much fewer tests for membership in a Bloom filter. Querying for an excerpt is done
in the same way as processing the payload in previous paragraph but instead of the
insertion we query the Bloom filter. Only when all blocks are found in the Bloom
filter the answer is positive. The maximum number of queries to a Bloom filter in
ACM Journal Name, Vol. V, No. N, Month 20YY.
16
·
Fig. 9.
Miroslav Ponec et al.
Processing of a payload with an Enhanced Variable Block Shingling (EVBS) method.
the worst case is about v · s times smaller than in a BBF, where v is the number
of possible starting offsets and s is the number of possible alignments of the first
block in a BBF, while assuming the average block size in a VBS method to be s.
3.4
Enhanced Variable Block Shingling (EVBS)
The enhanced version of the variable block shingling method tries to solve a problem
with block sizes. A VBS can create many small blocks, which can flood the Bloom
filter and do not provide enough discriminability, or some large blocks, which can
prevent querying for smaller excerpts. In a EVBS we form superblocks composed
from blocks found by a VBS method to achieve better control over the size of blocks.
To be precise, when processing a payload we slide a window of size k bytes
through the entire payload and for each position of the window we compute the
value of the fingerprinting function H(c1 , . . . , ck ) on the byte values of the payload
as in the VBS method. When H(c1 , . . . , ck ) mod m is equal to zero, we insert
a block boundary after the current position of byte ck . We take the resulting
blocks of an expected size m bytes, one by one from the start of the payload,
and form superblocks, i.e., new non-overlapping blocks made of multiple original
blocks, with the size at least m′ bytes, where m′ ≥ m. We do this by selecting
some of the original block boundaries to be the boundaries of the new superblocks.
Every boundary that creates a superblock of size greater or equal to m′ is selected
(Figure 9, where the minimum superblock size is m′ ). Finally, superblocks with an
overlap to the next superblock of size o bytes are inserted into the Bloom filter.
The maximum number of queries to a Bloom filter in the worst case is about the
same as for a VBS, assuming the average block sizes for the two methods are the
same.
This leads, however, to a problem when querying for an excerpt. If we use the
same fingerprinting function H and parameter m we get the same block boundaries
in the excerpt as in the original payload, but the starting boundary of the first
superblock inside the excerpt is unknown. Therefore, we have to try all boundaries
in the first m′ bytes of an excerpt (or the first that follows if there was none) to
form the first boundary of the first superblock. The number of possible boundaries
we have to try in an EVBS method (approximately m′ /m) is much smaller than
the number of possible alignments (i.e., the block size s) in an HBF, for usual
parameter values.
ACM Journal Name, Vol. V, No. N, Month 20YY.
New Payload Attribution Methods for Network Forensic Investigations
·
17
Fig. 10. Processing of a payload with a Winnowing Block Shingling (WBS) method. First, we
compute hash values for each payload byte position. Subsequently boundaries are selected to be
at the positions of the rightmost maximum hash value inside the winnowing window which we
slide through the array of hashes. Bytes between consecutive pairs of boundaries form blocks (plus
the overlap).
3.5
Winnowing Block Shingling (WBS)
In a Winnowing Block Shingling method we use the idea of winnowing, described
in Section 2.3, to select boundaries of blocks and shingling to resolve the consecutiveness of blocks. We select the winnowing window size instead of a block size and
we are guaranteed to have at least one boundary in any window of this size inside
the payload. This also sets an upper bound on the block size.
We start by computing hash values for each payload byte position. In our implementation this is done by sliding a window of size k bytes through the whole
payload and for each position of the window we compute the value of a fingerprinting function H(c1 , . . . , ck ) on the byte values of the payload as in the VBS
method. In this way we get an array of hashes, where the i-th element is the
hash of bytes ci , . . . , ci+k−1 , where ci is the i-th byte of the payload of size p, for
i = 1, . . . , (p − k + 1). Then we slide a winnowing window of size w through this
array and for each position of the winnowing window we put a boundary immediately before the position of the maximum hash value within this window. If there
are more hashes with maximum value we choose the rightmost one. Bytes between
consecutive pairs of boundaries form blocks (plus the beginning of size o of the next
block, the overlap) and they are inserted into a Bloom filter. See Figure 3.5.
When querying for an excerpt we do the same process except that we query the
Bloom filter for blocks instead of inserting them. If all were found in the Bloom
filter the answer to the query is positive. The maximum number of queries to a
Bloom filter in the worst case is about the same as for a VBS, assuming the average
block sizes for the two methods are the same.
There is at least one block boundary in any window of size w. Therefore the
longest possible block size is w + 1 + o bytes. This also guarantees that there are
always at least two boundaries to form a block in an excerpt of size at least 2w + o
bytes.
ACM Journal Name, Vol. V, No. N, Month 20YY.
18
3.6
·
Miroslav Ponec et al.
Variable Hierarchical Bloom Filter (VHBF)
Querying an HBF involved trying s possible starting positions of the first block
in an excerpt. In a VHBF method we avoid this by splitting the payload into
variable-sized blocks (see Figure 11) determined by a fingerprinting function as in
Section 3.3 (VBS). Building the hierarchy, the insertion of blocks and querying are
the same as in the original Hierarchical Bloom Filter, only the block boundary
definition has changed. Reducing the number of queries by s (the size of one block
in an HBF) helps reducing the resulting false positive rate of this method.
Notice that even if we added overlaps between blocks (i.e., use shingling) we
would still need to use offsets, because they work as a way of determining whether
to check the next level of the hierarchy during the query phase because we check
the next level only for even offset numbers.
Fig. 11.
3.7
Processing of a payload with a Variable Hierarchical Bloom Filter (VHBF) method.
Fixed Doubles (FD)
The method of fixed doubles is designed to address a shortcoming of a hierarchy in
an HBF. The hierarchy in an HBF is not complete in a sense that we do not insert all
double-blocks (blocks of size 2s) and all quadruple-blocks (4s), and so on, into the
Bloom filter. For example, when inserting a packet consisting of blocks S0 S1 S2 S3 S4
into an HBF, we insert blocks S0 ,. . . , S4 , S0 S1 , S2 S3 , and S0 S1 S2 S3 . And if we
query for an excerpt of size 2s (or up to size 4s − 2 bytes, Figure 3), for example,
S1 S2 , this block of size 2s is not found in the HBF (and all other double-blocks at
odd offsets) and the false positive rate is worse than the one of a BBF in this case,
because we need about two times more space for an HBF compared to a BBF with
the same false positive rate of the Bloom filter. The same is true for other levels of
the hierarchy. In fact, the probability that this event happens rises exponentially
with the level number. As an alternative approach to the hierarchy we insert all
double-blocks as shown in Figure 12, but do not continue to the next level to not
increase the storage requirements. Note that this method is not identical to a FBS
scheme with an overlap of size s because in a FD we insert all single blocks and
also all double blocks.
In this method we neither use shingling nor offsets, because the consecutiveness
problem is solved by the level of all double-blocks which overlap with each other,
ACM Journal Name, Vol. V, No. N, Month 20YY.
New Payload Attribution Methods for Network Forensic Investigations
·
19
by half of the size with the previous one and the second half overlaps with the next
one.
Fig. 12.
Processing of a payload with a Fixed Doubles (FD) method.
The query mechanism works as follows: We first find the correct alignment of
the first block of an excerpt by trying to query the Bloom filter for all windows of
size s starting at positions 0 through s − 1. Note that we can get multiple positive
answers and in that case we continue the process independently for all of them.
Then we split the excerpt into blocks of size s starting at the position found and
query for each of them. Finally, we query for all double-blocks and when all answers
were positive we claim that the excerpt was found.
The FD scheme inserts 2⌊p/s⌋ − 1 blocks into the Bloom filter for a payload of
size p bytes, which is approximately the same as an HBF and about two times more
than an FBS scheme. The maximum number of queries to a Bloom filter in the
worst case is about two times the number for a FBS method.
3.8
Variable Doubles (VD)
This method is similar to the previous one (FD) but the block boundaries are
determined by a fingerprinting function as in a VBS (Section 3.3). Hence, we do
not have to try finding the correct alignment of the first block of an excerpt when
querying and the blocks have variable size. Both the number of blocks inserted
into the Bloom filter and the maximum number of queries in the worst case are
approximately the same as for the FD scheme. An example is given in Figure 3.8.
As well as the FD method we use neither shingling nor offsets, because the
consecutiveness problem is solved by the level of all double-blocks which overlap
with each other and with the single blocks.
During querying we simply divide the query excerpt into blocks by a fingerprinting method and query the Bloom filter for all blocks and for all double-blocks.
Finally, if all answers are positive we claim that the excerpt is found.
3.9
Enhanced Variable Doubles (EVD)
The Enhanced Variable Doubles method uses the technique from Section 3.4 (EVBS)
to create an extension of a VD method by forming superblocks of a payload. Then
these superblocks are treated the same way as blocks in a VD method. Thus, we
insert all superblocks and all doubles of these superblocks into the Bloom filter as
ACM Journal Name, Vol. V, No. N, Month 20YY.
20
·
Miroslav Ponec et al.
Fig. 13.
Processing of a payload with a Variable Doubles (VD) method.
shown in Figure 14. The number of blocks inserted into the Bloom filter as well
as the maximum number of queries in the worst case is similar to that of the VD
method (assuming similar average block sizes of both schemes).
Fig. 14.
3.10
Processing of a payload with an Enhanced Variable Doubles (EVD) method.
Multi-Hashing (MH)
One would imagine that the technique of VBS would be strengthened by using multiple independent VBS methods because it provides greater flexibility in the choice
of parameters, such as the expected block size. We call this technique Multi-hashing
which uses t independent fingerprinting methods (or fingerprinting functions with
different parameters) to set block boundaries as shown in Figure 15. It is equivalent
to using t independent Variable Block Shingling methods and the answer to excerpt
queries is positive only if all the t methods answer positively.
Note that even if we set the overlapping part, i.e., the parameter o, to zero for all
instances of the VBS, we would still get a guarantee that the excerpt has appeared
on the network as one continuous fragment.
Moreover, by using expected block sizes of the instances as multiples of powers
of two we can generate a hierarchical structure with the MH method.
The expected
of blocks inserted into the Bloom filter for a payload of
Pnumber
t
size p bytes is i=1 ⌊p/mi ⌋, where mi is the expected block size for the i-th VBS.
ACM Journal Name, Vol. V, No. N, Month 20YY.
New Payload Attribution Methods for Network Forensic Investigations
·
21
Fig. 15. Processing of a payload with a Multi-Hashing (MH) method. In this case, the MH uses
two independent instances of the Variable Block Shingling method simultaneously to process the
payload.
3.11
Enhanced Multi-Hashing (EMH)
The enhanced version of Multi-Hashing uses multiple instances of EVBS to increase the certainty of answers to excerpt queries. Blocks inserted by independent
instances of EVBS are different and overlap with each other, and therefore improve
the robustness of this method. Other aspects than the superblock formation are
the same as for the MH method. In our experiments in Section 5 we use two independent instances of EVBS with identical parameters and store the data for both
in one Bloom filter.
3.12
Winnowing Multi-Hashing (WMH)
The WMH method uses multiple instances of WBS (Section 3.5) to reduce the
probability of false positives for excerpt queries. The WMH gives not only excellent
control over the block sizes due to winnowing (see Figure 18(c)) but also provides
much greater confidence about the consecutiveness of the blocks inside the query
excerpt because of overlaps both inside each instance of WBS and among the blocks
of multiple instances. Both querying and payload processing are done for all t WBS
instances and the final answer to an excerpt query is positive only if all t answers
are positive.
In our experiments in Section 5 we use two instances of WBS with identical
winnowing window size and store data from both methods in one Bloom filter. By
storing data of each instance in a separate Bloom filter we can allow data aging to
save space by keeping only some of the Bloom filters for very old data at the cost
of higher false positive rates.
For an example of processing payload and querying in a WMH method see the
multi-packet case in Fig. 16 and 17.
4.
PAYLOAD ATTRIBUTION SYSTEMS (PAS)
A payload attribution system performs two separate tasks: payload processing
and query processing. In payload processing, the payload of all traffic that passed
through the network where the PAS is deployed is examined and some information is
saved into permanent storage. This has to be done at line speed and the underlying
raw packet capture component can also perform some filtering of the packets, for
ACM Journal Name, Vol. V, No. N, Month 20YY.
22
·
Miroslav Ponec et al.
example, choosing to process only HTTP traffic.
Data is stored in archive units, each of which having two timestamps (start and
end of the time interval during which we collected the data). For each time interval
we also need to save all flowIDs (e.g., pairs of source and destination IP addresses)
to allow querying later on. This information can be alternatively obtained from
connection records collected by firewalls, intrusion detection systems or other log
files.
During query processing given the excerpt and a time interval of interest we have
to retrieve all the corresponding archive units from the storage. We query each unit
for the excerpt and if we get a positive answer we try to query successively for each
of the flowIDs appended to the blocks of the excerpt and report all matches to the
user.
4.1
Attacks on PAS
As with any security system, there are ways an adversary can evade proper attribution. We identify the following types of attacks on a PAS (mostly similar to those
in [Shanmugasundaram et al. 2004]):
4.1.1 Compression&Encryption. If the payload is compressed or encrypted, a
PAS can allow to query only for the exact compressed or encrypted form.
4.1.2 Fragmentation. An attacker can transform the stream of data into a sequence of packets with payload sizes much smaller than the (average) block size we
use in the PAS. Methods with variable block sizes where block boundaries depend
on the payload are harder to beat, but, for very small fragments, e.g., 6 bytes each,
the system will not be able to do the attribution correctly. A solution is to make the
PAS stateful so that it concatenates payloads of one data stream prior to processing. However, such a solution would impose additional memory and computational
costs and there are known attacks on stateful IDS systems [Handley et al. 2001],
such as incorrect fragmentation and timing attacks.
4.1.3 Boundary Selection Hacks. For methods with block boundaries depending
on the payload an attacker can try to send special packets containing payload that
can contain too many or no boundaries. The PAS can use different parameters for
boundary selection algorithm for each archive unit so that it would be impossible
for an attacker to fool the system. Moreover, winnowing guarantees at least one
boundary in each winnowing window.
4.1.4 Hash Collisions. Hash collisions are very unpredictable and therefore hard
to use by an attacker because we use different salt for the hash computation in each
Bloom filter.
4.1.5 Stuffing. An attacker can inject some characters into the payload which
are ignored by applications but in the network layer they change the payload structure. Our methods are robust against stuffing because the attacker has to modify
most of the payload to avoid correct attribution as we can match even very small
excerpts of payload.
4.1.6 Resource Exhaustion. Flooding attacks can impair a PAS. However, our
methods are more robust to these attacks than raw packet loggers due to the data
ACM Journal Name, Vol. V, No. N, Month 20YY.
New Payload Attribution Methods for Network Forensic Investigations
·
23
reduction they provide. Moreover, processing identical payloads repeatedly does
not impact the precision of attribution because the insertion into a Bloom filter is
an idempotent operation. On the other hand, the list of flowIDs is vulnerable to
flooding, for example, when a worm tries to propagate out of the network by trying
many random destination addresses.
4.1.7 Spoofing. Source IP addresses can be spoofed and a PAS is primarily
concerned with attributing payload according to what packets have been delivered
by the network. The scope of possible spoofing depends on the deployment of the
system and filtering applied in affected networks.
4.2
Multi-packet queries
The methods described in Section 3 show how to query for excerpts inside one
packet’s payload. Nevertheless, we can extend the querying mechanism to handle
strings that span multiple packets.
Methods which use offsets have to continue querying for the next block, which
was not found with its sequential offset number, with a zero offset instead and try all
alignments of that block as well because the fragmentation into packets could leave
out some part smaller than the block size at the end of the first packet. This is very
inefficient and increases the false positive rate. Moreover, for methods that form a
hierarchy of blocks it means that it cannot be fully utilized. The payload attribution
system can do TCP stream reconstruction and work on the reconstructed flow to
fix it.
On the other hand, methods using shingling can be extended without any further
changes if we return as an answer to the query the full sequence of blocks found
(see a WMH example in Fig. 16 and 17).
Fig. 16. Processing of payloads of two packets with a Winnowing Multi-Hashing (WMH) method
where both packets are processed by two independent instances of the Winnowing Block Shingling
(WBS) method simultaneously.
4.3
Privacy and Simple Access Control
Processing and archiving payload information must comply with the privacy and
security policies of the network where they are performed. Furthermore, authorization to use the payload attribution system should be granted only to properly
authorized parties and all necessary precautions must be taken to minimize the
possibility of a system compromise.
The privacy stems from using a Bloom filter to hold the data. It is only possible
to query the Bloom filter for a specific packet content but it cannot be forced to
ACM Journal Name, Vol. V, No. N, Month 20YY.
24
·
Miroslav Ponec et al.
Fig. 17. Querying for an excerpt spanning multiple packets in a Winnowing Multi-Hashing (WMH)
method comprised of two instances of WBS. We assume the WMH method processed the packets in
Fig. 16 prior to querying. In this case, we see that WMH can easily query for an excerpt spanning
two packets and that the blocks found significantly overlap which increases the confidence of the
query result. However, there is still a small gap between the two parts because WMH works on
individual packets (unless we perform TCP stream reconstruction).
provide a list of packet data stored inside. Simple access control (i.e., restricting the
ability to query the Bloom filter) can be easily achieved as follows. Our methods
allow the collected data to be stored and queried by an untrusted party without
disclosing any payload information nor giving the query engine any knowledge of
the contents of queries. We achieve this by adding a secret salt when computing
hashes for insertion and querying the Bloom filter. A different salt is used for each
Bloom filter and serves the purpose of a secret key. We can also easily achieve much
finer granularity of access control by using different keys for different protocols or
subranges of IP address space. Without a key the Bloom filter cannot be queried
and the key doesn’t have to be made available to the querier (only the indices of
bits for which we want to query are disclosed). Without knowing the key a third
party cannot query the Bloom filter. However, additional measures must be taken
to enforce that the third party provides correct answers and does not alter the
archived data. Also note that this kind of access control is not cryptographically
secure and some information leakage can occur. On the other hand, there is no
additional computational or storage cost associated with using it and also no need
for decrypting the data before querying as is common with standard techniques. A
detailed analysis of privacy achieved by using Bloom filters can be found in [Bellovin
and Cheswick 2004].
4.4
Compression
In addition to the inherent data reduction provided by our attribution methods
due to the use of Bloom filters, our experiments show that we can achieve another
about 20 percent storage savings by compressing the archived data (after careful
optimization of parameters [Mitzenmacher 2002]), for example by gzip. The results
presented in the next section do not include this additional compression.
5.
EXPERIMENTAL RESULTS
In this section we show performance measurements of payload attribution methods
described in Section 3 and discuss the results from various perspectives. For this
purpose we collected a network trace of 4 GB of HTTP traffic from our campus
network. For performance evaluation throughout this section we consider processing
3.1 MB segment (about 5000 packets) of the trace as one unit trace collected during
ACM Journal Name, Vol. V, No. N, Month 20YY.
New Payload Attribution Methods for Network Forensic Investigations
·
25
one time interval. As discussed earlier, we store all network traffic information in
one Bloom filter in memory and we save it to a permanent storage in predefined
time intervals or when it becomes full. A new Bloom filter is used for each time
interval. The time interval should be short because it determines the time precision
at which we can attribute packets, i.e., we can determine only in which time interval
a packet appeared on the network, not the exact time. The results presented do
not depend on the size of the unit because we use a data reduction ratio to set the
Bloom filter size (e.g., 100:1 means a Bloom filter of size 31 kB). Each method uses
one Bloom filter of an equal size to store all data. Our results did not show any
deviations depending on the selection of the segment within the trace. All methods
were tested to select the best combination of parameters for each of them. Results
are grouped into subsections by different points of interest.
5.1
Performance Metrics
To compare payload attribution methods we consider several aspects which are not
completely independent. The first and most important aspect is the amount of
storage space a method needs to allow querying with a false positive rate bounded
by a pre-defined value. We provide a detailed comparison and analysis in the
following subsections. Second, the methods differ in the number of elements they
insert into a Bloom filter when processing packets and also in the number of queries
to a Bloom filter performed when querying for an excerpt in the worst case (i.e.,
when the answer is negative). A summary can be found in Table II. Methods which
use shingling and variable block size achieve significant decrease in the number of
queries they have to perform to analyze each excerpt. It is important not only for
the computational performance but also for the resulting false positive rate as each
query to the Bloom filter takes a risk of a false positive answer. The boundary
selection techniques these methods use are very computationally efficient and can
be performed in a single pass through the payload. The implementation can be
highly optimized for a particular platform and some part of the processing can be
also done by a special hardware. Our implementation running on a Linux-based
commodity PC (with a kernel modified for fast packet capturing [NTOP 2008]) can
smoothly handle 200 Mbps and the processing can be easily split among multiple
machines (e.g., by having each machine process packets according to a hash value
of the packet header).
5.2
Block Size Distribution
The graphs in Figure 18 show the distributions of block sizes for three different
methods of block boundary selection. We use a block (or winnowing window) size
parameter of 32 bytes, a small block size for an EVBS of 8 bytes, and an overlap of
4 bytes. Both VBS and EVBS show a distribution with an exponential decrease in
the number of blocks with an increasing block size, shifted by the overlap size for a
VBS or the block size plus the overlap size for an EVBS. Long tails were cropped
for clarity and the longest block was 1029 bytes long.
On the other hand, a winnowing method results in a quite uniform distribution
where the block sizes are bounded by the winnowing window size plus the overlap. The apparent peaks for the smallest block size in graphs 18(a) and 18(c) are
caused by low-entropy payloads, such as long blocks of zeros. The distributions of
ACM Journal Name, Vol. V, No. N, Month 20YY.
26
·
Miroslav Ponec et al.
Table II. Comparison of payload attribution methods from Section 3 based on the
number of elements inserted into a Bloom filter when processing one packet of a
fixed size and the number of blocks tested for presence in a Bloom filter when
querying for an excerpt of a fixed size in the worst case (i.e., when the answer is
negative). Note that the values are approximations and we assume all methods
have the same average block size. The variables refer to: n: the number of blocks
inserted by a BBF taken as a base, s: the size of a block in fixed block size methods
(BBF, HBF, FBS, FD), v:s the number of possible offset numbers, p: the number
of alignments tested for enhanced methods (EVBS, EVD, EMH). Note that the
actual number of bits tested or set in a Bloom filter depends on the number of hash
functions used for each method and therefore this table presents numbers of blocks.
block sizes obtained by processing random payloads generated with the same payload sizes as in the original trace show the same distributions just without these
peaks. Nevertheless, the huge amount of small blocks does not significantly affect
the attribution because inserting a block into the Bloom filter is an idempotent
operation.
5.3
Unprocessed Payload
Some fraction of each packet’s payload is not processed by the attribution mechanisms presented in Section 3. Table III(b) shows how each boundary selection
method affects the percentage of unprocessed payload. For methods with a fixed
block size the part of a payload between the last block’s boundary and the end of
the payload is ignored by the payload attribution system. With (enhanced) Rabin
fingerprinting, and winnowing methods the part starting at the beginning of the
payload and ending at the first block boundary and the part between the last block
boundary and the end of the payload are not processed. The enhanced version
of Rabin fingerprinting achieves much better results because the small block size,
which was four times smaller that the superblock size in our test, applies when
selecting the first block boundary in a payload.
Winnowing performs better than other methods with a variable block size in
terms of unprocessed payload. Note also that a WMH method, even though it
uses winnowing for block boundary selection as well as a WBS does, has about
t times smaller percentage of unprocessed payload than a WBS because each of
the t instances of a WBS within the WMH covers a different part of the payload
independently. Moreover, the “inner” part of the payload is covered t times which
makes the method much more resistant to collisions because t collisions have to
occur at the same time to produce a false positive answer.
For large packets the small percentage of unprocessed payload does not pose a
problem, however, for very small packets, e.g., only 6 bytes long, it means that
they are possibly not processed at all. Therefore we can optionally insert the entire
payload of a packet in addition to inserting all blocks and add a special query
type to the system to support queries for exact packets. This will increase the
ACM Journal Name, Vol. V, No. N, Month 20YY.
New Payload Attribution Methods for Network Forensic Investigations
·
27
(a)
(b)
(c)
Fig. 18. The distributions of block sizes for three different methods of block boundary selection after processing 100000 packets of HTTP traffic are shown: (a) VBS
(the inner graph shows an exponential decrease in log scale), (b) EVBS, (c) WBS.
ACM Journal Name, Vol. V, No. N, Month 20YY.
28
·
Miroslav Ponec et al.
(a)
(b)
Table III. (a) False positive rates for data reduction ratio 130:1 for a WMH method with a winnowing window size 64 bytes and therefore an average block size about 32 bytes. The table summarizes
answers to 10000 queries for each query excerpt size. All 10000 answers should be NO. YES answers are due to false positives inherent to a Bloom filter. WMH guarantees no N/A answers for
these excerpt sizes. (b) The percentage of unprocessed payload of 50000 packets depending on
the block boundary selection method used. Details are provided in Sections 5.2 and 5.3.
Fig. 19. The graph shows the number of correct answers to 10000 excerpt queries for a varying
length of a query excerpt for each method (with block size or winnowing window size 64 bytes)
and data reduction ratio 100:1. This reduction ratio can be further improved to about 120:1 by
using a compressed Bloom filter. The WMH method has no false positives for excerpt sizes 250
bytes and longer. The previous state-of-the-art method HBF does not provide any useful answers
at this high data reduction ratio for excerpts shorter than about 400 bytes.
storage requirements only slightly because we would insert one additional element
per packet into the Bloom filter.
5.4
Query Answers
To measure and compare the performance of attribution methods, and in particular
to analyze the false positive rate, we processed the trace by each method and queried
for random strings which included a small excerpt of size 8 bytes in the middle that
ACM Journal Name, Vol. V, No. N, Month 20YY.
New Payload Attribution Methods for Network Forensic Investigations
·
29
Table IV. Measurements of false positive rate for data reduction ratio 50:1. The
table summarizes answers to 10000 excerpt queries using all methods (with block
size 32 bytes) described in Section 3 for various query excerpt lengths (top row).
These queries were performed after processing a real packet trace and all methods
use the same size of a Bloom filter (50 times smaller than the trace size). All 10000
answers should be NO since these excerpts were not present in the trace. YES
answers are due to false positives in Bloom filter and N/A answers mean that there
were no boundaries selected inside the excerpt to form a block for which we can
query.
did not occur in the trace. In this way we made sure that these query strings did
not represent payloads inserted into the Bloom filters. Every method has to answer
either YES, NO or answer not available (N/A) to each query. A YES answer means
a match was found for the entire query string for at least one of the flowIDs and
represents a false positive. A NO answer is the correct answer for the query, and
a N/A answer is returned if the blocking mechanism specific to the method did
not select even one block inside the query excerpt. The N/A answer can occur,
for example, when the query excerpt is smaller than the block size in an HBF, or
when there was one or no boundary selected in the excerpt by a VBS method and
so there was no block to query for.
In Table I we summarize the possibility of getting N/A and false negative answers
for each method. A false negative answer can occur when we query for an excerpt
which has size greater than the block size but none of the alignments of blocks inside
the excerpt fit the alignment that has been used to process the payload which contained the excerpt. For example, if we processed a payload ABCDEF GHI by an
HBF with block size 4 bytes, we would have blocks ABCD, EF GH, ABCDEF GH,
and if we queried for an excerpt BCDEF G, the HBF would answer NO. Note that
false negatives can occur only for excerpts smaller than twice the block size and
only for methods which involve testing the alignment of blocks.
Table IV provides detailed results of 10000 excerpt queries for all methods with
the same storage capacity and data reduction ratio 50:1. The WMH method
achieves the best results among the listed methods for all excerpt sizes and for
excerpts longer than 200 bytes has no false positives. The WMH also excels in the
number of N/A answers among the methods with a variable block size because it
guarantees at least one block boundary in each winnowing window. The results
also show that methods with a variable block size are in general better than methods with a fixed block size because there are no problems with finding the right
ACM Journal Name, Vol. V, No. N, Month 20YY.
30
·
Miroslav Ponec et al.
alignment. The enhanced version of Rabin fingerprinting for the block boundary
selection does not perform better than the original version. This is mostly because
we need to try all alignments of blocks inside superblocks when querying which increases the false positive rate. These enhanced methods therefore favorably enhance
the block size control but not always the false positive rate.
The graph in Fig. 19 shows the number of correct answers to 10000 excerpt
queries as a function of the length of a query excerpt for each method with block
size parameter set to 64 bytes and data reduction ratio 100:1. The WMH method
outperforms all other methods for all excerpt lengths. On the other hand, the
HBF’s results are the worst because it can fully utilize the hierarchy only for long
excerpts and it has very high false positive rate for high data reduction ratios
due to the use of offsets, problems with block alignments, and the large number
of elements inserted into the Bloom filter. As can be observed when comparing
HBF’s results to the ones of a FD method, using double-blocks instead of building
a hierarchy is a significant improvement and a FD, for excerpt sizes of 220 bytes
and longer, performs even better than a variable block size version of the hierarchy,
a VHBF. It is interesting to see that an HBF outperforms a VHBF version for
excerpts of length 400 bytes. For longer excerpts than 400 bytes (not shown) they
perform about the same and both quickly achieve no false positives. The results
of methods in this graph are clearly separated into two groups by performance;
the curves representing the methods which use a variable block size and do not
use offset numbers or superblocks (i.e., VBS, VD, WBS, MH, WMH) have concave
shape and in general perform better. For very long excerpts all methods provide
highly reliable results.
The Winnowing Multi-Hashing achieves the best overall performance in all our
tests and allows to query for very small excerpts because the average block size
is approximately half of the winnowing window size (plus the overlap size). The
average block size was 18.9 bytes for a winnowing window size 32 bytes and an
overlap 4 bytes. Table III(a) shows the false positive rates for WMH for a data
reduction ratio 130:1. This data reduction ratio means that the total size of the
processed payload was 130-times the size of the Bloom filter which is archived
to allow querying. The Bloom filter could be additionally compressed to achieve
a final compression of about 158:1, but note that the parameters of the Bloom
filter (i.e., the relation among the number of hash functions used, the number of
elements inserted and the size of the Bloom filter) have to be set in a different
way [Mitzenmacher 2002]. We have to decide in advance whether we want to use
the additional compression of the Bloom filter and if so, optimize the parameters
for it, otherwise the Bloom filter data would have very high entropy and is hard to
compress. The compression is possible because the approximate representation of
a set by a standard Bloom filter does not reach the information theoretical lower
bound [Broder and Mitzenmatcher 2002].
Winnowing block shingling method (WBS) performs almost as well as WMH
(see Fig. 19) and requires t-times less computation. However, the confidence of the
results is lower than with WMH because multi-hashing covers majority of the query
excerpt multiple times and if storage is needed, a data aging method can be used
to downgrade it to a simple WBS later.
ACM Journal Name, Vol. V, No. N, Month 20YY.
New Payload Attribution Methods for Network Forensic Investigations
6.
·
31
CONCLUSION
In this paper, we presented novel methods for payload attribution. When incorporated into a network forensics system they provide an efficient probabilistic query
mechanism to answer queries for excerpts of a payload that passed through the
network. Our methods allow data reduction ratios greater than 100:1 while having a very low false positive rate. They allow queries for very small excerpts of
a payload and also for excerpts that span multiple packets. The experimental results show that our methods represent a significant improvement in query accuracy
and storage space requirements compared to previous attribution techniques. More
specifically, we found that winnowing represents the best technique for block boundary selection in payload attribution applications and that shingling as a method
for consecutiveness resolution is a clear win over the use of offset numbers and
finally, the use of multiple instances of payload attribution methods can provide
additional benefits, such as improved false positive rates and data-aging capability. These techniques combined together form a payload attribution method called
Winnowing Multi-Hashing which substantially outperforms previous methods. The
experimental results also testify that in general the accuracy of attribution increases
with the length and the specificity of a query. Moreover, privacy and simple access
control is achieved by the use of Bloom filters and one-way hashing with a secret
key. Thus, even if the system is compromised no raw traffic data is ever exposed
and querying the system is possible only with the knowledge of the secret key. We
believe that these methods also have much broader range of applicability in various
areas where large amounts of data are being processed.
REFERENCES
Anderson, E. and Arlitt, M. 2006. Full Packet Capture and Offline Analysis on 1 and 10 Gb
Networks. Technical Report HPL-2006-156.
Bellovin, S. and Cheswick, W. 2004. Privacy-enhanced searches using encrypted Bloom filters.
Cryptology ePrint Archive, Report 2004/022. Available at http://eprint.iacr.org/.
Bloom, B. 1970. Space/time tradeoffs in hash coding with allowable errors. In Communications
of the ACM (CACM). 422–426.
Broder, A. 1993. Some applications of Rabin’s fingerprinting method. In Sequences II: Methods
in Communications, Security, and Computer Science. Springer-Verlag, 143–152.
Broder, A. 1997. On the resemblance and containment of documents. In Proceedings of the
Compression and Complexity of Sequences.
Broder, A. and Mitzenmatcher, M. 2002. Network Applications of Bloom Filters: A Survey. In Annual Allerton Conference on Communication, Control, and Computing. UrbanaChampaign, Illinois, USA.
Cho, C. Y., Lee, S. Y., Tan, C. P., and Tan, Y. T. 2006. Network forensics on packet fingerprints. In 21st IFIP Information Security Conference (SEC 2006). Karlstad, Sweden.
Garfinkel, S. 2002. Network forensics: Tapping the internet. O’Reilly Network.
Gu, G., Porras, P., Yegneswaran, V., Fong, M., and Lee, W. 2007. BotHunter: Detecting
Malware Infection Through IDS-Driven Dialog Correlation. In Proceedings of the 16th USENIX
Security Symposium. 167–182.
Handley, M., Kreibich, C., and Paxson, V. 2001. Network Intrusion Detection: Evasion, Traffic
Normalization, and End-to-End Protocol Semantics. In Proceedings of the USENIX Security
Symposium. Washington, USA.
King, N. and Weiss, E. 2002.
Network Forensics Analysis Tools (NFATs) reveal insecurities, turn sysadmins into system detectives.
Information Security.
Available at
www.infosecuritymag.com/2002/feb/cover.shtml.
ACM Journal Name, Vol. V, No. N, Month 20YY.
32
·
Miroslav Ponec et al.
Manber, U. 1994. Finding similar files in a large file system. In Proceedings of the USENIX
Winter 1994 Technical Conference. San Fransisco, CA, USA, 1–10.
Mitzenmacher, M. 2002. Compressed Bloom Filters. IEEE/ACM Transactions on Networking
(TON) 10, 5, 604 – 612.
NTOP. 2008. PF RING Linux kernel patch. Available at http://www.ntop.org/PF RING.html.
Ponec, M., Giura, P., Brönnimann, H., and Wein, J. 2007. Highly Efficient Techniques for
Network Forensics. In Proceedings of the 14th ACM Conference on Computer and Communications Security. Alexandria, Virginia, USA, 150–160.
Rabin, M. O. 1981. Fingerprinting by random polynomials. Technical report 15-81, Harvard
University.
Rhea, S., Liang, K., and Brewer, E. 2003. Value-based web caching. In Proceedings of the
Twelfth International World Wide Web Conference.
Richardson, R. and Peters, S. 2007. 2007 CSI Computer Crime and Security Survey Shows
Average Cyber-Losses Jumping After Five-Year Decline. CSI Press Release. Available at
http://www.gocsi.com/press/20070913.jhtml.
Schleimer, S., Wilkerson, D. S., and Aiken, A. 2003. Winnowing: local algorithms for document fingerprinting. In SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD international
conference on Management of data. ACM Press, New York, NY, USA, 76–85.
Shanmugasundaram, K., Brönnimann, H., and Memon, N. 2004. Payload Attribution via
Hierarchical Bloom Filters. In Proc. of ACM CCS.
Shanmugasundaram, K., Memon, N., Savant, A., and Brönnimann, H. 2003. ForNet: A Distributed Forensics Network. In Proc. of MMM-ACNS Workshop. 1–16.
Snoeren, A. C., Partridge, C., Sanchez, L. A., Jones, C. E., Tchakountio, F., Kent, S. T.,
and Strayer, W. T. 2001. Hash-based IP traceback. In ACM SIGCOMM. San Diego, California, USA.
Staniford-Chen, S. and Heberlein, L. 1995. Holding intruders accountable on the internet. In
Proceedings of the 1995 IEEE Symposium on Security and Privacy. Oakland.
Received February 2008; revised July 2008; accepted November 2008
ACM Journal Name, Vol. V, No. N, Month 20YY.