BotGrep: Finding P2P Bots with
Structured Graph Analysis
Shishir Nagaraja, Prateek Mittal, Chi-Yao Hong,
Matthew Caesar, Nikita Borisov (UIUC)
USENIX Security Symposium 2010
Graph Theory: Cut
G = (V, E)
4
5
A cut C = (S, T) is a partition of
V of a graph G = (V, E)
({1, 2, 3, 4, 5}, {6, 7})
small cut, min. cut…
http://en.wikipedia.org/wiki/Cut_%28graph_theory%29
2010/9/14
Speaker: Li-Ming Chen
7
1
4
5
{(5, 6), (1, 7)}
The size of a cut is the number
of edges in the cut-set
6
2
The cut-set of a cut is the set of
edges whose end points are in
different subsets of the partition
3
3
6
2
1
7
2
Graph Theory: Random Walk (RM)
4
5
RW: a trajectory that consists
of taking successive random
steps on a graph
RWs are usually assumed to
be Markov processes
Example:
q 0 1 0 0 0 0 0 0
2
From
node
1
Starts at node 1
q q P 0 1 / 5 1 / 5 1 / 5 1 / 5 0 1 / 5
1
6
E.g., the path traced by a molecule (分
子) as it travels in a liquid
3
0
Prob. after 1 RW
2
3
4
5
6
7
7
1
To node
1
2
0 1/ 5
1 / 3 0
1 / 4 1 / 4
P 1 / 3 0
1 / 5 1 / 5
0
0
1 / 2 0
3
4
5
1/ 5 1/ 5 1/ 5
1/ 3
0
1/ 3
0
1/ 3
1/ 4 1/ 4
0
1/ 5 1/ 5
1/ 3
0
0
0
1/ 2
0
0
0
6
7
1 / 5
0
0
0
0
0
0
1/ 5 0
0 1 / 2
1 / 2 0
0
q 2 q1 P 0.323 0.090 0.173 0.090 0.183 0.140 0
Prob. after 2 RW
http://en.wikipedia.org/wiki/Random_walk
2010/9/14
Speaker: Li-Ming Chen
3
Graph Theory: Stationary Distribution
Example 1:
4
q 0 1 0 0 0 0 0 0
5
3
q 26 0.2083 0.125 0.1667 0.125 0.2083 0.0833 0.0833
6
(remain steady)
Example 2:
2
1
q 0 1 / 7 1 / 7 1 / 7 1 / 7 1 / 7 1 / 7 1 / 7
7
q11 0.2083 0.125 0.1667 0.125 0.2083 0.0833 0.0833
(remain steady)
A stationary distribution π is a vector, whose entries are nonnegative and sum to 1, that satisfies
Markov chain mixing time:
How large must t be until the time-t distribution (qt) is approximately π?
(to converge to stationary Dist.)
http://en.wikipedia.org/wiki/Markov_chain
2010/9/14
Speaker: Li-Ming Chen
4
Outline
Problem Definition
System Architecture
Approach:
Prefiltering Step
Clustering P2P Nodes
Validation
(*) Privacy Preserving Graph Algorithms
Results & Discussion
Conclusion & My Comments
2010/9/14
Speaker: Li-Ming Chen
5
What is Botnet?
Bots: compromised hosts, “Zombies”
Botnets: networks of bots that are under the control of a
human operator (botmaster)
(generally looks like) Worm + C&C channel
Command and Control Channel
Disseminate the botmasters’ commands to their bot armies
Communication (IRC, HTTP, … (can be encrypted))
Worm
2010/9/14
Attack
(DoS, spamming, phishing site, …)
Propagation
(vulnerabilities, file sharing, P2P, …)
Speaker: Li-Ming Chen
6
Botnet Structure Change!!
Centralized structure P2P, why?
Growing size of botnets
Development of mechanisms that
detect traditional centralized C&C
servers
P2P communication is more
efficient and robust
(Traditional)
centralized structure
Try to evade detection
Question:
2010/9/14
Whether ISPs can detect P2P botnets
and use this as a basis for botnet
defense.
Speaker: Li-Ming Chen
P2P structure
7
Problem & Proposed Solution
Problem:
ISPs have significant visibility into the Comm. patterns
But, how to separate botnet traffic from background Internet traffic?
Proposed approach: BotGrep
An algorithm that isolates P2P Comm. structure
Only based on the information about which pairs of nodes
communicate with one another
2010/9/14
Input: a communication graph
Can sustain when only a partial view of the comm. graph is
available
Can support “privacy preserving collaboration”
Speaker: Li-Ming Chen
8
Challenges
Background traffic volume is huge
Background traffic is highly variable and
continuously changing
Botnet traffic blends in with the regular traffic of the
legitimate users
botnet is tightly integrated and can NOT be separated
from the rest of the nodes by a small cut
ISPs collaboration scaling issues
ISPs collaboration privacy issues
2010/9/14
Speaker: Li-Ming Chen
9
Outline
Problem Definition
System Architecture
Approach:
Prefiltering Step
Clustering P2P Nodes
Validation
(*) Privacy Preserving Graph Algorithms
Results & Discussion
Conclusion & My Comments
2010/9/14
Speaker: Li-Ming Chen
10
BotGrep Architecture
Data source 1:
Combining observations across different
network monitors into a single Comm. graph
Outputs:
A set of suspect hosts
(and links)
Data source 2:
Borrow misuse detection to
distinguish P2P bot and other P2P applications
(speed up botnet identification)
2010/9/14
Speaker: Li-Ming Chen
11
Inference System
As mentioned, botnet graph is embedded within a
background Comm. graph
One common feature of P2P structured graph:
Fast mixing time (∵ highly structured)
BotGrep exploits this feature by
2010/9/14
performing random walks to identify fast-mixing
component(s) and isolate them from the rest of the Comm.
graph
Speaker: Li-Ming Chen
12
Outline
Problem Definition
System Architecture
Approach:
Prefiltering Step
Clustering P2P Nodes
Validation
(*) Privacy Preserving Graph Algorithms
Results & Discussion
Conclusion & My Comments
2010/9/14
Speaker: Li-Ming Chen
13
Problem Formulation
Given a Comm. graph: G = (V, E)
Assume a P2P graph Gp is embedded
(note: not specify a clear time period)
Gp G
remaining subgraph contains non-P2P Comm.
Gn G G p
edges
Goal:
2010/9/14
Partition the input G into {Gp, Gn} in the presence of
dynamic background traffic and with only partial visibility
Speaker: Li-Ming Chen
14
Approach Overview (BotGrep)
Idea:
Perform random walks, and comparing the relative mixing
rates of subgraphs
3 steps:
(1) Pre-filtering (actually is k-means clustering)
(2) Clustering P2P Nodes (sampling)
Apply modified SybilInfer Algo. to remove FP
(3) Validation
2010/9/14
Extract a small set of candidate P2P notes (+ FP)
Validate step (2) based on fast-mixing characteristic
Speaker: Li-Ming Chen
15
Step (1): Pre-filtering
Idea:
For short random walks, the state Prob. associated with nodes in
the fast-mixing subgraph is likely to be closer to the “stationary
distribution” than nodes in the slow-mixing subgraph
4
5
Input:
3
short RW, t = log(|V|)
init:
G = (V, E) 6
2
1
7
∵ stationary Dist.
is proportional to
node degrees
Goal: the sum of
squares J from points
to the assigned
cluster centers
cj is minimized.
(squares J is the
cluster score)
2010/9/14
k-means
dampening constant,
to undermine highdegree nodes
?? should be
si cluster
j Li-Ming Chen
Speaker:
16
Step (2): Clustering P2P Nodes
Step (1)’s output: {G1, G2, …, Gk}
perform “modified SybilInfer Algo.” on each subgraphs to
remove weakly connected nodes (FP)
Concept of modified SybilInfer Algo. (3 steps):
Get “traces” T
A trace represents a related vertex-pair by using RW*
Use sampling to get P2P nodes
Assume a cut X0 consists of P2P nodes, X 0 V
Check if X’ is better than X0 according to probability P ( X | T )
If better, X’ replace X0; else X0 retained.
and then do it in several runs
Get {X0, X1, …, XN}, X i ~ P( X | T ) ; decide P[node i is P2P] = ?
G. Danezis and P. Mittal, “SybilInfer: Detecting Sybil Nodes using Social Networks,” in Proc. NDSS, 2009.
2010/9/14
Speaker: Li-Ming Chen
17
Step (2): Clustering P2P Nodes (cont’d)
Modified SybilInfer Algo.:
Step (1) Generation of traces:
(ensures that the “stationary
Dist.” of the RW is uniform
over all vertices)
qt = qt-1‧P’
Perform a number n of RWs, starting at each node,
length t = log(|V|)
Traces T is the set of starting and ending vertex-pairs
of each RW
(we are interested in these pairs traversed by RW)
2010/9/14
Speaker: Li-Ming Chen
18
(My Observation)
4
5
Traces T
The end points reflect the
connectivity of the graph
high-degree nodes may walk
to low-degree nodes
~ equal prob. to its neighbors
2
From
node
1
RW may be trapped by lowdegree nodes (if they are
connected)
2010/9/14
2
3
4
5
6
P’ is a symmetry matrix
6
Set of vertex-pairs
3
7
7
1
To node
1
2
0 1/ 5
1 / 5 0
1 / 5 1 / 4
P' 1 / 5 0
1 / 5 1 / 5
0
0
1 / 5 0
3
4
5
1/ 5 1/ 5 1/ 5
1/ 4
0
1/ 4
0
1/ 5
1/ 4 1/ 5
0
1/ 5 1/ 5
1/ 5
0
0
0
1/ 5
0
0
0
6
7
1 / 5
0
0
0
0
0
0
1/ 5 0
0 1 / 2
1 / 2 0
0
RW will not converge by using
P’ !!
Speaker: Li-Ming Chen
19
Step (2): Clustering P2P Nodes (cont’d)
Modified SybilInfer Algo.:
Step (2) A (Bayesian) Prob. model for P2P nodes:
Given the set of traces T, compute the Prob. that any set of
node X are all P2P nodes,
goal
can be acquired
fixed
assign a uniform prob.
to all walks ending in
the set X
trace ends in
vertex v in X
trace ends in
vertex a in X
2010/9/14
Speaker: Li-Ming Chen
number of RW ending
in vertex a (or v)
20
Step (2): Clustering P2P Nodes (cont’d)
Modified SybilInfer Algo.:
Step (3) Metropolis-Hastings Sampling:
To enumeration over all subsets X of the graph is impossible
sample configurations Xi following this distribution
Xi ~ P(X|T)
Given a set of samples S = {X0, X1, …, XN}, we can compute
marginal Prob. of nodes being P2P nodes as follows:
Threshold:
2010/9/14
Node i exists in more
than half of the samples
if P[node i is P2P] > 0.5, then P2P, else non-P2P (FP).
Next setp: validate P2P group!
Speaker: Li-Ming Chen
21
Step (3): Validation
SybilInfer only partitions a graph into two subgraphs
We need to use multiple iterations to get to the desired fastest
mixing subgraph
require a validation test
If the cut passes all the 3 validation tests below, then we are
done:
2010/9/14
(1) Graph conductance test
(2) q(t) entropy comparison test
(3) Degree-homogeneity test
Speaker: Li-Ming Chen
22
Step (3): Validation (cont’d)
(1) Graph conductance test
(2) q(t) entropy comparison test
P2P network is fast mixing no small cut graph conductance
should be high
close
RWs on structured homogeneous P2P graphs are characterized
(t )
by high entropy state Prob. Dist. qi 1 / n
KL divergence measure should be close to 0
(3) Degree-homogeneity test
2010/9/14
To rule out star topology!
Measure the dispersion of degree values should be
homogeneous
Speaker: Li-Ming Chen
0
23
Outline
Problem Definition
System Architecture
Approach:
Prefiltering Step
Clustering P2P Nodes
Validation
(*) Privacy Preserving Graph Algorithms (ignored)
Results & Discussion
Conclusion & My Comments
2010/9/14
Speaker: Li-Ming Chen
24
Dataset (Graphs)
Background traffic communication graph:
Constructed from 1 day real-world traffic trace:
Botnet graph:
Synthetically add links between random selected “bots” in
the background traffic
For sensitivity test, the structure of botnet graph includes:
(1) Abilene’s NetFlow trace (2009/10/22) (104,426 nodes)
(2) CAIDA packet-level trace (2009/1/11) (3,839,936 nodes)
(1) de Bruijn, (2) Chord, (3) Kademlia, (4) LEET-Chord
Take the combined graph as the algorithm input
2010/9/14
Speaker: Li-Ming Chen
25
An Algorithm Example
Background traffic communication graph:
GD: Abilene’s trace
Botnet graph:
Gp: de Bruijn structure
Random select 10,000 nodes from GD
Parameters: m=10 (outgoing links), n=4 (dimensions)
Combined input G = <V, E>:
de Bruijn graph (m, n)
de Bruijn graph (2, 3)
N = |V| = 104,426 nodes (Abilene)
|E| = 647,053 edges
Goal:
2010/9/14
Extract Gp from GD as accurately as possible!
Speaker: Li-Ming Chen
26
An Algorithm Example (cont’d)
(step 1)
Perform a short random walk starting from every node
Get si (use r = 100)
K-means clustering derives 10 clusters
(step 2 & 3)
Only check 4th cluster (yellow)
Recursively apply SybilInfer to
this cluster and validate in 3 iterations
Validation
17,576 nodes
Contains honey-net nodes
2010/9/14
10,143 nodes (TP: 9,905 / FP: 238)
Speaker: Li-Ming Chen
27
Results
(1) Effect of botnet topology
(2) Effect of botnet graph size
(3) Effect of background graph size
(4) Effect of reduced visibility
(5) Leveraging Honeynets
(6) Effect of inference algorithm
2010/9/14
Speaker: Li-Ming Chen
28
(1) Effect of botnet topology
4 botnet graphs de Bruijn, Chord, Kademlia, LEET-Chord
Overall, performance is stable across these graphs
Detection rate > 95%
FP rate < 0.42% for LEET-Chord
Stealthiness vs. resilience:
2010/9/14
Randomly removing nodes (%)
Check failed paths: LEET-Chord
is less resilience to failure
the use of stealth to evade
BotGrep would adversely effect
the resilience of the botnet
Speaker: Li-Ming Chen
29
(2) Effect of botnet graph size
Experiment:
Keep the size of the background traffic graph constant
Vary the size of the synthetic botnet graph
102, 103, 104, or 105 bots
Finding:
2010/9/14
Size increases, performance degrades (but only by a small
amount)
Speaker: Li-Ming Chen
30
(3) Effect of background graph
size
A larger background graphs botnet is easy to hide inside (?)
Experiment:
Try to scale up the background graphs while retaining their
statistical properties (ignore the procedure here), then insert botnet
e.g., CAIDA: 3.8 million 30 million nodes (×9)
Finding:
BotGrep scales well with network size!
÷9
2010/9/14
Speaker: Li-Ming Chen
31
(4) Effect of reduced visibility
Previous Experiments:
Problems of reduced visibility:
Gp is present in its entirety
Only deploy BotGrep at a subset of ISPs
Network traffic sampling
5 most-affected ASes
contribute views 57~65%
Experiment:
2010/9/14
Study Storm & Karaken botnet
Measure number of inter-bot
paths visible from ASes
Sort ASes (according to # of pahts)
Cooperate sorted ASes and
contribution see their “visibility”
Speaker: Li-Ming Chen
32
(4) Effect of reduced visibility (cont’d)
57~65% visibility from Top 5 ASes
Apply BotGrep on a “combined graph” by removing 40% of
links from the botnet graph
2010/9/14
Speaker: Li-Ming Chen
33
(5) Leveraging Honeynets
Perform RWs starting only from the honey-net nodes to obtain
a set of candidate P2P nodes in the prefiltering stage (?)
Finding:
Significantly reduce FP rates
Also speed up the efficiency
2010/9/14
Only 1 iteration is required for Modified SybilInfer Algo.
Speaker: Li-Ming Chen
34
(6) Effect of inference algorithm
Compare with other graph partitioning algorithm
(1) Edge importance based community structure detection
Girvan-Newman Betweenness
Information centrality (too slow, not considered)
(2) Spectral-based approach
Modularity Eigenvector
Fast Greedy Modularity
Perform BFS and limit
visited nodes by a size of 2k
2010/9/14
Speaker: Li-Ming Chen
Run-time:
Mot scaled
well for large
dataset
35
Related Work (Botnet Detection)
Network based approaches
Detect attack traffic
Detect control (C&C) traffic
Traffic signature based detection
Statistical traffic analysis based detection
Hybrid approaches
Exploit DNS usage patterns, using Honeypot
Detect attack & control traffic
Combine network-based and host-based approach
Graph based approaches
2010/9/14
Centralized structure
P2P structure
Speaker: Li-Ming Chen
36
Conclusion
Goal: localize structured Comm. graphs within network traffic
to identify botnets hosts and links
Propose BotGrep: searching for structured topologies, and
separating them from background Comm. graph
Tackling the privacy-preserving issues
Achieve low FP rate and high detection rate
Future work:
Consider temporal variation
Distinguish other P2P structure
Address the botnet response problem
2010/9/14
Observing how parts of the Comm. graph change over time
Do not completely disconnect a node but mitigate its potential
malicious activities
Speaker: Li-Ming Chen
37
My Comments
About the approach:
In the 3 steps, the accuracy of the 1st step seems to be the key
factor, but not proved
1. deciding k (in k-means)
2. Clustering is based the properties if each “node” (qi/di) after RW,
does nodes in a same cluster really connected?
FN is due to? (pre-filtering? sampling?)
Why adopt modified SybilInfer Algo. in this paper to remove FP?
The original problem is dealing with P2P net + sybil nodes
Does the length of each RW affect the results?
Can we assign weights to edges?
2010/9/14
Step 2 & 3 could deal with this issue…
e.g., # of connection between 2 nodes
Speaker: Li-Ming Chen
38
My Comments
Time issues:
Does not consider the effects of time, traffic log in different
time period
(as mentioned) temporal variation of the Comm. graph
The effects of a small P2P network (at the early
stage) vs. the effects of a large P2P network (lots of
bots)
2010/9/14
Speaker: Li-Ming Chen
39
© Copyright 2026 Paperzz