(2): Clustering P2P Nodes

BotGrep: Finding P2P Bots with
Structured Graph Analysis
Shishir Nagaraja, Prateek Mittal, Chi-Yao Hong,
Matthew Caesar, Nikita Borisov (UIUC)
USENIX Security Symposium 2010
Graph Theory: Cut
G = (V, E)
4
5

A cut C = (S, T) is a partition of
V of a graph G = (V, E)


({1, 2, 3, 4, 5}, {6, 7})
 small cut, min. cut…
http://en.wikipedia.org/wiki/Cut_%28graph_theory%29
2010/9/14
Speaker: Li-Ming Chen
7
1
4
5
{(5, 6), (1, 7)}
The size of a cut is the number
of edges in the cut-set

6
2
The cut-set of a cut is the set of
edges whose end points are in
different subsets of the partition


3
3
6
2
1
7
2
Graph Theory: Random Walk (RM)
4
5
RW: a trajectory that consists
of taking successive random
steps on a graph

RWs are usually assumed to
be Markov processes
Example:

q 0  1 0 0 0 0 0 0
2
From
node
1
Starts at node 1
q  q  P  0 1 / 5 1 / 5 1 / 5 1 / 5 0 1 / 5
1
6
E.g., the path traced by a molecule (分
子) as it travels in a liquid


3
0
Prob. after 1 RW
2
3
4
5
6
7
7
1
To node
1
2
 0 1/ 5
1 / 3 0

1 / 4 1 / 4

P  1 / 3 0
1 / 5 1 / 5

0
 0
1 / 2 0

3
4
5
1/ 5 1/ 5 1/ 5
1/ 3
0
1/ 3
0
1/ 3
1/ 4 1/ 4
0
1/ 5 1/ 5
1/ 3
0
0
0
1/ 2
0
0
0
6
7
1 / 5
0
0 
0
0 

0
0 
1/ 5 0 

0 1 / 2
1 / 2 0 
0
q 2  q1  P  0.323 0.090 0.173 0.090 0.183 0.140 0
Prob. after 2 RW
http://en.wikipedia.org/wiki/Random_walk
2010/9/14
Speaker: Li-Ming Chen
3
Graph Theory: Stationary Distribution
Example 1:
4
q 0  1 0 0 0 0 0 0
5
3
q 26  0.2083 0.125 0.1667 0.125 0.2083 0.0833 0.0833
6
(remain steady)
Example 2:
2
1
q 0  1 / 7 1 / 7 1 / 7 1 / 7 1 / 7 1 / 7 1 / 7
7
q11  0.2083 0.125 0.1667 0.125 0.2083 0.0833 0.0833
(remain steady)

A stationary distribution π is a vector, whose entries are nonnegative and sum to 1, that satisfies


Markov chain mixing time:

How large must t be until the time-t distribution (qt) is approximately π?
(to converge to stationary Dist.)
http://en.wikipedia.org/wiki/Markov_chain
2010/9/14
Speaker: Li-Ming Chen
4
Outline



Problem Definition
System Architecture
Approach:






Prefiltering Step
Clustering P2P Nodes
Validation
(*) Privacy Preserving Graph Algorithms
Results & Discussion
Conclusion & My Comments
2010/9/14
Speaker: Li-Ming Chen
5
What is Botnet?



Bots: compromised hosts, “Zombies”
Botnets: networks of bots that are under the control of a
human operator (botmaster)
(generally looks like) Worm + C&C channel
 Command and Control Channel
 Disseminate the botmasters’ commands to their bot armies
Communication (IRC, HTTP, … (can be encrypted))
Worm
2010/9/14
Attack
(DoS, spamming, phishing site, …)
Propagation
(vulnerabilities, file sharing, P2P, …)
Speaker: Li-Ming Chen
6
Botnet Structure Change!!

Centralized structure  P2P, why?

Growing size of botnets


Development of mechanisms that
detect traditional centralized C&C
servers


 P2P communication is more
efficient and robust
(Traditional)
centralized structure
Try to evade detection
Question:

2010/9/14
Whether ISPs can detect P2P botnets
and use this as a basis for botnet
defense.
Speaker: Li-Ming Chen
P2P structure
7
Problem & Proposed Solution

Problem:



ISPs have significant visibility into the Comm. patterns
But, how to separate botnet traffic from background Internet traffic?
Proposed approach: BotGrep


An algorithm that isolates P2P Comm. structure
Only based on the information about which pairs of nodes
communicate with one another



2010/9/14
Input: a communication graph
Can sustain when only a partial view of the comm. graph is
available
Can support “privacy preserving collaboration”
Speaker: Li-Ming Chen
8
Challenges



Background traffic volume is huge
Background traffic is highly variable and
continuously changing
Botnet traffic blends in with the regular traffic of the
legitimate users



 botnet is tightly integrated and can NOT be separated
from the rest of the nodes by a small cut
ISPs collaboration  scaling issues
ISPs collaboration  privacy issues
2010/9/14
Speaker: Li-Ming Chen
9
Outline



Problem Definition
System Architecture
Approach:






Prefiltering Step
Clustering P2P Nodes
Validation
(*) Privacy Preserving Graph Algorithms
Results & Discussion
Conclusion & My Comments
2010/9/14
Speaker: Li-Ming Chen
10
BotGrep Architecture
Data source 1:
Combining observations across different
network monitors into a single Comm. graph
Outputs:
A set of suspect hosts
(and links)
Data source 2:
Borrow misuse detection to
distinguish P2P bot and other P2P applications
(speed up botnet identification)
2010/9/14
Speaker: Li-Ming Chen
11
Inference System


As mentioned, botnet graph is embedded within a
background Comm. graph
One common feature of P2P structured graph:


Fast mixing time (∵ highly structured)
 BotGrep exploits this feature by

2010/9/14
performing random walks to identify fast-mixing
component(s) and isolate them from the rest of the Comm.
graph
Speaker: Li-Ming Chen
12
Outline



Problem Definition
System Architecture
Approach:






Prefiltering Step
Clustering P2P Nodes
Validation
(*) Privacy Preserving Graph Algorithms
Results & Discussion
Conclusion & My Comments
2010/9/14
Speaker: Li-Ming Chen
13
Problem Formulation

Given a Comm. graph: G = (V, E)


Assume a P2P graph Gp is embedded


(note: not specify a clear time period)
Gp  G
 remaining subgraph contains non-P2P Comm.

Gn  G  G p
edges

Goal:

2010/9/14
Partition the input G into {Gp, Gn} in the presence of
dynamic background traffic and with only partial visibility
Speaker: Li-Ming Chen
14
Approach Overview (BotGrep)

Idea:


Perform random walks, and comparing the relative mixing
rates of subgraphs
3 steps:

(1) Pre-filtering (actually is k-means clustering)


(2) Clustering P2P Nodes (sampling)


Apply modified SybilInfer Algo. to remove FP
(3) Validation

2010/9/14
Extract a small set of candidate P2P notes (+ FP)
Validate step (2) based on fast-mixing characteristic
Speaker: Li-Ming Chen
15
Step (1): Pre-filtering

Idea:

For short random walks, the state Prob. associated with nodes in
the fast-mixing subgraph is likely to be closer to the “stationary
distribution” than nodes in the slow-mixing subgraph
4
5
Input:
3
short RW, t = log(|V|)
init:
G = (V, E) 6
2
1
7
∵ stationary Dist.
is proportional to
node degrees
Goal: the sum of
squares J from points
to the assigned
cluster centers
cj is minimized.
(squares J is the
cluster score)
2010/9/14
k-means
dampening constant,
to undermine highdegree nodes
?? should be
si  cluster
j Li-Ming Chen
Speaker:
16
Step (2): Clustering P2P Nodes

Step (1)’s output: {G1, G2, …, Gk}


 perform “modified SybilInfer Algo.” on each subgraphs to
remove weakly connected nodes (FP)
Concept of modified SybilInfer Algo. (3 steps):


Get “traces” T
 A trace represents a related vertex-pair by using RW*
Use sampling to get P2P nodes
 Assume a cut X0 consists of P2P nodes, X 0  V
 Check if X’ is better than X0 according to probability P ( X | T )



If better, X’ replace X0; else X0 retained.
and then do it in several runs
Get {X0, X1, …, XN}, X i ~ P( X | T ) ; decide P[node i is P2P] = ?
G. Danezis and P. Mittal, “SybilInfer: Detecting Sybil Nodes using Social Networks,” in Proc. NDSS, 2009.
2010/9/14
Speaker: Li-Ming Chen
17
Step (2): Clustering P2P Nodes (cont’d)

Modified SybilInfer Algo.:

Step (1) Generation of traces:
(ensures that the “stationary
Dist.” of the RW is uniform
over all vertices)
 qt = qt-1‧P’
Perform a number n of RWs, starting at each node,
length t = log(|V|)
 Traces T is the set of starting and ending vertex-pairs
of each RW
(we are interested in these pairs traversed by RW)
2010/9/14
Speaker: Li-Ming Chen
18
(My Observation)
4
5

Traces T


The end points reflect the
connectivity of the graph
 high-degree nodes may walk
to low-degree nodes


~ equal prob. to its neighbors
2
From
node
1
 RW may be trapped by lowdegree nodes (if they are
connected)
2010/9/14
2
3
4
5
6
P’ is a symmetry matrix

6
Set of vertex-pairs


3
7
7
1
To node
1
2
 0 1/ 5
1 / 5 0

1 / 5 1 / 4

P'  1 / 5 0
1 / 5 1 / 5

0
 0
1 / 5 0

3
4
5
1/ 5 1/ 5 1/ 5
1/ 4
0
1/ 4
0
1/ 5
1/ 4 1/ 5
0
1/ 5 1/ 5
1/ 5
0
0
0
1/ 5
0
0
0
6
7
1 / 5
0
0 
0
0 

0
0 
1/ 5 0 

0 1 / 2
1 / 2 0 
0
RW will not converge by using
P’ !!
Speaker: Li-Ming Chen
19
Step (2): Clustering P2P Nodes (cont’d)

Modified SybilInfer Algo.:
Step (2) A (Bayesian) Prob. model for P2P nodes:


Given the set of traces T, compute the Prob. that any set of
node X are all P2P nodes,
goal
can be acquired
fixed
assign a uniform prob.
to all walks ending in
the set X
trace ends in
vertex v in X
trace ends in
vertex a in X
2010/9/14
Speaker: Li-Ming Chen
number of RW ending
in vertex a (or v)
20
Step (2): Clustering P2P Nodes (cont’d)

Modified SybilInfer Algo.:

Step (3) Metropolis-Hastings Sampling:


To enumeration over all subsets X of the graph is impossible
 sample configurations Xi following this distribution


Xi ~ P(X|T)
Given a set of samples S = {X0, X1, …, XN}, we can compute
marginal Prob. of nodes being P2P nodes as follows:


Threshold:


2010/9/14
Node i exists in more
than half of the samples
if P[node i is P2P] > 0.5, then P2P, else non-P2P (FP).
Next setp: validate P2P group!
Speaker: Li-Ming Chen
21
Step (3): Validation

SybilInfer only partitions a graph into two subgraphs



We need to use multiple iterations to get to the desired fastest
mixing subgraph
 require a validation test
If the cut passes all the 3 validation tests below, then we are
done:



2010/9/14
(1) Graph conductance test
(2) q(t) entropy comparison test
(3) Degree-homogeneity test
Speaker: Li-Ming Chen
22
Step (3): Validation (cont’d)

(1) Graph conductance test


(2) q(t) entropy comparison test



P2P network is fast mixing  no small cut  graph conductance
should be high
close
RWs on structured homogeneous P2P graphs are characterized
(t )
by high entropy state Prob. Dist. qi  1 / n
KL divergence measure should be close to 0
(3) Degree-homogeneity test


2010/9/14
To rule out star topology!
Measure the dispersion of degree values  should be
homogeneous
Speaker: Li-Ming Chen
0
23
Outline



Problem Definition
System Architecture
Approach:






Prefiltering Step
Clustering P2P Nodes
Validation
(*) Privacy Preserving Graph Algorithms (ignored)
Results & Discussion
Conclusion & My Comments
2010/9/14
Speaker: Li-Ming Chen
24
Dataset (Graphs)

Background traffic communication graph:

Constructed from 1 day real-world traffic trace:



Botnet graph:


Synthetically add links between random selected “bots” in
the background traffic
For sensitivity test, the structure of botnet graph includes:


(1) Abilene’s NetFlow trace (2009/10/22) (104,426 nodes)
(2) CAIDA packet-level trace (2009/1/11) (3,839,936 nodes)
(1) de Bruijn, (2) Chord, (3) Kademlia, (4) LEET-Chord
 Take the combined graph as the algorithm input
2010/9/14
Speaker: Li-Ming Chen
25
An Algorithm Example

Background traffic communication graph:


GD: Abilene’s trace
Botnet graph:

Gp: de Bruijn structure



Random select 10,000 nodes from GD
Parameters: m=10 (outgoing links), n=4 (dimensions)
Combined input G = <V, E>:



de Bruijn graph (m, n)
de Bruijn graph (2, 3)
N = |V| = 104,426 nodes (Abilene)
|E| = 647,053 edges
Goal:

2010/9/14
Extract Gp from GD as accurately as possible!
Speaker: Li-Ming Chen
26
An Algorithm Example (cont’d)

(step 1)
Perform a short random walk starting from every node
Get si (use r = 100)
K-means clustering derives 10 clusters




(step 2 & 3)
Only check 4th cluster (yellow)




Recursively apply SybilInfer to
this cluster and validate in 3 iterations

Validation
17,576 nodes
Contains honey-net nodes
2010/9/14
10,143 nodes (TP: 9,905 / FP: 238)
Speaker: Li-Ming Chen
27
Results






(1) Effect of botnet topology
(2) Effect of botnet graph size
(3) Effect of background graph size
(4) Effect of reduced visibility
(5) Leveraging Honeynets
(6) Effect of inference algorithm
2010/9/14
Speaker: Li-Ming Chen
28
(1) Effect of botnet topology

4 botnet graphs  de Bruijn, Chord, Kademlia, LEET-Chord




Overall, performance is stable across these graphs
Detection rate > 95%
FP rate < 0.42% for LEET-Chord
Stealthiness vs. resilience:



2010/9/14
Randomly removing nodes (%)
Check failed paths: LEET-Chord
is less resilience to failure
 the use of stealth to evade
BotGrep would adversely effect
the resilience of the botnet
Speaker: Li-Ming Chen
29
(2) Effect of botnet graph size

Experiment:


Keep the size of the background traffic graph constant
Vary the size of the synthetic botnet graph


102, 103, 104, or 105 bots
Finding:

2010/9/14
Size increases, performance degrades (but only by a small
amount)
Speaker: Li-Ming Chen
30
(3) Effect of background graph
size


A larger background graphs  botnet is easy to hide inside (?)
Experiment:



Try to scale up the background graphs while retaining their
statistical properties (ignore the procedure here), then insert botnet
e.g., CAIDA: 3.8 million  30 million nodes (×9)
Finding:

BotGrep scales well with network size!
÷9
2010/9/14
Speaker: Li-Ming Chen
31
(4) Effect of reduced visibility

Previous Experiments:


Problems of reduced visibility:



Gp is present in its entirety
Only deploy BotGrep at a subset of ISPs
Network traffic sampling
5 most-affected ASes
contribute views 57~65%
Experiment:




2010/9/14
Study Storm & Karaken botnet
Measure number of inter-bot
paths visible from ASes
Sort ASes (according to # of pahts)
Cooperate sorted ASes and
contribution see their “visibility”
Speaker: Li-Ming Chen
32
(4) Effect of reduced visibility (cont’d)
57~65% visibility from Top 5 ASes

Apply BotGrep on a “combined graph” by removing 40% of
links from the botnet graph
2010/9/14
Speaker: Li-Ming Chen
33
(5) Leveraging Honeynets


Perform RWs starting only from the honey-net nodes to obtain
a set of candidate P2P nodes in the prefiltering stage (?)
Finding:


Significantly reduce FP rates
Also speed up the efficiency

2010/9/14
Only 1 iteration is required for Modified SybilInfer Algo.
Speaker: Li-Ming Chen
34
(6) Effect of inference algorithm

Compare with other graph partitioning algorithm

(1) Edge importance based community structure detection



Girvan-Newman Betweenness
Information centrality (too slow, not considered)
(2) Spectral-based approach


Modularity Eigenvector
Fast Greedy Modularity
Perform BFS and limit
visited nodes by a size of 2k
2010/9/14
Speaker: Li-Ming Chen
Run-time:
Mot scaled
well for large
dataset
35
Related Work (Botnet Detection)

Network based approaches

Detect attack traffic


Detect control (C&C) traffic



Traffic signature based detection
Statistical traffic analysis based detection
Hybrid approaches



Exploit DNS usage patterns, using Honeypot
Detect attack & control traffic
Combine network-based and host-based approach
Graph based approaches


2010/9/14
Centralized structure
P2P structure
Speaker: Li-Ming Chen
36
Conclusion

Goal: localize structured Comm. graphs within network traffic
to identify botnets hosts and links




Propose BotGrep: searching for structured topologies, and
separating them from background Comm. graph
Tackling the privacy-preserving issues
Achieve low FP rate and high detection rate
Future work:

Consider temporal variation



Distinguish other P2P structure
Address the botnet response problem

2010/9/14
Observing how parts of the Comm. graph change over time
Do not completely disconnect a node but mitigate its potential
malicious activities
Speaker: Li-Ming Chen
37
My Comments

About the approach:

In the 3 steps, the accuracy of the 1st step seems to be the key
factor, but not proved


1. deciding k (in k-means)
2. Clustering is based the properties if each “node” (qi/di) after RW,
does nodes in a same cluster really connected?



FN is due to? (pre-filtering? sampling?)
Why adopt modified SybilInfer Algo. in this paper to remove FP?



The original problem is dealing with P2P net + sybil nodes
Does the length of each RW affect the results?
Can we assign weights to edges?

2010/9/14
Step 2 & 3 could deal with this issue…
e.g., # of connection between 2 nodes
Speaker: Li-Ming Chen
38
My Comments

Time issues:



Does not consider the effects of time, traffic log in different
time period
(as mentioned) temporal variation of the Comm. graph
The effects of a small P2P network (at the early
stage) vs. the effects of a large P2P network (lots of
bots)
2010/9/14
Speaker: Li-Ming Chen
39