Extension of PCA to Stochastic Processes

Applying PCA for Traffic
Anomaly Detection:
Problems and Solutions
Daniela Brauckhoff (ETH Zurich, CH)
Kave Salamatian (Lancaster University, FR)
Martin May (Thomson, CH)
IEEE INFOCOM (April, 2009)
2010/3/2
1
Agenda
•
•
•
•
•
•
Before Introduction
Objective
A Signal Processing View on PCA
Extension of PCA to Stochastic Processes
Validation
Conclusion
2010/3/2
2
What is PCA?
• PCA
– Principle Component Analysis
• PCA’s Usage
– lower the characteristic dimension
– e.g., a picture with size 1024 * 768
• its characteristic dimension is its length * width
• with 786432 characteristic value
• use PCA to lower the characteristic dimension
2010/3/2
3
What is PCA? (cont.1)
2010/3/2
Ref. Site- http://blog.finalevil.com/2008/07/pca.html
4
What is PCA? (cont.2)
2010/3/2
5
Agenda
•
•
•
•
•
•
Before Introduction
Objective
A Signal Processing View on PCA
Extension of PCA to Stochastic Processes
Validation
Conclusion
2010/3/2
6
Problems and Solutions
• Consider the temporal correlation of the data
• Extend the PCA
– Replaced by Karhunen-Loeve Transform
2010/3/2
7
Agenda
•
•
•
•
•
•
Before Introduction
Objective
A Signal Processing View on PCA
Extension of PCA to Stochastic Processes
Validation
Conclusion
2010/3/2
8
Two different interpretations
1. As an efficient representation that transforms
the data to a new coordinate system
•
Projection on the first coordinate contains the
greatest variance
2. As a modeling technique
•
2010/3/2
using a finite number of terms of an orthogonal serie
expansion of the signal with uncorrelated coefficients
9
Background
• Suppose that we have a column vector of
correlated random variables:
–
Matrix X =>
X  ( X 1 ,..., X K )  R
T
k
– Each random variable has its own observation vector
through N dependent realization vector:
x  ( x ,..., x )
i
1
i T
K
– Note:
• Random variables means the data you collected from
network
2010/3/2
10
Background (cont.1)
• In order to find the characteristic of the
above data collected from network
– i.e., the most suitable basis: (1 ,..., K ) ,
• where  i is an eigenvector of the covariance matrix X
defined as   E{( X   )( X   )T } , estimated by
ˆ 

• where
1
xx T
N 1
 is a column vector containing the means of X i
X  ( X 1 ,..., X K )  R
T
2010/3/2
k
11
Background (cont.2)
• The most suitable basis: (1 ,..., K )
• How to find the i respectively?
– i.e., solve the following linear equation:
 i  λ ii
– Method: SVD (Singular Value Decomposition)
• Note: basis change matrix U  [1,...,K ]
2010/3/2
  E{( X   )( X   )T }
12
Background (cont.3)
• But U  [1,...,K ] is a basis change matrix only
when X is zero mean
~
• Meanwhile, X must replaced by X  X - 
~
~
– i.e., y  Ux
– not taking care of it could lead to large errors when using
PCA
• Rewrite the initial vector of random variables X
– Yi is the essential property!
– i.e., suitable for PCA representation
2010/3/2
K
X   Yii
i 1
13
Agenda
•
•
•
•
•
•
Before Introduction
Objective
A Signal Processing View on PCA
Extension of PCA to Stochastic Processes
Validation
Conclusion
2010/3/2
14
Stochastic Process
• The extension to PCA Stochastic processes
that have temporal as well as spatial
correlations
• Assume we have a K-vector of zero mean
stationary stochastic processes
X(t )  ( X 1 (t ),..., X K (t ))T
– with a covariance function
 i , j ( )  E{ X i (t ) X j (t   )}
2010/3/2
15
Stochastic Process (cont.1)
• The multi-dimension Karhunen-Loeve
theorem states that one can rewrite this
vector as a serie expansion (named KL
expansion):
K

X l (t )   Y i , j (t )
i 1 j 1
– Compared: X 
K
Y
i 1
2010/3/2
l
i, j
i i
16
Stochastic Process (cont.2)
• How to get basis function  i , j (t ) ?
– Solve the linear integral equations:
K
 
i 1
– Compared:
b
a
i ,l
( s ) i , j ( s  t )ds  λ l , j  l , j (t )
 i  λ ii
l
i, j
• Then we can obtained Y
by
b
Y   X l (s)i , j (s)ds
l
i, j
a
K
2010/3/2

X l (t )   Yi ,l j i , j (t )17
i 1 j 1
Stochastic Process (cont.3)
• But Galerkin method transforms the above
integral equations to a matrix problem that
can be solved by applying the SVD technique
• It possible to derive the KL expansion using
only a finite number of samples
– Time-sampled version =>  i , j [k ]   i , j (kT )
– Finally, we obtain a discrete version of the KL
K N
expansion as:
l
X l [k ]   Yi , j i , j [k ]
i 1 j 1
2010/3/2
18
Stochastic Process (cont.4)
• Construct a KN × (n − N) observation matrix
K
N
X l [k ]   Yi ,l j i , j [k ]
i 1 j 1
• With KN eigenvector
2010/3/2
19
Stochastic Process (cont.5)
ˆ 

1
xx T
n  N 1
• Use
to estimate the all needed
spatio-temporal convariance
2010/3/2
20
Agenda
•
•
•
•
•
•
Before Introduction
Objective
A Signal Processing View on PCA
Extension of PCA to Stochastic Processes
Validation
Conclusion
2010/3/2
21
Data Set and Metrics
• Collect Three weeks of Netflow data
– one of the peering links of a medium-sized ISP
(SWITCH, AS559)
• Recorded in August 2007
– comprise a variety of traffic anomalies
– happening in daily operation such as network
scans, denial of service attacks, alpha flows, etc
2010/3/2
22
Data Set and Metrics (cont.1)
• The computing the detection metrics:
– distinguish between incoming and outgoing
traffic, as well as UDP and TCP flows
– for each of these four categories, compute seven
commonly used traffic features:
•
•
•
•
•
2010/3/2
Byte
Packet
flow counts
Sources and destination IP address entropy
Source and destination IP address counts
23
Data Set and Metrics (cont.2)
• All metrics obtained by aggregating the traffic
in 15 minutes intervals resulting 28*96 matrix
per measurement day
• Anomalies identified by using visual inspection
• Resulted in 28 detected anomalous events in
UDP and 73 detected in TCP traffic
2010/3/2
24
Data Set and Metrics (cont.3)
• Use the vector of metrics
containing the first two days of metrics for
building the model
• Derive a spatio-temporal correlation matrix
with the temporal correlation range set to N =
1, .., 5
– Note that setting N = 1 gives the standard PCA
approach
– apply SVD decomposition to the data, resulting in
a basis change matrix
2010/3/2
25
ROC curves
• Receiver Operating Characteristics (ROC)
curve combining the two parameters in one
value captures this essential trade-off
– false positive and true positive
2010/3/2
26
ROC curves (cont.1)
• Receiver Operating Characteristics (ROC)
curve combining the two parameters in one
value captures this essential trade-off
– false positive and true positive
2010/3/2
27
ROC curves (cont.2)
2010/3/2
28
ROC curves (cont.3)
• The comparison of ROC curves shows a
considerable improvement of the anomaly
detection performance with use of KL
expansion with N = 2, 3 consistently for UDP
and TCP traffic and thereafter a decrease for
N≥4
2010/3/2
29
Effect of non-stationarity
• Stationarity issue:
– N ≥ 4 the performance decreases
– when N increases, the model contains more
parameters and becomes more sensitive to the
stationarity of the traffic metrics
2010/3/2
30
Agenda
•
•
•
•
•
•
Before Introduction
Objective
A Signal Processing View on PCA
Extension of PCA to Stochastic Processes
Validation
Conclusion
2010/3/2
31
Conclusion
• Direct application of the PCA method results
in poor performance in terms of ROC curves
• The correct framework is not the classical PCA
but rather the Karhunen-Loeve expansion
• Provide a Galerkin method for developing a
predictive model and therefore an important
improvement is attained when temporal
correlation is considered
2010/3/2
32
Q&A
Thank you!
2010/3/2
33
i
( X1,..., X K )
X  ( X 1 ,..., X K )T  R k

K
X   Yii
i 1
( X 1 ,..., X K )
X  ( X 1 ,..., X K )T  R k
x  ( x1i ,..., xKi )T
(e1 ,..., e K )
(1 ,...,  K )
  E{( X   )( X   )T }
 i  λ ii
~
X  X-
~y  U~
x
K
X   Yii
i 1
1
xx T
N 1
X (t )  ( X 1 (t ),..., X K (t )) T
ˆ 

 i , j ( )  E{ X i (t ) X j (t   )}
K

X l (t )   Yi ,l j  i , j (t )
i 1 j 1
K
b
i 1
a
 
i ,l
( s ) i , j ( s  t ) ds  λ l , j  l , j (t )
b
Y   X l ( s ) i , j ( s )ds
l
i, j
a
K
N
X l [k ]   Yi ,l j  i , j [ k ]
i 1 j 1
x  ( x ,..., x )
i
1
(e1 ,..., e K )
i T
K
Xi
X
Yi
(1 ,..., K )
ˆ 

ˆ 

1
xx T
N 1
1
xx T
n  N 1
L
M
Xˆ l [k ]   Yi ,l j  i , j [ k ]
i 1 j 1
 i , j [k ]
D[k ] 
i

Q[k ]h

Xi
X
U  [1 ,...,  K ]
Yi
 i , j (t )
Yi ,l j
 i , j [k ]   i , j (kT )
X(t )  ( X 1 (t ),..., X K (t ))T
  E{( X   )( X   ) }  i , j ( )  E{ X i (t ) X j (t   )}
T
 i  λ ii
~
X  X-
~
y  U~
x
2010/3/2
K

X l (t )   Y i , j (t )
i 1 j 1
l
i, j
34
K
 
i 1
b
a
i ,l
( s ) i , j ( s  t )ds  λ l , j  l , j (t )
b
Y   X l (s)i , j (s)ds
l
i, j
D[k ] 
a
K
i 1 j 1
1
xx T
n  N 1
L
Q[k ]h

N
X l [k ]   Yi ,l j i , j [k ]
ˆ 

 i , j [k ]
M
Xˆ l [k ]   Yi ,l j i , j [k ]
U  [1,...,K ]
Yi ,l j
 i , j [k ]   i , j (kT )
i 1 j 1
2010/3/2
35