Anomaly Detection with Extreme Value Theory

Anomaly Detection with Extreme Value Theory
A. Siffer, P-A Fouque, A. Termier and C. Largouet
April 26, 2017
Contents
Context
Providing better thresholds
Finding anomalies in streams
Application to intrusion detection
A more general framework
1
Context
General motivations
⊸ Massive usage of the Internet
2
General motivations
⊸ Massive usage of the Internet
• More and more vulnerabilities
2
General motivations
⊸ Massive usage of the Internet
• More and more vulnerabilities
• More and more threats
2
General motivations
⊸ Massive usage of the Internet
• More and more vulnerabilities
• More and more threats
⊸ Awareness of the sensitive data and infrastructures
2
General motivations
⊸ Massive usage of the Internet
• More and more vulnerabilities
• More and more threats
⊸ Awareness of the sensitive data and infrastructures
⇒ Network security :
a major concern
2
A Solution
⊸ IDS (Intrusion Detection System)
• Monitor traffic
• Detect attacks
3
A Solution
⊸ IDS (Intrusion Detection System)
• Monitor traffic
• Detect attacks
⊸ Current methods : rule-based
• Work fine on common and well-known attacks
• Cannot detect new attacks
3
A Solution
⊸ IDS (Intrusion Detection System)
• Monitor traffic
• Detect attacks
⊸ Current methods : rule-based
• Work fine on common and well-known attacks
• Cannot detect new attacks
⊸ Emerging methods : anomaly-based
• Use the network data to estimate a normal behavior
• Apply algorithms to detect abnormal events (→ attacks)
3
Overview
⊸ Basic scheme
data
Algorithm
alerts
4
Overview
⊸ Basic scheme
data
Algorithm
alerts
⊸ Many ”standard” algorithms have been tested
4
Overview
⊸ Basic scheme
data
Algorithm
alerts
⊸ Many ”standard” algorithms have been tested
⊸ Complex pipelines are emerging (ensemble/hybrid techniques)
data
alerts
4
Inherent problem
⊸ Algorithms are not magic
• They give some information about data (scores)
5
Inherent problem
⊸ Algorithms are not magic
• They give some information about data (scores)
• But the decision often rely on a human choice
if score>threshold then trigger alert
5
Inherent problem
⊸ Algorithms are not magic
• They give some information about data (scores)
• But the decision often rely on a human choice
if score>threshold then trigger alert
⊸ The thresholds are often hard-set
• Expertise
• Fine-tuning
• Distribution assumption
5
Inherent problem
⊸ Algorithms are not magic
• They give some information about data (scores)
• But the decision often rely on a human choice
if score>threshold then trigger alert
⊸ The thresholds are often hard-set
• Expertise
• Fine-tuning
• Distribution assumption
⊸ Our idea: provide dynamic threshold with a probabilistic
meaning
5
Providing better thresholds
My problem
0.025
Frequency
0.020
0.015
0.010
0.005
0.000
0
25
50
75
100 125 150
Daily payment by credit card ( )
175
200
6
My problem
0.025
Frequency
0.020
0.015
0.010
0.005
0.000
0
25
50
75
100 125 150
Daily payment by credit card ( )
175
200
⊸ How to set zq such that P(X€ > zq ) < q ?
6
Solution 1: empirical approach
0.025
Frequency
0.020
0.015
0.010
0.005
0.000
0
25
50
75
100 125 150
Daily payment by credit card ( )
175
200
7
Solution 1: empirical approach
0.025
1−q
q
Frequency
0.020
0.015
0.010
zq
0.005
0.000
0
25
50
75
100 125 150
Daily payment by credit card ( )
175
200
7
Solution 1: empirical approach
0.025
1−q
q
Frequency
0.020
0.015
0.010
zq
0.005
0.000
0
25
50
75
100 125 150
Daily payment by credit card ( )
175
⊸ Drawbacks: stuck in the interval, poor resolution
200
7
Solution 2: Standard Model
0.025
Frequency
0.020
0.015
0.010
0.005
0.000
0
25
50
75
100 125 150
Daily payment by credit card ( )
175
200
8
Solution 2: Standard Model
0.025
Frequency
0.020
0.015
0.010
0.005
0.000
0
25
50
75
100 125 150
Daily payment by credit card ( )
175
200
8
Solution 2: Standard Model
0.025
Frequency
0.020
0.015
0.010
zq
0.005
0.000
0
25
50
75
100 125 150
Daily payment by credit card ( )
175
200
8
Solution 2: Standard Model
0.025
Frequency
0.020
0.015
0.010
zq
0.005
0.000
0
25
50
75
100 125 150
Daily payment by credit card ( )
175
⊸ Drawbacks: manual step, distribution assumption
200
8
Realities
0.015
0.20
0.15
0.10
0.05
0.00
0.010
0.005
0.000
0.020
0.015
0.010
0.005
0.000
0
100
200
300
16
18
0
25
20
22
24
0.03
0.02
0.01
20
40
60
80
100
0.00
50
75
⊸ Different clients and/or temporal drift
9
Results
Properties
statistical guarantees
easy to adapt
high resolution
Empirical quantile
Yes
Yes
No
Standard model
Yes
No
Yes
10
Inspection of extreme events
0.025
Frequency
0.020
0.015
0.010
Probability estimation ?
0.005
0.000
0
25
50
75
100 125 150
Daily payment by credit card ( )
175
200
11
Inspection of extreme events
0.025
Frequency
0.020
0.015
0.010
Probability estimation ?
0.005
0.000
0
25
50
75
100 125 150
Daily payment by credit card ( )
175
200
11
Extreme Value Theory
12
Extreme Value Theory
⊸ Main result (Fisher-Tippett-Gnedenko, 1928)
The extreme values of any distribution have nearly the same
distribution (called Extreme Value Distribution)
12
Extreme Value Theory
⊸ Main result (Fisher-Tippett-Gnedenko, 1928)
The extreme values of any distribution have nearly the same
distribution (called Extreme Value Distribution)
heavy tail
exponential tail
bounded tail
0.6
(X > x)
0.5
0.4
0.3
0.2
0.1
0.0
0
1
2
3
x
4
5
6
12
An impressive analogy
⊸ Let X1 , X2 , . . . Xn a sequence of i.i.d. random variables with
Sn =
n
∑
i=1
Xi
Mn = max (Xi )
1≤i≤n
13
An impressive analogy
⊸ Let X1 , X2 , . . . Xn a sequence of i.i.d. random variables with
Sn =
n
∑
i=1
Xi
Mn = max (Xi )
1≤i≤n
⊸ Central Limit Theorem
Sn − nµ d
√
−→ N (0, σ 2 )
n
13
An impressive analogy
⊸ Let X1 , X2 , . . . Xn a sequence of i.i.d. random variables with
Sn =
n
∑
i=1
Xi
Mn = max (Xi )
1≤i≤n
⊸ Central Limit Theorem
Sn − nµ d
√
−→ N (0, σ 2 )
n
⊸ FTG Theorem
M n − an d
−→ EVD(γ)
bn
13
A more practical result
⊸ Second theorem of EVT (Pickands-Balkema-de Haan, 1974)
The excesses over a high threshold follow a Generalized Pareto
Distribution (with parameters γ, σ)
14
A more practical result
⊸ Second theorem of EVT (Pickands-Balkema-de Haan, 1974)
The excesses over a high threshold follow a Generalized Pareto
Distribution (with parameters γ, σ)
⊸ What does it imply ?
• we have a model for extreme events
• we can compute zq for q as small as desired
14
How to use EVT
⊸ Get some data X1 , X2 . . . Xn
⊸ Set a high threshold t and retrieve the excesses Yj = Xkj − t
when Xkj > t
15
How to use EVT
⊸ Get some data X1 , X2 . . . Xn
⊸ Set a high threshold t and retrieve the excesses Yj = Xkj − t
when Xkj > t
⊸ Fit a GPD to the Yj (→ find parameters γ, σ)
15
How to use EVT
⊸ Get some data X1 , X2 . . . Xn
⊸ Set a high threshold t and retrieve the excesses Yj = Xkj − t
when Xkj > t
⊸ Fit a GPD to the Yj (→ find parameters γ, σ)
⊸ Compute zq such as P(X > zq ) < q
15
How to use EVT
⊸ Get some data X1 , X2 . . . Xn
⊸ Set a high threshold t and retrieve the excesses Yj = Xkj − t
when Xkj > t
⊸ Fit a GPD to the Yj (→ find parameters γ, σ)
⊸ Compute zq such as P(X > zq ) < q
q
X1 , X2 . . . Xn
EVT
zq
15
Finding anomalies in streams
Streaming Peaks-Over-Threshold (SPOT) algorithm
16
Streaming Peaks-Over-Threshold (SPOT) algorithm
q
(initial batch)
X1 , X2 . . . Xn
0.30
Calibration
0.20
0.10
0
trigger alarm
(stream)
Xi>n
yes
Xi > zq
zq
t
20 40 60 80 100 120
update model
yes
no
Xi > t
no
drop
16
Streaming Peaks-Over-Threshold (SPOT) algorithm
q
(initial batch)
X1 , X2 . . . Xn
0.30
Calibration
0.20
0.10
0
trigger alarm
(stream)
Xi>n
yes
Xi > zq
zq
t
20 40 60 80 100 120
update model
yes
no
Xi > t
no
drop
16
Streaming Peaks-Over-Threshold (SPOT) algorithm
q
(initial batch)
X1 , X2 . . . Xn
0.30
Calibration
0.20
0.10
0
trigger alarm
(stream)
Xi>n
yes
Xi > zq
zq
t
20 40 60 80 100 120
update model
yes
no
Xi > t
no
drop
16
Streaming Peaks-Over-Threshold (SPOT) algorithm
q
(initial batch)
X1 , X2 . . . Xn
0.30
Calibration
0.20
0.10
0
trigger alarm
(stream)
Xi>n
yes
Xi > zq
zq
t
20 40 60 80 100 120
update model
yes
no
Xi > t
no
drop
16
Streaming Peaks-Over-Threshold (SPOT) algorithm
q
(initial batch)
X1 , X2 . . . Xn
0.30
Calibration
0.20
0.10
0
trigger alarm
(stream)
Xi>n
yes
Xi > zq
zq
t
20 40 60 80 100 120
update model
yes
no
Xi > t
no
drop
16
Streaming Peaks-Over-Threshold (SPOT) algorithm
q
(initial batch)
X1 , X2 . . . Xn
0.30
Calibration
0.20
0.10
0
trigger alarm
(stream)
Xi>n
yes
Xi > zq
zq
t
20 40 60 80 100 120
update model
yes
no
Xi > t
no
drop
16
Streaming Peaks-Over-Threshold (SPOT) algorithm
q
(initial batch)
X1 , X2 . . . Xn
0.30
Calibration
0.20
0.10
0
trigger alarm
(stream)
Xi>n
yes
Xi > zq
zq
t
20 40 60 80 100 120
update model
yes
no
Xi > t
no
drop
16
Streaming Peaks-Over-Threshold (SPOT) algorithm
q
(initial batch)
X1 , X2 . . . Xn
0.30
Calibration
0.20
0.10
0
trigger alarm
(stream)
Xi>n
yes
Xi > zq
zq
t
20 40 60 80 100 120
update model
yes
no
Xi > t
no
drop
16
Streaming Peaks-Over-Threshold (SPOT) algorithm
q
(initial batch)
X1 , X2 . . . Xn
0.30
Calibration
0.20
0.10
0
trigger alarm
(stream)
Xi>n
yes
Xi > zq
zq
t
20 40 60 80 100 120
update model
yes
no
Xi > t
no
drop
16
Streaming Peaks-Over-Threshold (SPOT) algorithm
q
(initial batch)
X1 , X2 . . . Xn
0.30
Calibration
0.20
0.10
0
trigger alarm
(stream)
Xi>n
yes
Xi > zq
zq
t
20 40 60 80 100 120
update model
yes
no
Xi > t
no
drop
16
Streaming Peaks-Over-Threshold (SPOT) algorithm
q
(initial batch)
X1 , X2 . . . Xn
0.30
Calibration
0.20
0.10
0
trigger alarm
(stream)
Xi>n
yes
Xi > zq
zq
t
20 40 60 80 100 120
update model
yes
no
Xi > t
no
drop
16
Can we trust that threshold zq ?
⊸ An example with ground truth : a Gaussian White Noise
• 40 streams with 200 000 iid variables drawn from N (0, 1)
• q = 10−3 ⇒ theoretical threshold zth ≃ 3.09
17
Can we trust that threshold zq ?
⊸ An example with ground truth : a Gaussian White Noise
• 40 streams with 200 000 iid variables drawn from N (0, 1)
• q = 10−3 ⇒ theoretical threshold zth ≃ 3.09
⊸ Averaged relative error
Relative error
0.04
0.03
0.02
0.01
0
50000
100000
150000
Number of observations
200000
17
Application to intrusion detection
About the data
⊸ Lack of relevant public datasets to test the algorithms ...
18
About the data
⊸ Lack of relevant public datasets to test the algorithms ...
⊸ KDD99 ? See [McHugh 2000] and [Mahoney & Chan 2003]
18
About the data
⊸ Lack of relevant public datasets to test the algorithms ...
⊸ KDD99 ? See [McHugh 2000] and [Mahoney & Chan 2003]
⊸ We rather use MAWI
• 15 min a day of real traffic (.pcap file)
• Anomaly patterns given by the MAWILab [Fontugne et al. 2010]
with taxonomy [Mazel et al. 2014]
18
About the data
⊸ Lack of relevant public datasets to test the algorithms ...
⊸ KDD99 ? See [McHugh 2000] and [Mahoney & Chan 2003]
⊸ We rather use MAWI
• 15 min a day of real traffic (.pcap file)
• Anomaly patterns given by the MAWILab [Fontugne et al. 2010]
with taxonomy [Mazel et al. 2014]
⊸ Preprocessing step : raw .pcap → NetFlow format (only
metadata)
18
An example to detect network syn scan
⊸ The ratio of SYN packets : relevant feature to detect network
scan [Fernandes & Owezarski 2009]
19
An example to detect network syn scan
ratio of SYN packets in 50ms time window
⊸ The ratio of SYN packets : relevant feature to detect network
scan [Fernandes & Owezarski 2009]
0.8
0.6
0.4
0.2
0.0
0.0 100.0 200.0 300.0 400.0 500.0 600.0 700.0 800.0
Time (s)
19
An example to detect network syn scan
ratio of SYN packets in 50ms time window
⊸ The ratio of SYN packets : relevant feature to detect network
scan [Fernandes & Owezarski 2009]
0.8
0.6
0.4
0.2
0.0
0.0 100.0 200.0 300.0 400.0 500.0 600.0 700.0 800.0
Time (s)
⊸ Goal: find peaks
19
SPOT results
⊸ Parameters : q = 10−4 , n = 2000 (from the previous day record)
20
SPOT results
ratio of SYN packets in 50ms time window
⊸ Parameters : q = 10−4 , n = 2000 (from the previous day record)
0.8
0.6
0.4
0.2
0.0
0.0
100.0 200.0 300.0 400.0 500.0 600.0 700.0 800.0
Time (s)
20
Do we really flag scan attacks ?
⊸ The main parameter q: a False Positive regulator
21
Do we really flag scan attacks ?
⊸ The main parameter q: a False Positive regulator
100.0
8. 10 −5
−5
1. 10 −5 5. 10
TPr (%)
80.0
6. 10 −3
60.0
40.0
9. 10 −6
20.0
0.0
0.0
8. 10 −6
0.5
1.0
1.5
2.0 2.5
FPr (%)
3.0
3.5
4.0
4.5
21
Do we really flag scan attacks ?
⊸ The main parameter q: a False Positive regulator
100.0
8. 10 −5
−5
1. 10 −5 5. 10
TPr (%)
80.0
6. 10 −3
60.0
40.0
9. 10 −6
20.0
0.0
0.0
8. 10 −6
0.5
1.0
1.5
2.0 2.5
FPr (%)
3.0
3.5
4.0
4.5
⊸ 86% of scan flows detected with less than 4% of FP
21
A more general framework
SPOT Specifications
⊸ A single main parameter q
• With a probabilistic meaning → P(X > zq ) < q
• False Positive regulator
22
SPOT Specifications
⊸ A single main parameter q
• With a probabilistic meaning → P(X > zq ) < q
• False Positive regulator
⊸ Stream capable
• Incremental learning
• Fast (∼ 1000 values/s)
• Low memory usage (only the excesses)
22
Other things ?
⊸ SPOT
• performs dynamic thresholding without distribution assumption
• uses it to detect network anomalies
23
Other things ?
⊸ SPOT
• performs dynamic thresholding without distribution assumption
• uses it to detect network anomalies
⊸ But it could be adapted to
23
Other things ?
⊸ SPOT
• performs dynamic thresholding without distribution assumption
• uses it to detect network anomalies
⊸ But it could be adapted to
• compute upper and lower thresholds
23
Other things ?
⊸ SPOT
• performs dynamic thresholding without distribution assumption
• uses it to detect network anomalies
⊸ But it could be adapted to
• compute upper and lower thresholds
• other fields
23
Other things ?
⊸ SPOT
• performs dynamic thresholding without distribution assumption
• uses it to detect network anomalies
⊸ But it could be adapted to
• compute upper and lower thresholds
• other fields
• drifting contexts (with an additional parameter) → DSPOT
23
A recent example
24
A recent example
⊸ Thursday the 9th of February 2017
24
A recent example
⊸ Thursday the 9th of February 2017
• 9h : explosion at Flamanville nuclear plant
24
A recent example
⊸ Thursday the 9th of February 2017
• 9h : explosion at Flamanville nuclear plant
• 11h : official declaration of the incident by EDF
24
A recent example
⊸ Thursday the 9th of February 2017
• 9h : explosion at Flamanville nuclear plant
• 11h : official declaration of the incident by EDF
⊸ What about the EDF stock prices ?
24
2
2
4
2
9
0
3
4
1
4
09
:0
09
:4
10
:3
11
:3
12
:1
13
:3
14
:4
15
:3
16
:2
17
:1
EDF stock price ( )
EDF stock prices
9.40
9.35
9.30
9.25
9.20
9.15
9.10
9.05
Time
25
2
2
4
2
9
0
3
4
1
4
09
:0
09
:4
10
:3
11
:3
12
:1
13
:3
14
:4
15
:3
16
:2
17
:1
EDF stock price ( )
EDF stock prices
9.40
9.35
9.30
9.25
9.20
9.15
9.10
9.05
Time
25
Conclusion
26
Conclusion
⊸ Context: A great deal of work has been done to develop
anomaly detection algorithms
26
Conclusion
⊸ Context: A great deal of work has been done to develop
anomaly detection algorithms
⊸ Problem: Decision thresholds rely on either distribution
assumption or expertise
26
Conclusion
⊸ Context: A great deal of work has been done to develop
anomaly detection algorithms
⊸ Problem: Decision thresholds rely on either distribution
assumption or expertise
⊸ Our solution: Building dynamic threshold with a probabilistic
meaning
26
Conclusion
⊸ Context: A great deal of work has been done to develop
anomaly detection algorithms
⊸ Problem: Decision thresholds rely on either distribution
assumption or expertise
⊸ Our solution: Building dynamic threshold with a probabilistic
meaning
• Application to detect network anomalies
26
Conclusion
⊸ Context: A great deal of work has been done to develop
anomaly detection algorithms
⊸ Problem: Decision thresholds rely on either distribution
assumption or expertise
⊸ Our solution: Building dynamic threshold with a probabilistic
meaning
• Application to detect network anomalies
• But a general tool to monitor online time series in a blind way
26
Conclusion
⊸ Context: A great deal of work has been done to develop
anomaly detection algorithms
⊸ Problem: Decision thresholds rely on either distribution
assumption or expertise
⊸ Our solution: Building dynamic threshold with a probabilistic
meaning
• Application to detect network anomalies
• But a general tool to monitor online time series in a blind way
⊸ Future: Adapt the method to higher dimensions
26