Anomaly Detection with Extreme Value Theory A. Siffer, P-A Fouque, A. Termier and C. Largouet April 26, 2017 Contents Context Providing better thresholds Finding anomalies in streams Application to intrusion detection A more general framework 1 Context General motivations ⊸ Massive usage of the Internet 2 General motivations ⊸ Massive usage of the Internet • More and more vulnerabilities 2 General motivations ⊸ Massive usage of the Internet • More and more vulnerabilities • More and more threats 2 General motivations ⊸ Massive usage of the Internet • More and more vulnerabilities • More and more threats ⊸ Awareness of the sensitive data and infrastructures 2 General motivations ⊸ Massive usage of the Internet • More and more vulnerabilities • More and more threats ⊸ Awareness of the sensitive data and infrastructures ⇒ Network security : a major concern 2 A Solution ⊸ IDS (Intrusion Detection System) • Monitor traffic • Detect attacks 3 A Solution ⊸ IDS (Intrusion Detection System) • Monitor traffic • Detect attacks ⊸ Current methods : rule-based • Work fine on common and well-known attacks • Cannot detect new attacks 3 A Solution ⊸ IDS (Intrusion Detection System) • Monitor traffic • Detect attacks ⊸ Current methods : rule-based • Work fine on common and well-known attacks • Cannot detect new attacks ⊸ Emerging methods : anomaly-based • Use the network data to estimate a normal behavior • Apply algorithms to detect abnormal events (→ attacks) 3 Overview ⊸ Basic scheme data Algorithm alerts 4 Overview ⊸ Basic scheme data Algorithm alerts ⊸ Many ”standard” algorithms have been tested 4 Overview ⊸ Basic scheme data Algorithm alerts ⊸ Many ”standard” algorithms have been tested ⊸ Complex pipelines are emerging (ensemble/hybrid techniques) data alerts 4 Inherent problem ⊸ Algorithms are not magic • They give some information about data (scores) 5 Inherent problem ⊸ Algorithms are not magic • They give some information about data (scores) • But the decision often rely on a human choice if score>threshold then trigger alert 5 Inherent problem ⊸ Algorithms are not magic • They give some information about data (scores) • But the decision often rely on a human choice if score>threshold then trigger alert ⊸ The thresholds are often hard-set • Expertise • Fine-tuning • Distribution assumption 5 Inherent problem ⊸ Algorithms are not magic • They give some information about data (scores) • But the decision often rely on a human choice if score>threshold then trigger alert ⊸ The thresholds are often hard-set • Expertise • Fine-tuning • Distribution assumption ⊸ Our idea: provide dynamic threshold with a probabilistic meaning 5 Providing better thresholds My problem 0.025 Frequency 0.020 0.015 0.010 0.005 0.000 0 25 50 75 100 125 150 Daily payment by credit card ( ) 175 200 6 My problem 0.025 Frequency 0.020 0.015 0.010 0.005 0.000 0 25 50 75 100 125 150 Daily payment by credit card ( ) 175 200 ⊸ How to set zq such that P(X€ > zq ) < q ? 6 Solution 1: empirical approach 0.025 Frequency 0.020 0.015 0.010 0.005 0.000 0 25 50 75 100 125 150 Daily payment by credit card ( ) 175 200 7 Solution 1: empirical approach 0.025 1−q q Frequency 0.020 0.015 0.010 zq 0.005 0.000 0 25 50 75 100 125 150 Daily payment by credit card ( ) 175 200 7 Solution 1: empirical approach 0.025 1−q q Frequency 0.020 0.015 0.010 zq 0.005 0.000 0 25 50 75 100 125 150 Daily payment by credit card ( ) 175 ⊸ Drawbacks: stuck in the interval, poor resolution 200 7 Solution 2: Standard Model 0.025 Frequency 0.020 0.015 0.010 0.005 0.000 0 25 50 75 100 125 150 Daily payment by credit card ( ) 175 200 8 Solution 2: Standard Model 0.025 Frequency 0.020 0.015 0.010 0.005 0.000 0 25 50 75 100 125 150 Daily payment by credit card ( ) 175 200 8 Solution 2: Standard Model 0.025 Frequency 0.020 0.015 0.010 zq 0.005 0.000 0 25 50 75 100 125 150 Daily payment by credit card ( ) 175 200 8 Solution 2: Standard Model 0.025 Frequency 0.020 0.015 0.010 zq 0.005 0.000 0 25 50 75 100 125 150 Daily payment by credit card ( ) 175 ⊸ Drawbacks: manual step, distribution assumption 200 8 Realities 0.015 0.20 0.15 0.10 0.05 0.00 0.010 0.005 0.000 0.020 0.015 0.010 0.005 0.000 0 100 200 300 16 18 0 25 20 22 24 0.03 0.02 0.01 20 40 60 80 100 0.00 50 75 ⊸ Different clients and/or temporal drift 9 Results Properties statistical guarantees easy to adapt high resolution Empirical quantile Yes Yes No Standard model Yes No Yes 10 Inspection of extreme events 0.025 Frequency 0.020 0.015 0.010 Probability estimation ? 0.005 0.000 0 25 50 75 100 125 150 Daily payment by credit card ( ) 175 200 11 Inspection of extreme events 0.025 Frequency 0.020 0.015 0.010 Probability estimation ? 0.005 0.000 0 25 50 75 100 125 150 Daily payment by credit card ( ) 175 200 11 Extreme Value Theory 12 Extreme Value Theory ⊸ Main result (Fisher-Tippett-Gnedenko, 1928) The extreme values of any distribution have nearly the same distribution (called Extreme Value Distribution) 12 Extreme Value Theory ⊸ Main result (Fisher-Tippett-Gnedenko, 1928) The extreme values of any distribution have nearly the same distribution (called Extreme Value Distribution) heavy tail exponential tail bounded tail 0.6 (X > x) 0.5 0.4 0.3 0.2 0.1 0.0 0 1 2 3 x 4 5 6 12 An impressive analogy ⊸ Let X1 , X2 , . . . Xn a sequence of i.i.d. random variables with Sn = n ∑ i=1 Xi Mn = max (Xi ) 1≤i≤n 13 An impressive analogy ⊸ Let X1 , X2 , . . . Xn a sequence of i.i.d. random variables with Sn = n ∑ i=1 Xi Mn = max (Xi ) 1≤i≤n ⊸ Central Limit Theorem Sn − nµ d √ −→ N (0, σ 2 ) n 13 An impressive analogy ⊸ Let X1 , X2 , . . . Xn a sequence of i.i.d. random variables with Sn = n ∑ i=1 Xi Mn = max (Xi ) 1≤i≤n ⊸ Central Limit Theorem Sn − nµ d √ −→ N (0, σ 2 ) n ⊸ FTG Theorem M n − an d −→ EVD(γ) bn 13 A more practical result ⊸ Second theorem of EVT (Pickands-Balkema-de Haan, 1974) The excesses over a high threshold follow a Generalized Pareto Distribution (with parameters γ, σ) 14 A more practical result ⊸ Second theorem of EVT (Pickands-Balkema-de Haan, 1974) The excesses over a high threshold follow a Generalized Pareto Distribution (with parameters γ, σ) ⊸ What does it imply ? • we have a model for extreme events • we can compute zq for q as small as desired 14 How to use EVT ⊸ Get some data X1 , X2 . . . Xn ⊸ Set a high threshold t and retrieve the excesses Yj = Xkj − t when Xkj > t 15 How to use EVT ⊸ Get some data X1 , X2 . . . Xn ⊸ Set a high threshold t and retrieve the excesses Yj = Xkj − t when Xkj > t ⊸ Fit a GPD to the Yj (→ find parameters γ, σ) 15 How to use EVT ⊸ Get some data X1 , X2 . . . Xn ⊸ Set a high threshold t and retrieve the excesses Yj = Xkj − t when Xkj > t ⊸ Fit a GPD to the Yj (→ find parameters γ, σ) ⊸ Compute zq such as P(X > zq ) < q 15 How to use EVT ⊸ Get some data X1 , X2 . . . Xn ⊸ Set a high threshold t and retrieve the excesses Yj = Xkj − t when Xkj > t ⊸ Fit a GPD to the Yj (→ find parameters γ, σ) ⊸ Compute zq such as P(X > zq ) < q q X1 , X2 . . . Xn EVT zq 15 Finding anomalies in streams Streaming Peaks-Over-Threshold (SPOT) algorithm 16 Streaming Peaks-Over-Threshold (SPOT) algorithm q (initial batch) X1 , X2 . . . Xn 0.30 Calibration 0.20 0.10 0 trigger alarm (stream) Xi>n yes Xi > zq zq t 20 40 60 80 100 120 update model yes no Xi > t no drop 16 Streaming Peaks-Over-Threshold (SPOT) algorithm q (initial batch) X1 , X2 . . . Xn 0.30 Calibration 0.20 0.10 0 trigger alarm (stream) Xi>n yes Xi > zq zq t 20 40 60 80 100 120 update model yes no Xi > t no drop 16 Streaming Peaks-Over-Threshold (SPOT) algorithm q (initial batch) X1 , X2 . . . Xn 0.30 Calibration 0.20 0.10 0 trigger alarm (stream) Xi>n yes Xi > zq zq t 20 40 60 80 100 120 update model yes no Xi > t no drop 16 Streaming Peaks-Over-Threshold (SPOT) algorithm q (initial batch) X1 , X2 . . . Xn 0.30 Calibration 0.20 0.10 0 trigger alarm (stream) Xi>n yes Xi > zq zq t 20 40 60 80 100 120 update model yes no Xi > t no drop 16 Streaming Peaks-Over-Threshold (SPOT) algorithm q (initial batch) X1 , X2 . . . Xn 0.30 Calibration 0.20 0.10 0 trigger alarm (stream) Xi>n yes Xi > zq zq t 20 40 60 80 100 120 update model yes no Xi > t no drop 16 Streaming Peaks-Over-Threshold (SPOT) algorithm q (initial batch) X1 , X2 . . . Xn 0.30 Calibration 0.20 0.10 0 trigger alarm (stream) Xi>n yes Xi > zq zq t 20 40 60 80 100 120 update model yes no Xi > t no drop 16 Streaming Peaks-Over-Threshold (SPOT) algorithm q (initial batch) X1 , X2 . . . Xn 0.30 Calibration 0.20 0.10 0 trigger alarm (stream) Xi>n yes Xi > zq zq t 20 40 60 80 100 120 update model yes no Xi > t no drop 16 Streaming Peaks-Over-Threshold (SPOT) algorithm q (initial batch) X1 , X2 . . . Xn 0.30 Calibration 0.20 0.10 0 trigger alarm (stream) Xi>n yes Xi > zq zq t 20 40 60 80 100 120 update model yes no Xi > t no drop 16 Streaming Peaks-Over-Threshold (SPOT) algorithm q (initial batch) X1 , X2 . . . Xn 0.30 Calibration 0.20 0.10 0 trigger alarm (stream) Xi>n yes Xi > zq zq t 20 40 60 80 100 120 update model yes no Xi > t no drop 16 Streaming Peaks-Over-Threshold (SPOT) algorithm q (initial batch) X1 , X2 . . . Xn 0.30 Calibration 0.20 0.10 0 trigger alarm (stream) Xi>n yes Xi > zq zq t 20 40 60 80 100 120 update model yes no Xi > t no drop 16 Streaming Peaks-Over-Threshold (SPOT) algorithm q (initial batch) X1 , X2 . . . Xn 0.30 Calibration 0.20 0.10 0 trigger alarm (stream) Xi>n yes Xi > zq zq t 20 40 60 80 100 120 update model yes no Xi > t no drop 16 Can we trust that threshold zq ? ⊸ An example with ground truth : a Gaussian White Noise • 40 streams with 200 000 iid variables drawn from N (0, 1) • q = 10−3 ⇒ theoretical threshold zth ≃ 3.09 17 Can we trust that threshold zq ? ⊸ An example with ground truth : a Gaussian White Noise • 40 streams with 200 000 iid variables drawn from N (0, 1) • q = 10−3 ⇒ theoretical threshold zth ≃ 3.09 ⊸ Averaged relative error Relative error 0.04 0.03 0.02 0.01 0 50000 100000 150000 Number of observations 200000 17 Application to intrusion detection About the data ⊸ Lack of relevant public datasets to test the algorithms ... 18 About the data ⊸ Lack of relevant public datasets to test the algorithms ... ⊸ KDD99 ? See [McHugh 2000] and [Mahoney & Chan 2003] 18 About the data ⊸ Lack of relevant public datasets to test the algorithms ... ⊸ KDD99 ? See [McHugh 2000] and [Mahoney & Chan 2003] ⊸ We rather use MAWI • 15 min a day of real traffic (.pcap file) • Anomaly patterns given by the MAWILab [Fontugne et al. 2010] with taxonomy [Mazel et al. 2014] 18 About the data ⊸ Lack of relevant public datasets to test the algorithms ... ⊸ KDD99 ? See [McHugh 2000] and [Mahoney & Chan 2003] ⊸ We rather use MAWI • 15 min a day of real traffic (.pcap file) • Anomaly patterns given by the MAWILab [Fontugne et al. 2010] with taxonomy [Mazel et al. 2014] ⊸ Preprocessing step : raw .pcap → NetFlow format (only metadata) 18 An example to detect network syn scan ⊸ The ratio of SYN packets : relevant feature to detect network scan [Fernandes & Owezarski 2009] 19 An example to detect network syn scan ratio of SYN packets in 50ms time window ⊸ The ratio of SYN packets : relevant feature to detect network scan [Fernandes & Owezarski 2009] 0.8 0.6 0.4 0.2 0.0 0.0 100.0 200.0 300.0 400.0 500.0 600.0 700.0 800.0 Time (s) 19 An example to detect network syn scan ratio of SYN packets in 50ms time window ⊸ The ratio of SYN packets : relevant feature to detect network scan [Fernandes & Owezarski 2009] 0.8 0.6 0.4 0.2 0.0 0.0 100.0 200.0 300.0 400.0 500.0 600.0 700.0 800.0 Time (s) ⊸ Goal: find peaks 19 SPOT results ⊸ Parameters : q = 10−4 , n = 2000 (from the previous day record) 20 SPOT results ratio of SYN packets in 50ms time window ⊸ Parameters : q = 10−4 , n = 2000 (from the previous day record) 0.8 0.6 0.4 0.2 0.0 0.0 100.0 200.0 300.0 400.0 500.0 600.0 700.0 800.0 Time (s) 20 Do we really flag scan attacks ? ⊸ The main parameter q: a False Positive regulator 21 Do we really flag scan attacks ? ⊸ The main parameter q: a False Positive regulator 100.0 8. 10 −5 −5 1. 10 −5 5. 10 TPr (%) 80.0 6. 10 −3 60.0 40.0 9. 10 −6 20.0 0.0 0.0 8. 10 −6 0.5 1.0 1.5 2.0 2.5 FPr (%) 3.0 3.5 4.0 4.5 21 Do we really flag scan attacks ? ⊸ The main parameter q: a False Positive regulator 100.0 8. 10 −5 −5 1. 10 −5 5. 10 TPr (%) 80.0 6. 10 −3 60.0 40.0 9. 10 −6 20.0 0.0 0.0 8. 10 −6 0.5 1.0 1.5 2.0 2.5 FPr (%) 3.0 3.5 4.0 4.5 ⊸ 86% of scan flows detected with less than 4% of FP 21 A more general framework SPOT Specifications ⊸ A single main parameter q • With a probabilistic meaning → P(X > zq ) < q • False Positive regulator 22 SPOT Specifications ⊸ A single main parameter q • With a probabilistic meaning → P(X > zq ) < q • False Positive regulator ⊸ Stream capable • Incremental learning • Fast (∼ 1000 values/s) • Low memory usage (only the excesses) 22 Other things ? ⊸ SPOT • performs dynamic thresholding without distribution assumption • uses it to detect network anomalies 23 Other things ? ⊸ SPOT • performs dynamic thresholding without distribution assumption • uses it to detect network anomalies ⊸ But it could be adapted to 23 Other things ? ⊸ SPOT • performs dynamic thresholding without distribution assumption • uses it to detect network anomalies ⊸ But it could be adapted to • compute upper and lower thresholds 23 Other things ? ⊸ SPOT • performs dynamic thresholding without distribution assumption • uses it to detect network anomalies ⊸ But it could be adapted to • compute upper and lower thresholds • other fields 23 Other things ? ⊸ SPOT • performs dynamic thresholding without distribution assumption • uses it to detect network anomalies ⊸ But it could be adapted to • compute upper and lower thresholds • other fields • drifting contexts (with an additional parameter) → DSPOT 23 A recent example 24 A recent example ⊸ Thursday the 9th of February 2017 24 A recent example ⊸ Thursday the 9th of February 2017 • 9h : explosion at Flamanville nuclear plant 24 A recent example ⊸ Thursday the 9th of February 2017 • 9h : explosion at Flamanville nuclear plant • 11h : official declaration of the incident by EDF 24 A recent example ⊸ Thursday the 9th of February 2017 • 9h : explosion at Flamanville nuclear plant • 11h : official declaration of the incident by EDF ⊸ What about the EDF stock prices ? 24 2 2 4 2 9 0 3 4 1 4 09 :0 09 :4 10 :3 11 :3 12 :1 13 :3 14 :4 15 :3 16 :2 17 :1 EDF stock price ( ) EDF stock prices 9.40 9.35 9.30 9.25 9.20 9.15 9.10 9.05 Time 25 2 2 4 2 9 0 3 4 1 4 09 :0 09 :4 10 :3 11 :3 12 :1 13 :3 14 :4 15 :3 16 :2 17 :1 EDF stock price ( ) EDF stock prices 9.40 9.35 9.30 9.25 9.20 9.15 9.10 9.05 Time 25 Conclusion 26 Conclusion ⊸ Context: A great deal of work has been done to develop anomaly detection algorithms 26 Conclusion ⊸ Context: A great deal of work has been done to develop anomaly detection algorithms ⊸ Problem: Decision thresholds rely on either distribution assumption or expertise 26 Conclusion ⊸ Context: A great deal of work has been done to develop anomaly detection algorithms ⊸ Problem: Decision thresholds rely on either distribution assumption or expertise ⊸ Our solution: Building dynamic threshold with a probabilistic meaning 26 Conclusion ⊸ Context: A great deal of work has been done to develop anomaly detection algorithms ⊸ Problem: Decision thresholds rely on either distribution assumption or expertise ⊸ Our solution: Building dynamic threshold with a probabilistic meaning • Application to detect network anomalies 26 Conclusion ⊸ Context: A great deal of work has been done to develop anomaly detection algorithms ⊸ Problem: Decision thresholds rely on either distribution assumption or expertise ⊸ Our solution: Building dynamic threshold with a probabilistic meaning • Application to detect network anomalies • But a general tool to monitor online time series in a blind way 26 Conclusion ⊸ Context: A great deal of work has been done to develop anomaly detection algorithms ⊸ Problem: Decision thresholds rely on either distribution assumption or expertise ⊸ Our solution: Building dynamic threshold with a probabilistic meaning • Application to detect network anomalies • But a general tool to monitor online time series in a blind way ⊸ Future: Adapt the method to higher dimensions 26
© Copyright 2026 Paperzz