Timeliness of Detection SVM Specificity

Towards Improved Sensitivity,
Specificity, and Timeliness of
Syndromic Surveillance
Systems
Anna L. Buczak, PhD,
Linda J. Moniz, PhD, Joseph Lombardo, MS
PHIN 2008, Session F7, August 27, 2008
This work was supported by the JHU/APL Internal
Research and Development (IR&D) Program
Outline
 Motivation
 Classical disease outbreak detection methods
 Novel machine learning methodology for disease
outbreak detection
 Results
 Conclusions
 Future directions
2
Motivation
 Develop methodology for reliable detection of disease
outbreaks
 Most of the existing methods use univariate statistics i.e.
look at each of the syndrome/ subsyndrome/ age/
gender, etc. combination separately
 proliferation of false alarms
 Our goal: develop multivariate models for detecting
abnormal relationships between time series
 reduced false alarms
3
Approaches to Outbreak Detection
 Two broad types of approaches:
 Anomaly detection
 Detectors flag any anomalous behavior
 Statistical methods: CUSUM, C1, C2, C3, EWMA
 Machine learning methods: clustering techniques, SVMs
 Specific disease outbreak detection
 Detectors are geared towards a specific disease
 In depth knowledge about given disease manifestation
needed
 Separate model needed for each disease
 Methods: Bayesian networks, Markov Decision
Processes
This talk will concentrate on anomaly detection methods
4
Statistical disease outbreak detection
methods
 They often determine whether the counts in a given syndrome/
subsyndrome time series are unusually high and thus worth investigating.
 Statistical detection algorithms: C2, C3, EWMA
 C2 & C3: 7 day baseline, 2 day guardband;
 Individual day statistic for day j with lag n:
 Sj,n = Max {0, ( Countj – [μn + σn] ) / σn}, where
 μn is 7-day average with n-day lag ( so μ3 is mean of counts in [j-3, j-9] )
 σn = standard deviation of same 7-day window
 C2: C2 statistic for day k is Sk,3 (2-day lag)
 Alerts if Individual day statistic exceeds threshold
 C3 statistic for day k is Sk,3 + Sk-1,3 + Sk-2,3
 Alerts when statistic for day k exceeds threshold
Day-9
Day-8
Day-7
Day-6
Day-5
Baseline C2 (-3 to -9days)
5
Baseline for C3 (-3 to -9 days)
Day-4
Day-3
Day-2
Day-1
Day 0
Current
Count
Sample Univariate Algorithm Output: C2
6
Accessible Alerting Algorithms: courtesy of Dr. Howard Burkom, JHU/APL
Sample Univariate Algorithm Output: C3
7
Accessible Alerting Algorithms: courtesy of Dr. Howard Burkom, JHU/APL
EWMA
 Exponential Weighted Moving Average (EWMA): average
with most weight on recent count Xt
Y1  X 1
Yt    X t  (1   )  Yt 1
 28-day baseline, 2 day guard band
 Test Statistic:
Yt   t
t
 Often threshold = 3
8

2 
>= threshold
Sample Univariate Algorithm Output:
EWMA
9
Accessible Alerting Algorithms: courtesy of Dr. Howard Burkom, JHU/APL
Machine Learning for Disease Outbreak
Detection
 Approach:
 Learn the model of regular activities
 Use one-class Support Vector
Machine
 Detect anomaly based on its
dissimilarity from regular activities
In SVMs a hyperplane in ndimensional space divides
data into two classes in
terms of the largest margin
Anomalous
 Advantages:
 Only normal behavior data needed
 Detectors flag any anomalous
behavior
 Capable of detecting anomalies for
new pathogens
 No need for separate models for each
disease
Normal
10
Support Vector Machines (SVM)
 SVM learning algorithm developed by Vapnik
based on statistical learning theory
 Learning problem:
 Find the best separating hyperplane
dividing two classes in terms of the largest
margin
 To construct a classifier for a given data
set, SVM solves a quadratic programming
problem
 To solve non-linear problems SVM employs
inner-product kernels such as: polynomial,
RBF, sigmoid, etc.
 SVMs build the decision surface using only
those training examples that are near the
boundary region
 Data points at the margin are called
support vectors
 One-Class SVM used
 Proposed by Schölkopf et al. (1999)
 Only positive training examples are needed
11
 SVM classifiers are based on
hyperplanes corresponding to
decision functions:
Class(x)  sign(W  x  b)
 Where w is the weight vector, b is the
threshold of the decision rule, x is the
classified pattern.
Details of the Approach
SVM Module
GI SVM
Fever SVM
Data
Streams
(3021)
Normal
EWMA
Smoothing
Resp SVM
3021
…
Combined Syn SVM
12
Anomalous
Data Sets
 Training
 ESSENCE data - no outbreaks (180 days)
 Flu season weeks removed from training data
 Testing (Recall)
 ESSENCE data - no outbreaks (132 days that were not
used in training)
 Simulated outbreaks added to real background data:
 Tularemia
 Hep A – Sets 1, 2 & 3
 Real problem
Real background data with simulated outbreaks
13
3021 Data Streams
Viral Syndrome: 0-4
Total
Male
Femal
e
Age 04
Age 517
…
Alexa
ndria
Bot_Like
Fever
GI -GIMale
Hem_Ill
Loc_Les
Lymph
Number of data streams:
(11+148)*19 = 3021
Neuro
…
AbdominalCramps
AbdominalPain
Time series for: each
Syndrome & subsyndrome:
AbdominalPainGroup
total / gender / age group / Fever
AbdominalTenderness
All
county
Abscess
AcuteBloodAbnormalities
…
Unresponsive
UrinaryTract
ViralSyndrome
Vomiting
14
Wheezing
Vomiting - Female
…
Washi
ngton
15
All EWMA Alerts
Alerts at 1 sigma
8/14/2007
7/25/2007
7/5/2007
6/15/2007
5/26/2007
5/6/2007
4/16/2007
3/27/2007
All Counts
3/7/2007
2/15/2007
1/26/2007
1/6/2007
12/17/2006
11/27/2006
11/7/2006
10/18/2006
9/28/2006
Proliferation of
false alarms
8/14/2007
7/25/2007
7/5/2007
6/15/2007
5/26/2007
5/6/2007
4/16/2007
3/27/2007
3/7/2007
2/15/2007
1/26/2007
1/6/2007
12/17/2006
11/27/2006
11/7/2006
10/18/2006
9/28/2006
9/8/2006
8/19/2006
 omega = 0.8, baseline = 28
days, guard band = 2 days
 Alert when EWMA Test
Statistic >= 3
 Average of alarms per day
on normal data: 28.1
9/8/2006
8/19/2006
Results: Univariate EWMA
6000
120
5000
100
4000
80
3000
60
2000
40
1000
20
0
0
All EWMA Alerts
120
100
80
60
40
20
0
Results
Polling Univariate EWMA Results
Initial SVM Results
 omega = 0.8, baseline = 28 days, guard
band = 2 days
 Results per day:
 Only Fever SVM and GI SVM
trained
 Results per day:








 Specificity: 94.7%
 Sensitivity: 54.8%
Specificity: 94.0% (th = 53)
Sensitivity: 29.0%
Specificity: 90.9% (th = 44)
Sensitivity: 41.9%
Specificity: 75.8% (th = 35)
Sensitivity : 48.4%
Specificity: 68.9% (th = 32 )
Sensitivity: 54.8%
 Results per outbreak:
 Specificity: 94.7%
 Sensitivity: 100%
0.6
Similar Specificity ->
SVM has Sensitivity
better by 25.8%
0.5
 Specificity: 94.0%
 Sensitivity: 80%
0.4
Sensitivity
 Results per outbreak:
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
1 - Specificity
EWMA1
16
EWMA2
EWMA3
EWMA4
SVM
Similar Sensitivity ->
SVM has Specificity
better by 25.8%
17
All Counts
EWMA Alerts at 1 sigma 2
SVM Alert2
7/17/2007
0.6
2000
All Added Counts
1000
0.2
0
0
EWMA Alerts at 1 sigma
7/2/2007
25000
6/27/2007
1
6/22/2007
30000
6/17/2007
5000
6/12/2007
1.2
6/7/2007
35000
6/2/2007
6000
5/28/2007
5/23/2007
5/18/2007
3000
7/12/2007
4000
7/7/2007
7/2/2007
6/27/2007
6/22/2007
6/17/2007
6/12/2007
6/7/2007
6/2/2007
5/28/2007
5/23/2007
5/18/2007
Results
50000
45000
40000
0.8
0.4
SVM A
Timeliness of Detection
 PU EWMA
 Specificity: 94.0 %
 Avg number of days to detect
an outbreak: 2.25
 2 outbreaks not detected at all
 SVM
 Specificity: 94.7%
 Avg number of days to detect
an outbreak: 1.67
 All outbreaks detected
 Specificity: 68.9%
 Avg number of days to detect
an outbreak: 1.8
 1 outbreak not detected at all
SVM obtains better Timeliness of Detection,
higher Specificity for same Sensitivity, and
higher Sensitivity for same Specificity than PU EWMA
18
Conclusions
 New approach for disease outbreak detection designed
 Very promising initial results obtained for normal data, 4
simulated outbreaks, one real problem
 Favorable comparison of initial SVM results to PU EWMA
results
19
Future Directions
 Proof-of-concept:
 Remaining SVMs to be trained
 More sophisticated decision fusion module
 Testing on many different types of simulated outbreaks
(varying amplitude, day of injection)
 Testing on real Influenza outbreaks
 Explanation capability for the SVM system:
 Capability for users to drill down to find the reason for the
alert
20
Contact info:
Dr. Anna L. Buczak
National Security Technology Department
Johns Hopkins University Applied Physics Laboratory
tel. 443-778-9350
e-mail: [email protected]
21