Anomaly detection through Bayesian Support Vector

Anomaly detection through
Bayesian Support Vector
Machines
Vasilis A. Sotiris
Michael Pecht
AMSC663 Project Proposal
1
Detection Algorithm
Model Decision boundary
Model space R kxm
K<n
Training data
KarhunenLoeve
expansion
Residual Decision boundary
Decision
Input space R nxm
Residual space R lxm
l<n
Positive class
Negative class
AMSC663 Project Proposal
2
Basics:
Linear Classification – Separable
x2
• For seperable data the SVM
finds a function D(x) that best
separates the two classes
(Maximum distance M)
• Function D(x) can be used as
a classifier
• Through the support vectors
we can
=
– compress the input space
– Detect anomalies
• By minimizing the norm of w
we find the line or linear
surface that best separates the
two classes
• The decision function is the
linear combination of the
weight vector w
Abnormal Class
M
w
Optimal Separating Line D(x)
x1
Normal Class
Training Support Vectors
New observation vector
1
min! w
2 n
2
1
 wT w
2
n
n
w    i yi xi
i 1
D( x)   wi xi  b   yi i xi x  b
i
i 1
T
i 1
Lagrange multipliers
AMSC663 Project Proposal
3
Basics:
Linear Classification - Inseparable
• Maximize the margin
M and minimize the
sum of slack errors xi
• Function D(x) can be
again used as a
classifier
(incorporating a
degree of error)
x2
Abnormal Class
x1
x1
M
x2
x2
Normal Class
x1
Training Support Vectors
New observation vector
n
1 2 1 T
min! w  w w  C  x i
2
2
i 1
AMSC663 Project Proposal
4
Nonlinear classification
• For inseparable data the SVM
finds a nonlinear function D(x)
that best separates the two
classes by:
x2
Abnormal Class
D(x)
– Use of a kernel map k(.)
– K=F(xi)F(x)
– An example of a feature map
F(x)=[x2 √2x 1]T
• The decision function D(x)
requires the dot product of the
feature map F
– uses the same mathematical
framework as the linear
classifier
• The class y of the data is
determined by the sign of D(x)
Normal Observation
x1
Normal Class
n
n
i 1
i 1
D( x)   wi xi  b   yi i Fxi Fx   b
y
AMSC663 Project Proposal
+1
sign ( D( x))  0
-1
sign ( D( x))  0
5
Nonlinear classification for detection
F2
58
x2
58
56
56
54
54
50
F3 Feature Space
48
46
44
42
42
F1
D(x)
52
X2
X2
52
x2
50
48
46
44
44
46
48
50
x1
52
Input Space
54
56
x1
42
42
44
46
48
50
x1
52
54
Input Space
56
x1
• Given: a training data set that contains the normal and artificial
abnormal data points (Blue crosses and red circles respectively
• Solve linear optimization problem to find w and b in the feature
space
• Form a nonlinear decision function by mapping back to the input
space using the same kernel mapping
• The result is that we can obtain a decision boundary on the given
training set and use it to classify new observations
AMSC663 Project Proposal
6
Need for a soft decision boundary
Could be a false alarm
• Class predictions are not
probabilistic
• SVM output is a ‘hard’
binary decision
– We desire to estimate
the conditional
distribution p(y|x) 
capture uncertainty
Soft decision boundary
Hard decision boundary
• User defined model
parameters like C can
lead to low generalization
• Bayesian methods can
n
determine model
p( y | x, w)   p yi | xi , w
parameters
Likelihood function
i 1
AMSC663 Project Proposal
D(x)
7
Validation
• Training data: simulate (n x m) matrix of
observations
• Test data: use training data and inject a
fault
• Construct D(x) with BSVM against training
data
• Validation
– detect given faults
– reduce false alarms (compare to SVM)
AMSC663 Project Proposal
8
BACKUP - Bayesian Classifier
Design
• Loss functions for a
soft decision and hard
decision boundary
3
yx=-1
yx=+1
2
1
0
-2
-1
0
1
2
D(x)
AMSC663 Project Proposal
9