Detecting Recurring and Novel classes in Concept

Classification and Novel Class Detection
in Data Streams
Mehedy Masud1, Latifur Khan1, Jing Gao2,
Jiawei Han2, and Bhavani Thuraisingham1
1Department
2Department
of Computer Science,University of Texas at Dallas
of Computer Science, University of Illinois at Urbana Champaign
This work was funded in part by
Presentation Overview

Stream Mining Background

Novel Class Detection– Concept Evolution
Data Streams

Data streams are:
◦ Continuous flows of data
◦ Examples:
Network traffic
Sensor data
Call center records
Data Stream Classification



Uses past labeled data to build classification model
Predicts the labels of future instances using the model
Helps decision making
Expert
analysis
and
labeling
Block and
quarantine
Network traffic
Attack traffic
Firewall
Classification
model
Benign traffic
Server
Introduction
Challenges

Infinite length

Concept-drift

Concept-evolution (emergence of novel class)

Recurrence (seasonal) class
ICDM 2012, Brussels, Belgium
12/11/2012
5
Infinite Length

Impractical to store and use all historical data
◦ Requires infinite storage
1
1
1
0
0
0
◦ And running time
0
1
1
1
0 0
Concept-Drift
Current hyperplane
Previous hyperplane
A data chunk
Negative instance
Positive instance
Instances victim of concept-drift
Concept-Evolution
y
y1
Novel class
----- --------
y
D
y1
C
A
++++ ++
++ + + ++
+ +++ ++ +
++ + + + ++ +
+++++ ++++ +++
+ ++ + + ++ ++ +
----- --------
A
- --- - ------------------ - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - -- --++++++++
++++++++
x1
B
y2
x
D
XXXXX
X X X X X XX X X X X X
XX X X X X X X X X X X
X X X X X XX X X X X X
X XXX
X X
++++ ++
++ + + ++
+ +++ ++ +
++ + + + ++ +
+++++ ++++ +++
+ ++ + + ++ ++ +
C
- --- - ------------------ - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - -- --++++++++
++++++++
x1
B
y2
x
Classification rules:
R1. if (x > x1 and y < y2) or (x < x1 and y < y1) then class = +
R2. if (x > x1 and y > y2) or (x < x1 and y > y1) then class = -
Existing classification models misclassify novel class instances
Background: Ensemble of Classifiers
C1
+
x,?
C2
+
input
C3
-
Classifier
Individual
outputs
+
voting
Ensemble
output

Background: Ensemble
Classification of Data Streams
Divide the data stream into equal sized chunks
◦ Train a classifier from each data chunk
◦ Keep the best L such classifier-ensemble
◦ Example: for L= 3
Note: Di may contain data points from different classes
Labeled chunk
Data
chunks
D2
D1
D543
D654
Unlabeled chunk
Classifiers
C1
Ensemble
C1
C2
C42
C53
C543
Addresses infinite length
and concept-drift
Prediction
Introduction
Examples of Recurrence and Novel Classes


Twitter Stream – a stream of messages
Each message may be given a category or “class”
◦ based on the topic

Examples
◦ “Election 2012”, “London Olympic”, “Halloween”,
“Christmas”, “Hurricane Sandy”, etc.

Among these
◦ “Election 2012” or “Hurricane Sandy” are novel classes
because they are new events.

Also
◦ “Halloween” is recurrence class because it “recurs” every
year.
ICDM 2012, Brussels, Belgium
12/11/2012
11
Introduction
Concept-Evolution and Feature Space
Novel class
y
y1
----- --------
y
D
y1
C
----- --------
A
A
- --- - ------------------ - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - -- ---
++++ ++
++ + + ++
+ +++ ++ +
++ + + + ++ +
+++++ ++++ +++
+ ++ + + ++ ++ +
++++++++
++++++++
x1
B
y2
D
XXXXX
X X X X X XX X X X X X
XX X X X X X X X X X X
X X X X X XX X X X X X
X XXX
X X
++++ ++
++ + + ++
+ +++ ++ +
++ + + + ++ +
+++++ ++++ +++
+ ++ + + ++ ++ +
x
C
- --- - ------------------ - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - -- --++++++++
++++++++
x1
B
y2
x
Classification rules:
R1. if (x > x1 and y < y2) or (x < x1 and y < y1) then class = +
R2. if (x > x1 and y > y2) or (x < x1 and y > y1) then class = -
Existing classification models misclassify novel class instances
ICDM 2012, Brussels, Belgium
12/11/2012
12
Prior work
Novel Class Detection – Prior Work

Three steps:
◦ Training and building decision boundary
◦ Outlier detection and filtering
◦ Computing cohesion and separation
ICDM 2012, Brussels, Belgium
12/11/2012
13
Prior work
Training: Creating Decision Boundary
• Training is done chunk-by-chunk (One classifier per chunk)
• An ensemble of classifiers are used for classification
y
y1
Raw training data
Clusters are created
-- - - - - -------
Pseudopoints
y
D
D
y1
C
C
A
A
--- - ------- - - - - - - - - - -- - - - - - - - - - -- - - - - - - - - - -
++++
++ + + +
+ +++ ++ +
+ + + + ++ +
+++ ++ ++ +++
+++++ ++++ +++
+ ++ + + ++ ++
+ ++
B
y2
B
+++ +
++++++
++++++
x1
x
x1
y2
x
Addresses Infinite length problem
ICDM 2012, Brussels, Belgium
12/11/2012
14
Prior work
Outlier Detection and Filtering
Test instance inside
decision boundary
(not outlier)
Test instance outside
decision boundary
Raw outlier or Routlier
Test instance
x
y
D
x
Ensemble of L models
M1
...
ML
M2
y1
A
C
Routlier
x
B
x1
Routlier
X is an existing
AND
False class instance
True
y2
x
Routlier
X is a filtered outlier (Foutlier)
(potential novel class instance)
Routliers may appear as a result of novel class, concept-drift, or noise.
Therefore, they are filtered to reduce noise as much as possible.
ICDM 2012, Brussels, Belgium
12/11/2012
15
Prior work
Computing Cohesion & Separation
 o,5(x)
a(x)
+,5(x)
b+(x)
-
+
+ +
++
+
+
+
+



- - - -
-
a(x) = mean distance from an Foutlier x to the instances in o,q(x)
bmin(x) = minimum among all bc(x) (e.g. b+(x) in figure)
q-Neighborhood Silhouette Coefficient (q-NSC):
q - NSC(x) 

-,5(x)
x b (x)
(b min (x)  a(x))
max( b min (x) , a(x))
If q-NSC(x) is positive, it means x is closer to Foutliers than any other class.
ICDM 2012, Brussels, Belgium
12/11/2012
16
Limitation: Recurrence Class
Prior work
Stream
chunk0
chunk1
chunk49
chunk51
chunk52
chunk99
chunk100
chunk101
chunk102
chunk149
chunk150
ICDM 2012, Brussels, Belgium
chunk50
12/11/2012
17
Prior work
Why Recurrence Classes are Forgotten?

Divide the data stream into equal sized chunks
◦
◦
◦
◦
◦
Train a classifier from whole data chunk
Keep the best L such classifier-ensemble
Example: for L= 3
Therefore, old models are discarded
Old classes are “forgotten” after a while
Data
chunks
Labeled chunk
D2
D1
D543
D654
Unlabeled chunk
Classifiers
C1
Ensemble
C1
C2
C42
C543
Prediction
C53
Addresses infinite length and concept-drift
ICDM 2012, Brussels, Belgium
12/11/2012
18
Proposed method
CLAM: The Proposed Approach
CLAss Based Micro-Classifier Ensemble
Stream
Latest unlabeled
instance
Classify
using M
Training
New
model
Update
Latest
Labeled
chunk
Outlier
detection
Ensemble (M)
(keeps all classes)
Not outlier
(Existing class)
Outlier
Buffering and novel
class detection
ICDM 2012, Brussels, Belgium
12/11/2012
19
Proposed method
Training and Updating
•
•
•
•
•
Each chunk is first separated into different classes
A micro-classifier is trained from each class’s data
Each micro-classifier replaces one existing micro-classifier
A total of L micro-classifiers make a Micro-Classifier Ensemble (MCE)
C such MCE’s constitute the whole ensemble, E
ICDM 2012, Brussels, Belgium
12/11/2012
20
Proposed method
CLAM: The Proposed Approach
CLAss Based Micro-Classifier Ensemble
Stream
Latest unlabeled
instance
Classify
using M
Training
New
model
Update
Latest
Labeled
chunk
Outlier
detection
Ensemble (M)
(keeps all classes)
Not outlier
(Existing class)
Outlier
Buffering and novel
class detection
ICDM 2012, Brussels, Belgium
12/11/2012
21
Proposed method
Outlier Detection and Classification
• A test instance x is first classified with each micro-classifier ensemble
• Each micro-classifier ensemble gives a partial output (Yr) and a outlier
flag (boolean)
• If all ensembles flags x as outlier, then it is buffered and sent to novel
class detector
• Otherwise, the partial outputs are combined and a class label is
predicted
ICDM 2012, Brussels, Belgium
12/11/2012
22
Evaluation

Evaluation
Competitors:
◦ CLAM (CL) – proposed work
◦ SCANR (SC) [1] – prior work
◦ ECSMiner (EM) [2] – prior work
◦ Olindda [3]-WCE [4] (OW) – another baseline

Datasets: Synthetic, KDD Cup 1999 & Forest covertype
1. M. M. Masud, T. M. Al-Khateeb, L. Khan, C. C. Aggarwal, J. Gao, J. Han, and B. M. Thuraisingham,
Detecting recurring and novel classes in concept-drifting data streams,” in Proc. ICDM ’11,
Dec. 2011, pp. 1176–181.
2. Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani M. Thuraisingham.Classification
and novel class detection in concept-drifting data streams under time constraints. In Preprints,
IEEE Transactions on Knowledge and Data Engineering (TKDE), 23(6): 859-874 (2011).
3. E. J. Spinosa, A. P. de Leon F. de Carvalho, and J. Gama. Cluster-based novel concept detection in
data streams applied to intrusion detection in computer networks. In Proc. 2008 ACM symposium
on Applied computing, pages 976–980, 2008.
4. H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept-drifting data streams using ensemble
classifiers. In Proc. ninth ACM SIGKDD international conference on Knowledge discovery and data
mining, pages 226–235,Washington, DC, USA, Aug, 2003.ACM.
ICDM 2012, Brussels, Belgium
12/11/2012
23
Overall Error
Evaluation
Error rates on (a) SynC20, (b)SynC40, (c)Forest and (d) KDD
ICDM 2012, Brussels, Belgium
12/11/2012
24
Evaluation
Number of Recurring Classes vs Error
ICDM 2012, Brussels, Belgium
12/11/2012
25
Evaluation
Error vs Drift and Chunk Size
ICDM 2012, Brussels, Belgium
12/11/2012
26
Evaluation
Summary Table
ICDM 2012, Brussels, Belgium
12/11/2012
27
Conclusion
Detect Recurrence
 Improved Accuracy
 Running Time
 Reduced Human Interaction
 Future work: use other base learners

ICDM 2012, Brussels, Belgium
12/11/2012
28
ICDM 2012, Brussels, Belgium
12/11/2012
29
ICDM 2012, Brussels, Belgium
12/11/2012
30