Classification and Novel Class Detection in Data Streams Mehedy Masud1, Latifur Khan1, Jing Gao2, Jiawei Han2, and Bhavani Thuraisingham1 1Department 2Department of Computer Science,University of Texas at Dallas of Computer Science, University of Illinois at Urbana Champaign This work was funded in part by Presentation Overview Stream Mining Background Novel Class Detection– Concept Evolution Data Streams Data streams are: ◦ Continuous flows of data ◦ Examples: Network traffic Sensor data Call center records Data Stream Classification Uses past labeled data to build classification model Predicts the labels of future instances using the model Helps decision making Expert analysis and labeling Block and quarantine Network traffic Attack traffic Firewall Classification model Benign traffic Server Introduction Challenges Infinite length Concept-drift Concept-evolution (emergence of novel class) Recurrence (seasonal) class ICDM 2012, Brussels, Belgium 12/11/2012 5 Infinite Length Impractical to store and use all historical data ◦ Requires infinite storage 1 1 1 0 0 0 ◦ And running time 0 1 1 1 0 0 Concept-Drift Current hyperplane Previous hyperplane A data chunk Negative instance Positive instance Instances victim of concept-drift Concept-Evolution y y1 Novel class ----- -------- y D y1 C A ++++ ++ ++ + + ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + ----- -------- A - --- - ------------------ - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - -- --++++++++ ++++++++ x1 B y2 x D XXXXX X X X X X XX X X X X X XX X X X X X X X X X X X X X X X XX X X X X X X XXX X X ++++ ++ ++ + + ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + C - --- - ------------------ - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - -- --++++++++ ++++++++ x1 B y2 x Classification rules: R1. if (x > x1 and y < y2) or (x < x1 and y < y1) then class = + R2. if (x > x1 and y > y2) or (x < x1 and y > y1) then class = - Existing classification models misclassify novel class instances Background: Ensemble of Classifiers C1 + x,? C2 + input C3 - Classifier Individual outputs + voting Ensemble output Background: Ensemble Classification of Data Streams Divide the data stream into equal sized chunks ◦ Train a classifier from each data chunk ◦ Keep the best L such classifier-ensemble ◦ Example: for L= 3 Note: Di may contain data points from different classes Labeled chunk Data chunks D2 D1 D543 D654 Unlabeled chunk Classifiers C1 Ensemble C1 C2 C42 C53 C543 Addresses infinite length and concept-drift Prediction Introduction Examples of Recurrence and Novel Classes Twitter Stream – a stream of messages Each message may be given a category or “class” ◦ based on the topic Examples ◦ “Election 2012”, “London Olympic”, “Halloween”, “Christmas”, “Hurricane Sandy”, etc. Among these ◦ “Election 2012” or “Hurricane Sandy” are novel classes because they are new events. Also ◦ “Halloween” is recurrence class because it “recurs” every year. ICDM 2012, Brussels, Belgium 12/11/2012 11 Introduction Concept-Evolution and Feature Space Novel class y y1 ----- -------- y D y1 C ----- -------- A A - --- - ------------------ - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - -- --- ++++ ++ ++ + + ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + ++++++++ ++++++++ x1 B y2 D XXXXX X X X X X XX X X X X X XX X X X X X X X X X X X X X X X XX X X X X X X XXX X X ++++ ++ ++ + + ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + x C - --- - ------------------ - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - -- --++++++++ ++++++++ x1 B y2 x Classification rules: R1. if (x > x1 and y < y2) or (x < x1 and y < y1) then class = + R2. if (x > x1 and y > y2) or (x < x1 and y > y1) then class = - Existing classification models misclassify novel class instances ICDM 2012, Brussels, Belgium 12/11/2012 12 Prior work Novel Class Detection – Prior Work Three steps: ◦ Training and building decision boundary ◦ Outlier detection and filtering ◦ Computing cohesion and separation ICDM 2012, Brussels, Belgium 12/11/2012 13 Prior work Training: Creating Decision Boundary • Training is done chunk-by-chunk (One classifier per chunk) • An ensemble of classifiers are used for classification y y1 Raw training data Clusters are created -- - - - - ------- Pseudopoints y D D y1 C C A A --- - ------- - - - - - - - - - -- - - - - - - - - - -- - - - - - - - - - - ++++ ++ + + + + +++ ++ + + + + + ++ + +++ ++ ++ +++ +++++ ++++ +++ + ++ + + ++ ++ + ++ B y2 B +++ + ++++++ ++++++ x1 x x1 y2 x Addresses Infinite length problem ICDM 2012, Brussels, Belgium 12/11/2012 14 Prior work Outlier Detection and Filtering Test instance inside decision boundary (not outlier) Test instance outside decision boundary Raw outlier or Routlier Test instance x y D x Ensemble of L models M1 ... ML M2 y1 A C Routlier x B x1 Routlier X is an existing AND False class instance True y2 x Routlier X is a filtered outlier (Foutlier) (potential novel class instance) Routliers may appear as a result of novel class, concept-drift, or noise. Therefore, they are filtered to reduce noise as much as possible. ICDM 2012, Brussels, Belgium 12/11/2012 15 Prior work Computing Cohesion & Separation o,5(x) a(x) +,5(x) b+(x) - + + + ++ + + + + - - - - - a(x) = mean distance from an Foutlier x to the instances in o,q(x) bmin(x) = minimum among all bc(x) (e.g. b+(x) in figure) q-Neighborhood Silhouette Coefficient (q-NSC): q - NSC(x) -,5(x) x b (x) (b min (x) a(x)) max( b min (x) , a(x)) If q-NSC(x) is positive, it means x is closer to Foutliers than any other class. ICDM 2012, Brussels, Belgium 12/11/2012 16 Limitation: Recurrence Class Prior work Stream chunk0 chunk1 chunk49 chunk51 chunk52 chunk99 chunk100 chunk101 chunk102 chunk149 chunk150 ICDM 2012, Brussels, Belgium chunk50 12/11/2012 17 Prior work Why Recurrence Classes are Forgotten? Divide the data stream into equal sized chunks ◦ ◦ ◦ ◦ ◦ Train a classifier from whole data chunk Keep the best L such classifier-ensemble Example: for L= 3 Therefore, old models are discarded Old classes are “forgotten” after a while Data chunks Labeled chunk D2 D1 D543 D654 Unlabeled chunk Classifiers C1 Ensemble C1 C2 C42 C543 Prediction C53 Addresses infinite length and concept-drift ICDM 2012, Brussels, Belgium 12/11/2012 18 Proposed method CLAM: The Proposed Approach CLAss Based Micro-Classifier Ensemble Stream Latest unlabeled instance Classify using M Training New model Update Latest Labeled chunk Outlier detection Ensemble (M) (keeps all classes) Not outlier (Existing class) Outlier Buffering and novel class detection ICDM 2012, Brussels, Belgium 12/11/2012 19 Proposed method Training and Updating • • • • • Each chunk is first separated into different classes A micro-classifier is trained from each class’s data Each micro-classifier replaces one existing micro-classifier A total of L micro-classifiers make a Micro-Classifier Ensemble (MCE) C such MCE’s constitute the whole ensemble, E ICDM 2012, Brussels, Belgium 12/11/2012 20 Proposed method CLAM: The Proposed Approach CLAss Based Micro-Classifier Ensemble Stream Latest unlabeled instance Classify using M Training New model Update Latest Labeled chunk Outlier detection Ensemble (M) (keeps all classes) Not outlier (Existing class) Outlier Buffering and novel class detection ICDM 2012, Brussels, Belgium 12/11/2012 21 Proposed method Outlier Detection and Classification • A test instance x is first classified with each micro-classifier ensemble • Each micro-classifier ensemble gives a partial output (Yr) and a outlier flag (boolean) • If all ensembles flags x as outlier, then it is buffered and sent to novel class detector • Otherwise, the partial outputs are combined and a class label is predicted ICDM 2012, Brussels, Belgium 12/11/2012 22 Evaluation Evaluation Competitors: ◦ CLAM (CL) – proposed work ◦ SCANR (SC) [1] – prior work ◦ ECSMiner (EM) [2] – prior work ◦ Olindda [3]-WCE [4] (OW) – another baseline Datasets: Synthetic, KDD Cup 1999 & Forest covertype 1. M. M. Masud, T. M. Al-Khateeb, L. Khan, C. C. Aggarwal, J. Gao, J. Han, and B. M. Thuraisingham, Detecting recurring and novel classes in concept-drifting data streams,” in Proc. ICDM ’11, Dec. 2011, pp. 1176–181. 2. Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani M. Thuraisingham.Classification and novel class detection in concept-drifting data streams under time constraints. In Preprints, IEEE Transactions on Knowledge and Data Engineering (TKDE), 23(6): 859-874 (2011). 3. E. J. Spinosa, A. P. de Leon F. de Carvalho, and J. Gama. Cluster-based novel concept detection in data streams applied to intrusion detection in computer networks. In Proc. 2008 ACM symposium on Applied computing, pages 976–980, 2008. 4. H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept-drifting data streams using ensemble classifiers. In Proc. ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 226–235,Washington, DC, USA, Aug, 2003.ACM. ICDM 2012, Brussels, Belgium 12/11/2012 23 Overall Error Evaluation Error rates on (a) SynC20, (b)SynC40, (c)Forest and (d) KDD ICDM 2012, Brussels, Belgium 12/11/2012 24 Evaluation Number of Recurring Classes vs Error ICDM 2012, Brussels, Belgium 12/11/2012 25 Evaluation Error vs Drift and Chunk Size ICDM 2012, Brussels, Belgium 12/11/2012 26 Evaluation Summary Table ICDM 2012, Brussels, Belgium 12/11/2012 27 Conclusion Detect Recurrence Improved Accuracy Running Time Reduced Human Interaction Future work: use other base learners ICDM 2012, Brussels, Belgium 12/11/2012 28 ICDM 2012, Brussels, Belgium 12/11/2012 29 ICDM 2012, Brussels, Belgium 12/11/2012 30
© Copyright 2025 Paperzz