数据挖掘 Introduction to Data Mining Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities [email protected] Spring 2017 A312: 15:20- 16:55 1 Introduction Last week: ◦ Clustering ◦ The second assignment was announced. It must be submitted before the 1st May at 11:59 PM. QQ group: 611811300 2 Course schedule (日程安排) Week 1 Introduction What is the knowledge discovery process? Week 2 Exploring data Week 3 Classification – decision trees Week 4 Classification – naïve Bayes and other techniques Week 5 Association analysis – Part 1 Week 6 Week 7 Association analysis – Part 2 Week 8 Clustering Week 9 Advanced topics + details about the final exam 3 ABOUT THE FINAL EXAM 4 Final exam Room and date to be determined. Duration: 2 hours It is a closed-book exam Answers must be written in English. About 10 questions. Some typical questions in my exams: ◦ ◦ ◦ ◦ What is the advantages/disadvantages of using X instead of Y ? When X should be used? How X works ? or why X is designed like that? A few questions that may require some calculations or to explain what will be the result of an algorithm for some data. 5 Final exam (continued) During the exam, if you are not sure about the meaning of a question in terms of English, you may ask me or the teaching assistant for some clarifications. No electronic devices are allowed, except a calculator. Besides that, a pen/pencil/eraser can be used during the exam. Bring your student ID card. 6 What is important? Understand what is data mining (week 1) Exploring the data (week 2) Classification (week 3 and 4) ◦ How the methods work, advantages/disadvantages ◦ For decision trees, you are expected to understand quite well, since you did the first assignment. ◦ I will not ask to do calculations for Naïve Bayes (just understand the main idea. 7 What is important? Association analysis (week 5 and 7) ◦ the problem of pattern mining and its variations. ◦ able to apply the Apriori algorithm and other techniques that we discussed. Clustering (week 8) ◦ What is clustering ◦ DBScan, K-Means, etc. Today (week 9) ◦ Outlier detection & sequence prediction The exam will focus on the above topics but may include other topics that we have discussed 8 OUTLIER DETECTION (异常点检测) 9 Anomaly/Outlier Detection What are anomalies/outliers (离群)? ◦ Data points that are considerably different from other data points Different problem definitions: ◦ Find all the data points with anomaly scores greater than some threshold t ◦ Find the n data points having the largest anomaly scores ◦ Compute the anomaly score of a given data point x. Applications: Credit card fraud detection, telecommunication fraud detection, network intrusion detection, fault detection, terrorism detection Example Ozone (臭氧) Depletion History In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica (南极洲 )had dropped 10% below normal levels Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations? The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded! Sources: http://exploringdata.cqu.edu.au/ozone.html http://www.epa.gov/ozone/science/hole/size.html Anomaly Detection Challenges ◦ How many outliers are there in the data? ◦ Methods are generally unsupervised No training is used. Thus it may become hard to validate the results. ◦ Finding outliers is like finding a needle in a haystack (大海捞针) Assumption: ◦ There are considerably more “normal” data points than outliers in the data How to do anomaly detection? General steps ◦ Build a profile of the “normal” behavior Profile can be patterns or summary statistics for the overall population ◦ Use the “normal” profile to detect anomalies Anomalies are observations whose characteristics differ significantly from the normal profile Different types of anomaly detection schemes ◦ ◦ ◦ ◦ Graphical & Statistical-based Distance-based Model-based … Graphical Approaches Boxplot (1-D), Scatter plot (2-D) … An outlier could be defined in terms of how far it is from the average (平均) or standard deviation (标准偏差). We can detect outliers, visually. Limitations Time consuming Subjective Convex Hull (凸包) Method Extreme points are assumed to be outliers Use the convex hull method to detect extreme values Convex hull: the smallest convex set that contains all the data points (variation: contains at least 95 % of the data points). But what if the outlier occurs in the middle of other data points? Statistical Approaches Assume a parametric model describing the distribution of the data (e.g. normal distribution - 正态分布) Apply a statistical test that depends on ◦ Data distribution ◦ Parameter(s) of the distribution (e.g., average, variance) ◦ Number of expected outliers (confidence limit) Limitations of Statistical Approaches Most of the tests are for a single attribute In many cases, data distribution may not be known. For high dimensional data, it may be difficult to estimate the true distribution Distance-based Approaches Data is represented as a vector of features (Macy, 18 years old, Beijing, Female) Three major approaches ◦ Nearest-neighbor based ◦ Density based ◦ Clustering based Nearest-Neighbor Based Approach Approach: Compute the distance between every pair of data points There are various ways to define outliers: ◦ Data points for which there are fewer than p neighboring points within a distance D ◦ The top n data points whose distance to the kth nearest neighbor is the greatest ◦ The top n data points whose average distance to the kth nearest neighbors is the greatest Outliers in Lower Dimensional Projection Divide each attribute into equal-depth intervals ◦ Each interval contains a fraction f = 1/ of the records Consider a k-dimensional cube created by picking grid ranges from k different dimensions ◦ If attributes are independent, we expect a region to contain a fraction fk of the records ◦ If there are N points, we can measure sparsity of a cube D as: ◦ Negative sparsity indicates cube contains smaller number of points than expected Example N=100, = 5, f = 1/5 = 0.2, N f2 = 4 Density-based method (LOF) For each point, compute the density of its local neighborhood Compute local outlier factor (LOF) of a sample p as the average of the ratio of the density of sample p and the density of its nearest neighbors (NN) Outliers are points with largest LOF value In the NN approach, p2 is not considered as outlier, while LOF approach find both p1 and p2 as outliers p2 p1 Clustering-Based Basic idea: ◦ Cluster the data into groups of different density ◦ Choose points in small cluster as candidate outliers ◦ Compute the distance between candidate points and non-candidate clusters. ◦ If candidate points are far from all other non-candidate points, they are outliers SEQUENCE PREDICTION (序列预测) 25 The problem of Sequence Prediction A B C ? Problem: ◦ Given a set of training sequences, predict the next symbol of a sequence. Applications: ◦ ◦ ◦ ◦ ◦ webpage prefetching, analyzing the behavior of customers on websites, keyboard typing prediction product recommendation, stock market prediction, ◦ … 26 General approach for this problem Phase 1) Training Training sequences Building a sequence prediction model Prediction Model Phase 2) Prediction Prediction Model Prediction algorithm Prediction e.g. D A sequence e.g. A,B,C 27 Sequential pattern mining Discovery of patterns Using the patterns for prediction It is time-consuming to extract patterns. patterns ignore rare cases, updating the patterns: very costly! sequences Pattern Support PrefixSpan Minsup = 33 % 28 28 Dependency Graph (DG) S1: {A,B,C,A,C,B,D} S2: {C,C,A,B,C,B,C,A} D C 1 3 1 4 3 A 3 3 B 2 DG with lookup table of size 2 29 Dependency Graph (DG) S1: {A,B,C,A,C,B,D} S2: {C,C,A,B,C,B,C,A} D C 1 3 1 4 3 A 3 P(B|A) = 3 / SUP(A) = 3 / 4 P(C|A) = 3 / SUP(A) = 3 / 4 … 3 B 2 DG with lookup table of size 2 30 PPM – order 1 (prediction by partial matching) S1: {A,B,C,A,C,B,D} S2: {C,C,A,B,C,B,C,A} 2 4 4 6 A B C 1 B C 3 1 3 1 1 C D A B C PPM – order 1 (prediction by partial matching) S1: {A,B,C,A,C,B,D} S2: {C,C,A,B,C,B,C,A} 2 4 4 6 A B C 1 B C 3 1 3 1 P(B|A) = 2 / 4 P(C|A) = 1 / 4 … 1 C D A B C PPM – order 2 S1: {A,B,C,A,C,B,D} S2: {C,C,A,B,C,B,C,A} 2 1 3 AB AC 2 1 C B … BC 1 2 A B predictions are inaccurate if there is noise… All-K-Order Markov Uses PPM from level 1 to K for prediction. More accurate than a fixed-order PPM, But exponential size 2 B 4 4 6 A B C 1 C 3 C 2 1 AB AC 1 1 3 1 1 2 D A B C C Example: order 2 3 B P(C|AB) = 2 / 2 P(B|AC) = 1 / 1 P(A|BC) = 2 / 3 … … BC 1 2 A B Limitations Several models assume that each event depends only on the immediately preceding event. Otherwise, often an exponential complexity (e.g.: All-K-Markov) Some improvements to reduce the size of markovian models, but few work to improve their accuracy. Several models are not noise tolerant. Some models are costly to update (e.g. sequential patterns). All the aforementioned models are lossy models. 35 CPT: COMPACT PREDICTION TREE Gueniche, T., Fournier-Viger, P., Tseng,V.-S. (2013). Compact Prediction Tree: A Lossless Model for Accurate Sequence Prediction. Proc. 9th International Conference on Advanced Data Mining and Applications (ADMA 2013) Part II, Springer LNAI 8347, pp. 177-188. 36 Goal ◦ to provide more accurate predictions, ◦ a model having a reasonable size, ◦ a model that is noise tolerant. 37 Hypothesis Idea: ◦ build a lossless model (or a model where the loss of information can be controlled), ◦ use all relevant information to perform each sequence prediction. Hypothesis: ◦ this would increase prediction accuracy. 38 Challenges Define an efficient structure in terms of space to store sequences, 2) The structure must be incrementally updatable to add new sequences 3) Propose a prediction algorithm that : 1) ◦ ◦ offers accurate predictions, if possible, is also time-efficient. 39 Our proposal Compact Prediction Tree (CPT) A tree-structure to store training sequences, An indexing mechanism, Each sequence is inserted one after the other in the CPT. Illustration 40 Example We will consider the four following training sequences: 1. ABC 2. AB 3. ABDC 4. BC 5. BDE 41 Example (construction) Prediction tree root Inverted Index Lookup table 42 Example: Inserting <A,B,C> Prediction tree root Inverted Index Lookup table 43 Example: Inserting <A,B,C> Prediction tree Inverted Index root s1 A B C A 1 1 1 B C Lookup table s1 44 Example: Inserting <A,B> Prediction tree Inverted Index root A B C A s1 s2 1 1 1 1 1 0 B C Lookup table s1 s2 45 Example: Inserting <A,B,D,C> Prediction tree Inverted Index root A B s1 s2 s3 A B C 1 1 1 1 1 0 1 1 1 D 0 0 1 D C C Lookup table s1 s2 s3 46 Example: Inserting <B,C> Prediction tree Inverted Index root B A B C s1 s2 s3 s4 A B C 1 1 1 1 1 0 1 1 1 0 1 1 D 0 0 1 0 D C C Lookup table s1 s2 s3 s4 47 Example: Inserting <B,D,E> Prediction tree Inverted Index root B A B D C D C s1 s2 s3 s4 s5 A B C 1 1 1 1 1 0 1 1 1 0 1 1 0 1 0 D E 0 0 0 0 1 0 0 0 1 1 E C Lookup table s1 s2 s3 s4 s5 48 Example: Inserting <B,D,E> Prediction tree Inverted Index root B A B D C D C s1 s2 s3 s4 s5 A B C 1 1 1 1 1 0 1 1 1 0 1 1 0 1 0 D E 0 0 0 0 1 0 0 0 1 1 E C Lookup table s1 s2 s3 s4 s5 49 Insertion linear complexity, O(m) where m is the sequence length. a reversible operation (sequences can be recovered from the CPT). the insertion order of sequences is preserved in the CPT. 50 Space complexity Size of the prediction tree ◦ worst case: O(N * average sequence length) where N is the number of sequences. ◦ In general, much smaller, because sequences overlap. root A B B C C D D E C 51 Space complexity (cont’d) Size of Inverted Index (n x b) n = sequence count b = symbol count small because encoded as bit vectors s1 s2 s3 s4 s5 A B C 1 1 1 1 1 0 1 1 1 0 1 1 0 1 0 D E 0 0 0 0 1 0 0 0 1 1 52 Space complexity (cont’d) A Size of lookup table n pointers where n is the sequence count B B C root D D C E C Lookup table s1 s2 s3 s4 s5 53 PREDICTION 54 Predicting the symbol following <A,B> Prediction tree Inverted Index root B A B D C D C s1 s2 s3 s4 s5 A B C 1 1 1 1 1 0 1 1 1 0 1 1 0 D E 0 0 0 0 1 0 0 0 1 1 1 0 E C Lookup table s1 s2 s3 s4 s5 55 Predicting the symbol following <A,B> Prediction tree Inverted Index root B A B D C D C s1 s2 s3 s4 s5 A B C 1 1 1 1 1 0 1 1 1 0 1 1 0 1 0 D E 0 0 0 0 1 0 0 0 1 1 The logical AND indicates that the sequences common to A and B are: s1, s2 et s3 E C Lookup table s1 s2 s3 s4 s5 56 Predicting the symbol following <A,B> Prediction tree Inverted Index root B A B D C D C s1 s2 s3 s4 s5 A B C 1 1 1 1 1 0 1 1 1 0 1 1 0 1 0 D E 0 0 0 0 1 0 0 0 1 1 The Lookup table allows to traverse the corresponding sequences in from the end to the start. E C Lookup table s1 s2 s3 s4 s5 57 Predicting the symbol following <A,B> Prediction tree Inverted Index root B A B D C D C E C Lookup table s1 s2 s3 s4 s1 s2 s3 s4 s5 A B C 1 1 1 1 1 0 1 1 1 0 1 1 0 1 0 D E 0 0 0 0 1 0 0 0 1 1 Count table: C: 2 occurrences after {AB} D: 1 occurrences after {AB} s5 58 Complexity of prediction 1. 2. 3. 4. Intersection of bit vectors: O(v) where v is the number of symbols. Traversing sequences: O(n) where n is the sequence count Creating the count table: O(x) where x is the number of symbols in sequences after the target sequence. Choosing the predicted symbol: O(y) where y is the number of distinct symbols in the Count Table. 59 EXPERIMENTAL EVALUATION 60 Experimental evaluation Datasets BMS, FIFA, Kosarak: sequences of clicks on webpages. SIGN: sentences in sign languages. BIBLE: sequences of characters in a book. Experimental evaluation (cont’d) Competitor algorithms DG (lookup window = 4) All-K-Order Markov (order of 5) PPM (order of 1) 10-fold cross-validation Experimental evaluation (cont’d) Measures: Accuracy = |success count| / |sequence count| Coverage = |prediction count| / |sequence count| Experiment 1 – Accuracy CPT is the most accurate except for one dataset. PPM and DG perform well in some situations. Experiment 1 – size CPT is ◦ smaller than All-K-order-Markov ◦ larger than DG and PPM Experiment 1 – time (cont’d) CPT’s training time is at least 3 times less than DG and AKOM, and similar to PPM. CPT’s prediction time is quite high (a trade-off for more accuracy) Experiment 2 – scalability CPT shows a trend similar to other algorithms Experiment 3 – prefix size prefix size: the number of symbols to be used for making a prediction for FIFA: The accuracy of CPT increases until a prefix size of around 8. (depends on the dataset) Optimisation #1 - RecursiveDivider Example: {A,B,C,D} Level 1 Level 2 Level 3 {B,C,D} {C,D} {D} {A,C,D} {B,D} {C} {A,B,D} {B,C} {B} {A,B,C} {A,D} {A} {A,C} {A,B} Accuracy and coverage are increasing. Training time and prediction time remains more or less the same. Therefore, a high value for this parameter is better for all datasets. Optimisation #2 – sequence splitting Example: splitting sequence {A,B,C,D,E,F,G} with split_length = 5 gives {C,D,E,F,G} Conclusion CPT, a new model for sequence prediction ◦ ◦ ◦ ◦ allows fast incremental updates, compresses training sequences, integrates an indexing mechanism two optimizations, Results: ◦ in general, more accurate than compared models but prediction time is greater (a trade-off), ◦ CPT is more than twice smaller than AKOM ◦ sequence insertion more than 3 times faster than DG and AKOM 71 CPT+: DECREASING THE TIME/SPACE COMPLEXITY OF CPT Gueniche, T., Fournier-Viger, P., Raman, R., Tseng,V. S. (2015). CPT+: Decreasing the time/space complexity of the Compact Prediction Tree. Proc. 19th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2015), Springer, LNAI9078, pp. 625-636 72 Introduction Two optimisations to reduce the size of the tree used by CPT: ◦ compressing frequent substrings, ◦ compressing simple branches. An optimisation to improve prediction time and noise tolerance. 73 (1) compressing frequent substrings This strategy is applied during training ◦ it identifies frequent substrings in training sequences, ◦ it replaces these substrings by new symbols Discovering substrings is done with a modified version of the PrefixSpan algorithm ◦ parameters: minsup, minLength and maxLength 74 (1) compressing frequent substrings Prediction tree Inverted Index Lookup table 75 (1) compressing frequent substrings Prediction tree Inverted Index Lookup table 76 (1) Compressing simple branches Time complexity: ◦ training : non negligible cost to discover frequent substrings, ◦ prediction: symbols are uncompressed on-the-fly in O(1) time. Space complexity: ◦ O(m) where m is the number of frequent substrings. 77 (2) Compressing simple branches A second optimization to reduce the size of the tree A simple branch is a branch where all nodes have a single child. Each simple branch is replaced by a single node representing the whole branch. 78 (2) Compressing simple branches Prediction tree Inverted Index Lookup table 79 (2) Compressing simple branches Prediction tree Inverted Index Lookup table 80 (2) Compressing simple branches Prediction tree Inverted Index Lookup table 81 (2) Compressing simple branches Time complexity ◦ very fast. ◦ after building the tree, we only need to traverse the branches from the bottom using the lookup table. 82 (3) Improved Noise Reduction Recall that CPT removes items from a sequence to be predicted to be more noise tolerant. Improvement: ◦ only remove less frequent symbols from sequences, assuming that they are more likely to be noise, ◦ consider a minimum number of sequences to perform a prediction, ◦ add a new parameter Noise Ratio (e.g. 20%) to determine how many symbols should be removed from sequences (e.g.: the 20% most infrequent symbols). ◦ Thus, the amount of noise is assumed to be proportional to the length of sequences. 83 Experiment Datasets Competitor algorithms DG, TDAG, PPM, LZ78, All-K-Markov 84 Prediction accuracy dataset CPT+ is also up to 4.5 times faster than CPT in terms of prediction time 85 Scalability Size (nodes) PPM Sequence count 86 Conclusion CPT(+): a novel sequence prediction model Fast training time Good scalability, High prediction accuracy. Future work: further compress the model, compare with other predictions models such as CTW and NN, data stream, user profiles… open-source library for web prefetching IPredict https://github.com/tedgueniche/IPredict/tree/master/src/ca/ipr edict 87 Conclusion Today, we discussed ◦ outlier detection ◦ sequence prediction Next time: Final exam 88 References Chapter 8, 9. Tan, Steinbach & Kumar (2006), Introduction to Data Mining, Pearson education, ISBN-10: 0321321367 (and PPTs) Han & Kamber (2011). Data Mining Concepts and Techniques. 89
© Copyright 2026 Paperzz