Sequence prediction - Philippe Fournier

数据挖掘
Introduction to Data Mining
Philippe Fournier-Viger
Full professor
School of Natural Sciences and Humanities
[email protected]
Spring 2017
A312: 15:20- 16:55
1
Introduction

Last week:
◦ Clustering
◦ The second assignment was announced.
It must be submitted before the 1st May at
11:59 PM.
QQ group: 611811300
2
Course schedule (日程安排)
Week 1
Introduction
What is the knowledge discovery process?
Week 2
Exploring data
Week 3
Classification – decision trees
Week 4
Classification – naïve Bayes and other techniques
Week 5
Association analysis – Part 1
Week 6
Week 7
Association analysis – Part 2
Week 8
Clustering
Week 9
Advanced topics + details about the final exam
3
ABOUT THE FINAL
EXAM
4
Final exam





Room and date to be determined.
Duration: 2 hours
It is a closed-book exam
Answers must be written in English.
About 10 questions. Some typical questions in my
exams:
◦
◦
◦
◦
What is the advantages/disadvantages of using X instead of Y ?
When X should be used?
How X works ? or why X is designed like that?
A few questions that may require some calculations or to explain
what will be the result of an algorithm for some data.
5
Final exam (continued)

During the exam, if you are not sure about the
meaning of a question in terms of English, you may ask
me or the teaching assistant for some clarifications.
No electronic devices are allowed, except a
calculator.
 Besides that, a pen/pencil/eraser can be used during
the exam.
 Bring your student ID card.

6
What is important?
Understand what is data mining
(week 1)
 Exploring the data (week 2)
 Classification (week 3 and 4)

◦ How the methods work, advantages/disadvantages
◦ For decision trees, you are expected to understand quite
well, since you did the first assignment.
◦ I will not ask to do calculations for Naïve Bayes (just
understand the main idea.
7
What is important?

Association analysis (week 5 and 7)
◦ the problem of pattern mining and its variations.
◦ able to apply the Apriori algorithm and other techniques that
we discussed.

Clustering (week 8)
◦ What is clustering
◦ DBScan, K-Means, etc.

Today (week 9)
◦ Outlier detection & sequence prediction
The exam will focus on the above topics but may include other
topics that we have discussed
8
OUTLIER DETECTION
(异常点检测)
9
Anomaly/Outlier Detection

What are anomalies/outliers (离群)?
◦ Data points that are considerably different from other data
points

Different problem definitions:
◦ Find all the data points with anomaly scores greater than
some threshold t
◦ Find the n data points having the largest anomaly scores
◦ Compute the anomaly score of a given data point x.

Applications: Credit card fraud detection, telecommunication
fraud detection, network intrusion detection, fault detection,
terrorism detection
Example
Ozone (臭氧) Depletion History

In 1985 three researchers (Farman,
Gardinar and Shanklin) were puzzled by
data gathered by the British Antarctic
Survey showing that ozone levels for
Antarctica (南极洲 )had dropped 10%
below normal levels

Why did the Nimbus 7 satellite, which
had instruments aboard for recording
ozone levels, not record similarly low
ozone concentrations?

The ozone concentrations recorded by
the satellite were so low they were being
treated as outliers by a computer
program and discarded!
Sources:
http://exploringdata.cqu.edu.au/ozone.html
http://www.epa.gov/ozone/science/hole/size.html
Anomaly Detection

Challenges
◦ How many outliers are there in the data?
◦ Methods are generally unsupervised
 No training is used. Thus it may become hard to validate
the results.
◦ Finding outliers is like finding a needle in a
haystack (大海捞针)

Assumption:
◦ There are considerably more “normal” data
points than outliers in the data
How to do anomaly detection?

General steps
◦ Build a profile of the “normal” behavior
 Profile can be patterns or summary statistics for the overall population
◦ Use the “normal” profile to detect anomalies
 Anomalies are observations whose characteristics
differ significantly from the normal profile

Different types of
anomaly detection schemes
◦
◦
◦
◦
Graphical & Statistical-based
Distance-based
Model-based
…
Graphical Approaches
Boxplot (1-D), Scatter plot (2-D) …
An outlier could be defined in
terms of how far it is from the
average (平均) or standard
deviation (标准偏差).
We can detect outliers, visually.
Limitations
Time consuming
Subjective
Convex Hull (凸包) Method




Extreme points are assumed to be outliers
Use the convex hull method to detect extreme values
Convex hull: the smallest convex set that contains all the data
points (variation: contains at least 95 % of the data points).
But what if the outlier occurs in the middle of other data points?
Statistical Approaches


Assume a parametric model describing the distribution of
the data (e.g. normal distribution - 正态分布)
Apply a statistical test that depends on
◦ Data distribution
◦ Parameter(s) of the distribution (e.g., average, variance)
◦ Number of expected outliers (confidence limit)
Limitations of Statistical Approaches



Most of the tests are for a single attribute
In many cases, data distribution may not be known.
For high dimensional data, it may be difficult to estimate
the true distribution
Distance-based Approaches


Data is represented as a vector of features
(Macy, 18 years old, Beijing, Female)
Three major approaches
◦ Nearest-neighbor based
◦ Density based
◦ Clustering based
Nearest-Neighbor Based Approach
Approach:
 Compute the distance between every pair
of data points
 There are various ways to define outliers:
◦ Data points for which there are fewer than p
neighboring points within a distance D
◦ The top n data points whose distance to the
kth nearest neighbor is the greatest
◦ The top n data points whose average distance
to the kth nearest neighbors is the greatest
Outliers in Lower Dimensional Projection

Divide each attribute into  equal-depth intervals
◦ Each interval contains a fraction f = 1/ of the records

Consider a k-dimensional cube created by picking
grid ranges from k different dimensions
◦ If attributes are independent, we expect a region to
contain a fraction fk of the records
◦ If there are N points, we can measure sparsity of a cube
D as:
◦ Negative sparsity indicates cube contains smaller number
of points than expected
Example
N=100,  = 5, f = 1/5 = 0.2, N  f2 = 4
Density-based method (LOF)
For each point, compute the density of its local neighborhood
 Compute local outlier factor (LOF) of a sample p as the average of
the ratio of the density of sample p and the density of its
nearest neighbors (NN)
 Outliers are points with largest LOF value

In the NN approach, p2 is
not considered as outlier,
while LOF approach find
both p1 and p2 as outliers
p2


p1
Clustering-Based
Basic idea:
◦ Cluster the data into groups
of different density
◦ Choose points in small cluster
as candidate outliers
◦ Compute the distance
between candidate points and
non-candidate clusters.
◦ If candidate points are far
from all other non-candidate
points, they are outliers
SEQUENCE
PREDICTION
(序列预测)
25
The problem of Sequence Prediction
A

B
C
?
Problem:
◦ Given a set of training sequences, predict the next symbol of a
sequence.

Applications:
◦
◦
◦
◦
◦
webpage prefetching,
analyzing the behavior of customers on websites,
keyboard typing prediction
product recommendation,
stock market prediction,
◦ …
26
General approach for this problem
Phase 1) Training
Training
sequences
Building a sequence
prediction model
Prediction
Model
Phase 2) Prediction
Prediction
Model
Prediction
algorithm
Prediction
e.g. D
A sequence
e.g. A,B,C
27
Sequential pattern mining





Discovery of patterns
Using the patterns for prediction
It is time-consuming to extract patterns.
patterns ignore rare cases,
updating the patterns: very costly!
sequences
Pattern
Support
PrefixSpan
Minsup = 33 %
28
28
Dependency Graph (DG)
S1: {A,B,C,A,C,B,D}
S2: {C,C,A,B,C,B,C,A}
D
C
1
3
1
4
3
A
3
3
B
2
DG with lookup table of size 2
29
Dependency Graph (DG)
S1: {A,B,C,A,C,B,D}
S2: {C,C,A,B,C,B,C,A}
D
C
1
3
1
4
3
A
3
P(B|A) = 3 / SUP(A) = 3 / 4
P(C|A) = 3 / SUP(A) = 3 / 4
…
3
B
2
DG with lookup table of size 2
30
PPM – order 1
(prediction by partial matching)
S1: {A,B,C,A,C,B,D}
S2: {C,C,A,B,C,B,C,A}
2
4
4
6
A
B
C
1
B C
3
1
3
1
1
C D A B C
PPM – order 1
(prediction by partial matching)
S1: {A,B,C,A,C,B,D}
S2: {C,C,A,B,C,B,C,A}
2
4
4
6
A
B
C
1
B C
3
1
3
1
P(B|A) = 2 / 4
P(C|A) = 1 / 4
…
1
C D A B C
PPM – order 2
S1: {A,B,C,A,C,B,D}
S2: {C,C,A,B,C,B,C,A}
2
1
3
AB
AC
2
1
C
B
…
BC
1
2
A
B
predictions are inaccurate if there is noise…
All-K-Order Markov
Uses PPM from level 1 to K for prediction.
 More accurate than a fixed-order PPM,
 But exponential size

2
B
4
4
6
A
B
C
1
C
3
C
2
1
AB
AC
1
1
3
1
1
2
D
A
B
C
C
Example: order 2
3
B
P(C|AB) = 2 / 2
P(B|AC) = 1 / 1
P(A|BC) = 2 / 3
…
…
BC
1
2
A
B
Limitations






Several models assume that each event depends only on
the immediately preceding event.
Otherwise, often an exponential complexity
(e.g.: All-K-Markov)
Some improvements to reduce the size of
markovian models, but few work to improve their
accuracy.
Several models are not noise tolerant.
Some models are costly to update
(e.g. sequential patterns).
All the aforementioned models are lossy models.
35
CPT: COMPACT
PREDICTION TREE
Gueniche, T., Fournier-Viger, P., Tseng,V.-S. (2013). Compact Prediction Tree: A
Lossless Model for Accurate Sequence Prediction. Proc. 9th International
Conference on Advanced Data Mining and Applications (ADMA 2013) Part II, Springer
LNAI 8347, pp. 177-188.
36
Goal
◦ to provide more accurate predictions,
◦ a model having a reasonable size,
◦ a model that is noise tolerant.
37
Hypothesis

Idea:
◦ build a lossless model (or a model where
the loss of information can be controlled),
◦ use all relevant information to perform each
sequence prediction.

Hypothesis:
◦ this would increase prediction accuracy.
38
Challenges
Define an efficient structure in terms of space
to store sequences,
2) The structure must be incrementally
updatable to add new sequences
3) Propose a prediction algorithm that :
1)
◦
◦
offers accurate predictions,
if possible, is also time-efficient.
39
Our proposal
Compact Prediction Tree (CPT)
 A tree-structure to store training
sequences,
 An indexing mechanism,
 Each sequence is inserted one after the
other in the CPT.
 Illustration 
40
Example
We will consider the four following training
sequences:
1. ABC
2. AB
3. ABDC
4. BC
5. BDE
41
Example (construction)
Prediction tree
root
Inverted Index
Lookup table
42
Example: Inserting <A,B,C>
Prediction tree
root
Inverted Index
Lookup table
43
Example: Inserting <A,B,C>
Prediction tree
Inverted Index
root
s1
A
B
C
A
1
1
1
B
C
Lookup table
s1
44
Example: Inserting <A,B>
Prediction tree
Inverted Index
root
A
B
C
A
s1
s2
1
1
1
1
1
0
B
C
Lookup table
s1
s2
45
Example: Inserting <A,B,D,C>
Prediction tree
Inverted Index
root
A
B
s1
s2
s3
A
B
C
1
1
1
1
1
0
1
1
1
D
0
0
1
D
C
C
Lookup table
s1
s2
s3
46
Example: Inserting <B,C>
Prediction tree
Inverted Index
root
B
A
B
C
s1
s2
s3
s4
A
B
C
1
1
1
1
1
0
1
1
1
0
1
1
D
0
0
1
0
D
C
C
Lookup table
s1
s2
s3
s4
47
Example: Inserting <B,D,E>
Prediction tree
Inverted Index
root
B
A
B
D
C
D
C
s1
s2
s3
s4
s5
A
B
C
1
1
1
1
1
0
1
1
1
0
1
1
0
1
0
D
E
0
0
0
0
1
0
0
0
1
1
E
C
Lookup table
s1
s2
s3
s4
s5
48
Example: Inserting <B,D,E>
Prediction tree
Inverted Index
root
B
A
B
D
C
D
C
s1
s2
s3
s4
s5
A
B
C
1
1
1
1
1
0
1
1
1
0
1
1
0
1
0
D
E
0
0
0
0
1
0
0
0
1
1
E
C
Lookup table
s1
s2
s3
s4
s5
49
Insertion
linear complexity, O(m) where m is the
sequence length.
 a reversible operation (sequences can be
recovered from the CPT).
 the insertion order of sequences is
preserved in the CPT.

50
Space complexity
Size of the prediction tree
◦ worst case:
O(N * average sequence length)
where N is the number of
sequences.
◦ In general, much smaller, because
sequences overlap.
root
A
B
B
C
C
D
D
E
C
51
Space complexity (cont’d)
Size of Inverted Index
 (n x b)
n = sequence count
b = symbol count

small because encoded as bit vectors
s1
s2
s3
s4
s5
A
B
C
1
1
1
1
1
0
1
1
1
0
1
1
0
1
0
D
E
0
0
0
0
1
0
0
0
1
1
52
Space complexity (cont’d)
A
Size of lookup table
n pointers where n is the
sequence
count
B
B
C
root
D
D
C
E
C
Lookup table
s1
s2
s3
s4
s5
53
PREDICTION
54
Predicting the symbol following <A,B>
Prediction tree
Inverted Index
root
B
A
B
D
C
D
C
s1
s2
s3
s4
s5
A
B
C
1
1
1
1
1
0
1
1
1
0
1
1
0
D
E
0
0
0
0
1
0
0
0
1
1
1
0
E
C
Lookup table
s1
s2
s3
s4
s5
55
Predicting the symbol following <A,B>
Prediction tree
Inverted Index
root
B
A
B
D
C
D
C
s1
s2
s3
s4
s5
A
B
C
1
1
1
1
1
0
1
1
1
0
1
1
0
1
0
D
E
0
0
0
0
1
0
0
0
1
1
The logical AND indicates
that the sequences common to
A and B are: s1, s2 et s3
E
C
Lookup table
s1
s2
s3
s4
s5
56
Predicting the symbol following <A,B>
Prediction tree
Inverted Index
root
B
A
B
D
C
D
C
s1
s2
s3
s4
s5
A
B
C
1
1
1
1
1
0
1
1
1
0
1
1
0
1
0
D
E
0
0
0
0
1
0
0
0
1
1
The Lookup table allows to
traverse the corresponding
sequences in from the end to
the start.
E
C
Lookup table
s1
s2
s3
s4
s5
57
Predicting the symbol following <A,B>
Prediction tree
Inverted Index
root
B
A
B
D
C
D
C
E
C
Lookup table
s1
s2
s3
s4
s1
s2
s3
s4
s5
A
B
C
1
1
1
1
1
0
1
1
1
0
1
1
0
1
0
D
E
0
0
0
0
1
0
0
0
1
1
Count table:
C: 2 occurrences after {AB}
D: 1 occurrences after {AB}
s5
58
Complexity of prediction
1.
2.
3.
4.
Intersection of bit vectors: O(v)
where v is the number of symbols.
Traversing sequences: O(n) where n
is the sequence count
Creating the count table: O(x)
where x is the number of symbols in
sequences after the target sequence.
Choosing the predicted symbol:
O(y) where y is the number of distinct
symbols in the Count Table.
59
EXPERIMENTAL
EVALUATION
60
Experimental evaluation
Datasets
BMS, FIFA, Kosarak: sequences of clicks on
webpages.
 SIGN: sentences in sign languages.
 BIBLE: sequences of characters in a book.

Experimental evaluation (cont’d)
Competitor algorithms
 DG (lookup window = 4)
 All-K-Order Markov (order of 5)
 PPM (order of 1)
10-fold cross-validation
Experimental evaluation (cont’d)
Measures:
 Accuracy
= |success count| / |sequence count|
 Coverage
= |prediction count| / |sequence count|
Experiment 1 – Accuracy
CPT is the most accurate except for one dataset.
 PPM and DG perform well in some situations.

Experiment 1 – size

CPT is
◦ smaller than All-K-order-Markov
◦ larger than DG and PPM
Experiment 1 – time (cont’d)
CPT’s training time is at least 3 times
less than DG and AKOM, and similar to
PPM.
 CPT’s prediction time is quite high (a
trade-off for more accuracy)

Experiment 2 – scalability
CPT shows a trend similar to other algorithms
Experiment 3 – prefix size
prefix size: the number of symbols to be used
for making a prediction
 for FIFA:

The accuracy of CPT increases until a prefix size of around 8.
(depends on the dataset)
Optimisation #1 - RecursiveDivider
Example: {A,B,C,D}
Level 1
Level 2
Level 3
{B,C,D}
{C,D}
{D}
{A,C,D}
{B,D}
{C}
{A,B,D}
{B,C}
{B}
{A,B,C}
{A,D}
{A}
{A,C}
{A,B}
Accuracy and coverage are
increasing.
Training time and prediction time
remains more or less the same.
Therefore, a high value for this
parameter is better for all datasets.
Optimisation #2 – sequence splitting
Example:
splitting sequence {A,B,C,D,E,F,G} with split_length = 5 gives {C,D,E,F,G}
Conclusion

CPT, a new model for sequence prediction
◦
◦
◦
◦

allows fast incremental updates,
compresses training sequences,
integrates an indexing mechanism
two optimizations,
Results:
◦ in general, more accurate than compared models but
prediction time is greater (a trade-off),
◦ CPT is more than twice smaller than AKOM
◦ sequence insertion more than 3 times faster than DG
and AKOM
71
CPT+: DECREASING
THE TIME/SPACE
COMPLEXITY OF CPT
Gueniche, T., Fournier-Viger, P., Raman, R., Tseng,V. S. (2015). CPT+: Decreasing the
time/space complexity of the Compact Prediction Tree. Proc. 19th Pacific-Asia
Conference on Knowledge Discovery and Data Mining (PAKDD 2015), Springer,
LNAI9078, pp. 625-636
72
Introduction

Two optimisations to reduce the size of
the tree used by CPT:
◦ compressing frequent substrings,
◦ compressing simple branches.

An optimisation to improve prediction
time and noise tolerance.
73
(1) compressing frequent substrings

This strategy is applied during training
◦ it identifies frequent substrings in training sequences,
◦ it replaces these substrings by new symbols

Discovering substrings is done with a modified
version of the PrefixSpan algorithm
◦ parameters: minsup, minLength and maxLength
74
(1) compressing frequent substrings
Prediction tree
Inverted Index
Lookup table
75
(1) compressing frequent substrings
Prediction tree
Inverted Index
Lookup table
76
(1) Compressing simple branches

Time complexity:
◦ training : non negligible cost to discover frequent
substrings,
◦ prediction: symbols are uncompressed on-the-fly in
O(1) time.

Space complexity:
◦ O(m) where m is the number of frequent substrings.
77
(2) Compressing simple branches
A second optimization to reduce the size of the
tree
 A simple branch is a branch where all nodes
have a single child.
 Each simple branch is replaced by a single node
representing the whole branch.

78
(2) Compressing simple branches
Prediction tree
Inverted Index
Lookup table
79
(2) Compressing simple branches
Prediction tree
Inverted Index
Lookup table
80
(2) Compressing simple branches
Prediction tree
Inverted Index
Lookup table
81
(2) Compressing simple branches

Time complexity
◦ very fast.
◦ after building the tree, we only need to
traverse the branches from the bottom using
the lookup table.
82
(3) Improved Noise Reduction


Recall that CPT removes items from a sequence to
be predicted to be more noise tolerant.
Improvement:
◦ only remove less frequent symbols from sequences,
assuming that they are more likely to be noise,
◦ consider a minimum number of sequences to perform a
prediction,
◦ add a new parameter Noise Ratio (e.g. 20%) to
determine how many symbols should be removed from
sequences (e.g.: the 20% most infrequent symbols).
◦ Thus, the amount of noise is assumed to be proportional
to the length of sequences.
83
Experiment
Datasets
Competitor algorithms
DG, TDAG, PPM, LZ78, All-K-Markov
84
Prediction accuracy
dataset
CPT+ is also up to 4.5 times faster than CPT
in terms of prediction time
85
Scalability
Size (nodes)
PPM
Sequence count
86
Conclusion
CPT(+): a novel sequence prediction model
 Fast training time
 Good scalability,
 High prediction accuracy.

Future work:
 further compress the model,
 compare with other predictions models such as CTW and
NN,
 data stream, user profiles…
 open-source library for web prefetching
IPredict
https://github.com/tedgueniche/IPredict/tree/master/src/ca/ipr
edict
87
Conclusion

Today, we discussed
◦ outlier detection
◦ sequence prediction

Next time:
Final exam
88
References
Chapter 8, 9. Tan, Steinbach & Kumar
(2006), Introduction to Data Mining,
Pearson education, ISBN-10: 0321321367
(and PPTs)
 Han & Kamber (2011). Data Mining
Concepts and Techniques.

89