sequence

A Comparison Study on Algorithms for
Incremental Update of Frequent Sequences
PhD Annual Talk
Supervisor: Dr. B. C. M. KAO
Speaker: Minghua ZHANG
Aug. 2, 2002
1
Content





2
Introduction
Problem Definition
Algorithms
Performance Comparison
Conclusion
Introduction

Mining frequent sequences is one of the important data
mining problems.
–
applications





3
web-log sequences  the most popular web page visiting orders
customers’ buying sequences  customers’ purchase patterns
Algorithms for solving the problem include AprioriAll,
GSP, MFS, SPADE and PrefixSpan.
These algorithms assume the database is static.
In practice, the content of a sequence database
changes over time.
Introduction

Some incremental algorithms are put forward based on
the previous algorithms.
–


4
E.g: GSP+ (based on GSP), MFS+ (based on MFS) and ISM
(based on SPADE)
The above 3 incremental algorithms have been studied
and evaluated separately, under different database
update models.
We compare the performance of the 3 incremental
algorithms and their non-incremental counterparts, in
various cases.
Problem Definition

item
–

transaction (or itemset)
–

I={i1, i2, … , iM}: a set of literals called items.
transaction t: a set of items such that t  I.
sequence
–
–
sequence s=< t1, t2, … , tn>: a set of ordered transactions.
The length of s (represented by |s|) is defined as the number of
items contained in s.

5
E.g. if s=<{1},{2,3},{1,4}>, then |s|=5.
Problem Definition

subsequence
–
–
s1=<a1, a2, …, am> , s2=<b1, b2, …, bn>
If there exist integers j1, j2, …, jn


–
–

s2 is a subsequence of s1, or s1 contains s2 ( represented by
s2 s1).
e.g: If s1=< {1}, {2,3}, {4,5}>, s2=<{2}, {5}>, then s2 s1.
maximal sequence
–
6
1 j1 <j2 <… <jn m
b1  aj1 , b2  aj2 , …, bn  ajn
Given a sequence set V, a sequence s in V is maximal: if s is
not a subsequence of any other sequence in V.
Problem Definition

Given a sequence database D and a sequence s
–
–
–

support count: the number of sequences in D that contain s.
support: the fraction of sequences in D that contain s.
frequent: the support of s is no less than a threshold s.
Mining Frequent Sequences
–
Inputs:


–
Output:

7
a database D of sequences
a user specified minimum support threshold s (e.g. s =1%)
maximal frequent sequences
Problem Definition

Two database update models
–
Sequence Model



8
Whole sequences are inserted into and/or removed from the old
database.
D is the old database;
- is the set of sequences removed;
+ is the set of sequences inserted;
D’ is the new database.
E.g: a new web-log sequence is added to the database as a user
finishes a visiting session.
GSP+, MFS+ had been evaluated under this model.
Problem Definition
–
Transaction Model


Sequences in the old database are updated by appending new
transactions at the end.
E.g: In the old database, there is a sequence < {1}, {2,3}>. By
appending a transaction {4,5} to the end of it, the sequence
becomes < {1}, {2,3}, {4,5}> in the new database.
–

–
ISM had been evaluated under this model.
Incremental update problem


9
Customers’ purchasing history database is usually updated this way.
inputs: the old database, the new database, the result of mining
the old database
output: maximal frequent sequences in the new database
Problem Definition

The two models can model each other.
–
for Transaction Model


if < {1}, {2,3}>  < {1}, {2,3}, {4,5}>
then it can be mapped to the following updates in the
Sequence Model
–
delete < {1}, {2,3}> from the old database; and
– insert < {1}, {2,3}, {4,5}> into the new database.
10
Problem Definition
–
for Sequence Model


if <{1}, {2,3}> is inserted into the new database
then it can be mapped to the following updates in the
Transaction Model
–
append {1} to the end of an initially empty sequence <>, to
get < {1} >; and
– append {2,3} to the end of <{1}> to get <{1}, {2,3}>.

11
Algorithms designed for one model can also
work under the other model.
Algorithm 1 -- GSP

GSP is an iterative algorithm for mining
frequent sequences.
–
–
Ci={candidate sequences of length i}
Li={frequent sequences of length i}
i=1;
Ci = {<{i}> | i  I};
While ( | Ci | >0)
{
scan database to get Li from Ci ;
Ci+1 = GGen( Li );
i++;
}
12
Algorithm 1 -- GSP

Analysis
–
GSP is efficient.

–
Its I/O cost may be high.


13
A candidate will be generated and have its support counted
only when all its subsequences are frequent.
All candidates generated in a pass are of the same length.
The number of I/O passes required is determined by the
length of the longest frequent sequences.
Algorithm 2 -- MFS


MFS tries to reduce the I/O cost needed by GSP.
Basic ideas:
–
making use of a suggested frequent sequence set Sest


–
maintaining a sequence set, called MFSS

–
MFSS = the set of maximal frequent sequences known so far
generalizing the candidate generation function of GSP


14
mine a sample of the database using GSP
result of the previous mining action
input: a set of frequent sequences of various lengths
output: a set of candidate sequences of various lengths
Algorithm 2 -- MFS

Comparison between GSP and MFS
i=1;
MFSS=ø;
Ci = {<{i}> | i  I};
C = {<{i}> | i  I}  Sest;
While ( | Ci | >0)
While ( | C | >0)
{
{
scan database to get Li from Ci ;
scan database to get frequent sequences from C ;
Ci+1 = GGen( Li );
update MFSS;
i++;
C = MGen( MFSS );
}
}
Algorithm GSP

15
Algorithm MFS
Longer sequences can be generated and counted
early. Therefore, MFS can reduce I/O cost.
Algorithms 3 and 4 – GSP+ & MFS+


GSP+ and MFS+ are incremental algorithms based on
GSP and MFS, respectively.
Observations:
–
–
If a sequence s is frequent in D, then its support count in D’
can be deduced by scanning - and +, without D-.
If a sequence s is infrequent in D, then it cannot be frequent in
D’ unless


16
its support count in + is large enough
its support count in - is small enough
Algorithms 3 and 4 – GSP+ & MFS+

A pruning technique is devised.
–
–
purpose: reduce the number of candidates whose supports
w.r.t. D’ have to be counted.
method: deduce whether a candidate has the potential of
being frequent by considering its support count in - and/or +.

–
–
17
two lemmas are proposed for deleting candidates that are
frequent in D and infrequent in D, respectively.
result: if a candidate is deleted, then its support in D- does not
need to be counted.
advantage: CPU cost is saved significantly.
Algorithms 3 and 4 – GSP+ & MFS+

GSP+ and MFS+
–
–
–
18
Their structures are similar as those of GSP and
MFS.
Major difference: each time after generating
candidates, use the pruning technique to delete as
many candidates as possible, then count the
supports of the remaining candidates in D’.
They achieve efficiency by candidates pruning.
Algorithm 5 -- SPADE


SPADE is an algorithm for mining frequent sequences.
Database representation
–
Previous algorithms work on horizontal databases.

every row in the database table is a transaction.
–
–
with Customer ID as the primary sort key and transaction timestamp
as the secondary sort key.
SPADE requires a vertical database.

each item has an id-list.
–
a list of (Customer ID, transaction timestamp) pairs
– each pair identifies a transaction that contains the item.

19
the vertical database is composed of the id-lists of all items.
Algorithm 5 -- SPADE

Example
Customer ID
Transaction timestamp
1
110
ABC
120
2
3
4
Item
Customer ID
Transaction timestamp
A
1
110
EG
2
210
130
CD
2
220
210
AB
3
310
220
ADE
4
420
310
AF
…
…
…
320
BE
410
G
420
ABF
G
1
120
430
CE
4
410
Horizontal database
20
Itemset
Vertical database
Algorithm 5 -- SPADE

the id-list of a sequence s
–
–
–
–
21
a list of (Customer ID, transaction timestamp) pairs
s is a subsequence of the customer sequence identified by
Customer ID.
the transaction identified by transaction timestamp contains
the first item of s.
E.g: for sequence <{B}, {E}>, the pair (1,110) appears in its idlist.
Customer ID
Transaction timestamp
Itemset
1
110
ABC
120
EG
130
CD
Algorithm 5 -- SPADE


The support count of a sequence = the number of
distinct customer IDs that appear in its id-list
How to computer the id-list of a sequence?
–
generating subsequences


–
the id-list of a sequence can be computed from the id-lists of
its two generating subsequences.

22
for a sequence s of length 2 or longer, its two generating
subsequences are obtained by removing the first or the second
item of s.
e.g: s=<{A,B},{C}>, then its generating subsequences are:
<{B},{C}> and <{A},{C}>.
id-list intersection
Algorithm 5 -- SPADE

Three steps of SPADE
–
–
–

1. Find frequent length-1 sequences.
2. Find frequent length-2 sequences.
3. Find long frequent sequences (length 3 or above).
How to count the support of a candidate?
–
–
in step 1, by reading the id-list of the item from the database.
in step 2, by building a horizontal database from the vertical
database, and calculating a candidate’s support from the horizontal
database directly.

–
23
the horizontal database holds information of all frequent items.
in step 3, by calculating the id-list of a candidate through id-list
intersection.
Algorithm 5 -- SPADE

Analysis
–
Efficiency

–
Memory requirement

24
The vertical database representation allows efficient
support counting using the idea of id-lists.
In step 2, constructing a horizontal database may require a
lot of memory, especially when the number of frequent
items and the vertical database are large.
Algorithm 6 -- ISM

ISM – an incremental algorithm base on SPADE.
–
information needed w.r.t. the old database


–
a sequence s is in NB if


–
25
all frequent sequences (and their support counts);
all sequences (and their support counts) in the negative border
(NB)
s is not frequent and |s|=1; or
s is not frequent and both of its two generating subsequences are
frequent.
the information is used to construct an increment sequence
lattice, or ISL.
Algorithm 6 -- ISM

26
An example of ISL
Algorithm 6 -- ISM


With the help of ISL, ISM can find frequent sequences
in the new database efficiently.
The major work of ISM is to do different operations on
ISL.
–
Are there any new sequences inserted into the database?


–
In later steps, ISM also deals with ISL.



27
yes. ISL is adjusted.
no. No adjustment on ISL is needed.
sequences may be moved from NB to frequent set;
new sequences may be added to ISL;
output ISL for later incremental updates.
Algorithm 6 -- ISM

Summary
–
–
–
–
28
ISM works on the vertical database.
The size of ISL affects the performance of ISM
greatly.
Similar to SPADE, ISM may require a lot of memory.
ISM can only deal with the case of insertion.
Performance Comparison

Synthetic dataset
–
29
Parameters of the dataset
Parameter
Description
Value
|D|
Number of customers
|C|
Average number of transactions per customer
10
|T|
Average number of items per transaction
2.5
|S|
Average No. of itemsets in maximal potentially frequent sequences
4
|I|
Average size of itemsets in maximal potentially frequent sequences
1.25
Ns
Number of maximal potentially frequent sequences
NI
Number of maximal potentially frequent itemsets
N
Number of items
-
5,000
25,000
-
Performance Comparison

Sequence Model
–
30
|D| = 1,000,000, |+| = 100,000=10%|D|, |-| =0, |D’|=1,100,000
Performance Comparison

Sequence Model
–
–
–
As the support threshold s increases, the running time of all 6
algorithms decreases.
The performance difference among algorithms is more
substantial when s is small.
GSP+ and MFS+ perform much better than GSP and MFS.

–
ISM performs worse than SPADE.

–
under the sequence model of database update, ISM needs to
work harder in maintaining ISL.
SPADE is the most efficient algorithm.

31
many candidates are pruned.
memory is enough (4GB).
Performance Comparison

The performance of ISM is affected greatly by
the size of ISL.
–
the number of items (N)


32
each item derives one length-1 sequence & all length-1
sequences are in ISL  a large N gives a fat ISL;
other parameters fixed, more items means fewer frequent
sequences  a large N leads to a smaller ISL.
Performance Comparison

Sequence Model
–
33
Execution time vs. the number of items (N)
Performance Comparison

Sequence Model
–
34
Effect of the database size (running on a PC of 512M memory)
Performance Comparison

35
Transaction Model
Performance Comparison

Transaction Model
–
ISM performs best unless s is very small.


–
GSP+ and MFS+ are not very effective.


–
Candidates that are frequent in D cannot be pruned.
Candidates that are infrequent in D are very unlikely to be pruned.
SPADE works consistently well over the range of s .

36
in transaction model, there is much less change to ISL.
when s is small, ISL is large. ISM is not as efficient as SPADE.
memory is enough (4GB).
Performance Comparison

Transaction Model
–
37
Effect of the percentage of sequences updated (%)
Performance Comparison

Transaction Model
–
Effect of the percentage of sequences updated (%)


GSP+ and MFS+ are relatively unaffected by the
percentage change.
The execution time of ISM increases linearly with the
percentage change.
–

38
more update leads to more changes to ISL.
If the change is small, ISM performs best. Otherwise,
SPADE is the best choice.
Performance Comparison

Transaction Model
–
39
Effect of the database size (running on a PC of 512M memory)
Conclusion

Guidelines
–
Sequence Model



40
Main memory is relatively large & the number of items is
small  ISM
Main memory is relatively large & the number of items is
large  SPADE
Main memory is limited  GSP+ or MFS+
Conclusion

Guidelines
–
Transaction Model



41
Main memory is abundant & the database update portion is
small  ISM
Main memory is abundant & the database update portion is
significant  SPADE
Main memory is limited  GSP, MFS, GSP+ MFS+
42