A Comparison Study on Algorithms for Incremental Update of Frequent Sequences PhD Annual Talk Supervisor: Dr. B. C. M. KAO Speaker: Minghua ZHANG Aug. 2, 2002 1 Content 2 Introduction Problem Definition Algorithms Performance Comparison Conclusion Introduction Mining frequent sequences is one of the important data mining problems. – applications 3 web-log sequences the most popular web page visiting orders customers’ buying sequences customers’ purchase patterns Algorithms for solving the problem include AprioriAll, GSP, MFS, SPADE and PrefixSpan. These algorithms assume the database is static. In practice, the content of a sequence database changes over time. Introduction Some incremental algorithms are put forward based on the previous algorithms. – 4 E.g: GSP+ (based on GSP), MFS+ (based on MFS) and ISM (based on SPADE) The above 3 incremental algorithms have been studied and evaluated separately, under different database update models. We compare the performance of the 3 incremental algorithms and their non-incremental counterparts, in various cases. Problem Definition item – transaction (or itemset) – I={i1, i2, … , iM}: a set of literals called items. transaction t: a set of items such that t I. sequence – – sequence s=< t1, t2, … , tn>: a set of ordered transactions. The length of s (represented by |s|) is defined as the number of items contained in s. 5 E.g. if s=<{1},{2,3},{1,4}>, then |s|=5. Problem Definition subsequence – – s1=<a1, a2, …, am> , s2=<b1, b2, …, bn> If there exist integers j1, j2, …, jn – – s2 is a subsequence of s1, or s1 contains s2 ( represented by s2 s1). e.g: If s1=< {1}, {2,3}, {4,5}>, s2=<{2}, {5}>, then s2 s1. maximal sequence – 6 1 j1 <j2 <… <jn m b1 aj1 , b2 aj2 , …, bn ajn Given a sequence set V, a sequence s in V is maximal: if s is not a subsequence of any other sequence in V. Problem Definition Given a sequence database D and a sequence s – – – support count: the number of sequences in D that contain s. support: the fraction of sequences in D that contain s. frequent: the support of s is no less than a threshold s. Mining Frequent Sequences – Inputs: – Output: 7 a database D of sequences a user specified minimum support threshold s (e.g. s =1%) maximal frequent sequences Problem Definition Two database update models – Sequence Model 8 Whole sequences are inserted into and/or removed from the old database. D is the old database; - is the set of sequences removed; + is the set of sequences inserted; D’ is the new database. E.g: a new web-log sequence is added to the database as a user finishes a visiting session. GSP+, MFS+ had been evaluated under this model. Problem Definition – Transaction Model Sequences in the old database are updated by appending new transactions at the end. E.g: In the old database, there is a sequence < {1}, {2,3}>. By appending a transaction {4,5} to the end of it, the sequence becomes < {1}, {2,3}, {4,5}> in the new database. – – ISM had been evaluated under this model. Incremental update problem 9 Customers’ purchasing history database is usually updated this way. inputs: the old database, the new database, the result of mining the old database output: maximal frequent sequences in the new database Problem Definition The two models can model each other. – for Transaction Model if < {1}, {2,3}> < {1}, {2,3}, {4,5}> then it can be mapped to the following updates in the Sequence Model – delete < {1}, {2,3}> from the old database; and – insert < {1}, {2,3}, {4,5}> into the new database. 10 Problem Definition – for Sequence Model if <{1}, {2,3}> is inserted into the new database then it can be mapped to the following updates in the Transaction Model – append {1} to the end of an initially empty sequence <>, to get < {1} >; and – append {2,3} to the end of <{1}> to get <{1}, {2,3}>. 11 Algorithms designed for one model can also work under the other model. Algorithm 1 -- GSP GSP is an iterative algorithm for mining frequent sequences. – – Ci={candidate sequences of length i} Li={frequent sequences of length i} i=1; Ci = {<{i}> | i I}; While ( | Ci | >0) { scan database to get Li from Ci ; Ci+1 = GGen( Li ); i++; } 12 Algorithm 1 -- GSP Analysis – GSP is efficient. – Its I/O cost may be high. 13 A candidate will be generated and have its support counted only when all its subsequences are frequent. All candidates generated in a pass are of the same length. The number of I/O passes required is determined by the length of the longest frequent sequences. Algorithm 2 -- MFS MFS tries to reduce the I/O cost needed by GSP. Basic ideas: – making use of a suggested frequent sequence set Sest – maintaining a sequence set, called MFSS – MFSS = the set of maximal frequent sequences known so far generalizing the candidate generation function of GSP 14 mine a sample of the database using GSP result of the previous mining action input: a set of frequent sequences of various lengths output: a set of candidate sequences of various lengths Algorithm 2 -- MFS Comparison between GSP and MFS i=1; MFSS=ø; Ci = {<{i}> | i I}; C = {<{i}> | i I} Sest; While ( | Ci | >0) While ( | C | >0) { { scan database to get Li from Ci ; scan database to get frequent sequences from C ; Ci+1 = GGen( Li ); update MFSS; i++; C = MGen( MFSS ); } } Algorithm GSP 15 Algorithm MFS Longer sequences can be generated and counted early. Therefore, MFS can reduce I/O cost. Algorithms 3 and 4 – GSP+ & MFS+ GSP+ and MFS+ are incremental algorithms based on GSP and MFS, respectively. Observations: – – If a sequence s is frequent in D, then its support count in D’ can be deduced by scanning - and +, without D-. If a sequence s is infrequent in D, then it cannot be frequent in D’ unless 16 its support count in + is large enough its support count in - is small enough Algorithms 3 and 4 – GSP+ & MFS+ A pruning technique is devised. – – purpose: reduce the number of candidates whose supports w.r.t. D’ have to be counted. method: deduce whether a candidate has the potential of being frequent by considering its support count in - and/or +. – – 17 two lemmas are proposed for deleting candidates that are frequent in D and infrequent in D, respectively. result: if a candidate is deleted, then its support in D- does not need to be counted. advantage: CPU cost is saved significantly. Algorithms 3 and 4 – GSP+ & MFS+ GSP+ and MFS+ – – – 18 Their structures are similar as those of GSP and MFS. Major difference: each time after generating candidates, use the pruning technique to delete as many candidates as possible, then count the supports of the remaining candidates in D’. They achieve efficiency by candidates pruning. Algorithm 5 -- SPADE SPADE is an algorithm for mining frequent sequences. Database representation – Previous algorithms work on horizontal databases. every row in the database table is a transaction. – – with Customer ID as the primary sort key and transaction timestamp as the secondary sort key. SPADE requires a vertical database. each item has an id-list. – a list of (Customer ID, transaction timestamp) pairs – each pair identifies a transaction that contains the item. 19 the vertical database is composed of the id-lists of all items. Algorithm 5 -- SPADE Example Customer ID Transaction timestamp 1 110 ABC 120 2 3 4 Item Customer ID Transaction timestamp A 1 110 EG 2 210 130 CD 2 220 210 AB 3 310 220 ADE 4 420 310 AF … … … 320 BE 410 G 420 ABF G 1 120 430 CE 4 410 Horizontal database 20 Itemset Vertical database Algorithm 5 -- SPADE the id-list of a sequence s – – – – 21 a list of (Customer ID, transaction timestamp) pairs s is a subsequence of the customer sequence identified by Customer ID. the transaction identified by transaction timestamp contains the first item of s. E.g: for sequence <{B}, {E}>, the pair (1,110) appears in its idlist. Customer ID Transaction timestamp Itemset 1 110 ABC 120 EG 130 CD Algorithm 5 -- SPADE The support count of a sequence = the number of distinct customer IDs that appear in its id-list How to computer the id-list of a sequence? – generating subsequences – the id-list of a sequence can be computed from the id-lists of its two generating subsequences. 22 for a sequence s of length 2 or longer, its two generating subsequences are obtained by removing the first or the second item of s. e.g: s=<{A,B},{C}>, then its generating subsequences are: <{B},{C}> and <{A},{C}>. id-list intersection Algorithm 5 -- SPADE Three steps of SPADE – – – 1. Find frequent length-1 sequences. 2. Find frequent length-2 sequences. 3. Find long frequent sequences (length 3 or above). How to count the support of a candidate? – – in step 1, by reading the id-list of the item from the database. in step 2, by building a horizontal database from the vertical database, and calculating a candidate’s support from the horizontal database directly. – 23 the horizontal database holds information of all frequent items. in step 3, by calculating the id-list of a candidate through id-list intersection. Algorithm 5 -- SPADE Analysis – Efficiency – Memory requirement 24 The vertical database representation allows efficient support counting using the idea of id-lists. In step 2, constructing a horizontal database may require a lot of memory, especially when the number of frequent items and the vertical database are large. Algorithm 6 -- ISM ISM – an incremental algorithm base on SPADE. – information needed w.r.t. the old database – a sequence s is in NB if – 25 all frequent sequences (and their support counts); all sequences (and their support counts) in the negative border (NB) s is not frequent and |s|=1; or s is not frequent and both of its two generating subsequences are frequent. the information is used to construct an increment sequence lattice, or ISL. Algorithm 6 -- ISM 26 An example of ISL Algorithm 6 -- ISM With the help of ISL, ISM can find frequent sequences in the new database efficiently. The major work of ISM is to do different operations on ISL. – Are there any new sequences inserted into the database? – In later steps, ISM also deals with ISL. 27 yes. ISL is adjusted. no. No adjustment on ISL is needed. sequences may be moved from NB to frequent set; new sequences may be added to ISL; output ISL for later incremental updates. Algorithm 6 -- ISM Summary – – – – 28 ISM works on the vertical database. The size of ISL affects the performance of ISM greatly. Similar to SPADE, ISM may require a lot of memory. ISM can only deal with the case of insertion. Performance Comparison Synthetic dataset – 29 Parameters of the dataset Parameter Description Value |D| Number of customers |C| Average number of transactions per customer 10 |T| Average number of items per transaction 2.5 |S| Average No. of itemsets in maximal potentially frequent sequences 4 |I| Average size of itemsets in maximal potentially frequent sequences 1.25 Ns Number of maximal potentially frequent sequences NI Number of maximal potentially frequent itemsets N Number of items - 5,000 25,000 - Performance Comparison Sequence Model – 30 |D| = 1,000,000, |+| = 100,000=10%|D|, |-| =0, |D’|=1,100,000 Performance Comparison Sequence Model – – – As the support threshold s increases, the running time of all 6 algorithms decreases. The performance difference among algorithms is more substantial when s is small. GSP+ and MFS+ perform much better than GSP and MFS. – ISM performs worse than SPADE. – under the sequence model of database update, ISM needs to work harder in maintaining ISL. SPADE is the most efficient algorithm. 31 many candidates are pruned. memory is enough (4GB). Performance Comparison The performance of ISM is affected greatly by the size of ISL. – the number of items (N) 32 each item derives one length-1 sequence & all length-1 sequences are in ISL a large N gives a fat ISL; other parameters fixed, more items means fewer frequent sequences a large N leads to a smaller ISL. Performance Comparison Sequence Model – 33 Execution time vs. the number of items (N) Performance Comparison Sequence Model – 34 Effect of the database size (running on a PC of 512M memory) Performance Comparison 35 Transaction Model Performance Comparison Transaction Model – ISM performs best unless s is very small. – GSP+ and MFS+ are not very effective. – Candidates that are frequent in D cannot be pruned. Candidates that are infrequent in D are very unlikely to be pruned. SPADE works consistently well over the range of s . 36 in transaction model, there is much less change to ISL. when s is small, ISL is large. ISM is not as efficient as SPADE. memory is enough (4GB). Performance Comparison Transaction Model – 37 Effect of the percentage of sequences updated (%) Performance Comparison Transaction Model – Effect of the percentage of sequences updated (%) GSP+ and MFS+ are relatively unaffected by the percentage change. The execution time of ISM increases linearly with the percentage change. – 38 more update leads to more changes to ISL. If the change is small, ISM performs best. Otherwise, SPADE is the best choice. Performance Comparison Transaction Model – 39 Effect of the database size (running on a PC of 512M memory) Conclusion Guidelines – Sequence Model 40 Main memory is relatively large & the number of items is small ISM Main memory is relatively large & the number of items is large SPADE Main memory is limited GSP+ or MFS+ Conclusion Guidelines – Transaction Model 41 Main memory is abundant & the database update portion is small ISM Main memory is abundant & the database update portion is significant SPADE Main memory is limited GSP, MFS, GSP+ MFS+ 42
© Copyright 2024 Paperzz