Efficient Mining of Closed Repetitive Gapped Subsequences from a

Efficient Mining of Closed Repetitive
Gapped Subsequences from a
Sequence Database
Bolin Ding, David Lo, Jiawei Han, and Siau-Cheng Khoo
ICDE 2009
2009-7-29
1
Motivation
 A huge wealth of sequence data
 Program execution traces
 Sequences of words (text data)
 Customer purchasing records
 Credit card usage histories
 Protein sequences
 A database of long sequences
 Patterns repeat multiple times in a sequence
2009-7-29
2
Example
 When a trading company are handling the requests of
customers.
 A : request placed
 B : request in-process
 C : request cancelled
 D : product delivered
 S1 = AABCDABB, S2 = ABCD
 Is pattern AB more frequent then CD?
Non-overlapping
Maximum
2009-7-29
3
Our Goal
 Repetitive pattern/support
 Repetitive non-overlapping instances of a pattern
 Maximum embedding of instances
2009-7-29
4
Problem Statement
Mining (closed) Repetitive Gapped Subsequence
 Input SeqDB = {S1, S2, …, SN} and min_sup
 Output patterns with support sup(P) ≥ min_sup
 Subsequence: S=e1e2…em is a subsequence of another
sequence S’=e’1e’2…e’n(m≤n)
 Landmark: if there exists a sequence of integers 1 ≤ k1 ≤
k2 ≤… ≤ km ≤ n s.t. S[i]=S’[ki] for i =1,2,…,m. Such a
sequence of integers <k1,…,km> is called a landmark.
 For example S= AB is a subsequence of S’ = ABAABAB
landmarks are <1,2>, <1,5>, <1,7>, <3,5>, …
2009-7-29
5
(Cont.)
 Instance of pattern in SeqDB
{(i, <k1, k2, …, km>): P[1] = Si[k1], P[2] = Si[k2], …, P[m] = Si[km]}
 Overlapping instances
(i, <k1, k2, …, km>) and (i, <k1’, k2’, …, km’>): kj = kj’ for some j
 sup(P): the maximum NON-overlapping instances set
sup(P) = max{|INS|: INS is a set of non-overlapping instances of P}

Support set INS: |INS| = sup(P)
 Ex: S1(AB) = {(1,<1,2>),(1,<1,5>),(1,<4,5>)}
S2(AB) = {(2,<1,3>),(2,<2,3>),(2,<1,4>),(2,<2,4>)}
 Instance set IAB={{(1,<1,2>),(1,<4,5>),(2,<1,3>),(2,<2,4>)}
is non-redundant
2009-7-29
6
Repetitive Gapped Subsequence-Support
 About the definition of support
 Repetitive instances of a pattern within each sequence
 Why non-overlapping

Avoid over-estimating the frequency: in AAAABBBBCCCC, 43
instances of ABC, and 4 non-overlapping ones
 Maximize the size of the non-overlapping instance set


Measure how frequent a pattern is
A unified definition for all patterns
 Example
 sup(AB) = 4
2009-7-29
S1 = A A B C D A B B, S2 = A B C D
7
Properties of Repetitive Support
 Monotonicity
 If P’ is a super-pattern of P, then sup(P’) ≤ sup(P)



Each INS’ = a set of non-overlapping instances of P’
Construct from INS’:
INS = a set of non-overlapping instances of P
|INS| = |INS’|
sup(P’) = max{ |INS’| } ≤ max{ |INS’| } = sup(P)
 Example:


ABA in ABAABA
AB in ABAABA
 Apriori Property
2009-7-29
8
Properties of Repetitive Support
 Closed pattern
 P is non-closed: a super-pattern P’ s.t. sup(P’) = sup(P)
 Support set of non-closed pattern is extendable
 P’ is a super pattern of P
 sup(P’) = sup(P) if and only if for any support set INS’ of P’,
there exists a support set INS of P, s.t.


|INS’| = |INS|
For any (i, <k1, k2, …, k|P|>) ∈ INS, there exists (i, <k1’, k2’, …, k|P’|’>)
∈ INS’, s.t. <k1’, k2’, …, k|P’|’> is a super- sequence of <k1, k2, …,
k|P|>
 Closed pattern is well defined
2009-7-29
9
Computing Repetitive Support
 Greedy instance-growth algorithm compute sup(ACB)
2009-7-29
1
2
3
4
5
6
7
8
9
S1
A
B
C
A
C
B
D
D
B
S2
A
C
D
B
A
C
A
D
D
10
Computing Repetitive Support
 Greedy instance-growth algorithm
2009-7-29
1
2
3
4
5
6
7
8
9
S1
A
B
C
A
C
B
D
D
B
S2
A
C
D
B
A
C
A
D
D
11
Computing Repetitive Support
 Greedy instance-growth algorithm
2009-7-29
1
2
3
4
5
6
7
8
9
S1
A
B
C
A
C
B
D
D
B
S2
A
C
D
B
A
C
A
D
D
12
Computing Repetitive Support
 Greedy instance-growth algorithm
2009-7-29
1
2
3
4
5
6
7
8
9
S1
A
B
C
A
C
B
D
D
B
S2
A
C
D
B
A
C
A
D
D
13
Computing Repetitive Support
 Greedy instance-growth algorithm
1
2
3
4
5
6
7
8
9
S1
A
B
C
A
C
B
D
D
B
S2
A
C
D
B
A
C
A
D
D
 Intuition: Extend each instance to the nearest possible event
2009-7-29
14
Computing Repetitive Support
 Correctness of greedy instance-growth algorithm
 Optimality: sup(P) = max{ |INS| }

Leftmost support set INS of P →
Leftmost support set INS+ of pattern P○e
 Instance-growth routine INSgrow(P, INS, e):



2009-7-29
Given a support set INS of P, with |INS| = sup(P), and event e
Extend each instance in INS to the nearest possible event e
INSgrow(P, INS, e) returns a support set INS+ of pattern P○e
15
Computing Repetitive Support
 Example: computing sup(ACB)
 Initialize support set INSA of A
 INSAC ← INSgrow(A, INSA, C)
 INSACB ← INSgrow(AC, INSAC, B)
 sup(ACB) ← |INSACB|
2009-7-29
16
Mining All Frequent Patterns
 Depth-first search of the pattern space
Closure checking
Instance-border checking
INSgrow(AA, INSAA, A)
A
AA
AB
B
AC
C
……
Frequent patterns: sup(P) ≥ min_sup
AAA
AAB
……
Infrequent patterns: sup(P) < min_sup
……
2009-7-29
……
17
Mining Closed Patterns
 Pattern extension
 Patterns with one more event in P = e1 e2 … em
 Extension(P, e) = {ee1e2…em, e1ee2…em, …, e1e2…eme}
 Closure checking
 Pattern P is NOT closed if and only if sup(P) = sup(Q) for
some Q ∈ Extension(P, e)
 Unable to prune the search space
 Ex. Given min_sup=3
IAB={{(1,<1,2>),(1,<4,6>),(2,<1,4>)} is not closed, because
IACB={{(1,<1,3,6>),(1,<4,5,9>),(2,<1,2,4>)}
2009-7-29
18
Mining Closed Patterns
 (Instance-border checking) Pattern P is prunable if:
 There exists Q ∈ Extension(P, e) for some e s.t.


sup(P) = sup(Q) (P is NOT closed)
(leftmost) support set INSP and (lesfmost) support set INSQ:
for each (i, <k1, k2, …, k|P|>) ∈ INSP and (i, <k1’, k2’, …, k|Q|’>) ∈ INSQ
k|Q|’ ≤ k|P|
 Example
 S = ACACBBDD
 INSAB:
ACACBBDD, ACACBBDD
 INSACB:
ACACBBDD, ACACBBDD
 AB is prunable
2009-7-29
19
Experimental Study
 Gazelle dataset (click stream)
 29369 sequences, 1423 distinct events, sequence length 1-651
 Vary min_sup
All
109
Closed
108
|Patterns| - (log-scale)
Runtime(s) - (log-scale)
104
103
2
10
1
10
107
106
105
All
104
Closed
3
10
8
2009-7-29
...
102
63
64
65
66
min_sup
8
...
63
64
65
66
min_sup
20
Experimental Study
 Vary the number of sequences
 10000 distinct events, sequence length 50
 Vary the number of sequences: 5000-25000
2009-7-29
21
Experimental Study
 Vary the average length of sequences
 10000 sequences, 10000 distinct events
 Vary the average length: 20-100
2009-7-29
22
Conclusion
 Frequent repetitive-pattern mining
 Capture repetitive non-overlapping instances
 Maximum embedding (repetitive support)
 Closed repetitive patterns
 Efficient mining algorithms
 Future work
 Statistic explanation of repetitive support
 Use repetitive patterns as features for other mining tasks
 Approximate repetitive patterns
2009-7-29
23