Efficient Mining of Closed Repetitive
Gapped Subsequences from a
Sequence Database
Bolin Ding, David Lo, Jiawei Han, and Siau-Cheng Khoo
ICDE 2009
2009-7-29
1
Motivation
A huge wealth of sequence data
Program execution traces
Sequences of words (text data)
Customer purchasing records
Credit card usage histories
Protein sequences
A database of long sequences
Patterns repeat multiple times in a sequence
2009-7-29
2
Example
When a trading company are handling the requests of
customers.
A : request placed
B : request in-process
C : request cancelled
D : product delivered
S1 = AABCDABB, S2 = ABCD
Is pattern AB more frequent then CD?
Non-overlapping
Maximum
2009-7-29
3
Our Goal
Repetitive pattern/support
Repetitive non-overlapping instances of a pattern
Maximum embedding of instances
2009-7-29
4
Problem Statement
Mining (closed) Repetitive Gapped Subsequence
Input SeqDB = {S1, S2, …, SN} and min_sup
Output patterns with support sup(P) ≥ min_sup
Subsequence: S=e1e2…em is a subsequence of another
sequence S’=e’1e’2…e’n(m≤n)
Landmark: if there exists a sequence of integers 1 ≤ k1 ≤
k2 ≤… ≤ km ≤ n s.t. S[i]=S’[ki] for i =1,2,…,m. Such a
sequence of integers <k1,…,km> is called a landmark.
For example S= AB is a subsequence of S’ = ABAABAB
landmarks are <1,2>, <1,5>, <1,7>, <3,5>, …
2009-7-29
5
(Cont.)
Instance of pattern in SeqDB
{(i, <k1, k2, …, km>): P[1] = Si[k1], P[2] = Si[k2], …, P[m] = Si[km]}
Overlapping instances
(i, <k1, k2, …, km>) and (i, <k1’, k2’, …, km’>): kj = kj’ for some j
sup(P): the maximum NON-overlapping instances set
sup(P) = max{|INS|: INS is a set of non-overlapping instances of P}
Support set INS: |INS| = sup(P)
Ex: S1(AB) = {(1,<1,2>),(1,<1,5>),(1,<4,5>)}
S2(AB) = {(2,<1,3>),(2,<2,3>),(2,<1,4>),(2,<2,4>)}
Instance set IAB={{(1,<1,2>),(1,<4,5>),(2,<1,3>),(2,<2,4>)}
is non-redundant
2009-7-29
6
Repetitive Gapped Subsequence-Support
About the definition of support
Repetitive instances of a pattern within each sequence
Why non-overlapping
Avoid over-estimating the frequency: in AAAABBBBCCCC, 43
instances of ABC, and 4 non-overlapping ones
Maximize the size of the non-overlapping instance set
Measure how frequent a pattern is
A unified definition for all patterns
Example
sup(AB) = 4
2009-7-29
S1 = A A B C D A B B, S2 = A B C D
7
Properties of Repetitive Support
Monotonicity
If P’ is a super-pattern of P, then sup(P’) ≤ sup(P)
Each INS’ = a set of non-overlapping instances of P’
Construct from INS’:
INS = a set of non-overlapping instances of P
|INS| = |INS’|
sup(P’) = max{ |INS’| } ≤ max{ |INS’| } = sup(P)
Example:
ABA in ABAABA
AB in ABAABA
Apriori Property
2009-7-29
8
Properties of Repetitive Support
Closed pattern
P is non-closed: a super-pattern P’ s.t. sup(P’) = sup(P)
Support set of non-closed pattern is extendable
P’ is a super pattern of P
sup(P’) = sup(P) if and only if for any support set INS’ of P’,
there exists a support set INS of P, s.t.
|INS’| = |INS|
For any (i, <k1, k2, …, k|P|>) ∈ INS, there exists (i, <k1’, k2’, …, k|P’|’>)
∈ INS’, s.t. <k1’, k2’, …, k|P’|’> is a super- sequence of <k1, k2, …,
k|P|>
Closed pattern is well defined
2009-7-29
9
Computing Repetitive Support
Greedy instance-growth algorithm compute sup(ACB)
2009-7-29
1
2
3
4
5
6
7
8
9
S1
A
B
C
A
C
B
D
D
B
S2
A
C
D
B
A
C
A
D
D
10
Computing Repetitive Support
Greedy instance-growth algorithm
2009-7-29
1
2
3
4
5
6
7
8
9
S1
A
B
C
A
C
B
D
D
B
S2
A
C
D
B
A
C
A
D
D
11
Computing Repetitive Support
Greedy instance-growth algorithm
2009-7-29
1
2
3
4
5
6
7
8
9
S1
A
B
C
A
C
B
D
D
B
S2
A
C
D
B
A
C
A
D
D
12
Computing Repetitive Support
Greedy instance-growth algorithm
2009-7-29
1
2
3
4
5
6
7
8
9
S1
A
B
C
A
C
B
D
D
B
S2
A
C
D
B
A
C
A
D
D
13
Computing Repetitive Support
Greedy instance-growth algorithm
1
2
3
4
5
6
7
8
9
S1
A
B
C
A
C
B
D
D
B
S2
A
C
D
B
A
C
A
D
D
Intuition: Extend each instance to the nearest possible event
2009-7-29
14
Computing Repetitive Support
Correctness of greedy instance-growth algorithm
Optimality: sup(P) = max{ |INS| }
Leftmost support set INS of P →
Leftmost support set INS+ of pattern P○e
Instance-growth routine INSgrow(P, INS, e):
2009-7-29
Given a support set INS of P, with |INS| = sup(P), and event e
Extend each instance in INS to the nearest possible event e
INSgrow(P, INS, e) returns a support set INS+ of pattern P○e
15
Computing Repetitive Support
Example: computing sup(ACB)
Initialize support set INSA of A
INSAC ← INSgrow(A, INSA, C)
INSACB ← INSgrow(AC, INSAC, B)
sup(ACB) ← |INSACB|
2009-7-29
16
Mining All Frequent Patterns
Depth-first search of the pattern space
Closure checking
Instance-border checking
INSgrow(AA, INSAA, A)
A
AA
AB
B
AC
C
……
Frequent patterns: sup(P) ≥ min_sup
AAA
AAB
……
Infrequent patterns: sup(P) < min_sup
……
2009-7-29
……
17
Mining Closed Patterns
Pattern extension
Patterns with one more event in P = e1 e2 … em
Extension(P, e) = {ee1e2…em, e1ee2…em, …, e1e2…eme}
Closure checking
Pattern P is NOT closed if and only if sup(P) = sup(Q) for
some Q ∈ Extension(P, e)
Unable to prune the search space
Ex. Given min_sup=3
IAB={{(1,<1,2>),(1,<4,6>),(2,<1,4>)} is not closed, because
IACB={{(1,<1,3,6>),(1,<4,5,9>),(2,<1,2,4>)}
2009-7-29
18
Mining Closed Patterns
(Instance-border checking) Pattern P is prunable if:
There exists Q ∈ Extension(P, e) for some e s.t.
sup(P) = sup(Q) (P is NOT closed)
(leftmost) support set INSP and (lesfmost) support set INSQ:
for each (i, <k1, k2, …, k|P|>) ∈ INSP and (i, <k1’, k2’, …, k|Q|’>) ∈ INSQ
k|Q|’ ≤ k|P|
Example
S = ACACBBDD
INSAB:
ACACBBDD, ACACBBDD
INSACB:
ACACBBDD, ACACBBDD
AB is prunable
2009-7-29
19
Experimental Study
Gazelle dataset (click stream)
29369 sequences, 1423 distinct events, sequence length 1-651
Vary min_sup
All
109
Closed
108
|Patterns| - (log-scale)
Runtime(s) - (log-scale)
104
103
2
10
1
10
107
106
105
All
104
Closed
3
10
8
2009-7-29
...
102
63
64
65
66
min_sup
8
...
63
64
65
66
min_sup
20
Experimental Study
Vary the number of sequences
10000 distinct events, sequence length 50
Vary the number of sequences: 5000-25000
2009-7-29
21
Experimental Study
Vary the average length of sequences
10000 sequences, 10000 distinct events
Vary the average length: 20-100
2009-7-29
22
Conclusion
Frequent repetitive-pattern mining
Capture repetitive non-overlapping instances
Maximum embedding (repetitive support)
Closed repetitive patterns
Efficient mining algorithms
Future work
Statistic explanation of repetitive support
Use repetitive patterns as features for other mining tasks
Approximate repetitive patterns
2009-7-29
23
© Copyright 2026 Paperzz