P Occ ≦ k

An Efficient Polynomial Delay Algorithm for
Pseudo Frequent Itemset Mining
Takeaki Uno (National Institute of Informatics)
Hiroki Arimura (Hokkaido University)
2/Oct/2007 Discovery Science 2007
Frequent Pattern Mining
• problem of finding all frequently appearing patterns from
(large scale) database
database: transaction, tree, string, graph, vector
pattern: subset, tree, path, sequence, graph, geograph…
database
ex1
●
●
●
▲
●
ex2
▲
●
●
●
●
●
▲
▲
ex3
▲
▲
▲
▲
▲
▲
ex4
▲
●
●
●
●
experiments
ATGCGCCGTA
TAGCGGGTGG
TTCGCGTTAG
GGATATAAAT
GCGCCAAATA
ATAATGTATTA
TTGAAGGGCG
ACAGTCTCTCA
ATAAGCGGCT
Genome info
• ex1● ,ex3 ▲
• ex2● ,ex4●
• ex2●, ex3 ▲, ex4●
• ex2▲ ,ex3 ▲
．
．
．
• ATGCAT
• CCCGGGTAA
• GGCGTTA
• ATAAGGG
．
．
．
This Research
• address transaction database
transaction database: each record (transaction) T of the database
is a subset of the itemset E, i.e., D, ∀T ∈D, T ⊆ E
frequent itemset: subset of E included in at least σ transactions
minimum support threshold
• problems
- so many patterns for finding valuable patterns
- inclusion is strict, to deal with errors
 "patterns ambiguously included in many transactions" are
impotant
We introduce an ambiguous inclusion,
and propose an efficient mining algorithm
Related Works
• Such frequent itemset mining with ambiguity is called
fault-tolerant pattern, degenerate pattern, soft occurrence
- ambiguity for inclusion is, "pattern is included if the ratio of
included items is more than the threshold
- another approach: find combinations of itemset and
transaction set, such that few pairs of item and transaction do
not satisfy inclusion relation
- similarity is used, for string matching and homology search
• Few "enumeration type" research with completeness
Look at practical models and algorithms, from algorithm theory
Notations for F.I.M.
• For itemset K,
occurrence of K: transaction of D including K
Occ(K): occurrence set of K: the set of occurrences of K
frq(K): frequency of K: the size of Occ(K)
1,2,5,6,7,9
2,3,4,5
D ＝ 1,2,7,8,9
1,7,9
2,7,9
2
Occ( {1,2} )
＝ { {1,2,5,6,7,9},
{1,2,7,8,9} }
Occ( {2,7,9} )
＝ { {1,2,5,6,7,9},
{1,2,7,8,9},
{2,7,9} }
Frequent Itemset
• Frequent itemset: itemset with frequency no less than σ
( σ is called minimum support (threshold) )
Ex.)
D＝
1,2,5,6,7,9
2,3,4,5
1,2,7,8,9
1,7,9
2,7,9
2
Itemsets included in no
less than 3 transactions
{1} {2} {7} {9}
{1,7} {1,9}
{2,7} {2,9} {7,9}
{1,7,9} {2,7,9}
Frequent itemset mining:
problem of enumerating all frequent itemsets for given
database D and minimum support σ
Inclusion with Ambiguity
• Ambiguous inclusion relation for itemset P and transaction T
• Popular definition: |P∩T| ／ |P| ≧ θ for threshold θ<1
 lose monotonicity of frequent itemsets
 there is a frequent itemset s.t. "any its subset is infrequent"
 much cost for computation
{1,2,3} ⊆ {1,2,4,5}
for θ= 0.6
{1,2,3,4,5,6,7} ⊆ {1,3,5,6,7} for θ= 0.6
{1,2,3} ⊆ {1,4,5}
θ= 0.6
{1,2}
{2,3}
{1,3}
for θ= 0.6
{1,2,3}  included in all
subset  not for any
k-pseudo Inclusion
• Use threshold for #non-included items:
k-pseudo inclusion: |P＼T| ≦k for threshold k ≧ 0
( k-pseudo [occurrence / occurrence set / frequency] )
 monotonicity is kept
 able to find characterizations such as
"many transactions include at least 3 items of P"
{1,2,3} ⊆ {1,2,4,5}
for k = 1
{1,2,3,4,5,6,7} ⊆ {1,3,5,6,7} for k = 1
{1,2,3} ⊆ {1,4,5}
for k = 1
k Pseudo Frequent Itemset
• k-pseudo frequent itemset: itemset k-pseudo included in at least
σ transactions of D
D ＝
1-pseudo frequent itemsets for σ=3
1,2,5,6,7,9 {1,2,3} {1,2,4} {1,2,5} {1,2,7} {1,2,9} {1,3,7}
2,3,4,5
{1,3,9} {1,4,7} {1,4,9} {1,5,7} {1,5,9} {1,6,7}
1,2,7,8,9
{1,6,9} {1,7,8} {1,7,9} {1,8,9} {2,3,7} {2,3,9}
1,7,9
{2,4,7} {2,4,9} {2,5,7} {2,5,8} {2,5,9} {2,6,7}
2,7,9
{2,6,9} {2,7,8} {2,7,9} {2,8,9} {3,7,9} {4,7,9}
2
{5,7,9} {6,7,9} {7,8,9}
{1,2,7,9} {1,3,7,9} {1,4,7,9}{1,5,7,9}
{1,6,7,9} {1,7,8,9} {2,3,7,9} {2,4,7,9}
{2,5,7,9} {2,6,7,9} {2,7,8,9}
Many trivial patterns
How to efficiently enumerate?
Enumeration using Monotonicity
111…1
• Pseudo frequent itemsets have monotone
property thereby simple backtrack
algorithm work
freq
• For each k-pseudo frequent itemset P,
compute k-pseudo frequency of each
P+e
• If the k-pseudo frequency of P+e
is no less than σ, generate recursive
call to enumerate k-pseudo frequent
itemsets including P+e
000…0
1,2,3,4
1,2,3 1,2,4
1,2
1,3
1
1,3,4
1,4
2,3,4
2,3
2,4
3,4
2
3
4
Polynomial time enumeration
How to efficiently computate?
φ
Computing k-Pseudo Occurrences
• Define Occ=h(P) = { T∈D | |P＼T| = h }
 set of transactions missing just h items of P
 Occ≦k(P) = ∪h≦kOcc=h(P)
• Occ=h(P∪e) = Occ=h(P)∩Occ(e) ∪ Occ=h-1(P)＼Occ(e)
 update of pseudo occurrence
set is done by taking intersection
A
A
B
B
• compute Occ=h(P)∩Occ(e)
A B A
C
C
for all pair of e and h
B A A
B A
A D
D
B B
B C
B E C
E
Occ0
C C
C D
C F D
F
Occ1
D G F F D
G
Occ2
8 9 10 11 12
P
Taking Intersections Efficiently
• Occ=h(P∪e) = Occ=h(P)∩Occ(e) ∪ Occ=h-1(P)＼Occ(e)
 having the same properties as usual occurrences
 can use many existing techniques for updating occurrence set
(down project, delivery, bitmap…)
• Database reduction (FP-tree)
is also available
• In deeper levels of recursion,
transactions to be scanned
becomes few, thereby
the computation is fast
A: 1,2,5,6,7,9
B: 2,3,4,5
C: 1,2,7,8,9
D: 1,7,9
E: 2,7,9
F: 2
1: A,C,D
2: A,B,C,E,F
3: B
4: B
5: A,B
6: A
7: A,C,D,E
8: C
9: A,C,D,E
Using Bottom-wideness
• Backtrack (depth-first search) generates several recursive calls
in each iteration
 The computation tree spreads exponentially by going down
 The computation time is dominated by the bottom level
iterations on the recursion tree
Since occurrences to
be computed
is few in lower levels,
long time
・・・
short time
Amortized computation time is reduced to that of bottom levels
For Large Minimum Support
• When σ is large, we access many transactions on the bottom levels
 Improvements by bottom-wideness is not drastic
• Reduce the database to speed up the bottoms
(1) Delete items less than the maximum item in P
(2) Delete items being infrequent on the occurrence set database
(since it never be added in the recursive call)
(3) unify the same transactions
P={1,3}, k=1, σ=4
• The database size is constant in the
bottom levels in practice
1
1
3
2
4
6
1
7
2
No big difference from small σ
3
5
2
3
4
3
4
3
4
5
6
7
6
7
6
7
Small & Trivial Patterns
• Under the k-pseudo inclusion, itemsets of size no more than k is
included in any transaction
• itemsets of size bit greater than k is also included in many
transactions
 Many small and trivial frequent itemsets
• We want to ignore these itemsets in practice
 Consider problem of directly finding
pseudo frequent itemsets of size l
Directly Finding Large Itemset
• Need exponential time if search all itemsets of size l
 Pruning unnecessary search is crucial
 Take candidates according to partial structure
• Let P be a k-pseudo frequent itemset of size l
• WLOG, P={1,…,l} and
sorted in decreasing order of |Occ=k(P)＼Occ({e})|
• Consider the (k-1)-pseudo frequency of itemset {1,…,y}
• Any transaction in Occ=k(P)＼Occ({e}), e>y
(k-1)-pseudo includes {1,…,y}
Search Route to Itemset of Size l
• Any transaction in Occ=k(P)＼Occ({e}), e>y
(k-1)-pseudo includes {1,…,y}
 |Occk-1({1,…,y})| ≧ |∪e=y+1,...,|P| (Occk(P)＼Occ({e}))|
• average of |Occk(P)＼Occ({e})| is no less than (k / |P|) |Occ=k (P)|
• 1,…,y are sorted in increasing order of |Occk(P)＼Occ({e})|
 |Occk-1({1,…,y})| ≧ |Occk(P)|×(|P|-y)/|P|
Partial frequency
condition
There is a sequence of itemsets from empty set to P composed only
of itemsets satisfying partial frequency condition
Example for Partial Frequency Condition
• Itemsets satisfying the partial frequency condition,
for k=1, σ=3, l=3
1,2,5,6,7,9
2,3,4,5
D ＝ 1,2,7,8,9
1,7,9
2,7,9
2
1-pseudo frequent itemsets
satisfying the partial frequency condition
{1} {2} {5} {7} {9} {1,2} {1,5} {1,6} {1,7}
{1,8} {1,9} {2,3} {2,4} {2,5} {2,6} {2,7}
{2,8} {2,9} {3,5} {4,5} {5,6} {5,7} {5,9}
{6,7} {6,9} {7,8} {7,9} {8,9}
#frequent itemsets to be searched is decreased,
 efficient search is expected
Restricted Search Route by P.F.C.
• Any k-pseudo frequent itemset of size l can be found by passing
through those satisfying partial frequency condition
 Let's do backtrack search
• Always exist an item whose removal satisfies the condition
• Tail extension is not available
(removal of tail may violate condition)
• Simple hill climbing generates duplications
• So, use a generation rule to avoid duplication (reverse search)
Reverse Search for P.F.C.
• Rule: generate itemset P from P＼{e} maximizing |Occk-1(P＼{e})|
(Tie is broken by choosing the minimum index)
ReverseSearch (P)
1. if P|=1 then output P; return;
2. for each e∈P do
if P+e is a k-pseudo frequent itemset satisfying P.F.C. then
if e maximizes |Occk-1(P＼{e})| then ReverseSearch (P+e)
3. end for
• |Occk-1(P＼{e})| can be efficiently computed by existing methods
O(|P|×||D||) time for one iteration
Conclusion
• Introduced ambiguous inclusion relation such that at most k items
of the pattern is not included
• Pseudo frequent itemset mining under the inclusion (monotonicity,
intersection, many small-trivial patterns)
• Reverse search for directly finding frequent itemset with fixed size
Future works
• implementation and experiments
• extension of the technique to other pattern mining
• approach to inclusion with "ratio r %"

Download Report

P Occ ≦ k

Paperzz.com

Your Paperzz