Mining of Frequent Itemsets from Streams of Uncertain Data

Mining of Frequent Itemsets from
Streams of Uncertain Data
Carson Kai-Sang Leung, Boyu Hao
ICDE 2009
&
A Tree-Based Approach for Frequent
Pattern Mining from Uncertain Data
Carson Kai-Sang Leung, Mark Anthony F. Mateo,
and Dale A. Brajczuk
PAKDD 2008
Outline
 Motivation
 Related
Work
 Method
UF-streaming
 SUF-growth
 UF-growth Improvement
 Experimental Result
 Conclusion

2
Motivation
ICDE：
 1. Can we handle streams of uncertain data?
 2. Given streams of uncertain data, how can we
effectively capture their important contents?
 PAKDD：
 1. Can we avoid generating candidates at all?
 2. Since tree-based algorithms for handling
precise data are usually faster than their Aprioribased counterparts, is this also the case when
handling uncertain data?

3
Related Work

existential probability：P(x, ti)

x：item，ti：transaction
Using the “possible world” interpretation of
uncertain data , there are two possible worlds for
an item x and a transaction ti:
 (i) x∈ti ，existential probability：P(x, ti)
 (ii) x∈ti ，existential probability：1-P(x, ti)

4
Related Work

The expected support of an itemset X in TDB：
5
Batch
UF-streaming
first
second
Transactions
t1
t2
t3
t4
t5
t6
t7
t8
t9
Contents
{a:0.9,d:0.8,e:0.7,f:0.2}
{a:0.9,c:0.7,d:0.7,e:0.6}
{b:1.0,c:0.9}
{b:1.0,c:0.9,d:0.3}
{a:0.9,d:0.8}
{b:1.0,d:0.7,e:0.1}
{a:0.9,d:0.8}
{b:1.0,c:0.9,d:0.3}
{a:0.9,d:0.8,e:0.7}
first bath：(preMinsup=0.9) third
 frequent items
→a：1.8、b：1.0、c：1.6、d：1.5、e：1.3、f：0.2
 a：(1×0.9)+(1×0.9)=1.8

1
UF-tree
6
Batch
UF-streaming
first
second
third
Transactions
t1
t2
t3
t4
t5
t6
t7
t8
t9
Contents
{a:0.9,d:0.8,e:0.7,f:0.2}
{a:0.9,c:0.7,d:0.7,e:0.6}
{b:1.0,c:0.9}
{b:1.0,c:0.9,d:0.3}
{a:0.9,d:0.8}
{b:1.0,d:0.7,e:0.1}
{a:0.9,d:0.8}
{b:1.0,c:0.9,d:0.3}
{a:0.9,d:0.8,e:0.7}
expSup({a,e})=(1×0.9×0.6)+(1×0.9×0.7)=1.17≧preMinsup
 expSup({d,e})=(1×0.7×0.6)+(1×0.8×0.7)=0.98≧preMinsup
→ frequent：{a,e}，{d,e}

7
UF-tree for {e}-projected DB

All frequent：{a},{a,d},{a,e},{b},{c},{d},{d,e},{e}
8
Batch
UF-streaming
first
second
third
Transactions
t1
t2
t3
t4
t5
t6
t7
t8
t9
Contents
{a:0.9,d:0.8,e:0.7,f:0.2}
{a:0.9,c:0.7,d:0.7,e:0.6}
{b:1.0,c:0.9}
{b:1.0,c:0.9,d:0.3}
{a:0.9,d:0.8}
{b:1.0,d:0.7,e:0.1}
{a:0.9,d:0.8}
{b:1.0,c:0.9,d:0.3}
{a:0.9,d:0.8,e:0.7}
9
Batch
UF-streaming
first
second
third
Transactions
t1
t2
t3
t4
t5
t6
t7
t8
t9
Contents
{a:0.9,d:0.8,e:0.7,f:0.2}
{a:0.9,c:0.7,d:0.7,e:0.6}
{b:1.0,c:0.9}
{b:1.0,c:0.9,d:0.3}
{a:0.9,d:0.8}
{b:1.0,d:0.7,e:0.1}
{a:0.9,d:0.8}
{b:1.0,c:0.9,d:0.3}
{a:0.9,d:0.8,e:0.7}
10
Batch
SUF-growth
first
second
third
Transactions
t1
t2
t3
t4
t5
t6
t7
t8
t9
Contents
{a:0.9,d:0.8,e:0.7,f:0.2}
{a:0.9,c:0.7,d:0.7,e:0.6}
{b:1.0,c:0.9}
{b:1.0,c:0.9,d:0.3}
{a:0.9,d:0.8}
{b:1.0,d:0.7,e:0.1}
{a:0.9,d:0.8}
{b:1.0,c:0.9,d:0.3}
{a:0.9,d:0.8,e:0.7}
11
UF-growth Improvement
Improvement 1：
 To reduce the memory consumption and to
increase the chance of path sharing, we discretize
and round the expected support of each tree node
up to k decimal places (e.g. 2 decimal places) and
the range(0,1]—to a maximum of 10k possible
values.
 Example：at most 100 possible expected support
values ranging from 0.01 to 1.00 inclusive when k
= 2.

12
UF-growth Improvement
Before improvement
After improvement
13
UF-growth Improvement
Improvement 2：
 The improved UF-growth does not need to build
subsequent UF-trees for any non-singleton
patterns(to reduce memory space).
 subsequent：{d,e}-projected database
 example：

14
UF-growth Improvement
{e}-projected→subset：{a,e},{a,d,e},{d,e}
 expSup({a,e})=1×0.9×0.72=0.648
 expSup({a,d,e})=1×0.9×0.72×0.7185=0.46575
 expSup({d,e})=1×0.7185×0.72=0.5175

15
UF-growth Improvement

{e}-projected→subset：{a,e},{a,d,e},{d,e}

expSup({a,e})=0.648+(2×0.9×0.71875)=1.94175
expSup({a,d,e})=0.46575+(2×0.9×0.72×0.7185)=1.39725
expSup({d,e})=0.5175+(2×0.72×0.7185)=1.5525

All frequent：{a},{a,d},{a,d,e}{a,e},{b},{b,c},{c},{d},{d,e},{e}


16
Experimental Result

ICDE：
17
Experimental Result

PAKDD：
18
Conclusion

ICDE：


Experimental results showed the effectiveness of our
algorithms.
PAKDD：

With our tree-based approach, users can mine
frequent patterns from uncertain data effectively.
19

Download Report

Mining of Frequent Itemsets from Streams of Uncertain Data

Paperzz.com

Your Paperzz