Mining of Frequent Itemsets from Streams of Uncertain Data

Mining of Frequent Itemsets from
Streams of Uncertain Data
Carson Kai-Sang Leung, Boyu Hao
ICDE 2009
&
A Tree-Based Approach for Frequent
Pattern Mining from Uncertain Data
Carson Kai-Sang Leung, Mark Anthony F. Mateo,
and Dale A. Brajczuk
PAKDD 2008
Outline
 Motivation
 Related
Work
 Method
UF-streaming
 SUF-growth
 UF-growth Improvement
 Experimental Result
 Conclusion

2
Motivation
ICDE:
 1. Can we handle streams of uncertain data?
 2. Given streams of uncertain data, how can we
effectively capture their important contents?
 PAKDD:
 1. Can we avoid generating candidates at all?
 2. Since tree-based algorithms for handling
precise data are usually faster than their Aprioribased counterparts, is this also the case when
handling uncertain data?

3
Related Work

existential probability:P(x, ti)

x:item,ti:transaction
Using the “possible world” interpretation of
uncertain data , there are two possible worlds for
an item x and a transaction ti:
 (i) x∈ti ,existential probability:P(x, ti)
 (ii) x∈ti ,existential probability:1-P(x, ti)

4
Related Work

The expected support of an itemset X in TDB:
5
Batch
UF-streaming
first
second
Transactions
t1
t2
t3
t4
t5
t6
t7
t8
t9
Contents
{a:0.9,d:0.8,e:0.7,f:0.2}
{a:0.9,c:0.7,d:0.7,e:0.6}
{b:1.0,c:0.9}
{b:1.0,c:0.9,d:0.3}
{a:0.9,d:0.8}
{b:1.0,d:0.7,e:0.1}
{a:0.9,d:0.8}
{b:1.0,c:0.9,d:0.3}
{a:0.9,d:0.8,e:0.7}
first bath:(preMinsup=0.9) third
 frequent items
→a:1.8、b:1.0、c:1.6、d:1.5、e:1.3、f:0.2
 a:(1×0.9)+(1×0.9)=1.8

1
UF-tree
6
Batch
UF-streaming
first
second
third
Transactions
t1
t2
t3
t4
t5
t6
t7
t8
t9
Contents
{a:0.9,d:0.8,e:0.7,f:0.2}
{a:0.9,c:0.7,d:0.7,e:0.6}
{b:1.0,c:0.9}
{b:1.0,c:0.9,d:0.3}
{a:0.9,d:0.8}
{b:1.0,d:0.7,e:0.1}
{a:0.9,d:0.8}
{b:1.0,c:0.9,d:0.3}
{a:0.9,d:0.8,e:0.7}
expSup({a,e})=(1×0.9×0.6)+(1×0.9×0.7)=1.17≧preMinsup
 expSup({d,e})=(1×0.7×0.6)+(1×0.8×0.7)=0.98≧preMinsup
→ frequent:{a,e},{d,e}

7
UF-tree for {e}-projected DB

All frequent:{a},{a,d},{a,e},{b},{c},{d},{d,e},{e}
8
Batch
UF-streaming
first
second
third
Transactions
t1
t2
t3
t4
t5
t6
t7
t8
t9
Contents
{a:0.9,d:0.8,e:0.7,f:0.2}
{a:0.9,c:0.7,d:0.7,e:0.6}
{b:1.0,c:0.9}
{b:1.0,c:0.9,d:0.3}
{a:0.9,d:0.8}
{b:1.0,d:0.7,e:0.1}
{a:0.9,d:0.8}
{b:1.0,c:0.9,d:0.3}
{a:0.9,d:0.8,e:0.7}
9
Batch
UF-streaming
first
second
third
Transactions
t1
t2
t3
t4
t5
t6
t7
t8
t9
Contents
{a:0.9,d:0.8,e:0.7,f:0.2}
{a:0.9,c:0.7,d:0.7,e:0.6}
{b:1.0,c:0.9}
{b:1.0,c:0.9,d:0.3}
{a:0.9,d:0.8}
{b:1.0,d:0.7,e:0.1}
{a:0.9,d:0.8}
{b:1.0,c:0.9,d:0.3}
{a:0.9,d:0.8,e:0.7}
10
Batch
SUF-growth
first
second
third
Transactions
t1
t2
t3
t4
t5
t6
t7
t8
t9
Contents
{a:0.9,d:0.8,e:0.7,f:0.2}
{a:0.9,c:0.7,d:0.7,e:0.6}
{b:1.0,c:0.9}
{b:1.0,c:0.9,d:0.3}
{a:0.9,d:0.8}
{b:1.0,d:0.7,e:0.1}
{a:0.9,d:0.8}
{b:1.0,c:0.9,d:0.3}
{a:0.9,d:0.8,e:0.7}
11
UF-growth Improvement
Improvement 1:
 To reduce the memory consumption and to
increase the chance of path sharing, we discretize
and round the expected support of each tree node
up to k decimal places (e.g. 2 decimal places) and
the range(0,1]—to a maximum of 10k possible
values.
 Example:at most 100 possible expected support
values ranging from 0.01 to 1.00 inclusive when k
= 2.

12
UF-growth Improvement
Before improvement
After improvement
13
UF-growth Improvement
Improvement 2:
 The improved UF-growth does not need to build
subsequent UF-trees for any non-singleton
patterns(to reduce memory space).
 subsequent:{d,e}-projected database
 example:

14
UF-growth Improvement
{e}-projected→subset:{a,e},{a,d,e},{d,e}
 expSup({a,e})=1×0.9×0.72=0.648
 expSup({a,d,e})=1×0.9×0.72×0.7185=0.46575
 expSup({d,e})=1×0.7185×0.72=0.5175

15
UF-growth Improvement

{e}-projected→subset:{a,e},{a,d,e},{d,e}

expSup({a,e})=0.648+(2×0.9×0.71875)=1.94175
expSup({a,d,e})=0.46575+(2×0.9×0.72×0.7185)=1.39725
expSup({d,e})=0.5175+(2×0.72×0.7185)=1.5525

All frequent:{a},{a,d},{a,d,e}{a,e},{b},{b,c},{c},{d},{d,e},{e}


16
Experimental Result

ICDE:
17
Experimental Result

PAKDD:
18
Conclusion

ICDE:


Experimental results showed the effectiveness of our
algorithms.
PAKDD:

With our tree-based approach, users can mine
frequent patterns from uncertain data effectively.
19