Finding Highly Correlated Pairs Efficiently with Powerful Pruning

Finding Highly Correlated Pairs
Efficiently with Powerful Pruning
Jian Zhang, Joan Feigenbaum
CIKM’06
2007/5/3
Chen Yi-Chun
1
Outline
•
•
•
•
•
•
Motivation
TAPER
Our approach
Algorithm
Experiment results
Conclusion
2007/5/3
Chen Yi-Chun
2
Motivation
• We consider the problem of finding highly
correlated pairs in a large data set.
• With massive data sets
– the total number of pairs may exceed the mainmemory capacity.
– The computational cost of the naïve method is
prohibitive.
2007/5/3
Chen Yi-Chun
3
TAPER
• Two passes:
– Generate a set of candidate pairs whose
correlation coefficients may be above the
threshold.
– Compute the correlation coefficients of
candidate pairs.
2007/5/3
Chen Yi-Chun
4
Cont.
• Advantage
– Computation simplicity
• Decide whether a pair (a,b) should be pruned, the TAPER uses
a that considers only the frequencies of individual items a and
b
• Disadvantage
– It is that a relatively large
group of uncorrelated pairs
is missed by the pruning
rule.
2007/5/3
Chen Yi-Chun
5
Notation definition
2007/5/3
Chen Yi-Chun
6
Our approach
• Jaccard distance(JD): R(a)  R(b)
R(a)  R(b)
• A strong connection
– If the pair (a, b) has a large correlation
coefficient, then its JD must be small.
2007/5/3
Chen Yi-Chun
7
• Pearson correlation coefficient:
sp(ab)  sp(a)sp(b)
 (a, b) 

sp(a)sp(b)(1  sp(a))(1  sp(b))
(a) a
 and
sp(ab
Because
Assumespthat
b )are,we
highly
can replace
correlated
sp(ab) with sp(a)
S
sp (a )(1  sp (b))

sp (b)(1  sp (a ))
By the assumption that sp(a)  sp(b), S  1
2007/5/3
Chen Yi-Chun
8
sp(ab) 
R(a)  R(b)
m
Our rule is thus to prune when
R(a )  R(b)
R(a )  R(b)
2
The last inequality comes from the fact that  , S  1
Given 1  S
2007/5/3

, this ratio achieves its minimum value of  when S  
Chen Yi-Chun
The last inequality comes
from the fact that
2
S 
9
Min-hash function
hmin (a)  min rR ( a ) {h(r )}
• It has the following property:
Pr(hmin (a)  hmin (b)) 
2007/5/3
R(a)
R(b)
R(a)  R(b)
Chen Yi-Chun
10
Cont.
• Ex1. Assume that there are 10 rows (baskets)in total
and we choose the following values of h
r
h(r)
0 1 2 3 4 5 6 7 8 9
17 21 9 44 5 16 1 20 37 8
• Also assume that item 3 appears in baskets 2,5,8
hmin  min{h(2)  9, h(5)  16, h(8)  37}  9
2007/5/3
Chen Yi-Chun
11
False negative problem
• Note that this bound is tight.
– Consider two items a and b with R(a)  R(b)  R(a)
– Assume sp(a) and sp(b) are very small
•
sp (a )
 ( a, b) 
sp (b)
R(a)  R(b) sp(a)

  2 ,  ( a, b)  
R(a)  R(b) sp(b)
R(a )  R(b)
2
Hence, if we prune a pair when
R(a )  R(b)
,we may have removed a pair whose  (a, b)  
2007/5/3
Chen Yi-Chun
12
Multiple min-hash function
• We use k independent min-hash functions
and define an equivalence relation “ “
– For two items a and b, a b  a and b have
the same min-hash values for all the k hash
functions.
– If one min-hash function, Pr(a b)  x ,with k
indep. functions, Pr(a b)  x k x
– We repeat the whole process t times.
2007/5/3
Chen Yi-Chun
13
Cont.
• The probability that a and b belong to the
same equivalence class in at least one of the
k t
1

(1

x
)
trials is
2007/5/3
Chen Yi-Chun
14
Cont.
• Ex2. we show how
candidates are generated
after we obtain the minhash value.
2007/5/3
• In round 1, v1 of item 3 is
equal to v1 of item 17.
Hence (3,17) is put in the
candidate set.
• In round 2, no two vectors
are equal.
• In round 3, v3 of item 9 is
equal to v3 of item 17.
Hence (9,17) is put in the
candidate set.
Chen Yi-Chun
15
Algorithm
2007/5/3
Chen Yi-Chun
16
Cont.
2007/5/3
Chen Yi-Chun
17
Cont.
2007/5/3
Chen Yi-Chun
18
Experiment results
2007/5/3
Chen Yi-Chun
19
Conclusion
• Back to Ex2. (3,9)is not in the candidate set.
Does this agree with my substitution ??
2007/5/3
Chen Yi-Chun
20