Finding Highly Correlated Pairs
Efficiently with Powerful Pruning
Jian Zhang, Joan Feigenbaum
CIKM’06
2007/5/3
Chen Yi-Chun
1
Outline
•
•
•
•
•
•
Motivation
TAPER
Our approach
Algorithm
Experiment results
Conclusion
2007/5/3
Chen Yi-Chun
2
Motivation
• We consider the problem of finding highly
correlated pairs in a large data set.
• With massive data sets
– the total number of pairs may exceed the mainmemory capacity.
– The computational cost of the naïve method is
prohibitive.
2007/5/3
Chen Yi-Chun
3
TAPER
• Two passes:
– Generate a set of candidate pairs whose
correlation coefficients may be above the
threshold.
– Compute the correlation coefficients of
candidate pairs.
2007/5/3
Chen Yi-Chun
4
Cont.
• Advantage
– Computation simplicity
• Decide whether a pair (a,b) should be pruned, the TAPER uses
a that considers only the frequencies of individual items a and
b
• Disadvantage
– It is that a relatively large
group of uncorrelated pairs
is missed by the pruning
rule.
2007/5/3
Chen Yi-Chun
5
Notation definition
2007/5/3
Chen Yi-Chun
6
Our approach
• Jaccard distance(JD): R(a) R(b)
R(a) R(b)
• A strong connection
– If the pair (a, b) has a large correlation
coefficient, then its JD must be small.
2007/5/3
Chen Yi-Chun
7
• Pearson correlation coefficient:
sp(ab) sp(a)sp(b)
(a, b)
sp(a)sp(b)(1 sp(a))(1 sp(b))
(a) a
and
sp(ab
Because
Assumespthat
b )are,we
highly
can replace
correlated
sp(ab) with sp(a)
S
sp (a )(1 sp (b))
sp (b)(1 sp (a ))
By the assumption that sp(a) sp(b), S 1
2007/5/3
Chen Yi-Chun
8
sp(ab)
R(a) R(b)
m
Our rule is thus to prune when
R(a ) R(b)
R(a ) R(b)
2
The last inequality comes from the fact that , S 1
Given 1 S
2007/5/3
, this ratio achieves its minimum value of when S
Chen Yi-Chun
The last inequality comes
from the fact that
2
S
9
Min-hash function
hmin (a) min rR ( a ) {h(r )}
• It has the following property:
Pr(hmin (a) hmin (b))
2007/5/3
R(a)
R(b)
R(a) R(b)
Chen Yi-Chun
10
Cont.
• Ex1. Assume that there are 10 rows (baskets)in total
and we choose the following values of h
r
h(r)
0 1 2 3 4 5 6 7 8 9
17 21 9 44 5 16 1 20 37 8
• Also assume that item 3 appears in baskets 2,5,8
hmin min{h(2) 9, h(5) 16, h(8) 37} 9
2007/5/3
Chen Yi-Chun
11
False negative problem
• Note that this bound is tight.
– Consider two items a and b with R(a) R(b) R(a)
– Assume sp(a) and sp(b) are very small
•
sp (a )
( a, b)
sp (b)
R(a) R(b) sp(a)
2 , ( a, b)
R(a) R(b) sp(b)
R(a ) R(b)
2
Hence, if we prune a pair when
R(a ) R(b)
,we may have removed a pair whose (a, b)
2007/5/3
Chen Yi-Chun
12
Multiple min-hash function
• We use k independent min-hash functions
and define an equivalence relation “ “
– For two items a and b, a b a and b have
the same min-hash values for all the k hash
functions.
– If one min-hash function, Pr(a b) x ,with k
indep. functions, Pr(a b) x k x
– We repeat the whole process t times.
2007/5/3
Chen Yi-Chun
13
Cont.
• The probability that a and b belong to the
same equivalence class in at least one of the
k t
1
(1
x
)
trials is
2007/5/3
Chen Yi-Chun
14
Cont.
• Ex2. we show how
candidates are generated
after we obtain the minhash value.
2007/5/3
• In round 1, v1 of item 3 is
equal to v1 of item 17.
Hence (3,17) is put in the
candidate set.
• In round 2, no two vectors
are equal.
• In round 3, v3 of item 9 is
equal to v3 of item 17.
Hence (9,17) is put in the
candidate set.
Chen Yi-Chun
15
Algorithm
2007/5/3
Chen Yi-Chun
16
Cont.
2007/5/3
Chen Yi-Chun
17
Cont.
2007/5/3
Chen Yi-Chun
18
Experiment results
2007/5/3
Chen Yi-Chun
19
Conclusion
• Back to Ex2. (3,9)is not in the candidate set.
Does this agree with my substitution ??
2007/5/3
Chen Yi-Chun
20
© Copyright 2025 Paperzz