Finding Near-Duplicate Web Pages

Finding Near-Duplicate Web Pages:
A Large-Scale
Evaluation of Algorithms
By Monika Henzinger
http://labs.google.com/people/monika/
Presented By
Harish Rayapudi
Shiva Prasad Malladi
Overview
•
Introduction
•
Broder’s Algorithm
•
Charikar’s Algorithm
•
Comparing the algorithms
•
Combined algorithm
•
Conclusion
Duplicate web pages
•
more space to store index
•
Slow down the performance
How to identify duplicate pages?
•
Compare ….. requires O(n^2) comparisons
•
Indexed web ….. 13.92 billion pages
Experimental Data
• 1.6B pages from a real Google crawl
• 25%-30% of identical pages removed prior
to authors recving the data
– unknown exactly how many pages identical
• Of the remainder:
– Broder's Algorithm (Alg B) found 1.7% near
duplicates
– Charikar's Algorithm (Alg C) found 2.2% near
duplicates
Algorithms
•
Broder’s and Charikar’s algorithm were not evaluated against each other
previously
•
Used by successful web search engines
•
Algorithms comparison:
1. Precision on a random subset
2. The distribution of the number of term differences per near-duplicate pair
3. The distribution of the number of near-duplicates per page.
•
The algorithms were evaluated on a set of 1.6B unique pages
Sample HTML Page
<html>
<body bgcolor="cream">
<H3>Harish Rayapudi Website </H3>
<H4><a
href="http://www.google.com"
target="_blank">Google
</a></H4>
<H4>I am a Computer Science
graduate student</H4>
</body>
</html>
Remove HTML &Formatting Info
Harish Rayapudi Website
http://www.google.com Google
I am a Computer Science graduate
student
Remove "." and "/" from URLs
Harish Rayapudi Website
http www google com Google
I am a Computer Science graduate
student
Tokens in the Page
Harish1 Rayapudi2 Website3
http4 www5 google6 com7 Google9 I10
am11 a12 Computer13 Science14
graduate15 student16
•We'll only look at the first 7 tokens
•The token sequence for this page P1 is {1,2,3,4,5,6,7}
•This token sequence is used by both the algorithms
Tokens in a Similar Page
Harish1 Rayapudi2 Website3
http4 www5 yahoo8 com7 Yahoo17 I10
am11 a12 Computer13 Science14
graduate15 student16
•We'll only look at the first 7 tokens
•The token sequence for this page P2 is {1,2,3,4,5,8,7}
•This token sequence is used by both the algorithms
Preprocessing step
contd..
•
Let n be the length of token sequence for pages P1 and P2, n=7
•
k subsequence of tokens is fingerprinted resulting in n-k+1 shingles
•
For k=2, shingles for page P1 {1,2,3,4,5,6,7} and page P2 {1,2,3,4,5,8,7} are,
P1 {12,23,34,45,56,67}
P2 {12,23,34,45,58,87}
Broder’s Algorithm
•
Shingles are fingerprinted with m different fingerprinting functions
•
For m = 4 we have, F1,F2,F3 and F4 fingerprinting functions
•
For P1, the result of applying m different functions:
12
23
34
45
56
67
F1
4
7
1
8
1
6
F2
7
4
2
5
8
3
F3
5
8
3
9
7
4
F4
9
5
6
7
8
5
•
Smallest value of each function is taken and a m-dimensional vector of min-values is
stored for each page
•
4-dimensional vector for page P1 is {1,2,3,5}
• For P2, the result of applying m functions:
12
23
34
45
58
87
F1
4
7
1
8
9
5
F2
7
4
2
5
7
3
F3
5
8
3
9
3
6
F4
9
5
6
7
1
4
• 4-dimensional vector for page P2 is {1,2,3,1}
•
m dimensional vector reduced to m' dimensional vector of supershingles,
m' is chosen such that m is divisible by m'
•
Since m=4, we take m' = 2
•
Non-overlapping sequence of P1 {1,2,3,5} is {12,35}
and for page P2 {1,2,3,1} is {12,31}
•
Generating supershingle vector from non-overlapping sequence
For P1, SS {12,35} = {x, y} and for P2, SS {12,31} = {x, z}
•
B-similarity of two pages is the identical number of entries in their
supershingle vector
•
B-similarity of pages P1 and P2 is 1 (common entry x)
Experimental results of Broder’s Algorithm
•
Algorithm generated 6 supershingles per page
and a total of 10.1 B supershingles
For pages P1 and P2 we had 2
supershingles, they are {x, y}, {x, z}
•
For each pair of pages with an identical supershingle B-similarity is determined
For pages P1 and P2 we had
B-similarity of 1
B-similarity graph
•
•
•
•
Every page is a node in the graph
Edge between two nodes if and only if the pair is B-similar
Label of an edge is the B-similarity of the pair
A node is considered a near-duplicate page if and only if it is incident to at
least one edge.
P1
1
The average degree of the B-similarity graph is about 135
P2
A random sample of 96556 B-similar pairs. Sub sampled and 1910 pairs were
chosen
•
•
The overall precision is 0.38
The precision for pairs on same site is .34 while for pairs on different site is
.84
Table taken from the paper
Correctness of a near-duplicate pair
•
Text differs only by URL, session id, a timestamp, visitor count
•
Difference is invisible to the visitors
•
Difference is a combination of above items
•
Pages are entry pages to the same site
• URL-only differences account for 41% of the correct pairs
Table taken from the paper
•
92% of them are on the same site. Almost half the cases are pairs that could not be
evaluated.
Table taken from the paper
> diff google yahoo
1,2c1,2
< Harish Rayapudi Google Website
< http www google com
----> Harish Rayapudi Yahoo Website
> http www yahoo com
• Term difference calculated by executing the Linux diff command
The average term difference is 24, the mean is 11.
• Figure shows the distribution of term difference up to 200.
Figure taken from the paper
Charikar’s Algorithm
•
P1, P2 = documents
k = 3 (shingle size = 3)
b=3
P1 = 1 2 3 4 5 6 7
P2 = 2 3 4 1 2 7 9
P1 shingles, with 3 random values chosen from {1,1}
•
123
234
345
456
567
-0.7 0.2 0.5
0.3 0.8 -0.4
-0.1 0.1 0.9
0.5 -0.2 -0.4
-0.9 -0.7 -0.5
now add columns to get (-0.9, 0.2,0.1)
•
•
234
0.3 0.8 -0.4
341 -0.1 0.1 -0.3
412
-0.3 -0.2 0.9
127
-0.7 0.2 -0.2
279
0.6
0.4 -0.8
add columns to get (-0.2,1.3,-0.8)
P1 vector = (-0.9, 0.2,0.1)
P2 vector = (-0.2,1.3,-0.8)
P1 final vector = (0,1,1)
P2 final vector = (0,1,0)
C-similarity (P1,P2) = 2
1. Each token is projected into bdimensional space by randomly choosing
b entries from {−1, 1}
2. This projection is the same for all
pages.
3. For each page a b-dimensional vector is
created by adding the projections of all the
tokens in its token sequence.
4. The final vector for the page is created
by setting every positive entry in the vector
to 1 and every non-positive entry to 0,
resulting in a random projection for each
page.
5. C-similarity of two pages is the number
of bits their projections agree on
6. We chose b = 384 so that both
algorithms store a bit string of 48 bytes per
page
7. We define two pages to be C-similar iff
the number of agreeing bits in their
projections lies above a fixed threshold.
8. We set a threshold t, t=372 here.
Experimental results of Charikar’s algorithm
•
•
•
•
Algorithm returns all pairs with C-similarity at least t as near-duplicate pairs.
Alg. C found 1630M duplicate web-pages of which only 50% were
correct(815M where we set t=372)
Alg.C found 1630M near-duplicate pairs of which 1630*0.5=815M are
correct pairs.
Better than Alg.B
C-similarity graph:
Similar to that of B-similarity graph
Experimental results of Charikar’s algorithm
Experimental results of Charikar’s algorithm
URL-only differences account for 72% of the correct
Interesting Website! http://www.businessline.co.uk/
Table taken from the paper
Experimental results of Charikar’s algorithm
• 95% of the undecided pairs are on the same site
Table taken from the paper
Comparisons of both the algorithms
Manual Evaluation:
Alg. C outperforms Alg. B with a precision of 0.50 versus 0.38 for
Alg. B
Term Difference:
The results for term differences are quite similar, except for the
larger number
(19 vs. 90) of pairs with term differences larger than 200.
Correlation:
In 96,556 B-similar pairs only 45 had t=372.
In 169,757 C-similar pages,4% were B-similar and 95% had,
B-similarity 0
Comparisons of both the algorithms
Table taken from the paper
Comparisons of both the algorithms
Table taken from the paper
Combined Algorithm:
Need:
Both the algorithms wrongly identify pairs as near-duplicates either
a) Because a small difference in tokens causes a large semantic difference
or
b) Because of unlucky random choices.
In This Algorithm:
1)First compute all B-similar pairs.
2)Then filter out those pairs whose C-similarity falls below a certain
threshold.
Combined Algorithm:
Here we select the best threshold value for the higher precision value
Here we select threshold=350
Figure taken from the paper
Combined Algorithm:
• R is the number of correct near duplicate pairs returned divided by
number of correct near duplicate pairs returned by Alg.B
• plots for S1 precision versus R for all C-similar thresholds between 0
and 384
Figure taken from paper
Combined Algorithm:
Combined Algorithm:
•
•
The resulting algorithm returns on the testing set S2 363 out of the 962
pairs as near-duplicates with a precision of 0.79 and an R-value of 0.79.
Above Table shows that 82% of the returned pairs are on the same site and
that the precision improvement is mostly achieved for these pairs. With 0.74
this number is much better than either of the individual algorithms.
Table taken from the paper
Conclusion
• The authors have performed an evaluation of two nearduplicate algorithms on 1.6B web pages.
Neither performed well on pages from the same site, but
a combined algorithm did without sacrificing much recall.
Discussion
• How can Alg.B be improved?
• Can we improve both algorithms to
perform better on pairs from same
website?