Slides - UCSB Computer Science

Clustering and Load Balancing
Optimization for Redundant
Content Removal
Shanzhong Zhu (Ask.com)
Alexandra Potapova, Maha Alabduljalil (Univ.
of California at Santa Barbara)
Xin Liu (Amazon.com)
Tao Yang (Univ. of California at Santa Barbara)
Redundant Content Removal in Search
Engines
• Over 1/3 of Web pages crawled are near
duplicates
• When to remove near duplicates?
 Offline removal
Web
Pages
Offline data
processing
Duplicate
filtering
Online
index
 Online removal with query-based duplicate
removal
User
query
Online index
matching &
result ranking
Duplicate
removal
Final results
Tradeoff of online vs. offline removal
Impact to offline
Online-dominating
approach
Offline-dominating
approach
High precision
Low recall
High precision
High recall
Remove fewer
duplicates
Remove most of
duplicates
Higher offline burden
Impact to online
More burden to
online deduplication
Less burden to
online deduplication
Impact to overall
cost
Higher serving cost
Lower serving cost
Challenges &issues in offline duplicate
handling
• Achieve high-recall with high precision
 All-to-all duplicate comparison for
complex/deep pairwise analysis
 Expensive parallelism management &
unnecessary computation elimination
• Maintain duplicate groups instead of
duplicate pairs
 Reduce storage requirement.
 Aid winner selection for duplicate removal
 Continuous group update is expensive.
Approximation.
Error handling
Optimization for faster offline duplicate
handling
• Incremental duplicate clustering and group
management
 Approximated transitive relationship
 Lazy update
• Avoid unnecessary computation while
balancing computation among machines
 Multi-dimensional partitioning
 Faster many-to-all duplicate comparisons
Page
partition
Page
Page
partition
Page
partition
…
Two-tier Architecture for Incremental
Duplicate Detection
Approximation in Incremental Duplicate
Group Management
• Example of incremental group merging/splitting
• Approximation
 Group is unchanged when updated pages are still
similar to group signatures
 Group splitting does not re-validate all relations
• Error of transitive relation after content update
 A<->B, B<-> C  A<->C
 A <->C may not be true if content B is updated.
• Error prevention during duplicate filtering:
 double check similarity threshold between a winner
and a loser
Multi-dimensional page partitioning
…
Pages
Pages
• Objective Pages
 One page is mapped to one unique partition
 Dissimilar pages are mapped to different partitions.
 Reduce unnecessary cross-partition comparisons.
• Partitioning based on document length
 Outperform signature-based mapping for higher
recall rates.
• Multi-dimensional mapping
 Improve load imbalance caused by skewed length
distribution800
700
600
500
400
300
200
100
0
20
Multi-dimensional page partitioning
Dictionary
Sub-dictionary
Sub-dictionary
A=(600)
A=(280,320)
800
600
400
200
0
20
1D length space
2D length space
When does Page A compare with B?
• Page length vector A= (A1, A2) , B=(B1,B2)
• Page A needs to be compared with B only if
• τ is the similarity threshold
• ρ is a fixed interval enlarging factor
Implementation and Evaluations
• Implemented in Ask.com offline platform with C++
for processing billions of documents
• Impact on relevancy
 Continuously monitor top query results.
 Error rate of false removal is tiny.
• Impact on cost.
 Compare two approaches
– A: Online dominating.
 Offline removes 5% duplicates first.
 Most of duplicates hosted in online tier-2 machines
– B: Offline dominating.
Cost Saving with Offline Dominating Approach
• Fixed QPS target. Two-tier online index for 3-8 billion
URLs.
0.3
0.25
0.2
T1.4
0.15
T2.5
0.1
0.05
0
0
2
4
6
8
10
• 8%-26% cost saving with offline dominating
 Less tier-2 machines due to less duplicates hosted.
 Online tier 1 machines can answer more queries
 Online messages communicated contain less
duplicates
Reduction of unnecessary inter-machine
communiation & comparison
Up to 87% saving when using up to 64 machines
1
0.9
0.8
0.7
Threshold 0.5
0.6
Threshold 0.6
0.5
Threshold 0.7
0.4
Threshold 0.8
0.3
Threshold 0.9
0.2
0.1
0
0
10
20
30
40
50
60
70
Effectiveness of 3D mapping
• Load balance factor with upto 64 machines
7.0000
6.0000
5.0000
4.0000
1D
2D
3.0000
3D
2.0000
1.0000
50
0.0000
2
4
8
16
32
64
45
40
35
30
• Speedup of
processing throughput
25
3D mapping
20
1D mapping
15
10
5
0
0
20
40
60
80
Benefits of incremental computation
• Ratio of non-incremental duplicate detection time over
incremental one for a 100 million dataset. Upto 24-fold
speedup.
25
20
15
Incremental vs.
nonincremental
10
5
32.00%
0
0
50000
100000
150000
• During a crawling update,
30% of updated pages have
signatures similar to group
signatures
30.00%
28.00%
26.00%
24.00%
22.00%
20.00%
2.5
5
10
15
20
25
30
Accuracy of distributed clustering and
duplicate group management
Relative error in precision compared to a singlemachine configuration
1.4
1.2
1
0.8
0.6
0.4
0.2
0
10
20
12 machines
30
50
24 machines
75
100
36 machines
1.8
1.75
1.7
1.65
1.6
1.55
1.5
1.45
Relative error in recall
1.4
1.35
10
20
12 machines
30
50
24 machines
75
36 machines
100
Conclusion remarks
• Budget-conscious solution with offline
dominating redundant removal
 Up to 26% cost saving.
• Approximated incremental scheme for
duplicate clustering with error handling
 Upto 24-fold speedup
 Undetected duplicates are handled online.
• 3D mapping still reduces unnecessary
comparisons (upto 87%) while balancing load
(3+ fold improvement)