A Distributed Web Crawler over DHTs

Distributed Web Crawling
over DHTs
Boon Thau Loo, Owen Cooper,
Sailesh Krishnamurthy
CS294-4
Search Today
Search
Index
Crawl
What’s Wrong?
Users have a limited search interface
Today’s web is dynamic and growing:


Timely re-crawls required.
Not feasible for all web sites.
Search engines control your search results:

Decide which sites get crawled:
 550 billion documents estimated in 2001 (BrightPlanet)
 Google indexes 3.3 billion documents.


Decide which sites gets updated more frequently
May censor or skew results rankings.
Challenge: User customizable searches that scale.
Our Solution: A Distributed Crawler
P2P users donate excess bandwidth and computation
resources to crawl the web.


Organized using Distributed Hash tables (DHTs)
DHT and Query Processor agnostic crawler:
 Designed to work over any DHT
 Crawls can be expressed as declarative recursive queries


Easy for user customization.
Queries can be executed over PIER, a DHT-based relational P2P
Query Processor
Crawlees: Web Servers
Crawlers: PIER nodes
Potential
Infrastructure for crawl personalization:


User-defined focused crawlers
Collaborative crawling/filtering (special interest groups)
Other possibilities:


Bigger, better, faster web crawler
Enables new search and indexing technologies
 P2P Web Search
 Web archival and storage (with OceanStore)
Generalized crawler for querying distributed graph
structures.


Monitor file-sharing networks. E.g. Gnutella.
P2P network maintenance:
 Routing information.
 OceanStore meta-data.
Challenges that We Investigated
Scalability and Throughput


DHT communication overheads.
Balance network load on crawlers
 2 components of network load: Download and DHT bandwidth.

Network Proximity. Exploit network locality of crawlers.
Limit download rates on web sites

Prevents denial of service attacks.
Main tradeoff: Tension between coordination and
communication


Balance load either on crawlers or on crawlees !
Exploit network proximity at the cost of communication.
Crawl as a Recursive Query
Publish WebPage(url)
Publish Link (sourceUrl, destUrl)
Rate Throttle & Reorder
: Link.destUrl WebPage(url)
Redirect
Filters
Crawler Thread
Dup Elim
Output Links
Extractor
CrawlWrapper
Seed Urls
DupElim
DHT Scan: WebPage(url)
Input Urls
Downloader
Crawl Distribution Strategies
Partition by URL


Ensures even distribution of crawler workload.
High DHT communication traffic.
Partition by Hostname




One crawler per hostname.
Creates a “control point” for per-server rate
throttling.
May lead to uneven crawler load distribution
Single point of failure:
 “Bad” choice of crawler affects per-site crawl throughput.

Slight variation: X crawlers per hostname.
Redirection
• Simple technique that allows a crawler to redirect or pass on its
assigned work to another crawler (and so on….)
• A second chance distribution mechanism orthogonal to the
partitioning scheme.
• Example: Partition by hostname
•
•
•
Node responsible for google.com (red) dispatches work (by URL) to
grey nodes
www.google.com
Load balancing benefits of partition by URL
Control benefits of partition by hostname
• When? Policy-based
•
•
Crawler load (queue size)
Network proximity
• Why not? Cost of redirection
•
•
Increased DHT control traffic
Hence, put a limit number of redirections per URL.
Experiments
Deployment


WebCrawler over PIER, Bamboo DHT, up to 80 PlanetLab nodes
3 Crawl Threads per crawler, 15 min crawl duration
Distribution (Partition) Schemes




URL
Hostname
Hostname with 8 crawlers per unique host
Hostname, one level redirection on overload.
Crawl Workload

Exhaustive crawl
 Seed URL: http://www.google.com
 78244 different web servers

Crawl of fixed number of sites
 Seed URL: http://www.google.com
 45 web servers within google

Crawl of single site within http://groups.google.com
Crawl of Multiple Sites I
CDF of Per-crawler Downloads (80 nodes)
Partition by Hostname shows
poor imbalance (70% idle).
Better off when more crawlers are busy
Crawl Throughput Scaleup
Hostname: Can only exploit at
most 45 crawlers.
Redirect (hybrid hostname/url)
does the best.
Crawl of Multiple Sites II
Per-URL DHT Overheads
Redirect: The per-URL DHT
overheads hit their
maximum around 70 nodes.
Redirection incurs higher overheads only after queue size exceeds
a threshold.
Hostname incurs low overheads since crawl only looks at
google.com which has lots of self-links.
Network Proximity
Sampled 5100 crawl targets and
measured ping times from each of
80 PlanetLab hosts
Partition by hostname approximates
random assignment
Best-3 random is “close enough”
to Best-5 random
Sanity check: what if a
single host crawls all targets ?
Summary of Schemes
Loadbalance
download
bandwidth
Loadbalance
DHT
bandwidth
Rate limit
Crawlees
Network
proximity
DHT
Communication
overheads
URL
+
+
-
-
-
Hostname
-
-
+
?
+
Redirect
+
?
+
+
--
Related Work
Herodotus, at MIT (Chord-based)



Partition by URL
Batching with ring-based forwarding.
Experimented on 4 local machines
Apoidea, at GaTech (Chord-based)



Partition by hostname.
Forwards crawl to DHT neighbor closest to
website.
Experimented on 12 local machines.
Conclusion
Our main contributions:


Propose a DHT and QP agnostic Distributed
Crawler.
Express crawl as a query.
 Permits user-customizable refinement of crawls

Discover important trade-offs in distributed
crawling:
 Co-ordination comes with extra communication costs

Deployment and experimentation on PlanetLab.
 Examine crawl distribution strategies under different
workloads on live web sources
 Measure the potential benefits of network proximity.
Backup slides
Existing Crawlers
Cluster-based crawlers


Google: Centralized dispatcher sends urls
to be crawled.
Hash-based parallel crawlers.
Focused Crawlers


BINGO!
Crawls the web given basic training set.
Peer-to-Peer


Grub SETI@Home infrastructure.
23993 members .
Exhaustive Crawl
Partition by Hostname
shows imbalance. Some
crawlers are over-utilized
for downloads.
Little difference in
throughput. Most crawler
threads are kept busy.
Single Site
URL is best, followed by
redirect and hostname.
Future Work
Fault Tolerance
Security
Single-Node Throughput
Work-Sharing between Crawl Queries

Essential for overlapping users.
Crawl Global Prioritization


A requirement of personalized crawls.
Online relevance feedback.
Deep web retrieval.