Frontera-crawling the spanish web

Frontera: open source, large scale web
crawling framework
Alexander Sibiryakov, October 1, 2015
[email protected]
Sziasztok résztvevők!
•
Born in Yekaterinburg, RU
•
5 years at Yandex, search
quality department: social and
QA search, snippets.
•
2 years at Avast! antivirus,
research team: automatic false
positive solving, large scale
prediction of malicious
download attempts.
2
Task
•
Crawl Spanish web to gather
statistics about hosts and their
sizes.
•
Limit crawl to .es zone.
•
Breadth-first strategy: first crawl
1-click distance documents,
next 2-clicks, and so on,
•
Finishing condition: absence of
hosts with less than 100
crawled documents.
•
Low costs.
3
Spanish internet (.es) in 2012
•
Domain names registered - 1,56М (39% growth per
year)
•
Web server in zone - 283,4K (33,1%)
•
Hosts - 4,2M (21%)
•
Spanish web sites in DMOZ catalog - 22043
* - отчет OECD Communications Outlook 2013
4
Solution
•
Scrapy* - network operations.
•
Apache Kafka - data bus (offsets, partitioning).
•
Apache HBase - storage (random access, linear scanning,
scalability).
•
Twisted.Internet - library for async primitives for use in workers.
•
Snappy - efficient compression algorithm for IO-bounded
applications.
* - network operations in Scrapy are implemented asynchronously,
based on the same Twisted.Internet
5
Architecture
Kafka topic
6
SW
Crawling strategy
workers
DB
Storage workers
1. Big and small hosts
problem
•
When crawler comes to huge
number of links from some
host, along with usage of
simple prioritization models, it
turns out queue is flooded with
URLs from the same host.
•
That causes underuse of
spider resources.
•
We adopted additional perhost (optionally per-IP)
queue and metering
algorithm: URLs from big
hosts are cached in memory.
7
3. DDoS DNS service
Amazon AWS
•
Breadth-first strategy assumes
first visiting of previously
unknown hosts, therefore
generating huge amount of
DNS request.
•
Recursive DNS server on each
downloading node, with
upstream set to Verizon and
OpenDNS.
•
We used dnsmasq.
8
4. Tuning Scrapy thread pool’а
for efficient DNS resolution
•
Scrapy uses a thread pool to
resolve DNS name to IP.
•
When ip is absent in cache,
request is sent to DNS server
in it’s own thread, which is
blocking.
•
Scrapy reported numerous
errors related to DNS name
resolution and timeouts.
•
We added option to Scrapy
for thread pool size and
timeout adjustment.
9
5. Overloaded HBase region
servers during state check
•
Crawler extracts from document
hundreds of links in average.
•
Before adding this links to queue, they
needs to be checked if they weren’t
already crawled (to avoid repetitive
visiting).
•
On small volumes SSDs were just fine.
After increase of table size, we had to
move to HDDs, and response times
dramatically grew up.
•
Host-local fingerprint function for
keys in HBase.
•
Tuning HBase block cache to fit
average host states into one block.
10
6. Intensive network traffic
from workers to services
•
We noticed throughput
between workers Kafka and
HBase up to 1Gbit/s.
•
Switched to Thrift compact
protocol for HBase
communication.
•
Message compression in
Kafka using Snappy.
11
7. Further query and traffic
optimizations to HBase
•
State check required lion’s
share of requests and
network throughput.
•
Consistency was another
requirement.
•
We created local state cache
in strategy worker.
•
For consistency, spider log
was partitioned by host, to
avoid cache overlap
between workers.
12
State cache
•
All operations are batched:
•
If key is absent in cache, it’s
requested from HBase,
•
every ~4K documents
cache is flushed to HBase.
•
When achieving 3M (~1Гб)
elements, flush and cleanup
happens.
•
It seems Least-Recently-Used
(LRU) algorithm is a good fit
there.
Spider priority queue (slot)
•
Cell has an array of:
- fingerprint, - Crc32(hostname), - URL, - score
•
Dequeueing top N.
•
Such design is prone to huge
hosts.
•
Partially this problem can be
solved using scoring model
taking into account known
document count per host.
14
8. Problem of big and small
hosts (strikes back!)
•
During crawling we’ve found few
very huge hosts (>20M docs)
•
All queue partitions were
flooded with pages from few
huge hosts, because of queue
design and scoring model used.
•
We made two MapReduce
jobs:
•
queue shuffling,
•
limiting all hosts to no more
than 100 documents.
15
Hardware requirements
•
Single-thread Scrapy spider gives 1200 pages/min.
from about 100 websites in parallel.
•
Spiders to workers ratio is 4:1 (without content)
•
1 Gb of RAM for every SW (state cache, tunable).
•
Example:
•
12 spiders ~ 14.4K pages/min.,
•
3 SW and 3 DB workers,
•
Total 18 cores.
Software requirements
•
Apache HBase,
•
Apache Kafka,
•
Python 2.7+,
•
Scrapy 0.24+,
•
DNS Service.
CDH (100% Open source
Hadoop package)
17
Maintaining Cloudera Hadoop on
Amazon EC2
•
CDH is very sensitive to free space on root partition, parcels, and
storage of Cloudera Manager.
•
We’ve moved it using symbolic links to separate EBS partition.
•
EBS should be at least 30Gb, base IOPS should be enough.
•
Initial hardware was 3 x m3.xlarge (4 CPU, 15Gb, 2x40 SSD).
•
After one week of crawling, we ran out of space, and started to
move DataNodes to d2.xlarge (4 CPU, 30.5Gb, 3x2Tb HDD).
Spanish (.es) internet crawl results
•
fnac.es, rakuten.es, adidas.es,
equiposdefutbol2014.es,
druni.es,
docentesconeducacion.es are the biggest websites
•
68.7K domains found (~600K
expected),
•
46.5M crawled pages overall,
•
1.5 months,
•
22 websites with more than
50M pages
where are the rest of
web servers?!
Bow-tie model
A. Broder et al. / Computer Networks 33 (2000) 309-320
Y. Hirate, S. Kato, and H. Yamana, Web Structure in 2005
Graph Structure in the Web — Revisited, Meusel, Vigna, WWW 2014
Main features
•
Online operation: scheduling of new batch,
updating of DB state.
•
Storage abstraction: write your own backend
(sqlalchemy, HBase is included).
•
Canonical URLs resolution abstraction: each
document has many URLs, which to use?
•
Scrapy ecosystem: good documentation, big
community, ease of customization.
24
Main features
•
Communication layer is Apache Kafka: topic
partitioning, offsets mechanism.
•
Crawling strategy abstraction: crawling goal, url
ordering, scoring model is coded in separate
module.
•
Polite by design: each website is downloaded by
at most one spider.
•
Python: workers, spiders.
References
•
Distributed Frontera. https://github.com/
scrapinghub/distributed-frontera
•
Frontera. https://github.com/scrapinghub/frontera
•
Documentation:
•
http://distributed-frontera.readthedocs.org/
•
http://frontera.readthedocs.org/
26
Future plans
•
Lighter version, without HBase
and Kafka. Communicating
using sockets.
•
Revisiting strategy out-of-box.
•
Watchdog solution: tracking
website content changes.
•
PageRank or HITS strategy.
•
Own HTML and URL parsers.
•
Integration into Scrapinghub
services.
•
Testing on larger volumes.
27
Contribute!
•
Distributed Frontera is a
historically first attempt to
implement web scale web
crawler using Python.
•
Truly resource-intensive task:
CPU, network, disks.
•
Made in Scrapinghub, a
company where Scrapy was
created.
•
A plans to become an Apache
Software Foundation project.
28
We’re hiring!
http://scrapinghub.com/jobs/
29
Köszönöm!
Thank you!
Alexander Sibiryakov,
[email protected]