Overview on Web Crawlers

International Journal of Engineering Technology Science and Research
IJETSR
www.ijetsr.com
ISSN 2394 – 3386
Volume 2 Issue 4
April 2015
Overview on Web Crawlers
Divya, Akshay Sawhney, Anand Ratna
Department of Computer Science and Engineering
Galgotias College of Engineering and Technology, Gr Noida, India
Abstract—The size of the WWW is increasing
rapidly and its nature is dynamic, building an
efficient search mechanism is very necessary. A
vast number of pages continually being added
every day, so fetching information about a
special-topic is gaining importance, which poses
exceptional scaling challenges for generalpurpose crawlers and search engines. There are
many challenges ranging from systems concerns
such as managing very large data structures.
This paper a basic survey on the science and
practice of web crawling. This survey outlines
the fundamental challenges and describes the
state-of-the-art models and solutions.
Keywords— Web crawler, Depth First
Traversal, Seed URL, World Wide Web,
Frontier
I.
INTRODUCTION
Nowadays Internet has a wide expanse of
Information. Finding relevant information
requires an efficient mechanism. Search Engine
provide that scope to the search. A web search
engine is a software system that is designed
to search for information on the World Wide
Web. The search results are generally presented
in a line of results often referred to as search
engine results pages (SERPs). The information
may be a mix of web pages, images, and other
types of files.
Two types of search engines:
Crawler Based Search Engines – Crawler
based search engines create their listings
automatically. Computer programs „spiders‟
build them not by human selection. They are not
organized by subject categories; a computer
algorithm ranks all pages. Such kinds of search
engines are huge and often retrieve a lot of
18
Divya, Akshay Sawhney, Anand Ratna
information -- for complex searches it allows to
search within the results of a previous search and
enables you to refine search results. These types
of search engines contain full text of the web
pages they link to. So one can find pages by
matching words in the pages one wants.
Human Powered Directories – These are built
by human selection i.e. they depend on humans
to create listings. They are organized into subject
categories and subjects do classification of
pages. Human powered directories never contain
full text of the web page they link to. They are
smaller than most search engines.
The crawler is an important module of a
web search engine. It is a program that
downloads and stores web pages often for a web
search engine. The quality of a crawler directly
affects the searching quality of such web search
engines. The two main objectives are fast
gathering and efficient gathering (as many useful
web pages as possible, and the links
interconnecting them) .The role of Crawlers is to
collect Web Content. Web crawlers are an
essential component to search engines; running a
web crawler is a challenging task. There are
tricky performance and reliability issues and
even more importantly, there are social issues.
Crawling is the most fragile application since it
involves interacting with hundreds of thousands
of web servers and various name servers, which
are all beyond the control of the system. Web
crawling speed is governed not only by the
speed of one‟s own Internet connection, but also
by the speed of the sites that are to be crawled.
Especially if one is a crawling site from multiple
servers, the total crawling time can be
significantly reduced, if many downloads are
done in parallel.
International Journal of Engineering Technology Science and Research
IJETSR
www.ijetsr.com
ISSN 2394 – 3386
Volume 2 Issue 4
April 2015
Search
Engines
Types
Google
Crawler-based search engine
AllTheWeb
Crawler-based search engine
Teoma
Crawler-based search engine
Inktomi
Crawler-based search engine
AltaVista
Crawler-based search engine
LookSmart
Human-Powered Directory
OpenDirector
Human-Powered Directory
y
Yahoo
Human-Powered Directory, also
provide crawler-based search
results powered by Google
MSN Search
Human-Powered
Directory
powered by LookSmart, also
provide crawler-based search
results powered by Inktomi
AOL Search
Provide crawler-based search
results powered by Google
AskJeeves
Provide crawler-based search
results powered by Teoma
HotBot
Provide crawler-based search
results
powered
byAllTheWeb, Google, Inktomi
and Teoma, “4-in-1” search
engine
Lycos
Provide crawler-based search
results powered by AllTheWeb
Netscape
Search
Provide crawler-based search
results powered by Google
Table 1: Summarizes the different types of the
major search engines
19
Divya, Akshay Sawhney, Anand Ratna
II.
RELATEDWORK
The first generation of crawlers [8] on which
the majority of the web search engines are based
rely heavily on traditional graph algorithms,
such as “breadth-first” traversal or “depthfirst”
traversal, for indexing the web. Document
content is paid little need, as the final goal of the
crawl is to cover the whole web.
Topic oriented crawlers try to focus the
crawling process on pages pertinent to the topic
or query. They maintain the overall number of
downloaded Web pages for processing [9] [10]
to a least amount, whereas maximizing the
percentage of pages which are pertinent. Their
performance mainly depends on the selection of
starting URLs or starting pages which are also
known as seed pages or seed URLs. In general,
users provide a set of seed pages as input to a
crawler or, otherwise, seed pages are selected
amongst the top answers returned by a Web
search engine [11], considering the topic as
query [12]. High-quality seed pages can be
either pages pertinent to the topic or pages from
which pertinent pages can be accessed within a
few number of routing hops.
A focused crawler competently seeks out
documents about a specific topic and guides the
search based on both the content and link
structure of the web [2]. A focused crawler
develops a strategy that associates a score with
each link in the pages it has downloaded [9, 10,
13]. Those links are sorted according to the
scores and inserted in a queue. This strategy
ensures that the focused crawler preferentially
pursues promising crawl paths.
A focused crawler is a computer program
used for finding information associated to some
specific topic from the WWW. However the
main goal of focused crawling is that the crawler
selects and retrieves pertinent pages only and
does not need to gather all web pages. As the
crawler is only a computer program, it cannot
predict how pertinent a web page is [14]. In an
attempt to search pages of a specific type or on a
specific topic, focused crawlers aspire to
International Journal of Engineering Technology Science and Research
IJETSR
www.ijetsr.com
ISSN 2394 – 3386
Volume 2 Issue 4
April 2015
recognize links that are probably to direct to
target documents, and pass up links to off topic.
Fish Search algorithm and Shark Search
algorithm were used previously for crawling
with topic keywords mentioned in query.
In Fish Search algorithm, the system is driven
by the query. It starts from a set of seed pages;
it takes only those pages that have content
similar to a given query and their relative pages.
The Fish-Search technique allocates binary
priority values (0 for not relevant, 1 for relevant)
to pages candidate for downloading using simple
keyword matching. Hence, all pertinent pages
are assigned the equal priority value. Shark
Search is an advance algorithm of Fish Search
algorithm in which a child link inherits a partial
value of the score of its parent link and this score
is united with a value depend on the anchor text
that arise around the link in the Web page [15].
The Shark-Search method proposes using Vector
Space Model (VSM) for allocating non binary
priority values to candidate pages.
III. LIRATURE REVIEW
Web crawlers are almost as old as the web itself.
In the spring of 1993, just months after the
release of NCSA Mosaic, Matthew Gray [5]
wrote the first web crawler, the World Wide
Web Wanderer, which was used from 1993 to
1996 to compile statistics about the growth of
the web. A year later, David Eichmann [4] wrote
the first research paper containing a short
description of a web crawler, the RBSE spider.
Burner provided the first detailed description of
the architecture of a web crawler, namely the
original Internet Archive crawler [2]. Brin and
Page‟s seminal paper on the (early) architecture
of the Google search engine contained a brief
description of the Google crawler, which used a
distributed system of page-fetching processes
and a central database for coordinating the
crawl. Heydon and Najork described Mercator
[7,8], a distributed and extensible web crawler
that was to become the blueprint for a number of
other crawlers. Other distributed crawling
20
Divya, Akshay Sawhney, Anand Ratna
systems described in the literature include
PolyBot [9], UbiCrawler [1], C-proc [3] and
Dominos [6].
Conceptually, the algorithm executed by
a web crawler is extremely simple: select a URL
from a set of candidates, download the
associated web pages, extract the URLs
(hyperlinks) contained therein, and add those
URLs that have not been encountered before to
the candidate set. Indeed, it is quite possible to
implement a simple functioning web crawler in a
few lines of a high-level scripting language such
as Perl.
IV.TYPES OF WEB CRAWLER
Focused Web Crawler: Focused Crawler is the
Web crawler that tries to download pages that
are related to each other [17]. It collects
documents which are specific and relevant to the
given topic [11][14]. It is also known as a Topic
Crawler because of its way of working [15]. The
focused crawler determines the following –
Relevancy, Way forward. It determines how far
the given page is relevant to the particular topic
and how to proceed forward. The benefits of
focused web crawler is that it is economically
feasible in terms of hardware and network
resources, it can reduce the amount of network
traffic and downloads [13]. The search exposure
of focused web crawler is also huge [10][12].
Incremental Crawler: An incremental crawler
incrementally refreshes the existing collection of
pages by visiting them frequently; based upon
the estimate as to how often pages change [17].
It also exchanges less important pages by new
and more important pages. It resolves the
problem of the freshness of the pages. The
benefit of incremental crawler is that only the
valuable data is provided to the user, thus
network bandwidth is saved and data enrichment
is achieved [18][23].
Distributed Crawler: Distributed web crawling
is a distributed computing technique. Many
crawlers are working to distribute in the process
of web crawling, in order to have the most
International Journal of Engineering Technology Science and Research
IJETSR
www.ijetsr.com
ISSN 2394 – 3386
Volume 2 Issue 4
April 2015
coverage of the web. A central server manages
the communication and synchronization of the
nodes, as it is geographically distributed [10]. It
basically uses Page rank algorithm for its
increased efficiency and quality search. The
benefit of distributed web crawler is that it is
robust against system crashes and other events,
and can be adapted to various crawling
applications [19].
Parallel Crawler: Multiple crawlers are often
run in parallel, which are referred as Parallel
crawlers [20]. A parallel crawler consists of
multiple crawling Processes [20] called as Cprocs which can run on network of workstations•
[21]. The Parallel crawlers depend on Page
freshness and Page Selection [16]. A Parallel•
crawler can be on local network or be distributed
at
geographically
distant
locations•
[10].Parallelization of crawling system is very
vital from the point of view of downloading•
documents in a reasonable amount of time [21].
Breadth First Crawler: Starts out with a small•
set of pages and then explores other pages by
following links in the breadth-first [26] fashion.•
Actually web pages are not traversed strictly in
breadth first fashion butmay use a variety of•
policies. For example it may crawl most
important pages first.
Form Focused Web Crawler: To deal with the•
sparse distribution of forms on the Web, our
Form Crawler [27] avoids crawling through
unproductive paths by: limiting the search to a
particular topic; learning features of links and
paths that lead to pages that contain searchable
forms; and employing appropriate stopping
criteria.
Hidden Web Crawler: A lot of data on the web
actually resides in the database and it can only
be retrieved by posting appropriate queries or by
filling out forms on the web. Recently interest
has been focused on access of this kind of data
called “deep web” or “hidden web [24] ”.
Current day crawler‟s crawls only publicly index
able web (PIW) i.e. set of pages which are
accessible by following hyperlinks ignoring
21
Divya, Akshay Sawhney, Anand Ratna
search pages and forms which require
authorization or prior registration. In reality they
may ignore huge amount of high quality data,
which is hidden behind search forms.
Batch Crawler: Crawl a snapshot of their crawl
space, until reaching a certain size or time limit.
The crawl order does not contain occurrences of
any page, but the entire crawling process is
periodically halted and restarted as a way to
obtain more recent snapshot of previously
crawled pages.
VI.FEATURES OF WEB CRAWLER
Robustness: ability to handle spider-traps
(cycles, dynamic web pages)
Politeness: policies about the frequency of robot
visits
Distribution: crawling should be distributed
within several machines
Scalability: crawling should be extensible by
adding machines, extending bandwith, etc.
Efficiency: clever use of the processor, memory,
bandwith
Quality: should detect the most useful pages, to
be indexed first
Freshness: should continuously crawl the web
(visiting frequency of a page should be close to
its modification frequency)
Extensibility: should support new data formats
(e.g. XML-based formats), new protocols (e.g.
ftp)
VII.ARCHITECTURE AND OPERATIONS
Fig1: Architecture of crawler
Description of various parts:
International Journal of Engineering Technology Science and Research
IJETSR
www.ijetsr.com
ISSN 2394 – 3386
Volume 2 Issue 4
April 2015
URL frontier - It starts with a list of URLs to
visit, called the seeds. As the crawler visits these
URLs, it identifies all the hyperlinks in the page
and adds them to the list of visited URLs, called
the crawl frontier.
1. Begin with known “seed” URLs.
2. Fetch and parse them.
3. Extract URLs they point to
4. Place the extracted URLs on a queue
5. Fetch each URL on the queue and repeat
DNS resolution - domain name service
resolution. Look up IP address for domain
names which to fetch a page defined by a URL •
•
Fetching module - generally use the http
protocol to fetch the URL and the page is parsed.•
Texts (images, videos, and etc.) and Links are
extracted.


Parsing module - extracting text and links.

VIII.POLICY OF CRAWLER
Selection policy
Search engines covers only a fraction of Internet.
This requires download of relevant pages, hence
good selection policy is very important.
Common Selection policies:
Restricting followed link
Path-ascending crawling
Focused crawling
Crawling the Deep Web
Re-Visit Policy
Web is dynamic; crawling takes a long time.
Cost factors play important role in crawling.
Freshness and Age- commonly used cost
functions.
Objective of crawler- high average freshness;
low average age of web pages.
Two re-visit policies:
Uniform policy
Proportional policy
Politeness Policy
Crawlers can have a crippling impact on the
overall performance of a site.
The costs of using Web crawlers include:
Network resources
Server overload
Server/ router crashes
Network and server disruption
A partial solution to these problems is the robots
exclusion protocol.
Parallelization Policy
The crawler runs multiple processes in parallel.
The goal is:
To maximize the download rate.
To minimize the overhead from parallelization.
To avoid repeated downloads of the same page.
•
The crawling system requires a policy for
assigning the new URLs discovered during the
crawling process.
Duplicate elimination - detecting URLs and•
contents that have been processed a short time•
ago.
•
Content Seen - test whether a web page with the•
same content has already been seen at another
URL. Need to develop a way to measure the•
fingerprint of a web page.


•
•




•
Fig2: Basic Operation of Crawler
Basic crawler operation
22
Divya, Akshay Sawhney, Anand Ratna
•
•



International Journal of Engineering Technology Science and Research
IJETSR
www.ijetsr.com
ISSN 2394 – 3386
Volume 2 Issue 4
April 2015
IX.STRATEGIES OF CRAWLER
Breadth-First Traversal: Given any graph and
a set of seeds at which to start, the graph can be
traversed using the algorithm
1.
Put all the given seeds into the queue;
2.
Prepare to keep a list of “visited” nodes
(initially empty);
3.
As long as the queue is not empty:
a.
Remove the first node from the queue;
b.
Append that node to the list of “visited”
nodes
c.
For each edge starting at that node:
i.
If the node at the end of the edge already
appears on the list of “visited” nodes or it is
already in the queue, then do nothing more with
that edge;
ii.
Otherwise, append the node at the end of
the edge to the end of the queue.
Fig3: Breadth First Traversal
Depth First Traversal: Use depth first search
(DFS) algorithm
1. Get the 1st link not visited from the start page
2. Visit link and get 1st non-visited link
3. Repeat above step till no non-visited links
4. Go to next non-visited link in the previous
level and repeat 2nd step
Fig4: Depth First Traversal
23
Divya, Akshay Sawhney, Anand Ratna
X.CONCLUSIONS
This paper explains the crawler and its
architecture, operation type, strategies. In this
work, Different research paper have been
studied on search engines and web crawler and
on the basis of that we described general web
crawler architecture.
REFERENCES
[1] Boldi P., Codenotti B., Santini M., and Vigna
S. UbiCrawler: a scalable fully distributed web
crawler. Software Pract. Exper., 34(8):711–726,
2004.
[2] Burner M. Crawling towards eternity:
building an archive of the World Wide Web.
Web Tech. Mag., 2(5):37–40, 1997.
[3] Cho J. and Garcia-Molina H. Parallel
crawlers. In Proc. 11th Int. World Wide Web
Conference, 2002, pp. 124–135.
[4] Eichmann D. The RBSE Spider – Balancing
effective search against web load. In Proc. 3rd
Int. World Wide Web Conference, 1994.
[5] Gray M. Internet Growth and Statistics:
Credits
and
background.
http://www.mit.edu/people/mkgray/net/backgrou
nd.html
[6] Hafri Y. and Djeraba C. High performance
crawling system. In Proc. 6th ACM SIGMM Int.
Workshop on Multimedia Information Retrieval,
2004, pp. 299–306.
[7] Heydon A. and Najork M. Mercator: a
scalable, extensible web crawler. World Wide
Web, 2(4):219–229, December 1999.
[8] Najork M. and Heydon A. High-performance
web crawling. Compaq SRC Research Report
173, September 2001.
[9] Shkapenyuk V. and Suel T. Design and
Implementation
of
a
high-performance
distributed web crawler. In Proc. 18th Int. Conf.
on Data Engineering, 2002, pp. 357–368.
[10] Pavalam S. M., S. V. Kasmir Raja, Jawahar
M., and Felix K. Akorli, “Web Crawler in
Mobile Systems”, International Journal of
Machine Learning and Computing, Vol. 2, No.
4, August 2012
International Journal of Engineering Technology Science and Research
IJETSR
www.ijetsr.com
ISSN 2394 – 3386
Volume 2 Issue 4
April 2015
[11] Md. Abu Kausar, V. S. Dhaka, Sanjeev
Kumar Singh, “Web Crawler: A Review”,
International Journal of Computer Applications
(0975 – 8887),Volume 63– No.2, February 2013
[12] Manas Kanti Dey, Debakar Shamanta,
Hasan Md Suhag Chowdhury, Khandakar
Entenam Unayes Ahmed, “Focused Web
Crawling: A Framework for Crawling of
Country Based Financial Data”, Information and
Financial
Engineering
(ICIFE),
IEEE
Conference Publications, 2010
[13] Debashis Hati, Biswajit Sahoo, Amritesh
Kumar, “Adaptive Focused Crawling Based on
Link Analysis”, 2nd International Conference on
Education
Technology
and
Computer
(ICETC),2010
[14] Shashi Shekhar, Rohit Agrawal and Karm
Veer Arya, “An Architectural Framework of a
Crawler for Retrieving Highly Relevant Web
Documents by Filtering Replicated We
Collections”, 2010 International Conference on
Advances in Computer Engineering, IEEE
Conference Publications 2010.
[15] Gautam Pant, Padmini Srinivasan,
“Learning to Crawl: Comparing Classification
Schemes”, ACM Transactions on Information
Systems, Vol. 23, No. 4, October 2005, Pages
430–462.
[16] AH Chung Tsol, Daniele Forsali, Marco
Gori, Markus Hagenbuchner, Franco Scarselli,
“A Simple Focused Crawler” Proceeding 12th
International WWW Conference 2003(poster),
pp. 1.
[17] Junghoo Cho and Hector Garcia-Molina.
2000a. “The evolution of the web and
implications for an incremental crawler”, In
Proceedings of the 26th International
Conference on Very Large Databases.
[18] A. K. Sharma and Ashutosh Dixit, “Self
Adjusting Refresh Time Based Architecture for
Incremental Web Crawler” International Journal
of Computer Science and Network Security,
vol.8 no.12, 2008, pp. 349-354
24
Divya, Akshay Sawhney, Anand Ratna
[19] Vladislav Shkapenyuk Torsten Suel
“Design and Implementation of a HighPerformance Distributed Web Crawler”, CIS
Department Polytechnic University Brooklyn,
NY 11201
[20] Junghoo Cho, Hector Garcia-Molina,
“Parallel Crawlers”, WWW2002, May 7–11,
2002, Honolulu, Hawaii, USA, ACM 1-58113449-5/02/0005.
[21] Shruti Sharma, A.K.Sharma, J.P.Gupta, “A
Novel Architecture of a Parallel Web Crawler”,
International Journal of Computer Applications
(0975 – 8887) Volume 14– No.4, January 2011
[22] Ntoulas, A., Cho, J., and Olston, C. “What‟s
new on the Web? The evolution of the Web
from a search engine perspective”, WWW ‟04 ,
1-12, 2004.
[23] Niraj Singhal, Ashutosh Dixit, and Dr. A.
K. Sharma, “Design of a Priority Based
Frequency Regulated Incremental Crawler”,
International Journal of Computer Applications
(0975 – 8887) vol.1, no. 1, 2010, pp. 42-47.
[24] A.K. Sharma, J.P. Gupta, D. P.
Agarwal,“Augmented Hypertext Documents
suitable
for parallel crawlers”, communicated to 21st
IASTED International Multi-conference
Applied Informatics AI-2003, Feb 10- 13,2003,
Austria.
[25] Jumgoo Cho and Hector Garcia-Molina,
“The evolution of the Web and implications for
an incremental crawler”, Prc. Of VLDB Conf,
2000.
[26] Bing Liu, “Web Content Mining” the 14th
international world wide web conference
(www 2005) China .japan.
[27] Paolo Boldi_ Bruno Codenotti† Massimo
Santini‡ Sebastiano Vigna,” UbiCrawler: A
Scalable Fully Distributed Web Crawler”