International Journal of Engineering Technology Science and Research IJETSR www.ijetsr.com ISSN 2394 – 3386 Volume 2 Issue 4 April 2015 Overview on Web Crawlers Divya, Akshay Sawhney, Anand Ratna Department of Computer Science and Engineering Galgotias College of Engineering and Technology, Gr Noida, India Abstract—The size of the WWW is increasing rapidly and its nature is dynamic, building an efficient search mechanism is very necessary. A vast number of pages continually being added every day, so fetching information about a special-topic is gaining importance, which poses exceptional scaling challenges for generalpurpose crawlers and search engines. There are many challenges ranging from systems concerns such as managing very large data structures. This paper a basic survey on the science and practice of web crawling. This survey outlines the fundamental challenges and describes the state-of-the-art models and solutions. Keywords— Web crawler, Depth First Traversal, Seed URL, World Wide Web, Frontier I. INTRODUCTION Nowadays Internet has a wide expanse of Information. Finding relevant information requires an efficient mechanism. Search Engine provide that scope to the search. A web search engine is a software system that is designed to search for information on the World Wide Web. The search results are generally presented in a line of results often referred to as search engine results pages (SERPs). The information may be a mix of web pages, images, and other types of files. Two types of search engines: Crawler Based Search Engines – Crawler based search engines create their listings automatically. Computer programs „spiders‟ build them not by human selection. They are not organized by subject categories; a computer algorithm ranks all pages. Such kinds of search engines are huge and often retrieve a lot of 18 Divya, Akshay Sawhney, Anand Ratna information -- for complex searches it allows to search within the results of a previous search and enables you to refine search results. These types of search engines contain full text of the web pages they link to. So one can find pages by matching words in the pages one wants. Human Powered Directories – These are built by human selection i.e. they depend on humans to create listings. They are organized into subject categories and subjects do classification of pages. Human powered directories never contain full text of the web page they link to. They are smaller than most search engines. The crawler is an important module of a web search engine. It is a program that downloads and stores web pages often for a web search engine. The quality of a crawler directly affects the searching quality of such web search engines. The two main objectives are fast gathering and efficient gathering (as many useful web pages as possible, and the links interconnecting them) .The role of Crawlers is to collect Web Content. Web crawlers are an essential component to search engines; running a web crawler is a challenging task. There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers, which are all beyond the control of the system. Web crawling speed is governed not only by the speed of one‟s own Internet connection, but also by the speed of the sites that are to be crawled. Especially if one is a crawling site from multiple servers, the total crawling time can be significantly reduced, if many downloads are done in parallel. International Journal of Engineering Technology Science and Research IJETSR www.ijetsr.com ISSN 2394 – 3386 Volume 2 Issue 4 April 2015 Search Engines Types Google Crawler-based search engine AllTheWeb Crawler-based search engine Teoma Crawler-based search engine Inktomi Crawler-based search engine AltaVista Crawler-based search engine LookSmart Human-Powered Directory OpenDirector Human-Powered Directory y Yahoo Human-Powered Directory, also provide crawler-based search results powered by Google MSN Search Human-Powered Directory powered by LookSmart, also provide crawler-based search results powered by Inktomi AOL Search Provide crawler-based search results powered by Google AskJeeves Provide crawler-based search results powered by Teoma HotBot Provide crawler-based search results powered byAllTheWeb, Google, Inktomi and Teoma, “4-in-1” search engine Lycos Provide crawler-based search results powered by AllTheWeb Netscape Search Provide crawler-based search results powered by Google Table 1: Summarizes the different types of the major search engines 19 Divya, Akshay Sawhney, Anand Ratna II. RELATEDWORK The first generation of crawlers [8] on which the majority of the web search engines are based rely heavily on traditional graph algorithms, such as “breadth-first” traversal or “depthfirst” traversal, for indexing the web. Document content is paid little need, as the final goal of the crawl is to cover the whole web. Topic oriented crawlers try to focus the crawling process on pages pertinent to the topic or query. They maintain the overall number of downloaded Web pages for processing [9] [10] to a least amount, whereas maximizing the percentage of pages which are pertinent. Their performance mainly depends on the selection of starting URLs or starting pages which are also known as seed pages or seed URLs. In general, users provide a set of seed pages as input to a crawler or, otherwise, seed pages are selected amongst the top answers returned by a Web search engine [11], considering the topic as query [12]. High-quality seed pages can be either pages pertinent to the topic or pages from which pertinent pages can be accessed within a few number of routing hops. A focused crawler competently seeks out documents about a specific topic and guides the search based on both the content and link structure of the web [2]. A focused crawler develops a strategy that associates a score with each link in the pages it has downloaded [9, 10, 13]. Those links are sorted according to the scores and inserted in a queue. This strategy ensures that the focused crawler preferentially pursues promising crawl paths. A focused crawler is a computer program used for finding information associated to some specific topic from the WWW. However the main goal of focused crawling is that the crawler selects and retrieves pertinent pages only and does not need to gather all web pages. As the crawler is only a computer program, it cannot predict how pertinent a web page is [14]. In an attempt to search pages of a specific type or on a specific topic, focused crawlers aspire to International Journal of Engineering Technology Science and Research IJETSR www.ijetsr.com ISSN 2394 – 3386 Volume 2 Issue 4 April 2015 recognize links that are probably to direct to target documents, and pass up links to off topic. Fish Search algorithm and Shark Search algorithm were used previously for crawling with topic keywords mentioned in query. In Fish Search algorithm, the system is driven by the query. It starts from a set of seed pages; it takes only those pages that have content similar to a given query and their relative pages. The Fish-Search technique allocates binary priority values (0 for not relevant, 1 for relevant) to pages candidate for downloading using simple keyword matching. Hence, all pertinent pages are assigned the equal priority value. Shark Search is an advance algorithm of Fish Search algorithm in which a child link inherits a partial value of the score of its parent link and this score is united with a value depend on the anchor text that arise around the link in the Web page [15]. The Shark-Search method proposes using Vector Space Model (VSM) for allocating non binary priority values to candidate pages. III. LIRATURE REVIEW Web crawlers are almost as old as the web itself. In the spring of 1993, just months after the release of NCSA Mosaic, Matthew Gray [5] wrote the first web crawler, the World Wide Web Wanderer, which was used from 1993 to 1996 to compile statistics about the growth of the web. A year later, David Eichmann [4] wrote the first research paper containing a short description of a web crawler, the RBSE spider. Burner provided the first detailed description of the architecture of a web crawler, namely the original Internet Archive crawler [2]. Brin and Page‟s seminal paper on the (early) architecture of the Google search engine contained a brief description of the Google crawler, which used a distributed system of page-fetching processes and a central database for coordinating the crawl. Heydon and Najork described Mercator [7,8], a distributed and extensible web crawler that was to become the blueprint for a number of other crawlers. Other distributed crawling 20 Divya, Akshay Sawhney, Anand Ratna systems described in the literature include PolyBot [9], UbiCrawler [1], C-proc [3] and Dominos [6]. Conceptually, the algorithm executed by a web crawler is extremely simple: select a URL from a set of candidates, download the associated web pages, extract the URLs (hyperlinks) contained therein, and add those URLs that have not been encountered before to the candidate set. Indeed, it is quite possible to implement a simple functioning web crawler in a few lines of a high-level scripting language such as Perl. IV.TYPES OF WEB CRAWLER Focused Web Crawler: Focused Crawler is the Web crawler that tries to download pages that are related to each other [17]. It collects documents which are specific and relevant to the given topic [11][14]. It is also known as a Topic Crawler because of its way of working [15]. The focused crawler determines the following – Relevancy, Way forward. It determines how far the given page is relevant to the particular topic and how to proceed forward. The benefits of focused web crawler is that it is economically feasible in terms of hardware and network resources, it can reduce the amount of network traffic and downloads [13]. The search exposure of focused web crawler is also huge [10][12]. Incremental Crawler: An incremental crawler incrementally refreshes the existing collection of pages by visiting them frequently; based upon the estimate as to how often pages change [17]. It also exchanges less important pages by new and more important pages. It resolves the problem of the freshness of the pages. The benefit of incremental crawler is that only the valuable data is provided to the user, thus network bandwidth is saved and data enrichment is achieved [18][23]. Distributed Crawler: Distributed web crawling is a distributed computing technique. Many crawlers are working to distribute in the process of web crawling, in order to have the most International Journal of Engineering Technology Science and Research IJETSR www.ijetsr.com ISSN 2394 – 3386 Volume 2 Issue 4 April 2015 coverage of the web. A central server manages the communication and synchronization of the nodes, as it is geographically distributed [10]. It basically uses Page rank algorithm for its increased efficiency and quality search. The benefit of distributed web crawler is that it is robust against system crashes and other events, and can be adapted to various crawling applications [19]. Parallel Crawler: Multiple crawlers are often run in parallel, which are referred as Parallel crawlers [20]. A parallel crawler consists of multiple crawling Processes [20] called as Cprocs which can run on network of workstations• [21]. The Parallel crawlers depend on Page freshness and Page Selection [16]. A Parallel• crawler can be on local network or be distributed at geographically distant locations• [10].Parallelization of crawling system is very vital from the point of view of downloading• documents in a reasonable amount of time [21]. Breadth First Crawler: Starts out with a small• set of pages and then explores other pages by following links in the breadth-first [26] fashion.• Actually web pages are not traversed strictly in breadth first fashion butmay use a variety of• policies. For example it may crawl most important pages first. Form Focused Web Crawler: To deal with the• sparse distribution of forms on the Web, our Form Crawler [27] avoids crawling through unproductive paths by: limiting the search to a particular topic; learning features of links and paths that lead to pages that contain searchable forms; and employing appropriate stopping criteria. Hidden Web Crawler: A lot of data on the web actually resides in the database and it can only be retrieved by posting appropriate queries or by filling out forms on the web. Recently interest has been focused on access of this kind of data called “deep web” or “hidden web [24] ”. Current day crawler‟s crawls only publicly index able web (PIW) i.e. set of pages which are accessible by following hyperlinks ignoring 21 Divya, Akshay Sawhney, Anand Ratna search pages and forms which require authorization or prior registration. In reality they may ignore huge amount of high quality data, which is hidden behind search forms. Batch Crawler: Crawl a snapshot of their crawl space, until reaching a certain size or time limit. The crawl order does not contain occurrences of any page, but the entire crawling process is periodically halted and restarted as a way to obtain more recent snapshot of previously crawled pages. VI.FEATURES OF WEB CRAWLER Robustness: ability to handle spider-traps (cycles, dynamic web pages) Politeness: policies about the frequency of robot visits Distribution: crawling should be distributed within several machines Scalability: crawling should be extensible by adding machines, extending bandwith, etc. Efficiency: clever use of the processor, memory, bandwith Quality: should detect the most useful pages, to be indexed first Freshness: should continuously crawl the web (visiting frequency of a page should be close to its modification frequency) Extensibility: should support new data formats (e.g. XML-based formats), new protocols (e.g. ftp) VII.ARCHITECTURE AND OPERATIONS Fig1: Architecture of crawler Description of various parts: International Journal of Engineering Technology Science and Research IJETSR www.ijetsr.com ISSN 2394 – 3386 Volume 2 Issue 4 April 2015 URL frontier - It starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of visited URLs, called the crawl frontier. 1. Begin with known “seed” URLs. 2. Fetch and parse them. 3. Extract URLs they point to 4. Place the extracted URLs on a queue 5. Fetch each URL on the queue and repeat DNS resolution - domain name service resolution. Look up IP address for domain names which to fetch a page defined by a URL • • Fetching module - generally use the http protocol to fetch the URL and the page is parsed.• Texts (images, videos, and etc.) and Links are extracted. Parsing module - extracting text and links. VIII.POLICY OF CRAWLER Selection policy Search engines covers only a fraction of Internet. This requires download of relevant pages, hence good selection policy is very important. Common Selection policies: Restricting followed link Path-ascending crawling Focused crawling Crawling the Deep Web Re-Visit Policy Web is dynamic; crawling takes a long time. Cost factors play important role in crawling. Freshness and Age- commonly used cost functions. Objective of crawler- high average freshness; low average age of web pages. Two re-visit policies: Uniform policy Proportional policy Politeness Policy Crawlers can have a crippling impact on the overall performance of a site. The costs of using Web crawlers include: Network resources Server overload Server/ router crashes Network and server disruption A partial solution to these problems is the robots exclusion protocol. Parallelization Policy The crawler runs multiple processes in parallel. The goal is: To maximize the download rate. To minimize the overhead from parallelization. To avoid repeated downloads of the same page. • The crawling system requires a policy for assigning the new URLs discovered during the crawling process. Duplicate elimination - detecting URLs and• contents that have been processed a short time• ago. • Content Seen - test whether a web page with the• same content has already been seen at another URL. Need to develop a way to measure the• fingerprint of a web page. • • • Fig2: Basic Operation of Crawler Basic crawler operation 22 Divya, Akshay Sawhney, Anand Ratna • • International Journal of Engineering Technology Science and Research IJETSR www.ijetsr.com ISSN 2394 – 3386 Volume 2 Issue 4 April 2015 IX.STRATEGIES OF CRAWLER Breadth-First Traversal: Given any graph and a set of seeds at which to start, the graph can be traversed using the algorithm 1. Put all the given seeds into the queue; 2. Prepare to keep a list of “visited” nodes (initially empty); 3. As long as the queue is not empty: a. Remove the first node from the queue; b. Append that node to the list of “visited” nodes c. For each edge starting at that node: i. If the node at the end of the edge already appears on the list of “visited” nodes or it is already in the queue, then do nothing more with that edge; ii. Otherwise, append the node at the end of the edge to the end of the queue. Fig3: Breadth First Traversal Depth First Traversal: Use depth first search (DFS) algorithm 1. Get the 1st link not visited from the start page 2. Visit link and get 1st non-visited link 3. Repeat above step till no non-visited links 4. Go to next non-visited link in the previous level and repeat 2nd step Fig4: Depth First Traversal 23 Divya, Akshay Sawhney, Anand Ratna X.CONCLUSIONS This paper explains the crawler and its architecture, operation type, strategies. In this work, Different research paper have been studied on search engines and web crawler and on the basis of that we described general web crawler architecture. REFERENCES [1] Boldi P., Codenotti B., Santini M., and Vigna S. UbiCrawler: a scalable fully distributed web crawler. Software Pract. Exper., 34(8):711–726, 2004. [2] Burner M. Crawling towards eternity: building an archive of the World Wide Web. Web Tech. Mag., 2(5):37–40, 1997. [3] Cho J. and Garcia-Molina H. Parallel crawlers. In Proc. 11th Int. World Wide Web Conference, 2002, pp. 124–135. [4] Eichmann D. The RBSE Spider – Balancing effective search against web load. In Proc. 3rd Int. World Wide Web Conference, 1994. [5] Gray M. Internet Growth and Statistics: Credits and background. http://www.mit.edu/people/mkgray/net/backgrou nd.html [6] Hafri Y. and Djeraba C. High performance crawling system. In Proc. 6th ACM SIGMM Int. Workshop on Multimedia Information Retrieval, 2004, pp. 299–306. [7] Heydon A. and Najork M. Mercator: a scalable, extensible web crawler. World Wide Web, 2(4):219–229, December 1999. [8] Najork M. and Heydon A. High-performance web crawling. Compaq SRC Research Report 173, September 2001. [9] Shkapenyuk V. and Suel T. Design and Implementation of a high-performance distributed web crawler. In Proc. 18th Int. Conf. on Data Engineering, 2002, pp. 357–368. [10] Pavalam S. M., S. V. Kasmir Raja, Jawahar M., and Felix K. Akorli, “Web Crawler in Mobile Systems”, International Journal of Machine Learning and Computing, Vol. 2, No. 4, August 2012 International Journal of Engineering Technology Science and Research IJETSR www.ijetsr.com ISSN 2394 – 3386 Volume 2 Issue 4 April 2015 [11] Md. Abu Kausar, V. S. Dhaka, Sanjeev Kumar Singh, “Web Crawler: A Review”, International Journal of Computer Applications (0975 – 8887),Volume 63– No.2, February 2013 [12] Manas Kanti Dey, Debakar Shamanta, Hasan Md Suhag Chowdhury, Khandakar Entenam Unayes Ahmed, “Focused Web Crawling: A Framework for Crawling of Country Based Financial Data”, Information and Financial Engineering (ICIFE), IEEE Conference Publications, 2010 [13] Debashis Hati, Biswajit Sahoo, Amritesh Kumar, “Adaptive Focused Crawling Based on Link Analysis”, 2nd International Conference on Education Technology and Computer (ICETC),2010 [14] Shashi Shekhar, Rohit Agrawal and Karm Veer Arya, “An Architectural Framework of a Crawler for Retrieving Highly Relevant Web Documents by Filtering Replicated We Collections”, 2010 International Conference on Advances in Computer Engineering, IEEE Conference Publications 2010. [15] Gautam Pant, Padmini Srinivasan, “Learning to Crawl: Comparing Classification Schemes”, ACM Transactions on Information Systems, Vol. 23, No. 4, October 2005, Pages 430–462. [16] AH Chung Tsol, Daniele Forsali, Marco Gori, Markus Hagenbuchner, Franco Scarselli, “A Simple Focused Crawler” Proceeding 12th International WWW Conference 2003(poster), pp. 1. [17] Junghoo Cho and Hector Garcia-Molina. 2000a. “The evolution of the web and implications for an incremental crawler”, In Proceedings of the 26th International Conference on Very Large Databases. [18] A. K. Sharma and Ashutosh Dixit, “Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler” International Journal of Computer Science and Network Security, vol.8 no.12, 2008, pp. 349-354 24 Divya, Akshay Sawhney, Anand Ratna [19] Vladislav Shkapenyuk Torsten Suel “Design and Implementation of a HighPerformance Distributed Web Crawler”, CIS Department Polytechnic University Brooklyn, NY 11201 [20] Junghoo Cho, Hector Garcia-Molina, “Parallel Crawlers”, WWW2002, May 7–11, 2002, Honolulu, Hawaii, USA, ACM 1-58113449-5/02/0005. [21] Shruti Sharma, A.K.Sharma, J.P.Gupta, “A Novel Architecture of a Parallel Web Crawler”, International Journal of Computer Applications (0975 – 8887) Volume 14– No.4, January 2011 [22] Ntoulas, A., Cho, J., and Olston, C. “What‟s new on the Web? The evolution of the Web from a search engine perspective”, WWW ‟04 , 1-12, 2004. [23] Niraj Singhal, Ashutosh Dixit, and Dr. A. K. Sharma, “Design of a Priority Based Frequency Regulated Incremental Crawler”, International Journal of Computer Applications (0975 – 8887) vol.1, no. 1, 2010, pp. 42-47. [24] A.K. Sharma, J.P. Gupta, D. P. Agarwal,“Augmented Hypertext Documents suitable for parallel crawlers”, communicated to 21st IASTED International Multi-conference Applied Informatics AI-2003, Feb 10- 13,2003, Austria. [25] Jumgoo Cho and Hector Garcia-Molina, “The evolution of the Web and implications for an incremental crawler”, Prc. Of VLDB Conf, 2000. [26] Bing Liu, “Web Content Mining” the 14th international world wide web conference (www 2005) China .japan. [27] Paolo Boldi_ Bruno Codenotti† Massimo Santini‡ Sebastiano Vigna,” UbiCrawler: A Scalable Fully Distributed Web Crawler”
© Copyright 2026 Paperzz