In Seventeenth Annual International Conference of the Association of Management (AoM/IAoM) on Computer Science, Maximilian Press Publishers, San Diego, CA, pages 324-329, August 1999. USING THE WEB EFFICIENTLY: MOBILE CRAWLERS Jan Fiedler and Joachim Hammer University of Florida Gainesville, FL 32611-6125 [email protected], [email protected] ABSTRACT Search engines have become important tools for Web navigation. In order to provide powerful search facilities, search engines maintain comprehensive indices of documents available on the Web. The creation and maintenance of Web indices is done by Web crawlers, which recursively traverse and download Web pages on behalf of search engines. Analysis of the collected information is performed after the data has been downloaded. In this research, we propose an alternative, more efficient approach to building Web indices based on mobile crawlers. Our proposed crawlers are transferred to the source(s) where the data resides in order to filter out any unwanted data locally before transferring it back to the search engine. Our approach to Web crawling is particularly well suited for implementing so-called “smart” crawling algorithms which determine an efficient crawling path based on the contents of Web pages that have been visited so far. In order to demonstrate the viability of our approach we have built a prototype mobile crawling system within the University of Florida Intranet. However, given the explosive growth of the Web there are several problems associated with this method of indexing which are due to the rapidly growing number of Web pages that have to be indexed on one hand, and the relatively slow increase in network bandwidth on the other. Specifically, we see the following problems with the way current search engines index the Web: Scaling. The concept of “download-first-andindex-later” will likely not scale given the limitations in the infrastructure and projected growth rate of the Web. Using the estimates for growth of Web indices provided in [8], a Web crawler running in the year 2000 would have to retrieve Web data at a rate of 45Mbit per second in order to download the estimated 480GB of pages per day that are necessary to maintain the index. Looking at the fundamental limitations of storage technology and communication networks, it is unlikely that Web indices of this size can be maintained efficiently. Efficiency. Current search engines add unnecessary traffic to the overloaded Internet. While existing approaches are the only alternative for general-purpose search engines trying to build a comprehensive Web index, there are many scenarios where it is more efficient to download and index only selected pages. We call these systems specialized search engines and justify their usefulness here. Quality of Index. The results of Web searches are overwhelming and require the user to act as part of the query processor. Current commercial search engines maintain Web indices of up to 110 million pages [8] and easily find several thousands of matches for an average query. Thus increasing the size of the Web index does not automatically improve the quality of the search results if it simply causes the search Keywords: search engine, WWW, crawling, mobile code 1. Introduction The current approach to Web searching is to create and maintain indices for the Web pages much like indices for a library catalog or access paths for tuples in a database. However, before the pages can be indexed they must first be collected and returned to the indexing engine. This is done by Web crawlers which systematically traverse the Web using various crawling algorithms (e.g., breadth first, depth first). The pages are downloaded to a search engine, which parses the text and creates and stores the index. For examples of Web search engines see Google [2], Altavista [1], Infoseek [7], etc. Author’s current address: Intershop Communications GmbH, Leutragraben 2-4, 07743 Jena, Germany. engine to return twice as many matches to a query as before. Since we cannot limit the number of pages on the Web, we have to find ways to improve the search results in such a way that can accommodate the rapid growth of the Web. Therefore, we expect a new generation of specialized search engines to emerge in the near future. These specialized engines may challenge the dominance of today's general search engines by providing superior search results for specific subject areas using a variety of new technologies from such areas as data mining, data visualization, and graph theory, for example. In addition, we believe that a new crawling approach is needed to improve the efficiency of data collection when used in combination with specialized search engines. functionality for extracting the collected data from the crawler for use by the indexer (see Sec. 4). Figure 1 provides on overview of mobile Web crawling. Remote Host Search Engine HTTP Server Web Craw ler Manager Remote Host Remote Host Index HTTP Server HTTP Server Figure 1: An overview of mobile Web crawling. 2. A Mobile Approach to Web Crawling In this overview paper, we summarize an alternative approach to Web crawling based on mobile crawlers. A much more detailed description can be found in the full version of this paper, which is available as a technical report [4] from our ftp server at ftp.dbcenter.cise.ufl. Crawler mobility allows for more sophisticated crawling algorithms [3] and avoids some of the inefficiencies associated with the brute force strategies exercised by current crawlers. We see mobile crawling as an efficient, scalable solution to establishing a specialized search index in the highly distributed, decentralized and dynamic environment of the Web. We define mobility in the context of Web crawling as the ability of a crawler to transfer itself to each Web server of interest before collecting pages on that server. After completing the collection process on a particular server, the crawler together with the collected data moves to the next server or to its home system. Mobile crawlers are managed by a crawler manager, which supplies each crawler with a list of target Web sites and monitors the location of each crawler. This is necessary to intervene in case one or more crawlers happen to interfere with each other (i.e., crawl the same Web space). However, the crawling strategy and path taken are controlled separately by each crawler through the crawling algorithm. In addition, the crawler manager provides the necessary In order to demonstrate the capabilities of mobile crawling consider a search engine, which wants to support high quality searches for a particular application domain, e.g., preventive health care, gardening, sports, etc. by building an index of relevant Web pages. The creation and maintenance of a suitable index to support such a specialized search engine using a traditional crawling approach is highly inefficient. This is due to the fact that traditional crawlers must download much more data than is effectively used (in the worst case scenario the whole Web). In contrast, in our approach, a mobile crawler is sent to each Web source that is expected to contain relevant information for a local pre-selection of pages. Initially, the crawler obtains a list of target locations from the crawler manager. These addresses are referred to as seed URLs since they indicate the beginning of the crawling process. In addition, the crawler manager also uploads the crawling strategy into the crawler in form of a program. This program tells the crawler which pages are considered relevant and should be collected. In addition, it also generates the crawler path through the Web site. As we have mentioned before, a considerable amount of research has focused on optimal crawling strategies [3]. We see our mobile crawling approach as a compliment to this work by providing an efficient infrastructure for implementing crawling strategies. Before the actual crawling begins, the crawler must migrate to a specific remote site using one of the seed URL’s as the target address. After the crawler successfully migrated to the remote host, the crawling algorithm is executed. This part of mobile crawling is very similar to traditional crawling techniques since pages are retrieved and analyzed recursively. In fact, any breadth-first or depth-first crawling strategy can be employed. However, data collection can be speeded up considerably by using more sophisticated (“smart”) algorithms. Since these crawling strategies need access to the contents of the crawled pages in order to optimize their crawl path, mobile crawlers are particularly well suited for implementing smart crawling strategies. When the crawler finishes it either returns to the crawler manager (crawler home) or, in case the list of seed URLs is not empty, migrates to the next Web site on the list and continues. Once the mobile crawler has successfully migrated back to its home, all pages retrieved by the crawler are transferred to the search engine via the crawler manager. Once the pages have been downloaded, the search engine can generate the index as before. The main difference is that the set of pages to be indexed is significantly smaller and only contains those pages that are relevant to the underlying search topic. Note a significant part of the transmission cost can be saved by compressing the pages as well as the crawler code prior to migration. In case a mobile crawler does not find any relevant information on a particular server, no data besides the crawler code itself will be transmitted. Furthermore, since the crawler depends on the resources of the remote host for its processing power, it is not possible to predict how much memory is available to the crawler. This means that a crawler has to be able to dynamically interrupt its collection process whenever the available memory is exhausted and be able to transmit the collected pages back to the search engine to free up resources. We are currently in the process of implementing a second prototype version of our mobile crawler that has the ability to transmit collected pages back to its home base whenever necessary. In addition, we are implementing a control interface that lets the crawler programmer specify ahead of time how resource intensive the crawler operates. Specifically, the crawler programmer can specify that the crawler never use more than a certain percentage of its available memory, for example, or that it should pause for t seconds in between accessing pages to reduce the load on the host server. However, several issues still have to be resolved before mobile crawling can become widely used. These issues can be categorized roughly as policy issues (e.g., a crawler must have permission from the owner of a Web site to execute locally on the server). Please also note that mobile crawlers are mainly a special case of mobile agents and thus some of the important issues related to mobile crawling should also be addressed in the broader context of mobile computing (e.g., our current version of the crawler code needs a run-time environment to be present at each site before it can execute). For example, the second problem can be alleviated in two ways: In the short-term, by simplifying the installation as much as possible by making the runtime environment a small server process which can be installed by the Web master of each participating Web site. In the long run, a better solution would be to standardize the runtime environment and make it an optional part of each Web server. Given the emergence of mobile agents, a standardized runtime environment could benefit other roaming agents besides our mobile crawlers. Currently, we are avoiding the above problems by testing our crawlers in secured intranets where we either have control over the participating Web servers or can obtain the necessary execution permission without problems. In a sense, our environment at the University is not unlike an Intranet within a large corporation, which can use mobile crawlers for setting up search indexes with relatively little effort. 3. Cost-Benefit Analysis We have analyzed the behavior of a mobile crawler using the following four parameters: Data Access: By migrating to a remote Web server, mobile crawlers can access Web pages locally with respect to the server. This saves 4. Architecture for Mobile Web Crawling In order to confirm the results of our analysis and to establish a proof of concept, we implemented an environment, which provides the infrastructure required by mobile crawlers. The prototype system, which is implemented using Java, extends the current Web architecture with a distributed runtime system for mobile crawlers and provides some additional components for the management and control of mobile crawlers. The overall system architecture and essential system components of the prototype system are depicted in Figure 2. Distributed Crawler Runtime Environment HTTP Serv er HTTP Serv er Virtual Machine Virtual Machine Communication Subsy stem Communication Subsy stem Virtual Machine Virtual Machine Net Communication Subsy stem HTTP Serv er HTTP Serv er Communication Subsy stem Communication Subsy stem Outbox Crawler Spec Crawler Manager Archiv e Manager Inbox Query Engine Database Command Manager Connection Manager SQL network bandwidth by eliminating request/response messages used for data retrieval. Remote Page Selection: By migrating to a remote Web server, mobile crawlers can select only the relevant pages before transmitting them over the network. This saves network bandwidth by discarding irrelevant information directly at the data source. Remote Page Filtering: By migrating to a remote Web server, mobile crawlers can reduce the content of Web pages before transmitting them over the network. This saves network bandwidth by discarding irrelevant portions of the retrieved pages. Remote Page Compression: By migrating to a remote Web server, mobile crawlers can compress the content of Web pages before transmitting them over the network. This saves network bandwidth by reducing the size of the retrieved data. As mentioned before, the details of this analysis, including the formulae on which our calculations are based as well as the graphs depicting the results can be found in the full paper. To summarize our findings, our experiments showed that mobile crawlers easily outperform traditional crawlers in terms of network efficiency. Most importantly, our claim holds independently of the type of search engine for which the crawler is working. Furthermore, as expected, the results of our analysis show that mobile crawlers do very well when used in the context of specialized search engines. Here, crawler mobility does not only have an impact on the domain space (i.e. network bandwidth saved) but also on the time domain (i.e. the time needed to finish crawling a site). This is due to the fact that the network is still a bottleneck when crawling the Web. By significantly reducing the amount of data to be transmitted over the network, the time needed for data transmission and therefore the time needed to finish crawling is significantly reduced. DB Application Framework Architecture Figure 2: System architecture overview. The architecture depicted in Figure 2 consists of two major parts. The first part, the distributed crawler runtime environment, provides the base functionality for the transfer and the execution of mobile code. The crawler runtime environment establishes a distributed execution environment in which application specific, mobile crawlers can operate. The second major part of the system, the application framework architecture, serves as an application independent interface to the distributed crawler runtime environment. The framework architecture provides functionality for mobile crawler creation and management. In addition, the framework provides a query interface, which allows applications to access the data retrieved by mobile crawlers. 4.1. Mobile Crawlers In our prototype, mobile crawlers serve as “mobile containers” for the crawling algorithm as well as for the collected data. To provide real crawler mobility, a crawler needs to be able to save its runtime state, transfer it over the network, and restore it at the remove location. For interoperability, crawlers need to use a machine independent representation for their runtime state. Since this kind of interoperability is difficult to achieve, we decided to minimize the runtime state needed by our crawlers as much as possible using a rule-based approach. The execution of a crawler program is equivalent to applying rules upon the facts inside the crawler’s knowledge-base. The advantage of this approach is that rule-based programs do not have a real runtime state. With carefully designed rules, the program can be represented by facts only. Thus, saving the runtime state of our crawlers involves stopping the rule application process and saving the current crawler fact base. In this way, the crawler can easily migrate since all relevant data (rules and facts) are now represented as simple ASCII strings within the crawler. The crawler object, which carries the rules and facts, migrates using the object serialization facilities of the Java language. 4.2. Virtual Machine The virtual machine is the heart of the distributed crawler runtime environment. Its main purpose is to provide an environment in which crawler code received via the network can be executed. Since crawler programs are specified based on rules, we can model our virtual machine using an inference engine, which takes care of the rule application process. To start the execution of a crawler we initialize the inference engine with the rules and facts of the crawler to be executed. Starting the rule application process of the inference engine is equivalent to starting crawler execution. Once the rule application has finished (either because there is no rule which is applicable or due to an external signal), the rules and facts now stored in the inference engine are extracted and stored back in the crawler. Thus, an inference engine establishes a virtual machine with respect to the crawler. The concrete implementation of our virtual machine uses an extended version of the Jess inference engine [5] which in turn is basically a Java port of the well knows CLIPS system [6]. 4.3. Query Engine The query engine is part of the application framework architecture and responsible for the communication between crawler and application. Since our mobile crawlers are application independent, they have no information about the semantics of the data they retrieve. In order to use the crawler retrieved data within an application, the data needs to be extracted from the crawler’s information base. To provide efficient access to this information, we implemented a query engine, which evaluates application specific queries upon the crawler information base. The query result is represented as structured data tuples very similar to relational database systems. Since the crawler information base consist of facts generated by the rule based crawler program, the query engine implementation is based on the same inference engine used for the virtual machine implementation. Application specific queries are translates into special query rules, which identify matching facts within the crawler information base. We refer the reader to [4]for an in-depth discussion of the system components and their implementation. 4.4. Prototype and Lessons Learned One of the most important requirements for our prototype implementation was the ability to run on multiple host platforms (compute interoperability). Specifically, the crawler runtime system has to provide a common environment to mobile crawlers while running on different platforms, operating systems, and Web servers. To achieve the required platform independence we implemented the crawler prototype using Java. So far, the crawler and runtime environment have been successfully tested on Unix and Windows machines. For example, within the University of Florida Intranet, mobile crawlers successfully migrated between host servers running the Unix and Windows operating system and collected data sets managed by more than ten different Web servers across campus. We are currently in the process of extending our prototype system with new crawler components, which address the critical issues, identified in Section 2. Our focus is on improving the security and stability of the distributed crawler runtime environment. We identified this as the most crucial point to be addressed prior to using mobile crawlers in a real network environment. We also plan to install our crawler runtime environment on Web servers outside of the University of Florida in order to evaluate our crawling approach in a broader context. secure sandbox scheme (similar to Java) can be implemented relatively easily. 5. Conclusion We have introduced an alternative approach to Web crawling based on mobile crawlers. The proposed approach surpasses the centralized architecture of the current Web crawling systems by distributing the data retrieval process across the network. In particular, using mobile crawlers we are able to perform remote operations such as data analysis and data compression at the data source before the data is transmitted over the network. This allows for more intelligent crawling techniques and addresses the needs of applications, which are only interested in certain subsets of the available data. We have developed and implemented an application framework, which demonstrates our mobile Web crawling approach and allows applications to take advantage of mobile crawling. The performance results of our approach are very promising. Mobile crawlers can reduce the network load caused by crawlers significantly by reducing the amount of data transferred over the network. Mobile crawlers achieve this reduction in network traffic by performing data analysis and data compression at the data source. Therefore, mobile crawlers transmit only relevant information in compressed form over the network. The prototype implementation of our mobile crawler framework provides an initial step towards mobile Web crawling. We have identified several issues, which need to be addressed before mobile crawling can be used in a larger scale: Security. Crawler migration and remote execution of code causes severe security problems because a mobile crawler might contain harmful code. We suggest introducing an identification mechanism for mobile crawlers based on digital signatures. Based on this crawler identification scheme a system administrator would be able to grant execution permission to certain crawlers only, excluding crawlers from unknown (and potentially unsafe) sources. In addition, the virtual machine needs to be secured such that crawlers cannot get access to critical system resources. This is already partially implemented in the Jess inference engine. By restricting the functionality of the Jess inference engine, a Integration of the mobile crawler virtual machine into the Web. The availability of a mobile crawler virtual machine on as many Web servers as possible is crucial for the effectiveness of mobile crawling. This integration can be achieved through Java Servlets, for example, which extend Web server functionality with special Java programs. We realize of course, that before this can be done, some effort has to be spent on standardizing the functionalities of such runtime environments. Research in mobile crawling algorithms. None of the current crawling algorithms have been designed with crawler mobility in mind. It seems worthwhile to spend some effort in the development of new algorithms, which take advantage of crawler mobility. In particular these algorithms have to deal with the loss of centralized control over the crawling process due to crawler mobility. References [1] [2] [3] [4] [5] [6] AltaVista, “AltaVista Search Engine,” WWW, http://www.altavista.com. Brin, S., Page, L., The Anatomy of a LargeScale Hypertextual Web Search Engine, Computer Science Department, Stanford University, Stanford, CA, USA, 1997. Cho, J., Garcia-Molina, H., Page, L., Efficient Crawling Through URL Ordering, Computer Science Department, Stanford University, Stanford, CA, USA, 1997. J. Fiedler and J. Hammer, “Using the Web Efficiently: Mobile Crawlers,” University of Florida, Gainesville, FL, Technical Report, November 1998, ftp://ftp.dbcenter.cise.ufl.edu/Pub/publications/Mo bile-Crawling.pdf. Friedman-Hill, E., Jess Manual, Sandia National Laboratories, Livermore, CA, USA, 1997. Giarratano, J. C., CLIPS User's Guide, Software Technology Branch, NASA/Lyndon B. Johnson Space Center, USA, 1997. [7] [8] Infoseek, “Infoseek Search Engine,” WWW, http://www.infoseek.com. Sullivan, D., Search Engine Watch, Mecklermedia, 1998, http://www.searchenginewatch.com.
© Copyright 2026 Paperzz