Using the Web Efficiently: Mobile Crawlers

Using the Web Efficiently: Mobile Crawlers - UF CISE

In Seventeenth Annual International Conference of the Association of Management (AoM/IAoM) on Computer Science,
Maximilian Press Publishers, San Diego, CA, pages 324-329, August 1999.
USING THE WEB EFFICIENTLY: MOBILE CRAWLERS
Jan Fiedler and Joachim Hammer
University of Florida
Gainesville, FL 32611-6125
[email protected], [email protected]
ABSTRACT
Search engines have become important tools
for Web navigation. In order to provide powerful
search facilities, search engines maintain
comprehensive indices of documents available on
the Web. The creation and maintenance of Web
indices is done by Web crawlers, which recursively
traverse and download Web pages on behalf of
search engines.
Analysis of the collected
information is performed after the data has been
downloaded.
In this research, we propose an alternative,
more efficient approach to building Web indices
based on mobile crawlers. Our proposed crawlers
are transferred to the source(s) where the data
resides in order to filter out any unwanted data
locally before transferring it back to the search
engine.
Our approach to Web crawling is
particularly well suited for implementing so-called
“smart” crawling algorithms which determine an
efficient crawling path based on the contents of
Web pages that have been visited so far. In order to
demonstrate the viability of our approach we have
built a prototype mobile crawling system within the
University of Florida Intranet.
However, given the explosive growth of the
Web there are several problems associated with this
method of indexing which are due to the rapidly
growing number of Web pages that have to be
indexed on one hand, and the relatively slow
increase in network bandwidth on the other.
Specifically, we see the following problems with the
way current search engines index the Web:

Scaling. The concept of “download-first-andindex-later” will likely not scale given the
limitations in the infrastructure and projected
growth rate of the Web. Using the estimates for
growth of Web indices provided in [8], a Web
crawler running in the year 2000 would have to
retrieve Web data at a rate of 45Mbit per second
in order to download the estimated 480GB of
pages per day that are necessary to maintain the
index. Looking at the fundamental limitations
of storage technology and communication
networks, it is unlikely that Web indices of this
size can be maintained efficiently.

Efficiency.
Current search engines add
unnecessary traffic to the overloaded Internet.
While existing approaches are the only
alternative for general-purpose search engines
trying to build a comprehensive Web index,
there are many scenarios where it is more
efficient to download and index only selected
pages. We call these systems specialized search
engines and justify their usefulness here.

Quality of Index. The results of Web searches
are overwhelming and require the user to act as
part of the query processor. Current commercial
search engines maintain Web indices of up to
110 million pages [8] and easily find several
thousands of matches for an average query.
Thus increasing the size of the Web index does
not automatically improve the quality of the
search results if it simply causes the search
Keywords: search engine, WWW, crawling, mobile
code
1. Introduction
The current approach to Web searching is to
create and maintain indices for the Web pages much
like indices for a library catalog or access paths for
tuples in a database. However, before the pages can
be indexed they must first be collected and returned
to the indexing engine. This is done by Web
crawlers which systematically traverse the Web
using various crawling algorithms (e.g., breadth
first, depth first). The pages are downloaded to a
search engine, which parses the text and creates and
stores the index. For examples of Web search
engines see Google [2], Altavista [1], Infoseek [7],
etc.
 Author’s current address: Intershop Communications GmbH, Leutragraben 2-4, 07743 Jena, Germany.
engine to return twice as many matches to a
query as before.
Since we cannot limit the number of pages on
the Web, we have to find ways to improve the
search results in such a way that can accommodate
the rapid growth of the Web. Therefore, we expect
a new generation of specialized search engines to
emerge in the near future. These specialized
engines may challenge the dominance of today's
general search engines by providing superior search
results for specific subject areas using a variety of
new technologies from such areas as data mining,
data visualization, and graph theory, for example. In
addition, we believe that a new crawling approach is
needed to improve the efficiency of data collection
when used in combination with specialized search
engines.
functionality for extracting the collected data from
the crawler for use by the indexer (see Sec. 4).
Figure 1 provides on overview of mobile Web
crawling.
Remote Host
Search Engine
HTTP
Server
Web
Craw ler Manager
Remote Host
Remote Host
Index
HTTP
Server
HTTP
Server
Figure 1: An overview of mobile Web crawling.
2. A Mobile Approach to Web Crawling
In this overview paper, we summarize an
alternative approach to Web crawling based on
mobile crawlers. A much more detailed description
can be found in the full version of this paper, which
is available as a technical report [4] from our ftp
server at ftp.dbcenter.cise.ufl.
Crawler mobility allows for more sophisticated
crawling algorithms [3] and avoids some of the
inefficiencies associated with the brute force
strategies exercised by current crawlers. We see
mobile crawling as an efficient, scalable solution to
establishing a specialized search index in the highly
distributed, decentralized and dynamic environment
of the Web. We define mobility in the context of
Web crawling as the ability of a crawler to transfer
itself to each Web server of interest before
collecting pages on that server. After completing
the collection process on a particular server, the
crawler together with the collected data moves to
the next server or to its home system. Mobile
crawlers are managed by a crawler manager, which
supplies each crawler with a list of target Web sites
and monitors the location of each crawler. This is
necessary to intervene in case one or more crawlers
happen to interfere with each other (i.e., crawl the
same Web space). However, the crawling strategy
and path taken are controlled separately by each
crawler through the crawling algorithm. In addition,
the crawler manager provides the necessary
In order to demonstrate the capabilities of
mobile crawling consider a search engine, which
wants to support high quality searches for a
particular application domain, e.g., preventive health
care, gardening, sports, etc. by building an index of
relevant Web pages. The creation and maintenance
of a suitable index to support such a specialized
search engine using a traditional crawling approach
is highly inefficient. This is due to the fact that
traditional crawlers must download much more data
than is effectively used (in the worst case scenario
the whole Web). In contrast, in our approach, a
mobile crawler is sent to each Web source that is
expected to contain relevant information for a local
pre-selection of pages.
Initially, the crawler obtains a list of target
locations from the crawler manager.
These
addresses are referred to as seed URLs since they
indicate the beginning of the crawling process. In
addition, the crawler manager also uploads the
crawling strategy into the crawler in form of a
program. This program tells the crawler which
pages are considered relevant and should be
collected. In addition, it also generates the crawler
path through the Web site. As we have mentioned
before, a considerable amount of research has
focused on optimal crawling strategies [3]. We see
our mobile crawling approach as a compliment to
this work by providing an efficient infrastructure for
implementing crawling strategies.
Before the actual crawling begins, the crawler
must migrate to a specific remote site using one of
the seed URL’s as the target address. After the
crawler successfully migrated to the remote host, the
crawling algorithm is executed. This part of mobile
crawling is very similar to traditional crawling
techniques since pages are retrieved and analyzed
recursively. In fact, any breadth-first or depth-first
crawling strategy can be employed. However, data
collection can be speeded up considerably by using
more sophisticated (“smart”) algorithms. Since
these crawling strategies need access to the contents
of the crawled pages in order to optimize their crawl
path, mobile crawlers are particularly well suited for
implementing smart crawling strategies.
When the crawler finishes it either returns to the
crawler manager (crawler home) or, in case the list
of seed URLs is not empty, migrates to the next
Web site on the list and continues. Once the mobile
crawler has successfully migrated back to its home,
all pages retrieved by the crawler are transferred to
the search engine via the crawler manager. Once the
pages have been downloaded, the search engine can
generate the index as before. The main difference is
that the set of pages to be indexed is significantly
smaller and only contains those pages that are
relevant to the underlying search topic.
Note a significant part of the transmission cost
can be saved by compressing the pages as well as
the crawler code prior to migration. In case a
mobile crawler does not find any relevant
information on a particular server, no data besides
the crawler code itself will be transmitted.
Furthermore, since the crawler depends on the
resources of the remote host for its processing
power, it is not possible to predict how much
memory is available to the crawler. This means that
a crawler has to be able to dynamically interrupt its
collection process whenever the available memory is
exhausted and be able to transmit the collected
pages back to the search engine to free up resources.
We are currently in the process of implementing a
second prototype version of our mobile crawler that
has the ability to transmit collected pages back to its
home base whenever necessary. In addition, we are
implementing a control interface that lets the
crawler programmer specify ahead of time how
resource
intensive
the
crawler
operates.
Specifically, the crawler programmer can specify
that the crawler never use more than a certain
percentage of its available memory, for example, or
that it should pause for t seconds in between
accessing pages to reduce the load on the host
server.
However, several issues still have to be resolved
before mobile crawling can become widely used.
These issues can be categorized roughly as policy
issues (e.g., a crawler must have permission from
the owner of a Web site to execute locally on the
server). Please also note that mobile crawlers are
mainly a special case of mobile agents and thus
some of the important issues related to mobile
crawling should also be addressed in the broader
context of mobile computing (e.g., our current
version of the crawler code needs a run-time
environment to be present at each site before it can
execute).
For example, the second problem can be
alleviated in two ways:

In the short-term, by simplifying the installation
as much as possible by making the runtime
environment a small server process which can
be installed by the Web master of each
participating Web site.

In the long run, a better solution would be to
standardize the runtime environment and make
it an optional part of each Web server. Given
the emergence of mobile agents, a standardized
runtime environment could benefit other
roaming agents besides our mobile crawlers.
Currently, we are avoiding the above problems
by testing our crawlers in secured intranets where
we either have control over the participating Web
servers or can obtain the necessary execution
permission without problems. In a sense, our
environment at the University is not unlike an
Intranet within a large corporation, which can use
mobile crawlers for setting up search indexes with
relatively little effort.
3. Cost-Benefit Analysis
We have analyzed the behavior of a mobile
crawler using the following four parameters:
 Data Access: By migrating to a remote Web
server, mobile crawlers can access Web pages
locally with respect to the server. This saves
4. Architecture for Mobile Web Crawling
In order to confirm the results of our analysis
and to establish a proof of concept, we implemented
an environment, which provides the infrastructure
required by mobile crawlers. The prototype system,
which is implemented using Java, extends the
current Web architecture with a distributed runtime
system for mobile crawlers and provides some
additional components for the management and
control of mobile crawlers. The overall system
architecture and essential system components of the
prototype system are depicted in Figure 2.
Distributed Crawler Runtime Environment
HTTP
Serv er
HTTP
Serv er
Virtual
Machine
Virtual
Machine
Communication
Subsy stem
Communication
Subsy stem
Virtual
Machine
Virtual
Machine
Net
Communication
Subsy stem
HTTP
Serv er
HTTP
Serv er
Communication
Subsy stem
Communication
Subsy stem
Outbox
Crawler
Spec
Crawler Manager
Archiv e Manager
Inbox
Query
Engine
Database
Command Manager
Connection
Manager
SQL
network
bandwidth
by
eliminating
request/response messages used for data
retrieval.
 Remote Page Selection: By migrating to a
remote Web server, mobile crawlers can select
only the relevant pages before transmitting them
over the network.
This saves network
bandwidth by discarding irrelevant information
directly at the data source.
 Remote Page Filtering: By migrating to a
remote Web server, mobile crawlers can reduce
the content of Web pages before transmitting
them over the network. This saves network
bandwidth by discarding irrelevant portions of
the retrieved pages.
 Remote Page Compression: By migrating to a
remote Web server, mobile crawlers can
compress the content of Web pages before
transmitting them over the network. This saves
network bandwidth by reducing the size of the
retrieved data.
As mentioned before, the details of this analysis,
including the formulae on which our calculations are
based as well as the graphs depicting the results can
be found in the full paper. To summarize our
findings, our experiments showed that mobile
crawlers easily outperform traditional crawlers in
terms of network efficiency. Most importantly, our
claim holds independently of the type of search
engine for which the crawler is working.
Furthermore, as expected, the results of our analysis
show that mobile crawlers do very well when used
in the context of specialized search engines. Here,
crawler mobility does not only have an impact on
the domain space (i.e. network bandwidth saved) but
also on the time domain (i.e. the time needed to
finish crawling a site). This is due to the fact that
the network is still a bottleneck when crawling the
Web. By significantly reducing the amount of data
to be transmitted over the network, the time needed
for data transmission and therefore the time needed
to finish crawling is significantly reduced.
DB
Application Framework Architecture
Figure 2: System architecture overview.
The architecture depicted in Figure 2 consists of
two major parts. The first part, the distributed
crawler runtime environment, provides the base
functionality for the transfer and the execution of
mobile code. The crawler runtime environment
establishes a distributed execution environment in
which application specific, mobile crawlers can
operate. The second major part of the system, the
application framework architecture, serves as an
application independent interface to the distributed
crawler runtime environment.
The framework
architecture provides functionality for mobile
crawler creation and management. In addition, the
framework provides a query interface, which allows
applications to access the data retrieved by mobile
crawlers.
4.1. Mobile Crawlers
In our prototype, mobile crawlers serve as
“mobile containers” for the crawling algorithm as
well as for the collected data. To provide real
crawler mobility, a crawler needs to be able to save
its runtime state, transfer it over the network, and
restore it at the remove location.
For
interoperability, crawlers need to use a machine
independent representation for their runtime state.
Since this kind of interoperability is difficult to
achieve, we decided to minimize the runtime state
needed by our crawlers as much as possible using a
rule-based approach. The execution of a crawler
program is equivalent to applying rules upon the
facts inside the crawler’s knowledge-base. The
advantage of this approach is that rule-based
programs do not have a real runtime state. With
carefully designed rules, the program can be
represented by facts only. Thus, saving the runtime
state of our crawlers involves stopping the rule
application process and saving the current crawler
fact base. In this way, the crawler can easily
migrate since all relevant data (rules and facts) are
now represented as simple ASCII strings within the
crawler. The crawler object, which carries the rules
and facts, migrates using the object serialization
facilities of the Java language.
4.2. Virtual Machine
The virtual machine is the heart of the
distributed crawler runtime environment. Its main
purpose is to provide an environment in which
crawler code received via the network can be
executed. Since crawler programs are specified
based on rules, we can model our virtual machine
using an inference engine, which takes care of the
rule application process. To start the execution of a
crawler we initialize the inference engine with the
rules and facts of the crawler to be executed.
Starting the rule application process of the inference
engine is equivalent to starting crawler execution.
Once the rule application has finished (either
because there is no rule which is applicable or due
to an external signal), the rules and facts now stored
in the inference engine are extracted and stored back
in the crawler.
Thus, an inference engine
establishes a virtual machine with respect to the
crawler. The concrete implementation of our virtual
machine uses an extended version of the Jess
inference engine [5] which in turn is basically a Java
port of the well knows CLIPS system [6].
4.3. Query Engine
The query engine is part of the application
framework architecture and responsible for the
communication between crawler and application.
Since our mobile crawlers are application
independent, they have no information about the
semantics of the data they retrieve. In order to use
the crawler retrieved data within an application, the
data needs to be extracted from the crawler’s
information base. To provide efficient access to this
information, we implemented a query engine, which
evaluates application specific queries upon the
crawler information base. The query result is
represented as structured data tuples very similar to
relational database systems. Since the crawler
information base consist of facts generated by the
rule based crawler program, the query engine
implementation is based on the same inference
engine used for the virtual machine implementation.
Application specific queries are translates into
special query rules, which identify matching facts
within the crawler information base.
We refer the reader to [4]for an in-depth
discussion of the system components and their
implementation.
4.4. Prototype and Lessons Learned
One of the most important requirements for our
prototype implementation was the ability to run on
multiple host platforms (compute interoperability).
Specifically, the crawler runtime system has to
provide a common environment to mobile crawlers
while running on different platforms, operating
systems, and Web servers. To achieve the required
platform independence we implemented the crawler
prototype using Java. So far, the crawler and
runtime environment have been successfully tested
on Unix and Windows machines. For example,
within the University of Florida Intranet, mobile
crawlers successfully migrated between host servers
running the Unix and Windows operating system
and collected data sets managed by more than ten
different Web servers across campus.
We are currently in the process of extending our
prototype system with new crawler components,
which address the critical issues, identified in
Section 2. Our focus is on improving the security
and stability of the distributed crawler runtime
environment. We identified this as the most crucial
point to be addressed prior to using mobile crawlers
in a real network environment. We also plan to
install our crawler runtime environment on Web
servers outside of the University of Florida in order
to evaluate our crawling approach in a broader
context.
secure sandbox scheme (similar to Java) can be
implemented relatively easily.
5. Conclusion
We have introduced an alternative approach to
Web crawling based on mobile crawlers. The
proposed approach surpasses the centralized
architecture of the current Web crawling systems by
distributing the data retrieval process across the
network. In particular, using mobile crawlers we are
able to perform remote operations such as data
analysis and data compression at the data source
before the data is transmitted over the network.
This allows for more intelligent crawling techniques
and addresses the needs of applications, which are
only interested in certain subsets of the available
data. We have developed and implemented an
application framework, which demonstrates our
mobile Web crawling approach and allows
applications to take advantage of mobile crawling.
The performance results of our approach are
very promising. Mobile crawlers can reduce the
network load caused by crawlers significantly by
reducing the amount of data transferred over the
network. Mobile crawlers achieve this reduction in
network traffic by performing data analysis and data
compression at the data source. Therefore, mobile
crawlers transmit only relevant information in
compressed form over the network.
The prototype implementation of our mobile
crawler framework provides an initial step towards
mobile Web crawling. We have identified several
issues, which need to be addressed before mobile
crawling can be used in a larger scale:

Security.
Crawler migration and remote
execution of code causes severe security
problems because a mobile crawler might
contain harmful code. We suggest introducing
an identification mechanism for mobile crawlers
based on digital signatures. Based on this
crawler identification scheme a system
administrator would be able to grant execution
permission to certain crawlers only, excluding
crawlers from unknown (and potentially unsafe)
sources. In addition, the virtual machine needs
to be secured such that crawlers cannot get
access to critical system resources. This is
already partially implemented in the Jess
inference engine.
By restricting the
functionality of the Jess inference engine, a

Integration of the mobile crawler virtual
machine into the Web. The availability of a
mobile crawler virtual machine on as many Web
servers as possible is crucial for the
effectiveness of mobile crawling.
This
integration can be achieved through Java
Servlets, for example, which extend Web server
functionality with special Java programs. We
realize of course, that before this can be done,
some effort has to be spent on standardizing the
functionalities of such runtime environments.

Research in mobile crawling algorithms. None
of the current crawling algorithms have been
designed with crawler mobility in mind. It
seems worthwhile to spend some effort in the
development of new algorithms, which take
advantage of crawler mobility. In particular
these algorithms have to deal with the loss of
centralized control over the crawling process
due to crawler mobility.
References
[1]
[2]
[3]
[4]
[5]
[6]
AltaVista, “AltaVista Search Engine,”
WWW, http://www.altavista.com.
Brin, S., Page, L., The Anatomy of a LargeScale Hypertextual Web Search Engine,
Computer Science Department, Stanford
University, Stanford, CA, USA, 1997.
Cho, J., Garcia-Molina, H., Page, L., Efficient
Crawling Through URL Ordering, Computer
Science Department, Stanford University,
Stanford, CA, USA, 1997.
J. Fiedler and J. Hammer, “Using the Web
Efficiently: Mobile Crawlers,” University of
Florida, Gainesville, FL, Technical Report,
November
1998,
ftp://ftp.dbcenter.cise.ufl.edu/Pub/publications/Mo
bile-Crawling.pdf.
Friedman-Hill, E., Jess Manual, Sandia
National Laboratories, Livermore, CA, USA,
1997.
Giarratano, J. C., CLIPS User's Guide,
Software Technology Branch, NASA/Lyndon
B. Johnson Space Center, USA, 1997.
[7]
[8]
Infoseek, “Infoseek Search Engine,” WWW,
http://www.infoseek.com.
Sullivan, D., Search Engine Watch,
Mecklermedia,
1998,
http://www.searchenginewatch.com.

Download Report

Using the Web Efficiently: Mobile Crawlers - UF CISE

Paperzz.com

Your Paperzz