Screen scraping web services - Institute for Computing and

Screen scraping web services
Alex van Oostenrijk
Department of Computer Science, Radboud University of Nijmegen
Email: [email protected]
December 2004
Abstract
it needs to answer the request, processes (restructures) it, and returns it to the client.
Remote software systems can provide web services
In the sample application that we shall build, the
that offer us remove functions to call through the web service’s server is a remote web site, and the
HTTP procotol. In this paper, we investigate web service is used to retrieve data from that webwhether we can use a web service as an interme- site and restructure it so that its client can present
diary between us and a web site, through which the data in other formats than the web site does.
we have no means to access other than regular Our demonstration project is described more fully
(browser) access, in order to restructure the infor- in section 3. We develop a web service and a client
mation contained in the website for our purposes. for it (Section 4).
The technique we explore is called screen scraping,
Issues that we will discuss on the way include
where our web service scans the web site’s HTML the web service protocols (SOAP, HTTP GET or
for the information that we need.
HTTP POST), the data types that web service trafWe build a sample ASP.NET implementation, fic can handle, and most importantly, speed (secand investigate what limitations there are to the tions 5-7). It is also interesting to see how web
protocol and data types that web services use. We services can be traced during development (section
find that speed is our biggest problem, and present 8). Finally, we see how our data harvesting techmethods to improve response times.
nique, screen scraping, can be made more robust
(section 9).
1
Introduction
2
Background
Web services are commonly used to provide access
to a system that is otherwise very secure (i.e. all 2.1 Definition of Web Service
ports are closed). The HTTP port (80) is normally left open for browsers to access, and it is What is a web service? We can find a number of
onto HTTP that the web services’ protocol, SOAP, different definitions on the Internet, including:
is layered. This allows us to develop systems that
“A self-contained, modular application
allow remote procedure calls while keeping system
that can be described, published, located,
administrators happy. We describe the workings of
and invoked over the Web. Platformweb services in detail in section 2.
neutral and based on open standards,
However, web services need not always be inteWeb Services can be combined with each
grated with the host system. It is also possible for
other in different ways to create business
a web service to be a system (host) by itself, and
processes that enable you to interact with
talk to another system in turn. In this text we treat
customers, employees, and suppliers.” [5]
a web service as a filter. It receives a request from
and
a client, and talks to its server to retrieve the data
1
protocol. This is a large telephone book for web
services. Providers of so-called UDDI registries include Microsoft and SAP. An UDDI entry points to
a URI (an Internet location) where the web service
can be found.
A client may ask a web service which functions
it offers, and what data types are used to pass the
function arguments and results. This information
is expressed in a WSDL2 document. Communication with a web service occurs over the HTTP protocol. Since HTTP offers no particular facilities to
encode function arguments and results, there exists
an additional XML layer, known as SOAP3 (Simple Object Access Protocol ) for the encoding. SOAP
messages are stored in the header of an HTTP request or HTTP response.
“A web service is a collection of functions that are packaged as a single entity
and published to the network for use by
other programs. Web services are building blocks for creating open distributed
systems, and allow companies and individuals to quickly and cheaply make their
digital assets available worldwide.” [6]
and
“A Web service is any piece of software
that makes itself available over the Internet and uses a standardized XML messaging system. There should be some simple
mechanism for interested parties to locate
the service and locate its public interface.
The most prominent directory of Web services is currently available via UDDI, or
Universal Description, Discovery, and Integration.” [3]
2.3
In this article, we explore how a web service may be
used to access an information system in such a way,
that the structure of the information obtained is
more valuable than the structure of the information
structure provided by that system.
In this case, we will consider the Radboud University faculty of Computer Science’s web site,
which contains a listing of courses [4]. The web site
does not offer any way of structuring this information, other than listing the available courses alphabetically. The web site is a closed system: the user
only has access to the data through HTTP, where
the data are always returned in a fixed structure
and order.
Our web service is also a user, and has no exclusive access to the database from which the web site
gets its data. Is it possible to present the data in a
different (searchable) structure?
A technique often used to access information
stored in closed systems (often older systems, that
remain valuable only because of the information
they store), is screen scraping. A screen scraper
communicates with a system as though the scraper
is an ordinary user. It navigates through the system’s user screens and ’reads’ information. The
From these definitions, we can distill the following properties of a web service:
• A web service is a web application. A web service is provided to clients by a web server on
the Internet. This implies that communication
with a web service occurs via the HTTP protocol (because web servers communicate through
this protocol). This is no coincidence: since
system administrators tend to close most communication ports for safety reasons, often only
port 80 (the HTTP port) remains.
• A web service is a collection of functions, similarly to a collection of remote procedure calls
(RPCs). As with RPCs, information is coded
in XML, albeit by the specifications of the
SOAP protocol used by web services.
• A web service can be described, published, and
called over the web. Although not further explored in this article, it is good to know that
web services have their own telephone book.
Web services can be catalogued, so that they
are easier to find than web sites.
2.2
Technical Realization
2 Web
Service Description Language
O in Object in the SOAP acronym may be confusing, since SOAP has nothing to do with objects. The O is a
relic from the days when SOAP was intended to be objectoriented, and was kept only because SOAP sounds good as
an acronym.
3 The
In order to make use of a web service, a client has
to find it first. This is done through the UDDI1
1 Universal
Screen Scraping
Description, Discovery and Integration
2
designer of a screen scraper closely examines the scraping is used to generate so-called RSS4 -feeds
source system to find on what location and on for web sites that do not offer them. Examples inwhich screen the data that he needs resides. The clude:
screen scraper becomes a part of a bigger program
• Lau Taarnskov wrote RSSscraper [8], a Ruby
outside of the information system, which can profit
program that, given a collection of regular exfrom the information that the screen scraper offers
pressions, is able to generate an RSS feed with
in several ways:
the headlines from a news site. This program
also acts as a web server. The author assumes
• The program can offer a different (more modthat the program will run on a different comern, faster) user interface to the old informaputer than the web site itself (just like we do
tion system;
in this article).
• The program can structure the information in
• Bill Humphries [4] explains that a similar rea different way than the old information syssult can be achieved using a combination of the
tem;
tools curl and tidy, and an XML-processor.
2.4
These projects do not use a web service and generate an RSS-feed, but the screen scraping technique
is the same.
Scraping Strategy
A screen scraper can be applied once or continuously. A one-time screen scraper retrieves all the
information from the old information system and
places it in a (modern) database. After this, the old
system can be decommissioned. With a continuous
screen scraper, the old system remains active and
the screen scraper retrieves the information from
the old system’s screens every time it is requested.
Sometimes the old system is not an old system
at all, but a system that is not controlled by the
screen scraper developer and which can only be accessed through the screens that the system offers.
An example of this is a third party web site. Interestingly, such a system (a web site) does not present
its data in a fixed screen location, but rather in a
continuous HTML stream. It is (almost) impossible
to specify a fixed screen location for a data item,
but a relative location can be given: the HTML
can be searched context-sensitively using regular
expressions.
We can still choose either a one-time or a continuous scraping approach. However, web sites tend
to be dynamic, so that the one-time scrape must
be executed regularly after all (for instance, daily)
in order to keep the data current.
2.5
3
Problem Definition
How can we make existing information, offered by
a web site, available through a web service?
The web site of the faculty of Computer Science
of the Radboud University of Nijmegen has a list
of courses that the faculty offers [7]. The search
possibilities are limited, since there is only an alphabetical list of available courses.
It must be possible to design a web service that
uses the information that the web site offers to create a better search system for the user. In this
project, we will make a web service in ASP.NET
that solves this problem using screen scraping. The
user sends a request to the web service, which in
turn talks to the university’s web site and returns
a table of search results (in XML). Possible search
requests include:
• Return all courses in alphabetical order (same
as the web site);
• Return all courses with the word “bachelor” in
the course title;
• Return all courses that start in the fall;
Existing Implementations
• Return all courses, sorted by course code;
Screen scraping is often used to gain access to old
but essential systems. But is the technique also
used to access modern systems? We have found
some implementations that do. HTML screen
• Return the course with unique code I00027.
4 Rich
3
Site Summary
The web service offers two methods:
getCourses: returns a list of available courses,
with year, code and title information. The required information is taken from the university’s
main course list page (which contains only year,
code and title information). The courses are placed
in an array of Course instances and returned to the
caller in XML (simplified):
The web service should be able to execute these
search requests. Now we can ask a number of questions pertaining to the development of the web service:
• Web services use HTTP GET, HTTP POST
and SOAP protocols. How does this influence
the data types we can use?
• Does this solution return the requested data
fast enough? Internet communication can be
slow, depending on the required number of network calls.
<?xml version="1.0" encoding="utf-8" ?>
<ArrayOfCursus>
<Cursus>
<jaar>2004</jaar>
<code>I00001</code>
<titel>Abstraction and Composition
in Programming</titel>
</Cursus>
<Cursus>
<jaar>2004</jaar>
<code>I00004</code>
<titel>Embedded Systems</titel>
</Cursus>
...
</ArrayOfCursus>
• How can we trace a web service in development?
4
Solution
4.1
Web Service
The web service is developed in ASP.NET, using
Microsoft Visual Studio .NET. The following types
are used by the service:
Course – (class) contains general data on a
course.
getCourseEx(year, code): returns all data for
a specific course.
Given a year and a course code, returns all available data for the matching course. Apart from the
year, title and code this is includes the course’s
weight (in credits) and the people that teach the
course. The information is retrieved from the main
page containing the list of courses, and from the
course’s own page.
The information is stored in an instance of
CourseEx and returned as XML (simplified):
public class Course {
public string year;
public string code;
public string title;
}
Person – (class) contains information about a
course’s teacher.
public class Person {
public string role;
public string department;
public string name;
}
<?xml version="1.0" encoding="utf-8" ?>
<CursusEx>
<jaar>2004</jaar>
<code>I00089</code>
<titel>Semantics and Correctness</titel>
<gewicht>6</gewicht>
<personen>
<Persoon>
<rol>examinator</rol>
<afdeling>g</afdeling>
<naam>Dr Eric Reynolds</naam>
</Persoon>
<Persoon>
CourseEx – (class) contains full course data.
public class CourseEx {
public string year;
public string code;
public string title;
public int weight;
public Person[] people;
}
4
the web site due to the number of network calls involved (for 132 courses, it took 4 minutes).
<rol>teacher</rol>
<afdeling>st</afdeling>
<naam>Dr Raymond Watson</naam>
</Persoon>
</personen>
</CursusEx>
5
A web service can make use of the protocols HTTP
(GET or POST) or SOAP. A web service made
with Visual Studio .NET supports all three, unless
protocols are turned off in the WSDL specification.
Are there any functional differences between these
protocols? At first glance, there are none: HTTP
GET and POST are supported to make web service
debugging easier, by accessing it with a browser.
The function arguments are passed to the service
as URL arguments (GET) or as HTTP headers
(POST), but returned in XML just like with SOAP.
The difference is that SOAP is not necessarily
connected to HTTP: this protocol can also be used
over a different transport protocol (while GET and
POST cannot). Since web services responses are
encoded in XML, there is no difference between
the function return values returned by GET/POST
and SOAP. This is not so for function arguments:
with SOAP, these are encoded in XML, thus supporting complex types, while GET and POST only
support simple data types. This way, SOAP is more
powerful after all.
The information that the web service returns is
retrieved from third party web pages. The web
service loads the page and searches it using regular
expressions. Of course, the page structure is apt
to change every so often, but regular expression
construction is easy enough for the web service to
be kept up to date.
4.2
Protocols
Web Client
It turns out that clients for our web service can
be written in all sorts of programming languages.
This includes Microsoft .NET languages such as
C# and Visual Basic .NET, but also Borland Delphi (which comes with components for SOAPcommunication), PHP, and even the little known
script language Ruby can execute SOAP requests.
The client was not written in C#, like the service itself, but rather in Delphi. The separation between
implementation platforms is such that the client
can only make use of the web service’s WSDL information, not of any code shared between service
and client.
The problem definition states that a number of
different searches must be possible. However, a web
service should be as simple as possible, so that it
is easy for a (client) developer to understand and
to keep the number of network calls down (a speed
issue). This is why the web service only has two
methods. The client should easily be able to answer the questions from the problem definition using these methods.
In our implementation (not included in this article), we use the web service to retrieve a list of all
courses, with columns for year, title and code. The
user can then click on a course to get additional
course information (credits and people). We have
found that it is not possible to create a list that already contains credits and people information, because this information is stored on a separate web
page for each course. It simply takes too long for
the web service to retrieve all this information from
6
Data Types
It turns out that SOAP supports all simple data
types, as well as arrays, classes and structures. For
classes, only those members are passed that can
be read from an instance using introspection. We
can conclude from this, that the power of SOAP
largely depends on the implementation language.
A comparison:
• Through introspection, C# can return only
public class members. Private members are
ignored in SOAP communication.
• Delphi does not have introspection5 and is thus
unable to pass class members dynamically.
5 Strictly speaking, Delphi does have introspection, but
this ill-documented aspect of the language is used internally
by Borland. It is far inferior to the introspection of C#,
Ruby or Java. Delphi’s introspection is known as RTTS
(runtime type information).
5
• Ruby’s introspection is able to read both public and private class members and pass them
in a SOAP message. With this, Ruby is the
most powerful SOAP supporter: the programmer does not have to jump through any hoops
to include all class members in SOAP messages.
web service becomes a (database) writer instead
of a (web page) reader, and that multiple requests
to the web service may create race conditions. In
that case, a plaintext database will not suffice, but
a database with mutual exclusion (e.g. MySQL)
will work. This makes the web service more complex. Whereas a continuous service is not much
more than a simple filter, a one-time service must
do more work, and the chances of errors are greater.
Classes may also be nested (as long as this happens only via public members in C#). Arrays of objects are also translated correctly, but only if they
are system arrays with a fixed element type. The
Tracing
C# ArrayList is a polymorphic list so that C# has no 8
way of knowing the ArrayList’s element type. Such
Debugging a web service is hard, because it runs in
an array cannot be included in a SOAP message.
the context of a web server (in our case, Microsoft
Internet Information Server), so that the debugger
7 Speed
cannot access the running process. As long as we
use Microsoft Visual Studio, however, this is not
In this article, we explore the feasibility of screen a problem: this IDE is able to intercept the proscraping in web services. Screen scraping works, cess anyway, so that the developer can stop the
and is easy to implement using regular expressions. service with breakpoints and examine the memory
This is why the bad news is not the screen scraping contents.
itself, but the structure of the web site that contains
If the developer does not have access to Visual
the data.
Studio, all is not lost. Breakpoints are not availThe case of the university’s course list is a good
able, and a web service cannot send any informaexample. The list of courses (with year, code and
tion to stdout. However, it is possible to write debug
title information) consists of one page and can be
information to the Windows event log. This is true
retrieved quickly. But the extra information for
for all processes that do not have a console.
each course (credits and people) is stored on a sepYet another method is the use of SOAP extenarate page for each course. If a client wants a list of
sions. The .NET framework offers an attribute
al courses with complete course information (year,
class named SoapExtension, from which the develcode, title, weight and people), then for n courses
oper can derive a class. SoapExtension specifies the
the service must retrieve n + 1 weg pages, and that
method ProcessMessage(message), which allows us to
simply takes too long. The delay for an ADSL conview the original SOAP message after it is received
nection and 132 courses (currently available on the
of just before it is sent.
university’s web site) is almost four minutes. BeSince the SOAP message is run-time information
cause of this, the client’s search possibilities are limin
which the developer is most interested (this is
ited.
data
that cannot be verified compile-time), writing
Now, the client program can only offer a list
the
message
to a file on disk is a useful debugging
with limited course information, and by clicking on
method
[1].
A
class derived from SoapExtension ala course the user can get additional course inforready
included
in the .NET framework is TraceExmation. This means that the user can never sort
tension
.
A
sample
of use:
courses by weight, or request a list of courses taught
by a given teacher. Depending on the information
needs of the user, this can make the web service
unusable.
The web service could be used to fill a database
once (or once a day, for instance) with course data.
In this scenario, the four minute wait may not be
a problem. The danger of this approach is that the
[WebMethod]
[TraceExtension( ?d:\\trace.log? )]
public int add( int a, int b )
{
return a + b;
}
6
10
This yields (somewhat compressed for readability):
Conclusions
After implementing our screen scraper, we must
conclude that although screen scraping works, it
is too slow. In order to profit from an alternative
view on the information offered by a web site, it is
necessary to search multiple pages of that web site.
In our example, this meant searching 132 pages per
request, which took just under four minutes.
Moreover, screen scraping a web site is not roResponse:
bust, but this is a problem that is only important
<AddResponse>
when the structure of a web site often changes.
<AddResult>15</AddResult>
Also, this problem can be solved partially by con</AddResponse>
verting HTML into XHTML, and parsing the XML
instead of the HTML.
9 Regular Expressions and If one is prepared to augment the web service
with a database, then the first problem is solved.
XHTML
Caching data can reduce the response time to an
An argument against screen scraping solutions is acceptable duration (i.e., comparable to the time it
that when the owner of the source web site changes takes to request one web page using a browser), but
its structure, however slightly, the screen scraper a strategic time must be chosen to refresh the data
will cease to work correctly. This is true. However, in the cache. If this does not happen frequently
there are some arguments that justify the use of a enough, the web service is liable to produce old
data.
screen scraper:
Request:
<Add>
<a>10</a>
<b>5</b>
</Add>
• If the web site belongs to a third party, screen
scraping may be the only way we can access
the data we need;
References
[1] Richard Anderson, Brian Francis, Alex
Homer, Rob Howard, Dave Sussman and Karli
Watson [2001], Professional ASP.NET, Wrox
Press Ltd.
• Regular expressions are easy to build. When
the structure of a web site changes, we only
need to adapt our expressions to repair our
screen scraper. It would be good if the screen
scraper were able to send a message to its
maintainer is a parse error occurs (possibly because the source web site’s structure changed).
[2] Michael Champion, Chris Ferris, Eric Newcomer and David Orchard [2002], Web
Services Architecture, W3C Working Draft,
November 2002, W3C.
http://www.w3.org/TR/2002/WD-ws-arch20021114/
There are other ways to make the scraping process more robust [4]. With the tool tidy we can
convert an HTML stream into an XHTML stream,
from which an XML document can be derived. This
document is always the same, whatever the web
site’s layout, so that it can always be parsed correctly. The price that we pay is that the tool tidy
must be supplied with style sheets in order to parse
the original HTML correctly. These style sheets
must be kept up to date (thus recreating the original problem), but at least they can be separated
from the scraper code. They are only text documents, after all.
[3] Factory 3x5 [2004], Glossary.
http://www.factory3x5.com/more info/glossary.xml
[4] Bill Humphries [2003], “Supporting the Desperate Hacker”, in More Like This Weblogs,
21 August 2003.
http://www.whump.com/moreLikeThis/
date/21/08/2003
[5] IBM [2002], IBM WebSphere Host Publisher
Administrator’s And User’s Guide.
7
http://www-306.ibm.com/software/webservers/
hostpublisher/
[6] PerfectXML [2004], Glossary.
http://www.perfectxml.com/glossary4.asp
[7] Radboud University of Nijmegen [2004], Faculty of Computer Science Courses Overview.
http://www.cs.kun.nl/dynamic/db/kiescollege1.cfm
[8] Lau Taarnskov [2004], Ruby RSSscraper.
http://rssscraper.rubyforge.org/
[9] Roger Wolter [2001], XML Web Services Basics, Microsoft Corp.
http://msdn.microsoft.com/webservices/
understanding/webservicebasics/default.aspx
8