Screen scraping web services Alex van Oostenrijk Department of Computer Science, Radboud University of Nijmegen Email: [email protected] December 2004 Abstract it needs to answer the request, processes (restructures) it, and returns it to the client. Remote software systems can provide web services In the sample application that we shall build, the that offer us remove functions to call through the web service’s server is a remote web site, and the HTTP procotol. In this paper, we investigate web service is used to retrieve data from that webwhether we can use a web service as an interme- site and restructure it so that its client can present diary between us and a web site, through which the data in other formats than the web site does. we have no means to access other than regular Our demonstration project is described more fully (browser) access, in order to restructure the infor- in section 3. We develop a web service and a client mation contained in the website for our purposes. for it (Section 4). The technique we explore is called screen scraping, Issues that we will discuss on the way include where our web service scans the web site’s HTML the web service protocols (SOAP, HTTP GET or for the information that we need. HTTP POST), the data types that web service trafWe build a sample ASP.NET implementation, fic can handle, and most importantly, speed (secand investigate what limitations there are to the tions 5-7). It is also interesting to see how web protocol and data types that web services use. We services can be traced during development (section find that speed is our biggest problem, and present 8). Finally, we see how our data harvesting techmethods to improve response times. nique, screen scraping, can be made more robust (section 9). 1 Introduction 2 Background Web services are commonly used to provide access to a system that is otherwise very secure (i.e. all 2.1 Definition of Web Service ports are closed). The HTTP port (80) is normally left open for browsers to access, and it is What is a web service? We can find a number of onto HTTP that the web services’ protocol, SOAP, different definitions on the Internet, including: is layered. This allows us to develop systems that “A self-contained, modular application allow remote procedure calls while keeping system that can be described, published, located, administrators happy. We describe the workings of and invoked over the Web. Platformweb services in detail in section 2. neutral and based on open standards, However, web services need not always be inteWeb Services can be combined with each grated with the host system. It is also possible for other in different ways to create business a web service to be a system (host) by itself, and processes that enable you to interact with talk to another system in turn. In this text we treat customers, employees, and suppliers.” [5] a web service as a filter. It receives a request from and a client, and talks to its server to retrieve the data 1 protocol. This is a large telephone book for web services. Providers of so-called UDDI registries include Microsoft and SAP. An UDDI entry points to a URI (an Internet location) where the web service can be found. A client may ask a web service which functions it offers, and what data types are used to pass the function arguments and results. This information is expressed in a WSDL2 document. Communication with a web service occurs over the HTTP protocol. Since HTTP offers no particular facilities to encode function arguments and results, there exists an additional XML layer, known as SOAP3 (Simple Object Access Protocol ) for the encoding. SOAP messages are stored in the header of an HTTP request or HTTP response. “A web service is a collection of functions that are packaged as a single entity and published to the network for use by other programs. Web services are building blocks for creating open distributed systems, and allow companies and individuals to quickly and cheaply make their digital assets available worldwide.” [6] and “A Web service is any piece of software that makes itself available over the Internet and uses a standardized XML messaging system. There should be some simple mechanism for interested parties to locate the service and locate its public interface. The most prominent directory of Web services is currently available via UDDI, or Universal Description, Discovery, and Integration.” [3] 2.3 In this article, we explore how a web service may be used to access an information system in such a way, that the structure of the information obtained is more valuable than the structure of the information structure provided by that system. In this case, we will consider the Radboud University faculty of Computer Science’s web site, which contains a listing of courses [4]. The web site does not offer any way of structuring this information, other than listing the available courses alphabetically. The web site is a closed system: the user only has access to the data through HTTP, where the data are always returned in a fixed structure and order. Our web service is also a user, and has no exclusive access to the database from which the web site gets its data. Is it possible to present the data in a different (searchable) structure? A technique often used to access information stored in closed systems (often older systems, that remain valuable only because of the information they store), is screen scraping. A screen scraper communicates with a system as though the scraper is an ordinary user. It navigates through the system’s user screens and ’reads’ information. The From these definitions, we can distill the following properties of a web service: • A web service is a web application. A web service is provided to clients by a web server on the Internet. This implies that communication with a web service occurs via the HTTP protocol (because web servers communicate through this protocol). This is no coincidence: since system administrators tend to close most communication ports for safety reasons, often only port 80 (the HTTP port) remains. • A web service is a collection of functions, similarly to a collection of remote procedure calls (RPCs). As with RPCs, information is coded in XML, albeit by the specifications of the SOAP protocol used by web services. • A web service can be described, published, and called over the web. Although not further explored in this article, it is good to know that web services have their own telephone book. Web services can be catalogued, so that they are easier to find than web sites. 2.2 Technical Realization 2 Web Service Description Language O in Object in the SOAP acronym may be confusing, since SOAP has nothing to do with objects. The O is a relic from the days when SOAP was intended to be objectoriented, and was kept only because SOAP sounds good as an acronym. 3 The In order to make use of a web service, a client has to find it first. This is done through the UDDI1 1 Universal Screen Scraping Description, Discovery and Integration 2 designer of a screen scraper closely examines the scraping is used to generate so-called RSS4 -feeds source system to find on what location and on for web sites that do not offer them. Examples inwhich screen the data that he needs resides. The clude: screen scraper becomes a part of a bigger program • Lau Taarnskov wrote RSSscraper [8], a Ruby outside of the information system, which can profit program that, given a collection of regular exfrom the information that the screen scraper offers pressions, is able to generate an RSS feed with in several ways: the headlines from a news site. This program also acts as a web server. The author assumes • The program can offer a different (more modthat the program will run on a different comern, faster) user interface to the old informaputer than the web site itself (just like we do tion system; in this article). • The program can structure the information in • Bill Humphries [4] explains that a similar rea different way than the old information syssult can be achieved using a combination of the tem; tools curl and tidy, and an XML-processor. 2.4 These projects do not use a web service and generate an RSS-feed, but the screen scraping technique is the same. Scraping Strategy A screen scraper can be applied once or continuously. A one-time screen scraper retrieves all the information from the old information system and places it in a (modern) database. After this, the old system can be decommissioned. With a continuous screen scraper, the old system remains active and the screen scraper retrieves the information from the old system’s screens every time it is requested. Sometimes the old system is not an old system at all, but a system that is not controlled by the screen scraper developer and which can only be accessed through the screens that the system offers. An example of this is a third party web site. Interestingly, such a system (a web site) does not present its data in a fixed screen location, but rather in a continuous HTML stream. It is (almost) impossible to specify a fixed screen location for a data item, but a relative location can be given: the HTML can be searched context-sensitively using regular expressions. We can still choose either a one-time or a continuous scraping approach. However, web sites tend to be dynamic, so that the one-time scrape must be executed regularly after all (for instance, daily) in order to keep the data current. 2.5 3 Problem Definition How can we make existing information, offered by a web site, available through a web service? The web site of the faculty of Computer Science of the Radboud University of Nijmegen has a list of courses that the faculty offers [7]. The search possibilities are limited, since there is only an alphabetical list of available courses. It must be possible to design a web service that uses the information that the web site offers to create a better search system for the user. In this project, we will make a web service in ASP.NET that solves this problem using screen scraping. The user sends a request to the web service, which in turn talks to the university’s web site and returns a table of search results (in XML). Possible search requests include: • Return all courses in alphabetical order (same as the web site); • Return all courses with the word “bachelor” in the course title; • Return all courses that start in the fall; Existing Implementations • Return all courses, sorted by course code; Screen scraping is often used to gain access to old but essential systems. But is the technique also used to access modern systems? We have found some implementations that do. HTML screen • Return the course with unique code I00027. 4 Rich 3 Site Summary The web service offers two methods: getCourses: returns a list of available courses, with year, code and title information. The required information is taken from the university’s main course list page (which contains only year, code and title information). The courses are placed in an array of Course instances and returned to the caller in XML (simplified): The web service should be able to execute these search requests. Now we can ask a number of questions pertaining to the development of the web service: • Web services use HTTP GET, HTTP POST and SOAP protocols. How does this influence the data types we can use? • Does this solution return the requested data fast enough? Internet communication can be slow, depending on the required number of network calls. <?xml version="1.0" encoding="utf-8" ?> <ArrayOfCursus> <Cursus> <jaar>2004</jaar> <code>I00001</code> <titel>Abstraction and Composition in Programming</titel> </Cursus> <Cursus> <jaar>2004</jaar> <code>I00004</code> <titel>Embedded Systems</titel> </Cursus> ... </ArrayOfCursus> • How can we trace a web service in development? 4 Solution 4.1 Web Service The web service is developed in ASP.NET, using Microsoft Visual Studio .NET. The following types are used by the service: Course – (class) contains general data on a course. getCourseEx(year, code): returns all data for a specific course. Given a year and a course code, returns all available data for the matching course. Apart from the year, title and code this is includes the course’s weight (in credits) and the people that teach the course. The information is retrieved from the main page containing the list of courses, and from the course’s own page. The information is stored in an instance of CourseEx and returned as XML (simplified): public class Course { public string year; public string code; public string title; } Person – (class) contains information about a course’s teacher. public class Person { public string role; public string department; public string name; } <?xml version="1.0" encoding="utf-8" ?> <CursusEx> <jaar>2004</jaar> <code>I00089</code> <titel>Semantics and Correctness</titel> <gewicht>6</gewicht> <personen> <Persoon> <rol>examinator</rol> <afdeling>g</afdeling> <naam>Dr Eric Reynolds</naam> </Persoon> <Persoon> CourseEx – (class) contains full course data. public class CourseEx { public string year; public string code; public string title; public int weight; public Person[] people; } 4 the web site due to the number of network calls involved (for 132 courses, it took 4 minutes). <rol>teacher</rol> <afdeling>st</afdeling> <naam>Dr Raymond Watson</naam> </Persoon> </personen> </CursusEx> 5 A web service can make use of the protocols HTTP (GET or POST) or SOAP. A web service made with Visual Studio .NET supports all three, unless protocols are turned off in the WSDL specification. Are there any functional differences between these protocols? At first glance, there are none: HTTP GET and POST are supported to make web service debugging easier, by accessing it with a browser. The function arguments are passed to the service as URL arguments (GET) or as HTTP headers (POST), but returned in XML just like with SOAP. The difference is that SOAP is not necessarily connected to HTTP: this protocol can also be used over a different transport protocol (while GET and POST cannot). Since web services responses are encoded in XML, there is no difference between the function return values returned by GET/POST and SOAP. This is not so for function arguments: with SOAP, these are encoded in XML, thus supporting complex types, while GET and POST only support simple data types. This way, SOAP is more powerful after all. The information that the web service returns is retrieved from third party web pages. The web service loads the page and searches it using regular expressions. Of course, the page structure is apt to change every so often, but regular expression construction is easy enough for the web service to be kept up to date. 4.2 Protocols Web Client It turns out that clients for our web service can be written in all sorts of programming languages. This includes Microsoft .NET languages such as C# and Visual Basic .NET, but also Borland Delphi (which comes with components for SOAPcommunication), PHP, and even the little known script language Ruby can execute SOAP requests. The client was not written in C#, like the service itself, but rather in Delphi. The separation between implementation platforms is such that the client can only make use of the web service’s WSDL information, not of any code shared between service and client. The problem definition states that a number of different searches must be possible. However, a web service should be as simple as possible, so that it is easy for a (client) developer to understand and to keep the number of network calls down (a speed issue). This is why the web service only has two methods. The client should easily be able to answer the questions from the problem definition using these methods. In our implementation (not included in this article), we use the web service to retrieve a list of all courses, with columns for year, title and code. The user can then click on a course to get additional course information (credits and people). We have found that it is not possible to create a list that already contains credits and people information, because this information is stored on a separate web page for each course. It simply takes too long for the web service to retrieve all this information from 6 Data Types It turns out that SOAP supports all simple data types, as well as arrays, classes and structures. For classes, only those members are passed that can be read from an instance using introspection. We can conclude from this, that the power of SOAP largely depends on the implementation language. A comparison: • Through introspection, C# can return only public class members. Private members are ignored in SOAP communication. • Delphi does not have introspection5 and is thus unable to pass class members dynamically. 5 Strictly speaking, Delphi does have introspection, but this ill-documented aspect of the language is used internally by Borland. It is far inferior to the introspection of C#, Ruby or Java. Delphi’s introspection is known as RTTS (runtime type information). 5 • Ruby’s introspection is able to read both public and private class members and pass them in a SOAP message. With this, Ruby is the most powerful SOAP supporter: the programmer does not have to jump through any hoops to include all class members in SOAP messages. web service becomes a (database) writer instead of a (web page) reader, and that multiple requests to the web service may create race conditions. In that case, a plaintext database will not suffice, but a database with mutual exclusion (e.g. MySQL) will work. This makes the web service more complex. Whereas a continuous service is not much more than a simple filter, a one-time service must do more work, and the chances of errors are greater. Classes may also be nested (as long as this happens only via public members in C#). Arrays of objects are also translated correctly, but only if they are system arrays with a fixed element type. The Tracing C# ArrayList is a polymorphic list so that C# has no 8 way of knowing the ArrayList’s element type. Such Debugging a web service is hard, because it runs in an array cannot be included in a SOAP message. the context of a web server (in our case, Microsoft Internet Information Server), so that the debugger 7 Speed cannot access the running process. As long as we use Microsoft Visual Studio, however, this is not In this article, we explore the feasibility of screen a problem: this IDE is able to intercept the proscraping in web services. Screen scraping works, cess anyway, so that the developer can stop the and is easy to implement using regular expressions. service with breakpoints and examine the memory This is why the bad news is not the screen scraping contents. itself, but the structure of the web site that contains If the developer does not have access to Visual the data. Studio, all is not lost. Breakpoints are not availThe case of the university’s course list is a good able, and a web service cannot send any informaexample. The list of courses (with year, code and tion to stdout. However, it is possible to write debug title information) consists of one page and can be information to the Windows event log. This is true retrieved quickly. But the extra information for for all processes that do not have a console. each course (credits and people) is stored on a sepYet another method is the use of SOAP extenarate page for each course. If a client wants a list of sions. The .NET framework offers an attribute al courses with complete course information (year, class named SoapExtension, from which the develcode, title, weight and people), then for n courses oper can derive a class. SoapExtension specifies the the service must retrieve n + 1 weg pages, and that method ProcessMessage(message), which allows us to simply takes too long. The delay for an ADSL conview the original SOAP message after it is received nection and 132 courses (currently available on the of just before it is sent. university’s web site) is almost four minutes. BeSince the SOAP message is run-time information cause of this, the client’s search possibilities are limin which the developer is most interested (this is ited. data that cannot be verified compile-time), writing Now, the client program can only offer a list the message to a file on disk is a useful debugging with limited course information, and by clicking on method [1]. A class derived from SoapExtension ala course the user can get additional course inforready included in the .NET framework is TraceExmation. This means that the user can never sort tension . A sample of use: courses by weight, or request a list of courses taught by a given teacher. Depending on the information needs of the user, this can make the web service unusable. The web service could be used to fill a database once (or once a day, for instance) with course data. In this scenario, the four minute wait may not be a problem. The danger of this approach is that the [WebMethod] [TraceExtension( ?d:\\trace.log? )] public int add( int a, int b ) { return a + b; } 6 10 This yields (somewhat compressed for readability): Conclusions After implementing our screen scraper, we must conclude that although screen scraping works, it is too slow. In order to profit from an alternative view on the information offered by a web site, it is necessary to search multiple pages of that web site. In our example, this meant searching 132 pages per request, which took just under four minutes. Moreover, screen scraping a web site is not roResponse: bust, but this is a problem that is only important <AddResponse> when the structure of a web site often changes. <AddResult>15</AddResult> Also, this problem can be solved partially by con</AddResponse> verting HTML into XHTML, and parsing the XML instead of the HTML. 9 Regular Expressions and If one is prepared to augment the web service with a database, then the first problem is solved. XHTML Caching data can reduce the response time to an An argument against screen scraping solutions is acceptable duration (i.e., comparable to the time it that when the owner of the source web site changes takes to request one web page using a browser), but its structure, however slightly, the screen scraper a strategic time must be chosen to refresh the data will cease to work correctly. This is true. However, in the cache. If this does not happen frequently there are some arguments that justify the use of a enough, the web service is liable to produce old data. screen scraper: Request: <Add> <a>10</a> <b>5</b> </Add> • If the web site belongs to a third party, screen scraping may be the only way we can access the data we need; References [1] Richard Anderson, Brian Francis, Alex Homer, Rob Howard, Dave Sussman and Karli Watson [2001], Professional ASP.NET, Wrox Press Ltd. • Regular expressions are easy to build. When the structure of a web site changes, we only need to adapt our expressions to repair our screen scraper. It would be good if the screen scraper were able to send a message to its maintainer is a parse error occurs (possibly because the source web site’s structure changed). [2] Michael Champion, Chris Ferris, Eric Newcomer and David Orchard [2002], Web Services Architecture, W3C Working Draft, November 2002, W3C. http://www.w3.org/TR/2002/WD-ws-arch20021114/ There are other ways to make the scraping process more robust [4]. With the tool tidy we can convert an HTML stream into an XHTML stream, from which an XML document can be derived. This document is always the same, whatever the web site’s layout, so that it can always be parsed correctly. The price that we pay is that the tool tidy must be supplied with style sheets in order to parse the original HTML correctly. These style sheets must be kept up to date (thus recreating the original problem), but at least they can be separated from the scraper code. They are only text documents, after all. [3] Factory 3x5 [2004], Glossary. http://www.factory3x5.com/more info/glossary.xml [4] Bill Humphries [2003], “Supporting the Desperate Hacker”, in More Like This Weblogs, 21 August 2003. http://www.whump.com/moreLikeThis/ date/21/08/2003 [5] IBM [2002], IBM WebSphere Host Publisher Administrator’s And User’s Guide. 7 http://www-306.ibm.com/software/webservers/ hostpublisher/ [6] PerfectXML [2004], Glossary. http://www.perfectxml.com/glossary4.asp [7] Radboud University of Nijmegen [2004], Faculty of Computer Science Courses Overview. http://www.cs.kun.nl/dynamic/db/kiescollege1.cfm [8] Lau Taarnskov [2004], Ruby RSSscraper. http://rssscraper.rubyforge.org/ [9] Roger Wolter [2001], XML Web Services Basics, Microsoft Corp. http://msdn.microsoft.com/webservices/ understanding/webservicebasics/default.aspx 8
© Copyright 2026 Paperzz