Intelligent agents (TME285) Lecture 9, 20170215 Internet data acquisition Mattias Wahde, PhD, Professor, Chalmers University of Technology e-mail: [email protected], http://www.me.chalmers.se/~mwahde Today’s learning goals • After this lecture you should be able to – Describe and use the HTMLDownloader class – Describe and use the RSSDownloader class Mattias Wahde, PhD, Professor, Chalmers University of Technology e-mail: [email protected], http://www.me.chalmers.se/~mwahde Internet data acquisition • Many IPAs require access to data obtained from the internet. • For example, a news reader agent must, of course, be able to download news. • Here, two approaches will be considered: – Download and (custom) parsing of a given HTML file. – Downloading and accessing data from an RSS feed (defined below). • Classes for these tasks are included in the InternetDataAcquisition library. Mattias Wahde, PhD, Professor, Chalmers University of Technology e-mail: [email protected], http://www.me.chalmers.se/~mwahde HTML download • The HTMLDownloader class downloads (at regular, userspecified intervals) the html contents of a web page. • The data are stored as a single string, which must then be processed (parsed) in order to extract the relevant information. • Note that the required parsing may differ from page to page, • Note also that if the web page owner changes the (structure of) the page, one may have to modify the parsing accordingly. Mattias Wahde, PhD, Professor, Chalmers University of Technology e-mail: [email protected], http://www.me.chalmers.se/~mwahde HTML download Mattias Wahde, PhD, Professor, Chalmers University of Technology e-mail: [email protected], http://www.me.chalmers.se/~mwahde HTML download • For this reason, code that relies on a particular structure of a web page will often be brittle and error-prone. • Still, in some cases it is necessary to use this approach. • If so, check the HTMLDownloader class, which contains (some) methods for parsing the downloaded string. Mattias Wahde, PhD, Professor, Chalmers University of Technology e-mail: [email protected], http://www.me.chalmers.se/~mwahde HTML download • After splitting a string into separate words (using, for example, the Split() method for generic lists), one can make use of the many methods available for generic lists, in order to process the list of words: Mattias Wahde, PhD, Professor, Chalmers University of Technology e-mail: [email protected], http://www.me.chalmers.se/~mwahde HTML download • NOTE: Some service providers do not allow (repeated) automatic downloads. If so, do respect these restrictions, and obtain the data from another source! • Also, sometimes (e.g. Google searches) the links etc. on a page may be hidden in various Javascripts etc. Mattias Wahde, PhD, Professor, Chalmers University of Technology e-mail: [email protected], http://www.me.chalmers.se/~mwahde Today’s learning goals • After this lecture you should be able to – Describe and use the HTMLDownloader class – Describe and use the RSSDownloader class Mattias Wahde, PhD, Professor, Chalmers University of Technology e-mail: [email protected], http://www.me.chalmers.se/~mwahde RSS feeds • RSS (really simple syndicate) feeds are a special class of web pages intended for automatic analysis by computers. • An RSS feed is formatted in a well-defined way (as an XML file), and can easily be parsed. • Basically, an RSS feed can be stored in an item of class SyndicationFeed (defined in the namespace System.ServiceModel.Syndication). • After download an RSS feed, the SyndicationFeed will contain a set of items of class SyndicationItem. Mattias Wahde, PhD, Professor, Chalmers University of Technology e-mail: [email protected], http://www.me.chalmers.se/~mwahde RSS feeds • Each SyndicationItem, in turn, contains are number of well-defined fields, e.g. – – – – – Title PublishDate Summary ID etc. etc. Mattias Wahde, PhD, Professor, Chalmers University of Technology e-mail: [email protected], http://www.me.chalmers.se/~mwahde RSS feeds: Date format • One slight problem with RSS feeds (or, rather, with the XmlTextReader used for reading them) is that they only handle some date formats, e.g “ ddd, dd MM YYYY hh:mm:ss ”. (Example: Fri 21 Oct 2016 07:14:17) • Many RSS feeds, however, require a different date format, e.g. “ ddd MMM dd yyyy hh:mm:ss ‘GMT+0000’ ” • Example: Fri Oct 21 2016 07:28:19 GMT+0000. Mattias Wahde, PhD, Professor, Chalmers University of Technology e-mail: [email protected], http://www.me.chalmers.se/~mwahde RSS feeds: Date format • For that reason, the RSSDownloader class contains a method for specifying a custom date format. • Usage example: • The exact format specification will vary between RSS feeds. Mattias Wahde, PhD, Professor, Chalmers University of Technology e-mail: [email protected], http://www.me.chalmers.se/~mwahde The RSSReader application • This program reads data (at regular intervals) from an RSS feed, and then shows them on screen. • It uses two threads (in addition to the GUI thread, of course): One for downloading data, and one for displaying data on the screen. • New items are shown in green, old items in gray. Mattias Wahde, PhD, Professor, Chalmers University of Technology e-mail: [email protected], http://www.me.chalmers.se/~mwahde The RSSReader application Mattias Wahde, PhD, Professor, Chalmers University of Technology e-mail: [email protected], http://www.me.chalmers.se/~mwahde Today’s learning goals • After this lecture you should be able to – Describe and use the HTMLDownloader class – Describe and use the RSSDownloader class Mattias Wahde, PhD, Professor, Chalmers University of Technology e-mail: [email protected], http://www.me.chalmers.se/~mwahde
© Copyright 2026 Paperzz