Intelligent agents (TME285) Lecture 9, 20170215

Intelligent agents (TME285)
Lecture 9, 20170215
Internet data acquisition
Mattias Wahde, PhD, Professor, Chalmers University of Technology
e-mail: [email protected], http://www.me.chalmers.se/~mwahde
Today’s learning goals
• After this lecture you should be able to
– Describe and use the HTMLDownloader class
– Describe and use the RSSDownloader class
Mattias Wahde, PhD, Professor, Chalmers University of Technology
e-mail: [email protected], http://www.me.chalmers.se/~mwahde
Internet data acquisition
• Many IPAs require access to data obtained from the
internet.
• For example, a news reader agent must, of course, be
able to download news.
• Here, two approaches will be considered:
– Download and (custom) parsing of a given HTML file.
– Downloading and accessing data from an RSS feed (defined
below).
• Classes for these tasks are included in the
InternetDataAcquisition library.
Mattias Wahde, PhD, Professor, Chalmers University of Technology
e-mail: [email protected], http://www.me.chalmers.se/~mwahde
HTML download
• The HTMLDownloader class downloads (at regular, userspecified intervals) the html contents of a web page.
• The data are stored as a single string, which must then be
processed (parsed) in order to extract the relevant
information.
• Note that the required parsing may differ from page to
page,
• Note also that if the web page owner changes the
(structure of) the page, one may have to modify the
parsing accordingly.
Mattias Wahde, PhD, Professor, Chalmers University of Technology
e-mail: [email protected], http://www.me.chalmers.se/~mwahde
HTML download
Mattias Wahde, PhD, Professor, Chalmers University of Technology
e-mail: [email protected], http://www.me.chalmers.se/~mwahde
HTML download
• For this reason, code that relies on a particular structure
of a web page will often be brittle and error-prone.
• Still, in some cases it is necessary to use this approach.
• If so, check the HTMLDownloader class, which contains
(some) methods for parsing the downloaded string.
Mattias Wahde, PhD, Professor, Chalmers University of Technology
e-mail: [email protected], http://www.me.chalmers.se/~mwahde
HTML download
• After splitting a string into separate words (using, for
example, the Split() method for generic lists), one can
make use of the many methods available for generic lists,
in order to process the list of words:
Mattias Wahde, PhD, Professor, Chalmers University of Technology
e-mail: [email protected], http://www.me.chalmers.se/~mwahde
HTML download
• NOTE: Some service providers do not allow (repeated)
automatic downloads. If so, do respect these restrictions,
and obtain the data from another source!
• Also, sometimes (e.g. Google searches) the links etc. on a
page may be hidden in various Javascripts etc.
Mattias Wahde, PhD, Professor, Chalmers University of Technology
e-mail: [email protected], http://www.me.chalmers.se/~mwahde
Today’s learning goals
• After this lecture you should be able to
– Describe and use the HTMLDownloader class
– Describe and use the RSSDownloader class
Mattias Wahde, PhD, Professor, Chalmers University of Technology
e-mail: [email protected], http://www.me.chalmers.se/~mwahde
RSS feeds
• RSS (really simple syndicate) feeds are a special class of
web pages intended for automatic analysis by computers.
• An RSS feed is formatted in a well-defined way (as an
XML file), and can easily be parsed.
• Basically, an RSS feed can be stored in an item of class
SyndicationFeed (defined in the namespace
System.ServiceModel.Syndication).
• After download an RSS feed, the SyndicationFeed will
contain a set of items of class SyndicationItem.
Mattias Wahde, PhD, Professor, Chalmers University of Technology
e-mail: [email protected], http://www.me.chalmers.se/~mwahde
RSS feeds
• Each SyndicationItem, in turn, contains are number of
well-defined fields, e.g.
–
–
–
–
–
Title
PublishDate
Summary
ID
etc. etc.
Mattias Wahde, PhD, Professor, Chalmers University of Technology
e-mail: [email protected], http://www.me.chalmers.se/~mwahde
RSS feeds: Date format
• One slight problem with RSS feeds (or, rather, with the
XmlTextReader used for reading them) is that they only
handle some date formats, e.g
“ ddd, dd MM YYYY hh:mm:ss ”.
(Example: Fri 21 Oct 2016 07:14:17)
• Many RSS feeds, however, require a different date format,
e.g. “ ddd MMM dd yyyy hh:mm:ss ‘GMT+0000’ ”
• Example: Fri Oct 21 2016 07:28:19 GMT+0000.
Mattias Wahde, PhD, Professor, Chalmers University of Technology
e-mail: [email protected], http://www.me.chalmers.se/~mwahde
RSS feeds: Date format
• For that reason, the RSSDownloader class contains a
method for specifying a custom date format.
• Usage example:
• The exact format specification will vary between RSS feeds.
Mattias Wahde, PhD, Professor, Chalmers University of Technology
e-mail: [email protected], http://www.me.chalmers.se/~mwahde
The RSSReader application
• This program reads data (at regular intervals) from an RSS
feed, and then shows them on screen.
• It uses two threads (in addition to the GUI thread, of
course): One for downloading data, and one for displaying
data on the screen.
• New items are shown in green, old items in gray.
Mattias Wahde, PhD, Professor, Chalmers University of Technology
e-mail: [email protected], http://www.me.chalmers.se/~mwahde
The RSSReader application
Mattias Wahde, PhD, Professor, Chalmers University of Technology
e-mail: [email protected], http://www.me.chalmers.se/~mwahde
Today’s learning goals
• After this lecture you should be able to
– Describe and use the HTMLDownloader class
– Describe and use the RSSDownloader class
Mattias Wahde, PhD, Professor, Chalmers University of Technology
e-mail: [email protected], http://www.me.chalmers.se/~mwahde