role of dom towards extraction of data from html

APPENDIX A
ROLE OF DOM TOWARDS EXTRACTION OF DATA FROM
HTML SOURCE PAGE
A.1
HTML Format:
HTML, which stands for HyperText Markup Language, is the predominant markup
language for web pages [110]. It is the basic building-block of webpages and is written in the
form of HTML elements consisting of tags, enclosed in angle brackets (like <html>), within
the web page content. HTML tags normally come in pairs like <h1> and </h1>. The first tag
in a pair is the start tag, the second tag is the end tag (they are also called opening tags and
closing tags). In between these tags web designers can add text, tables, images, etc.
The purpose of a web browser is to read HTML documents and compose them into web
pages. The browser does not display the HTML tags, but uses the tags to interpret the content
of the page. It also provides a means to create structured documents by denoting structural
semantics for text such as headings, paragraphs, lists, links, quotes and other items. It can
embed scripts in languages such as JavaScript which affect the behavior of HTML webpages.
An HTML document is comprised of the following important entities:
•
Elements: An HTML element starts with a start tag (opening tag) and ends with an
end tag ( closing tag). For Example :
<html>
<body>
<p>This is my first paragraph
.</p>
</body>
</html>
139
•
Attributes: Attributes provide additional information about an element. Attributes are
always specified in the start tag. Attributes come in name/value pairs like:
name="value". For example: HTML links are defined with the <a> tag. The link
address is specified in the href attribute:
<a href="http://www.w3schools.com">This is a link</a>
•
Text: is described by the characters that do not belong to the markup and appear
between the start and end tags of an element. Note that the entity Text is not part of
the specification, yet its definition is useful for further processing.
The purpose of a web browser is to read HTML documents and compose them into visual or
audible web pages. With the help of parsers, HTML documents can be converted into
sequence of characters, sequence of tag , text tokens and tree like structure. In the case of the
tree structure, a detailed description of access and manipulation operations is given by the
DOM specification.
A.2
HTML PAGE GENERATION:
HTML pages are divided into one of the two categories: static pages and dynamic pages.
Static Web page is a web page that is presented to user exactly in the same way as they are
stored. It displays the same information to all users. These pages are HTML documents
stored as files in file systems and made available to web servers over HTTP. An obvious
example of a static page is an old style HTML document. The only way to change a HTML
page is to upload a new or updated version in its place. Every time a static file is downloaded,
the file contents that are sent to the browser are the same for everyone that accesses that file.
Dynamic Web pages are generated by web applications. A dynamic web page is a kind of
web page that contains fresh information prepared for individual user. It is not static because
it changes with time (like newspaper site), user (login, password) and user interaction
(selecting different parameters). Content on the dynamic webpage can change in response to
140
different contexts. In these types of sites, content and page layout are created separately. The
content is stored in backend databases and is delivered to user when needed. Here the
original web page only acts as a layout template with appropriate slots for content placing.
Before the page is served, these slots are dynamically filled with content, e.g. with matching
query result. It also allows anyone with little experience in web designing to update their
website.
There are two types of page generation techniques for dynamic websites. These are:
Server-side Scripting is a web server technology in which dynamic web pages are generated
when a user's request is verified by running a script directly on the web server. A program
running on web server is used to change the web content on web pages. Such web pages are
often created with the help of server side languages such as PHP, PERL , ASP, ASP.NET.
These languages use Common Gateway Interface (CGI) to produce dynamic web pages.
From a security point of view, server-side scripts are never visible to the browser as these
scripts are executed on the server and emit HTML corresponding to user's input to the page.
Client-side Scripting generally refers to the class of computer programs on the web that are
executed client-side, by the user's web browser, instead of server-side. In response to mouse
events or keyboard actions, this kind of script changes the interface behaviour with a
particular webpage. Client-side scripts are often embedded within an HTML document
("embedded script"), but they may also be contained in a separate file, which is referenced by
the document that use it ("external script"). Upon request, the necessary files are sent to the
user's computer by the web server on which they reside. The user's web browser executes the
script, then displays the document, including any visible output from the script. Client-side
scripts may also contain instructions for the browser to follow in response to certain user
actions, (e.g., clicking a button). Often, these instructions can be followed without further
communication with the server. By viewing the file that contains the script, users may be
able to see its source code. Web authors write client-side scripts in languages such as
JavaScript and VBScript.
141
A.3
HTML DOM:
The HTML DOM defines a W3C (World Wide Web Consortium) standard for accessing and
manipulating HTML documents [111]. It presents an HTML document as a tree-structure. It
is also defined as the objects and properties of all document elements, and the methods
(interface) to access them.
HTML DOM Nodes:
In the DOM, everything in an HTML document is a node. The entire document is a
document node. Every HTML element is an element node. The text in the HTML elements is
the text node. Every HTML attribute is an attribute node. Comments are comment nodes.
<html>
<head>
<title>DOM Tutorial</title>
</head>
<body>
<h1>DOM Lesson one</h1>
<p>Hello world!</p>
</body>
</html>
Fig A.1: DOM Example
The root node in the HTML above(figure A.1) is <html>. All other nodes in the document
are contained within <html>. The <html> node has two child nodes; <head> and <body>.
The <head> node holds a <title> node. The <body> node holds a <h1> and <p> node. Text is
always stored in Text Nodes. In above figure <title>DOM Tutorial</title>, the element
node <title>, holds a text node with the value "DOM Tutorial". The value of Text node can
be accessed by the innerHTML property in the HTML DOM.
142
HTML DOM Node Tree:
The HTML DOM views an HTML document as a node-tree. All the nodes in the tree have
relationships to each other. All nodes can be accessed through the tree. Their contents can be
modified or deleted, and new elements can be created. The node tree shown below in figure
A.2 shows the set of nodes, and the connections between them. The tree starts at the root
node and branches out to the text nodes at the lowest level of the tree:
Fig A.2: HTML DOM Tree
The terms parent, child, and sibling are used to describe the relationships. Parent nodes have
children. Children on the same level are called siblings (brothers or sisters).
•
In a node tree, the top node is called the root
•
Every node, except the root, has exactly one parent node
•
A node can have any number of children
•
A leaf is a node with no children
•
Siblings are nodes with the same parent
The following figure A.3 illustrates a part of the node tree and the relationship between the
nodes:
143
Fig A.3: DOM Tree and relationship with nodes
The following HTML example describes the parent, child relationship.:
<html>
<head>
<title>DOM Tutorial</title>
</head>
<body>
<h1>DOM Lesson one</h1>
<p>Hello world!</p>
</body>
</html>
•
The <html> node has no parent node; it is the root node
•
The parent node of the <head> and <body> nodes is the <html> node
•
The parent node of the "Hello world!" text node is the <p> node
and:
•
The <html> node has two child nodes; <head> and <body>
•
The <head> node has one child node; the <title> node
•
The <title> node also has one child node; the text node "DOM Tutorial"
•
The <h1> and <p> nodes are siblings, and both child nodes of <body>
144
First Child - Last Child
From the HTML above:
•
the <head> element is the first child of the <html> element, and the <body> element
is the last child of the <html> element
•
the <h1> element is the first child of the <body> element, and the <p> element is the
last child of the <body> element
HTML DOM Properties and Methods:
Properties and methods define the programming interface of the HTML DOM. In the DOM,
HTML documents consist of a set of node objects. The nodes can be accessed with
JavaScript or other programming languages.
The programming interface of the DOM is defined by standard properties and methods.
Properties are often referred to as something that is (i.e. the name of a node) and Methods
are often referred to as something that is done (i.e. remove a node).
HTML DOM Properties:
•
x.innerHTML - the text value of x
•
x.nodeName - the name of x
•
x.nodeValue - the value of x
•
x.parentNode - the parent node of x
•
x.childNodes - the child nodes of x
•
x.attributes - the attributes nodes of x
In the list above, x is a node object (HTML element).
HTML DOM Methods:
•
x.getElementById(id) - get the element with a specified id
•
x.getElementsByTagName(name) - get all elements with a specified tag name
145
•
x.appendChild(node) - insert a child node to x
•
x.removeChild(node) - remove a child node from x
Since, HTML DOM defines a standard for accessing and manipulating HTML documents, it
is very easy to understand the structure of HTML documents using DOM. It presents an
HTML document as a tree-structure. So, by looking at DOM Tree, we find how nested the
HTML page is.
A.4
HTML DOM Inspector:
To understand the DOM Tree and an HTML page, we have used the software IE DOM
Inspector that presents the structure of a web page. Figure A.4 shows the search interface for
autonagar.com.
Fig A.4: Search Interface of autonagar.com
Figure A.5 is the output of IE DOM Inspector applied on the webpage returned after
submitting search form (A.4).Here, DOM tree is clearly shown in which header is a parent
node and check, model, year, pics, mileage, price, color, city, listed , view are the child nodes
of that parent.
146
Fig A.5: DOM Tree Example by IE DOM Inspector
Source code for the webpage is given below in figure A.6 which shows the parent- child
relationship discussed above in section A.3.
147
<P>
<SPAN class=linewrap3_check> </SPAN>
<SPAN class=linewrap3_model><A href="javascript:void(0);">Make-Model</A> </SPAN>
<SPAN class=linewrap3_year><A onclick="return
extendURL('compactlist=&Sort=Year&O=D&start=&dealerid=');"
href="javascript:void(0);">Year</A> </SPAN>
<SPAN class=linewrap3_pics><IMG border=0 alt=""
src="http://www.autonagar.com/imgs/trans.gif" width=10 height=9></SPAN>
<SPAN class=linewrap3_mileage><A onclick="return
extendURL('compactlist=&Sort=Mileage&O=D&start=&dealerid=');"
href="javascript:void(0);">Mileage</A> </SPAN>
<SPAN class=linewrap3_price><A onclick="return
extendURL('compactlist=&Sort=Price&O=D&start=&dealerid=');"
href="javascript:void(0);">Price</A> </SPAN>
<SPAN class=linewrap3_color><A onclick="return
extendURL('compactlist=&Sort=Color&O=D&start=&dealerid=');"
href="javascript:void(0);">Color</A> </SPAN>
Fig 3.24 Source code of autonagar.com
<SPAN class=linewrap3_city><A onclick="return
extendURL('compactlist=&Sort=City&O=D&start=&dealerid=');"
href="javascript:void(0);">City</A> </SPAN>
<SPAN class=linewrap3_listed><A onclick="return
extendURL('compactlist=&Sort=PostDated&O=D&start=&dealerid=');"
href="javascript:void(0);">Listed On</A> </SPAN><SPAN class=linewrap3_view></SPAN>
<BR class=clear>
Fig A.6: Source code of autonagar.com
148

Download Report

role of dom towards extraction of data from html

Paperzz.com

Your Paperzz