Exploring the use of markup and attributes for - UvA-DARE

Exploring the use of markup and attributes for classifying
HTML snippets
Alexis Kalyvitis
October 5, 2013
One-year programme Masters Software Engineering
Supervisor : Marcel Worring
Tutor: Marc Van Agteren
Organisation: Usabilla B.V.
Publication status: Submitted for review
Universiteit van Amsterdam
1
Abstract
Web pages are made up with headers, navigational elements, content, footer and more. Each
of these elements have a different impact on the page. If we are able to distinguish between
these elements we can improve the classification of the web page as a whole by providing
insight as to the elements that contain meaningful information. In this paper we explore ways
to classify such elements based on their appearance, looking for cues at the html markup, class
and id attributes, as well as their size and position within the page.
Contents
1 Introduction
1.1 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
5
2 Related Work
6
3 Methods and Algorithms
3.1 Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Training the Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
7
9
10
4 Experiment
4.1 Evaluation and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
12
13
15
5 Interpretation
17
6 Conclusions
18
7 Appendix
7.1 SMO . . . . . . . . . . .
7.2 SMO - Stemmed . . . .
7.3 Naive Bayes . . . . . . .
7.4 Naive Bayes - Stemmed
7.5 C4.5 . . . . . . . . . . .
7.6 C4.5 - Stemmed . . . . .
7.7 K* . . . . . . . . . . . .
7.8 K* - Stemmed . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20
20
21
21
22
22
23
23
24
Chapter 1
Introduction
The classification of web pages – or in their native form HTML documents – has been of high
interest to the research community for the past decade. Many researchers [1, 3, 4] proposed
ways to classify HTML documents based on numerous attributes found inside the documents
themselves, or even from pages that reference them.
But there is more we can learn from an HTML document than the class of it’s contents as
a whole. A typical web page is structured with headers, navigational elements, content and
many more. Each of these elements are of different significance to the page they appear in. A
pages header is a place where the organizations logo and punchline would appear. The sites
main navigation would be useful in finding other pages related to the site, similar to a site
map. Menus would identify some products or services that are offered and so on.
Knowing the types of these elements would eventually improve the classification of the
whole page by using the right information out of the right element, or weighting elements
by their impact. Therefore we believe that it’s worth making an attempt at classifying such
elements based on their appearance.
To classify these elements we look into the HTML markup that generates them and extract
meaningful information that can be used for classifying. The downside of this approach is that
the HTML specification does not enforce the way elements may be created. There is almost
always more than one way to create an element. For example a button can be created using
a set of <button></button> tags, the <input type="submit"/> tag or even using an <a
href="#"class="button"></a> and styled appropriately. No matter how a developer chose
to create an element, its visual representation is the same from the users point of view.
3
Figure 1.1: Two visually identical elements created using different markup.
The elements depicted in Figure 1.1 look exactly the same, but when we look at the underlying HTML markup that generates these two elements, we observe that they are written
in a fairly different manner.
Listing 1.1: Typical HTML markup for defining a navigation element
<ul class="nav nav-list">
<li class="nav-header">List header</li>
<li><a href="#">Home</a></li>
<li><a href="#">Library</a></li>
<li><a href="#">Applications</a></li>
<li class="nav-header">Another list header</li>
<li><a href="#">Profile</a></li>
<li><a href="#">Settings</a></li>
<li class="divider"></li>
<li><a href="#">Help</a></li>
</ul>
Listing 1.2: Alternative markup for defining a navigation element
<nav>
<h4>List header</h4>
<p><a href="#">Home</a></p>
<p><a href="#">Library</a></p>
<p><a href="#">Applications</a></p>
<h4>Another list header</h4>
<p><a href="#">Profile</a></p>
<p><a href="#">Settings</a></p>
<p class="divider"></p>
<p><a href="#">Help</a></p>
</nav>
As the above listings show, two HTML snippets can be created using different markup, yet
manage to create elements of similar or even identical appearance. By looking at the markup
4
of these snippets, a person with moderate experience with HTML can identify what these
snippets will generate once rendered in a web browser. A quick glance at the structure and
tags used, class and id attributes will probably hint at what these snippets will look like once
rendered in a web browser.
We see a few ways to approach the classification of these snippets. One may be a rule
based or expert system which can be configured with a set of rules to discriminate the type of
an element, the same way an expert system would help a doctor choose the correct diagnosis
based on a list of given symptoms. This approach can be trivial to implement but given the
fact that HTML standards change often and practices vary among developers, a classification
that is correct for one might not be correct for someone else. Also a classification that is
correct today might not be correct in the future. A prime example of this is the extensive use
of tables to layout HTML pages during the late 1990s which was made obsolete by the advent
of Cascading Style Sheets.
Another, more preferable approach would be a method that can learn and adapt as standards and practices change, as well as overcome the ambiguities found in HTML and described
earlier in this section. So a better way would be to use snippets from live websites which we
can learn from and derive a set of rules from them. This method would ensure that the rules
will not be subjective to the one who defines them but based on examples from a number of
real snippets.
1.1
Research Objectives
Our goal is to achieve an acceptably accurate classification of web elements based on their
underlying HTML markup. To reach this goal we will investigate whether there is enough
information in the HTML markup that can accurately describe elements, such as HTML tag
names, attributes like id and class, size and offset, and whether it is possible to train classifiers
that are able to accurately predict the type of an element.
In the following sections we will go through the extraction of information out of HTML
snippets, transforming that information, training and evaluating the classifiers and presenting
the results.
5
Chapter 2
Related Work
HTML has received a lot of attention by the research community in the past several years.
Since the explosion of the amount of information available on-line, many researchers focused
their attention to HTML documents and the taming of their semi-structured nature regarding
the information they contain.
Sun et al [6] used a Support Vector Machine to classify web pages using the sites text, title
and anchor text in several combinations. Even though some HTML markup was used to aid
classification, this studies main focus is text classification.
Glover et al [3], combined the anchor text and the nearby words from anchors pointing to
a site from various sources into a virtual document. All these virtual documents were used
to classify the page they were pointing to. Similarly, Furnkranz [1] found that it is easier
to classify a hypertext document using information provided on documents that point to it
instead of using information that is provided on the document itself. In their experiment they
searched for clues in the anchor text, the context in which the anchor appears and the heading
preceding the anchor. In both cases the focus of their study was the content of the document.
Yang et al [4] summarize the different approaches to hypertext categorization by comparing
several regularly used methods and compared their results.
Abdelhamid and Hassler [8] combined both the structure and the content to classify XML
documents. Their classification approach is based on the k-nearest neighbor algorithm that
relies on an edit distance measure. This study is focused on classifying a document based on
the content it holds and not the type of the document itself.
Elsas [7] developed tag metrics which could be used to automatically classify HTML documents into three distinct categories. The categories used where data for pages that contained
data arranged in rows and columns, index or table of contents for pages containing hypertext links to other pages and content for pages that contained a significant amount of natural
language or free text.
Candillier et al proposed methods of transforming the structure of XML trees for use in
classification [5]. The authors propose using the set of “parent-child” relations, the set of
“next-sibling” relations, the set of paths starting from the root and the arity of the nodes.
This study is particularly insightful and relevant, as the hierarchy of nodes inside an HTML
document plays an important part as to the underlining structure the markup describes
As related work suggests, most research in hypertext classification has been focused primarily on the textual features of a document [6, 3, 1, 8]. Documents have been classified based
on the information they contain, often using structure to weight the importance of words and
sentences as they occur inside the document. Except from [4, 5], the majority of studies presented in this section describe cases of text classification, occasionally using some XML/HTML
structure to aid their classifiers. All the studies mentioned above provide valuable insights to
help with our experiment, but unfortunately none of them was focused on the structure of
elements or their visual representation and thus there is no baseline to compare our results
with. It seems that what this paper is attempting to find out hasn’t been researched before.
6
Chapter 3
Methods and Algorithms
The construction of a classification system consists of two phases. First a learning algorithm
gets trained using data of which the classes are known (training data) to produce a model. Then
using the model we can classify data of which the class is not known (test data). But before we
begin training we must make sure that we have extracted the most relevant information possible
from our dataset and represent it in a way suited for a learning algorithm. In this section will
describe how we approached information extraction, preprocessing and classification of HTML
snippets.
Figure 3.1: Methodology Data Flow Diagram
HTML Snippets
(Class known)
Preprocessing
Training
HTML Snippets
Model
(Class unknown)
3.1
Class prediction
Information Extraction
HTML Attributes
HTML elements can be augmented with attributes in order to provide semantics, style, data
and more. Most elements can take common attributes like id which provides the element with
a document-wide unique identifier or class which is used to group similar elements together.
Web developers often use these attributes to distinguish various parts of a web page in order to
style them through style sheets or manipulate them with scripting languages. These attributes
are a good place for mining for useful information that can be used for classification.
7
Document Object Model
Each snippet is parsed by an HTML parser that generates the snippets document object model.
A Document Object Model (DOM) is a convention used for representing and interacting with
HTML documents and its used by web browsers to render HTML documents into interactive
web pages. Once parsed, the nodes of the HTML document are organized in a tree structure,
called the DOM tree [9].
With an active DOM it becomes trivial to extract tag names as well as class and id attributes
from nodes in the snippet and store them for later use.
Figure 3.2: An example snippet taken from the UvA front page
Listing 3.1: HTML markup that generates the snippet shown in Figure 3.2 with some parts
omitted for clarity
<html>
<head>
...
</head>
<body>
...
<div class="header-logo" id="snippet">
<map name="header-logo-map" id="header-logo-map">
<area ... href="http://www.uva.nl">
<area ... href="http://www.uva.nl">
</map>
<img src=".../uva-nl.jpg" usemap="#header-logo-map">
</div>
...
</body>
</html>
Listing 3.2: Produced bag-of-words array
[div, header-logo, map, header-logo-map, header-logo-map, area, area, img]
8
Size & Position
The size of each snippet is calculated using the size of the outermost node of the snippet.
Similarly the offset is measured as the amount of pixels between the outermost node of the
snippet’s side to the documents bounds. Due to some limitations imposed by the vector space
model described in section 3.2 we cannot use the snippets exact numeric size in pixels. The
bag-of-words approach is unable to process numeric attributes and it treats every number as
a unique word without taking into account its proximity to other numbers. For example, two
snippets of width 399 and 400 respectively will be treated as having a completely different width
as the word 399 is different than the word 400. To overcome this setback we will translate
sizes and offsets into their approximate textual equivalents. Taking into account the screen
size used in modern computer screens we assume an average of 1366x768 and classify sizes as
described in tables 1 to 3.
Table 3.1: Size translation
Pixels
<320
<640
<800
<1024
>1024
Translation
XSmall
Small
Medium
Large
XLarge
Table 3.2: Y-Axis Offset Translation
Pixels
<256
<512
>512
Translation
Top
Middle
Bottom
Table 3.3: X-Axis Offset Translation
Pixels
<454
<910
>910
3.2
Translation
Left
Center
Right
Transformation
Stemming
The element identifiers, namely the class and id attributes of an element, contain a word or
multiple words and are usually informative as to the type of the element. We believe that this
information should be taken into account in our information retrieval process. The downside
is that these attributes may contain words from an infinitely large vocabulary.
Using the array produced in Listing 3.2 as an example, we may be hinted that the snippet we
are about to classify is a logo. The word “logo” appears three times in the array. Unfortunately
the word “header-logo” is different than “header-logo-map”, and the classifier will treat it as
such.
9
But if there was a way to distinguish and normalize such inflected words we would have
better described the snippet in question (as far as the classifier is concerned). In linguistic
morphology and information retrieval this method is called stemming and it is typically used
to reduce inflected or sometimes even derived words into their root word. For example a
stemmer for the English language should reduce the words “fishing”, “fished”, “fish”, and
“fisher” to the root word “fish”.
For our purpose it doesn’t seem meaningful to use a stemmer for a specific language such
as the Porter Stemming Algorithm [14] but rather a small scale stemmer which would be able
to identify certain abbreviations or multiple words such as the ones found in Listing 3.2.
In the experiments we used our own implementation of a stemmer which is able to normalize
complex or multiple words containing a keyword into their respective root word or stem. All
the available stems supported by the stemmer, were simply the names of the available classes
found in the dataset. Each stem in turn maps to an array of inflected words that relate to the
stem word. When the stemmer is called to reduce a given word to its potential stem, it loops
through the available list of inflected words and compares the given word. If a match is found,
the given word is replaced with the respective stem word. The comparison is a check whether
one of the words in list of inflected words is a substring of the given word. Given an example
stem word advertisement and the associated inflected words ad, commercial or promo, if we
try to stem the word banner-promo, when the inflected word promo is found to be a substring
of the given word, the word gets reduced to advertisement.
Vector Space
In turn, the array of now stemmed words is transformed into a Vector Space. A Vector Space
model is an algebraic model for representing text as vectors of identifiers [15, 16]. Each dimension in the Vector Space corresponds to a separate term in a document. If a term occurs
in the document, its value in the vector is equal to the number of occurrences. Typically
terms are single words, keywords, or longer phrases. Similar to the bag-of-words model, the
grammar and order in which terms appear is not considered. Listing 3.3 depicts the Vector
Space representing the values from Listing 3.2.
Listing 3.3: An example Vector Space Model
[div, 1, header-logo, 1, map, 1, header-logo-map, 2, area, 2, img, 1]
3.3
Training the Classifiers
After extracting and preprocessing snippets as described above, the dataset is ready to be
split into parts and given to a classifier for training. In order to evaluate the accuracy of the
classifier, the dataset will be split into N parts, the classifier will train using N-1 parts of the
dataset and test on the remaining part. This will be repeated until all parts have been used
in testing. We will train several different classification algorithms with the same dataset to
compare results. The algorithms we selected for training are:
Naive Bayes selected to represent simple probabilistic classifiers.
SMO selected to represent support vector machines.
C4.5 selected to represent decision tree algorithms.
Choosing these particular group of algorithms was made solely to test algorithms that are
fundamentally different, and not to benchmark particular algorithms. We use implementations of these classifiers from the WEKA machine learning software suite [10]. The criteria for
choosing WEKA for use in our experiment are its free and open source availability, portable to
10
almost any platform since it is implemented in the Java programming language, easy integration with our own software, and most of all high quality, accurate implementations of machine
learning algorithms modeled after academic papers. Using such widely available implementations of the classifiers makes our experiment easily reproducible by anyone willing to extend
upon this work.
11
Chapter 4
Experiment
In this section we describe the execution and outcome of the experiments we conducted based
on Section 3, explain how we evaluated the results and which metrics we used to make measurements.
4.1
Evaluation and Analysis
To evaluate the classifier we used cross-validation, a technique for assessing how the results
of a statistical analysis will generalize to an independent data set. This technique is mainly
used in settings similar to our experiment, to estimate how accurately a predictive model will
perform in practice. A round of cross-validation involves partitioning a sample of data into
complementary subsets, performing the analysis on one subset (called the training set), and
validating the analysis on the other subset (called the test set). To reduce variability, multiple
rounds of cross-validation are performed using different partitions, and the validation results
are averaged over the rounds.
The accuracy of the classification is calculated as the sum of correctly classified instances
divided by the total number of instances. Accuracy is the most basic yet fundamental indication
of how well the classifier performs. Additionally we measure the precision and recall [17] to
see which classes are being classified correctly, which are failing and what causes the confusion.
In classification tasks, precision is the fraction of the classified instances that are correct,
while recall is the fraction of correct instances that are retrieved. The terms true positives,
true negatives, false positives, and false negatives compare the results of the classifier under
test with trusted external judgments. Precision and recall are usually combined into a single
measure, such as the F-measure. By measuring precision and recall we can gain better insights
to as to the quality of the classification, finding out which classes performed well and how that
performance was achieved.
P recision =
Recall =
TP
TP + FP
TP
TP + FN
(4.1)
(4.2)
P recision ∗ Recall
(4.3)
P recision + Recall
To interpret the classification, confusion matrices are used to display the accuracy, precision and recall of individual classes. A confusion matrix is a specific table layout that allows
visualization of the performance of an algorithm. Each column of the matrix represents the
instances in a predicted class, while each row represents the instances in an actual class. Confusion matrices make it easier to inspect the individual classes and observe their performance.
The confusion matrices for the experiments can be found in the Appendix section.
F =2∗
12
4.2
Dataset
To conduct our experiments we require a dataset with HTML snippets whose classes are known
in advance. To make sure these snippets are classified correctly, the classification should be
made by experts in the field of web design. The dataset we used came from a web service
that enabled users to bookmark and share HTML snippets [11]. Users who participated in
our experiment installed a “bookmarklet” on their web browser and used it to select HTML
snippets from live websites that they found interesting or worth sharing. During the process
of bookmarking an HTML snippet, users were able to select the category in which the snippet
belonged to.
After some examination of this dataset, it came to our attention that some of the snippets
were wrongfully classified. For example users would classify an image into a menu if the image
depicted a menu. We consider this classification wrong as the actual type of the snippet is an
image. In order to avoid these erroneous cases we will only use snippets that were classified
by experts whose judgment we trust is of higher quality. Classifications made by these experts
will be treated as correct and considered the gold standard. A significant consequence of this
decision is the fact that the dataset shrunk by a significant factor. From about 15,000 records
in total we are now left with roughly 1,500 records. The number of training examples should
ideally be higher than 1,500 in such experiments but we opted for higher quality over higher
quantity.
The list of available categories that snippets could be classified in, was determined by the
dataset with the exclusion of some categories that had a very small amount of examples.
13
Figure 4.1: Screenshot showing part of the dataset, attempting to emphasize the diversity in
the training examples
14
4.3
Results
Here we present the results from a combination of methods used during the experiment. We
executed the same experiment using several classification algorithms, with and without the use
of stemming.
Accuracy Accuracy 60% 60% 58% 58% 56% 56% Accuracy 54% Accuracy 54% 52% 52% 50% 50% SMO Naive Bayes C4.5 K* SMO Naive Bayes C4.5 K* (b) Without stemming
(a) With stemming
Figure 4.2: Accuracy
60% 60% 55% 55% Precision 50% Precision 50% Recall Recall F-­‐Measure F-­‐Measure 45% 45% 40% 40% SMO Naive Bayes C4.5 K* SMO (a) With stemming
Naive Bayes C4.5 K* (b) Without stemming
Figure 4.3: Precision & Recall
100% 100% 90% 90% 80% 80% 70% 70% 60% 60% Precision 50% 40% Recall 40% F-­‐Measure 30% Precision 50% Recall F-­‐Measure 30% 20% 20% 10% 10% 0% 0% bu/on header image list logo menu bu/on header (a) With stemming
image list logo menu (b) Without stemming
Figure 4.4: SMO - Detailed accuracy by class
15
100% 100% 90% 90% 80% 80% 70% 70% 60% 60% Precision 50% 40% Recall 40% F-­‐Measure 30% Precision 50% Recall F-­‐Measure 30% 20% 20% 10% 10% 0% 0% bu/on header image list logo menu bu/on header image (a) With stemming
list logo menu (b) Without stemming
Figure 4.5: Naive Bayes - Detailed accuracy by class
100% 100% 90% 90% 80% 80% 70% 70% 60% 60% Precision 50% 40% Recall 40% F-­‐Measure 30% Precision 50% Recall F-­‐Measure 30% 20% 20% 10% 10% 0% 0% bu/on header image list logo bu/on header menu (a) With stemming
image list logo menu (b) Without stemming
Figure 4.6: C4.5 - Detailed accuracy by class
100% 100% 90% 90% 80% 80% 70% 70% 60% 60% Precision 50% 40% Recall 40% F-­‐Measure 30% Precision 50% Recall F-­‐Measure 30% 20% 20% 10% 10% 0% 0% bu/on header image list logo menu bu/on header (a) With stemming
image list logo menu (b) Without stemming
Figure 4.7: K* - Detailed accuracy by class
16
Chapter 5
Interpretation
The most accurate classification was achieved by the combination of a Naive Bayes classifier and
the use of a stemmer, with 59.3% accurately classified instances. It’s non-stemmed counterpart
achieved about half percent lower at 58.7%. On the other hand K* performed the worst
compared to the algorithms in this experiment, almost 5% less than Naive Bayes. In most
cases the use of stemming only improved accuracy by one or two percent, except from C4.5
where it actually decreased accuracy by a half percent.
17
Chapter 6
Conclusions
In order to make sense of the results we look at the confusion matrices and re-examine the
dataset. We observe that some classes draw a fair amount of false positives. Classes such as
header, menu, logo and list tend to draw most of the false positives. By re-visiting the dataset
and comparing some examples, we find that this seems quite logical as these classes are hard
to tell apart as they often share a very similar HTML structure. As evident by the confusion
matrices in the appendix in section 7.1, there’s a high amount of confusion between button
and logo instances. In the particular experiment the classifier has labeled 43 logo instances
wrongfully as buttons, and 33 instances correctly as logos. Illustrated by the examples in
listings 6.1 and 6.2, a comparison is made between a snippet of class button and a snippet of
class logo respectively, that emphasizes the similarities between these two classes.
Listing 6.1: Example HTML snippet of class button
<a href="#" class="button">
<img src="/path/to/button.jpg" />
</a>
Listing 6.2: Example HTML snippet of class logo
<a href="#" class="logo">
<img src="/path/to/logo.jpg" />
</a>
A known limitation of our experiment, is the number of training examples. We believe that
accuracy would improve significantly if there were more examples to train with. Additionally
there is a variation in the number of training examples per class. For example in our dataset
there were 222 instances of class logo versus 96 instances of class button. The experiment
would have been more indicative if the number of examples in each class were more evenly
distributed.
Another limitation is the stemmer that was used in our experiments. Because stemmers
work best with natural language it is difficult to compare them in an experiment such as the
one described in this paper. Words used in HTML attributes for example don’t necessarily
need to be grammatically correct in any language.
Some challenges we faced during the implementation of these experiments were imposed by
the fact that we used techniques previously used to analyze natural language, in the context
of analyzing the structure and meta data of markup. Typically in machine learning tasks like
HTML classification, involve extracting text from documents and trying to derive the class of
18
the documents based on the text. In our case we are not interested in the content its self, but
the type of entity the content is presented in.
This paper is an exploratory attempt at using a tried and well researched technology into
a completely new problem domain. It is fairly obvious from the results that its use was not
optimal, but it serves as a first step at solving a problem that hasn’t been addressed before.
Snippet classification seems to be a domain that didn’t receive any attention in the past, but
it would be interesting to see under which use cases it would prove its usefulness. Such a
tool could perhaps be useful in assistive technologies such as screen readers, by providing the
blind or visually impaired with a richer experience or better guidance by additionally describing
elements on a website and how they can be interacted with. An other possible usage might be in
search engines, using the class of web page elements to describe the context where information
is displayed in or surrounded by.
19
Chapter 7
Appendix
7.1
SMO
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
472
358
0.4591
0.2404
0.3375
88.3217 %
91.4935 %
830
56.8675 %
43.1325 %
=== Detailed Accuracy By Class ===
Weighted Avg.
TP Rate
0.344
0.801
0.423
0.31
0.91
0.227
0.569
FP Rate
0.03
0.19
0.045
0.043
0.176
0.055
0.108
Precision
0.6
0.504
0.554
0.53
0.654
0.457
0.557
Recall
0.344
0.801
0.423
0.31
0.91
0.227
0.569
F-Measure
0.437
0.619
0.48
0.391
0.761
0.303
0.535
=== Confusion Matrix ===
a
b
33 10
2 129
4 21
11 30
4
4
1 62
c
3
9
41
9
8
4
d
e
2 43
0 11
8 19
35 11
2 202
19 23
f
5
10
4
17
2
32
|
|
|
|
|
|
<-a
b
c
d
e
f
classified as
= button
= header
= image
= list
= logo
= menu
20
ROC Area
0.789
0.868
0.795
0.787
0.888
0.746
0.824
Class
button
header
image
list
logo
menu
7.2
SMO - Stemmed
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
484
349
0.4746
0.2405
0.3375
88.5348 %
91.5702 %
833
58.1032 %
41.8968 %
=== Detailed Accuracy By Class ===
Weighted Avg.
TP Rate
0.344
0.799
0.406
0.368
0.905
0.285
0.581
FP Rate
0.028
0.183
0.035
0.047
0.171
0.058
0.104
Precision
0.611
0.488
0.6
0.558
0.67
0.506
0.579
Recall
0.344
0.799
0.406
0.368
0.905
0.285
0.581
F-Measure
0.44
0.606
0.484
0.443
0.77
0.364
0.554
ROC Area
0.79
0.866
0.737
0.807
0.888
0.784
0.826
Class
button
header
image
list
logo
menu
ROC Area
0.824
0.88
0.788
0.822
0.919
0.776
0.848
Class
button
header
image
list
logo
menu
=== Confusion Matrix ===
a
b
33
7
1 119
5 19
7 32
3
5
5 62
7.3
c
3
6
39
6
9
2
d
e
5 41
2
9
9 19
43 15
3 209
15 19
f
7
12
5
14
2
41
|
|
|
|
|
|
<-a
b
c
d
e
f
classified as
= button
= header
= image
= list
= logo
= menu
Naive Bayes
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
488
342
0.486
0.1638
0.3196
60.1778 %
86.648 %
830
58.7952 %
41.2048 %
=== Detailed Accuracy By Class ===
Weighted Avg.
TP Rate
0.365
0.789
0.485
0.398
0.892
0.255
0.588
FP Rate
0.025
0.164
0.059
0.059
0.155
0.051
0.1
Precision
0.66
0.536
0.522
0.517
0.678
0.507
0.579
Recall
0.365
0.789
0.485
0.398
0.892
0.255
0.588
F-Measure
0.47
0.638
0.503
0.45
0.77
0.34
0.562
=== Confusion Matrix ===
a
b
35
9
1 127
4 16
8 26
5
4
0 55
c
4
8
47
14
11
6
d
e
5 38
6
6
6 18
45 11
2 198
23 21
f
5
13
6
9
2
36
|
|
|
|
|
|
<-a
b
c
d
e
f
classified as
= button
= header
= image
= list
= logo
= menu
21
7.4
Naive Bayes - Stemmed
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
494
339
0.4933
0.1589
0.3184
58.4828 %
86.3945 %
833
59.3037 %
40.6963 %
=== Detailed Accuracy By Class ===
Weighted Avg.
TP Rate
0.458
0.772
0.469
0.393
0.874
0.292
0.593
FP Rate
0.027
0.162
0.06
0.046
0.141
0.067
0.096
Precision
0.688
0.509
0.506
0.582
0.704
0.477
0.588
Recall
0.458
0.772
0.469
0.393
0.874
0.292
0.593
F-Measure
0.55
0.613
0.486
0.469
0.78
0.362
0.574
ROC Area
0.834
0.873
0.756
0.832
0.925
0.803
0.852
Class
button
header
image
list
logo
menu
ROC Area
0.744
0.849
0.736
0.741
0.89
0.736
0.801
Class
button
header
image
list
logo
menu
=== Confusion Matrix ===
a
b
44
6
1 115
5 15
5 25
9
6
0 59
7.5
c
4
5
45
16
12
7
d
e
3 31
4
7
6 17
46 12
2 202
18 18
f
8
17
8
13
0
42
|
|
|
|
|
|
<-a
b
c
d
e
f
classified as
= button
= header
= image
= list
= logo
= menu
C4.5
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
456
374
0.4367
0.1908
0.3257
70.099 %
88.3074 %
830
54.9398 %
45.0602 %
=== Detailed Accuracy By Class ===
Weighted Avg.
TP Rate
0.302
0.783
0.371
0.301
0.874
0.262
0.549
FP Rate
0.046
0.176
0.042
0.06
0.166
0.068
0.109
Precision
0.46
0.516
0.537
0.442
0.658
0.44
0.527
Recall
0.302
0.783
0.371
0.301
0.874
0.262
0.549
F-Measure
0.365
0.622
0.439
0.358
0.75
0.329
0.52
=== Confusion Matrix ===
a
b
29
8
3 126
3 20
11 27
13
4
4 59
c
4
8
36
10
6
3
d
e
6 43
1
6
13 22
34 11
4 194
19 19
f
6
17
3
20
1
37
|
|
|
|
|
|
<-a
b
c
d
e
f
classified as
= button
= header
= image
= list
= logo
= menu
22
7.6
C4.5 - Stemmed
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
452
381
0.4284
0.192
0.3306
70.6828 %
89.7098 %
833
54.2617 %
45.7383 %
=== Detailed Accuracy By Class ===
Weighted Avg.
TP Rate
0.354
0.711
0.333
0.308
0.866
0.306
0.543
FP Rate
0.037
0.159
0.042
0.082
0.166
0.08
0.109
Precision
0.557
0.493
0.508
0.379
0.667
0.444
0.526
Recall
0.354
0.711
0.333
0.308
0.866
0.306
0.543
F-Measure
0.433
0.582
0.403
0.34
0.753
0.362
0.52
ROC Area
0.788
0.816
0.678
0.695
0.879
0.753
0.787
Class
button
header
image
list
logo
menu
ROC Area
0.786
0.87
0.801
0.745
0.907
0.77
0.828
Class
button
header
image
list
logo
menu
=== Confusion Matrix ===
a
b
34
7
1 106
5 19
6 28
8
4
7 51
7.7
c
2
10
32
8
6
5
d
e
4 42
8
7
16 19
36 15
11 200
20 17
f
7
17
5
24
2
44
|
|
|
|
|
|
<-a
b
c
d
e
f
classified as
= button
= header
= image
= list
= logo
= menu
K*
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
433
397
0.3908
0.1904
0.3226
69.9551 %
87.4448 %
830
52.1687 %
47.8313 %
=== Detailed Accuracy By Class ===
Weighted Avg.
TP Rate
0.073
0.801
0.33
0.177
0.95
0.241
0.522
FP Rate
0.008
0.182
0.042
0.039
0.275
0.062
0.131
Precision
0.538
0.514
0.508
0.417
0.558
0.442
0.502
Recall
0.073
0.801
0.33
0.177
0.95
0.241
0.522
F-Measure
0.128
0.626
0.4
0.248
0.703
0.312
0.458
=== Confusion Matrix ===
a
b
7
9
1 129
0 16
2 31
0
4
3 62
c
3
3
32
16
4
5
d
e
4 70
5 11
10 31
20 26
1 211
8 29
f
3
12
8
18
2
34
|
|
|
|
|
|
<-a
b
c
d
e
f
classified as
= button
= header
= image
= list
= logo
= menu
23
7.8
K* - Stemmed
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
447
386
0.4121
0.186
0.3194
68.4615 %
86.6588 %
833
53.6615 %
46.3385 %
=== Detailed Accuracy By Class ===
Weighted Avg.
TP Rate
0.177
0.732
0.333
0.231
0.931
0.326
0.537
FP Rate
0.008
0.171
0.039
0.042
0.244
0.083
0.124
Precision
0.739
0.482
0.525
0.474
0.594
0.452
0.541
Recall
0.177
0.732
0.333
0.231
0.931
0.326
0.537
F-Measure
0.286
0.581
0.408
0.31
0.725
0.379
0.494
=== Confusion Matrix ===
a
b
17
8
0 109
1 17
2 26
1
6
2 60
c
2
4
32
12
7
4
d
e
5 56
7 12
10 29
27 26
1 215
7 24
f
8
17
7
24
1
47
|
|
|
|
|
|
<-a
b
c
d
e
f
classified as
= button
= header
= image
= list
= logo
= menu
24
ROC Area
0.801
0.854
0.747
0.788
0.917
0.792
0.833
Class
button
header
image
list
logo
menu
Bibliography
[1] Johannes Furnkranz, Exploiting Structural Information for Text Classification on the
WWW, Advances in Intelligent Data Analysis, Lecture Notes in Computer Science, 1999,
pp.487-497
[2] Esposito Floriana, Malerba Donato, Di Pace Luigi, Leo Pietro, A Machine Learning Approach to Web Mining, Advances in Artificial Intelligence, Lecture Notes in Computer
Science, 2000, pp.190-201
[3] Eric J. Glover, Kostas Tsioutsiouliklis, Steve Lawrence David M. Pennock, Gary W. Flake,
Using Web Structure for Classifying and Describing Web Pages
[4] Yiming Yang, Sean Slattery, Rayid Ghani, A Study of Approaches to Hypertext Categorization, Journal of Intelligent Information Systems, 2002, pp.219-241
[5] Laurent Candillier, Isabelle Tellier, Fabien Torre, Transforming XML trees for efficient
classification and clustering, Advances in XML Information Retrieval and Evaluation,
2006, pp.469-480
[6] Aixin Sun, Ee-Peng Lim, and Wee-Keong Ng, Web classification using support vector
machine Proceedings of the 4th international workshop on Web information and data
management (WIDM ’02), 2002, pp.96-99
[7] Jonathan Elsas, Miles Efron, HTML Tag Based Metrics for use in Web Page Type Classification,
[8] Abdelhamid Bouchachia, Marcus Hassler, Classification of XML Documents, Proceedings
of the IEEE Symposium on Computational Intelligence and Data Mining 2007, pp.390-396
[9] World Wide Web Consortium (W3C) Document
http://www.w3.org/DOM, Retrieved 6 Dec 2012
Object
Model
(DOM)
[10] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H.
Witten, The WEKA Data Mining Software: An Update, SIGKDD Explorations, Volume
11, Issue 1, 2009.
[11] Usabilla B.V Usabilla Discover http://discover.usabilla.com Retrieved 17 Jun 2013
[12] Mitchell T., Machine Learning, McGraw Hill. ISBN 0-07-042807-7, 1997, p.2.
[13] Jain, A.K., Duin, R.P.W., Jianchang Mao, Statistical pattern recognition: a review, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol.22, no.1, pp.4-37, Jan
2000.
[14] M.F. Porter, An algorithm for suffix stripping, Readings in information retrieval, 1997,
pp.313-316
[15] G. Salton, A. Wong, C. S. Yang, A vector space model for automatic indexing Commun.
ACM November 1975, pp.613-620
25
[16] Amit Singhal Modern Information Retrieval: A Brief Overview Bulletin of the IEEE
Computer Society Technical Committee on Data Engineering 2001
[17] I. Dan Melamed, Ryan Green, Joseph P. Turian, Precision and recall of machine translation. North American Chapter of the Association for Computational Linguistics on Human
Language Technology 2003, pp.61-63
26