1 - NAZOU

A
ERID – Estimation of Relevance for Internet
Documents
A.1
Basic Information
Internet documents vary in the topic and in the structure of the content. ERID employs
method of single neuron (perceptron) that tries to evaluate the rate of the Internet
document relevance using the defined keywords (expressed by regular expressions) as
features of neuron input. Such perceptron is trained on the set of web pages well
describing the required domain to set-up desired behavior. Such rate is used by the
WebCrawler tool to decide whether to create the copy of crawled Internet document or
not.
A.1.1 Basic Terms
HTML – HyperText Markup Language
TLU – Threshold Logic Unit
A.1.2 Method Description
ERID tool employs a basic method of single neuron (perceptron) well known in neural
network community. The most important feature of neuron includes the possibility of
training according to the input set. In general, the input sets are set of features gained
from input data and can vary in many ways. The main requirement on the feature of
input set is its measurability which means that the feature must be transformable in
some way into numerical value called input vector. It’s up to programmer how the input
set will be measured and evaluated.
The perceptron is a type of artificial neural network invented in 1957 at the Cornell
Aeronautical Laboratory by Frank Rosenblatt. It can be seen as the simplest kind of
feedforward neural network: a linear classifier.
The perceptron is composed of xj denotes j-th item in the input vector, wj denotes j-th
item in the weight vector (feature’s importance factor), a denotes the weighted sum of
dot product of input vector and weight vector, b is threshold and y is output from vector.
Figure 1: Schema of perceptron.
A.1.3 Scenarios of Use
The ERID tool can be used for any categorization purposes, where the features of
categorization object can be identified and used as input vector.
A.1.4 External Links and Publications
A.2
Integration Manual
ERID tool is developed in Java SE 1.5 and distributed as jar package. The objects of
ERID are accessible via Java interface. ERID tool is customizable by setting-up
keyword matcher implemented using regular expressions.
A.2.1 Dependencies
ERID uses following external libraries:
 Corporate memory libraries: file (Corporate memory libraries) libraries
developed within the scope of NAZOU project.
 HtmlParser library (http://sourceforge.net/projects/htmlparser, version 2.0)
 Log4j library (http://logging.apache.org/log4j/, version 1.2.8+)
 JUnit library (http://sourceforge.net/projects/junit/, version 4.3+)
A.2.2 Installation
Installation requires the Apache Ant tool and Java SE 1.5 to be installed. The
installation procedure takes following steps:
1. Download ERID source code.
2. Run shell command ant jar to build jar file.
A.2.3 Configuration
1. To setup ERID keywords and weights, edit file build.properties and set
parameters
 page.dir: the path to directory where categorized HTML files are stored. The
ERID expects files that are prefixed by “yes” to fall in the category and files
prefixed by “no” not to fall into the category. This is important for training and
evaluation of ERID,
 stopword.properites: the path to the file containing the list of words which are
excluded from the evaluation,
 keyword.properties: the path to the file that contains weight setup. File holds
pairs of keyword (can be written as regular expression) and weight that are
loaded during ERID initialization. The initial weight can be set to the random
value. This property shouldn’t be set in case of running find-candidates task.
2. Run shell command ant find-candidates to return the most frequently used
keywords in the training set.
3. Run shell command ant train to train ERID tool upon the given training set. By
default iteration limit value is set to 1000 loops and learning rate to 0.005. Limit
value and learning rate can be changed in Evaluate java class.
4. Run shell command ant evaluate to check the train progress. Training process
(step 3) can be repeated till acceptable result is returned.
A.2.4 Integration Guide
Tool can be used via Java interfaces of ERID and TLU implementations:
Figure 2: Tlu and Erid interface.
ERID tool employs TLU interface to access methods for perceptron initialization,
training and evaluation of input set. ERID interface is used for setting keywords
(features of document), training TLU by processing content of pre-categorized web
pages and evaluation of category according given web page content. See JavaDoc for
more information about methods and their parameters.
Creating training set
a. Establishing unambiguous categorization rules. For example, to categorize the
web pages falling into domain of job offers, the web page must contain at least
information about work position name, work position location, offered salary
and requirements, moreover the page cannot contain multiple job offers on the
page, work position must have contact person, etc.
b. Creating copies of human categorized web pages’ contents. In case of
categorization of job offers, a user (a person who trains ERID tool) should visit
various job offer portals and randomly download pages. The user must mark
web pages containing job offer (the pages fulfilling the rules) with prefix “yes”,
otherwise mark web page with prefix “no”. These web page copies should reside
in single directory.
Selecting proper keywords (features)
The user can intuitively assess the keywords matching required domain. Selected
keywords along with categorized training set definitely specify used domain for ERID.
User can run find-candidates task (see Step 2 in section A.2.3 Configuration) to review
the most used keywords in the pages’ content of training set and choose only that
keywords regarding the domain.
Training TLU
TLU is trained by running train task (see Step 3 in section A.2.3 Configuration). ERID
tool reads content of web pages located in directory specified by page.dir property and
creates list on input vectors and expected result vector to run method Tlu.train. This
method adjusts weights for particular keyword and stores them into file specified by
keyword.properties.
Development Manual
A.2.5 Tool Structure
ERID tool is composed of the following classes:
TluImpl class – provides simple perceptron implementation. It exposes methods for
setting/getting weights, training weights and evaluation of input set.
EridImpl class – provides implementation of keyword-based features working with
HTML pages.
WordStat class – helps during the keyword selection.
Configuration class and Evaluation class – provide methods for computing of initial
settings and evaluation of trained TLU.
Figure 3: ERID tool block schema.
A.2.6 Method Implementation
ERID tool employs method of single neuron (described in section A.1.2 Method
Description) for evaluation of input set (web page contents). Neuron method is
implemented in class Tlu. The Erid class counts the word occurrences and employs
single Tlu object (properly initialized) for evaluation the web page content domain.
Figure 4: Dependency diagram for Erid interface.
Figure 5: Dependency diagram for Tlu interface.
A.2.7 Enhancements and Optimizing
ERID tool can be enhanced by implementing multi-level TLUs for more precise domain
categorization. TLU can be combined using logical operators or using output of TLU as
an input for another TLU.
A.3
Manual for Adaptation to Other Domains
A.3.1 Configuring to Other Domain
ERID tool can be adapted to other domains by customizing training (described in
paragraph “Creating training set” in section A.2.4 Integration guide) set and proper
keyword selection (described in paragraph “Selecting proper keywords” in section
A.2.4 Integration guide).