Semi-automatic creation of domain ontologies with centroid based

Semi-automatic creation of domain ontologies with
centroid based crawlers
Carel Fenijn
Graduate Thesis Doctoraal Linguistics
Utrecht University, December 2007
i
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
1 Introduction . . . . . . .
1.1 The World Wide Web .
1.2 The Semantic Web . . .
1.3 From World Wide Web
. . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
to Semantic Web . . . .
. . . . .
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
. . . . .
. . . . . .
. . . . . .
. . . . . .
1
1
2
3
2 Ontology Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Ontology Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Types of Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Classification of Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Ontology Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Ontology Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Ontology Learning . . . . . . . . . . . . . . . .
3.1 Ontology Learning Techniques . . . . . . . . . .
3.2 Ontology Editors and Engineering Tools . . . .
3.3 Ontology Learning Approaches . . . . . . . . . .
3.4 Assessment of Ontology Learning Approaches .
. . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
. . . . 15
. . . . . 16
. . . . . 17
. . . . . 18
. . . . . 32
4 Information Retrieval: Focused Crawling . . . . . . . . . . . . . 35
4.1 Definition Focused Crawling . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Focused Crawling Techniques . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Focused Crawling Approaches . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 Assessment of Focused Crawling Approaches . . . . . . . . . . . . . . . 42
5 OntoSpider . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1 The Ontology Engineering Component of OntoSpider . . .
5.2 The IR Component of OntoSpider . . . . . . . . . . . . . . .
5.3 Assessment of OntoSpider . . . . . . . . . . . . . . . . . . . .
.
.
.
.
. . . . 43
. . . . . 44
. . . . . 47
. . . . . 55
6 Conclusion and Further Research . . . . . . . . . . . . . . . . . . 59
6.1 Some Notes on Methodology . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
ii
List of Figures
1.1 Layered Stack of the Semantic Web, from http://www.w3.org/ . . . . 3
2.1 An Ontology Scale by Lassila and McGuiness, 2001 . . . . . . . . . . . 9
2.2 An Ontology Scale by Daconta et al., 2003 . . . . . . . . . . . . . . . . 9
3.1 Opening screen of Protege with the OntoLT plug-in marked in red . 22
3.2 Tabs of OntoLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Mapping Rule for Head Nouns and Modifiers . . . . . . . . . . . . . . 25
3.4 Above Rule in an older version of OntoLT . . . . . . . . . . . . . . . . 25
5.1 Simplified Possible View of OntoSpider . . . . . . . . . . . . . . . . . . 43
5.2 OntoSpider with OntoLT as the Ontology Learning Component . . . 45
5.3 IR Component of OntoSpider . . . . . . . . . . . . . . . . . . . . . . . 47
5.4 Rich output of OntoSpider . . . . . . . . . . . . . . . . . . . . . . . . . 54
Abstract
Various approaches exist for the semi-automatic creation of ontologies from text. This
thesis shows how centroid based focused crawlers can be used for this purpose, specifically with domain ontologies in specialistic fields like that of linguistics as the target.
The approach is highly modular: the highly specialistic output corpus of the Information
Retrieval component of the approach will be input to its Ontology Engineering component, which can create ontologies. With this approach, domain ontologies can be created
for subjects like natural language morphology. The overall approach that is proposed
combines techniques from Information Retrieval and of Ontology Engineering. Some systems that could form the Ontology Engineering part of the approach are discussed. In
this study it is examined, whether the use of centroid based focused crawlers can help
in the semi-automatic creation of ontologies from text. More specifically, two types of
focused crawlers will be compared: A general purpose centroid based focused crawler and
a literature crawler. The approach that is proposed here is called ”OntoSpider”.
iii
iv
Acknowledgements
I would like to thank Dr. Paola Monachesi at Utrecht University, who supervised
this thesis with good advice and much patience, as it was mostly difficult to have
progress in this research in combination with a full-time job.
Thanks also to my former boss at Demon, Jim Segrave, for allowing me to
shift work hours in order to follow some courses that were relevant for this research.
An interesting course in Information Retrieval by Maarten de Rijke and Valentin
Jijkoun at the University of Amsterdam set me off into the direction of this research,
and material of that course was used.
v
vi
Chapter 1
Introduction
1.1
The World Wide Web
Information on the current Web, the World Wide Web (WWW), is stored in a decentralized way and may be available in many formats, traditionally mostly embedded
in the relatively loose format HTML, more and more in the more rigid XHTML,
and is often represented in a natural language. Metadata is scarcely present in the
form of keywords and DTD’s. If DTD’s are available, they are often very generic,
to such an extent that no real specific semantic data can be derived from them. For
agents like web crawlers, it is often difficult to extract the right information from
the WWW because agents do not ’understand’ natural languages. Web pages are
mostly created for human ’consumption’ only.
Finding specific information on the WWW is often a very time consuming
endeavour, with limited results. Because of the overload of information that is
present on the World Wide Web, and the noise that is accompanied by it, much
research takes place on Information Retrieval from the World Wide Web.
For search engines like Google and Yahoo!, software agents called web crawlers
or spiders crawl the Web to gather and index information and make this available.
Typically, such search engines try to cover a very large part of the Web, and general
purpose crawlers may be used for this. In order to keep the task manageable, efficient
algorithms have been developed like the PageRank algorithm[55].
One type of crawler that has been developed, is the focused crawler or topicoriented crawler[17]. Focused crawling approaches try to deal with the enormous
mass of information that is contained in the Web in a more efficient way than
general purpose crawlers do, and offer ways to extract very specific on-topic data
from it, by selectively crawling the Web. This saves network traffic and processor
1
2
CHAPTER 1. INTRODUCTION
time, as only smaller subsets of the Web are crawled. Apart from the more limited
use of resources of focused crawlers, they may also yield better results for specific
domains than general purpose ones. A centroid based focused crawler is a special
type of focused crawler, that makes use of a centroid, which is a representation of
highly on-topic information. Focused crawlers in general and centroid based ones in
particular, will be described in chapter 4.
1.2
The Semantic Web
The Semantic Web as presented in e.g. Tim Berners-Lee et al. ([7], 2001) is a
vision of Tim Berners-Lee, the inventor of the current World Wide Web (WWW)
and director of the World Wide Web Consortium (W3C). In Berners-Lee’s vision,
the future web, the Semantic Web, will also contain metadata. This metadata
will enable agents to extract very much specific information from web pages and
act intelligently based on this information using logical inferences. Apart from the
addition of metadata, other problems of the current Web will be addressed in the
Semantic Web.
One such problem is that of trust and reliability of data. In a layered model
of the design of the Semantic Web, trust is even at the highest layer of the stack
(see Figure 1.1), based on proof and cryptography. If intelligent agents can make
logical inferences based on explicit formal metadata, they can also account for these
inferences, which can be requested and verified by humans if necessary. The interpretation of metadata will be based on information that is present in ontologies.
Like on the current World Wide Web, information on the Semantic Web will also
be stored in a decentralized way.
Ontologies are central to the implementation of the Semantic Web. They contain domain knowledge, specific data regarding a certain subject field, in a very
structured way. Semantic Web Agents will be able to interpret information that is
found in Web pages using these ontologies, as they give the agents precise information on those pages. Apart from that, based on ontologies, such agents will be able
to communicate with each other, as the ontologies provide a shared understanding
of a given domain.
The development and maintenance of ontologies is part of Ontology Engineering. For the Semantic Web to work, it will be necessary that very many ontologies
will be available. In the literature, this is referred to as the ’bootstrapping problem’
of the Semantic Web, which is in a way some sort of a chicken and egg problem:
Ontologies are a necessary prerequisite for a working Semantic Web, but as long as
there is no Semantic Web yet, many people are not interested in producing ontolo-
1.3. FROM WORLD WIDE WEB TO SEMANTIC WEB
3
Figure 1.1: Layered Stack of the Semantic Web, from http://www.w3.org/
gies. One way of overcoming this problem may be to (semi-)automatically create a
lot of ontologies from existing resources, like knowledge bases and the World Wide
Web. As Ontology Engineering is such a central part of this study, it will be treated
in more detail in a separate chapter.
1.3
From World Wide Web to Semantic Web
The (semi-)automatic creation of ontologies has received quite some attention these
last years. The main reasons for this are the fact that manual creation of ontologies
is tedious and costly work. On the one hand, we know that very many ontologies will
need to be made in the bootstrapping phase and later phases of the Semantic Web,
and much research has been carried out in this direction. On the other hand, in the
Information Retrieval community, much work has been done in the improvement of
approaches and algorithms for efficient and effective IR on the current World Wide
Web. One of the approaches in this field is that of focused crawlers, and here, a
centroid based approach is one of the options. We will investigate how these two
research areas can be combined. For domains that have already been covered by
ontologies, the creation of such ontologies from scratch might be less useful, unless
this is for the purpose of evaluating an Ontology Engineering approach. Enriching
such existing ontologies makes more sense. This approach may be particularly useful
4
CHAPTER 1. INTRODUCTION
in case no such ontologies exist yet and repositories of specialized research papers
on a subject are available. Even once the Semantic Web will be in a very advanced
stage, there may still be highly specialistic subjects on which no ontologies are
available at all, and data on the World Wide Web can be of use.
The reason why not much other research combines the two fields of Information Retrieval and Ontology Engineering may be the fact, that the semi-automatic
creation of ontologies by itself is already a very difficult research field, which has
many problems that still need to be solved, like NLP problems and knowledge engineering problems, and most researchers concentrate on just that, or even on sub
problems of it. For many specific purposes, selecting a set of relevant documents on
which ontologies are based in a different way, like with clustering techniques, often
suffices. Also, the field of Information Retrieval has its own issues that need to be
resolved, both for IR on the Web and on the Semantic Web. Yet, a motivation
for the combination of the two fields will be presented here. The World Wide Web,
with all the shortcomings it has compared to the Semantic Web, does contain a huge
wealth of information. As mentioned, focused crawlers can extract highly relevant
and specialistic information from this World Wide Web. The centroid that is used
by centroid based focused crawlers and the set of downloaded pages should contain
very specialized data. Usually the downloaded pages will be in a rough format like
HTML embedded free text. As was mentioned above, ontologies also contain very
specific data. Clearly, it is a far cry from the raw specialized data that is gathered
by focused crawlers, and the richly structured data that is contained in ontologies,
still it is interesting to study the hypothesis that a combination of a focused crawler
approach and an approach of semi-automatic creation of ontologies from text can
be fruitful. A study which seems to confirm this hypothesis is Ehrig (([31]), 2002).
His approach, which will be described in some more detail below, also combines a
focused crawler with ontology creation. However, it uses ontological metadata for
the enhancement of focused crawls, instead of simple vector based centroids. The
study at hand will do the opposite: Examine how focused crawlers may help in the
semi-automatic creation of ontologies from text. Another, more recent, study that
combines focused crawling with ontology learning is Su et al ([67] (2004)). They
also use ontologies for improving focused crawls. As a side-effect, the ontologies that
are used are enriched in an automatic way. Some more detail on their approach will
also follow in chapter 3. Note that the definition of ’Ontology’ that is adopted here,
does not mention the Semantic Web at all. Even though ontologies play a crucial
part in the emergence of the Semantic Web, the use of ontologies is more universal
than that. In research projects, company intranets, etcetera ontologies may play
an important role as well, as part of Knowledge Management. In general, one of
the first stages in Ontology Engineering processes, like the manual construction of
ontologies, is the enumeration of terms that will be part of the ontology. Noy et
1.3. FROM WORLD WIDE WEB TO SEMANTIC WEB
5
al. ([52],2001) describe this as step 3, after determining the domain and scope of
the ontology (step 1) and considering reusing existing ontologies (step 2). It may
be interesting to see whether the resulting set of terms that are in the centroid of
the focused crawler might be a good starting point for this third step in manual
ontology creation as well.
Now that a motivation for combining a focused crawler with the semi-automatic
creation of ontologies from text has been presented, the central areas of this approach of combining Ontology Engineering and centroid based focused crawling,
will be described in more detail in the following chapters. Chapter 2 presents the
field of Ontology Engineering, describing the main concepts of this field. Chapter
3 is on Ontology Learning. This chapter mainly presents concepts, techniques and
approaches that are specific to the (semi)automatic creation of ontologies. In the
last chapter, chapter 5, the approach itself, OntoSpider, will be presented. This
approach employs the use of centroid based focused crawlers to semi-automatically
create domain ontologies based on data that is available on the World Wide Web.
More specifically, the results of a General Purpose Focused Crawler will be compared with those of a Literature Crawler, from an Ontology Engineering point of
view. For this purpose, hypotheses will be proposed.
6
CHAPTER 1. INTRODUCTION
Chapter 2
Ontology Engineering
The field of Ontology Engineering studies the theory and practice of how Ontologies
are designed and created. An overview of a recent state of the art in Ontology
Engineering can be found in Gómez-Pérez et al. ([34], 2004). Much of this chapter
is based on information from their book.
2.1
Ontology Definitions
Traditionally, Ontology is a branch of Philosophy, that studies the ’being’, of things,
their essence, existence, properties, nature, classification, etcetera. Ancient Greek
philosophers like Parmenides and Aristotle made important contributions to this
discipline. Throughout history until the modern time, various philosophers have
studied this discipline.
More recently, the term ’Ontology’ has been adopted within a Knowledge Engineering setting. One very frequently cited definition is that of Gruber ([36], 1993),
”An ontology is an explicit specification of a conceptualization.” Clearly, this definition is rather vague, and other researchers have proposed definitions that are based
on Gruber’s, but that are more precise. Struder et al. (1998), as cited in [34],
define an ontology as ”a formal explicit specification of a shared conceptualization.
Conceptualization refers to an abstract model of some phenomenon in the world by
having identified the relevant concepts of that phenomenon. Explicit means that the
type of concepts used, and the constraints on their use are explicitly defined. ’Formal’ refers to the fact that the ontology should be machine-readable and processable.
’Shared’ reflects the notion that an ontology captures consensual knowledge, that
is, it is not private of some individual, but accepted by a group. For the purpose of
this study, this definition will be adopted. Ontology specifications are formulated in
7
8
CHAPTER 2. ONTOLOGY ENGINEERING
ontology languages, and various of these have been developed in recent years.
2.2
Types of Ontologies
In the literature, various types of ontologies have been proposed. Gómez-Pérez et al.
([34], 2004) mention Top-Level Ontologies which mainly deal with universal abstract
categories, General or Common Ontologies that contain common sense information,
Knowledge Representation Ontologies for which a KR paradigm is characteristic,
Task Ontologies that are focused on a task or activity, Method Ontologies which
center around some method, and Application Ontologies that are made for a specific
application. The type of Ontology that this study is concerned with, is that of the
Domain Ontology. Characteristics of a Domain Ontology are, that a specific domain
like a scientific discipline or a specific business is the subject of the Ontology, and
that the Ontology therefore typically uses a more specialized vocabulary.
2.3
Classification of Ontologies
It is very common to distinguish between lightweight and heavyweight ontologies,
and scales or hierarchies of ontologies have been proposed, that place ontologies on
such a scale between shallow and heavyweight ones, which also reflects the expressiveness of the formalisms that are used for these ontologies. Such ontology scales
can help understand the differences, commonalities and relationships between e.g.
semantic networks, thesauri, taxonomies, catalogs, ontologies, relational databases,
UML, logics and the Object Oriented paradigm.
One classification is that of Lassila and McGuinness ([41], 2001). In their paper, they argue that the RDF formalism can be seen as a frame based formalism, and
that frame-based representation is a suitable paradigm for ontology creation. They
point out the connection between frame-based systems, object oriented programming and description logics, and argue that even catalogs, glossaries and controlled
vocabularies could be seen as potential ontology specifications. The classification,
which they present in the paper as An Ontology Spectrum, ranges from such catalogs and glossaries on one end, to systems with general logical constraints on the
other end. There is a clear line between systems with informal is-a relations and
those with formal is-a relations.
While explaining the characteristics of taxonomies and thesauri, the difference
between taxonomies and ontologies, and as a basis for their definition of ontologies,
Daconta et al. ([23], 2003) propose an Ontology Spectrum with weak semantics on
2.3. CLASSIFICATION OF ONTOLOGIES
Figure 2.1: An Ontology Scale by Lassila and McGuiness, 2001
Figure 2.2: An Ontology Scale by Daconta et al., 2003
9
10
CHAPTER 2. ONTOLOGY ENGINEERING
the lower end of the scale, and strong semantics at the higher end. The scale ranges,
from the weakest, the Relational Model, via Taxonomy, Schema, ER, Thesaurus,
Extended ER, XTM, RDF/S, Conceptual Model, UML, DAML+OIL, Description
Logic, Local Domain Theory and First Order Logic to Modal Logic, which is at the
high end of the spectrum. They go in great length describing what taxonomies and
ontologies are, and hold that the main difference between taxonomies and ontologies
is, that the former do not have rigorous logic, that machines can base inferences on,
and the latter do have such rigorous logic.
2.4
Ontology Languages
Ontology languages are the formal languages in which ontologies are defined. In
this section, only some of the important characteristics of what are currently the
main ontology languages will be described in a general informal way. For formal
specifications, there is ample literature available.
2.4.1
XML/XML Schema
XML (Extensible Markup Language) is a formal language the conforms with the
SGML specifications. It can be seen as a subset of SGML, which is simpler and
more practical in its use than SGML. Because XML and XHTML which was derived
from it, is more rigid in its definition than HTML, it is easier to process XML and
XHTML automatically in a consistent way than HTML. One of the reasons for the
W3C to develop XML was, to deal with the shortcomings of HTML. In HTML,
the representation of data and its presentation are mixed and messy, in XML they
are strictly separated, enabling clear unambiguous data representation with welldefined syntactic means. However, the use of XML is far broader and far reaching
than just for applications on the World Wide Web. At the time of writing, XML is
the most common standard that is in use for Business to Business (B2B) information
interchange.
XML Schema and its formal language XML Schema Definition (XSD) allows
one to create data models and specify data types and criteria by which XML document are valid or not. Thus XML documents can be syntactically correct according
to the XML specifications, but invalid given a specific XML Schema specification.
An older schema language that was in common use for HTML and XML, is DTD.
DTD’s are making way more and more for XML Schema, but for historic reasons
they are still in wide use. Unlike DTD’s, XML Schema itself conforms to the XML
specifications.
2.4. ONTOLOGY LANGUAGES
11
In and of themselves, XML and XML Schema do not suffice as ontology languages, for only the correctness and validity of the syntax of XML documents can be
determined, not the semantics of these documents. In itself, e.g. the XML markup
<dictator>John</dictator> and <gardiner>John</gardiner> do not mean anything different to an XML parser, even though humans who choose or read these tags
will most likely assign a certain meaning to them. However, fully-fledged ontology
languages which are capable of expressing complex meaning have been formulated
fully in XML and XML Schema, which is the reason for mentioning XML here.
2.4.2
RDF(S)
One of the many formal languages that have been constructed in accordance with the
XML specifications, is the Resource Description Framework (RDF)[3]. It was developed by the W3C to provide a solid formal basis for ontology languages, expressing
meaning with RDF-triples. RDF-triples are sets of three identifiers, resources, one
of which intuitively functions as a subject, one as an object, and one as a predicate
or relation between subject and object, much like meaning can be represented in
many natural languages and in First Order Predicate Logic (FOL). A triplet like a,
R, b could be represented in FOL with the two place predicate R like so: Rab or
R(a,b). The identifiers of RDF-triples are often URI’s, for the subject and relation
or predicate, this is always the case. The object can be either a URI or a literal.
The URI’s, which are often URL’s on the Web, ensure explicitness and precision of
data representation. For example, thousands of different entities called ’John’ can
all have their own URL disambiguating them. Even though RDF’s data model with
RDF-triples is simple, its expressiveness is very great. Many RDF-triples can combine into complicated webs of knowledge that are equivalent to semantic nets. Even
though more place predicates in FOL cannot be represented with a single RDFtriplet, they can be represented with multiple RDF-triples in an indirect way. Also,
reification is part of RDF, so it is possible to make statements about RDF statements in this data model. Furthermore, RDF containers are part of the RDF data
model, with groups of resources like bags (unordered sets) and sequences (ordered
sets).
RDF has been extensively documented by the W3C and all specifications are
open in the RDF Concepts and Abstract Syntax document, the RDF Semantics
document ([57]) and other documents. Also, a document like the RDF Primer
([56]) makes the technology accessible to the public. RDF does not necessarily have
to be represented in XML. Shorthand notations exist like N3, and tuple notation
like {subject, predicate, object} and <subject> <predicate> <object> are in use,
as well as graphical representations with directed labeled graphs. Although RDF
graphs are easy to consume by humans, it is more efficient to serialize the data in
12
CHAPTER 2. ONTOLOGY ENGINEERING
XML format so that it is easy for computer programs to process it.
Unlike XML Schema is to XML, RDF Schema is not a schema language in
which valid RDF representations are defined. RDF Schema was built on top of
RDF, and can be seen as a limited, lightweight ontology language, in that in it, the
class and subclass relations are defined in a formal way, and RDF vocabularies can
be formulated, in which classes and properties are distinguished.
2.4.3
OWL
The constraints that RDFS imposes on RDF are quite limited. Other ontology
languages were developed, which impose more and preciser constraints and allow
the formulation of heavyweight ontologies. One such language is the Web Ontology
Language (OWL), which exists in three types: OWL Lite, OWL DL and OWL
Full. Like RDF, OWL is well documented with extensive open documentation, like
the OWL Web Ontology Language Reference ([54]). Historically it descends from
earlier ontology languages, DAML+OIL, which like OWL itself was based on RDF.
In OWL Lite, relatively lightweight ontologies like taxonomies can be formulated,
in OWL DL, which is more expressive, more heavyweight ones, and in OWL Full
which is most expressive of the three, any ontologies that the RDF formalism allows
for can be formulated. The choice of the type of OWL can depend on the purpose
of a project, if one only needs to produce a taxonomy, the choice of OWL Lite can
be evident, also for reasons of decidability and efficiency. A quick overview of the
OWL specifications can be found in the OWL Web Ontology Language Overview
([53]).
2.5
Ontology Design
Various very specific methodologies for ontology design have been proposed in the
literature. They are applicable both to manual and to (semi-)automatic Ontology
Design. The main methodologies for defining a classification of classes or concepts
in ontologies, are the Top-Down one, which goes from general to specific, and the
Bottom-Up one which goes the opposite direction, from specific to more general
classes or concepts.
The Top-Down methodology departs from general concepts, going to specific
ones. According to Uschold and Gruninger ([71],1996), the amount of detail of the
ontology is better controlled with this methodology as compared with the BottomUp methodology. A disadvantage of this methodology is however, that it can become arbitrary which high-level concepts will get a place in the ontology when this
2.5. ONTOLOGY DESIGN
13
methodology is followed. In the Top-Down approach, the high-level concepts do
not follow from the lower level concepts themselves. Therefore, the ontology could
become less stable and the process may require more effort and re-work.
The Bottom-Up methodology goes from detailed and specific concepts to more
general ones. Uschold and Gruninger ([71],1996) maintain that this approach may
also result in more effort and re-work, but for different reasons. The level of detail in
the ontology may become very high in this approach, which may increase the chance
of inconsistencies and which may make commonalities between related concepts less
transparent.
In the Middle-Out approach, one starts with the main concepts ’in the middle’,
i.e. those which are neither very high-level nor at the maximum of specificity.
Uschold and Gruninger ([71],1996) hold, that this approach strikes a balance in
the level of detail of the resulting ontology. High-level and low-level concepts only
follow naturally from these main concepts from which one departs.
An approach that ([71],1996) do not mention, is that of Mixture Ontology
design. Here, one could start with both high-level concepts and concepts at the
lowest level, which have most detail, thus mixing the Top-Down and Bottom-Up
approach. It is expected, that this methodology would suffer from the drawbacks
of both of the other methodologies. All in all, the Middle-Out Ontology Design
strategy seems to be the most promising.
Many approaches that involve some cyclic, iterative way of constructing ontologies, will include the possibility of enriching existing ontologies because of this.
Approaches may also focus on the enrichment of existing ontologies as a goal in
itself. Apart from enriching existing ontologies, there are also strategies that reuse
existing ontologies to create totally new ontologies. Also, strategies that merge two
or more existing ontologies into one ontologies exist.
14
CHAPTER 2. ONTOLOGY ENGINEERING
Chapter 3
Ontology Learning
Ontology Learning is the acquisition of knowledge for the (semi)automatic creation
of ontologies. Very often Ontology Learning is from text, but it can also be from
other sources, like databases. Because of the interdisciplinary nature of the subject,
very many Ontology Learning approaches exist, and many methods and techniques
are used in this field. Buitelaar et al. ([11], 2003) argue, that in spite of this
multidisciplinary nature, Ontology Learning is a new and challenging area in its
own right.
In this chapter, some existing surveys will first be treated. Then some specific
approaches that are somehow similar to the OntoSpider approach that is presented
in chapter 5, or that are somehow related to it will be described. Finally, some general aspects of various approaches will be mentioned, like commonalities in system
designs, convergence or divergence of NLP approaches, choice of AI technologies,
etcetera. This study will mainly focus on ontology learning from text. It is not an
exhaustive survey. The reason for examining various approaches apart from using
existing surveys was, to get a better grasp of the subject matter and to avoid reinventing the wheel. Roughly, two types of related work can be distinguished: Work
that is very similar to the total approach, i.e. it both involves a focused crawler
and (semi-)automatic creation of ontologies from text, and work that is only similar
to part of it, i.e. work that only involves the use of focused crawlers, mainly for
scientific data gathering purposes, or that involves the (semi-)automatic creation of
ontologies.
15
16
3.1
CHAPTER 3. ONTOLOGY LEARNING
Ontology Learning Techniques
Because of the multidisciplinary nature of the field, many existing techniques from
fields like Artificial Intelligence and Information Retrieval are used for Ontology
Learning. This section presents some techniques that may be used by various Ontology Learning approaches. Many of these techniques are related to text processing
and analysis. There is often a choice of algorithms that can be used for the implementation of the techniques that are being described here. Certain techniques are
very general in nature, and might as well have been presented in the chapter on
Information Retrieval, chapter 4.
Web Mining is a research area in which information is extracted from the
World Wide Web. For this extraction, among other things Text Mining may be
used, here the information extraction is specifically from texts in natural languages
like English and French. Further techniques that are used include Chunk Parsing,
POS Tagging and Semantic Tagging.
Stopping or stopword removal is a standard technique in NLP. The most frequent words in corpora, the stopwords, will occur in practically any document, hence
they are not significant for most IR purposes, and are removed at a very early stage.
Like stopping, stemming is very standard in NLP and IR. Stemming reduces
the various forms of a word that may be the result of morphological processes like
inflection and derivation, to a single stem or root. Some often used stemmers are the
Porter stemmer, which has modules for various languages, and the Lovins stemmer.
Often, from a morphological point of view, the results of stemmers are quite crude,
but from a pragmatic point of view they are still very effective.
Part-Of-Speech tagging or POS-tagging is also a very common technique in
NLP and IR. One of the most famous one is the Brill POS-tagger, another one is
the Monty POS-tagger.
Chunk Parsing is a shallow technique, by which natural language sentences are
parsed in chunks. Very roughly, these chunks correspond to syntactic phrases. Often,
chunk parsing takes place after a POS-tagging phase, and the technique is widely
used in IR. Approaches that have a chunk parser, include SMES and SymOntos.
The latter uses the CHAOS chunk parser. A specific application of chunk parsing is
cascaded chunk parsing. In this approach, the output of one round of chunk parsing
can be input to a next round of chunk parsing at which new chunks can be parsed,
thus multiple rounds of consequent chunk parsing can take place.
Semantic Tagging or Semantic Annotation is the enrichment of natural language texts like corpora with semantic tags. Often a semantically tagged text comes
in a way closer to an ontology, as it could be input to concept extraction modules
3.2. ONTOLOGY EDITORS AND ENGINEERING TOOLS
17
or otherwise be part of approaches. As part of Semantic Tagging, various types of
resolution may take place, like Synonymy, Hyponomy, Hyperonomy and Meronymy
Resolution. Dill et al. ([28], 2003) maintain, that ”automated large-scale semantic
tagging of ambiguous content can bootstrap and accelerate the creation of the Semantic Web”. Their approach consists of SemTag, a Semantic Tagger that works
through three stages, one for spotting, with tokenizing and label extraction, one
for learning and finally one for the actual semantic tagging itself. The other ”half”
of the approach, Seeker, will be described in another section. In practice, most of
the semantic taggers that exist today only produce shallow results. If the resulting
ontologies of systems that use these should not be shallow, that could be achieved
by combining shallow semantic taggers with other techniques.
SMES is presented by Maedche and Staab ([42], 2000; [43], 2000; [44], 2001)
as part of the Text-To-Onto approach. In [19] a Concept Extractor was developed
for the Ontolo approach.
Clustering is an IR technique in which documents are grouped together in
so-called clusters. This technique can be used for classification purposes, or as a
preparatory step for further analysis of the documents.
Various approaches use simple pattern matching approaches. E.g. Perl or
sed regexes can be very powerful. Especially as an additional technique to other
techniques or as part of other techniques like POS-tagging it can be very useful.
Maedche and Staab emphasize the difference between taxonomic and nontaxonomic relation extraction from text. Much of the work that precedes theirs consists
of very shallow approaches, which only succeed in taxonomic relation extraction.
What is necessary according to these researchers, is nontaxonomic relation extraction from text.
3.2
Ontology Editors and Engineering Tools
Clearly, if the creation of ontologies from text is not done fully automatically but
semi-automatically, an ontology engineer will have to correct, refine or expand the
ontologies. For this purpose, Ontology Editors and Engineering Tools can be used.
They can be considered part of overall semi-automatic approaches.
Da Silva et al. ([60], 2004) present a survey on eConstruction, Ontology Engineering/Design tools, and Ontology Exploitation software tools. The focus of the
study is on software tools and the following Ontology Design tools are described:
LexiCon, OilED, Protégé2000, OntoEdit, LinkFactory, e-COGNOS and e-COSER,
TERMINAE, Text-to-Onto and OntoLearn. In the conclusion on Ontology Design
tools, the authors state that Protégé is the most recommended software tool for
18
CHAPTER 3. ONTOLOGY LEARNING
various reasons, including OWL-compliance, the fact that it is freeware and it has a
good base of developers around the world that support it. The Ontology Exploitation that is evaluated in the survey, is outside the scope of this study.
OntoEdit is an ontology engineering environment that is presented by Maedche and Staab ([42], 2000; [43], 2000; [44], 2001) as part of the Text-To-Onto approach. Only a limited version of OntoEdit is free of charge. The tool will run on
Windows and Linux platforms.
Protégé is a very popular Ontology Engineering Tool. It is an Open Source
Java tool, that can be used for editing domain ontologies or knowledge bases in a
user friendly way with a GUI. The tool comes with a clear tutorial and good documentation, and is used by a large community. It is scalable, platform-independent
and easy to extend with plugins. Furthermore, it supports data in various formats,
like RDF and OWL. Many specialistic ontologies have been developed with Protégé
with various domains. Linguistics related ontologies include GOLD, an ontology for
descriptive linguistics and GUM, a general task and domain independent linguistically motivated ontology.
3.3
3.3.1
Ontology Learning Approaches
Surveys of Ontogy Learning Approaches
Various surveys of ontology-learning approaches exist. Some of these are briefly
presented here in chronological order.
Maedche and Staab ([44, p.76-78], 2001) include a brief survey of ontologylearning approaches in the presentation of their own ontology-learning framework,
which includes Text-To-Onto, SMES and OntoEdit. The survey covers the following domains: free text, dictionary, knowledge base, semistructured and relational
schemata. The methods mentioned for free text, the subject matter of OntoSpider, are clustering, inductive logic programming, association rules, frequency based,
pattern-matching and classification methods. No extensive evaluation is made of
the various approaches, they are presented in a table with references to the corresponding literature.
Ying Ding and Schubert Foo ([30], 2002) present a review of ontology generation, in which Infosleuth, SKC, AIFB approaches like SMES, OntoEditor, and Textto-Onto, ECAI 2000 (SVETLAN, Mo’K, SYLEX, ASIUM), Inductive Logic Programming (WOLFIE), DELOS, OntoWeb, DODDLE, and some more approaches
are described. Before presenting these approaches, some general notes on ontology
creation are given. An important conclusion they draw is, that the complexity of
3.3. ONTOLOGY LEARNING APPROACHES
19
relation extraction is the main impedance to ontology learning and its application,
and that learning ontologies from text is still largely a theoretic enterprise, which is
not advanced enough yet for real applications.
A more extensive and recent survey of existing approaches can be found in
Gomez et al. ([33],2003). Many researchers have contributed to this survey, and it
is very systematic. The following domains are covered: text, machine-readable dictionaries, knowledge bases, structured data, semi-structured data and unstructured
data. Of ontology learning from text, both methods and tools are described. The
methods are usually named after one of the authors of the papers. The tools that are
described, are Caméléon, CORPORUM-Ontobuilder, DOE, KEA, LTG Text Processing Workbench, Mo’K Workbench, the Ontolearn Tool, Prométhé, SOAT, SubWordNet Engineering Process Tool, SVETLAN, TFIDF based term classification
system, TERMINAE, Text-To-Onto, TextStorm and Clouds, Welkin and WOLFIE.
The authors do not pretend to present a complete survey, but do claim that the
main approaches have been covered. The systematic presentation of the methods
and approaches gives a very clear overview and one thing that may strike the reader
because of this, is the fact that in many cases certain aspects of approaches are not
disclosed in papers at all, which is indicated in the text with ”information not available in papers”. Approaches with semi-automatic creation of ontologies incorporate
various modules.
3.3.2
Descriptions of Ontogy Learning Approaches
TERMINAE is presented in various papers, like Biebow et al ([8], 1999), as a
methodology and a tool for building ontologies from text or from scratch. Much
attention is given to linguistics, and formality and traceability are requirements.
Lexter is used for the extraction of terms from text. The approach, that focuses on
technical text, evolved over time, one version uses Syntex and Caméléon as NLP
tools for the subsequent linguistic analyses. The knowledge engineer is expected
to have expertise in the area of the subject of the ontology and to have a good
idea of how the resulting ontology will be applied, intuitive GUI’s can be used to
construct and adapt ontologies. The role of the knowledge expert is crucial in this
approach. After normalization, the domain knowledge is formalized in some kind
of a description logic. This description logic has rather limited expressive power.
Subsequent work by the authors included work on other systems, like Géditerm,
which also implemented part of the tasks of their methodology.
Text-To-Onto is presented by Maedche and Staab ([42], 2000; [43], 2000;
[44], 2001) as an architecture and a system for semi-automatic creation of ontologies
from text. It was used in the On-To-Knowledge project. The authors stress that
20
CHAPTER 3. ONTOLOGY LEARNING
most of the approaches prior to the year 2000 only got to the taxonomic level, but
not further than that, and that ”non-taxonomic conceptual relations” are an important goal in ontology engineering. This view corresponds with the classification
of ontologies by Lassila and McGuinness (2001). They use a balanced cooperative
modeling paradigm as proposed by Morik (1993), which includes the use of Text
Mining. An NLP module, SMES is used for shallow text processing, with some
extensions for heuristic correlations in order to attain a high recall of relevant linguistic dependency relations. SMES has access to a lexical database with German
words. Dependency relations form the main output of SMES. Concept and relation
extraction are performed by the learning module, the algorithm of which is based
on Ramakrishnan Srikant and Rakesh Agrawal, ”Mining Generalized association
rules” (1995). The ontology engineer gets presented pairs of concepts which can be
included in the ontology as non-taxonomic relations. For this purpose, OntoEdit
is used. Furthermore, the ontology engineer can prune the resulting ontology, and
decide whether it is necessary to iterate the ontology learning cycle or not. The
authors stress that this is just one of various possible strategies.
OntoLearn, presented in Missikoff et al. ([48], 2002) is a system that can automatically extract concepts from text to form semantic nets and specialized domain
ontologies from corpora. It uses WordNet and large domain corpora. Projects that
used OntoLearn include Harmonise, which produced a large ontology on tourism.
Other applications involved ontologies in the fields of Economy and Computer Networks. OntoLearn incorporates mainly three algorithms, one for terminology extraction, one for semantic disambiguation and one for semantic annotation and the
creation of ontologies. A special algorithm, SSI (Structural Semantic Interconnections), was designed for semantic interpretation, which is also done based on the
principle of compositionality of meaning. The relevance of concepts that are extracted from a corpus is determined by comparison with frequencies of occurrence
in a generic corpus, which functions as a contrast corpus. For the purpose of evaluating resulting ontologies, glosses were added to the Ontolearn system. These will
be described in a later section.
Symontos is described by Missikoff et al. ([47], 2001). It is an approach that
uses Web Mining for Ontology creation and enrichment. The ontological data that
is created is not very rich, it is about at the taxonomic level, but can be used for
the creation or enrichment of ontologies.
Ontolo as presented in Chetrit ([19], 2004) is a tool for facilitating Ontology
Construction from texts, in fact that is literally the title of the thesis. The user
manually inserts articles from the PubMed database. After POS-tagging, stemming and concept extraction, rudimentary ontologies are created by the Ontology
Construction tool.
3.3. ONTOLOGY LEARNING APPROACHES
21
The system ”Asium” is presented in Faure et al. ([32], 1998). It is a system that automatically acquires semantic knowledge and ontologies from text, with
Machine Learning techniques. Another system, Sylex, is used for syntactic parsing
and after this parsing and post-processing, syntactic frames of clauses are produced.
Along with these, an ontology of concepts can be formed. The clustering of words
can be done in a hierarchical or in a pyramidal way. Pyramids of clusters are richer
than simple hierarchies because multiple parents are possible. The relevance of
concepts that can be derived from clusters is determined with a similarity measure,
that determines how close clusters are to each other. The user interactively validates
learned clusters. For this purpose, a GUI is part of the system.
GATE is presented in Cunningham et al. ([22], 2002) and Bontcheva et al.
([9], 2004) as a framework and a graphical development environment for Language
Engineering. It was implemented in Java, is freely available, modular, Open Source
and well documented in articles and with an extensive user guide. GATE can be
seen as a general-purpose and flexible tool for NLP processing. Here, only GATE
v2 will be described.
The authors distinguish the following GATE resources that are available: language resources (LRs), processing resources (PRs) and visual resources (VRs). The
language resources with declarative data are strictly separated from the processing
resources and the visual resources, which enables the users, e.g. linguists or programmers, to concentrate on their field of expertise in their work on GATE. All
resources together are called CREOLE, a Collection of REusable Objects for Language Engineering. The GATE resources can be accessed both via a GUI and via
the GATE API. The API makes it easier to automate certain tasks. GATE can
deal with various data formats, which are converted into a GATE specific XML
format before they are further processed. Examples of processing resources that are
available in GATE, are tokenizers, POS-taggers,
An important part of GATE v2 is JAPE, an engine for regular expressions,
that is based on finite state technology. Although the use of finite state technology
does not guarantee efficient processing, generally most tasks that are performed with
the JAPE engine are efficient.
Other modules that are available as plugins to GATE, are an implementation
of the Google API, A web crawler.
Various resources that can process ontologies are available for Gate v2. The
OntoGazetteer is an interface that enables one to view ontologies. With the OntoGazetteer Editor, the class hierarchies of RDF or RDF(S) ontologies can be edited.
Protégé has been integrated with GATE.
OntoLT is a very likely candidate for the Ontology Engineering component
of OntoSpider. For this reason, it will be described into more detail here. OntoLT
22
CHAPTER 3. ONTOLOGY LEARNING
Figure 3.1: Opening screen of Protege with the OntoLT plug-in marked in red
is a plug-in for Protégé that requires Sun’s Java Runtime Environment (JRE). A
first beta version of the plug-in was made available to the public in November 2004.
The current version is 2.0 and it works with version 3.x of Protégé. The following
description is based on [10], [12], [13], [14], [15], [16] and [62], and evaluations
that were done with an earlier version on a machine running FreeBSD 5.x. Most
screenshots were taken from the latest version. In Protégé, the OntoLT plug-in is
represented by a tab.
OntoLT takes an XML-annotated corpus as input. The format of this XML
annotation is proprietary and is called MM. This MM format encodes morphological, syntactic and semantic information. A software package that can produce the
necessary XML annotation automatically, is SCHUG/WebSchug. SCHUG, which
stands for Shallow and Chunk based Unification Grammar, was introduced by Declerck et al. ([24], 2002) and later in other work like Declerck et al. ([26], 2003).
SCHUG maps XML with linguistic information onto feature structures, on which
unification can work, activating rules that can work on the linguistic data. The
3.3. ONTOLOGY LEARNING APPROACHES
23
technique of Cascaded Chunk Processing is used at this point to perform various
kinds of linguistic processing. The output of SCHUG is again data which is XML
encoded, enriched with more linguistic annotations. SCHUG is able to process various natural languages, like German and Spanish, which is demonstrated in Declerc
et al. ([24], 2002) and, as is demonstrated in Declerck et al. ([26], 2003), in e.g.
Central and Eastern European languages. Another application outside OntoLT that
uses SCHUG is the MUMIS project, which performs Information Extraction on multimedia resources in the field of soccer. MUMIS is described in various papers, like
Declerck et al. ([25], 2002).
A corpus consists of one or more documents that are marked in XML with
<document> tags. Every document is represented by a separate file on disk and can
consist of one or more sentences, indicated by <sentence> tags. Sentences consist
of clauses, phrases and text, indicated with tags of the same name. Text contains
<token> tags, from which the original sentences can be reconstructed. A simplified
abstract example of the XML structure, the dots are an informal representation of
partial information:
<?xml version=’1.0’ encoding=0 ISO-8859-10 ?>
<document name=00 ./example.xml00 date=00 2005-01-2400 >
<sentence id=00 100 stype=00 decl00 corresp=0000 >
<clauses>
</clauses>
<phrases>
</phrases>
<text>
<token>
</token>
</text>
</sentence>
<sentence id=00 ...00 >
...
</sentence>
</document>
A sample XML annotated English corpus that was supplied by SCHUG is included in the OntoLT package. Manual XML annotation is tedious, and the OntoLT
24
CHAPTER 3. ONTOLOGY LEARNING
Figure 3.2: Tabs of OntoLT
plug-in is meant for semi-automatic use anyway, so the only realistic alternatives to
using SCHUG are writing an alternative semi-automatic XML annotator or adapting an existing one to deal with this specific XML format. For the purpose of this
study, WebSchug was chosen. Much is done by the module that produces the input
XML to OntoLT. It will take care of POS-tagging, (other) morphological analysis,
syntactic analysis and lexical semantic tagging, and provide XML markup for all
this. When the OntoLT tab is clicked, tabs for Operators, Mappings, Conditions
and Corpora will be visible (Figure 3.2).
In the Corpora tab, new corpora can be imported. For this purpose, multiple
XML annotated files can be selected and together given a corpus name. Clicking
on the binoculars in the Candidate View tab and selecting the corpus then extracts
candidate classes, slots and instances. The name of the extraction is derived from
the time at which it took place. Extracted candidates can be inspected by clicking
on key icons. The user can choose with which candidates the resulting ontology
should be enriched. At the time of writing, OntoLT only allows for the extension of
ontologies, not creating smaller ontologies from existing ones.
The extraction takes place based on XPATH expressions. These can be found
under the XPaths tab. If an XPATH expression matches, a mapping rule is activated
based on which candidates may be extracted. Both the XPATH expressions and the
mapping rules can be adjusted or added to by the user. For the XPATH expressions,
a precondition language is available, comprising of the predicates containsPath,
HasValue, HasConcept, AND, OR, NOT and EQUAL, and the function ID. The beta
version of OntoLT 1.0 includes two mapping rules which consist of large conjunctions
of conditions (Figure 3).
3.3. ONTOLOGY LEARNING APPROACHES
Figure 3.3: Mapping Rule for Head Nouns and Modifiers
Figure 3.4: Above Rule in an older version of OntoLT
25
26
CHAPTER 3. ONTOLOGY LEARNING
Like similar approaches, the ontologies that OntoLT produces are shallow.
The object languages of the corpora that are supported, are English and German.
The concept of focus is also present in the OntoLT approach. For this, a statistical
relevance metric is used, which can be adjusted manually under the Mappings tab
by deselecting extracted terms of a previous extraction and then initiating a new
extraction of candidates. Obviously, after deselection of terms, the resulting set of
candidates will be smaller, and they should be more focused. Statistical preprocessing is based on the Chi-Square function that is described in section 4.2, based on
its presentation in ([2],2001). In this paper, the Chi-Square (χ2 ) function is used to
construct topic signatures, a concept which resembles that of a centroid relatively
closely. The function calculates similarity, as an alternative way to the cosine similarity algorithm which is often used for this purpose. Both methods use vectors with
weight information. The aim of the paper is to enrich WordNet concepts with topic
signatures, so that it will also contain information on related concepts which are
not part of the same synsets, like the concepts of chicken and farm, and which are
not present yet. The approach thus aims at (semi-)automatically enriching existing
ontologies or thesauri. In practice, the topic signatures that are constructed this
way contain some noise, and various strategies are used to filter this noise out.
OntoLT allows for ordering of predicates based on the frequency of their occurrences, the top-N relevant candidates could be compared with a centroid. Also, one
would expect that the top-N of the centroid and that of OntoLT in an approach that
combines both, would be very similar. The same similarity measures and weighing
schemes could be used in both approaches, but it would also be possible to base the
one of the centroid on a cosine similarity measure whilst that of OntoLT is based
on the Chi-Square weighting in combination with absolute frequency counts.
A sample ontology that is very briefly presented as an experiment in [14] is
for the field of neurology. It is described how the Mapping Rules HeadNounToClass ModToSubClass and SubjToClass PredToSlot DObjToRange map between
linguistic annotation in the corpus and Protégé classes and slots. Examples are
given of how HeadNounToClass ModToSubClass extracts classes from head-nouns
and subclasses from their modifiers. In OntoLT, Mapping is a class, which consists
of Conditions (constraints) and Operators. If the specific conditions on the XML
annotated linguistic structure are met, the Operators will enable the user to enrich the ontology by forming candidate classes, subclasses, etcetera. The CreateCls,
AddSlot, CreateInstance and FillSlot operators are available for this purpose.
Since at present there are only two mapping rules, they will be described in
some more detail here. Note that it is possible for the user to define other mapping
rules than these.
The mapping rule HeadNounToClass ModToSubClass can make two kinds of
3.3. ONTOLOGY LEARNING APPROACHES
27
mappings: It can map HeadNouns to classes and modifiers to subclasses. The
Conditions are a conjunction, as is shown in figure 3.3. In an older version of OntoLT,
a long conjunction was visible, as can be seen in figure 3.4. The two Operators that
are associated with it, are: CreateCls(HeadNoun, :THING) and CreateCls(HN Mod,
HeadNoun) which creates a subclass of the first one. Examples of HeadNouns that
could map to classes, are the words ”morphology”, and ”grammar”. Examples of
modifiers that could map to subclasses, are ”generative morphology” and ”categorial
grammar”. So the modifiers result in more specific subclasses than the classes that
are expressed by the head nouns they modify.
The mapping rule: SubjToClass PredToSlot DObjToRange is a bit more complicated. It can map Subjects and Objects to classes, and Predicates to slots, with
the objects as ranges. An fictive example could be the sentence ”semantics is intertwined with syntax”, where both ”semantics” and ”syntax” would be turned into
classes, and ”intertwine” into a slot. Here as well, the Conditions are a long conjunction. The Operators that are associated with it, are: CreateCls(Subject T ext, :
T HIN G)andCreateCls(DOBJ Text, :THING).
In the chapter on Ontology Engineering, the formal description of systems for
semi-automatic ontology creation by Sintek et al ([62], 2004) was mentioned. In
their paper, OntoLT is described as an example or instantiation of such a system.
An example of how parts of the process are made more explicit by the formal presentation, is the fact that a formal suggestion function σ is proposed. Furthermore,
the set of mapping rules of OntoLT is also formally defined as a set of rules that
map from a single sentence onto sets of suggestions.
Specia and Motta ([66], 2006) describe a hybrid approach that aims for
providing rich semantic annotations for the Semantic Web in an automatic way.
The approach integrates many systems, like Aqualog, resources from GATE, Jape,
Minipar, ESpotter and WordNet (for synonymy resolution and for finding deeper
meanings). It takes raw text as input and as the first step, it detects linguistic
triples in the syntax, which are then mapped onto semantic relationships in the
second step. As Jape can only detect these triples in a shallow way, Minipar is
used to perform deeper processing. The linguistic component is based on that of
Aqualog, an adaptation of its Relation Similarity Service (RSS) maps linguistic
triples onto ontology triples. In this mapping, three cases are distinguished, that
depend on the question whether there is a match with existing relations in the
Knowledge Base and/or the Domain Ontology. If multiple mappings are possible,
a Word Sense Disambiguation system (WSD) is used to disambiguate words. An
example of a linguistic triple is: <noun phrase,verbal expression,noun phrase> A
pattern-based classification model is used to identify new relations between entities,
and a repository of patterns is maintained by the approach. Newsletter texts are
input of this hybrid approach.
28
CHAPTER 3. ONTOLOGY LEARNING
Not many approaches are very similar to the OntoSpider approach that is
proposed in this thesis. Most work does not combine a focused crawler with the
automatic creation of ontologies. However, Ehrig ([31], 2002) does propose a very
similar approach. This approach involves two main cycles in which the Ontology
Maintenance cycle precedes the crawling cycle. An ontology is used by the focused
crawler for its focus, so from the outset an ontology is the starting point, which
will be enriched with relational metadata that the focused crawler finds during its
crawls. This places a burden on the user, she must already be able to create reasonable ontologies from scratch. Given the fact that the approach is mainly meant
to let relational metadata enhance the results of a focused crawler, rather than to
create ontologies from scratch, this is quite reasonable. The approach builds on
KAON, the Karlsruhe Ontology and Semantic Web infrastructure, with the OntoMat implementation. Text-To-Onto is used for the semi-automatic generation of
ontologies. Because relational metadata on the current Web is hardly available,
in fact the approach as it was presented in the thesis remained largely theoretic
at the time of writing. The evaluation searches that did use relational metadata,
did yield better results according to the writer. Only HTML is processed, but the
intention was to also process other formats like PDF. A lot of questions were left
open, which is no surprise at all with an advanced and ambitious approach like
this. In ”Ontology-Focused Crawling of Web Documents”, by Ehrig, Maedche et
al. ( [45] ), CATYRPEL is presented as actually using graph-oriented knowledge
representations, rather than text-oriented ones.
Just to give an indication of the fact, that the use of semi-structured sources
like text corpora is not self-evident at all in the field of Ontology Engineering and
that there is a broader picture, one approach will be discussed here that is not
similar to that of OntoSpider at all, as an example. The approach of Sabou et
al ([58], 2006) makes use of ontologies that are already available on the Semantic
Web to perform semantic mappings. One specific reason for not using sources like
text corpora for dynamically acquiring knowledge that the authors mention is the
fact, that approaches that do use them suffer from noise because of shortcomings
in Information Extraction technology. According to the authors, current syntactic
approaches fall short in that they do not provide semantic mappings, and that
they fail when ontologies are too dissimilar. The authors also describe the limited
use of the types of background knowledge in current systems: WordNet, reference
domain ontologies and online textual resources. Sabou et al. pose as a hypothesis,
that semantic data on the Semantic Web can be used as a source of background
knowledge in ontology mapping, and that mappings that other approaches fail to
recognize, can be discovered with this background knowledge. The implementation
of Sabou et al. uses Swoogle’05, the ontology search engine for the Semantic Web.
They describe their mapping strategies in the description logic syntax for semantic
3.3. ONTOLOGY LEARNING APPROACHES
29
relations. If ontologies can be found with Swoogle that contain the same concepts
with identical names, mappings can be based on them. Even in this simple case,
problems like contradictions may have to be dealt with. In order to go beyond
Swoogle’s coverage, further techniques are used to deal with issues like variants of
compound names and synonymy resolution. Sabou et al. do not only consider the
use of the background knowledge of a single ontology for creating mappings, they
also describe a strategy that does the mappings in a recursive way, based on two or
more ontologies. According to the authors, their approach can be combined with
other approaches, like ”syntactic” ones. They describe promising results of their
experiments. The more the Semantic Web will grow, the more approaches like that
of Sabou et al. may prove fruitful.
3.3.3
Aspects of Existing Ontology Learning Approaches
In this section, some general aspects of various approaches will be mentioned, like
commonalities in system designs, convergence or divergence of NLP approaches,
choice of AI technologies, etcetera. At least three sorts of languages may be involved,
the object languages, which are the natural languages from which the ontologies are
created, the programming languages of the actual implementations, which may have
an impact on the portability of the approaches, and finally the ontology languages
of the resulting ontologies, which often reflect the time at which the research was
done. Earlier research may have e.g. Shoe ontologies as result, and more modern
approaches may support Owl.
The selection of the sources of systems is part of their design. If systems are
based on a corpus, the way Corpus Constitution takes place is not always explicitly
mentioned. Some approaches, like Text-To-Onto, depart from a core ontology, not
just from a corpus. If a system departs from seed documents, it should be clear how
these are chosen.
In TERMINAE, the recommendation is that the domain expert decides on the
choice of documents that will be used for the corpus. This corpus should be complete. One application of the system involved the use of a corpus on the Knowledge
Engineering domain.
The Harmonize project on Tourism, which uses OntoLearn, used a corpus of
about 1 million words.
Clearly, if one wants to create a domain ontology on a domain for which there
already exist databases or even ontologies, it may be fruitful to make use of these.
An example is the UMLS (Unified Medical Language System), which contains many
medical terms. An ontology on biomedical terms could use this for e.g. merging or
integration purposes, verification and assessment purposes, etcetera.
30
CHAPTER 3. ONTOLOGY LEARNING
The role of the Ontology Engineer or Domain Expert may vary. In a centroid
based approach, the Domain Expert may enrich or prune a centroid. In general she
may enrich or prune ontologies using an ontology engineering tool.
The resulting ontologies of various approaches may be represented in one of
many ontology languages that exist. Older approaches may export e.g. Shoe ontologies whilst modern ones can export OWL ontologies. Some, like TERMINAE,
have their own ontology language, in the case of TERMINAE this is some sort of a
restricted description logic. Shallow approaches can result in lightweight ontologies,
which could be represented in RDF triples, among other possible representations.
Many shallow approaches, like [19], do not aim for fully fledged ontologies yet, but do
form a basis for ontology creation, i.e. they either produce output that can be processed by modules that create ontologies, or further development on the approaches
and later versions could yield ontologies themselves.
Various programming languages can be used. The use of Java, Perl, ansi C
or C++ can make implementations very portable. TERMINAE and Ontolo are
examples of approaches with implementations in Java.
The main object languages that are described in the literature, are English,
French and German. Often, approaches for other languages than English also include
the possibility to process English texts. For English, often WordNet is used, and for
other languages EuroWordNet.
It is obvious, that if natural language text, like a corpus, is used for the creation of ontologies, it is unavoidable to make use of linguistics and NLP approaches.
Nearly any mature approach will have to deal with semantic phenomena like hyponymy and synonymy. In many cases, it is so clear that POS-tagging and stemming
are necessary, that they are not even mentioned.
The current state in linguistics is that there are various mainstream theories,
like the Government and Binding model, the Barriers model, Head Driven Phrase
Structure Grammar, Categorial Grammar, Lexical Functional Grammar and Optimality Theoretic approaches. Likewise, a wide variety of linguistic approaches is
presented in the literature. Certain NLP approaches are reasonably standardized
and matured, however, like stemming algorithms are often not based on a particular linguistic theory, whilst they do process natural language text all the same.
Also, POS-tagging, which is based on natural language morphology, is often part
of the various approaches. In the literature, in many cases there is even no explicit
mention of the linguistic or NLP approach that is used at all. Often simple pattern matching algorithms are used, not only in the stemming phase. Techniques
include the use of N-grams (like bigrams and trigrams) and other co-location (cooccurrence) techniques. An example of an approach that only uses such shallow
techniques is that of Ding and Engels ([29]): this uses co-occurrence as a measure
3.3. ONTOLOGY LEARNING APPROACHES
31
for similarity, which is an approach that is very common in the IR community and
could thus be characterized as a pure IR approach rather than a linguistic one. In
a similar way, Sundblad ([68]) studies the automatic acquisition of hyponyms and
meronyms from question corpora, for which simple pattern matching techniques are
used. Sundblad does indicate that he intends to use linguistic techniques like Functional Dependency Grammar (FDG) in the future. The approach of Jianming et
al. ([38]), which only focuses on semantic annotation of domain specific sentences,
uses Dependency Grammar and Link Grammar. For processing the semantics of
texts, Discourse Theory seems an obvious choice. Handschuh et al. ([37]) present
S-CREAM, Semi-Automatic Creation of Metadata, in which Discourse Theory and
Centering Theory are part of the approach. Also, Lazy-NLP as part of Amilcare,
a tool for Information Extraction (IE) from text, is used in their approach. Quite
exotic in this context is the language game approach that is part of Steels” ([65])
study, which is on physical robots that develop ontologies, rather than software
agents, but may have some relevance for the latter as well.
Quite some approaches use a very specific NLP module, like [42], in which
Maedche and Staab present Text-To-Onto, that uses the lexical analyzer of SMES
to perform morphological analysis, POS-tagging, and more. In [70], Aussenac-Gilles
et al. mention SYNTEX and Cameleon as NLP modules that are embedded in their
TERMINAE approach. A common linguistic hypothesis that is used here is, that
meaning is specific to a domain, and can be inferred by observing regularity of the
use of words.
Summarizing the various linguistics and NLP approaches that are mentioned
in the literature, there is a wide range of theories and techniques that are used, often
the linguistic theoretic framework is implicit or even absent.
One obvious commonality between approaches is the modularity of the software designs. In many cases, theoretically, one should be able to plug in and out and
combine parts of different approaches. Limiting factors for this could be differences
in any of the formal and natural languages that are mentioned in the previous sections. For instance, if a module pre-processes German texts, the output obviously
cannot be input to modules that can only deal with English texts. If implementations only run on one type of Operating System, they cannot be combined with
modules that only run on other Operating Systems. An inventory of the amount
in which various approaches can actually be combined could be very useful, but is
beyond the scope of this thesis. Approaches with platform-independent languages
like Java for the implementations, which process English texts, are expected to be
easy to combine. Often, on paper systems are very modular, but in practice, adding,
combining or replacing modules may still require a lot of work.
Another recurring feature of software designs, is cyclicity. Often results, like
32
CHAPTER 3. ONTOLOGY LEARNING
resulting ontologies, can be reused for or backfed into the learning or training module, refining the final results.
Something that TERMINAE stands out in, is the fact that traceability of results is a prerequisite of the approach. Other approaches do also include traceability,
but often they do not mention it explicitly.
In software engineering, completeness is one of the characteristics that one
strives for. Many approaches do not even mention completeness, let alone discuss
the completeness or incompleteness of their results. According to Maedche and
Staab ([44], 2001), there is a tension field between trying to get to the most complete
Ontology, which is not achievable, and to get to Scarce Models, and some balance
must be made between these two.
Even though efficiency and effectiveness are also important properties of application software, they are often not discussed at all in literature that describes such
software.
One thing all or most approaches seem to agree on is the assumption, that
the ontology engineer should be an expert on the domain of the ontology. What
the approaches do differ in, however, is the involvement of the ontology engineer in
the process of ontology creation. Some require very much work from the ontology
engineer, in many stages of the process, she is expected to make corrections or input
data. Many other approaches only expect the ontology engineer to evaluate the
resulting ontology, so the degree of automation is higher, which may reflect on the
quality of the resulting ontologies.
3.4
Assessment of Ontology Learning Approaches
The assessment of resulting ontologies is usually difficult to achieve. In a way,
creating a domain ontology on a certain subject could be compared to writing a
book on that subject. Different expert writers could write totally different books
on the same subject, which are good and relevant. Likewise, for a given subject,
different ontologies that are quite dissimilar could be generated which are good for
very specific purposes. Of course, certain elements would have to be present in any
introductory book on a subject. Likewise, a domain expert could specify elements
any relevant ontology on a certain subject should have. One way of assessing a
resulting ontology could be, to let a domain expert compare it with some standard
ontology. For example, in [19] the resulting ontologies of Ontolo are compared with
existing ones for the same keywords in the Gene Ontology.
Within the OntoLearn approach, evaluation of certain formal aspects of semiautomatically generated ontologies by domain experts appeared to be difficult. For
3.4. ASSESSMENT OF ONTOLOGY LEARNING APPROACHES
33
this reason, glosses were added to the system. These provide informal natural language representations for certain formal properties of ontologies, enabling domain
experts who do not have much insight in these formal properties to make evaluations of them anyway. Various papers, e.g. Cucchiarelli et al. ([20] and [21], 2004)
describe these.
Ideally, the assessment of ontologies could be defined in such a precise and formal way, that the process itself could also be automated. An Automatic Assessment
tool would ideally be so general, that any approach that creates ontologies from text
can be evaluated with it. Owing to the complexity of the subject, this could be the
topic of a different thesis.
34
CHAPTER 3. ONTOLOGY LEARNING
Chapter 4
Information Retrieval: Focused
Crawling
Information Retrieval (IR) is an interdisciplinary science that covers many subjects,
like the theory behind search engines, storage of information, software robots that
index the Web, user friendly interfaces for the retrieval of relevant data and information filtering. Within this broad science, only Focused Crawling will be studied
here into some more detail, because the OntoSpider approach that is presented in
this thesis investigates how they can be of use for the semi-automatic creation of ontologies. Focused crawlers will be used in this approach to create specialized domain
corpora, from which concepts and relations can be extracted.
4.1
Definition Focused Crawling
Traditionally, many crawlers that index the World Wide Web use an overkill approach, trying to index as much of the Web as they can, in order to cater for any
possible query that can be entered in the interface of a search engine. Focused
Crawling is traversing the web with a robot, concentrating on the retrieval of documents that are relevant with regard to a very specific topic or a set of such topics.
The focused crawler will only try to traverse a subset of the Web which is relevant
to these specific topics, thus saving bandwidth and computational resources and
enabling higher precision and recall. At least theoretically, the World Wide Web
could grow in such a manner, and become so dynamic, that a traditional ’overkill’
breadth-first crawling approach that tries to index the whole Web could be unfeasible, and the use of focused crawling approaches could be a more realistic option. The
focused crawling process could be compared to the way humans process the huge
amount of information that their senses register. Instead of examining every detail
35
36
CHAPTER 4. INFORMATION RETRIEVAL: FOCUSED CRAWLING
that their eyes, ears and other senses perceive, they focus on parts of reality that are
relevant to them at a given time, thus avoiding a waste of resources and an overload
of information. The comparison between human information processing and focused
crawling will be discussed into some more detail below. A Centroid Based Focused
Crawler is a special type of crawler, that uses a set of seed documents that are
relevant to a certain topic to construct a centroid, an abstract representation like a
vector of the relevant terms in the documents. This centroid determines the focus
of the crawler, and based on this it will crawl a relevant subset of the World Wide
Web, ignoring web pages that are not close to the centroid, and collecting documents
that are. In the rest of this section, only the general idea of one simple centroid
based focused crawler approach will be described. In a centroid based approach,
the focus of the crawler is based on the information that is contained in a centroid,
which is typically a vector that contains highly relevant words for the topic at hand.
This centroid is constructed from a set of on-topic documents. The closeness of
vector representations of the web pages, to the centroid, which is an indication of
their similarity, can be calculated with various similarity measures. Some of the
algorithms that can determine similarity will be presented in the paragraph ’Determining Similarity’. Links in Web pages with a very high similarity can get a higher
weight or be pushed up in the stack of pages that should be crawled, so that URL’s
that are probably on-topic will be visited earlier by the crawler. While the crawler
is gathering data, the centroid can be enriched with this. The main methodologies
that are used in Web Crawling in general, are the Depth-First search, and BreathFirst search. Breadth-First searches are often used by more general crawlers, which
attempt to gather information on lots of topics, and do not focus on a particular
subject. Many crawlers that are used for general purpose search engines work this
way. Apart from these, methodologies that use some mixture of the two, also exist.
One thing that is interesting as far as focused crawling on scientific topics is
concerned, is to be able to crawl between research papers in PDF or PS format themselves. One of the focused crawlers of the OntoSpider approach that is proposed in
this thesis uses this type of crawling. I name this process Literature Crawling, and a
crawler that works in accordance with such a process a Literature Crawler. Instead
of extracting URLs from pages that the crawler retrieves, the crawler would extract
references from the bibliography, of which it would try to find online versions of the
works that are mentioned using a general purpose search engine. This Literature
Crawling would resemble the process that researches use to follow manually when
they study a specific subject: The researcher searches for relevant papers, and usually the bibliography of these papers give pointers to similar relevant literature. If
the researcher writes a paper on a subject, the list of papers thus obtained would
roughly be the bibliography of the paper. Thus, the list of documents that is produced by a Literature Crawler could be seen as some sort of a new bibliography.
4.2. FOCUSED CRAWLING TECHNIQUES
37
The use of Google Scholar instead of a Literature Crawler will also be discussed,
very briefly.
Crawls need to start somewhere. Mostly, no keyword search is used to start
focused crawls, but URLs. These initial URLs that are used for crawls are called
Seed URLs. One way of determining specialistic URLs for focused crawls, is using a
search engine line Google, and finding a set of relevant URLs manually. Of course,
an expert in a certain field may know relevant URLs or may have bookmarked them
and can use these as seed URLs. Also, the user could supply a set of very relevant
keywords, from which the crawler could determine seed URLs itself, based on e.g.
Google searches.
4.2
Focused Crawling Techniques
Many techniques that are used for Focused Crawling are quite general. Some of these
techniques, like stopping, stemming and POS-tagging, have already been discussed
in the chapter on Ontology Engineering. Some will also be discussed in the next
section, that presents some of the existing Focused Crawler approaches. Various
techniques and approaches that are specific to focused web crawling are discussed in
Novak ([50], 2004). Techniques that help focus the crawls better include the use of
a centroid, the use of the PageRank algorithm[55], the use of metadata in ontologies
that already exist, the use of the HITS algorithm[39] like in the BINGO system[63],
and the use of contrast corpora to exclude the negative class, i.e. documents that
are less relevant to the crawls. These techniques can increase the harvest rate of the
crawlers. Other techniques that may increase this harvest rate include the use of
databases that were built during previous crawls, which could be seen as some sort of
an intelligent crawling memory. There are also techniques that defend the crawler
against crawler traps, which may cause a focused crawler to crawl endlessly in a
loop. One way of defending against such traps is to auto-detect them, if possible, or
to enable the user to maintain a list of known sites that are problematic and that
should be excluded from the crawls. Much information on the World Wide Web is
not available to superficial crawling, because it can only be retrieved dynamically
from e.g. online databases via special interfaces like those of portals or search
engines. Often specialistic data is part of the so-called ”Deep Web”, the part of the
Web that is not available to superficial crawlers that only have access to information
on the ”Surface Web”. One technique to let crawlers also retrieve information from
the Deep Web is to enable them to deal with such interfaces automatically. This
is mostly problematic and only an option if the data that will be retrieved is on
very specific known subjects, and special code is there to interact with specific
interfaces. Another way to allow a crawler to have access to the Deep Web is by using
38
CHAPTER 4. INFORMATION RETRIEVAL: FOCUSED CRAWLING
general purpose search engines like Google or Google Scholar automatically. Certain
information in the Deep Web will still be unreachable for automated processes. An
example of this are documents that are only available for paying users on sites that
do not have a subscription service, or documents on sites that are only available to
employees of specific companies.
Tunneling[6] is a process that addresses the problem that highly relevant pages
may be hidden behind pages that are less relevant themselves. Various tunneling
techniques have been proposed, some of which are sophisticated. A very crude way
of tunneling is to not stop at irrelevant pages, but pursue outlinks of them as well,
but only up to a certain degree. The crawler could stop if these outlinks themselves
are also of irrelevant documents. Such a crude approach poses a far larger burden
on the crawler, in bandwidth usage and processor time, as it has to download and
process far more irrelevant pages.
As soon as a focused crawler drifts, the results of the crawl may become useless.
Avoiding this drifting is therefore very important. Apart from the quality of the
centroid which ensures the focus of the crawls, other techniques may be used to avoid
drifting. One way of doing so, could be to require certain substrings in the URLs
that are crawled. E.g. for scientific purposes, only domains within the .edu TLD
could be crawled. The latter could also be a way to avoid crawler traps. Probably
it is less likely to encounter such traps within the .edu TLD. Another way to avoid
drifting, which would make an approach more interactive, is to allow the user to steer
the crawling process by giving feedback on the relevance of links and adjusting the
search queue. Apart from comparisons with the centroid, other relevance measures
may help in the avoidance of drifting of the crawler, like the amount of links there
are to a particular page, making pages that are more often linked to more relevant
than those that are not.
Similarity is a notion from mathematics. In various contexts, mathematicians
try to determine the similarity of structures like shapes of geometric figures such
as triangles or of matrices or patterns in strings. The opposite of similarity, is
dissimilarity. The determination of the similarity of documents is an important focus
of attention within the Information Retrieval community. In a ranked presentation of
retrieved documents, one often wants the most similar documents to a query higher
in the ranking. Hence similarity measures are mostly part of ranking algorithms.
For clustering purposes similarity of documents can be used, etcetera. For the
determination of similarity (or dissimilarity) of documents, many algorithms have
been proposed in the literature and been implemented in projects like the simmetrics
sourceforge project[61]. Only some of these algorithms will be presented here.
In the simple boolean model of IR, queries and documents are represented by
sets. All weights in this model have boolean values, i.e. values from the set {0,1}.
4.2. FOCUSED CRAWLING TECHNIQUES
39
A definition of similarity in this model is as follows:
sim(dj , q) =
if ∃~qcc : (~qcc ∈ ~qdnf ) ∧ (∀ki , gi (d~j ) = gi (~qcc ))
otherwise
1
0
The vector model of IR is more advanced than the boolean model. In this
model, queries and documents are represented by vectors, and weights of terms are
non-binary, e.g. they could have any real values between 0 and 1. One of the most
common similarity measures in this vector model of IR is that of cosine similarity,
which is a technique from vector algebra. Given a query P
vector q and a document
vector d, the inner product of q and d, q.d, is defined as ni=1 q.d. An assumption
here, is that the vectors are of equal length. If they are not, there are algorithms
to turn them into vectors of equal length. The cosine similarity algorithm works for
any amount of dimensions. One intuitive way of looking at cosine similarity is, that
it expresses the closeness of the angle of two vectors in vector space, the closer they
are, the higher the similarity.
(4.1)
Pn
i=1 qi × di
pPn
2
i=1 (qi ) ×
i=1
sim(q, d) = pPn
(di )n
Note that instead of a query q and a document d between which similarity
is determined, the comparison may just as well be between two documents or a
centroid and a document, as long as vector representations are made of these.
Another algorithm for determining similarity is the Chi-Square function. The
following formulas for Chi-Square are literally taken from Agirre et al., ([2],2001):
(4.2)
(4.3)
wij =
(
f reqij −mij
,
mij
0
if f reqij > mij
otherwise
mij =
P
P
f
req
ij
ij
j f reqij
P
ij f reqij
In these Chi-Square formulas, f reqij indicates the frequency of word j in document collection i, mij the mean of word j in document collection i, and wij is the
Chi-Square value of word j in document collection i. Word frequencies of a document
collection i are contrasted with those of the contrast set, a reference corpus which
is the collection of documents that are not part of i itself.
There are also probabilistic models of IR, in which other similarity measures
have been proposed, like ones based on Bayesian Probabilities.
40
4.3
CHAPTER 4. INFORMATION RETRIEVAL: FOCUSED CRAWLING
Focused Crawling Approaches
Various focused crawling approaches have been proposed in the literature. Only
some will be mentioned here.
A relevant paper on the subject of Focused Crawling is Chakrabarti et al.
[17], 1999). In this paper, Focused Crawling was presented as a new approach, that
makes use of a classifier and a distiller. Their system starts with on-topic documents
rather than keywords. The classifier determines the on-topicness of retrieved pages,
from which more links can be accumulated. The distiller detects the hubs and
also determines the priority of the candidate pages to be visited. The user has the
opportunity to change the set of documents the system is about to process based
on nearness to the supplied documents, interactively. Multiple threads are used by
the crawler. Worker threads individually gather information from the World Wide
Web, and invoke the classifier. Periodically all data that the individual worker
threads have gathered is stored in a central work pool, which is implemented with
the Berkeley DB B-Tree storage manager. Also, a link graph is stored on disk.
Soumen Chakrabarti published more work on the subject of focused crawlers. He
argues that one thing that focused crawling makes possible, is lightweight crawling.
The less severe the bandwidth demands become, the more it will become possible
for researchers to perform focused crawls from their home DSL or cable connections,
or at their workstations on university campuses.
Diligenti et al. ([27], 2000) propose an algorithm for focused crawling that
uses context graphs to model the context in which on-topic pages are located on the
World Wide Web. The crawler that works with this algorithm is called the Context
Focused Crawler (CFC). The authors distinguish between forward and backward
crawling. Forward crawling is the regular way of crawling that follows hyperlinks
in webpages. Backward crawling however, is a more exceptional way of crawling,
that follows the pages that link to a certain page, the inlinks. The CFC uses search
engines to find these inlinks. Classifiers are trained using linkage information. Here
the link distance to a certain on-topic page is important. This is the shortest distance
between other pages and the current on-topic page, in terms of amount of hyperlinks
that connect them. Starting with seed URLs, the crawler will create Context Graphs
from information it gets by backward crawling and train classifiers. The approach
also uses traditional TF.IDF measures and a Naive Bayes classifier. According to
the authors, the results of the CFC are better than those of traditional focused
crawlers.
Mukherjea ([49], 2000) describes the development of WTMS, a system that
employs a focused crawler to perform Topic Management, to allow users to gather
on-topic information from the World Wide Web for various purposes, like data
4.3. FOCUSED CRAWLING APPROACHES
41
analysis. The focused crawler starts with a set of on-topic seed URLs that are
specified by the user, or with a set of on-topic keywords that he or she enters. From
these, a centroid is created. The authors call this a representative document vector
(RDV). A stop URL list is used to avoid crawling of certain unwanted pages. Various
heuristics are used to make the crawls more efficient. Determining the relevance of
pages before they are actually downloaded saves bandwidth and time. One way of
doing so is by determining the nearness of a potential candidate page to the page
that is currently processed, based on the directory these pages are in. If pages of the
same website are in directories that are no parent or sibling directories of the current
page, they will not be retrieved. If too many pages within the same directory are
irrelevant to the topic at hand, all pages in that directory will be ignored. WTMS
allows for different graphical views on the websites and directories of these websites
that are relevant to the topic, as well as paths between these web sites.
Aggarwal et al ([1], 2001) propose the concept of ”intelligent crawling”. By this
they mean a type of crawling that learns the linkage structure of visited web pages
during the crawls, and that can be an alternative to traditional focused crawling.
In this approach, no initial seed URLs or other on-topic data is necessary for the
crawler to do its work: It starts off in a general way, like a general purpose crawler.
The user can indicate an on-topic predicate he or she is interested in, and the
intelligent crawler will auto-focus on this predicate while it performs the crawls. This
mechanism of auto-focusing is based on various characteristics of the web pages that
are visited, like words that they contain, substrings of the URLs and inlinks from
other on-topic pages and their siblings. To determine a priority order of web pages,
a probabilistic model is used. According to the authors, their intelligent crawler
approach deals better with varying behaviour of different predicates, because the
self-learning capabilities of it make it more flexible.
CROSSMARC is a Focused Crawler that is presented in Stamatakis et al. ([64],
2003). The authors distinguish focused crawling that identifies on-topic websites,
and domain-specific spidering, which identifies on-topic web pages. The user indicates the topic of the crawls, either by entering search queries or, in a later version,
by supplying seed URLs. Specific to the approach is the fact that it can process texts
in various natural languages, not just one language like English or German. The
authors evaluated various machine learning methods for text classification, most of
which performed well with their approach.
The BINGO system by Sizov et al. ([63], 2003), is a focused crawling system
that is designed to improve the recall of expert systems on the World Wide Web. For
training the classifier, linear SVM’s are used, that are implemented in SVMLight.
Kleinberg’s HITS algorithm[39] is used besides a TF.IDF based similarity measure,
to determine the on-topicness of documents. BINGO is able to retrieve data from
the Deep Web by automatically entering data into portals.
42
CHAPTER 4. INFORMATION RETRIEVAL: FOCUSED CRAWLING
YAFC (Yet Another Focused Crawler) by Sakkis ([59], 2003) is a focused
crawler approach that uses a neural network instead of the heuristic methods of most
other approaches. Apart from this, it also makes use of Reinforcement Learning,
in which decisions are based on rewards and punishment. Based on seed URLs the
crawls start, and highly on-topic pages get higher rewards than irrelevant pages.
The prototype implementation of the approach, also called YAFC, was compared
with several other crawlers. However, according to the author, initial experiments
were discouraging.
The Focused Crawler approach of Menezes ([46], 2004) stresses the role of the
end user. The user can interact with the system in many ways. It is not only more
user friendly, but may also yield better results as the process is more personalized. A
specific scout procedure is proposed, that lets the user enter seed URLs and positive
and negative examples from which the classifier can learn. According to the author,
a personalized approach is natural for focused crawling approaches.
4.4
Assessment of Focused Crawling Approaches
For the assessment of focused crawlers, many concepts from Information Retrieval
in general are used, like precision and recall. Informally, precision expresses the
ratio between the amount of documents that have been retrieved and the amount
of documents that are relevant or on topic. Recall expresses the ratio between the
amount of relevant documents that have been retrieved and the total of documents
that have been retrieved. Often this is also called the harvest rate of a focused
crawler. One could call irrelevant documents that been retrieved false positives. The
assessment of crawlers in general does not take into account how well results should
be in view of the creation of ontologies. For certain focused crawler applications, the
efficiency of the crawlers can be most important, or user friendliness can be a very
important feature. Clearly, if our aim is to get ontologies that are as complete as
possible given a certain domain, recall is most important. To get ontologies that are
’clean’, i.e. that do not contain irrelevant concepts and relations, precision is also
important, but in fact a crawler that has low precision can still be very useful, if the
irrelevant documents are filtered out at a later stage. Recall is more important, for
once a specialized corpus has been created by the focused crawler, the concepts and
relations that will eventually end up in the ontologies are limited by this corpus. The
more concepts and relations that can potentially be extracted from the corpus, the
more complete the ontology can become. Still, high precision is very useful during
the crawls as well, for the higher the precision is, the more efficient the crawler will
prove to be.
Chapter 5
OntoSpider
In this chapter OntoSpider will be presented. The approach employs Information
Retrieval techniques, namely centroid based focused crawlers, for Ontology Learning.
A functional design of the approach will be proposed, as well as possible implementations and experiments that need to be conducted for testing hypotheses. More
specifically, results of the use of a general purpose focused crawler will be compared
to those of a Literature Crawler. This is to test whether a Literature Crawler will
perform better than a General Purpose Focused Crawler, from an Ontology Learning
point of view. On the one hand, it is expected that the precision of the Literature
Crawler will be higher than that of the General Purpose Focused Crawler. On the
other hand, it is also likely that the recall of the General Purpose Focused Crawler
will be higher than that of the Literature Crawler. Figure 5.1 gives a very general
overview of the overall approach, while figure 5.2 shows it in a bit more detail: ontopic information is gathered by the IR Component of OntoSpider, which results
in a highly specialized corpus that is fed into the Ontology Learning component,
that can eventually output ontologies for the Semantic Web. This Ontology Learning component will use NLP techniques to extract ontological information, both
concepts and relations, from the highly specialized corpus.
Figure 5.1: Simplified Possible View of OntoSpider
43
44
5.1
CHAPTER 5. ONTOSPIDER
The Ontology Engineering Component of OntoSpider
In this section some approaches and software packages will be discussed that could
be used as the Ontology Engineering Component of OntoSpider, and a choice will
be made from them. In fact, any ontology learning approach that takes a corpus
as input and that creates domain ontologies could be used. Candidates include OntoLT, approaches that use GATE, and the hybrid approach for extracting semantic
relations from text by Specia and Motta (2006). Also, the possibility to develop a
Ontology Learning component from scratch will be discussed.
In part based on surveys of existing research, delimitations were made in
the OntoSpider approach which is presented in this thesis. The approach is semiautomatic, not manual or fully automatic. The type of ontologies that OntoSpider
will produce, are domain ontologies. One of the very first stages of creating a domain ontology involves the creation of the vocabulary that contains terms that refer
to the concepts that are relevant to the domain, and the relationships between these
concepts. The centroid that is used in the IR component of OntoSpider could be
used as a basis for this vocabulary, if the components of OntoSpider are not strictly
separate. However, in this study, these components are kept separate. The main
goal is the creation of ontologies, not ontology refinement, ontology merging, etc.
Of course such other ontology engineering approaches are also relevant, but they
are not part of this study. To avoid pollution of resulting ontologies by noise in the
input, contrast corpora could be used. If RDF-triplets are output, perhaps even a
concept like stop-triplets that is similar to that of stopwords, could be used to filter
out irrelevant ontological data. A database of very common RDF-triplets could be
maintained for that purpose. Also, if there is much pollution, manual pruning could
be done in the IR Component of OntoSpider, based on the resulting corpus or, if
that is too involved, based on the bibliography that is produced as a side-effect. If
important relations are missed, techniques like synonymy resolution and automatic
deductive reasoning could discover such relations, or else synonymy resolution could
be used in the assessment stage. For the population with classes of the ontology,
it would be preferred to use the Middle-Out strategy, as this will strike the best
balance in the level of detail of these classes, as has been argued by Uschold and
Gruninger ([71], 1996). Techniques from Text Mining, more specifically, Web Mining will be used, so a restriction is that documents in formats like HTML, PDF and
PS are processed and e.g. no databases or other ontologies will be used by the IR
Component of OntoSpider.
The choice of software is mainly based on surveys and also on original literature, which were described in chapter 3. The main prerequisites are: 1. The software
5.1. THE ONTOLOGY ENGINEERING COMPONENT OF ONTOSPIDER 45
Figure 5.2: OntoSpider with OntoLT as the Ontology Learning Component
46
CHAPTER 5. ONTOSPIDER
and resources must be modular and pluggable; 2. They are preferably Open Source
or otherwise freely available; 3. Domain ontologies must be created. 4. The Ontologies that are produced must include the latest standards, i.e. RDF(S) or OWL,
either directly or via an export function.
Various off the shelf ontology learning packages exist that meet many of the
prerequisites that were mentioned. Apart from using an off the shelf ontology learning package, it would also be possible to implement an ontology learning approach
from scratch. An obvious disadvantage is the added complexity of the implementation, as many problems in Ontology Learning that have already been addressed in
off the shelf packages would have to be solved. However, so far, all existing software
packages that were examined had their drawbacks, in some cases, not all required
software packages were freely available, or the creation of ontologies was not from
text, or approaches were based on existing ontologies, etc. For practical purposes
and for initial testing, for instance a pattern matching approach in Perl could be
useful anyway, as rapid prototyping in Perl can yield quick results, and from a development point of view it could be advantageous. If an implementation would be
made in Perl, re-inventing the wheel should still be avoided. The use of existing text
mining Perl code such as the code that is presented in Konchady ([40], 2006) could
be a good option, as well as the use of GATE resources.
As newsletter texts are input of the hybrid approach of Specia and Motta that
was described in chapter 3, it could be used as the Ontology Engineering component
of OntoSpider. The highly specialized output corpus of OntoSpider could be input to
it instead of this newsletter corpus. The deep linguistic processing and the resulting
rich semantic annotations of the approach make it an attractive candidate as opposed
to the many shallow approaches that exist. An important disadvantage of the use of
this approach within OntoSpider is, that an already existing domain ontology is used
by the approach. Ideally, no existing domain ontology should be necessary for an
implementation of OntoSpider, perhaps the approach can be used without existing
ontology. Another way of combining the approach with OntoSpider could be, to use
the ontologies that OntoSpider produces as input domain ontologies, rather than
using the output corpora of OntoSpider as text input to the approach, but this
possibility is totally outside the scope of this study.
It must be stressed, that the choice of software is not really essential for the
purpose of this study, as long as the results of hypothesis testing are conclusive and
domain ontologies are created.
After considering the various options, OntoLT was chosen for the Ontology
Learning component of this study, as at least theoretically, it meets most of the
requirements that were mentioned above. The approach is modular, relatively
platform-independent as the Java Runtime Environment is available for many sys-
5.2. THE IR COMPONENT OF ONTOSPIDER
47
Figure 5.3: IR Component of OntoSpider
tems. It is a plugin for Protégé which is a very common Ontology Engineering
environment, and OntoLT can take text corpora as input. OntoLT can export in
RDF and OWL, it creates domain ontologies and is the implementation of a very
formal approach that is freely available. It uses WordNet, so that also makes it
an attractive candidate. It does linguistic analysis, which can be extended. Corpora must be XML-annotated first, for which SCHUG can be used. The fact that
OntoLT depends on SCHUG is currently a big disadvantage, for it has very limited availability. Actual implementations that use OntoLT will either have to use
SCHUG, or develop an alternative software package that provides the XML annotations that OntoLT requires. At the request of the author, some documents had been
XML-annotated with SCHUG for test purposes, but in an actual semi-automatic
implementation that is intended to be widely available, automatic annotation with
SCHUG or an alternative of it must be possible. As long as this is not the case yet,
the Ontology Engineering component of OntoSpider remains theoretical.
5.2
The IR Component of OntoSpider
The Information Retrieval Component of OntoSpider consists of lightweight centroid
based focused crawlers that are designed for the creation of highly specialized corpora
from which domain ontologies can be created at a later stage. As an off the shelf
Ontology Learning software package will be used, the focus of the description of
OntoSpider in this thesis is on this Information Retrieval Component.
Each of the focused crawlers must meet various requirements: 1. It must
48
CHAPTER 5. ONTOSPIDER
make use of existing modules that were developed for Information Retrieval; 2.
It must adhere to good crawler practices, like the avoidance of hammering sites
and being registered at Robots databases. If the implementation is turned into a
general purpose tool that many people can use, registration at Robots databases
can be problematic, but in the current study this will be neglected. 3. It must be
lightweight enough to be used from a broadband connection. 4. If any use is made
of a search engine for the retrieval of documents, it must not be too dependant
on the specific search engine and its algorithms. The approach would also have to
work with an alternative search engine and only use the search engine for retrieval
of documents that have already been identified.
Even though there are various existing focused crawler implementations, it was
decided at a very early point not to use off the shelf software for the Information Retrieval component of OntoSpider. For this there were mainly two reasons: To a large
extent, a centroid based focused crawler approach had already been implemented by
the author for an assignment in Information Retrieval. The second reason was the
fact that using a Literature Crawler for the sake of Ontology Learning was a new
idea, for which a design would have to be presented. Furthermore, for comparing
the results of a General Purpose Focused with those of a Literature Crawler, it was
useful to make sure that the two approaches have many characteristics in common.
That way, differences in results can be attributed to the actual crawler approaches,
rather than to peculiarities of the specific implementations.
5.2.1
Functional Design of The IR Component
As no off the shelf software package is chosen for the Information Retrieval Component of OntoSpider, details will be provided on the functional design of this component. As far as the processing of files that are no flat ASCII texts, like PDF files,
and more specifically scientific papers, in an automated way is concerned, specific
issues need to be addressed. One obvious issue is, that these texts will either need to
be converted to plain ASCII texts for further processing, or that there will need to
be code that specifically processes the specific format, possibly with a module that
is written for this purpose. The crawler that downloads these documents, must be
able to distinguish between files that need to be processed and those that do not. A
simple way to address this problem, is to make the decision based on extensions of
file names. If PDF and PostScript documents are to be processed, only those that
actually have the file extension .pdf, .PDF, .ps and .PS will.
Ideally, the focused crawler will keep track of bandwidth usage, and enable
the user to indicate an optional maximum bandwidth usage for the crawler. If no
maximum is set, much of the bandwidth of the connection may be used by the
5.2. THE IR COMPONENT OF ONTOSPIDER
49
crawler, as long as this connection is not saturated.
In all cases, the crawl is neither breadth-first nor depth-first, so this applies to
both the general purpose focused crawler and the literature crawler. This is because
the queue is reordered on the fly based on cosine similarity values, which makes the
crawls best-first. This way, the chance of drifting is minimized, whilst the crawler
does not necessarily remain at the same site for a long time. One problem that
can be anticipated is, that for highly specialistic subjects, there are too few results.
Initially, only single crawlers are used. If efficiency appears to be an issue in the
future, the use of parallel crawlers can be considered. At least at first, striving for
effectiveness of the crawlers is most important.
General Purpose Focused Crawling
Many choices of General Purpose Focused Crawlers are possible. Here some pseudocode for a basic crawler that has much in common with the Literature Crawler
is presented. Note that embedding the documents in a set of irrelevant documents
can make the crawler more efficient:
• Initialize ranking array R to empty array
• Let the user start with a set of one or more highly relevant seed documents D
which he or she chooses
• Create vector V0 representation from D, excluding the table of contents
• Make centroid C identical to vector V0
• Optionally ask user to prune vector V0 or to add to it
• Create ranking array R with outlinks, all with value 1
• Loop through ranking array R and
REPEAT for each item I :
– Retrieve I if possible
– If I could be retrieved, create vector representation V1 from it
– Calculate similarity of vector V1 to Centroid C
– Place I in ranking array R based on similarity
UNTIL: queue is empty, OR
user has interrupted the crawl, OR
a certain maximum (bytes, documents) has been reached
50
CHAPTER 5. ONTOSPIDER
Since this study concentrates on the comparison between a General purpose
Focused Crawler and a Literature Crawler, non-focused web crawling is only briefly
mentioned here. Crawling a very large part of the World Wide Web with ’regular’
crawlers, i.e. crawlers that do not focus on a certain subject, is quite expensive. This
is one of the reasons for using focused crawlers instead. It would probably require
the use of parallel crawlers and a lot of bandwidth and storage capacity would be
needed. If these are available, there is much literature on how to perform crawling
in an efficient way. One relatively inexpensive way of testing the results of such
’regular’ crawlers anyway, is to use search engines for which data has been gathered
with such crawlers. If search engines are used, it is important not to limit the search
to the top N hits, but rather to as many hits as the search results in. If only the top
N hits are used, they could be based on an unknown, possibly complicated, ranking
algorithm. So by using search engines, researchers can have access to the results
of regular crawlers in a relatively inexpensive way. The results of queries at search
engines can be ’polluted’ by unknown factors if the search and ranking algorithms
of the search engines are not fully open, so for e.g. one-time research purposes, a
’regular’ crawler approach can still be justifiable. Again, this study only makes use
of focused crawler approaches.
Literature Crawling
Here a tentative algorithm for Literature Crawling is proposed in pseudocode. The
algorithm is quite similar to that of the General Purpose Focused Crawler that was
presented above, but crucial differences are that crawling only takes place ’within’
scholarly papers and that they are not retrieved directly during crawls, based on
extracted URLs, but indirectly via a search engine. Google was chosen for the
retrieval of documents, as it currently covers a very extensive part of the World
Wide Web with fast access to billions of documents via a reliable API, the Google
API. As specified above, the choice of the specific search engine is immaterial to the
approach. Another difference with the General Purpose Focused Crawler is, that
instead of the usual outlinks, which are hyperlinks, bibliographic references will be
’visited’ at subsequent crawls. Of course, the actual retrieval will also be via a
URL, after a lookup via the search engine. See the table below for more differences
between the General Purpose Centroid based Crawler approach and the Literature
Crawler approach.
• Initialize ranking array R to empty array
• Let the user start with a set of one or more highly relevant seed documents D
which he or she chooses
• Create vector V0 representation from D, excluding the table of contents
5.2. THE IR COMPONENT OF ONTOSPIDER
51
• Make centroid C identical to vector V0
• Optionally ask user to prune vector V0 or to add to it
• Create ranking array R with bibliography references, all with value 1
• Loop through ranking array R and
REPEAT for each item I :
– Retrieve I using a Google API if possible
– If I could be retrieved, create vector representation V1 from it
– Calculate similarity of vector V1 to Centroid C
– Place I in ranking array R based on similarity
UNTIL: either queue is empty, OR
user has interrupted the crawl, OR
a certain maximum (bytes, documents) has been reached
According to this algorithm, the ranking array R should grow as more documents are processed. Note that the use of Google Scholar instead of a Literature
Crawler would be an option. Google Scholar is a search engine for scholarly literature. Google Scholar is in beta at the time of writing. Because of this and because
the availability of it for automatic processing are uncertain and its specifications
are not open, it was decided that a Literature Crawler would be developed. However, Google Scholar can certainly be helpful for verification of the corpora and
bibliographies that are output of the Literature Crawler.
The extraction of bibliographic references is not always easy. If there are
specific BibTeX or bibitem elements in documents, or one of the commonly used
bibliography formats is used, it is easy to detect such bibliographic references. In
many cases, a relatively simple pattern matching approach may suffice. Usually,
bibliographic references are found at the end of a paper, in a section that is called
’References’ or ’Bibliography’. In most cases, it will be sufficient that the software
will detect this References section and then process the individual references. If a
search engine is used to retrieve the actual literature, it may not even be necessary to
do much parsing of the bibliographic entries. A simple pattern matching approach
can detect the year of publication, the title and the authors from these entries, and
they can be used in a search query to retrieve the actual document. Here some
pseudocode for bibliography extraction follows:
• Retrieve PDF or PS document.
• Convert retrieved document into ASCII format.
52
CHAPTER 5. ONTOSPIDER
• Skip lines until ’References’ or ’Bibliography’ is found on a line.
• IF the References section is found,
THEN process further lines as follows:
– Pattern match year, names of authors, title.
– Add found data to ranking array R with bibliographic references.
ELSE abort bibliography extraction.
Various commonly used formats for bibliographic entries are in use. The pattern matching could make use of those formats that are most common, or it could be
set up in a very general way. As PDF and PS documents are analyzed with pattern
matching for bibliography extraction after they have been converted to plain ASCII
texts anyway, it is also possible to treat specific sections, like the table of contents
and a list of tables, in a special way. Especially a keywords section normally consists
of crucial terms for the domain of the paper.
Literature is historical, in that obviously older work that has not been republished, cannot refer to work that is published at a later point. Because of this, the
backward crawling approach of Diligenti et al ([27], 2000) is a very natural way of
overcoming this problem. Literature that refers to the current on-topic paper, is
also processed with the backward crawling approach. Even though this approach
certainly has merit in view of the literature crawler, as it could yield more on-topic
literature, for the current research, no backward crawling is used, as for our experiments, we only want to use a search engine to retrieve documents that we have
already identified, and not depend on the search algorithms of the search engine.
A fully-fledged literature crawler will eventually include backward crawling as well.
Once the simple Literature Crawler was described here will have been implemented,
it will be relatively easy to add backward crawling capabilities to it. Other techniques that are used for standard Web crawling can be used for Literature Crawling
as well, like an adapted version of Kleinberg’s HITS algorithm. All this is left for
future research.
Commonalities between the crawler approaches
It is very important, that the common module of the two crawlers has as much as
possible of the functionality that the two crawlers have in common. This way, code
can be refined and corrected without necessarily favoring either of the two crawlers.
This section gives specifications on features that can be part of the common module.
5.2. THE IR COMPONENT OF ONTOSPIDER
53
Table 5.1: Differences between the two crawlers
General Purpose Focused Crawler Literature Crawler
Ranking Array consists of
Ranking Array consists of
outlinks (URLs)
bibliographic references
Mainly HTML is processed
Only formats like PDF and PS
are processed
Retrieval of documents
Retrieval of documents
straight from the Web
via Google API
All documents are treated
Some of the structure of
as Bags
documents is analyzed.
of Words
then documents are treated as
Bags of Words
As a side effect, a bibliography is created
Centroid
The program starts determining the centroid based on a set of relevant seed URLs
which are declared in an ASCII file or entered by the user. The user determines
highly on topic seed URLs for the specific search that should be performed, e.g.
using a search engine or web portal to find relevant hits. Starting off with the right
seed URLs is very important. For specialistic topics, finding suitable seed URLs can
be highly nontrivial. The seed URLs are embedded in a set of irrelevant URLs, which
could be processed to form something like a dispersoid, the opposite of a centroid.
If a document would be close to the dispersoid, it would be discarded. The set of
irrelevant documents could be composed in various ways, like based on other search
queries that have nothing to do with the topic of the focused crawler. The centroid of
the focused crawler itself can be kept constant, or it can be automatically adjusted
based on the data that has been processed. The user can specify how often the
centroid should be calculated by setting a counter variable to some integer value,
like 15.
If a page is retrieved, the URLs that are extracted from it will inherit the
cosine similarity values of the page, but these values will be somewhat downplayed,
with the value that is declared in a variable. This way, a URL will be deemed less
relevant, the further away it is from a known relevant page, as far as the amount
of links is concerned. In the literature, this is described as ’better parents have
better children’. The relevance of the actual pages that these URLs refer to will be
computed anew as soon as they are retrieved.
The maximum amount of pages and bytes that must be retrieved are not so
important, they may be set to extremely high values if necessary. Of course the
54
CHAPTER 5. ONTOSPIDER
Figure 5.4: Rich output of OntoSpider
program must break out of this while loop if the queue (array) is empty! It must
also stop the loop if a flagfile is touched by the user, this allows for intermediate
evaluations at any point during the crawl.
One problem that can be anticipated is, that the dimensionality of the centroid
could become very large, which could make processing of data inefficient. Stopword
removal and the use of a set of irrelevant documents as was mentioned above, can
help in reducing the dimensionality of the centroid.
Interface between the OntoSpider Components
The interface between the components of OntoSpider can be very simple, as the
output of the Information Retrieval component is a highly specialized corpus. The
documents that are output of the Information Retrieval component can be stored
in a relational database like a MySQL or Postgress database, or even a hashed directory structure with separate files. The choice may depend on the storage of the
input that the Ontology Learning component expects. It is important, that humans
have the possibility to easily inspect and remove documents from this corpus. If
the components of OntoSpider are kept strictly distinct, it is easy to combine either
component with other implementations, like third party software packages. If there
is some overlap in the components, e.g. if the centroids that are used in the IR
component are actually used as the basis for the vocabularies of the ontologies that
5.3. ASSESSMENT OF ONTOSPIDER
55
are created by the Ontology Learning component, this combination of different approaches will become more difficult and the approach may become more cluttered.
As it was chosen to keep the component strictly separate and to use OntoLT, storage
of all documents is in flat ASCII format, also of the PDF and PostScript documents
that the Literature Crawler produces. These can be preprocessed by SCHUG or
equivalent, so that they are XML annotated for processing by OntoLT. This preprocessing stage can be seen as part of the interface between the two components, but
it would be better to consider it as part of the Ontological Engineering component
itself, as unstructured text is annotated by it. Clearly, the conversion from PS and
PDF to plain ASCII format must be of a good quality, so that SCHUG or equivalent
can be able to XML-annotate the texts in a reliable way.
5.3
Assessment of OntoSpider
The assessment of OntoSpider could be done at least at two stages, the results of the
IR Component and those of the Ontology Learning Component could be assessed,
or even at more stages if the centroid that evolves or the bibliography that is created
as a side effect, are assessed. However, the final results of the overall approach, the
quality of the resulting ontologies, are most important. The first step of assessment
could be to check that crucial concepts are present in the ”vocabulary”. If the two
components of OntoSpider are kept strictly separate, a comparison could be made
between the concepts that are part of the centroid of the IR Component and the
concepts that are in this vocabulary. Presumably, important concepts that are part
of the centroid should also be in the vocabulary. The next step could be to check
that crucial relations between concepts are also there. Synonymy resolution could
discover missing concepts and relations that were missed at earlier stages, but this
should not be necessary at assessment time, for such synonymy resolution should be
built in in the approach itself. Still, at assessment time there could be an extra check
that such built in synonymy resolution was actually effective. It would be useful
to formally define a Quality Function Q that takes (the results of) an Ontology
Learning approach as argument and that expresses the quality of the ontologies
that are the result of the approach. This idea is discussed in the paragraph with
notes on methodology.
In Sintek et al ([62], 2004), a formal definition of the task of semi-automatic
creation of ontologies from text is proposed, to overcome the central issue of the
difficulty of evaluating the usefulness and accurateness of the resulting ontologies.
The definition that is presented is general enough to cover both ontologies and
knowledge bases, as long as they have a model-theoretic semantics. Various formal
definitions are proposed, one of which is for a suggestion function σ which maps a
56
CHAPTER 5. ONTOSPIDER
text corpus C and an ontology O and (background) knowledge K to suggestions S.
Also, operations on ontologies are defined, + and - which map two ontologies onto a
new ontology, based on entailment. O1 +O2 is defined as the most general ontology O,
where O entails O1 and O entails O2 . O1 −O2 is defined as the least general ontology
O, where O1 entails O and O does not entail O2 . The notions of most general and
least general ontologies are also formally defined, based on entailment of ontologies.
An important problem that the paper flags is that of evaluating the usefulness or
accurateness of (semi-)automatically generated ontologies. The formal definitions
that the authors propose could help solve this problem, as common use of formal
definitions should make it easier to compare (results of) approaches. The Quality
Function Q could be based on this formal approach, as well as something advanced
like an Automatic Ontology Assessment Tool like that one that was suggested in
chapter 3. The Ontology Assessment Tool that is presented in Wang et al. ([72],
2005) could be of use here.
Experiments for making good Evaluations
For testing whether the results of a General Purpose Focused Crawler are better
than or equivalent to the results of a Literature Crawler, careful experiments must
be set up. In the first place, a reasonable amount of subjects must be found on which
there are already existing domain ontologies that have either been created manually,
or that have been post-edited by humans. Then the crawlers must be ’instructed’
to create ontologies on the same subjects, and the results must be compared to the
manually (post-)edited ontologies, that function as a ’standard’. Of course, this is
just temporarily for the purpose of the evaluation. An important characteristic to
check is completeness of the ontologies: Are concepts or relations between concepts
missing in the results of one of the crawlers, and not in those of the other? All
findings can be expressed in percentages, compared to the ’standard’ ontology. The
correctness of the resulting ontologies is equally important. If there are concepts
and relations between concepts that should not be part of the resulting ontologies,
this must also be expressed by a score.
It is also important to determine, at what point of specialization of the subject
the results of the resulting ontologies are too meagre. Generally, the more specialized
the topic of the focused crawl is, the less the amount of on topic documents that
can be found on the World Wide Web will be and the smaller the resulting ontology
could become, if a certain ”critical mass” is not reached. If the crawl only results in
one single paper or even none, the system should abort with a warning. Experiments
should be conducted to get a better idea where to draw the line, as a priori this
is not clear. Probably the question where to draw a line depends on the topic of
the focused crawls, as certain scientific fields, like the medical sciences, have very
5.3. ASSESSMENT OF ONTOSPIDER
57
much specialistic information online whilst in many other fields, not so much online
information can be found. Of course, it may not make sense to create ontologies on
certain extremely specialized subjects.
An example of topics that get more specialistic is: Linguistics is a broad field,
Natural Language Syntax is a branch of linguistics, sentence structure is one of the
subjects in syntax, sentence structure in Semitic Languages is even more specific
and VSO patterns in Modern Standard Arabic is an example of a more specialistic
subject. It so happens to be, that there are many studies on VSO patterns in Semitic
languages. This is no more than an example of how a subject can be narrowed down
more and more. It would probably not make sense to create an ontology on ’trivial’
subjects like VSO patterns in a specific dialect of Old South Arabic.
For assessing the results of OntoSpider, it is important that a Domain Expert is
available who can make judgments based on expertise in the domain that is covered.
The Ontology Engineer does not need to be the same person as the Domain Expert,
but this can be useful. Generally, as the Semantic Web keeps evolving, it can
be expected that more and more Domain Experts will gain expertise in Ontology
Engineering.
58
CHAPTER 5. ONTOSPIDER
Chapter 6
Conclusion and Further Research
This study described research in the area of centroid based focused crawlers that
can help in the semi-automatic creation of ontologies, based on information that is
available on the World Wide Web.
When the results of two specific focused crawler approaches, like a Literature
Crawler and a General Purpose Focused Crawler, are compared, it can be difficult
to ensure that better results of either one are inherent to the specific focused crawler
approach, or that they are the result of accidental factors. This study has proposed
ways of minimizing the influence of such accidental factors.
More research is necessary to be able to draw hard conclusions on the OntoSpider approach. For that, at least one fully-fledged working implementation of
the total approach, so the Ontology Engineering component and the Information
Retrieval component, must be developed. At the time of writing, the development
of OntoSpider is still in progress and the approach is still theoretical. The latest version of OntoSpider implementations will be available under the GNU-FDL license on
the following URL: http://www.nomeka.info/research/OntoSpider/ As the writer is
limited in the time he can spend on these implementations, it is not certain how fast
they will progress. This study described how empirical research can be done with
this approach, based on testing hypotheses.
Some initial experimenting suggests that implementations of the approach that
was described here would be interesting. Such implementations could result in the
creation of domain ontologies in fields for which no such ontologies are available yet,
at relatively little cost. Apart from this, as a side effect of the Literature Crawler
approach, highly specialized corpora could be created which could be used as input
for other approaches, and bibliographies could be formed that can be useful for
scientific research as well. See figure 5.4 for an illustration of the rich output of
the OntoSpider approach. Eventually, a Literature Crawler with backward crawling
59
60
CHAPTER 6. CONCLUSION AND FURTHER RESEARCH
capabilities can create very complete specialized corpora and bibliographies.
6.1
Some Notes on Methodology
In the literature on methodology, various sorts of research are distinguished, including explorative, descriptive and explanatory research. In a way, this thesis encompasses all three types of research. With literature study on existing approaches and
the description of that, some descriptive research is done. Quite extensive literature
study was done to determine whether there were any similar approaches or whether
findings of totally different approaches could be used. This thesis also contains
explorative research, in that a hypothesis is proposed on the relationship between
focused crawling data and the quality of ontologies. The research is explanatory in
that it tries to establish the validity or falsity of the hypothesis. However, no full
implementation of the total approach is there yet. Originally, it was the plan to
have the whole approach implemented in time. In itself, establishing whether the
proposed hypothesis or its alternative hypothesis holds is not the only goal of this
research. If the alternative hypothesis holds, hopefully interesting domain ontologies
will be a side-effect or result of the research as well. The outcome of such ontologies
will eventually also be described. Even though the study falls within the general
bootstrapping problem of the Semantic Web, it is still interesting outside the scope
of this problem as well. The motivation and background were discussed in chapter
1. Part of the scientific approach is, apart from being precise and explicit, to be very
critical. Questions like ”Is the hypothesis not too self-evident? If not, why not?” ”Is
the research precise and explicit enough, so that it is not immune to refutation, and
can it be reproduced?” will need to be answered. Generally speaking, the goal of
this research was to determine whether centroid based focused crawler approaches
can be useful for the purpose of the semi-automatical creation of ontologies. One
could say that it is up to a domain expert to make this judgment based on intuitions
on resulting ontologies. For initial tentative experiments, such an intuitive verification will probably suffice. A more exact approach is, to formulate hypotheses and
alternative hypotheses. For such a more exact approach, the following three crawler
implementations could be taken into account: 1. A General Purpose Crawler (G),
2. A General Purpose Centroid Based Focused Crawler (F) and 3. A Literature
Crawler (L). One null hypothesis could be, that the quality of the ontologies that
are produced by G is better than or equal to that of F, with alternative hypothesis
that the opposite is the case:
H0 : Q(ΩG ) ≥ Q(ΩF )
Ha : Q(ΩG ) < Q(ΩF )
6.1. SOME NOTES ON METHODOLOGY
61
Another interesting thing to test would be, whether the quality of the ontologies that are produced by F is better than or equal to that of those that are
created with L. For that, the null hypothesis and alternative hypothesis could be
the following:
H0 : Q(ΩF ) ≥ Q(ΩL )
Ha : Q(ΩF ) < Q(ΩL )
In this study, a General Purpose Focused Crawler approach is compared with
that of a Literature Crawler, so for that, testing the latter hypothesis is relevant.
Even though a search engine like Google Scholar and automated ways of creating
bibliographies do exist, the use of a Literature Crawler for the semi-automatic creation of ontologies seems to be a new approach. Even though a start was made with
an implementation of the OntoSpider approach, it still needs to be fully implemented
and evaluated.
The most difficult part of a very exact approach is the assessment of the
resulting ontologies. Ideally, this assessment would take place in an automatic way
itself, or at any rate the resulting ontologies would have to be assessed in some
objective way. Some of the important aspects of ontologies that need to be assessed
are completeness and consistency. The research could use some manually created
comparison ontology.
In order to be able to express the quality of ontologies in an exact way, a
quantitative approach could be used. In the representations above, Q expresses a
quality function that represents the quality of ontologies that result from approaches.
Probably the simplest way of quantifying would be to go down the Ontology Scale to
the vocabulary level, and simply compare the concepts that are present in the various
ontologies, with those that are in the comparison ontology. Of course, in itself this
is not sufficient. In fact it resembles the comparison of vector representations of
documents with a centroid, which is part of the OntoSpider approach itself. The
next step could be in a higher layer of the Ontology Scale to compare the concepts
with their relations, ontology triples, e.g. RDF triplets, that the ontologies consist
of. For initial experiments, the RDF triplet level seems sufficient. Further reseach
could add more complicated ontologies, like OWL ontologies. In both cases, even if
H0 cannot be rejected on reasonable grounds, the study is useful from a scientific
point of view. In science, often failures, dead ends, etc. are not mentioned in
publications, but pursuing wrong tracks in a search for better understanding can
also be important, not only so that other researchers do not repeat such research in
vain, but also because intermediate findings can be important in themselves. Maybe
in scientific research, a phenomenon like ”tunneling” could even be possible, where
paths that seem to have dead ends lead to a main road further on.
As far as the implementation is concerned, it is very important to make sure
62
CHAPTER 6. CONCLUSION AND FURTHER RESEARCH
that the conclusions are stipulative, i.e. it must be certain that if other means would
be used, results would not be far different. Also, the impact of possible fatal bugs
that have a large impact on the conclusion must be ruled out. One way of making
sure the outcome is more solid is to run multiple tests and to compare the results.
It can be tempting, to fiddle the software until the results are as expected. This
must be avoided, and ideally anyone should be able to verify that similar results
can be obtained with the approach. On the other hand, it may not be realistic not
to allow for adaptation of the software so that better results are obtained. For this
reason, the design of OntoSpider involves a common module, OntoSpider.pm, that
is used by both the General Purpose Centroid Based Focused Crawler (focusbot.pl)
and the Literature Crawler (litcrawl.pl), so that code in this common module can
be improved on without favouring either crawler exclusively. For the very first implementation of the approach, the resulting ontologies of L and F will be compared.
In order to facilitate reproduction of the results, the software that is used will be
freely available under the GNU/FDL license. Whenever search engines are used
for Information Retrieval, this must be done with care. Often algorithms of these
search engines that are used for data gathering, ranking, etc. are not open. With
the crawlers that are described, the search engine only plays a role in the retrieval
of actual documents, not in the search or qualifying process itself. This does not
have an impact on the results of the research, for they would be the same if the
documents would be retrieved from the Web in some other way. As mentioned, the
implementation has not been finished yet. The appendix shows some of the code
that has been developed so far.
All empirical data is gathered by the crawlers from the Internet. Compared
to research that requires questionnaires, this is not costly. Compared to research
that uses statistical data from existing sources, it is quite more involved. The
Web contains a treasure of information, data mining/text mining can extract useful
information at relatively little cost. If necessary, the cost in terms of bandwidth
usage can be measured with e.g. MRTG or Cricket. Disk storage is very cheap at
the time of writing. A working model that is used for the data and its forms, is
that of the Ontology Scale or Ontological Spectrum. At the basis of the research are
quite simple straightforward IR methods and techniques, like a TF.IDF similarity
measure, centroid based crawling, the use of contrast sets, etc. Implementations
could be done with rapid prototyping in Perl, with the use of existing resources
like WordNet and software packages like GATE and OntoLT. Common etiquette for
spidering the Web is observed, like the avoidance of hammering sites, avoidance of
repetitive retrieval of the same documents. Also, a modular approach is taken. Part
of the research includes finding other linguistic ontologies, whether they are created
manually or automatically, and the state of the art of linguistic ontologies at the
time of writing. One thing that is very important is, that a solid method is used for
6.2. FUTURE RESEARCH
63
comparing and assessing linguistic ontologies. Also, finding similar approaches and
the description of them is important.
Another important thing is to have a good delimitation of the subject of the
research. It is easy to get trapped in a too ambitious approach, with complicated
ontologies, too wide a scope etcetera. From the start, a general purpose centroid
based focused crawler approach was chosen for the Information Retrieval component
of the research. At a later point, a Literature Crawler approach was considered as
well. Theoretically, for Literature Crawling a backward crawling approach like the
one that was proposed by Diligenti et al. ([27]) looked very promising, but it was
decided to leave that for future research. At any rate, as far as the implementation is
concerned, much use would be made of existing Perl modules and software packages.
The type of ontologies was a further point of delimitation. It was suggested, that
specific linguistic ontologies would be a good further specialization, for two reasons.
At the time of writing, there weren’t many such linguistic ontologies yet. Furthermore, computational linguistics is the field of study of the writer. The creation of
specialistic ontologies in general was too broad a goal. Even though the focus would
be on linguistic ontologies, the implementation should be general enough to allow for
the creation of scientific ontologies in general, otherwise the approach might become
too ad-hoc. As far as the type of ontologies is concerned, its place on the Ontology
Scale/Ontological Spectrum was also important. It seemed logical to start at the
bottom of the scale, with keyword sets, then proceed to RDF triplets and maybe to
proceed further. For the Ontology Engineering part of the research, at first OntoLT
was chosen, but as SCHUG, which was required for the XML annotation OntoLT
expects in the input documents, at that time appeared not to be a free software
package, experimenting with it became more difficult. So various packages, like OntoLT and GATE, were examined and described. At a later point, an approach with
off the shelf software could be used for the Information Retrieval component of the
research as well, as part of this or future research.
6.2
Future Research
Much related research could be done that involves the use of online sources with
specialistic information in textual format or in formats that can easily be converted
into text format. Sources for such research could be specialistic mailing lists like
the Linguist List, newsgroups and forums, though it is likely that such sources will
contain far too much noise, compared to the specialistic literature that is used by
a Literature Crawler. As more and more books are digitized by projects such as
Project Gutenberg[51] and especially, more recently, the ambitious Google Books
project[35], more information that can be used for creating domain ontologies will
64
CHAPTER 6. CONCLUSION AND FURTHER RESEARCH
become available as well over the years. Often content that is contained in those
projects is only available if the copyright has expired, and for many subjects it
may be too dated. Projects in which user created content is central, like wikimedia
projects such as wikipedia could be a source of highly specialistic information, but
they would probably be less suitable for the creation of domain ontologies than
scientific publications are, for they try to avoid too technical descriptions, are often
out of balance as far as the coverage of subjects is concerned,and are as yet not
always reliable. Later stable versions of such projects could be useful all the same.
Future study could also investigate how the proposed OntoSpider crawlers can
be combined. If the crawlers complement each other as far as ontological data is
concerned, their resulting ontologies could be merged. This does not need to be at
the ontology level. Combining the corpora of the two crawlers before they are input
to the Ontology Learning component is also possible. It would be interesting to
investigate how good the results of a combination of the output of the two crawlers
are. If the expectation that the Literature Crawler will perform better than the
General Purpose Focused Crawler as far as precision is concerned, and that the
latter will perform better on recall, combining the two crawlers could be a good way
of getting a good balance between precision and recall. Furthermore, it is likely that
the recall of the combined crawlers will be better than that of any of the individual
crawlers by themselves.
Bibliography
[1] Aggarwal, Charu C.; Al-Garawi, Fatima and Yu, Philip S.Intelligent crawling
on the World Wide Web with arbitrary predicates World Wide Web, 2001.
[2] Agirre E., Ansa O., Martnez D., Hovy E. Enriching WordNet concepts with
topic signatures, Procceedings of the SIGLEX workshop on ”WordNet and
Other Lexical Resources: Applications, Extensions and Customizations”. In
conjunction with NAACL. 2001.
[3] Antoniou, Grigoris, van Harmelen, Frank, Web Ontology Language, Handbook on Ontologies in Information Systems, Springer-Verlag, 2003.
[4] Aussenac-Gilles, Nathalie; Biebow, Brigitte; Szulman, Sylvie, Corpus Analysis for conceptual modelling, Workshop on Ontologies and Texts, Knowledge
Engineering and Knowledge Management: Methods, Models and Tools, 12th
International Conference, EKAW”2000, Juan-les-pins, France, Octobre 2000,
Springer-Verlag
[5] Aussenac-Gilles, Nathalie; Biebow, Brigitte; Szulman, Sylvie, Revisiting Ontology Design: A Methodology Based on Corpus Analysis, Lecture Notes In
Computer Science, Proceedings of the 12th European Workshop on Knowledge Acquisition, Modeling and Management, p.172-188, 2000.
[6] Bergmark, Donna; Lagoze, Carl and Sbityakov, Alex, Focused Crawls, Tunneling, and Digital Libraries, Lecture Notes In Computer Science; Vol. 2458
archive Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries table of contents, 2002.
[7] Lee, T.B, Hendler, J. and Lassila, O., The Semantic Web, Scientific American, May 2001.
[8] Biebow, Brigitte and Szulman, Sylvie, TERMINAE: A Linguistics-Based
Tool for the Building of a Domain Ontology, 11th European Workshop,
Knowledge Acquisition, Modeling and Management (EKAW” 99), Dagstuhl
Castle, Germany, 26-29 Mai, 1999, p. 49-66
65
66
BIBLIOGRAPHY
[9] Bontcheva, K. and Tablan, V. and Maynard, D. and Cunningham, H., Evolving GATE to Meet New Challenges in Language Engineering, Natural Language Engineering volume 10, number 3/4 pages 349-373, 2004.
[10] Buitelaar Paul, Olejnik Daniel. and Sintek Michael. OntoLT: A Protg Plug-In
for Ontology Extraction from Text In: Proceedings of the Demo Session of
ISWC-2003, Sanibel Island, Florida, October 2003.
[11] Buitelaar Paul, Cimiano Philipp and Magnini Bernardo, Ontology Learning
from Text: An Overview, DFKI, Language Technology Lab AIFB, University
of Karlsruhe, 2003.
[12] Buitelaar, Paul; Olejnik, Daniel; Hutanu, Mihaela; Schutz, Alexander; Declerck, Thierry; Sintek, Michael, Towards Ontology Engineering Based on
Linguistic Analysis, Saarbruecken; Kaiserslautern, Germany, 2004.
[13] Buitelaar, Paul; Sintek, Michael, OntoLT Version 1.0: Middleware for Ontology Extraction from Text, Saarbruecken; Kaiserslautern, Germany, 2004.
[14] Buitelaar, Paul; Olejnik, Daniel and Sintek, Michael, A Protg Plug-In for
Ontology Extraction from Text Based on Linguistic Analysis In: Proceedings
of the 1st European Semantic Web Symposium (ESWS), Heraklion, Greece,
May 2004.
[15] Buitelaar, Paul; Sintek, Michael and Iqbal, Yasir, OntoLT Version 1.0: Short
User Guide, Saarbruecken; Kaiserslautern, Germany, 2004.
[16] Buitelaar, Paul, Practical HLT and ML - Ontology Learning Section,
Linguistic-based Extraction of Concepts and Relations with OntoLT, Saarbruecken, Germany, 2004.
[17] Chakrabarti, Soumen; van den Berg, Martin and Dom, Byron, Focused crawling: a new approach to topic-specific Web resource discovery Computer Networks vol. 31 number 11-16, Amsterdam, Netherlands, 1999.
[18] Chakrbarti, Soumen, Focused Crawling: The quest for topic-specific portals,
http://www.cs.berkeley.edu/ soumen/focus/, 1999.
[19] Chétrit, H., LiTH-IDA-EX-04/017-SE, A Tool for Facilitating Ontology Construction from Texts, Master”s Thesis, Sweden 2004.
[20] Cucchiarelli, A., Navigli, R., Neri, F, Velardi, P, Automatic Generation of
Glosses in the OntoLearn System, Proc. of 4th International Conference on
Language Resources and Evaluation (LREC 2004), Libsoa, 26-28th May,
2004.
BIBLIOGRAPHY
67
[21] Cucchiarelli, A., Navigli, R., Neri, F., Velardi, P., Automatic Ontology Learning: Supporting a Per-Concept Evaluation by Domain Experts, Workshop on
Ontology Learning and Population, in the 16th European Conference on Artificial Intelligence (ECAI 2004), Valencia, Spain, August 22-23rd, 2004.
[22] Cunningham, H and Maynard, D and ontcheva, K. and Tablan, V., GATE: A
framework and graphical development environment for robust NLP tools and
applications, Proceedings of the 40th Anniversary Meeting of the Association
for Computational Linguistics, 2002.
[23] Daconta, Michael, Obrst L. and Smith, K., The Semantic Web: A Guide
to the Future of XML, Web Services, and Knowledge Management, Wiley &
Sons, 2003.
[24] Declerck, Thierry, A set of tools for integrating linguistic and non-linguistic
information, Proceedings of SAAKM (ECAI Workshop), 2002.
[25] Declerck, Thierry and André, Elisabeth, L’indexation conceptuelle de documents multilingues et multimédias, Multilinguisme et traitement de linformation (Traité des sciences et techniques de l’information) , Lavoisier, 2002
[26] Declerck, Thierry and Crispi, Claudia, Multilingual Extension of a morphosyntactic Lattice to Central and Eastern European Languages, Proceedings of
IESL, Saarland University, Saarbruecken, 2003.
[27] Diligenti, Michelangelo; Coetzee, Frans; Lawrence, Steve; Lee Giles, C and
Gori, Marco, Focused Crawling using Context Graphs, in 26th International
Conference on Very Large Databases, VLDB, Cairo 2000.
[28] Dill et al., A case for automated large-scale semantic annotation, Web Semantics: Science, Services and Agents on the World Wide Web 1, 115-132,
2003.
[29] Ding, Y, Engels, R, IR and AI: Using Co-occurrence Theory to Generate
Lightweight Ontologies, Netherlands (Amsterdam), Norway (2001)
[30] Ding, Y and Foo, S. ”Ontology Research and Development: Part 1 - A Review
of Ontology Generation. Journal of Information Science 28(2), 2002.
[31] Ehrig, M. Ontology-Focused Crawling of Documents and Relational Metadata,
Master”s Thesis, Karlsruhe 2002.
[32] Faure, D. and Nédellec, C. and Rouveirol, C., Acquisition of Semantic Knowledge using Machine learning methods: The System ASIUM, Technical report
number ICS-TR-88-16, 1998.
68
BIBLIOGRAPHY
[33] Gómez-Pérez, Asunción, Manzano-Macho, David, et al. A survey of ontology
learning methods and techniques, OntoWeb Deliverable 1.5, Madrid, 2003.
[34] Gómez-Pérez, Asunción, Fernández-López, Mariano, Corcho, Oscar, Ontological Engineering with examples from the areas of Knowledge Management,
e-Commerce and the Semantic Web, Springer Verlag, London, 2004.
[35] Google Books, http://books.google.com/googlebooks/about.html
[36] Gruber, T. R., A Translation Approach to Portable Ontology Specifications,
Knowledge Acquisition, Knowledge Systems Laboratory, Computer Science
Department, Stanford University, 2003.
[37] Handschuh, S. and Staab, S. and Ciravegna, F., S-CREAM-Semi-automatic
CREAtion of Metadata, Proc. of the European Conference on Knowledge
Acquisition and Management, 2002.
[38] Jianming Li, Lei Zhang and Yong Yu. Learning to Generate Semantic Annotation for Domain Specific Sentences, first International Conference on
Knowledge Capture, 2001.
[39] Kleinberg, J., Authoritative sources in a hyperlinked environment. Proc. 9th
ACM-SIAM Symposium on Discrete Algorithms, 1998. Extended version in
Journal of the ACM 46(1999). Also appears as IBM Research Report RJ
10076, May 1997.
[40] Konchady, Manu, Text Mining Application Programming, Thomson Delmar
Learning, Charles River Media Programming Series, Boston, Massachusetts,
2006.
[41] Lassila, O. and McGuinness, D. L., The Role of Frame-Based Representation
on the Semantic Web, Knowledge Systems Laboratory, January, 2001.
[42] Maedche, A and Staab, S., Semi-Automatic Engineering of Ontologies from
Text, in Proceedings of the 12th Internal Conference on Software and Knowledge Engineering. Chicago, USA. KSI, 2000.
[43] Maedche, A. and Staab, S.: Software Demonstration: The Text-To-Onto Ontology Learning Environment, International Conference on Conceptual Structures Logical, Linguistic, and Computational Issues (ICCS”2000), Darmstadt, 14-18 August, 2000.
[44] Maedche, A and Staab, S. Ontology learning for the Semantic Web, IEEE
Intelligent Systems, 16(2), 2001.
BIBLIOGRAPHY
69
[45] Maedche, A and Ehrig, M. and Handschuh, S and Stojanovic, L. and Volz,
R., Ontology-Focused Crawling of Web Documents and RDF-based Metadata,
Karlsruhe, Germany.
[46] Menezes, Roger, Crawling the Web at Desktop Scales, Dissertation, 2004.
[47] Missikoff, Michele and Velardi, Paola and Fabriani, Paolo, Text Mining Techniques to Automatically Enrich a Domain Ontology, int. Journal of Applied
Intelligence, 2001.
[48] Missikoff, Michele and Navigli, Roberto and Velardi, Paola, Integrated Approach to Web Ontology Learning and Engineering, , IEEE Computer, pp.
60-63, November 2002.
[49] Mukherjea, Sougata, WTMS: a system for collecting and analyzing topicspecific Web information, Computer Networks 33(1-6): 457-471, 2000
[50] Novak, Blaz, A Survey of Focused Web Crawling Algorithms in: SIKDD 2004
at multiconference IS 2004, Ljubljana, Slovenia, 12-15 Oct 2004.
[51] http://www.gutenberg.org/wiki/Gutenberg:About
[52] Noy, Natalya Fridman and McGuinness, Deborah L. Ontology Development
101: A Guide to Creating Your First Ontology, Stanford Knowledge Systems
Laboratory Technical Report KSL-01-05 and Stanford Medical Informatics
Technical Report SMI-2001-0880, March 2001.
[53] OWL Web Ontology Language Overview, http://www.w3.org/TR/owlfeatures/, W3C Recommendation, 10 February 2004.
[54] OWL Web Ontology Language Reference, W3C
http://www.w3.org/TR/owl-ref/, 10 February 2004.
Recommendation,
[55] Page, Lawrence; Brin, Sergey; Motwani, Rajeev; Winograd, Terry, The
PageRank Citation Ranking: Bringing Order to the Web., Technical Report,
Computer Science Department, Stanford University, 1998.
[56] RDF Primer, http://www.w3.org/TR/rdf-primer/, W3C Recommendation,
10 February 2004.
[57] RDF Semantics, http://www.w3.org/TR/rdf-mt/, W3C Recommendation,
10 February 2004.
70
BIBLIOGRAPHY
[58] Sabou, Dr. Marta and d’Aquin, Dr. Mathieu and Motta, Prof. Enrico, Using the Semantic Web as Background Knowledge for Ontology Mapping In
Proceedings International Workshop on Ontology Matching (OM-2006), collocated with ISWC’06, 2006.
[59] Sakkis, George, Yet Another Focused Crawler, The First instructional Conference on Machine Learning (iCML 2003). December 3-8, 2003.
[60] Silva, Mrs. C. da, European eConstruction Software Implementation Toolset,
(Workshop on eConstruction N066 DRAFT CWA5), Delft, Netherlands,
2004.
[61] Simmetrics SourceForge Webpage http://sourceforge.net/projects/simmetrics/
[62] Sintek, Michael; Buitelaar, Paul and Olejnik, Daniel, A Formalization of
Ontology Learning from Text, in: Proceedings of EON2004 DFKI GmbH,
Kaiserslautern, Saarbruecken, 2004
[63] Sizov, Sergej; Graupmann, Jens, Martin, Theobald, From Focused Crawling
to Expert Information: an Application Framework for Web Exploration and
Portal Generation, Proceedings of the 29th International Conference on Very
Large Data Bases (VLDB-03), Berlin, 2003.
[64] Stamatakis, K.; Karkaletsis, V.; Paliouras, G.; Horlock, J.; Grover, C.; Curran, J.R. and Dingare, S, Domain-Specific Web Site Identification: The
CROSSMARC Focused Web Crawler, Proceedings of the Second International Workshop on Web Document Analysis (WDA), Edinburgh, UK, 2003.
[65] Steels, Luc, The Origins of Ontologies and Communication Conventions in
Multi-Agent Systems, Sony Computer Science Laboratory Paris and Artificial
Intelligence Laboratory Vrije Universiteit Brussel, November 19, 1997
[66] Specia, L., Motta, E., A hybrid approach for extracting semantic relations
from texts, 2nd Workshop on Ontology Learning and Population (OLP2) at
COLING/ACL 2006, pp. 57-64. July 22, Sydney, 2006.
[67] Su, Chang; Gao, Yang; Yang, Jianmei; Luo, Bin, An efficient Adaptive Focused Crawler Based on Ontology Learning, 5th International Conference on
Hybrid Intelligent Systems (HIS 2005), Rio de Janeiro, Brazil, 6.9 November
2005.
[68] Sundblad, H., Automatic Acquisition of Hyponyms and Meronyms from Question Corpora, in Proceedings of the Workshop on Natural Language Processing and Machine Learning for Ontology Engineering at ECAI”2002, Lyon,
France.
BIBLIOGRAPHY
71
[69] Szulman, S, Biébow, B., Aussenac-Gilles, N, Vers un environnement intégré
pour la structuration de terminologies : TERMINAE LIPN - Université Paris
13, IRIT Toulouse, 2001.
[70] Szulman, S, Biébow, B., Aussenac-Gilles, N, Modelling the travelling domain
from an NLP description with TERMINAE, LIPN - Université Paris 13, IRIT
Toulouse, 2003.
[71] Uschold, M. and Gruninger, M., ONTOLOGIES: Principles, Methods and
Applications, Knowledge Engineering Review, Vol. 11, 2, 1996.
[72] Wang, J.Z. and Ali, Farha, An efficient ontology comparison tool for semantic Web applications, in: Web Intelligence, 2005. Proceedings. The 2005
IEEE/WIC/AC M International Conference on Web Intelligence, Clemson
Univ., CA, USA, 2005.
Chapter 7
Appendices
i
ii
CHAPTER 7. APPENDICES
Appendix - An Implementation of
OntoSpider
Several programming languages would qualify for implementations of OntoSpider,
like Java, Python and C. In this section, a possible Perl implementation of the
OntoSpider approach will be described very consisely. Perl is very suitable for
inter process communication, string manipulations, etcetera. The focused crawler
focusbot.pl and the Literature Crawler litcrawl.pl are instances of the IR component
of this approach. Subroutines that the scripts have in common can be found in the
Perl module OntoSpider.pm.
Implementation of the script focusbot.pl
Focusbot.pl is an implementation of a general purpose centroid based focused crawler
that can make up the IR Component of OntoSpider. It is written in Perl and
makes use of the modules Digest::MD5, URI, LWP::UserAgent, LWP::MediaTypes
and WWW:RobotRules. The very first version of the crawler was written for an
assignment of a course in Information Retrieval. It runs on FreeBSD 5.x, but should
run on any unix or linux version with minimal modifications. The script identifies
itself to the web server as OntoSpider/version libwww-perl/version. It waits at least
10 seconds between page downloads (Perl sleep) to avoid hammering sites. The
robot and the IP address from which the crawls took place were registered on sites
that maintain robots databases, so that the identification can be easily checked by
webmasters whose sites are visited by it.
The crawler was registered at the following Robots databases:
http://icehousedesigns.com/useragents/
http://www.jafsoft.com/searchengines/webbots.html
http://www.psychedelix.com/agents1.html and
iii
iv
CHAPTER 7. APPENDICES
http://www.robotstxt.org/wc/robots.html
Later google searches for the keyword OntoSpider show web server logs of
remote sites that were visited by OntoSpider. Ideally, eventually OntoSpider implementations will be so lightweight, that they can be run on any machine with a
broadband connection like DSL. Even during testing, it could be useful to be able
to crawl from various IP addresses.
Implementation of the script litcrawl.pl
Like focusbot.pl, litcrawl.pl can make up the IR Component of OntoSpider. It
is an implementation of the Literature Crawler that was described earlier. The
crawler was written in a modular way, and testing various subroutines separately
was facilitated by using specific command line options with the Getopt Perl module.
In all cases, the option –debug (or -d for short) makes the script yield much
debug information, which can be used for testing and troubleshooting purposes.
One important problem that the Literature Crawler had to address, is how to
adequately detect bibliographic references in a paper. Specific difficulties are the
fact that those references may be split over multiple lines and that they can be
in various formats. Initial implementations were very naive, with very simple Perl
regex matching.
Option –filename (or -f) allows one to specify one or more filenames of PDF
files that are on local disk. With this, the extraction of bibliographic references from
such a file can be tested and fine tuned. If there are multiple filenames, they should
be separated by three hyphens (’—’).
Google searches of litcrawl.pl were implemented with the Google API. For this
purpose, the script googly.pl was used from the book Google Hacks. A more integrated approach could be implemented at a later point. An account was registered
at Google to be able to use the Google API, and as Google imposes a maximum limit
of lookups per day, the script itself also keeps track of this amount, and experiments
were kept conservative.
As our aim is to find specialistic literature, the crawler should retrieve documents in text, PDF and PS format. Other formats could be interesting as well, and
it should not be difficult to extend the software so that it could deal with other types
like RTF documents. In order to avoid the retrieval of non-text pages, the module
LWP::MediaTypes was used, letting sub valid mediatype() check for text/(anything)
or application/octet-stream before calling sub retrieve page and extract urls to retrieve a page. I am sure that this can be improved upon, the module allows for
v
addition of mediatypes, etc. If a non-ASCII file is downloaded by accident, it will
be removed and a warning will be printed to STDERR. This can help in improving
the checks for plain text files. OntoSpider tries to behave by respecting robot.rules,
for that WWW::RobotRules was used.
Implementation of the module OntoSpider.pm
This subsection not only describes the contents of the Perl module OntoSpider.pm,
but also that of other implementation details that are common to the scripts focusbot.pl and litcrawl.pl.
For starters, it was decided to require some ’interesting’ substrings in URLs,
this requirement is unimportant and may be dropped altogether once the mechanism
for detecting relevant URLs is improved on. As mentioned, the use of substrings
that are required in the URLs could be an easy way to avoid drifting and crawler
traps. This approach may be too detrimental to the retrieval of important on topic
documents, which can be verified by experiments.
The queue of URLs that the crawler should visit is mainly based on similarity
with the centroid.
In order to increase the efficiency of the crawler, processing the same document
multiple times is avoided by taking the MD5 checksum of the documents that have
been retrieved. Not in all cases identical documents can be detected this way.
The next step of detecting identical documents is by seeing whether the vector
representations of the given documents are almost identical. In a way, if the vector
representations are up to like 99.9% identical, they could be seen as identical finger
prints of the documents they represent. Another way of making the crawler more
efficient is by keeping track of the results of previous crawls. Esp. during testing,
if crawls are done based on the same seed URLs, and previous crawls were aborted
for some reason, it is important to reuse previous results.
vi
CHAPTER 7. APPENDICES
Appendix - OntoSpider
experiments
Very Early Experiments
Initial experiments with earlier versions of OntoSpider showed some clear problems
that needed to be solved:
Initially, the centroid was based on seed URLs without embedding in a set of
nonrelevant URLs. They clearly contained some irrelevant elements that one would
like to exclude from the outset.
Experiments and results of the general purpose focused crawler
Experiments and results of the Literature Crawler
Finding seed documents
Finding suitable seed documents can be helped by running litcrawl.pl -b=”search
string” -d. For example, with search string ”clitics Spanish” this will yield the
following output at the time of writing:
Resulting URLs:
http://people.cohums.ohio-state.edu/grinstead11/delamora.pdf
http://individual.utoronto.ca/criscuer/ling/GenCliticsCuervo.pdf
And with search string ”clitics Arabic”:
vii
viii
CHAPTER 7. APPENDICES
Resulting URLs:
http://archimedes.fas.harvard.edu/mdh/arabic/NAACL.pdf
http://www.stanford.edu/ jurafsky/ArabicChunk.pdf
http://www.linguistics.ucla.edu/people/grads/jforeman/PreverbalS ubjectsi nM acuiltianguisZ apot
http://www.unige.ch/lettres/linge/syntaxe/shlonsky/glow05/abstracts/-semitic
http://www.cs.um.edu.mt/ mros/WSL/papers/kamir:etal.pdf
Extracting bibliographic references for the queue
Testing the extraction of bibliographic references from documents can be tested
with local documents with the option –filename (-f), sample output of the command
litcrawl.pl -f /tmp/biebow.pdf:
Extracted bibliography:
knowledge acquisition benefit from terminology ? In Proc. of the 9th Banff
Knowledge Ac-quisition for Knowledge-Based Systems Workshop, Banff, (1995)
domain ontology. In Proc. of EKAW’99, (1999)
l’acquisition des connaissances ‘a partir de textes. Th‘ese, EHESS Paris, (1994)
review. In Proc. of the 1997 AAAI Spring Symposium on Ontological Engineering, (1997)
In International Journal of Human-Computer Studies,43, (1995) 907-928
Criteria for Structuring Knowledge Bases. In Data and Knowledge Engineering, (1992)
conference on Formal Ontologies in Information Systems (FOIS’98), Trento,
Italy,(1998)
Document retrieval
The retrieval of documents by litcrawl.pl can be tested with the command line
option –retrieve (-r), if a URL is specified and the file does not exist locally yet,
it will be retrieved from the remote server. If it does exist locally, the script will
output ”filename already present, no need to retrieve it”. If a string is specified that
is not a URL, a Google search will be initiated with the Google API.
Sample invocation: litcrawl.pl -r=http://roa.rutgers.edu/files/537-0802/537-
ix
0802-PRINCE
The following command retrieved three documents which are all highly on
topic given the subject:
litcrawl.pl -r ”spanish clitics”
These were the documents that were retrieved at the time of writing:
GenCliticsCuervo.pdf from http://individual.utoronto.ca:80/criscuer/ling/GenCliticsCuervo.pd
this is the paper ’Spanish clitics: three of a perfect pair’ by Mara Cristina Cuervo,
which does contain a bibliography itself.
0521571774ws.pdf from http://assets.cambridge.org:80/052157/1774/sample/0521571774ws.pdf
this is the paper ’The Syntax of Spanish’ by Karen Zagona, which does not contain
a bibliography.
Valeria Belloro 2005.pdf from http://formosan.sinica.edu.tw:80/RRG05/RRGP apers/V aleria
Generally, the filenames of the URL’s that are shown with the command
litcrawl.pl -b ”search string” can be retrieved with litcrawl.pl -r ”search string”
Centroid Calculation
The command line option –centroid from docs (-c) will allow the user to create a
centroid from a corpus. The document set from which the centroid will be calculated
can be supplied on the command line, if there are multiple documents, separated by
three consecutive hyphens (’—’). At the time of writing, this functionality is in the
process of being implemented.
Sample invocation:
litcrawl.pl -c /tmp/biebow.pdf—/tmp/corpus.pdf -d This will yield very much
output, like contents of the created bibliography, term frequency and document
frequency values, etc.
x
CHAPTER 7. APPENDICES
Appendix - OntoSpider Code
xi
i
#!/usr/bin/perl −w
#
# $Id: litcrawl.pl,v 1.15 2007/10/02 07:06:09 carelf Exp carelf $
#
# Literature Crawler
#
# The first versions of this script are written to work on
# ∗nix systems like FreeBSD.
#
# Note that this is work in progress.
#
# Carel Fenijn, October 2007
#
use strict;
use Data:: Dumper;
use Getopt:: Long;
use Pod:: Usage;
use OntoSpider;
$ENV{’PATH’} = ””;
#
#
#
Mainly Declarations
my $max bibliographic line length = 400;
my $google top n = 5; # top N hits of google will be used only
my $google search counter = 0; # init
my $max google searches = 1000; # Less or equal the Google imposed daily max
my $max runs = 10;
# max amount of runs, may be set to a very high value
my $total amount of docs; # total amount of processed docs
my $total amount of words; # total amount of words of all processed docs
#
# %df hash will contain df values of words, df is the amount of docs in
# which a word is found.
#
my %df hash;
my $data dir = ”./data”;
ii
CHAPTER 7. APPENDICES
my $tmp dir = ”$data dir/tmp”;
my @centroid vector ary; # represents centroid vector with weight values
my($centroid words ary ref, $centroid hash ref, $centroid vector ary ref, $doc vector hash ref);
my $google retrieve command = ”./googly.pl”;
my $file retrieve command;
my $wget command = ”/usr/local/bin/wget”;
my $fetch command = ”/usr/bin/fetch”;
my $pdftohtml command = ”/usr/local/bin/pdftohtml”;
my $english stemmer command = ”./estemmer”;
my $pdftohtml command options;
if($#ARGV == −1)
{
pod2usage();
exit;
}
my($help,$man,$testmode,$debugmode,$initial centroid mode,$seed url,$input filename,$biblio
GetOptions(
”help|?” ⇒ \$help,
”man” ⇒ \$man,
”testmode” ⇒ \$testmode,
”filename = s” ⇒ \$input filename,
”bibliography = s” ⇒ \$bibliography,
”centroid from docs = s” ⇒ \$centroid from docs,
”debugmode” ⇒ \$debugmode,
”retrieve = s” ⇒ \$retrieve documents,
”initial centroid mode” ⇒ \$initial centroid mode,
”seed url = s” ⇒ \$seed url
);
print ”Unprocessed by Getopt:: Long\n” if $ARGV[0];
#
# Initially, @url ary should contain URLs of highly on−topic seed documents.
# At a later point, it will contain other documents based on retrieved data.
# Please enter at least one URL of a PDF paper with English content here
# or leave @url ary empty if all seed URLs are supplied on the command line:
#
#my @url ary = (
#
’http://roa.rutgers.edu/files/537−0802/537−0802−PRINCE−0−0.PDF’
#
# bird flu / avian influenza
#
#
’www.aameda.org/MemberServices/Exec/Articles/fall05/Avian Flu in Humans.pdf ’
#
);
iii
my @url ary;
if(defined($seed url))
{
if($seed url = ∼/\,/)
{
@url ary = split(/\,/,$seed url);
}
else
{
@url ary = ($seed url);
}
}
if($#url ary 6= −1)
{
OntoSpider:: print debug($debugmode,”Seed URLs: @url ary\n”);
}
else
{
OntoSpider:: print debug($debugmode,”No seed URLs specified\n”);
}
if(@url ary == −1)
{
print STDERR ”Error: No seed URLs supplied, exiting. . .\n”;
print verbose(”You may supply seed URLs on the command line or in the\n”);
print verbose(”script\n”);
exit;
}
my $previous string representation;
my $avg doclen; # average document length, used for calculations
#
#
#
Main Program
check availability of resources();
process command line options();
my $bibliography ary ref;
my $i = 0;
# while loop counter
my $stopword ary ref = OntoSpider:: create stopword ary();
#
# Main while loop,
#
iv
CHAPTER 7. APPENDICES
while(1)
{
if($i > $max runs)
{
print(”Maximum amount of runs \($max runs\) reached, exiting. . .\n”);
exit;
}
if($i > 0) # seed documents have already been processed, create new @url ary
{
@url ary = @{bibliography to url ary($bibliography ary ref)};
}
my($doc ary ref) = retrieve docs(\@url ary); # get docs from remote servers
if($testmode)
{
print data str to file(\@url ary,”$tmp dir/url ary contents\ $i”);
print data str to file($doc ary ref,”$tmp dir/doc ary contents\ $i”);
}
#
# Extract data from docs
#
if($i == 0)
{
#
# First run: We are processing seed documents, so we build the
# initial centroid at this point
#
($bibliography ary ref,$doc vector hash ref,$avg doclen) = process docs($doc ary ref);
($centroid words ary ref, $centroid hash ref, $centroid vector ary ref) = OntoSpider:: determine ce
}
if($initial centroid mode)
{
my $str repr = Dumper($centroid words ary ref);
OntoSpider:: print debug($debugmode,”begin centroid words ary\n”);
OntoSpider:: print debug($debugmode,”$str repr\n”);
OntoSpider:: print debug($debugmode,”end centroid words ary\n”);
$str repr = Dumper($centroid hash ref);
OntoSpider:: print debug($debugmode,”begin centroid hash\n”);
OntoSpider:: print debug($debugmode,”$str repr\n”);
OntoSpider:: print debug($debugmode,”end centroid hash\n”);
exit;
}
if(defined($avg doclen))
{
OntoSpider:: print debug($debugmode,”avg doclen: $avg doclen\n”);
}
else
{
OntoSpider:: print debug($debugmode,”avg doclen not defined\n”);
}
v
OntoSpider:: press enter to continue();
if($testmode)
{
print data str to file($bibliography ary ref,”$tmp dir/bibliography ary contents\ $i”);
print data str to file($doc vector hash ref,”$tmp dir/doc vector hash contents\ $i”);
}
$i++;
}
#
#
#
Subroutines
sub check availability of resources
#
# Check the availability of resources:
# Create directories if necessary, check that commands that the
# program uses are available, etc.
#
{
foreach my $dir ($data dir,$tmp dir)
{
if(! −d ”$dir”)
{
OntoSpider:: print debug($debugmode,”Creating $dir\n”);
mkdir $dir,0755;
}
}
if(−x $wget command)
{
$file retrieve command = $wget command;
}
elsif(−x $fetch command)
{
$file retrieve command = $fetch command;
}
else
{
die ”FATAL: Not an executable: \’$wget command\’ or \’$fetch command\’\n”;
}
foreach my $command (
$google retrieve command,
$pdftohtml command
)
{
vi
CHAPTER 7. APPENDICES
if(! −x $command)
{
die ”FATAL: Not an executable: \’$command\’\n”;
}
}
}
sub process docs
#
# Process documents that have been retrieved, PDF documents will first
# be converted to HTML
#
# Input: Reference to array with documents
# Output: Two references and a real:
#
Reference to array with bibliographic entries, based on google searches
#
Reference to hash with vector representations of documents
#
Real that represents the average doc length
#
{
my($doc ary ref) = $ [0];
my(@retrieved doc ary) = @{$doc ary ref};
my(@bibliography ary);
my(%doc vector hash);
my(%tf hash); # contains term frequencies in a doc
my $doc str; # string representation of a document
my $total doclen = 0; # total length of all documents together; init
my $total amount of docs = 0; # total amount of documents; init
DOCUMENT: foreach my $doc (@retrieved doc ary)
{
my @doc vector ary = (0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0);
OntoSpider:: print debug($debugmode,”Processing doc $doc\n”);
OntoSpider:: print debug($debugmode,”Converting doc to html. . .\n”);
system(”$pdftohtml command $pdftohtml command options $doc”);
my $html file = $doc;
$html file = ∼s/\.pdf/\.html/i;
if(−f ”$html file”)
{
$total amount of docs++;
local ∗HTMLF;
open(HTMLF, ”$html file”) ||
die ”FATAL: Could not open $html file for reading: $!”;
while(my $l = < HTMLF > )
{
$doc str. = $l;
if(probably a bibliographic reference($l))
{
OntoSpider:: print debug($debugmode,”Probably a bibliographic reference: $l\n”);
$l = ∼s/\ < br\ > //g;
push(@bibliography ary,$l);
print(”.”);
}
else
{
OntoSpider:: print debug($debugmode,”Probably NOT a bibliographic reference: $l\n”);
# init
vii
}
}
close(HTMLF);
#
# At this point, $doc str contains the contents of a HTML document
#
# First we determine the docleng (document length) value
#
my @doclen ary = split(/\s+/,$doc str);
OntoSpider:: print debug($debugmode,”doclen ary: @doclen ary\n”);
$total doclen + = $#doclen ary;
# Now we’ll convert it into a flat ascii document
#
$doc str = OntoSpider:: normalize words($doc str);
OntoSpider:: print debug($debugmode,”doc str: \’$doc str\’\n”);
$doc str = OntoSpider:: remove stopwords($doc str,$stopword ary ref);
$doc str = OntoSpider:: standardize string($doc str);
OntoSpider:: print debug($debugmode,”doc str after stopword removal: \’$doc str\’\n”);
my $stop output file = ”$tmp dir/doc file.$∧T\.stop”;
local ∗DOCF;
open(DOCF,” > $stop output file”) ||
die ”FATAL: Could not open $stop output file for overwriting: $!”;
print DOCF ”$doc str”;
close(DOCF);
my $stem output file = ”$stop output file\.stem”;
system(”$english stemmer command $stop output file \ > $stem output file”);
sleep(2);
my @words ary;
local ∗STEMF;
open(STEMF,”$stem output file”) ||
die ”FATAL: Could not open $stem output file for reading: $!”;
while(my $l = < STEMF > )
{
my @found words ary = split(/\s+/,$l);
@words ary = (@words ary,@found words ary);
}
close(STEMF);
$total amount of words + = $#words ary;
foreach my $word (@words ary)
{
$tf hash{$word}++;
# one more time $word occurs in this doc
OntoSpider:: print debug($debugmode,”tf hash $word : $tf hash{$word}\n”);
}
foreach my $word (keys(%tf hash))
{
$df hash{$word}++;
# one more doc in which $word occurs
OntoSpider:: print debug($debugmode,”df hash $word : $df hash{$word}\n”);
}
$doc vector hash{$doc} = \@doc vector ary;
}
else
{
print(”Skipping $doc as it has apparently not been converted to HTML\n”);
next DOCUMENT;
}
}
print(”total doclen: \’$total doclen\’\n”);
print(”total amount of docs: \’$total amount of docs\’\n”);
my $avg doclen = $total doclen / $total amount of docs;
viii
CHAPTER 7. APPENDICES
return(\@bibliography ary,\%doc vector hash,$avg doclen);
}
sub probably a bibliographic reference
#
# Detect bibliographic references
#
# First arg: line from a paper
# Returns 1 if the line is probably a bibliographic reference,
#
0 otherwise
#
# Note that the algorithm should be refined.
# One problem is, that a bibliographic reference may be split up
# over multiple lines.
#
{
my $line = $ [0];
my $title candidate;
my $year candidate;
my $author candidate substr;
#
# Figure out the title of the paper
#
if($line = ∼/\ < i\ > (.∗)\ < \/i\ > / &&
length($line) ≤ $max bibliographic line length)
{
$title candidate = $1;
}
elsif($line = ∼/((\w+\,\s+\w+[\.|\,]){2,})/ &&
length($line) ≤ $max bibliographic line length)
{
$title candidate = $1;
}
#
# Figure out the year of publication
#
if($line = ∼/\((\d{4})\)/)
{
$year candidate = $1;
}
if($line = ∼/∧(\S+)\s/)
{
$author candidate substr = $1;
$author candidate substr = ∼s/\,$//;
}
if($year candidate)
{
#
# At this point, it is very likely that we’ve found a bibliographic
# entry
ix
#
return(1);
}
return(0);
}
sub bibliography to url ary
#
# Input: Reference to array with bibliographic entries (strings)
# Output: Reference to array with URLs of these bibliographic entries,
#
based on google searches
#
# Note: This approach is very crude and can be refined, if a google
# search for a bibliographic entry does not yield a URL of a PDF or
# PS file, it will simply be omitted.
#
{
my $bibliography ary ref = $ [0];
my @url ary;
BIBLIO: foreach my $bibliographic entry (@{$bibliography ary ref})
{
if($google search counter > $max google searches)
{
print(”WARNING: maximum google searches of $max google searches reached\n”);
last BIBLIO;
}
$bibliographic entry = standardize bibliographic entry($bibliographic entry);
OntoSpider:: print debug($debugmode,”About to invoke $google retrieve command \’$bibliographic entry\
if($testmode)
{
print test(”Not invoking $google retrieve command \’$bibliographic entry\’ in testmode\n”);
return;
}
my $google hit counter = 0;
local ∗GOOGLEF;
open(GOOGLEF,”$google retrieve command \’$bibliographic entry\’ |”) ||
die ”FATAL: Could not pipe $google retrieve command \’$bibliographic entry\’: $!”;
GOOGLEHIT: while( < GOOGLEF > )
{
OntoSpider:: print debug($debugmode,”$ ”);
next if /∧\s+$/;
if(/http.∗pdf/i)
{
OntoSpider:: print debug($debugmode,”Google hit: $ \n”);
push(@url ary,$ );
$google hit counter++;
if($google hit counter > $google top n)
{
OntoSpider:: print debug($debugmode,”Max. google hits $google top n reached\n”);
last GOOGLEHIT;
}
}
else
x
CHAPTER 7. APPENDICES
{
OntoSpider:: print debug($debugmode,”No PDF file found.\n”);
}
}
close(GOOGLEF);
$google search counter++;
}
return(\@url ary);
}
sub standardize bibliographic entry
#
# Standardize a bibliographic entry
#
# Input: ’raw’ bibliographic entry that was found in a document
# Output: standardized bibliographic entry, without comma’s, markup, etc.
#
{
my($bibliographic entry) = $ [0];
$bibliographic entry = ∼s/\.//g;
$bibliographic entry = ∼s/\,//g;
$bibliographic entry = ∼s/\ < [∧\ > ]+\ > //g;
return($bibliographic entry);
}
sub print test
{
if($testmode)
{
print(”TESTMODE \ = \ > @ ”);
}
}
sub print data str to file
#
# Write Data:: Dumper dump of data structure to file
#
{
my($data str ref,$fname) = @ ;
my $string representation = Dumper($data str ref);
if($string representation ne $previous string representation &&
length(scalar($string representation)) > 30)
{
print(”Saving data structure contents to $fname\n”);
local ∗OUTPUTF;
open(OUTPUTF,” > $fname”) ||
die ”Could not open $fname for overwriting: $!”;
xi
print OUTPUTF ”$string representation\n”;
close(OUTPUTF);
$previous string representation = $string representation; # for the next run
}
}
sub process command line options
{
if($help)
{
pod2usage();
exit;
}
if($man)
{
pod2usage({−verbose ⇒ 2, −output ⇒ \∗STDOUT});
exit;
}
if($testmode)
{
print test(”Running in test mode\n”);
$pdftohtml command options = ’−i −noframes’;
}
else
{
$pdftohtml command options = ’−i −noframes −q’;
}
if($input filename)
{
OntoSpider:: print debug($debugmode,”Input filename: $input filename\n”);
my @doc ary = ($input filename);
my($bibliography ary ref,$doc vector hash ref,$avg doclen) = process docs(\@doc ary);
my @bibliography ary = @{$bibliography ary ref};
print(”Extracted bibliography:\n”);
foreach my $bibliography (@bibliography ary)
{
print(”$bibliography\n”);
}
exit;
}
if($centroid from docs)
{
#
# The command line option centroid from docs simulates the creation of
# the initial centroid from the docs that are specified.
#
my(@doc ary);
if($centroid from docs = ∼/\−\−\−/)
xii
CHAPTER 7. APPENDICES
{
@doc ary = split(/\−\−\−/,$centroid from docs);
}
else
{
@doc ary = ($centroid from docs);
}
my $total amount of docs = $#doc ary+1;
my($bibliography ary ref,$doc vector hash ref,$avg doclen) = process docs(\@doc ary);
OntoSpider:: print debug($debugmode,”bibliography ary: @{$bibliography ary ref}\n”);
my($centroid words ary ref, $centroid hash ref, $centroid vector ary ref) = OntoSpider:: determine
exit;
}
if($bibliography)
{
OntoSpider:: print debug(”Bibliography: $bibliography\n”);
my @bibliography ary;
if($bibliography = ∼/\−\−\−/)
{
@bibliography ary = split(/\−\−\−/,$bibliography);
}
else
{
@bibliography ary = ($bibliography);
}
@url ary = @{bibliography to url ary(\@bibliography ary)};
print(”Resulting URLs:\n”);
foreach my $url (@url ary)
{
print(”$url\n”);
}
exit;
}
if($retrieve documents)
{
OntoSpider:: print debug($debugmode,”Running in retrieve mode\n”);
OntoSpider:: print debug($debugmode,”URLs: $retrieve documents\n”);
my(@url ary);
if($retrieve documents = ∼/\−\−\−/)
{
@url ary = split(/\−\−\−/,$retrieve documents);
}
else
{
@url ary = ($retrieve documents);
}
my($doc ary ref) = retrieve docs(\@url ary); # get docs from remote servers
print(”Retrieved documents:\n”);
foreach my $doc (@{$doc ary ref})
{
print(”$doc\n”);
xiii
}
exit;
}
if($debugmode)
{
OntoSpider:: print debug($debugmode,”Running in debug mode\n”);
}
}
sub retrieve docs
#
# Retrieve documents from remote servers if necessary
#
# Input: Reference to array with URLs of documents, e.g. seed documents
# Output: Reference to array with filenames of retrieved documents
#
{
my $url ary ref = $ [0];
my @url ary = @{$url ary ref};
my @doc ary;
foreach my $url (@url ary)
{
$url = ∼s/\s+$//g;
print debug($debugmode,”URL: \’$url\’\n”);
if($url = ∼/http:\/\/.∗\/([∧\/]+\.pdf$)/i)
{
my $pdf fname = $1;
if(−f ”$data dir/$pdf fname”)
{
print(”$data dir/$pdf fname already present, no need to retrieve it\n”);
push(@doc ary,”$data dir\/$pdf fname”);
}
else
{
my $previous dir = chdir($data dir);
system(”$file retrieve command $url”);
chdir($previous dir);
if(−f ”$data dir/$pdf fname”)
{
push(@doc ary,”$data dir\/$pdf fname”);
}
else
{
print STDERR ”Warning: Could not retrieve $pdf fname \(from $url\)\n”;
}
}
}
else
{
my $google hit counter = 0;
print debug($debugmode,”Invoking $google retrieve command \’$url\’”);
local ∗GOOGLEF;
xiv
CHAPTER 7. APPENDICES
open(GOOGLEF,”$google retrieve command \’$url\’ |”) ||
die ”FATAL: Could not pipe $google retrieve command \’$url\’: $!”;
GOOGLELOOP: while( < GOOGLEF > )
{
if(/(http.∗pdf)/i)
{
my $found url = $1;
OntoSpider:: print debug($debugmode,”Google hit: $1\n”);
push(@url ary, $found url);
$google hit counter++;
last GOOGLELOOP if $google hit counter ≥ $google top n;
}
}
close(GOOGLEF);
}
}
return(\@doc ary);
}
#
#
#
POD (embedded documentation)
NAME litcrawl.pl — Literature Crawler
SYNOPSIS
colosys_configure_switch.pl [options] [file ...]
OPTIONS -b, –bibliography specify bibliographic strings from which a
URL queue must be created -c, –centroid from docs Build centroid
from docs -d, –debug run in debug mode with much output
-f=<filename>, –file=<filename> specify input filename -h, –help help
message -i, –interactive run in interactive mode -m, –man show
extensive help, like a manpage -r=<URL>, –retrieve=<URL> retrieve
docs from remote servers -t, –testmode run in testmode, do not connect
to remote servers -s=<URL>, –seedurl=<URL> specify URL for seed
document
DESCRIPTION litcrawl.pl, Literature Crawler written for a Masters
Thesis
PREREQUISITES This program requires the following non-standard
modules:
• Pod::Usage
i
EXAMPLES litcrawl.pl –help litcrawl.pl –bibliography=”PDF paper
circumfix”
AUTHOR Carel Fenijn <[email protected]>
ii
CHAPTER 7. APPENDICES
#!/usr/bin/perl −w
#
# $Id: focusbot.pl,v 1.2 2007/10/02 07:05:56 carelf Exp carelf $
#
# Focused Crawler that was originally written for assignment 4
# of the course Information Retrieval 2002/2003 UVA and later adapted
# for a thesis.
#
# Note that this is work in progress.
#
# Carel Fenijn, October 2007
#
#
#
#
Some General Notes
#
# This code is based on code that was written for earlier assignments for
# this course, so there is quite some overlap.
#
# The following assumptions were made:
#
# − All retrieved docs are in English, and stop word removal and stemming
#
is based on that
#
#
# Note 0: If a comment is marked with [lwp], it means it was copy&pasted
# from the libwww−perl manpage(s)
#
# Note 1: If the comments of a subroutine mention an Input, this will be
# @ ; if it they mention an Output, this refers to the return value(s).
#
use strict;
use Digest:: MD5 qw(md5 base64);
use URI;
use LWP:: UserAgent;
use LWP:: MediaTypes qw(guess media type);
require WWW:: RobotRules;
select(STDOUT);
$ |= 1;
#
#
#
# Unbuffer STDOUT
Mainly Declarations
iii
my $testmode = 0;
my $max amount of retrieved pages = 2000000000;
my $max amount of retrieved bytes = 1048576000000;
my $page download delay = 10; # Amount of seconds to wait between page downloads
my $data dir = ”.”;
my $base download dir = ”$data dir/downloads/$∧T”;
my $url to fname mappings file = ”$base download dir/url to fname mappings”;
my $url rankings file = ”$base download dir/url rankings file”;
my $focusbot flagfile = ”/tmp/focusbot.flag”;
my $centroid file = ”$base download dir/centroid data”;
my $long fname suffix = ”focusbotfname”;
my $max fname length = 20; # not so important value, avoid very long fnames
my $long subdir suffix = ”focusbotsubdir”;
my $max subdir length = 20; # not so important value, avoid very long subdirs
my $english stopword file = ”$data dir/english stopwords”;
my $english stemmer command = ”$data dir/estemmer”;
#
# Some datastructures that will be used for the centroid
#
my @centroid words ary; # contains all words of centroid in sorted order
my %centroid words hash; # hash of all centroid words
my @centroid vector ary; # represents centroid vector with weight values
my $amount of pages after which to recalculate centroid = 15;
#
# We let URLs inherit the cosine similarity values of their parents,
# but do want to downplay this a bit, for starters, subtract a small value.
#
my $sim value inheritance downplay factor = .1;
#
# For starters, we manually define @seed url ary, the seed document
# set that we will start the crawl with. This can become a set that the
# user supplies manually or the top N documents to some relevant query
# in Google.
#
my @seed url ary = (
’http://www.yourdictionary.com/morph.html’,
’http://www.facstaff.bucknell.edu/rbeard/’
);
#
# To make the crawls more restrictive, we require certain substrings
# in URLs, this restriction can be dropped without any problem.
#
iv
CHAPTER 7. APPENDICES
my $required url substr = qq
!morph|lingui|synta|semant|phon|edu|uni|sci|dict|lex|word!;
my $client id = ’focusbot/$Revision: 1.2 $’; # Use RCS version, auto−updated
$client id = ∼s/\$\s∗Revision\s∗\:\s∗(\S+)\s∗\$/$1/; # Only use bare RCS rev nr
my $amount of retrieved pages = 0;
my $amount of retrieved bytes = 0;
my $avg doclen = 0;
my $total amount of docs = 0;
my $total amount of words = 0;
# init
# init
# init
# init
# init
#
# @q ary is the Queue of URLs that must be retrieved. It will NOT
# be like a FIFO stack (breadth−first) or LIFO stack (depth−first),
# but ordering will be adjusted on the fly based on cosine similarity
# values to get a focused crawl.
#
my @q ary = @seed url ary; # initially, only the seed URLs will be visited
my @uri start ary = (’A HREF’,’FRAME SRC’);
#
# %sim value hash and %sim value url hash will contain cosine
# similarity values of documents
#
my %sim value hash;
my %sim value url hash;
#
# %df hash will contain df values of words
#
my %df hash;
#
# %processed urls hash URLs that have been processed as keys
#
my %processed urls hash;
# init
#
# %docid hash keeps track of DOC IDs
#
# init
my $docid counter = 0;
my %docid hash;
# init
#
# %md5 hash records MD5 checksums of downloaded content.
#
my %md5 hash;
#
#
#
Main Program
v
my @stopword ary = @{&create stopword ary};
&print test(”Using base URLs: \’@seed url ary\’\n”);
&print test(”Identify to the webserver as: \’$client id\’\n”);
&print test(”Base download dir: \’$base download dir\’\n”);
if(! −d ”$base download dir”)
{
mkdir $base download dir, 0755 ||
die ”FATAL: Could not create $base download dir: $!”;
}
local ∗MAPF;
open(MAPF,” > $url to fname mappings file”) ||
die ”Could not open $url to fname mappings file for overwriting: $!”;
&determine centroid(’initial calculation’);
while($amount of retrieved pages < $max amount of retrieved pages &&
$amount of retrieved bytes < $max amount of retrieved bytes)
{
last if $#q ary == −1;
# Finish when the queue is empty
my $url = shift(@q ary);
# Not FIFO, for @q ary is adjusted
next if $processed urls hash{$url};
if($amount of retrieved pages % $amount of pages after which to recalculate centroid == 0)
{
&determine centroid(’recalculation’);
}
&print test(”Processing next url from queue: \’$url\’\n”);
if($url !∼ /∧http:\/\//i)
{
&print test(”Skipping $url, not starting with http:\/\/\n”);
$processed urls hash{$url} = 1;
next;
}
my($robotsrules) = &get robots rules(”$url/robots.txt”,$client id);
if($robotsrules→allowed($url))
{
if(&valid mediatype($url))
{
if(&retrieve page and extract urls($url))
{
$processed urls hash{$url} = 1;
}
else
{
print STDERR ”WARNING: Did not retrieve $url or not an ASCII file\n”;
next;
}
}
else
{
vi
CHAPTER 7. APPENDICES
&print test(”Not a valid mediatype of $url\n”);
$processed urls hash{$url} = 1;
next;
}
}
else
{
&print test(”RobotRules disallow accessing URL $url\n”);
$processed urls hash{$url} = 1;
next;
}
my($amount of urls in queue);
if($testmode)
{
$amount of urls in queue = $#q ary;
}
&print test(”$amount of urls in queue URLs currently in the queue\n”);
@q ary = @{&recalculate q ary(\@q ary)};
&print test(”Sleeping $page download delay seconds to avoid hammering the site. . .\n”);
print(”Note that you can abort the crawl by touching /tmp/focusbot.flag\n”);
print(”You could press CTRL Z, then enter: touch /tmp/focusbot.flag \; fg\n”);
sleep($page download delay);
if(−f ”$focusbot flagfile”)
{
unlink($focusbot flagfile);
last;
}
print(”.”) unless $testmode;
}
close(MAPF);
my $amount of seconds used = time−$∧T;
print "FINALOUTPUT";
Finished!
Amount of retrieved pages: $amount of retrieved pages
Amount of retrieved bytes: $amount of retrieved bytes
Amount of seconds used: $amount of seconds used
Downloads can be found in this dir: $base download dir
URL to Filename mappings can be found in the following file:
$url to fname mappings file
Ranking of the URLs can be found in this file:
$url rankings file
vii
FINALOUTPUT
#
#
#
Subroutines
sub retrieve page and extract urls
#
# Input: First arg: standardized URL
#
Second arg (optional): ’centroid relevant’ or ’centroid nonrelevant’
# Output: 1 upon success
#
0 otherwise
#
# Side−effect(s):
#
Retrieve page and store this on disk
#
Make @q ary grow if new URLs are detected, but not if
#
second arg eq ’centroid nonrelevant’
#
{
my $url = $ [0];
my $centroid relevant mode = 0;
my $centroid nonrelevant mode = 0;
my @doc vector ary;
my @total words ary;
if($ [1] eq ’centroid relevant’)
{
$centroid relevant mode = 1;
}
elsif($ [1] eq ’centroid nonrelevant’)
{
$centroid nonrelevant mode = 1;
}
my $initial working dir = ‘pwd‘;
chomp($initial working dir);
&print test(”Trying to derive data from url \’$url\’. . .\n”);
#
# Create a user agent object [lwp]
#
my $ua = LWP:: UserAgent→new;
$ua→agent(”$client id ”);
#
# Create a request [lwp]
#
my $req = HTTP:: Request→new(GET ⇒ ”$url”);
#
# Pass request to the user agent and get a response back [lwp]
#
my $res = $ua→request($req);
#
viii
CHAPTER 7. APPENDICES
# Check the outcome of the response [lwp]
#
if($res→is success)
{
$amount of retrieved pages++;
my $page content = $res→content;
my($fname,$subdir) = &url2fname($page content);
if($fname eq ””)
{
&print test(”Skipping url $url, probably known MD5 checksum\n”);
return(0);
}
my $output file = ”$subdir/$fname”;
&print test(”Subdir: \’$subdir\’\n”);
&print test(”Output File: \’$output file\’\n”);
my $stopped page content = &remove stopwords($page content);
local ∗OUTF;
local ∗STOPOUTF;
my $stop output file = ”$output file\.stop”;
if(!(open(OUTF,” > $output file”)))
{
print STDERR ”WARNING: Could not open $output file for overwriting: $!”;
chdir($initial working dir);
return(0);
}
if(!(open(STOPOUTF,” > $stop output file”)))
{
print STDERR ”WARNING: Could not open $stop output file for overwriting: $!”;
chdir($initial working dir);
return(0);
}
print OUTF ”$page content”;
close(OUTF);
print STOPOUTF ”$stopped page content”;
close(STOPOUTF);
if(! −T $output file)
{
print STDERR ”Oops, accidentally downloaded non−ASCII file\n”;
if($output file = ∼/$base download dir/) # double check
{
if(unlink($output file) &&
unlink($stop output file))
{
&print test(”Unlinked $output file and $stop output file\n”);
}
else
{
print STDERR ”Could not unlink $output file or $stop output file: $!”;
}
}
chdir($initial working dir);
return(0);
}
my $stem output file = ”$stop output file\.stem”;
system(”$english stemmer command $stop output file \ > $stem output file”);
if(!($centroid nonrelevant mode))
ix
{
local ∗STEMF;
open(STEMF,”$stem output file”) ||
die ”FATAL: Could not open $stem output file for reading: $!”;
while(my $l = < STEMF > )
{
$l = &normalize words($l);
my(@words ary) = split(/\s+/,$l);
if($centroid relevant mode)
{
@centroid words ary = (@centroid words ary,@words ary);
}
else
{
@total words ary = (@total words ary,@words ary);
}
}
close(STEMF);
}
my $sim value = 1;
if($centroid relevant mode)
{
$sim value url hash{$url} = .9;
}
else
{
my $doc vector ary ref = &words ary2vector ary(\@total words ary,\@centroid words ary);
$sim value = &sim(\@centroid vector ary,$doc vector ary ref);
if($sim value == 0)
{
&print test(”Skipping document with cosine similarity value of 0\n”);
return(0);
}
$sim value hash{$sim value} = $url;
$sim value url hash{$url} = $sim value;
}
if(!($centroid nonrelevant mode))
{
print MAPF ”$url\:$fname\n”;
}
my($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,$size,
$atime,$mtime,$ctime,$blksize,$blocks) = stat($output file);
$amount of retrieved bytes + = $size;
if(!($centroid nonrelevant mode))
{
while($page content = ∼/\ < a\s+href\ = \"([∧\"]+)\"/i)
{
my $detected url = URI→new abs($1,$res→base); # absolutize URLs
$page content = ∼s/\ < a\s+href\ = \”([∧\”]+)\”//i;
&print test(”detected url before standardization: \’$detected url\’\n”);
$detected url = &standardize url($detected url);
&print test(”detected url after standardization: \’$detected url\’\n”);
if($url !∼ /$required url substr/)
{
&print test(”Skipping $url, $required url substr is not substr\n”);
}
elsif($url = ∼/∧http:\/\//)
{
x
CHAPTER 7. APPENDICES
push @q ary, $detected url;
#
# Note: at this point, the detected URL inherits the
# cosine similarity value of the page in which it was found,
# as initial value wich can be adjusted later on!
#
if($centroid relevant mode) # exception for the seed URL set
{
$sim value url hash{$detected url} = .9;
}
else
{
$sim value url hash{$detected url} = ($sim value url hash{$url}−$sim value inheritance downplay factor);
if($sim value url hash{$detected url} < 0)
{
$sim value url hash{$detected url} = .00001;
}
}
}
else
{
&print test(”Not adding \’$detected url\’ to queue, does not start with http:\/\/\n”);
}
}
}
}
else
{
print STDERR ”Apparently I did not succeed in gathering data from $url\n”;
return(0);
}
chdir($initial working dir);
return(1);
}
sub print test
{
if($testmode)
{
print(”TESTMODE\ > $ [0]”);
}
}
sub clean up url
#
# Clean a URL up, e.g. remove trailing double quotes, whitespace
#
{
my $url = $ [0];
$url = ∼s/\s+$//;
$url = ∼s/∧\s+//;
$url = ∼s/∧\”//;
$url = ∼s/\”$//;
return($url);
}
xi
sub standardize url
#
# Input: URL
# Output: URL in standardized format, if it is relative,
#
it will become an absolute URL.
#
{
my $url = $ [0];
$url = &clean up url($url);
return($url);
}
sub valid mediatype
#
# Return 1 if MediaType is octet/stream or text/∗
#
0 otherwise
#
{
my $url = $ [0];
my $guessed content type = guess media type($url);
&print test(”Guessed MediaType: $guessed content type\n”);
if($guessed content type = ∼/∧text\// ||
$guessed content type = ∼/∧application\/octet−stream$/i)
{
return(1);
}
return(0);
}
sub url2fname
#
# Input: string with content of retrieved page
# Output: filename of the downloaded page if the MD5 checksum is ’new’,
#
emtpy string otherwise
#
{
my $str = $ [0];
my $fname;
my $subdir;
my $digest = md5 base64($str);
if(defined($md5 hash{$digest}))
{
&print test(”Known MD5 checksum\n”);
return(””,””);
}
else
{
$md5 hash{$digest} = 1;
}
while($docid hash{$docid counter})
{
$docid counter++;
}
$docid hash{$docid counter} = 1;
$fname = $docid counter;
xii
CHAPTER 7. APPENDICES
if($fname = ∼/(.)(.)(.)$/)
{
$subdir = ”$base download dir/$3/$2/$1”;
}
elsif($fname = ∼/(.)(.)$/)
{
$subdir = ”$base download dir/0/$2/$1”;
}
elsif($fname = ∼/(.)$/)
{
$subdir = ”$base download dir/0/0/$1”;
}
if(system(”mkdir −p $subdir”) 6= 0)
{
print STDERR ”WARNING: Could not create subdir: \’$subdir\’\n”;
}
else
{
&print test(”$subdir created\n”);
}
return($fname,$subdir);
}
sub get robots rules
#
# Note: much of this code is copy&pasted from the
# WWW:: RobotRules manpage(s) and slightly adapted.
#
{
my($url,$client id) = @ ;
my($robotsrules) = new WWW:: RobotRules ”$client id”;
use LWP:: Simple qw(get);
my($robots txt) = get($url);
$robotsrules→parse($url,$robots txt);
return($robotsrules);
}
sub remove stopwords
#
# This subroutine will remove stopwords
#
# First arg: string from which stopwords should be removed
# Returns string without the stop words
#
{
my $str = $ [0];
foreach my $stopword (@stopword ary)
{
while($str = ∼/\b$stopword\b/i)
{
$str = ∼s/\b$stopword\b/ /gi;
}
}
return($str);
xiii
}
sub create stopword ary
#
# Returns a reference to an array with stop words
#
{
my @stopword ary;
local ∗STOPWORDF;
open(STOPWORDF,”$english stopword file”) ||
die ”FATAL: Could not open $english stopword file for reading: $!”;
while(my $l = < STOPWORDF > )
{
$l = ∼s/\|.∗//;
# strip comments
$l = ∼s/\s+$//;
next if $l = ∼/∧\s∗$/;
# skip lines with only whitespace or comments
push @stopword ary, $l;
}
close(STOPWORDF);
return(\@stopword ary);
}
sub words ary2vector ary
#
# First arg: reference to array with words
# Second arg: reference to @centroid words ary
#
# Returns: reference to @query vector ary
#
{
my @words ary = @{$ [0]};
my @centroid words ary = @{$ [1]};
my $doclen = $#words ary;
my @vector ary;
my %words hash;
my %tf hash;
$total amount of docs++;
$total amount of words + = $doclen;
$avg doclen = $total amount of words/$total amount of docs;
foreach my $word (@words ary)
{
$tf hash{$word}++;
}
foreach my $word (keys(%tf hash))
{
$df hash{$word}++;
}
for(my $i = 0;$i ≤ $#centroid words ary;$i++)
{
my $word = $centroid words ary[$i];
my $tf;
if(defined($tf hash{$word}))
{
$tf = $tf hash{$word};
}
else
{
xiv
CHAPTER 7. APPENDICES
$tf = 0;
}
#
# Robertson/Okapi TF: nice−ir0203−week02−2.pdf p. 85ff
#
my $okapi tf = $tf/($tf+.5+(1.5∗($doclen/$avg doclen)));
#
# IDF Karen Sparck Jones 1972 nice−ir0203−week02−2.pdf p. 89ff
#
my $df = 0;
if($df hash{$word})
{
$df = $df hash{$word};
}
if($df == 0)
{
$vector ary[$i] = 0;
}
else
{
my $idf = 1+log($total amount of docs/$df);
my $tf idf weight = $okapi tf∗$idf;
$vector ary[$i] = $tf idf weight;
}
}
&print test(”vector ary: @vector ary\n”);
return(\@vector ary);
}
sub normalize words
#
# For starters a simple approach: remove probable html
# tags and then all non−word chars
#
{
my $str = $ [0];
$str = ∼s/\ < [∧\ > ]+\ > //g;
$str = ∼s/\W/ /g;
$str = ∼s/\b\d+\b/ /g;
$str = ∼s/ / /g;
return($str);
}
sub sim
#
# Cosine Similarity between a centroid vector and a document vector
# From college slides nice−ir0203−week02−2.pdf p. 136
#
# First arg: reference to centroid vector array
# Second arg: reference to document vector array of one document
#
# Returns: Cosine Similarity
#
{
my $query vector ary ref = $ [0];
my $doc vector ary ref = $ [1];
my @query vector ary = @{$query vector ary ref};
xv
my @doc vector ary = @{$doc vector ary ref};
my $numerator = 0;
# init
my $denominator = 0; # init − we’ll avoid dividing by zero, of course
#
# The Cosine Similarity formula is a fraction
# We calculate the numerator first
#
for(my $i = 0;$i ≤ $#query vector ary;$i++)
{
$numerator + = ($query vector ary[$i]∗$doc vector ary[$i]);
}
#
#
#
#
#
#
The denominator in the Cosine Similarity formula is a product
of which we first calculate the left product term and then the
right product term, they could be dealt with in the same induction
on the length of the vector as both lenghts are equal anyway, but
I think the code is clearer if it they are calculated separately.
#
my $left product term = 0; # init
my $right product term = 0; # init
for(my $i = 0;$i ≤ $#query vector ary;$i++)
{
$left product term + = (($query vector ary[$i])∧2);
$right product term + = (($doc vector ary[$i])∧2);
}
$left product term = sqrt($left product term);
$right product term = sqrt($right product term);
$denominator = $left product term∗$right product term;
my $cosine similarity;
if($denominator 6= 0)
{
$cosine similarity = $numerator/$denominator;
}
else
{
$cosine similarity = 0;
}
return($cosine similarity);
}
sub determine centroid
#
# Determine the centroid based on a set of relevant URLs: @seed url ary
#
# First argument: either str. ’initial calculation’ or ’recalculation’
# If the first arg. is ’initial calculation’, all values of the centroid
# vector will be set to 1.
# if the first arg. is ’recalculation’, actual TF.IDF values will be used.
# In the latter case, please make sure that there is enough data for
# sensible TF.IDF values.
#
{
my $mode = $ [0];
foreach my $seed url (@seed url ary)
{
&print test(”Processing seed URL $seed url\n”);
my($robotsrules) = &get robots rules(”$seed url/robots.txt”,$client id);
if($robotsrules→allowed($seed url))
xvi
CHAPTER 7. APPENDICES
{
if(&valid mediatype($seed url))
{
if(&retrieve page and extract urls($seed url,’centroid relevant’))
{
$processed urls hash{$seed url} = 1;
}
else
{
print STDERR ”WARNING: Did not retrieve $seed url or not an ASCII file\n”;
}
}
else
{
&print test(”Not a valid mediatype of $seed url\n”);
$processed urls hash{$seed url} = 1;
}
}
else
{
&print test(”RobotRules disallow accessing URL $seed url\n”);
$processed urls hash{$seed url} = 1;
}
}
my %tf hash;
foreach my $word (@centroid words ary)
{
$tf hash{$word}++;
}
my $doclen = $#centroid words ary;
foreach my $word (@centroid words ary)
{
if($mode = ∼/initial calculation/i)
{
$centroid words hash{$word} = 1;
}
elsif($mode = ∼/recalculation/i)
{
my $tf;
if(defined($tf hash{$word}))
{
$tf = $tf hash{$word};
}
else
{
$tf = 0;
}
#
# Robertson/Okapi TF: nice−ir0203−week02−2.pdf p. 85ff
#
my $okapi tf = $tf/($tf+.5+(1.5∗($doclen/$avg doclen)));
#
# IDF Karen Sparck Jones 1972 nice−ir0203−week02−2.pdf p. 89ff
#
my $df = 0;
if($df hash{$word})
{
$df = $df hash{$word};
}
if($df == 0)
{
print STDERR ”WARNING: centroid $df should not be zero\n”;
print STDERR ”Setting centroid weight for word to 1\n”;
$centroid words hash{$word} = 1;
}
else
{
my $idf = 1+log($total amount of docs/$df);
my $tf idf weight = $okapi tf∗$idf;
$centroid words hash{$word} = $tf idf weight;
}
i
}
else
{
print STDERR ”WARNING: mode should be either initial calculation or recalculation, assuming initial calculation\n”;
$centroid words hash{$word} = 1;
next;
}
}
@centroid words ary = ();
foreach my $word (sort keys(%centroid words hash))
{
push @centroid words ary,$word;
push @centroid vector ary,$centroid words hash{$word};
}
}
sub recalculate q ary
#
# Here we reorder @q ary based on cosine similarity values.
# In fact, we ”throw away” the old @q ary and replace it by
# one with reverse rankings of cosine similarity values.
#
# Side effect: the ranked URLs will be stored in $url rankings file
#
{
my @q ary;
local ∗RANKF;
open(RANKF,” > $url rankings file”) ||
die ”FATAL: Could not open $url rankings file for overwriting: $!”;
foreach my $url (sort { $sim value url hash{$b} < ⇒ $sim value url hash{$a} } keys %sim value url hash)
{
push(@q ary,$url);
print RANKF ”$url $sim value url hash{$url}\n”;
}
close(RANKF);
return(\@q ary);
}
#
#
#
End Of Script
ii
CHAPTER 7. APPENDICES
#
# $Id: OntoSpider.pm,v 1.9 2007/10/02 07:06:20 carelf Exp carelf $
#
# General perl module for OntoSpider that contains code that can be
# used by both the Literature Crawler and the General Purpose Focused
# Crawler.
#
# Note that this is work in progress.
#
# Carel Fenijn, October 2007
#
package OntoSpider;
use strict;
use Exporter ();
#
#
#
Mainly Declarations
use vars qw(@ISA @EXPORT @EXPORT OK $VERSION);
@ISA = qw(Exporter);
@EXPORT OK = qw($verboseornoverbose $debug);
$VERSION = ”1.0”;
my $english stopword file = ”english stopwords”;
my $english verb file = ”english verbs list.txt”;
my $english noun file = ”english nouns list.txt”;
my $english adjectives and adverbials file = ”english adjectives or adverbials list.txt”;
my $debugmode;
my $testmode = 1;
END { }
#
#
#
Subroutines
sub normalize words
#
# For starters a simple approach: remove probable html
# tags and then all non−word chars from a string
#
{
my $str = $ [0];
iii
print test(”str: \’$str\’\n”);
$str = ∼s/\ < [∧\ > ]+\ > //g;
$str = ∼s/\W/ /g;
$str = ∼s/\b\d+\b/ /g;
$str = ∼s/ / /g;
return($str);
}
sub remove stopwords
#
# This subroutine will remove stopwords
#
# First arg: string from which stopwords should be removed
# Second arg: reference to array with stopwords
#
If the second arg is the string ’default’ or there is
#
no second arg, the default stopword list (array) will be used.
# Returns string without the stop words
#
{
my $str = $ [0];
my $stopword ary ref;
if($ [1] eq ’default’ || !($ [1]))
{
$stopword ary ref = create stopword ary();
}
else
{
$stopword ary ref = $ [1];
}
foreach my $stopword (@{$stopword ary ref})
{
while($str = ∼/\b$stopword\b/i)
{
$str = ∼s/\b$stopword\b/ /gi;
}
}
return($str);
}
sub press enter to continue
{
print(”Press ENTER to continue. . .\n”);
< STDIN > ;
}
sub standardize string
#
# Standardize a string, i.e. remove spurious whitespace and
iv
CHAPTER 7. APPENDICES
# single characters
#
# First arg: string that must be standardized
# Returns standardized string
#
{
my $str = $ [0];
while($str = ∼/\s\S\s/)
{
$str = ∼s/\s\S\s/ /g; # remove single chars
}
$str = ∼s/\s+/ /g;
# remove spurious whitespace
return($str);
}
sub create stopword ary
#
# Returns a reference to an array with stop words
#
{
my @stopword ary;
local ∗STOPWORDF;
open(STOPWORDF,”$english stopword file”) ||
die ”FATAL: Could not open $english stopword file for reading: $!”;
while(my $l = < STOPWORDF > )
{
$l = ∼s/\|.∗//;
# strip comments
$l = ∼s/\s+$//;
next if $l = ∼/∧\s∗$/;
# skip lines with only whitespace or comments
push @stopword ary, $l;
}
close(STOPWORDF);
return(\@stopword ary);
}
sub determine centroid
#
# Determine the centroid based on an array with words from documents
#
# INPUT
#
# First argument: either str. ’initial calculation’ or ’recalculation’
# If the first arg. is ’initial calculation’, all values of the centroid
# vector will be set to 1.
# if the first arg. is ’recalculation’, actual TF.IDF values will be used.
# In the latter case, please make sure that there is enough data for
# sensible TF.IDF values.
# Second argument: reference to @doc ary, which contains the words of
v
# documents from which the centroid will be created.
# Third argument: average doclength so far (non−zero real)
#
# OUTPUT
#
# Returns: references to the following data structures:
#
#
@centroid words ary
#
%centroid hash
#
@centroid vector ary
#
# Example: @centroid words ary = (’doctor’,’physician’,’illness’,’melon’);
#
@centroid vector ary = (.9,.9,.8,0);
#
%centroid hash = (
#
’doctor’ ⇒ .9,
#
’phycisian’ ⇒ .9,
#
’illness’ ⇒ .8,
#
’melon’ ⇒ 0
#
);
#
{
my $mode = $ [0];
my @centroid words ary = @{$ [1]};
my $avg doclen = $ [2];
my $total amount of docs = $ [3];
my $df hash ref = $ [4];
my %centroid hash;
my %tf hash;
my @centroid vector ary;
foreach my $word (@centroid words ary)
{
$tf hash{$word}++;
print debug($debugmode,”adding \’$word\’ to tf hash\n”);
}
my $doclen = $#centroid words ary;
print test(”doclen: \’$doclen\’\n”);
foreach my $word (@centroid words ary)
{
if($mode = ∼/initial calculation/i)
{
$centroid hash{$word} = 1;
print test(”adding \’$word\’ to centroid hash\n”);
}
elsif($mode = ∼/recalculation/i)
{
my $tf;
if(defined($tf hash{$word}))
{
$tf = $tf hash{$word};
}
else
vi
CHAPTER 7. APPENDICES
{
$tf = 0;
}
print test(”recalculation of centroid, tf: \’$tf\’\n”);
#
# Robertson/Okapi TF: nice−ir0203−week02−2.pdf p. 85ff
#
my $okapi tf = $tf/($tf+.5+(1.5∗($doclen/$avg doclen)));
print test(”recalculation of centroid, okapi tf: \’$okapi tf\’\n”);
#
# IDF Karen Sparck Jones 1972 nice−ir0203−week02−2.pdf p. 89ff
#
my $df = 0;
if(${$df hash ref}{$word})
{
$df = ${$df hash ref}{$word};
print test(”df: \’$df\’ based on df hash \’$word\’\n”);
}
if($df == 0)
{
print STDERR ”WARNING: centroid $df should not be zero\n”;
print STDERR ”Setting centroid weight for $word to 1\n”;
$centroid hash{$word} = 1;
}
else
{
my $idf = 1+log($total amount of docs/$df);
my $tf idf weight = $okapi tf∗$idf;
$centroid hash{$word} = $tf idf weight;
print test(”idf: \’$idf\’\n”);
print test(”tf idf weight: \’$tf idf weight\’\n”);
print test(”centroid hash $word becomes \’$tf idf weight\’\n”);
}
}
else
{
print STDERR ”WARNING: mode should be either initial calculation or recalc
ulation, assuming initial calculation\n”;
$centroid hash{$word} = 1;
print test(”initial calculation assumed, centroid hash \’$word\’ becomes 1\n”);
next;
}
}
@centroid words ary = ();
# reset
foreach my $word (sort keys(%centroid hash))
{
push @centroid words ary,$word;
print test(”\’$word\’ added to centroid words ary\n”);
push @centroid vector ary,$centroid hash{$word};
print test(”\’$centroid hash{$word}\’ added to centroid vector ary\n”);
}
return(\@centroid words ary, \%centroid hash, \@centroid vector ary);
}
vii
sub google hits list to url ary
#
# Input: filename of HTML file that contains URLs
# Output: reference to an array with URLs that have been extracted
#
{
my $google hits file = $ [0];
print test(”google hits file: $google hits file\n”);
if(!(−T $google hits file))
{
print STDERR ”Error: Not an ASCII file: \”$google hits file\”\n”;
exit;
}
local ∗GOOGLEHITF;
open(GOOGLEHITF,”$google hits file”) ||
die ”Error: could not open \’$google hits file\’ for reading: $!”;
while( < GOOGLEHITF > )
{
my $url;
if(/\&q\ = (http:\/\/\S+\.pdf)/)
{
$url = $1;
}
if($url)
{
print test(”URL: \’$url\’\n”);
}
}
close(GOOGLEHITF);
my @output ary;
return(\@output ary);
}
sub print debug
{
my $debugmode = $ [0];
if($debugmode)
{
print(”DEBUGMODE\ > @ ”);
}
}
sub print test
{
if($testmode)
{
print(”TESTMODE\ > @ ”);
}
}
viii
CHAPTER 7. APPENDICES
sub extract rdf triplets
#
# This subroutine will attempt to extract RDF triplets from
# a flat ASCII text file.
# The RDF triplets will be represented as a hash, %rdf triplet hash,
# in which the keys are the RDF relations and the values are the
# objects between which the relations hold.
#
# TODO: Try to find third party software that does this job.
#
# First argument: the filename of the document with its full path
# Returns: a reference to %rdf triplet hash
#
{
my $input document = $ [0];
local ∗INPUTF;
my(%rdf triplet hash);
if(! −T $input document)
{
print(”Error: \’$input document\’ should be the filename including path of a plain ASCII text file\n”);
exit;
}
my %english verbs hash = %{create english verbs hash()};
my %english nouns hash = %{create english nouns hash()};
my %english ad hash = %{create english ad hash()};
open(INPUTF,”$input document”) ||
die ”Error: Fatal: Could not open $input document for reading: $!”;
while(my $l = < INPUTF > )
{
next if $l = ∼/∧\s∗\#|\;/;
# allow for comments
$l = remove stopwords($l,’default’);
print test(”$l”);
my @words ary = split(/\s+/,$l);
foreach my $word (@words ary)
{
$word = ∼s/\,|\.$//; # remove trailing interpunction signs
next if $word !∼ /[a−z][a−z]/i;
if($english verbs hash{$word})
{
print(”$word = VERB ”);
}
elsif($english nouns hash{$word})
{
print(”$word = NOUN ”);
}
elsif($english ad hash{$word})
{
print(”$word = ADJ OR ADV ”);
}
else
{
ix
if($word = ∼/s$/)
{
$word = ∼s/s$//;
if($english verbs hash{$word})
{
print(”$word = VERB ”);
}
elsif($english nouns hash{$word})
{
print(”$word = NOUN ”);
}
else
{
print(”$word = UNKNOWN CATEGORY ”);
}
}
else
{
print(”$word = UNKNOWN CATEGORY ”);
}
}
}
}
close(INPUTF);
return(\%rdf triplet hash);
}
sub create english verbs hash
{
my(%english verbs hash);
local ∗VERBF;
open(VERBF,”$english verb file”) ||
die ”Error: Fatal: Could not open $english verb file for reading: $!”;
while(my $l = < VERBF > )
{
next if $l = ∼/∧\s∗\#|\;/;
# allow for comments
$l = ∼s/\s+$//g;
if($l = ∼/(\S+)/)
{
$english verbs hash{$l} = 1;
}
}
close(VERBF);
return(\%english verbs hash);
}
sub create english nouns hash
{
my(%english nouns hash);
local ∗NOUNF;
open(NOUNF,”$english noun file”) ||
die ”Error: Fatal: Could not open $english noun file for reading: $!”;
while(my $l = < NOUNF > )
{
next if $l = ∼/∧\s∗\#|\;/;
# allow for comments
x
CHAPTER 7. APPENDICES
$l = ∼s/\s+$//g;
if($l = ∼/(\S+)/)
{
$english nouns hash{$l} = 1;
}
}
close(NOUNF);
return(\%english nouns hash);
}
sub create english ad hash
{
my(%english ad hash);
local ∗ADF;
open(ADF,”$english adjectives and adverbials file”) ||
die ”Error: Fatal: Could not open $english adjectives and adverbials file for reading: $!”;
while(my $l = < ADF > )
{
next if $l = ∼/∧\s∗\#|\;/;
# allow for comments
$l = ∼s/\s+$//g;
if($l = ∼/(\S+)/)
{
$english ad hash{$l} = 1;
}
}
close(ADF);
return(\%english ad hash);
}
1;