A Methodological Approach for Converting Thesaurus to Domain

ISSN: 2277-3754
ISO 9001:2008 Certified
International Journal of Engineering and Innovative Technology (IJEIT)
Volume 3, Issue 5, November 2013
A Methodological Approach for Converting
Thesaurus to Domain Ontology: Application to
Tourism
Mouhim sanaa, Tatane Khalid, Cherkaoui chihab, Douzi Hassan, Mammas driss
Abstract— The use of ontologies in various applications such
as information retrieval, indexing and annotation of resources,
knowledge representation, etc. proved to be very advantageous.
However, the process of building ontologies poses several research
problems in the absence of a standardized methodology supported
by the scientific community. The role of this paper is to propose an
approach for converting a thesaurus to domain ontology. It is
specifically to convert terms into concepts and thesaurus relations
into semantic relations.
Index Terms—thesaurus, ontology, term, concept, hierarchy,
generalization, association.
I. INTRODUCTION
The thesaurus is a controlled and structured vocabulary in
which concepts are represented by terms, organized in a way
that the relations between the concepts are explicit, and that
the preferred terms are accompanied by entries to their
synonyms or quasi-synonyms. It is a documentary language
specially used for indexing and searching information in
various fields such as medicine, tourism, education, etc.... The
terms are related to each other using relations of synonymy,
hierarchy and association. The establishment and the
development of thesauri follow guidelines and standards like
(ISO 2788 and ANSI Z39) which are conventions for the
content, the display, the methods of construction, and the
maintenance of monolingual or multilingual thesaurus. Their
use is very promising since the design was done by experts
and required heavy modeling efforts. It has the advantage of
using the terms identified as relevant in the field. However,
the format no standardized of thesauri like (database, html,
ASCII file) opened a new field of research that involves the
migration from thesaurus standard to the W3C format such as
RDF (Resource Description Framework) and OWL
(Ontology web Language). In this article, we are especially
interested in proposing an approach to convert a thesaurus
into domain ontology. We will focus on exploiting the
relations of association, equivalence and hierarchy between
terms to create concepts and semantic relations of an
ontology. In order to explain our scientific approach, we have
structured this article as follows: the first part deals with the
main differences between thesauri and ontologies. The second
part presents the most important works that were interested in
converting thesaurus to ontologies. The third part deals with
the proposed approach for the conversion of a thesaurus into
domain ontology. In the fourth part, we experiment the
proposed approach to build ontology of tourism.
II. THESAURI AND ONTOLOGIES: LOCATE
DIFFERENCES
To convert the formats of thesauri into the semantic web
languages, the differences between thesauri and ontologies
should be taken into account. Indeed, at the level of
knowledge representation and compared to ontologies,
thesauri suffer from a lack of conceptual abstraction [1]. It is
mainly about the differentiation between terms and concepts
which is not established in the thesaurus. Furthermore, the
terms are grouped into a single or multi hierarchical structure
linked by ambiguous and not clear relations. These semantic
links may reflect the intended use of the thesaurus rather than
real semantic relations between terms [2]. They can be
summarized mainly as relations of association and
hierarchies. These links are unable to guide the inference or
even suggest a query expansion to support the process of
information retrieval.
III. THESAURUS VS ONTOLOGY: BACKGROUND
Creating ontology from the thesaurus has been the aim of
many interesting works. The work of Noy was interested in
turning the thesaurus NCI (National Cancer Institute) to an
OWL-DL and he addressed the problem of knowledge
representation [3]. Wielinga an approach to transform the
thesaurus “facets of art and architecture AAT” into an
ontology for indexing images. This approach is entirely
manual. The ontology formalized in RDFS format is built in
two steps: the identification of concepts and the connection of
properties to specific sub-sets of AAT [4]. Hepp describes the
GenTax methodology in order to construct ontology in OWL
and RDF-S (RDF-Shema) formats from hierarchical
classifications, thesauri and taxonomies. This approach
allows the semi-automatic creation of meaningful ontology
classes for a particular context while preserving the original
hierarchy. The basic idea of the GenTax methodology is to
derive two classes of ontology from each category: a generic
concept in a given context and a broader taxonomic concept
that preserves the original hierarchy [5]. Chrisment proposes
a method to transform a thesaurus designed under ISO 2788
and ANSI Z39 standards into a light ontology for indexing
corpus. The first step of the proposed methodology is to
extract a set of thesaurus concepts and their lexical variations.
315
ISSN: 2277-3754
ISO 9001:2008 Certified
International Journal of Engineering and Innovative Technology (IJEIT)
Volume 3, Issue 5, November 2013
The second step allows structuring the ontology concepts by variations of the term must also be taken into consideration.
detecting its taxonomic associations and relations in the From both this list and the term itself, a concept referring to a
thesaurus and in the corpus [6]. All these works converge on semantic class can be built. It has a name and includes the
the utility to convert thesauri to the semantic web languages. term, its lexical variations and synonyms.
In accordance with [6] the first step in the process of building
B. Structuring of concepts
ontology from the thesaurus is to build on the terms, semantic
After defining the concepts of the ontology, the second step
classes by taking into account the lexicalization of concepts.
is
to
form the hierarchical structure of classes and subclasses.
However, lexicalizing the concept is restricted to the terms
In
general,
most studies transform specific and generic
present in the thesaurus. On the other hand, all of the studies
use the hierarchical relations between terms for structuring relations of thesaurus to a relation of hierarchy. However,
hierarchically the concepts into classes and subclasses. these two relations should be used with caution because they
However, this relation may introduce other semantic links also include the relations “isPartOf” and “isInstanceOf” (Fig
such as "instance" and "part of". Few studies have focused on 2).
Generic concept
the inclusion of this differentiation. The innovation of the
Term1
process of building an ontology from a thesaurus, which we
Term2
Specific concept
TG
Term 1
Term 2
propose, reside in a method of constructing concept from the
isPartOf
Term1
Term2
thesaurus terms. We also propose enriching the lexicon by
TS
Term 2
Term 1
using lexical databases and dictionaries. It is to propose new
isInstanceOf
Term1
Term2
extensions of concepts based on both lexicalizations
explained in the thesaurus in these databases. In addition, we
propose a mechanism of explaining the hierarchical relation
Fig 2. Possible interpretation of the relation "Generic and
specific" of the thesaurus
taking into account the disambiguation of the relations:
"partOf" and "instance". Conceptualization OF THE
To help the knowledge engineer to differentiate between
THESAURUS This step involves the transformation of the
these
three cases we suggest interpreting the relation of
thesaurus lexicon to ontology. It is first to create concepts
generality
and specificity as follows:
using the equivalence relation between terms. Then, these
1)
The
relation
of hierarchy identifies a link between a class
concepts are grouped using both the hierarchical and
and
its
members
(subclasses): To interpret a generality
association relations.
relation as a hierarchy relation, the logical test "some"
A. From terms to concepts
and "all" between the two descriptors should be validated
The first step in the transformation of a thesaurus is the
(Fig 3).
passage from terms to concepts. To build a concept, we
propose to associate for each thesaurus term a set of
Some
Some
synonyms, lexical variations, and a definition in natural
Succulent
Cactus
Desert Plant
Cactus
Plant
language. Apply this rule is to provide, from descriptors and
Some
all
the relation of equivalence between descriptors and
no-descriptors, a mechanism for grouping concepts and their
lexical variation. The grouping of terms is achieved through
Fig 3. Example of application of the logical test “some and
the relation EMP and EP between descriptors and
all” for the relation of hierarchy
non-descriptors. The identifier of the concept is then the name As shown in this example, all "cactus", by definition, and
of the descriptor. The no-descriptors are introduced as labels whatever the context, are "succulent." Then, we can deduce
that the class "cactus" is a subclass of "succulent". However,
of the concept (Fig 1).
in the second example, some members of the desert plant are
called cactus. It is also known that the cactus is not all "desert
Concept
plant". These two descriptors must then be assigned to
= Term1
EP
Term 1
Term 2
different classes and not to the same hierarchy.
Term
2) The relation of hierarchy can be interpreted as an
e1
EMP
instance relation when it identifies a link between a
Term 2
Term 1
Label= Term2
general category expressed by a proper name.
Example: Mountain TG Alep is interpreted as Alep is an
instance of Mountain The differentiation between the three
Fig 1. Passage from terms to concepts using the relation of
scenarios of the hierarchy relation in a thesaurus does not only
equivalence
To give a definition to a concept, the NE is used if it exists. ensure the appropriate hierarchical structure of concepts into
If not, an external dictionary is used. The list of synonyms can classes and subclasses but also manage the multi hierarchy
be established by using the Lexical data base Wolf [7]. Lexical that can be encountered in the thesaurus.
316
ISSN: 2277-3754
ISO 9001:2008 Certified
International Journal of Engineering and Innovative Technology (IJEIT)
Volume 3, Issue 5, November 2013
semantic link between two classes, but the type or the nature
C. Management of multi-hierarchy
The multi-hierarchy is the fact that a descriptor can have at of the relation is not defined. It may be a relation between a
the same time several generic terms. In this case, the type of subject and objects or phenomena studied, or the discipline
the two relations is interpreted using the test "some" and "all" and practitioners etc.
Example: Mathematics RT Mathematician
as explained above. Three cases can occur:
It can also include concepts linked by a causal dependency,
1) The test "some" and "all" has been validated in both
concepts
and their unit or measuring mechanism, objects and
relations:
their
properties,
etc.
Example: Piano TG Percussion Instrument and Piano TG
Example:
Mathematics
RT Mathematician
Cord Instrument
Mourn RT Death
In this example, the descriptor "Piano" has two generic
Temperature RT Thermometer
terms "Percussion Instrument" and "Cords Instrument".
The exploitation of these semantic relations is highly
The test "some" and "all" is validated in both relations since
all pianos and percussion instruments are cords dependent on the interpretation of the knowledge engineer
instruments. In this case, this descriptor is interpreted as an and the requirements that must the current ontology answer
instance of both generic terms. Piano is then introduced into for. In case this relation is difficult to explain, other resources
the ontology as an instance of the two concepts "Cord should be used. This can be the second phase in the process of
Instruments" and "Percussion Instrument". However, it is building the ontology which is to extract from a text the
important to note that the two concepts should not be explanation of these semantic relations.
declared as disjoint.
IV. EXPERIMENTATION
2) The "some" test "all" is not validated in both relations:
Example: Biochemistry TG Biology and Biochemistry
In this part we focus on the construction of a French
TG Chemistry
tourism ontology. It is to implement the various steps detailed
In this example, the descriptor "Biochemistry" has two in the previous section. In the tourism sector, there are a
generic terms "Biology" and "Chemistry". As the test number of controlled vocabularies or thesauri. We cite in
"some" and "all" is not validated, the two relations are particular, "Thesaurus of Tourism and Recreation" of the
interpreted as a "partOf" relation. In this case a new World Tourism Organization (WTO). Indeed, the
concept “Biochemistry” is built in another hierarchy than development and implementation of the latest version of the
“Biology” and “Chemistry” while keeping the link "Part multilingual thesaurus took more than 20 years of work in
Of" between it and the two concepts.
consolidating the efforts of several specialists and domain
3) The test "some" and "all" has been validated in one of experts under the direction of WTO and the direction of
two relations and not in the other:
French tourism. The WTO Thesaurus is structured into 20
Example: Skull TG Bones and Skull TG Head
semantic classes. For our specific needs, we are interested in
In this example, the descriptor "Skull" has two generic the following semantic fields expressed in the thesaurus.
terms "Bones" and "Head." The test "some" "all" has
been validated between "Skull" and "Bones" and has not
been validated between "Skull" and "Head." In this case,
the concept "Skull" is introduced as a subclass of
"Bones". The concept "Head" is introduced in a different
hierarchy keeping "partOf" link between the two classes.
D. Removing redundancy in hierarchical relations
When a generic relation is interpreted as a relation between
a class and its subclass is a transitive relation. It allows this
type of inference: if A is a subclass of "B and B" is a subclass
of "C, then A" is a subclass of "C. It is therefore unnecessary
to explain the relation A TG C.
E. Exploitation of the association relation
This relation covers associations between descriptors that
are neither equivalent nor hierarchical. The semantic or
conceptual association between the terms is made explicit by
the relation RT (Related Term). In the thesaurus, this relation
is symmetrical.
The interpretation of the associative relations is the most
difficult. Indeed, this relation indicates the existence of a
All these semantic fields will form the super classes of our
tourism ontology (Fig 4).
317
ISSN: 2277-3754
ISO 9001:2008 Certified
International Journal of Engineering and Innovative Technology (IJEIT)
Volume 3, Issue 5, November 2013
Fig 6. Hierarchical structuring of some concepts of the ontology
of tourism
Fig 4. Semantic fields of the tourism ontology
Recall that one of the basic rules that we have defined is to
take into consideration lexicalizations of concepts and their
synonyms. It is in this phase to exploit the equivalence
relation between descriptors and non-descriptors expressed in
the thesaurus.
Exemple COTE EP : baie, détroit, front de mer, golfe,
péninsule
As shown in this example, the descriptor “Cote” is
connected to the no-descriptors “baie, détroit, front de mer,
golfe, péninsule”. In this case, a concept named “Cote” is
built, and the associated non-descriptors are expressed in
Protege2000 [7] as labels of the concept (Fig 5).
Fig 5. Lexicalization of the concept “cote” in Protege2000
Constructed concepts are structured as we have shown in
explaining the generic relationship.
1) The relationship TG is interpreted as an instance: In the
thesaurus of World Tourism Organization, from 1300
generic relations we judged 175 as instances.
Example: CAMEROUN TG AFRIQUE CENTRALE
2) The relation TG is interpreted as a relation of hierarchy:
The test "some" and "all" we proposed allowed us to
interpret among 1300 generic relations 950 as
hierarchical relations.
Example : MANIFESTATION COMMERCIALE TG
MANIFESTATION TOURISTIQUE
3) The relation TG is not interpreted as a relation of
hierarchy: The test "Some" and "all" has allowed us to
interpret 200 generic relations as not being a hierarchy
relation.
Example: ActivitéSportive TG ArticleDeSport
The multi-hierarchy has been located in country names
(field 20), because if a country is part of a continent, it may
also belong to one or more political, economic, geographic,
ethnic, religious or linguistic subdivisions.
Exemple: CAMEROUN
TG : AFRIQUE CENTRALE
TG : AFRIQUE FRANCOPHONE
TG : PAYS ISLAMIQUE
TG : SAHEL
In this case, the test "some" and "all" is validated in the four
generality relations. It is then interpreted as explained as an
instance relation. Also recall that the descriptors
Central African, Francophone Africa, Islamic Countries,
Sahel must not be declared as disjoint classes.
The association relation TA expresses the existence of a
semantic link between descriptors. In some cases, this
relationship is easily interpreted by the cogniticien. It is
transformed into an object property linking the two
descriptors.
Example:
ARCHITECTURE
TA
MONUMENT
MOUNTAIN TA SPORTING
The resulting ontology is rich with terms and concepts.
With 1300 concepts, and 225 lexicalizations, it provides
broad coverage of the tourism sector. However, the semantic
relations are still limited and object properties are zero.
V. CONCLUSION
In this paper, we proposed an approach for converting
manually a thesaurus to ontology. We focused on the passage
from terms to concepts and some issues of hierarchy and
association relations. The innovation of our approach
involves first proposing a method of building concepts from
the thesaurus terms. We also propose enriching the lexicon
using lexical databases and dictionaries. It is to propose new
extensions of concepts using both lexicalizations explicit in
318
ISSN: 2277-3754
ISO 9001:2008 Certified
International Journal of Engineering and Innovative Technology (IJEIT)
Volume 3, Issue 5, November 2013
the thesaurus in these databases. In addition, we proposed a
mechanism of explaining the hierarchical relation taking into
account the disambiguation "partOf" and "instance Of."
Indeed, through the test "all and some", we were able to
differentiate between different interpretations of the
generality relation. The proposed method was applied to the
field of tourism by converting the thesaurus of WTO to
tourism ontology. This manual interpretation is important to
obtain a first ontology which will be enriched using other text
resources with a semi-automatic approach based on the NLP
tools.
REFERENCES
[1] Soergel D., B. Lauser, A. Liang, F. Fisseha, J. Keizer, S. Katz.
(2004). Reengineering Thesauri for New Applications : the
AGROVOC Example, Journal of Digital Information, Volume
4 Issue 4, 2004.
[2] Fernandez M., A. Gomez-Pérez et N. Juristo.(1997).
METHONTOLOGY: From Ontological Art towards
Ontological Engineering» Proceedings of the AAAI-97 Spring
Symposium Series on Ontological Engineering, Stanford, CA,
USA, 1997.
[3] C Noy, N., R.W. Fergerson et M.A. Musen (2000). The
knowledge model of Protégé2000: combining interoperability
and flexibility, in Proceedings of the International Conference
on Knowledge Engineering and Knowledge Management
(EKAW’00), 2000.
[4] Wielinga, B., G. Schreiber, J. Wielemaker, et J.A.C. Sandberg.
(2001). from thésaurus to ontology. International Conference
on Knowledge Capture, Victoria, Canada, Octobre 2001.
[5] Hepp, M.et J. Bruijn. (2003). GenTax: A Generic Methodology
for Deriving OWL and RDF-S Ontologies from Hierarchical
Classifications, Thesauri, and Inconsistent Taxonomies,
Digital Enterprise Research Institute (DERI), University of
Innsbruck.
[6] Chrisment, C., F. Genova, N. Hernandez et J. Mothe. (2006).
D’un thesaurus vers une ontologie de domaine pour
l’exploration d’un corpus». ametist, Numéro 0AMETIST.
http://ametist.inist.fr/document.php ?id=152.
[7] http ://alpage.inria.fr/~sagot/wolf.html.
319