ISSN: 2277-3754 ISO 9001:2008 Certified International Journal of Engineering and Innovative Technology (IJEIT) Volume 3, Issue 5, November 2013 A Methodological Approach for Converting Thesaurus to Domain Ontology: Application to Tourism Mouhim sanaa, Tatane Khalid, Cherkaoui chihab, Douzi Hassan, Mammas driss Abstract— The use of ontologies in various applications such as information retrieval, indexing and annotation of resources, knowledge representation, etc. proved to be very advantageous. However, the process of building ontologies poses several research problems in the absence of a standardized methodology supported by the scientific community. The role of this paper is to propose an approach for converting a thesaurus to domain ontology. It is specifically to convert terms into concepts and thesaurus relations into semantic relations. Index Terms—thesaurus, ontology, term, concept, hierarchy, generalization, association. I. INTRODUCTION The thesaurus is a controlled and structured vocabulary in which concepts are represented by terms, organized in a way that the relations between the concepts are explicit, and that the preferred terms are accompanied by entries to their synonyms or quasi-synonyms. It is a documentary language specially used for indexing and searching information in various fields such as medicine, tourism, education, etc.... The terms are related to each other using relations of synonymy, hierarchy and association. The establishment and the development of thesauri follow guidelines and standards like (ISO 2788 and ANSI Z39) which are conventions for the content, the display, the methods of construction, and the maintenance of monolingual or multilingual thesaurus. Their use is very promising since the design was done by experts and required heavy modeling efforts. It has the advantage of using the terms identified as relevant in the field. However, the format no standardized of thesauri like (database, html, ASCII file) opened a new field of research that involves the migration from thesaurus standard to the W3C format such as RDF (Resource Description Framework) and OWL (Ontology web Language). In this article, we are especially interested in proposing an approach to convert a thesaurus into domain ontology. We will focus on exploiting the relations of association, equivalence and hierarchy between terms to create concepts and semantic relations of an ontology. In order to explain our scientific approach, we have structured this article as follows: the first part deals with the main differences between thesauri and ontologies. The second part presents the most important works that were interested in converting thesaurus to ontologies. The third part deals with the proposed approach for the conversion of a thesaurus into domain ontology. In the fourth part, we experiment the proposed approach to build ontology of tourism. II. THESAURI AND ONTOLOGIES: LOCATE DIFFERENCES To convert the formats of thesauri into the semantic web languages, the differences between thesauri and ontologies should be taken into account. Indeed, at the level of knowledge representation and compared to ontologies, thesauri suffer from a lack of conceptual abstraction [1]. It is mainly about the differentiation between terms and concepts which is not established in the thesaurus. Furthermore, the terms are grouped into a single or multi hierarchical structure linked by ambiguous and not clear relations. These semantic links may reflect the intended use of the thesaurus rather than real semantic relations between terms [2]. They can be summarized mainly as relations of association and hierarchies. These links are unable to guide the inference or even suggest a query expansion to support the process of information retrieval. III. THESAURUS VS ONTOLOGY: BACKGROUND Creating ontology from the thesaurus has been the aim of many interesting works. The work of Noy was interested in turning the thesaurus NCI (National Cancer Institute) to an OWL-DL and he addressed the problem of knowledge representation [3]. Wielinga an approach to transform the thesaurus “facets of art and architecture AAT” into an ontology for indexing images. This approach is entirely manual. The ontology formalized in RDFS format is built in two steps: the identification of concepts and the connection of properties to specific sub-sets of AAT [4]. Hepp describes the GenTax methodology in order to construct ontology in OWL and RDF-S (RDF-Shema) formats from hierarchical classifications, thesauri and taxonomies. This approach allows the semi-automatic creation of meaningful ontology classes for a particular context while preserving the original hierarchy. The basic idea of the GenTax methodology is to derive two classes of ontology from each category: a generic concept in a given context and a broader taxonomic concept that preserves the original hierarchy [5]. Chrisment proposes a method to transform a thesaurus designed under ISO 2788 and ANSI Z39 standards into a light ontology for indexing corpus. The first step of the proposed methodology is to extract a set of thesaurus concepts and their lexical variations. 315 ISSN: 2277-3754 ISO 9001:2008 Certified International Journal of Engineering and Innovative Technology (IJEIT) Volume 3, Issue 5, November 2013 The second step allows structuring the ontology concepts by variations of the term must also be taken into consideration. detecting its taxonomic associations and relations in the From both this list and the term itself, a concept referring to a thesaurus and in the corpus [6]. All these works converge on semantic class can be built. It has a name and includes the the utility to convert thesauri to the semantic web languages. term, its lexical variations and synonyms. In accordance with [6] the first step in the process of building B. Structuring of concepts ontology from the thesaurus is to build on the terms, semantic After defining the concepts of the ontology, the second step classes by taking into account the lexicalization of concepts. is to form the hierarchical structure of classes and subclasses. However, lexicalizing the concept is restricted to the terms In general, most studies transform specific and generic present in the thesaurus. On the other hand, all of the studies use the hierarchical relations between terms for structuring relations of thesaurus to a relation of hierarchy. However, hierarchically the concepts into classes and subclasses. these two relations should be used with caution because they However, this relation may introduce other semantic links also include the relations “isPartOf” and “isInstanceOf” (Fig such as "instance" and "part of". Few studies have focused on 2). Generic concept the inclusion of this differentiation. The innovation of the Term1 process of building an ontology from a thesaurus, which we Term2 Specific concept TG Term 1 Term 2 propose, reside in a method of constructing concept from the isPartOf Term1 Term2 thesaurus terms. We also propose enriching the lexicon by TS Term 2 Term 1 using lexical databases and dictionaries. It is to propose new isInstanceOf Term1 Term2 extensions of concepts based on both lexicalizations explained in the thesaurus in these databases. In addition, we propose a mechanism of explaining the hierarchical relation Fig 2. Possible interpretation of the relation "Generic and specific" of the thesaurus taking into account the disambiguation of the relations: "partOf" and "instance". Conceptualization OF THE To help the knowledge engineer to differentiate between THESAURUS This step involves the transformation of the these three cases we suggest interpreting the relation of thesaurus lexicon to ontology. It is first to create concepts generality and specificity as follows: using the equivalence relation between terms. Then, these 1) The relation of hierarchy identifies a link between a class concepts are grouped using both the hierarchical and and its members (subclasses): To interpret a generality association relations. relation as a hierarchy relation, the logical test "some" A. From terms to concepts and "all" between the two descriptors should be validated The first step in the transformation of a thesaurus is the (Fig 3). passage from terms to concepts. To build a concept, we propose to associate for each thesaurus term a set of Some Some synonyms, lexical variations, and a definition in natural Succulent Cactus Desert Plant Cactus Plant language. Apply this rule is to provide, from descriptors and Some all the relation of equivalence between descriptors and no-descriptors, a mechanism for grouping concepts and their lexical variation. The grouping of terms is achieved through Fig 3. Example of application of the logical test “some and the relation EMP and EP between descriptors and all” for the relation of hierarchy non-descriptors. The identifier of the concept is then the name As shown in this example, all "cactus", by definition, and of the descriptor. The no-descriptors are introduced as labels whatever the context, are "succulent." Then, we can deduce that the class "cactus" is a subclass of "succulent". However, of the concept (Fig 1). in the second example, some members of the desert plant are called cactus. It is also known that the cactus is not all "desert Concept plant". These two descriptors must then be assigned to = Term1 EP Term 1 Term 2 different classes and not to the same hierarchy. Term 2) The relation of hierarchy can be interpreted as an e1 EMP instance relation when it identifies a link between a Term 2 Term 1 Label= Term2 general category expressed by a proper name. Example: Mountain TG Alep is interpreted as Alep is an instance of Mountain The differentiation between the three Fig 1. Passage from terms to concepts using the relation of scenarios of the hierarchy relation in a thesaurus does not only equivalence To give a definition to a concept, the NE is used if it exists. ensure the appropriate hierarchical structure of concepts into If not, an external dictionary is used. The list of synonyms can classes and subclasses but also manage the multi hierarchy be established by using the Lexical data base Wolf [7]. Lexical that can be encountered in the thesaurus. 316 ISSN: 2277-3754 ISO 9001:2008 Certified International Journal of Engineering and Innovative Technology (IJEIT) Volume 3, Issue 5, November 2013 semantic link between two classes, but the type or the nature C. Management of multi-hierarchy The multi-hierarchy is the fact that a descriptor can have at of the relation is not defined. It may be a relation between a the same time several generic terms. In this case, the type of subject and objects or phenomena studied, or the discipline the two relations is interpreted using the test "some" and "all" and practitioners etc. Example: Mathematics RT Mathematician as explained above. Three cases can occur: It can also include concepts linked by a causal dependency, 1) The test "some" and "all" has been validated in both concepts and their unit or measuring mechanism, objects and relations: their properties, etc. Example: Piano TG Percussion Instrument and Piano TG Example: Mathematics RT Mathematician Cord Instrument Mourn RT Death In this example, the descriptor "Piano" has two generic Temperature RT Thermometer terms "Percussion Instrument" and "Cords Instrument". The exploitation of these semantic relations is highly The test "some" and "all" is validated in both relations since all pianos and percussion instruments are cords dependent on the interpretation of the knowledge engineer instruments. In this case, this descriptor is interpreted as an and the requirements that must the current ontology answer instance of both generic terms. Piano is then introduced into for. In case this relation is difficult to explain, other resources the ontology as an instance of the two concepts "Cord should be used. This can be the second phase in the process of Instruments" and "Percussion Instrument". However, it is building the ontology which is to extract from a text the important to note that the two concepts should not be explanation of these semantic relations. declared as disjoint. IV. EXPERIMENTATION 2) The "some" test "all" is not validated in both relations: Example: Biochemistry TG Biology and Biochemistry In this part we focus on the construction of a French TG Chemistry tourism ontology. It is to implement the various steps detailed In this example, the descriptor "Biochemistry" has two in the previous section. In the tourism sector, there are a generic terms "Biology" and "Chemistry". As the test number of controlled vocabularies or thesauri. We cite in "some" and "all" is not validated, the two relations are particular, "Thesaurus of Tourism and Recreation" of the interpreted as a "partOf" relation. In this case a new World Tourism Organization (WTO). Indeed, the concept “Biochemistry” is built in another hierarchy than development and implementation of the latest version of the “Biology” and “Chemistry” while keeping the link "Part multilingual thesaurus took more than 20 years of work in Of" between it and the two concepts. consolidating the efforts of several specialists and domain 3) The test "some" and "all" has been validated in one of experts under the direction of WTO and the direction of two relations and not in the other: French tourism. The WTO Thesaurus is structured into 20 Example: Skull TG Bones and Skull TG Head semantic classes. For our specific needs, we are interested in In this example, the descriptor "Skull" has two generic the following semantic fields expressed in the thesaurus. terms "Bones" and "Head." The test "some" "all" has been validated between "Skull" and "Bones" and has not been validated between "Skull" and "Head." In this case, the concept "Skull" is introduced as a subclass of "Bones". The concept "Head" is introduced in a different hierarchy keeping "partOf" link between the two classes. D. Removing redundancy in hierarchical relations When a generic relation is interpreted as a relation between a class and its subclass is a transitive relation. It allows this type of inference: if A is a subclass of "B and B" is a subclass of "C, then A" is a subclass of "C. It is therefore unnecessary to explain the relation A TG C. E. Exploitation of the association relation This relation covers associations between descriptors that are neither equivalent nor hierarchical. The semantic or conceptual association between the terms is made explicit by the relation RT (Related Term). In the thesaurus, this relation is symmetrical. The interpretation of the associative relations is the most difficult. Indeed, this relation indicates the existence of a All these semantic fields will form the super classes of our tourism ontology (Fig 4). 317 ISSN: 2277-3754 ISO 9001:2008 Certified International Journal of Engineering and Innovative Technology (IJEIT) Volume 3, Issue 5, November 2013 Fig 6. Hierarchical structuring of some concepts of the ontology of tourism Fig 4. Semantic fields of the tourism ontology Recall that one of the basic rules that we have defined is to take into consideration lexicalizations of concepts and their synonyms. It is in this phase to exploit the equivalence relation between descriptors and non-descriptors expressed in the thesaurus. Exemple COTE EP : baie, détroit, front de mer, golfe, péninsule As shown in this example, the descriptor “Cote” is connected to the no-descriptors “baie, détroit, front de mer, golfe, péninsule”. In this case, a concept named “Cote” is built, and the associated non-descriptors are expressed in Protege2000 [7] as labels of the concept (Fig 5). Fig 5. Lexicalization of the concept “cote” in Protege2000 Constructed concepts are structured as we have shown in explaining the generic relationship. 1) The relationship TG is interpreted as an instance: In the thesaurus of World Tourism Organization, from 1300 generic relations we judged 175 as instances. Example: CAMEROUN TG AFRIQUE CENTRALE 2) The relation TG is interpreted as a relation of hierarchy: The test "some" and "all" we proposed allowed us to interpret among 1300 generic relations 950 as hierarchical relations. Example : MANIFESTATION COMMERCIALE TG MANIFESTATION TOURISTIQUE 3) The relation TG is not interpreted as a relation of hierarchy: The test "Some" and "all" has allowed us to interpret 200 generic relations as not being a hierarchy relation. Example: ActivitéSportive TG ArticleDeSport The multi-hierarchy has been located in country names (field 20), because if a country is part of a continent, it may also belong to one or more political, economic, geographic, ethnic, religious or linguistic subdivisions. Exemple: CAMEROUN TG : AFRIQUE CENTRALE TG : AFRIQUE FRANCOPHONE TG : PAYS ISLAMIQUE TG : SAHEL In this case, the test "some" and "all" is validated in the four generality relations. It is then interpreted as explained as an instance relation. Also recall that the descriptors Central African, Francophone Africa, Islamic Countries, Sahel must not be declared as disjoint classes. The association relation TA expresses the existence of a semantic link between descriptors. In some cases, this relationship is easily interpreted by the cogniticien. It is transformed into an object property linking the two descriptors. Example: ARCHITECTURE TA MONUMENT MOUNTAIN TA SPORTING The resulting ontology is rich with terms and concepts. With 1300 concepts, and 225 lexicalizations, it provides broad coverage of the tourism sector. However, the semantic relations are still limited and object properties are zero. V. CONCLUSION In this paper, we proposed an approach for converting manually a thesaurus to ontology. We focused on the passage from terms to concepts and some issues of hierarchy and association relations. The innovation of our approach involves first proposing a method of building concepts from the thesaurus terms. We also propose enriching the lexicon using lexical databases and dictionaries. It is to propose new extensions of concepts using both lexicalizations explicit in 318 ISSN: 2277-3754 ISO 9001:2008 Certified International Journal of Engineering and Innovative Technology (IJEIT) Volume 3, Issue 5, November 2013 the thesaurus in these databases. In addition, we proposed a mechanism of explaining the hierarchical relation taking into account the disambiguation "partOf" and "instance Of." Indeed, through the test "all and some", we were able to differentiate between different interpretations of the generality relation. The proposed method was applied to the field of tourism by converting the thesaurus of WTO to tourism ontology. This manual interpretation is important to obtain a first ontology which will be enriched using other text resources with a semi-automatic approach based on the NLP tools. REFERENCES [1] Soergel D., B. Lauser, A. Liang, F. Fisseha, J. Keizer, S. Katz. (2004). Reengineering Thesauri for New Applications : the AGROVOC Example, Journal of Digital Information, Volume 4 Issue 4, 2004. [2] Fernandez M., A. Gomez-Pérez et N. Juristo.(1997). METHONTOLOGY: From Ontological Art towards Ontological Engineering» Proceedings of the AAAI-97 Spring Symposium Series on Ontological Engineering, Stanford, CA, USA, 1997. [3] C Noy, N., R.W. Fergerson et M.A. Musen (2000). The knowledge model of Protégé2000: combining interoperability and flexibility, in Proceedings of the International Conference on Knowledge Engineering and Knowledge Management (EKAW’00), 2000. [4] Wielinga, B., G. Schreiber, J. Wielemaker, et J.A.C. Sandberg. (2001). from thésaurus to ontology. International Conference on Knowledge Capture, Victoria, Canada, Octobre 2001. [5] Hepp, M.et J. Bruijn. (2003). GenTax: A Generic Methodology for Deriving OWL and RDF-S Ontologies from Hierarchical Classifications, Thesauri, and Inconsistent Taxonomies, Digital Enterprise Research Institute (DERI), University of Innsbruck. [6] Chrisment, C., F. Genova, N. Hernandez et J. Mothe. (2006). D’un thesaurus vers une ontologie de domaine pour l’exploration d’un corpus». ametist, Numéro 0AMETIST. http://ametist.inist.fr/document.php ?id=152. [7] http ://alpage.inria.fr/~sagot/wolf.html. 319
© Copyright 2026 Paperzz