ENRICHING A TERMINOLOGICAL DATABASE WITH A SET OF IDIOMATIC EXPRESSIONS Rita Marinelli, Laura Cignoni Istituto di Linguistica Computazionale, Consiglio Nazionale delle Ricerche, Pisa (ITALY) [email protected], [email protected] Abstract The research described here is aimed at enriching the terminological database Mariterm with a set of idiomatic expressions related to the nautical field. The database is available at the Institute for Computational Linguistics (ILC) of the National Research Council (CNR) in Pisa (Italy), and contains semantic information for around 3500 Italian terms belonging to the maritime domain. Each Italian term is linked to other terms by means of semantic “internal relations” and is also connected to the equivalent synonyms in English. We relate on the methodology designed to expand the database, increase the lexical resource with explanations on the origins of the most common expressions, and study the similarities and differences in meaning and usage of some idiomatic expressions in the two languages. The possibility of using this linguistic resource also for didactic purposes in public and private schools is considered. Keywords: Lexical semantic databases, terminology, idiomatic expressions. 1 INTRODUCTION This paper describes a research project aimed at enhancing the terminological database Mariterm available at the Institute for Computational Linguistics (ILC) of the National Research Council (CNR) in Pisa with a set of maritime idiomatic expressions. The database already contains semantic information for a group of Italian terms belonging to the field, and linked to other terms by semantic relations, 1 2 according to the EuroWordNet /ItalWordNet model. The conceptual structure of the database also provides a set of relations, which connect the Italian terms to the equivalent synonyms (or near synonyms) in English, included in the Princeton database WordNet. Alongside the translation of each Italian term into English, a definition is provided for each word in the two languages. In the following sections we describe i) the terminological database and methodology designed to enhance the lexicon with around 200 idiomatic expressions pertinent to the maritime field in the two languages, taking into account the conceptual model of the database; ii) the enrichment of the lexical resource with explanations of the origins of the most common expressions, e.g. tenere la rotta (to hold the course), perdere la bussola (to lose one’s bearings), andare contro corrente (to sail against the wind). These phraseological units, groups of words having a fixed lexical composition and grammatical structure, are particularly attractive to those interested in learning the ways in which maritime language has developed; iii) the similarities and differences in meaning and usage of some idiomatic expressions in Italian and in English; iv) the possibility of using a linguistic resource of this type, managed by a tool which provides an appealing visualization of the terms and their connections to other terms, also for didactic purposes, in public and private schools. 2 THE MARITERM DATABASE Mariterm is a lexical database of a relational type, structured along the lines of the EuroWordNet (EWN)/ItalWordNet (IWN) model, following the perspective of the Princeton WordNet (WN) philosophy [1], [2], according to relational semantics. The terminological lexicon is organized around the concept of synset, “a set of words, with the same part-of-speech that can be interchanged in a certain context” [3], without changing the truth value of the proposition in which the word is included, e.g.: nave mercantile, nave da carico (cargo ship), spiaggia, arenile (beach); ancorarsi, gettare l’ancora (to anchor), armare, equipaggiare (to equip) [4]. 1 Web site of the EWN project:http://www.illc.uva.nl/EuroWordNet 2 Web site of the IWN project:http://www.ilc.cnr.it/iwndb/iwndb_php Proceedings of EDULEARN12 Conference. 2nd-4th July 2012, Barcelona, Spain. 0690 ISBN: 978-84-695-3491-5 The approximate 3,500 lemmas belonging to the maritime knowledge field, in particular to the two subdomains of maritime navigation and transport recorded in the database are clustered into about 2,500 synsets, and are connected to other synsets by means of semantic relations which reflect the user’s mental lexicon [5]: a) relations linking the terms with other Italian terms ‘within’ the database (or ‘internal relations’); b) relations connecting the terms to the synsets of the generic English WordNet (WN) (or ‘equivalence relations’): 2.1 Internal relations As previously mentioned, the concept of synset is assigned a central role and the synonymy relationship allows to group synonymous meanings in a synset linked to other synsets by languageinternal relations. WNs are based on the management of consistent conceptual structures, arranged hierarchically through ‘vertical’ relations of the type “has_hyperonym/has_hyponym”, which provide coherence to the system. The linguistic model of EWN/IWN uses many lexical-semantic relations, e.g. “part_of”, “cause” “purpose”, “subevent”, “belongs_to_class”, etc., reflecting the organization of lexical knowledge. These horizontal relations make it possible to establish a semantic relation between concepts which are neither synonyms nor hierarchically-dependent terms [6]: merce (goods) has_hyponym prodotto alimentare (food product) stivare (to stow) involved_agent stivatore (stevedor) Proper names, also included in the database, are considered “instances” of the class they belong to: the relation “belongs to class” and its reversed “has instance” connect “instances” with “synsets”: Venezia (Venice) belongs_to_class porto (harbour) porto (harbour) has_instance Venezia (Venice) What follows (Fig. 1) is a screenshot from Mariterm for the synset ‘nave’ (ship): Fig. 1. The synset ‘nave’ (ship) with some semantic relations. 2.2 Equivalence relations The conceptual structure of the model supplies lexical semantic relations between and among the terms found in the specialized lexicon, and also provides a set of semantic relations connecting the synsets of the terminological database to the general WordNet lexicon, i.e. the “equivalence relations”. 0691 These relations make it possible to link the terminological synsets to their closest equivalent English 3 concepts of the Princeton WN, using the Inter Lingual Index (ILI) , through the following relations: a) synonymy (or near synonymy): nave eq_ synonym ship (a vessel that carries passengers or freight) armare, equipaggiare eq_near synonym equip, fit out (provide with equipment) b) inclusion of synsets in a taxonomic chain: porto di scalo (calling-in port) eq_has_hyperonym seaport, haven, harbour (a place where ships can take on or discharge cargo) c) ‘part’, ‘subevent’, ‘role’, ‘means’, etc. relations, which allow a more exhaustive description of the semantic field of each term: marinaio (sailor) eq_has_holonym crew (a group of people who work on and operate a ship) comandante (captain) eq_role command (exert control over) Each term is connected to the ILI, and then linked to the Top Ontology (TO), a set of concepts which are at the same time abstract, hierarchically-organized and language-independent: naufragio (shipwreck) → Cause, Dynamic equipaggio (crew of a ship) → Group, Human, Occupation Fig. 2 shows equivalence relations for the synset ‘nave’ (ship): Fig. 2. The synset ‘nave’ (ship) with equivalence relations 3 MARITERM ENRICHMENT Within the context of everyday life one can find many linguistic expressions belonging to the maritime domain, for example nouns often used with an extension of meaning (metaphorical or metonymic): faro (beacon); verbs: ancorarsi (to anchor); dead metaphors: governo (government), corrente (stream); semantic calques: grattacielo (skyscraper); and idiomatic expressions: seguire la corrente (to go with the tide), arrivare al giro di boa (to reach the turning point) [7]. A list of words and expressions was obtained analyzing the contexts of usage of terms found in many 4 sources of the CLIC corpus (Corpus di Lingua Italiana Contemporanea), a large textual corpus of 3 The ILI is an unstructured fund of synsets (mainly taken from WorNet 1.5), the so-called ILI-records [3]. 4 The goal of the LE PAROLE project [15] was to create a comparable set of large Written Language Resources (WLRs) for all European languages. At the end of the PAROLE Project in 1997, the Corpus was enlarged (up to 100 million words) covering the years until 2005 and renamed CLIC. 0692 contemporary Italian [8] [9], and from the intersection of a number of dictionaries, web sources and specialized texts. As said above, synsets are known to contain different word-forms with the same meaning, as well as word-forms with more than one meaning appearing in as many different synsets as they have meanings [10], [11], [12]. For example, the string marea (tide) occurs in Mariterm as a noun in two different synsets, and therefore has two distinct senses. The former expresses the concept marea1, defined as movimento periodico dell'acqua del mare che si alza e si abbassa, the latter indicates 5 grande quantità di cose o persone, spec. se in movimento as in the examples from the CLIC Corpus: liberarsi della morsa delle lamiere sommerse, grazie all'alta *marea* (literal use) una *marea* brulicante di folla riempie all'improvviso ogni millimetro di vuoto (extended use) 6 In WN3.1 , the most recent version of WordNet, two different senses of the synset ‘tide’ are highlighted, even if the extension of meaning is not codified by a relation: S: (n) tide (the periodic rise and fall of the sea level under the gravitational pull of the moon) S: (n) tide (something that may increase or decrease (like the tides of the sea)) "a rising tide of popular interest" Therefore, the second sense of marea (tide) as an ‘extended’ sense of the former was considered worth representing by a particular relation encoding the synset extensions. A new semantic relation was used in the set of database relations to codify the phenomenon of sense extension for the metaphorical or metonymic uses of a word, i.e. the relation has_extension and its reverse is_extension_of, already employed for the representation of specific ‘extended’ senses of Proper Names [13], [14]: marea1 (movimento periodico dell’acqua del mare) has hyperonym fenomeno naturale (natural phenomenon) marea2 (abbondanza, grande numero) has hyperonym quantità (quantity) marea1 has extension marea2 marea2 is extension of marea1 Like the other synsets, the idiomatic expressions are connected to other concepts represented in the database, but at the same time continue to belong to a group of specific linguistic expressions. Each expression is linked to a ‘thematic’ term, a keyword representing a central relevant concept in the specialized lexicon and occurring in associated idiomatic expressions. The semantic field (‘node of information’ connected to the given keyword) is supplied with new information which helps enrich the conceptual network in which that term is included and plays a central role. The link is constituted by a specific ad hoc relation connecting the idiomatic expressions to the relative thematic word(s), e.g. tenere la rotta (Ζ rotta), tirare i remi in barca (Ζ remo, Ζ barca). These data, 7 information and semantic relations are managed by a tool which allows visualization and updating of the database. Each idiomatic expression is included in a set of items appropriately predisposed and managed by the tool, so that the idiomatic expressions form a repository of idioms that can be retrieved and listed. Like the other synsets in the database, each expression provided with a brief definition is connected to its thematic term through a particular relation; if the term, for example mare (sea), appears in an idiomatic expression, the new relation in idiomatic expression and its reverse idiomatic expression from are used: mare in idiomatic expression portare acqua al mare (fare una cosa sciocca, inutile) portare acqua al mare idiomatic expression from mare In some cases, the system shows a window with etymological and historical information related to the idiomatic expression searched, e.g.: tempesta in un bicchier d’acqua meaning ‘confusion for insignificant thing’ (see Figs. 3 and 4): 5 Il Sabatini-Coletti, Dizionario della Lingua Italiana, 2005. 6 http://wordnetweb.princeton.edu 7 The database management system has been realized in Visual Basic and the data are filed in a SQL database. 0693 Fig. 3. Example of the term ‘tempesta’ Fig. 4. ‘Tempesta in un bicchier d’acqua’ 4 ENGLISH TRANSLATION The idiomatic expressions included in the list are provided with the English translation, in keeping with the conceptual model of the database. Many similarities were found between the Italian and English nautical phrases, but sometimes an expression was idiomatic only in one of the two languages and not in the other [16]. With regard to the English translation, the following conditions may appear: 0694 a) The translation in English expresses the same concept, and uses words pertaining to the same maritime lexicon. For some expressions, the literal translation and idiomatic sense are the same, i.e. un pesce fuor d’acqua (a fish out of water), the same in the two languages, indicating unease and discomfort, or parlare come uno scaricatore di porto, which is similar to “to talk like a dockworker”. b) The idiomatic expression has an English translation conveying the same concept, but using terms that do not belong to the same (maritime) ‘world’ i.e. essere un porto di mare (to be like Piccadilly Circus); un mare di guai (a heap of trouble); goccia nel mare (drop in the bucket), intending that a certain amount of something is insufficient. c) Some English idiomatic expressions belonging to the maritime ‘world’ have an equivalent expression in Italian, which however does not belong to the maritime ‘world’ [17], e.g. “to burn one’s boats”, Italian: bruciare i ponti, lit: to burn the bridges, in the sense of leaving oneself with no means of escape [18]; “hit the deck”, Italian: buttarsi a terra, lit: throw yourself to the ground, with the meaning of protecting oneself. 5 CONCLUSIONS In the future, we intend to provide the database with additional idiomatic expressions in the two languages, and to study the phenomenon in more detail. The resource Mariterm could be used for those requiring an adequate knowledge of maritime terminology, in contexts in which the knowledge of nautical expressions is essential, for example for workers in yachting harbours and cruise ports. We suggest the possibility of using a tool of this type also for didactic purposes, in both public and private schools. The definitions and translations of specific terms and a number of multi-word units and idiomatic expressions can be made available for professional training in import-export companies and maritime agencies, and for didactic activities and tasks of various types in schools and nautical institutes. REFERENCES [1] Vossen, P. (1999). EuroWordNet General Document, 1999. http://www.hum.uva.nl/~EWN. [2] Miller G. A., Beckwith R., Fellbaum C., Gross D., Miller K.J. (1990). Introduction to WordNet: An On-line Lexical Database, International Journal of Lexicography, III, 4, pp. 235-244. [3] Vossen P. (2004). Eurowordnet: a multilingual database of autonomous and language-specific wordnets connected via an Inter-Lingual-Index. International Journal of Lexicography. 2004. 17 (2), pp.161-173. [4] Marinelli R., Roventini A. (2006). The Italian Maritime Lexicon and the ItalWordNet Semantic Database, In E. Miyares Bermúdez and L. Ruiz Miyares (Eds.), Linguistics in the Twenty First Century, Cambridge Scholar Press, pp. 173-182. [5] Miller G. A., Fellbaum C. (2007). WordNet then and now, Language Resources & Evaluation. 41, pp. 209-214. [6] Marinelli R., Mazzocchi F., Tiberi M., Motta M. (2010). Il modello semantico di EuroWordNet come strumento per la strutturazione della relazione associativa nei thesauri. Bollettino AIB, L, 3, pp. 249-263. [7] Marinelli R. (2008a). “Idiomatic Expressions and Metaphors from the Maritime Domain”. In M. Alvarez de la Granja (Ed.). Lenguaje figurado y motivacion. Una perspectiva desde la fraseologia. Studien zur romanischen Sprachwissenschaft und Interculturellen Kommunication. Frankfurt am Main: Peter Lang Internationaler Verlag der Wissenschaften, pp. 209-220. [8] Cignoni L., Coffey S. (1998). A Corpus-based study of Italian idiomatic phrases: from citation forms to ‘real-life’ occurrences. Proceedings of Euralex ’98, University of Liège, 1, pp. 291-300. [9] Marinelli R., (2008b). Analisi di metafore e espressioni idiomatiche per mezzo di risorse computazionali e corpora elettronici. In Critica del testo. L’Europa dei proverbi. A cura di A. Punzi e I. Tomassetti. Roma. XI, nn. 1-2, pp. 469-488. [10] Fellbaum C. (1998). Towards a representation of idioms in WordNet. Proceedings of the workshop on the Use of WordNet in Natural Language Processing Systems, Coling-ACL, Montreal. 0695 [11] Fellbaum C. (2011). Idioms and collocations. In C. Maienborn, K. von Heusinger, P. Portner (Eds.), Semantics, De Gruyter Mouton, 2011, pp. 441-456. [12] Osherson A., Fellbaum C. (2009). The representation of idioms in WordNet. Proceedings of ACL- IJCNLP 09. [13] Cignoni L., Coffey S. (2003). At the interface of onomastics and phraseology. Multiword units as proper names - proper names as ‘common’ phrasal units, Linguistica Computazionale, XVIXIX, IEPI. 1, pp. 251-262. [14] Marinelli R. (2010). Ontological variation and sense extension in proper names. Proceedings of the XXVIII AESLA Conference, “Analizar datos > Describir variacion – Analysing data > Describing variation”, University of Vigo. [15] Marinelli R., Biagini L., Bindi R., Goggi S., Monachini M., Orsolini P., Picchi E., Rossi S., Calzolari N., Zampolli A. (2003). The Italian PAROLE corpus: an overview. In Linguistica Computazionale. XVI-XVII. IEPI. 1, pp. 401-421. [16] Cignoni L., Coffey S., Moon R. (1999). “Languages in Contrast”, Idiom Variation in Italian and English, Two corpus-based studies, II, 2, John Benjamins Publishing Company, pp. 279-300. [17] Moon, R. (1998). Fixed Expressions and Idioms in English. Oxford. Clarendon Press. [18] McCarthy M., O’Dell F. (2002). English Idioms in Use. Dubai. Cambridge University Press. 0696
© Copyright 2026 Paperzz