3rd AIUCD Conference - 2014 (AIUCD’14) - Selected papers in collaboration with ALMA MATER STUDIORUM UNIVERSITY OF BOLOGNA SCHOOL OF ARTS, HUMANITIES, AND CULTURAL HERITAGE DEPARTMENT OF CLASSICAL PHILOLOGY AND ITALIAN STUDIES DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Humanities and Their Methods in the Digital Ecosystem Proceedings of AIUCD 2014 Selected papers Ed. by Francesca Tomasi, Roberto Rosselli Del Turco, and Anna Maria Tammaro i 3rd AIUCD Conference - 2014 (AIUCD’14) - Selected papers The Association for Computing Machinery 2 Penn Plaza, Suite 701 New York New York 10121-0701 ACM COPYRIGHT NOTICE. Copyright © 2015 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept., ACM, Inc., fax +1 (212) 869-0481, or [email protected]. For other copying of articles that carry a code at the bottom of the first or last page, copying is permitted provided that the per-copy fee indicated in the code is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, +1-978-750-8400, +1-978-750-4470 (fax). ACM ISBN: 978-1-4503-3295-8 ii 3rd AIUCD Conference - 2014 (AIUCD’14) - Selected papers CONFERENCE COMMITTEE General Chair Dino Buzzetti, University of Bologna, Italy – AIUCD President Local Chair Francesca Tomasi, University of Bologna, Italy - Department of Classical Philology and Italian Studies Program Committee Maristella Agosti, University of Padova, Italy; Fabio Ciotti, University of Roma Tor Vergata; Maurizio Lana, University of Piemonte Orientale; Federico Meschini, University of Tuscia; Nicola Palazzolo, University of Perugia; Roberto Rosselli Del Turco, University of Torino; Anna Maria Tammaro, University of Parma. Local Program Committee (University of Bologna) Luisa Avellini - Department of Classical Philology and Italian Studies Federica Rossi - Leader Library of the Department of Classical Philology and Italian Studies Marialaura Vignocchi - Leader AlmaDL and CRR-MM Fabio Vitali - Department of Computer Science and Engineering Secretary (University of Bologna) Marilena Daquino - Research assistant Federico Nanni - PhD iii 3rd AIUCD Conference - 2014 (AIUCD’14) - Selected papers PREFACE The conference call and topics AIUCD 2014,1 the third AIUCD (Associazione per l’Informatica Umanistica e la Cultura Digitale2) Annual Conference, was devoted to discussing the role of Digital Humanities in the current research practices of the traditional humanities disciplines. The introduction of computational methods prompted a new characterization of the methodology and the theoretical foundations of the human sciences and a new conceptual understanding of the traditional disciplines. Art, archeology, philology, philosophy, linguistics, bibliography, history, archival sciences and diplomatics, as well as social and communication sciences, avail themselves of computational methods to formalize their research questions, and to innovate their practices and procedures. A profound reorganization of disciplinary canons is therefore implied. New emerging notions such as Semantic web, Linked Open Data, digital libraries, digital archives, digital museum collections, information architecture, information visualization have turned into key issues of humanities research. A close comparison of research procedures in the traditional disciplines and in the digital humanities becomes inevitable to detect concurrencies and to renew their tools and methods from a new interdisciplinary and multidisciplinary perspective. We invited therefore submissions related, but not confined to the following topics: — Digital humanities and the traditional disciplines — Computational concepts and methods in the humanities research practices — Computational methods and their impact on the traditional methodologies — The emergence of new disciplinary paradigms Submission, review and selection process The call showed a large interest on these topics from both the computer science and the humanistic communities: 29 abstracts were submitted for a first evaluation; of these, 10 were accepted as main conference speeches and 6 as posters. Final papers were subsequently submitted for a first peer-review. The members of the Program Committee as well as additional reviewers were given the task to evaluate them. A second peer review was necessary in order to verify the authors’ acceptance of the reviewers’ suggestions. Eventually 13 submissions were accepted for the present proceedings, and organized in full papers and poster papers. Also the invited talk papers were evaluated by the program committee. Acknowledgments Our thanks go to: the School of Arts, Humanities, and Cultural Heritage, the Department of Classical Philology and Italian Studies and the Department of Computer Science and Engineering for the sponsorship of the event. We also thank all the attendants of the conference: more than 100 people physically attended the meeting in Bologna. The Program Committee thanks our external reviewers for the effort in the revision process. Finally, we are grateful to the Library of the Department of Classical Philology and Italian Studies for having accepted and hosted the workshop. 1 Conference web site: http://aiucd2014.unibo.it/. FB: http://www.facebook.com/groups/aiucd; WIKI: http://linclass.classics.unibo.it/udwiki/; WEB SITE: http://www.umanisticadigitale.it/; LIST: [email protected]; BLOG: http://infouma.hypotheses.org/. 2 iv 3rd AIUCD Conference - 2014 (AIUCD’14) - Selected papers TABLE OF CONTENTS INVITED TALK PAPERS Keynote address · Article 1 - Manfred Thaller. Are the Humanities an Endangered or a Dominant Species in the Digital Ecosystem? An Interview Parallel methodologies · Article 2 - Paola Italia, Fabio Vitali, and Angelo Di Iorio. Variants and Versioning between Textual Bibliography and Computer Science · Article 3 - Karen Coyle, Gianmaria Silvello, and Annamaria Tammaro. Comparing Methodologies: Linked Open Data and Digital Libraries PANELS · Article 4 - Dino Buzzetti, Claudio Gnoli, and Paola Moscati. The Digital Humanities Role and the Liberal Disciplines · Article 5 - Andrea Angiolini, Francesca Di Donato, Luca Rosati, Federica Rossi, Enrica Salvatori, and Stefano Vitali, Digital Humanities: Crafts and Occupations (Lavori e professioni nel campo delle Digital Humanities) FULL PAPERS · Article 6 - Costanza Giannaccini and Susanne Müller. Burckhardtsource.org: A Semantic Digital Library · Article 7 - Maristella Agosti, Nicola Orio, and Chiara Ponchia. Guided Tours Across a Collection of Historical Digital Images · Article 8 - Marilena Daquino and Francesca Tomasi. Ontological Approaches to Information Description and Extraction in the Cultural Heritage Domain · Article 9 - Silvia Stoyanova and Ben Johnston. Remediating Giacomo Leopardi’s Zibaldone: Hypertextual Semantic Networks in the Scholarly Archive · Article 10 - Linda Spinazzè. ‘Cursus in clausula’, an Online Analysis Tool of Latin Prose · Article 11 - Matteo Abrate, Angelo Del Grosso, Emiliano Giovannetti, Angelica Lo Duca, Andrea Marchetti, Lorenzo Mancini, Irene Pedretti, and Silvia Piccini. The Clavius on the Web Project: Digitization, Annotation and Visualization of Early Modern Manuscripts · Article 12 - Vito Pirrelli, Ouafae Nahli, Federico Boschetti, Riccardo Del Gratta, and Claudia Marzi. Computational Linguistics and Language Physiology: Insights from Arabic NLP and Cooperative Editing · Article 13 - Pietro Giuffrida. Aristotelian Cross-References Network: A Case Study for Digital Humanities · Article 14 - Cristina Marras. Exploring Digital Environments for Research in Philosophy: Metaphorical Models of Sustainability POSTER PAPERS · Article 15 - Alessio Piccioli, Francesca Di Donato, Danilo Giacomi, Romeo Zitarosa, and Chiara Aiola. Linked Open Data Portal. How to Make Use of Linked Data to Generate Serendipity · Article 16 - Marc Lasson and Corinne Manchio. Measuring the Words: Digital Approach to the Official Correspondence of Machiavelli · Article 17 - Antonella Brunelli. PhiloMed: An Online Dialogue between Philosophy and Biomedical Sciences · Article 18 - Marion Lamé. Primary Sources of Information, Digitization Processes and Dispositive Analysis v Computational Linguistics and Language Physiology: Insights from Arabic NLP and Cooperative Editing Vito Pirrelli, Ouafae Nahli, Federico Boschetti, Riccardo Del Gratta, Claudia Marzi Institute of Computational Linguistics “A. Zampolli”, CNR via Moruzzi 1 Pisa, Italy {vito.pirrelli, ouafae.nahli, federico.boschetti, riccardo.delgratta, claudia.marzi}@ilc.cnr.it ABSTRACT Computer processing of written Arabic raises a number of challenges to traditional parsing architectures on many levels of linguistic analysis. In this contribution, we review some of these core issues and the demands they make, to suggest different strategies to successfully tackle them. In the end, we assess these issues in connection with the behaviour of neuro-biologically inspired lexical architectures known as Temporal Self-Organising Maps. We show that, far from being language-specific problems, issues in Arabic processing can shed light on some fundamental characteristics of the human language processor, such as structure-based lexical recoding, concurrent, competitive activation of output candidates and dynamic selection of optimal solutions. Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing – linguistics processing; I.2.0 [Artificial Intelligence]: General – cognitive simulation; I.2.4 [Artificial Intelligence]: Knowledge Representation Formalisms and Methods – semantic networks; I.2.6 [Artificial Intelligence]: Learning – connectionism and neural nets, language acquisition; I.2.7 [Artificial Intelligence]: Natural Language Processing – language parsing and understanding; I.7.5 [Artificial Intelligence]: Document Capture – optical character recognition. Keywords Non-concatenative morphology, Optical Character Recognition, WordNet, Temporal Self-organising Maps, Mental Lexicon, Language neuro-physiology. 1. INTRODUCTION Over the last twenty years, Computational Linguistics has considerably changed the way natural language is perceived as an object of scientific inquiry. We have witnessed a gradual shift of emphasis from purely abstract and formal aspects of language structure to a task-based investigation of the role of language in daily communicative exchanges and as a means of conveying, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. AIUCD’14, September 18-19, 2014, Bologna, Italy Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-3295-8/14/09…$15.00 DOI: http://dx.doi.org/10.1145/2802612.2802637 accessing and manipulating information. However, there is a real risk that such a shift of emphasis from competence to performance, from general models of language to functionally-specified aspects of language communication, may yield a motley assortment of unrelated computational techniques and ad-hoc optimal solutions, thereby losing sight of some fundamental aspects of human language and cognition. To minimise this risk, we suggest that current natural language technologies can benefit considerably from a better understanding of the neuro-functional underpinnings of the human processor and the psycho-cognitive bases of human communication. We believe that some non-trivial issues in the computer-based analysis of the Arabic language illustrate the need for such an integration of perspectives in a neat way. At the Institute for Computational Linguistics (ILC-CNR), the Communication Physiology Lab and the Computational Philology Lab have launched a joint effort into a computer-based investigation of Arabic, from word structure to concept structure. Our current experience highlights some of the advantages of tackling traditional problems from a non-traditional perspective, and the added value of integrating complementary perspectives on the same issue. The Arabic lexicon is organised around two fundamental pillars: a non-concatenative inflectional morphology based on discontinuous roots and vowel patterns, and a productive derivational morphology component, whereby root-sharing words are mutually related across both grammatical categories and conceptual boundaries to a much larger extent than any IndoEuropean language. Most of the specific problems that arise in computerised Arabic text analysis stem from these two principles of lexicon organisation and can give us important insights into human language processing strategies at large. In what follows, we summarise some of the core problems in Arabic word processing, and envisage possible alternative ways to approach them. In all case studies provided here, processing consists in filtering and selecting the best possible candidate, given a number of competing solutions and some context-based constraints to satisfy. These issues are exemplified and discussed in connection with the morphological analysis of written Arabic (section 2), optical character recognition of Arabic literary texts (section 3) and acquisition of lexico-semantic relations in the Arabic lexicon (section 4). Such a filtering approach is strongly reminiscent of the way the human processor is known to operate in the mental lexicon (section 5), through competitive coactivation of lexical candidates and goodness-of-fit criteria that guide the activation towards the optimal (i.e. the most compatible) candidate. Computer simulation of brain maps provide a neurobiologically plausible way of implementing this approach through artificial neural networks and point in the direction of a useful integration of traditional, task-oriented computational solutions with more principled models of language physiology. could be added in order to obtain a “maximum word”, for example: 2. ISSUES IN THE COMPUTATIONAL ANALYSIS OF ARABIC MORPHOLOGY wa=sa=ja-ʃraħ=u:-na=ha: The Arabic language presents a number of challenges for computerised text analysis at the word level. ‘and they will explain it’ (PFUT=Future particle). Primarily, the inconsistent use of diacritics in the Arabic orthography raises problems of written word recognition for morphological analysis. Due to the unsystematic marking of diacritics, Arabic word forms can be found in written texts in many different spellings. For example, the word َﺐ َ ( َﻛﺘkataba, ‘he wrote’) can be written as َﺐ َ ( َﻛﺘkataba), ( ﻛَﺘﺐkatb), ( ﻛﺘَﺐktab), ﻛﺘﺐ َ (ktba), ( َﻛﺘَﺐkatab), َﺐ َ ( ﻛﺘktaba), ( ﻛﺘﺐktb). As a general strategy, to avoid useless proliferation of inconsistently diacriticised forms in the morphological lexicon, written forms in a text are brought back to a unique underlying form without diacritics (e.g. ﻛﺘﺐktb), for the latter to be spelled out into the whole range of its possible vocalised instantiations. The strategy reduces the unpredictable variability of spelled-out forms to a unique underlying exponent. However, it may increase the ambiguity of the input and the resulting number of possible morphological parses, since the normalised exponent can be vocalised in ways that are incompatible with the original form. For example, given the input form ( ﻛَﺘﺐkatb), its normalised exponent ( ﻛﺘﺐktb) is spelled out as َﺐ َ ( َﻛﺘkataba ‘he wrote’), ﺐ َ ( َﻛﺘﱠkattaba ‘he made write’), ﺐ َ ِ( ُﻛﺘkutiba ‘it was written’) and ( ُﻛﺘُﺐkutub ‘books’), with the last two forms being incompatible with the original input. َو َﺳﯿَ ْﺸ َﺮﺣُﻮﻧَﮭَﺎ and=PFUT=IPFV.3-explain=They.3PLM-PRS.IND=it.3SF.ACC Here, the “minimum word” َ[ ﯾَ ْﺸ َﺮﺣُﻮنjaʃraħu:na] is preceded by proclitics (conjunction [wa] and future particle [sa]) and is followed by an enclitic representing an accusative pronoun [ha:]. This complex structure raises the segmentation problem of drawing the morphological boundaries between the subconstituents of the maximum words. A further element of complexity of the Arabic morphology is that agreement rules can vary according to the semantic marking [+/human] of the verb arguments.3 Without semantic restrictions, the automated analysis of the form [ ﺷﺮﺣﻚʃrħk] offers multiple parses4 linked to various lemmas, for example: ﻚ َ َﺷ َﺮ َﺣ ʃaraħ-a=ka explainPST-3SGF=You2SGM.ACC ‘he explained you' ﻚ َ َﺷ ﱠﺮ َﺣ ʃarraħ-a=ka dissectPST-3SGF=You2SGM.ACC ‘he dissected you’ Secondly, Arabic is a morphologically rich language, where much information related to the semantic and morpho-syntactic relationships between words in context is expressed at the word level [1]. A written form can consist of a single inflected unit called “minimum word” [2], for example: ʃarħ-u=ka َﯾَ ْﺸ َﺮﺣُﻮن ‘your explanation’ ja-ʃraħ=u:-na IPFV.3-explain=They.PL-PRS.IND ‘they explain’1 The minimum word َ[ ﯾَ ْﺸ َﺮﺣُﻮنjaʃraħu:na] is made up out of an inflectional prefix of the imperfective aspect and the third person [ja], a verbal stem [ʃraħ], a nominative pronoun [u:] specifying the plural and an inflexional suffix [na] indicating the present indicative. To this “minimum word”, proclitics and/or enclitics2 1 2 Hereafter, depending on the context, Arabic forms are written in Arabic orthography (َﺐ َ ) َﻛﺘ, transliterated in Latin italicised characters (kataba), or transcribed in IPA [kataba]. Interlinear glosses follow the standard set of parsing conventions and grammatical abbreviations recommended in the Leipzig Glossing Rules, “The Leipzig Glossing Rules: Conventions for interlinear morpheme-by-morpheme glosses”, February 2008. Segmental morphemes are separated by hyphens and clitics boundaries are marked by a “=” sign, both in the IPA transcription and in the interlinear gloss. Some prepositions, conjunctions and other particles (hereafter collectively referred to as function words) are morphologically ﻚ َ ﺷَﺮْ ُﺣ explanation-NOM=YourPOSS2SGM.GEN The forms [( َﺷ َﺮ َحʃaraħa] ‘explain’) and [( َﺷ ﱠﺮ َحʃarraħa]‘dissect’) belong to a class of verbs that require ‘not human’ direct objects and accept only 3-S/D-M/F accusative pronouns. Interestingly, the grammatically-relevant distinction here does not rest on the realised as proclitics, while pronouns (in nominative, accusative or genitive case) are enclitics. 3 When the noun refers to a rational entity, namely a person ()اﻟﻌﺎﻗﻞ, the anaphoric pronoun agrees with it in number (singular, dual, plural) and gender (masculine and feminine). When the noun refers to an irrational entity, i.e. a non-human entity ( ﻏﯿﺮ )اﻟﻌﺎﻗﻞ, the anaphoric pronoun is always in third person, and agrees with the noun in both number and gender only when the noun is singular or dual; when the noun is plural, the anaphoric pronoun is in third person singular feminine. 4 We use the Arabic Morphological Analyzer suite developed by Tim Buckwalter, known as AraMorph. The java AraMorph version 1.0 is freely available online (http://www.nongnu.org/ aramorph/english/index.html) and contains, besides the rule engine, morphological data distributed over three lexicons (dictPrefixes, dictSuffixes, and dictStems), and three compatibility tables to control combinations of proclitics, stems and enclitics. general meaning of the two verbs, but only on those morphosyntactic and semantic restrictions that the verbs impose on their arguments [3]. Both verbs are transitive, they take ‘not human’ objects, and are incompatible with accusative pronouns in first and second person (singular, dual and plural), and third person plural. To sum up, some orthographic, morphological and syntactic characteristics of Arabic tend to considerably increase the ambiguity of linguistic analyses of a morphological parser. To reduce the problem and regiment the interpolation of missing vowels/diacritics, spelling rules can act as a preliminary filter. Accordingly, we developed a classifier that compares the token in the text with the solutions offered by the morphological engine on the normalised form and filters out proposed solutions that do not match the diacritic information provided by the original (not normalised) token [4]. In addition, morpho-syntactic and semantic constraints should be jointly evaluated as early as possible during word parsing. Semantic properties of lexical units can in fact enforce constraints that help predicting their syntactic properties. As a mid-term goal, we intend to augment the lexicon with a full list of proclitics and enclitics, for our classifier to be able to take into account the semantic restrictions of verbs, and refine compatibility among proclitics, minimum words and enclitics. new OCR engine under development at the University “Sidi Mohamed Ben Abdellah” of Fez [11], main partner of the ILCCNR in the creation of digital resources and computational tools for the processing of the Arabic language. In order to facilitate the manual corrections by focusing the attention on possible errors, the web application created for the correction of ancient Greek OCR [12] has been adapted to the peculiarities of the Arabic language. In particular, the current extension takes into account the right-to-left direction of the text that must be corrected and provides a spell-checker based on Arabic literary texts. As shown in Figure 1, the image of a text line is followed by the related OCR output. Words underlined with a red waving line are unrecognised by the spell-checker provided by the browser (in this case, by Firefox), whereas words coloured in red are unrecognised by the spell-checker based on literary texts. In this way, the individuation of possible errors results by crossing two different methods. The first one may generate some false positives on literary terms and the second one may generate some false positives on colloquial terms. 3. ACQUISITION OF ARABIC TEXTS AND COLLABORATIVE EDITING In order to study textual phenomena and observe their frequency, it is necessary to build large collections of digital texts. For this reason, the process of digitization may be a bottleneck in Corpus Linguistics, and more generally in Computational Linguistics. Indeed, acquisition and correction of digital texts are timeconsuming and expensive tasks. Repositories of out-of-print text images are rapidly increasing (see for instance [5] and [6]) but, excluding the editions in English or in some other modern western languages, the machine actionable texts extracted from those images and available online are often inaccurate. Texts written in historical languages, such as works in Ancient Greek, multilingual texts, such as the notes in the apparatus of a critical edition, and texts written in continuous scripts, such Arabic or Sanskrit scripts, are the most challenging cases for Optical Character Recognition (OCR) engines, especially for commercial ones. Many improvements in the recognition of Arabic scripts have been achieved, as demonstrated in [7] and [8]. The improvements can be obtained in the pre-processing phase, by enhancing the quality of the text images, in the processing phase, by optimizing the character segmentation, and in post-processing, by merging multiple outputs and enhancing the spell-checking techniques. Indeed, the alignment of the OCR output produced by multiple OCR engines, in conjunction with suitable criteria to select the best outcome at the granularity of the single character, can improve the accuracy of the recognition. For this reason, we have adapted the methods developed for texts written in ancient Greek [9] to Arabic samples and the first tests are encouraging, even if we need further investigation to evaluate the results. The alignment strategy, which involves the application of the Needleman-Wunsch algorithm, allows us to use commercial software, such as AbbyyFineReader, open source software, such as Tesseract/Cube [10] and experimental software, such as the Figure 1. Arabic OCR Proof-reading Web App. 4. TOWARDS A LITERARY ARABICWORDNET To provide as much lexico-semantic information as possible at the very early stages of Arabic text processing, we developed a strategy to augment the Arabic WordNet [13]. WordNet [14] is a lexical semantic resource that aims at creating, managing and organising a conceptual net through concepts expressed by words. The model contains, among other types of information, words, providing information about individual lemmas, and senses, providing the sense(s) associated with individual lemmas. Synsets are defined as sets of senses associated with lemmas referring to the same concept. For example, build up, fortify and gird are (quasi) synonyms of the sense of the lemma arm as a verb (e.g. in the sentence “The farmers armed themselves with rifles”). WordNet contains also semantic relations among synsets: for example hyponymy, hypernymy, meronymy, antonymy and so on. Our basic idea was to update a WordNet-like resource by adding information gathered from a bilingual dictionary. This strategy is general-purpose and widely used for other languages [15], [16], [17]. Due to lexical polysemy, however, the use of bilingual links to expand a network of word meanings must be carefully constrained to avoid inclusion of spurious associations. Bilingual information must thus be cross-checked with the content and structure of AWN, with a view to obtaining two principal results: new Arabic lemmas are added to the WordNet and clustered into synsets; existing lemmas are rearranged in terms of new synonymic patterns. As a result, new and existing lemmas are grouped together into new synonymic refactoring. explained as a result of how AWN is designed and coverage percentage is calculated. 4.1 Resources Used The augmenting algorithm of AWN is depicted by the flowchart of Figure 2. From the list of Arabic-English word pairs, we have selected the entries where a single Arabic word is translated by more than one English word. In this section, we list and briefly describe the lexical resources used to update the Arabic WordNet. Interested readers may find more information in [13]. AWN was designed to contain 1000 named entities and 9000 basic concepts. This means that coverage of Arabic words in AWN will increase, if we limited ourselves to considering words in dictStems that are either basic concepts or named entities. 4.2 AWN: Enhancing Strategy Princeton WordNet (PWN) - version 3.0 - contains 147.306 distinct words and 117.659 distinct synsets (collections of synonyms) which are associated with 206.941 senses. Nouns and Verbs, with 146.312 and 25.047 senses respectively, cover 70.1% and 12.1% of the total amount respectively. PWN contains relations, such as “antonymy” and “part holonymy”, which, in addition to “hypernymy/hyponymy”, connect synsets. For example, arm as limb is part of the human body (part holonym); disarm is an antonym of arm etc. The Arabic WordNet (AWN) contains 12.114 distinct words and 9.576 synsets, which are associated with 17.419 senses. Nouns and Verbs cover 84.2% and 14.1% senses respectively. The Arabic WordNet contains many geographic/political named entities (about 1.000) and a set of Common Base Concepts (CBCs).Vocalization of the Arabic words is partially inconsistent [18]. DictStems is the Arabic-English bilingual dictionary that along with dictSuffix and dictPrefix resources is part of the AraMorph suite. This dictionary contains more than 30.000 Arabic words and more than 25.000 English translational equivalents. The dictStems dictionary is a component of the morphological engine and it does not properly serve as a bilingual lexical resource. However, we think that it can be used as a good starting point to enhance the Arabic WordNet since it contains many ArabicEnglish word pairs. Nonetheless, we are fully aware of the limitations of this resource. For example, in Arabic, the sense of become is translated by the verb ﺎر َ ﺻ َ [sˁa:ra] when expressed without a temporal connotation, and by َ[ ﺑَﺎتba:ta], [ أَﺻْ ﺒَ َﺢʔasˁbaħa] and ﺴﻰ َ [ أ ْﻣʔamsa:], when a temporal connotation (night, morning, evening respectively) is added. However, dictStems translates both َ[ ﺑَﺎتba:ta] and [ أَﺻْ ﺒَ َﺢʔasˁbaħa] with the generic verb become. Similarly, many English translations are provided by the periphrastic combination of a verb plus an adjective/noun. For example, become old/rich, give breakfast, be hostile, etc. In most these cases, although the individual English words in the periphrastic translation are present in PWN, their combinations are not present as independent PWN entries in their own right. The different nature of AWN and PWN on the one hand and dictStems on the other hand, explains the limited overlapping in word coverage between the first two and dictStems. About 60% English words in dictStems are found in PWN, but only nearly 10% of Arabic words in dictStems are in AWN. The low coverage of AWN with respect to the Arabic words in dictStems can be Figure 2. Workflow for Arabic WordNet Enhancement. For each English translational equivalent, we extracted all synsets from Princeton WordNet that contain it, and considered how many synsets were shared by the family of all translational equivalents of the same Arabic word. In calculating common synsets, three cases may occur: all English words in the same translational family share all synsets (spread=1 in Figure 2), that is to say they are synonyms under every possible sense they may take; words share no synsets (spread=0); only some synsets are shared (spread between 0 and 1). In the last case, words are synonyms only in certain senses. We have focused on the spread=1 case, in line with the following assumption: two (or more) Arabic words that are translated by more than one English word sharing all their synsets (fully synonymic translational equivalents) are strong candidates for being synonyms in Arabic too. Table 1. Mapping results # Lines # Synsets # Mapped Synsets % 203 233 76 32.6 # Arabic Words # Arabic Words in AWN # Discovered Synonyms % 203 11 42 20.7 The number of such clues is very limited: out of 34,000 candidate word pairs, only 200 pairs fulfil the requirements above. However, these 200 items provide very interesting results (see Table 1). Manual validation of the results in Table 1 shows that 38 of the 42 candidate synonyms selected by the algorithms are true (newly discovered) synonyms. We are currently investigating two additional improvements to augment AWN. One strategy is to analyse word pairs with different spreads. In previous work [13], we showed that there are about 9,000 word-pairs with a spread between 0 and 1. We expect spread values to correlate with accuracy in the resulting candidate synonyms, with values close to 1 being more likely to select more reliable candidates. Another strategy can be suggested along the lines of previous work we carried out on Ancient Greek [19]. We can select entries from the bilingual resource where the Arabic lemma is translated by an English phrasal structure expressing a subsumption (IS-A) relation. For example, in dictStems, [zahorijjah] is translated as “flower vase”, with “vase” being the head of the English translation. By parsing the English phrase and extracting its syntactic head, we can eventually select the Arabic word [zahorijjah] as a candidate hyponym of the Arabic equivalent of “vase”: [ʔina:ʔ]. 5. BIO-COMPUTATIONAL DYNAMICS OF THE ARABIC MORPHOLOGY Arabic non-concatenative morphology seriously challenges the definition of a computational simulation for a biologically inspired neural architecture of the mental lexicon. Since lexical organisation relies on the process of input recoding and adaptive strategies for long-term memory, modelling the mental lexicon focuses on both processing and storage dynamics. A fundamental issue in lexical processing is represented by the emergence of the morphological organisation level in the lexicon, based on paradigmatic relations between fully-stored word forms. The emergence of morphological structure can be defined as the task of identifying recurrent morphological formatives within morphologically complex word forms. In the perspective of adaptive strategies for lexical acquisition and processing based on emergent morphological relations between fully-stored word forms (defined as an abstractive approach after Blevins, [20]), paradigmatic relations can be accounted for as the result of longterm entrenchment of neural circuits (chains of memory nodes) that are repeatedly being activated by a memory map in the process of recoding input words into a two-dimensional lexical layer. In the computational framework of Temporal Self-Organising Maps (TSOMs, [21], [22], [23], [24]), based on Self-Organising Maps with Hebbian connections defined over a temporal layer, the identification/perception of surface morphological relations involves the co-activation of recoded representations of morphologically-related input words. The highly non-linear nature of Arabic morphology urges the development of co-activation of memory chains, namely, the propensity to respond to the same symbol at different positions in time. Generally, a node that is selectively sensitive to a particular symbol in a specific ordered position reaches high level of coactivity when the same symbol is shifted by few positions. Repeated exposure to the underlying structure of Arabic word forms enforces both a mild sensitivity to time invariant symbol encoding (intra-paradigmatic relations) and a position sensitivity to time-bound instances of the same vowel symbol (interparadigmatic relations). Traditionally, lexical processing has been modelled in terms of basic mechanisms of human memory for serial order, as proposed in the vast literature on immediate serial recall and visual word recognition ([25, 26], for detailed reviews). Some of the earliest psychological accounts of serial order assume that item sequences are represented as temporal chains made up of stimulus-response links. However, it is difficult to align, for example, word forms of differing lengths [27], hindering recognition of shared sequences between morphologically-related forms. Conventionally, the task of identifying morphological formatives within morphologically complex word forms has been taken to model morphology induction. Morphology induction can be defined as the task of singling out morphological formatives within morphologically complex word forms. Accordingly, the issue of word alignment appears to be crucial for morphology induction. Since languages widely differ in the way morphological information is sequentially encoded – ranging from suffixation, to prefixation, circumfixation, infixation, and combination thereof - a principled solution to alignment issues is needed. Thus, algorithms for morphology acquisition should be adaptable to the morphological structure of a target language, and with identical settings of initial parameters and a comparable set of assumptions concerning input representations, should be able to successfully deal with as diverse inflectional systems as, for example, Italian, German and Arabic. In this perspective, the non-concatenative morphology of Arabic language seriously challenges traditional morpheme-based approaches [28], [29], while suggesting a reappraisal of morphology induction through adaptive memory self-organisation strategies. The computational framework of TSOMs, through chains of activated nodes encoding time sequences of symbols, is conducive to enforce alignment through synchrony (i.e. co-activation) (for a detailed description see [30]). Actually, node co-activation represents the immediate correlate to the perception of similarity between strings: the extent to which two strings are perceived as similar by the map is given by the amount of highly co-activated nodes that are involved in processing them. At their core, TSOMs are dynamic memories trained to store and classify input stimuli through patterns of activation of map nodes (simulating memory traces). A pattern of node activation is the processing response of a map when a stimulus is input. After being trained on a set of stimuli, the map gest increasingly attuned to respond to similar stimuli with largely overlapping activation patterns, and/or synchronously co-activated nodes. It has been shown that, in TSOMs, the paradigmatic organisation of the inflectional morphology of Italian and German [31], by focussing on how different types of related intra- and interparadigmatic families, induces a strongly paradigm-related coorganisation and co-activation to facilitate paradigmatic extension. Concatenative morphologies, in fact, are shown to exhibit a strong correlation between topological distance (proximity) of node chains on the map and morphological similarity of their corresponding word forms. The correlation reflects the topological organisation of a map by clusters of co-activating nodes (e.g. nodes that respond to either the same stimuli or to very similar stimuli), and the sensitivity of the map to the immediately preceding context of the current stimulus. On the contrary, for those (non-concatenative) morphologies where structures are misaligned [32], perception of internal structure is better correlated with high levels of co-activation than with topological proximity of memory nodes. Discontinuous morphological formatives - e.g. roots in the Arabic inflectional system – appear embedded in considerably different contexts, as for example in kataba (‘he wrote’) and yaktubu (‘he writes’). By topologically aligning these two morphologicallyrelated word forms, sub-lexical constituents shared by them will hardly emerge. This may keep the nodes responding to the root (kt-b) in the two forms far apart on artificial brain maps. Responses of the two pools of nodes are maximally synchronised when the root k-t-b is presented. These levels of co-activity response can be taken to mean that k, t, b are perceived as instantiations of the same consonantal skeleton in both word forms. Figure 3 provides an example of differences in node co-activation levels associated with the two input verb forms. Figure 4. Co-activation level distances for the imperfective input forms yaktubu (‘he writes’) and yadxulu (‘he enters’). Figure 5. Co-activation level distances for the perfective input forms kataba (‘he wrote’) and daxala (‘he entered’). Due to the chiefly non-concatenative nature of consonantal roots and vowel patterns, and the concurrent presence of prefixes and suffixes, TSOMs can provide a flexible and adaptive computational approach to Arabic morphology, capable to cast its highly non-linear nature into a self-organising network of densely interconnected memory nodes, where distributed populations of memory nodes can be trained to selectively respond to either timeinvariant or context-sensitive recoding of symbols. 6. GENERAL DISCUSSION Figure 3. Co-activation level distances for the input forms kataba (‘he wrote’) and yaktubu (‘he writes’). As previously claimed, lexical processing relies on the emergence of paradigmatic relations between fully-stored morphologically complex words. TSOMs adhere perfectly to this perspective, since they get able, during unsupervised training, to find out what is common and what is different within any set of paradigmaticallyrelated verb forms, as a dynamic result of persistent co-activation levels across forms within paradigms (intra-paradigmatic relations) and between paradigms (inter-paradigmatic relations). In addition, both intra- and inter-paradigmatic relations are determinants for paradigmatic extension and generalisation. In detail, extending the underlying structure of known Arabic verb forms to novel forms requires sensitivity to both time invariant symbol encoding of the root consonant skeleton (as shown in Figure 3), and a position sensitivity to time-bound instances of the same vowel symbol (see Figure 4 and Figure 5). Over the last decades, a growing body of evidence on the mechanisms governing lexical storage, access, acquisition and processing has questioned traditional computational models of language architecture and word usage. By pulling together cognitive, neuro-functional and psycho-computational implications of these mechanisms, a new view of the language architecture is emerging, based on the dynamic interaction between storage and processing [33], [34], [35], [36], [37]. According to such an “integrative” conception of the mental lexicon (as opposed to the “discriminative” view that takes lexical processes and lexical representations to belong to two distinct and neuro-anatomically segregated modules of the human language processor) several levels of linguistic knowledge are understood to be interfaced in the lexicon, and to simultaneously interact in word processing. Word forms are concurrently, redundantly and competitively stored in a distributed network of perisylvian brain circuits, whose organisation appears to comply more with a principle of maximization of redundant opportunities [38, 39], than with requirements of minimization of storage and logical necessity. In this perspective, words are not localised in a single bran region, but they are better viewed as emergent properties of the functional interaction between different brain regions. Computational modelling of such a novel view crucially rests on an activation metaphor: exposure to an input word triggers the distributed activation of clusters of parallel processing units (or nodes) each of which tends to respond more highly to specific instances of an input symbol (e.g. a letter or a sound of an input word). In addition, nodes are arranged on a bi-dimensional grid (a map) that is topologically organised: nodes that respond to similar stimuli tend to lie close on the map. Crucially, a map organisation is not wired-in and given once and for all, but it is the outcome of a process of adaptive self-organisation, which heavily depends on the underling structure on training data. In particular, three basic factors appear to affect self-organisation: (i) similarity: similar symbols trigger overlapping activation patterns; (ii) frequency: frequent symbols tend to recruit dedicated nodes; and (iii) symbol timing: nodes react differently depending on the time-bound context where a symbol is repeatedly found. Activation of such self-organising maps by an input word thus reflects some basic structural properties of the training words to be stored, making room for the wide range of cross-linguistic typological variability in the marking of morphological information. Accordingly, perception of similarity between words may depend on the most recurrent patterns shared by inflected words that belong to the same paradigmatic family. Adaptivity to recurrent morphological patterns allows the TSOMs reviewed in section 5 to adjust themselves to as different morphological systems as German or Italian conjugations on the one hand, and Arabic conjugation on the other hand. We believe that an important lesson can be learned from the study of the behaviour of temporal self-organising maps. Simultaneous activation of several competitive candidates for lexical access or production, far from being the occasional outcome of a severely underspecified or noisy input, is the ordinary operational state of the mental lexicon. A word memory trace tends to contain, over and above its proper representational nodes, also nodes that belong to other lexical neighbours. Activation of one such distributed representation thus prompts co-activation of other representations that compete for selection. In an optimally organised lexicon, connections between nodes are distributed in such a way that co-activation does not hinder selective lexical access. However, we know that competition between neighbouring forms, which are attested with difference frequencies, may occasionally cause interference and delay, with recognition and production being suboptimal [40], [41], [42]. In the light of these remarks, some of the core problems associated with Arabic Natural Language Processing (exemplified and reviewed in sections 2, 3 and 4) strike us as being connected with some persisting limitations of traditional processing architectures, where rules and lexical representations are logically distinct and functionally segregated, and different levels of linguistic representation (morphological, morpho-syntactic and semantic) are accessed one after the other. Our experience with Arabic NLP through a variety of approaches and software architectures shows that a new conception of language processing as a classification task, resulting from on-line, probabilisticallysupported resolution of concurrent and conflicting multi-level constraints, is probably closer to the way speakers understand and produce their language. Over the last fifteen years, such a view has been implemented through stochastic parsing architectures, based on information-theoretic principles such as Maximum Entropy and Bayesian learning. Recent artificial neural networks such as TSOMs show that such a dynamic and adaptive view on language processing can effectively be implemented biologically-inspired networks. by By understanding more of the way the human processor deals with issues of concurrent and competitive word storage, we can gain many insights into the architecture of language in the brain. In addition, we suggest that effective integration of some of these insights within computer processing architectures can be instrumental in designing and developing more adaptive and intelligent systems, where processing resources and structures are dynamically apportioned as a function of past experience (longterm, cumulative frequency effects) and salience (short-term context-sensitivity effects) in the input. REFERENCES [1] Tsarfaty, R., Seddah D., Kubler S., and Nivre J. 2013. Parsing Morphologically Rich Languages: Introduction to the Special Issue. Computational Linguistics 39, 1, 15–22. [2] Dichy, J. 1997. Pour une lexicomatique de l’arabe: l'unité lexicale simple et l'inventaire fini des spécificateurs du domaine du mot. Meta 42, 2, 291-306. [3] Jackendoff, R. 2002. Foundations of language. Brain, Meaning, Grammar, Evolution. Oxford University Press, New York. [4] Nahli, O. 2013. Computational contributions for Arabic language processing. The automatic morphologic analysis of Arabic texts. In Studia graeco-arabica 3, 195-206. [5] Internet Archive: http://archive.org/. [6] Google Books: http://books.google.com/. [7] Dichy, J. and Kanoun S., Eds. 2013. Linguistic Knowledge integration in optical Arabic word and text recognition process. Linguistica Communicatio 15, 1-2. [8] Märgner, V. and El Abed H. 2012. Guide to OCR for Arabic scripts. Springer, London. [9] Boschetti, F., Romanello M., Babeu A., Bamman D., and Crane G. 2009. Improving OCR accuracy for classical critical editions. In Proceedings of the 13th European conference on Research and advanced technology for digital libraries (ECDL’09), M. Agosti, J. Borbinha, S. Kapidakis, C. Papatheodorou and G. Tsakonas, Eds. Springer-Verlag, Berlin, Heidelberg, 156-167. [10] Tesseract/Cube: http://code.google.com/p/tesseract-ocr/. [11] Lasri, Y. 2014. Contribution à la reconnaissance optique (OCR) du texte arabe imprimé, Fès: Université ̣“Sidi Mohamed Ben Abdellah” de Fès, MA Thesis. [12] Boschetti, F. 2013. Acquisizione e Creazione di Risorse Plurilingui per gli Studi di Filologia Classica in Ambienti Collaborativi. In. Collaborative Research Practices and Shared Infrastructures for Humanities Computing, M. Agosti and F. Tomasi, Eds. Proceedings of Revised Papers AIUCD 2013 (Padua, Italy, December 11-12, 2013), 55-67. [13] Del Gratta, R. and Nahli, O. 2014. Enhancing Arabic WordNet with the use of Princeton WordNet and a bilingual dictionary. IEEE - CiST14 Colloquium on Information Science and Technology - ANLP Invited Session. [14] Fellbaum, C., Ed. 1998. WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press, Cambridge, MA. [15] Sagot, B. and Fišer D. 2011. Extending Wordnets by learning from multiple resources In LTC’11: 5th Language and Technology Conference, Poznań, Poland. [16] Rodríguez, H., Farwell, D., Farreres, J., Bertran, M., Martí, M.A., Black, W., Elkateb, S., Kirk, J., Vossen, P., and Fellbaum, C. 2008. Arabic Wordnet: Current State and Future Extensions. In Proceedings of the Fourth International Global WordNet - Conference, 387-406. [17] Vossen, P., Ed., 1998. EuroWordNet: A Multilingual Database with Lexical Semantic Networks. Kluwer Academic Publishers, Norwell, MA, USA. [18] Fellbaum, C., Alkhalifa, M., Black, W. J., Elkateb, S., Pease, A., Rodríguez, H., and Vossen, P. 2006. Building a WordNet for Arabic. In Proceedings of the 5th Conference on Language Resources and Evaluation (ELRA - LREC 2006, Genova), 29-34. [19] Boschetti, F., Del Gratta, R., and Lamè, M. 2014. Computer Assisted Annotation of Themes and Motifs in Ancient Greek Epigrams: First Steps. In Proceedings of CLIC, Computational Linguistics Italian Conference. Pisa, Italy. [20] Blevins, J.P. 2006. Word-based morphology. Journal of Linguistics 42, 531-573. [21] Ferro, M., Pezzulo, G., and Pirrelli, V. 2010. Morphology, Memory and the Mental Lexicon. In Lingue e Linguaggio, vol. IX(2), Interdisciplinary aspects to understanding word processing and storage, V. Pirrelli, Ed. Il Mulino, Bologna, 199-238. [22] Pirrelli, V., Ferro, M., and Calderone, B. 2011. Learning paradigms in time and space. Computational evidence from Romance languages. In Morphological Autonomy: Perspectives from Romance Inflectional Morphology, M. Goldbach, M. O. Hinzelin, M. Maiden, and J.C. Smith, Eds. Oxford University Press, Oxford, 135-157. [23] Marzi, C., Ferro, M., and Pirrelli, V. 2012. Word alignment and paradigm induction. Lingue e Linguaggio XI, 2, 251274. [24] Marzi, C., Ferro, M., and Pirrelli, V. 2014. Morphological structure through lexical parsability. Lingue e Linguaggio XIII, 2, 263-290. [25] Henson, R. N. A. 1999. Coding position in short-term memory. International Journal of Psychology 34, 5-6, 403409. [26] Davis, C. J. 2010. The spatial coding model of visual word identification. Psychological Review 117, 3, 713-758. [27] Davis, C. J. and Bowers, J. S. 2004. What do letter migration errors reveal about letter position coding in visual word recognition? Journal of Experimental Psychology: Human Perception and Performance 30, 923-941. [28] Halle, M. and Marantz, A. 1993. Distributed Morphology and the pieces of inflection. In The view from building 20, K. Hale and S. J. Keyser, Eds. MIT Press, Cambridge, MA, 111-176. [29] Embick, D. and Halle, M. 2005. On the Status of stems in morphological theory. In Romance Languages and Linguistics Theory 2003, T. Geerts, I. van Ginneken, and H. Jacobs, Eds. John Benjamins, Amsterdam, 37-62. [30] Marzi, C. 2014. Models and dynamics of the morphological lexicon in mono- and bilingual acquisition. PhD unpublished dissertation. University of Pavia. [31] Marzi, C., Ferro, M., Caudai, C. and Pirrelli, V. 2012. Evaluating Hebbian Self-Organizing Memories for Lexical representation and Access. Proceedings of 8th International Conference on Language Resources and Evaluation, (ELRA - LREC 2012, Malta), 886-893. [32] Marzi, C., Nahli, O., and Ferro, M. 2014. Word Processing for Arabic Language. IEEE - CiST14 Colloquium on Information Science and Technology - ANLP Invited Session. [33] Hickok, G.M., and Poeppel, D. 2004. Dorsal and ventral streams: a framework for understanding aspects of the functional anatomy of language. Cognition 92, 67-99. [34] D’Esposito, M. 2007. From cognitive to neural models of working memory. Philosophical Transactions of the Royal Society B: Biological Sciences 362, 761-772. [35] Saur, D., Kreher, B.W., Schnell, S., Kümmerer, D., Kellmeyer, P., Vry, M.-S., Umarova, R., Musso, M., Glauche, V., Abel, S., Huber, W., Rijntjes, M., Hennig, J., and Weiller, C. 2008. Ventral and dorsal pathways for language. Proc. Nat. Academy of Sciences 105, 46, 1803518046. [36] Forkel, S. J., Thiebaut de Schotten, M., Dell’Acqua, F., Kalra, L., Murphy, D.G.M., Williams, S.C.R., and Catani, M. 2014. Anatomical predictors of aphasia recovery: a tractography study of bilateral perisylvian language networks. Brain 137, 2027-2039. [37] Ma, W. J., Husain, M., and Bays, P.M. 2014. Changing concepts of working memory. Nature Neuroscience 17, 3, 347-356. [38] Libben, G. 2005. Everything is psycholinguistics: Material and methodological considerations in the study of compound processing. Canadian Journal of Linguistics 50, 267-283. [39] Baayen, R.H. 2007. Storage and computation in the mental lexicon. In The Mental Lexicon: Core Perspectives, G. Jarema, G. Libben, Eds. Elsevier, 81-104. [40] Luce, P., Pisoni, D., and Goldinger, S.D. 1990. Similarity neighborhoods of spoken words. In Cognitive models of speech pro-cessing: Psycholinguistic and computational perspectives, G.T.M. Altmann, Ed. MIT Press, Cambridge, MA, 122-147. [41] Huntsman, L.A. and Lima, S.D. 2002. Orthographic Neighbors and Visual Word Recognition. Journal of Psycholinguistic Research 31, 289-306. [42] Goldrick, M., Folk, J. R., and Rapp, B. 2010. Mrs. Malaprop’s neighborhood: Using word errors to reveal neighborhood structure. Journal of Memory and Language 62, 2, 113-13.
© Copyright 2026 Paperzz