Humanities and Their Methods in the Digital Ecosystem

3rd AIUCD Conference - 2014 (AIUCD’14) - Selected papers
in collaboration with
ALMA MATER STUDIORUM
UNIVERSITY OF BOLOGNA
SCHOOL OF ARTS, HUMANITIES, AND CULTURAL HERITAGE
DEPARTMENT OF CLASSICAL PHILOLOGY AND ITALIAN STUDIES
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Humanities and Their Methods in the
Digital Ecosystem
Proceedings of AIUCD 2014
Selected papers
Ed. by Francesca Tomasi, Roberto Rosselli Del Turco,
and Anna Maria Tammaro
i
3rd AIUCD Conference - 2014 (AIUCD’14) - Selected papers
The Association for Computing Machinery
2 Penn Plaza, Suite 701
New York New York 10121-0701
ACM COPYRIGHT NOTICE. Copyright © 2015 by the Association for Computing Machinery,
Inc. Permission to make digital or hard copies of part or all of this work for
personal or classroom use is granted without fee provided that copies are not made or
distributed for profit or commercial advantage and that copies bear this notice and the
full citation on the first page. Copyrights for components of this work owned by others
than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to
republish, to post on servers, or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from Publications Dept., ACM, Inc.,
fax +1 (212) 869-0481, or [email protected].
For other copying of articles that carry a code at the bottom of the first or last page,
copying is permitted provided that the per-copy fee indicated in the code is paid
through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923,
+1-978-750-8400, +1-978-750-4470 (fax).
ACM ISBN: 978-1-4503-3295-8
ii
3rd AIUCD Conference - 2014 (AIUCD’14) - Selected papers
CONFERENCE COMMITTEE
General Chair
Dino Buzzetti, University of Bologna, Italy – AIUCD President
Local Chair
Francesca Tomasi, University of Bologna, Italy - Department of Classical Philology and
Italian Studies
Program Committee
Maristella Agosti, University of Padova, Italy;
Fabio Ciotti, University of Roma Tor Vergata;
Maurizio Lana, University of Piemonte Orientale;
Federico Meschini, University of Tuscia;
Nicola Palazzolo, University of Perugia;
Roberto Rosselli Del Turco, University of Torino;
Anna Maria Tammaro, University of Parma.
Local Program Committee (University of Bologna)
Luisa Avellini - Department of Classical Philology and Italian Studies
Federica Rossi - Leader Library of the Department of Classical Philology and Italian
Studies
Marialaura Vignocchi - Leader AlmaDL and CRR-MM
Fabio Vitali - Department of Computer Science and Engineering
Secretary (University of Bologna)
Marilena Daquino - Research assistant
Federico Nanni - PhD
iii
3rd AIUCD Conference - 2014 (AIUCD’14) - Selected papers
PREFACE
The conference call and topics
AIUCD 2014,1 the third AIUCD (Associazione per l’Informatica Umanistica e la Cultura
Digitale2) Annual Conference, was devoted to discussing the role of Digital Humanities in the
current research practices of the traditional humanities disciplines.
The introduction of computational methods prompted a new characterization of the
methodology and the theoretical foundations of the human sciences and a new conceptual
understanding of the traditional disciplines. Art, archeology, philology, philosophy, linguistics,
bibliography, history, archival sciences and diplomatics, as well as social and communication
sciences, avail themselves of computational methods to formalize their research questions,
and to innovate their practices and procedures. A profound reorganization of disciplinary
canons is therefore implied.
New emerging notions such as Semantic web, Linked Open Data, digital libraries, digital
archives, digital museum collections, information architecture, information visualization have
turned into key issues of humanities research. A close comparison of research procedures in
the traditional disciplines and in the digital humanities becomes inevitable to detect
concurrencies and to renew their tools and methods from a new interdisciplinary and
multidisciplinary perspective.
We invited therefore submissions related, but not confined to the following topics:
— Digital humanities and the traditional disciplines
— Computational concepts and methods in the humanities research practices
— Computational methods and their impact on the traditional methodologies
— The emergence of new disciplinary paradigms
Submission, review and selection process
The call showed a large interest on these topics from both the computer science and the
humanistic communities: 29 abstracts were submitted for a first evaluation; of these, 10 were
accepted as main conference speeches and 6 as posters. Final papers were subsequently
submitted for a first peer-review. The members of the Program Committee as well as
additional reviewers were given the task to evaluate them. A second peer review was
necessary in order to verify the authors’ acceptance of the reviewers’ suggestions.
Eventually 13 submissions were accepted for the present proceedings, and organized in full
papers and poster papers. Also the invited talk papers were evaluated by the program
committee.
Acknowledgments
Our thanks go to: the School of Arts, Humanities, and Cultural Heritage, the Department of
Classical Philology and Italian Studies and the Department of Computer Science and
Engineering for the sponsorship of the event.
We also thank all the attendants of the conference: more than 100 people physically attended
the meeting in Bologna.
The Program Committee thanks our external reviewers for the effort in the revision process.
Finally, we are grateful to the Library of the Department of Classical Philology and Italian
Studies for having accepted and hosted the workshop.
1
Conference web site: http://aiucd2014.unibo.it/.
FB: http://www.facebook.com/groups/aiucd; WIKI: http://linclass.classics.unibo.it/udwiki/; WEB SITE:
http://www.umanisticadigitale.it/; LIST: [email protected]; BLOG: http://infouma.hypotheses.org/.
2
iv
3rd AIUCD Conference - 2014 (AIUCD’14) - Selected papers
TABLE OF CONTENTS
INVITED TALK PAPERS
Keynote address
· Article 1 - Manfred Thaller. Are the Humanities an Endangered or a Dominant Species in
the Digital Ecosystem? An Interview
Parallel methodologies
· Article 2 - Paola Italia, Fabio Vitali, and Angelo Di Iorio. Variants and Versioning
between Textual Bibliography and Computer Science
· Article 3 - Karen Coyle, Gianmaria Silvello, and Annamaria Tammaro. Comparing
Methodologies: Linked Open Data and Digital Libraries
PANELS
· Article 4 - Dino Buzzetti, Claudio Gnoli, and Paola Moscati. The Digital Humanities Role
and the Liberal Disciplines
· Article 5 - Andrea Angiolini, Francesca Di Donato, Luca Rosati, Federica Rossi, Enrica
Salvatori, and Stefano Vitali, Digital Humanities: Crafts and Occupations (Lavori e
professioni nel campo delle Digital Humanities)
FULL PAPERS
· Article 6 - Costanza Giannaccini and Susanne Müller. Burckhardtsource.org: A Semantic
Digital Library
· Article 7 - Maristella Agosti, Nicola Orio, and Chiara Ponchia. Guided Tours Across a
Collection of Historical Digital Images
· Article 8 - Marilena Daquino and Francesca Tomasi. Ontological Approaches to
Information Description and Extraction in the Cultural Heritage Domain
· Article 9 - Silvia Stoyanova and Ben Johnston. Remediating Giacomo Leopardi’s
Zibaldone: Hypertextual Semantic Networks in the Scholarly Archive
· Article 10 - Linda Spinazzè. ‘Cursus in clausula’, an Online Analysis Tool of Latin Prose
· Article 11 - Matteo Abrate, Angelo Del Grosso, Emiliano Giovannetti, Angelica Lo Duca,
Andrea Marchetti, Lorenzo Mancini, Irene Pedretti, and Silvia Piccini. The Clavius on the
Web Project: Digitization, Annotation and Visualization of Early Modern Manuscripts
· Article 12 - Vito Pirrelli, Ouafae Nahli, Federico Boschetti, Riccardo Del Gratta, and
Claudia Marzi. Computational Linguistics and Language Physiology: Insights from Arabic
NLP and Cooperative Editing
· Article 13 - Pietro Giuffrida. Aristotelian Cross-References Network: A Case Study for
Digital Humanities
· Article 14 - Cristina Marras. Exploring Digital Environments for Research in Philosophy:
Metaphorical Models of Sustainability
POSTER PAPERS
· Article 15 - Alessio Piccioli, Francesca Di Donato, Danilo Giacomi, Romeo Zitarosa, and
Chiara Aiola. Linked Open Data Portal. How to Make Use of Linked Data to Generate
Serendipity
· Article 16 - Marc Lasson and Corinne Manchio. Measuring the Words: Digital Approach
to the Official Correspondence of Machiavelli
· Article 17 - Antonella Brunelli. PhiloMed: An Online Dialogue between Philosophy and
Biomedical Sciences
· Article 18 - Marion Lamé. Primary Sources of Information, Digitization Processes and
Dispositive Analysis
v
Computational Linguistics and Language Physiology:
Insights from Arabic NLP and Cooperative Editing
Vito Pirrelli, Ouafae Nahli, Federico Boschetti, Riccardo Del Gratta, Claudia Marzi
Institute of Computational Linguistics “A. Zampolli”, CNR
via Moruzzi 1
Pisa, Italy
{vito.pirrelli, ouafae.nahli, federico.boschetti, riccardo.delgratta, claudia.marzi}@ilc.cnr.it
ABSTRACT
Computer processing of written Arabic raises a number of
challenges to traditional parsing architectures on many levels of
linguistic analysis. In this contribution, we review some of these
core issues and the demands they make, to suggest different
strategies to successfully tackle them. In the end, we assess these
issues in connection with the behaviour of neuro-biologically
inspired lexical architectures known as Temporal Self-Organising
Maps. We show that, far from being language-specific problems,
issues in Arabic processing can shed light on some fundamental
characteristics of the human language processor, such as
structure-based lexical recoding, concurrent, competitive
activation of output candidates and dynamic selection of optimal
solutions.
Categories and Subject Descriptors
H.3.1 [Information Storage and Retrieval]: Content Analysis
and Indexing – linguistics processing; I.2.0 [Artificial
Intelligence]: General – cognitive simulation; I.2.4 [Artificial
Intelligence]: Knowledge Representation Formalisms and
Methods – semantic networks; I.2.6 [Artificial Intelligence]:
Learning – connectionism and neural nets, language acquisition;
I.2.7 [Artificial Intelligence]: Natural Language Processing –
language parsing and understanding; I.7.5 [Artificial
Intelligence]: Document Capture – optical character recognition.
Keywords
Non-concatenative morphology, Optical Character Recognition,
WordNet, Temporal Self-organising Maps, Mental Lexicon,
Language neuro-physiology.
1. INTRODUCTION
Over the last twenty years, Computational Linguistics has
considerably changed the way natural language is perceived as an
object of scientific inquiry. We have witnessed a gradual shift of
emphasis from purely abstract and formal aspects of language
structure to a task-based investigation of the role of language in
daily communicative exchanges and as a means of conveying,
Permission to make digital or hard copies of all or part of this work for personal
or classroom use is granted without fee provided that copies are not made or
distributed for profit or commercial advantage and that copies bear this notice
and the full citation on the first page. Copyrights for components of this work
owned by others than the author(s) must be honored. Abstracting with credit is
permitted. To copy otherwise, or republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee. Request permissions
from [email protected].
AIUCD’14, September 18-19, 2014, Bologna, Italy
Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ACM 978-1-4503-3295-8/14/09…$15.00
DOI: http://dx.doi.org/10.1145/2802612.2802637
accessing and manipulating information. However, there is a real
risk that such a shift of emphasis from competence to performance,
from general models of language to functionally-specified aspects
of language communication, may yield a motley assortment of
unrelated computational techniques and ad-hoc optimal solutions,
thereby losing sight of some fundamental aspects of human
language and cognition. To minimise this risk, we suggest that
current natural language technologies can benefit considerably
from a better understanding of the neuro-functional underpinnings
of the human processor and the psycho-cognitive bases of human
communication. We believe that some non-trivial issues in the
computer-based analysis of the Arabic language illustrate the need
for such an integration of perspectives in a neat way.
At the Institute for Computational Linguistics (ILC-CNR), the
Communication Physiology Lab and the Computational Philology
Lab have launched a joint effort into a computer-based
investigation of Arabic, from word structure to concept structure.
Our current experience highlights some of the advantages of
tackling traditional problems from a non-traditional perspective,
and the added value of integrating complementary perspectives on
the same issue. The Arabic lexicon is organised around two
fundamental pillars: a non-concatenative inflectional morphology
based on discontinuous roots and vowel patterns, and a productive
derivational morphology component, whereby root-sharing words
are mutually related across both grammatical categories and
conceptual boundaries to a much larger extent than any IndoEuropean language. Most of the specific problems that arise in
computerised Arabic text analysis stem from these two principles
of lexicon organisation and can give us important insights into
human language processing strategies at large.
In what follows, we summarise some of the core problems in
Arabic word processing, and envisage possible alternative ways to
approach them. In all case studies provided here, processing
consists in filtering and selecting the best possible candidate,
given a number of competing solutions and some context-based
constraints to satisfy. These issues are exemplified and discussed
in connection with the morphological analysis of written Arabic
(section 2), optical character recognition of Arabic literary texts
(section 3) and acquisition of lexico-semantic relations in the
Arabic lexicon (section 4). Such a filtering approach is strongly
reminiscent of the way the human processor is known to operate
in the mental lexicon (section 5), through competitive coactivation of lexical candidates and goodness-of-fit criteria that
guide the activation towards the optimal (i.e. the most compatible)
candidate. Computer simulation of brain maps provide a neurobiologically plausible way of implementing this approach through
artificial neural networks and point in the direction of a useful
integration of traditional, task-oriented computational solutions
with more principled models of language physiology.
could be added in order to obtain a “maximum word”, for
example:
2. ISSUES IN THE COMPUTATIONAL
ANALYSIS OF ARABIC MORPHOLOGY
wa=sa=ja-ʃraħ=u:-na=ha:
The Arabic language presents a number of challenges for
computerised text analysis at the word level.
‘and they will explain it’ (PFUT=Future particle).
Primarily, the inconsistent use of diacritics in the Arabic
orthography raises problems of written word recognition for
morphological analysis. Due to the unsystematic marking of
diacritics, Arabic word forms can be found in written texts in
many different spellings. For example, the word ‫َﺐ‬
َ ‫( َﻛﺘ‬kataba, ‘he
wrote’) can be written as ‫َﺐ‬
َ ‫( َﻛﺘ‬kataba), ‫( ﻛَﺘﺐ‬katb), ‫( ﻛﺘَﺐ‬ktab), ‫ﻛﺘﺐ‬
َ
(ktba), ‫( َﻛﺘَﺐ‬katab), ‫َﺐ‬
َ ‫( ﻛﺘ‬ktaba), ‫( ﻛﺘﺐ‬ktb). As a general strategy, to
avoid useless proliferation of inconsistently diacriticised forms in
the morphological lexicon, written forms in a text are brought
back to a unique underlying form without diacritics (e.g. ‫ ﻛﺘﺐ‬ktb),
for the latter to be spelled out into the whole range of its possible
vocalised instantiations. The strategy reduces the unpredictable
variability of spelled-out forms to a unique underlying exponent.
However, it may increase the ambiguity of the input and the
resulting number of possible morphological parses, since the
normalised exponent can be vocalised in ways that are
incompatible with the original form. For example, given the input
form ‫( ﻛَﺘﺐ‬katb), its normalised exponent ‫( ﻛﺘﺐ‬ktb) is spelled out as
‫َﺐ‬
َ ‫( َﻛﺘ‬kataba ‘he wrote’), ‫ﺐ‬
َ ‫( َﻛﺘﱠ‬kattaba ‘he made write’), ‫ﺐ‬
َ ِ‫( ُﻛﺘ‬kutiba
‘it was written’) and ‫( ُﻛﺘُﺐ‬kutub ‘books’), with the last two forms
being incompatible with the original input.
‫َو َﺳﯿَ ْﺸ َﺮﺣُﻮﻧَﮭَﺎ‬
and=PFUT=IPFV.3-explain=They.3PLM-PRS.IND=it.3SF.ACC
Here, the “minimum word” َ‫[ ﯾَ ْﺸ َﺮﺣُﻮن‬jaʃraħu:na] is preceded by
proclitics (conjunction [wa] and future particle [sa]) and is
followed by an enclitic representing an accusative pronoun [ha:].
This complex structure raises the segmentation problem of
drawing the morphological boundaries between the subconstituents of the maximum words.
A further element of complexity of the Arabic morphology is that
agreement rules can vary according to the semantic marking [+/human] of the verb arguments.3 Without semantic restrictions, the
automated analysis of the form ‫[ ﺷﺮﺣﻚ‬ʃrħk] offers multiple parses4
linked to various lemmas, for example:
‫ﻚ‬
َ ‫َﺷ َﺮ َﺣ‬
ʃaraħ-a=ka
explainPST-3SGF=You2SGM.ACC
‘he explained you'
‫ﻚ‬
َ ‫َﺷ ﱠﺮ َﺣ‬
ʃarraħ-a=ka
dissectPST-3SGF=You2SGM.ACC
‘he dissected you’
Secondly, Arabic is a morphologically rich language, where much
information related to the semantic and morpho-syntactic
relationships between words in context is expressed at the word
level [1]. A written form can consist of a single inflected unit
called “minimum word” [2], for example:
ʃarħ-u=ka
َ‫ﯾَ ْﺸ َﺮﺣُﻮن‬
‘your explanation’
ja-ʃraħ=u:-na
IPFV.3-explain=They.PL-PRS.IND
‘they explain’1
The minimum word َ‫[ ﯾَ ْﺸ َﺮﺣُﻮن‬jaʃraħu:na] is made up out of an
inflectional prefix of the imperfective aspect and the third person
[ja], a verbal stem [ʃraħ], a nominative pronoun [u:] specifying the
plural and an inflexional suffix [na] indicating the present
indicative. To this “minimum word”, proclitics and/or enclitics2
1
2
Hereafter, depending on the context, Arabic forms are written in
Arabic orthography (‫َﺐ‬
َ ‫) َﻛﺘ‬, transliterated in Latin italicised
characters (kataba), or transcribed in IPA [kataba]. Interlinear
glosses follow the standard set of parsing conventions and
grammatical abbreviations recommended in the Leipzig
Glossing Rules, “The Leipzig Glossing Rules: Conventions for
interlinear morpheme-by-morpheme glosses”, February 2008.
Segmental morphemes are separated by hyphens and clitics
boundaries are marked by a “=” sign, both in the IPA
transcription and in the interlinear gloss.
Some prepositions, conjunctions and other particles (hereafter
collectively referred to as function words) are morphologically
‫ﻚ‬
َ ‫ﺷَﺮْ ُﺣ‬
explanation-NOM=YourPOSS2SGM.GEN
The forms ‫[( َﺷ َﺮ َح‬ʃaraħa] ‘explain’) and ‫[( َﺷ ﱠﺮ َح‬ʃarraħa]‘dissect’)
belong to a class of verbs that require ‘not human’ direct objects
and accept only 3-S/D-M/F accusative pronouns. Interestingly, the
grammatically-relevant distinction here does not rest on the
realised as proclitics, while pronouns (in nominative, accusative
or genitive case) are enclitics.
3
When the noun refers to a rational entity, namely a person
(‫)اﻟﻌﺎﻗﻞ‬, the anaphoric pronoun agrees with it in number (singular,
dual, plural) and gender (masculine and feminine). When the
noun refers to an irrational entity, i.e. a non-human entity ( ‫ﻏﯿﺮ‬
‫)اﻟﻌﺎﻗﻞ‬, the anaphoric pronoun is always in third person, and
agrees with the noun in both number and gender only when the
noun is singular or dual; when the noun is plural, the anaphoric
pronoun is in third person singular feminine.
4
We use the Arabic Morphological Analyzer suite developed by
Tim Buckwalter, known as AraMorph. The java AraMorph
version 1.0 is freely available online (http://www.nongnu.org/
aramorph/english/index.html) and contains, besides the rule
engine, morphological data distributed over three lexicons
(dictPrefixes, dictSuffixes, and dictStems), and three
compatibility tables to control combinations of proclitics, stems
and enclitics.
general meaning of the two verbs, but only on those morphosyntactic and semantic restrictions that the verbs impose on their
arguments [3]. Both verbs are transitive, they take ‘not human’
objects, and are incompatible with accusative pronouns in first
and second person (singular, dual and plural), and third person
plural.
To sum up, some orthographic, morphological and syntactic
characteristics of Arabic tend to considerably increase the
ambiguity of linguistic analyses of a morphological parser. To
reduce the problem and regiment the interpolation of missing
vowels/diacritics, spelling rules can act as a preliminary filter.
Accordingly, we developed a classifier that compares the token in
the text with the solutions offered by the morphological engine on
the normalised form and filters out proposed solutions that do not
match the diacritic information provided by the original (not
normalised) token [4]. In addition, morpho-syntactic and semantic
constraints should be jointly evaluated as early as possible during
word parsing. Semantic properties of lexical units can in fact
enforce constraints that help predicting their syntactic properties.
As a mid-term goal, we intend to augment the lexicon with a full
list of proclitics and enclitics, for our classifier to be able to take
into account the semantic restrictions of verbs, and refine
compatibility among proclitics, minimum words and enclitics.
new OCR engine under development at the University “Sidi
Mohamed Ben Abdellah” of Fez [11], main partner of the ILCCNR in the creation of digital resources and computational tools
for the processing of the Arabic language.
In order to facilitate the manual corrections by focusing the
attention on possible errors, the web application created for the
correction of ancient Greek OCR [12] has been adapted to the
peculiarities of the Arabic language. In particular, the current
extension takes into account the right-to-left direction of the text
that must be corrected and provides a spell-checker based on
Arabic literary texts. As shown in Figure 1, the image of a text
line is followed by the related OCR output. Words underlined
with a red waving line are unrecognised by the spell-checker
provided by the browser (in this case, by Firefox), whereas words
coloured in red are unrecognised by the spell-checker based on
literary texts. In this way, the individuation of possible errors
results by crossing two different methods. The first one may
generate some false positives on literary terms and the second one
may generate some false positives on colloquial terms.
3. ACQUISITION OF ARABIC TEXTS AND
COLLABORATIVE EDITING
In order to study textual phenomena and observe their frequency,
it is necessary to build large collections of digital texts. For this
reason, the process of digitization may be a bottleneck in Corpus
Linguistics, and more generally in Computational Linguistics.
Indeed, acquisition and correction of digital texts are timeconsuming and expensive tasks. Repositories of out-of-print text
images are rapidly increasing (see for instance [5] and [6]) but,
excluding the editions in English or in some other modern western
languages, the machine actionable texts extracted from those
images and available online are often inaccurate. Texts written in
historical languages, such as works in Ancient Greek, multilingual
texts, such as the notes in the apparatus of a critical edition, and
texts written in continuous scripts, such Arabic or Sanskrit scripts,
are the most challenging cases for Optical Character Recognition
(OCR) engines, especially for commercial ones. Many
improvements in the recognition of Arabic scripts have been
achieved, as demonstrated in [7] and [8]. The improvements can
be obtained in the pre-processing phase, by enhancing the quality
of the text images, in the processing phase, by optimizing the
character segmentation, and in post-processing, by merging
multiple outputs and enhancing the spell-checking techniques.
Indeed, the alignment of the OCR output produced by multiple
OCR engines, in conjunction with suitable criteria to select the
best outcome at the granularity of the single character, can
improve the accuracy of the recognition. For this reason, we have
adapted the methods developed for texts written in ancient Greek
[9] to Arabic samples and the first tests are encouraging, even if
we need further investigation to evaluate the results. The
alignment strategy, which involves the application of the
Needleman-Wunsch algorithm, allows us to use commercial
software, such as AbbyyFineReader, open source software, such
as Tesseract/Cube [10] and experimental software, such as the
Figure 1. Arabic OCR Proof-reading Web App.
4. TOWARDS A LITERARY
ARABICWORDNET
To provide as much lexico-semantic information as possible at the
very early stages of Arabic text processing, we developed a
strategy to augment the Arabic WordNet [13].
WordNet [14] is a lexical semantic resource that aims at creating,
managing and organising a conceptual net through concepts
expressed by words. The model contains, among other types of
information, words, providing information about individual
lemmas, and senses, providing the sense(s) associated with
individual lemmas. Synsets are defined as sets of senses
associated with lemmas referring to the same concept. For
example, build up, fortify and gird are (quasi) synonyms of the
sense of the lemma arm as a verb (e.g. in the sentence “The
farmers armed themselves with rifles”). WordNet contains also
semantic relations among synsets: for example hyponymy,
hypernymy, meronymy, antonymy and so on.
Our basic idea was to update a WordNet-like resource by adding
information gathered from a bilingual dictionary. This strategy is
general-purpose and widely used for other languages [15], [16],
[17]. Due to lexical polysemy, however, the use of bilingual links
to expand a network of word meanings must be carefully
constrained to avoid inclusion of spurious associations. Bilingual
information must thus be cross-checked with the content and
structure of AWN, with a view to obtaining two principal results:
new Arabic lemmas are added to the WordNet and clustered into
synsets; existing lemmas are rearranged in terms of new
synonymic patterns. As a result, new and existing lemmas are
grouped together into new synonymic refactoring.
explained as a result of how AWN is designed and coverage
percentage is calculated.
4.1 Resources Used
The augmenting algorithm of AWN is depicted by the flowchart
of Figure 2. From the list of Arabic-English word pairs, we have
selected the entries where a single Arabic word is translated by
more than one English word.
In this section, we list and briefly describe the lexical resources
used to update the Arabic WordNet. Interested readers may find
more information in [13].
AWN was designed to contain 1000 named entities and 9000
basic concepts. This means that coverage of Arabic words in
AWN will increase, if we limited ourselves to considering words
in dictStems that are either basic concepts or named entities.
4.2 AWN: Enhancing Strategy
Princeton WordNet (PWN) - version 3.0 - contains 147.306
distinct words and 117.659 distinct synsets (collections of
synonyms) which are associated with 206.941 senses. Nouns and
Verbs, with 146.312 and 25.047 senses respectively, cover 70.1%
and 12.1% of the total amount respectively. PWN contains
relations, such as “antonymy” and “part holonymy”, which, in
addition to “hypernymy/hyponymy”, connect synsets. For
example, arm as limb is part of the human body (part holonym);
disarm is an antonym of arm etc.
The Arabic WordNet (AWN) contains 12.114 distinct words and
9.576 synsets, which are associated with 17.419 senses. Nouns
and Verbs cover 84.2% and 14.1% senses respectively. The
Arabic WordNet contains many geographic/political named
entities (about 1.000) and a set of Common Base Concepts
(CBCs).Vocalization of the Arabic words is partially inconsistent
[18].
DictStems is the Arabic-English bilingual dictionary that along
with dictSuffix and dictPrefix resources is part of the AraMorph
suite. This dictionary contains more than 30.000 Arabic words
and more than 25.000 English translational equivalents. The
dictStems dictionary is a component of the morphological engine
and it does not properly serve as a bilingual lexical resource.
However, we think that it can be used as a good starting point to
enhance the Arabic WordNet since it contains many ArabicEnglish word pairs. Nonetheless, we are fully aware of the
limitations of this resource. For example, in Arabic, the sense of
become is translated by the verb ‫ﺎر‬
َ ‫ﺻ‬
َ [sˁa:ra] when expressed
without a temporal connotation, and by َ‫[ ﺑَﺎت‬ba:ta], ‫[ أَﺻْ ﺒَ َﺢ‬ʔasˁbaħa]
and ‫ﺴﻰ‬
َ ‫[ أ ْﻣ‬ʔamsa:], when a temporal connotation (night, morning,
evening respectively) is added. However, dictStems translates
both َ‫[ ﺑَﺎت‬ba:ta] and ‫[ أَﺻْ ﺒَ َﺢ‬ʔasˁbaħa] with the generic verb become.
Similarly, many English translations are provided by the
periphrastic combination of a verb plus an adjective/noun. For
example, become old/rich, give breakfast, be hostile, etc. In most
these cases, although the individual English words in the
periphrastic translation are present in PWN, their combinations
are not present as independent PWN entries in their own right.
The different nature of AWN and PWN on the one hand and
dictStems on the other hand, explains the limited overlapping in
word coverage between the first two and dictStems. About 60%
English words in dictStems are found in PWN, but only nearly 10%
of Arabic words in dictStems are in AWN. The low coverage of
AWN with respect to the Arabic words in dictStems can be
Figure 2. Workflow for Arabic WordNet Enhancement.
For each English translational equivalent, we extracted all synsets
from Princeton WordNet that contain it, and considered how
many synsets were shared by the family of all translational
equivalents of the same Arabic word.
In calculating common synsets, three cases may occur: all English
words in the same translational family share all synsets (spread=1
in Figure 2), that is to say they are synonyms under every possible
sense they may take; words share no synsets (spread=0); only
some synsets are shared (spread between 0 and 1). In the last case,
words are synonyms only in certain senses.
We have focused on the spread=1 case, in line with the following
assumption: two (or more) Arabic words that are translated by
more than one English word sharing all their synsets (fully
synonymic translational equivalents) are strong candidates for
being synonyms in Arabic too.
Table 1. Mapping results
# Lines
# Synsets
# Mapped
Synsets
%
203
233
76
32.6
# Arabic
Words
# Arabic
Words in AWN
# Discovered
Synonyms
%
203
11
42
20.7
The number of such clues is very limited: out of 34,000 candidate
word pairs, only 200 pairs fulfil the requirements above. However,
these 200 items provide very interesting results (see Table 1).
Manual validation of the results in Table 1 shows that 38 of the 42
candidate synonyms selected by the algorithms are true (newly
discovered) synonyms.
We are currently investigating two additional improvements to
augment AWN. One strategy is to analyse word pairs with
different spreads. In previous work [13], we showed that there are
about 9,000 word-pairs with a spread between 0 and 1. We expect
spread values to correlate with accuracy in the resulting candidate
synonyms, with values close to 1 being more likely to select more
reliable candidates.
Another strategy can be suggested along the lines of previous
work we carried out on Ancient Greek [19]. We can select entries
from the bilingual resource where the Arabic lemma is translated
by an English phrasal structure expressing a subsumption (IS-A)
relation. For example, in dictStems, [zahorijjah] is translated as
“flower vase”, with “vase” being the head of the English
translation. By parsing the English phrase and extracting its
syntactic head, we can eventually select the Arabic word
[zahorijjah] as a candidate hyponym of the Arabic equivalent of
“vase”: [ʔina:ʔ].
5. BIO-COMPUTATIONAL DYNAMICS OF
THE ARABIC MORPHOLOGY
Arabic non-concatenative morphology seriously challenges the
definition of a computational simulation for a biologically
inspired neural architecture of the mental lexicon.
Since lexical organisation relies on the process of input recoding
and adaptive strategies for long-term memory, modelling the
mental lexicon focuses on both processing and storage dynamics.
A fundamental issue in lexical processing is represented by the
emergence of the morphological organisation level in the lexicon,
based on paradigmatic relations between fully-stored word forms.
The emergence of morphological structure can be defined as the
task of identifying recurrent morphological formatives within
morphologically complex word forms. In the perspective of
adaptive strategies for lexical acquisition and processing based on
emergent morphological relations between fully-stored word
forms (defined as an abstractive approach after Blevins, [20]),
paradigmatic relations can be accounted for as the result of longterm entrenchment of neural circuits (chains of memory nodes)
that are repeatedly being activated by a memory map in the
process of recoding input words into a two-dimensional lexical
layer.
In the computational framework of Temporal Self-Organising
Maps (TSOMs, [21], [22], [23], [24]), based on Self-Organising
Maps with Hebbian connections defined over a temporal layer, the
identification/perception of surface morphological relations
involves the co-activation of recoded representations of
morphologically-related input words.
The highly non-linear nature of Arabic morphology urges the
development of co-activation of memory chains, namely, the
propensity to respond to the same symbol at different positions in
time. Generally, a node that is selectively sensitive to a particular
symbol in a specific ordered position reaches high level of coactivity when the same symbol is shifted by few positions.
Repeated exposure to the underlying structure of Arabic word
forms enforces both a mild sensitivity to time invariant symbol
encoding (intra-paradigmatic relations) and a position sensitivity
to time-bound instances of the same vowel symbol (interparadigmatic relations).
Traditionally, lexical processing has been modelled in terms of
basic mechanisms of human memory for serial order, as proposed
in the vast literature on immediate serial recall and visual word
recognition ([25, 26], for detailed reviews). Some of the earliest
psychological accounts of serial order assume that item sequences
are represented as temporal chains made up of stimulus-response
links. However, it is difficult to align, for example, word forms of
differing lengths [27], hindering recognition of shared sequences
between morphologically-related forms. Conventionally, the task
of identifying morphological formatives within morphologically
complex word forms has been taken to model morphology
induction. Morphology induction can be defined as the task of
singling out morphological formatives within morphologically
complex word forms.
Accordingly, the issue of word alignment appears to be crucial for
morphology induction. Since languages widely differ in the way
morphological information is sequentially encoded – ranging from
suffixation, to prefixation, circumfixation, infixation, and
combination thereof - a principled solution to alignment issues is
needed. Thus, algorithms for morphology acquisition should be
adaptable to the morphological structure of a target language, and
with identical settings of initial parameters and a comparable set
of assumptions concerning input representations, should be able to
successfully deal with as diverse inflectional systems as, for
example, Italian, German and Arabic.
In this perspective, the non-concatenative morphology of Arabic
language seriously challenges traditional morpheme-based
approaches [28], [29], while suggesting a reappraisal of
morphology induction through adaptive memory self-organisation
strategies.
The computational framework of TSOMs, through chains of
activated nodes encoding time sequences of symbols, is conducive
to enforce alignment through synchrony (i.e. co-activation) (for a
detailed description see [30]). Actually, node co-activation
represents the immediate correlate to the perception of similarity
between strings: the extent to which two strings are perceived as
similar by the map is given by the amount of highly co-activated
nodes that are involved in processing them.
At their core, TSOMs are dynamic memories trained to store and
classify input stimuli through patterns of activation of map nodes
(simulating memory traces). A pattern of node activation is the
processing response of a map when a stimulus is input. After
being trained on a set of stimuli, the map gest increasingly attuned
to respond to similar stimuli with largely overlapping activation
patterns, and/or synchronously co-activated nodes.
It has been shown that, in TSOMs, the paradigmatic organisation
of the inflectional morphology of Italian and German [31], by
focussing on how different types of related intra- and interparadigmatic families, induces a strongly paradigm-related coorganisation and co-activation to facilitate paradigmatic extension.
Concatenative morphologies, in fact, are shown to exhibit a strong
correlation between topological distance (proximity) of node
chains on the map and morphological similarity of their
corresponding word forms. The correlation reflects the topological
organisation of a map by clusters of co-activating nodes (e.g.
nodes that respond to either the same stimuli or to very similar
stimuli), and the sensitivity of the map to the immediately
preceding context of the current stimulus. On the contrary, for
those (non-concatenative) morphologies where structures are
misaligned [32], perception of internal structure is better
correlated with high levels of co-activation than with topological
proximity of memory nodes.
Discontinuous morphological formatives - e.g. roots in the Arabic
inflectional system – appear embedded in considerably different
contexts, as for example in kataba (‘he wrote’) and yaktubu (‘he
writes’). By topologically aligning these two morphologicallyrelated word forms, sub-lexical constituents shared by them will
hardly emerge. This may keep the nodes responding to the root (kt-b) in the two forms far apart on artificial brain maps. Responses
of the two pools of nodes are maximally synchronised when the
root k-t-b is presented. These levels of co-activity response can be
taken to mean that k, t, b are perceived as instantiations of the
same consonantal skeleton in both word forms.
Figure 3 provides an example of differences in node co-activation
levels associated with the two input verb forms.
Figure 4. Co-activation level distances for the imperfective
input forms yaktubu (‘he writes’) and yadxulu (‘he enters’).
Figure 5. Co-activation level distances for the perfective input
forms kataba (‘he wrote’) and daxala (‘he entered’).
Due to the chiefly non-concatenative nature of consonantal roots
and vowel patterns, and the concurrent presence of prefixes and
suffixes, TSOMs can provide a flexible and adaptive
computational approach to Arabic morphology, capable to cast its
highly non-linear nature into a self-organising network of densely
interconnected memory nodes, where distributed populations of
memory nodes can be trained to selectively respond to either timeinvariant or context-sensitive recoding of symbols.
6. GENERAL DISCUSSION
Figure 3. Co-activation level distances for the input forms
kataba (‘he wrote’) and yaktubu (‘he writes’).
As previously claimed, lexical processing relies on the emergence
of paradigmatic relations between fully-stored morphologically
complex words. TSOMs adhere perfectly to this perspective, since
they get able, during unsupervised training, to find out what is
common and what is different within any set of paradigmaticallyrelated verb forms, as a dynamic result of persistent co-activation
levels across forms within paradigms (intra-paradigmatic relations)
and between paradigms (inter-paradigmatic relations).
In addition, both intra- and inter-paradigmatic relations are
determinants for paradigmatic extension and generalisation. In
detail, extending the underlying structure of known Arabic verb
forms to novel forms requires sensitivity to both time invariant
symbol encoding of the root consonant skeleton (as shown in
Figure 3), and a position sensitivity to time-bound instances of the
same vowel symbol (see Figure 4 and Figure 5).
Over the last decades, a growing body of evidence on the
mechanisms governing lexical storage, access, acquisition and
processing has questioned traditional computational models of
language architecture and word usage. By pulling together
cognitive,
neuro-functional
and
psycho-computational
implications of these mechanisms, a new view of the language
architecture is emerging, based on the dynamic interaction
between storage and processing [33], [34], [35], [36], [37].
According to such an “integrative” conception of the mental
lexicon (as opposed to the “discriminative” view that takes lexical
processes and lexical representations to belong to two distinct and
neuro-anatomically segregated modules of the human language
processor) several levels of linguistic knowledge are understood
to be interfaced in the lexicon, and to simultaneously interact in
word processing. Word forms are concurrently, redundantly and
competitively stored in a distributed network of perisylvian brain
circuits, whose organisation appears to comply more with a
principle of maximization of redundant opportunities [38, 39],
than with requirements of minimization of storage and logical
necessity. In this perspective, words are not localised in a single
bran region, but they are better viewed as emergent properties of
the functional interaction between different brain regions.
Computational modelling of such a novel view crucially rests on
an activation metaphor: exposure to an input word triggers the
distributed activation of clusters of parallel processing units (or
nodes) each of which tends to respond more highly to specific
instances of an input symbol (e.g. a letter or a sound of an input
word). In addition, nodes are arranged on a bi-dimensional grid (a
map) that is topologically organised: nodes that respond to similar
stimuli tend to lie close on the map. Crucially, a map organisation
is not wired-in and given once and for all, but it is the outcome of
a process of adaptive self-organisation, which heavily depends on
the underling structure on training data. In particular, three basic
factors appear to affect self-organisation: (i) similarity: similar
symbols trigger overlapping activation patterns; (ii) frequency:
frequent symbols tend to recruit dedicated nodes; and (iii) symbol
timing: nodes react differently depending on the time-bound
context where a symbol is repeatedly found.
Activation of such self-organising maps by an input word thus
reflects some basic structural properties of the training words to
be stored, making room for the wide range of cross-linguistic
typological variability in the marking of morphological
information. Accordingly, perception of similarity between words
may depend on the most recurrent patterns shared by inflected
words that belong to the same paradigmatic family. Adaptivity to
recurrent morphological patterns allows the TSOMs reviewed in
section 5 to adjust themselves to as different morphological
systems as German or Italian conjugations on the one hand, and
Arabic conjugation on the other hand.
We believe that an important lesson can be learned from the study
of the behaviour of temporal self-organising maps. Simultaneous
activation of several competitive candidates for lexical access or
production, far from being the occasional outcome of a severely
underspecified or noisy input, is the ordinary operational state of
the mental lexicon. A word memory trace tends to contain, over
and above its proper representational nodes, also nodes that
belong to other lexical neighbours. Activation of one such
distributed representation thus prompts co-activation of other
representations that compete for selection. In an optimally
organised lexicon, connections between nodes are distributed in
such a way that co-activation does not hinder selective lexical
access. However, we know that competition between
neighbouring forms, which are attested with difference
frequencies, may occasionally cause interference and delay, with
recognition and production being suboptimal [40], [41], [42].
In the light of these remarks, some of the core problems
associated with Arabic Natural Language Processing (exemplified
and reviewed in sections 2, 3 and 4) strike us as being connected
with some persisting limitations of traditional processing
architectures, where rules and lexical representations are logically
distinct and functionally segregated, and different levels of
linguistic representation (morphological, morpho-syntactic and
semantic) are accessed one after the other. Our experience with
Arabic NLP through a variety of approaches and software
architectures shows that a new conception of language processing
as a classification task, resulting from on-line, probabilisticallysupported resolution of concurrent and conflicting multi-level
constraints, is probably closer to the way speakers understand and
produce their language. Over the last fifteen years, such a view
has been implemented through stochastic parsing architectures,
based on information-theoretic principles such as Maximum
Entropy and Bayesian learning. Recent artificial neural networks
such as TSOMs show that such a dynamic and adaptive view on
language processing can effectively be implemented
biologically-inspired networks.
by
By understanding more of the way the human processor deals
with issues of concurrent and competitive word storage, we can
gain many insights into the architecture of language in the brain.
In addition, we suggest that effective integration of some of these
insights within computer processing architectures can be
instrumental in designing and developing more adaptive and
intelligent systems, where processing resources and structures are
dynamically apportioned as a function of past experience (longterm, cumulative frequency effects) and salience (short-term
context-sensitivity effects) in the input.
REFERENCES
[1]
Tsarfaty, R., Seddah D., Kubler S., and Nivre J. 2013.
Parsing Morphologically Rich Languages: Introduction to
the Special Issue. Computational Linguistics 39, 1, 15–22.
[2]
Dichy, J. 1997. Pour une lexicomatique de l’arabe: l'unité
lexicale simple et l'inventaire fini des spécificateurs du
domaine du mot. Meta 42, 2, 291-306.
[3]
Jackendoff, R. 2002. Foundations of language. Brain,
Meaning, Grammar, Evolution. Oxford University Press,
New York.
[4]
Nahli, O. 2013. Computational contributions for Arabic
language processing. The automatic morphologic analysis of
Arabic texts. In Studia graeco-arabica 3, 195-206.
[5]
Internet Archive: http://archive.org/.
[6]
Google Books: http://books.google.com/.
[7]
Dichy, J. and Kanoun S., Eds. 2013. Linguistic Knowledge
integration in optical Arabic word and text recognition
process. Linguistica Communicatio 15, 1-2.
[8]
Märgner, V. and El Abed H. 2012. Guide to OCR for Arabic
scripts. Springer, London.
[9]
Boschetti, F., Romanello M., Babeu A., Bamman D., and
Crane G. 2009. Improving OCR accuracy for classical
critical editions. In Proceedings of the 13th European
conference on Research and advanced technology for
digital libraries (ECDL’09), M. Agosti, J. Borbinha, S.
Kapidakis, C. Papatheodorou and G. Tsakonas, Eds.
Springer-Verlag, Berlin, Heidelberg, 156-167.
[10] Tesseract/Cube: http://code.google.com/p/tesseract-ocr/.
[11] Lasri, Y. 2014. Contribution à la reconnaissance optique
(OCR) du texte arabe imprimé, Fès: Université ̣“Sidi
Mohamed Ben Abdellah” de Fès, MA Thesis.
[12] Boschetti, F. 2013. Acquisizione e Creazione di Risorse
Plurilingui per gli Studi di Filologia Classica in Ambienti
Collaborativi. In. Collaborative Research Practices and
Shared Infrastructures for Humanities Computing, M.
Agosti and F. Tomasi, Eds. Proceedings of Revised Papers
AIUCD 2013 (Padua, Italy, December 11-12, 2013), 55-67.
[13] Del Gratta, R. and Nahli, O. 2014. Enhancing Arabic
WordNet with the use of Princeton WordNet and a bilingual
dictionary. IEEE - CiST14 Colloquium on Information
Science and Technology - ANLP Invited Session.
[14] Fellbaum, C., Ed. 1998. WordNet: An Electronic Lexical
Database (Language, Speech, and Communication). The
MIT Press, Cambridge, MA.
[15] Sagot, B. and Fišer D. 2011. Extending Wordnets by
learning from multiple resources In LTC’11: 5th Language
and Technology Conference, Poznań, Poland.
[16] Rodríguez, H., Farwell, D., Farreres, J., Bertran, M., Martí,
M.A., Black, W., Elkateb, S., Kirk, J., Vossen, P., and
Fellbaum, C. 2008. Arabic Wordnet: Current State and
Future Extensions. In Proceedings of the Fourth
International Global WordNet - Conference, 387-406.
[17] Vossen, P., Ed., 1998. EuroWordNet: A Multilingual
Database with Lexical Semantic Networks. Kluwer
Academic Publishers, Norwell, MA, USA.
[18] Fellbaum, C., Alkhalifa, M., Black, W. J., Elkateb, S., Pease,
A., Rodríguez, H., and Vossen, P. 2006. Building a
WordNet for Arabic. In Proceedings of the 5th Conference
on Language Resources and Evaluation (ELRA - LREC
2006, Genova), 29-34.
[19] Boschetti, F., Del Gratta, R., and Lamè, M. 2014. Computer
Assisted Annotation of Themes and Motifs in Ancient
Greek Epigrams: First Steps. In Proceedings of CLIC,
Computational Linguistics Italian Conference. Pisa, Italy.
[20] Blevins, J.P. 2006. Word-based morphology. Journal of
Linguistics 42, 531-573.
[21] Ferro, M., Pezzulo, G., and Pirrelli, V. 2010. Morphology,
Memory and the Mental Lexicon. In Lingue e Linguaggio,
vol. IX(2), Interdisciplinary aspects to understanding word
processing and storage, V. Pirrelli, Ed. Il Mulino, Bologna,
199-238.
[22] Pirrelli, V., Ferro, M., and Calderone, B. 2011. Learning
paradigms in time and space. Computational evidence from
Romance languages. In Morphological Autonomy:
Perspectives from Romance Inflectional Morphology, M.
Goldbach, M. O. Hinzelin, M. Maiden, and J.C. Smith, Eds.
Oxford University Press, Oxford, 135-157.
[23] Marzi, C., Ferro, M., and Pirrelli, V. 2012. Word alignment
and paradigm induction. Lingue e Linguaggio XI, 2, 251274.
[24] Marzi, C., Ferro, M., and Pirrelli, V. 2014. Morphological
structure through lexical parsability. Lingue e Linguaggio
XIII, 2, 263-290.
[25] Henson, R. N. A. 1999. Coding position in short-term
memory. International Journal of Psychology 34, 5-6, 403409.
[26] Davis, C. J. 2010. The spatial coding model of visual word
identification. Psychological Review 117, 3, 713-758.
[27] Davis, C. J. and Bowers, J. S. 2004. What do letter
migration errors reveal about letter position coding in visual
word recognition? Journal of Experimental Psychology:
Human Perception and Performance 30, 923-941.
[28] Halle, M. and Marantz, A. 1993. Distributed Morphology
and the pieces of inflection. In The view from building 20, K.
Hale and S. J. Keyser, Eds. MIT Press, Cambridge, MA,
111-176.
[29] Embick, D. and Halle, M. 2005. On the Status of stems in
morphological theory. In Romance Languages and
Linguistics Theory 2003, T. Geerts, I. van Ginneken, and H.
Jacobs, Eds. John Benjamins, Amsterdam, 37-62.
[30] Marzi, C. 2014. Models and dynamics of the morphological
lexicon in mono- and bilingual acquisition. PhD
unpublished dissertation. University of Pavia.
[31] Marzi, C., Ferro, M., Caudai, C. and Pirrelli, V. 2012.
Evaluating Hebbian Self-Organizing Memories for Lexical
representation and Access. Proceedings of 8th International
Conference on Language Resources and Evaluation, (ELRA
- LREC 2012, Malta), 886-893.
[32] Marzi, C., Nahli, O., and Ferro, M. 2014. Word Processing
for Arabic Language. IEEE - CiST14 Colloquium on
Information Science and Technology - ANLP Invited Session.
[33] Hickok, G.M., and Poeppel, D. 2004. Dorsal and ventral
streams: a framework for understanding aspects of the
functional anatomy of language. Cognition 92, 67-99.
[34] D’Esposito, M. 2007. From cognitive to neural models of
working memory. Philosophical Transactions of the Royal
Society B: Biological Sciences 362, 761-772.
[35] Saur, D., Kreher, B.W., Schnell, S., Kümmerer, D.,
Kellmeyer, P., Vry, M.-S., Umarova, R., Musso, M.,
Glauche, V., Abel, S., Huber, W., Rijntjes, M., Hennig, J.,
and Weiller, C. 2008. Ventral and dorsal pathways for
language. Proc. Nat. Academy of Sciences 105, 46, 1803518046.
[36] Forkel, S. J., Thiebaut de Schotten, M., Dell’Acqua, F.,
Kalra, L., Murphy, D.G.M., Williams, S.C.R., and Catani,
M. 2014. Anatomical predictors of aphasia recovery: a
tractography study of bilateral perisylvian language
networks. Brain 137, 2027-2039.
[37] Ma, W. J., Husain, M., and Bays, P.M. 2014. Changing
concepts of working memory. Nature Neuroscience 17, 3,
347-356.
[38] Libben, G. 2005. Everything is psycholinguistics: Material
and methodological considerations in the study of
compound processing. Canadian Journal of Linguistics 50,
267-283.
[39] Baayen, R.H. 2007. Storage and computation in the mental
lexicon. In The Mental Lexicon: Core Perspectives, G.
Jarema, G. Libben, Eds. Elsevier, 81-104.
[40] Luce, P., Pisoni, D., and Goldinger, S.D. 1990. Similarity
neighborhoods of spoken words. In Cognitive models of
speech pro-cessing: Psycholinguistic and computational
perspectives, G.T.M. Altmann, Ed. MIT Press, Cambridge,
MA, 122-147.
[41] Huntsman, L.A. and Lima, S.D. 2002. Orthographic
Neighbors and Visual Word Recognition. Journal of
Psycholinguistic Research 31, 289-306.
[42] Goldrick, M., Folk, J. R., and Rapp, B. 2010. Mrs.
Malaprop’s neighborhood: Using word errors to reveal
neighborhood structure. Journal of Memory and Language
62, 2, 113-13.