LEXBANK: A MULTILINGUAL LEXICAL RESOURCE FOR LOW

LEXBANK: A MULTILINGUAL LEXICAL RESOURCE FOR LOW-RESOURCE
LANGUAGES
by
Feras Ali Al Tarouti
M.S., King Fahd University of Petroleum & Minerals, 2008
B.S., University of Dammam, 2001
A dissertation submitted to the Graduate Faculty of the
University of Colorado Colorado Springs
in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
Department of Computer Science
2016
ii
© Copyright by Feras Ali Al Tarouti 2016
All Rights Reserved
iii
This dissertation for Doctor of Philosophy degree by
Feras Ali Al Tarouti
has been approved for the
Department of Computer Science
by
Jugal Kalita, Chair
Tim Chamillard
Rory Lewis
Khang Nhut Lam
Sudhanshu Semwal
Date
iv
Al Tarouti, Feras A. (Ph.D., Computer Science)
LexBank: A Multilingual Lexical Resource for Low-Resource Languages
Dissertation directed by Professor Jugal Kalita
In this dissertation, we present new methods to create essential lexical resources for
low-resource languages. Specifically, we develop methods for enhancing automatically created wordnets. As a baseline, we start by producing core wordnets, for several languages,
using methods that need limited freely available resources for creating lexical resources
(Lam et al., 2014a,b, 2015b). Then, we establish the semantic relations between synsets in
wordnets we create. Next, we introduce a new method to automatically add glosses to the
synsets in our wordnets. Our techniques use limited resources as input to ensure that they
can be felicitously used with languages that currently lack many original resources. Most
existing research works with languages that have significant lexical resources available,
which are costly to construct. To make our created lexical resources publicly available,
we developed LexBank which is a web-based system that provides language services for
several low-resource languages.
To my mother, father and my wife.
vi
Acknowledgments
I would like to express my appreciation to my wife and the mother of my kids Omima for
the unlimited support she gave to me during my journey toward my Ph.D. I am also very
grateful to the support and guidance provided by my advisor Dr. Jugal Kalita. In addition, I
would like to thank my dissertation committee members: Dr. Sudhanshu Semwal, Dr. Tim
Chamillard, Dr. Rory Lewis and Dr. Khang Nhut Lam for their guidance and consultation.
vii
Table of Contents
1
Introduction
1
1.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Research Focus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.2.1
Assamese Language . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.2.1.1
Assamese Script . . . . . . . . . . . . . . . . . . . . . .
5
1.2.1.2
Assamese Morphology . . . . . . . . . . . . . . . . . .
5
1.2.1.3
Assamese Syntax . . . . . . . . . . . . . . . . . . . . .
6
Vietnamese Language . . . . . . . . . . . . . . . . . . . . . . . .
6
1.2.2.1
Vietnamese Script . . . . . . . . . . . . . . . . . . . . .
7
1.2.2.2
Vietnamese Morphology . . . . . . . . . . . . . . . . .
7
1.2.2.3
Vietnamese Syntax . . . . . . . . . . . . . . . . . . . .
8
Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.2.2
1.3
2
Case Study: The Current Status of and Challenges in Processing Information in
Arabic
11
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2
Fundamentals of Arabic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1
Arabic Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
viii
2.3
3
4
5
2.2.2
Arabic Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3
Arabic Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Literature Review
21
3.1
Automatic Construction of Wordnets . . . . . . . . . . . . . . . . . . . . . 21
3.2
Wordnet Management Tools . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3
Creating Bilingual Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Automaticaaly Constructing Structured Wordnets
32
4.1
Constructing Core Wordnets . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2
Constructing Wordnet Semantic Relations . . . . . . . . . . . . . . . . . . 34
4.3
Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Enhancing Automatic Wordnet Construction Using Word Embeddings
40
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2
Similarity Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3
Generating Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4
Removing Irrelevant Words in Synsets . . . . . . . . . . . . . . . . . . . . 44
5.5
Validating Candidate Relations . . . . . . . . . . . . . . . . . . . . . . . . 45
5.6
Selecting Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.7
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
ix
6
5.7.1
Generating Vector Representations of Wordnets Words . . . . . . . 46
5.7.2
Producing Word Embeddings for Arabic . . . . . . . . . . . . . . . 48
5.8
Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.9
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Selecting Glosses for Wordnet Synsets Using Word Embeddings
54
6.1
Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2
Creating Language Model Using Word Embeddings . . . . . . . . . . . . . 55
6.3
Generating Vector Representation of Wordnet Synsets . . . . . . . . . . . . 55
6.4
Automatically Selecting a Synset Gloss From a Corpus Using Synset2Vec . 58
6.5
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.5.1
Using Synset2vec to Select Glosses for PWN Synsets . . . . . . . . 60
6.5.2
Using Synset2vec to Select Glosses for Arabic, Assamese and Vietnamese Synsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.5.3
6.6
7
Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 62
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
LexBank: A Multilingual Lexical Resource
67
7.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.2
Database Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.2.1
7.2.2
The System Settings Database . . . . . . . . . . . . . . . . . . . . 68
7.2.1.1
Users_Info . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.2.1.2
System_log . . . . . . . . . . . . . . . . . . . . . . . . 69
The Lexical Resources Database . . . . . . . . . . . . . . . . . . . 69
x
7.2.2.1
CoreWordnet . . . . . . . . . . . . . . . . . . . . . . . . 70
7.2.2.2
Sem_Relations . . . . . . . . . . . . . . . . . . . . . . . 70
7.2.2.3
WordnetGlosses . . . . . . . . . . . . . . . . . . . . . . 70
7.2.2.4
Sem_Relations_Eval_Data . . . . . . . . . . . . . . . . 71
7.2.2.5
Sem_Relations_Eval_Response . . . . . . . . . . . . . . 71
7.2.2.6
WordnetGlosses_Eval_Data . . . . . . . . . . . . . . . . 72
7.2.2.7
WordnetGlosses_Eval_Response . . . . . . . . . . . . . 72
7.3
Application Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.4
Web Interface Design and Implementation . . . . . . . . . . . . . . . . . . 74
7.5
7.4.1
Registration Form . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.4.2
Log-in Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.4.3
The Main Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.4.4
Searching Wordnet By Lexeme Web Form . . . . . . . . . . . . . . 79
7.4.5
Searching Wordnet by OffsetPos Web Form . . . . . . . . . . . . . 80
7.4.6
Evaluating Semantic Relations Between Synsets Web Form . . . . 82
7.4.7
Evaluating Wordnet Synsets Glosses Web Form . . . . . . . . . . . 86
7.4.8
Searching Bilingual Dictionary Web Form . . . . . . . . . . . . . . 88
7.4.9
Users Management Web Form . . . . . . . . . . . . . . . . . . . . 89
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8
Conclusions
92
9
Future Work
95
9.1
Extending Bilingual Dictionaries . . . . . . . . . . . . . . . . . . . . . . . 95
xi
9.1.1
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9.1.2
Extending Bilingual Dictionaries Using Structured Wordnets . . . . 97
9.2
Integrating Part-of-speech Tagging into Wordnet Construction . . . . . . . 99
9.3
Wordnet Expansion Using Word Embeddings . . . . . . . . . . . . . . . . 100
9.4
Producing Vector Representation for Multi-word Lexemes . . . . . . . . . 101
9.5
Vector Representation for Mulit-lingual Wordnets . . . . . . . . . . . . . . 101
Bibliography
102
Appendices
115
A Papers Resulting from The Dissertation
115
B Data Processing Software Code
116
B.1 ComputCosineSim.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
B.2 GenerateVectorForSynset.py . . . . . . . . . . . . . . . . . . . . . . . . . 118
B.3 GenerateVectorForGloss.py . . . . . . . . . . . . . . . . . . . . . . . . . . 119
B.4 ComputeGlossSynsetSimilarity.py . . . . . . . . . . . . . . . . . . . . . . 120
C Microsoft SQL Server Tables
121
D LexBank Utility Class
133
E IRB Approval Letter
146
xii
List of Tables
3.1
A list of the Java libraries tested in (Finlayson, 2014). . . . . . . . . . . . . 26
3.2
A comparison between some of the Java libraries for accessing the PWN. . 27
4.1
Wordnet semantic relations. . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2
Size, coverage and precision of the core wordnets we create for Arabic,
Assamese and Vietnamese. . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3
Precision of the semantic relations established for our Arabic wordnet. . . . 38
5.1
An example of cosine similarity between words in a candidate synset . . . . 45
5.2
The weighted average similarity between related words in AWN. . . . . . . 48
5.3
Comparison between the weighted similarity averages obtained using different word2vec settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.4
Comparison between the number of synsets in AWN and our Arabic wordnet using different threshold values. . . . . . . . . . . . . . . . . . . . . . 49
5.5
Precision of the Arabic wordnet we create. . . . . . . . . . . . . . . . . . . 50
5.6
Precision of the Assamese wordnet we create. . . . . . . . . . . . . . . . . 50
5.7
Precision of the Vietnamese wordnet we create. . . . . . . . . . . . . . . . 50
5.8
Examples of related words and their cosine similarity from our Arabic
wordnet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
xiii
5.9
Examples of related words and their cosine similarity from our Assamese
wordnet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.10 Examples of related words and their cosine similarity from our Vietnamese
wordnet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.1
Meanings of the noun “spill” and its synonyms. . . . . . . . . . . . . . . . 56
6.2
Cosine similarity between the different synset vectors and glosses of the
word “abduction” in PWN. . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3
The precision of selecting glosses for PWN synsets . . . . . . . . . . . . . 62
6.4
Examples of Arabic glosses we produce in our Arabic wordnet. . . . . . . . 63
6.5
Examples of Assamese glosses we produce in our Assamese wordnet. . . . 64
6.6
Examples of Vietnamese glosses we produce in our Vietnamese wordnet. . 65
6.7
The precision of selecting glosses for Arabic, Assamese and Vietnamese
synsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
xiv
List of Figures
3.1
An overview of the CSS management tool, adapted from (Nagvenkar et al.,
2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1
IWND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2
Core wordnet mapping to structured wordnet. . . . . . . . . . . . . . . . . 35
4.3
Creating wordnet semantic relations using intermediate wordnet. . . . . . . 36
4.4
The effect of missing synsets in recovering wordnet semantic relations using intermediate wordnet. . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5
Percentage of synset semantic relations recovered for the Arabic, Assamese
and Vietnamese wordnets. . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1
A histogram of synonyms, semantically related words, and non-related
words extracted from AWN. . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.1
An example of creating a vector for a wordnet synset that include more
than one word. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2
An example of creating vectors for wordnet synsets that share a single word. 58
7.1
An overview of LexBank system. . . . . . . . . . . . . . . . . . . . . . . . 67
7.2
LexBank web site map . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
xv
7.3
The registration web form . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.4
Sequence diagram of the registration process . . . . . . . . . . . . . . . . 77
7.5
The log-in web form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.6
Sequence diagram of the login process . . . . . . . . . . . . . . . . . . . . 78
7.7
The main menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.8
The web form for searching wordnet by lexeme. The form shows the result
of searching the Arabic lexeme (‫)ﻣﺼﺮ‬, which means Egypt. . . . . . . . . 80
7.9
Sequence diagram of the process of searching wordnet using lexeme . . . . 81
7.10 The web form for searching wordnet by OffsetPos. The form shows the
result of searching the Arabic synset (08897065-n). . . . . . . . . . . . . . 82
7.11 The web form for searching wordnet by OffsetPos. The form shows the
result of searching the Vietnamese synset (08897065-n). . . . . . . . . . . 83
7.12 The web form for searching wordnet by OffsetPos. The form shows the result of searching the Assamese synset (08897065-n). The third part meronym
in Assamese is wrong. It comes from the verb meaning of “desert” which
means to leave without intending to return. . . . . . . . . . . . . . . . . . . 83
7.13 Sequence diagram of the process of searching wordnet using OffsetPos. . . 84
7.14 The web form for evaluating semantic relations between synsets in a wordnet. The form shows an example of evaluating a hyponymy relation between the two Assamese lexemes, one for radio telegraph and the other for
radio. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.15 Sequence diagram of the process of evaluating the relation between two
lexemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
xvi
7.16 The web form for evaluating wordnet synsets glosses. The form shows an
example of evaluating Arabic synset (13108841-n). . . . . . . . . . . . . . 86
7.17 Sequence diagram of the process of evaluating the relation between two
lexemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.18 The web form for searching a bilingual dictionary. The form shows the result of translating the Arabic word (‫)ﻣﺼﺮ‬, which means Egypt, to Assamese. 88
7.19 Sequence diagram of the process of searching a bilingual dictionary. . . . . 89
7.20 The web form for managing users in LexBank. . . . . . . . . . . . . . . . . 89
7.21 Sequence diagram of the process of managing users in LexBank. . . . . . . 90
9.1
The IW approach for creating a new bilingual dictionary . . . . . . . . . . 96
9.2
Extending bilingual dictionaries using structured wordnets . . . . . . . . . 98
Chapter 1
INTRODUCTION
1.1
Motivation
The word lexicon means a repository that stores the vocabulary of a person, language
or branch of knowledge such as computer science, the military or medicine (Wikipedia,
2016c). In linguistics, a lexicon is an inventory of basic units of meaning or lexemes. In
practice, a lexicon (we may also call it a dictionary) may be printed as a book, or stored
in a computer database that can be searched and used by a program. A lexical resource is
a group of lexical units that provide linguistic information. The lexical units can be morphemes, words or multi-word expressions. The basic unit of a lexical resource is usually
called a lexical entry. Some lexical resources can be used by humans directly while other
lexical resources are machine readable. Lexical resources are the base of most Natural
Language Processing (NLP) applications.
There are many types of lexical resources. Based on its type, a lexical resource can
provide syntactical, morphological, phonological or semantic information. Unilingual dictionaries, bilingual dictionaries and wordnets are examples of lexical resources. There are
a few fortunate languages, such as English and Chinese, which have a relatively large number of high quality lexical resources. These languages are usually called resource-rich.
Most high quality lexical resources for the resource-rich languages have been painstak-
2
ingly created by researchers manually over many years. Unfortunately, most other existing
human languages lack many such lexical resources. Languages which lack lexical and
other resources are called resource-low or resource-poor languages. While some of these
languages have some resources, there are many other languages that barely have any resources. Especially poor in this context are the endangered languages around the world.
One important resource that is very helpful in computational processing and in human
language learning is a thesaurus providing synonyms and antonyms of words. An extended
version of a thesaurus that provides additional relations among words in the computational
context is usually called a wordnet. A wordnet is a structured lexical ontology of words
that groups words based on their meaning using sets that are called synsets. For example,
the words helicopter, chopper, whirlybird and eggbeater are grouped in one synset that
has the gloss, an aircraft without wings that obtains its lift from the rotation of overhead
blades. The wordnet connects synsets with each other based on semantic relations. Wordnets are used in many applications such as word sense disambiguation, machine translation,
information retrieval, text classification and text summarization.
The Princeton WordNet (PWN) is the original English version of such a wordnet and
has been produced with diligent manual work augmented by the development of computational tools, over several decades at Princeton University. Similar complete wordnets have
also been produced for a small number of additional languages such as French (Sagot and
Fišer, 2008), Finnish (Lindén and Carlson, 2010) and Japanese (Kaji and Watanabe, 2006).
Efforts to produce wordnets for a variety of other languages have been proposed, but most
are moving slowly, such as the effort to construct the Asian wordnets (Charoenporn et al.,
2008) and Indian wordnets (Bhattacharyya, 2010).
3
Another important type of resource is the bilingual dictionary, an essential tool for
human language learners. Most existing (online) bilingual dictionaries are between two
resource-rich languages or between a resource-rich language and a resource-poor language.
It is fortunate that many endangered languages have one bilingual dictionary, created usually by explorers, evangelists or other scholars. However, dictionaries or translators for
translations between two resource-poor languages do not really exist. Wiktionary1 , a dictionary created by volunteers, supports over 172 languages, although coverage is poor for
many of them. The online translation machines developed by Google2 and Microsoft3
provide pairwise translations, including translations for single words, for 103 and 53 languages, respectively. While this is a wide range of languages, these machine translators
still leave out many widely-spoken languages, not to mention endangered ones. 7097 languages are spoken in the world today (Gordon and Grimes, 2005), of which 400 are spoken
by at least a million people.
In previous work we focused on developing new techniques that leverage existing
resources for resource-rich languages to build bilingual dictionaries, and core wordnets
and other resources such as simple translators for resource-poor languages, including a few
endangered ones (Lam et al., 2014a,b, 2015b). In this thesis work, we take these resources
to the next level by improving the functionality, quality and coverage of these resources.
We present several new techniques that we did not use in our previous work. Our ultimate
goal is to produce an integrated multilingual lexical resource available online, one that
1 http://en.wiktionary.org/wiki/Wiktionary:Main_Page
2 http://translate.google.com/
3 http://www.bing.com/translator
4
includes several important individual resources for several languages. We believe that our
resources will help researches, speakers, learners and other users of these languages.
1.2
Research Focus
The goal of this dissertation is to create and make available multilingual lexical resources for several languages by bootstrapping from a limited number of existing resources.
Our study has the potential not only to construct new lexical resources, but also to provide
support for communities using languages with limited resources. Additionally, our research presents novel approaches to generate new lexical resources from a limited number
of existing resources.
The main focus of our work is to collect data from disparate sources, develop algorithms for mining and integrating such data, produce lexical resources, and evaluate the
resources in regards to the quality and quantity of entries. To develop and test our ideas, we
work with a few languages with in-house expertise. These include Assamese (asm), Arabic
(arb), English (eng) and Vietnamese (vie). In Chapter 2 we present a detailed introduction
to Arabic. Next, we present a brief introduction to Assamese and Vietnamese.
1.2.1
Assamese Language
Assamese is an Indo-European language that are spoken by more than 15 million
people (Hinkle et al., 2013). It is mainly used in the Indian states of Assam, Arunachal
Pradesh, Meghalaya, Nagaland and West Bengal. Assamese has 4 dialects: Standard Assamese, Jharwa, Mayang and Western Assamese (Gordon and Grimes, 2005). We present
a brief description of the script, morphology and syntax of Assamese.
5
1.2.1.1
Assamese Script
Assamese script consists of 37 consonants, 11 vowels, 147 conjuncts and a few punctuation marks (Hinkle et al., 2013). Unlike English where the written letters might have
variable pronunciation, Assamese written letters have one pronunciation. A consonant that
does not occur at the end of a word is assumed to have implicit vowel a following it. However, when several consonants need to be pronounced together, they are usually written
using a new conjunct letter.
When a vowel follows a consonant, the vowel is not written explicitly, but implicitly
as an operator. These operators are attached to consonants in different positions (Hinkle
et al., 2013). They can appear to the left, right, below or above the consonants. Foreign
words can appear in Assamese script as transliteration. However, It is not unusual to write
foreign words in foreign alphabets within a piece of Assamese text.
1.2.1.2
Assamese Morphology
Assamese morphology has two types of morphological transformations: derivational
and inflectional. Around 48% of the Assamese words are constructed using those two types
of transformation (Sharma et al., 2008). The derivational transformation in Assamese is
usually performed by changing the vowel component in the word, while the inflectional
transformation is performed by adding prefixes or suffixes to the word. Assamese is wellknown for its complex suffixes. It is common in Assamese that a word includes a sequence
of suffixes. Four to six suffixes in sequence are not uncommon (Saharia et al., 2009).
6
In Assamese, suffixes are used for many purposes. The most common purpose of
suffixes is determination (Sharma et al., 2008). In fact, a large number of the Assamese
suffixes are determiners. As in other languages, some determiners are attached to nouns
and pronouns to make them specific. This is similar to using this and that in English.
Unlike in many other languages, such as English, where affixes are used, determiners in
Assamese are also used to transfer single noun to plural.
1.2.1.3
Assamese Syntax
Assamese has less firm syntax which means that it is considered a free word order
language. This means that sentences could be written in different word orders and still have
the same meaning. The normal form of a simple Assamese sentences is Subject+Object+
Verb (SOV) (Sarma, 2012), although other orders are acceptable.
1.2.2
Vietnamese Language
Vietnamese, the first language of Vietnam, is an Austroasiatic language that arose
in Indo-China (Thompson, 1987). It is the first language of more than 75 millions people living in Vietnam (Gordon and Grimes, 2005). Also, due to emigration, it is the fist
language of many people living around the world, specially in East and Southeast Asia.
Vietnamese, which is called Annamese also, has five main dialects that differ mainly in
their sound systems. The five main dialects of Vietnamese are: Northern Vietnamese,
North-central Vietnamese, Mid-Central Vietnamese, South-Central Vietnamese and Southern Vietnamese (Wikipedia, 2016a). In the next sections, we present a brief description of
the script, morphology and syntax of Vietnamese.
7
1.2.2.1
Vietnamese Script
Old Vietnamese texts are written using Chinese characters. In the 17th century, the
Latin alphabet was introduced to Vietnamese by the French. By the beginning of the 20th
century, the Romanized version of Vietnamese became dominant (Thompson, 1987).
Compared to other languages, Vietnamese has a large number of vowels. It has 11
single vowels in addition to three types of composed vowels: centering diphthongs, closing diphthongs and triphthongs (Gordon and Grimes, 2005). These vowels are created
by combining single vowels together. Vowels are modified by diacritics. The diacritics,
which can be written above or below a vowel, are used to specify the tone of the vowel.
These tones have different lengths, pitch heights, pitch melodies and phonations. There are
25 consonants in Vietnamese. Consonants are represented in written script by a variable
number of letters. Some of the consonants are represented using one letter and other consonants are represented by a digraph, which is a combination of two letters. There are some
consonants which are represented by more than one digraph or letter (Wikipedia, 2016a).
1.2.2.2
Vietnamese Morphology
In Vietnamese, the majority of words are polysyllabic words (Noyer, 1998). Polysyllabic words are words composed of two or more syllables. The construction of polymorphemic words in Vietnamese is done in three ways: combining two words, adding affixes
to stem or reduplication. Words formed using reduplication morphology are constructed by
duplicating a word or a part of a word. There are a small number of affixes in Vietnamese.
Most of them are in the form of prefixes and suffixes. One distinct characteristic of Viet-
8
namese is that it does not have any number, gender, case and tense distinction (Wikipedia,
2016b). However, usually a noun classifier is used as a determiner and is added after the
word to specify those characteristics.
1.2.2.3
Vietnamese Syntax
Vietnamese sentences follow the Subject+Verb+Object (SVO) word order. To distinguish between verbs and nouns in a Vietnamese sentence, a copula is used before the
nouns. Noun phrases are usually composed of a noun and a modifier. The modifier can
be a numerator, classifier, prepositional phrase or other description word. Like in other
languages, pronouns are used to substitute the nouns and noun phrases.
1.3
Research Contributions
The resources created by Khang’s PhD dissertation (Lam, 2015) and reported in (Lam
et al., 2014a,b, 2015b), have many holes. E.g., the wordnets have only synsets, which are
sets of synonyms for words. In this dissertation work, we develop algorithms and models
to automatically establish the semantic relations between synsets in our previously created
core wordnets for our languages of focus using both pre-existing resources, as well as by
bootstrapping with resources we create ourselves. Following are the contributions produced
by this thesis:
• We construct the rest of the structure for our core wordnets with acceptable quality. We focus on the construction of wordnet semantic relations such as Hypernyms,
Hyponyms, Member Meronyms, Part Meronyms and Part Holonyms between the
9
synsets.We believe that our work contributes significantly to the repository of resources for languages that lack them.
• We present a method to enhance the quality of wordnets, which we create in the first
task, by filtering the mistakenly created synsets and relations. In this task, we use one
of the state of the art techniques which is word embedding (Mikolov et al., 2013).
This method give a solution to the problem of wrong translation produced by the
translation method.
• We produce an approach to create a vector representation for synsets. This approach
aims to produce a better way for representing meaning. This representation can be
used in several areas. In this task we use it to automatically extract glosses from
corpora for wordnet synsets we create in the previous tasks. It, also, can be used in
the word-sense disambiguation (WSD) problem which occurs with words that have
multiple meanings.
• Then, based on the vector representation of synsets, we present a novel approach
to add a gloss for each synonym set (synset) in our core wordnets. A gloss is a
definition or a sentence that clarifies the meaning of the synset. Glosses are mostly
added manually by human or automatically generated using rule-based generation
approach (Cucchiarelli et al., 2004).
• Finally, we present LexBank which is a system that makes our created resources
available for public. We design and implement the system such that it provides useful services for users that seek linguistics resources in a friendly manner. We aim
10
to make our system flexible and expandable so it can accommodate additional new
languages and resources.
Chapter 2
CASE STUDY: THE CURRENT STATUS OF AND CHALLENGES IN
PROCESSING INFORMATION IN ARABIC
Since Arabic is one of the languages we use in our experiments throughout this dissertation, we present the current status of Arabic language processing as an example in this
chapter.
2.1
Introduction
According to Ethnologue (Gordon and Grimes, 2005), Arabic is the official language
of more than 223 million people in 25 countries which makes it one of the most widelyspoken languages in the word. Arabic is the language of Islam, which is the religion of
1.6 billion people around the world. Muslims are required to use Arabic to read the Quran
(the Holy Book of Islam) and to perform the rituals of Islam. There are around 30 major
dialects in Arabic. These dialects have different phonologies, morphologies, syntax and
even lexicons (Habash, 2010). However, these dialects are not used as official languages
by themselves. They are used for informal speech. For formal writing and speaking, the
official Modern Standard Arabic (MSA) is used. MSA was developed based on Classical
Arabic, which is the language of historical literature. However, dialects are commonly used
for writing now-a-days in social media. But, they are rarely used in books, newspapers and
in literary writing. Most Arabs can speak MSA, however, it is not the natively spoken
12
language of any region (Diab and Habash, 2007). This coexistence between MSA and
dialects is problematic for Arabic language processing. This happens to be a problem in
most of widely spoken languages in the world (Haugen, 1966).
One important survey (Farghaly and Shaalan, 2009) discussed the importance of
research in the field of Arabic processing from three perspectives. First, the perspective of
non-Arabic speakers who need to process a huge amount of Arabic texts. The Department
of Homeland Security in the United States is a good example. With increasing security
risks, there is a crucial need to be able to understand the meaning of Arabic documents
and retrieve important information from them such as names, organization and places. The
second perspective is that of Arabic speakers. Machine translation, retrieving information,
summarization, and linguistic tools are some of the applications which are requested by
Arabic speakers.
In the rest of this chapter, we give a summary of the features that make the processing of Arabic text so challenging and some of the solutions and resources that have been
designed to address these challenges. First, in Section 2, we discuss the fundamental issues
in Arabic which are the script, the morphological issues, and the syntactical issues. Then,
in Section 3, we discuss three of the most valuable resources for Arabic processing. These
are The Penn Arabic Treebank (PATB), The Prague Arabic Dependency Treebank (PADT),
and The Columbia Arabic Treebank (CATiB).
2.2
Fundamentals of Arabic
In this section we discuss the script, morphology and syntax of Arabic.
13
2.2.1
Arabic Script
Arabic is written as a right to left script. The Arabic script is also used by languages
such as Kurdish, Urdu, Persian and Pashto (Habash, 2010). One important aspect of Arabic
is that most of Arabic letters are composed of two parts: a base form and a mark. There
are three kinds of marks in Arabic letters. The first kind consists of dots which are used
to distinguish between letters that share the same base form. An example of letters that
share the same base form are the letters (‫“ )ﺏ‬ba”,(‫“ )ﺕ‬ta”, and (‫“ )ﺙ‬tha”. The second kind
of mark is the Hamza mark (‫ )ﺀ‬which can be written above some letters, as in (‫“ )ﺅ‬u”, or
under some letters, as in (‫“ )ﺇ‬I”. Unfortunately, people often misspell words by not writing
such marks making it hard to distinguish between similar letters and causing ambiguity in
the text. It is also important to notice that Hamza (‫ )ﺀ‬can also be considered a letter by
itself. An example of a word that has the Hamza letter is the word (‫ )ﺳﻤﺎﺀ‬which means
“sky”. The third kind of mark is the Hamza mark that distinguishes the letter (‫“ )ﻙ‬Kaf”
from the letter (‫“ )ﻝ‬Lam”.
Most letters in Arabic have several shapes. The shape of a written letter is determined
based on the position of that letter in the word. Let us take the letter (‫“ )ﻕ‬qaf” as an
example. If it appears at the beginning of the word, it will have the shape (‫ )ﻗـ‬whereas it
will have the shape (‫ )ـﻘـ‬if it appears in the middle of the word, and the shape (‫ )ـﻖ‬if it
is at the end of the word. All word processors select the appropriate letter shape based on
the rules which govern these shapes, and therefore, there is only one key for each letter.
Inflectional morphology is also a factor that governs the shape of some Arabic letters.
The Arabic letter “Hamza” is a good example for that. The word (‫)ﺃﺻﺪﻗﺎﺀ‬, which means
14
“friends”, becomes (‫ )ﺃﺻﺪﻗﺎﺋﻲ‬instead of (‫ )ﺃﺻﺪﻗﺎﺀﻱ‬when we add the letter (‫ )ﻱ‬which
means the possessive pronoun “my”.
In Arabic, each letter is mapped to one unvarying sound, which makes it a phonetic
language. For example, the Arabic letter (‫ )ﺳـ‬always has the pronunciation /s/. On the
other hand, letter “s” in English has three pronunciations: /z/, /s/, and /sh/ as in “nose”,
“salt”, and “sugar”, respectively. However, in Arabic a short vowel may be added to the
letter to change its sound. There are three short vowels in Arabic, which means that each
letter has three more sounds in addition to the original sound. There are no dedicated letters
to represent short vowels. The short vowels may be specified in the written language using
optional diacritics. To show how the short vowels change the sound of Arabic letters, let us
look at the Arabic letter (‫ )ﺳـ‬again. We said that (‫ )ﺳـ‬is pronounced as /s/; however, if we
add the short vowel “Dhamma” it will be pronounced as “su” and it may be written, with
the “Dhamma” diacritic, as (‫ـ‬u‫)ﺳ‬. If we add the short vowel “Kasra”, it will be pronounced
as “si” and it may be written with “Kasra” diacritic, as (‫ـ‬i‫)ﺳ‬. Keep in mind that in MSA, the
writing of the diacritics is optional, although a change in a diacritic of a letter can change
the meaning of the word and may even change the morphological structure of the sentence.
Clearly, this a major source of ambiguity in Arabic processing (Diab and Habash, 2007).
Obviously, with all these problems caused by the Arabic script, Arabic input text
has to be pre-processed to enhance recognition during the actual processing. This preprocessing, which is called normalization, aims to standardize different Arabic script variations. There are several solutions which have been proposed to normalize the Arabic script.
For example, (Larkey et al., 2002) normalized the corpus, the queries, and the dictionaries
of Arabic using the following steps. They first unified the encoding and removed punctua-
15
tions in the text. Then they removed all the diacritics and the non-letters called “tatweel”.
After that, they removed the Hamza mark (‫ )ﺀ‬from the letter “Alif” to standardize all the
variations (‫)ﺃ‬,(‫)ﺇ‬, and (‫ )ﺁ‬to (‫)ﺍ‬. Also, they replaced (‫ )ﻯﺀ‬with (‫)ﺉ‬, (‫ )ﻯ‬with (‫)ﻱ‬, and (‫)ﺓ‬
with (‫)ﻩ‬. The Stanford Natural Language Processing Group adopted a similar procedure in
the Stanford Arabic Statistical Parser (Green and Manning, 2010). The normalization process, as you might expect, does not come without a price. Since all these removed marks
purpose to clarify ambiguity, the normalization of the variant scripts causes the ambiguity
probability to increase (Farghaly and Shaalan, 2009).
Unlike English and other languages, there are no silent letters in Arabic. An example
of a silent letter in English is the letter “p” in the word “pneumonia”. There are no new
sounds in Arabic produced by combining two letters. For instance, in English, “c” and “h”
are combined to produce three distinct sounds: the sound at the beginning of “cheese”, the
sound at the beginning of “character”, and the sound at the beginning of “chef.”
It is well known that the process of splitting text into sentences is an essential step
in many Natural Language Processing (NLP) applications. In English, this is relatively an
easy task since English sentences start with an uppercase letter and finish with a period.
However, splitting Arabic sentences is not as easy as in English since there is no capital
form for Arabic letters (Chinese, Japanese, and Korean have no capitalization too). In
addition, punctuation rules in Arabic are not strict; so many people do not use it properly. In
fact, Arabic writers excessively use coordinations, subordinations and logical connectives
to conjoin the sentences (Farghaly and Shaalan, 2009). Hence, it is not unusual for an
Arabic article to have a complete paragraph which does not include any periods other than
16
the period at the end of the paragraph. Therefore, texts in the Arabic must go through
complicated preprocessing.
The lack of capitalization obviously makes it hard to detect named entities (Darwish,
2013) which is an essential part of Information Retrieval (IR). In English, extracting named
entities such as cities, names of people, addresses and organizations is done with the help of
capitalization and punctuation. For example, to recognize a name like “Barack H. Obama”,
a simple algorithm can be used to search for an uppercase word followed by an initial
with an optional period followed by an uppercase letter. We are not claiming that NER
in English is straightforward or simple in general, but since Arabic does not have these
features, new methods must be used to address the problem of named entity recognition
(Darwish, 2013).
2.2.2
Arabic Morphology
Arabic has a very rich and complex morphology (Attia, 2008). Similar to the other
Semitic languages, morphology in Arabic is of two types, derivational and inflectional.
Derivational morphology is the process of creating new words. This is done by mapping
a root to a pattern. The root holds the meaning while the pattern changes the structure of
the root generating a new word with a different part-of-speech. This type of derivational
morphology is called nonlinear morphology (Bhattacharya et al., 2005). On the other
hand, inflectional morphology is the process of modifying the words with features to create
plural, feminine, or definite forms of the word (Habash, 2010).
A morpheme is “a linguistic form which bears no partial phonetic-semantic resemblance to any other form” (Bloomfield, 1933). Morphological processing in NLP is the
17
process of decomposing a word into morphemes. Relatively, this is an easy task in concatenative morphology. However, in languages with nonconcatenative morphology, like
Arabic, it is a much harder task. In Arabic, words are built by merging a consonantal root
and a vocalism (McCarthy, 1981). The root holds a semantic field while the vocalism
specifies the grammatical form. An example showing the nonconcatenative morphology
of Arabic would be the word (‫“ )ﻛﺘﺐ‬katab” which means “to write”. It is composed by
associating the root (‫ )ﻛﺘﺐ‬/k-t-b/ which has the meaning of “writing”.
Several approaches have been used to decompose Arabic words. The first approach
recovers the root by extracting all prefixes and suffixes from the word, then, stripping the
rest of the word using a lexicon of roots (Hlal, 1985). This approach is very common;
however, it requires a lexicon of all possible Arabic roots, prefixes, infixes and suffixes
(Beesley, 1996; Shaalan et al., 2006). Buckwalter introduced another approach in his morphological analyzer (BAMA) (Buckwalter, 2004). Rather than recovering the root, BAMA
recovers the stem and considers it the main building block for the Arabic word. The stem
is recovered by just removing the prefixes and the suffixes. Therefore, BAMA decomposes
the Arabic word into three parts: Arabic stems, Arabic prefixes and Arabic suffixes.
The decomposition process searches for the prefixes and the suffixes in the word
that satisfy constraints governing the possibility of combining them with the stem in the
word. BAMA has a bidirectional transliteration schema from Arabic script to Latin script.
That means that developers can work with unstructured Arabic texts without any Arabic
language knowledge. For this reason, many recent statistical ANLP systems use BAMA as
the foundation for machine translation and information retrieval. However, BAMA has the
limitation of giving a general analysis that includes all possible cases of the word without
18
considering the context of the input text. A more refined result can be obtained using a
disambiguation module that considers the context of the input text after eliminating the
incorrect analyses (Habash and Rambow, 2005).
Dialectal Arabic differs from MSA morphologically, lexically and phonologically
(Habash et al., 2013). Furthermore, there are no standard orthographies and no language
academies in dialectal Arabic. Therefore, the tools and resources designed for MSA do
not work with dialectal Arabic. Recently, several research efforts have focused on Arabic
dialectal texts (Habash and Rambow, 2005; Habash et al., 2013; Zaidan and CallisonBurch, 2014). The state-of-the-art dialectal Arabic morphological analyzer is the Columbia
Arabic Language and dialect Morphological Analyzer (CALIMA) (Habash et al., 2013).
Arabic is an agglutinative language, which means that Arabic words usually include
affixes and clitics that represent different parts-of-speech. Let us take the word (‫ﻬ‬
Â ‫ﺘ‬u ‫ﺒ‬a‫ﻛﺘ‬
a )
“katabto ho” which means “I wrote it”. This word is a verb that has the subject and the
object attached to it. The subject is the diacritic on the fourth letter (‫“ )ﺕ‬ta”, while the
object is the suffix (‫“ )ـﻪ‬ha”. This is just a simple example whereas words usually have
more complex structures that include other clitics to specify the gender, person, number and
voice. Hence, due to complex phonological rules, the decomposition of words in Arabic is
relatively more difficult in comparison to other languages.
2.2.3
Arabic Syntax
According to (Habash, 2010), there are two kinds of sentences in Arabic: sentences
that start with verb (V-SENT), and sentences that start with noun (N-SENT). Verb-subjectobject (VSO) is the primary structure of a V-SENT sentence in the Classical and Modern
19
Standard Arabic. However, the object-verb-subject (OVS) and subject-verb-object are also
commonly used. As we mentioned before, Arabic is a pro-drop language which means that
the subjectless sentences are perfectly grammatical in Arabic. Also, unlike English, the
use of the equational sentences like “He a journalist”, are allowed without the need of a
“to be” verb. Russian, Hungarian, Hebrew, and Quechuan languages also allow this type
of sentences.
In Arabic, the structure of constituent questions is usually composed by starting with
a wh-phrase. However, it is grammatically correct if the constituent question does not start
with the wh-phrase. For example, the question (‫ )ﺃﻛﻠﺖ ﻣﺎﺫﺍ ﺑﺎﻷﻣﺲ؟‬literally means “you
eat what yesterday?”. Furthermore, relative clauses in Arabic are connected using relative
pronouns. For example, in the sentence (‫ )ﺍﺣﺒﺒﺖ ﺍﻟﺒﻴﺖ ﺍﻟﺬﻱ ﺍﺷﺘﺮﻳﺘﻪ‬there are two clauses:
(‫ )ﺃﺣﺒﺒﺖ ﺍﻟﺒﻴﺖ‬which means “I liked the house”, and (‫ )ﺍﻟﺬﻱ ﺍﺷﺘﺮﻳﺘﻪ‬which means “which
I bought”. The two clauses are connected using the relative pronoun (‫ )ﺍﻟﺬﻱ‬which means
“which”. The Arabic relative pronouns must agree with the noun which it modifies at the
second clause in number and gender.
2.3
Summary
In this chapter, we presented a short overview of inofrmation processing in Arabic.
We summarized challenges that face developers and researchers when processing Arabic
text due to many of its features. The lack of capitalization, dropped subjects, missing
short vowels and the nonconcatenative nature are some of these features. In addition, there
are many dialects in Arabic, which are used in the informal speaking and writing. These
20
dialects must be treated differently when processing Arabic texts. Much research has been
conducted to address the challenges of Arabic text processing. Some valuable resources
and techniques have been presented for Arabic. However, more work needs to be done to
give Arabic developers and speakers the support they need.
Chapter 3
LITERATURE REVIEW
In this chapter, we provide a summary of the main existing approaches for creating
lexical resources. We focus on two types of lexical resources: wordnets and bilingual
dictionaries.
3.1
Automatic Construction of Wordnets
Wordnet is a lexical ontology of words. There are two ways to construct wordnets
for languages that do not possess such resources: manual construction and automatic construction. We intend to use automatic construction using core wordnets we have created in
our earlier work (Lam et al., 2014a,b, 2015b) and other existing resources that are freely
available. Other efforts are underway to manually (or mostly manually) create wordnets in
a variety of languages, although progress seems slow all around.
High-quality wordnets have been developed for a small number of languages. Wordnets, other than the Princeton WordNet (Fellbaum, 1998; Miller, 1995), are typically constructed by one of two approaches. The first approach, which is called the expansion approach, translates the PWN to target languages (Akaraputthiporn et al., 2009; Barbu, 2007;
Bilgin et al., 2004; Kaji and Watanabe, 2006; Lam et al., 2014b; Lindén and Niemi, 2014;
Oliver and Climent, 2012; Sagot and Fišer, 2008; Saveski and Trajkovski, 2010). In con-
22
trast, the second approach, which is called the merge approach, builds the semantic taxonomy of a wordnet in a target language, and then aligns it with the Princeton WordNet by
generating translations (Borin and Forsberg, 2014; Gunawan and Saputra, 2010; Maziarz
et al., 2013; Rigau et al., 1998). To construct the taxonomic relations between words,
first definitions of words are retrieved from machine readable dictionaries. Then, a genus
disambiguation process, which is the process of finding a word with a broad meaning that
more specific words fall under, is performed using the definitions to construct a hierarchical
class of concepts. Next, classes are merged with the synsets in the PWN using a bilingual
dictionary to form the target wordnet.
The expansion approach dominates the merge approach in popularity. Wordnets generated using the merge approach may have different structures from the Princeton WordNet. In contrast, wordnets created using the expansion approach have the same structure
as the Princeton WordNet, which provides for a level of uniformity among them, possibly at the cost of some natural language-specific expressiveness (Leenoi et al., 2008).
Many approaches to construct wordnets are semi-automatic and, therefore, can be used
only for languages that have some existing lexical resources. Therefore, any attempt to
build wordnets for resource-poor languages using these methods would be doomed from
the start. Moreover, while wordnets are always difficult to evaluate, it is even harder to evaluate machine-created wordnets in resource-poor languages because these languages do not
have gold standards to compare with, and frequently do not have easily-accessible experts
to evaluate such resources.
Crouch (Crouch, 1990) clusters documents using a complete link clustering algorithm and generates thesaurus classes or synonym lists based on user-supplied parameters.
23
Curran and Moens evaluate the performance and efficiency of thesaurus extraction methods and also propose an approximation method that provides for better time complexity
with little loss in accuracy (Curran and Moens, 2002a,b). Ramirez and Matsumoto develop
a multilingual Japanese-English-Spanish thesaurus using two freely available resources:
Wikipedia and the Princeton WordNet (Ramírez et al., 2013). They extract translation tuples from Wikipedia articles in these languages, disambiguate them by mapping to wordnet
senses, and extract a multilingual thesaurus with a total of 25,375 entries. One thing we
must note about all these approaches is that they are resource-hungry, requiring a large corpus of Wikipedia or non-Wikipedia documents and wordnets. For example, Lin works with
a 64-million word English corpus to produce a high quality thesaurus with about 10,000
entries (Lin, 1998). Ramirez and Matsumoto have the entire Wikipedia at their disposal
with millions of articles in three languages, although for experiments they use only about
13,000 articles in total (Ramírez et al., 2013). Furthermore, (Miller and Gurevych, 2014)
work with more than 19 thousands of Wiktionary senses and 16 thousands of Wikipedia
articles to produce a three-way alignment of WordNet, Wiktionary, and Wikipedia. When
we work with low-resource or endangered languages, we do not have the luxury of collecting such big corpora or accessing even a few thousand articles from Wikipedia or the entire
Web. Many such languages have none or very limited Web presence. As a result, we have
to work with whatever limited resources are available.
In this work we introduce approaches to generate synonyms, hypernyms, hyponyms
and some other semantic relations. To enhance the quality of wordnets we create, we
present several approaches to measure relatedness between concepts or words. Some po-
24
tential approaches for measuring semantic relationships are a dictionary-based approach
(Kozima and Furugori, 1993) and thesaurus-based approach (Hirst and St-Onge, 1998).
Oliver (Vossen, 1998) presented approaches for constructing wordnets using the expand model and made them available through a Python toolkit (Oliver, 2014). The authors
designed three strategies that use three types of resources to construct wordnets: dictionaries, semantic networks (Navigli and Ponzetto, 2010) and parallel corpora. While the
construction approaches of wordnets using dictionaries and semantic networks were direct,
the authors used machine translation and automatic sense-tagging to construct their wordnets using parallel corpora. A toolkit1 provides access to the three construction methods
besides access to some freely available lexical resources. To test their dictionary based
approach, the authors constructed wordnets for six languages: Spanish, Catalan, French,
Italian, German and Portuguese with precision between 48.09% and 84.8%. Using their
semantic network based approach, the authors constructed wordnets for the six languages
with precision between 49.43% and 94.58%. The parallel corpus based approach with
machine translation achieved precision between 70.26% and 93.81%, while with automatic sense-tagging it achieved between 75.35% and 82.44%. The authors stated that their
automatically-calculated precision value is very prone to errors.
Another example of constructing wordnets using dictionary based methods is JAWS
(Mouton and de Chalendar, 2010). JAWS is a French wordnet for nouns constructed by
translating wordnet nouns using a bilingual dictionary and syntactic language model. The
construction of JAWS starts with copying the structure (the synsets with no words) of the
source wordnet. Then, the phrases that are available in the bilingual dictionary are used to
1 http://lpg.uoc.edu/wn-toolkit
25
fill out the initial synsets. Finally, the language model is used to incrementally add new
phrases to JAWS. An improved version of JAWS is called WoNeF (Pradet et al., 2014).
The new improved wordnet includes parts of speech and was evaluated using a gold standard produced by two annotators. In addition, WoNef uses a better translation selection
algorithm that uses machine learning to select variable thresholds for translations.
In (Lam et al., 2014b), we presented three methods to construct wordnet synsets
for several resource-rich and resource-poor languages. We used some publicly available
wordnets, a machine translator and a single bilingual dictionary. Our algorithms translate
synsets of existing wordnets to a target language T, then apply a ranking method on the
translation candidates to find best translations in T. The approaches we used are applicable
to any language which has at least one existing bilingual dictionary translating from English
to it.
In the first approach, which we call the direct translation approach (DR), for each
synset in PWN, we directly translate the words from English to the target language. In
the second approach, which we call IW, we extract candidates from several intermediate
wordnets rather than just using PWN to disambiguate the translation. In the third approach,
which we call IWND, we try to reduce the number of bilingual dictionaries we use in the
second approach. When the intermediate wordnet is not PWN, we translate the extracted
words from the wordnets to English, and then we use a single bilingual dictionary to translate the words from English to the target language. In all of these methods, after extracting
the candidates, we use a ranking method to select the best translations and insert them as a
synset in the traget wordnet.
26
3.2
Wordnet Management Tools
Library
CICWN
extJWNL
Javatools
Jawbone
JawJaw
JAWS
JWI
JWNL
URCS
WNJN
WNPojo
WordnetEJB
URL
http://fviveros.gelbukh.com/wordnet.html
http://extjwnl.sourceforge.net/
http://www.mpi-inf.mpg.de/yago-naga/javatools/
http://sites.google.com/site/mfwallace/jawbone/
http://www.cs.cmu.edu/~hideki/software/jawjaw/
http://lyle.smu.edu/~tspell/jaws/
http://projects.csail.mit.edu/jwi/
http://sourceforge.net/apps/mediawiki/jwordnet/
http://www.cs.rochester.edu/research/cisd/wordnet/
http://wnjn.sourceforge.net/
http://wnpojo.sourceforge.net/
http://wnejb.sourceforge.net/
Table 3.1. A list of the Java libraries tested in (Finlayson, 2014).
Maintaining wordnets is an important area of research. The manual construction of
a wordnet is an intensive process that requires a large number of specialists to work for
several years. Furthermore, a wordnet is not static. The meaning of many phrases change
through time and new phrases appear every year. For example, the country Sudan was
divided into two countries Sudan and South Sudan in 2011. If one searches the PWN 3.1 for
Sudan, only the senses corresponding to the old Sudan show up since the new sense has not
yet been added. Moreover, the representation of wordnets evolves over time. For example,
many old wordnets were upgraded to provide the XML representation. In addition, as this
section shows, many wordnets are built based on the PWN. Every time PWN gets updated,
these wordnets must be updated also to preserve the alignment with PWN. All the previous
issues show the need for wordnet maintenance tools.
One recent work on tools for maintaining wordnets is by (Mladenovic et al., 2014).
The tools are designed to provide for upgrade, cleaning, validation, search, import and
27
export of functionalities for the Serbian wordnet (Christodoulakis et al., 2002). Another
recent work develops a Java library, which is called JWI, for accessing the PWN and compares it with eleven other libraries is (Finlayson, 2014). The comparison between the libraries was based on five features: special requirements, used similarity metrics, ability to
edit the wordnet, whether they need to work with the Maven project or not, and forwardcompatibility with Java. Table 3.1 shows the tested libraries and Table 3.2 shows a summary of the comparison.
Similarity Metrics
Editing
Maven
Minimum Java
CICWN
extJWNL
Javatools
Jawbone
JawJaw
JAWS
JWI
JWNL
URCS
WNJN
WNPojo
WordnetEJB
Standalone
Metric
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Yes
No
No
No
No
No
Yes
Yes
Yes
No
Yes
Yes
No
No
No
No
No
Yes
No
No
No
No
No
No
No
No
No
No
No
Yes
No
No
No
No
No
Yes
No
No
No
No
1.6
1.6
1.6
1.6
1.5
1.4
1.5
1.4
1.6
1.5
1.6
1.6
Table 3.2. A comparison between some of the Java libraries for accessing the PWN.
Another wordnet management tool was also presented recently for the IndoWordNet2
(Nagvenkar et al., 2014). The tool, which is called the Concept Space Synset Management
Tool3 (CSS), provides an interactive user interface for creating new language synsets and
2 http://www.cfilt.iitb.ac.in/indowordnet/
3 http://indradhanush.unigoa.ac.in/conceptspace
28
linking them to other Indian language wordnets. The CSS tool uses a role-based access
control to restrict the access to the wordnet. Figure 3.1 shows an overview of the CSS tool.
Figure 3.1: An overview of the CSS management tool, adapted from (Nagvenkar et al.,
2014)
Sense marking is the process of tagging words with senses in corpus. It is a necessary
task in preparing training data for machine learning techniques. Since sense marking is an
intensive process, sense marking tools are very handy. For example, the Indian Institute
of Technology Bombay has developed a sense marker tool for the IndoWordNet (Prabhugaonkar et al., 2014). The sense marking tool shows a highlighted word in a piece of text
and asks the annotator to choose the most appropriate sense from the available senses. The
tool, also, allows the annotator to add new senses that do not exist in the wordnet.
29
3.3
Creating Bilingual Dictionaries
Bilingual dictionaries are essential lexical resources which we use in our approaches.
The majority of low-resource languages have bilingual dictionaries to provide phrase translation between them and rich-resource languages. However, only relativity few bilingual
dictionaries are available for translation between low-resource languages. Several methods have been presented to automatically construct such dictionaries between low-resource
lanauges. Since wordnets we create in this dissertation are aligned with each others, we
believe that they can be good resources for phrase translation between languages. In this
section, we discuss some methods for automatically creating bilingual dictionaries.
Given two input dictionaries L 1 -Lp and Lp -L 2 , a naïve method to create a new bilingual dictionary L 1 -L 2 may use Lp as a pivot using a straightforward transitive approach.
However, if a word has more than one sense, being a polysemous word, this method may
introduce incorrect translations. After computing an initial bilingual dictionary, past researchers have used several approaches to mitigate the effect of ambiguity in word senses.
Methods used for disambiguation use wordnet distance between source and target words in
some way, look at dictionary entries in both forward and backward directions and compute
the amount of overlap to compute disambiguation scores (Ahn and Frampton, 2006; Bond
and Ogura, 2008; Gollins and Sanderson, 2001; Lam and Kalita, 2013; Shaw et al., 2013;
Soderland et al., 2010; Tanaka and Umemura, 1994).
Researchers have also merged information from several sources such as parallel corpora or comparable corpora (Nerima and Wehrli, 2008; Otero and Campos, 2010) and a
wordnet (István and Shoichi, 2009; Lam and Kalita, 2013) to address the ambiguity prob-
30
lem. Some researchers extract bilingual dictionaries directly from monolingual corpora,
parallel corpora or comparable corpora using statistical methods (Bouamor et al., 2013;
Brown, 1997; Haghighi et al., 2008; Héja, 2010; Ljubešić and Fišer, 2011; Nakov and Ng,
2009; Yu and Tsujii, 2009).
Obviously, the quality and quantity of existing resources strongly affect the accuracies of newly-created dictionaries. For instance, Nerima and Wehrli create new EnglishGerman and English-Italian bilingual dictionaries with 21,600 and 26,834 entries, respectively, from 76,311 entries in an English-French dictionary, 45,492 entries in a GermanFrench dictionary, and 36,672 entries in a French-Italian dictionary (Nerima and Wehrli,
2008). Given parallel corpora of Lithuanian consisting of 1,765,000 tokens and Hungarian
including 2,121,000 tokens, Heja can extract only 2,616 correct translation candidates with
accuracy over a certain threshold from 4,025 translation candidates (Héja, 2010). Thus,
new bilingual dictionaries created using current approaches have very few entries compared to the size of the input dictionaries. Furthermore, most resource-poor languages do
not have any corpora, or even online documents. Some languages have only one very small
bilingual dictionary, such as the Karbi-English dictionary of 2,341 words.
In (Lam et al., 2015b), we present approaches to automatically build a large number of new bilingual dictionaries for low-resource languages, especially resource-poor and
endangered languages, using a single input bilingual dictionary. Our algorithms produce
translations of words in a source language to many target languages using publicly available wordnets and a machine translator (MT). Our approaches may produce any bilingual
dictionary as long as one of the two languages is English or has a wordnet linked to the
PWN. Using our approaches and starting with 5 available bilingual dictionaries, we cre-
31
ated 48 new bilingual dictionaries. Of these, 30 pairs of languages are not supported by the
popular MTs: Google4 and Bing5 .
3.4
Summary
In this chapter, we have discussed the existing methods for the automatic construction of wordnets. We have also discussed several tools and system for managing wordnets.
Moreover, we covered some of the approaches for automatically creating bilingual dictionaries.
4 http://translate.google.com/
5 http://www.bing.com/translator
Chapter 4
AUTOMATICAALY CONSTRUCTING STRUCTURED WORDNETS
The core idea behind a wordnet is to group words which are synonyms, or roughly
synonymous, into lexical categories that are called synsets. Then, semantic relations between these synsets are established in a hierarchical manner. In this chapter, we present
a method to automatically construct the wordnet semantic relations such as Hypernyms,
Hyponyms, Member Meronyms, Part Meronyms and Part Holonyms using PWN.
4.1
Constructing Core Wordnets
In (Lam et al., 2014b), we introduced an approach, which we refer to as the IWND
approach, that creates wordnet synsets with relatively high coverage. As Figure 4.1 shows,
in IWND, to create wordnet synsets for a target language T we used existing wordnets and
a machine translator (MT) and/or a single bilingual dictionary. First, we extracted every
synset in Princeton WordNet (PWN) using the unique offset-POS key, which refers to the
offset for a synset with a particular part-of-speech (POS). Notice here that each synset
may have one or more words, each of which may be in one or more synsets. Words in a
synset have the same sense. Then, we extracted the corresponding synsets for each offsetPOS from existing wordnets linked to PWN, in several languages. Next, we translated
the extracted synsets in each language to T to produce synset candidates using MT or a
33
dictionary. Then, we applied a ranking method on these candidates to find the correct
words for a specific offset-POS in T.
Figure 4.1: Creating wordnet synsets using the IWND algorithm (Lam et al., 2014b).
34
The ranking method we used in (Lam et al., 2014b) is based on the occurrence count
of a candidate. Specifically, the rank of a word w, the so-called rankw , is computed as
below.
occurw
ordN ets
∗ numDstW
rankw = numCandidates
numW ordN ets where:
- numCandidates is the total number of translation candidates of an offset-POS
- occurw is the occurrence count of the word w in the numCandidates
- numWordNets is the number of intermediate wordnets used, and
- numDstWordNets is the number of distinct intermediate wordnets that have words
translated to the word w in the target language.
4.2
Constructing Wordnet Semantic Relations
Synsets in a wordnet are linked in a hierarchal fashion. The hierarchy in a wordnet
is established using the super-subordinate relation between synsets. For example, nouns
are linked using hyperonymy, which is a relation between a general synset and a specific
one. An example of a hyperonymy relation is the relation between the synsets {food,
solid_food} and {baked_goods}. The Hyperonymy relation is transitive, for example, the
synset {bread}, which is a hyponym of the synset {baked_goods}, is also a hyponym of
the synset {food, solid_food}. Table 4.1 shows the semantic relations available in wordnet
(Wikipedia, 2015).
In (Lam et al., 2014b), we constructed core wordnets, which essentially means that
we created synsets with no connections between them. As Figure 4.2 shows, our goal is to
recover the taxonomy of synsets. To establish the semantic relations between the sysnets
35
Phrase Type
Nouns
Relation
Hypernyms
Hyponyms
Coordinate terms
Meronyms
Holonyms
Hypernyms
Verbs
Troponyms
Entailments
Coordinate terms
Definition
Y is a hypernym of X if every X is a (kind of) Y
Y is a hyponym of X if every Y is a (kind of) X
Y is a coordinate term of X if X and Y share a
hypernym
Y is a meronym of X if Y is a part of X
Y is a holonym of X if X is a part of Y
The verb Y is a hypernym of the verb X if the
activity X is a (kind of) Y
The verb Y is a troponym of the verb X if the
activity Y is doing X in some manner
The verb Y is entailed by X if by doing X you
must be doing Y
Those verbs sharing a common hypernym
Table 4.1. Wordnet semantic relations.
Figure 4.2: Core wordnet mapping to structured wordnet.
we created in (Lam et al., 2014b), we rely on the Princeton WordNet (Fellbaum, 2005) as
an intermediate resource.
36
As Figure 4.3 shows, to construct the links between synsets in our wordnet for language T, we extract each synseti from wordnett and map it to synsetj , which is the corresponding synset in the Princeton WordNet. Then, for each synsetj in the Princeton WordNet, we extract each semantic relations rj and the linked synsetsk . Next, we check the
availability of synsetk in wordnett . Finally, if synsetk is available in wordnett , we add a
relation between synseti and synsetk to wordnett .
Figure 4.3: Creating wordnet semantic relations using intermediate wordnet.
We notice here that although we used some disambiguation methods when we created
the core wordnets, there still are words that are misplaced. This will cause some false
classification of synset relations. Another challenge is that translation leads to loss of some
information. For example, it is very important to distinguish between classes and instances
in wordnets (Miller and Hristea, 2006). There is no guarantee that an instance will not
be translated into the target language as a class and vice versa. Furthermore, as Figure
4.4 shows, since the core wordnets are automatically created, there will be some missing
37
synsets that might not be available in the target languages. That is will lead to fragments
in the recovered links. All the previous issues need to be observed and dealt with to obtain
accepted accuracy.
Figure 4.4: The effect of missing synsets in recovering wordnet semantic relations using
intermediate wordnet.
4.3
Experiments and Evaluation
In this section, we generate the semantic relations between synsets in three wordnets:
Arabic, Assamese and Vietnamese. We start by creating the core nets using the algorithms
we described in Section 4.1. Table 4.2 shows the result of creating the core wordnets for the
three languages. Next we apply our method, which we presented in Section 4.2, to link the
synsets. The algorithm was able to recover a total of 206, 766 relations between the Arabic
38
Language
Arabic
Assamese
Vietnamese
Synsets
93,383
107,616
55,451
Coverage
59.95%
36.95%
36.20%
Precision /4.00
3.82
3.78
3.75
Table 4.2. Size, coverage and precision of the core wordnets we create for Arabic, Assamese and Vietnamese.
Relation
SimilarTo
Hypernym
Hyponym
MemberMeronym
PartHolonym
Average
Precision
75.62%
70.41%
71.23%
77.54%
84.29%
75.82%
Table 4.3. Precision of the semantic relations established for our Arabic wordnet.
synsets, 139, 502 relations between the Assamese synsets and 146, 172 relations between
the Vietnamese synsets. As Figure 4.5 shows, most of the recovered relations are hyponym
and hypernym relations.
To evaluate our algorithm, we evaluated the relations recovered for the Arabic wordnet. We asked three Arabic to evaluate a sample of 500 relations. The sample consists of
the following relations: 100 “hypernym” relations, 100 “hyponym” relations, 100 “similar to” relations, 100 “MemberMeronym” relations and 100 “PartHolonym” relations. The
evaluation done using a T rue and False questions where the T rue gives score of 1 and
False gives a score of 0 to the relation.
As Table 4.3 shows, the precision of algorithm was between 70.41%, which was
for the “hypernym” relation, and 84.29% which was for the “PartHolonym” relation. The
average precision score was 75.82%.
39
Figure 4.5: Percentage of synset semantic relations recovered for the Arabic, Assamese
and Vietnamese wordnets.
4.4
Summary
In this chapter, we presented an approach that automatically constructs semantic relations between synsets in a wordnet. The approach depends on the PWN to establish the
links between the synsets. We conducted an experiment to evaluate our algorithm. Our
approach produces semantic relations between the Arabic synsets with 75.82% precision.
Chapter 5
ENHANCING AUTOMATIC WORDNET CONSTRUCTION USING WORD
EMBEDDINGS
In the previous chapters, we have shown that a wordnet for a new language, possibly
resource-poor, can be constructed automatically by translating wordnets of resource-rich
languages. The quality of these constructed wordnets is affected by the quality of the
resources used such as dictionaries and translation methods in the construction process.
Recent work shows that vector representation of words (word embeddings) can be used to
discover related words in text. In this chapter, we propose a method that performs such
similarity computation using word embeddings to improve the quality of automatically
constructed wordnets.
5.1
Introduction
It is well known that one way to find out semantically related word is to use context as
lead (Firth, 1957; Harris, 1954). Words that share the same neighbors are usually somehow
related to each other. For example, consider the two sentences:
“He rides his bike to the park everyday” and
“He rides his bicycle to the park everyday”.
41
One can conclude that the words “bike” and “bicycle” are similar or semantically related
since they appear in similar context. This observation led researches to what is called
distributional methods which are widely used in recent days. In these methods, also known
as vector semantics and word embeddinдs, co-occurrences of the words in a corpus is
represented as vectors in a multidimensional space forming a word-word matrix (Jurafsky
and Martin, 20016).
Since a corpus consists of a large number of distinct words, these vectors are usually
long and sparse. The sparseness of the vectors is caused by the fact that a word often cooccurs with a limited number of other words in a given corpus. For this reason, special
algorithms are used to process and save these sparse vectors. Usually, the co-occurrence
of a word is limited to a specific window of words before and after the word. According
to (Jurafsky and Martin, 20016), there are two types of co-occurrence: first-order cooccurrence and second-order co-occurrence. The first type is used to describe words that
appear next to each other, while in the second type, the words share similar surrounding
words.
In order to reduce the effect of stop words, i.e., words that co-occur with most of
the words, usually the pointwise mutual information measure (PMI ) (Fano and Hawkins,
1961) is used rather than pure co-occurrences. This measure considers the probability of
the co-occurrence of two words compared to the words occurring together by chance alone.
The PMI between two words w 1 and w 2 is
PMI (w 1 ,w 2 ) = log2
P (w 1 ,w 2 )
.
P (w 1 )P (w 2 )
(5.1)
42
where P (w 1 ) is the probability of word w 1 ,
P (w 2 ) is the probability of word w 2 , and
P (w 1 ,w 2 ) is the probability of w 1 in the context of w 2
5.2
Similarity Metrics
There are many ways to compute similarity between vectors (Jurafsky and Martin,
20016). We list three common metrics used to measure similarity or relatedness between
~ and B~ with size N.
two vectors A
• Cosine Similarity: It is the most common measure used in natural language processing. It produces similarity values from 0 to 1. When using row co-occurrences
or PMI , words with cosine similarity value near 1 are supposedly very similar and
words with cosine similarity value near 0 are supposedly unrelated. Cosine similarity
is measured using the next formula:
PN
~ B)
~ =q
cosine (A,
PN
i=1 Ai Bi
2
i=1 Ai
q
.
(5.2)
PN
2
i=1 Bi
• Jaccard Measure: It was introduced by (Jaccard, 1912) and adapted by (Grefenstette, 2012) to be used with vectors. The Jaccard similarity is computed using the
following formula:
PN
min(Ai , Bi )
~
~
Jaccardsim (A, B) = Pi=1
N
i=1 min(Ai , Bi )
(5.3)
43
• Dice Measure: It was originally used with binary vectors and was adapted by (Curran, 2004) to be applied with semantic similarity. The Dice similarity measure is
computed using the next equation:
~ B)
~ =
Dicesim (A,
5.3
2
PN
i=1 min(Ai , Bi )
PN
i=1 (Ai + Bi )
(5.4)
Generating Word Embeddings
In order to validate the synsets we create using translation and obtain relations between them, we use the word2vec algorithm (Mikolov et al., 2013) to generate word representations from an existing corpus. The word2vec algorithm uses a feedforward neural
network to predict the vector representation of words within a multi-dimensional language
model. W ord2vec has two variations: Skip-Gram (SG) and Continuous Bag-Of-Words
(CBOW). In the SG version, the neural network predicts words adjacent to a given word
on either side, while in the CBOW model the network predicts the word in the middle of a
given sequence of words. In the work presented in this section, we generate representations
of words using both models with several different vector and window sizes to obtain the
settings for the highest precision. The purpose of the steps discussed next is to improve the
quality of synsets produced by the translation process in addition to generating relations
among the synsets.
44
5.4
Removing Irrelevant Words in Synsets
We compute the cosine similarity between word vectors within each single synset in
TWN, the wordnet being constructed in language T , to filter false word members within
synsets. To filter the initially constructed synsets in TWN, we pick a threshold value α such
that the selected words have cosine similarity larger than α with each other. We describe
the filtering process we propose below.
1. Let
synsetic = {word1 , word2 , word3 , word4 }
(5.5)
be a candidate synset to be potentially included in TWN.
2. Compute the cosine similarity between all the possible pairs of words in synsetic .
3. Extract the pair of words with the highest cosine similarity.
4. If this pair of words have cosine similarity larger than α, keep the pair in the final
synset synseti , otherwise, discard synsetic itself. This may have been a low quality
candidate synset generated in the translation process.
5. Next, among the remaining words in synsetic , keep a word if it has a connection with
any word in synseti with similarity higher than α.
For example, let us assume that the cosine similarity between the words in synsetic
are as shown in Table 5.1 and α=0.70. First, the pair with the highest cosine similarity,
(word 1 ,word 2 ) is kept in the final synseti since its cosine similarity is larger than α. Then,
word 3 is discarded since it does not have any cosine similarity larger than α with any of the
45
Pair
(word 1 ,word 2 )
(word 1 ,word 3 )
(word 1 ,word 4 )
(word 2 ,word 3 )
(word 2 ,word 4 )
(word 3 ,word 4 )
Cosine Similarity
0.91
0.22
0.82
0.34
0.72
0.12
Table 5.1. An example of cosine similarity between words in a candidate synset
words in the current final synseti . Finally, word 4 is kept synseti since it does have a cosine
similarity with word 1 that satisfies the threshold α.
5.5
Validating Candidate Relations
Similarly, we compute the cosine similarity between words within pairs of semantically related synsets. This allow us to verify the constructed relations between synsets in
TWN. For example, let
synseti = {wordi 1 ,wordi 2 ,wordi 3 ,wordi 4 }, and
synset j = {word j 1 ,word j 2 ,word j 3 ,word j 4 }
be synsets in TWN. And let
ρi j be a candidate semantic relation between synseti and synset j .
We compute the cosine similarity between all the possible pairs of words from synseti to
synset j and obtain the maximum similarity obtained. Then, if this value is larger than a
threshold α ρ , then we retain the relation ρi j , otherwise, we discard it. The pseudo code of
the validation algorithm is shown in Algorithm 1.
46
Algorithm 1: Validating Semantic Relation
Data: synseti , synset j , relation ρi j , threshold α ρ
Result: retain or discard the relation ρi j
initialization;
Similaritymax ← 0;
foreach wordi in synseti do
foreach word j in synset j do
sim ← ComputeCosineSimilarity(wordi ,word j );
if sim > Similaritymax then
Similaritymax = sim;
end
end
end
if Similaritymax < α ρ then
Discard(ρi j ) ;
end
5.6
Selecting Thresholds
To pick the synset similarity threshold value α and the threshold α ρ for each semantic
relation we create, we compute the cosine similarity between pairs of synonym words,
semantically related words, and non-related words obtained from existing wordnets. Then,
based on the previous data, we select the threshold values that are associated with higher
precision and maximum coverage.
5.7
Experiments
In this section, we discuss the enhancement of the Arabic, Assamese and Vietnamese
wordnets we create using our method we described in the previous sections.
5.7.1
Generating Vector Representations of Wordnets Words
For generating vector representations of the Arabic Words we use the following freely
available corpora:
• Watan-2004 corpus (12 million words) (Abbas et al., 2011),
47
• Khaleej-2004 corpus (3 million) (Abbas and Smaili, 2005), and
• 21 million words of Wikipedia1 Arabic articles.
We process and combine the three corpora into a single plain text file.
For both Assamese and Vietnamese, we used Wikipedia articles to generate the vector
representation for words. The size of the Assamese Wikipedia articles we used is 1.4
million of words, While the size of Vietnamese articles was 80 million words.
Figure 5.1: A histogram of synonyms, semantically related words, and non-related words
extracted from AWN.
In order to compute the synset similarity threshold value α and the threshold for
each semantic relation α ρ , we use the freely available Arabic wordnet (AWN) (Rodríguez
et al., 2008). AWN was manually constructed in 2006 and has been semi-automatically
enhanced and extended several times. We start by extracting synonym words, semantically
related words, and non-related words from AWN. The Python program that we wrote to
1 https://ar.wikipedia.org
48
Relation
Synonyms
Hypernyms
TopicDomains
PartHolonyms
InstanceHypernyms
MemberMeronyms
Weighted Average
Similarity
0.28
0.22
0.23
0.28
0.08
0.29
Table 5.2. The weighted average similarity between related words in AWN.
compute the cosine similarity between the words is listed in Appendix B.1. Then, we
use the histogram representation of the cosine similarity of the previous sets of words to
set the thresholds. As Figure 5.1 shows, more than 67% of the non-related words have
cosine similarity less than 0.1, while about 23% of the synonym words in AWN have a
cosine similarity less than 0.1. Furthermore, about 34% of the semantically related words
in AWN have cosine similarity less than 0.1. Table 5.2 shows the weighted average cosine
similarity between synonyms, hypernyms, topic-domain related, part-holonyms, instancehypernyms, and member-meronyms in AWN where the frequency of the similarity value is
the weight.
5.7.2
Producing Word Embeddings for Arabic
In this part of this experiment, we use the word2vec algorithm to produce vector
representation of Arabic. We test the word2vec algorithm with different window sizes to
select the window size that produces the highest similarity. We generate word embeddings
using the CBOW version with window sizes 3, 5 and 8. Next, we compute the weighted
averages of the cosine similarity between the synonyms in AWN. The highest weighted
average we obtained was 0.288 with window size 3, while the weighted averages obtained
with window sizes 5 and 8 were 0.283 and 0.277, respectively. Then, we compare between
the SG and the CBOW approaches with different vector sizes. Table 5.3 shows the weighted
average cosine similarity obtained between 16, 000 pairs of synonyms in AWN using both
49
Algorithm
SG
SG
SG
CBOW
CBOW
CBOW
Vector Size
100
200
500
100
200
500
Similarity Average
0.289
0.258
0.194
0.288
0.259
0.195
Table 5.3. Comparison between the weighted similarity averages obtained using different
word2vec settings.
Threshold
0.000
0.100
0.288
0.500
0.750
AWN
5, 941
3, 433
2, 471
1, 190
209
Our Arabic WordNet
17, 349
2, 073
943
271
13
Table 5.4. Comparison between the number of synsets in AWN and our Arabic wordnet
using different threshold values.
variations of word2vec, with window size=3 and vector size set to 100, 200, and 500.
We notice that both versions produce almost similar results with a slight advantage to SG
with the cost of more execution time. However, for the corpus we use, smaller vector size
produces better precision.
5.8
Evaluation and Discussion
We compute cosine similarity between semantically related words extracted from
our initial Arabic, Assamese and Vietnamese wordnets produced in the previous chapter.
The language model to calculate the cosine similarity is created using CBOW with vector
size=100 and window size=3. Table 5.4 shows a comparison between the number of Arabic
synsets we create and the number of synsets in AWN.
We notice that the translation method we use produces a high number of synsets compared to the manually constructed AWN. However, the number of synsets sharply decreases
after filtering the initial synonyms using the method described in Section 5.3. Although
50
Synonyms
Hypernyms
PartHolonym
MemberMeronym
Overall
Threshold Range
0- 0.1 0.1 - 0.288 0.288 - 1
34.8%
56.8%
78.4%
45.2%
57.2%
84.4%
50.8%
75.2%
90.4%
40.8%
56.8%
79.6%
42.9%
61.5%
83.2%
Table 5.5. Precision of the Arabic wordnet we create.
Synonyms
Hypernyms
PartHolonym
MemberMeronym
Overall
Threshold Range
0- 0.1 0.1 - 0.288 0.288 - 1
52.0%
57.6%
88.0%
37.6%
49.6%
76.0%
51.2%
46.4%
82.4%
62.4%
67.2%
81.6%
50.8%
55.2%
82.0%
Table 5.6. Precision of the Assamese wordnet we create.
our Arabic wordnet is automatically created, the number of synsets we create is 60% of the
number of synsets in the manually created AWN when filtering the synsets using α= 0.1.
We evaluate precision by comparing 600 pairs of synonyms, hypernyms, part-holonyms,
and member-meronyms with three ranges of cosine similarity values: 0 to 0.1, 0.1 to 0.288,
and 0.288 to 1. We asked 3 Arabic speakers to evaluate the pairs using a 0 to 5 scale where 0
represents the minimum score and 5 represents the maximum score. We compute precision
by taking the average score and converting it to a percentage. See Table 5.5.
Synonyms
Hypernyms
PartHolonym
MemberMeronym
Overall
Threshold Range
0- 0.1 0.1 - 0.288 0.288 - 1
31.2%
40.2%
57.6%
31.8%
39.0%
69.4%
32.2%
42.8%
75.0%
22.0%
24.0%
73.8%
29.3%
36.5%
68.95%
Table 5.7. Precision of the Vietnamese wordnet we create.
51
Table 5.8. Examples of related words and their cosine similarity from our Arabic wordnet.
The precision of the synonyms, hypernyms, part-holonyms, and member-meronyms
we produce is 78.4%, 84.4%, 90.4%, and 79.6% respectively, with the threshold set to
0.288. This is higher than the precision obtained by (Lam et al., 2014b) which produces
synonyms with 76.4% precision when just using PWN. Furthermore, the precision of the
Assamese and Vietnamese wordnets are shown in Tables 5.6 and 5.7, respectively. As
shown in Tables (5.8, 5.9, 5.10), our results suggest that using lower precision for producing
synsets reduces the quality of the other created semantic relations. Our results show that
pairs with higher cosine similarity are more likely to be semantically related. It confirms
the benefit of combining the translation method with word embeddings in the process of
automatically generating new wordnets.
5.9
Summary
In this chapter, we discuss an approach for enhancing the automatically generated
wordnets we create for low-resource languages. Our approach takes advantage of word
embeddings to enhance the translation method for automatic wordnet creation. We present
52
Table 5.9. Examples of related words and their cosine similarity from our Assamese wordnet.
Table 5.10. Examples of related words and their cosine similarity from our Vietnamese
wordnet.
53
an application of our approach to producing new Arabic Wordnet. Our method automatically produces Arabic synonyms with 78.4% precision and semantically related pairs of
words with up to 90.4% precision.
Acknowledgment This chapter is based on the paper “Enhancing Automatic Wordnet
Construction Using Word Embeddings” (Al tarouti and Kalita, 2016), written in collaboration with Jugal Kalita, that appeared in the Proceedings of the Workshop on Multilingual
and Cross-lingual Methods in NLP, San Diego, USA, June 2016. Association for Computational Linguistics: Human Language Technologies (NAACLHLT).
Chapter 6
SELECTING GLOSSES FOR WORDNET SYNSETS USING WORD
EMBEDDINGS
Word embeddings provide a way to represent words as vectors in a multi-dimensional
space such that related words are represented as vectors with similar direction. It has been
shown that this model can be used to discover relations between words effectively. In this
chapter, we introduce a method to represents wordnet synsets in similar way. A wordnet
synset is a group of synonym words grouped together because they all represent the same
concept. Our proposed method can be used in several NLP applications such as word-sense
disambiguation and automatic wordnet construction. To test our method we use it in the
task of selecting glosses for wordnet synsets of several languages.
6.1
Literature Review
Several methods were introduced to produce vector representations of meanings.
Clustering is one technique that is commonly used to separate the vector of a multi-sense
word into several vectors which represent the senses of the word. For example, (Neelakantan et al., 2015) modified the skip-gram version of the word2vec algorithm to produce
multiple word embeddings per word. In this work, the senses of a word are learned online
by creating clusters of the contexts of the word. When a new context of the word starts
to appear far away from the center of the known context, a new vector is created for the
new context. A global context-aware neural model was presented by (Huang et al., 2012)
to learn the context vectors of words using both local and global context. To evaluate
their neural architecture, the author produced new dataset that provide similarity, based on
human judgments, between words within specific contexts.
55
Other techniques of producing sense vectors representation are based on ontology.
For example, (Chen et al., 2014) modified the objective of the skip-gram model of the
word2vec algorithm to assign vector representation for the synsets based on their glosses.
The work also presented two word-sense-disambiguation algorithms based on the sense
vectors. Another approach to learn synset embeddings was introduced by (Rothe and
Schütze, 2015). The approach, which is called AutoExtend, is a neural network based
learning model. It include hidden layers for both synset lexemes and embeddings. Foley
and Kalita (Foley and Kalita, 2016) compared between several models which use WordNet
to create sense vectors. They also presented an approach, which is called hyponym tree
propagation model (HTP), that uses vector space model (VSM) to produce sense vectors.
6.2
Creating Language Model Using Word Embeddings
We start by creating word embeddings using a corpus and the word2vec software
(Mikolov et al., 2013). word2vec is a two-layer feedforward neural-network learning
model that produces multi-dimensional vector representations of words. There are two
implementations of this learning model: Skip-Gram (SG) implementation and Continuous
Bag-Of-Words (CBOW) implementation. In the SG implementation, the model learns the
words around a given word, while in the CBOW implementation the model learns the word
within a given sequence of words.
6.3
Generating Vector Representation of Wordnet Synsets
In this section, we present our method to produce wordnet synsets. We build our
method based on the vectors of the synonym words produced by the word embedding
method. We believe that combining the vectors of synonym words into one vector can
produce a way to represent meaning. Next, we describe our proposed method to build the
vector representation of synsets, which we call synset2vec.
Let
56
Synset Key
00076884-n
00329619-n
04277034-n
15049594-n
Gloss
a sudden drop from an upright position
the act of allowing a fluid to escape
a channel that carries excess water
over or around a dam or other obstruction
liquid that is spilled
Synonyms
{spill, tumble, fall}
{spill, spillage, release}
{spill, spillway, wasteweir}
{spill}
Table 6.1. Meanings of the noun “spill” and its synonyms.
synseti = {word 1 ,word 2 , ...,word j } be a synset in wordnetx ,
{n 1 ,n 2 , ...,n j } is
the number of synsets for each word in synseti , and
~1 , V~2 , ..., V~j } is the set of corresponding vectors for {word 1 ,word 2 , ...,word j } in the
{V
word embedding model.
We identify two cases:
1. The first case is when a word, which does not have any synonyms, represents several
synsets, i.e., has more than one meaning. Therefore, the vector that is produced by
the word embedding is actually representing the combined meanings of the word. For
example, in PWN, the word “abduction” is the only word in both synset 00775460n, “the criminal act of capturing and carrying away by force a family member”, and
synset 00333037-n, “moving of a body part away from the central axis of the body”.
Hence, the vector for “abduction” actually represents both meanings.
2. The second case is when a word, which does have one or more synonyms, have
one or more meanings. In this case, the synonyms might or might not have other
meanings also. For example, the noun “spill” has four meanings in PWN and it has 6
synonyms. Table 6.1 shows all the meanings of the noun “spill” and all its synonyms
in PWN.
Obviously, to generate a combined vector for a synset, we need a way to limit the
effect of the other meanings that the synonyms might hold. To do so we start by solving
57
the second case where the synsets have more than one word. In this case, we normalize
the vector of each word by dividing its coordinates by the number of synsets that the word
belongs to. This reduces the noise when generating the synset vector caused by the other
meanings that a word can hold. We define the vector of synseti (V~si ) as follows:
1
1
1
1
V~si = · (V~1 · + V~2 · + ... + V~j · ).
j
n1
n2
nj
Figure 6.1 shows an example of creating a vector for the synset 00076884-n which include
three words: spill, tumble and fall.
Figure 6.1: An example of creating a vector for a wordnet synset that include more than
one word.
Next, we produce vectors for the synsets that share a single word, i.e., words that do
not have any synonyms and have more than one meaning. In this case, for each synset,
we produce the synset vector by combing the word vector with the vector of a word in a
related synset, e.g., a hypernym, a hyponym, or a meronym. For example, let synseti and
synsetk be synsets that both include the same single word w. And let h 1 be a word from the
hypernym of synseti and h 2 be a word from the hypernym of synsetk . We define the vector
of synseti (V~si ) as follows:
58
1
1 ~
1
V~si = · (V~w ·
+ Vh1 ·
).
2
nw
nh 1
Similarly, we define the vector of synsetk (V~sk ) as follows:
1
1 ~
1
V~sk = · (V~w ·
+ Vh2 ·
).
2
nw
nh 2
Figure 6.2 shows an example of creating vectors for the two synsets of the word “abduction”. In Appendix B.2 we list a python implementation of the procedure.
Figure 6.2: An example of creating vectors for wordnet synsets that share a single word.
6.4
Automatically Selecting a Synset Gloss From a Corpus Using Synset2Vec
In this section, we give one example of use of our model. We show how our proposed
model can be used in the automatic selection of glosses for wordnet synsets. The automatic
selection of synset gloss is a word-sense disambiguation problem. A gloss is short sentence
which is, usually, manually attached to a synset to clarify the meaning of the synset. This
short sentence can be a definition or an example sentence of one of the members of the
synset. We test our method using PWN and, then, apply it to automatically add glosses to
wordnets created in (Lam et al., 2014b).
59
In the following steps, we present our method to select a gloss for synseti we defined
in section 6.3.
• Let G = {д1 ,д2 , ...,дy } be set of candidate glosses that include a word belongs to
synseti .
• To select the closest gloss to synseti from G we generate a vector for each gloss дz ∈
G. We list a Python function for this step in Appendix B.3.
• Assume that the gloss дz consists of the words {w 1 ,w 2 , ...,wd },
{m 1 ,m 2 , ...,md } is
the number of synsets for each word in дz , and
~w1 , V~w2 , ..., V~wd } is
{V
the set of corresponding vectors for {w 1 ,w 2 , ...,wd }.
• We compute the vector of gloss дz as follows:
1
1
1 ~
1
).
+ Vw2 ·
+ ... + V~wd ·
V~дz = · (V~w1 ·
d
m1
m2
md
• Then, we compute the cosine similarity between the vector of each gloss дz and V~si .
We present a Python implementation for this step in Appendix B.4.
• Finally, we select the gloss with highest cosine similarity with V~si .
For instance, as shown in Table 6.2, if we consider the word “abduction” which belongs
to two synsets and does not have any synonyms, we notice that our algorithm was able to
distinguish between the two meanings and select the right gloss for both synsets.
6.5
Evaluation
In this section, we introduce two forms of evaluation. First, we apply our method
to select glosses for the PWN synsets. In this case, we directly compare our results to
the manually attached glosses in PWN. Then, we apply our method to attach glosses to
wordnet synsets generated by (Lam et al., 2014b). In this case, we ask human judges to
evaluate the resulting glosses for three languages: Arabic, Assamese and Vietnamese.
60
Synset Key
00333037-n
00775460-n
Gloss
the criminal act of capturing and carrying away
by force a family member
moving of a body part away from the central
axis of the body
the criminal act of capturing and carrying away
by force a family member
moving of a body part away from the central
axis of the body
Cosine
Similarity
0.172
0.214
0.204
0.189
Table 6.2. Cosine similarity between the different synset vectors and glosses of the word
“abduction” in PWN.
6.5.1
Using Synset2vec to Select Glosses for PWN Synsets
In order to evaluate our synset vector representation in the task of selecting glosses for
wordnets, we use it in the process of gloss selection for PWN synsets. We take advantage of
the glosses manually added to the synsets in PWN to automatically measure the precision of
our synsets representation. The following steps describe the evaluation process of selecting
glosses for PWN synsets.
• For each synseti in PWN, we construct a set of candidate glosses. The candidate
glosses are extracted from PWN using the following method. First the gloss attached
to synseti in PWN is added to the candidate set of glosses. Next, to generate negative
glosses for synseti , we extract words which belong to synseti and other synsets, i.e.,
words have the meaning of synseti and one or more other meaning. This allow us to
examine the ability of the algorithm to differentiate between the different meanings
of synsets.
• We randomly select two types of synsets from PWN: synsets that have single words,
i.e. synsets that are represented by only single words, and synsets that include multiple synonym words.
• We generate the synset vectors using the algorithm we described in Section 6.3.
61
• Next, we generate the gloss vectors using the method we described in Section 6.4.
• Then, we compute the cosine similarity between synseti and each gloss in the candidate set.
• Finally, we select the gloss with the highest cosine similarity.
6.5.2
Using Synset2vec to Select Glosses for Arabic, Assamese and Vietnamese Synsets
In this section, we examine the precision of our method by applying it for the pur-
pose of selecting glosses from corpora to attach to the wordnets we create in the previous
chapters. In this experiment, we use the wordnets of the languages: Arabic, Assamese and
Vietnamese. Next, we describe the steps of evaluating glosses selected by our method for
the synsets of the target languages.
• For each synseti in the target wordnet wordnett , we generate a set of candidate
glosses by extracting the set of sentences that include any member of synseti from
the corpora we described in Section 5.7.
• We randomly select two types of synsets from wordnett : synsets that have single
words, i.e., synsets that are represented by only single words, and synsets that include
multiple synonym words.
• We generate the synset vectors using the algorithm we described in Section 6.3.
• Next, we generate vectors for each sentence in the set of candidate glosses using the
method we described in Section 6.4.
• Then, we compute the cosine similarity between synseti and each sentence in the
candidate set.
• Next, the top 3 sentences with the highest cosine similarity with the synseti are selected.
62
Synset Type
Single Member
Multi Member
Number of Synsets
1400
600
Precision
76.5%
79.6%
Table 6.3. The precision of selecting glosses for PWN synsets
• Finally, 3 native speakers of the target language are asked to evaluate the selected
sentences using a 5 point scale.
6.5.3
Results and Discussion
As shown in Table 6.3, we used our algorithm to select glosses for 1400 single-
member synsets from PWN. The algorithm achieved 76.5% precision. In addition, we used
it to select glosses for 600 multi-member synsets from PWN. The precision was 79.6% in
this case. As expected, the precision of selecting glosses is better when it is used with
multi-member synsets since more information about the context of the sense is provided by
the multi-member synsets.
In the second evaluation, we randomly selected 300 synsets from the Arabic, Assamese and Vietnamese wordnets we create (100 synset each). For each synset, we extracted all the sentences that included any member of the synset from the corpora. The
sentences were sorted according to the cosine similarity with the synset vector and the top
3 sentences where selected.
As shown in Table 6.7, the precision of selecting glosses for the Arabic synsets is
81.4% when selecting the sentences with the highest cosine similarity with the synset vector. Furthermore, the precision of the top 2 and top 3 sentences is 70.4% and 65.8% respectively. The overall precision of selecting glosses using our method for the Arabic synsets is
72.6%. Table 6.4 shows some examples of glosses we produce for the Arabic synsets along
with the their cosine similarity values.
The precision of our method for selecting glosses for the Assamese synsets is 85.2%
when selecting the sentences with the highest cosine similarity. Moreover, the top 2 and
63
Table 6.4. Examples of Arabic glosses we produce in our Arabic wordnet.
top 3 selected sentences achieved 83.2% and 84.6% respectively. The overall precision for
Assamese glosses is 84.4%. Table 6.5 shows some examples of glosses we produce for the
Assamese synsets along with the their cosine similarity values.
The top Vietnamese glosses selected by our method has 39.4% precision. The top 2
and top 3 Vietnamese glosses selected by our method has 36.6% and 37% precision. Table
6.6 shows some examples of glosses we produce for the Vietnamese synsets along with the
their cosine similarity values.
In general, the precision of the recently published algorithms for the task of multilingual word-sense disambiguation is arround 68.7% (Apidianaki and Von Neumann, 2013),
meaning that our algorithm is showing reasonably good performance for English, Arabic
and Assamese. However, we notice that our method perform poorly with Vietnamese. The
reason behind the poor results with Vietnamese is that Vietnamese words are not separated
by white spaces (Gordon and Grimes, 2005). This means that the meaning of most the
words can change based on the following words. This makes the process of generating the
vectors for both the synsets and sentences extremely difficult since the word2vec algorithm
assumes that words are separated by white spaces. The same problem appears in the pro-
64
Table 6.5. Examples of Assamese glosses we produce in our Assamese wordnet.
cess of automatically generating bilingual dictionaries for Vietnamese (Lam et al., 2015a).
One possible solution to this problem is replacing the white spaces within the single Vietnamese words with a special non-white character. This requires the existence of a language
dictionary to distinguish the words that include white spaces within them.
6.6
Summary
In this chapter, we presented a new method for selecting synset glosses from a
corpus. Our glosses are example sentences that clarify the meaning of the synset. The
method can be used for low-resource languages to attach glosses to wordnets constructed
automatically. Our method presents vector representation for wordnet synsets in a multidimensional space. We construct a synset vector by grouping the word embedding vector
65
Table 6.6. Examples of Vietnamese glosses we produce in our Vietnamese wordnet.
Wordnet
Arabic
Assamese
Vietnamese
Top 1
81.4%
85.2%
39.4%
Precision
Top 2 Top 3
70.4% 65.8%
83.2% 84.6%
36.6% 37.0%
Overall
72.6%
84.4%
37.6%
Table 6.7. The precision of selecting glosses for Arabic, Assamese and Vietnamese synsets
66
of each synonym in the synset. Our evaluation showed that our method selects glosses with
precision up to 84.4%.
Chapter 7
LEXBANK: A MULTILINGUAL LEXICAL RESOURCE
Figure 7.1: An overview of LexBank system.
7.1
Introduction
In this chapter, we discuss the design and implementation of LexBank: a system that
provides access to the multilingual lexical resources we create in this dissertation. We aim
to give public users the ability to access and use the resources that we have created in our
project. The system provides wordnet search services to several resource-poor languages
68
in addition to bilingual dictionary look-up services. In addition, the system receives evaluation and feedback from users to improve the quality of the resources.
As Figure 7.1 shows, the system is divided into three layers: web interface, application layer and database layer. The web interface allows users to log into the system and
access the search services. The web interface, also, provides a control panel for administrators to allow them to manage the system. The application layer includes all the software
required to securely execute the user’s requests. The database layer has two databases:
lexical resources database and system database. The system database stores user’s information and the system settings. The design of the system allows inclusion of new language
resources and easy modifications.
7.2
Database Design
LexBank uses two databases: one for storing the system settings and one for storing
the lexical resources. We have used Microsoft SQL Server to construct the databases. The
SQL code we use to construct the databases is listed in Appendix C. Next, we describe
each database in details.
7.2.1
The System Settings Database
There are two tables in the setting database: Users_Info and System_log. We de-
scribe both of the tables below.
7.2.1.1
Users_Info
The Users_Info table contains information about registered users. The following are
the fields contained in the Users_Info table.
• U serId: a unique short alias name, which is selected by the user, that is used to
identify users in the system.
69
• U ser N ame: the full name of the user.
• U ser Email: the email address of the user.
• U serPwd: the encrypted password used by the user to access the system.
• U serPriv: a text field that determines the privileges that the user has. There are two
levels of users in the system. The first level is administrator which has the privileges
of managing users and data in the system. The second level is client which has the
privilege of browsing the available resources.
• U serStatus: this field specifies the status of the user. The status can be Active,
Inactive or New.
7.2.1.2
System_log
The System_log table keeps records of all the user activities in the system. This helps
us in maintenance and keeping track of the utilization of the system. The following fields
are contained in the System_log table.
• EventId: a unique key that is used to identify the event.
• EventDesc: a text description of the event.
• EventT ime: the date and time of the event.
• U serId: the identification key of the user who committed the event.
7.2.2
The Lexical Resources Database
The lexical resources database contains the resources we produced in this thesis. For
each language supported by the system, the database maintain tables for storing the core
wordnet, the semantic relations, the wordnet glosses, the evaluation data for the semantic
relations and the evaluation data for the wordnet glosses. Next, we describe each table in
this database.
70
7.2.2.1
CoreWordnet
The CoreW ordnet table stores the wordnet synsets we created in this thesis. The
core wordnet groups the synonym words into sets called synsets. In this table, synsets are
identified using the offset-pos of the corresponding synset in PWN. In PWN, the offset-pos
consists of two parts: byte offset used to locate the synset in the data file and the part-ofspeech of the synset. The following are the fields in the CoreW ordnet table.
• offset-pos: the offset-pos of the wordnet synset which is used as an identifier for the
synset.
• Member : a word that belongs to the synset.
7.2.2.2
Sem_Relations
Whereas the synonymy relation is stored in the CoreWordnet table, other semantic
relations such as hyperonymy and meronymy are stored in the Sem_Relations table. As
we described in Section 4.2, the semantic relations are directed relations. Therefore, we
should maintain the direction by specifying the two sides of each synset in the relation. The
Sem_Relations table contains the following fields.
• Le f t_offset-pos: this field specifies the offset-pos of the synset in the left side of the
relation.
• Relation: a text field that specifies the relation between the left side and the right side
synsets.
• Riдht_offset-pos: the offset-pos of the synset in the right side of the relation.
7.2.2.3
WordnetGlosses
The WordnetGlosses table stores the wordnet glosses we generate in Chpater 6. The
following are the fields of the WordnetGlosses table.
71
• offset-pos: the offset-pos of the wordnet synset.
• Gloss: a text field that contains the gloss of the synset.
7.2.2.4
Sem_Relations_Eval_Data
The Sem_Relations_Eval_Data table contains the semantic relations sample data
used in the evaluation. This table contains the following fields.
• RelationKey: a unique identification number used to identify the semantic relation
being evaluated.
• Le f t_offset-pos: the offset-pos of the synset on the left side of the relation being
evaluated.
• Word1: this field specifies the word on the left side of the relation being evaluated.
• Relation: a text field that specifies the type of relation being evaluated.
• Riдht_offset-pos: the offset-pos of the synset on the right side of the relation being
evaluated.
• Word2: this field specifies the word on the right side of the relation being evaluated.
• COS: the cosine distance, as measured in Section 5.4, between the left word and the
right word in the relation being evaluated.
7.2.2.5
Sem_Relations_Eval_Response
The Sem_Relations_Eval_Response table contains the collected responses of the semantic relations we produce from evaluators. This table consists of the following fields.
• AnswerKey: a unique integer that is generated automatically to identify the response.
• RelationKey: the key of the semantic relation being evaluated.
72
• Score: an integer value from 1 to 5 that represents the score assigned by the evaluator
to the semantic relation.
• UserId: identification key of the evaluator who evaluated the response.
7.2.2.6
WordnetGlosses_Eval_Data
The WordnetGlosses_Eval_Data table holds the wordnet glosses sample being evaluated by the users. The table includes the following fields.
• GlossKey: an automatically generated unique integer used to identify the gloss being
evaluated.
• offset-pos: the offset-pos of the wordnet synset.
• Word: the word which is used in the gloss to represent the wordnet synset.
• Sentence: the sentence selected as gloss for this wordnet synset.
• PWNGloss: the English gloss of the corresponding synset in PWN.
• CosSem: the cosine similarity between the selected sentence and the synset as measured in Section 6.4.
• GlossRank: an integer value that represents the rank of the gloss among the other
candidate glosses. The rank is assigned by the system to the gloss being evaluated
based on the CosSem value. Glosses with the highest CosSem value have a rank value
1.
7.2.2.7
WordnetGlosses_Eval_Response
Responses from the users for evaluating the wordnet glosses we produced in Section
6.4 are stored in the WordnetGlosses_Eval table. This table consists of the following fields:
73
• AnswerKey: a unique integer number that is generated automatically to identify the
response.
• GlossKey: the key of the gloss being evaluated.
• Score: an integer value from 1 to 5 that represents the score assigned by the evaluator
to the gloss.
• UserId: identification key of the evaluator who evaluated the gloss.
7.3
Application Layer
In this section, we describe the main functions provided by LexBank. In order to
maintain simplicity, we implement most of the functions of the system in one utility class
(LexBankUtils.cs) written in Microsoft C#. The utility class, which is listed in Appendix
D, consists of the following methods.
• IsUserIdAvailable(): takes a userId and returns true if this has never been used by
another user before.
• EncryptPassword(): takes a plain text password and returns an encrypted password.
• DecryptPassword(): takes an encrypted password and returns a decrypted password.
• CreateNewUser(): takes the details of a new user and creates an account for him by
string the data in the Users_Info table.
• IsAuthenticated(): takes the user identification and password and returns true if it
matches the user information in the users table.
• FindSynSet(): takes a lexeme and returns a list of synsets that include this lexeme.
• FindSynSetLexemes(): takes an OffsetPos of a synset and returns the list of lexemes
of this synset.
74
• IsSynSetAvailable(): takes an OffsetPos of a synset in a specific wordnet, and returns
true if the synset is available in the spcified wordnet.
• FindSynSetRelations(): takes an OffsetPos of a synset and returns all the semantically
related lexemes.
• FindGloss(): takes an OffsetPos of a synset and returns the gloss of the synset.
• ReadRelation(): takes a RelationKey and returns the details of the relation.
• ReadSynsetGloss(): takes a GlossKey and returns the details of the gloss.
• EvaluateRelation(): takes RelationKey, Score and UserId and stores them in the evaluation table of the semantic Relations.
• EvaluateGloss(): takes GlossKey, Score and UserId and stores them in the evaluation
table of the wordnet glosses.
• LogEvent(): takes event description and stores it in the System_log table.
• ChangeUserStatus(): takes UserId of a user and changes his status to a specific new
status.
• RetrieveUsers(): a method that returns a list of all the users in the system and their
information.
7.4
Web Interface Design and Implementation
In this section, we describe the design of the web interface of LexBank. The web
interface is implemented in ASP.NET using Microsoft Visual Studio 2012. Figure 7.2
shows the site map of the web interface. The interface is accessed by the login web page
(frmLogin.aspx). New users need to register to gain access to the system. Registration can
be done by filling the web registration form (frmRegister.aspx). Once a user logs into the
75
system, the main menu web page (frmMainMenu.aspx) is shown. The main menu includes
links to access the services available in the system. In the following sections, we describe
each web page in the system.
Figure 7.2: LexBank web site map
7.4.1
Registration Form
New users needs to register in the system using the registration form (frmRegis-
ter.aspx). As shown in Figure 7.3, a new user needs to provide the full name, email, email
confirmation, user identification, password and password confirmation, and then press the
Register button.
The registration process starts when a new user submits his information through the
registration web form. Once the registration form receives the information, it checks if all
the fields meet the requirements of the system. The requirements include a valid format
for the email address and the password. The requirements, also, include that the user
identification was never used before by an existing user. If the information sent by the
user passes the validation process, the registration form calls the CreateNewUser() method
from the utility class. The CreateNewUser() method uses the EncryptPassword() method to
76
Figure 7.3: The registration web form
encrypt the password, and then it writes the data into the Users_info table. The registration
process is summarized in the sequence diagram shown in Figure 7.4.
7.4.2
Log-in Form
Registered users can login to the system using the login web page (frmLogin.aspx)
which is shown in Figure 7.5. A user with an active account needs to provide his user
identification and password to start the login process.
As shown in Figure 7.6, when the login web form (frmLogin.aspx) receives the userid
and the passowrd, it calls the IsAuthenticated() method from the utility class. Then, the
password is encrypted using the EncryptPassword() and compared with the encrypted password stored in the users table. If the userid and the password provided by the user match
77
Figure 7.4: Sequence diagram of the registration process
Figure 7.5: The log-in web form
78
Figure 7.6: Sequence diagram of the login process
the userid and the password stored in the users table, the main menu of the web interface
is shown to the users; otherwise, an error message is shown to the user. The main menu is
shown in Figure 7.7.
7.4.3
The Main Menu
The main menu includes links to access the services available in the system. The
services presented by the web interface are given below.
• Searching wordnet using lexeme, provided by the web page (frmWordnetSearch.aspx).
• Searching wordnet using OffsetPos, provided by the web page (frmSynsetDetails.aspx).
• Evaluating semantic relations between synsets, provided by the web page (frmEvalRelations.aspx).
• Evaluating wordnet glosses, provided by the web page (frmEvalGloss.aspx).
• Searching a bilingual dictionary, provided by the web page (frmDictionarySearch.aspx).
79
Figure 7.7: The main menu
• User management, provided by the web page (frmManageUsers.aspx).
7.4.4
Searching Wordnet By Lexeme Web Form
The web form (frmWordnetSearch.aspx) allows users to search for the synsets of a
lexeme in a specific langauge. As shown in Figure 7.8, this web form consists of the
following components.
• A text box used to allow the user to enter a lexeme.
• A drop-down menu to allow the user to select the language.
• A list box for showing the synsets list of the entered lexeme.
• A list box for showing the synonyms of the entered lexeme.
• A list box for showing the related lexemes.
• A button to start the searching process.
The searching process, as shown in Figure 7.9, starts when the user submits a lexeme
and a language to the frmWordnetSearch.aspx web form. Then, the method FindSynset()
80
Figure 7.8: The web form for searching wordnet by lexeme. The form shows the result of
searching the Arabic lexeme (‫)ﻣﺼﺮ‬, which means Egypt.
from the utility class is called to retrieve the synsets that include the entered lexeme and
show the result in the synsets list. Next, when the user selects a synset from the synsets
list, the frmWordnetSearch.aspx web form calls the FindSynsetLexemes() method from the
utility class to show the synonyms of the lexeme in the synonym list. It, also, calls the
FindSynsetRelations() method to obtain the related lexemes and show them to the user in
the related lexemes list. The user also can extend the details of the synset shown in the
synset list and the related lemexes list by double-clicking on the synset OffsetPos. This
shows the frmSynsetDetails.aspx web form which we describe next.
7.4.5
Searching Wordnet by OffsetPos Web Form
Wordnet search using OffsetPos is provided by the frmSynsetDetails.aspx. An exam-
ple of searching for a synset with the OffsetPos (08897065-n) in our Arabic, Vietnamese
81
Figure 7.9: Sequence diagram of the process of searching wordnet using lexeme
and Assamese wordnets using the frmSynsetDetails.aspx web form is shown in Figures
7.10, 7.11 and 7.12 respectively. This web form consists of the following components.
• A text box for entering the OffsetPos of the synset.
• A drop-down menu to allow the user to select the language.
• A text box for showing the gloss of the synset.
• A text box for showing the English gloss of the synset.
• A list box to show the synonym list of the synset.
• A list box to show the related synsets and lexemes of the entered synset.
• A button to start the search process.
82
Figure 7.10: The web form for searching wordnet by OffsetPos. The form shows the result
of searching the Arabic synset (08897065-n).
In this form, the user starts the process of searching wordnet by submitting the OffsetPos of the synset and the target language to the frmSynsetDetails.aspx web form. The
web form calls the FindGloss() method from the utility class to retrieve the gloss of the
synset. It, also, calls the FindSynSetLexemes() and the FindSynSetRelations() methods to
obtain the synonym list and releated synsets of the input synset to show them in the form.
7.4.6
Evaluating Semantic Relations Between Synsets Web Form
The web form frmEvalRealtions.aspx allows users to evaluate semantic relations be-
tween lexemes and synsets in the system. The form shows the relation as a sentence and
asks the user to rate the correctness of the sentence using a Likert-type scale. The form
consists of the following components.
83
Figure 7.11: The web form for searching wordnet by OffsetPos. The form shows the result
of searching the Vietnamese synset (08897065-n).
Figure 7.12: The web form for searching wordnet by OffsetPos. The form shows the result
of searching the Assamese synset (08897065-n). The third part meronym in Assamese is
wrong. It comes from the verb meaning of “desert” which means to leave without intending
to return.
84
Figure 7.13: Sequence diagram of the process of searching wordnet using OffsetPos.
Figure 7.14: The web form for evaluating semantic relations between synsets in a wordnet.
The form shows an example of evaluating a hyponymy relation between the two Assamese
lexemes, one for radio telegraph and the other for radio.
85
• A text box showing the relation key.
• A text box showing the relation in the form of a sentence.
• A text box showing the UserId of the evaluator.
• An option box that allows the user to rate the relation.
• A button to submit the score.
• A button to end the evaluation session.
Figure 7.15: Sequence diagram of the process of evaluating the relation between two lexemes.
The evaluation form frmEvalRealtions.aspx starts the evaluation process by calling
the ReadRelation() method from the utility class to show the relation details to the user.
When the user submits the score he assigns to a relation, the evaluation form frmEvalRealtions.aspx stores the score by calling the EvaluateRelation() method from the utility class.
Then, the evaluation form reads the next relation and shows it to the user. The user can
stop the evaluation process by clicking the End Session button. The user has the option
86
to resume the evaluation process if he stops any time he wishes without re-evaluating the
relations he has already evaluated.
7.4.7
Evaluating Wordnet Synsets Glosses Web Form
Figure 7.16: The web form for evaluating wordnet synsets glosses. The form shows an
example of evaluating Arabic synset (13108841-n).
The glosses of the wordnets can be evaluated using the frmEvalGloss.aspx web form.
To evaluate a synset gloss, the form attaches the English gloss of the synset obtained from
the PWN to the selected gloss in the target language. Then, the user is asked if the lexeme
in the selected gloss has the same meaning as in the PWN gloss. This evaluation form is
composed of the following components.
• A text box showing the gloss key.
• A text box showing a lexeme from a synset, a candidate gloss written in the target
language, and the English gloss of the synset.
87
• A text box showing the UserId of the evaluator.
• An option box that allows the user to rate the candidate gloss.
• A button to submit the score.
• A button to end the evaluation session.
Figure 7.17: Sequence diagram of the process of evaluating the relation between two lexemes.
The web form frmEvalGloss.aspx starts the evaluation process for glosses by calling
the ReadSynsetGloss() method from the utility class to obtain the lexeme, the candidate
gloss and the English gloss of the synset being evaluated. Then, the web form uses the previous data to construct a question for the user. When the user submits the score he assigns
to the candidate gloss, the evaluation form stores the score by calling the EvaluateGloss()
method from the utility class. Then, the evaluation form reads the next gloss and shows it to
the user. The user can stop the gloss evaluation process by clicking the End Session button.
The user can resume the gloss evaluation process any time he wishes without re-evaluating
the glosses he has already evaluated.
88
7.4.8
Searching Bilingual Dictionary Web Form
Figure 7.18: The web form for searching a bilingual dictionary. The form shows the result
of translating the Arabic word (‫)ﻣﺼﺮ‬, which means Egypt, to Assamese.
The web form (frmDictionarySearch.aspx) allows users to use the bilingual dictionaries we create in (Lam et al., 2015b) to translate words between languages. As shown in
Figure 7.18, this form consists of the following components.
• A text box used to allow the user to enter a word.
• A drop-down menu to allow the user to select the source language.
• A drop-down menu to allow the user to select the target language.
• A list box for showing the translations list of the entered word.
• A button to start the searching process.
The translation process, as shown in Figure 7.19, starts when the user submits a word,
source language, and target language to the frmDictionarySearch.aspx web form. Then, the
method Translate() from the utility class is called to retrieve the translation list from the
bilingual dictionary and show them to the user.
89
Figure 7.19: Sequence diagram of the process of searching a bilingual dictionary.
Figure 7.20: The web form for managing users in LexBank.
7.4.9
Users Management Web Form
The web form frmManageUsers.aspx allows the administrators of LexBank to man-
age users. Access to this form is restricted to administrators. The form lists all registered
users with related information. An administrator can activate the accounts of new users
using this form. He can also deactivate any user from the list. This form can be extended
in the future by adding more functionality. As shown in Figure 7.20, this form consists of
the following components.
• ID: the UserId of the user.
• Name: the full name of the user.
90
• Email: the email address of the user.
• Privilege: the privilege assigned to the user. This can be administrator or client.
• Status: the current status of the user.
• Change Status: a command link to changed the current status of the user. The status
of the user can be change to be Inactive or Active.
Figure 7.21: Sequence diagram of the process of managing users in LexBank.
As summarized in the sequence diagram shown in Figure 7.21, an administrator user
starts the process of user management by trying to access the frmManageUsers.aspx web
form. The web form calls the method IsAdmin() from the utility class to verify if the user
is authorized to access the form or not. If the user is not authorized, an error message
is sent to the user. Otherwise, if the user is authorized, the web form calls the method
91
RetrieveUsers() to obtain the list of registered users in the system. Then, the administrator
can select a user from the list and click the change status link to change the current status
of the user. The web form calls the ChangeUserStatus() method from the utility class to
store the new status and reload the updated users list in the screen.
7.5
Summary
In this chapter, we have described the design and implementation of LexBank, the
multilingual lexical resource we produce in this thesis. The architecture of LexBank consists of three layers: the database layer, the application layer and the web interface layer.
The database layer consists of two databases: system settings database and resource database.
The application layer of the system is implemented using Microsoft C#. It provides administrative and resource access services to the web interface. The web interface is designed
and implemented using Microsoft Visual Studio 2012. The interface includes web forms
for managing users and provides different wordnet search services in several languages.
The system can easily updated to accommodate other lingual services and languages.
Chapter 8
CONCLUSIONS
In this chapter, we summarize the main contributions of this dissertation. This dissertation is motivated by the fact that many languages around the word lack the computational
lexical resources that are essential in natural language processing. Our first goal in this dissertation was to develop automatic techniques that rely on few available public resources,
for constructing wordnets for low-resource languages. A wordnet is a structured lexical ontology of words that groups words based on their meaning using sets that are called synsets.
A wordnet is a very important lexical resource that is used in many applications, such as
translation, word-sense disambiguation, information retrieval and document classification.
The second goal of this dissertation is to design and implement a system that makes the
lexical resources we produce available to the public. Below, we list the main contributions
of this dissertation.
• We have developed an approach for constructing structured wordnets. This approach was developed by extending the approach for constructing the core wordnets presented by (Lam et al., 2014b). A core wordnet consists of only synsets that
group synonym words in sets with unique ID. In a more comprehensive wordnet,
these synsets are semantically connected to represent the relation among the meanings of the synsets. Our approach produces synsets that are semantically connected
by semantic relations. Examples of the semantic relations we produced are: synonyms, hypernyms, topic-domain relation, part-holonyms and instance-hypernyms
and member-meronyms.
93
• We presented an approach for enhancing the quality of automatically constructed
wordnets. The approach is based on the vector representation of words (word embeddings). Word embeddings are produced by a machine learning technique that maps
words to real numbers vectors in a multi-dimensional space. Our approach uses the
word2vec algorithm (Mikolov et al., 2013) to generate word representations from an
existing corpus. The word2vec algorithm is a feedforward neural network that predicts the vector representation of words within a multi-dimensional language model.
Our approach computes the cosine similarity, using word2vec, between semantically
related words in our constructed wordnets and filters any entries which do not satisfy
a pre-selected threshold value.
• We introduced synset2vec, which is an algorithm for representing wordnet synsets in
a multi-dimensional space. Word embeddings provide an excellent vector representation of words. However, the words representation is affected by the fact that many
words have multiple meanings. In order to represent meanings rather than words,
we combine the vectors of synset lexemes into a single vector that represents the
meaning. We believe that this vector representation can be used in many important
applications. For example, it can be used in word-sense disambiguation, machine
translation and gloss selection for wordnet synsets.
• We used our algorithm synset2vec to add glosses to our automatically constructed
synsets. Glosses are a very important part of wordnets. A gloss is used to declare or
clarify the meaning of a synset in a wordnet. A gloss can be a definition statement or
an example sentence that shows the usage of the synonyms of the synset. To select
a gloss (an example sentence) from a corpus for a synset, we used synset2vec to
generate vector representations of candidate glosses and the synset. We compute the
cosine similarity between each candidate gloss and the synsets. Finally, we select the
gloss with highest cosine similarity with synset and attach it to the synset.
94
• We have developed LexBank which is a web application that gives access to public
users to our created resources. LexBank provides useful services for users that seek
linguistic assistance in a friendly manner. It also includes evaluation web forms
that are used to gather feedback from human judges. The design of LexBank is
flexible and it can be easily expanded to accommodate additional new languages
and resources.
Chapter 9
FUTURE WORK
In this chapter, we propose some potential future work that can be performed as
extension to the work presented in this dissertation. The general goal of the proposed
future work is to enhance the quality and extend the coverage of the lexical resources we
have created. For example, we produced our core wordnets using machine translation and
small dictionaries. The quality of these wordnets is limited by the resources we used to
create them. It is well-known that these resources do not guarantee high coverage and
accuracy for all of the target languages. Below, we list some of the potential future work.
9.1
Extending Bilingual Dictionaries
In this section, we provide a task that can be undertaken in future work. We propose a
new method to extend the bilingual dictionaries created in (Lam et al., 2015b). To increase
the coverage of the bilingual dictionaries, we take advantage of the wordnets we have
created in this dissertation. This section is divided into two parts. In the first part, we
describe the approach we used in (Lam et al., 2015b) to create the bilingual dictionaries.
In the second part, we describe the proposed method to extend these bilingual dictionaries.
9.1.1
Related Work
In (Lam et al., 2015b) we have created a large number of new bilingual dictionaries
using intermediate core wordnets and a machine translator. A dictionary or a lexicon, as
defined by (Landau, 1984), consists of sorted 2-tuple <LexicalUnit, Definition> entries.
Each entry is called LexicalEntry. The first part of a LexicalEntry is the phrase being defined, while the second part is the definition of the phrase. The definition includes the
96
meaning of the LexicalUnit and usually has several Senses, each of which is a separate representation of a single aspect of the meaning of a phrase. In (Lam et al., 2015b), the entries
in the dictionaries are of the form < LexicalU nit,Sense 1 >, < LexicalU nit,Sense 2 >,....
The approach for creating dictionaries using intermediate wordnets and a machine
translator (IW) is described as in Figure 9.1 and Algorithm 2.
Figure 9.1: The IW approach for creating a new bilingual dictionary
Suppose that we would like to construct a bilingual dictionary Dict(S,D), where S
is a source language and D is a target language, given the dictionary Dict(S,R) where R
97
is a resource-rich intermediate language. The IW algorithm reads each LexicalEntry from
Dict(S,R) and extracts SenseR from it. Then, it retrieves all Offset-POSs of SenseR from
the wordnet of language R (Algorithm 2, lines 2-5). All the synonyms of the extracted
Offset-POSs are extracted from all the available intermediate wordnets. Then, the algorithm constructs a candidate set candidateSet for the final translations in language D by
translating all the extracted synonyms to language D using machine translation (Algorithm
3). There are 2 attributes in each candidate in candidateSet: word which represents a
translation in language D, and rank which counts the occurrence of this translation. The
rank attribute is used to order the candidates in descending order where the top candidate
is the best translation. Finally, the sorted candidates are inserted into the new dictionary
Dict(S,D) (Algorithm 2, lines 8-10).
Algorithm 2: IW algorithm (taken from (Lam et al., 2015b))
Input: Dict(S,R)
Output: Dict(S, D)
1: Dict(S, D) := ϕ
2: for all LexicalEntry ∈ Dict(S,R) do
3:
for all SenseR ∈ LexicalEntry do
4:
candidateSet := ϕ
5:
Find all Offset-POSs of synsets containing SenseR from the R Wordnet
6:
candidatSet = FindCandidateSet (Offset-POSs, D)
7:
sort all candidate in descending order based on their rank values
8:
for all candidate ∈ candidateSet do
9:
SenseD =candidate.word
10:
add tuple <LexicalUnit,SenseD > to Dict(S,D)
11:
end for
12:
end for
13: end for
9.1.2
Extending Bilingual Dictionaries Using Structured Wordnets
In this section, we propose a new method to extend dictionaries we created in (Lam
et al., 2015b) using the structured wordnets that we have created in this dissertation. The
98
Algorithm 3: FindCandidateSet (Offset-POSs,D) (taken from (Lam et al., 2015b))
Input: Offset-POSs, D
Output: candidateSet
1: candidateSet := ϕ
2: for all Offset-POS ∈ Offset-POSs do
3:
for all word in the Offset-POS extracted from the intermediate wordnets do
4:
candidate.word= translate (word, D)
5:
candidate.rank++
6:
candidateSet += candidate
7:
end for
8: end for
9: return candidateSet
following steps, which are summarized in Figure 9.2, describe the proposed method to
extend the dictionaries.
Figure 9.2: Extending bilingual dictionaries using structured wordnets
• We start by extracting each input enrty Si from the source language S in the bilingual
dictionary from S to D.
99
• Then, we retrieve the synsets list of Si from the wordnet of S.
• Next, we extract the corresponding synsets from the wordnet of D.
• For each synset member Dk we extracted from wordnet of D, we create a lexical
entry (Si ,Dk ).
• Besides that, for each synset we extracted from wordnet of D, we extract the direct
hypernyms and we also create a lexical entry (Si ,Hl ).
• Finally, we add any lexical entry we have created in the previous steps to the bilingual
dictionary from S to D if it does not already exists in the dictionary.
9.2
Integrating Part-of-speech Tagging into Wordnet Construction
Since our approach for automatic wordnet construction is based on translation, some
of the generated synsets include words that have the wrong part-of-speech. One solution
is to use a Part-Of-Speech Tagger (POS Tagger) to correct the wrong form of the words in
the synset.
A POS Tagger is a computer program which is used to specify the part-of-speech
of words in a text written in some language. For example, the Stanford Part-Of-Speech
Tagger (Toutanova et al., 2003), which is freely available, provides part-of-speech tagging
for Arabic, Chinese, French, Spanish and German. POS Taggers are available for Assamese
(Saharia et al., 2009) and Vietnamese (Le-Hong et al., 2010) as well. Since we are dealing
with low-resource languages, many languages do not have any POS Taggers, and therefore,
this approach is not applicable to them.
To correct the part-of-speech in the words within a synset, we propose the following
steps.
• For each synset synseti in a wordnet wordnetT , we extract the part-of-speech of the
synset from Offset-POS of synseti .
100
• For each word word j in synseti , we find the part-of-speech of word j and compare it
with the part-of-speech of synseti . If the parts-of-speech of word j and synseti do not
match, we convert the form of word j to the correct part-of-speech form and update
synseti .
9.3
Wordnet Expansion Using Word Embeddings
One possible way to automatically improve the coverage of a wordnet is by looking
for additional related words in a corpus using word embeddings. In Chapter 6, we introduced synset2vec, which produces vector representations of synsets in a multi-dimensional
space. Taking advantage of synset2vec, we believe it is possible to look for previously unknown words that are semantically related to a synset and include them in the wordnet.
Below, we present a brief description of our idea.
• Assume that we would like to expand a wordnet wordnetT of language T . First, word
embeddings for T are generated.
• Next, for each synset synseti in wordnetT , the vector for synseti V~i is generated using
synset2vec.
• Then, all the words that have cosine similarity value of a preselected threshold α or
less are extracted. From these words, only the words that do not have any semantic
relation with synseti are inserted into a candidate set Ci .
• Next, for each word word j in Ci , a semantic relation r j is selected based on a classification approach.
• Finally, word j is inserted into wordnetT and connected to synseti using semantic
relation r j .
101
9.4
Producing Vector Representation for Multi-word Lexemes
One issue that appears when producing vector representations is that wordnet lex-
emes can be multi-word phrases. Most of the existing tools for producing word embeddings consider single words only. This means that they produce vectors for lexical units
that are surrounded by spaces. Therefore, when we try to generate a vector for a wordnet
synset, we avoid multi-word lexemes. An enhanced version of our approach for generating vectors for wordnet synsets can be achieved by including a vector representation for
multi-word lexemes. The vectors of the single words within a multi-word lexeme should
be aggregated such that it becomes one vector within the synset. However, one issue that
arises is that each single word within the multi-word lexeme may have several meanings
when they appear individually. Therefore, careful research is needed to determine a good
solution for this problem.
9.5
Vector Representation for Mulit-lingual Wordnets
In this dissertation, we produced vector representation for individual wordnets. One
work that might help in problems such as wordnet expansion and machine translation, is
obtaining the vector representation of aggregated wordnets of several languages. Since all
of the wordnets we create in this dissertation are aligned with PWN, synsets having the
same Offset-Pos in different wordnets actually represent the same meaning. Therefore, we
believe that combining the vectors of aligned synsets from different languages will produce
representation for the meaning within several language. One can use this representation to
discover the closest meaning of new words that are not included within the wordnets. This
could also be used in discovering a rough translation for words that are not included in a
dictionary.
102
BIBLIOGRAPHY
M. Abbas and K. Smaili. Comparison of topic identification methods for arabic language.
In International Conference on Recent Advances in Natural Language ProcessingRANLP 2005, volume 14, 2005.
M. Abbas, K. Smaïli, and D. Berkani. Evaluation of topic identification methods on arabic
corpora. JDIM, 9(5):185–192, 2011.
K. Ahn and M. Frampton. Automatic generation of translation dictionaries using intermediary languages. In Proceedings of the International Workshop on Cross-Language
Knowledge Induction, pages 41–44. Association for Computational Linguistics, 2006.
P. Akaraputthiporn, K. Kosawat, and W. Aroonmanakun. A Bi-directional Translation
Approach for Building Thai Wordnet. In Asian Language Processing, 2009. IALP’09.
International Conference on, pages 97–101. IEEE, 2009.
F. Al tarouti and J. Kalita. Enhancing automatic wordnet construction using word embeddings. In Proceedings of the Workshop on Multilingual and Cross-lingual Methods
in NLP, pages 30–34, San Diego, California, June 2016. Association for Computational
Linguistics.
M. Apidianaki and R. J. Von Neumann. Limsi: Cross-lingual word sense disambiguation
using translation sense clustering. In Second Joint Conference on Lexical and Computational Semantics (* SEM), volume 2, pages 178–182, 2013.
103
M. A. Attia. Handling Arabic morphological and syntactic ambiguity within the LFG
framework with a view to machine translation. PhD thesis, University of Manchester,
2008.
E. Barbu. Automatic Building of Wordnets EdUArd BarbU* &: Verginica BarbU MiTiTElU*** Graphitech Italy" Romanian Academy, Research Institute for Artificial Intelligence. Recent Advances in Natural Language Processing IV: Selected Papers from
RANLP 2005, 292:217, 2007.
K. R. Beesley. Arabic finite-state morphological analysis and generation. In Proceedings
of the 16th conference on Computational linguistics-Volume 1, pages 89–94. Association
for Computational Linguistics, 1996.
S. Bhattacharya, M. Choudhury, S. Sarkar, and A. Basu. Inflectional morphology synthesis
for bengali noun, pronoun and verb systems. Proc. of NCCPB, 8, 2005.
P. Bhattacharyya. Indowordnet. In In Proc. of LREC-10, 2010.
O. Bilgin, z. Çetinoğlu, and K. Oflazer. Building a wordnet for Turkish. Romanian Journal
of Information Science and Technology, 7(1-2):163–172, 2004.
L. Bloomfield. Language. new york: Holt, rinehart and winston. A classic in linguistic
studies and the first serious attempt in the development of morphology. Pre-and postgenerative morphology conceptually were nurtured from the remarkable insights given
in this linguistic masterpiece, 1933.
F. Bond and K. Ogura. Combining linguistic resources to create a machine-tractable
Japanese-Malay dictionary. Language Resources and Evaluation, 42(2):127–136, 2008.
L. Borin and M. Forsberg. Swesaurus; or, the frankenstein approach to wordnet construction. In Proceedings of the Seventh Global WordNet Conference (GWC 2014), 2014.
104
D. Bouamor, N. Semmar, C. France, and P. Zweigenbaum. Using Wordnet and semantic
similarity for bilingual terminology mining from comparable corpora. In Proceedings of
the 6th Workshop on Building and Using Comparable Corpora, pages 16–23. Citeseer,
2013.
R. D. Brown. Automated dictionary extraction for “knowledge-free” example-based translation. In Proceedings of the Seventh International Conference on Theoretical and
Methodological Issues in Machine Translation, pages 111–118, 1997.
T. Buckwalter. Issues in arabic orthography and morphology analysis. In Proceedings of
the Workshop on Computational Approaches to Arabic Script-based Languages, pages
31–34. Association for Computational Linguistics, 2004.
T. Charoenporn, V. Sornlertlamvanich, C. Mokarat, and H. Isahara. Semi-automatic compilation of Asian WordNet. In 14th Annual Meeting of the Association for Natural Language Processing, pages 1041–1044, 2008.
X. Chen, Z. Liu, and M. Sun. A unified model for word sense representation and disambiguation. In EMNLP, pages 1025–1035. Citeseer, 2014.
D. Christodoulakis, K. Oflazer, D. Dutoit, S. Koeva, G. Totkov, K. Pala, D. Cristea, D. Tufiş,
M. Grigoriadou, I. Tsakou, and others. BalkaNet: A Multilingual Semantic Network for
Balkan Languages. In Proceedings of the 1st International Wordnet Conference, Mysore,
India, 2002.
C. J. Crouch. An approach to the automatic construction of global thesauri. Information
Processing & Management, 26(5):629–640, 1990.
A. Cucchiarelli, R. Navigli, F. Neri, and P. Velardi. Automatic Generation of Glosses in the
OntoLearn System. In LREC. Citeseer, 2004.
J. R. Curran. From distributional to semantic similarity. 2004.
105
J. R. Curran and M. Moens. Improvements in automatic thesaurus extraction. In Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition-Volume 9, pages
59–66. Association for Computational Linguistics, 2002a.
J. R. Curran and M. Moens. Scaling context space. In Proceedings of the 40th Annual
Meeting on Association for Computational Linguistics, pages 231–238. Association for
Computational Linguistics, 2002b.
K. Darwish. Named entity recognition using cross-lingual resources: Arabic as an example.
In ACL (1), pages 1558–1567, 2013.
M. Diab and N. Habash. Arabic dialect processing tutorial. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Tutorial Abstracts, pages 5–6. Association for Computational Linguistics, 2007.
R. M. Fano and D. Hawkins. Transmission of information: A statistical theory of communications. American Journal of Physics, 29(11):793–794, 1961.
A. Farghaly and K. Shaalan. Arabic natural language processing: Challenges and solutions.
ACM Transactions on Asian Language Information Processing (TALIP), 8(4):14, 2009.
C. Fellbaum. A semantic network of English verbs. WordNet: An electronic lexical
database, 3:153–178, 1998.
C. Fellbaum. WordNet and Wordnets. In A. Barber, editor, Encyclopedia of Language and
Linguistics, pages 2–665. Elsevier, 2005.
M. A. Finlayson. Java libraries for accessing the Princeton WordNet: Comparison and
evaluation. In Proceedings of the 7th Global Wordnet Conference, pages 78–85, 2014.
J. R. Firth. {A synopsis of linguistic theory, 1930-1955}. 1957.
106
D. Foley and J. Kalita. Integrating wordnet for multiple sense embeddings in vector semantics. In REU on Machine Learning and Applications. University of Colorado, Colorado
Springs, 2016.
T. Gollins and M. Sanderson. Improving cross language retrieval with triangulated translation. In Proceedings of the 24th annual international ACM SIGIR conference on Research
and development in information retrieval, pages 90–95. ACM, 2001.
R. G. Gordon and B. F. Grimes. Ethnologue: Languages of the world, volume 15. SIL
international Dallas, TX, 2005.
S. Green and C. D. Manning. Better arabic parsing: Baselines, evaluations, and analysis. In
Proceedings of the 23rd International Conference on Computational Linguistics, pages
394–402. Association for Computational Linguistics, 2010.
G. Grefenstette. Explorations in automatic thesaurus discovery, volume 278. Springer
Science & Business Media, 2012.
G. Gunawan and A. Saputra. Building synsets for Indonesian Wordnet with monolingual
lexical resources. In Asian Language Processing (IALP), 2010 International Conference
on, pages 297–300. IEEE, 2010.
N. Habash and O. Rambow. Arabic tokenization, part-of-speech tagging and morphological
disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 573–580. Association for Computational
Linguistics, 2005.
N. Habash, R. Roth, O. Rambow, R. Eskander, and N. Tomeh. Morphological analysis and
disambiguation for dialectal arabic. In HLT-NAACL, pages 426–432, 2013.
N. Y. Habash. Introduction to arabic natural language processing. Synthesis Lectures on
Human Language Technologies, 3(1):1–187, 2010.
107
A. Haghighi, P. Liang, T. Berg-Kirkpatrick, and D. Klein. Learning Bilingual Lexicons
from Monolingual Corpora. In ACL, volume 2008, pages 771–779, 2008.
Z. S. Harris. Distributional structure. Word, 10(2-3):146–162, 1954.
E. Haugen. Dialect, language, nation. American anthropologist, 68(4):922–935, 1966.
L. Hinkle, A. Brouillette, S. Jayakar, L. Gathings, M. Lezcano, and J. Kalita. Design and
evaluation of soft keyboards for brahmic scripts. ACM Transactions on Asian Language
Information Processing (TALIP), 12(2):6, 2013.
G. Hirst and D. St-Onge. Lexical chains as representations of context for the detection
and correction of malapropisms. WordNet: An electronic lexical database, 305:305–332,
1998.
E. Héja. Dictionary Building based on Parallel Corpora and Word Alignment. In Proceedings of the XIV Euralex International Congress, Leeuwarden, pages 6–10, 2010.
Y. Hlal. Morphological analysis of arabic speech. In Workshop Papers Kuwait/Proceedings
of Kuwait Conference on Computer Processing of the Arabic Language, pages 273–294,
1985.
E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng. Improving word representations
via global context and multiple word prototypes. In Proceedings of the 50th Annual
Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages
873–882. Association for Computational Linguistics, 2012.
V. István and Y. Shoichi. Bilingual dictionary generation for low-resourced language pairs.
In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2, pages 862–870. Association for Computational Linguistics,
2009.
108
P. Jaccard. The distribution of the flora in the alpine zone. New phytologist, 11(2):37–50,
1912.
D. Jurafsky and J. H. Martin. Speech and Language Processing (3rd Edition Draft).
Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 20016.
H. Kaji and M. Watanabe. Automatic Construction of Japanese WordNet. Proceedings of
LREC2006, Italy, 2006.
H. Kozima and T. Furugori. Similarity between words computed by spreading activation
on an English dictionary. In Proceedings of the sixth conference on European chapter of
the Association for Computational Linguistics, pages 232–239. Association for Computational Linguistics, 1993.
K. N. Lam. Automatically Creating MultiLingual Resources. PhD thesis, University of
Colorado, Colorado Springs, Apr. 2015.
K. N. Lam and J. Kalita. Creating Reverse Bilingual Dictionaries. In HLT-NAACL, pages
524–528. Citeseer, 2013.
K. N. Lam, F. Al Tarouti, and J. Kalita. Creating Lexical Resources for Endangered Languages. In Proceedings of the 2014 Workshop on the Use of Computational Methods
in the Study of Endangered Languages, pages 54–62, Baltimore, Maryland, USA, June
2014a. Association for Computational Linguistics.
K. N. Lam, F. A. Tarouti, and J. Kalita. Automatically constructing Wordnet synsets.
In 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014),
Baltimore, USA, June, 2014b.
K. N. Lam, F. Al Tarouti, and J. Kalita. Phrase translation using a bilingual dictionary and
n-gram data: A case study from vietnamese to english. In Proceedings of NAACL-HLT,
pages 65–69, 2015a.
109
K. N. Lam, F. Al Tarouti, and J. Kalita. Automatically Creating a Large Number of New
Bilingual Dictionaries. In Twenty-Ninth AAAI Conference on Artificial Intelligence, Feb.
2015b.
S. I. Landau. Dictionaries. NY: Scribners, 1984.
L. S. Larkey, L. Ballesteros, and M. E. Connell. Improving stemming for arabic information retrieval: light stemming and co-occurrence analysis. In Proceedings of the 25th
annual international ACM SIGIR conference on Research and development in information retrieval, pages 275–282. ACM, 2002.
P. Le-Hong, A. Roussanaly, T. M. H. Nguyen, and M. Rossignol. An empirical study of
maximum entropy approach for part-of-speech tagging of vietnamese texts. In Traitement
Automatique des Langues Naturelles-TALN 2010, page 12, 2010.
D. Leenoi, T. Supnithi, and W. Aroonmanakun. Building a Gold Standard for Thai WordNet. In Proceeding of The International Conference on Asian Language Processing 2008
(IALP2008), pages 78–82, 2008.
D. Lin. Automatic retrieval and clustering of similar words. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International
Conference on Computational Linguistics-Volume 2, pages 768–774. Association for
Computational Linguistics, 1998.
K. Lindén and J. Niemi. Is it possible to create a very large wordnet in 100 days? an
evaluation. Language resources and evaluation, 48(2):191–201, 2014.
K. Lindén and L. Carlson. Finn WordNet-WordNet p\a a finska via översättning. LexicoNordica, 17(17), 2010.
N. Ljubešić and D. Fišer. Bootstrapping bilingual lexicons from comparable corpora for
closely related languages. In Text, Speech and Dialogue, pages 91–98. Springer, 2011.
110
M. Maziarz, M. Piasecki, E. Rudnicka, and S. Szpakowicz. Beyond the transfer-and-merge
wordnet construction: plwordnet and a comparison with wordnet. In RANLP, pages
443–452, 2013.
J. J. McCarthy. A prosodic theory of nonconcatenative morphology. Linguistic inquiry, 12
(3):373–418, 1981.
T. Mikolov, W.-t. Yih, and G. Zweig. Linguistic regularities in continuous space word
representations. In HLT-NAACL, pages 746–751, 2013.
G. A. Miller. WordNet: a lexical database for English. Communications of the ACM, 38
(11):39–41, 1995.
G. A. Miller and F. Hristea. WordNet nouns: Classes and instances. Computational Linguistics, 32(1):1–3, 2006.
T. Miller and I. Gurevych. Wordnet-wikipedia-wiktionary: Construction of a three-way
alignment. In LREC, pages 2094–2100, 2014.
M. Mladenovic, J. Mitrovic, and C. Krstev. Developing and Maintaining a WordNet: Procedures and Tools. In Proceedings of the 7th Global Wordnet Conference (GWC 2014),
pages 55–62, 2014.
C. Mouton and G. de Chalendar. JAWS: Just another WordNet subset. Proc. of TALN’10,
2010.
A. S. Nagvenkar, N. R. Prabhugaonkar, V. P. Prabhu, R. N. Karmali, and J. D. Pawar. Concept Space Synset Manager Tool. In Proceedings of the 7th Global Wordnet Conference,
pages 86–94, 2014.
P. Nakov and H. T. Ng. Improved statistical machine translation for resource-poor languages using related resource-rich languages. In Proceedings of the 2009 Conference on
111
Empirical Methods in Natural Language Processing: Volume 3-Volume 3, pages 1358–
1367. Association for Computational Linguistics, 2009.
R. Navigli and S. P. Ponzetto. BabelNet: Building a very large multilingual semantic
network. In Proceedings of the 48th annual meeting of the association for computational
linguistics, pages 216–225. Association for Computational Linguistics, 2010.
A. Neelakantan, J. Shankar, A. Passos, and A. McCallum. Efficient non-parametric estimation of multiple embeddings per word in vector space. arXiv preprint arXiv:1504.06654,
2015.
L. Nerima and E. Wehrli. Generating Bilingual Dictionaries by Transitivity. In LREC,
volume 8, pages 2584–2587, 2008.
R. Noyer. Vietnamese’morphology’and the definition of word. University of Pennsylvania
Working Papers in Linguistics, 5(2):5, 1998.
A. Oliver. Wn-toolkit: Automatic generation of wordnets following the expand model.
Proceedings of the 7th Global WordNetConference, Tartu, Estonia, 2014.
A. Oliver and S. Climent. Parallel corpora for Wordnet construction: machine translation
vs. automatic sense tagging. In Computational Linguistics and Intelligent Text Processing, pages 110–121. Springer, 2012.
P. G. Otero and J. R. P. Campos. Automatic generation of bilingual dictionaries using intermediary languages and comparable corpora. In Computational Linguistics and Intelligent
Text Processing, pages 473–483. Springer, 2010.
N. R. Prabhugaonkar, J. D. Pawar, and T. Plateau. Use of Sense Marking for Improving
WordNet Coverage. In Proceedings of the 7th Global Wordnet Conference, pages 95–99,
2014.
112
Q. Pradet, G. de Chalendar, and J. B. Desormeaux. Wonef, an improved, expanded and
evaluated automatic french translation of wordnet. Proceedings of the 7th Global WordNetConference, Tartu, Estonia, 2014.
J. Ramírez, M. Asahara, and Y. Matsumoto. Japanese-Spanish thesaurus construction using
English as a pivot. arXiv preprint arXiv:1303.1232, 2013.
G. Rigau, H. Rodriguez, and E. Agirre. Building accurate semantic taxonomies from
monolingual MRDs. In Proceedings of the 17th international conference on Computational linguistics-Volume 2, pages 1103–1109. Association for Computational Linguistics, 1998.
H. Rodríguez, D. Farwell, J. Ferreres, M. Bertran, M. Alkhalifa, and M. A. Martí. Arabic
wordnet: Semi-automatic extensions using bayesian inference. In LREC, 2008.
S. Rothe and H. Schütze. Autoextend: Extending word embeddings to embeddings for
synsets and lexemes. arXiv preprint arXiv:1507.01127, 2015.
B. Sagot and D. Fišer. Building a free French wordnet from multilingual resources. In
OntoLex, 2008.
N. Saharia, D. Das, U. Sharma, and J. Kalita. Part of speech tagger for assamese text. In
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 33–36. Association for Computational Linguistics, 2009.
R. C. S. K. Sarma. Structured and logical representations of assamese text for questionanswering system. In 24th International Conference on Computational Linguistics,
page 27, 2012.
M. Saveski and I. Trajkovski. Automatic construction of wordnets by using machine translation and language modeling. In 13th Multiconference Information Society, Ljubljana,
Slovenia, 2010.
113
K. Shaalan, A. A. Monem, and A. Rafea. Arabic morphological generation from interlingua. In Intelligent Information Processing III, pages 441–451. Springer, 2006.
U. Sharma, J. K. Kalita, and R. K. Das. Acquisition of morphology of an indic language from text corpus. ACM Transactions on Asian Language Information Processing
(TALIP), 7(3):9, 2008.
R. Shaw, A. Datta, D. VanderMeer, and K. Dutta. Building a scalable database-driven
reverse dictionary. Knowledge and Data Engineering, IEEE Transactions on, 25(3):
528–540, 2013.
S. Soderland, O. Etzioni, D. S. Weld, K. Reiter, M. Skinner, M. Sammer, J. Bilmes, and
others. Panlingual lexical translation via probabilistic inference. Artificial Intelligence,
174(9):619–637, 2010.
K. Tanaka and K. Umemura. Construction of a bilingual dictionary intermediated by a third
language. In Proceedings of the 15th conference on Computational linguistics-Volume 1,
pages 297–303. Association for Computational Linguistics, 1994.
L. C. Thompson. A Vietnamese reference grammar, volume 13. University of Hawaii Press,
1987.
K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. Feature-rich part-of-speech tagging
with a cyclic dependency network. In Proceedings of the 2003 Conference of the North
American Chapter of the Association for Computational Linguistics on Human Language
Technology-Volume 1, pages 173–180. Association for Computational Linguistics, 2003.
P. Vossen. Introduction to eurowordnet. In EuroWordNet: A multilingual database with
lexical semantic networks, pages 1–17. Springer, 1998.
114
Wikipedia. Wordnet — wikipedia, the free encyclopedia, 2015. URL http://en.
wikipedia.org/w/index.php?title=WordNet&oldid=656664111.
[Online; accessed 22-April-2015].
Wikipedia.
Vietnamese language — wikipedia, the free encyclopedia, 2016a.
URL
https://en.wikipedia.org/w/index.php?title=Vietnamese_
language&oldid=731154067. [Online; accessed 30-July-2016].
Wikipedia.
Vietnamese morphology — wikipedia, the free encyclopedia, 2016b.
URL https://en.wikipedia.org/w/index.php?title=Vietnamese_
morphology&oldid=730832239. [Online; accessed 30-July-2016].
Wikipedia.
Lexicon — wikipedia, the free encyclopedia, 2016c.
URL https://
en.wikipedia.org/w/index.php?title=Lexicon&oldid=718057169.
[Online; accessed 3-August-2016].
K. Yu and J. Tsujii. Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. In Proceedings of Human Language Technologies: The 2009
Annual Conference of the North American Chapter of the Association for Computational
Linguistics, Companion Volume: Short Papers, pages 121–124. Association for Computational Linguistics, 2009.
O. F. Zaidan and C. Callison-Burch. Arabic dialect identification. Computational Linguistics, 40(1):171–202, 2014.
Appendix A
PAPERS RESULTING FROM THE DISSERTATION
Appendix B
DATA PROCESSING SOFTWARE CODE
B.1 ComputCosineSim.py
###########################
# Program to compute cosine similarity
# between semantically related words in a WordNet
# using Word2Vec
# Author: Feras Al Tarouti
# Date : Feb 4 2016
import
import
import
import
unicodecsv as csv
codecs
gensim
editdistance
word2vecmodel=gensim.models.Word2Vec.load_word2vec_format('
VieVectors_SG_Size100_W5.bin', binary=True)
with open('LexBankVieSemRelatedWords_WithCOS.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerow(['OffsetPos1','Word1','Relation','OffsetPos2','Word2',
'COS','ld'])
with open('LexBankVieSemRelatedWords.csv', 'rb') as f:
reader = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE)
firstline = True
rownum = 0
for row in reader:
if firstline:
firstline=False
else:
print("Compute Similarity for pairs number: {0}".format(rownum))
SynsetID1=row[0]
Word1= row[1]
Relation=row[2]
SynsetID2=row[3]
Word2=row[4]
try:
cos= round(word2vecmodel.similarity(Word1,Word2),3)
117
except Exception:
cos=00.00
ld= editdistance.eval(Word1,Word2)
newrow=[SynsetID1,Word1,Relation,SynsetID2,Word2,cos,ld]
writer.writerow(newrow)
rownum =rownum +1
118
B.2 GenerateVectorForSynset.py
###########################
# A function for computing a synset vector
# Author: Feras Al Tarouti
# Date : May 18 2016
def GenerateVectorForSynset(syn,thislemma):
FinalVector=np.zeros(100)
VectorList=[] # define the vector set for this synset
LemmasList=FindLemmasOfSyns(syn) # the list of lemmas for this synset
for lemma in LemmasList:
if lemma != thislemma:
Vector= GenerateVectorForLemma(lemma)
if np.count_nonzero(Vector)>0:
VectorList.append(Vector) # add vector of word to the synset Vector
# Find out if this synset have only one word,
# in this case we have to find a related word and add it to the
vector sets
if len(VectorList)<2:
#we need to find out a related synset
relatedword=FindRelatedSyn(syn)
if relatedword != "":
Vector =GenerateVectorForLemma(relatedword)
if np.count_nonzero(Vector)>0:
VectorList.append(Vector) # add vector of word to the synset Vector
for vec in VectorList:
FinalVector=np.add(FinalVector,vec)
# compute the average
numbofVec= len(VectorList)
scalar=np.divide(float(1),float(numbofVec))
FinalVector=np.multiply(FinalVector, scalar)
return FinalVector
119
B.3 GenerateVectorForGloss.py
###########################
# A function for computing a gloss vector
# Author: Feras Al Tarouti
# Date : May 18 2016
def GenerateVectorFor(thisSentence,lemma):
VectorList=[] # define the vector set for this Sentence
FinalVector=np.zeros(100)
for word in thisSentence.split():
skip = False
if word not in stopwrds and word != lemma:
try:
Vector = word2vecmodel[word]
NofSyns = FindNumberOfSyns(word)
# Scale the vector base on the number of synsets
if NofSyns > 1:
thisScalar = np.divide(float(1),float(NofSyns))
Vector = np.multiply(Vector, thisScalar)
VectorList.append(Vector)
skip=False # we have this word in our model
except Exception:
skip=True
if len(VectorList)>0:
for vec in VectorList:
FinalVector=np.add(FinalVector,vec)
numbofVec= len(VectorList)
saclar=np.divide(float(1),numbofVec)
FinalVector=np.multiply(FinalVector, saclar)
return FinalVector;
120
B.4 ComputeGlossSynsetSimilarity.py
###########################
# A program for computing similarity between synset and gloss
# Author: Feras Al Tarouti
# Date : May 18 2016
# First Step : Open the synset-gloss files, and read the sentence
# Second Step : Generate the vector for the synset
# Third Step : Generate the vector for the sentence
# Fourth Step : Compute the cosine similarity between the synset vector
#
and the sentence vector
# Fivth Step : Save the result
###########################
with open(InputDataFile,'rb') as SentencesFile, open(outputfile, 'wb')
as out_file:
reader = csv.reader(SentencesFile,encoding='utf-8' ,delimiter=',')
writer = csv.writer(out_file, encoding='utf-8')
writer.writerow(['ID','CosSem'])
rownum=0
for row in reader:
if rownum!=0:
print("Computing Cosine Similarity for Row numb: {0}".format(rownum)
)
thisSenID
= row[0] # read the current sentence ID
thisSynset
= row[1] # read the current synsetID
thisSynMem
= row[2] # read number of members for this synset
thiswrd
= row[3] # read the word used in this sentence
thiswrdSyns = row[4] # read the number of synsets for this word
thisSentence = row[5] # read the current sentence
#Compute a vector for this synset
thisSynsetVector = GenerateVectorForSynset(thisSynset,"")
# Generate Vector for this sentence
thisSentenceVector = GenerateVectorFor(thisSentence,"")
CosDistance = ComputeCosine (thisSynsetVector, thisSentenceVector)
x=Decimal(CosDistance)
if math.isnan(x):
CosDistance=0
newrow=[thisSenID,CosDistance]
writer.writerow(newrow)
rownum=rownum+1
Appendix C
MICROSOFT SQL SERVER TABLES
--- Database: `LexBank_System`
--- ---------------------------------------------------------- Table structure for table Ùsers_Info`
-USE [LexBank_System]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Users_Info](
[UserId] [varchar](50) NOT NULL,
[UserName] [varchar](100) NOT NULL,
[UserEmail] [varchar](70) NOT NULL,
[UserPwd] [varchar](max) NOT NULL,
[UserPriv] [varchar](15) NOT NULL,
[UserStatus] [varchar](15) NOT NULL,
CONSTRAINT [PK_Users_Info] PRIMARY KEY CLUSTERED
(
[UserId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY =
OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
-- ---------------------------------------------------------
122
-- Table structure for table `System_Log`
-USE [LexBank_System]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[System_Log](
[EventId] [int] IDENTITY(1,1) NOT NULL,
[EventDesc] [varchar](200) NOT NULL,
[EventTime] [datetime] NOT NULL,
[UserId] [varchar](50) NOT NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
-- ---------------------------------------------------------- Database: `LexBank_Resources`
--- ---------------------------------------------------------- Table structure for table Àrabic_CoreWordnet`
-USE [LexBank_Resources]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Arabic_CorWordnet](
[Offset_Pos] [nvarchar](10) NOT NULL,
[Member] [nvarchar](200) NOT NULL
) ON [PRIMARY]
SET ANSI_PADDING OFF
GO
-- --------------------------------------------------------- Table structure for table Àssamese_CoreWordnet`
123
-USE [LexBank_Resources]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Assamese_CorWordnet](
[Offset_Pos] [nvarchar](10) NOT NULL,
[Member] [nvarchar](200) NOT NULL
) ON [PRIMARY]
SET ANSI_PADDING OFF
GO
-- --------------------------------------------------------- Table structure for table `Vietnamese_CoreWordnet`
-USE [LexBank_Resources]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Vietnamese_CorWordnet](
[Offset_Pos] [nvarchar](10) NOT NULL,
[Member] [nvarchar](200) NOT NULL
) ON [PRIMARY]
SET ANSI_PADDING OFF
GO
-- --------------------------------------------------------- Table structure for table Àrabic_Sem_Relations`
-USE [LexBank_Resources]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
124
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Arabic_Sem_Relations](
[Left_Offset_Pos] [nvarchar](10) NOT NULL,
[Relation] [nvarchar](50) NOT NULL,
[Right_Offset_Pos] [nvarchar](10) NOT NULL
) ON [PRIMARY]
SET ANSI_PADDING OFF
GO
-- --------------------------------------------------------- Table structure for table Àssamese_Sem_Relations`
-USE [LexBank_Resources]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Assamese_Sem_Relations](
[Left_Offset_Pos] [nvarchar](10) NOT NULL,
[Relation] [nvarchar](50) NOT NULL,
[Right_Offset_Pos] [nvarchar](10) NOT NULL
) ON [PRIMARY]
SET ANSI_PADDING OFF
GO
-- --------------------------------------------------------- Table structure for table `Vietnamese_Sem_Relations`
-USE [LexBank_Resources]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Vietnamese_Sem_Relations](
[Left_Offset_Pos] [nvarchar](10) NOT NULL,
[Relation] [nvarchar](50) NOT NULL,
[Right_Offset_Pos] [nvarchar](10) NOT NULL
) ON [PRIMARY]
125
SET ANSI_PADDING OFF
GO
-- --------------------------------------------------------- Table structure for table Àrabic_WordnetGlosses`
-USE [LexBank_Resources]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Arabic_WordnetGlosses](
[Offset_Pos] [varchar](10) NOT NULL,
[Gloss] [varchar](4000) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
-- --------------------------------------------------------- Table structure for table Àssamese_WordnetGlosses`
-USE [LexBank_Resources]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Assamese_WordnetGlosses](
[Offset_Pos] [varchar](10) NOT NULL,
[Gloss] [varchar](4000) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
-- --------------------------------------------------------- Table structure for table `Vietnamese_WordnetGlosses`
--
126
USE [LexBank_Resources]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Vietnamese_WordnetGlosses](
[Offset_Pos] [varchar](10) NOT NULL,
[Gloss] [varchar](4000) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
-- --------------------------------------------------------- Table structure for table Àrabic_Sem_Relations_Eval_Data`
-USE [LexBank_Resources]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Arabic_Sem_Relations_Eval_Data](
[RelationKey] [int] IDENTITY(1,1) NOT NULL,
[Left_Offset_Pos] [nvarchar](10) NOT NULL,
[Word1] [nvarchar](100) NOT NULL,
[Relation] [nvarchar](50) NOT NULL,
[Right_Offset_Pos] [nvarchar](10) NOT NULL,
[Word2] [nvarchar](100) NOT NULL,
[COS] [real] NULL,
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
-- --------------------------------------------------------- Table structure for table Àssamese_Sem_Relations_Eval_Data`
-USE [LexBank_Resources]
127
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Assamese_Sem_Relations_Eval_Data](
[RelationKey] [int] IDENTITY(1,1) NOT NULL,
[Left_Offset_Pos] [nvarchar](10) NOT NULL,
[Word1] [nvarchar](100) NOT NULL,
[Relation] [nvarchar](50) NOT NULL,
[Right_Offset_Pos] [nvarchar](10) NOT NULL,
[Word2] [nvarchar](100) NOT NULL,
[COS] [real] NULL,
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
-- --------------------------------------------------------- Table structure for table `Vietnamese_Sem_Relations_Eval_Data`
-USE [LexBank_Resources]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Vietnamese_Sem_Relations_Eval_Data](
[RelationKey] [int] IDENTITY(1,1) NOT NULL,
[Left_Offset_Pos] [nvarchar](10) NOT NULL,
[Word1] [nvarchar](100) NOT NULL,
[Relation] [nvarchar](50) NOT NULL,
[Right_Offset_Pos] [nvarchar](10) NOT NULL,
[Word2] [nvarchar](100) NOT NULL,
[COS] [real] NULL,
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
128
-- --------------------------------------------------------- Table structure for table Àrabic_Sem_Relations_Eval_Response`
-USE [LexBank_Resources]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Arabic_Sem_Relations_Eval_Response](
[AnswerKey] [int] IDENTITY(1,1) NOT NULL,
[RelationKey] [int] NOT NULL,
[Score] [int] NOT NULL,
[UserId] [varchar](50) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
-- --------------------------------------------------------- Table structure for table Àssamese_Sem_Relations_Eval_Response`
-USE [LexBank_Resources]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Assamese_Sem_Relations_Eval_Response](
[AnswerKey] [int] IDENTITY(1,1) NOT NULL,
[RelationKey] [int] NOT NULL,
[Score] [int] NOT NULL,
[UserId] [varchar](50) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
-- --------------------------------------------------------- Table structure for table `Vietnamese_Sem_Relations_Eval_Response`
129
-USE [LexBank_Resources]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Vietnamese_Sem_Relations_Eval_Response](
[AnswerKey] [int] IDENTITY(1,1) NOT NULL,
[RelationKey] [int] NOT NULL,
[Score] [int] NOT NULL,
[UserId] [varchar](50) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
-- --------------------------------------------------------- Table structure for table Àrabic_WordnetGloss_Eval_Data`
-USE [LexBank_Resources]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Arabic_WordnetGloss_Eval_Data](
[GlossKey] [int] IDENTITY(1,1) NOT NULL,
[Offset-pos] [varchar](10) NOT NULL,
[Word] [nvarchar](500) NULL,
[Sentence] [nvarchar](4000) NULL,
[PWNGloss] [nvarchar](900) NULL,
[CosSem] [real] NULL,
[GlossRank] [int] NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
-- --------------------------------------------------------
130
-- Table structure for table Àssamese_WordnetGloss_Eval_Data`
-USE [LexBank_Resources]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Assamese_WordnetGloss_Eval_Data](
[GlossKey] [int] IDENTITY(1,1) NOT NULL,
[Offset-pos] [varchar](10) NOT NULL,
[Word] [nvarchar](500) NULL,
[Sentence] [nvarchar](4000) NULL,
[PWNGloss] [nvarchar](900) NULL,
[CosSem] [real] NULL,
[GlossRank] [int] NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
-- --------------------------------------------------------- Table structure for table `Vietnamese_WordnetGloss_Eval_Data`
-USE [LexBank_Resources]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Vietnamese_WordnetGloss_Eval_Data](
[GlossKey] [int] IDENTITY(1,1) NOT NULL,
[Offset-pos] [varchar](10) NOT NULL,
[Word] [nvarchar](500) NULL,
[Sentence] [nvarchar](4000) NULL,
[PWNGloss] [nvarchar](900) NULL,
[CosSem] [real] NULL,
[GlossRank] [int] NULL
) ON [PRIMARY]
GO
131
SET ANSI_PADDING OFF
GO
-- --------------------------------------------------------- Table structure for table Àrabic_WordnetGloss_Eval_Response`
-USE [LexBank_Resources]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Arabic_WordnetGlosses_Eval_Response](
[AnswerKey] [int] IDENTITY(1,1) NOT NULL,
[GlossKey] [int] NOT NULL,
[Score] [int] NOT NULL,
[UserId] [varchar](50) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
-- --------------------------------------------------------- Table structure for table Àssamese_WordnetGloss_Eval_Response`
-USE [LexBank_Resources]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Assamese_WordnetGlosses_Eval_Response](
[AnswerKey] [int] IDENTITY(1,1) NOT NULL,
[GlossKey] [int] NOT NULL,
[Score] [int] NOT NULL,
[UserId] [varchar](50) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
132
GO
-- --------------------------------------------------------- Table structure for table `Vietnamese_WordnetGloss_Eval_Response`
-USE [LexBank_Resources]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Vietnamese_WordnetGlosses_Eval_Response](
[AnswerKey] [int] IDENTITY(1,1) NOT NULL,
[GlossKey] [int] NOT NULL,
[Score] [int] NOT NULL,
[UserId] [varchar](50) NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
-- --------------------------------------------------------
Appendix D
LEXBANK UTILITY CLASS
1
2
3
4
5
6
7
8
9
10
using
using
using
using
using
using
using
using
using
using
System;
System.Collections.Generic;
System.Linq;
System.Web;
System.Data;
System.Data.SqlClient;
System.Web.Configuration;
System.IO;
System.Text;
System.Security.Cryptography;
11
12
13
14
15
16
namespace LexBank2016
{
public class LexBankUtils
{
private string LexBankConnectionString =
WebConfigurationManager.ConnectionStrings["LexBankData"].
ToString();
17
18
19
20
21
public Boolean IsUserIdAvailable(string UserId)
{
// This function takes user id and check if it is already
used or not
Boolean result = false;
22
23
24
25
26
27
28
29
30
31
32
33
using (SqlConnection connection = new SqlConnection(
LexBankConnectionString))
{
connection.Open();
//
// Create new SqlCommand object.
//
using (SqlCommand command = new SqlCommand("SELECT
UserId FROM Users_Info where UserId like @UserId",
connection))
{
// Define the parameters
command.Parameters.AddWithValue("@UserId", UserId
.Trim());
134
//
// Invoke ExecuteReader method.
//
var firstColumn = command.ExecuteScalar();
if (firstColumn == null)
{
result = true;
}
34
35
36
37
38
39
40
41
}
}
return result;
42
43
44
45
46
47
}
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
public string EncryptPassword(string PlanePassword)
{
string EncryptionKey = "LexBank";
byte[] PlaneBytes = Encoding.Unicode.GetBytes(
PlanePassword);
using (Aes PasswordEncryptor = Aes.Create())
{
Rfc2898DeriveBytes PBKDF = new Rfc2898DeriveBytes(
EncryptionKey, new byte[] { 0x49, 0x76, 0x61, 0x6e
, 0x20, 0x4d, 0x65, 0x64, 0x76, 0x65, 0x64, 0x65,
0x76 });
PasswordEncryptor.Key = PBKDF.GetBytes(32);
PasswordEncryptor.IV = PBKDF.GetBytes(16);
using (MemoryStream ms = new MemoryStream())
{
using (CryptoStream cs = new CryptoStream(ms,
PasswordEncryptor.CreateEncryptor(),
CryptoStreamMode.Write))
{
cs.Write(PlaneBytes, 0, PlaneBytes.Length);
cs.Close();
}
PlanePassword = Convert.ToBase64String(ms.ToArray
());
}
}
return PlanePassword;
}
70
71
72
73
74
75
76
77
public string DecryptPassword(string EncryptedPassword)
{
string EncryptionKey = "LexBank";
byte[] DecryptedBytes = Convert.FromBase64String(
EncryptedPassword);
using (Aes PasswordEncryptor = Aes.Create())
{
Rfc2898DeriveBytes PBKDF = new Rfc2898DeriveBytes(
EncryptionKey, new byte[] { 0x49, 0x76, 0x61, 0x6e
135
, 0x20, 0x4d, 0x65, 0x64, 0x76, 0x65, 0x64, 0x65,
0x76 });
PasswordEncryptor.Key = PBKDF.GetBytes(32);
PasswordEncryptor.IV = PBKDF.GetBytes(16);
using (MemoryStream ms = new MemoryStream())
{
using (CryptoStream cs = new CryptoStream(ms,
PasswordEncryptor.CreateDecryptor(),
CryptoStreamMode.Write))
{
cs.Write(DecryptedBytes, 0, DecryptedBytes.
Length);
cs.Close();
}
EncryptedPassword = Encoding.Unicode.GetString(ms
.ToArray());
}
78
79
80
81
82
83
84
85
86
87
88
}
return EncryptedPassword;
89
90
91
}
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
public Boolean CreateNewUser(string UserId, string UserName,
string UserEmail, string UserPwd)
{
Boolean result = false;
string UserPriv = "client";
string UserStatus = "New";
using (SqlConnection connection = new SqlConnection(
LexBankConnectionString))
{
connection.Open();
//
// Create new SqlCommand object.
//
using (SqlCommand command = new SqlCommand("INSERT
INTO Users_Info VALUES(@UserId,@UserName,
@UserEmail,@UserPwd,@UserPriv,@UserStatus)",
connection))
{
// Define the parameters
command.Parameters.AddWithValue("@UserId", UserId
.Trim());
command.Parameters.AddWithValue("@UserName",
UserName.Trim());
command.Parameters.AddWithValue("@UserEmail",
UserEmail.Trim());
command.Parameters.AddWithValue("@UserPwd",
UserPwd.Trim());
command.Parameters.AddWithValue("@UserPriv",
UserPriv.Trim());
command.Parameters.AddWithValue("@UserStatus",
UserStatus.Trim());
//
// Invoke ExecuteNonQuery method.
136
//
int c = 0;
try
{
c = command.ExecuteNonQuery();
if (c == 1)
result = true;
}
catch (Exception e)
{
115
116
117
118
119
120
121
122
123
124
125
}
126
127
128
}
129
130
}
131
132
133
134
135
return result;
136
137
}
138
139
140
public bool IsAuthenticated(string userid, string
userpassword)
{
141
bool result = false;
SqlConnection LexBankDataConnection = new SqlConnection(
LexBankConnectionString);
SqlCommand AuthCommand = new SqlCommand("Select UserId,
UserPriv, UserStatus from Users_Info where UserId=
@userid and UserPwd=@userpassword",
LexBankDataConnection);
AuthCommand.Parameters.AddWithValue("@userid", userid);
AuthCommand.Parameters.AddWithValue("@userpassword",
EncryptPassword(userpassword.Trim()));
LexBankDataConnection.Open();
SqlDataReader reader = AuthCommand.ExecuteReader();
while (reader.Read())
{
string UserStatus = reader["UserStatus"].ToString();
if (UserStatus == "Active")
{
result = true;
LogEvent("Login", DateTime.Now, userid.Trim());
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
}
}
return result;
157
158
159
160
}
161
162
public List<string> FindSynSet(string lexeme, string WordNet)
137
163
{
164
List<string> result = new List<string>();
165
166
using (SqlConnection connection = new SqlConnection(
LexBankConnectionString))
{
connection.Open();
//
// Create new SqlCommand object.
//
using (SqlCommand command = new SqlCommand("SELECT *
FROM " + WordNet + " where Member like @lexeme",
connection))
{
// Define the parameters
command.Parameters.AddWithValue("@lexeme", lexeme
.Trim());
//
// Invoke ExecuteReader method.
//
SqlDataReader reader = command.ExecuteReader();
while (reader.Read())
{
result.Add(reader.GetString(0).Trim());
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
}//end while
185
186
} //end the second using
}//end the first using
return result;
187
188
189
190
}
191
192
193
public List<string> FindSynSetLexemes(string OffsetPos,
string WordNet)
{
194
195
List<string> result = new List<string>();
196
197
198
199
200
201
202
203
204
205
206
207
using (SqlConnection connection = new SqlConnection(
LexBankConnectionString))
{
connection.Open();
//
// Create new SqlCommand object.
//
using (SqlCommand command = new SqlCommand("SELECT *
FROM " + WordNet + " where Offset_Pos like
@OffsetPos", connection))
{
// Define the parameters
command.Parameters.AddWithValue("@OffsetPos",
OffsetPos.Trim());
//
138
// Invoke ExecuteReader method.
//
SqlDataReader reader = command.ExecuteReader();
while (reader.Read())
{
result.Add(reader.GetString(1).Trim());
208
209
210
211
212
213
214
}//end while
215
216
} //end the second using
}//end the first using
return result;
217
218
219
220
}
221
222
223
224
225
public Boolean IsSynSetAvailable(string OffsetPos, string
Wordnet)
{
// This function takes synsetID and check if it is
included in a Wordnet
Boolean result = false;
226
227
using (SqlConnection connection = new SqlConnection(
LexBankConnectionString))
{
connection.Open();
//
// Create new SqlCommand object.
//
using (SqlCommand command = new SqlCommand("SELECT
Offset_Pos FROM " + Wordnet.Trim() + " where
Offset_Pos like @OffsetPos", connection))
{
// Define the parameters
command.Parameters.AddWithValue("@OffsetPos",
OffsetPos.Trim());
//
// Invoke ExecuteReader method.
//
SqlDataReader reader = command.ExecuteReader();
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
if (reader.Read())
result = true;
243
244
245
}
246
247
248
}
249
250
251
return result;
252
253
254
255
}
139
256
257
258
public Dictionary<string, string> FindSynSetRelations(string
OffsetPos, string WordNet, string RelationsTable)
{
259
260
Dictionary<string, string> result = new Dictionary<string
, string>();
261
262
263
264
265
266
267
268
using (SqlConnection connection = new SqlConnection(
LexBankConnectionString))
{
connection.Open();
//
// Create new SqlCommand object.
//
269
270
271
272
273
274
275
276
277
using (SqlCommand command = new SqlCommand("SELECT *
FROM " + RelationsTable.Trim() + " where
Left_Offset_Pos like @OffsetPos", connection))
{
// Define the parameters
command.Parameters.AddWithValue("@OffsetPos",
OffsetPos.Trim());
//
// Invoke ExecuteReader method.
//
SqlDataReader reader = command.ExecuteReader();
278
279
string Relation = "";
280
281
282
283
284
285
286
287
288
int c = 0;
while (reader.Read())
{
if (IsSynSetAvailable(reader.GetString(2).
Trim(), WordNet))
{
Relation = reader.GetString(1).Trim() + "
: " + reader.GetString(2).Trim();
string RelatedOffsetPos = reader.
GetString(2).Trim();
List<string> RelatedLexemes =
FindSynSetLexemes(RelatedOffsetPos,
WordNet);
289
290
291
292
293
foreach (string lexeme in RelatedLexemes)
{
c++;
result.Add(RelatedOffsetPos + c.
ToString(), Relation + "-->" +
lexeme);
294
295
296
}
140
}
297
298
299
}//end while
300
301
302
303
304
} //end the second using
}//end the first using
305
306
307
return result;
308
309
310
}
311
312
313
314
public string FindGloss(string OffsetPos, string GlossTable)
{
string result = "Gloss is not available";
315
using (SqlConnection connection = new SqlConnection(
LexBankConnectionString))
{
connection.Open();
//
// Create new SqlCommand object.
//
using (SqlCommand command = new SqlCommand("SELECT *
FROM " + GlossTable + " where Offset_Pos like
@OffsetPos", connection))
{
// Define the parameters
command.Parameters.AddWithValue("@OffsetPos",
OffsetPos.Trim());
//
// Invoke ExecuteReader method.
//
SqlDataReader reader = command.ExecuteReader();
while (reader.Read())
{
result=reader.GetString(1).Trim();
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
}//end while
334
335
} //end the second using
}//end the first using
336
337
338
339
340
return result;
341
342
343
344
345
}
141
346
347
348
public List<string> ReadRelation(string RelationKey, string
RelationDataTable)
{
// This method reads a relation and return it to be
evaluated
349
List<string> Result = new List<string>();
350
351
try
{
352
353
SqlConnection MyConnection = new SqlConnection(
LexBankConnectionString);
354
355
string Sqls = "SELECT [RelationKey], [Word1] , [
Relation], [Word2] FROM " + RelationDataTable + "
where [RelationKey] = @RelationKey";
SqlCommand Mycommand = new SqlCommand(Sqls,
MyConnection);
DataTable MyTable = new DataTable();
using (SqlDataAdapter Myadapter = new SqlDataAdapter(
Mycommand))
{
356
357
358
359
360
361
Myadapter.Fill(MyTable);
362
363
if (MyTable.Rows.Count > 0)
{
364
365
366
for (int x = 0; x < 4; x++)
{
367
368
369
Result.Add(MyTable.Rows[0][x].ToString())
;
370
371
}
372
373
}
374
375
}
376
377
return Result;
378
}
379
380
catch (Exception ex)
{
return Result;
}
381
382
383
384
385
386
}
387
388
389
public List<string> ReadSynsetGloss(int GlossKey,string
TableName)
{
142
// This method Read a synset gloss from the table and
return it to be evaluated
390
391
List<string> Result = new List<string>();
392
393
try
{
394
395
SqlConnection MyConnection = new SqlConnection(
LexBankConnectionString);
396
397
string Sqls = "SELECT [GlossKey], [Word] , [Sentence
[
], [PWN_Gloss] FROM " + TableName + " where
GlossKey] =@GlossKey";
DataTable MyTable = new DataTable();
SqlCommand Mycommand = new SqlCommand(Sqls,
MyConnection);
Mycommand.Parameters.AddWithValue("@GlossKey",
GlossKey);
using (SqlDataAdapter Myadapter = new SqlDataAdapter(
Mycommand))
{
Myadapter.Fill(MyTable);
if (MyTable.Rows.Count > 0)
{
398
399
400
401
402
403
404
405
406
407
for (int x = 0; x < 4; x++)
{
Result.Add(MyTable.Rows[0][x].ToString())
;
}
408
409
410
411
412
}
413
414
}
415
416
return Result;
}
catch (Exception ex)
{
return Result;
}
417
418
419
420
421
422
423
424
}
425
426
427
public Boolean EvaluateRelation(int RelationKey, int Score,
string UserId, string EvaluationTable)
{
428
429
430
try
{
431
432
433
SqlConnection MyConnection = new SqlConnection(
LexBankConnectionString);
143
string sqls = "INSERT INTO " + EvaluationTable + " ([
RelationKey],[Score] ,[UserID]) values (
@RelationKey,@Score,@UserId)";
var command = new SqlCommand(sqls, MyConnection);
command.Parameters.AddWithValue("@RelationKey",
RelationKey);
command.Parameters.AddWithValue("@Score", Score);
command.Parameters.AddWithValue("@UserId", UserId.
Trim());
MyConnection.Open();
command.ExecuteNonQuery();
MyConnection.Close();
return true;
434
435
436
437
438
439
440
441
442
443
}
444
445
catch (Exception ex)
{
return false;
}
446
447
448
449
450
451
}
452
453
454
private Boolean EvaluateGloss(int GlossKey, int Score, string
UserId, string EvaluationTable)
{
455
456
457
try
{
458
SqlConnection MyConnection = new SqlConnection(
LexBankConnectionString);
459
460
string sqls2 = "INSERT INTO
" + EvaluationTable + "
([GlossKey],[Score] ,[UserID]) values (@GlossKey,
@Score,@UserId)";
var command = new SqlCommand(sqls2, MyConnection);
command.Parameters.AddWithValue("@GlossKey", GlossKey
);
command.Parameters.AddWithValue("@Score", Score);
command.Parameters.AddWithValue("@UserId", UserId);
461
462
463
464
465
466
MyConnection.Open();
command.ExecuteNonQuery();
MyConnection.Close();
return true;
467
468
469
470
471
472
}
473
474
475
476
477
478
catch (Exception ex)
{
return false;
}
144
479
}
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
public void LogEvent(string EventDesc, DateTime EventTime,
string UserId)
{
using (SqlConnection connection = new SqlConnection(
LexBankConnectionString))
{
connection.Open();
//
// Create new SqlCommand object.
//
using (SqlCommand command = new SqlCommand("INSERT
INTO System_Log([EventDesc], [EventTime], [UserId
]) VALUES(@EventDesc, @EventTime, @UserId)",
connection))
{
// Define the parameters
command.Parameters.AddWithValue("@EventDesc",
EventDesc.Trim());
command.Parameters.AddWithValue("@EventTime",
SqlDbType.DateTime).Value = EventTime;
command.Parameters.AddWithValue("@UserId", UserId
.Trim());
//
// Invoke ExecuteNonQuery method.
//
//try
//{
command.ExecuteNonQuery();
//}
//catch (Exception e)
//{
504
//}
505
506
}
507
}
508
509
510
}
511
512
513
514
515
516
517
518
519
520
521
public void ChangeUserStatus(string UserId, string NewStatus)
{
using (SqlConnection connection = new SqlConnection(
LexBankConnectionString))
{
connection.Open();
//
// Create new SqlCommand object.
//
using (SqlCommand command = new SqlCommand("UPDATE
Users_Info SET UserStatus=@UserStatus WHERE
UserId=@UserId", connection))
{
145
// Define the parameters
command.Parameters.AddWithValue("@UserId", UserId
.Trim());
command.Parameters.AddWithValue("@UserStatus",
NewStatus.Trim());
//
// Invoke ExecuteNonQuery method.
//
//try
//{
command.ExecuteNonQuery();
//}
//catch (Exception e)
//{
522
523
524
525
526
527
528
529
530
531
532
533
534
//}
535
536
}
537
}
538
539
}
540
541
public DataTable RetrieveUsers()
{
DataTable result = new DataTable();
542
543
544
545
using (SqlConnection connection = new SqlConnection(
LexBankConnectionString))
{
connection.Open();
//
// Create new SqlCommand object.
//
using (SqlCommand command = new SqlCommand("SELECT [
UserId], [UserName], [UserEmail], [UserPriv], [
UserStatus] FROM [Users_Info]", connection))
{
SqlDataAdapter dadapter = new SqlDataAdapter(
command);
dadapter.Fill(result);
546
547
548
549
550
551
552
553
554
555
556
}
}
return result;
557
558
559
560
}
561
562
}
563
564
}
Appendix E
IRB APPROVAL LETTER
147

Download Report

LEXBANK: A MULTILINGUAL LEXICAL RESOURCE FOR LOW

Paperzz.com

Your Paperzz