Extraction of synonyms and semantically related words from

Extraction of synonyms
and semantically related
words from chat logs
Fredrik Norlindh
Uppsala University
Department of Linguistics and Philology
Master’s Programme in Language Technology
Master’s Thesis in Language Technology
November 20, 2012
Supervisors:
Mats Dahllöf, Uppsala University
Sonja Petrović Lundberg, Artificial Solutions
Abstract
This study explores synonym extraction from domain specific chat log
collections by means of a tool-kit called JavaSDM. JavaSDM uses random
indexing and measures distributional similarities. The focus of this study
is to evaluate the effect of different preprocessing operations of training
data and different extraction criteria. Four chat log collections containing
approximately 1,000,000 tokens were compared: one English and one
Swedish from a retail company and one English and one Swedish from
a travel company. One gold standard was based on synonym dictionaries
and one was a manually extended version of that gold standard. The
extended gold standard included antonyms, misspellings and near-related
hyponyms/hypernyms/siblings and was about 20 % bigger. On average
around two of the extracted synonym candidates per test word were falsely
classified as incorrect because they were not included in the dictionarybased gold standard.
Precision, recall and f-score were computed. Test words were either
nouns, verbs or adjectives. The f-scores were three to five times higher
when using the extended gold standard.
The best f-scores were achieved when training data had been lemmatized. POS-tagging improved precision but decreased recall and decreased
the number of extractions of misspellings as synonyms. A cosine similarity
score threshold 0.5 could be used to increase precision and f-score without
substantially decreasing recall.
Contents
Acknowledgments
5
1 Introduction
6
2 Background
2.1 Synonymy and other lexical sense relations
2.2 Usage of extracted synonyms . . . . . . . .
2.3 Random Indexing . . . . . . . . . . . . . .
2.4 Synonym Extraction Methods . . . . . . .
2.5 Evaluation Methods . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
8
9
10
11
3 Data and Method
3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Preprocessings . . . . . . . . . . . . . . . . . . .
3.2.1 Extraction of User Input from Chat Logs
3.2.2 Tokenization . . . . . . . . . . . . . . . .
3.2.3 Part-Of-Speech Tagging . . . . . . . . . .
3.2.4 Lemmatization . . . . . . . . . . . . . .
3.2.5 Stop Word Lists . . . . . . . . . . . . . .
3.3 Synonym Extraction by Random Indexing . . .
3.3.1 Random Indexing Tool-Kit . . . . . . . .
3.3.2 Extraction Criteria . . . . . . . . . . . .
3.4 Evaluation . . . . . . . . . . . . . . . . . . . . .
3.4.1 Test Words . . . . . . . . . . . . . . . . .
3.4.2 Gold Standard . . . . . . . . . . . . . . .
3.4.3 Adapted Gold Standard . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
13
15
15
15
15
16
16
16
16
18
19
20
20
21
4 Results
4.1 Tests . . . . . . . . . . . . . .
4.2 Threshold . . . . . . . . . . .
4.3 Part-of-speech tagging . . . .
4.4 Lemmatization . . . . . . . .
4.5 The different gold standards
4.6 Precision . . . . . . . . . . .
4.7 Recall . . . . . . . . . . . . .
4.8 F-score . . . . . . . . . . . .
4.9 Comparison Overview . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24
24
25
25
26
26
27
28
29
30
.
.
.
.
.
.
.
.
.
5 Conclusions
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
34
3
5.1 Overview of the results . . . . . . . . . . . . . . . . . . . . . . .
5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Future Study suggestions . . . . . . . . . . . . . . . . . .
35
35
39
References
40
Bibliography
40
4
Acknowledgments
I would like to thank Mats Dahllöf for continuous support and feedback
throughout this project.
I’m greatly appreciative to Artificial Solutions for the opportunity to do this
project with a leading language technology company and the use of their data.
Especially I want to thank my supervisor Sonja Petrovic Lundberg.
I would also like to thank Per Starbäck for the time he took to read my thesis
and the advice he gave me regarding formatting estestics and Jörg Tiedemann
for helpful comments and feedback.
5
1
Introduction
This study explores synonym extraction from chat log collections by means
of an open source tool-kit called JavaSDM. JavaSDM measures distributional
similarities and uses a method called random indexing. The focus of this study
is to evaluate effects of different preprocessing operations of training data and
different extraction criteria. Preprocessing operations are for example lemmatization and POS-tagging. An extraction criterion is for example a similarity
threshold t which says that words with lower similarity score than t are not to be
considered as synonym suggestions. The study is intended to help the automatic
part of a semi-automatic synonym extraction process (a process where humans
extract synonyms from automatically extracted synonym candidates).
The effects these operations and criteria have on two languages and two domains will be compared. Artificial Solutions provided four different collections
of chat logs which contain approximately 1,000,000 tokens each. One Swedish
and one English collection from a travel company and one Swedish and one
English collection from a retail company.
Four gold standards were created for each chat log corpus. One dictionarybased and one adapted gold standard for graph words and one dictionary-based
and one adapted gold standard for lemmas. The dictionary-based gold standards
were based on existing synonym dictionaries but did not include words that did
not exist in the input data. Inflections were added to graph word based gold
standards. Adapted gold standards were created by manually reading synonym
proposals that were not included in the dictionary-based gold standards. The
words that actually were correct were added to adapted gold standard where all
words from the dictionary-based gold standard also are included. Graph word
data and lemmatized data are different things, but, unless a combination of
both is created and alterations in the JavaSDM code is done, one and only one
of them must be chosen for synonym extraction with JavaSDM.
6
2
Background
Section 2.1 discusses when words can be seen as synonyms. Section 2.2 exemplifies how extracted synonyms can be used. Section 2.3 is about random
indexing, which is implemented by JavaSDM. Section 2.4 gives an overview of
different synonym extraction approaches that have been used in prior studies.
Section 2.5 is about ways people have evaluated synonym extractors.
2.1
Synonymy and other lexical sense relations
A synonym is a word having the same or nearly the same meaning as another
word. For example “pal” and “friend” or “jump” and “leap”. Creating an absolute
definition of how similar a pair of words need to be to be synonyms is almost
impossible or at least rather subjective (Edmonds and Hirst (2002)). Are for
example “buxom”, “chubby”, “plump”, “obese” and “fat” all synonyms? The lexical database WordNet have chosen to classify their relation as “similar” instead
of “synonym”. The difficulty of synonym extraction evaluation is illustrated by
the fact that not even professional lexicographers always agreed on whether
two words are synonyms or not when evaluating a synonym extractor (Plas and
Bouma (2004))(see Section 2.5 about evaluation methods.)
Polysemous and homonymous graph words have different senses. This means
that if two graph words are synonyms in one situation they are not necessarily
synonyms in other situations. For example “I don’t buy her story” could be
replaced with “I don’t believe her story” but “I will buy a new house” can’t be
replaced by “I will believe a new house”.
Hyponyms and hypernyms can be synonymous, especially within specific
domains. For example a lot of companies have a “XXX membership card” and
due to the obvious reference the company and their customers often only
say “membership card” although there are lots of other membership cards in
the world. Whether a hypernym and a hyponym can be used as synonyms
depends on how precise they are and the context. For example “thing”, which
is a hypernym of everything, is not to consider as a synonym for everything.
“Marabou” (the biggest chocolate company in Sweden today) is a hypernym
often used as a synonym/referent to “choklad” (“Chocolate”) in Sweden. Proper
nouns, such as “Marabou” are generally not included in synonym dictionaries,
but there are exceptions. For example “MAC” and “computer” are listed as
synonyms in English thesauri.
Conceptual siblings are words that are hyponyms “on the same level” to the
same hypernym. For example “father” and “mother” are siblings that are used
to tell that a person is the parent of a child. Another example of conceptual
7
siblings are products that have different names and/or are created in a slightly
different way. This is very common at pharmacies in for example Sweden,
where the pharmacists occasionally say they are out of the medecine you ask
for or that they have a similar cheaper medicine that you should pick according
to the laws regarding high-cost protection. Siblings can also be antonymous, for
exampel man and woman. Antonymous word pairs often have paradigmatic
relations, which means that it is possible to replace one word, for example “bad”
with its antonym “good”, and still have a syntactically and semantically coherent
sentence and due to irony they can mean the same thing. Combinations of
negations and antonyms can be used as a synonym phrase, for example one
can say “Not bad!” instead of “Good!”. Actually “bad” can be positive in slang
sense, for example “he’s a baaad dude” (Cruse (2002)). Since antonyms often
represent opposite polarities on the same “scale”, they are often used in the same
contexts, for example I hate/love you”. This is why they might be suggested as
synonym candidats when measuring distributional similarity.
In many situations the positive/negative energy of a word is used to decide
whether two words can be used to send the same message. For example the
definitions of “stupid” and “ugly” are not synonyms but in sentences such as
“that dress looks ugly/stupid” they are both used to say that something does not
look good because they carry the same negative energy. The affective meaning
of words are for example studied by Strapparava and Mihalcea (2007) .
Another relation that has not been mentioned in earlier synonym extraction
projects is misspellings. Due to the fact that the training data in this project
is unedited chat data, misspellings occur. Actually some words were more
frequently misspelled than correctly spelled.
Sometimes users of a dialogue system mix words from different languages.
During the Swedish tests in this study for example “infant” was extracted as a
synonym for “bebis” (“infant”) because they were used in similar contexts. In
such cases the translation should be considered as a synonym.
2.2
Usage of extracted synonyms
Semi-automatic synonym extractions is probably the best way to find many
synonyms with good quality. Muller et al. (2006) states that synonym extractors
are supposed to complement existing thesauri or to help targeting a specific
domain. Specific domains might use words in new senses or new words that
are not used outside the domain. Existing thesauri don’t include all possible
synonyms in a language and new words and new word senses might come
every day. Automatic synonym extraction is a much more effective way than
letting lexicographers think or analyze corpora by themselves to find synonyms
candidates. Then lexicographers can weed out bad synonym candidates in a set
of proposals. A synonym is relevant for dialogue systems to understand users
and to widen their own vocabulary. Dialogue systems, that for example are
created by Artificial Solutions, are often domain specific.
Plas and Bouma (2004) exemplify why it might be a good thing if hypernyms, hyponyms and conceptual siblings are found. If a question answering
system is asked “What actor is used as Jar Jar Binks voice?” The system looks
for the name of a person if the system know sthat an actor “IS-A-person”. If a
8
question is “What is the profession of Renzo Piano?” and the system has access
to a document saying “Renzo Piano is an architect and an Italian”, then the
knowledge that architect is a profession would help answering the question.
The knowledge of product Y as a similar alternative to product X, such as
the pharmacy-example in Section 2.1, is useful when a customer is looking for
something that for example is not at hand or more expensive than a similar
product.
2.3
Random Indexing
JavaSDM, which is the tool-kit used for synonym extraction in this study, uses
random indexing. Word space models use distributional statistics to generate
high-dimensional vector spaces, in which words are represented by context
vectors whose relative directions are assumed to indicate semantic similarity.
Words with similar meanings tend to occur in similar contexts. Contexts can be
either whole documents or context windows. A context window can include x
words directly before the focus word and y words directly after.
Random indexing is a method based on an incremental word space model
which accumulates context vectors based on the occurence of words in contexts. Each word in the text is assigned a sparse random index vector with
dimensionality usually chosen to be in the range of a couple of hundred up
to several thousands, depending on the size and redundancy of the data. A
small number of the elements of the vector are randomly distributed +1’s and
an equal number of elements are distributed -1’s and the rest is set to 0. The
index vectors are used to construct context vectors for each word in the text
(Sahlgren (2005) and Hassel (2007)). This is exemplified by Figure 2.1.
Figure 2.1: A context window focused on the token “merchandise”, taking note of the
co-occurring tokens. The row cv represents the continuously updated context vectors, and
the row rl represents the static randomly generated index labels. The grey fields are not
involved in this update example.”
Sahlgren (2005) states that the most well-known word space alternatives
to random indexing such as Latent Semantic Analysis (LSA) first construct a
huge co-occurrence matrix and then use a separate time and memory consuming dimension reduction phase. He lists four advantages of Random indexing
compared to other word space methodologies:
• The context vectors can be used for similarity computations even after
just a few examples have been encountered (Ferret (2010) confirmed this
by trying different frequency thresholds deciding which words got their
9
own context vectors, i.e decided which words could be extracted, and
found that such thresholds only lowered results). By contrast, most other
word space models require the entire data to be sampled before similarity
computations can be performed.
• Random Indexing uses “implicit” dimension reduction, since the fixed
dimensionality d is much lower than the number of contexts c in the data.
This leads to a significant gain in processing time and memory consumption as compared to word space methods that employ computationally
expensive dimension reduction algorithms.
• The dimensionality d of the vectors is a parameter in Random Indexing.
This means that d does not change once it has been set. When encountering new data the values of the elements of the context vectors increases,
but never their dimensionality. In HAL (Hyperspace Analogue to Language) and LSA, matrixes are created of size T × T or D × T , where D
is the number of indexed documents and T is equal to the number of
unique terms found in the data set, so far (Hassel (2007)).
• Random indexing can be used with any type of context. Other word
space models typically use either documents or words as contexts. Random indexing is not limited to these naive choices, but can be used with
basically any type of contexts.
2.4
Synonym Extraction Methods
There have been plenty of papers written about synonym extraction methods.
Three important approaches are:
1. Distributional approaches to monolingual corpora, for example Rosell et al.
(2009) and Ferret (2010). This method, which is used by JavaSDM,
measures distributional similarities within a corpus. The approach is based
on the distributional hypothesis (Harris (1954)) that words that occur in
similar contexts also tend to have similar meanings/functions. Systems
using this approach provide ranked lists of semantically related words
according to their context similarities. Random indexing is a common
method for this (Gorman and Curran (2006)).
2. Alignment of bi- or multilingual corpora, for example van der Plas and
Tiedemann (2006) and Blondel and Senellart (2002). This is a method
using two or more parallel corpora based on the assumption that two
words are similar if they have the same translation. If two words also
share translational contexts they are even more similar. This is measured
in a similar way to distributional monolingual corpora. Automatic word
alignment can be used to find the translations in parallel corpora. This
approach is known as a way to separate synonyms from other semantic
relations; for example “apple” is typically not translated with a word for
“fruit” nor “pear”.
3. Dictionary-based approaches, for example Wang and Hirst (2012) and
Muller et al. (2006). This method measures distributional similarities
10
among definitions in dictionaries. It is similar to the distributional approaches to monolingual corpora, but specificly a dictionary is used
instead. Hence, relatedness of words is measured using distributional
similarity in the same way as in the monolingual case but with a different type of context. Automatic word alignment can be used to finding
translations in parallel data.
The two latter approaches are not relevant for this study since each chat data
collection is a monolingual corpus. A method with distributional approach assesses the degree of synonymy between words according to their co-occurrence
patterns within text corpora, under the assumption that similar words tend
to appear in similar contexts (Wang (2009)). For example van der Plas and
Tiedemann (2006) mention suggestions of hyponyms, antonyms, hypernyms
as synonyms as a problem encountered with distributional similarities but as
exemplified in the Section 2.1 such synonym proposals are not necessarily
negative.
Wu and Zhou (2003) used a bilingual method, monolingual dictionary
and a monolingual corpus and concluded that the corpus based method can
get nuanced synonyms or near synonyms which generally cannot be found
in manually built thesauri, and exemplifies with “handspring → handstand” ,
“audiology → otology” , “roisterer → carouser” and “parmesan → gouda”.
2.5
Evaluation Methods
The validation of a method’s ability to propose synonyms can be done with a
test using gold standard or an extrinsic evaluation.
Muller et al. (2006) states that ”Comparing to already existing thesaurus is
a debatable means when automatic construction is supposed to complement an
existing one, or when a specific domain is targeted”. He also says that, in order
to get an indication of the extraction quality, manual verification of a sample
of synonym proposals is a common practice. Time is the reason only a sample
of the proposals are verified. Such verification is normally done either by the
authors of a study or by independent lexicographers. The larger sample the
more reliable conclusions but the more time is required. Automatic verifications
can verify more proposals in much shorter time but are less reliable.
The simplest evaluation measure of synonym extractions is a comparison
with manually created thesauri (Wu and Zhou (2003)). Precision, recall and
f-score are the most common measures. Gold standard words are often taken
from one or several thesauri such as WordNet, Roget’s, the Macquarie, Moby.
If a word was polysemous all the synonyms of all its senses are included in the
gold standard in prior studies, for example Wu and Zhou (2003), Curran and
Moens (2002) and van der Plas and Tiedemann (2006). Wang and Hirst (2012),
who evaluated a dictionary method, tried to reduce this “problem” by using a
POS-tagger. In their study a synonym proposal was considered correct if it both
got the same POS-tag as the test word and was included in the same synset in
WordNet. Test data can be created using different techniques. van der Plas and
Tiedemann (2006) automatically tagged all words and made a list of all words
tagged as nouns with a frequency of 4 or more. Then they randomly used 1,000
11
of these as test words and the synonyms found in Dutch EuroWordnet for these
words were used as gold standard. A correct synonym extraction, for example
“e-mail” (“e-mail”) for “e-postadress” (“e-mail address”) might be considered as
incorrect because it is not included in the thesauri. They had 10 lexicographers
evaluate a sample of 100 incorrect suggestions. 10 out of the 100 word pairs
were classified as synonyms by all evaluators and 37 word pairs were classified
as synonyms more than half of the human evaluators.
There have been several other evaluation methods, for example TOEFL
(Test of English as a Foreign Language) for example Ferret (2010) and Wang
(2009). TOEFL is originally an obligatory test for non-native speakers of English
who intend to study at a university with English as the teaching language. The
synonym extractor is supposed to pick the one out of four words that is most
similar to a test word (Ferret (2010)).
Another method, clustering, has been used by Rosell et al. (2009). First
they assigned every word in their data to one of several clusters based on the
similarity measure. A test group had graded the similarities between word pairs
from 0-5 to build a synonym dictionary. The word-pairs with mean grade of
3.0 to 5.0 were put on a list. They used the list of synonym pairs to see if the
words ended up in the same cluster or not. If a word pair graded as synonyms
by the users had ended up in the same cluster it was counted as correct.
van der Plas and Tiedemann (2006) acknoledged that their conclusions
about precision when using monolingual distributional approach might have
been a bit unfair since they always extracted the same number of synonym
candidates no matter how low the similarity scores were. Wu and Zhou (2003)
tried different similarity thresholds and showed that a higher threshold increases
precision but also decreases recall and vice versa.
Wu and Zhou (2003) describes evaluation alternatives of synonyms extracted from a corpus as the following: “Comparing the results with the humanbuilt thesauri is not the best way to evaluate synonym extraction because the
coverage of the human-built thesaurus is also limited. However, manually evaluating the results is time consuming. And it also cannot get the precise evaluation
of the extracted synonyms. Although the human-built thesauri cannot help to
precisely evaluate the results, they can still be used to detect the effectiveness
of extraction methods.”
12
3
Data and Method
In this chapter Section 3.1 discusses the data used in this study. Section 3.2 is
about the preprocessing operations and the tools used for that. Section 3.3.1
describes the random indexing tool-kit and the extraction criteria used in this
study. Section 3.4 is about the creation of test data and gold standards.
Figure 3.1 gives an overview of the extraction experiments.
Figure 3.1: A flow chart of the extraction experiments
3.1
Data
Artificial Solutions provided four chat log collections with sizes of approximately 1,000,000 tokens:
• Swedish Retail: Swedish chat log collections from the retail company
• English Retail: English chat log collections from the retail company
• Swedish Travel: Swedish chat log collections from the travel company
• English Travel: English chat log collections from the travel company
A chat log file could look like this:
13
<SESSION ID="-H-vkmsKoZ" ADMIN="no" CONTINUED="no">
<INFO>
<BEGIN LOAD="7" TIME="1304203914 2011-05-01 00:51:54"/>
<END LOAD="7" TIME="1304204091 2011-05-01 00:54:51"/>
<BOT NAME="XXXX"/>
<LOG VERSION="2"/>
<KBLOAD TIME="1302168630 2011-04-07 11:30:30"/>
<TRANSACTIONS COUNT="4"/>
</INFO>
<TRANSACTION TIME="1304203914 2011-05-01 00:51:54"
DURATION="26">
<QUESTION></QUESTION>
<ANSWER ID="50888" EMOTION="Happy.Happy" CATEGORY=
"Library.Dialogue.Initial">Welcome to XXX. How can I help you
today? </ANSWER>
<RECOGNITION AUTO_RANK="1" AUTO_SUBRANK="11" CUSTOM_RANK="1"
CUSTOM_SUBRANK="1"/>
</TRANSACTION>
<TRANSACTION TIME="1304203936 2011-05-01 00:52:16"
DURATION="457">
<QUESTION>hi</QUESTION>
<ANSWER ID="50890" EMOTION="Nodding.Happy" CATEGORY=
"Library.Dialogue.Positive">Welcome.</ANSWER>
<RECOGNITION AUTO_RANK="3" AUTO_SUBRANK="11" CUSTOM_RANK="17"
CUSTOM_SUBRANK="11"/>
</TRANSACTION>
<TRANSACTION TIME="1304203950 2011-05-01 00:52:30"
DURATION="342">
<QUESTION>i want pepper mill</QUESTION>
<ANSWER ID="50486" EMOTION="Action.Pointing_left"
CATEGORY="Library.User.Personal.Feelings
">You can find information about it on the webpage I've just
opened for you.</ANSWER>
<RECOGNITION AUTO_RANK="3" AUTO_SUBRANK="11" CUSTOM_RANK="14"
CUSTOM_SUBRANK="6"/>
</TRANSACTION>
<TRANSACTION TIME="1304204089 2011-05-01 00:54:49"
DURATION="385">
<QUESTION>thanks for your help</QUESTION>
<ANSWER ID="50464" EMOTION="Happy.Smiling" CATEGORY=
"Library.Dialogue.Positive">You are welcome. </ANSWER>
<RECOGNITION AUTO_RANK="7" AUTO_SUBRANK="11" CUSTOM_RANK="17"
CUSTOM_SUBRANK="11"/>
</TRANSACTION>
</SESSION>
Text categorized as “QUESTION” has been written by users and is used as
training corpora in this study. The actual user input looks like this:
hi
14
i want pepper mill
thanks for your help
3.2
Preprocessings
3.2.1
Extraction of User Input from Chat Logs
The actual user input was extracted from the chat logs by using regular expressions in Linux terminal.
3.2.2
Tokenization
The JavaSDM tokenizer was not fully appropriate for noisy data, for example
character sequences like “,candy” and “broken,it” were not split up. I wrote a
new tokenizer which split up tokens that include one or more punctuations.
It did not split up hyphenated words and used abbreviation lists to not split
up abbreviations. The abbreviation lists were created by extracting Swedish
abbreviations from Stockholm-Umeå Corpus and Talbanken and English abbreviations from Penn Treebank.
3.2.3
Part-Of-Speech Tagging
The open source Hidden Markov Model tagger Hunpos was used to POStag the data before training. Swedish models trained by Beáta Megyesi on the
Stockholm-Umeå Corpus and English models trained on Wall Street Journal corpus were downloaded from http://stp.lingfil.uu.se/~bea/resources/
hunpos/ and http://code.google.com/p/hunpos/.
The tagging was not more fine-grained then “_a”=adjective, “_n”=noun,
“_v”=verb and “_other”= any other word class. POS-tagging initialize a distinction between homonyms of different word classes. Morphological tagging can
be useful for separating homonyms, but during pre-tests it gave lower results
than the chosen POS-tagging (see Section 5.2).
The test words were either nouns, verbs or adjectives (see Section 3.4.1).
After Hunpos tagging, past participle tags were converted to adjective tags
because they are very similar and in Svenska Akademins ordlista (SAOL, The
dictionary of the Swedish Academy) several words, for example “intresserad”
(“interested”), are classified both as adjective and perfect participle. According
to Holmer (2009) a past participle has a function somewhere between adjective
and verb. If a past participle frequently enough has been used with rather an
adjective than a verb function it is included as an individual lexeme classified
as adjective in SAOL. Proper nouns were seen as nouns since both domains
in this study have named products. As mentioned in Section 2.1, names can
be near-synonyms, for example “MAC” and “computer” and “Marabou” and
“chocolate”.
Tests using the dictionary-based gold standards were run to find indications
of whether to tag the whole corpus or only verbs, nouns and adjectives since
they are the only classes included in the test data. These tests, that were done
15
before any manual correction, showed results favoring tagging only nouns, verbs
and adjectives. This POS-tagging tag set was consequently used in this study.
The corpora could look like this:
the
fat_a
lady_n
sings_v
POS-tagged data were required by the lemmatization tools.
3.2.4
Lemmatization
The open source lemmatizer Lempas by Silvia Cinkova was used for Swedish
(+http://ufal.mff.cuni.cz/~cinkova/+). A lemmatizer from the NLTK
tool-kit was used for English (http://nltk.org/). Data needed to be POS-tagged
before lemmatization could be performed by theses lemmatizers. The purpose is to group the inflected forms of a word as a single item (http://
www.collinsdictionary.com/dictionary/english/lemmatisation). If for
example the graph words “run” and “runs” occur 5 times each, their lemma
occurs 10 times if the data is lemmatized.
Cases when it is important to make differences between different inflections
are disadvantages of lemmatization. For example the plural form “tillgångar”
(assets) often means money which is not what its lemma form “tillgång” (asset/access) is used for.
3.2.5
Stop Word Lists
Both Swedish and English stop word lists were taken from the NLTK tool-kit
(http://nltk.org/). A regular expression for tokens that did not contain any
alphabetic character was added to the stop word lists.
A few quick tests were enough to clearly see that the use of stop word lists
improved the precision, therefore all tests used stop word lists. JavaSDM can
read a given list of stop words and remove all those words from the training data
before doing any training. Tests using the dictionary-based gold standards were
run and indicated that it is better to keep stop words during training and instead
use the list of stop words as a way to stop extractions of those words from being
synonym proposals (see Section 5.2 for discussion why). For that reason and
in order to not double number of experiments and manual evaluation, that
method was used in the experiments.
3.3
Synonym Extraction by Random Indexing
3.3.1
Random Indexing Tool-Kit
The tool-kit called JavaSDM (http://www.csc.kth.se/~xmartin/java/,
which was written by Martin Hassel, is the software used to extract synonyms in
this study. It is an open source tool-kit which uses random indexing (more about
random indexing in Section 2.3). JavaSDM can use one or more documents
16
to train a model for synonym extraction. The model can be given a list of test
words and provide a ranking list of synonym candidates for each test word.
Dimensionality, random degree, weighting scheme and context window size
are training parameters that can be altered. All these parameters affect synonym
extraction but an examination of them is beyond the scope of this study. For
example 100 models would have to be created for each chat log corpora in
order to find the best context window combining 1–10 word before and 1–10
words after the focus word. Each of these models would need to be combined
with the other parameters, which means a 100 times more models. Each new
parameter setting multiply the number of tests. For this reason default settings
were used in the experiments.
• A lower dimensionality without decreasing extraction quality is desired
due to less time and memory consumption. Where that point is depends
on a combination of size and redundancy of the data. The best dimensionality size is usually “between a couple of hundred and several thousands
depending on the size and redundancy of the data” (Hassel (2007)). The
default dimensionality is 1000.
• Random degree is the number of random +1’s and -1’s in each random
vector. The default random degree is 8 which means four +1’s and four
-1’s.
• Context window size is the number of words before and after the focus
word that will be considered as the context for each word in the text.
When scanning through the text, the random index vectors of the neighbors in the sliding window are added to the context vector of the current
focus word (Sahlgren (2005) and Hassel (2007)). A brief context window
test gave better results for symmetrical windows with 4 or 6 words before
and after the focus word than for a window with 5 words before and after
the focus word. This indicates that the effect of the window size is not
obvious. A symmetric context window with 4 words on each side of the
focus word was the default settings.
• The three simplest weighting schemes available in JavaSDM are:
– ConstantWS: Using a constant weight makes the weighting word
order independent.
– MangesWS: Calculates the weighting based upon the distance to the
current label in the following manner: weight = 21−distance to focus word .
– MartinsWS: Calculates the weighting based upon the distance to the current label in the following manner: weight =
1/distance to focus word.
”Manges Weighting scheme” is the default weighting scheme. It weights
with exponential dampening, 21−d , where d is the distance between the
focus word and the neighbor. This weighting scheme has for example
been used in papers by the creator of JavaSDM (Rosell et al. (2009)).
When the context window has four words on each side of the focus word,
Manges, Martins and Constant weighting scheme, define weights like this:
17
can
Manges 1/8
Martins 1/4
Constant 1
i
1/4
1/3
1
buy the merchandise online to
1/2 1 Focus Word
1
1/2
1/2 1 Focus Word
1
1/2
1
1 Focus Word
1
1
be pick
1/4 1/8
1/3 1/4
1
1
An extraction example from JavaSDM is given below. JavaSDM only measures the similarity between two words by the cosine similarity of their corresponding context vectors (the dot product of the normalized vectors). The
maximal cosine similarity is 1. Ferret (2010) showed that cosine-similarity gave
the best results in a comparison with four other semantic similarity measures
called ehlert, jaccard, lin and dice.
The test word is “cute” and the other words are synonym proposals followed
by their cosine similarity to the test word.
cute
ugly
pretty
sexy
weird
funny
thick
stupid
beautiful
hair
3.3.2
0.92525595
0.9251133
0.915225
0.9052225
0.89562523
0.890025
0.88813686
0.876878
0.8744281
Extraction Criteria
The extraction criteria were:
• Candidate set size limit:
– a limit of at most 15 synonym proposals for models trained on graph
word data
– a limit of at most 10 synonym proposals for models trained on
lemmatized data
A higher number of proposals is expected to improve recall and lower
precision. In preliminary test runs 5, 10, 15 and 30 were used as the
maximal number of proposals per test word. The f-scores for lemma
models were highest when using 10, and 15 candidates gave best f-score
for graph word models. Thus the number of synonym proposals was
limited to 10 for lemma based models and 15 for graph word based
models. The reason graph word based models benefitted from more
candidates is probably because they have more words in the gold standard.
30 proposals was very memory- and time-consuming.
• Different cosine similarity thresholds:
– 0 (no threshold)
– 0.5
18
– 0.7
– 0.85
The purpose of a similarity score threshold is to weed out bad synonym
proposals and thereby get better precision. With a too high threshold,
good synonym proposals might not be given and thereby recall would
be lower. A cosine similarity threshold was not an optional setting in the
downloaded JavaSDM code and was therefor new code had to be written.
Cosine similarity thresholds of 0, 0.5, 0.7 and 0.85 were used. These
have different effects on precision, recall and f-score. The thresholds were
chosen after running pre-tests of many different thresholds to see around
which cosine similarity scores there is more or less result differences.
The synonym candidates with a cosine similarity score above the threshold
are shown, but never more than the max number of candidates. A max
number of candidates is preferable because there are words with really
high numbers of candidates with high similarity and the more suggestions
the more time and memory consumption (when having 30 as a max
number of candidates, my computer randomly crashed even when having
a cosine similarity of 0.85 and it was very time consuming). Having 30
candidates as max number did not clearly decreased f-score in preliminary
tests. If a user wants to use this tool to get a list of synonym candidates,
he/she should set the highest number or candidates he/she wants to read.
Another alternative was to, instead of limiting candidate set size, use a
threshold like the one mentioned above, and also have a relative threshold
compared to the synonym candidate with highest similarity score. A few
attempts were made to examine this idea but as only negative effects
were found this method was not further explored in this study.
• Extracted words whose POS-tag does not match the test word’s tag are not
given as synonym proposals. This is a criterion intended to separate the verb
“book” from the noun “book” and also to exclude extractions of words
from other word classes than the test word since synonym words are
supposed to be from the same word class. For example “hair” would not
have been extracted as a synonym candidate to “cute” as in the extraction
example in Section 3.3.1. This criterion is only used when a model is
trained on POS-tagged data.
This criterion would have a negative effect if a POS-tagger has mistagged
an extraction that actually is of the correct word class.
• Extracted stop words are not synonym proposals The stop word lists were
collected as described in Section3.2.5 but those tokens were kept during
training. Instead JavaSDM extractions of tokens that are in those lists
were filtered out and thereby not counted as synonym proposals.
3.4
Evaluation
The focus of this study is to evaluate preprocessing operations and extraction
criteria. The evaluation is done by computing precision, recall and f-score.
19
One test word set based on lemmas and one based on graph words set were
created. For each set one dictionary-based and one adapted gold standard
were created. The dictionary-based gold standards were created from existing
synonym dictionaries and inflections of those synonyms and the test words
themselves. The adapted gold standard is the dictionary-based gold standard
extended by manual correction.
3.4.1
Test Words
The test words in this study were selected among adjectives, nouns and verbs.
The POS-tagging described in Section 3.2.3 was needed for this. The test words
had to be chosen from the training data if JavaSDM would have a possibility
to suggest any synonym candidates. Keyword extraction based on a method
by Ortuño et al. (2002) was used to extract the top 200 nouns, 100 verbs
100 adjectives from the chat log collections. Those words also had to have a
frequency of at least 30 in the chat log collection. This was done with POStagged data. No more than one inflected form of a word was included in the
test word set based on graph words in order to have the same number of test
words for lemma and graph words. The test word set based on lemmas consisted
of the lemmas of the words in the test word set based on graph words. If no
synonyms were found for a word so far qualified to the gold standard, the word
was not used as a test words.
3.4.2
Gold Standard
The web-pages www.synonymer.se and www.synonymer.org were used to look
up base forms of Swedish synonyms. WordNet and www.thesaurus.com were
used to find English synonyms. For example the adjective able got the synonyms
bright, capable, easy, good, strong, worthy included in the gold standard. The
English synonym sources included a lot more words than Swedish sources.
For a synonym found in a thesauri to be included in the gold standard its
graph word had to occur at least once with the correct POS-tag in the training
data. Otherwise it is impossible to extract that word and the non-extraction
of the word would say nothing about the quality of JavaSDM. For example
if an English thesauri had “book” as a synonym to “order” it would have been
included in the gold standard because of lines like line 2 and not because of
lines like line 1:
Line 1:
I
read_v
a
book_n
Line 2:
I
want_v
to
book_v
a
ticket_n
Synonym phrases and compound words (separated by space) were not
included in the gold standard because JavaSDM only extracts single tokens. For
example “be good for” was not included in the gold standard as a synonym
20
for “benefit”. With a collocation extractor a phrase like that could have been
transformed into the token “be_good_for”.
If a model is trained on a perfectly lemmatized data it can’t be used to
extract other inflected forms of a word than the lemma, which models trained
on graph word data often can (during the tests there were occasions when the
lemmatizers had failed to lemmatize every form of a word the same way). It is
also possible that a form of a word identical with the lemma is never used in
graph word data. Therefore the lemma and graph word models needed different
gold standards. First the lemma gold standard was created as described above.
Then a triple column version of the training data was created which showed
the graph word version of the training data in the first column, the lemmatized
version in the second column and the POS-tags in the third column like this:
dogs dog NN
are be
VB
cute cute JJ
Every graph word that had the same lemma and POS-tag as a word in the
lemma gold standard were included in the graph word gold standard. Table 3.1
the effect of this gold standard creation method for the noun “babe”:
Table 3.1:
Lemma Gold Standard
baby
child
newborn
Graph Word Gold Standard
babies
baby
babys
children
childs
newborn
In this example there are twice as many words in the graph word gold
standard. Still one can see that not all inflections are included in the graph
word gold standard, for example the plural form of newborn. That is because
“newborns” never occurred in the training data. On the other hand the informal
or incorrect forms “babys” and “childs” were included because they were used by
users and they had been lemmatized as “baby” and “child” and were POS-tagged
as nouns.
3.4.3
Adapted Gold Standard
The dictionary-based gold standards include far from all synonyms. For example
”monteringsanvisning” (“assembly instruction”) could not be counted as a correct
synonym proposal to “instruktion” (“instruction”) in tests with the dictionarybased gold standard, since it was not included there. Manual correction was
done to avoid that and to get an idea how useful addition a synonym extractor
can be to existing thesauri and to get more accurate results.
Manual control was done on synonym proposals, from all models, that
were not included in the gold standard when not using any threshold. Tests
using thresholds could not possibly find words that were not found without
21
threshold. Manually found synonyms, antonyms and semantically near-related
words such as hypernyms, hyponyms, siblings were added to the extended
gold standard (more about synonyms and near-related words in Section 2.1).
For example “trasig” (“broken”) was added to the gold standard for “skadad”
(“wounded”) and “kontokort” (“credit card”) added to the gold standard for
“kort” (“card”). Antonyms/siblings such as “mother” and “father” were also added
to the extended gold standard. For the Swedish test word “bebis” (“baby”) the
non-Swedish word “infant” had been extracted and since “infant” is an English
word translation of “bebis” was added to the extended gold standard. Generally
the synonym proposals manually found as correct were added to both lemma
and graph word gold standards. When graph word based models had found more
than one inflection of a word applicable for gold standard only one inflection of
that word was added to the lemma gold standard although all the inflections
were added to the graph word gold standard. If a word was found by a lemma
based model but not by any of the graph word based models there was no
chance that the graph word based models would find that word in tests using
thresholds since these models would not find more words than the ones they
found using no threshold. Still it was important to add that word to both lemma
and graph word gold standards to get a good recall and f-score comparison.
Examples of differences between dictionary-based and adapted gold standards
are shown in Table 3.2– 3.4.
Table 3.2: English Retail Lemma, Test word = crap (noun):
Dictionary-based Gold Standard
nonsense
bunk
twaddle
Adapted Gold Standard
nonsense
bunk
twaddle
rubbish
shit
Misspellings were also added to the extended gold standard. In order to save
time, Levenshteins algorithm was used to see if an extraction was a misspelling
of the test word (Jurafsky and Martin (2009): p. 108). The editing distance
had to be 1 or 2 and the test word had to have a length of at least five letters,
otherwise too many words could be seen as misspellings. The longer a word is
the more risk for longer editing distances, but that would have needed further
research for which word lengths to increase the editing distance. If a word
with editing distance 1 or 2 already was included in the gold standard, for
example “monkeys” - “monkey”, then that misspelling was not added to the
gold standard. More misspellings were manually found afterwards among the
incorrectly classified extractions.
In average 1–3 new tokens, per test word were added to the adapted gold
standards for the four chat collections respectively. Approximately one fifth
were misspellings.
Table 3.5 shows the number of tokens included in each gold standard. The
table also shows that there are more keywords with synonyms and semantically
related words in the retail data than in the travel data.
22
Table 3.3: English Retail Graph Word, Test word = noticed (verb):
Dictionary-based Gold Standard
observe
advert
catch
mark
mind
minded
note
noted
recognize
recognized
refer
referred
referring
regard
regarding
see
seeing
spot
spotted
Adapted Gold Standard
observe
advert
catch
mark
mind
minded
note
noted
recognize
recognized
refer
referred
referring
regard
regarding
see
seeing
spot
spotted
discovered
found
realised
realized
notice (inflection of the test word)
Table 3.4: Swedish Retail Lemma, Test word = nummer (noun):
Dictionary-based Gold Standard
exemplar
format
mått
nr
personnummer
siffra
storlek
stycke
tal
telefonnummer
Adapted Gold Standard
exemplar
format
mått
nr
personnummer
siffra
storlek
stycke
tal
telefonnummer
telefonnr
telnr
telefonnumer (misspelling)
Table 3.5: The gold standard is defined with the following abbreviations: DB = dictionary
based gold standard, A = adapted gold standard, GW = graph words, L = lemma. The chat
collection and number of test words are defined like this: R-Swe(244) = Swedish Retail and
244 test words.
DB-GW
A-GW
DB-L
A-L
R-Swe(244)
3065
3675
1271
1849
T-Swe(152)
1509
1852
574
890
23
R-Eng(171)
3322
3642
1874
2139
T-Eng(175)
2447
2696
1359
1568
4
Results
The first section shows tested combinations of models and criteria that will
be discussed in this chapter. It also tells what kind of metrics that have been
computed in the study. Sections 4.2– 4.4 summarize the main effects of
thresholds, POS-tagging and lemmatization in the experiments. The dictionarybased and the adapted gold standards are compared in Section 4.5. F-score is the
main metric but Sections 4.6 and 4.7 exemplify how precision and recall were
affected by preprocessing operations and extraction criteria. Section 4.8 shows
the f-score for all tests using the adapted gold standards. Tables in Section 4.9
give a comparison overview of the evaluated combinations of preprocessing
operations and extraction criteria.
4.1
Tests
16 tests were run on each of the four chat log corpora. Table 4.1 shows the
following parameters:
Table 4.1: Name = the name of the test, Text Unit, Set = the candidate set size limit, Threshold
= the cosine similarity threshold and POS = part-of-speech tagged or not:
Name
15-Graph-0
15-Graph-0p5
15-Graph-0p7
15-Graph-0p85
15-GraphPOS-0
15-GraphPOS-0p5
15-GraphPOS-0p7
15-GraphPOS-0p85
10-Lemma-0
10-Lemma-0p5
10-Lemma-0p7
10-Lemma-0p85
10-LemPOS-0
10-LemPOS-0p5
10-LemPOS-0p7
10-LemPOS-0p85
Text Unit
Graph word
Graph word
Graph word
Graph word
Graph word
Graph word
Graph word
Graph word
Lemma
Lemma
Lemma
Lemma
Lemma
Lemma
Lemma
Lemma
Set
15
15
15
15
15
15
15
15
10
10
10
10
10
10
10
10
Threshold
0.5
0.7
0.85
0.5
0.7
0.85
0.5
0.7
0.85
0.5
0.7
0.85
POS
+
+
+
+
+
+
+
+
The metrics that were computed for all the test words are discussed in this
chapter:
• Precision, recall and f-score separately
– Using the dictionary-based and the adapted gold standards separately
24
– Averages and standard deviations
4.2
Threshold
As expected a higher precision is reached the higher threshold you use. Strict
extraction criteria in general benefit precision but not recall. A threshold of
at least 0.5 is recommended since practically no correct synonym proposals
were extracted below that level. Thus precision and f-score are lower with
lower threshold. Models trained on both POS-tagged and lemmatized data
from English Retail and Swedish Travel got better precision and f-score when
using a threshold of 0.7 but in all the other cases 0.5 got the best f-score and
recall.
Figure 4.1s an example of how precision, recall and f-score were affected by
thresholds in this study.
Figure 4.1: A diagram showing the effect of cosine similarity threshold on precision, recall
and f-score when using a model trained on the lemmatized version of Swedish Retail.
4.3
Part-of-speech tagging
POS-tagging improves precision but lowers the number of found misspellings
among synonyms proposals. Misspellings are harder to give a correct POStag and due to the word-class criterion a correct tag is necessary. A fact that
POS-tagged models got the best f-score for all collections when using dictionarybased gold standard indicates that POS-tagging is useful handling “well-known”
words (see appendix). When using travel data regardless of language, POStagging improved f-score for both lemma and graph word based models. Models
trained on POS-tagged data got more words with high similarity score.
25
4.4
Lemmatization
Models trained on lemmatized data gave better precision, recall as well as
f-score than graph word based models for all chat collections. Models trained on
lemmatized data got more words with high cosine similarity score than graph
word based models. Worth noting is that models trained on lemmatized data
can propose misspellings which in theory could be lemmatized misspellings. For
example “essen” could be a lemmatization of the token “essens” which probably
is a misspelling of “essence”.
4.5
The different gold standards
POS-tagging combined with lemmatization improved f-score when using the
dictionary-based gold-standards. When using adapted gold standards POStagging lowered f-score for all chat collections but Swedish travel. The best
cosine similarity thresholds where either equal or lower when using adapted
gold standards. The results for precision, recall and f-score were much higher
when using the adapted gold standards.
If excluding misspellings from the adap ted gold standard there were not
many differences between the ranking of the models and thresholds except
for English Travel. The differences were the cosine similarity thresholds, as
described above regarding adapted gold standards including misspellings, and
the fact that lemmatized English retail data gave better f-score if it wasn’t
POS-tagged when the adapted gold standard without misspellings was used.
These observations are well exemplified by Tables 4.2–4.4 showing the
ranking lists according to average f-score results using the dictionary-based gold
standard, the adapted gold standard and the adapted gold standard without
misspellings for Swedish Travel.
Table 4.2: Swedish Retail using the dictionary-based
gold standard
Rank
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
Model
10-LemPOS-0p7
10-LemPOS-0p5
10-LemPOS-0
10-Lemma-0p7
10-LemPOS-0p85
10-Lemma-0p5
10-Lemma-0
15-Graph-0p5
15-Graph-0
10-Lemma-0p85
15-GraphPOS-0p5
15-GraphPOS-0
15-Graph-0p7
15-GraphPOS-0p7
15-GraphPOS-0p85
15-Graph-0p85
Table 4.3: Swedish Retail using the adapted gold standard:
Rank
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
F-score
0.059+-0.085
0.059+-0.084
0.059+-0.084
0.055+-0.085
0.055+-0.103
0.053+-0.075
0.053+-0.075
0.05+-0.059
0.049+-0.059
0.048+-0.104
0.047+-0.059
0.047+-0.059
0.043+-0.063
0.043+-0.057
0.03+-0.062
0.027+-0.069
26
Model
10-Lemma-0p5
10-Lemma-0
10-Lemma-0p7
10-LemPOS-0p5
10-LemPOS-0
10-LemPOS-0p7
15-Graph-0p5
15-Graph-0
15-Graph-0p7
10-LemPOS-0p85
15-GraphPOS-0p5
15-GraphPOS-0
15-GraphPOS-0p7
10-Lemma-0p85
15-GraphPOS-0p85
15-Graph-0p85
F-score
0.247+-0.17
0.246+-0.171
0.236+-0.177
0.231+-0.179
0.231+-0.179
0.231+-0.179
0.176+-0.129
0.175+-0.129
0.156+-0.130
0.155+-0.177
0.149+-0.126
0.149+-0.126
0.142+-0.127
0.118+-0.154
0.086+-0.123
0.07+-0.113
Table 4.4: Swedish Retail using the adapted gold standard without misspellings:
Rank
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
4.6
Model
10-LemPOS-0p7
10-LemPOS-0p5
10-LemPOS-0
10-Lemma-005
10-Lemma-0
10-Lemma-0p7
10-LemPOS-0p85
15-Graph-0p5
15-Graph-0
15-GraphPOS-0p5
15-GraphPOS-0
15-Graph-0p7
15-GraphPOS-0p7
10-Lemma-0p85
15-GraphPOS-0p85
15-Graph-0p85
F-score
0.20861813+-0.17
0.20810248+-0.17
0.20809248+-0.17
0.20524344+-0.157
0.2045285+-0.157
0.20224719+-0.168
0.14682704+-0.182
0.14472191+-0.114
0.14366198+-0.113
0.13288288+-0.119
0.13287288+-0.119
0.12905452+-0.12
0.12848777+-0.122
0.10565685+-0.163
0.08052743+-0.123
0.06271605+-0.112
Precision
For all chat log collections the tests with a cosine similarity threshold of 0.85
were the four models with the highest precision as in Table 4.5 and Table 4.6.
POS-tagging and lemmatization had positive effect on precision. When models
are trained on lemmatized data more words got higher similarity scores.
Although models trained on graph word data have a limit of 15 synonym
proposals instead of 10, the models trained on lemma extracted almost as
many or more words when the similarity threshold was set to 0.85. Table 4.5
exemplifies how prominently POS-tagging improved precision for all chat log
collections but Swedish Retail.
Table 4.5: Precision ranking for English Retail using the adapted gold standard:
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Model
10-LemPOS-0p85
15-GraphPOS-0p85
10-Lemma-0p85
15-Graph-0p85
10-LemPOS-0p7
10-LemPOS-0p5
10-LemPOS-0
15-GraphPOS-0p7
15-GraphPOS-0p5
15-GraphPOS-0
10-Lemma-0p7
15-Graph-0p7
10-Lemma-0p5
10-Lemma-0
15-Graph-0p5
15-Graph-0
Precision
0.418+-0.327
0.357+-0.367
0.275+-0.311
0.249+-0.313
0.247+-0.256
0.217+-0.246
0.215+-0.243
0.213+-0.238
0.196+-0.221
0.192+-0.216
0.185+-0.174
0.162+-0.154
0.159+-0.154
0.159+-0.15
0.139+-0.117
0.136+-0.116
Correct Proposals / Proposals
89/213
91/255
131/476
125/502
166/671
180/831
180/838
191/895
231/1180
231/1201
245/1324
283/1750
262/1644
262/1650
338/2426
338/2494
When threshold is 0.85, models trained on POS-tagged Swedish data did
more synonym proposals than untagged models, but they did much less propos27
als when a threshold was low, see Table 4.6.
Table 4.6: Precision ranking for Swedish Retail using the adapted gold standard:
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Model
10-Lemma-0p85
15-Graph-0p85
10-LemPOS-0p85
15-GraphPOS-0p85
10-LemPOS-0p7
10-Lemma-0p7
10-LemPOS-0p5
10-LemPOS-0
10-Lemma-0p5
10-Lemma-0
15-Graph-0p7
15-GraphPOS-0p7
15-GraphPOS-0p5
15-GraphPOS-0
15-Graph-0p5
15-Graph-0
Precision
0.333+-0.384
0.294+-0.380
0.291+-0.351
0.268+-0.391
0.247+-0.206
0.244+-0.218
0.24+-0.195
0.24+-0.195
0.224+-0.171
0.222+-0.171
0.206+-0.201
0.187+-0.185
0.182+-0.165
0.182+-0.165
0.18+-0.133
0.178+-0.133
Correct Proposals / Proposals
133/399
147/500
195/671
187/697
402/1625
423/1731
413/1720
413/1720
508/2265
508/2279
462/2246
422/2256
461/2531
461/2531
631/3498
632/3550
A summary of the precision effects of preprocessings and thresholds for the
four chat log collections:
1. A high threshold. 0.85 was best no matter what model or domain.
2. POS-tagging improved precision for both
lemma(Swedish Retail was the only exception)
graph
words
and
3. The lemma based models got higher precision than graph word based
models
4.7
Recall
For Swedish and English Retail and Swedish Travel no more correct candidates
were found among the synonym proposals below cosine similarities of 0.5. Only
English Travel made a few correct extractions below 0.5 which is shown in
Table 4.7. Graph word based models generally extract more correct candidates
due to all inflections of the test words and other synonyms. If for example
subtracting proposals with the same lemma as the test words, the number of
correct synonym proposals are pretty close for graph word and lemma models.
A summary of the recall effects of preprocessings and thresholds for the
four chat log collections:
1. Low thresholds benefit recall but practically no more correct synonyms
were found with thresholds below 0.5 with the selected candidate set size
limits
2. Lemma based models got higher recall than graph word based models
3. POS-tagged models got lower recall
28
Table 4.7: Recall ranking for English Travel using the adapted gold standard:
Rank
1
2
3
4
5
6
7
8
9
10
4.8
Model
10-Lemma-0.txt
10-Lemma-0p5.txt
10-LemPOS-0.txt
10-LemPOS-0p5.txt
15-Graph-0.txt
15-Graph-0p5.txt
10-Lemma-0p7.txt
15-GraphPOS-0.txt
15-GraphPOS-0p5.txt
10-LemPOS-0p7.txt
Recall
0.136+-0.157
0.135+-0.155
0.106+-0.147
0.105+-0.144
0.104+-0.135
0.102+-0.133
0.098+-0.125
0.083+-0.116
0.082+-0.112
0.075+-0.123
Correct Proposals / Words in Gold Standard
213/1568
212/1568
166/1568
164/1568
280/2696
274/2696
153/1568
223/2696
220/2696
118/1568
F-score
The f-score ranking in Tables 4.8–4.11 was computed with the adapted gold
standards including misspellings.
When using the dictionary-based gold standard, POS-tagging gave the best
f-scores for all collections. When excluding misspellings from the adapted gold
standard, models that were both lemmatized and POS-tagged gave the highest
f-score for all collections but English Retail. When including misspellings in
the adapted gold standard, models trained on data without POS-tags got better
than models trained on POS-tagged data.
Table 4.8: F-score ranking Swedish Retail using the
adapted gold standard:
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Model
10-Lemma-0p5
10-Lemma-0
10-Lemma-0p7
10-LemPOS-0p7
10-LemPOS-0p5
10-LemPOS-0
15-Graph-0p5
15-Graph-0
15-Graph-0p7
10-LemPOS-0p85
15-GraphPOS-0p5
15-GraphPOS-0
15-GraphPOS-0p7
10-Lemma-0p85
15-GraphPOS-0p85
15-Graph-0p85
F-score
0.247+-0.17
0.246+-0.171
0.236+-0.177
0.231+-0.179
0.231+-0.179
0.231+-0.179
0.176+-0.129
0.175+-0.129
0.156+-0.13,
0.155+-0.177
0.149+-0.126
0.149+-0.126
0.142+-0.127
0.118+-0.154
0.086+-0.123
0.07+-0.113
Table 4.9: F-score ranking English Retail using the
adapted gold standard:
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
29
Model
10-Lemma-0p7
10-Lemma-0p5
10-Lemma-0
10-LemPOS-0p5
10-LemPOS-0
10-LemPOS-0p7
15-Graph-0p5
15-Graph-0
15-Graph-0p7
10-Lemma-0p85
15-GraphPOS-0p5
15-GraphPOS-0
15-GraphPOS-0p7
10-LemPOS-0p85
15-Graph-0p85
15-GraphPOS-0p85
F-score
0.141+-0.139
0.139+-0.134
0.138+-0.134
0.121+-0.143
0.121+-0.143
0.118+-0.143
0.111+-0.107
0.11+-0.106
0.105+-0.111
0.1+-0.14
0.096+-0.114
0.095+-0.113
0.084+-0.114
0.076+-0.165
0.060+-0.127,
0.047+-0.129
Table 4.10: F-score ranking Swedish Travel using the
adapted gold standard:
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
4.9
Model
10-LemPOS-0p7
10-LemPOS-0p5
10-LemPOS-0
10-Lemma-0p5
10-Lemma-0p7
10-Lemma-0
10-LemPOS-0p85
15-GraphPOS-0p5
15-GraphPOS-0
15-GraphPOS-0p7
15-Graph-0p5
15-Graph-0
15-Graph-0p7
10-Lemma-0p85
15-GraphPOS-0p85
15-Graph-0p85
F-score
0.235+-0.199
0.23+-0.195
0.229+-0.195
0.218+-0.177
0.215+-0.184
0.215+-0.174
0.19+-0.219
0.179+-0.148
0.179+-0.148
0.17+-0.149
0.163+-0.146
0.160+-0.139
0.142+-0.149
0.138+-0.2
0.091+-0.159
0.071+-0.139
Table 4.11: F-score ranking English Travel using the
adapted gold standard:
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Model
10-Lemma-0p5
10-LemPOS-0p5
10-LemPOS-0
10-Lemma-0
10-Lemma-0p7
15-GraphPOS-0p5
15-GraphPOS-0
15-Graph-0p5
10-LemPOS-0p7
15-Graph-0
15-GraphPOS-0p7
15-Graph-0p7
10-Lemma-0p85
10-LemPOS-0p85
15-Graph-0p85
15-GraphPOS-0p85
F-score
0.136+-0.126
0.135+-0.146
0.135+-0.146
0.135+-0.124
0.119+-0.121
0.112+-0.115
0.112+-0.114
0.11+-0.102
0.109+-0.141
0.109+-0.097
0.089+-0.115
0.087+-0.104
0.065+-0.125
0.038+-0.124
0.03+-0.093
0.028+-0.108
Comparison Overview
Tables 4.12– 4.16 compare each test combination of preprocessing operations
and extraction criteria with every other combination. The comparison is done
by counting how many percent of the times the combination associated with the
row gave better results than the combination associated with the column. Both
dictionary-based and adapted gold standard tests are counted in this chapter.
Misspellings are included in all gold standard tests presented in this section.
The five tables represent Swedish corpora, English corpora, corpora from the
retail company, corpora from the travel company and all corpora.
• LU0 = Lemmatized and untagged, threshold = 0.0 (no threshold)
• LU5 = Lemmatized and untagged, threshold = 0.5
• LU7 = Lemmatized and untagged, threshold = 0.7
• LU85 = Lemmatized, threshold = 0.85
• LP0 = Lemmatized and POS-tagged, threshold = 0.0 (no threshold)d
• LP5 = Lemmatized and POS-tagged, threshold = 0.5
• LP7 = Lemmatized and POS-tagged, threshold = 0.7
• LP85 = POS-tagged, threshold = 0.85
• GP0 = POS-tagged, threshold = 0.0 (no threshold)
• GP5 = POS-tagged, threshold = 0.5
• GP7 = POS-tagged, threshold = 0.7
• GP85 = POS-tagged, threshold = 0.85
30
• GU0 = untagged, threshold = 0.0 (no threshold)
• GU5 = untagged, threshold = 0.5
• GU7 = untagged, threshold = 0.7
• GU85 = untagged, threshold = 0.85
These tables show that models trained on both lemmatized and POS-tagged
data gave the best f-scores. A threshold of 0.5 is better than no threshold
regardless of the preprocessing operations. For models trained on lemmatized
Swedish data a threshold of 0.7 gives better f-score than 0.5. For models trained
on graph words 0.5 was the threshold giving the best f-score. POS-tagging had
a clearly a positive impact on travel data. On graph word data from the retail
company POS-tagging lowered the f-score. 0.85 was the threshold giving the
lowest f-scores.
Table 4.12: All tests (including both dictionary-based and adapted gold standard tests on
Retail and Travel data in Swedish and English):
L0
LU0
LU5
LU7
L85
LP0
LP5
LP7
LP85
GP0
GP5
GP7
GP85
GU0
GU5
GU7
GU85
87.5
75
12.5
75
75
50
12.5
25
25
12.5
0
0
0
0
0
L5
L7
L85
LP0
LP5
LP7
LP85
P0
P5
P7
P85
R0
R5
R7
R85
12.5
25
50
87.5
100
87.5
25
37.5
25
0
25
37.5
25
0
0
50
50
50
0
37.5
37.5
87.5
75
87.5
37.5
87.5
87.5
87.5
75
87.5
75
25
100
100
75
62.5
75
75
75
25
100
100
75
50
0
87.5
87.5
100
37.5
100
100
100
162.5
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
87.5
12.5
100
100
87.5
37.5
100
50
25
0
100
100
87.5
12.5
100
100
75
37.5
100
50
25
0
0
100
100
87.5
25
100
100
100
37.5
100
75
37.5
0
87.5
87.5
100
100
100
100
100
100
100
100
100
100
100
50
100
100
100
50
0
62.5
62.5
50
25
12.5
25
12.5
0
0
0
0
0
12.5
75
75
50
12.5
25
25
0
0
12.5
12.5
12.5
0
100
100
100
62.5
75
75
62.5
0
87.5
87.5
75
0
100
62.5
12.5
0
0
0
0
0
0
0
0
62.5
12.5
0
0
0
0
0
0
0
0
12.5
25
25
0
0
12.5
25
0
0
31
37.5
50
37.5
0
62.5
62.5
62.5
0
100
0
0
0
0
0
0
0
0
50
50
25
0
0
75
75
62.5
0
100
100
100
50
100
12.5
0
12.5
0
0
Table 4.13: Swedish (including both dictionary-based and adapted gold standard tests on
Swedish Retail and Travel):
L0
LU0
LU5
LU7
LU85
LP0
LP5
LP7
LP85
GP0
GP5
GP7
GP85
GU0
GU5
GU7
GU85
100
75
25
75
75
75
50
25
25
25
0
0
0
0
0
L5
L7
L85
LP0
LP5
LP7
LP85
P0
P5
P7
P85
R0
R5
R7
R85
0
25
50
75
100
75
25
25
25
0
25
25
25
0
0
25
25
50
0
0
0
50
75
75
0
75
75
75
75
75
75
25
100
100
100
100
75
75
75
25
100
100
100
100
0
75
75
100
25
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
25
100
100
100
75
50
50
50
0
100
100
100
25
100
100
100
75
50
50
50
0
0
100
100
100
50
100
100
100
75
25
25
50
0
75
75
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
50
0
75
75
75
25
25
25
25
0
0
0
0
0
25
75
75
50
25
25
25
0
0
0
0
0
0
100
100
100
100
75
75
75
0
75
75
50
0
100
100
25
0
0
0
0
0
0
0
0
100
25
0
0
0
0
0
0
0
0
25
0
0
0
0
0
0
0
0
0
0
0
0
25
25
25
0
100
0
0
50
50
25
0
0
0
50
50
25
0
0
50
50
50
0
100
100
100
0
100
25
0
25
0
0
Table 4.14: English (including both dictionary-based and adapted gold standard tests on
English Retail and Travel):
L0
LU0
LU5
LU7
LU85
LP0
LP5
LP7
LP85
GP0
GP5
GP7
GP85
GU0
GU5
GU7
GU85
100
50
0
75
75
25
0
25
25
0
0
0
0
0
0
L5
L7
L85
LP0
LP5
LP7
LP85
P0
P5
P7
P85
R0
R5
R7
R85
0
50
50
100
100
100
25
50
25
0
25
50
25
0
0
75
75
50
0
75
75
100
75
100
75
100
100
100
75
100
75
25
100
100
50
25
75
75
75
25
100
100
50
0
0
100
100
100
50
100
100
100
25
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
75
0
100
100
75
0
50
50
0
0
100
100
75
0
100
100
50
0
50
50
0
0
0
100
100
75
0
100
100
100
0
75
75
25
0
100
100
100
100
100
100
100
100
100
100
100
100
100
0
100
100
100
50
0
50
50
25
25
0
25
0
0
0
0
0
0
0
75
75
50
0
25
25
0
0
25
25
25
0
100
100
100
25
75
75
50
0
100
100
100
0
100
25
0
0
0
0
0
0
0
0
0
25
0
0
0
0
0
0
0
0
0
0
50
50
0
0
25
50
0
0
32
75
100
75
0
100
100
100
0
100
0
0
50
50
25
0
0
0
50
50
25
0
0
100
100
75
0
100
100
100
100
100
0
0
0
0
0
Table 4.15: Retail (including both dictionary-based and adapted gold standard tests on
Swedish and English Retail):
L0
LU0
LU5
LU7
LU85
LP0
LP5
LP7
LP85
GP0
GP5
GP7
GP85
GU0
GU5
GU7
GU85
75
100
0
50
50
50
0
0
0
0
0
0
0
0
0
L5
L7
L85
LP0
LP5
LP7
LP85
P0
P5
P7
P85
R0
R5
R7
R85
25
0
25
100
100
75
50
50
50
0
50
50
50
0
0
50
50
75
0
25
25
100
75
100
25
100
100
100
100
100
100
50
100
100
100
75
100
100
100
50
100
100
100
50
0
100
100
100
75
100
100
100
50
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
0
100
100
100
25
0
0
0
0
100
100
100
0
100
100
100
25
0
0
0
0
0
100
100
100
25
100
100
100
25
50
50
0
0
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
75
0
50
50
50
25
0
0
0
0
0
0
0
0
25
50
50
25
0
0
0
0
0
0
0
0
0
100
100
100
75
50
50
25
0
100
100
75
0
100
75
0
0
0
0
0
0
0
0
0
75
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
25
50
50
0
75
75
75
0
100
0
0
100
100
50
0
0
0
100
100
50
0
0
100
100
100
0
100
100
100
50
100
0
0
0
0
0
Table 4.16: Travel (including both dictionary-based and adapted gold standard tests on
Swedish and English Travel):
L0
LU0
LU5
LU7
LU85
LP0
LP5
LP7
LP85
GP0
GP5
GP7
GP85
GU0
GU5
GU7
GU85
100
50
25
100
100
50
25
50
50
25
0
0
0
0
0
L5
L7
L85
LP0
LP5
LP7
LP85
P0
P5
P7
P85
R0
R5
R7
R85
0
50
75
75
100
100
0
25
0
0
0
25
0
0
0
50
50
25
0
50
50
75
75
75
50
75
75
75
50
75
50
0
100
100
50
50
50
50
50
0
100
100
50
50
0
75
75
100
0
100
100
100
75
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
25
100
100
75
50
100
100
100
0
100
100
100
25
100
100
50
50
100
100
100
0
0
100
100
100
25
100
100
100
50
100
100
100
0
75
75
100
100
100
100
100
100
100
100
100
100
100
50
100
100
100
25
0
75
75
50
25
25
50
25
0
0
0
0
0
0
100
100
75
25
50
50
0
0
0
0
0
0
100
100
100
50
100
100
100
0
75
75
75
0
100
50
25
0
0
0
0
0
0
0
0
50
25
0
0
0
0
0
0
0
0
25
50
50
0
0
25
50
0
0
33
50
50
25
0
50
50
50
0
100
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
100
100
100
50
100
25
0
25
0
0
5
Conclusions
The purpose of this study was to evaluate effects of different preprocessing operations of chat log collections and different extraction criteria when extracting
synonyms with JavSDM. In the experiments and this chapter, lemmatization
and POS-tagging are the main preprocessing operations and cosine similarity
the main criterion that are discussed.
Conclusions were drawn if there was similar effect on at least two of
the chat collections with either the language or domain in common. What
strengthens the conclusions drawn in this chapter is that the rankings of models
and thresholds looked very similar no matter which gold standard or chat
collection was used. The only obvious difference was that POS-tagging had
more positive effect on travel data.
The standard deviations made all results overlap since precision and recall
often varied from 0.0–1.0 for single test words. Consequently these deviations
are not very informative. Figure 5.1 is an example why the standard deviations
had wide ranges. Section 5.1 is a summary of the conclusions drawn from the
Figure 5.1: The example shows how many percent of the test words that got synonym
proposals with a specific f-score. The results in the diagram are from a model trained on the
lemmatized chat collection of Swedish Retail tested with 250 test words. 25 per cent of the
test words got f-score of 0.
results in this study. Section 5.2 discusses interesting observations from the
study that were outside the scope of the purpose of thesis, alternative ways
things could have been done, things that would have been tested/studied if
34
there had been more time and future study suggestions.
5.1
Overview of the results
The purpose of this study was to compare preprocessing operations and extraction criteria and thereby help users of JavaSDM for synonym extraction. The
following is a list of main conclusions that can be drawn from this study.
• POS-tagging clearly improves precision for all data collections except for
Swedish Retail.
• Lemmatized data give better precision, recall and f-score for all collections
(this is discussed in Section 5.2).
• As expected, best precision is achieved with higher thresholds. 0.85 was
the highest in this study.
• Best recall is achieved with a threshold of 0.5 or lower. 0.5 is preferable
since lower thresholds decrease precision and f-score
• Best f-score for retail data is achieved without POS-tagging when misspellings are included in the gold standard.
• Best f-score for travel data is achieved with lemmatized and POS-tagged
data.
• Best threshold regarding f-score was generally 0.5 for unlemmatized
data. With lemmatized data the synonym candidates tend to gain higher
similarity scores and therefore 0.7 is quite even with 0.5. For models
trained on both lemmatized and POS-tagged Swedish corpora 0.7 gave
better f-score every time. With a larger corpora words will probably
get higher similarity scores since words in lemmatized data have higher
frequency and lemmatized models got higher cosine similarity scores than
graph word models.
• If ignoring misspellings, POS-tagging improves f-score for all collections
but English Retail.
• The fact the adapted gold standards gives three to four times better f-score
than dictionary-based indicates how useful synonym extraction can be
when handling a domain specific vocabulary or complementing existing
thesauri.
5.2
Discussion
As van der Plas and Tiedemann (2006) stated, a higher lemmatization and POStagging quality would improve the results, e.g. when a Swedish gold standard
included “smart” the program extracted “smar”, since that was how “smart”
had been lemmatized. In such cases the random indexing is successful and the
problem is the lemmatization quality or in this example rather the lemmatizer’s
choice of lemma form for inflections of smart. Unedited chat logs often lack
35
punctuation and capitalization of proper nouns and often contain misspellings
and words from foreign languages. In specific domains there might be named
entities that are rarely mentioned outside the domain. These are factors that
might make things more difficult for lemmatizers and part-of-speech-taggers.
Gold standard exclusion of synonyms from synonym dictionaries that do
not appear in training data makes recall scores more reflective of JavaSDM’s
ability to extract synonyms. Still it is not certain that the ones that occurred in
the training data appeared in a sense that is synonymous to the test word. The
POS-tagging solves parts of this problem, but a good word-sense or morphology
tagging would do it more accurately. Morphology is in theory useful for separating noun homonyms in Swedish. For example gender separates the Swedish
graph word “lag” that means “law” if it is utrum and “team” if it is neutrum and
species can decide whether the Swedish graph word “banan” means “banana”
or “the track”. Morphological use did not improve results in this study, but
the training data was rather small and it could be worth trying morphological
tagging on a larger corpus.
Lemmatized Gold Standard
ersätta
skifta
ändra
Graph Word Gold Standard
ersätta
skifta
ändra
ersatt
ersatte
ersatts
ersätter
skiftar
ändrade
ändrat
When using the graph word gold standard all inflections of a synonym, that
has occurred in the training data, are seen as correct synonyms e.g. “ersatt”,
“ersatte”, “ersätta” and “ersatts” count as four correct synonyms for “byta” (see
Section 3.4.2). Is it better if for example a graph word based model’s proposals
include those four correct candidates than if a lemmatized model’s proposals
only include two correct candidates: “ersätta” and “skifta”? Precision-wise the
answer is yes according to the evaluation method if one compares 4/15 to 2/10.
Recall-wise the answer is no in this example since the graph word gold standard
includes more inflections of the other words as well and thereby get a recall of
4/10 while the lemma model would get 2/3. The number of found words is
counted and compared to get indications of the effects of this risk. If a lemma
version does more correct extractions this issue is no problem, but if a graph
word version does more correct extractions but has lower recall the issue could
be there.
For a clearer recall comparison between lemma and graph word based
models, one could have counted extractions of words with the same lemma
form as one suggestion, lemmatize that extraction and use the lemmatized gold
standard for evaluation. That would for example have meant a recall of 1/3
instead of 4/10 for a graph word based model that extracted “ersatt”, “ersatte”,
36
“ersätta” and “ersatts” for “byta” using the gold standard below. The extractions
and transformed extractions for “byta” are shown below:
Extractions -> Transformed Extractions
ersatt
ersätta
ersätta
ringa
ersatte
hitta
ersatts
välja
ringa
använda
hitta
montera
välja
ställa
använda
skriva
montera
se
ställa
hämtar
skriva
paxa
se
läsa
hämtar
paxa
läsa
How to count precision and thereby f-score when using such a method is not
as self-evident. One could choose between the following:
• The precision would be 1/15 if you only count a lemma form as correct
one time
• The precision would it be 1/11 if you count all extractions with identical
lemma form as only one extraction
• The precision would be 4/15 if you count each inflection individually.
If “ersätta” had been an incorrect extraction, would the four inflections be
counted as one or four incorrect extractions? If the user of JavaSDM don’t need
more in more than on form and want less extractions to read, it would be better
to lemmatize extractions and only show a reappearing lemmas once.
In this study models were trained either on graph words or lemma but
one could try to combine graph words and lemma when training models. For
example each focus word could be the graph word while the surrounding words
could be lemmatized or the other way around. For example “the dogs waved
their tales” could be “the dog waved their tail” or “the dogs wave their tails”.
Preliminary tests indicated that it is better to keep stop words during training
and instead use the list of stop words as a way to stop extractions of those words
from being synonym proposals. An explanation why could be that stop words
often give syntactic information and are for example used by POS-taggers. The
syntactic information from stop words is probably not as relevant if data is
POS-tagged.
Antonyms, which prior studies have mentioned as problem, might be used
ironically. With smileys irony is sometimes more clearly revealed in chat logs.
Smileys can carry lots of information and are therefore something that should
be explored further for use in dialogue systems. Sometimes smileys are used
37
instead of words to communicate positive or negative energy. Lika all other
tokens without alphabetic characters, smileys could not be extracted in this
study.
When training models with JavsSDM, one can set different parameters such
as type of weighting scheme and size of context windows. Default parameters
were used in this study since no report was found stating when to use which
combination of parameters. Different settings might be better for graph word
data but not for lemmatized data and vice versa. Also the corpora sizes might
have affected the conclusions drawn in this study.
There are more things left to try regarding model training of POS-tagged
data. There is a weighting scheme that during training can give extra weight to
content words, for example words tagged as nouns, adjectives and verbs. It is
not sure that it would improve results since the pre-tests indicated that it can
be a good choice to keep stop words during training, but it should definitely be
tested.
Misspelled tokens are intended to have the same meaning as test words
and should have distributions suitable for the test word. Graph word data is
better for this since lemmatizers and part-of-speech taggers are not trained
to handle unknown spellings. Even if misspellings should be seen as correct
synonym suggestions and actually might be very relevant, there are other ways
to find misspellings. Though a distributional similarity can be used to enhance
the probability that a detected/suspected misspelling is a misspelling. JavaSDM
found several misspellings with frequencies of 1 which in a way is not as
relevant as frequent misspellings. Frequency counting was done and showed
that some misspellings were more frequent than the “correct” spelling and those
misspellings are very interesting. It is likely that very frequent misspellings are
not misspellings in the users’ heads.
Compound words are not handled automatically by JavaSDM. Spaces are
the same as the start and end of a word which means that “answering machine”,
for example is read as “answering” and “machine” and “washing machine” as
“washing” and “machine”. Such an example increases risk of having “washing”
suggested as a synonym for “answering”. That problem could be avoided if
such phrases were somehow recognized as one word when training JavaSDM
models. This is a problem because compound words do not get as much data
as they should and there may be distributional relations calculated, e.g. if
“asbra” (“cadaver”+”good” meaning “extremely good”) is written “as bra” then
“asbra” has one less occurrence calculated and also “as” (“cadaver”) becomes
associated with a positive word such as “good” and vice versa. This is extra
difficult regarding verb compounds that can have other words in between for
example “Switch off the light!” and “Switch the light off!”. I wrote a simple
program that extracted nouns followed by nouns and then checked if these
coordinate nouns occurred compounded as one word with or without hyphen.
Then it compared how many times each different compounding method was
used and actually quite often the coordinate nouns had occurred in all three
versions and even more often the compounding method was one that would
be considered as a misspelling in formal Swedish. This should be examined in
further studies, because it is an obstruction for synonym extraction and it is a
well-known problem in linguistics. Incorrect disjunctions are the most common
problem in Swedish (Melin (2006)).
38
5.2.1
Future Study suggestions
The following is a summary list of the most relevant future study suggestions
mentioned in the discussion:
• Lemmatize graph word extractions for a better comparison.
• Join compund words as one token. Try to handle incorrect spelling of
Swedish compound words.
• Examine effects of JavaSDM parameters, for example different context
window sizes.
• Try weighting words in the context windows differently depending on
their POS-tag.
39
Bibliography
V.D. Blondel and P. Senellart. Automatic extraction of synonyms in a dictionary.
Technical report, University of Louvain, 2002.
D. Alan Cruse. Notes on meaning in language. Master’s thesis, Mitthögskolan,
2002.
James R. Curran and Marc Moens. Improvements in automatic thesaurus
extraction. In Proceedings of the ACL-02 workshop on Unsupervised lexical
acquisition - Volume 9, ULA ’02, pages 59–66, Stroudsburg, PA, USA, 2002.
Association for Computational Linguistics. doi: 10.3115/1118627.1118635.
URL http://dx.doi.org/10.3115/1118627.1118635.
Philip Edmonds and Graeme Hirst. Near-synonymy and lexical choice. Computational Linguistics, 28(2), 2002.
Olivier Ferret. Testing semantic similarity measures for extracting synonyms
from a corpus. Proceedings if LREC, 2010.
James Gorman and James R. Curran. Random indexing using statistical weight
functions. In Proceedings of the 2006 Conference on Empirical Methods in
Natural Language Processing, EMNLP ’06, pages 457–464, Stroudsburg, PA,
USA, 2006. Association for Computational Linguistics. ISBN 1-932432-73-6.
URL http://dl.acm.org/citation.cfm?id=1610075.1610139.
Zelig S. Harris. Distributional structure. Word, 10:146–162, 1954.
Martin Hassel. Resource Lean and Portable Automatic Text Summarization. PhD
thesis, KTH, 2007.
Louise Holmer. Passiv och perfekt particip i saol plus. en dokumentation av den
lexikografiska arbetsprocessen. Master’s thesis, University of Gothenburg,
2009.
Daniel Jurafsky and James H. Martin. Speech and Language Processing. An
introduction to Natural Language Processing, Computational Linguistics, and
Speech Recognition. Pearson Education International, 2009.
Lars Melin. Experimentell språkvård. Språkvård 4, 2006.
Philippe Muller, Nabil Hathout, and Bruno Gaume. Synonym extraction using
a semantic distance on a dictionary. In Proceedings of the First Workshop on
Graph Based Methods for Natural Language Processing, TextGraphs-1, pages
65–72, Stroudsburg, PA, USA, 2006. Association for Computational Linguistics. URL http://dl.acm.org/citation.cfm?id=1654758.1654773.
40
M Ortuño, P Carpena, P Bernaola-Galván, E Muñoz, and A.M Somoza. Keyword detection in natural languages and dna. EPL (Europhysics Letters), 57
(5):759–764, 2002.
Lonneke Van Der Plas and Gosse Bouma. Syntactic contexts for finding semantically related words. In In CLIN, pages 173–186, 2004.
M. Rosell, M. Hassel, and V. Kann. Global evaluation of random indexing
through swedish word clustering compared to the people’s dictionary of
synonyms. 2009.
Magnus Sahlgren. An introduction to random indexing. SICS, Swedish Institute
of Computer Science, 2005.
Carlo Strapparava and Rada Mihalcea. Semeval-2007 task 14: Affective text.
In Proceedings of the 4th International Workshop on Semantic Evaluations
(SemEval- 2007), pages 70–74, 2007.
Lonneke van der Plas and Jörg Tiedemann. Finding synonyms using automatic
word alignment and measures of distributional similarity. In Proceedings of the
COLING/ACL on Main conference poster sessions, COLING-ACL ’06, pages
866–873, Stroudsburg, PA, USA, 2006. Association for Computational Linguistics. URL http://dl.acm.org/citation.cfm?id=1273073.1273184.
Tong Wang. Extracting synonyms from dictionary definitions by. Master’s thesis,
University of Toronto, 2009.
Tong Wang and Graeme Hirst. Exploring patterns in dictionary definitions for
synonym extraction. Natural Language Engineering, 18:313–342, 2012.
Hua Wu and Ming Zhou. Optimizing synonym extraction using monolingual
and bilingual resources. In Proceedings of the second international workshop
on Paraphrasing - Volume 16, PARAPHRASE ’03, pages 72–79, Stroudsburg,
PA, USA, 2003. Association for Computational Linguistics. doi: 10.3115/
1118984.1118994. URL http://dx.doi.org/10.3115/1118984.1118994.
41