Choosing the most reasonable split of a compound word

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING,
SECOND CYCLE, 30 CREDITS
STOCKHOLM, SWEDEN 2017
Choosing the most reasonable
split of a compound word using
Wikipedia
YVONNE LE
KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION
Choosing the most reasonable split of a
compound word using Wikipedia
YVONNE LE ([email protected])
Master’s Thesis at CSC
Supervisor: Olov Engwall ([email protected])
Examiner: Viggo Kann ([email protected])
Abstract
The purpose of this master thesis is to make use of the category taxonomy of Wikipedia to determine the most reasonable split from the suggestions generated by an independent
compound word splitter.
The articles a word was found in can be seen as a group
of contexts the word can occur in and also different representations of the word, i.e. an article is a representation of
the word. Instead of only analysing the data of each single
article, the intention is to find more data for each representation/context to perform an analysis on. The idea is to
expand each article representing one context by including
related articles in the same category.
Two perceptions of a ”reasonable split” was studied.
The first case was a split consisting of only two parts and
the second case of unlimited parts.
This approach is well-suited for choosing the correct
split out of a several suggestions but unsuitable for identifying compound words. It would more often than not
decide to not split a compound word. It is very dependant
on the compound words appearing in Wikipedia.
Referat
Val av den rimligaste delningen av ett
sammansatt ord med hjälp av Wikipedia
Syftet med detta examensarbete är att utse den rimligaste
uppdelningen av ett sammansatt ord genom användning av
Wikipedias kategoritaxonomi. Förslag på olika uppdelningar genereras av en oberoende färdig algoritm.
Artiklarna som ett ord finns can ses som en grupp av
kontexter som ett ord kan förekomma i och olika framställningar av ett ord. Avsikten är att hitta mer data för varje
framställning/kontext att utföra en analys på istället för
att bara analysera artikeln ordet hittades i. Idéen som ska
testas är att expandera varje artikel som representerar en
kontext genom att inkludera relaterade artiklar i samma
kategori.
Två olika synsätt på ”rimliga uppdelningar” studerades. Första fallet var att endast dela upp sammansatta ord
i två delar och andra fallet var att dela upp i obestämt antal
delar.
Metoden visade sig utmärka sig på att välja rätt uppdelning när den väl gjorde ett försök. En stor nackdel var
att den ofta valde att inte dela upp sammansättningar trots
att den skulle ha gjort det. Metoden är mycket beroende
av att sammansättningarna måste finnas i Wikipedia.
Contents
1 Introduction
1.1 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
1
2
2 Background
2.1 Early history and progress . . . . . . . . . . .
2.2 Application . . . . . . . . . . . . . . . . . . .
2.2.1 Machine Translation . . . . . . . . . .
2.2.2 Information Retrieval . . . . . . . . .
2.3 Linguistics . . . . . . . . . . . . . . . . . . . .
2.3.1 Semantic classification . . . . . . . . .
2.3.1.1 Copulative . . . . . . . . . .
2.3.1.2 Determinative . . . . . . . .
2.3.1.3 Endocentric . . . . . . . . .
2.3.1.4 Exocentric . . . . . . . . . .
2.3.2 How much should be split? . . . . . .
2.3.3 Where to split? . . . . . . . . . . . . .
2.3.4 What is a reasonable split? . . . . . .
2.4 Compound splitting methods . . . . . . . . .
2.4.1 Statistical methods . . . . . . . . . . .
2.4.1.1 Component frequencies . . .
2.4.1.2 n-grams . . . . . . . . . . . .
2.4.1.3 Part of speech combinations
2.4.1.4 Parallel corpus . . . . . . . .
2.4.2 Linguistic methods . . . . . . . . . . .
2.4.2.1 Semantic vector space . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
3
3
4
5
5
5
5
5
6
6
6
7
7
7
8
8
9
9
9
9
.
.
.
.
11
11
12
12
12
3 Resources
3.1 Findwise’s compound splitter
3.2 Search platforms . . . . . . .
3.2.1 Elastic . . . . . . . . .
3.2.2 Solr . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3.3
3.4
Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 Methodology
4.1 Concept . . . . . . . . . . . . . . . . . . . .
4.2 Architecture . . . . . . . . . . . . . . . . . .
4.3 Implementation . . . . . . . . . . . . . . . .
4.3.1 Split suggestions retrieval . . . . . .
4.3.2 Articles retrieval . . . . . . . . . . .
4.3.3 Computation . . . . . . . . . . . . .
4.4 Evaluation metrics . . . . . . . . . . . . . .
4.4.1 Terms . . . . . . . . . . . . . . . . .
4.4.2 Metrics . . . . . . . . . . . . . . . .
4.5 Test data . . . . . . . . . . . . . . . . . . .
4.5.1 Den stora svenska ordlistan (DSSO)
4.5.2 Svenska akademiens ordlista (SAOL)
4.5.3 Annotators . . . . . . . . . . . . . .
4.6 Data sets . . . . . . . . . . . . . . . . . . .
4.6.1 Data set 1 . . . . . . . . . . . . . . .
4.6.2 Data set 2 . . . . . . . . . . . . . . .
4.6.3 Data set 3 . . . . . . . . . . . . . . .
4.6.4 Data set 4 . . . . . . . . . . . . . . .
13
14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
15
16
17
17
17
17
17
18
18
18
18
19
19
19
20
20
20
20
.
.
.
.
21
21
21
21
22
.
.
.
.
.
.
.
.
.
.
23
23
23
24
24
24
25
26
26
26
27
7 Discussion
7.1 Test case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1.1 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
29
29
5 Experiment
5.1 Language analyzers .
5.2 Redo the expansion .
5.3 No stemming as base
5.4 Merge words . . . .
. . . .
. . . .
index
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Results
6.1 Case 1 . . . . . . . . . . . . . . . . .
6.1.1 No stemming . . . . . . . . .
6.1.2 Swedish light . . . . . . . . .
6.1.3 Swedish . . . . . . . . . . . .
6.1.4 Comparison against baseline
6.2 Case 2 . . . . . . . . . . . . . . . . .
6.2.1 No stemming . . . . . . . . .
6.2.2 Swedish light . . . . . . . . .
6.2.3 Swedish . . . . . . . . . . . .
6.2.4 Comparison against baseline
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7.2
7.3
7.4
7.5
7.1.2 Comparison to baseline
Test case 2 . . . . . . . . . . .
7.2.1 Optimization . . . . . .
7.2.2 Comparison to baseline
Strengths . . . . . . . . . . . .
Weaknesses . . . . . . . . . . .
Further work . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
29
30
30
30
30
31
31
8 Conclusion
33
Bibliography
35
Chapter 1
Introduction
Swedish is a compounding language allowing you to form a new word joining a theoretically infinite amount of words together. The words are joined without blank
spaces and sometimes connected with joining morphemes. Being able to split compound words is a powerful resource in natural language processing, e.g. to expand
queries in information retrieval or in statistical machine translation.
Depending on the application one may want to split into different number of
parts. For statistical machine translation one may want to do an aggressive splitting
of ”bordtennisspelare” (table tennis player) to ”bord tennis spelare” (table tennis
player) enabling one by one translation of all the constituents. However, for a search
engine one may be satisfied at ”bordtennis spelare” because querying ”bordtennis”
(table tennis) is more relevant than querying ”bord” (table).
How can one find the most reasonable split? Previous research has studied a window of words appearing in the same document [23], this study will be a continuation
studying an expanded context using the free encyclopedia Wikipedia. Wikipedia
has a taxonomy system in which all the articles belong to at least one category[27].
The intention is to group together other articles in the same category and study the
new expanded context with more data instead of only the data in a single article.
1.1
Research Question
Is it possible to choose the most reasonable split using an expanded context generated by Wikipedia’s category taxonomy?
1.2
Objective
Findwise has a compound word splitter for the Swedish language generating possible
splits. The purpose of this master thesis is to make use of Wikipedia’s category
taxonomy to determine the most reasonable split from the suggestions.
The method will be tested on two perceptions of a reasonable split. The goal is
to study the method and its performance from the two points of view.
1
CHAPTER 1. INTRODUCTION
1.3
Delimitations
This method will only be tested for the Swedish language and the time complexity
will not be taken into account. The algorithm will only work for compound words
which appear in the Swedish Wikipedia and it will not be able to split words which
Findwise’s compound word splitter fails to split.
2
Chapter 2
Background
This chapter will present the history, state-of-the-art and application of automatic
compound splitting. It will also present the relevant theory.
2.1
Early history and progress
Research in compound analysing has been around for a long time in several languages. Fujisaki et al [11] experimented with applying a probabilistic grammar for
parsing Japanese noun compounds; Rackow et al [21] used a monolingual corpus
to find appropriate English translations of German noun compounds and Lauer[14]
used a statistical approach in his publication. Numerous studies have ever since then
been carried out studying both statistical, linguistic and other approaches over the
years, and even different combinations of them. The state-of-the-art studies the
semantic vector space using word embeddings. Both Dima and Hinrichs [6] as well
as Soricut and Och [25] have studied this approach with positive results.
2.2
Application
As mentioned in the introduction, compound splitting is a powerful application in
natural language processing. Two important use cases are in machine translation
and information retrieval. This section will further explain the details.
2.2.1
Machine Translation
Machine translation is a field which investigates the use of computers to automate
translation from one language to another [12]. The methods are heavily dependant
on data of varying types, e.g. corpora or rules. Compounding splitting can improve
translations by providing more information about a word for a more accurate analysis. Furthermore, it provides the system an additional chance to analyse compound
words which do not occur in the bilingual corpora, dictionary or other text sources
used.
3
CHAPTER 2. BACKGROUND
Statistical machine translation (SMT) is a method which analyses bilingual text
corpora to build statistical models which transforms text from the source language
to the target language. The capacity of what it can translate is therefore limited to
the text corpora, it would never be able to translate a word which did not appear
in it. This is the same for many other methods such as direct translations using
a dictionary. However, compound splitting would be able to break the unknown
compound word into smaller known constituents within its capacity. Presume the
Swedish compound word ”favoritmånad” (favourite month) did not occur in the corpora but ”favorit” (favourite) and ”månad” (month) did. The translation machine
would fail to translate it because of it, but it would be able to acquire a translation by performing compound splitting and then finding the proper translation for
”favorit” (favourite) and ”månad” (month).
Compound splitting is not only useful to apply when attempting to translate a
word, it is also useful in the pre-processing work. Applying compound splitting on
the corpora before analysing would result in more data to work with and it would
learn common patterns. The corpora with compound splitting can be a complement
to the original corpora.
2.2.2
Information Retrieval
Query expansion is a process carried out to expand the original query to get improved retrieval performance in information retrieval. Query expansion can be used
to both get better precision and better recall. Better precision means better accuracy of relevant documents on the retrieved result. Better recall means increasing
the total amount of relevant retrieved documents.
In some texts the same concept may be referred to using different words, especially in languages with compounding. An example is the Swedish word ”vinterskor”
(winter shoes) and the Swedish phrase ”skor man har på vintern” (shoes you use
during winter). They both grasp the same concept and are both relevant documents
to get retrieved. By only using the search query ”vinterskor” we would only find
documents explicitly containing the word ”vinterskor” and not the other phrase.
However, by splitting the compound word and expanding the query with ”winter
shoes” and weighting the other relevant documents would be found and retrieved
as well.
The previous examples were about using compound splitting on the query, but it
can also be used on the documents. If we were to query ”skor” (shoes) we would be
interested in all kind of shoes but only documents containing the ”shoes” explicitly
would be retrieved. That would exclude all the documents describing different types
of shoes such as ”vintersko”(winter shoe) or ”fotbollssko” (football shoe). By using
compound splitting on the documents we would be able to capture them as well.
Compound splitting is extremely valuable in both machine translation and information retrieval to get improved result. It has an positive impact on applications
relying on accurate result. It is difficult to discuss the impact it has from an ethical
or sustainability standpoint because it is dependent on the application which uses
4
2.3. LINGUISTICS
it.
2.3
Linguistics
Compound words in Swedish are words that consists of two or more independent
words [4]. As mentioned in the introduction words can be joined by joining morphemes such as s, extra vowel, letter replacement and letter removal[9]. This chapter
will present an overview of the linguistic theory and why compound word splitting
is a difficult challenge. Moreover, it will end with explaining the two definitions of
”reasonable split” which will be used in this project.
2.3.1
Semantic classification
Swedish compound words can be classified into the following four types:
• Copulative
• Determinative
• Endocentric
• Exocentric
A compound word can belong to several types. This section will explain these
categories to deepen the understanding of different types of compound words.
2.3.1.1
Copulative
Copulative words are words for which parts are equally weighted [2]. The constituents can usually be shuffled without losing the meaning. An example of copulative words are color-combinations, e.g. ”blåröd” in Swedish which is "blue red" in
English. Shuffling the constituents ”rödblå” (red blue) does not change the meaning
of it.
2.3.1.2
Determinative
Determinative words are words for which the first part determinates the second part
[17]. An example is the Swedish word ”matsked” (table spoon) where ”mat” (food)
is an attribute to ”sked” (spoon).
2.3.1.3
Endocentric
All endocentric words are determinative.Endocentric words have a head which is the
dominant part of an endocentric compound word. The head holds the basic meaning
of the whole word and the whole compound word inherits its inflectional properties
from it [3]. An endocentric word is also a hyponym to its head. ”Blackboard” is an
5
CHAPTER 2. BACKGROUND
example of an English endocentric word where board is the head and blackboard
is a hyponym to board. ”Vattensäng” (water bed) is an example of a Swedish
endocentric word where ”säng” (bed) is the head and ”vattensäng” is a hyponym
to ”bed”.
2.3.1.4
Exocentric
Exocentric, also known as bahuvrihi, words are words that does not have a head
which can represent the whole compound word [3]. Splitting these words will result
in words which do not provide any semantic context to the original word. Skinhead is
an example of an English exocentric word. Neither of the words ”skin” or ”head” can
represent the original word alone. ”Fågelskrämma” (scarecrow) is another Swedish
example. Neither of the words ”fågel” (bird) or ”skrämma” (scare) can represent
the original word alone.
2.3.2
How much should be split?
All compound words can be divided into (at least) two parts, first element and last
element [19], in Swedish called ”förled” and ”efterled” [4] [1]. The characteristics of
compound words is that the first element and last element can be standalone words
and they can even be compound words themselves. An example is the Swedish
compound word ”bordtennisspelare” (table tennis player) which consists of the first
element ”bordtennis” (table tennis) and the last element ”spelare” (player). ”Bordtennis” is the first element in that word, but itself is a compound word as well
made up by ”bord” (table) and ”tennis” (tennis).
Splitting a word which should not be split is called over splitting. It is a challenge
connected with ambiguity. A lot of Swedish words can be interpreted as compound
words although they are non-compound word as well. All Swedish compound words
which are joined together correctly are technically correct, no matter how long they
are or what words they are made of. An example is the Swedish word ”sparris”
(asparagus). Splitting it into ”sparr is” (spar ice) is technically not wrong but
it is unreasonable since it is not common. One challenge in compounds splitting
is therefore to know when it is appropriate to stop splitting. In this thesis all
technically correct splits will not automatically be evaluated as correct, they must
be deemed as reasonable by the annotators or an open source word list. More about
the evaluation and the definition of reasonable will be explained later in the report.
2.3.3
Where to split?
Ambiguity is a difficult challenge in Swedish compound splitting and this is connected with the problem of over splitting mentioned in the previous paragraph.
An example is the Swedish word ”Matris”. It can be interpreted both as an
non-compound word ”matris” (matrix) and a compound word ”mat ris” (food rice).
Another example of ambiguity is the Swedish word ”kamplåtar”. It can be inter6
2.4. COMPOUND SPLITTING METHODS
preted as a compound word in two ways, either ”kamp låtar” (battle songs) or ”kam
plåtar” (comb plates).
Absence of context adds an extra layer of difficulty on determining whether a
word is a compound word and if so, of which words it is composed of.
2.3.4
What is a reasonable split?
The conclusion is that it is difficult to define ”reasonable”. It depends on context
and application. However, a takeaway from this section is that compound words
can always be divided into two constituents. This will therefore be one of the cases
the algorithm will be tested on, a correct split consisting of two constituents.
The second case is to test the algorithm’s performance from the standpoint that
compounds words can be split into more than 2 splits. This would be the case if
the constituents themselves are compound words.
An example which illustrates what would be acceptable by the two cases can be
made with the Swedish sword ”bordtennisspelare” (table tennis player). The first
definition would accept ”bordtennis spelare” (table tennis and player) as a correct
split and the second definition would accept ”bord tennis spelare” (table and tennis
and player) as a correct split.
Reasonable is defined as that the constituents should make sense being put
together and not change the meaning of the original compound word (unless it is
an exocentric word). The algorithm should not over split ambiguous constituents
into smaller parts which changes the meaning of word. E.g. split ”tennis” (tennis)
into ”tenn is” (tin ice) when splitting the ”bordtennis” (table tennis). The second
case will be evaluated against splits of the same words made by annotators.
2.4
Compound splitting methods
There are different approaches to compound splitting. The two main approaches are
either using a statistical-based method or a linguistic-based method. A combination
of them are also quite common to maximise the result.
The two approaches will be explained in this section followed by related work
which used those approaches.
2.4.1
Statistical methods
Statistical methods depends on the corpora which the statistics to be used in compound splitting is computed. This method is therefore more adaptable than linguistic methods. A linguistic method is bound to one language because of the unique
language traits used whereas statistical methods can be used on other languages
by changing the corpora. A disadvantage is that it requires both versatile and a
large number of data to cover all the language traits which needs to be used for this
method to be successful since it ignores the linguistic aspect.
7
CHAPTER 2. BACKGROUND
2.4.1.1
Component frequencies
A frequency based metric is described in Koehn and Knight [13] where the following
formula is used to pick the most probable split:
argmaxS (
Y
1
count(pi )) n
(2.1)
pi ∈S
where:
S = Set of split suggestions
pi = Part i of a split suggestion
n = Number of constituents in the split suggestion
Given the count of words in the corpus, the split S with the highest geometric
mean of word frequencies of its parts pi is picked. n is the number of parts in a
split. If a word as a whole occurs more frequently than its constituents, the word
should be kept as a whole. E.g, for the English compound word ”wardrobe” would
have the split S = ”ward”+”robe” as one of the split suggestions. The remaining
variables would be p1 =”ward”, p2 =”robe” and n=2 because the split made up by
two constituents.
A drawback with this approach is that prepositions and determiners will always
occur more frequently in the corpus compared to other words. This means that
split containing prepositions will get a higher score. To exclude these mistakes, the
knowledge of the part of speech of words are used and constituents which are not
nouns, adverbs, adjectives, or verbs are not split.
Sjöbergh and Kann [23] presented a frequency based metric as well. This approach focused on studying the context the compound word appeared in by studying
the 50 words appearing before and the 50 words appearing after the compound word.
Constituents occurring closer to the compound word got a higher score. Similar to
the approach described by Koehn and Knight [13] mentioned earlier, the split with
the highest geometric mean was chosen. The drawbacks of this methods were that
constituents would rarely appear close to the compound word and short constituents
which usually are common words would appear more often by chance.
2.4.1.2
n-grams
Sjöbergh and Kann [23] presented in their study a method of compound splitting
by analysing n-grams. They had a list of compound head and tails and frequencies
of all character 4-grams in compound heads and tails (not overlapping a head/tailborder) were counted. The frequency data taught which 4-grams were most common
and that part should therefore not be split. To get a better guess all frequencies
of the character 4-grams containing the suggested split point were added. This was
done for every split suggestion. The suggestion with the lowest sum would then be
the split with highest probability.
8
2.4. COMPOUND SPLITTING METHODS
2.4.1.3
Part of speech combinations
Some part of speech combinations are more common than others. More than 25%
of the compounds are noun-noun combinations and very few are pronoun-pronoun
combinations. By analysing which part of speech combinations were the most common, Sjöberg and Kann [23] developed the next method calculating the probability
of such combinations. The algorithm looks at the words in the suggested split up
and learn their part of speech. It then selects the part of speech combinations with
the highest probability.
2.4.1.4
Parallel corpus
Koehn and Knight [13] presents an approach using a parallel corpus. This works
since English compound words are mostly joined by space as opposed to Swedish
compound words (e.g. “ishockeyspelare” and “ice hockey player”). This approach
requires a translation lexicon and the easiest way is to learn it from a parallel
corpus. This approach will translate the constituents for every split. If one of the
(translated) splits can be found in the translation lexicon, it should be considered
as a correct split.
Since some words can have different translations depending on the context, this
can cause some problems. An example mentioned in the paper is the word “grundrechte” (basic rights). This word should have the split “grund rechte”. However,
since “grund” usually translates into “reason” or “foundation”, this approach will
look for “reason rights” in the translation lexicon. This does not exist and the
correct split “grund rechte” will therefore be neglected. This could however be accounted for by building a second translation lexicon and join it with the first one.
The first translation lexicon was obtained by learning from a parallel corpus without
splitting the German compound words and the second one which will complement
the first could be obtained by learning from a parallel corpus with the German
compound words split using a frequency based approach. The resulting translation
lexicon would then learn that “grund” ↔ “basic”.
2.4.2
Linguistic methods
A linguistic approach relies on a lexical database and linguistic knowledge is also
applied. Opposed from statistical methods, the data for linguistic methods are
enriched with additional information such as part of speech tags or information
about different frequencies.
2.4.2.1
Semantic vector space
Daiber, Quiroz, Wechsler and Frank [5] analyse compounds by exploiting regularities
in the semantic vector space. They are exploiting the fact that linguistic items with
similar distributions have similar meanings and their work is based on analogies
such as “king is to man what queen is to woman”. This paper only focuses on
9
CHAPTER 2. BACKGROUND
endocentric compounds which are compounds where its head describes the basic
meaning of the whole word. The head and the whole compound word will therefore
be close to each other in the semantic space and this can be used to make correct
splits. This worked well and outperformed a commonly used compound splitter on
a gold standard task [20].
10
Chapter 3
Resources
Several resources were used to help to reach the project objective apart from the
programming part, mainly the following:
• Findwise’s compound splitter: To generate split suggestions our algorithm
should choose from
• Search platform: To store and fetch articles
• Wikipedia: The articles which will be stored in the search platform and which
the algorithm will review
This chapter will present the resources used in this project together with other
alternatives. A summary can be found at the end of the chapter stating which
resources were chosen.
3.1
Findwise’s compound splitter
Findwise compound splitter returns a list of possible splittings of a compound word.
It is based on a generated text file containing the following data of words:
• Word in its baseform
• Conjugations not ending a compound word
• Conjugations ending a compound word
The algorithm iterates through a graph built on the information in the text
file to return all the possible splits. The compound splitter is also programmed to
determine not only the positions to split, but also which part of speech the constituents belongs to. Therefore, ambiguous cases can generate multiple suggestions.
E.g. ”spelkonsol” (gaming console) will generate both ”spel konsol” (game console)
and ”spela konsol” (gaming console) because the word ”spel” is ambigouos and can
be both a noun and a verb.
11
CHAPTER 3. RESOURCES
It also suggests the part of
It also suggests the different part of speech of a word. E.g.
The splitter is evaluated on splitting the compound words in the Den stora
svenska ordlistan (DSSO) dictionary. The compound words in the dictionary have
a specific tag and information about the first element and the last element of a
compound word is also provided, i.e. the correct split into two constituents which
is our first case to evaluate.
Findwise’s compound splitter was tested on splitting words into two constituents.
The accuracy was calculated as
Correct split
Correct split + Wrong faulty split
(3.1)
A splitting attempt was considered as a ”correct split” if the correct split could
be found in the top x suggestions of the returned sorted list of possible splits. The
following is the results:
• The first split suggestion is a correct split - Accuracy: 0.74
• Consider a hit in the top 2 splittings as a correct split - Accuracy: 0.90
• Consider a hit in the top 3 splittings as a correct split - Accuracy: 0.92
• Consider a hit in any splitting as a correct split - Accuracy: 0.93
3.2
Search platforms
A search platform will be used to store and fetch the articles according to the
algorithm presented in this project. The most important fields to be indexed are
the article id, category and content for querying purposes. Two alternatives will be
presented in this section.
3.2.1
Elastic
Elastic was created in 2012 and is an open-source search platform based on Lucene
[7].
Elastic supports aggregations, multitenancy and has better support for complex
and analytical queries compared to Solr.
3.2.2
Solr
Solr was created in 2004 and is also an open source search platform based on Lucene.
[16] [10]
Compared to Elastic which is relatively new, Solr is more mature and stable.
12
3.3. WIKIPEDIA
3.3
Wikipedia
Wikipedia was used as data to study the context the words appear in. It is a
free encyclopedia created collaboratively by anyone who uses it and has grown to
become one of the world’s largest encyclopedias since its creation in 2001. The
English Wikipedia has of today (2016-01-21) 5,059,991 articles and is the largest
one, the second is the Swedish Wikipedia with a total of 2,662,793 articles [28].
Wikipedia has a couple of ways to browse articles and group similar ones. One
way is using categories. All articles in Wikipedia belong to at least one category
and they are intended to group together pages on similar topics [27]. By clicking on
different categories one can find similar articles related to the topic from different
point of views. E.g. The English article ”Table tennis” has the categories ”Table
tennis” and ”Racquet sports” amongst others [30]. They are both related to table
tennis but put the topic into different contexts. In the first context table tennis
is viewed as the main topic of the category and you will find articles about the
equipments used, different play styles and so on. In the other one table tennis will
be viewed as one type of racquet sport and you will find articles about other types of
racquet sports. This can be interpreted as each Wikipedia article being connected
to several contexts where each context is disguised as a group of related articles
from a different point of view. This project will utilise this type of grouping for the
context expansion.
Another way of grouping articles is considering all internal links appearing in
the same articles as a group. Internal links are links to another articles with the
aim to allow readers to deepen their understanding of a topic [31]. Internal links
are therefore in some way relevant to the topic since they appeared in the main
article. However, a drawback is that they all get the same weight, e.g. an internal
link mentioning the material of ”sponge” in the article table tennis gets the same
relevancy as different ”tennis grips” although the latter one may be more specific
and relevant to the topic.
Backlinks can also be considered as a group of articles. As mentioned earlier, all
Wikipedia articles have internal links linking to other Wikipedia articles. Similarly,
they also have a collection of incoming links (links from other Wikipedia articles
which have a link to them) and those are called backlinks. Although all the articles
in the backlinks may have referred to the same article, they may have not discussed
the topic in the same context. E.g. the English article ”Table tennis” have both the
articles ”North Korea”, ”Table (furniture)” and ”Tennis” in its backlinks and are
different contexts[29]. The context interpretation discussed in the paragraph about
categories is also applicable here where each backlink is a context. The difference
here is that a single article is a context instead of a group of articles.
Some articles also have a section with related articles presented in a ”see also”section. E.g. a town can have neighbouring towns in its ”see also”-section. It is
more common in certain types of articles and the articles having this sections is
very inconsistent.
Another possible encyclopedia which can be used is the Swedish Nationalen13
CHAPTER 3. RESOURCES
cyklopedin. It is stated on its website that it has over 200,000 articles [15]. By
performing a search on their website a more exact number can be found which is
313,146 articles in total. Many articles have a simplified version called ”simple”
article. The original is called ”long” article. The simple articles are written with
a simple and causal language in mind without abbreviations but cover the same
topics. The amount of articles covering a unique topic is therefore less than 313,146
since there are some duplicates. 72,925 of the articles on Nationaencyklopdin are
classified as ”simple” articles and 240,221 as ”long” articles.
The articles in Nationalencyklopedin do not belong to any categories but have
a list of links to related pages. Nationalencyklopedin also states that the articles
covers seven main topics. However, it is not possible to browse through the main
topics and there is no visible information about which main topic an article belongs
to.
The articles on Nationalencyklopedin are reviewed by experts in contrast to
Wikipedia. Therefore, although the amount of data is a lot less compared to
Wikipedia, the data itself is more reliable.
The advantage of using Nationalencyklopedin over Wikipedia is the quality
of the data. The disadvantage is on the other hand the quantity compared to
Wikipedia. Since this project uses a statistical approach the quantity is more important than the quality of the data.
3.4
Choice
Elastic was used as the search platform. Elastic was chosen as the search platform
because of better support for analytical queries which could potentially be of usage.
Wikipedia was chosen because it was the initial intention with the project and the
large amount of data is crucial for a statistical approach.
14
Chapter 4
Methodology
This chapter will give you an overview of the system and what the main parts do.
The idea behind the algorithm will firstly be explained before presenting the details.
It will be followed by a presentation of the evaluation metrics data test data.
4.1
Concept
Instead of analysing the article the compound word occurred in, the intention is
to find a larger context for analysis. The articles it was found in can be seen as
representations of the word, i.e. an article is a representation of the word. The idea
is to expand the articles representing it by including related articles which can be
found in the categories it belongs to.
A illustration of the progress of the expansion can be seen in figure 4.1, it
describes how the representation changes.
Figure 4.1. Illustration of the data retrieval progress
1. Get articles: ”String” –> articles.
15
CHAPTER 4. METHODOLOGY
2. Get categories: article –> categories
3. Get articles: categories –> articles
In the end, each representation of the word originally represented by one single
article is now a representation made up by a bundle of several articles.
The next step is to get the split suggestions from the splitter and count the
occurrences of the constituents in the bundle. All constituents occurring in the
same bag means the split suggestion is good enough.
4.2
Architecture
The architecture consists of three parts: Findwise’s compound splitter, a class which
communicates with the Elastic server and main class sending data requests to the
other components and does the calculation. 4.2 shows a simple figure of it.
Figure 4.2. Simplified overview of system architecture
16
4.3. IMPLEMENTATION
4.3
4.3.1
Implementation
Split suggestions retrieval
The compound word to be split gets queried to the compound splitter and the
retrieved results get filtered. The filter method removes all suggestions with more
than two parts for the first case and suggestions with only one part (the whole word)
for the second case.
4.3.2
Articles retrieval
A single class manages all the requests to the Elastic server. There are two main
requests:
Get categories of the base articles: Search for maximum n numbers of articles which contain the compound word and return all those articles’ categories. No
analysis of the categories’ relevancy is done and the categories are chosen after how
they are presented by Wikipedia which is an alphabetical order. All the categories
originating from the same article are bundled together.
Get articles of the categories: Search and return at maximum m numbers
of articles belonging to each category in every category bundle.
4.3.3
Computation
A list storing integers with the size of number of split suggestions is initialized
first. Every element in the list represents one split suggestion and the integer is the
number of bundles where all constituents of the split could be found in.
For each bundle of articles, see if all parts of a split suggestion can be found
in the same one. If so, increment the counter. When all bundles are done, do the
same with the next split suggestion.
Lastly, the split suggestion with the highest integer in the list will be chosen
as the best split. This means all the constituents could be found in the same and
most contexts. The implication is that the constituents have a strong connection
with the original compound word and it is therefore reasonable to split it this way.
If nothing was found, the algorithm will not choose any split suggestion at all and
return the word as a whole.
4.4
Evaluation metrics
The same metrics and definitions used in the paper written by Koehn and Knight
[13] will be used for the evaluation. It will first cover the terms used in the metrics
and then the metrics themselves.
17
CHAPTER 4. METHODOLOGY
4.4.1
Terms
• Correct split: Words that should be split and were split correctly.
• Correct non: Words that should not be split and were not.
• Wrong not: Words that should be split but were not.
• Wrong faulty split: Words that should be split, were split, but wrongly
(either too much or too little)
• Wrong split: Words that should not be split, but were.
4.4.2
Metrics
Precision =
Correct split
Correct split + Wrong faulty split
(4.1)
Precision is measures the amount of correct splits in comparison to all the attempted splits.
Recall =
Correct split
Correct split + Wrong faulty split + Wrong not split
(4.2)
Recall measures the amount of correct splits in comparison to all the words that
should be split.
Accuracy =
Correct
Correct + Wrong
(4.3)
Accuracy measures the amount of correct decisions (correct splits and correct
non splits).
4.5
Test data
The test cases are a combination of data sets from different sources. The sources
used where DSSO, Svenska akadmiens ordlista and data from annotators. This
section will cover the sources the data was collected from and the test cases used.
The algorithm is dependant on Findwise’s compound word splitter generating
splits. Since it is an evaluation of the algorithm and not Findwise’s splitter, all
words which the splitter could not split got filtered away and the words tested are
only those which Findwise’s splitter managed to split.
4.5.1
Den stora svenska ordlistan (DSSO)
DSSO was used to collect compound words. It is a word list created collaboratively
by its users [18] [26]. The same file of DSSO used by Findwise in their evaluation was
used in this project. The file consists of a total of around 17000 marked compound
words together with the correct split into two parts.
18
4.6. DATA SETS
4.5.2
Svenska akademiens ordlista (SAOL)
SAOL was used to collect non-compound words. That was done manually by randomly adding a words that were not marked as compound words in the word list
by going through every page. A couple of words were chosen from every page if
possible. It was in attempt to include compound words starting with every letter in
the alphabet and a letter combination distribution similar to the Swedish language
the dictionary represents. 515 non-compound words in total were collected through
this method.
4.5.3
Annotators
The third source was two subsets of the compound words from DSSO annotated by
annotators. Each subset consists of 500 words and each subset is annotated by 5
different annotators.
The annotation was carried out by handing the annotators a list of 500 words
with the same instruction to split the words as much as possible, but the meaning
of the now ”group of words” should still remain the same as the original compound word according to the annotators themselves. The key word should not be
split. E.g. ”Långdistansförhållande” (long distance relationship) should be split
aggressively into ”lång distans förhållande” (long distance relationship) whereas
”bordtennisbord” (table tennis table) should only be split into ”bordtennis bord”
(’table tennis’ table) according to me. Splitting the latter one into three constituents
would not retain the original key word ”table tennis”.
The annotators were free to put 0, 1 or more marks where they thought the
word should be split.
The two groups of 5 annotators each were handed two different word lists, the
first one consists of non-compound words which occur in Wikipedia and the second
one words that do not.
For the evaluation phase, a split was marked as correct if at least 2 annotators
had agreed on the split.
4.6
Data sets
This section will present the data sets used.
Some data were applied with the Wikipedia filter before they were shuffled and
randomly selected to a data set. This filter removed all the words which could not
be found in any Wikipedia article. This is only applied to some test case to be able
to evaluate the algorithm’s maximum potential.
Data set 1 and 2 were used to evaluate the first case and data set 3 and 4 to
evaluate the second case. The first case is a split into two constituents and the
second case is a split into as many constituents as possible while still retaining the
meaning of the word. All the words used in data set 1 and 2 were checked with
Findwise’s splitter that it could generate at least one split suggestion of two parts.
19
CHAPTER 4. METHODOLOGY
4.6.1
Data set 1
Data set 1 contains 1000 already annotated compound words from DSSO and 287
non-compound words from SAOL.
This data set only contains words that occur in Wikipedia.
4.6.2
Data set 2
Data set 2 contains 1000 already annotated compound word and 300 non-compound
words.
4.6.3
Data set 3
Data set 3 contains 500 compound words annotated by 5 annotators and 287 noncompound words from SAOL.
This data set only contains words that occur in Wikipedia.
4.6.4
Data set 4
Data set 4 contains 500 compounds word annotated by 5 annotators and 300 noncompound words from SAOL.
20
Chapter 5
Experiment
This chapter will cover the experiments done which were applied to some of the test
cases.
5.1
Language analyzers
The purpose of testing different language analyzers was to compare the impact they
had. Elastic’s Swedish language analyzer based on the Snowball stemmer [8] [24],
the light stemmer created by Jacques Savoy [22] and no stemming were tested.
5.2
Redo the expansion
There is a high risk the compound word will not be found in Wikipedia, especially
longer and unusual ones. Some words have split suggestions where they have at
least one word in common. This could be exploited by redoing the search on that
dominant word. In this way, articles that are relevant, but do not contain the word,
may be found to get some context at all.
5.3
No stemming as base index
The task of stemmers is to reduce a word into its root form, e.g. ”player” and
”plays” have the root form ”play”. The first base 15 articles which will be expanded
on are crucial for finding relevant data to base the compound splitting on. Querying
the root form returns more articles but also more irrelevant articles. E.g. querying
the root form of ”play” instead of ”player” would return all articles that do not
mention the word ”player” as well increasing the recall but decreasing the precision.
Since we are only using 15 base articles, the precision is more important than the
recall.
No stemming at all to retrieve the first 15 base articles was therefore tested in
attempt to increase the amount of relevant articles expanding the context on.
21
CHAPTER 5. EXPERIMENT
5.4
Merge words
Because the way the external compound splitter is programmed to determine the
part of speech as well, it will generate different representations of the same split
suggestion as mentioned earlier. E.g. the word ”lekplats” (playground) have the
split suggestions ”lek plats” (play ground) and ”leka plats” (to play ground) because
of ambiguity, ”lek” could mean both the noun ”play” or the verb ”to play”. However,
instead of choosing one randomly when there is a tie between two suggestions, we
choose the one with no modifications from the original word when the constituents
are merged together. In this case, ”lek plats” would be chosen because it is identical
to the original word when the constituents are put together.
22
Chapter 6
Results
This chapter presents the results from the experiments and is divided into two parts.
The first part covers the first case and second part the second case. The metrics
were explained in 4.4.1.
6.1
Case 1
The algorithm is only allowed to split the compound word into two constituents or
not at all.
The first row shows the result of the default algorithm without any optimization.
The second and last row show the results of the experiment when two optimizations
options each were enabled.
• First row - Default algorithm.
• Second row - Stemming is initially disabled as described in 5.3 and the algorithm is allowed to perform an additional attempt on a new query as described
in 5.2.
• Last row - The algorithm is allowed to perform an additional attempt on a
new query as described in 5.2 and should choose the split suggestion which is
closest to the original compound word when the constituents are put together
as described in 5.4.
6.1.1
No stemming
Experiment on the index without stemming. In this case, optimizing with initially
disabling stemming as the second row suggests is cancelled out because no stemming
is used on the index at all. Data set 1 was tested.
23
CHAPTER 6. RESULTS
Correct Correct Wrong Wrong
Wrong Precision Recall Accuracy
split
non
non
faulty split split
Default
618
195
305
77
92
0.89
0.62 0.63
Base index and Redo 660
176
197
143
111
0.82
0.66 0.65
Redo and Merge
671
176
197
132
111
0.84
0.67 0.66
Table 6.1. The results for data set 1 tested on the algorithm without a stemmer for
case 1
6.1.2
Swedish light
Experiment on the index with stemming using the Swedish light analyzer created
by Jacques Savoy [22]. This stemmer does not stem as aggressively as Elastic’s
Swedish stemmer. Data set 1 was tested.
Correct Correct Wrong Wrong
Wrong Precision Recall Accuracy
split
non
non
faulty split split
Default
473
187
385
142
100
0.77
0.47 0.51
Base Index and Redo 619
120
124
247
167
0.71
0.63 0.58
Redo and Merge
701
120
134
165
167
0.81
0.70 0.63
Table 6.2. The results for data set 1 tested on the algorithm with Swedish light
language analyzer for case 1
6.1.3
Swedish
Experiment on the index with stemming using the Swedish stemmer based on the
Snowball stemmer [8] [24]. This stemmer uses a more aggressive approach when
stemming resulting in better recall but worse precision. More (relevant and irrelevant) data is collected with this stemmer. Data set 1 was tested.
Correct Correct Wrong Wrong
Wrong Precision Recall Accuracy
split
non
non
faulty split split
Default
448
210
433
119
166
0.79
0.45 0.51
Base Index and Redo 634
127
124
242
160
0.72
0.69 0.63
Redo and Merge
713
127
124
163
160
0.81
0.71 0.62
Table 6.3. The results for data set 1 tested on the algorithm with Elastic’s Swedish
language analyzer for case 1
6.1.4
Comparison against baseline
Comparison of the best combination of optimizations (or lack of) of each stemming
option and the top suggestion in Findwise’s compound splitter. Data set 2 was
tested.
24
6.2. CASE 2
Correct Correct Wrong Wrong
Wrong Precision Recall Accuracy
split
non
non
faulty split split
No stemming 530
189
331
139
111
0.80
0.53 0.55
Light
584
133
273
143
167
0.80
0.58 0.55
Swedish
596
140
266
138
160
0.81
0.59 0.57
Findwise
778
0
0
222
300
0.78
0.78 0.59
Table 6.4. The results for data set 2 tested on Findwise’s compound word splitter
versus this project’s method.
The largest impact of an optimization can be seen in table 6.1.2 where the
number of correct splits increased from 473 to 619 and table 6.1.3 where the number
of wrong non decreased from 433 to 124. However, this optimization also decreased
the number of correct nons.
The precision was high in all the cases meaning the algorithm often chose the
right split when it had to choose. However, recall and accuracy were lower, ranging
from 0.51 to 0.71. This was caused by the high number of wrong nons and wrong
splits.
6.2
Case 2
The second part covers the second case where there is no limit to the amount of
splits.
The first row shows the result of the default algorithm without any optimization.
The second and last row show the results of the experiment when optimization
options were used.
• First row - Default algorithm.
• Second row - Stemming is initially disabled as described in 5.3.
• Last row - The algorithm is allowed to perform an additional attempt on a
new query as described in 5.2 5.4.
25
CHAPTER 6. RESULTS
6.2.1
No stemming
Experiment on the index without stemming. In this case, optimizing with initially
disabling stemming as the second row suggests is cancelled out because no stemming
is used on the index at all. Data set 3 was tested.
Correct Correct Wrong Wrong
Wrong Precision Recall Accuracy
split
non
non
faulty split split
Default
344
193
131
25
94
0.93
0.68 0.62
Base Index 344
193
131
25
94
0.93
0.68 0.62
Redo
365
175
101
34
112
0.91
0.73 0.67
Table 6.5. The results for data set 3 tested on the algorithm with no stemmer for
case 2.
6.2.2
Swedish light
Experiment on the index with stemming using the Swedish light analyzer created
by Jacques Savoy [22]. This stemmer does not stem as aggressively as Elastic’s
Swedish stemmer. Data set 1 was tested.
Correct Correct Wrong Wrong
Wrong Precision Recall Accuracy
split
non
non
faulty split split
Default
347
184
115
38
103
0.90
0.69 0.67
Base Index 362
125
101
37
162
0.91
0.72 0.62
Redo
374
118
77
49
169
0.89
0.75 0.63
Table 6.6. The results for data set 3 tested on the algorithm with the Swedish light
language analyzer for case 2.
6.2.3
Swedish
Experiment on the index with stemming using the Swedish stemmer based on the
Snowball stemmer [8] [24]. This stemmer uses a more aggressive approach when
stemming resulting in better recall but worse precision. More (relevant and irrelevant) data is collected with this stemmer. Data set 3 was tested.
Correct Correct Wrong Wrong
Wrong Precision Recall Accuracy
split
non
non
faulty split split
Default
366
307
87
47
80
0.89
0.73 0.73
Base Index 371
143
95
34
144
0.92
0.74 0.65
Redo
377
123
73
50
164
0.88
0.75 0.64
Table 6.7. The results for data set 3 tested on the algorithm with Elastic’s Swedish
language analyzer for case 2.
26
6.2. CASE 2
6.2.4
Comparison against baseline
Comparison of the best optimization option (or lack of) of each stemming option
and the top suggestion in Sjöbergh and Kann’s compound splitter. Data set 4 was
tested.
Correct Correct Wrong Wrong
Wrong Precision Recall Accuracy
split
non
non
faulty split split
No stemming 182
195
307
11
105
0.94
0.36 0.47
Light
184
155
287
29
145
0.86
0.37 0.60
Swedish
187
158
292
21
142
0.90
0.37 0.43
Sjöbergh Kann 441
290
32
27
10
0.94
0.88 0.91
Table 6.8. The results for data set 4 tested on the algorithm developed by Sjöbergh
and Kann for case 2 [23].
27
CHAPTER 6. RESULTS
The number of wrong nons in test case 2 where significantly lower per default
compared to test case 1. This had a large impact on the precision which landed on
around 0.90 for all cases.
Similarly to test case 1, applying a base index and redo increased the number of
correct split. However, the impact on wrong nons was not as large as in test case 1.
This is probably because of the low number it had even before applying anything
as mentioned in the previous paragraph.
The precision landed on around 0.90 for all cases whereas recall and accuracy
ranged between 0.62 and 0.75. The algorithm is good at choosing the correct split
but less good at identifying compound words. More often than not it would rather
not split it because the conditions were too weak, i.e. some but not every single
constituent could be found in any bundle of articles at all.
The developed algorithm had slightly lower precision than Sjöbergh and Kann’s,
but the recall and accuracy was much lower. Sjöbergh and Kann’s recall was 0.88
while the project’s algorithm landed on around 0.37. Sjöbergh and Kann’s accuracy
was 0.91 while the project’s algorithm ranged between 0.43 - 0.60.
28
Chapter 7
Discussion
7.1
7.1.1
Test case 1
Optimizations
Having a non stemmed base index and redoing the retrieval for not found compound
words had a large positive impact on the number of correct splits and wrong nons
for test case 1. It is an effect of a higher precision of relevant articles to expand
and the increased quantity of retrieved articles and data. Redoing the retrieval on a
substring increased the chance of finding some relevant articles instead of not doing
any analysis at all and wrongly mark it as a non-compound word.
However, this optimization also decreased the number of correct non’s. Although
the algorithm got better at correctly splitting more compound words, it also got
worse on judging when not to split. The increased amount of data improved the
amount of relevant data but also the amount of less relevant which got the same
weight.
The amount of wrong faulty splits decreased when applying ”merge”. Many
suggestions were split on the same position but the constituents were different forms
of the same words and they had the same score because of the stemming. An
example is the suggestions for the compound word ”lekplats” (playground) which
were ”leka plats” (to play gorund) and ”lek plats” (play ground) among others. The
first suggestion would be chosen without merge. However, using merge, ”lek plats”
would be chosen instead which is the correct one. Applying ”merge” increased the
chance of choosing noun-noun words which also are statistically more common.
7.1.2
Comparison to baseline
No filtering out of words which could not be found in Wikipedia was done when
comparing the algorithm against the baseline. Findwise’s default compound word
splitter was used as a baseline for test case 1.
Findwise’s compound word splitter beat the algorithm in all metrics but precision. This is partly due to that the splitter assumes all words are compound words
29
CHAPTER 7. DISCUSSION
and should be split, it therefore does not judge whether a word should be split
or not. This results in the splitter having 0 correct nons and 0 wrong nons. The
amount of compound words are much higher than non-compound words, the ratio is
approximately 3:1, leading to the metrics being misleading and partial to a splitter
who attempts to split all the words.
In conclusion, Findwise’s default splitter was better at splitting based on the
quantity of correctly split compound words but the algorithm developed in this
thesis had a higher precision and chose the correct split more often. If both splitters
were to split the same document, Findwise’s compound splitter would return more
correct split words but also more incorrect splits. The thesis’ splitter would return
less correct splits but also less incorrect splits. The document split by this thesis’
algorithm would therefore be more understandable although not as many compound
words were split.
7.2
Test case 2
7.2.1
Optimization
The results on test case 2 were slightly better than test case 1. However, that was
also because the criterion on a correct split was much looser. There could be several
correct suggestions (all suggestions which at least two annotators considered correct
were judged as correct), hence there was a higher chance scoring a correct one. This
explains the low number of wrong nons.
One major difference to test case 1 was that although the number of correct
splits increased after applying redo, the precision decreased. One common factor
was that the number of wrong splits also increased. So, although redoing the search
on a substring of the original word increased the quantity of correct splits it also
got worse at identifiying words that should not be split. As mentioned in test case
1, this is due to the amount of increased data and that all data (both very relevant
and less relevant) has the same weight.
7.2.2
Comparison to baseline
The performance of this thesis’ splitter were worse compared to Sjöbergh and
Kann’s. Since this data set did include words that do not appear in Wikipedia
a lot of words were returned as a whole which partially caused the low number
of recall and accuracy. Only around 40% of the compound words were correctly
identified as compound words. The number of wrong faulty splits was otherwise
low and the precision was equal or slightly worse than Sjöbergh and Kann’s.
7.3
Strengths
The algorithm had a high precision on the test cases and is good at choosing the
correct split from a set of suggestions due to the possibility to analyse context. Even
30
7.4. WEAKNESSES
though a lot of suggestions are technically correct, it was able to choose the most
common interpretation and therefore most reasonable one.
7.4
Weaknesses
Since the algorithm performed well on the task of choosing the correct split (showed
by the high precision), the major bottle neck was that it was bad at identifying
compound words. This can be seen by comparing the amount of wrong nons and
wrong faulty splits. The amount of wrong nons was always higher than wrong
faulty splits. The most extreme case is showcased by the last comparison against
the baseline for test case 2, the amount of wrong nons are 307 to 11 wrong faulty
splits using the ”No stemming” index.
A large reason for the high amount of wrong nons was that the algorithm would
mark a word as ”non compound” if it did not find the compound word in Wikipedia.
A lot of the compound words were not found in the Wikipedia and this poorly made
decision affected the result negatively. Trying to split all the words even though
they were not found in Wikpedia would increase the number of correct splits but at
the time increase the number of wrong splits (trying to split non compound words).
7.5
Further work
To counter the disadvantage of the issue of not being able to find relevant context/articles, further work should focus on study methods to capture it. This would
definitely decrease the number of wrong nons and low recall and accuracy since many
of them arised from not finding articles which contain that word.
Methods worth studying to find more and better contexts could be by improving
the data source, use synonyms and studying different substrings of the word which
can capture relevant context.
Improving the data source can be done by changing or adding other encyclopedias, e.g. Nationalencyklopedin which was mentioned in background. Another
suggestion is to change the way of expanding the context. In this project the order
of the base articles and categories were not changed from the retrieval and were
therefore alphabetically ordered . An improvement would be to first sort them
based on some criterion, e.g. most popular or size. This would probably increase
the relevancy of the data and result in more accurate data for the algorithm to
base the compound splitting decision on. Expanding by the backlinks instead of
categories is also a suggestion worth trying.
Since many compound words are rare, especially long ones, studying different
substrings of the compound words in combination with synonyms could increase
the amount of found articles for rare words.
Further work improving the precision can be done by giving words different
weight. As mentioned earlier all words get the same weight as long as they are
found in the same context/category. A word appearing in the same document
31
CHAPTER 7. DISCUSSION
as the whole compound word is therefore given the same weight as some word
appearing in a different document but in the same category. To improve the analysis,
words that are found closer to the original compound word should weight heavier.
Implementing stop words is also a method worth trying.
32
Chapter 8
Conclusion
This method developed in this thesis yielded a high precision but low recall and
accuracy. It chose the correct split most of the time but had difficulties in separating
compound words and non-compound words.
A major limitation is that it is highly dependant on the compound words appearing in Wikipedia which has a large impact on the result when comparing to
other algorithms.
The approach is good when trying to identify the correct split if the compound
word is a common word, there are otherwise other algorithms that perform better
for other use cases.
33
Bibliography
[1]
Carina Ahlin. Spargris – Är det en gris som man sparar till jul? 2016. url:
https://gupea.ub.gu.se/bitstream/2077/29653/1/gupea_2077_29653_
1.pdf (visited on 01/24/2016).
[2]
Johan van der Auwera and Ekkehard König. The Germanic languages. Routledge, 1994.
[3]
Réka Benczes. Creative compounding in English. John Benjamins Pub. Co.,
2006, pp. 8–9.
[4]
Sven Eriksson Carl-Johan Markstedt. Svenska Impulser 2 - Språkets byggstenar. 2016. url: http://sanomautbildning.se/upload/SvenskaImpulser2_
sid455_474.pdf (visited on 01/24/2016).
[5]
Joachim Daiber et al. “Splitting Compounds by Semantic Analogy”. In: arXiv
preprint arXiv:1509.04473 (2015).
[6]
Corina Dima and Erhard Hinrichs. “Automatic Noun Compound Interpretation using Deep Neural Networks and Word Embeddings”. In: IWCS 2015
(2015), p. 173.
[7]
Elastic. Elastic | We’re About Data. 2015. url: https://www.elastic.co/
about (visited on 12/22/2015).
[8]
ElasticSearch. Stemmer Token Filter. 2016. url: https://www.elastic.co/
guide/en/elasticsearch/reference/2.0/analysis-stemmer-tokenfilter.
html (visited on 12/20/2016).
[9]
Claes-Christian Elert. Ordbildning. 2016. url: https://en.wikipedia.org/
wiki/Compound_(linguistics) (visited on 01/23/2016).
[10]
The Apache Software Foundtation. APACHE SOLR™ 5.4.1. 2016. url: http:
//lucene.apache.org/solr/) (visited on 01/15/2016).
[11]
Tetsu Fujisaki et al. “A probabilistic parsing method for sentence disambiguation”. In: Current Issues in Parsing Technology. Springer, 1991, pp. 139–152.
[12]
Daniel Jurafsky and James Martin H. Speech and Language Processing. Pearson Education, 2009, p. 895. isbn: 9780135041963.
35
BIBLIOGRAPHY
[13]
Philipp Koehn and Kevin Knight. “Empirical methods for compound splitting”. In: Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics-Volume 1. Association for Computational Linguistics. 2003, pp. 187–193.
[14]
Mark Lauer. “Designing statistical language learners: Experiments on noun
compounds”. In: arXiv preprint cmp-lg/9609008 (1996).
[15]
Nationalencyklopedin. Uppslagsverket. 2015. url: http://www.ne.se/info/
tj%C3%A4nster/uppslagsverket (visited on 12/18/2015).
[16]
M. Nayrolles. Mastering Apache Solr: A practical guide to get to grips with
Apache Solr. Inkstall Solutions LLC, 2014. isbn: 9788192784502. url: https:
//books.google.se/books?id=HSWeAwAAQBAJ.
[17]
Lena Öhrman. “Felaktigt särskrivna sammansättningar”. In: C-uppsats i datorlingvistik, Institutionen för lingvistik, Stockholms universitet (1998).
[18]
Apache OpenOffice. Förbättra vår ordlista. 2016. url: https://www.openoffice.
org/sv/dsso.html) (visited on 01/26/2016).
[19]
Norstedts ordböcker. Ord.se. 2016. url: http://www.ord.se/oversattning/
engelska/?s=element&l=ENGSVE) (visited on 12/20/2016).
[20]
Maja Popović, Daniel Stein, and Hermann Ney. “Statistical machine translation of German compound words”. In: Advances in Natural Language Processing (2006), pp. 616–624.
[21]
Ulrike Rackow, Ido Dagan, and Ulrike Schwall. “Automatic translation of
noun compounds”. In: Proceedings of the 14th conference on Computational
linguistics-Volume 4. Association for Computational Linguistics. 1992, pp. 1249–
1253.
[22]
Jacques Savoy. “Report on CLEF-2001 experiments: Effective combined querytranslation approach”. In: Evaluation of Cross-Language Information Retrieval
Systems. Springer Berlin Heidelberg. 2002, pp. 27–43.
[23]
Jonas Sjöbergh and Viggo Kann. “Vad kan statistik avslöja om svenska sammansättningar”. In: Språk och Stil 16 (2006), pp. 199–214.
[24]
Snowball. Swedish Stemming Algorithm. 2016. url: http://snowball.tartarus.
org/algorithms/swedish/stemmer.html (visited on 12/20/2016).
[25]
Radu Soricut and Franz Och. “Unsupervised morphology induction using word
embeddings”. In: Proc. NAACL. 2015.
[26]
Språkteknologi.se. Den stora svenska ordlistan, svensk rättstavningslista. 2016.
url: http://sprakteknologi.se/resurser/sprakresurser/den-storasvenska- ordlistan- svensk- raettstavningslista- foer- open- office)
(visited on 01/26/2016).
[27]
sv.wikipedia.org. Kategorier. 2016. url: https://sv.wikipedia.org/wiki/
Special:Kategorier (visited on 11/26/2015).
36
BIBLIOGRAPHY
[28]
Wikipedia. List of Wikipedias. 2016. url: https : / / en . wikipedia . org /
wiki/List_of_Wikipedias (visited on 01/21/2016).
[29]
Wikipedia. Pages that link to "Table tennis". 2016. url: https://en.wikipedia.
org/wiki/Special:WhatLinksHere/Table_tennis (visited on 01/10/2016).
[30]
Wikipedia. Table tennis. 2016. url: https : / / en . wikipedia . org / wiki /
Table_tennis (visited on 01/10/2016).
[31]
Wikipedia. Wikipedia:Manual of Style/Linking. 2016. url: https : / / en .
wikipedia.org/wiki/Wikipedia:Manual_of_Style/Linking (visited on
01/10/2016).
37
www.kth.se