Morpheme Segmentation Gold Standards for Finnish and English

Morpheme Segmentation Gold Standards for
Finnish and English
Mathias Creutz
Krister Lindén
Helsinki University of Technology∗
[email protected]
Helsinki University of Technology∗,
University of Helsinki†
[email protected]
[email protected]
∗ Neural
Networks Research Centre, Helsinki University of Technology,
P.O.Box 5400, FIN-02015 HUT, Finland
† University
of Helsinki, Department of General Linguistics,
P.O.Box 9, FIN-00014 University of Helsinki, Finland
Abstract
This document describes Hutmegs, the Helsinki University of Technology Morphological Evaluation Gold Standard package, which contains gold-standard morphological
segmentations for 1.4 million Finnish and 120 000 English words. The Gold Standards comprise surface-string, or allomorph, segmentations of word forms, as well as
deep-level, or morpheme, segmentations of the words. The segmentations have been
produced semi-automatically and are based on existing resources: the two-level morphological analyzer for Finnish (FINTWOL) and the English CELEX database. For
some cases where the transition between two morphemes does not appear clear-cut, so
called “fuzzy morpheme boundaries” have been marked as an option. The Hutmegs
package also contains some evaluation scripts allowing the user to compute the accuracy compared to the Gold Standard of a segmentation produced by some morphologylearning algorithm. The use of Hutmegs is free for academic purposes, but in order to
access the gold-standard segmentations, inexpensive licenses must be purchased from
Lingsoft Inc. (for Finnish) and the Linguistic Data Consortium (for English).
1
1 Introduction
With the emergence of large amounts of textual data in several languages the prospects
for designing algorithms that are capable of acquiring language in an unsupervised
manner from data seem more and more promising. Also due to the large amounts of
data available there is an increasing need for minimally supervised natural language
processing systems.
However, a crucial point in the development of data-driven algorithms is the evaluation. The lack of gold-standard references makes it difficult for researchers to assess
the performance of their algorithms and to compare them to algorithms developed by
others. It is our hope that by providing morphological segmentations for large numbers
of Finnish and English word forms, together with software for evaluating segmentation
accuracy, we can facilitate the benchmarking and comparison of different algorithms.
In the field of unsupervised learning of the morphology of a natural language, segmentation of word forms into morphemes (or morpheme-like units) is commonly considered an important goal. There are a number of data-driven algorithms that work
more or less without supervision and induce, from nothing more than raw text, plausible morpheme segmentations for the words occurring in the text, e.g., (Goldsmith,
2001; de Marcken, 1996; Déjean, 1998; Baayen and Schreuder, 2000; Creutz and Lagus, 2002; Creutz, 2003; Creutz and Lagus, 2004). Morphemes have been defined
in linguistic theory as the smallest meaningful units of language. A morpheme corresponds to a meaning and it functions as the smallest element in the syntax of the
language (Matthews, 1991). Therefore, morphemes can conceivably be very useful
from the point of view of artificial language production or understanding and in applications, such as speech recognition (Siivola et al., 2003), machine translation and
information retrieval.
In writing systems where word boundaries are not explicitly marked, word or morpheme segmentation is the first necessary step for any natural language processing task
dealing with written text. Languages employing such writing systems comprise, e.g.,
Chinese and Japanese. The need for common standards for segmentation (in addition
to part-of-speech tagging and syntactic bracketing) is apparent. Within the Penn Chinese Treebank project (Xue et al., 2004) a 100 000 word corpus of Mandarin Chinese
has been segmented into words, tagged with part-of-speech tags and provided with
syntactic bracketing.
In Western languages, there are spaces between the words, and word segmentation
of written text is trivial. However, large amounts of work has gone into the annotation
of corpora, e.g., part-of-speech tagging, morphological analysis and syntactic bracketing. For American English, the Penn Treebank (Marcus et al., 1993) is an example of
existing resources.
A more detailed annotation of the morphological structure of words can be found
in the CELEX databases of English, Dutch and German (Baayen et al., 1995). Among
2
other things, the databases provide information on the derivational and compositional
structure as well as inflectional paradigms of tens of thousands of word forms.
Corresponding morphological analyses of word forms, though less detailed, can be
obtained using software based on the two-level morphology of Koskenniemi (1983).
Such TWOL analyzers exist for, e.g., Finnish, the Scandinavian languages (Swedish,
Danish, Norwegian), English and German.1
What the existing analyzers and databases lack, however, is an explicit morpheme
segmentation of the surface forms of the words. In the case of the annotated Chinese
corpus, the text has indeed been segmented into words. In contrast, the morphological
analyses provided by CELEX and TWOL consist of base forms of the words together
with morphosyntactic tags. The emphasis of the TWOL is on inflectional morphology.
Additionally, a set of tags indicating derivational endings exist and the boundaries
within compound words are usually marked. CELEX provides a detailed description
of inflectional category together with derivational and compositional structure. This
information can be seen as a morpheme segmentation of a word, but the morphemes
are not indicated as they are realized on the surface, as word segments or allomorphs,
but as deep-level morphemes (or base forms), e.g., the English word ‘bacteriologist’
yields the segmentation ‘bacterium+ology+ist’.
We have produced segmentations of the surface forms of both Finnish and English
words. We propose the use of these segmentations as a reference, or Gold Standard,
that can be freely used for research purposes, e.g., for the evaluation of unsupervised
morphology-learning algorithms or as sub-word units for language modelling in automatic speech recognition. Our work consists in processing the output of the Finnish
TWOL and the contents of the English CELEX database to produce an alignment
between surface, or allomorph, segmentation and deep-level, or morpheme, segmentation, as in the following examples:2
tieteellisessä
bacteriologist
tietee:tiede|N llise:DN-LLINEN ssä:INE
bacteri:bacterium|N olog:ology|s ist:ist|s
The Finnish Gold Standard contains segmentations for 1.4 million distinct word
forms (word types). The English Gold Standard contains segmentations for 120 000
word types. The locations of morpheme boundaries in the surface form is not always
obvious and the interpretation chosen by us relies on (Hakulinen, 1979) for Finnish
and (Quirk et al., 1985) for English.
The segmentation has been performed semi-automatically with the help of rulesets
and a number of scripts. The necessary steps for arriving at the final segmentation are
described briefly in the following sections, together with information on formats and
access of files and license issues. In addition, we introduce so called “fuzzy morpheme
boundaries”, which can be applied for cases where it is inconvenient to define one exact
1
2
Licenses can be obtained from Lingsoft, Inc. URL: http://www.lingsoft.fi
The Finnish word ‘tieteellisessä’ means ‘in [the] scientific’.
3
transition point between two morphemes.
1.1 Concatenative morphology
A segmentation into morphemes implies that words are formed mainly through a concatenation process. Our Gold Standard relies on the Item and Arrangement model of
morphology: Words consist of sequences of morphemes and due to some restrictions
in the language only a certain realization of a morpheme is possible in a particular
context. This realization variant is called an allomorph. An example of allomorphs in
Finnish is the inessive case ending, which is realized as ‘-ssa’ or ‘-ssä’ depending on
the preceding word stem, e.g., ‘talo+ssa’, ‘metsä+ssä’.
Another type of concatenative morphology is represented by the Item and Process
model. Also in this model, words are thought to consist of sequences of morphemes.
When the morphemes are joined together they trigger morpho-phonological processes,
which alter their realization. As an example we can say that the basic form of the
Finnish inessive ending is ‘-ssA’, where the capital letter ‘A’ represents an open unrounded vowel, which becomes the open, unrounded back vowel ‘a’, when the stem
contains back vowels; and the open, unrounded front vowel ‘ä’ otherwise.
In our gold-standard segmentation no morpho-phonological processes have been
marked explicitly. However, we believe that the Gold Standard can be useful also in
the evaluation of algorithms relying on the Item and Process model. For instance, such
an algorithm would first segment the words ‘talossa’ and ‘metsässä’ into ‘talo+ssa’ and
‘metsä+ssä’, and then learn phonological processes, enabling it to conclude that ‘-ssa’
and ‘-ssä’ are realizations of the same morpheme. Since the Gold Standard contains an
alignment between allomorphs and their underlying morphemes, it is straight-forward
to evaluate how accurately the algorithm discovers these relationships.
As for other models of morphology, which do not assume that words are formed
by concatenation, our Gold Standard is less informative. For instance, the patterns of
vowel change in so called strong verbs in English have not been indicated, e.g., ‘sing’–
‘sang’–‘sung’, ‘ring’–‘rang’–‘rung’. All three forms of either verb are merely marked
as allomorphs of the base forms ‘sing’ and ‘ring’, respectively.
1.2 Fuzzy locations of morpheme boundaries
In some cases, the “linguistically correct” location of a morpheme boundary may not
seem the only plausible solution. Historic development of the language may affect the
way linguists describe the contemporary morphology. However, from the point of view
of natural language applications, this may not be the optimal description.
In our reference segmentations, there is a notation for marking “fuzziness” of morpheme boundaries. The fuzziness consists in alternative locations for the same morpheme boundary, i.e., the boundary does not have an unambiguous location. Users of
4
our Gold Standard can choose whether they want to evaluate their own segmentation
against the one correct “linguistic” segmentation, or against the fuzzy segmentation.
We allow fuzziness as follows: If at the end of a morpheme, there is one phoneme
(or sometimes more) that may be totally absent in some allomorphs of the morpheme,
this phoneme is considered to lie on a fuzzy boundary between two morphemes. (The
latter morpheme is always a suffix.) The phoneme is on the fuzzy boundary only
if it alternates phonologically with a “null phoneme”, not if it is replaced by another phoneme. This is a somewhat arbitrary definition, but our motivation is that the
phoneme (or phonemes) seems to be a seam, or a joint, which is not always needed.
If the “joint phoneme” is present only in combination with some following suffixes, it
could be considered part of the suffix as easily as part of the preceding morpheme.
For instance, in English, the stem-final ‘e’ in verbs is dropped in some forms. The
user of the Gold Standard can choose whether to consider only the traditional linguistic
segmentation correct, as in:
invite, invite+s, invit+ed and invit+ing,
or whether also to allow for an alternative interpretation, where the ‘e’ is considered
part of the suffix, as in:
invit+e, invit+es, invit+ed and invit+ing,
In the former case, there are two allomorphs of the stem (‘invite’ and ‘invit’), and
one allomorph for the suffixes. In the latter case, there is only one allomorph of the
stem (‘invit’), whereas there are two allomorphs of the third person in the present tense
(‘-s’ and ‘-es’) and an additional infinitive ending (‘-e’). Since there are a much greater
number of different stems than suffixes in the English language, the latter interpretation
lends itself to more compact Item and Arrangement models of morphology. The same
reasoning applies to the Finnish language and some examples of fuzzy morpheme
boundaries for Finnish will be presented in Section 2.6.
5
2 Finnish morpheme segmentations
This section describes the steps in the process of acquiring morpheme segmentations
for 1.4 million Finnish word forms, as well as the notation used for the final segmentations.
2.1 Data
A 32 million word corpus was compiled containing Finnish text from books and newspapers as well as newswires. The newswires originate from the Finnish National News
Agency and the rest of the material from the Finnish Center for Scientific Computation
(CSC).3
A word list was produced containing each word type occurring in this corpus, i.e.,
one occurrence of every distinct word form. All upper-case letters were converted
to lower-case. Compound words containing hyphens were stored both as entire words
and as separate words split at the hyphens. (The latter approach was due to the fact that
in Finnish hyphens are almost as obvious delimiters as spaces, and that morphologylearning algorithms may benefit from this knowledge and treat hyphens as word breaks
in the pre-processing.) The resulting word list contained 1.7 million word types.
2.2 TWOL analyses
The obtained word list was processed using the FINTWOL analyzer (a morphological
transducer lexicon description for Finnish), licensed from Lingsoft, Inc. Out of the
1.7 million word forms, 1.4 million were recognized by TWOL, and for these word
forms a morphological analysis was obtained.
The analysis consists of the base form of the word together with morphosyntactic
tags. Also tags corresponding to derivational endings exist and the boundaries within
compound words are marked with a number sign ‘#’. In case of ambiguous readings,
analyses for all alternative interpretations are produced.
We shall follow the development of the segmentation of a particular word: ‘kahvinjuojia’ ([some] coffee-drinkers). The TWOL analysis looks like this:
"kahvin#juoja" DV-JA N PTV PL
The base form is ‘kahvinjuoja’ (coffee-drinker), which contains an agentive suffix (DVJA). The word is a noun (N) and inflected into the partitive case (PTV) plural (PL).
2.3 Deep-level morpheme segmentation
The TWOL analysis is no morpheme segmentation. First of all, the base form is a
compound, and only its last part (‘juojia’) has so far been analyzed. The morphological
3
URL: http://www.csc.fi/index.phtml.en
6
structure of the first part (‘kahvin’) is now obtained by running it alone through the
TWOL analyzer, which yields:
"kahvi" N GEN SG
The base form is ‘kahvi’ (coffee), which is a noun (N) and inflected into the genitive
case (GEN) singular (SG).
By appending the two analyzes, we almost obtain a deep-level morpheme segmentation:
kahvi N GEN SG juoja DV-JA N PTV PL
Next, tags corresponding to no surface morphemes, e.g., the null morpheme for singular (SG), are removed and the part-of-speech information is appended to the preceding stem label (if the preceding label corresponds to a stem). The tags are re-ordered
to reflect the order in which the morphemes are realized in the surface form:
kahvi|N GEN juoja DV-JA PL PTV
This representation now corresponds to a deep-level morpheme representation, or
a “base-form morpheme segmentation”. Each morpheme label corresponds to an allomorph that is present in the word form. One problem remains, however: Unlike the
TWOL, we want to treat inflection and derivation equally. Thus, the stem morpheme
for ‘juojia’ (drinkers) should be ‘juoda’ (to drink) instead of ‘juoja’ (drinker). We will
come back to this in Section 2.5 below.
2.4 Surface-level allomorph segmentation
The surface form of the word can be aligned with the deep-level morpheme segmentation in order to get a surface-level segmentation. Each segment will represent an
allomorph (a realization variant) of an underlying morpheme. The alignment of our
example word ‘kahvinjuojia’ looks like this:
Surface allomorphs
kahvi
n
juo
j
i
a
Underlying morphemes kahvi|N GEN juoja DV-JA PL PTV
In order to obtain a linguistically correct alignment a ruleset has been devised. The
ruleset is based on (Hakulinen, 1979) and defines, e.g., all possible allmorph realizations of the suffix tags. The rules state, for instance, that the plural of nominals (PL) is
realized as ‘-i-’, ‘-j-’, or ‘-t’.
2.5 Deep-level morpheme segmentation revisited
The morpheme labels preceding derivational endings are not yet correct (cf. Section 2.3 above). At this stage, the surface segmentation is utilized. All instances in
all segmented words of the allmorph ‘juo’, except those preceding derivational suffixes, are investigated and the corresponding underlying morphemes are collected. If
7
the morpheme is unambiguous, such as in the case of ‘juo’, which is always an allomorph of the verb ‘juoda’ (to drink), the labels are completed automatically:
kahvi|N GEN juoda|V DV-JA PL PTV
In case of ambiguities, such as the allomorph ‘tunne’, which can correspond to
either the noun ‘tunne’ (feeling) or the noun ‘tunti’ (hour), the completions have been
made manually.
2.6 Fuzzy morpheme boundaries
The motivations for so called “fuzzy morpheme boundaries” have been given in Section 1.2. There are a number of cases where fuzzy boundaries have been allowed
as alternative segmentations for Finnish words. In the following, some examples are
given:
The proper name ‘Windsor’ has three allomorphs in Finnish: ‘Windsor’ (nominative singular, genitive plural), ‘Windsori’ (oblique cases in singular, nominative
plural), and ‘Windsore’ (oblique cases in plural). The following segmentations are linguistically conventional, e.g., ‘Windsor’, ‘Windsori+n’, ‘Windsori+lla’, ‘Windsori+t’,
‘Windsor+i+en’, ‘Windsore+i+lla’. Since the final vowel of the stem is not always
present, it belongs to a fuzzy boundary, and can therefore also be attached to the ending: ‘Windsor’, ‘Windsor+in’, ‘Windsor+illa’, ‘Windsor+it’, ‘Windsor+i+en’, ‘Windsor+ei+lla’.
The adjective ‘hapan’ (sour) has four allomorphs: ‘hapan’, ‘happama’, ‘happame’,
and ‘happam’. Only the final vowels ‘a’ and ‘e’ in ‘happama’/‘happame’ are on a fuzzy
boundary. (The consonants ‘n’ and ‘m’ alternate with each other.) E.g., traditionally
we have: ‘hapan+ta’, ‘happama+lla’, ‘happame+sti’, ‘happam+i+a’, ‘happam+uus’;
due to fuzziness we also allow: ‘happam+alla’, ‘happam+esti’.
Sometimes the baseform of a word ends in a fuzzy phoneme, e.g., the adjective
‘mukava’ (nice, comfortable) having the allomorphs ‘mukava’ and ‘mukav’. Some traditional segmentations are ‘mukava’, ‘mukava+n’, ‘mukav+i+a’, ‘mukav+uus’. As the
final ‘a’ is fuzzy, we obtain the alternative segmentation ‘mukav+an’ for ‘mukavan’.
In order to obtain a consistent inflection paradigm, we also allow the segmentation
‘mukav+a’ for ‘mukava’. That is, the baseform is allowed to have an ending ‘-a’, even
though there is no ending according to the traditional view.
An extensive list of the cases were fuzzy boundaries are applied for Finnish is
shown Table 1.
8
Table 1: List of all morpheme-final Finnish phonemes that can lie on a fuzzy boundary,
and thus are allowed to be attached to either the morpheme preceding or following the
morpheme boundary. Examples are given of words segmented into morphemes, where
the fuzzy phonemes are present (column labelled “Allomorph with fuzzy phonemes”),
and missing (column labelled “Without”). The morphemes in question are rendered in
bold-face and their part of speech is indicated in the second column from the left. The
phonemes on the fuzzy boundary are rendered in italics. The selected words illustrate
how fuzziness is applied in different inflection paradigms.
Fuzzy
phonemes
Part of
speech
Example morpheme segmentations
Allomorph with fuzzy phonemes
Without
-a
Adj.
Noun
Verb
Suffix
Noun
Adj.
Noun
Noun
Verb
Suffix
Noun
Noun
Noun
Noun
Adj.
Noun
Noun
Noun
Verb
Adj.
Verb.
Suffix
Verb
Verb
Noun
Noun
Adj.
Adj.
Suffix
hauska+n, hausk+an
koira, koir+a
laula+vat, laul+avat
sano+tta+isi+in, sano+tt+aisi+in
Tuomaa+n, Tuoma+an
hauske+mpi, hausk+empi
hyttyse+stä, hyttys+estä
kahve+i+ssa, kahv+ei+ssa
pääte+tä+än, päät+etä+än
laul+ele+vat, laul+el+evat
Tuomakse+n, Tuomaks+en
tähde+n, tähd+en
venee+seen, vene+eseen
Windsore+i+lla, Windsor+ei+lla
kallii+ksi, kalli+iksi
kahvi, kahv+i
tähti, täht+i
Windsori+lla, Windsor+illa
vastan+nut, vasta+nnut
hausko+i+ssa, hausk+oi+ssa
laulo+i+vat, laul+oi+vat
valitsi+jo+i+ta, valitsi+j+oi+ta
vastas+i, vasta+si
vastat+a, vasta+ta
venet+tä, vene+ttä
pyörä+ni, pyör+äni
iäkkää+llä, iäkkä+ällä
jämäkkä+nä, jämäkk+änä
jämäkä+mpä+nä, jämäkä+mp+änä
-e
-i
-n
-o
-s
-t
-ä
9
hausk+uus
koir+i+a
laul+ele+vat
sano+tt+i+in
Tuomas
hausk+uus
hyttys+i+stä
kahv+i+en
päät+i+t
laul+el+i+vat
Tuomas
tähd+i+stä
vene
Windsor
kallis
kahv+i+en
tähd+i+stä
Windsor
vasta+us
hausk+uus
laul+el+i+vat
kannatta+j+i+a
vasta+us
vasta+us
vene
pyör+i+ltä
iäkäs
jämäkk+yys
jämäkä+mp+i+in
Table 1: (continued)
Fuzzy
phonemes
Part of
speech
Example morpheme segmentations
Allomorph with fuzzy phonemes
Without
-ö
-an
-en
-ne
-se
-si
-ts
-än
-tse
-tsi
Verb
Adj.
Num.
Num.
Verb
Verb
Verb
Verb
Num.
Verb
Verb
päätä+mme, päät+ämme
jämäkö+i+tä, jämäk+öi+tä
kahdeksan, kahdeksa+n, kahdeks+an
kymmenen, kymmene+n, kymmen+en
pahene+e, pahen+ee, pahe+nee
narise+e, naris+ee, nari+see
narisi+ja, naris+ija, nari+sija
valits+i, valit+si, vali+tsi
yhdeksän, yhdeksä+n, yhdeks+än
valitse+n, valits+en, valit+sen, vali+tsen
valitsi+ja, valits+ija, valit+sija, vali+tsija
päät+i+mme
jämäkk+yys
kahdeks+i+ssa
kymmen+i+ä
pahe+nta+a
nari+na
nari+na
vali+nta
yhdeks+i+ssä
vali+nta
vali+nta
2.7 Syntax of the Finnish gold-standard segmentation file
In the Finnish gold-standard segmentation file each row has the format:
<wordform><TAB><segmentations><NEWLINE>
<wordform> and <segmentations> can contain any printable characters except
space, tab, newline, and carriage return. There are some characters with a special
function. If the special function is not intended, the character in question is preceded
by the escape character \ (backslash). Backslash itself is written as \\ (double backslash). <wordform> is the word for which a morphological analysis, i.e., segmentation, is available. All characters are lower-case. <segmentations> contains one or
several alternative analyses for <wordform>. The alternatives are separated using ,
(comma). Each segmentation consists of chunks, which are separated by space. Each
chunk consists of two parts, separated by : (colon). The first part is the allomorph (the
surface segment of the word) and the second part is the morpheme, i.e., the base form
or deep-level representation. The format of a segmentation is thus:
<allomorph1>:<morpheme1> <allomorph2>:<morpheme2> ...
E.g.,
arvoamme
arvo:arvo|N a:PTV mme:1PL, arvo:arvo|N amme:amme|N
The word ‘arvoamme’ has two interpretations: ‘arvo+a+mme’ (‘of our value’) and the
improbable ‘arvo+amme’ (‘valuable bathtub’).
10
Table 2: Finnish part-of-speech tags.
A
ADV
A/N
CLI
N
NUM
PFX
PRON
V
adjective
adverb
adjective or noun
clitic (suffix-like particle, e.g., -kin, -kaan, -pa, -ko)
noun
numeral
prefix
pronoun
verb
2.7.1 The morpheme part of a chunk
The second part of each chunk in the segmentation can consist of a string in lowercase, which is the base form of the morpheme, usually followed by | (vertical line)
and a part-of-speech tag, e.g., arvo|N; indicating that ‘arvo’ (value) is a noun stem.
Sometimes the vertical line and part-of-speech tag is missing, which is the case when
the part-of-speech information is missing in FINTWOL or when the part of speech
does not belong to the major parts of speech listed in Table 2.
The morpheme part of the chunk can also consist of a suffix label, which is written
in capital letters, e.g., PTV. The first or last character can, however, be a digit: 1, 2, 3,
or 4, e.g., 1PL, and the label can contain a hyphen, e.g., DV-MA. Tables 3 and 4 show
the possible suffix labels (inflectional suffixes in Table 3 and derivational suffixes in
Table 4). Note that for Finnish we have chosen not to include null morphemes, i.e.,
morphemes that are not realized as segments of words. Therefore there are no suffix
labels for, e.g., nominative case, singular number, or present tense.
The morpheme part of a chunk can also consist of a single ∼ (tilde sign), which
means that a segment of the word corresponds to no underlying morpheme. The only
case, where this applies is the hyphen in compounds, such as:
viljo-eno
viljo:viljo|N -:∼ eno:eno|N
2.7.2 The allomorph part of a chunk
The allomorph part of a chunk can contain either of the special characters ˆ (caret) or
" (citation mark), which mark fuzzy boundaries. The caret indicates that the following
morpheme boundary may come earlier and at the earliest at the point of the caret, e.g.,
ilmenevistä
ilmeˆne:ilmetä|V v:PCP1 i:PL stä:ELA
yields the possible segmentations:
ilmene+v+i+stä, ilmen+ev+i+stä, and ilme+nev+i+stä.
11
Table 3: The Finnish inflectional suffixes and their labels (most labels are the same as
in FINTWOL). The suffixes are grouped into nominal and verbal suffixes. The nominal
suffixes are further grouped into case, number, comparison, adverbial, possessive and
other suffixes. The case suffixes are additionally grouped into four sub-categories. The
verbal suffixes are grouped into tense, voice, mood, infinitive, person and participle
suffixes.
INFLECTIONAL SUFFIXES
Nominal
Grammatical
GEN
genitive
PTV
partitive
ACC
accusative
(of certain
pronouns)
PL
Number
plural
1SG
2SG
3SGPL
Case
Location (internal) Location (external)
INE inessive
ADE
adessive
ELA elative
ABL
ablative
ILL illative
ALL
allative
Comparison
CMP comparative
SUP superlative
1st person singular
2nd person singular
3rd person singular or plural
Adverbial
MAN
manner
PROL prolative
ESS
TRA
INS
ABE
COM
Marginal
essive
translative
instructive
abessive
comitative
Other
PRONSUF Clitic-like suffix of pronouns
e.g., jo+ssa+kin
Possessive
1PL
1st person plural
2PL
2nd person plural
Verbal
Tense
PAST
past tense
SG1
SG2
SG3
PL1
PL2
PL3
PE4
Voice
PSS passive
Person
1st person singular
2nd person singular
3rd person singular
1st person plural
2nd person plural
3rd person plural
4th person (for passive voice)
Mood
COND conditional
POTN potential
IMPV imperative
PCP1
PCP2
PSSPCP2
DV-MA
12
INF1
INF2
INF3
INF5
Infinitive
1st infinitive
2nd infinitive
3rd infinitive
5th infinitive
Participle
present participle
past participle active
past participle passive
agent participle
Table 4: The labels for the Finnish derivational suffixes (most labels are the same as in
FINTWOL). The suffixes are grouped according to the part of speech of the stem that
they can be attached to: noun, verb, adjective, or numeral.
DERIVATIONAL SUFFIXES
Noun stem
Verb stem
DN-INEN
DN-ITTAIN
DN-LAINEN
DN-LAISTA
DN-LAISTU
DN-LLINEN
DN-MAINEN
DN-TAR
DN-TON
DN-UUS
DV-AISE
DV-ELE
DV-ILE
DV-JA
DV-MA
DV-MATON
DV-MINEN
DV-NA
DV-NEISUUS
DV-NTA
DV-NTAA
DV-NTI
DV-SKELE
DV-SKENTELE
DV-TTA
DV-TU
DV-U
DV-US
DV-UTTA
DV-UTU
DV-VAINEN
Adjective stem
DA-US
DA-UUS
Numeral stem
ORD
ordinal
FRAC fractional
The citation mark indicates that an additional morpheme boundary may be inserted
at any point from the location of the citation mark to the end of the morpheme, e.g.,
ilmene
ilme"ne:ilmetä|V
yields the possible segmentations:
ilmene, ilmen+e, and ilme+ne.
Thus, as there is no suffix allomorph to which to attach the fuzzy stem-final phonemes,
it is necessary to create a new, but optional, suffix. The same applies when the fuzzy
boundary is followed by a clitic or a stem. The fuzzy phonemes are not attached to the
following clitic or stem, because they clearly do not belong to it. Instead an optional
suffix can be inserted, as for:
13
kenttävartioon
kentt"ä:kenttä|N vartio:vartio|N on:ILL,
which yields the possible segmentations:
kenttä+vartio+on and kentt+ä+vartio+on.
14
3 English morpheme segmentations
This section describes the steps in the process of acquiring morpheme segmentations
for almost 120 000 English word forms, as well as the notation used for the final segmentations.
3.1 Data
The English lemma and wordform lexicons in the CELEX database have been used as
data for our task. These lexicons contain roughly 70 000 distinct word forms. (The
number pertains to the number of word types that are spelled differently, without considering the fact that identical word forms may be ambiguous and have several different
meanings and morphological structure.)
The CELEX database also contains information on phrasal verbs, e.g., ’shrug off’.
These were left out, as we are only concerned with the space-delimited words of English.
3.2 Complete and flat segmentation
A flat, non-hierarchical, segmentation for each word form was extracted from the
CELEX database. The segmentation is complete in the sense that it identifies all the
morphemes a word form contains. The information gathered for each word contains
the morphemes (“deep-level” or baseform) together with part-of-speech and type-offlection labels. For instance, for the word ‘bacteriologists’ the following information
is obtained:
bacterium|N ology|s ist|s N+P
That is, the base form of the word consists of three morphemes: ‘bacterium’, which is
a noun (N), and ‘ology’ and ‘ist’, which are suffixes (s). The whole word is a noun and
inflected into plural (N+P).
3.3 Surface-level allomorph segmentation
The surface form of the word is aligned with the deep-level morpheme segmentation
in order to get a surface-level segmentation. Each segment will represent an allomorph
(a realization variant) of an underlying morpheme. The alignment of the word ‘bacteriologists’ looks like this:
Surface allomorphs
bacteri
olog
ist
s
Underlying morphemes bacterium|N ology|s ist|s N+P
In order to obtain a linguistically correct alignment for English, we consulted
(Quirk et al., 1985).
15
3.4 Possessive forms of nouns
The English CELEX lexicons do not contain nouns in the possessive form, such as
man’s, men’s, father’s, fathers’. In order to provide analyses for word forms in the
possessive form, we have automatically generated the possessive forms for all nouns
in the data. Nouns in singular ending in ‘s’ receive two variants: a possesive ending
only in an apostrophe and a possessive ending in an apostrophe and ‘s’, e.g.,
Julius’ and Julius’s.
Nouns in singular and plural ending in another letter than ‘s’ receive the apostrophe
followed by ‘s’ as their possessive ending, e.g.,
man’s, father’s, mother’s and men’s.
Nouns in plural ending in ‘s’ receive an apostrophe as their possessive ending, e.g.,
fathers’ and mothers’.
This automatic generation of possessives makes the original word list grow from
about 70 000 to almost 120 000 word forms. Many of the resulting words are not likely
to occur in real texts, although they do not seem to be incorrect.
3.5 Fuzzy morpheme boundaries
In English, the stem-final ‘e’ is dropped in some forms. This phenomenon is the only
one that fulfills our criterion for “fuzzy boundaries” (cf. Section 1.2).
3.6 Syntax of the English gold-standard segmentation file
The English gold-standard segmentation file follows the same syntax as the Finnish
file, but the morpho-syntactic labels are different. Each row in the file has the format:
<wordform><TAB><segmentations><NEWLINE>
<wordform> and <segmentations> can contain any printable characters except
space, tab, newline, and carriage return. There are some characters with a special function. If the special function is not intended, the character in question is preceded by the
escape character \ (backslash). Backslash itself is written as \\ (double backslash).
<wordform> is the word for which a morphological analysis, i.e., segmentation, is
available. <segmentations> contains one or several alternative analyses for <wordform>. The alternatives are separated using , (comma). Each segmentation consists
of chunks, which are separated by space. Each chunk consists of two parts, separated
by : (colon). The first part is the allomorph (the surface segment of the word) and
the second part is the morpheme, i.e., the base form or deep-level representation. The
format of a segmentation is thus:
16
<allomorph1>:<morpheme1> <allomorph2>:<morpheme2> ...
E.g.,
dials
dial:dial|N s:N+P, dial:dial|N s:V+e3S
The word ‘dials’ has two interpretations: a noun in plural or a present tense verb in
the third person singular. (In CELEX the stem ‘dial’ is given as the noun ‘dial’ in
both cases with a derivation from noun to verb called conversion. We indicate the
conversion with the stem ‘dial|N’ having the suffix ‘s:V+e3S’. An alternative markup could have been ‘dial|N ∼:V s:e3S’ indicating an explicit derivational step
with an empty surface string.)
3.6.1 The morpheme part of a chunk
The second part of each chunk in the segmentation can consist of a string, which is the
base form of the morpheme, followed by | (vertical line) and a part-of-speech tag, e.g.,
dial|N; indicating that ‘dial’ is a noun stem. In some rare cases the part-of-speech
tag is missing (when CELEX does not indicate the part of speech). The possible partof-speech tags are listed in Table 5.
The morpheme part of the chunk can also consist of a label, e.g., V+e3S, which
indicates both the part of speech of the whole word and the inflection form, which
often corresponds to a surface allomorph. The label consists of a part-of-speech tag,
which is separated from a string of inflection features by a + (plus sign), e.g.,
dial:dial|N s:V+e3S.
In this interpretation, the whole word ‘dials’ is a verb (V), inflected into present tense
(e), third person (3) singular (S). Table 6 shows the possible inflection features for
English.
The morpheme part of a chunk can also consist of a single ∼ (tilde sign), which
means that a segment of the word corresponds to no underlying morpheme. The only
case, where this applies is the hyphen in compounds, such as:
ball-dresses
ball:ball|N -:∼ dress:dress|N es:N+P
3.6.2 The allomorph part of a chunk
The allomorph part of a chunk can consist of a single ∼ (tilde sign), which means that
the chunk corresponds to a null morpheme, i.e., an inflection feature that is not realized
as a segment of the word, e.g.,
dress
dress:dress|N ∼:N+S, dress:dress|N ∼:V+i.
Here the segmentation of the word ‘dress’ consists of only one segment, which consists
of the entire word. The word can be interpreted as a noun in singular (N+S) or as the
17
Table 5: English part-of-speech tags. (The tags are the same as in CELEX.)
A
B
C
D
I
N
O
P
Q
V
p
s
?
adjective
adverb
conjunction
article
interjection
noun
pronoun
preposition
numeral
verb
prefix
suffix
undetermined
Table 6: English inflection features. (All tags except the possessive tag have been
adopted directly from CELEX.) The features are grouped according to part of speech.
S
P
o
r
Nouns
singular
plural
possessive
rare form
S
e
a
i
p
1
2
3
r
Verbs
singular
present tense
past tense
infinitive
participle
1st person
2nd person
3rd person
rare form
Adjectives
and adverbs
b positive
c comparative
s superlative
18
Other
X headword form
infinitive of a verb (V+i), but neither of these morphological features are realized as a
word segment.
The allomorph part of a chunk can contain either of the special characters ˆ (caret)
or " (citation mark), which mark fuzzy boundaries. The caret indicates that the following morpheme boundary may come earlier, at the point of the caret, e.g.,
loves
lovˆe:love|V s:V+e3S
yields the possible segmentations:
love+s and lov+es.
The citation mark indicates that an additional morpheme boundary may be inserted
at the citation mark, e.g.,
love
lov"e:love|V ∼:V+i
yields the possible segmentations:
love and lov+e.
Thus, as there is no suffix allomorph to which to attach the fuzzy stem-final ‘e’, it
is necessary to make a suffix of the ‘e’. The same applies when the fuzzy boundary
is followed by a morpheme that is not a suffix. The fuzzy ‘e’ is not attached to the
following morpheme, because it clearly does not belong to it. Instead an ‘e’-suffix can
be inserted, as for:
lovebird
lov"e:love|V bird:bird|N ∼:N+S
which yields the possible segmentations:
love+bird and lov+e+bird.
19
4 Installation and evaluation
This section describes how to obtain the gold-standard segmentations for Finnish and
English and install them on your computer. A Linux or Unix operating system is
necessary. Instructions are also given regarding a few scripts, which are helpful in the
evaluation of a segmentation produced by some algorithm.
4.1 Download and licensing conditions
The Helsinki University of Technology Morphological Evaluation Gold Standard package, Hutmegs, is a collection of files and documentation that are free to use for noncommercial purposes as long as a reference is made to the source of the material.4
The Hutmegs package can be downloaded from the following Internet address: http:
//www.cis.hut.fi/projects/morpho/.
However, the Finnish gold-standard segmentations are based on the FINTWOL analyzer, which is a commercial product. To obtain the complete Finnish Gold Standard,
a missing component must be licensed from Lingsoft, Inc.5 in Helsinki, Finland. The
missing component is a file containing the FINTWOL analyses for the word forms
comprised in the Gold Standard. The list of FINTWOL analyses has the product name
‘LS Hutmegs-Fintwol’ (version 1.0) and the one-time license fee for non-commercial
activities is 600 euros (as of September, 2004). To purchase this component contact
Lingsoft through e-mail: [email protected]. If the component is not purchased,
the user will have access to all Hutmegs scripts and documentation, but only a sample
Gold Standard containing the analyses of no more than 700 Finnish word forms.
Likewise, the CELEX database (Baayen et al., 1995) is a prerequisite for accessing
the complete English Gold Standard. Two files from CELEX describing the English
morphology must be available in order to generate the English gold-standard segmentations. Non-commercial licenses are available from the Linguistic Data Consortium6
in Philadelphia, USA. The product name of the database is CELEX2 and its catalogue number is LDC96L14.7 The non-member price is 150 US dollars (as of September, 2004). The Hutmegs package provides sample gold-standard segmentations for
roughly 600 English word forms, which can be viewed without access to the CELEX
database.
4
Cite this work like this: “Mathias Creutz and Krister Lindén. 2004. Morpheme Segmentation Gold
Standards for Finnish and English. Technical Report A77, Helsinki University of Technology, October.
URL: http://www.cis.hut.fi/projects/morpho/”
5
URL: http://www.lingsoft.fi
6
URL: http://wave.ldc.upenn.edu/
7
URL: http://wave.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=
LDC96L14
20
4.2 Installation
The downloaded Hutmegs package version 1.0 is unpacked using the following command:
tar xzf hutmegs ver1.0.tar.gz.
A directory tree will be created with the root hutmegs and the subdirectory
ver1.0 (or later version). This directory contains the following:
README
Makefile
bin/
lib/
test/
doc/
Some useful information.
A Makefile showing examples of how to decode the encoded Finnish
and English Gold Standards and how to use the evaluation scripts
(described below).
A directory containing all installation and evaluation scripts described in this document. The scripts are Perl scripts.
A directory containing the encoded Finnish and English Gold Standards as well as the decoded samples. This directory also contains
example morpheme category files for Finnish and English, which can
be used in connection with the boundary detection stats.pl
and evaluate tags.pl evaluation scripts.
A directory containing example files utilized by the Makefile.
A directory containing this document.
4.2.1 Configuring the scripts
The scripts in the bin/ directory are Perl scripts. The first line in every script needs
to point to the location of the Perl interpreter on the computer system. The value is by
default set to /usr/bin/perl. If this is incorrect, either edit the scripts in the bin/
directory manually, or adjust the value of the PERLCMD variable in Makefile, and run
‘make perlconfig’. This command will update the scripts automatically with the
value of the PERLCMD variable.
4.2.2 Decoding the Finnish Gold Standard
Once the FINTWOL output file has been purchased from Lingsoft, Inc., the decoding
of the Finnish Gold Standard is invoked from the command line using the following
syntax:
decodefintwol.pl fintwol output encodedgoldstd.fin > goldstd.fin.
Make sure that the FINTWOL output file fintwol output is unpacked. The decodefintwol.pl script is in the bin/ directory and the encodedgoldstd.fin file
is in the lib/ directory.
21
4.2.3 Decoding the English Gold Standard
The compilation of the English Gold Standard involves two steps. First, run:
decodecelex.pl eml.cd emw.cd encodedgoldstd.eng > goldstd.tmp,
where eml.cd and emw.cd are files from the English CELEX database. The locations on the CD-ROM of these files are english/eml/eml.cd and english/emw/
emw.cd, respectively. The encodedgoldstd.eng file is in the Hutmegs lib/ directory.
Next, run:
postprocesscelex.pl < goldstd.tmp > goldstd.eng.
The scripts decodecelex.pl and postprocesscelex.pl are in the Hutmegs bin/
directory.
4.3 Evaluation scripts
The four evaluation scripts are found in the Hutmegs bin/ directory. The scripts
are align segmentations.pl, boundary precision recall.pl, boundary
detection stats.pl, and evaluate tags.pl.
4.3.1 align segmentations.pl
The script align segmentations.pl produces an optimal morpheme-boundary
alignment between the gold-standard segmentation of words and the segmentations
produced by some algorithm. This alignment is needed as input to the other scripts
described below. Invoke align segmentations.pl from the command line using
the following syntax:
align segmentations.pl [-fuzzy] goldstdfile < segs 2b evaluated.
The -fuzzy switch is optional and activates the use of fuzzy morpheme boundaries.
That is, for some morpheme boundaries several locations are considered correct. When
the -fuzzy switch is not utilized, all morpheme boundaries have an unambiguous,
linguistically conventional, location (cf. Section 1.2). The parameter goldstdfile
indicates the location of the appropriate gold-standard file (Finnish or English), and
segs 2b evaluated indicates the location of the file containing the segmentations to
be evaluated.
Input
Each line of segs 2b evaluated must comply with the following format:
<segmentation><TAB><wcount><NEWLINE>,
22
where the segmentation consists of chunks separated by space, and each chunk consists of two parts separated by : (colon). The first part is a segment of the word and
the second part is some tag, a class assigned by the algorithm to the segment. This
corresponds to the allomorph and morpheme parts in the chunks in the gold-standard
segmentations (cf. Sections 2.7 and 3.6). The word count (wcount) is a positive integer indicating the number of times the word form has occurred in the data processed
by the algorithm.
For instance, here is an extract of the segmentation file produced by the so called
Category-Learning algorithm presented in (Creutz and Lagus, 2004). The data used
consisted of a subset of the Brown corpus comprising 250 000 word tokens:
aspect:STM
13
aspect:STM s:SUF
17
asphalt:STM
2
aspir:STM ant:SUF
1
aspir:STM ation:SUF
2
aspir:STM ation:SUF s:SUF
aspir:STM ed:SUF
1
aspir:STM es:SUF
1
aspir:STM ing:SUF
1
2
The Category-Learning algorithm works in an unsupervised way and proposes a morpheme segmentation for every word form in the data. Additionally, each morpheme is
tagged with one of the categories prefix (PRE), stem (STM), and suffix (SUF). In this
example the word ‘aspires’, which has occurred once in the data, has been segmented
into ‘aspir+es’. The first segment has been tagged as a stem and the second segment
as a suffix.
Output
Each row in the output of align segmentations.pl has the following format:
<word><TAB><goldstd seg><TAB><algorithm seg><TAB><wcount><NEWLINE>.
The first field contains the word itself. The second field contains the gold-standard
segmentation of the word aligned against the proposed segmentation in the third field.
The fourth field indicates the word count, i.e., the number of times the word occurred
in the data. When the gold-standard segmentation is aligned against the segmentation
proposed by the algorithm, there will always be as many chunks in both representations. If necessary, either or both segmentations will be extended with “null chunks”
(∼:∼).
For instance, the following output is produced for the word ‘aspires’, when the
-fuzzy switch is not in use:
aspires
∼:∼ aspire:aspire|V s:V+e3S
23
aspir:STM ∼:∼ es:SUF
1
The alignment can be presented more explicitly as:
∼:∼
aspir:STM
Gold Standard:
Algorithm:
aspire:aspire|V
∼:∼
s:V+e3S
es:SUF
Null chunks on the gold-standard side can be interpreted as insertions of incorrect
morpheme boundaries. In this case, the algorithm has proposed an incorrect boundary
at the end of ‘aspir’. Null chunks in the segmentation proposed by the algorithm can
be interpreted as deletions of desired boundaries. In the example, the algorithm has
missed the boundary at the end of ‘aspire’. (Note that the alignment is an optimal
alignment of morpheme boundaries and not an optimal alignment of morphemes. In
our implementation, two morphemes must end at the same point in order to be aligned
with each other, which is the case for ‘s’ vs. ‘es’ above, but which is not the case for
‘aspire’ vs. ‘aspir’.)
In this particular example, the alignment is different, if the -fuzzy switch is activated. The stem-final ‘e’ can then be attached to the suffix, which makes the segmentation proposed by the algorithm correct. Consequently, there are no insertions or
deletions:
aspires
aspir:aspire|V es:V+e3S
aspir:STM es:SUF
1
When there are many alternative correct segmentations in the Gold Standard, there
may be several equally good alignments with the single segmentation proposed by the
segmentation algorithm. This is represented in the output as several gold-standard segmentations separated by , (comma) in the <goldstd seg> field. These are matched
with equally many alignments of the proposed segmentation in the <algorithm seg>
field. In the following example the algorithm has incorrectly segmented the word ‘conjunction’ into two stems ‘conj’+‘unction’. The correct segmentations, according to the
Gold Standard, are ‘con+junction’ (prefix + noun stem) and ‘conjunct+ion’ (verb stem
+ suffix). The long line has been broken onto three lines after the first and second field:
conjunction
...
con:con|p ∼:∼ junction:junction|N, ∼:∼ conjunct:conjoin|V ion:ion|s. . .
∼:∼ conj:STM unction:STM, conj:STM ∼:∼ unction:STM
2
4.3.2 boundary precision recall.pl
The script boundary precision recall.pl computes accuracy statistics for the
morpheme boundaries proposed by the algorithm. First, an alignment against the goldstandard segmentation must be produced using the script align segmentations.pl.
Then, boundary precision recall.pl is invoked from the command line as follows:
boundary precision recall.pl < alignment,
where alignment is the output of align segmentations.pl. The script bound24
ary precision recall.pl only evaluates the placing of the morpheme boundaries,
i.e., the segmentation into allomorphs of the surface form of words. The underlying
morphemes and their labels play no role whatsoever in this evaluation.
The output of the script looks like the following (explanations are given after the
example):
Number of word tokens: 250000, of which omitted 9070 (3.63%)
Token precision: 70.36%
Token recall: 64.70%
Token F-measure: 67.41%
Number of word types: 21434, of which omitted 3864 (18.03%)
Type precision: 76.85%
Type recall: 66.13%
Type F-measure: 71.09%
Number of desired morph types: 8749; recognized: 8999
All recognized morph types: 11604, of which omitted 2605 (22.45%)
The evaluation statistics are computed both on word tokens and word types. Precision is the proportion of morpheme boundaries suggested by the algorithm that are
correct. Recall is the proportion of morpheme boundaries in the Gold Standard that
were correctly recognized by the algorithm. The F-measure is the harmonic mean of
precision and recall [see e.g., (Manning and Schütze, 1999)]:
1
1
1
F-Measure = 1/[ (
+
)].
2 Precision Recall
In the evaluation statistics for word tokens, the segmentation of each occurrence of
every word in the data is taken into account. That is, words occurring many times have
a higher impact on the result than rare words. In the evaluation statistics for word types,
only one occurrence of each distinct word form is considered. That is, the accuracy of
the segmentation of rare words dominate, since most of the words are rare.
The number and proportion of omitted word tokens and types correspond to words
in the data, for which there was no gold-standard segmentation available. These words
are left out of the evaluation.
The number of desired morph types indicates how many distinct allomorphs (segments) there are in the gold-standard segmentations of the words in the data set used.
The number of recognized morph types indicates how many distinct allomorphs there
are in the segmentations of these words proposed by the algorithm. If these two figures are roughly equal, the algorithm has managed to segment the words to the same
degree as in the Gold Standard. That is, on the average, the words have neither been
split excessively into many short segments, nor too reluctantly into too long and few
segments per word.
The number of all recognized morph types indicates how many distinct allomorphs
there are in the segmentations proposed by the algorithm for all words in the data,
including those words that have been omitted in the evaluation, because there is no
gold-standard segmentation available for them.
25
4.3.3 boundary detection stats.pl
The script boundary detection stats.pl is helpful, when it is necessary to know
roughly where morpheme boundaries have been detected correctly and incorrectly. The
script is invoked as follows:
boundary detection stats.pl morpheme categories < alignment,
where alignment is the output of align segmentations.pl, which must be run
first. The file morpheme categories contains a grouping of morphemes in the Gold
Standard into broader categories. This is one possible grouping for the English morphemes (consult Tables 5 and 6 for an explanation of the tags):
# All labels corresponding to surface allomorphs are
# grouped into two classes: stems and affixes
#
STEM
A B C D I N O P Q V ? # Part-of-speech tags for stems
AFFX
p s
# Derivational pre- & suffixes
AFFX
A+c A+s B+c B+s
# Comparatives and superlatives
# for adjectives and adverbs
AFFX
N+P N+Po N+So
# Plurals & possessives of nouns
AFFX
V+a1S V+e3S V+pe
# Verb endings (past, pres., -ing)
AFFX
∼
# Hyphen or extra ending
# due to fuzziness
Each line of the file contains two tab-separated fields. The left field contains a
morpheme label for a broad morpheme category; in this example STEM for stems and
AFFX for affixes. The right field consists of space-separated labels for part-of-speech
or suffix type. The number sign (#) indicates the start of a comment. Everything from
the number sign to the end of the line is ignored, including any preceding whitespace
(tab or space).
The part-of-speech tags and suffix labels correspond to information in the morpheme parts of the chunks in the Gold Standard. If the morpheme part of the chunk
contains a base form as for, e.g., aspire|V, only the part-of-speech tag (V) is retained.
The question mark (?) covers the cases, where there is a baseform, but the part-ofspeech tag is missing. In the current example, word stems of any part of speech are
classified as STEM, whereas derivational prefixes and suffixes (e.g., mis|p and ion|s)
are classified as AFFX.
If the morpheme-part of the chunk contains no baseform, but only a suffix label,
the entire label is retained (e.g., A+c representing the comparative ending ‘-er’). In the
example, all such cases are classified as AFFX. The special character ∼ (tilde) covers
cases, where there is a segment in the surface-form of the word, but no underlying
morpheme. Hyphens in compounds are such a case. The other case occurs if fuzzy
morpheme boundaries are activated and consists in the optional endings at the end of
baseforms of words, e.g., the ‘-e’ of ‘aspir+e’.
The script evaluates only the segmentation proposed by the algorithm, not the tag26
ging. The output looks like this:
Incorrectly inserted boundaries (#insertions/#morphemes)
AFFX
tok: 2288/62154 = 0.0368
typ: 518/14215 = 0.0365
STEM
tok: 15728/244871 = 0.0642 typ: 2462/18319 = 0.1344
-Recall for detection of beginning of morpheme
AFFX
tok: 39634/57639 = 68.76% typ: 8859/12923 = 68.56%
STEM
tok: 3132/8456 = 37.04%
typ: 1037/2041 = 50.78%
-Recall for detection of ending of morpheme
AFFX
tok: 6431/10988 = 58.52%
typ: 1972/3312 = 59.56%
STEM
tok: 36335/55106 = 65.94% typ: 7923/11652 = 68.00%
-Recall for detection of morpheme transitions
AFFX AFFX
tok: 4608/6888 = 66.89%
typ: 1371/2088 = 65.65%
AFFX STEM
tok: 1823/4100 = 44.46%
typ: 602/1224 = 49.16%
STEM AFFX
tok: 35026/50750 = 69.02% typ: 7488/10834 = 69.12%
STEM STEM
tok: 1309/4356 = 30.05%
typ: 435/818 = 53.20%
--
Statistics are computed both for word tokens (tok) and types (typ). First, the
distribution of incorrectly inserted morpheme boundaries is shown. For instance, there
were a total number of 62 154 morpheme tokens classified as AFFX according to the
Gold Standard and the morpheme category file. The algorithm proposed 2288 incorrect
boundaries within the affix morphemes, resulting in an average of 0.0368 insertions per
affix. For the stem morphemes the proportion of insertions was higher, 0.0642, which
is natural, since the stems are typically longer than the affixes, and there is thus more
“room” for insertions.
Next, the recall for the detection of the beginning of morphemes is shown. (The
first morpheme in every word is left out from this calculation, since the beginning
of the first morpheme is trivial to detect.) For instance, there were 2041 non-wordinitial stem morphemes in the list of word types. The algorithm managed to place a
morpheme boundary at the beginning of these morphemes in 1037 cases, representing
50.78 % of all the cases.
Likewise, the recall for the detection of the ending of morphemes is computed.
(Here the last morpheme of every word is left out of the calculation.) For instance,
there were 11 652 non-word-final stem morphemes in the list of word types. The algorithm detected correctly 7923, or 68.00 %, of the end of these morphemes.
Finally, the placing of boundaries between morphemes is evaluated. According
to the example, the most “easy boundaries” to detect are those between a stem and
a following affix; 69.02 % were detected (when looking at word tokens). The most
difficult transitions to detect were those between two adjacent stems: only 30.05 %
were found (word tokens).
27
4.3.4 evaluate tags.pl
The script evaluate tags.pl evaluates the tagging proposed by the algorithm, and
is invoked as follows:
evaluate tags.pl morpheme categories < alignment.
alignment is the output of align segmentations.pl, which must be run first.
The file morpheme categories contains a grouping of morphemes in the Gold Stan-
dard into the categories learned by the algorithm. Our example algorithm tags the
segments it discovers as prefixes, stems, or suffixes (PRE, STM, or SUF). It is thus
necessary to map the category labels used in the Gold Standard file onto these three
categories. Exactly the same syntax applies as for the morpheme categories file in
Section 4.3.3:
PRE
STM
SUF
SUF
p
A B C D I N O P Q V ?
s
A+c A+s B+c B+s
SUF
SUF
SUF
N+P N+Po N+So
V+a1S V+e3S V+pe
∼
#
#
#
#
#
#
#
#
#
(Derivational) prefix
Part-of-speech tags for stems
Derivational suffix
Comparatives and superlatives
for adjectives and adverbs
Plurals & possessives of nouns
Verb endings (past, pres., -ing)
Hyphen or extra ending
due to fuzziness
The output of the script looks like this (explanations are given after the example):
Number of word tokens: 250000, of which omitted 9070 (3.63%)
Number of word types: 21434, of which omitted 3864 (18.03%)
Token #tags(PRE)
desi: 4537 (1.48%)
reco: 3092 (1.02%)
Token #tags(STM)
desi: 244871 (79.76%) reco: 246071 (81.56%)
Token #tags(SUF)
desi: 57616 (18.77%)
reco: 52548 (17.42%)
Token #tags(All)
desi: 307025 (100.00%) reco: 301712 (100.00%)
Type #tags(PRE)
desi: 1304 (4.01%)
reco: 823 (2.70%)
Type #tags(STM)
desi: 18319 (56.31%)
reco: 19237 (63.18%)
Type #tags(SUF)
desi: 12910 (39.68%)
reco: 10387 (34.12%)
Type #tags(All)
desi: 32533 (100.00%) reco: 30446 (100.00%)
Number of correctly segmented word tokens: 206884 (85.87%)
Number of correctly segmented word types: 11425 (65.03%)
Token tags correct: 240764/242003 (99.49%)
Type tags correct: 18629/18963 (98.24%)
Token precision (PRE): 96.39% (909/943)
Token recall (PRE): 76.90% (909/1182)
Token F-measure (PRE): 85.55%
Token precision (STM): 99.45% (207912/209061)
Token recall (STM): 99.96% (207912/208001)
Token F-measure (STM): 99.70%
Token precision (SUF): 99.83% (31943/31999)
Token recall (SUF): 97.33% (31943/32819)
Token F-measure (SUF): 98.56%
28
Type
Type
Type
Type
Type
Type
Type
Type
Type
precision (PRE): 96.41% (295/306)
recall (PRE): 76.42% (295/386)
F-measure (PRE): 85.26%
precision (STM): 97.41% (11792/12106)
recall (STM): 99.83% (11792/11811)
F-measure (STM): 98.60%
precision (SUF): 99.87% (6542/6551)
recall (SUF): 96.70% (6542/6765)
F-measure (SUF): 98.26%
The figures are as usual calculated for both word tokens and word types. First, the
number of word tokens and word types in the data are shown together with the number and proportion of word forms for which there is no gold-standard segmentation,
and consequently are left out of the evaluation. Then the number and proportion of
morphemes tagged with each of the three tags (PRE, STM, SUF) are shown for both the
gold-standard segmentation (in the “desired” field) and the segmentation proposed by
algorithm (in the “recognized” field). Ideally, the figures in both fields should match.
The evaluation that follows after this is based only on the words that have been
segmented correctly. The number and proportion of these correctly segmented word
tokens and types are shown: 85.87 % (tokens), 65.03 % (types). For the correctly
segmented words, 99.49 % of the tags proposed by the algorithm match the tags in
the Gold Standard (word tokens). Evaluated on word types, the tagging is correct for
98.24 % of the morphemes.
Finally, precision, recall and F-measure is calculated for each morpheme category
separately, on word tokens as well as word types. For instance, the precision for prefixes is 96.39 % (word tokens), which means that 96.39 % of the morphemes tagged
as prefixes by the algorithm are indeed prefixes (when only the correctly segmented
words are considered). Recall for prefixes is lower, 76.90 % (word tokens), indicating
that there were a number of prefixes in the Gold Standard that were tagged as something else by the algorithm.
29
5 Discussion
The gold-standard segmentations have been produced semi-automatically. The quality of the FINTWOL analyzer as well as the quality of the English CELEX database
determine to a large degree the quality of the final gold-standard segmentations. FINTWOL has some tendencies to over-generate, especially for compound words. This
means that there can be many alternative analyses for a word, some of which may be
peculiar and improbable. Also the stem-final fuzziness applied to the Finnish words
has been produced using automatic rules, which made it impossible to consider every
case separately due to time constraints.
Derivational morphology is more exhaustively described for English in CELEX
than for Finnish in FINTWOL. Finnish derivational morphemes have been provided
in the Gold Standard, when the information is available in FINTWOL. Additionally,
some obvious mistakes have been corrected for Finnish.
We use the CELEX data to identify the exact morpheme segmentation in the surface string of words. For instance, for the word ‘conductivities’ we not only satisfy
ourselves with the fact that it is derived from ‘conduct’, but the word is split into the
segments ‘conduct+iv+iti+es’. Others have used CELEX as well, e.g., (Schone and
Jurafsky, 2001; Baayen and Schreuder, 2000), but they do not aim at evaluating a
precise segmentation. Schone and Jurafsky (2001) evaluate conflation sets, e.g., the
word ‘conduct’ is derivationally related to ‘conduct’, ‘conducts’, ‘conducted’, ‘conducting’, ‘conduction’, ‘conductive’, ‘conductivity’, etc. Baayen and Schreuder (2000)
use CELEX, but they evaluate manually and any segmentation which is not wrong or
improbable is deemed acceptable regardless of the number of segments discovered.
CELEX sometimes gives derivations with morphemes, which have no corresponding surface realization in the final word form. For example, the word form ‘conductive’
is derived from ‘conduct’ via the intermediary form ‘conduction’. In those cases we
stay true to CELEX by introducing a corresponding empty string, or null morpheme,
indicated by ∼ (tilde) on the surface, e.g., ‘conduct+∼+ive’. (The deep-level morpheme segmentation is here: ‘conduct|V+ion|s+ive|s’.) However, derivational morphemes corresponding to empty strings are impossible to discover with current segmentation algorithms, so we drop those morphemes before evaluation. If someone
invents a method for discovering them, they can easily be retained. At this point, the
Finnish and English Gold Standards differ. Whereas there are some null morphemes
in the analyses of the English words, we have excluded all null morphemes from the
analyses of Finnish words.
Any null morphemes are ignored by the current evaluation scripts (cf. Section 4.3).
However, when fuzzy boundaries are allowed, we could make use of these morphemes:
For instance, English verbs usually do not have any ending marking the infinitive. But
if the final ‘e’ is detached from the verb stem due to fuzziness, it could be considered
as an infinitive ending instead of, as in the current implementation, being treated as
30
an optional “anonymous” ending (cf. Sections 3.6.2 and 4.3.3). Corresponding cases
exist for Finnish: The nominative case of nominals could obtain an ending due to fuzzy
boundaries, and so could the present tense of verbs.
There are special cases when the inflectional information is in the middle of the
string like ‘aides-de-camp’ as a plural form of ‘aide-de-camp’. These cases are rare in
English, so we found it to be most consistent with the Item and Arrangement model
not to split them at all, but to treat them as irregular forms, or allomorphs. Compare
the inflection of the words ‘sing’ as ‘sing’, ‘sang’, ‘sung’. We do not split this into
‘s+?+ng’, but instead we consider the baseform to have three allomorphs ‘sing’, ‘sang’,
‘sung’, which are selected in their entirety in different contexts.
For other languages, such as Arabic or Semitic languages with frequent infixes,
one would likely choose a different segmentation standard. Already the Finnish language exhibits some “infix-like behaviour”, as there are sometimes inflectional “suffixes” in the middle of words. In numeral expressions each number is inflected separately even if the expression is written without spaces in one sequence, e.g., ‘kuude+lla+kymmene+llä+neljä+llä euro+lla’ (with sixty-four euros). Some pronouns also
have their inflection morphemes in the middle of the word, e.g., ‘jo+lla+kin’ (‘with
something’), which is the segmentation in our Gold Standard.
To conclude, Hutmegs is an evaluation package for morphological segmentations
of Finnish and English vocabularies. It is mainly based on an Item and Arrangement
model of morphology, which results in particular strengths and weaknesses. The possibility to use fuzzy morpheme boundaries may alleviate some of the rigidness that
would ensue if one single correct answer had to be chosen.
We hope that the growing circle of researchers studying morphology induction
will discover the Hutmegs package and find it useful for their work. By providing
a segmentation standard that can be used for benchmarking and evaluating different
morphology-learning algorithms, we wish to contribute to the promotion of research
in this fascinating field.
Acknowledgements
We are grateful to D.Sc. (Tech.) Krista Lagus for her thorough examination of the
manuscript and her insightful comments. Furthermore, we would like thank the Graduate School of Language Technology in Finland for funding our work, and we express
our gratitude to the Finnish Center for Scientific Computation and the Finnish National
News Agency for providing text corpora and making it possible for us to produce a
large vocabulary of Finnish words. Our sincere thanks go to Juhani Reiman, CEO of
Lingsoft, Inc., for sharing our interest in making the Finnish word segmentations public for research, and to Mr. Sami Virpioja for drawing our attention to a few bugs in
earlier versions of the Finnish Gold Standard.
31
References
R. Harald Baayen and Robert Schreuder. 2000. Towards a psycholinguistic computational model for morphological parsing. Philosophical Transactions of the Royal
Society (Series A: Mathematical, Physical and Engineering Sciences), 358:1–13.
R. Harald Baayen, Richard Piepenbrock, and Léon Gulikers. 1995. The CELEX
lexical database (CD-ROM). Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA. URL: http://wave.ldc.upenn.edu/Catalog/
CatalogEntry.jsp?catalogId=LDC96L14 .
Mathias Creutz and Krista Lagus. 2002. Unsupervised discovery of morphemes. In
Proceedings of the 6th Meeting of the ACL Special Interest Group of Computaional
Phonology (SIGPHON), pages 21–30, Philadelphia, Pennsylvania, USA.
Mathias Creutz and Krista Lagus. 2004. Induction of a simple morphology for highlyinflecting languages. In Proceedings of the 7th Meeting of the ACL Special Interest
Group of Computaional Phonology (SIGPHON), pages 43–51, Barcelona, Spain.
Mathias Creutz. 2003. Unsupervised segmentation of words using prior distributions
of morph length and frequency. In Proceedings of the 41st Annual Meeting of the
Association for Computational Linguistics (ACL), pages 280–287, Sapporo, Japan.
Carl G. de Marcken. 1996. Unsupervised Language Acquisition. Ph.D. thesis, MIT.
Hervé Déjean. 1998. Morphemes as necessary concept for structures discovery from
untagged corpora. In Workshop on Paradigms and Grounding in Natural Language
Learning, pages 295–299, Adelaide.
John Goldsmith. 2001. Unsupervised learning of the morphology of a natural language. Computational Linguistics, 27(2):153–198.
Lauri Hakulinen. 1979. Suomen kielen rakenne ja kehitys (The structure and development of the Finnish language). Kustannus-Oy Otava, 4 edition.
Kimmo Koskenniemi. 1983. Two-level morphology: A general computational model
for word-form recognition and production. Ph.D. thesis, University of Helsinki.
Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. The MIT Press.
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building
a large annotated corpus of English: the Penn Treebank. Computational Linguistics, 19.
P. H. Matthews. 1991. Morphology. Cambridge Textbooks in Linguistics, 2nd edition.
Randolph Quirk, Sidney Greenbaum, Geoffrey Leech, and Jan Svartvik. 1985. A
Comprehensive Grammar of the English Language. Longman, Essex.
32
Patrick Schone and Daniel Jurafsky. 2001. Knowledge-free induction of inflectional
morphologies. In Proceedings of the 2nd Meeting of the North American Chapter
of the Association for Computational Linguistics NAACL-2001.
Vesa Siivola, Teemu Hirsimäki, Mathias Creutz, and Mikko Kurimo. 2003. Unlimited vocabulary speech recognition based on morphs discovered in an unsupervised
manner. In Proceedings of the 8th European Conference on Speech Communication and Technology (Eurospeech), pages 2293–2296, Geneva, Switzerland.
Nianwen Xue, Fei Xia, Fu-Dong Chiou, and Martha Palmer. 2004. The Penn Chinese Treebank: Phrase structure annotation of a large corpus. Natural Language
Engineering, 10(4):1–30. URL: http://www.cis.upenn.edu/%7Echinese/.
33