Unsupervised Segmentation for Statistical Machine Translation

Unsupervised Segmentation for
Statistical Machine Translation
Siriwan Sereewattana
NI VER
S
Y
TH
IT
E
U
R
G
H
O F
E
D I
U
N B
Master of Science
School of Informatics
University of Edinburgh
2003
Abstract
An unsupervised approach is applied to segment German-English and French-English
parallel corpora for statistical machine translation. The approach requires no languagenor domain-specific knowledge whatsoever. Segmentation is shown to effectively reduce the number of unknown words and singletons in the corpora which helps improve
the translation model. As a result, word error rates are lowered by 0.37% and 2.15%
in the translation of German to English and French to English respectively. The benefits of segmentation to statistical machine translation are more pronounced when the
training data size is small.
i
Acknowledgements
I would like to thank Miles Osborne and Chris Callison-Burch for their guidance; and
Pronab Saha for his much-needed moral support.
ii
Declaration
I declare that this thesis was composed by myself, that the work contained herein is
my own except where explicitly stated otherwise in the text, and that this work has not
been submitted for any other degree or professional qualification except as specified.
(Siriwan Sereewattana)
iii
Table of Contents
1
Introduction
1
2
Statistical Machine Translation
5
2.1
6
3
4
5
The Translation Model . . . . . . . . . . . . . . . . . . . . . . . . .
Unsupervised Morphology
12
3.1
13
Unsupervised Segmentation for Morphology . . . . . . . . . . . . . .
Unsupervised Segmentation for Machine Translation
19
4.1
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
4.2
Segmentation for Machine Translation . . . . . . . . . . . . . . . . .
24
4.2.1
Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
4.2.2
Segmenting . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
4.2.3
Postprocessing . . . . . . . . . . . . . . . . . . . . . . . . .
28
Experiments and Results
30
5.1
Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
5.2
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . .
33
5.2.1
Baseline Segmentation . . . . . . . . . . . . . . . . . . . . .
33
5.2.2
Optimal Segmentation . . . . . . . . . . . . . . . . . . . . .
34
Configuring Segmentation for Machine Translation . . . . . . . . . .
40
5.3.1
What to Segment . . . . . . . . . . . . . . . . . . . . . . . .
40
5.3.2
How to Segment . . . . . . . . . . . . . . . . . . . . . . . .
42
5.3.3
How to Reconstruct a Fluent Sentence . . . . . . . . . . . . .
44
5.3
iv
6
Conclusion
48
6.1
50
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A Software
52
A.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
A.2 Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . .
53
B Segmentation Lexicons
56
Bibliography
61
v
Chapter 1
Introduction
Machine translation is arguably the hardest task in all of natural language processing
because it requires building a number of natural language understanding and generation tasks as well as requiring exclusive linguistic knowledge of the languages involved. However, in the mid-90s researchers at IBM devised a statistical approach
to machine translation (Brown et al. (1993)) that requires minimal, if any, linguistic
knowledge. They used machine learning techniques to automatically induce bilingual dictionaries and translation rules by examining co-occurrences of words and their
relative orderings in millions of sentences paired with their translations into another
language.
Although in principle the statistical framework of machine translation can be applied to any language pair, in practice its application is limited to languages with large
parallel corpora. Al-Onaizan et al. (2000) explain in simple terms the importance of
data in statistical machine translation:
If a program sees a particular word or phrase one thousand times during
training, it is more likely to learn a correct translation pattern than if it sees
it ten times, or once, or never.
When more data are available for the training of a statistical translation system,
the problem of sparse data or unseen words is ameliorated and the system is able to
estimate better bilingual dictionaries and better translation rules. As a result, the overall
translation improves.
1
Chapter 1. Introduction
2
44
WER (%)
43
42
41
40
39
38
10
20
30
40
50
60
70
80
90
100
Training Corpus Size (1,000 Sentences)
Figure 1.1: Word error rates of French to English translation are lowered as the training
corpus size grows.
Figure 1.1 from Callison-Burch (2002) shows that word error rate of French to English translation is lowered as the size of training corpus increases. Word error rate is a
common measure of translation quality that is computed based on the minimum number of insertion, deletion and substitution operations required to transform a machine
translation sentence the same as a reference sentence. Similar results of better translation quality are also reported in the translation of German and Spanish to English when
more training data are used.
Unfortunately, large parallel corpora are not always readily available for arbitrary
languages. Building such a corpus take a great amount of effort and time. When the
size of training corpus is small, the data sparsity problem is worse. This is exacerbated
by the fact that a statistical translation system treats sequences of characters separated
by spaces as words. By treating distinct sequences of characters as distinct words,
the system could not identify the relationship between different word forms that are
derived from the same root. For example, it is unaware of any similarity between “eat”,
“eats”, and “eating”, and will have to learn the translation of each independently.
Brown et al. (1993) note that the problem of sparse data is more prominent when
a langauge is morphologically rich such as French. In French a verb can take so many
forms that it is not likely to see all of them in a corpus. For example, there are a total of
Chapter 1. Introduction
3
41 different forms of verb devoir but only seven were present in their training corpus.
Brown et al. propose a morpho-syntactic approach to address such sparse statistics
problems in machine translation:
We plan to ameliorate these problems with a simple, inflectional analysis
of verbs, nouns, adjectives, and adverbs, so that the brotherhood of the
several forms of the same word is manifest in our representation of the
data ... Thus, our intention is to transform (je mange la pêche | I eat the
peach) into, e.g., (je manger, 13spres la pêche | I eat, x3spres the peach).
Hre, eat is analyzed into a root eat, and an ending, x3spres, that indicates
the present tense form used except in the third person singular.
While their proposed approach helps alleviate the data sparsity problem, it requires
an intensive morpho-syntactic analysis of both source and target languages. It is clear
that linguistic knowledge is mandatory but may not be readily available.
Instead of human linguistic knowledge, this report applies an unsupervised segmentation to discover morphology-like regularities from monolingual corpora as an
alternative preprocessing technique for statistical machine translation. The unsupervised approach to segmentation requires no linguistic knowledge whatsoever and as
such can be applied to any language.
In this report, regular sequences of characters are separated from the words in
which they embed by an unsupervised segmenter. Two language pairs are used: GermanEnglish and French-English. The following table shows segmentation examples of
each monolingual corpus. A segmentation position within a word is marked by “+ +”
symbols.
German dies ist eine groe Heraus+ +forderung fr die Vlker+ +gemeinschaft
French
La République franç+ +ais est condamn+ +ée les dépens
English
the part+ +ies also discussed inter+ +national matters
Without increasing the size of training data, experimental results of both GermanEnglish and French-English language pairs show an overall improvement in the translation quality when the corpora are segmented. Word error rates of the German to
English and the French to English translation are lowered by 0.37% and 2.15% respectively. The numbers of singletons, or words that occur only once, and unseen words
Chapter 1. Introduction
72.61
4
Original German
72.24
Original French
63.59
61.44
Segmentation
Segmentation
48.76
41.82
46.35
41.48
10.48
6.50
8.33
Word error rate (%)
Singletons (%)
Unseen Words (%)
Word error rate (%)
Singletons (%)
6.39
Unseen Words (%)
Figure 1.2: Segmentation improves the translation quality of both German-English and
French-English language pairs by ameliorating the data sparsity problem.
in segmented corpora also significantly decrease which help improve the translation
models.
Figure 1.2 shows lowered rates of sparse words in the segmented corpora and lowered translation word error rates in both language pairs.
The remainder of this report is structured as follows:
• Chapter 2 surveys techniques, both general and language dependent, devised to
improve the translation quality in the statistical framework and their implications.
• Chapter 3 surveys unsupervised morphological learning methodologies with an
emphasis on unsupervised segmentation.
• Chapter 4 describes the unsupervised segmentation algorithm and methodology
employed in this report. Various segmentation results are also presented.
• Chapter 5 describes the configuration of segmentation for statistical machine
translation of each language pair. Experimental results based on the best configuration and a non-strategic segmentation are presented.
• Chapter 6 discusses the implication of experiment results as well as future work.
Chapter 2
Statistical Machine Translation
A statistical approach to machine translation was proposed in the early 1990s by a
group of researchers at IBM (Brown et al. (1990, 1993)) when large machine readable bilingual corpora became increasingly available during that time. Their statistical
machine translation makes few, if any, assumptions about the source and target languages. The approach relies exclusively on a large number of source sentences aligned
with their translations in the target language.
Statistical machine translation involves three computational tasks: the approximation of the language model probability and the translation model probability, and an
efficient search for a translation text that maximizes their product.
For the rest of the report, an example of a French, f, to English, e, translation will
be used although it should be noted that any source and target languages are applicable.
For each translation pair, (e, f), a conditional probability Pr(e|f) is calculated. Using Bayes’s theorem, this is equivalent to:
Pr(e|f) =
Pr(e)Pr(f|e)
Pr(f)
(2.1)
Since f is given, Equation (2.1) can be simplified as:
Pr(e|f) ∝ Pr(e)Pr(f|e)
(2.2)
Pr(e) is the language model that determines the well-formedness of the English
string; whereas, Pr(f|e) is the translation model of the source and target languages. To
5
Chapter 2. Statistical Machine Translation
6
Input French string
P(f|e)
argmax e P(e|f)
Translation Model
P(e)
Global Search
Language Model
English translation string
Figure 2.1: Statistical machine translation architecture.
find the best translation for a given French string is to search for an English string that
maximizes the product of the language model and the translation model probabilities.
ê = argmaxe Pr(e)Pr(f|e)
(2.3)
Brown et al. refer to Equation (2.3) as the Fundamental Equation of Statistical
Machine Translation.
2.1
The Translation Model
During the training phase, the translation is modelled by examining the co-occurrence
of words in each sentence pair and trying to connect the words in the source string and
the target string. In the automatic word alignment, every French word, a translation,
is connected to one English word but each English word can produce zero or more
French words. The number of French words that an English word produces in a given
alignment is called the fertility of that alignment. A distortion occurs when the position
of a French word is different from that of the English word that produces it.
Figure 2.2 illustrates the alignment between an English source sentence and its
French translation. The word overlap is connected to the words se chevauchent and
Chapter 2. Statistical Machine Translation
7
the activities of the parties overlap to a very small extent
les activités des parties se chevauchent à un degré très faible
Figure 2.2: An example of word alignment between an English and a French sentence.
therefore has fertility 2. Two English word can produce one French word such as of
the and des.
Given sentences in source language and their translations in target langauge, a
translation model is constructed by estimating a word-for-word alignment probability between both languages. The probability of an alignment a given an English source
sentence e and its French translation f can be written as follows:
Pr(a|e, f ) =
=
Pr(a, e, f )
Pr(e, f )
Pr(a, f |e) · Pr(e)
Pr( f |e) · Pr(e)
=
Pr(a, f |e)
Pr( f |e)
Since more than one alignment is possible for each sentence pair, the translation
model probability term Pr(f|e) of the Fundamental Equation of Statistical Machine
Translation is therefore the sum of all possible alignments between the source string
and the target string.
Pr(f|e) = ∑ Pr(f, a|e)
(2.4)
a
It should be re-emphasized that although the above example focuses on a FrenchEnglish language pair, in theory the framework can be applied to any source and target
languages.
Chapter 2. Statistical Machine Translation
8
A major problem plaguing statistical machine translation is sparse and unseen data.
Most often, this is due to limited availability of training data. Given a large amount
of data, the parameters of a translation model can be estimated more accurately which
results in a more accurate translation of unseen data.
Because statistical machine translation treats space-separated units as words, it is
unaware of the similarity between different forms of the same word, for example “eat”,
“eats”, and “eating”, and tries to model the translation of each form as a distinct word.
This linguistic naı̈vety often results in an oversized vocabulary which complicates the
translation model. It also exacerbates the problem of data sparsity especially when a
langauge is morphologically rich. Brown et al. (1993) discussed the problem of sparse
data in French to English translation:
In English, for example, we recognize no kinship among the several forms
of the verb to eat (eat, ate, eaten, eats and eating). ... In French, the
verb manger (to eat) has thirty-nine forms (manger, mange, manges, ...,
mangessent). The translation model must learn to connect the five forms
of to eat to the thirty-nine forms of manger. In the 28,850,104 French
words that make up our training data, only thirteen of the thirty-nine forms
of manger actually appear.
To improve the quality of automatic translation in statistical framework, two directions have been taken: increasing the amount of training data and manipulating existing
data for a better translation model. The following describes a number of techniques
that have been devised to improve the translation quality. While many succeed without
any linguistic knowledge, others require language- or domain-specific knowledge in
order to do so.
Language Independent Approaches As shown in Figure 1.1 in the introduction,
higher translation accuracy rates can be achieved by increasing the size of the training
corpus. However, building a large parallel corpus manually requires a lot of time and
money. Automatic approaches have therefore been devised to gather more data for
machine translation.
• Koehn (2002) develops a crawler to gather web pages of the proceedings of the
European Parliament in 11 languages from the European Parliament website.
Several other steps are then taken to automatically extract and align English
Chapter 2. Statistical Machine Translation
9
sentences from the page contents with their corresponding translation in other
languages. The result is multilingual corpora each of which contains over half
a million sentence pairs aligned with their English translation. Koehn reports
performance gain when machine translation systems are trained on increasing
amounts of training data. The availability of quality multilingual corpora extends
the capability and the quality of statistical machine translation in multiple source
languages.
• Callison-Burch and Osborne (2003) apply co-training in machine learning to
add more aligned sentences for the training of a machine translation system.
Their approach bootstraps different translation systems with small parallel corpora. After each round of translation evaluation, the N best translations are added
into the training corpora and removed from the untranslated pool. In their approach, documents in different source languages provide different views for cotraining. When trained on larger corpora newly acquired by co-training, the
translation systems have shown to sometimes perform better with lower word
error rates. This co-training technique has proved to be a resource-efficient way
to increase the size and quality of parallel corpora for statistical machine translation.
Language or Domain Specific Approaches Instead of increasing the size of the
training data, many have tried to transform existing data in language- or domainspecific ways in order to improve the translation quality. Although these approaches
achieve a remarkably higher translation quality, their application are often limited to
certain languages or domains.
• Garca-Varea and Casacuberta (2000) divide the sentence pairs in a travel domain into categories of words they contain, such as DATES, HOURS, ROOMS
and NUMBERS. In each group of sentences, they extract the categorized instances by a rule-based tagger and replace them with their corresponding keywords. For example, veintiseis de abril in the source Spanish sentence and April
the twenty-sixth in the English sentence are replaced by $DATE. A translation
system is then trained on multiple parallel corpora which includes, for each cat-
Chapter 2. Statistical Machine Translation
10
egory, a corpus of sentences with keywords in place of categorized words, and
a corpus of actual categorized instances. The vocabulary size and average sentence length of each corpus are significantly reduced and as a result the overall
translation quality improves noticeably. However, the categories used are by and
large specific to a particular domain. The application of this method to other
domains requires careful consideration for the choice of categories because every additional category requires more work in both the extraction and translation
steps.
• More information can be added to the data in an attempt to improve the translation model. Brown et al. (1992) perform morpho-syntactic analysis on the
French and English corpora. The surface text is transformed into a new representation that reflects syntax and morphology of words. The transformations
include part of speech annotation, question inversion, and normalization of inflectional morphology. They achieve an improvement in the translation quality
by increasing the number of acceptable translation sentences by human judges
from 39 to 60 out of 100 sentences.
• Koehn and Knight (2003) split German compounds into units in order to promote a one-to-one alignment in the German to English translation model. In
effect a one-to-one alignment helps improve the fertility probability in the word
alignment model. The splitting step is performed with linguistic information,
such as part of speech, as well as the statistics of words obtained from the German corpus. They evaluate the splitting results against manually annotated word
splits. When the most linguistically accurate splitting method is applied to machine translation, the translation quality turns out to be less accurate than other
less linguistically motivated methods including the biggest split which favors as
many splits as possible in a compound.
It is shown that more training data helps improve the statistical translation quality.
However, the effort and time required to build a large amount of quality parallel corpus
are almost too enormous to be practical. Alternatively, the performance of a statistical
translation system can be improved by transforming existing data using linguistic or
Chapter 2. Statistical Machine Translation
11
domain-specific knowledge. As a result, the application of statistical translation is
limited to certain languages or domains.
The unsupervised segmentation approach presented in this report aims to improve
the translation quality by exploiting the regularities of raw data, and thereby alleviating the data sparsity problem of statistical machine translation. The approach does not
require a bigger training corpus, or any langauge- or domain-specific knowledge. The
segmenter developed in this report is based on techniques of unsupervised segmentation for morphology which are reviewed in the next chapter.
Chapter 3
Unsupervised Morphology
Morphology is concerned with the ways in which words are formed from smaller
meaningful units, or morphemes. Morphology benefits statistical machine translation
in the recognition of unknown words. By identifying morphemes in unknown words,
their syntactic functions can be inferred and the translation can be performed based on
known morphemic components.
There are three distinct types of morphology: inflection, derivation, and compounding (Hutchins and Somers (1992)). Inflectional morphology is the system defining the
possible variations on a root, or base form, which do not change its original word
class. For example, “play”,“plays” and “played” share the same “play” root and are all
of the same verb category. Derivational morphology is the formation of morphemes
into words of a different grammatical category. Compounding is concerned with the
forming of a new word from two or more whole words. A new compound may or may
not share the same meaning as its components, for example, “skyrocket” and “blackboard”.
Isolating languages, such as Chinese, have almost no inflectional morphology;
whereas polysynthetic languages, such as Inuktitut, inflect verbs and nouns to express
the grammatical meaning of a sentence. French is also considered an inflectionally rich
language due to various conjugations of verbs. English, on the other hand, has more
derivations but limited inflectional verb forms. Languages with rich morphology are
problematic for statistical machine translation because most often the available data
lacks instances of all possible forms of a word to efficiently train a translation system.
12
Chapter 3. Unsupervised Morphology
13
For example, Brown et al. (1993) discover that only 13 of the possible 39 forms of
the French verb manger actually appear in their training data of nearly thirty million
words.
In a language like German, new words can be formed by simply writing two or
more words together without a space or a hyphen in between. Since statistical machine translation treats space-delimited units as words, such phenomenon poses a data
sparsity problem for a translation system where new words are often formed which
have hardly, if not at all, been seen before.
Learning morphology of a natural language is a task that does not usually come
with a set of rules or examples. However, with adequate linguistic information and external knowledge, it is shown that morphology can successfully be learned by machine.
For example, Yarowsky and Wicentowski (2000) use syntactic information and a list of
<inflection, root> examples to induce inflectional morphology, both regular and irregular, of unseen words in English and Spanish. Natalia and Pierre (1999) use thesauri
and English medical dictionaries to discover inflection, derivation and composition of
medical terms in other languages.
Contrary to the approaches above, this report focuses on unsupervised, knowledgefree approaches under a concatenative model in which segmentation has been widely
employed to split words into smaller meaningful units. Dealing with irregular morphology, such as “ate” of the root “eat”, is beyond the scope of the report. Various
segmentation techniques for morphology are reviewed in the following section. These
techniques are adapted for the development of an unsupervised segmenter for statistical
machine translation in this report.
3.1
Unsupervised Segmentation for Morphology
The task of unsupervised segmentation for morphology is performed on a set of individual words or a stream of symbols, for example continuous speech and concatenated
text. When the data has clear word boundaries such as spaces, a list of words can be
easily extracted for morphological segmentation. In continuous speech and concatenated text, there are no clear word boundaries to exploit. Therefore, a segmenter has
Chapter 3. Unsupervised Morphology
14
to implicitly learn the global structure of the data in order to split a stream of symbols into smaller meaningful units. Unsupervised segmentation approaches in both
scenarios are discussed below.
Minimum Description Length (MDL) Based on information theory, the MDL
framework provides a way to fit statistical models to data (Rissanen and Ristad (1994))
and a way of trading off hypothesis complexity against the number of errors committed
by the hypothesis (Mitchell (1997)). A hypothesis is chosen where
hMDL = argmaxh∈H log2 P(D|h) + log2 P(h)
(3.1)
where h is a hypothesis in the hypothesis space H and D is the input data. Given
the data, MDL biases toward the hypothesis that minimizes the combined description
lengths of the model and the data it describes. The description length is in bits. The
encoding schemes and search heuristics for an optimal hypothesis may vary and their
choice can considerably effect the outcome.
The MDL principle has been successfully employed by many in an unsupervised
approach to segmentation.
• Kazakov and Manandhar (2001) use a simplified version of MDL as the fitness
function of a genetic algorithm to identify morphemes from French verbs. Each
word is segmented into a prefix and a suffix. Based on the total number of
characters of entries in the prefix and suffix lexicons, the best hypothesis is the
one with the lowest number. For example, consider a list of various conjugations
of the verb chanter as shown in the table below. To encode these words, it
takes 36 characters. When they are segmented, only 22 characters are needed to
encode the lexicons which therefore represent a better model for this word list.
Words
Length Segmentation
Prefixes Suffixes
chanta
6
chan ta
chan
ta
chantai
7
chant ai
chant
ai
chantais
8
chant ais
ais
chantait
8
chan tait
tait
chanter
7
chant er
er
Total characters:
36
22 (9 + 13)
Chapter 3. Unsupervised Morphology
15
In their approach, the initial lexicons consist of randomly segmented words, with
no repetition. The lexicons are then altered by shifting the segmentation position
to the left or to the right and are re-evaluated with the fitness function. The naı̈ve
interpretation of MDL yields a reasonably good result because it takes advantage
of the orthographic regularities in French verb conjugation. The result of its
application to other languages is not reported.
• Brent et al. (1995) use MDL to discover English suffixes from both raw and
part-of-speech annotated data. Goldsmith (2001), on the other hand, relies on
raw data of The Adventures of Tom Sawyer in different European languages to
discover sets of suffixes that are associated with each stem in each langauge.
Their output lexicons are comparable with the gold standard by linguists.
• de Marcken (1995, 1996); Hua (2000) apply MDL to the segmentation of continuous speech and concatenated text, in English and Chinese. Their approaches
work well when tested with unseen data. Common segmentation mistakes include under segmentation, such as “toinsure”; over segmentation, such as “f eas
ibility” and “rig id”; and confused segmentation, such as “took now” instead of
“too know”.
Multiple Word and Multiple Character Approaches Many techniques are devised to identify morphemic boundaries in compounds and continuous speech by examining multiple adjacent words or multiple character sequences in the data.
• Jacquemin (1997) uses a multiple word approach to induce derivational morphology from data in the medical domain. Each multiple word context is examined to conflate similar multi-word terms together. Based on their suffixes, these
terms are then clustered into classes of derivational forms that co-occur. As a
result, multiple words in different graphical forms can be recognized as having
a similar informational content, for example, “acoustic signal” and “acoustical
signal”.
• Deligne and Bimbot (1997) uses a multi-gram model of character sequences
to segment concatenated English text and continuous speech. The objective is
Chapter 3. Unsupervised Morphology
16
to find a model, which is a set of character n-grams and frequency thresholds,
that maximizes the likelihood of the data and itself. In their experiments, word
boundaries can be discovered using a five-gram model, or a model that consists
of sequences of five characters or shorter, and the initial frequency threshold of
five. Similar mistakes to de Marcken’s and Hua’s approaches are observed in the
results including under and over segmentation.
Deligne and Bimbot explain that a five-gram model is suitable for English data
because most lexical items in English are less than eight characters. When sequences of longer letters are allowed and the thresholds kept constant, they find
that the final lexicon is also bigger because there are many lengthy items that are
made of shorter ones. For example, in a nine-gram model the following words
are discovered, “becauseof” which are repetitions of “because” in the sevengram model lexicon and “of” in the bi-gram.
Apparently, the choice of a multi-gram model must be proportional to the average length of lexical items expected in the data. Overlapping distribution when
using higher order of n-grams poses a potential problem to their approach.
• Ando and Lee (1999, 2003) propose a similar character multi-gram approach to
segment Japanese kanji sequences of compound nouns. They consider a set of ngrams but compare counts of each separately. By considering the distribution of
one n-gram at a time and using a “voting” system instead of raw frequency, their
approach ameliorate the problem faced by Deligne and Bimbot above where
the frequencies of higher order n-grams overlap those of lower order ones. A
threshold is also used to control the granularity of segmentation. The n-grams
and threshold value are set empirically but preferences are given to smaller cardinalities, shorter n-grams and larger threshold value.
They train the program on multi-gram frequencies obtained from a large raw
news corpus. The segmentation is then performed on a list of compound nouns.
Their approach yields segmentation results with high accuracy rate when evaluated with manually annotated references.
Unlike Deligne and Bimbot’s approach where a multi-gram model consists of a
Chapter 3. Unsupervised Morphology
17
continuous range of n-grams, Ando and Lee’s approach allows omission of certain n-grams in their model. For example, their best experimental result comes
from a multi-gram model of bi-gram, four-gram and six-gram. The problem encountered by their approach is single-character affixes which are either wrongly
segmented or are not discovered at all.
Syntactic and Semantic Approaches More information about the data can be exploited in unsupervised segmentation and morphology learning. Language-specific
tools, such as part of speech taggers, are often used in order to obtain additional information about the data. Although the use of linguistic tools may improve the performance of a learner, its application is limited to certain languages for which such tools
are available.
• Schone and Jurafsky (2001) combine various information sources from an English corpus to produce an output of a set of inflectional and derivational morphemes for each base form. A part of speech tagger is used to annotate the
corpus. The information used includes semantic relatedness of affixes, affix frequency, syntactic context, and transitive closure. They identify the lexicons of
potential prefixes and suffixes using a character tree or trie, and analyze their
semantic and syntactic relatedness by using Latent Semantic Analysis and surrounding words, respectively. Although their approach yields an accurate result,
the learning process is rather complicated for the task.
Unsupervised segmentation has been successfully applied to split orthographic
words into smaller meaningful units. Most of the segmentation techniques discussed in
this chapter rely on large amounts of data, although a few use langauge-specific tools
such as part of speech taggers.
The unsupervised segmenter developed in this report are based on segmentation
techniques for morphology. However, it is not intended to learn morphology per se.
Instead, its induced segmentation lexicons are expected to ameliorate the problem of
unknown words in statistical machine translation and thereby improve the overall translation quality. Moreover, the segmenter does not assume any linguistic knowledge or
Chapter 3. Unsupervised Morphology
18
external resources such as linguistic tools of any kind. The unsupervised segmentation
for machine translation is described in the next chapter.
Chapter 4
Unsupervised Segmentation for
Machine Translation
The unsupervised segmenter described in this chapter is inspired by the work of Ando
and Lee (2003) and Deligne and Bimbot (1997) in unsupervised segmentation for morphology discussed in the previous chapter. However, the segmenter in this report does
not perform morphological analysis on the data per se. Therefore, the linguistic accuracy of learned segmentation lexicons is not evaluated explicitly. Instead, a statistical
translation system is used to determine the segmentation quality.
It should be noted that the translation quality is not an indicator of the linguistic
quality of segmentation lexicons. As shown in Koehn and Knight (2003), the most
linguistically accurate splits of German compounds do not result in the best translation quality. In their experiments, several methods are used to select among different
segmentation options for each compound. In one method, a part of speech tagger is
used to ensure resulting segments are only content words including nouns, adverbs,
adjectives and verbs. In another, the empirical frequency in the data is used to decide
whether to keep or split compounds. They also consider splitting compounds into as
many parts as possible and using the translation result of a parallel corpus as a segmentation guideline. When evaluated against a manually segmented gold standard, the
segmentation based on both syntactic and translation information yields the highest accuracy rate. However, when these segmentation methods are applied to German noun
19
Chapter 4. Unsupervised Segmentation for Machine Translation
20
and prepositional phrases in a German to English machine translation task, the biggest
performance gain comes from the most split and frequency-based methods.
The unsupervised segmenter in this report is trained on empirical frequencies of
various character-based n-grams obtained from raw corpora. Segmentation is performed on whole sentences instead of a list of words in order to take into account
the context of words to be segmented. Additionally, by considering whole sentences
rather than individual words as Ando and Lee do, no linguistic presumption is made as
to which words are selected for segmentation. A threshold value is used to adjust the
granularity of segmentation.
The following sections describe the unsupervised segmenter for statistical machine
translation developed in this report. Empirical experiments performed to optimize the
setting of segmentation parameters for machine translation are presented in the next
chapter.
4.1
Methodology
The character multi-gram approach to unsupervised segmentation in this report is
adapted from the work of Deligne and Bimbot (1997) and Ando and Lee (2003). While
Deligne and Bimbot’s approach is employed to recover English words from continuous speech, Ando and Lee’s is used to split Japanese compounds. Their approaches are
modified for a universal, language-independent unsupervised segmenter in this report
in the following ways:
• The training and segmentation are performed based on the distributions of character n-grams obtained from a raw training corpus. However, rather than using
raw n-gram frequencies, the segmenter adopts a voting mechanism devised by
Ando and Lee. This is to prevent smaller n-grams from dominating the segmentation decision because they tend to have higher distributions.
• Similar to Deligne and Bimbot’s approach, segmentation is performed on whole
sentences rather than individual words and a multi-gram model consists of a
continuous range of n-grams. This allows the segmenter to take into account the
Chapter 4. Unsupervised Segmentation for Machine Translation
?S
SL
21
R
The_actions_were_dismissed
0
1
2
3
4
5
6
7 8
9
10 11
12
13
14 15 16
17 18 19 20
21 22 23 24
25
26
t1
t2
Figure 4.1: Comparing tri-gram counts for a potential segmentation boundary at position
7. Counts of adjacent tri-grams, SL and SR , are compared with straddling tri-grams, t1
and t2 .
context of words to segment. Instead of concatenating text, spaces are considered
as a character in order to preserve word boundaries.
• A threshold is used to control the granularity of segmentation as in Ando and
Lee’s approach. Its value is set empirically.
Let #(s) denote the number of times the character s occurs in the training corpus.
Ando and Lee formally describe the segmentation algorithm as follows:
Fix a location k and an n-gram order n, and let SLn and SRn be the nonstraddling n-grams just to the left and the right of it, respectively. For j ∈
{1,2,...,n-1}, let t nj be the straddling n-gram with j characters to the right
of location k. We define I> (y, z) to be the indicator function that is 1 when
y > z, and 0 otherwise.
vn (k) =
n−1
1
I> (#(snd ), #(t nj ))
∑
∑
2(n − 1) d∈{L,R} j=1
(4.1)
The average vote of the overall multi-gram N is therefore the sum of each n-gram’s
vote for the location, divided by the number of n-grams in the model:
vN (k) =
1
∑ vn(k)
|N| n∈N
(4.2)
Let us consider an input string T he actions were dismissed and a multi-gram model
of N = {2, 3}. All spaces are replaced with “ ”. The position in the input string starts
Chapter 4. Unsupervised Segmentation for Machine Translation
22
from 0 before the first letter T . Position 1 is between T and h, 2 between h and e, and
so on. Figure 4.1 illustrates the segmenter at position 7 between t and i in actions. It
then performs the following steps to calculate the average vote for position 7:
1. Given the frequency tables obtained from the training corpus.
Bi-gram
counts Tri-gram
counts
ct
16
act
5
io
14
cti
3
ti
9
ion
9
...
...
tio
5
...
...
...
...
2. Compare the bi-gram frequencies
Bi-gram
answer points
Is #(ct) > #(ti)?
yes
1
Is #(io) > #(ti)?
yes
1
3. Calculate the bi-gram average point, by adding all points and dividing it by the
number of comparisons performed = 2/2 = 1.
4. Compare the tri-gram frequencies
Tri-gram
answer points
Is #(act) > #(cti)?
yes
1
Is #(act) > #(tio)?
no
0
Is #(ion) > #(cti)?
yes
1
Is #(ion) > #(tio)?
yes
1
5. Calculate the tri-gram average point = 3/4 = 0.75.
6. Finally, calculate the average multi-gram vote = (1 + 0.75)/2 = 0.875.
Note that spaces are assumed to pad the input string. This is applicable when
processing frequency comparison at the beginning or final positions where either the
Chapter 4. Unsupervised Segmentation for Machine Translation
1.2
1
segmentation boudaries
_
t = 0.70
t = 0.55
i
0.8
0.6
23
s
h
m
s
0.4
0.2
0
_ T h e _ a c t i o n s _ w e r e _ d i s m i s s e d _
Figure 4.2: Segmentation boundaries based on the average vote of a multi-gram model
and three threshold values, 0, 0.55, and 0.70.
left or the right character sequence is shorter than the n-gram. For example, when
processing at position 1 for a tri-gram model, two spaces are assumed to proceed T .
Hence, T is the tri-gram to the left of position 1 and he is one to the right. All unseen
n-gram sequences are assumed to have frequency of 1.
After all positions are processed, the input string is segmented based on the resulting multi-gram vote and a threshold value t. When the threshold is 0, segmentation
occurs at local optimal positions.
Based on different threshold values depicted in Figure 4.2, the input string can be
segmented in three different ways:
Threshold (t)
Result strings
0.70
The act ion s were dismissed
0.55
T he a ct ion s were dis mi ssed
0.00
T he a ct ion s w e re dis mi ssed
As illustrated in the segmentation results above, the lower the threshold value becomes, the more segmentation boundaries are defined. The threshold can therefore be
used to adjust the segmentation granularity.
Chapter 4. Unsupervised Segmentation for Machine Translation
4.2
24
Segmentation for Machine Translation
The unsupervised segmenter described in the previous section is used to segment parallel corpora for statistical machine translation. The corpora and segmentation task are
discussed below.
4.2.1 Corpora
Two European Parliament parallel corpora, French-English and German-English, are
used for development and evaluation. Extremely long or extremely short sentences are
excluded. Sentence pairs are selected based on their English translations. The English
translation of each sentence pair has 35 words or less. English sentences shorter than
four words are filtered out. So are sentences with only numbers and punctuation.
In each corpus, there are 41,000 sentence pairs in total. The testing data of 1,000
pairs are randomly selected. The remaining 40,000 pairs are used for the segmentation
development of which 1,000 are again randomly selected as a heldout set for statistical
machine translation evaluation. The best segmentation parameters for each corpus are
set by evaluating the segmentation results with a translation system.
The preprocessing steps include removing punctuation; replacing all number sequences with 0 as a placeholder; and expanding contracted forms into single words,
for example, “Europe’s way” is expanded as “Europe ’s way” and “qu’il” as “que il”.
Table 4.1 shows statistics of the processed data. German words are longer on average than English and French. Since German words are mostly compounds which do
not occur often in the corpus, the vocabulary size of German corpus is the biggest. On
average, French words are slightly shorter than English and the French vocabulary is
approximately 10% bigger.
4.2.2 Segmenting
The training data set of 39,000 sentences in each langauge pair is separated into two
monolingual corpora which are used for the training of n-gram counts. The heldout set
of 1,000 sentences is use to configure the parameters for each language. The parameters include the order of n-grams in the model and the threshold value. Segmentation
Chapter 4. Unsupervised Segmentation for Machine Translation
39,000
sentences
1,000
sentences
1,000
sentences
Training
25
Testing
Development
German
English
Train
Develop
Test
Train
Develop
Test
Sentences
39,000
1,000
1,000
39,000
1,000
1,000
Words
485,850
7,448
12,170
556,684
7,837
13,968
3,839,072
61,804
95,004
3,648,447
55,185
91,104
Avg sentence(words)
12.46
7.45
12.17
14.27
7.84
13.97
Avg word(characters)
7.90
8.30
7.81
6.55
7.04
6.52
29,564
1,416
3,348
17,532
1,371
2,873
Characters
Vocabulary size
French
English
Train
Develop
Test
Train
Develop
Test
Sentences
39,000
1,000
1,000
39,000
1,000
1,000
Words
555,320
9,224
13,916
506,655
7,418
12,647
3,595,462
59,955
91,039
3,336,863
53,017
84,168
Avg sentence(words)
14.24
9.22
13.92
12.99
7.42
12.65
Avg word(characters)
6.47
6.50
6.54
6.59
7.15
6.66
20,411
1,349
2,985
16,992
1,278
2,752
Characters
Vocabulary size
Table 4.1: Statistics of parallel corpora before segmentation.
Chapter 4. Unsupervised Segmentation for Machine Translation
26
is performed on one sentence at a time.
Given a training corpus and an input string, the segmentation algorithm works as
follows:
Training
1. Choose a set of n-grams for the model, N = {n}.
2. Obtain counts of each n-gram from the training corpus. Keep the information in
n-gram frequency table.
Voting
1. Start from the 0-th position, k = 0, before the first letter of the input string.
(a) For each n-gram in N,
i. Based on n-gram frequency table, compare the counts of n-gram sequences adjacent to the left of the position, SL , and to the right, SR ,
with other n-character sequences that straddle them.
ii. Award one point to the position if SL or SR frequency is greater than
that of the straddling sequences.
iii. Calculate n-gram average point, by summing all points, divided by
the total number of comparisons performed. This is the average score
awarded to the position by this particular n-gram.
iv. Repeat 1(a)i using the next n-gram.
(b) Calculate the average N vote for the position by summing all n-gram scores
divided by the number of participating n-gram in N. This is the final N vote
for the position.
2. Repeat 1a for the next k in the input string. Stop after processing the last k.
Segmenting
1. Set a threshold value, t.
2. Segment the input string at the positions where the votes are higher than t.
Chapter 4. Unsupervised Segmentation for Machine Translation
27
Each segmentation position within a word is marked by “+ +” symbols. One “+” is
placed at the end of the preceding segment; the other at the beginning of the following
segment to indicate a suffix and a prefix is expected respectively. Two multi-gram
models are considered for each corpus, a 2-10 and a 2-12 multi-gram. In each model,
all orders of n-grams between the range are considered. In general, using a higher
threshold in a shorter model gives a similar result to a lower threshold in a longer
model. However, in the German corpus where words are longer on average, a longer
model results in a better segmentation.
Although the segmentation results are evaluated by a statistical machine translation, certain criteria are devised to initially determine the segmentation quality and
filter out unlikely candidates. The initial screening has proved to be very efficient because development time is saved by not translating poorly segmented data especially
when each machine translation run can take more than three hours.
The initial criteria include the total number of segments, the average segment
length, the average number of segmentation per words, and also human judgement
of the overall quality of segmentation results. The settings that result in too frequent
or too short segments are judged unlikely candidates and are filtered out. For example,
the following threshold values for an English 2-10 multi-gram model are rejected.
Threshold (t)
0.40
Result strings
T+ +he app+ +l+ +ic+ +ation i+ +s dis+ +miss+ +ed a+ +s
in+ +ad+ +missi+ +ble
0.45
T+ +he appl+ +ic+ +ation i+ +s dis+ +miss+ +ed a+ +s in+
+ad+ +missi+ +ble
After the initial screening, two segmentation models are selected for each of the
German and French corpora and three for the English. The candidate models of each
corpus are:
• German N={2-10} t=0.60 and N={2-12} t=0.60
• French N={2-10} t=0.55 and N={2-10} t=0.60
• English N={2-10} t=0.55, N={2-10} t=0.60 and N={2-10} t=0.65
Chapter 4. Unsupervised Segmentation for Machine Translation
28
Model
Result strings
German Original
die Klage wird als offensichtlich unzulssig abgewiesen
2-10, 0.60
die Klage wird als offen+ +sichtlich un+ +zulssig abgewiesen
2-12, 0.60
die Klage wird als offensichtlich un+ +zulssig abgewiesen
French Original
elles comportent les principaux éléments suivants
2-10, 0.55
elles com+ +port+ +ent les princip+ +aux élé+ +ment+ +s suivants
2-10, 0.60
elles comport+ +ent les principaux élément+ +s suivants
English Original
The application is dismissed as inadmissible
2-10, 0.55
The application i+ +s dismiss+ +ed as in+ +admissible
2-10, 0.60
The application is dismiss+ +ed as in+ +admissible
2-10, 0.65
The application is dismiss+ +ed as in+ +admissible
Table 4.2: Examples of segmented text based on the candidate models. 2-10 indicates
the multi-gram, N , and 0.60 the threshold, t .
4.2.3 Postprocessing
After candidate models are selected, they are reapplied to segment the training data.
The segmentation of both training and heldout data sets is a preliminary step to machine translation evaluation.
However, using the same multi-gram model does not guarantee a consistent lexicon across the entire corpus, for example, not all occurrences of the same word are
segmented the same way or some are not segmented at all. This is because the segmenter takes into account the context of words to segment but the same words can
occur in different contexts. To decide whether a word should be segmented and how,
two criteria are used: frequency-based and greedy.
• Frequency-based segmentation If a word has been segmented in more than one
way, the segmentation with the highest frequency is chosen.
• Greedy segmentation All occurrences of a word is segmented if at least one of
its occurrences is segmented.
Based on these criteria, the postprocessing steps work as follows:
Chapter 4. Unsupervised Segmentation for Machine Translation
29
1. From both training and development data, record in a table all segmented words
with a) their segmentation patterns and b) the frequency of each pattern. Call
this a segmentation table, S.
2. For each segmented word, s, in S,
(a) Select the segmentation pattern with the highest frequency and delete the
other patterns, if any, from S.
(b) Segment all occurrences of s in the corpus based on the selected pattern.
The segmentation table S constitutes the segmentation vocabulary learned from the
data by the unsupervised segmenter. Examples of segmentation lexicons are shown in
Appendix B. When the order of multi-gram is higher, the segments are also bigger
as seen in the German’s 2-12 and 2-10 multi-gram segmentation results. However, in
the case of English and French corpora, the order of multi-gram is constant while the
threshold value varies. The higher the value, the bigger the segments. Both multigram order and threshold value can therefore be used to adjust the size of segments. It
should be noted that when a language has a longer average word, using a low order of
multi-gram may not capture all possible words. Since German has longer words than
English, it might be better to use the order of multi-gram as a segment size adjuster.
The unsupervised segmenter for machine translation described in this chapter is
knowledge-free and language-independent. It requires no linguistic tools nor external
resources. Instead, it is trained on frequencies of multiple character n-grams extracted
from raw data. Segmentation is performed on English, French and German corpora.
Candidate models are selected for each language. Since the goal of segmentation is to
improve statistical translation quality, the segmentation result of each candidate model
is not evaluated for its linguistic accuracy, but with a machine translation system.
The next chapter presents the evaluation and optimization of segmentation for the
translation of German to English and French to English. Experimental results of segmentation of unseen data are also presented and discussed.
Chapter 5
Experiments and Results
This chapter describes the evaluation metrics and statistical machine translation results
of the baseline and optimal segmentation of the German-English and French-English
language pairs. Various experiments carried out to achieve the optimal segmentation
of each language pair are also presented.
A machine translation system is used to translate German to English and French
to English based on segmented corpora. Two automatic evaluation metrics are then
employed to measure the quality of the translation sentences. The machine translation architecture as depicted in Figure 2.1 is extended as an analysis-transfer-synthesis
architecture (Brown et al. (1992)) which includes three components:
1. Analysis In this report, the unsupervised segmenter analyzes a surface string
based on a segmentation lexicon and produces an output of segmented string.
Other research has employed more linguistically comprehensive analyses in this
step. For example, part of speech, morphology and word sense as reported in
Brown et al. (1992); Garca-Varea and Casacuberta (2000); Ueffing and Ney
(2003); Nießen and Ney (2000),
2. Transfer The machine translation process occurs in the transfer step where the
language and translation models of the segmented strings are constructed, and
the decoder, when given a segmented source sentence, produces a translation
based on both models.
30
Chapter 5. Experiments and Results
31
Le recours est rejeté comme irrecevable
ANALYSIS: f
f'
Le re+ +cours est rejeté comme irrecev+ +able
TRANSFER: f '
e'
P(f '|e')
argmax e' P(e'|f ')
Translation Model
P(e')
Global Search
Language Model
The application is dismiss+ +ed as in+ +admissible
SYNTHESIS: e'
e
The application is dismissed as inadmissible
Figure 5.1: The analysis-transfer-synthesis architecture of a statistical machine translation.
Chapter 5. Experiments and Results
32
3. Synthesis In the final step, a synthesizer transforms the transfer output back into
surface strings.
Figure 5.1 illustrates the extended machine translation architecture used in this
report. f and e are the original, unsegemented source and target sentences respectively.
f’ and e’ are their segmented version.
5.1
Evaluation Metrics
The goal of the unsupervised segmentation in this report is to improve the translation quality of statistical machine translation. The segmentation results are therefore
evaluated using a translation system.
It should be noted that the quality of machine translation sentences does not necessarily mean the segmentation is morphologically or linguistically correct. In this
report, a segmentation is considered successful when it reflects the regularities of a
corpus in a complementary way to its corresponding translation. The translation models based on segmented corpora are also examined although no quantitative evaluation
is performed.
Two automatic word-based evaluation metrics are used to measure the quality of
machine translations against reference human translations. They are word error rate
and position-independent word error rate. Both metrics have been employed to evaluate machine translation systems in such works as Callison-Burch and Osborne (2003)
and Ueffing and Ney (2003).
1. Word Error Rate (WER) computes the score based on string edit distance
which is the minimum number of insertion, deletion and substitution operations
required to transform a machine translation sentence the same as a reference
sentence.
2. Position-Independent Word Error Rate (PER) does not consider the positions
of words in the translation sentence. The number of matching words in the
reference and translation sentences is used in this metric. PER is considered
Chapter 5. Experiments and Results
33
more forgiving but does not capture the accuracy and fluency of the translation
as well as WER.
Other evaluation techniques include subjective sentence error rate (SSER) and
BLEU. SSER is a metric based on human judgement. Human judges grade each translation sentence according to a predefined scale, such as a score of 1 to 4, or 0 to 1, or
a category of “wrong”, “acceptable”, “perfect”, and so on. SSER is used as reported
metric in Brown et al. (1990, 1992); Nießen and Ney (2000). BLEU is an automatic
evaluation metric devised by researchers at IBM (K. Papieni and Zhu (2002)). It computes a score of a translation sentence given a reference translation based on a modified
uni-gram precision criterion. Considered comparable to human judgement, BLEU is
used as an evaluation metric in Koehn and Knight (2003); Ueffing and Ney (2003).
5.2
Experimental Results
In each language pair, a total of 1,000 unseen sentences are used as the testing data. The
training data consists of 40,000 sentences. Statistical machine translation experiments
are performed on the baseline and optimal segmentation.
5.2.1 Baseline Segmentation
In the baseline segmentation, in each corpus words longer than five characters are
segmented at the fifth position. A maximum of two segments per word are allowed.
For example:
German
Annahme eines Aktionsplans durch die Kommission
Segmented Annah+ +me eines Aktio+ +nsplans durch die Kommi+ +ssion
French
adoption par la Commission de un plan de action
Segmented adopt+ +ion par la Commi+ +ssion de un plan de actio+ +n
English
the Commission adopts an action plan
Segmented the Commi+ +ssion adopt+ +s an actio+ +n plan
Chapter 5. Experiments and Results
100
86.45
34
100
GERMAN original
FRENCH Original
78.05
Naive segment
Naive segment
75
60.72
56.97
50
31.59
Error rate (%)
Error rate (%)
75
25
50
21.79
25
0
49.39
46.40
0
WER
PER
WER
PER
Figure 5.2: Translation error rates (%) of the naı̈ve segmentation of German to English,
left, and French to English, right.
The segmentation results are shown in Figure 5.2. The translation results based on
naı̈ve segmentation indicate that a non-strategic segmentation worsens the performance
of a translation system. In the case presented above, translation error rates increase by
approximately 30-60% in each language pair.
5.2.2 Optimal Segmentation
The best segmentation configuration for each language pair obtained during the development is used in the testing phase. Section 5.3 describes in detail how an optimal
configuration for each language pair is achieved.
Based on the best configuration for the German to English translation, only the German corpus is segmented using a 2-12 multi-gram and 0.60 threshold model; whereas,
for the French to English, only the English corpus is segmented using a 2-10 multigram and 0.65 threshold model.
The translation results and the rates of rare and unseen words are shown in Figure
5.3. The word error rate of German-English pair decreases by approximately 0.40%;
whereas, that of the French-English grows. In the German to English translation, segmentation helps break down distinct compounds into regular sequences of string and
therefore reduces the number of singletons in the training corpus and of unseen words
in the testing data. A lower singleton rate helps improve the quality of the translation
model. Fewer unseen words mean more words in the source string can be translated to
Chapter 5. Experiments and Results
35
48.76
80
Original
Original
Segmentation
Segmentation
46.35
70
66.49
63.59
65
41.82 41.53
Unseen Words (%)
10.48
72.61 72.24
Singletons (%)
WER (%)
75
8.33
6.50
6.22
60
GERMAN
FRENCH
GERMAN
FRENCH
GERMAN
FRENCH
Figure 5.3: Translation word error rates (%), left, of German to English and French
to English based on the optimal segmentation. Unseen word and singleton rates (%),
right, of segmented corpora.
a target word. The regularities reflected in the segmentation benefit the overall performance of a translation system in the German-English langauge pair.
Contrarily, fewer unseen words and singletons fail to improve the overall French
to English translation quality due to the limitation of the spell-checking synthesizer.
There are numerous incomplete, pairless segments in raw translation sentences. These
segments expect either a suffix or a prefix but their adjacent segments do not. Their reconstruction into surface words is therefore the task of a synthesizer. Since the synthesizer reconstructs a segment into a word based on a uni-gram word frequencies of the
corpus, oftentimes these reconstructed words, although valid, do not exactly match the
references. For example, “improve+” is synthesized as “improvement”but the reference word is “improved”. Section 5.3.3 describes the spell-checking, frequency-based
synthesizer in detail.
Vocabulary Size and Synthesizer Two approaches are taken to improve the French
to English translation quality and to address the incomplete segments problem: reducing the vocabulary size and improving the synthesizer. Firstly, the segmentation
lexicon learned during the segmentation development, as described in Section 5.3, is
used to segment the English training and testing data. As a result, the vocabulary
size decreases from 17,099 to 17,029 words but there are more unknown words. This
however turns out to be advantageous since the transfer translation output has fewer
Chapter 5. Experiments and Results
36
le issue de le débat une résolution a été adoptée
TRANSFER: f
e'
P(f|e')
argmax e' P(e'|f)
Translation Model
P(e')
Global Search
Language Model
after has been discus+ resolut+ +ion adop+ +ted
SYNTHESIS: e'
e
P(e'|e)
argmax e P(e|e')
Translation Model
P(e)
Global Search
Language Model
after discussed has been resolution adopted
( Ref: a resolution was adopted at the close of the debate )
Figure 5.4: Using a translation system as the synthesizer.
Chapter 5. Experiments and Results
37
Pairless Segments
Testing Lexicon
66.49
(+2.90)
Development Lexicon
Singletons (%)
720
690
100-Suffix Lexicon
WER (%)
Origianl French-English
63.59
63.05
(-0.54)
61.44
(-2.15)
41.96
627
41.53
Testing
Lexicon
41.48
Development
Lexicon
100-Suffix
Lexicon
Figure 5.5: Translation word error rates (%) of French to English using various segmentation lexicons and a new synthesizer, left, and corresponding numbers of incomplete
segments before synthesis, right.
pairless segments to be reconstructed. After the synthesis step, the translation quality
has shown to improve significantly, by 2.15%.
In the second approach, the synthesizer is made more aware of the surrounding context of the segments to be reconstructed. This is achieved by treating the segmented
and the original sentences as translations of one another. The translation system is
therefore trained twice. The first time, the original French corpus is used as the source
sentences and the segmented English the target. The second time, the segmented English corpus is the source and the original English the target. The translation process
is subsequently carried out by first translating the French test data into segmented English translations based on the original French to segmented English translation model.
Then the segmented translations are translated into a surface English form based on the
segmented English to original English model. Figure 5.4 illustrates this double translation process. An example shows not only are incomplete segments reconstructed but
the word positions rearranged.
Also in the second approach, the segmentation lexicon is limited to only the hundred most frequent suffixes. These suffixes are used to segment the English data. With
a smaller lexicon and a new synthesizer, the French to English translation word error
rate decreases by 0.54%.
Figure 5.5 shows an improved translation quality of the segmentation based on the
Chapter 5. Experiments and Results
80
38
70
GERMAN Original
FRENCH Original
Segmentation
WER (%)
WER (%)
Segmentation
75
70
65
60
10,000
20,000
40,000
Training Sentences
10,000
20,000
40,000
Training Sentences
Figure 5.6: Translation word error rates (%) of German to English, left, and French to
English, right, based on various training data sizes.
development lexicon and a lexicon consisting of only suffixes. The number of incomplete segments appears to directly affect the translation results. The transfer output, or
raw segmented translations, of the segmentation based on the testing lexicon has the
highest number of pairless segments which cannot be properly reconstructed to surface
words by the spell-checking, frequency-based synthesizer. Although, there are only 30
pairless segments fewer in the case of 100-suffix segmentation, the quality of reconstructed translations does not degrade as badly due to the new translation synthesizer.
In the absence of a translation synthesizer, a smaller number of singletons helps better the translation model and therefore lower the translation error rate as shown in the
translation results of segmented data based on the development lexicon.
Training Data Size The above translation results indicate that regularities discovered by the unsupervised segmenter in many ways complement the statistical translation model. To validate the quality of induced segmentation lexicons and their benefit
to statistical machine translation, the size of training data is varied for further evaluation.
The segmented training data based on the testing lexicons is halved and quartered.
The translation is then performed on the same testing data of 1,000 sentences. Incomplete segments are reconstructed to surface words by a spell-checking synthesizer.
Figure 5.6 shows the translation results.
The translation quality based on segmented corpora does not deteriorate as much
Chapter 5. Experiments and Results
39
der Welthandelsorganisation
NULL World
der Welt+ +handel+ +sorganisation
the World Trade Organisation
( Reference: the World Trade Organisation )
Figure 5.7: An example of German-English alignments.
as the unprocessed corpora when the training data size decreases. However, the performance gain from segmentation lessens when more training data becomes available.
In fact, in the French-English pair, the training data of 40,000 segmented sentences
worsens the translation quality.
Segmentation, however, is most beneficial to a compounding language like German. Segmenting distinct compounds into regular words or text sequences improves
the translation model not only by reducing the number of singletons but also by increasing a one-to-one alignment probability between source and target words. As shown in
Figure 5.7, more English words can be aligned with German segments which subsequently improve the overall translation quality.
Based on the experimental results, segmentation has shown to improve the translation quality by effectively lowering word error rates by 0.37% and 2.15% in the
German-English and French-English language pairs respectively.
Without any language- or domain-specific knowledge, the unsupervised approach
to segmentation successfully induces concatenative regularities from a corpus. Segmentation is most favorable in the translation of a compounding source langauge, such
Chapter 5. Experiments and Results
40
as German, to a target langauge of shorter average word length. When a language is
rich in inflectional and derivational morphology, segmenting the data into pieces entails the development of a more complicated synthesizer. This is apparent in the French
to English translation. In both cases, segmentation significantly reduces the number of
unknown words and singletons which has proved extremely helpful when the training
data is scarce.
The remainder of this chapter describes the experiments carried out during the
development to determine the best segmentation settings for statistical machine translation of both language pairs.
5.3
Configuring Segmentation for Machine Translation
A training data set of 39,000 sentences and a heldout set of 1,000 sentences in each
language pair are used to empirically configure the best segmentation for statistical
machine translation.
The configuration is based on a selection of candidate segmentation models produced by an unsupervised segmenter, as described in the previous chapter, which include two models for the French and German corpora and three for the English. The
configuration is in fact two fold because it aims to improve the translation model of a
statistical translation system by improving the segmentation quality of the corpora.
The following sections describe the configuration process.
5.3.1 What to Segment
The possibilities of segmentation for statistical machine translation include segmenting
both the source and target corpora, segmenting either the source or target corpora,
and segmenting nothing at all. All possible combinations in each language pair are
evaluated to find the best translation. Table 5.1 shows translation error rates of some
combinations.
For the German-English pair, the best combination is an original, unsegmented
English corpus and a segmented German corpus using a 2-12 multi-gram with a 0.60
threshold. For the French-English pair, the best combination is an original, unseg-
Chapter 5. Experiments and Results
41
German
2-10, t=0.60
2-12, t=0.60
English
WER
PER
WER
PER
2-10, t=0.55
76.26
46.22
81.61
50.13
2-10, t=0.60
69.19
37.04
68.28
39.74
Original
63.44
34.56
56.74
24.63
French
2-10, t=0.55
English
WER
PER
WER
PER
2-10, t=0.55
73.44
44.11
60.54
29.43
2-10, t=0.60
61.61
33.72
47.98
25.19
2-10, t=0.65
56.97
29.52
45.76
20.79
Original
Table 5.1: Translation error rates (%) of various combinations of segmentation models
in each language pair. The best combinations are in bold.
mented French corpus and a segmented English corpus based on a 2-10 multi-gram,
0.65 threshold model.
In the French-English pair, when both corpora are segmented, certain morphological paradigms are reflected in the segments, for instance, “envisag+ +er”, “envisag+
+era”, “envisag+ +ées” in French, and “introduc+ +ed”, “introduc+ +es”, “introduc+
+ing” in English. However, the overall translation quality is considerably damaged due
to wrong segment alignments.
Since segmentation relies on the regularities of monolingual corpora, it is not always the case that the source and target words in a parallel corpus are segmented in
the same way. For example, while the English word “prejudice” is not segmented, its
French translation “préjudice” is segmented as “pré+ +judice”. Without corresponding
segments in both languages, this renders the translation model more complex because
of inconsistent fertility and alignment between the source and target words. Figure 5.8
shows an example of damaging effects of over segmentation in the French corpus.
On the other hand, the German to English translation benefits from the segmentation quite a great deal. Since German has a lot of compounds, breaking them into seg-
Chapter 5. Experiments and Results
42
ministre irlandais de le Environnement
Irish Environment Minister NULL
ministre irland+ +ais de le Environnement
Irish Environment NULL
( Reference: Irish Minister for the Environment )
Figure 5.8: An example of French-English alignments.
ments helps promote a one-to-one fertility and a better alignment with English words
in the translation model, as shown in Figure 5.7.
From the experimental results of all possible segmentation combinations, the best
segmentation settings of each language pair are selected as follows:
• German-English Segmented German corpus using a 2-12 multi-gram and 0.60
threshold model and the original English corpus.
• French-English The original French corpus and segmented English corpus using a 2-10 multi-gram and 0.65 threshold model.
5.3.2 How to Segment
The German and English segmentation lexicons based on the best models are closely
examined in order to improve the quality of segmentation for machine translation. As
a result, unlikely segments, such as single-character segments, are discovered. While
one-character segments are possible in German compounds, they are unlikely in English prefixes. These tiny segments in turn worsen the alignment in the translation
Chapter 5. Experiments and Results
43
sentences where one target word is aligned with multiple segments in the source sentence, or vice versa. Un-segmentation is therefore performed on unlikely segments in
each language.
Many unsegment operations have been considered and evaluated with a translation
system. The best un-segmentation for machine translation is described below:
German: Unsegment
• Single-character prefixes by joining them with the segments that follow.
• Suffixes of less than or equal to two characters by joining them with the preceding segments.
English: Unsegment
• Single-character prefixes by joining them with the following segments.
• Suffixes of less than or equal to two characters by joining them with the preceding segments.
• Possible named entities which are words with initial capitalization. Capitalized
words at the beginning of a sentence are not considered entities.
• Middle segments by joining them with the preceding segments. Because of this
operation, there are at most two segments per word.
The following are examples of un-segmentation in each language:
German
Segmented
Unsegmented
Prefixes
P+ +ortugies+ +ische
Portugies+ +ische
Suffixes
wirtschaft+ +lich+ +er
wirtschaft+ +licher
English
Segmented
Unsegmented
Prefixes
c+ +ustoms
customs
Suffixes
out+ +com+ +e
out+ +come
Entities
Mr Einem Austria+ +n
Mr Einem Austrian
Middles
organi+ +z+ +ations
organiz+ +ations
Chapter 5. Experiments and Results
44
Details of resulting segmentation lexicons are shown in Appendix B. It should be
noted that un-segmentation is not intended to linguistically correct the segmentation
results. Instead, it is carried out to improve the translation model by promoting a better
alignment between the source and target sentences.
5.3.3 How to Reconstruct a Fluent Sentence
In the French to English translation where the English corpus is segmented, the raw
translation sentences consist of numerous incomplete and incompatible segments which
do not constitute a fluent sentence after synthesis. Incomplete segments are those that
expect either a prefix or a suffix but the segments preceding or following them expect
none, for instance “Summi+ point”. On the other hand, incompatible segments, such
as “inclu+ +lating”, yield invalid surface words after they are joined.
A spell-checking, frequency-based synthesizer is developed to correct invalid segments using a vocabulary extracted from the English corpus. Each word and segment
in the raw translation is looked up in the table. If an exact match is found, it is considered valid. If no match is found, a character at the end is removed and the lookup
is performed based on the remaining character sequence. Characters are removed iteratively until the closest match is found. If there are more than one match, the most
frequent word is selected.
For example, given a word frequency table below:
Word
Summers
Frequency Word
Frequency
3
include
150
Summit
196
included
57
Summits
2
includes
51
Sun
5
including
374
...
...
inclusion
40
...
...
“Summi+” is synthesized as “Summit” after removing the final “+”. Whereas,
“inclu+ +lating” are reconstructed separately into two surface words, “including” and
“speculating”.
Chapter 5. Experiments and Results
70
56.73
Segmentation
50
40
31.59
30
50
Error rate (%)
Error rate (%)
60
GERMAN original
60.72
60
45
24.63
20
FRENCH Original
46.40
45.76
Segmentation
40
30
21.79
20.79
20
10
WER
PER
WER
PER
Figure 5.9: Translation error rates (%) of German to English, left, and French to English,
right, based on the best segmentation of development data.
Based on the best segmentation configuration and the spell-checking synthesizer,
the translation quality of the heldout data in both language pairs is effectively improved. As depicted in Figure 5.9, the word error rates of original translations are
reduced by 0.64% in the French to English translation and almost 4% in the German
to English. The best development configuration and synthesizer are used in the translation of unseen data in the testing phase as discussed in Section 5.2.2.
Table 5.2 shows the statistics of the parallel corpora that yield the best translation
quality during both the development and testing phases. Segmentation effectively reduces the vocabulary size of each translation pair. The numbers of unseen words and
rare words also decrease. For example, segmentation reduces the German vocabulary
size by nearly 2,000 words or about 7%.
A translation model comprises not only a lexical translation model but also a hidden
word alignment model. When the length of segmented sentences in one language is
proportional to that of the other, such as segmented German and English, segmentation
helps improve the alignment model by promoting a one-to-one relationship between
the source and target words.
However, when the length of segmented sentences are not proportional, such as
French and segmented English, segmentation can cause a destructive side-effect to a
statistical system. The word-alignment model becomes more complex which results in
many incomplete segments in the translations as seen so often during the experiments.
Chapter 5. Experiments and Results
46
Segmented German
Word error rate (%)
Avg sentence(words)
Avg word(characters)
Vocabulary size
Singletons (%)
Unseen words (%)
Develop
Test
56.73
72.24
(-3.99)
(-0.37)
9.61
15.39
(+2.16)
(+3.22)
6.43
6.17
(-1.87)
(-1.64)
27,657
27,777
(-1,907)
(-1930)
46.50
46.35
(-2.43)
(-2.41)
8.09
8.33
(-2.01)
(-2.15)
Original French
Develop
Test
Word error rate (%)
Avg sentence(words)
Avg word(characters)
Vocabulary size
Singletons (%)
Unseen words (%)
9.22
6.50
20,411
59.75
9.71
13.92
6.54
20,542
43.45
7.50
Original English
Develop
Test
7.48
13.97
7.04
6.52
17,532
17,623
41.06
40.89
6.78
5.53
Segmented English
Develop
Test
45.76
61.44
(-0.94)
(-2.15)
8.50
14.40
(+1.08)
(+1.75)
6.24
5.84
(-0.91)
(-0.82)
16,936
17,029
(-60)
(-76)
41.53
41.48
(-0.40)
(-0.29)
7.96
6.39
(-0.45)
(-0.11)
Table 5.2: Statistics of parallel corpora with the best translation quality. Differences
between the segmented and original corpora are shown in parentheses.
Chapter 5. Experiments and Results
47
Fortunately, this negative side-effect is outweighed by the benefits of segmentation,
such as fewer words in the vocabulary and fewer singletons.
Word error rates of the development data are lower than those of the testing data
because on average a sentence in the testing data in both translation pairs is longer by
approximately five words which renders the translation task of the testing data more
difficult.
In this chapter, the evaluation metrics, segmentation configuration, and experimental results of segmentation for statistical machine translation are presented. The results
show that segmentation helps improve the translation quality of both the GermanEnglish and French-English language pairs by alleviating the data sparsity problem.
The benefits of segmentation are more pronounced when the training data size is small.
Chapter 6
Conclusion
In this report, an unsupervised segmentation for statistical machine translation is presented. The segmentation of two parallel corpora, German-English and French-English,
has shown to successfully improve the overall translation quality. Moreover, the unsupervised approach presented requires no language- nor domain-specific knowledge in
order to achieve higher translation accuracy.
The unsupervised segmentation can be applied to any corpus in any language and
domain. However, the parameters, i.e. the order of multi-gram and threshold value,
for different corpora will vary and have to be set empirically. Since segmentation is
aimed at capturing the regularities of a corpus, different corpora that have different
vocabularies, different word frequencies, or even different sizes are unlikely to have
the same parameter settings or to produce the same segmentation lexicon.
The experimental results show that segmentation results reflect the regularities of
each corpus by reducing the vocabulary size and number of unknown words. When
using a segmented German corpus in a German to English translation task, and a segmented English corpus in a French to English, the translation word error rates are
lowered by 0.37% and 2.15% respectively.
The application of segmentation to other translation pairs is expected to yield a
similar result. Languages with rich compounding, such as Finish and Inuktitut, are
expected to most benefit from segmentation because the segmented data improves the
alignment of words in the source and target sentences. An example of a better word
alignment in the German to English translation is shown below, where the first align48
Chapter 6. Conclusion
49
ment is based on the original German corpus and the second on a segmented corpus:
der Welthandelsorganisation
NULL World
der Welt+ +handel+ +sorganisation
the World Trade Organisation
( Reference: the World Trade Organisation )
The advantages of segmented data are more emphasized when the training corpus
size is small. Since the data is segmented into regular sequences of text, the number
of rare sequences or words is considerably reduced. Subsequently, a better translation
model can be constructed.
In the German to English translation, most German compounds are distinct and do
not appear frequently enough to be learned correctly by a translation model. By breaking up compounds into regular words that are consistent with the data, the number of
single compounds is lessened. Moreover, the segmented German data also promote a
one-to-one relationship between German and English words in the alignment model
which ultimately results in a better translation quality. The improved translation accuracy of German to English based on various training corpus sizes is shown in the
following graph:
80
GERMAN Original
WER (%)
Segmentation
75
70
10,000
20,000
Training Sentences
40,000
Chapter 6. Conclusion
50
Similarly, segmentation has proved to lower the French to English translation word
error rate especially when the training corpus size is limited. When more training
data becomes available, however, segmentation has shown to hurt the overall quality. Segmentation of English data manifests a complicated synthesis problem. In the
raw English translation, many segments appear alone or with incompatible affixes.
Experimental results show that a simple frequency-based synthesizer often results in
ungrammatical surface words. When the segment context is taken into account during
the synthesis, word error rates are lowered and the overall translation quality is better
than that of the original data.
6.1
Future Work
Without losing its language- and domain-independence, the unsupervised segmentation for statistical machine translation presented in this report can be enhanced in several ways. They are described as follows:
• Developing a synthesizer that is more context sensitive. It is obvious from the
experimental results that the more context of pairless segments is taken into
account, the better quality the synthesized surface words become. Apart from
considering the synthesis as a translation task as presented in this report, other
approaches can also be applied, for instance, using maximum entropy to determine the surface form of a segment (Ueffing and Ney (2003)). This is especially
appropriate for a translation of a morphologically rich language, such as French
and Spanish, to a less rich one, such as English.
• Improving the translation model by dividing the translation into a morphological
and a lexical translation subtask. This approach is most suitable when a translation involves a morphologically rich language and the synthesis task involves
reconstruction of words in various forms. In this report, a number of stem boundaries have been successfully learned by the segmenter, for example, “adopt+
+ant”, “adopt+ +ent”, “adopt+ +er” in French and “adop+ +ting”, “adop+ +ted”,
“adop+ +tion” in English. Regular signatures of stems and suffixes in each language can be extracted from segmented words (Goldsmith (2001)). Irregular
Chapter 6. Conclusion
51
segments that do not belong in any signature are rejoined. A morphological
translation system is then trained on instances of these signatures. Similar to a
word categorization approach in Garca-Varea and Casacuberta (2000), in place
of signature instances in the original sentences are their stem keywords, such as
ADOPT in the French sentences and ADOP in the English. A lexical translation
system is trained on the parallel corpus with stem replacements. Since there are
fewer words in the vocabulary of each corpus and the average sentence length
becomes shorter, the translation models are expected to be less complex and
more accurate. Accordingly, the translation process is carried out in two steps,
based on the lexical and morphological translation models.
• Using a named entity recognizer to exclude words to segment and translate. The
current segmenter does not differentiate entities from ordinary words. The current implementation assumes entities to be capitalized words in a sentence. This
assumption is not necessarily valid in all cases. As a result, certain entities end
up being segmented which should not be. When named entities are properly
identified in the data, they can be replaced by their category keywords, such as
PER for a person’s name or ORG for an organization’s name. Subsequently,
the translation process can be performed in two steps: translating the sentences
and translating the entities. The goal of this approach is also to improve the
translation model by reducing the number of parameters.
Appendix A
Software
A.1
Segmentation
The unsupervised segmenter in this report is developed using C and Perl script. It
consists of four components:
1. Chopper splits input data into n-gram sequences. The order of n-grams constitutes the multi-gram segmentation model. Chopper is implemented in Perl
script.
2. Text2wfreq is a tool in the CMU-Cambridge Language Modeling Toolkit, discussed in the Machine Translation section below. The module creates a frequency table for each n-gram in the multi-gram model.
3. Tango or Threshold And Maximum for N-Grams that Overlap (Ando and Lee
(2003)) is the segmentation algorithm. The multi-gram model is voted to determine segmentation boundaries in the input data based on the frequency tables.
Tango computes the multi-gram vote for each position and segments them if the
vote exceeds a specified threshold. Tango is the only C component.
4. Postprocessor selects a most frequent segmentation pattern of each word in the
data and reapplies the pattern to all instances of the word to ensure consistency.
All segmented words and their segmentation patterns make up a segmentation
lexicon of the data. Postprocessor is a Perl component.
52
Appendix A. Software
53
Input string
Chopper
text2wfreq
Multi-grams, N = {n}
Tango
Multi-gram
Frequency Tables
Segmentation, t = 0.0
Postprocesser
Segmentation Lexicon
Segmented string
Figure A.1: Software components of an unsupervised segmenter.
The training phase includes splitting a training corpus into n-grams and creating
a frequency table for each of them. The process takes less than 30 minutes for a 3.5
MB monolingual corpus of 40,000 sentences on a 256MB 1.13GHtz Intel machine.
The training process can be incremental when more training data becomes available
by simply updating multi-gram frequency tables. Segmentation of an input data is
carried out by Tango based on the frequency tables and later by Postprocessor. The
segmentation task takes approximately 20 minutes for a corpus of the same size.
A.2
Machine Translation
There are three components used in a statistical machine translation system in this
report:
Appendix A. Software
54
Input source string
GIZA++
ISI ReWrite Decoder
P(a,f|e)
Translation Model
argmax e P(e|f)
Global Search
CMU-Cambridge
P(e)
Language Model
Target translation string
Figure A.2: Software components of a statistical machine translation system.
1. GIZA++ constructs a translation model and a hidden alignment model based on
aligned sentence pairs during the training phase. GIZA++ is developed by Och
and Ney (2000). It is an extension of the program GIZA which was developed by
the Statistical Machine Translation team during the summer workshop in 1999
at the Center for Language and Speech Processing at Johns-Hopkins University.
In the experiments, the default parameters are used which include five iterations
each of the IBM models 3 and 4 (Brown et al. (1993)).
2. CMU-Cambridge Language Modeling Toolkit version 2 by Clarkson and Rosenfeld (1997) is used to construct a tri-gram language model of the target sentences.
The default parameters are also used in the experiments which include zero minimum uni-gram count and a Good-Turing discounting method.
3. ISI ReWrite Decoder, Germann et al. (2001), performs a global search for a
translation of an input string based on the language and translation models. In
the experiments, the default fast greedy decoding is used.
During the training phase, constructing a translation model of a 7MB parallel corpus consisting of 40,000 sentence pairs takes approximately an hour on a 512KB
2.40GHz Intel machine; a language model takes about half an hour. The training takes
longer for a segmented corpus due to the fact that there are more parameters, i.e. words
Appendix A. Software
55
in the vocabulary and words in each sentence, to estimate. The decoding or translating
phase takes longer. A translation of 1,000 unsegemented sentences takes slightly over
an hour on the same machine. When the data is segmented, the decoding takes almost
two hours. The decoding time can be longer if both the source and target corpora are
segmented. In total, an average machine translation run, both training and testing, for
original corpora, takes approximately two hours and a half; and for segmented corpora
three hours.
Appendix B
Segmentation Lexicons
Lexicons used in the segmentation of each corpus for the best translation quality are
presented and described below.
• German In segmenting German compounds, middle segments are allowed. As
a result, the German lexicon consists of both affixes and middle segments.
• French French segmentation is not used in the translation of the best result. The
segmentation shown here is based on a 2-10 multi-gram and a 0.60 threshold.
Middle segments are not allowed and are concatenated with their preceding segments. suffixes of less than two characters are also concatenated. Words with
initial capitalization are assumed to be named entities and are not segmented.
• English Like French, the English lexicon has no middle segments. Singlecharacter prefixes and suffixes of less than or equal to two characters are concatenated with adjacent segments. Named entities are not segmented.
The size of German and French lexicons are extremely large compared to the English lexicon. This is because the German corpus consists mostly of compounds which
do not occur frequently in the corpus. The number of distinct segmented compounds is
therefore high. In French data, different conjugations or inflections of the same roots
contribute to a lot of unique surface words. However, when these German compounds
and French conjugations are segmented, the number of resulting affixes decreases significantly reflecting the fact that a lot of surface words share the same components.
56
Appendix B. Segmentation Lexicons
57
German
French English
Lexicon size
8,380
6,480
2,199
Prefixes
2,816
2,935
1,040
Middles
2,212
0
0
Suffixes
3,073
1,921
1,005
Table B.1: Statistics of segmentation lexicons.The size of a lexicon is the number of
unique segmented words in the corpus.
English suffixes
+ing
+tection
+ure
+duction
+bate
+tion
+lations
+urity
+form
+tary
+ation
+formation
+ers
+gration
+wards
+ion
+national
+mer
+visions
+ressed
+ment
+cess
+ternal
+ons
+ward
+port
+ity
+ative
+ence
+inciples
+ies
+ial
+laration
+view
+ject
+operation
+ction
+sed
+rough
+ical
+ities
+gress
+tance
+out
+action
+ent
+change
+work
+lating
+legation
+ments
+ional
+ions
+ssible
+llow
+ding
+refore
+nal
+zation
+tails
+ted
+sion
+icultural
+tocol
+come
+ement
+ainst
+spect
+ises
+parations
+ations
+ess
+ditions
+largement
+ach
+ating
+ference
+bility
+struments
+ised
+ting
+nagement
+ining
+nual
+cial
+tions
+ance
+ople
+mocratic
+lation
+ble
+vention
+nership
+fication
+inciple
+ded
+ning
+jects
+tation
+able
Table B.2: One hundred most frequent English suffixes in descending order.
Appendix B. Segmentation Lexicons
58
German
Abwasser+ +projekt
Förder+ +mechanismen
Abwasser+ +behandlung
Förder+ +instrumentarium
Abwasserauf+ +bereitungs+ +anlage
Förder+ +programme
Abwasser+ +systems
Förder+ +gebietskarte
Abwasser+ +reinigung
Förder+ +programms
Abwasser+ +netzes
Förder+ +gebieten
Abwasserbeseitig+ +ungsanlagen
Förder+ +gebiete
Abwasserauf+ +bereitung
Förder+ +konzepts
Abwasserauf+ +bereitung+ +s+ +netze
Gemeinschafts+ +gesetzgeber
Abwasserauf+ +bereitung+ +s+ +netzen
Gemeinschaf+ +t+ +s+ +unternehmens
Abwassersammel+ +einrichtungen
Gemeinschafts+ +markensystem
Bildung+ +s+ +manahmen
Gemeinschaft+ +s+ +bereich
Bildungs+ +bereichen
Gemeinschaft+ +s+ +organe
Bildung+ +sausgaben
Gemeinschaft+ +s+ +in+ +itiativen
Bildungs+ +bereich
Gemeinschaft+ +scharta
Bildungs+ +einrichtungen
Gemeinschafts+ +flotte
Bildungs+ +wesen
Handel+ +s+ +system
Bildungs+ +leistungen
Handels+ +präferenzen
Bildungs+ +zentren
Handel+ +s+ +ver+ +handlungen
Chemie+ +erzeugnisse
Handel+ +s+ +bestimm+ +ungen
Chemie+ +unternehmen
Handel+ +s+ +statistiken
Chemie+ +industrie
Handels+ +zugestndnisse
Dienstleistung+ +s+ +sektors
Handels+ +aktivitten
Dienstleistung+ +s+ +zentren
Investition+ +s+ +minister
Dienstleistung+ +s+ +sektor
Investition+ +s+ +entscheidung
Table B.3: Examples from German segmentation lexicon.
Appendix B. Segmentation Lexicons
59
French
accept+ +able
adopt+ +ant
contin+ +entale
accept+ +er
adopt+ +ent
contin+ +ents
accept+ +eront
adopt+ +er
contin+ +gents
accept+ +ée
adopt+ +eront
contin+ +uation
accept+ +ées
adopt+ +ons
contin+ +ue
accessi+ +bilité
adopt+ +ée
contin+ +ues
accessi+ +ble
adopt+ +és
contin+ +us
accessi+ +bles
adopt+ +es
ex+ +ceptionnel
accompagn+ +ant
adress+ +es
ex+ +ceptionnelle
accompagn+ +ent
adress+ +es
ex+ +ceptionnelles
accompagn+ +er
adress+ +és
ex+ +ceptionnels
accompagn+ +ée
attach+ +ant
ex+ +ceptions
accompagn+ +és
attach+ +ement
ex+ +clusif
accompli+ +ra
attach+ +ent
ex+ +clusion
accord+ +ent
attach+ +era
ex+ +cédent
accord+ +er
attach+ +ons
ex+ +cédents
accord+ +era
attach+ +e
ex+ +erce
accord+ +eront
attach+ +es
ex+ +ercer
accord+ +ons
attach+ +s
ex+ +ercice
accord+ +ée
atteign+ +ent
ex+ +igences
accord+ +és
atteind+ +rait
ex+ +istants
accueil+ +le
atteind+ +ront
ex+ +iste
accueil+ +lir
atten+ +dre
ex+ +onération
accueill+ +ons
atten+ +tion
ex+ +plicites
accueilla+ +nt
atten+ +tive
ex+ +pliquant
Table B.4: Examples from French segmentation lexicon.
Appendix B. Segmentation Lexicons
60
English
accord+ +ance
assess+ +ing
dis+ +appointed
accord+ +ing
assess+ +ment
dis+ +bursement
account+ +ability
assess+ +ments
dis+ +charge
account+ +able
assis+ +ted
dis+ +content
account+ +ing
assis+ +ting
dis+ +continue
achiev+ +ing
attach+ +ing
dis+ +continuity
achieve+ +ment
attach+ +ment
dis+ +courage
achieve+ +ments
attemp+ +ted
dis+ +covered
activ+ +ate
attend+ +ance
dis+ +criminating
activ+ +ated
attend+ +ing
dis+ +location
activ+ +ation
chang+ +ing
ground+ +handling
activ+ +ists
change+ +over
ground+ +ing
activ+ +ity
character+ +isation
ground+ +water
adapt+ +ation
character+ +ises
ground+ +work
adapt+ +ations
character+ +istics
group+ +ing
adop+ +ting
characteri+ +sed
group+ +ings
adop+ +tion
charg+ +ing
guarantee+ +ing
applica+ +bility
charge+ +able
guid+ +ance
applica+ +ble
circum+ +stances
guid+ +ing
applica+ +nts
circum+ +vent
im+ +mediate
apply+ +ing
circum+ +vention
im+ +mediately
appoin+ +ting
citizen+ +ship
im+ +migration
appoint+ +ments
contin+ +ent
im+ +perative
appreci+ +ates
contin+ +ental
im+ +peratives
appreci+ +ation
contin+ +ents
im+ +pression
Table B.5: Examples from English segmentation lexicon.
Bibliography
Al-Onaizan, Y., Germann, U., Hermjakob, U., Knight, K., Koehn, P., Marcu, D., and
Kenji, Y. (2000). Translating with scarce resources. In Proceedings of the National
Conference on Artificial Intelligence (AAAI).
Ando, R. K. and Lee, L. (1999). Unsupervised statistical segmentation of japanese
kanji strings. CS Technical Report TR99-1756, Cornell University.
Ando, R. K. and Lee, L. (2003). Mostly-unsupervised statistical segmentation of
Japanese kanji sequences. Journal of Natural Language Engineering. Scheduled
to appear in Volume 9, number 1.
Brent, M., Mickael, B., Murthy, S., and Lunsberg, A. (1995). Discovering morphemic
suffixes : A case study in mdl induction.
Brown, P., Pietra, S. A. D., Pietra, V. J. D., and Mercer, R. L. (1993). The mathematics
of statistical machine translation: Parameter estimation. Computational Linguistics,
19(2):263–312.
Brown, P. F., Cocke, J., Pietra, S. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D., Mercer,
R. L., and Roossin, P. S. (1990). A statistical approach to machine translation.
Computational Linguistics, 16(2):79–85.
Brown, P. F., Pietra, S. A. D., Pietra, V. J., Lafferty, J. D., and Mercer, R. L. (1992).
Analysis, statistical transfer, and synthesis in machine translation. In Proceedings
of the Fourth International Conference on Theoretical and Methodological Issues
in Machine Translation: Empiricist vs. Rationalist Methods in MT, pages 83–100,
Montreal, Canada.
61
Bibliography
62
Callison-Burch, C. (2002). Co-training for statistical machine translation. Master’s
thesis, Division of Informatics, University of Edinburgh, Edinburgh, U.K.
Callison-Burch, C. and Osborne, M. (2003). Bootstrapping parallel corpora. In NAACL
workshop, Building and Using Parallel Texts: Data Driven Machine Translation and
Beyond, Edmonton, Canada.
Clarkson, P. and Rosenfeld, R. (1997). Statistical language modeling using the cmucambridge toolkit. In ESCA Eurospeech.
de Marcken, C. (1995). The unsupervised acquisition of a lexicon from continuous
speech. Technical Report AIM-1558, Massachusetts Institute of Technology.
de Marcken, C. (1996). Linguistic structure as composition and perturbation. In Joshi,
A. and Palmer, M., editors, Proceedings of the Thirty-Fourth Annual Meeting of the
Association for Computational Linguistics, pages 335–341, San Francisco. Morgan
Kaufmann Publishers.
Deligne, S. and Bimbot, F. (1997). Inference of variable-length acoustic units for
continuous speech recognition. In Proc. ICASSP ’97, pages 1731–1734, Munich,
Germany.
Garca-Varea, I. and Casacuberta, F. (2000). Word categorization in statistical translation.
Germann, U., Jahr, M., Knight, K., Marcu, D., and Yamada, K. (2001). Fast decoding
and optimal decoding for machine translation. In ACL, Toulouse, France.
Goldsmith, J. (2001). Unsupervised learning of the morphology of a natural language.
Computational Linguistics, 27(2):153–198.
Hua, Y. (2000). Unsupervised word induction using mdl criterion. In ISCSL, Beijing.
Hutchins, W. J. and Somers, H. L. (1992). An Introduction to Machine Translation.
Harcourt Brace Jovanonvich.
Bibliography
63
Jacquemin, C. (1997). Guessing morphology from terms and corpora. In 20th Annual
International ACM SIGIR Conference on Research and Development in Information
Retrieval SIGIR0 97, pages 156–167, Philadelphia, PA.
K. Papieni, S. Roukos, T. W. and Zhu, W.-J. (2002). Bleu: A method for automatic
evaluation of machine translation. In Meeting of the Association for Computational
Linguistics, pages 311–318.
Kazakov, D. and Manandhar, S. (2001). Unsupervised learning of word segmentation rules with genetic algorithms and inductive logic pr,hutchins1992ogramming.
Machine Learning, (43):121–162.
Koehn, P. (2002). Europarl: A multilingual corpus for evaluation of machine translation. Information Sciences Institue, University of Southern California.
Koehn, P. and Knight, K. (2003). Empirical methods for compound splitting. In
Meeting of the Association for Computational Linguistics, Budapest, Hungary.
Mitchell, T. M. (1997). Machine Learning. McGraw-Hill.
Natalia, G. and Pierre, Z. (1999). Language-independent automatic acquisition of
morphological knowledge from synonym pairs.
Nießen, S. and Ney, H. (2000). Improving SMT quality with morpho-syntactic analysis. In Proc. COLING 2000: The 18th Int. Conf. on Computational Linguistics,
pages 1081–1085, Saarbrucken, Germany.
Och, F. and Ney, H. (2000). Improved statistical alignment models. In ACL00, pages
440–447, Hongkong, China.
Rissanen, J. and Ristad, E. (1994). Language acquisition in the mdl framework.
Schone, P. and Jurafsky, D. (2001). Knowledge-free induction of inflectional morphologies.
Ueffing, N. and Ney, H. (2003). Using pos information for statistical machine translation into morphologically rich languages. In the 10th Conference of the European
Bibliography
64
Chapter of the Association for Computational Linguistics EACL, pages 347–354,
Budapest, Hungary.
Yarowsky, D. and Wicentowski, R. (2000). Minimally supervised morphological analysis by multimodal alignment.