Parallel Corpora

Outline
Parallel Corpora
Guest Lecture: Corpus linguistics and ontologies
Jennifer Spenader
http://odur.let.rug.nl/ spenader/
University of Groningen
and Stockholm University
Wednesday, 13 oktober
Jennifer Spenader
Parallel Corpora
Outline
Outline
What are parallel corpora
Where do they come from
What format are they in
Cross-linguistic comparative studies
Parallel corpora disambiguate
WSD
Determining meaning
Machine Translation
Translation related uses
Translation aids
Omissions
Jennifer Spenader
Parallel Corpora
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Outline
What are parallel corpora
Where do they come from
What format are they in
Cross-linguistic comparative studies
Parallel corpora disambiguate
WSD
Determining meaning
Machine Translation
Translation related uses
Translation aids
Omissions
Jennifer Spenader
Parallel Corpora
Machine Translation
Translation
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
Where do they come from
What are parallel corpora?
I
I
Translations of the same text
Two traditions of research
I
I
I
Parallel corpora used for comparative language study, tradition
of corpus linguistics
Statistical analysis used to discover patterns between
languages, with little or no linguistic information
Borin (2002): Research from these two traditions are
beginning to approach each other
Jennifer Spenader
Parallel Corpora
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
Where do they come from
Constructed parallel corpora
I
Constructed parallel corpora
I
I
I
Created by taking already existing translations of a given set of
texts
Swedish-English Parallel Corpus: Swedish-English novels and
children’s books
Already existing parallel corpora
I
I
Jennifer Spenader
Parallel Corpora
Bible translations: computerized versions freely available for
most languages
Softwares and military manuals
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
Where do they come from
Where do parallel corpora come from?
I
Naturally produced
I
I
I
Texts produced by governments that by law must have
translations
Canadian Hansard: parliamentary debate transcripts available
in English and French
Europarl: European parliamentary debate transcripts available
in 11 EU languages
I
I
Jennifer Spenader
Parallel Corpora
Extremely large! (last count 27 million words for each
language!)
Dutch, Danish, English, Finnish, French, German, Greek,
Italian, Portugese, Swedish, Spanish,
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
What format are they in
What format are parallel corpora in?
I
Raw parallel texts
I
These are useful for very simple investigations that don’t
involve translation comparisions
Aligned corpora
I
I
I
I
Jennifer Spenader
Parallel Corpora
More useful format where areas in one text are mapped with
areas in another text that are believed to be semantically
related
These are also called bitext mappings
Mappings done by paragraph, by sentence, by a given number
of characters
Can be done by hand or automated
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Outline
What are parallel corpora
Where do they come from
What format are they in
Cross-linguistic comparative studies
Parallel corpora disambiguate
WSD
Determining meaning
Machine Translation
Translation related uses
Translation aids
Omissions
Jennifer Spenader
Parallel Corpora
Machine Translation
Translation
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
What can you do with raw parallel texts?
I
I
Anna Wärnsby: Investigation of epistemic modality in Swedish
and English using the English-Swedish Parallel Corpus (ESPC)
ESPC
I
I
I
I
Jennifer Spenader
Parallel Corpora
64 English text samples and translations and 72 Swedish texts
and their translations
fiction and non-fiction, fiction are extracts
Total size, 2.8 million words
“The corpus is only available for research at the Department of
English at the Universities of Lund and Göteborg. Scholars
outside these departmenst can gain access to the corpus by
visiting, or cooperating with, one of these departments.”
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
What can you do with raw parallel texts?
I
Anna Wärnsby: Investigation of epistemic modality in Swedish
and English
(1)
John is putting on his coat. He must be ready to leave.
(2)
John must pass this exam in order to keep his job.
I
I
I
I
Goal: to discover co-occurring features that signaled epistemic
modality
investigate same features in parallel corpora in relation to can,
must,måste, and kan.
can make an analysis of what features are important without
reference to the translation
Swedish and English results are maximally comparable
because data from each language is maximally comparable
Jennifer Spenader
Parallel Corpora
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
What can you do with raw parallel texts?
I
Anna Wärnsby: Investigation of epistemic modality in Swedish
and English
(1)
John is putting on his coat. He must be ready to leave.
(2)
John must pass this exam in order to keep his job.
I
I
I
I
Goal: to discover co-occurring features that signaled epistemic
modality
investigate same features in parallel corpora in relation to can,
must,måste, and kan.
can make an analysis of what features are important without
reference to the translation
Swedish and English results are maximally comparable
because data from each language is maximally comparable
Jennifer Spenader
Parallel Corpora
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
Exploiting the correspondences between parallel corpora
Especially, what can you do with parallel corpora that you can’t do
with non-parallel corpora?
Dan Melamed, 2001:p. 1
“Bitexts are one of the richest sources of linguistic knowledge because the translation of a text into another language can be viewed
as a detailed annotation of what that text means. One might think
that if that other language is also a natural language, then a computer is no further ahead, because it cannot understand the original text. However, just the knowledge that the two data streams
are semantically equivalent leads to a kind of understanding that
enables computers to perform and important class of “intelligent”
functions”.
Jennifer Spenader
Parallel Corpora
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
Exploiting the correspondences between parallel corpora
Especially, what can you do with parallel corpora that you can’t do
with non-parallel corpora?
Dan Melamed, 2001:p. 1
“Bitexts are one of the richest sources of linguistic knowledge because the translation of a text into another language can be viewed
as a detailed annotation of what that text means. One might think
that if that other language is also a natural language, then a computer is no further ahead, because it cannot understand the original text. However, just the knowledge that the two data streams
are semantically equivalent leads to a kind of understanding that
enables computers to perform and important class of “intelligent”
functions”.
Jennifer Spenader
Parallel Corpora
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Outline
What are parallel corpora
Where do they come from
What format are they in
Cross-linguistic comparative studies
Parallel corpora disambiguate
WSD
Determining meaning
Machine Translation
Translation related uses
Translation aids
Omissions
Jennifer Spenader
Parallel Corpora
Machine Translation
Translation
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Disambiguation: using one language to disambiguate
another
1. If a lexical item in one language is ambiguous, and each
meaning corresponds to a distinct lexical item in another
language, we can use use parallel corpora to extract a large
number of “disambiguated examples” in the first language
Jennifer Spenader
Parallel Corpora
Translation
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
WSD
Word sense disambiguation
1. Word sense disambiguation is an important subtask in NLP
2. Words like “bank” or “course” in English have different
meanings
3. When searching for information on “the world bank” for
instance, we don’t want to retrieve articles about the river
banks in the world perhaps
I
I
I
I
Jennifer Spenader
Parallel Corpora
Users intends the “financial” meaning of bank
“bank” also has several senses: i.e. the physical building vs.
the organization.
e.g. “The bank was on the corner” vs “The bank lowered its
lending rate”, and then we still have the ambiguous
“The bank was sold” is still ambiguous
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
WSD
Word sense disambiguation
1. Word sense disambiguation is an important subtask in NLP
2. Words like “bank” or “course” in English have different
meanings
3. When searching for information on “the world bank” for
instance, we don’t want to retrieve articles about the river
banks in the world perhaps
I
I
I
I
Jennifer Spenader
Parallel Corpora
Users intends the “financial” meaning of bank
“bank” also has several senses: i.e. the physical building vs.
the organization.
e.g. “The bank was on the corner” vs “The bank lowered its
lending rate”, and then we still have the ambiguous
“The bank was sold” is still ambiguous
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
WSD
Word sense disambiguation
1. Word sense disambiguation is an important subtask in NLP
2. Words like “bank” or “course” in English have different
meanings
3. When searching for information on “the world bank” for
instance, we don’t want to retrieve articles about the river
banks in the world perhaps
I
I
I
I
Jennifer Spenader
Parallel Corpora
Users intends the “financial” meaning of bank
“bank” also has several senses: i.e. the physical building vs.
the organization.
e.g. “The bank was on the corner” vs “The bank lowered its
lending rate”, and then we still have the ambiguous
“The bank was sold” is still ambiguous
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
WSD
Word sense disambiguation
1. Word sense disambiguation is an important subtask in NLP
2. Words like “bank” or “course” in English have different
meanings
3. When searching for information on “the world bank” for
instance, we don’t want to retrieve articles about the river
banks in the world perhaps
I
I
I
I
Jennifer Spenader
Parallel Corpora
Users intends the “financial” meaning of bank
“bank” also has several senses: i.e. the physical building vs.
the organization.
e.g. “The bank was on the corner” vs “The bank lowered its
lending rate”, and then we still have the ambiguous
“The bank was sold” is still ambiguous
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
WSD
Word sense disambiguation
1. Word sense disambiguation is an important subtask in NLP
2. Words like “bank” or “course” in English have different
meanings
3. When searching for information on “the world bank” for
instance, we don’t want to retrieve articles about the river
banks in the world perhaps
I
I
I
I
Jennifer Spenader
Parallel Corpora
Users intends the “financial” meaning of bank
“bank” also has several senses: i.e. the physical building vs.
the organization.
e.g. “The bank was on the corner” vs “The bank lowered its
lending rate”, and then we still have the ambiguous
“The bank was sold” is still ambiguous
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
WSD
Word sense disambiguation
1. Word sense disambiguation is an important subtask in NLP
2. Words like “bank” or “course” in English have different
meanings
3. When searching for information on “the world bank” for
instance, we don’t want to retrieve articles about the river
banks in the world perhaps
I
I
I
I
Jennifer Spenader
Parallel Corpora
Users intends the “financial” meaning of bank
“bank” also has several senses: i.e. the physical building vs.
the organization.
e.g. “The bank was on the corner” vs “The bank lowered its
lending rate”, and then we still have the ambiguous
“The bank was sold” is still ambiguous
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
WSD
Word sense disambiguation
1. Word sense disambiguation is an important subtask in NLP
2. Words like “bank” or “course” in English have different
meanings
3. When searching for information on “the world bank” for
instance, we don’t want to retrieve articles about the river
banks in the world perhaps
I
I
I
I
Jennifer Spenader
Parallel Corpora
Users intends the “financial” meaning of bank
“bank” also has several senses: i.e. the physical building vs.
the organization.
e.g. “The bank was on the corner” vs “The bank lowered its
lending rate”, and then we still have the ambiguous
“The bank was sold” is still ambiguous
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
WSD
Parallel corpora and word sense disambiguation
I
To build a WSD system you need to have already
disambiguated data to train on, and disambiguated data to
test on
I
If you want to try to statistically derive rules for
disambiguation, you will have to have a great deal of data
I
Even rule based systems need many examples in order to test
how well they work
I
By exploiting parallel corpora, training and test sets of
examples can be more easily constructed
Jennifer Spenader
Parallel Corpora
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
Determining meaning
Disambiguation for determining meaning
I
I
I
I
I
I
Spenader, J. (2004). Using Simple Word Alignment Measures
to Study Discourse Particles. To appear in Sprache und
Datenverarbeitung, International Journal for Language Data
Processing
Particle ju in Swedish like many discourse particles has no
clear “meaning”
Aijmer (1977) identified four uses of ju
Aijmer (1996) in a parallel corpus study using ESPC identifed
four main functions: Modality, Interactive,
Interpersonal and Discourse Functions
eg. Modality often translation with I suppose or I could
Discourse Functions usually translated with since, as,
because (ju occurs in a clause with a reason or explanation
Jennifer Spenader
Parallel Corpora
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
Determining meaning
Disambiguation for determining meaning
I
I
I
I
I
I
Spenader, J. (2004). Using Simple Word Alignment Measures
to Study Discourse Particles. To appear in Sprache und
Datenverarbeitung, International Journal for Language Data
Processing
Particle ju in Swedish like many discourse particles has no
clear “meaning”
Aijmer (1977) identified four uses of ju
Aijmer (1996) in a parallel corpus study using ESPC identifed
four main functions: Modality, Interactive,
Interpersonal and Discourse Functions
eg. Modality often translation with I suppose or I could
Discourse Functions usually translated with since, as,
because (ju occurs in a clause with a reason or explanation
Jennifer Spenader
Parallel Corpora
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
Determining meaning
Disambiguation for determining meaning
I
I
I
I
I
I
Spenader, J. (2004). Using Simple Word Alignment Measures
to Study Discourse Particles. To appear in Sprache und
Datenverarbeitung, International Journal for Language Data
Processing
Particle ju in Swedish like many discourse particles has no
clear “meaning”
Aijmer (1977) identified four uses of ju
Aijmer (1996) in a parallel corpus study using ESPC identifed
four main functions: Modality, Interactive,
Interpersonal and Discourse Functions
eg. Modality often translation with I suppose or I could
Discourse Functions usually translated with since, as,
because (ju occurs in a clause with a reason or explanation
Jennifer Spenader
Parallel Corpora
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
Determining meaning
Disambiguation for determining meaning
I
I
I
I
I
I
Spenader, J. (2004). Using Simple Word Alignment Measures
to Study Discourse Particles. To appear in Sprache und
Datenverarbeitung, International Journal for Language Data
Processing
Particle ju in Swedish like many discourse particles has no
clear “meaning”
Aijmer (1977) identified four uses of ju
Aijmer (1996) in a parallel corpus study using ESPC identifed
four main functions: Modality, Interactive,
Interpersonal and Discourse Functions
eg. Modality often translation with I suppose or I could
Discourse Functions usually translated with since, as,
because (ju occurs in a clause with a reason or explanation
Jennifer Spenader
Parallel Corpora
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
Determining meaning
Disambiguation for determining meaning
I
I
I
I
I
I
Spenader, J. (2004). Using Simple Word Alignment Measures
to Study Discourse Particles. To appear in Sprache und
Datenverarbeitung, International Journal for Language Data
Processing
Particle ju in Swedish like many discourse particles has no
clear “meaning”
Aijmer (1977) identified four uses of ju
Aijmer (1996) in a parallel corpus study using ESPC identifed
four main functions: Modality, Interactive,
Interpersonal and Discourse Functions
eg. Modality often translation with I suppose or I could
Discourse Functions usually translated with since, as,
because (ju occurs in a clause with a reason or explanation
Jennifer Spenader
Parallel Corpora
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
Determining meaning
Disambiguation for determining meaning
I
I
I
I
I
I
Spenader, J. (2004). Using Simple Word Alignment Measures
to Study Discourse Particles. To appear in Sprache und
Datenverarbeitung, International Journal for Language Data
Processing
Particle ju in Swedish like many discourse particles has no
clear “meaning”
Aijmer (1977) identified four uses of ju
Aijmer (1996) in a parallel corpus study using ESPC identifed
four main functions: Modality, Interactive,
Interpersonal and Discourse Functions
eg. Modality often translation with I suppose or I could
Discourse Functions usually translated with since, as,
because (ju occurs in a clause with a reason or explanation
Jennifer Spenader
Parallel Corpora
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
Determining meaning
Disambiguation for determining meaning
I
I
I
I
I
I
Spenader, J. (2004). Using Simple Word Alignment Measures
to Study Discourse Particles. To appear in Sprache und
Datenverarbeitung, International Journal for Language Data
Processing
Particle ju in Swedish like many discourse particles has no
clear “meaning”
Aijmer (1977) identified four uses of ju
Aijmer (1996) in a parallel corpus study using ESPC identifed
four main functions: Modality, Interactive,
Interpersonal and Discourse Functions
eg. Modality often translation with I suppose or I could
Discourse Functions usually translated with since, as,
because (ju occurs in a clause with a reason or explanation
Jennifer Spenader
Parallel Corpora
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Determining meaning
Examples of translations
(3)
...om all de övriga medlemsstaterna accepterar det så kan
en medlesmsstat gå ur, och det kan man ju säga är en
juridisk självklarhet.
(4)
...if all the other member States agree, a Member State
can terminate its membership, and that is, I suppose, self
evident in law...
(5)
och jag kan ju amerikanskan så jag tog platsen.
(6)
and as I know how to speak the American language, I got
the job.
Jennifer Spenader
Parallel Corpora
Translation
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
Determining meaning
Discourse Particles study
I
I
Used the Europarl corpus
Can use larger corpus to confirm that the smaller scale
introspective work is accurate
I
I
Can examine groups of examples of different translation
equivalents to see if they correspond with different uses
Results: translation equivalents identified by Aijmer are infact
more frequent in translations of passages with ju than in the
rest of the corpus as a whole
Jennifer Spenader
Parallel Corpora
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Outline
What are parallel corpora
Where do they come from
What format are they in
Cross-linguistic comparative studies
Parallel corpora disambiguate
WSD
Determining meaning
Machine Translation
Translation related uses
Translation aids
Omissions
Jennifer Spenader
Parallel Corpora
Machine Translation
Translation
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
Heavy statistics with Very Large Corpora
I
The roots of NLP lie in the desire to do machine translation
I
Rule-based machine translation earlier method
I
In the 80’s Brown et al. showed that with very large corpora,
and a lot of computing power, you could do statistical
machine translation
Jennifer Spenader
Parallel Corpora
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Statistical Machine Translation
I
Large, aligned, parallel corpora
I
Use word alignment statistics
I
i.e. translation without a dictionary
I
the larger the corpora, the better the results
Jennifer Spenader
Parallel Corpora
Machine Translation
Translation
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Statistical Machine Translation
I
Large, aligned, parallel corpora
I
Use word alignment statistics
I
i.e. translation without a dictionary
I
the larger the corpora, the better the results
Jennifer Spenader
Parallel Corpora
Machine Translation
Translation
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Outline
What are parallel corpora
Where do they come from
What format are they in
Cross-linguistic comparative studies
Parallel corpora disambiguate
WSD
Determining meaning
Machine Translation
Translation related uses
Translation aids
Omissions
Jennifer Spenader
Parallel Corpora
Machine Translation
Translation
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
Translation aids
Translation helpers
I
In translating, source language expressions can be searched for
in target language parallel corpus and can offer suggestions,
statistics etc for improving translation, and improving
consistency of translation, translation memory
I
Automatic dictionaries
I
multi-word unit non-compositional compounds (Melamed, D.
1997)
I
translation omissions automatically detected
Jennifer Spenader
Parallel Corpora
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
Omissions
Errors of omission in translation
I
omissions in translation frequently occur
I
I
tired translators skip
wrong key, leads to deletions
I
proof-reading is costly
I
Automatic detection of omissions would be useful
Translations also contain intentional omissions
I
I
Jennifer Spenader
Parallel Corpora
how to distinguish intentional abridgements from unintentional
errors?
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
Omissions
ADOMIT: Automatic detection of omissions in translation
I
Melamed (2001)
I
Draw little graph
Jennifer Spenader
Parallel Corpora
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Translation
Omissions
Advantages of ADOMIT
I
Relies on the geometric analysis of bitexts
I
Works entirely without linguistic resources, thus technique is
easily used with different languages
I
Because it is linguistically ignorant, it can detect word
processing errors as well
Jennifer Spenader
Parallel Corpora
What are parallel corpora
Cross-linguistic comparisons
Parallel corpora disambiguate
Machine Translation
Omissions
Summary
I
Parallel corpora offer a rich source of additional knowledge
about language.
I
There are opportunities both for cross linguistic corpus
research and for doing more statistic based extraction of a
large variety of information
Jennifer Spenader
Parallel Corpora
Translation