Outline Parallel Corpora Guest Lecture: Corpus linguistics and ontologies Jennifer Spenader http://odur.let.rug.nl/ spenader/ University of Groningen and Stockholm University Wednesday, 13 oktober Jennifer Spenader Parallel Corpora Outline Outline What are parallel corpora Where do they come from What format are they in Cross-linguistic comparative studies Parallel corpora disambiguate WSD Determining meaning Machine Translation Translation related uses Translation aids Omissions Jennifer Spenader Parallel Corpora What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Outline What are parallel corpora Where do they come from What format are they in Cross-linguistic comparative studies Parallel corpora disambiguate WSD Determining meaning Machine Translation Translation related uses Translation aids Omissions Jennifer Spenader Parallel Corpora Machine Translation Translation What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation Where do they come from What are parallel corpora? I I Translations of the same text Two traditions of research I I I Parallel corpora used for comparative language study, tradition of corpus linguistics Statistical analysis used to discover patterns between languages, with little or no linguistic information Borin (2002): Research from these two traditions are beginning to approach each other Jennifer Spenader Parallel Corpora What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation Where do they come from Constructed parallel corpora I Constructed parallel corpora I I I Created by taking already existing translations of a given set of texts Swedish-English Parallel Corpus: Swedish-English novels and children’s books Already existing parallel corpora I I Jennifer Spenader Parallel Corpora Bible translations: computerized versions freely available for most languages Softwares and military manuals What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation Where do they come from Where do parallel corpora come from? I Naturally produced I I I Texts produced by governments that by law must have translations Canadian Hansard: parliamentary debate transcripts available in English and French Europarl: European parliamentary debate transcripts available in 11 EU languages I I Jennifer Spenader Parallel Corpora Extremely large! (last count 27 million words for each language!) Dutch, Danish, English, Finnish, French, German, Greek, Italian, Portugese, Swedish, Spanish, What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation What format are they in What format are parallel corpora in? I Raw parallel texts I These are useful for very simple investigations that don’t involve translation comparisions Aligned corpora I I I I Jennifer Spenader Parallel Corpora More useful format where areas in one text are mapped with areas in another text that are believed to be semantically related These are also called bitext mappings Mappings done by paragraph, by sentence, by a given number of characters Can be done by hand or automated What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Outline What are parallel corpora Where do they come from What format are they in Cross-linguistic comparative studies Parallel corpora disambiguate WSD Determining meaning Machine Translation Translation related uses Translation aids Omissions Jennifer Spenader Parallel Corpora Machine Translation Translation What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation What can you do with raw parallel texts? I I Anna Wärnsby: Investigation of epistemic modality in Swedish and English using the English-Swedish Parallel Corpus (ESPC) ESPC I I I I Jennifer Spenader Parallel Corpora 64 English text samples and translations and 72 Swedish texts and their translations fiction and non-fiction, fiction are extracts Total size, 2.8 million words “The corpus is only available for research at the Department of English at the Universities of Lund and Göteborg. Scholars outside these departmenst can gain access to the corpus by visiting, or cooperating with, one of these departments.” What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation What can you do with raw parallel texts? I Anna Wärnsby: Investigation of epistemic modality in Swedish and English (1) John is putting on his coat. He must be ready to leave. (2) John must pass this exam in order to keep his job. I I I I Goal: to discover co-occurring features that signaled epistemic modality investigate same features in parallel corpora in relation to can, must,måste, and kan. can make an analysis of what features are important without reference to the translation Swedish and English results are maximally comparable because data from each language is maximally comparable Jennifer Spenader Parallel Corpora What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation What can you do with raw parallel texts? I Anna Wärnsby: Investigation of epistemic modality in Swedish and English (1) John is putting on his coat. He must be ready to leave. (2) John must pass this exam in order to keep his job. I I I I Goal: to discover co-occurring features that signaled epistemic modality investigate same features in parallel corpora in relation to can, must,måste, and kan. can make an analysis of what features are important without reference to the translation Swedish and English results are maximally comparable because data from each language is maximally comparable Jennifer Spenader Parallel Corpora What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation Exploiting the correspondences between parallel corpora Especially, what can you do with parallel corpora that you can’t do with non-parallel corpora? Dan Melamed, 2001:p. 1 “Bitexts are one of the richest sources of linguistic knowledge because the translation of a text into another language can be viewed as a detailed annotation of what that text means. One might think that if that other language is also a natural language, then a computer is no further ahead, because it cannot understand the original text. However, just the knowledge that the two data streams are semantically equivalent leads to a kind of understanding that enables computers to perform and important class of “intelligent” functions”. Jennifer Spenader Parallel Corpora What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation Exploiting the correspondences between parallel corpora Especially, what can you do with parallel corpora that you can’t do with non-parallel corpora? Dan Melamed, 2001:p. 1 “Bitexts are one of the richest sources of linguistic knowledge because the translation of a text into another language can be viewed as a detailed annotation of what that text means. One might think that if that other language is also a natural language, then a computer is no further ahead, because it cannot understand the original text. However, just the knowledge that the two data streams are semantically equivalent leads to a kind of understanding that enables computers to perform and important class of “intelligent” functions”. Jennifer Spenader Parallel Corpora What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Outline What are parallel corpora Where do they come from What format are they in Cross-linguistic comparative studies Parallel corpora disambiguate WSD Determining meaning Machine Translation Translation related uses Translation aids Omissions Jennifer Spenader Parallel Corpora Machine Translation Translation What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Disambiguation: using one language to disambiguate another 1. If a lexical item in one language is ambiguous, and each meaning corresponds to a distinct lexical item in another language, we can use use parallel corpora to extract a large number of “disambiguated examples” in the first language Jennifer Spenader Parallel Corpora Translation What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation WSD Word sense disambiguation 1. Word sense disambiguation is an important subtask in NLP 2. Words like “bank” or “course” in English have different meanings 3. When searching for information on “the world bank” for instance, we don’t want to retrieve articles about the river banks in the world perhaps I I I I Jennifer Spenader Parallel Corpora Users intends the “financial” meaning of bank “bank” also has several senses: i.e. the physical building vs. the organization. e.g. “The bank was on the corner” vs “The bank lowered its lending rate”, and then we still have the ambiguous “The bank was sold” is still ambiguous What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation WSD Word sense disambiguation 1. Word sense disambiguation is an important subtask in NLP 2. Words like “bank” or “course” in English have different meanings 3. When searching for information on “the world bank” for instance, we don’t want to retrieve articles about the river banks in the world perhaps I I I I Jennifer Spenader Parallel Corpora Users intends the “financial” meaning of bank “bank” also has several senses: i.e. the physical building vs. the organization. e.g. “The bank was on the corner” vs “The bank lowered its lending rate”, and then we still have the ambiguous “The bank was sold” is still ambiguous What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation WSD Word sense disambiguation 1. Word sense disambiguation is an important subtask in NLP 2. Words like “bank” or “course” in English have different meanings 3. When searching for information on “the world bank” for instance, we don’t want to retrieve articles about the river banks in the world perhaps I I I I Jennifer Spenader Parallel Corpora Users intends the “financial” meaning of bank “bank” also has several senses: i.e. the physical building vs. the organization. e.g. “The bank was on the corner” vs “The bank lowered its lending rate”, and then we still have the ambiguous “The bank was sold” is still ambiguous What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation WSD Word sense disambiguation 1. Word sense disambiguation is an important subtask in NLP 2. Words like “bank” or “course” in English have different meanings 3. When searching for information on “the world bank” for instance, we don’t want to retrieve articles about the river banks in the world perhaps I I I I Jennifer Spenader Parallel Corpora Users intends the “financial” meaning of bank “bank” also has several senses: i.e. the physical building vs. the organization. e.g. “The bank was on the corner” vs “The bank lowered its lending rate”, and then we still have the ambiguous “The bank was sold” is still ambiguous What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation WSD Word sense disambiguation 1. Word sense disambiguation is an important subtask in NLP 2. Words like “bank” or “course” in English have different meanings 3. When searching for information on “the world bank” for instance, we don’t want to retrieve articles about the river banks in the world perhaps I I I I Jennifer Spenader Parallel Corpora Users intends the “financial” meaning of bank “bank” also has several senses: i.e. the physical building vs. the organization. e.g. “The bank was on the corner” vs “The bank lowered its lending rate”, and then we still have the ambiguous “The bank was sold” is still ambiguous What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation WSD Word sense disambiguation 1. Word sense disambiguation is an important subtask in NLP 2. Words like “bank” or “course” in English have different meanings 3. When searching for information on “the world bank” for instance, we don’t want to retrieve articles about the river banks in the world perhaps I I I I Jennifer Spenader Parallel Corpora Users intends the “financial” meaning of bank “bank” also has several senses: i.e. the physical building vs. the organization. e.g. “The bank was on the corner” vs “The bank lowered its lending rate”, and then we still have the ambiguous “The bank was sold” is still ambiguous What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation WSD Word sense disambiguation 1. Word sense disambiguation is an important subtask in NLP 2. Words like “bank” or “course” in English have different meanings 3. When searching for information on “the world bank” for instance, we don’t want to retrieve articles about the river banks in the world perhaps I I I I Jennifer Spenader Parallel Corpora Users intends the “financial” meaning of bank “bank” also has several senses: i.e. the physical building vs. the organization. e.g. “The bank was on the corner” vs “The bank lowered its lending rate”, and then we still have the ambiguous “The bank was sold” is still ambiguous What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation WSD Parallel corpora and word sense disambiguation I To build a WSD system you need to have already disambiguated data to train on, and disambiguated data to test on I If you want to try to statistically derive rules for disambiguation, you will have to have a great deal of data I Even rule based systems need many examples in order to test how well they work I By exploiting parallel corpora, training and test sets of examples can be more easily constructed Jennifer Spenader Parallel Corpora What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation Determining meaning Disambiguation for determining meaning I I I I I I Spenader, J. (2004). Using Simple Word Alignment Measures to Study Discourse Particles. To appear in Sprache und Datenverarbeitung, International Journal for Language Data Processing Particle ju in Swedish like many discourse particles has no clear “meaning” Aijmer (1977) identified four uses of ju Aijmer (1996) in a parallel corpus study using ESPC identifed four main functions: Modality, Interactive, Interpersonal and Discourse Functions eg. Modality often translation with I suppose or I could Discourse Functions usually translated with since, as, because (ju occurs in a clause with a reason or explanation Jennifer Spenader Parallel Corpora What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation Determining meaning Disambiguation for determining meaning I I I I I I Spenader, J. (2004). Using Simple Word Alignment Measures to Study Discourse Particles. To appear in Sprache und Datenverarbeitung, International Journal for Language Data Processing Particle ju in Swedish like many discourse particles has no clear “meaning” Aijmer (1977) identified four uses of ju Aijmer (1996) in a parallel corpus study using ESPC identifed four main functions: Modality, Interactive, Interpersonal and Discourse Functions eg. Modality often translation with I suppose or I could Discourse Functions usually translated with since, as, because (ju occurs in a clause with a reason or explanation Jennifer Spenader Parallel Corpora What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation Determining meaning Disambiguation for determining meaning I I I I I I Spenader, J. (2004). Using Simple Word Alignment Measures to Study Discourse Particles. To appear in Sprache und Datenverarbeitung, International Journal for Language Data Processing Particle ju in Swedish like many discourse particles has no clear “meaning” Aijmer (1977) identified four uses of ju Aijmer (1996) in a parallel corpus study using ESPC identifed four main functions: Modality, Interactive, Interpersonal and Discourse Functions eg. Modality often translation with I suppose or I could Discourse Functions usually translated with since, as, because (ju occurs in a clause with a reason or explanation Jennifer Spenader Parallel Corpora What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation Determining meaning Disambiguation for determining meaning I I I I I I Spenader, J. (2004). Using Simple Word Alignment Measures to Study Discourse Particles. To appear in Sprache und Datenverarbeitung, International Journal for Language Data Processing Particle ju in Swedish like many discourse particles has no clear “meaning” Aijmer (1977) identified four uses of ju Aijmer (1996) in a parallel corpus study using ESPC identifed four main functions: Modality, Interactive, Interpersonal and Discourse Functions eg. Modality often translation with I suppose or I could Discourse Functions usually translated with since, as, because (ju occurs in a clause with a reason or explanation Jennifer Spenader Parallel Corpora What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation Determining meaning Disambiguation for determining meaning I I I I I I Spenader, J. (2004). Using Simple Word Alignment Measures to Study Discourse Particles. To appear in Sprache und Datenverarbeitung, International Journal for Language Data Processing Particle ju in Swedish like many discourse particles has no clear “meaning” Aijmer (1977) identified four uses of ju Aijmer (1996) in a parallel corpus study using ESPC identifed four main functions: Modality, Interactive, Interpersonal and Discourse Functions eg. Modality often translation with I suppose or I could Discourse Functions usually translated with since, as, because (ju occurs in a clause with a reason or explanation Jennifer Spenader Parallel Corpora What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation Determining meaning Disambiguation for determining meaning I I I I I I Spenader, J. (2004). Using Simple Word Alignment Measures to Study Discourse Particles. To appear in Sprache und Datenverarbeitung, International Journal for Language Data Processing Particle ju in Swedish like many discourse particles has no clear “meaning” Aijmer (1977) identified four uses of ju Aijmer (1996) in a parallel corpus study using ESPC identifed four main functions: Modality, Interactive, Interpersonal and Discourse Functions eg. Modality often translation with I suppose or I could Discourse Functions usually translated with since, as, because (ju occurs in a clause with a reason or explanation Jennifer Spenader Parallel Corpora What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation Determining meaning Disambiguation for determining meaning I I I I I I Spenader, J. (2004). Using Simple Word Alignment Measures to Study Discourse Particles. To appear in Sprache und Datenverarbeitung, International Journal for Language Data Processing Particle ju in Swedish like many discourse particles has no clear “meaning” Aijmer (1977) identified four uses of ju Aijmer (1996) in a parallel corpus study using ESPC identifed four main functions: Modality, Interactive, Interpersonal and Discourse Functions eg. Modality often translation with I suppose or I could Discourse Functions usually translated with since, as, because (ju occurs in a clause with a reason or explanation Jennifer Spenader Parallel Corpora What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Determining meaning Examples of translations (3) ...om all de övriga medlemsstaterna accepterar det så kan en medlesmsstat gå ur, och det kan man ju säga är en juridisk självklarhet. (4) ...if all the other member States agree, a Member State can terminate its membership, and that is, I suppose, self evident in law... (5) och jag kan ju amerikanskan så jag tog platsen. (6) and as I know how to speak the American language, I got the job. Jennifer Spenader Parallel Corpora Translation What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation Determining meaning Discourse Particles study I I Used the Europarl corpus Can use larger corpus to confirm that the smaller scale introspective work is accurate I I Can examine groups of examples of different translation equivalents to see if they correspond with different uses Results: translation equivalents identified by Aijmer are infact more frequent in translations of passages with ju than in the rest of the corpus as a whole Jennifer Spenader Parallel Corpora What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Outline What are parallel corpora Where do they come from What format are they in Cross-linguistic comparative studies Parallel corpora disambiguate WSD Determining meaning Machine Translation Translation related uses Translation aids Omissions Jennifer Spenader Parallel Corpora Machine Translation Translation What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation Heavy statistics with Very Large Corpora I The roots of NLP lie in the desire to do machine translation I Rule-based machine translation earlier method I In the 80’s Brown et al. showed that with very large corpora, and a lot of computing power, you could do statistical machine translation Jennifer Spenader Parallel Corpora What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Statistical Machine Translation I Large, aligned, parallel corpora I Use word alignment statistics I i.e. translation without a dictionary I the larger the corpora, the better the results Jennifer Spenader Parallel Corpora Machine Translation Translation What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Statistical Machine Translation I Large, aligned, parallel corpora I Use word alignment statistics I i.e. translation without a dictionary I the larger the corpora, the better the results Jennifer Spenader Parallel Corpora Machine Translation Translation What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Outline What are parallel corpora Where do they come from What format are they in Cross-linguistic comparative studies Parallel corpora disambiguate WSD Determining meaning Machine Translation Translation related uses Translation aids Omissions Jennifer Spenader Parallel Corpora Machine Translation Translation What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation Translation aids Translation helpers I In translating, source language expressions can be searched for in target language parallel corpus and can offer suggestions, statistics etc for improving translation, and improving consistency of translation, translation memory I Automatic dictionaries I multi-word unit non-compositional compounds (Melamed, D. 1997) I translation omissions automatically detected Jennifer Spenader Parallel Corpora What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation Omissions Errors of omission in translation I omissions in translation frequently occur I I tired translators skip wrong key, leads to deletions I proof-reading is costly I Automatic detection of omissions would be useful Translations also contain intentional omissions I I Jennifer Spenader Parallel Corpora how to distinguish intentional abridgements from unintentional errors? What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation Omissions ADOMIT: Automatic detection of omissions in translation I Melamed (2001) I Draw little graph Jennifer Spenader Parallel Corpora What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Translation Omissions Advantages of ADOMIT I Relies on the geometric analysis of bitexts I Works entirely without linguistic resources, thus technique is easily used with different languages I Because it is linguistically ignorant, it can detect word processing errors as well Jennifer Spenader Parallel Corpora What are parallel corpora Cross-linguistic comparisons Parallel corpora disambiguate Machine Translation Omissions Summary I Parallel corpora offer a rich source of additional knowledge about language. I There are opportunities both for cross linguistic corpus research and for doing more statistic based extraction of a large variety of information Jennifer Spenader Parallel Corpora Translation
© Copyright 2026 Paperzz