Corpora and Concordances in
Human Translation, Machine
Translation, and
Language/Translator Education
Barbara Gawronska
Dept. of Foreign Languages and Translation, University of
Agder, Kristiansand, Norway
Oslo, Oct 2008
Outline
{
Concordance Tools for Human Users
z
z
{
Monolingual Corpora in Machine
Translation (MT)
z
{
concordances in the past and now
how can language learners, language teachers,
and translators benefit from modern
concordance tools – some examples
how can MT-systems developers benefit from
monolingual corpora?
Between MT and human translation:
electronic translation aids
Concordance Tools for Human
Users
Concordance – the traditional definition and
the modern developments
{
http://en.wiktionary.org/wiki/concordance: ”An
alphabetical verbal index showing the places in
the text of a book where each principal word may
be found, with its immediate context in each
place”
z
{
this definition is a little old-fashioned, since
modern tools allow e.g. to look for excerpts where
the context is defined in terms of parts of speech.
The search results may be sorted in different
ways: by frequency or by collocational strength
calculated by different statistical measures
Concordance – a very useful complement to a
dictionary (for some reason, seldom employed in
language education and translator education…)
{
words and phrases can be looked up by their
context; the user can easily get the overall
picture of the usage of a word
a &noun of laughter
a pack of &noun
a
a
a
a
a
a
a
a
a
a
roar of laughter
burst of laughter
round of laughter
bray of laughter
fit of laughter
pack
pack
pack
pack
pack
of
of
of
of
of
wolves
thieves
liers
bloodhounds
flatheads
An example: Corpus Culler – a multilanguage
concordance tool (LexwareLabs AB, www.lexwarelabs.com
– a company cooperating with Högskolan i Skövde)
{
{
{
{
Excerpts are selected from corpora of modern
English/Swedish/Polish of about 30 millions word
tokens each
Authentic modern language – corpora collected
from the web
Pre-defined part-of-speech variables can be
combined with variables defined by the user
Statistical information available (raw frequency
and different association measures: MI, Chisquare, T-score etc.)
Translation equivalents listed by Lexin
{
English entry word
z
{
{
pack
{
koppel, släpp,
flock, skock
(substantiv)
Composition
z
z
z
English entry word
z
Swedish translation
z
{
a pack of dogs--ett koppel hundar
a pack of thieves--en tjuvliga
you're just telling
a pack of lies!
Swedish translation
z
z
{
flock
(om djur) skock,
flock (substantiv)
oordnad grupp; hjord
Example
z
a flock of sheep--en skock får
Concordance complements the dictionary
information
(LexwareLabs AB, www.lexwarelabs.com)
A concordance may facilitate different educational tasks, and
help the translator. It can e.g. show the use of prepositions
and particles with verbs…
…stilistic variation, semantically similar words
Distinctions between semantically similar words that
are not exact synonyms; valency patterns
( e.g. enable/facilitate)
The usage of ”support verbs”: results of
search for ”göra &noun”
Results of search for göra &noun, statistics
Finding translation equivalents, avoiding
interference:”göra motstånd” compared to Polish
”&verb opór” (opór = motstånd)
The concordance tool is connected to dictionaries; a word may
be seen in a broader context, and dictionary definitions pop up
on choice
The dictionary function is also available for
Swedish
Even specialized dictionaries…
Non-dictionary expressions and new
expressions (the variable &new)
WordNet: 155,327 entries, 207,016 senses
UMLS: 975,354 concepts and 2.4 million concept names
Absent in dictionaries:
new words
pneumonoultramicroscopicsilicovolcanoconiosis,
fusioneer, biofraud
new multiwords
fridge googling, chief hacking officer
tricky new formations
(but also misspellled
words)
cyberchondriac, spamdexing,
b4, lol, cul8er, håglöss, nydist, syniker,
turktumlare, Kristi himmelfärsdag…
Corpora and Corcondances in
Machine Translation
Stochastic MT
Statistical MT (Brown et al. 1993)
Example-based Machine Translation (EBMT)
KBMT
Knowledge Based Machine Translation (KBMT) –
Nirenburg et al., Hobbs, Wilks mm
- knowledge stored in lexicons, onomastikons, and ontologies
- rule-based parsing and semantico-pragmatic analysis
aimed at conceptual representations
Is merging of the different approaches possible? YES. The
MT system Verbmobil (a simplified version of Figure 11, p.
17 in Wahlster 2000)
How can MT and automated Information
Extraction be facilitated by unilingual corpora
and concordances?
{
statistics-based term extraction
{
automatic extraction of inflectional
information
{
collocation extraction
{
extraction of knowledge about concrete
and metaphorical usage of words and
phrases
Extracting English compounds from a biomedical corpus (Dura,
Gawronska, and Erlendsson 2006)
f ( x, y ) − f ( x ) f ( y )
Tscore =
f ( x) f ( y )
f(x) corpus frequency of word x
f(x,y) corpus frequency of word pair (x, y)
N total number of words in the corpus
Extracting Latin terms from a biomedical corpus (Dura,
Gawronska, and Erlendsson 2006)
MI ( x, y ) = H ( x ) + H ( y ) − H ( x, y )
⎛ MI ( x , y ) MI ( x , y ) 2
max ⎜⎜
,
H
x
H ( y)
(
)
⎝
⎞
⎟⎟
⎠
H(x) entropy of word x = -p(x)log2(p(x)),
where probability p(x) = f(x) / N
Extracting ’superanimate’ nouns for Polish
(Gawronska et al 2002)
Pojawili się więc Algierczycy, Jemeńczycy, obywatele Bangladeszu, Uzbecy,
Kirgizi i Tadżycy.
’There arrived Algerians, Yemenis, citizens of Bangladesh, Uzbeks,
Kirgizis, and Tadjiks’
Stop-list and a suffix list with declension numbers
PolWN Database
Word
Word
Algierczycy
Jemeńczycy
obywatele
Uzbecy
Kirgizi
Tadżycy
Pojawili
Cat
SemCat
Gender
Number
Case
Decl
Cat
n
SemCat
hum
hum
hum
hum
hum
hum
hum
Gender
ma
Number
pl
Case
nom
nom
nom
nom
nom
nom
Decl
35
35
14
36
38
35
n
n
n
n
n
v
ma
ma
ma
ma
ma
ma
pl
pl
pl
pl
pl
pl
Translation Memories – between MT
and Human Translation
z
z
z
Alignment techniques and EBMT-techniques
can be employed for building and searching
the translator’s own corpora
Knowledge about existing corpus and
concordance techniques helps the translators
in the task of building own memory data bases
Most popular translation memories:
{ TRADOS Translation Memory Desktop,
{ Déjà Vu
{ MemoQ
{ Similis
Conclusions: electronic corpora and
concordance programs facilitate the
following translation-related tasks:
{
{
{
Language education: learning terminology and
words in context, easy creation of exercises and
tests
Creation of Machine Translation systems:
(however,corrections made by translators will
always be necessary)
Creation of computer-based translation aids:
z
z
z
z
z
Dictionaries
Language aids providing grammatical information
(morphology, noun/verb paradigms)
Style checkers
Terminology aids, such as glossaries of ‘authorized’
terminology for a particular scientific, technical or
commercial field, for particular clients, agencies and
customers
Translation memories
Thank you!
References
Dura, E. and Gawronska, B. (2007) Novelty Extraction from Special and Parallel Corpora. In: Proceedings of 3rd Language &
Technology Conference 2007, 305-309. Adam Mickiewicz University, Poznan, Poland. ISBN 978-83-7177-407-2.
Dura, E, Gawronska, B, Olsson, B., and Erlendsson, B. (2006) Towards Information Fusion in Pathway Evaluation: Encoding
Relations in Biomedical Texts. In: Proceedings of The 9th International Conference on Information Fusion, Florence, Italy,
10-13 July 2006, 240-247
Huenerfauth, Matt (2004) Spatial and Planning Models of ASL Classifier Predicates for Machine Translation. To appear in
Proceedings of TMI 2004, Baltimore, U.S.
Hutchins, John (1999a) The development and use of machine translation systems and computer-based translation tools.
International Symposium on Machine Translation and Computer Language Information Processing, 26–28 June 1999,
Beijing, China.
http://ourworld.compuserve.com/homepages/WJHutchins/Beijing.htm
Hutchins, John (1999b) Retrospect and prospect in computer-based translation. (Paper presented at the MT Summit, Singapore,
1999)
http://ourworld.compuserve.com/homepages/WJHutchins/MTS-99.htm
Hutchins, John (2000) The IAMT Certification initiative and defining translation system categories. (Presented at EAMT
Workshop, Ljubljana, May 2000)
http://ourworld.compuserve.com/homepages/WJHutchins/IAMTcert.htm
Hutchins, John (ej publicerad) The history of machine translation in a nutshell
http://ourworld.compuserve.com/homepages/WJHutchins/Nutshell.htm
Jurafsky, D. & Martin, J.H. (2000) Speech and Language Processing. Prentice Hall Series in Artificial Intelligence. Kapitel 18, 20,
21.
Kay, Martin (1996) Machine Translation: The Disappointing Past and Present. I Survey of the State of the Art in Human
Language Technology (Ed.Varile, Giovanni Battista & Zampolli, Antonio ).
http://cslu.cse.ogi.edu/HLTsurvey/ch8node4.html http://www.multilingual.com/machineTranslation62.htm
Seligman, Mark, Dillinger, Mike, and Zong, Chengqing (2004) Cooperative Spoken Language Understanding for Robust Speech
Translation. Paper submitted for TMI 2004.
Somers, H. (2003) Machine Translation: Latest Development. I Mitkov (ed.): The Oxford Handbook of Computational Linguistics
Wahlster, W., (2000) Verbmobil: Foundations of Speech-to-Speech Translation, Springer-Verlag, Berlin.
© Copyright 2026 Paperzz