Development of Parallel Corpus and English to Urdu Statistical

International Journal of Engineering & Technology IJET-IJENS Vol:10 No:05
31
Development of Parallel Corpus and
English to Urdu Statistical Machine Translation
Aasim Ali, Shahid
Muhammad Kamran Malik
National University of Computer & Emerging Sciences, Lahore
Campus
[email protected], [email protected]
PUCIT, University of the Punjab, Lahore
[email protected]
Abstract-- In this paper we share the efforts for development of
a parallel corpus for statistical machine translation for English
text into Urdu. There are certain issues faced during this effort,
which are shared and discussed.
1.
INTRODUCTION
Phrase-based Statistical Machine Translation has been the
focus of research in the area of machine translation since
decades [5] [8]. The alignment of phrases is computed on the
basis of word-to-word alignment [3]. Then these translated
phrases of target language are sequenced using n-gram
language model [3]. An open source toolkit, moses [7], is used
for our experiments. Typological distance between the two
languages of the selected pair, Urdu and English, makes it
challenging to find the useful set of parameters for translation
quality enhancements.
Section 2 of this paper contains brief account of basics of
SMT, resource requirement, Hindi-Urdu language comparison,
and Google Translate. Section 3 describes the issues related to
corpus development, viz., sentence alignment, encoding,
paragraphing, and punctuation mismatch. Section 4 discusses
the results including error analysis, and comparison with
Google Translate. Section 5 notes future directions.
2.
LITERATURE REVIEW
Brown et al [2] practically started the era of SMT. Their
work is termed as IBM models of SMT, giving a complete
mathematical formulation in [3], famous as IBM Model 1 to
IBM Model 5.
2.1. SMT Basics
SMT requires a parallel corpus, in which each sentence of
source language is aligned with its translation in the target
language. A practically successful approach for automatic
alignment of sentences was demonstrated by Brown et al [2],
in which they used a parallel corpus, and found the sentence
alignments algorithmically, on the basis of word count. Och
and Ney [12] introduced the idea of character count based
mechanism for sentence alignment.
In the first phase of SMT, a translation table is generated,
which is later used at the time of decoding. This translation
table requires alignment at the level of translation units. A
translation unit contains one or more words of source language
aligned to one or more words of target language. Translation
units are built from automatic word alignment, by combining
the neighboring words [6].
Later, some approaches showed improvement in results by
adding grammatical category (as in Yamada and Knight [20])
and other morphological and syntactic information (as in [4]
[9] [10]).
2.2. Resource Requirement
A large collection of sentences aligned with their
corresponding translation is required for an algorithm to learn
translation parameters. Brown et al [2] used 2 million parallel
sentences of English and French. However, there have been
experiments with scarce resource language pairs, with modest
collection. Nießen and Ney [11] used 5,000 parallel sentences
with the support of an external dictionary. The problem of
sparseness was handled with the help of morphological
analysis and synthesis.
2.3. Hindi-Urdu Language Comparison
Urdu and Hindi are structurally almost same [17]. So it
seems that English-Hindi SMT may work well for Urdu, as
well. But the ground reality is different because of the
dissimilarity in lexicon, orthographic direction, and casemarker affixation. Following is an example from [18]:
Example 1:
English: scientific kind of a celestial trip for,
planetarium visit (come)
Hindi/Transliteration: vaigyaanika tariike ke eka divya
saira ke lie, taaraamandala aaem
Example 2:
English: players should just play
Hindi/Transliteration: khilaadiyom ko kevala khelanaa
caahie
Example 3:
English: the president of America visited India in June
Hindi/Transliteration: amariikaa ke raashtrapati ne
juuna mem bhaarata kii yaatraa kii
103705-6161 IJET-IJENS © October 2010 IJENS
IJENS
International Journal of Engineering & Technology IJET-IJENS Vol:10 No:05
Sinha [19] has attempted English-Urdu Machine
Translation via Hindi. The following issues are reported: (a)
Fluency of Urdu is compromised very much as compared to
direct English-Urdu translation, (b) Inappropriate choice of
lexical mapping gets multiplied due to two stages of lexical
mapping, (c) Difference of orthography and one-to-many
mapping of Hindi consonants to Urdu consonants adversely
affects the transliteration of Proper Nouns and unknown
words.
Translation is not only about structure, it is more about the
semantic transfer. Therefore, lexical differences may not be
ignored. However, going from Urdu to Hindi has some
advantages because of: (a) one-to-one mapping of consonant
characters, (b) multiple tokens (e.g. in case of post-positions)
in Urdu may be synthesized using morphological information
(e.g. case), and can be easily combined to generate Hindi
words (which carry post-position as part of the preceding
noun).
2.4. Google Translate
Google Translate also provide English-to-Urdu translation.
The word mapping seems better than morpho-syntactic
syntheses, viz. grouping of words into phrases, agreement
between verb and its arguments (e.g. gender agreement), and
distortion etc. Following is an example taken from Google
Translate:
English: he is going to school.
Urdu/Transliteration: wo skool ja rahi hai
3.
CORPUS DEVELOPMENT
Bilingual parallel corpus is the fundamental resource for
SMT. There is no published work on development of parallel
corpus for the selected pair, i.e. English-Urdu. Authentic
parallel data is not available in any public collection. The most
authentic parallel data in this regard was English and Urdu
translation of Quran. It could not be adopted for religious
reasons. The next option was tafaseer and ahadeeth. Tafaseer
are in handwritten form, mostly, especially their Urdu
versions. Then, availability of ahadeeth in semi-text1 form
solved the problem of data. Although the ahadeeth were
parallel but each hadeeth may contain multiple sentences. So
sentence level alignment is performed manually.
32
Fig. 1. Sentence Alignment Assistant
3.2. Punctuation and Other Issues
Sometimes, punctuations were not found aligned. For
example, there used to be full-stop on English side, but a
comma on Urdu side. There used to sign of exclamation on
English side, for vocative case but no punctuation mark on
Urdu side. Similarly, there used to be mismatch of colon,
bullets, numbering, paragraphing. However, there were no
encoding issues because data was already in Unicode.
3.3. Translation Related Issues
All ahadeeth in Urdu translation, contain complete chain of
narrators, but English side contains only the original narrator
who was witness to that hadeeth. So the chain of narrators had
no parallel translation on English side, therefore, it was
removed from Urdu side, as well. Mostly, the word “he” (third
person singular) on English side was translated as “‫”انہوں نے‬
(third person plural) and “‫( ”آپﷺ نے‬second person), due to
honor. It was decided to keep as it is.
4.
EXPERIMENT
Data prepared for experimentation was 6,000 parallel
sentences (counting from English side), of which 5,000 were
used for training, 800 were used for tuning, and remaining 200
were used for testing.
Moses [7] toolkit along with Giza++ (a software for
word/phrase alignment) [14], and mkcls (a utility for making
bilingual word classes) [13] were used for training. Mert [15]
script was used for tuning, and BLEU [16] was used for testing.
3.1. Sentence Alignment Issues
A tool was developed to assist in manual alignment.
Screenshot of the tool is shown in Fig. 1. During this process of
alignment, it was learnt that English translation of a hadeeth
contains more sentences than Urdu translation, in general. So,
the sentence of English side was kept as single unit, as a
general guideline. However, sometimes two or more English
sentences were also taken as single unit due to inseparability in
the corresponding translation on Urdu side.
“semi-text” means the data was to be converted into
Unicode-based text files.
1
The whole corpus was divided into four partitions for the
purpose of cross-validation [1]. Each three partitions are
joined to be used as training data, thus running the SMT
system four times. Three-fourth of the remaining (4th) partition
in all the above cases is used as development data, for tuning.
Remaining one-fourth of the same (4th) partition is used as test
data.
Thus, in each run, three-fourth of the whole data is used for
training, three-sixteenth for tuning, and remaining onesixteenth for testing.
103705-6161 IJET-IJENS © October 2010 IJENS
IJENS
International Journal of Engineering & Technology IJET-IJENS Vol:10 No:05
5. RESULT
The results are reported for all the four experiments with
and without tuning. Table I given below shows the accuracies
achieved, in all the four experiments, and their average, when
the tuning is not performed.
[11]
Table I
SMT Accuracies,
(a) without tuning, (b) with tuning
(a) BLEU score
(b) BLEU
Run #
1
2
3
4
Average
(before Tuning)
7.94
10.63
8.13
6.86
8.39
[10]
score
(after Tuning)
8.7
11.05
8.85
7.54
9.035
[12]
[13]
[14]
[15]
5.1. Error Analysis
Following were the major reasons of errors:
(a) Proper names occurring not often
(b) Extra honor level of pronoun
(c) Long distance dependencies
[16]
[17]
6.
FUTURE WORK
There is a need of more parallel text for appropriate
learning of translation parameters. Every word of Urdu side
must be analyzed for morpho-syntactic support so that
translation may be made free from typological differences, like
gender, honor, and case.
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
REFERENCES
(2004) E Alpaydin: Introduction to Machine Translation. The
MIT
Press.
ISBN: 262012111.
(1990) P F Brown, J Cocke, S A D Pietra, V J D Pietra, F Jelinek,
J D Lafferty, R L Mercer, P S Roossin: A Statistical Approach to
Machine Translation. Computational Linguistics Volume 16,
Number 2, June 1990.
(1993) P F Brown, S A D Pietra, V J D Pietra, R L Mercer: The
Mathematics of Statistical Machine Translation: Parameter
Estimation. ACL 1993. Computational Linguistics Volume 19,
Issue 2, June 1993, Pages: 263 – 311 .
(2006) N Habash, F Sadat: Arabic Preprocessing Scheme for
Statistical Machine Translation. Proceedings of the Human
Language Technology Conference of the North American Chapter
of the ACL, pages 49-52, New York, June 2006.
(1994) J Hutchins: Research methods and system designs in
machine translation – a ten-year review, 1984-1994. International
conference ‘Machine Translation: ten year on’, Cranfield
University, England, 12-14 November 1994.
(2008) D Jurafsky, J H Martin: Speech and Language Processing.
2nd Edition. May 2008. ISBN-10: 0131873210
(2007) P Koehn, H Huang, A Birch, C Callison-Burch, M
Federico, N Bertoldi, B Cowan, W Shen, C Moran, R Zens, C
Dyer, O Bojar, A Constantin, E Herbst. Moses: Open Source
Toolkit for Statistical Machine Translation. ACL Demos.
(2008) A Lopez. Statistical Machine Translation. ACM
Computing Surveys (C Sur), Volume 40, Issue 3, Article No. 8,
August 2008.
(2000) S Nießen, H Ney: Improving SMT Quality with MorphoSyntactic Analysis. Proceedings of the 18th conference on
[18]
[19]
[20]
33
Computational linguistics, Volume 2, July 31-August 04, 2000,
Saarbrücken, Germany, Pages: 1081 - 1085.
(2001) S Nießen, H Ney: Morpho-Syntactic Analysis for
Reordering in Statistical Machine Translation. In Proceedings of
MT Summit VIII, pages 247-252, Santiago de Compostela, Spain,
September.
(2004) S Nießen, H Ney: Statistical Machine Translation with
Scarce Resources Using Morpho-syntactic Information.
Computational Linguistics, Volume 30 , Issue 2 (June 2004),
Pages: 181 - 204.
(2000) F J Och, H Ney: Improved Statistical Alignment Models.
Proceedings of the 38th Annual Meeting on Association for
Computational Linguistics, p.440-447, October 03-06, 2000,
Hong Kong
(1999) F J Och: An Efficient Method of Determining Bilingual
Word Classes. EACL’99: Ninth Conference of the European
Chapter of the Association for Computational Lingustics, Bergen,
Norway (1999) Pages 71–76.
(2003) F J Och, H Ney: A Syntactic Comparison of Various
Statistical Alignment Models. Computational Linguistics Volume
29, number 1, pp 19-51, March 2003.
(2003) F J Och: Minimum error rate training in statistical
machine translation. Proceedings of the 41st Annual Meeting on
Association for Computational Linguistics, p.160-167, July 07-12,
2003, Sapporo, Japan.
(2002) K Papineni, S Roukos, T Ward, W J Zhu: BLEU: a Method
for Automatic Evaluation of Machine Translation. Proceedings of
the 40th Annual Meeting of the Association for Computational
Linguistics (ACL), Philadelphia, July 2002, pp. 311-318.
(2004) Raghavendra U, T A Faruquie: An English-Hindi
Statistical Machine Translation System. IJCNLP 2004, LNAI
3248, pp. 254-262.
(2008) A Ramanathan, P Bhattacharyya, J Hegde, R M Shah,
Sasikumar M: Simple Syntactic and Morphological Processing
Can Help English-Hindi Statistical Machine Translation. IJCNLP
2008.
(2009) R.M.K. Sinha, Developing English-Urdu Machine
Translation via Hindi. In Third Workshop on Computational
Approaches to Arabic Script based Languages (CAASL3), MT
Summit XII, Otawa, Canada.
(2001) K Yamada, K Knight. A Syntax-based Statistical
Translation Model. Proceedings of the 39th Annual Meeting on
Association for Computational Linguistics, 523-530, July 06-11,
2001, Toulous, France.
103705-6161 IJET-IJENS © October 2010 IJENS
IJENS