International Journal of Engineering & Technology IJET-IJENS Vol:10 No:05 31 Development of Parallel Corpus and English to Urdu Statistical Machine Translation Aasim Ali, Shahid Muhammad Kamran Malik National University of Computer & Emerging Sciences, Lahore Campus [email protected], [email protected] PUCIT, University of the Punjab, Lahore [email protected] Abstract-- In this paper we share the efforts for development of a parallel corpus for statistical machine translation for English text into Urdu. There are certain issues faced during this effort, which are shared and discussed. 1. INTRODUCTION Phrase-based Statistical Machine Translation has been the focus of research in the area of machine translation since decades [5] [8]. The alignment of phrases is computed on the basis of word-to-word alignment [3]. Then these translated phrases of target language are sequenced using n-gram language model [3]. An open source toolkit, moses [7], is used for our experiments. Typological distance between the two languages of the selected pair, Urdu and English, makes it challenging to find the useful set of parameters for translation quality enhancements. Section 2 of this paper contains brief account of basics of SMT, resource requirement, Hindi-Urdu language comparison, and Google Translate. Section 3 describes the issues related to corpus development, viz., sentence alignment, encoding, paragraphing, and punctuation mismatch. Section 4 discusses the results including error analysis, and comparison with Google Translate. Section 5 notes future directions. 2. LITERATURE REVIEW Brown et al [2] practically started the era of SMT. Their work is termed as IBM models of SMT, giving a complete mathematical formulation in [3], famous as IBM Model 1 to IBM Model 5. 2.1. SMT Basics SMT requires a parallel corpus, in which each sentence of source language is aligned with its translation in the target language. A practically successful approach for automatic alignment of sentences was demonstrated by Brown et al [2], in which they used a parallel corpus, and found the sentence alignments algorithmically, on the basis of word count. Och and Ney [12] introduced the idea of character count based mechanism for sentence alignment. In the first phase of SMT, a translation table is generated, which is later used at the time of decoding. This translation table requires alignment at the level of translation units. A translation unit contains one or more words of source language aligned to one or more words of target language. Translation units are built from automatic word alignment, by combining the neighboring words [6]. Later, some approaches showed improvement in results by adding grammatical category (as in Yamada and Knight [20]) and other morphological and syntactic information (as in [4] [9] [10]). 2.2. Resource Requirement A large collection of sentences aligned with their corresponding translation is required for an algorithm to learn translation parameters. Brown et al [2] used 2 million parallel sentences of English and French. However, there have been experiments with scarce resource language pairs, with modest collection. Nießen and Ney [11] used 5,000 parallel sentences with the support of an external dictionary. The problem of sparseness was handled with the help of morphological analysis and synthesis. 2.3. Hindi-Urdu Language Comparison Urdu and Hindi are structurally almost same [17]. So it seems that English-Hindi SMT may work well for Urdu, as well. But the ground reality is different because of the dissimilarity in lexicon, orthographic direction, and casemarker affixation. Following is an example from [18]: Example 1: English: scientific kind of a celestial trip for, planetarium visit (come) Hindi/Transliteration: vaigyaanika tariike ke eka divya saira ke lie, taaraamandala aaem Example 2: English: players should just play Hindi/Transliteration: khilaadiyom ko kevala khelanaa caahie Example 3: English: the president of America visited India in June Hindi/Transliteration: amariikaa ke raashtrapati ne juuna mem bhaarata kii yaatraa kii 103705-6161 IJET-IJENS © October 2010 IJENS IJENS International Journal of Engineering & Technology IJET-IJENS Vol:10 No:05 Sinha [19] has attempted English-Urdu Machine Translation via Hindi. The following issues are reported: (a) Fluency of Urdu is compromised very much as compared to direct English-Urdu translation, (b) Inappropriate choice of lexical mapping gets multiplied due to two stages of lexical mapping, (c) Difference of orthography and one-to-many mapping of Hindi consonants to Urdu consonants adversely affects the transliteration of Proper Nouns and unknown words. Translation is not only about structure, it is more about the semantic transfer. Therefore, lexical differences may not be ignored. However, going from Urdu to Hindi has some advantages because of: (a) one-to-one mapping of consonant characters, (b) multiple tokens (e.g. in case of post-positions) in Urdu may be synthesized using morphological information (e.g. case), and can be easily combined to generate Hindi words (which carry post-position as part of the preceding noun). 2.4. Google Translate Google Translate also provide English-to-Urdu translation. The word mapping seems better than morpho-syntactic syntheses, viz. grouping of words into phrases, agreement between verb and its arguments (e.g. gender agreement), and distortion etc. Following is an example taken from Google Translate: English: he is going to school. Urdu/Transliteration: wo skool ja rahi hai 3. CORPUS DEVELOPMENT Bilingual parallel corpus is the fundamental resource for SMT. There is no published work on development of parallel corpus for the selected pair, i.e. English-Urdu. Authentic parallel data is not available in any public collection. The most authentic parallel data in this regard was English and Urdu translation of Quran. It could not be adopted for religious reasons. The next option was tafaseer and ahadeeth. Tafaseer are in handwritten form, mostly, especially their Urdu versions. Then, availability of ahadeeth in semi-text1 form solved the problem of data. Although the ahadeeth were parallel but each hadeeth may contain multiple sentences. So sentence level alignment is performed manually. 32 Fig. 1. Sentence Alignment Assistant 3.2. Punctuation and Other Issues Sometimes, punctuations were not found aligned. For example, there used to be full-stop on English side, but a comma on Urdu side. There used to sign of exclamation on English side, for vocative case but no punctuation mark on Urdu side. Similarly, there used to be mismatch of colon, bullets, numbering, paragraphing. However, there were no encoding issues because data was already in Unicode. 3.3. Translation Related Issues All ahadeeth in Urdu translation, contain complete chain of narrators, but English side contains only the original narrator who was witness to that hadeeth. So the chain of narrators had no parallel translation on English side, therefore, it was removed from Urdu side, as well. Mostly, the word “he” (third person singular) on English side was translated as “”انہوں نے (third person plural) and “( ”آپﷺ نےsecond person), due to honor. It was decided to keep as it is. 4. EXPERIMENT Data prepared for experimentation was 6,000 parallel sentences (counting from English side), of which 5,000 were used for training, 800 were used for tuning, and remaining 200 were used for testing. Moses [7] toolkit along with Giza++ (a software for word/phrase alignment) [14], and mkcls (a utility for making bilingual word classes) [13] were used for training. Mert [15] script was used for tuning, and BLEU [16] was used for testing. 3.1. Sentence Alignment Issues A tool was developed to assist in manual alignment. Screenshot of the tool is shown in Fig. 1. During this process of alignment, it was learnt that English translation of a hadeeth contains more sentences than Urdu translation, in general. So, the sentence of English side was kept as single unit, as a general guideline. However, sometimes two or more English sentences were also taken as single unit due to inseparability in the corresponding translation on Urdu side. “semi-text” means the data was to be converted into Unicode-based text files. 1 The whole corpus was divided into four partitions for the purpose of cross-validation [1]. Each three partitions are joined to be used as training data, thus running the SMT system four times. Three-fourth of the remaining (4th) partition in all the above cases is used as development data, for tuning. Remaining one-fourth of the same (4th) partition is used as test data. Thus, in each run, three-fourth of the whole data is used for training, three-sixteenth for tuning, and remaining onesixteenth for testing. 103705-6161 IJET-IJENS © October 2010 IJENS IJENS International Journal of Engineering & Technology IJET-IJENS Vol:10 No:05 5. RESULT The results are reported for all the four experiments with and without tuning. Table I given below shows the accuracies achieved, in all the four experiments, and their average, when the tuning is not performed. [11] Table I SMT Accuracies, (a) without tuning, (b) with tuning (a) BLEU score (b) BLEU Run # 1 2 3 4 Average (before Tuning) 7.94 10.63 8.13 6.86 8.39 [10] score (after Tuning) 8.7 11.05 8.85 7.54 9.035 [12] [13] [14] [15] 5.1. Error Analysis Following were the major reasons of errors: (a) Proper names occurring not often (b) Extra honor level of pronoun (c) Long distance dependencies [16] [17] 6. FUTURE WORK There is a need of more parallel text for appropriate learning of translation parameters. Every word of Urdu side must be analyzed for morpho-syntactic support so that translation may be made free from typological differences, like gender, honor, and case. [1] [2] [3] [4] [5] [6] [7] [8] [9] REFERENCES (2004) E Alpaydin: Introduction to Machine Translation. The MIT Press. ISBN: 262012111. (1990) P F Brown, J Cocke, S A D Pietra, V J D Pietra, F Jelinek, J D Lafferty, R L Mercer, P S Roossin: A Statistical Approach to Machine Translation. Computational Linguistics Volume 16, Number 2, June 1990. (1993) P F Brown, S A D Pietra, V J D Pietra, R L Mercer: The Mathematics of Statistical Machine Translation: Parameter Estimation. ACL 1993. Computational Linguistics Volume 19, Issue 2, June 1993, Pages: 263 – 311 . (2006) N Habash, F Sadat: Arabic Preprocessing Scheme for Statistical Machine Translation. Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, pages 49-52, New York, June 2006. (1994) J Hutchins: Research methods and system designs in machine translation – a ten-year review, 1984-1994. International conference ‘Machine Translation: ten year on’, Cranfield University, England, 12-14 November 1994. (2008) D Jurafsky, J H Martin: Speech and Language Processing. 2nd Edition. May 2008. ISBN-10: 0131873210 (2007) P Koehn, H Huang, A Birch, C Callison-Burch, M Federico, N Bertoldi, B Cowan, W Shen, C Moran, R Zens, C Dyer, O Bojar, A Constantin, E Herbst. Moses: Open Source Toolkit for Statistical Machine Translation. ACL Demos. (2008) A Lopez. Statistical Machine Translation. ACM Computing Surveys (C Sur), Volume 40, Issue 3, Article No. 8, August 2008. (2000) S Nießen, H Ney: Improving SMT Quality with MorphoSyntactic Analysis. Proceedings of the 18th conference on [18] [19] [20] 33 Computational linguistics, Volume 2, July 31-August 04, 2000, Saarbrücken, Germany, Pages: 1081 - 1085. (2001) S Nießen, H Ney: Morpho-Syntactic Analysis for Reordering in Statistical Machine Translation. In Proceedings of MT Summit VIII, pages 247-252, Santiago de Compostela, Spain, September. (2004) S Nießen, H Ney: Statistical Machine Translation with Scarce Resources Using Morpho-syntactic Information. Computational Linguistics, Volume 30 , Issue 2 (June 2004), Pages: 181 - 204. (2000) F J Och, H Ney: Improved Statistical Alignment Models. Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, p.440-447, October 03-06, 2000, Hong Kong (1999) F J Och: An Efficient Method of Determining Bilingual Word Classes. EACL’99: Ninth Conference of the European Chapter of the Association for Computational Lingustics, Bergen, Norway (1999) Pages 71–76. (2003) F J Och, H Ney: A Syntactic Comparison of Various Statistical Alignment Models. Computational Linguistics Volume 29, number 1, pp 19-51, March 2003. (2003) F J Och: Minimum error rate training in statistical machine translation. Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, p.160-167, July 07-12, 2003, Sapporo, Japan. (2002) K Papineni, S Roukos, T Ward, W J Zhu: BLEU: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002, pp. 311-318. (2004) Raghavendra U, T A Faruquie: An English-Hindi Statistical Machine Translation System. IJCNLP 2004, LNAI 3248, pp. 254-262. (2008) A Ramanathan, P Bhattacharyya, J Hegde, R M Shah, Sasikumar M: Simple Syntactic and Morphological Processing Can Help English-Hindi Statistical Machine Translation. IJCNLP 2008. (2009) R.M.K. Sinha, Developing English-Urdu Machine Translation via Hindi. In Third Workshop on Computational Approaches to Arabic Script based Languages (CAASL3), MT Summit XII, Otawa, Canada. (2001) K Yamada, K Knight. A Syntax-based Statistical Translation Model. Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, 523-530, July 06-11, 2001, Toulous, France. 103705-6161 IJET-IJENS © October 2010 IJENS IJENS
© Copyright 2026 Paperzz