Available online at www.sciencedirect.com ScienceDirect Computer Speech and Language 32 (2015) 91–108 Translating noun compounds using semantic relations夽 Renu Balyan ∗ , Niladri Chatterjee 1 Department of Mathematics, IIT Delhi, Hauz Khas, New Delhi, India Received 20 March 2014; received in revised form 13 September 2014; accepted 22 September 2014 Available online 2 October 2014 Abstract Despite having a research history of more than 20 years, English to Hindi machine translation often suffers badly from incorrect translations of noun compounds. The problems envisaged can be of various types, such as, the absence of proper postpositions, inappropriate word order, incorrect semantics. Different existing English to Hindi machine translation systems show their vulnerability, irrespective of the underlying technique. A potential solution to this problem lies in understanding the semantics of the noun compounds. The present paper proposes a scheme based on semantic relations to address this issue. The scheme works in three steps: identification of the noun compounds in a given text, determination of the semantic relationship(s) between them, and finally, selecting the right translation pattern. The scheme provides translation patterns for different semantic relations for 2-word noun compounds first. These patterns are used recursively to find the semantic relations and the translation patterns for 3-word and 4-word noun compounds. Frequency and probability based adjacency and dependency models are used for bracketing (grouping) the constituent words of 3-word and 4-word noun compounds into 2-word noun compounds. The semantic relations and the translation patterns generated for 2-word, 3-word and 4-word noun compounds are evaluated. The proposed scheme is compared with some well-known English to Hindi translators, viz. AnglaMT, Anuvadaksh, Bing, Google, and also with the Moses baseline system. The results obtained, show significant improvement over the Moses baseline system. Also, it performs better than the other online MT systems in terms of recall and precision. © 2014 Elsevier Ltd. All rights reserved. Keywords: Noun compounds; Semantic relation; Translation pattern; Bracketing; Machine translation 1. Introduction A compound noun is a noun that is made up of two or more words, such as, noun + noun (e.g. water tank, football), adjective + noun (e.g. full moon, blackboard), verb + noun (e.g. washing machine, swimming pool), noun + verb (e.g. haircut, rainfall), preposition + noun (e.g. underworld). Single (1-word) compound nouns, such as football, blackboard and rainfall are mostly a part of the lexicon and hence the translation of these single compound nouns is obtained from the lexicon. In this paper we have considered compound nouns that are sequence of nouns occurring as separate words. Henceforth, we call these compound nouns as noun compounds (NCs) in the rest of the paper. Noun compounds can 夽 ∗ 1 This paper has been recommended for acceptance by Prof. R. K. Moore. Corresponding author. Tel.: +91 9891830407. E-mail addresses: [email protected] (R. Balyan), [email protected] (N. Chatterjee). Tel: +91 1126591490. http://dx.doi.org/10.1016/j.csl.2014.09.007 0885-2308/© 2014 Elsevier Ltd. All rights reserved. 92 R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108 be freely constructed in English (Jones, 1983), and sometimes these can be quite long. For illustration consider NCs such as “colon cancer tumor suppressor protein”, “water meter cover adjustment screw”, which are made of 5 words. Noun compounds form an important and substantial part of an English corpus. For illustration, the written portion of British National Corpus (BNC) consisting of 84M words (Burnard, 2000) has 2.6% bigram nominal compounds and the Reuters corpus consisting of 108M words (Rose et al., 2002) has 3.9% (Kim and Baldwin, 2005). It is also estimated that 88.4% of noun compounds in the Wall Street Journal section of the Penn Treebank are binary i.e. 2-word (Baldwin and Tanaka, 2004). As a consequence, any machine translation (MT) system needs to take sincere care of the English noun compounds while translating them into a target language. This is particularly true for a language like Hindi where nouns in an English noun compound after translation are often separated by postpositions; and their inappropriate usage may render the translation syntactically and/or semantically incorrect. This can be observed by analyzing the outputs of different English to Hindi machine translation systems available online. The present study, in particular, is based on five such systems, viz. AnglaMT,2 Anuvadaksh,3 Bing,4 Google5 and MaTra6 translators. Our analysis finds that several different types of problems may be encountered while translating the English noun compounds into Hindi. Some major problems in this regard are: • Omission of appropriate postpositions: In certain cases simple juxtaposition of nouns does not convey the semantics in Hindi. They need to be separated by appropriate postpositions. Omission or incorrect postpositions during translation of the noun compounds may result in erroneous semantics. For example: “mustard oil” ∼ sarson (mustard) kaa (of) tel (oil); “body ache” ∼ shareer (body) mein(in) dard (ache). The postpositions that convey the correct sense in the two cases discussed above are, kaa and mein respectively. • Use of a single word: Often the Hindi translation of an English noun compound comprising two or more nouns is a single word. Any attempt to translate the nouns separately with a separator word (such as postpositions) is erroneous. For illustration, “boys’ hostel” ∼ chhaatraavaas, “cow dung” ∼ gobar, “blood pressure” ∼ raktchaap and “wine bar” ∼ madhushaalaa. In some cases the translated Hindi word is completely a new word, and in no way related to the constituent nouns of the noun compound. Whereas, in certain cases a single word is formed by concatenating or combining the translation of the constituent nouns of the noun compound. • Bracketing (Sequential grouping): As the number of nouns in a noun compound increases, the number of possible combinations also increases. For illustration, the following 3-word NCs “olympic gold medal” and “rolled gold medal” have very similar structure, but have different grouping of the nouns as their semantic structures are different: - olympic gold medal → (olympic (gold medal)). - rolled gold medal → ((rolled gold) medal). For a 3-word NC “N1 N2 N3 ” there are two possibilities, viz. (N1 (N2 N3 )) and ((N1 N2 )N3 ) of grouping the nouns; while for a 4-word NC there are 6 possibilities. In a similar vein, there exist 13 different ways of grouping a 5-word NC. According to Nakov (2013) longer (i.e. no. of terms ≥ 5) noun compounds, are rarely dealt with except for scientific texts and technical literature. Thus, in this work we restricted ourselves to noun compounds with a maximum length of 4 only. However, appropriate grouping of the nouns of a noun compound and preserving its semantics still poses a big challenge for any English to Hindi machine translation system. The present work focuses on the development of a scheme for efficient handling of the above issues corresponding to English to Hindi machine translation. The necessity also arises from the fact that the above shortcomings are not due to the underlying translation paradigm; rather the problem is inherent in the language characteristics. For elucidation, the systems considered in this work follow different translation schemes. While AnglaMT (Sinha et al., 1995) is a rule-based system, Google and Bing use statistical approaches, and Anuvadaksh and MaTra (Ananthakrishanan et al., 2 3 4 5 6 http://www.tdil-dc.in/. http://www.tdil-dc.in/. http://www.bing.com/translator/. http://translate.google.com/. http://www.cdacmumbai.in/matra/. R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108 93 2006) are hybrid MT systems – yet each of the systems suffers from the above problems. Consequently, interpretation of noun compounds assumes a very important role in English to Hindi machine translation. In this work we develop a scheme based on semantic relations for interpretation of the noun compounds. The intuition behind the proposed scheme is that an appropriate set of rules can be generated in a supervised way if the right semantic relation can be established between the nouns of a noun compound. Here the term “semantic relation” implies the underlying relation between the head noun, and the other nouns of the noun compound. Proper interpretation of a noun compound can be made with the help of discovering the semantic relations between the constituent nouns. Most of the related researches have proposed semantic labels based on the theory that a compound noun expresses one of the small number of covert semantic relations (Barker and Szpakowicz, 1998). Levi (1978) offers nine semantic labels representing underlying predicates deleted during compound formation. Warren (1978) describes a multi-level system of semantic labels for noun-noun relationships. We have used some of these semantic labels for this work. The paper is organized as follows: Section 2 looks at some of the related works in this domain. Section 3 discusses the proposed scheme. It also discusses noun compound identification, and the semantic relations. Section 4 describes the algorithms for semantic relation identification and translation pattern generation for a 2-word noun compound. Section 5 deals with handling of the 3-word and 4-word noun compounds. The experimental setup and results are given in Section 6. Section 7 concludes the paper. 2. Related work Noun compound interpretation has been studied in the context of various applications including question-answering and machine translation (Baldwin and Tanaka, 2004; Cao and Li, 2002; Lauer, 1995; Moldovan et al., 2004). Works on automatic/semi-automatic interpretation of NCs by Lapata (2002), Rosario and Hearst (2001), Moldovan et al. (2004), Kim and Baldwin (2005) either made assumptions about the scope of semantic relations, or restricted the domain of interpretation. However, the methods using verbs and verb semantics for interpreting noun compounds avoid any such assumption, and outperform the above methods (Kim and Baldwin, 2006; Nakov, 2008; Nakov and Hearst, 2006). Kim and Baldwin (2013a)’s work is based on the lexical similarity with tagged noun compounds, where the lexical similarity measures are derived from the WordNet. Kim and Baldwin (2013b) investigate word sense distributions in the noun compounds. They disambiguate the word sense of the component words in the noun compounds, by investigating “semantic collocation” between them. On the other hand, Nakov (2013) uses syntax, semantics for noun compound interpretation. One of the earlier works of Rackow et al. (1992) used a combination of linguistic rules and statistical data for resolving ambiguities for German to English compound translations. Maalej (1994) has studied English to Arabic machine translation of noun compounds. Shahzad et al. (1999) used non-aligned corpora for identifying translations of the noun compounds, which is an extension of the work carried out by Tanaka and Yoshihiro (1999). A feasibility study on the ability of the shallow parsing methods used for translating Japanese and English noun compounds has been discussed in Tanaka and Baldwin (2003a). Tanaka and Baldwin (2003b) used Japanese-English as a test case for a compositional translation method which makes use of a word level translation lexicon and monolingual corpus data. Baldwin and Tanaka (2004) proposed a support vector learning based method employing a target language corpus, and a bilingual dictionary data for English to Japanese and vice versa. The work of Tanaka and Baldwin was further extended by Bungum and Oepen (2009) for translating Norwegian nominal compounds into English and by using different ranking strategies for obtaining high quality translations. However, it is observed that most of the works carried out so far have focused on 2-word noun compounds; higher word order (3-word, 4-word) noun compounds are rarely studied. Further, this problem has hardly been studied for the Indian languages. The only work taken up for nominal compounds translation for English to Hindi is discussed by Paul et al. (2010) and Mathur and Paul (2009). Mathur and Paul follow a template based corpus search approach as used by Bungum and Oepen (2009) and Baldwin and Tanaka (2004). However, their system, unlike others, attempts to select the correct sense of nominal components by running a word sense disambiguation system by Patwardhan et al. (2005) on the source language. Paul et al. (2010) have used only eight prepositions given by Lauer (1995) for paraphrasing the noun compounds, and translating them by using one-to-one mapping from the English prepositions to Hindi postpositions. However, this approach is not very robust as some English prepositions may have more than one Hindi translation, and the correct one can be determined only from the context. Hence in this study we have used “verb + preposition(s)” for paraphrasing the noun compounds instead of using only the prepositional paraphrases. We 94 R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108 English Sentence Extract 2-word NC groups from 3/4- word NC using bracketing. Call them N1N2 PoS tagged Sentence Tagger No NC Extractor NC is 2-word Yes NC (N1N2) SR(s) Finder SR: Seed Verbs Verbs Verbs and Prepositions Paraphrase Generator SRs TP: SRs Verb Extractor TP(s) Finder Paraphrasing Candidates High Frequency Paraphrases TP(s) for the NC (N1N2) Web Search Fig. 1. Proposed scheme for handling translation of noun compounds (NC: noun compound, SR: semantic relation and TP: translation pattern). have developed a scheme for handling 2-word noun compounds; and extended the idea for 3-word and 4-word English noun compounds as well. 3. Noun compound identification and semantic relations The proposed approach consists of three steps to solve the problem of translation of noun compounds. These steps are: • Identification of a noun compound: a tagger has been used for this, and the part-of-speech information given by the tagger is used for identifying a noun compound. • Interpretation of the semantic relation between the nouns in a noun compound: the semantic relations have been identified by paraphrasing the noun compounds using verbs and preposition(s). A set of “seed verbs” has been identified for representing each semantic relation. • Generation of the translation pattern(s) for a noun compound: the translation patterns are generated based on the semantic relation occurring between the nouns of the noun compound. Fig. 1 provides a schematic representation of the proposed scheme. We have used a semi-supervised approach for noun compound identification. Here, we have used the Stanford tagger (Toutanova et al., 2003) for tagging a document. Any consecutive sequence of words tagged as noun (NN or NNS) by the tagger is considered as a noun compound. A corpus comprising 15,000 English sentences has been used for this purpose. The corpus was cleaned, split into sentences using sentence boundary program, and tagged using the Stanford tagger. The procedure resulted in 22,074 occurrences of noun compounds in total, comprising 16,606 different noun compounds. Out of these 14,015 noun compounds occurred only once in the corpus. However, 125 noun compounds were found to have been incorrectly identified as NCs. Table 1 summarizes the findings. The proportion of incorrect noun compounds in comparison with the correctly identified ones indicates that one can depend on the taggers’ output for noun compound identification. R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108 95 Table 1 Statistics of noun compounds in the corpus. # Sentences 2-word NCs 3-word NCs 4-word NCs NCs > 4-word Incorrect NCs 15,000 13,067 2,815 571 153 125 Once the noun compounds are identified we aimed at estabilshing the semantic relation (SR) between the constituent words. However, there has been little agreement among the researchers on the kind of relations that can hold between the two nouns in a noun compound. This can be evidenced from the different sets of relations that have been used in the literature. For illustration, • Abstract relations, such as, Agent, Location, Instrument, suggested in Barker and Szpakowicz (1998), Finin (1980), Girju et al. (2005), Kim and Baldwin (2005, 2008), Moldovan et al. (2004), Rosario and Hearst (2001). • Generalized prepositions, such as, Of, For, In, proposed by Lauer (1995). • Recoverably deletable predicates (RDPs) to interpret semantic relations in the NCs (Levi, 1978). Semantic relations, such as Have, Make, Be, From, For, In are examples of this kind. For the present work we have used a set of 20 semantic relations which are taken from the existing literature. These are Agent, Beneficiary, Cause, Container, Content, Equative, Instrument, Location, Material, Possessor, Product, Purpose, Result, Source, Time, Topic, Experiencer, Specialization, Attribute-Transfer and Use. We excluded some of the semantic relations found in literature (e.g. Extent, Probability, Frequency, Influence, Synonymy, Possibility) due to their lack of instances. Another important semantic relation, viz. Property, has also been ignored as this is satisfied primarily by a combination of “adjective + noun” or “proper noun + common noun” pairs. For example “blue car”, “Delhi city”. Section 4 describes the scheme used for semantic relation identification and translation pattern generation for the 2word noun compounds. Section 5 discusses the bracketing issues, and how this 2-word scheme can be used recursively for generating translation patterns for 3-word and 4-word noun compounds. 4. Semantic relation identification and translation patterns for 2-word NCs In order to find the translation pattern for a 2-word noun compound we first need to find the semantic relation between the two nouns of the noun compound. 4.1. Semantic relation identification As a semantic relation can be represented and is dominated by a set of verbs (Nakov and Hearst, 2006; Nakov, 2008), the proposed scheme tries to uncover the relationship between two noun pairs by rewriting or paraphrasing the noun compounds as a phrase that contains a verb and one or more preposition(s). For illustration, the noun compound “family car” can be represented by the following paraphrases: “car owned by family”, “car possessed by family”, “car belonging to family”. The verbs ‘own’, ‘possess’ and ‘belong’ along with the prepositions ‘by’ and ‘to’ provide an evidence for the presence of the semantic relation: Possessor. Similarly, the noun compound “olive oil” can be represented by the following paraphrases: “oil obtained from olive”, “oil made from olive”, “oil coming from olive”. These constructs indicate the presence of semantic relation: Material between the two nouns. Thus, for each semantic relation a group of verbs called seed verbs, are used. These seed verbs have been taken from Nakov and Hearst (2006), and the shared Task7 data 2008. Each semantic relation is represented by a group of verbs and a verb assigned to a semantic relation may belong to multiple semantic relations. A set of 728 seed verbs and 30 prepositions have been identified for the purpose of paraphrasing. Table 2 presents some examples of the seed verbs assigned to the semantic relations. 7 http://multiword.sourceforge.net/. 96 R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108 Table 2 Seed verbs associated with the semantic relations. S. No. Semantic relations Seed verbs with suitable prepositions 1 Cause 2 Experiencer Cause, promote, lead to, result in, generate, create, carry, spread, transmit, bring, infect, responsible for, give, pass Spread, acquire, suffer from, die of, develop, contract, catch, diagnosed of, have, beat, infected by, survive from, get, pass, fall, transmit, avoid Own, owned by, possess, possessed by, have, belong to, related to, borrow, take, grant, request Produce, make, manufacture, build, assemble, create Arrive in, leave at, conducted in, occur in, happen during, experience in Made of, made from, contain, originate from, composed of, produced from Come from, caused by, induced by, relate to, arise from, result from, generated by Cure, relieve, treat, help with, reduce, heal, prevent, prescribed for, block, control, end, intended for Contained in, created in, built in, built for, provided in, experienced in, included by Live in, work on, come from, work in, reside in, located in, bred in, kept in, made in, born from 3 4 5 6 7 8 9 10 Possessor Product Time Material Source Purpose Container Location Paraphrase Candidates Verbs and Prepositions Paraphrase Generator Noun Compound (NC) N1N2 - (anthrax death) death act as anthrax death aid as anthrax death arise from anthrax death caused by anthrax death caused from anthrax death consist of anthrax death result from anthrax death supported by anthrax Fig. 2. Paraphrase generation. For a 2-word noun compound the paraphrases are formed using these seed verbs and prepositions. Paraphrase generation for a noun compound, viz. anthrax death, is shown in Fig. 2. For identifying the semantic relation between the nouns of a noun compound, paraphrases are generated and web frequency of these paraphrases is found using search engines and the Netspeak8 web service (Potthast et al., 2010). The top 15 paraphrases in terms of web frequencies are identified, and the verb parts of these paraphrases are extracted. The semantic relation that contains the maximum of these extracted verbs is selected. This indicates that the semantic relation is best represented by this group of verbs, and hence indicates the semantic relation existing between the two nouns of the noun compound. The algorithm for semantic relation identification between the two nouns in a 2-word noun compound is described in Algorithm 1. Algorithm 1 (Semantic relation identification for a 2-word English noun compound). Input: A 2-word English noun compound, seed verbs, prepositions. Output: The semantic relation(s) for the noun compound. 1. The 2-word NC (N1 N2 ) is split in its two nouns (N1 and N2 ) 2. Using the two nouns (N1 and N2 ) do: 2.1 Form the paraphrases using the seed verbs, the prepositions and the nouns 2.2 (a) Find the web frequency of all the paraphrases using search engine and Netspeak (b) Find the top 15 paraphrases having highest frequencies (c) Find the verbs forming these 15 paraphrases obtained from (b) (d) Find the semantic relation(s) having the maximum number of verbs extracted in (c) 2.3 Return the semantic relation(s) 8 http://www.netspeak.eu/. R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108 97 Table 3 Semantic relations and the Hindi translation patterns: N1T and N2T are the translations of N1 and N2 . Semantic relation Definition Examples Hindi translation patterns Possessor N1 has N2 – N1 is owner Company car, family estate, girl mouth, child foot Student loan, national debt N1T + kaa/ke/ki + N2T Source N1 is the source of N2 and N1 is not a body part N1 is the source of N2 but N1 is a body part Northern wind, foreign capital N1T N2T Chest pain, stomach ache, heart attack N1T + mein + N2T Equative9 N1 is also head – N1 , N2 are human Composer arranger, player coach, lady doctor N1T N2T Experiencer N2 experiences N1 (an animated entity experiencing a state/feeling) Heart patient, cancer patient N1T + kaa/ke/ki + N2T Specialization N1 is specialization of N2 but N1 and N2 are human NC is specialization of N2 Boy child, girl child, baby boy, baby girl Fighter planes, war ships Single word Attribute-Transfer Salient attribute of N1 is transferred to N2 Iron will, crescent wrench, lion heart, doe eye, chicken heart N1T + jaise + N2T + vaalaa or single word Use N2 uses N1 Laser printer, water gun, electron microscope Faith cure, shock treatment, milieu therapy N1T + vaalaa/vaale/vaali + N2T N1 has N2 – N1 is borrower N2 uses N1 and N1 is a concept N1T N2T N1T N2T N1T N2T 4.2. Translation pattern generation Once the semantic relation between the nouns of a 2-word NC is identified, the translation pattern for the 2-word NC is generated. In order to achieve this objective 5 semantic relation groups have been formed based on the translation patterns of the NCs with these semantic relations. All the semantic relations belonging to one group are represented by the same translation pattern. The semantic relation Attribute-Transfer has not been allotted to any of the groups as it has a special pattern and works differently from others. The semantic relation groups formed on the basis of translation patterns are as follows: • SR group1 contains those semantic relations that do not require any postpositions in between the two nouns for its Hindi translations. The translation is obtained by simple juxtaposition of their respective Hindi translations. The semantic relations belonging to this group are: Beneficiary, Equative, Instrument, Location, Possessor, Product, Purpose, Source, Topic, Specialization, Use. • SR group2 comprises semantic relations whose Hindi translations involve one of the postpositions “kaa/ke/kii” between the nouns. The pattern to be used out of these three (kaa/ke/kii) is determined on the basis of gender and number of the head noun in the noun compound. The semantic relations belonging to this group are: Agent, Container, Material, Possessor, Result, Experiencer. • SR group3 consists of semantic relations whose translation pattern requires use of one of “vaalaa/vale/vaalii” postposition between the nouns. The gender and number of the head noun in the noun compound help in determining the pattern to be used out of these three (vaalaa/vale/vaalii). Semantic relations belonging to this group are: Content, Temporal, Specialization, Use. • Translation patterns of the semantic relations belonging to SR group4 involve the postposition “mein”. The set consists of only one semantic relation, viz. Source. • SR group5 consists of the semantic relation Cause and the postposition used by this semantic relation is “se”. 9 In this semantic relation there are a number of cases where a kin relationship may exist between the two nouns. Such cases are rare in English but very common in Hindi, and thus need to be considered for Hindi to English translation of NCs. Some examples that can illustrate this point are: “bhaaii bahan ∼ brother sister”, “betaa betii ∼ son daughter” etc. 98 R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108 Four semantic relations, viz. Possessor, Source, Specialization and Use have been grouped into two different SR groups. These semantic relations have different translation patterns depending upon the condition satisfied as shown in Table 3. Thus, if a noun compound has any of these 4 semantic relations between the nouns then the translation pattern is decided by referring to the lexical semantic category of the nouns as mentioned in the modified definitions given in Table 3. However it is not possible to determine whether the noun is a borrower or an owner (required for Possessor semantic relation) from the lexical database. In order to solve this problem we resort to the specific verbs (that help in deciding the semantic relation, see Table 2), and their web frequencies and probabilities. The translation pattern is decided accordingly. The semantic interpretation allows us to form rules for generating the translation patterns for different NCs. Table 3 gives a description along with examples of some of these semantic relations. It is clear from Table 3 that in order to get the correct translation patterns we had to modify the definitions of some of the existing semantic relations to some extent. The methodology used for generating the translation pattern(s) for a 2-word NC is described in Algorithm 2. Algorithm 2 (Translation pattern generation for a 2-word English noun compound). Input: A 2-word English noun compound. Output: The Hindi translation pattern for the noun compound. Notation: N1T and N2T are the Hindi translations of the English nouns N1 and N2 respectively. 1. Split the 2-word NC, (N1 N2 ) into two nouns N1 and N2 2. Using the 2-word NC do: 2.1 Determine the SR between the nouns N1 and N2 in the NC using Algorithm 1 2.2 Determine the SR group to which the SR belongs 2.2.1 Switch SR group Case SR group1: Hindi Translation Pattern is “N1T N2T ”. Case SR group2: a. Determine the gender and number of the head noun, N2 from the lexicon b. If the gender of N2 is feminine Hindi Translation Pattern is “N1T + kii + N2T ”. c. If the gender of N2 is masculine and the number is singular Hindi Translation Pattern is “N1T + kaa + N2T ”. d. If the gender of N2 is masculine and the number is plural Hindi Translation Pattern is “N1T + ke + N2T ”. Case SR group3: a. Determine the gender and number of the head noun, N2 from the lexicon b. If the gender of N2 is feminine Hindi Translation Pattern is “N1T + vaalii + N2T ”. c. If the gender of N2 is masculine and the number is singular Hindi Translation Pattern is “N1T + vaalaa + N2T ”. d. If the gender of N2 is masculine and the number is plural Hindi Translation Pattern is “N1T + vaale + N2T ”. Case SR group4: Hindi Translation Pattern is “N1T + mein + N2T ”. Case SR group5: Hindi Translation Pattern is “N1T + se + N2T ”. 2.3 If the SR between N1 and N2 is Attribute-Transfer Then Hindi Translation Pattern is “N1T + jaise + N2T + vaalaa”. 2.4 Return the Translation Pattern In certain cases there may be multiple translations in the lexicon for a noun in the noun compound. In order to select the best translation of the noun, the disambiguation tool SenseRelate::TargetWord10 has been used. It disambiguates a target word with respect to its context by finding the sense that is most related to its neighbors according to a WordNet::Similarity measure of relatedness (Patwardhan et al., 2005). Once the sense of the word is computed, the 10 http://search.cpan.org/dist/WordNet-SenseRelate-TargetWord/. R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108 99 keywords in the definition of the computed sense are compared with the definition terms for that word in the lexicon. The translation whose definition matches the maximum number of keywords is selected as the most appropriate translation. Also, it has been observed that the noun components of a noun compound indicate the presence of more than one semantic relation. This may also lead to the generation of multiple translation patterns for the noun compound. In case the translation pattern cannot be identified using the scheme then we use the transliterated forms of the two nouns, and they are juxtaposed without any intervening postposition. Some examples of this kind are, “safety pin ∼ ”, ”, “miss world ∼ ” etc. “wall hanging ∼ For a 3-word and a 4-word noun compound we first resolve the bracketing issue, and then the scheme used for 2-word noun compounds is applied recursively to obtain the right translation pattern(s). This is described in detail in the next Section. 5. Bracketing and translation patterns for 3-word and 4-word NCs For a noun compound consisting of 3 nouns (say N1 , N2 and N3 ) one of the following three cases is possible: (a) Right Bracketing: N1 modifies the result of N2 modifying N3 ∼ (N1 (N2 N3 )), (b) Left Bracketing: The result of N1 modifying N2 which in turn modifies N3 ∼ ((N1 N2 )N3 ) (c) No Bracketing: N1 , N2 and N3 are independent ∼ (N1 N2 N3 ). According to Finin (1980), structure (b) is more preferred in English as compared to (a) and structure (c) is more preferred in long sequences. 5.1. Adjacency and dependency models for bracketing Two models, viz. adjacency by Marcus (1980), Pustejovsky et al. (1993), Resnik (1993), and dependency by Lauer (1995) have been used in the literature to decide a suitable bracketing (left or right): • The adjacency model checks how strongly N2 modifies N3 as opposed to N1 N2 being a compound, to decide the correct bracketing (Nakov and Hearst, 2005). If web frequency/probability of N1 N2 is greater than the web frequency/probability of N2 N3 , then a left bracketing is predicted otherwise a right bracketing is predicted. • The dependency model checks whether N1 modifies N3 as opposed to N1 modifying N2 . For this model a left bracketing is predicted if web frequency/probability of N1 N2 is greater than the web frequency/probability of N1 N3 otherwise it is a right bracketing. We have used both frequency and probability based approaches for each of dependency and adjacency models for resolving bracketing related issues, and compared the results. The web frequency for a 3-word NC “N1 N2 N3 ” is computed as follows: The web frequencies for the bigrams (N1 N2 , N2 N3 ) and also the skip bigram, where we skip N2 to get a bigram (N1 N3 ), are computed from the web. In the probability based approach we consider the NC “N1 N2 N3 ” and Pr (Na → Nb |Nb ) to be the conditional probability that the word Na precedes a given fixed word Nb . Both models (adjacency and dependency) based on the two approaches (web frequency and probability) are summarized in Table 4. Table 4 Adjacency and dependency models for frequency and probability based approaches. Model Approach Bracketing Web Frequency based Probability based Adjacency Freq(N1 N2 ) > Freq(N2 N3 ) Freq(N1 N2 ) < Freq(N2 N3 ) Pr (N1 → N2 |N2 ) > Pr (N2 → N3 |N3 ) Pr (N1 → N2 |N2 ) < Pr (N2 → N3 |N3 ) Left Right Dependency Freq(N1 N2 ) > Freq(N1 N3 ) Freq(N1 N2 ) < Freq(N1 N3 ) Pr (N1 → N2 |N2 ) > Pr (N1 → N3 |N3 ) Pr (N1 → N2 |N2 ) < Pr (N1 → N3 |N3 ) Left Right 100 R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108 In probability based approach, the probability Pr (Na → Nb |Nb ) is estimated as #(Na Nb )/#(Nb ), where #(Na Nb ) and #(Nb ) are the corresponding bigram and unigram frequencies. The frequencies are considered to be the values returned by the search engines and the Netspeak web service in response to the queries for the exact phrase “Na Nb ”, and for the word “Nb ”. Two search engines, namely Bing, Google and the Netspeak (NS) web service have been used to extract the web frequencies of the bigrams and the unigrams of each NC. Two examples with the bracketing predictions for the frequency and probability based adjacency and dependency models are shown in Tables 5 and 6. These tables show the web frequencies and the probabilities that are obtained for two 3-word noun compounds. It also shows the bracketing based on the frequencies and probabilities. It is clear from the example in Table 5 that the two models may contradict each other. In such a case we propose to consider the majority approach, and decide the bracketing accordingly. Thus, in case of “Hydrogen Ion Exchange” our scheme decides a left bracketing, as per the majority. However, there are a few cases, such as, “watershed development planner” where the bracketing cannot be decided on the basis of the majority. In such cases we consider both the alternatives (left as well as right bracketing). This of course results in multiple translation patterns. Table 6 illustrates the handling of the case “watershed development planner”. A detailed analysis of the results related to bracketing is discussed in detail in Section 6. 5.2. Translation pattern generation for 3-word NCs Once the bracketing of a 3-word NC is identified, it is subjected to Algorithm 3 for determination of its translation pattern. The overall translation pattern of the 3-word noun compound depends recursively on the translation patterns of the 2-word NCs present in the 3-word noun compound. Algorithm 3 (Translation pattern generation for a 3-word English noun compound). Input: A 3-word noun compound, bracketing of the noun compound. Output: The Hindi translation pattern for the noun compound. Notation: NTL and NTR are the Hindi translations of the nouns, and NH is the head noun for the 2-word NC. 1. If the Input NC has left bracketing i.e. ((N1 N2 ) N3 ) 1.1. Find the translation pattern for N1 N2 (left bracketed nouns) using Algorithm 2 and set it to NTL . 1.2. Find the Head noun in the NC (N1 N2 ), set it to NH . (Generally, the rightmost noun is the head noun) 1.3. Find the translation pattern for NH N3 (a 2-word NC formed from head noun of left bracketed nouns and the 3rd noun-N3 ) using Algorithm 2 and set it to NTR . 1.4. Combine the two translation patterns obtained in steps (1.1) and (1.3), NTL NTR. 1.5. Remove the common noun part present in both the translation patterns from either NTL or NTR . 1.6. The pattern obtained after removal of the duplicate is the final translation pattern for the 3-word NC. 1.7. Return the translation pattern. 2. If NC has right bracketing i.e. (N1 (N2 N3 )) 2.1. Find the translation pattern for N2 N3 (right bracketed nouns) using Algorithm 2 and set it to NTR . 2.2. Find the Head noun in the NC (N2 N3 ), set it to NH . (The rightmost noun is the head noun) 2.3. Find the translation pattern for N1 NH (a 2-word NC formed from head noun of right bracketed nouns and the 1st noun-N1 ) using Algorithm 2 and set it to NTL . 2.4. Combine the two translation patterns obtained in steps (2.1) and (2.3), NTL NTR. 2.5. Remove the duplicate common noun part from the translation pattern, NTL . 2.6. The pattern obtained after removal of duplicate is the final translation pattern for the 3-word NC. 2.7. Return the translation pattern. This scheme is illustrated below with the 3-word NC “olive oil bottle”. This NC has left bracketing as can be determined using the approach discussed in Section 5.1 i.e. by using adjacency and dependency models, and making use of the web based frequency as well as the probability based approach. Thus the NC will be represented as ((olive oil) bottle). • The translation pattern for “olive oil”, using Algorithm 2 is computed. This NC has a semantic relation Material as obtained on using Algorithm 1, which belongs to SR group2 that has a translation pattern “kaa/kii/ke”. This will R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108 101 Table 5 Sample NC for which the Adjacency (Adj) and Dependency (Dep) models contradict (Notation: RB:Right Bracketing; LB: Left Bracketing; NS: Netspeak web service). 3-word NC (N1 N2 N3 ) Web frequencies/probabilities Google Hydrogen ion exchange (N1 N2 N3 ) 2,760,000/0.007976879 Hydrogen ion (N1 N2 ) 5,380,000/0.005028037 Ion exchange (N2 N3 ) 242,000/0.000226168 Hydrogen exchange (N1 N3 ) Frequency Probability Dep Adj Dep LB LB LB Bing NS Adj 429,000/0.0007234 1,120,000/0.0044268 21,600/0.0000854 55,000/0.0045454 304,000/0.0039481 13,000/0.0001688 RB Table 6 Sample NC for which Adjacency (Adj) and Dependency (Dep) models contradict (Notation: RB: Right Bracketing; LB: Left Bracketing; NS: Netspeak web service). 3-word NC (N1 N2 N3 ) Web frequencies/probabilities Google Watershed development planner (N1 N2 N3 ) 262,000/0.00017013 Watershed development (N1 N2 ) 208,000/0.001808696 Development planner (N2 N3 ) Watershed planner (N1 N3 ) 7,620/0.00006626 Frequency Probability Adj Bing NS Adj 61,300/0.0001868 98,600/0.0023309 7,190/0.0001699 35,000/0.0001222 11,000/0.0010185 810/0.000075 RB Dep LB Dep LB RB be translated as (jaitun(olive) kaa/kii/ke tel(oil)). As Hindi word tel is masculine and singular, kaa is selected from these three patterns. Thus the final translation pattern is jatun kaa tel. • Head noun for (olive oil) is “oil”. • The translation pattern for (oil bottle) is determined. The semantic relation between the two words of the NC is Content, which belongs to SR group3 that has a translation pattern “vaalaa/vaalii/vale”. This will be translated as (tel(oil) vaalaa/vaalii/vale shishi(bottle)). The Hindi word shishi(bottle) is feminine and singular, hence vaalii is selected from these three patterns. Thus the final translation pattern is tel vaalii shishi. • Combining the translation patterns, the pattern generated is (jaitun kaa tel) + (tel vaalii shishi). Removing the duplicate words (tel) the final translation pattern will be (jaitun kaa tel vaalii shishi). Similarly, the NC “plastic oil bottle” has right bracketing and will be represented as (plastic (oil bottle)). The translation pattern obtained for this NC is (plastic kii tel vaalii shishi). The detailed observations and test results of the scheme related to bracketing issues and the translation patterns for 3-word noun compounds are discussed in detail in Section 6. 5.3. Translation pattern generation for 4-word NCs Marcus (1980) made the assumption that an arbitrarily long modifier string can be analyzed by examining only the three left most nouns in the modifier string in an iterative way. This work was further extended by Resnik (1993). However, we have extended the 3-word NCs bracketing concept for the 4-word NCs as well using the above-mentioned web frequency based approach. The possible bracketing combinations that may occur for a 4-word NC are determined and explored to find the most probable bracketing out of these combinations. A 4-word NC “N1 N2 N3 N4 ” may have the following possible bracketing combinations: 1. All the four nouns are independent, (N1 N2 N3 N4 ). For this pattern the scheme finds the web frequency of the 4-gram “N1 N2 N3 N4 ”. 2. The leftmost (N1 ) and the rightmost (N4 ) nouns are independent and the middle two nouns (N2 N3 ) form a group, i.e. the bracketing is (N1 (N2 N3 ) N4 ). Here the scheme considers web frequency for trigram “N1 N3 N4 ” as N1 and N4 are independent, and N3 is the head noun for (N2 N3 ) where N2 is a modifier. 102 R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108 3. Here N1 N2 form one group and N3 N4 form another group. Hence the bracketing is ((N1 N2 )(N3 N4 )). The scheme finds the web frequency of the bigram “N2 N4 ” as N2 is the head noun for (N1 N2 ) and N4 is the head noun for (N3 N4 ), and N1 and N3 are their respective modifiers. 4. In this case the leftmost three nouns (N1 N2 N3 ) form a group and the rightmost (N4 ) noun is independent. Hence the bracketing is ((N1 N2 N3 ) N4 ). In this case the 3-word (N1 N2 N3 ) noun group may have either a left or a right bracketing i.e. one of (((N1 N2 ) N3 ) N4 ) or ((N1 (N2 N3 )) N4 ) is a possibility. In this pattern we find the web frequency for the bigram “N3 N4 ” as N1 N2 is the modifier for N3 . 5. The leftmost (N1 ) noun is independent and the rightmost three nouns (N2 N3 N4 ) form a group, i.e. we have the bracketing (N1 (N2 N3 N4 )). The trigram (N2 N3 N4 ) may either have a left or a right bracketing i.e. (N1 ((N2 N3 ) N4 )) or (N1 (N2 (N3 N4 ))). The web frequency for the bigram “N1 N4 ” is determined, since N2 N3 acts as the modifier for N4 which is the head noun for (N2 N3 N4 ). To determine the bracketing pattern for a 4-word noun compound web frequencies are computed for the sequences N1 N2 N3 N4 , N1 N3 N4 , N2 N4 , N3 N4 and N1 N4 . The noun compound will follow the pattern having the highest frequency. The translation pattern for the 4-word NC can be generated once the pattern with highest frequency is found. The pattern having the highest frequency will consist of nouns that are independent (pattern 1), 2-word nouns (patterns 2 and 3) or 3-word nouns (patterns 4 and 5). The translation patterns for these 2-word and 3-word NCs can be generated as discussed in previous sections and combined to form the translation pattern for the 4-word noun compound. 6. Experimental results The approaches discussed for 2-word, 3-word and 4-word noun compounds have been manually evaluated by two evaluators for translation patterns. These evaluators were not shown the standard results and were asked to mark the translation pattern(s) obtained for an NC as correct or incorrect. The annotators’ agreement was measured using κ coefficient (Cohen, 1960). The κ coefficient is computed as P(A)−P(E) 1−P(E) , where P(A) is the observed agreement among the annotators, and P(E) is the expected agreement, i.e. P(E) represents the probability that the annotators agree by chance. The value of κ is constrained to the interval [−1, 1]. A κ value of positive one means perfect agreement, and a κ value of negative one means a perfect disagreement. The results obtained for the semantic relation and the bracketing have also been compared with the reference results mentioned in the datasets collected from the literature. We also compare the outputs of the proposed system11 with the Moses baseline system and some of the state-of-the-art translators. The precision, recall and F-scores have been calculated for all the systems. 6.1. 2-word NC evaluation The test set for 2-word NCs has been extracted from the data provided by Kim (2008 Shared Task). A test set of 200 2-word NCs has been selected randomly from a training corpus of 1088 tokens defined by Barker and Szpakowicz (1998). The test set is so chosen that instances of all the 20 SRs considered in the paper are evenly distributed. The SR(s) obtained for an NC using the approach discussed were compared with the SRs given in the 1088 NCs set to find the accuracy of the method. Some observations for the 2-word NCs are: • Automatic evaluation of the identified semantic relations resulted in an accuracy of 64%. Here the semantic relations obtained by using the proposed scheme were compared with the semantic relations specified in the reference set. • Manual evaluation of the same resulted in 83% accuracy. This may be attributed to the fact that the human evaluators are able to interpret certain semantic relations that may not possible while performing automatic evaluation. For illustration, the NC “cabinet member” is marked with the semantic relation Container in the reference set, whereas the proposed method assigned the semantic relation Possessor. The manual evaluators mark this as correct assignment of semantic relation, whereas automatic evaluation marked this as incorrect. Some other such cases are “steel hammer”, “hospital room”, “sea shell”. 11 The system obtained by integrating the Moses baseline system with the noun compound translation (NCT) system. R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108 103 Table 7 3-word NC Bracketing accuracy. Approach Frequency based Probability based Model Adjacency Dependency Adjacency Dependency Correct bracketing Accuracy (%) 68 75.56 65 72.22 60 66.67 63 70.00 • Most of the NCs with SRs Cause, Material, Product, Purpose, Possessor and Source were correctly identified. Some examples are “anthrax death”, “olive oil”, “petroleum product”, etc. • NCs with SRs Beneficiary, Result and Use were incorrectly identified for most of the cases. Some example NCs is “machine translation”, “faith cure”, “consumer price”. The translation patterns obtained for the NCs based on the extracted SRs were given to two evaluators to mark the patterns as either correct or incorrect. A kappa (κ) value of 0.414 was obtained for the two annotators. The strength of agreement is considered to be ‘moderate’. 6.2. 3-word NC evaluation We have considered a test set of 90 3-word NCs collected from existing literature for evaluating the performance of 3-word NCs. Web frequency of occurrence for each of these NCs has been found using the adjacency and the dependency models. Both adjacency and dependency models were used for bracketing the noun compound. It has been found that out of the 90 NCs considered, 76 NCs agreed for both the adjacency and the dependency models for probability based approach. However, for the web frequency based approach only 64 NCs matched for both models. The accuracy of the two models each with both frequency and probability based approaches is shown in Table 7. Adjacency model with frequency based approach is found to be the best. The bracketing obtained for these 90 NCs was compared with the standard bracketing obtained from the reference set from Resnik (1993); 68 NCs were found to be bracketed correctly and matched with the reference bracketing set. The translation patterns for only these 68 NCs have been obtained using the proposed scheme. It has been observed that for some NCs multiple translation patterns were obtained. This is because at times more than one semantic relation is identified for a given NC by the scheme, resulting in multiple translation patterns. It is further observed that the NCs involving nouns with semantic category human are not assigned proper SRs in many cases. Hence there is a need for more SRs that may help in proper identification. In some cases the search engine reported very high frequency for a paraphrase, but when manually verified we found that the web frequencies reported were not of the exact paraphrase match but some related paraphrases. Such cases were removed from the list of high frequency paraphrases and were not considered any further. The accuracy of translation patterns generated for 3-word NCs was calculated only for those 68 NCs for which bracketing has been correct. A Kappa (κ) value of 0.416 was obtained for two evaluators for the translation patterns. The strength of agreement for these translation patterns for these two evaluators is considered to be ‘moderate’. 6.3. 4-word NC evaluation The instances for 4-word NCs are rare in routine text. We could collect only a small set of 35 NCs from the existing literature and corpus discussed in Section 3. Web frequencies for this set was found using two search engines, viz. Bing and Google, to determine the most probable bracketing for 4-word NCs. It is observed that for 75% of the test cases, the pattern ((N1 N2 N3 ) N4 ) obtains highest frequency for both the search engines. For nearly half of the remaining test cases there is a conflict in selecting the highest frequency patterns. While the pattern (N1 (N2 N3 N4 )) obtains highest frequency for only 9.375% of the cases; for 3.125% test cases the highest frequency pattern has been ((N1 N2 )(N3 N4 )). Thus for a 4-word NC (N1 N2 N3 N4 ), it is assumed that the NC will strictly follow the following form of bracketing ((N1 N2 N3 ) N4 ), which corroborates the findings of Marcus (1980). The 3-word NC (N1 N2 N3 ) within the 4-word NC 104 R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108 Transliteration Module Source Sentence Noun compound translator Moses Baseline System SenseRelate:: TargetWord Tool Lexicon Target Sentence Fig. 3. Schematic diagram for the Moses + NCT system. has been handled by deciding the bracketing first and then generating the translation patterns for the two 2-word NCs recursively. The translation patterns obtained for the 4-word NCs have been marked correct by both the evaluators for 64% of the cases. The evaluators did not agree for 18% of the translation patterns. Remaining 18% translation patterns have been found to be incorrect by both the evaluators. Of all the correct translation patterns 29% included postpositions kaa/ke/kii in the translations; and 14% correct translation had other postpositions. It is further observed that 57% of the correct translation patterns were juxtapositions of word by word translations of N1 , N2 , N3 and N4 . They did not include any postpositions in the translations. For NCs involving more than 4 words the authors feel that using the translated or transliterated forms of individual nouns in the same sequence in which they appear, generate acceptable translation patterns. It also saves a lot of overheads caused due to the possible number of bracketings, and the huge number of iterations required for SR identification. 6.4. Comparison with the state-of-the-art translators Sections 6.1–6.3 discuss the accuracy of the proposed scheme for semantic relation identification, and the translation patterns generated for the 2-word, 3-word and 4-word noun compounds. In this section we consider the state-of-the-art translators AnglaMT, Anuvaadaksh, Bing, Google and Moses12 (Koehn et al., 2007) to find how these systems translate the noun compounds occurring in a sentence. We also integrate the proposed noun compound translator (NCT) system with the Moses baseline system to compare the translation quality of the integrated system over the Moses baseline system. We refer to it as “Moses + NCT” in subsequent discussion. The integrated system is schematically represented in Fig. 3. A test set of 200 sentences consisting of 242 noun compounds (each sentence contains at least one noun compound) is considered for evaluation. It has been observed that all the translators suffer from various problems related to incorrect postpositions while translating noun compounds. The problems range from simple transliteration of nouns to altering the positions of the nouns within a noun compound conveying erroneous, if not meaningless, semantics. Moreover, there are cases where the correct semantics are not preserved primarily due to ambiguous word sense. The different types of errors and their percentages that occurred during the translation of noun compounds by different translators are summarized in Table 8. The bold ones are showing the minimum errors. Some of the observations in this regard are: • The major problem faced by the translators is related to postpositions. However, Moses + NCT system has the least percentages of postposition related errors in comparison with other translators. • With respect to semantics, the Moses + NCT system shows considerable improvement over the Moses baseline system. The percentages of semantics related errors have reduced from 16.11 to 8.60. This has been possible 12 http://www.statmt.org/moses/. R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108 105 Table 8 Percentages of errors occurring in the NCs translated by the translators. Types of errors AnglaMT Anuvaadaksh Word-order Postposition Semantic Transliteration 3.74 24.06 16.58 6.95 3.10 32.30 11.95 16.37 Bing 4.18 30.96 5.44 5.02 Google MaTra Moses Moses + NCT 2.09 28.45 5.86 1.67 3.85 32.91 13.68 23.50 18.79 21.48 16.11 36.24 0 15.05 8.60 12.90 due to the introduction of SenseRelate::TargetWord tool. Only Bing and Google perform somewhat better than Moses + NCT in this regard. • Google system has the best transliteration module with the least percentages of errors. Although Moses + NCT performs much better than the Moses baseline system some more improvements are needed in this regard. • Moses + NCT system does not suffer from word-order related errors. Whereas, for all other translators there are cases when the nouns in a noun compound are either placed far apart or the nouns reverse their order after translation. This leads to incorrect translation of noun compounds. Table 9 illustrates some of the errors shown by different translators while translating some noun compounds extracted from some of the test sentences. Here the error codes ‘P’, ‘T’, ‘O’, ‘S’, ‘N’ and ‘PoS’ denote the errors due to postpositions, translation/transliteration, word-order, semantic, number and part-of-speech, respectively. The translations of noun compounds as shown in Table 9 clearly indicate the inability of the translators to handle these noun compounds. All the translators except Google and the Moses + NCT system translate the noun compound “death sentence” incorrectly. This is because the word ‘sentence’ has two meanings: a linguistic unit (∼ ) and ). The word ‘player’ in the noun compound “record player” has been translated a decree of punishment (∼ incorrectly by all the translators except Bing and Google, who have actually transliterated the word correctly. The sense considered for ‘player’ by most of the translators is: a person taking part in a sport or game (∼ ). For the noun compound “wedding present” the postposition ( ) considered by Google is incorrect. The gender and number for the word ‘present’ (∼ ) is masculine and singular, hence the correct postposition should be ‘ ’. All other translators excepting Moses + NCT have made some errors. Only Moses + NCT transliterated the noun compound correctly, and thereby avoided the mistakes. For the noun compounds “hotel electrician” and “working hours” the mistakes are primarily either missing postpositions or incorrect postpositions. Anuvaadaksh produced a wrong word order for “working hours”. Only Moses + NCT could translate both of them correctly. The AnglaMT system has considered the PoS of word ‘forces’ as verb in the noun compound “security forces”, and hence translated it incorrectly. Only the Anuvaadaksh system and the Moses + NCT proposed in this paper could translate it correctly. Table 10 presents the precision, recall and F-scores of the Moses baseline system and the various online translators mentioned above. The values in bold give the maximum for each row (i.e. recall, precision and F-value) respectively. We compare these results with the results obtained from Moses + NCT system. In this scenario, we have considered Table 9 Translation outputs of various translators for some example noun compounds in a sentence. 106 R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108 Table 10 Precision, recall and F-score for the different translators. Scores Translators Recall Precision F-score AnglaMT Anuvaadaksh Bing Google MaTra Moses Moses + NCT 0.870 0.492 0.629 0.934 0.381 0.541 0.983 0.521 0.681 0.996 0.568 0.715 0.971 0.255 0.404 0.616 0.477 0.538 0.927 0.699 0.823 recall as the ratio of the number of noun compounds translated by the system to the total number of noun compounds present in the document, i.e. # Translated NCs # Total NCs In a similar vein, precision is defined as the ratio of the number of correctly translated noun compounds to the total number of noun compounds identified and translated by the MT system. Hence, Recall = # Correctly translated NCs # Translated NCs Some observations related to the experiments are: Precision = • The Moses baseline system has the least recall. This may be attributed to the presence of many out-of-vocabulary (OOV) words that simply need to be transliterated. The baseline system does not have a built-in transliteration module. The system recall increases significantly (from 0.616 to 0.927) with the addition of the transliteration module as can be seen from the Moses + NCT system results. Google translator records the highest recall. • AnglaMT system has lower recall as compared to all other translators except the Baseline system. This may be due to the errors in the transliterated terms. The NCs that are incorrectly transliterated are considered incorrect, resulting in lower recall. • The precision and F-score for the Moses + NCT system is the highest. It has shown a significant improvement over the Moses baseline system. This is possible only because of the hybrid approach that has been used for handling the issues related to the NC translation. 7. Concluding remarks This paper deals with handling of noun compounds while translating from English to Hindi. Existing English to Hindi machine translation systems are often not able to handle the English noun compounds while translating them to Hindi. The problems can be of various types which can range from simple syntactic level to complicated semantics. As a consequence, the solution scheme needs to find answers to the following questions: • to determine whether the word sequence in a sentence is a noun compound; • to identify and interpret the semantic relation of the words present in a noun compound; • to choose proper translation pattern for the English noun compounds in the target language. In this work we investigated the above problems for noun compounds of the form of two, three and four consecutive nouns. A hybrid approach has been proposed for solving this difficulty. A knowledge base consisting of a set of seed verbs has been constructed. This helps in identifying the semantic relation between the nouns of a noun compound. A database containing noun compounds that are translated as single words has been formed. This helps in determining the translation patterns of noun compounds that translate to single words without having to go through the process of semantic relation identification. A rule-based scheme has been used for generating the translation patterns. The rules are formulated by identifying the semantic relations between the nouns of the noun compounds of the source language. Twenty different semantic relations have been considered for study. The evaluation results for translation patterns obtained for noun compounds have been presented and the inter annotators’ agreement is obtained. For semantic relation identification the accuracy of objective evaluation is 64%; R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108 107 whereas for subjective evaluation it is 83%. The apparent discrepancy is because of the inherent ambiguity of the perceived semantic relation existing between two nouns that human evaluators are capable of understanding but cannot be achieved automatically. In this study we have considered fifteen paraphrases with highest frequencies for determining the semantic relations. The proposed system records the highest precision and F-scores in comparison with the Moses baseline and other online English to Hindi translators. The errors performed by machine translation systems following different translation paradigms have been analyzed. It has been observed that the percentage of errors related to postpositions and word-order occurring while translating noun compounds for all the systems was comparable. Number related errors have been found to be more common for statistical systems (Google and Bing, here). However, transliteration and semantic related errors were more frequent in the rule-based system (i.e. AnglaMT) and the hybrid systems (i.e. Anuvaadaksh and MaTra) considered for this work. Tables 8 and 9 show the percentage of various errors and certain example cases when the noun compounds are translated using the MT systems. Presently the study focuses on noun sequences only. This work can be extended for generating translation patterns for other categories of compound nouns, such as “adjective + noun”, “verb + noun”, “noun + verb”. Acknowledgements We would like to thank the two anonymous reviewers for their helpful comments, suggestions and valuable input on this work. References Ananthakrishanan, R., Kavitha, M., Hegde, J., Shekhar, C., Shah, R., Bade, S., Sasikumar, M., 2006. MaTra: a practical approach to fully-automatic indicative English–Hindi machine translation. In: Symposium on Modeling and Shallow Parsing of Indian Languages (MSPIL’06). Baldwin, T., Tanaka, T., 2004. Translation by machine of compound nominals: getting it right. In: Proceedings of ACL 2004 Workshop on Multiword Expressions: Integrating Processing, pp. 24–31. Barker, K., Szpakowicz, S., 1998. Semi-automatic recognition of noun modifier relationships. In: Proceedings of International Conference on Computational Linguistics (COLING-1998), pp. 96–102. Bungum, L., Oepen, S., 2009. Automatic translation of Norwegian noun compounds. In: Proceedings of European Association for Machine Translation (EAMT-09), pp. 136–143. Burnard, L., 2000. User reference guide for the British National Corpus. In: Technical Report. Oxford University Computing Services. Cao, Y., Li, H., 2002. Base noun phrase translation using web data and the EM algorithm. In: Proceedings of International Conference on Computational Linguistics (COLING-2002), pp. 1–7. Cohen, J., 1960. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20, 37–46. Finin, T.W., (Ph.D. thesis) 1980. The Semantic Interpretation of Compound Nominals. University of Illinois. Girju, R., Moldovan, D., Tatu, M., Antohe, D., 2005. On the semantics of noun compounds. Comput. Speech Lang. 19 (44), 479–496. Jones, K.S., 1983. Compound noun interpretation problems. In: Technical Report. University of Cambridge. Kim, S.N., Baldwin, T., 2013a. A lexical semantic approach to interpreting and bracketing English noun compounds. Nat. Lang. Eng. 19 (3), 385–407. Kim, S.N., Baldwin, T., 2013b. Word sense and semantic relations in noun compounds. ACM Trans. Speech Lang. Process. 10 (July (3)), Special Issue on Multiword Expressions: From Theory to Practice and Use, Part 2, Aricle No. 9. Kim, S.N., Baldwin, T., 2008. An unsupervised approach to interpreting noun compounds. In: Proceedings of IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLP-KE’08), pp. 1–7. Kim, S.N., Baldwin, T., 2006. Interpreting semantic relations in noun compounds via verb semantics. In: Proceedings of COLING/ACL 2006 Main Conference Poster Session, pp. 491–498. Kim, S.N., Baldwin, T., 2005. Automatic interpretation of compound nouns using WordNet similarity. In: Proceedings of International Joint Conference on Natural Language Processing (IJCNLP-2005), pp. 945–956. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E., 2007. Moses: open source toolkit for statistical machine translation. In: Proceedings of the ACL 2007 Demo and Poster Sessions, Prague, Czech Republic, pp. 177–180. Lapata, M., 2002. The disambiguation of nominalizations. Comput. Linguist. 28 (3), 357–388. Lauer, M., (Ph.D. thesis) 1995. Designing Statistical Language Learners: Experiments on Noun Compounds. Macquarie University. Levi, J., 1978. The Syntax and Semantics of Complex Nominals. Academic Press, New York. Maalej, Z., 1994. English–Arabic machine translation of nominal compounds. In: Bouillon, P., Estival, D. (Eds.), Proceedings of Workshop on Compound Nouns: Multilingual Aspects of Nominal Composition. , pp. 135–146. Marcus, M., (Ph.D. Thesis) 1980. A Theory of Syntactic Recognition for Natural Language. MIT Press, pp. 336–343. Mathur, P., Paul, S., 2009. Automatic translation of nominal compounds from English to Hindi. In: Proceedings of International Conference on Natural Language Processing (ICON). 108 R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108 Moldovan, D., Badulescu, A., Tatu, M., Antohe, D., Girju, R., 2004. Models for the semantic classification of noun phrases. In: Proceedings of HLT-NAACL 2004: Workshop on Computational Lexical Semantics, pp. 60–67. Nakov, P., 2013. On the interpretation of noun compounds: syntax, semantics, entailment. Nat. Lang. Eng. 1 (1), 1–40. Nakov, P., 2008. Paraphrasing verbs for noun compound interpretation. In: Proceedings of LREC 2008 Workshop: Towards a Shared Task for Multiword Expressions (MWE 2008). Nakov, P., Hearst, M., 2006. Using verbs to characterize noun–noun relations. In: Proceedings of International Conference on Artificial Intelligence: Methodology, Systems Applications (AIMSA), pp. 233–244. Nakov, P., Hearst, M., 2005. Search engine statistics beyond the n-gram: application to noun compound bracketing. In: Proceedings of Computational Natural Language Learning (CoNLL-2005), pp. 17–24. Patwardhan, S., Banerjee, S., Pedersen, T., 2005. SenseRelate::TargetWord-a generalized framework for word sense disambiguation. In: Proceedings of ACL Interactive Poster and Demonstration Sessions. Paul, S., Mathur, P., Kishore, S., 2010. Syntactic construct: an aid for translating English nominal compound into Hindi. In: Proceedings of NAACL HLT Workshop on Extracting and Using Constructions in Computational Linguistics, pp. 32–38. Potthast, M., Trenkmann, M., Stein, B.,2010. Netspeak: assisting writers in choosing words. In: Proceedings of 32nd European Conference on Advances in Information Retrieval (ECIR 10), Lecture Notes in Computer Science. Springer, 672-672. Pustejovsky, J., Anick, P., Bergler, S., 1993. Lexical semantic techniques for corpus analysis. Comput. Linguist. 19 (2), 331–358. Rackow, U., Dagan, I., Schwall, U., 1992. Automatic translation of noun compounds. In: Proceedings of COLING, pp. 1249–1253. Resnik, P., (Ph.D. thesis) 1993. Selection and Information: A Class-based Approach to Lexical Relationships. University of Pennsylvania, pp. 126–131. Rosario, B., Hearst, M., 2001. Classifying the semantic relations in noun compounds via a domain-specific lexical hierarchy. In: Proceedings of Empirical Methods in Natural Language Processing (EMNLP-2001), pp. 82–90. Rose, T., Stevenson, M., Whitehead, M., 2002. The reuters corpus volume 1-from yesterday’s news to tomorrow’s language resources. In: Proceedings of Language Resources and Evaluation (LREC 2002), pp. 827–833. Shahzad, I., Ohtake, K., Masuyama, S., Yamamoto, K., 1999. Identifying translations of compound nouns using non-aligned corpora. In: Proceedings of Workshop MAL, pp. 108–113. Sinha, R.M.K., Sivaraman, K., Agrawal, A., Jain, R., Srivastava, R., Jain, A.,1995. AnglaBharti: a multilingual machine aided translation project on translation from English to Hindi. In: Proceedings of IEEE International Conference Systems, Man and Cybernetics. IEEE Press, pp. 609–614. Tanaka, T., Baldwin, T., 2003a. Noun–noun compound machine translation: a feasibility study on shallow processing. In: Proceedings of ACL-2003 Workshop on Multiword Expression: Analysis, Acquisition and Treatment, pp. 17–24. Tanaka, T., Baldwin, T., 2003b. Translation selection for Japanese–English noun–noun compounds. In: Proceedings of Machine Translation Summit (MT Summit IX), pp. 378–385. Tanaka, T., Yoshihiro, M., 1999. Extraction of translation equivalents from non-parallel corpora. In: Proceedings of Theoretical and Methodological Issues in Machine Translation, pp. 109–119. Toutanova, K., Klein, D., Manning, C., Singer, Y., 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of HLT-NAACL 2003, pp. 252–259. Warren, B., 1978. Semantic Patterns of Noun-Noun Compounds. Gothenburg Studies in English 41. Acta Universitatis Gothoburgensis, Gothenburg. Renu Balyan is a research student in the Department of Mathematics, IIT Delhi. She has obtained an M.Phil. (CS) and MCA degree. Her areas of interest include machine translation, information extraction and data structures. She has worked as a research fellow and project engineer with Centre for Development of Advanced Computing, Noida for almost 6 years. She has also worked as an intern with Dublin City University, Ireland. She has published 14 papers in national and International conferences. Niladri Chatterjee is a Professor of Statistics and Computer Science in the Department of Mathematics, IIT Delhi. His primary research areas are: Natural Language Processing, Semantic Web, Statistical Modeling, His association with IIT Delhi is closed to 15 years. Prior to that he had worked in the Dept. of Computer science, University College London, and at Indian Statistical Institute, Calcutta. In 2010 he has been a Visiting Scientist in Dipartimento di Informatica, University of Pisa, Italy. He has over 70 publications in International and national journals and conferences. He has been the Organizing Chair of “CICLING–2012” in March 2012.
© Copyright 2026 Paperzz