Translating noun compounds using semantic relations

Available online at www.sciencedirect.com
ScienceDirect
Computer Speech and Language 32 (2015) 91–108
Translating noun compounds using semantic relations夽
Renu Balyan ∗ , Niladri Chatterjee 1
Department of Mathematics, IIT Delhi, Hauz Khas, New Delhi, India
Received 20 March 2014; received in revised form 13 September 2014; accepted 22 September 2014
Available online 2 October 2014
Abstract
Despite having a research history of more than 20 years, English to Hindi machine translation often suffers badly from incorrect
translations of noun compounds. The problems envisaged can be of various types, such as, the absence of proper postpositions,
inappropriate word order, incorrect semantics. Different existing English to Hindi machine translation systems show their vulnerability, irrespective of the underlying technique. A potential solution to this problem lies in understanding the semantics of the
noun compounds. The present paper proposes a scheme based on semantic relations to address this issue. The scheme works in
three steps: identification of the noun compounds in a given text, determination of the semantic relationship(s) between them, and
finally, selecting the right translation pattern. The scheme provides translation patterns for different semantic relations for 2-word
noun compounds first. These patterns are used recursively to find the semantic relations and the translation patterns for 3-word and
4-word noun compounds. Frequency and probability based adjacency and dependency models are used for bracketing (grouping) the
constituent words of 3-word and 4-word noun compounds into 2-word noun compounds. The semantic relations and the translation
patterns generated for 2-word, 3-word and 4-word noun compounds are evaluated. The proposed scheme is compared with some
well-known English to Hindi translators, viz. AnglaMT, Anuvadaksh, Bing, Google, and also with the Moses baseline system. The
results obtained, show significant improvement over the Moses baseline system. Also, it performs better than the other online MT
systems in terms of recall and precision.
© 2014 Elsevier Ltd. All rights reserved.
Keywords: Noun compounds; Semantic relation; Translation pattern; Bracketing; Machine translation
1. Introduction
A compound noun is a noun that is made up of two or more words, such as, noun + noun (e.g. water tank, football),
adjective + noun (e.g. full moon, blackboard), verb + noun (e.g. washing machine, swimming pool), noun + verb (e.g.
haircut, rainfall), preposition + noun (e.g. underworld). Single (1-word) compound nouns, such as football, blackboard
and rainfall are mostly a part of the lexicon and hence the translation of these single compound nouns is obtained from
the lexicon. In this paper we have considered compound nouns that are sequence of nouns occurring as separate words.
Henceforth, we call these compound nouns as noun compounds (NCs) in the rest of the paper. Noun compounds can
夽
∗
1
This paper has been recommended for acceptance by Prof. R. K. Moore.
Corresponding author. Tel.: +91 9891830407.
E-mail addresses: [email protected] (R. Balyan), [email protected] (N. Chatterjee).
Tel: +91 1126591490.
http://dx.doi.org/10.1016/j.csl.2014.09.007
0885-2308/© 2014 Elsevier Ltd. All rights reserved.
92
R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108
be freely constructed in English (Jones, 1983), and sometimes these can be quite long. For illustration consider NCs
such as “colon cancer tumor suppressor protein”, “water meter cover adjustment screw”, which are made of 5 words.
Noun compounds form an important and substantial part of an English corpus. For illustration, the written portion
of British National Corpus (BNC) consisting of 84M words (Burnard, 2000) has 2.6% bigram nominal compounds
and the Reuters corpus consisting of 108M words (Rose et al., 2002) has 3.9% (Kim and Baldwin, 2005). It is also
estimated that 88.4% of noun compounds in the Wall Street Journal section of the Penn Treebank are binary i.e. 2-word
(Baldwin and Tanaka, 2004).
As a consequence, any machine translation (MT) system needs to take sincere care of the English noun compounds
while translating them into a target language. This is particularly true for a language like Hindi where nouns in an
English noun compound after translation are often separated by postpositions; and their inappropriate usage may render
the translation syntactically and/or semantically incorrect. This can be observed by analyzing the outputs of different
English to Hindi machine translation systems available online. The present study, in particular, is based on five such
systems, viz. AnglaMT,2 Anuvadaksh,3 Bing,4 Google5 and MaTra6 translators. Our analysis finds that several different
types of problems may be encountered while translating the English noun compounds into Hindi. Some major problems
in this regard are:
• Omission of appropriate postpositions: In certain cases simple juxtaposition of nouns does not convey the semantics in
Hindi. They need to be separated by appropriate postpositions. Omission or incorrect postpositions during translation
of the noun compounds may result in erroneous semantics. For example: “mustard oil” ∼ sarson (mustard) kaa (of)
tel (oil); “body ache” ∼ shareer (body) mein(in) dard (ache). The postpositions that convey the correct sense in the
two cases discussed above are, kaa and mein respectively.
• Use of a single word: Often the Hindi translation of an English noun compound comprising two or more nouns
is a single word. Any attempt to translate the nouns separately with a separator word (such as postpositions) is
erroneous. For illustration, “boys’ hostel” ∼ chhaatraavaas, “cow dung” ∼ gobar, “blood pressure” ∼ raktchaap
and “wine bar” ∼ madhushaalaa. In some cases the translated Hindi word is completely a new word, and in no
way related to the constituent nouns of the noun compound. Whereas, in certain cases a single word is formed by
concatenating or combining the translation of the constituent nouns of the noun compound.
• Bracketing (Sequential grouping): As the number of nouns in a noun compound increases, the number of possible
combinations also increases. For illustration, the following 3-word NCs “olympic gold medal” and “rolled gold
medal” have very similar structure, but have different grouping of the nouns as their semantic structures are different:
- olympic gold medal → (olympic (gold medal)).
- rolled gold medal → ((rolled gold) medal).
For a 3-word NC “N1 N2 N3 ” there are two possibilities, viz. (N1 (N2 N3 )) and ((N1 N2 )N3 ) of grouping the nouns;
while for a 4-word NC there are 6 possibilities. In a similar vein, there exist 13 different ways of grouping a 5-word NC.
According to Nakov (2013) longer (i.e. no. of terms ≥ 5) noun compounds, are rarely dealt with except for scientific
texts and technical literature. Thus, in this work we restricted ourselves to noun compounds with a maximum length
of 4 only. However, appropriate grouping of the nouns of a noun compound and preserving its semantics still poses a
big challenge for any English to Hindi machine translation system.
The present work focuses on the development of a scheme for efficient handling of the above issues corresponding
to English to Hindi machine translation. The necessity also arises from the fact that the above shortcomings are not due
to the underlying translation paradigm; rather the problem is inherent in the language characteristics. For elucidation,
the systems considered in this work follow different translation schemes. While AnglaMT (Sinha et al., 1995) is a
rule-based system, Google and Bing use statistical approaches, and Anuvadaksh and MaTra (Ananthakrishanan et al.,
2
3
4
5
6
http://www.tdil-dc.in/.
http://www.tdil-dc.in/.
http://www.bing.com/translator/.
http://translate.google.com/.
http://www.cdacmumbai.in/matra/.
R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108
93
2006) are hybrid MT systems – yet each of the systems suffers from the above problems. Consequently, interpretation
of noun compounds assumes a very important role in English to Hindi machine translation.
In this work we develop a scheme based on semantic relations for interpretation of the noun compounds. The
intuition behind the proposed scheme is that an appropriate set of rules can be generated in a supervised way if the right
semantic relation can be established between the nouns of a noun compound. Here the term “semantic relation” implies
the underlying relation between the head noun, and the other nouns of the noun compound. Proper interpretation of a
noun compound can be made with the help of discovering the semantic relations between the constituent nouns.
Most of the related researches have proposed semantic labels based on the theory that a compound noun expresses
one of the small number of covert semantic relations (Barker and Szpakowicz, 1998). Levi (1978) offers nine semantic
labels representing underlying predicates deleted during compound formation. Warren (1978) describes a multi-level
system of semantic labels for noun-noun relationships. We have used some of these semantic labels for this work.
The paper is organized as follows: Section 2 looks at some of the related works in this domain. Section 3 discusses
the proposed scheme. It also discusses noun compound identification, and the semantic relations. Section 4 describes
the algorithms for semantic relation identification and translation pattern generation for a 2-word noun compound.
Section 5 deals with handling of the 3-word and 4-word noun compounds. The experimental setup and results are given
in Section 6. Section 7 concludes the paper.
2. Related work
Noun compound interpretation has been studied in the context of various applications including question-answering
and machine translation (Baldwin and Tanaka, 2004; Cao and Li, 2002; Lauer, 1995; Moldovan et al., 2004). Works on
automatic/semi-automatic interpretation of NCs by Lapata (2002), Rosario and Hearst (2001), Moldovan et al. (2004),
Kim and Baldwin (2005) either made assumptions about the scope of semantic relations, or restricted the domain of
interpretation. However, the methods using verbs and verb semantics for interpreting noun compounds avoid any such
assumption, and outperform the above methods (Kim and Baldwin, 2006; Nakov, 2008; Nakov and Hearst, 2006). Kim
and Baldwin (2013a)’s work is based on the lexical similarity with tagged noun compounds, where the lexical similarity
measures are derived from the WordNet. Kim and Baldwin (2013b) investigate word sense distributions in the noun
compounds. They disambiguate the word sense of the component words in the noun compounds, by investigating
“semantic collocation” between them. On the other hand, Nakov (2013) uses syntax, semantics for noun compound
interpretation.
One of the earlier works of Rackow et al. (1992) used a combination of linguistic rules and statistical data for
resolving ambiguities for German to English compound translations. Maalej (1994) has studied English to Arabic
machine translation of noun compounds. Shahzad et al. (1999) used non-aligned corpora for identifying translations
of the noun compounds, which is an extension of the work carried out by Tanaka and Yoshihiro (1999). A feasibility
study on the ability of the shallow parsing methods used for translating Japanese and English noun compounds has
been discussed in Tanaka and Baldwin (2003a). Tanaka and Baldwin (2003b) used Japanese-English as a test case for
a compositional translation method which makes use of a word level translation lexicon and monolingual corpus data.
Baldwin and Tanaka (2004) proposed a support vector learning based method employing a target language corpus,
and a bilingual dictionary data for English to Japanese and vice versa. The work of Tanaka and Baldwin was further
extended by Bungum and Oepen (2009) for translating Norwegian nominal compounds into English and by using
different ranking strategies for obtaining high quality translations.
However, it is observed that most of the works carried out so far have focused on 2-word noun compounds; higher
word order (3-word, 4-word) noun compounds are rarely studied. Further, this problem has hardly been studied for
the Indian languages. The only work taken up for nominal compounds translation for English to Hindi is discussed by
Paul et al. (2010) and Mathur and Paul (2009). Mathur and Paul follow a template based corpus search approach as
used by Bungum and Oepen (2009) and Baldwin and Tanaka (2004). However, their system, unlike others, attempts
to select the correct sense of nominal components by running a word sense disambiguation system by Patwardhan
et al. (2005) on the source language. Paul et al. (2010) have used only eight prepositions given by Lauer (1995) for
paraphrasing the noun compounds, and translating them by using one-to-one mapping from the English prepositions
to Hindi postpositions. However, this approach is not very robust as some English prepositions may have more than
one Hindi translation, and the correct one can be determined only from the context. Hence in this study we have used
“verb + preposition(s)” for paraphrasing the noun compounds instead of using only the prepositional paraphrases. We
94
R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108
English Sentence
Extract 2-word NC groups
from 3/4- word NC using
bracketing. Call them N1N2
PoS tagged Sentence
Tagger
No
NC Extractor
NC is 2-word
Yes
NC (N1N2)
SR(s)
Finder
SR: Seed Verbs
Verbs
Verbs
and
Prepositions
Paraphrase
Generator
SRs
TP: SRs
Verb
Extractor
TP(s)
Finder
Paraphrasing
Candidates
High Frequency
Paraphrases
TP(s) for the
NC (N1N2)
Web
Search
Fig. 1. Proposed scheme for handling translation of noun compounds (NC: noun compound, SR: semantic relation and TP: translation pattern).
have developed a scheme for handling 2-word noun compounds; and extended the idea for 3-word and 4-word English
noun compounds as well.
3. Noun compound identification and semantic relations
The proposed approach consists of three steps to solve the problem of translation of noun compounds. These steps
are:
• Identification of a noun compound: a tagger has been used for this, and the part-of-speech information given by the
tagger is used for identifying a noun compound.
• Interpretation of the semantic relation between the nouns in a noun compound: the semantic relations have been
identified by paraphrasing the noun compounds using verbs and preposition(s). A set of “seed verbs” has been
identified for representing each semantic relation.
• Generation of the translation pattern(s) for a noun compound: the translation patterns are generated based on the
semantic relation occurring between the nouns of the noun compound.
Fig. 1 provides a schematic representation of the proposed scheme.
We have used a semi-supervised approach for noun compound identification. Here, we have used the Stanford tagger
(Toutanova et al., 2003) for tagging a document. Any consecutive sequence of words tagged as noun (NN or NNS) by
the tagger is considered as a noun compound. A corpus comprising 15,000 English sentences has been used for this
purpose. The corpus was cleaned, split into sentences using sentence boundary program, and tagged using the Stanford
tagger. The procedure resulted in 22,074 occurrences of noun compounds in total, comprising 16,606 different noun
compounds. Out of these 14,015 noun compounds occurred only once in the corpus. However, 125 noun compounds
were found to have been incorrectly identified as NCs. Table 1 summarizes the findings. The proportion of incorrect
noun compounds in comparison with the correctly identified ones indicates that one can depend on the taggers’ output
for noun compound identification.
R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108
95
Table 1
Statistics of noun compounds in the corpus.
# Sentences
2-word NCs
3-word NCs
4-word NCs
NCs > 4-word
Incorrect NCs
15,000
13,067
2,815
571
153
125
Once the noun compounds are identified we aimed at estabilshing the semantic relation (SR) between the constituent
words. However, there has been little agreement among the researchers on the kind of relations that can hold between
the two nouns in a noun compound. This can be evidenced from the different sets of relations that have been used in
the literature. For illustration,
• Abstract relations, such as, Agent, Location, Instrument, suggested in Barker and Szpakowicz (1998), Finin (1980),
Girju et al. (2005), Kim and Baldwin (2005, 2008), Moldovan et al. (2004), Rosario and Hearst (2001).
• Generalized prepositions, such as, Of, For, In, proposed by Lauer (1995).
• Recoverably deletable predicates (RDPs) to interpret semantic relations in the NCs (Levi, 1978). Semantic relations,
such as Have, Make, Be, From, For, In are examples of this kind.
For the present work we have used a set of 20 semantic relations which are taken from the existing literature. These are
Agent, Beneficiary, Cause, Container, Content, Equative, Instrument, Location, Material, Possessor, Product, Purpose,
Result, Source, Time, Topic, Experiencer, Specialization, Attribute-Transfer and Use.
We excluded some of the semantic relations found in literature (e.g. Extent, Probability, Frequency, Influence,
Synonymy, Possibility) due to their lack of instances. Another important semantic relation, viz. Property, has also been
ignored as this is satisfied primarily by a combination of “adjective + noun” or “proper noun + common noun” pairs.
For example “blue car”, “Delhi city”.
Section 4 describes the scheme used for semantic relation identification and translation pattern generation for the 2word noun compounds. Section 5 discusses the bracketing issues, and how this 2-word scheme can be used recursively
for generating translation patterns for 3-word and 4-word noun compounds.
4. Semantic relation identification and translation patterns for 2-word NCs
In order to find the translation pattern for a 2-word noun compound we first need to find the semantic relation
between the two nouns of the noun compound.
4.1. Semantic relation identification
As a semantic relation can be represented and is dominated by a set of verbs (Nakov and Hearst, 2006; Nakov,
2008), the proposed scheme tries to uncover the relationship between two noun pairs by rewriting or paraphrasing the
noun compounds as a phrase that contains a verb and one or more preposition(s).
For illustration, the noun compound “family car” can be represented by the following paraphrases: “car owned by
family”, “car possessed by family”, “car belonging to family”. The verbs ‘own’, ‘possess’ and ‘belong’ along with
the prepositions ‘by’ and ‘to’ provide an evidence for the presence of the semantic relation: Possessor. Similarly, the
noun compound “olive oil” can be represented by the following paraphrases: “oil obtained from olive”, “oil made from
olive”, “oil coming from olive”. These constructs indicate the presence of semantic relation: Material between the two
nouns. Thus, for each semantic relation a group of verbs called seed verbs, are used. These seed verbs have been taken
from Nakov and Hearst (2006), and the shared Task7 data 2008. Each semantic relation is represented by a group of
verbs and a verb assigned to a semantic relation may belong to multiple semantic relations. A set of 728 seed verbs
and 30 prepositions have been identified for the purpose of paraphrasing. Table 2 presents some examples of the seed
verbs assigned to the semantic relations.
7
http://multiword.sourceforge.net/.
96
R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108
Table 2
Seed verbs associated with the semantic relations.
S. No.
Semantic relations
Seed verbs with suitable prepositions
1
Cause
2
Experiencer
Cause, promote, lead to, result in, generate, create, carry, spread, transmit, bring, infect, responsible for,
give, pass
Spread, acquire, suffer from, die of, develop, contract, catch, diagnosed of, have, beat, infected by,
survive from, get, pass, fall, transmit, avoid
Own, owned by, possess, possessed by, have, belong to, related to, borrow, take, grant, request
Produce, make, manufacture, build, assemble, create
Arrive in, leave at, conducted in, occur in, happen during, experience in
Made of, made from, contain, originate from, composed of, produced from
Come from, caused by, induced by, relate to, arise from, result from, generated by
Cure, relieve, treat, help with, reduce, heal, prevent, prescribed for, block, control, end, intended for
Contained in, created in, built in, built for, provided in, experienced in, included by
Live in, work on, come from, work in, reside in, located in, bred in, kept in, made in, born from
3
4
5
6
7
8
9
10
Possessor
Product
Time
Material
Source
Purpose
Container
Location
Paraphrase Candidates
Verbs and
Prepositions
Paraphrase
Generator
Noun Compound (NC)
N1N2 - (anthrax death)
death act as anthrax
death aid as anthrax
death arise from anthrax
death caused by anthrax
death caused from anthrax
death consist of anthrax
death result from anthrax
death supported by anthrax
Fig. 2. Paraphrase generation.
For a 2-word noun compound the paraphrases are formed using these seed verbs and prepositions. Paraphrase
generation for a noun compound, viz. anthrax death, is shown in Fig. 2.
For identifying the semantic relation between the nouns of a noun compound, paraphrases are generated and web
frequency of these paraphrases is found using search engines and the Netspeak8 web service (Potthast et al., 2010).
The top 15 paraphrases in terms of web frequencies are identified, and the verb parts of these paraphrases are extracted.
The semantic relation that contains the maximum of these extracted verbs is selected. This indicates that the semantic
relation is best represented by this group of verbs, and hence indicates the semantic relation existing between the two
nouns of the noun compound. The algorithm for semantic relation identification between the two nouns in a 2-word
noun compound is described in Algorithm 1.
Algorithm 1 (Semantic relation identification for a 2-word English noun compound).
Input: A 2-word English noun compound, seed verbs, prepositions.
Output: The semantic relation(s) for the noun compound.
1. The 2-word NC (N1 N2 ) is split in its two nouns (N1 and N2 )
2. Using the two nouns (N1 and N2 ) do:
2.1 Form the paraphrases using the seed verbs, the prepositions and the nouns
2.2 (a) Find the web frequency of all the paraphrases using search engine and Netspeak
(b) Find the top 15 paraphrases having highest frequencies
(c) Find the verbs forming these 15 paraphrases obtained from (b)
(d) Find the semantic relation(s) having the maximum number of verbs extracted in (c)
2.3 Return the semantic relation(s)
8
http://www.netspeak.eu/.
R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108
97
Table 3
Semantic relations and the Hindi translation patterns: N1T and N2T are the translations of N1 and N2 .
Semantic relation
Definition
Examples
Hindi translation patterns
Possessor
N1 has N2 – N1 is owner
Company car, family estate, girl
mouth, child foot
Student loan, national debt
N1T + kaa/ke/ki + N2T
Source
N1 is the source of N2 and N1 is not a
body part
N1 is the source of N2 but N1 is a
body part
Northern wind, foreign capital
N1T N2T
Chest pain, stomach ache, heart
attack
N1T + mein + N2T
Equative9
N1 is also head – N1 , N2 are human
Composer arranger, player coach,
lady doctor
N1T N2T
Experiencer
N2 experiences N1 (an animated
entity experiencing a state/feeling)
Heart patient, cancer patient
N1T + kaa/ke/ki + N2T
Specialization
N1 is specialization of N2 but N1 and
N2 are human
NC is specialization of N2
Boy child, girl child, baby boy, baby
girl
Fighter planes, war ships
Single word
Attribute-Transfer
Salient attribute of N1 is transferred
to N2
Iron will, crescent wrench, lion heart,
doe eye, chicken heart
N1T + jaise + N2T + vaalaa or single
word
Use
N2 uses N1
Laser printer, water gun, electron
microscope
Faith cure, shock treatment, milieu
therapy
N1T + vaalaa/vaale/vaali + N2T
N1 has N2 – N1 is borrower
N2 uses N1 and N1 is a concept
N1T N2T
N1T N2T
N1T N2T
4.2. Translation pattern generation
Once the semantic relation between the nouns of a 2-word NC is identified, the translation pattern for the 2-word
NC is generated. In order to achieve this objective 5 semantic relation groups have been formed based on the translation
patterns of the NCs with these semantic relations. All the semantic relations belonging to one group are represented by
the same translation pattern. The semantic relation Attribute-Transfer has not been allotted to any of the groups as it
has a special pattern and works differently from others. The semantic relation groups formed on the basis of translation
patterns are as follows:
• SR group1 contains those semantic relations that do not require any postpositions in between the two nouns for its
Hindi translations. The translation is obtained by simple juxtaposition of their respective Hindi translations. The
semantic relations belonging to this group are: Beneficiary, Equative, Instrument, Location, Possessor, Product,
Purpose, Source, Topic, Specialization, Use.
• SR group2 comprises semantic relations whose Hindi translations involve one of the postpositions “kaa/ke/kii”
between the nouns. The pattern to be used out of these three (kaa/ke/kii) is determined on the basis of gender
and number of the head noun in the noun compound. The semantic relations belonging to this group are: Agent,
Container, Material, Possessor, Result, Experiencer.
• SR group3 consists of semantic relations whose translation pattern requires use of one of “vaalaa/vale/vaalii”
postposition between the nouns. The gender and number of the head noun in the noun compound help in determining
the pattern to be used out of these three (vaalaa/vale/vaalii). Semantic relations belonging to this group are: Content,
Temporal, Specialization, Use.
• Translation patterns of the semantic relations belonging to SR group4 involve the postposition “mein”. The set
consists of only one semantic relation, viz. Source.
• SR group5 consists of the semantic relation Cause and the postposition used by this semantic relation is “se”.
9 In this semantic relation there are a number of cases where a kin relationship may exist between the two nouns. Such cases are rare in English
but very common in Hindi, and thus need to be considered for Hindi to English translation of NCs. Some examples that can illustrate this point are:
“bhaaii bahan ∼ brother sister”, “betaa betii ∼ son daughter” etc.
98
R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108
Four semantic relations, viz. Possessor, Source, Specialization and Use have been grouped into two different SR
groups. These semantic relations have different translation patterns depending upon the condition satisfied as shown in
Table 3. Thus, if a noun compound has any of these 4 semantic relations between the nouns then the translation pattern
is decided by referring to the lexical semantic category of the nouns as mentioned in the modified definitions given in
Table 3. However it is not possible to determine whether the noun is a borrower or an owner (required for Possessor
semantic relation) from the lexical database. In order to solve this problem we resort to the specific verbs (that help
in deciding the semantic relation, see Table 2), and their web frequencies and probabilities. The translation pattern is
decided accordingly.
The semantic interpretation allows us to form rules for generating the translation patterns for different NCs. Table 3
gives a description along with examples of some of these semantic relations. It is clear from Table 3 that in order to get
the correct translation patterns we had to modify the definitions of some of the existing semantic relations to some extent.
The methodology used for generating the translation pattern(s) for a 2-word NC is described in Algorithm 2.
Algorithm 2 (Translation pattern generation for a 2-word English noun compound).
Input: A 2-word English noun compound.
Output: The Hindi translation pattern for the noun compound.
Notation: N1T and N2T are the Hindi translations of the English nouns N1 and N2 respectively.
1. Split the 2-word NC, (N1 N2 ) into two nouns N1 and N2
2. Using the 2-word NC do:
2.1 Determine the SR between the nouns N1 and N2 in the NC using Algorithm 1
2.2 Determine the SR group to which the SR belongs
2.2.1 Switch SR group
Case SR group1:
Hindi Translation Pattern is “N1T N2T ”.
Case SR group2:
a. Determine the gender and number of the head noun, N2 from the lexicon
b. If the gender of N2 is feminine
Hindi Translation Pattern is “N1T + kii + N2T ”.
c. If the gender of N2 is masculine and the number is singular
Hindi Translation Pattern is “N1T + kaa + N2T ”.
d. If the gender of N2 is masculine and the number is plural
Hindi Translation Pattern is “N1T + ke + N2T ”.
Case SR group3:
a. Determine the gender and number of the head noun, N2 from the lexicon
b. If the gender of N2 is feminine
Hindi Translation Pattern is “N1T + vaalii + N2T ”.
c. If the gender of N2 is masculine and the number is singular
Hindi Translation Pattern is “N1T + vaalaa + N2T ”.
d. If the gender of N2 is masculine and the number is plural
Hindi Translation Pattern is “N1T + vaale + N2T ”.
Case SR group4:
Hindi Translation Pattern is “N1T + mein + N2T ”.
Case SR group5:
Hindi Translation Pattern is “N1T + se + N2T ”.
2.3 If the SR between N1 and N2 is Attribute-Transfer
Then Hindi Translation Pattern is “N1T + jaise + N2T + vaalaa”.
2.4 Return the Translation Pattern
In certain cases there may be multiple translations in the lexicon for a noun in the noun compound. In order to select
the best translation of the noun, the disambiguation tool SenseRelate::TargetWord10 has been used. It disambiguates
a target word with respect to its context by finding the sense that is most related to its neighbors according to a
WordNet::Similarity measure of relatedness (Patwardhan et al., 2005). Once the sense of the word is computed, the
10
http://search.cpan.org/dist/WordNet-SenseRelate-TargetWord/.
R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108
99
keywords in the definition of the computed sense are compared with the definition terms for that word in the lexicon. The
translation whose definition matches the maximum number of keywords is selected as the most appropriate translation.
Also, it has been observed that the noun components of a noun compound indicate the presence of more than one
semantic relation. This may also lead to the generation of multiple translation patterns for the noun compound. In case
the translation pattern cannot be identified using the scheme then we use the transliterated forms of the two nouns,
and they are juxtaposed without any intervening postposition. Some examples of this kind are, “safety pin ∼
”,
”, “miss world ∼
” etc.
“wall hanging ∼
For a 3-word and a 4-word noun compound we first resolve the bracketing issue, and then the scheme used for
2-word noun compounds is applied recursively to obtain the right translation pattern(s). This is described in detail in
the next Section.
5. Bracketing and translation patterns for 3-word and 4-word NCs
For a noun compound consisting of 3 nouns (say N1 , N2 and N3 ) one of the following three cases is possible:
(a) Right Bracketing: N1 modifies the result of N2 modifying N3 ∼ (N1 (N2 N3 )),
(b) Left Bracketing: The result of N1 modifying N2 which in turn modifies N3 ∼ ((N1 N2 )N3 )
(c) No Bracketing: N1 , N2 and N3 are independent ∼ (N1 N2 N3 ).
According to Finin (1980), structure (b) is more preferred in English as compared to (a) and structure (c) is more
preferred in long sequences.
5.1. Adjacency and dependency models for bracketing
Two models, viz. adjacency by Marcus (1980), Pustejovsky et al. (1993), Resnik (1993), and dependency by Lauer
(1995) have been used in the literature to decide a suitable bracketing (left or right):
• The adjacency model checks how strongly N2 modifies N3 as opposed to N1 N2 being a compound, to decide
the correct bracketing (Nakov and Hearst, 2005). If web frequency/probability of N1 N2 is greater than the web
frequency/probability of N2 N3 , then a left bracketing is predicted otherwise a right bracketing is predicted.
• The dependency model checks whether N1 modifies N3 as opposed to N1 modifying N2 . For this model a left
bracketing is predicted if web frequency/probability of N1 N2 is greater than the web frequency/probability of N1 N3
otherwise it is a right bracketing.
We have used both frequency and probability based approaches for each of dependency and adjacency models for
resolving bracketing related issues, and compared the results.
The web frequency for a 3-word NC “N1 N2 N3 ” is computed as follows: The web frequencies for the bigrams
(N1 N2 , N2 N3 ) and also the skip bigram, where we skip N2 to get a bigram (N1 N3 ), are computed from the web. In the
probability based approach we consider the NC “N1 N2 N3 ” and Pr (Na → Nb |Nb ) to be the conditional probability that
the word Na precedes a given fixed word Nb . Both models (adjacency and dependency) based on the two approaches
(web frequency and probability) are summarized in Table 4.
Table 4
Adjacency and dependency models for frequency and probability based approaches.
Model
Approach
Bracketing
Web Frequency based
Probability based
Adjacency
Freq(N1 N2 ) > Freq(N2 N3 )
Freq(N1 N2 ) < Freq(N2 N3 )
Pr (N1 → N2 |N2 ) > Pr (N2 → N3 |N3 )
Pr (N1 → N2 |N2 ) < Pr (N2 → N3 |N3 )
Left
Right
Dependency
Freq(N1 N2 ) > Freq(N1 N3 )
Freq(N1 N2 ) < Freq(N1 N3 )
Pr (N1 → N2 |N2 ) > Pr (N1 → N3 |N3 )
Pr (N1 → N2 |N2 ) < Pr (N1 → N3 |N3 )
Left
Right
100
R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108
In probability based approach, the probability Pr (Na → Nb |Nb ) is estimated as #(Na Nb )/#(Nb ), where #(Na Nb ) and
#(Nb ) are the corresponding bigram and unigram frequencies. The frequencies are considered to be the values returned
by the search engines and the Netspeak web service in response to the queries for the exact phrase “Na Nb ”, and for the
word “Nb ”. Two search engines, namely Bing, Google and the Netspeak (NS) web service have been used to extract
the web frequencies of the bigrams and the unigrams of each NC. Two examples with the bracketing predictions for the
frequency and probability based adjacency and dependency models are shown in Tables 5 and 6. These tables show the
web frequencies and the probabilities that are obtained for two 3-word noun compounds. It also shows the bracketing
based on the frequencies and probabilities.
It is clear from the example in Table 5 that the two models may contradict each other. In such a case we propose
to consider the majority approach, and decide the bracketing accordingly. Thus, in case of “Hydrogen Ion Exchange”
our scheme decides a left bracketing, as per the majority.
However, there are a few cases, such as, “watershed development planner” where the bracketing cannot be decided
on the basis of the majority. In such cases we consider both the alternatives (left as well as right bracketing). This
of course results in multiple translation patterns. Table 6 illustrates the handling of the case “watershed development
planner”. A detailed analysis of the results related to bracketing is discussed in detail in Section 6.
5.2. Translation pattern generation for 3-word NCs
Once the bracketing of a 3-word NC is identified, it is subjected to Algorithm 3 for determination of its translation
pattern. The overall translation pattern of the 3-word noun compound depends recursively on the translation patterns
of the 2-word NCs present in the 3-word noun compound.
Algorithm 3 (Translation pattern generation for a 3-word English noun compound).
Input: A 3-word noun compound, bracketing of the noun compound.
Output: The Hindi translation pattern for the noun compound.
Notation: NTL and NTR are the Hindi translations of the nouns, and NH is the head noun for the 2-word NC.
1. If the Input NC has left bracketing i.e. ((N1 N2 ) N3 )
1.1. Find the translation pattern for N1 N2 (left bracketed nouns) using Algorithm 2 and set it to NTL .
1.2. Find the Head noun in the NC (N1 N2 ), set it to NH . (Generally, the rightmost noun is the head noun)
1.3. Find the translation pattern for NH N3 (a 2-word NC formed from head noun of left bracketed nouns and the 3rd noun-N3 ) using Algorithm
2 and set it to NTR .
1.4. Combine the two translation patterns obtained in steps (1.1) and (1.3), NTL NTR.
1.5. Remove the common noun part present in both the translation patterns from either NTL or NTR .
1.6. The pattern obtained after removal of the duplicate is the final translation pattern for the 3-word NC.
1.7. Return the translation pattern.
2. If NC has right bracketing i.e. (N1 (N2 N3 ))
2.1. Find the translation pattern for N2 N3 (right bracketed nouns) using Algorithm 2 and set it to NTR .
2.2. Find the Head noun in the NC (N2 N3 ), set it to NH . (The rightmost noun is the head noun)
2.3. Find the translation pattern for N1 NH (a 2-word NC formed from head noun of right bracketed nouns and the 1st noun-N1 ) using Algorithm
2 and set it to NTL .
2.4. Combine the two translation patterns obtained in steps (2.1) and (2.3), NTL NTR.
2.5. Remove the duplicate common noun part from the translation pattern, NTL .
2.6. The pattern obtained after removal of duplicate is the final translation pattern for the 3-word NC.
2.7. Return the translation pattern.
This scheme is illustrated below with the 3-word NC “olive oil bottle”. This NC has left bracketing as can be
determined using the approach discussed in Section 5.1 i.e. by using adjacency and dependency models, and making
use of the web based frequency as well as the probability based approach. Thus the NC will be represented as ((olive
oil) bottle).
• The translation pattern for “olive oil”, using Algorithm 2 is computed. This NC has a semantic relation Material
as obtained on using Algorithm 1, which belongs to SR group2 that has a translation pattern “kaa/kii/ke”. This will
R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108
101
Table 5
Sample NC for which the Adjacency (Adj) and Dependency (Dep) models contradict
(Notation: RB:Right Bracketing; LB: Left Bracketing; NS: Netspeak web service).
3-word NC (N1 N2 N3 )
Web frequencies/probabilities
Google
Hydrogen ion exchange (N1 N2 N3 )
2,760,000/0.007976879
Hydrogen ion (N1 N2 )
5,380,000/0.005028037
Ion exchange (N2 N3 )
242,000/0.000226168
Hydrogen exchange (N1 N3 )
Frequency
Probability
Dep
Adj
Dep
LB
LB
LB
Bing
NS
Adj
429,000/0.0007234
1,120,000/0.0044268
21,600/0.0000854
55,000/0.0045454
304,000/0.0039481
13,000/0.0001688
RB
Table 6
Sample NC for which Adjacency (Adj) and Dependency (Dep) models contradict
(Notation: RB: Right Bracketing; LB: Left Bracketing; NS: Netspeak web service).
3-word NC (N1 N2 N3 )
Web frequencies/probabilities
Google
Watershed development planner (N1 N2 N3 )
262,000/0.00017013
Watershed development (N1 N2 )
208,000/0.001808696
Development planner (N2 N3 )
Watershed planner (N1 N3 )
7,620/0.00006626
Frequency
Probability
Adj
Bing
NS
Adj
61,300/0.0001868
98,600/0.0023309
7,190/0.0001699
35,000/0.0001222
11,000/0.0010185
810/0.000075
RB
Dep
LB
Dep
LB
RB
be translated as (jaitun(olive) kaa/kii/ke tel(oil)). As Hindi word tel is masculine and singular, kaa is selected from
these three patterns. Thus the final translation pattern is jatun kaa tel.
• Head noun for (olive oil) is “oil”.
• The translation pattern for (oil bottle) is determined. The semantic relation between the two words of the NC is
Content, which belongs to SR group3 that has a translation pattern “vaalaa/vaalii/vale”. This will be translated as
(tel(oil) vaalaa/vaalii/vale shishi(bottle)). The Hindi word shishi(bottle) is feminine and singular, hence vaalii is
selected from these three patterns. Thus the final translation pattern is tel vaalii shishi.
• Combining the translation patterns, the pattern generated is (jaitun kaa tel) + (tel vaalii shishi). Removing the
duplicate words (tel) the final translation pattern will be (jaitun kaa tel vaalii shishi).
Similarly, the NC “plastic oil bottle” has right bracketing and will be represented as (plastic (oil bottle)). The
translation pattern obtained for this NC is (plastic kii tel vaalii shishi).
The detailed observations and test results of the scheme related to bracketing issues and the translation patterns for
3-word noun compounds are discussed in detail in Section 6.
5.3. Translation pattern generation for 4-word NCs
Marcus (1980) made the assumption that an arbitrarily long modifier string can be analyzed by examining only
the three left most nouns in the modifier string in an iterative way. This work was further extended by Resnik (1993).
However, we have extended the 3-word NCs bracketing concept for the 4-word NCs as well using the above-mentioned
web frequency based approach. The possible bracketing combinations that may occur for a 4-word NC are determined
and explored to find the most probable bracketing out of these combinations. A 4-word NC “N1 N2 N3 N4 ” may have
the following possible bracketing combinations:
1. All the four nouns are independent, (N1 N2 N3 N4 ). For this pattern the scheme finds the web frequency of the 4-gram
“N1 N2 N3 N4 ”.
2. The leftmost (N1 ) and the rightmost (N4 ) nouns are independent and the middle two nouns (N2 N3 ) form a group,
i.e. the bracketing is (N1 (N2 N3 ) N4 ). Here the scheme considers web frequency for trigram “N1 N3 N4 ” as N1 and
N4 are independent, and N3 is the head noun for (N2 N3 ) where N2 is a modifier.
102
R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108
3. Here N1 N2 form one group and N3 N4 form another group. Hence the bracketing is ((N1 N2 )(N3 N4 )). The scheme
finds the web frequency of the bigram “N2 N4 ” as N2 is the head noun for (N1 N2 ) and N4 is the head noun for
(N3 N4 ), and N1 and N3 are their respective modifiers.
4. In this case the leftmost three nouns (N1 N2 N3 ) form a group and the rightmost (N4 ) noun is independent. Hence
the bracketing is ((N1 N2 N3 ) N4 ). In this case the 3-word (N1 N2 N3 ) noun group may have either a left or a right
bracketing i.e. one of (((N1 N2 ) N3 ) N4 ) or ((N1 (N2 N3 )) N4 ) is a possibility. In this pattern we find the web frequency
for the bigram “N3 N4 ” as N1 N2 is the modifier for N3 .
5. The leftmost (N1 ) noun is independent and the rightmost three nouns (N2 N3 N4 ) form a group, i.e. we have the
bracketing (N1 (N2 N3 N4 )). The trigram (N2 N3 N4 ) may either have a left or a right bracketing i.e. (N1 ((N2 N3 ) N4 ))
or (N1 (N2 (N3 N4 ))). The web frequency for the bigram “N1 N4 ” is determined, since N2 N3 acts as the modifier for
N4 which is the head noun for (N2 N3 N4 ).
To determine the bracketing pattern for a 4-word noun compound web frequencies are computed for the sequences
N1 N2 N3 N4 , N1 N3 N4 , N2 N4 , N3 N4 and N1 N4 . The noun compound will follow the pattern having the highest frequency.
The translation pattern for the 4-word NC can be generated once the pattern with highest frequency is found. The
pattern having the highest frequency will consist of nouns that are independent (pattern 1), 2-word nouns (patterns 2
and 3) or 3-word nouns (patterns 4 and 5). The translation patterns for these 2-word and 3-word NCs can be generated
as discussed in previous sections and combined to form the translation pattern for the 4-word noun compound.
6. Experimental results
The approaches discussed for 2-word, 3-word and 4-word noun compounds have been manually evaluated by two
evaluators for translation patterns. These evaluators were not shown the standard results and were asked to mark the
translation pattern(s) obtained for an NC as correct or incorrect. The annotators’ agreement was measured using κ
coefficient (Cohen, 1960). The κ coefficient is computed as P(A)−P(E)
1−P(E) , where P(A) is the observed agreement among
the annotators, and P(E) is the expected agreement, i.e. P(E) represents the probability that the annotators agree by
chance. The value of κ is constrained to the interval [−1, 1]. A κ value of positive one means perfect agreement, and a
κ value of negative one means a perfect disagreement. The results obtained for the semantic relation and the bracketing
have also been compared with the reference results mentioned in the datasets collected from the literature.
We also compare the outputs of the proposed system11 with the Moses baseline system and some of the state-of-the-art
translators. The precision, recall and F-scores have been calculated for all the systems.
6.1. 2-word NC evaluation
The test set for 2-word NCs has been extracted from the data provided by Kim (2008 Shared Task). A test set of
200 2-word NCs has been selected randomly from a training corpus of 1088 tokens defined by Barker and Szpakowicz
(1998). The test set is so chosen that instances of all the 20 SRs considered in the paper are evenly distributed. The
SR(s) obtained for an NC using the approach discussed were compared with the SRs given in the 1088 NCs set to find
the accuracy of the method. Some observations for the 2-word NCs are:
• Automatic evaluation of the identified semantic relations resulted in an accuracy of 64%. Here the semantic relations
obtained by using the proposed scheme were compared with the semantic relations specified in the reference set.
• Manual evaluation of the same resulted in 83% accuracy. This may be attributed to the fact that the human evaluators
are able to interpret certain semantic relations that may not possible while performing automatic evaluation. For
illustration, the NC “cabinet member” is marked with the semantic relation Container in the reference set, whereas the
proposed method assigned the semantic relation Possessor. The manual evaluators mark this as correct assignment of
semantic relation, whereas automatic evaluation marked this as incorrect. Some other such cases are “steel hammer”,
“hospital room”, “sea shell”.
11
The system obtained by integrating the Moses baseline system with the noun compound translation (NCT) system.
R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108
103
Table 7
3-word NC Bracketing accuracy.
Approach
Frequency based
Probability based
Model
Adjacency
Dependency
Adjacency
Dependency
Correct bracketing
Accuracy (%)
68
75.56
65
72.22
60
66.67
63
70.00
• Most of the NCs with SRs Cause, Material, Product, Purpose, Possessor and Source were correctly identified. Some
examples are “anthrax death”, “olive oil”, “petroleum product”, etc.
• NCs with SRs Beneficiary, Result and Use were incorrectly identified for most of the cases. Some example NCs is
“machine translation”, “faith cure”, “consumer price”.
The translation patterns obtained for the NCs based on the extracted SRs were given to two evaluators to mark the
patterns as either correct or incorrect. A kappa (κ) value of 0.414 was obtained for the two annotators. The strength of
agreement is considered to be ‘moderate’.
6.2. 3-word NC evaluation
We have considered a test set of 90 3-word NCs collected from existing literature for evaluating the performance
of 3-word NCs. Web frequency of occurrence for each of these NCs has been found using the adjacency and the
dependency models. Both adjacency and dependency models were used for bracketing the noun compound. It has
been found that out of the 90 NCs considered, 76 NCs agreed for both the adjacency and the dependency models for
probability based approach. However, for the web frequency based approach only 64 NCs matched for both models.
The accuracy of the two models each with both frequency and probability based approaches is shown in Table 7.
Adjacency model with frequency based approach is found to be the best.
The bracketing obtained for these 90 NCs was compared with the standard bracketing obtained from the reference
set from Resnik (1993); 68 NCs were found to be bracketed correctly and matched with the reference bracketing set.
The translation patterns for only these 68 NCs have been obtained using the proposed scheme. It has been observed
that for some NCs multiple translation patterns were obtained. This is because at times more than one semantic relation
is identified for a given NC by the scheme, resulting in multiple translation patterns.
It is further observed that the NCs involving nouns with semantic category human are not assigned proper SRs in
many cases. Hence there is a need for more SRs that may help in proper identification. In some cases the search engine
reported very high frequency for a paraphrase, but when manually verified we found that the web frequencies reported
were not of the exact paraphrase match but some related paraphrases. Such cases were removed from the list of high
frequency paraphrases and were not considered any further.
The accuracy of translation patterns generated for 3-word NCs was calculated only for those 68 NCs for which
bracketing has been correct. A Kappa (κ) value of 0.416 was obtained for two evaluators for the translation patterns.
The strength of agreement for these translation patterns for these two evaluators is considered to be ‘moderate’.
6.3. 4-word NC evaluation
The instances for 4-word NCs are rare in routine text. We could collect only a small set of 35 NCs from the existing
literature and corpus discussed in Section 3. Web frequencies for this set was found using two search engines, viz. Bing
and Google, to determine the most probable bracketing for 4-word NCs. It is observed that for 75% of the test cases,
the pattern ((N1 N2 N3 ) N4 ) obtains highest frequency for both the search engines. For nearly half of the remaining test
cases there is a conflict in selecting the highest frequency patterns. While the pattern (N1 (N2 N3 N4 )) obtains highest
frequency for only 9.375% of the cases; for 3.125% test cases the highest frequency pattern has been ((N1 N2 )(N3 N4 )).
Thus for a 4-word NC (N1 N2 N3 N4 ), it is assumed that the NC will strictly follow the following form of bracketing
((N1 N2 N3 ) N4 ), which corroborates the findings of Marcus (1980). The 3-word NC (N1 N2 N3 ) within the 4-word NC
104
R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108
Transliteration
Module
Source Sentence
Noun compound
translator
Moses Baseline
System
SenseRelate::
TargetWord
Tool
Lexicon
Target Sentence
Fig. 3. Schematic diagram for the Moses + NCT system.
has been handled by deciding the bracketing first and then generating the translation patterns for the two 2-word NCs
recursively.
The translation patterns obtained for the 4-word NCs have been marked correct by both the evaluators for 64% of
the cases. The evaluators did not agree for 18% of the translation patterns. Remaining 18% translation patterns have
been found to be incorrect by both the evaluators. Of all the correct translation patterns 29% included postpositions
kaa/ke/kii in the translations; and 14% correct translation had other postpositions. It is further observed that 57% of
the correct translation patterns were juxtapositions of word by word translations of N1 , N2 , N3 and N4 . They did not
include any postpositions in the translations.
For NCs involving more than 4 words the authors feel that using the translated or transliterated forms of individual
nouns in the same sequence in which they appear, generate acceptable translation patterns. It also saves a lot of overheads
caused due to the possible number of bracketings, and the huge number of iterations required for SR identification.
6.4. Comparison with the state-of-the-art translators
Sections 6.1–6.3 discuss the accuracy of the proposed scheme for semantic relation identification, and the translation
patterns generated for the 2-word, 3-word and 4-word noun compounds. In this section we consider the state-of-the-art
translators AnglaMT, Anuvaadaksh, Bing, Google and Moses12 (Koehn et al., 2007) to find how these systems translate
the noun compounds occurring in a sentence. We also integrate the proposed noun compound translator (NCT) system
with the Moses baseline system to compare the translation quality of the integrated system over the Moses baseline
system. We refer to it as “Moses + NCT” in subsequent discussion. The integrated system is schematically represented
in Fig. 3.
A test set of 200 sentences consisting of 242 noun compounds (each sentence contains at least one noun compound)
is considered for evaluation. It has been observed that all the translators suffer from various problems related to incorrect
postpositions while translating noun compounds. The problems range from simple transliteration of nouns to altering
the positions of the nouns within a noun compound conveying erroneous, if not meaningless, semantics. Moreover,
there are cases where the correct semantics are not preserved primarily due to ambiguous word sense. The different
types of errors and their percentages that occurred during the translation of noun compounds by different translators
are summarized in Table 8. The bold ones are showing the minimum errors.
Some of the observations in this regard are:
• The major problem faced by the translators is related to postpositions. However, Moses + NCT system has the least
percentages of postposition related errors in comparison with other translators.
• With respect to semantics, the Moses + NCT system shows considerable improvement over the Moses baseline
system. The percentages of semantics related errors have reduced from 16.11 to 8.60. This has been possible
12
http://www.statmt.org/moses/.
R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108
105
Table 8
Percentages of errors occurring in the NCs translated by the translators.
Types of errors
AnglaMT
Anuvaadaksh
Word-order
Postposition
Semantic
Transliteration
3.74
24.06
16.58
6.95
3.10
32.30
11.95
16.37
Bing
4.18
30.96
5.44
5.02
Google
MaTra
Moses
Moses + NCT
2.09
28.45
5.86
1.67
3.85
32.91
13.68
23.50
18.79
21.48
16.11
36.24
0
15.05
8.60
12.90
due to the introduction of SenseRelate::TargetWord tool. Only Bing and Google perform somewhat better than
Moses + NCT in this regard.
• Google system has the best transliteration module with the least percentages of errors. Although Moses + NCT
performs much better than the Moses baseline system some more improvements are needed in this regard.
• Moses + NCT system does not suffer from word-order related errors. Whereas, for all other translators there are cases
when the nouns in a noun compound are either placed far apart or the nouns reverse their order after translation.
This leads to incorrect translation of noun compounds.
Table 9 illustrates some of the errors shown by different translators while translating some noun compounds extracted
from some of the test sentences. Here the error codes ‘P’, ‘T’, ‘O’, ‘S’, ‘N’ and ‘PoS’ denote the errors due to
postpositions, translation/transliteration, word-order, semantic, number and part-of-speech, respectively.
The translations of noun compounds as shown in Table 9 clearly indicate the inability of the translators to handle
these noun compounds. All the translators except Google and the Moses + NCT system translate the noun compound
“death sentence” incorrectly. This is because the word ‘sentence’ has two meanings: a linguistic unit (∼
) and
). The word ‘player’ in the noun compound “record player” has been translated
a decree of punishment (∼
incorrectly by all the translators except Bing and Google, who have actually transliterated the word correctly. The
sense considered for ‘player’ by most of the translators is: a person taking part in a sport or game (∼
). For the
noun compound “wedding present” the postposition ( ) considered by Google is incorrect. The gender and number for
the word ‘present’ (∼
) is masculine and singular, hence the correct postposition should be ‘ ’. All other translators
excepting Moses + NCT have made some errors. Only Moses + NCT transliterated the noun compound correctly, and
thereby avoided the mistakes. For the noun compounds “hotel electrician” and “working hours” the mistakes are
primarily either missing postpositions or incorrect postpositions. Anuvaadaksh produced a wrong word order for
“working hours”. Only Moses + NCT could translate both of them correctly. The AnglaMT system has considered the
PoS of word ‘forces’ as verb in the noun compound “security forces”, and hence translated it incorrectly. Only the
Anuvaadaksh system and the Moses + NCT proposed in this paper could translate it correctly.
Table 10 presents the precision, recall and F-scores of the Moses baseline system and the various online translators
mentioned above. The values in bold give the maximum for each row (i.e. recall, precision and F-value) respectively.
We compare these results with the results obtained from Moses + NCT system. In this scenario, we have considered
Table 9
Translation outputs of various translators for some example noun compounds in a sentence.
106
R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108
Table 10
Precision, recall and F-score for the different translators.
Scores
Translators
Recall
Precision
F-score
AnglaMT
Anuvaadaksh
Bing
Google
MaTra
Moses
Moses + NCT
0.870
0.492
0.629
0.934
0.381
0.541
0.983
0.521
0.681
0.996
0.568
0.715
0.971
0.255
0.404
0.616
0.477
0.538
0.927
0.699
0.823
recall as the ratio of the number of noun compounds translated by the system to the total number of noun compounds
present in the document, i.e.
# Translated NCs
# Total NCs
In a similar vein, precision is defined as the ratio of the number of correctly translated noun compounds to the total
number of noun compounds identified and translated by the MT system. Hence,
Recall =
# Correctly translated NCs
# Translated NCs
Some observations related to the experiments are:
Precision =
• The Moses baseline system has the least recall. This may be attributed to the presence of many out-of-vocabulary
(OOV) words that simply need to be transliterated. The baseline system does not have a built-in transliteration
module. The system recall increases significantly (from 0.616 to 0.927) with the addition of the transliteration
module as can be seen from the Moses + NCT system results. Google translator records the highest recall.
• AnglaMT system has lower recall as compared to all other translators except the Baseline system. This may be due
to the errors in the transliterated terms. The NCs that are incorrectly transliterated are considered incorrect, resulting
in lower recall.
• The precision and F-score for the Moses + NCT system is the highest. It has shown a significant improvement over
the Moses baseline system. This is possible only because of the hybrid approach that has been used for handling the
issues related to the NC translation.
7. Concluding remarks
This paper deals with handling of noun compounds while translating from English to Hindi. Existing English to
Hindi machine translation systems are often not able to handle the English noun compounds while translating them to
Hindi. The problems can be of various types which can range from simple syntactic level to complicated semantics.
As a consequence, the solution scheme needs to find answers to the following questions:
• to determine whether the word sequence in a sentence is a noun compound;
• to identify and interpret the semantic relation of the words present in a noun compound;
• to choose proper translation pattern for the English noun compounds in the target language.
In this work we investigated the above problems for noun compounds of the form of two, three and four consecutive
nouns. A hybrid approach has been proposed for solving this difficulty. A knowledge base consisting of a set of seed
verbs has been constructed. This helps in identifying the semantic relation between the nouns of a noun compound. A
database containing noun compounds that are translated as single words has been formed. This helps in determining
the translation patterns of noun compounds that translate to single words without having to go through the process of
semantic relation identification. A rule-based scheme has been used for generating the translation patterns. The rules
are formulated by identifying the semantic relations between the nouns of the noun compounds of the source language.
Twenty different semantic relations have been considered for study.
The evaluation results for translation patterns obtained for noun compounds have been presented and the inter
annotators’ agreement is obtained. For semantic relation identification the accuracy of objective evaluation is 64%;
R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108
107
whereas for subjective evaluation it is 83%. The apparent discrepancy is because of the inherent ambiguity of the
perceived semantic relation existing between two nouns that human evaluators are capable of understanding but
cannot be achieved automatically. In this study we have considered fifteen paraphrases with highest frequencies for
determining the semantic relations. The proposed system records the highest precision and F-scores in comparison
with the Moses baseline and other online English to Hindi translators. The errors performed by machine translation
systems following different translation paradigms have been analyzed. It has been observed that the percentage of
errors related to postpositions and word-order occurring while translating noun compounds for all the systems was
comparable. Number related errors have been found to be more common for statistical systems (Google and Bing,
here). However, transliteration and semantic related errors were more frequent in the rule-based system (i.e. AnglaMT)
and the hybrid systems (i.e. Anuvaadaksh and MaTra) considered for this work. Tables 8 and 9 show the percentage
of various errors and certain example cases when the noun compounds are translated using the MT systems.
Presently the study focuses on noun sequences only. This work can be extended for generating translation patterns
for other categories of compound nouns, such as “adjective + noun”, “verb + noun”, “noun + verb”.
Acknowledgements
We would like to thank the two anonymous reviewers for their helpful comments, suggestions and valuable input
on this work.
References
Ananthakrishanan, R., Kavitha, M., Hegde, J., Shekhar, C., Shah, R., Bade, S., Sasikumar, M., 2006. MaTra: a practical approach to fully-automatic
indicative English–Hindi machine translation. In: Symposium on Modeling and Shallow Parsing of Indian Languages (MSPIL’06).
Baldwin, T., Tanaka, T., 2004. Translation by machine of compound nominals: getting it right. In: Proceedings of ACL 2004 Workshop on Multiword
Expressions: Integrating Processing, pp. 24–31.
Barker, K., Szpakowicz, S., 1998. Semi-automatic recognition of noun modifier relationships. In: Proceedings of International Conference on
Computational Linguistics (COLING-1998), pp. 96–102.
Bungum, L., Oepen, S., 2009. Automatic translation of Norwegian noun compounds. In: Proceedings of European Association for Machine
Translation (EAMT-09), pp. 136–143.
Burnard, L., 2000. User reference guide for the British National Corpus. In: Technical Report. Oxford University Computing Services.
Cao, Y., Li, H., 2002. Base noun phrase translation using web data and the EM algorithm. In: Proceedings of International Conference on
Computational Linguistics (COLING-2002), pp. 1–7.
Cohen, J., 1960. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20, 37–46.
Finin, T.W., (Ph.D. thesis) 1980. The Semantic Interpretation of Compound Nominals. University of Illinois.
Girju, R., Moldovan, D., Tatu, M., Antohe, D., 2005. On the semantics of noun compounds. Comput. Speech Lang. 19 (44), 479–496.
Jones, K.S., 1983. Compound noun interpretation problems. In: Technical Report. University of Cambridge.
Kim, S.N., Baldwin, T., 2013a. A lexical semantic approach to interpreting and bracketing English noun compounds. Nat. Lang. Eng. 19 (3),
385–407.
Kim, S.N., Baldwin, T., 2013b. Word sense and semantic relations in noun compounds. ACM Trans. Speech Lang. Process. 10 (July (3)), Special
Issue on Multiword Expressions: From Theory to Practice and Use, Part 2, Aricle No. 9.
Kim, S.N., Baldwin, T., 2008. An unsupervised approach to interpreting noun compounds. In: Proceedings of IEEE International Conference on
Natural Language Processing and Knowledge Engineering (IEEE NLP-KE’08), pp. 1–7.
Kim, S.N., Baldwin, T., 2006. Interpreting semantic relations in noun compounds via verb semantics. In: Proceedings of COLING/ACL 2006 Main
Conference Poster Session, pp. 491–498.
Kim, S.N., Baldwin, T., 2005. Automatic interpretation of compound nouns using WordNet similarity. In: Proceedings of International Joint
Conference on Natural Language Processing (IJCNLP-2005), pp. 945–956.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O.,
Constantin, A., Herbst, E., 2007. Moses: open source toolkit for statistical machine translation. In: Proceedings of the ACL 2007 Demo and
Poster Sessions, Prague, Czech Republic, pp. 177–180.
Lapata, M., 2002. The disambiguation of nominalizations. Comput. Linguist. 28 (3), 357–388.
Lauer, M., (Ph.D. thesis) 1995. Designing Statistical Language Learners: Experiments on Noun Compounds. Macquarie University.
Levi, J., 1978. The Syntax and Semantics of Complex Nominals. Academic Press, New York.
Maalej, Z., 1994. English–Arabic machine translation of nominal compounds. In: Bouillon, P., Estival, D. (Eds.), Proceedings of Workshop on
Compound Nouns: Multilingual Aspects of Nominal Composition. , pp. 135–146.
Marcus, M., (Ph.D. Thesis) 1980. A Theory of Syntactic Recognition for Natural Language. MIT Press, pp. 336–343.
Mathur, P., Paul, S., 2009. Automatic translation of nominal compounds from English to Hindi. In: Proceedings of International Conference on
Natural Language Processing (ICON).
108
R. Balyan, N. Chatterjee / Computer Speech and Language 32 (2015) 91–108
Moldovan, D., Badulescu, A., Tatu, M., Antohe, D., Girju, R., 2004. Models for the semantic classification of noun phrases. In: Proceedings of
HLT-NAACL 2004: Workshop on Computational Lexical Semantics, pp. 60–67.
Nakov, P., 2013. On the interpretation of noun compounds: syntax, semantics, entailment. Nat. Lang. Eng. 1 (1), 1–40.
Nakov, P., 2008. Paraphrasing verbs for noun compound interpretation. In: Proceedings of LREC 2008 Workshop: Towards a Shared Task for
Multiword Expressions (MWE 2008).
Nakov, P., Hearst, M., 2006. Using verbs to characterize noun–noun relations. In: Proceedings of International Conference on Artificial Intelligence:
Methodology, Systems Applications (AIMSA), pp. 233–244.
Nakov, P., Hearst, M., 2005. Search engine statistics beyond the n-gram: application to noun compound bracketing. In: Proceedings of Computational
Natural Language Learning (CoNLL-2005), pp. 17–24.
Patwardhan, S., Banerjee, S., Pedersen, T., 2005. SenseRelate::TargetWord-a generalized framework for word sense disambiguation. In: Proceedings
of ACL Interactive Poster and Demonstration Sessions.
Paul, S., Mathur, P., Kishore, S., 2010. Syntactic construct: an aid for translating English nominal compound into Hindi. In: Proceedings of NAACL
HLT Workshop on Extracting and Using Constructions in Computational Linguistics, pp. 32–38.
Potthast, M., Trenkmann, M., Stein, B.,2010. Netspeak: assisting writers in choosing words. In: Proceedings of 32nd European Conference on
Advances in Information Retrieval (ECIR 10), Lecture Notes in Computer Science. Springer, 672-672.
Pustejovsky, J., Anick, P., Bergler, S., 1993. Lexical semantic techniques for corpus analysis. Comput. Linguist. 19 (2), 331–358.
Rackow, U., Dagan, I., Schwall, U., 1992. Automatic translation of noun compounds. In: Proceedings of COLING, pp. 1249–1253.
Resnik, P., (Ph.D. thesis) 1993. Selection and Information: A Class-based Approach to Lexical Relationships. University of Pennsylvania, pp.
126–131.
Rosario, B., Hearst, M., 2001. Classifying the semantic relations in noun compounds via a domain-specific lexical hierarchy. In: Proceedings of
Empirical Methods in Natural Language Processing (EMNLP-2001), pp. 82–90.
Rose, T., Stevenson, M., Whitehead, M., 2002. The reuters corpus volume 1-from yesterday’s news to tomorrow’s language resources. In: Proceedings
of Language Resources and Evaluation (LREC 2002), pp. 827–833.
Shahzad, I., Ohtake, K., Masuyama, S., Yamamoto, K., 1999. Identifying translations of compound nouns using non-aligned corpora. In: Proceedings
of Workshop MAL, pp. 108–113.
Sinha, R.M.K., Sivaraman, K., Agrawal, A., Jain, R., Srivastava, R., Jain, A.,1995. AnglaBharti: a multilingual machine aided translation project on
translation from English to Hindi. In: Proceedings of IEEE International Conference Systems, Man and Cybernetics. IEEE Press, pp. 609–614.
Tanaka, T., Baldwin, T., 2003a. Noun–noun compound machine translation: a feasibility study on shallow processing. In: Proceedings of ACL-2003
Workshop on Multiword Expression: Analysis, Acquisition and Treatment, pp. 17–24.
Tanaka, T., Baldwin, T., 2003b. Translation selection for Japanese–English noun–noun compounds. In: Proceedings of Machine Translation Summit
(MT Summit IX), pp. 378–385.
Tanaka, T., Yoshihiro, M., 1999. Extraction of translation equivalents from non-parallel corpora. In: Proceedings of Theoretical and Methodological
Issues in Machine Translation, pp. 109–119.
Toutanova, K., Klein, D., Manning, C., Singer, Y., 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of
HLT-NAACL 2003, pp. 252–259.
Warren, B., 1978. Semantic Patterns of Noun-Noun Compounds. Gothenburg Studies in English 41. Acta Universitatis Gothoburgensis, Gothenburg.
Renu Balyan is a research student in the Department of Mathematics, IIT Delhi. She has obtained an M.Phil. (CS) and MCA
degree. Her areas of interest include machine translation, information extraction and data structures. She has worked as a
research fellow and project engineer with Centre for Development of Advanced Computing, Noida for almost 6 years. She
has also worked as an intern with Dublin City University, Ireland. She has published 14 papers in national and International
conferences.
Niladri Chatterjee is a Professor of Statistics and Computer Science in the Department of Mathematics, IIT Delhi. His
primary research areas are: Natural Language Processing, Semantic Web, Statistical Modeling, His association with IIT
Delhi is closed to 15 years. Prior to that he had worked in the Dept. of Computer science, University College London, and at
Indian Statistical Institute, Calcutta. In 2010 he has been a Visiting Scientist in Dipartimento di Informatica, University of
Pisa, Italy. He has over 70 publications in International and national journals and conferences. He has been the Organizing
Chair of “CICLING–2012” in March 2012.