Cross-Language Information Retrieval of Proper

Cross-Language Information Retrieval of Proper Nouns
using Context Information
Isao Goto, Noriyoshi Uratani, Terumasa Ehara
NHK Science and Technical Research Laboratories
1-10-11 Kinuta, Setagaya-ku, Tokyo 157-8510, JAPAN
{igoto, uratani, eharate}@strl.nhk.or.jp
Abstract
Translating news articles frequently
involves finding foreign language
equivalents for proper nouns occurring
for the first time in an original article, a
time-consuming and labor-intensive
task. We propose an Internet-based
technique for efficiently finding
foreign language equivalents for
proper nouns via Cross-Language
Information Retrieval (CLIR). In this
technique, the CLIR of proper nouns
that are not registered in any dictionary
is made possible using "Relevant
Context Words" occurring in the
original news articles. Our trial
searches for individual name words
using this technique resulted in a 90%
success rate.
1
Introduction
In our organization, English news articles for
overseas and multiplex broadcasting are
translated from Japanese news articles. Since the
news generally involves recent events, the
original Japanese articles frequently contain
proper nouns (such as the names of foreign
personages) appearing in print for the first time.
Finding the English equivalents for these proper
nouns represents a significant burden in the
translation process. A method for efficiently
finding borrowed1 proper nouns would therefore
be highly useful.
Translation Example Browser, which is one of
Translation Memory system, is useful for finding
proper nouns (Kumano, 1998). But, previously
untranslated proper nouns cannot be found based
on past translations. In such cases, it is very
1 The words which represent
pronunciation in Japanese katakana.
foreign
word
useful to retrieve the translation of proper nouns
from the Internet via CLIR.
The resource used for information retrieval is
the large database embodied by Internet Web
pages; CLIR adopts a query translation-based
method. However, query translation-based CLIR
cannot perform search successfully if the
Japanese query word cannot be translated into
English. Thus, when the name or term being
sought is not already registered with a dictionary,
conventional CLIR is unable to search English
Web pages in response to a Japanese query. This
makes it quite difficult to find the correct proper
noun equivalent using CLIR.
Query translation-based CLIR is categorized
as one of the following three types (Jones, 1999):
1)
dictionary-based;
2)
machine
translation-based; 3) parallel corpus-based.
Dictionary-based and machine translation-based
methods are also incapable of finding English
equivalents for proper nouns not registered in
dictionaries. The parallel corpus-based method is
unable to find equivalents of proper nouns
appearing for the first time, since the term does
not occur in the corpus.
Query translation-based CLIR has ambiguity,
since query words sometimes have many
different translation equivalents, and each
translation equivalent sometimes has numerous
meanings. The methods described below have
been proposed previously to handle ambiguities
and to increase accuracy in dictionary-based
CLIR.
In the case of numerous translation equivalents
having different meanings, the following
methods have been suggested: The query is
assumed to consist of many words. Dagan (1994)
and Jang (1999) select a combination of query
equivalents with the highest statistical reliability
in the corpus in the target language. Dorr (1997)
uses structured Lexicon to reduce translation
ambiguities. Hull (1997) performs structured
translated queries by the Boolean model. All
these CLIR methods for handling translation
ambiguities presuppose the existence of the
required equivalents exists.
Other methods for handling ambiguities in the
meanings of a translated word have been
suggested. Ballesteros (1997) and Chen (1998)
proposed the query expansion method to reduce
the ambiguity of retrieved words. This method
searches for words appearing near the query
word from the corpus in the original language,
translates the query and searched words, and add
searched words in target language to the
translated query. Ballesteros also searches for
words appearing near the translated query word
from the corpus in the target language, and also
adds searched words to the translated query.
However, these CLIR methods still presuppose
query equivalents. And new proper nouns do not
appear in the corpus.
Knight (1998) and Fujii (1999) have proposed
transliteration methods capable of suggesting
English word candidates from Japanese katakana
words. Nevertheless, they cannot ensure that the
correct English equivalent will be among the
words suggested. Moreover, even if the
suggestions include the correct English
equivalent, numerous other words that are
similarly spelled will be retrieved as well, due to
the ambiguity of the transliteration.
This paper proposes a method for performing
CLIR from Web pages for proper nouns (such as
the names of foreign individuals) spelled in
katakana in Japanese articles. This technique
enables proper nouns not registered in any
dictionary, using “Relevant Context Words”
occurring in articles. In addition, this technique
does not require corpus including query word and
creating a correct English equivalent through
transliteration.
2
Proposed CLIR method
2.1
“Relevant Context Words”
For proper nouns (such as the names of foreign
individuals) in the Japanese articles to be
translated, the social position of these individuals
(e.g., office or titles) is often described or
clarified by the context. In addition to these
words that specify the individual in question, the
Japanese articles also refer to topics related to
these individuals. These relevant words related to
the proper nouns occurring in the articles are
called "Relevant Context Words."
In searching vast databases (such as a search
engine's indexed Web pages) for borrowed
proper nouns, we may find proper nouns that
refer to different individuals that have quite
similar spellings. For example, the name of a
certain individual, expressed as "エリクソン"
(erikuson) in Japanese, has multiple English
equivalents ("Erickson," "Ericson," "Ericsson,"
"Erikkson," "Erikson," and "Eriksson"), –
referring to various actors, singers, writers, and
so on. However, the "エリクソン" appearing in
Sergeyev in katakana
Figure 1. Translators’ Workbench
a news article references a specific individual,
and the exact English equivalent for the person is
required. This emphasizes the importance of
context, including brief descriptions of the proper
nouns retrieved or topics related to them. For our
purposes, search results are useless without
contextual information concerning the retrieved
words.
This is the CLIR method we propose: We
search for the Relevant Context Words of a
borrowed proper noun being searched for, which
occurs in katakana. We translate the Relevant
Context Words into English. Then we use the
English equivalents of these Relevant Context
Words as keywords for a search. From the search
results, we pick out the proper nouns and
compare them with the original Japanese proper
noun based on pronunciation. The closest match
then becomes the search result.
2.2
User Interface
This technique is implemented in the integrated
translation aid environment -Translators'
Workbench- (Kumano, 2000), which gives
priority to a bilingual editor, as shown in Figure 1.
The editor shows a Japanese article to be
translated on the left, and a translator creates the
English article on the right. Using the mouse, the
translator selects the portion to be searched for in
a Japanese article (セルゲーエフ) and clicks the
search button. The system retrieves not just the
proper noun to be searched for, but the context
information found in the article. The retrieval
Figure 2. Retrieval results
results present English equivalents for the proper
noun being searched for and the extracted context
words, as shown in Figure 2. Clicking the
English equivalents displays the actual Web
pages where the English equivalent and Relevant
Context Words occur.
2.3
Extraction of "Relevant Context
Words"
This search technique first requires extracting
Relevant Context Words from an article. Ideally,
to expand the coverage of a search, we need to
obtain a considerable amount of information
from the article.
Here, we consider the case of an individual
name. Then we consider the words that indicate
the post of the person being searched for as one
of the Relevant Context Words. The Web may
also offer information on related news topics,
such as places visited by the individual. Topics
related to the individual are likely to appear near
the name of the person. Thus, we treat the
individual's post and news topics related to the
individual as Relevant Context Words.
When searching for the name of an individual,
we can classify the Relevant Context Words as
one of the following two types, based on the
position of the words in the article:
Relevant Context Words 1:
Post describing the individual, or news
topics related to the individual
Relevant Context Words 2:
Information that specifies the individual,
such as a title
We can retrieve effectively using keywords
composed of the combination of the Relevant
Context Words 1 and 2.
Many keywords may be extracted as Relevant
Context Words 1. These keywords are ranked
based on how closely they are related to the
individual. Using keywords by order rank
enables efficient searches. The positional
relationship between the name of the individual
and the keywords in the article gives a rough idea
of how closely keywords are related to the
individual. In news documents, when a proper
noun (such as the name of a foreign individual)
appears in an article, the proper noun is explained
in context before occurring for the first time. In
many cases, an explanation of a proper noun is
given immediately before the proper noun. If an
article concerns an issue involving a foreign
Japanese article:
Personal Name chosen by the user
モンゴルでは、・・・・・民主党のゴンチグドルジ前国会議長に対して・・
(Mongol)
(Democratic Party)
(Former Chairman)
Extraction of personal name context
モンゴル
民主党
(Mongol) (Democratic Party)
“Relevant Context Words 1”
前国会議長
(Former Chairman)
“Relevant Context Words 2”
Figure 3. Extracting Typical Context of a Personal Name
country, the name of the country, which is the
area of interest, often appears at the beginning of
the article.
We need not organize Relevant Context Word
1 into a hierarchy. Nor do we distinguish
Relevant Context Word 1 as post or topic. These
characteristics make this a robust technique.
Rankings that express how closely keywords are
related to an individual are divided into the
following three levels:
Rank 1: The words before the name of the
individual
Rank 2: From the beginning of an article to
the sentence immediately before the
sentence containing the name of the
individual
Rank 3: The words after the name of the
individual to the end of an article
In a Japanese news article, the title of an
individual often follows the individual's name.
Thus, by extracting the word immediately after
the name of the individual, we can extract
Relevant Context Words 2. If a general honorific
(such as "氏" or "さん" = "Mr./Ms.") occurs
immediately after the name of the individual, the
title of the individual often precedes the name. In
this case, the word immediately before the name
is extracted into the Relevant Context Words 2
category. Figure 3 shows the extraction of a
typical context for a personal name.
2.4
Cross-Language Information
Retrieval
For CLIR, we translate Japanese Relevant
Context Words into English using a dictionary.
For the name of an individual, we create
keywords by combining Relevant Context Words
1 and Relevant Context Words 2. A search using
these keywords shows that the closer the position
between the keywords Relevant Context Words 1
and Relevant Context Words 2, the higher the
likelihood that the name of the individual to be
searched for will also be found on the web page.
We use a search engine that prioritizes results in
which keywords are found closer together within
a Web page. We extract the proper noun
candidates from the HTML pages that have been
returned as search results. Figure 4 shows this
CLIR flow.
2.5
Selecting the English Proper word
by comparing pronunciation
Borrowed words from foreign languages have
pronunciations close to those of the original
language. We compare proper nouns expressed
in Japanese katakana with extracted proper
nouns expressed in English, using their
pronunciations as criteria. Then both words are
converted into phonetic symbols, and these
character strings are compared using Dynamic
Programming matching. Here, we use AKPR
(alphabetized katakana phonetic representation),
Translation of Relevant Context Words Japanese into English
“Relevant Context Words 1”:
“Relevant Context Words 2”:
民主党
モンゴル
前国会議長
→ Democratic Party
→ Mongol
→ Former Speaker
Former Chairman
Creation of English Keyword Sequences by combination of Relevant Context Words 1 and 2
Democratic Party Former Speaker
Democratic Party Former Chairman
Mongol Former Speaker
Mongol Former Chairman
Retrieval using English Keyword Sequences
Extraction of proper noun candidates in the retrieval result HTML pages
(Expression of the retrieval result HTML page)
... to choose Radnaasumbereliyn Gonchigdorj, former chairman of the Mongolian State ...
Assumed proper noun candidates in the retrieval result HTML page
Figure 4. Flow of Cross-Language Information Retrieval
for phonetic symbols. We use the AKPR
converted from the English word and from the
katakana word.
While katakana characters themselves can be
converted into unique AKPR, English words are
not so easily converted into AKPR. However
converting English words into katakana has been
English word: Gonchigdorj
Convert
AKPR: gontigudo-ji,
gontigudaruji,
gonchaigudo-ji,
... etc.
DP matching
Katakana word: ゴンチグドルジ
Convert
AKPR: gontigudoruji
Figure 5. Conversion example of an English
word to AKPR considering ambiguity and
comparing with the katakana word
studied. (Sumiyoshi, 1994) Precise conversion is
possible by simple processing. We use this
technique as the conversion method.
If a word spelled in the same way can be
pronounced differently depending on the original
language and the individual (i.e., conversion is
ambiguous), all combinations of ambiguities are
considered. Figure 5 shows an example. To
calculate the degree of similarity, we adopt the
most similar of multiple candidates given in the
Roman character representation for English
words. Including ambiguous conversion results
increases comparison accuracy for a word that is
spelled in the same way, but pronounced
differently (such as in English characters for
non-English languages).
Here, unless using AKPR, we think about rely on
a transliteration directory. When using
transliteration from katakana to English, the
accuracy of comparison nearly matches that
using AKPR. But English is more ambiguous
than AKPR, So, calculation time for comparing
increases. In transliteration from English to
katakana, the accuracy of comparisons may fall
short of that using AKPR. This is because
katakana includes different characters or
Table 1. Experimental Result
Number of
Number of
successfully
Jananese articles
retreved articles
ウォルフォウイッツ
Wolfowitz
America, U.S., Russia
Deputy Defense Secretary
3
3
エンリレ
Enrile
Philippine, Manila, Senator
former Defense Minister
5
5
カークパトリック
Kirkpatrick
U.N., America. U.S., Washington
Ambassador
1
1
カニングハム
Cunningham
America, U.S.
U.N. ambassador
1
1
カラード
Carrard
IOC
Director General
1
1
カルデロン
Calderon
Peru
Attorney General
1
1
キシン
Qichen
Beijing, China, Taiwan, Shanghai, Taipei
Vice Premier
1
1
キム・ジョンナム
Kim Jong Nam
North Korea
son
2
1
クフォー
Kufuor
Ghana
President
1
1
クリストドゥロス
Christodoulos
Greek Orthodox church, Athens
Archbishop
1
1
ゲオルギエフスキ
Georgievski
Macedonia, Yugoslavia, Albania
Prime Minister
2
2
ゴンチグドルジ
Gonchigdorj
Democratic Party, Mongol
former Speaker
5
5
サンチャゴ
Santiago
Philippine, Manila
Senator
1
1
シェイファー
Schaffer
Republican Party
Representative
1
1
ジェフォーズ
Jeffords
Republican Party, America, U.S., Vermont
Senator
1
1
ジャファル・ウマル・タリブ Jaffar Umar Thalib Muslim, Kalimantan, Indonesia, Java
commander
1
1
ショウ
Siew
Taiwan, Taipei
Deputy leader
1
0
チャンピ
Ciampi
Italia
President
1
1
デービス
Davis
California, America, U.S.
Governor
1
1
ドブリアンスキー
Dobriansky
America, U.S., Tibet, China
Undersecretary
1
1
ナンシー・カセボム
Nancy Kassebaum America, U.S., North Korea
former Senator
1
1
バーンズ
Burns
America, U.S.
Middle East envoy
1
0
ファラヒアン
Fallahian
Iran
former Intelligence Minister
1
1
ブラッター
Blatter
FIFA
President
1
1
ペリー・コモ
Perry Como
America, U.S., Florida
singer
1
0
ベルルスコーニ
Berlusconi
Italia
former Prime Minister
4
4
マリン
Marine
America, U.S., Beijing, Taiwan, China, New York
Charge d'affaires
1
0
マンフレット・ゲンツ
Manfred Gentz
DaimlerChrysler, America, U.S., Germany
board
1
1
ミッチェル
Mitchell
America, U.S., Palestine, Israel
former Senator
1
1
メニューイン
Menuhin
England, Belgium, Brussels, Russia, Poland, Vienna, Europa violinist
1
1
ラジョイ
Rajoy
Spain
Interior Minister
1
1
リー・シャオミン
Li Shaomin
Hong Kong, China, America, U.S.
associate professor
1
1
リンド
Lindh
Sweden, EU, Europa
Foreign Minister
1
1
ルウィン
Lwin
NLD
Secretary
1
1
ルテッリ
Rutelli
Italia
former Rome mayor
1
1
Name (Japanese)
Correct English
equivalents
Relevant Context Words 1
character combinations that have similar
pronunciations. For example, comparing
‘ サ ’,‘sa’ and ‘ シ ャ ’,‘sha’ are not same in
katakana, while ‘s’ and ‘a’ identical in AKPR.
3
Experiment
Of the manuscripts from international broadcasts
in Japanese that were actually translated into
English, we used international articles to test our
proposed technique. First, as a test set, we
selected articles from May 1, 2001 to May 30,
2001. From these articles, we extracted all the
names of individuals written in katakana and
translated into the actual English news. From
these names, we selected the names of
individuals that had not yet appeared in articles
during the one-year period preceding April 2001.
We used the selected names of individuals and
the original articles containing these names as a
test set.
The experiment is described below:
Relevant Context Words were automatically
extracted by the method given in Section 2.3
Relevant Context Words 2
using a morphological analyzer. Words which
had been identified as "organization names" or
"place names" by the morphological analysis
were extracted as Relevant Context Words 1. We
used keywords in the order in which they were
ranked. The search terminated when the optimal
search results were obtained.
To translate keywords from Japanese to English,
we used an in-house dictionary constructed from
a Japanese-English bi-lingual database of news
articles. In this experiment, the dictionary
includes all Relevant Context Words. Almost all
Relevant Context Words 1 are found in an
ordinary Japanese-English dictionary.
We used Google2 as the search engine. The
search itself was performed in June 2001. We
assumed that searched words beginning with an
uppercase letter were proper nouns. For each
keyword sequence, we obtained HTML pages
showing 100 Google search results. Then we
extracted words with the best scores in
DP-matching from proper noun candidates in the
2
http://www.google.com/
HTML pages. When multiple English
equivalents were extracted, we selected the most
frequently extracted word as the English
equivalent of the CLIR result. If the extracted
word of the CLIR result was the correct English
equivalent, we considered the search to be a
success.
Table 1 shows the names of individuals in
Japanese from a test set, the correct English
equivalents of the names, all used Relevant
Context Words 1 in English, most of the Relevant
Context Words 2 in English, the number of
Japanese articles containing the names, and the
number of articles from which English
equivalents were successfully retrieved.
The results show that for 50 test sets of
articles, correct English equivalents were
extracted from 45 test sets, a success rate of 90%.
All told, of 35 names, 31 names were
successfully retrieved. Not distinguishing
between post and topic in Relevant Context
Words 1 did not negatively affect the results.
The problems are of the following 3 types.
Type 1 is inability to extract Relevant Context
Word 2 from the article. When the article is
continued news, sometimes it does not include
sufficient information for the topic. One article
fell into this category. Type 2 is inability to
extract the correct equivalent found in the Google
search results. One article fell into this category.
In this case, the name Siew could not be extracted
as the first of equivalent candidates, because the
matching score when comparing ‘Siew’ and ‘シ
ョウ’ (sho-) was poor. The third type of problem
involves the correct equivalent not turning up in
the Google search results. Three articles fell into
this category. In this case, by increasing Google
search results or using more combinations of
Relevant Context Words as keywords, we can
expect to see higher retrieval success rates.
4
Conclusion
This paper proposed a new CLIR technique for
proper nouns using Relevant Context Words.
This CLIR technique enables searching of the
Internet for English equivalents used in news
translations, in which new proper nouns
continually appear. We built an experimental
system using this technique, then tested the
system by searching the Internet using the names
of individuals and articles actually used in news
translations. The experiment demonstrated the
effectiveness of the technique.
We plan to expand our method to include
proper nouns other than individual names.
Without translating the query itself, the method
creating retrieval keywords from Relevant
Context Words can be useful for retrieving
almost any borrowed word written in katakana.
References
Ballesteros, Lisa and W. Bruce Croft. 1997.
Phrasal Translation and Query Expansion
Techniques for Cross-Language Information
Retrieval. In Proceedings of the 20th SIGIR
Conference, pp.84-91.
Chen, Hsin-Hsi, Guo-Wei Bian, and
Wen-Cheng Lin. 1999. Resolving Translation
Ambiguity
and
Target
Polysemy
in
Cross-Language Information Retrieval. In
Proceedings of the Conference 37th Annual
Meeting of the Association for Computational
Linguistics, pp.215-222.
Dagan, Ido and Alon Itai. 1994. Word Sense
disambiguation Using a Second Language
Monolingual Corpus. Computational Linguistics,
Vol.20, No.4, pp.563-596.
Dorr, Bonnie J. and Douglas W. Oard. 1998.
Evaluating Resources for Query Translation in
Cross-Language Information Retrieval. First
International Conference on Language and
Evaluation.
Fujii, Atsushi and Tetsuya Ishikawa. 1999.
Cross-Language Information Retrieval for
Technical Documents. In Proceedings of the
Joint ACL SIGDAT Conference on Empirical
Methods in Natural Language Processing and
Very Large Corpora, pp.29-37, Jun.
Hull, David A. 1997. Using Structured
Queries for Disambiguation in Cross-Language
Information Retrieval. AAAI Spring Symposium
on Cross-Language Text and Speech Retrieval
Electronic Working Notes.
Jang, Myung-Gil, Sung Hyon Myaeng, and Se
Young Park. 1999. Using Mutual Information to
Resolve Query Translation Ambiguities and
Query Term Weighting. In Proceedings of the
Conference 37th Annual Meeting of the
Association for Computational Linguistics,
pp.223-229.
Jones, Gareth, Tetsuya Sakai, Nigel Collier,
Akira Kumano, and Kazuo Sumita. 1999.
Exploring the use of Machine Translation
resources for English-Japanese Cross-Language
Information Retrieval. In Proceedings of
Machine Translation for Cross-Language
Information Retrieval Workshop, Machine
Translation Summit VII, pp.15-22.
Knight, Kevin and Jonathan Graehl. 1998.
Machine
Transliteration.
Computational
Linguistics, Vol.24, No.4, pp.599-612.
Kumano, Tadashi, Hideki Tanaka, Noriyoshi
Uratani, and Terumasa Ehara. 1998. Translation
examples browser: Japanese to English
translation aid for news articles. International
Conference on Natural Language Processing and
Industrial Applications.
Kumano, Tadashi, Isao Goto, and Terumasa
Ehara. 2000. Translators’ Workbench: An
Integrated Translation Aid Environment by
centering bilingual Editor. In Proceedings of the
Seventh Annual Meeting of The Association for
Natural Language Processing, pp.143-146. (In
Japanese)
Sumiyoshi, Hideki and Teruaki Aizawa. 1994.
Simple Translation of English Proper Nouns to
Japanese Katakana (Phonogram Expression).
Information Processing Society of Japan Journal
Contents Vol.35 No.01, pp.35-45. (In Japanese)