IEICE TRANS. INF. & SYST., VOL.E90–D, NO.11 NOVEMBER 2007 1841 PAPER A Machine Learning Approach for an Indonesian-English Cross Language Question Answering System Ayu PURWARIANTI†a) , Masatoshi TSUCHIYA††b) , Nonmembers, and Seiichi NAKAGAWA†c) , Fellow SUMMARY We have built a CLQA (Cross Language Question Answering) system for a source language with limited data resources (e.g. Indonesian) using a machine learning approach. The CLQA system consists of four modules: question analyzer, keyword translator, passage retriever and answer finder. We used machine learning in two modules, the question classifier (part of the question analyzer) and the answer finder. In the question classifier, we classify the EAT (Expected Answer Type) of a question by using SVM (Support Vector Machine) method. Features for the classification module are basically the output of our shallow question parsing module. To improve the classification score, we use statistical information extracted from our Indonesian corpus. In the answer finder module, using an approach different from the common approach in which answer is located by matching the named entity of the word corpus with the EAT of question, we locate the answer by text chunking the word corpus. The features for the SVM based text chunking process consist of question features, word corpus features and similarity scores between the word corpus and the question keyword. In this way, we eliminate the named entity tagging process for the target document. As for the keyword translator module, we use an Indonesian-English dictionary to translate Indonesian keywords into English. We also use some simple patterns to transform some borrowed English words. The keywords are then combined in boolean queries in order to retrieve relevant passages using IDF scores. We first conducted an experiment using 2,837 questions (about 10% are used as the test data) obtained from 18 Indonesian college students. We next conducted a similar experiment using the NTCIR (NII Test Collection for IR Systems) 2005 CLQA task by translating the English questions into Indonesian. Compared to the Japanese-English and Chinese-English CLQA results in the NTCIR 2005, we found that our system is superior to others except for one system that uses a high data resource employing 3 dictionaries. Further, a rough comparison with two other Indonesian-English CLQA systems revealed that our system achieved higher accuracy score. key words: Cross Language Question Answering, Indonesian-English CLQA, limited resource language, machine learning 1. Introduction Recently, CLQA (Cross Language Question Answering) systems have become an area of much research interest. The CLEF (Cross Language Evaluation Forum) has conducted a CLQA task since 2003 [1] using English target documents and Italian, Dutch, French, Spanish, German source questions. Indonesian-English CLQA has been one of the more recent goals of the CLEF since 2005 [2]. The NTCIR (NII Manuscript received March 28, 2007. Manuscript revised May 25, 2007. † The authors are with Electronic and Information Department, Toyohashi University of Technology, Toyohashi-shi, 441–8580 Japan. †† The author is with Information and Multimedia Center, Toyohashi University of Technology, Toyohashi-shi, 441–8580 Japan. a) E-mail: [email protected] b) E-mail: [email protected] c) E-mail: [email protected] DOI: 10.1093/ietisy/e90–d.11.1841 Test Collection for IR Systems) has also been active in its own CLQA efforts since 2005 [3] by providing JapaneseEnglish, English-Japanese, Chinese-English and EnglishChinese data. In CLQA, the answer to a given question in a source language is searched for in documents of a target language. Accuracy is measured by the retrieved correct answer. The translation phase of CLQA makes answer searching more difficult than monolingual QA for the following reasons [3]: 1. Translated questions are represented with different expressions than those used in news articles in which answers appear; 2. Since keywords for retrieving documents are translated from an original question, document retrieval in CLQA becomes much more difficult than that in monolingual QA. Translations can be done for both questions and documents. Researchers tend to select question translation, however, due to its easiness compared with document translation. In question translation, one can either directly translate the source question into the target language or extract the keywords and then translate them into the target language. In CLQA, translation quality is an important factor in achieving an accurate QA system. Thus, the quality of the machine translation and dictionary used is very important. In the Indonesian language, there are a number of translation resources available [4], such as an online machine translation (Kataku) and an Indonesian-English dictionary (KEBI). Previous work in Indonesian-Japanese CLIR [4] shows that using a bilingual dictionary in a transitive approach can achieve a higher retrieval score than using a machine translation tool. Other work in the Indonesian language [5], however, showed that using an online dictionary available on the Internet gave a lower IR (Information Retrieval) performance score than the available machine translation because of the limited vocabulary size of the online dictionary. As for CLQA research, the best performance of Japanese-English CLQA in the NTCIR 2005 was obtained by a study [6] using three dictionaries (EDICT, 110,428 entries; ENAMDICT, 483,691 entries; and in-house translation dictionary, 660,778 entries) that can be qualified as high data resources. Accuracy was 31.5% on the first exact answer. This accuracy score outperformed other submitted runs, such as the second ranked run with 9% accuracy. For the Indonesian-English CLQA, there have been c 2007 The Institute of Electronics, Information and Communication Engineers Copyright IEICE TRANS. INF. & SYST., VOL.E90–D, NO.11 NOVEMBER 2007 1842 two trials [7], [8] conducted using the CLEF data set. Both systems used a machine translation tool to translate Indonesian questions into English. One system [7] used a commercially available Indonesian-English machine translation software (Transtool) and was able to obtain only 2 correct answers among 150 factoid questions. The other system [8] used an online Indonesian-English machine translation software (Kataku) and was able to obtain only 14 correct answers among 150 factoid questions. Different from both the Indonesian-English CLQA systems mentioned above, in this study we utilize an Indonesian-English KEBI [9] dictionary (29,047 word entries) and then combine the translation results in a boolean query to retrieve the relevant passages. For a limited resource language such as Indonesian, the idea of using a bilingual dictionary means that we have to provide our own question analyzer module. A question analyzer module aims to extract the question keywords and EAT. Defining the EAT is a more difficult task than extracting the question keywords, and many approaches are available for defining the EAT, such as rule based systems or the machine learning approach. In a rule based system, in order to get a high accuracy score for classifying the EAT, the rules should be as complete as possible. Only having a few simple rules will lead to a low accuracy score. An example of a high quality question classifier that employs rich rules is shown in [6], in which a Japanese morphological analyzer based on Nihongo Goi-Taikei that classifies 300,000 words into about 3,000 categories. As for the Indonesian-English CLQA, the rule based question classifier was used by [7], [8] which achieved a low question answering accuracy. In this case, the rules for the question classifier only counted on the question interrogative words. In order to avoid the human labour involved in the EAT classification module, we chose to use the machine learning approach. In the machine learning approach, the pioneeingr work of [10] used SNoW algorithm with features such as POS tags, chunks, named entities, head chunks, and semantically related words (a certain word in the question that triggers a flag indicating certain EAT). They achieved 91% accuracy for coarse classes (6 classes) and 84.2% accuracy for fine classes (50 classes). Other work [11] tried to compare some machine learning algorithms with simple features (e.g. bag-of-word and bag-of-ngram). These studies showed that the SVM (Support Vector Machine) algorithm outperformed other machine learning algorithms such as Nearest Neighbor, Naive Bayes, Decision Tree, and SNoW. Using SVM and bag-of-ngram features, they achieved 87.4% accuracy for coarse classes and 79.2% for fine classes. Another experiment [12] tried to utilize SVM algorithm with richer features, such as subordinate word categories (using WordNet), question focus (extracted from a manually generated list of regular expressions) and syntactic semantic structure. System accuracy was 85.6% for the fine classes. Although this system achieved better accuracy results than the previous works, the system still requires some rich resources such as WordNet and manually prepared patterns. As for our Indonesian EAT classification, we use the output of our question shallow parser and statistical information yielded from a monolingual corpus as the features of SVM engine in order to classify the EAT of our Indonesian questions. The EATs are usually used to locate the answers in the retrieved passages. Each word in a document is matched with the resulted EAT. In one commonly used approach, documents are named entity tagged. Matching is then conducted between each named entity and the EAT. Apart from the named entity approach, there is another approach proposed by [13], in which a statistical approach is used to locate the answer. This approach is called an experimental Extended QBTE (Question-Biased Term Extraction) model. Basically, the method is a text chunking process in which the answer is located by classifying each word in the document into one of 3 classes (B, I and O). B class means that the word is the first word in the answer, I means that it is part of the answer and O means that the word is not part of the answer. In this method, the classification is done by a maximum entropy algorithm, and using this approach means that one does not have to conduct EAT classification or perform named entity tagging. Features are taken from a rich POS resource that also includes the named entity information for both source and target languages. Although the accuracy score obtained using this text chunking approach [13] was low in terms of the NTCIR 2005 CLQA task, as there were only 2 correct answers for the Japanese-English CLQA task, in the current study we adopt this approach in our answer finder module. However, different from [13], our approach employs a question analyzer module. We use the EAT and other question shallow parser results as features. For the source question, we do not use rich POS information, instead using only a common POS category and bi-gram frequency scores for our main question word (i.e. a question focus or a question clue word; for explanation see Sect. 3.1.1). In the target document, we use a WordNet distance score for each document word. The rest of this paper is organized as follows. In Sect. 2, we describe the Indonesian-English CLQA data. Section 3 gives an overview of our CLQA system along with the method used for each module. Section 4 shows the experimental results including an experiment using the translated questions of NTCIR 2005 CLQA test data set. Section 5 concludes our research. 2. Indonesian-English CLQA Data In order to gain an adequate number of data, we collected our own Indonesian-English CLQA data. We asked 18 Indonesian college students to read an English article from the Daily Yomiuri Shimbun. Each student was asked to create about 200 Indonesian questions related to the content of the English article. The questions were factoid questions with certain EATs. There were 6 EATs: date, quantity, location, person, organization, and name (nouns except for the location, person and organization categories). After deleting duplicate questions and incorrectly formed questions, we obtained 2,837 questions in total. The number of questions for PURWARIANTI et al.: INDONESIAN-ENGLISH CLQA 1843 Table 1 Number of questions per EAT. EAT date location name organization person quantity Number of Questions 459 476 482 475 447 498 each EAT are shown in Table 1. Because of development cost, the answer for each question is limited to one answer given by a respondent. The alternative answers are not included in our in-house question-answer pair data. But in the experiments against NTCIR 2005 CLQA task data set, we matched our resulting answers with the alternative answers provided in the NTCIR 2005 CLQA task data. Based on our observations of the written Indonesian questions, an Indonesian question always has an interrogative word. The interrogative word can be located in the beginning, the end or the middle of a question. In question “Dimana Giuseppe Sinopoli lahir?”(Where was Giuseppe Sinopoli born?), the interrogative word dimana(where) is located as the first word in the question. In question “Sebelum menjabat sebagai presiden Amerika, George W. Bush adalah gubernur dimana?” (Before serving as the President of the United States, where did George W. Bush serve as governour?), on the other hand, the interrogative word dimana (where) is the last word in the question. In addition to interrogative words, another important word is the main question word. Here, we define a main question word as a question focus or a question clue word (if the question focus does not exist in the question). Further, related to the question focus, the order of the question focus and the interrogative word in a sentence is reversible. An interrogative word can be ordered either before a question focus or after a question focus. For example, in the question “Lonceng yang ada di Kodo Hall dibuat di negara mana?” (In which country was Kodo Hall’s bell made?), the interrogative word mana (which) is located after the question focus negara (country). However, in the question “Apa nama buku yang ditulis oleh Koichi Hamazaki beberapa tahun yang lalu?” (What was the name of the book written by Koichi Hamazaki several years ago?), the interrogative word apa (what) is placed before the question focus buku (book). Even though a question focus is an important clue word to define the EAT, as mentioned before, not all question sentences have a question focus. Some examples include “Apa yang digunakan untuk menghilangkan lignin?” (What is used to dispose of lignin?) and “Siapa yang menyusun panduan untuk pengobatan dan pencegahan viral hepatitis pada tahun 2000?” (Who composed a reference book for medication and prevention of viral hepatitis in 2000?). For such questions without a question focus, we select other words such as “digunakan”(used) or “menyusun”(composed) as the question clue words. Table 2 shows some question examples for each EAT Table 2 Question example for each EAT. Question + EAT + Main Question Word + Interrogative Word Siapakah ketua Komite Penyalahgunaan Zat di Akademi Pediatri di Amerika (Who is the head of the Committee on Substance Abuse at the American Academy of Pediatrics) EAT: Person Main Question Word: ketua (head) Interrogative Word: siapa (who) Apa nama institut penelitian yang meneliti Aqua-Explorer? (What is the name of the research institute that does the Aqua-Explorer experiment?) EAT: Organization Main Question Word: institut (institute) Interrogative Word: apa (what) Di kota manakah Universitas Mahidol berada? (In which city is Mahidol University located?) EAT: Location Main Question Word: kota (city) Interrogative Word: manakah (which) Ada berapa lonceng di Kodo Hall? (How many bells are there in Kodo Hall?) EAT: Quantity Main Question Word: lonceng (bells) Interrogative Word: berapa (how many) Kapankah insiden penyerangan Alondra Rainbow di Perairan Indonesia? (When did the Alondra Rainbow attack incident happen off the Indonesian waters?) EAT: Date Main Question Word: insiden (incident) Interrogative Word: kapankah (when) Apa nama buku yang ditulis oleh Koichi Hamazaki beberapa tahun yang lalu? (What was the name of the book written by Koichi Hamazaki several years ago?) EAT: Name Main Question Word: buku (book) Interrogative Word: apa (what) and their respective different patterns. The third question, for example, “Di prefektur manakah, Tokaimura terletak?”(In which prefecture is Tokaimura located?) has “prefektur”(prefecture) as the main question word. The QA system should be able to locate the answer, “prefecture”, as the correct answer for the question is “Ibaraki Prefecture”. Here, the main question word is found to also be part of the answer. The fifth question “Kapankah insiden penyerangan Alondra Rainbow di Perairan Indonesia?”(When did the Alondra Rainbow attack incident happen off the Indonesian waters?) has no question focus (the question clue word is “insiden”(incident)) but the answer can be located by searching the closest “date” word. The answer is “October”. Even though there are some similarities between Indonesian and English sentences, such as the subjectpredicate-object order, there are still some syntactical differences, such as the word order in a noun phrase. For example, in English, the sentence “Country which has the biggest population?” is a grammatically incorrect sentence. The sentence should be “Which country has the biggest population?”. In Indonesian, however, “Negara apa” (country which) has the same meaning as “Apa negara” (which country). Observing such differences between Indonesian and English sentences, we concluded that we could not directly IEICE TRANS. INF. & SYST., VOL.E90–D, NO.11 NOVEMBER 2007 1844 use English sentence processing tools for Indonesian language. 3. Schema of Indonesian-English CLQA Using a common approach, we divide our CLQA system into four modules: question analyzer, keyword translator, passage retriever, and answer finder. As shown in Fig. 1, Indonesian questions are first analyzed by the question analyzer into keywords, the main question word, EAT and a phrase-like information. Our question analyzer consists of a question shallow parser and a question classifier module. The question classifier defines the EAT of a question using SVM algorithm (provided by WEKA). Then, the Indonesian keywords and main question word are translated by the Indonesian-English keyword translator. The translator results are subsequently used to retrieve the relevant passages. In the final phase, the English answers are located using a text chunking program (Yamcha) with input features explained in Sect. 3.1.4. Each module involved in the schema is described in the next section. 3.1 Modules in the Indonesian-English CLQA 3.1.1 Question Analyzer Our Indonesian question analyzer consists of a question shallow parser and a question classifier. In the first module, a rule based Indonesian question shallow parser accepts an Indonesian question sentence as the input and gives outputs such as: • interrogative words (such as who, when, where, etc) A written Indonesian question sentence must have at least one interrogative word. Thus, we simply list the interrogative words in Indonesian and then select the word within the list as the interrogative word. If there are more than one interrogative word in a question, we Fig. 1 Schema of Indonesian-English CLQA. select the first word appearing in the sentence. • question keywords (by eliminating some stop words from the question) We used a stop word list to filter the question keywords. Basically, only verbs, nouns and adjectives are accepted as question keywords. As an additional stop word filtering process, we add some special stop words such as the word tahun (year), which appears before the year number, and tanggal (date), which appears before the date number, etc. • main question word (with its position in relation to the interrogative word) A main question word is a word or phrase in a question that can be used to identify the EAT. A main question word can be a question focus or a question clue word. A question focus is a word or a phrase in a question that can be used to identify the ontological type of the answer entity. Usually, the question focus has the same ontological type as the expected answer type except for the quantity type. Not all questions have a question focus. For questions without a question focus, we extract another word (a question clue word) as the main question word. We define a question clue word as a noun or a verb that can be used to help in defining the EAT if the question focus does not exist. For example, in the question “Dimana Giuseppe Sinopoli lahir?” (Where was Giuseppe Sinopoli born?), we assume that it has no question focus (and therefore the question clue word of lahir(born) is chosen), whereas in “Di kota apa pengarang buku The Little Prince dilahirkan?” (In which city was the author of The Little Prince born?), kota (city) is the question focus. • phrase like information for the question interrogative and/or main question word Here, we define 6 labels to be used: “-”, “NP”, “PN”, “NP-POST”, “N-NOF”, “V-NOF”. If a question focus is located before the question interrogative word, then the label is “NP”. And when its location is after the question interrogative word, then the label is “NP-POST”. If the question interrogative word is preceded by a preposition word, then the label is “PN”. If the question has no question focus and the question interrogative word is not preceded by a preposition word, then the two labels of “N-NOF” (the question clue word is a noun) and “V-NOF” (the question clue word is a verb) are used. If the question analyzer module could not extract a main question word, then label “-” is used. The output of the shallow parser is used as input for the question classifier. The question classifier’s aim is to classify the EAT of a question sentence. As mentioned before, in our study we employ SVM algorithm provided by WEKA [14] for our question classifier. The features for the SVM engine are the output of the shallow parser and two additional features consisting of the bi-gram frequency feature of the main question word and WordNet distance feature of PURWARIANTI et al.: INDONESIAN-ENGLISH CLQA 1845 the main question word. The bi-gram frequency feature consists of two frequency scores for each EAT candidate. The first frequency score is calculated such as in Eq. (1). wc∈W ( f (wc, q f )) ,i = 1...6 (1) Si = i Wi ( w∈Wi ( f (w, q f ))) Wi is a list of defined words of each EAT candidate. qf is the question focus. For each EAT candidate, the bi-gram frequency pair of the word list and the main question word is normalized. The second score is the number of unique words in each word list that is a bi-gram pair with the main question word. All 12 scores (2x6) are inputted as the additional features. Another additional score is the WordNet distance. Here we calculate the distance between the translated main question word and 25 synsets in the WordNet taken from 25 noun lexicographer files in WordNet. The specified WordNet synsets are act, animal, artifact, attribute, body, cognition, communication, event, feeling, food, group, location, motive, object, person, phenomenon, plant, possession, process, quantity, relation, shape, state, substance, and time. A question example along with its features for the question classification is shown in Fig. 2. In this example, we use two examples for which the main difference is in the bi-gram frequency information. For the first question, the main question word “kota” (city) exists in the Indonesian corpus, while for the second question, the main question word “prefektur” (prefecture), does not, which gives a zero score for the bi-gram frequency feature. The features are written in the 4th –8th rows of each question. Features in the 4th –6th rows are yielded by our question shallow parser. The next two rows are the additional features (bi-gram frequency score and WordNet distance). For the first question, the main question word “kota” (city) is statistically related with the “location” and “organization” entities (7th row). The highest relation is with the “location” entity, with 0.86 as the first bi-gram frequency score. There are 23 preceding words for the “city” word related with the “location” class and 10 words related with the “organization” class. As for the WordNet distance, the main question word “kota” (city) is a hyponym of “location” word. These corresponding types of information in the bi-gram frequency score and the WordNet lead to correct prediction of EAT. In the second question, even though the bi-gram frequency score is zero, the other features could still lead the question classification module to give a good prediction. As has been proven in our Indonesian QA [15], using a question class (EAT) in the answer finder gives higher performance than without using a question class (i.e. depending solely on the question shallow parser result and mainly on the main question word). Compared to a monolingual system, using question class in the cross language QA system has more benefits. The first benefit occurs when there is more than one translation for a main question word with different meanings. For example, in the question “Posisi apakah yang dijabat George W. Bush sebelum menjadi pres- Fig. 2 Example of features for question classifier. iden Amerika?” (What was George W. Bush’s position before he became the President of the United States?), the main question word is posisi (position) which can be assumed as a place (location) or an occupation (name). By classifying the question into “name”, the answer extractor will automatically avoid the “location” answer. The second benefit relates to the problem of an out-of-vocabulary main question word. By providing the question class even though the main question word can not be translated, the answer can still be predicted using the question class. Even though our question shallow parser is based on a rule-based module, it was built with simple rules. First, each word in the question is assigned with the POS taken from a POS source (an Indonesian-English dictionary compiled with POS information) and some lists for words other than nouns, adjectives and verbs. We do not deal with POS ambiguity here because our aim is only to extract some information based on the POS word and not to form the sentence’s tree. The first information taken from the question is the question interrogative word. Next, the words before and after the question interrogative word are checked to obtained the question focus or the question clue word as the main question word. If the word is a noun and not listed in the stop word list, then it is classified as the question focus. If the word is a verb or a noun and not listed in the stop word IEICE TRANS. INF. & SYST., VOL.E90–D, NO.11 NOVEMBER 2007 1846 Table 3 Question classifier accuracy. Features B S B+S B+W B+P S+W S+P B+S+W B+S+P S+W+P B+W+P B+S+W+P Accuracy Score 91.9% 95.5% 93.8% 94.0% 94.2% 95.7% 95.9% 95.0% 95.1% 96.0% 94.6% 95.4% Fig. 3 Example of keyword translation result. list, then it is a question clue. Phrase-like information is defined based on the position of the question focus or question clue. The implementation cost of our question shallow parser is only about 2 weeks of programming. Even though it is a simple rule-based question shallow parser, it could improve question classification’s accuracy from 91.9% (bagof-words feature) into 95.5% (question shallow parser feature), as shown in Table 3. 3.1.2 Keyword Translator Based on our observations of the collected Indonesian questions, we concluded that there are three types of words used in the Indonesian question sentences: 1. Native Indonesian words, such as “siapakah” (who), “bandara” (airport), “bekerja” (work), etc 2. English words, such as “barrel”, “cherry”, etc. 3. Transformed English words, such as “presiden” (president), “agensi” (agency), “prefektur” (prefecture), etc. We use an Indonesian-English bilingual dictionary [2] (29,047 entries) to translate the non-stop Indonesian words into English. To handle the second type of keyword, we simply search the keyword in the English corpus. For the third type of keyword, we apply some transformation rules, such as “k” into “c”, or “si” into “cy”, etc. Using this strategy, among 3,706 unique keywords in our 2,837 questions, we obtained only 153 OOV words (4%). In addition, we also augmented the English translations by adding the synonyms from WordNet. An example of keyword translation process is shown in Fig. 3. Third row shows the keywords extracted from the question. The fourth-sixth rows are the keyword translation results. Two words (“letak”, “pulau”) can be translated with the Indonesian-English dictionary, “Zamami” is an English term that is not translated because it exists in the English corpus, and “prefektur” is transformed into “prefecture” using the transformation rules mentioned above. The next row shows the attempt of adding more translations from WordNet. The WordNet addition process is employed for the word “letak”(location, site, position). An example of keyword addition process is as follows. The word “location” has 4 synsets in the WordNet. There are only 2 synsets Fig. 4 Example of passage retriever result. with synonyms. The first synset has synonyms of “placement”, “locating”, “position”, “positioning” and “emplacement”. Because one of the synonyms is also included in the translation results of the bilingual dictionary, all synonyms for the first synset are included as the additional keywords. 3.1.3 Passage Retriever The English translations are then combined into a boolean query. By joining all the translations into a boolean query, we do not filter the keywords into only one translation candidate, as is done in a machine translation method. The operators used in the boolean query are “or”, “or2”, and “and” operators. The “or” operator is used to join the translation sets for each Indonesian keyword. As shown in Fig. 4, the “or” operator joins the translation sets of “prefektur”, “letak”, “pulau”, and “Zamami”. The “or2” [16] operator is used for synonyms. Figure 4 shows that the boolean query for the translations of the word “letak” is “location ‘or2’ site ‘or2’ position”. The “and” operator is used if the translation result contains more than one term. For example, the boolean query for “territorial water” (the translation of “perairan”) is “territorial ‘and’ water”. PURWARIANTI et al.: INDONESIAN-ENGLISH CLQA 1847 The IDF score for each translation depends on the word number of the translation results. For example, if an Indonesian word is translated into only one English word, then the IDF score for this translation is equal with the IDF score of the English word (the number of documents in the corpus divided by number of documents contain the English word). If the translations are more than one English word (the “or2” operator), then the IDF score is calculated from the documents containing at least one of the translation. For the “and” operator, the IDF score is calculated from the documents containing all translations in the “and” operator. We retrieve the relevant passages in two steps: document retrieval and passage retrieval. For document retrieval, we select documents with IDF score higher than the highest IDF score divided by 2. As for the passage retrieval, we select passages in the retrieved documents that fall within the three highest IDF scores. One passage consists of three sentences. Some examples of such retrieved passages are shown in Fig. 4. The three passages with the highest IDF score were retrieved. The first and third passages are the correct passages, but the second passage is not. This occurred because we do not limit the distances among keywords in the passage retrieval. 3.1.4 Answer Finder As mentioned before, we do not do the named entity tagging in the answer finder phase, instead we do an answer tagging process. In common approach, researchers tend to use two processes in a CLQA system. First process is the named entity tagging which tags all named entities in the document. Second process is the answer selection which matches the named entity with question features. If both processes employ machine learning approaches, then there are two sets of training data should be prepared. In this common approach, the error of the answer finder will be resulted from the error of the named entity tagger propagated by the error of answer selection process. By our approach, we simplify the process of answer finder. We directly match the document features with the question features. It means that we shorten the development time. It also means that for a new language which has no named entity tagger available, by using our approach, one does not have to prepare two sets of training data, instead one has only to prepare one training data for the answer tagging process which can be built from the questionanswer pair (already available as the QA data) and the answer tagged passages (can be easily prepared by automatically searching and tagging the answers in the relevant passages). In our answer tagging process, we treat the answer finder as a text chunking process. Here, each word in the corpus will be given a status, either B or I or O, based on some features of both the document words and of the question. We use a text chunking software Yamcha [17] that works using SVM algorithm. The answer finder with text chunking process is also Fig. 5 Example of features in answer finder module. used in QBTE [13]. QBTE approach employed maximum entropy as the machine learning algorithm. In QBTE approach, [13] used some POS information for words. For example, the word “Tokyo” is analyzed as POS1=noun, POS2=propernoun, POS3=location, and POS4=general. The POS information of each word in the question is matched with the POS information of each word in the corpus by using a true/false score for features of the machine learning. In our Indonesian-English CLQA, we do not use such information. The POS information in our Indonesian-English CLQA is similar to the POS1 mentioned in [13]. Even though our Indonesian-English dictionary is larger than the Japanese-English dictionary (24,805 entries) used by Sasaki, our Indonesian-English dictionary does not posses the POS2, POS3 and POS4. Another difference of our study is that we use question class as one of question features, with the reasons for this choice as mentioned in second last paragraph of Sect. 3.1.1. We also use the results of our question shallow parser along with the bi-gram frequency score. For the document features, each word is morphologically analyzed into its root word using TreeTagger [18]. The root word, its orthographic information, and its POS (noun, verb, etc) information are used as the question features. Different from our Indonesian QA [15], we do not calculate the bi-gram frequency score for the document word. Instead, we calculate its WordNet distance with 25 synsets, such as those listed in the noun lexicographer files of WordNet. Each document word is also complemented by its similarity scores with the main question word and question keywords. If a question keyword consists of two successive words such as “territorial water” as the translation for “perairan”, then when a document word matches with one of the words of the question keyword, the score is divided by the number of words in that question keyword. For example, for the document word “territorial”, the similarity score against “territorial water” is 0.5. An example of the features used in the Yamcha text chunking software is shown in Fig. 5. IEICE TRANS. INF. & SYST., VOL.E90–D, NO.11 NOVEMBER 2007 1848 Table 4 tion. 4. Experimental Result 4.1 Question Classifier In the question classification experiment, we applied SVM algorithm from the WEKA software [14] with a linear kernel and the “string to word vector” function to process the string value. We used 10-fold cross validation for the accuracy calculation by using the collected data described in Sect. 2. We tried some feature combinations for the question classifier. The results are shown in Table 3. “B” designates the bag-of-words feature, “S” designates shallow parser’s result feature including the interrogative word, main question word, etc. “W” designates the WordNet distance feature for the main question word. “P” designates the bi-gram frequency feature for the main question word. As Table 3 shows, features using the bag-of-words gave a lower performance than those without it. This result is different from what would be obtained using a monolingual QA system. Our results using the monolingual Indonesian question classification [15] show that using the bag-ofwords feature improves classification accuracy. We believe this is because the keywords used in the CLQA are more various than those used in the monolingual QA. For the queries used in our CLQA, users generated questions based on their translation knowledge, such as they were free to use any Indonesian terms as the translation of any of the English terms they found in the English article. In the monolingual QA system [15], however, users tended to use keywords such as those written in the monolingual (Indonesian) article. The highest accuracy result is 96.02%, achieved by the combination of the shallow parser’s result (S), bi-gram frequency (P) and WordNet distance (W). This is also different from the monolingual QA [15], as the result using the above combination is lower than that obtained using only the S+P features. This is because there are many English keywords used in the Indonesian questions for the CLQA system, thereby making the WordNet distance feature a useful feature. The detailed accuracy for each EAT of the S+P+W features is shown in Table 4. The lowest performance is for the “organization” class, which is a difficult task. For example, for the question “Siapa yang mengatakan bahwa 10% warga negara Jepang telah mendaftarkan diri untuk mengikuti Pemilu pada tahun 2000?” (Who says that 10% of Japan citizen applied for the national election in year 2000?) “person” was obtained as the classification result. Even for a human, it would be quite difficult to define the question class of the above example without knowing the correct answer(“Foreign Ministry”). Confusion matrix of S+P+W features for question classificain / out date loc name org person quan date 459 0 0 0 0 2 Table 5 loc 0 460 4 10 0 0 name 0 10 467 33 0 0 org 0 4 10 403 8 0 person 0 2 1 29 439 0 quan 0 0 0 0 0 496 Examples of transformation result. Indonesian English Translation Correct Translation prefektur prefecture agensi agency jerman german, germany Incorrect Translation mula mole, mule Table 6 Passage retriever accuracy. Run-ID In-house In-house-sw In-house-sw-wn Precision % 12.6% 12.1% 19.7% # 1,044 of 8,311 1,092 of 9,026 845 of 4,295 Recall % # 71.1% 202 73.2% 208 70.4% 200 that, we also transformed some Indonesian words into English. Using the transformation module, we were able to translate 38 OOV words with only 1 incorrect result. The examples of the transformation result are shown in Table 5. 4.3 Passage Retriever For the passage retriever, we used two evaluation measures: precision and recall. Precision shows the average ratio of relevant documents. A relevant document is a document that contains a correct answer without considering any available supporting evidence. Recall refers to the number of questions that might have a correct answer in the retrieved passages. We conducted three schemas for the passage retriever. The results are shown in Table 6. In Table 6, “in house sw” means that the keywords are filtered using an additional stop word elimination process. “in house wn” label means that WordNet is used to augment the Indonesian-English translation. For the English target corpus, we used the Daily Yomiuri Shimbun from the years 2000 and 2001. We found that using WordNet for the expansion gave a lower recall result, which means that a lower number of candidate documents might contain the answer. Moreover, by using WordNet, some irrelevant passages got higher IDF score than the relevant ones. 4.4 Answer Finder 4.2 Keyword Translation As mentioned before, we translated Indonesian queries into English using an Indonesian-English dictionary. Apart of To locate the answer, we used SVM based text chunking software (Yamcha) with a default value for the SVM configuration. We ranked the answers result with the following PURWARIANTI et al.: INDONESIAN-ENGLISH CLQA 1849 Table 7 Run-ID In-house A B C D E In-house-sw A B C D E In-house-sw-wn A B C D E Table 9 Answer finder accuracy. Top1 Top5 MRR 20.1%(57) 23.9%(68) 23.2%(66) 23.2%(66) 17.3%(49) 34.2%( 97) 34.2%( 97) 33.8%( 96) 33.8%( 96) 32.8%( 93) 25.6% 28.1% 27.7% 27.6% 23.6% 20.4%(58) 24.3%(69) 23.9%(68) 23.9%(68) 19.0%(54) 35.6%(101) 36.6%(104) 36.6%(104) 36.3%(103) 34.5%( 98) 26.9% 29.2% 29.0% 28.8% 25.2% 20.4%(58) 25.0%(71) 24.7%(70) 24.7%(70) 18.0%(51) 34.9%( 99) 35.6%(101) 35.6%(101) 35.2%(100) 33.5%( 95) 26.5% 29.5% 29.2% 29.1% 24.3% Table 8 Answer finder accuracy for translation without the transformation process. Run-ID A B C D E Top1 19.0%(54) 23.2%(66) 22.5%(64) 22.5%(64) 17.3%(49) Top5 33.5%( 95) 33.8%( 96) 33.8%( 96) 33.5%( 95) 31.3%( 89) MRR 25.2% 27.5% 27.1% 27.0% 23.2% five schemas: 1. A: Using only the Text Chunking score 2. B: (0.3 x Passage Retrieval score) + (0.7 x Text Chunking score) 3. C: (0.5 x Passage Retrieval score) + (0.5 x Text Chunking score) 4. D: (0.7 x Passage Retrieval score) + (0.3 x Text Chunking score) 5. E: Using only the Passage Retrieval score All results are shown in Table 7. To measure the CLQA performance, we used the Top1, Top5 and MRR scores for the exact answers. Table 7 shows that using the passage retrieval score (“B”,“C” and “D”) improves the overall accuracy. Here, the number of retrieved passages influences the machine learning accuracy because the highest question answering accuracy score (Top1 and MRR) is achieved by the “in-house-sw-wn” which reached the lowest recall score but the highest precision score with respect to passage retrieval accuracy. Other than experiments shown in Table 7, we also conducted the experiments for question translations without the transformation process. The experiment results in Table 8 show that the transformation process gives benefit to the answer finder accuracy, similar to the improvement given by the stop word elimination process as shown in Table 7. 4.5 Experiments with NTCIR 2005 CLQA Task Test Data As mentioned in the beginning of our paper, we also con- Passage retriever accuracy for NTCIR translated test data set. Run-ID Trans1 Trans1-sw Trans1-sw-wn Trans2 Trans2-sw Trans2-sw-wn Precision % 18.5% 17.3% 19.3% 10.1% 12.2% 12.4% # 792 of 4281 839 of 4842 753 of 3903 883 of 8733 976 of 8028 912 of 7346 Recall % # 76.0% 152 76.5% 153 74.5% 149 77.0% 154 78.5% 157 77.5% 155 ducted a CLQA experiment using NTCIR 2005 CLQA Task Data [3]. Because the English-Japanese and JapaneseEnglish data sets are parallel, we translated the English question sentences into Indonesian and used the translated Indonesian questions to find the answers from the English documents (Daily Yomiuri Shimbun, years 2000–2001). To prepare the Indonesian questions, we asked two Indonesians to translate the 200 English questions into Indonesian. The translated Indonesian questions were then labeled as belonging to one of two categories: Trans1 and Trans2. This question test set was categorized by NTCIR 2005 into 9 EATs: money, numex, percent, date, time, organization, location, person, and artifact. This categorization is quite similar to the EATs used in our system such as the categories for “date”, “organization”(except for the “country” question focus), “location”, “person”, “artifact”(equal with the “name” EAT), “money”(“quantity”), “numex”(“quantity”) and “percent”(“quantity”) EATs. The only EAT that we do not have in our training data is the “time” EAT. It is interesting to observe whether our system can handle the out-of-EAT questions. There are 14 questions of NTCIR 2005 with the “time” EAT. It should be noted, however, that in our data there is no question that has an answer that would include special terms such as “p.m” or “a.m”. Further, there is also no question similar to “What time did the inaugural flight depart from Haneda Airport’s new runway B?”. As a result of these differences, in the question classification, our system classified all “time” questions as “date” EAT. Even though there is no similar question-answer pair in our training data among these 14 questions, for the first translation set, our system was able to locate 2 correct answers. Further, we believe that if we add “time” questions to our training data, our system will achieve even better performance. Our experimental results are shown in the following two tables. The result of the passage retriever accuracy is shown in Table 9, which is similar to the result described in the previous section, where the highest recall score is achieved by the passage retrieval module with additional stop word elimination only, that is, without the WordNet expansion. This result is shown in both translation data sets. Table 10 shows the question answering accuracy result. We used only the “B” ranking score calculation ((0.3 x Passage Retrieval Score) + (0.7 x Text Chunking Score)). As a rough comparison, we also include the question answering accuracy of two best run-ids in the JapaneseEnglish task and the Chinese-English task of the NTCIR 2005 CLQA task, shown in Table 11 [3]. In NTCIR 2005 IEICE TRANS. INF. & SYST., VOL.E90–D, NO.11 NOVEMBER 2007 1850 Table 10 Question answering accuracy for NTCIR translated questions. Run-ID Trans1 Trans1-sw Trans1-sw-wn Trans2 Trans2-sw Trans2-sw-wn Top1 22.5%(45) 22.0%(44) 22.0%(44) 21.0%(42) 22.0%(44) 22.5%(45) Top5 34.0%(68) 35.5%(71) 34.5%(69) 31.5%(63) 32.5%(65) 32.5%(65) MRR 27.2% 27.6% 27.4% 25.7% 26.7% 26.8% Table 11 Question answering accuracy of run-ids in NTCIR 2005 CLQA task [3]. Run-ID Japanese-English NCQAL-J-E-01 Forst-J-E-02 Forst-J-E-03 Chinese-English UNTIR-C-E-01 NTOUA-C-E-03 NTOUA-C-E-02 Top1 Top5 MRR 31.5%(63) 9.0%(18) 8.5%(17) 58.5%(117) 19.5%( 39) 19.0%( 38) 42.0% 12.8% 12.3% 6.5%(13) 3.5%( 7) 3.0%( 6) N/A N/A N/A N/A N/A N/A CLQA task, the answer is labeled as R or U. R indicates that the answer is correct and extracted from relevant passages, whereas U indicates that the answer is correct but it is taken from irrelevant passages. In our experiment, we do not evaluate the relevancy of a passage, and therefore the results shown in Table 11 is the R+U answers for Top1 and Top5 answers. The best result (NCQAL-J-E-01) [6] used some high quality data resources. To translate Japanese keywords into English, they used three dictionaries: EDICT, ENAMDICT and an in-house translation dictionary. In the question analyzer, they used the ALTJAWS morphological analyzer based on the Japanese lexicon “Nihongo Goi-Taikei”. The second and third best ranked teams in the Japanese-English task (Forst-J-E) [19] used EDR JapaneseEnglish bilingual dictionary to translate Japanese keywords into English and then retrieved the English passages. The English passages were translated into Japanese using a machine translation system. The translated Japanese passages along with the Japanese sentence were used as input for a Japanese monolingual QA system. Here, it can be noted that the number of entries of the EDR Japanese-English dictionary is higher than that of our KEBI Indonesian-English dictionary. In the Chinese-English task, the best result (UNTIRC-E-01) [20] used currently available resources such as BabelFish machine translation † , Lemur IR toolkit †† , Chinese Encoding Converter ††† , LingPipe †††† and Minipar ††††† to annotate English documents. By using these resources, even though the result was lower than the submitted runs for Japanese-English task, this team achieved the highest accuracy with 13 correct answers as Top1. The second and third best team (NTOUA) [21] merged several English-Chinese dictionaries to get better translation coverage and used a web based searching to translate the unknown words. As mentioned before, our answer finder module is similar to the QATRO [13] submitted run that used a maximum Fig. 6 QA accuracy for various size of training data. entropy machine learning algorithm. The team used only 300 questions as training data, and this is likely the main reason for its low accuracy score (2 correct answers as Top1). Here, we tried to use various size of the training data to show the effect of the size of training data on the CLQA accuracy scores. The results shown in Fig. 6 are the accuracy scores (using the “B” ranking score calculation) for the first translation set of NTCIR 2005 CLQA task test set. Figure 6 indicates that the size of training data does influence the accuracy of a question answering system. Larger training data will cause higher performance on the question answering accuracy. Figure 6 also shows that for 250 question-answer pairs of training data (lower size than the one used in QATRO), our system was still able to achieve 12.3% accuracy for Top1 answer (24.6 correct answers in average) which was only defeated by the NCQAL [6] system. 5. Conclusion and Future Research Our experiment shows that for a resource poor language such as Indonesian, it is possible to build a cross language question answering system using machine learning approach with a promising result. Using a bilingual dictionary in the translation module means that it is necessary to build a question analyzer module for Indonesian language. In the question analyzer, we built two main systems, a question shallow parser and a question classification system. Our question shallow parser was built using the simple rules described in Sect. 3.1.1. The EATs were classified using a machine learning method. Using the results of the question shallow parser improves the question classification accuracy compared to the results obtained by using the bag-of-words feature. The use of machine learning in our answer finder allowed us to obtain promising results. Compared to other Indonesian-English CLQA systems, our system achieved better performance. By conducting our experiment using the translated question test set of the NTCIR 2005 CLQA Task, we could determine that our system was only defeated by one system (NCQAL, [6]) that used a combination of rich resources and rich rule-based modules. † http://babelfish.altavista.com http://www.lemur.project.org ††† http://www.mandarintools.com/zhcode.html †††† http://www.alias-i.com/lingpipe ††††† http://www.cs.ualberta.ca/˜lindek/minipar.htm †† PURWARIANTI et al.: INDONESIAN-ENGLISH CLQA 1851 We believe that our system is suitable for languages with poor resources. For a source language, a POS resource is required mainly for the noun, verb and adjectives. Other modules, such as the question shallow parser, can be built with minimal programming. For a target language, there are two main resources required, the POS resource (or POS tagger) and an ontology resource such as WordNet. Even though data resources similar to WordNet are not readily available for many languages, current research in Indonesian Question Answering [15] shows that this kind of information can be replaced by statistical information yielded from a monolingual corpus. For future research, we would like to conduct experiments on Indonesian-Japanese CLQA, which is more problematic than the Indonesian-English CLQA. One of the problems is the lack of adequate resources between Indonesian and Japanese which makes translation more difficult, especially with respect to the translation of proper nouns. Acknowledgments This work was partially supported by a grant for “Frontiers of Intelligent Sensing” from the Ministry of Education, Culture, Sports, Science and Technology. References [1] B. Magnini, S. Romagnoli, A. Vallin, J. Herrera, A. Penas, V. Peinado, F. Verdejo, and M. de Rijke, “The multiple language question answering track at CLEF 2003,” Proc. of CLEF 2003 Workshop, Norway, pp.471–486, Aug. 2003. [2] A. Vallin, B. Magnini, D. Giampiccolo, L. Aunimo, C. Ayache, P. Osenova, A. Peas, M. de Rijke, B. Sacaleanu, D. Santos, and R. Sutcliffe, “Overview of the CLEF 2005 multilingual question answering track,” Proc. CLEF 2005 Workshop, pp.307–331, Vienna, Austria, Sept. 2005. [3] Y. Sasaki, H.H. Chen, K. hua Chen, and C.J. Lin, “Overview of the NTCIR-5 cross-lingual question answering task (CLQA1),” Proc. NTCIR-5 Workshop Meeting, pp.230–235, Tokyo, Japan, Dec. 2005. [4] A. Purwarianti, M. Tsuchiya, and S. Nakagawa, “IndonesianJapanese transitive translation using English for CLIR,” Journal of Natural Language Processing, Association for Natural Language Processing, vol.14, no.2, 2007. [5] M. Adriani and C. van Rijsbergen, “Term similarity based query expansion for cross language information retrieval,” Proc. Research and Advanced Technology for Digital Libraries (ECDL’99), pp.311– 322, Paris, Springer Verlag, 1999. [6] H. Isozaki, K. Sudoh, and H. Tsukada, “NTT’s Japanese-English cross language question answering system,” Proc. NTCIR-5 Workshop Meeting, pp.186–193, Tokyo, Japan, Dec. 2005. [7] M. Adriani and Rinawati, “University of Indonesia participation at query answering-CLEF 2005,” Working Notes CLEF 2005 Workshop, http://clef.iei.pi.cnr.it/2005/working notes/workingnotes2005/ rinawati05.pdf, Vienna, Austria, Sept. 2005. [8] Wijono, S. Hartati, I. Budi, L. Fitria, and M. Adriani, “Finding answers to Indonesian questions from english documents,” Working Notes CLEF 2006 Workshop, http://www.clefcampaign.org/2006/working notes/workingnotes2006/ wijonoCLEF2006.pdf, Spain, Sept. 2006. [9] Indonesian Agency for The Assesment and Application Technology, “KEBI, Kamus Elektronik Bahasa Indonesia,” http://nlp.aia.bppt.go. id/kebi/ [10] X. Li and D. Roth, “Learning question classifiers,” Proc. 19th International Conference on Computational Linguistics (COLING 2002), pp.556–562, Taipei, Taiwan, 2002. [11] D. Zhang and W.S. Lee, “Question classification using support vector machine,” Proc. 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.26– 32, Toronto, Canada, 2003. [12] M. Skowron and K. Araki, “Effectiveness of combined features for machine learning based question classification,” Journal of Natural Language Processing, Information Processing Society of Japan, vol.6, pp.63–83, 2005. [13] Y. Sasaki, “Baseline systems for NTCIR-5 CLQA1: An experimentally extended QBTE approach,” Proc. NTCIR-5 Workshop Meeting, pp.230–235, Tokyo, Japan, Dec. 2005. [14] I.H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques 2nd edition, Elsevier, 2005. [15] A. Purwarianti, M. Tsuchiya, and S. Nakagawa, “A machine learning approach for Indonesian question answering system,” Proc. IASTED AIA 2007, pp.537–542, Innsbruck, Austria, Feb. 2007. [16] A. Pirkola, “The effects of query structure and dictionary setups in dictionary-based cross language information retrieval,” Proc. 21st Annual International ACM SIGIR, pp.55–63, 1998. [17] T. Kudoh and Y. Matsumoto, “Use of support vector learning for chunk identification,” Proc. Fourth Conference on Natural Language Learning (CoNLL-2000), pp.142–144, Lisbon, Portugal, 2000. [18] H. Schmid, “Probabilistic part-of-speech tagging using decision trees,” Proc. International Conference on New Methods in Language Processing, pp.44–49, Manchester, UK, 1994. [19] T. Mori and M. Kawagishi, “A method of cross language questionanswering based on machine translation and transliteration,” Proc. NTCIR-5 Workshop Meeting, pp.215–222, Tokyo, Japan, Dec. 2005. [20] J. Chen, R. Li, P. Yu, H. Ge, P. Chin, F. Li, and C. Xuan, “Chinese QA and CLQA: NTCIR-5 QA experiments at UNT,” Proc. NTCIR-5 Workshop Meeting, pp.242–249, Tokyo, Japan, Dec. 2005. [21] C.L. Lin, Y.C. Tzeng, and H.H. Chen, “System description of NTOUA group in CLQA1,” Proc. NTCIR-5 Workshop Meeting, pp.250–255, Tokyo, Japan, Dec. 2005. Ayu Purwarianti graduated from Bandung Institute of Technology for her Bachelor’s and Master’s degree in 1998 and 2002, respectively. Since 2004, she has been studying at Toyohashi University of Technology as a doctoral student. Her research interest is in natural language processing, especially for Indonesian language. Masatoshi Tsuchiya received his B.E., M.E. and Dr. of Informatics degrees from Kyoto University in 1998, 2000 and 2007 respectively. He joined Computer Center of Toyohashi University of Technology, which was reconstructed to Information Media Center in 2005, as an assistant professor in 2004. His major interest in research is natural language processing. IEICE TRANS. INF. & SYST., VOL.E90–D, NO.11 NOVEMBER 2007 1852 Seiichi Nakagawa received Dr. of Eng. degree from Kyoto University in 1977. He joined the faculty of Kyoto University in 1976 as a Research Associate in the Department of Information Sciences. From 1980 to 1983, he was an Assistant Professor, from 1983 to 1990, he was an Associate Professor and since 1990, he has been a Professor in the Department of Information and Computer Sciences, Toyohashi University of Technology, Toyohashi. From 1985 to 1986, he was a Visiting Scientist in the Department of Computer Science, Carnegie-Mellon University, Pittsburgh, USA. He received the 1997/2001 Paper Award from the IEICE and the 1988 JC Bose Memorial Award from the Institution of Electro, Telecomm. Engrs. His major interests in research include automatic speech recognition/speech processing, natural language processing, human interface and artificial intelligence. He is a Fellow of IPSJ.
© Copyright 2026 Paperzz