6. FRAMEWORK IMPLEMENTATION AND RESULTS 6.1 INTRODUCTION The proof of framework is discussed in this chapter. In order to evaluate the performance of proposed framework, two “individual” experiments are launched: i) query conversion using the bilingual ontology and language grammar rules (preprocessing), and ii) the retrieved results conversion approach (post-processing). We then compare the results of the existing approach with proposed model, measured by mean average precision (MAP), with the results of these two experiments. 6.2 APPROACHES FOR EVALUATING INFORMATION RETRIEVAL To evaluate cross language information retrieval system in the typical way, three things are required: a collection of documents or information, a test suite of information needs represented as queries, and a set of relevance judgments. The standard approach to information retrieval evaluation revolves around the notion of relevant and non-relevant information. With respect to a user’s information need, a document or information set in the test collection is given a binary classification as either relevant or non-relevant. It has been found that the sufficient minimum of information set needs is 50 [73]. Relevance is assessed relative to an information need, not a query. Information retrieved relevant to the query is relevant if it addresses the stated information need, not because it contains all or some the words in the query. 6.3 TEST COLLECTION In this research work, the webpages of English and Telugu have been used to evaluate query expansion using ontology and language grammar rules. The evaluation test shares the same information collection, containing both Telugu and English web pages in HTML format. The task has few queries, in which some queries have no relevant results. 6.4 EVALUATION OF RESULTS The two most frequent and basic measures for information retrieval effectiveness are precision and weighted precision, which were first used by Kent et al [74]. Relevant Results ……………(6.1) Precision = Ret100 (N1xW 1)+ (N2xW 2)+ (N3xW 3) …………..(6.2) Weighted Precision = (N1+N2+N3)xW 3 (N1,N2,N3) € relevant results The major advantage of using precision and weighted precision is that one is more important than the other in many cases. For example, in web searches always provide users with ranked results where the first items are most likely to be relevant to the user given queries (high precision), but they are not designed for returning every relevant result to users query. However, recall is a non-decreasing function of the number of results retrieved: users can always get a recall of 1 by retrieving all results for all queries. On the other hand, precision usually decreases as the number of results retrieved is increased. 6.4.1 Mean Average Precision Mean average precision (MAP) provides a single-figure measure of quality across recall levels. Among various evaluation measures, MAP has been shown to have especially good discrimination and stability [75]. Average precision (AP) is the average of the precision obtained for the set of the top k results retrieved existing after each relevant result is retrieved, and this value is then averaged over information needs. If the set of relevant documents for information need is the set of ranked retrieval results from the top result until document appears, then: The MAP value estimates the average area under the precision-recall curve for a set of queries. The above measure calculates all recall levels. For many applications, measuring at fixed low levels of retrieved results, such as 10 or 30 results, is useful. This is referred to as precision at k. It has the advantage that any estimate of the size of the set of relevant results is not required. But it is the least stable of the commonly used evaluation measures and does not average well. In our research work, we use average precision (AP) and mean average precision (MAP) to measure the results of all experiments, because MAP evaluates the performance of IR over the entire query set. The first 500 returned results are concerned when calculating MAP. 6.5 EXPERIMENTAL FRAMEWORK AND TOOLKIT The work has been implemented using Java and the Carrot toolkit. The Carrot toolkit is an open source tool kit. The initial version of Carrot 2 was implemented in 2001 by Dawid Weissis in the Center for Intelligent Information Retrieval (CIIR) at the University of Massachusetts, Amherst, and the Language Technologies Institute (LIT) at Carnegie Mellon University. The carrot toolkit comprises an open-source Indri search engine which provides a combination of inference network and language model for retrieval, a query log toolkit to capture and analyze user interaction data, and a set of structured query operators. Carrot search engine by itself does not support Telugu. But, we have added the necessary modifications for it. In this research work, we construct the experimental CLIR system using the carrot search engine toolkit, taking advantage of its clustering query language and the built-in clustering models. 6.6 EXPERIMENTAL SETTINGS FOR PRE-PROCESSING The original user query was written in Telugu. To retrieve more related results, it needs to be converted into English and separated into words according to corresponding language grammar rules. Once the user gives the input to framework, a tokenization or lexical analysis process is applied to tokenize the characters into “words” or “tokens”. Tokenization can decrease the length of index terms; hence index efficiency may be improved by this processing. Tokenization takes the factors that are discussed in chapter 4 under tokenizer. In the pre-processing system, all user queries are processed by following the components described in chapter 4. Because some user queries have no relevant information results in the results available, these queries are ignored in all experiments. The bilingual ontology, language grammar rules and OOV components are constructed in chapter 4 is used to expand and convert the Telugu query terms. This expansion is performed as follows: After expansion, the queries are converted into the English equivalents using the language grammar rules using following procedure: The tokenized user query terms are classified into subject, object, verb and inflection. Then its English equivalent will be taken from ontology, including both root terms and node terms, are used to replace this Telugu terms. Each of these English terms inherits the term weight from the Telugu term. If a term cannot be found in the ontology because of its inflection added along with the verb, then the inflection table along with the rule is used to identify the root word in that term. Once the root word is found and the English equivalent term is taken from ontology. All inflections listed in the table will be included in the new English query; each conversion uses the query term to find the English equivalent. If different terms have identical conversions, then the converted terms are weighted. The new term weight is the maximum weight amongst the duplicates. If a Telugu query term is found in the bilingual ontology, any siblings and child nodes are sorted into a list according to their term weights and index. Only the top 5 terms from the list are added to the query along with their term weights. Query terms which are not found in the ontology are considered as out of vocabulary terms and these terms are literally transliterated into source language which is retained in the query, given their likelihood of representing terms. All untranslatable terms will be considered as out of vocabulary terms and these terms are also literally transliterated. Once the terms are finalized using language grammar rules the query will be reconstructed into the source language. . The policy for out-of-vocabulary words which contain special Telugu characters is neglect, i.e., the words containing special characters that cannot be converted will be ignored. The retrieval performance is measured using MAP. 6.7 EXPERIMENTAL SETTINGS FOR POST-PROCESSING The finalized queries in pre-processing system are sent to the post-processing system for retrieving results related to the query and these results conversion and reranking process is done in the post-processing stage. The detailed working procedure for the post-processing system is shown in the chapter 5. In post-processing system, the following steps are followed to convert the retrieved results (1) Retrieved results are given to the tokenizer and the step (1) to step (5) will repeat in post-processing stage to convert the results. (2) The converted results are re-ranked based on the re-ranking system explained in chapter 5 and the results are shown to the user. 6.8 TESTING AND RESULTS The framework was deployed in different java enabled computers. The system was tested in December 2012. The browsing experience of 125 users in the age group of 18 to 35 with browsing period of 15 to 30 minutes was benchmarked. The users were trained in the use of the systems and asked to enter queries of their choice. Figure 6.1 Step by Step Process of the System The overall aim of the experimentation was to observe the data and evaluate the precision, weighted precision and time taken. Seventy percent of the users used to access Telugu information over the web regularly. The users were Graduate and Post Graduate students of Engineering. The users were knowledgeable in the process of browsing the content in Telugu language. The users were given the option of browsing the content through proposed and existing system blind testing approach was used. Existing system was labeled as system1 and the proposed was labeled as system 2. Google Telugu was taken as existing system this measures ensured no bias was present. The same users were given this prototype, and their responses were tabulated. Research hypothesis was framed to validate the work. The discussions of the research hypothesis are given below. The first hypothesis concerns the complete capability of existing search engines to retrieve the content in other language for the given user query. The case studies show the results as they appear from the search engine. The pre-processing system imposes some overheads on the processing of the queries. Hence, the time taken for the completion of the results can differ and will definitely be more than that of the regular systems. The precision of the system is measured as the ratio of the relevant results retrieved, and the results retrieved. The ultimate goal of any cross language information retrieval system is to increase the precision and sort the results in the order of relevance. If the order of relevance is increased for the top ranked results, the overhead imposed in terms of the additional time taken, will be acceptable. The key is that the overhead must not defeat the purpose of the system, and be within acceptable bounds. Hypothesis 1: The present search engines don’t have the complete capability to retrieve the content in other languages. Hypothesis 2: Word sense can be better represented by the grammar rule based method The language grammar rules are the major part of the framework. The rate of growth of the ontology can be exponential, and hence, mechanisms to control the size are essential. Case 1 shows the results that are retrieved in existing system and proposed system. Here in figure 6.2 shows that there are no results for the existing system and few results are shown in proposed system shown in figure 6.3. Figure 6.2 Results for query term “మయిలాడుతురై” in existing system Figure 6.3 Results for query term “మయిలాడుతురై” in proposed system In case 2 the user gives a query term “Kiran Kumar Reddy” for that the existing system retrieves the results that available in Telugu language alone and shown to user, the same is shown in figure 6.3 and the results retrieved for the same query in proposed framework is shown in figure 6.4. In table 6.1 the relative retrieval efficiency is shown for different user queries. This table shows that the existing system retrieves very less number of results because it considers only the content available in the user query language. Whereas in the proposed system it retrieves more number of results related to the user query and it consider the results in other languages also. From this table the hypothesis 1 is proved. Table 6.1 Relative retrieval efficiency Existing Telugu Proposed rule system results based system No results 10 కిరణ్ కుమార్ రడ్డి (kiran kumar reddy) 790 8270 మండ్ేలా (mandela) 1950 5460 458 1400 అతడు జయించాడు (he won the match) 377 1220 నేను భారతదేశం లో (I am in India) 865 2080 సో షల్ మీడ్డయా (social media) 509 1250 అతను చేసిన సాహితీ (his literature work) 1070 2060 User query మయిలాడుతురై (mayiladuthurai) కిరణ్ కుమార్ రడ్డి రాజీనామా (kiran kumar reddy resigns) ఆమె ఒక పుస్త కం తీస్ుకువచ్చంది (she 104 209 తెలుగు స్ంస్కృతి (Telugu heritage) 10100 13800 ఈ రోజు ఉదయం (today morning) 3940 6500 brought a book) This research work results are compared with the existing Telugu search engine by which it can measure Telugu English CLIR results. Figure 6.4 Results for query term “కిరణ్ కుమార్ రడ్డి” in existing system Figure 6.5 Results for query term “కిరణ్ కుమార్ రడ్డి” in proposed system In an experimental setting, there are a lot of parameters that can be tested, such as the efficiency of the pre-processing in terms of the time taken for task completion and Precision of the results retrieved by the system and also the user acceptance. Each of these parameters has an impact on the overall effectiveness of the proposed system. The comparison between the systems gives an idea of the improvement in the efficiency of the system progressively, and for the user acceptance a survey questionnaire is taken and these questions are shown in annexure 2. Hypothesis 3: The grammar rule based method and the size of the ontology plays a key role in the increase of efficiency Hypothesis 4: Time taken for the retrieval is comparable between two models The system is processed for different queries and is compared in terms of the time taken for query processing. The data for completion was calculated by the system and entered by the users. The results are shown below in Table 6.2. The results validate Hypothesis 1 and Hypothesis 4 that the existing search engines are not considering the results in other languages and the time taken to retrieve results is slightly higher when compared to the existing system. There is a definite overhead in processing the query from the web, but these results are within acceptable limits. Table 6.2 Time taken for Query processing in the Existing and proposed systems User Query Time in Seconds Existing Proposed 1 31 36 2 35 31 3 31 29 4 24 28 5 25 43 6 34 32 7 32 54 8 27 33 9 20 30 10 22 31 11 28 35 12 34 31 13 56 25 14 25 33 15 33 25 16 44 31 17 28 35 18 20 34 19 28 54 20 36 38 Average 30.65 33.65 The precision percentage of the retrieved results in terms of the content retrieved is calculated in table 6.3. The users were given a sheet and asked to rank the results in order of relevance. They were also asked to mark if the results were not within the scope of the query at all. The overall relevance of the result was tabulated and not the individual results. The precision values show the accuracy of the data retrieved. Table 6.3 Precision percentages for retrieved results in existing and proposed systems Precision % User Query 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Average Existing 21 86 26 43 50 32 10 34 49 66 35 51 58 40 63 24 63 40 37 11 41.95 Proposed with Proposed with Pre-Processing Pre & Post-Processing 60 86 20 68 84 89 35 59 49 47 56 21 67 40 72 53 47 62 17 33 53.25 60 49 56 77 86 45 83 67 64 71 51 46 83 65 72 84 56 58 61 63 64.85 The results show that the precision of the system increases with its varied usage. However, the precision in terms of the percentage shows a huge difference between the existing and proposed systems. The results validate the research hypothesis 2 and hypothesis 3. Significance tests for these experiments are carried out between existing and proposed systems using the same query set. Calculations show that there is significant difference among these methods. However, the results of the experiment that shows more improvement using the language grammar rules model. Table 6.4 Precision for results User Relevant Results Relevant Results Precision query @ 100 in ES @ 100 in PS @ Precision 100 for ES 100 for PS Quey1 0 45 0.0000 0.4500 Quey2 64 83 0.6400 0.8300 Quey3 38 54 0.3800 0.5400 Quey4 53 81 0.5300 0.8100 Quey5 87 51 0.8700 0.5100 Quey6 54 80 0.5400 0.8000 Quey7 30 61 0.3000 0.6100 Quey8 80 93 0.8000 0.9300 Quey9 39 67 0.3900 0.6700 Quey10 23 48 0.2300 0.4800 Quey11 18 39 0.1800 0.3900 @ The results illustrated in Table 6.4 suggest that the grammar rule based approach for Telugu CLIR greatly improves the retrieval performance and user acceptance. The best retrieval of the results related to the user queries are 0.3368 and 0.2305 for simple and complex respectively, attained when language grammar rules is applied along with the bilingual ontology. Unlike dictionary based conversion methods, which suffer from out-of-vocabulary terms, content conversion is not able to done, although it may be inappropriate. Table 6.5 Weighted Precision for results Relevant weighted User Results relevant query @ 100 in results @ ES 100 in ES 3 2 1 Weighted precision for ES Relevant weighted Results relevant @ 100 in results @ PS 100 in PS 3 2 Weighted precision for PS 1 Quey1 0 0 0 0 0.0000 45 23 16 6 0.7926 Quey2 64 8 32 24 0.5833 83 53 22 8 0.8474 Quey3 38 8 12 18 0.5789 54 35 11 8 0.8333 Quey4 53 12 18 23 0.5975 81 41 22 18 0.7613 Quey5 51 17 13 21 0.6405 87 49 27 11 0.8123 Quey6 74 27 34 13 0.7297 80 47 23 10 0.8208 Quey7 30 8 7 0.6778 61 42 16 3 0.8798 Quey8 69 29 22 18 0.7198 93 53 31 9 0.8244 Quey9 39 11 19 9 0.6838 67 37 22 8 0.8109 Quey10 23 7 13 3 0.7246 48 28 12 8 0.8056 Quey11 18 3 4 11 0.5185 39 19 12 8 0.7607 15 This approach improves retrieval performance and user also gets more information related to the user given query it is shown in table 6.5. It is also noticed that the degree of increment in retrieval performance for the general CLIR to Rule based CLIR for Telugu. 6.9 CONCLUSION In this chapter, the research work evaluates the effectiveness of the each component individually when it is used to convert user queries in cross language information retrieval for Telugu. Compared to other dictionary-based approaches, the results show that the query conversion based on the bi lingual ontology is an effective approach to CLIR for Telugu. Although the query conversion and content conversion using bilingual ontology and language grammar rules, which are different mechanisms combined to implement CLIR for Telugu, lead to better retrieval performance. In this research work the results are compared between the experiments conducted by ontology and language grammar rules between the user queries with out of vocabulary terms and without out of vocabulary terms. The experimental results illustrate that the combination of language grammar rules with bilingual ontology performs better than the bilingual ontology alone.
© Copyright 2026 Paperzz