6 - Shodhganga

1. POST PROCESSING
5.1
INTRODUCTION
In chapter 4 the pre-processing of user’s query is explained. In the post
processing stage, the processed query will be given to the search engine to retrieve
results. The user’s query is searched in the search engine. The search results are
converted by the post-processing system, using grammar rule structure and the
ontology model. Finally the results are re-ranked using re-ranking algorithm and shown
to the user in user query language. The grammar based system plays a role in the
assembling of the results in the target language and also in the re-ranking system
wherein the most relevant results alone are shown first.
5.2
METHODOLOGY OF PROPOSED POST-PROCESSING
The major objective of the post-processing stage is to convert the retrieved
results related to the query into the Telugu language. There are three distinct
components in the process (Figure 5.1).
Figure 5.1 Overall process of post-processing
5.2.1 Tokenizer
The working procedure of the tokenizer is same as in the pre-processing stage.
Here the tokenizer is used to tokenize the results that are retrieved for the given
queries. Tokens are separated by whitespace characters, such as a space or line break,
or by punctuation characters. Figure 5.2 explains the working process of a sample user
given Telugu query.
Results retrieved
Tokenization
Output
Figure 5.2 Tokenizer process
When a snippet is given to the tokenizer and it will be tokenized into tokens for
further process. The operation of the tokenizer in post-processing is shown in Figure
5.3.
1. The user’s query is received and processed by the pre-processing system.
2. The system processes the query and converts into English equivalent query and
it is passed to the search engine.
3. The search engine retrieves the results related to the query.
4. The outcome of the search engine is processed into the post-processing system
and the outcome is processed and presented to the user.
Figure 5.3 Process Flow of system
5.2.2 Language Grammar Rules
The tokenized snippet terms are sent to the language grammar rule component
to process. The detailed flow of the grammar structure is explained in Appendix 1. In
this sub section, the essence is explained briefly. The working procedure of the
language grammar rule component is same as in the pre-processing stage.
Once the terms identified and it looks into the ontology to get equivalent terms to
convert the results into the user query language. Once the results are converted the reranking process will start.
5.2.3 Re-ranking system
In the Web search for the results of processed query by pre-processing system, it
has been observed that the majority of the snippet contents contained the search query
terms. Hence, methods to manipulate the results based on the snippets, must also take
into account the linkages of the search term in the context of the snippet, thus needing
the ontology. The snippets are assigned a rank based on the inter-term relationships in
an organized set of steps. This approach is outlined in Chapter 4 is combined with the
post-processing system and used.
The ontological method models the set of keywords retrieved by the search
process as a unified whole, from where the re-ranking of the content can be done using
the fuzzy relations between the query term and the ontology. In the Web search, it has
observed that the majority of the snippet contents contained the search query. Hence,
methods to manipulate the results based on the snippets, must also take into account
the linkages of the search term in the context of the snippet, thus needing the ontology.
The snippets are assigned a rank based on the inter-term relationships in an organized
set of steps.
Figure 5.4 Term frequency for the query terms relationship
Ea
Query
0.9
ch
0.7
1.0
term
processin
Meaning
Relevant
Relationship
0.5
g
considere
0.6
0.8
d
Relevant
Related
Meaning
Relevant
is
Related Meaning Relevant Related
as
step
a
in
the
computation of the information gain, and the consolidated information gain tfij is
calculated for the entire snippet contents. Here, the notation tf ij represents a term‘t’ in
the snippet ‘f’. The term ‘i’ stands for the snippet value and the term ‘j’ stands for the
term in the snippet. Each snippet is randomly chosen from among the search results.
The terms visited in each snippet can be written as tf i1, tfi2, tfi3…
For each term in the snippet, the distance vector measure is calculated in terms
of the term-relationship frequency where the term relationship frequency is calculated
as the measure of the term-relationship value level. Now the term relationship is
calculated for the snippet as to how each term is related to the contents of the ontology
in the dependency tree order in Fig 5.4. The position in the parse tree is found. For
relationships, the value is 0.9. For meanings it is 1.0. The third position (related) is 0.75.
The next positions are each assigned a value of 0.60, 0.55, 0.50, 0.45…etc. till the 10
terms are reached. For all other terms 0.05 is assigned. Anything beyond is not
assigned any value, and left off. These values (1, 0.90, 0.75 …) are arrived at by
experimentation. The sample term frequency calculated for the ontology given in Fig
5.5.
The similarity of the query results are found next. It is done by comparing the
non-stop terms of the snippets. Two snippets are considered to be similar if more than
60% of the terms in the terms match. The value 60% has been arrived after
experimentation and in future theoretical basis for the same will be derived. Similar
snippets are clustered in the order (meaning, related, relationship; snippet number). The
results contain mix of English and Telugu content. For the English results the results
smoothening approach is used.
5.2.4 Smoothening Approach
The resultant snippets in English are taken one at a time. The basic unit of the
process is to identify the root words of each term in the snippet. First the snippets are
delineated in terms of sentences. Sentences are classified into simple and complex
based on the structure. A simple sentence is one which follows the subject verb object
form. All other sentences are complex sentences. For each sentence the terms are
identified into – clauses and stop words.
Figure 5.5 Sample term frequency
A clause is a verb/adverb/adjective. The stop words are identified from the
sentences. The terms are converted into the root word using porter’s stemming
algorithm. Now language specific rules are applied to identify the translation heuristics.
A single term may exist in different tense and word forms. Hence the query specific
information tree sequence is used to disambiguate the sense of the term. Now,
morphological rules are applied to get the translation for known grammar forms and
Related
terms.
Mobile (మొబైల్)
Computing (కంప్యూటంగ్)
Relationship
Technical (సాంకేతిక)
Item (వస్తువు)
Telephone (దూర వాణి)
Out of
Standard (సామర్ధాయాన్ని)
Vocabu
lary
terms
are
treated
in the same manner as Proper nouns. Such terms are transliterated automatically.
Case 1, figure 5.6 shows an example, how the results retrieved related to the
user given for the given user query which is pre-processed and converted into English
language in pre-processing system. Here a step by step process of the post-processing
system for results retrieved is discussed below:
Step1:
relevant results are retrieved related to the pre-processed user query from
the web.
Step2:
Each
is ాtokenized into tokens.

మొబైల్ కంప్యూట
ంగ్ result
- వికీపీడియ
te.wikipedia.org/wiki/మొబైల్_కంప్యూటంగ్
Step3:
Using English grammar rules the terms (subject verb and object) are
మొబైల్ కంప్యూటంగ్ (Mobile computing) అనేది చలనంలో ఉనిప్ుుడు సాంకేతిక వస్తువులనత వాడటాన్నకి ఒక
identified
andవాడే
Apply
grammar
rules
tokens,
first
look into
వూకిుకుని సామర్ధా
య ాన్ని వర్ధ్ణంచడాన్నకి
సాధారణ
ప్దం, స్థి రముగా
ఒకto
చోటthe
అమర్ధ్
క చేస్థ మాత్ర
మే వాడటాన్నకి
వీల ైన స్తలభంగాfor
...

the tokens
inflection. If any inflection is found and the equivalent grammar rule is
మొబైల్ టవి - వికీపీడియా
used to identify subject verb and object
te.wikipedia.org/wiki/మొబైల్ టవి
మొబైల్ టెలివిజన్once
అంటే చేthe
తిలో ఇమిడే
కరముతో టెలివిజన్
చూడడం.
ఆ ...
Step4:
termsప్ర్ధ్(subject,
verb,
object
and inflection) are identified then look

మొబైల్ నంబర్ పో ర్టబిలిటీ - వికీపీడియా
into the ontology for equivalent terms. Here in this case it looks into the
te.wikipedia.org/wiki/మొబైల్_నంబర్_పో ర్టబిలిటీ
ontology for Telugu terms.
మొబైల్ నంబర్ పో ర్టబిలిటీ (Mobile Number Portability or MNP ) మొబైల్ ఫో నత వాడకందారల కు, ఒక మొబైల్
నెటవర్క్
Step5:
ఆప్ర్ధేటర్క నతండి మర్ధొక ఆప్ర్ధేటర్కకు మార్ధ్ినప్ుడు త్మ మొబైల్ టెలిఫో న్ నంబర్కనత ఉంచతకోగలిగే
the terms that are not available in ontology are sent to the OOV
సౌలభూం కలిుస్తుంది. ...
component to transliterate literally
Step6:
once the terms are converted now the result will be converted into Telugu.
Step7:
using the ontology the re-ranking process is done and the results are
shown to the user in user native language. Figure 5.7 shows the results
that are processed in post-processing system.
Start
Retrieve relevant results
related to the pre-processed
query
Figure 5.6
Tokenize the snippet into tokens using
tokenizer
Results
retrieved
related to
Inflection Table lookup
the query
Language Grammar Rules to identify
Subject, verb and object
Rule identification based on the inflection
and verb
Lookup into the
ontology for
equivalent terms
Transliteration
Figure 5.7
N
o
Yes
Final
Results to
the user
Results conversion into
R1: మొబైల్ కంప్యూటంగ్ (Mobile computing) సాంకేతిక వస్తువులనత వూకిుకుని సామర్ధాయాన్ని వర్ధ్ణంచడాన్నకి
the user native language
సాధారణ ప్దం, స్థి రముగా అమర్ధ్క చేస్థ వాడటాన్నకి వీల ైన స్తలభంగా...
R2: మొబైల్ టెలివిజన్ చేతిలో ప్ర్ధ్కరముతో టెలివిజన్ చూడడం. ఆ ...
Results re-ranking using
ontology
R3: మొబైల్ నంబర్ పో ర్టబిలిటీthe
(Mobile
Number Portability or MNP) మొబైల్ ఫో నత, మొబైల్ నెటవర్క్ ఆప్ర్ధేటర్క
మార్ధ్ినప్ుడు మొబైల్ టెలిఫో న్ నంబర్కనత...
Stop
system is shown in the Figure 5.8
for given
query
The
chart
the
flow
for
pre-
processing
Figure 5.8 Flow Chart for the Post-Processing stage
5.3
CONCLUSION
In this chapter, the post-processing system for content presentation to the user
has been explained in detail. The re-ranking algorithm is snippet based and takes into
account the grammatical structure of the resultant snippets. The highlight of this work
has been the semantic nature of the entire processing; overall in the past two chapters
the Telugu equivalent for a user generated query has been generated. The next step is
to evaluate the system with various parameters. These are done in the next two
chapters.