design of a context based indexing system

Chapter 4
DESIGN OF A CONTEXT BASED INDEXING SYSTEM
4.1 INTRODUCTION
This chapter presents an indexing structure in which index is built on the basis of context of the
document rather than on the terms basis using ontology. The ontology-based collection selection
method presented in this work uses context to describe collections and search engines. The
context of the documents being collected by the crawler in the repository is being extracted by
the indexer using the context repository, thesaurus and ontology repository and then documents
are indexed according to their respective context.
4.2 ARCHITECTURE OF CONTEXT BASED INDEXING SYSTEM
The context based indexing system [95] constructs the index on the basis of semantics of the
document. In construction of context based index, once the document preprocessing is complete,
the term with the maximum frequency matched with the title is extracted from the document.
Then the maximum frequency keyword is being searched in the thesaurus (thesaurus can be
taken online from thesaurus.com) and the context repository. This step helps in extracting the
context of the document but a keyword may have multiple contexts. So the multiple contexts are
extracted. Now the next step is to extract the specific context of the document from these
multiple contexts. The multiple contexts and the terms of the document are compared with the
ontology repository. Thus by matching the keywords of the document and the multiple contexts
with the concepts and the relationship terms in the ontology repository, the context of the
document gets extracted. Now the posting list in the index consist of two columns, the one
containing the context, the second one containing the lists of documents that contain the term
with that specific context. The architecture of context based indexing system is shown in Fig.
4.1.
99
Web Pages Repository
Search Interface (Query with
context)
Documents Preprocessing
Searcher
Indexer
Maximum frequency
keyword extraction
Thesaurus
Context
Terms
Doc Ids
Fruit
Computer
apple
ipod
1, 3, 5
6, 78, 90
Context repository
Document context
Ontology repository
Index creation
Figure 4.1 Architecture of Context based Indexing
A brief discussion of the various components used in the architecture is given in next section.
4.2.1
DESCRIPTION OF VARIOUS COMPONENTS
This section discusses the various components of the architecture. A brief description of each
component is given herewith.
A.
Repository of web page: This is the database which contains the set of documents that
have been collected by the crawler.
B.
Indexer: After the documents have been gathered by the crawler, the indexer maintains
an index of the documents which is in the form of posting lists that contain the term as well the
100
document identifiers of the documents which contain the given term and also other related
information.
C.
Preprocessing of document: The preprocessing step involves stemming as well as
removal of stop words. A stop word is any word which has no semantic content. Common stop
words are prepositions and articles, as well as high frequency words that do not help retrieval
D.
Thesaurus: It is a dictionary of words available on the World Wide Web from
thesaurus.com which contains the words as well as their multiple meanings. It is an online
lexicon available to the users.
E.
Context Repository: This is a database which contains the various contexts. Also the new
contexts derived from thesaurus are stored in this repository. The context repository maintains a
database of several types of context data.
F.
Ontology Repository: This is a database of ontologies which contains the various
relationships among objects in various domains. Ontology repository contains various concepts
with their relationships.
G.
Context of the document: This context represents the theme of the document that has
been extracted using context repository, thesaurus and ontology repository.
H.
Index: This is the final index that is constructed after extracting the context of the
document. Rather than being formed on the term basis, the index is constructed on the context
basis with context as first field, term as next field and finally the document identifiers of the
relevant documents.
I.
Searcher: It is that module of the search engine that receives user queries via the user
interface and hence after searching the results in the index provides them to the user.
J.
Search Interface: It is that user interface through which user types the query along with
the context specified.
101
4.2.2
ALGORITHM FOR INDEX CONSTRUCTION
The algorithm depicted in Fig.4.2 shows the various steps in the construction of the context
based index and hence context based searching.
Algorithm Index_construct()
{
Step 1. Document Preprocessing
// includes stemming and stop word removal
2. Extraction of maximum frequency term matched with title.
3. Maximum frequency keyword searched in thesaurus
and the context repository.
// thesaurus taken online from thesaurus.com
4. Extraction of contexts.
// multiple contexts may be extracted.
5. Extraction of specific context. // by matching the keywords of the document and
the multiple contexts with the concepts and the
relationship terms in the ontology repository
6: Construction of context based Index
7: Firing of user’s query with context.
8: Searching of index on context basis.
9: Query terms matching with terms in the index
10. Documents provided to the user.
}
Figure 4.2 Algorithm (Index_construct)
As shown in the algorithm, the context of the document is extracted on the basis of maximum
frequency keyword. Those keywords that can have multiple contexts are further worked upon
using ontology repository to extract the specific context. The ontology repository contains the
ontology representation for keywords in different contexts. Thus the index we get after applying
the above algorithm consists of the context, term and the document identifiers of the documents
related to the context.
4.2.3
EXAMPLE ILLUSTRATING CONTEXT BASED INDEXING
The given example tends to extract the context of the web pages which are retrieved in response
to the query apple .Let the web pages in repository related to the keyword apple are considered.
The keyword apple has two contexts i.e. one related to computer (apple iphone, ipad, ipod etc.)
and the other one related to the context fruit. Now if the query “apple” fired on the google search
interface is considered, the following links are retrieved as shown in Fig.4.3. The web page
shows the results related to both the contexts of apple.
102
Figure 4.3 Results of google for query “apple”
The results for the query in tabular form are shown in Table 4.1.
Table 4.1 Results retrieved from www.google.com for the query “Apple”.
Sr. No.
1
URL's Retrieved at www.google.com for query “apple”
Contents of the retrieved results
The apple is the pomaceous fruit of the apple tree, species Malus
Apple - Wikipedia, the free encyclopedia
Domestica in the rose family (Rosaceae). It is one of the most widely
2
en.wikipedia.org/wiki/Apple
cultivated tree fruits…
Apple Inc. - Wikipedia, the free encyclopedia
Apple Inc. is an American Multinational Corporation that designs and
sells consumer electronics, computer software and…..
en.wikipedia.org/wiki/Apple_inc
3
Apple GSM cellphones.. Apple iPhone 5 review hotter than ever apple
All Apple Phones
iPad 3review fast 4ward. Apple iPhone 4S review. Love and hate………
www.gs marena.com/apple-phones-48.php
4
Apple designs the Mac along with OS X iLife and iWork. It leads the
Apple – YouTube
digital music revolution with iPods and iTunes. It reinvented the mobile
phone with iPhone…..
www.youtube.com/user/apple
103
As it can see from the table shown in 4.1 that there are documents related to both the contexts of
apple. Now if suppose the context of the document as shown in the snapshot in Fig.4.4 is to be
extracted, then as per the algorithm, first of all the maximum frequency keyword is found out. In
this document, as it is visible that the term with the maximum frequency as matched with the
keywords of the document as well as the URL (http://en.wikipedia.org/wiki/Apple ) is apple.
Figure 4.4 Web Page Retrieved in response to Query “apple”
Now when the online thesaurus word net is looked into as shown in Fig.4.5, the meaning of the
term apple as well as some related words like fruit, orchard apple tree, cultivated, malus pumila
are retrieved. These related words match with the ontology of the keyword apple related to the
domain of fruit. Moreover these related words also match with the terms in the document. So the
context of the document is extracted to be fruit.
104
Figure 4.5 Word Net retrieval in response to keyword apple
Now let document related to the other context i.e. computer be considered as shown in Fig.4.6.
Figure 4.6 Web page Retrieved in response to query “apple”
105
Now if suppose the context of the above document is to be extracted, then maximum frequency
keyword as matched with the keywords of the document as well as the URL
(http://www.apple.com/in/ipod/ ) is ipod.
Now when the online thesaurus word net is looked into as shown in Fig.4.7, the meaning of the
term ipod as well as some related words like device, store, and files are retrieved. These related
words match with the ontology of the keyword apple related to the domain of computer related
devices. Moreover these related words also match with the terms in the document. So the context
of the document is extracted to be computer.
Figure 4.7 Word Net Retrieval in response to keyword “ipod”
Now let the case of some keyword which has a single context be considered as shown in Fig.4.8.
Indexing in search engines is one such query which has a single context.
106
Figure 4.8 Web page Retrieved in response to query “indexing”
Now if suppose the context of the above document is to be extracted, then as per the algorithm,
first of all, the maximum frequency keyword is found out. In this document, as it is visible that
the term with the maximum frequency as matched with the keywords of the document as well as
the URL (http://en.wikipedia.org/wiki/Search_engine_indexing) is “search engine”.
Now when the online thesaurus word net is looked into as shown in Fig.4.9, the meaning of the
term as well as some related words like computer, database, retrieves, document etc are
retrieved. These related words match with the ontology of the keyword search engine related to
the domain of search engines. The ontology for search engine consists of the same terms as there
are in its word net retrieval. Moreover these related words also match with the terms in the
document. So the context of the document is extracted to be “search engines”. Thus a single
context is extracted for the above given query.
107
Figure 4.9 Word Net Retrieval in response to keyword “search engine”
If the maximum frequency comes out to be indexing, then the word net retrieval can be done as
shown in Fig.4.10.
Figure 4.10 Word Net Retrieval in response to keyword “indexing”
108
The example discussed in section 4.1.4 demonstrates how the context of a document can be
retrieved. After constructing the index on the basis of search, context ontology can be applied for
ranking and searching. Context-ontology is a shared vocabulary to share context information in a
pervasive computing domain. The next chapter discusses context ontology driven information
retrieval, query expansion and search.
109

Download Report

design of a context based indexing system

Paperzz.com

Your Paperzz