International Journal of Machine Learning and Computing, Vol.1, No. 1, April 2011 A New Context Oriented Synonym Based Searching Technique for Digital Collection M Thangaraj, Member, IACSIT, V Gayathri Abstract—Each year sees the introduction of new digital libraries promoted as valuable resources for education and other needs. Yet systematic evaluation of the implementation and efficacy of these digital library systems is often lacking, due to the traditional keyword based search. Not only the keywords, but the synonym of it also plays an important role in the searching era. In order to overcome the issues of the traditional search, this paper suggests an effective technique, NCOSBS (New Context – Oriented Synonym Based Search). Our proposal is illustrated by showing improvements with respect to existing technique, and providing ground for further research and discussion. Index Terms— B+-Tree; Digital Collection; Hash Table; Synonym–Based Search; obvious that very few results are relevant to the user needs out of the huge set of results even though they contain the keyword. Thus, we need an effective searching technique in digital collection, to produce the best result. The main problem here is, the relation between the terms in the given query [3] [9] i.e., the meaning of the entire query is missing. Thus it is needed to consider the query as the contexts instead of considering just as keywords. This paper suggests a technique named New Context Oriented Synonyms Based Search (NCOSBS) to tackle the above summarized problem. The remainder of the paper is organized as follows: Section II is devoted to the issues relevant to searching. In Section III, we describe the architecture for NCOSBS. Section IV shows our performance evaluation result. Finally, in Section V we present conclusion. I. INTRODUCTION II. RELATED WORK With the tremendous growth of the Web, information “Big Bang” has been taken place on the Internet [2]. Search Engines have become one of the most helpful tools for obtaining useful information from the “Big Bang”. Increasing growth of information volume causes an increasing need to develop automatic methods for retrieval of documents and ranking them according to their relevance to the query. Similarly digital libraries offer diverse information resources in digital format. Traditionally, libraries have been warehouse of knowledge providing information services to the users. Even after the technological intrusion, this basic function remains the corner stone of the library structure. However, the ways and means by which the information is organized, processed, stored, retrieved and filtered have undergone tremendous changes to bring in new technologies and devices. Digital libraries provide instant access to all information, for all sectors of society, from anywhere in the world. Dr.S R Ranganathan [1], the father of library science has rightly pointed out “Right information to the right user at the right time in the right form”. It is observed that the features of digital library seem to reflect the vision of Dr.S.R.Ranganathan. Any given query may fetch huge number of results. It is Manuscript received March 28, 2011. Dr.M Thangaraj is now Associate Professor, in Department of Computer Science, Madurai Kamaraj University, Madurai, TN, India (e-mail: [email protected]). V Gayathri is with the Department of Computer Science, Madurai Kamaraj University, Madurai, TN, India (e-mail: [email protected]). Generally search systems are based on the importance of the papers and / or the existence of the keywords. They do not give much importance for the context. For checking the existence of the keyword, similarity techniques like Text-based similarity [8], Google based similarity [7] [6] is used. Even though there are many techniques are available, still the end users are struggling to get the desired information. Because, in a keyword – based search, the main ambiguity is that, a single word may have different meanings, where as different words may also refer to the same thing. Thus we need to search by considering the context of the given query. In [4], it suggests some structures used for searching like Trie, Composite tree, and bi-partite graph. The main problem with these structures is, time consuming and needs more number of comparisons. To overcome these issues, NCBS[5] uses B+-tree as a searching structure. Here, contexts are extracted from the documents in the corpus using pattern extraction based techniques. All the documents in the corpus are classified based on the context regardless of the query and are mapped into the NCBS structure (combination of B+-tree and Inverted List). When a query is given, the relevant context is identified and returned with its synonym as well as the appropriate document of the context. The main drawback of this method is it can search only with the context, not with its synonyms. It will just return the list of synonyms. Thus our improved version NCOSBS will search in both Context and its synonyms. 100 International Journal of Machine Learning and Computing, Vol.1, No. 1, April 2011 III. PROPOSED WORK In this work, we propose a technique named NCOSBS, which effectively retrieves the relevant document from the collection using both context and synonym. It has two major segments such as pre–processing and the query evaluation. The following process is carried out in the pre-processing phase: Constructing NCOSBS Structure: Patterns are extracted from the digital collection to organize the documents. Documents in the digital collection, which are matched to the extracted patterns, are mapped to the NCOSBS structure. Note that pattern extraction and construction of NCOSBS structure are pre-executed and not query dependent. The following steps are used in the query evaluation: a) Processing the query: The query pattern is extracted from the given input text. b) Identification of Context / Synonyms: The context / synonyms relevant to the query pattern is extracted using the NCOSBS structure. c) Retrieving Relevant Documents: The document(s) relevant to the context / synonym is retrieved. The NCOSBS architecture is given in Figure 1. The data set are nothing but the information in the digital collection pertaining to multiple topics. Using the pre-processing phase, the documents in the digital collection are classified based on the context extracted (using pattern extraction technique). Data Query Query Processing Ctxt Tree Root Node Internal Node Leaf Node Hash Table Ptr to Syn Tree Syn set 1 Syn set 2 … Syn set 3 … … … … … … … … … … … Figure 2. Overview of NCOSBS structure Thus each bucket contains the set of synonyms and a pointer to the respective synonym B+-tree. The synonyms tree is also constructed based on the synonym with its prefix and suffix terms (<prefix><context><suffix>). The leaf node contains the following information: synonym details, context to which it belongs to and a pointer to its respective document. Constructing NCOSBS Structure Pre Processing I/P filled by the alphabetically ordered synonyms of all contexts. Synonyms are retrieved from the Leaf of the Context tree and filled in the synonyms structure. Identifying Query Context / Synonyms B. Identification of Context / Synonym Retrieving Relevant Document Query Evaluation O/P Figure1. NCOSBS Architecture When a query is specified, the relevant pattern is extracted from the collection of patterns. Pattern matching is performed in the searching structure with the extracted query pattern. The relevant context / synonym is identified and the respective document is returned. A. Constructing NCOSBS Structure The digital collections are classified into contexts using pattern extraction based techniques (similar to NCBS). Then the classified contexts are mapped to our NCOSBS structure. NCOSBS structure is a combination of B+-tree and Hash table. The context B+-tree is constructed based on the classified context with its prefix and suffix terms (<prefix><context><suffix>). The leaf node contains the Context, its synonyms and a pointer to its relevant document. The structure of NCOSBS is shown in Figure 2. In Hash Table, each bucket is allotted for a set of alphabets, and it is NCOSBS structure is searched against the given query context. Initially the query context is searched in the Context tree of NCOSBS (Similar to our earlier model NCBS). If it is found in the Context tree, then its relevant document is returned. If the given query context is not found in the Context tree, then it is searched by the starting letter with the synonyms available in the hash table. Once the appropriate bucket of the hash table is traced, then further comparisons are done in its respective Synonyms tree. When the user needs to search for the related synonyms of the given query, it is easier to trace from the NCOSBS structure. Once the respective context (in case of the given query is synonym, find its respective context) is identified, the synonyms list of that context is identified from the Context tree and returned. When searching is stopped without getting the exact context or its related synonyms document as in the query, then the Context/Synonym, up to which the search mechanism found its match with the query context, is returned as a result. When there is no match occurs, then the Contexts in the root of the Context tree are returned as a suggestion for the user’s reference. Instead of getting out with empty result set, the user can get some information to 101 International Journal of Machine Learning and Computing, Vol.1, No. 1, April 2011 make improvement in their searching query task. The user can insert the same query into our database by giving some additional information. Algorithm Data Structure Input Output : Context Identification : See Fig 2. : Query q, Node nodepointer : Context with its document as well as related synonyms Let i, 0 ≤ i < degree of the Context Tree; j, 0 ≤ j ≤ 2 (j – each tuple in the query); x – nodeof Context Tree; doc-ptr– Pointer to the relevant document of the Context; <Pi> - Prefix tuple of the ith node segment of the nodepointer; <Ci> - Context tuple of the ith node segment of the nodepointer; <Si> - Suffix tuple of the ith node segment of the nodepointer; find (Query q, Node nodepointer) { if (nodepointer is NULL) return (NULL); if (nodepointer is a leaf) { x= find (q, nodepointer); return(x, doc-ptr, Synonyms ); \\x is the context } else { for each tuple <qj> in the query context for i = 0 to degree-1{ if (<qj> = <Ci>) { x = find (qj+1, nodepointer -> childi); if (x = NULL) return(current Context from Tree Leaf); return (x); } else if (<qj> = <Pi>) return (find (<qj+1>, nodepointer)); } \\ if qj is not available in any of the node segments in \\ nodepointer, then, search for it in the last subtree x= find(<qj>, nodepointer -> childj); if(x!=NULL) return (x); } x=findHash(q); if(x!=NULL) return (x); return headnode; } When no match found in both prefix and context tuples then, it is searched in the last (i+1) sub tree. In case no query tuple is found in the Context tree, then the same query is searched against the synonyms structure by invoking the Synonym Identification algorithm. If the query elements are not available even in the collection of synonyms then, the Contexts in the root node are returned as a result to the user. Algorithm Data Structure Input Output : Synonym Identification : See Fig 2. : Query q <qp, qc, qs> : Synonym with its document as well as relevant Context Let i, 0 ≤ i ≤ 2 (i – each tuple in the query); findHash (Query q) { for each query tuple<qi> in the query context { Compute hash value hv for <qi>; Identify SynNodePointer of hv; x = findSyn (q, SynNodePointer); if (x = NULL) continue; \\continues with next iteration (i++) return (x); } return (NULL); } Input : Query q <qp, qc, qs> SynNode SynNodePointer Output : Relevant Synonym Node Let d be the degree of the Synonyms Tree, i, 0≤ i ≤ d findSyn (Query q, SynNode SynNodePointer) { if (SynNodePointer is NULL) return (NULL); if (SynNodePointer is a leaf) { if (SynNodePointer.Synonym = q) return (SynNodePointer); } else { if (<qc> < SynNodePointer.Ci) findSyn (q, SynNodePointer.childi); } return (NULL); } In Context Identification algorithm, each query tuple is compared with the context tuple of the node segment. If match found then, subsequent tuples are searched in its sub tree. If no match found then, it is compared with the prefix tuple. When the query tuple is found at the prefix, then the consequent tuples are compared with the context tuple of the same node segment. If the first tuple of the query is not found, then the process is continued with the next tuple. In Synonym Identification algorithm, query is taken as an input. The hash value of the query tuple is computed by which the relevant bucket is identified. The query is now searched against the respective Synonyms tree, using the findSyn function. If the query tuple is not found, then the same process is continued for the next tuple. In findSyn function, context tuple of query is searched against the elements in the synonyms tree. If it is found then the rest of the query is also compared and the respective synonym node is returned. 102 International Journal of Machine Learning and Computing, Vol.1, No. 1, April 2011 REFERENCES IV. PERFORMANCE ANALYSIS NCOSBS approach has been implemented using C++. The set of experiment explores the effects of data and query parameters on performance. All experiments reported in this section are done on Pentium Duo 1.60 GHz with 1015MB RAM and 150 GB of secondary storage, running Windows Vista. In order to test the model and to find the performance, about 5000 documents have been created and various queries have been implemented. When compared to existing methods and techniques, this model requires a quite large storage space. Since storage space is now less expensive, the effectiveness of the access structure need to be concentrated. [1] [2] [3] [4] [5] [6] [7] [8] [9] S Abideen, Srivathsan, “Convergence of Digital Libraries with Knowledgement,” Proc. Of ICDL, India, 2004. http://news.netcraft.com/archieves/web_server_survey.html Information Explosion. Fabrizio Lamberti, Andrea Sanna, Claudio Demartini, “A Relation-Based Page Rank Algorithm for Semantic Web Search Engines”, IEEE Transactions on Knowledge and Data Engineering, Vol.21, No.1, 2009. Gae-won You, Seung-won Hwang, “Search structures and algorithms for personalized ranking”, Science Direct, Information Sciences, 178, 3925–3942, 2008. M Thangaraj, V Gayathri, “An Effective Technique For Context – Based Digital Collection Search”, Proc. of ICMLC 2011, Vol 5, Singapore, pp 128 – 131. Ramiz M. Aliguliyev, “A new sentence similarity measure and sentence based extractive technique for automatic text summarization”, Expert Systems with Applications 36, 7764–7772, 2009. Rudi L. Cilibrasi, Paul M.B. Vitanyi, “The Google Similarity Distance”, IEEE Transactions on Knowledge and Data Engineering, Vol. 19, No. 3, 2007. Yen-Liang Chen, Yu-Ting Chiu, “An IPC-based vector space model for patent retrieval”, Information Processing and Management, Article in press, 2010. Yufei Li, Yuan Wang, Xiaotao Huang, “A Relation-Based Search Engine in Semantic Web”, IEEE Transactions on Knowledge and Data Engineering, Vol.19, No.2, 2007. Figure 3. Synonym Search Comparisons Figure 3 shows the amount of comparisons needed for retrieving the synonym. As NCBS does not have the facility of searching the synonyms, it will search the whole structure and finally return empty result. But our NCOSBS structure will return the exact synonym in less number of comparisons. V. CONCLUSION In this paper, we have proposed a model NCOSBS and analyzed the major issues related to the operations of the digital content. The results of this structure show the effectiveness of the contexts as well as the importance of the searching mechanism. This work can be extended further by incorporating additional features like learning mechanism, automatic query completion. M Thangaraj received his post-graduate degree in Computer Science from Alagappa University, Karaikudi, M.Tech. degree in Computer Science from Pondicherry University and Ph.D. degree in Computer Science from Madurai Kamaraj University, Madurai, TN, South India in 2006. He is now the ASSOCIATE PROFESSOR of Computer Science Department at M.K.University. He is an active researcher in Web mining, Semantic Web and Information Retrieval and has published more than 50 papers in Journals and Conference Proceedings. He is a senior member of IACSIT. He has served as program chair and program committee member of many international conferences held in India for about one decade. ACKNOWLEDGMENT The authors wish to thank the IACSIT for giving an opportunity to publish in this journal and also thank the anonymous reviewers of ICMLC for their significant comments. 103 V Gayathri received her post-graduate degree in Computer Science in 2008 and M.Phil. degree in Computer Science in 2010 from Madurai Kamraj University, Madurai, TN, India. She is currently a Ph.D. SCHOLAR in Dept. of Computer Science, M. K. University, Madurai, TN, India. Her research interests include Context-based Search, Information Retrieval and Semantic Web.
© Copyright 2025 Paperzz