Relevance Feedback and other Query Modification Techniques 課程名稱: 資訊擷取與推薦技術 指導教授: 黃三益 教授 報告者: 博一 楊錦生 (d9142801) 博一 曾繁絹 (d9142803) 1 Introduction Precision v.s. Recall In case high recall ratio is critical to users, they have to retrieve more relevant documents. Methods to retrieve more: “Expand” their search by broadening a narrow Boolean query or looking further down a ranked list of retrieved documents. Modify the original query. 2 Introduction (cont’d) “Word Mismatch” problem: Some of the unretrieved relevant documents are indexed by a different set of terms than those in the query or in most of the other relevant documents. Approaches for improving the initial query: Relevance Feedback Automatic Query Modification 3 Conceptual Model of Relevance Feedback Query New Query Based on Result Set User Relevance Feedback Result Set 4 Basic Ideas about Relevance Feedback Two components of relevance feedback: Reweighting of query terms based on the distribution of these terms in the relevant and nonrelevant documents retrieved in response to those queries Changing the actual terms in the query 5 Basic Ideas about Relevance Feedback (cont’d) Evaluation of Relevance Feedback The results after one iteration of feedback against those using no feedback generally show spectacular improvement Another evaluation of the results is to compare only the residual collections 6 Basic approach to Relevance Feedback Rocchio’s approach used the vector space model to rank documents 1 n1 1 n2 Q1 Q 0 Ri Si n1 i 1 n2 i 1 7 Ide developed three particular strategies extending Rocchio’s approach 1. 2. 3. Basic Roccho’s formula, minus the normalization for the number of relevant and nonrelevant documents Allowed only feedback from relevant documents Allowed limited negative feedback from only the highest-ranked nonrelevant document 8 Term reweighting without Query Expansion A probabilistic model proposed by Robertson and Sparck Jones (1976) r Rr Wij log nr ( N n) ( R r ) Wij = the term weight for term i in query j r = the number of relevant documents for query j having term i R = the total number of relevant documents for query j n = the number of documents in the collection having term i N = the number of documents in the collection 9 Term reweighting without Query Expansion (cont’d) Croft (1983) extended this weighting scheme as below, initial search Feedback Wijk (C IDF i ) fik Wijk (C log pij(1 qij) ) fik (1 pij)qij Wijk = the term weight for term I in query j and document k IDFi = the IDF weight for term I in the entire collection Pij = the probability that term i is assigned within the set of relevant documents for query j Qij = the probability that term i is assigned with the set of nonrelevant documents for query j Fik = K+(1-K)(freqik/maxfreqk) freqik=the frequency of term i in document k maxfreqk = the maximum frequency of any term in document k 10 Query Expansion The query could be expanded by offering users a selection of terms that are the terms most closely related to the initial query terms (thesaurus) presenting users with a sorted list of terms from the relevant documents or all retrieved documents 11 Query Expansion (cont’d) A proposed list of terms from relevant/nonrelevant documents using ranking methods User selection from the top N terms Automatically added to the query The early SMART experiments both expanded the query and reweighted the query terms by adding the vectors of the relevant and nonrelevant documents. 12 Query Expansion (cont’d) Modification of terms in relevant/nonrelevant documents: Any relevant document(s) as a “new query” (Noreault, 1979) If no relevant documents are indicated, the term list shown to the user is the list of related terms based on those previously sorted in the inverted file 13 Query Expansion with Term Reweighting The vast amount of relevance feedback and query expansion research has been done using both query expansion and termreweighting. Three of most used feedback methods: Ide Regular n1 n2 i 1 i 1 Q1 Q 0 Ri Si 14 Query Expansion with Term Reweighting(cont’d) Ide dec-hi n1 Q1 Q 0 Ri Si i 1 Si = the top ranked non-relevant document Standard Rocchio n1 n2 Ri Si Q1 Q 0 i 1 n1 i 1 n 2 15 Automatic Query Modification The major disadvantage of relevance feedback is that it increase the burden on the users [X97]. Approaches for automatic query modification: Local feedback Automatic query expansion Dictionary-based Global analysis Local analysis 16 Local Feedback Local feedback is similar to relevance feedback. Difference: assume the top ranked documents are relevant without human judgment. It saves the costs of relevance judgment, but it can result in poor retrieval if the top ranked documents are non-relevant. 17 Automatic Query Expansion Basic idea: Expanding a user query using semantically similar and/or statistically associated terms with corresponding weights are added. Thesauri are needed for similarity judgment. Two approach for thesauri construction: Manual thesauri Automatic thesauri 18 Dictionary-based Query Expansion Based on manual thesauri (e.g., WordNet [M95] ). In expansion process, synonymous (or other semantic relations) words of initial query terms are selected and assigned each term a weight. Disadvantage: Construction of manual thesaurus requires a lot of human labor. A general manual thesaurus does not consistently improve retrieval performance. 19 Example - WordNet Semantic Relation Syntactic Category Examples Synonymy (similar) N, V. Aj, Av sad, unhappy rapidly, speedily Antonymy (opposite) Aj, Av, (N, V) powerful, powerless rapidly, slowly Hyponymy (subordinate) N sugar maple, maple tree, plant Meronymy (part) N brim, hat gin, martini Troponymy (manner) V march, walk whisper, speak Entailment V drive, ride divorce, marry Note: N = Nouns, Aj = Adjectives, V = Verbs, Av = Adverbs 20 Automatic Thesauri Construction Approach Thesauri are construction from the whole (a part of) the data corpus. Basic idea of automatic thesauri construction: Term co-occurrence Methods of automatic thesauri construction: Traditional TFxIDF [Y02] Variant of TFxIDF (i.e., similarity thesaurus [QF93]) Mining Association Rule Approach [WBO00] 21 Example of Thesaurus Construction To each term ti is associated a vector: ti wi ,1 , wi , 2 ,..., wi , N Where wi , j tfi , j 0.5 0.5 max j tfi , j 2 tfi , j N 2 l 1 0.5 0.5 max tf itf j l l, j The relationship between two terms tu and tv Su ,v tu tv d wu , j wv , j j According to [QF93] 22 Example of Thesaurus Construction (cont’d) CRM Knowledge Discovery Text Mining 0.90 Data 0.75 0.12 0.32 Data Mining 0.31 Warehouse Decision Tree 0.56 Clustering Analysis 0.50 0.50 0.22 Classification Analysis 0.45 C4.5 0.21 Prediction 23 Global Analysis The whole collection of documents is used for thesaurus creation. Approaches: Similarity Thesaurus [QF93] Statistical Thesaurus [CY92] 24 Global Analysis (cont’d) Data Corpus Initial User Query Thesaurus Construction Thesaurus Query Expansion Expanded Query Retrieve Relevant Documents 25 Local Analysis Unlike the global analysis, only the top ranked documents are used for constructing thesaurus. Approaches: Local Clustering [AF77] Local Content Analysis [X97, XC96, XC00] According to [XC96, X97, X00], local analysis is more effective than global analysis. 26 Local Analysis (cont’d) Initial User Query 1st Retrieve Top Ranked Documents Thesaurus Construction Query Expansion Expanded Query 2nd Retrieve Relevant Documents 27 Approach Advantage Disadvantage Relevance Feedback 1. Retrieve more 1. The burden on the users. Local Feedback 1. Result in poor retrieval result 1. It saves the costs of relevance judgment. if the top ranked documents are non-relevant. Dictionary-based QE 1. Robust in that average performance of queries. 1. Costs of human labors to construct a dictionary. 2. Lack of domain specific words Global Analysis QE 1. Relatively robust in that average 1. Expensive in terms of disk performance of queries. 2. It provides a thesaurus-like resource that can be used for space and computer time to do the global analysis. 2. Build the searchable browsing or other types of database and individual concept search. queries can be significantly degraded by expansion. Local Analysis QE 1. Relatively efficient to do expansion based on high ranked documents. 2. Needs no global thesaurus construction phase. 1. It may be slightly slower at run-time. 2. It is not clear how well this technique will work with queries that retrieve few relevant documents. 28 References [AF77] Attar, R. and Fraenkel, A. S., “Local Feedback in Full-Text Retrieval Systems,” Journal of the ACM, Volume 24, Issue 3, 1977, pp.397-417. [BR99] Baeza-Yates, R, Ribeiro-Neto, B, Modern Information Retrieval, Addison Wesley/ACM Pres, Harlow, England, 1999. [CY92] Crouch, C. J., Yang, B., "Experiments in Automatic Statistical Thesaurus Construction," Proceedings of the 15th Annual International ACM SIGIR Conference on Research and development in information retrieval, 1992, pp.7788. [M95] Miller, G. A, “WordNet: A Lexical Database for English,” Communications of the ACM, Vol. 38, No. 11, November 1995, pp.39- 41. [QF93] Qiu, Y., Frei, H. P., "Concept Based Query Expansion," Proceedings of the 16th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, 1993, pp. 160-169. [WBO00] Wei, J., Bressan, S., and Ooi, B. C., “Mining Term Association Rules for Automatic Global Query Expansion: Methodology and Preliminary Results,” Proceedings of the First International Conference on Web Information Systems Engineering, Volume 1, 2000, pp. 366-373. 29 References (cont’d) [X97] Xu, J., “Solving the Word Mismatch Problem Through Automatic Text Analysis,” PhD Thesis, University of Massachusetts at Amherst, 1997. [XC96] Xu, J. and Croft, W. B., “Query Expansion Using Local and Global Document Analysis,” Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996, pp. 4-11. [XC00] Xu, J. and Croft, W. B., “Improving the Effectiveness of Information Retrieval with Local Context Analysis,” ACM Transactions on Information Systems, Volume 18, Issue 1, 2000, pp. 79-112. [Y02] Yang, C., “Investigation of Term Expansion on Text Mining Techniques,” Master Thesis, National Sun Yet-Sen University, Taiwan, 2002. 30
© Copyright 2025 Paperzz