Adventure Works Annik Stahl July 2002

Relevance Feedback and other
Query Modification Techniques
課程名稱: 資訊擷取與推薦技術
指導教授: 黃三益 教授
報告者: 博一 楊錦生 (d9142801)
博一 曾繁絹 (d9142803)
1
Introduction
 Precision v.s. Recall
 In case high recall ratio is critical to users, they
have to retrieve more relevant documents.
 Methods to retrieve more:

“Expand” their search by broadening a narrow
Boolean query or looking further down a ranked
list of retrieved documents.

Modify the original query.
2
Introduction (cont’d)
 “Word Mismatch” problem:

Some of the unretrieved relevant documents
are indexed by a different set of terms than
those in the query or in most of the other
relevant documents.
 Approaches for improving the initial query:


Relevance Feedback
Automatic Query Modification
3
Conceptual Model of Relevance
Feedback
Query
New Query
Based on Result Set
User Relevance
Feedback
Result Set
4
Basic Ideas about Relevance
Feedback
 Two components of relevance feedback:


Reweighting of query terms based on the
distribution of these terms in the relevant and
nonrelevant documents retrieved in response
to those queries
Changing the actual terms in the query
5
Basic Ideas about Relevance
Feedback (cont’d)
 Evaluation of Relevance Feedback


The results after one iteration of feedback
against those using no feedback generally
show spectacular improvement
Another evaluation of the results is to compare
only the residual collections
6
Basic approach to Relevance
Feedback
 Rocchio’s approach used the vector space
model to rank documents
1 n1
1 n2
Q1  Q 0   Ri   Si
n1 i 1
n2 i 1
7

Ide developed three particular strategies
extending Rocchio’s approach
1.
2.
3.
Basic Roccho’s formula, minus the normalization
for the number of relevant and nonrelevant
documents
Allowed only feedback from relevant documents
Allowed limited negative feedback from only the
highest-ranked nonrelevant document
8
Term reweighting without Query
Expansion
 A probabilistic model proposed by Robertson
and Sparck Jones (1976)
r
Rr
Wij  log
nr
( N  n)  ( R  r )
Wij = the term weight for term i in query j
r = the number of relevant documents for query j having term i
R = the total number of relevant documents for query j
n = the number of documents in the collection having term i
N = the number of documents in the collection
9
Term reweighting without Query
Expansion (cont’d)
 Croft (1983) extended this weighting scheme
as below,
initial search
Feedback
Wijk  (C  IDF i )  fik
Wijk  (C  log
pij(1  qij)
)  fik
(1  pij)qij
Wijk = the term weight for term I in query j and document k
IDFi = the IDF weight for term I in the entire collection
Pij = the probability that term i is assigned within the set of relevant documents for query j
Qij = the probability that term i is assigned with the set of nonrelevant documents for query j
Fik = K+(1-K)(freqik/maxfreqk)
freqik=the frequency of term i in document k
maxfreqk = the maximum frequency of any term in document k
10
Query Expansion
 The query could be expanded by


offering users a selection of terms that are the
terms most closely related to the initial query
terms (thesaurus)
presenting users with a sorted list of terms
from the relevant documents or all retrieved
documents
11
Query Expansion (cont’d)
 A proposed list of terms from
relevant/nonrelevant documents using
ranking methods


User selection from the top N terms
Automatically added to the query
 The early SMART experiments both
expanded the query and reweighted the
query terms by adding the vectors of the
relevant and nonrelevant documents.
12
Query Expansion (cont’d)
 Modification of terms in relevant/nonrelevant
documents:


Any relevant document(s) as a “new query”
(Noreault, 1979)
If no relevant documents are indicated, the
term list shown to the user is the list of related
terms based on those previously sorted in the
inverted file
13
Query Expansion with Term
Reweighting
 The vast amount of relevance feedback and
query expansion research has been done
using both query expansion and termreweighting.
 Three of most used feedback methods:

Ide Regular
n1
n2
i 1
i 1
Q1  Q 0   Ri  Si
14
Query Expansion with Term
Reweighting(cont’d)

Ide dec-hi
n1
Q1  Q 0   Ri Si
i 1

Si = the top ranked non-relevant
document
Standard Rocchio
n1
n2
Ri
Si
Q1  Q 0   
 
i 1 n1
i 1 n 2
15
Automatic Query Modification
 The major disadvantage of relevance
feedback is that it increase the burden on the
users [X97].
 Approaches for automatic query modification:


Local feedback
Automatic query expansion



Dictionary-based
Global analysis
Local analysis
16
Local Feedback
 Local feedback is similar to relevance
feedback.
 Difference: assume the top ranked
documents are relevant without human
judgment.
 It saves the costs of relevance judgment, but
it can result in poor retrieval if the top ranked
documents are non-relevant.
17
Automatic Query Expansion
 Basic idea:

Expanding a user query using semantically
similar and/or statistically associated terms
with corresponding weights are added.
 Thesauri are needed for similarity judgment.
 Two approach for thesauri construction:


Manual thesauri
Automatic thesauri
18
Dictionary-based Query Expansion
 Based on manual thesauri (e.g., WordNet
[M95] ).
 In expansion process, synonymous (or other
semantic relations) words of initial query
terms are selected and assigned each term a
weight.
 Disadvantage:


Construction of manual thesaurus requires a
lot of human labor.
A general manual thesaurus does not
consistently improve retrieval performance.
19
Example - WordNet
Semantic Relation
Syntactic Category
Examples
Synonymy (similar)
N, V. Aj, Av
sad, unhappy
rapidly, speedily
Antonymy (opposite)
Aj, Av, (N, V)
powerful, powerless
rapidly, slowly
Hyponymy (subordinate) N
sugar maple, maple
tree, plant
Meronymy (part)
N
brim, hat
gin, martini
Troponymy (manner)
V
march, walk
whisper, speak
Entailment
V
drive, ride
divorce, marry
Note: N = Nouns, Aj = Adjectives, V = Verbs, Av = Adverbs
20
Automatic Thesauri Construction
Approach
 Thesauri are construction from the whole (a
part of) the data corpus.
 Basic idea of automatic thesauri construction:

Term co-occurrence
 Methods of automatic thesauri construction:



Traditional TFxIDF [Y02]
Variant of TFxIDF (i.e., similarity thesaurus
[QF93])
Mining Association Rule Approach [WBO00]
21
Example of Thesaurus
Construction
 To each term ti is associated a vector:
ti  wi ,1 , wi , 2 ,..., wi , N 
Where
wi , j 


tfi , j
 0.5  0.5


max j tfi , j  

2

tfi , j
N 
2

l 1  0.5  0.5 max tf  itf j
l
l, j 

 The relationship between two terms tu and tv
Su ,v  tu  tv  d wu , j  wv , j
j
According to [QF93]
22
Example of Thesaurus
Construction (cont’d)
CRM
Knowledge
Discovery
Text Mining
0.90
Data
0.75
0.12
0.32
Data Mining
0.31
Warehouse
Decision Tree
0.56
Clustering
Analysis
0.50
0.50
0.22
Classification
Analysis
0.45
C4.5
0.21
Prediction
23
Global Analysis
 The whole collection of documents is used for
thesaurus creation.
 Approaches:


Similarity Thesaurus [QF93]
Statistical Thesaurus [CY92]
24
Global Analysis (cont’d)
Data
Corpus
Initial
User Query
Thesaurus
Construction
Thesaurus
Query
Expansion
Expanded Query
Retrieve
Relevant
Documents
25
Local Analysis
 Unlike the global analysis, only the top
ranked documents are used for constructing
thesaurus.
 Approaches:


Local Clustering [AF77]
Local Content Analysis [X97, XC96, XC00]
 According to [XC96, X97, X00], local analysis
is more effective than global analysis.
26
Local Analysis (cont’d)
Initial
User Query
1st Retrieve
Top Ranked
Documents
Thesaurus
Construction
Query
Expansion
Expanded Query
2nd Retrieve
Relevant
Documents
27
Approach
Advantage
Disadvantage
Relevance Feedback 1. Retrieve more
1. The burden on the users.
Local Feedback
1. Result in poor retrieval result
1. It saves the costs of relevance
judgment.
if the top ranked documents
are non-relevant.
Dictionary-based QE 1. Robust in that average
performance of queries.
1. Costs of human labors to
construct a dictionary.
2. Lack of domain specific
words
Global Analysis QE
1. Relatively robust in that average 1. Expensive in terms of disk
performance of queries.
2. It provides a thesaurus-like
resource that can be used for
space and computer time to
do the global analysis.
2. Build the searchable
browsing or other types of
database and individual
concept search.
queries can be significantly
degraded by expansion.
Local Analysis QE
1. Relatively efficient to do
expansion based on high ranked
documents.
2. Needs no global thesaurus
construction phase.
1. It may be slightly slower at
run-time.
2. It is not clear how well this
technique will work with
queries that retrieve few
relevant documents.
28
References






[AF77] Attar, R. and Fraenkel, A. S., “Local Feedback in Full-Text Retrieval
Systems,” Journal of the ACM, Volume 24, Issue 3, 1977, pp.397-417.
[BR99] Baeza-Yates, R, Ribeiro-Neto, B, Modern Information Retrieval, Addison
Wesley/ACM Pres, Harlow, England, 1999.
[CY92] Crouch, C. J., Yang, B., "Experiments in Automatic Statistical Thesaurus
Construction," Proceedings of the 15th Annual International ACM SIGIR
Conference on Research and development in information retrieval, 1992, pp.7788.
[M95] Miller, G. A, “WordNet: A Lexical Database for English,” Communications
of the ACM, Vol. 38, No. 11, November 1995, pp.39- 41.
[QF93] Qiu, Y., Frei, H. P., "Concept Based Query Expansion," Proceedings of
the 16th annual international ACM SIGIR Conference on Research and
Development in Information Retrieval, 1993, pp. 160-169.
[WBO00] Wei, J., Bressan, S., and Ooi, B. C., “Mining Term Association Rules for
Automatic Global Query Expansion: Methodology and Preliminary Results,”
Proceedings of the First International Conference on Web Information Systems
Engineering, Volume 1, 2000, pp. 366-373.
29
References (cont’d)




[X97] Xu, J., “Solving the Word Mismatch Problem Through Automatic Text
Analysis,” PhD Thesis, University of Massachusetts at Amherst, 1997.
[XC96] Xu, J. and Croft, W. B., “Query Expansion Using Local and Global
Document Analysis,” Proceedings of the 19th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval, 1996, pp.
4-11.
[XC00] Xu, J. and Croft, W. B., “Improving the Effectiveness of Information
Retrieval with Local Context Analysis,” ACM Transactions on Information
Systems, Volume 18, Issue 1, 2000, pp. 79-112.
[Y02] Yang, C., “Investigation of Term Expansion on Text Mining Techniques,”
Master Thesis, National Sun Yet-Sen University, Taiwan, 2002.
30