Less is More - People.csail.mit.edu

Less is More
Probabilistic Models for Retrieving Fewer
Relevant Documents
Harr Chen, David R. Karger
MIT CSAIL
ACM SIGIR 2006
August 9, 2006
Outline
•
•
•
•
•
•
•
•
•
Motivations
Expected Metric Principle
Metrics
Bayesian Retrieval
Objectives
Heuristics
Experimental Results
Related Work
Future Work and Conclusions
August 9, 2006
ACM SIGIR 2006
Slide 2
Motivation
• In IR, we have formal models, and formal metrics
• Models provide framework for retrieval
– E.g.: Probabilistic
• Metrics provide rigorous evaluation mechanism
– E.g.: Precision and recall
• Probability ranking principle (PRP) provably
optimal for precision/recall
– Ranking by probability of relevance
• But other metrics capture other notions of result
set quality  and PRP isn’t necessarily optimal
August 9, 2006
ACM SIGIR 2006
Slide 3
Example: Diversity
• User may be satisfied with one relevant result
– Navigational queries, question/answering
• In this case, we want to “hedge our bets” by
retrieving for diversity in result set
– Better to satisfy different users with different
interpretations, than one user many times over
• Reciprocal rank/search length metrics capture
this notion
• PRP is suboptimal
August 9, 2006
ACM SIGIR 2006
Slide 4
IR System Design
• Metrics define preference ordering on result sets
– Metric[Result set 1] > Metric[Result set 2]
Result set 1 preferred to Result set 2
• Traditional approach: Try out heuristics that we
believe will improve relevance performance
– Heuristics not directly motivated by metric
– E.g. synonym expansion, psuedorelevance feedback
• Observation: Given a model, we can try to
directly optimize for some metric
August 9, 2006
ACM SIGIR 2006
Slide 5
Expected Metric Principle (EMP)
• Knowing which metric to use tells us what to
maximize for – the expected value of the
metric for each result set, given a model
Corpus
Result Sets
1, 2
Document 1
1, 3
2, 1
Document 2
2, 3
3, 1
Document 3
August 9, 2006
Calculate
E[Metric]
using
model
Return
set
with
max
score
3, 2
ACM SIGIR 2006
Slide 6
Our Contributions
• Primary: EMP – metric as retrieval goal
– Metric designed to measure retrieval quality
• Metrics we consider: precision/recall @ n, search length,
reciprocal rank, instance recall, k-call
– Build probabilistic model
– Retrieve to maximize an objective: the expected value
of metric
• Expectations calculated according to our probabilistic model
– Use computational heuristics to make optimization
problem tractable
• Secondary: retrieving for diversity (special case)
– A natural side effect of optimizing for certain metrics
August 9, 2006
ACM SIGIR 2006
Slide 7
Detour: What is a Heuristic?
Ad hoc approach
• Use heuristics that are
believed to be correlated
with good performance
• Heuristics used to
improve relevance
• Heuristics (probably)
make system slower
• Infinite number of
possibilities, no formalism
• Model, heuristics
intertwined
August 9, 2006
Our approach
• Build model that directly
optimizes for good
performance
• Heuristics used to
improve efficiency
• Heuristics (probably)
make optimization worse
• Well-known space of
optimization techniques
• Clean separation between
model and heuristics
ACM SIGIR 2006
Slide 8
Our Contributions
• Primary: EMP – metric as retrieval goal
– Metric designed to measure retrieval quality
• Metrics we consider: precision/recall @ n, search length,
reciprocal rank, instance recall, k-call
– Build probabilistic model
– Retrieve to maximize an objective: the expected value
of metric
• Expectations calculated according to our probabilistic model
– Use computational heuristics to make optimization
problem tractable
• Secondary: retrieving for diversity (special case)
– A natural side effect of optimizing for certain metrics
August 9, 2006
ACM SIGIR 2006
Slide 9
Search Length/Reciprocal Rank
• (Mean) search length (MSL): number of irrelevant
results until first relevant
• (Mean) reciprocal rank (MRR): one over rank of
first relevant
}
August 9, 2006
Search length = 2
Reciprocal rank = 1/3
ACM SIGIR 2006
Slide 10
Instance Recall
• Each topic has multiple instances (subtopics,
aspects)
• Instance recall is how many instances covered
(in union) over first n results
}
August 9, 2006
Instance recall @ 5 = 0.75
ACM SIGIR 2006
Slide 11
k-call @ n
• Binary metric: 1 if top n results has k relevant, 0
otherwise
• 1-call is (1 – %no)
– See TREC robust track
}
August 9, 2006
1-call @ 5 = 1
2-call @ 5 = 1
3-call @ 5 = 0
ACM SIGIR 2006
Slide 12
Motivation for k-call
• 1-call: Want one relevant document
– Many queries satisfied with one relevant result
– Only need one relevant document, more room to
explore  promotes result set diversity
• n-call: Want all relevant documents
– “Perfect precision”
– Hone in on one interpretation and stick to it!
• Intermediate k
– Risk/reward tradeoff
• Plus, easily modeled in our framework
– Binary variable
August 9, 2006
ACM SIGIR 2006
Slide 13
Our Contributions
• Primary: EMP – metric as retrieval goal
– Metric designed to measure retrieval quality
• Metrics we consider: precision/recall @ n, search length,
reciprocal rank, instance recall, k-call
– Build probabilistic model
– Retrieve to maximize an objective: the expected value
of metric
• Expectations calculated according to our probabilistic model
– Use computational heuristics to make optimization
problem tractable
• Secondary: retrieving for diversity (special case)
– A natural side effect of optimizing for certain metrics
August 9, 2006
ACM SIGIR 2006
Slide 14
Bayesian Retrieval Model
• There exists distributions that generate relevant
documents, irrelevant documents
• PRP: rank by
Pr[ d | r ]
Pr[ r | d ] 
Pr[ d | r ]
• Remaining modeling questions: form of rel/irrel
distributions and parameters for those
distributions
• In this paper, we assume multinomial models,
and choose parameters by maximum a posteriori
– Prior is background corpus word distribution
August 9, 2006
ACM SIGIR 2006
Slide 15
Our Contributions
• Primary: EMP – metric as retrieval goal
– Metric designed to measure retrieval quality
• Metrics we consider: precision/recall @ n, search length,
reciprocal rank, instance recall, k-call
– Build probabilistic model
– Retrieve to maximize an objective: the expected value
of metric
• Expectations calculated according to our probabilistic model
– Use computational heuristics to make optimization
problem tractable
• Secondary: retrieving for diversity (special case)
– A natural side effect of optimizing for certain metrics
August 9, 2006
ACM SIGIR 2006
Slide 16
Objective
• Probability Ranking Principle (PRP): maximize
Pr[ r | d ] at each step in ranking
• Expected Metric Principle (EMP): maximize
E[metric | d1...d n ] for complete result set
• In particular for k-call, maximize:
E[at least k relevant | d1...d n ]  Pr[at least k relevant | d1...d n ]
August 9, 2006
ACM SIGIR 2006
Slide 17
Our Contributions
• Primary: EMP – metric as retrieval goal
– Metric designed to measure retrieval quality
• Metrics we consider: precision/recall @ n, search length,
reciprocal rank, instance recall, k-call
– Build probabilistic model
– Retrieve to maximize an objective: the expected value
of metric
• Expectations calculated according to our probabilistic model
– Use computational heuristics to make optimization
problem tractable
• Secondary: retrieving for diversity (special case)
– A natural side effect of optimizing for certain metrics
August 9, 2006
ACM SIGIR 2006
Slide 18
Optimization of Objective
• Exact optimization of objective is usually NP-hard
– E.g.: Exact optimization for k-call reducible to NP-hard
maximum graph clique problem
• Approximation heuristic: Greedy algorithm
– Select documents successively in rank order
– Hold previous documents fixed, optimize objective at
each rank
d1
August 9, 2006
Maximize E[metric | d]
ACM SIGIR 2006
Slide 19
Optimization of Objective
• Exact optimization of objective is usually NP-hard
– E.g.: Exact optimization for k-call reducible to NP-hard
maximum graph clique problem
• Approximation heuristic: Greedy algorithm
– Select documents successively in rank order
– Hold previous documents fixed, optimize objective at
each rank
August 9, 2006
d1
Fixed
d2
Maximize E[metric | d, d1]
ACM SIGIR 2006
Slide 20
Optimization of Objective
• Exact optimization of objective is usually NP-hard
– E.g.: Exact optimization for k-call reducible to NP-hard
maximum graph clique problem
• Approximation heuristic: Greedy algorithm
– Select documents successively in rank order
– Hold previous documents fixed, optimize objective at
each rank
August 9, 2006
d1
Fixed
d2
Fixed
d3
Maximize E[metric | d, d1, d2]
ACM SIGIR 2006
Slide 21
Greedy on 1-call and n-call
• 1-greedy
– Greedy algorithm reduces to ranking each successive
document assuming all previous documents are
irrelevant
Pr[ r1  r2  ...  ri ]
 Pr[ r1  ...  ri 1 ]  Pr[(r1  ...  ri 1 )  ri ]
 Pr[ r1  ...  ri 1 ]  Pr[(r1  ...  ri 1 )] Pr[ ri | (r1  ...  ri 1 )]
 Pr[ ri | r1 , r2 ,..., ri 1 ]
– Algorithm has “discovered” incremental negative
pseudorelevance feedback
• n-greedy: Assume all previous documents relevant
August 9, 2006
ACM SIGIR 2006
Slide 22
Greedy on Other Metrics
• Greedy with precision/recall  reduces to PRP!
• Greedy on k-call for general k (k-greedy)
– More complicated…
• Greedy with MSL, MRR, instance recall works
out to 1-greedy algorithm
– Intuition: to make first relevant document appear
earlier, we want to hedge our bets as to query
interpretation (i.e., diversify)
August 9, 2006
ACM SIGIR 2006
Slide 23
Experiments Overview
• Experiments verify that optimizing for metric
improves performance on metric
– They do not tell us which metrics to use
•
•
•
•
Looked at ad hoc diversity examples
TREC topics/queries
Tuned weights on separate development set
Tested on:
– Standard ad hoc (robust track) topics
– Topics with multiple annotators
– Topics with multiple instances
August 9, 2006
ACM SIGIR 2006
Slide 24
Diversity on Google Results
• Task: reranking top 1,000 Google results
• In optimizing 1-call, our algorithm finds more
diverse results than PRP, Google results
August 9, 2006
ACM SIGIR 2006
Slide 25
Experiments: Robust Track
• TREC 2003, 2004 robust tracks
– 249 topics
– 528,000 documents
1-call
10-call
MRR
MSL
P@10
PRP
0.791
0.020
0.563
3.052
0.333
1-greedy
0.835
0.004
0.579
2.763
0.269
10-greedy
0.671
0.084
0.517
3.992
0.337
• 1-call, 10-call results statistically significant
August 9, 2006
ACM SIGIR 2006
Slide 26
Experiments: Instance Retrieval
• TREC-6,7,8 interactive tracks
– 20 topics
– 210,000 documents
– 7 to 56 instances per topic
• PRP baseline: instance recall @ 10 = 0.234
• Greedy 1-call: instance recall @ 10 = 0.315
August 9, 2006
ACM SIGIR 2006
Slide 27
Experiments: Multi-annotator
• TREC-4,6 ad hoc retrieval
– Independent annotators assessed same topics
– TREC-4: 49 topics, 568,000 documents, 3 annotators
– TREC-6: 50 topics, 556,000 documents, 2 annotators
1-call (1)
1-call (2)
1-call (3)
Total
TREC-4
PRP
0.735
0.551
0.653
1.939
TREC-4
1-greedy
0.776
0.633
0.714
2.122
TREC-6
PRP
0.660
0.620
N/A
1.280
TREC-6
1-greedy
0.800
0.820
N/A
1.620
•  More annotators more satisfied using 1-greedy
August 9, 2006
ACM SIGIR 2006
Slide 28
Related Work
• Fits in risk minimization framework (objective as
negative loss function)
• Other approaches look at optimizing for metrics
directly, with training data
• Pseudorelevance feedback
• Subtopic retrieval
• Maximal marginal relevance
• Clustering
• See paper for references
August 9, 2006
ACM SIGIR 2006
Slide 29
Future Work
• General k-call (k = 2, etc.)
– Determination if this is what users want
• Better underlying probabilistic model
– Our contribution is in the ranking objective, not the
model  model can be arbitrarily sophisticated
• Better optimization techniques
– E.g., Local search would differentiate algorithms for
MRR and 1-call
• Other metrics
– Preliminary work on mean average precision, precision
@ recall
• (Perhaps) surprisingly, these metrics are not optimized by PRP!
August 9, 2006
ACM SIGIR 2006
Slide 30
Conclusions
• EMP: Metric can motivate model – choosing and
believing in a metric already gives us a
reasonable objective, E[metric]
• Can potentially apply EMP on top of a variety of
different underlying probabilistic models
• Diversity is one practical example of a natural
side effect of using EMP with the right metric
August 9, 2006
ACM SIGIR 2006
Slide 31
Acknowledgments
• Harr Chen supported by the Office of Naval
Research through a National Defense Science
and Engineering Graduate Fellowship
• Jaime Teevan, Susan Dumais, and anonymous
reviewers provided constructive feedback
• ChengXiang Zhai, William Cohen, and Ellen
Voorhees provided code and data
August 9, 2006
ACM SIGIR 2006
Slide 32