Risk Minimization and
Language Modeling in Text Retrieval
ChengXiang Zhai
Thesis Committee:
John Lafferty (Chair),
Jamie Callan
Jaime Carbonell
David A. Evans
W. Bruce Croft (Univ. of Massachusetts, Amherst)
1
Information Overflow
Web Site Growth
2
Text Retrieval (TR)
database/collection
query
“Tips on thesis defense”
User
Retrieval
System
text docs
relevant docs
3
Challenges in TR
Ad hoc
parameter tuning
(independent,topical)
Relevance
Utility
4
Sophisticated Parameter Tuning
in the Okapi System
“k1, b and k3 are parameters which depend on the nature of the queries and
possibly on the database; k1 and b default to 1.2 and 0.75 respectively,
but smaller values of b are sometimes advantageous; in long queries k3 is
often set to 7 or 1000 (effectively infinite).” (Robertson et al. 1999)
5
More Than “Relevance”
Relevance Ranking
Desired Ranking
Redundancy
Readability
6
Meeting the Challenges
Risk Minimization Framework
Statistical
Language Models
Bayesian Decision Theory
Parameter
Estimation
Utility-based Retrieval
7
Map of Thesis
New TR Framework
Risk
Minimization
Framework
New TR Models
Features
Two-stage
Language Model
Automatic
parameter setting
KL-divergence
Retrieval Model
Natural incorporation
of feedback
Aspect Retrieval
Model
Non-traditional ranking
8
Retrieval as Decision-Making
Given a query,
- Which documents should be selected? (D)
- How should these docs be presented to the user? ()
Choose: (D,)
?
Query
?
?
Unordered subset
…
Ranked list
1 2 3 4
Clustering
9
Generative Model of
Document & Query
User
U
p( Q | U )
Q
p(q | Q ,U )
q
Partially
observed
Source
Query
observed
S
p( D | S )
D
p(d | D ,S )
d
Document
inferred
10
Bayesian Decision Theory
Loss
Choice: (D1,1)
L
Choice: (D2,2)
L
query q
user U
q
1
...
Choice: (Dn,n)
L
doc set C
source S
N
( D*, *) arg min L( D, , ) p( | q,U , C , S )d
D ,
RISK MINIMIZATION
loss
hidden
observed
Bayes risk for choice (D, )
11
Special Cases
• Set-based models (choose D)
Boolean model
• Ranking models (choose )
– Independent loss ( PRP)
• Relevance-based loss
• Distance-based loss
– Dependent loss
• MMR loss
• MDR loss
Probabilistic relevance model
Vector-space Model
Two-stage LM
KL-divergence model
Aspect retrieval model
12
Map of Existing TR Models
Relevance
(R(q), R(d))
Similarity
Different
rep & similarity
P(d q) or P(q d)
Probabilistic inference
P(r=1|q,d) r {0,1}
Probability of Relevance
Regression
Model
(Fox 83)
…
Vector space
Prob. distr.
model
model
(Salton et al., 75) (Wong & Yao, 89)
Generative
Model
Doc
generation
Query
generation
Different
inference system
Prob. concept
space model
(Wong & Yao, 95)
Classical
LM
prob. Model
approach
(Robertson &
(Ponte & Croft, 98)
Sparck Jones, 76) (Lafferty & Zhai, 01a)
Inference
network
model
(Turtle & Croft, 91)
13
Where Are We?
Two-stage
Language Model
Risk
Minimization
Framework
KL-divergence
Retrieval Model
Aspect Retrieval
Model
14
Two-stage Language Models
Loss function
Risk ranking formula
0 if ( Q , D )
l ( d , Q , D )
c otherwise
U
p( Q | U )
Q
p(q | Q ,U )
q
Stage 2
p( D | S )
D
p(d | D ,S )
R(d , q) p(q | Q ˆD ,U )
Stage 2: compute p(q | ˆD , U )
(Mixture model)
Stage 1
S
Rank
d
Two-stage smoothing
Stage 1: compute ˆD
(Dirichlet prior smoothing)
15
The Need of Query-Modeling
(Dual-Role of Smoothing)
Keyword
queries
Verbose
queries
16
Interaction of the
Two Roles of Smoothing
Query Type
Title
Long
JM
0.228
0.278
Dir
0.256
0.276
AD
0.237
0.260
Relative performance of JM, Dir. and AD
precision
0.3
TitleQuery
0.2
LongQuery
0.1
0
JM
DIR
AD
Method
17
Two-stage Smoothing
Stage-1
Stage-2
-Explain unseen words
-Explain noise in query
-Dirichlet prior(Bayesian) -2-component mixture
P(w|d) = (1-)
c(w,d) +p(w|C)
|d|
+ p(w|U)
+
18
Estimating using leave-one-out
w1
P(w1|d- w1)
log-likelihood
N
l1 ( | C ) c( w, di ) log(
w2
P(w2|d- w2)
i 1 wV
Leave-one-out
c( w, di ) 1 p( w | C )
)
| di | 1
Maximum Likelihood Estimator
...
wn
P(wn|d- wn)
μˆ argmax l 1 (μ | C)
μ
Newton’s Method
19
Estimating using Mixture Model
Stage-2
Stage-1
d1
P(w|d1)
dN
1
...
… ...
(1-)p(w|d1)+ p(w|U)
query
N
(1-)p(w|dN)+ p(w|U)
P(w|dN)
N
m
i 1
j1
p(q | λ, U) π i ((1 λ)p(q j | θˆ d i ) λp(q j | U))
λˆ argmax p(q | λ, U)
λ
Maximum Likelihood Estimator
Expectation-Maximization (EM) algorithm
20
Automatic 2-stage results
Optimal 1-stage results
Average precision (3 DB’s + 4 query types, 150 topics)
Collection
AP88-89
WSJ87-92
ZIFF1-2
query
SK
LK
SV
LV
SK
LK
SV
LV
SK
LK
SV
LV
Optimal-JM
20.3%
36.8%
18.8%
28.8%
19.4%
34.8%
17.2%
27.7%
17.9%
32.6%
15.6%
26.7%
Optimal-Dir
23.0%
37.6%
20.9%
29.8%
22.3%
35.3%
19.6%
28.2%
21.5%
32.6%
18.5%
27.9%
Auto-2stage
22.2%*
37.4%
20.4%
29.2%
21.8%*
35.8%
19.9%
28.8%*
20.0%
32.2%
18.1%
27.9%*
21
Where Are We?
Two-stage
Language Model
Risk
Minimization
Framework
KL-divergence
Retrieval Model
Aspect Retrieval
Model
22
KL-divergence Retrieval Models
Loss function
Risk ranking formula
Rank
R(d , q ) D(ˆQ || ˆD )
l (d , Q , D ) c( Q , D )
cD( Q || D )
U
p( Q | U )
Q
p(q | Q ,U )
q
D(ˆQ || ˆD )
S
p( D | S )
D
p(d | D ,S )
d
23
Expansion-based vs. Model-based
Doc model
Document D
Query likelihood
Scoring
D
P(Q | D )
Results
Query Q
modify
Expansion-based
Feedback
Feedback Docs
Doc model
Document D
KL-divergence
Query Q
D
Query model
Scoring
D( Q || D )
Results
Model-based
Feedback
Feedback Docs
Q
modify
24
Feedback as Model Interpolation
Document D
D
D( Q || D )
Query Q
Q
Q ' (1 ) Q F
=0
Q ' Q
No feedback
Results
=1
Q ' F
Full feedback
F
Feedback Docs
F={d1, d2 , …, dn}
Generative model
Divergence minimization
25
F Estimation Method I:
Generative Mixture Model
Background words
P(w| C)
w
F={d1, …, dn}
P(source)
Topic words
1-
P(w| )
w
log p(F | ) c(w ; d i ) log((1 ) p(w | ) p(w | C ))
i
w
Maximum Likelihood
F arg max log p(F | )
26
F Estimation Method II:
Empirical Divergence Minimization
Background model
C
C
d
close
far ()
F={d1, …, dn}
d
Empirical divergence D ( , F , C )
Divergence minimization
d1
1
n
dn
n
1
|F |
D( ||
i 1
dj
) D( || C ))
F arg min D ( , F ,C )
27
Example of Feedback Query Model
Trec topic 412: “airport security”
=0.9
W
security
airport
beverage
alcohol
bomb
terrorist
author
license
bond
counter-terror
terror
newsnet
attack
operation
headline
Mixture model approach
p(W| F )
0.0558
0.0546
0.0488
0.0474
0.0236
0.0217
0.0206
0.0188
0.0186
0.0173
0.0142
0.0129
0.0124
0.0121
0.0121
Web database
Top 10 docs
=0.7
W
the
security
airport
beverage
alcohol
to
of
and
author
bomb
terrorist
in
license
state
by
p(W| F )
0.0405
0.0377
0.0342
0.0305
0.0304
0.0268
0.0241
0.0214
0.0156
0.0150
0.0137
0.0135
0.0127
0.0127
0.0125
28
Model-based feedback
vs. Simple LM
collection
AvgPr
0.21
0.296
InitPr
0.617
0.591
3067/4805
3888/4805
AvgPr
0.256
0.282
InitPr
0.729
0.707
2853/4728
3160/4728
AvgPr
0.281
0.306
InitPr
0.742
0.732
Recall
1755/2279
1758/2279
AP88-89 Recall
TREC8 Recall
WEB
Simple LM Mixture
Improv.
pos +41%
pos -4%
pos +27%
pos +10%
pos -3%
pos +11%
pos +9%
pos -1%
pos +0%
Div.Min.
Improv.
0.295
pos +40%
0.617
pos +0%
3665/4805 pos +19%
0.269
pos +5%
0.705
pos -3%
3129/4728 pos +10%
0.312
pos +11%
0.728
pos -2%
1798/2279 pos +2%
29
Where Are We?
Two-stage
Language Model
Risk
Minimization
Framework
KL-divergence
Retrieval Model
Aspect Retrieval
Model
30
Aspect Retrieval
Query: What are the applications of robotics in the world today?
Find as many DIFFERENT applications as possible.
Example Aspects:
A1: spot-welding robotics
A2: controlling inventory
A3: pipe-laying robots
A4: talking robot
A5: robots for loading & unloading
memory tapes
A6: robot [telephone] operators
A7: robot cranes
……
Aspect judgments
A1 A2 A3 … ...
d1
d2
d3
….
dk
Ak
1 1 0 0… 0 0
0 1 1 1… 0 0
0 0 0 0… 1 0
1 0 1 0 ... 0 1
31
Evaluation Measures
• Aspect Coverage (AC): measures per-doc coverage
–
#distinct-aspects/#docs
– Equivalent to the “set cover” problem, NP-hard
• Aspect Uniqueness(AU): measures redundancy
– #distinct-aspects/#aspects
– Equivalent to the “volume cover” problem, NP-hard
• Examples
Accumulated
counts
0
0
d1 01
0
0
1
#doc
1
#asp
2
#uniq-asp
2
AC:
2/1=2.0
AU:
2/2=1.0
0
1
d 01
2 1
0
0
2
5
4
4/2=2.0
4/5=0.8
1
0
d3 00
1
0
1
3
8
5
5/3=1.67
5/8=0.625
… ...
……
……
32
Loss Function L( k+1 | 1 … k )
known
d1
dk
? dk+1
…
Maximal Marginal Relevance (MMR)
1
k
k+1
Novelty/Redundancy
Nov ( k+1 | 1 … k )
The best dk+1 is
novel & relevant
Relevance
Rel( k+1 )
Maximal Diverse Relevance (MDR)
1
k
k+1
Aspect Coverage Distrib. p(a|i)
The best dk+1 is
complementary
in coverage
33
Maximal Marginal Relevance
(MMR) Models
• Maximizing aspect coverage indirectly
through redundancy elimination
• Elements
– Redundancy/Novelty measure
– Combination of novelty and relevance
• Proposed & studied six novelty measures
• Proposed & studied four combination
strategies
34
Comparison of Novelty Measures
(Aspect Coverage)
3.5
Avg. Aspect Coverage
3
2.5
2
Relevance
1.5
AvgKL
AvgMix
KLMin
KLAvg
1
MixMin
MixAvg
0.5
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Aspect Recall
35
Comparison of Novelty Measures
(Aspect Uniqueness)
1.2
Avg. Aspect Uniqueness
1
0.8
0.6
Relevance
AvgKL
AvgMix
0.4
KLMin
KLAvg
0.2
MixMin
MixAvg
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Aspect Recall
36
A Mixture Model for Redundancy
Maximum Likelihood
Expectation-Maximization
Ref. document
P(w|Old)
=?
1-
Collection
P(w|Background)
37
Cost-based Combination of
Relevance and Novelty
l (d k | d1 ,..., d k 1 , Q , { i }ki11 ) c2 p(Re l | d k )(1 p( New | d k )) c3 (1 p(Re l | d k ))
Rank
Rank
where ,
p(Re l | d k )(1 p( New | d k ))
p(q | d k ) (1 p( New | d k ))
c3
1
c2
Relevance score
Novelty score
38
Maximal Diverse Relevance
(MDR) Models
• Maximizing aspect coverage directly through
aspect modeling
• Elements
– Aspect loss function
– Generative Aspect Model
• Proposed & studied KL-divergence aspect
loss function
• Explored two aspect models (PLSI, LDA)
39
Aspect Generative Model of
Document & Query
User
U
p( Q | , U )
Source
S
D
p( D | , S )
A
p ( d | , D )
p(d | , D ) p(d i | a ) p(a | D )
i 1 a 1
n
LDA:
q
Query
d
Document
=( 1,…, k)
n
PLSI:
Q
p(q | , Q )
where , d d1 ...d n
A
p(d | , ) p(d i | a ) p(a | )Dir( | )d
i 1 a 1
40
Aspect Loss Function
l (d k | d 1 ,..., d k 1 , Q , { i }ki11 ) D( Q || 1k,..., k 1 )
where ,
p(a |
U
k
1,..., k 1
p( Q | , U )
)
k 1
p(a | ) (1 ) p(a |
k 1
i 1
Q
i
p(q | , Q )
p( D | , S )
)
q
S
k
D(ˆQ || ˆ1k,..., k 1 )
D
p(d | , D
d
41
Aspect Loss Function: Illustration
perfect
redundant
Desired coverage “Already covered”
p(a|Q)
p(a|1)... p(a|k -1)
non-relevant
New candidate
p(a|k)
Combined coverage
p(a | 1k,...,k 1 )
k 1
p(a | i ) (1 ) p(a | k )
k 1 i 1
42
Preliminary Evaluation:
MMR vs. MDR
Ranking
Method
MMR
MDR
Relevant Data
AC
AU
+2.6%
+13.8%
+9.8%
+4.5%
AC
+1.5%
+1.5%
Mixed Data
AU
Prec.
+2.2%
+3.4%
0.0%
-13.8%
• On the relevant data set, both MMR and MDR are
effective, but they complement each other
- MMR improves AU more than AC
- MDR improves AC more than AU
• On the mixed data set, however,
- MMR is only effective when relevance ranking is accurate
- MDR improves AC, even though relevance ranking is
degraded.
43
Further Work is Needed
• Controlled experiments with synthetic data
– Level of redundancy
– Density of relevant documents
– Per-document aspect counts
• Alternative loss functions
• Aspect language models, especially along the
line of LDA
– Aspect-based feedback
44
Summary of Contributions
New TR Models
New TR Framework
Risk
Minimization
Framework
•Unifies existing models
•Incorporates LMs
•Serves as a map for
exploring new models
Specific Contributions
Two-stage
Language Model
•Empirical study of smoothing
(dual role of smoothing)
•New smoothing method
(two-stage smoothing)
•Automatic parameter setting
(leave-one-out, mixture)
KL-divergence
Retrieval Model
•Query/document distillation
•Feedback with LMs
(mixture model & div. min.)
Aspect Retrieval
Model
•Evaluation criteria (AC, AU)
•Redundancy/novelty measures
(mixture weight)
•MMR with LMs (cost-comb.)
•Aspect-based loss function
(“collective KL-div”)
45
Future Research Directions
• Better Approximation of the risk integral
• More effective LMs for “traditional” retrieval
– Can we beat TF-IDF without increasing
computational complexity?
– Automatic parameter setting, especially for
feedback models
– Flexible passage retrieval, especially with HMM
– Beyond unigrams (more linguistics)
46
More Future Research Directions
• Aspect Retrieval Models
– Document structure/sub-topic modeling
– Aspect-based feedback
• Interactive information retrieval models
– Risk minimization for information filtering
– Personalized & context-sensitive retrieval
47
Thank you!
48
© Copyright 2026 Paperzz