Lidan Wang1, Jimmy Lin2, and Donald Metzler3
1Department
of Computer Science
University of Maryland, College Park
2College
of Information Studies
University of Maryland, College Park
3Information
Sciences Institute
University of Southern California
1
Motivation
Efficiency
Vector space models
Language models
Linear feature-based models
Gradient boosted regression trees
Ranked effectiveness
2
Motivation
Efficiency
Vector space models
Language models
How to define and learn
effective ranking models
for web-scale collections?
Linear feature-based models
Gradient boosted regression trees
Ranked effectiveness
3
Web-Scale Ranked Retrieval
Additive ranking models: final document score is the
sum of partial document score by weak rankers
D
H1
S1
H2
S2
St-1
….
Ht
St
Weak ranker Ht: tree; linear feature-based; etc.
Query execution time depends on:
Complexity of the model (e.g., # trees, tree depth)
Feature computation costs (may need online evaluation)
Number of documents to evaluate (e.g., web)
4
Web-Scale Ranked Retrieval
d1…dm
Training data
(d1, y1)
·
··
(dn, yn)
Ranked retrieval
Impractical for
complex models
We are interested in the run-time efficiency of H(Ɵ)
Machine learned models can be very expensive
Learning
model
H(Ɵ)
~10,000 trees (Yahoo’s LTR challenge runner-up)
Can’t meet stringent time requirement
Our work: enable machine-learned models via cascade
5
Overview of the paper
A Cascade Ranking Model for ranked retrieval
Highly efficient for web-scale ranked retrieval
Result quality at least as good as effectiveness-centric
monolithic models
A boosting algorithm for learning the cascade
Provides a knob to control tradeoff between
effectiveness and efficiency
Optimizes the entire cascade wrt tradeoff
6
Related Work
Learning to efficiently rank
Our earlier work: Wang et al., SIGIR 2010; Wang et al.,
CIKM 2010
Learns model wrt tradeoff between effectiveness and
efficiency
Orthogonal to
Query evaluation and caching
learning a
cascade
Early exits (Cambazoglu et al.; Anh et al.)
Caching postings lists, query results (Baeza-Yates et al.)
Coarse-to-fine models
Learns
monolithic
models
Real-time object detection (Viola et al., 2002)
Not applicable
for ranked
retrieval
7
Cascade Ranking Model
Skewed distribution of rel/non-rel docs; interested in top K
Ranking with monolithic model is costly, and unnecessary
Waste computational effort over large # non-relevant documents
8
Cascade Ranking Model
Skewed distribution of rel/non-rel docs; interested in top K
Ranking with monolithic model is costly, and unnecessary
Waste computational effort over large # non-relevant documents
Cascade ranking model
9
Cascade Ranking Model
Skewed distribution of rel/non-rel docs; interested in top K
Ranking with monolithic model is costly, and unnecessary
Waste computational effort over large # non-relevant documents
Cascade ranking model
Stage 0
{di}
Stage 1
R{H
H0
0}
J1
R{H
0 , J1 }
discard docs
Stage 2
R{·, H
H1
1}
J2
R{·, J
2}
H2
Stage T
R{·, H
discard docs
Increasing accuracy
Increasing costs
2}
…
JT
R{·, J
T}
HT
Top K
docs
discard docs
10
Cascade Ranking Model
Skewed distribution of rel/non-rel docs; interested in top K
Ranking with monolithic model is costly, and unnecessary
Waste computational effort over large # non-relevant documents
Cascade ranking model
Stage 0
{di}
Stage 1
R{H
H0
0}
J1
R{H
0 , J1 }
discard docs
Stage 2
R{·, H
H1
1}
J2
R{·, J
2}
H2
Stage T
R{·, H
discard docs
Increasing accuracy
Increasing costs
2}
…
JT
R{·, J
T}
HT
Top K
docs
discard docs
11
Cascade Ranking Model
Skewed distribution of rel/non-rel docs; interested in top K
Ranking with monolithic model is costly, and unnecessary
Waste computational effort over large # non-relevant documents
Cascade ranking model
Stage 0
{di}
Stage 1
R{H
H0
0}
J1
R{H
0 , J1 }
discard docs
Stage 2
R{·, H
H1
1}
J2
R{·, J
2}
H2
Stage T
R{·, H
discard docs
Increasing accuracy
Increasing costs
2}
…
JT
R{·, J
T}
HT
Top K
docs
discard docs
12
Cascade Ranking Model
Skewed distribution of rel/non-rel docs; interested in top K
Ranking with monolithic model is costly, and unnecessary
Waste computational effort over large # non-relevant documents
Cascade ranking model
Stage 0
{di}
Stage 1
R{H
H0
0}
J1
R{H
0 , J1 }
discard docs
Stage 2
R{·, H
H1
1}
J2
R{·, J
2}
H2
Stage T
R{·, H
discard docs
Increasing accuracy
Increasing costs
2}
…
JT
R{·, J
T}
HT
Top K
docs
discard docs
13
Cascade Ranking Model
Stage 0
Stage 1
R{H
{di}
H0
0}
J1
R{H
0 , J1 }
discard docs
R{·, H
H1
1}
J2
R{·, J
2}
H2
Stage T
R{·, H
discard docs
2}
…
JT
R{·, J
T}
HT
Top K
docs
discard docs
Enable complex models without incurring high latency
Stage 2
Effectiveness improves with more complex models
Good efficiency since most input documents are eliminated very early on
Capture different effectiveness/efficiency tradeoffs via:
Pruning aggressiveness
Weak ranker complexity
Jointly learning pruning functions and weak rankers
Final score of non-pruned document:
Weak ranker Ht: a single feature (more on this later)
Weight on weak
ranker Ht
14
Pruning Functions
Stage t
R{·, H
t-1}
Jt
R{·, J }
t
Ht
R{·, H }
t
discard docs
Apply
Jt
pruning function Jt before Ht at each stage
removes bottom βt% of docs from stage input
βt is
a pruning threshold, controlling pruning aggressiveness
Flexible
to use various pruning functions Jt (for
determining the bottom βt% of docs to remove):
Rank-based
Score-based
Mean-max
score-based
15
Overview of the paper
A Cascade Ranking Model for ranked retrieval
Highly efficient for web-scale ranked retrieval
Result quality ≥ effectiveness-centric monolithic
models
A boosting algorithm for learning the cascade
Provides a knob to control tradeoff between
effectiveness and efficiency
Optimize the entire cascade wrt tradeoff
16
Learning the Cascade
A
cascade S is parameterized by {<Jt(βt), Ht, αt>}, t=1…T
Jt: pruning function at stage t (rank, score, or mean-max score)
βt: pruning threshold at stage t
Ht , αt : weak ranker and its weight, resp. at stage t
17
Learning the Cascade
A
cascade S is parameterized by {<Jt(βt), Ht, αt>}, t=1…T
Jt: pruning function at stage t (rank, score, or mean-max score)
βt: pruning threshold at stage t
Ht , αt : weak ranker and its weight, resp. at stage t
Formulation
of the cascade learning problem:
: a tradeoff metric (more on this later)
A set of training data: D = {<query, documents, qrels>}
Learn cascade S*:
18
Learning the Cascade
A
cascade S is parameterized by {<Jt(βt), Ht, αt>}, t=1…T
Jt: pruning function at stage t (rank, score, or mean-max score)
βt: pruning threshold at stage t
Ht , αt : weak ranker and its weight, resp. at stage t
Formulation
of the cascade learning problem:
: a tradeoff metric (more on this later)
A set of training data: D = {<query, documents, qrels>}
Learn cascade S*:
Key
issue: how to learn cascades that jointly optimize two
competing metrics (effectiveness and efficiency)
19
Tradeoff Metric
Cascade cost (C): sum of individual stage cost
Effectiveness metric (E): NDCG@20.
Cost of stage t:
=
: unit cost of Ht
: documents that will be scored by Ht (those that
have passed pruning Jt)
Although other metrics can be substituted.
Tradeoff metric:
Effectiveness
Tradeoff
parameter
Cascade cost
20
Learning Algorithm
Stage t (St)
R{·, H
Boosting-based learning algorithm:
t-1}
Jt
R{·, J }
t
Ht
R{·, H }
t
discard docs
· Initialize importance weights on training instances (queries): P0(qi)=1/N, where N
is # of queries, and qi is a query
· Initialize cascade S={}
For t=1…T do
· Select a stage <Jt(βt), Ht> over weighted training instances
· Set weight αt for Ht:
· Add full stage < Jt(βt), Ht, αt> to cascade S
E(St, qi): effectiveness
of stage St for qi
C(St, qi): cost of stage St for qi
· Update importance distribution:
21
Learning Algorithm
Stage t (St)
R{·, H
Boosting-based learning algorithm:
t-1}
Jt
R{·, J }
t
Ht
R{·, H }
t
discard docs
· Initialize importance weights on training instances (queries): P0(qi)=1/N, where N
is # of queries, and qi is a query
· Initialize cascade S={}
For t=1…T do
Ht that ranks costly or hard
queries well gets larger weight αt
· Select a stage <Jt(βt), Ht> over weighted training instances
· Set weight αt for Ht:
· Add full stage < Jt(βt), Ht, αt> to cascade S
E(St, qi): effectiveness
of stage St for qi
C(St, qi): cost of stage St for qi
· Update importance distribution:
22
Learning Algorithm
Stage t (St)
R{·, H
Boosting-based learning algorithm:
t-1}
Jt
R{·, J }
t
Ht
R{·, H }
t
discard docs
· Initialize importance weights on training instances (queries): P0(qi)=1/N, where N
is # of queries, and qi is a query
· Initialize cascade S={}
For t=1…T do
Ht that ranks costly or hard
queries well gets larger weight αt
· Select a stage <Jt(βt), Ht> over weighted training instances
· Set weight αt for Ht:
· Add full stage < Jt(βt), Ht, αt> to cascade S
E(St, qi): effectiveness
of stage St for qi
C(St, qi): cost of stage St for qi
· Update importance distribution:
Pay more attention to underperforming queries in next round23
Stage Construction
Used single features as weak rankers Ht
Jointly select Jt(βt) and Ht at each stage to balance
cost and effectiveness
Dirichlet, BM25
Term proximity features
Jt(βt) and weak rankers Ht, Ht+1, Ht+2, … are inter-dependent
To improve tractability, only jointly consider Jt(βt) and Ht
Generalization of effectiveness-centric models
(AdaRank)
24
Experiment
Search engine: Ivory
TREC web collections
First stage: standard inverted index
Subsequent stages: filtering of candidate docs
Gov2, Clue
Comparison methods
Query likelihood
Cascade model
Feature pruning-based, Wang et al., SIGIR 2010
AdaRank (effectiveness-centric), Xu et al., SIGIR 2007
25
Effectiveness vs Efficiency
Gov2 (25 million docs)
Effectiveness
Avg query exec time
48
+0.1%
47.5
47
-49%
NDCG20
-45%
+0.6%
46.5
46
45.5
45
44.5
44
43.5
43
QL Cascade Feature AdaRank
QL Cascade Feature AdaRank
Cascade is more efficient; effectiveness not sacrificed
Tradeoff parameter = 0.1
26
Effectiveness vs Efficiency
Clue (50 million docs)
Effectiveness
Avg query exec time
32
-1.1%
31
-25%
30
NDCG20
-35%
+3.2%
29
28
27
26
25
QL Cascade Feature AdaRank
QL Cascade Feature AdaRank
Cascade is more efficient; effectiveness not sacrificed
Tradeoff parameter = 0.1
27
Analysis of the Cascade
Stage 0
Stage 1
Stage 2
Stage 3
Tradeoff parameter = 0.1
Filtered: input documents at each stage that are pruned by Jt
Filter loss: relevant documents that are incorrectly pruned by Jt
Pruning function Jt : S: score-based; R: rank-based; M: mean-max score-based
28
Varying Tradeoff Parameter
Gov2 (25 million docs)
Clue (50 million docs)
Cascade’s NDCG improves faster
when given more time
29
Conclusions
We introduced a cascade ranking model for efficient
ranked retrieval
We proposed a novel boosting algorithm for learning
the cascade
Future work
Other objective metrics
Other pruning functions
Different types of weak rankers
Although orthogonal to query evaluation/caching, when
combined with them, may result in further improvement
30
Thanks!
Code
and results are reproducible from
Ivory http://ivory.cc
I’m
graduating next year and will be
looking for a job
Questions?
31
© Copyright 2026 Paperzz