Lidan Wang1, Jimmy Lin2, and Donald Metzler3

Lidan Wang1, Jimmy Lin2, and Donald Metzler3
1Department
of Computer Science
University of Maryland, College Park
2College
of Information Studies
University of Maryland, College Park
3Information
Sciences Institute
University of Southern California
1
Motivation
Efficiency
Vector space models
Language models
Linear feature-based models
Gradient boosted regression trees
Ranked effectiveness
2
Motivation
Efficiency
Vector space models
Language models
How to define and learn
effective ranking models
for web-scale collections?
Linear feature-based models
Gradient boosted regression trees
Ranked effectiveness
3
Web-Scale Ranked Retrieval
 
Additive ranking models: final document score is the
sum of partial document score by weak rankers
D
 
 
H1
S1
H2
S2
St-1
….
Ht
St
Weak ranker Ht: tree; linear feature-based; etc.
Query execution time depends on:
 
 
 
Complexity of the model (e.g., # trees, tree depth)
Feature computation costs (may need online evaluation)
Number of documents to evaluate (e.g., web)
4
Web-Scale Ranked Retrieval
d1…dm
Training data
(d1, y1)
·
··
(dn, yn)
 
 
Ranked retrieval
Impractical for
complex models
We are interested in the run-time efficiency of H(Ɵ)
Machine learned models can be very expensive
 
 
 
Learning
model
H(Ɵ)
~10,000 trees (Yahoo’s LTR challenge runner-up)
Can’t meet stringent time requirement
Our work: enable machine-learned models via cascade
5
Overview of the paper
 
A Cascade Ranking Model for ranked retrieval
 
 
 
Highly efficient for web-scale ranked retrieval
Result quality at least as good as effectiveness-centric
monolithic models
A boosting algorithm for learning the cascade
 
 
Provides a knob to control tradeoff between
effectiveness and efficiency
Optimizes the entire cascade wrt tradeoff
6
Related Work
 
Learning to efficiently rank
 
 
 
 
 
Our earlier work: Wang et al., SIGIR 2010; Wang et al.,
CIKM 2010
Learns model wrt tradeoff between effectiveness and
efficiency
Orthogonal to
Query evaluation and caching
 
learning a
cascade
Early exits (Cambazoglu et al.; Anh et al.)
Caching postings lists, query results (Baeza-Yates et al.)
Coarse-to-fine models
 
Learns
monolithic
models
Real-time object detection (Viola et al., 2002)
Not applicable
for ranked
retrieval
7
Cascade Ranking Model
 
 
Skewed distribution of rel/non-rel docs; interested in top K
Ranking with monolithic model is costly, and unnecessary
 
Waste computational effort over large # non-relevant documents
8
Cascade Ranking Model
 
 
Skewed distribution of rel/non-rel docs; interested in top K
Ranking with monolithic model is costly, and unnecessary
 
 
Waste computational effort over large # non-relevant documents
Cascade ranking model
9
Cascade Ranking Model
 
 
Skewed distribution of rel/non-rel docs; interested in top K
Ranking with monolithic model is costly, and unnecessary
 
 
Waste computational effort over large # non-relevant documents
Cascade ranking model
Stage 0
{di}
Stage 1
R{H
H0
0}
J1
R{H
0 , J1 }
discard docs
Stage 2
R{·, H
H1
1}
J2
R{·, J
2}
H2
Stage T
R{·, H
discard docs
Increasing accuracy
Increasing costs
2}
…
JT
R{·, J
T}
HT
Top K
docs
discard docs
10
Cascade Ranking Model
 
 
Skewed distribution of rel/non-rel docs; interested in top K
Ranking with monolithic model is costly, and unnecessary
 
 
Waste computational effort over large # non-relevant documents
Cascade ranking model
Stage 0
{di}
Stage 1
R{H
H0
0}
J1
R{H
0 , J1 }
discard docs
Stage 2
R{·, H
H1
1}
J2
R{·, J
2}
H2
Stage T
R{·, H
discard docs
Increasing accuracy
Increasing costs
2}
…
JT
R{·, J
T}
HT
Top K
docs
discard docs
11
Cascade Ranking Model
 
 
Skewed distribution of rel/non-rel docs; interested in top K
Ranking with monolithic model is costly, and unnecessary
 
 
Waste computational effort over large # non-relevant documents
Cascade ranking model
Stage 0
{di}
Stage 1
R{H
H0
0}
J1
R{H
0 , J1 }
discard docs
Stage 2
R{·, H
H1
1}
J2
R{·, J
2}
H2
Stage T
R{·, H
discard docs
Increasing accuracy
Increasing costs
2}
…
JT
R{·, J
T}
HT
Top K
docs
discard docs
12
Cascade Ranking Model
 
 
Skewed distribution of rel/non-rel docs; interested in top K
Ranking with monolithic model is costly, and unnecessary
 
 
Waste computational effort over large # non-relevant documents
Cascade ranking model
Stage 0
{di}
Stage 1
R{H
H0
0}
J1
R{H
0 , J1 }
discard docs
Stage 2
R{·, H
H1
1}
J2
R{·, J
2}
H2
Stage T
R{·, H
discard docs
Increasing accuracy
Increasing costs
2}
…
JT
R{·, J
T}
HT
Top K
docs
discard docs
13
Cascade Ranking Model
Stage 0
Stage 1
R{H
{di}
H0
0}
J1
R{H
0 , J1 }
discard docs
 
R{·, H
H1
1}
J2
R{·, J
2}
H2
Stage T
R{·, H
discard docs
2}
…
JT
R{·, J
T}
HT
Top K
docs
discard docs
Enable complex models without incurring high latency
 
 
 
Stage 2
Effectiveness improves with more complex models
Good efficiency since most input documents are eliminated very early on
Capture different effectiveness/efficiency tradeoffs via:
 
 
 
Pruning aggressiveness
Weak ranker complexity
Jointly learning pruning functions and weak rankers
 
Final score of non-pruned document:
 
Weak ranker Ht: a single feature (more on this later)
Weight on weak
ranker Ht
14
Pruning Functions
Stage t
R{·, H
t-1}
Jt
R{·, J }
t
Ht
R{·, H }
t
discard docs
  Apply
  Jt
pruning function Jt before Ht at each stage
removes bottom βt% of docs from stage input
  βt is
a pruning threshold, controlling pruning aggressiveness
  Flexible
to use various pruning functions Jt (for
determining the bottom βt% of docs to remove):
 Rank-based
 Score-based
 Mean-max
score-based
15
Overview of the paper
 
A Cascade Ranking Model for ranked retrieval
 
 
 
Highly efficient for web-scale ranked retrieval
Result quality ≥ effectiveness-centric monolithic
models
A boosting algorithm for learning the cascade
 
 
Provides a knob to control tradeoff between
effectiveness and efficiency
Optimize the entire cascade wrt tradeoff
16
Learning the Cascade
 A
cascade S is parameterized by {<Jt(βt), Ht, αt>}, t=1…T
 
 
 
Jt: pruning function at stage t (rank, score, or mean-max score)
βt: pruning threshold at stage t
Ht , αt : weak ranker and its weight, resp. at stage t
17
Learning the Cascade
 A
cascade S is parameterized by {<Jt(βt), Ht, αt>}, t=1…T
 
 
 
Jt: pruning function at stage t (rank, score, or mean-max score)
βt: pruning threshold at stage t
Ht , αt : weak ranker and its weight, resp. at stage t
 Formulation
of the cascade learning problem:
 
: a tradeoff metric (more on this later)
A set of training data: D = {<query, documents, qrels>}
 
Learn cascade S*:
 
18
Learning the Cascade
 A
cascade S is parameterized by {<Jt(βt), Ht, αt>}, t=1…T
 
 
 
Jt: pruning function at stage t (rank, score, or mean-max score)
βt: pruning threshold at stage t
Ht , αt : weak ranker and its weight, resp. at stage t
 Formulation
of the cascade learning problem:
 
: a tradeoff metric (more on this later)
A set of training data: D = {<query, documents, qrels>}
 
Learn cascade S*:
 
 Key
issue: how to learn cascades that jointly optimize two
competing metrics (effectiveness and efficiency)
19
Tradeoff Metric
 
Cascade cost (C): sum of individual stage cost
 
 
 
 
Effectiveness metric (E): NDCG@20.
 
 
Cost of stage t:
=
: unit cost of Ht
: documents that will be scored by Ht (those that
have passed pruning Jt)
Although other metrics can be substituted.
Tradeoff metric:
Effectiveness
Tradeoff
parameter
Cascade cost
20
Learning Algorithm
 
Stage t (St)
R{·, H
Boosting-based learning algorithm:
t-1}
Jt
R{·, J }
t
Ht
R{·, H }
t
discard docs
· Initialize importance weights on training instances (queries): P0(qi)=1/N, where N
is # of queries, and qi is a query
· Initialize cascade S={}
For t=1…T do
· Select a stage <Jt(βt), Ht> over weighted training instances
· Set weight αt for Ht:
· Add full stage < Jt(βt), Ht, αt> to cascade S
E(St, qi): effectiveness
of stage St for qi
C(St, qi): cost of stage St for qi
· Update importance distribution:
21
Learning Algorithm
 
Stage t (St)
R{·, H
Boosting-based learning algorithm:
t-1}
Jt
R{·, J }
t
Ht
R{·, H }
t
discard docs
· Initialize importance weights on training instances (queries): P0(qi)=1/N, where N
is # of queries, and qi is a query
· Initialize cascade S={}
For t=1…T do
Ht that ranks costly or hard
queries well gets larger weight αt
· Select a stage <Jt(βt), Ht> over weighted training instances
· Set weight αt for Ht:
· Add full stage < Jt(βt), Ht, αt> to cascade S
E(St, qi): effectiveness
of stage St for qi
C(St, qi): cost of stage St for qi
· Update importance distribution:
22
Learning Algorithm
 
Stage t (St)
R{·, H
Boosting-based learning algorithm:
t-1}
Jt
R{·, J }
t
Ht
R{·, H }
t
discard docs
· Initialize importance weights on training instances (queries): P0(qi)=1/N, where N
is # of queries, and qi is a query
· Initialize cascade S={}
For t=1…T do
Ht that ranks costly or hard
queries well gets larger weight αt
· Select a stage <Jt(βt), Ht> over weighted training instances
· Set weight αt for Ht:
· Add full stage < Jt(βt), Ht, αt> to cascade S
E(St, qi): effectiveness
of stage St for qi
C(St, qi): cost of stage St for qi
· Update importance distribution:
Pay more attention to underperforming queries in next round23
Stage Construction
 
Used single features as weak rankers Ht
 
 
 
Jointly select Jt(βt) and Ht at each stage to balance
cost and effectiveness
 
 
 
Dirichlet, BM25
Term proximity features
Jt(βt) and weak rankers Ht, Ht+1, Ht+2, … are inter-dependent
To improve tractability, only jointly consider Jt(βt) and Ht
Generalization of effectiveness-centric models
(AdaRank)
24
Experiment
 
Search engine: Ivory
 
 
 
TREC web collections
 
 
First stage: standard inverted index
Subsequent stages: filtering of candidate docs
Gov2, Clue
Comparison methods
 
 
 
 
Query likelihood
Cascade model
Feature pruning-based, Wang et al., SIGIR 2010
AdaRank (effectiveness-centric), Xu et al., SIGIR 2007
25
Effectiveness vs Efficiency
Gov2 (25 million docs)
Effectiveness
Avg query exec time
48
+0.1%
47.5
47
-49%
NDCG20
-45%
+0.6%
46.5
46
45.5
45
44.5
44
43.5
43
QL Cascade Feature AdaRank
QL Cascade Feature AdaRank
Cascade is more efficient; effectiveness not sacrificed
Tradeoff parameter = 0.1
26
Effectiveness vs Efficiency
Clue (50 million docs)
Effectiveness
Avg query exec time
32
-1.1%
31
-25%
30
NDCG20
-35%
+3.2%
29
28
27
26
25
QL Cascade Feature AdaRank
QL Cascade Feature AdaRank
Cascade is more efficient; effectiveness not sacrificed
Tradeoff parameter = 0.1
27
Analysis of the Cascade
Stage 0
Stage 1
Stage 2
Stage 3
Tradeoff parameter = 0.1
Filtered: input documents at each stage that are pruned by Jt
Filter loss: relevant documents that are incorrectly pruned by Jt
Pruning function Jt : S: score-based; R: rank-based; M: mean-max score-based
28
Varying Tradeoff Parameter
Gov2 (25 million docs)
Clue (50 million docs)
Cascade’s NDCG improves faster
when given more time
29
Conclusions
 
 
 
We introduced a cascade ranking model for efficient
ranked retrieval
We proposed a novel boosting algorithm for learning
the cascade
Future work
 
 
 
 
Other objective metrics
Other pruning functions
Different types of weak rankers
Although orthogonal to query evaluation/caching, when
combined with them, may result in further improvement
30
Thanks!
  Code
and results are reproducible from
Ivory http://ivory.cc
  I’m
graduating next year and will be
looking for a job
  Questions?
31