Preference Judgment Data Label

Learning to Rank with Ties
Authors: Ke Zhou, Gui-Rong Xue,
Hongyuan Zha, and Yong Yu
Presenter: Davidson
Date: 2009/12/29
Published in SIGIR 2008
Contents
 Introduction
 Pairwise learning to rank
 Models for paired comparisons with ties
 General linear models
 Bradley-Terry model
 Thurstone-Mosteller model




Loss functions
Learning by functional gradient boosting
Experiments
Conclusions and future work
2/24
Introduction
 Learning to rank:
 Ranking objects for some queries
 Document retrieval, expert finding, anti
web spam, and product ratings, etc.
 Learning to rank methods:
 Pointwise approach
 Pairwise approach
 Listwise approach
3/24
Existing approaches (1/2)
 Vector space model methods
 Represent query and documents as vectors
of features
 Compute the distance as similarity measure
 Language modeling based methods
 Use a probabilistic framework for the
relevance of a document with respect to a
query
 Estimate the parameters in probability
models
4/24
Existing approaches (2/2)
 Supervised machine learning framework
 Learn a ranking function from pairwise
preference data
 Minimize the number of contradicting pairs
in training data
 Direct optimization of loss function
designed from performance measure
 Obtain a ranking function that is optimal
with respect to some performance measure
 Require absolute judgment data
5/24
Pairwise learning to rank
 Use pairs of preference data

x  y = x is preferred to y
 How to obtain preference judgments?
 Clickthroughs
 User click count on search results
 Use heuristic rules such as clichthrough rates
 Absolute relevant judgments
 Human labels
 E.g. (4-level judgments) perfect, excellent, good,
and bad
 Need to convert preference judgments to
preference data
6/24
Conversion from preference
judgments to preference data
 E.g. 4 samples with 2-level judgment
Preference judgment
Data Preference
Label Judgment
A
Relevant
B
Relevant
C
Irrelevant
D
Irrelevant
Preference data




(A,
(A,
(B,
(B,
C)
D)
C)
D)
 Tie cases (A, B) and (C, D) are ignored!
7/24
Models for paired comparisons with
ties
 Notations:


x y
x y
=>
=>
x
x
is preferred to y
and y are preferred equally
 Proposed framework
 General linear models for paired comparisons
(in Oxford University Press, 1988)
 Statistical models for paired comparisons
 Bradley-Terry model (in Biometrika, 1952)
 Thurstone-Mosteller model (in Psychological
Review, 1927)
8/24
General linear models (1/4)
 For a non-decreasing
function F : R  R
F     1
F     0
F  x   1  F  x 
 The probability that document x is
preferred to document y is:
h
.
Px  y   F hx  h y 
= scoring/ranking function
9/24
General linear models (2/4)
 With ties, the function becomes:
Px  y   F hx  h y    
Px  y   F hx  h y      F hx  h y    
  = a threshold that controls the tie
probability

  0 , this model is identical to the original
general linear models.
10/24
General linear models (3/4)
F x 
P y  x
Px  y 
x
11/24
General linear models (4/4)
 
Px  y 
F x 
P y  x
Px  y 

x
12/24
Bradley-Terry model
 The function is set to be
1
eh( x)
F x  
so that P x  y   h ( x )
1  exp(  x)
e  eh( y )
h( x)
e
 With ties: P x  y  
h( x)
h( y )
e  e



 1 eh( x)eh( y )
P x  y   h ( x )
h( y )
h( y )
h( x)
e  e
e  e
where
  e

2

13/24
Thurstone-Mosteller model
 The function is set to be the Gaussian
cumulative distribution
1
F x   x  
2

e
x2

2
dx
x
 With ties:
Px  y   hx  h y    
Px  y   hx  h y      hx  h y    
14/24
Loss functions
 Training data

N
xi , yi 
preference data,
N M
i 1
M tie data
 Minimize the empirical risk:
Rˆ [h] 
N M
 Lh, x , y 
i 1
N
i
i
  Lh, xi  yi  
i 1
N M
 Lh, x
i  N 1
i
 yi 




L
h
,
x

y


log
P
x

y
i
i
i
i
 Loss function:
Lh, xi  yi    log Pxi  yi 
15/24
Learning by functional gradient
boosting
 Obtain a function h from a function
space H that minimizes the empirical
*
loss:
h  arg min Rˆ[h]
*
hH
 Apply gradient boosting algorithm
 Approximate h by iteratively constructing a
sequence of base learners
 Base learners are regression trees
 The number of iteration and shrinkage factor
in boosting algorithm are found by using
cross validation
16/24
*
Experiments
 Learning to rank methods:
 BT (Bradley-Terry model) and TM
(Thurstone-Mosteller model)
 BT-noties and TM-noties (BT and TM without
ties)
 RankSVM, RankBoost, AdaRank, Frank
 Datasets (Letor data collection)
 OHSUMED (16,140 pairs, 3-level judgment)
 TREC2003 (49,171 pairs, binary judgment)
 TREC2004 (74,170 pairs, binary judgment)
17/24
Performance measures
 Precision
 Binary judgment only
 Mean Average Precision (MAP)
 Binary judgment only
 Sensitive to the entire ranking order
 Normalized Discount Cumulative Gain
(NDCG)
 Multi-level judgment
18/24
Performance comparisons over
OHSUMED
19/24
Performance comparisons over
TREC2003
20/24
Performance comparisons over
TREC2004
21/24
Performance comparison: ties from
different relevance levels
22/24
Learning curves of Bradley-Terry
model
23/24
Conclusions and future work
 Conclusions
 Tie data improve the performance
 Common features of relevant documents are
extracted
 Irrelevant documents have more diverse tie
features and are less effective
 BT and TM are comparable in most cases
 Future work
 Theoretical analysis of ties
 New methods/algorithms/loss functions of
incorporating ties
24/24