A Support Vector Method for Optimizing Rank-Based

A Support Vector Method for Optimizing
Average Precision
SIGIR 2007
Yisong Yue
Cornell University
In Collaboration With:
Thomas Finley, Filip Radlinski, Thorsten Joachims
(Cornell University)
Motivation
• Learn to Rank Documents
• Optimize for IR performance measures
• Mean Average Precision
• Leverage Structural SVMs
[Tsochantaridis et al. 2005]
MAP vs Accuracy
• Average precision is the average of the precision scores at the rank
locations of each relevant document.
• Ex:
has average precision
1 1 2 3 
      0.76
3 1 3 5 
• Mean Average Precision (MAP) is the mean of the Average
Precision scores for a group of queries.
• A machine learning algorithm optimizing for Accuracy might learn a
very different model than optimizing for MAP.
• Ex:
has average precision of about 0.64, but has a
max accuracy of 0.8 vs 0.6 in above ranking.
Recent Related Work
Greedy & Local Search
•
[Metzler & Croft 2005] – optimized for MAP, used gradient descent,
expensive for large number of features.
•
[Caruana et al. 2004] – iteratively built an ensemble to greedily improve
arbitrary performance measures.
Surrogate Performance Measures
•
[Burges et al. 2005] – used neural nets optimizing for cross entropy.
•
[Cao et al. 2006] – used SVMs optimizing for modified ROC-Area.
Relaxations
•
[Xu & Li 2007] – used Boosting with exponential loss relaxation
Conventional SVMs
• Input examples denoted by x (high dimensional point)
• Output targets denoted by y (either +1 or -1)
• SVMs learns a hyperplane w, predictions are sign(wTx)
• Training involves finding w which minimizes
1 2 C
w 
2
N
• subject to
• The sum of slacks
N

i 1
i
i : yi  ( wT xi )  1   i

i
i
upper bounds the accuracy loss
Conventional SVMs
• Input examples denoted by x (high dimensional point)
• Output targets denoted by y (either +1 or -1)
• SVMs learns a hyperplane w, predictions are sign(wTx)
• Training involves finding w which minimizes
1 2 C
w 
2
N
• subject to
• The sum of slacks
N

i 1
i
i : yi  ( wT xi )  1   i

i
i
upper bounds the accuracy loss
Adapting to Average Precision
• Let x denote the set of documents/query examples for a query
• Let y denote a (weak) ranking (each yij 2 {-1,0,+1})
• Same objective function:
1 2 C
w 
2
N

i
i
• Constraints are defined for each incorrect labeling y’ over the set of
documents x.
y'  y : wT ( y, x)  wT ( y' , x)  ( y, y' )  
• Joint discriminant score for the correct labeling at least as large
as incorrect labeling plus the performance loss.
Adapting to Average Precision
• Maximize
1 2 C
w 
2
N
subject to
where

i
i
y'  y : wT ( y, x)  wT ( y' , x)  ( y, y' )  
 ( y ' , x)    y 'ij ( xi  x j )
i:rel j:!rel
and
( y, y ' )  1  Avgprec( y ' )
• Sum of slacks
  upper bound MAP loss.
• After learning w, a prediction is made by sorting on wTxi
Adapting to Average Precision
• Maximize
1 2 C
w 
2
N
subject to
where

i
i
y'  y : wT ( y, x)  wT ( y' , x)  ( y, y' )  
 ( y ' , x)    y 'ij ( xi  x j )
i:rel j:!rel
and
( y, y' )  1  Avgprec( y' )
• Sum of slacks
  upper bound MAP loss.
• After learning w, a prediction is made by sorting on wTxi
Too Many Constraints!
• For Average Precision, the true labeling is a ranking where the
relevant documents are all ranked in the front, e.g.,
• An incorrect labeling would be any other ranking, e.g.,
• This ranking has Average Precision of about 0.8 with (y,y’) ¼ 0.2
• Exponential number of rankings, thus an exponential number
of constraints!
Structural SVM Training
• STEP 1: Solve the SVM objective function using only the current
working set of constraints.
• STEP 2: Using the model learned in STEP 1, find the most violated
constraint from the exponential set of constraints.
• STEP 3: If the constraint returned in STEP 2 is more violated than
the most violated constraint the working set by some small constant,
add that constraint to the working set.
• Repeat STEP 1-3 until no additional constraints are added. Return
the most recent model that was trained in STEP 1.
STEP 1-3 is guaranteed to loop for at most a polynomial number of
iterations. [Tsochantaridis et al. 2005]
Illustrative Example
Original SVM Problem
Structural SVM Approach
•
•
•
Exponential constraints
Most are dominated by a small set
of “important” constraints
•
Repeatedly finds the next most
violated constraint…
…until set of constraints is a good
approximation.
Illustrative Example
Original SVM Problem
Structural SVM Approach
•
•
•
Exponential constraints
Most are dominated by a small set
of “important” constraints
•
Repeatedly finds the next most
violated constraint…
…until set of constraints is a good
approximation.
Illustrative Example
Original SVM Problem
Structural SVM Approach
•
•
•
Exponential constraints
Most are dominated by a small set
of “important” constraints
•
Repeatedly finds the next most
violated constraint…
…until set of constraints is a good
approximation.
Illustrative Example
Original SVM Problem
Structural SVM Approach
•
•
•
Exponential constraints
Most are dominated by a small set
of “important” constraints
•
Repeatedly finds the next most
violated constraint…
…until set of constraints is a good
approximation.
Finding Most Violated Constraint
• Structural SVM is an oracle framework.
• Requires subroutine to find the most violated constraint.
– Dependent on formulation of loss function and joint feature
representation.
• Exponential number of constraints!
• Efficient algorithm in the case of optimizing MAP.
Finding Most Violated Constraint
H ( y ' ; w)  ( y, y ' )    y 'ij ( wT xi  wT x j )
i:rel j:!rel
Observation
• MAP is invariant on the order of documents within a relevance class
– Swapping two relevant or non-relevant documents does not change MAP.
• Joint SVM score is optimized by sorting by document score, wTx
• Reduces to finding an interleaving
between two sorted lists of documents
Finding Most Violated Constraint
H ( y ' ; w)  ( y, y ' )    y 'ij ( wT xi  wT x j )
i:rel j:!rel
• Start with perfect ranking
• Consider swapping adjacent
relevant/non-relevant documents
►
Finding Most Violated Constraint
H ( y ' ; w)  ( y, y ' )    y 'ij ( wT xi  wT x j )
i:rel j:!rel
• Start with perfect ranking
• Consider swapping adjacent
relevant/non-relevant documents
• Find the best feasible ranking of
the non-relevant document
►
Finding Most Violated Constraint
H ( y ' ; w)  ( y, y ' )    y 'ij ( wT xi  wT x j )
i:rel j:!rel
• Start with perfect ranking
• Consider swapping adjacent
relevant/non-relevant documents
• Find the best feasible ranking of the
non-relevant document
• Repeat for next non-relevant
►
document
Finding Most Violated Constraint
H ( y ' ; w)  ( y, y ' )    y 'ij ( wT xi  wT x j )
i:rel j:!rel
• Start with perfect ranking
• Consider swapping adjacent
relevant/non-relevant documents
• Find the best feasible ranking of the
non-relevant document
• Repeat for next non-relevant
document
• Never want to swap past previous
►
non-relevant document
Finding Most Violated Constraint
H ( y ' ; w)  ( y, y ' )    y 'ij ( wT xi  wT x j )
i:rel j:!rel
• Start with perfect ranking
• Consider swapping adjacent
relevant/non-relevant documents
• Find the best feasible ranking of the
non-relevant document
• Repeat for next non-relevant
document
• Never want to swap past previous
non-relevant document
• Repeat until all non-relevant
documents have been considered
►
Quick Recap
SVM Formulation
• SVMs optimize a tradeoff between model complexity and MAP loss
• Exponential number of constraints (one for each incorrect ranking)
• Structural SVMs finds a small subset of important constraints
• Requires sub-procedure to find most violated constraint
Find Most Violated Constraint
• Loss function invariant to re-ordering of relevant documents
• SVM score imposes an ordering of the relevant documents
• Finding interleaving of two sorted lists
• Loss function has certain monotonic properties
• Efficient algorithm
Experiments
• Used TREC 9 & 10 Web Track corpus.
• Features of document/query pairs computed
from outputs of existing retrieval functions.
(Indri Retrieval Functions & TREC Submissions)
• Goal is to learn a recombination of outputs which
improves mean average precision.
Comparison with Best Base methods
0.3
0.28
0.26
Mean Average Precision
0.24
0.22
SVM-MAP
Base 1
0.2
Base 2
Base 3
0.18
0.16
0.14
0.12
0.1
TREC 9 Indri
TREC 10 Indri
TREC 9
Submissions
TREC 10
Submissions
Dataset
TREC 9
Submissions
(without best)
TREC 10
Submissions
(without best)
Comparison with other SVM methods
0.3
0.28
0.26
Mean Average Precision
0.24
SVM-MAP
0.22
SVM-ROC
SVM-ACC
0.2
SVM-ACC2
SVM-ACC3
SVM-ACC4
0.18
0.16
0.14
0.12
0.1
TREC 9 Indri
TREC 10 Indri
TREC 9
Submissions
TREC 10
Submissions
Dataset
TREC 9
Submissions
(without best)
TREC 10
Submissions
(without best)
Moving Forward
• Approach also works (in theory) for other measures.
• Some promising results when optimizing for NDCG
(with only 1 level of relevance).
• Currently working on optimizing for NDCG with multiple
levels of relevance.
• Preliminary MRR results not as promising.
Conclusions
• Principled approach to optimizing average precision.
(avoids difficult to control heuristics)
• Performs at least as well as alternative SVM methods.
• Can be generalized to a large class of rank-based
performance measures.
• Software available at http://svmrank.yisongyue.com

Download Report

A Support Vector Method for Optimizing Rank-Based

Paperzz.com

Your Paperzz