Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem ICML 2009 Yisong Yue Thorsten Joachims Cornell University Learning To Rank • Supervised Learning Problem – Extension of classification/regression – Relatively well understood – High applicability in Information Retrieval • Requires explicitly labeled data – Expensive to obtain – Expert judged labels == search user utility? – Doesn’t generalize to other search domains. Our Contribution • Learn from implicit feedback (users’ clicks) – Reduce labeling cost – More representative of end user information needs • Learn using pairwise comparisons – Humans are more adept at making pairwise judgments – Via Interleaving [Radlinski et al., 2008] • On-line framework (Dueling Bandits Problem) – We leverage users when exploring new retrieval functions – Exploration vs exploitation tradeoff (regret) Team-Game Interleaving (u=thorsten, q=“svm”) f1(u,q) r1 1. 2. 3. 4. 5. f2(u,q) r2 Kernel Machines http://svm.first.gmd.de/ Support Vector Machine http://jbolivar.freeservers.com/ An Introduction to Support Vector Machines http://www.support-vector.net/ Archives of SUPPORT-VECTOR-MACHINES ... http://www.jiscmail.ac.uk/lists/SUPPORT... SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ 1. NEXT PICK 2. 3. 4. 5. Interleaving(r1,r2) 1. 2. 3. 4. 5. 6. 7. Kernel Machines T2 http://svm.first.gmd.de/ Support Vector Machine T1 http://jbolivar.freeservers.com/ SVM-Light Support Vector Machine T2 http://ais.gmd.de/~thorsten/svm light/ An Introduction to Support Vector Machines T1 http://www.support-vector.net/ Support Vector Machine and Kernel ... References T2 http://svm.research.bell-labs.com/SVMrefs.html Archives of SUPPORT-VECTOR-MACHINES ... T1 http://www.jiscmail.ac.uk/lists/SUPPORT... Lucent Technologies: SVM demo applet T2 http://svm.research.bell-labs.com/SVT/SVMsvt.html Kernel Machines http://svm.first.gmd.de/ SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ Support Vector Machine and Kernel ... References http://svm.research.bell-labs.com/SVMrefs.html Lucent Technologies: SVM demo applet http://svm.research.bell-labs.com/SVT/SVMsvt.html Royal Holloway Support Vector Machine http://svm.dcs.rhbnc.ac.uk Invariant: For all k, in expectation same number of team members in top k from each team. Interpretation: (r2 Â r1) ↔ clicks(T2) > clicks(T1) [Radlinski, Kurup, Joachims; CIKM 2008] Dueling Bandits Problem • Continuous space bandits F – E.g., parameter space of retrieval functions (i.e., weight vectors) • Each time step compares two bandits – E.g., interleaving test on two retrieval functions – Comparison is noisy & independent Dueling Bandits Problem • Continuous space bandits F – E.g., parameter space of retrieval functions (i.e., weight vectors) • Each time step compares two bandits – E.g., interleaving test on two retrieval functions – Comparison is noisy & independent • Choose pair (ft, ft’) to minimize regret: T T P( f * f t ) P( f * f t ' ) 1 t 1 • (% users who prefer best bandit over chosen ones) T T P( f * f t ) P( f * f t ' ) 1 t 1 •Example 1 •P(f* > f) = 0.9 •P(f* > f’) = 0.8 •Incurred Regret = 0.7 •Example 2 •P(f* > f) = 0.7 •P(f* > f’) = 0.6 •Incurred Regret = 0.3 •Example 3 •P(f* > f) = 0.51 •P(f* > f) = 0.55 •Incurred Regret = 0.06 Modeling Assumptions • Each bandit f 2F has intrinsic value v(f) – Never observed directly – Assume v(f) is strictly concave ( unique f* ) • Comparisons based on v(f) – P(f > f’) = σ( v(f) – v(f’) ) – P is L-Lipschitz 1 – For example: ( x) 1 exp( x) Probability Functions Dueling Bandit Gradient Descent • Maintain ft – Compare with ft’ (close to ft -- defined by step size) – Update if ft’ wins comparison • Expectation of update close to gradient of P(ft > f’) – Builds on Bandit Gradient Descent [Flaxman et al., 2005] δ – explore step size γ – exploit step size Current point Losing candidate Winning candidate Dueling Bandit Gradient Descent δ – explore step size γ – exploit step size Current point Losing candidate Winning candidate Dueling Bandit Gradient Descent δ – explore step size γ – exploit step size Current point Losing candidate Winning candidate Dueling Bandit Gradient Descent δ – explore step size γ – exploit step size Current point Losing candidate Winning candidate Dueling Bandit Gradient Descent δ – explore step size γ – exploit step size Current point Losing candidate Winning candidate Dueling Bandit Gradient Descent δ – explore step size γ – exploit step size Current point Losing candidate Winning candidate Dueling Bandit Gradient Descent δ – explore step size γ – exploit step size Current point Losing candidate Winning candidate Dueling Bandit Gradient Descent δ – explore step size γ – exploit step size Current point Losing candidate Winning candidate Dueling Bandit Gradient Descent δ – explore step size γ – exploit step size Current point Losing candidate Winning candidate Dueling Bandit Gradient Descent Analysis (Sketch) • Dueling Bandit Gradient Descent – Sequence of partially convex functions ct(f) = P(ft > f) – Random binary updates (expectation close to gradient) • Bandit Gradient Descent [Flaxman et al., SODA 2005] – Sequence of convex functions – Use randomized update (expectation close to gradient) – Can be extended to our setting (Assumes more information) Analysis (Sketch) • Convex functions satisfy c( x) c( x*) c( x) ( x x*) – Both additive and multiplicative error – Depends on exploration step size δ – Main analytical contribution: bounding multiplicative error Regret Bound • Regret grows as O(T3/4): E T 2T 3/ 4 10 RdL • Average regret shrinks as O(T-1/4) – In the limit, we do as well as knowing f* in hindsight T T P( f * f t ) P( f * f t ' ) 1 t 1 δ = O(1/T-1/4 ) γ = O(1/T-1/2 ) Practical Considerations • Need to set step size parameters – Depends on P(f > f’) • Cannot be set optimally – We don’t know the specifics of P(f > f’) – Algorithm should be robust to parameter settings • Set parameters approximately in experiments 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 DBGD BGD 1 BGD 2 10 570 1130 1690 2250 2810 3370 3930 4490 5050 5610 6170 6730 7290 7850 8410 8970 9530 Average Regret Regret Comparison DBGD vs BGD • • • • 50 dimensional parameter space Value function v(x) = -xTx Logistic transfer function Random point has regret almost 1 More experiments in paper. Web Search Simulation • Leverage web search dataset – 1000 Training Queries, 367 Dimensions • Simulate “users” issuing queries – Value function based on NDCG@10 (ranking measure) – Use logistic to make probabilistic comparisons • Use linear ranking function. • Not intended to compete with supervised learning – Feasibility check for online learning w/ users – Supervised labels difficult to acquire “in the wild” 0.62 0.6 0.58 0.56 0.54 0.52 0.5 0.48 Sample 1 Sample 10 Sample 100 Ranking SVM 0 630000 1260000 1890000 2520000 3150000 3780000 4410000 5040000 5670000 6300000 6930000 7560000 8190000 8820000 9450000 Training NDCG @10 Web Simulation Results • Chose parameters with best final performance • Curves basically identical for validation and test sets (no over-fitting) • Sampling multiple queries makes no difference What Next? • Better simulation environments – More realistic user modeling assumptions • DBGD simple and extensible – Incorporate pairwise document preferences – Deal with ranking discontinuities • Test on real search systems – Varying scales of user communities – Sheds on insight / guides future development Extra Slides Active vs Passive Learning • Passive Data Collection (offline) – Biased by current retrieval function • Point-wise Evaluation – Design retrieval function offline – Evaluate online • Active Learning (online) – Automatically propose new rankings to evaluate – Our approach Relative vs Absolute Metrics • Our framework based on relative metrics – E.g., comparing pairs of results or rankings – Relatively recent development • Absolute Metrics – – – – E.g., absolute click-through rate More common in literature Suffers from presentation bias Less robust to the many different sources of noise 180 # times result selected 160 time spent in abstract 1 0.9 0.8 140 0.7 120 0.6 100 0.5 80 0.4 60 0.3 40 0.2 20 0.1 0 0 1 2 3 4 5 6 7 8 9 10 11 Rank of result [Joachims et al., TOIS 2007] mean time (s) # times rank selected What Results do Users View/Click? Time spent in each result by frequency of doc selected Analysis (Sketch) • Convex functions satisfy c( x) c( x*) c( x) ( x x*) – We have both multiplicative and additive error – Depends on exploration step size δ – Main technical contribution: bounding multiplicative error Existing results yields sub-linear bounds on: T E P( f t f t ) P( f t f *) t 1 Analysis (Sketch) T • We know how to bound E P( f t f t ) P( f t f *) t 1 T • Regret: T P( f * f t ) P( f * f t ' ) 1 t 1 • We can show using Lipschitz and symmetry of σ: T E T 2E P( f t f t ) P( f t f *) LT t 1 More Simulation Experiments • Logistic transfer function σ(x) = 1/(1+exp(-x)) • 4 choices of value functions • δ, γ set approximately RT NDCG • Normalized Discounted Cumulative Gain • Multiple Levels of Relevance • DCG: – contribution of ith rank position: – Ex: 2 yi 1 log( i 1) has DCG score of 1 3 1 0 1 5.45 log( 2) log( 3) log( 4) log( 5) log( 6) • NDCG is normalized DCG – best possible ranking as score NDCG = 1 Considerations • NDCG is discontinuous w.r.t. function parameters – Try larger values of δ, γ – Try sampling multiple queries per update • Homogenous user values – NDCG@10 – Not an optimization concern – Modeling limitation • Not intended to compete with supervised learning – Sanity check of feasibility for online learning w/ users
© Copyright 2026 Paperzz