Generic Exploration and K-armed Voting Bandits Tanguy Urvoy Fabrice Clerot Raphael Féraud Sami Naamane Orange-labs, 2 avenue Pierre Marzin, 22307 Lannion, FRANCE Abstract We study a stochastic online learning scheme with partial feedback where the utility of decisions is only observable through an estimation of the environment parameters. We propose a generic pure-exploration algorithm, able to cope with various utility functions from multi-armed bandits settings to dueling bandits. The primary application of this setting is to offer a natural generalization of dueling bandits for situations where the environment parameters reflect the idiosyncratic preferences of a mixed crowd. 1. Introduction The stochastic multi-armed bandits became popular as a stripped-down model of exploration versus exploitation balance in sequential decision problems. In its simplest formulation, we are facing a slot machine with several arms. The rewards of these arms are modeled by unknown but bounded and independent random variables. To maximize our long term reward, we would like to play an arm with maximal expected value but we need to explore efficiently all the arms in order to find it. [email protected] [email protected] [email protected] [email protected] sider pure-exploration or PAC sample complexity: how to find a nearly-optimal arm with high confidence in a minimum of trials? For multi-armed bandits, several algorithms where already proposed and studied from this perspective in (Even-Dar et al., 2002; 2006).We can also reverse the PAC question to control the prediction accuracy after a fixed number of samples as in (Audibert et al., 2010; Bubeck et al., 2011). When the number of trials is bounded by a known horizon, we can adopt an explore then exploit strategy to control the final regret with an efficient exploration algorithm (Yue et al., 2012). This is the perspective we adopt here. 1.1. Rigged bandits As an introduction to our generic exploration setting we consider the rigged bandits problem which is a simplified model for click-fraud in online advertising. In this variant of multi-armed bandits we know from a reliable source (the barmaid of the casino) that the m best arms of the slot machine have been rigged and only deliver counterfeit money. To maximize our gain, we want to design a sequence of experiments in order to determine and play the pm ` 1qth best arm as early as possible while avoiding the rigged arms. The cost of ignorance is traditionally expressed in term of expected regret : the expected difference of reward between a playing policy established with perfect knowledge of the environment parameters and a given ”unaware” policy. Following Lai & Robbins (1985), several regret analysis have been proposed (see for instance Auer et al., 2002; Audibert et al., 2008; Auer & Ortner, 2011). The main characteristic of this problem lies in the absence of a direct utility feedback: to estimate our real income we need to know with enough confidence which arms were rigged. This problem also requires what we call a generic exploration policy: we do not want to design a new exploration algorithm for each possible fraud-detection criterion we have at hand (see section 5.1 for a formalized example). Another way to evaluate bandit algorithms is to con- 1.2. Dueling bandits th Proceedings of the 30 International Conference on Machine Learning, Atlanta, Georgia, USA, 2013. JMLR: W&CP volume 28. Copyright 2013 by the author(s). The dueling bandit problem, introduced by Yue & Joachims (2009) to formalize online learning from pref- Generic Exploration and K-armed Voting Bandits erence feedback, shares the indirect or parametric feedback property with rigged bandits. The initial motivation to depart from the absolute-reward model came from information retrieval evaluation where the implicit feedback by means of click logs is strongly biased by the ranking itself. A solution was proposed by Joachims (2003) to circumvent this problem: by interleaving two ranking models, and checking where the user clicked, one obtains an unbiased – but pairwise – preference feedback. Further experiments were performed in (Chapelle et al., 2012). The original definition of the dueling bandits problem (Yue et al., 2012; Yue & Joachims, 2011) was built upon strong assumptions about the preference matrix: existence of a strict linear ordering, stochastic transitivity and stochastic triangular inequality (see Yue & Joachims, 2011). An extension of this setting with restricted pairing was proposed by Di Castro et al. (2011), but this extension also assumes the preference matrix to be the byproduct of an inherent value for each arm. In a situation where the preferences reflects the expression of a mixed crowd, there can be several inconsistencies or voting paradoxes which contradict these assumptions. The definition we propose here is more relaxed: we do not assume the existence of a perfect linear order, neither do we assume the existence of an inherent value of arms. We simply try to sample efficiently the preference matrix in order to propose a ”best element” similar to the one we would choose with perfect knowledge of the crowd preferences. Electing a ”best element” or a ”best linear ordering” from such a preference matrix is a tough but old and well-studied problem (see Charon & Hudry, 2010, for a survey), but the works about online and noisy declinations of this problem are scarce (see however Ravikumar et al., 1987; Feige et al., 1994, for related problems). If we change the election criterion according to the vast social choice theory (see Chevaleyre et al., 2007, for a survey),we can decline the dueling bandits in several unexplored flavors: for instance Borda bandits, Copeland bandits, Slater bandits, or Kemeny bandits. 1.3. Toward generic exploration algorithms The traditional approach to deal with exotic sequential decision problems is to design tailor-made algorithms which handle simultaneously the exploration of the environment and the exploitation of its knowledge. One purpose of this article is to explore the possibility of a generic algorithm which automatically generates an efficient exploration policy for any given decision crite- rion. We propose a generic algorithm with theoretical guarantees for the case of parametric decision problems and we evaluate its performances on relatively simple declinations of rigged and dueling bandits. 2. Main Problem Statement Consider a stationary environment modeled by a vector of N unknown parameters µ “ pµ1 , . . . , µN q P r0, 1sN (By convention, we write the vectors with bold faces). We have a noisy perception of µ modeled by a vector X of N independent random variables Xi P r0, 1s verifying ErXi s “ µi for each index i “ 1, . . . , N . The only hint we may have about µ is a set of feasible environment configurations F Ď r0, 1sN . Let D be a set of decisions and U : D ˆ F Ñ R` be a given utility function. From this utility function we can derive a decision function f : F Ñ D which computes an optimal option f pxq P arg maxd U pd, xq for each feasible realization x of the random vector X. The parametric decision problem consists in finding the best decision d˚ :“ f pµq with a high probability after a minimum amount of samples of the environment parameters. If we have a metric L : D ˆ D Ñ R` to compare decisions, we can also search an ε-approximation of d˚ (i.e. a decision d P D such that Lpd˚ , dq ď ε). We use a ”budgeted” version of PAC learning: Definition 1. An algorithm is an pε, δq-PAC algorithm with horizon T for the parametric decision problem if it outputs an ε-approximation of d˚ with probability at least 1 ´ δ when it terminates with strictly less than T samples. We call exploration time the number of parameter samples required for termination. This definition extends PAC learning to finite horizons: to avoid confusion we use the term exploration time instead of sample complexity when T is finite. The exploration time at horizon T with δ “ 1{T provides an upper bound for the expected cumulative regret (see Yue et al., 2012, section 4). The decision function may be the result of quite a complex algorithm, but in the problems we consider here, D is finite and the decision function is partitioning the input space into single-decision areas. For these problems we will assume that the environment state µ falls outside of the decision frontiers. In other words, we will assume that there exists a neighborhood of µ where f is constant. In order to analyze the performance of our algorithm in the next section, we will need a finer description of this neighborhood. For instance, the binary decision function defined in Figure 1 Generic Exploration and K-armed Voting Bandits 0 1 0 Figure 1. A simple example of binary decision function defined by f px1 , x2 q “ rx1 ą 1{8 ^ px1 ă 1{2 _ x2 ą x1 qs (We use the square brackets r¨s to denote the characteristic function of a predicate). How accurately do we need to estimate µ in order to take the good decision? The environment state µ is at distance ∆1 from the nearest decision frontier but we can stop exploring x2 as soon as we know that x1 ă 1{2. is constant on the neighborhood of µ but it is also independent of x2 for any configuration where x1 ă 1{2. Let us introduce some more notations: hereafter }x}8 :“ maxi |xi | will denote the l8 norm, Bpµ, rq :“ tx | }x ´ µ}8 ă ru will denote the l8 ball (or box) of radius r around µ, and ei will denote the standard basis vector with a 1 in the ith coordinate and 0’s elsewhere. Definition 2. Let H be a subset of F. The decision function f is independent of its parameter i on H if for any α P r´1, `1s we have: x, x ` αei P H ñ f pxq “ f px ` αei q For instance on Figure 1 the decision is independent of x2 on the set Bpµ, ∆2 q. This local independence, parametrized by the ∆i radii, captures the sensitivity of the decision to its input parameters around the environment state µ. 3. The SAVAGE Algorithm We propose a generic zooming algorithm to solve the N dimensional parametric decision problem with high confidence. This algorithm, called SAVAGE (Sensitivity Analysis of VAriables for Generic Exploration) is described in Algorithm 1. It works by reducing progressively a box-shaped confidence set H until a single decision remains in f pHq. The algorithm stops Algorithm 1 SAVAGE algorithm 1: Input: X “ pX1 , . . . , XN q, f , F, T , δ 2: Initialization: 3: W :“ t1, . . . , N u, H :“ F, s :“ 1 4: @i P W : µ̂i :“ 1{2, and ti :“ 0 5: while Acceptpf, H, Wq ^ s ď T do 6: Pick a variable index i P arg minW tt1 , . . . , tN u 7: ti :“ ti ` 1 8: Sample the ith distribution xi Ð Xi 9: µ̂i :“ p1 ´ t1i qµ̂i ` t1i xi 10: H :“ H X tx | |xi ´ µ̂i | ă cpti qu 11: W :“ Wztj } IndepTestpf, H, jqu 12: s :“ s ` 1 13: end while 14: return dˆ P f pHq exploring a parameter when it knows from a sensitivity analysis subroutine IndepTestpf, H, iq that, given our knowledge of the environment, the final decision will not change according to this parameter; in other words when f is independent of i on H as formalized in Definition 2. The boundaries of H are defined by the confidence radius: c 1 ηptq cptq “ logp q, (1) 2t δ where the η function is set to 2N T , when the horizon 2 t2 T is finite, and π N when it is infinite (PAC setting). 3 Termination is controlled by the predicate: Acceptpf, H, Wq :“ ”W “ H” which implies |f pHq| “ 1 (2) (3) Theorem 1. If f is independent of each parameter i on F X Bpµ, ∆i q, with ∆i ą 0, then SAVAGE is a p0, δq-PAC algorithm with horizon T for the parametric decision problem. When T “ 8, its sample complexity is bounded by: ¸ ˜ N N ÿ q logp δ∆ i . O ∆2i i“1 When T ă 8, its exploration time is bounded by: ˜ ¸ N ÿ logp NδT q O . ∆2i i“1 Proof sketch. We first establish that the unknown parameter value µ will stay inside the confidence set H during all the computation with a probability of at least 1´δ with the union bound and Hoeffding Lemma Generic Exploration and K-armed Voting Bandits (Hoeffding, 1963). Then we bound the exploration time of each parameter with a projection argument. Any point x in H can be projected into: ÿ x1 :“ x ` pµj ´ xj qej (4) jRW By construction we have f px1 q “ f pxq. The same projection can be done for any point x ` αei P H. By hypotheses we have f px1 q “ f px1 ` αei q, hence f px ` αei q “ f pxq. This proof is detailed step by step in the extended version. By definition, if f is independent of each parameter i on Bpµ, ∆i q, then f is constant on the minimal box Bpµ, Λq, where Λ “ mini ∆i . An exploration policy without ¯ would reach this neighborhood af´ elimination N logp NδT q samples. ter O Λ2 This means that SAVAGE will outperform a uniform exploration policy as soon as the ∆i are not equal. It is also worth noting for practical purpose, that this improvement will hold even if we replace IndepTestpf, H, iq with a sufficient condition of independence. Such a relaxation requires however to replace (2) by (3) to ensure termination. 3.1. Independence predicates The independence predicate IndepTestpf, H, iq is a property of the decision function and its feasible set. It can thus be specialized via symbolic calculus or handcrafted for specific problems where the properties of f and F are well known. For example in traditional multi-armed bandits settings, when f px1 , . . . , xN q P arg max tx1 , . . . , xN u and Hptq is encoded by a product of confidence intervals rai , bi s, we can use the SAVAGE algorithm with the following specialized predicate IndepTestf pHptq, iq: pDj, bi ď aj q _ p@k ‰ i, bk ď ai q . (5) With this predicate, we fall-back almost to the ”arm elimination” of (Even-Dar et al., 2002). We however slightly depart from this algorithm by forcing inclusion of the successive confidence sets: rai , bi s :“ rmaxtai , µ̂i ´ cpti qu, mintbi , µ̂i ` cpti qus. If we rather want to retrieve the pm ` 1qth best arm like in rigged bandits, the independence predicate becomes: _ pDA, |A| “ m ` 1, @j P A, bi ď aj q pDB, |B| “ N ´ m, @k P B, bk ď ai q . (6) A simple formalization of the independence allows us to apply SAVAGE and Theorem 1 to several other variants of multi-armed bandits. Algorithm 2 Parameters Elimination by Sampling 1: Input: f ,H,W,m,M 2: Initialization: 3: S Ð H 4: for l “ 1, . . . , m do 5: Sample x uniformly from H 6: x1 Ð x 7: for s “ 1, . . . , M do 8: Pick a random parameter i P WzS 9: Re-sample xi until x P H 10: if f pxq ‰ f px1 q then 11: S :“ S Y tiu 12: end if 13: x1i Ð xi 14: end for 15: end for 16: W :“ W X S When the knowledge about f or F is scarce, and the dimension of the problem is not too high, another solution that we only explored empirically is to estimate the independence predicate by ”introspective” simulations. We used the multi-start random-walk approximation detailed in Algorithm 2. It provides an ”almost-everywhere statement” of the property with an asymmetric risk of failure which can be made arbitrary low by increasing the number of samples (parameters m and M ). This kind of method is widely used in sensitivity analysis (see Saltelli et al., 2000, for a survey). 3.2. Approximate decision If the decision function f is λ-Lipschitz for the decision comparison metric L and the l8 norm with a known Lipschitz constant, we are able to relax the problem by searching only an ε-approximation of the best decision. In order to do so we replace the Acceptpf, H, Wq condition by: @x, x1 P H, }x ´ x1 }8 ď ε λ (7) Theorem 2. If f is λ-Lipschitz around the environment state µ, and if f is independent of each parameter i on F X Bpµ, ∆i q with radius ∆i ą 0, then SAVAGE with (7) as acceptance condition is an pε, δqPAC algorithm with horizon T for the parametric decision problem. When T “ 8, its sample complexity is bounded by: ˜ ¸ ˆ 2 ˙ N ÿ q logp δ∆ λ Nε,λ λN i ` O logp q ; O ∆2i ε2 δε i:∆i ěε{λ where Nε,λ “ |ti | ∆i ă ε{λu|. Generic Exploration and K-armed Voting Bandits When T ă 8, its exploration time is bounded by: ¸ ˜ ˆ 2 ˙ ÿ logp NδT q λ Nε,λ NT `O logp O q . ∆2i ε2 δ majority victories: See extended version for the proof. Any element of arg maxi UCop pi, xq is called a Copeland winner of the matrix. 4. Application to K-armed Dueling Bandits 4.1.1. General Copeland bandits UCop pi, xq “ The K-dueling problem, as presented in (Yue et al., 2012; Yue & Joachims, 2011), assumes the existence of an environment preference matrix µ from which we only have a noisy perception modeled, as in (Feige et al., 1994), by a KˆK-matrix of random variables Xi,j P r0, 1s verifying ErXi,j s “ µi,j . Our aim is to design a sequence of pairwise experiments pit , jt q called duels for t “ 1, . . . , T in order to find the best arm. They also assume the following properties for the preference matrix(WLOG for a proper indexation of the matrix): strict linear order: if i ă j then µi,j ą 12 ; γ-relaxed stochastic transitivity: if 1 ă j ă k then γ ¨ µ1,k ě maxtµ1,j , µj,k u; stochastic triangular inequality: if 1 ă j ă k then µ1,k ď µ1,j ` µj,k ´ 21 . These last three assumptions are realistic when the preference matrix is the result of a perturbed linear order. This is indeed the case for some generative models where the number of parameters of the environment is assumed to be K: the inherent values of arms. In a situation where the preferences may contain cycles (or voting paradoxes) there is no clear notion of what the best arm is, and the notion of regret is unclear. To avoid these problems, we propose to consider a ”voting” variant of K-dueling bandits where a pairwise election criterion is used to determine the best candidate from the preference matrix. Several election systems can be used, but we will focus here on a simple and well-established one: the Copeland pairwise aggregation method (see Charon & Hudry, 2010). 4.1. Copeland bandits If x is a KˆK preference matrix, we define the Copeland score of an arm i by its number of one-to-one rxi,j ą j i:∆i ąε{λ From now on, we call KˆK preference matrix a KˆK matrix pxi,j q such that xi,j ` xj,i “ 1 for each i, j P t1, . . . , Ku (we use lower-case letters to match the notations of Section 2). ÿ 1 s. 2 (8) With a preference matrix of size K we have N “ KpK ´ 1q{2 free parameters to estimate: we can encode H as a product of intervals rai,j , bi,j s and apply the SAVAGE algorithm with IndepTestf pH, pi, jqq “ pai,j ą 1 2 _ bi,j ă 1 2 q _ Cop pH, pi, jqq , where Cop pH, pi, jqq :“ Di` s. t., pmin Ucop pi` , Hq ą max Ucop pi, Hqq ^ pmin Ucop pi` , Hq ą max Ucop pj, Hqq . (9) By applying Theorem 1, we obtain an exploration time bound of order: ¸ ˜ ˆ ˙ ÿ logpKT {δq 2 logpKT {δq ď O K , (10) O ∆2i,j Λ2 iăj where ∆i,j “ |µi,j ´ 21 | for any i ă j, and Λ “ miniăj ∆i,j . This bound requires weak assumptions about the preference matrix µ but its strong dependence on the ”hard” parameters (when µi,j is close to 1 2 ) makes it quite conservative. The behavior of the algorithm is more efficient in practice. 4.1.2. Condorcet assumption If there exists an arm f pxq preferred to all the others, it is unique and verifies UCop pf pxq, xq “ K ´ 1. The existence of this arm, called Condorcet winner of the matrix, allows us to tighten the exploration bound. Property 1. If the environment state µ admits arm i˚ as a Condorcet winner with ∆ “ minj‰i˚ µi˚ ,j ´ 12 and ∆i,j “ maxt∆, |µi,j ´ 21 |u then f is independent of xi,j on Bpµ, ∆i,j q for any i ă j. By applying Theorem 1 when Property 1 holds, we obtain a bound which is less sensitive to the presence of tight duels than (10) without changing the algorithm. If we know the existence of a Condorcet winner, we can also tame the SAVAGE algorithm by restricting the feasible set F to the KˆK-preferences matrices admitting a Condorcet winner: FCond :“ tx | Di˚ , UCop pi˚ , xq “ K ´ 1u . (11) Generic Exploration and K-armed Voting Bandits _ pmaxxPH UCop pi, xq ă K ´ 1q pmaxxPH UCop pj, xq ă K ´ 1q (12) To stop exploration with an ε-approximation1 of the winner, we replace Acceptpf, H, Wq by: expected reward We can obtain a formal independence test in Fcond by replacing (9) with: 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 best arm long tail: 1 @x P H, K ´1 ´ UCop pf pxq, xq ď ε . Where for each j ‰ i˚ we have ∆j “ µi˚ ,j ´ 21 (indexed WLOG by increasing values of ∆j ). When T ă 8 its exploration time is bounded by: ¸ ˜ K´1 ÿ j ¨ logp KT δ q . O ∆2j j“ε`1 This is a significant improvement from (10) but it does not remove the quadratic term K 2 . This leading K 2 factor is the price we pay for accepting less constrained preference matrices. 4.2. Borda bandits Another simple way to elect the winner of the matrix is to use Borda count. Each competitor is ranked according to its mean performance against others: ÿ UBor pi, xq “ xi,¨ “ xi,j . (14) j The main advantage of this criterion is that it both offers stability (the utility is linear) and clearly reduces the dimension of the problem to only K parameters: xi,¨ for i “ 1, . . . , K. This means that we can simply wrap a classical bandit algorithm to search for the Borda winner of the matrix. It is quite easy however to design a Condorcet preference matrix where the Borda winner is not the Condorcet winner2 . The decision criterion underlying the Beat the Mean Bandit algorithm proposed by (Yue & Joachims, 2011) 1 2 3 4 5 6 (13) Theorem 3. If the environment state µ is known to admit a Condorcet winner i˚ “ f pµq then SAVAGE with FCond as feasible set and (13) as acceptance condition is an pε, δq-PAC algorithm with horizon T for the Copeland bandits problem. When T “ 8, its samples complexity is bounded by: ¸ ˜ K K´1 ÿ q j ¨ log p δ∆ j . O ∆2j j“ε`1 ε is the number of tolerated defeats. There exist also K-armed stochastic bandit settings which are non transitive (Gardner, 1970). Apparent reward: Real reward: rigged arms 7 8 9 10 11 12 13 14 15 16 17 18 ...... arm Figure 2. A 100-armed rigged bandit designed in order to game simple exploration algorithms : the four apparently best arms only deliver counterfeit money. The algorithm must sample intensively the arms 5 to 10 in order to guess that 5 is the best. is different: the matrix rows are explored to find Borda loosers which are progressively eliminated from the matrix until only one arm remains. This election procedure called Bottom-up Borda elimination returns the Condorcet winner if there exists one. SAVAGE being a generic algorithm, it can be applied directly to these two voting criteria. 5. Simulations In order to compare algorithms of different natures, we used a pure-exploration online setting where at each time step t the algorithm choose a parameter it to explore, choose a decision dt accordingly, and gets an unknown reward U pdˆt , µq. We considered both the best-decision rate and the regret U pd˚ , µq ´ U pdˆt , µq. For all algorithms except Interleave Filtering and Beat the Mean, we took dˆt :“ f pµ̂ptqq. For PAC algorithms, we took “ 0 and δ “ 1{T (explorethen-exploit setting). The sample time where the bestdecision rate reaches 1´δ gives an empirical estimation of the PAC exploration time. To avoid nasty sideeffects, we shuffled the matrices/parameters at each run. 5.1. Bandits simulations For bandits problems the decision space and the exploration space coincide, but we are here in a pure exploration setting where the arm we predict to be the best is not necessary the one we explore. We considered the following algorithms for our bandits simulations: Uniform: baseline uniform exploration policy (each arm is explored once in a round-robin manner); Naive UCB: UCB1 (as in Auer et al., 2002); 2 Naive Elimination: applies Action elimination al- 1 0.9 0.8 0.8 0.7 0.6 0.5 0.4 Uniform Naive UCB Wrapped UCB Naive Elimination Wrapped Elim SAVAGE SAVAGE Sampling 0.3 0.2 0.7 0.5 0.4 0.3 0.2 0.1 0 4000 3500 3000 2500 2000 1500 1000 500 0 200000 150000 100000 50000 0 100 10 1 100 Uniform SAVAGE SAVAGE Sampling SAVAGE / Condorcet feasible set SAVAGE Sampling / Borda feasible set Interleave Filtering Beat the Mean 0.6 0 cumulative regret cumulative regret 0.1 explored arms best-arm rate (accuracy) 1 0.9 1000 10000 100000 1e+06 samples Figure 3. Behavior of the different algorithms for 1000 simulations with Figure 2 distribution. The top figure depicts the best-arm rate, the middle figure show the cumulative regret and the bottom one tracks the number of active arms for elimination algorithms. Time scale is logarithmic. gorithm (as described in Even-Dar et al., 2002); Wrapped UCB: applies UCB1 to the wrapped reward random variable Ûi “ U pi, µ̂q used as a proxy for U pi, µq; Wrapped Elimination: applies Action elimination with the above ”wrapped reward”; SAVAGE: applies Algorithm 1 with ηptq :“ 2N T and predicate (6) with m “ 4; SAVAGE Sampling: Algorithm 1 with a sampled independence predicate and 1000 simulations by arm (see Algorithm 2). We compared these algorithms on several Bernoulli reward distributions. We give here the simulation result for a rigged bandits problem specially designed in order to illustrate the different exploration behaviors of the algorithms. In this setting, the utility of arm i (indexed WLOG by decreasing µi ) is defined by U pi, µq “ 0 if i ă 5 and U pi, µq “ µi otherwise (see Figure 2 for the reward distribution, and Figure 3 for the simulations results). As expected, the maximizing policies like UCB explore aggressively the head of the distribution but after around 103 samples fall into the rigged arms and neglect the second part of the distribution head. The wrapped versions are less sensitive to this trap, but the non-linearity of the utility and the explored parameters best-arm rate (accuracy) Generic Exploration and K-armed Voting Bandits 1000 100 10 1 100 1000 10000 100000 1e+06 samples Figure 4. Average good prediction rate and regret for 500 simulations of a 30-armed Condorcet bandit instance. We used the Copeland index (8) to compute the regret. violation of independence it induces both cripple their regret performances in the beginning of the runs. As expected, the SAVAGE versions perform well, more surprising is the side-effect of sampling which makes the algorithm more aggressive against weak arms. 5.2. Dueling bandits simulations For the dueling bandits simulations, we considered Uniform, SAVAGE, and SAVAGE Sampling policies plus the following ones: SAVAGE/Condorcet: Algorithm 1 with (12) for the independence test; SAVAGE Sampling/Borda: Algorithm 1 with sampled oracle and a Bordařrelaxation of the Condorcet feasible set, i.e. Di, j xi,j ą K{2; Interleave Filtering: as in (Yue et al., 2012); Beat the Mean: as in (Yue & Joachims, 2011) with arg maxtP̂b | b P Wl u for dˆt . 5.2.1. Condorcet simulations We first used ”hard” KˆK Condorcet preference matrices µ defined by µi,j “ 12 ` j{p2Kq for each i ă j. The matrices of this family verify all the assumptions defined in Section 4 but they also offer some difficult Generic Exploration and K-armed Voting Bandits duels involving the Condorcet winner (for instance if K “ 100 we have µ1,2 “ 0.51 hence ∆ “ 0.01). 1 best-arm rate (accuracy) 0.8 The results of these experiments appear in Figure 4 and Figure 5. The SAVAGE Sampling policy slightly improves from the formal version but its heavy introspection cost makes it difficult to deploy on highdimension problems. Low-dimension instance of Figure 4 is not favorable for Interleave Filtering and Beat the Mean which were designed to drop the OpK 2 q term of the bound with partial – hence risky – exploration strategies (see Yue & Joachims, 2011). Uniform SAVAGE SAVAGE / Condorcet feasible set Interleave Filtering Beat the Mean 0.6 0.4 0.2 0 5.2.2. General case simulations 100000 explored parameters cumulative regret 5e+08 4.5e+08 4e+08 3.5e+08 3e+08 2.5e+08 2e+08 1.5e+08 1e+08 5e+07 0 10000 1000 100 10 1 100 1000 10000 100000 1e+06 1e+07 1e+08 samples Figure 5. Same setting as in Figure 4 with K “ 400 and the horizon set to 108 . When we increase K the problem becomes difficult. 1 best-arm rate (accuracy) 0.9 0.8 0.7 Uniform SAVAGE SAVAGE Sampling SAVAGE / Condorcet feasible set SAVAGE Sampling / Borda feasible set Interleave Filtering Beat the Mean 0.6 0.5 0.4 We proposed SAVAGE, a flexible and generic algorithm based on sensitivity analysis of parameters for online learning with indirect feedback. We provided PAC theoretical guarantees for this algorithm when used with proper independence predicates. We also proposed a generic ”introspective sampling” method to approximate theses predicates. The ”voting bandits” framework we proposed naturally extends dueling bandits for realistic situations where the preferences reflects mixed and inconsistent opinions. The SAVAGE algorithm is robust and clearly outperforms state-of-the art algorithms in such situations. 0.3 0.2 0 200000 cumulative regret 6. Conclusion Our simulations confirmed and reinforced the theoretical results on various parametric decision problems from classical bandits to K-armed dueling bandits. 0.1 150000 100000 The construction of a generic exploration algorithm reaching optimality for any provided decision function remains as a challenging open problem. 50000 0 explored parameters In order to study the behavior of the algorithms with more realistic – non-Condorcet – preferences, we generated uniformly random preference matrices and performed the same experiments. As expected, when the Condorcet hypothesis is violated the performances of specialized algorithms (including SAVAGE/ Condorcet) collapse well behind the baseline uniform exploration policy (see Figure 6). 1000 100 Acknowledgments 10 1 100 1000 10000 100000 1e+06 samples Figure 6. Result of 500 simulations with 30ˆ30 randomized preference matrices. We would like to thank the reviewers for their careful reading and helpful comments. Generic Exploration and K-armed Voting Bandits References Audibert, J.Y., Munos, R., and Szepesvári, Cs. Exploration-exploitation trade-off using variance estimates in multi-armed bandits. Theoretical Computer Science, 2008. Audibert, J.Y., Bubeck, S., and Munos, R. Best arm identification in multi-armed bandits. In COLT, Haı̈fa (Israël), 2010. Omnipress. Auer, P. and Ortner, R. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Period.Math.Hungar., 61(1–2):55–65, 2011. Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Mach. Learn., 47(2-3):235–256, 2002. Bubeck, S., Munos, R., and Stoltz, G. Pure exploration in finitely-armed and continuous-armed bandits. Theor. Comput. Sci., 412(19):1832–1852, 2011. Chapelle, O., Joachims, T., Radlinski, F., and Yue, Y. Large-scale validation and analysis of interleaved search evaluation. ACM Trans. Inf. Syst., 30(1):6, 2012. Charon, I. and Hudry, O. An updated survey on the linear ordering problem for weighted or unweighted tournaments. Annals OR, 175(1):107–158, 2010. Chevaleyre, Y., Endriss, U., Lang, J., and Maudet, N. A short introduction to computational social choice. In SOFSEM, volume 4362 of LNCS, pp. 51– 69. Springer-Verlag, 2007. Di Castro, D., Gentile, C., and Mannor, S. Bandits with an edge. CoRR, abs/1109.2296, 2011. Even-Dar, E., Mannor, S., and Mansour, Y. PAC bounds for multi-armed bandit and Markov decision processes. In COLT, pp. 255–270, London, UK, 2002. Springer-Verlag. Even-Dar, E., Mannor, S., and Mansour, Y. Action elimination and stopping conditions for the multiarmed bandit and reinforcement learning problems. JMLR, 7:1079–1105, 2006. Feige, U., Raghavan, P., Peleg, D., and Upfal, E. Computing with noisy information. SIAM J. Comput., 23(5):1001–1018, 1994. Gardner, M. Mathematical games: The paradox of the nontransitive dice and the elusive principle of indifference. Scientific American, 223:110–114, dec 1970. Hoeffding, W. Probability inequalities for sums of bounded random variables. J. of the American Statistical Association, 58(301):13–30, 1963. Joachims, T. Evaluating retrieval performance using clickthrough data. In Text Mining, pp. 79–96. Physica/Springer Verlag, 2003. Lai, T.L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985. Ravikumar, B., Ganesan, K., and Lakshmanan, K.B. On selecting the largest element in spite of erroneous information. In STACS, volume 247 of LNCS, pp. 88–99. Springer Berlin Heidelberg, 1987. Saltelli, A., Chan, K., and Scott, E.M. (eds.). Sensitivity analysis. Wiley series in probability and statistics. J. Wiley & sons, New York, Chichester, Weinheim, 2000. Yue, Y. and Joachims, T. Interactively optimizing information retrieval systems as a dueling bandits problem. In ICML, volume 382 of ACM Proceeding Series, pp. 1201–1208. ACM, 2009. Yue, Y. and Joachims, T. Beat the mean bandit. In ICML, pp. 241–248. Omnipress, 2011. Yue, Y., Broder, J., Kleinberg, R., and Joachims, T. The k-armed dueling bandits problem. J. Comput. Syst. Sci., 78(5):1538–1556, 2012.
© Copyright 2026 Paperzz