Generic Exploration and K-armed Voting Bandits

Generic Exploration and K-armed Voting Bandits
Tanguy Urvoy
Fabrice Clerot
Raphael Féraud
Sami Naamane
Orange-labs, 2 avenue Pierre Marzin, 22307 Lannion, FRANCE
Abstract
We study a stochastic online learning scheme
with partial feedback where the utility of decisions is only observable through an estimation of the environment parameters. We propose a generic pure-exploration algorithm,
able to cope with various utility functions
from multi-armed bandits settings to dueling bandits. The primary application of this
setting is to offer a natural generalization of
dueling bandits for situations where the environment parameters reflect the idiosyncratic
preferences of a mixed crowd.
1. Introduction
The stochastic multi-armed bandits became popular
as a stripped-down model of exploration versus exploitation balance in sequential decision problems. In
its simplest formulation, we are facing a slot machine
with several arms. The rewards of these arms are modeled by unknown but bounded and independent random variables. To maximize our long term reward,
we would like to play an arm with maximal expected
value but we need to explore efficiently all the arms in
order to find it.
[email protected]
[email protected]
[email protected]
[email protected]
sider pure-exploration or PAC sample complexity: how
to find a nearly-optimal arm with high confidence in a
minimum of trials? For multi-armed bandits, several
algorithms where already proposed and studied from
this perspective in (Even-Dar et al., 2002; 2006).We
can also reverse the PAC question to control the prediction accuracy after a fixed number of samples as in
(Audibert et al., 2010; Bubeck et al., 2011).
When the number of trials is bounded by a known
horizon, we can adopt an explore then exploit strategy
to control the final regret with an efficient exploration
algorithm (Yue et al., 2012). This is the perspective
we adopt here.
1.1. Rigged bandits
As an introduction to our generic exploration setting
we consider the rigged bandits problem which is a simplified model for click-fraud in online advertising.
In this variant of multi-armed bandits we know from
a reliable source (the barmaid of the casino) that the
m best arms of the slot machine have been rigged and
only deliver counterfeit money. To maximize our gain,
we want to design a sequence of experiments in order
to determine and play the pm ` 1qth best arm as early
as possible while avoiding the rigged arms.
The cost of ignorance is traditionally expressed in
term of expected regret : the expected difference of reward between a playing policy established with perfect
knowledge of the environment parameters and a given
”unaware” policy. Following Lai & Robbins (1985),
several regret analysis have been proposed (see for instance Auer et al., 2002; Audibert et al., 2008; Auer &
Ortner, 2011).
The main characteristic of this problem lies in the absence of a direct utility feedback: to estimate our real
income we need to know with enough confidence which
arms were rigged. This problem also requires what
we call a generic exploration policy: we do not want
to design a new exploration algorithm for each possible fraud-detection criterion we have at hand (see
section 5.1 for a formalized example).
Another way to evaluate bandit algorithms is to con-
1.2. Dueling bandits
th
Proceedings of the 30 International Conference on Machine Learning, Atlanta, Georgia, USA, 2013. JMLR:
W&CP volume 28. Copyright 2013 by the author(s).
The dueling bandit problem, introduced by Yue &
Joachims (2009) to formalize online learning from pref-
Generic Exploration and K-armed Voting Bandits
erence feedback, shares the indirect or parametric feedback property with rigged bandits. The initial motivation to depart from the absolute-reward model came
from information retrieval evaluation where the implicit feedback by means of click logs is strongly biased by the ranking itself. A solution was proposed
by Joachims (2003) to circumvent this problem: by
interleaving two ranking models, and checking where
the user clicked, one obtains an unbiased – but pairwise – preference feedback. Further experiments were
performed in (Chapelle et al., 2012).
The original definition of the dueling bandits problem
(Yue et al., 2012; Yue & Joachims, 2011) was built
upon strong assumptions about the preference matrix:
existence of a strict linear ordering, stochastic transitivity and stochastic triangular inequality (see Yue
& Joachims, 2011). An extension of this setting with
restricted pairing was proposed by Di Castro et al.
(2011), but this extension also assumes the preference
matrix to be the byproduct of an inherent value for
each arm.
In a situation where the preferences reflects the expression of a mixed crowd, there can be several inconsistencies or voting paradoxes which contradict these
assumptions. The definition we propose here is more
relaxed: we do not assume the existence of a perfect
linear order, neither do we assume the existence of
an inherent value of arms. We simply try to sample
efficiently the preference matrix in order to propose
a ”best element” similar to the one we would choose
with perfect knowledge of the crowd preferences.
Electing a ”best element” or a ”best linear ordering”
from such a preference matrix is a tough but old and
well-studied problem (see Charon & Hudry, 2010, for a
survey), but the works about online and noisy declinations of this problem are scarce (see however Ravikumar et al., 1987; Feige et al., 1994, for related problems). If we change the election criterion according
to the vast social choice theory (see Chevaleyre et al.,
2007, for a survey),we can decline the dueling bandits in several unexplored flavors: for instance Borda
bandits, Copeland bandits, Slater bandits, or Kemeny
bandits.
1.3. Toward generic exploration algorithms
The traditional approach to deal with exotic sequential
decision problems is to design tailor-made algorithms
which handle simultaneously the exploration of the environment and the exploitation of its knowledge. One
purpose of this article is to explore the possibility of
a generic algorithm which automatically generates an
efficient exploration policy for any given decision crite-
rion. We propose a generic algorithm with theoretical
guarantees for the case of parametric decision problems
and we evaluate its performances on relatively simple
declinations of rigged and dueling bandits.
2. Main Problem Statement
Consider a stationary environment modeled by a vector of N unknown parameters µ “ pµ1 , . . . , µN q P
r0, 1sN (By convention, we write the vectors with
bold faces). We have a noisy perception of µ modeled by a vector X of N independent random variables Xi P r0, 1s verifying ErXi s “ µi for each index
i “ 1, . . . , N . The only hint we may have about µ is a
set of feasible environment configurations F Ď r0, 1sN .
Let D be a set of decisions and U : D ˆ F Ñ R`
be a given utility function. From this utility function
we can derive a decision function f : F Ñ D which
computes an optimal option f pxq P arg maxd U pd, xq
for each feasible realization x of the random vector X.
The parametric decision problem consists in finding
the best decision d˚ :“ f pµq with a high probability
after a minimum amount of samples of the environment parameters.
If we have a metric L : D ˆ D Ñ R` to compare
decisions, we can also search an ε-approximation of d˚
(i.e. a decision d P D such that Lpd˚ , dq ď ε).
We use a ”budgeted” version of PAC learning:
Definition 1. An algorithm is an pε, δq-PAC algorithm with horizon T for the parametric decision problem if it outputs an ε-approximation of d˚ with probability at least 1 ´ δ when it terminates with strictly less
than T samples. We call exploration time the number
of parameter samples required for termination.
This definition extends PAC learning to finite horizons:
to avoid confusion we use the term exploration time
instead of sample complexity when T is finite. The
exploration time at horizon T with δ “ 1{T provides
an upper bound for the expected cumulative regret
(see Yue et al., 2012, section 4).
The decision function may be the result of quite a complex algorithm, but in the problems we consider here,
D is finite and the decision function is partitioning
the input space into single-decision areas. For these
problems we will assume that the environment state µ
falls outside of the decision frontiers. In other words,
we will assume that there exists a neighborhood of µ
where f is constant. In order to analyze the performance of our algorithm in the next section, we will
need a finer description of this neighborhood. For instance, the binary decision function defined in Figure 1
Generic Exploration and K-armed Voting Bandits
0
1
0
Figure 1. A simple example of binary decision function defined by f px1 , x2 q “ rx1 ą 1{8 ^ px1 ă 1{2 _ x2 ą x1 qs
(We use the square brackets r¨s to denote the characteristic function of a predicate). How accurately do we need to
estimate µ in order to take the good decision? The environment state µ is at distance ∆1 from the nearest decision
frontier but we can stop exploring x2 as soon as we know
that x1 ă 1{2.
is constant on the neighborhood of µ but it is also independent of x2 for any configuration where x1 ă 1{2.
Let us introduce some more notations: hereafter
}x}8 :“ maxi |xi | will denote the l8 norm, Bpµ, rq :“
tx | }x ´ µ}8 ă ru will denote the l8 ball (or box)
of radius r around µ, and ei will denote the standard
basis vector with a 1 in the ith coordinate and 0’s elsewhere.
Definition 2. Let H be a subset of F. The decision
function f is independent of its parameter i on H if
for any α P r´1, `1s we have:
x, x ` αei P H ñ f pxq “ f px ` αei q
For instance on Figure 1 the decision is independent
of x2 on the set Bpµ, ∆2 q. This local independence,
parametrized by the ∆i radii, captures the sensitivity of the decision to its input parameters around the
environment state µ.
3. The SAVAGE Algorithm
We propose a generic zooming algorithm to solve the
N dimensional parametric decision problem with high
confidence. This algorithm, called SAVAGE (Sensitivity Analysis of VAriables for Generic Exploration)
is described in Algorithm 1. It works by reducing progressively a box-shaped confidence set H until a single decision remains in f pHq. The algorithm stops
Algorithm 1 SAVAGE algorithm
1: Input: X “ pX1 , . . . , XN q, f , F, T , δ
2: Initialization:
3: W :“ t1, . . . , N u, H :“ F, s :“ 1
4: @i P W : µ̂i :“ 1{2, and ti :“ 0
5: while Acceptpf, H, Wq ^ s ď T do
6:
Pick a variable index i P arg minW tt1 , . . . , tN u
7:
ti :“ ti ` 1
8:
Sample the ith distribution xi Ð Xi
9:
µ̂i :“ p1 ´ t1i qµ̂i ` t1i xi
10:
H :“ H X tx | |xi ´ µ̂i | ă cpti qu
11:
W :“ Wztj } IndepTestpf, H, jqu
12:
s :“ s ` 1
13: end while
14: return dˆ P f pHq
exploring a parameter when it knows from a sensitivity analysis subroutine IndepTestpf, H, iq that, given
our knowledge of the environment, the final decision
will not change according to this parameter; in other
words when f is independent of i on H as formalized
in Definition 2.
The boundaries of H are defined by the confidence
radius:
c
1
ηptq
cptq “
logp
q,
(1)
2t
δ
where the η function is set to 2N T , when the horizon
2
t2
T is finite, and π N
when it is infinite (PAC setting).
3
Termination is controlled by the predicate:
Acceptpf, H, Wq :“ ”W “ H”
which implies
|f pHq| “ 1
(2)
(3)
Theorem 1. If f is independent of each parameter
i on F X Bpµ, ∆i q, with ∆i ą 0, then SAVAGE is a
p0, δq-PAC algorithm with horizon T for the parametric
decision problem. When T “ 8, its sample complexity
is bounded by:
¸
˜
N
N
ÿ
q
logp δ∆
i
.
O
∆2i
i“1
When T ă 8, its exploration time is bounded by:
˜
¸
N
ÿ
logp NδT q
O
.
∆2i
i“1
Proof sketch. We first establish that the unknown parameter value µ will stay inside the confidence set H
during all the computation with a probability of at
least 1´δ with the union bound and Hoeffding Lemma
Generic Exploration and K-armed Voting Bandits
(Hoeffding, 1963). Then we bound the exploration
time of each parameter with a projection argument.
Any point x in H can be projected into:
ÿ
x1 :“ x `
pµj ´ xj qej
(4)
jRW
By construction we have f px1 q “ f pxq. The same
projection can be done for any point x ` αei P H.
By hypotheses we have f px1 q “ f px1 ` αei q, hence
f px ` αei q “ f pxq. This proof is detailed step by step
in the extended version.
By definition, if f is independent of each parameter i
on Bpµ, ∆i q, then f is constant on the minimal box
Bpµ, Λq, where Λ “ mini ∆i . An exploration policy
without
¯ would reach this neighborhood af´ elimination
N logp NδT q
samples.
ter O
Λ2
This means that SAVAGE will outperform a uniform exploration policy as soon as the ∆i are not
equal. It is also worth noting for practical purpose,
that this improvement will hold even if we replace
IndepTestpf, H, iq with a sufficient condition of independence. Such a relaxation requires however to
replace (2) by (3) to ensure termination.
3.1. Independence predicates
The independence predicate IndepTestpf, H, iq is a
property of the decision function and its feasible set. It
can thus be specialized via symbolic calculus or handcrafted for specific problems where the properties of f
and F are well known.
For example in traditional multi-armed bandits settings, when f px1 , . . . , xN q P arg max tx1 , . . . , xN u and
Hptq is encoded by a product of confidence intervals
rai , bi s, we can use the SAVAGE algorithm with the
following specialized predicate IndepTestf pHptq, iq:
pDj, bi ď aj q _ p@k ‰ i, bk ď ai q .
(5)
With this predicate, we fall-back almost to the ”arm
elimination” of (Even-Dar et al., 2002). We however slightly depart from this algorithm by forcing inclusion of the successive confidence sets: rai , bi s :“
rmaxtai , µ̂i ´ cpti qu, mintbi , µ̂i ` cpti qus.
If we rather want to retrieve the pm ` 1qth best arm
like in rigged bandits, the independence predicate becomes:
_
pDA, |A| “ m ` 1, @j P A, bi ď aj q
pDB, |B| “ N ´ m, @k P B, bk ď ai q .
(6)
A simple formalization of the independence allows us
to apply SAVAGE and Theorem 1 to several other variants of multi-armed bandits.
Algorithm 2 Parameters Elimination by Sampling
1: Input: f ,H,W,m,M
2: Initialization:
3: S Ð H
4: for l “ 1, . . . , m do
5:
Sample x uniformly from H
6:
x1 Ð x
7:
for s “ 1, . . . , M do
8:
Pick a random parameter i P WzS
9:
Re-sample xi until x P H
10:
if f pxq ‰ f px1 q then
11:
S :“ S Y tiu
12:
end if
13:
x1i Ð xi
14:
end for
15: end for
16: W :“ W X S
When the knowledge about f or F is scarce, and the
dimension of the problem is not too high, another
solution that we only explored empirically is to estimate the independence predicate by ”introspective”
simulations. We used the multi-start random-walk approximation detailed in Algorithm 2. It provides an
”almost-everywhere statement” of the property with
an asymmetric risk of failure which can be made arbitrary low by increasing the number of samples (parameters m and M ). This kind of method is widely
used in sensitivity analysis (see Saltelli et al., 2000, for
a survey).
3.2. Approximate decision
If the decision function f is λ-Lipschitz for the decision
comparison metric L and the l8 norm with a known
Lipschitz constant, we are able to relax the problem
by searching only an ε-approximation of the best decision. In order to do so we replace the Acceptpf, H, Wq
condition by:
@x, x1 P H, }x ´ x1 }8 ď
ε
λ
(7)
Theorem 2. If f is λ-Lipschitz around the environment state µ, and if f is independent of each parameter i on F X Bpµ, ∆i q with radius ∆i ą 0, then
SAVAGE with (7) as acceptance condition is an pε, δqPAC algorithm with horizon T for the parametric decision problem. When T “ 8, its sample complexity
is bounded by:
˜
¸
ˆ 2
˙
N
ÿ
q
logp δ∆
λ Nε,λ
λN
i
`
O
logp
q
;
O
∆2i
ε2
δε
i:∆i ěε{λ
where Nε,λ “ |ti | ∆i ă ε{λu|.
Generic Exploration and K-armed Voting Bandits
When T ă 8, its exploration time is bounded by:
¸
˜
ˆ 2
˙
ÿ
logp NδT q
λ Nε,λ
NT
`O
logp
O
q .
∆2i
ε2
δ
majority victories:
See extended version for the proof.
Any element of arg maxi UCop pi, xq is called a Copeland
winner of the matrix.
4. Application to K-armed Dueling
Bandits
4.1.1. General Copeland bandits
UCop pi, xq “
The K-dueling problem, as presented in (Yue et al.,
2012; Yue & Joachims, 2011), assumes the existence
of an environment preference matrix µ from which we
only have a noisy perception modeled, as in (Feige
et al., 1994), by a KˆK-matrix of random variables
Xi,j P r0, 1s verifying ErXi,j s “ µi,j . Our aim is to design a sequence of pairwise experiments pit , jt q called
duels for t “ 1, . . . , T in order to find the best arm.
They also assume the following properties for the preference matrix(WLOG for a proper indexation of the
matrix):
strict linear order: if i ă j then µi,j ą 12 ;
γ-relaxed stochastic transitivity: if 1 ă j ă k
then γ ¨ µ1,k ě maxtµ1,j , µj,k u;
stochastic triangular inequality: if 1 ă j ă k
then µ1,k ď µ1,j ` µj,k ´ 21 .
These last three assumptions are realistic when the
preference matrix is the result of a perturbed linear order. This is indeed the case for some generative models
where the number of parameters of the environment is
assumed to be K: the inherent values of arms. In a
situation where the preferences may contain cycles (or
voting paradoxes) there is no clear notion of what the
best arm is, and the notion of regret is unclear.
To avoid these problems, we propose to consider a
”voting” variant of K-dueling bandits where a pairwise election criterion is used to determine the best
candidate from the preference matrix. Several election
systems can be used, but we will focus here on a simple and well-established one: the Copeland pairwise
aggregation method (see Charon & Hudry, 2010).
4.1. Copeland bandits
If x is a KˆK preference matrix, we define the
Copeland score of an arm i by its number of one-to-one
rxi,j ą
j
i:∆i ąε{λ
From now on, we call KˆK preference matrix a KˆK
matrix pxi,j q such that xi,j ` xj,i “ 1 for each i, j P
t1, . . . , Ku (we use lower-case letters to match the notations of Section 2).
ÿ
1
s.
2
(8)
With a preference matrix of size K we have N “
KpK ´ 1q{2 free parameters to estimate: we can encode H as a product of intervals rai,j , bi,j s and apply
the SAVAGE algorithm with IndepTestf pH, pi, jqq “
pai,j ą
1
2
_ bi,j ă
1
2
q _ Cop pH, pi, jqq ,
where Cop pH, pi, jqq :“ Di` s. t.,
pmin Ucop pi` , Hq ą max Ucop pi, Hqq
^
pmin Ucop pi` , Hq ą max Ucop pj, Hqq .
(9)
By applying Theorem 1, we obtain an exploration time
bound of order:
¸
˜
ˆ
˙
ÿ
logpKT {δq
2 logpKT {δq
ď
O
K
, (10)
O
∆2i,j
Λ2
iăj
where ∆i,j “ |µi,j ´ 21 | for any i ă j, and Λ “
miniăj ∆i,j . This bound requires weak assumptions
about the preference matrix µ but its strong dependence on the ”hard” parameters (when µi,j is close to
1
2 ) makes it quite conservative. The behavior of the
algorithm is more efficient in practice.
4.1.2. Condorcet assumption
If there exists an arm f pxq preferred to all the others,
it is unique and verifies UCop pf pxq, xq “ K ´ 1. The
existence of this arm, called Condorcet winner of the
matrix, allows us to tighten the exploration bound.
Property 1. If the environment state µ admits arm
i˚ as a Condorcet winner with ∆ “ minj‰i˚ µi˚ ,j ´ 12
and ∆i,j “ maxt∆, |µi,j ´ 21 |u then f is independent
of xi,j on Bpµ, ∆i,j q for any i ă j.
By applying Theorem 1 when Property 1 holds, we
obtain a bound which is less sensitive to the presence of
tight duels than (10) without changing the algorithm.
If we know the existence of a Condorcet winner, we
can also tame the SAVAGE algorithm by restricting
the feasible set F to the KˆK-preferences matrices
admitting a Condorcet winner:
FCond :“ tx | Di˚ , UCop pi˚ , xq “ K ´ 1u .
(11)
Generic Exploration and K-armed Voting Bandits
_
pmaxxPH UCop pi, xq ă K ´ 1q
pmaxxPH UCop pj, xq ă K ´ 1q
(12)
To stop exploration with an ε-approximation1 of the
winner, we replace Acceptpf, H, Wq by:
expected reward
We can obtain a formal independence test in Fcond by
replacing (9) with:
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
best arm
long tail:
1
@x P H, K ´1 ´ UCop pf pxq, xq ď ε .
Where for each j ‰ i˚ we have ∆j “ µi˚ ,j ´ 21 (indexed
WLOG by increasing values of ∆j ).
When T ă 8 its exploration time is bounded by:
¸
˜
K´1
ÿ
j ¨ logp KT
δ q
.
O
∆2j
j“ε`1
This is a significant improvement from (10) but it does
not remove the quadratic term K 2 . This leading K 2
factor is the price we pay for accepting less constrained
preference matrices.
4.2. Borda bandits
Another simple way to elect the winner of the matrix
is to use Borda count. Each competitor is ranked according to its mean performance against others:
ÿ
UBor pi, xq “ xi,¨ “
xi,j .
(14)
j
The main advantage of this criterion is that it both offers stability (the utility is linear) and clearly reduces
the dimension of the problem to only K parameters:
xi,¨ for i “ 1, . . . , K. This means that we can simply wrap a classical bandit algorithm to search for the
Borda winner of the matrix. It is quite easy however to
design a Condorcet preference matrix where the Borda
winner is not the Condorcet winner2 .
The decision criterion underlying the Beat the Mean
Bandit algorithm proposed by (Yue & Joachims, 2011)
1
2
3
4
5
6
(13)
Theorem 3. If the environment state µ is known to
admit a Condorcet winner i˚ “ f pµq then SAVAGE
with FCond as feasible set and (13) as acceptance condition is an pε, δq-PAC algorithm with horizon T for
the Copeland bandits problem. When T “ 8, its samples complexity is bounded by:
¸
˜
K
K´1
ÿ
q
j ¨ log p δ∆
j
.
O
∆2j
j“ε`1
ε is the number of tolerated defeats.
There exist also K-armed stochastic bandit settings
which are non transitive (Gardner, 1970).
Apparent reward:
Real reward:
rigged arms
7
8
9
10
11
12
13
14
15
16
17
18
......
arm
Figure 2. A 100-armed rigged bandit designed in order to
game simple exploration algorithms : the four apparently
best arms only deliver counterfeit money. The algorithm
must sample intensively the arms 5 to 10 in order to guess
that 5 is the best.
is different: the matrix rows are explored to find Borda
loosers which are progressively eliminated from the
matrix until only one arm remains. This election procedure called Bottom-up Borda elimination returns the
Condorcet winner if there exists one. SAVAGE being
a generic algorithm, it can be applied directly to these
two voting criteria.
5. Simulations
In order to compare algorithms of different natures,
we used a pure-exploration online setting where at
each time step t the algorithm choose a parameter
it to explore, choose a decision dt accordingly, and
gets an unknown reward U pdˆt , µq.
We considered
both the best-decision rate and the regret U pd˚ , µq ´
U pdˆt , µq. For all algorithms except Interleave Filtering and Beat the Mean, we took dˆt :“ f pµ̂ptqq. For
PAC algorithms, we took “ 0 and δ “ 1{T (explorethen-exploit setting). The sample time where the bestdecision rate reaches 1´δ gives an empirical estimation
of the PAC exploration time. To avoid nasty sideeffects, we shuffled the matrices/parameters at each
run.
5.1. Bandits simulations
For bandits problems the decision space and the exploration space coincide, but we are here in a pure exploration setting where the arm we predict to be the best
is not necessary the one we explore. We considered
the following algorithms for our bandits simulations:
Uniform: baseline uniform exploration policy (each
arm is explored once in a round-robin manner);
Naive UCB: UCB1 (as in Auer et al., 2002);
2
Naive Elimination: applies Action elimination al-
1
0.9
0.8
0.8
0.7
0.6
0.5
0.4
Uniform
Naive UCB
Wrapped UCB
Naive Elimination
Wrapped Elim
SAVAGE
SAVAGE Sampling
0.3
0.2
0.7
0.5
0.4
0.3
0.2
0.1
0
4000
3500
3000
2500
2000
1500
1000
500
0
200000
150000
100000
50000
0
100
10
1
100
Uniform
SAVAGE
SAVAGE Sampling
SAVAGE / Condorcet feasible set
SAVAGE Sampling / Borda feasible set
Interleave Filtering
Beat the Mean
0.6
0
cumulative regret
cumulative regret
0.1
explored arms
best-arm rate (accuracy)
1
0.9
1000
10000
100000
1e+06
samples
Figure 3. Behavior of the different algorithms for 1000 simulations with Figure 2 distribution. The top figure depicts
the best-arm rate, the middle figure show the cumulative
regret and the bottom one tracks the number of active arms
for elimination algorithms. Time scale is logarithmic.
gorithm (as described in Even-Dar et al., 2002);
Wrapped UCB: applies UCB1 to the wrapped reward random variable Ûi “ U pi, µ̂q used as a
proxy for U pi, µq;
Wrapped Elimination: applies Action elimination
with the above ”wrapped reward”;
SAVAGE: applies Algorithm 1 with ηptq :“ 2N T
and predicate (6) with m “ 4;
SAVAGE Sampling: Algorithm 1 with a sampled
independence predicate and 1000 simulations by
arm (see Algorithm 2).
We compared these algorithms on several Bernoulli reward distributions. We give here the simulation result for a rigged bandits problem specially designed
in order to illustrate the different exploration behaviors of the algorithms. In this setting, the utility of
arm i (indexed WLOG by decreasing µi ) is defined by
U pi, µq “ 0 if i ă 5 and U pi, µq “ µi otherwise (see
Figure 2 for the reward distribution, and Figure 3 for
the simulations results). As expected, the maximizing
policies like UCB explore aggressively the head of the
distribution but after around 103 samples fall into the
rigged arms and neglect the second part of the distribution head. The wrapped versions are less sensitive
to this trap, but the non-linearity of the utility and the
explored parameters
best-arm rate (accuracy)
Generic Exploration and K-armed Voting Bandits
1000
100
10
1
100
1000
10000
100000
1e+06
samples
Figure 4. Average good prediction rate and regret for 500
simulations of a 30-armed Condorcet bandit instance. We
used the Copeland index (8) to compute the regret.
violation of independence it induces both cripple their
regret performances in the beginning of the runs. As
expected, the SAVAGE versions perform well, more
surprising is the side-effect of sampling which makes
the algorithm more aggressive against weak arms.
5.2. Dueling bandits simulations
For the dueling bandits simulations, we considered
Uniform, SAVAGE, and SAVAGE Sampling policies plus the following ones:
SAVAGE/Condorcet: Algorithm 1 with (12) for
the independence test;
SAVAGE Sampling/Borda: Algorithm 1 with
sampled oracle and a Bordařrelaxation of the
Condorcet feasible set, i.e. Di, j xi,j ą K{2;
Interleave Filtering: as in (Yue et al., 2012);
Beat the Mean: as in (Yue & Joachims, 2011) with
arg maxtP̂b | b P Wl u for dˆt .
5.2.1. Condorcet simulations
We first used ”hard” KˆK Condorcet preference matrices µ defined by µi,j “ 12 ` j{p2Kq for each i ă j.
The matrices of this family verify all the assumptions
defined in Section 4 but they also offer some difficult
Generic Exploration and K-armed Voting Bandits
duels involving the Condorcet winner (for instance if
K “ 100 we have µ1,2 “ 0.51 hence ∆ “ 0.01).
1
best-arm rate (accuracy)
0.8
The results of these experiments appear in Figure 4
and Figure 5. The SAVAGE Sampling policy slightly
improves from the formal version but its heavy introspection cost makes it difficult to deploy on highdimension problems. Low-dimension instance of Figure 4 is not favorable for Interleave Filtering and Beat
the Mean which were designed to drop the OpK 2 q term
of the bound with partial – hence risky – exploration
strategies (see Yue & Joachims, 2011).
Uniform
SAVAGE
SAVAGE / Condorcet feasible set
Interleave Filtering
Beat the Mean
0.6
0.4
0.2
0
5.2.2. General case simulations
100000
explored parameters
cumulative regret
5e+08
4.5e+08
4e+08
3.5e+08
3e+08
2.5e+08
2e+08
1.5e+08
1e+08
5e+07
0
10000
1000
100
10
1
100
1000
10000
100000
1e+06
1e+07
1e+08
samples
Figure 5. Same setting as in Figure 4 with K “ 400 and
the horizon set to 108 . When we increase K the problem
becomes difficult.
1
best-arm rate (accuracy)
0.9
0.8
0.7
Uniform
SAVAGE
SAVAGE Sampling
SAVAGE / Condorcet feasible set
SAVAGE Sampling / Borda feasible set
Interleave Filtering
Beat the Mean
0.6
0.5
0.4
We proposed SAVAGE, a flexible and generic algorithm based on sensitivity analysis of parameters for
online learning with indirect feedback. We provided
PAC theoretical guarantees for this algorithm when
used with proper independence predicates. We also
proposed a generic ”introspective sampling” method
to approximate theses predicates.
The ”voting bandits” framework we proposed naturally extends dueling bandits for realistic situations
where the preferences reflects mixed and inconsistent opinions. The SAVAGE algorithm is robust and
clearly outperforms state-of-the art algorithms in such
situations.
0.3
0.2
0
200000
cumulative regret
6. Conclusion
Our simulations confirmed and reinforced the theoretical results on various parametric decision problems
from classical bandits to K-armed dueling bandits.
0.1
150000
100000
The construction of a generic exploration algorithm
reaching optimality for any provided decision function
remains as a challenging open problem.
50000
0
explored parameters
In order to study the behavior of the algorithms
with more realistic – non-Condorcet – preferences,
we generated uniformly random preference matrices
and performed the same experiments. As expected,
when the Condorcet hypothesis is violated the performances of specialized algorithms (including SAVAGE/
Condorcet) collapse well behind the baseline uniform
exploration policy (see Figure 6).
1000
100
Acknowledgments
10
1
100
1000
10000
100000
1e+06
samples
Figure 6. Result of 500 simulations with 30ˆ30 randomized
preference matrices.
We would like to thank the reviewers for their careful
reading and helpful comments.
Generic Exploration and K-armed Voting Bandits
References
Audibert, J.Y., Munos, R., and Szepesvári, Cs.
Exploration-exploitation trade-off using variance estimates in multi-armed bandits. Theoretical Computer Science, 2008.
Audibert, J.Y., Bubeck, S., and Munos, R. Best arm
identification in multi-armed bandits. In COLT,
Haı̈fa (Israël), 2010. Omnipress.
Auer, P. and Ortner, R. UCB revisited: Improved
regret bounds for the stochastic multi-armed bandit problem. Period.Math.Hungar., 61(1–2):55–65,
2011.
Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time
analysis of the multiarmed bandit problem. Mach.
Learn., 47(2-3):235–256, 2002.
Bubeck, S., Munos, R., and Stoltz, G. Pure exploration in finitely-armed and continuous-armed bandits. Theor. Comput. Sci., 412(19):1832–1852, 2011.
Chapelle, O., Joachims, T., Radlinski, F., and Yue,
Y. Large-scale validation and analysis of interleaved
search evaluation. ACM Trans. Inf. Syst., 30(1):6,
2012.
Charon, I. and Hudry, O. An updated survey on the
linear ordering problem for weighted or unweighted
tournaments. Annals OR, 175(1):107–158, 2010.
Chevaleyre, Y., Endriss, U., Lang, J., and Maudet,
N. A short introduction to computational social
choice. In SOFSEM, volume 4362 of LNCS, pp. 51–
69. Springer-Verlag, 2007.
Di Castro, D., Gentile, C., and Mannor, S. Bandits
with an edge. CoRR, abs/1109.2296, 2011.
Even-Dar, E., Mannor, S., and Mansour, Y. PAC
bounds for multi-armed bandit and Markov decision processes. In COLT, pp. 255–270, London, UK,
2002. Springer-Verlag.
Even-Dar, E., Mannor, S., and Mansour, Y. Action
elimination and stopping conditions for the multiarmed bandit and reinforcement learning problems.
JMLR, 7:1079–1105, 2006.
Feige, U., Raghavan, P., Peleg, D., and Upfal, E. Computing with noisy information. SIAM J. Comput.,
23(5):1001–1018, 1994.
Gardner, M. Mathematical games: The paradox of
the nontransitive dice and the elusive principle of
indifference. Scientific American, 223:110–114, dec
1970.
Hoeffding, W. Probability inequalities for sums of
bounded random variables. J. of the American Statistical Association, 58(301):13–30, 1963.
Joachims, T. Evaluating retrieval performance using
clickthrough data. In Text Mining, pp. 79–96. Physica/Springer Verlag, 2003.
Lai, T.L. and Robbins, H. Asymptotically efficient
adaptive allocation rules. Advances in Applied
Mathematics, 6(1):4–22, 1985.
Ravikumar, B., Ganesan, K., and Lakshmanan, K.B.
On selecting the largest element in spite of erroneous
information. In STACS, volume 247 of LNCS, pp.
88–99. Springer Berlin Heidelberg, 1987.
Saltelli, A., Chan, K., and Scott, E.M. (eds.). Sensitivity analysis. Wiley series in probability and
statistics. J. Wiley & sons, New York, Chichester,
Weinheim, 2000.
Yue, Y. and Joachims, T. Interactively optimizing
information retrieval systems as a dueling bandits
problem. In ICML, volume 382 of ACM Proceeding
Series, pp. 1201–1208. ACM, 2009.
Yue, Y. and Joachims, T. Beat the mean bandit. In
ICML, pp. 241–248. Omnipress, 2011.
Yue, Y., Broder, J., Kleinberg, R., and Joachims, T.
The k-armed dueling bandits problem. J. Comput.
Syst. Sci., 78(5):1538–1556, 2012.