Optimal Adaptive Policies for Sequential Allocation Problems

ADVANCES IN APPLIED MATHEMATICS
ARTICLE NO.
17, 122]142 Ž1996.
0007
Optimal Adaptive Policies for Sequential
Allocation Problems
Apostolos N. Burnetas
Department of Operations Research, Case Western Reser¨ e Uni¨ ersity, Cle¨ eland, Ohio
44106-7235
and
Michael N. Katehakis
GSM and RUTCOR, Rutgers Uni¨ ersity, Newark, New Jersey 07102-1895
Received March 10, 1995
Consider the problem of sequential sampling from m statistical populations to
maximize the expected sum of outcomes in the long run. Under suitable assumptions on the unknown parameters u g Q, it is shown that there exists a class CR of
adaptive policies with the following properties: Ži. The expected n horizon reward
0
Vnp Ž u . under any policy p 0 in CR is equal to n m*Ž u . y M Ž u .log n q oŽlog n., as
n ª `, where m*Ž u . is the largest population mean and M Ž u . is a constant. Žii.
Policies in CR are asymptotically optimal within a larger class C UF of ‘‘uniformly
0
fast convergent’’ policies in the sense that limn ª `Ž n m*Ž u . y Vnp Ž u ..r
p
Ž n m*Ž u . y Vn Ž u .. F 1, for any p g C UF and any u g Q such that M Ž u . ) 0.
Policies in CR are specified via easily computable indices, defined as unique
solutions to dual problems that arise naturally from the functional form of M Ž u ..
In addition, the assumptions are verified for populations specified by nonparametric discrete univariate distributions with finite support. In the case of normal
populations with unknown means and variances, we leave as an open problem the
verification of one assumption. Q 1996 Academic Press, Inc.
1. INTRODUCTION
Consider for any a s 1, . . . , m the i.i.d. random variables Ya , Ya j , j s
1, 2, . . . with univariate density function f aŽ y, ua ., with respect to some
known measure na , where ua is a vector of unknown parameters
Ž ua1 , . . . , uak .. For each a, k a is known and the vector ua belongs to some
a
122
0196-8858r96 $18.00
Copyright Q 1996 by Academic Press, Inc.
All rights of reproduction in any form reserved.
ADAPTIVE POLICIES FOR SEQUENTIAL ALLOCATION
123
known set Qa that in general depends on a and is a subset of R k a. The
functional form of f aŽ?, ? . is known and allowed to depend on a. The
information specified by Ž Qa , f aŽ?, ? ., na . is said in the literature to define
population a s 1, . . . , m. The practical interpretation of the model is that
Ya j represents a reward received the jth time population a is sampled. The
objective is to determine an adaptive rule for sampling from the m
populations so as to maximize the sum of realized rewards S n s X 0 q
X 1 q ??? qX ny1 n ª `, where X t is Yak if at time t population a is
sampled for the kth time.
In their paper w25x on this problem, Lai and Robbins give a method for
constructing adaptive allocation policies that converge fast to an optimal
one under complete information and possess the remarkable property that
their finite horizon expected loss due to ignorance Ž‘‘regret’’. attains,
asymptotically, a minimum value. The analysis was based on a theorem
ŽTheorem 1. establishing the existence of an asymptotic lower bound for
the regret of any policy in a certain class of candidate policies; see UF
policies below. The knowledge of the functional form of this lower bound
was used to construct, via suitably defined ‘‘upper confidence bounds’’ for
the sample means of each population, adaptive allocation policies that
attain it.
The assumptions that they made for the partial information model
restricted the applicability of the method to the case in which each
population is specified by a density that depends on a single unknown
parameter, as is the case of a single parameter exponential family.
The contributions in this paper are the following. Ža. It is shown that
Theorem 1 holds, under no parametric assumptions, for a suitable unique
extension of the coefficient in the lower bound; see Theorem 1 Ž1., below.
Žb. We give the explicit form of a new set of indices that are defined as the
unique solutions to dual problems that arise naturally from the definition
of the Žnew. lower bound. Žc. We give sufficient conditions under which
the adaptive allocation policies that are defined by these indices possess
the optimality properties of Theorem 1 Ž2., below. Žd. We show that the
sufficient conditions hold for an arbitrary nonparametric, discrete, univariate distribution. Že. We discuss the problem of normal populations with
unknown variance, where we leave as an open problem the verification of
one sufficient condition.
We first discovered the form of the indices used in this paper when we
employed the dynamic programming approach to study a Bayes version of
this problem w6, 7x. The ideas involved in the present paper are a natural
extension of w25x; they are essentially a simplification of work in w8x on
dynamic programming.
Our work is related to that of w33x, which obtained adaptive policies with
regret of order O Žlog n., as in our Theorem 1, for general nonparametric
124
BURNETAS AND KATEHAKIS
models, under appropriate assumptions on the rate of convergence of the
estimates.
Starting with w31, 3x, the literature on versions of this problem is large;
see w24, 26, 22, 4, 15, 17]20x for work on the so-called multiarmed bandit
problem and w16, 29, 30, 1, 12, 27, 14, 5, 2, 28x for more general dynamic
programming extensions. For a survey see also w18x.
2. THE PARTIAL INFORMATION MODEL
The statistical framework used below is as follows. For any population a
let Ž YaŽ n., BaŽ n. . denote the sample space of a sample of size n: Ž Ya1 , . . . , Yan .,
1 F n F `. For each ua g Qa , let Pu a be the probability measure on BaŽ1.
generated by f aŽ y, ua . and na and PuŽan. the measure on BaŽ n. generated by
n independent replications of Ya . In what follows, PuŽan. will often be
abbreviated as Pu a. The joint Žproduct. sample space for the m populations
will be denoted by Ž Y Ž n., B Ž n. . and the probability measure on B Ž n. will
be denoted by PuŽ n. and will be abbreviated as Pu , where u s Ž u 1 , . . . , um . g
Q [ Q1 = ??? = Qm .
2.1. Sample Paths}Adapti¨ e Policies and Statistics
Let A t , X t , t s 0, 1, . . . denote respectively the action taken Ži.e., population sampled. and the outcome observed at period t. A history or sample
path at time n is any feasible sequence of actions and observations during
the first n time periods, i.e., v n s a0 , x 0 , . . . a ny1 , x ny1. Let Ž V Ž n., F Ž n. .,
1 F n F `, denote the sample space of the histories v n , where V Ž n. is the
set of all histories v n and F Ž n. the s-field generated by V Ž n.. E¨ ents will
be defined on F Ž n. or on BaŽ n. and will be denoted by capital letters. The
complement of event B will denoted by B.
A policy p represents a generally randomized rule for selecting actions
Žpopulations. based on the observed history, i.e., p is a sequence
p 0 , p 1 , . . . 4 of history-dependent probability measures on the set of populations 1, . . . , m4 so that p t Ž a. s p t Ž a, v t . is the probability that policy p
selects population a at time t when the observed history is v t . Any policy
p generates a probability measure on F Ž n. that will be denoted by Pup Žcf.
w10x, p. 47x.. Let C denote the set of all policies. Expectation under a policy
p g C will be denoted by Eup . For notational convenience we may
use p t to denote also the action selected by a policy p at time t.
Given the history v n , let TnŽ a. denote the number of times population a
1 4
has been sampled, TnŽ a. [ Ý ny
ts0 1 p t s a . Finally, assume that there are
T
Ž
a.
estimators uˆa n s g aŽ Ya1 , . . . , YaT n Ž a. . g Qa for ua . The initial estimates uˆa0
125
ADAPTIVE POLICIES FOR SEQUENTIAL ALLOCATION
are arbitrary, unless otherwise specified. Properties of the estimators are
given by conditions ŽA2. and ŽA3. below.
Remark 1. Note the distinction between the policy-dependent
Ž V Ž n., F Ž n., Pup . and policy-independent ŽgaŽ n., BaŽ n., PuŽ n. . probability spaces,
a
see also w33x. However, since uˆaj is a function of Ya1 , . . . , Ya j only, it is easy
to see by conditioning that the following type of relations hold, for any
sequence of subsets Fn a j of Qa , n, j G 1:
Pup uˆaT nŽ a. g Fn aT n Ž a. , Tn Ž a . s j F Pu a uˆaj g Fn a j .
ž
/
ž
Ž 2.1.
/
Pup uˆaT nŽ a. g Fn aT n Ž a. F Pu a uˆaj g Fn a j for some j F n .
ž
/
ž
/
Ž 2.2.
2.2. Unobser¨ able Quantities
We next list notation regarding the unobservable quantities such as the
population means m a , the Kullback]Leibler information number IŽ ua , uaX .,
the set of ‘‘optimal’’ populations, OŽ u . for any parameter value u , the
subset DQaŽ ua . of the parameter space Qa that consists of all parameter
values for which population a is uniquely optimal Žhenceforth called
critical ., the minimum discrimination information for the hypothesis that
population a is critical K aŽ u ., analogous quantities for m*Ž u . y « ,
DQaŽ ua , « ., and JaŽ ua , u ; « ., for any « ) 0, the set of all critical populations BŽ u ., and the parameter space constant MŽ u . as follows:
Ž1. Ža.
Žb.
Ž2. Ža.
Žb.
Ž3. Ža.
Žb.
Žc.
Ž4. Ža.
Žb.
Ž5. Ža.
Žb.
m aŽ ua . [ Eu aYa ,
IŽ ua , uaX . [ Eu alogŽ f aŽ Ya ; ua .rf aŽ Ya ; uaX ..,
m* s m*Ž u . [ max as 1, . . . , m m aŽ ua .4 ,
OŽ u . [ a: m a s m*Ž u .4 ,
DQaŽ ua , « . [ uaX g Qa : m aŽ uaX . ) m*Ž u . y « 4 , for « ) 0,
waŽ ua , z . [ inf IŽ ua , uaX .: m aŽ uaX . ) z 4 , for y` F z F `,
JaŽ ua , u ; « . [ waŽ ua , m*Ž u . y « . s inf IŽ ua , uaX .:
uaX g DQaŽ ua , « .4 ,
for « G 0,
Ž
.
Ž
.
DQa ua [ DQa ua , 0 s uaX g Q a : m aŽ uaX . ) m*Ž u .4 ,
BŽ u . [ a: a f OŽ u . and DQaŽ ua . / B4 ,
K aŽ u . [ JaŽ ua , u ; 0. s inf IŽ ua , uaX .: uaX g DQaŽ ua .4 ,
for a g BŽ u .,
MŽ u . [ Ý ag BŽ u .Ž m*Ž u . y m aŽ ua ..rK aŽ u ..
In the definition of MŽ u . we have used the fact that K aŽ u . g Ž0, `.,
;a g BŽ u . : OŽ u ., which is a consequence of the fact that IŽ ua , uaX . s 0
only when ua s uaX .
126
BURNETAS AND KATEHAKIS
Under the assumptions made inw25x the constant K aŽ u . reduces to
IŽ ua , u *., thus giving the form for MŽ u . used in that paper.
2.3. Optimality Criteria
Ž .
Ž . denote respectively the
Let Vnp Ž u . s Eup Ý ny1
ts0 X t and Vn u s n m* u
expected total reward during the first n periods under policy p and the
optimal n-horizon reward which would be achieved if the true value u
were known to the experimenter. Let Rpn Ž u . s VnŽ u . y Vnp Ž u . represent
the loss or regret, due to partial information, when policy p is used;
maximization of Vnp Ž u . with respect to p is equivalent to minimization of
Rpn Ž u ..
In general it is not possible to find an adaptive policy that minimizes
Rpn Ž u . uniformly in u . uniformly in u . However, if we let g p Ž u . s
limn ª `Vnp Ž u .rn, then limn ª `Ž VnŽ u . y Vnp Ž u ..rn s limn ª ` R pn Ž u . s
m*Ž u . y g p Ž u . G 0, ;u .
A policy p will be called uniformly con¨ ergent ŽUC. or uniformly fast
con¨ ergent ŽUF. if ;u g Q as n ª `, Rpn Ž u . s oŽ n. Žfor UC. or Rpn Ž u . s
oŽ n a ., ;a ) 0 Žfor UF..
A UF Policy p 0 will be called uniformly maximal con¨ ergence rate ŽUM.
0
if limnª` Rpn Ž u .rRpn Ž u . F 1, ;u g Q such that M Ž u . ) 0, for all UF
policies p . Note that according to this definition a UM policy has maximum rate of convergence only for those values of the parameter space for
which M Ž u . ) 0; when M Ž u . s 0 it is UF. Let C UC > C UF > C UM denote
the classes of UC, UF, UM policies, respectively.
3. THE MAIN THEOREM
We start by giving the explicit form of the indices UaŽ v k . which define a
class of adaptive policies that will be shown to be UM under conditions
ŽA1. ] ŽA3. below. For a s 1, . . . , m, ua g Qa , and 0 F g F `, let
u a Ž ua , g . s sup
uaX gQ a
m a Ž uaX . : I Ž ua , uaX . - g 4 .
Ž 3.1.
Given the history v k and the statistics Tk Ž a., uˆa s uˆaT k Ž a. for ua , define the
index UaŽ v k . as
Ua Ž v k . s u a Ž uˆa , log krTk Ž a . . ,
Ž 3.2.
for k G 1; UaŽ v 0 . s m aŽ uˆa0 .. We assume, throughout, that, when j s 0 in a
ratio of the form log krj, the latter is equal to `.
ADAPTIVE POLICIES FOR SEQUENTIAL ALLOCATION
127
Note also that UaŽ v k . is a function of k, Tk Ž a., and uˆaT k Ž a. only and that
we allow Tk Ž a. s 0 in Ž3.2., in which case UaŽ v k . s supu aX g Q a m aŽ uaX .4 ; in
applications this will be equivalent to taking some small number of
samples from each population to begin with.
Remark 2. Ža. For all a and v k , UaŽ v k . G m aŽ uˆa ., i.e., the indices are
inflations of the current estimates for the means. In addition, UaŽ v k . is
increasing in k and decreasing in Tk Ž a., thus giving higher chance for the
‘‘under sampled’’ actions to be selected. In the case of a one-dimensional
parameter vector, they yield the same value as those in w25, 23x.
Žb. The analysis remains the same if in the definition of UaŽ v k . we
replace log krj by a function of the form Žlog k q hŽlog k ..rj, where hŽ t .
is any function of t with hŽ t . s oŽ t . as t ª `. Up to this equivalence, the
index UaŽ v k . is uniquely defined.
Žc. We note that u aŽ ua , g . and waŽ ua , z . are connected by the following
duality relation. The condition u aŽ ua , g . ) z implies waŽ ua , z . F g . In
addition, when for g s g 0 , the supremum in u aŽ ua , g 0 . is attained at some
ua0 s ua0 Žg 0 . g Qa such that IŽ ua , ua0 . s g 0 Žas is the case, for example,
when m aŽ uaX . is a linear function of uaX ., ua0 also attains the infimum in
waŽ ua , z 0 . for z 0 s u aŽ ua , g 0 ., i.e., u aŽ ua , g 0 . s m aŽ ua0 . s z 0 , and waŽ ua , z 0 . s
IŽ ua , ua0 . s g 0 . This type of duality is well known in finance w32, p. 113x.
Žd. For z g R, let WaŽ v k , z . s waŽ uˆaT nŽ a., z .. It follows from Žc. above
that ;v k , the condition UaŽ v k . ) z implies the condition WaŽ v k , z . F
log krTk Ž a..
Furthermore, when the supremum in u aŽ uˆa , log krTk Ž a.. is attained at
some ua0 s ua0 Ž v k . g Qa such that IŽ uˆa , ua0 . s log krTk Ž a.,
Ua Ž v k . s m a Ž ua0 . ,
Ž 3.3.
Wa Ž v k , m a Ž ua0 . . s I Ž uˆa , ua0 . s log krTk Ž a .
s Ja uˆa , u ; m* Ž u . y m a Ž ua0 . .
ž
/
Ž 3.4.
The conditions given below are sufficient conditions for Theorem 1 Ž2..
Condition A1. ;u g Q and ;a f OŽ u . such that DQaŽ ua , 0. s B and
DQaŽ ua , « . / B, ;« ) 0, the following relation holds: lim « ª 0 JaŽ ua , u ; « .
s `.
Condition A2. Pu aŽ5 uˆak y ua 5 ) « . s oŽ1rk ., as k ª `, ;« ) 0, and
;ua g Qa , ;a.
Condition A3. Pu aŽu aŽ uˆaj, log krj . - m Ž ua . y « , for some j F k . s
oŽ1rk ., as k ª `, ;« ) 0, ;ua g Qa , ;a.
128
BURNETAS AND KATEHAKIS
Remark 3. To see the significance of condition ŽA1. consider the next
examples.
EXAMPLE 1. Take m s 2, Q 1 s w0, 1x, Q 2 s w0, .5x, m*Ž u . s m 1Ž u 1 . s
0.5 ) m 2 Ž u 2 ., f aŽ y; ua . s uax Ž1 y ua .Ž1yx ., x s 0, 1.
EXAMPLE 2. Take Q 2 s w0, 0.51x in Example 1.
EXAMPLE 3. Take Q 1 s Q 2 s w0, 1x, m*Ž u . s m 1Ž u 1 . s 1 ) m 2 Ž u 2 ., in
Example 1.
Situations such as in Example 1 are excluded, while Examples 2 and 3
satisfy ŽA1..
Remark 4. Ža. Note that ŽA2. is a condition on the rate of convergence of uˆak to ua and it holds in the usual case that uˆak is either equal to
or follows the same distribution as the mean of i.i.d. random variables Z j
with finite moment generating function in a neighborhood around zero. In
this case ŽA2. can be verified using large deviation arguments. This implies
1
Ž5 ˆk
5
.
Ž
.
that Ý ny
ks 1 Pu a ua y ua ) « s o log n , as n ª `.
Žb. From the continuity of IŽ ua , uaX . and, hence, of JaŽ ua , u ; « . in ua , it
follows that the event JaŽ uˆak , u ; « . - JaŽ ua , u ; « . y d 4 is contained in the
event 5 uˆak y ua 5 ) h 4 , for some h s h Ž d . ) 0. Thus, condition ŽA2. implies Pu awJaŽ uˆak , u ; « . - JaŽ ua , u ; « . y d x s oŽ1rk ., as k ª `. The last can
be written in the form below, as required for the proof of Proposition 2 Ža.:
;d ) 0
ny1
Ý
ks1
Pu a Ja uˆak , u ; « - Ja Ž ua , u ; « . y d s o Ž log n .
ž
/
as n ª `.
1
w Ž ˆj
.
Remark 5. Condition ŽA3. can be written as Ý ny
ks 1 Pu a u a ua , log krj m Ž ua . y « , for some j F k x s oŽlog n., as n ª `. It is used in this form
for the proof of Proposition 2 Žb..
Let CR denote the class of policies which in every period select any
action with the largest index value UaŽ v k . s u aŽ uˆa , log krTk Ž a... We can
now state the following theorem.
THEOREM 1. Ž1. For any u g Q, a g BŽ u ., such that K aŽ u . / 0 the
following is true, ;p g C UF ,
lim Eup Tn Ž a . rlog n G 1rK a Ž u . .
nª`
ADAPTIVE POLICIES FOR SEQUENTIAL ALLOCATION
129
Ž2. If conditions ŽA1., ŽA2., and ŽA3. hold and p 0 g CR , then, ;u g Q
and ;a f BŽ u .,
lim Eup Tn Ž a . rlog n F 1rK a Ž u . ,
if a g B Ž u . ,
lim Eup Tn Ž a . rlog n s 0,
if a f B Ž u . ,
0
nª`
Ž a.
0
nª`
Žb. 0 F Rpn Ž u . s MŽ u .log n q oŽlog n., as n ª `, ;u g Q,
Žc. CR ; C UM .
0
Proof. Parts Ž1. and Ž2a. are proved in Propositions Ž1. and Ž2., respectively.
For part Ž2b., note first that
m
Rpn Ž u . s n m* Ž u . y
Ý
Eup
as1
TnŽ a.
Ý
m
s
Ý
as1
Yat
ts1
p
u Tn
ž m* Ž u . y m Ž u . / E
a
a
Ž a. ,
;p g C.
Using the definition of MŽ u . in subsection 2.3, ;p 0 g CR , ;u g Q,
lim Rpn Ž u . rlog n F M Ž u . ,
0
Ž from part Ž 2a . . ;
nª`
hence, CR : C UF . Thus, it follows from part Ž1. that
lim Rpn Ž u . rlog n G M Ž u . ,
0
nª`
and the proof is easy to complete, using the above observations.
0
To show the last chain we need only to divide both Rpn Ž u . and Rpn Ž u .
by MŽ u . log n, when MŽ u . ) 0.
Remark 6. Ža. It is instructive to compare the maximum expected
finite horizon reward under complete information about u ŽEq. Ž3.5.. with
the asymptotic expression for expected finite horizon reward for a UM
policy p 0 , under partial information about u ŽEq. Ž3.6.., established by
Theorem 1:
Vn Ž u . s n m* Ž u .
Vnp Ž u . s n m* Ž u . y M Ž u . log n q o Ž log n .
0
Ž 3.5.
Ž as n ª ` . . Ž 3.6.
130
BURNETAS AND KATEHAKIS
Žb. The results of Theorem 1 can be expressed in terms of the rate of
convergence of Vnp Ž u .rn to m*Ž u ., as follows. If p g C UC then
lim nª`Vnp Ž u .rn s m*Ž u . for all u . No claim regarding the rate of convergence can be made. If p g C UF then it is also true that < Vnp Ž u .rn y
m*Ž u .< s oŽ n a . for all u Žand ;a ) 0.; therefore Vnp Ž u .rn converges to
m*Ž u . at least as fast as log nrn. The UM policy is such that for all u g Q
with M Ž u . ) 0, the rate of convergence of Vnp Ž u . to m*Ž u . is the maximum among all policies in C UF : C UC : C, and is equal to M Ž u .log nrn.
0
Žc. For u g Q such that M Ž u . s 0, it is true that Vnp Ž u . s n m*Ž u . q
0
oŽlog n.; therefore Vnp Ž u .rn converges to m*Ž u . faster than log nrn.
However, this does not necessarily represent the fastest rate of convergence.
In the proof of the Proposition 1 below, we use the notation uaX [
Ž u 1 , . . . , uay1 , uaX , uaq1 , . . . , um ., ;uaX g DQaŽ ua ., and the following remark.
Remark 7. Ža. The definition of DQaŽ ua . implies that if a g BŽ u . /
B, then OŽ uaX . s a4 , ;uaX g DQaŽ ua ., and thus EupaX Ž n y TnŽ a.. s oŽ n a .,
;a ) 0 and ;uaX g DQaŽ ua ., ;p g C UF , the latter being a consequence of
the definition of C UF .
Žb. Let Zt be i.i.d. random variables such that Snrn s Ý nts1 Ztrn,
converges a.s. ŽP. to a constant m , let bn be an increasing sequence of
positive constants such that bn ª `, and let ? bn @ denote the integer part
of bn . Then max k F ? b n @ Sk 4 rbn converges a.s. ŽP. to m and
;d ) 0,
P max S k 4 rbn ) m q d s o Ž 1 .
PROPOSITION 1.
ž
/
kF ? bn @
Ž as n ª ` . .
If p g C UF then, for any a g BŽ u . / B,
lim Eup Tn Ž a . rlog n G 1rK a Ž u . .
Ž 3.7.
nª`
Proof. The proof is an adaptation of the proof of Theorem 1 in w25x for
the constant K aŽ u .. Form the Markov inequality it follows that
Pup Tn Ž a . rlog n G 1rK a Ž u . F Eup Tn Ž a . K a Ž u . rlog n,
ž
/
Thus, to show Ž3.7., it suffices to show that
lim Pup Tn Ž a . rlog n G 1rI a Ž u . s 1
nª`
ž
/
; n ) 1.
ADAPTIVE POLICIES FOR SEQUENTIAL ALLOCATION
131
or, equivalently,
lim Pup Tn Ž a . rlog n - Ž 1 y « . rK a Ž u . s 0,
nª`
ž
/
;« ) 0.
Ž 3.8.
By the definition of K aŽ u . we have, ;d ) 0, 'uaX s uaX Ž d . g DQaŽ ua . such
that K aŽ u . - IŽ ua , uaX . - Ž1 q d .K aŽ u ..
Fix such a d ) 0 and uaX , let I d s IŽ ua , uaX ., and define the sets Adn [
TnŽ a.rlog n - Ž1 y d .rI d 4 and C nd [ log LT Ž a. F Ž1 y dr2.log n4 ,
n
where log L k s Ý kis1 logŽ f aŽ Yai ; ua .rf aŽ Yai ; uaX ..
We will show that Pup ŽAdn . s Pup ŽAdnC nd . q Pup ŽAdnC nd . s oŽ1., as n ª `,
;d ) 0. Indeed, Pup ŽAdnC nd . F n1y d r2 PupaX ŽAdnC nd . F n1y d r2 PupaX ŽAdn . F
n1y d r2 EupaX Ž n y TnŽ a..rŽ n y Ž1 y d .log nrI d . s oŽ n a .rŽ n d r2 Ž1-O Žlog n.r
n.. s oŽ1., for a - dr2.
The first inequality follows from the observation that on C dn l TnŽ a. s
4
k we have f aŽ Ya1; ua . ??? f aŽ yak ; ua . F n1y d r2 f aŽ Ya1; uaX . ??? f aŽ Yak ; uaX .; note
also that e Ž1y d r2.log n s n1y d r2 . The third relation is the Markov inequality
and the fourth is due to Remark 7Ža.. above.
To see that Pup ŽAdnC nd . s oŽ1., note that
Pup Ž AdnC dn . F Pup
ž max log L 4 ) Ž 1 y dr2. log n /
s P ž max log L 4 rb ) I Ž 1 q dr Ž 2 Ž 1 y d . . .
F P ž max log L 4 rb ) I Ž 1 q dr Ž 2 Ž 1 y d . . . ,
p
u
ua
kF ? bn @
k
kF ? bn @
k
kF ? bn @
k
n
n
d
d
where bn [ Ž1 y d .log nrI d and the last inequality follows using an
argument like that in Remark 1. Thus the result follows from Remark 7Žb.,
since log L krk ª I d a.s. ŽPu a ..
To complete the proof of Ž3.8., it suffices to notice that the choices of d
and uaX Ž d . imply Ž1 y d .rI d ) Ž1 y d .rŽŽ1 q d .K aŽ u .. ) Ž1 y « .rK aŽ u .,
and Pup ŽTnŽ a.rlog n - Ž1 y « .rK aŽ u .. F Pup ŽAdn . s oŽ1., when d - «r
Ž2 y « ..
To facilitate the proof of the Proposition 2 below we introduce some
notation and state a remark.
For any e ) 0, let
ny1
TnŽ1. Ž a, « . s
Ý 1 ž p k s a, Ua Ž v k . ) m* Ž u . y « / ,
ks1
132
BURNETAS AND KATEHAKIS
and
1
TnŽ2. Ž a, « . s Ý ny
ks1 1 Ž p k s a, Ua Ž v k . F m a Ž ua . y « . .
Remark 8. Let Zt be any sequence sequence of constants Žor random
1 4
variables. and let t k [ Ý ky
ts0 1 Z t s a . This definition of t k implies that
Žpointwise if we have random variables.
ny1
Ý 1 Zk s a, t k F c 4 F c q 1.
ks1
1 4
Indeed, note that Ý ny
ks 1 1 Z k s a, t k s i F 1, ; i s 0, . . . , ? c @ . Therefore,
ny 1 ny 1
? c@
c@
4
Ý ks 1 1 Zk s a, t k F c s Ý ks1 Ý is0 1 Zk s a, t k s i4 s Ý?is0
Ý ny1
ks1 1 Z k s
4
a, t k s i F ? c @ q 1 F c q 1.
PROPOSITION 2. For any u g Q the following are true:
Ža. Under ŽA1. and ŽA2., if p 0 g CR and a f OŽ u ., then
lim lim Eup TnŽ1. Ž a, « . rlog n F 1rK a Ž u . ,
0
if a g B Ž u . , Ž 3.9.
«ª0 nª`
lim lim Eup TnŽ1. Ž a, « . rlog n s 0,
0
if a f B Ž u . .
«ª0 nª`
Ž 3.10.
Žb. Under ŽA3., if p 0 g CR , then limnª` Eup TnŽ2. Ž a, « .rlog n s 0, ;a
and ;« ) 0.
0
Žc. Under ŽA1., ŽA2., and ŽA3., if p 0 g CR , then limnª` Eup TnŽ a.rlog n
is less than or equal to 1rK aŽ u ., if a g BŽ u . and it is equal to 0, if a f BŽ u ..
0
Proof. Ža. fix p 0 g CR , u g Q, a f OŽ u ., i.e., m* ) m aŽ ua .. Let « g
Ž0, m* y m aŽ ua .., and consider two cases.
Case 1. There exists « 0 ) 0 such that DQaŽ ua , « 0 . s B. For any « - « 0
and any ua g Qa it is true that m aŽ ua . F m*Ž u . y « 0 - m*Ž u . y « ; therefore, TnŽ1. Ž a, « . s 0, for all « - « 0 and Ž3.10. holds.
Case 2. DQaŽ ua , « . / B, ;« ) 0. Note that JaŽ ua , u ; « . ) 0, ;« g
Ž0, m* y m aŽ ua ... Let J« s JaŽ ua , u ; « . and ˆ
J« s JaŽ uˆaT k Ž a., u ; « .; then, ;d )
0, we have sample path-wise:
ny1
TnŽ1. Ž a, « . F
Ý 1 Ž p k0 s a, ˆJ« F log krTk Ž a. .
ks1
ny1
s
Ý 1 Ž p k0 s a, ˆJ« F log krTk Ž a. , ˆJ« G J« y d .
ks1
ny1
q
Ý 1 Ž p k0 s a, ˆJ« F log krTk Ž a. , ˆJ« - J« y d .
ks1
133
ADAPTIVE POLICIES FOR SEQUENTIAL ALLOCATION
ny1
F
Ý 1 Ž p k0 s a, Tk Ž a. F log nr Ž J« y d . .
ks1
ny1
q
Ý 1 Ž p k0 s a, ˆJ« - J« y d .
ks1
ny1
F log nr Ž J« y d . q 1 q
Ý 1 Ž p k0 s a, ˆJ« - J« y d . .
ks1
For the first inequality, we have used an immediate consequence of the
‘‘duality’’ relations of Remark 2Žc. with z 0 s m*Ž u . y « , which imply that
the event UaŽ v k . ) m*Ž u . y « 4 is contained in the event JaŽ uˆaT k Ž a., u ; « . F
log krTk Ž a.4 . For the last inequality, we have used Remark 8.
Taking expectations of the first and last terms and using remark 4Žb., it
0
follows that, since d is arbitrarily small, Eup TnŽ1. Ž a, « .rlog n F 1 q
0
1 Ž 0
ˆ
.x
log nrJaŽ ua , u ; « . q Eup wÝ ny
ks1 1 p k s a, J« - J« y d . In addition,
Eup
ny1
0
Ý 1 Ž p k0 s a, ˆJ« - J« y d .
ks1
s Eup
0
TnŽ a .
Ý
js0
F Eup
1 Ja uˆaj , u ; « - J« y d
ž ž
/
/
ny1
0
Ý 1 ž Ja ž uˆaj , u ; « / - J« y d /
js0
ny1
F Eu a
Ý 1 ž Ja ž uˆaj , u ; « / - J« y d /
s o Ž log n . ,
js0
where the second inequality follows from Remark 1 and the last equality
0
follows from Remark 4Žb.. Therefore, Eup TnŽ1. Ž a, « .rlog n F log nr
JaŽ ua , u ; « . q oŽlog n..
Thus the proof of part Ža. is complete since lim « ª 0 JaŽ ua , u ; « . s K aŽ u .,
if DQaŽ ua , 0. / B, from the definition of K aŽ u ., and lim « ª 0 JaŽ ua , u ; « . s
`, if DQaŽ ua , 0. s B, from ŽA1..
Žb. Note first that for p 0 g CR , the following inequality holds pointwise on V Ž n. :
ny1
TnŽ2. Ž a, « . F
Ý 1 ž u a* Ž uˆa*j , log krj . F m* Ž u . y « , for some
ks1
;a* g O Ž u . , ;u g Q.
jFk ,
/
134
BURNETAS AND KATEHAKIS
1 Ž 0
Ž .
Ž .
.
Indeed, TnŽ2. Ž a, « . s Ý ny
ks1 1 p k s a, Ua v k F m* u y e , and, since p g
0
CR , the condition p k s a implies that UaŽ v k . s max a9Ua9Ž v k . G Ua* Ž v k .;
thus, the event p k0 s a, UaŽ v k . F m*Ž u . y « 4 is contained in the event
Ua* Ž v k . F m*Ž u . y « 4 , for any a* g O Ž u .. The latter event is contained
in the event u a* Ž uˆa*j , k, j . F m*Ž u . y « , for some j F k .4 . Therefore, using
also Ž2.2.,
Eup TnŽ2. Ž a, « . F
0
ny1
Ý Pu
a*
ž u Ž uˆ
a*
j
a* ,
k, j . F m* Ž u . y « ,
ks1
for some j F k s o Ž log n . ,
/
by Condition ŽA3..
The proof of Žc. follows from Ža. and Žb. when we let « ª 0, since
Ž
Tn a. F 1 q TnŽ1. Ž a, « . q TnŽ2. Ž a, « ., ; n G 1, ;« ) 0.
4. APPLICATIONS OF THEOREM 1
4.1. Discrete Distributions with Finite Support
Assume that the observations Ya j from population a follow a univariate
discrete distribution, i.e., f aŽ y, p a . s p a y 1 Ya s y4 , y g S a s r a1 , . . . , r ada 4 ,
where the unknown parameters p a y are in Qa s p a g R d a : p a y ) 0, ; y s
1, . . . , d a , Ý y p a y s 14 , and r a y are known. Here we use the notation
ua s p a , u s p s Žp1 , . . . , p m . and na is the counting measure on
r a1 , . . . , r ad 4 .
a
a
Thus we can write IŽp a , q a . s Ý dys1
p a y logŽp a yrq a y ., m aŽp a . s r Xa p a s
X
Ý y r a y p a y , m* s m*Žp. s max a r a p a4 , DQaŽp, « . s q a : m aŽq a . ) m*Žp. y
« 4 , where r Xa denotes the transpose of the vector r a . Note that computation
of the constant K aŽp. as a function of p involves the minimization of p
involves the minimizatin of a convex function subject to two linear constraints; hence,
½
K a Ž p . s wa p a , m* Ž p . s min I Ž p a , q a . : r Xa q a G m* Ž p . ,
ž
/
q aG0
da
Ý qay s 1
ys1
5
.
Ž 4.1.
For any estimators ˆ
p at of p a , the computation of the index UaŽ v k . involves
the solution of the dual problem of Ž4.1. Žwith p a replaced by ˆ
p at . in Ž4.1..,
which, in this case, is a problem of maximization of a linear function
135
ADAPTIVE POLICIES FOR SEQUENTIAL ALLOCATION
subject to a constraint with convex level sets and a linear constraint; hence,
U a Ž v k . s u a Žˆ
p at , log krt .
s max
q aG0
½
r Xa q a :
da
p at , q a
Iˆ
ž
Ýq
/ F log krt, ys1
ay
5
s1 .
Ž 4.2.
In Proposition 3 below, it is shown that Conditions ŽA1., ŽA2., and ŽA3.
are satisfied for estimators defined from the observations from population
a. Given vk with Tk Ž a. s t define:
Ž1. For t G 1, n t Ž y; a. s Ýtjs1 1Ž Ya j s r a, y ., f t Ž y; a. s n t Ž y; a.rt and
f t Ž a. s wf t Ž y; a.x y g S a.
t
Ž2. For t G 0 let ˆ
Ž
.
Ž
.
p a,
y s 1 y wt rd a q wt f t y; a , where wt s
t
t x
Ž
.
w
tr d a q t , and let ˆ
pn s ˆ
p a, y y g S a.
In the proof of Proposition 3 we make use of the following quantities and
properties:
Ž1. For a s 1, . . . , m and p a , q 1 , q 2 g Qa let lŽp a ; q 1 , q 2 . s IŽp a , q 2 .
Ž
y I p a , q 1 . [ Ý y g S a p a y logwq 1 yrq 2 y x.
Ž2. Let L t Žq 1 , q 2 . s Ł tjs1 q 1, Y rq 2, Y .
aj
aj
Ž3. For p a g Qa , let Fk t Žp a . s q g Qa : IŽp a , q. F log krt 4 .
Note that UaŽ v k . s sup r Xa q: q g Fk, T Ž a.Žˆ
pTa k Ž a. .4 .
k
Remark 9. Ža. log L t Žq 1 , q 2 . s lŽf t Ž a.; q 1 , q 2 ..
Žb. sup q lŽp a ; q 1 , q 2 . s lŽp a ; p a , q 2 . s IŽp a , q 2 ..
1
Žc. e t IŽf t Ž a., q 2 . s e t lŽf t Ž a.; f t Ž a., q 2 . s sup q L t Žq 1 , q 2 ..
1
Žd. t ?lŽˆ
p at ; q 1 , q 2 . s wt w bŽq 1 , q 2 . q lŽf t Ž a.; q 1 , q 2 .x, where bŽq 1 , q 2 . s
Ý y logwwq 1 yrq 2 y x.
Ž e . t I Žˆ
p at , q 2 . F w t w b 0 Žq 2 . s t I Žf t Ž a ., q 2 .x, where b 0 Žq 2 0 s
yÝ y log q 2 y .
Indeed, Ža. follows from the observation that
t
log L t Ž q 1 , q 2 . s
Ý log
q 1, Ya jrq 2, Ya j
js1
t
s
da
Ý Ý 1 Ž Ya j s y . log
js1 ys1
s t l Ž f t Ž a. ; q 1 , q 2 . .
q 1 yrq 2 y
136
BURNETAS AND KATEHAKIS
Žb. is a restatement of the information inequality yIŽp a , q 1 . F 0. Žc.
follows from Žb.. To see Žd. and Že., recall that ˆ
p at s Ž1 y wt .rd a q wt ?
f t Ž a., where wt s trŽ t q d a .; note that t Ž1 y wt .rd a s wt , and use Ža. Žfor
Žd.. and Žb. Žfor Že...
PROPOSITION 3. The discrete distribution model satisfies Conditions ŽA1.,
ŽA2., and ŽA3. of Theorem 1.
Proof. Ž1. It is easy to see that Condition ŽA1. holds. Indeed, note
that that ;« G 0, DQaŽp; « . / B if and only if max y r a y ) m*Žp. y « . Thus
if DQaŽp, « . / B, ;« ) 0 and DQaŽp. s B, then max y r a y s m*Žp. and
lim « ª 0 JŽp a , p; « . s IŽp a , q e . s `, where q e is the unit vector of R d a , with
nonzero component corresponding to max y r a y .
Ž2. We next show that Condition ŽA2. holds. Since 1 y wt ª 0, it
t
Ž .
follows from the definition of ˆ
p a,
y that for any « ) 0 there exists t 0 « G 1
such that
«
t
<
<
<
Pp a <ˆ
p a,
,
; t G t0 .
y y p a y ) « F Pp a f t Ž y ; a . y p a y )
2
Because f t Ž y; a. is the average of i.i.d. Bernoulli random variables with
mean p a y , it follows from standard results of large deviations theory Žcf.
w11; 9, p. 27x. that Pp w <f t Ž y; a. y p a y < ) «r2x F Ceyg t for some C, g ) 0;
a
t
yg t
<
x
therefore, for t G t 0 Ž « ., Pp aw <ˆ
p a,
s oŽ1rt ..
y y p a y ) « F Ce
Ž3. To show that the model satisfies Condition ŽA3., we must prove
k
that Pp awDts0
Bk t x s oŽ1rk ., as k ª `, ;« ) 0, ;a, where Bk t s Bk t Ž a, « .
t
s u aŽˆ
p a , log krt . F m aŽp a . y « 4 .
On the event Bk t it is true that m aŽq. - m aŽp a . y « , ;q g Fk t Žˆ
p at ..
X
X
Therefore, Bk t : Bk t , where Bk t s m aŽq. - m aŽp a . y « , ;q g Fk t Žˆ
p at .4 ;
X
ky 1
thus it suffices to prove that Ý ts0 Pp aw Bk t x s oŽ1rk ..
From Lemma 1 it follows that for « ) 0 sufficiently small, there exists a
probability vector q0 s q0 Ž « . g Qa such that m Žq0 . s m aŽp a . y « and q:
m Žq. - m aŽp a . y « 4 s q: lŽq; p a , q0 . - yc« 4 , where c« s IŽq0 , p a ..
Hence, BkX t : BkY t , where BkY t s lŽq; p a , q0 . - yc« , ;q g Fk t Žˆ
p at .4 and it
Y
ky 1
is sufficient to prove that Ý ts0 Pp aw Bk t x s oŽ1rk ..
On the event BkY t the following are true. First, since ˆ
p at g Fk t Žˆ
p at ., it
follows that lŽˆ
p at ; p a , q0 . - yc« . Second, since lŽq0 ; p a , q0 . s yc« , it fol0
lows that q f Fk t Žˆ
p at ., i.e,. IŽˆ
p at , q0 . ) log krt. Therefore, BkY t : BkZt , where
B kZt s lŽ p at ; p a , q 0 . - yc« , I Žˆ
p at , q 0 . ) log krt 4 . Let I Ž q 0 . s
0
0
max q g Q aIŽq, q . s max y g S alog <q y < - `.
For t - log krI Žq0 ., it is true that IŽˆ
p at , q0 . F I Žq0 . - log krt; thus
Z
Bk t s B. Therefore, in order to prove the proposition, it suffices to show
1
w Zx
Ž
.
Ž .
that Ý ky
ts? log k r I @ Pp a Bk t s o 1rk . This follows from Lemma 2 2 ; thus
the proof is complete.
137
ADAPTIVE POLICIES FOR SEQUENTIAL ALLOCATION
LEMMA 1.
For any p a g Qa and « ) 0 sufficiently small, there exists a
¨ ector q0 Ž « . g Qa such that m aŽq0 Ž « .. s m aŽp a . y « and q g Qa : m aŽq. -
m aŽp a . y « 4 s q g Qa : lŽq; p, q0 Ž « .. - yIŽq0 Ž « ., p.4 .
Proof. For n G 0 define ˜
qŽ n . g S a as follows: ˜
q y Ž n . s p a y eyn r a yrbŽ n .,
yn r a y
Ž
.
where b n s Ý y g S a p a y e
. We prove that for all « ) 0 there exists
n s n Ž « . ) 0 such that HmŽ « . s HlŽ n Ž « .., where HmŽ « . s q: m aŽq. s
m aŽp a . y « 4 , HlŽ n . s q: lŽq; p a , ˜
qŽ n .. s ycn 4 and cn s ylŽ˜
qŽ n .;
Ž
..
Ž
Ž
.
.
Ž
.
Ž
.
pa, ˜
q n sI ˜
q n , p a . Indeed, for all « , n ) 0, Hm h and Hl n are
parallel hyperplanes, since by construction of ˜
qŽ n .,
pa y
l Ž q; p a , ˜
q Ž n . . s Ý q y log
˜q y Ž n .
ygS a
sn
Ý
q y r a y q log b Ž n . s nm a Ž q . q log b Ž n . .
ygS a
In addition, q˜Ž n . g HlŽ n ., since lŽ˜
qŽ n .; p a , ˜
qŽ n .. s yIŽ˜
qŽ n ., p a . s ycn .
Hence, for HmŽ « . s HlŽ n Ž « .., it suffices to choose n Ž « . such that
˜gŽ n Ž « .. g HmŽ « ., i.e., m aŽ˜qŽ n Ž « ... s m aŽp a . y « .
For n s 0 it is true that bŽ0. s 1 and ˜
qŽ0. s p a ; thus, m Ž˜
qŽ0.. s m aŽp a ..
As n ª `,
˜q Ž n . ª ˜q Ž `. [
½
p a yrÝ z : r a zsr a0 p a z ,
if r a y s r a0
0,
otherwise
,
where r a0 s min z r a z . Thus, as n ª `, m aŽ˜
qŽ n .. ª m` [ r a0 - m aŽp a ..
Therefore, for any « - Ž m aŽp a . y m` ., there exists n Ž « . ) 0 such that
m aŽ˜
qŽ n Ž « ... s m aŽp a . y « , because m aŽq0n . is continuous in n .
Let HmyŽ « ., HmqŽ « ., HlyŽ n ., HlqŽ n . denote the corresponding half-spaces
of Hm , Hl , i.e,. HmyŽ « . s q: m aŽq. - m aŽp a . y « 4 , etc. To prove that
HmyŽ « . s HlyŽ n Ž « .., it suffices to show that p a g HmqŽ « . and p a g
HlqŽ n Ž « ... The first is immediate, while for the second we note that
lŽp a ; p a , ˜
qŽ n Ž « .. s IŽp a , ˜
qŽ n Ž « ... ) 0 ) yIŽ˜
qŽ n Ž « .., p a .. Thus the lemma
0Ž .
Ž
Ž
..
follows with q « s ˜
qn « .
LEMMA 2. Ž1. ;p a , q g Qa , ;c, d ) 0, and ; b1 , b 2 g R
ky1
Ý
ts ? d log k @
Pp a l Ž f t Ž a . ; p a , q . - yc q b1rt , I Ž f t Ž a . , q . ) log krt q b 2rt
s o Ž 1rk . ,
as k ª `.
Ž2. ;c, d ) 0 and p a , q g Qa
ky1
Ý
ts ? d log k @
Pp a l p at ; p a , q - yc, I ˆ
p at , q ) log krt s o Ž 1rk . ,
ž
/
ž
/
as k ª `.
138
BURNETAS AND KATEHAKIS
Proof. Ž1. The proof is an adaptation of Lemma 2 in w25x. From
Remark 9Žb. it follows that, for all k, t,
Pp a l Ž f t Ž a . ; p a , q . - yc q b1rt , I Ž f t Ž a . , q . ) log krt q b 2rt
s Pp a L t Ž p a , q . F e b 1 eyct , sup L t Ž p, q . ) e b 2 k .
p
After a change-or-measure transformation between p a and q we obtain
Pp a L t Ž p a , q . F e b 1 eyct , sup L t Ž p, q . ) e b 2 k
p
F e b 1 eyct Pq sup L t Ž p, q . ) e b 2 k .
Ž 4.3.
p
Since Qa is subset of a compact set, for any d ) 0 there exists M - ` and
a finite collection of vectors qŽ i. g Qa , and neighborhoods Ni Ž d ., i s
1, . . . , M, such that D i Ni Ž d . = Qa , and Ni Ž d . s p g Qa : 5p y qŽ i. 5 - d 4 .
For all i s 1, . . . , M and y g S a , it is true that sup p g N i Ž d . p yrq y F
ŽqŽyi. q d .rq y ; thus,
sup
Eq
pg Ni Ž d .
p Ya
q Ya
F Eq
qŽYi.a q d
q Ya
s 1 q < Sa < d .
Therefore for any « ) 0, selecting d - « r < S a <,
E q wsup p g N i Ž d . p Yarq Ya x F 1 q « , i s 1, . . . , M, and thus
Pq
we
obtain
sup L t Ž p, q . ) e b 2 k
pg Ni Ž d .
F eyb 2 ky1 E q
yb 2 y1
Fe
k
ž
Eq
sup L t Ž p, q .
pg Ni Ž d .
sup
pg Ni Ž d .
p Ya
q Ya
t
/
t
F eyb 2 ky1 Ž 1 q « . ,
Ž 4.4.
where the first inequality follows from the Markov inequality, the second
from the observation that sup p g N i Ž d . L t Žp, q. F P tjq1 sup p g N i Ž d . p Ya jrq Ya j ,
and the third from the fact that Ya j , j s 1, . . . , t, are i.i.d.
ADAPTIVE POLICIES FOR SEQUENTIAL ALLOCATION
139
Combining Ž4.3. and Ž4.4.
Pp a L t Ž p a , q . F e b 1 ey 8cu t , sup L t Ž p, q . ) e b 2 k
p
F e b 1 eyct
M
Ý Pq
is1
sup L t Ž p, q . ) e b 2 k
pg Ni Ž d .
F Me b 1yb 2 ky1 eyct Ž 1 q « . .
t
Selecting « so that eyc Ž1 q « . - 1, we obtain
k
Ý
ts ? d log k @
Pp a L t Ž p a , q . F b1 eyc t , sup L t Ž p, q . ) b 2 k
pgQ a
F M9ky1
`
Ý
t
Ž eyc Ž 1 q « . . F M9ky1ydŽ cylogŽ1q « .. ,
ts ? d log k @
where M9 s Me b 1yb 2rŽ1 y eyc Ž1 q « ...
Since d ) 0 and c y logŽ1 q « . ) 0 by selection of « , it follows that
y1 y dŽ c y logŽ1 q « .. - y1, and the proof is complete.
Ž2. From Remark 9Žc., in the event lŽˆ
p at ; p a , q. - yc4 it is true that
Ž
Ž
.
Ž
Ž
.
..
wt b p a , q q t l f t a ; p a , q - yct; therefore, since 0 - wt - 1, bŽp a , q.
qt lŽf t Ž a.; p a , q. - yctrwt - yct.
Also, in the event IŽˆ
p at , q. ) log krt 4 it is true that wt Ž b 0 Žq. q
t IŽf t Ž a., q.. ) log k; therefore, since 0 - wt - 1, b 0 Žq. q t IŽf t Ž a., q. )
log krwt ) log k.
Hence,
k
Ý
ts ? d log k @
Pp a l ˆ
p at ; p a , q - yc, I ˆ
p at , q ) log krt
ž
/
ž
/
k
F
Ý
ts ? d log k @
Pp a l Ž f t Ž a . ; p a , q . - yc y b Ž p a , q . rt , I Ž f t Ž a . , q .
) log krt y b 0 Ž q . rt ,
and the result follows from part Ž1., with b1 s ybŽp a , q0 . and b 2 s
yb 0 Žq0 ..
4.2. Normal Distributions
Assume that the observations Ya j from population a are normally
distributed with unknown mean m a and known variance sa2 , i.e., ua s m a ,
140
BURNETAS AND KATEHAKIS
and that Q a s Žy`, ` .. Given history v k , define m
ˆ Ta k Ž a. s
T
Ž
a.
k
ŽÝ js1 Ya j .rTk Ž a..
From the definition of Qa it follows that DQaŽ u . s Ž m*Ž u ., `.; therefore, BŽ u . s 1, . . . , m4 , ;u g Q. Also it can be seen after some algebra
that K q Ž u . s Ž1r2.logŽ1 q Ž m*Ž u y m a . 2rsa2 . and UaŽ v k . s m
ˆTa k Ž a. q
sˆaT k Ž a. Ž e 2 log k r T k Ž a. y 1.1r2 s m
ˆTa k Ž a. q sˆaT k Ž a. Ž k 2rTk Ž a. y 1.1r2 .
Ž
.
Therefore, Condition A1 of Theorem 1 holds. Condition ŽA2. follows
from standard large deviation arguments. Condition ŽA3. follows from an
inequality for the tails of the normal distribution Žcf. w13, p. 166x.. Therefore, any index policy in CR is UMCR.
For details and variations of this model see w25, 23, 21x.
4.3. Normal and Discrete Distributions
Assume that ' m1 - m such that:
Ž1. For a s 1, . . . , m1 , Ya j are normally distributed with unknown
mean m a and known variance sa2 , i.e., ua s m a , and Qa s Žy`, `..
Ž2. For a s m1 q 1, . . . , m, Ya j follow discrete distribution with known
support S a s r a1 , . . . , r ad a4 and unknown parameters p a g Qa s p a g R d a :
p a y ) 0, ; y s 1, . . . , d a , Ý y p a y s 14 .
In this case B Ž u . s 1, . . . , m14 j a ) m1: max y r a y ) m*Ž u .4 .
Conditions ŽA1., ŽA2., and ŽA3. are satisfied. Indeed, they have been
verified separately for a ) m1 in subsection 4.1 and for a F m1 in subsection 4.2. Thus any index policy in CR is UM.
4.4. Normal Distributions with Unknown Variance
Assume that Ya j are normally distributed with unknown mean m a and
variance sa2 , i.e., ua s Ž m a , sa2 ., and that Qa s ua : m a g R, sa2 G 04 .
T k Ž a.
Given history v k , define uˆaT k Ž a. s Ž m
ˆTa k Ž a., sˆa2 ., where m
ˆTa k Ž a. s
T k Ž a.
T k Ž a.
2
2Ž
T k Ž a. Ž
T k Ž a. . 2
Ž
.
Ž
..
Ž
.
1rTk a Ý js1 Ya j , sˆa
s s Tk a s 1rTk a Ý js1 Ya j y m
.
ˆa
From the definition of Qa it follows that DQaŽ u . / B, ;u g Q; therefore, BŽ u . s 1, . . . , m4 , ;u g Q. Also it can be seen after some algebra
that K aŽ u . s Ž1r2.logŽ1 q Ž m*Ž u . y m a . 2rsa2 . and UaŽ v k . s m
ˆTa k Ž a. q
T k Ž a. Ž 2 log k r T k Ž a. .1r2
T k Ž a.
T k Ž a. Ž 2r T k Ž a.
1r2
sˆa
e
-1
sm
k
y 1. .
ˆ a q sˆa
Therefore, Condition ŽA1. of Theorem 1 holds. It is easy to to see that
ŽA2. holds, using large deviations arguments. However, we have not been
able to prove that ŽA3. is satisfied, so this remains an open problem.
ACKNOWLEDGMENTS
We are grateful to H. Robbins for helpful comments on this paper.
ADAPTIVE POLICIES FOR SEQUENTIAL ALLOCATION
141
REFERENCES
1. R. Acost-Abreu and O. Hernandez-Lerma, Iterative adaptive control of denumerable
state, average-cost Markov systems, Control Cybernet. 14 Ž1985., 313]322.
2. R. Agrawal, D. Teneketzis, and V. Anantharam, Asymtotically efficient adaptive allocation schemes for controlled Markov chains: Finite parameter space, Trans. Automat.
Control IEEE 34 Ž1989., 1249]1259.
3. R. Bellman, ‘‘Dynamic Programming,’’ Princeton Univ. Press, Princeton, NJ, 1957.
4. D. A. Berry and B. Fristedt, ‘‘Bandit Problems: Sequential Allocation of Experiments,’’
Chapman & Hall, London Ž1985..
5. F. J. Beutler and D. Teneketzis, Routing in queueing networks under imperfect information: Stochastic dominance and thresholds, Stochastics Stochastics Rep. 26 Ž1989., 81]100.
6. A. N. Burnetas and M. N. Katehakis, ‘‘On Optimal Sequential Allocation Policies for the
Finite Horizon One-Armed Bandit Problem,’’ Technical Report, Rutgers University,
1989.
7. A. N. Burnetas and M. N. Katehakis, On sequencing two types of tasks on a single
processor under incomplete information, Probab. Engrg. Inform. Sci. 7 Ž1993. 85]119.
8. A. N. Burnetas and M. N. Katehakis, ‘‘Optimal Adaptive Policies for Dynamic Programming,’’ Technical Report, Rutgers University, 1994.
9. A. Dembo and O. Zeitouni, ‘‘Large Deviations Techniques and Applications,’’ Jones &
Bartlett, Boston, 1993.
10. E. B. Dynkin and A. A. Yushkevich, ‘‘Controlled Markov Processes,’’ Springer-Verlag,
BerlinrNew York, 1979.
11. R. S. Ellis, ‘‘Entropy, Large Deviations and Statistical Mechanics,’’ Springer-Verlag,
BerlinrNew York, 1985.
12. A. Federgruen and P. Schweitzer, Nonstatoinary Markov decision problems with converging parameters, J. Optim. Theory Appl. 34 Ž1981., 207]241.
13. W. Feller, ‘‘An Introduction to Probability Theory and Its Applications,’’ Vol. 1, 3rd ed.,
Wiley, New York, 1967.
14. B. L. Fox and J. F. Rolph, Adaptive policies for Markov renewal programs, Ann. Statist. 1
Ž1973., 334]341.
15. J. C. Gittins, Bandit processes and dynamic allocation indices Žwith discussion., J. Roy.
Statist. Soc. Ser. B 41 Ž1979., 335]340.
16. J. C. Gittins and K. D. Glazebrook, On Bayesian models in stochastic scheduling, J. Appl.
Probab. 14 Ž1977., 556]565.
17. J. C. Gittins and D. M. Jones, A dynamic allocation index for the discounted multarmed
bandit problem, Biometrika 66 Ž1979., 561]565.
18. Z. Govindarajulu, ‘‘The Sequential Statistical Analysis of Hypothesis Testing, Point and
Interval Estimation, and Decision Theory,’’ 2nd ed., Am. Sci. Press, Syracuse, NY, 1987.
19. M. N. Katehakis and C. Derman, Computing optimal sequential allocation rules in
clinical trials, in ‘‘Adaptive Statistical Procedures and Related Topics,’’ ŽJ. van Ryzin, Ed.
Vol. 8, pp. 29]39, IMS Lecture Notes}Monograph Series, Inst. Math .Statist., Hayward,
CA, 1985.
20. M. N. Katehakis and A. F. Vienott, Jr., The multi-armed bandit problem: Decomposition
and computation, Math. Oper. Res. 12 Ž1987., 262]268.
21. M.N. Katehakis and H. Robbins, ‘‘Sequential Allocation Involving Normal Populations,’’
Technical Report, Rutgers University, 1994.
22. P. R. Kumar, A survey of some results in stochastic adaptive control, SIAM J. Control
Optim. 23 Ž1985. 329]380.
23. T. L. Lai, Adaptive treatment allocation and the multi-armed bandit problem, Ann.
Statist. 15 Ž1987., 1091]1114.
142
BURNETAS AND KATEHAKIS
24. T. L. Lai and H. Robbins, Asymptotically optimal allocation of treatments in sequential
experiments, in ‘‘Design of Experients: Ranking and Selection: Essays in Honor of
Robert E. Bechhofer’’ ŽT. J. Santner and A. C. Tamhane, Eds.., Vol. 56, pp. 127]142,
Dekker, New York, 1984.
25. T. L. Lai and H. Robbins, Asymptotically efficient adaptive allocation rules, Ad¨ . in Appl.
Math. 6 Ž1985., 4]22.
26. Z. Li and C. Zhang, Asymptotically efficient allocation rules for two Bernoulli populations, J. Roy. Statist. Soc. Ser. B 54 Ž1992., 609]616.
27. P. Mandl, Estimation and control in Markov chains, Ad¨ . in Appl. Probab. 6 Ž1974.,
40]60.
28. K. M. Van Hee, Markov decision processes with unknown transition law: The average
return case, in ‘‘Recent Developments in Markov Decision Processes’’ ŽD. J. White R.
Hartley, and L. C. Thomas, Eds.., pp. 227]244. Academic Press, New York, 1980.
29. R. A. Milito and J. B. Cruz, Jr., A weak contrast function approach to adaptive
semi-Markov decision models, in ‘‘Stochastic Large Scale Engineering Systems’’ ŽS.
Tzafestas and C. Watanabe, Eds.., pp. 253]278, Dekker, New York, 1992.
30. U. Rieder and J. Weishaupt, Customer scheduling with incomplete information, Probab.
Engrg. Inform. Sci. to appear.
31. H. Robbins, Some aspects of the sequential design of experiments, Bull. Amer. Math.
Monthly 58 Ž1952., 527]536.
32. H. R. Varian, ‘‘Microeconomic Analysis,’’ 2nd ed., Norton, New York, 1984.
33. S. Yakowitz and W. Lowe, Nonparametric bandit methods, Ann. Oper. Res. 28 Ž1991.,
297]312.