On an Index Policy for Restless Bandits

On an Index Policy for Restless Bandits
Author(s): Richard R. Weber and Gideon Weiss
Source: Journal of Applied Probability, Vol. 27, No. 3 (Sep., 1990), pp. 637-648
Published by: Applied Probability Trust
Stable URL: http://www.jstor.org/stable/3214547 .
Accessed: 24/10/2011 10:35
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].
Applied Probability Trust is collaborating with JSTOR to digitize, preserve and extend access to Journal of
Applied Probability.
http://www.jstor.org
J. Appl. Prob. 27, 637-648 (1990)
Printedin Israel
Trust1990
? AppliedProbability
ON AN INDEX POLICY FOR RESTLESS BANDITS
RICHARD R. WEBER,*Universityof Cambridge
GIDEONWEISS,**
GeorgiaInstituteof Technology
Abstract
We investigatethe optimal allocation of effort to a collection of n projects.The
projectsare 'restless'in that the state of a project evolves in time, whetheror not
it is allocatedeffort.The evolution of the state of each projectfollows a Markovrule,
but transitions and rewardsdepend on whether or not the project receives effort.
The objective is to maximize the expected time-averagerewardunder a constraint
that exactlym of the n projectsreceiveeffortat any one time. We show that as m and
n tend to oo with m/n fixed, the per-project reward of the optimal policy is
asymptoticallythe same as that achieved by a policy which operates under the
relaxedconstraintthat an averageof m projectsbe active. The relaxedconstraintwas
considered by Whittle (1988) who described how to use a Lagrangianmultiplier
approach to assign indices to the projects. He conjectured that the policy of
allocatingeffortto the m projectsof greatestindex is asymptoticallyoptimal as m
and n tend to oo. We show that the conjectureis true if the differentialequation
describingthe fluid approximationto the index policy has a globally stable equilibrium point. This need not be the case, and we present an example for which the
index policy is not asymptoticallyoptimal. However,numericalwork suggeststhat
such counterexamplesare extremely rare and that the size of the suboptimality
which one might expect is minuscule.
FLUID APPROXIMATIONS;GITTINS INDEX; LARGE DEVIATION THEORY; MULTI-ARMED
BANDIT PROBLEM;STOCHASTICSCHEDULING
1. Restlessbandits
Whittle (1988) has recently studied an interestinggeneralizationof the classical
multi-armedbandit problem.The classicalproblemconcernsn projects,the state of
projecti at time t beingdenotedxi(t). At eachtime tjust one projectis to be operated.If
projecti is operatedthena rewardg,(x,(t)) is receivedandthe transitionxi(t) ->xi(t + 1)
follows a Markovrule specificto projecti. The n - 1 projectswhich are not operated
produceno reward,and theirstatesdo not change.Onethinksof a gamblerwho at each
turn can pull exactlyone of the n armsof a multi-armedbandit, or slot-machine,and
who desiresto maximizehis time-averagerewardby an optimalsequenceof pulls.
Received 14 July 1989; revision received 29 September1989.
* Postal address:CambridgeUniversity EngineeringDepartment,
ManagementStudies Group, Mill
Lane, CambridgeCB2 IRX, UK.
** Postal address:School of Industrialand Systems Engineering,Georgia Institute of Technology,
Atlanta, GA 30332-0205, USA.
637
638
R. R WEBERAND G. WEISS
Gittins(1970)(seeGittinsandJones(1974))showedthatan indexpolicyis optimal
forthisproblem.TheGittinsindex,denotedbyvi(x,),canbe calculated
foreachproject
as a functionof thelabeli andstatex, alone;theoptimalpolicyis simplyto operatethe
projectof greatestindex.
Whittlehasstudieda variationin whichtwogeneralizations
areintroduced.
Firstly,at
eachmomentexactlym of theprojectsareto be operated.Secondly,thosen - m which
arenotoperatedmaynevertheless
contribute
rewardandchangetheirstate.A projectis
saidto be activeorpassivedependinguponwhetheror not it is operated.It is because
passiveprojectsmaychangestatethattheyarecalledrestlessbandits(sincewearenow
banditmachinefor whicheventhosearmswhicharenot
thinkingof a multi-armed
Whittle
pulledchangestate).
gave severalexamplesof problemswhichare nicely
modelledas restlessbandits,forexamplethe exhaustionandrecuperation
modesof n
to a change
workers,wherem mustalwaysbe active.Herea changeof statecorresponds
A
of a worker's
condition.
of
state
also
a
to
physical
change
correspond changeof
may
information.
Onemightgaindifferentinformation
whenmakinga projectactivethan
when makingit passive.For example,projectsmightdeterioratein an unobserved
mannerwhennotmadeactive.Wemightthinkof m helicopters
tryingto keeptrackof
the positionsof n submarines.
Forsimplicityof expositionwe supposethatall projectsareof the sametype:they
havethe samefinitestatespace,withstateslabelled(1, - - -, k), andchangetheirstate
accordingto an identicalMarkovrule.It is convenientto formulatethe modelin
continuoustime.Wesupposethatfollowingentryto statei thenextpotentialchangeof
stateoccursat a timewhichis exponentially
distributed
withmean1.At thattimethe
to
moves
state
with
where
action a = 1 or 2 denote
project
probabilityP1,(a),
j
thattheactiveorpassiveactionwasbeingappliedto theprojectjustpriorto
respectively
its potentialchangeof state.Notethatthediagonalentriesof P(a) arenotnecessarily
0,
so the projectmaynot actuallychangestate.Thisuniformization
deviceis a standard
ideain thestudyof Markovprocessesandsimplifiestheanalysis.Weshallalsosuppose
matricesaresuchthatthestatesforma singleclosedclassregardless
of
thatthetransition
theprojectis in statei
the policyemployed.Rewardis earnedat a rateg(i, a) whenever
andactiona is taken.
If one wereattemptingto maximizeaveragerewardoverthe infinitehorizonfor a
singleproject,thesolutionwouldbe foundfromtheoptimalityequation
y + f(i)= max
g(i, a) +
a-1,2
k
Pj(a)f(j) ,
j-I
and y wouldbe the time-averageoptimalreward.Considernow the optimalityequation
obtained if the rewardreceived when taking the passive action is subsidisedby an
amountv:
(1)
y(v)+ f(i) = max g(i, 1)+ E PU(1)f(j),v + g(i, 2) + 1 PU,(2)f(j).
j-1
j-1
One thinks of v as a subsidy which is paid for taking the passive action. It is intuitive that
as v increases from - cc to + cc the proportion of time for which it is optimal to take the
Onan indexpolicyfor restlessbandits
639
passive action increases.One can interpretv as the Lagrangianmultiplierassociated
with a constraintthat the passive action be taken for a proportionof the time 1 - a,
O a 5 1. This leads us to considera problemin which the constraintis relaxedto
demandingonly that the numberof projectsto be made active satisfiesEm(t) = an in
time average.It is clearthat the policy whichmaximizesthe averagerewardsubjectto
this constraintsatisfies(1) whenv is pitchedat the rightlevel. A subsidypolicyis defined
by Whittle as a policy inducedby (1) for some value of v and which resolvesany tie
between the terms within the maximizationoperatorby choosingthe passive action.
However,except for a finite numberof values of a, for which the non-randomizingvsubsidypoliciesare optimal,the appropriatesubsidyv will be such that the active and
passiveactionsareequallyattractivefor some statei. By appropriatelyrandomizingthe
choice of action in this state a stationarypolicy is obtainedfor which the constraintis
satisfied.Denote by r(a) the maximumaveragerewardthat can be achievedunderthe
relaxedconstraint;this can be expressedas (Whittle(1988), Proposition1)
(2)
r(a) = inf (y(v) - v(1 - a)).
Clearlynr(a)is an upperboundforRn)(a), the maximalaveragerewardto be obtained
from n projectssubjectto the moredemandingconstraintthatexactlym = an areto be
made active at all times. The policy which achievesthe value nr(a) simplyappliesthe
same policy to each project independently,making each project active, passive, or
perhapsrandomizingbetweenactiveand passiveactions,on the basisof the stateof the
projectalone. We call this policy, which is optimal underthe relaxedconstraint,the
relaxed-constraint
optimalpolicy, or morebrieflythe relaxedpolicy, and denote it areI.
Remark. If a and n aresuchthatan is not an integer,thenwe interpretthe constraint
m = an as demandingthat Lan]projectsbe made active, n - [an] projectsbe made
passive, and the remainingprojectbe made active with probabilityan - Lan].This is
consistent with the idea that the resource available for applying active actions is
continuouslydivisible. This interpretationalso avoids technical issues that are not
centralto our discussion.
Whittledefinesthe index v(i) for a projectin state i as the leastvalueof the subsidyv
for whichit couldbe optimalin (1) to makethe projectpassivein statei. It turnsout that
this indexreducesto the usualGittinsindex in the case that the passiveprojectsdo not
changestate and yield no rewards.If indexingis to be meaningfulit should induce a
consistentordering,meaningthat if it is optimal to make a projectpassive when the
subsidy is v then it is also optimal to make it passive for all v'> v. The concept is
formalizedin Whittle'sdefinitionof indexability.
Definition. Let D(v) be the set of statesfor whicha projectwouldbe madepassive
undera v-subsidypolicy. Theprojectis indexableifD(v) increasesmonotonically
from 0
to ({1, , k ) as v increasesfrom - c0 to + c0.
?
Whittle'sindexpolicy, denotedoind, is the one whichat all times makesactive the m
(= an) projectsof greatestindex (wherethis is interpretedaccordingto the remark
640
R. R WEBERAND G. WEISS
rewardfora problemwithn projects
above).Forthispolicywedenotethetime-average
by R(,3(a). Clearly
(3)
5 R() (a) 5 nr(a).
Rf6,(a)
One expectsthat underthe index policy the rewardper projectwill be close to r(a) and
the policy will be nearlyoptimal. This is Whittle'sconjecture,which can be stated as
follows.
Conjecture(Whittle(1988)).
(4)
R (a)/n -- r(a) as n, m ---oo, m = an.
Becauseit is possibleto computethe v(i)'s, the indexpolicyis easyto implement.The
truth of the conjecturewould make the index policy an attractive policy for the
constrainedproblem,sincewe couldbe surethatfor largen it nearlyachievesa rewardof
nr(a);this is also a quantitythat can be computed.
The conjectureis plausible since, as n increases,one expects a weaker coupling
betweenthe statesof distinctprojects.If the relaxedpolicy is appliedto n projectsthen
the equilibriumnumberin state i will be binomiallydistributedas B(nnit,n7r(1 - n;)),
where 7rt is the proportionof time a single project spends in state i. If the initial
distributionis the relaxedpolicy'sequilibriumdistributionand one startsto applythe
indexpolicythen,at leastinitially,the relaxedandindexpolicieswill differin the actions
they take on a numberof stateswhose expectedvalue is only O(ji), and the expected
differencein rewardper projectbetweenthe policieswill be O(1/n). Sinceactionsdo
not differvery much, the equilibriumdistributionof the index policy will also have
about n7r + O(/n) projectsin state i.
However,we have discoveredthatconjecture(4) is falseandwe showthis in Section4
using ideas from the theory of large deviations and a specific counterexample.The
counterexamplewas by no meanseasy to find.We still believe the conjectureto be true
in most circumstances.Section 3 describesan analysisof the index policy using the
theoryof largedeviations.In it we give a sufficientconditionfor the truthof (4). Section
2 beginsthe paperwith a proofof a positiveresult:thatthe secondinequalityin (3) is an
equalityto withinO(,/n). Thusfor largen impositionof the moredemandingconstraint
does not lead to substantialreductionof rewardper project.
2. The asymptotic reward of the optimal policy
We begin by showing that asymptotically nothing is gained by the relaxation of the
constraint. Asymptotically the optimal reward per project is the same for the constrained
and relaxed problems.
Theorem 1.
(5)
R" (a)/n
- r(a)
as n, m - oo, m = an.
bandits
Onanindexpolicy
forrestless
641
Proof. Let arbe the equilibriumdistributionfor the stateof a singleprojectwhenthe
relaxedpolicy is employed.Considerthe expected-average-cost
optimalityequationfor
the constrainedproblem:
R(p(a)/n + f(x) = max
{(l/n) i • g(x, a1)+ Eaf(X)}
a
f
Herexi denotesthe stateof projecti, ai denotesthe actiontakenfor thatproject,andXi
denotes the state of projecti subsequentto the next potentialtransition(whichoccurs
after time exponentiallydistributedwith mean 1/n), x, XE({1,.
k}", a•(1, 2}".
?.,
Assumedata in the problemis suchthat ntis rationaland n is suchthat
nn is a vectorof
integers.Let ni(x) denote the numberof projectsin state i. Supposethe statex is such
that the numberof projectsin state i is exactly ni(x)= nin. Then, applicationof the
relaxed policy satisfies the constraintthat exactly m projects will be made active.
Supposenow that the relaxedpolicy is appliedto everyproject;denoteby a" the action
appliedto projecti. Moreover,supposethatfor a time 6 the constantactiona"is applied
to projecti even if that projectchangesstate. The expectednumberof potentialstate
changeswhichoccurduringthis intervalof length3 is n3. The expectedrewardobtained
duringthe intervalis boundedbelow by 3r(a)n - nJ2G, where G = 2 maxi,aIg(i, a)l.
Clearlythe policy is suboptimal,so
(6)
JR(n)(a) + f(x) > ({r(a)n - n32G+ Eaf(X6)}.
Here Xf'is the state of the projecti aftertime 3. Define,for any two statesx and y, the
distanced(x, y) as the minimalnumberof componentsin whichx andy can be madeto
differby permutingthe componentsof y: this is d(x, y) = (1/2)Z,Ini(x) - n;(y) I.Note
that we can write ni(X6) = Y,1+ -
+
Yk
where Y,, - -,
Yk
are independent random
'
variables, Yj has a binomial distribution B(nj(x), Pji(aj,6)) and Pi,(aj,) =
e- (3,1+ 3P11(a)+ 62Pj2(aj)/2!+ . ) denotes the probabilitywith which a projectin
statej is in state i aftertime 6 given thatthe fixedactionajis applied.Fromthis, and the
fact that ni(x) = ni7t,it follows that the expected value and variance of n1(X6)are
ni(x) + no(6) and 26n;(x)({1 - Pi,(a)} + no(6). By the centrallimit theoremandthe fact
that d(.) satisfiesa triangleinequality,it follows that Ed(x, X6))< no(3) + A ,n for
some A.
Supposewe could show that thereexists a B > 0 such thatf(y) - f(x) > - Bd(x, y),
for any x and y. Then from (6) we wouldhave
r(a) - JG - B(o(3)/6 + A//Jn}.
R(")(a)/n>
By taking3 sufficientlysmall and then lettingn -? 0 (througha subsequencefor which
nn E Zk) the right-hand side of (7) has a limit greater than r(a) - e, for any e > 0. This
would provethe theorem.
Since d(.) obeys a triangleinequality,it sufficesto prove the claim of the above
paragraphto considerx, y suchthatd(x, y) = 1. Supposethis is the caseandthatx andy
differon projecti. We use a couplingargument:supposethat in statey we applyto each
projectexactlythe sameactionwhichooptappliesto that projectin statex. We continue
(7)
r(a) >
642
R. R WEBER AND G. WEISS
to do this until thereis a potentialchangeof state for projecti. This occursaftera time
which is exponentiallydistributedwith mean I and we deduce from the optimality
equation
f(x) - f(y) > - G + Ea,(f(X) - f(Y)},
whereX and Ydenotethe statesof the projectsat the time of the firstpotentialchangeof
state for projecti. Now d(X, Y) 5 1. Moreover,thereis alwayssome probability,say at
least o > 0, that Xi = Yiand thereforethat d(X, Y) = 0. Thus from (8), we have
(8)
m
min
x,y: d(x,y)- 1
f(x) - f(y))}
(f(x) - f(y))},
- G + (1 - o) x,y: min
d(x,y)- 1
which impliesf(x) - f(y) > - G/o, completingthe proof of the theorem.
3. A sufficientconditionfor asymptoticoptimalityof the indexpolicy
In this sectionwe considerthe index policy,and, appliedto a collectionof n projects.
... , Znk(t)) be a state for the system,whereza,(t) denotesthe proporLet zn(t)=
(z,l(t),
in state i. Possibletransitionsof the processare of the form z, -- zn +
tion of projects
whereetjis a vectorwith - 1 in the ith component, + 1 in thejth component
(l1n)e,., in all other
and
components.Considera policy in which the m = an projectsof
O's
index
are
made
active. We extend the definitionof the index policy by stating
greatest
that when an is not an integerthenLan]projectsof greatestindex are made active and
then one furtherproject,with a greatestindex amongstthose remaining,is madeactive
with probabilityan - [an]. Let and qj denote the transitionrates from state i to j
qj. respectively.For convenience,we have chosen to
underthe active and passive actions
write the transitionrate matricesso that (qj), and (q]j)have columns summingto 0
(whichis contraryto the usualconventionfor Markovprocesses,but convenientin this
exposition).The remainderof the paperdoes not employthe uniformization,but works
directlywith the transitionrates.Define for any numbersa' and a2, and 1 5 i - k,
ui(z) = min zi, max 0, a-
hi
zh
1 - u,(z)= min z,, max 0, 1 - a Oi(z, a', a2) = ui(z)a +1
i
zi,
k1
Zh
zi.
- ui(z))a2.
Here ui(z) and 1 - u,(z) are the probabilitiesthat a projectwhichis selectedat random
from amongstthose whichare presentlyin state i will be madeactive or passive.Under
the index policy the transitionrate associatedwith ey is nqi1(z,)zni,where
qi(z, qj, qj) =(ui(z)q1) +(1 - ui(z))qj }.
qj;(z)=
Define the path z(t) startingat z(0), I z1(0)= 1, by the differentialequation
(9)
(10)
dz/dt =
i,j,
qi,(z)ze1i = Q(z)z,
bandits
Onanindexpolicy
forrestless
or equivalently, dzi/dt = j ?,
q,.(z)z,
z,(t).
643
j?i
qij(z)zi. This is the fluid approximation for
A sufficientconditionforthe asymptoticoptimalityof the indexpolicycanbe statedas
follows.
Theorem2. Let r be the equilibriumdistributionof a singleprojectoperatedunder
the relaxed-constraint
optimalpolicy. Supposethat the differentialequation(10) has no
limit cycles, nor do its solutionsbehavechaotically.Thenif the projectsare indexable,
(10) has the uniquefixed point n and z(t) -* rfor all z(O).Furthermoreconjecture(4) is
(a)/n -- r(a), as n and m increase to infinity with m = an.
true:
R(,
Firstly,we prove a lemma which shows that indexabilityimplies that n is a unique
fixed point. Considera singleproject.Supposeit is indexableand that, withoutloss of
generality,v(1) < ... < v(k). Let a(i, 0) be the policywhichtakesthe passiveaction in
states 1,- - -, i - 1, the active action in states i + 1,- - -, k, and whichtakesthe passive
and active actions in state i with probabilities0 and 1 - 0 respectively,0 0 5 1. Let
_
a(i, 0) denote the time-averageproportionof time that the active action is takenusing
policy a(i, 0).
Lemma 1. Supposethe projectis strictly indexable, such that v(1) < ... < v(k).
Thena(i, 6) is a strictlydecreasingfunctionof 0.
Proof. y(v) is a convex, piecewise-linear,increasingfunctionof v. This followsfrom
the fact that in the set S of stationarynon-randomizingpolicies, a policy a ES has a
rewardfunction,say y,(v), which is linearin v and y(v) = max-s{(y,(v)}. Also, y(v) is
strictlyincreasingfor v > v(1). By indexabilityandthe definitionsof v(i - 1)and v(i) we
have that
= y(v(i - 1)) + (v - v(i - 1))(1 - a(i, 0))
and
,•(i,o)(V)
= y(v(i)) + (v - v(i))(1 - a(i, 1))
7y,(i,,)(V)
are both subgradientsto y(v) at v = v(i). Hence
(11)
y(v(i)) = y(v(i - 1)) + (v(i) - v(i - 1))(1 - a(i, 0)).
Now since v(i - 1) < v(i),
(12)
=
+
Y,1,,1)(v(i 1)) y(v(i)) (v(i-
1) - v(i))(1
-
a(i, 1))< y(v(i - 1)).
So (11) and (12) implya(i, 0) > a(i, 1).Now supposethat nOand n' arethe equilibrium
distributionsof policies a(i, 0) and a(i, 1). The equilibriumdistributioninducedby
a(i, 6) is a linear combination of no and y', namely no =(1 -p)nor + prn', where
p -6rp/(
=
0{n +(1 - 6)n}). Note that n7 = 7Ci /({0zn +(1 - 6)n'}). Thus a(i, 6) =
1+ ... + af_~ +
is the ratio of two linear functions of 6 and is therefore
(ze
0z?)
6
monotone as goes from 0 to 1. Since a(i, 0) > a(i, 1) it must be strictly decreasing. This
proves the lemma.
We shall also make use of the following result which states that on average z, does not
much differ from a fixed distribution r. In the context of diffusion processes this result
644
R. R WEBER AND G. WEISS
was provedby FreidlinandVentsel(1984).It hasbeen reformulatedby MitraandWeiss
(1988) in a formwhichis appropriateto a Markovjump process.In fact,theirstatement
arrival
of the propositionalso allowsstate-dependent,exogenous,Lipschitz-continuous
and departurerates.
Proposition(see Mitra and Weiss (1988), Theorem 2). Suppose there exists a
probabilitydistributionCsuch thatfor everyinitialprobabilitydistributionz(0) thefluid
approximationdzldt = Q(z)z has z(t) -~o , and the transitionratesq1,(z)are bounded
Thenfor everye > 0 thereexistpositiveconstantsc, andc2such
andLipschitz-continuous.
thatfor any initialstate zn(0)
I
lim -
(13)
o
t"-
t fo
t
P(t
zn(u) -
212>e)du:c: exp( - nc2).
Observe
Proofof Theorem2. Note that our q1j(z)are indeed Lipschitz-continuous.
also that in state zn(t) the index policy has instantaneousreward per project of
ZiO(zn(t),g(i, 1),g(i, 2)), wherewe use the functionq;that was definedat the startof
this sectionto evaluatethe rewardobtainedby the mixingof activeand passiveactions.
Similarly,if a is the equilibriumdistributionof the state of a single projectunderthe
relaxed policy then r(a) = l, #O(n,g(i, 1),g(i, 2)). ClearlyOi(z,(t),g(i, 1),g(i, 2)) is a
continuousfunctionof z over a compactregion.Supposefor the momentthat (10) has
the uniqueequilibriumIt. Sincewe assume(10) has no limit cyclesor chaoticbehaviour
it must be that for any initial distributionsz(0), z(t) -- n as t - 00.. TakingC= nr,the
propositionaboveimpliesthatthe differencebetweenr(a) andRMj(a)/n canboundedby
2 max Ig(i, a)l c1exp( - nc2)
i,a
+
sup
z:IIz-n 2<e i
I ;(z, g(i, 1),g(i, 2)) - i(7r,g(i, 1),g(i, 2))|.
This may be made smaller than any arbitraryrt, by first choosing e such that the
secondtermis less than rt/2and then takingn largeenoughthatthe firsttermis also less
than rq/2.
The proof of the theoremis completedby showingthat Q(z)z has the unique zero
Q(n)n = 0. Note firstthat Q(n)n = 0, since (from (10) and the followingline) this is
simply a statementof the balance equations for the equilibriumdistributionof the
relaxedpolicy. Similarly,if 4 : rnwerea secondprobabilityvectorsuchthat Q(4)4= 0,
then C would have to be the stationarydistributionof some other policy of the form
a(i, 6), such that a(i, 6) = a. But this would contradict Lemma 1. Hence Q(z)z = 0 has
the uniqueroot z = i. This completesthe proof of Theorem2.
4. Suboptimalityof the indexpolicy
It has been establishedthat Whittle'sconjectureis trueif the differentialequationfor
the fluidapproximationhas an equilibriumpoint whichis globallyasymptoticallystable
within the (k - 1)-dimensionalspace of probabilityvectors. Indexabilitywas used as
Onan indexpolicyfor restlessbandits
645
sufficientconditionto guaranteeuniquenessof the equilibriumpoint.In this sectionwe
begin by showing that even if it were known by some other argumentthat the
equilibriumpoint is unique,indexabilityis a necessaryconditionfor the stabilityof that
point. AlthoughLemma2 is interestingin itself,the mainreasonfor its presentationis in
orderto explainhow the questionof the stabilityof the equilibriumpoint of (10), which
is apparentlya non-lineardifferentialequation,reducesto a questionof the stabilityof a
linear system. In the second part of this section we use these ideas to explain a
counterexampleto the conjecturethat occursbecausethe equilibriumpoint is unstable.
Lemma 2. Supposethatfor a given a the stable point of (10) is the equilibrium
distributionn, and i is a state such that, O< u,(n) < 1; uj(n)= O,j < i; uj(n)= 1, j > i.
Supposethat no0and n' are the equilibriumdistributionsof policies a(i, 0) and a(i, 1).
Recall that a(i, 0) is the non-randomizing
policy whichtakes the activeaction in states
i,... , k, and the passiveaction in states 1,... , i - 1. a(i, 1) is the non-randomizing
policy whichtakestheactiveactionin statesi + 1, ... , k, and thepassiveactionin states
-1,., -, i. Then in orderthat z(t)-- n for all z(0) it is necessarythat a(i, 1)<a(i, 0).
Equivalentlythis is
+ 7r < r? +
+ r? = a(i, 0).
a(i, 1) = n•i +
.
.
Note that(4) mightbe truefor some valuesof a and untruefor others.Condition(14)
must hold if (4) is true for any a betweena(i, 1) and a(i, 0). If (4) is true for all a then
indexabilityis required.
(14)
Proof. Let qjkbe the jth column of the matrix(q ). In a region Ci, definedas the
closureof the set (z: 0 < ui(z) < 1, 2 zi = 1), Equation(10) can be written
dz(t)/dt = Aiz(t) + b,
where
b = (1 - a)q7 + aq',
(15)
Ai = (q - q7
Iq7-I - q | OIqi+l - qi'I .. Iq - qil),
and Ai is a matrixpartitionedby columns.Interestingly,(10) is just a lineardifferential
equationin regionCi. Indeeddz/dt is linearin z in eachof k regions,C1,. ., Ck.We can
eliminatezi from the right-handside of (10). (This is a consequenceof the fact that z is
constrainedto the (k - 1)-dimensionalsubspaceof probabilityvectors.)Let A, be the
(k - 1) x (k - 1) matrixformedfromAiby deletingthe ith rowand columnand let 4i',
q , i and it be (k - 1)-dimensionalvectorsformedfrom q]',q7, z and nrrespectivelyby
deleting the ith component. From d(2 - ?)/dt = A,(2 - r) we see that if
n for all
z(t)-z(0) thenA1mustbe a stabilitymatrix.Thusthe eigenvaluesofA, musthavenegativereal
parts.Using the fact that
\j<i
we can eliminate
q'
j>i
and q} from (15), to get Ai = OiBi, where
O~i
=c
li, li+,I...? ~)
R. R WEBER AND G. WEISS
646
B,= Ik-- (1/t,/I
"... f 1/•
I? I' ' II?7io ).
HereBi is the sum of the identitymatrixand a rank-2matrixhavingall its firsti - 1 and
last k - i columns identical. Qi is a Metzler matrix (of negative diagonaland nonnegativeoff-diagonalentries)and has non-positivecolumnsums.Thereforeits PerronFrobeniuseigenvalueis non-positive.In fact, this eigenvaluecannotbe 0, for if 4 > 0
were the correspondingeigenvectorwe wouldhave
0,
,
Q(0t)(~l
"
-*i-l,
,
0,O, i+,
=
~k)
0,
whichcontradictsthe irreducibilityof Q(n). Hencethe realpartsof the eigenvaluesof
$Q
mustbe negativeand Q, is a stabilitymatrix.A necessaryconditionfor a d X d matrixto
be a stabilitymatrixis that its determinanthave the same sign as ( - 1)d (sincethis has
the signof the productof the realeigenvalueswhenall arenegative).Thereforedet(Bi)is
necessarilypositive. It is not hardto show that
det(Bi)
=
(1 -
ro-
r-0 1
...
*
--
I
+1nI,
)n
which is positive if and only if (14) holds. This completesthe proofof the lemma.
Considernow the problemdescribedby the followingdata in which k = 4. All the
followingmatrixcalculationswerecarriedout usingthe softwarePC-MATLAB3.10.
-2.5
0.0025 0
0.5 -0.2825
1.0
0.28
1.0
q)
0
1.0
0.5 0
- 56.5
56.0
1.0
0
0
'I
2.0 - 2.0]
0
-2.5
0.5
1.0
1.01
0
- 2.0
0
1.0
L
10
1.01
0
-2.0
2.0
'1)
10
10
10]
10
g(j,2 1)
- 2.01
0
In states 1, 3 and 4 the transitionratesare the samefor both passiveand active actions.
In state 2 the passiveratesq2 are chosento be 200 times the active ones q2i.The reader
can checkthatprojectsareindexableand thatas the subsidyfor passivityincreasesfrom
- oo throughthe values - 10, 0, 9 and 10, the set of states which ought to be made
passive, D(v), increasesmonotonicallyby the addition of states 1, 2, 3 and 4 in that
order.Supposea is chosen so that the relaxed-constraint
optimalpolicy makesstate 1
with
someprobabilities9 and
passive,states3 and 4 active,and state2 passiveor active
1 - 6. Then
4=-
- 3.0
55.0
1.0
-0.0025
-2.28
2.0
0.9975
0.72
-2.0
Onan indexpolicyfor restlessbandits
647
is not a stabilitymatrix since its eigenvaluesare - 7.4037 and 0.0618 + 3.9670i.
This leads to a counterexampleto conjecture(4). Supposewe take a = 0.835. For this
value of a the relaxedpolicyis passiveon state 1, activeon states3 and 4, and on state2
it is active and passive with probabilities0.9938 and 0.0062 respectively.The equilibriumis n = (0.1644, 0.0973, 0.3281, 0.4102).
For this value of a the solution to (10) does not tend to nras t - co. Numerical
integrationof (10) showsthat the fluidflowapproximationfor the index policyactually
tends to a limit cycle of period 1.6384498 in which the state for which 0 < ui(z) < 1
alternatesbetweenstate 1 and state 2. So, in contrastto the relaxedpolicy, projectsin
state 1 are sometimesmade active. The proofwhich Weissgives for the propositionin
Section 3 can be adaptedin an obviousway to a versionin whichthe assumptionthat
z(t) tends to a unique equilibrium,Q(C)C= 0, is replacedby the assumptionthat z(t)
tends to a unique limit cycle for all initial probabilityvectors z(0). It can then be
shownusingargumentssimilarto those in Section3 that the asymptoticaveragereward
per projectof the index policy is the rewardobtainedby averagingthe rewardfunction around the path of the limit cycle. For our data this integral comes to
10 - 1.26577X 10-4. Clearlythe relaxedpolicyachievesan averagerewardof 10. Thus
the index policy is asymptoticallysuboptimalby the tiny amountof 0.00126577%.
A2
5. Conclusions
For the data of the counterexamplein Section 4 there is a heuristicpolicy which is
asymptoticallyoptimal.Supposethat m/n = a and the relaxedpolicy takes the active
and passive actions in state 2 with probabilities1 - 6 and 0. Suppose there are n,
m } of theseprojects
projectsin state2. Considerthe policywhichmakesmin{(1 - m)n2,
active,andthenmakesenoughof the remainingprojectsin states 1 and 3 active,to bring
the total numberof active projectsto exactlym = an. (Whichprojectsin states 1 and 3
are made active does not mattersince the transitionratesdo not dependon the action
takenin these states).The resultingMarkovprocessis a migrationprocessand satisfies
detailedbalanceequations.So one can findexpressionsfor the equilbriumdistribution
and showthat it has asymptoticallythe sameproportionsof projectsin each stateas the
relaxedpolicy.
Conjecture(4) is alwaystrue when k = 2. In this case the regionsC, and C2have a
single point as theirboundary.The trajectoriesof dz/dt must enterone or the otherof
these regionsand neverleave it. The only behaviourconsistentwith this is z(t) - 7, so
the conditions of Theorem 2 are met. In fact, for the case k = 2 we have derived
expressionsfor the equilibriumdistributionof the index policy. One can give a direct
proof of the truthof conjecture(4). It turnsout that the asymptoticdifferencebetween
Rj (a)/n and r(a) is even less than O(1/Jn).
Ourcounterexamplehad k = 4 and it is not clearwhethertheremightbe a counterexamplewithk = 3. Certainlyit will be harderto discoversuchan example,sincewe can
show that when k = 3 indexabilityalwaysimplies the stabilityof A,. Thus the equilibriumpoint of (10) is at least locallyasymptoticallystable.
648
R. R WEBER AND G. WEISS
By randomlygeneratingvaluesforthe activeandpassivematrices,Q'and Q2,we have
founda numberof counterexamplesfor k = 4 and k = 5. In ourearliestexperimentswe
generatedthe off-diagonalentries in these matricesas uniform randomvariablesin
[0, 10] and then multipliedthe columnsby randomfactors.Roughly90%of the test
problemswere indexable,but in a sample of over 20000 test problemsno counterexample to the conjecturewas found. Counterexampleswere finally discoveredby
restrictingattentionto test matricesfor whichQ'and Q2 differedin just one column,as
in the exampleof Section4. Ourthinkingwasthatforthis veryspecialisedcasea proofof
the conjectureor a counterexamplemight more easily be discovered.The experiments
were rewardedwith counterexamples.Whilewe did not try to accuratelyestimatetheir
wereproducedfor less than 1 in 1000
frequency,ourimpressionis thatcounterexamples
test problems.The size of the asymptoticsuboptimalityof the indexpolicywas no more
than 0.002%in any example.Of courseone should not place too much emphasison
resultswhichdependon the waytest problemsaregenerated.We maybe missinga class
of examplesfor which the degreeof suboptimalityis greater.A better understanding
might lead to more dramatic counterexamples,but the reasoning that led to the
counterexamplein Section4 does not seem to help. Nonetheless,the evidenceso far is
that counterexamplesto the conjectureare rareand that the degreeof suboptimalityis
very small.It appearsthat in most cases the index policy is a very good heuristic.
Acknowledgements
We wish to thankDr AlanWeiss,of AT&TBell Laboratories,who explainedto us his
interestingworkon largedeviationtheoryand Markovjump processes,and Professor
Keith Glover,of CambridgeUniversityEngineeringDepartment'sControlGroup,who
producedthe numbersfor the matrices(qj) and (qj) in the counterexampleof Section4.
These provided a more attractive counterexamplethan those found by randomly
generatingtest problems.
Our researchhas been sponsoredby National Science Foundationgrant #ECS8712798. The second author acknowledgesadditional supportfrom the FAW (ForUlm, West GerWissensverarbeitung),
schungs Institut for Anwendungsorientierte
many, for supportduringthe summerof 1988.
References
FREIDLIN,M. I. AND VENTSEL,A. D. (1984) Random Perturbationsof Dynamical Systems.
Springer-Verlag,New York.
J. C. ANDJONES,D. M. (1974) A dynamic allocation index for the sequentialdesign of
GITTINS,
experiments. In Progress in Statistics, ed. J. Gani, North-Holland, Amsterdam, 241-266.
MITRA,D. ANDWEISS,A. (1988) A fluid limit of a closed queueing network with applications to data
networks.
WHITTLE,P. (1988) Restless bandits: activity allocation in a changing world. In A Celebration of
AppliedProbability,ed. J. Gani, J. Appl. Prob. 25A, 287-298.