On an Index Policy for Restless Bandits Author(s): Richard R. Weber and Gideon Weiss Source: Journal of Applied Probability, Vol. 27, No. 3 (Sep., 1990), pp. 637-648 Published by: Applied Probability Trust Stable URL: http://www.jstor.org/stable/3214547 . Accessed: 24/10/2011 10:35 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. Applied Probability Trust is collaborating with JSTOR to digitize, preserve and extend access to Journal of Applied Probability. http://www.jstor.org J. Appl. Prob. 27, 637-648 (1990) Printedin Israel Trust1990 ? AppliedProbability ON AN INDEX POLICY FOR RESTLESS BANDITS RICHARD R. WEBER,*Universityof Cambridge GIDEONWEISS,** GeorgiaInstituteof Technology Abstract We investigatethe optimal allocation of effort to a collection of n projects.The projectsare 'restless'in that the state of a project evolves in time, whetheror not it is allocatedeffort.The evolution of the state of each projectfollows a Markovrule, but transitions and rewardsdepend on whether or not the project receives effort. The objective is to maximize the expected time-averagerewardunder a constraint that exactlym of the n projectsreceiveeffortat any one time. We show that as m and n tend to oo with m/n fixed, the per-project reward of the optimal policy is asymptoticallythe same as that achieved by a policy which operates under the relaxedconstraintthat an averageof m projectsbe active. The relaxedconstraintwas considered by Whittle (1988) who described how to use a Lagrangianmultiplier approach to assign indices to the projects. He conjectured that the policy of allocatingeffortto the m projectsof greatestindex is asymptoticallyoptimal as m and n tend to oo. We show that the conjectureis true if the differentialequation describingthe fluid approximationto the index policy has a globally stable equilibrium point. This need not be the case, and we present an example for which the index policy is not asymptoticallyoptimal. However,numericalwork suggeststhat such counterexamplesare extremely rare and that the size of the suboptimality which one might expect is minuscule. FLUID APPROXIMATIONS;GITTINS INDEX; LARGE DEVIATION THEORY; MULTI-ARMED BANDIT PROBLEM;STOCHASTICSCHEDULING 1. Restlessbandits Whittle (1988) has recently studied an interestinggeneralizationof the classical multi-armedbandit problem.The classicalproblemconcernsn projects,the state of projecti at time t beingdenotedxi(t). At eachtime tjust one projectis to be operated.If projecti is operatedthena rewardg,(x,(t)) is receivedandthe transitionxi(t) ->xi(t + 1) follows a Markovrule specificto projecti. The n - 1 projectswhich are not operated produceno reward,and theirstatesdo not change.Onethinksof a gamblerwho at each turn can pull exactlyone of the n armsof a multi-armedbandit, or slot-machine,and who desiresto maximizehis time-averagerewardby an optimalsequenceof pulls. Received 14 July 1989; revision received 29 September1989. * Postal address:CambridgeUniversity EngineeringDepartment, ManagementStudies Group, Mill Lane, CambridgeCB2 IRX, UK. ** Postal address:School of Industrialand Systems Engineering,Georgia Institute of Technology, Atlanta, GA 30332-0205, USA. 637 638 R. R WEBERAND G. WEISS Gittins(1970)(seeGittinsandJones(1974))showedthatan indexpolicyis optimal forthisproblem.TheGittinsindex,denotedbyvi(x,),canbe calculated foreachproject as a functionof thelabeli andstatex, alone;theoptimalpolicyis simplyto operatethe projectof greatestindex. Whittlehasstudieda variationin whichtwogeneralizations areintroduced. Firstly,at eachmomentexactlym of theprojectsareto be operated.Secondly,thosen - m which arenotoperatedmaynevertheless contribute rewardandchangetheirstate.A projectis saidto be activeorpassivedependinguponwhetheror not it is operated.It is because passiveprojectsmaychangestatethattheyarecalledrestlessbandits(sincewearenow banditmachinefor whicheventhosearmswhicharenot thinkingof a multi-armed Whittle pulledchangestate). gave severalexamplesof problemswhichare nicely modelledas restlessbandits,forexamplethe exhaustionandrecuperation modesof n to a change workers,wherem mustalwaysbe active.Herea changeof statecorresponds A of a worker's condition. of state also a to physical change correspond changeof may information. Onemightgaindifferentinformation whenmakinga projectactivethan when makingit passive.For example,projectsmightdeterioratein an unobserved mannerwhennotmadeactive.Wemightthinkof m helicopters tryingto keeptrackof the positionsof n submarines. Forsimplicityof expositionwe supposethatall projectsareof the sametype:they havethe samefinitestatespace,withstateslabelled(1, - - -, k), andchangetheirstate accordingto an identicalMarkovrule.It is convenientto formulatethe modelin continuoustime.Wesupposethatfollowingentryto statei thenextpotentialchangeof stateoccursat a timewhichis exponentially distributed withmean1.At thattimethe to moves state with where action a = 1 or 2 denote project probabilityP1,(a), j thattheactiveorpassiveactionwasbeingappliedto theprojectjustpriorto respectively its potentialchangeof state.Notethatthediagonalentriesof P(a) arenotnecessarily 0, so the projectmaynot actuallychangestate.Thisuniformization deviceis a standard ideain thestudyof Markovprocessesandsimplifiestheanalysis.Weshallalsosuppose matricesaresuchthatthestatesforma singleclosedclassregardless of thatthetransition theprojectis in statei the policyemployed.Rewardis earnedat a rateg(i, a) whenever andactiona is taken. If one wereattemptingto maximizeaveragerewardoverthe infinitehorizonfor a singleproject,thesolutionwouldbe foundfromtheoptimalityequation y + f(i)= max g(i, a) + a-1,2 k Pj(a)f(j) , j-I and y wouldbe the time-averageoptimalreward.Considernow the optimalityequation obtained if the rewardreceived when taking the passive action is subsidisedby an amountv: (1) y(v)+ f(i) = max g(i, 1)+ E PU(1)f(j),v + g(i, 2) + 1 PU,(2)f(j). j-1 j-1 One thinks of v as a subsidy which is paid for taking the passive action. It is intuitive that as v increases from - cc to + cc the proportion of time for which it is optimal to take the Onan indexpolicyfor restlessbandits 639 passive action increases.One can interpretv as the Lagrangianmultiplierassociated with a constraintthat the passive action be taken for a proportionof the time 1 - a, O a 5 1. This leads us to considera problemin which the constraintis relaxedto demandingonly that the numberof projectsto be made active satisfiesEm(t) = an in time average.It is clearthat the policy whichmaximizesthe averagerewardsubjectto this constraintsatisfies(1) whenv is pitchedat the rightlevel. A subsidypolicyis defined by Whittle as a policy inducedby (1) for some value of v and which resolvesany tie between the terms within the maximizationoperatorby choosingthe passive action. However,except for a finite numberof values of a, for which the non-randomizingvsubsidypoliciesare optimal,the appropriatesubsidyv will be such that the active and passiveactionsareequallyattractivefor some statei. By appropriatelyrandomizingthe choice of action in this state a stationarypolicy is obtainedfor which the constraintis satisfied.Denote by r(a) the maximumaveragerewardthat can be achievedunderthe relaxedconstraint;this can be expressedas (Whittle(1988), Proposition1) (2) r(a) = inf (y(v) - v(1 - a)). Clearlynr(a)is an upperboundforRn)(a), the maximalaveragerewardto be obtained from n projectssubjectto the moredemandingconstraintthatexactlym = an areto be made active at all times. The policy which achievesthe value nr(a) simplyappliesthe same policy to each project independently,making each project active, passive, or perhapsrandomizingbetweenactiveand passiveactions,on the basisof the stateof the projectalone. We call this policy, which is optimal underthe relaxedconstraint,the relaxed-constraint optimalpolicy, or morebrieflythe relaxedpolicy, and denote it areI. Remark. If a and n aresuchthatan is not an integer,thenwe interpretthe constraint m = an as demandingthat Lan]projectsbe made active, n - [an] projectsbe made passive, and the remainingprojectbe made active with probabilityan - Lan].This is consistent with the idea that the resource available for applying active actions is continuouslydivisible. This interpretationalso avoids technical issues that are not centralto our discussion. Whittledefinesthe index v(i) for a projectin state i as the leastvalueof the subsidyv for whichit couldbe optimalin (1) to makethe projectpassivein statei. It turnsout that this indexreducesto the usualGittinsindex in the case that the passiveprojectsdo not changestate and yield no rewards.If indexingis to be meaningfulit should induce a consistentordering,meaningthat if it is optimal to make a projectpassive when the subsidy is v then it is also optimal to make it passive for all v'> v. The concept is formalizedin Whittle'sdefinitionof indexability. Definition. Let D(v) be the set of statesfor whicha projectwouldbe madepassive undera v-subsidypolicy. Theprojectis indexableifD(v) increasesmonotonically from 0 to ({1, , k ) as v increasesfrom - c0 to + c0. ? Whittle'sindexpolicy, denotedoind, is the one whichat all times makesactive the m (= an) projectsof greatestindex (wherethis is interpretedaccordingto the remark 640 R. R WEBERAND G. WEISS rewardfora problemwithn projects above).Forthispolicywedenotethetime-average by R(,3(a). Clearly (3) 5 R() (a) 5 nr(a). Rf6,(a) One expectsthat underthe index policy the rewardper projectwill be close to r(a) and the policy will be nearlyoptimal. This is Whittle'sconjecture,which can be stated as follows. Conjecture(Whittle(1988)). (4) R (a)/n -- r(a) as n, m ---oo, m = an. Becauseit is possibleto computethe v(i)'s, the indexpolicyis easyto implement.The truth of the conjecturewould make the index policy an attractive policy for the constrainedproblem,sincewe couldbe surethatfor largen it nearlyachievesa rewardof nr(a);this is also a quantitythat can be computed. The conjectureis plausible since, as n increases,one expects a weaker coupling betweenthe statesof distinctprojects.If the relaxedpolicy is appliedto n projectsthen the equilibriumnumberin state i will be binomiallydistributedas B(nnit,n7r(1 - n;)), where 7rt is the proportionof time a single project spends in state i. If the initial distributionis the relaxedpolicy'sequilibriumdistributionand one startsto applythe indexpolicythen,at leastinitially,the relaxedandindexpolicieswill differin the actions they take on a numberof stateswhose expectedvalue is only O(ji), and the expected differencein rewardper projectbetweenthe policieswill be O(1/n). Sinceactionsdo not differvery much, the equilibriumdistributionof the index policy will also have about n7r + O(/n) projectsin state i. However,we have discoveredthatconjecture(4) is falseandwe showthis in Section4 using ideas from the theory of large deviations and a specific counterexample.The counterexamplewas by no meanseasy to find.We still believe the conjectureto be true in most circumstances.Section 3 describesan analysisof the index policy using the theoryof largedeviations.In it we give a sufficientconditionfor the truthof (4). Section 2 beginsthe paperwith a proofof a positiveresult:thatthe secondinequalityin (3) is an equalityto withinO(,/n). Thusfor largen impositionof the moredemandingconstraint does not lead to substantialreductionof rewardper project. 2. The asymptotic reward of the optimal policy We begin by showing that asymptotically nothing is gained by the relaxation of the constraint. Asymptotically the optimal reward per project is the same for the constrained and relaxed problems. Theorem 1. (5) R" (a)/n - r(a) as n, m - oo, m = an. bandits Onanindexpolicy forrestless 641 Proof. Let arbe the equilibriumdistributionfor the stateof a singleprojectwhenthe relaxedpolicy is employed.Considerthe expected-average-cost optimalityequationfor the constrainedproblem: R(p(a)/n + f(x) = max {(l/n) i • g(x, a1)+ Eaf(X)} a f Herexi denotesthe stateof projecti, ai denotesthe actiontakenfor thatproject,andXi denotes the state of projecti subsequentto the next potentialtransition(whichoccurs after time exponentiallydistributedwith mean 1/n), x, XE({1,. k}", a•(1, 2}". ?., Assumedata in the problemis suchthat ntis rationaland n is suchthat nn is a vectorof integers.Let ni(x) denote the numberof projectsin state i. Supposethe statex is such that the numberof projectsin state i is exactly ni(x)= nin. Then, applicationof the relaxed policy satisfies the constraintthat exactly m projects will be made active. Supposenow that the relaxedpolicy is appliedto everyproject;denoteby a" the action appliedto projecti. Moreover,supposethatfor a time 6 the constantactiona"is applied to projecti even if that projectchangesstate. The expectednumberof potentialstate changeswhichoccurduringthis intervalof length3 is n3. The expectedrewardobtained duringthe intervalis boundedbelow by 3r(a)n - nJ2G, where G = 2 maxi,aIg(i, a)l. Clearlythe policy is suboptimal,so (6) JR(n)(a) + f(x) > ({r(a)n - n32G+ Eaf(X6)}. Here Xf'is the state of the projecti aftertime 3. Define,for any two statesx and y, the distanced(x, y) as the minimalnumberof componentsin whichx andy can be madeto differby permutingthe componentsof y: this is d(x, y) = (1/2)Z,Ini(x) - n;(y) I.Note that we can write ni(X6) = Y,1+ - + Yk where Y,, - -, Yk are independent random ' variables, Yj has a binomial distribution B(nj(x), Pji(aj,6)) and Pi,(aj,) = e- (3,1+ 3P11(a)+ 62Pj2(aj)/2!+ . ) denotes the probabilitywith which a projectin statej is in state i aftertime 6 given thatthe fixedactionajis applied.Fromthis, and the fact that ni(x) = ni7t,it follows that the expected value and variance of n1(X6)are ni(x) + no(6) and 26n;(x)({1 - Pi,(a)} + no(6). By the centrallimit theoremandthe fact that d(.) satisfiesa triangleinequality,it follows that Ed(x, X6))< no(3) + A ,n for some A. Supposewe could show that thereexists a B > 0 such thatf(y) - f(x) > - Bd(x, y), for any x and y. Then from (6) we wouldhave r(a) - JG - B(o(3)/6 + A//Jn}. R(")(a)/n> By taking3 sufficientlysmall and then lettingn -? 0 (througha subsequencefor which nn E Zk) the right-hand side of (7) has a limit greater than r(a) - e, for any e > 0. This would provethe theorem. Since d(.) obeys a triangleinequality,it sufficesto prove the claim of the above paragraphto considerx, y suchthatd(x, y) = 1. Supposethis is the caseandthatx andy differon projecti. We use a couplingargument:supposethat in statey we applyto each projectexactlythe sameactionwhichooptappliesto that projectin statex. We continue (7) r(a) > 642 R. R WEBER AND G. WEISS to do this until thereis a potentialchangeof state for projecti. This occursaftera time which is exponentiallydistributedwith mean I and we deduce from the optimality equation f(x) - f(y) > - G + Ea,(f(X) - f(Y)}, whereX and Ydenotethe statesof the projectsat the time of the firstpotentialchangeof state for projecti. Now d(X, Y) 5 1. Moreover,thereis alwayssome probability,say at least o > 0, that Xi = Yiand thereforethat d(X, Y) = 0. Thus from (8), we have (8) m min x,y: d(x,y)- 1 f(x) - f(y))} (f(x) - f(y))}, - G + (1 - o) x,y: min d(x,y)- 1 which impliesf(x) - f(y) > - G/o, completingthe proof of the theorem. 3. A sufficientconditionfor asymptoticoptimalityof the indexpolicy In this sectionwe considerthe index policy,and, appliedto a collectionof n projects. ... , Znk(t)) be a state for the system,whereza,(t) denotesthe proporLet zn(t)= (z,l(t), in state i. Possibletransitionsof the processare of the form z, -- zn + tion of projects whereetjis a vectorwith - 1 in the ith component, + 1 in thejth component (l1n)e,., in all other and components.Considera policy in which the m = an projectsof O's index are made active. We extend the definitionof the index policy by stating greatest that when an is not an integerthenLan]projectsof greatestindex are made active and then one furtherproject,with a greatestindex amongstthose remaining,is madeactive with probabilityan - [an]. Let and qj denote the transitionrates from state i to j qj. respectively.For convenience,we have chosen to underthe active and passive actions write the transitionrate matricesso that (qj), and (q]j)have columns summingto 0 (whichis contraryto the usualconventionfor Markovprocesses,but convenientin this exposition).The remainderof the paperdoes not employthe uniformization,but works directlywith the transitionrates.Define for any numbersa' and a2, and 1 5 i - k, ui(z) = min zi, max 0, a- hi zh 1 - u,(z)= min z,, max 0, 1 - a Oi(z, a', a2) = ui(z)a +1 i zi, k1 Zh zi. - ui(z))a2. Here ui(z) and 1 - u,(z) are the probabilitiesthat a projectwhichis selectedat random from amongstthose whichare presentlyin state i will be madeactive or passive.Under the index policy the transitionrate associatedwith ey is nqi1(z,)zni,where qi(z, qj, qj) =(ui(z)q1) +(1 - ui(z))qj }. qj;(z)= Define the path z(t) startingat z(0), I z1(0)= 1, by the differentialequation (9) (10) dz/dt = i,j, qi,(z)ze1i = Q(z)z, bandits Onanindexpolicy forrestless or equivalently, dzi/dt = j ?, q,.(z)z, z,(t). 643 j?i qij(z)zi. This is the fluid approximation for A sufficientconditionforthe asymptoticoptimalityof the indexpolicycanbe statedas follows. Theorem2. Let r be the equilibriumdistributionof a singleprojectoperatedunder the relaxed-constraint optimalpolicy. Supposethat the differentialequation(10) has no limit cycles, nor do its solutionsbehavechaotically.Thenif the projectsare indexable, (10) has the uniquefixed point n and z(t) -* rfor all z(O).Furthermoreconjecture(4) is (a)/n -- r(a), as n and m increase to infinity with m = an. true: R(, Firstly,we prove a lemma which shows that indexabilityimplies that n is a unique fixed point. Considera singleproject.Supposeit is indexableand that, withoutloss of generality,v(1) < ... < v(k). Let a(i, 0) be the policywhichtakesthe passiveaction in states 1,- - -, i - 1, the active action in states i + 1,- - -, k, and whichtakesthe passive and active actions in state i with probabilities0 and 1 - 0 respectively,0 0 5 1. Let _ a(i, 0) denote the time-averageproportionof time that the active action is takenusing policy a(i, 0). Lemma 1. Supposethe projectis strictly indexable, such that v(1) < ... < v(k). Thena(i, 6) is a strictlydecreasingfunctionof 0. Proof. y(v) is a convex, piecewise-linear,increasingfunctionof v. This followsfrom the fact that in the set S of stationarynon-randomizingpolicies, a policy a ES has a rewardfunction,say y,(v), which is linearin v and y(v) = max-s{(y,(v)}. Also, y(v) is strictlyincreasingfor v > v(1). By indexabilityandthe definitionsof v(i - 1)and v(i) we have that = y(v(i - 1)) + (v - v(i - 1))(1 - a(i, 0)) and ,•(i,o)(V) = y(v(i)) + (v - v(i))(1 - a(i, 1)) 7y,(i,,)(V) are both subgradientsto y(v) at v = v(i). Hence (11) y(v(i)) = y(v(i - 1)) + (v(i) - v(i - 1))(1 - a(i, 0)). Now since v(i - 1) < v(i), (12) = + Y,1,,1)(v(i 1)) y(v(i)) (v(i- 1) - v(i))(1 - a(i, 1))< y(v(i - 1)). So (11) and (12) implya(i, 0) > a(i, 1).Now supposethat nOand n' arethe equilibrium distributionsof policies a(i, 0) and a(i, 1). The equilibriumdistributioninducedby a(i, 6) is a linear combination of no and y', namely no =(1 -p)nor + prn', where p -6rp/( = 0{n +(1 - 6)n}). Note that n7 = 7Ci /({0zn +(1 - 6)n'}). Thus a(i, 6) = 1+ ... + af_~ + is the ratio of two linear functions of 6 and is therefore (ze 0z?) 6 monotone as goes from 0 to 1. Since a(i, 0) > a(i, 1) it must be strictly decreasing. This proves the lemma. We shall also make use of the following result which states that on average z, does not much differ from a fixed distribution r. In the context of diffusion processes this result 644 R. R WEBER AND G. WEISS was provedby FreidlinandVentsel(1984).It hasbeen reformulatedby MitraandWeiss (1988) in a formwhichis appropriateto a Markovjump process.In fact,theirstatement arrival of the propositionalso allowsstate-dependent,exogenous,Lipschitz-continuous and departurerates. Proposition(see Mitra and Weiss (1988), Theorem 2). Suppose there exists a probabilitydistributionCsuch thatfor everyinitialprobabilitydistributionz(0) thefluid approximationdzldt = Q(z)z has z(t) -~o , and the transitionratesq1,(z)are bounded Thenfor everye > 0 thereexistpositiveconstantsc, andc2such andLipschitz-continuous. thatfor any initialstate zn(0) I lim - (13) o t"- t fo t P(t zn(u) - 212>e)du:c: exp( - nc2). Observe Proofof Theorem2. Note that our q1j(z)are indeed Lipschitz-continuous. also that in state zn(t) the index policy has instantaneousreward per project of ZiO(zn(t),g(i, 1),g(i, 2)), wherewe use the functionq;that was definedat the startof this sectionto evaluatethe rewardobtainedby the mixingof activeand passiveactions. Similarly,if a is the equilibriumdistributionof the state of a single projectunderthe relaxed policy then r(a) = l, #O(n,g(i, 1),g(i, 2)). ClearlyOi(z,(t),g(i, 1),g(i, 2)) is a continuousfunctionof z over a compactregion.Supposefor the momentthat (10) has the uniqueequilibriumIt. Sincewe assume(10) has no limit cyclesor chaoticbehaviour it must be that for any initial distributionsz(0), z(t) -- n as t - 00.. TakingC= nr,the propositionaboveimpliesthatthe differencebetweenr(a) andRMj(a)/n canboundedby 2 max Ig(i, a)l c1exp( - nc2) i,a + sup z:IIz-n 2<e i I ;(z, g(i, 1),g(i, 2)) - i(7r,g(i, 1),g(i, 2))|. This may be made smaller than any arbitraryrt, by first choosing e such that the secondtermis less than rt/2and then takingn largeenoughthatthe firsttermis also less than rq/2. The proof of the theoremis completedby showingthat Q(z)z has the unique zero Q(n)n = 0. Note firstthat Q(n)n = 0, since (from (10) and the followingline) this is simply a statementof the balance equations for the equilibriumdistributionof the relaxedpolicy. Similarly,if 4 : rnwerea secondprobabilityvectorsuchthat Q(4)4= 0, then C would have to be the stationarydistributionof some other policy of the form a(i, 6), such that a(i, 6) = a. But this would contradict Lemma 1. Hence Q(z)z = 0 has the uniqueroot z = i. This completesthe proof of Theorem2. 4. Suboptimalityof the indexpolicy It has been establishedthat Whittle'sconjectureis trueif the differentialequationfor the fluidapproximationhas an equilibriumpoint whichis globallyasymptoticallystable within the (k - 1)-dimensionalspace of probabilityvectors. Indexabilitywas used as Onan indexpolicyfor restlessbandits 645 sufficientconditionto guaranteeuniquenessof the equilibriumpoint.In this sectionwe begin by showing that even if it were known by some other argumentthat the equilibriumpoint is unique,indexabilityis a necessaryconditionfor the stabilityof that point. AlthoughLemma2 is interestingin itself,the mainreasonfor its presentationis in orderto explainhow the questionof the stabilityof the equilibriumpoint of (10), which is apparentlya non-lineardifferentialequation,reducesto a questionof the stabilityof a linear system. In the second part of this section we use these ideas to explain a counterexampleto the conjecturethat occursbecausethe equilibriumpoint is unstable. Lemma 2. Supposethatfor a given a the stable point of (10) is the equilibrium distributionn, and i is a state such that, O< u,(n) < 1; uj(n)= O,j < i; uj(n)= 1, j > i. Supposethat no0and n' are the equilibriumdistributionsof policies a(i, 0) and a(i, 1). Recall that a(i, 0) is the non-randomizing policy whichtakes the activeaction in states i,... , k, and the passiveaction in states 1,... , i - 1. a(i, 1) is the non-randomizing policy whichtakestheactiveactionin statesi + 1, ... , k, and thepassiveactionin states -1,., -, i. Then in orderthat z(t)-- n for all z(0) it is necessarythat a(i, 1)<a(i, 0). Equivalentlythis is + 7r < r? + + r? = a(i, 0). a(i, 1) = n•i + . . Note that(4) mightbe truefor some valuesof a and untruefor others.Condition(14) must hold if (4) is true for any a betweena(i, 1) and a(i, 0). If (4) is true for all a then indexabilityis required. (14) Proof. Let qjkbe the jth column of the matrix(q ). In a region Ci, definedas the closureof the set (z: 0 < ui(z) < 1, 2 zi = 1), Equation(10) can be written dz(t)/dt = Aiz(t) + b, where b = (1 - a)q7 + aq', (15) Ai = (q - q7 Iq7-I - q | OIqi+l - qi'I .. Iq - qil), and Ai is a matrixpartitionedby columns.Interestingly,(10) is just a lineardifferential equationin regionCi. Indeeddz/dt is linearin z in eachof k regions,C1,. ., Ck.We can eliminatezi from the right-handside of (10). (This is a consequenceof the fact that z is constrainedto the (k - 1)-dimensionalsubspaceof probabilityvectors.)Let A, be the (k - 1) x (k - 1) matrixformedfromAiby deletingthe ith rowand columnand let 4i', q , i and it be (k - 1)-dimensionalvectorsformedfrom q]',q7, z and nrrespectivelyby deleting the ith component. From d(2 - ?)/dt = A,(2 - r) we see that if n for all z(t)-z(0) thenA1mustbe a stabilitymatrix.Thusthe eigenvaluesofA, musthavenegativereal parts.Using the fact that \j<i we can eliminate q' j>i and q} from (15), to get Ai = OiBi, where O~i =c li, li+,I...? ~) R. R WEBER AND G. WEISS 646 B,= Ik-- (1/t,/I "... f 1/• I? I' ' II?7io ). HereBi is the sum of the identitymatrixand a rank-2matrixhavingall its firsti - 1 and last k - i columns identical. Qi is a Metzler matrix (of negative diagonaland nonnegativeoff-diagonalentries)and has non-positivecolumnsums.Thereforeits PerronFrobeniuseigenvalueis non-positive.In fact, this eigenvaluecannotbe 0, for if 4 > 0 were the correspondingeigenvectorwe wouldhave 0, , Q(0t)(~l " -*i-l, , 0,O, i+, = ~k) 0, whichcontradictsthe irreducibilityof Q(n). Hencethe realpartsof the eigenvaluesof $Q mustbe negativeand Q, is a stabilitymatrix.A necessaryconditionfor a d X d matrixto be a stabilitymatrixis that its determinanthave the same sign as ( - 1)d (sincethis has the signof the productof the realeigenvalueswhenall arenegative).Thereforedet(Bi)is necessarilypositive. It is not hardto show that det(Bi) = (1 - ro- r-0 1 ... * -- I +1nI, )n which is positive if and only if (14) holds. This completesthe proofof the lemma. Considernow the problemdescribedby the followingdata in which k = 4. All the followingmatrixcalculationswerecarriedout usingthe softwarePC-MATLAB3.10. -2.5 0.0025 0 0.5 -0.2825 1.0 0.28 1.0 q) 0 1.0 0.5 0 - 56.5 56.0 1.0 0 0 'I 2.0 - 2.0] 0 -2.5 0.5 1.0 1.01 0 - 2.0 0 1.0 L 10 1.01 0 -2.0 2.0 '1) 10 10 10] 10 g(j,2 1) - 2.01 0 In states 1, 3 and 4 the transitionratesare the samefor both passiveand active actions. In state 2 the passiveratesq2 are chosento be 200 times the active ones q2i.The reader can checkthatprojectsareindexableand thatas the subsidyfor passivityincreasesfrom - oo throughthe values - 10, 0, 9 and 10, the set of states which ought to be made passive, D(v), increasesmonotonicallyby the addition of states 1, 2, 3 and 4 in that order.Supposea is chosen so that the relaxed-constraint optimalpolicy makesstate 1 with someprobabilities9 and passive,states3 and 4 active,and state2 passiveor active 1 - 6. Then 4=- - 3.0 55.0 1.0 -0.0025 -2.28 2.0 0.9975 0.72 -2.0 Onan indexpolicyfor restlessbandits 647 is not a stabilitymatrix since its eigenvaluesare - 7.4037 and 0.0618 + 3.9670i. This leads to a counterexampleto conjecture(4). Supposewe take a = 0.835. For this value of a the relaxedpolicyis passiveon state 1, activeon states3 and 4, and on state2 it is active and passive with probabilities0.9938 and 0.0062 respectively.The equilibriumis n = (0.1644, 0.0973, 0.3281, 0.4102). For this value of a the solution to (10) does not tend to nras t - co. Numerical integrationof (10) showsthat the fluidflowapproximationfor the index policyactually tends to a limit cycle of period 1.6384498 in which the state for which 0 < ui(z) < 1 alternatesbetweenstate 1 and state 2. So, in contrastto the relaxedpolicy, projectsin state 1 are sometimesmade active. The proofwhich Weissgives for the propositionin Section 3 can be adaptedin an obviousway to a versionin whichthe assumptionthat z(t) tends to a unique equilibrium,Q(C)C= 0, is replacedby the assumptionthat z(t) tends to a unique limit cycle for all initial probabilityvectors z(0). It can then be shownusingargumentssimilarto those in Section3 that the asymptoticaveragereward per projectof the index policy is the rewardobtainedby averagingthe rewardfunction around the path of the limit cycle. For our data this integral comes to 10 - 1.26577X 10-4. Clearlythe relaxedpolicyachievesan averagerewardof 10. Thus the index policy is asymptoticallysuboptimalby the tiny amountof 0.00126577%. A2 5. Conclusions For the data of the counterexamplein Section 4 there is a heuristicpolicy which is asymptoticallyoptimal.Supposethat m/n = a and the relaxedpolicy takes the active and passive actions in state 2 with probabilities1 - 6 and 0. Suppose there are n, m } of theseprojects projectsin state2. Considerthe policywhichmakesmin{(1 - m)n2, active,andthenmakesenoughof the remainingprojectsin states 1 and 3 active,to bring the total numberof active projectsto exactlym = an. (Whichprojectsin states 1 and 3 are made active does not mattersince the transitionratesdo not dependon the action takenin these states).The resultingMarkovprocessis a migrationprocessand satisfies detailedbalanceequations.So one can findexpressionsfor the equilbriumdistribution and showthat it has asymptoticallythe sameproportionsof projectsin each stateas the relaxedpolicy. Conjecture(4) is alwaystrue when k = 2. In this case the regionsC, and C2have a single point as theirboundary.The trajectoriesof dz/dt must enterone or the otherof these regionsand neverleave it. The only behaviourconsistentwith this is z(t) - 7, so the conditions of Theorem 2 are met. In fact, for the case k = 2 we have derived expressionsfor the equilibriumdistributionof the index policy. One can give a direct proof of the truthof conjecture(4). It turnsout that the asymptoticdifferencebetween Rj (a)/n and r(a) is even less than O(1/Jn). Ourcounterexamplehad k = 4 and it is not clearwhethertheremightbe a counterexamplewithk = 3. Certainlyit will be harderto discoversuchan example,sincewe can show that when k = 3 indexabilityalwaysimplies the stabilityof A,. Thus the equilibriumpoint of (10) is at least locallyasymptoticallystable. 648 R. R WEBER AND G. WEISS By randomlygeneratingvaluesforthe activeandpassivematrices,Q'and Q2,we have founda numberof counterexamplesfor k = 4 and k = 5. In ourearliestexperimentswe generatedthe off-diagonalentries in these matricesas uniform randomvariablesin [0, 10] and then multipliedthe columnsby randomfactors.Roughly90%of the test problemswere indexable,but in a sample of over 20000 test problemsno counterexample to the conjecturewas found. Counterexampleswere finally discoveredby restrictingattentionto test matricesfor whichQ'and Q2 differedin just one column,as in the exampleof Section4. Ourthinkingwasthatforthis veryspecialisedcasea proofof the conjectureor a counterexamplemight more easily be discovered.The experiments were rewardedwith counterexamples.Whilewe did not try to accuratelyestimatetheir wereproducedfor less than 1 in 1000 frequency,ourimpressionis thatcounterexamples test problems.The size of the asymptoticsuboptimalityof the indexpolicywas no more than 0.002%in any example.Of courseone should not place too much emphasison resultswhichdependon the waytest problemsaregenerated.We maybe missinga class of examplesfor which the degreeof suboptimalityis greater.A better understanding might lead to more dramatic counterexamples,but the reasoning that led to the counterexamplein Section4 does not seem to help. Nonetheless,the evidenceso far is that counterexamplesto the conjectureare rareand that the degreeof suboptimalityis very small.It appearsthat in most cases the index policy is a very good heuristic. Acknowledgements We wish to thankDr AlanWeiss,of AT&TBell Laboratories,who explainedto us his interestingworkon largedeviationtheoryand Markovjump processes,and Professor Keith Glover,of CambridgeUniversityEngineeringDepartment'sControlGroup,who producedthe numbersfor the matrices(qj) and (qj) in the counterexampleof Section4. These provided a more attractive counterexamplethan those found by randomly generatingtest problems. Our researchhas been sponsoredby National Science Foundationgrant #ECS8712798. The second author acknowledgesadditional supportfrom the FAW (ForUlm, West GerWissensverarbeitung), schungs Institut for Anwendungsorientierte many, for supportduringthe summerof 1988. References FREIDLIN,M. I. AND VENTSEL,A. D. (1984) Random Perturbationsof Dynamical Systems. Springer-Verlag,New York. J. C. ANDJONES,D. M. (1974) A dynamic allocation index for the sequentialdesign of GITTINS, experiments. In Progress in Statistics, ed. J. Gani, North-Holland, Amsterdam, 241-266. MITRA,D. ANDWEISS,A. (1988) A fluid limit of a closed queueing network with applications to data networks. WHITTLE,P. (1988) Restless bandits: activity allocation in a changing world. In A Celebration of AppliedProbability,ed. J. Gani, J. Appl. Prob. 25A, 287-298.
© Copyright 2026 Paperzz