paper - Stanford University

Reinforcement with Fading Memories
Kuang Xu
Graduate School of Business, Stanford University, Stanford, CA 94305, [email protected]
Se-Young Yun
Los Alamos National Laboratory, Los Alamos, NM 87545, [email protected]
We study the effect of lossy memory on decision making in the context of a sequential action-reward problem.
An agent chooses a sequence of actions which generate discrete rewards at different rates. She is allowed to
make new choices at rate β, while past rewards disappear from her memory at rate µ. We focus on a family
of decision heuristics where the agent makes a new choice by randomly selecting an action with a probability
approximately proportional to the amount of past rewards associated with each action in her memory.
We provide closed-form formulae for the agent’s steady-state choice distribution in the regime where the
memory span is large (µ → 0), and show that the agent’s success critically depends on how quickly she
updates her choices relative to the speed of memory decay. If β µ, the agent almost always chooses the
best action, i.e., the one with the highest reward rate. Conversely, if β µ, the agent chooses an action with
a probability roughly proportional to its reward rate.*
Key words : stochastic model, memory, reinforcement learning, Markov process, fluid model.
1.
Introduction
Can one make good choices despite having a faulty memory? We investigate in this paper the effect
of lossy memory on dynamic decision making via a simple action-reward model. An agent operating
in continuous time makes choices among a finite menu of actions. When action k is chosen, the
agent accrues discrete rewards according to a Poisson process with rate λk , while a unit of reward
“expires” and disappears from her memory after a random amount of time that is exponentially
distributed with rate µ. Opportunities for the agent to make new choices arise according to a
Poisson process with rate β. Instead of trying to find an “optimal” strategy, we will restrict our
attention to a simple family of heuristics that the agent uses to make new choices, referred to as
the reward matching rule, which reinforces actions with more positive past stimuli: she randomly
samples an action such that the probability of choosing action k is proportional to max{Qk , α},
where Qk is the total units of rewards accrued through action k that have yet to expire, and α > 0
an exploration parameter representing the agent’s minimal willingness to experiment with any
action. We are interested in the probability that the agent chooses an action in steady state.
* This version: October 19, 2016. The authors would like to thank Steven Callander, Dana Foarta, J. Michael Harrison,
David M. Kreps, Mihalis Markakis, Michael Ostrovsky, Dan Russo, Daniela Saban and Takuo Sugaya for the insightful
discussions and feedback on the manuscript.
1
Xu and Yun: Reinforcement with Fading Memories
2
Our model aims to capture the choice behavior of a non-sophisticated human decision maker
whose imperfect memory is subject to unpredictable losses. For instance, we may think of the agent
as a consumer, the actions as products or service providers, and the opportunities to choose new
actions as the times when the consumer is allowed to re-select her membership or subscription.
The rewards act as positive experiences, or impressions, resulting from using a service, and the
strength of each impression diminishes as time progresses. Furthermore, the agent that we have in
mind behaves more like an ordinary consumer, who, possibly being unaware of the memory loss,
employs a simple choice heuristic that tends to reinforce actions associated with more positive
impressions. This is in contrast to a well-informed decision maker or organization who has the
capability to deploy a sophisticated, “optimal” decision policy tailored to the problem. A more
extensive discussion on the motivation for our modeling assumptions will be given in Section 3.
Preview of Main Results. Our main results provide simple, closed-form formulae for the distribution of the agent’s choice in steady state, in the regime where the agent’s memory span is large
(µ → 0). Using these formulae, we show a “phase transition” in which the agent’s success critically
depends on how fast her choices are updated (β) relative to the rate of memory decay (µ), when α
is relatively small compared to the overall memory size. In particular, if β µ,1 the agent nearly
always chooses the best action, i.e., the one with the highest λk . In contrast, if β µ, the agent
splits her attention more smoothly among all actions, with frequencies roughly proportional to the
respective λk ’s.
1.1.
The Model
We now describe our model more formally, as is depicted in Figure 1. The system consists of
an agent operating in continuous time, who makes choices by selecting from a set of K actions,
K = {1, . . . , K }. We denote by C(t) ∈ K the action chosen by the agent at time t, and refer to
{C(t)}t∈R+ as the choice process. When the C(t) = k for some k ∈ K, the agent accrues discrete
rewards according to a Poisson process with rate λk > 0, where each reward can be treated as a
token of unit size. We refer to λk as the reward rate of action k. We assume that there exists one
reward rate that strictly dominates the rest, and, without loss of generality, we rank them in a
decreasing order, so that λ1 > λ2 ≥ . . . ≥ λK > 0.
The agent’s fading memory is modeled by the “expiration” of past rewards. Each reward is
associated with a lifespan, an exponential random variable with rate µ > 0 drawn independently
from all other aspects of the system, and a reward departs permanently after staying in the system
for a time duration equal to its lifespan. We refer to µ as the rate of memory decay. For k ∈ K,
1
The notation β µ represents limµ→0 β/µ = 0.
3
Xu and Yun: Reinforcement with Fading Memories
C (t )
λ1
μ
Q 1 (t )
Figure 1
Q 2 (t )
Q K (t )
A snapshot of the sequential decision-making problem, where the choice at time t is action 1, and the
agent accrues discrete rewards according to Poisson process of rate λ1 . Each existing recallable reward
departs from the system at rate µ, as depicted by the horizontal arrows.
we denote by Qk (t) the total units of rewards accrued from choosing action k that have yet to
depart by time t, and we refer to {Q(t)}t∈R+ as the recallable reward process. Note that at time t
the recallable rewards associated with action k depart at an aggregate rate of µQk (t), and arrive
at rate λk , if C(t) = k, or 0, otherwise. We assume that the agent’s initial recallable rewards at
time 0 is a bounded vector in ZK
+.
The agent has the opportunity to make a new choice at a set of update points scattered across
time. The update points are distributed according to an exogenous Poisson process with rate β,
which we refer to as the update rate, so that the time between two adjacent update points is an
exponentially distributed random variable with mean β −1 , independent from all other aspects of
the system.
How does the agent make a new choice when an update point arises? In this work, we will focus
on a specific family of choice heuristics referred to as the reward matching rules. We assume that
at time t the agent makes a new choice by sampling from the distribution
Qk (t) ∨ α
,
i∈K (Qi (t) ∨ α)
P(new choice = k) = P
k ∈ K,
(1)
4
where x ∨ y = max{x, y }. Here, α is a positive constant referred to as the exploration parameter,
which captures the agent’s willingness to experiment with an action even if there is currently little
or no recallable reward attached to it. The reward matching decision rule can thus be seen as a
simple form of reinforcement behavior, where the agent is more likely to choose actions associated
with more past rewards.
1.2.
Performance Metrics and Scaling Regime
The main goal of this paper is to study the agent’s long-run behavior in the model described above,
and a key quantity of interest is the steady-state distribution of the choice process, C(·), i.e.,
lim P(C(t) = k),
t→∞
k ∈ K,
(2)
4
Xu and Yun: Reinforcement with Fading Memories
Unfortunately, the choice process is non-Markovian as its transitions are influenced by the recallable
reward process Q(·) in a non-trivial way, and this makes it difficult to obtain explicit expressions
for its steady-state distribution. To circumvent this challenge, we will focus on the regime where
the average lifespan of a reward, µ−1 , tends to infinity, as follows.2 Fix the action set, K, and the
reward rates, {λk }k∈K . We consider a sequence of systems, indexed by m ∈ N, where the average
lifespan of a reward in the mth system is equal to m, i.e.,
µ = 1/m.
(3)
We will denote by αm , βm , Qm (·), and C m (·) the corresponding quantities in the mth system.
Finally, as the rewards’ average lifespan, m, tends to infinity, the total amount of recallable rewards
in the system also scales as Θ(m).3 For this reason, we will scale the exploration parameter αm
linearly with respect to m, by setting αm = α0 ∗ m, where α0 is a fixed parameter.
1.3.
Notation
For x, y ∈ R, x ∨ y and x ∧ y denote max{x, y } and min{x, y }, respectively. For a vector x ∈ RK
+,
PK
kxk denotes the L1 norm of x, kxk = i=1 |xi |. For two functions f and g, f g means that
(x)
= ∞; f g is defined analogously. We use the notation X(·) and X[·] as a shorthand
limx→∞ fg(x)
for processes {X(t)}t∈R+ and {X[n]}n∈N , respectively. For two random variables X and Y taking
d
values in R, we use X = Y to mean that X and Y have the same distribution. The expression
X Y means X is stochastically dominated by Y , i.e., P(X ≥ x) ≤ P(Y ≥ x) for all x ∈ R. For an
event A, we use X |A to denote the a random variable with the cumulative distribution function
P(X ≤ x A), x ∈ R. For a stochastic process, {X(t)}t∈R+ , we denote by X(∞) the random variable
distributed according to the stationary distribution of {X(t)}t∈R+ , if it exists.
2.
Main Results
We now present our main results, which provide exact expressions for the steady-state distribution
of the choice process, as well as the expected value of the (scaled) recallable reward process, in the
limit as m → ∞. For the remainder of the paper, we will assume that the update rate βm satisfies
either βm 1/m or βm 1/m, as m → ∞
2
The limit that we are taking is not the same as setting µ to zero, in which case there would be no memory loss.
This is because we focus on the sequence of steady-state distributions, by first taking the limit as t → ∞, and then as
µ → 0.
3
To see this, note that if the choice process were to stay in action 1 all the time, then Qm
1 (·) would evolve according
to an M/M/∞ queue with arrival rate λ1 and departure rate m−1 , leading to a steady-state expected total recallable
reward of λ1 m.
5
Xu and Yun: Reinforcement with Fading Memories
Theorem 2.1 (Steady-State Choice Probabilities) Fix α0 > 0. The following limits exist
ck = lim P(C m (∞) = k),
m→∞
k ∈ K.
(4)
Furthermore, their expressions depend on the scaling of βm , as follows.
1. (Memory-abundant regime) Suppose that βm 1/m, as m → ∞. If α0 ≤ λ1 /K, then
c1 =1 −
ck =
α0
,
λ1
K −1
α0 ,
λ1
k = 2, . . . , K.
(5)
If α0 > λ1 /K, then
ck = 1/K,
k = 1, . . . , K.
(6)
2. (Memory-deficient regime) Suppose that βm 1/m, as m → ∞. Then,
λk ∨ α0 + (K − 1)α0
,
i∈K [λi ∨ α0 + (K − 1)α0 ]
ck = P
k = 1, . . . , K.
(7)
The next theorem characterizes the steady-state expected value of the recallable reward process,
Qm (·). Define the scaled recallable reward process:
m
Q (t) =
1 m
Q (mt),
m
t ∈ R+ .
(8)
Theorem 2.2 (Steady-State Recallabe Rewards) Fix α0 > 0. For all m ∈ N, the scaled
m
recallable reward process, {Q (t)}t∈R+ , admits a unique steady-state distribution, and the following
limits exist:
m
qk = lim E(Qk (∞)),
m→∞
k ∈ K.
(9)
Furthermore, their expressions depend on the scaling of βm , as follows.
1. (Memory-abundant regime) Suppose that βm 1/m, as m → ∞. If α0 ≤ λ1 /K, then
q1 =λ1 − (K − 1)α0 ,
qk =(λk /λ1 )α0 ,
k = 2, . . . , K,
(10)
If α0 > λ1 /K, then
qk = λk /K,
k = 1, . . . , K.
(11)
2. (Memory-deficient regime) Suppose that βm 1/m, as m → ∞. Then,
λk ∨ α0 + (K − 1)α0
,
i∈K [λi ∨ α0 + (K − 1)α0 ]
qk = λk P
k = 1, . . . , K.
(12)
6
Xu and Yun: Reinforcement with Fading Memories
2.1.
Implications of the Theorems
A prominent feature of Theorems 2.1 and 2.2 is that the results critically depend on how quickly
the update rate βm scales relative to the rate of memory decay 1/m, but are otherwise insensitive
to the exact expression of βm . This leads to a natural dichotomy of the system dynamics into
what we called the memory-abundant and memory-deficient regimes, and we discuss below several
interesting properties of the two regimes.
Winner takes all. Suppose that the constant α0 in the exploration parameter is substantially
smaller than λ1 /K. Then, our results show that the best action attracts nearly all of the agent’s
attention, if βm 1/m (memory-abundant regime). Specifically, in the limit as m → ∞, the agent
chooses action 1 with a probability of (1 − K−1
α0 ), which is almost 1 when α0 is very small compared
λ1
to λ1 /K. Moreover, the system exhibits an interesting winner-takes-all phenomenon, whereby the
degree to which the agent focuses on the best action is independent of the reward rates of all other
actions. For instance, if we were to increase the reward rate of the second-best action, λ2 , the
probability that the agent chooses action 1 would stay unchanged (and close to 1), and that of
choosing action 2 would still be close to zero no matter how close λ2 is to λ1 . One consequence
of this effect is that, if we imagine the actions as being “competitive” service providers vying for
a consumer’s attention, then the winner-takes-all phenomenon can be interpreted as a form of
extreme competition among the providers, where the best attracts nearly all of the consumer’s
budget, even if its superiority compared to the second best is not significant.
probability
1
Memory-abundant
0.8
Memory-deficient
0.6
0.4
0.2
0
1
Figure 2
2
3
ac4on(k)
4
An example of the limiting steady-state distribution of the choice process C m (·) in the two regimes
given by Eqs. (5) through (7), with reward rates (λ1 , λ2 , λ3 , λ4 ) = (8, 6, 4, 2) and α0 = 0.5. The agent’s
attention sharply concentrates on the best action in one regime, and is more spread out among all
actions in the other.
Reward-rate-proportional allocation. In contrast to the memory-abundant regime, an update rate
that scales substantially slower than the memory decay rate will lead to an allocation of choices
that is much smoother as a function of the reward rates (Figure 2 provides a graphical comparison
Xu and Yun: Reinforcement with Fading Memories
7
of the qualitative difference between the agent’s behavior under the two regimes). In this case, the
probability that the agent chooses action k is proportional to [λk ∨ α0 + (K − 1)α0 ], which, when
α0 is small, is essentially equal to the reward rate, λk . The agent’s behavior thus exhibits a certain
weak reinforcement, where the better actions do attract more attention but only proportional to
their respective reward rates, and the above-mentioned winner-takes-all effect disappears.
The intuition behind this reward-rate-proportional allocation can be roughly explained as follows.
When the update rate is slow, by the time a new choice is to be made the agent would have
forgotten the rewards associated with all actions other than her most recent choice. Nevertheless,
she still exhibits a mild preference over actions with higher reward rates in steady state, because
she does retain a good memory of the rewards of her most recent choice, C(t), which can be shown
to be proportional to λC(t) . The reward rate λC(t) dictates how likely the agent will make the
same choices, which in turn leads to the reward rates’ approximately linear appearance in the
steady-state distribution of the choice process.
Rapid updates lead to total oblivion. The theorems also show that, surprisingly, a slower update
rate might not always be bad. While it may be tempting to conclude that faster update rates
always lead to better outcomes for the agent, this turns out to be true only when the exploration
parameter is relatively small. In fact, as soon as α0 grows beyond λ1 /K, in the memory-abundant
regime, the agent becomes completely oblivious and chooses actions uniformly at random (Eq. (6)).
In contrast, the agent is always biased towards the best action in the memory-deficient regime,
albeit mildly, as long as α0 < λ1 (beyond which point the exploration parameter is so large that it
trivially overwhelms even the best action). The true culprit behind the complete oblivion is a lack
of patience: when the update rate is fast and the exploration parameter high, the agent tends to
switch her choices rapidly, before she is able to collect enough rewards to “learn” the reward rate
of any individual action. Consequently, in the long run, not a single component of the recallable
reward, not even the best action, can distinguish itself by consistently rising above the exploration
parameter, and hence complete oblivion becomes the only possible outcome.
2.2.
Numerical Results
While Theorems 2.1 and 2.2 only apply in the limit where the average reward lifespan tends to
infinity, the simulation results in Figures 3 and 4 show that our theoretical results also provide
fairly accurate predictions, quantitatively and qualitatively, in systems with a moderate, finite
lifespan. In Figure 3, we fix m, while varying the update rate, β, and we observe that concentration
on the best action intensifies as β increases. Figure 4 shows example sample paths of the scaled
m
recallable reward processes, Q (·). One can discern without difficulty that, even for a modest m,
the process in the memory-abundant regime (plots (a) through (c)) stays close to a set of smooth
8
Xu and Yun: Reinforcement with Fading Memories
“fluid” trajectories, whose invariant state (the flat portion of the trajectories) coincide with the
m
limiting value of E(Q (∞)) predicted in Eq. (10), while in the memory-deficient regime the process
changes more abruptly over time. As will become clear in the sequel, this qualitative discrepancy
is a direct consequence of the overall system dynamics being dominated by the evolution of either
m
Q (·) or C m (·), depending on the scaling of the update rate βm , a key fact that we will exploit in
our proof of the main theorems.
5
4
(b)
action 1
action 2
action 3
action 4
0.6
probability
recallable reward
(a)
3
2
1
0
10-4
0.4
0.3
0.2
10-2
100
decision rate (β)
Figure 3
0.5
0.1
10-4
10-2
decision rate (β)
100
Time-average scaled recallable rewards (plot (a)) and distribution of choices (plot (b)) as a function of
the update rate, β, with K = 4, (λ1 , λ2 , λ3 , λ4 ) = (8, 6, 4, 2), α0 = 1, and m = 200. Each point is averaged
over 5 × 107 to 5 × 105 time units for β ranging from 10−4 to 1, respectively (longer duration for a
slower update rate).
3.
3.1.
Motivation and Related Research
Assumptions on Memory and Choice Heuristics
The assumption of a reward’s lifespan being exponentially distributed is motivated by the exponential memory decay theory that dates back to the celebrated work of Ebbinghaus (9), which
posits that the recall probability of a past event decays exponentially as time passes. The reward
matching heuristic (Eq. (1)) is inspired by the so-called probability matching behavior originating
from the psychology and behavioral economics literature (see (25, 6) and the references therein),
where it has been empirically observed that, faced with a sequential action-reward problem similar
to ours, human agents or animal subjects often randomly select an action with probabilities proportional to their associated number of past successes, as opposed to greedily choosing the action
with the highest empirical, or perceived, average payoff.
It is important to note that neither our memory decay model nor choice heuristic comes close to
realistically modeling human behavior. The simple reward matching rule clearly does not capture
the full complexity of human decision making, and there has been evidence suggesting that memory
9
Xu and Yun: Reinforcement with Fading Memories
µ = 1/400, β = 1
recallable reward
recallable reward
(a) µ = 1/150, β = 1
6
4
2
0
2
4
6
8
10
12
14
6
4
2
0
0
2
4
6
t
10
recallable reward
recallable reward
(c) µ = 1/800, β = 1
6
4
2
0
2
4
6
8
10
12
14
8
8
10
12
14
(d) µ = 1/100, β = 1/500
m
Q1
m
Q2
m
Q3
6
4
2
0
20
40
t
Figure 4
t
60
80
100
t
Example sample paths under different memory decay and update rates, with K = 3, (λ1 , λ2 , λ3 ) =
(8, 6, 4), α0 = 1. Plots (a) through (c) correspond to the memory-abundant regime with m = 150, 400
m
and 800, respectively, where the solid lines are the stochastic scaled recallable reward processes, Q (·),
and the dashed lines the fluid solution, q(·); the lines, from top to bottom, correspond to actions 1
through 3, respectively. Plot (d) corresponds to the memory-deficient regime, where the update rate
β is much smaller than µ.
decay may not necessarily follow an exponential curve (27). Nevertheless, the simplified model
exhibits some key properties that an actual human agent likely shares. For instance, it is reasonable
to assume that more recent impressions weigh more in an agent’s mind than distant ones, even
though the decay over time might not be exponential. Similarly, one might agree that the agent
would be biased toward actions with more positive past impressions, while the exact form or degree
of such bias may vary on a case-to-case basis. As such, our model is intended to be a first step
towards the understanding more realistic behavioral and decision models.
3.2.
Related Literature
The impact of limited memory to sequential decision-making has been examined in the statistics
literature dating back to the seminal work of (8, 14), which proposed algorithms for solving multiarm bandit and sequential hypothesis testing problems when the decision maker has access to only
a finite number of bits of memory. In the queueing theory literature, (18) and (11) study load
Xu and Yun: Reinforcement with Fading Memories
10
balancing problems with space-bounded routing algorithms. In contrast to the present paper, the
memories in these models are assumed to be perfect and immune to random erasures.
The dynamic induced in our model is related to the celebrated Pólya’s urn scheme (cf. (19)),
which has found a wide range of applications, with the restaurant process in non-parametric Baysian
statistics being a prominent example (cf. Section 3.1 of (20)).4 In its basic form, one ball is drawn
uniformly at random from an urn containing balls of different colors. The chosen ball is then
returned to the urn along with a new ball of the same color, and the procedure repeats. Since the
probability that a certain color being chosen is proportional to the number of balls of that color
present in the urn, by viewing the balls as rewards and colors as actions, our model can be thought
of as the urn scheme with the modifications where (a) each ball disappears from the urn after some
random amount of time, corresponding to memory loss, and (b) the number of new balls added
is random, whose distribution depends on the color, corresponding to the variations in the reward
rates. Note that in Pólya’s urn scheme the total number of balls in the urn eventually tends to
infinity as the number of draws increases, while in our model the recallable rewards always stay a
finite random variable as a result of memory loss. As such, our model can be viewed as a stationary
variation of the urn scheme, which admits a very different dynamic.
On the technical end, a part of our proof relies on a certain fluid model to approximate the
evolution of the original, stochastic recallable reward process. While the use of fluid models is a
popular technique in the literature of queueing networks (cf. (16, 5, 23)), our work departs from
conventional fluid models in a notable way: the recallable reward process is not Markovian, and
the analysis hence must take into account the dynamics of the auxiliary choice process. It is worthy
noting that there has been a large-deviation theory developed for obtaining fluid-like approximation
results where the transition rates of a continuous-time jump process are modulated by a finite-state
auxiliary chain (cf. Theorem 1 of (18), which is based on Theorem 8.15 of (21)). Unfortunately,
these results do not apply in our model for two reasons. First, they require the transition rate of the
jump process of interest to be bounded, a condition not met here due to the aggregate departure
rate of the rewards being unbounded over the state space. Second, the existing results typically
have one scaling parameter, m, whereby time is sped up by a factor of m, and space scaled down
by a factor of 1/m. We use a similar scaling for the recallable reward process, Qm (·), while the
update rate βm servers as a second scaling parameter that is not captured by the model in (18, 21).
The addition of βm is non-trivial: in fact, if βm is too small, the fluid approximation does not
hold, and a different (memory-deficient) regime arises. Due to these considerations, we will develop
our fluid approximation results starting from first principles, by using elementary properties of
continuous-time martingales and some arguments developed by (17).
4
The restaurant process is also known in the literature as the “Chinese restaurant process”. We tend to believe that
the use of “Chinese” in this context is unnecessary.
11
Xu and Yun: Reinforcement with Fading Memories
reward matching
policy
recallable
reward
choice
Q (t )
C (t )
acquire new
rewards
Figure 5
4.
A high-level illustration of the interaction between the choice and recallable reward processes.
Proof Overview
The remainder of the paper is devoted to the proof of Theorem 2.2; the proof of Theorem 2.1 is a
straightforward extension of Theorem 2.2 and will be given in Section 7. We discuss in this section
some high-level ideas before delving into the details. Our proof consists of two main components
corresponding to the memory-abundant and memory-deficient regimes, respectively. The system
dynamics turn out to be quite different depending on the scaling of βm , and as a result, so are our
proof techniques. Fix m ∈ N, and define the joint reward-choice process
W m (t) = (Qm (t), C m (t)),
t ∈ R+ .
(13)
It is not difficult to check that the joint process {W m (t)}t∈R+ is Markovian, while, for any fixed
m, the component processes {Qm (t))}t∈R+ and {C m (t)}t∈R+ depend on one another and are not
individually Markovian (see Figure 5). However, a key observation is that one of the two component
processes becomes asymptotically Markovian as m → ∞:
1. (Memory-abundant) If βm 1/m, a large number of choice updates takes place before substantial change occurs in the recallable reward process. As a result, the recallable reward
process, {Qm (t))}t∈R+ , becomes a sufficient Markov representation of the system dynamics,
as m → ∞.
2. (Memory-deficient) If βm 1/m, the interval between two consecutive update points is large.
By the time a new choice is to be made, the agent will have “forgotten” nearly all of the
rewards associated with the actions other than the most current choice, and the rewards
associated with the current choice will have become quite predictable. Here, the choice process
{C m (t))}t∈R+ becomes asymptotically Markovian as m → ∞.
Our proofs for the two regimes will be tailored to the dichotomy outlined above.
1. For the memory-abundant regime, our analysis focuses on characterizing the scaled recallable
m
reward process, Q (·), and uses a certain fluid model to show that, as m → ∞, the evolution
m
of Q (·) is well approximated by the solution to a system of ordinary differential equations
12
Xu and Yun: Reinforcement with Fading Memories
(ODE), and its stationary distribution converges to the unique invariant state of the ODEs.
This portion of the proof will be presented in Sections 5.
2. For the memory-deficient regime, we turn instead to the choice process, C m (·), and use its
asymptotic Markovian property to obtain an explicit formula for its steady-state distribution,
in the limit as m → ∞. We then leverage this formula to calculate the expected steady-state
recallable reward using the stationarity properties of the continuous-time processes. The proof
for this portion of the theorem is presented in Sections 6.
5.
The Memory-abundant Regime
We establish the first claim of Theorem 2.2 in this section, and we will do so by actually proving a
m
stronger statement, that the steady-state distribution of Qk (·) converges in L1 to the corresponding
expressions in Eqs. (10) and (11), as m → ∞. As was alluded to earlier, our main tool is certain
fluid solutions, defined as the solutions to a set of ordinary differential equations, whose dyanmics
m
will be shown to closely approximate that of Q (·) when m is large. We begin by introducing the
fluid solutions and some of their basic properties.
5.1.
Fluid Solutions and Their Basic Properties
K
Define the function p : RK
+ → R+ , where
qk ∨ α0
,
i∈K qi ∨ α0
pk (q) = P
q ∈ RK
+ , k ∈ K.
(14)
K
Definition 5.1 (Fluid Solutions) Fix q 0 ∈ RK
+ . A continuous function, q : R+ → R+ , is called a
fluid solution with initial condition q 0 , if it satisfies the following:
1. q(0) = q 0 ;
2. for almost all t ∈ R+ , the function is differentiable along all coordinates, with
q̇k (t) = λk pk (q(t)) − qk (t),
∀k ∈ K,
(15)
where p(·) is defined in Eq. (14).
0
I
We say that q I ∈ RK
+ is an invariant state of the fluid solutions, if by setting q = q , we have that
q(t) = q I for all t ∈ R+ .
The first term on the right-hand side of Eq. (15) corresponds to the instantaneous arrival rate
of the rewards, where pk (q(t)) is the probability that action k is chosen given the current states
of recallable rewards, and λk the reward rate for that action. The second term corresponds to the
instantaneous departure rate of rewards, which is proportional to the amount of recallable rewards
associated with action k. The following lemma states some basic properties of the fluid solutions.
The proof is given in Appendix B.1.
13
Xu and Yun: Reinforcement with Fading Memories
0
Lemma 5.2 For all q 0 ∈ RK
+ , there exists a unique fluid solution {q(q , t)}t∈R+ with initial condi-
tion q 0 . Furthermore, for all t ∈ R+ , q(q 0 , t) is a continuous function with respect to q 0 .
The next result states that the fluid solutions admit a unique invariant state; the proof is given
in Appendix B.2. Note that the expressions of the invariant state coincide with those in Eqs. (10)
and (11) in Theorem 2.2.
Theorem 5.3 Fix α0 > 0. The fluid solutions admit a unique invariant state, q I , whose expressions
depend on the value of α0 , as follows.
1. Suppose α0 ∈ (0, λ1 /K]. Then,
qkI
=
λ1 − (K − 1)α0 ,
(λk /λ1 )α0 ,
k = 1,
k = 2, . . . , K.
(16)
2. Suppose α0 > λ1 /K. Then,
qkI =
λk
,
K
k = 1, . . . , K.
(17)
The following theorem is the main result of this section, which would imply the first claim of
Theorem 2.2.
m
Theorem 5.4 Fix m ∈ N. The process Q (·) admits a unique steady-state distribution, denoted
by π m . Suppose that βm 1/m as m → ∞. We have that
π m converges weakly to δqI ,
as m → ∞,
(18)
where q I is the unique invariant state of the fluid model, and δqI is the probability measure with
unit mass on q I . Furthermore, we have that
m
lim E(|Qk (∞) − qkI |) = 0,
m→∞
∀k ∈ K.
(19)
The remainder of Section 5 is devoted to showing Theorem 5.4 and consists of three main parts,
as illustrated in Figure 6.
1. Theorem 5.5 of Section 5.2 shows that the scaled recallabe-reward process converges uniformly
to a fluid solution over any finite time horizon, as m → ∞. This is the more technical portion
of the proof, and the key idea is to use a quasi-fluid process, one where the evolution of the
reward process is deterministic while the choice process remains random, as an intermediary
to bridge the gap between the fluid solution and the original stochastic process.
2. Theorem 5.12 of Section 5.3 shows that starting from any bounded initial condition, q 0 , the
fluid solution q(q 0 , t) converges exponentially fast to the unique invariant state, as t → ∞.
The proof is centered around the evolution of a simple piece-wise linear potential function,
which ensures that q(q 0 , t) rapidly enters a suitable subset of the state space, from which point
exponential convergence takes place.
14
Xu and Yun: Reinforcement with Fading Memories
3. Finally, in Section 5.4, by combining Theorems 5.5 and 5.12, we show that the sequence of
steady-state scaled recallable rewards converges in L1 to a unique limiting point concentrated
on the invariant state of the fluid solution, completing the proof of Theorem 5.4.
m→∞
m
Q (t )
(c)
t →∞
m
Q (∞)
Figure 6
q(t )
(a)
(b)
m→∞
(d)
q
t →∞
I
An illustrative summary of the convergence results for the memory-abundant regime. The arrows (a)
and (b) correspond to Theorems 5.5 and 5.12, respectively, and (c) and (d) to Theorem 5.4.
5.2.
Convergence to the Fluid Solution over Finite Horizon
In this subsection, we show that, with high probability, the scaled recallable reward process converges to the fluid solution uniformly over any finite horizon, as m → ∞, as summarized in the
following theorem (See plots (a) through (b) of Figure 4 for a sample-path example of how this
convergence takes place.)
Theorem 5.5 Suppose that βm 1/m, as m → ∞. Fix q 0 ∈ RK
+ , and suppose that
m
lim P(kQ (0) − q 0 k > ) = 0,
m→∞
∀ > 0.
(20)
Let q(q 0 , ·) be the unique fluid solution with initial condition q 0 , as defined in Lemma 5.2. We have
that, for all T > 0,
lim P
m→∞
m
0
sup Q (t) − q(q , t) > = 0,
∀ > 0.
(21)
0≤t≤T
A main challenge in proving Theorem 5.5 is that the processes Qm (·) and C m (·) interact in an
intricate way: the reallable reward vector influences how the subsequent choices, while the current
choice in turn dictates the arrival rate of new rewards associated with each action. The main
idea of the proof is to “‘disentangle” their interactions by means of an intermediate, quasi-fluid
process, where the dynamics of the arrivals and departures of rewards becomes deterministic and
“fluid-like”, while the choice process remains random.
We now construct the quasi-fluid process. Fix q ∈ RK
+ and c ∈ K, and define
Gk (q, c) = λk I(c = k) − qk ,
k ∈ K.
(22)
15
Xu and Yun: Reinforcement with Fading Memories
In words, Gk (q, c) represents the rate of change in recallable rewards associated with action k when
the current recallable reward vector and action are q and c, respectively. Define the scaled action
process:
m
C (t) = C m (mt),
m
t ∈ R+ .
(23)
m
Note that, unlike Q (·), the scaling of C (·) only occurs in time and not in space. We then
construct a quasi-fluid solution, {V m (t)}t∈R+ , by letting
Z t
m
Vkm (t) = qk0 +
Gk (V m (s), C (s))ds,
k ∈ K, t ∈ R+ .
(24)
0
In words, V m (·) corresponds to a system in which recallable rewards are governed by deterministic
arrival and departure processes, as specified by Eq. (22), and where the choice process is scaled in
time by a factor of m, but remains stochastic.
We note some basic properties of Vkm (·) that will become useful. First, since C m (·) is a Markov
jump process and the function Gk (·, ·) uniformly Lipschitz continuous, analogously to Lemma 5.2,
it is not difficult to show that V m (·), expressed as a solution to the above integral equation, is
uniquely defined almost surely. Also, it is not difficult to verify that
0 ≤ Vkm (t) ≤ qk0 + λk t,
∀t ∈ R+ ,
(25)
m
where the right-hand side of the second inequality corresponds to the scenario where C (s) = k
for all s ∈ [0, t] and the negative drift term (−qk ) is absent from the expression of Gk (q, c).
m
The proof of Theorem 5.5 proceeds in two main steps. We first show that {Q (t)}t≥0 converges
to the quasi-fluid process, {V (t)}t≥0 , as m → ∞, as is summarized in the following proposition.
Proposition 5.6 Fix q 0 ∈ RK
+ , and suppose that
m
lim P(kQ (0) − q 0 k > ) = 0,
m→∞
Then,
lim P
m→∞
∀ > 0.
m
m
sup Q (t) − V (t) > = 0,
∀ > 0.
(26)
(27)
0≤t≤T
We then show that, when βm 1/m, the process {V m (t)}t≥0 converges to the fluid solution,
{q(q 0 , t)}t≥0 :
Proposition 5.7 Suppose, in addition to the condition of Eq. (26) in Proposition 5.6, that mβm →
∞, as m → ∞. For all > 0,
lim P
m→∞
m
0
sup V (t) − q(q , t) > = 0.
0≤t≤T
Note that Proposition 5.6 holds independently of the scaling of βm , while Proposition 5.7 requires
that the system be in the memory-abundant regime. Together, Propositions 5.6 and 5.7 imply
Theorem 5.5. The proof for Propositions 5.6 and 5.7 will be given in Sections 5.2.1 and 5.2.2,
respectively.
16
Xu and Yun: Reinforcement with Fading Memories
5.2.1.
Proof of Proposition 5.6 The proof of Proposition 5.6 is largely based on maximal
ineqaulities for continuous-time martingales, and the main steps follow the approach developed in
(17). Define X m (t) ∈ RK
+ , where
Z t
m
m
m
m
m
Xk (t) = Qk (t) − Qk (0) +
Gk (Q (s), C (s))ds ,
t ∈ R+ , k ∈ K .
(28)
0
The proof proceeds in two stages: we first show, via Gronwall’s lemma (Proposition A.2 in Appendix
m
A), that bounding the deviation of Qk (·) from Vkm (·) can be reduced to bounding the magnitude of
Xkm (·). We then show that this can be accomplished by writing Xkm (·) as a sum of two continuoustime martingales. From the definition of {V m (t)}t≥0 , we have that
Z t
m
X m
m
m
0
m
Gk (V (s), C (s))ds
Q (t) − V (t) =
Qk (t) − qk −
0
k∈K
Z t
X X m
m
m
m
0
m
m
Gk (Q (s), C (s)) − Gk (V (s), C (s)) ds
≤
Qk (0) − qk +
Xk (t) +
0
k∈K
k∈K
m
Z t X
m
m
m
≤ Q (0) − q 0 + kX m (t)k +
Gk (Q (s), C (s)) − Gk (V m (s), C (s)) ds
0 k∈K
Z t X m
m
0
m
≤ Q (0) − q + kX (t)k +
Qk (s) − Vkm (s) ds
(a) 0 k∈K
m
Z t m
m
= Q (0) − q 0 + kX m (t)k +
Q
(s)
−
V
(s)
ds
0
(29)
For step (a), we observe that for all c, k ∈ K, Gk (·, c) is a 1-Lipschitz continuous function. We now
m
apply Gronwall’s lemma (Proposition A.2 in Appendix A) to Eq. (29), with a(t) = Q (0) − q 0 +
m
kX m (t)k, b(t) = 1, and f (t) = Q (t) − V m (t), and obtain that
m
m
P sup Q (t) − V m (t) > ≤P sup Q (0) − q 0 + kX m (t)k et > 0≤t≤T
0≤t≤T
m
0
−T
m
−T
+ P sup kX (t)k > e
. (30)
≤P Q (0) − q > e
0≤t≤T
m
Because we have assumed that limm→∞ P Q (0) − q 0 > = 0 for all > 0, to prove Lemma 5.6,
it therefore suffices to show that
lim P
m→∞
sup kX (t)k > = 0,
m
∀ > 0.
(31)
0≤t≤T
We now prove Eq. (31) by expressing Xkm (·) as the sum of two continuous-time martingales
corresponding to the arrivals and departures of rewards, respectively. Fix t ∈ R+ and k ∈ K, and
m
denote by Am
k (t) and Dk (t) the amount of rewards associated with action k that have arrived and
departed, respectively, during the interval [0, t]. In particular, we can write
m
m
m
Qm
k (t) = Qk (0) + Ak (t) − Dk (t),
t ∈ R+ .
(32)
17
Xu and Yun: Reinforcement with Fading Memories
We have that
Z t
1 m
m
m
m
m
|Xk (t)| = (Ak (mt) − Dk (mt)) −
Gk (Q (s), C (s))ds
m
0
Z mt
Z mt
1
1
m
m
m
m
−1
Ak (mt) −
λk I(C (s) = k)ds −
Dk (mt) −
Qk (s)m ds =
m
m
0
0
Z mt
Z mt
1
1
m
m
m
m
−1
Ak (mt) −
λk I(C (s) = k)ds + Dk (mt) −
Qk (s)m ds . (33)
≤
m
m
0
0
For the remainder of the proof, we will focus on showing that, for all > 0,
Z mt
1
m
m
lim P sup Ak (mt) −
λk I(C (s) = k)ds > =0,
m→∞
0≤t≤T m
0
Z mt
1
m
m
−1
Dk (mt) −
Qk (s)m ds > =0.
lim P sup m→∞
0≤t≤T m
0
and
(34)
(35)
In light of Eq. (33), the above two equations imply the validity of Eq. (31), which proves Proposition
5.6. The proofs for both Eqs. (34) and (35) hinge upon the following technical result, which states
that the sample path of a certain time-inhomogenous Poisson process stays close to its mean, with
high probability. The proof of the lemma is given in Appendix B.3, and uses a similar line of
argument as that of Theorem 2.2 in (17).
Lemma 5.8 Fix l ∈ ZK+1 and T ∈ R+ . Let {N (t)}t≥0 be the counting process where N (t) is the
number of times in [0, t] that the process W m (·) jumps from state w to w + l, for some w ∈ ZK+1 .
Denote by ψ : ZK+1 → R+ the corresponding rate function of N (·), so that the instantaneous transition rate of N (·) at time t is equal to ψ(W m (t)). For all , φ > 0, we have that
Z t
m
−φT ·h(/φT )
m
P sup N (t) −
µ(W (s))ds ≥ ≤ 2e
+ P sup ψ(W (t)) ≥ φ ,
0≤t≤T
(36)
0≤t≤T
0
where h(x) = (1 + x) log(1 + x) − x.
We now prove Eqs. (34) and (35). In the context of Lemma 5.8, Ak (·) is the counting process
with l = ek , where ek is the vector (K + 1) × 1 vector whose kth coordinate is 1 and all other
coordinates zero, corresponding to an arrival to Qm
k (·). The rate of Ak (·) at time t is equal to
λk I(C m (t) = k), which is bounded from above by λk for all t ∈ R+ . By applying Lemma 5.8, with
ψ(W m (t)) = λk I(C m (t) = k) and φ = λk , we have that P sup0≤t≤T ψ(W m (t)) > φ = 0, and for all
> 0,
Z mt
1
m
Am
(mt)
−
λ
I(C
(s)
=
k)ds
≥
lim P sup k
k
m→∞
0≤t≤T m
0
Z t
= lim P
sup Am
λk I(C m (s) = k)ds ≥ m
k (t) −
m→∞
0≤t≤mT
0 ≤ lim 2 exp −λk mT · h
= 0.
m→∞
λk T
(37)
18
Xu and Yun: Reinforcement with Fading Memories
The proof of Eq. (35) uses essentially the same idea, but the argument needs to be more delicate
m
−1
due to the fact that the rate of the counting process Dkm (t), which is equal to Qm
, is
k (t)µ = Qk (t)m
not bounded over the state space. Therefore, we first derive an upper bound on the tail probabilities
0
of Qm
k (t), as follows. Let ψ(·) be the rate function for Dk (·) as in Lemma 5.8, and φ = 2qk + (1 +
)λk T . Fixing > 0, we have that, for all r ∈ Z+ ,
m
m
−1
P
sup ψ(W (t)) ≥ φ = P
sup Qk (t)m ≥ φ
0≤t≤mT
0≤t≤mT
m
m
0
m
≤P(Qk (0) ≥ 2mqk ) + max P
sup Qk (t) ≥ mφ Q (0) = r
0
r=1,...,2mqk
0≤t≤mT
m
m
0
m
m
=P(Qk (0) ≥ 2mqk ) + max P
sup Qk (t) − Qk ≥ mφ − r Q (0) = r
0
r=1,...,2mqk
0≤t≤mT
m
0
m
≤P(Qm
(0)
≥
2mq
)
+
max
Q
(0)
=
r
P
sup
A
(t)
≥
mφ
−
r
k
k
k
0
r=1,...,2mqk
0≤t≤mT
(a)
m
0
m
=P(Qk (0) ≥ 2mqk ) + P
sup Ak (t) ≥ (1 + )mλk T
0≤t≤mT
Z t
(b)
m
0
m
m
≤ P(Qk (0) ≥ 2mqk ) + P
sup
Ak (t) −
λk I(C (s) = k)ds ≥ m[(1 + )λk T − λk T ]
0≤t≤mT
0
(c)
0
≤ P(m−1 Qm
k (0) ≥ 2qk ) + 2 exp (−λk mT · h()) ,
(38)
m
where step (a) follows from the fact that Am
k (·) is independent of Qk (0), and hence the maximum
in the second term is attained by setting r = 2mqk0 . Step (b) follows from λk I(C m (s) = k) ≤ λk for
all t ∈ R+ , and (c) from the inequality in Eq. (37), by replacing with λk T .
We are now ready to establish Eq. (35):
Z mt
Z t
1
m
m
m
−1
m
−1
P sup Dk (mt) −
Qk (s)m ds > = P
sup Dk (t) −
Qk (s)m ds > m
0≤t≤T m
0≤t≤mT
0
0 (a)
≤ 2 exp −φmT · h
+P
sup ψ(W m (t)) ≥ φ
φT
0≤t≤mT
(b)
0
≤ 2 exp −φmT · h
+ 2 exp (−λk mT · h()) + P(m−1 Qm
(39)
k (0) ≥ 2qk ),
φT
where step (a) follows from Lemma 5.8, and (b) from Eq. (38). Note that φ, and T are a positive
0
constants, and by our assumption, limm→∞ P(|m−1 Qm
k (0) − qk | > δ) = 0 for all δ > 0. Therefore,
Eq. (35) follows by taking the limit in the above inequality as m → ∞. This completes the proof
of Proposition 5.6.
5.2.2.
Proof of Proposition 5.7 For the purpose of this proof, we will use an alternative
K
representation of the fluid solutions using integral equations. Define the drift function, F : RK
+ → R+ :
Fk (q) = λk pk (q) − qk ,
q ∈ K,
(40)
19
Xu and Yun: Reinforcement with Fading Memories
where p(·) is defined in Eq. (14). It can be verified from the definition of p(·) that there exists lF ∈ R+
K
such that Fk (·) is lF -Lipschitz continuous for all k ∈ K. Fix q 0 ∈ RK
+ , let {q(t)}t∈R+ , q(t) ∈ R+ , be a
solution to the following integral equation:
qk (t) =
qk0
Z
+
t
k ∈ K.
Fk (q(s))ds,
(41)
0
Similar to Lemma 5.2, it is not difficult to show that the function q(·) defined in Eq. (41) exists,
is unique, and coincides with the fluid solution with initial condition q 0 . We have that
X Z t
m
m
m
Gk (V (s), C (s)) − Fk (q(s))ds
kV (t) − q(t)k =
0
k∈K
Z t
X Z t
m
m
m
≤
kF (V m (s)) − F (q(s))k ds
Gk (V (s), C (s)) − Fk (V (s))ds +
0
0
k∈K
Z t
X Z t
m
m
m
≤
kV m (s) − q(s)k ds,
(42)
Gk (V (s), C (s)) − Fk (V (s))ds + lf
0
k∈K
0
where the last inequalities come from the fact that Fk (·) is lf -Lipschitz continuous for all k ∈ K. In
a manner analogous to Eq. (29) from the proof of Proposition 5.6, by applying Gronwall’s lemma
(Proposition A.2 in Appendix A), we have that, for all > 0,
!
X Z t
m
Gk (V m (s), C (s)) − Fk (V m (s))ds > e−lf T .
P sup kV m (t) − q(t)k > ≤P sup
0≤t≤T
0≤t≤T
k∈K
0
(43)
Therefore, in order to establish Proposition 5.7, it suffices to show that
!
X Z t
m
m
m
Gk (V (s), C (s)) − Fk (V (s))ds > = 0,
lim P sup
m→∞
0≤t≤T
∀ > 0.
(44)
0
k∈K
In what follows, we will show (44) by using the discrete-time embedded process of C m (·) and
analyzing the system dynamics at the times when new choices are chosen. We begin by introducing
some notation. Fixing i ∈ Z+ , we denote by Si the ith update point, i.e., time of the ith update in
4
C m (·), with S0 = 0, and
τim = m−1 (Si − Si−1 ),
i ∈ N.
(45)
Note that {τi }i∈N are independent exponential random variables with mean (mβm )−1 . Let
m
m
m
{Q [i]}i∈N be the discrete-time process, where Q [i] corresponds to the value of Q (·) immediately
following the ith update point:
m
m
Qk [i] = m−1 Qm
k (Si ) = Qk (Si /m),
k ∈ K, i ∈ N.
(46)
and {Zkm [i]}i∈N be the process of indicator variables:
Zkm [i] = I(C m (Si ) = k),
k ∈ K, i ∈ N.
(47)
20
Xu and Yun: Reinforcement with Fading Memories
b m (t)}t≥0 be a
That is, Zkm [i] = 1 if action k is slected on the ith update point. Finally, let {Q
m
right-continuous piece-wise constant process which coincides with Q (·) at the points {Si /m}i∈N :
b m (t) = Qm [i],
Q
∀t ∈ [Si /m, Si+1 /m).
(48)
Fix k ∈ K, and denote by It the number of updates in C m (·) by time t, i.e.,
It = max{i : Si ≤ t}.
(49)
By the triangle inequality, we have that
Z t
m
−1
m
m
λk sup Gk (V (s), C (s)) − Fk (V (s))ds
0≤t≤T
Z0 t m
−1
m
=λk sup λk I(C (s) = k) − pk (V (s)) ds
0≤t≤T
0
Z t Z t m
m
m
m
b
b
≤ sup I(C (s) = k) − pk (Q (s)) ds + pk (Q (s)) − pk (V (s)) ds
0≤t≤T
0
0
#
"I +1
mt
Z t X
m
m
m
m
m
b
= sup pk (Q (s)) − pk (V (s)) ds
τi Zk [i] − pk (Q [i]) + 0≤t≤T 0
i=1
I +1
Z t mt
m
X m
m
m
m
b (s)) − pk (V (s)) ds
≤ sup pk (Q
τi Zk [i] − pk (Q [i]) + sup 0≤t≤T 0≤t≤T
0
i=1
(50)
We now derive upper bounds on tail probabilities for each term on the right-hand side of Eq. (50).
For the first term, define
ξnm
=
n
X
m
τim Zkm [i] − pk (Q [i]) ,
n ∈ N.
(51)
i=1
Fix m ∈ N and > 0. Let
Jm = (1 + )mβm T.
(52)
(To avoid the excessive use of floors and ceilings, we assume that Jm , as well as the number of
summands in the subsequent Lemma 5.9, are positive integers. The results extend easily to the
general case.) Define the events
Am
= {ImT <
Jm } ,
and
Bm
=
max |ξnm |
0≤n≤Jm
≤ .
(53)
We have that
I +1
!
mt
m
X m
m
max |ξnm | ≥ P sup τi Zk [i] − pk (Q [i]) ≥ =P
0≤n≤ImT
0≤t≤T i=1
m
≤1 − P(Am
∩ B )
m
≤(1 − P(Am
)) + (1 − P(B )).
(54)
21
Xu and Yun: Reinforcement with Fading Memories
m
With the above equation in mind, we now proceed to demonstrate that both P(Am
) and P(B )
are diminishingly small in the limit as m → ∞. The following lemma is useful, whose proof is a
straightforward application of Chebyshev’s inequality and is given in Appendix B.4.
Lemma 5.9 Fix r, and y > 0. Let {Xi }i∈N be a sequence of independent and identically distributed
exponentially variables with mean r−1 . Then,
(1+)ry
X
P
!
Xi ≤ y
≤
i=1
Recall that the
τim ’s
1+
.
ry2
(55)
are i.i.d. exponential random variables with mean (mβm )−1 . Applying Lemma
5.9 with r = mβm and y = T , we have that
(1+)mβm T
P(Am
)
= P(ImT < (1 + )mβm T ) = 1 − P
X
!
τim
≤T
≥1−
i=1
We next turn to the value of
P(Bm ).
1+
(mβm )−1 .
T 2
(56)
Recall from the definition of Zkm [i] that
m
m
E(Zkm [i] Q [i]) = pk (Q [i]).
(57)
It is therefore not difficult to verify that {ξnm }n∈N is a martingale, and our objective would be
to derive an upper bound on its maximum upward excursion over {0, . . . , Jm }. Unfortunately, we
cannot apply the Azuma-Hoeffding inequality, because the ith increment of {ξnm }n∈N involves the
term τim , which does not admit a bounded support. Instead, we will use the following upper bound
on the moment generating function of ξnm , whose proof is based on Doob’s inequality and is given
in Appendix B.5.
Lemma 5.10 Fix n ∈ N and θ ∈ (0, mβm ). We have that
θ2 n
m
E (exp (θξn )) ≤ exp
.
4mβm (mβm − θ)
(58)
We are now ready to establish an upper bound on the quantity 1 − P(Bm ). Recall that Jm =
(1 + )mβm T . Fix η ∈ (0, min{T, 1}/2), and let θ0 = η 1+
T −1 mβm . In particular, θ0 ∈ (0, mβm /2).
We have that
1 − P(Bm )
=P
(a)
max ξnm
0≤n≤Jm
≥
≤ inf exp (−θ) E exp θξJmm
θ>0
(b)
Jm θ02 /4
≤ exp −θ0 +
mβm (mβm − θ0 )
(c)
Jm θ02
≤ exp −θ0 +
2(mβm )2
22
Xu and Yun: Reinforcement with Fading Memories
= exp −θ0
(1 + )T θ0
−
2mβm
(d)
≤ exp (−θ0 ( − /2))
2
−1
= exp −η
T mβm .
2 + 2
(59)
Step (a) follows from Doob’s inequality and the fact that, for any θ > 0, the sequence {exp (θξnm )}n∈N
is a submartingale, and (b) from Lemma 5.10 with n = Jm . Step (c) is due to θ0 < mβm /2, and
hence mβm − θ0 > mβm /2. Finally, step (d) follows from the definition of θ0 and that η < 1.
With a derivation analogous to that of Eq. (59), we have that
2
m
−1
P
min ξn ≤ − ≤ exp −η
T mβm .
0≤n≤Jm
2 + 2
(60)
Therefore, combining Eq. (54) with Eqs. (56), (59) and (60), we have that
I +1
!
mt
m
X m
m
m
P sup τi Zk [i] − pk (Q [i]) ≥ ≤(1 − P(Am
)) + (1 − P(B ))
0≤t≤T
i=1
1+
2
−1
−1
≤ 2 (mβm ) + 2 exp −η
T mβm . (61)
T
2 + 2
Since mβm → ∞ as m → ∞, the above equation further implies that
I +1
!
mt
2
1+
m
X m
−1
−1
m
(mβm ) + 2 exp −η
T mβm
lim P sup τi Zk [i] − pk (Q [i]) ≥ ≤ lim
m→∞
m→∞
T 2
2 + 2
0≤t≤T i=1
=0
(62)
We now bound the second term in Eq. (50). It is not difficult to verify, from Eq. (14), that there
exists lp ∈ R+ such that pk (·) is lp -Lipschitz continuous for all k ∈ K. We have that
Z t Z t bm
m
m
m
b (s)) − pk (V (s)) ds ≥ ≤P sup
P sup p k (Q
l
Q
(s)
−
V
(s)
ds
≥
p k
k
0≤t≤T
0≤t≤T 0
0
Z T bm
m
≤P
lp Qk (s) − Vk (s) ds ≥ .
(63)
0
It therefore suffices to show that, if mβm → ∞ as m → ∞, then
Z T bm
m
lim P
Qk (s) − Vk (s) ds ≥ = 0,
m→∞
∀ > 0.
0
To this end, we have that
Z
0
T
ImT Z
X
bm
m
Qk (s) − Vk (s) ds =
≤
Si+1 /m i=0
ImT
Si /m
i=0
Si /m
XZ
bm
m
Qk (t) − Vk (s) ds
Si+1 /m bm
Qk (t) − Vkm (Si /m) + Vkm (s) − Vkm (Si /m) ds
(64)
23
Xu and Yun: Reinforcement with Fading Memories
ImT
(a) X
mT Z Si+1 /m
IX
m m
m
=
τi+1 Qk [i] − Vk (Si /m) +
|Vkm (s) − Vkm (Si /m)|ds
i=0
i=0 Si /m
mT
m
IX
m
m
m
≤T sup Qk (s) − Vkm (s) +
τi+1
τi+1
sup Gk (Vkm (s), C (s))
0≤s≤T
0≤s≤T
i=0
+1
ImT
m
m
X m 2
m
m
(τi )
=T sup Qk (s) − Vk (s) + sup Gk (Vk (s), C (s))
0≤s≤T
0≤s≤T
i=1
ImT
m
X+1
(b)
m
m
≤ T sup Qk (s) − Vk (s) + λk + sup |Vk (s)|
(τim )2
0≤s≤T
0≤s≤T
i=1
ImT +1
X
(c)
m
(τim )2 .
(65)
≤ T sup Qk (s) − Vkm (s) + [q0 + λk (T + 1)]
0≤s≤T
i=1
bm
In step (a) we have invoked the property that Q
k (·) is piece-wise constant. Step (b) follows from
the definition of Gk (·, ·) in Eq. (22), and (c) from the fact that |Vkm (t)| ≤ q 0 + λk t for all t ∈ R+
PImT +1 m 2
(Eq. (25)). It remains to derive an upper bound on the tail probabilities of the term i=1
(τi ) ,
which is isolated in the form of the following lemma. The proof involves an elementary application
of Markov’s inequality, and is given in Appendix B.6.
Lemma 5.11 Suppose that mβm → ∞ as m → ∞. We have that
!
ImT +1
X
m 2
lim P
(τi ) ≥ = 0. ∀ > 0.
m→∞
(66)
i=1
We are now ready to prove Eq. (64). By Eq. (65), we have that
Z T bm
m
lim P
Qk (s) − vk (s) ds ≥ m→∞
0
m
m
+ lim P
≤ lim P sup Qk (s) − Vk (s) ≥
m→∞
m→∞
2T
0≤s≤T
ImT +1
X
i=1
(τim )2 ≥
2[q0 + λk (T + 1)]
!
=0.
(67)
for all > 0, where the first inequality follows from a union bound, and the last step from Proposition 5.6 and Lemma 5.11. Substituting Eqs. (62) and (67) into Eq. (50), we have that
!
X Z t
m
Gk (V m (s), C (s)) − Fk (V m (s))ds > lim P sup
m→∞
0≤t≤T
k∈K
0
I +1
!
mt
m
X m
m
= lim P sup τi Zk [i] − pk (Q [i]) ≥ /λk
m→∞
0≤t≤T i=1
Z t m
m
b
+ lim P sup pk (Q (s)) − pk (V (s)) ds ≥ /λk = 0.
m→∞
0≤t≤T
0
This establishes Eq. (44), which, in light of Eq. (43) completes the proof of Proposition 5.7.
(68)
24
Xu and Yun: Reinforcement with Fading Memories
5.3.
Exponential Convergence of Fluid Solutions to Invariant State
In this section, we show that the fluid solution converges exponentially fast to the unique invariant
state, starting from any bounded set of initial conditions.
Theorem 5.12 Fix h0 > 0. There exist a, b > 0, such that for all u ∈ RK
+ , kuk ≤ h0 ,
q(u, t) − q I ≤ a exp(−bt),
∀t ∈ R+ .
(69)
The proof of Theorem 5.12 is given in Appendix B.7. A main difficulty of the proof is that,
depending on the initial condition, the individual coordinates of q(·) may not converge to their
equilibrium state monotonically, as can be seen in the trajectory of q2 (·) in the first three plots of
Figure 4. Fortunately, we are able to obtain a better control of q(·) with the aid of the following
piece-wise linear potential function:
g(q) =
X
qk ∨ α0 ,
q ∈ RK
+.
(70)
k∈K
By analyzing the the evolution of g(q(·)), we then show that, for every bounded initial condition of
q(·), the fluid solution rapidly enters a desirable subset of the state space, from which on exponential
convergence to the invariant state takes place.
5.4.
Convergence of Steady-State Distributions
We complete the proof of Theorem 5.4 in this subsection. We will begin by establishing a useful
m
distributional upper bound on Qm
k (∞). Fix t ∈ R+ , and denote by Uλ (t) the number of jobs in
system at time t in an M/M/∞ queue with arrival rate λ and departure rate 1/m, and by Uλm its
steady-state distribution. It is well known that Uλm is a Poisson random variable with mean mλ,
and it follows that, for all λ > 0,
m−1 Uλm →λ,
almost surely, as m → ∞.
(71)
Define W = RK
+ × K, and the process
m
m
W (t) = (Q (t), C m (t)),
t ∈ R+ .
(72)
We have the following lemma, whose proof is given in Appendix B.8.
m
m
Lemma 5.13 The process {W (t)}t∈R+ is positive recurrent, and Qk (∞) m−1 Uλm1 , for all k ∈ K.
m
Denote by πW
the probability distribution over W that corresponds to the steady-state distrim
m
bution of W (·). Lemma 5.13 can be used to to show that the sequence {πW
}m∈N is tight, as
formalized in the following result, whose proof is given in Appendix B.9.
25
Xu and Yun: Reinforcement with Fading Memories
Lemma 5.14 For every x ∈ R+ , define Qx = {q ∈ RK
+ : maxk∈K qk ≤ x} × K, x ∈ R+ . Then, for
every > 0, there exists M ∈ N such that
m
πW
(QM ) ≥ 1 − ,
∀m ∈ N,
(73)
We say that y is a limit point of {xi }i∈N if there exists a sub-sequence of {xi } that converges to
y. It is not difficult to verify that the space W is separable as a result of both RK
+ and K being
m
separable. By Prohorov’s theorem, the tightness of {πW
}m∈N thus implies that any sub-sequence
mi
m
of {πW
}m∈N admits a limit point with respect to the topology of weak convergence. Let {πW
}i∈N
m
be a sub-sequence of {πW
}m∈N , and πW a limit point of the sub-sequence. Denote by π and π mi
m
mi
the marginals of πW and πW
over the first K coordinates (corresponding to Q (·)), respectively.
In the remainder of the proof, we will show that π necessarily concentrates on q I .
We first show that π is an stationary measure with respect to the deterministic fluid solution.
To this end, we will use the method of continuous test functions (cf. Section 4 of (10)). Let C be
the space of all bounded continuous functions from RK
+ to R+ . We will demonstrate that, for all
f ∈ C,
Eπ f (q(q 0 , t)) − Eπ f (q 0 ) = 0,
∀t ∈ R+ .
(74)
m
m
Define Qm = {x/m, x ∈ ZK
+ }, m ∈ N. Fix f ∈ C . Let W (0) be distributed according to πW , and
define
m
F (q , t) =E f (Q (t)) Q (0) = q 0 ,
m
0
m
t ∈ R+ , q 0 ∈ Q m .
(75)
Fix t > 0. We have that
Eπ f (q(q 0 , t)) − Eπ f (q 0 ) (a)
≤ lim sup Eπ f (q(q 0 , t)) − Eπmi f (q(q 0 , t)) + lim sup Eπmi f (q(q 0 , t)) − Eπmi F mi (q 0 , t) i→∞
i→∞
mi
0
0 + lim sup Eπmi F (q , t) − Eπ f (q )
i→∞
(b)
≤ lim sup Eπ f (q(q 0 , t)) − Eπmi f (q(q 0 , t)) + lim sup Eπmi f (q(q 0 , t)) − Eπmi F mi (q 0 , t) i→∞
i→∞
0
0 + lim sup Eπmi f (q ) − Eπ f (q )
i→∞
(c)
= lim sup Eπmi f (q(q 0 , t)) − Eπmi F mi (q 0 , t) .
(76)
i→∞
Step (a) follows from the triangle inequality, and (b) from the fact that Qm (·) is stationary when
initialized according to π m , and therefore,
Z
m
m
m 0
Eπm F (q , t) =
E f (Q (t)) Q (0) = q 0 dπ m = Eπm f (q 0 ) .
q 0 ∈RK
+
(77)
26
Xu and Yun: Reinforcement with Fading Memories
For step (c), we note that since q(q 0 , t) depends continuous on q 0 (Lemma 5.2), the function
mi
g(x) = f (q(x, t)), x ∈ RK
converges
+ , belongs to C . Therefore, the step follows from the fact that π
weakly to π as i → ∞.
Fix > 0. There exists M > 0 such that the the right-hand side of Eq. (76) satisfies
lim sup Eπmi f (q(q 0 , t)) − Eπmi F mi (q 0 , t) i→∞
Z
Z
0
mi
m 0
mi ≤ lim sup f (q(q , t))dπ −
F (q , t)dπ +
0
i→∞
q 0 ∈QM
Zq ∈QM
Z
0
mi
m 0
mi f (q(q , t))dπ −
F (q , t)dπ lim sup i→∞ q 0 ∈RK \QM
q 0 ∈RK
+ \QM
+
Z
(a)
f (q(q 0 , t)) − F mi (q 0 , t) dπ mi + 2 sup f (x)
≤ lim sup
i→∞
x∈RK
+
q 0 ∈QM
(b)
=2 sup |f (x)|.
(78)
x∈RK
+
where (a) follows from the tightness property of Eq. (73), and step (b) involves an argument that
allows for interchanging the order of taking the limit and integration. We isolate step (b) in the
following lemma, whose proof leverages Theorem 5.5.
Lemma 5.15 Fix a compact set S ⊂ RK
+ and f ∈ C , we have that
Z
lim sup
m→∞
Proof.
q 0 ∈S
f (q(q 0 , t)) − F m (q 0 , t) dπ m = 0.
(79)
We first show the following uniform convergence property:
lim
m
sup P kq(x, t) − Q (x, t)k > δ = 0,
m→∞ x∈S∩Qm
m
m
∀δ > 0.
(80)
m
where Q (x, ·) denotes a process Q (·) initialized with P(Q (0) = x) = 1. Suppose, for the sake of
contradiction, that there exist δ > 0 and a sequence {xi }i∈N , xi ∈ S ∩ Qm , such that
m
lim sup P kq(xi , t) − Q (xi , t)k > δ > 0.
(81)
i→∞
Because S is compact, there exists a sub-sequence {xij }j∈N ⊂ {xi }i∈N and x∗ ∈ S such that xij → x∗
as j → ∞. We have that
(a)
m
m
lim sup P kq(xij , t) − Q (xij , t)k > δ = lim sup P kq(x∗ , t) − Q (xij , t)k > δ > 0,
j→∞
(82)
j→∞
where step (a) follows from the fact that, for all t ∈ R+ , q(x, t) is continuous with respect to x
(Lemma 5.2). This leads to a contradiction with Theorem 5.5, and hence proves Eq. (80).
27
Xu and Yun: Reinforcement with Fading Memories
Fix δ > 0. We have that
Z
Z
(a)
f (q(q 0 , t)) − F m (q 0 , t) dπ m ≤ lim
lim
m→∞
m
mi
i
E f (q(q 0 , t)) − f (Q (t)) Q (0) = q 0 dπ m
i→∞ q 0 ∈S
!
m
≤wf (S, δ) + sup |f (x)| lim sup P kq(x, t) − Q (x, t)k > δ
q 0 ∈S
i→∞ x∈S∩Qm
x∈RK
+
(b)
=wf (S, δ),
(83)
where wf is the modulus of continuity of f in S: wf (S, δ) = supx,y∈S,kx−yk≤δ |f (x) − f (y)|. Step (a)
follows from the fact that π m (Qm ) = 1, and (b) from Eq. (80). Because S is compact, f is uniformly
continuous in S and hence limδ→0 wf (S, δ) = 0. The lemma follows by taking the limit as δ → 0 in
Eq. (83).
Q.E.D.
Since Eq. (78) holds for all > 0, combining it with Eq. (76), we conclude that
Eπ f (q(q 0 , t)) − Eπ f (q 0 ) ≤ lim sup Eπmi f (q(q 0 , t)) − Eπmi F mi (q 0 , t) = 0,
(84)
i→∞
and this proves Eq. (74). We now show that Eq. (74) implies that π = δqI , i.e., δqI is the unique
invariant measure with respect to the dynamics induced by the fluid solution. Define the truncated
norm:
4
kxkM = min{kxk, M },
M ∈ R+ ,
(85)
and the set
4
RL = {q ∈ RK
+ : kq k ≤ L},
L ∈ R+ ,
(86)
Let q 0 be a random vector in RK
+ distributed according to π. Suppose, for the sake of contradiction,
4
that E(kq 0 − q I kM ) = µ0 > 0 for some M > 0, where the expectation is taken with respect to the
randomness in q 0 . Fix L > 0 such that P(q 0 ∈
/ RL ) ≤
µ0
4
· M1 . By Theorem 5.12, there exists T > 0
such that supx∈RL kq(x, T ) − q I kM < µ0 /4. Therefore,
E(kq(q 0 , T ) − q I kM ) < µ0 /4 + M · P(q 0 ∈
/ RL ) ≤ µ0 /2 = E(kq 0 − q I kM )/2,
(87)
0
I
where the first inequality uses the fact that kq kM ≤ M for all q ∈ RK
+ . Because E(kq − q kM ) was
assumed to be strictly positive, this is in contradiction with the stationarity of π with respect to
the fluid solution, which would imply that E(kq(q 0 , T ) − q I kM ) = E(kq 0 − q I kM ) for all T ∈ R+ . We
thus conclude that
E(kq 0 − q I kM ) = 0,
∀M ∈ R+ ,
(88)
which in turn implies that π = δqI . This proves Eq. (18).
Finally, to show Eq. (19), note that by Lemma 5.13, for all m ∈ N and k ∈ K, the random variable
m
Qk (∞)
is non-negative and stochastically dominated by m−1 Uλm1 . Using similar steps as those in
Eq. (180), it is not difficult to show that supm∈N E(|m−1 Uλm1 |2 ) < ∞, which, implies the uniform
m
integrability of {Qk }m∈N . In combination with Eq. (18), this proves the convergence in L1 stated
in Eq. (19). This completes the proof of Theorem 5.4 as well as the first claim of Theorem 2.2.
28
Xu and Yun: Reinforcement with Fading Memories
6.
The Memory-Deficient Regime
We now prove the second claim of Theorem 2.2 concerning the memory-deficient regime, where
βm 1/m, as m → ∞. The statement is repeated below for easy reference.
Theorem 6.1 Suppose that βm 1/m, as m → ∞. Then,
λk ∨ α0 + (K − 1)α0
,
i∈K [λi ∨ α0 + (K − 1)α0 ]
k = 1, . . . , K,
qk = λk P
(89)
where qk = limm→∞ E (m−1 Qm
k (∞)).
4
Fix m ∈ N. Let Snm be the nth update point, and set S0m = 0. Denote by
C m [n] = C m (Snm ),
n ∈ Z+ ,
(90)
the embedded discrete-time process associated with the choice process {C m (t)}t∈R+ . The discretetime processes {Qm [n]}n∈Z+ and {W m [n]}n∈Z+ are defined analogously.
The key to proving Theorem 6.1 hinges upon the following property: the discrete-time choice
process, {C m [n]}n∈N , becomes asymptotically Markovian in the limit as m tends to infinity. Note
that {C m [n]}n∈N is not Markovian for a finite m, because the choices are sampled from a distribution
that is a function of the current recallable reward vector, which in turn depends on past choices.
However, the dependence of the reward vector on past choices is greatly weakened in the memorydeficient regime: because the life span of a unit reward is so small compared to the time between two
adjacent choices, by the time a new choice is to be made, the agent will have essentially forgotten
all of the past rewards associated with any action other than the one she currently uses. Therefore,
when m is large, knowing the choice C m [n] for some n ∈ N, we can predict with high accuracy
the recallable reward vector at the n + 1 update point, rendering the choice process approximately
Markovian. The approximate Markovian property will allow us to explicitly characterize the steadystate distribution of C m [·] in the limit as m → ∞, which we then leverage to analyze the steady-state
distribution of the recallable reward process, {Qm (t)}t∈N . Following this line of thinking, our proof
will consist of three main parts:
1. (Proposition 6.3) Fix n ∈ N. We first show that, in the limit as m → ∞, the value of the
scaled recallable reward process at the (n + 1)-th update point converges in probability to a
deterministic vector that only depends on C m [n].
2. (Proposition 6.4) Using the convergence of the recallable reward vector, we derive an explicit
expression for the steady-state distribution of the discrete choice process, {C m [n]}n∈N , in the
limit as m → ∞.
29
Xu and Yun: Reinforcement with Fading Memories
3. Using the stationarity of the original recallable reward process, we combine the above two
results to arrive at a characterization of the steady-state distribution of {Qm (t)}t∈R+ , thus
completing the proof of Theorem 6.1.
We begin by stating some basic properties of Qm [·] and C m [·], which will allow us to relate the
steady-state behavior of these processes to their continuous-time counterparts; the proof of the
following lemma is given in Appendix B.10.
Lemma 6.2 Fix m ∈ N. The discrete-time process {W m [n]}n∈N is positive recurrent, whose steadystate distribution satisfies
m
Qm
k [∞] Uλ1 ,
∀k ∈ K,
(91)
and
d
C m [∞] = C m (∞),
d
Qm [∞] = Qm (∞).
and
(92)
For the remainder of the proof, we will fix h : Z+ → Z+ to be an increasing function, such that
1/βm h(m) m, and, in particular,
lim h(m)/m = ∞, and lim h(m)βm = 0.
m→∞
m→∞
(93)
m
Define the scaled discrete-time recallable-reward process: Q [n] = m−1 Qm (Snm ) = m−1 Qm [n], n ∈ N.
m
The following proposition states that when m is large, the value of Q [1] converges to a determinm
istic value that depends solely on C m [0]: Qk [1] is equal to λk for k = C m [0], and zero, otherwise.
In other words, the agent will have forgotten all past rewards other than those associated with her
current choice. The proof is given in Appendix B.11.
Proposition 6.3 Define the K × K matrix q ∗ , where
∗
qk,i
= I(k = i)λk ,
k, i ∈ K,
(94)
Suppose that the continuous-time process {W m (t)}t∈N is initialized at t = 0 according to its steadystate distribution. Then, for all i, k ∈ K,
m
∗ lim P Q [1] − qk,i
≤ C m [0] = k = 1,
m→∞
∀ > 0,
(95)
and
∗ m
lim E m−1 Qm (h(m)) − qk,i
C [0] = k = 0.
m→∞
(96)
Using Proposition 6.3, we will be able to obtain an explicit expression for the limit
limm→∞ P(C m [∞] = k), as is stated in the following proposition.
30
Xu and Yun: Reinforcement with Fading Memories
m
Proposition 6.4 Define pm
k = P(C [∞] = k), k ∈ K. We have that
lim pm
k =
m→∞
where z0 =
Proof.
P
i∈K [λi
λk ∨ α0 + (K − 1)α0
,
z0
k ∈ K,
(97)
∨ α0 + (K − 1)α0 ] is a normalizing constant.
Suppose that {W m [n]}n∈N is initialized at n = 0 in its steady-state distribution. We have
that, for all i ∈ K,
m
P(C [1] = i) =
X
P(C [0] = k)
u∈Qm
k∈K
4 1 K
Z .
m +
where Qm =
X m
m
m
ui ∨ α0
P
,
P Q [1] = u C [0] = k
∈K (uj ∨ α0 )
(98)
By Eq. (95) in Proposition 6.3, Eq. (98) implies that, there exists a sequence of
K × 1 vectors {δ m }m∈N with limm→∞ maxi∈K |δim | = 0, such that for all i ∈ K and m ∈ N,
P(C m [1] = i) =δim +
X
∗
∨ α0
qk,i
,
∗
j∈K (qk,j ∨ α0 )
P(C m [0] = k) P
k∈K
(99)
d
d
∗
where qk,i
= I(k = i)λk . Note that by the stationarity of C m [·], we have that C m [0] = C m [1] =
m
m T
C m [∞]. Therefore, by treating pm = (pm
1 , p2 , . . . , pK ) as a column vector, and defining the K × K
matrix, R, where
∗
qk,i
∨ α0
,
∗
j∈K (qk,j ∨ α0 )
∀i, k ∈ K,
Rk,i = P
(100)
Eq. (99) can be written more compactly as
m
m
pm
i = δi + (Rp )i ,
∀i ∈ K.
(101)
It is not difficult to verify that R is row-stochastic and corresponds to the transition kernel of a
discrete-time, irreducible Markov chain over K. Therefore, the equation x = Rx admits a unique
solution, x∗ , in the K-dimensional simplex. Furthermore, it is not difficult to check that the following choice of x∗ satisfies x∗ = Rx∗ , and is hence the unique solution:
λk ∨ α0 + (K − 1)α0
,
i∈K [λi ∨ α0 + (K − 1)α0 ]
x∗k = P
k = 1, 2, . . . , K.
(102)
Since limm→∞ maxi∈K |δim | = 0, combining Eqs. (101) and (102), we conclude that pm converges to
x∗ coordinate-wise, as m → ∞. This completes the proof of Proposition 6.4.
Q.E.D.
We are now ready to complete the proof of Theorem 6.1. Fix m ∈ N. Suppose that we initiate
the continuous-time process {W m (t)}t∈N in its steady-state distribution so that it is stationary.
The stationarity implies that for all k ∈ K,
−1 m
E(m−1 Qm
Qk (h(m))) =
k (0)) =E(m
X
m
P(C m (0) = i)E m−1 Qm
k (h(m)) C (0) = i
i∈K
(a) X
=
i∈K
m
m
X m
P(C n [∞] = i)E m−1 Qm
pi E m−1 Qm
k (h(m)) C [0] = i =
k (h(m)) C [0] = i ,
i∈K
31
Xu and Yun: Reinforcement with Fading Memories
m
m
where pm
i is as defined in Proposition 6.4, and step (a) follows from the fact that C [∞] = C (∞)
(Lemma 6.2). Taking the limit as m → ∞, we have that
qk = lim E(m−1 Qm
k (0))= lim
m→∞
(a) X
=
i∈K
m→∞
X
−1 m
pm
Qk (h(m)) C m [0] = i
i E m
i∈K
λi ∨ α0 + (K − 1)α0 ∗
λk ∨ α0 + (K − 1)α0
qk.i = λk P
.
z0
i∈K [λi ∨ α0 + (K − 1)α0 ]
(103)
where step (a) follows from Eq. (96) in Proposition 6.3 and Eq. (97) in Proposition 6.4. This
completes the proof of Theorem 6.1.
7.
Proof of Theorem 2.1
We prove Theorem 2.1 in this section. Let W m [·] = (C m [·], Qm [·]) be the discrete-time embedded
Markov chain defined in Eq. (90). By Lemma 6.2, C m [∞] and C m (∞) admit the same distribution,
and it hence suffices to characterize the former. Because the limiting expressions for C m [∞] in the
memory deficient regime have already been established in Proposition 6.4, we will focus on the
memory abundant regime, where βm 1/m. By Theorem 5.4, for all > 0
lim P(km−1 Qm [∞] − q I k > ) = lim P(km−1 Qm (∞) − q I k > ) = 0.
m→∞
m→∞
(104)
where the first step follows from Eq. (92) in Lemma 6.2. Fix k ∈ K. We have that
lim P(C m [∞] = k) = lim
m→∞
m→∞
q ∨ (mα0 )
q I ∨ α0
Pk
P(Qm [∞] = q) = P k I
,
i qi ∨ (mα0 )
i qi ∨ α0
K
X
(105)
q∈R+
where the last equality follows from Eq. (104) and the fact that
q ∨(mα0 )
Pk
i qi ∨(mα0 )
is uniformly bounded
I
over q ∈ RK
+ . Eq. (105), along with the expression for q in Theorem 5.3, leads to Eqs. (5) and (6),
depending on the value of k and α0 . This completes the proof of Theorem 2.1.
8.
Discussions
There is a number of extensions of our model that are potentially interesting for future work. While
we currently assume that all past rewards depart from the system at the same rate, we believe
that our results could be generalized so that the memory decay rates are allowed to depend on
the action, e.g., the rewards associated with action k decay at rate µk = µ0k m−1 , where {µ0k }k∈K
are positive constants. In this case, the expressions for the limiting steady-state probabilities of
C m (·) in Theorem 2.1 can be modified by replacing the appearance of λk with λk /µ0k . One may
also consider other memory models, such as when memory decay follows a power law (27), though
relaxing the independence assumption could make the model more difficult to analyze.
Our main results have also left out the scaling regime where βm = Θ(m−1 ). The complication here
is that neither Qm (·) nor C m (·) becomes Markovian as m grows, and our current proof technique,
which exploits the asymptotic Markovian property, does not extend to this regime easily.
32
Xu and Yun: Reinforcement with Fading Memories
The reward matching rule is a special case of a broader family of choice heuristics, in which
action k is chosen with a probability proportional to w(Qk (t)) ∨ α, where w : R+ → R+ is a nondecreasing weight function. For instance, when w(x) = exp(cx) for some constant c > 0, the sampling
procedure becomes reminiscent of the celebrated logit model in choice theory (cf. Chapter 3, (22))
as well as the exponential-weight algorithm in the multi-arm bandit literature (cf. (2)). We believe
that it is possible to extend our results to this broader family of heuristics with additional work.
For instance, in the memory-deficient regime, the limiting probability of choosing action k would
become proportional to w(λk ) ∨ α0 + (K − 1)α0 , i.e., λk is replaced by w(λk ). Note that if we had the
x
freedom of choosing w(·), then applying a highly skewed weight function, e.g., w(x) = 1020 , would
induce the agent to almost always choose the best action in steady state, even in the memorydeficient regime. However, the downside is that with a skewed weight function, the agent is highly
unlikely to give up the current choice at an update point, regardless of its reward rate. It will
thus take a long time for the agent’s time-average behavior to approach that of the steady state,
resulting in bad transient performance. It would be interesting to better understand the tradeoff
between the agent’s steady-state versus transient performance as the weight function varies.
While memory loss is regarded more as a physical constraint in the current work, one could also
ask whether it would be be beneficial to artificially induce memory loss even when perfect recall
is feasible. For instance, it has been observed in the learning theory literature that policies with
built-in regularization that penalizes distant past experience can achieve better regret in multi-arm
bandit problems with time-varying, or adversarially chosen, parameters (cf. (12, 24, 4, 15)). This
is because these policies tend to be more adaptive to the environment and not “weighed down” by
past experience. While our theorems do not directly apply to time-varying reward rates, Theorem
5.12 shows that the fluid solution converges exponentially quickly to the invariant state from any
bounded initial condition, suggesting that the reward matching rule could be an attractive option
in a non-stationary environment.
Finally, while our memory model was mainly motivated by human decision making, the phenomenon of forgetting arises in organizations as well, as departures of employees could lead to the
loss of skills and expertise (cf. (3)). It could be interesting to investigate the impact of organizational
forgetting on an enterprise’s efficiency in dynamic decision problems.
References
[1] W. F. Ames and B. Pachpatte. Inequalities for Differential and Integral Equations, volume 197. Academic Press, 1997.
[2] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem.
SIAM Journal on Computing, 32(1):48–77, 2002.
33
Xu and Yun: Reinforcement with Fading Memories
[3] C. L. Benkard. Learning and forgetting: The dynamics of aircraft production. The American Economic
Review, 90(4):1034–1054, 2000.
[4] O. Besbes, Y. Gur, and A. Zeevi.
Non-stationary stochastic optimization.
Operations Research,
63(5):1227–1244, 2015.
[5] M. Bramson. State space collapse with application to heavy traffic limits for multiclass queueing
networks. Queueing Systems, 30(1-2):89–140, 1998.
[6] C. Camerer and T. Hua Ho. Experience-weighted attraction learning in normal form games. Econometrica, 67(4):827–874, 1999.
[7] E. A. Coddington and N. Levinson. Theory of Ordinary Differential Equations. Tata McGraw-Hill
Education, 1955.
[8] T. Cover and M. Hellman. The two-armed-bandit problem with time-invariant finite memory. IEEE
Transactions on Information Theory, 16(2):185–195, 1970.
[9] H. Ebbinghaus. Memory: A contribution to experimental psychology. Dover Publications, Inc, 1964. H.
A. Ruger and C. E. Bussenius, Trans. Originally published, 1885.
[10] S. N. Ethier and T. G. Kurtz. Markov Processes: Characterization and Convergence. John Wiley &
Sons, 2005.
[11] D. Gamarnik, J. N. Tsitsiklis, and M. Zubeldia. Delay, memory, and messaging tradeoffs in distributed
service systems. In Proceedings of the ACM SIGMETRICS International Conference, pages 1–12. ACM,
2016.
[12] A. Garivier and E. Moulines. On upper-confidence bound policies for non-stationary bandit problems.
arXiv preprint arXiv:0805.3415, 2008.
[13] G. Grimmett and D. Stirzaker. Probability and Random Processes. Oxford University Press, 2001.
[14] M. E. Hellman and T. M. Cover. Learning with finite memory. The Annals of Mathematical Statistics,
pages 765–782, 1970.
[15] N. B. Keskin and A. Zeevi. Chasing demand: Learning and earning in a changing environment. Available
at SSRN 2389750, 2013.
[16] T. G. Kurtz. Solutions of ordinary differential equations as limits of pure jump markov processes.
Journal of applied Probability, 7(1):49–58, 1970.
[17] T. G. Kurtz. Strong approximation theorems for density dependent markov chains. Stochastic Processes
and their Applications, 6(3):223–240, 1978.
[18] M. Mitzenmacher, B. Prabhakar, and D. Shah. Load balancing with memory. In Proceedings of IEEE
Symposium on Foundations of Computer Science (FOCS), pages 799–808. IEEE, 2002.
[19] R. Pemantle. A survey of random processes with reinforcement. Probab. Surv, 4(0):1–79, 2007.
34
Xu and Yun: Reinforcement with Fading Memories
[20] J. Pitman. Combinatorial Stochastic Processes. Technical Report 621, Dept. Statistics, UC Berkeley,
2002. Lecture notes for St. Flour course, 2002.
[21] A. Shwartz and A. Weiss. Large Deviations for Performance Analysis: Queues, Communication and
Computing, volume 5. CRC Press, 1995.
[22] K. E. Train. Discrete Choice Methods with Simulation. Cambridge University Press, 2009.
[23] J. N. Tsitsiklis and K. Xu. On the power of (even a little) resource pooling. Stochastic Systems,
2(1):1–66, 2012.
[24] T. Van Erven and W. Kotl. Follow the leader with dropout perturbations. In COLT, pages 949–974,
2014.
[25] N. Vulkan. An economists perspective on probability matching. Journal of economic surveys, 14(1):101–
118, 2000.
[26] R. Washburn and A. Willsky. Optional sampling of submartingales indexed by partially ordered sets.
The Annals of Probability, pages 957–970, 1981.
[27] J. T. Wixted and E. B. Ebbesen. Genuine power curves in forgetting: A quantitative analysis of
individual subject forgetting functions. Memory & Cognition, 25(5):731–739, 1997.
Appendix A:
Technical Preliminaries
Proposition A.1 (Doob’s inequality (cf. Section 12.6, (13))) Let {X(t)}t≥0 be a discreteor continuous-time non-negative submartingale. Fix T ≥ 0. We have that
E (X(T ))
, ∀c > 0.
P sup X(t) ≥ c ≤
c
0≤t≤T
Proposition A.2 (Gronwall’s lemma (cf. Section 1.3, (1))) Let f, b : R+ → R+ be continuous and non-negative functions, and let a : R+ → R+ be a continuous, positive and non-decreasing
Rt
function. If f (t) ≤ a(t) + 0 b(s)f (s)ds, for all t ∈ [0, T ], then
Z T
f (T ) ≤ a(T ) exp
b(t) dt .
0
Appendix B:
B.1.
Proofs
Proof of Lemma 5.2
Proof.
The existence and uniqueness of the fluid solution follow from Picard’s existence theorem
(Section 2, Chapter 1, (7)) by verifying that the right-hand side of Eq. (15) is uniformly Lipschitzcontinuous in q(t) over RK
+ , and (trivially) continuous in t. To show the fluid solution’s continuous
dependence on initial condition, note that because of the Lipschitz continuity of pk (·), there exists
a constant l > 0, such that for initial conditions x, y ∈ RK
+ and t ∈ R+ , we have that
Z t
kq(x, t) − q(y, t)k ≤kx − y k +
k(p(q(x, s)) − q(x, s)) − (p(q(y, s)) − q(y, s))kds
0
35
Xu and Yun: Reinforcement with Fading Memories
t
Z
≤kx − y k + l
kq(x, s) − q(y, s)kds
0
≤kx − y kelt ,
(106)
where the last inequality follows from Gronwall’s lemma (Proposition A.2). Therefore,
limx→y q(x, t) = q(y, t) for all y ∈ RK
+ . This completes the proof of the lemma.
B.2.
Q.E.D.
Proof of Theorem 5.3
Proof.
It can be easily verified that the expressions in Eqs. (16) and (17) are valid invariant
states in their respective regimes of α0 values. We next show that they are indeed the unique
invariant states in both cases. Let q I be an invariant state of the fluid solutions. Then, q I must
satisfy:
qkI ∨ α0
− qkI = 0,
I
q
∨
α
0
i∈K i
λk P
∀k ∈ K.
(107)
K− = k ∈ K : qkI ≤ α0 .
(108)
Define a partition of K into K+ and K− , where:
K+ = k ∈ K : qkI > α0 ,
Note that by definition,
qkI ∨ α0 = qkI ,
k ∈ K+ ,
(109)
qkI ∨ α0 = α0 ,
k ∈ K− .
(110)
Suppose that α0 ∈ (0, λ1 /K]. We next show that in this case K+ is the singleton set, {1}. First,
suppose that K+ is empty, that is, qkI ≤ α0 for all k ∈ K. We have that
α0
α0
λ1
q1I = λ1 P
= λ1
=
> α0 ,
( i∈K+ qiI ) + α0 |K− |
Kα0 K
(111)
where the last inequality follows from the assumption that α0 < λ1 /K. This leads to a contradiction,
implying that K+ 6= ∅. Fix k ∈ K+ . Combining Eqs. (107), (109) and (110), we have that
qkI
λk P
= qkI ,
( i∈K+ qiI ) + α0 |K− |
(112)
which, after rearrangement, can be written as
X
qiI = λk − α0 |K− |.
(113)
i∈K+
Note that in Eq. (113) only the term λk on the right-hand depends on k. This implies that, if
K+ 6= ∅, then all coordinates in K+ must have the same arrival rate of rewards. Thus, we can assume
that there exists λ̃ ∈ R+ , such that λk = λ̃, for all k ∈ K+ , and it suffices to show that λ̃ = λ1 .
36
Xu and Yun: Reinforcement with Fading Memories
For the sake of contradiction, suppose that λ̃ 6= λ1 , and hence λ1 > λ̃. This implies that 1 ∈
/ K+ ,
and hence q1I ≤ α0 . Invoking Eq. (107) again, we have that
λ1 (b)
α0
(a)
= α0
q1I = λ1 P
> α0 ,
I
−
( i∈K+ qi ) + α0 |K |
λ̃
(114)
where step (a) follows from Eq. (113), and (b) from the fact that λ1 > λ̃. Clearly, Eq. (114) contradicts with the fact that q1I ≤ α0 . We thus conclude that λ̃ = λ1 , K+ = {1}, and K− = {2, . . . , K }.
By setting k = 1 in Eq. (113), we further conclude that
q1I =
X
qiI = λ1 − α0 |K− | = λ1 − (K − 1)α0 .
(115)
i∈K+
Now, fix k ∈ {2, . . . , K }. We have, from Eq. (107), that
λk
α0
= α0 ,
qkI = λk P
I
−
( i∈K+ qi ) + α0 |K |
λ1
where the last equality follows from the fact that
!
X
I
qi + α0 |K− | = q1I + (K − 1)α0 = [λ1 − (K − 1)α0 ] + (K − 1)α0 = λ1 .
(116)
(117)
i∈K+
Combining Eqs. (115) and (116), we have shown that the expressions in Eq. (107) are the unique
invariant state of the fluid model, when α0 ∈ (0, λ1 /K].
We turn next to the second case of Theorem 5.3, and assume that α0 > λ1 /K. First, suppose
that K+ 6= ∅. By the arguments leading to Eq. (114), which do not depend on the value of α0 , we
know that K+ = {1}, and
q1I = λ1 − (K − 1)α0 ≤ λ1 − (1 − 1/K)λ1 =
λ1
≤ α0 ,
K
(118)
where both inequalities follow from the assumption that α0 > λ1 /K. This contradicts with the
assumption that 1 ∈ K+ and hence q1I > α0 , and we conclude that K+ must be empty. This implies
that qkI ≤ α0 for all k ∈ K, and by Eq. (107), we have that
qkI = λk
α0
= λk K,
α0 K
k = 1, . . . , K.
(119)
This completes the proof of Theorem 5.3.
B.3.
Proof of Lemma 5.8
Proof.
Define
Z
γ(t) = N (t) −
0
t
ψ(W m (s))ds,
t ∈ R+ .
(120)
37
Xu and Yun: Reinforcement with Fading Memories
Let {Ft }t∈R+ be the natural filtration associated with W m (·). It is not difficult to show, by the
definition of N (·), that {γ(t)}t∈R+ is martingale with respect to {Ft }t∈R+ . Define the stopping time
Ts = inf {t : ψ(W m (t)) ≥ φ}.
(121)
Let {Ñ (t)}t∈N be a counting process defined as:
Ñ (t) = N (t ∧ Ts ),
t ∈ R+ .
(122)
That is, Ñ (t) coincides with N (t) up until t = Ts , and stays constant afterwards. Let
4
ψ̃(t) = ψ(W m (t))I(t ≤ Ts ).
(123)
Then, it is not difficult to show that Ñ (·) is a counting process whose instantaneous rate at time
t is ψ̃(t), and the process
Z
γ̃(t) = Ñ (t) −
t
t ∈ R+ ,
ψ̃(s)ds,
0
is a martingale with respect to {Ft }t∈R+ . From the definition of γ̃(·) and γ(·), we have that, for all
> 0,
P
sup |γ(t)| ≥ =P
0≤t≤T
sup |γ(t)| ≥ , Ts > T + P sup |γ(t)| ≥ , Ts ≤ T
0≤t≤T
≤P sup |γ̃(t)| ≥ + P (Ts ≤ T )
0≤t≤T
m
=P sup |γ̃(t)| ≥ + P sup ψ(W (t)) ≥ φ .
0≤t≤T
0≤t≤T
(124)
0≤t≤T
To complete the proof, therefore, it suffices to show that
P sup |γ̃(t)| ≥ ≤ 2e−φT ·h(/φT ) .
(125)
0≤t≤T
We now show Eq. (125) using Doob’s inequality (Proposition A.1 in Appendix A), by following
a line of arguments similar to that used in the proof of Theorem 2.2 of (17). First, we introduce a
representation of the Markov process W m (·) using Poisson processes. Let {Ξy (·)}y∈Z|K|+1 a family
of mutually independent unit-rate Poisson counting processes, indexed by Z|K|+1 . For every y ∈ Ξy ,
let ψy (·) be the rate function of W m (·) for the jump with value y, i.e., ψy (w) is the instantaneous
rate at which W m (·) jumps to state w + y when in state w ∈ ZK+1
. Then, the process W m (·) can
+
be expressed as a solution to the following integral equation:
Z t
X
W m (t) = W m (0) +
y Ξy
ψy (W m (s))ds ,
y∈ZK+1
+
0
t ∈ R+ .
(126)
38
Xu and Yun: Reinforcement with Fading Memories
In this representation, Ξy
R
t
ψ (W m (s))ds
0 y
counts the number of jumps of value y over the interval
[0, t]. We thus have that, by setting y = l,
Z t
m
ψ(W (s))ds ,
N (t) = Ξl
t ∈ R+ ,
(127)
0
and
t
Z
γ̃(t) = Ñ (t) −
t
Z
ψ̃(s)ds = Ξl
Z t
ψ̃(s)ds −
ψ̃(s)ds,
0
0
t ∈ R+ .
(128)
0
Fix θ > 0. Since γ̃(·) is a martingale and exp(·) a positive convex function, {exp(θγ̃(t))})t∈R+ is
a submartingale. From Eq. (128), we have that
E (exp (γ̃(T ))) = E (exp (Ξl (τT ) − τT )) ,
where τt =
Rt
0
(129)
ψ̃(s)ds. Note that τT is a stopping time with respect to Ξl (·). Since by definition
ψ̃(t) ≤ φ for all t, we have that τT ≤ φT . Applying the optional sampling theorem for submartingales
indexed by partially ordered sets (cf. (26)) to {exp(θγ̃(t))}t∈R+ , we have that
E (exp (θγ̃(T ))) ≤ E (exp (θΞl (φT ) − θφT )) ≤ exp (eθ − θ − 1)φT ,
(130)
where the last inequality follows from the fact that Ξl (φT ) is a Poisson random variable with
mean φT whose moment generating function is given by E(exp(aΞl (φT ))) = exp(φT (ea − 1)), a ∈ R.
Analogously, we can show that
E (exp (−θγ̃(T ))) ≤ exp
e−θ + θ − 1 φT ≤ exp
eθ − θ − 1 φT ,
(131)
where the last inequality follows from the fact that eθ − e−θ ≥ 2θ for all θ ≥ 0.
Note that for any θ > 0, both {exp (θγ(t))}t≥0 and {exp (−θγ(t))}t≥0 are non-negative submartingales. Using Doob’s inequality and Eqs. (130) and (131), we have that, for all , θ > 0,
P sup γ̃(t) ≥ + P sup −γ̃(t) ≥ =P (exp(θγ̃(T )) ≥ exp(θ)) + P (exp(−θγ̃(T )) ≥ exp(θ))
0≤t≤T
0≤t≤T
E (exp (θγ̃(T ))) E (exp (−θγ̃(T )))
+
exp(θ)
exp(θ)
θ
≤2 exp φT e − θ − 1 − θ .
≤
By setting θ = log (1 + /φT ) in Eq. (132), we conclude that
P sup |γ̃(t)| ≥ ≤ 2 exp(−φT · h(/φT )),
0≤t≤T
4
where h(x) = (1 + x) log(1 + x) − x. This completes the proof.
Q.E.D.
(132)
39
Xu and Yun: Reinforcement with Fading Memories
B.4.
Proof of Lemma 5.9
Proof.
Define Z =
1
(1+)ry
P(1+)ry
i=1
Xi . In particular, E(Z) = r−1 and STD(Z) = √
1
r−1 ,
(1+)ry
where STD(Z) stands for the standard deviation of Z. We have that
!
r
(1+)ry
X
ry
1+
−1
−1
P
Xi ≤ y ≤P |Z − r | ≥
, (133)
= P |Z − r | ≥ STD(Z) ≤
(1 + )r
1+
ry2
i=1
where the last step follows from Chebyshev’s inequality.
B.5.
Q.E.D.
Proof of Lemma 5.10
Proof.
We show the result by induction. For the base case, we extend the definition of {ξim }i∈N
4
by letting ξ0m = 0, and it is not difficult to see that the inequality holds when n = 0.
Fix i ∈ {0, . . . , n − 1}, and suppose that
E (exp (θξim ))
iθ2 /4
≤ exp
.
mβm (mβm − θ)
(134)
In light of the base case, it then suffices to show that the above equation implies
(i + 1)θ2 /4
m
E exp θξi+1
≤ exp
.
mβm (mβm − θ)
m
(135)
4
Let {Cl }l∈Z+ be the natural filtration induced by {τlm , Zkm [l], Q [l]}l∈N , with C0 = ∅. We have that
m
m
m
Ci =E exp (θξim ) exp θτi+1
E exp θξi+1
Zkm [i + 1] − pk (Q [i + 1]) Ci
m
(a)
m
= exp (θξim ) E exp θτi+1
Zkm [i + 1] − pk (Q [i + 1]) Ci ,
(136)
where step (a) follows from ξim being Ci -measurable. We now develop an upper bound for the
second term on the right-hand side of Eq. (136), as follows.
m
m
E exp θτi+1
Zkm [i + 1] − pk (Q [i + 1]) Ci
m
m
m
=E E exp θτi+1
Zkm [i + 1] − pk (Q [i + 1]) Q [i + 1] Ci
m
m
(a)
m
=E E pk (Q [i + 1]) exp θ 1 − pk (Q [i + 1]) τi+1
m
m
m
m
+ 1 − pk (Q [i + 1]) exp −θpk (Q [i + 1])τi+1
Q [i + 1] Ci


m
m
1
−
p
(Q
[i
+
1])
mβ
k
m
p
(Q
[i
+
1])mβ
(b)
k
m
Ci 
+
=E 
m
m
mβm + θpk (Q [i + 1]) mβm − θ 1 − pk (Q [i + 1])
!
m
mβm
mβm + 2θpk (Q [i + 1]) − θ =E
·
Ci
m
m
mβm + θpk (Q [i + 1]) mβm + θpk (Q [i + 1]) − θ 

m
m
θ2 pk (Q [i + 1]) 1 − pk (Q [i + 1])
Ci 

=E 1 + m
m
mβm + θpk (Q [i + 1]) mβm + θpk (Q [i + 1]) − θ 40
Xu and Yun: Reinforcement with Fading Memories
 
m
m
2
θ
p
(Q
[i
+
1])
1
−
p
(Q
[i
+
1])
(c)
k
k



Ci 
≤ E exp m
m
mβm + θpk (Q [i + 1]) mβm + θpk (Q [i + 1]) − θ
(d)
θ2 /4
≤ exp
.
mβm (mβm − θ)


(137)
m
Step (a) follows from the fact that, for a given value of Q [i + 1], Zkm [i + 1] is a Bernoulli random
m
m
variable with P (Zkm [i + 1] = 1) = pk (Q [i + 1]), and is independent from τi+1
. For step (b), note that
m
τi+1 is an exponential random variable with mean (mβm )−1 , independent from Ci and Q [i + 1].
Its moment generating function is given by E (ehτi+1 ) =
m
βm
βm −h
for all h ≤ βm , where h, in our case,
m
corresponds to θ(1 − pk (Q [i + 1])) and θpk (Q [i + 1]), for the two terms respectively. Step (c)
stems from the fact that 1 + x ≤ ex for all x ∈ R+ . Finally, for step (d) we have used the fact that
m
pk (Q [i + 1]) ∈ [0, 1] by definition, and that θ < mβm ; the exponent is hence bounded from above
m
by setting pk (Q [i + 1]) to 1/2 and 0 in the numerator and denominator, respectively.
Substituting Eq. (137) into Eq. (136), and invoking the induction hypothesis of Eq. (134), we
have that
m
E exp θξi+1
This proves our claim.
B.6.
m
Ci
=E E exp θξi+1
m
m
m
m
=E exp (θξi ) E exp θτi+1 Zk [i + 1] − pk (Q [i + 1]) Ci
θ2 /4
iθ2 /4
exp
≤ exp
mβm (mβm − θ)
mβm (mβm − θ)
2
(i + 1)θ /4
≤ exp
.
mβm (mβm − θ)
(138)
Q.E.D.
Proof of Lemma 5.11
Fix > 0 and m ∈ N. We have that
!
!
ImT +1
ImT +1
X
X
m 2
m 2
P
(τi ) ≥ ≤P
(τi ) ≥ , ImT < (1 + )mβm T + P (ImT ≥ (1 + )mβm T )
Proof.
i=1
i=1
(a)
≤P
(1+)mβm T
X
!
(τim )2 ≥ +
i=1
(b) (1 + )mβ T
m
≤
(c)
· E((τ1m )2 )
=2(1 + )T (mβm )−1 +
+
1+
(mβm )−1
T 2
1+
(mβm )−1
T 2
1+
(mβm )−1 ,
T 2
(139)
where step (a) follows from Eq. (56), (b) from the Markov’s inequality, and (c) from τim being an
exponentially distributed random variable with mean (mβm )−1 and hence E((τ1m )2 ) = 2(mβm )−2 .
Because mβm → ∞ as m → ∞, the claim follows.
Q.E.D.
41
Xu and Yun: Reinforcement with Fading Memories
B.7.
Proof of Theorem 5.12
Proof.
Fix u ∈ RK
+ , such that kuk ≤ h0 . For simplicity of notation, we will omit the dependence
of the initial condition and write q(·) in place of q(u, ·), whenever doing so does not cause any
confusion. The proof of the theorem is divided into two parts, depending on the value of α0 . For
the first part of the proof, we will assume that α0 ≤ λ1 /K.
Define the potential function
g(q) =
X
qk ∨ α0 ,
q ∈ RK
+.
(140)
k∈K
We will first partition
RK
+
into three disjoint subsets, depending on the value of g(·). Let λ1 − λ2
= min
, λ1 − (K − 1)α0 ,
2
(141)
and define
Q1 ={q ∈ RK
+ : g(q) ≤ λ1 − },
Q2 ={q ∈ RK
+ : λ1 − < g(q) ≤ λ1 },
Q3 ={q ∈ RK
+ : g(q) > λ1 }.
(142)
We observe some useful properties of the fluid solutions.
Lemma B.1 Fix t > 0 so that all coordinates of q(·) are differentialble. The following holds:
1. If q(t) ∈ Q1 ∪ Q2 , then q̇1 (t) ≥ 0. Furthermore, there exists a constant d1 > 0, such that if
q(t) ∈ Q1 , then q̇1 (t) ≥ d1 .
2. Denote by K + (t) the coordinates over which q(t) is at least α0 :
K+ (t) = {k ∈ K : qk (t) ≥ α0 }.
(143)
There exists a constant d2 > 0, such that if q(t) ∈ Q2 ∪ Q3 , then for all k ∈ K+ (t), k ≥ 2,
q̇k (t) ≤ −d2 .
3. If q(t) ∈ Q3 , then
Proof.
d
g(q(t))
dt
(144)
< 0.
Suppose that q(t) ∈ Q1 ∪ Q2 . We have that
q1 (t) ∨ α0
− q1 (t)
g(q(t))
q1 (t) ∨ α0
≥λ1
− q1 (t) ∨ α0
g(q(t))
λ1
=(q1 (t) ∨ α0 )
−1
g(q(t))
(a)
λ1
− 1 = 0,
≥ α0
λ1
q̇1 (t) =λ1
(145)
42
Xu and Yun: Reinforcement with Fading Memories
where step (a) follows from the assumption that q(t) ∈ Q1 ∪ Q2 . In a similar fashion, now suppose
that q(t) ∈ Q1 and hence g(q(t)) ≤ λ1 − . We obtain that
λ1
4
− 1 = d1 > 0.
q̇1 (t)≥α0
λ1 − (146)
This proves the first claim of the lemma.
For the second claim, fix q(t) ∈ Q2 ∪ Q3 and k ∈ K+ (t), k ≥ 2. We have that
qk (t) ∨ α0
− qk (t)
g(q(t))
λk
=qk (t)
−1
g(q(t))
(a)
λk
≤ α0
−1
λ1 − (b)
λ2
≤ α0
− 1 < 0,
λ2 + (λ1 − λ2 )/2
q̇k (t) =λk
(147)
where step (a) follows from the assumption that q(t) ∈ Q2 ∪ Q3 , and (b) from the fact that λ2 ≥ λk
for all k ≥ 2, and ≤
λ1 −λ2
.
2
This proves the second claim.
Finally, for the last claim, note that since α0 ≤ λ1 /K, the fact that q(t) ∈ Q3 implies that at least
one component of q(t) is no less than α0 , and hence K+ (t) 6= ∅. We have that
!
X
d X
d
g(q(t)) =
qk (t) ∨ α0 =
q̇k (t)
dt
dt k∈K
+
k∈K (t)
X
λk
=
qk (t)
−1
g(q(t))
k∈K+ (t)
(a) X
λk
−1
<
α0
λ1
+
k∈K (t)
≤0.
(148)
where the strict inequality in step (a) follows from the definition of Q3 , the fact that K+ (t) is
non-empty, and that λk ≤ λ1 for all k. This completes the proof of Lemma B.1.
Q.E.D.
We proceed by considering two cases for the initial condition of q(·).
Case 1: q(0) ∈ Q1 ∪ Q2 . Claim 3 in Lemma B.1 implies that
q(t) ∈
/ Q3 ,
∀t ∈ R+ .
(149)
By Claim 1 of the same lemma, we hence conclude that q1 (t) is non-decreasing in t for all t ∈ R+ .
By the same claim, we observe that q̇1 (t) is bounded from below by the constant d1 whenever q(t)
lies in Q1 . Since g(q(t)) ≥ q1 (t), we further conclude that in order for Eq. (149) to be valid, the
amount of time q(t) spends in Q1 must satisfy:
Z
λ1
I(q(t) ∈ Q1 )dt ≤ ,
d1
t≥0
(150)
43
Xu and Yun: Reinforcement with Fading Memories
which, combined with Eq. (149), implies that for all t ≥ 0:
Z t
λ1
I(q(s) ∈ Q2 )ds ≥ t − .
d1
0
(151)
Claim 2 of Lemma B.1 states that if q(t) ∈ Q2 , then any component of q(t) in K+ (t)\{1} will have a
negative drift that is bounded away from zero. Fix t ∈ R+ , and suppose there exists k 0 ∈ K+ (t)\{1}.
Since q̇k (t) ≤ λ1 for all k and t, by Eq. (151), we have that
λ1
λ1
λ21
λ1
qk0 (t) ≤ qk0 (0) + λ1
− t−
d2 ≤ h0 +
− t−
d2 .
d1
d1
d1
d1
(152)
Since the components of q(t) must be non-negative, the above equation implies that for all k ≥ 2,
qk (t) ≤ α0 ,
where T0 =
h0 +λ2
1 /d1
d2
∀t ≥ T0 ,
(153)
+ λd11 .
Eq. (153) will let us strengthen our previous observation. For all t ≥ T0 , we have that
g(q(t)) =
X
qk (t) ∨ α0 = q1 (t) ∨ α0 + (K − 1)α0 .
(154)
k∈K
Since q1 (t) is non-decreasing in t by the first claim of Lemma B.1, we conclude that g(q(t)) is
non-decreasing in t after T0 . Combined with Eq. (150), this implies that
q(t) ∈ Q2 ,
∀t ≥ T1 ,
(155)
where T1 = T0 + λd11 . Fix t > T1 . Observe that by the definition of , we have that q1 (t) ≥ α0 whenever
q(t) ∈ Q2 . We have that
g(q(t)) = q1 (t) + (K − 1)α0 ,
(156)
and hence
d I
q1 (t)
(q − q1 (t)) = −q̇1 (t) = −λ1
+ q1 (t)
dt
q1 (t) + (K − 1)α0
λ1
= − q1 (t)
−1
q1 (t) + (K − 1)α0
(a)
λ1
≤ − α0
−1
q (t) + (K − 1)α0
1
1
= − α0
−1
1 − (λ1 − (K − 1)α0 − q1 (t))/λ1
1
= − α0
−1
1 − (q1I − q1 (t))/λ1
(b)
α0
≤ − (q1I − q1 (t)),
λ1
(157)
44
Xu and Yun: Reinforcement with Fading Memories
where step (a) follows from the fact that q1 (t) ≥ α0 . For step (b), note that since q(t) ∈ Q2 and
g(q(t)) = q1 (t) + (K − 1)α0 , we have that
q1I − q1 (t) = q1I − [g(q(t)) − (K − 1)α0 ] ∈ [0, ) ⊂ [0, λ1 /2).
(158)
The inequality in step (b) thus follows from the fact that 1/(1 − x) ≥ 1 + x, for all x ∈ [0, 1), where
we let x = (q1I − q1 (t))/λ1 .
Recall that, for a > 0, the solution to the ordinary differential equation ẋ(t) = −ax(t) is given by
x(t) = x(0) exp(−at).
(159)
From Eq. (157), we conclude that for all s > 0,
α0
α0
q1I − q1 (T1 + s) ≤ (q1I − q1 (T1 )) exp − s ≤ exp − s ,
λ1
λ1
(160)
where the last inequality follows from the fact that q(T1 ) ∈ Q2 and qk (T1 ) ≤ α0 for all k ≥ 2,
and hence q1 (T1 ) ≥ λ1 − (K − 1)α0 − = q1I − . Let a1 = q1I ∨ , and b1 = α0 /λ1 . Since q1 (t) is
non-decreasing by Lemma B.1, from Eqs. (155) and (160), we have that if q(0) ∈ Q1 ∪ Q2 , then
|q1 (t) − q1I | ≤ a1 exp(−b1 t),
∀t ∈ R+ ,
(161)
We now show the convergence of qk (·) for k ≥ 2. Fix k ∈ {2, . . . , K }. Define
δk (t) = qk (t) − qkI ,
and recall from Theorem 5.3 that qkI =
λk
α ,
λ1 0
k ∈ K,
(162)
k = 2, . . . , K. By Eq. (153), we have that qk (t) ≤ α0
for all t ≥ T0 . Therefore, for all t ≥ T1 ,
α0
− qk (t)
g(q(t))
α0
=λk
− qk (t)
q1 (t) + (K − 1)α0
α0
=λk
− qk (t).
λ1 + δ1 (t)
q̇k (t) =λk
Using the Taylor expansion of the function f (x) =
α0
λ1 +x
(163)
around x = 0 and noting that δ1 (t) ≤
a1 exp(−b1 t) (Eq. (161)), we have that there exists T2 ≥ T1 , such that for all t ≥ T2 ,
δ̇k (t) ∈ −δk (t) ±
4 2λk α0 a1
,
λ1
where γk =
2λk α0
|δ1 (t)| = −δk (t) ± γk exp(−b1 t),
λ1
(164)
and we use the notation x ∈ y ± z to mean y − z ≤ x ≤ y + z. Given a fixed value
of δk (T2 ), Eq. (164) implies that the value of δk (T2 + t) must lie between the solutions to the ODEs
ẋ(t) = −x(t) + γk exp(−b1 t) and ẋ(t) = −x(t) − γk exp(−b1 t), with initial condition x(0) = δk (T2 ),
45
Xu and Yun: Reinforcement with Fading Memories
respectively. It can be verified that the solution to the ODE ẋ(t) = −x(t) + c1 exp(c2 t), with initial
c1
c1
condition x(0) = x0 , is given by x(t) = 1−c
exp(
−
c
x)
+
x
−
exp(−x). Setting c1 = γk and
2
0
1−c2
2
c2 = b1 , we thus conclude that there exist ak , bk > 0, such that
|δk (T2 + t)| ≤ (δk (T2 ) + ak ) exp(−bk ).
(165)
Note that δ̇k (t) is always bounded. We thus conclude hat there exists ak > 0 such that,
|qk (t) − qkI | = |δk (t)| ≤ ak exp(−bk t),
∀t ∈ R+ .
(166)
Since the above equation holds for all k ≥ 2, together with Eq. (161), we have proven Eq. (69) for
the first case, by assuming that q(0) ∈ Q1 ∪ Q2 .
Case 2: q(0) ∈ Q3 . We now consider the second case where the initial condition, q(0), belongs to
the set Q3 . Recall from Eq. (144) of Lemma B.1 that when q(t) ∈ Q2 ∪ Q3 , all components of q(t)
in K+ (t) exhibit a negative drift of magnitude at least d2 . Letting T3 = h0 /d2 , we thus conclude
that one of the following scenarios must occur:
1. q(s0 ) ∈ Q1 ∪ Q2 for some s0 ∈ [0, T3 ].
2. q(t) ∈ Q3 for all t ∈ [0, T3 ], and there exists s0 ∈ [0, T3 ] such that
qk (s0 ) ≤ α0 ,
∀k ≥ 2.
(167)
For the first scenario, it is equivalently to “re-initializing” the system at time t = s0 at a point
in Q1 . The proof for Case 1 presented earlier thus applies and Eq. (69) follows. In what follows,
we will focus on the second scenario. Without loss of generality, it suffices to show the validity of
Eq. (69) by fixing an initial condition:
q(0) ∈ Q3 , and max qk (0) ≤ α0 .
k≥2
(168)
Not that, by the definition of Q3 , this implies that
q1 (0) > q1I > α0 .
(169)
We next show that q1 (t) > q1I for all t ∈ R+ . Fix t > 0 such that
q(s) ∈ Q3 ,
∀s ∈ [0, t].
(170)
Such t exists because the set Q3 is open and q(·) is continuous. From Lemma B.1, we know that if
q(t) ∈ Q3 , then any component in K+ (t)\{1} must have a negative drift. It thus follows that
qk (t) < α0 ,
∀k ≥ 2.
(171)
46
Xu and Yun: Reinforcement with Fading Memories
We have that, whenever q1I < q1 (t) < q1I + 1,
d
λ1
(a)
I
q1 (t) − q1 = q̇1 (t) = − q1 (t) 1 −
dt
q1 (t) + (K − 1)α0
1
= − q1 (t) 1 −
1 + (q1 (t) − q1I )/λ1
1
I
≥ − (q1 + 1) 1 −
1 + (q1 (t) − q1I )/λ1
I
(b)
q +1
(q1 (t) − q1I )
≥− 1
2λ1
(172)
where step (a) follows from the fact that qk (t) ≤ α0 for all k ≥ 2, and (b) from 1/(1 + x) ≤ 1 − x/2
for all x ∈ [0, 1]. In light of Eq. (159), the above inequality implies if the value of q1 (t) − q1I is ever
to be below 1, then from that point on it will be bounded from below by an exponential function
that is strictly positive. This shows that q1 (t) > q1I for all t > 0.
We now prove the exponential rate of convergence of q1 (t) to q1I . Using again Eq. (172), we obtain
d
λ1
I
q1 (t) − q1 = − q1 (t) 1 −
dt
q1 (t) + (K − 1)α0
(a)
1
I
≤ − q1 1 −
1 + (q1 (t) − q1I )/λ1
(b)
qI
≤ − 1 (q1 (t) − q1I ),
(173)
λ1
where step (a) follows from the fact that q1 (t) > q1I , and (b) from 1/(1 + x) ≥ 1 − x for all x ≥ 0.
Using the reasoning identical to that in Eqs. (157) through (161), we conclude that there exists
a1 , b1 > 0, such that
|q1 (t) − q1I | ≤ a1 exp(−b1 t),
∀t ∈ R+ .
(174)
Finally, the fact that q1 (t) > q1I implies that g(q(t)) > λ1 , and hence q(t) ∈ Q3 for all t > 0. By
Eq. (171), this further implies that for any k ≥ 2, qk (t) ≤ α0 for all t ∈ R+ . The exponential rate
of convergence of qk (t) for k ∈ {2, . . . , K } thus follows from the same argument as in Eqs. (163)
through 166. This completes the proof of Theorem 5.12, assuming that α0 < λ1 /K.
We now turn to the case where α0 > λ1 /K. We will use a similar proof strategy by partitioning
the state space based on the value of g(q(t)), albeit with a different partitioning. In particular,
define the sets
Q̃1 ={q ∈ RK
+ : g(q) = Kα0 },
Q̃2 ={q ∈ RK
+ : g(q) > Kα0 }.
(175)
K
Note that by definition g(q) ≥ Kα0 for all q ∈ RK
+ , so the above sets constitute a partition of R+ .
We have the following lemma.
47
Xu and Yun: Reinforcement with Fading Memories
Lemma B.2 Fix t > 0 so that all coordinates of q(·) are differentialble. If q(t) ∈ Q̃2 , then there
exists a constant d3 > 0, such that
Proof.
d
g(q(t))
dt
< −d3 .
Suppose that q(t) ∈ Q̃2 . Since g(q(t)) > Kα0 , we have that K+ (t) 6= ∅. We have that
!
X
d
d X
q̇k (t)
qk (t) ∨ α0 =
g(q(t)) =
dt
dt k∈K
k∈K+ (t)
X
λk
=
qk (t)
−1
g(q(t))
k∈K+ (t)
X
(a)
λk
≤−
α0 1 −
Kα0
k∈K+ (t)
(b)
λ1
4
≤ − α0 1 −
= −d3 ,
(176)
Kα0
where step (a) follows from the fact that g(q(t)) > Kα0 , and (b) from K+ (t) 6= ∅. Finally, note that
since α0 > λ1 /K, d3 is strictly positive. This completes the proof of the lemma.
Q.E.D.
Lemma B.2 implies that if q(0) ∈ Q̃2 , then we must have that
q(t) ∈ Q̃1 ,
∀t ≥ h0 /d3 .
(177)
Since the rate of change in every coordinate of q(t) is bounded, in light of the above equation, it
suffices to establish Eq. (69) by considering only the scenario where q(t) ∈ Q̃1 for all t ∈ R+ . Fixing
k ∈ K, we have that
d
qk (t) ∨ α0
α0
(a)
(b)
(c)
qk (t) − qkI = λk
− qk (t) = λk
− qk (t) = λk /K − qk (t) = −(qk (t) − qkI ), (178)
dt
g(q(t))
g(q(t))
where step (a) follows from the fact that qk (t) ≤ α0 whenever q(t) ∈ Q̃1 , (b) from the fact that
g(q(t)) = Kα0 , and (c) from the definition qkI = λk /K. In light of Eq. (159), Eq. (178) implies that
|qk (t) − qkI | = |qk (0) − qkI | exp(−t) ≤ (h0 ∨ qkI ) exp(−t),
t ∈ R+ .
(179)
Since the above equation holds for all k, we have thus established an exponential rate of convergence
of q(t) to q I , as t → ∞. This completes the proof of Theorem 5.12.
B.8.
Q.E.D.
Proof of Lemma 5.13
Proof.
We will use a simple coupling argument as follows. Fixing k ∈ K, the evolution of the
process Uλmk (·) corresponds to that of Qm
k (·) if action k were selected for all t ∈ R+ , and it follows,
m
given the same initial condition, that Qm
k (t) is stochastically dominated by Uk (t) for all t ∈ R+ .
Since Ukm (·) is positive recurrent for all k ∈ K, we know that Qm (·) is also positive recurrent,
m
which in turn implies the positive recurrence of W (·), because the evolution of C m (·) is derived
by sampling from K based solely on the value of Qm (·). Lemma 5.13 follows form the abovementioned stochastic dominance, the fact that under any finite initial condition, Ukm (t) converges
in distribution to Uλmk as t → ∞, and the observation that Uλm0 Uλm whenever λ ≥ λ0 ≥ 0.
Q.E.D.
48
Xu and Yun: Reinforcement with Fading Memories
B.9.
Proof of Lemma 5.14
Proof.
To show Eq. (73), observe that
m
lim sup (1 − πW
(Qx ))
(a)
m
= lim sup P max Qk (∞) > x ≤ lim sup KP(Uλm1 /m > x)
x→∞ m∈N
x→∞ m∈N
k∈K
i
∞
∞ X (mλ1 )i (b)
X
emλ1
= lim sup Ke−mλ1
≤ lim sup K
x→∞ m∈N
x→∞ m∈N
i!
i
i=mx
i=mx
∞
∞
i
i
X
X
eλ1
emλ1
≤ lim sup K
≤ lim K
x→∞ m∈N
x→∞
mx
x
i=x
i=mx
x→∞ m∈N
≤ lim K2−(x−1)
x→∞
=0,
(180)
where step (a) follows from Lemma 5.13 and the union bound, and (b) from the elementary inequalm
ity i! ≥ (i/e)i . Eq. (180) thus shows that for all , inf m∈N πW
(Qx ) > 1 − for all sufficiently large x,
which proves Eq. (73).
B.10.
Q.E.D.
Proof of Lemma 6.2
Proof.
It is not difficult verify that {W m [n]}n∈Z+ is a time-homogeneous, aperiodic, and irre-
ducible Markov chain. Because the continuous-time process, {W m (t)}t∈R+ , is positive recurrent, so
is its discrete-time counterpart, {W m [n]}n∈N , and W m [n] converges to its steady-state distribution,
W m [∞], as n → ∞. Eq. (91) follows from the same argument as that of Lemma 5.13. We now show
Eq. (92). Denote by N m (t) the index of the last update point by time t:
N m (t) = sup{n : Snm ≤ t},
(181)
m
T m (t) = SN
m (t) .
(182)
and by T m (t) its value
Recall that the update points are generated according to a Poisson process, and it not difficult
to show that, almost surely, N m (t) → ∞ and T m (t) → ∞, as t → ∞. We thus have that, for all
k ∈ K,
P(C m [∞] = k) = lim P(C m [n] = k) = lim P(C m (T m (t)) = k) = P(C m (∞) = k).
n→∞
t→∞
(183)
The same argument applies for Qm [∞] versus Qm (·). This completes the proof of Lemma 6.2.
Q.E.D.
49
Xu and Yun: Reinforcement with Fading Memories
B.11.
Proof of Proposition 6.3
Proof.
We begin by showing the following, strengthened version of Lemma 5.13, which states
that a similar stochastic dominance property holds for Qm (∞) even when conditioning on C m (·)
being of a specific value.
Lemma B.3 Let W m = (Qm , C m ) be a random vector drawn from the steady-state distribution,
W m (∞). There exist constants , m0 and γ > 0, such that for all m ≥ m0 ,
m
{C = k } γUλm , ∀i, k ∈ K.
Qm
i
1
Let Qm = {x/m : x ∈ ZK
+ }. We have that, for all k ∈ K, n ∈ N,
m
m
X
X
α0
u ∨ α0
P k
P
P Q [n] = u
P(C m [n] = k)=
P Q [n] = u ≥
Kα0 + i∈K ui
i∈K (ui ∨ α0 )
u∈Qm
u∈Qm
(184)
Proof.
By Eq. (92) of Lemma 6.2, the above equation implies that
m
X
α0
m
m
P
P(C = k) = lim P(C [n] = k) ≥
P Q [∞] = u .
n→∞
Kα0 + i∈K ui
u∈Qm
where the last step follows from the fact that
α
0
P
Kα0 + i∈K ui
(185)
(186)
is always bounded from above by 1/K.
By Eq. (71), we have that Uλm1 /m → λ1 almost surely as m → ∞. Therefore, there exist m0 and
y > 0, such that
1 m
1
P
Uλ1 ≥ λ1 + y ≤
, ∀m ≥ m0 .
(187)
m
2K 2
Combining Eq. (91) in Lemma 6.2 with Eqs. (186) and (187), we have that, for all m ≥ m0 ,
m
(a) X
α0
P
P(C m = k) ≥
P Q [∞] = u
Kα0 + i∈K ui
u∈Qm
α0
m
m
≥P max Qi [∞] ≤ λ1 + y
+ P max Qi [∞] > λ1 + y · 0
i∈K
i∈K
Kα0 + K(λ1 + y)
(b)
1 m
α0
≥ 1 − KP
Uλ1 ≤ λ1 + y
m
Kα0 + K(λ1 + y)
(c) α (1 − 1/2K)
0
≥
Kα0 + K(λ1 + y)
=γ −1 ,
(188)
4 Kα0 +K(λ1 +y)
.
α0 (1−1/2K)
where γ =
Step (a) follows from Eq. (186), (b) from Lemma 5.13 and a union bound,
and (c) from Eq. (187).
Fix x ∈ Z+ . We have that, for all m ≥ m0 ,
m
m
P (Qm
= k) P (Qm
i ≥ x, C
i ≥ x)
P Qm
≥
x
C
=
k
=
≤
i
m
m
P(C = k)
P(C = k)
(a) P U m ≥ x
λ1
≤
P(C m = k)
(b)
≤ γP Uλm1 ≥ x ,
(189)
50
Xu and Yun: Reinforcement with Fading Memories
for all i, k ∈ K, where step (a) follows from Lemma 5.13, and (b) from Eq. (188). Since the above
inquality holds for all x ∈ Z+ , this completes the proof of Lemma B.3.
Q.E.D.
We now prove the convergence in Eq. (95). Recall that the first update point, S1m , is exponentially
distributed with mean 1/βm , and independent from W m (0). Define the event E m = {S1m ≥ h(m)}.
We have that
P(E m ) = P(S1m ≥ h(m)) = exp(−βm h(m)) → 1,
as m → ∞.
(190)
Fix i, k ∈ K. Recall from Eq. (71) that Uλm1 /m converges to λ1 almost surely as m → ∞. This implies
that there exists υ > 0, independent of m, such that
m
(a)
lim sup P Qi [0] > υ C m [0] = k ≤ lim sup P Uλm1 /m > γ −1 υ = 0,
m→∞
(191)
m→∞
where step (a) follows from Lemma B.3.
Fix x ∈ R+ . We have that
h m
i
m
lim sup P Qi [1] ≥ x C m [0] = k ≤ lim sup P Qi [1] ≥ x C m [0] = k, E m + (1 − P(E m C m [0] = k))
m→∞
m→∞
i
h m
(a)
= lim sup P Qi [1] ≥ x C m [0] = k, E m + (1 − P(E m ))
m→∞
m
(b)
= lim sup P Qi [1] ≥ x C m [0] = k, E m
(192)
m→∞
where step (a) follows from the independence between E m and C m [0], and (b) from Eq. (190).
We now bound the term on the right-hand side of Eq. (192). Denote by B(n, p) a binomial
random variable with n trials and a success probability of p per trial, and by Uλm (t) the number
of jobs in system at time t in an initially empty M/M/∞ queue with arrival rate λ and departure
rate 1/m. We have that
m
P Qi [1] ≥ x C m [0] = k, E m
m
1
(a)
m
m
m
m
−S1m /m
=P
≥ x C [0] = k, E
Uλk (S1 )I(i = k) + B Qi (0), e
m
(b)
m
1
m
m
m
−h(m)/m
≤P
U (S )I(i = k) + B Qi (0), e
≥ x C [0] = k
m λk 1
(c)
m
1
m
m
−h(m)/m
U I(i = k) + B Qi (0), e
≥ x C [0] = k
≤P
m λk
m
m
1
m
m
−h(m)/m
≤P
Uλk I(i = k) + B υm, e
≥ x C [0] = k, Qi (0) ≤ υ + P Qi (0) > υ C m [0] = k
m
m
1
m
−h(m)/m
(193)
=P
Uλk I(i = k) + B υm, e
≥ x + P Qi (0) > υ C m [0] = k .
m
For step (a), note that each unit of reward initially present at time t = 0 has probability of
exp(−S1m /m) or remaining in the system by t = S1m . Therefore, the rewards in site i at time S1m
satisfy the following decomposition:
d
m
m
m
−S1m /m
.
Qm
i [1] = Uλk (S1 )I(i = k) + B Qi (0), e
(194)
51
Xu and Yun: Reinforcement with Fading Memories
The first term corresponds to the units of rewards at t = S1m that had arrived during the interval
(0, S1m ), and hence is non-zero only if i = k. The second term corresponds to those individuals
initially present at t = 0 who remained in the system by t = S1m . Step (b) follows from the definition
of E m , and (c) from the well-known fact that the number of jobs in system in an initially empty
M/M/∞ queue at any time is always stochastically dominated by its steady-state distribution.
1
Because h(m)/m → ∞ as m → ∞, we have that limm→∞ E m
B υm, e−h(m)/m = 0. Applying
Markov’s inequality, we obtain that
P
1
B υm, e−h(m)/m → 0,
m
as m → ∞,
(195)
P
where → denotes convergence in probability. Recall from Eq. (71) that, almost surely,
1 m
∗
U I(i = k) → λk I(i = k) = qk,i
.
m λk
(196)
Fix > 0, and substitute Eq. (193) into Eq. (192). We have that
m
m
∗
∗
lim sup P Qi [1] ≥ qk,i
+ C m [0] = k = lim sup P Qi [1] ≥ qk,i
+ C m [0] = k, E m
m→∞
m→∞
m
1
∗
m
−h(m)/m
≤ lim sup P
≥ qk,i + + lim sup P Qi (0) > υ C m [0] = k
Uλk I(i = k) + B υm, e
m
m→∞
m→∞
1
(a)
∗
+
= lim sup P
U m I(i = k) + B υm, e−h(m)/m ≥ qk,i
m λk
m→∞
(b)
=0,
(197)
where step (a) follows from Eq. (191), and (b) from Eqs. (195) and (196). Using the same line of
arguments as that in Eqs. (192) through (197), we can show that
m
∗
lim sup P Qi [1] ≤ qk,i
− C m [0] = k = 0,
(198)
which, along with Eq. (197), yields that
m
∗ lim sup P Qi [1] − qk,i
> C m [0] = k = 0.
(199)
m→∞
m→∞
Since the above equation holds for all i, k ∈ K, this proves Eq. (95) in Proposition 6.3.
We now turn to Eq. (96). Fix i, k ∈ K. Using essentially identical arguments as those for Eq. (199),
we can show that
m
∗ lim sup P m−1 Qm
i (h(m)) − qk,i > C [0] = k = 0,
∀ > 0.
(200)
m→∞
We have that,
(a)
m
m
m
m
Qm
i (h(m))|{C [0] = k } Uλ1 (h(m)) + Qi (0)|{C [0] = k }
(b)
Uλm1 (h(m)) + γUλm1
(γ + 1)Uλm1 ,
(201)
52
Xu and Yun: Reinforcement with Fading Memories
Step (a) follows from a decomposition similar to Eq. (194), by the writing the recallable rewards at
time h(m) as those who arrived after t = 0, which is dominated by Uλm1 (h(m)), and those who were
m
in the system at t = 0, which is dominated by Qm
i (0)|{C [0] = k }. Step (b) follows from Lemma
5.13. Since Uλm1 is a Poisson distribution with mean mλ1 , it is not difficult show that there exists
a random variable Y ∈ R+ , such that
(a)
m
m−1 Qm
i (h(m))|{C [0] = k } 1 m
U (γ + 1) Y,
m λ1
∀m ∈ N.
(202)
Combining Eqs. (95) and (202), the dominated convergence theorem implies that, for all i, k ∈ K,
∗ m
lim E m−1 Qm
C [0] = k = 0.
i (h(m)) − qk,i
(203)
m→∞
This shows Eq. (96), and thus completes the proof of Proposition 6.3.
Q.E.D.