Online learning in non-stationary Markov decision processes

Online learning in non-stationary
Markov decision processes
Gergely Neu
[email protected]
Thesis submitted to the
Budapest University of Technology and Economics
in partial fulfilment for the award of the degree of
D OCTOR OF P HILOSOPHY
in Informatics
Supervised by
Dr. László Györfi
Dr. András György
Dr. Csaba Szepesvári
Department of Computing Science and Information Technology
Magyar tudósok körútja 2.
Budapest, HUNGARY
March 2013
Abstract
This thesis studies the problem of online learning in non-stationary Markov decision processes
where the reward function is allowed to change over time. In every time step of this sequential decision problem, a learner has to choose one of its available actions after observing some
part of the current state of the environment. The chosen action influences the observable state
of the environment in a stochastic fashion and earns some reward to the learner. However,
the entire state (be it observed or not) also influences the reward. The goal of the learner is
maximizing the total (non-discounted) reward that it receives. In this work, we assume that
the unobserved part of the state evolves autonomously from the observed part of the state
or the actions chosen by the learner, thus corresponding to a state sequence generated by an
oblivious adversary such as nature. Otherwise, absolutely no statistical assumption is made
about the mechanism generating the unobserved state variables. This setting fuses two important paradigms of learning theory: online learning and reinforcement learning. In this thesis,
we propose and analyze a number of algorithms designed to work under various assumptions
about the dynamics of the stochastic process characterizing the evolution of observable states.
For all of these algorithms, we provide bounds on the regret defined as the performance gap between the total reward gathered by the learner and the total reward of the best available fixed
policy.
Acknowledgments
First and foremost, I would like to thank my supervisors Csaba Szepesvári and András György. I
cannot be thankful enough to Csaba, who introduced me to the exciting field of reinforcement
learning, and guided my first steps as a researcher. Later on, Andris initiated me into the subject of online prediction, which became my main interest through the years. Working with the
two of them greatly inspired me both professionally and personally, and I’m deeply grateful for
all their help. I would also like to thank László Györfi, who efficiently supported my work by
encouraging me whenever I was uncertain about the next step.
I would also like to thank all my colleagues, ex-colleagues and related folks from the Department of Computer Science and Information Theory at the Budapest University of Technology
and Economics (especially András Temesváry, Márk Horváth, Márta Pintér, László Ketskeméty
and András Telcs for teaming up with me during the Great Frankfurt Snowstorm of 2010), the
Reinforcement Learning and Artificial Intelligence Lab at the University of Alberta (especially
Gábor Bartók along with his lovely family, Gábor Balázs, István Szita and Réka Mihalka, Dávid
Pál, and Yasin Abbasi-Yadkori) and MTA SZTAKI (especially András Antos, the coauthor of the
results presented in Chapter 4). Once again, I am very grateful to Csaba and his amazing family
for welcoming me several times in Edmonton. I am also thankful to the unstoppable statistician/cyclists Gábor Lugosi and Luc Devroye, also known as the coauthors of the results presented in Chapter 6.
Finally, I would like to express my gratitude to the people who obviously had the greatest
impact on my entire life: my parents, brother and grandmother. Most of all, I am eternally
grateful to Éva Kóczi for her endless love and support through the ups and downs of the past
years.
In loving memory of József Lancz
Contents
1 Introduction
1
1.1 The learning model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2 Contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2 Background
15
2.1 Online prediction of arbitrary sequences . . . . . . . . . . . . . . . . . . . . . . . .
15
2.2 Stochastic multi-armed bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
2.3 Markov decision processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.3.1 Loop-free stochastic shortest path problems . . . . . . . . . . . . . . . . . .
20
2.3.2 Unichain MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3 Online learning in known stochastic shortest path problems
26
3.1 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.2 A decomposition of the regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.3 Full Information O-SSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.4 Bandit O-SSP using black-box algorithms . . . . . . . . . . . . . . . . . . . . . . . .
32
3.4.1 Expected regret against stationary policies . . . . . . . . . . . . . . . . . . .
33
3.4.2 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.5 Bandit O-SSP using Exp3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.5.1 Expected regret against stationary policies . . . . . . . . . . . . . . . . . . .
39
3.5.2 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.5.3 The case of α = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
3.5.4 A bound that holds with high probability . . . . . . . . . . . . . . . . . . . .
47
3.6 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
3.7 The proof of Lemma 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
3.8 A technical lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
4 Online learning in known unichain Markov decision processes
60
4.1 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
4.2 A decomposition of the regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
4.3 Full information O-MDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
4.4 Bandit O-MDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
4.5 The proofs of Propositions 4.1 and 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . .
73
4.5.1 The change rate of the learner’s policies . . . . . . . . . . . . . . . . . . . . .
73
4.5.2 Proof of Proposition 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
4.5.3 Proof of Proposition 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
5 Online learning in unknown stochastic shortest path problems
84
5.1 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
5.2 Rewriting the regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
5.3 Follow the perturbed optimistic policy . . . . . . . . . . . . . . . . . . . . . . . . .
87
5.3.1 The confidence set for the transition function . . . . . . . . . . . . . . . . .
89
5.3.2 Extended dynamic programming . . . . . . . . . . . . . . . . . . . . . . . .
90
5.4 Regret bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
5.5 Extended dynamic programming: technical details . . . . . . . . . . . . . . . . . .
97
6 Online learning with switching costs
99
6.1 The Shrinking Dartboard algorithm revisited . . . . . . . . . . . . . . . . . . . . . . 101
6.2 Prediction by random-walk perturbation . . . . . . . . . . . . . . . . . . . . . . . . 105
6.2.1 Regret and number of switches . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.2.2 Bounding the number of switches . . . . . . . . . . . . . . . . . . . . . . . . 107
6.2.3 Online combinatorial optimization . . . . . . . . . . . . . . . . . . . . . . . 109
7 Online lossy source coding
115
7.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.2 Limited-delay limited-memory sequential source codes . . . . . . . . . . . . . . . 117
7.3 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.4 Sequential zero-delay lossy source coding . . . . . . . . . . . . . . . . . . . . . . . 123
7.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
8 Conclusions
127
8.1 The gap between the lower and upper bounds . . . . . . . . . . . . . . . . . . . . . 127
8.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.3 The complexity of our algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
8.4 Online learning with switching costs . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
8.5 Closing the gap for online lossy source coding . . . . . . . . . . . . . . . . . . . . . 130
Chapter 1
Introduction
The work in this thesis generalizes two major fields of sequential machine learning theory: online learning [25, 20] and reinforcement learning [86, 17, 92, 85]. The characteristics of these
learning models can be summarized as follows:
Online learning: In each time step t = 1, 2, . . . , T of the standard online learning (or online
prediction) problem, the learner selects an action at from a finite action space A, and
consequently earns some reward r t (at ). The goal of the learner is to maximize its total
expected reward. This problem can be easily treated by standard statistical machinery if
the sequence of reward functions are generated in an i.i.d. fashion (that is, the rewards
(r t (a))Tt=1 are independent and identically distributed). However, this assumption does
not account for dynamic data, let alone acting in a reactive environment. The power
of the online learning framework lies in the fact that it does not require any statistical
assumptions to be made about the data generation process: It is assumed that the sequence of reward functions (r t )Tt=1 , r t : A → [0, 1] is an arbitrary fixed sequence chosen
by an external mechanism referred to as the environment or the adversary. Of course, by
dropping the strong statistical assumptions on the reward sequence, we can no longer
P
hope to explicitly maximize the total cumulative rewards Tt=1 r t (at ) and thus have to
settle for a less ambitious goal. This goal is to minimize the performance gap between
our algorithm and the strategy that selects the best action fixed in hindsight. This performance gap is called the regret and is defined formally as
LT = max
T
X
a∈A t =1
r t (a) −
T
X
r t (at ).
t =1
It is important to note that the best fixed action in the above expression can only be computed in full knowledge of the sequence of reward functions. While intuitively minimizing the regret seems to be very difficult, it is now a very well understood problem even in
the significantly more challenging bandit setting where the learner only observes r t (at )
after making its decision. In recent years, numerous algorithms have been proposed for
different versions of the online learning problem that consider different assumptions on
the action space A and the amount of information revealed to the learner. The main
1
shortcoming of this problem formulation is that it does not adequately account for the
influence of previous actions a1 , . . . , at −1 on the reward function r t , that is, it assumes
that the decisions of the learner do not influence the mechanism generating the rewards.
The formalism presented in this thesis provides a way of modeling and coping with such
effects.
Reinforcement learning: In every time step t = 1, 2, . . . , T of the standard reinforcement learning (RL) problem, the learner (or agent) observes the state xt of the environment, selects
an action at from a finite action space A, and consequently earns some reward r (xt , at ).
Finally, the next state xt +1 of the environment is drawn from the distribution P (·|xt , at ).
It is assumed that the state space of the environment X is finite, the reward function
r : X × A → [0, 1] and the transition function P : X × X × A → [0, 1] are fixed but unknown
functions. The goal of the learner is to maximize its total reward in the Markov decision
process (MDP) described above by the tuple (X, A, P, r ). It is commonplace to consider
learning algorithms that map the states of the environment to actions with a stationary
state-feedback policy (or in short, a policy) π : X → A. A policy can be evaluated in terms
of its total expected reward after T time steps in the MDP:
R Tπ = E
"
T
X
#
r (x0t , π(x0t )) ,
t =1
where x0t +1 ∼ P (·|x0t , π(x0t )). Writing the total expected reward of the learner as
"
T
X
RbT = E
#
r (xt , at ) ,
t =1
we can define a good performance measure of a reinforcement learning algorithm then
using the following notion of regret:
b T = max R π − RbT .
L
T
π
The regret measures the amount of “lost reward” during the learning process, that is, the
price of not knowing r and P before the learning process starts.
The main limitation of the MDP formalism is that it does not account for non-random
state dynamics that cannot be captured by the Markovian model P . In this thesis, we
present a formalism that goes beyond the standard stochastic assumptions made in the
reinforcement learning literature and provide algorithms with good theoretical performance guarantees.
In this thesis we study complex reinforcement learning problems where the performance
of learning algorithms is measured by the total reward they collect during learning, and the
assumption that the state is completely Markovian is relaxed. Our formalism is based on principles of reinforcement learning and online learning: we regard these complex problems as
Markov decision problems where the reward functions are allowed to change over time, and
2
propose algorithms with theoretically guaranteed bounds on their performance. The performance criterion that we address is the worst-case regret, which is typically considered in the
online learning literature. Learning in this model is called the online MDP or O-MDP problem.
The main idea of our approach relies on the observation that in a number of practical problems, the hard-to-model, complex part of the environment influences only the rewards that the
learner receives. In the rest of this chapter, we describe our general model and show how it precisely relates to the two learning paradigms described above. In Section 1.2, we summarize the
contributions of the present thesis. The most important other results from the literature are
discussed in Section 1.3. We conclude this chapter with Section 1.4, where we briefly describe
some practical problems where our formalism can be applied.
The rest of the thesis is organized as follows: In Chapter 2, we review some relevant concepts from online learning and reinforcement learning. The purpose of the chapter is to set up
the formal definitions later to be used in the thesis. In Chapters 3–5, we discuss different versions of the online MDP problem and propose algorithms for the specified learning problems.
We provide rigorous theoretical analysis to each of the proposed algorithms. In Chapter 6, we
describe a special case of our setting called online learning with switching costs. For this problem, we provide two algorithms with optimal performance guarantees. In Chapter 7, we apply
the methods described in Chapter 6 to construct adaptive coding schemes for the problem of
online lossy source coding. We show that the proposed algorithm enjoys near-optimal performance guarantees. Since it is difficult to evaluate the contributions of Chapters 3–5 separately,
we present conclusions for all chapters in Chapter 8.
1.1 The learning model
The interaction between the learner and the environment is shown in Figure 1.1. The environment is split into two parts: One part that has Markovian dynamics and another one with
an unrestricted, autonomous dynamics. In each discrete time step t , the agent receives the
state xt of the Markovian environment and possibly the previous state y t −1 of the autonomous
dynamics. The learner then makes a decision about the next action at , which is sent to the environment. The environment then makes a transition: the next state of the Markovian environment depends stochastically on the current state and the chosen action as xt +1 ∼ P (·|xt , at ), and
y t +1 is generated by an autonomous dynamic which is not influenced by the learner’s actions
or the state of the Markovian environment. After this transition, the agent receives a reward
depending on the complete state of the environment and the chosen action, and then the process continues. The goal of the learner is to collect as much reward as possible. The modeling
philosophy is that whatever information about the environment’s dynamics can be modeled
should be modeled in the Markovian part and the remaining “unmodeled dynamics” is what
constitutes the autonomous part of the environment.
A large number of practical operations research and control problems have the above outlined structure. These problems include production and resource allocation problems, where
the major source of difficulty is to model prices, various problems in computer science, such as
3
at
/01*2'
;<='
;<='
;<='
!"#$%&'()*"+,-.'
xt
rt
;<='
314"#5'67*-8%*'
yt
9*-%*2#%::15'
()*"+,-.'
;<='
Figure 1.1: The interaction between the learner and the environment. At time t , the agent’s
action is a t , the state of the Markovian dynamics is x t , the state of the uncontrolled dynamics
is y t ; q −1 is a one-step delay operator.
the k-server problem, paging problems, or web-optimization problems, such as ad-allocation
problems with delayed information [see, e.g., 33, 105].
In the rest of the thesis, for simplicity, by slightly abusing terminology, we call the state xt
of the Markovian part “the state” and regard dependency on y t as dependency on t by letting
r t (·, ·) = r (·, ·, y t ). The goal of the learner is to maximize its total expected reward
¯ #
¯
¯
RbT = E
r t (xt , at )¯ P ,
¯
t =1
"
T
X
where the notation E [ ·| P ] is used to emphasize that the state sequence (xt )Tt=1 is generated by
the transition function P . Controllers of the form π : X → A are called stationary deterministic
policies, where X is the state-space of the Markovian part of the environment and A is the set
of actions. The performance of policy π is measured by the total expected reward
R Tπ
"
=E
T
X
t =1
¯
¯
¯
r t (x0t , π(x0t ))¯ P, π
¯
#
,
where (x0t )Tt=1 is the state sequence obtained by following policy π in the MDP described by P .
The learner’s goal is to perform nearly as well as the best fixed stationary policy in hindsight in
terms of the total reward collected, that is, to minimize the following quantity:
b T = max R π − RbT .
L
T
π
(1.1)
In other words, we are interested in constructing algorithms that minimize the total expected
4
regret defined as the gap between the total accumulated reward of the learner and the best fixed
controller.
Naturally, no assumptions can be made about the autonomous part of the environment as
it is assumed that modeling this part of the environment lies outside of the capabilities of the
learner. Guaranteeing a low regret is equivalent to a robust control guarantee: The guarantee
on the performance must hold no matter how the autonomous state sequence (y t ), or equivalently, the reward sequence (r t ) is chosen. The potential benefit is that the results will be more
generally applicable and the algorithms will enjoy added robustness, while, generalizing from
results available for supervised learning [23, 62, 87], the algorithms can also avoid being too
pessimistic despite the strong worst-case guarantees.1
1.2 Contributions of the thesis
We have studied the above problem under various assumptions on the structure of the underlying MDP and the feedback provided to the learner. To be able to present our contributions,
we present our assumptions informally in the following list. The precise assumptions are presented in Chapter 2.
1. Loop-free episodic environments are episodic MDPs where transitions are only possible
in a “forward” manner. Episodic MDPs capture learning problems where the learner has
to repeatedly perform similar tasks consisting of multiple state transitions. At the beginning of each episode, the learner starts from a fixed state x 0 and the episode ends when
the goal state x L is reached. We assume that all other states in the state space X can only
be visited once, thus, the transition structure does not allow loops. The reward function
r t : X × A → [0, 1] remains fixed during each episode t = 1, 2, . . . , T , but can change arbitrarily between consecutive episodes. In each time step l = 0, 1, . . . , L − 1 of episode t , the
)
learner observes its state x(t
and has to decide about its action al(t ) . The total expected
l
reward of the learner is defined as
"
T L−1
X
X
RbT = E
t =1 l =0
¯ #
¯
¯
r t (xl(t ) , al(t ) )¯ P
¯
,
and the total expected reward of policy π is defined as
¯
#
¯
¯
R Tπ = E
r t (x0l , a0l )¯ P, π ,
¯
t =1 l =0
"
T L−1
X
X
where we used the notation E [ ·| P, π] to emphasize that the trajectory (x0l , a0l )L−1
is generl =0
ated by transition model P and policy π. The minimal visitation probability α is defined
1 Sometimes, robustness is associated to conservative choices and thus poor “average” performance. Although
we do not study this question here, we note in passing that the algorithms we build upon have “adaptive variants”
that are known to adapt to the environment in the sense that their performance improves when the environment is
“less adversarial”.
5
as
¯
£
¤
α = min min P ∃l : x0l = x ¯ P, π .
x∈X π∈AX
This problem will be often referred to as the online stochastic shortest path (O-SSP) problem.
(a) Full feedback with known transitions: We assume that the transition function P is
fully known before the first episode and the reward function r t is entirely revealed
after episode t .
(b) Bandit feedback with known transitions: We assume that the transition function
P is fully known before the first episode, but the reward function r t is only revealed
along the trajectory traversed by the learner in episode t . In other words, the feedback provided to the learner after episode t is
³
)
xl(t ) , al(t ) , r t (xl(t ) , a(t
)
l
´L−1
l =0
.
(c) Full feedback with unknown transitions: We assume that P is unknown to the
learner, but the reward function r t is entirely revealed after episode t . The layer
structure of the state space and the action space is assumed to be known and the
traversed trajectory is also revealed to the learner. In other words, the feedback provided to the learner after episode t is
µ³
´L−1
xl(t ) , al(t )
, rt
l =0
¶
.
2. Unichain environments are continuing MDPs where no episodes are specified. In each
time step t = 1, 2, . . . , T , the learner observes the state xt and has to decide about its action
at , while the reward function r t : X × A → [0, 1] is also allowed to change after each time
step. For any stationary policy π : X → A, we define the elements of the transition kernel
P π as
P π (x|y) = P (x|y, π(y))
for all x, y ∈ X. We assume that for each policy π, there exists a unique probability distribution µπ over the state space that satisfies
µπ (x) =
X
y∈X
µπ (y)P π (x|y)
for all x ∈ X. The distribution µπ is called the stationary distribution corresponding to
policy π. The minimal stationary visitation probability α0 is defined as
α0 = min min µπ (x).
x∈X π∈AX
We assume that every policy π has a finite mixing time τπ > 0 that specifies the speed of
convergence to the stationary distribution µπ . The total expected reward of the learner is
6
defined as
¯ #
¯
¯
RbT = E
r t (xt , at )¯ P ,
¯
t =1
"
T
X
and the total expected reward of any policy is defined as
"
RbT = E
T
X
t =1
¯
¯
#
¯
r t (x0t , a0t )¯ P, π
¯
,
where the trajectory (x0t , a0t ) is generated by following policy π in the MDP specified by P .
(a) Full feedback with known transitions: We assume that the transition function P is
fully known before the first time step and the reward function r t is entirely revealed
after time step t .
(b) Bandit feedback with known transitions: We assume that the transition function P
is fully known before the first time step, but the reward function r t is only revealed
in the state-action pair visited by the learner in time step t . In other words, the
feedback provided to the learner after time step t is
(r t (xt , at ), xt +1 ) .
3. Online learning with switching costs is a special version of the online prediction problem where switching between actions is subject to some cost K > 0. Alternatively, this
problem can be regarded as a special case of our general setting, as we can construct
a simple online MDP that can be used to model all online learning problems where
switching between experts is expensive. The online MDP (X, A, P, (r t )Tt=1 ) in question
is specified as follows: The state xt +1 of the environment is identical to the previously
selected action at . In other words, X = A and the transition function P is such that for
all x, y, z ∈ A, P (y|x, y) = 1, otherwise P (z|x, y) = 0 if z 6= y. The reward function in this
online MDP is defined using the original reward function g t : A → [0, 1] of the prediction
problem and the switching cost K as
r t (x, a) = g t (a) − K I{a6=x}
for all (x, a) ∈ A2 . Note that K is allowed to be much larger than the maximal reward of 1.
We consider two subclasses of online learning problems with switching costs.
(a) Online prediction with expert advice: We assume that the action set A is relatively
small and the environment is free to choose the rewards for each different action.
Actions in this setting are often referred to as “experts”.
(b) Online combinatorial optimization: We assume that each action can be represented by a d -dimensional binary vector and the environment can only choose the
rewards given for selecting each of the d components. Formally, the learner has access to the action space A ⊆ {0, 1}d and in each round t , the environment specifies
7
a vector of rewards g t ∈ Rd . The reward given for selecting action a ∈ A is the inner
product g t (a) = g t> a.
4. The online lossy source coding problem is a special case of Setting 3 where a learner
has to encode a sequence of source symbols z 1 , z 2 , . . . , z T on a noiseless channel and produce a sequence of reproduction symbols ẑ1 , ẑ2 , . . . , ẑT . A coding scheme consists of an
encoder f mapping source symbols (z)Tt=1 to channel symbols (y)Tt=1 and a decoder g
mapping channel symbols (y)Tt=1 to reconstruction symbols (ẑ)Tt=1 . We assume that the
learner has access to a fixed pool of coding schemes F. The goal of the learner is to select
coding schemes (ft , gt ) ∈ F that minimize the cumulative distortion between the source
sequence and the reproduction sequence, defined as
bT =
D
T
X
d t (z t , ẑt ),
t =1
where the sequence (ẑ)Tt=1 is produced by the sequence of applied coding schemes and d
is a given distortion measure. We make no statistical assumptions about the sequence of
source symbols. Denoting the cumulative distortion of a fixed coding scheme ( f , g ) ∈ F
as D T ( f , g ), the goal of the learner can be formulated as minimizing the expected normalized distortion redundancy
bT =
R
¶
µ
1 £ ¤
b − min D T ( f , g ) ,
E D
T
( f ,g )∈F
T
which is in turn equivalent to regret minimization in the online learning problem where
rewards correspond to negative distortions. Additionally, in each time step t , the learning
algorithm has to ensure that the receiving entity is informed of the identity of the decoder
gt to be used for decoding the t -th channel symbol. We assume that transmitting the
decoder gt is only possible on the same channel that is used for transmitting the source
sequence. This gives rise to a cost for switching between coding schemes, making this
problem an instance of problems described in Setting 3a.
The contributions of this thesis for each setting are listed below on Figure 1.2. All performance
guarantees proved in Settings 1a through 2b concern learning schemes that have not been covered by previous works, with the exception of 2a, where our contribution is improving the regret
bounds given by Even-Dar et al. [33]. Our main results concerning online MDPs are also presented in Table 1.1, along with the most important other results in the literature. For Setting 3a,
we propose a new prediction algorithm with optimal performance guarantees. The same approach can be used for learning in Setting 3b, a problem that has not yet been addressed in the
literature. The results for Setting 4 significantly improve a number of previous performance
guarantees known for the problem: our performance guarantees on the expected regret match
the best known lower bound for the problem up to a logarithmic factor.
8
Setting 1a
• Guarantee on the expected regret against the pool of all stationary policies
(Proposition 3.1).
Setting 1b
• Guarantees on the expected regret against the pool of all stationary policies assuming α > 0 (Theorems 3.1, 3.2, 3.4).
• Guarantees on the expected regret against the pool of all non-stationary policies assuming α > 0 (Theorems 3.3, 3.5).
• Guarantee on the expected regret against the pool of all stationary policies allowing α = 0 (Theorem 3.6).
• High-confidence guarantee on the regret against the pool of all stationary policies assuming α > 0 (Theorem 3.7).
Setting 1c
• Guarantee on the expected regret against the pool of all stationary policies
(Theorem 5.1).
Setting 2a
• Improved guarantee on the expected regret against the pool of all stationary
policies (Theorem 4.1).
Setting 2b
• Guarantee on the expected regret against the pool of all stationary policies assuming α0 > 0 (Theorem 4.2).
Setting 3a
• Optimal guarantees on both the expected regret against the pool of the fixed
actions in A and the number of action switches (Theorem 6.1).
Setting 3b
• Near-optimal guarantees on both the expected regret against the pool of the
fixed actions in A ⊆ {0, 1}d and the number of action switches (Theorem 6.2).
Setting 4
• Optimal guarantee on the expected regret against any finite pool of reference
classes F (Theorem 7.1).
• Optimal guarantee on the expected regret against the pool of all quantizers with
quadratic distortion measure (Theorem 7.3).
Figure 1.2: The contributions of the thesis.
9
r t observed
r t (xt , at ) observed
P known
• Neu et al. [78]
– SSP environment
p
b T = O(L 2 T |A|/α)
– L
• Even-Dar et al. [33]
– unichain environment
p
b T = O(τ2 T log |A|)
– L
• Neu et al. [81]
– unichain environment
p
b T = O(τ3/2 T |A|/α0 )
– L
• Jaksch et al. [59]
P unknown
– stochastic rewards
– connected environment
p
b T = Õ(D|X| T |A|)
– L
Future work
• Neu et al. [79]
– SSP environment
p
b T = O(L||X| | A| T )
– L
Table 1.1: Upper bounds on the regret for different feedback assumptions. Our results are typeset in boldface. Rewards are assumed to be adversarial unless otherwise stated explicitly.
1.3 Related work
As noted earlier, our work is closely related to the field of reinforcement learning. Most works
in the field consider the case when the learner controls a finite Markovian Decision Process
(MDP, see [19, 60, 15, 59], or Section 4.2.4 of [93] for a summary of available results and the
references therein). While there exists a few works that extend the theoretical analysis beyond
finite MDPs, these come at strong assumptions on the MDP (e.g., [61, 91, 1]).
The first work to address the theoretical aspects of online learning in non-stationary MDPs
is due to Even-Dar et al. [32, 33], who consider the case when the reward function is fully
observable. They propose an algorithm, MDP-E, which uses some (optimized) experts algorithm in every state fed with the action-values of the policy used in the last round. Assuming that the MDP is unichain and that the worst mixing time τ over all policies is unip
formly small, the regret of their algorithm is shown to be Õ(τ2 T ). Part of our work recycles
the core idea underlying MDP-E: it uses black-box bandit algorithms at every state. An alternative Follow-the-Perturbed-Leader-type algorithm was introduced by Yu et al. [105] for the
same full-information unichain MDP problem. The algorithm comes with improved computational complexity but an increased Õ(T 3/4+² ) regret bound. Concerning bandit information
and unichain MDPs, Yu et al. [105] introduced an algorithm with vanishing regret (i.e., the algorithm is Hannan consistent). Yu and Mannor [103, 104] considered the problem of on-line
learning in MDPs where the transition probabilities may also change arbitrarily after each transition. This problem is significantly more difficult than the case where only the reward function is changed arbitrarily. Accordingly, the algorithms proposed in these papers fail to achieve
10
sublinear regret. Yu and Mannor [104] also considered the case when rewards are only observed along the trajectory traversed by the agent. However, this paper seems to have gaps: If
the state space consists of a single state, the problem becomes identical to the non-stochastic
multi-armed bandit problem. Yet, from Theorem IV.1 of Yu and Mannor [104] it follows that
p
p
the expected regret of their algorithm is O( log |A|T ), which contradicts the known Ω( |A|T )
lower bound on the regret (see Auer et al. [10]).2
Some parts of our work can be viewed as a stochastic extensions of works that considered online shortest path problems in deterministic settings. Here, the closest to our ideas
and algorithm is the paper by György et al. [47]. They implement a modified version of the
Exp3 algorithm of Auer et al. [9] over all paths using dynamic programming and estimating
the reward-to-go via estimates for the immediate rewards. The resulting algorithm is shown
p
to achieve O( T ) regret that scales polynomially in the size of the problem, and can be implemented with linear complexity. A conceptually harder version of the shortest path problem
is when only the rewards of the whole paths are received, and the rewards corresponding to
the individual edges are not revealed. Dani et al. [28] showed that this problem is actually not
harder, by proposing a generalization of Exp3 to linear bandit problems, which can be applied
p
to this setting and which gives an expected regret of O( T ) (again, scaling polynomially in the
size of the problem), improving earlier results of Awerbuch and Kleinberg [12], McMahan and
Blum [76], György et al. [47]. Bartlett et al. [14] showed that the algorithm can be extended so
that the bound holds with high probability, while Cesa-Bianchi and Lugosi [26] improved the
method of [28] for bandit problems with some special “combinatorial” structure. While the
above very similar approaches (an efficient implementation of a centralized Exp3 variant) are
also appealing for our problem, they cannot be applied directly: The random transitions in the
MDP structure disables the dynamic programming-based approach of György et al. [47]. Furthermore, although Dani et al. [28] suggest that their algorithm can be implemented efficiently
for the MDP setting, this does not seem to be straightforward at all. This is because the application of their approach requires representing policies via the distributions (or occupancy measures) they induce over the state space, but the non-linearity of the latter dependence makes
dynamic programming highly-nontrivial and causes difficulties in describing linear combination of policies.
The contextual bandit setting considered by Lazaric and Munos [70] can also be regarded
as a simplified version of our model, with the restriction that the states are generated in an
i.i.d. fashion. More recently, Arora et al. [3] gave an algorithm for MDPs with deterministic transitions, arbitrary reward sequences and bandit information. Following the work of Ortner [83],
who studied the same problem with i.i.d. rewards, they note that following any policy in a deterministic MDP leads to periodic behavior, and thus finding an optimal policy is equivalent
to finding an “optimal cycle” in the transition graph. While this optimal cycle is well defined
for stationary rewards, Arora et al. observe that it can be ill-defined for non-stationary reward
2 To show this contradiction note that the condition T > N in the bound of Theorem IV.1 of Yu and Mannor [104]
can be traded for an extra O(1/T ) term in the regret bound. Then the said contradiction can be arrived at by letting
², δ converge to zero such that ²/δ3 → 0.
11
sequences: in particular, it is easy to construct an example where the same policy can incur
an average reward of either 0 or 1, depending on the state where we start to run the policy. A
meaningful goal in this setting is to compete with the best meta-policy that can run any stationary policy starting from any state as its initial state. Arora et al. give an algorithm that enjoys a
regret bound of O(T 3/4 ) against the pool of such meta-policies.
1.4 Applications
In this section, we outline how some real-world problems fit into our framework. The common
feature of the examples to be presented is that the state space of the environment is a product of
two parts: a controlled part X and an uncontrolled part Y. In all of these problems, the evolution
of the controlled part of the state can be modeled as a Markov decision process described by
©
ª
X, A, P , while no statistical assumptions can be made about the sequence of uncontrolled
state variables (y t )Tt=1 . We assume that interaction between the controlled and uncontrolled
parts of the environment is impossible, that is, the stochastic transitions of (xt )Tt=1 cannot be
influenced by the irregular transitions of (y t )Tt=1 , and vice versa.
Inventory management
Consider the problem of controlling an inventory so as to maximize the revenue. This is an
optimal control problem, where the state of the controlled system is the stock xt , the action at
is the amount of stock ordered. The evolution of the stock is also influenced by the demand,
which is assumed to be stochastic. Further, the revenue depends on the prices at which products are bought and sold. By assumption, the prices are not available at the time when the
decisions are made. Since the prices can depend on many external, often unobserved factors
y t , their evolution is often hard to model. We assume that the influence of our purchases on the
prices is negligible. Since y t is unobserved, this problem is covered by our Settings 1b and 2b.
Controlling the engine of a hybrid car
Consider the problem of switching between the electric motor and the internal combustion
engine of a hybrid car so as to optimize fuel consumption (see, e.g., [13]). In this setting the
favorable engine depends on the road conditions and the intentions of the driver: for example,
driving downhill can be used to recharge the batteries of the electric motor, while picking up
higher speeds under normal road conditions can be more efficient with the internal combustion engine. We assume that we can observe the partial state xt of the engines and can decide
to initiate a switching procedure from one engine to the other. The execution of the switching procedure depends on the state xt of the engines in a well-understood stochastic fashion.
Other parts of the state of engine y t may or may not be observed. External conditions only influence this state variable and do not interfere with xt . That is, the state y t can be seen as the
uncontrolled state variable influencing the rewards given for high fuel efficiency. Since y t is not
entirely observed, this problem is covered by our Settings 1b and 2b.
12
Storage control of wind plants
Hungarian regulations require wind plants to produce schedules on their actual production on
a 15-minute basis, one month prior. If production exceeds a certain range of the schedule, the
producer should pay a penalty-tariff depending on the deviation. As discussed by Hartmann
and Dán [52], a possible way of meeting the schedule under adverse wind conditions is using energy storage units: the excess energy accumulated when production would exceed the
schedule can be fed into the power system when wind energy stays below the desired level. In
this setting, we can assume that the state xt of the energy storage unit can be captured by a possibly unknown Markovian model, while the evolution of wind speed over time (y t )Tt=1 is clearly
uncontrolled. Since the schedules are fixed well in advance, its influence can be incorporated
into the Markovian model as well. This way, the uncontrolled state variable only influences the
rewards (or negative penalties), and thus this problem also fits into our framework. Since y t is
observed, this problem is covered by our Settings 1a and 2a. The case when the exact dynamics
of the energy storage unit is unknown is covered by Setting 1c.
Adaptive routing in computer networks
Consider the problem of routing in a computer network where the goal is transferring packets
from a source node x 0 to a designated drain node x L with minimal delay (see, e.g., [50]). The
delays can be influenced by external events such as malfunction of some of the internal nodes.
These external events are captured by the uncontrolled state y t . Assume that in each node xl(t ) ,
we can choose the next node using some interface al(t ) ∈ A of the network layer. Assuming that
)
this interface decides about the actual next state xl(t+1
using a simple randomized algorithm, we
can cast our problem as an online learning problem in SSPs, covered by our Setting 1. If the
delays are only observed on the actual path traversed by the packet and the algorithm implemented by the interfaces a ∈ A is known, the problem is covered by Setting 1b. Assuming that
y t is revealed by some oracle after sending each packet, our algorithms for Setting 1c can be
used even if the randomized algorithms used in the network layer are unknown.
Growth optimal portfolio selection with transaction cost
Consider the problem of constructing sequential investment strategies for financial markets
where at each time step the investor distributes its capital among d assets (see, e.g., Györfi
and Walk [42]). Formally, the investor’s decision at time t is to select a portfolio vector at ∈
P
[0, 1]d such that di=1 (at )i = 1. The i -th component of at gives the proportion of the investor’s
capital Nt invested in asset i at time t . The evolution of the market in time is represented by
the sequence of market vectors s 1 , s 2 , . . . , s T ∈ [0, ∞)d , where the i -th component of s t gives the
price of the i -th asset at time t . It is practical to define the return vector y t ∈ [0, ∞)d at time
t with components (y t )i =
(s t )i
(s t −1 )i
. Furthermore, we assume that switching between portfolios
is subject to some additional cost proportional to the price of the assets being bought or sold.
The goal of the investor is to maximize its capital NT , or, equivalently, to maximize its average
growth rate
1
T
log NT . The problem of maximizing the growth rate under transition costs can be
13
formalized as an online MDP where the state at time t is given by the previous portfolio vector
at −1 and the reward given for choosing action at at state xt = at −1 is
µ
¶
¡
¢
Nt − ct
r (xt , y t , at ) = log
+ log a>
t yt ,
Nt
where ct is the transaction cost arising at time t . Since the relation between ct and the state xt ,
the action at and the capital Nt is well-defined and the rewards are influenced by the uncontrolled sequence (y t )Tt=1 in a transparent way, we can assume full feedback. After discretizing
the space of portfolios and prices, we can directly apply learning algorithms devised for Setting 2a to construct sequential investment strategies. The problem can also be approximately
modeled by Setting 3a when assuming that c t ≤ αNt holds for some constant α ∈ (0, 1): Up¡
¢
per bounding the regret on the online learning problem with rewards g t (a) = log a > y t and
switching cost K = − log(1 − α), we obtain a crude upper bound on the regret of the resulting
sequential investment strategy.
14
Chapter 2
Background
In this chapter, we precisely describe the learning models outlined in Chapter 1. In the first half
of the chapter, we review some concepts of online learning that are relevant for our work. The
rest of the chapter discusses important tools for Markov decision processes that will be useful
in the later chapters.
Throughout the thesis, will use boldface letters (such as x, a, u, . . .) to denote random variables. We use kvkp to denote the L p -norm of a function or a vector. In particular, for p = ∞ the
maximum norm of a function v : S → R is defined as kvk∞ = sups∈S |v(s)|, and for 0 ≤ p < ∞ and
¡P
¢1/p
for any vector v = (v 1 , . . . , v d ) ∈ Rd , kvkp = di=1 |v i |p
. We will use ln to denote the natural
logarithm function. For a logical expression A, the indicator function I{A} is defined as

0, if A is false
I{A} =
1, if A is true.
2.1 Online prediction of arbitrary sequences
The general protocol of online prediction (also called online learning) is shown on Figure 2.1.
Learning algorithms for the online prediction problems are often referred to as experts algorithms. Formally, an experts algorithm E with action set AE and an adversary interact in the
following way: At each round t , algorithm E chooses a distribution pt over the actions AE
Parameters: finite set of actions A, upper bound H on rewards, feedback alphabet Σ, feedback function f t : [0, H ]A → Σ.
For all t = 1, 2, . . . , T , repeat
1. The environment chooses r t : A → [0, H ] for all a ∈ A.
2. The learner chooses action at .
3. The environment gives feedback f t (r t , at ) ∈ Σ to the learner.
4. The learner earns reward r t (at ).
Figure 2.1: The online prediction protocol.
15
and picks an action at according to this distribution. The adversary selects a reward function
r t : AE → [0, H ], where H > 0 is known to the learner. Then, the adversary receives the action
at and gives a reward of r t (at ) to E . In the most general model, E only gets to observe some
feedback f t (r t , at ), where f t is a fixed and known mapping from reward functions and actions
to some alphabet Σ. Under full-information feedback, the algorithm E receives f t (r t , at ) = r t ,
that is, it gets to observe the entire reward function. When only f t (r t , at ) = r t (at ) is sent to the
algorithm, we say that the algorithm works under bandit feedback. The goal of the learner in
both settings is to minimize the total expected regret defined as
"
b T = max E
L
a∈AE
T
X
#
r t (a) − E
t =1
"
T
X
#
r t (at ) ,
t =1
where the expectation is taken over the internal randomization of the learner.
It will be customary to consider two types of adversaries: An oblivious adversary chooses
a fixed sequence of reward functions r 1 , . . . , r T , while a non-oblivious, or adaptive adversary
is defined with T , possibly random functions such that the t -th function acts on past actions
a1 , . . . , at −1 and the history of the adversary (including the previous reward functions selected,
as well as any variable describing the earlier evolution of the inner state of the adversary) and
returns the random reward function rt .
So far we have considered experts algorithms for a given horizon and action set. However,
experts algorithms in fact come as meta-algorithms that can be specialized to a given horizon
and action set (i.e., the meta-algorithms take as input T and the action set AE ). In the future, at
the price of abusing terminology, we will call such meta-experts algorithms experts algorithms,
too.
Definition 2.1. Let C be a class of adversaries and consider an experts algorithm E . The function BE : {1, 2, . . .} × {1, 2, . . .} → [0, ∞) is said to be a regret bound for E against C if the following
hold: (i) BE is non-decreasing in its second argument; and (ii) for any adversary from C, time
horizon T and action set AE , the inequality
"
b T = max E
L
a∈AE
T
X
t =1
r t (a) −
T
X
#
r t (at ) ≤ BE (T, |AE |)
(2.1)
t =1
holds true, where (a1 , r 1 ), . . . , (aT , r T ) is the sequence of actions and reward functions that arise
from the interaction of E and the adversary when E is initialized with T and AE .
Note that both terms of the regret involve the same sequence of rewards. Although this
may look unnatural when the opponent is not oblivious, the definition is standard and will be
useful in this form for our purposes.1 In what follows, to simplify the notation, we will also
use BE (T, AE ) to mean BE (T, |AE |). As usual, for the regret bound to make sense it has to be a
sublinear function of T . Some algorithms do not need T as their input (all algorithms need to
1 In Chapters 3 and 4, we decompose the online MDP problem into smaller online prediction problems where
the reward sequences are generated by a non-oblivious adversary. The global decision problem then reduces to the
problem of controlling the above notion of regret against this non-oblivious adversary.
16
know the action set, of course). Such algorithms are called universal, whereas in the opposite
case the algorithm is called non-universal. Oftentimes, a non-universal method can be turned
into a universal one, at the price of deteriorating the bound BE by at most a constant factor, by
either adaptively changing the algorithm’s internal parameters or by resorting to the so-called
“doubling trick” (cf. Exercises 2.8 and 2.9 in [25]).
Typically, algorithms are developed for the case when the adversaries generate rewards belonging to the [0, 1] interval. Given such an algorithm E , if the adversaries generate rewards in
[0, H ] instead of [0, 1] for some H > 0, one can still use E such that in each round, the reward
that is fed to E is first divided by H . We shall refer to such a “composite” algorithm as E tuned to
the maximum reward H . Clearly, if E enjoyed a regret bound BE against the class of all oblivious (respectively, adaptive) adversaries with rewards in [0, 1], E tuned to the maximum reward
H will enjoy the regret bound H BE against the class of all oblivious (respectively, adaptive)
adversaries that generate rewards in [0, H ]. Obviously, this tuning requires prior knowledge of
H.
Note that typical full-information experts algorithms enjoy the same regret bounds for oblivious and non-oblivious adversaries. The reason for this is that if the distribution pt selected by
the experts algorithm E is fully determined by the previous reward functions r 1 , . . . , r t −1 , then
any regret bound that holds against an oblivious adversary also holds against a non-oblivious
one by Lemma 4.1 of Cesa-Bianchi and Lugosi [25]. To our knowledge, all best experts algorithms that enjoy a regret bound of optimal order against an adaptive adversary satisfy this
condition on pt .
An optimized best experts algorithm in the full-information case is an algorithm whose exp
pected regret can be bounded by BOE (T, A) = O( T ln |A|), and similarly, an optimized |A|p
armed bandit algorithm is one with a bound BOB (T, A) = O( T |A|) on its expected regret (in
the case of bandit algorithms, actions are also called arms). Optimized best experts algorithms
include the exponentially weighted average forecaster (EWA) (a variant of weighted majority algorithm of Littlestone and Warmuth [72], and aggregating strategies of Vovk [96], also known
as Hedge by Freund and Schapire [36]) and the Follow-the-Perturbed-Leader (FPL) algorithm
by Kalai and Vempala [63] (see also Hannan [51]). There exist a number of algorithms for the
p
bandit case that attain regrets of O( T |A| ln |A|), such as Exp3 by Auer et al. [10] and Green
by Allenberg et al. [2], while the algorithm presented by Audibert and Bubeck [4] achieves the
p
optimal rate O( T |A|). Although these papers prove the regret bound (2.1) only in the case
of oblivious adversaries, that is, when the actions taken by the algorithm have no effect on the
next rewards chosen by the adversary, it is not hard to see that the bounds proved in these papers continue to hold in the non-oblivious case, too. This is because the only non-algebraic
step in the regret bound proofs uses that the reward estimates constructed by the respective algorithms are unbiased estimates of the actual rewards and this property continues to hold true
even in the non-oblivious setting. In addition to the above algorithms, Poland [84] describes
Follow-the-Perturbed Leader-type algorithms that also satisfy these requirements.
All the algorithms discussed above developed to deal with rewards in the [0, 1] interval. For
rewards in [0, H ], using the tuning method described above would yield a regret bound that
17
scales linearly with H . While in general this dependence is inevitable, there are some specific
cases when the problem structures can be exploited to obtain much tighter bounds. For a more
careful treatment of this scaling issue, see [68].
2.2 Stochastic multi-armed bandits
A closely related line of work concerns multi-armed bandit problems where the environment
generates rewards in an i.i.d. fashion, that is, it is assumed that for a fixed action a, the random
rewards rt (a) are drawn from a fixed distribution d (a) with mean m(a). The goal of the learner
in this setting is to select its actions a1 , a2 , . . . , aT so as to minimize the pseudo-regret
L̃T = T max m(a) −
a
T
X
m(at ) =
T ³
X
´
max m(a) − m(at ) .
t =1
t =1
a
Obviously, this learning problem is significantly easier than the case when the i.i.d. assumption is dropped: the lower bound on the regret for the bandit case2 has been known since the
seminal work of Lai and Robbins [69]. The distribution d ∗ with the largest mean m ∗ and the
distribution d ∗ with the second largest mean m ∗ play an important role in characterizing the
complexity of the problem: for any action selection strategy, the regret satisfies
¶
ln T
LT ≥ c
,
KL (d ∗ kd ∗ )
µ
where KL(d ∗ kd ∗ ) denotes the Kulback-Leibler divergence between d ∗ and d ∗ , and c > 0 is
some universal constant.3 Since then, a number of efficient algorithms have been proposed
that attain this optimal regret, see, e.g., [37, 73, 65]. A large family of methods implement
the principle of “optimism in the face of uncertainty” (OFU) by computing upper confidence
bounds for each action a ∈ A and select the one with maximal upper bound. Using the notaP −1
tion K t (a) = ts=1
I{as =a} , the upper confidence bound for action a is defined as
Pt −1
r̂t (a) =
s=1 rs (a)I{as =a}
K t (a)
+ B t (K t (a)) ,
where the functions B t : N → R+ is chosen carefully such that the probability P [m(a) ≥ r̂t (a)]
vanishes quickly as K t (a) grows. For example, the seminal algorithm of Auer et al. [8], UCB1,
uses
B t (K t ) =
p
2 ln t K t .
In other words the upper confidence bound r̂t (a) is chosen to strategically overestimate the
mean m(a). Selecting at = arg maxa r̂t (a) is then equivalent to setting up confidence intervals
on the means and acting according to the most optimistic model in this confidence set.
2 Clearly, regret minimization under full feedback is a trivial problem under these assumptions.
3 Note that this lower bound depends on the specific problem instance that the learner has to face. The best
p
problem-independent lower bound on the regret is of order
upper bound on the regret in the non-i.i.d. case.
18
T |A|, which is exactly the order of the best known
This idea of jointly selecting a model and an action in an optimistic fashion can be successfully applied for more complex action sets A. Some examples from the huge body of literature on optimistic algorithms, are applications for stopping problems [77], bandit problems
[7, 8, 29], variants of the pick-the-winner problem [34, 74, 77], or active learning [34, 22]. The
principle has also been used to provide regret bounds for reinforcement learning problems by
Auer and Ortner [11], Jaksch et al. [59] and Bartlett and Tewari [15]. The immense popularity
of the optimistic approach is due to two factors: (i ) It allows relatively simple, clean and easyto-interpret theoretical analysis of the resulting algorithms, yet (ii ) optimistic methods have
excellent empirical performance in a number of notoriously difficult-to-solve practical problems. For example, one of the greatest success stories of the online learning community is that
of the optimistic tree-search algorithm underlying all the current top 10 computer Go players,
UCT, first proposed by Kocsis and Szepesvári [67].
Despite their success in a number of practical problems, optimistic algorithms tend to be
overly aggressive when the process generating the rewards does not conform any stochastic
assumptions, that is, in the case discussed in the previous section. In such cases, optimistic
algorithms discard actions with initial low rewards too quickly to be able to discover possible
higher rewards associated with the same actions in later stages of the learning process. On the
other hand, universal bandit algorithms such as Exp3 by Auer et al. [10] and Green by Allenberg
et al. [2] can be overly conservative when the reward sequences have nice statistical properties.
2.3 Markov decision processes
In this section, we formalize the two classes of finite Markov decision processes that we described in Chapter 1: loop-free stochastic shortest path problems (Settings 1a–1c) and unichain
MDPs (Setting 2b). A finite MDP is formally defined by the tuple M = (X, A, P, r ), where
X, the state space, is a finite set of states,
A, the action space, is a finite set of actions,
P : X × X × A → [0, 1] is the transition function describing the dynamics of the MDP,
r : X × A → [0, 1], the reward function, allocates rewards in [0, 1] to states and actions.
An MDP with no reward specified will be referred to as an MDP\r , which is defined by the tuple (X, A, P ). The literature distinguishes between two main types of finite MDPs: episodic and
continuing environments. While the first notion can be used to formulate control tasks that
can be naturally broken down into a sequence of individual but similar tasks, the second notion stands closer to the traditional formulation of control problems where a controller has to
govern the behavior of a plant over a possibly infinite time horizon. The two specific classes
we consider are chosen to represent one of these types each. The assumptions on the transition function P and the precise interaction between the learner and the environment in M are
described below for each model class.
19
In order to define our performance criteria, we need the definition of stochastic stationary
policies. A stochastic stationary policy is a mapping π : X × A → [0, 1] from the set of admissible
P
state-action pairs to the [0, 1] interval such that a∈A π(x, a) = 1. Following a policy π means
that upon reaching state x ∈ X, the next action a is chosen according to the distribution π(x, ·),
that is, a ∼ π(x, ·). To emphasize that π(x, ·) specifies a probability distribution over the actions
for each state x ∈ X, in what follows we will use π(a|x) to denote π(x, a).
We use e 1 , . . . , e d to denote the row vectors in the canonical basis of the Euclidean space Rd .
Since we will identify the state space X with the integers {1, . . . , |X|}, we will also use the notation
e x for x ∈ X. A convention that we will use is that the symbols x, x 0 , . . . will be reserved to denote
a state in X, while a, a 0 , b will be reserved to denote an action in A. In expressions involving
sums over X, the domain of x, x 0 , . . . will thus be suppressed to avoid clutter. The same holds for
sums involving actions.
2.3.1 Loop-free stochastic shortest path problems
Loop-free stochastic shortest path environments (or in short, SSPs) are episodic MDPs where
transitions are only possible in a “forward” manner. An episodic MDP is an MDP with a few
special states, the starting state and some terminal states: In an episodic MDP, the process is
segmented into episodes. Each episode starts from the designated starting state and ends when
it reaches a terminal state. In an episodic MDP the goal of the agent is to maximize the total
expected reward collected in an episode. By switching from rewards to costs, the goal is turned
into minimizing the total expected cost. In this case the problem of finding a minimizing policy
is also called the stochastic shortest path (SSP) problem. By slightly abusing terminology, we
will also call the problem of maximizing the total reward an SSP problem. (Note that the same
terminology is used in Bertsekas and Tsitsiklis [17].)
When, in addition, the state space has a layered structure with respect to the transitions,
we get the so-called loop-free variant of the SSP problem. That the state space has a layered
structure means that X = ∪Ll=0 Xl , where Xl is called the l th layer of the state space, Xl ∩ Xk =
; for all l 6= k, and the agent can only move between consecutive layers. In particular, each
episode starts at layer 0, from state x 0 , and ends when it reaches any state x L belonging to the
last layer XL (see Figure 2.2). Further, for any x ∈ Xl and a ∈ A, P (y|x, a) = 0 if y 6∈ Xl +1 , l =
0, . . . , L − 1. This assumption is equivalent to assuming that each path in the graph is of equal
length.4 For any state x ∈ X we will use l x to denote the index of the layer x belongs to, that is,
l x = l if x ∈ Xl .
The protocol of traditional fixed-reward SSPs is shown on Figure 2.3, Figure 2.2 shows a
representation of an example of an SSP. The feedback function is f t (r, x, a) = r in the fullinformation case and f t (r, x, a) = r (x, a) in the bandit case. The transition function P can be
either known or unknown to the learner, depending on the nature of the problem. The assumption that we make on the transition function of SSPs are summarized for later reference
below.
4 Note that all loop-free state spaces can be transformed to one that satisfies our assumptions with no significant
increase in the size of the problem. A simple transformation algorithm is given in Appendix A of György et al. [47].
20
X0
X1
X2
X3
X4
X0
X1
X2
X3
X4
l=0
l=1
l=2
l=3
l=4
l=0
l=1
l=2
l=3
l=4
a1
a2
Figure 2.2: An example of a stochastic shortest path problem when two actions a 1 (“up”) and
a 2 (“down”) are available in all states. Nonzero transition probabilities under each action are
indicated with arrows between circles representing states. In the case of two successor states,
the successor states with the intended direction with larger probabilities are connected with
solid arrows, dashed arrows indicate less probable transitions.
Parameters: state space X, action space A, feedback alphabet Σ, feedback function f t : [0, H ]A → Σ.
For all episodes t = 1, 2, . . . , T , repeat
)
1. Set x(t
0 = x 0 . For all layer indices l = 0, 1, . . . , L − 1, repeat
(a) The learner observes xl(t ) and chooses al(t ) .
(b) The environment gives feedback f t (r, xl(t ) , al(t ) ) ∈ Σ to the learner.
(c) The learner earns reward r (xl(t ) , al(t ) ).
)
(d) The environment draws the next state as xl(t+1
= P (·|xl(t ) , al(t ) ).
Figure 2.3: The protocol of stochastic shortest path problems.
Assumption S1. The state space be decomposed into L layers: X = X0 ∪ X1 ∪ · · · ∪ XL where
Xi ∩ X j = ; if i 6= j . The transition function is such that transitions are only possible between
consecutive layers, that is, there only exists a ∈ A such that P (x|y, a) > 0 if x ∈ Xl and y ∈ Xl +1 ,
l = 0, 1, . . . , L − 1. The first and last layers are singleton layers, that is, X0 = {x 0 } and XL = {x L }.
Value functions are useful tools for addressing learning problems in an SSP. Fix a policy
π and let (x0 , a0 ), (x1 , a1 ), . . . , (xL−1 , aL−1 ) be a random trajectory generated by following π. The
value function, v π : X → R, and the action-value function, q tπ : X × A → R, of policy π are defined
21
for an SSP with transition function P and reward function r , respectively, by
"
π
v (x) = E
L−1
X
k=l x
π
"
q (x, a) = E
L−1
X
k=l x
#
¯
¯
r (xk , ak )¯xl = x, π, P ,
x ∈ X,
#
¯
¯
r (xk , ak )¯xl = x, al = a, π, P ,
(x, a) ∈ X × A,
where we used the notation E [ ·| π, P ] to emphasize that the random trajectories are generated
by following policy π and the transition model P . The function-pair (v π , q π ) is known to be the
unique solution to the Bellman equations (see, e.g., Puterman [86]):
q π (x, a) = r (x, a) +
X
P (x 0 |x, a)v π (x 0 ),
(x, a) ∈ X × A;
x0
v π (x) =
X
π(a|x)q π (x, a),
a
x ∈ X \ {x L };
(2.2)
v π (x L ) = 0.
A policy π is deterministic when for each state x ∈ X, π(·|x) is concentrated on a single
state. For a deterministic policy π, we will use π(x) to denote the action for which π(a|x) = 1.
Theorem 4.4.2 of [86] shows that there exists a stationary and deterministic policy π∗ satisfying
v π = max v π (x 0 ).
∗
π
Such a policy is called an optimal policy in the SSP M .
Further, define the occupation measure, µπ : X → R, by
¯
#
¯
¯
£
¤
¯
µπ (x) = E
I{xl =x} ¯ P, π = P xl x = x ¯ P, π ,
¯
l =0
"
L
X
x ∈ X.
Note that the restriction of µπ to any layer Xl defines a probability mass function. It is easy to
see that one can compute the occupation measure µπ in a recursive fashion for any x ∈ X \ {x 0 }
as
µπ (x) =
µπ (x 0 )π(a 0 |x 0 )P (x|x 0 , a 0 ),
X
(2.3)
x 0 ∈Xl −1
x
(x 0 ,a 0 )∈X×A
where µπ (x 0 ) = 1. For some of the results presented in the thesis, we will need the following
additional assumption on the SSPs underlying the considered decision problems:
Assumption S2. For any policy π, the occupation measure at state x ∈ X is bounded from below
as
µπ (x) ≥ α(x).
The minimal occupation measure is defined as
def
α = min min µπ (x).
π
x∈X
22
Parameters: state space X, action space A, initial state distribution µ1 , feedback
alphabet Σ, feedback function f t : [0, H ]A → Σ.
Initialization: The environment draws the initial state as x1 ∼ µ1 . For all time
steps t = 1, 2, . . . , T , repeat
1. The learner observes xt and chooses at .
2. The environment gives feedback f t (r, xt , at ) ∈ Σ to the learner.
3. The learner earns reward r (xt , at ).
4. The environment draws the next state as xt +1 = P (·|xt , at ).
Figure 2.4: The protocol of continuing Markov decision processes.
2.3.2 Unichain MDPs
Unichain MDPs form a subclass of continuing Markov decision processes that is widely used
in the reinforcement learning literature. Roughly speaking, the unichain assumption ensures
that for any pair of states x, y ∈ X, it is possible to find a policy that takes the learner from state
x to state y in finite time, thus making it possible to fix any mistakes made by the learning
algorithm in early stages of the learning procedure. The interaction between the learner and
the environment is shown on Figure 2.4. Once again, the feedback function is f t (r, x, a) = r
in the full-information case and f t (r, x, a) = r (x, a) in the bandit case, the transition function
P can be either known or unknown to the learner, depending on the nature of the learning
problem.
Before describing our assumptions, a few more definitions are needed: Without loss of generality, we shall identify the states with the first |X| integers and assume that X = {1, 2, . . . , |X|}.
P
Now, take a policy π and define the Markov kernel P π (x 0 |x) = a π(a|x)P (x 0 |x, a). The identification of X with the first |X| integers makes it possible to view P π as a matrix: (P π )x,x 0 =
P π (x 0 |x). In what follows, we will also take this view when convenient.
In general, distributions will also be treated as row vectors. Hence, for a distribution µ, µP π
is the distribution over X that results from using policy π for one step after a state is sampled
from µ (i.e., the “next-state distribution” under π). Finally, remember that the stationary distribution of a policy π is a distribution µst that satisfies µst P π = µst .
We assume that every (stochastic stationary) policy π has a well-defined unique stationary
distribution µπst . This assumption, usually referred to as the unichain assumption, ensures that
the average reward underlying any stationary policy is a well-defined single real number. It is
well-known that in this case the convergence to the stationary distribution is exponentially fast.
Following Even-Dar et al. [33], we consider the following stronger, “uniform mixing condition”
(which implies the existence of the stationary distributions):
Assumption M1. There exists a number τ > 0 such that for any two arbitrary distributions µ
and µ0 over X,
sup k(µ − µ0 )P π k1 ≤ e −1/τ kµ − µ0 k1 .
π
23
(2.4)
As Even-Dar et al. [33], we call the smallest τ such that (2.4) holds the mixing time of the
transition probability kernel P . The next assumption ensures that every state is visited eventually no matter what policy is chosen:
Assumption M2. The stationary distributions are uniformly bounded away from zero:
inf µπst (x) ≥ α0 > 0
π,x
for some α0 ∈ R.
Note that e −1/τ is the supremum over all policy π of the Markov-Dobrushin coefficient of
[58]. It is also known that m P π
k(µ−µ0 )P π k1
kµ−µ0 k1
for the transition probability kernel P π , see, e.g.,
P
= 1 − minx,x 0 ∈X y∈X min{P π (y|x), P π (y|x 0 )} [58], which implies
ergodicity, defined as m P π = supµ6=µ0
that Assumption M1 is satisfied, that is, m P π < 1 for every π, if and only if P π is a scrambling
matrix for every π (P π is a scrambling matrix if any two rows of P π share some column in which
they both have a positive element). Since m P π is a continuos function of π and the set of policies
is compact, there is a policy π0 with m P π0 = supπ m P π . Furthermore, if P π is a scrambling matrix
for any deterministic policy π then it is also a scrambling matrix for any stochastic policy. Thus,
to guarantee Assumption M1 it is enough to verify mixing for deterministic policies only.
The concept of value functions will be once again useful when addressing learning problems in unichain MDPs. Fix an arbitrary policy π and t ≥ 1. Let {(x0s , a0s )} be a random trajectory generated by π and the transition probability kernel P . We will use q π to denote the
action-value function underlying π and the reward function r , while we will use v π to denote
the corresponding value function.5 That is, for (x, a) ∈ X × A,
¯
¸
∞ ©
X
ª¯
0 0
π ¯ 0
0
q (x, a) = E
r (xs , as ) − ρ ¯ x1 = x, a1 = a, π, P ,
·
π
v π (x) = E
·
s=1
∞ ©
X
s=1
¯
¸
ª¯
r (x0s , a0s ) − ρ π ¯¯ x01 = x, π, P ,
where ρ π is the long-term average reward corresponding to π:
S
1X
E[r (x0s , a0s )] .
S→∞ S s=1
ρ π = lim
The long-term average reward can be expressed as
ρπ =
X
x
µπst (x)
X
π(a|x)r (x, a) ,
(2.5)
a
where µπst is the stationary distribution underlying policy π. The value functions q π , v π are
5 Most sources would call these functions differential action- and state-value functions. We omit this adjective
for brevity.
24
equivalently given by the Bellman equations
q π (x, a) = r (x, a) − ρ π +
X
P (x 0 |x, a)v π (x 0 )
x0
π
v (x) =
X
π
π(a|x)q (x, a)
a
which hold simultaneously for all (x, a) ∈ X × A.
25
(2.6)
Chapter 3
Online learning in known stochastic
shortest path problems
This chapter discusses the problem of learning in loop-free episodic environments with non©
ª
stationary rewards. Throughout this chapter, we suppose that the MDP\r S = X, A, P satisfies
Assumption S1 and the transition function P is known to the learner. After defining the learning
problem in Sections 3.1 and 3.2, we present and analyze two families of algorithms: the first
set of methods, presented in Sections 3.3 and 3.4, use black-box bandit algorithms in each
state x ∈ X, while in Section 3.5 we open up these black boxes and directly analyze the special
case when Exp3 is used to select actions in each state. We provide upper bounds on various
notions of regret for both types of algorithms and notice that directly analyzing the methods
based on Exp3 leads to a significant improvement in the bounds. Furthermore, we see that
the latter approach helps in providing high probability bounds on the regret and relaxing some
assumptions when only expected regret is concerned. Parts of this work were published in [78]
and [80]. The results of this chapter are summarized by the following thesis:
Thesis 1. Proposed a family of efficient algorithms for online learning in known stochastic shortest path problems. Proved performance guarantees for Settings 1a and 1b. The proved bounds
are optimal in terms of the dependence on the number of episodes. [78, 80]
3.1 Problem setup
In this section, we define the online version of the stochastic shortest path problem, based on
the definition of the original SSP problem previously presented in Section 2.3.1. In the online version, referred to as the online SSP or O-SSP problem, the reward function is allowed
to change between episodes, that is, instead of a single reward function r , at every episode a
new reward function, r t : X × A → [0, 1], is used. Accordingly, the difference from the protocol
described on Figure 2.3 is that r t takes the place of r in t -th episode. Note that the constraint
26
that r t depends only on the current state and action is assumed only for simplicity: the results
of the chapter can easily be extended to the situation where r t is allowed to depend on the next
state as well (i.e., when the reward function is of the form r t (x l , a l , x l +1 )).
The goal is changed to maximizing the expected sum of the rewards received over the
episodes (instead of considering the reward per episode, which is ill-defined when the reward
function changes from episode to episode). We will assume that the agent knows the transition
probabilities, but it does not know in advance the sequence (r t )Tt=1 . The agent can, however,
learn about r t through interaction. In particular, the agent learns about reward r t (xl(t ) , al(t ) ) in
time step l of episode t (after action al(t ) has actually been performed), where xl(t ) is the state
)
visited and a(t
is the action chosen at that time (in the easier, “full-information” version of the
l
problem, the agent learns the whole reward function r t ). As in the online prediction problem
presented in Section 2.1, the sequence, (r t )t =1,2,... , does not have to satisfy any stochastic assumptions; we assume only that it is chosen in advance (possibly, in an adversarial manner).
The performance of the agent at the end of episode T will be measured by how much reward
it lost on expectation as compared to using the best fixed stationary policy in hindsight. In order
to formally define this quantity, let
¯ #
¯
¯
RbT = E
r t (xl(t ) , al(t ) )¯ P ,
¯
t =1 l =0
"
T L−1
X
X
denote the total expected reward of the learner and
¯
#
¯
π
0 0 ¯
RT = E
r t (xl , al )¯ P, π
¯
t =1 l =0
"
T L−1
X
X
stand for the total expected reward of the policy π. Then, the performance difference described
above, also known as the total expected regret, can be written using
b T = sup R π − RbT .
L
T
π
3.2 A decomposition of the regret
In this section, we introduce some more notation to be able to treat the O-SSP problem more
effectively. In particular, we slightly change our point of view on how the learner acts in the
environment by assuming that the learner picks its actions from a stochastic policy πt fixed at
the beginning of the episode. Using this formalism enables us to prove the key result of this
chapter that allows for a simple decomposition of the learning problem into smaller individual
learning problems.
Let
³
´
) (t )
(t ) (t )
(t ) (t )
(t ) (t )
(t )
(t )
(t )
(t )
(t )
ut = x(t
0 , a0 , r t (x0 , a0 ), x1 , a1 , r t (x1 , a1 ), . . . , xL−1 , aL−1 , r t (xL−1 , aL−1 ), xL
denote the information that becomes available to the learning algorithm by the end of episode
27
t . The information available to the algorithm at time l of episode t is the information available
from previous episodes,
def
Ut −1 = (u1 , u2 , . . . , ut −1 ) ,
(with the understanding that U0 is the empty sequence) and also the information from the
current episode:
´
³
) (t )
(t ) (t )
(t ) (t )
(t ) (t )
(t )
(t )
(t )
(t )
ut (l − 1) = x(t
,
a
,
r
(x
,
a
),
x
,
a
,
r
(x
,
a
),
.
.
.
,
x
,
a
,
r
(x
,
a
)
.
t 0
t 1
0
0
0
1
1
1
l −1 l −1 t l −1 l −1
As mentioned above, we shall only deal with algorithms that for each episode follow a policy
they come up with just before the episode starts. This means that before episode t begins all
our algorithms will compute the policy πt to be followed in that episode based on Ut −1 . Thus,
for any t ≥ 1 and 0 ≤ l ≤ L − 1, (x, a) ∈ X × A,
¯
¯
h
i
h
i
¯ (t )
¯ (t )
)
(t )
P a(t
=
a
x
=
x,
U
,
u
(l
−
1)
=
P
a
=
a
x
=
x,
U
,
= πt (a|x) .
¯
¯
t
−1
t
t
−1
l
l
l
l
This is not restrictive since for large t the information “thrown away” is negligible. On the
other hand, if r t is chosen such that its components are correlated (e.g., r t (x 1 , a 1 ) = r t (x 2 , a 2 )
for all t for some (x 1 , a 1 ), (x 2 , a 2 ) ∈ X × A, (x 1 , a 1 ) 6= (x 2 , a 2 )) then we certainly may lose a lot with
this assumption, as the “effective” size of the state space becomes smaller. However, such a
situation should rather be modelled in the known dynamics part by reducing the state space
accordingly, or model state similarity.
In what follows, we will make use of the value functions corresponding to the reward functions r t . In particular, the value function v tπ and the action-value function q tπ are defined as
the solution of the Bellman equations (2.2) with the role of r taken by r t . We also define the
cumulative action-value, Q tπ : X × A → R, and the cumulative value function, Vtπ : X → R by the
respective equations,
Q tπ =
t
X
q sπ
and
Vtπ =
s=1
t
X
v sπ .
s=1
With the help of value functions, the total expected reward, a.k.a. the total return of policy π for
a period of T > 0 episodes can be written as
RbTπ =
T
X
t =1
v tπ (x 0 ) = VTπ (x 0 ).
Thus, the total expected reward of the best policy in hindsight is given by
R T∗ = sup
T
X
π t =1
v tπ (x 0 ) = sup VTπ (x 0 ).
π
Note that by the linearity of the Bellman equations, VTπ corresponds to the value function comP
puted with reward function Tt=1 r t . Thus, the finding the above optimal value is equivalent
to finding the optimal policy in a traditional SSP. As noted in Section 2.3.1, there always exists a deterministic policy attaining this supremum. In what follows, we will simply refer to
28
π∗
def
any deterministic policy satisfying VT T (x 0 ) = VT∗ (x 0 ) = maxπ VTπ (x 0 ) as an optimal policy.1 The
def
occupation measure generated by π∗T will be denoted by µ∗T = µπ∗T . To further simplify presentation, we introduce the following abbreviations:
π
vt = v t t ,
π
qt = q t t ,
Vt =
t
X
s=1
vt ,
Qt =
t
X
qt ,
µt = µπt .
s=1
With this notation, the total expected reward in the first T episodes becomes
PT
t =1 E [vt (x 0 )] =
E [VT (x 0 )], and the regret can be written as
b T = R ∗ − Rbπ = V ∗ (x 0 ) − E [VT (x 0 )] .
L
T
T
T
The following performance difference lemma is the key to our main results. Note that a
similar argument is used by Even-Dar et al. [33] to prove their main result about online learning
in unichain MDPs in the full-information case (cf. Lemma 4.1). The benefit of this lemma is
that the problem of bounding the regret is essentially reduced to the problem of bounding the
difference between action-values of the policy followed by the agent. Similar statements also
appeared much earlier in the control literature: The book by Cao [21], which puts performance
difference statements in the center of the theory of MDPs, reviews many of them.
Lemma 3.1 (Performance difference lemma for SSPs). For any deterministic policy π, arbitrary
policy π̂ and any t ≥ 1,
v tπ (x 0 ) − v tπ̂ (x 0 ) =
L−1
X
X
l =0 x∈Xl
³
´
µπ (x) q tπ̂ (x, π(x)) − v tπ̂ (x) .
(3.1)
In particular, for any T > 0,
VT∗ (x 0 ) − VT (x 0 ) =
L−1
X
X
l =0 x∈Xl
³
´
µ∗T (x) QT (x, π∗T (x)) − VT (x) ,
where µ∗T is the occupation measure underlying the best policy π∗T in hindsight.
Proof. The second part of the lemma follows from applying (3.1) to π = π∗T and π̂ = πt for t =
1, . . . , T , and summing up the resulting equalities. Hence, it remains to prove (3.1). Repeatedly
1 The existence of this maximum is a standard result of reinforcement learning, see [86].
29
using the Bellman equations and reordering, we get
v tπ (x 0 ) − v tπ̂ (x 0 )
= v tπ (x 0 ) − q tπ̂ (x 0 , π(x 0 )) + q tπ̂ (x 0 , π(x 0 )) − v tπ̂ (x 0 )
³
´
³
´
X
= q tπ̂ (x 0 , π(x 0 )) − v tπ̂ (x 0 ) +
P (x 1 |x 0 , π(x 0 )) v tπ (x 1 ) − v tπ̂ (x 1 )
x 1 ∈X1
³
´
³
´
X
= µπ (x 0 ) q tπ̂ (x 0 , π(x 0 )) − v tπ̂ (x 0 ) +
µπ (x 1 ) v tπ (x 1 ) − v tπ̂ (x 1 )
x 1 ∈X1
= µπ (x 0 )
³
q tπ̂ (x 0 , π(x 0 )) − v tπ̂ (x 0 )
Ã
+
X
x 1 ∈X1
= ··· =
L−1
X
µπ (x 1 )
X
l =0 x l ∈Xl
³
´
q tπ̂ (x 1 , π(x 1 )) − v tπ̂ (x 1 )
´
+
X
x 2 ∈X2
P (x 2 |x 1 , π(x 1 ))µπ (x 1 )
³
v tπ (x 2 ) − v tπ̂ (x 2 )
´
!
³
´
µπ (x l ) q tπ̂ (x l , π(x l )) − v tπ̂ (x l ) ,
which proves the statement.
Remark 3.1. Note that the lemma can be generalized easily to allow π to be a non-deterministic
policy.
3.3 Full Information O-SSP
In this short section we give an algorithm for the full-information case and a proof that bounds
the algorithm’s regret, the purpose being to fix some ideas that will be useful later on, as well
as to obtain a baseline result for loop-free SSPs. The algorithm presented in this section is a
straightforward adaptation of the MDP-E algorithm of Even-Dar et al. [33] to our setting.
The idea of the algorithm, shown as Algorithm 3.1, is to place an instance e(x) of a fullinformation experts algorithm E into each state x ∈ X, initialized with the action set A and
tuned to maximum reward L − l x . Then, the policy to be followed in episode t is just πt , where
πt (·|x) is the distribution returned by e(x) in episode t , while at the end of each episode e(x) is
fed with the action-values q t (x, ·) as “rewards”.
Based on Lemma 3.1, we immediately obtain a performance bound for Algorithm 3.1:
Proposition 3.1. Let E be a full-information experts algorithm with a regret bound BE against
the class of oblivious adversaries and tuned for rewards in [0, 1]. Then the regret of Algorithm 3.1
against an oblivious adversary satisfies
bT ≤
L
L(L + 1)
BE (T, A) .
2
The above bound also holds against a non-oblivous adversary if the distribution selected by E in
a round is fully determined by the previous reward functions observed by the algorithm.
Proof. We prove the statement for the case of an oblivous adversary. The statement for nonoblivious adversaries follows directly from Lemma 4.1 of Cesa-Bianchi and Lugosi [25]. Note
30
Algorithm 3.1 Algorithm for the full-information O-SSP.
1. For all states x ∈ X, initialize e(x), an instance of the full-information experts algorithm
E with action set A, tuned to maximum reward L − l x .
2. For t = 1, 2, . . . , T , repeat:
(a) For all x ∈ X, let πt (·|x) denote the distribution chosen by e(x) at episode t .
)
(b) Traverse a path ut following the policy πt , that is, use action a(t
∼ πt (·|x) upon
lx
reaching state x.
(c) Observe the reward function r t .
π
(d) Compute q t = q t t from applying the Bellman equations (2.2) to πt and r t .
(e) For all states x ∈ X, feed e(x) with the function q t (x, ·).
that πt is a function of the previous deterministic rewards r 1 , . . . , r t −1 . In particular, πt is a deterministic function of these rewards. As a result, q t (x, ·) = qt (x, ·), the reward fed to e(x) is a
deterministic sequence: q t (x, ·) = F t (r 1 , . . . , r t ) for some functions F t , t = 1, 2, . . .. As a consequence of this, VT and QT are also non-random. This should be kept in mind, though we will
continue to use the boldface characters to refer to these quantities to avoid further clutter.
b T = V ∗ (x 0 ) − VT (x 0 ). By Lemma 3.1, this latter
Because VT is non-random, we have that L
T
expression can be written as
VT∗ (x 0 ) − VT (x 0 ) =
L−1
X
X
l =0 x∈Xl
n
o
µ∗T (x) QT (x, π∗T (x)) − VT (x) .
P
Now, QT (x, π∗T (x)) − VT (x) ≤ maxa {QT (x, a) − VT (x)}. Here, QT (x, a) = Tt=1 qt (x, a) is the total
P P
“reward” for action a from the point of view of e(x), while VT (x) = Tt=1 a πt (a|x)qt (x, a) is
the total expected reward incurred by e(x). In other words, QT (x, a) − VT (x) is the expected regret against action a. By our previous remark, the sequence qt (x, ·) = q t (x, ·) is a deterministic
sequence. Since the regret bound BE is assumed to hold when E is used on an arbitrary deterministic reward sequence in [0, L − l x ], it must also hold when E is fed with sequence q t (x, ·).
Hence, QT (x, a) − VT (x) ≤ (L − l x )BE (T, A). Finally, the desired bound is obtained by simple
algebra.
Remark 3.2. Applying EWA with (time-horizon dependent) optimized parameters as the experts algorithm E , the above bound becomes (even for non-oblivious adversaries)2
b T ≤ L(L + 1)
L
2
2 See Theorem 2.2 in Cesa-Bianchi and Lugosi [25].
31
s
T ln |A|
.
2
3.4 Bandit O-SSP using black-box algorithms
In the bandit case, the rewards are observed only along the paths that the agent traverses. We
start with an algorithm along the lines of the algorithm constructed in the previous section: the
algorithm uses a bandit experts algorithm in each state. We show two complementary regret
bounds for this algorithm that scale proportionally with the regret of the underlying bandit algorithm (Theorems 3.1 and 3.2). We also consider the case when the goal is to compete with
sequences of policies, i.e., the case of “tracking” the best policy in hindsight (Theorem 3.3).
Next, we study the case when the Exp3 algorithm of Auer et al. [10] is used as the base bandit algorithm. By exploiting the special properties of this algorithm, we show some improved
regret bounds (Theorems 3.4 and 3.5). We also propose a variant of our algorithm that is also
applicable beyond the scope of the results obtained previously, and prove an O(T 2/3 ) bound
on its regret (Theorem 3.6). We finish with a regret bound that holds with high-probability
(Theorem 3.7).
The algorithms discussed in this section are direct descendants of the algorithm of Section 3.3. Similarly to Algorithm 3.1, they place a separate experts algorithm in each state x.
However, in these case, a bandit algorithm B is needed.
The algorithms below will use the following estimate of qt :
q̂t (x, a) =
 PL−1 ¡ (t ) (t ) ¢
 k=l x r t xk ,ak ,
πt (a|x)µt (x)

0,
³
´
if (x, a) = xl(t ) , al(t ) ;
x
x
(3.2)
otherwise.
Assume that the denominators in the definition of q̂t are non-zero with probability one. In this
case, it follows easily that q̂t is a conditionally unbiased estimate of qt given Ut −1 , i.e.,
E[q̂t (x, a)|Ut −1 ] = qt (x, a),
(3.3)
Note that the estimates q̂t can be computed at the end of episode t only, as the estimates need
all the rewards up to the end of the episode.
To tune the instances of the bandit algorithm B to be used in each state correctly, we have
to suppose that Assumption S2 holds with α > 0:
α = min α(x) = min min µπ (x) > 0.
x∈X
x∈X
π
A version of B tuned to work with maximum reward (L − l x )/α(x) is used at state x. As we will
see later, the quantity α plays an important role in describing the problem structure, since if
α > 0, any algorithm automatically visits each state with a minimum (positive) frequency, so
no resources have to be spent on exploring the state space. In particular, the algorithm below
works only for α > 0, and a different algorithm is needed for α = 0 (see Algorithm 3.5).
32
3.4.1 Expected regret against stationary policies
Our bandit O-SSP algorithm for competing with the best stationary policy is shown as Algorithm 3.2.
Algorithm 3.2 Algorithm for the bandit O-SSP.
1. For all sates x ∈ X , initialize b(x), an instance of the bandit algorithm B with action set A,
tuned to maximum reward (L − l x )/α(x).
2. For t = 1, 2, . . . , T , repeat:
(a) For all x ∈ X: let πt (·|x) denote the distribution chosen by b(x) at episode t and
choose ã(t ) (x) (independently) according to πt (·|x).
(b) Follow the policy πt by using ã(t ) (x) upon reaching state x to obtain the path ut .
³
´
³
´
(t )
(t )
(c) Observe the rewards r t x0(t ) , a0(t ) , . . . , r t xL−1
, aL−1
.
(d) Compute µt (x) for all x ∈ X using (2.3) recursively, and construct the estimates q̂t ,
based on ut , using Equation (3.2).
¢ ¡
¢
¡
(e) For all states x ∈ X feed b(x) with the reward πt ã(t ) (x)|x q̂t x, ã(t ) (x) .
The next lemma shows the relation between the regret of Algorithm 3.2 and that of the bandit algorithm B . As in the proof of Proposition 3.1, the plan is to reduce the regret analysis of
the algorithm to bounding the regret of the individual bandit algorithms b(x) via the help of
Lemma 3.1. To be able to do this, we need well-defined bandit problems for each x ∈ X. The
difficulty here is that in reality the reward corresponding to a state action-pair (x, a) ∈ X × A
is only defined if the agent’s actual path, ut contains (x, a). The bandit problem b(x) faces is
well-defined if one can specify reward functions rBt : X × A → R for each episode t = 1, 2, . . . , T
such that the reward fed to b(x) in episode t can be shown to be equal to rBt (x, ã(t ) (x)). The next
technical lemma shows that it is possible to define such reward functions. In essence, for each
state-action pair (x, a) ∈ X × A and each time instant t we define a random variable whose value
equals the reward the bandit algorithm b(x) would have received had it chosen action a in state
x (and, as a prerequisite, the agent had got in state x in the actual episode). More specifically,
rBt (x, a) follows the distribution of the cumulative reward obtained along a trajectory generated
by policy πt (and the dynamics of the MDP) that starts in state x with action a. While conceptually straightforward, the proof of the lemma contains some complicated notation to correctly
capture the necessary dependence among the reward functions, and therefore it is delegated
to the end of this chapter, Section 3.7.
Lemma 3.2. Assume α > 0. Fix T > 0 and consider an arbitrary reward sequence r 1 , . . . , r T :
X × A → [0, 1].hThen there
exists a sequence of reward functions rBt : X × A → R, t = 1, . . . , T such
i
B
(t )
x
that rBt (x, a) ∈ 0, L−l
α(x) for all (x, a) ∈ X × A, b(x) receives reward rt (x, ã (x)) in episode t ,
£
¤
£
¤
E qt (x, a) − vt (x) = E rBt (x, a) − rBt (x, ã(t ) (x)) ,
33
(3.4)
rBt (·, ·) and ã(t ) (·) are independent given Ut −1 , and for each x ∈ X, rBt (x, ·) is selected by a suitable
adaptive adversary interacting with the bandit algorithm b(x).
Remark 3.3. Note that the adversaries in the different states are not independent, since their
selected reward functions rBt (x, ·) have to satisfy (3.4) simultaneously for all x ∈ X.
Using the reward functions defined in the above lemma, it is not hard to analyze the performance of Algorithm 3.2, leading to our first regret bound for the bandit case:
Theorem 3.1. Assume α > 0. Fix T > 0 and consider an arbitrary reward sequence r 1 , . . . , r T :
X × A → [0, 1]. Let B be a multi-armed bandit algorithm that enjoys a regret bound BB against
the class of adaptive adversaries that generate rewards in the [0, 1] interval. Then, the regret of
Algorithm 3.2 satisfies
L(L + 1)
BB (T, A).
2α
Remark 3.4. One has α > 0 if, for example,
bT ≤
L
min min min min P (x 0 |x, a) > 0.
0≤l ≤L−1 x∈Xl a∈A x 0 ∈Xl +1
In fact, the assumption of α > 0 is closely related to assuming unichain MDPs without transient
states, as both guarantee a minimal probability for visiting a state.
Remark 3.5. The regret bound above is clearly loose in some cases, such as the case when we
face L consecutive, independent bandit problems in each round (i.e., Xl = {x l } and P (x l +1 |x l , a) =
1 for all actions a and all l = 1, 2, . . . , L − 1) In this case, a reasonable bound would scale linearly
with L, not quadratically as the above bound. However, when actions influence future rewards,
one may have a heavier dependence on L.
Remark 3.6. If one uses the algorithm of Audibert and Bubeck [4] with appropriate parameters
as the base bandit algorithm B , the above bound gives
bT ≤
L
15L(L + 1) p
T |A|.
2α
Proof. Proof of Theorem 3.1. As in the proof of Proposition 3.1, the plan is to reduce the problem to bounding the regret of the individual bandit algorithms b(x) via the help of Lemma 3.1,
where the bandit problems at each state will be defined via Lemma 3.2. By Lemma 3.1, the
expected regret of the algorithm can be written as
bT =
L
L−1
X
X
l =0 x∈Xl
£
¤
µ∗T (x)E QT (x, π∗T (x)) − VT (x) .
(3.5)
Thus, it suffices to bound the individual terms of the sum on the right hand side. To do this fix
some x ∈ Xl and consider
T
£
¤ X
£
¤
E QT (x, π∗T (x)) − VT (x) =
E qt (x, π∗T (x)) − vt (x) .
t =1
34
(3.6)
Now let rBt (x, ·), t = 1, . . . , T defined by Lemma 3.2. By the lemma, and especially by (3.4), the
difference in (3.6) is the expected regret of algorithm b(x) against action a = π∗T (x) and a nonoblivous adversary that returns the reward function rBt (x, ·) in round t . Thus, thanks to 0 ≤
rBt (x, ·) ≤
L−l x
α(x) , the assumption that B
enjoys regret BB against the class of adaptive adversaries
C that choose rewards in [0, 1] and that b(x) is tuned to the maximum reward
L−l x
α(x) , we get
T
T
X
¤ L − lx
£
£
¤ X
E rBt (x, a) − rBt (x, ã(t ) (x)) ≤
E qt (x, π∗T (x)) − vt (x) =
BB (T, A) .
α(x)
t =1
t =1
(3.7)
Combining (3.5), (3.6), and (3.7), and using α ≤ α(x) we get
bT ≤
L
L−1
X
X
l =0 x∈Xl
µ∗T (x)
L −l
L(L + 1)
BB (T, A) ≤
BB (T, A) ,
α(x)
2α
which finishes the proof.
According to this theorem, as long as 1/α is “reasonable”, the size of the state space, |X|,
does not play a role in the regret. However, oftentimes one finds that 1/α can be exponentially
large (even when α is positive) as a function of the size of X. In the following theorem, the factor
(L + 1)/2α is traded for a factor κ|X|, where
κ = max
x∈X
β(x)
,
α(x)
β(x) = max µπ (x),
π
x ∈ X.
(3.8)
When β(x) scales proportionally to α(x) (if a state is rare to be visited under some policy then
it is rare to be visited under all policies) then κ|X| can be smaller than (L + 1)/(2α). This is the
case, for example, in the grid-world example considered in the simulations in Section 3.6.
Theorem 3.2. Let B be a multi-armed bandit algorithm that enjoys a regret bound BB against
the class of adaptive adversaries that generate rewards in the [0, 1] interval. Assume α > 0 and let
κ be defined by (3.8). Then the regret of Algorithm 3.2 satisfies
bT ≤ κ
L
©PL−1
l =0
ª
(L − l )|Xl | BB (T, A) ≤ κ |X| L BB (T, A) .
Proof. Following the proof of Theorem 3.1 we obtain, for any 0 ≤ l ≤ L − 1,
£
¤ X ∗
L −l
µ∗T (x)E QT (x, π∗T (x)) − VT (x) ≤
µT (x)
BB (T, A)
α(x)
x∈Xl
x∈Xl
X
≤
X
x∈Xl
β(x)
L −l
BB (T, A)
α(x)
≤ |Xl |κ(L − l )BB (T, A).
Summing up for all l , we get
b T ≤ κBB (T, A)
L
L−1
X
(L − l )|Xl | ≤ κBB (T, A) L
l =0
L−1
X
l =0
35
|Xl | ≤ κBB (T, A) L |X|.
Remark 3.7. The first bound in the theorem reveals that the cardinality of the layers for smaller
indices has a bigger impact on the regret bound than the cardinality of layers with larger indices. The explanation, as it can be read out from the proof, is that for earlier layers, the range
of the total reward-to-go is larger.
3.4.2 Tracking
So far the regret of our algorithm was measured relative to the single best policy in hindsight.
However, depending on the environment, the performance of the single best policy in hindsight might not be satisfactory, for example, if the environment “drifts”. A natural extension
then is to design algorithms that can compete with the performance of any sequence of policies. This is called the tracking problem in the online learning literature (cf. Section 5.2 in [25]).
To formalize the problem let π1:T = (π1 , π2 , . . . , πT ) be some sequence of policies. IntroducP
π
π
ing VT 1:T = Tt=1 v t t , the total expected reward of the policy sequence π1:T can be written as
π
VT 1:T (x 0 ). The question in the case of tracking is how does the total expected reward of a learning algorithm fare compared to the above total reward, that is, we are interested in the expected
tracking regret
b T (π1:T ) = V π1:T (x 0 ) − E [VT (x 0 )] .
L
T
(3.9)
If the policies in the sequence change frequently, we expect a potentially larger gap: In particular, the regret compared to π1:T must clearly be linear if in every episode t , πt is the optimal
policy for the reward function r t . However, if a good sequence exists that does not change frequently, a smaller regret must be possible. Thus, similarly to standard related results in online
learning, we expect to be able to design algorithms whose regret scales with the complexity,
C (π1:T ) = 1 + |{t : πt 6= πt +1 , 1 ≤ t ≤ T − 1}|
of the competitor sequence π1:T . In particular, if the C = C (π1:T ) switches in π1:T occur evenly
distributed in time, one can spend T /C episodes for learning about each policy πt . The rep
gret suffered during T /C episodes can be bounded by O( T /C ), hence altogether the regret
p
suffered during the T = C (T /C ) episodes will be bounded by O( T C ). Thus, this quick, backp
of-the-envelope calculation shows that the best we can expect is a regret of size O( T C ) against
a sequence π1:T with complexity C = C (π1:T ).
Again, the idea is to employ one bandit algorithm in each state x ∈ X. However, in this
case, the bandit algorithms should be tuned to achieve a good tracking regret in the underlying
(stateless) prediction problems. In fact, we will need bandit algorithms that achieve a sublinear
strongly uniform regret bound BBT (T, A), when used with an action set A, where, following
Hazan and Seshadhri [54, 55], the concept of strongly uniform regret is defined as the maximal
regret of the algorithm over any contiguous time interval against a constant policy:3 a bandit
3 Note that Hazan and Seshadhri [54, 55] used the term adaptive regret instead of strongly uniform regret.
36
algorithm predicting the action at ∈ A in time step t is said to achieve the strongly uniform
regret bound BBT on time horizon T > 0 if
(
max
1≤t ≤t 0 ≤T
"
max E
a∈A
t0
X
#
"
rs (a) − E
s=t
t0
X
#)
rs (as )
≤ BBT (T, A),
(3.10)
s=t
where rt : A → [0, 1] is the reward function for time step t .
Several algorithms are known for the full-information case with vanishing tracking regret
under various conditions and with different reward functions, see, for example, Willems [102],
Herbster and Warmuth [57], Shamir and Merhav [90], Vovk [98], György et al. [45]. These methods can be extended to the bandit case as well, see, e.g., Auer et al. [10]. Further, these same
methods can be shown to enjoy a sublinear strongly uniform regret bound BBT [54, 55, 46].
Similarly to Definition 2.1, we say that BBT is a strongly uniform regret bound for algorithm E
against a class of adversaries C if the conditions of the definition are satisfied with the strongly
uniform regret (3.10) in place of the regret in (2.1) .
Later in this chapter, in Theorem 3.5, we will show that the Exp3.S algorithm of Auer et al.
[10] enjoys a near-optimal strongly uniform regret bound, as given by (3.24) for optimal parameters when setting L − l x = 1 and α(x) = 1.
BBT (T, A) = 2 T |A|(ln(T |A|) + 1)(e − 1), t ≥ 1.
p
(3.11)
The following theorem shows that with a bandit algorithm that enjoys a good strongly uniform
regret, Algorithm 3.2 achieves good tracking performance:
Theorem 3.3. Assume that in Algorithm 3.2 a bandit algorithm B = BT is used that enjoys a
strongly uniform regret bound BBT against the class of non-oblivious adversaries generating
rewards in [0, 1]. Then the regret of this algorithm relative to any fixed sequence of policies π1:T
can be bounded as
½
b T (π1:T ) ≤ C (π1:T ) min
L
¾
L(L + 1)
, κL|X| BBT (T, A).
2α
Remark 3.8. In particular, if the Exp3.S algorithm of Auer et al. [10] is used with appropriate
parameters, (3.11) gives the bound
¾p
½
L(L + 1)
b
, 2κL|X|
T |A| (ln(T |A|) + 1) (e − 1)
LT (π1:T ) ≤ C (π1:T ) min
α
on the tracking regret relative to π1:T . The bound depends on C and T on the typical almost
optimal order, see, e.g., [57] for lower bounds, and [46] for an overview of related results.
Proof. Fix any sequence π1:T = (π1 , . . . , πT ) of policies, and let t 1 = 1 < t 2 < t 3 < · · · < tC < tC +1 =
T + 1 denote the change points of policy π1:T , that is πtc = πtc +1 = · · · = πtc+1 −1 for c = 1, . . . ,C .
37
Following the proof of Theorem 3.1 we obtain, for any l ,
T
X X
x∈Xl t =1
=
³ £
´
¤
µπt (x) E qt (x, πt (x)) − E [vt (x)]
C
X X
x∈Xl c=1
C
X X
³ £
¤
£
¤´
µπtc (x) E Qtc :tc+1 −1 (x, πt (x)) − E Vtc :tc+1 −1 (x)
L −l
BBT (T, A)
α(x)
x∈Xl c=1
½
¾
C
X
L −l
≤
min
, Lκ|Xl | BBT (T, A),
α
c=1
≤
µπtc (x)
where the first inequality follows since BBT is a strongly uniform regret bound for BT , while
the second inequality can be obtained by bounding µπt as in Theorems 3.1 and 3.2. Summing
up for all l finishes the proof.
3.5 Bandit O-SSP using Exp3
In this section we demonstrate that by “opening up” the bandit algorithms that are used at
the individual states, and in particular, by tuning the bandit algorithms in a state-dependent
manner, better regret bounds can be achieved than showed in the previous section. In this
section we will use the Exp3 algorithm of Auer et al. [10] (as described in Section 6.8 of CesaBianchi and Lugosi [25]), as it is probably the most well-known and simple bandit algorithm
to date. Furthermore, we employ a more sophisticated estimate of qt , defined as follows: first,
define
r̂t (x, a) =


r t (x,a)
πt (a|x)µt (x) ,
0,
³
´
if (x, a) = xl(t ) , al(t ) ;
x
x
(3.12)
otherwise.
Then define q̂t and v̂t as the solution to the Bellman-equations
q̂t (x, a) = r̂t (x, a) +
X
P (x 0 |x, a)v̂t (x 0 ),
(x, a) ∈ X × A;
x0
v̂t (x) =
X
πt (a|x)q̂t (x, a),
a
x ∈ X \ {x L };
(3.13)
v̂t (x L ) = 0.
If we assume πt (a|x)µt (x) > 0 for all (x, a) ∈ X × A, we have E [ r̂t (x, a)| Ut −1 ] = rt (x, a) and
E [ πt (a|x)| Ut −1 ] = πt (a|x),
¯
£
¤
E q̂t (x, a)¯ Ut −1 = qt (x, a)
(3.14)
follows from the linearity of the Bellman-equations. In contrast to the estimates defined in
Equation (3.2), these estimates use the information made available about the reward function
r t for updating the policy in all the states, not only in the states contained in ut . Accordingly,
38
using the above estimates in Algorithm 3.2 leads to a much more robust performance4 .
3.5.1 Expected regret against stationary policies
The method for regret minimization against the set of all stationary policies is given as Algorithm 3.3. The next theorem gives a performance bound on the algorithm.
Algorithm 3.3 Algorithm for the bandit O-SSP based on Exp3 for α(x) > 0 for all x ∈ X.
1. Set γx ∈ (0, 1], η x > 0 for all x ∈ X, and w1 (x, a) = 1 for all (x, a) ∈ X × A.
2. For t = 1, 2, . . . , T , repeat:
(a) For all (x, a) ∈ X × A let
wt (x, a)
γx
πt (a|x) = (1 − γx ) P
.
+
|
A|
w
(x,
b)
b t
(b) Obtain a path ut following the policy πt .
(c) Compute µt (x) for all x ∈ X using (2.3) recursively, and construct estimates q̂t using (3.12) and (3.13) based on ut .
(d) For all (x, a) ∈ X × A, set
wt +1 (x, a) = wt (x, a)e η x q̂t (x,a) .
Theorem 3.4. If Algorithm 3.3 is run with η x =
then for all T
|A| ln |A|
≥ maxx∈X (e−1)α(x)
,
1
L−l x
q
α(x) ln |A|
|A|(e−1)T
and γx =
q
|A| ln |A|
(e−1)α(x)T
(x ∈ X),
¾p
½
L(L + 1)
0
b
, 2κ |X|L
T |A| ln |A|(e − 1),
LT ≤ min
p
α
β(x)
.
α(x)
where κ0 = maxx∈X p
Remark 3.9. The advantage of the above results over those of Theorem 3.1 and Theorem 3.2 is
p
p
the better dependency on α and α(x) ( α instead of α, and α(x) in the definition of κ0 instead
of α(x) in the definition of κ).
£
¤
b T = PL−1 Px∈X µ∗ (x)E QT (x, π∗ (x)) − VT (x) . Combining
Proof. According to Lemma 3.1, L
l
T
T
l =0
£
¤
this with the bounds on E QT (x, π∗T (x)) − VT (x) stated in Lemma 3.3 below and then finishing
as in the proofs of Theorems 3.1 and Theorem 3.2 gives the result.
Thus, it remains to prove the following result:
4 Note that all following results can be also proven for the estimates defined in Equation (3.2). Even though the
proofs for the simpler estimates are somewhat more straightforward, we focus on these more refined estimates
since they intuitively convey more information to the learner. Our experimental results presented in Section 3.6
support this intuition very spectacularly.
39
Lemma 3.3. Fix any x ∈ X. If Algorithm 3.3 is run with parameters γx ∈ (0, 1] and
ηx ≤
α(x) γx
(L − l x ) |A|
(3.15)
then, for any a ∈ A, E [QT (x, a) − VT (x)] ≤ γx (e −1)(L −l x )T + lnη|xA| . In particular, setting η x to its
upper bound gives
½
¾
|A| ln |A|
E [QT (x, a) − VT (x)] ≤ (L − l x ) γx (e − 1)T +
.
γx α(x)
Proof. Fix the state x ∈ X. We claim that by Lemma 3.7, for any b ∈ A,
b T (x, b) − V
b T (x) ≤
Q
bT =
where Q
PT
t =1 q̂t ,
M2 =
L−l x
α(x) .
X
ln |A|
b T (x, a) + γx Q
b T (x, b) ,
+ (e − 2)η x M 2 Q
ηx
a
(3.16)
Indeed, the conditions of the lemma are satisfied (use q t (·) =
q̂t (x, ·), πt (·) = πt (·|x), γ = γx , η = η x , A = A). Then, (3.46) is satisfied by the definition of
πt (x, ·).
To proceed, we need to provide an upper bound for q̂t (x, a) that holds for all a ∈ A. To this
end, let l = l x and define for all x 0 ∈ Xk , (k > l ), x̃ ∈ Xl , ã ∈ A
¯
£
¤
µt (x 0 |x̃, ã) = P xk = x 0 ¯ xl = x̃, al = ã, ut −1 ,
(3.17)
where (x0 , a0 , x1 , a1 , . . . , xL ) is a random trajectory generated by πt , but is otherwise independent
of ut and Ut −1 . Using this notation and the definition of q̂t and v̂t , we bound q̂t (x, a) as
q̂t (x, a) = r̂t (x, a) +
L−1
X
X
k=l +1 x 0 ∈Xk
µt (x 0 |x, a)
X
πt (a 0 |x 0 )r̂t (x 0 , a 0 )
a0
0
≤
L−1
X X µt (x |x, a)I{x 0 =x(tk ) }
1
+
πt (a|x)µt (x) k=l +1 x 0 ∈Xk
µt (x 0 )
≤
L−1
X µt (xk(t ) |x, a)
1
+
πt (a|x)µt (x) k=l +1 µt (x(t ) )
k
Now notice that
µt (x 0 ) =
X X
x̃∈Xl ã
µt (x 0 |x̃, ã)µt (x̃)πt (ã|x̃)
≥ µt (x 0 |x, a)µt (x)πt (a|x)
holds for all x 0 ∈ Xk , (k > l ), x ∈ Xl , a ∈ A. Using the above inequality for the case when µt (xk(t ) |x, a) >
0 gives
µt (xk(t ) |x, a)
µt (xk(t ) )
≤
1
,
πt (a|x)µt (x)
40
thus we obtain an upper bound on q̂t (x, a) as
q̂t (x, a) ≤
L −l
(L − l )|A|
≤
πt (a|x)µt (x)
γx α(x)
(a ∈ A) ,
(3.18)
γ
γ
x
where in the last step we have used that µt (x) ≥ α(x) and πt (a|x) ≥ |A|(x)
≥ |Ax| for all (x, a) ∈
X × A. Furthermore, we have
X
πt (a|x)q̂t (x, a)2
=
a
X
a
L − lx X
πt (a|x)q̂t (x, a) q̂t (x, a) ≤
q̂t (x, a) .
|
{z
}
α(x) a
(3.19)
x
≤ L−l
α(x)
Using (3.18) together with (3.15) gives (3.47), while by (3.19), we can choose M 2 =
L−l x
α(x)
as re-
quired. Now, taking expectations of both sides of (3.16), using (3.14) and Q T (x, a) ≤ (L − l x )
gives the desired result once η x is replaced with its upper bound from (3.15).
3.5.2 Tracking
Next we consider how the results of the previous subsection can be improved for the tracking
problem if we directly analyze the case when the tracking algorithm Exp3.S of Auer et al. [10]
is plugged in Algorithm 3.2 as the bandit algorithm B . The resulting Algorithm 3.4 is given for
completness.
Algorithm 3.4 Algorithm for tracking the best dynamic policy in the bandit case.
1. Set γx ∈ (0, 1], η x > 0, δ > 0 and w1 (x, a) = 1 for all (x, a) ∈ X × A.
2. For t = 1, 2, . . . , T , repeat
(a) Set
γx
wt (x, a)
+
πt (a|x) = (1 − γx ) P
| A|
b wt (x, b)
for all (x, a) ∈ X × A.
(b) Compute µt (x) for all x ∈ X using (2.3) recursively.
(c) Draw a path ut randomly, according to the transition probabilities P and the policy
πt .
³ ³
´
³
´´
) (t )
(t )
(t )
(d) Receive rewards r t (ut ) = r t x(t
,
a
,
.
.
.
,
r
x
,
a
.
t
0
0
L−1 L−1
(e) Compute µt (x) for all x ∈ X using (2.3) recursively, and construct estimates q̂t using 3.12 and (3.13) based on ut .
(f) For all (x, a) ∈ X × A, set
wt +1 (x, a) = wt (x, a)e η x q̂t (x,a) +
δ X
wt (x, a).
| A| a
Theorem 3.5. Define
α(x) = min µπ (x),
β(x) = max µπ (x)
π
π
41
and set δ > 0, η x > 0, γx ≥
(L−l x )|A|
α(x) η x
for all x ∈ X. Assume that κ0 = maxx
β(x)
p
α(x)
< ∞. Then for
any fixed sequence of policies π1:T = (π1 , . . . , πT ) the regret of Algorithm 3.4 satisfies
½
¾ p
L(L + 1)
0
b
LT (π1:T ) ≤ C (π1:T ) min
, 2κ |X|L 2 T |A| (ln(T |A|) + 1) (e − 1).
p
α
Remark 3.10. This result improves Theorem 3.3 similarly to how Theorem 3.4 improves the
black-box results of Theorem 3.1.
Proof. We prove the statement by proving a bound on the strongly uniform regret defined as
½
¾
max max E [Qt :t 0 (x, a)] − E [Vt :t 0 (x)] .
1≤t ≤t 0 ≤T
x∈X
a∈A
To this end, fix an interval [r, s] ⊆ [1, T ] and consider
def
b s:s 0 = max V π 0 (x 0 ) − Vs:s 0 (x 0 ).
L
s:s
π
Fix a policy π. Applying Lemma 3.1 and summing for t = s, s + 1, . . . , s 0 we obtain
π
Vs:s
0 (x 0 ) − E [Vs:s 0 (x 0 )] =
L−1
X
X
l =0 x l ∈Xl
s0
X
£
¤
µ(x l ) E qt (x l , π(x l )) − vt (x l )
(3.20)
t =s
From now on, we follow the proof of Theorem 8.1 in Auer et al. [10]. Fix a state x ∈ X, and define
P
Wt (x) = a wt (x, a), 1 ≤ t ≤ T . Now, fix 1 ≤ t ≤ T − 1. Then,
Wt +1 (x) X wt (x, a) η x q̂t (x,a)
=
e
+δ.
Wt (x)
a Wt (x)
We claim that η x q̂(x, a) ≤ 1. Indeed, thanks to the choice of η x and γx , η x ≤
αγx
(L−l x )|A|
holds,
which, together with (3.18), implies η x q̂(x, a) ≤ 1. Now, the first term can be bounded identically to how the same expression was bounded in the proof of Theorem 3.4, giving
(e − 2)η2x L − l x X
Wt +1 (x)
ηx
≤ 1+
v̂t (x) +
q̂t (x, a) + δ .
Wt (x)
1 − γx
1 − γx α(x) a
Now let
b s:s 0 (x) =
V
s0
X
(3.21)
v̂t (x)
t =s
Taking logarithms on both sides of (3.21) and summing over t = s, s + 1, . . . , s 0 , we get
ln
s0 X
(e − 2)η2x L − l x X
Ws (x)
ηx
b s:s 0 (x) +
≤
V
q̂t (x, a) + δ(s 0 − s).
Wr (x) 1 − γx
1 − γx α(x) t =s a
Now let b = π(x). Since
Ã
ws (x, b) ≥ wr (x, b) exp η x
s0
X
!
q̂t (x, b) ≥
t =s
42
δ
|A|
Ã
Wr exp η x
s0
X
t =s
!
q̂t (x, b)
(3.22)
we have
ln
µ
¶
s0
X
ws (x, b)
δ
Ws (x)
≥ ln
≥ ln
+ ηx
q̂t (x, b).
Wr (x)
Wr (x)
| A|
t =s
(3.23)
Putting (3.22) and (3.23) together we get
b s:s 0 (x) ≥ (1 − γx )
V
s0
X
s0 X
ln (|A|/δ)
L − lx X
δ(s 0 − s)
q̂t (x, b) −
− (e − 2)η x
q̂t (x, a) −
ηx
α(x) t =s a
ηx
t =s
After taking expectations and using
"
E [Vs:s 0 (x)] ≥ (1 − γx )E
s0
X
P
a Qs:s 0 (x, a) ≤ (L − l x )|A|(s
#
qt (x, π(x)) − (e − 2)η x
t =s
0
− s) we get
ln(|A|/δ) + δ(r − s)
(L − l x )2
|A|(s 0 − s) −
.
α(x)
ηx
Using s 0 − s ≤ T , we get that
E [Q
s:s 0
(x, π(x)) − V
Substituting δ =
s:s 0
µ
¶
ln(|A|/δ) + δT
L − lx
(x)] ≤
+ γx + η x (e − 2)
|A| (L − l x )T.
ηx
α(x)
1
T,
1
ηx =
L − lx
and
s
γx =
s
(ln(T |A|) + 1)α(x)
T (e − 1)|A|
(ln(T |A|) + 1)|A|
,
T (e − 1)α(x)
we obtain
s
E [Qs:s 0 (x, πt (x)) − Vs:s 0 (x)] ≤ 2(L − l x )
T |A|(ln(T |A|) + 1)(e − 1)
.
α(x)
(3.24)
Combining with (3.20) and using the arguments used for proving Theorem 3.3, we obtain the
result.
3.5.3 The case of α = 0
In some problems the minimum state occupancy measure, α might be zero. In such cases the
previous results are vacuous. We now show that by appropriately adjusting the previous algorithm, we can obtain a sublinear guarantee on the regret under the less restrictive assumption
that there exists some policy πexp that visits each state with positive probability – in fact, all
meaningful MDPs should satisfy this criterion with setting πexp to be, for example, the uniform
policy.
The algorithm below, which achieves this regret, is similar to Algorithm 3.3 in that the
policy πt to be followed at time t is obtained by mixing the “Boltzmann policy”, π0t (a|x) =
P
wt (x, a)/ b wt (x, b) with πexp instead of the policy that explores all actions in all states with
equal probability. However, the mixing happens at a global level: instead of deciding about
whether to use πexp or π0t in each state independently of other states, a global randomized
choice is made at the beginning of episode t between πexp and π0t . Provided that πexp is cho-
43
sen with probability γ, this ensures that the probability of visiting any state x ∈ X is at least
γµπexp (x). Potentially, this is much larger than what we would get had we chosen between these
policies at every state. Indeed, if zt ∈ {0, 1} is the indicator that πexp is followed in episode t (i.e.,
P [zt = 1|Ut −1 ] = γ) then
i
h
)
µt (x) = P x(t
=
x|U
t −1
lx
h
i
h
i
)
(t )
+
P
x
=
x|z
=
0,
U
P [zt = 0|Ut −1 ]
= P x(t
=
x|z
=
1,
U
P
=
1|U
]
[z
t
t
−1
t
t
−1
t
t
−1
l
l
x
x
= γµπexp (x) + (1 − γ)µπ0t (x)
(3.25)
≥ γµπexp (x) ,
whereas if one used the mixture policy πt (a|x) = γπexp (a|x) + (1 − γ)π0t (a|x) ((x, a) ∈ X × A),
then from (2.3), by induction on l x , we would only get
h
i
)
µ0t (x) = P x(t
=
x|U
t
−1
lx
X
X
=γ
µ0t (x 0 ) πexp (a 0 |x 0 )P (x|x 0 , a 0 )+
x 0 ∈Xl x −1
(1 − γ)
≥γ
≥γ
a
X
x 0 ∈Xl x −1
X
x 0 ∈Xl x −1
X
x 0 ∈Xl x −1
µ0t (x 0 )
µ0t (x 0 )
X
X
a
π0t (a 0 |x 0 )P (x|x 0 , a 0 )
πexp (a 0 |x 0 )P (x|x 0 , a 0 )
a0
(γl x −1 )µπexp (x 0 )
X
πexp (a 0 |x 0 )P (x|x 0 , a 0 )
a0
= γl x µπexp (x) .
Since the policy used in an episode is a mixture of stationary policies (and is not a stationary
policy on its own), we need to modify the definitions of qt and vt . In particular, we let
π0
πexp
π0
πexp
qt = (1 − γ)q t t + γq t
vt = (1 − γ)v t t + γv t
and use QT =
PT
t =1 qt
and VT =
PT
t =1 vt
(3.26)
,
as before. It can be easily verified that Lemma 3.1
stays
of QT and VT as well. Introduce νt : X × A → [0, 1], νt (x, a) =
h true for this definition
i
(t )
(t )
P xl = x, al = a|Ut −1 , and (similarly to the estimates defined in (3.12)) define
x
x
r0t (x, a) =

 r t (x,a) ,
³
´
)
if (x, a) = xl(t ) , a(t
;
l
0,
otherwise.
νt (x,a)
x
44
x
(3.27)
Finally, define q0t and v0t as the solution to the Bellman-equations
q0t (x, a) = r0t (x, a) +
v0t (x) =
X
a
X
x0
P (x 0 |x, a)v0t (x 0 ),
π0t (a|x)q0t (x, a),
(x, a) ∈ X × A;
x ∈ X \ {x L };
(3.28)
v0t (x L ) = 0.
£
¤
£
¤
Then, since by construction E r0t (x, a)|Ut −1 = r t (x, a) and E π0t (a|x)|Ut −1 = π0t (a|x) holds for
£
¤
π0
all (x, a) ∈ X × A, we also have E q0t (x, a)|Ut −1 = q t t (x, a). It is easy to see that the occupation
measure νt can be efficiently calculated using
νt (x, a) = γµπexp (x)πexp (a|x) + (1 − γ)µπ0t (x)π0t (a|x),
(x, a) ∈ U .
(3.29)
The new algorithm is shown in Algorithm 3.5. Note that if the exploration policy πexp is the
uniform policy πE (πE chooses each action with equal probability in each state, i.e., πE (a|x) =
1/|A| for all x, a) then the marginal distribution of actions in each state remains the same as
if Exp3 was used (although the joint distributions are different, resulting in different histories
and behaviors generated by the algorithm).
Algorithm 3.5 Algorithm for the bandit O-SSP based on Exp3 using “global” exploration.
1. Set γ ∈ (0, 1], η x > 0 for all x ∈ X and w1 (x, a) = 1 for all (x, a) ∈ X × A. Initialize the
exploration policy πexp and compute µπexp .
2. For t = 1, 2, . . . , T , repeat:
(a) Draw a Bernoulli random variable zt ∈ {0, 1} with parameter γ.
i. If zt = 1, then follow the exploration policy πexp throughout the episode.
ii. If zt = 0, then follow the policy π0t given by
wt (x, a)
π0t (a|x) = P
,
b wt (x, b)
(x, a) ∈ U .
(b) Observe the path ut following the policy computed in the previous step.
(c) Compute νt : X × A → [0, 1] using (3.29) (µπ0t can be computed from (2.3)) and then
construct the estimates q0t via (3.27) and (3.28) by using the data ut .
(d) For all (x, a) ∈ X × A, set
wt +1 (x, a) = wt (x, a)e η x qt (x,a) .
0
The advantage of Algorithm 3.5 becomes apparent when α = minx∈X minπ µπ (x) = 0, that is,
when certain policies do not visit some states: the next result shows that the algorithm achieves
an O(T 2/3 ) regret even in this case.
Theorem 3.6. Suppose that αexp = minx∈X µπexp (x) mina πexp (a|x) > 0. If Algorithm 3.5 is used
45
with parameters γ ∈ (0, 1] and any η x ≤
γαexp
L−l x
then
µ
¶
η x (e − 2)(L − l x )
ln |A|
E [QT (x, a) − VT (x)] ≤ γ +
(L − l x )T +
γαexp
ηx
for all x ∈ X. In particular, setting γ =
q
3
r
(e−2) ln |A|
αexp T
and η x =
2
3
s
E [QT (x, a) − VT (x)] ≤ 3 T (L − l x ) 3
for all T ≥
ln |A|
.
αexp (e−2)2
1
L−l x
3
αexp ln2 |A|
(e−2)T 2
yields
(e − 2) ln |A|
αexp
Furthermore, if the latter condition on T holds for all x ∈ X, then
2
3
b T ≤ 3L(L + 1)T
L
2
s
3
(e − 2) ln |A|
.
αexp
Remark 3.11. (i) The theorem shows the advantage of allowing an arbitrary exploration policy πexp in Algorithm 3.5 that visits hardly accessible states with higher probability than, for
example, the uniform policy πE ; that is, αexp may be substantially larger for a well-chosen exploration policy than for uniform exploration. (ii) Since γ does not depend on x, we cannot
modify the theorem to obtain a bound where the constant is proportional to something similar
to κ or κ0 as in Theorems 3.2 and 3.4.
Proof. We will use the notations Q0T =
PT
0
t =1 qt
and V0T =
PT
0
t =1 vt .
First, let us fix an arbitrary
x ∈ X. We will use a slightly modified version of Lemma 3.7 with A = A, π̃t = π0 (·|x), q t = q0t (x, ·),
η = η x , γ = 0. Since r̂t (x, a) ≤
1
γαexp
and accordingly, q0t (x, a) ≤
L−l x
γαexp
holds for all (x, a) ∈ X × A,
our condition on η x ensures that (3.47) holds. Since our estimates do not fulfill condition (3.48),
we need to use
X
a
π0t (a|x)(q0t (x, a))2 ≤
L − lx X 0
π (a|x)q0t (x, a)
γαexp a t
instead, where we have used the same upper bound for q0t as above. Applying this small modification, one can easily prove
Q0T (x, b) − V0T (x) ≤
ln |A| η x (e − 2)(L − l x ) 0
+
VT (x),
ηx
γαexp
which becomes
"
E
T ³
X
t =1
π0
π0
q t t (x, b) − v t t (x)
´
#
≤
ln |A| η x (e − 2)(L − l x )2
+
T
ηx
γαexp
πexp
after taking expectations of both sides. Using the definitions (3.26) and the trivial bound q t
πexp
v t (x) ≤ L − l x , we finally get
µ
¶
ln |A|
η x (e − 2)(L − l x )
E [QT (x, b) − VT (x)] ≤
+ (L − l x ) T γ +
ηx
γαexp
46
(x, a)−
as desired. The rest follows from applying Lemma 3.1.
3.5.4 A bound that holds with high probability
In this section, we propose a method based on Exp3.P, as described in Section 6.8 of CesaBianchi and Lugosi [25] to control the (random) regret
LT = VT∗ (x 0 ) −
T L−1
X
X
t =1 l =0
r t (xl(t ) , al(t ) ) .
The new method is a variation of Algorithm 3.3, where the weight update at (x, a) ∈ X × A uses
the following slightly biased estimate of the rewards:
r̃t (x, a) = r̂t (x, a) +
ω
,
πt (a|x)µt (x)
(3.30)
where r̂t is as defined in (3.12) and the value of ω > 0 will be specified later. The estimates for
q̃t are then defined as
q̃t (x, a) = r̃t (x, a) +
X
P (x 0 |x, a)ṽt (x 0 ),
(x, a) ∈ X × A;
x0
ṽt (x) =
X
πt (a|x)q̃t (x, a),
a
x ∈ X \ {x L };
(3.31)
ṽt (x L ) = 0.
Introducing the notation
ct (x, a) =
X X X πt (a 0 |x 0 )µt (x 0 |x, a)
1
+
µt (x)πt (a|x) k=l +1 x 0 ∈Xl a 0
µt (x 0 )
(3.32)
for all (x, a) ∈ X × A, we can simply write q̃t as
q̃t (x, a) = q̂t (x, a) + ωct (x, a).
The main result of this section is the following theorem:
Theorem 3.7. Let δ ∈ (0, 1) and consider Algorithm 3.3 where in line 2d q̃t (x, a), as defined
in (3.28), is used in place of q̂t (x, a). Assume that the parameters of the algorithm are set to
s
γx = γ =
3|A| ln(3|X||A|2 ln T /δ)
1
,
ηx =
Tα
L − lx
s
α ln(3|X||A| ln T /δ)
,
ω=2
T | A|
s
3α ln(3|X||A|2 ln T /δ)
,
T |A|
and T is big enough so that γx ≤ 1/3 and ω ≤ 1. Then, with probability at least 1 − δ, the regret,
47
LT , of the algorithm satisfies
s
LT ≤ 4L(L + 1)
T |A| 3|X||A|2 log2 T L(L + 1) 3|X||A| log2 T
ln
+
ln
+L
α
δ
2α
δ
r
8T ln
3
.
δ
For a state-action pair (x, a) ∈ X × A, define
e T (x, a) =
Q
T
X
q̃t (x, a)
and
e T (x) =
V
T
X
ṽt (x).
t =1
t =1
We prove the statement through a series of lemmata. The plan is to use Lemma 3.1 to reduce
the problem of bounding the regret to that of bounding the state-wise regrets, maxa QT (x, a) −
VT (x). We will see that Lemma 3.7 allows one to upper bound maxa QT (x, a) − VT (x) in terms
e T (x, a). Our first lemma shows that this last difference cannot be too large no
of QT (x, a) − Q
matter the choice of γx . This is in contrast to the difference between Q̂T and QT , that scales
in the worst case. That the difference is upper bounded independently of the value
with γ−1/2
x
p
of γx is crucial: It is this property which makes it possible to prove a T regret bound.
The following Lemma, taken from Bartlett et al. [14], is the key element to our high probability bound:
Lemma 3.4. Assume z1 , z2 , . . . , zT is a martingale difference sequence with |zt | ≤ b. Let
σ2t = Var [ zt | z1 , z2 , . . . , zt −1 ] .
Furthermore, let
v
u t
uX
Σt = t σ2t .
s=1
Then the following holds for any δ < 1/e and T ≥ 4:
"
P
T
X
t =1
#
n
op
p
ln(1/δ) ≤ δ log2 T.
zt > 2 max 2ΣT , b ln(1/δ)
Using this tool, we can prove the following result:
Lemma 3.5. Fix some (x, a) ∈ X × A, x ∈ Xl and let δ0 = 3|X||Aδ| log T . Then, with probability at
2
least 1 − δ0 ,
µ
¶
4 | A|
1
e T (x, a) ≤
QT (x, a) − Q
+
(L − l ) ln 0 .
(3.33)
ω γα
δ
δ
,
Furthermore, letting δ00 = 3|X| log
2T
e T (x) − VT (x) ≤
V
µ
¶
2T ω(L − l )|A|
4 1
1
+
+
(L − l ) ln 00
α
ω α
δ
holds with probability at least 1 − δ00 .
48
(3.34)
Proof. Let us use Lemma 3.4 for zt = qt (x, a) − q̂t (x, a). First, we have
h¡
i
¢2 ¯¯
q̂t (x, a) ¯ Ut −1
"Ã
!2 ¯
#
¯
X X
X
¯
0
0 0
0 0
= E r̂t (x, a) +
µt (x |x, a) πt (a |x )r̃t (x , a ) ¯ Ut −1
¯
a0
k=l +1 x 0 ∈Xl
"
Ã
!¯
#
¯
X X
X
2
0
0 0
0 0 2 ¯
≤ E (L − l ) r̂t (x, a) +
µt (x |x, a) πt (a |x )r̂t (x , a ) ¯ Ut −1
¯
a0
k=l +1 x 0 ∈Xl
!
Ã
X X X µt (x 0 |x, a)πt (a 0 |x 0 )r t (x 0 , a 0 )
r t (x, a)
+
≤ (L − l )
µt (x)πt (a|x) k=l +1 x 0 ∈Xl a 0
µt (x 0 )πt (a 0 |x 0 )
!
Ã
X X X µt (x 0 |x, a)πt (a 0 |x 0 )
1
+
≤ (L − l )
µt (x)πt (a|x) k=l +1 x 0 ∈Xl a 0 µt (x 0 )πt (a 0 |x 0 )
σ2t ≤ E
= (L − l )ct (x, a),
P
P
where we have used ( Kk=1 a k )2 ≤ K Kk=1 a k2 and Jensen’s inequality in the third line, r̂t (x 0 , a 0 ) ≤
¯
£
¤
1
0 0 ¯
0 0
0 0
µt (x 0 )πt (a 0 |x 0 ) and E r̂t (x , a ) Ut −1 = r t (x , a ) in the fourth line, r t (x , a ) < 1 in the fifth line and
the definition of ct (3.32) in the last line. Thus,
v
Ã
!
u
T
T
u
X
X
(L
−
l
)
1
ΣT = t(L − l )
ct (x, a) ≤
ω0 ct (x, a) + 0 ,
2
ω
t =1
t =1
for some ω0 > 0 by the relationship between the arithmetic and the geometric means. Using
(L−l )|A|
γα
Lemma 3.4 and b =
(Ã
b t + 2 max
QT ≤ Q
bt +
≤Q
T
X
(see (3.18)), we get that with probability at least 1 − δ0 log2 T
!
)
p
(L − l ) (L − l )|A| p
,
ln(1/δ0 )
ln(1/δ0 )
(L − l )
ω ct (x, a) +
0
ω
γα
t =1
2(L − l )ω0
T
X
p
0
ln(1/δ0 )ct (x, a) +
t =1
2(L − l ) p
(L − l )|A|
ln(1/δ0 ) +
ln(1/δ0 ).
0
ω
γα
p
Setting ω = 2(L − l ) ln(1/δ0 )ω0 , we obtain
¶
4 (L − l )|A|
bt +
QT ≤ Q
ωct (x, a) +
+
(L − l ) ln(1/δ0 )
ω
γα
t =1
µ
¶
4 | A|
e
+
(L − l ) ln(1/δ0 ),
= Qt +
ω γα
T
X
µ
yielding the first statement of the lemma. Now for the second statement, set zt = v̂t (x) − vt (x)
and b =
(L−l )
α
for Lemma 3.4, we get
b T (x) − VT (x) ≤ ω
V
e T (x) = V
b T (x) + ω
thus using V
PT
T X
X
t =1 a
t =1
P
µ
¶
4 1
πt (a|x)ct (x, a) +
+
(L − l ) ln(1/δ00 )
ω α
a πt (a|x)ct (x, a)
49
and
P
a πt (a|x)ct (x, a)
≤
(L−l )|A|
,
α
we get
that
e T (x) − VT (x) ≤
V
¶
µ
4 1
2T ω(L − l )|A|
(L − l ) ln(1/δ00 )
+
+
α
ω α
holds with probability at least 1 − δ00 as required.
Lemma 3.6. Let δ ∈ (0, 1). Assume that the parameters of the algorithm are set such that
0 ≤ γ ≤ 1/3,
(3.35)
α
γ
0 ≤ ηx ≤
,
(L − l x ) |A|
1
0≤ω≤
.
| A|
(3.36)
(3.37)
Then,
(L − l x )|A| 3|X||A|2 log2 T
ln
γα
δ
µ
¶
3|X||A| log2 T
2T ω(L − l x )|A|
8 1
+
+
+
.
(L − l x ) ln
α
ω α
δ
QT (x, a) − Vt (x) ≤ 3γ(L − l x )T +
with probability at least 1 − 2δ/3, simultaneously for all (x, a) ∈ X × A.
Remark 3.12. Setting the parameters as
s
γ=
3|A| ln(3|X||A|2 ln T /δ)
,
Tα
s
ω=2
1
ηx =
L − lx
s
3α ln(3|X||A|2 ln T /δ)
,
T | A|
α ln(3|X||A| ln T /δ)
,
T | A|
gives
s
max QT (x, a) − VT (x) ≤ 8(L − l x )
a
T |A| 3|X||A|2 log2 T (L − l x ) 3|X||A| log2 T
ln
+
ln
.
α
δ
α
δ
Proof. The proof essentially follows that of Theorem 6.10 in Cesa-Bianchi and Lugosi [25]. Fix
a state-action pair (x, a) ∈ X × A, x ∈ Xl and let δ0 and δ00 be as defined in Lemma 3.5. Consider
the decomposition
¡
¢ ¡
¢
e t (x) + V
e T (x) − VT (x) .
QT (x, b) − Vt (x) = QT (x, b) − V
(3.38)
e t (x). We claim that from Lemma 3.7 it follows that
We first construct an upper bound on −V
e T (x, a) − V
e T (x) ≤
max Q
a
X
ln |A|
e T (x, a) + γx max Q
e T (x, a) ,
+ cη x M 2 Q
a
ηx
a
50
(3.39)
where c = e − 2 and
M 2 = (1 + ω|A|)
L −l
.
α
(3.40)
To verify this, we need to check the conditions of the Lemma 3.7. Using the identifications
q t (·) = q̃t (x, ·), πt (·) = πt (·|x), w t (a) = wt (x, a), we see that (3.46) follows from the choice of
πt (·|x). Further, we have
ct (x, a) ≤
(L − l )|A|
,
γα
which gives
q̃t (x, a) = q̂t (x, a) + ωct (x, a) ≤ (1 + ω)
(L − l )|A|
γα
when combined with (3.18). Furthermore, since it is easy to see that by the definition of ct
πt (a|x)ct (x, a) ≤
(L − l )|A|
,
α
holds for all (x, a) ∈ X × A, we also have
X
πt (b|x)q̃t (x, b)2 =
b
X¡
¢
πt (b|x)q̂t (x, b) + πt (b|x)ct (x, b) q̃t (x, b)
b
≤ (1 + ω|A|)
L −l X
q̃t (x, b) .
α b
Now, (3.47) follows from the first inequality and (3.37), (3.36), while the second inequality yields
(3.48). We also see that we can indeed choose M 2 as shown in (3.40). Finally, (3.39) follows from
the claim of the lemma.
From (3.39), upper bounding
P
a QT (x, a)
gives
e T (x) ≤
−V
e
e T (x, a) and reordering the terms
by |A| maxa Q
ln |A|
e T (x, a) ,
+ (ζ − 1) max Q
a
ηx
(3.41)
where ζ = γx + cη x M 2 |A|. Since, from (3.35)–(3.37) it follows that ζ ≤ 3γx ≤ 1, therefore, in
e T (x, a).
order to get an upper bound on the³right-hand
side, we need a lower bound on maxa Q
´
A|
(L−l ) ln δ10 is a suitable lower bound. Using a union
According to Lemma 3.5, QT (x, a)− ω4 + |γα
bound and (3.41) we thus get that apart from an event of probability of at most δ0 ,
µ
µ
¶
¶
ln |A|
4 | A|
1
e T (x) ≤
−V
+ (ζ − 1) max QT (x, a) −
+
(L − l ) ln 0
a
ηx
ω γα
δ
51
holds. Plugging this into (3.38), we get
¶
¶
µ
µ
1
4 | A|
(L − l ) ln 0
QT (x, b) − Vt (x) ≤ QT (x, b) + (ζ − 1) max QT (x, a) −
+
a
ω γα
δ
¢
ln |A| ¡
e T (x) − VT (x)
+
+ V
ηx
µ
¶
4 | A|
1
= QT (x, b) − max QT (x, a) + ζ max QT (x, a) + (1 − ζ)
(L − l ) ln 0
+
a
a
ω γα
δ
¢
ln |A| ¡
e T (x) − VT (x) .
+
+ V
ηx
Here, QT (x, b) − maxa QT (x, a) ≤ 0. Further, we can upper bound QT (x, a) by (L − l )T and 1 − ζ
by 1, to get
QT (x, b) − Vt (x) ≤ ζ(L − l )T +
µ
¶
¢
1 ln |A| ¡
4 |A|
e T (x) − VT (x) .
(L − l ) ln 0 +
+ V
+
ω γα
δ
ηx
Using Lemma 3.5 and a union bound for all x ∈ X, we get
µ
¶
2T ω(L − l )|A|
4 1
1
e T (x) − VT (x) ≤
V
+
+
(L − l ) ln 00 .
α
ω α
δ
Putting together the bounds, using δ00 < δ0 and one last union bound, we thus obtain that with
probability at least 1 − 2δ/3, simultaneously for all x ∈ X, it holds that
¶
8 | A| 1
1 ln |A|
QT (x, b) − Vt (x) ≤ ζ(L − l )T +
(L − l ) ln 0 +
+
+
ω γα α
δ
ηx
2T ω(L − l )|A|
+
.
α
µ
Using ζ ≤ 3γx , plugging in the definitions of δ0 and δ00 and setting η x to its upper bound gives
the desired bound.
Proof of Theorem 3.7. We have
!
Ã
T L−1
X
X
¡ ∗
¢
(t ) (t )
r t (xl , al ) .
LT = VT (x 0 ) − VT (x 0 ) + VT (x 0 ) −
t =1 l =0
The last term can be bounded from applying the Hoeffding–Azuma inequality, once we note
³
´T
P
(t ) (t )
that vt (x 0 ) − L−1
r
(x
,
a
),
U
is a martingale difference sequence with differences bounded
t
t
l =0
l
l
t =1
in [−L, L]. Hence, with probability at least 1 − δ/3,
VT (x 0 ) −
T L−1
X
X
t =1 l =0
r t (xl(t ) , al(t ) ) ≤ L
r
3
8T ln .
δ
Let us now deal with the first term. According to Lemma 3.1,
VT∗ (x 0 ) − VT (x 0 ) =
L−1
X
X
l =0 x∈Xl
³
´
µ∗T (x) QT (x, π∗T (x)) − VT (x) .
52
´
³
Now, QT (x, π∗T (x)) − VT (x) can be bounded using Lemma 3.6 (and Remark 3.12). Summing
up for all layers, we get
s
VT∗ (x 0 ) − VT (x 0 ) ≤ 4L(L + 1)
T |A| 3|X||A|2 log2 T L(L + 1) 3|X||A| log2 T
ln
+
ln
.
α
δ
2α
δ
Putting everything together and using the union bound gives the result.
3.6 Simulations
(i, j + 1)
0.1 · 0.1
(i + 1, j + 1) (i + 2, j + 1)
0.1 · 0.1
(i, j + 1)
1−p
0.1 · 0.8
p
0.1 · 0.8
(i, j)
(i + 1, j)
(i + 2, j)
(i, j)
(i + 1, j)
0.9 · 0.8
(a)
(b)
Figure 3.1: The two problem settings considered for the experiments.
We have conducted two sets of experiments to (i) further illustrate the differences between
the algorithms, (ii) to show that using tools for traditional MDPs can lead to poor performance
in our setting, and (iii) to show how the randomness of the MDP influences the performance of
our algorithms. We consider two examples as the underlying MDP: the first one, illustrated on
Figure 3.1(a) is a project scheduling problem where the learner has to allocate resources to two
projects repeatedly. Both projects consist of a number of milestones (K 1 and K 2 , respectively),
a learning episode ends when both projects reach their respective final milestones. In each
time step of the episode, the learner is in one of the states x = (i , j ) described by the status of
each project (i ∈ {1, 2, . . . , K 1 } , j ∈ {1, 2, . . . , K 2 }) and has to decide whether to focus on project 1
or project 2. These decisions are represented by actions a 1 and a 2 , respectively. If the learner
chooses a 1 in state (i , j ) (where i < K 1 and j < K 2 ), then the probability of going to the next
state (i 0 , j 0 ) is given by the following table:
i0 = i
i0 = i +1
i 0 = min{i + 2, K 1 }
j0 = j
0
0.9 · 0.2
0.9 · 0.8
j = j +1
0.1 · 0.1
0.1 · 0.1
0.1 · 0.8
0
The roles of i 0 and j 0 are reversed for action a 2 . When i = K 1 , both actions yield the same transitions: j 0 = j + 1 with probability 0.2 and j 0 = min{ j + 2, K 2 } with probability 0.8. The status
i 0 follows the same rule if j = K 2 . We assume that after each decision, the learner is subject to
53
some non-negative cost that depends on (i , j ) and the chosen action; the goal of the learner
is to minimize its cost. Assuming that the costs can change after each episode, we can apply
our algorithms to guarantee that the learner does nearly as good as the best state-dependent
allocation policy given the complete sequence of costs. Note that our methods need some adjustments to be able to cope with this setting: first, costs should be represented as negative
rewards, that is, we need to assume r t : X × A → [−1, 0]. It is very easy to show that all statements of this chapter continue to hold under this assumption. Second, observe that the MDP
structure described above does not conform the assumption that all paths have fixed lengths.
This problem can be either overcome by the method described in Appendix A of György et al.
[47] at the cost of slightly inflating the state space, or by redefining µt as
µt (x) = P [ x ∈ ut | Ut −1 ] .
Again, it is easy to check that our analysis carries through with this slight modification.
Our first goal with the experiments was to show that while algorithms assuming i.i.d. rewards can fail when rewards are allowed to change over time, the algorithms presented and
analyzed in this chapter continue to have good empirical performance. In particular, we have
implemented Algorithms 3.1, 3.2, 3.3 and 3.5 and also a simplified version of UCRL [59] that
maintains confidence intervals for the rewards only. We have set K 1 = K 2 = 10 and designed the
reward sequence as follows: we have set r t (x, a) = −1 for all t = 1, 2, . . . , T /2 and all (x, a) ∈ X × A.
For the remaining episodes, we used a function φ : {1, 2, . . . , 10} × {1, 2, . . . , 10} × A → [0, 1] describing cost advantages that depend on the current status of each project. Using this func¡¡ ¢ ¢
¡¡ ¢ ¢
tion, we have set the reward for state x = (i , j ) as r t i , j , a = φ i , j , a /8 − 1 for t = T /2 +
¡¡ ¢ ¢
¡¡ ¢ ¢
¡¡ ¢ ¢
1, . . . , T /2 + T /16 and r t i , j , a = φ i , j , a /8 + φ j , i , a − 1 for the remaining episodes.
In other words, after introducing a small state-dependent cost advantage in episode T /2 + 1,
we add a larger state-dependent advantage that rewards the exact opposite actions as the previously introduced one. The regret of our algorithms and UCRL is plotted on Figure 3.2 as a
function of the number of episodes. As expected, UCRL performs very well until the reward
function remains unchanged, but fails to discover the second change in the rewards since it
becomes overly confident once it finds out about the first change. This results in linear regret
for this algorithm. Also notice that Algorithm 3.2 also fails on this learning problem, which can
be attributed to the fact that this algorithm uses very little information about the underlying
MDP structure when constructing its action-value estimates. Algorithm 3.3 and Algorithm 3.5
p
both perform well on the learning problem since their regrets grow proportionally to t . The
explicit exploration episodes used by Algorithm 3.5 enable it to quickly discover both changes
in the rewards and thus help this method outperform all other methods on this problem.
The second set of experiments were conducted on a grid world of size 10 × 10, where in
each episode the agent has to find the shortest path from the lower left corner to the upper
right corner. The agent has two actions; both make the agent move right or up, the “right”
(“up”) action makes the agent move right (respectively, “up”) with probability p, while it makes
it move “up” (respectively, “right”) with probability 1 − p. That is, we have L = 20, |X| = 100,
54
Figure 3.2: The regret of the discussed algorithms in the project scheduling problem plotted
against the number of episodes. Each curve is the average of 20 independent experiments,
shaded areas around each curve represent adjusted empirical standard deviations of these
runs. The switches between reward functions are marked with vertical lines.
1/α = (1/(1 − p))10 , κ = (p 2 /(1 − p))5 . Our goal with these experiments was to show that the
performance of our algorithms does change as α gets closer to zero. The experiment is run
with T = 200, 000, and rewards are constructed in a similar fashion as in the first experiment:
¡¡ ¢ ¢
¡¡ ¢ ¢
for all x = (i , j ) ∈ {1, 2, . . . , 10}2 , r t ((i , j ), a) = 0 for t = 1, . . . , T /2, r t i , j , a = φ i , j , a /8
¡¡ ¢ ¢
¡¡ ¢ ¢
¡¡ ¢ ¢
for t = T /2 + 1, . . . , T /2 + T /16 and r t i , j , a = φ i , j , a /8 + φ j , i , a for the remaining
episodes. To examine the dependence of the performance of each algorithm on α, we set
φ(i , j ) = I{i =10, j ∈{1,2,3,4,5}} , so that the learner has to discover the positive rewards hidden in the
hard-to-reach corners. We have simulated the policies generated by Algorithms 3.1, 3.2, 3.3,
3.5 and UCRL for a number of values of p between 0.55 and 1. The attained regrets for each of
these methods are plotted on Figure 3.3 against the resulting values of ln(1/α). We see that the
performance of each algorithm deteriorates as 1/α grows, indicating that the learning problem
does become harder as the problem gets closer to being deterministic. The intuitive reason for
this is that the closer the success probability p gets to 1, the less likely it becomes to accidentally
discover new rewards hidden in distant parts of the state space. Since Algorithm 3.5 discovers
hidden rewards with exponentially higher probability as our other algorithms do, its performance stays much closer to the optimal performance. Similarly to the previous experiments,
we see that once UCRL discovers the first batch of non-zero rewards, it only changes its mind
when it accidentally stumbles upon the second batch. The probability of this event goes to zero
55
rapidly as p increases. In summary, we conclude that the randomness of the MDP does influence the performance of our algorithms, however, this dependence seems to be much milder
p
than 1/ α.
Figure 3.3: The regret of the discussed algorithms in the grid world problem plotted against
ln(1/α). Each curve is the average of 20 independent experiments, shaded areas around each
curve represent adjusted empirical standard deviations of these runs.
3.7 The proof of Lemma 3.2
For (x, a) ∈ X × A, t ∈ {1, . . . , T }, define x̃(t ) (x, a) ∈ X to be a random state whose distribution
(given Ut −1 ) is P (·|x, a). We assume that
the random variables (ã(t ) (x))x∈X , (x̃(t ) (x, a))(x,a)∈X×A
(3.42)
are conditionally independent of each other given Ut −1 .
Without loss of generality, we may assume that the state-action pairs along the random path ut
are obtained using
)
) (t )
x(t
= x̃(t ) (x(t
, al ),
l +1
l
al(t ) = ã(t ) (xl(t ) ),
56
l = 0, 1, . . . , L − 1 .
Now, given (x, a) ∈ X × A, define the random variables (x̂k(t ) (x, a))l x ≤k≤L and (âk(t ) (x, a))l x ≤k≤L−1
as follows:
)
x̂(t
(x, a) =
k

x,
if k = l x ;
x̃(t ) (x̂(t ) (x, a), â(t ) (x, a)), l + 1 ≤ k ≤ L ,
x
k−1
k−1
and
)
â(t
(x, a) =
k

a,
if k = l x ;
ã(t ) (x̂(t ) (x, a)), l + 1 ≤ k ≤ L − 1 .
x
k
³
´
)
(t )
Thus, x̂(t
(x,
a),
â
(x,
a)
= (x, a) holds for any (x, a) ∈ X × A. Further,
l
l
x
x
) (t ) (t )
)
x̂(t
(xl , al ) = x(t
,
k
k
and
) (t )
âk(t ) (x(t
, al ) = ak(t ) ,
l
0 ≤ l ≤ k ≤ L,
(3.43)
(x, a) ∈ X × A.
(3.44)
as can be seen by induction on k.
Now, define the reward sequence (rBt )1≤t ≤T by
r (x̂(t ) (x, a), âk(t ) (x, a))
k=l x t k
PL−1
rBt (x, a) = I{x(t ) =x}
µt (x)
lx
,
First notice that since µt (x) ≥ α and r t (x, a) ∈ [0, 1] for all (x, a) ∈ X × A, we trivially have 0 ≤
rBt (x, a) ≤
L−l x
α .
rBt (x, ã(t ) (x)).
Next we show that the reward received by b(x) in episode t , denoted by r̄t (x), is
Consider first the case when x was not visited in episode t . In this case r̄t (x) = 0.
However, in this case rBt (x, ·) = 0 holds, too, and thus r̄t (x) = rBt (x, ã(t ) (x)). Now, assume that x
was visited in episode t . In this case, ã(t ) (x) = al(t ) and the reward fed to b(x) is
x
PL
r̄t (x) =
k=l x
) (t )
r t (x(t
, ak )
k
µt (x)
PL−1
=
k=l x
³
´
) (t )
(t ) (t ) (t )
r t x̂k(t ) (x(t
,
a
),
â
(x
,
a
)
l
l
k
l
l
x
x
µt (x)
x
x
,
where we used (3.43). Since under the assumption that x was visited in episode t , the second
expression equals rBt (x, ã(t ) (x)), the proof of the claim r̄t (x) = rBt (x, ã(t ) (x)) is finished.
Next we show that rBt (x, ·) and ã(t ) (x) are conditionally independent of each other, given
Ut −1 . To see why this holds, note that by its definition, rBt (x, a) can be written as a deterministic function of (x̃(t ) (x 0 , a 0 ); (x 0 , a 0 ) ∈ X × A), (ã(t ) (x 0 ); x 0 ∈ X \ {x}) and Ut −1 . Thus, thanks
to (3.42), ã(t ) (x) and rBt (x, a) are indeed conditionally independent given Ut −1 because both
of them are deterministic functions of some random variables, the respective sets of random
variables are disjoint and the random variables they depend on are all conditionally independent given Ut −1 . This conditional independence and the definition (3.44) imply that one can
assume that the partial reward functions rBt (x, ·) are selected by some suitable non-oblivious
adversary interacting with b(x) for all x ∈ X.
57
It remains to show that (3.4) holds. Let l = l x . We have
"
#
¯
¯
h
i I{x=x(t ) }
L−1
X
¯ (t )
¯ (t )
l
(t )
(t )
B
E
r t (x̂k (x, a), âk (x, a))¯xl , Ut −1
E rt (x, a)¯xl , Ut −1 =
µt (x)
k=l
=
I{x=x(t ) }
l
µt (x)
(3.45)
qt (x, a) ,
where the first equality follows from the definition of rBt and because µt is a function of Ut −1 .
L−1
The second equality uses that the random state xl(t ) and the random path (x̂k(t ) (x, a), âk(t ) (x, a))k=l
generated from (x, a) are conditionally independent given Ut −1 . Therefore,
"
E
L−1
X
k=l
#
¯
¯
)
)
r t (x̂(t
(x, a), â(t
(x, a))¯xl(t ) , Ut −1
k
k
"
=E
L−1
X
k=l
¯
¯
r t (x̂k(t ) (x, a), âk(t ) (x, a))¯Ut −1
#
.
Now, this last expectation equals qt (x, a), thanks to the definition of qt and (x̂k(t ) (x, a), âk(t ) (x, a))L−1
.
k=l
By taking the conditional expectation of both
sides of (3.45)
w.r.t. Ut −1 , using again that qt and
h
i
µt are functions of Ut −1 , together with P x = xl(t ) |Ut −1 = µt (x) gives that
¯
h
i
¯
E rBt (x, a)¯Ut −1 = qt (x, a) .
Now, by the conditional independence of rBt (x, ·) and ã(t ) (x) given Ut −1 , we also get that
£
¤
£
¤
E rBt (x, ã(t ) (x))|ã(t ) (x), Ut −1 = E qt (x, ã(t ) (x))|ã(t ) (x), Ut −1 .
Taking the conditional expectation of both sides w.r.t. Ut −1 , we thus obtain that
¯
i X
h
¯
E rBt (x, ã(t ) (x))¯Ut −1 = πt (a|x)qt (x, a) = vt (x) .
a
Taking expectations of both sides again, we get that (3.4) indeed holds. This finishes the proof
of the lemma.
3.8 A technical lemma
Lemma 3.7. Fix a finite set A, an integer T > 0, the reals 0 < γ < 1, 0 < η and the sequence of
functions q 1 , . . . , q T : A → R. Define the sequence of functions, (w t )0≤t ≤T , (π̃t )1≤t ≤T (w t : A → R,
π̃t : A → R) by w 0 = 1,
Ã
w t (a) = exp η
tX
−1
!
q s (a) ,
s=1
and
π̃t (a) = P
w t (a)
.
a∈A w t (a)
58
a ∈ A,
Let (πt )1≤t ≤T be a sequence such that
πt (a) ≥ (1 − γ)π̃t (a),
Define Q T =
PT
t =1 q t , VT
=
PT
t =1 v t , where
vt =
P
1 ≤ t ≤ T, a ∈ A .
a∈A πt (a)q t (a).
ηq t (a) ≤ 1,
X
a∈A
πt (a)q t (a)2 ≤ M 2
X
a∈A
(3.46)
Assume that
1 ≤ t ≤ T, a ∈ A ,
(3.47)
1 ≤ t ≤ T,
(3.48)
q t (a),
for some appropriate constant M 2 > 0. Then, for any b ∈ A,
Q T (b) − VT ≤
Proof. Let Wt =
P
a w t (a) and let c
X
ln |A|
Q T (a) + γQ T (b) .
+ (e − 2)ηM 2
η
a∈A
= e −2. From the update rule and some elementary inequal-
ities, we get the following upper bound on the ratio Wt +1 /Wt :
Wt +1 X w t +1 (a) X w t (a) ηq t (a)
=
=
e
Wt
Wt
Wt
a
a
X πt (a) ηq (a)
≤
e t
a 1−γ
X π (a) ³
¡
¢2 ´
t
≤
1
+
ηq
(a)
+
c
ηq
(a)
t
t
1−γ
by (3.46)
e y ≤ 1 + y + c y 2 if y ≤ 1 and (3.47)
a
η
= 1 + 1−γ
≤ 1+
X
a
cη2
πt (a)q t (a) + 1−γ
X
πt (a)(q t (a))2
a
X
η
cη2
vt +
M 2 q t (a).
1−γ
1−γ
a
by (3.48)
Here, the second inequality uses (3.47). Taking logarithms and using ln(1 + x) ≤ x gives
ln
X
Wt +1
η
cη2
vt +
M 2 q t (a).
≤
Wt
1−γ
1−γ
a
Summing over t we get
ln
X
η
cη2
WT +1
≤
VT +
M 2 Q T (a).
W1
1−γ
1−γ
a
On the other hand, we have for any action b ∈ A that
ln
WT +1
w T +1 (b)
≥ ln
= ηQ T (b) − ln |A| .
W1
W1
Combining with (3.49), we get
VT ≥ (1 − γ)Q T (b) −
X
ln |A|
− cηM 2 Q T (a) .
η
a
Reordering gives the desired inequality.
59
(3.49)
Chapter 4
Online learning in known unichain
Markov decision processes
The topic of this chapter is learning in continuing MDPs where the rewards are allowed to
change between each time step of the decision process. We assume that the MDP\r M =
©
ª
X, A, P satisfies Assumption M1 and the transition function P is known to the learner. After defining the learning problem in Section 4.1, we show that it is possible to decompose the
learning problem into smaller learning problems in Section 4.2. To ease presentation of our
ideas, we first present the algorithm MDP-E of Even-Dar et al. [33] in Section 4.3. We present
and analyze our algorithm called MDP-Exp3 in Section 4.4. A preliminary version of the main
result of this chapter was published as [81], while the full version to be presented is covered by
[82]. The results of this chapter are summarized as follows:
Thesis 2. Proposed an efficient algorithm for online learning in known unichain MDPs. Proved
performance guarantees for Settings 2a and 2b. The proved bounds are optimal in terms of the
dependence on the number of time steps, up to logarithmic factors. [81, 82]
4.1 Problem setup
The purpose of this section is to provide the formal definition of our problem and to set the
goals. Building on the formalism of unichain MDPs introduced in Section 2.3.2, we introduce
the online version of Markov decision processes called online MDP or O-MDP. The difference
from the protocol described on Figure 2.4 is that the reward function is allowed to change arbitrarily in every time step, that is, the single reward function r is replaced by a sequence of
reward functions (r t )Tt=1 . This sequence is assumed to be fixed ahead of time by an oblivious
adversary, and, for simplicity, we assume that r t (x, a) ∈ [0, 1] for all (x, a) ∈ X×A and t ∈ {1, 2, . . .}.
As in the online prediction framework described in Section 2.1, no other assumptions are made
about this sequence. The learning agent is assumed to know the transition probabilities P ,
60
but is not given the sequence (r t )Tt=1 . The protocol of interaction with the environment is unchanged: At time step t the agent selects an action at based on the information available to it,
which is sent to the environment. In response, the reward r t (xt , at ) and the next state xt +1 are
communicated to the agent. The initial state x1 is generated from a fixed (unknown) distribution µ1 .
Let the total expected reward collected by the agent up to time T be denoted by
¯ #
¯
¯
RbT = E
r t (xt , at )¯ P .
¯
t =1
"
T
X
As before, the goal of the agent is to make this sum as large as possible, or, equivalently, to
minimize the total expected regret against the pool of all stationary policies. The concept of
regret for this case is defined as follows: Fix a stochastic stationary policy π : X × A → [0, 1] and
¡
¢T
let x0t , a0t t =1 be the trajectory that results from following policy π from x01 ∼ P 0 (in particular,
a0t ∼ π(·|x0t )). The total expected reward of π over the first T time step is defined as
¯
#
¯
¯
R Tπ = E
r t (x0t , a0t )¯ P, π .
¯
t =1
"
T
X
Now, the total expected regret of the learning agent relative to the class of stationary policies is
defined as
b T = sup R π − RbT ,
L
T
π
where the supremum is taken over all stochastic stationary policies in M .
4.2 A decomposition of the regret
For the rest of the chapter, we suppose that P satisfies Assumption M1. As we have done before
in Section 3.2, we introduce some additional concepts needed to break down the global learning problem to smaller individual learning problems: We reformulate the task of the learner
from having to select actions at to having to select policies πt in each time step t . Using this
twist, we state a performance difference lemma in the vein of Lemma 3.1 to show a relationship
between the average rewards of any fixed policy π and the chosen policy πt .1 Unlike in the OSSP setting, this lemma alone is not sufficient for proving upper bounds on the regret, since the
actual reward r t (xt , at ) accumulated by the algorithm at time t can differ from the long-term
average reward of the currently followed policy. We show that if the sequence of policies (πt )Tt=1
satisfies a slow-change criterion, the expectation of r t (xt , at ) stays close to the long-term average reward of πt given r t .
For defining the learner’s policy, consider the trajectory {(xt , at )} followed by a learning
1 That these concepts are well-defined follows from Assumption M1 from standard results to be found, for exam-
ple, in the book by Puterman [86].
61
agent with x1 ∼ µ1 . Define
ut = ( x1 , a1 , r 1 (x1 , a1 ), x2 , a2 , r 2 (x2 , a2 ), . . . , xt , at , r t (xt , at ) )
(4.1)
and introduce the policy followed in time step t , πt (a|x) = P[at = a|ut −1 , xt = x]. Note that
πt is computed based on past information and is therefore random. The average reward ρ πt
of policy π is defined by replacing r by r t in Equation (2.5). The action-value and value functions q tπ and v tπ are then defined as the action-value and value functions satisfying the Bellman
equations (2.6) with the roles of r and ρ π taken by r t and ρ πt . We will use the following notation:
π
π
qt = q t t ,
π
ρt = ρt t .
vt = v t t ,
Thus, the following equations hold simultaneously for all (x, a) ∈ X × A:
qt (x, a) = r t (x, a) − ρ t +
X
P (x 0 |x, a)vt (x 0 ),
vt (x) =
X
πt (a|x)qt (x, a).
a
x0
(4.2)
Following Even-Dar et al. [33], fix some policy π and consider the decomposition of the
regret relative to π:
R Tπ − RbT
Ã
=
R Tπ −
T
X
ρ πt
! Ã
+
t =1
T
X
ρ πt −
t =1
T
X
t =1
! Ã
ρt +
T
X
!
ρ t − RbT .
(4.3)
t =1
The second term can be treated with the following performance difference lemma:
Lemma 4.1 (Performance difference lemma for unichain MDPs).
π
2
Let π, π̂ be two (stochastic
π̂
stationary) policies. Assume that ρ , ρ and the action-value function of π̂, q π̂ , are well-defined.
Then, for any t ≥ 1
ρ πt − ρ π̂t =
X
x
h
i
µπst (x)π(a|x) q tπ̂ (x, a) − v tπ̂ (x) .
In particular, for any T > 1, and deterministic policy π,
T
X
t =1
ρ πt −
T
X
ρt =
t =1
X
x,a
µπst (x) [QT (x, π(x)) − VT (x)] .
Remark 4.1. This lemma appeared as Lemma 4.1 in [33], but similar statements have been
known for a while. For example, the book of Cao [21] also puts performance difference statements in the center of the theory of MDPs. For the sake of completeness, we include the easy
proof:
2 This lemma does not need Assumption M1.
62
Proof. We have
X
x,a
µπst (x)π(a|x)q π̂ (x, a) =
X
x,a
µπst (x)π(a|x)
= ρ π − ρ π̂ +
X
x
"
π̂
r (x, a) − ρ +
X
0
π̂
#
0
P (x |x, a)v (x )
x0
µπst (x)v π̂ (x).
Reordering the terms gives the result.
The first and the last terms of (4.3) measure the difference between the sum of average
rewards and the actual expected reward. The existence of a unique stationary distribution for
any π (which follows from Assumption M1 ) ensures that these differences are not large in the
case of a fixed policy. Accordingly, the third term is bounded by a constant, O(τ):
Lemma 4.2. For any T ≥ 1 and any policy π it holds that
R Tπ −
T
X
ρ πt ≤ 2τ + 2 .
(4.4)
t =1
We give the proof for completeness and because our proof corrects slight inaccuracies in
[33].
Proof. Let {(xt , at )} be the trajectory when π is followed. Note that the difference R Tπ −
π
t =1 ρ t
PT
is due to the initial distribution of x1 being different from the stationary distribution of π. To
quantify the difference, write
R Tπ −
T
X
t =1
ρ πt =
T X
X
X
(νπt (x) − µπst (x)) π(a|x)r t (x, a),
t =1 x
a
where νπt (x) = P[xt = x] is the state distribution at time step t . Viewing νπt is a row vector, we
have νπt = νπt −1 P π . Consider the t th term of the above difference. Then, using r t (x, a) ∈ [0, 1] we
have
X π
X
(νt (x) − µπst (x)) π(a|x)r t (x, a) ≤ kνπt − µπst k1
x
a
= kνπt −1 P π − µπst P π k1 ≤ e −1/τ kνπt −1 − µπst k1
≤ . . . ≤ e −(t −1)/τ kνπ1 − µπst k1
≤ 2e −(t −1)/τ .
This, together with the elementary inequality
PT
t =1 e
−(t −1)/τ
≤ 1+
R∞
0
e −t /τ d t = 1 + τ gives the
desired bound.3
Treating the first term of (4.3) requires more careful treatment since the policies πt change
over time. The techniques needed to control the change rate of the policies and the translation
3 Even-Dar et al. [33] mistakenly uses kνπ − µπ k ≤ e −t /τ kνπ − µπ k in their paper (t = 1 immediately shows that
t
st 1
st 1
1
this can be false). See, e.g., the proofs of their Lemmas 2.2 and 5.2.
63
of this rate to the remaining term of the regret are different in the full information and the
bandit settings, so we are going to treat these issues in the next sections separately.
4.3 Full information O-MDP
In this section, we present the algorithm proposed by Even-Dar et al. [33] for the full-information
version of the O-MDP problem. The algorithm is called MDP-E , and is shown as Algorithm 4.1.
The idea of the algorithm is to use an instance e(x) of expert algorithm E with action set A in
¡
¢T
each state x, fed with reward sequence q t (x, ·) t =1 . The same idea used for proving Proposition 3.1 can be used to prove an upper bound on the main term of the regret, that is, the second
term in Equation (4.3). However, as noted in the previous section, we still need to ensure that
the policy sequence satisfies a slow-change condition. For the sake of completeness and the
better understandability of our contributions, we state a performance guarantee for MDP-E
for the case when EWA is used as the expert algorithm E .
Algorithm 4.1 MDP-E: an algorithm for online learning in MDPs with full information
1. Set η > 0 and w 1 (x, a) = 1 for all (x, a) ∈ X × A.
2. For t = 1, 2, . . . , T , repeat:
(a) For all (x, a) ∈ X × A, let
w t (x, a)
πt (a|x) = P
.
b w t (x, b)
(b) Draw an action at ∼ πt (·|xt ).
(c) Receive reward r t (xt , at ), observe r t and xt +1 .
(d) Compute q t using the Bellman equations (2.6) with r t .
(e) For all (x, a) ∈ X × A, set
w t +1 (x, a) = w t (x, a)e ηq t (x,a) .
Theorem 4.1. Let the transition probability kernel P satisfy Assumptions M1. Fix T > 0. Then
for an appropriate choice of parameter η (which depends on |A|, T, τ), for any sequence of reward
functions (r t )Tt=1 taking values in [0, 1], the regret of the algorithm MDP-E can be bounded as
s
µ
b T ≤ 4τ + 2 + 2 T ln |A| 2τ3 +
L
¶
13 2
τ + 6τ + 2 .
2
(4.5)
Remark 4.2. Note that the constants in this bound are different from those presented in Thep
p
orem 5.1 of Even-Dar et al. [33]. In particular, the leading term here is 8τ3/2 T ln |A|, while
p
their leading term is 4τ2 T ln |A|. The above bound both corrects some small mistakes in their
calculations and improves them at the same time.4
4 One of the mistakes is in the proof of Theorem 4.1 of Even-Dar et al. [33] where, they failed to notice that q πt
t
64
The proof of this result is presented partly for the sake of completeness and partly so that
we can be more specific about the corrections required to fix the main result (Theorem 5.2) of
Even-Dar et al. [33]. Further, the proof will also serve as a starting point for the proof of our
main result, Theorem 4.2. Nevertheless, the inpatient reader may skip this next section and
jump immediately to the proof of Theorem 4.2, which apart from referring to some general
lemmas developed in the next subsection, is entirely self-contained.
First note that since the algorithm gets to observe the complete reward function r t in each
time step t = 1, 2, . . . , T , the policies (πt )Tt=1 do not depend on the trajectory traversed by the
learner. Accordingly, we use the notations πt , q t , v t and ρ t to emphasize that all these variables
are now non-random. Now consider the regret decomposition of Equation 4.3:
R Tπ − RbT
Ã
=
R Tπ −
T
X
ρ πt
! Ã
+
t =1
T
X
ρ πt −
t =1
T
X
! Ã
ρt +
T
X
!
ρ t − RbT .
t =1
t =1
As suggested by Lemma 4.1, the second term can be controlled by upper bounding the individual regrets maxa Q T (x, a) − VT (x) of the instances of EWA in each state x ∈ X. Assume for a
π
second that K > 0 is such that kq t t k∞ ≤ K holds for 1 ≤ t ≤ T . By Theorem 2.2 in [25], each
individual regret can be bounded by
ln |A| K 2 ηT
+
.
η
8
(4.6)
¡ π ¢T
Notice that q t t t =1 is a sequence that is sequentially generated from (r t )Tt=1 . It is Lemma 4.1 of
[25] that shows that the bound of Corollary 2.2 of [25] continues to hold for such sequentially
generated functions. Putting the inequalities together, we obtain
T
X
ρ πt −
t =1
T
X
t =1
ρt ≤
ln |A| K 2 ηT
+
.
η
8
(4.7)
According to the next lemma an appropriate value for K is 2τ + 4. The lemma is stated in a
greater generality than what is needed here because the more general form shown will be useful
later.
Lemma 4.3. Pick any policy π in an MDP (P, r ). Assume that the mixing time of π is τ in the
P
sense of (2.4). If | a π(a|x)r (x, a)| ≤ R holds for any x ∈ X, then |v π (x)| ≤ 2R(τ + 1) holds for all
x ∈ X. Furthermore, for any (x, a) ∈ X × A, |q π (x, a)| ≤ 2R(τ + 1) + R + |r (x, a)| ≤ (2τ + 4) · kr k∞ .
Proof. As it is well known and is easy to see from the definitions, the (differential) value of policy
π at state x can be written as
v π (x) =
∞ X
X
s=1
x0
(νπs,x (x 0 ) − µπst (x 0 ))
X
π(a|x 0 )r (x 0 , a),
a
π
can take on negative values. Thus, their Assumption 3.1 is not met by {q t t } (one needs to extend the upper bound
given in their Lemma 2.2 with a lower bound and change Assumption 3.1). As a result, Assumption 3.1 cannot be
used to show that the inequality in the proof of Theorem 4.1 holds. This mistake, as well as the others, can be easily
corrected, as we show it here.
65
where νπs,x = e x (P π )s−1 is the state distribution when following π for s − 1 steps starting from
P
state x. Using the bound on a π(a|x 0 )r (x 0 , a) and the triangle inequality gives
∞ X
X
¯ π ¯
¯v (x)¯ ≤ R
|νπs,x (x 0 ) − µπst (x 0 )| ≤ 2R (τ + 1) ,
s=1 x 0
where in the second inequality we used kνπs,x − µπst k1 ≤ 2e −(s−1)/τ and that
P∞
s=1 e
−(s−1)/τ
≤ τ+1
(cf. the proof of Lemma 4.2). This proves the first inequality. The second inequality follows
from the first part and the Bellman equation:
|q π (x, a)| ≤ |r (x, a)| + |ρ π | +
X
P (x 0 |x, a)|v π (x 0 )| ≤ |r (x, a)| + R + 2R(τ + 1).
x0
Here, we used that ρ π =
P
P
P
π
x µst (x) a π(a|x)r (x, a) and the bound on a π(a|x)r (x, a).
Let us now consider the third term of (4.3),
PT
t =1 ρ t
− RbT . The t th term of this difference is
the difference between the average reward of πt and the expected reward obtained in step πt .
If νt (x) is the distribution of states in time step t ,
T
X
ρ t − RbT =
t =1
Thus,
PT
t =1 ρ t
− RbT ≤
π
µstt
πt
t =1 kµst
PT
T X
X
X
(µπst (x) − νt (x)) π(a|x)r t (x, a).
t =1 x
a
− νt k1 and so remains to bound the L 1 distances between the
and νt . For this, we will use two general lemmas that will again come useful
P
later. Introduce the mixed norm kπk1,∞ = maxx a kπ(a|x)k. Clearly, kνP π −νP π̂ k1 ≤ kπ− π̂k1,∞
distributions
holds for any two policies π, π̂ and any distribution ν (cf. Lemma 5.1 in [33]). The first lemma
shows that the map π 7→ µπst as a map from the space of stationary policies equipped with the
mixed norm k · k1,∞ to the space of distributions equipped with the L 1 -norm is τ-Lipschitz:
Lemma 4.4. Let P be a transition probability kernel over X × A such that the mixing time of P is
τ < ∞. For any two policies, π, π̂, it holds that
kµπst − µπ̂st k1 ≤ τkπ − π̂k1,∞ .
Proof. The statement follows from solving
kµπst − µπ̂st k1 ≤ kµπst P π − µπ̂st P π k1 + kµπ̂st P π − µπ̂st P π̂ k1
≤ e −1/τ kµπst − µπ̂st k1 + kπ − π̂k1,∞
for kµπst − µπ̂st k1 .
The next lemma allows us to compare an n-step distribution under a policy sequence with
the stationary distribution of the sequence’s last policy:
Lemma 4.5. Let P be a transition probability kernel over X × A such that the mixing time of
P is τ < ∞. Take any probability distribution µ0 over X , integer n > 0 and policies π1 , . . . , πn .
66
Consider the distribution µn = µ0 P π1 · · · P πn . Then, it holds that
π
kµn − µstn k1 ≤ 2e −n/τ + τ(τ + 1) max kπt − πt −1 k1,∞ .
2≤t ≤n
Proof. Introduce c = max2≤t ≤n kπt − πt −1 k1,∞ . Now,
π
π
π
π
π
kµn − µstn k1 ≤ kµn − µstn−1 k1 + kµstn−1 − µstn k1 ≤ e −1/τ kµn−1 − µstn−1 k1 + τc ,
π
π
where we used that by the previous lemma kµstn−1 − µstn k1 ≤ τkπn−1 − πn k1,∞ ≤ τc. Continuing
recursively we get
π
π
kµn − µstn k1 ≤ e −1/τ {e −1/τ kµn−2 − µstn−2 k1 + τc} + τc
..
.
π
≤ e −(n−1)/τ kµ1 − µst1 k1 + τc{1 + e −1/τ + . . . + e −(n−2)/τ }
π
≤ e −n/τ kµ0 − µst1 k1 + τ(τ + 1)c
≤ 2e −n/τ + τ(τ + 1)c ,
thus, finishing the proof.
We are now ready to prove Theorem 4.1.
π
Proof of Theorem 4.1. Applying Lemma 4.5 to kνt − µstt k1 we get
π
kνt − µstt k1 ≤ 2e −t /τ + τ(τ + 1)K 0 ,
where K 0 is a bound on max2≤t ≤n kπt − πt −1 k1,∞ .5 Therefore,
t
X
ρ t − RbT ≤ 2{e −1/τ + e −2/τ + . . .} + τ(τ + 1)K 0 T ≤ 2τ + τ(τ + 1)K 0 T .
t =1
Thus, it remains to find an appropriate value for K 0 . It is a well known property of the EWA that
π
kπt (·|x) − πt −1 (·|x)k1 ≤ ηkq t t (·, x)k∞ (see, e.g, Theorem 8.13 and Exercise 8.1 in Bartók et al.
π
π
[16]). Thus, kπt − πt −1 k1 ≤ ηkq t t k∞ . Now, by Lemma 4.3, kq t t k∞ ≤ 2τ + 4, showing that K 0 =
5 Lemma 5.2 of Even-Dar et al. [33] gives a bound on kν − µπt k . However, there are two mistakes in the proof
t
st 1
p
of Lemma 5.2. One of the mistakes is that Assumption 3.1 states that K 0 = ln |A|/T , whereas since the range of
the action-value functions scales with τ, K 0 should also scale with τ. (Unfortunately, in [81] we committed the same
mistake.) This results in that the multiplier 2τ2 in the first term of the bound of Lemma 5.2 should be replaced by
2τ2 K 0 T , where K 0 is a bound on how fast the policies change (which does depend on τ). The second small mistake
is that in the third displayed equation, the multiplier of kd 1 − d πt k1 should be e −(t −1)/τ as can be easily checked
by considering t = 1. Finally, we note that the proof in Lemma 5.2 uses a different decomposition than our bound,
leading to a small improvement compared to our bound in terms of the leading constant of the bound. We choose to
present this alternate decomposition as we find it somewhat cleaner and it also gave us the opportunity to present
the interesting Lemma 4.4.
67
η(2τ + 4) is suitable. Putting together the inequalities obtained, we get
T
X
ρ t − RbT ≤ 2τ + (2τ + 4)τ(τ + 1)ηT .
t =1
Combining (4.4), (4.7) and this last bound, we obtain
R Tπ − RbT
µ
¶
13 2
ln |A|
3
+ ηT 2τ + τ + 6τ + 2 .
= 4τ + 2 +
η
2
Setting
s
η=
ln |A|
¡
¢,
3
2
T 2τ + 13
2 τ + 6τ + 2
we get the bound stated in the theorem.
4.4 Bandit O-MDP
Our algorithm, MDP-Exp3 , shown as Algorithm 4.2, is inspired by MDP-E, while also borrowing ideas from the Exp3 algorithm of Auer et al. [10]. The main idea of the algorithm is
Algorithm 4.2 MDP-Exp3 : an algorithm for online learning in MDPs with bandit information
1. Set γ ∈ (0, 1], η > 0, N ≥ 1, and w1 (x, a) = w2 (x, a) = · · · = w2N (x, a) = 1 for all (x, a) ∈ X × A.
2. For t = 1, 2, . . . , T , repeat:
(a) For all (x, a) ∈ X × A, let
wt (x, a)
γ
πt (a|x) = (1 − γ) P
+
.
w
(x,
b)
|
A
|
b t
(b) Draw an action at ∼ πt (·|xt ).
(c) Receive reward r t (xt , at ) and observe xt +1 .
(d) If t ≥ N + 1
i. Compute µN
t (x) for all x ∈ X using (4.10).
ii. Construct estimates r̂t based on (xt , at ) and r t (xt , at ) using (4.8) and compute
q̂t using (4.13).
iii. For all (x, a) ∈ X × A, set
wt +N (x, a) = wt +N −1 (x, a)e ηq̂t (x,a) .
to construct estimates q̂t of the action-value functions qt , which are then used to determine
the action-selection probabilities πt (·|x) in each state x in each time step t . In particular, the
probability of selecting action a in state x at time step t is computed as the mixture of the uniform distribution (which encourages exploring actions irrespective of what the algorithm has
learned about the action-values) and a Gibbs distribution, the mixture parameter being γ > 0.
Given a state x, the Gibbs distribution defines the probability of choosing action a to be propor68
tional to exp(η
Pt −N
s=N
q̂t (x, a)). 6 Here, η > 0, N > 0 are further parameters of the algorithm. Note
that for the single-state setting with N = 1, MDP-Exp3 is equivalent to the Exp3 algorithm of
Auer et al. [10].
It is interesting to discuss how the Gibbs exploration policy is related to what is known as
the Boltzmann-exploration policy in the reinforcement learning literature [e.g., 92]. Remember
that given a state x, the Boltzmann-exploration policy would select action a with probability
proportional to exp(ηq̂∗t (x, a)) for some estimate q̂∗t of the optimal action-value function in the
MDP (P, r t ). Thus, we can see a couple of differences between the Boltzmann exploration and
our Gibbs policy. The first difference is that the Gibbs policy in our algorithm uses the cumulated sum of the estimates of action-values, while the Boltzmann policy uses only one estimate,
the latest one. By relying on the sum, the Gibbs policy will rely less on the last estimate. This reduces how fast the policies can change, making the learning “smoother”. Another difference is
that in our Gibbs policy the sum of previous action-values runs only up to step t − N instead of
using the sum that runs up to the last step t . The reasons for doing this will be explained below.
Finally, the Gibbs policy uses the action-value function estimates of the policies {πt } followed
by the algorithm, as opposed to using an estimate of the optimal action value function. This
makes our algorithm closer in spirit to (modified) policy iteration than to value iteration and is
again expected to reduce the variance of the learning process.
The reason the Gibbs policy does not use the last N estimates is to allow the construction of
a reasonable estimate q̂t of the action-value function qt as this will be explained in more details
shortly. If r t was available, one could compute qt based on r t (cf. (4.2)) and the sum could then
run up to t , resulting in the algorithm MDP-E. Since in our problem r t is not available, we
estimate it using an importance sampling estimator. To define this estimator define µN
t (x) as
the probability of visiting state x at time step t , conditioned on the history ut −N up to time step
t − N , including xt −N and at −N (cf. (4.1) for the definition of {ut }):
def
µN
t (x) = P [xt = x | ut −N ] ,
x ∈ X, .
Then, the estimate of r t is constructed using
r̂t (x, a) =


r t (x,a)
,
πt (a|x)µN
t (x)

0,
if (x, a) = (xt , at ) ;
(4.8)
otherwise.
Note that r̂t is defined only for t ≥ N + 1 (to allow using xt −N ), which explains why the sum in
the definition of the Gibbs policy runs from time N + 1.
The importance sampling estimator (4.8) is well-defined only if for x = xt ,
µN
t (x) > 0
(4.9)
6 In the algorithm the Gibbs action-selection probabilities are computed in an incremental fashion with the help
of the “weights” wt (x, a). Note that a numerically stable implementation would calculate the action-selection probP −N
P −N
abilities based on the relative value differences, ts=N
q̂t (x, ·) − maxa∈A ts=N
q̂ (x, ·). These relative value differ+1 t
ences can also be updated incrementally. The form shown in Algorithm 4.2 is preferred for mathematical clarity.
69
holds almost surely (by construction πt (·|xt ) > γ/|X | > 0). An essential step of our proof is to
show that inequality (4.9) indeed holds. In fact, we will show that this inequality holds almost
surely (a.s.) for all x ∈ X provided that N is large enough. This will be done by first showing
that the policies πt change “slowly” (this is where it becomes useful that the Gibbs policy is
defined using a sum of previous action values), from which inequality (4.9) will follow thanks
to Assumptions M1 and M2. This will be shown in Lemma 4.12. To see the intuitive reason of
why (4.9) holds, it is instructive to look into how the distribution µN
t can be computed.
Denote by P a the transition probability matrix of the policy that selects action a in every state and recall that e x denotes the x th unit row vector of the canonical basis of the |X|dimensional Euclidean space. Viewing µN
t as a row vector, we may write
at −N πt −N +1
µN
P
· · · P πt −1 .
t = e xt −N P
(4.10)
This holds because it follows from the form of the algorithm that
xt −N , at −N , πt ∈ σ(ut −N ).
(4.11)
Thus, we also have that πt , πt −1 , . . . , πt −N ∈ σ(ut −N ) and therefore (4.10) follows from the law
of total probability.
Let us return to the discussion of why µN
t (x) is bounded away from zero. If the policies
during the last N steps change sufficiently slowly then they will all be “quite close” to the policy of the last time step. Then, the expression on the right-hand side of (4.10) can be seen to
be close to the N -step state distribution of πt when starting from (x, a), which, if N is large
enough, will be close to the stationary distribution of πt as it was explained in Section 2.3.2.
π
Since by Assumption M2, minx∈X µstt (x) ≥ α0 > 0 then, by choosing the algorithm’s parameters
0
appropriately, we can show that µN
t (x) ≥ α /2 > 0 holds for all x ∈ X (specifically, this is shown
in Lemma 4.12).
It remains to be seen that the estimate r̂t is meaningful. In this regard, we claim that
E [ r̂t (x, a)| ut −N ] = r t (x, a) a.s.
(4.12)
holds for all (x, a) ∈ X × A.7 First note that
E [r̂t (x, a) | ut −N ] =
r t (x, a)
πt (a|x)µN
t (x)
£
¤
E I{(x,a)=(xt ,at )} | ut −N ,
where we have exploited that πt , µN
t , xt −N , at −N ∈ σ(ut −N ). Now,
£
¤
E I{(x,a)=(xt ,at )} | ut −N = P [at = a | xt = x, ut −N ] P [xt = x | ut −N ]
7 In what follows, for the sake of brevity, unless otherwise stated, we will omit “a.s.” from probabilistic statements.
70
and since πt ∈ σ(ut −N ),
P [at = a | xt = x, ut −N ] = P [at = a | xt = x, ut −1 ] = πt (a|x),
where the last equality follows from the definition of πt and at . Finally, πt −N +1 , . . . , πt −1 ∈
σ(ut −N ) and so P [xt = x | ut −N ] = P [xt = x | ut −N , πt −N +1 , . . . , πt −1 ] = µN
t (x). Putting together
the equalities obtained, we get (4.12).
Given r̂t , the estimate q̂t of the action-value function qt is defined as the action-value function underlying policy πt in the the average-reward MDP given by the transition probability
kernel P and reward function r̂t . Thus, q̂t can be computed as the solution to the Bellman
equations
q̂t (x, a) = r̂t (x, a) − ρ̂ t +
X
P (x 0 |x, a)v̂t (x 0 ) , v̂t (x) =
πt (a|x)q̂t (x, a) ,
a
x0
ρ̂ t =
X
X
x,a
π
µstt (x)πt (a|x)r̂t (x, a) ,
(4.13)
which hold simultaneously for all (x, a) ∈ X × A.
By linearity of expectation, it then follows that E[ρ̂ t |ut −N ] = ρ t , and, hence, by the linearity
of the Bellman equations and the uniqueness of the solutions of the equations defining qt and
vt (see, e.g., Corollary 8.2.7 in [86]), we have, for all (x, a) ∈ X × A,
E[q̂t (x, a)|ut −N ] = qt (x, a),
and
E[v̂t (x)|ut −N ] = vt (x).
(4.14)
As a consequence, we also have, for all (x, a) ∈ X × A, t ≥ N + 1,
£ ¤
E[ρ̂ t ] = E ρ t ,
£
¤
E[q̂t (x, a)] = E qt (x, a) ,
and
E[v̂t (x)] = E [vt (x)] .
(4.15)
The main result of the chapter is the following bound concerning the performance of MDPExp3 .
Theorem 4.2. Let the transition probability kernel P satisfy Assumptions M1 and M2. Fix T > 0
and let N = dτ ln T e+1. Then for an appropriate choice of the parameters η and γ (which depend
on |A|, T, α0 , τ), for any sequence of reward functions {r t } taking values in [0, 1], the regret of the
algorithm MDP-Exp3 can be bounded as
s
bT ≤
L
T ln |A|(2τ + 2)
µ
¶
p
4|A|
(e − 2)
+
(e
−
2)(τ
+
2))
+
(4τ
+
5)|
A
|N
+
+
o(
T ).
(2N
α0
4N
Remark 4.3. The above bound as well as the bound (4.5) do not depend directly on the number
of states, |X|, but the dependence appears implicitly through τ only. Even-Dar et al. [33] also
note that a tighter bound, where only the mixing times of the actual policies chosen appear, can
be derived. However, it is unclear whether in the worst-case this could be used to improve the
bound. Similarly to (4.5), our bound depends on |X| through other constants. In our case, these
71
are α0 and τ. Comparing the theorems it seems that the main price of not seeing the rewards
seems to be the appearance of |A| instead of ln |A| (a typical difference between the bandit and
full observation cases) and the appearance of a 1/α0 term in the bound.
Now we present the proof of Theorem 4.2. Once again, consider the decomposition 4.3:
Ã
R Tπ − RbT
=
R Tπ −
T
X
ρ πt
! Ã
t =1
+
T
X
ρ πt −
T
X
! Ã
ρt +
!
ρ t − RbT .
t =1
t =1
t =1
T
X
Again, the first term is handled by Lemma 4.2. Thus, it remains to bound the expectation of the
other two terms. This is done in the following two propositions whose proofs are deferred to
the next sections:
Proposition 4.1. Let



¡
¢
4τ + 5 + γ(2τ + 3) 4(e − 1) ¡
¢
³
´,
c = η max
2τ + 4 + γ(2τ + 2)
0
 1 − 2η|A| 1 + 2τ + 2

α
α0
γ
4
α0
and assume that
γ ∈ (0, 1)
(4.16)
|A|(N − 1/2) ≥ 1,
η≤
(4.17)
α
¢³ 1
1
0
4|A| N − 2
¡
c τ(τ + 2) < α0 /2,
½
µ
N ≥ max τ ln
γ + 2τ + 2
´,
(4.18)
(4.19)
¶
¾
4
, τ ln T + 1.
α0 − 2cτ(τ + 2)
(4.20)
Then, for any policy π, we have
µ
¶
T
X
£ π
¤
40
ln |A|
E ρ t − ρ t ≤ N 2τ + 10 + T η c |A|(τ + 2)(e − 2)(2τ + 3) +
α
η
t =1
µ
µ
¶¶
0
2
c
+ 2(τ + 2) T γ + c N |A| + η (e − 2) |A| 2τ + 4 + N
.
α
γ
Proposition 4.2. Assume that (4.19) and (4.20) hold. Then,
T
X
£ ¤
E ρ t − RbT ≤ T c τ(τ + 2) + 2Te −(N −1)/τ + 2N .
(4.21)
t =1
Remark 4.4. Note that setting N > dτ ln T e, the second term in (4.21) becomes O(1). Also, if T
is large enough, the choices of N , η and γ in Theorem 4.2 will satisfy the conditions of Proposition 4.1. Putting together the bounds obtained so far allows us to finish the proof of the theorem:
Proof of Theorem 4.2. Set γ = K η for some positive constant K > 0. The remaining work is to
72
choose K so that all the assumptions in Propositions 4.1 and 4.2 stated about the parameters
hold. First we have to ensure that the upper bound
η≤
α0
¢³ 1
1
¡
4|A| N − 2
K η + 2τ + 2
´
¡
¢
8|A| N − 1
2
is positive. Setting K =
takes care of this problem. Now note that by our assumption
α0
p
¡
¢
0
on η, we have c ≤ η α8 4τ + 5 + γ (2τ + 2) Collecting all o( T ) terms, we get
µ
µ
¶¶
80
20
N
b
LT ≤ ηT (2τ + 2) K + (4τ + 5) (|A|N + τ(τ + 2)) + (e − 2)|A| 2τ + 4 +
α
α
K
p
ln |A|
+
+ o( T ) .
η
Plugging in the definition of K yields
4|A|N + (16τ + 20) (|A|N + τ(τ + 2)) + (e − 2)|A| (2τ + 4) e − 2
+
α0
4N
p
ln |A|
+
+ o( T ) .
η
b T ≤ ηT (2τ + 2)
L
µ
¶
Thus choosing
v
u
u
η=t
ln |A|
T (2τ + 2)
³
4|A|N +(16τ+20)(|A|N +τ(τ+2))+(e−2)|A|(2τ+4)
α0
+ e−2
4N
´
gives the result as
s
4|A|N + (16τ + 20) (|A|N + τ(τ + 2)) + (e − 2)|A| (2τ + 4) e − 2
T ln |A|(8τ + 8)
+
α0
4N
p
+ o( T ) .
µ
¶
bT ≤
L
4.5 The proofs of Propositions 4.1 and 4.2
This section provides proofs for the key propositions of the previous section. Throughout the
section we suppose that both Assumptions M1 and M2 hold for P . The first half of the section
provides technical tools needed to prove both propositions.
4.5.1 The change rate of the learner’s policies
We proceed with a series of lemmas to control the rate of change of the policies generated by
MDP-Exp3 .
73
0
all states x. Then, for any
Lemma 4.6. Let N < t ≤ T and assume that µN
t (x) ≥ α /2 holds for
³
´
2
2 1
(x, a) ∈ X × A, we have |v̂t (x)| ≤ 4τ+4
and
−
(2τ
+
3)
≤
q̂
(x,
a)
≤
+
2τ
+
2
.
0
0
0
t
α
α
α γ
Proof. Since |
P
N
0
a πt (a|x)r̂t (x, a)| ≤ 1/µt (x) ≤ 2/α
by assumption and v̂t = v πt , the first state-
ment of the lemma follows from Lemma 4.3.
To prove the bounds on q̂t , also notice that
0 ≤ ρ̂ t =
X
x,a
π
µstt (x) πt (a|x)r̂t (x, a) =
X
x
π
µstt (x)
X
πt (a|x)r̂t (x, a) ≤
a
2
.
α0
Applying the above inequalities to the Bellman equations (4.13) we obtain q̂t (x, a) = r̂t (x, a) −
P
2
2
ρ̂ t + x 0 P (x 0 |x, a)v̂t (x 0 ) ≥ − α20 − 4τ+4
α0 ´ = − α0 (2τ + 3). Using rt (x, a) ≤ γα0 we get the upper bound
³
4τ+4
2
= α20 γ1 + 2τ + 2 .
q̂t (x, a) ≤ γα
0 + α0
The previous result can be strengthened if one is interested in a bound on E [ |v̂t (x)| | ut −N ]:
Lemma 4.7. Let N < t ≤ T and assume that µN
t (x) > 0 holds for all states x. Then, for any x ∈ X,
we have E [ |v̂t (x)| | ut −N ] ≤ 2(τ + 1).
Proof. Proceeding as in the proof of Lemma 4.3 and then taking expectations, we get
E [ |v̂t (x)| | ut −N ] ≤
∞ X
X
s=1 x 0
|νπs,x (x 0 ) − µπst (x 0 )| E
·
X
a
¯
¸
¯
¯
πt (a|x )r̂t (x , a) ¯ ut −N ,
0
0
where we have exploited that r̂t takes only nonnegative values. Now, by (4.11) and (4.12),
E
·
X
a
¯
¸
X
¯
¯
£
¤
πt (a|x 0 )r̂t (x 0 , a) ¯¯ ut −N = πt (a|x 0 )E r̂t (x 0 , a) ¯ ut −N
a
=
X
πt (a|x 0 )r t (x, a),
a
which is bounded between 0 and 1. Hence, E [ |v̂t (x)| | ut −N ] ≤
P∞ P
s=1
π
0
π 0
x 0 |νs,x (x ) − µst (x )|.
Now,
finishing as before we get the statement.
¯
£
¤
Similarly, we will also need a bound on the expected value of E |q̂t (x, a)| ¯ ut −N . This is
bounded as follows:
Lemma 4.8. Let N < t ≤ T and assume that µN
t (x) > 0 holds for all states x. Then, for any
¯¯
£¯
¤
(x, a) ∈ X × A, we have E ¯q̂t (x, a)¯ ¯ ut −N ≤ 2(τ + 2).
Proof. By the Bellman equations (4.13),
¯
¯
¯
¤ X
£
¤
£
¤
£
E |q̂t (x, a)| ¯ ut −N ≤ E [ |r̂t (x, a)| | ut −N ] + E |ρ̂ t | ¯ ut −N + P (x 0 |x, a)E |v̂t (x 0 )| ¯ ut −N .
x0
¯
£
¤
As before, E [ |r̂t (x, a)| | ut −N ] ≤ 1, and also E |ρ̂ t | ¯ ut −N ≤ 1. Combining these with the result of
the previous Lemma, we get the desired statement.
74
The quantity πt (x, a)|q̂t (x, a)| also enjoys a bound which is independent of the exploration
rate γ:
0
Lemma 4.9. Let N < t ≤ T and assume that µN
t (x) > α /2 holds for all states x. Then, for any
¯
¯ 4
(x, a) ∈ X × A, it holds that πt (x, a) ¯q̂t (x, a)¯ ≤ 0 (τ + 2).
α
Proof. By assumption and the construction of r̂t (x, a),
πt (x, a)|r̂t (x, a)| ≤
2
.
α0
Thus, we can apply Lemma 4.3 with R = 2/α0 to obtain |q̂t (x, a)| ≤
(4.22)
2
α0
(2(τ + 1) + 1) + |r̂t (x, a)|.
Multiplying both sides by πt (x, a) and using (4.22) again finishes the proof.
Now we show that if the policies that we follow up to time step t change slowly, µN
t is “close”
π
to µstt :
Lemma 4.10. Let 1 ≤ N < t ≤ T and c > 0 be such that kπs+1 − πs k1,∞ ≤ c holds for 1 ≤ s ≤ t − 1.
°
π °
Then we have °µN − µ t ° ≤ c τ(τ + 2) + 2e −(N −1)/τ
t
st
1
Proof. This follows directly from Lemmas 4.4, 4.5 and the recursive form (4.10) for µN
t . Indeed,
at −N πt −N +1
this recursive form shows that µN
P
· · · P πt −1 . Thus, we write
t = e xt −N P
X¯ N
¯
¯µ (x) − µπt (x)¯ = kµN − µπt k1 ≤ kµN − µπt −1 k1 + kµπt −1 − µπt k1 .
t
t
t
st
st
st
st
st
x
Now, according to Lemma 4.5, the first term can be bounded by 2e −(N −1)/τ + τ(τ + 1)c, while by
Lemma 4.4, the second is bounded by τc. Adding these up gives the claimed bound.
In the next two lemmas we compute the rate of change of the policies produced by MDPExp3 and show that for a large enough value of N , µN
t can be uniformly bounded from below
by α0 /2.
0
0
0
Lemma 4.11. Assume that for all N + 1 ≤ t ≤ T , µN
t (x ) ≥ α /2 holds for all states x , γ ≤ 1,
|A|(N − 1/2) ≥ 1, and η ≤
α³
´
¢
¡
4|A| N − 12 γ1 +2τ+2
0


Set

¡
¢
4τ + 5 + γ(2τ + 3) 4(e − 1) ¡
¢
´,
³
c = η max
2τ + 4 + γ(2τ + 2) .
0
 1 − 2η|A| 1 + 2τ + 2

α
α0
γ
4
α0
Then
kπt +N −1 − πt +N k1,∞ ≤ c.
(4.23)
Proof. Fix some state-action pair (x, a) ∈ X × A. We will prove the statement by induction on t .
To prove the bound for time step t assume that kπs+N −1 −πs+N k1,∞ holds for all s = 1, 2, . . . , t −1.
As πs+N −1 = πs+N for all s = 1, 2, . . . , N , the assumption holds for t = 1, . . . , N and we are left with
proving the induction step for t > N .
75
We start with the simple bound
¯
¯
¯ wt +N −1 (x, a) wt +N (x, a) ¯
¯
|πt +N −1 (a|x) − πt +N (a|x)| =(1 − γ) ¯¯
−
Wt +N −1
Wt +N (x) ¯
¯
¯
wt +N (x, a) Wt +N −1 (x) ¯¯
wt +N −1 (x, a) ¯¯
1
−
=(1 − γ)
Wt +N −1 (x) ¯
wt +N −1 (x, a) Wt +N (x) ¯
¯
¯
¯
wt +N −1 (x, a) ¯¯
ηq̂t (x,a) Wt +N −1 (x) ¯
=(1 − γ)
1−e
¯
Wt +N −1 (x)
Wt +N (x) ¯
¯
¯
¯
¯
ηq̂t (x,a) Wt +N −1 (x) ¯
¯
.
≤πt +N −1 (a|x) ¯1 − e
Wt +N (x) ¯
(4.24)
From now on, we examine two separate cases depending on the sign of the expression in the
absolute value on the right-hand side.
−1 (x)
Case a) 1 − e ηq̂t (x,a) WWt +N
≥ 0: Using 1 − e z ≤ −z (that holds for all z ∈ R), we get
t +N (x)
¯
¯
¯
¯
¯1 − e ηq̂t (x,a) Wt +N −1 (x) ¯ ≤ − ηq̂t (x, a) + ln Wt +N (x) ≤ 2η (2τ + 3) − ln Wt +N −1 (x) ,
¯
Wt +N (x) ¯
Wt +N −1 (x) α0
Wt +N (x)
where the second step holds by the lower bound on q̂t (x, a) given in Lemma 4.6. Applying
Jensen’s inequality, the second term can be bounded as
!
Ã
!
Ã
γ
X wt +N (x, b) −ηq̂ (x,b)
X πt +N (b|x) − |A|
Wt +N −1 (x)
t
ln
q̂t (x, b)
= ln
e
≥ −η
Wt +N (x)
1−γ
b Wt +N (x)
b
X
η X
ηγ
=−
q̂t (x, b)
πt +N (b|x)q̂t (x, b) +
1−γ b
|A|(1 − γ) b
η X
2ηγ
≥−
(2τ + 3) ,
πt +N (b|x)q̂t (x, b) − 0
1−γ b
α (1 − γ)
where the last step follows again by Lemma 4.6. Combining with (4.24) we have
|πt +N −1 (a|x) − πt +N (a|x)| ≤
X
2η
2ηγ
(2τ
+
3)
+
η
πt +N (b|x)q̂t (x, b) + 0 (2τ + 3)
0
α
α
b
t +N
X
X−1 X¡
¢
2η
(1
+
γ)(2τ
+
3)
+
η
π
(b|x)q̂
(x,
b)
+
η
πs+1 (b|x) − πs (b|x) q̂t (x, b)
t
t
0
α
s=t
b
b
µ
¶
4η
2η
2η(N − 1)|A|c 1
≤ 0 (1 + γ)(2τ + 3) + 0 (τ + 1) +
+ 2τ + 2
α
α
α0
γ
µ
¶
X
2η 1
|πt +N −1 (b|x) − πt +N (b|x)| ,
+ 0
+ 2τ + 2
α γ
b
=
b πt (b|x)q̂t (x, b)
P
and q̂t (x, b) from
P
Lemma 4.6 and the induction hypothesis. Taking maximum over a and using the bound b |πt +N −1 (b|x)−
where the last inequality holds by the bounds on v̂t (x) =
πt +N (b|x)| ≤ |A| maxa |πt +N −1 (a|x) − πt +N (a|x)|, we get after reordering
max|πt +N −1 (a|x) − πt +N (a|x)| ≤
c
2η(N −1)|A| 1
α0
γ
³
a
76
´
¢
2η ¡
+ 2τ + 2 + α0 4τ + 5 + γ(2τ + 3)
³
´
.
2η|A|
1 − α0 γ1 + 2τ + 2
From here the desired maxa |πt +N −1 (a|x) − πt +N (a|x)| ≤ c follows thanks to
2η(N −1)|A| 1
α0
γ
³
´
+ 2τ + 2
1
³
´ ≤ ,
2η|A| 1
2
1 − α0 γ + 2τ + 2
¢
2η ¡
α0 4τ + 5 + γ(2τ + 3)
³
´
2η|A|
1 − α0 γ1 + 2τ + 2
and
≤
c
,
2
where the first inequality follows by our assumption on η, while the second follows from the
definition of c. This finishes the proof of case (a).
−1 (x)
≤ 0: First notice that the logarithm of the second term is positive by
Case b) 1 − e ηq̂t (x,a) WWt +N
t +N (x)
−1 (x)
the condition, that is, ηq̂t (x, a) + ln WWt +N
≥ 0. Furthermore, it is bounded from above by 1
t +N (x)
since
ln
Wt +N −1 (x)
Wt +N (x)
= − ln
X wt +N −1 (x, a)
b
Wt +N −1 (x)
e ηq̂t (x,b) ≤ −η
X wt +N −1 (x, b)
b
Wt +N −1 (x)
q̂t (x, b)
(4.25)
by Jensen’s inequality, and so
ηq̂t (x, a) + ln
X wt +N −1 (x, b)
Wt +N −1 (x)
q̂t (x, b)
≤ ηq̂t (x, a) − η
Wt +N (x)
b Wt +N −1 (x)
µ
¶
µ
¶
2η
1
2η
1
≤ 0 2τ + 3 + + 2τ + 2 = 0 4τ + 5 +
≤ 1,
α
γ
α
γ
where the second inequality holds by Lemma 4.6, and the third one by our assumption on η
and |A|(N − 1/2) ≥ 1. Thus, using e z − 1 ≤ (e − 1)z, which holds for any 0 ≤ z ≤ 1, we get
¯
¯
µ
¶
¯
¯
¯1 − e ηq̂t (x,a) Wt +N −1 (x) ¯ ≤ (e − 1) ηq̂t (x, a) + ln Wt +N −1 (x) .
¯
W
(x) ¯
W
(x)
t +N
t +N
Combining with (4.24), (4.25) and the bounds on q̂t provided in Lemma 4.6, we get
|πt +N −1 (a|x) − πt +N (a|x)| ≤ η(e − 1)πt +N −1 (a|x)q̂t (x, a)
µ
¶
X
2ηγ(e − 1) 1
− η(e − 1) πt +N −1 (b|x)q̂t (x, b) +
+ 2τ + 2
α0
γ
b
µ
¶
X
2ηγ(e − 1) 1
= −η(e − 1)
+ 2τ + 2
πt +N −1 (b|x)q̂t (x, b) +
α0
γ
b6=a
= −η(e − 1)
X
πt (b|x)q̂t (x, b) − η(e − 1)
t +N
X−2
s=t
b6=a
X¡
¢
πs+1 (b|x) − πs (b|x) q̂t (x, b)
b6=a
µ
¶
2ηγ(e − 1) 1
+
2τ
+
2
(telescoping)
+
α0
γ
µ
¶
2η(e − 1)
2η(e − 1)(N − 1)(|A| − 1) 1
≤
(2τ
+
3)
+
c
+
2τ
+
2
+
α0
α0
γ
µ
¶
2ηγ(e − 1) 1
+ 2τ + 2
(induction)
α0
γ
µ
¶
¢
2η(e − 1) ¡
2η(e − 1)(N − 1)(|A| − 1) 1
=
2τ + 4 + γ(2τ + 2) + c
+ 2τ + 2 .
α0
α0
γ
The definition of c and the assumption on η imply that both the first and the second terms are
77
at most c/2, respectively, finishing the proof of the lemma.
Lemma 4.12. Let c be as in Lemma 4.11. Assume that cτ(τ + 2) < α0 /2, and let
»
µ
4
N > τ ln 0
α − 2cτ(τ + 2)
¶¼
.
(4.26)
0
Then, for all N < t ≤ T , x ∈ X, we have µN
t (x) ≥ α /2 and kπt +1 − πt k1,∞ ≤ c.
Proof. We prove the lemma by induction on t . The induction hypothesis is that for N + 1 ≤ t ≤
P
0
T , minx µN
s (x) ≥ α /2 and maxx a |πs+1 (a|x) − πs (a|x)| ≤ c hold for all N + 1 ≤ s ≤ t .
Let us first show that this hypothesis holds when N + 1 ≤ t ≤ 2N − 1. By the construction
P
of the policies, we have maxx a |πt +1 (a|x) − πt (a|x)| = 0 ≤ c for all 1 ≤ t ≤ 2N − 1. Thus, by
π
−(N −1)/τ
t
Lemma 4.10, we get that kµN
holds for all N + 1 ≤ t ≤ 2N − 1. By
t − µst k1 ≤ cτ(τ + 2) + 2e
our assumption about N , we have
cτ(τ + 2) + 2e −(N −1)/τ ≤ α0 /2,
(4.27)
thus for any N + 1 ≤ t ≤ 2N − 1,
π
π
N
0
t
t
kµN
t − µst k∞ ≤ kµt − µst k1 ≤ α /2.
(4.28)
π
Since, by assumption, µπst (x 0 ) ≥ α0 holds for any stationary policy π, we also have µstt (x 0 ) ≥ α0 .
0
This, together with (4.28) gives that µN
t (x) ≥ α /2 holds for any x ∈ X.
Now, fix a time index 2N ≤ t ≤ T and assume that the induction hypothesis holds for time
0
t − 1. Then, thanks to minx µN
t −N +1 (x) ≥ α /2, Lemma 4.11 implies kπt +1 − πt k1,∞ ≤ c. Now, by
π
−(N −1)/τ
t
Lemma 4.10, we have kµN
Using the same reasoning as above,
t − µst k1 ≤ cτ(τ + 2) + 2e
we finish the inductive step and thus the proof.
4.5.2 Proof of Proposition 4.1
The statement is trivial for T ≤ N . Therefore, in what follows assume that T > N . For every x, a
P
P
define QT (x, a) = Tt=N +1 qt (x, a) and VT (x) = Tt=N +1 vt (x). The preceding lemma shows that
in order to prove Proposition 4.1, it suffices to prove an upper bound on E [QT (x, a) − VT (x)].
l
³
´m
4
Lemma 4.13. Let c be as in Lemma 4.11. Assume that γ ∈ (0, 1), cτ(τ+2) < α0 /2, N ≥ τ ln α0 −2cτ(τ+2)
,
0<η≤
α0
2(1/γ +2τ+3) , and T
> N hold. Then, for all (x, a) ∈ X × A,
µ
¶
40
ln |A|
E [QT (x, b) − VT (x)] ≤ N 4(τ + 2) + T η c |A|(τ + 2)(e − 2)(2τ + 3) +
α
η
µ
µ
¶¶
0
2
c
+ 2(τ + 2) T γ + c N |A| + η (e − 2) |A| 2τ + 4 + N
.
α
γ
Proof. We follow the steps of the proof in Auer et al. [10]. Fix (x, a) ∈ X × A. For any 1 ≤ t ≤ T
P
define Wt (x) = a wt (x, a). First, note that since the conditions of Lemma 4.12 are satisfied,
78
hence, the conclusions of this lemma, as well as those of Lemmas 4.6–4.9 hold. In particular, by
def 4
α0
Lemma 4.9, πt (a|x)|q̂t (x, a)| ≤ B q =
(τ + 2) holds for any s ≥ N + 1. Now, using Lemma 4.3
and the Bellman equations, we get that q̂t (x, a) ≤
20
α (1/γ + 2τ + 3),
thus by the constraint on η,
ηq̂t (x, a) ≤ 1.
Fix 2N ≤ t ≤ T − 1. We have the following inequalities:
Wt +1 (x) X wt +1 (x, a) X wt (x, a) ηq̂t −N +1 (x,a) X πt (a|x) − γ/|A| ηq̂t −N +1 (x,a)
=
=
e
=
e
Wt (x)
Wt (x)
1−γ
a
a Wt (x)
a
X πt (a|x) − γ/|A| ³
¡
¢2 ´
≤
1 + ηq̂t −N +1 (x, a) + (e − 2) ηq̂t −N +1 (x, a)
1−γ
a
(as ηq̂t −N +1 (x, a) ≤ 1)
≤ 1+
η2 (e − 2) X
η X
πt (a|x)q̂t −N +1 (x, a) +
πt (a|x)(q̂t −N +1 (x, a))2 .
1−γ a
1−γ a
Now rewrite the last term as
X
a
πt (a|x)(q̂t −N +1 (x, a))2 =
X
a
+
πt −N +1 (a|x)(q̂t −N +1 (x, a))2
X
a
(πt (a|x) − πt −N +1 ) (q̂t −N +1 (x, a))2
By Lemma 4.9, the first term can be bounded as follows:
X
a
πt −N +1 (a|x)(q̂t −N +1 (x, a))2 ≤
X¯
¯
40
(τ + 2) ¯q̂t −N +1 (x, a)¯ ,
α
a
while the second one is bounded by
X
a
µ
¶
X¯
¯
2 1
¯q̂t −N +1 (x, a)¯ ,
+ 2τ + 3
(πt (a|x) − πt −N +1 ) (q̂t −N +1 (x, a)) ≤ c N 0
α γ
a
2
where we have used the bound on the change rate of the policies and the upper bound on
|q̂t (x, a)|. Thus, using the notation
B q0
µ
¶
40
2 1
= (τ + 2) + c N 0
+ 2τ + 3 ,
α
α γ
we get
X
a
Defining v̂N
t (x) =
P
πt (a|x)(q̂t −N +1 (x, a))2 ≤ B q0
X¯
¯
¯q̂t −N +1 (x, a)¯ .
a
a πt (a|x)q̂t −N +1 (x, a), we obtain
B q0 η2 (e − 2) X ¯
¯
Wt +1 (x)
η N
¯q̂t −N +1 (x, a)¯ .
≤ 1+
v̂t (x) +
Wt (x)
1−γ
(1 − γ)
a
79
Using 1 + x ≤ e x and then taking logarithms gives
ln
B q0 η2 (e − 2) X ¯
¯
Wt +1 (x)
η N
¯q̂t −N +1 (x, a)¯ .
≤
v̂t (x) +
Wt (x)
1−γ
(1 − γ)
a
Summing over t = 2N , 2N + 1, . . . , T we get
ln
with V̂TN (x) =
PT
t =2N
B q0 η2 (e − 2) T −N
X+1 X ¯
¯
WT +1 (x)
η
¯q̂t (x, a)¯
≤
V̂TN (x) +
W2N (x)
1−γ
(1 − γ)
t =N +1 a
(4.29)
v̂N
t (x). On the other hand, for any action b we have
ln
T −N
X+1
wT +1 (x, b)
WT +1 (x)
≥ ln
=η
q̂t (x, b) − ln |A|,
W2N (x)
W2N (x)
t =N +1
where we used that w2N (x, a) = 1 holds for all a ∈ A. Combining with (4.29), we get
V̂TN (x) ≥ (1 − γ)Q̂TN (x, b) −
where Q̂TN (x, b) =
PT −N +1
t =N +1
T −N
X+1 X ¯
¯
ln |A|
¯q̂t (x, a)¯ ,
− B q0 η(e − 2)
η
t =N +1 a
(4.30)
q̂t (x, b).
Let us now bound the difference of V̂TN (x) and
V̂T (x) =
T
X
V̂TN (x) =
πt (a|x)q̂t (x, a).
t =N +1 a
t =N +1
Note that
T
X
X
v̂t (x) =
T −N
X+1 X
t =N +1 a
πt +N −1 (a|x)q̂t (x, a).
Therefore,
V̂TN (x) − V̂T (x)
≤
≤
T −N
X+1 X
t =N +1 a
T −N
X+1
¯
¯
¯
¯
|q̂t (x, a)|¯πt +N −1 (a|x) − πt (a|x)¯ +
T
X
≤Nc
t =N +1
°
° q̂t (x, ·) ° +
∞
T
X
X
πt (a|x)|q̂t (x, a)|
t =T −N +2 a
T
X
X
°
°
k πt +N −1 (·|x) − πt (·|x) k1 ° q̂t (x, ·) °∞ +
t =N +1
T −N
X+1 °
X
πt (a|x)|q̂t (x, a)|
t =T −N +2 a
πt (a|x)|q̂t (x, a)| ,
t =T −N +2 a
where we have used that by Lemma 4.11, k πt +N −1 − πt k1,∞ ≤ N c, that is, the policies change
slowly. Taking the expectation of both sides, we get
T −N
X+1 £°
° ¤
£
¤
£
¤
E V̂TN (x) − E V̂T (x) ≤ N c
E ° q̂t (x, ·) °∞ +
t =N +1
80
T
X
t =T −N +2
E
·
X
a
¸
πt (a|x)|q̂t (x, a)| .
By Lemma 4.8,
£
¤
def
E |q̂t (x, a)| ≤ B q00 = 2(τ + 2)
(4.31)
° ¤
£°
holds for any (x, a) ∈ X × A. Therefore, E ° q̂t (x, ·) °∞ ≤ |A|B q00 .
Using again (4.31),
E
·
X
a
¯
¸¸
¸
· ·
X
¯
πt (a|x)|q̂t (x, a)| = E E
πt (a|x)|q̂t (x, a)|¯¯ ut −N
a
·
¸
X
¯
£
¤
¯
=E
πt (a|x) E |q̂t (x, a)| ut −N
a
·
¸
X
00
≤E
πt (a|x) B q ≤ B q00 .
a
¤
£
¤
£
Hence, E V̂TN (x) ≤ E V̂T (x) + N B q00 (c T |A| + 1).
This, together with (4.30) gives
¤ ln |A|
£
¤
£
E V̂T (x) + N B q00 (c T |A| + 1) ≥(1 − γ)E Q̂TN (x, b) −
η
T −N
+1
X X £¯
¯¤
− B q0 η(e − 2)
E ¯q̂t (x, a)¯ .
(4.32)
t =N +1 a
¤
£
By equation (4.15), we have E V̂T (x) = E [VT (x)]and with the definition
QTN (x, b) =
T −N
X+1
qt (x, b),
t =N +1
£
¤
£
¤
we also have E Q̂TN (x, b) = E QTN (x, b) .
Thus, using (4.31) again, we get
£
¤ ln |A|
E [VT (x)] + N B q00 (c T |A| + 1) ≥ (1 − γ)E QTN (x, b) −
− η(e − 2) B q0 B q00 T |A|.
η
By reordering the terms, this becomes
£
¤
£
¤
ln |A|
E QTN (x, b) − VT (x) ≤γE QTN (x, b) + N B q00 (c T |A| + 1) +
η
(4.33)
+ η(e − 2) B q0 B q00 T |A|.
We now lower bound QTN (x, a) by QT (x, a):
QT (x, b) − QTN (x, b) =
T
X
qt (x, b) ≤ B q00 N ,
(4.34)
t =T −N +2
where we used that by Lemma 4.3,
qt (x, b) ≤ B q00 = 2(τ + 2)
81
(4.35)
since the rewards are bounded between 0 and 1.
Combining (4.34) with (4.33), we obtain
£
¤
E [QT (x, b) − VT (x)] ≤ γE QTN (x, b) + N B q00 (c T |A| + 1)
³
´
ln |A|
+ B q00 η(e − 2) B q0 T |A| + N .
+
η
£
¤
Using (4.35) again, we get E QTN (x, b) ≤ B q0 T . Thus,
E [QT (x, b) − VT (x)] ≤ 2B q00 N +
³
´
ln |A|
+ T B q00 γ + N |A|c + η(e − 2) B q0 |A| .
η
Using the definitions of B q0 and B q00 , we arrive at the final result:
¶
ln |A|
40
E [QT (x, b) − VT (x)] ≤ N 4(τ + 2) + T η c |A|(τ + 2)(e − 2)(2τ + 3) +
α
η
µ
µ
¶¶
0
2
c
+ 2(τ + 2) T γ + c N |A| + η (e − 2) |A| 2τ + 4 + N
.
α
γ
µ
The proof of Proposition 4.1 is now easy:
Proof of Proposition 4.1. Under the conditions of the proposition, combining Lemmas 4.1-4.13
yields
T
X
X
£
¤
E ρ πt − ρ t ≤ 2N + µπst (x)π(a|x) E [QT (x, a) − VT (x)]
t =1
x,a
¶
µ
ln |A|
40
≤ N 2τ + 10 + T η c |A|(τ + 2)(e − 2)(2τ + 3) +
α
η
µ
µ
¶¶
20
c
+ 2(τ + 2) T γ + c N |A| + η (e − 2) |A| 2τ + 4 + N
,
α
γ
proving Proposition 4.1.
4.5.3 Proof of Proposition 4.2
¤
£ ¤
£P π
Let t > N . First, since πt is σ(ut −N )-measurable, E ρ t = E x µstt (x)E [ r t (x, at )| ut −N ] . We
also have
E [r t (xt , at )] = E [E [ r t (xt , at )| ut −N ]] = E
82
·
X
x
µN
t (x) E [ r t (x, at )| ut −N ]
¸
.
Hence,
·
¸
X π
£
¤
N
t
E ρ t − r t (xt , at ) = E
(µst (x) − µt (x)) E [ r t (x, at )| ut −N ]
x
· ¯
¯¸
X¯ π
¯
≤E
(x)
¯µstt (x) − µN
¯ ,
t
x
where we have used that r t (x, a) ∈ [0, 1].
¯
P ¯ π
¯
Thanks to Lemma 4.12, Lemma 4.10 is applicable. Hence, x ¯µstt (x) − µN
t (x) ≤ cτ(τ + 2) +
£
¤
2e −(N −1)/τ , and thus E ρ t − r t (xt , at ) ≤ cτ(τ + 2) + 2e −(N −1)/τ . Summing up these inequalities
¡P
£ ¤¢
for t = N +1, . . . , T , we get Tt=N +1 E ρ t −RbT ≤ T cτ(τ+2)+2Te −(N −1)/τ Using the trivial bound
£
¤
E ρ t − r t (xt , at ) ≤ 2 for the first N terms, we get the desired result.
83
Chapter 5
Online learning in unknown stochastic
shortest path problems
In this chapter, we consider the setting when the learner aims to minimize its regret against an
arbitrary sequence of reward functions but for the case when the Markovian dynamics is unknown. As it turns out, this problem is considerably more challenging than the one studied in
the previous chapters, when the Markovian dynamics is known a priori. That said, there exists
an alternate thread of previous theoretical works that assume an unknown Markovian dynamics over finite state and action spaces [59, 15]. In these works, however, the reward function is
fixed (known or unknown). That in the problem we study both the Markovian dynamics is unknown and the reward function is allowed to be an arbitrary sequence forces us to explore an
algorithm design principle that remained relatively less-explored so far. The results presented
in this chapter were published in Neu et al. [79].
©
ª
The assumptions we make about the underlying MDP\r X, A, P are the same as in Chap-
ter 3. In particular, we assume that P satisfies Assumption S1. Besides supposing that the
learner does not know P , we have to assume that the reward function r t is completely revealed
after episode t . The results of this chapter are summarized below.
Thesis 3. Proposed an efficient algorithm for online learning in unknown stochastic shortest
path problems. Proved performance guarantees for Setting 1c. The proved bounds are optimal
in terms of the dependence on the number of episodes. [79]
The algorithm that we present combines previous work on online learning in Markovian
decision processes and work on online prediction in adversarial settings in a novel way. In
particular, we combine ideas from the UCRL-2 algorithm of Jaksch et al. [59] developed for
learning in finite MDPs, with the Follow-the-Perturbed-Leader (FPL) prediction method for arbitrary sequences [51, 64], while for the analysis we also borrow some ideas from our algorithms
described in the previous chapters (which in turn build on the fundamental ideas of Even-Dar
et al. 32, 33). However, in contrast to our previously presented work where we rely on using one
84
exponentially weighted average forecaster (EWA) locally in each state, we use FPL to compute
the policy to be followed globally. The main reason for this change is twofold: First, we need
a Follow-the-Leader-Be-the-Leader-type inequality for the policies we use and we could not
prove such an inequality when the policies are implicitly obtained by placing experts into the
individual states. Second, we could not find a computationally efficient way of implementing
EWA over the space of policies, hence we turned to the FPL approach, which gives rise to a
computationally efficient implementation. The FPL technique has been explored by Yu et al.
[105] and Even-Dar et al. [32, 33] for the case when P is known. It is also interesting to note
that even if the reward function is fixed, no previous work considered the analysis of the FPL
technique for unknown MDP dynamics. Our analysis also sheds some light on the role of the
confidence intervals and the principle of "optimism in face of uncertainty" in on-line learning,
as presented in Section 2.2.
After setting the goals of this chapter in Section 5.1, we introduce a more convenient form
of the regret in Section 5.2. We present our algorithm in Section 5.3 and analyze it in Section 5.4.
5.1 Problem setup
The problem setup is very similar to the one presented in Section 3.1: We assume that the
©
ª
Markovian dynamics X, A, P satisfies Assumption S1, and the learning protocol described on
Figure 2.3 uses time-dependent reward functions r t instead of r in each episode t . As before,
we make no statistical assumptions about the sequence (r t )Tt=1 other than its elements have to
be chosen in advance from the set of mappings X × A → [0, 1]. Contrary to the previously presented setting, we assume that P is unknown to the learner, making the problem fundamentally
more difficult, as now the learning procedure has to deal with both stochastic and adversarial
components of the environment. We assume that the reward functions r t are completely revealed after each episode.
The learner has to minimize the same regret criterion that was defined in Section 3.1: Let
"
RbT = E
T L−1
X
X
t =1 l =0
¯ #
¯
) ¯
r t (xl(t ) , a(t
)¯ P
l ¯
,
denote the total expected reward of the learner and
¯
#
¯
¯
R Tπ = E
r t (x0l , a0l )¯ P, π
¯
t =1 l =0
"
T L−1
X
X
stand for the total expected reward of the policy π. Then, the total expected regret can be written using
b T = sup R π − RbT .
L
T
π
85
5.2 Rewriting the regret
The techniques presented in this chapter are fundamentally different from the previously presented ones. Instead of using the decomposition presented in Section 3.2, we treat the learning
problem as a global sequential decision problem where the learner has to select a sequence of
deterministic policies (πt )Tt=1 . We still make use of value functions as defined in Section 2.3.1,
but in our setting, we also have to compute value functions defined for transition functions
other than P . Furthermore, the global decision-making only uses values of the initial state: all
decisions are based on v π (x 0 ). Below, we introduce a formalism that can accommodate for
different transition functions.
A deterministic policy π (or, in short: a policy) is a mapping π : X → A. We say that a policy
π is followed in an SSP problem if the action in state x ∈ X is set to be π(x), independently of
previous states and actions. A random path
u = (x0 , a0 , . . . , xL−1 , aL−1 , xL )
is said to be generated by policy π under the transition model P if x0 = x 0 and xl +1 ∈ Xl +1 is
drawn from P (·|xl , π(xl )) for all l = 0, 1, . . . , L − 1. We denote this relation by u ∼ (π, P ). Using
the above notation we can define the value of a policy π, given a fixed reward function r and a
transition model P as
¯
#
¯
¯
W (r, π, P ) = E
r (xl , π(xl ))¯ u ∼ (π, P ) ,
¯
l =0
"
L−1
X
that is, the expected sum of rewards gained when following π in the MDP defined by r and P .
In our problem, we are given a sequence of reward functions (r t )Tt=1 . We will refer to the partial
P
sum of these rewards as R t = ts=1 r t . Using the notations of Section 2.3.1, we have
v tπ (x 0 ) = W (r t , π, P )
Vtπ (x 0 ) = W (R t , π, P ).
and
In what follows, we will omit explicit references to the initial state x 0 , since these are the only
values that we consider. We can phrase our goal as coming up with an algorithm that generates
a sequence of policies (πt )Tt=1 that minimizes the total expected regret
b T = max V π − E
L
T
π
"
T
X
t =1
π
vt t
#
.
At this point, we mention that regret minimization demands that we continuously incorporate each observation as it is made available during the learning process. Assume that we
decide to use the following simple algorithm: run an arbitrary exploration policy πexp for 0 <
K < T episodes, estimate a transition function P̄, then, in the remaining episodes, run the algorithm described in Chapter 3 using P̄. It is easy to see that this method attains a regret of
p
O(K + T / K ), which becomes O(T 2/3 ) when setting K = O(T 2/3 ). Also, the regret of this algorithm would scale very poorly with the parameters of the SSP.
86
5.3 Follow the perturbed optimistic policy
In this section we present our FPL-based method that we call “follow the perturbed optimistic
policy” (FPOP). The algorithm is shown as Algorithm 5.1.
One of the key ideas of our approach, borrowed from Jaksch et al. [59], is to maintain a confidence set of transition functions that contains the true transition function with high probability. The confidence set is constructed in such a way that it remains constant over random time
periods, called epochs, and the number of epochs (KT ) is guaranteed to be small relative to the
time horizon (details on how to select the confidence set are presented in Section 5.3.1). We denote the i -th epoch by E i , while the corresponding confidence set is denoted by Pi . Now consider the simpler problem where the reward function r is assumed to be constant and known
throughout all episodes t = 1, 2, . . . , T . One can directly apply UCRL-2 of Jaksch et al. [59] and
select policy and transition-function estimate as
¡
¢
©
ª
πi , P̃i = arg max W (r, π, P̄ )
π∈Π,P̄ ∈Pi
in epoch E i , and follow πi through that epoch. In other words, we optimistically select the
model from the confidence set that maximizes the optimal value of the MDP (defined as the
value of the optimal policy in the MDP) and follow the optimal policy for this model and the
fixed reward function. However, it is well known that for the case of arbitrarily changing reward functions, optimistic “follow the leader”-style algorithms like the one described above
are bound to fail even in the simple (stateless) expert setting.
1
Thus, we need rely on the
formalism of online learning of arbitrary sequences, presented in Section 2.1.
In particular, our approach is to make this optimistic method more conservative by adding
some carefully designed noise to the cumulative sum of reward functions that we have observed so far – much in the vein of FPL. To this end, introduce the perturbation function Yi :
X × A → R with Yi (x, a) being i.i.d. random variables for all (x, a) ∈ X × A and all epochs i =
1, 2, . . . , KT ; in our algorithm Yi will be selected according to Exp(η, |X||A|), the |X||A|-dimensional
distribution whose components are independent having exponential distribution with expectation η. Using these perturbations, the key idea of our algorithm is to choose our policy and
transition-function estimate as
¡
¢
©
ª
πt , P̃t = arg max W (R t −1 + Yi(t ) , π, P̄ )
(5.1)
π∈Π,P̄ ∈Pi(t )
where i(t ) is the epoch containing episode t . That is, after fixing a biased reward function based
on our previous observations, we still act optimistically when taking stochastic uncertainty into
consideration. We call this method “follow the perturbed optimistic policy”, or, in short, FPOP.
1 This phenomenon was also illustrated by experiments presented in Section 3.6.
87
Algorithm 5.1 The FPOP algorithm for the online loop-free SSP problem with unknown transition probabilities.
Input: State space X, action space A, perturbation parameter η ≥ 0, confidence parameter
0 < δ < 1, time horizon T > 0.
Initialization: Let R 0 (x, a) = 0, n1 (x, a) = 0, N1 (x, a) = 0, and M1 (x, a) = 0 for all state-action
pairs (x, a) ∈ X × A, let P1 be the set of all transition functions, and let the episode index i(1) = 1.
Choose Y1 ∈ R|X||A| ∼ Exp(η, |X||A|) randomly.
For t = 1, 2, . . . , T
1. Compute πt and P̃t according to (5.1).
¡ ) (t )
¢
(t )
(t )
2. Traverse a path ut = x(t
, a0 , . . . , xL−1
, aL−1
, xL(t ) following the policy πt , where xl(t ) ∈ Xl
0
¡ )¢
)
and a(t
= πt x(t
∈ A.
l
l
3. Observe reward function r t and receive rewards
PL−1
l =0
¡ ) (t ) ¢
r t x(t
, al .
l
¡ ) (t ) ¢
¡ ) (t ) ¢
4. Set ni(t ) x(t
, al = ni(t ) x(t
, al + 1 for all l = 0, . . . , L − 1.
l
l
5. If ni(t ) (x, a) ≥ Ni(t ) (x, a) for some (x, a) ∈ X × A then start a new epoch:
(a) Set i(t + 1) = i(t ) + 1, ti(t +1) = t + 1 and compute Ni(t +1) (x, a) and Mi(t +1) (x, a) for all
(x, a) by (5.2).
(b) Construct Pi(t +1) according to (5.3) and (5.4).
(c) Reset ni(t +1) (x, a) = 0 for all (x, a).
(d) Choose Yi(t +1) ∈ R|X||A| ∼ Exp(η, |X||A|) independently.
Else set i(t + 1) = i(t ).
88
5.3.1 The confidence set for the transition function
In this section, following Jaksch et al. [59], we describe how the confidence set is maintained
to ensure that it contains the real transition function with high probability yet does not change
too often. To maintain simplicity, we will assume that the layer decomposition of the SSP is
known in advance, however the algorithm can be easily extended to cope with unknown layer
structure.
The algorithm proceeds in epochs of random length: the first epoch E 1 starts at episode
t = 1, and each epoch E i ends when any state-action pair (x, a) has been chosen at least as
many times in the epoch as before the epoch (e.g., the first epoch E 1 consists of the single
episode t = 1). Let ti denote the index of the first episode in epoch E i , i(t ) be the index of the
epoch that includes t , and let Ni (x, a) and Mi (x 0 |x, a) denote the number of times state-action
pair (x, a) has been visited and the number of times this event has been followed by a transition
to x 0 up to episode ti , respectively. That is
tX
i −1
Ni (x l , a l )=
s=1
tX
i −1
Mi (x l +1 |x l , a l )=
s=1
I©x(s)=xl ,a(s)=al ª
l
l
(5.2)
I©x(s) =xl +1 ,x(s)=xl ,a(s)=al ª ,
l +1
l
l
where l = 0, 1, . . . , L − 1, x l ∈ Xl , a l ∈ A and x l +1 ∈ X (clearly, Mi (x l +1 |x l , a l ) can be non-zero only
if x l +1 ∈ Xl +1 ). Our estimate of the transition function in epoch E i will be based on the relative
frequencies
P̄i (x l +1 |x l , a l ) =
Mi (x l +1 |x l , a l )
.
max {1, Ni (x l , a l )}
Note that P̄i (·|x, a) belongs to the set of probability distributions ∆(Xl x +1 , X) defined over X
with support contained in Xl x +1 . Define a confidence set of transition functions for epoch E i
as the following L 1 -ball around P̄i :
v
u
½
T |X||A|
°
° u
t 2|Xl x+1 | ln δ
°
°
P̂i = P̄ : P̄ (·|x, a)−P̄i (·|x, a) 1 ≤
max{1, Ni (x, a)}
¾
and P̄ (·|x, a)∈∆(Xl x+1 , X) for all (x, a)∈ X×A .
(5.3)
The following lemma ensures that the true transition function lies in these confidence sets with
high probability. We present the proof for completeness.
Lemma 5.1 (59, Lemma 17). For any 0 < δ < 1
v
u
T |X||A|
u 2|X
°
°
l x +1 | ln
δ
°P̄i (·|x, a) − P (·|x, a)° ≤ t
1
max{1, Ni (x, a)}
holds with probability at least 1 − δ simultaneously for all (x, a) ∈ X × A and all epochs.
Proof. Let us fix an arbitrary x ∈ X and let l = l x . The statement follows from the following
89
inequality due to Weissman et al. [101] concerning the distance of a true discrete distribution
p and the empirical distribution p̂ over m distinct events from n samples:
¶
µ
¤ ¡
¢
£
nε2
.
P kp − p̂k1 ≥ ε ≤ 2m − 2 exp −
2
As now we have |Xl +1 | distinct events, we get that setting
s
ε=
4|Xl +1 | ln T |Xδ||A|
n
for some fixed n ∈ [1, 2, . . . , t ] yields
s

°
°
P  °P̄i (·|x, a) − P (·|x, a)°1 ≥
2|Xl +1 | ln T |Xδ||A|
n
¯

¯
¯
δ
¯ Ni (x, a) = n  ≤
.
¯
2
T |X||A|
¯
Using the union bound for all possible values of Ni (x, a), all (x, a) ∈ X × A, all i = 1, 2, . . . , KT
(note that for the bound, we have used the very crude upper bound T > KT for simplicity)
and the fact that the confidence intervals trivially hold when there are no observations with
probability 1, we get the statement of the lemma.
Since by the above result P ∈ P̂i holds with high probability for all epochs, defining our true
confidence set Pi as
Pi = ∩ij =1 P̂ j ,
(5.4)
we also have P ∈ Pi for all epochs with probability at least 1 − δ. This way we ensure that our
confidence sets cannot increase between consecutive episodes with probability 1. Note that
this is a delicate difference from the construction of Jaksch et al. [59] that plays an important
role in our proof.
5.3.2 Extended dynamic programming
FPOP needs to compute an optimistic transition function and an optimistic policy in each
episode with respect to some reward function r and some confidence set P of transition functions. That is, we need to solve the problem
¡ ∗ ∗¢
π , P = arg max W (r, π, P̄ ).
(5.5)
π,P̄ ∈P
We will use an approach called extended dynamic programming to solve this problem, a simple
adaptation of the extended value iteration method proposed by Jaksch et al. [59]. The method
is presented as Algorithm 5.2 in Section 5.5. Computing in a backward manner on the states
(that is, going from layer Xl to X0 ), the algorithm maximizes the transition probabilities to
the direction of the largest reward-to-go. This is possible since the L 1 -balls allow to select the
optimistic transition functions independently for all state-action pairs (x, a) ∈ X × A. Following
90
the proof of Theorem 7 of Jaksch et al. [59], Lemma 5.6 shows that Algorithm 5.2 indeed solves
the required minimization problem, and can be implemented with O(|A||X|2 ) time and space
complexity.
5.4 Regret bound
The main result of this chapter is presented below. The performance bound is detailed in Theorem 5.1 in the appendix.
Theorem 5.1. Assume η ≤ (|X||A|)−1 and T ≥ |X||A|. Then the expected regret of FPOP can be
bounded as
b T ≤ L|X||A| log
L
2
µ
8T
|X||A|
s
³p
´
+ 2 + 1 L|X|
´
³
¶ ln |X||A| + 1
L
η
+ ηT L (e − 1)|X||A|
T |X||A|
T |A| ln
+ L|X|
δL
s
2 T ln
L
+ 3δ T L.
δ
In particular, assuming T ≥ (|X||A|)2 , setting
v
³
´
u
µ
¶ ln |X||A| + 1
u
t
8T
L
η = log2
|X||A|
T (e − 1)
and δ = 1/T gives
s
b T ≤ 2 L|X||A|
L
+
µ
8T
|X||A|
s
¶µ
T (e − 1) log2
³p
´
2 + 1 L|X|
T |A| ln
µ
ln
¶
¶
|X||A|
+1
L
p
T 2 |X||A|
+ L|X| 2 T ln(LT ) + 3 L.
L
Remark 5.1. The above regret bound can be simply stated as
p ´
³
b T = Õ L|X||A| T .
L
Using the techniques of Jaksch et al. [59], one can show a regret bound of Õ(L|X|
p
T |A|) for
the easier problem when the reward function is fixed and known. Thus, the price we pay for
p
playing against an arbitrary reward sequence is an O( |A|) factor in the upper bound.
The proof of Theorem 5.1 mainly follows the regret analysis of FPL combined with ideas
from the regret analysis of UCRL-2. First, let us consider the policy-transition model pair
¡
¢
©
ª
π̂t , P̂t = arg max W (R t + Yi(t ) , π, P̄ ) .
π∈Π,P̄ ∈Pi(t )
In other words, π̂t is an optimistic policy that knows the reward function before the episode
91
would begin. Define
ṽt = W (r t , πt , P̃t ) and v̂t = W (r t , π̂t , P̂t ).
Furthermore, let
©
ª
π∗t = arg max W (R t + Yi(t ) , π, P ) ,
π∈Π
the optimal policy with respect to the perturbed rewards and the true transition function. The
perturbation added to the value of policy π in episode t will be denoted by Zt (π) = W (Yi(t ) , π, P ).
The true optimal value up to episode t and the optimal policy attaining this value will be denoted by
Vt∗ = max Vt (π)
π∈Π
σ∗t = arg max Vt (π).
and
π∈Π
We proceed by a series of lemmas to prove our main result. The first one shows that our optimistic choice of the estimates of the transition model enables us to upper bound the optimal
value Vt∗ with a reasonable quantity.
Lemma 5.2.
VT∗
≤
T
X
t =1
E [v̂t ] + δ T L + |X||A| log2
µ
8T
|X||A|
¶ PL−1
l =0
ln (|Xl ||A|) + L
η
.
Note that while the proof of this result might seem a simple reproduction of some arguments from the standard FPL-analysis, it contains subtle details about the role of our optimistic
estimates that are of crucial importance for the analysis.
Proof. Assume that P ∈ Pi(T ) , which holds with probability at least 1 − δ, by Lemma 5.1. First,
we have
VT∗ + ZT (σ∗T ) ≤ VT (π∗T ) + ZT (π∗T )
= W (R T + Yi(T ) , π∗T , P )
(5.6)
≤ W (R T + Yi(T ) , π̂T , P̂T )
where the first inequality follows from the definition of π∗T and the second from the optimistic
choice of π̂T and P̂T . Let dYi(s) = Yi(s) − Yi(s−1) for s = 1, . . . , t . Next we show that, given P ∈ Pi(T ) ,
W (R t +Yi(t ) , π̂t , P̂t ) ≤
t
X
W (r s +dYi(s) , π̂s , P̂s )
(5.7)
s=1
where we define Y0 = 0. The proof is done by induction on t . Equation (5.7) holds trivially for
t = 1. For t > 1, assuming P ∈ Pi(T ) and (5.7) holds for t − 1, we have
W (R t + Yi(t ) , π̂t , P̂t ) = W (R t −1+Yi(t −1) , π̂t , P̂t )+W (r t +dYi(t ) , π̂t , P̂t )
≤ W (R t −1+Yi(t −1) , π̂t −1 , P̂t −1 )+W (r t +dYi(t ) , π̂t , P̂t )
≤
t
X
W (r s + dYi(s) , π̂s , P̂s ),
s=1
92
¡
¢
where the first inequality follows from the fact that π̂t −1 , P̂t −1 is selected from a wider class2
¢
¡
than π̂t , P̂t and is optimistic with respect to rewards R t −1 +Yi(t −1) , while the second inequality
holds by the induction hypothesis for t − 1. This proves (5.7).
Now the non-negativity of ZT (σ∗T ), (5.6) and (5.7) imply that, given P ∈ Pi(T ) ,
VT∗ ≤
=
T
X
t =1
T
X
W (r t + dYi(t ) , π̂t , P̂t )
v̂t +
T
X
W (dYi(t ) , π̂t , P̂t ).
t =1
t =1
Since P ∈ Pi(T ) holds with probability at least 1 − δ, VT∗ ≤ T L trivially and the right hand side of
the above inequality is non-negative, we have
VT∗
≤
T
X
"
E [v̂t ] + E
t =1
T
X
#
W (dYi(t ) , π̂t j , P̂t j ) + δT L
(5.8)
t =1
The elements in the second sum above may be non-zero only if i(t ) 6= i(t − 1). Furthermore, by
Proposition 18 of Jaksch et al. [59], the number of epochs KT up to episode T is bounded from
above as
T
X
def
KT =
t =1
I{i(t )6=i(t −1)} ≤ |X||A| log2
µ
¶
8T
.
|X||A|
Therefore,
"
E
T
X
#
"
W (dYi(t ) , π̂t j , P̂t j ) ≤ E
t =1
m
X
#
W (Y j , π̂t j , P̂t j )
j =1
≤ |X||A| log2
µ
¶ L−1 ·
¸
X
8T
E
max Y1 (x, a) .
|X||A| l =0 (x,a)∈Xl ×A
Using the upper bound on the expectation of the maximum of a number of exponentially distributed variables (see, e.g., the proof of Corollary 4.5 in Cesa-Bianchi and Lugosi 25), a combination of the above inequality with (5.8) gives the desired result.
Next, we show that peeking one episode into the future does not change the performance
too much. The following lemma is a standard result used for the analysis of FPL and we include
the proof only for completeness.
Lemma 5.3. Assume that η ≤ (|X||A|)−1 . Then,
"
E
T
X
t =1
#
v̂t ≤ E
"
T
X
#
ṽt + ηT L (e − 1)|X||A|.
t =1
Proof. Let
©
ª
(σt (Y), Γt (Y)) = arg max W (R t −1 + Y, π, P̄ )
π∈Π,P̄ ∈Pi(t )
2 This follows from the definition of the confidence sets in Equation (5.4).
93
and Ft (Y) = W (r t , σt (Y), Γt (Y)). Clearly, ṽt = Ft (Yi(t ) ) and v̂t = Ft (Yi(t ) + r t ). Now let f be the
density function of Yi(t ) and Fi(t ) denote the σ-algebra generated by all random variables before
epoch E i(T ) .3 We have
E v̂t | Fi(t −1) =
£
¤
Z
Z
Ft (y + r t ) f (y)d y =
Ft (y) f (y − r t )d y
R|X||A|
Z
¤
f (y − r t )
f (y − r t ) £
≤ sup
E ṽt | Fi(t −1) .
Ft (y) f (y)d y ≤ sup
|
X
||
A
|
f (y)
f (y)
y,t
y,t
R
R|X||A|
¡ P
¢
Since f (y) = η exp −η x,a y(x, a) for all y º 0, we get
Ã
!
X
¡
¢
f (y − r t )
sup
= exp η r t (x, a) ≤ exp η|X||A| .
f (y)
y
x,a
Using e x ≤ 1 + (e − 1)x for x ∈ [0, 1], which holds by our assumption on η, we get
¡
¢
E [v̂t ] ≤ E [ṽt ] 1 + η(e − 1)|X||A| .
Noticing that ṽt ≤ L gives the result.
¯
£
¤
Now consider µt (x) = P xl x = x ¯ u ∼ (πt , P ) , that is, the probability that a trajectory gen-
erated by πt and P includes x. Note that given a layer Xl , the restriction µt ,l : Xl → [0, 1] is a
£
¤
distribution. Define an estimate of µt as µ̂t (x) = P xl = x| u ∼ (πt , P̃t ) . Note that this estimate
can be efficiently computed using the recursion
µ̃t (x l +1 ) =
X
x l ,a l
P̃t (x l +1 |x l , a l )πt (a l |x l )µ̃t (x l ),
for l = 0, 1, 2, . . . , L −1, with µ̃t (x 0 ) = 1. The following result will ensure that if our estimate of the
transition function is close enough to the true transition function in the L 1 sense, than these
estimates of the visitation probabilities are also close to the true values that they estimate.
Lemma 5.4. Assume that there exists some function dt : X × A → R+ such that
°
°
°P̃t (·|x, a) − P (·|x, a)° ≤ dt (x, a)
1
holds for all (x, a) ∈ X × A. Then
X
x l ∈Xl
|µ̃t (x l ) − µt (x l )| ≤
lX
−1
X
k=0 x k ∈Xk
µt (x k ) dt (x k , πt (x k ))
for all l = 1, 2, . . . , L − 1.
3 Note that Y
i(t ) is generated independently from the history up to epoch i(t ).
94
Proof. We prove the statement by induction on l . For l = 1 we have
X¯
¯ X¯
¯
¯µ̃t (x 1 ) − µt (x 1 )¯ = ¯P̃t (x 1 |x 0 , πt (x 0 )) − P (x 1 |x 0 , πt (x 0 ))¯ ≤ dt (x 0 , , πt (x 0 )),
x1
x1
proving the statement for this case. Now assume that the statement holds for some l − 1. We
have
µ̃t (x l ) − µt (x l ) =
X¡
¢
P̃t (x l |x l −1 , πt (x l −1 ))µ̃t (x l −1 ) − P (x l |x l −1 , πt (x l −1 ))µt (x l −1 )
x l −1
=
X³
x l −1
¢
¡
P̃t (x l |x l −1 , πt (x l −1 )) µ̃t (x l −1 ) − µt (x l −1 )
´
¡
¢
+ P̃t (x l |x l −1 , πt (x l −1 )) − P (x l |x l −1 , πt (x l −1 )) µt (x l −1 ) ,
and thus
X
|µ̃t (x l ) − µt (x l )| ≤
xl
X³
x l ,x l −1
¯
¯
P̃t (x l |x l −1 , πt (x l −1 )) ¯µ̃t (x l −1 ) − µt (x l −1 )¯
´
¯
¯
+ ¯P̃t (x l |x l −1 , πt (x l −1 )) − P (x l |x l −1 , πt (x l −1 ))¯ µt (x l −1 )
X¯
¯
¯µ̃t (x l −1 ) − µt (x l −1 )¯
=
x l −1
+
X
x l −1
≤
lX
−2
X
k=0 x k ∈Xk
µt (x l −1 )
X¯
¯
¯P̃t (x l |x l −1 , πt (x l −1 )) − P (x l |x l −1 , πt (x l −1 ))¯
xl
µt (x k ) dt (x k , πt (x k )) +
X
x l −1
µt (x l −1 )
X
xl
dt (x l −1 , πt (x l −1 )),
proving the statement.
Finally, we use the above result to relate the estimated policy values ṽt to their true values v t (πt ). The following lemma, largely based on Lemma 19 of Jaksch et al. [59], asserts this
relationship.
Lemma 5.5. Assume T ≥ |X||A|. Then
"
E
T
X
t =1
#
ṽt ≤E
"
T
X
s
#
v t (πt ) + L|X|
2 T ln
t =1
³p
´
+ 2 + 1 L|X|
L
+ 2δ T L
δ
s
T |A| ln
T |X||A|
.
δL
Proof. We start by some arguments borrowed from Jaksch et al. [59]. Let ni (x, a) be the number
of times state-action pair (x, a) has been visited in epoch E i . We have
Ni (x, a) =
iX
−1
ni (x, a).
j =1
For simplicity, let KT = m be the number of epochs. By Appendix C.3 of Jaksch et al. [59], we
95
have
³p
´p
m n (x, a)
X
i
≤ 2+1
Nm (x, a),
p
i =1 Ni (x, a)
and by Jensen’s inequality,
³p
´p
m n (x, a)
XX
i
≤ 2+1
|X||A|T .
p
x,a i =1 Ni (x, a)
Now fix an arbitrary t : 1 ≤ t ≤ T . We have
ṽt =
L−1
X
l =0 x∈Xl
and
v t (πt ) =
thus
ṽt − v t (πt ) =
L−1
X
X
L−1
X
µ̃t (x)r t (x, πt (x))
X
l =0 x∈Xl
µt (x)r t (x, πt (x)),
L−1
X ¡
X X ¯
¯
¢
¯µ̃t (x) − µt (x)¯ .
µ̃t (x) − µt (x) r t (x, πt (x)) ≤
l =0 x∈Xl
l =0 x∈Xl
PT P
¯
¯
That is, we need to bound t =1 x∈Xl ¯µ̃t (x) − µt (x)¯.
°
°
Setting dt (x, a) = °P̃t (·|x, a) − P (·|x, a)°1 for all (x, a) ∈ X × A, the conditions of Lemma 5.4
are clearly satisfied, and so
−1 X
X ¯
¯ lX
¯µ̃t (x) − µt (x)¯ ≤
µt (x k ) dt (x k , πt (x k ))
x∈Xl
k=0 x k ∈Xk
=
lX
−1
k=0
³
) (t )
dt x(t
, ak
k
´
+
lX
−1
(5.9)
X ³
k=0 x k ∈Xk
µt (x k
) − I©
´
)
x(t
=x k
k
ª
dt (x k , πt (x k )) .
Now, by Lemma 5.1, we have with probability at least 1 − δ simultaneously for all k that
T
X
t =1
³
´
v
u
u
t
2|Xk+1 | ln T |Xδ||A|
n
³
ó
) (t )
t =1
max 1, Ni(t ) x(t
,
a
k
k
v
u
u 2|Xk+1 | ln T |X||A|
m
X X
t
δ
©
ª
≤
ni (x k , a k )
max 1, Ni(t ) (x k , a k )
x k ,a k i =1
s
³p
´
T |X||A|
≤ 2+1
2 T |Xk ||Xk+1 ||A| ln
.
δ
) (t )
dt x(t
, ak ≤
k
T u
X
³
´
For the second term on the right hand side of (5.9), notice that µt (x k ) − I©x(t ) =xk ª form a mark
tingale difference sequence with respect to {Ut }Tt=1 and thus by the Hoeffding–Azuma inequality and dt ≤ 2, we have
T ³
X
t =1
s
´
µt (x k ) − I©x(t ) =xk ª dt (x k , πt (x k )) ≤
k
96
2 T ln
L
δ
with probability at least 1 − δ/L. Putting everything together, the union bound implies that we
have, with probability at least 1 − 2δ simultaneously for all l = 1, . . . , L,
t =1 x∈Xl
s
s
−1
T |X||A| lX
L
|Xk | 2 T ln
2+1
T |Xk ||Xk+1 ||A| ln
+
δ
δ
k=0
k=0
s
s
´ L−1
³p
−1
X 1
T |X||A| lX
L
|Xk | 2 T ln
2+1 L
T |Xk ||Xk+1 ||A| ln
+
≤
δ
δ
k=0 L
k=0
s
s
µ
¶
³p
´
|X| 2 T |X||A|
L
≤
2 + 1 L T | A|
ln
+ |X| 2 T ln
L
δ
δ
s
s
³p
´
T |X||A|
L
=
+ |X| 2 T ln
(5.10)
2 + 1 |X| T |A| ln
δ
δ
T X ¡
X
¢
µ̃t (x) − µt (x) ≤
lX
−1 ³p
´
where in the last step we used Jensen’s inequality for the concave function f (x, y) =
PL−1
with parameter a > 0 and the fact that k=0
|Xk | = |X| − 1 < |X|.
p
x y(a + ln x)
Summing up for all l = 0, 1, . . . , L − 1 and taking expectation, using that v t (πt ) − ṽt ≤ L and
(5.10) holds with probability at least 1 − 2δ, finishes the proof.
The theorem can be obtained by a trivial combination of Lemmas 5.2, 5.3, and 5.5 and
applying the bound
|X||A|
ln (|Xl ||A|) ≤ L ln
L
l =0
L−1
X
µ
¶
in the last term of Lemma 5.2.
5.5 Extended dynamic programming: technical details
The extended dynamic programming algorithm is given by Algorithm 5.2. The next lemma,
which can be obtained by a straightforward modification of the proof of Theorem 7 of Jaksch
et al. [59], shows that Algorithm 5.2 efficiently solves the desired minimization problem.
Lemma 5.6. Algorithm 5.2 solves the maximization problem (5.5) for P = {P̄ : kP̄ − P̂ k1 ≤ b}.
P
Let S = lL−1
|Xl ||X l +1 | denote the maximum number of possible transitions in the given model.
=0
The time and space complexity of Algorithm 5.2 is the number of possible non-zero elements of P̄
allowed by the given structure, and so it is O(S|A|), which, in turn, is O(|A||X|2 ).
97
Algorithm 5.2 Extended dynamic programming for finding an optimistic policy and transition
model for a given confidence set of transition functions and given rewards.
Input: empirical estimate P̂ of transition functions, L 1 bound b ∈ (0, 1]|X||A| , reward function
r ∈ [0, 1]|X||A| .
Initialization: Set w(x L ) = 0.
For l = L − 1, L − 2, . . . , 0
¡
¢
1. Let k = |Xl +1 | and x 1∗ , x 2∗ , . . . , x k∗ be a sorting of the states in Xl +1 such that w(x 1∗ ) ≥
w(x 2∗ ) ≥ · · · ≥ w(x k∗ ).
2. For all (x, a) ∈ Xl × A
©
ª
(a) P ∗ (x 1∗ |x, a) = min P̂ (x 1∗ |x, a) + b(x, a)/2, 1
(b) P ∗ (x i∗ |x, a) = P̂ (x i∗ |x, a) for all i = 2, 3, . . . , k.
(c) Set j = k.
P
(d) While i P ∗ (x i∗ |x, a) > 1 do
©
ª
P
i. Set P ∗ (x ∗j |x, a) = max 0, 1 − i 6= j P ∗ (x i∗ |x, a)
ii. Set j = j − 1.
3. For all x ∈ Xl
©
ª
P
(a) Let w(x) = maxa r (x, a) + x 0 P ∗ (x 0 |x, a)w(x 0 ) .
©
ª
P
(b) Let π∗ (x) = arg maxa r (x, a) + x 0 P ∗ (x 0 |x, a)w(x 0 ) .
Return: optimistic transition function P ∗ , optimistic policy π∗ .
98
Chapter 6
Online learning with switching costs
In this chapter we study a special case of online learning in reactive environments where switching between actions is subject to some additional cost. The precise protocol of the prediction
problem is identical to the online prediction protocol shown on Figure 2.1, with the additional
assumption that every time the learner selects an action at 6= at −1 , a known cost of K > 0 is
deducted from the earnings of the learner. We are interested in algorithms that can minimize
regret under this additional assumption, or, equivalently, minimize the regret while keeping the
number of action switches low. However, the usual forecasters with small regret—such as the
exponentially weighted average forecaster or the FPL forecaster with i.i.d. perturbations—may
switch actions a large number of times, typically Θ(n). Therefore, the design of special forecasters with small regret and small number of action switches is called for. In this chapter, we
consider this problem in the full information setting where reward functions are revealed after
each time step. Our results are summarized in the following thesis.
Thesis 4. Proposed an efficient algorithm for online learning with switching costs. Proved performance guarantees for Settings 3a and 3b. The proved bounds for Setting 3a are optimal in all
problem parameters. Our algorithm is the first known efficient algorithm for Setting 3b. [30, 31]
While this learning problem is very interesting on its own right, one can imagine numerous
applications where one would like to define forecasters that do not change their prediction too
often. Examples of such problems include the online buffering problem described by Geulen
et al. [39] and the online lossy source coding problem to be discussed in Chapter 7. Furthermore, as seen in Chapter 4, abrupt policy switches can also be harmful in the online MDP
problem. While the core idea of the analysis presented in Chapter 4 is guaranteeing that the
learner’s policies change slowly over time, it is very easy to see that one can achieve similar results by ensuring that the learner changes its policies rarely. The main reason that we took the
first route for the analysis is that there are currently no known algorithms that can guarantee
p
that the number of action switches is O( T ). Preliminary results ([24]) suggest that even the
existence of such prediction algorithms is nontrivial.
99
As mentioned before in Chapter 1, learning in this problem can be regarded as a special
case of the online MDP problem. The Markovian environment describing the current setting
is deterministic and it is easy to find policies that induce periodic state transitions. In fact,
the only policies that admit stationary distributions are the ones that satisfy π(x) = x for some
©
ª
x ∈ A. This implies that X, A, P do not satisfy Assumption M1, so we cannot guarantee that
the MDP-E algorithm (Algorithm 4.1) performs well in this problem by straightforward application of Theorem 4.1. Recently, Arora et al. [3] proposed an algorithm for regret minimization
in deterministic MDPs with non-stationary rewards. Since they consider a significantly more
complicated problem where the optimal policy is allowed to induce periodic behavior, they can
only guarantee an expected average regret of Õ(T −1/4 ).
Since our general tools introduced in the previous chapters are not directly applicable for
this specific setting, we turn to algorithms that are directly tailored to deal with the above problem of switch-constrained online prediction. The first paper to explicitly attack this problem
is by Geulen et al. [39], who propose a variant of the exponentially weighted average forecaster called the “Shrinking Dartboard” (SD) algorithm and prove that it provides an expected
p
regret of O( T ln |A|), while guaranteeing that the expected number of switches is at most
p
O( T ln |A|). A less conscious attempt to solve the problem is due to Kalai and Vempala [63]
(see also 64); they show that the simplified version of the Follow-the-Perturbed-Leader (FPL)
p
algorithm with identical perturbations (as described above) guarantees an O( T ln |A|) bound
on both the expected regret and the expected number of switches.
In the first half of this chapter, we present a modified version of the SD algorithm that
enjoys optimal bounds on both the standard regret and the number of switches. Our contribution, presented in György and Neu [48] and György and Neu [49], is a minor modification
of the SD algorithm that allows using adaptive step-size parameters. More importantly, in the
second half of the chapter representing our works Devroye et al. [30, 31], we propose a method
based on FPL in which perturbations are defined by independent symmetric random walks.
We show that this intuitively appealing forecaster has similar regret and switch-number guarantees as SD and FPL with identical perturbations. A further important advantage of the new
forecaster is that it may be used simply in the more general problem of online combinatorial—
or, more generally, linear—optimization. We postpone the definitions and the statement of the
results to Section 6.2.3 below.
Before presenting our algorithms, we set the stage for Chapter 7 by slightly changing our notations. For a number of practical problems, including the online lossy source coding problem,
it is more suitable to regard the learner’s task as having to minimize losses instead of having to
maximize rewards. In accordance with the notations used by Cesa-Bianchi and Lugosi [25], we
identify the elements of the action set A with the natural numbers {1, 2, . . . , N } and denote the
loss given for choosing action i ∈ {1, 2, . . . , N } at time t by ì ,t . We assume that ì ,t ∈ [0, 1]. The
goal of the learner to choose its actions I1 , I2 , . . . , IT so as to minimize its total expected regret
100
Algorithm 6.1 The modified Shrinking Dartboard algorithm
1. Set η t > 0 with η t +1 ≤ η t for all t = 1, 2, . . ., η 0 = η 1 , and L i ,0 = 0 and w i ,0 = 1/N for all
actions i ∈ {1, 2, . . . , N }.
2. for t = 1, . . . , T do
(a) Set w i ,t =
(b) Set p i ,t =
1 −η t L i ,t −1
for all i ∈ {1, 2, . . . , N }.
Ne
w i ,t
PN
for all i ∈ {1, 2, . . . , N }.
j =1 w j ,t
(c) Set c t = e (η t −η t −1 )(t −2) .
w It −1 ,t
(d) With probability c t w I
, set It = It −1 if t ≥ 2, that is, do not change expert; other©
ª
wise choose It randomly according to the distribution p 1,t , . . . , p N ,t .
t −1 ,t −1
(e) Observe the losses ì ,t and set L i ,t = L i ,t −1 + ì ,t for all i ∈ {1, 2, . . . , N }.
end for
defined as
bT =
L
T
X
t =1
Ìt ,t − min
T
X
i ∈{1,...,N } t =1
ì ,t .1
Further, define the number of action switches up to time n by
Cn = |{1 < t ≤ n : It −1 6= It }| .
b T of the order
We are interested in defining randomized forecasters that achieve a regret L
p
O( T ln N ) while keeping the number of action switches C n as small as possible.
6.1 The Shrinking Dartboard algorithm revisited
In this section, we present a modified version of the Shrinking Dartboard (SD) algorithm of
Geulen et al. [39]. A modified version of this prediction method, called the modified SD (mSD)
algorithm, is shown as Algorithm 6.1. The difference between the SD and the mSD algorithms
is that mSD is horizon independent, which is achieved by introducing the constant c t in the
algorithm (setting η t = η the mSD algorithm reduces to SD).
w
To see that the mSD algorithm is well-defined we have to show that c t w i ,ti ,t−1 ≤ 1 for all t and
i . For t = 1, the statement follows from the definitions, since c 1 = 1. For t ≥ 2 it follows since
w i ,t
w i ,t −1
¡
¢
= exp η t −1 L i ,t −2 − η t D i ,t −1
¡¡
¢
¢
≤ exp η t −1 − η t L i ,t −2 − η t ì ,t −1
¡¡
¢
¢
≤ exp η t −1 − η t (t − 2) = 1/c t .
1 One can easily go back and forth between this notation and the one used in previous chapters by using the
transformation ì ,t = 1 − r t (i ) for all time steps and actions. Note that the regret is invariant under this transformation.
101
Note that the only difference between the mSD and the EWA prediction algorithms is the
presence of the first random choice in step 2d of mSD: while the EWA algorithm chooses a
new action in each time step t according to the distribution {p 1,t , . . . , p N ,t }, the mSD algorithm
sticks with the previously chosen action with some probability. By precise tuning of this probp
ability, the method guarantees that actions are changed over time only at most O( T ) times in
T time steps, while maintaining the same marginal distributions over the actions as the EWA
algorithm. The latter fact guarantees that the expected regret of the two algorithms are the
same.
In the following we formalize the above statements concerning the mSD algorithm. The
next lemma shows that the marginal distributions generated by the mSD and the EWA algorithms are the same. The lemma is obtained by a slight modification of the proof Lemma 1 in
[39].
Lemma 6.1. Assume the mSD algorithm is run with η t +1 ≤ η t for all t = 1, 2, . . . , T . Then
the probability of selecting action i at time t satisfies P [It = i ] = p i ,t for all t = 1, 2, . . . and
i ∈ {1, 2, . . . , N }.
Proof. We will use the notation Wt =
PN
i =1 w i ,t .
We prove the lemma by induction on 1 ≤ t ≤ T .
For t = 1, the statement follows from the definition of the algorithm. Now assume that t ≥ 2
and the hypothesis holds for t − 1. For all i ∈ {1, 2, . . . , N }, we have
N
X
µ
¶
£
¤
w j ,t
P [It = i ] = P [It −1 = i ] c t
+ p i ,t
P It −1 = j 1 − c t
w i ,t −1
w j ,t −1
j =1
µ
¶
N
X
w j ,t
w i ,t
p j ,t −1 1 − c t
+ p i ,t
= p i ,t −1 c t
w i ,t −1
w j ,t −1
j =1
µ
¶
N w
w j ,t
w i ,t −1 w i ,t
w i ,t X
j ,t −1
= ct
+
1 − ct
Wt −1 w i ,t −1 Wt j =1 Wt −1
w j ,t −1
w i ,t
= ct
w i ,t
Wt −1
+
w i ,t
Wt
− ct
w i ,t
w i ,t Wt
=
= p i ,t .
Wt Wt −1
Wt
As a consequence of this result, the expected regret of mSD matches that of EWA, so the
performance bound of EWA holds for the mSD algorithm as well [39, Lemma 2]). That is, the
following result can be obtained by a slight modification of the proof of [41, Lemma 1]. We
include the proof for completeness.
Lemma 6.2. Assume η t +1 ≤ η t for all t = 1, 2, . . . , T . Then the total expected regret of the mSD
algorithm can be bounded as
T η
£ ¤ X
ln N
t
bT ≤
+
.
E L
ηT
t =1 8
Proof. Introduce the following notation:
w i0 ,t =
1 −η t −1 L i ,t −1
e
,
N
102
Pt −1
Note that the difference between w i ,t and w i0 ,t is that η t is replaced by
P
η t −1 in the latter. We will also use Wt0 = iN=1 w i0 ,t . First, we have
where L i ,t −1 =
s=1 ì ,s .
W0
1
1
ln t +1 =
ln
ηt
Wt
ηt
≤−
N
X
PN
i =1 w i ,t e
−η t ì ,t
Wt
p i ,t ì ,t +
i =1
=
N
X
1
p i ,t e −η t ì ,t
ln
η t i =1
£
¤ ηt
ηt
≤ −E Ìt ,t +
8
8
where the second to last step follows from Hoeffding’s inequality (see, e.g., [25, Lemma A.1])
and the fact that ì ,t ∈ [0, 1]. After rearranging, we get
£
¤
W0
ηt
1
E Ìt ,t ≤ − ln t +1 + .
ηt
Wt
8
(6.1)
Rewriting the first term on the right hand side, we obtain
lnWt − lnWt0+1
ηt
µ
=
¶ µ
¶
lnWt lnWt +1
lnWt +1 lnWt0+1
−
+
−
.
ηt
η t +1
η t +1
ηt
The first term can be telescoped as
¶
T µ lnW
X
ln w i ,T +1
lnWt +1
lnW1
lnWT +1
t
−
=
lnW1 −
≤−
ηt
η t +1
η1
η T +1
η T +1
t =1
=−
1
1
ln N
ln e −ηT +1 L i ,T = L i ,T +
,
η T +1 N
η T +1
for any 1 ≤ i ≤ N . Now to deal with the second term, observe that
N 1 ¡
N 1
X
X
¢ ηt +1
e −η t +1 L i ,t =
e −η t L i ,t ηt
i =1 N
i =1 N
! ηt +1
Ã
ηt
N 1
X
¡
¢ ηt +1
≤
e −η t L i ,t
= Wt0+1 ηt ,
i =1 N
Wt +1 =
where we have used that η t +1 ≤ η t and thus we can apply Jensen’s inequality for the concave
function x
η t +1
ηt
, x ∈ R. Thus we have
lnWt +1 lnWt0+1 η t +1 lnWt0+1 lnWt0+1
−
≤
−
= 0.
η t +1
ηt
ηt
η t +1
ηt
Substituting into (6.1) and summing for all t = 1, 2, . . . , T , we obtain
T
T
X
£
¤
1X
ln N
E Ìt ,t ≤ L i ,T +
ηt +
.
8
η
T +1
t =1
t =1
Finally, since the losses ì ,t , i ∈ {1, 2, . . . , N } and Ìt ,t do not depend on η T +1 for t ≤ T , we can
choose, without loss of generality η T +1 = η T , and the statement of the lemma follows.
103
p
ln N /T optimally (as a function of the time horizon T ), the bound
p
£ ¤ p
T ln N
b T ≤ T ln N .
,
while
setting
η
=
2
ln
N
/t
independent
of
T
,
we
have
E
L
t
2
Remark 6.1.
q Setting η t =
becomes
Now let us consider the total number of action switches CT . The next lemma, which is a
slightly improved and generalized version of Lemma 2 from [39] gives an upper bound on the
expectation of CT .
Lemma 6.3. Let CT denote the number of times the mSD algorithm switches between different
actions. Then
E [CT ] ≤ η T L ∗T −1 + ln N +
TX
−1
(η t − η T )
t =1
where L ∗T −1 = mini ∈{1,2,...,N } L i ,T −1 .
Proof. The probability of switching experts in step t ≥ 2 is
N
X
µ
¶
¢
w i ,t ¡
P [It −1 = i ] 1 − c t
αt = P [It −1 6= It ] =
1 − p i ,t
w i ,t −1
i =1
µ
¶
N
N w
X
X
w i ,t
w i ,t
Wt
i ,t −1
P [It −1 = i ] 1 − c t
≤
ct
= 1 − ct
= 1−
w i ,t −1
w i ,t −1
Wt −1
i =1
i =1 W t −1
def
where in the last line we used Lemma 6.1. Reordering gives Wt ≤
WT ≤ W1
1−αt
c t W t −1
and thus
T 1−α
T 1−α
Y
Y
t
t
=
.
c
c
t
t
t =2
t =2
On the other hand,
WT ≥ w T, j ≥
1 −ηT L j ,T −1
e
.
N
for arbitrary j ∈ {1, 2, . . . , N }. Taking logarithms of both inequalities and putting them together,
we get
− ln N − η T L ∗T −1 = −η T
min
j ∈{1,2,...,N }
L j ,T −1 ≤
T
X
ln(1 − αt ) −
T
X
ln c t .
t =2
t =2
Now using ln(1 − x) ≤ −x for all x ∈ [0, 1), we obtain
T
X
E [CT ] =
t =2
αt ≤ η T L ∗T −1 + ln N −
T
X
ln c t .
t =2
Now the statement of the lemma follows since
−
T
X
t =2
ln c t =
T
X
(η t −1 − η t )(t − 2) ≤
t =2
TX
−1
(η t − η T ).
t =1
p
p
p
In particular, for η t = ln N /T , we have E [CT ] ≤ T ln N +ln N , while setting η t = 2 ln N /t ,
p
we obtain E [CT ] ≤ 4 T ln N + ln N .
104
6.2 Prediction by random-walk perturbation
In this section we propose a variant of the Follow-the-Perturbed-Leader (FPL) algorithm that
p
switches actions at most O( T ln N ) times in expectation. The proposed forecaster perturbs
the loss of each action at every time instant by a symmetric coin flip and chooses an action with
minimal cumulative perturbed loss. More precisely, the algorithm draws independent random
variables Xi ,t that take values ±1/2 with equal probabilities and Xi ,t is added to each loss ì ,t −1 .
¢
P ¡
At time t action i is chosen that minimizes ts=1 ì ,t −1 + Xi ,t (where we define ì ,0 = 0).
Algorithm 6.2 Prediction by random-walk perturbation.
Initialization: set L i ,0 = 0 and Zi ,0 = 0 for all i = 1, 2, . . . , N .
For all t = 1, 2, . . . , T , repeat
1. Draw Xi ,t for all i = 1, 2, . . . , N such that
(
Xi ,t =
1
2
with probability
− 21
with probability
1
2
1
2.
2. Let Zi ,t = Zi ,t −1 + Xi ,t for all i = 1, 2, . . . , N .
3. Choose action
¡
¢
It = arg min L i ,t −1 + Zi ,t .
i
4. Observe losses ì ,t for all i = 1, 2, . . . , N , suffer loss Ì t ,t .
5. Set L i ,t = L i ,t −1 + ì ,t for all i = 1, 2, . . . , N .
Equivalently, the forecaster may be thought of as an FPL algorithm in which the cumulative
P
losses L i ,t −1 are perturbed by Zi ,t = it =1 Xi ,t . Since for each fixed i , Zi ,1 , Zi ,2 , . . . is a symmetric
random walk, cumulative losses of the N actions are perturbed by N independent symmetric
random walks. This is the way the algorithm is presented in Algorithm 6.2.
A simple variation is when one replaces random coin flips by independent standard normal
random variables. Both have similar performance guarantees and we choose ±(1/2)-valued
perturbations for mathematical convenience. In Section 6.2.3 we switch to normally distributed
perturbations—again driven by mathematical simplicity. In practice both versions are expected
to have a similar behavior.
Conceptually, the difference between standard FPL and the proposed version is the way
the perturbations are generated: while common versions of FPL use perturbations that are
generated in an i.i.d. fashion, the perturbations of the algorithm proposed here are dependent.
This will enable us to control the number of action switches during the learning process. Note
p
that the standard deviation of these perturbations is still of order t just like for the standard
FPL forecaster with optimal parameter settings.
To obtain intuition why this approach will solve our problem, first consider a problem with
N = 2 actions and an environment that generates equal losses, say ì ,t = 0 for all i and t , for
all actions. When using i.i.d. perturbations, FPL switches actions with probability 1/2 in each
105
p
round, thus yielding Ct = t /2 + O( t ) with overwhelming probability. The same holds for the
exponentially weighted average forecaster. On the other hand, when using the random-walk
perturbations described above, we only switch between the actions when the leading random
walk is changed, that is, when the difference of the two random walks—which is also a symmetric random walk—hits zero. It is a well known that the number of occurrences of this event
p
up to time t is O p ( t ), see, [35]. As we show below, this is the worst case for the number of
switches.
The next theorem summarizes our performance bounds for the proposed forecaster.
Theorem 6.1. The expected regret and expected number of switches of actions of the forecaster
of Algorithm 6.2 satisfy, for all possible loss sequences (under the oblivious-adversary model),
p
b T ≤ 2E [CT ] ≤ 8 2T ln N + 16 ln T + 16 .
L
Remark. Even though we only prove bounds for the expected regret and the expected number of switches, it is of great interest to understand upper tail probabilities. However, this is a
highly nontrivial problem. One may get an intuition by considering the case when N = 2 and
all losses are equal to zero. In this case the algorithm switches actions whenever a symmetric
random walk returns to zero. This distribution is well understood and the probability that this
p
2
occurs more than x T times during the first T steps is roughly 2P{N > 2x} ≤ 2e −2x where N
is a standard normal random variable (see [35, Section III.4]). Thus, in this case we see that
¢
¡p
the number of switches is bounded by O T ln(1/δ) , with probability at least 1 − δ. However,
proving analog bounds for the general case remains a challenge.
To prove the theorem, we first show that the regret can be bounded in terms of the number
of action switches. Then we turn to analyzing the expected number of action switches.
6.2.1 Regret and number of switches
The next simple lemma shows that the regret of the forecaster may be bounded in terms of the
number of times the forecaster switches actions.
Lemma 6.4. Fix any i ∈ {1, 2, . . . , N }. Then
T
X
Ìt ,t − L i ,T ≤ 2CT + Zi ,T +1 −
t =1
TX
+1
t =1
XIt −1 ,t .
Proof. We apply Lemma 3.1 of [25] (sometimes referred to as the “be-the-leader” lemma) for
the sequence (`·,t −1 + X·,t )∞
t =1 with ` j ,0 = 0 for all j ∈ {1, 2, . . . , N }, obtaining
TX
+1 ¡
+1 ¡
¢ TX
¢
Ìt ,t −1 + XIt ,t ≤
ì ,t −1 + Xi ,t
t =1
t =1
= L i ,T + Zi ,T +1 .
106
Reordering terms, we get
T
X
Ìt ,t ≤ L i ,T +
TX
+1 ¡
t =1
t =1
TX
+1
¢
XIt ,t .
Ìt −1 ,t −1 − Ìt ,t −1 + Zi ,T −
(6.2)
t =1
The last term can be rewritten as
−
TX
+1
t =1
XIt ,t = −
TX
+1
t =1
XIt −1 ,t +
TX
+1 ¡
t =1
¢
XIt −1 ,t − XIt ,t .
Now notice that XIt −1 ,t − XIt ,t and Ìt −1 ,t −1 − Ìt ,t −1 are both zero when It = It −1 and are upper
bounded by 1 otherwise. That is, we get that
TX
+1
+1 ¡
¢
¢ TX
I{It −1 6=It } = 2CT .
XIt −1 ,t − XIt ,t ≤ 2
Ìt −1 ,t −1 − Ìt ,t −1 +
TX
+1 ¡
t =1
t =1
t =1
Putting everything together gives the statement of the lemma.
6.2.2 Bounding the number of switches
Next we analyze the number of switches Cn . In particular, we upper bound the marginal probability P [It +1 6= It ] for each t ≥ 1. We define the lead pack At as the set of actions that, at time t ,
have a positive probability of taking the lead at time t + 1:
½
¾
¡
¢
At = i ∈ {1, 2, . . . , N } : L i ,t −1 + Zi ,t ≤ min L j ,t −1 + Z j ,t + 2 .
j
We bound the probability of lead change as
1
P [It 6= It +1 ] ≤ P [|At | > 1] .
2
The key to the proof of the theorem is the following lemma that gives an upper bound for the
probability that the lead pack contains more than one action. It implies, in particular, that
p
E [CT ] ≤ 4 2T ln N + 4 ln n + 4 ,
which is what we need to prove the expected-value bounds of Theorem 6.1.
Lemma 6.5.
s
P [|At | > 1] ≤ 4
2
ln N 8
+ .
t
t
i
h
Proof. Define p t (k) = P Zi ,t = k2 for all k = −t , . . . , t and we let St denote the set of leaders at
time t (so that the forecaster picks It ∈ St arbitrarily):
½
©
St = j ∈ {1, 2, . . . , N } : L j ,t −1 + Z j ,t = min L i ,t −1 + Zi ,t
i
107
ª
¾
.
Let us start with analyzing P [|At | = 1]:
N
t X
X
¸
©
ª
k
L i ,t −1 + Zi ,t ≥ L j ,t −1 + + 2
i ∈{1,2,...,N }\ j
2
k=−t j =1
¸
·
tX
−4 X
N
©
ª
p t (k)
k +4
≥
p t (k + 4)P
min
L i ,t −1 + Zi ,t ≥ L j ,t −1 +
i ∈{1,2,...,N }\ j
2
p t (k + 4)
k=−t j =1
¸
·
t
N
X
X
©
ª
k p t (k − 4)
=
p t (k)P
min
L i ,t −1 + Zi ,t ≥ L j ,t −1 +
.
i
∈{1,2,...,N
j
}\
2
p t (k)
k=−t +4 j =1
P [|At | = 1] =
p t (k)P
·
min
Before proceeding, we need to make two observations. First of all,
N
X
j =1
p t (k)P
¸
·
¸
©
ª
k
k
≥ P ∃ j ∈ St : Z j ,t =
L i ,t −1 + Zi ,t ≥ L j ,t −1 +
i ∈{1,2,...,N }\ j
2
2
·
¸
k
≥ P min Z j ,t =
,
j ∈St
2
·
min
where the first inequality follows from the union bound and the second from the fact that the
latter event implies the former. Also notice that Zi ,t + 2t is binomially distributed with parame¡ t ¢
ters t and 1/2 and therefore p t (k) = t +k 21t . Hence
2
³
p t (k − 4)
=³
t +k
p t (k)
2
= 1+
´³
´
! t −k
2 !
´³
´
− 2 ! t −k
+
2
!
2
t +k
2
4(t + 1)(k − 2)
.
(t − k + 2)(t − k + 4)
It can be easily verified that
4(t + 1)(k − 2)
4(t + 1)(k − 2)
≥
(t − k + 2)(t − k + 4)
(t + 2)(t + 4)
holds for all k ∈ [−t , t ]. Using our first observation, we get
t
X X
¸
©
ª
k p t (k − 4)
P [|At | = 1] ≥
p t (k)P
min
L i ,t −1 + Zi ,t ≥ L j ,t −1 +
i ∈{1,2,...,N }\ j
2
p t (k)
j k=−t +4
·
¸
t
X
k p t (k − 4)
≥
P min Z j ,t =
.
j
∈S
2
p t (k)
t
k=−t +4
·
108
Along with our second observation, this implies
t
X
·
¸
k p t (k − 4)
P min Z j ,t =
P [|At | > 1] ≤1 −
j ∈St
2
p t (k)
k=−t +4
·
¸µ
¶
t
X
k
4(t + 1)(k − 2)
P min Z j ,t =
1+
≤1 −
j ∈St
2
(t + 2)(t + 4)
k=−t +4
·
¸
µ
¶
t
X
k 4(2 − k)(t + 1)
P min Z j ,t =
≤
j ∈St
2
(t + 2)(t + 4)
k=−t
¸
·
8(t + 1)
t +1
=
−8
E min Z j ,t
j ∈St
(t + 2)(t + 4)
(t + 2)(t + 4)
¸
·
8 8
≤ + E
max Z j ,t .
j ∈{1,2,...,N }
t t
£
¤ q
Now using E max j Z j ,t ≤ t ln2 N implies
s
P [|At | > 1] ≤ 4
2 ln N 8
+
t
t
as desired.
6.2.3 Online combinatorial optimization
In this section we study the case of online linear optimization (see, among others, [38], [66],
[40], [94], [64], [99], [56], [53], [68], [27], [5]). This is a similar prediction problem as the one
described on Figure 2.1 but here each action i is represented by a vector v i ∈ Rd . The loss corresponding to action i at time t equals v i> `t where `t ∈ [0, 1]d is the so-called loss vector. Thus,
given a set of actions S = {v i : i = 1, 2, . . . , N } ⊆ Rd , at every time instant t , the forecaster chooses,
in a possibly randomized way, a vector Vt ∈ S and suffers loss V>
t `t . Using this notation, the regret becomes
T
X
t =1
where L t =
Pt
s=1 `s
>
V>
t `t − min v L T
v∈S
is the cumulative loss vector. While the results of the previous section still
hold when treating each v i ∈ S as a separate action, one may gain important computational
advantage by taking the structure of the action set into account. In particular, as [64] emphasize, FPL-type forecasters may often be computed efficiently. In this section we propose such a
forecaster which adds independent random-walk perturbations to the individual components
of the loss vector. To gain simplicity in the presentation, we restrict our attention to the case
of online combinatorial optimization in which S ⊂ {0, 1}d , that is, each action is represented
a binary vector. This special case arguably contains most important applications such as the
(non-stochastic) online shortest path problem. In this example, a fixed directed acyclic graph
of d edges is given with two distinguished vertices u and w. The forecaster, at every time instant t , chooses a directed path from u to w. Such a path is represented by it binary incidence
vector v ∈ {0, 1}d . The components of the loss vector `t ∈ [0, 1]d represent losses assigned to the
109
d edges and v > `t is the total loss assigned to the path v. Another (non-essential) simplifying
assumption is that every action v ∈ S has the same number of 1’s: kvk1 = m for all v ∈ S. The
value of m plays an important role in the bounds below.
The proposed prediction algorithm is defined as follows. Let X1 , . . . , Xn be independent
Gaussian random vectors taking values in Rd such that the components of each Xt are i.i.d.
normal Xi ,t ∼ N(0, η2 ) for some fixed η > 0 whose value will be specified later. Denote
Zt =
t
X
Xt .
s=1
The forecaster at time t , chooses the action
©
ª
Vt = arg min v > (L t −1 + Zt ) ,
v∈S
where L t =
Pt
s=1 `t
for t ≥ 1 and L 0 = (0, . . . , 0)> .
The next theorem bounds the performance of the proposed forecaster. Again, we are not
P
only interested in the regret but also the number of switches nt=1 I{Vt +1 6=Vt } . The regret is of simp
ilar order—roughly m d T —as that of the standard FPL forecaster, up to a logarithmic factor.
p ¢
¡
Moreover, the expected number of switches is O m(ln d )5/2 T . Remarkably, the dependence
on d is only polylogarithmic and it is the weight m of the actions that plays an important role.
We note in passing that the SD algorithm presented in Section 6.1 can be used for simulp
taneously guaranteeing that the expected regret is O(m 3/2 T ln d ) and the expected number
p
of switches is mT ln d . However, as this algorithm requires explicit computation of the exponential weighted forecaster, it can only be efficiently implemented for some special decision
sets S—see [68] and [27] for some examples. On the other hand, our algorithm can be efficiently implemented whenever there exists an efficient implementation of the static optimization problem of finding arg minv∈S v > ` for any ` ∈ Rd .
Theorem 6.2. The expected regret and the expected number of action switches satisfy (under the
oblivious adversary model)
µ
¶
p
p 2d
md (ln T + 1)
b
LT ≤ m T
+ η 2 ln d +
η
η2
and
"
E
T
X
t =1
#
I{Vt +1 6=Vt } ≤
µ
³
³
´
´2 ¶
p
p
2
m
1
+
2η
2
ln
d
+
2
ln
d
+
2
ln
d
+
1
+
η
2
ln
d
+
1
T
X
4η2 t
´´ p
p
2 ln d
T m 1 + η 2 ln d + 2 ln d + 1
X
+
.
p
η t
t =1
t =1
³
³
110
In particular, setting η =
q
p 2d
2 ln d
yields
p
p
4
p
b T ≤ 4m d T ln d + m(ln T + 1) ln d .
L
and
E
·
n
X
t =1
¸
³
p ´
I{Vt +1 6=Vt } = O m(ln d )5/2 T .
The proof of the regret bound is quite standard, similar to the proof of Theorem 3 in [6],
and is deferred to the end of this section. The more interesting part is the bound for the ex£P
¤ P
pected number of action switches E nt=1 I{Vt +1 6=Vt } = nt=1 P [Vt +1 6= Vt ]. The upper bound on
this quantity follows from the lemma below and the well-known fact that the expected value
of the maximum of the square of d independent standard normal random variables is at most
p
2 ln d + 2 ln d + 1 (see, e.g., [18]). Thus, it suffices to prove the following:
Lemma 6.6. For each t = 1, 2, . . . , T ,
P [Vt +1 6= Vt |Xt +1 ] ≤
m k`t + Xt +1 k2∞
2η2 t
p
m k`t + Xt +1 k∞ 2 ln d
+
p
η t
Proof. We use the notation Pt [·] = P [· |Xt +1 ] and Et [·] = E [· |Xt +1 ]. Also, let
ht = `t + Xt +1
and
Ht =
tX
−1
ht .
s=0
Furthermore, we will use the shorthand notation c = kht k∞ . Define the set At as the lead pack:
©
ª
At = w ∈ S : (w − Vt )> Ht ≤ kw − Vt k1 c .
Observe that the choice of c guarantees that no action outside At can take the lead at time t +1,
since if w 6∈ At , then
¯
¯
(w − Vt )> Ht ≥ ¯(w − Vt )> ht ¯
so (w − Vt )> Ht +1 ≥ 0 and w cannot be the new leader. It follows that we can upper bound the
probability of switching as
Pt [Vt +1 6= Vt ] ≤ Pt [|At | > 1] ,
which leaves us with the problem of upper bounding Pt [|At | > 1]. Similarly to the proof of
Lemma 6.5, we start analyzing Pt [|At | = 1]:
Pt [|At | = 1] =
=
X
v∈S
£
¤
Pt ∀w 6= v : (w − v)> Ht ≥ kw − vk1 c
X Z
v∈S
y∈R
¯
£
¤
f v (y)Pt ∀w 6= v : w > Ht ≥ y + kw − vk1 c ¯v > Ht = y d y,
111
(6.3)
where f v is the distribution of v > Ht . Next we crucially use the fact that the conditional distributions of correlated Gaussian random variables are also Gaussian. In particular, defining
k(w, v) = (m − kw − vk1 ), the covariances are given as
¡
¢
cov w > Ht , v > Ht = η2 (m − kw − vk1 )t = η2 k(w, v)t .
Let us organize all actions w ∈ S \ v into a matrix W = (w 1 , w 2 , . . . , w N −1 ). The conditional distribution of W > Ht is an (N − 1)-variate Gaussian distribution with mean
µ
¶
k(w 1 , v) >
k(w 2 , v)
k(w N −1 , v) >
>
µv (y) = w 1> L t −1 + y
, w 2 L t −1 + y
, . . . , wN
L
+
y
−1 t −1
m
m
m
and covariance matrix Σv , given that v > Ht = y. Defining K = (k(w 1 , v), . . . , k(w N −1 , v))> and
2
1
exp(− x2 ), we get that
(2π)N −1 |Σv |
using the notation ϕ(x) = p
¯
£
¤
Pt ∀w 6= v : w > Ht ≥ y + kw − vk1 c ¯v > Ht = y
Z
=
∞
Z
φ
···
´
³q
(z − µv (y))> Σ−1
(z
−
µ
(y))
dz
v
y
z i =y+(m−k(w i ,v))c
Z ∞ Z
=
···
φ
µq
¡
z − µv (y) − cK
¢>
Σ−1
y
¡
z − µv (y) − cK
¶
¢
dz
z i =y+(m−k(w i ,v))c+k(w i ,v)c
Z ∞ Z µq
=
φ
···
¡
¶
¢> −1 ¡
¢
z − µv (y + mc) Σ y z − µv (y + mc) d z
z i =y+mc
¯
£
¤
= Pt ∀w 6= v : w > Ht ≥ y + mc ¯ v > Ht = y + mc ,
where we used µ y+mc = µ y + cK . Using this, we rewrite (6.3) as
Pt [|At | = 1] =
X Z
v∈S
y∈R
−
¯
£
¤
f v (y)Pt ∀w 6= v : w > Ht ≥ y ¯ v > Ht = y d y
X Z ¡
v∈S
y∈R
=1 −
¯
¢ £
¤
f v (y) − f v (y − mc) Pt ∀w 6= v : w > Ht ≥ y ¯ v > Ht = y d y
X Z ¡
v∈S
y∈R
¯
¢ £
¤
f v (y) − f v (y − mc) Pt ∀w 6= v : w > Ht ≥ y ¯ v > Ht = y d y.
To treat the remaining term, we use that v > Ht is Gaussian with mean v > L t −1 and standard
p
deviation η mt and obtain
µ
¶
f v (y − mc)
f v (y) − f v (y − mc) = f v (y) 1 −
f v (y)
µ
¶
2
mc
c(y − v > L t −1 )
≤ f v (y)
−
.
2η2 t
η2 t
112
Thus,
Pt [|At | > 1] ≤
≤
X Z ¡
v∈S
y∈R
¯
¢ £
¤
f v (y) − f v (y − mc) Pt ∀w 6= v : w > Ht ≥ y ¯ v > Ht = y d y
¶
¯
£
¤
mc 2 c(y − v > L t −1 )
−
Pt ∀w 6= v : w > Ht ≥ y ¯ v > Ht = y d y
f v (y)
2
2
2η t
η t
µ
X Z
v∈S
y∈R
£
¤
£
¤
mc 2 c E V>
mc 2 mc E kZt k∞
t Zt
= 2 −
≤ 2 +
2η t
η2 t
2η t
η2 t
p
2
m kht k∞ m kht k∞ 2 ln d
+
=
,
p
2η2 t
η t
p
¤
£
where we used the definition of c and E kZt k∞ ≤ η 2t ln d in the last step.
Proof of the first statement of Theorem 6.2. The proof is based on the proof of Theorem 4.2 of
[25] and Theorem 3 of [6]. The main difference from those proofs is that the standard deviation
of our perturbations changes over time, however, this issue is very easy to treat. First, we define
p
b t = t X1 :
an infeasible “forecaster” that peeks one step into the future and uses perturbation Z
¡
¢
b t = arg min w > L t + Z
bt .
V
w∈S
Using Lemma 3.1 of [25], we get for any fixed v ∈ S that
T
X
>
b b
b
b>
V
t (`t + (Zt − Zt −1 )) ≤ v (L n + ZT ).
t =1
After reordering, we obtain
T
X
T
X
>
>b
V>
t ` t ≤ v L T + v ZT +
t =1
bT +
= v >LT + v >Z
t =1
T
X
b t )> ` t −
(Vt − V
b t )> ` t +
(Vt − V
t =1
T
X
t =1
T
X
b b
b>
V
t (Zt − Zt −1 )
p
p >
b t X1
( t − 1 − t )V
t =1
The last term can be bounded as
T p
T p
X
X
p
¯ > ¯
p >
b t X1 ¯
b t X1 ≤ ( t − t − 1) ¯V
( t − 1 − t )V
t =1
t =1
≤m
T p
X
p
( t − t − 1) kX1 k∞
t =1
p
≤m T kX1 k∞ .
Taking expectations, we obtain the bound
"
E
T
X
t =1
#
V>
t `t
− v >LT ≤
T
p
X
£
¤
b t )> `t + ηm 2T ln d ,
E (Vt − V
t =1
113
p
£
¤
£
¤
b t )> `t
where we used E kX1 k∞ ≤ η 2 ln d . That is, we are left with the problem of bounding E (Vt − V
for each t ≥ 1.
To this end, let
v(z) = arg min w > z
w∈S
for all z ∈ Rd , and also
F t (z) = v(z)> `t .
b t . We have
Further, let f t (z) be the density of Zt , which coincides with the density of Z
£
¤
E V>
t `t =E [F t (L t −1 + Zt )]
Z
=
f t (z)F t (L t −1 + z) d z
d
Zz∈R
=
f t (z)F t (L t − `t + z) d z
z∈Rd
Z
=
f t (z + `t )F t (L t + z) d z
z∈Rd
Z
£
¤
¡
¢
b
=E F t (L t + Zt ) +
f t (z + `t ) − f t (z) F (L t + z) d z
d
z∈R
Z
£ > ¤
¡
¢
b t `t +
=E V
f t (z) − f t (z − `t ) F (L t −1 + z) d z .
z∈Rd
The last term can be upper bounded as
Z
z∈Rd
¶¶
µ
µ
(z − `t )> `t
F t (L t −1 + z) d z
f t (z) 1 − exp
η2 t
¶
µ
Z
(z − `t )> `t
F (L t −1 + z) d z
≤−
f t (z)
η2 t
z∈Rd
£
¤
Z
2
¯
¯
E V>
m
t `t k`t k2
≤
+
f t (z) ¯z > `t ¯ d z
2
2
η t
η t z∈Rd
Z
md
m
f t (z) kzk1 d z
≤ 2 + 2
η t η t z∈Rd
r
md
2 md
= 2 +
· p ,
η t
π η t
p
£
¤
where we used E kZt k1 = ηd 2t /π in the last step. Putting everything together, we obtain the
statement of the theorem as
E
·
n
X
t =1
V>
t `t
¸
r
T md
T
p
X
X
2 md
− v LT ≤
+
·
+
ηm
2T ln d
p
2
π η t
t =1 η t
t =1
p
p
2md T
md (ln T + 1)
≤
+ ηm 2T ln d +
.
η
η2
>
114
Chapter 7
Online lossy source coding
In this chapter we consider the problem of fixed-rate sequential lossy source coding of individual sequences with limited delay. Here a source sequence z 1 , z 2 , . . . taking values from the
source alphabet Z has to be transformed into a sequence y 1 , y 2 , . . . of channel symbols taking
values in the finite channel alphabet {1, . . . , M }, and these channel symbols are then used to
produce the reproduction sequence ẑ 1 , ẑ 2 , . . .. The rate of the scheme is defined as ln M nats
(where ln denotes the natural logarithm), and the scheme is said to have δ1 encoding and δ2
decoding delay if, for any t = 1, 2, . . ., the channel symbol y t depends on z t +δ1 = (z 1 , z 2 , . . . , z t +δ1 )
and ẑ t depends on y t +δ2 = (y 1 , . . . , y t +δ2 ). The goal of the coding scheme is to minimize the distortion between the source sequence and the reproduction sequence. We concentrate on the
individual sequence setting and aim to find methods that work uniformly well with respect to
a reference coder class on every individual (deterministic) sequence. Thus, no probabilistic
assumption is made on the source sequence, and the performance of a scheme is measured
by the distortion redundancy defined as the maximal difference, over all source sequences of
a given length, between the normalized distortion of the given coding scheme and that of the
best reference coding scheme matched to the underlying source sequence.
As will be shown later, this problem is an instance of online learning with switching costs,
where taking actions corresponds to selecting coding schemes, distortions correspond to negb T /T .
ative rewards and the distortion redundancy corresponds to the average expected regret L
The switching cost naturally arises from the fact that every time the coding scheme is changed
on the source side, the receiver has to be informed of the new decoding scheme via the same
channel that transmits the useful information. This chapter presents our works György and
Neu [48] and György and Neu [49]. Our results are summarized below.
Thesis 5. Proposed an efficient algorithm for the problem of online lossy source coding. Proved
performance guarantees for Setting 4. The proved bounds are optimal in the number of time
steps, up to logarithmic factors. [48, 49]
115
7.1 Related work
The study of limited-delay (zero-delay) lossy source coding in the individual sequence setting was initiated by Linder and Lugosi [71], who showed the existence of randomized coding
schemes that perform, on any bounded source sequence, essentially as well as the best scalar
quantizer matched to the underlying sequence. More precisely, it was shown that the normalized squared error distortion of their scheme on any source sequence z T of length T is at most
O(T −1/5 ln T ) larger than the normalized distortion of the best scalar quantizer matched to the
source sequence in hindsight. The method of [71] is based on the exponentially weighted average (EWA) prediction method [96, 97, 72]: at each time instant a coding scheme (a scalar
quantizer) is selected based on its “estimated” performance. A major problem in this approach
is that the prediction, and hence the choice of the quantizer at each time instant, is performed
based on the source sequence which is not known exactly at the decoder. Therefore, in [71] information about the source sequence that is used in the random choice of the quantizers is also
transmitted over the channel, reducing the available capacity for actually encoding the source
symbols.
The coding scheme of [71] was improved and generalized by Weissman and Merhav [100].
They considered the more general case when the reference class F is a finite set of limited-delay
and limited-memory coding schemes. To reduce the communication about the actual decoder
to be used at the receiver, Weissman and Merhav introduced a coding scheme where the source
sequence is split into blocks of equal length, and in each block a fixed encoder-decoder pair is
used with its identity communicated at the beginning of each block. Similarly to [71], the code
for each block is chosen using the EWA prediction method. The resulting scheme achieves an
O(T −1/3 ln2/3 |F|) distortion redundancy, or, in the case of the infinite class of scalar quantizers,
the distortion redundancy becomes O(T −1/3 ln T ).
The results of [100] have been extended in various ways, but all of these works are based
on the block-coding procedure described above. A disadvantage of this method is that the
EWA prediction algorithm keeps one weight for each code in the reference class, and so the
computational complexity of the method becomes prohibitive even for relatively simple and
small reference classes. Computationally efficient solutions to the zero-delay case were given
by György, Linder and Lugosi using dynamic programming [95] and EWA prediction in [43]
and based on the Follow-the-Perturbed-Leader prediction method (see [51, 64]) in [44]. The
first method achieves the O(T −1/3 ln T ) redundancy of Weissman and Merhav with O(M T 4/3 )
p
computational and O(T 2/3 ) space complexity and a somewhat worse O(T −1/4 ln T ) distortion
redundancy with linear O(M T ) time and O(T 1/2 ) space complexity, while the second method
achieves O(T −1/4 ln T ) distortion redundancy with the same O(M T ) linear time complexity and
O(M T 1/4 ) space complexity.
Matloub and Weissman [75] extended the problem to allow a stochastic discrete memoryless channel between the encoder and the decoder, while Reani and Merhav [88] extended
the model to the Wyner-Ziv case (i.e., when side information is also available at the decoder).
The performance bound in both cases are based on [100] while low-complexity solutions for
116
the zero-delay case are provided based on [44] and [43], respectively. Finally, the case when
the reference class is a set of time-varying limited-delay limited-memory coding scheme was
analyzed in [45], and efficient solutions were given for the zero-delay case for both traditional
and network (multiple-description and multi-resolution) quantization.
Since most of the above coding schemes are based on the block-coding scheme of [100],
they cannot achieve better distortion redundancy than O(T −1/3 ) up to some logarithmic factors. On the other hand, the distortion redundancy is known to be bounded from below by a
constant multiple of T −1/2 in the zero-delay case [43], leaving a gap between the best known
upper and lower bounds. Furthermore, if the identity of the used coding scheme were communicated as side information (before the encoded symbol is revealed) the employed EWA prediction method would guarantee an O(T −1/2 ln |F|) distortion redundancy for any finite reference
coder class F (of limited delay and limited memory), in agreement with the lower bound. Thus,
to improve upon the existing coding schemes, the communication overhead (describing the actually used coding schemes) between the encoder and the decoder has to be reduced, which
is achievable by controlling the number of times the coding scheme changes in a better way
then blockwise coding. This goal can be achieved by employing the techniques presented in
Chapter 6.
In this chapter we construct a randomized coding strategy, which uses a the mSD algorithm
p
described in Section 6.1 as the prediction component, that achieves an O( ln T /T ) average distortion redundancy with respect to a finite reference class of limited-delay and limited-memory
source codes. The method can also be applied to compete with the (infinite) reference class
p
of scalar quantizers, where it achieves an O(ln T / T ) distortion redundancy. Note that these
bounds are only logarithmic factors larger than the corresponding lower bound.
After presenting the formalism of sequential source coding in Section 7.2, we present our
randomized coding strategy in Section 7.3. The strategy is applied to the problem of adaptive
zero-delay lossy source coding in Section 7.4 and further extensions are given in Section 7.5
7.2 Limited-delay limited-memory sequential source codes
A fixed-rate delay-δ (randomized) sequential source code of rate ln M is defined by an encoderdecoder pair connected via a discrete noiseless channel of capacity ln M . Here δ is a nonnegative integer and M ≥ 2 is a positive integer. The input to the encoder is a sequence z 1 , z 2 , . . .
taking values in some source alphabet Z. At each time instant t = 1, 2, . . ., the encoder observes z t and a random number Ut , where the randomizing sequence U1 , U2 , . . . is assumed
to be independent with its elements uniformly distributed over the interval [0, 1]. At each
time instant t + δ, t = 1, 2, . . ., based on the source sequence z t +δ = (z 1 , . . . , z t +δ ) and the randomizing sequence Ut = (U1 , . . . , Ut ) received so far, the encoder produces a channel symbol
yt ∈ {1, 2, . . . , M } which is then transmitted to the decoder. After receiving yt , the decoder outputs the reconstruction value ẑt ∈ Ẑ based on the channel symbols yt = (y1 , . . . , yt ) received so
far, where Ẑ is the reconstruction alphabet.
Formally, a code is given by a sequence of encoder-decoder functions ( f , g ) = { f t , g t }∞
t =1 ,
117
where
f t : Zt +δ × [0, 1]t → {1, 2, . . . , M }
and
g t : {1, 2, . . . , M }t → Ẑ
so that yt = f t (z t +δ , Ut ) and ẑt = g t (yt ), t = 1, 2, . . .. Note that the total delay of the encoding and
decoding process is δ.1 To simplify the notation we will omit the randomizing sequence from
f t (·, Ut ) and write f t (·) instead.
Now let F be a finite set of reference codes with |F| = N . The cumulative distortion of the
sequential scheme after reproducing the first T symbols is given by
D̂T (z T +δ ) =
T
X
d (z t , ẑt ),
t =1
where d : Z × Ẑ → [0, 1] is some distortion measure,2 while the minimal cumulative distortion
achievable by codes from F is
∗ T +δ
DF
(z
) = min
T
X
( f ,g )∈F t =1
¡
¢
d z t , g t (y t )
¡
¢
where the sequence y T is generated sequentially by ( f , g ), that is, y t = f t z t +δ , Ut . Of course,
in general it is impossible to come up with a coding scheme that attains this distortion without
knowing the whole input sequence beforehand. Thus, our goal is to construct a coding scheme
that asymptotically achieves the performance of the above encoder-decoder pair. Formally
this means that we want to obtain a randomized coding scheme that minimizes the worst-case
expected normalized distortion redundancy
³
í
³
ó
1n h
∗
E D̂T z T +δ − D F
z T +δ ,
z T ∈ZT T
b T = max
R
where the expectation is taken with respect to the randomizing sequence UT of our coding
scheme.
The decoder {g t } is said to be of memory s ≥ 0 if g t ( ŷ t ) = g t ( ỹ t ) for all t and ŷ t , ỹ t ∈ {0, . . . , M }t
such that ŷ tt −s = ỹ tt −s , where ŷ tt −s = ( ŷ t −s , ŷ t −s+1 , . . . , ŷ t ) and ỹ tt −s = ( ỹ t −s , ỹ t −s+1 , . . . , ỹ t ). Let Fδ
denote the collection of all (randomized) delay-δ sequential source codes of rate ln M , and let
Fsδ denote the class of codes in Fδ with memory s.
Weissman and Merhav [100] proved that there exists a randomized coding scheme such
that, for any δ ≥ 0 and s ≥ 0 and for any finite class F ⊂ Fsδ of reference codes, the normalized
distortion redundancy with respect to F is of order T −1/3 ln2/3 |F|. This coding scheme splits
the source sequence into blocks of length O(T 1/3 ). At the beginning of each block a code is se1 Although we require the decoder to operate with zero delay, this requirement introduces no loss in generality,
as any finite-delay coding system with δ1 encoding and δ2 decoding delay can be represented equivalently in this
way with δ1 + δ2 encoding and zero decoding delay [100].
2 All results may be extended trivially for arbitrary bounded distortion measures
118
lected from F using EWA prediction and the identity of the selected reference decoder is communicated to the decoder. During these first steps, the decoder emits arbitrary reproduction
symbols, while the chosen code is used in the rest of the block. The formation of the blocks ensures that only a limited fraction of the available channel capacity is used for describing codes,
while the limited memory property ensures that not transmitting real data at the beginning of
each block has only a limited effect on decoding the rest of the block.
7.3 The Algorithm
Next we describe a coding scheme based on the observation that the problem of sequential
lossy source coding can be regarded as an online prediction problem where switching between
actions is subject to some positive cost. While both algorithms presented in Chapter 6 can be
directly applied for this learning problem, we propose an algorithm based on the mSD prediction algorithm presented in Section 6.1. The proposed algorithm adaptively creates blocks of
p
variable length such that on the average O( T ) blocks are created, and so the overhead used to
p
transmit code descriptions scales with T instead of T 2/3 in [100]. The coding scheme, given
in Algorithm 7.1, works as follows.
At each time instant t the mSD algorithm selects one code (f(t ) , g(t ) ) from the finite reference class F; the loss in the mSD algorithm associated with ( f , g ) ∈ F is defined by
¡
¡ ¢¢
`( f ,g ),t (z t +δ ) = d z t , g t y 0t
(7.1)
where y 10 , y 20 , . . . , y t0 is the sequence obtained by using the coding scheme ( f , g ) to encode z t ,
that is, y t0 = f t (z t +δ , Ut ) (note that `( f ,g ),t can be computed at the encoder at time t + δ). The
mSD algorithm splits the time into blocks [1, t 1 ], [t 1 + 1, t 2 ], [t 2 + 1, t 3 ], . . . in a natural way such
that the decoder of the reference code chosen by the algorithm is constant over each block, that
is, g(ti +1) = g(ti +2) = · · · = g(ti +1 ) and g(ti ) 6= g(ti +1) for all i (here we used the convention t 0 = 0).
Since the beginning of a new block can only be noticed at the encoder, this event has to be
communicated to the decoder. In order to do so, we select randomly a new-block signal v of
length A (that is, v ∈ {1, . . . , M } A ), and v is transmitted over the channel in the first A time steps
of each block. In the next B time steps of the block the identity of the decoder chosen by the
mSD algorithm is communicated, where
ln |{g : ( f , g ) ∈ F}|
B=
ln M
»
¼
(7.2)
is the number of channel symbols required to describe uniquely all possible decoder functions.
In the remainder of the block the selected encoder (or, possibly, more encoders) is used to
encode the source symbols.
On the other hand, whenever the decoder observes v in the received channel symbol sequence yt , it starts a new block. In this block the decoder first receives the index of the reference
decoder to be used in the block, and the received reference decoder is used in the remainder of
119
the block to generate the reproduction symbols. One slight problem here is that the new-block
signal may be obtained by encoding the input sequence; in this case, to synchronize with the
decoder, a new block is started at the encoder. We can keep the loss introduced by these unnecessary new blocks low by a careful choice of the new-block signal. Clearly, if v is selected uniformly at random from {1, 2, . . . , M } A then for any fixed string u ∈ {1, 2, . . . , M } A , P [v = u] = 1/M A .
Thus, setting A = O(ln T ) makes P [v = u] = O(1/T ), and so the expected number of unnecessary
new blocks is at most a constant in T time steps.
In summary, the algorithm works in blocks of variable length as follows: At the beginning
of the block an algorithm is selected using the mSD prediction algorithm and a new-block
signal and the identity of the chosen decoder is communicated to the receiver. In the next
time steps, as long as the mSD algorithm selects the same decoder, the chosen code is used to
encode the source symbols at the sender and used for decoding at the receiver. When the mSD
method selects a different decoder, or a new block signal is transmitted by chance, a new block
is started.
The next result shows that the normalized distortion redundancy of the proposed scheme
p
is O( ln(T )/T ).
Theorem 7.1. Let F ⊂ Fsδ be a finite reference class of delay-δ memory-s codes, and, for some
T > 0, set A = dln T / ln M e and
s
ηt =
ln |F|
.
T (17/8 + ln (T |F|) / ln M + s)
(7.3)
Then the expected normalized distortion redundancy of Algorithm 7.1 can be bounded as
s
bT ≤2
R
µ
¶
¶
µ
ln |F| 17 ln(T |F|)
ln T
.
+
+s +O
T
8
ln M
T
l
m
ln T
Remark 7.1. In the above result concerning our algorithm, the parameters A = ln
M and η t =
p
η = O(1/ T ln T ) are set as a function of the time horizon T . The proposed algorithm can
be modified to be strongly sequential in the sense that it becomes horizon independent. The
main difference is that the new-block signal will be time-variant: at time instants M k−1 the kth
symbol vk of v is transmitted,
and at each time instant t the so far received new-block signal
l
m
p
ln
t
v A t of length A t = ln M is used. Setting η t = O(1/ t ln t ), it can be shown that the modified
algorithm has only a constant time larger regret than the original, horizon-dependent one.
The above theorem follows from the following result taking into consideration that B ≤
dln |F|/ ln M e and setting A = dln T / ln M e and η t according to (7.3).
Theorem 7.2. The expected normalized distortion redundancy of Algorithm 7.1 for any finite
reference class F ⊂ Fsδ can be bounded as
µ
¶ T
X η t (A + B + s)(ln |F| + 1 + TM−A
ln |F|
1
A )
b
RT ≤
+ + A +B +s
+
T ηT
8
T
t =1 T
120
Algorithm 7.1 An efficient algorithm for adaptive sequential lossy source coding
Encoder:
1. Input: A finite reference class F ⊂ Fsδ , positive integer A and time horizon T .
2. Initialization
(a) Draw a new-block signal v uniformly at random from {1, . . . , M } A , the set of channel
symbol sequences of length A.
(b) Initialize the mSD algorithm for F.
(c) Set B according to (7.2) and set g0 to an invalid value (not contained in F).
3. For each block do
(a) Observe z t +δ .
(b) For all time instants (t + δ) run the mSD algorithm:
i. Feed the mSD algorithm with losses `( f ,g ),t (z t +δ ) for each code ( f , g ) ∈ F.
ii. Let (f(t ) , g(t ) ) denote the choice of the mSD algorithm.
(c) In the first A time steps of the block transmit v.
(d) After the first A time steps set (f, g) = (f(t ) , g(t ) ), the output of the mSD algorithm in
this time step.
(e) In time steps A + 1, . . . , A + B of the block send the index describing g.
(f) If (t + δ) belongs to steps A + B + 1, A + B + 2, . . . of the block then
i. if g(t ) = g(t −1) then transmit yt = ft (z t +δ )
ii. else start a new block with the same time index;
iii. if (yt −A+1 , . . . , yt ) = v then start a new block at the next time instant.
Decoder:
1. Input: A finite reference class F ⊂ Fsδ , positive integers A, B , time horizon T .
2. For t = 1, . . . , A
(a) Observe yt .
(b) At time t = A set v = y A and declare a new block.
3. For each block do
(a) Observe yt .
(b) In the first B time steps of the block receive the index of the decoder to be used. At
time step B of the block set the decoder g according to the symbols received so far.
(c) In time steps B, B + 1, . . . of the block output ẑt = g(yt ) = g(ytt −s+1 ).
(d) In time steps B + A, B + A+1, . . . of the block declare a new block if (yt −A+1 , . . . , yt ) = v.
121
Proof. Let ẑ ( f ,g ),1 , . . . , ẑ ( f ,g ),T denote the reproduction sequence generated by the reference code
( f , g ) ∈ F when applied to the source sequence z T , and let z̃t = ẑ (f(t ) ,g(t ) ),t . That is, z̃t is the reproduction sequence our coding scheme would generate if it did not have to transmit the identity
of the chosen reference decoder, and the correct past s symbols were also available at the decoder (in the current setting when the reference decoder changes we have to wait s channel
symbols to have the decoder operating correctly, as it may require s past symbols due to its
memory).
Decomposing the cumulative distortion we get
T
X
X
d t (z t , ẑt ) =
t =1
t :1≤t ≤T,ẑt =z̃t
≤
T
X
t =1
X
d t (z t , z̃t ) +
d t (z t , ẑt )
t :1≤t ≤T,ẑt 6=z̃t
`(f(t ) ,g(t ) ),t (z
t +δ
(7.4)
) + |{t : ẑt 6= z̃t , 1 ≤ t ≤ T }| .
The expectation of the first term can be bounded using Lemma 6.2 as
"
E
T
X
t =1
#
∗ T +δ
`(f(t ) ,g(t ) ),t (z t +δ ) ≤ D F
(z
)+
T η
X
ln |F|
t
+
.
ηT
t =1 8
(7.5)
It is easy to see that ẑt 6= z̃t may happen only at the first A + B + s steps of each block. New
blocks are started when either the mSD algorithm decides to start one, or when a new-block
signal is transmitted by chance. Letting ST and NT = {t : (yt −A+1 , . . . , yt ) = v, A ≤ t ≤ T } denote
the number of new blocks, up to time T , started “intentionally” by the mSD algorithm and,
respectively, “unintentionally” by chance, we have
¯©
ª¯
¯ t : 1 ≤ t ≤ T, ẑt 6= z0 ¯ ≤ (ST + 1 + NT ) (A + B + s).
t
(7.6)
ST can be bounded by Lemma 6.3. On the other hand, letting nt = I{v=(yt −A+1 ,...,yt )} , we have
P
NT = Tt=A nt . Since v is independent of yt −1 ,
£
¤
P v = (yt −A+1 , . . . , yt ) = 1/M A
for any A ≤ t ≤ T , and so
E [NT ] = (T − A)/M A .
(7.7)
Now taking expectations in (7.6), Lemma 6.3 and (7.7) yield
Ã
E [|{t : 1 ≤ t ≤ T, ẑt 6= z̃t }|] ≤ (A + B + s) η T (T − 1) +
TX
−1 ¡
η t − η T + ln |F| + 1 +
¢
t =1
Ã
= (A + B + s)
T
X
η t + ln |F| + 1 +
t =1
T −A
MA
!
.
Combining the above with (7.4) and (7.5) proves the statement of the theorem.
122
T −A
MA
!
7.4 Sequential zero-delay lossy source coding
An important and widely studied special case of the source coding problem considered is the
case of on-line scalar quantization, that is, the problem of zero-delay lossy source coding with
memoryless encoders and decoders [71, 100, 43, 44]. Here we assume for simplicity Z = [0, 1]
and d (z, ẑ) = (z − ẑ)2 . An M -level scalar quantizer Q (defined on [0, 1]) is a measurable mapping
[0, 1] → C , where the codebook C is a finite subset of [0, 1] with cardinality |C | = M . The elements
of C are called the code points. The instantaneous squared distortion of Q for input z is (z −
Q(z))2 . Without loss of generality we will only consider nearest neighbor quantizers Q satisfying
(z −Q(z))2 = minẑ∈C (z − ẑ)2 .
Let Q denote the collection of all M -level nearest neighbor quantizers. In this section our
goal is to design a sequential coding scheme that asymptotically achieves the performance of
the best scalar quantizer (from Q) for all source sequences z T . Note that the expected normalized distortion redundancy in this special case is defined as
"
#
T
T
1 X
1 X
2
max
(z t −Q(z t ))2.
E
(z t −ẑt ) − min
Q∈
Q
T
z T ∈[0,1]T T
t =1
t =1
To be able to apply the results of the previous section, we approximate the infinite class Q
with QK ⊂ Q, the set of M -level nearest neighbor scalar quantizers whose code points all belong
ª
© 1 3
, 2K , . . . , 2K2K−1 . It is shown in [43] that the distortion redundancy of any sequential
to the set 2K
coding scheme relative to Q is at least on the order of T −1/2 . The next theorem shows that the
slightly larger O(T −1/2 ln T ) normalized distortion redundancy is achievable.
Theorem 7.3. Relative to the reference class Q, the expected normalized distortion redundancy
of Algorithm 7.1 applied to QbpT c satisfies, for any T ≥ 2,
s
bT ≤
R
µ
¶
µ 2 ¶
2M ln T 25 (M +2) ln T
ln T
+
+s +O
T
8
2 ln M
T
and the algorithm can be implemented with O(T 2 ) time and O(T ) space complexity.
Proof. The proof is based on results developed in [43]. It is easy to see that for any quantizer
Q ∈ Q there exists a quantizer Q K ∈ QK such that
max |(z −Q(z))2 − (z −Q K (z))2 | ≤ 1/K .
z∈[0,1]
Thus, in this sense, the class Q is well approximated by QK . Thus, for any sequence z T ∈ [0, 1]T ,
T
T
1 X
1 X
1
(z t −Q(z t ))2 − min
(z t −Q(z t ))2 ≤ .
Q∈QK T t =1
Q∈Q T t =1
K
min
Applying Algorithm 7.1 to the reference class F = QK we obtain by Theorem 7.1 that the nor-
123
malized distortion redundancy relative to the class QK can be bounded as
#
"
T
T
X
X
1
2
2
(z t − ẑ t ) − min
(z t −Q(z t ))
max
E
Q∈QK t =1
z T ∈[0,1]T T
t =1
v ¡ ¢Ã
Ã Ã !
¡K ¢ !
¡K ¢ !
u
u ln K
17 ln T M
K ln T + M
M
t
≤2
+
s + O ln
T
8
ln M
M
T
Note that since in this case the size of the reference class |QK | =
¡K ¢
M
depends on T , such terms
also need to be taken into account in the last O(·) term. Combining the above results and
p
substituting K = b T c gives the performance bound of the theorem, taking into account that
p
¡bpT c¢
M /2
p1 < 2M ln T /T for all T ≥ 2 since M ≥ 2 by assumption.
≤
T
and
M
b Tc
It is shown in [43] that the random choice of a quantizer according to the EWA prediction
algorithm in one time step can be performed with O(K 2 ) time and space complexity. Applying
the same method in our algorithm we obtain the desired complexity results (it is easy to see
that the extra random choice in mSD does not change these complexities).
7.5 Extensions
In the previous sections we assumed that the encoder and the decoder communicate over a
noiseless channel. Following Matloub and Weissman [75], we can extend the results to the
case of stochastic channels with positive error exponents. We assume that the communication
channel has finite memory r for some integer r ≥ 0, and its output also depends on some stationary noise process . . . , X−1 , X0 , X1 , . . . with known distribution such that if the channel input
up to time t is y t for some t ≥ r , then the output of the channel is a function of y tt −r +1 and Xt .
Moreover, it is assumed that for some rate R > 0 there exists a constant σ > 0 such that for any
block length b there exists a channel code Cb that can discriminate 2bR messages with maximum error probability e −σb in b channel uses. These assumptions are not restrictive and hold
for all channels with positive capacity and error exponent.
c , a delayFormally, denoting the channel input and output alphabet by M = {1, . . . , M } and M
δ sequential joint source-channel code is given by a sequence of encoder-decoder functions
t +δ
( f , g ) = { f t , g t }∞
× [0, 1]t → M and g t : Mt → Ẑ. Matloub and Weissman [75]
t =1 with f t : Z
used a channel code Cb (minimizing the maximum error probability) to communicate the de-
coding function at the beginning of each block, as well as replaced the distortion `( f ,g ),t (z t +δ )
£
¤
with its expectation `¯( f ,g ),t (z t +δ ) = EXt `( f ,g ),t (z t +δ ) with respect to the channel noise Xt (note
that `¯( f ,g ),t (z t +δ ) is still a random variable that depends on the internal randomization of the
encoder f ). In our case a further modification is needed, as the new block signal also has to be
communicated using channel coding.
Formally, in step 3(b)i of the encoder in Algorithm 7.1, `( f ,g ),t (x t +δ ) has to be replaced with
`¯( f ,g ),t (z t +δ ). Furthermore, during the whole communication process, the new block signal v
and the indices of the decoder functions g are transmitted using channel coding, with codes
C Ab and CBb , respectively. These codes are used at the decoder to identify the beginning of a
124
new block and determining the decoding function. Note that before each use of these channel
codes, the encoder uses r symbols to reset the memory of the channel. To simplify the encoder, we do not check whether the decoder would receive a new block signal by chance, that
is, we omit step 3(f)iii of the encoding algorithm (this step would become problematic due to
the channel noise). While this decision makes the algorithm simpler, it can also ruin its performance if such an accident occurs (the scheme has no built-in method to recover from such
an error). However, by a careful selection of the new block signal, we can guarantee that this
disaster only happens with a very low probability.
The analysis of the above procedure can be done following the proof of Theorem 7.1. We
can obtain a variant of (7.5) in the same way where the distortion values are replaced by their expectations with respect to the channel noise. The overhead in the communication is analyzed
in a slightly different way. We declare failure and consider the maximum distortion during the
whole communication process if the new block symbol is received incorrectly at the beginning,
or if the new block symbol is transmitted unintentionally. Since the latter can happen at each
time instant, and results in an O(T ) cumulative distortion, we select v from T 2 options, making
A = d2 ln T / ln M e. Furthermore, we want to transmit v at the beginning with probability of error O(1/T 2 ). By our assumptions on C Ab , to achieve an error probability ² in recovering v at the
b
b
decoder, we need ² ≤ 2−σ A and 2R A ≥ M A . Thus, when A = O(ln T ), a choice Ab = O(ln T ) is sufficient to ensure an O(1/T 2 ) error probability. Then the expected cumulative error due to the
incorrect decoding of v is O(1). Furthermore, communicating the index of the decoding function in Bb = O(ln F ln T ) channel uses can be decoded with maximum error probability O(1/T 2 )
(here, as before, F denotes the reference class of codes), which again results in a cumulative
O(1) error. Summarizing, for each block we have an extra O(ln T ) overhead plus a constant
error term (in expectation). Thus, we obtain the following result.
Corollary 7.1. Given any finite reference class F of delay-δ sequential joint source-channel codes
with bounded memory, under our assumptions on the communication channel, there exists a
p
sequential joint source-channel coding scheme with O( ln(T )/T ) normalized distortion redundancy.
In the Wyner-Ziv setting considered by Reani and Merhav [89], there is a noiseless communication channel between the encoder and the decoder, and the decoder also has access to
a side information signal that is a noisy observation of the current source symbol z t through
a memoryless channel. This setup can be treated as a special case of the above joint source
channel coding problem with a restricted set of encoders and a special channel (the channel
is composed of a noiseless part and a noisy side information channel, and each encoder has
to transmit the actual source symbol uncoded over the side information channel). In fact, this
setup is simpler, as there is no need to use error protection for communicating the indices of
the decoders and the new block signals; however, replacing `t by `¯t is still necessary. Thus, the
p
above O( ln(T )/T ) normalized distortion redundancy is also achievable in this case. Moreover, Reani and Merhav also gave an efficient implementation for the zero-delay case based
on an efficient implementation of the EWA algorithm. This efficient algorithm can easily be
125
incorporated in our method in the same way as the efficient algorithms for scalar quantization
(provided in [43, 45]) were used for the zero-delay lossy coding case considered in Section 7.4.
126
Chapter 8
Conclusions
In this work, we addressed the problem of online learning in non-stationary Markov decision
processes, first defined by Even-Dar et al. [33]. We proposed algorithms that are guaranteed to
perform well in a number of different settings. In this chapter, we discuss how our results compare to other known results concerning regret minimization in MDPs. After discussing some
general insights revealed by our work, we comment on the possibility of combining the results of the thesis for solving the most complicated version of the online MDP problem, where
the transition functions are unknown and the learner is only informed about the reward it actually gathers (instead of being informed about the entire reward function). We conclude by
commenting on the complexity of our algorithms and summarizing our results presented in
Chapter 5.
8.1 The gap between the lower and upper bounds
Jaksch et al. [59] considered the case when the transition function P is unknown to the learner
and the rewards are generated in an i.i.d. fashion. The assumptions they make about the transition function are less stringent than our Assumptions M1 and M2: they assume that there exists
a finite constant D > 0 such that for any two states x, x 0 ∈ X, there exists a policy πx,x 0 that takes
the learner from state x to state x 0 in at most D steps (in expectation). Following Puterman [86],
such MDPs are called communicating MDPs with diameter D. Jaksch et al. aim to minimize a
regret criterion identical to the one defined in Equation (1.1), and prove that for any learning
algorithm there exists an MDP with diameter D, state and action cardinality |X| and |A| such
that
b T ≥ c D|X||A|T ,
L
p
where c > 0 is some universal constant. Their algorithm UCRL-2 guarantees a regret of order
p
D|X| |A|T that holds for all possible reward distributions in all MDPs with diameter D. In
what follows, we compare the performance guarantees of our algorithms to these bounds: The
bounds presented in Chapters 3 and 4 have the same dependence on the number of actions
|A| and the number of time steps T . For the O-SSP setting considered in Chapters 3 and 5, the
127
diameter is essentially equivalent to the number of layers L. The regret guarantee proven for
p
our algorithm in the bandit case when P is known is O(L 2 |A|T /α), which does not directly
p
depend on |X|. The factor L 2 is clearly suboptimal. Depending on the structure of P , 1/ α
can be either much smaller or much bigger than |X|, however, it tends to be large in practical problems. Even though our experiments in Section 3.6 suggest that learning does become
more difficult as α approaches zero, we believe that the reasons underlying these gaps are to
be found in our decomposition of the learning problem: this approach does not exploit that
the individual bandit problems defined in each state x ∈ X are not independent of each other.
Essentially, the same comments hold for our algorithm presented in Chapter 4. Treating the
learning problem globally as in Chapter 5 seems to give much better results: the gap between
p
our bound and the bound of Jaksch et al. [59] is only of order |A|. Thus, the price we pay for
p
playing against an arbitrary reward sequence with full feedback is at most an O( |A|) factor in
the upper bound.
This observation immediately leads to the following question: why don’t we use the approach described in Chapter 5 for solving the bandit problems described in Chapters 3 and 4?
The reason is very simple: the main limitation of the FPL algorithm used for solving the global
decision problem is that it cannot work efficiently under bandit feedback. For constructing unbiased reward estimates of the form (3.12) or (4.8), we need to explicitly compute the stochastic
policy πt . Unfortunately, the relationship of the policy πt and the perturbations is very complicated and thus the former can be only approximated by excessive sampling from the latter.
As shown by Poland [84], producing sufficiently precise reward estimates needs drawing up
to O(T 2 ) samples, which is unacceptable. As discussed in Section 1.3, it is nontrivial whether
there exists an efficient implementation of a global decision making algorithm that can be used
to minimize regret in the bandit O-MDP setting.
8.2 Outlook
We can conclude that a good learning algorithm in a continuing MDP task has to either change
its policies either slowly (as quantified in Chapter 4) or rarely (as in the approach taken in Chapter 7). The fundamental difference between the results proved in Chapter 3 and the one proved
in Chapter 4 is that in the latter case, the learner has to guarantee one of the above criteria on
its policy sequence in addition to being able to minimize the main term of the expected regret.
Accordingly, the insights used in Chapter 3 to prove extensions of the basic problem can be
directly used in the O-MDP setting as well without any fundamental modification. Note that
while the “slow-change” requirement seems to conflict the intuition that an algorithm with
good tracking regret should change its policies more often, our proof concerning the change
rate of Exp3 carries through to the case of Exp3.S.
In the process of proving the results of Chapter 5, much structure (of both the standard
MDP learning problem and the online learning problem) was uncovered that was not transparent in previous works. In particular, it has become apparent that the size of confidence
intervals for the unknown model parameters only influences the quality of the estimate of the
128
achieved performance (since this factor only shows up in the proof of Lemma 5.5), while selecting our models optimistically helps only in estimating the best possible performance (since
optimism is only exploited in the proof of Lemma 5.2). These issues are not so clearly separable in a purely stochastic learning problem where the achieved performance has to stay close
to the fixed target behavior.
It remains an interesting and important open problem to solve the online MDP problem
with bandit information in unknown stochastic environments. Besides the problem of not
being able to explicitly computing the occupation measures generated by FPL, the unknown
transition function P also adds to the uncertainty. Constructing reward estimates with appropriately controlled bias seems to a be very challenging problem that needs fundamentally new
techniques or highly nontrivial combination of ideas presented in this thesis.
8.3 The complexity of our algorithms
Let us finally comment on the computational complexity of our algorithms. It is easy to see
that of all our algorithms, the algorithm MDP-E of Chapter 4 has the largest computational and
memory complexity: Due to the delay that the algorithm needs, the algorithm needs to store
N policies. Thus, the memory requirement of MDP-Exp3 scales with N |A||X|. The computational complexity of the algorithm is dominated by the cost of computing r̂t and, in particular,
¡
¢
3
by the cost of computing µN
t . The cost of this is O N |A||X| in the worst case, however, it can
be much smaller for specific practical cases. For instance, if all states have at most J possible
preceding states and K succeeding states, the cost becomes O (N |A||X|J K ). The computation
cost of the other algorithms is smaller by a factor of N . Setting the parameters of the algorithms
presented in Chapters 3 and 4 require known lower bounds on the minimal visitation probabilities α and α0 , and also the knowledge of the mixing time τ. While these quantities can be
determined in principle from the transition probability kernel P , it is not clear how to compute
efficiently the minimum over all policies.
8.4 Online learning with switching costs
In Chapter 6, we proposed a novel prediction method that works by perturbing the cumulative losses with symmetric random walks. Analyzing the performance of this forecaster in the
finite-expert setting required bounding the number of times that the leading random walk is
switched. While the guarantees obtained for the expected regret and the expected number of
switches are essentially equivalent to the best previously known guarantees, we hope that our
technique can be used to provide bounds of the said quantities that hold with high probability. At the moment, the mere existence of a prediction algorithm guaranteeing high-confidence
bounds for this setting is nontrivial.
We were the first to consider the problem of online combinatorial optimization under switching constraints. While it is possible to prove performance bounds for a version of the Shrinking Dartboard algorithm in this setting, the resulting algorithm can only be implemented for
129
a handful of decision sets. On the other hand, our algorithm can be efficiently implemented
whenever there exist an efficient solution to static optimization problems on the decision set.
However, this efficiency comes at the price of slightly worse performance bounds. Finding out
if this trade-off between low computational complexity and optimal regret bounds is inherent
to the problem of online combinatorial optimization is an interesting open question of learning
theory.
8.5 Closing the gap for online lossy source coding
In Chapter 7, we provided a sequential lossy source coding scheme that achieves a normalp
ized distortion redundancy of O( ln(T )/T ) relative to any finite reference class of limited-delay
limited-memory codes, improving the earlier results of O(T −1/3 ). Applied to the case when the
reference class is the (infinite) set of scalar quantizers, we showed that the algorithm achieves
p
O(ln(T )/ T ) normalized distortion redundancy, which is almost optimal in view that the norp
malized distortion redundancy is known to be at least of order 1/ T . The existence of a coding
scheme with optimal high-confidence performance guarantees depends on the existence of an
online prediction algorithm with high-confidence guarantees on its regret and switch-number.
As discussed in the previous section, this remains an open problem.
130
Bibliography
[1] Abbasi-Yadkori, Y. and Szepesvári, Cs. (2011). Regret bounds for the adaptive control of
linear quadratic systems. In Proceedings of the Twenty-Fourth Conference on Computational
Learning Theory.
[2] Allenberg, C., Auer, P., Györfi, L., and Ottucsák, Gy. (2006). Hannan consistency in on-line
learning in case of unbounded losses under partial monitoring. In ALT, pages 229–243.
[3] Arora, R., Dekel, O., and Tewari, A. (2012). Deterministic MDPs with adversarial rewards
and bandit feedback. CoRR, abs/1210.4843.
[4] Audibert, J.-Y. and Bubeck, S. (2010). Regret bounds and minimax policies under partial
monitoring. Journal of Machine Learning Research, 11:2785–2836.
[5] Audibert, J. Y., Bubeck, S., and Lugosi, G. (2011). Minimax policies for combinatorial prediction games. In Conference on Learning Theory.
[6] Audibert, J. Y., Bubeck, S., and Lugosi, G. (2012). Regret in online combinatorial optimization. Manuscript.
[7] Auer, P. (2003). Using confidence bounds for exploitation-exploration trade-offs. Journal of
Machine Learning Research, 3:397–422.
[8] Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002a). Finite-time analysis of the multiarmed
bandit problem. Mach. Learn., 47(2-3):235–256.
[9] Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. (1995). Gambling in a rigged casino:
The adversarial multi-armed bandit problem. In Proceedings of the 36th Annual Symposium
on the Foundations of Computer Science, pages 322–331. IEEE press.
[10] Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002b). The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48–77.
[11] Auer, P. and Ortner, R. (2006). Logarithmic online regret bounds for undiscounted reinforcement learning. In NIPS’06, pages 49–56.
[12] Awerbuch, B. and Kleinberg, R. D. (2004). Adaptive routing with end-to-end feedback:
distributed learning and geometric approaches. In Proceedings of the 36th ACM Symposium
on Theory of Computing, pages 45–53.
131
[13] Balluchi, A., Benvenuti, L., Di Benedetto, M. D., Pinello, C., and Sangiovanni-Vincentelli,
A. L. (2000). Automotive engine control and hybrid systems: challenges and opportunities.
In Proceedings of the IEEE, pages 888–912.
[14] Bartlett, P. L., Dani, V., Hayes, T. P., Kakade, S., Rakhlin, A., and Tewari, A. (2008). Highprobability regret bounds for bandit online linear optimization. In Servedio, R. A. and Zhang,
T., editors, 21st Annual Conference on Learning Theory (COLT 2008), pages 335–342. Omnipress.
[15] Bartlett, P. L. and Tewari, A. (2009). REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. In Proceedings of the 25th Annual Conference on Uncertainty in Artificial Intelligence.
[16] Bartók, G., Pál, D., Szepesvári, Cs., and Szita, I. (2011). Online learning. Lecture notes,
University of Alberta. https://moodle.cs.ualberta.ca/file.php/354/notes.pdf.
[17] Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena Scientific, Belmont, MA.
[18] Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities:A Nonasymptotic Theory of Independence. Oxford University Press.
[19] Brafman, R. I. and Tennenholtz, M. (2002). R-MAX - a general polynomial time algorithm
for near-optimal reinforcement learning. Journal of Machine Learning Research, 3:213–231.
[20] Bubeck, S. and Cesa-Bianchi, N. (2012). Regret Analysis of Stochastic and Nonstochastic
Multi-armed Bandit Problems. Now Publishers Inc.
[21] Cao, X.-R. (2007). Stochastic Learning and Optimization: A Sensitivity-Based Approach.
Springer, New York.
[22] Carpentier, A., Lazaric, A., Ghavamzadeh, M., Munos, R., Auer, P., and Auer, P. (2011).
Upper-confidence-bound algorithms for active learning in multi-armed bandits. In ALT,
pages 189–203.
[23] Cesa-Bianchi, N., Conconi, A., and Gentile, C. (2004). On the generalization ability of online learning algorithms. IEEE Transactions on Information Theory, 50:2050–2057.
[24] Cesa-Bianchi, N., Dekel, O., and Shamir, O. (2013). Online Learning with Switching Costs
and Other Adaptive Adversaries. ArXiv e-prints.
[25] Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, Learning, and Games. Cambridge
University Press, New York, NY, USA.
[26] Cesa-Bianchi, N. and Lugosi, G. (2009). Combinatorial bandits. In Proceedings of the
Twenty-Second Conference on Computational Learning Theory, pages 237–246.
132
[27] Cesa-Bianchi, N. and Lugosi, G. (2012). Combinatorial bandits. Journal of Computer and
System Sciences, 78:1404–1422.
[28] Dani, V., Hayes, T. P., and Kakade, S. (2008a). The price of bandit information for online
optimization. In Platt, J. C., Koller, D., Singer, Y., and Roweis, S. T., editors, Advances in Neural
Information Processing Systems 20, pages 345–352. MIT Press.
[29] Dani, V., Hayes, T. P., and Kakade, S. M. (2008b). Stochastic linear optimization under bandit feedback. In Servedio, R. A. and Zhang, T., editors, 21st Annual Conference on Learning
Theory - COLT 2008, Helsinki, Finland, July 9-12, 2008, pages 355–366. Omnipress.
[30] Devroye, L., Lugosi, G., and Neu, G. (2013a). Prediction by random-walk perturbation.
Submitted to the Twenty-Sixth Conference on Learning Theory.
[31] Devroye, L., Lugosi, G., and Neu, G. (2013b). Prediction by random-walk perturbation.
Submitted to the IEEE Transactions on Information Theory.
[32] Even-Dar, E., Kakade, S. M., and Mansour, Y. (2005). Experts in a Markov decision process.
In Saul, L. K., Weiss, Y., and Bottou, L., editors, Advances in Neural Information Processing
Systems 17, pages 401–408.
[33] Even-Dar, E., Kakade, S. M., and Mansour, Y. (2009). Online Markov decision processes.
Mathematics of Operations Research, 34(3):726–736.
[34] Even-dar, E., Mannor, S., and Mansour, Y. (2002). PAC bounds for multi-armed bandit and
Markov decision processes. In In Fifteenth Annual Conference on Computational Learning
Theory (COLT), pages 255–270.
[35] Feller, W. (1968). An Introduction to Probability Theory and its Applications, Vol. 1. John
Wiley, New York.
[36] Freund, Y. and Schapire, R. (1997). A decision-theoretic generalization of on-line learning
and an application to boosting. Journal of Computer and System Sciences, 55:119–139.
[37] Garivier, A. and Cappé, O. (2011). The KL-UCB algorithm for bounded stochastic bandits
and beyond. CoRR, abs/1102.2490.
[38] Gentile, C. and Warmuth, M. (1998). Linear hinge loss and average margin. In Advances
in Neural Information Processing Systems (NIPS), pages 225–231.
[39] Geulen, S., Voecking, B., and Winkler, M. (2010). Regret minimization for online buffering
problems using the weighted majority algorithm. In Proceedings of the Twenty-Third Conference on Computational Learning Theory.
[40] Grove, A., Littlestone, N., and Schuurmans, D. (2001). General convergence results for
linear discriminant updates. Machine Learning, 43:173–210.
133
[41] Györfi, L. and Ottucsák, Gy. (2007). Sequential prediction of unbounded stationary time
series. IEEE Transactions on Information Theory, 53(5):866–1872.
[42] Györfi, L. and Walk, H. (2012). Empirical portfolio selection strategies with proportional
transaction costs. IEEE Transactions on Information Theory, 58(10):6320–6331.
[43] György, A., Linder, T., and Lugosi, G. (2004a). Efficient algorithms and minimax bounds
for zero-delay lossy source coding. IEEE Transactions on Signal Processing, 52:2337–2347.
[44] György, A., Linder, T., and Lugosi, G. (2004b). A "follow the perturbed leader"-type algorithm for zero-delay quantization of individual sequences. In Proc. Data Compression
Conference, pages 342–351, Snowbird, UT, USA.
[45] György, A., Linder, T., and Lugosi, G. (2008). Tracking the best quantizer. IEEE Transactions
on Information Theory, 54:1604–1625.
[46] György, A., Linder, T., and Lugosi, G. (2012). Efficient tracking of large classes of experts.
IEEE Transactions on Information Theory. (accepted with minor revisions).
[47] György, A., Linder, T., Lugosi, G., and Ottucsák, Gy. (2007). The on-line shortest path problem under partial monitoring. Journal of Machine Learning Research, 8:2369–2403.
[48] György, A. and Neu, G. (2011). Near-optimal rates for limited-delay universal lossy source
coding. In Proceedings of the IEEE International Symposium on Information Theory (ISIT),
pages 2344–2348.
[49] György, A. and Neu, G. (2013). Near-optimal rates for limited-delay universal lossy source
coding. Submitted to the IEEE Transactions on Information Theory.
[50] György, A. and Ottucsák, Gy. (2006). Adaptive routing using expert advice. Computer
Journal, 49(2):180–189.
[51] Hannan, J. (1957). Approximation to Bayes risk in repeated play. Contributions to the
theory of games, 3:97–139.
[52] Hartmann, B. and Dán, A. (2012). Cooperation of a grid-connected wind farm and an
energy storage unit demonstration of a simulation tool. IEEE Transactions on Sustainable
Energy, 3(1):49–56.
[53] Hazan, E., Kale, S., and Warmuth, M. (2010). Learning rotations with little regret. In Proceedings of the 23rd Annual Conference on Learning Theory (COLT), pages 144–154.
[54] Hazan, E. and Seshadhri, C. (2007). Adaptive algorithms for online decision problems.
Electronic Colloquium on Computational Complexity (ECCC).
[55] Hazan, E. and Seshadhri, C. (2009). Efficient learning algorithms for changing environments. In Proceedings of the 26th Annual International Conference on Machine Learning,
pages 393–400. ACM.
134
[56] Helmbold, D. P. and Warmuth, M. (2009).
Learning permutations with exponential
weights. Journal of Machine Learning Research, 10:1705–1736.
[57] Herbster, M. and Warmuth, M. (1998). Tracking the best expert. Machine Learning,
32:151–178.
[58] Ipsen, I. C. F. and Selee, T. M. (2011). Ergodicity coefficients defined by vector norms. SIAM
J. Matrix Analysis Applications, 32(1):153–200.
[59] Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement
learning. J. Mach. Learn. Res., 99:1563–1600.
[60] Kakade, S. (2003). On the sample complexity of reinforcement learning. PhD thesis, Gatsby
Computational Neuroscience Unit, University College London.
[61] Kakade, S., Kearns, M. J., and Langford, J. (2003). Exploration in metric state spaces. In
ICML2003, pages 306–312.
[62] Kakade, S., Sridharan, K., and Tewari, A. (2009). On the complexity of linear prediction:
Risk bounds, margin bounds, and regularization. In NIPS-22, pages 793–800.
[63] Kalai, A. and Vempala, S. (2003). Efficient algorithms for the online decision problem.
In Schölkopf, B. and Warmuth, M., editors, Proceedings of the 16th Annual Conference on
Learning Theory and the 7th Kernel Workshop, COLT-Kernel 2003, pages 26–40, New York,
USA. Springer.
[64] Kalai, A. and Vempala, S. (2005). Efficient algorithms for online decision problems. J.
Comput. Syst. Sci., 71:291–307.
[65] Kaufmann, E., Korda, N., and Munos, R. (2012). Thompson sampling: An asymptotically
optimal finite-time analysis. In ALT’12, pages 199–213.
[66] Kivinen, J. and Warmuth, M. (2001). Relative loss bounds for multidimensional regression
problems. Machine Learning, 45:301–329.
[67] Kocsis, L. and Szepesvári, Cs. (2006). Bandit based Monte-Carlo planning. In Proceedings
of the 17th European conference on Machine Learning, pages 282–293.
[68] Koolen, W., Warmuth, M., and Kivinen, J. (2010). Hedging structured concepts. In Proceedings of the 23rd Annual Conference on Learning Theory (COLT), pages 93–105.
[69] Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22.
[70] Lazaric, A. and Munos, R. (2011). Learning with stochastic inputs and adversarial outputs.
To appear in Journal of Computer and System Sciences.
135
[71] Linder, T. and Lugosi, G. (2001). A zero-delay sequential scheme for lossy coding of individual sequences. IEEE Transactions on Information Theory, 47:2533–2538.
[72] Littlestone, N. and Warmuth, M. (1994). The weighted majority algorithm. Information
and Computation, 108:212–261.
[73] Maillard, O.-A., Munos, R., and Stoltz, G. (2011). A finite-time analysis of multi-armed
bandits problems with kullback-leibler divergences. Journal of Machine Learning Research Proceedings Track, 19:497–514.
[74] Mannor, S. and Tsitsiklis, J. N. (2004). The sample complexity of exploration in the multiarmed bandit problem. J. Mach. Learn. Res., 5:623–648.
[75] Matloub, S. and Weissman, T. (2006). Universal zero delay joint source-channel coding.
IEEE Transactions on Information Theory, 52:5240–5250.
[76] McMahan, H. B. and Blum, A. (2004). Online geometric optimization in the bandit setting
against an adaptive adversary. In Proceedings of the Eighteenth Conference on Computational
Learning Theory, pages 109–123.
[77] Mnih, V., Szepesvári, Cs., and Audibert, J.-Y. (2008). Empirical Bernstein stopping. In
ICML, pages 672–679.
[78] Neu, G., György, A., and Szepesvári, Cs. (2010a). The online loop-free stochastic shortestpath problem. In Proceedings of the Twenty-Third Conference on Computational Learning
Theory, pages 231–243.
[79] Neu, G., György, A., and Szepesvári, Cs. (2012). The adversarial stochastic shortest path
problem with unknown transition probabilities. In Proceedings of the Fifteenth International
Conference on Artificial Intelligence and Statistics, volume 22 of JMLR Workshop and Conference Proceedings, pages 805–813, La Palma, Canary Islands.
[80] Neu, G., György, A., and Szepesvári, Cs. (2013a). The online loop-free stochastic shortestpath problem. In preparation.
[81] Neu, G., György, A., Szepesvári, Cs., and Antos, A. (2010b). Online Markov decision processes under bandit feedback. In Advances in Neural Information Processing Systems 23,
pages 1804–1812.
[82] Neu, G., György, A., Szepesvári, Cs., and Antos, A. (2013b). Online Markov decision processes under bandit feedback. Accepted for publication at the IEEE Transactions on Automatic Control.
[83] Ortner, R. (2008). Online regret bounds for Markov decision processes with deterministic transitions. In Proceedings of the 19th International Conference on Algorithmic Learning
Theory, ALT 2008.
136
[84] Poland, J. (2005). FPL analysis for adaptive bandits. In In 3rd Symposium on Stochastic
Algorithms, Foundations and Applications (SAGA’05), pages 58–69.
[85] Powell, W. B. (2007). Approximate Dynamic Programming: Solving the curses of dimensionality. John Wiley and Sons, New York.
[86] Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-Interscience.
[87] Rakhlin, A., Sridharan, K., and Tewari, A. (2011). Online learning: Stochastic and constrained adversaries. In NIPS-24.
[88] Reani, A. and Merhav, N. (2009). Efficient on-line schemes for encoding individual sequences with side information at the decoder. In Proc. IEEE International Symposium on
Information Theory (ISIT 2009), pages 1025–1029.
[89] Reani, A. and Merhav, N. (2011). Efficient on-line schemes for encoding individual sequences with side information at the decoder. IEEE Transactions on Information Theory,
pages 6860–6876.
[90] Shamir, G. I. and Merhav, N. (1999).
Low-complexity sequential lossless coding for
piecewise-stationary memoryless sources. IEEE Transactions on Information Theory, IT45:1498–1519.
[91] Strehl, A. L. and Littman, M. L. (2008). Online linear regression and its application to
model-based reinforcement learning. In NIPS20, pages 1417–1424.
[92] Sutton, R. and Barto, A. (1998). Reinforcement Learning: An Introduction. MIP Press.
[93] Szepesvári, Cs. (2010). Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers.
[94] Takimoto, E. and Warmuth, M. (2003). Paths kernels and multiplicative updates. Journal
of Machine Learning Research, 4:773–818.
[95] Takimoto, E. and Warmuth, M. K. (2002). Path kernels and multiplicative updates. In Kivinen, J. and Sloan, R. H., editors, Proceedings of the 15th Annual Conference on Computational
Learning Theory, COLT 2002, LNAI 2375, pages 74–89, Berlin–Heidelberg. Springer-Verlag.
[96] Vovk, V. (1990). Aggregating strategies. In Proceedings of the third annual workshop on
Computational learning theory (COLT), pages 371–386.
[97] Vovk, V. (1998). A game of prediction with expert advice. Journal of Computer and System
Sciences, 56(2):153–173.
[98] Vovk, V. (1999).
Derandomizing stochastic prediction strategies.
35(3):247–282.
137
Machine Learning,
[99] Warmuth, M. and Kuzmin, D. (2008). Randomized online pca algorithms with regret
bounds that are logarithmic in the dimension.
Journal of Machine Learning Research,
9:2287–2320.
[100] Weissman, T. and Merhav, N. (2002). On limited-delay lossy coding and filtering of individual sequences. IEEE Transactions on Information Theory, 48:721–733.
[101] Weissman, T., Ordentlich, E., Seroussi, G., Verdu, S., and Weinberger, M. J. (2003). Inequalities for the l1 deviation of the empirical distribution.
[102] Willems, F. M. J. (1996).
Coding for a binary independent piecewise-identically-
distributed source. IEEE Transactions on Information Theory, IT-42:2210–2217.
[103] Yu, J. Y. and Mannor, S. (2009a). Arbitrarily modulated Markov decision processes. In
Joint 48th IEEE Conference on Decision and Control and 28th Chinese Control Conference,
pages 2946–2953. IEEE Press.
[104] Yu, J. Y. and Mannor, S. (2009b). Online learning in Markov decision processes with arbitrarily changing rewards and transitions. In GameNets’09: Proceedings of the First ICST
international conference on Game Theory for Networks, pages 314–322, Piscataway, NJ, USA.
IEEE Press.
[105] Yu, J. Y., Mannor, S., and Shimkin, N. (2009). Markov decision processes with arbitrary
reward processes. Mathematics of Operations Research, 34(3):737–757.
138

Download Report

Online learning in non-stationary Markov decision processes

Paperzz.com

Your Paperzz