Online learning in non-stationary Markov decision processes Gergely Neu [email protected] Thesis submitted to the Budapest University of Technology and Economics in partial fulfilment for the award of the degree of D OCTOR OF P HILOSOPHY in Informatics Supervised by Dr. László Györfi Dr. András György Dr. Csaba Szepesvári Department of Computing Science and Information Technology Magyar tudósok körútja 2. Budapest, HUNGARY March 2013 Abstract This thesis studies the problem of online learning in non-stationary Markov decision processes where the reward function is allowed to change over time. In every time step of this sequential decision problem, a learner has to choose one of its available actions after observing some part of the current state of the environment. The chosen action influences the observable state of the environment in a stochastic fashion and earns some reward to the learner. However, the entire state (be it observed or not) also influences the reward. The goal of the learner is maximizing the total (non-discounted) reward that it receives. In this work, we assume that the unobserved part of the state evolves autonomously from the observed part of the state or the actions chosen by the learner, thus corresponding to a state sequence generated by an oblivious adversary such as nature. Otherwise, absolutely no statistical assumption is made about the mechanism generating the unobserved state variables. This setting fuses two important paradigms of learning theory: online learning and reinforcement learning. In this thesis, we propose and analyze a number of algorithms designed to work under various assumptions about the dynamics of the stochastic process characterizing the evolution of observable states. For all of these algorithms, we provide bounds on the regret defined as the performance gap between the total reward gathered by the learner and the total reward of the best available fixed policy. Acknowledgments First and foremost, I would like to thank my supervisors Csaba Szepesvári and András György. I cannot be thankful enough to Csaba, who introduced me to the exciting field of reinforcement learning, and guided my first steps as a researcher. Later on, Andris initiated me into the subject of online prediction, which became my main interest through the years. Working with the two of them greatly inspired me both professionally and personally, and I’m deeply grateful for all their help. I would also like to thank László Györfi, who efficiently supported my work by encouraging me whenever I was uncertain about the next step. I would also like to thank all my colleagues, ex-colleagues and related folks from the Department of Computer Science and Information Theory at the Budapest University of Technology and Economics (especially András Temesváry, Márk Horváth, Márta Pintér, László Ketskeméty and András Telcs for teaming up with me during the Great Frankfurt Snowstorm of 2010), the Reinforcement Learning and Artificial Intelligence Lab at the University of Alberta (especially Gábor Bartók along with his lovely family, Gábor Balázs, István Szita and Réka Mihalka, Dávid Pál, and Yasin Abbasi-Yadkori) and MTA SZTAKI (especially András Antos, the coauthor of the results presented in Chapter 4). Once again, I am very grateful to Csaba and his amazing family for welcoming me several times in Edmonton. I am also thankful to the unstoppable statistician/cyclists Gábor Lugosi and Luc Devroye, also known as the coauthors of the results presented in Chapter 6. Finally, I would like to express my gratitude to the people who obviously had the greatest impact on my entire life: my parents, brother and grandmother. Most of all, I am eternally grateful to Éva Kóczi for her endless love and support through the ups and downs of the past years. In loving memory of József Lancz Contents 1 Introduction 1 1.1 The learning model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2 Background 15 2.1 Online prediction of arbitrary sequences . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Stochastic multi-armed bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Markov decision processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.1 Loop-free stochastic shortest path problems . . . . . . . . . . . . . . . . . . 20 2.3.2 Unichain MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3 Online learning in known stochastic shortest path problems 26 3.1 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 A decomposition of the regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Full Information O-SSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 Bandit O-SSP using black-box algorithms . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4.1 Expected regret against stationary policies . . . . . . . . . . . . . . . . . . . 33 3.4.2 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.5 Bandit O-SSP using Exp3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5.1 Expected regret against stationary policies . . . . . . . . . . . . . . . . . . . 39 3.5.2 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.5.3 The case of α = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.5.4 A bound that holds with high probability . . . . . . . . . . . . . . . . . . . . 47 3.6 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.7 The proof of Lemma 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.8 A technical lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4 Online learning in known unichain Markov decision processes 60 4.1 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 A decomposition of the regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.3 Full information O-MDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4 Bandit O-MDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.5 The proofs of Propositions 4.1 and 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.5.1 The change rate of the learner’s policies . . . . . . . . . . . . . . . . . . . . . 73 4.5.2 Proof of Proposition 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.5.3 Proof of Proposition 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5 Online learning in unknown stochastic shortest path problems 84 5.1 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.2 Rewriting the regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.3 Follow the perturbed optimistic policy . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.3.1 The confidence set for the transition function . . . . . . . . . . . . . . . . . 89 5.3.2 Extended dynamic programming . . . . . . . . . . . . . . . . . . . . . . . . 90 5.4 Regret bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.5 Extended dynamic programming: technical details . . . . . . . . . . . . . . . . . . 97 6 Online learning with switching costs 99 6.1 The Shrinking Dartboard algorithm revisited . . . . . . . . . . . . . . . . . . . . . . 101 6.2 Prediction by random-walk perturbation . . . . . . . . . . . . . . . . . . . . . . . . 105 6.2.1 Regret and number of switches . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.2.2 Bounding the number of switches . . . . . . . . . . . . . . . . . . . . . . . . 107 6.2.3 Online combinatorial optimization . . . . . . . . . . . . . . . . . . . . . . . 109 7 Online lossy source coding 115 7.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.2 Limited-delay limited-memory sequential source codes . . . . . . . . . . . . . . . 117 7.3 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.4 Sequential zero-delay lossy source coding . . . . . . . . . . . . . . . . . . . . . . . 123 7.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 8 Conclusions 127 8.1 The gap between the lower and upper bounds . . . . . . . . . . . . . . . . . . . . . 127 8.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 8.3 The complexity of our algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 8.4 Online learning with switching costs . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 8.5 Closing the gap for online lossy source coding . . . . . . . . . . . . . . . . . . . . . 130 Chapter 1 Introduction The work in this thesis generalizes two major fields of sequential machine learning theory: online learning [25, 20] and reinforcement learning [86, 17, 92, 85]. The characteristics of these learning models can be summarized as follows: Online learning: In each time step t = 1, 2, . . . , T of the standard online learning (or online prediction) problem, the learner selects an action at from a finite action space A, and consequently earns some reward r t (at ). The goal of the learner is to maximize its total expected reward. This problem can be easily treated by standard statistical machinery if the sequence of reward functions are generated in an i.i.d. fashion (that is, the rewards (r t (a))Tt=1 are independent and identically distributed). However, this assumption does not account for dynamic data, let alone acting in a reactive environment. The power of the online learning framework lies in the fact that it does not require any statistical assumptions to be made about the data generation process: It is assumed that the sequence of reward functions (r t )Tt=1 , r t : A → [0, 1] is an arbitrary fixed sequence chosen by an external mechanism referred to as the environment or the adversary. Of course, by dropping the strong statistical assumptions on the reward sequence, we can no longer P hope to explicitly maximize the total cumulative rewards Tt=1 r t (at ) and thus have to settle for a less ambitious goal. This goal is to minimize the performance gap between our algorithm and the strategy that selects the best action fixed in hindsight. This performance gap is called the regret and is defined formally as LT = max T X a∈A t =1 r t (a) − T X r t (at ). t =1 It is important to note that the best fixed action in the above expression can only be computed in full knowledge of the sequence of reward functions. While intuitively minimizing the regret seems to be very difficult, it is now a very well understood problem even in the significantly more challenging bandit setting where the learner only observes r t (at ) after making its decision. In recent years, numerous algorithms have been proposed for different versions of the online learning problem that consider different assumptions on the action space A and the amount of information revealed to the learner. The main 1 shortcoming of this problem formulation is that it does not adequately account for the influence of previous actions a1 , . . . , at −1 on the reward function r t , that is, it assumes that the decisions of the learner do not influence the mechanism generating the rewards. The formalism presented in this thesis provides a way of modeling and coping with such effects. Reinforcement learning: In every time step t = 1, 2, . . . , T of the standard reinforcement learning (RL) problem, the learner (or agent) observes the state xt of the environment, selects an action at from a finite action space A, and consequently earns some reward r (xt , at ). Finally, the next state xt +1 of the environment is drawn from the distribution P (·|xt , at ). It is assumed that the state space of the environment X is finite, the reward function r : X × A → [0, 1] and the transition function P : X × X × A → [0, 1] are fixed but unknown functions. The goal of the learner is to maximize its total reward in the Markov decision process (MDP) described above by the tuple (X, A, P, r ). It is commonplace to consider learning algorithms that map the states of the environment to actions with a stationary state-feedback policy (or in short, a policy) π : X → A. A policy can be evaluated in terms of its total expected reward after T time steps in the MDP: R Tπ = E " T X # r (x0t , π(x0t )) , t =1 where x0t +1 ∼ P (·|x0t , π(x0t )). Writing the total expected reward of the learner as " T X RbT = E # r (xt , at ) , t =1 we can define a good performance measure of a reinforcement learning algorithm then using the following notion of regret: b T = max R π − RbT . L T π The regret measures the amount of “lost reward” during the learning process, that is, the price of not knowing r and P before the learning process starts. The main limitation of the MDP formalism is that it does not account for non-random state dynamics that cannot be captured by the Markovian model P . In this thesis, we present a formalism that goes beyond the standard stochastic assumptions made in the reinforcement learning literature and provide algorithms with good theoretical performance guarantees. In this thesis we study complex reinforcement learning problems where the performance of learning algorithms is measured by the total reward they collect during learning, and the assumption that the state is completely Markovian is relaxed. Our formalism is based on principles of reinforcement learning and online learning: we regard these complex problems as Markov decision problems where the reward functions are allowed to change over time, and 2 propose algorithms with theoretically guaranteed bounds on their performance. The performance criterion that we address is the worst-case regret, which is typically considered in the online learning literature. Learning in this model is called the online MDP or O-MDP problem. The main idea of our approach relies on the observation that in a number of practical problems, the hard-to-model, complex part of the environment influences only the rewards that the learner receives. In the rest of this chapter, we describe our general model and show how it precisely relates to the two learning paradigms described above. In Section 1.2, we summarize the contributions of the present thesis. The most important other results from the literature are discussed in Section 1.3. We conclude this chapter with Section 1.4, where we briefly describe some practical problems where our formalism can be applied. The rest of the thesis is organized as follows: In Chapter 2, we review some relevant concepts from online learning and reinforcement learning. The purpose of the chapter is to set up the formal definitions later to be used in the thesis. In Chapters 3–5, we discuss different versions of the online MDP problem and propose algorithms for the specified learning problems. We provide rigorous theoretical analysis to each of the proposed algorithms. In Chapter 6, we describe a special case of our setting called online learning with switching costs. For this problem, we provide two algorithms with optimal performance guarantees. In Chapter 7, we apply the methods described in Chapter 6 to construct adaptive coding schemes for the problem of online lossy source coding. We show that the proposed algorithm enjoys near-optimal performance guarantees. Since it is difficult to evaluate the contributions of Chapters 3–5 separately, we present conclusions for all chapters in Chapter 8. 1.1 The learning model The interaction between the learner and the environment is shown in Figure 1.1. The environment is split into two parts: One part that has Markovian dynamics and another one with an unrestricted, autonomous dynamics. In each discrete time step t , the agent receives the state xt of the Markovian environment and possibly the previous state y t −1 of the autonomous dynamics. The learner then makes a decision about the next action at , which is sent to the environment. The environment then makes a transition: the next state of the Markovian environment depends stochastically on the current state and the chosen action as xt +1 ∼ P (·|xt , at ), and y t +1 is generated by an autonomous dynamic which is not influenced by the learner’s actions or the state of the Markovian environment. After this transition, the agent receives a reward depending on the complete state of the environment and the chosen action, and then the process continues. The goal of the learner is to collect as much reward as possible. The modeling philosophy is that whatever information about the environment’s dynamics can be modeled should be modeled in the Markovian part and the remaining “unmodeled dynamics” is what constitutes the autonomous part of the environment. A large number of practical operations research and control problems have the above outlined structure. These problems include production and resource allocation problems, where the major source of difficulty is to model prices, various problems in computer science, such as 3 at /01*2' ;<=' ;<=' ;<=' !"#$%&'()*"+,-.' xt rt ;<=' 314"#5'67*-8%*' yt 9*-%*2#%::15' ()*"+,-.' ;<=' Figure 1.1: The interaction between the learner and the environment. At time t , the agent’s action is a t , the state of the Markovian dynamics is x t , the state of the uncontrolled dynamics is y t ; q −1 is a one-step delay operator. the k-server problem, paging problems, or web-optimization problems, such as ad-allocation problems with delayed information [see, e.g., 33, 105]. In the rest of the thesis, for simplicity, by slightly abusing terminology, we call the state xt of the Markovian part “the state” and regard dependency on y t as dependency on t by letting r t (·, ·) = r (·, ·, y t ). The goal of the learner is to maximize its total expected reward ¯ # ¯ ¯ RbT = E r t (xt , at )¯ P , ¯ t =1 " T X where the notation E [ ·| P ] is used to emphasize that the state sequence (xt )Tt=1 is generated by the transition function P . Controllers of the form π : X → A are called stationary deterministic policies, where X is the state-space of the Markovian part of the environment and A is the set of actions. The performance of policy π is measured by the total expected reward R Tπ " =E T X t =1 ¯ ¯ ¯ r t (x0t , π(x0t ))¯ P, π ¯ # , where (x0t )Tt=1 is the state sequence obtained by following policy π in the MDP described by P . The learner’s goal is to perform nearly as well as the best fixed stationary policy in hindsight in terms of the total reward collected, that is, to minimize the following quantity: b T = max R π − RbT . L T π (1.1) In other words, we are interested in constructing algorithms that minimize the total expected 4 regret defined as the gap between the total accumulated reward of the learner and the best fixed controller. Naturally, no assumptions can be made about the autonomous part of the environment as it is assumed that modeling this part of the environment lies outside of the capabilities of the learner. Guaranteeing a low regret is equivalent to a robust control guarantee: The guarantee on the performance must hold no matter how the autonomous state sequence (y t ), or equivalently, the reward sequence (r t ) is chosen. The potential benefit is that the results will be more generally applicable and the algorithms will enjoy added robustness, while, generalizing from results available for supervised learning [23, 62, 87], the algorithms can also avoid being too pessimistic despite the strong worst-case guarantees.1 1.2 Contributions of the thesis We have studied the above problem under various assumptions on the structure of the underlying MDP and the feedback provided to the learner. To be able to present our contributions, we present our assumptions informally in the following list. The precise assumptions are presented in Chapter 2. 1. Loop-free episodic environments are episodic MDPs where transitions are only possible in a “forward” manner. Episodic MDPs capture learning problems where the learner has to repeatedly perform similar tasks consisting of multiple state transitions. At the beginning of each episode, the learner starts from a fixed state x 0 and the episode ends when the goal state x L is reached. We assume that all other states in the state space X can only be visited once, thus, the transition structure does not allow loops. The reward function r t : X × A → [0, 1] remains fixed during each episode t = 1, 2, . . . , T , but can change arbitrarily between consecutive episodes. In each time step l = 0, 1, . . . , L − 1 of episode t , the ) learner observes its state x(t and has to decide about its action al(t ) . The total expected l reward of the learner is defined as " T L−1 X X RbT = E t =1 l =0 ¯ # ¯ ¯ r t (xl(t ) , al(t ) )¯ P ¯ , and the total expected reward of policy π is defined as ¯ # ¯ ¯ R Tπ = E r t (x0l , a0l )¯ P, π , ¯ t =1 l =0 " T L−1 X X where we used the notation E [ ·| P, π] to emphasize that the trajectory (x0l , a0l )L−1 is generl =0 ated by transition model P and policy π. The minimal visitation probability α is defined 1 Sometimes, robustness is associated to conservative choices and thus poor “average” performance. Although we do not study this question here, we note in passing that the algorithms we build upon have “adaptive variants” that are known to adapt to the environment in the sense that their performance improves when the environment is “less adversarial”. 5 as ¯ £ ¤ α = min min P ∃l : x0l = x ¯ P, π . x∈X π∈AX This problem will be often referred to as the online stochastic shortest path (O-SSP) problem. (a) Full feedback with known transitions: We assume that the transition function P is fully known before the first episode and the reward function r t is entirely revealed after episode t . (b) Bandit feedback with known transitions: We assume that the transition function P is fully known before the first episode, but the reward function r t is only revealed along the trajectory traversed by the learner in episode t . In other words, the feedback provided to the learner after episode t is ³ ) xl(t ) , al(t ) , r t (xl(t ) , a(t ) l ´L−1 l =0 . (c) Full feedback with unknown transitions: We assume that P is unknown to the learner, but the reward function r t is entirely revealed after episode t . The layer structure of the state space and the action space is assumed to be known and the traversed trajectory is also revealed to the learner. In other words, the feedback provided to the learner after episode t is µ³ ´L−1 xl(t ) , al(t ) , rt l =0 ¶ . 2. Unichain environments are continuing MDPs where no episodes are specified. In each time step t = 1, 2, . . . , T , the learner observes the state xt and has to decide about its action at , while the reward function r t : X × A → [0, 1] is also allowed to change after each time step. For any stationary policy π : X → A, we define the elements of the transition kernel P π as P π (x|y) = P (x|y, π(y)) for all x, y ∈ X. We assume that for each policy π, there exists a unique probability distribution µπ over the state space that satisfies µπ (x) = X y∈X µπ (y)P π (x|y) for all x ∈ X. The distribution µπ is called the stationary distribution corresponding to policy π. The minimal stationary visitation probability α0 is defined as α0 = min min µπ (x). x∈X π∈AX We assume that every policy π has a finite mixing time τπ > 0 that specifies the speed of convergence to the stationary distribution µπ . The total expected reward of the learner is 6 defined as ¯ # ¯ ¯ RbT = E r t (xt , at )¯ P , ¯ t =1 " T X and the total expected reward of any policy is defined as " RbT = E T X t =1 ¯ ¯ # ¯ r t (x0t , a0t )¯ P, π ¯ , where the trajectory (x0t , a0t ) is generated by following policy π in the MDP specified by P . (a) Full feedback with known transitions: We assume that the transition function P is fully known before the first time step and the reward function r t is entirely revealed after time step t . (b) Bandit feedback with known transitions: We assume that the transition function P is fully known before the first time step, but the reward function r t is only revealed in the state-action pair visited by the learner in time step t . In other words, the feedback provided to the learner after time step t is (r t (xt , at ), xt +1 ) . 3. Online learning with switching costs is a special version of the online prediction problem where switching between actions is subject to some cost K > 0. Alternatively, this problem can be regarded as a special case of our general setting, as we can construct a simple online MDP that can be used to model all online learning problems where switching between experts is expensive. The online MDP (X, A, P, (r t )Tt=1 ) in question is specified as follows: The state xt +1 of the environment is identical to the previously selected action at . In other words, X = A and the transition function P is such that for all x, y, z ∈ A, P (y|x, y) = 1, otherwise P (z|x, y) = 0 if z 6= y. The reward function in this online MDP is defined using the original reward function g t : A → [0, 1] of the prediction problem and the switching cost K as r t (x, a) = g t (a) − K I{a6=x} for all (x, a) ∈ A2 . Note that K is allowed to be much larger than the maximal reward of 1. We consider two subclasses of online learning problems with switching costs. (a) Online prediction with expert advice: We assume that the action set A is relatively small and the environment is free to choose the rewards for each different action. Actions in this setting are often referred to as “experts”. (b) Online combinatorial optimization: We assume that each action can be represented by a d -dimensional binary vector and the environment can only choose the rewards given for selecting each of the d components. Formally, the learner has access to the action space A ⊆ {0, 1}d and in each round t , the environment specifies 7 a vector of rewards g t ∈ Rd . The reward given for selecting action a ∈ A is the inner product g t (a) = g t> a. 4. The online lossy source coding problem is a special case of Setting 3 where a learner has to encode a sequence of source symbols z 1 , z 2 , . . . , z T on a noiseless channel and produce a sequence of reproduction symbols ẑ1 , ẑ2 , . . . , ẑT . A coding scheme consists of an encoder f mapping source symbols (z)Tt=1 to channel symbols (y)Tt=1 and a decoder g mapping channel symbols (y)Tt=1 to reconstruction symbols (ẑ)Tt=1 . We assume that the learner has access to a fixed pool of coding schemes F. The goal of the learner is to select coding schemes (ft , gt ) ∈ F that minimize the cumulative distortion between the source sequence and the reproduction sequence, defined as bT = D T X d t (z t , ẑt ), t =1 where the sequence (ẑ)Tt=1 is produced by the sequence of applied coding schemes and d is a given distortion measure. We make no statistical assumptions about the sequence of source symbols. Denoting the cumulative distortion of a fixed coding scheme ( f , g ) ∈ F as D T ( f , g ), the goal of the learner can be formulated as minimizing the expected normalized distortion redundancy bT = R ¶ µ 1 £ ¤ b − min D T ( f , g ) , E D T ( f ,g )∈F T which is in turn equivalent to regret minimization in the online learning problem where rewards correspond to negative distortions. Additionally, in each time step t , the learning algorithm has to ensure that the receiving entity is informed of the identity of the decoder gt to be used for decoding the t -th channel symbol. We assume that transmitting the decoder gt is only possible on the same channel that is used for transmitting the source sequence. This gives rise to a cost for switching between coding schemes, making this problem an instance of problems described in Setting 3a. The contributions of this thesis for each setting are listed below on Figure 1.2. All performance guarantees proved in Settings 1a through 2b concern learning schemes that have not been covered by previous works, with the exception of 2a, where our contribution is improving the regret bounds given by Even-Dar et al. [33]. Our main results concerning online MDPs are also presented in Table 1.1, along with the most important other results in the literature. For Setting 3a, we propose a new prediction algorithm with optimal performance guarantees. The same approach can be used for learning in Setting 3b, a problem that has not yet been addressed in the literature. The results for Setting 4 significantly improve a number of previous performance guarantees known for the problem: our performance guarantees on the expected regret match the best known lower bound for the problem up to a logarithmic factor. 8 Setting 1a • Guarantee on the expected regret against the pool of all stationary policies (Proposition 3.1). Setting 1b • Guarantees on the expected regret against the pool of all stationary policies assuming α > 0 (Theorems 3.1, 3.2, 3.4). • Guarantees on the expected regret against the pool of all non-stationary policies assuming α > 0 (Theorems 3.3, 3.5). • Guarantee on the expected regret against the pool of all stationary policies allowing α = 0 (Theorem 3.6). • High-confidence guarantee on the regret against the pool of all stationary policies assuming α > 0 (Theorem 3.7). Setting 1c • Guarantee on the expected regret against the pool of all stationary policies (Theorem 5.1). Setting 2a • Improved guarantee on the expected regret against the pool of all stationary policies (Theorem 4.1). Setting 2b • Guarantee on the expected regret against the pool of all stationary policies assuming α0 > 0 (Theorem 4.2). Setting 3a • Optimal guarantees on both the expected regret against the pool of the fixed actions in A and the number of action switches (Theorem 6.1). Setting 3b • Near-optimal guarantees on both the expected regret against the pool of the fixed actions in A ⊆ {0, 1}d and the number of action switches (Theorem 6.2). Setting 4 • Optimal guarantee on the expected regret against any finite pool of reference classes F (Theorem 7.1). • Optimal guarantee on the expected regret against the pool of all quantizers with quadratic distortion measure (Theorem 7.3). Figure 1.2: The contributions of the thesis. 9 r t observed r t (xt , at ) observed P known • Neu et al. [78] – SSP environment p b T = O(L 2 T |A|/α) – L • Even-Dar et al. [33] – unichain environment p b T = O(τ2 T log |A|) – L • Neu et al. [81] – unichain environment p b T = O(τ3/2 T |A|/α0 ) – L • Jaksch et al. [59] P unknown – stochastic rewards – connected environment p b T = Õ(D|X| T |A|) – L Future work • Neu et al. [79] – SSP environment p b T = O(L||X| | A| T ) – L Table 1.1: Upper bounds on the regret for different feedback assumptions. Our results are typeset in boldface. Rewards are assumed to be adversarial unless otherwise stated explicitly. 1.3 Related work As noted earlier, our work is closely related to the field of reinforcement learning. Most works in the field consider the case when the learner controls a finite Markovian Decision Process (MDP, see [19, 60, 15, 59], or Section 4.2.4 of [93] for a summary of available results and the references therein). While there exists a few works that extend the theoretical analysis beyond finite MDPs, these come at strong assumptions on the MDP (e.g., [61, 91, 1]). The first work to address the theoretical aspects of online learning in non-stationary MDPs is due to Even-Dar et al. [32, 33], who consider the case when the reward function is fully observable. They propose an algorithm, MDP-E, which uses some (optimized) experts algorithm in every state fed with the action-values of the policy used in the last round. Assuming that the MDP is unichain and that the worst mixing time τ over all policies is unip formly small, the regret of their algorithm is shown to be Õ(τ2 T ). Part of our work recycles the core idea underlying MDP-E: it uses black-box bandit algorithms at every state. An alternative Follow-the-Perturbed-Leader-type algorithm was introduced by Yu et al. [105] for the same full-information unichain MDP problem. The algorithm comes with improved computational complexity but an increased Õ(T 3/4+² ) regret bound. Concerning bandit information and unichain MDPs, Yu et al. [105] introduced an algorithm with vanishing regret (i.e., the algorithm is Hannan consistent). Yu and Mannor [103, 104] considered the problem of on-line learning in MDPs where the transition probabilities may also change arbitrarily after each transition. This problem is significantly more difficult than the case where only the reward function is changed arbitrarily. Accordingly, the algorithms proposed in these papers fail to achieve 10 sublinear regret. Yu and Mannor [104] also considered the case when rewards are only observed along the trajectory traversed by the agent. However, this paper seems to have gaps: If the state space consists of a single state, the problem becomes identical to the non-stochastic multi-armed bandit problem. Yet, from Theorem IV.1 of Yu and Mannor [104] it follows that p p the expected regret of their algorithm is O( log |A|T ), which contradicts the known Ω( |A|T ) lower bound on the regret (see Auer et al. [10]).2 Some parts of our work can be viewed as a stochastic extensions of works that considered online shortest path problems in deterministic settings. Here, the closest to our ideas and algorithm is the paper by György et al. [47]. They implement a modified version of the Exp3 algorithm of Auer et al. [9] over all paths using dynamic programming and estimating the reward-to-go via estimates for the immediate rewards. The resulting algorithm is shown p to achieve O( T ) regret that scales polynomially in the size of the problem, and can be implemented with linear complexity. A conceptually harder version of the shortest path problem is when only the rewards of the whole paths are received, and the rewards corresponding to the individual edges are not revealed. Dani et al. [28] showed that this problem is actually not harder, by proposing a generalization of Exp3 to linear bandit problems, which can be applied p to this setting and which gives an expected regret of O( T ) (again, scaling polynomially in the size of the problem), improving earlier results of Awerbuch and Kleinberg [12], McMahan and Blum [76], György et al. [47]. Bartlett et al. [14] showed that the algorithm can be extended so that the bound holds with high probability, while Cesa-Bianchi and Lugosi [26] improved the method of [28] for bandit problems with some special “combinatorial” structure. While the above very similar approaches (an efficient implementation of a centralized Exp3 variant) are also appealing for our problem, they cannot be applied directly: The random transitions in the MDP structure disables the dynamic programming-based approach of György et al. [47]. Furthermore, although Dani et al. [28] suggest that their algorithm can be implemented efficiently for the MDP setting, this does not seem to be straightforward at all. This is because the application of their approach requires representing policies via the distributions (or occupancy measures) they induce over the state space, but the non-linearity of the latter dependence makes dynamic programming highly-nontrivial and causes difficulties in describing linear combination of policies. The contextual bandit setting considered by Lazaric and Munos [70] can also be regarded as a simplified version of our model, with the restriction that the states are generated in an i.i.d. fashion. More recently, Arora et al. [3] gave an algorithm for MDPs with deterministic transitions, arbitrary reward sequences and bandit information. Following the work of Ortner [83], who studied the same problem with i.i.d. rewards, they note that following any policy in a deterministic MDP leads to periodic behavior, and thus finding an optimal policy is equivalent to finding an “optimal cycle” in the transition graph. While this optimal cycle is well defined for stationary rewards, Arora et al. observe that it can be ill-defined for non-stationary reward 2 To show this contradiction note that the condition T > N in the bound of Theorem IV.1 of Yu and Mannor [104] can be traded for an extra O(1/T ) term in the regret bound. Then the said contradiction can be arrived at by letting ², δ converge to zero such that ²/δ3 → 0. 11 sequences: in particular, it is easy to construct an example where the same policy can incur an average reward of either 0 or 1, depending on the state where we start to run the policy. A meaningful goal in this setting is to compete with the best meta-policy that can run any stationary policy starting from any state as its initial state. Arora et al. give an algorithm that enjoys a regret bound of O(T 3/4 ) against the pool of such meta-policies. 1.4 Applications In this section, we outline how some real-world problems fit into our framework. The common feature of the examples to be presented is that the state space of the environment is a product of two parts: a controlled part X and an uncontrolled part Y. In all of these problems, the evolution of the controlled part of the state can be modeled as a Markov decision process described by © ª X, A, P , while no statistical assumptions can be made about the sequence of uncontrolled state variables (y t )Tt=1 . We assume that interaction between the controlled and uncontrolled parts of the environment is impossible, that is, the stochastic transitions of (xt )Tt=1 cannot be influenced by the irregular transitions of (y t )Tt=1 , and vice versa. Inventory management Consider the problem of controlling an inventory so as to maximize the revenue. This is an optimal control problem, where the state of the controlled system is the stock xt , the action at is the amount of stock ordered. The evolution of the stock is also influenced by the demand, which is assumed to be stochastic. Further, the revenue depends on the prices at which products are bought and sold. By assumption, the prices are not available at the time when the decisions are made. Since the prices can depend on many external, often unobserved factors y t , their evolution is often hard to model. We assume that the influence of our purchases on the prices is negligible. Since y t is unobserved, this problem is covered by our Settings 1b and 2b. Controlling the engine of a hybrid car Consider the problem of switching between the electric motor and the internal combustion engine of a hybrid car so as to optimize fuel consumption (see, e.g., [13]). In this setting the favorable engine depends on the road conditions and the intentions of the driver: for example, driving downhill can be used to recharge the batteries of the electric motor, while picking up higher speeds under normal road conditions can be more efficient with the internal combustion engine. We assume that we can observe the partial state xt of the engines and can decide to initiate a switching procedure from one engine to the other. The execution of the switching procedure depends on the state xt of the engines in a well-understood stochastic fashion. Other parts of the state of engine y t may or may not be observed. External conditions only influence this state variable and do not interfere with xt . That is, the state y t can be seen as the uncontrolled state variable influencing the rewards given for high fuel efficiency. Since y t is not entirely observed, this problem is covered by our Settings 1b and 2b. 12 Storage control of wind plants Hungarian regulations require wind plants to produce schedules on their actual production on a 15-minute basis, one month prior. If production exceeds a certain range of the schedule, the producer should pay a penalty-tariff depending on the deviation. As discussed by Hartmann and Dán [52], a possible way of meeting the schedule under adverse wind conditions is using energy storage units: the excess energy accumulated when production would exceed the schedule can be fed into the power system when wind energy stays below the desired level. In this setting, we can assume that the state xt of the energy storage unit can be captured by a possibly unknown Markovian model, while the evolution of wind speed over time (y t )Tt=1 is clearly uncontrolled. Since the schedules are fixed well in advance, its influence can be incorporated into the Markovian model as well. This way, the uncontrolled state variable only influences the rewards (or negative penalties), and thus this problem also fits into our framework. Since y t is observed, this problem is covered by our Settings 1a and 2a. The case when the exact dynamics of the energy storage unit is unknown is covered by Setting 1c. Adaptive routing in computer networks Consider the problem of routing in a computer network where the goal is transferring packets from a source node x 0 to a designated drain node x L with minimal delay (see, e.g., [50]). The delays can be influenced by external events such as malfunction of some of the internal nodes. These external events are captured by the uncontrolled state y t . Assume that in each node xl(t ) , we can choose the next node using some interface al(t ) ∈ A of the network layer. Assuming that ) this interface decides about the actual next state xl(t+1 using a simple randomized algorithm, we can cast our problem as an online learning problem in SSPs, covered by our Setting 1. If the delays are only observed on the actual path traversed by the packet and the algorithm implemented by the interfaces a ∈ A is known, the problem is covered by Setting 1b. Assuming that y t is revealed by some oracle after sending each packet, our algorithms for Setting 1c can be used even if the randomized algorithms used in the network layer are unknown. Growth optimal portfolio selection with transaction cost Consider the problem of constructing sequential investment strategies for financial markets where at each time step the investor distributes its capital among d assets (see, e.g., Györfi and Walk [42]). Formally, the investor’s decision at time t is to select a portfolio vector at ∈ P [0, 1]d such that di=1 (at )i = 1. The i -th component of at gives the proportion of the investor’s capital Nt invested in asset i at time t . The evolution of the market in time is represented by the sequence of market vectors s 1 , s 2 , . . . , s T ∈ [0, ∞)d , where the i -th component of s t gives the price of the i -th asset at time t . It is practical to define the return vector y t ∈ [0, ∞)d at time t with components (y t )i = (s t )i (s t −1 )i . Furthermore, we assume that switching between portfolios is subject to some additional cost proportional to the price of the assets being bought or sold. The goal of the investor is to maximize its capital NT , or, equivalently, to maximize its average growth rate 1 T log NT . The problem of maximizing the growth rate under transition costs can be 13 formalized as an online MDP where the state at time t is given by the previous portfolio vector at −1 and the reward given for choosing action at at state xt = at −1 is µ ¶ ¡ ¢ Nt − ct r (xt , y t , at ) = log + log a> t yt , Nt where ct is the transaction cost arising at time t . Since the relation between ct and the state xt , the action at and the capital Nt is well-defined and the rewards are influenced by the uncontrolled sequence (y t )Tt=1 in a transparent way, we can assume full feedback. After discretizing the space of portfolios and prices, we can directly apply learning algorithms devised for Setting 2a to construct sequential investment strategies. The problem can also be approximately modeled by Setting 3a when assuming that c t ≤ αNt holds for some constant α ∈ (0, 1): Up¡ ¢ per bounding the regret on the online learning problem with rewards g t (a) = log a > y t and switching cost K = − log(1 − α), we obtain a crude upper bound on the regret of the resulting sequential investment strategy. 14 Chapter 2 Background In this chapter, we precisely describe the learning models outlined in Chapter 1. In the first half of the chapter, we review some concepts of online learning that are relevant for our work. The rest of the chapter discusses important tools for Markov decision processes that will be useful in the later chapters. Throughout the thesis, will use boldface letters (such as x, a, u, . . .) to denote random variables. We use kvkp to denote the L p -norm of a function or a vector. In particular, for p = ∞ the maximum norm of a function v : S → R is defined as kvk∞ = sups∈S |v(s)|, and for 0 ≤ p < ∞ and ¡P ¢1/p for any vector v = (v 1 , . . . , v d ) ∈ Rd , kvkp = di=1 |v i |p . We will use ln to denote the natural logarithm function. For a logical expression A, the indicator function I{A} is defined as 0, if A is false I{A} = 1, if A is true. 2.1 Online prediction of arbitrary sequences The general protocol of online prediction (also called online learning) is shown on Figure 2.1. Learning algorithms for the online prediction problems are often referred to as experts algorithms. Formally, an experts algorithm E with action set AE and an adversary interact in the following way: At each round t , algorithm E chooses a distribution pt over the actions AE Parameters: finite set of actions A, upper bound H on rewards, feedback alphabet Σ, feedback function f t : [0, H ]A → Σ. For all t = 1, 2, . . . , T , repeat 1. The environment chooses r t : A → [0, H ] for all a ∈ A. 2. The learner chooses action at . 3. The environment gives feedback f t (r t , at ) ∈ Σ to the learner. 4. The learner earns reward r t (at ). Figure 2.1: The online prediction protocol. 15 and picks an action at according to this distribution. The adversary selects a reward function r t : AE → [0, H ], where H > 0 is known to the learner. Then, the adversary receives the action at and gives a reward of r t (at ) to E . In the most general model, E only gets to observe some feedback f t (r t , at ), where f t is a fixed and known mapping from reward functions and actions to some alphabet Σ. Under full-information feedback, the algorithm E receives f t (r t , at ) = r t , that is, it gets to observe the entire reward function. When only f t (r t , at ) = r t (at ) is sent to the algorithm, we say that the algorithm works under bandit feedback. The goal of the learner in both settings is to minimize the total expected regret defined as " b T = max E L a∈AE T X # r t (a) − E t =1 " T X # r t (at ) , t =1 where the expectation is taken over the internal randomization of the learner. It will be customary to consider two types of adversaries: An oblivious adversary chooses a fixed sequence of reward functions r 1 , . . . , r T , while a non-oblivious, or adaptive adversary is defined with T , possibly random functions such that the t -th function acts on past actions a1 , . . . , at −1 and the history of the adversary (including the previous reward functions selected, as well as any variable describing the earlier evolution of the inner state of the adversary) and returns the random reward function rt . So far we have considered experts algorithms for a given horizon and action set. However, experts algorithms in fact come as meta-algorithms that can be specialized to a given horizon and action set (i.e., the meta-algorithms take as input T and the action set AE ). In the future, at the price of abusing terminology, we will call such meta-experts algorithms experts algorithms, too. Definition 2.1. Let C be a class of adversaries and consider an experts algorithm E . The function BE : {1, 2, . . .} × {1, 2, . . .} → [0, ∞) is said to be a regret bound for E against C if the following hold: (i) BE is non-decreasing in its second argument; and (ii) for any adversary from C, time horizon T and action set AE , the inequality " b T = max E L a∈AE T X t =1 r t (a) − T X # r t (at ) ≤ BE (T, |AE |) (2.1) t =1 holds true, where (a1 , r 1 ), . . . , (aT , r T ) is the sequence of actions and reward functions that arise from the interaction of E and the adversary when E is initialized with T and AE . Note that both terms of the regret involve the same sequence of rewards. Although this may look unnatural when the opponent is not oblivious, the definition is standard and will be useful in this form for our purposes.1 In what follows, to simplify the notation, we will also use BE (T, AE ) to mean BE (T, |AE |). As usual, for the regret bound to make sense it has to be a sublinear function of T . Some algorithms do not need T as their input (all algorithms need to 1 In Chapters 3 and 4, we decompose the online MDP problem into smaller online prediction problems where the reward sequences are generated by a non-oblivious adversary. The global decision problem then reduces to the problem of controlling the above notion of regret against this non-oblivious adversary. 16 know the action set, of course). Such algorithms are called universal, whereas in the opposite case the algorithm is called non-universal. Oftentimes, a non-universal method can be turned into a universal one, at the price of deteriorating the bound BE by at most a constant factor, by either adaptively changing the algorithm’s internal parameters or by resorting to the so-called “doubling trick” (cf. Exercises 2.8 and 2.9 in [25]). Typically, algorithms are developed for the case when the adversaries generate rewards belonging to the [0, 1] interval. Given such an algorithm E , if the adversaries generate rewards in [0, H ] instead of [0, 1] for some H > 0, one can still use E such that in each round, the reward that is fed to E is first divided by H . We shall refer to such a “composite” algorithm as E tuned to the maximum reward H . Clearly, if E enjoyed a regret bound BE against the class of all oblivious (respectively, adaptive) adversaries with rewards in [0, 1], E tuned to the maximum reward H will enjoy the regret bound H BE against the class of all oblivious (respectively, adaptive) adversaries that generate rewards in [0, H ]. Obviously, this tuning requires prior knowledge of H. Note that typical full-information experts algorithms enjoy the same regret bounds for oblivious and non-oblivious adversaries. The reason for this is that if the distribution pt selected by the experts algorithm E is fully determined by the previous reward functions r 1 , . . . , r t −1 , then any regret bound that holds against an oblivious adversary also holds against a non-oblivious one by Lemma 4.1 of Cesa-Bianchi and Lugosi [25]. To our knowledge, all best experts algorithms that enjoy a regret bound of optimal order against an adaptive adversary satisfy this condition on pt . An optimized best experts algorithm in the full-information case is an algorithm whose exp pected regret can be bounded by BOE (T, A) = O( T ln |A|), and similarly, an optimized |A|p armed bandit algorithm is one with a bound BOB (T, A) = O( T |A|) on its expected regret (in the case of bandit algorithms, actions are also called arms). Optimized best experts algorithms include the exponentially weighted average forecaster (EWA) (a variant of weighted majority algorithm of Littlestone and Warmuth [72], and aggregating strategies of Vovk [96], also known as Hedge by Freund and Schapire [36]) and the Follow-the-Perturbed-Leader (FPL) algorithm by Kalai and Vempala [63] (see also Hannan [51]). There exist a number of algorithms for the p bandit case that attain regrets of O( T |A| ln |A|), such as Exp3 by Auer et al. [10] and Green by Allenberg et al. [2], while the algorithm presented by Audibert and Bubeck [4] achieves the p optimal rate O( T |A|). Although these papers prove the regret bound (2.1) only in the case of oblivious adversaries, that is, when the actions taken by the algorithm have no effect on the next rewards chosen by the adversary, it is not hard to see that the bounds proved in these papers continue to hold in the non-oblivious case, too. This is because the only non-algebraic step in the regret bound proofs uses that the reward estimates constructed by the respective algorithms are unbiased estimates of the actual rewards and this property continues to hold true even in the non-oblivious setting. In addition to the above algorithms, Poland [84] describes Follow-the-Perturbed Leader-type algorithms that also satisfy these requirements. All the algorithms discussed above developed to deal with rewards in the [0, 1] interval. For rewards in [0, H ], using the tuning method described above would yield a regret bound that 17 scales linearly with H . While in general this dependence is inevitable, there are some specific cases when the problem structures can be exploited to obtain much tighter bounds. For a more careful treatment of this scaling issue, see [68]. 2.2 Stochastic multi-armed bandits A closely related line of work concerns multi-armed bandit problems where the environment generates rewards in an i.i.d. fashion, that is, it is assumed that for a fixed action a, the random rewards rt (a) are drawn from a fixed distribution d (a) with mean m(a). The goal of the learner in this setting is to select its actions a1 , a2 , . . . , aT so as to minimize the pseudo-regret L̃T = T max m(a) − a T X m(at ) = T ³ X ´ max m(a) − m(at ) . t =1 t =1 a Obviously, this learning problem is significantly easier than the case when the i.i.d. assumption is dropped: the lower bound on the regret for the bandit case2 has been known since the seminal work of Lai and Robbins [69]. The distribution d ∗ with the largest mean m ∗ and the distribution d ∗ with the second largest mean m ∗ play an important role in characterizing the complexity of the problem: for any action selection strategy, the regret satisfies ¶ ln T LT ≥ c , KL (d ∗ kd ∗ ) µ where KL(d ∗ kd ∗ ) denotes the Kulback-Leibler divergence between d ∗ and d ∗ , and c > 0 is some universal constant.3 Since then, a number of efficient algorithms have been proposed that attain this optimal regret, see, e.g., [37, 73, 65]. A large family of methods implement the principle of “optimism in the face of uncertainty” (OFU) by computing upper confidence bounds for each action a ∈ A and select the one with maximal upper bound. Using the notaP −1 tion K t (a) = ts=1 I{as =a} , the upper confidence bound for action a is defined as Pt −1 r̂t (a) = s=1 rs (a)I{as =a} K t (a) + B t (K t (a)) , where the functions B t : N → R+ is chosen carefully such that the probability P [m(a) ≥ r̂t (a)] vanishes quickly as K t (a) grows. For example, the seminal algorithm of Auer et al. [8], UCB1, uses B t (K t ) = p 2 ln t K t . In other words the upper confidence bound r̂t (a) is chosen to strategically overestimate the mean m(a). Selecting at = arg maxa r̂t (a) is then equivalent to setting up confidence intervals on the means and acting according to the most optimistic model in this confidence set. 2 Clearly, regret minimization under full feedback is a trivial problem under these assumptions. 3 Note that this lower bound depends on the specific problem instance that the learner has to face. The best p problem-independent lower bound on the regret is of order upper bound on the regret in the non-i.i.d. case. 18 T |A|, which is exactly the order of the best known This idea of jointly selecting a model and an action in an optimistic fashion can be successfully applied for more complex action sets A. Some examples from the huge body of literature on optimistic algorithms, are applications for stopping problems [77], bandit problems [7, 8, 29], variants of the pick-the-winner problem [34, 74, 77], or active learning [34, 22]. The principle has also been used to provide regret bounds for reinforcement learning problems by Auer and Ortner [11], Jaksch et al. [59] and Bartlett and Tewari [15]. The immense popularity of the optimistic approach is due to two factors: (i ) It allows relatively simple, clean and easyto-interpret theoretical analysis of the resulting algorithms, yet (ii ) optimistic methods have excellent empirical performance in a number of notoriously difficult-to-solve practical problems. For example, one of the greatest success stories of the online learning community is that of the optimistic tree-search algorithm underlying all the current top 10 computer Go players, UCT, first proposed by Kocsis and Szepesvári [67]. Despite their success in a number of practical problems, optimistic algorithms tend to be overly aggressive when the process generating the rewards does not conform any stochastic assumptions, that is, in the case discussed in the previous section. In such cases, optimistic algorithms discard actions with initial low rewards too quickly to be able to discover possible higher rewards associated with the same actions in later stages of the learning process. On the other hand, universal bandit algorithms such as Exp3 by Auer et al. [10] and Green by Allenberg et al. [2] can be overly conservative when the reward sequences have nice statistical properties. 2.3 Markov decision processes In this section, we formalize the two classes of finite Markov decision processes that we described in Chapter 1: loop-free stochastic shortest path problems (Settings 1a–1c) and unichain MDPs (Setting 2b). A finite MDP is formally defined by the tuple M = (X, A, P, r ), where X, the state space, is a finite set of states, A, the action space, is a finite set of actions, P : X × X × A → [0, 1] is the transition function describing the dynamics of the MDP, r : X × A → [0, 1], the reward function, allocates rewards in [0, 1] to states and actions. An MDP with no reward specified will be referred to as an MDP\r , which is defined by the tuple (X, A, P ). The literature distinguishes between two main types of finite MDPs: episodic and continuing environments. While the first notion can be used to formulate control tasks that can be naturally broken down into a sequence of individual but similar tasks, the second notion stands closer to the traditional formulation of control problems where a controller has to govern the behavior of a plant over a possibly infinite time horizon. The two specific classes we consider are chosen to represent one of these types each. The assumptions on the transition function P and the precise interaction between the learner and the environment in M are described below for each model class. 19 In order to define our performance criteria, we need the definition of stochastic stationary policies. A stochastic stationary policy is a mapping π : X × A → [0, 1] from the set of admissible P state-action pairs to the [0, 1] interval such that a∈A π(x, a) = 1. Following a policy π means that upon reaching state x ∈ X, the next action a is chosen according to the distribution π(x, ·), that is, a ∼ π(x, ·). To emphasize that π(x, ·) specifies a probability distribution over the actions for each state x ∈ X, in what follows we will use π(a|x) to denote π(x, a). We use e 1 , . . . , e d to denote the row vectors in the canonical basis of the Euclidean space Rd . Since we will identify the state space X with the integers {1, . . . , |X|}, we will also use the notation e x for x ∈ X. A convention that we will use is that the symbols x, x 0 , . . . will be reserved to denote a state in X, while a, a 0 , b will be reserved to denote an action in A. In expressions involving sums over X, the domain of x, x 0 , . . . will thus be suppressed to avoid clutter. The same holds for sums involving actions. 2.3.1 Loop-free stochastic shortest path problems Loop-free stochastic shortest path environments (or in short, SSPs) are episodic MDPs where transitions are only possible in a “forward” manner. An episodic MDP is an MDP with a few special states, the starting state and some terminal states: In an episodic MDP, the process is segmented into episodes. Each episode starts from the designated starting state and ends when it reaches a terminal state. In an episodic MDP the goal of the agent is to maximize the total expected reward collected in an episode. By switching from rewards to costs, the goal is turned into minimizing the total expected cost. In this case the problem of finding a minimizing policy is also called the stochastic shortest path (SSP) problem. By slightly abusing terminology, we will also call the problem of maximizing the total reward an SSP problem. (Note that the same terminology is used in Bertsekas and Tsitsiklis [17].) When, in addition, the state space has a layered structure with respect to the transitions, we get the so-called loop-free variant of the SSP problem. That the state space has a layered structure means that X = ∪Ll=0 Xl , where Xl is called the l th layer of the state space, Xl ∩ Xk = ; for all l 6= k, and the agent can only move between consecutive layers. In particular, each episode starts at layer 0, from state x 0 , and ends when it reaches any state x L belonging to the last layer XL (see Figure 2.2). Further, for any x ∈ Xl and a ∈ A, P (y|x, a) = 0 if y 6∈ Xl +1 , l = 0, . . . , L − 1. This assumption is equivalent to assuming that each path in the graph is of equal length.4 For any state x ∈ X we will use l x to denote the index of the layer x belongs to, that is, l x = l if x ∈ Xl . The protocol of traditional fixed-reward SSPs is shown on Figure 2.3, Figure 2.2 shows a representation of an example of an SSP. The feedback function is f t (r, x, a) = r in the fullinformation case and f t (r, x, a) = r (x, a) in the bandit case. The transition function P can be either known or unknown to the learner, depending on the nature of the problem. The assumption that we make on the transition function of SSPs are summarized for later reference below. 4 Note that all loop-free state spaces can be transformed to one that satisfies our assumptions with no significant increase in the size of the problem. A simple transformation algorithm is given in Appendix A of György et al. [47]. 20 X0 X1 X2 X3 X4 X0 X1 X2 X3 X4 l=0 l=1 l=2 l=3 l=4 l=0 l=1 l=2 l=3 l=4 a1 a2 Figure 2.2: An example of a stochastic shortest path problem when two actions a 1 (“up”) and a 2 (“down”) are available in all states. Nonzero transition probabilities under each action are indicated with arrows between circles representing states. In the case of two successor states, the successor states with the intended direction with larger probabilities are connected with solid arrows, dashed arrows indicate less probable transitions. Parameters: state space X, action space A, feedback alphabet Σ, feedback function f t : [0, H ]A → Σ. For all episodes t = 1, 2, . . . , T , repeat ) 1. Set x(t 0 = x 0 . For all layer indices l = 0, 1, . . . , L − 1, repeat (a) The learner observes xl(t ) and chooses al(t ) . (b) The environment gives feedback f t (r, xl(t ) , al(t ) ) ∈ Σ to the learner. (c) The learner earns reward r (xl(t ) , al(t ) ). ) (d) The environment draws the next state as xl(t+1 = P (·|xl(t ) , al(t ) ). Figure 2.3: The protocol of stochastic shortest path problems. Assumption S1. The state space be decomposed into L layers: X = X0 ∪ X1 ∪ · · · ∪ XL where Xi ∩ X j = ; if i 6= j . The transition function is such that transitions are only possible between consecutive layers, that is, there only exists a ∈ A such that P (x|y, a) > 0 if x ∈ Xl and y ∈ Xl +1 , l = 0, 1, . . . , L − 1. The first and last layers are singleton layers, that is, X0 = {x 0 } and XL = {x L }. Value functions are useful tools for addressing learning problems in an SSP. Fix a policy π and let (x0 , a0 ), (x1 , a1 ), . . . , (xL−1 , aL−1 ) be a random trajectory generated by following π. The value function, v π : X → R, and the action-value function, q tπ : X × A → R, of policy π are defined 21 for an SSP with transition function P and reward function r , respectively, by " π v (x) = E L−1 X k=l x π " q (x, a) = E L−1 X k=l x # ¯ ¯ r (xk , ak )¯xl = x, π, P , x ∈ X, # ¯ ¯ r (xk , ak )¯xl = x, al = a, π, P , (x, a) ∈ X × A, where we used the notation E [ ·| π, P ] to emphasize that the random trajectories are generated by following policy π and the transition model P . The function-pair (v π , q π ) is known to be the unique solution to the Bellman equations (see, e.g., Puterman [86]): q π (x, a) = r (x, a) + X P (x 0 |x, a)v π (x 0 ), (x, a) ∈ X × A; x0 v π (x) = X π(a|x)q π (x, a), a x ∈ X \ {x L }; (2.2) v π (x L ) = 0. A policy π is deterministic when for each state x ∈ X, π(·|x) is concentrated on a single state. For a deterministic policy π, we will use π(x) to denote the action for which π(a|x) = 1. Theorem 4.4.2 of [86] shows that there exists a stationary and deterministic policy π∗ satisfying v π = max v π (x 0 ). ∗ π Such a policy is called an optimal policy in the SSP M . Further, define the occupation measure, µπ : X → R, by ¯ # ¯ ¯ £ ¤ ¯ µπ (x) = E I{xl =x} ¯ P, π = P xl x = x ¯ P, π , ¯ l =0 " L X x ∈ X. Note that the restriction of µπ to any layer Xl defines a probability mass function. It is easy to see that one can compute the occupation measure µπ in a recursive fashion for any x ∈ X \ {x 0 } as µπ (x) = µπ (x 0 )π(a 0 |x 0 )P (x|x 0 , a 0 ), X (2.3) x 0 ∈Xl −1 x (x 0 ,a 0 )∈X×A where µπ (x 0 ) = 1. For some of the results presented in the thesis, we will need the following additional assumption on the SSPs underlying the considered decision problems: Assumption S2. For any policy π, the occupation measure at state x ∈ X is bounded from below as µπ (x) ≥ α(x). The minimal occupation measure is defined as def α = min min µπ (x). π x∈X 22 Parameters: state space X, action space A, initial state distribution µ1 , feedback alphabet Σ, feedback function f t : [0, H ]A → Σ. Initialization: The environment draws the initial state as x1 ∼ µ1 . For all time steps t = 1, 2, . . . , T , repeat 1. The learner observes xt and chooses at . 2. The environment gives feedback f t (r, xt , at ) ∈ Σ to the learner. 3. The learner earns reward r (xt , at ). 4. The environment draws the next state as xt +1 = P (·|xt , at ). Figure 2.4: The protocol of continuing Markov decision processes. 2.3.2 Unichain MDPs Unichain MDPs form a subclass of continuing Markov decision processes that is widely used in the reinforcement learning literature. Roughly speaking, the unichain assumption ensures that for any pair of states x, y ∈ X, it is possible to find a policy that takes the learner from state x to state y in finite time, thus making it possible to fix any mistakes made by the learning algorithm in early stages of the learning procedure. The interaction between the learner and the environment is shown on Figure 2.4. Once again, the feedback function is f t (r, x, a) = r in the full-information case and f t (r, x, a) = r (x, a) in the bandit case, the transition function P can be either known or unknown to the learner, depending on the nature of the learning problem. Before describing our assumptions, a few more definitions are needed: Without loss of generality, we shall identify the states with the first |X| integers and assume that X = {1, 2, . . . , |X|}. P Now, take a policy π and define the Markov kernel P π (x 0 |x) = a π(a|x)P (x 0 |x, a). The identification of X with the first |X| integers makes it possible to view P π as a matrix: (P π )x,x 0 = P π (x 0 |x). In what follows, we will also take this view when convenient. In general, distributions will also be treated as row vectors. Hence, for a distribution µ, µP π is the distribution over X that results from using policy π for one step after a state is sampled from µ (i.e., the “next-state distribution” under π). Finally, remember that the stationary distribution of a policy π is a distribution µst that satisfies µst P π = µst . We assume that every (stochastic stationary) policy π has a well-defined unique stationary distribution µπst . This assumption, usually referred to as the unichain assumption, ensures that the average reward underlying any stationary policy is a well-defined single real number. It is well-known that in this case the convergence to the stationary distribution is exponentially fast. Following Even-Dar et al. [33], we consider the following stronger, “uniform mixing condition” (which implies the existence of the stationary distributions): Assumption M1. There exists a number τ > 0 such that for any two arbitrary distributions µ and µ0 over X, sup k(µ − µ0 )P π k1 ≤ e −1/τ kµ − µ0 k1 . π 23 (2.4) As Even-Dar et al. [33], we call the smallest τ such that (2.4) holds the mixing time of the transition probability kernel P . The next assumption ensures that every state is visited eventually no matter what policy is chosen: Assumption M2. The stationary distributions are uniformly bounded away from zero: inf µπst (x) ≥ α0 > 0 π,x for some α0 ∈ R. Note that e −1/τ is the supremum over all policy π of the Markov-Dobrushin coefficient of [58]. It is also known that m P π k(µ−µ0 )P π k1 kµ−µ0 k1 for the transition probability kernel P π , see, e.g., P = 1 − minx,x 0 ∈X y∈X min{P π (y|x), P π (y|x 0 )} [58], which implies ergodicity, defined as m P π = supµ6=µ0 that Assumption M1 is satisfied, that is, m P π < 1 for every π, if and only if P π is a scrambling matrix for every π (P π is a scrambling matrix if any two rows of P π share some column in which they both have a positive element). Since m P π is a continuos function of π and the set of policies is compact, there is a policy π0 with m P π0 = supπ m P π . Furthermore, if P π is a scrambling matrix for any deterministic policy π then it is also a scrambling matrix for any stochastic policy. Thus, to guarantee Assumption M1 it is enough to verify mixing for deterministic policies only. The concept of value functions will be once again useful when addressing learning problems in unichain MDPs. Fix an arbitrary policy π and t ≥ 1. Let {(x0s , a0s )} be a random trajectory generated by π and the transition probability kernel P . We will use q π to denote the action-value function underlying π and the reward function r , while we will use v π to denote the corresponding value function.5 That is, for (x, a) ∈ X × A, ¯ ¸ ∞ © X ª¯ 0 0 π ¯ 0 0 q (x, a) = E r (xs , as ) − ρ ¯ x1 = x, a1 = a, π, P , · π v π (x) = E · s=1 ∞ © X s=1 ¯ ¸ ª¯ r (x0s , a0s ) − ρ π ¯¯ x01 = x, π, P , where ρ π is the long-term average reward corresponding to π: S 1X E[r (x0s , a0s )] . S→∞ S s=1 ρ π = lim The long-term average reward can be expressed as ρπ = X x µπst (x) X π(a|x)r (x, a) , (2.5) a where µπst is the stationary distribution underlying policy π. The value functions q π , v π are 5 Most sources would call these functions differential action- and state-value functions. We omit this adjective for brevity. 24 equivalently given by the Bellman equations q π (x, a) = r (x, a) − ρ π + X P (x 0 |x, a)v π (x 0 ) x0 π v (x) = X π π(a|x)q (x, a) a which hold simultaneously for all (x, a) ∈ X × A. 25 (2.6) Chapter 3 Online learning in known stochastic shortest path problems This chapter discusses the problem of learning in loop-free episodic environments with non© ª stationary rewards. Throughout this chapter, we suppose that the MDP\r S = X, A, P satisfies Assumption S1 and the transition function P is known to the learner. After defining the learning problem in Sections 3.1 and 3.2, we present and analyze two families of algorithms: the first set of methods, presented in Sections 3.3 and 3.4, use black-box bandit algorithms in each state x ∈ X, while in Section 3.5 we open up these black boxes and directly analyze the special case when Exp3 is used to select actions in each state. We provide upper bounds on various notions of regret for both types of algorithms and notice that directly analyzing the methods based on Exp3 leads to a significant improvement in the bounds. Furthermore, we see that the latter approach helps in providing high probability bounds on the regret and relaxing some assumptions when only expected regret is concerned. Parts of this work were published in [78] and [80]. The results of this chapter are summarized by the following thesis: Thesis 1. Proposed a family of efficient algorithms for online learning in known stochastic shortest path problems. Proved performance guarantees for Settings 1a and 1b. The proved bounds are optimal in terms of the dependence on the number of episodes. [78, 80] 3.1 Problem setup In this section, we define the online version of the stochastic shortest path problem, based on the definition of the original SSP problem previously presented in Section 2.3.1. In the online version, referred to as the online SSP or O-SSP problem, the reward function is allowed to change between episodes, that is, instead of a single reward function r , at every episode a new reward function, r t : X × A → [0, 1], is used. Accordingly, the difference from the protocol described on Figure 2.3 is that r t takes the place of r in t -th episode. Note that the constraint 26 that r t depends only on the current state and action is assumed only for simplicity: the results of the chapter can easily be extended to the situation where r t is allowed to depend on the next state as well (i.e., when the reward function is of the form r t (x l , a l , x l +1 )). The goal is changed to maximizing the expected sum of the rewards received over the episodes (instead of considering the reward per episode, which is ill-defined when the reward function changes from episode to episode). We will assume that the agent knows the transition probabilities, but it does not know in advance the sequence (r t )Tt=1 . The agent can, however, learn about r t through interaction. In particular, the agent learns about reward r t (xl(t ) , al(t ) ) in time step l of episode t (after action al(t ) has actually been performed), where xl(t ) is the state ) visited and a(t is the action chosen at that time (in the easier, “full-information” version of the l problem, the agent learns the whole reward function r t ). As in the online prediction problem presented in Section 2.1, the sequence, (r t )t =1,2,... , does not have to satisfy any stochastic assumptions; we assume only that it is chosen in advance (possibly, in an adversarial manner). The performance of the agent at the end of episode T will be measured by how much reward it lost on expectation as compared to using the best fixed stationary policy in hindsight. In order to formally define this quantity, let ¯ # ¯ ¯ RbT = E r t (xl(t ) , al(t ) )¯ P , ¯ t =1 l =0 " T L−1 X X denote the total expected reward of the learner and ¯ # ¯ π 0 0 ¯ RT = E r t (xl , al )¯ P, π ¯ t =1 l =0 " T L−1 X X stand for the total expected reward of the policy π. Then, the performance difference described above, also known as the total expected regret, can be written using b T = sup R π − RbT . L T π 3.2 A decomposition of the regret In this section, we introduce some more notation to be able to treat the O-SSP problem more effectively. In particular, we slightly change our point of view on how the learner acts in the environment by assuming that the learner picks its actions from a stochastic policy πt fixed at the beginning of the episode. Using this formalism enables us to prove the key result of this chapter that allows for a simple decomposition of the learning problem into smaller individual learning problems. Let ³ ´ ) (t ) (t ) (t ) (t ) (t ) (t ) (t ) (t ) (t ) (t ) (t ) (t ) ut = x(t 0 , a0 , r t (x0 , a0 ), x1 , a1 , r t (x1 , a1 ), . . . , xL−1 , aL−1 , r t (xL−1 , aL−1 ), xL denote the information that becomes available to the learning algorithm by the end of episode 27 t . The information available to the algorithm at time l of episode t is the information available from previous episodes, def Ut −1 = (u1 , u2 , . . . , ut −1 ) , (with the understanding that U0 is the empty sequence) and also the information from the current episode: ´ ³ ) (t ) (t ) (t ) (t ) (t ) (t ) (t ) (t ) (t ) (t ) (t ) ut (l − 1) = x(t , a , r (x , a ), x , a , r (x , a ), . . . , x , a , r (x , a ) . t 0 t 1 0 0 0 1 1 1 l −1 l −1 t l −1 l −1 As mentioned above, we shall only deal with algorithms that for each episode follow a policy they come up with just before the episode starts. This means that before episode t begins all our algorithms will compute the policy πt to be followed in that episode based on Ut −1 . Thus, for any t ≥ 1 and 0 ≤ l ≤ L − 1, (x, a) ∈ X × A, ¯ ¯ h i h i ¯ (t ) ¯ (t ) ) (t ) P a(t = a x = x, U , u (l − 1) = P a = a x = x, U , = πt (a|x) . ¯ ¯ t −1 t t −1 l l l l This is not restrictive since for large t the information “thrown away” is negligible. On the other hand, if r t is chosen such that its components are correlated (e.g., r t (x 1 , a 1 ) = r t (x 2 , a 2 ) for all t for some (x 1 , a 1 ), (x 2 , a 2 ) ∈ X × A, (x 1 , a 1 ) 6= (x 2 , a 2 )) then we certainly may lose a lot with this assumption, as the “effective” size of the state space becomes smaller. However, such a situation should rather be modelled in the known dynamics part by reducing the state space accordingly, or model state similarity. In what follows, we will make use of the value functions corresponding to the reward functions r t . In particular, the value function v tπ and the action-value function q tπ are defined as the solution of the Bellman equations (2.2) with the role of r taken by r t . We also define the cumulative action-value, Q tπ : X × A → R, and the cumulative value function, Vtπ : X → R by the respective equations, Q tπ = t X q sπ and Vtπ = s=1 t X v sπ . s=1 With the help of value functions, the total expected reward, a.k.a. the total return of policy π for a period of T > 0 episodes can be written as RbTπ = T X t =1 v tπ (x 0 ) = VTπ (x 0 ). Thus, the total expected reward of the best policy in hindsight is given by R T∗ = sup T X π t =1 v tπ (x 0 ) = sup VTπ (x 0 ). π Note that by the linearity of the Bellman equations, VTπ corresponds to the value function comP puted with reward function Tt=1 r t . Thus, the finding the above optimal value is equivalent to finding the optimal policy in a traditional SSP. As noted in Section 2.3.1, there always exists a deterministic policy attaining this supremum. In what follows, we will simply refer to 28 π∗ def any deterministic policy satisfying VT T (x 0 ) = VT∗ (x 0 ) = maxπ VTπ (x 0 ) as an optimal policy.1 The def occupation measure generated by π∗T will be denoted by µ∗T = µπ∗T . To further simplify presentation, we introduce the following abbreviations: π vt = v t t , π qt = q t t , Vt = t X s=1 vt , Qt = t X qt , µt = µπt . s=1 With this notation, the total expected reward in the first T episodes becomes PT t =1 E [vt (x 0 )] = E [VT (x 0 )], and the regret can be written as b T = R ∗ − Rbπ = V ∗ (x 0 ) − E [VT (x 0 )] . L T T T The following performance difference lemma is the key to our main results. Note that a similar argument is used by Even-Dar et al. [33] to prove their main result about online learning in unichain MDPs in the full-information case (cf. Lemma 4.1). The benefit of this lemma is that the problem of bounding the regret is essentially reduced to the problem of bounding the difference between action-values of the policy followed by the agent. Similar statements also appeared much earlier in the control literature: The book by Cao [21], which puts performance difference statements in the center of the theory of MDPs, reviews many of them. Lemma 3.1 (Performance difference lemma for SSPs). For any deterministic policy π, arbitrary policy π̂ and any t ≥ 1, v tπ (x 0 ) − v tπ̂ (x 0 ) = L−1 X X l =0 x∈Xl ³ ´ µπ (x) q tπ̂ (x, π(x)) − v tπ̂ (x) . (3.1) In particular, for any T > 0, VT∗ (x 0 ) − VT (x 0 ) = L−1 X X l =0 x∈Xl ³ ´ µ∗T (x) QT (x, π∗T (x)) − VT (x) , where µ∗T is the occupation measure underlying the best policy π∗T in hindsight. Proof. The second part of the lemma follows from applying (3.1) to π = π∗T and π̂ = πt for t = 1, . . . , T , and summing up the resulting equalities. Hence, it remains to prove (3.1). Repeatedly 1 The existence of this maximum is a standard result of reinforcement learning, see [86]. 29 using the Bellman equations and reordering, we get v tπ (x 0 ) − v tπ̂ (x 0 ) = v tπ (x 0 ) − q tπ̂ (x 0 , π(x 0 )) + q tπ̂ (x 0 , π(x 0 )) − v tπ̂ (x 0 ) ³ ´ ³ ´ X = q tπ̂ (x 0 , π(x 0 )) − v tπ̂ (x 0 ) + P (x 1 |x 0 , π(x 0 )) v tπ (x 1 ) − v tπ̂ (x 1 ) x 1 ∈X1 ³ ´ ³ ´ X = µπ (x 0 ) q tπ̂ (x 0 , π(x 0 )) − v tπ̂ (x 0 ) + µπ (x 1 ) v tπ (x 1 ) − v tπ̂ (x 1 ) x 1 ∈X1 = µπ (x 0 ) ³ q tπ̂ (x 0 , π(x 0 )) − v tπ̂ (x 0 ) à + X x 1 ∈X1 = ··· = L−1 X µπ (x 1 ) X l =0 x l ∈Xl ³ ´ q tπ̂ (x 1 , π(x 1 )) − v tπ̂ (x 1 ) ´ + X x 2 ∈X2 P (x 2 |x 1 , π(x 1 ))µπ (x 1 ) ³ v tπ (x 2 ) − v tπ̂ (x 2 ) ´ ! ³ ´ µπ (x l ) q tπ̂ (x l , π(x l )) − v tπ̂ (x l ) , which proves the statement. Remark 3.1. Note that the lemma can be generalized easily to allow π to be a non-deterministic policy. 3.3 Full Information O-SSP In this short section we give an algorithm for the full-information case and a proof that bounds the algorithm’s regret, the purpose being to fix some ideas that will be useful later on, as well as to obtain a baseline result for loop-free SSPs. The algorithm presented in this section is a straightforward adaptation of the MDP-E algorithm of Even-Dar et al. [33] to our setting. The idea of the algorithm, shown as Algorithm 3.1, is to place an instance e(x) of a fullinformation experts algorithm E into each state x ∈ X, initialized with the action set A and tuned to maximum reward L − l x . Then, the policy to be followed in episode t is just πt , where πt (·|x) is the distribution returned by e(x) in episode t , while at the end of each episode e(x) is fed with the action-values q t (x, ·) as “rewards”. Based on Lemma 3.1, we immediately obtain a performance bound for Algorithm 3.1: Proposition 3.1. Let E be a full-information experts algorithm with a regret bound BE against the class of oblivious adversaries and tuned for rewards in [0, 1]. Then the regret of Algorithm 3.1 against an oblivious adversary satisfies bT ≤ L L(L + 1) BE (T, A) . 2 The above bound also holds against a non-oblivous adversary if the distribution selected by E in a round is fully determined by the previous reward functions observed by the algorithm. Proof. We prove the statement for the case of an oblivous adversary. The statement for nonoblivious adversaries follows directly from Lemma 4.1 of Cesa-Bianchi and Lugosi [25]. Note 30 Algorithm 3.1 Algorithm for the full-information O-SSP. 1. For all states x ∈ X, initialize e(x), an instance of the full-information experts algorithm E with action set A, tuned to maximum reward L − l x . 2. For t = 1, 2, . . . , T , repeat: (a) For all x ∈ X, let πt (·|x) denote the distribution chosen by e(x) at episode t . ) (b) Traverse a path ut following the policy πt , that is, use action a(t ∼ πt (·|x) upon lx reaching state x. (c) Observe the reward function r t . π (d) Compute q t = q t t from applying the Bellman equations (2.2) to πt and r t . (e) For all states x ∈ X, feed e(x) with the function q t (x, ·). that πt is a function of the previous deterministic rewards r 1 , . . . , r t −1 . In particular, πt is a deterministic function of these rewards. As a result, q t (x, ·) = qt (x, ·), the reward fed to e(x) is a deterministic sequence: q t (x, ·) = F t (r 1 , . . . , r t ) for some functions F t , t = 1, 2, . . .. As a consequence of this, VT and QT are also non-random. This should be kept in mind, though we will continue to use the boldface characters to refer to these quantities to avoid further clutter. b T = V ∗ (x 0 ) − VT (x 0 ). By Lemma 3.1, this latter Because VT is non-random, we have that L T expression can be written as VT∗ (x 0 ) − VT (x 0 ) = L−1 X X l =0 x∈Xl n o µ∗T (x) QT (x, π∗T (x)) − VT (x) . P Now, QT (x, π∗T (x)) − VT (x) ≤ maxa {QT (x, a) − VT (x)}. Here, QT (x, a) = Tt=1 qt (x, a) is the total P P “reward” for action a from the point of view of e(x), while VT (x) = Tt=1 a πt (a|x)qt (x, a) is the total expected reward incurred by e(x). In other words, QT (x, a) − VT (x) is the expected regret against action a. By our previous remark, the sequence qt (x, ·) = q t (x, ·) is a deterministic sequence. Since the regret bound BE is assumed to hold when E is used on an arbitrary deterministic reward sequence in [0, L − l x ], it must also hold when E is fed with sequence q t (x, ·). Hence, QT (x, a) − VT (x) ≤ (L − l x )BE (T, A). Finally, the desired bound is obtained by simple algebra. Remark 3.2. Applying EWA with (time-horizon dependent) optimized parameters as the experts algorithm E , the above bound becomes (even for non-oblivious adversaries)2 b T ≤ L(L + 1) L 2 2 See Theorem 2.2 in Cesa-Bianchi and Lugosi [25]. 31 s T ln |A| . 2 3.4 Bandit O-SSP using black-box algorithms In the bandit case, the rewards are observed only along the paths that the agent traverses. We start with an algorithm along the lines of the algorithm constructed in the previous section: the algorithm uses a bandit experts algorithm in each state. We show two complementary regret bounds for this algorithm that scale proportionally with the regret of the underlying bandit algorithm (Theorems 3.1 and 3.2). We also consider the case when the goal is to compete with sequences of policies, i.e., the case of “tracking” the best policy in hindsight (Theorem 3.3). Next, we study the case when the Exp3 algorithm of Auer et al. [10] is used as the base bandit algorithm. By exploiting the special properties of this algorithm, we show some improved regret bounds (Theorems 3.4 and 3.5). We also propose a variant of our algorithm that is also applicable beyond the scope of the results obtained previously, and prove an O(T 2/3 ) bound on its regret (Theorem 3.6). We finish with a regret bound that holds with high-probability (Theorem 3.7). The algorithms discussed in this section are direct descendants of the algorithm of Section 3.3. Similarly to Algorithm 3.1, they place a separate experts algorithm in each state x. However, in these case, a bandit algorithm B is needed. The algorithms below will use the following estimate of qt : q̂t (x, a) = PL−1 ¡ (t ) (t ) ¢ k=l x r t xk ,ak , πt (a|x)µt (x) 0, ³ ´ if (x, a) = xl(t ) , al(t ) ; x x (3.2) otherwise. Assume that the denominators in the definition of q̂t are non-zero with probability one. In this case, it follows easily that q̂t is a conditionally unbiased estimate of qt given Ut −1 , i.e., E[q̂t (x, a)|Ut −1 ] = qt (x, a), (3.3) Note that the estimates q̂t can be computed at the end of episode t only, as the estimates need all the rewards up to the end of the episode. To tune the instances of the bandit algorithm B to be used in each state correctly, we have to suppose that Assumption S2 holds with α > 0: α = min α(x) = min min µπ (x) > 0. x∈X x∈X π A version of B tuned to work with maximum reward (L − l x )/α(x) is used at state x. As we will see later, the quantity α plays an important role in describing the problem structure, since if α > 0, any algorithm automatically visits each state with a minimum (positive) frequency, so no resources have to be spent on exploring the state space. In particular, the algorithm below works only for α > 0, and a different algorithm is needed for α = 0 (see Algorithm 3.5). 32 3.4.1 Expected regret against stationary policies Our bandit O-SSP algorithm for competing with the best stationary policy is shown as Algorithm 3.2. Algorithm 3.2 Algorithm for the bandit O-SSP. 1. For all sates x ∈ X , initialize b(x), an instance of the bandit algorithm B with action set A, tuned to maximum reward (L − l x )/α(x). 2. For t = 1, 2, . . . , T , repeat: (a) For all x ∈ X: let πt (·|x) denote the distribution chosen by b(x) at episode t and choose ã(t ) (x) (independently) according to πt (·|x). (b) Follow the policy πt by using ã(t ) (x) upon reaching state x to obtain the path ut . ³ ´ ³ ´ (t ) (t ) (c) Observe the rewards r t x0(t ) , a0(t ) , . . . , r t xL−1 , aL−1 . (d) Compute µt (x) for all x ∈ X using (2.3) recursively, and construct the estimates q̂t , based on ut , using Equation (3.2). ¢ ¡ ¢ ¡ (e) For all states x ∈ X feed b(x) with the reward πt ã(t ) (x)|x q̂t x, ã(t ) (x) . The next lemma shows the relation between the regret of Algorithm 3.2 and that of the bandit algorithm B . As in the proof of Proposition 3.1, the plan is to reduce the regret analysis of the algorithm to bounding the regret of the individual bandit algorithms b(x) via the help of Lemma 3.1. To be able to do this, we need well-defined bandit problems for each x ∈ X. The difficulty here is that in reality the reward corresponding to a state action-pair (x, a) ∈ X × A is only defined if the agent’s actual path, ut contains (x, a). The bandit problem b(x) faces is well-defined if one can specify reward functions rBt : X × A → R for each episode t = 1, 2, . . . , T such that the reward fed to b(x) in episode t can be shown to be equal to rBt (x, ã(t ) (x)). The next technical lemma shows that it is possible to define such reward functions. In essence, for each state-action pair (x, a) ∈ X × A and each time instant t we define a random variable whose value equals the reward the bandit algorithm b(x) would have received had it chosen action a in state x (and, as a prerequisite, the agent had got in state x in the actual episode). More specifically, rBt (x, a) follows the distribution of the cumulative reward obtained along a trajectory generated by policy πt (and the dynamics of the MDP) that starts in state x with action a. While conceptually straightforward, the proof of the lemma contains some complicated notation to correctly capture the necessary dependence among the reward functions, and therefore it is delegated to the end of this chapter, Section 3.7. Lemma 3.2. Assume α > 0. Fix T > 0 and consider an arbitrary reward sequence r 1 , . . . , r T : X × A → [0, 1].hThen there exists a sequence of reward functions rBt : X × A → R, t = 1, . . . , T such i B (t ) x that rBt (x, a) ∈ 0, L−l α(x) for all (x, a) ∈ X × A, b(x) receives reward rt (x, ã (x)) in episode t , £ ¤ £ ¤ E qt (x, a) − vt (x) = E rBt (x, a) − rBt (x, ã(t ) (x)) , 33 (3.4) rBt (·, ·) and ã(t ) (·) are independent given Ut −1 , and for each x ∈ X, rBt (x, ·) is selected by a suitable adaptive adversary interacting with the bandit algorithm b(x). Remark 3.3. Note that the adversaries in the different states are not independent, since their selected reward functions rBt (x, ·) have to satisfy (3.4) simultaneously for all x ∈ X. Using the reward functions defined in the above lemma, it is not hard to analyze the performance of Algorithm 3.2, leading to our first regret bound for the bandit case: Theorem 3.1. Assume α > 0. Fix T > 0 and consider an arbitrary reward sequence r 1 , . . . , r T : X × A → [0, 1]. Let B be a multi-armed bandit algorithm that enjoys a regret bound BB against the class of adaptive adversaries that generate rewards in the [0, 1] interval. Then, the regret of Algorithm 3.2 satisfies L(L + 1) BB (T, A). 2α Remark 3.4. One has α > 0 if, for example, bT ≤ L min min min min P (x 0 |x, a) > 0. 0≤l ≤L−1 x∈Xl a∈A x 0 ∈Xl +1 In fact, the assumption of α > 0 is closely related to assuming unichain MDPs without transient states, as both guarantee a minimal probability for visiting a state. Remark 3.5. The regret bound above is clearly loose in some cases, such as the case when we face L consecutive, independent bandit problems in each round (i.e., Xl = {x l } and P (x l +1 |x l , a) = 1 for all actions a and all l = 1, 2, . . . , L − 1) In this case, a reasonable bound would scale linearly with L, not quadratically as the above bound. However, when actions influence future rewards, one may have a heavier dependence on L. Remark 3.6. If one uses the algorithm of Audibert and Bubeck [4] with appropriate parameters as the base bandit algorithm B , the above bound gives bT ≤ L 15L(L + 1) p T |A|. 2α Proof. Proof of Theorem 3.1. As in the proof of Proposition 3.1, the plan is to reduce the problem to bounding the regret of the individual bandit algorithms b(x) via the help of Lemma 3.1, where the bandit problems at each state will be defined via Lemma 3.2. By Lemma 3.1, the expected regret of the algorithm can be written as bT = L L−1 X X l =0 x∈Xl £ ¤ µ∗T (x)E QT (x, π∗T (x)) − VT (x) . (3.5) Thus, it suffices to bound the individual terms of the sum on the right hand side. To do this fix some x ∈ Xl and consider T £ ¤ X £ ¤ E QT (x, π∗T (x)) − VT (x) = E qt (x, π∗T (x)) − vt (x) . t =1 34 (3.6) Now let rBt (x, ·), t = 1, . . . , T defined by Lemma 3.2. By the lemma, and especially by (3.4), the difference in (3.6) is the expected regret of algorithm b(x) against action a = π∗T (x) and a nonoblivous adversary that returns the reward function rBt (x, ·) in round t . Thus, thanks to 0 ≤ rBt (x, ·) ≤ L−l x α(x) , the assumption that B enjoys regret BB against the class of adaptive adversaries C that choose rewards in [0, 1] and that b(x) is tuned to the maximum reward L−l x α(x) , we get T T X ¤ L − lx £ £ ¤ X E rBt (x, a) − rBt (x, ã(t ) (x)) ≤ E qt (x, π∗T (x)) − vt (x) = BB (T, A) . α(x) t =1 t =1 (3.7) Combining (3.5), (3.6), and (3.7), and using α ≤ α(x) we get bT ≤ L L−1 X X l =0 x∈Xl µ∗T (x) L −l L(L + 1) BB (T, A) ≤ BB (T, A) , α(x) 2α which finishes the proof. According to this theorem, as long as 1/α is “reasonable”, the size of the state space, |X|, does not play a role in the regret. However, oftentimes one finds that 1/α can be exponentially large (even when α is positive) as a function of the size of X. In the following theorem, the factor (L + 1)/2α is traded for a factor κ|X|, where κ = max x∈X β(x) , α(x) β(x) = max µπ (x), π x ∈ X. (3.8) When β(x) scales proportionally to α(x) (if a state is rare to be visited under some policy then it is rare to be visited under all policies) then κ|X| can be smaller than (L + 1)/(2α). This is the case, for example, in the grid-world example considered in the simulations in Section 3.6. Theorem 3.2. Let B be a multi-armed bandit algorithm that enjoys a regret bound BB against the class of adaptive adversaries that generate rewards in the [0, 1] interval. Assume α > 0 and let κ be defined by (3.8). Then the regret of Algorithm 3.2 satisfies bT ≤ κ L ©PL−1 l =0 ª (L − l )|Xl | BB (T, A) ≤ κ |X| L BB (T, A) . Proof. Following the proof of Theorem 3.1 we obtain, for any 0 ≤ l ≤ L − 1, £ ¤ X ∗ L −l µ∗T (x)E QT (x, π∗T (x)) − VT (x) ≤ µT (x) BB (T, A) α(x) x∈Xl x∈Xl X ≤ X x∈Xl β(x) L −l BB (T, A) α(x) ≤ |Xl |κ(L − l )BB (T, A). Summing up for all l , we get b T ≤ κBB (T, A) L L−1 X (L − l )|Xl | ≤ κBB (T, A) L l =0 L−1 X l =0 35 |Xl | ≤ κBB (T, A) L |X|. Remark 3.7. The first bound in the theorem reveals that the cardinality of the layers for smaller indices has a bigger impact on the regret bound than the cardinality of layers with larger indices. The explanation, as it can be read out from the proof, is that for earlier layers, the range of the total reward-to-go is larger. 3.4.2 Tracking So far the regret of our algorithm was measured relative to the single best policy in hindsight. However, depending on the environment, the performance of the single best policy in hindsight might not be satisfactory, for example, if the environment “drifts”. A natural extension then is to design algorithms that can compete with the performance of any sequence of policies. This is called the tracking problem in the online learning literature (cf. Section 5.2 in [25]). To formalize the problem let π1:T = (π1 , π2 , . . . , πT ) be some sequence of policies. IntroducP π π ing VT 1:T = Tt=1 v t t , the total expected reward of the policy sequence π1:T can be written as π VT 1:T (x 0 ). The question in the case of tracking is how does the total expected reward of a learning algorithm fare compared to the above total reward, that is, we are interested in the expected tracking regret b T (π1:T ) = V π1:T (x 0 ) − E [VT (x 0 )] . L T (3.9) If the policies in the sequence change frequently, we expect a potentially larger gap: In particular, the regret compared to π1:T must clearly be linear if in every episode t , πt is the optimal policy for the reward function r t . However, if a good sequence exists that does not change frequently, a smaller regret must be possible. Thus, similarly to standard related results in online learning, we expect to be able to design algorithms whose regret scales with the complexity, C (π1:T ) = 1 + |{t : πt 6= πt +1 , 1 ≤ t ≤ T − 1}| of the competitor sequence π1:T . In particular, if the C = C (π1:T ) switches in π1:T occur evenly distributed in time, one can spend T /C episodes for learning about each policy πt . The rep gret suffered during T /C episodes can be bounded by O( T /C ), hence altogether the regret p suffered during the T = C (T /C ) episodes will be bounded by O( T C ). Thus, this quick, backp of-the-envelope calculation shows that the best we can expect is a regret of size O( T C ) against a sequence π1:T with complexity C = C (π1:T ). Again, the idea is to employ one bandit algorithm in each state x ∈ X. However, in this case, the bandit algorithms should be tuned to achieve a good tracking regret in the underlying (stateless) prediction problems. In fact, we will need bandit algorithms that achieve a sublinear strongly uniform regret bound BBT (T, A), when used with an action set A, where, following Hazan and Seshadhri [54, 55], the concept of strongly uniform regret is defined as the maximal regret of the algorithm over any contiguous time interval against a constant policy:3 a bandit 3 Note that Hazan and Seshadhri [54, 55] used the term adaptive regret instead of strongly uniform regret. 36 algorithm predicting the action at ∈ A in time step t is said to achieve the strongly uniform regret bound BBT on time horizon T > 0 if ( max 1≤t ≤t 0 ≤T " max E a∈A t0 X # " rs (a) − E s=t t0 X #) rs (as ) ≤ BBT (T, A), (3.10) s=t where rt : A → [0, 1] is the reward function for time step t . Several algorithms are known for the full-information case with vanishing tracking regret under various conditions and with different reward functions, see, for example, Willems [102], Herbster and Warmuth [57], Shamir and Merhav [90], Vovk [98], György et al. [45]. These methods can be extended to the bandit case as well, see, e.g., Auer et al. [10]. Further, these same methods can be shown to enjoy a sublinear strongly uniform regret bound BBT [54, 55, 46]. Similarly to Definition 2.1, we say that BBT is a strongly uniform regret bound for algorithm E against a class of adversaries C if the conditions of the definition are satisfied with the strongly uniform regret (3.10) in place of the regret in (2.1) . Later in this chapter, in Theorem 3.5, we will show that the Exp3.S algorithm of Auer et al. [10] enjoys a near-optimal strongly uniform regret bound, as given by (3.24) for optimal parameters when setting L − l x = 1 and α(x) = 1. BBT (T, A) = 2 T |A|(ln(T |A|) + 1)(e − 1), t ≥ 1. p (3.11) The following theorem shows that with a bandit algorithm that enjoys a good strongly uniform regret, Algorithm 3.2 achieves good tracking performance: Theorem 3.3. Assume that in Algorithm 3.2 a bandit algorithm B = BT is used that enjoys a strongly uniform regret bound BBT against the class of non-oblivious adversaries generating rewards in [0, 1]. Then the regret of this algorithm relative to any fixed sequence of policies π1:T can be bounded as ½ b T (π1:T ) ≤ C (π1:T ) min L ¾ L(L + 1) , κL|X| BBT (T, A). 2α Remark 3.8. In particular, if the Exp3.S algorithm of Auer et al. [10] is used with appropriate parameters, (3.11) gives the bound ¾p ½ L(L + 1) b , 2κL|X| T |A| (ln(T |A|) + 1) (e − 1) LT (π1:T ) ≤ C (π1:T ) min α on the tracking regret relative to π1:T . The bound depends on C and T on the typical almost optimal order, see, e.g., [57] for lower bounds, and [46] for an overview of related results. Proof. Fix any sequence π1:T = (π1 , . . . , πT ) of policies, and let t 1 = 1 < t 2 < t 3 < · · · < tC < tC +1 = T + 1 denote the change points of policy π1:T , that is πtc = πtc +1 = · · · = πtc+1 −1 for c = 1, . . . ,C . 37 Following the proof of Theorem 3.1 we obtain, for any l , T X X x∈Xl t =1 = ³ £ ´ ¤ µπt (x) E qt (x, πt (x)) − E [vt (x)] C X X x∈Xl c=1 C X X ³ £ ¤ £ ¤´ µπtc (x) E Qtc :tc+1 −1 (x, πt (x)) − E Vtc :tc+1 −1 (x) L −l BBT (T, A) α(x) x∈Xl c=1 ½ ¾ C X L −l ≤ min , Lκ|Xl | BBT (T, A), α c=1 ≤ µπtc (x) where the first inequality follows since BBT is a strongly uniform regret bound for BT , while the second inequality can be obtained by bounding µπt as in Theorems 3.1 and 3.2. Summing up for all l finishes the proof. 3.5 Bandit O-SSP using Exp3 In this section we demonstrate that by “opening up” the bandit algorithms that are used at the individual states, and in particular, by tuning the bandit algorithms in a state-dependent manner, better regret bounds can be achieved than showed in the previous section. In this section we will use the Exp3 algorithm of Auer et al. [10] (as described in Section 6.8 of CesaBianchi and Lugosi [25]), as it is probably the most well-known and simple bandit algorithm to date. Furthermore, we employ a more sophisticated estimate of qt , defined as follows: first, define r̂t (x, a) = r t (x,a) πt (a|x)µt (x) , 0, ³ ´ if (x, a) = xl(t ) , al(t ) ; x x (3.12) otherwise. Then define q̂t and v̂t as the solution to the Bellman-equations q̂t (x, a) = r̂t (x, a) + X P (x 0 |x, a)v̂t (x 0 ), (x, a) ∈ X × A; x0 v̂t (x) = X πt (a|x)q̂t (x, a), a x ∈ X \ {x L }; (3.13) v̂t (x L ) = 0. If we assume πt (a|x)µt (x) > 0 for all (x, a) ∈ X × A, we have E [ r̂t (x, a)| Ut −1 ] = rt (x, a) and E [ πt (a|x)| Ut −1 ] = πt (a|x), ¯ £ ¤ E q̂t (x, a)¯ Ut −1 = qt (x, a) (3.14) follows from the linearity of the Bellman-equations. In contrast to the estimates defined in Equation (3.2), these estimates use the information made available about the reward function r t for updating the policy in all the states, not only in the states contained in ut . Accordingly, 38 using the above estimates in Algorithm 3.2 leads to a much more robust performance4 . 3.5.1 Expected regret against stationary policies The method for regret minimization against the set of all stationary policies is given as Algorithm 3.3. The next theorem gives a performance bound on the algorithm. Algorithm 3.3 Algorithm for the bandit O-SSP based on Exp3 for α(x) > 0 for all x ∈ X. 1. Set γx ∈ (0, 1], η x > 0 for all x ∈ X, and w1 (x, a) = 1 for all (x, a) ∈ X × A. 2. For t = 1, 2, . . . , T , repeat: (a) For all (x, a) ∈ X × A let wt (x, a) γx πt (a|x) = (1 − γx ) P . + | A| w (x, b) b t (b) Obtain a path ut following the policy πt . (c) Compute µt (x) for all x ∈ X using (2.3) recursively, and construct estimates q̂t using (3.12) and (3.13) based on ut . (d) For all (x, a) ∈ X × A, set wt +1 (x, a) = wt (x, a)e η x q̂t (x,a) . Theorem 3.4. If Algorithm 3.3 is run with η x = then for all T |A| ln |A| ≥ maxx∈X (e−1)α(x) , 1 L−l x q α(x) ln |A| |A|(e−1)T and γx = q |A| ln |A| (e−1)α(x)T (x ∈ X), ¾p ½ L(L + 1) 0 b , 2κ |X|L T |A| ln |A|(e − 1), LT ≤ min p α β(x) . α(x) where κ0 = maxx∈X p Remark 3.9. The advantage of the above results over those of Theorem 3.1 and Theorem 3.2 is p p the better dependency on α and α(x) ( α instead of α, and α(x) in the definition of κ0 instead of α(x) in the definition of κ). £ ¤ b T = PL−1 Px∈X µ∗ (x)E QT (x, π∗ (x)) − VT (x) . Combining Proof. According to Lemma 3.1, L l T T l =0 £ ¤ this with the bounds on E QT (x, π∗T (x)) − VT (x) stated in Lemma 3.3 below and then finishing as in the proofs of Theorems 3.1 and Theorem 3.2 gives the result. Thus, it remains to prove the following result: 4 Note that all following results can be also proven for the estimates defined in Equation (3.2). Even though the proofs for the simpler estimates are somewhat more straightforward, we focus on these more refined estimates since they intuitively convey more information to the learner. Our experimental results presented in Section 3.6 support this intuition very spectacularly. 39 Lemma 3.3. Fix any x ∈ X. If Algorithm 3.3 is run with parameters γx ∈ (0, 1] and ηx ≤ α(x) γx (L − l x ) |A| (3.15) then, for any a ∈ A, E [QT (x, a) − VT (x)] ≤ γx (e −1)(L −l x )T + lnη|xA| . In particular, setting η x to its upper bound gives ½ ¾ |A| ln |A| E [QT (x, a) − VT (x)] ≤ (L − l x ) γx (e − 1)T + . γx α(x) Proof. Fix the state x ∈ X. We claim that by Lemma 3.7, for any b ∈ A, b T (x, b) − V b T (x) ≤ Q bT = where Q PT t =1 q̂t , M2 = L−l x α(x) . X ln |A| b T (x, a) + γx Q b T (x, b) , + (e − 2)η x M 2 Q ηx a (3.16) Indeed, the conditions of the lemma are satisfied (use q t (·) = q̂t (x, ·), πt (·) = πt (·|x), γ = γx , η = η x , A = A). Then, (3.46) is satisfied by the definition of πt (x, ·). To proceed, we need to provide an upper bound for q̂t (x, a) that holds for all a ∈ A. To this end, let l = l x and define for all x 0 ∈ Xk , (k > l ), x̃ ∈ Xl , ã ∈ A ¯ £ ¤ µt (x 0 |x̃, ã) = P xk = x 0 ¯ xl = x̃, al = ã, ut −1 , (3.17) where (x0 , a0 , x1 , a1 , . . . , xL ) is a random trajectory generated by πt , but is otherwise independent of ut and Ut −1 . Using this notation and the definition of q̂t and v̂t , we bound q̂t (x, a) as q̂t (x, a) = r̂t (x, a) + L−1 X X k=l +1 x 0 ∈Xk µt (x 0 |x, a) X πt (a 0 |x 0 )r̂t (x 0 , a 0 ) a0 0 ≤ L−1 X X µt (x |x, a)I{x 0 =x(tk ) } 1 + πt (a|x)µt (x) k=l +1 x 0 ∈Xk µt (x 0 ) ≤ L−1 X µt (xk(t ) |x, a) 1 + πt (a|x)µt (x) k=l +1 µt (x(t ) ) k Now notice that µt (x 0 ) = X X x̃∈Xl ã µt (x 0 |x̃, ã)µt (x̃)πt (ã|x̃) ≥ µt (x 0 |x, a)µt (x)πt (a|x) holds for all x 0 ∈ Xk , (k > l ), x ∈ Xl , a ∈ A. Using the above inequality for the case when µt (xk(t ) |x, a) > 0 gives µt (xk(t ) |x, a) µt (xk(t ) ) ≤ 1 , πt (a|x)µt (x) 40 thus we obtain an upper bound on q̂t (x, a) as q̂t (x, a) ≤ L −l (L − l )|A| ≤ πt (a|x)µt (x) γx α(x) (a ∈ A) , (3.18) γ γ x where in the last step we have used that µt (x) ≥ α(x) and πt (a|x) ≥ |A|(x) ≥ |Ax| for all (x, a) ∈ X × A. Furthermore, we have X πt (a|x)q̂t (x, a)2 = a X a L − lx X πt (a|x)q̂t (x, a) q̂t (x, a) ≤ q̂t (x, a) . | {z } α(x) a (3.19) x ≤ L−l α(x) Using (3.18) together with (3.15) gives (3.47), while by (3.19), we can choose M 2 = L−l x α(x) as re- quired. Now, taking expectations of both sides of (3.16), using (3.14) and Q T (x, a) ≤ (L − l x ) gives the desired result once η x is replaced with its upper bound from (3.15). 3.5.2 Tracking Next we consider how the results of the previous subsection can be improved for the tracking problem if we directly analyze the case when the tracking algorithm Exp3.S of Auer et al. [10] is plugged in Algorithm 3.2 as the bandit algorithm B . The resulting Algorithm 3.4 is given for completness. Algorithm 3.4 Algorithm for tracking the best dynamic policy in the bandit case. 1. Set γx ∈ (0, 1], η x > 0, δ > 0 and w1 (x, a) = 1 for all (x, a) ∈ X × A. 2. For t = 1, 2, . . . , T , repeat (a) Set γx wt (x, a) + πt (a|x) = (1 − γx ) P | A| b wt (x, b) for all (x, a) ∈ X × A. (b) Compute µt (x) for all x ∈ X using (2.3) recursively. (c) Draw a path ut randomly, according to the transition probabilities P and the policy πt . ³ ³ ´ ³ ´´ ) (t ) (t ) (t ) (d) Receive rewards r t (ut ) = r t x(t , a , . . . , r x , a . t 0 0 L−1 L−1 (e) Compute µt (x) for all x ∈ X using (2.3) recursively, and construct estimates q̂t using 3.12 and (3.13) based on ut . (f) For all (x, a) ∈ X × A, set wt +1 (x, a) = wt (x, a)e η x q̂t (x,a) + δ X wt (x, a). | A| a Theorem 3.5. Define α(x) = min µπ (x), β(x) = max µπ (x) π π 41 and set δ > 0, η x > 0, γx ≥ (L−l x )|A| α(x) η x for all x ∈ X. Assume that κ0 = maxx β(x) p α(x) < ∞. Then for any fixed sequence of policies π1:T = (π1 , . . . , πT ) the regret of Algorithm 3.4 satisfies ½ ¾ p L(L + 1) 0 b LT (π1:T ) ≤ C (π1:T ) min , 2κ |X|L 2 T |A| (ln(T |A|) + 1) (e − 1). p α Remark 3.10. This result improves Theorem 3.3 similarly to how Theorem 3.4 improves the black-box results of Theorem 3.1. Proof. We prove the statement by proving a bound on the strongly uniform regret defined as ½ ¾ max max E [Qt :t 0 (x, a)] − E [Vt :t 0 (x)] . 1≤t ≤t 0 ≤T x∈X a∈A To this end, fix an interval [r, s] ⊆ [1, T ] and consider def b s:s 0 = max V π 0 (x 0 ) − Vs:s 0 (x 0 ). L s:s π Fix a policy π. Applying Lemma 3.1 and summing for t = s, s + 1, . . . , s 0 we obtain π Vs:s 0 (x 0 ) − E [Vs:s 0 (x 0 )] = L−1 X X l =0 x l ∈Xl s0 X £ ¤ µ(x l ) E qt (x l , π(x l )) − vt (x l ) (3.20) t =s From now on, we follow the proof of Theorem 8.1 in Auer et al. [10]. Fix a state x ∈ X, and define P Wt (x) = a wt (x, a), 1 ≤ t ≤ T . Now, fix 1 ≤ t ≤ T − 1. Then, Wt +1 (x) X wt (x, a) η x q̂t (x,a) = e +δ. Wt (x) a Wt (x) We claim that η x q̂(x, a) ≤ 1. Indeed, thanks to the choice of η x and γx , η x ≤ αγx (L−l x )|A| holds, which, together with (3.18), implies η x q̂(x, a) ≤ 1. Now, the first term can be bounded identically to how the same expression was bounded in the proof of Theorem 3.4, giving (e − 2)η2x L − l x X Wt +1 (x) ηx ≤ 1+ v̂t (x) + q̂t (x, a) + δ . Wt (x) 1 − γx 1 − γx α(x) a Now let b s:s 0 (x) = V s0 X (3.21) v̂t (x) t =s Taking logarithms on both sides of (3.21) and summing over t = s, s + 1, . . . , s 0 , we get ln s0 X (e − 2)η2x L − l x X Ws (x) ηx b s:s 0 (x) + ≤ V q̂t (x, a) + δ(s 0 − s). Wr (x) 1 − γx 1 − γx α(x) t =s a Now let b = π(x). Since à ws (x, b) ≥ wr (x, b) exp η x s0 X ! q̂t (x, b) ≥ t =s 42 δ |A| à Wr exp η x s0 X t =s ! q̂t (x, b) (3.22) we have ln µ ¶ s0 X ws (x, b) δ Ws (x) ≥ ln ≥ ln + ηx q̂t (x, b). Wr (x) Wr (x) | A| t =s (3.23) Putting (3.22) and (3.23) together we get b s:s 0 (x) ≥ (1 − γx ) V s0 X s0 X ln (|A|/δ) L − lx X δ(s 0 − s) q̂t (x, b) − − (e − 2)η x q̂t (x, a) − ηx α(x) t =s a ηx t =s After taking expectations and using " E [Vs:s 0 (x)] ≥ (1 − γx )E s0 X P a Qs:s 0 (x, a) ≤ (L − l x )|A|(s # qt (x, π(x)) − (e − 2)η x t =s 0 − s) we get ln(|A|/δ) + δ(r − s) (L − l x )2 |A|(s 0 − s) − . α(x) ηx Using s 0 − s ≤ T , we get that E [Q s:s 0 (x, π(x)) − V Substituting δ = s:s 0 µ ¶ ln(|A|/δ) + δT L − lx (x)] ≤ + γx + η x (e − 2) |A| (L − l x )T. ηx α(x) 1 T, 1 ηx = L − lx and s γx = s (ln(T |A|) + 1)α(x) T (e − 1)|A| (ln(T |A|) + 1)|A| , T (e − 1)α(x) we obtain s E [Qs:s 0 (x, πt (x)) − Vs:s 0 (x)] ≤ 2(L − l x ) T |A|(ln(T |A|) + 1)(e − 1) . α(x) (3.24) Combining with (3.20) and using the arguments used for proving Theorem 3.3, we obtain the result. 3.5.3 The case of α = 0 In some problems the minimum state occupancy measure, α might be zero. In such cases the previous results are vacuous. We now show that by appropriately adjusting the previous algorithm, we can obtain a sublinear guarantee on the regret under the less restrictive assumption that there exists some policy πexp that visits each state with positive probability – in fact, all meaningful MDPs should satisfy this criterion with setting πexp to be, for example, the uniform policy. The algorithm below, which achieves this regret, is similar to Algorithm 3.3 in that the policy πt to be followed at time t is obtained by mixing the “Boltzmann policy”, π0t (a|x) = P wt (x, a)/ b wt (x, b) with πexp instead of the policy that explores all actions in all states with equal probability. However, the mixing happens at a global level: instead of deciding about whether to use πexp or π0t in each state independently of other states, a global randomized choice is made at the beginning of episode t between πexp and π0t . Provided that πexp is cho- 43 sen with probability γ, this ensures that the probability of visiting any state x ∈ X is at least γµπexp (x). Potentially, this is much larger than what we would get had we chosen between these policies at every state. Indeed, if zt ∈ {0, 1} is the indicator that πexp is followed in episode t (i.e., P [zt = 1|Ut −1 ] = γ) then i h ) µt (x) = P x(t = x|U t −1 lx h i h i ) (t ) + P x = x|z = 0, U P [zt = 0|Ut −1 ] = P x(t = x|z = 1, U P = 1|U ] [z t t −1 t t −1 t t −1 l l x x = γµπexp (x) + (1 − γ)µπ0t (x) (3.25) ≥ γµπexp (x) , whereas if one used the mixture policy πt (a|x) = γπexp (a|x) + (1 − γ)π0t (a|x) ((x, a) ∈ X × A), then from (2.3), by induction on l x , we would only get h i ) µ0t (x) = P x(t = x|U t −1 lx X X =γ µ0t (x 0 ) πexp (a 0 |x 0 )P (x|x 0 , a 0 )+ x 0 ∈Xl x −1 (1 − γ) ≥γ ≥γ a X x 0 ∈Xl x −1 X x 0 ∈Xl x −1 X x 0 ∈Xl x −1 µ0t (x 0 ) µ0t (x 0 ) X X a π0t (a 0 |x 0 )P (x|x 0 , a 0 ) πexp (a 0 |x 0 )P (x|x 0 , a 0 ) a0 (γl x −1 )µπexp (x 0 ) X πexp (a 0 |x 0 )P (x|x 0 , a 0 ) a0 = γl x µπexp (x) . Since the policy used in an episode is a mixture of stationary policies (and is not a stationary policy on its own), we need to modify the definitions of qt and vt . In particular, we let π0 πexp π0 πexp qt = (1 − γ)q t t + γq t vt = (1 − γ)v t t + γv t and use QT = PT t =1 qt and VT = PT t =1 vt (3.26) , as before. It can be easily verified that Lemma 3.1 stays of QT and VT as well. Introduce νt : X × A → [0, 1], νt (x, a) = h true for this definition i (t ) (t ) P xl = x, al = a|Ut −1 , and (similarly to the estimates defined in (3.12)) define x x r0t (x, a) = r t (x,a) , ³ ´ ) if (x, a) = xl(t ) , a(t ; l 0, otherwise. νt (x,a) x 44 x (3.27) Finally, define q0t and v0t as the solution to the Bellman-equations q0t (x, a) = r0t (x, a) + v0t (x) = X a X x0 P (x 0 |x, a)v0t (x 0 ), π0t (a|x)q0t (x, a), (x, a) ∈ X × A; x ∈ X \ {x L }; (3.28) v0t (x L ) = 0. £ ¤ £ ¤ Then, since by construction E r0t (x, a)|Ut −1 = r t (x, a) and E π0t (a|x)|Ut −1 = π0t (a|x) holds for £ ¤ π0 all (x, a) ∈ X × A, we also have E q0t (x, a)|Ut −1 = q t t (x, a). It is easy to see that the occupation measure νt can be efficiently calculated using νt (x, a) = γµπexp (x)πexp (a|x) + (1 − γ)µπ0t (x)π0t (a|x), (x, a) ∈ U . (3.29) The new algorithm is shown in Algorithm 3.5. Note that if the exploration policy πexp is the uniform policy πE (πE chooses each action with equal probability in each state, i.e., πE (a|x) = 1/|A| for all x, a) then the marginal distribution of actions in each state remains the same as if Exp3 was used (although the joint distributions are different, resulting in different histories and behaviors generated by the algorithm). Algorithm 3.5 Algorithm for the bandit O-SSP based on Exp3 using “global” exploration. 1. Set γ ∈ (0, 1], η x > 0 for all x ∈ X and w1 (x, a) = 1 for all (x, a) ∈ X × A. Initialize the exploration policy πexp and compute µπexp . 2. For t = 1, 2, . . . , T , repeat: (a) Draw a Bernoulli random variable zt ∈ {0, 1} with parameter γ. i. If zt = 1, then follow the exploration policy πexp throughout the episode. ii. If zt = 0, then follow the policy π0t given by wt (x, a) π0t (a|x) = P , b wt (x, b) (x, a) ∈ U . (b) Observe the path ut following the policy computed in the previous step. (c) Compute νt : X × A → [0, 1] using (3.29) (µπ0t can be computed from (2.3)) and then construct the estimates q0t via (3.27) and (3.28) by using the data ut . (d) For all (x, a) ∈ X × A, set wt +1 (x, a) = wt (x, a)e η x qt (x,a) . 0 The advantage of Algorithm 3.5 becomes apparent when α = minx∈X minπ µπ (x) = 0, that is, when certain policies do not visit some states: the next result shows that the algorithm achieves an O(T 2/3 ) regret even in this case. Theorem 3.6. Suppose that αexp = minx∈X µπexp (x) mina πexp (a|x) > 0. If Algorithm 3.5 is used 45 with parameters γ ∈ (0, 1] and any η x ≤ γαexp L−l x then µ ¶ η x (e − 2)(L − l x ) ln |A| E [QT (x, a) − VT (x)] ≤ γ + (L − l x )T + γαexp ηx for all x ∈ X. In particular, setting γ = q 3 r (e−2) ln |A| αexp T and η x = 2 3 s E [QT (x, a) − VT (x)] ≤ 3 T (L − l x ) 3 for all T ≥ ln |A| . αexp (e−2)2 1 L−l x 3 αexp ln2 |A| (e−2)T 2 yields (e − 2) ln |A| αexp Furthermore, if the latter condition on T holds for all x ∈ X, then 2 3 b T ≤ 3L(L + 1)T L 2 s 3 (e − 2) ln |A| . αexp Remark 3.11. (i) The theorem shows the advantage of allowing an arbitrary exploration policy πexp in Algorithm 3.5 that visits hardly accessible states with higher probability than, for example, the uniform policy πE ; that is, αexp may be substantially larger for a well-chosen exploration policy than for uniform exploration. (ii) Since γ does not depend on x, we cannot modify the theorem to obtain a bound where the constant is proportional to something similar to κ or κ0 as in Theorems 3.2 and 3.4. Proof. We will use the notations Q0T = PT 0 t =1 qt and V0T = PT 0 t =1 vt . First, let us fix an arbitrary x ∈ X. We will use a slightly modified version of Lemma 3.7 with A = A, π̃t = π0 (·|x), q t = q0t (x, ·), η = η x , γ = 0. Since r̂t (x, a) ≤ 1 γαexp and accordingly, q0t (x, a) ≤ L−l x γαexp holds for all (x, a) ∈ X × A, our condition on η x ensures that (3.47) holds. Since our estimates do not fulfill condition (3.48), we need to use X a π0t (a|x)(q0t (x, a))2 ≤ L − lx X 0 π (a|x)q0t (x, a) γαexp a t instead, where we have used the same upper bound for q0t as above. Applying this small modification, one can easily prove Q0T (x, b) − V0T (x) ≤ ln |A| η x (e − 2)(L − l x ) 0 + VT (x), ηx γαexp which becomes " E T ³ X t =1 π0 π0 q t t (x, b) − v t t (x) ´ # ≤ ln |A| η x (e − 2)(L − l x )2 + T ηx γαexp πexp after taking expectations of both sides. Using the definitions (3.26) and the trivial bound q t πexp v t (x) ≤ L − l x , we finally get µ ¶ ln |A| η x (e − 2)(L − l x ) E [QT (x, b) − VT (x)] ≤ + (L − l x ) T γ + ηx γαexp 46 (x, a)− as desired. The rest follows from applying Lemma 3.1. 3.5.4 A bound that holds with high probability In this section, we propose a method based on Exp3.P, as described in Section 6.8 of CesaBianchi and Lugosi [25] to control the (random) regret LT = VT∗ (x 0 ) − T L−1 X X t =1 l =0 r t (xl(t ) , al(t ) ) . The new method is a variation of Algorithm 3.3, where the weight update at (x, a) ∈ X × A uses the following slightly biased estimate of the rewards: r̃t (x, a) = r̂t (x, a) + ω , πt (a|x)µt (x) (3.30) where r̂t is as defined in (3.12) and the value of ω > 0 will be specified later. The estimates for q̃t are then defined as q̃t (x, a) = r̃t (x, a) + X P (x 0 |x, a)ṽt (x 0 ), (x, a) ∈ X × A; x0 ṽt (x) = X πt (a|x)q̃t (x, a), a x ∈ X \ {x L }; (3.31) ṽt (x L ) = 0. Introducing the notation ct (x, a) = X X X πt (a 0 |x 0 )µt (x 0 |x, a) 1 + µt (x)πt (a|x) k=l +1 x 0 ∈Xl a 0 µt (x 0 ) (3.32) for all (x, a) ∈ X × A, we can simply write q̃t as q̃t (x, a) = q̂t (x, a) + ωct (x, a). The main result of this section is the following theorem: Theorem 3.7. Let δ ∈ (0, 1) and consider Algorithm 3.3 where in line 2d q̃t (x, a), as defined in (3.28), is used in place of q̂t (x, a). Assume that the parameters of the algorithm are set to s γx = γ = 3|A| ln(3|X||A|2 ln T /δ) 1 , ηx = Tα L − lx s α ln(3|X||A| ln T /δ) , ω=2 T | A| s 3α ln(3|X||A|2 ln T /δ) , T |A| and T is big enough so that γx ≤ 1/3 and ω ≤ 1. Then, with probability at least 1 − δ, the regret, 47 LT , of the algorithm satisfies s LT ≤ 4L(L + 1) T |A| 3|X||A|2 log2 T L(L + 1) 3|X||A| log2 T ln + ln +L α δ 2α δ r 8T ln 3 . δ For a state-action pair (x, a) ∈ X × A, define e T (x, a) = Q T X q̃t (x, a) and e T (x) = V T X ṽt (x). t =1 t =1 We prove the statement through a series of lemmata. The plan is to use Lemma 3.1 to reduce the problem of bounding the regret to that of bounding the state-wise regrets, maxa QT (x, a) − VT (x). We will see that Lemma 3.7 allows one to upper bound maxa QT (x, a) − VT (x) in terms e T (x, a). Our first lemma shows that this last difference cannot be too large no of QT (x, a) − Q matter the choice of γx . This is in contrast to the difference between Q̂T and QT , that scales in the worst case. That the difference is upper bounded independently of the value with γ−1/2 x p of γx is crucial: It is this property which makes it possible to prove a T regret bound. The following Lemma, taken from Bartlett et al. [14], is the key element to our high probability bound: Lemma 3.4. Assume z1 , z2 , . . . , zT is a martingale difference sequence with |zt | ≤ b. Let σ2t = Var [ zt | z1 , z2 , . . . , zt −1 ] . Furthermore, let v u t uX Σt = t σ2t . s=1 Then the following holds for any δ < 1/e and T ≥ 4: " P T X t =1 # n op p ln(1/δ) ≤ δ log2 T. zt > 2 max 2ΣT , b ln(1/δ) Using this tool, we can prove the following result: Lemma 3.5. Fix some (x, a) ∈ X × A, x ∈ Xl and let δ0 = 3|X||Aδ| log T . Then, with probability at 2 least 1 − δ0 , µ ¶ 4 | A| 1 e T (x, a) ≤ QT (x, a) − Q + (L − l ) ln 0 . (3.33) ω γα δ δ , Furthermore, letting δ00 = 3|X| log 2T e T (x) − VT (x) ≤ V µ ¶ 2T ω(L − l )|A| 4 1 1 + + (L − l ) ln 00 α ω α δ holds with probability at least 1 − δ00 . 48 (3.34) Proof. Let us use Lemma 3.4 for zt = qt (x, a) − q̂t (x, a). First, we have h¡ i ¢2 ¯¯ q̂t (x, a) ¯ Ut −1 "à !2 ¯ # ¯ X X X ¯ 0 0 0 0 0 = E r̂t (x, a) + µt (x |x, a) πt (a |x )r̃t (x , a ) ¯ Ut −1 ¯ a0 k=l +1 x 0 ∈Xl " à !¯ # ¯ X X X 2 0 0 0 0 0 2 ¯ ≤ E (L − l ) r̂t (x, a) + µt (x |x, a) πt (a |x )r̂t (x , a ) ¯ Ut −1 ¯ a0 k=l +1 x 0 ∈Xl ! à X X X µt (x 0 |x, a)πt (a 0 |x 0 )r t (x 0 , a 0 ) r t (x, a) + ≤ (L − l ) µt (x)πt (a|x) k=l +1 x 0 ∈Xl a 0 µt (x 0 )πt (a 0 |x 0 ) ! à X X X µt (x 0 |x, a)πt (a 0 |x 0 ) 1 + ≤ (L − l ) µt (x)πt (a|x) k=l +1 x 0 ∈Xl a 0 µt (x 0 )πt (a 0 |x 0 ) σ2t ≤ E = (L − l )ct (x, a), P P where we have used ( Kk=1 a k )2 ≤ K Kk=1 a k2 and Jensen’s inequality in the third line, r̂t (x 0 , a 0 ) ≤ ¯ £ ¤ 1 0 0 ¯ 0 0 0 0 µt (x 0 )πt (a 0 |x 0 ) and E r̂t (x , a ) Ut −1 = r t (x , a ) in the fourth line, r t (x , a ) < 1 in the fifth line and the definition of ct (3.32) in the last line. Thus, v à ! u T T u X X (L − l ) 1 ΣT = t(L − l ) ct (x, a) ≤ ω0 ct (x, a) + 0 , 2 ω t =1 t =1 for some ω0 > 0 by the relationship between the arithmetic and the geometric means. Using (L−l )|A| γα Lemma 3.4 and b = (à b t + 2 max QT ≤ Q bt + ≤Q T X (see (3.18)), we get that with probability at least 1 − δ0 log2 T ! ) p (L − l ) (L − l )|A| p , ln(1/δ0 ) ln(1/δ0 ) (L − l ) ω ct (x, a) + 0 ω γα t =1 2(L − l )ω0 T X p 0 ln(1/δ0 )ct (x, a) + t =1 2(L − l ) p (L − l )|A| ln(1/δ0 ) + ln(1/δ0 ). 0 ω γα p Setting ω = 2(L − l ) ln(1/δ0 )ω0 , we obtain ¶ 4 (L − l )|A| bt + QT ≤ Q ωct (x, a) + + (L − l ) ln(1/δ0 ) ω γα t =1 µ ¶ 4 | A| e + (L − l ) ln(1/δ0 ), = Qt + ω γα T X µ yielding the first statement of the lemma. Now for the second statement, set zt = v̂t (x) − vt (x) and b = (L−l ) α for Lemma 3.4, we get b T (x) − VT (x) ≤ ω V e T (x) = V b T (x) + ω thus using V PT T X X t =1 a t =1 P µ ¶ 4 1 πt (a|x)ct (x, a) + + (L − l ) ln(1/δ00 ) ω α a πt (a|x)ct (x, a) 49 and P a πt (a|x)ct (x, a) ≤ (L−l )|A| , α we get that e T (x) − VT (x) ≤ V ¶ µ 4 1 2T ω(L − l )|A| (L − l ) ln(1/δ00 ) + + α ω α holds with probability at least 1 − δ00 as required. Lemma 3.6. Let δ ∈ (0, 1). Assume that the parameters of the algorithm are set such that 0 ≤ γ ≤ 1/3, (3.35) α γ 0 ≤ ηx ≤ , (L − l x ) |A| 1 0≤ω≤ . | A| (3.36) (3.37) Then, (L − l x )|A| 3|X||A|2 log2 T ln γα δ µ ¶ 3|X||A| log2 T 2T ω(L − l x )|A| 8 1 + + + . (L − l x ) ln α ω α δ QT (x, a) − Vt (x) ≤ 3γ(L − l x )T + with probability at least 1 − 2δ/3, simultaneously for all (x, a) ∈ X × A. Remark 3.12. Setting the parameters as s γ= 3|A| ln(3|X||A|2 ln T /δ) , Tα s ω=2 1 ηx = L − lx s 3α ln(3|X||A|2 ln T /δ) , T | A| α ln(3|X||A| ln T /δ) , T | A| gives s max QT (x, a) − VT (x) ≤ 8(L − l x ) a T |A| 3|X||A|2 log2 T (L − l x ) 3|X||A| log2 T ln + ln . α δ α δ Proof. The proof essentially follows that of Theorem 6.10 in Cesa-Bianchi and Lugosi [25]. Fix a state-action pair (x, a) ∈ X × A, x ∈ Xl and let δ0 and δ00 be as defined in Lemma 3.5. Consider the decomposition ¡ ¢ ¡ ¢ e t (x) + V e T (x) − VT (x) . QT (x, b) − Vt (x) = QT (x, b) − V (3.38) e t (x). We claim that from Lemma 3.7 it follows that We first construct an upper bound on −V e T (x, a) − V e T (x) ≤ max Q a X ln |A| e T (x, a) + γx max Q e T (x, a) , + cη x M 2 Q a ηx a 50 (3.39) where c = e − 2 and M 2 = (1 + ω|A|) L −l . α (3.40) To verify this, we need to check the conditions of the Lemma 3.7. Using the identifications q t (·) = q̃t (x, ·), πt (·) = πt (·|x), w t (a) = wt (x, a), we see that (3.46) follows from the choice of πt (·|x). Further, we have ct (x, a) ≤ (L − l )|A| , γα which gives q̃t (x, a) = q̂t (x, a) + ωct (x, a) ≤ (1 + ω) (L − l )|A| γα when combined with (3.18). Furthermore, since it is easy to see that by the definition of ct πt (a|x)ct (x, a) ≤ (L − l )|A| , α holds for all (x, a) ∈ X × A, we also have X πt (b|x)q̃t (x, b)2 = b X¡ ¢ πt (b|x)q̂t (x, b) + πt (b|x)ct (x, b) q̃t (x, b) b ≤ (1 + ω|A|) L −l X q̃t (x, b) . α b Now, (3.47) follows from the first inequality and (3.37), (3.36), while the second inequality yields (3.48). We also see that we can indeed choose M 2 as shown in (3.40). Finally, (3.39) follows from the claim of the lemma. From (3.39), upper bounding P a QT (x, a) gives e T (x) ≤ −V e e T (x, a) and reordering the terms by |A| maxa Q ln |A| e T (x, a) , + (ζ − 1) max Q a ηx (3.41) where ζ = γx + cη x M 2 |A|. Since, from (3.35)–(3.37) it follows that ζ ≤ 3γx ≤ 1, therefore, in e T (x, a). order to get an upper bound on the³right-hand side, we need a lower bound on maxa Q ´ A| (L−l ) ln δ10 is a suitable lower bound. Using a union According to Lemma 3.5, QT (x, a)− ω4 + |γα bound and (3.41) we thus get that apart from an event of probability of at most δ0 , µ µ ¶ ¶ ln |A| 4 | A| 1 e T (x) ≤ −V + (ζ − 1) max QT (x, a) − + (L − l ) ln 0 a ηx ω γα δ 51 holds. Plugging this into (3.38), we get ¶ ¶ µ µ 1 4 | A| (L − l ) ln 0 QT (x, b) − Vt (x) ≤ QT (x, b) + (ζ − 1) max QT (x, a) − + a ω γα δ ¢ ln |A| ¡ e T (x) − VT (x) + + V ηx µ ¶ 4 | A| 1 = QT (x, b) − max QT (x, a) + ζ max QT (x, a) + (1 − ζ) (L − l ) ln 0 + a a ω γα δ ¢ ln |A| ¡ e T (x) − VT (x) . + + V ηx Here, QT (x, b) − maxa QT (x, a) ≤ 0. Further, we can upper bound QT (x, a) by (L − l )T and 1 − ζ by 1, to get QT (x, b) − Vt (x) ≤ ζ(L − l )T + µ ¶ ¢ 1 ln |A| ¡ 4 |A| e T (x) − VT (x) . (L − l ) ln 0 + + V + ω γα δ ηx Using Lemma 3.5 and a union bound for all x ∈ X, we get µ ¶ 2T ω(L − l )|A| 4 1 1 e T (x) − VT (x) ≤ V + + (L − l ) ln 00 . α ω α δ Putting together the bounds, using δ00 < δ0 and one last union bound, we thus obtain that with probability at least 1 − 2δ/3, simultaneously for all x ∈ X, it holds that ¶ 8 | A| 1 1 ln |A| QT (x, b) − Vt (x) ≤ ζ(L − l )T + (L − l ) ln 0 + + + ω γα α δ ηx 2T ω(L − l )|A| + . α µ Using ζ ≤ 3γx , plugging in the definitions of δ0 and δ00 and setting η x to its upper bound gives the desired bound. Proof of Theorem 3.7. We have ! à T L−1 X X ¡ ∗ ¢ (t ) (t ) r t (xl , al ) . LT = VT (x 0 ) − VT (x 0 ) + VT (x 0 ) − t =1 l =0 The last term can be bounded from applying the Hoeffding–Azuma inequality, once we note ³ ´T P (t ) (t ) that vt (x 0 ) − L−1 r (x , a ), U is a martingale difference sequence with differences bounded t t l =0 l l t =1 in [−L, L]. Hence, with probability at least 1 − δ/3, VT (x 0 ) − T L−1 X X t =1 l =0 r t (xl(t ) , al(t ) ) ≤ L r 3 8T ln . δ Let us now deal with the first term. According to Lemma 3.1, VT∗ (x 0 ) − VT (x 0 ) = L−1 X X l =0 x∈Xl ³ ´ µ∗T (x) QT (x, π∗T (x)) − VT (x) . 52 ´ ³ Now, QT (x, π∗T (x)) − VT (x) can be bounded using Lemma 3.6 (and Remark 3.12). Summing up for all layers, we get s VT∗ (x 0 ) − VT (x 0 ) ≤ 4L(L + 1) T |A| 3|X||A|2 log2 T L(L + 1) 3|X||A| log2 T ln + ln . α δ 2α δ Putting everything together and using the union bound gives the result. 3.6 Simulations (i, j + 1) 0.1 · 0.1 (i + 1, j + 1) (i + 2, j + 1) 0.1 · 0.1 (i, j + 1) 1−p 0.1 · 0.8 p 0.1 · 0.8 (i, j) (i + 1, j) (i + 2, j) (i, j) (i + 1, j) 0.9 · 0.8 (a) (b) Figure 3.1: The two problem settings considered for the experiments. We have conducted two sets of experiments to (i) further illustrate the differences between the algorithms, (ii) to show that using tools for traditional MDPs can lead to poor performance in our setting, and (iii) to show how the randomness of the MDP influences the performance of our algorithms. We consider two examples as the underlying MDP: the first one, illustrated on Figure 3.1(a) is a project scheduling problem where the learner has to allocate resources to two projects repeatedly. Both projects consist of a number of milestones (K 1 and K 2 , respectively), a learning episode ends when both projects reach their respective final milestones. In each time step of the episode, the learner is in one of the states x = (i , j ) described by the status of each project (i ∈ {1, 2, . . . , K 1 } , j ∈ {1, 2, . . . , K 2 }) and has to decide whether to focus on project 1 or project 2. These decisions are represented by actions a 1 and a 2 , respectively. If the learner chooses a 1 in state (i , j ) (where i < K 1 and j < K 2 ), then the probability of going to the next state (i 0 , j 0 ) is given by the following table: i0 = i i0 = i +1 i 0 = min{i + 2, K 1 } j0 = j 0 0.9 · 0.2 0.9 · 0.8 j = j +1 0.1 · 0.1 0.1 · 0.1 0.1 · 0.8 0 The roles of i 0 and j 0 are reversed for action a 2 . When i = K 1 , both actions yield the same transitions: j 0 = j + 1 with probability 0.2 and j 0 = min{ j + 2, K 2 } with probability 0.8. The status i 0 follows the same rule if j = K 2 . We assume that after each decision, the learner is subject to 53 some non-negative cost that depends on (i , j ) and the chosen action; the goal of the learner is to minimize its cost. Assuming that the costs can change after each episode, we can apply our algorithms to guarantee that the learner does nearly as good as the best state-dependent allocation policy given the complete sequence of costs. Note that our methods need some adjustments to be able to cope with this setting: first, costs should be represented as negative rewards, that is, we need to assume r t : X × A → [−1, 0]. It is very easy to show that all statements of this chapter continue to hold under this assumption. Second, observe that the MDP structure described above does not conform the assumption that all paths have fixed lengths. This problem can be either overcome by the method described in Appendix A of György et al. [47] at the cost of slightly inflating the state space, or by redefining µt as µt (x) = P [ x ∈ ut | Ut −1 ] . Again, it is easy to check that our analysis carries through with this slight modification. Our first goal with the experiments was to show that while algorithms assuming i.i.d. rewards can fail when rewards are allowed to change over time, the algorithms presented and analyzed in this chapter continue to have good empirical performance. In particular, we have implemented Algorithms 3.1, 3.2, 3.3 and 3.5 and also a simplified version of UCRL [59] that maintains confidence intervals for the rewards only. We have set K 1 = K 2 = 10 and designed the reward sequence as follows: we have set r t (x, a) = −1 for all t = 1, 2, . . . , T /2 and all (x, a) ∈ X × A. For the remaining episodes, we used a function φ : {1, 2, . . . , 10} × {1, 2, . . . , 10} × A → [0, 1] describing cost advantages that depend on the current status of each project. Using this func¡¡ ¢ ¢ ¡¡ ¢ ¢ tion, we have set the reward for state x = (i , j ) as r t i , j , a = φ i , j , a /8 − 1 for t = T /2 + ¡¡ ¢ ¢ ¡¡ ¢ ¢ ¡¡ ¢ ¢ 1, . . . , T /2 + T /16 and r t i , j , a = φ i , j , a /8 + φ j , i , a − 1 for the remaining episodes. In other words, after introducing a small state-dependent cost advantage in episode T /2 + 1, we add a larger state-dependent advantage that rewards the exact opposite actions as the previously introduced one. The regret of our algorithms and UCRL is plotted on Figure 3.2 as a function of the number of episodes. As expected, UCRL performs very well until the reward function remains unchanged, but fails to discover the second change in the rewards since it becomes overly confident once it finds out about the first change. This results in linear regret for this algorithm. Also notice that Algorithm 3.2 also fails on this learning problem, which can be attributed to the fact that this algorithm uses very little information about the underlying MDP structure when constructing its action-value estimates. Algorithm 3.3 and Algorithm 3.5 p both perform well on the learning problem since their regrets grow proportionally to t . The explicit exploration episodes used by Algorithm 3.5 enable it to quickly discover both changes in the rewards and thus help this method outperform all other methods on this problem. The second set of experiments were conducted on a grid world of size 10 × 10, where in each episode the agent has to find the shortest path from the lower left corner to the upper right corner. The agent has two actions; both make the agent move right or up, the “right” (“up”) action makes the agent move right (respectively, “up”) with probability p, while it makes it move “up” (respectively, “right”) with probability 1 − p. That is, we have L = 20, |X| = 100, 54 Figure 3.2: The regret of the discussed algorithms in the project scheduling problem plotted against the number of episodes. Each curve is the average of 20 independent experiments, shaded areas around each curve represent adjusted empirical standard deviations of these runs. The switches between reward functions are marked with vertical lines. 1/α = (1/(1 − p))10 , κ = (p 2 /(1 − p))5 . Our goal with these experiments was to show that the performance of our algorithms does change as α gets closer to zero. The experiment is run with T = 200, 000, and rewards are constructed in a similar fashion as in the first experiment: ¡¡ ¢ ¢ ¡¡ ¢ ¢ for all x = (i , j ) ∈ {1, 2, . . . , 10}2 , r t ((i , j ), a) = 0 for t = 1, . . . , T /2, r t i , j , a = φ i , j , a /8 ¡¡ ¢ ¢ ¡¡ ¢ ¢ ¡¡ ¢ ¢ for t = T /2 + 1, . . . , T /2 + T /16 and r t i , j , a = φ i , j , a /8 + φ j , i , a for the remaining episodes. To examine the dependence of the performance of each algorithm on α, we set φ(i , j ) = I{i =10, j ∈{1,2,3,4,5}} , so that the learner has to discover the positive rewards hidden in the hard-to-reach corners. We have simulated the policies generated by Algorithms 3.1, 3.2, 3.3, 3.5 and UCRL for a number of values of p between 0.55 and 1. The attained regrets for each of these methods are plotted on Figure 3.3 against the resulting values of ln(1/α). We see that the performance of each algorithm deteriorates as 1/α grows, indicating that the learning problem does become harder as the problem gets closer to being deterministic. The intuitive reason for this is that the closer the success probability p gets to 1, the less likely it becomes to accidentally discover new rewards hidden in distant parts of the state space. Since Algorithm 3.5 discovers hidden rewards with exponentially higher probability as our other algorithms do, its performance stays much closer to the optimal performance. Similarly to the previous experiments, we see that once UCRL discovers the first batch of non-zero rewards, it only changes its mind when it accidentally stumbles upon the second batch. The probability of this event goes to zero 55 rapidly as p increases. In summary, we conclude that the randomness of the MDP does influence the performance of our algorithms, however, this dependence seems to be much milder p than 1/ α. Figure 3.3: The regret of the discussed algorithms in the grid world problem plotted against ln(1/α). Each curve is the average of 20 independent experiments, shaded areas around each curve represent adjusted empirical standard deviations of these runs. 3.7 The proof of Lemma 3.2 For (x, a) ∈ X × A, t ∈ {1, . . . , T }, define x̃(t ) (x, a) ∈ X to be a random state whose distribution (given Ut −1 ) is P (·|x, a). We assume that the random variables (ã(t ) (x))x∈X , (x̃(t ) (x, a))(x,a)∈X×A (3.42) are conditionally independent of each other given Ut −1 . Without loss of generality, we may assume that the state-action pairs along the random path ut are obtained using ) ) (t ) x(t = x̃(t ) (x(t , al ), l +1 l al(t ) = ã(t ) (xl(t ) ), 56 l = 0, 1, . . . , L − 1 . Now, given (x, a) ∈ X × A, define the random variables (x̂k(t ) (x, a))l x ≤k≤L and (âk(t ) (x, a))l x ≤k≤L−1 as follows: ) x̂(t (x, a) = k x, if k = l x ; x̃(t ) (x̂(t ) (x, a), â(t ) (x, a)), l + 1 ≤ k ≤ L , x k−1 k−1 and ) â(t (x, a) = k a, if k = l x ; ã(t ) (x̂(t ) (x, a)), l + 1 ≤ k ≤ L − 1 . x k ³ ´ ) (t ) Thus, x̂(t (x, a), â (x, a) = (x, a) holds for any (x, a) ∈ X × A. Further, l l x x ) (t ) (t ) ) x̂(t (xl , al ) = x(t , k k and ) (t ) âk(t ) (x(t , al ) = ak(t ) , l 0 ≤ l ≤ k ≤ L, (3.43) (x, a) ∈ X × A. (3.44) as can be seen by induction on k. Now, define the reward sequence (rBt )1≤t ≤T by r (x̂(t ) (x, a), âk(t ) (x, a)) k=l x t k PL−1 rBt (x, a) = I{x(t ) =x} µt (x) lx , First notice that since µt (x) ≥ α and r t (x, a) ∈ [0, 1] for all (x, a) ∈ X × A, we trivially have 0 ≤ rBt (x, a) ≤ L−l x α . rBt (x, ã(t ) (x)). Next we show that the reward received by b(x) in episode t , denoted by r̄t (x), is Consider first the case when x was not visited in episode t . In this case r̄t (x) = 0. However, in this case rBt (x, ·) = 0 holds, too, and thus r̄t (x) = rBt (x, ã(t ) (x)). Now, assume that x was visited in episode t . In this case, ã(t ) (x) = al(t ) and the reward fed to b(x) is x PL r̄t (x) = k=l x ) (t ) r t (x(t , ak ) k µt (x) PL−1 = k=l x ³ ´ ) (t ) (t ) (t ) (t ) r t x̂k(t ) (x(t , a ), â (x , a ) l l k l l x x µt (x) x x , where we used (3.43). Since under the assumption that x was visited in episode t , the second expression equals rBt (x, ã(t ) (x)), the proof of the claim r̄t (x) = rBt (x, ã(t ) (x)) is finished. Next we show that rBt (x, ·) and ã(t ) (x) are conditionally independent of each other, given Ut −1 . To see why this holds, note that by its definition, rBt (x, a) can be written as a deterministic function of (x̃(t ) (x 0 , a 0 ); (x 0 , a 0 ) ∈ X × A), (ã(t ) (x 0 ); x 0 ∈ X \ {x}) and Ut −1 . Thus, thanks to (3.42), ã(t ) (x) and rBt (x, a) are indeed conditionally independent given Ut −1 because both of them are deterministic functions of some random variables, the respective sets of random variables are disjoint and the random variables they depend on are all conditionally independent given Ut −1 . This conditional independence and the definition (3.44) imply that one can assume that the partial reward functions rBt (x, ·) are selected by some suitable non-oblivious adversary interacting with b(x) for all x ∈ X. 57 It remains to show that (3.4) holds. Let l = l x . We have " # ¯ ¯ h i I{x=x(t ) } L−1 X ¯ (t ) ¯ (t ) l (t ) (t ) B E r t (x̂k (x, a), âk (x, a))¯xl , Ut −1 E rt (x, a)¯xl , Ut −1 = µt (x) k=l = I{x=x(t ) } l µt (x) (3.45) qt (x, a) , where the first equality follows from the definition of rBt and because µt is a function of Ut −1 . L−1 The second equality uses that the random state xl(t ) and the random path (x̂k(t ) (x, a), âk(t ) (x, a))k=l generated from (x, a) are conditionally independent given Ut −1 . Therefore, " E L−1 X k=l # ¯ ¯ ) ) r t (x̂(t (x, a), â(t (x, a))¯xl(t ) , Ut −1 k k " =E L−1 X k=l ¯ ¯ r t (x̂k(t ) (x, a), âk(t ) (x, a))¯Ut −1 # . Now, this last expectation equals qt (x, a), thanks to the definition of qt and (x̂k(t ) (x, a), âk(t ) (x, a))L−1 . k=l By taking the conditional expectation of both sides of (3.45) w.r.t. Ut −1 , using again that qt and h i µt are functions of Ut −1 , together with P x = xl(t ) |Ut −1 = µt (x) gives that ¯ h i ¯ E rBt (x, a)¯Ut −1 = qt (x, a) . Now, by the conditional independence of rBt (x, ·) and ã(t ) (x) given Ut −1 , we also get that £ ¤ £ ¤ E rBt (x, ã(t ) (x))|ã(t ) (x), Ut −1 = E qt (x, ã(t ) (x))|ã(t ) (x), Ut −1 . Taking the conditional expectation of both sides w.r.t. Ut −1 , we thus obtain that ¯ i X h ¯ E rBt (x, ã(t ) (x))¯Ut −1 = πt (a|x)qt (x, a) = vt (x) . a Taking expectations of both sides again, we get that (3.4) indeed holds. This finishes the proof of the lemma. 3.8 A technical lemma Lemma 3.7. Fix a finite set A, an integer T > 0, the reals 0 < γ < 1, 0 < η and the sequence of functions q 1 , . . . , q T : A → R. Define the sequence of functions, (w t )0≤t ≤T , (π̃t )1≤t ≤T (w t : A → R, π̃t : A → R) by w 0 = 1, à w t (a) = exp η tX −1 ! q s (a) , s=1 and π̃t (a) = P w t (a) . a∈A w t (a) 58 a ∈ A, Let (πt )1≤t ≤T be a sequence such that πt (a) ≥ (1 − γ)π̃t (a), Define Q T = PT t =1 q t , VT = PT t =1 v t , where vt = P 1 ≤ t ≤ T, a ∈ A . a∈A πt (a)q t (a). ηq t (a) ≤ 1, X a∈A πt (a)q t (a)2 ≤ M 2 X a∈A (3.46) Assume that 1 ≤ t ≤ T, a ∈ A , (3.47) 1 ≤ t ≤ T, (3.48) q t (a), for some appropriate constant M 2 > 0. Then, for any b ∈ A, Q T (b) − VT ≤ Proof. Let Wt = P a w t (a) and let c X ln |A| Q T (a) + γQ T (b) . + (e − 2)ηM 2 η a∈A = e −2. From the update rule and some elementary inequal- ities, we get the following upper bound on the ratio Wt +1 /Wt : Wt +1 X w t +1 (a) X w t (a) ηq t (a) = = e Wt Wt Wt a a X πt (a) ηq (a) ≤ e t a 1−γ X π (a) ³ ¡ ¢2 ´ t ≤ 1 + ηq (a) + c ηq (a) t t 1−γ by (3.46) e y ≤ 1 + y + c y 2 if y ≤ 1 and (3.47) a η = 1 + 1−γ ≤ 1+ X a cη2 πt (a)q t (a) + 1−γ X πt (a)(q t (a))2 a X η cη2 vt + M 2 q t (a). 1−γ 1−γ a by (3.48) Here, the second inequality uses (3.47). Taking logarithms and using ln(1 + x) ≤ x gives ln X Wt +1 η cη2 vt + M 2 q t (a). ≤ Wt 1−γ 1−γ a Summing over t we get ln X η cη2 WT +1 ≤ VT + M 2 Q T (a). W1 1−γ 1−γ a On the other hand, we have for any action b ∈ A that ln WT +1 w T +1 (b) ≥ ln = ηQ T (b) − ln |A| . W1 W1 Combining with (3.49), we get VT ≥ (1 − γ)Q T (b) − X ln |A| − cηM 2 Q T (a) . η a Reordering gives the desired inequality. 59 (3.49) Chapter 4 Online learning in known unichain Markov decision processes The topic of this chapter is learning in continuing MDPs where the rewards are allowed to change between each time step of the decision process. We assume that the MDP\r M = © ª X, A, P satisfies Assumption M1 and the transition function P is known to the learner. After defining the learning problem in Section 4.1, we show that it is possible to decompose the learning problem into smaller learning problems in Section 4.2. To ease presentation of our ideas, we first present the algorithm MDP-E of Even-Dar et al. [33] in Section 4.3. We present and analyze our algorithm called MDP-Exp3 in Section 4.4. A preliminary version of the main result of this chapter was published as [81], while the full version to be presented is covered by [82]. The results of this chapter are summarized as follows: Thesis 2. Proposed an efficient algorithm for online learning in known unichain MDPs. Proved performance guarantees for Settings 2a and 2b. The proved bounds are optimal in terms of the dependence on the number of time steps, up to logarithmic factors. [81, 82] 4.1 Problem setup The purpose of this section is to provide the formal definition of our problem and to set the goals. Building on the formalism of unichain MDPs introduced in Section 2.3.2, we introduce the online version of Markov decision processes called online MDP or O-MDP. The difference from the protocol described on Figure 2.4 is that the reward function is allowed to change arbitrarily in every time step, that is, the single reward function r is replaced by a sequence of reward functions (r t )Tt=1 . This sequence is assumed to be fixed ahead of time by an oblivious adversary, and, for simplicity, we assume that r t (x, a) ∈ [0, 1] for all (x, a) ∈ X×A and t ∈ {1, 2, . . .}. As in the online prediction framework described in Section 2.1, no other assumptions are made about this sequence. The learning agent is assumed to know the transition probabilities P , 60 but is not given the sequence (r t )Tt=1 . The protocol of interaction with the environment is unchanged: At time step t the agent selects an action at based on the information available to it, which is sent to the environment. In response, the reward r t (xt , at ) and the next state xt +1 are communicated to the agent. The initial state x1 is generated from a fixed (unknown) distribution µ1 . Let the total expected reward collected by the agent up to time T be denoted by ¯ # ¯ ¯ RbT = E r t (xt , at )¯ P . ¯ t =1 " T X As before, the goal of the agent is to make this sum as large as possible, or, equivalently, to minimize the total expected regret against the pool of all stationary policies. The concept of regret for this case is defined as follows: Fix a stochastic stationary policy π : X × A → [0, 1] and ¡ ¢T let x0t , a0t t =1 be the trajectory that results from following policy π from x01 ∼ P 0 (in particular, a0t ∼ π(·|x0t )). The total expected reward of π over the first T time step is defined as ¯ # ¯ ¯ R Tπ = E r t (x0t , a0t )¯ P, π . ¯ t =1 " T X Now, the total expected regret of the learning agent relative to the class of stationary policies is defined as b T = sup R π − RbT , L T π where the supremum is taken over all stochastic stationary policies in M . 4.2 A decomposition of the regret For the rest of the chapter, we suppose that P satisfies Assumption M1. As we have done before in Section 3.2, we introduce some additional concepts needed to break down the global learning problem to smaller individual learning problems: We reformulate the task of the learner from having to select actions at to having to select policies πt in each time step t . Using this twist, we state a performance difference lemma in the vein of Lemma 3.1 to show a relationship between the average rewards of any fixed policy π and the chosen policy πt .1 Unlike in the OSSP setting, this lemma alone is not sufficient for proving upper bounds on the regret, since the actual reward r t (xt , at ) accumulated by the algorithm at time t can differ from the long-term average reward of the currently followed policy. We show that if the sequence of policies (πt )Tt=1 satisfies a slow-change criterion, the expectation of r t (xt , at ) stays close to the long-term average reward of πt given r t . For defining the learner’s policy, consider the trajectory {(xt , at )} followed by a learning 1 That these concepts are well-defined follows from Assumption M1 from standard results to be found, for exam- ple, in the book by Puterman [86]. 61 agent with x1 ∼ µ1 . Define ut = ( x1 , a1 , r 1 (x1 , a1 ), x2 , a2 , r 2 (x2 , a2 ), . . . , xt , at , r t (xt , at ) ) (4.1) and introduce the policy followed in time step t , πt (a|x) = P[at = a|ut −1 , xt = x]. Note that πt is computed based on past information and is therefore random. The average reward ρ πt of policy π is defined by replacing r by r t in Equation (2.5). The action-value and value functions q tπ and v tπ are then defined as the action-value and value functions satisfying the Bellman equations (2.6) with the roles of r and ρ π taken by r t and ρ πt . We will use the following notation: π π qt = q t t , π ρt = ρt t . vt = v t t , Thus, the following equations hold simultaneously for all (x, a) ∈ X × A: qt (x, a) = r t (x, a) − ρ t + X P (x 0 |x, a)vt (x 0 ), vt (x) = X πt (a|x)qt (x, a). a x0 (4.2) Following Even-Dar et al. [33], fix some policy π and consider the decomposition of the regret relative to π: R Tπ − RbT à = R Tπ − T X ρ πt ! à + t =1 T X ρ πt − t =1 T X t =1 ! à ρt + T X ! ρ t − RbT . (4.3) t =1 The second term can be treated with the following performance difference lemma: Lemma 4.1 (Performance difference lemma for unichain MDPs). π 2 Let π, π̂ be two (stochastic π̂ stationary) policies. Assume that ρ , ρ and the action-value function of π̂, q π̂ , are well-defined. Then, for any t ≥ 1 ρ πt − ρ π̂t = X x h i µπst (x)π(a|x) q tπ̂ (x, a) − v tπ̂ (x) . In particular, for any T > 1, and deterministic policy π, T X t =1 ρ πt − T X ρt = t =1 X x,a µπst (x) [QT (x, π(x)) − VT (x)] . Remark 4.1. This lemma appeared as Lemma 4.1 in [33], but similar statements have been known for a while. For example, the book of Cao [21] also puts performance difference statements in the center of the theory of MDPs. For the sake of completeness, we include the easy proof: 2 This lemma does not need Assumption M1. 62 Proof. We have X x,a µπst (x)π(a|x)q π̂ (x, a) = X x,a µπst (x)π(a|x) = ρ π − ρ π̂ + X x " π̂ r (x, a) − ρ + X 0 π̂ # 0 P (x |x, a)v (x ) x0 µπst (x)v π̂ (x). Reordering the terms gives the result. The first and the last terms of (4.3) measure the difference between the sum of average rewards and the actual expected reward. The existence of a unique stationary distribution for any π (which follows from Assumption M1 ) ensures that these differences are not large in the case of a fixed policy. Accordingly, the third term is bounded by a constant, O(τ): Lemma 4.2. For any T ≥ 1 and any policy π it holds that R Tπ − T X ρ πt ≤ 2τ + 2 . (4.4) t =1 We give the proof for completeness and because our proof corrects slight inaccuracies in [33]. Proof. Let {(xt , at )} be the trajectory when π is followed. Note that the difference R Tπ − π t =1 ρ t PT is due to the initial distribution of x1 being different from the stationary distribution of π. To quantify the difference, write R Tπ − T X t =1 ρ πt = T X X X (νπt (x) − µπst (x)) π(a|x)r t (x, a), t =1 x a where νπt (x) = P[xt = x] is the state distribution at time step t . Viewing νπt is a row vector, we have νπt = νπt −1 P π . Consider the t th term of the above difference. Then, using r t (x, a) ∈ [0, 1] we have X π X (νt (x) − µπst (x)) π(a|x)r t (x, a) ≤ kνπt − µπst k1 x a = kνπt −1 P π − µπst P π k1 ≤ e −1/τ kνπt −1 − µπst k1 ≤ . . . ≤ e −(t −1)/τ kνπ1 − µπst k1 ≤ 2e −(t −1)/τ . This, together with the elementary inequality PT t =1 e −(t −1)/τ ≤ 1+ R∞ 0 e −t /τ d t = 1 + τ gives the desired bound.3 Treating the first term of (4.3) requires more careful treatment since the policies πt change over time. The techniques needed to control the change rate of the policies and the translation 3 Even-Dar et al. [33] mistakenly uses kνπ − µπ k ≤ e −t /τ kνπ − µπ k in their paper (t = 1 immediately shows that t st 1 st 1 1 this can be false). See, e.g., the proofs of their Lemmas 2.2 and 5.2. 63 of this rate to the remaining term of the regret are different in the full information and the bandit settings, so we are going to treat these issues in the next sections separately. 4.3 Full information O-MDP In this section, we present the algorithm proposed by Even-Dar et al. [33] for the full-information version of the O-MDP problem. The algorithm is called MDP-E , and is shown as Algorithm 4.1. The idea of the algorithm is to use an instance e(x) of expert algorithm E with action set A in ¡ ¢T each state x, fed with reward sequence q t (x, ·) t =1 . The same idea used for proving Proposition 3.1 can be used to prove an upper bound on the main term of the regret, that is, the second term in Equation (4.3). However, as noted in the previous section, we still need to ensure that the policy sequence satisfies a slow-change condition. For the sake of completeness and the better understandability of our contributions, we state a performance guarantee for MDP-E for the case when EWA is used as the expert algorithm E . Algorithm 4.1 MDP-E: an algorithm for online learning in MDPs with full information 1. Set η > 0 and w 1 (x, a) = 1 for all (x, a) ∈ X × A. 2. For t = 1, 2, . . . , T , repeat: (a) For all (x, a) ∈ X × A, let w t (x, a) πt (a|x) = P . b w t (x, b) (b) Draw an action at ∼ πt (·|xt ). (c) Receive reward r t (xt , at ), observe r t and xt +1 . (d) Compute q t using the Bellman equations (2.6) with r t . (e) For all (x, a) ∈ X × A, set w t +1 (x, a) = w t (x, a)e ηq t (x,a) . Theorem 4.1. Let the transition probability kernel P satisfy Assumptions M1. Fix T > 0. Then for an appropriate choice of parameter η (which depends on |A|, T, τ), for any sequence of reward functions (r t )Tt=1 taking values in [0, 1], the regret of the algorithm MDP-E can be bounded as s µ b T ≤ 4τ + 2 + 2 T ln |A| 2τ3 + L ¶ 13 2 τ + 6τ + 2 . 2 (4.5) Remark 4.2. Note that the constants in this bound are different from those presented in Thep p orem 5.1 of Even-Dar et al. [33]. In particular, the leading term here is 8τ3/2 T ln |A|, while p their leading term is 4τ2 T ln |A|. The above bound both corrects some small mistakes in their calculations and improves them at the same time.4 4 One of the mistakes is in the proof of Theorem 4.1 of Even-Dar et al. [33] where, they failed to notice that q πt t 64 The proof of this result is presented partly for the sake of completeness and partly so that we can be more specific about the corrections required to fix the main result (Theorem 5.2) of Even-Dar et al. [33]. Further, the proof will also serve as a starting point for the proof of our main result, Theorem 4.2. Nevertheless, the inpatient reader may skip this next section and jump immediately to the proof of Theorem 4.2, which apart from referring to some general lemmas developed in the next subsection, is entirely self-contained. First note that since the algorithm gets to observe the complete reward function r t in each time step t = 1, 2, . . . , T , the policies (πt )Tt=1 do not depend on the trajectory traversed by the learner. Accordingly, we use the notations πt , q t , v t and ρ t to emphasize that all these variables are now non-random. Now consider the regret decomposition of Equation 4.3: R Tπ − RbT à = R Tπ − T X ρ πt ! à + t =1 T X ρ πt − t =1 T X ! à ρt + T X ! ρ t − RbT . t =1 t =1 As suggested by Lemma 4.1, the second term can be controlled by upper bounding the individual regrets maxa Q T (x, a) − VT (x) of the instances of EWA in each state x ∈ X. Assume for a π second that K > 0 is such that kq t t k∞ ≤ K holds for 1 ≤ t ≤ T . By Theorem 2.2 in [25], each individual regret can be bounded by ln |A| K 2 ηT + . η 8 (4.6) ¡ π ¢T Notice that q t t t =1 is a sequence that is sequentially generated from (r t )Tt=1 . It is Lemma 4.1 of [25] that shows that the bound of Corollary 2.2 of [25] continues to hold for such sequentially generated functions. Putting the inequalities together, we obtain T X ρ πt − t =1 T X t =1 ρt ≤ ln |A| K 2 ηT + . η 8 (4.7) According to the next lemma an appropriate value for K is 2τ + 4. The lemma is stated in a greater generality than what is needed here because the more general form shown will be useful later. Lemma 4.3. Pick any policy π in an MDP (P, r ). Assume that the mixing time of π is τ in the P sense of (2.4). If | a π(a|x)r (x, a)| ≤ R holds for any x ∈ X, then |v π (x)| ≤ 2R(τ + 1) holds for all x ∈ X. Furthermore, for any (x, a) ∈ X × A, |q π (x, a)| ≤ 2R(τ + 1) + R + |r (x, a)| ≤ (2τ + 4) · kr k∞ . Proof. As it is well known and is easy to see from the definitions, the (differential) value of policy π at state x can be written as v π (x) = ∞ X X s=1 x0 (νπs,x (x 0 ) − µπst (x 0 )) X π(a|x 0 )r (x 0 , a), a π can take on negative values. Thus, their Assumption 3.1 is not met by {q t t } (one needs to extend the upper bound given in their Lemma 2.2 with a lower bound and change Assumption 3.1). As a result, Assumption 3.1 cannot be used to show that the inequality in the proof of Theorem 4.1 holds. This mistake, as well as the others, can be easily corrected, as we show it here. 65 where νπs,x = e x (P π )s−1 is the state distribution when following π for s − 1 steps starting from P state x. Using the bound on a π(a|x 0 )r (x 0 , a) and the triangle inequality gives ∞ X X ¯ π ¯ ¯v (x)¯ ≤ R |νπs,x (x 0 ) − µπst (x 0 )| ≤ 2R (τ + 1) , s=1 x 0 where in the second inequality we used kνπs,x − µπst k1 ≤ 2e −(s−1)/τ and that P∞ s=1 e −(s−1)/τ ≤ τ+1 (cf. the proof of Lemma 4.2). This proves the first inequality. The second inequality follows from the first part and the Bellman equation: |q π (x, a)| ≤ |r (x, a)| + |ρ π | + X P (x 0 |x, a)|v π (x 0 )| ≤ |r (x, a)| + R + 2R(τ + 1). x0 Here, we used that ρ π = P P P π x µst (x) a π(a|x)r (x, a) and the bound on a π(a|x)r (x, a). Let us now consider the third term of (4.3), PT t =1 ρ t − RbT . The t th term of this difference is the difference between the average reward of πt and the expected reward obtained in step πt . If νt (x) is the distribution of states in time step t , T X ρ t − RbT = t =1 Thus, PT t =1 ρ t − RbT ≤ π µstt πt t =1 kµst PT T X X X (µπst (x) − νt (x)) π(a|x)r t (x, a). t =1 x a − νt k1 and so remains to bound the L 1 distances between the and νt . For this, we will use two general lemmas that will again come useful P later. Introduce the mixed norm kπk1,∞ = maxx a kπ(a|x)k. Clearly, kνP π −νP π̂ k1 ≤ kπ− π̂k1,∞ distributions holds for any two policies π, π̂ and any distribution ν (cf. Lemma 5.1 in [33]). The first lemma shows that the map π 7→ µπst as a map from the space of stationary policies equipped with the mixed norm k · k1,∞ to the space of distributions equipped with the L 1 -norm is τ-Lipschitz: Lemma 4.4. Let P be a transition probability kernel over X × A such that the mixing time of P is τ < ∞. For any two policies, π, π̂, it holds that kµπst − µπ̂st k1 ≤ τkπ − π̂k1,∞ . Proof. The statement follows from solving kµπst − µπ̂st k1 ≤ kµπst P π − µπ̂st P π k1 + kµπ̂st P π − µπ̂st P π̂ k1 ≤ e −1/τ kµπst − µπ̂st k1 + kπ − π̂k1,∞ for kµπst − µπ̂st k1 . The next lemma allows us to compare an n-step distribution under a policy sequence with the stationary distribution of the sequence’s last policy: Lemma 4.5. Let P be a transition probability kernel over X × A such that the mixing time of P is τ < ∞. Take any probability distribution µ0 over X , integer n > 0 and policies π1 , . . . , πn . 66 Consider the distribution µn = µ0 P π1 · · · P πn . Then, it holds that π kµn − µstn k1 ≤ 2e −n/τ + τ(τ + 1) max kπt − πt −1 k1,∞ . 2≤t ≤n Proof. Introduce c = max2≤t ≤n kπt − πt −1 k1,∞ . Now, π π π π π kµn − µstn k1 ≤ kµn − µstn−1 k1 + kµstn−1 − µstn k1 ≤ e −1/τ kµn−1 − µstn−1 k1 + τc , π π where we used that by the previous lemma kµstn−1 − µstn k1 ≤ τkπn−1 − πn k1,∞ ≤ τc. Continuing recursively we get π π kµn − µstn k1 ≤ e −1/τ {e −1/τ kµn−2 − µstn−2 k1 + τc} + τc .. . π ≤ e −(n−1)/τ kµ1 − µst1 k1 + τc{1 + e −1/τ + . . . + e −(n−2)/τ } π ≤ e −n/τ kµ0 − µst1 k1 + τ(τ + 1)c ≤ 2e −n/τ + τ(τ + 1)c , thus, finishing the proof. We are now ready to prove Theorem 4.1. π Proof of Theorem 4.1. Applying Lemma 4.5 to kνt − µstt k1 we get π kνt − µstt k1 ≤ 2e −t /τ + τ(τ + 1)K 0 , where K 0 is a bound on max2≤t ≤n kπt − πt −1 k1,∞ .5 Therefore, t X ρ t − RbT ≤ 2{e −1/τ + e −2/τ + . . .} + τ(τ + 1)K 0 T ≤ 2τ + τ(τ + 1)K 0 T . t =1 Thus, it remains to find an appropriate value for K 0 . It is a well known property of the EWA that π kπt (·|x) − πt −1 (·|x)k1 ≤ ηkq t t (·, x)k∞ (see, e.g, Theorem 8.13 and Exercise 8.1 in Bartók et al. π π [16]). Thus, kπt − πt −1 k1 ≤ ηkq t t k∞ . Now, by Lemma 4.3, kq t t k∞ ≤ 2τ + 4, showing that K 0 = 5 Lemma 5.2 of Even-Dar et al. [33] gives a bound on kν − µπt k . However, there are two mistakes in the proof t st 1 p of Lemma 5.2. One of the mistakes is that Assumption 3.1 states that K 0 = ln |A|/T , whereas since the range of the action-value functions scales with τ, K 0 should also scale with τ. (Unfortunately, in [81] we committed the same mistake.) This results in that the multiplier 2τ2 in the first term of the bound of Lemma 5.2 should be replaced by 2τ2 K 0 T , where K 0 is a bound on how fast the policies change (which does depend on τ). The second small mistake is that in the third displayed equation, the multiplier of kd 1 − d πt k1 should be e −(t −1)/τ as can be easily checked by considering t = 1. Finally, we note that the proof in Lemma 5.2 uses a different decomposition than our bound, leading to a small improvement compared to our bound in terms of the leading constant of the bound. We choose to present this alternate decomposition as we find it somewhat cleaner and it also gave us the opportunity to present the interesting Lemma 4.4. 67 η(2τ + 4) is suitable. Putting together the inequalities obtained, we get T X ρ t − RbT ≤ 2τ + (2τ + 4)τ(τ + 1)ηT . t =1 Combining (4.4), (4.7) and this last bound, we obtain R Tπ − RbT µ ¶ 13 2 ln |A| 3 + ηT 2τ + τ + 6τ + 2 . = 4τ + 2 + η 2 Setting s η= ln |A| ¡ ¢, 3 2 T 2τ + 13 2 τ + 6τ + 2 we get the bound stated in the theorem. 4.4 Bandit O-MDP Our algorithm, MDP-Exp3 , shown as Algorithm 4.2, is inspired by MDP-E, while also borrowing ideas from the Exp3 algorithm of Auer et al. [10]. The main idea of the algorithm is Algorithm 4.2 MDP-Exp3 : an algorithm for online learning in MDPs with bandit information 1. Set γ ∈ (0, 1], η > 0, N ≥ 1, and w1 (x, a) = w2 (x, a) = · · · = w2N (x, a) = 1 for all (x, a) ∈ X × A. 2. For t = 1, 2, . . . , T , repeat: (a) For all (x, a) ∈ X × A, let wt (x, a) γ πt (a|x) = (1 − γ) P + . w (x, b) | A | b t (b) Draw an action at ∼ πt (·|xt ). (c) Receive reward r t (xt , at ) and observe xt +1 . (d) If t ≥ N + 1 i. Compute µN t (x) for all x ∈ X using (4.10). ii. Construct estimates r̂t based on (xt , at ) and r t (xt , at ) using (4.8) and compute q̂t using (4.13). iii. For all (x, a) ∈ X × A, set wt +N (x, a) = wt +N −1 (x, a)e ηq̂t (x,a) . to construct estimates q̂t of the action-value functions qt , which are then used to determine the action-selection probabilities πt (·|x) in each state x in each time step t . In particular, the probability of selecting action a in state x at time step t is computed as the mixture of the uniform distribution (which encourages exploring actions irrespective of what the algorithm has learned about the action-values) and a Gibbs distribution, the mixture parameter being γ > 0. Given a state x, the Gibbs distribution defines the probability of choosing action a to be propor68 tional to exp(η Pt −N s=N q̂t (x, a)). 6 Here, η > 0, N > 0 are further parameters of the algorithm. Note that for the single-state setting with N = 1, MDP-Exp3 is equivalent to the Exp3 algorithm of Auer et al. [10]. It is interesting to discuss how the Gibbs exploration policy is related to what is known as the Boltzmann-exploration policy in the reinforcement learning literature [e.g., 92]. Remember that given a state x, the Boltzmann-exploration policy would select action a with probability proportional to exp(ηq̂∗t (x, a)) for some estimate q̂∗t of the optimal action-value function in the MDP (P, r t ). Thus, we can see a couple of differences between the Boltzmann exploration and our Gibbs policy. The first difference is that the Gibbs policy in our algorithm uses the cumulated sum of the estimates of action-values, while the Boltzmann policy uses only one estimate, the latest one. By relying on the sum, the Gibbs policy will rely less on the last estimate. This reduces how fast the policies can change, making the learning “smoother”. Another difference is that in our Gibbs policy the sum of previous action-values runs only up to step t − N instead of using the sum that runs up to the last step t . The reasons for doing this will be explained below. Finally, the Gibbs policy uses the action-value function estimates of the policies {πt } followed by the algorithm, as opposed to using an estimate of the optimal action value function. This makes our algorithm closer in spirit to (modified) policy iteration than to value iteration and is again expected to reduce the variance of the learning process. The reason the Gibbs policy does not use the last N estimates is to allow the construction of a reasonable estimate q̂t of the action-value function qt as this will be explained in more details shortly. If r t was available, one could compute qt based on r t (cf. (4.2)) and the sum could then run up to t , resulting in the algorithm MDP-E. Since in our problem r t is not available, we estimate it using an importance sampling estimator. To define this estimator define µN t (x) as the probability of visiting state x at time step t , conditioned on the history ut −N up to time step t − N , including xt −N and at −N (cf. (4.1) for the definition of {ut }): def µN t (x) = P [xt = x | ut −N ] , x ∈ X, . Then, the estimate of r t is constructed using r̂t (x, a) = r t (x,a) , πt (a|x)µN t (x) 0, if (x, a) = (xt , at ) ; (4.8) otherwise. Note that r̂t is defined only for t ≥ N + 1 (to allow using xt −N ), which explains why the sum in the definition of the Gibbs policy runs from time N + 1. The importance sampling estimator (4.8) is well-defined only if for x = xt , µN t (x) > 0 (4.9) 6 In the algorithm the Gibbs action-selection probabilities are computed in an incremental fashion with the help of the “weights” wt (x, a). Note that a numerically stable implementation would calculate the action-selection probP −N P −N abilities based on the relative value differences, ts=N q̂t (x, ·) − maxa∈A ts=N q̂ (x, ·). These relative value differ+1 t ences can also be updated incrementally. The form shown in Algorithm 4.2 is preferred for mathematical clarity. 69 holds almost surely (by construction πt (·|xt ) > γ/|X | > 0). An essential step of our proof is to show that inequality (4.9) indeed holds. In fact, we will show that this inequality holds almost surely (a.s.) for all x ∈ X provided that N is large enough. This will be done by first showing that the policies πt change “slowly” (this is where it becomes useful that the Gibbs policy is defined using a sum of previous action values), from which inequality (4.9) will follow thanks to Assumptions M1 and M2. This will be shown in Lemma 4.12. To see the intuitive reason of why (4.9) holds, it is instructive to look into how the distribution µN t can be computed. Denote by P a the transition probability matrix of the policy that selects action a in every state and recall that e x denotes the x th unit row vector of the canonical basis of the |X|dimensional Euclidean space. Viewing µN t as a row vector, we may write at −N πt −N +1 µN P · · · P πt −1 . t = e xt −N P (4.10) This holds because it follows from the form of the algorithm that xt −N , at −N , πt ∈ σ(ut −N ). (4.11) Thus, we also have that πt , πt −1 , . . . , πt −N ∈ σ(ut −N ) and therefore (4.10) follows from the law of total probability. Let us return to the discussion of why µN t (x) is bounded away from zero. If the policies during the last N steps change sufficiently slowly then they will all be “quite close” to the policy of the last time step. Then, the expression on the right-hand side of (4.10) can be seen to be close to the N -step state distribution of πt when starting from (x, a), which, if N is large enough, will be close to the stationary distribution of πt as it was explained in Section 2.3.2. π Since by Assumption M2, minx∈X µstt (x) ≥ α0 > 0 then, by choosing the algorithm’s parameters 0 appropriately, we can show that µN t (x) ≥ α /2 > 0 holds for all x ∈ X (specifically, this is shown in Lemma 4.12). It remains to be seen that the estimate r̂t is meaningful. In this regard, we claim that E [ r̂t (x, a)| ut −N ] = r t (x, a) a.s. (4.12) holds for all (x, a) ∈ X × A.7 First note that E [r̂t (x, a) | ut −N ] = r t (x, a) πt (a|x)µN t (x) £ ¤ E I{(x,a)=(xt ,at )} | ut −N , where we have exploited that πt , µN t , xt −N , at −N ∈ σ(ut −N ). Now, £ ¤ E I{(x,a)=(xt ,at )} | ut −N = P [at = a | xt = x, ut −N ] P [xt = x | ut −N ] 7 In what follows, for the sake of brevity, unless otherwise stated, we will omit “a.s.” from probabilistic statements. 70 and since πt ∈ σ(ut −N ), P [at = a | xt = x, ut −N ] = P [at = a | xt = x, ut −1 ] = πt (a|x), where the last equality follows from the definition of πt and at . Finally, πt −N +1 , . . . , πt −1 ∈ σ(ut −N ) and so P [xt = x | ut −N ] = P [xt = x | ut −N , πt −N +1 , . . . , πt −1 ] = µN t (x). Putting together the equalities obtained, we get (4.12). Given r̂t , the estimate q̂t of the action-value function qt is defined as the action-value function underlying policy πt in the the average-reward MDP given by the transition probability kernel P and reward function r̂t . Thus, q̂t can be computed as the solution to the Bellman equations q̂t (x, a) = r̂t (x, a) − ρ̂ t + X P (x 0 |x, a)v̂t (x 0 ) , v̂t (x) = πt (a|x)q̂t (x, a) , a x0 ρ̂ t = X X x,a π µstt (x)πt (a|x)r̂t (x, a) , (4.13) which hold simultaneously for all (x, a) ∈ X × A. By linearity of expectation, it then follows that E[ρ̂ t |ut −N ] = ρ t , and, hence, by the linearity of the Bellman equations and the uniqueness of the solutions of the equations defining qt and vt (see, e.g., Corollary 8.2.7 in [86]), we have, for all (x, a) ∈ X × A, E[q̂t (x, a)|ut −N ] = qt (x, a), and E[v̂t (x)|ut −N ] = vt (x). (4.14) As a consequence, we also have, for all (x, a) ∈ X × A, t ≥ N + 1, £ ¤ E[ρ̂ t ] = E ρ t , £ ¤ E[q̂t (x, a)] = E qt (x, a) , and E[v̂t (x)] = E [vt (x)] . (4.15) The main result of the chapter is the following bound concerning the performance of MDPExp3 . Theorem 4.2. Let the transition probability kernel P satisfy Assumptions M1 and M2. Fix T > 0 and let N = dτ ln T e+1. Then for an appropriate choice of the parameters η and γ (which depend on |A|, T, α0 , τ), for any sequence of reward functions {r t } taking values in [0, 1], the regret of the algorithm MDP-Exp3 can be bounded as s bT ≤ L T ln |A|(2τ + 2) µ ¶ p 4|A| (e − 2) + (e − 2)(τ + 2)) + (4τ + 5)| A |N + + o( T ). (2N α0 4N Remark 4.3. The above bound as well as the bound (4.5) do not depend directly on the number of states, |X|, but the dependence appears implicitly through τ only. Even-Dar et al. [33] also note that a tighter bound, where only the mixing times of the actual policies chosen appear, can be derived. However, it is unclear whether in the worst-case this could be used to improve the bound. Similarly to (4.5), our bound depends on |X| through other constants. In our case, these 71 are α0 and τ. Comparing the theorems it seems that the main price of not seeing the rewards seems to be the appearance of |A| instead of ln |A| (a typical difference between the bandit and full observation cases) and the appearance of a 1/α0 term in the bound. Now we present the proof of Theorem 4.2. Once again, consider the decomposition 4.3: à R Tπ − RbT = R Tπ − T X ρ πt ! à t =1 + T X ρ πt − T X ! à ρt + ! ρ t − RbT . t =1 t =1 t =1 T X Again, the first term is handled by Lemma 4.2. Thus, it remains to bound the expectation of the other two terms. This is done in the following two propositions whose proofs are deferred to the next sections: Proposition 4.1. Let ¡ ¢ 4τ + 5 + γ(2τ + 3) 4(e − 1) ¡ ¢ ³ ´, c = η max 2τ + 4 + γ(2τ + 2) 0 1 − 2η|A| 1 + 2τ + 2 α α0 γ 4 α0 and assume that γ ∈ (0, 1) (4.16) |A|(N − 1/2) ≥ 1, η≤ (4.17) α ¢³ 1 1 0 4|A| N − 2 ¡ c τ(τ + 2) < α0 /2, ½ µ N ≥ max τ ln γ + 2τ + 2 ´, (4.18) (4.19) ¶ ¾ 4 , τ ln T + 1. α0 − 2cτ(τ + 2) (4.20) Then, for any policy π, we have µ ¶ T X £ π ¤ 40 ln |A| E ρ t − ρ t ≤ N 2τ + 10 + T η c |A|(τ + 2)(e − 2)(2τ + 3) + α η t =1 µ µ ¶¶ 0 2 c + 2(τ + 2) T γ + c N |A| + η (e − 2) |A| 2τ + 4 + N . α γ Proposition 4.2. Assume that (4.19) and (4.20) hold. Then, T X £ ¤ E ρ t − RbT ≤ T c τ(τ + 2) + 2Te −(N −1)/τ + 2N . (4.21) t =1 Remark 4.4. Note that setting N > dτ ln T e, the second term in (4.21) becomes O(1). Also, if T is large enough, the choices of N , η and γ in Theorem 4.2 will satisfy the conditions of Proposition 4.1. Putting together the bounds obtained so far allows us to finish the proof of the theorem: Proof of Theorem 4.2. Set γ = K η for some positive constant K > 0. The remaining work is to 72 choose K so that all the assumptions in Propositions 4.1 and 4.2 stated about the parameters hold. First we have to ensure that the upper bound η≤ α0 ¢³ 1 1 ¡ 4|A| N − 2 K η + 2τ + 2 ´ ¡ ¢ 8|A| N − 1 2 is positive. Setting K = takes care of this problem. Now note that by our assumption α0 p ¡ ¢ 0 on η, we have c ≤ η α8 4τ + 5 + γ (2τ + 2) Collecting all o( T ) terms, we get µ µ ¶¶ 80 20 N b LT ≤ ηT (2τ + 2) K + (4τ + 5) (|A|N + τ(τ + 2)) + (e − 2)|A| 2τ + 4 + α α K p ln |A| + + o( T ) . η Plugging in the definition of K yields 4|A|N + (16τ + 20) (|A|N + τ(τ + 2)) + (e − 2)|A| (2τ + 4) e − 2 + α0 4N p ln |A| + + o( T ) . η b T ≤ ηT (2τ + 2) L µ ¶ Thus choosing v u u η=t ln |A| T (2τ + 2) ³ 4|A|N +(16τ+20)(|A|N +τ(τ+2))+(e−2)|A|(2τ+4) α0 + e−2 4N ´ gives the result as s 4|A|N + (16τ + 20) (|A|N + τ(τ + 2)) + (e − 2)|A| (2τ + 4) e − 2 T ln |A|(8τ + 8) + α0 4N p + o( T ) . µ ¶ bT ≤ L 4.5 The proofs of Propositions 4.1 and 4.2 This section provides proofs for the key propositions of the previous section. Throughout the section we suppose that both Assumptions M1 and M2 hold for P . The first half of the section provides technical tools needed to prove both propositions. 4.5.1 The change rate of the learner’s policies We proceed with a series of lemmas to control the rate of change of the policies generated by MDP-Exp3 . 73 0 all states x. Then, for any Lemma 4.6. Let N < t ≤ T and assume that µN t (x) ≥ α /2 holds for ³ ´ 2 2 1 (x, a) ∈ X × A, we have |v̂t (x)| ≤ 4τ+4 and − (2τ + 3) ≤ q̂ (x, a) ≤ + 2τ + 2 . 0 0 0 t α α α γ Proof. Since | P N 0 a πt (a|x)r̂t (x, a)| ≤ 1/µt (x) ≤ 2/α by assumption and v̂t = v πt , the first state- ment of the lemma follows from Lemma 4.3. To prove the bounds on q̂t , also notice that 0 ≤ ρ̂ t = X x,a π µstt (x) πt (a|x)r̂t (x, a) = X x π µstt (x) X πt (a|x)r̂t (x, a) ≤ a 2 . α0 Applying the above inequalities to the Bellman equations (4.13) we obtain q̂t (x, a) = r̂t (x, a) − P 2 2 ρ̂ t + x 0 P (x 0 |x, a)v̂t (x 0 ) ≥ − α20 − 4τ+4 α0 ´ = − α0 (2τ + 3). Using rt (x, a) ≤ γα0 we get the upper bound ³ 4τ+4 2 = α20 γ1 + 2τ + 2 . q̂t (x, a) ≤ γα 0 + α0 The previous result can be strengthened if one is interested in a bound on E [ |v̂t (x)| | ut −N ]: Lemma 4.7. Let N < t ≤ T and assume that µN t (x) > 0 holds for all states x. Then, for any x ∈ X, we have E [ |v̂t (x)| | ut −N ] ≤ 2(τ + 1). Proof. Proceeding as in the proof of Lemma 4.3 and then taking expectations, we get E [ |v̂t (x)| | ut −N ] ≤ ∞ X X s=1 x 0 |νπs,x (x 0 ) − µπst (x 0 )| E · X a ¯ ¸ ¯ ¯ πt (a|x )r̂t (x , a) ¯ ut −N , 0 0 where we have exploited that r̂t takes only nonnegative values. Now, by (4.11) and (4.12), E · X a ¯ ¸ X ¯ ¯ £ ¤ πt (a|x 0 )r̂t (x 0 , a) ¯¯ ut −N = πt (a|x 0 )E r̂t (x 0 , a) ¯ ut −N a = X πt (a|x 0 )r t (x, a), a which is bounded between 0 and 1. Hence, E [ |v̂t (x)| | ut −N ] ≤ P∞ P s=1 π 0 π 0 x 0 |νs,x (x ) − µst (x )|. Now, finishing as before we get the statement. ¯ £ ¤ Similarly, we will also need a bound on the expected value of E |q̂t (x, a)| ¯ ut −N . This is bounded as follows: Lemma 4.8. Let N < t ≤ T and assume that µN t (x) > 0 holds for all states x. Then, for any ¯¯ £¯ ¤ (x, a) ∈ X × A, we have E ¯q̂t (x, a)¯ ¯ ut −N ≤ 2(τ + 2). Proof. By the Bellman equations (4.13), ¯ ¯ ¯ ¤ X £ ¤ £ ¤ £ E |q̂t (x, a)| ¯ ut −N ≤ E [ |r̂t (x, a)| | ut −N ] + E |ρ̂ t | ¯ ut −N + P (x 0 |x, a)E |v̂t (x 0 )| ¯ ut −N . x0 ¯ £ ¤ As before, E [ |r̂t (x, a)| | ut −N ] ≤ 1, and also E |ρ̂ t | ¯ ut −N ≤ 1. Combining these with the result of the previous Lemma, we get the desired statement. 74 The quantity πt (x, a)|q̂t (x, a)| also enjoys a bound which is independent of the exploration rate γ: 0 Lemma 4.9. Let N < t ≤ T and assume that µN t (x) > α /2 holds for all states x. Then, for any ¯ ¯ 4 (x, a) ∈ X × A, it holds that πt (x, a) ¯q̂t (x, a)¯ ≤ 0 (τ + 2). α Proof. By assumption and the construction of r̂t (x, a), πt (x, a)|r̂t (x, a)| ≤ 2 . α0 Thus, we can apply Lemma 4.3 with R = 2/α0 to obtain |q̂t (x, a)| ≤ (4.22) 2 α0 (2(τ + 1) + 1) + |r̂t (x, a)|. Multiplying both sides by πt (x, a) and using (4.22) again finishes the proof. Now we show that if the policies that we follow up to time step t change slowly, µN t is “close” π to µstt : Lemma 4.10. Let 1 ≤ N < t ≤ T and c > 0 be such that kπs+1 − πs k1,∞ ≤ c holds for 1 ≤ s ≤ t − 1. ° π ° Then we have °µN − µ t ° ≤ c τ(τ + 2) + 2e −(N −1)/τ t st 1 Proof. This follows directly from Lemmas 4.4, 4.5 and the recursive form (4.10) for µN t . Indeed, at −N πt −N +1 this recursive form shows that µN P · · · P πt −1 . Thus, we write t = e xt −N P X¯ N ¯ ¯µ (x) − µπt (x)¯ = kµN − µπt k1 ≤ kµN − µπt −1 k1 + kµπt −1 − µπt k1 . t t t st st st st st x Now, according to Lemma 4.5, the first term can be bounded by 2e −(N −1)/τ + τ(τ + 1)c, while by Lemma 4.4, the second is bounded by τc. Adding these up gives the claimed bound. In the next two lemmas we compute the rate of change of the policies produced by MDPExp3 and show that for a large enough value of N , µN t can be uniformly bounded from below by α0 /2. 0 0 0 Lemma 4.11. Assume that for all N + 1 ≤ t ≤ T , µN t (x ) ≥ α /2 holds for all states x , γ ≤ 1, |A|(N − 1/2) ≥ 1, and η ≤ α³ ´ ¢ ¡ 4|A| N − 12 γ1 +2τ+2 0 Set ¡ ¢ 4τ + 5 + γ(2τ + 3) 4(e − 1) ¡ ¢ ´, ³ c = η max 2τ + 4 + γ(2τ + 2) . 0 1 − 2η|A| 1 + 2τ + 2 α α0 γ 4 α0 Then kπt +N −1 − πt +N k1,∞ ≤ c. (4.23) Proof. Fix some state-action pair (x, a) ∈ X × A. We will prove the statement by induction on t . To prove the bound for time step t assume that kπs+N −1 −πs+N k1,∞ holds for all s = 1, 2, . . . , t −1. As πs+N −1 = πs+N for all s = 1, 2, . . . , N , the assumption holds for t = 1, . . . , N and we are left with proving the induction step for t > N . 75 We start with the simple bound ¯ ¯ ¯ wt +N −1 (x, a) wt +N (x, a) ¯ ¯ |πt +N −1 (a|x) − πt +N (a|x)| =(1 − γ) ¯¯ − Wt +N −1 Wt +N (x) ¯ ¯ ¯ wt +N (x, a) Wt +N −1 (x) ¯¯ wt +N −1 (x, a) ¯¯ 1 − =(1 − γ) Wt +N −1 (x) ¯ wt +N −1 (x, a) Wt +N (x) ¯ ¯ ¯ ¯ wt +N −1 (x, a) ¯¯ ηq̂t (x,a) Wt +N −1 (x) ¯ =(1 − γ) 1−e ¯ Wt +N −1 (x) Wt +N (x) ¯ ¯ ¯ ¯ ¯ ηq̂t (x,a) Wt +N −1 (x) ¯ ¯ . ≤πt +N −1 (a|x) ¯1 − e Wt +N (x) ¯ (4.24) From now on, we examine two separate cases depending on the sign of the expression in the absolute value on the right-hand side. −1 (x) Case a) 1 − e ηq̂t (x,a) WWt +N ≥ 0: Using 1 − e z ≤ −z (that holds for all z ∈ R), we get t +N (x) ¯ ¯ ¯ ¯ ¯1 − e ηq̂t (x,a) Wt +N −1 (x) ¯ ≤ − ηq̂t (x, a) + ln Wt +N (x) ≤ 2η (2τ + 3) − ln Wt +N −1 (x) , ¯ Wt +N (x) ¯ Wt +N −1 (x) α0 Wt +N (x) where the second step holds by the lower bound on q̂t (x, a) given in Lemma 4.6. Applying Jensen’s inequality, the second term can be bounded as ! à ! à γ X wt +N (x, b) −ηq̂ (x,b) X πt +N (b|x) − |A| Wt +N −1 (x) t ln q̂t (x, b) = ln e ≥ −η Wt +N (x) 1−γ b Wt +N (x) b X η X ηγ =− q̂t (x, b) πt +N (b|x)q̂t (x, b) + 1−γ b |A|(1 − γ) b η X 2ηγ ≥− (2τ + 3) , πt +N (b|x)q̂t (x, b) − 0 1−γ b α (1 − γ) where the last step follows again by Lemma 4.6. Combining with (4.24) we have |πt +N −1 (a|x) − πt +N (a|x)| ≤ X 2η 2ηγ (2τ + 3) + η πt +N (b|x)q̂t (x, b) + 0 (2τ + 3) 0 α α b t +N X X−1 X¡ ¢ 2η (1 + γ)(2τ + 3) + η π (b|x)q̂ (x, b) + η πs+1 (b|x) − πs (b|x) q̂t (x, b) t t 0 α s=t b b µ ¶ 4η 2η 2η(N − 1)|A|c 1 ≤ 0 (1 + γ)(2τ + 3) + 0 (τ + 1) + + 2τ + 2 α α α0 γ µ ¶ X 2η 1 |πt +N −1 (b|x) − πt +N (b|x)| , + 0 + 2τ + 2 α γ b = b πt (b|x)q̂t (x, b) P and q̂t (x, b) from P Lemma 4.6 and the induction hypothesis. Taking maximum over a and using the bound b |πt +N −1 (b|x)− where the last inequality holds by the bounds on v̂t (x) = πt +N (b|x)| ≤ |A| maxa |πt +N −1 (a|x) − πt +N (a|x)|, we get after reordering max|πt +N −1 (a|x) − πt +N (a|x)| ≤ c 2η(N −1)|A| 1 α0 γ ³ a 76 ´ ¢ 2η ¡ + 2τ + 2 + α0 4τ + 5 + γ(2τ + 3) ³ ´ . 2η|A| 1 − α0 γ1 + 2τ + 2 From here the desired maxa |πt +N −1 (a|x) − πt +N (a|x)| ≤ c follows thanks to 2η(N −1)|A| 1 α0 γ ³ ´ + 2τ + 2 1 ³ ´ ≤ , 2η|A| 1 2 1 − α0 γ + 2τ + 2 ¢ 2η ¡ α0 4τ + 5 + γ(2τ + 3) ³ ´ 2η|A| 1 − α0 γ1 + 2τ + 2 and ≤ c , 2 where the first inequality follows by our assumption on η, while the second follows from the definition of c. This finishes the proof of case (a). −1 (x) ≤ 0: First notice that the logarithm of the second term is positive by Case b) 1 − e ηq̂t (x,a) WWt +N t +N (x) −1 (x) the condition, that is, ηq̂t (x, a) + ln WWt +N ≥ 0. Furthermore, it is bounded from above by 1 t +N (x) since ln Wt +N −1 (x) Wt +N (x) = − ln X wt +N −1 (x, a) b Wt +N −1 (x) e ηq̂t (x,b) ≤ −η X wt +N −1 (x, b) b Wt +N −1 (x) q̂t (x, b) (4.25) by Jensen’s inequality, and so ηq̂t (x, a) + ln X wt +N −1 (x, b) Wt +N −1 (x) q̂t (x, b) ≤ ηq̂t (x, a) − η Wt +N (x) b Wt +N −1 (x) µ ¶ µ ¶ 2η 1 2η 1 ≤ 0 2τ + 3 + + 2τ + 2 = 0 4τ + 5 + ≤ 1, α γ α γ where the second inequality holds by Lemma 4.6, and the third one by our assumption on η and |A|(N − 1/2) ≥ 1. Thus, using e z − 1 ≤ (e − 1)z, which holds for any 0 ≤ z ≤ 1, we get ¯ ¯ µ ¶ ¯ ¯ ¯1 − e ηq̂t (x,a) Wt +N −1 (x) ¯ ≤ (e − 1) ηq̂t (x, a) + ln Wt +N −1 (x) . ¯ W (x) ¯ W (x) t +N t +N Combining with (4.24), (4.25) and the bounds on q̂t provided in Lemma 4.6, we get |πt +N −1 (a|x) − πt +N (a|x)| ≤ η(e − 1)πt +N −1 (a|x)q̂t (x, a) µ ¶ X 2ηγ(e − 1) 1 − η(e − 1) πt +N −1 (b|x)q̂t (x, b) + + 2τ + 2 α0 γ b µ ¶ X 2ηγ(e − 1) 1 = −η(e − 1) + 2τ + 2 πt +N −1 (b|x)q̂t (x, b) + α0 γ b6=a = −η(e − 1) X πt (b|x)q̂t (x, b) − η(e − 1) t +N X−2 s=t b6=a X¡ ¢ πs+1 (b|x) − πs (b|x) q̂t (x, b) b6=a µ ¶ 2ηγ(e − 1) 1 + 2τ + 2 (telescoping) + α0 γ µ ¶ 2η(e − 1) 2η(e − 1)(N − 1)(|A| − 1) 1 ≤ (2τ + 3) + c + 2τ + 2 + α0 α0 γ µ ¶ 2ηγ(e − 1) 1 + 2τ + 2 (induction) α0 γ µ ¶ ¢ 2η(e − 1) ¡ 2η(e − 1)(N − 1)(|A| − 1) 1 = 2τ + 4 + γ(2τ + 2) + c + 2τ + 2 . α0 α0 γ The definition of c and the assumption on η imply that both the first and the second terms are 77 at most c/2, respectively, finishing the proof of the lemma. Lemma 4.12. Let c be as in Lemma 4.11. Assume that cτ(τ + 2) < α0 /2, and let » µ 4 N > τ ln 0 α − 2cτ(τ + 2) ¶¼ . (4.26) 0 Then, for all N < t ≤ T , x ∈ X, we have µN t (x) ≥ α /2 and kπt +1 − πt k1,∞ ≤ c. Proof. We prove the lemma by induction on t . The induction hypothesis is that for N + 1 ≤ t ≤ P 0 T , minx µN s (x) ≥ α /2 and maxx a |πs+1 (a|x) − πs (a|x)| ≤ c hold for all N + 1 ≤ s ≤ t . Let us first show that this hypothesis holds when N + 1 ≤ t ≤ 2N − 1. By the construction P of the policies, we have maxx a |πt +1 (a|x) − πt (a|x)| = 0 ≤ c for all 1 ≤ t ≤ 2N − 1. Thus, by π −(N −1)/τ t Lemma 4.10, we get that kµN holds for all N + 1 ≤ t ≤ 2N − 1. By t − µst k1 ≤ cτ(τ + 2) + 2e our assumption about N , we have cτ(τ + 2) + 2e −(N −1)/τ ≤ α0 /2, (4.27) thus for any N + 1 ≤ t ≤ 2N − 1, π π N 0 t t kµN t − µst k∞ ≤ kµt − µst k1 ≤ α /2. (4.28) π Since, by assumption, µπst (x 0 ) ≥ α0 holds for any stationary policy π, we also have µstt (x 0 ) ≥ α0 . 0 This, together with (4.28) gives that µN t (x) ≥ α /2 holds for any x ∈ X. Now, fix a time index 2N ≤ t ≤ T and assume that the induction hypothesis holds for time 0 t − 1. Then, thanks to minx µN t −N +1 (x) ≥ α /2, Lemma 4.11 implies kπt +1 − πt k1,∞ ≤ c. Now, by π −(N −1)/τ t Lemma 4.10, we have kµN Using the same reasoning as above, t − µst k1 ≤ cτ(τ + 2) + 2e we finish the inductive step and thus the proof. 4.5.2 Proof of Proposition 4.1 The statement is trivial for T ≤ N . Therefore, in what follows assume that T > N . For every x, a P P define QT (x, a) = Tt=N +1 qt (x, a) and VT (x) = Tt=N +1 vt (x). The preceding lemma shows that in order to prove Proposition 4.1, it suffices to prove an upper bound on E [QT (x, a) − VT (x)]. l ³ ´m 4 Lemma 4.13. Let c be as in Lemma 4.11. Assume that γ ∈ (0, 1), cτ(τ+2) < α0 /2, N ≥ τ ln α0 −2cτ(τ+2) , 0<η≤ α0 2(1/γ +2τ+3) , and T > N hold. Then, for all (x, a) ∈ X × A, µ ¶ 40 ln |A| E [QT (x, b) − VT (x)] ≤ N 4(τ + 2) + T η c |A|(τ + 2)(e − 2)(2τ + 3) + α η µ µ ¶¶ 0 2 c + 2(τ + 2) T γ + c N |A| + η (e − 2) |A| 2τ + 4 + N . α γ Proof. We follow the steps of the proof in Auer et al. [10]. Fix (x, a) ∈ X × A. For any 1 ≤ t ≤ T P define Wt (x) = a wt (x, a). First, note that since the conditions of Lemma 4.12 are satisfied, 78 hence, the conclusions of this lemma, as well as those of Lemmas 4.6–4.9 hold. In particular, by def 4 α0 Lemma 4.9, πt (a|x)|q̂t (x, a)| ≤ B q = (τ + 2) holds for any s ≥ N + 1. Now, using Lemma 4.3 and the Bellman equations, we get that q̂t (x, a) ≤ 20 α (1/γ + 2τ + 3), thus by the constraint on η, ηq̂t (x, a) ≤ 1. Fix 2N ≤ t ≤ T − 1. We have the following inequalities: Wt +1 (x) X wt +1 (x, a) X wt (x, a) ηq̂t −N +1 (x,a) X πt (a|x) − γ/|A| ηq̂t −N +1 (x,a) = = e = e Wt (x) Wt (x) 1−γ a a Wt (x) a X πt (a|x) − γ/|A| ³ ¡ ¢2 ´ ≤ 1 + ηq̂t −N +1 (x, a) + (e − 2) ηq̂t −N +1 (x, a) 1−γ a (as ηq̂t −N +1 (x, a) ≤ 1) ≤ 1+ η2 (e − 2) X η X πt (a|x)q̂t −N +1 (x, a) + πt (a|x)(q̂t −N +1 (x, a))2 . 1−γ a 1−γ a Now rewrite the last term as X a πt (a|x)(q̂t −N +1 (x, a))2 = X a + πt −N +1 (a|x)(q̂t −N +1 (x, a))2 X a (πt (a|x) − πt −N +1 ) (q̂t −N +1 (x, a))2 By Lemma 4.9, the first term can be bounded as follows: X a πt −N +1 (a|x)(q̂t −N +1 (x, a))2 ≤ X¯ ¯ 40 (τ + 2) ¯q̂t −N +1 (x, a)¯ , α a while the second one is bounded by X a µ ¶ X¯ ¯ 2 1 ¯q̂t −N +1 (x, a)¯ , + 2τ + 3 (πt (a|x) − πt −N +1 ) (q̂t −N +1 (x, a)) ≤ c N 0 α γ a 2 where we have used the bound on the change rate of the policies and the upper bound on |q̂t (x, a)|. Thus, using the notation B q0 µ ¶ 40 2 1 = (τ + 2) + c N 0 + 2τ + 3 , α α γ we get X a Defining v̂N t (x) = P πt (a|x)(q̂t −N +1 (x, a))2 ≤ B q0 X¯ ¯ ¯q̂t −N +1 (x, a)¯ . a a πt (a|x)q̂t −N +1 (x, a), we obtain B q0 η2 (e − 2) X ¯ ¯ Wt +1 (x) η N ¯q̂t −N +1 (x, a)¯ . ≤ 1+ v̂t (x) + Wt (x) 1−γ (1 − γ) a 79 Using 1 + x ≤ e x and then taking logarithms gives ln B q0 η2 (e − 2) X ¯ ¯ Wt +1 (x) η N ¯q̂t −N +1 (x, a)¯ . ≤ v̂t (x) + Wt (x) 1−γ (1 − γ) a Summing over t = 2N , 2N + 1, . . . , T we get ln with V̂TN (x) = PT t =2N B q0 η2 (e − 2) T −N X+1 X ¯ ¯ WT +1 (x) η ¯q̂t (x, a)¯ ≤ V̂TN (x) + W2N (x) 1−γ (1 − γ) t =N +1 a (4.29) v̂N t (x). On the other hand, for any action b we have ln T −N X+1 wT +1 (x, b) WT +1 (x) ≥ ln =η q̂t (x, b) − ln |A|, W2N (x) W2N (x) t =N +1 where we used that w2N (x, a) = 1 holds for all a ∈ A. Combining with (4.29), we get V̂TN (x) ≥ (1 − γ)Q̂TN (x, b) − where Q̂TN (x, b) = PT −N +1 t =N +1 T −N X+1 X ¯ ¯ ln |A| ¯q̂t (x, a)¯ , − B q0 η(e − 2) η t =N +1 a (4.30) q̂t (x, b). Let us now bound the difference of V̂TN (x) and V̂T (x) = T X V̂TN (x) = πt (a|x)q̂t (x, a). t =N +1 a t =N +1 Note that T X X v̂t (x) = T −N X+1 X t =N +1 a πt +N −1 (a|x)q̂t (x, a). Therefore, V̂TN (x) − V̂T (x) ≤ ≤ T −N X+1 X t =N +1 a T −N X+1 ¯ ¯ ¯ ¯ |q̂t (x, a)|¯πt +N −1 (a|x) − πt (a|x)¯ + T X ≤Nc t =N +1 ° ° q̂t (x, ·) ° + ∞ T X X πt (a|x)|q̂t (x, a)| t =T −N +2 a T X X ° ° k πt +N −1 (·|x) − πt (·|x) k1 ° q̂t (x, ·) °∞ + t =N +1 T −N X+1 ° X πt (a|x)|q̂t (x, a)| t =T −N +2 a πt (a|x)|q̂t (x, a)| , t =T −N +2 a where we have used that by Lemma 4.11, k πt +N −1 − πt k1,∞ ≤ N c, that is, the policies change slowly. Taking the expectation of both sides, we get T −N X+1 £° ° ¤ £ ¤ £ ¤ E V̂TN (x) − E V̂T (x) ≤ N c E ° q̂t (x, ·) °∞ + t =N +1 80 T X t =T −N +2 E · X a ¸ πt (a|x)|q̂t (x, a)| . By Lemma 4.8, £ ¤ def E |q̂t (x, a)| ≤ B q00 = 2(τ + 2) (4.31) ° ¤ £° holds for any (x, a) ∈ X × A. Therefore, E ° q̂t (x, ·) °∞ ≤ |A|B q00 . Using again (4.31), E · X a ¯ ¸¸ ¸ · · X ¯ πt (a|x)|q̂t (x, a)| = E E πt (a|x)|q̂t (x, a)|¯¯ ut −N a · ¸ X ¯ £ ¤ ¯ =E πt (a|x) E |q̂t (x, a)| ut −N a · ¸ X 00 ≤E πt (a|x) B q ≤ B q00 . a ¤ £ ¤ £ Hence, E V̂TN (x) ≤ E V̂T (x) + N B q00 (c T |A| + 1). This, together with (4.30) gives ¤ ln |A| £ ¤ £ E V̂T (x) + N B q00 (c T |A| + 1) ≥(1 − γ)E Q̂TN (x, b) − η T −N +1 X X £¯ ¯¤ − B q0 η(e − 2) E ¯q̂t (x, a)¯ . (4.32) t =N +1 a ¤ £ By equation (4.15), we have E V̂T (x) = E [VT (x)]and with the definition QTN (x, b) = T −N X+1 qt (x, b), t =N +1 £ ¤ £ ¤ we also have E Q̂TN (x, b) = E QTN (x, b) . Thus, using (4.31) again, we get £ ¤ ln |A| E [VT (x)] + N B q00 (c T |A| + 1) ≥ (1 − γ)E QTN (x, b) − − η(e − 2) B q0 B q00 T |A|. η By reordering the terms, this becomes £ ¤ £ ¤ ln |A| E QTN (x, b) − VT (x) ≤γE QTN (x, b) + N B q00 (c T |A| + 1) + η (4.33) + η(e − 2) B q0 B q00 T |A|. We now lower bound QTN (x, a) by QT (x, a): QT (x, b) − QTN (x, b) = T X qt (x, b) ≤ B q00 N , (4.34) t =T −N +2 where we used that by Lemma 4.3, qt (x, b) ≤ B q00 = 2(τ + 2) 81 (4.35) since the rewards are bounded between 0 and 1. Combining (4.34) with (4.33), we obtain £ ¤ E [QT (x, b) − VT (x)] ≤ γE QTN (x, b) + N B q00 (c T |A| + 1) ³ ´ ln |A| + B q00 η(e − 2) B q0 T |A| + N . + η £ ¤ Using (4.35) again, we get E QTN (x, b) ≤ B q0 T . Thus, E [QT (x, b) − VT (x)] ≤ 2B q00 N + ³ ´ ln |A| + T B q00 γ + N |A|c + η(e − 2) B q0 |A| . η Using the definitions of B q0 and B q00 , we arrive at the final result: ¶ ln |A| 40 E [QT (x, b) − VT (x)] ≤ N 4(τ + 2) + T η c |A|(τ + 2)(e − 2)(2τ + 3) + α η µ µ ¶¶ 0 2 c + 2(τ + 2) T γ + c N |A| + η (e − 2) |A| 2τ + 4 + N . α γ µ The proof of Proposition 4.1 is now easy: Proof of Proposition 4.1. Under the conditions of the proposition, combining Lemmas 4.1-4.13 yields T X X £ ¤ E ρ πt − ρ t ≤ 2N + µπst (x)π(a|x) E [QT (x, a) − VT (x)] t =1 x,a ¶ µ ln |A| 40 ≤ N 2τ + 10 + T η c |A|(τ + 2)(e − 2)(2τ + 3) + α η µ µ ¶¶ 20 c + 2(τ + 2) T γ + c N |A| + η (e − 2) |A| 2τ + 4 + N , α γ proving Proposition 4.1. 4.5.3 Proof of Proposition 4.2 ¤ £ ¤ £P π Let t > N . First, since πt is σ(ut −N )-measurable, E ρ t = E x µstt (x)E [ r t (x, at )| ut −N ] . We also have E [r t (xt , at )] = E [E [ r t (xt , at )| ut −N ]] = E 82 · X x µN t (x) E [ r t (x, at )| ut −N ] ¸ . Hence, · ¸ X π £ ¤ N t E ρ t − r t (xt , at ) = E (µst (x) − µt (x)) E [ r t (x, at )| ut −N ] x · ¯ ¯¸ X¯ π ¯ ≤E (x) ¯µstt (x) − µN ¯ , t x where we have used that r t (x, a) ∈ [0, 1]. ¯ P ¯ π ¯ Thanks to Lemma 4.12, Lemma 4.10 is applicable. Hence, x ¯µstt (x) − µN t (x) ≤ cτ(τ + 2) + £ ¤ 2e −(N −1)/τ , and thus E ρ t − r t (xt , at ) ≤ cτ(τ + 2) + 2e −(N −1)/τ . Summing up these inequalities ¡P £ ¤¢ for t = N +1, . . . , T , we get Tt=N +1 E ρ t −RbT ≤ T cτ(τ+2)+2Te −(N −1)/τ Using the trivial bound £ ¤ E ρ t − r t (xt , at ) ≤ 2 for the first N terms, we get the desired result. 83 Chapter 5 Online learning in unknown stochastic shortest path problems In this chapter, we consider the setting when the learner aims to minimize its regret against an arbitrary sequence of reward functions but for the case when the Markovian dynamics is unknown. As it turns out, this problem is considerably more challenging than the one studied in the previous chapters, when the Markovian dynamics is known a priori. That said, there exists an alternate thread of previous theoretical works that assume an unknown Markovian dynamics over finite state and action spaces [59, 15]. In these works, however, the reward function is fixed (known or unknown). That in the problem we study both the Markovian dynamics is unknown and the reward function is allowed to be an arbitrary sequence forces us to explore an algorithm design principle that remained relatively less-explored so far. The results presented in this chapter were published in Neu et al. [79]. © ª The assumptions we make about the underlying MDP\r X, A, P are the same as in Chap- ter 3. In particular, we assume that P satisfies Assumption S1. Besides supposing that the learner does not know P , we have to assume that the reward function r t is completely revealed after episode t . The results of this chapter are summarized below. Thesis 3. Proposed an efficient algorithm for online learning in unknown stochastic shortest path problems. Proved performance guarantees for Setting 1c. The proved bounds are optimal in terms of the dependence on the number of episodes. [79] The algorithm that we present combines previous work on online learning in Markovian decision processes and work on online prediction in adversarial settings in a novel way. In particular, we combine ideas from the UCRL-2 algorithm of Jaksch et al. [59] developed for learning in finite MDPs, with the Follow-the-Perturbed-Leader (FPL) prediction method for arbitrary sequences [51, 64], while for the analysis we also borrow some ideas from our algorithms described in the previous chapters (which in turn build on the fundamental ideas of Even-Dar et al. 32, 33). However, in contrast to our previously presented work where we rely on using one 84 exponentially weighted average forecaster (EWA) locally in each state, we use FPL to compute the policy to be followed globally. The main reason for this change is twofold: First, we need a Follow-the-Leader-Be-the-Leader-type inequality for the policies we use and we could not prove such an inequality when the policies are implicitly obtained by placing experts into the individual states. Second, we could not find a computationally efficient way of implementing EWA over the space of policies, hence we turned to the FPL approach, which gives rise to a computationally efficient implementation. The FPL technique has been explored by Yu et al. [105] and Even-Dar et al. [32, 33] for the case when P is known. It is also interesting to note that even if the reward function is fixed, no previous work considered the analysis of the FPL technique for unknown MDP dynamics. Our analysis also sheds some light on the role of the confidence intervals and the principle of "optimism in face of uncertainty" in on-line learning, as presented in Section 2.2. After setting the goals of this chapter in Section 5.1, we introduce a more convenient form of the regret in Section 5.2. We present our algorithm in Section 5.3 and analyze it in Section 5.4. 5.1 Problem setup The problem setup is very similar to the one presented in Section 3.1: We assume that the © ª Markovian dynamics X, A, P satisfies Assumption S1, and the learning protocol described on Figure 2.3 uses time-dependent reward functions r t instead of r in each episode t . As before, we make no statistical assumptions about the sequence (r t )Tt=1 other than its elements have to be chosen in advance from the set of mappings X × A → [0, 1]. Contrary to the previously presented setting, we assume that P is unknown to the learner, making the problem fundamentally more difficult, as now the learning procedure has to deal with both stochastic and adversarial components of the environment. We assume that the reward functions r t are completely revealed after each episode. The learner has to minimize the same regret criterion that was defined in Section 3.1: Let " RbT = E T L−1 X X t =1 l =0 ¯ # ¯ ) ¯ r t (xl(t ) , a(t )¯ P l ¯ , denote the total expected reward of the learner and ¯ # ¯ ¯ R Tπ = E r t (x0l , a0l )¯ P, π ¯ t =1 l =0 " T L−1 X X stand for the total expected reward of the policy π. Then, the total expected regret can be written using b T = sup R π − RbT . L T π 85 5.2 Rewriting the regret The techniques presented in this chapter are fundamentally different from the previously presented ones. Instead of using the decomposition presented in Section 3.2, we treat the learning problem as a global sequential decision problem where the learner has to select a sequence of deterministic policies (πt )Tt=1 . We still make use of value functions as defined in Section 2.3.1, but in our setting, we also have to compute value functions defined for transition functions other than P . Furthermore, the global decision-making only uses values of the initial state: all decisions are based on v π (x 0 ). Below, we introduce a formalism that can accommodate for different transition functions. A deterministic policy π (or, in short: a policy) is a mapping π : X → A. We say that a policy π is followed in an SSP problem if the action in state x ∈ X is set to be π(x), independently of previous states and actions. A random path u = (x0 , a0 , . . . , xL−1 , aL−1 , xL ) is said to be generated by policy π under the transition model P if x0 = x 0 and xl +1 ∈ Xl +1 is drawn from P (·|xl , π(xl )) for all l = 0, 1, . . . , L − 1. We denote this relation by u ∼ (π, P ). Using the above notation we can define the value of a policy π, given a fixed reward function r and a transition model P as ¯ # ¯ ¯ W (r, π, P ) = E r (xl , π(xl ))¯ u ∼ (π, P ) , ¯ l =0 " L−1 X that is, the expected sum of rewards gained when following π in the MDP defined by r and P . In our problem, we are given a sequence of reward functions (r t )Tt=1 . We will refer to the partial P sum of these rewards as R t = ts=1 r t . Using the notations of Section 2.3.1, we have v tπ (x 0 ) = W (r t , π, P ) Vtπ (x 0 ) = W (R t , π, P ). and In what follows, we will omit explicit references to the initial state x 0 , since these are the only values that we consider. We can phrase our goal as coming up with an algorithm that generates a sequence of policies (πt )Tt=1 that minimizes the total expected regret b T = max V π − E L T π " T X t =1 π vt t # . At this point, we mention that regret minimization demands that we continuously incorporate each observation as it is made available during the learning process. Assume that we decide to use the following simple algorithm: run an arbitrary exploration policy πexp for 0 < K < T episodes, estimate a transition function P̄, then, in the remaining episodes, run the algorithm described in Chapter 3 using P̄. It is easy to see that this method attains a regret of p O(K + T / K ), which becomes O(T 2/3 ) when setting K = O(T 2/3 ). Also, the regret of this algorithm would scale very poorly with the parameters of the SSP. 86 5.3 Follow the perturbed optimistic policy In this section we present our FPL-based method that we call “follow the perturbed optimistic policy” (FPOP). The algorithm is shown as Algorithm 5.1. One of the key ideas of our approach, borrowed from Jaksch et al. [59], is to maintain a confidence set of transition functions that contains the true transition function with high probability. The confidence set is constructed in such a way that it remains constant over random time periods, called epochs, and the number of epochs (KT ) is guaranteed to be small relative to the time horizon (details on how to select the confidence set are presented in Section 5.3.1). We denote the i -th epoch by E i , while the corresponding confidence set is denoted by Pi . Now consider the simpler problem where the reward function r is assumed to be constant and known throughout all episodes t = 1, 2, . . . , T . One can directly apply UCRL-2 of Jaksch et al. [59] and select policy and transition-function estimate as ¡ ¢ © ª πi , P̃i = arg max W (r, π, P̄ ) π∈Π,P̄ ∈Pi in epoch E i , and follow πi through that epoch. In other words, we optimistically select the model from the confidence set that maximizes the optimal value of the MDP (defined as the value of the optimal policy in the MDP) and follow the optimal policy for this model and the fixed reward function. However, it is well known that for the case of arbitrarily changing reward functions, optimistic “follow the leader”-style algorithms like the one described above are bound to fail even in the simple (stateless) expert setting. 1 Thus, we need rely on the formalism of online learning of arbitrary sequences, presented in Section 2.1. In particular, our approach is to make this optimistic method more conservative by adding some carefully designed noise to the cumulative sum of reward functions that we have observed so far – much in the vein of FPL. To this end, introduce the perturbation function Yi : X × A → R with Yi (x, a) being i.i.d. random variables for all (x, a) ∈ X × A and all epochs i = 1, 2, . . . , KT ; in our algorithm Yi will be selected according to Exp(η, |X||A|), the |X||A|-dimensional distribution whose components are independent having exponential distribution with expectation η. Using these perturbations, the key idea of our algorithm is to choose our policy and transition-function estimate as ¡ ¢ © ª πt , P̃t = arg max W (R t −1 + Yi(t ) , π, P̄ ) (5.1) π∈Π,P̄ ∈Pi(t ) where i(t ) is the epoch containing episode t . That is, after fixing a biased reward function based on our previous observations, we still act optimistically when taking stochastic uncertainty into consideration. We call this method “follow the perturbed optimistic policy”, or, in short, FPOP. 1 This phenomenon was also illustrated by experiments presented in Section 3.6. 87 Algorithm 5.1 The FPOP algorithm for the online loop-free SSP problem with unknown transition probabilities. Input: State space X, action space A, perturbation parameter η ≥ 0, confidence parameter 0 < δ < 1, time horizon T > 0. Initialization: Let R 0 (x, a) = 0, n1 (x, a) = 0, N1 (x, a) = 0, and M1 (x, a) = 0 for all state-action pairs (x, a) ∈ X × A, let P1 be the set of all transition functions, and let the episode index i(1) = 1. Choose Y1 ∈ R|X||A| ∼ Exp(η, |X||A|) randomly. For t = 1, 2, . . . , T 1. Compute πt and P̃t according to (5.1). ¡ ) (t ) ¢ (t ) (t ) 2. Traverse a path ut = x(t , a0 , . . . , xL−1 , aL−1 , xL(t ) following the policy πt , where xl(t ) ∈ Xl 0 ¡ )¢ ) and a(t = πt x(t ∈ A. l l 3. Observe reward function r t and receive rewards PL−1 l =0 ¡ ) (t ) ¢ r t x(t , al . l ¡ ) (t ) ¢ ¡ ) (t ) ¢ 4. Set ni(t ) x(t , al = ni(t ) x(t , al + 1 for all l = 0, . . . , L − 1. l l 5. If ni(t ) (x, a) ≥ Ni(t ) (x, a) for some (x, a) ∈ X × A then start a new epoch: (a) Set i(t + 1) = i(t ) + 1, ti(t +1) = t + 1 and compute Ni(t +1) (x, a) and Mi(t +1) (x, a) for all (x, a) by (5.2). (b) Construct Pi(t +1) according to (5.3) and (5.4). (c) Reset ni(t +1) (x, a) = 0 for all (x, a). (d) Choose Yi(t +1) ∈ R|X||A| ∼ Exp(η, |X||A|) independently. Else set i(t + 1) = i(t ). 88 5.3.1 The confidence set for the transition function In this section, following Jaksch et al. [59], we describe how the confidence set is maintained to ensure that it contains the real transition function with high probability yet does not change too often. To maintain simplicity, we will assume that the layer decomposition of the SSP is known in advance, however the algorithm can be easily extended to cope with unknown layer structure. The algorithm proceeds in epochs of random length: the first epoch E 1 starts at episode t = 1, and each epoch E i ends when any state-action pair (x, a) has been chosen at least as many times in the epoch as before the epoch (e.g., the first epoch E 1 consists of the single episode t = 1). Let ti denote the index of the first episode in epoch E i , i(t ) be the index of the epoch that includes t , and let Ni (x, a) and Mi (x 0 |x, a) denote the number of times state-action pair (x, a) has been visited and the number of times this event has been followed by a transition to x 0 up to episode ti , respectively. That is tX i −1 Ni (x l , a l )= s=1 tX i −1 Mi (x l +1 |x l , a l )= s=1 I©x(s)=xl ,a(s)=al ª l l (5.2) I©x(s) =xl +1 ,x(s)=xl ,a(s)=al ª , l +1 l l where l = 0, 1, . . . , L − 1, x l ∈ Xl , a l ∈ A and x l +1 ∈ X (clearly, Mi (x l +1 |x l , a l ) can be non-zero only if x l +1 ∈ Xl +1 ). Our estimate of the transition function in epoch E i will be based on the relative frequencies P̄i (x l +1 |x l , a l ) = Mi (x l +1 |x l , a l ) . max {1, Ni (x l , a l )} Note that P̄i (·|x, a) belongs to the set of probability distributions ∆(Xl x +1 , X) defined over X with support contained in Xl x +1 . Define a confidence set of transition functions for epoch E i as the following L 1 -ball around P̄i : v u ½ T |X||A| ° ° u t 2|Xl x+1 | ln δ ° ° P̂i = P̄ : P̄ (·|x, a)−P̄i (·|x, a) 1 ≤ max{1, Ni (x, a)} ¾ and P̄ (·|x, a)∈∆(Xl x+1 , X) for all (x, a)∈ X×A . (5.3) The following lemma ensures that the true transition function lies in these confidence sets with high probability. We present the proof for completeness. Lemma 5.1 (59, Lemma 17). For any 0 < δ < 1 v u T |X||A| u 2|X ° ° l x +1 | ln δ °P̄i (·|x, a) − P (·|x, a)° ≤ t 1 max{1, Ni (x, a)} holds with probability at least 1 − δ simultaneously for all (x, a) ∈ X × A and all epochs. Proof. Let us fix an arbitrary x ∈ X and let l = l x . The statement follows from the following 89 inequality due to Weissman et al. [101] concerning the distance of a true discrete distribution p and the empirical distribution p̂ over m distinct events from n samples: ¶ µ ¤ ¡ ¢ £ nε2 . P kp − p̂k1 ≥ ε ≤ 2m − 2 exp − 2 As now we have |Xl +1 | distinct events, we get that setting s ε= 4|Xl +1 | ln T |Xδ||A| n for some fixed n ∈ [1, 2, . . . , t ] yields s ° ° P °P̄i (·|x, a) − P (·|x, a)°1 ≥ 2|Xl +1 | ln T |Xδ||A| n ¯ ¯ ¯ δ ¯ Ni (x, a) = n ≤ . ¯ 2 T |X||A| ¯ Using the union bound for all possible values of Ni (x, a), all (x, a) ∈ X × A, all i = 1, 2, . . . , KT (note that for the bound, we have used the very crude upper bound T > KT for simplicity) and the fact that the confidence intervals trivially hold when there are no observations with probability 1, we get the statement of the lemma. Since by the above result P ∈ P̂i holds with high probability for all epochs, defining our true confidence set Pi as Pi = ∩ij =1 P̂ j , (5.4) we also have P ∈ Pi for all epochs with probability at least 1 − δ. This way we ensure that our confidence sets cannot increase between consecutive episodes with probability 1. Note that this is a delicate difference from the construction of Jaksch et al. [59] that plays an important role in our proof. 5.3.2 Extended dynamic programming FPOP needs to compute an optimistic transition function and an optimistic policy in each episode with respect to some reward function r and some confidence set P of transition functions. That is, we need to solve the problem ¡ ∗ ∗¢ π , P = arg max W (r, π, P̄ ). (5.5) π,P̄ ∈P We will use an approach called extended dynamic programming to solve this problem, a simple adaptation of the extended value iteration method proposed by Jaksch et al. [59]. The method is presented as Algorithm 5.2 in Section 5.5. Computing in a backward manner on the states (that is, going from layer Xl to X0 ), the algorithm maximizes the transition probabilities to the direction of the largest reward-to-go. This is possible since the L 1 -balls allow to select the optimistic transition functions independently for all state-action pairs (x, a) ∈ X × A. Following 90 the proof of Theorem 7 of Jaksch et al. [59], Lemma 5.6 shows that Algorithm 5.2 indeed solves the required minimization problem, and can be implemented with O(|A||X|2 ) time and space complexity. 5.4 Regret bound The main result of this chapter is presented below. The performance bound is detailed in Theorem 5.1 in the appendix. Theorem 5.1. Assume η ≤ (|X||A|)−1 and T ≥ |X||A|. Then the expected regret of FPOP can be bounded as b T ≤ L|X||A| log L 2 µ 8T |X||A| s ³p ´ + 2 + 1 L|X| ´ ³ ¶ ln |X||A| + 1 L η + ηT L (e − 1)|X||A| T |X||A| T |A| ln + L|X| δL s 2 T ln L + 3δ T L. δ In particular, assuming T ≥ (|X||A|)2 , setting v ³ ´ u µ ¶ ln |X||A| + 1 u t 8T L η = log2 |X||A| T (e − 1) and δ = 1/T gives s b T ≤ 2 L|X||A| L + µ 8T |X||A| s ¶µ T (e − 1) log2 ³p ´ 2 + 1 L|X| T |A| ln µ ln ¶ ¶ |X||A| +1 L p T 2 |X||A| + L|X| 2 T ln(LT ) + 3 L. L Remark 5.1. The above regret bound can be simply stated as p ´ ³ b T = Õ L|X||A| T . L Using the techniques of Jaksch et al. [59], one can show a regret bound of Õ(L|X| p T |A|) for the easier problem when the reward function is fixed and known. Thus, the price we pay for p playing against an arbitrary reward sequence is an O( |A|) factor in the upper bound. The proof of Theorem 5.1 mainly follows the regret analysis of FPL combined with ideas from the regret analysis of UCRL-2. First, let us consider the policy-transition model pair ¡ ¢ © ª π̂t , P̂t = arg max W (R t + Yi(t ) , π, P̄ ) . π∈Π,P̄ ∈Pi(t ) In other words, π̂t is an optimistic policy that knows the reward function before the episode 91 would begin. Define ṽt = W (r t , πt , P̃t ) and v̂t = W (r t , π̂t , P̂t ). Furthermore, let © ª π∗t = arg max W (R t + Yi(t ) , π, P ) , π∈Π the optimal policy with respect to the perturbed rewards and the true transition function. The perturbation added to the value of policy π in episode t will be denoted by Zt (π) = W (Yi(t ) , π, P ). The true optimal value up to episode t and the optimal policy attaining this value will be denoted by Vt∗ = max Vt (π) π∈Π σ∗t = arg max Vt (π). and π∈Π We proceed by a series of lemmas to prove our main result. The first one shows that our optimistic choice of the estimates of the transition model enables us to upper bound the optimal value Vt∗ with a reasonable quantity. Lemma 5.2. VT∗ ≤ T X t =1 E [v̂t ] + δ T L + |X||A| log2 µ 8T |X||A| ¶ PL−1 l =0 ln (|Xl ||A|) + L η . Note that while the proof of this result might seem a simple reproduction of some arguments from the standard FPL-analysis, it contains subtle details about the role of our optimistic estimates that are of crucial importance for the analysis. Proof. Assume that P ∈ Pi(T ) , which holds with probability at least 1 − δ, by Lemma 5.1. First, we have VT∗ + ZT (σ∗T ) ≤ VT (π∗T ) + ZT (π∗T ) = W (R T + Yi(T ) , π∗T , P ) (5.6) ≤ W (R T + Yi(T ) , π̂T , P̂T ) where the first inequality follows from the definition of π∗T and the second from the optimistic choice of π̂T and P̂T . Let dYi(s) = Yi(s) − Yi(s−1) for s = 1, . . . , t . Next we show that, given P ∈ Pi(T ) , W (R t +Yi(t ) , π̂t , P̂t ) ≤ t X W (r s +dYi(s) , π̂s , P̂s ) (5.7) s=1 where we define Y0 = 0. The proof is done by induction on t . Equation (5.7) holds trivially for t = 1. For t > 1, assuming P ∈ Pi(T ) and (5.7) holds for t − 1, we have W (R t + Yi(t ) , π̂t , P̂t ) = W (R t −1+Yi(t −1) , π̂t , P̂t )+W (r t +dYi(t ) , π̂t , P̂t ) ≤ W (R t −1+Yi(t −1) , π̂t −1 , P̂t −1 )+W (r t +dYi(t ) , π̂t , P̂t ) ≤ t X W (r s + dYi(s) , π̂s , P̂s ), s=1 92 ¡ ¢ where the first inequality follows from the fact that π̂t −1 , P̂t −1 is selected from a wider class2 ¢ ¡ than π̂t , P̂t and is optimistic with respect to rewards R t −1 +Yi(t −1) , while the second inequality holds by the induction hypothesis for t − 1. This proves (5.7). Now the non-negativity of ZT (σ∗T ), (5.6) and (5.7) imply that, given P ∈ Pi(T ) , VT∗ ≤ = T X t =1 T X W (r t + dYi(t ) , π̂t , P̂t ) v̂t + T X W (dYi(t ) , π̂t , P̂t ). t =1 t =1 Since P ∈ Pi(T ) holds with probability at least 1 − δ, VT∗ ≤ T L trivially and the right hand side of the above inequality is non-negative, we have VT∗ ≤ T X " E [v̂t ] + E t =1 T X # W (dYi(t ) , π̂t j , P̂t j ) + δT L (5.8) t =1 The elements in the second sum above may be non-zero only if i(t ) 6= i(t − 1). Furthermore, by Proposition 18 of Jaksch et al. [59], the number of epochs KT up to episode T is bounded from above as T X def KT = t =1 I{i(t )6=i(t −1)} ≤ |X||A| log2 µ ¶ 8T . |X||A| Therefore, " E T X # " W (dYi(t ) , π̂t j , P̂t j ) ≤ E t =1 m X # W (Y j , π̂t j , P̂t j ) j =1 ≤ |X||A| log2 µ ¶ L−1 · ¸ X 8T E max Y1 (x, a) . |X||A| l =0 (x,a)∈Xl ×A Using the upper bound on the expectation of the maximum of a number of exponentially distributed variables (see, e.g., the proof of Corollary 4.5 in Cesa-Bianchi and Lugosi 25), a combination of the above inequality with (5.8) gives the desired result. Next, we show that peeking one episode into the future does not change the performance too much. The following lemma is a standard result used for the analysis of FPL and we include the proof only for completeness. Lemma 5.3. Assume that η ≤ (|X||A|)−1 . Then, " E T X t =1 # v̂t ≤ E " T X # ṽt + ηT L (e − 1)|X||A|. t =1 Proof. Let © ª (σt (Y), Γt (Y)) = arg max W (R t −1 + Y, π, P̄ ) π∈Π,P̄ ∈Pi(t ) 2 This follows from the definition of the confidence sets in Equation (5.4). 93 and Ft (Y) = W (r t , σt (Y), Γt (Y)). Clearly, ṽt = Ft (Yi(t ) ) and v̂t = Ft (Yi(t ) + r t ). Now let f be the density function of Yi(t ) and Fi(t ) denote the σ-algebra generated by all random variables before epoch E i(T ) .3 We have E v̂t | Fi(t −1) = £ ¤ Z Z Ft (y + r t ) f (y)d y = Ft (y) f (y − r t )d y R|X||A| Z ¤ f (y − r t ) f (y − r t ) £ ≤ sup E ṽt | Fi(t −1) . Ft (y) f (y)d y ≤ sup | X || A | f (y) f (y) y,t y,t R R|X||A| ¡ P ¢ Since f (y) = η exp −η x,a y(x, a) for all y º 0, we get à ! X ¡ ¢ f (y − r t ) sup = exp η r t (x, a) ≤ exp η|X||A| . f (y) y x,a Using e x ≤ 1 + (e − 1)x for x ∈ [0, 1], which holds by our assumption on η, we get ¡ ¢ E [v̂t ] ≤ E [ṽt ] 1 + η(e − 1)|X||A| . Noticing that ṽt ≤ L gives the result. ¯ £ ¤ Now consider µt (x) = P xl x = x ¯ u ∼ (πt , P ) , that is, the probability that a trajectory gen- erated by πt and P includes x. Note that given a layer Xl , the restriction µt ,l : Xl → [0, 1] is a £ ¤ distribution. Define an estimate of µt as µ̂t (x) = P xl = x| u ∼ (πt , P̃t ) . Note that this estimate can be efficiently computed using the recursion µ̃t (x l +1 ) = X x l ,a l P̃t (x l +1 |x l , a l )πt (a l |x l )µ̃t (x l ), for l = 0, 1, 2, . . . , L −1, with µ̃t (x 0 ) = 1. The following result will ensure that if our estimate of the transition function is close enough to the true transition function in the L 1 sense, than these estimates of the visitation probabilities are also close to the true values that they estimate. Lemma 5.4. Assume that there exists some function dt : X × A → R+ such that ° ° °P̃t (·|x, a) − P (·|x, a)° ≤ dt (x, a) 1 holds for all (x, a) ∈ X × A. Then X x l ∈Xl |µ̃t (x l ) − µt (x l )| ≤ lX −1 X k=0 x k ∈Xk µt (x k ) dt (x k , πt (x k )) for all l = 1, 2, . . . , L − 1. 3 Note that Y i(t ) is generated independently from the history up to epoch i(t ). 94 Proof. We prove the statement by induction on l . For l = 1 we have X¯ ¯ X¯ ¯ ¯µ̃t (x 1 ) − µt (x 1 )¯ = ¯P̃t (x 1 |x 0 , πt (x 0 )) − P (x 1 |x 0 , πt (x 0 ))¯ ≤ dt (x 0 , , πt (x 0 )), x1 x1 proving the statement for this case. Now assume that the statement holds for some l − 1. We have µ̃t (x l ) − µt (x l ) = X¡ ¢ P̃t (x l |x l −1 , πt (x l −1 ))µ̃t (x l −1 ) − P (x l |x l −1 , πt (x l −1 ))µt (x l −1 ) x l −1 = X³ x l −1 ¢ ¡ P̃t (x l |x l −1 , πt (x l −1 )) µ̃t (x l −1 ) − µt (x l −1 ) ´ ¡ ¢ + P̃t (x l |x l −1 , πt (x l −1 )) − P (x l |x l −1 , πt (x l −1 )) µt (x l −1 ) , and thus X |µ̃t (x l ) − µt (x l )| ≤ xl X³ x l ,x l −1 ¯ ¯ P̃t (x l |x l −1 , πt (x l −1 )) ¯µ̃t (x l −1 ) − µt (x l −1 )¯ ´ ¯ ¯ + ¯P̃t (x l |x l −1 , πt (x l −1 )) − P (x l |x l −1 , πt (x l −1 ))¯ µt (x l −1 ) X¯ ¯ ¯µ̃t (x l −1 ) − µt (x l −1 )¯ = x l −1 + X x l −1 ≤ lX −2 X k=0 x k ∈Xk µt (x l −1 ) X¯ ¯ ¯P̃t (x l |x l −1 , πt (x l −1 )) − P (x l |x l −1 , πt (x l −1 ))¯ xl µt (x k ) dt (x k , πt (x k )) + X x l −1 µt (x l −1 ) X xl dt (x l −1 , πt (x l −1 )), proving the statement. Finally, we use the above result to relate the estimated policy values ṽt to their true values v t (πt ). The following lemma, largely based on Lemma 19 of Jaksch et al. [59], asserts this relationship. Lemma 5.5. Assume T ≥ |X||A|. Then " E T X t =1 # ṽt ≤E " T X s # v t (πt ) + L|X| 2 T ln t =1 ³p ´ + 2 + 1 L|X| L + 2δ T L δ s T |A| ln T |X||A| . δL Proof. We start by some arguments borrowed from Jaksch et al. [59]. Let ni (x, a) be the number of times state-action pair (x, a) has been visited in epoch E i . We have Ni (x, a) = iX −1 ni (x, a). j =1 For simplicity, let KT = m be the number of epochs. By Appendix C.3 of Jaksch et al. [59], we 95 have ³p ´p m n (x, a) X i ≤ 2+1 Nm (x, a), p i =1 Ni (x, a) and by Jensen’s inequality, ³p ´p m n (x, a) XX i ≤ 2+1 |X||A|T . p x,a i =1 Ni (x, a) Now fix an arbitrary t : 1 ≤ t ≤ T . We have ṽt = L−1 X l =0 x∈Xl and v t (πt ) = thus ṽt − v t (πt ) = L−1 X X L−1 X µ̃t (x)r t (x, πt (x)) X l =0 x∈Xl µt (x)r t (x, πt (x)), L−1 X ¡ X X ¯ ¯ ¢ ¯µ̃t (x) − µt (x)¯ . µ̃t (x) − µt (x) r t (x, πt (x)) ≤ l =0 x∈Xl l =0 x∈Xl PT P ¯ ¯ That is, we need to bound t =1 x∈Xl ¯µ̃t (x) − µt (x)¯. ° ° Setting dt (x, a) = °P̃t (·|x, a) − P (·|x, a)°1 for all (x, a) ∈ X × A, the conditions of Lemma 5.4 are clearly satisfied, and so −1 X X ¯ ¯ lX ¯µ̃t (x) − µt (x)¯ ≤ µt (x k ) dt (x k , πt (x k )) x∈Xl k=0 x k ∈Xk = lX −1 k=0 ³ ) (t ) dt x(t , ak k ´ + lX −1 (5.9) X ³ k=0 x k ∈Xk µt (x k ) − I© ´ ) x(t =x k k ª dt (x k , πt (x k )) . Now, by Lemma 5.1, we have with probability at least 1 − δ simultaneously for all k that T X t =1 ³ ´ v u u t 2|Xk+1 | ln T |Xδ||A| n ³ ´o ) (t ) t =1 max 1, Ni(t ) x(t , a k k v u u 2|Xk+1 | ln T |X||A| m X X t δ © ª ≤ ni (x k , a k ) max 1, Ni(t ) (x k , a k ) x k ,a k i =1 s ³p ´ T |X||A| ≤ 2+1 2 T |Xk ||Xk+1 ||A| ln . δ ) (t ) dt x(t , ak ≤ k T u X ³ ´ For the second term on the right hand side of (5.9), notice that µt (x k ) − I©x(t ) =xk ª form a mark tingale difference sequence with respect to {Ut }Tt=1 and thus by the Hoeffding–Azuma inequality and dt ≤ 2, we have T ³ X t =1 s ´ µt (x k ) − I©x(t ) =xk ª dt (x k , πt (x k )) ≤ k 96 2 T ln L δ with probability at least 1 − δ/L. Putting everything together, the union bound implies that we have, with probability at least 1 − 2δ simultaneously for all l = 1, . . . , L, t =1 x∈Xl s s −1 T |X||A| lX L |Xk | 2 T ln 2+1 T |Xk ||Xk+1 ||A| ln + δ δ k=0 k=0 s s ´ L−1 ³p −1 X 1 T |X||A| lX L |Xk | 2 T ln 2+1 L T |Xk ||Xk+1 ||A| ln + ≤ δ δ k=0 L k=0 s s µ ¶ ³p ´ |X| 2 T |X||A| L ≤ 2 + 1 L T | A| ln + |X| 2 T ln L δ δ s s ³p ´ T |X||A| L = + |X| 2 T ln (5.10) 2 + 1 |X| T |A| ln δ δ T X ¡ X ¢ µ̃t (x) − µt (x) ≤ lX −1 ³p ´ where in the last step we used Jensen’s inequality for the concave function f (x, y) = PL−1 with parameter a > 0 and the fact that k=0 |Xk | = |X| − 1 < |X|. p x y(a + ln x) Summing up for all l = 0, 1, . . . , L − 1 and taking expectation, using that v t (πt ) − ṽt ≤ L and (5.10) holds with probability at least 1 − 2δ, finishes the proof. The theorem can be obtained by a trivial combination of Lemmas 5.2, 5.3, and 5.5 and applying the bound |X||A| ln (|Xl ||A|) ≤ L ln L l =0 L−1 X µ ¶ in the last term of Lemma 5.2. 5.5 Extended dynamic programming: technical details The extended dynamic programming algorithm is given by Algorithm 5.2. The next lemma, which can be obtained by a straightforward modification of the proof of Theorem 7 of Jaksch et al. [59], shows that Algorithm 5.2 efficiently solves the desired minimization problem. Lemma 5.6. Algorithm 5.2 solves the maximization problem (5.5) for P = {P̄ : kP̄ − P̂ k1 ≤ b}. P Let S = lL−1 |Xl ||X l +1 | denote the maximum number of possible transitions in the given model. =0 The time and space complexity of Algorithm 5.2 is the number of possible non-zero elements of P̄ allowed by the given structure, and so it is O(S|A|), which, in turn, is O(|A||X|2 ). 97 Algorithm 5.2 Extended dynamic programming for finding an optimistic policy and transition model for a given confidence set of transition functions and given rewards. Input: empirical estimate P̂ of transition functions, L 1 bound b ∈ (0, 1]|X||A| , reward function r ∈ [0, 1]|X||A| . Initialization: Set w(x L ) = 0. For l = L − 1, L − 2, . . . , 0 ¡ ¢ 1. Let k = |Xl +1 | and x 1∗ , x 2∗ , . . . , x k∗ be a sorting of the states in Xl +1 such that w(x 1∗ ) ≥ w(x 2∗ ) ≥ · · · ≥ w(x k∗ ). 2. For all (x, a) ∈ Xl × A © ª (a) P ∗ (x 1∗ |x, a) = min P̂ (x 1∗ |x, a) + b(x, a)/2, 1 (b) P ∗ (x i∗ |x, a) = P̂ (x i∗ |x, a) for all i = 2, 3, . . . , k. (c) Set j = k. P (d) While i P ∗ (x i∗ |x, a) > 1 do © ª P i. Set P ∗ (x ∗j |x, a) = max 0, 1 − i 6= j P ∗ (x i∗ |x, a) ii. Set j = j − 1. 3. For all x ∈ Xl © ª P (a) Let w(x) = maxa r (x, a) + x 0 P ∗ (x 0 |x, a)w(x 0 ) . © ª P (b) Let π∗ (x) = arg maxa r (x, a) + x 0 P ∗ (x 0 |x, a)w(x 0 ) . Return: optimistic transition function P ∗ , optimistic policy π∗ . 98 Chapter 6 Online learning with switching costs In this chapter we study a special case of online learning in reactive environments where switching between actions is subject to some additional cost. The precise protocol of the prediction problem is identical to the online prediction protocol shown on Figure 2.1, with the additional assumption that every time the learner selects an action at 6= at −1 , a known cost of K > 0 is deducted from the earnings of the learner. We are interested in algorithms that can minimize regret under this additional assumption, or, equivalently, minimize the regret while keeping the number of action switches low. However, the usual forecasters with small regret—such as the exponentially weighted average forecaster or the FPL forecaster with i.i.d. perturbations—may switch actions a large number of times, typically Θ(n). Therefore, the design of special forecasters with small regret and small number of action switches is called for. In this chapter, we consider this problem in the full information setting where reward functions are revealed after each time step. Our results are summarized in the following thesis. Thesis 4. Proposed an efficient algorithm for online learning with switching costs. Proved performance guarantees for Settings 3a and 3b. The proved bounds for Setting 3a are optimal in all problem parameters. Our algorithm is the first known efficient algorithm for Setting 3b. [30, 31] While this learning problem is very interesting on its own right, one can imagine numerous applications where one would like to define forecasters that do not change their prediction too often. Examples of such problems include the online buffering problem described by Geulen et al. [39] and the online lossy source coding problem to be discussed in Chapter 7. Furthermore, as seen in Chapter 4, abrupt policy switches can also be harmful in the online MDP problem. While the core idea of the analysis presented in Chapter 4 is guaranteeing that the learner’s policies change slowly over time, it is very easy to see that one can achieve similar results by ensuring that the learner changes its policies rarely. The main reason that we took the first route for the analysis is that there are currently no known algorithms that can guarantee p that the number of action switches is O( T ). Preliminary results ([24]) suggest that even the existence of such prediction algorithms is nontrivial. 99 As mentioned before in Chapter 1, learning in this problem can be regarded as a special case of the online MDP problem. The Markovian environment describing the current setting is deterministic and it is easy to find policies that induce periodic state transitions. In fact, the only policies that admit stationary distributions are the ones that satisfy π(x) = x for some © ª x ∈ A. This implies that X, A, P do not satisfy Assumption M1, so we cannot guarantee that the MDP-E algorithm (Algorithm 4.1) performs well in this problem by straightforward application of Theorem 4.1. Recently, Arora et al. [3] proposed an algorithm for regret minimization in deterministic MDPs with non-stationary rewards. Since they consider a significantly more complicated problem where the optimal policy is allowed to induce periodic behavior, they can only guarantee an expected average regret of Õ(T −1/4 ). Since our general tools introduced in the previous chapters are not directly applicable for this specific setting, we turn to algorithms that are directly tailored to deal with the above problem of switch-constrained online prediction. The first paper to explicitly attack this problem is by Geulen et al. [39], who propose a variant of the exponentially weighted average forecaster called the “Shrinking Dartboard” (SD) algorithm and prove that it provides an expected p regret of O( T ln |A|), while guaranteeing that the expected number of switches is at most p O( T ln |A|). A less conscious attempt to solve the problem is due to Kalai and Vempala [63] (see also 64); they show that the simplified version of the Follow-the-Perturbed-Leader (FPL) p algorithm with identical perturbations (as described above) guarantees an O( T ln |A|) bound on both the expected regret and the expected number of switches. In the first half of this chapter, we present a modified version of the SD algorithm that enjoys optimal bounds on both the standard regret and the number of switches. Our contribution, presented in György and Neu [48] and György and Neu [49], is a minor modification of the SD algorithm that allows using adaptive step-size parameters. More importantly, in the second half of the chapter representing our works Devroye et al. [30, 31], we propose a method based on FPL in which perturbations are defined by independent symmetric random walks. We show that this intuitively appealing forecaster has similar regret and switch-number guarantees as SD and FPL with identical perturbations. A further important advantage of the new forecaster is that it may be used simply in the more general problem of online combinatorial— or, more generally, linear—optimization. We postpone the definitions and the statement of the results to Section 6.2.3 below. Before presenting our algorithms, we set the stage for Chapter 7 by slightly changing our notations. For a number of practical problems, including the online lossy source coding problem, it is more suitable to regard the learner’s task as having to minimize losses instead of having to maximize rewards. In accordance with the notations used by Cesa-Bianchi and Lugosi [25], we identify the elements of the action set A with the natural numbers {1, 2, . . . , N } and denote the loss given for choosing action i ∈ {1, 2, . . . , N } at time t by `i ,t . We assume that `i ,t ∈ [0, 1]. The goal of the learner to choose its actions I1 , I2 , . . . , IT so as to minimize its total expected regret 100 Algorithm 6.1 The modified Shrinking Dartboard algorithm 1. Set η t > 0 with η t +1 ≤ η t for all t = 1, 2, . . ., η 0 = η 1 , and L i ,0 = 0 and w i ,0 = 1/N for all actions i ∈ {1, 2, . . . , N }. 2. for t = 1, . . . , T do (a) Set w i ,t = (b) Set p i ,t = 1 −η t L i ,t −1 for all i ∈ {1, 2, . . . , N }. Ne w i ,t PN for all i ∈ {1, 2, . . . , N }. j =1 w j ,t (c) Set c t = e (η t −η t −1 )(t −2) . w It −1 ,t (d) With probability c t w I , set It = It −1 if t ≥ 2, that is, do not change expert; other© ª wise choose It randomly according to the distribution p 1,t , . . . , p N ,t . t −1 ,t −1 (e) Observe the losses `i ,t and set L i ,t = L i ,t −1 + `i ,t for all i ∈ {1, 2, . . . , N }. end for defined as bT = L T X t =1 `It ,t − min T X i ∈{1,...,N } t =1 `i ,t .1 Further, define the number of action switches up to time n by Cn = |{1 < t ≤ n : It −1 6= It }| . b T of the order We are interested in defining randomized forecasters that achieve a regret L p O( T ln N ) while keeping the number of action switches C n as small as possible. 6.1 The Shrinking Dartboard algorithm revisited In this section, we present a modified version of the Shrinking Dartboard (SD) algorithm of Geulen et al. [39]. A modified version of this prediction method, called the modified SD (mSD) algorithm, is shown as Algorithm 6.1. The difference between the SD and the mSD algorithms is that mSD is horizon independent, which is achieved by introducing the constant c t in the algorithm (setting η t = η the mSD algorithm reduces to SD). w To see that the mSD algorithm is well-defined we have to show that c t w i ,ti ,t−1 ≤ 1 for all t and i . For t = 1, the statement follows from the definitions, since c 1 = 1. For t ≥ 2 it follows since w i ,t w i ,t −1 ¡ ¢ = exp η t −1 L i ,t −2 − η t D i ,t −1 ¡¡ ¢ ¢ ≤ exp η t −1 − η t L i ,t −2 − η t `i ,t −1 ¡¡ ¢ ¢ ≤ exp η t −1 − η t (t − 2) = 1/c t . 1 One can easily go back and forth between this notation and the one used in previous chapters by using the transformation `i ,t = 1 − r t (i ) for all time steps and actions. Note that the regret is invariant under this transformation. 101 Note that the only difference between the mSD and the EWA prediction algorithms is the presence of the first random choice in step 2d of mSD: while the EWA algorithm chooses a new action in each time step t according to the distribution {p 1,t , . . . , p N ,t }, the mSD algorithm sticks with the previously chosen action with some probability. By precise tuning of this probp ability, the method guarantees that actions are changed over time only at most O( T ) times in T time steps, while maintaining the same marginal distributions over the actions as the EWA algorithm. The latter fact guarantees that the expected regret of the two algorithms are the same. In the following we formalize the above statements concerning the mSD algorithm. The next lemma shows that the marginal distributions generated by the mSD and the EWA algorithms are the same. The lemma is obtained by a slight modification of the proof Lemma 1 in [39]. Lemma 6.1. Assume the mSD algorithm is run with η t +1 ≤ η t for all t = 1, 2, . . . , T . Then the probability of selecting action i at time t satisfies P [It = i ] = p i ,t for all t = 1, 2, . . . and i ∈ {1, 2, . . . , N }. Proof. We will use the notation Wt = PN i =1 w i ,t . We prove the lemma by induction on 1 ≤ t ≤ T . For t = 1, the statement follows from the definition of the algorithm. Now assume that t ≥ 2 and the hypothesis holds for t − 1. For all i ∈ {1, 2, . . . , N }, we have N X µ ¶ £ ¤ w j ,t P [It = i ] = P [It −1 = i ] c t + p i ,t P It −1 = j 1 − c t w i ,t −1 w j ,t −1 j =1 µ ¶ N X w j ,t w i ,t p j ,t −1 1 − c t + p i ,t = p i ,t −1 c t w i ,t −1 w j ,t −1 j =1 µ ¶ N w w j ,t w i ,t −1 w i ,t w i ,t X j ,t −1 = ct + 1 − ct Wt −1 w i ,t −1 Wt j =1 Wt −1 w j ,t −1 w i ,t = ct w i ,t Wt −1 + w i ,t Wt − ct w i ,t w i ,t Wt = = p i ,t . Wt Wt −1 Wt As a consequence of this result, the expected regret of mSD matches that of EWA, so the performance bound of EWA holds for the mSD algorithm as well [39, Lemma 2]). That is, the following result can be obtained by a slight modification of the proof of [41, Lemma 1]. We include the proof for completeness. Lemma 6.2. Assume η t +1 ≤ η t for all t = 1, 2, . . . , T . Then the total expected regret of the mSD algorithm can be bounded as T η £ ¤ X ln N t bT ≤ + . E L ηT t =1 8 Proof. Introduce the following notation: w i0 ,t = 1 −η t −1 L i ,t −1 e , N 102 Pt −1 Note that the difference between w i ,t and w i0 ,t is that η t is replaced by P η t −1 in the latter. We will also use Wt0 = iN=1 w i0 ,t . First, we have where L i ,t −1 = s=1 `i ,s . W0 1 1 ln t +1 = ln ηt Wt ηt ≤− N X PN i =1 w i ,t e −η t `i ,t Wt p i ,t `i ,t + i =1 = N X 1 p i ,t e −η t `i ,t ln η t i =1 £ ¤ ηt ηt ≤ −E `It ,t + 8 8 where the second to last step follows from Hoeffding’s inequality (see, e.g., [25, Lemma A.1]) and the fact that `i ,t ∈ [0, 1]. After rearranging, we get £ ¤ W0 ηt 1 E `It ,t ≤ − ln t +1 + . ηt Wt 8 (6.1) Rewriting the first term on the right hand side, we obtain lnWt − lnWt0+1 ηt µ = ¶ µ ¶ lnWt lnWt +1 lnWt +1 lnWt0+1 − + − . ηt η t +1 η t +1 ηt The first term can be telescoped as ¶ T µ lnW X ln w i ,T +1 lnWt +1 lnW1 lnWT +1 t − = lnW1 − ≤− ηt η t +1 η1 η T +1 η T +1 t =1 =− 1 1 ln N ln e −ηT +1 L i ,T = L i ,T + , η T +1 N η T +1 for any 1 ≤ i ≤ N . Now to deal with the second term, observe that N 1 ¡ N 1 X X ¢ ηt +1 e −η t +1 L i ,t = e −η t L i ,t ηt i =1 N i =1 N ! ηt +1 à ηt N 1 X ¡ ¢ ηt +1 ≤ e −η t L i ,t = Wt0+1 ηt , i =1 N Wt +1 = where we have used that η t +1 ≤ η t and thus we can apply Jensen’s inequality for the concave function x η t +1 ηt , x ∈ R. Thus we have lnWt +1 lnWt0+1 η t +1 lnWt0+1 lnWt0+1 − ≤ − = 0. η t +1 ηt ηt η t +1 ηt Substituting into (6.1) and summing for all t = 1, 2, . . . , T , we obtain T T X £ ¤ 1X ln N E `It ,t ≤ L i ,T + ηt + . 8 η T +1 t =1 t =1 Finally, since the losses `i ,t , i ∈ {1, 2, . . . , N } and `It ,t do not depend on η T +1 for t ≤ T , we can choose, without loss of generality η T +1 = η T , and the statement of the lemma follows. 103 p ln N /T optimally (as a function of the time horizon T ), the bound p £ ¤ p T ln N b T ≤ T ln N . , while setting η = 2 ln N /t independent of T , we have E L t 2 Remark 6.1. q Setting η t = becomes Now let us consider the total number of action switches CT . The next lemma, which is a slightly improved and generalized version of Lemma 2 from [39] gives an upper bound on the expectation of CT . Lemma 6.3. Let CT denote the number of times the mSD algorithm switches between different actions. Then E [CT ] ≤ η T L ∗T −1 + ln N + TX −1 (η t − η T ) t =1 where L ∗T −1 = mini ∈{1,2,...,N } L i ,T −1 . Proof. The probability of switching experts in step t ≥ 2 is N X µ ¶ ¢ w i ,t ¡ P [It −1 = i ] 1 − c t αt = P [It −1 6= It ] = 1 − p i ,t w i ,t −1 i =1 µ ¶ N N w X X w i ,t w i ,t Wt i ,t −1 P [It −1 = i ] 1 − c t ≤ ct = 1 − ct = 1− w i ,t −1 w i ,t −1 Wt −1 i =1 i =1 W t −1 def where in the last line we used Lemma 6.1. Reordering gives Wt ≤ WT ≤ W1 1−αt c t W t −1 and thus T 1−α T 1−α Y Y t t = . c c t t t =2 t =2 On the other hand, WT ≥ w T, j ≥ 1 −ηT L j ,T −1 e . N for arbitrary j ∈ {1, 2, . . . , N }. Taking logarithms of both inequalities and putting them together, we get − ln N − η T L ∗T −1 = −η T min j ∈{1,2,...,N } L j ,T −1 ≤ T X ln(1 − αt ) − T X ln c t . t =2 t =2 Now using ln(1 − x) ≤ −x for all x ∈ [0, 1), we obtain T X E [CT ] = t =2 αt ≤ η T L ∗T −1 + ln N − T X ln c t . t =2 Now the statement of the lemma follows since − T X t =2 ln c t = T X (η t −1 − η t )(t − 2) ≤ t =2 TX −1 (η t − η T ). t =1 p p p In particular, for η t = ln N /T , we have E [CT ] ≤ T ln N +ln N , while setting η t = 2 ln N /t , p we obtain E [CT ] ≤ 4 T ln N + ln N . 104 6.2 Prediction by random-walk perturbation In this section we propose a variant of the Follow-the-Perturbed-Leader (FPL) algorithm that p switches actions at most O( T ln N ) times in expectation. The proposed forecaster perturbs the loss of each action at every time instant by a symmetric coin flip and chooses an action with minimal cumulative perturbed loss. More precisely, the algorithm draws independent random variables Xi ,t that take values ±1/2 with equal probabilities and Xi ,t is added to each loss `i ,t −1 . ¢ P ¡ At time t action i is chosen that minimizes ts=1 `i ,t −1 + Xi ,t (where we define `i ,0 = 0). Algorithm 6.2 Prediction by random-walk perturbation. Initialization: set L i ,0 = 0 and Zi ,0 = 0 for all i = 1, 2, . . . , N . For all t = 1, 2, . . . , T , repeat 1. Draw Xi ,t for all i = 1, 2, . . . , N such that ( Xi ,t = 1 2 with probability − 21 with probability 1 2 1 2. 2. Let Zi ,t = Zi ,t −1 + Xi ,t for all i = 1, 2, . . . , N . 3. Choose action ¡ ¢ It = arg min L i ,t −1 + Zi ,t . i 4. Observe losses `i ,t for all i = 1, 2, . . . , N , suffer loss `I t ,t . 5. Set L i ,t = L i ,t −1 + `i ,t for all i = 1, 2, . . . , N . Equivalently, the forecaster may be thought of as an FPL algorithm in which the cumulative P losses L i ,t −1 are perturbed by Zi ,t = it =1 Xi ,t . Since for each fixed i , Zi ,1 , Zi ,2 , . . . is a symmetric random walk, cumulative losses of the N actions are perturbed by N independent symmetric random walks. This is the way the algorithm is presented in Algorithm 6.2. A simple variation is when one replaces random coin flips by independent standard normal random variables. Both have similar performance guarantees and we choose ±(1/2)-valued perturbations for mathematical convenience. In Section 6.2.3 we switch to normally distributed perturbations—again driven by mathematical simplicity. In practice both versions are expected to have a similar behavior. Conceptually, the difference between standard FPL and the proposed version is the way the perturbations are generated: while common versions of FPL use perturbations that are generated in an i.i.d. fashion, the perturbations of the algorithm proposed here are dependent. This will enable us to control the number of action switches during the learning process. Note p that the standard deviation of these perturbations is still of order t just like for the standard FPL forecaster with optimal parameter settings. To obtain intuition why this approach will solve our problem, first consider a problem with N = 2 actions and an environment that generates equal losses, say `i ,t = 0 for all i and t , for all actions. When using i.i.d. perturbations, FPL switches actions with probability 1/2 in each 105 p round, thus yielding Ct = t /2 + O( t ) with overwhelming probability. The same holds for the exponentially weighted average forecaster. On the other hand, when using the random-walk perturbations described above, we only switch between the actions when the leading random walk is changed, that is, when the difference of the two random walks—which is also a symmetric random walk—hits zero. It is a well known that the number of occurrences of this event p up to time t is O p ( t ), see, [35]. As we show below, this is the worst case for the number of switches. The next theorem summarizes our performance bounds for the proposed forecaster. Theorem 6.1. The expected regret and expected number of switches of actions of the forecaster of Algorithm 6.2 satisfy, for all possible loss sequences (under the oblivious-adversary model), p b T ≤ 2E [CT ] ≤ 8 2T ln N + 16 ln T + 16 . L Remark. Even though we only prove bounds for the expected regret and the expected number of switches, it is of great interest to understand upper tail probabilities. However, this is a highly nontrivial problem. One may get an intuition by considering the case when N = 2 and all losses are equal to zero. In this case the algorithm switches actions whenever a symmetric random walk returns to zero. This distribution is well understood and the probability that this p 2 occurs more than x T times during the first T steps is roughly 2P{N > 2x} ≤ 2e −2x where N is a standard normal random variable (see [35, Section III.4]). Thus, in this case we see that ¢ ¡p the number of switches is bounded by O T ln(1/δ) , with probability at least 1 − δ. However, proving analog bounds for the general case remains a challenge. To prove the theorem, we first show that the regret can be bounded in terms of the number of action switches. Then we turn to analyzing the expected number of action switches. 6.2.1 Regret and number of switches The next simple lemma shows that the regret of the forecaster may be bounded in terms of the number of times the forecaster switches actions. Lemma 6.4. Fix any i ∈ {1, 2, . . . , N }. Then T X `It ,t − L i ,T ≤ 2CT + Zi ,T +1 − t =1 TX +1 t =1 XIt −1 ,t . Proof. We apply Lemma 3.1 of [25] (sometimes referred to as the “be-the-leader” lemma) for the sequence (`·,t −1 + X·,t )∞ t =1 with ` j ,0 = 0 for all j ∈ {1, 2, . . . , N }, obtaining TX +1 ¡ +1 ¡ ¢ TX ¢ `It ,t −1 + XIt ,t ≤ `i ,t −1 + Xi ,t t =1 t =1 = L i ,T + Zi ,T +1 . 106 Reordering terms, we get T X `It ,t ≤ L i ,T + TX +1 ¡ t =1 t =1 TX +1 ¢ XIt ,t . `It −1 ,t −1 − `It ,t −1 + Zi ,T − (6.2) t =1 The last term can be rewritten as − TX +1 t =1 XIt ,t = − TX +1 t =1 XIt −1 ,t + TX +1 ¡ t =1 ¢ XIt −1 ,t − XIt ,t . Now notice that XIt −1 ,t − XIt ,t and `It −1 ,t −1 − `It ,t −1 are both zero when It = It −1 and are upper bounded by 1 otherwise. That is, we get that TX +1 +1 ¡ ¢ ¢ TX I{It −1 6=It } = 2CT . XIt −1 ,t − XIt ,t ≤ 2 `It −1 ,t −1 − `It ,t −1 + TX +1 ¡ t =1 t =1 t =1 Putting everything together gives the statement of the lemma. 6.2.2 Bounding the number of switches Next we analyze the number of switches Cn . In particular, we upper bound the marginal probability P [It +1 6= It ] for each t ≥ 1. We define the lead pack At as the set of actions that, at time t , have a positive probability of taking the lead at time t + 1: ½ ¾ ¡ ¢ At = i ∈ {1, 2, . . . , N } : L i ,t −1 + Zi ,t ≤ min L j ,t −1 + Z j ,t + 2 . j We bound the probability of lead change as 1 P [It 6= It +1 ] ≤ P [|At | > 1] . 2 The key to the proof of the theorem is the following lemma that gives an upper bound for the probability that the lead pack contains more than one action. It implies, in particular, that p E [CT ] ≤ 4 2T ln N + 4 ln n + 4 , which is what we need to prove the expected-value bounds of Theorem 6.1. Lemma 6.5. s P [|At | > 1] ≤ 4 2 ln N 8 + . t t i h Proof. Define p t (k) = P Zi ,t = k2 for all k = −t , . . . , t and we let St denote the set of leaders at time t (so that the forecaster picks It ∈ St arbitrarily): ½ © St = j ∈ {1, 2, . . . , N } : L j ,t −1 + Z j ,t = min L i ,t −1 + Zi ,t i 107 ª ¾ . Let us start with analyzing P [|At | = 1]: N t X X ¸ © ª k L i ,t −1 + Zi ,t ≥ L j ,t −1 + + 2 i ∈{1,2,...,N }\ j 2 k=−t j =1 ¸ · tX −4 X N © ª p t (k) k +4 ≥ p t (k + 4)P min L i ,t −1 + Zi ,t ≥ L j ,t −1 + i ∈{1,2,...,N }\ j 2 p t (k + 4) k=−t j =1 ¸ · t N X X © ª k p t (k − 4) = p t (k)P min L i ,t −1 + Zi ,t ≥ L j ,t −1 + . i ∈{1,2,...,N j }\ 2 p t (k) k=−t +4 j =1 P [|At | = 1] = p t (k)P · min Before proceeding, we need to make two observations. First of all, N X j =1 p t (k)P ¸ · ¸ © ª k k ≥ P ∃ j ∈ St : Z j ,t = L i ,t −1 + Zi ,t ≥ L j ,t −1 + i ∈{1,2,...,N }\ j 2 2 · ¸ k ≥ P min Z j ,t = , j ∈St 2 · min where the first inequality follows from the union bound and the second from the fact that the latter event implies the former. Also notice that Zi ,t + 2t is binomially distributed with parame¡ t ¢ ters t and 1/2 and therefore p t (k) = t +k 21t . Hence 2 ³ p t (k − 4) =³ t +k p t (k) 2 = 1+ ´³ ´ ! t −k 2 ! ´³ ´ − 2 ! t −k + 2 ! 2 t +k 2 4(t + 1)(k − 2) . (t − k + 2)(t − k + 4) It can be easily verified that 4(t + 1)(k − 2) 4(t + 1)(k − 2) ≥ (t − k + 2)(t − k + 4) (t + 2)(t + 4) holds for all k ∈ [−t , t ]. Using our first observation, we get t X X ¸ © ª k p t (k − 4) P [|At | = 1] ≥ p t (k)P min L i ,t −1 + Zi ,t ≥ L j ,t −1 + i ∈{1,2,...,N }\ j 2 p t (k) j k=−t +4 · ¸ t X k p t (k − 4) ≥ P min Z j ,t = . j ∈S 2 p t (k) t k=−t +4 · 108 Along with our second observation, this implies t X · ¸ k p t (k − 4) P min Z j ,t = P [|At | > 1] ≤1 − j ∈St 2 p t (k) k=−t +4 · ¸µ ¶ t X k 4(t + 1)(k − 2) P min Z j ,t = 1+ ≤1 − j ∈St 2 (t + 2)(t + 4) k=−t +4 · ¸ µ ¶ t X k 4(2 − k)(t + 1) P min Z j ,t = ≤ j ∈St 2 (t + 2)(t + 4) k=−t ¸ · 8(t + 1) t +1 = −8 E min Z j ,t j ∈St (t + 2)(t + 4) (t + 2)(t + 4) ¸ · 8 8 ≤ + E max Z j ,t . j ∈{1,2,...,N } t t £ ¤ q Now using E max j Z j ,t ≤ t ln2 N implies s P [|At | > 1] ≤ 4 2 ln N 8 + t t as desired. 6.2.3 Online combinatorial optimization In this section we study the case of online linear optimization (see, among others, [38], [66], [40], [94], [64], [99], [56], [53], [68], [27], [5]). This is a similar prediction problem as the one described on Figure 2.1 but here each action i is represented by a vector v i ∈ Rd . The loss corresponding to action i at time t equals v i> `t where `t ∈ [0, 1]d is the so-called loss vector. Thus, given a set of actions S = {v i : i = 1, 2, . . . , N } ⊆ Rd , at every time instant t , the forecaster chooses, in a possibly randomized way, a vector Vt ∈ S and suffers loss V> t `t . Using this notation, the regret becomes T X t =1 where L t = Pt s=1 `s > V> t `t − min v L T v∈S is the cumulative loss vector. While the results of the previous section still hold when treating each v i ∈ S as a separate action, one may gain important computational advantage by taking the structure of the action set into account. In particular, as [64] emphasize, FPL-type forecasters may often be computed efficiently. In this section we propose such a forecaster which adds independent random-walk perturbations to the individual components of the loss vector. To gain simplicity in the presentation, we restrict our attention to the case of online combinatorial optimization in which S ⊂ {0, 1}d , that is, each action is represented a binary vector. This special case arguably contains most important applications such as the (non-stochastic) online shortest path problem. In this example, a fixed directed acyclic graph of d edges is given with two distinguished vertices u and w. The forecaster, at every time instant t , chooses a directed path from u to w. Such a path is represented by it binary incidence vector v ∈ {0, 1}d . The components of the loss vector `t ∈ [0, 1]d represent losses assigned to the 109 d edges and v > `t is the total loss assigned to the path v. Another (non-essential) simplifying assumption is that every action v ∈ S has the same number of 1’s: kvk1 = m for all v ∈ S. The value of m plays an important role in the bounds below. The proposed prediction algorithm is defined as follows. Let X1 , . . . , Xn be independent Gaussian random vectors taking values in Rd such that the components of each Xt are i.i.d. normal Xi ,t ∼ N(0, η2 ) for some fixed η > 0 whose value will be specified later. Denote Zt = t X Xt . s=1 The forecaster at time t , chooses the action © ª Vt = arg min v > (L t −1 + Zt ) , v∈S where L t = Pt s=1 `t for t ≥ 1 and L 0 = (0, . . . , 0)> . The next theorem bounds the performance of the proposed forecaster. Again, we are not P only interested in the regret but also the number of switches nt=1 I{Vt +1 6=Vt } . The regret is of simp ilar order—roughly m d T —as that of the standard FPL forecaster, up to a logarithmic factor. p ¢ ¡ Moreover, the expected number of switches is O m(ln d )5/2 T . Remarkably, the dependence on d is only polylogarithmic and it is the weight m of the actions that plays an important role. We note in passing that the SD algorithm presented in Section 6.1 can be used for simulp taneously guaranteeing that the expected regret is O(m 3/2 T ln d ) and the expected number p of switches is mT ln d . However, as this algorithm requires explicit computation of the exponential weighted forecaster, it can only be efficiently implemented for some special decision sets S—see [68] and [27] for some examples. On the other hand, our algorithm can be efficiently implemented whenever there exists an efficient implementation of the static optimization problem of finding arg minv∈S v > ` for any ` ∈ Rd . Theorem 6.2. The expected regret and the expected number of action switches satisfy (under the oblivious adversary model) µ ¶ p p 2d md (ln T + 1) b LT ≤ m T + η 2 ln d + η η2 and " E T X t =1 # I{Vt +1 6=Vt } ≤ µ ³ ³ ´ ´2 ¶ p p 2 m 1 + 2η 2 ln d + 2 ln d + 2 ln d + 1 + η 2 ln d + 1 T X 4η2 t ´´ p p 2 ln d T m 1 + η 2 ln d + 2 ln d + 1 X + . p η t t =1 t =1 ³ ³ 110 In particular, setting η = q p 2d 2 ln d yields p p 4 p b T ≤ 4m d T ln d + m(ln T + 1) ln d . L and E · n X t =1 ¸ ³ p ´ I{Vt +1 6=Vt } = O m(ln d )5/2 T . The proof of the regret bound is quite standard, similar to the proof of Theorem 3 in [6], and is deferred to the end of this section. The more interesting part is the bound for the ex£P ¤ P pected number of action switches E nt=1 I{Vt +1 6=Vt } = nt=1 P [Vt +1 6= Vt ]. The upper bound on this quantity follows from the lemma below and the well-known fact that the expected value of the maximum of the square of d independent standard normal random variables is at most p 2 ln d + 2 ln d + 1 (see, e.g., [18]). Thus, it suffices to prove the following: Lemma 6.6. For each t = 1, 2, . . . , T , P [Vt +1 6= Vt |Xt +1 ] ≤ m k`t + Xt +1 k2∞ 2η2 t p m k`t + Xt +1 k∞ 2 ln d + p η t Proof. We use the notation Pt [·] = P [· |Xt +1 ] and Et [·] = E [· |Xt +1 ]. Also, let ht = `t + Xt +1 and Ht = tX −1 ht . s=0 Furthermore, we will use the shorthand notation c = kht k∞ . Define the set At as the lead pack: © ª At = w ∈ S : (w − Vt )> Ht ≤ kw − Vt k1 c . Observe that the choice of c guarantees that no action outside At can take the lead at time t +1, since if w 6∈ At , then ¯ ¯ (w − Vt )> Ht ≥ ¯(w − Vt )> ht ¯ so (w − Vt )> Ht +1 ≥ 0 and w cannot be the new leader. It follows that we can upper bound the probability of switching as Pt [Vt +1 6= Vt ] ≤ Pt [|At | > 1] , which leaves us with the problem of upper bounding Pt [|At | > 1]. Similarly to the proof of Lemma 6.5, we start analyzing Pt [|At | = 1]: Pt [|At | = 1] = = X v∈S £ ¤ Pt ∀w 6= v : (w − v)> Ht ≥ kw − vk1 c X Z v∈S y∈R ¯ £ ¤ f v (y)Pt ∀w 6= v : w > Ht ≥ y + kw − vk1 c ¯v > Ht = y d y, 111 (6.3) where f v is the distribution of v > Ht . Next we crucially use the fact that the conditional distributions of correlated Gaussian random variables are also Gaussian. In particular, defining k(w, v) = (m − kw − vk1 ), the covariances are given as ¡ ¢ cov w > Ht , v > Ht = η2 (m − kw − vk1 )t = η2 k(w, v)t . Let us organize all actions w ∈ S \ v into a matrix W = (w 1 , w 2 , . . . , w N −1 ). The conditional distribution of W > Ht is an (N − 1)-variate Gaussian distribution with mean µ ¶ k(w 1 , v) > k(w 2 , v) k(w N −1 , v) > > µv (y) = w 1> L t −1 + y , w 2 L t −1 + y , . . . , wN L + y −1 t −1 m m m and covariance matrix Σv , given that v > Ht = y. Defining K = (k(w 1 , v), . . . , k(w N −1 , v))> and 2 1 exp(− x2 ), we get that (2π)N −1 |Σv | using the notation ϕ(x) = p ¯ £ ¤ Pt ∀w 6= v : w > Ht ≥ y + kw − vk1 c ¯v > Ht = y Z = ∞ Z φ ··· ´ ³q (z − µv (y))> Σ−1 (z − µ (y)) dz v y z i =y+(m−k(w i ,v))c Z ∞ Z = ··· φ µq ¡ z − µv (y) − cK ¢> Σ−1 y ¡ z − µv (y) − cK ¶ ¢ dz z i =y+(m−k(w i ,v))c+k(w i ,v)c Z ∞ Z µq = φ ··· ¡ ¶ ¢> −1 ¡ ¢ z − µv (y + mc) Σ y z − µv (y + mc) d z z i =y+mc ¯ £ ¤ = Pt ∀w 6= v : w > Ht ≥ y + mc ¯ v > Ht = y + mc , where we used µ y+mc = µ y + cK . Using this, we rewrite (6.3) as Pt [|At | = 1] = X Z v∈S y∈R − ¯ £ ¤ f v (y)Pt ∀w 6= v : w > Ht ≥ y ¯ v > Ht = y d y X Z ¡ v∈S y∈R =1 − ¯ ¢ £ ¤ f v (y) − f v (y − mc) Pt ∀w 6= v : w > Ht ≥ y ¯ v > Ht = y d y X Z ¡ v∈S y∈R ¯ ¢ £ ¤ f v (y) − f v (y − mc) Pt ∀w 6= v : w > Ht ≥ y ¯ v > Ht = y d y. To treat the remaining term, we use that v > Ht is Gaussian with mean v > L t −1 and standard p deviation η mt and obtain µ ¶ f v (y − mc) f v (y) − f v (y − mc) = f v (y) 1 − f v (y) µ ¶ 2 mc c(y − v > L t −1 ) ≤ f v (y) − . 2η2 t η2 t 112 Thus, Pt [|At | > 1] ≤ ≤ X Z ¡ v∈S y∈R ¯ ¢ £ ¤ f v (y) − f v (y − mc) Pt ∀w 6= v : w > Ht ≥ y ¯ v > Ht = y d y ¶ ¯ £ ¤ mc 2 c(y − v > L t −1 ) − Pt ∀w 6= v : w > Ht ≥ y ¯ v > Ht = y d y f v (y) 2 2 2η t η t µ X Z v∈S y∈R £ ¤ £ ¤ mc 2 c E V> mc 2 mc E kZt k∞ t Zt = 2 − ≤ 2 + 2η t η2 t 2η t η2 t p 2 m kht k∞ m kht k∞ 2 ln d + = , p 2η2 t η t p ¤ £ where we used the definition of c and E kZt k∞ ≤ η 2t ln d in the last step. Proof of the first statement of Theorem 6.2. The proof is based on the proof of Theorem 4.2 of [25] and Theorem 3 of [6]. The main difference from those proofs is that the standard deviation of our perturbations changes over time, however, this issue is very easy to treat. First, we define p b t = t X1 : an infeasible “forecaster” that peeks one step into the future and uses perturbation Z ¡ ¢ b t = arg min w > L t + Z bt . V w∈S Using Lemma 3.1 of [25], we get for any fixed v ∈ S that T X > b b b b> V t (`t + (Zt − Zt −1 )) ≤ v (L n + ZT ). t =1 After reordering, we obtain T X T X > >b V> t ` t ≤ v L T + v ZT + t =1 bT + = v >LT + v >Z t =1 T X b t )> ` t − (Vt − V b t )> ` t + (Vt − V t =1 T X t =1 T X b b b> V t (Zt − Zt −1 ) p p > b t X1 ( t − 1 − t )V t =1 The last term can be bounded as T p T p X X p ¯ > ¯ p > b t X1 ¯ b t X1 ≤ ( t − t − 1) ¯V ( t − 1 − t )V t =1 t =1 ≤m T p X p ( t − t − 1) kX1 k∞ t =1 p ≤m T kX1 k∞ . Taking expectations, we obtain the bound " E T X t =1 # V> t `t − v >LT ≤ T p X £ ¤ b t )> `t + ηm 2T ln d , E (Vt − V t =1 113 p £ ¤ £ ¤ b t )> `t where we used E kX1 k∞ ≤ η 2 ln d . That is, we are left with the problem of bounding E (Vt − V for each t ≥ 1. To this end, let v(z) = arg min w > z w∈S for all z ∈ Rd , and also F t (z) = v(z)> `t . b t . We have Further, let f t (z) be the density of Zt , which coincides with the density of Z £ ¤ E V> t `t =E [F t (L t −1 + Zt )] Z = f t (z)F t (L t −1 + z) d z d Zz∈R = f t (z)F t (L t − `t + z) d z z∈Rd Z = f t (z + `t )F t (L t + z) d z z∈Rd Z £ ¤ ¡ ¢ b =E F t (L t + Zt ) + f t (z + `t ) − f t (z) F (L t + z) d z d z∈R Z £ > ¤ ¡ ¢ b t `t + =E V f t (z) − f t (z − `t ) F (L t −1 + z) d z . z∈Rd The last term can be upper bounded as Z z∈Rd ¶¶ µ µ (z − `t )> `t F t (L t −1 + z) d z f t (z) 1 − exp η2 t ¶ µ Z (z − `t )> `t F (L t −1 + z) d z ≤− f t (z) η2 t z∈Rd £ ¤ Z 2 ¯ ¯ E V> m t `t k`t k2 ≤ + f t (z) ¯z > `t ¯ d z 2 2 η t η t z∈Rd Z md m f t (z) kzk1 d z ≤ 2 + 2 η t η t z∈Rd r md 2 md = 2 + · p , η t π η t p £ ¤ where we used E kZt k1 = ηd 2t /π in the last step. Putting everything together, we obtain the statement of the theorem as E · n X t =1 V> t `t ¸ r T md T p X X 2 md − v LT ≤ + · + ηm 2T ln d p 2 π η t t =1 η t t =1 p p 2md T md (ln T + 1) ≤ + ηm 2T ln d + . η η2 > 114 Chapter 7 Online lossy source coding In this chapter we consider the problem of fixed-rate sequential lossy source coding of individual sequences with limited delay. Here a source sequence z 1 , z 2 , . . . taking values from the source alphabet Z has to be transformed into a sequence y 1 , y 2 , . . . of channel symbols taking values in the finite channel alphabet {1, . . . , M }, and these channel symbols are then used to produce the reproduction sequence ẑ 1 , ẑ 2 , . . .. The rate of the scheme is defined as ln M nats (where ln denotes the natural logarithm), and the scheme is said to have δ1 encoding and δ2 decoding delay if, for any t = 1, 2, . . ., the channel symbol y t depends on z t +δ1 = (z 1 , z 2 , . . . , z t +δ1 ) and ẑ t depends on y t +δ2 = (y 1 , . . . , y t +δ2 ). The goal of the coding scheme is to minimize the distortion between the source sequence and the reproduction sequence. We concentrate on the individual sequence setting and aim to find methods that work uniformly well with respect to a reference coder class on every individual (deterministic) sequence. Thus, no probabilistic assumption is made on the source sequence, and the performance of a scheme is measured by the distortion redundancy defined as the maximal difference, over all source sequences of a given length, between the normalized distortion of the given coding scheme and that of the best reference coding scheme matched to the underlying source sequence. As will be shown later, this problem is an instance of online learning with switching costs, where taking actions corresponds to selecting coding schemes, distortions correspond to negb T /T . ative rewards and the distortion redundancy corresponds to the average expected regret L The switching cost naturally arises from the fact that every time the coding scheme is changed on the source side, the receiver has to be informed of the new decoding scheme via the same channel that transmits the useful information. This chapter presents our works György and Neu [48] and György and Neu [49]. Our results are summarized below. Thesis 5. Proposed an efficient algorithm for the problem of online lossy source coding. Proved performance guarantees for Setting 4. The proved bounds are optimal in the number of time steps, up to logarithmic factors. [48, 49] 115 7.1 Related work The study of limited-delay (zero-delay) lossy source coding in the individual sequence setting was initiated by Linder and Lugosi [71], who showed the existence of randomized coding schemes that perform, on any bounded source sequence, essentially as well as the best scalar quantizer matched to the underlying sequence. More precisely, it was shown that the normalized squared error distortion of their scheme on any source sequence z T of length T is at most O(T −1/5 ln T ) larger than the normalized distortion of the best scalar quantizer matched to the source sequence in hindsight. The method of [71] is based on the exponentially weighted average (EWA) prediction method [96, 97, 72]: at each time instant a coding scheme (a scalar quantizer) is selected based on its “estimated” performance. A major problem in this approach is that the prediction, and hence the choice of the quantizer at each time instant, is performed based on the source sequence which is not known exactly at the decoder. Therefore, in [71] information about the source sequence that is used in the random choice of the quantizers is also transmitted over the channel, reducing the available capacity for actually encoding the source symbols. The coding scheme of [71] was improved and generalized by Weissman and Merhav [100]. They considered the more general case when the reference class F is a finite set of limited-delay and limited-memory coding schemes. To reduce the communication about the actual decoder to be used at the receiver, Weissman and Merhav introduced a coding scheme where the source sequence is split into blocks of equal length, and in each block a fixed encoder-decoder pair is used with its identity communicated at the beginning of each block. Similarly to [71], the code for each block is chosen using the EWA prediction method. The resulting scheme achieves an O(T −1/3 ln2/3 |F|) distortion redundancy, or, in the case of the infinite class of scalar quantizers, the distortion redundancy becomes O(T −1/3 ln T ). The results of [100] have been extended in various ways, but all of these works are based on the block-coding procedure described above. A disadvantage of this method is that the EWA prediction algorithm keeps one weight for each code in the reference class, and so the computational complexity of the method becomes prohibitive even for relatively simple and small reference classes. Computationally efficient solutions to the zero-delay case were given by György, Linder and Lugosi using dynamic programming [95] and EWA prediction in [43] and based on the Follow-the-Perturbed-Leader prediction method (see [51, 64]) in [44]. The first method achieves the O(T −1/3 ln T ) redundancy of Weissman and Merhav with O(M T 4/3 ) p computational and O(T 2/3 ) space complexity and a somewhat worse O(T −1/4 ln T ) distortion redundancy with linear O(M T ) time and O(T 1/2 ) space complexity, while the second method achieves O(T −1/4 ln T ) distortion redundancy with the same O(M T ) linear time complexity and O(M T 1/4 ) space complexity. Matloub and Weissman [75] extended the problem to allow a stochastic discrete memoryless channel between the encoder and the decoder, while Reani and Merhav [88] extended the model to the Wyner-Ziv case (i.e., when side information is also available at the decoder). The performance bound in both cases are based on [100] while low-complexity solutions for 116 the zero-delay case are provided based on [44] and [43], respectively. Finally, the case when the reference class is a set of time-varying limited-delay limited-memory coding scheme was analyzed in [45], and efficient solutions were given for the zero-delay case for both traditional and network (multiple-description and multi-resolution) quantization. Since most of the above coding schemes are based on the block-coding scheme of [100], they cannot achieve better distortion redundancy than O(T −1/3 ) up to some logarithmic factors. On the other hand, the distortion redundancy is known to be bounded from below by a constant multiple of T −1/2 in the zero-delay case [43], leaving a gap between the best known upper and lower bounds. Furthermore, if the identity of the used coding scheme were communicated as side information (before the encoded symbol is revealed) the employed EWA prediction method would guarantee an O(T −1/2 ln |F|) distortion redundancy for any finite reference coder class F (of limited delay and limited memory), in agreement with the lower bound. Thus, to improve upon the existing coding schemes, the communication overhead (describing the actually used coding schemes) between the encoder and the decoder has to be reduced, which is achievable by controlling the number of times the coding scheme changes in a better way then blockwise coding. This goal can be achieved by employing the techniques presented in Chapter 6. In this chapter we construct a randomized coding strategy, which uses a the mSD algorithm p described in Section 6.1 as the prediction component, that achieves an O( ln T /T ) average distortion redundancy with respect to a finite reference class of limited-delay and limited-memory source codes. The method can also be applied to compete with the (infinite) reference class p of scalar quantizers, where it achieves an O(ln T / T ) distortion redundancy. Note that these bounds are only logarithmic factors larger than the corresponding lower bound. After presenting the formalism of sequential source coding in Section 7.2, we present our randomized coding strategy in Section 7.3. The strategy is applied to the problem of adaptive zero-delay lossy source coding in Section 7.4 and further extensions are given in Section 7.5 7.2 Limited-delay limited-memory sequential source codes A fixed-rate delay-δ (randomized) sequential source code of rate ln M is defined by an encoderdecoder pair connected via a discrete noiseless channel of capacity ln M . Here δ is a nonnegative integer and M ≥ 2 is a positive integer. The input to the encoder is a sequence z 1 , z 2 , . . . taking values in some source alphabet Z. At each time instant t = 1, 2, . . ., the encoder observes z t and a random number Ut , where the randomizing sequence U1 , U2 , . . . is assumed to be independent with its elements uniformly distributed over the interval [0, 1]. At each time instant t + δ, t = 1, 2, . . ., based on the source sequence z t +δ = (z 1 , . . . , z t +δ ) and the randomizing sequence Ut = (U1 , . . . , Ut ) received so far, the encoder produces a channel symbol yt ∈ {1, 2, . . . , M } which is then transmitted to the decoder. After receiving yt , the decoder outputs the reconstruction value ẑt ∈ Ẑ based on the channel symbols yt = (y1 , . . . , yt ) received so far, where Ẑ is the reconstruction alphabet. Formally, a code is given by a sequence of encoder-decoder functions ( f , g ) = { f t , g t }∞ t =1 , 117 where f t : Zt +δ × [0, 1]t → {1, 2, . . . , M } and g t : {1, 2, . . . , M }t → Ẑ so that yt = f t (z t +δ , Ut ) and ẑt = g t (yt ), t = 1, 2, . . .. Note that the total delay of the encoding and decoding process is δ.1 To simplify the notation we will omit the randomizing sequence from f t (·, Ut ) and write f t (·) instead. Now let F be a finite set of reference codes with |F| = N . The cumulative distortion of the sequential scheme after reproducing the first T symbols is given by D̂T (z T +δ ) = T X d (z t , ẑt ), t =1 where d : Z × Ẑ → [0, 1] is some distortion measure,2 while the minimal cumulative distortion achievable by codes from F is ∗ T +δ DF (z ) = min T X ( f ,g )∈F t =1 ¡ ¢ d z t , g t (y t ) ¡ ¢ where the sequence y T is generated sequentially by ( f , g ), that is, y t = f t z t +δ , Ut . Of course, in general it is impossible to come up with a coding scheme that attains this distortion without knowing the whole input sequence beforehand. Thus, our goal is to construct a coding scheme that asymptotically achieves the performance of the above encoder-decoder pair. Formally this means that we want to obtain a randomized coding scheme that minimizes the worst-case expected normalized distortion redundancy ³ ´i ³ ´o 1n h ∗ E D̂T z T +δ − D F z T +δ , z T ∈ZT T b T = max R where the expectation is taken with respect to the randomizing sequence UT of our coding scheme. The decoder {g t } is said to be of memory s ≥ 0 if g t ( ŷ t ) = g t ( ỹ t ) for all t and ŷ t , ỹ t ∈ {0, . . . , M }t such that ŷ tt −s = ỹ tt −s , where ŷ tt −s = ( ŷ t −s , ŷ t −s+1 , . . . , ŷ t ) and ỹ tt −s = ( ỹ t −s , ỹ t −s+1 , . . . , ỹ t ). Let Fδ denote the collection of all (randomized) delay-δ sequential source codes of rate ln M , and let Fsδ denote the class of codes in Fδ with memory s. Weissman and Merhav [100] proved that there exists a randomized coding scheme such that, for any δ ≥ 0 and s ≥ 0 and for any finite class F ⊂ Fsδ of reference codes, the normalized distortion redundancy with respect to F is of order T −1/3 ln2/3 |F|. This coding scheme splits the source sequence into blocks of length O(T 1/3 ). At the beginning of each block a code is se1 Although we require the decoder to operate with zero delay, this requirement introduces no loss in generality, as any finite-delay coding system with δ1 encoding and δ2 decoding delay can be represented equivalently in this way with δ1 + δ2 encoding and zero decoding delay [100]. 2 All results may be extended trivially for arbitrary bounded distortion measures 118 lected from F using EWA prediction and the identity of the selected reference decoder is communicated to the decoder. During these first steps, the decoder emits arbitrary reproduction symbols, while the chosen code is used in the rest of the block. The formation of the blocks ensures that only a limited fraction of the available channel capacity is used for describing codes, while the limited memory property ensures that not transmitting real data at the beginning of each block has only a limited effect on decoding the rest of the block. 7.3 The Algorithm Next we describe a coding scheme based on the observation that the problem of sequential lossy source coding can be regarded as an online prediction problem where switching between actions is subject to some positive cost. While both algorithms presented in Chapter 6 can be directly applied for this learning problem, we propose an algorithm based on the mSD prediction algorithm presented in Section 6.1. The proposed algorithm adaptively creates blocks of p variable length such that on the average O( T ) blocks are created, and so the overhead used to p transmit code descriptions scales with T instead of T 2/3 in [100]. The coding scheme, given in Algorithm 7.1, works as follows. At each time instant t the mSD algorithm selects one code (f(t ) , g(t ) ) from the finite reference class F; the loss in the mSD algorithm associated with ( f , g ) ∈ F is defined by ¡ ¡ ¢¢ `( f ,g ),t (z t +δ ) = d z t , g t y 0t (7.1) where y 10 , y 20 , . . . , y t0 is the sequence obtained by using the coding scheme ( f , g ) to encode z t , that is, y t0 = f t (z t +δ , Ut ) (note that `( f ,g ),t can be computed at the encoder at time t + δ). The mSD algorithm splits the time into blocks [1, t 1 ], [t 1 + 1, t 2 ], [t 2 + 1, t 3 ], . . . in a natural way such that the decoder of the reference code chosen by the algorithm is constant over each block, that is, g(ti +1) = g(ti +2) = · · · = g(ti +1 ) and g(ti ) 6= g(ti +1) for all i (here we used the convention t 0 = 0). Since the beginning of a new block can only be noticed at the encoder, this event has to be communicated to the decoder. In order to do so, we select randomly a new-block signal v of length A (that is, v ∈ {1, . . . , M } A ), and v is transmitted over the channel in the first A time steps of each block. In the next B time steps of the block the identity of the decoder chosen by the mSD algorithm is communicated, where ln |{g : ( f , g ) ∈ F}| B= ln M » ¼ (7.2) is the number of channel symbols required to describe uniquely all possible decoder functions. In the remainder of the block the selected encoder (or, possibly, more encoders) is used to encode the source symbols. On the other hand, whenever the decoder observes v in the received channel symbol sequence yt , it starts a new block. In this block the decoder first receives the index of the reference decoder to be used in the block, and the received reference decoder is used in the remainder of 119 the block to generate the reproduction symbols. One slight problem here is that the new-block signal may be obtained by encoding the input sequence; in this case, to synchronize with the decoder, a new block is started at the encoder. We can keep the loss introduced by these unnecessary new blocks low by a careful choice of the new-block signal. Clearly, if v is selected uniformly at random from {1, 2, . . . , M } A then for any fixed string u ∈ {1, 2, . . . , M } A , P [v = u] = 1/M A . Thus, setting A = O(ln T ) makes P [v = u] = O(1/T ), and so the expected number of unnecessary new blocks is at most a constant in T time steps. In summary, the algorithm works in blocks of variable length as follows: At the beginning of the block an algorithm is selected using the mSD prediction algorithm and a new-block signal and the identity of the chosen decoder is communicated to the receiver. In the next time steps, as long as the mSD algorithm selects the same decoder, the chosen code is used to encode the source symbols at the sender and used for decoding at the receiver. When the mSD method selects a different decoder, or a new block signal is transmitted by chance, a new block is started. The next result shows that the normalized distortion redundancy of the proposed scheme p is O( ln(T )/T ). Theorem 7.1. Let F ⊂ Fsδ be a finite reference class of delay-δ memory-s codes, and, for some T > 0, set A = dln T / ln M e and s ηt = ln |F| . T (17/8 + ln (T |F|) / ln M + s) (7.3) Then the expected normalized distortion redundancy of Algorithm 7.1 can be bounded as s bT ≤2 R µ ¶ ¶ µ ln |F| 17 ln(T |F|) ln T . + +s +O T 8 ln M T l m ln T Remark 7.1. In the above result concerning our algorithm, the parameters A = ln M and η t = p η = O(1/ T ln T ) are set as a function of the time horizon T . The proposed algorithm can be modified to be strongly sequential in the sense that it becomes horizon independent. The main difference is that the new-block signal will be time-variant: at time instants M k−1 the kth symbol vk of v is transmitted, and at each time instant t the so far received new-block signal l m p ln t v A t of length A t = ln M is used. Setting η t = O(1/ t ln t ), it can be shown that the modified algorithm has only a constant time larger regret than the original, horizon-dependent one. The above theorem follows from the following result taking into consideration that B ≤ dln |F|/ ln M e and setting A = dln T / ln M e and η t according to (7.3). Theorem 7.2. The expected normalized distortion redundancy of Algorithm 7.1 for any finite reference class F ⊂ Fsδ can be bounded as µ ¶ T X η t (A + B + s)(ln |F| + 1 + TM−A ln |F| 1 A ) b RT ≤ + + A +B +s + T ηT 8 T t =1 T 120 Algorithm 7.1 An efficient algorithm for adaptive sequential lossy source coding Encoder: 1. Input: A finite reference class F ⊂ Fsδ , positive integer A and time horizon T . 2. Initialization (a) Draw a new-block signal v uniformly at random from {1, . . . , M } A , the set of channel symbol sequences of length A. (b) Initialize the mSD algorithm for F. (c) Set B according to (7.2) and set g0 to an invalid value (not contained in F). 3. For each block do (a) Observe z t +δ . (b) For all time instants (t + δ) run the mSD algorithm: i. Feed the mSD algorithm with losses `( f ,g ),t (z t +δ ) for each code ( f , g ) ∈ F. ii. Let (f(t ) , g(t ) ) denote the choice of the mSD algorithm. (c) In the first A time steps of the block transmit v. (d) After the first A time steps set (f, g) = (f(t ) , g(t ) ), the output of the mSD algorithm in this time step. (e) In time steps A + 1, . . . , A + B of the block send the index describing g. (f) If (t + δ) belongs to steps A + B + 1, A + B + 2, . . . of the block then i. if g(t ) = g(t −1) then transmit yt = ft (z t +δ ) ii. else start a new block with the same time index; iii. if (yt −A+1 , . . . , yt ) = v then start a new block at the next time instant. Decoder: 1. Input: A finite reference class F ⊂ Fsδ , positive integers A, B , time horizon T . 2. For t = 1, . . . , A (a) Observe yt . (b) At time t = A set v = y A and declare a new block. 3. For each block do (a) Observe yt . (b) In the first B time steps of the block receive the index of the decoder to be used. At time step B of the block set the decoder g according to the symbols received so far. (c) In time steps B, B + 1, . . . of the block output ẑt = g(yt ) = g(ytt −s+1 ). (d) In time steps B + A, B + A+1, . . . of the block declare a new block if (yt −A+1 , . . . , yt ) = v. 121 Proof. Let ẑ ( f ,g ),1 , . . . , ẑ ( f ,g ),T denote the reproduction sequence generated by the reference code ( f , g ) ∈ F when applied to the source sequence z T , and let z̃t = ẑ (f(t ) ,g(t ) ),t . That is, z̃t is the reproduction sequence our coding scheme would generate if it did not have to transmit the identity of the chosen reference decoder, and the correct past s symbols were also available at the decoder (in the current setting when the reference decoder changes we have to wait s channel symbols to have the decoder operating correctly, as it may require s past symbols due to its memory). Decomposing the cumulative distortion we get T X X d t (z t , ẑt ) = t =1 t :1≤t ≤T,ẑt =z̃t ≤ T X t =1 X d t (z t , z̃t ) + d t (z t , ẑt ) t :1≤t ≤T,ẑt 6=z̃t `(f(t ) ,g(t ) ),t (z t +δ (7.4) ) + |{t : ẑt 6= z̃t , 1 ≤ t ≤ T }| . The expectation of the first term can be bounded using Lemma 6.2 as " E T X t =1 # ∗ T +δ `(f(t ) ,g(t ) ),t (z t +δ ) ≤ D F (z )+ T η X ln |F| t + . ηT t =1 8 (7.5) It is easy to see that ẑt 6= z̃t may happen only at the first A + B + s steps of each block. New blocks are started when either the mSD algorithm decides to start one, or when a new-block signal is transmitted by chance. Letting ST and NT = {t : (yt −A+1 , . . . , yt ) = v, A ≤ t ≤ T } denote the number of new blocks, up to time T , started “intentionally” by the mSD algorithm and, respectively, “unintentionally” by chance, we have ¯© ª¯ ¯ t : 1 ≤ t ≤ T, ẑt 6= z0 ¯ ≤ (ST + 1 + NT ) (A + B + s). t (7.6) ST can be bounded by Lemma 6.3. On the other hand, letting nt = I{v=(yt −A+1 ,...,yt )} , we have P NT = Tt=A nt . Since v is independent of yt −1 , £ ¤ P v = (yt −A+1 , . . . , yt ) = 1/M A for any A ≤ t ≤ T , and so E [NT ] = (T − A)/M A . (7.7) Now taking expectations in (7.6), Lemma 6.3 and (7.7) yield à E [|{t : 1 ≤ t ≤ T, ẑt 6= z̃t }|] ≤ (A + B + s) η T (T − 1) + TX −1 ¡ η t − η T + ln |F| + 1 + ¢ t =1 à = (A + B + s) T X η t + ln |F| + 1 + t =1 T −A MA ! . Combining the above with (7.4) and (7.5) proves the statement of the theorem. 122 T −A MA ! 7.4 Sequential zero-delay lossy source coding An important and widely studied special case of the source coding problem considered is the case of on-line scalar quantization, that is, the problem of zero-delay lossy source coding with memoryless encoders and decoders [71, 100, 43, 44]. Here we assume for simplicity Z = [0, 1] and d (z, ẑ) = (z − ẑ)2 . An M -level scalar quantizer Q (defined on [0, 1]) is a measurable mapping [0, 1] → C , where the codebook C is a finite subset of [0, 1] with cardinality |C | = M . The elements of C are called the code points. The instantaneous squared distortion of Q for input z is (z − Q(z))2 . Without loss of generality we will only consider nearest neighbor quantizers Q satisfying (z −Q(z))2 = minẑ∈C (z − ẑ)2 . Let Q denote the collection of all M -level nearest neighbor quantizers. In this section our goal is to design a sequential coding scheme that asymptotically achieves the performance of the best scalar quantizer (from Q) for all source sequences z T . Note that the expected normalized distortion redundancy in this special case is defined as " # T T 1 X 1 X 2 max (z t −Q(z t ))2. E (z t −ẑt ) − min Q∈ Q T z T ∈[0,1]T T t =1 t =1 To be able to apply the results of the previous section, we approximate the infinite class Q with QK ⊂ Q, the set of M -level nearest neighbor scalar quantizers whose code points all belong ª © 1 3 , 2K , . . . , 2K2K−1 . It is shown in [43] that the distortion redundancy of any sequential to the set 2K coding scheme relative to Q is at least on the order of T −1/2 . The next theorem shows that the slightly larger O(T −1/2 ln T ) normalized distortion redundancy is achievable. Theorem 7.3. Relative to the reference class Q, the expected normalized distortion redundancy of Algorithm 7.1 applied to QbpT c satisfies, for any T ≥ 2, s bT ≤ R µ ¶ µ 2 ¶ 2M ln T 25 (M +2) ln T ln T + +s +O T 8 2 ln M T and the algorithm can be implemented with O(T 2 ) time and O(T ) space complexity. Proof. The proof is based on results developed in [43]. It is easy to see that for any quantizer Q ∈ Q there exists a quantizer Q K ∈ QK such that max |(z −Q(z))2 − (z −Q K (z))2 | ≤ 1/K . z∈[0,1] Thus, in this sense, the class Q is well approximated by QK . Thus, for any sequence z T ∈ [0, 1]T , T T 1 X 1 X 1 (z t −Q(z t ))2 − min (z t −Q(z t ))2 ≤ . Q∈QK T t =1 Q∈Q T t =1 K min Applying Algorithm 7.1 to the reference class F = QK we obtain by Theorem 7.1 that the nor- 123 malized distortion redundancy relative to the class QK can be bounded as # " T T X X 1 2 2 (z t − ẑ t ) − min (z t −Q(z t )) max E Q∈QK t =1 z T ∈[0,1]T T t =1 v ¡ ¢Ã à à ! ¡K ¢ ! ¡K ¢ ! u u ln K 17 ln T M K ln T + M M t ≤2 + s + O ln T 8 ln M M T Note that since in this case the size of the reference class |QK | = ¡K ¢ M depends on T , such terms also need to be taken into account in the last O(·) term. Combining the above results and p substituting K = b T c gives the performance bound of the theorem, taking into account that p ¡bpT c¢ M /2 p1 < 2M ln T /T for all T ≥ 2 since M ≥ 2 by assumption. ≤ T and M b Tc It is shown in [43] that the random choice of a quantizer according to the EWA prediction algorithm in one time step can be performed with O(K 2 ) time and space complexity. Applying the same method in our algorithm we obtain the desired complexity results (it is easy to see that the extra random choice in mSD does not change these complexities). 7.5 Extensions In the previous sections we assumed that the encoder and the decoder communicate over a noiseless channel. Following Matloub and Weissman [75], we can extend the results to the case of stochastic channels with positive error exponents. We assume that the communication channel has finite memory r for some integer r ≥ 0, and its output also depends on some stationary noise process . . . , X−1 , X0 , X1 , . . . with known distribution such that if the channel input up to time t is y t for some t ≥ r , then the output of the channel is a function of y tt −r +1 and Xt . Moreover, it is assumed that for some rate R > 0 there exists a constant σ > 0 such that for any block length b there exists a channel code Cb that can discriminate 2bR messages with maximum error probability e −σb in b channel uses. These assumptions are not restrictive and hold for all channels with positive capacity and error exponent. c , a delayFormally, denoting the channel input and output alphabet by M = {1, . . . , M } and M δ sequential joint source-channel code is given by a sequence of encoder-decoder functions t +δ ( f , g ) = { f t , g t }∞ × [0, 1]t → M and g t : Mt → Ẑ. Matloub and Weissman [75] t =1 with f t : Z used a channel code Cb (minimizing the maximum error probability) to communicate the de- coding function at the beginning of each block, as well as replaced the distortion `( f ,g ),t (z t +δ ) £ ¤ with its expectation `¯( f ,g ),t (z t +δ ) = EXt `( f ,g ),t (z t +δ ) with respect to the channel noise Xt (note that `¯( f ,g ),t (z t +δ ) is still a random variable that depends on the internal randomization of the encoder f ). In our case a further modification is needed, as the new block signal also has to be communicated using channel coding. Formally, in step 3(b)i of the encoder in Algorithm 7.1, `( f ,g ),t (x t +δ ) has to be replaced with `¯( f ,g ),t (z t +δ ). Furthermore, during the whole communication process, the new block signal v and the indices of the decoder functions g are transmitted using channel coding, with codes C Ab and CBb , respectively. These codes are used at the decoder to identify the beginning of a 124 new block and determining the decoding function. Note that before each use of these channel codes, the encoder uses r symbols to reset the memory of the channel. To simplify the encoder, we do not check whether the decoder would receive a new block signal by chance, that is, we omit step 3(f)iii of the encoding algorithm (this step would become problematic due to the channel noise). While this decision makes the algorithm simpler, it can also ruin its performance if such an accident occurs (the scheme has no built-in method to recover from such an error). However, by a careful selection of the new block signal, we can guarantee that this disaster only happens with a very low probability. The analysis of the above procedure can be done following the proof of Theorem 7.1. We can obtain a variant of (7.5) in the same way where the distortion values are replaced by their expectations with respect to the channel noise. The overhead in the communication is analyzed in a slightly different way. We declare failure and consider the maximum distortion during the whole communication process if the new block symbol is received incorrectly at the beginning, or if the new block symbol is transmitted unintentionally. Since the latter can happen at each time instant, and results in an O(T ) cumulative distortion, we select v from T 2 options, making A = d2 ln T / ln M e. Furthermore, we want to transmit v at the beginning with probability of error O(1/T 2 ). By our assumptions on C Ab , to achieve an error probability ² in recovering v at the b b decoder, we need ² ≤ 2−σ A and 2R A ≥ M A . Thus, when A = O(ln T ), a choice Ab = O(ln T ) is sufficient to ensure an O(1/T 2 ) error probability. Then the expected cumulative error due to the incorrect decoding of v is O(1). Furthermore, communicating the index of the decoding function in Bb = O(ln F ln T ) channel uses can be decoded with maximum error probability O(1/T 2 ) (here, as before, F denotes the reference class of codes), which again results in a cumulative O(1) error. Summarizing, for each block we have an extra O(ln T ) overhead plus a constant error term (in expectation). Thus, we obtain the following result. Corollary 7.1. Given any finite reference class F of delay-δ sequential joint source-channel codes with bounded memory, under our assumptions on the communication channel, there exists a p sequential joint source-channel coding scheme with O( ln(T )/T ) normalized distortion redundancy. In the Wyner-Ziv setting considered by Reani and Merhav [89], there is a noiseless communication channel between the encoder and the decoder, and the decoder also has access to a side information signal that is a noisy observation of the current source symbol z t through a memoryless channel. This setup can be treated as a special case of the above joint source channel coding problem with a restricted set of encoders and a special channel (the channel is composed of a noiseless part and a noisy side information channel, and each encoder has to transmit the actual source symbol uncoded over the side information channel). In fact, this setup is simpler, as there is no need to use error protection for communicating the indices of the decoders and the new block signals; however, replacing `t by `¯t is still necessary. Thus, the p above O( ln(T )/T ) normalized distortion redundancy is also achievable in this case. Moreover, Reani and Merhav also gave an efficient implementation for the zero-delay case based on an efficient implementation of the EWA algorithm. This efficient algorithm can easily be 125 incorporated in our method in the same way as the efficient algorithms for scalar quantization (provided in [43, 45]) were used for the zero-delay lossy coding case considered in Section 7.4. 126 Chapter 8 Conclusions In this work, we addressed the problem of online learning in non-stationary Markov decision processes, first defined by Even-Dar et al. [33]. We proposed algorithms that are guaranteed to perform well in a number of different settings. In this chapter, we discuss how our results compare to other known results concerning regret minimization in MDPs. After discussing some general insights revealed by our work, we comment on the possibility of combining the results of the thesis for solving the most complicated version of the online MDP problem, where the transition functions are unknown and the learner is only informed about the reward it actually gathers (instead of being informed about the entire reward function). We conclude by commenting on the complexity of our algorithms and summarizing our results presented in Chapter 5. 8.1 The gap between the lower and upper bounds Jaksch et al. [59] considered the case when the transition function P is unknown to the learner and the rewards are generated in an i.i.d. fashion. The assumptions they make about the transition function are less stringent than our Assumptions M1 and M2: they assume that there exists a finite constant D > 0 such that for any two states x, x 0 ∈ X, there exists a policy πx,x 0 that takes the learner from state x to state x 0 in at most D steps (in expectation). Following Puterman [86], such MDPs are called communicating MDPs with diameter D. Jaksch et al. aim to minimize a regret criterion identical to the one defined in Equation (1.1), and prove that for any learning algorithm there exists an MDP with diameter D, state and action cardinality |X| and |A| such that b T ≥ c D|X||A|T , L p where c > 0 is some universal constant. Their algorithm UCRL-2 guarantees a regret of order p D|X| |A|T that holds for all possible reward distributions in all MDPs with diameter D. In what follows, we compare the performance guarantees of our algorithms to these bounds: The bounds presented in Chapters 3 and 4 have the same dependence on the number of actions |A| and the number of time steps T . For the O-SSP setting considered in Chapters 3 and 5, the 127 diameter is essentially equivalent to the number of layers L. The regret guarantee proven for p our algorithm in the bandit case when P is known is O(L 2 |A|T /α), which does not directly p depend on |X|. The factor L 2 is clearly suboptimal. Depending on the structure of P , 1/ α can be either much smaller or much bigger than |X|, however, it tends to be large in practical problems. Even though our experiments in Section 3.6 suggest that learning does become more difficult as α approaches zero, we believe that the reasons underlying these gaps are to be found in our decomposition of the learning problem: this approach does not exploit that the individual bandit problems defined in each state x ∈ X are not independent of each other. Essentially, the same comments hold for our algorithm presented in Chapter 4. Treating the learning problem globally as in Chapter 5 seems to give much better results: the gap between p our bound and the bound of Jaksch et al. [59] is only of order |A|. Thus, the price we pay for p playing against an arbitrary reward sequence with full feedback is at most an O( |A|) factor in the upper bound. This observation immediately leads to the following question: why don’t we use the approach described in Chapter 5 for solving the bandit problems described in Chapters 3 and 4? The reason is very simple: the main limitation of the FPL algorithm used for solving the global decision problem is that it cannot work efficiently under bandit feedback. For constructing unbiased reward estimates of the form (3.12) or (4.8), we need to explicitly compute the stochastic policy πt . Unfortunately, the relationship of the policy πt and the perturbations is very complicated and thus the former can be only approximated by excessive sampling from the latter. As shown by Poland [84], producing sufficiently precise reward estimates needs drawing up to O(T 2 ) samples, which is unacceptable. As discussed in Section 1.3, it is nontrivial whether there exists an efficient implementation of a global decision making algorithm that can be used to minimize regret in the bandit O-MDP setting. 8.2 Outlook We can conclude that a good learning algorithm in a continuing MDP task has to either change its policies either slowly (as quantified in Chapter 4) or rarely (as in the approach taken in Chapter 7). The fundamental difference between the results proved in Chapter 3 and the one proved in Chapter 4 is that in the latter case, the learner has to guarantee one of the above criteria on its policy sequence in addition to being able to minimize the main term of the expected regret. Accordingly, the insights used in Chapter 3 to prove extensions of the basic problem can be directly used in the O-MDP setting as well without any fundamental modification. Note that while the “slow-change” requirement seems to conflict the intuition that an algorithm with good tracking regret should change its policies more often, our proof concerning the change rate of Exp3 carries through to the case of Exp3.S. In the process of proving the results of Chapter 5, much structure (of both the standard MDP learning problem and the online learning problem) was uncovered that was not transparent in previous works. In particular, it has become apparent that the size of confidence intervals for the unknown model parameters only influences the quality of the estimate of the 128 achieved performance (since this factor only shows up in the proof of Lemma 5.5), while selecting our models optimistically helps only in estimating the best possible performance (since optimism is only exploited in the proof of Lemma 5.2). These issues are not so clearly separable in a purely stochastic learning problem where the achieved performance has to stay close to the fixed target behavior. It remains an interesting and important open problem to solve the online MDP problem with bandit information in unknown stochastic environments. Besides the problem of not being able to explicitly computing the occupation measures generated by FPL, the unknown transition function P also adds to the uncertainty. Constructing reward estimates with appropriately controlled bias seems to a be very challenging problem that needs fundamentally new techniques or highly nontrivial combination of ideas presented in this thesis. 8.3 The complexity of our algorithms Let us finally comment on the computational complexity of our algorithms. It is easy to see that of all our algorithms, the algorithm MDP-E of Chapter 4 has the largest computational and memory complexity: Due to the delay that the algorithm needs, the algorithm needs to store N policies. Thus, the memory requirement of MDP-Exp3 scales with N |A||X|. The computational complexity of the algorithm is dominated by the cost of computing r̂t and, in particular, ¡ ¢ 3 by the cost of computing µN t . The cost of this is O N |A||X| in the worst case, however, it can be much smaller for specific practical cases. For instance, if all states have at most J possible preceding states and K succeeding states, the cost becomes O (N |A||X|J K ). The computation cost of the other algorithms is smaller by a factor of N . Setting the parameters of the algorithms presented in Chapters 3 and 4 require known lower bounds on the minimal visitation probabilities α and α0 , and also the knowledge of the mixing time τ. While these quantities can be determined in principle from the transition probability kernel P , it is not clear how to compute efficiently the minimum over all policies. 8.4 Online learning with switching costs In Chapter 6, we proposed a novel prediction method that works by perturbing the cumulative losses with symmetric random walks. Analyzing the performance of this forecaster in the finite-expert setting required bounding the number of times that the leading random walk is switched. While the guarantees obtained for the expected regret and the expected number of switches are essentially equivalent to the best previously known guarantees, we hope that our technique can be used to provide bounds of the said quantities that hold with high probability. At the moment, the mere existence of a prediction algorithm guaranteeing high-confidence bounds for this setting is nontrivial. We were the first to consider the problem of online combinatorial optimization under switching constraints. While it is possible to prove performance bounds for a version of the Shrinking Dartboard algorithm in this setting, the resulting algorithm can only be implemented for 129 a handful of decision sets. On the other hand, our algorithm can be efficiently implemented whenever there exist an efficient solution to static optimization problems on the decision set. However, this efficiency comes at the price of slightly worse performance bounds. Finding out if this trade-off between low computational complexity and optimal regret bounds is inherent to the problem of online combinatorial optimization is an interesting open question of learning theory. 8.5 Closing the gap for online lossy source coding In Chapter 7, we provided a sequential lossy source coding scheme that achieves a normalp ized distortion redundancy of O( ln(T )/T ) relative to any finite reference class of limited-delay limited-memory codes, improving the earlier results of O(T −1/3 ). Applied to the case when the reference class is the (infinite) set of scalar quantizers, we showed that the algorithm achieves p O(ln(T )/ T ) normalized distortion redundancy, which is almost optimal in view that the norp malized distortion redundancy is known to be at least of order 1/ T . The existence of a coding scheme with optimal high-confidence performance guarantees depends on the existence of an online prediction algorithm with high-confidence guarantees on its regret and switch-number. As discussed in the previous section, this remains an open problem. 130 Bibliography [1] Abbasi-Yadkori, Y. and Szepesvári, Cs. (2011). Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the Twenty-Fourth Conference on Computational Learning Theory. [2] Allenberg, C., Auer, P., Györfi, L., and Ottucsák, Gy. (2006). Hannan consistency in on-line learning in case of unbounded losses under partial monitoring. In ALT, pages 229–243. [3] Arora, R., Dekel, O., and Tewari, A. (2012). Deterministic MDPs with adversarial rewards and bandit feedback. CoRR, abs/1210.4843. [4] Audibert, J.-Y. and Bubeck, S. (2010). Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research, 11:2785–2836. [5] Audibert, J. Y., Bubeck, S., and Lugosi, G. (2011). Minimax policies for combinatorial prediction games. In Conference on Learning Theory. [6] Audibert, J. Y., Bubeck, S., and Lugosi, G. (2012). Regret in online combinatorial optimization. Manuscript. [7] Auer, P. (2003). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3:397–422. [8] Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002a). Finite-time analysis of the multiarmed bandit problem. Mach. Learn., 47(2-3):235–256. [9] Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. (1995). Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of the 36th Annual Symposium on the Foundations of Computer Science, pages 322–331. IEEE press. [10] Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002b). The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48–77. [11] Auer, P. and Ortner, R. (2006). Logarithmic online regret bounds for undiscounted reinforcement learning. In NIPS’06, pages 49–56. [12] Awerbuch, B. and Kleinberg, R. D. (2004). Adaptive routing with end-to-end feedback: distributed learning and geometric approaches. In Proceedings of the 36th ACM Symposium on Theory of Computing, pages 45–53. 131 [13] Balluchi, A., Benvenuti, L., Di Benedetto, M. D., Pinello, C., and Sangiovanni-Vincentelli, A. L. (2000). Automotive engine control and hybrid systems: challenges and opportunities. In Proceedings of the IEEE, pages 888–912. [14] Bartlett, P. L., Dani, V., Hayes, T. P., Kakade, S., Rakhlin, A., and Tewari, A. (2008). Highprobability regret bounds for bandit online linear optimization. In Servedio, R. A. and Zhang, T., editors, 21st Annual Conference on Learning Theory (COLT 2008), pages 335–342. Omnipress. [15] Bartlett, P. L. and Tewari, A. (2009). REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. In Proceedings of the 25th Annual Conference on Uncertainty in Artificial Intelligence. [16] Bartók, G., Pál, D., Szepesvári, Cs., and Szita, I. (2011). Online learning. Lecture notes, University of Alberta. https://moodle.cs.ualberta.ca/file.php/354/notes.pdf. [17] Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena Scientific, Belmont, MA. [18] Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities:A Nonasymptotic Theory of Independence. Oxford University Press. [19] Brafman, R. I. and Tennenholtz, M. (2002). R-MAX - a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3:213–231. [20] Bubeck, S. and Cesa-Bianchi, N. (2012). Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Now Publishers Inc. [21] Cao, X.-R. (2007). Stochastic Learning and Optimization: A Sensitivity-Based Approach. Springer, New York. [22] Carpentier, A., Lazaric, A., Ghavamzadeh, M., Munos, R., Auer, P., and Auer, P. (2011). Upper-confidence-bound algorithms for active learning in multi-armed bandits. In ALT, pages 189–203. [23] Cesa-Bianchi, N., Conconi, A., and Gentile, C. (2004). On the generalization ability of online learning algorithms. IEEE Transactions on Information Theory, 50:2050–2057. [24] Cesa-Bianchi, N., Dekel, O., and Shamir, O. (2013). Online Learning with Switching Costs and Other Adaptive Adversaries. ArXiv e-prints. [25] Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA. [26] Cesa-Bianchi, N. and Lugosi, G. (2009). Combinatorial bandits. In Proceedings of the Twenty-Second Conference on Computational Learning Theory, pages 237–246. 132 [27] Cesa-Bianchi, N. and Lugosi, G. (2012). Combinatorial bandits. Journal of Computer and System Sciences, 78:1404–1422. [28] Dani, V., Hayes, T. P., and Kakade, S. (2008a). The price of bandit information for online optimization. In Platt, J. C., Koller, D., Singer, Y., and Roweis, S. T., editors, Advances in Neural Information Processing Systems 20, pages 345–352. MIT Press. [29] Dani, V., Hayes, T. P., and Kakade, S. M. (2008b). Stochastic linear optimization under bandit feedback. In Servedio, R. A. and Zhang, T., editors, 21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-12, 2008, pages 355–366. Omnipress. [30] Devroye, L., Lugosi, G., and Neu, G. (2013a). Prediction by random-walk perturbation. Submitted to the Twenty-Sixth Conference on Learning Theory. [31] Devroye, L., Lugosi, G., and Neu, G. (2013b). Prediction by random-walk perturbation. Submitted to the IEEE Transactions on Information Theory. [32] Even-Dar, E., Kakade, S. M., and Mansour, Y. (2005). Experts in a Markov decision process. In Saul, L. K., Weiss, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 17, pages 401–408. [33] Even-Dar, E., Kakade, S. M., and Mansour, Y. (2009). Online Markov decision processes. Mathematics of Operations Research, 34(3):726–736. [34] Even-dar, E., Mannor, S., and Mansour, Y. (2002). PAC bounds for multi-armed bandit and Markov decision processes. In In Fifteenth Annual Conference on Computational Learning Theory (COLT), pages 255–270. [35] Feller, W. (1968). An Introduction to Probability Theory and its Applications, Vol. 1. John Wiley, New York. [36] Freund, Y. and Schapire, R. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55:119–139. [37] Garivier, A. and Cappé, O. (2011). The KL-UCB algorithm for bounded stochastic bandits and beyond. CoRR, abs/1102.2490. [38] Gentile, C. and Warmuth, M. (1998). Linear hinge loss and average margin. In Advances in Neural Information Processing Systems (NIPS), pages 225–231. [39] Geulen, S., Voecking, B., and Winkler, M. (2010). Regret minimization for online buffering problems using the weighted majority algorithm. In Proceedings of the Twenty-Third Conference on Computational Learning Theory. [40] Grove, A., Littlestone, N., and Schuurmans, D. (2001). General convergence results for linear discriminant updates. Machine Learning, 43:173–210. 133 [41] Györfi, L. and Ottucsák, Gy. (2007). Sequential prediction of unbounded stationary time series. IEEE Transactions on Information Theory, 53(5):866–1872. [42] Györfi, L. and Walk, H. (2012). Empirical portfolio selection strategies with proportional transaction costs. IEEE Transactions on Information Theory, 58(10):6320–6331. [43] György, A., Linder, T., and Lugosi, G. (2004a). Efficient algorithms and minimax bounds for zero-delay lossy source coding. IEEE Transactions on Signal Processing, 52:2337–2347. [44] György, A., Linder, T., and Lugosi, G. (2004b). A "follow the perturbed leader"-type algorithm for zero-delay quantization of individual sequences. In Proc. Data Compression Conference, pages 342–351, Snowbird, UT, USA. [45] György, A., Linder, T., and Lugosi, G. (2008). Tracking the best quantizer. IEEE Transactions on Information Theory, 54:1604–1625. [46] György, A., Linder, T., and Lugosi, G. (2012). Efficient tracking of large classes of experts. IEEE Transactions on Information Theory. (accepted with minor revisions). [47] György, A., Linder, T., Lugosi, G., and Ottucsák, Gy. (2007). The on-line shortest path problem under partial monitoring. Journal of Machine Learning Research, 8:2369–2403. [48] György, A. and Neu, G. (2011). Near-optimal rates for limited-delay universal lossy source coding. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), pages 2344–2348. [49] György, A. and Neu, G. (2013). Near-optimal rates for limited-delay universal lossy source coding. Submitted to the IEEE Transactions on Information Theory. [50] György, A. and Ottucsák, Gy. (2006). Adaptive routing using expert advice. Computer Journal, 49(2):180–189. [51] Hannan, J. (1957). Approximation to Bayes risk in repeated play. Contributions to the theory of games, 3:97–139. [52] Hartmann, B. and Dán, A. (2012). Cooperation of a grid-connected wind farm and an energy storage unit demonstration of a simulation tool. IEEE Transactions on Sustainable Energy, 3(1):49–56. [53] Hazan, E., Kale, S., and Warmuth, M. (2010). Learning rotations with little regret. In Proceedings of the 23rd Annual Conference on Learning Theory (COLT), pages 144–154. [54] Hazan, E. and Seshadhri, C. (2007). Adaptive algorithms for online decision problems. Electronic Colloquium on Computational Complexity (ECCC). [55] Hazan, E. and Seshadhri, C. (2009). Efficient learning algorithms for changing environments. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 393–400. ACM. 134 [56] Helmbold, D. P. and Warmuth, M. (2009). Learning permutations with exponential weights. Journal of Machine Learning Research, 10:1705–1736. [57] Herbster, M. and Warmuth, M. (1998). Tracking the best expert. Machine Learning, 32:151–178. [58] Ipsen, I. C. F. and Selee, T. M. (2011). Ergodicity coefficients defined by vector norms. SIAM J. Matrix Analysis Applications, 32(1):153–200. [59] Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. J. Mach. Learn. Res., 99:1563–1600. [60] Kakade, S. (2003). On the sample complexity of reinforcement learning. PhD thesis, Gatsby Computational Neuroscience Unit, University College London. [61] Kakade, S., Kearns, M. J., and Langford, J. (2003). Exploration in metric state spaces. In ICML2003, pages 306–312. [62] Kakade, S., Sridharan, K., and Tewari, A. (2009). On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In NIPS-22, pages 793–800. [63] Kalai, A. and Vempala, S. (2003). Efficient algorithms for the online decision problem. In Schölkopf, B. and Warmuth, M., editors, Proceedings of the 16th Annual Conference on Learning Theory and the 7th Kernel Workshop, COLT-Kernel 2003, pages 26–40, New York, USA. Springer. [64] Kalai, A. and Vempala, S. (2005). Efficient algorithms for online decision problems. J. Comput. Syst. Sci., 71:291–307. [65] Kaufmann, E., Korda, N., and Munos, R. (2012). Thompson sampling: An asymptotically optimal finite-time analysis. In ALT’12, pages 199–213. [66] Kivinen, J. and Warmuth, M. (2001). Relative loss bounds for multidimensional regression problems. Machine Learning, 45:301–329. [67] Kocsis, L. and Szepesvári, Cs. (2006). Bandit based Monte-Carlo planning. In Proceedings of the 17th European conference on Machine Learning, pages 282–293. [68] Koolen, W., Warmuth, M., and Kivinen, J. (2010). Hedging structured concepts. In Proceedings of the 23rd Annual Conference on Learning Theory (COLT), pages 93–105. [69] Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22. [70] Lazaric, A. and Munos, R. (2011). Learning with stochastic inputs and adversarial outputs. To appear in Journal of Computer and System Sciences. 135 [71] Linder, T. and Lugosi, G. (2001). A zero-delay sequential scheme for lossy coding of individual sequences. IEEE Transactions on Information Theory, 47:2533–2538. [72] Littlestone, N. and Warmuth, M. (1994). The weighted majority algorithm. Information and Computation, 108:212–261. [73] Maillard, O.-A., Munos, R., and Stoltz, G. (2011). A finite-time analysis of multi-armed bandits problems with kullback-leibler divergences. Journal of Machine Learning Research Proceedings Track, 19:497–514. [74] Mannor, S. and Tsitsiklis, J. N. (2004). The sample complexity of exploration in the multiarmed bandit problem. J. Mach. Learn. Res., 5:623–648. [75] Matloub, S. and Weissman, T. (2006). Universal zero delay joint source-channel coding. IEEE Transactions on Information Theory, 52:5240–5250. [76] McMahan, H. B. and Blum, A. (2004). Online geometric optimization in the bandit setting against an adaptive adversary. In Proceedings of the Eighteenth Conference on Computational Learning Theory, pages 109–123. [77] Mnih, V., Szepesvári, Cs., and Audibert, J.-Y. (2008). Empirical Bernstein stopping. In ICML, pages 672–679. [78] Neu, G., György, A., and Szepesvári, Cs. (2010a). The online loop-free stochastic shortestpath problem. In Proceedings of the Twenty-Third Conference on Computational Learning Theory, pages 231–243. [79] Neu, G., György, A., and Szepesvári, Cs. (2012). The adversarial stochastic shortest path problem with unknown transition probabilities. In Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, volume 22 of JMLR Workshop and Conference Proceedings, pages 805–813, La Palma, Canary Islands. [80] Neu, G., György, A., and Szepesvári, Cs. (2013a). The online loop-free stochastic shortestpath problem. In preparation. [81] Neu, G., György, A., Szepesvári, Cs., and Antos, A. (2010b). Online Markov decision processes under bandit feedback. In Advances in Neural Information Processing Systems 23, pages 1804–1812. [82] Neu, G., György, A., Szepesvári, Cs., and Antos, A. (2013b). Online Markov decision processes under bandit feedback. Accepted for publication at the IEEE Transactions on Automatic Control. [83] Ortner, R. (2008). Online regret bounds for Markov decision processes with deterministic transitions. In Proceedings of the 19th International Conference on Algorithmic Learning Theory, ALT 2008. 136 [84] Poland, J. (2005). FPL analysis for adaptive bandits. In In 3rd Symposium on Stochastic Algorithms, Foundations and Applications (SAGA’05), pages 58–69. [85] Powell, W. B. (2007). Approximate Dynamic Programming: Solving the curses of dimensionality. John Wiley and Sons, New York. [86] Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-Interscience. [87] Rakhlin, A., Sridharan, K., and Tewari, A. (2011). Online learning: Stochastic and constrained adversaries. In NIPS-24. [88] Reani, A. and Merhav, N. (2009). Efficient on-line schemes for encoding individual sequences with side information at the decoder. In Proc. IEEE International Symposium on Information Theory (ISIT 2009), pages 1025–1029. [89] Reani, A. and Merhav, N. (2011). Efficient on-line schemes for encoding individual sequences with side information at the decoder. IEEE Transactions on Information Theory, pages 6860–6876. [90] Shamir, G. I. and Merhav, N. (1999). Low-complexity sequential lossless coding for piecewise-stationary memoryless sources. IEEE Transactions on Information Theory, IT45:1498–1519. [91] Strehl, A. L. and Littman, M. L. (2008). Online linear regression and its application to model-based reinforcement learning. In NIPS20, pages 1417–1424. [92] Sutton, R. and Barto, A. (1998). Reinforcement Learning: An Introduction. MIP Press. [93] Szepesvári, Cs. (2010). Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers. [94] Takimoto, E. and Warmuth, M. (2003). Paths kernels and multiplicative updates. Journal of Machine Learning Research, 4:773–818. [95] Takimoto, E. and Warmuth, M. K. (2002). Path kernels and multiplicative updates. In Kivinen, J. and Sloan, R. H., editors, Proceedings of the 15th Annual Conference on Computational Learning Theory, COLT 2002, LNAI 2375, pages 74–89, Berlin–Heidelberg. Springer-Verlag. [96] Vovk, V. (1990). Aggregating strategies. In Proceedings of the third annual workshop on Computational learning theory (COLT), pages 371–386. [97] Vovk, V. (1998). A game of prediction with expert advice. Journal of Computer and System Sciences, 56(2):153–173. [98] Vovk, V. (1999). Derandomizing stochastic prediction strategies. 35(3):247–282. 137 Machine Learning, [99] Warmuth, M. and Kuzmin, D. (2008). Randomized online pca algorithms with regret bounds that are logarithmic in the dimension. Journal of Machine Learning Research, 9:2287–2320. [100] Weissman, T. and Merhav, N. (2002). On limited-delay lossy coding and filtering of individual sequences. IEEE Transactions on Information Theory, 48:721–733. [101] Weissman, T., Ordentlich, E., Seroussi, G., Verdu, S., and Weinberger, M. J. (2003). Inequalities for the l1 deviation of the empirical distribution. [102] Willems, F. M. J. (1996). Coding for a binary independent piecewise-identically- distributed source. IEEE Transactions on Information Theory, IT-42:2210–2217. [103] Yu, J. Y. and Mannor, S. (2009a). Arbitrarily modulated Markov decision processes. In Joint 48th IEEE Conference on Decision and Control and 28th Chinese Control Conference, pages 2946–2953. IEEE Press. [104] Yu, J. Y. and Mannor, S. (2009b). Online learning in Markov decision processes with arbitrarily changing rewards and transitions. In GameNets’09: Proceedings of the First ICST international conference on Game Theory for Networks, pages 314–322, Piscataway, NJ, USA. IEEE Press. [105] Yu, J. Y., Mannor, S., and Shimkin, N. (2009). Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34(3):737–757. 138
© Copyright 2026 Paperzz