Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Making Complex Decisions – Markov Decision Processes Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science Bioinformatics and Computational Biology Program Center for Computational Intelligence, Learning, & Discovery Iowa State University [email protected] www.cs.iastate.edu/~honavar/ www.cild.iastate.edu/ www.bcb.iastate.edu/ www.igert.iastate.edu Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Making Complex Decisions: Markov Decision Problem How to use knowledge about the world to make decisions when • there is uncertainty about consequences of actions • Rewards are delayed Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory The Solution Sequential decision problems in uncertain environments can be solved by calculating a policy that associates an optimal decision with every environmental state Markov Decision Process (MDP) Vasant Honavar, 2006. 1 Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Example The world Actions have uncertain consequences 3 +1 2 -1 1 0.8 0.1 0.1 start 1 2 3 4 Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Vasant Honavar, 2006. 2 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Vasant Honavar, 2006. 3 Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Cumulative Discounted Reward Suppose rewards are bounded by M Cumulative discounted reward is bounded by M + M (1 − γ ) + ..M (1 − γ ) = M n (1 − γ ) n +1 (1 − γ ) Note : For the geometric series to converge, 0 ≤ γ < 1 Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Utility of a State Sequence g Additive rewards U h ([ s0 , s1 , s2 ...]) = R( s0 ) + R( s1 ) + R( s2 ) + ... g Discounted rewards U h ([ s0 , s1 , s2 ...]) = R ( s0 ) + γR ( s1 ) + γ 2 R ( s2 ) + ... Vasant Honavar, 2006. 4 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Utility of a State g The utility of each state is the expected sum of discounted rewards if the agent executes the policy π ⎡∞ ⎤ U π ( s) = E ⎢∑ γ t R( st ) π , s0 = s ⎥ ⎣ t =0 ⎦ g The true utility of a state corresponds to the optimal policy π* Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Vasant Honavar, 2006. 5 Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Calculating the Optimal Policy Value iteration Policy iteration Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Value Iteration g Calculate the utility of each state g Then use the state utilities to select an optimal action in each state π * ( s) = arg max ∑ T ( s, a, s / )U ( s / ) a s/ Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Value Iteration Algorithm function value-iteration(MDP) returns a utility function local variables: U, U’ initially identical to R repeat U← U’ for each state s do U ( s) ← R( s) + γ end until close-enough(U, U’) return U max ∑ T (s, a, s )U (s ) / a / s/ Bellman update Vasant Honavar, 2006. 6 Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Value Iteration Algorithm: Example 3 0.812 2 0.762 1 0.705 1 0.868 0.655 0.912 +1 0.660 -1 0.611 0.388 2 3 4 The Utilities of the States Obtained After Value Iteration Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Policy Iteration • Pick a policy, then calculate the utility of each state given that policy (value determination step) • Update the policy at each state using the utilities of the successor states • Repeat until the policy stabilizes Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Policy Iteration Algorithm function policy-iteration(MDP) returns a policy local variables: U, a utility function, π, a policy repeat U← value-determination(π,U,MDP,R) unchanged? ← true for each state s do ⎞ ⎛ if ⎜⎜ max ∑ T ( s, a, s / )U ( s / ) > ∑ T ( s, π ( s ), s / )U ( s / ) ⎟⎟ then / / a s s ⎠ ⎝ π ( s ) ← arg max ∑ T ( s, a, s / )U ( s / ) a unchanged? ← false s/ end until unchanged? return π Vasant Honavar, 2006. 7 Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Value Determination g Simplification of the value iteration algorithm because the policy is fixed g Linear equations because the max() operator has been removed g Solve exactly for the utilities using standard linear algebra Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Optimal Policy (policy iteration with 11 linear equations) 3 +1 2 -1 1 1 2 3 4 u(1,1) = 0.8 u(1,2) + 0.1 u(1,2) + 0.1 u(1,1) u(1,2) = 0.8 u(1,3) + 0.2 u(1,2) … Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Partially observable MDP (POMDP) • In an inaccessible environment, the percept does not provide enough information to determine the state or the transition probability • POMDP – State transition function: P(st+1 | st, at) – Observation function: P(ot | st, at) – Reward function: E(rt | st, at) • Approach – Calculate a probability distribution over the possible states given all previous percepts, and to base decision on this distribution Vasant Honavar, 2006. 8 Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Learning from Interaction with the world • An agent receives sensations or percepts from the environment through its sensors and acts on the environment through its effectors and occasionally receives rewards or punishments from the environment • The goal of the agent is to maximize its reward (pleasure) or minimize its punishment (or pain) as it stumbles along in an a-priori unknown, uncertain, environment Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Supervised Learning Experience = Labeled Examples Inputs Supervised Learning System Outputs Objective – Minimize Error between desired and actual outputs Copyright Vasant Honavar, 2006. 9 Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Reinforcement Learning Experience = Action-induced State Transitions and Rewards Reinforcement Learning System Inputs Outputs = actions Objective – Maximize reward Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Reinforcement learning • Learner is not told which actions to take • Rewards and punishments may be delayed – Sacrifice short-term gains for greater long-term gains • The need to tradeoff between exploration and exploitation • Environment may not be observable or only partially observable • Environment may be deterministic or stochastic Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Reinforcement learning Environment state action reward Agent Copyright Vasant Honavar, 2006. 10 Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Key elements of an RL System Policy Reward Value Model of environment Policy – what to do Reward – what is good Value – what is good because it predicts reward Model – what follows what • • • • Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University An Extended Example: Tic-Tac-Toe X X X X O X O X X O X O O X O X O X X O X O X O O X X O } X moves x ... ... x o ... x } O moves x ... o o x x } X moves x ... ... ... ... ... } O moves } X moves Assume an imperfect opponent: —he/she sometimes makes mistakes x o x xo Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University A Simple RL Approach to Tic-Tac-Toe Make a table with one entry per state V(s) – estimated probability of winning State 0.5 0.5 x x o o o 0 ... ... o x o o x x x o o 1 ... ... x ... ... x x x o o 0.5 win Now play lots of games. To pick our moves, look ahead one step Current state loss draw * Possible next states Pick the next state with the highest estimated prob. of winning — the largest V(s) – a greedy move; Occasionally pick a move at random – an exploratory move. Copyright Vasant Honavar, 2006. 11 Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University RL Learning Rule for Tic-Tac-Toe opponent's move our move opponent's move our move opponent's move our move { { { { { { starting position a • s •b s ′ – the state after our greedy move – the state before our greedy move • c c* •d e* “Exploratory” move •e •f • g g* .. . We increment each V(s) toward V( s′) – a backup : V(s) ← V (s) + α [V( s′) − V (s)] Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Why is Tic-Tac-Toe Too Easy? • Number of states is small and finite • One-step look-ahead is always possible • State completely observable Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Some Notable RL Applications • TD-Gammon – world’s best backgammon program (Tesauro) • Elevator Control –Crites & Barto • Inventory Management – 10 – 15% improvement over industry standard methods – Van Roy, Bertsekas, Lee and Tsitsiklis • Dynamic Channel Assignment -- high performance assignment of radio channels to mobile telephone calls – Singh and Bertsekas Copyright Vasant Honavar, 2006. 12 Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University The n-Armed Bandit Problem • Choose repeatedly from one of n actions; each choice is called a play • After each play at , you get a reward rt , where * E rt | at = Q (at ) Distribution of rt depends only on at • Objective is to maximize the reward in the long term, e.g., over 1000 plays Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University The Exploration – Exploitation Dilemma • Suppose you form action value estimates * Qt (a) ≈ Q (a) at* = argmax Qt (a) a • The greedy action at t is a = a * ⇒ exploitation t t at ≠ at* ⇒ exploration • You can’t exploit all the time; you can’t explore all the time • You can never stop exploring; but you could reduce exploring Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Action-Value Methods • Stateless • Adapt action-value estimates and nothing else. • Suppose by the t-th play, action a had been chosen k a times, producing rewards r1 , r2 , K, rk , then a Qt (a) = r1 + r2 + Lrk a ka * lim Qt (a) = Q (a) ka →∞ Copyright Vasant Honavar, 2006. 13 Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University ε-Greedy Action Selection Greedy at = at* = arg max Qt (a) a ⎧a with probability 1 − ε at = ⎨ ⎩random action with probability ε ε-Greedy * t Boltzmann Pr(choosing action a at time t ) = eQt ( a ) τ ∑ n b =1 e Qt ( b ) τ where τ is computational temperature Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Incremental Implementation Recall the sample average estimation method The average of the first k rewards is Qk = r1 + r2 + Lrk k Incremental update rule – does not require storing past rewards Qk +1 = Qk + 1 [r − Qk ] k + 1 k +1 Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Tracking a Nonstationary Environment Choosing Qk to be a sample average is appropriate in a Stationary environment in which the dependence of Rewards on actions is time invariant when none of the * Q (a) change over time, In a nonstationary environment, it is better to use exponential, recency-weighted average Qk +1 = Qk + α [rk +1 − Qk ] for constant α , 0 < α ≤ 1 k = (1− α ) k Q0 + ∑α (1 − α )k −i ri i =1 Copyright Vasant Honavar, 2006. 14 Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Reinforcement learning when the agent can sense and respond to environmental states Agent state st reward action rt rt+1 st+1 at Environment Agent and environment interact at discrete time steps: t = 0,1, 2, K Agent observes state at step t : st ∈S produces action at step t : at ∈ A(st ) gets resulting reward: rt +1 ∈ℜ and resulting next state: st +1 rt +2 rt +1 rt +3 s ... ... st +1 st +2 st a t +3 a at +1 at +2 t t +3 Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory The Agent Learns a Policy Policy at step t ,π t : a mapping from states to action probabilities π t ( s, a) = probability that at = a when st = s • Reinforcement learning methods specify how the agent changes its policy as a result of experience. • Roughly, the agent’s goal is to get as much reward as it can over the long run. Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Agent-Environment Interface -- Goals and Rewards • Is a scalar reward signal an adequate notion of a goal? – maybe not, but it is surprisingly flexible. • A goal should specify what we want to achieve, not how we want to achieve it. • A goal is typically outside the agent’s direct control • The agent must be able to measure success: • explicitly • frequently during its lifespan Copyright Vasant Honavar, 2006. 15 Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Rewards Suppose the sequence of rewards after step t is : rt +1 ,rt + 2 ,rt +3 ,K What do we want to maximize? In general, we want to maximize the expected return, E{Rt }, for each step t. Episodic tasks – interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze. R = r + r +L+ r , t t +1 t +2 T where T is a final time step at which a terminal state is reached, ending an episode. Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Rewards for Continuing Tasks Continuing tasks: interaction does not have natural episodes. Discounted return: ∞ Rt = rt +1 + γ rt + 2 + γ 2 rt + 3 + L = ∑ γ k rt + k +1 , k =0 where γ ,0 ≤ γ ≤ 1, is the discount rate. shortsight ed 0 ←γ → 1 farsighted Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Example – Pole Balancing Task Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. As an episodic task where episode ends upon failure: reward = +1 for each step before failure ⇒ return = number of steps before failure As a continuing task with discounted return: reward = −1 upon failure; 0 otherwise ⇒ return = − γ k , for k steps before failure In either case, return is maximized by avoiding failure for as long as possible. Copyright Vasant Honavar, 2006. 16 Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Example -- Driving task Get to the top of the hill as quickly as possible. reward = −1 for each step when not at top of hill ⇒ return = − number of steps before reaching top of hill Return is maximized by minimizing the number of steps taken to reach the top of the hill. Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University The Markov Property • By the state at step t, we mean whatever information is available to the agent at step t about its environment. • The state can include immediate sensations, highly processed sensations, and structures built up over time from sequences of sensations. • Ideally, a state should summarize past sensations so as to retain all essential information – it should have the Markov Property: Pr {st +1 = s ′,rt +1 = r st ,at ,rt ,st −1 , at −1 ,K, r1 , s 0 , a 0 } = Pr {st +1 = s ′,rt +1 = r st , at } ∀s ′,r,and histories st ,at ,rt ,st −1 , at −1 ,K, r1 , s0 , a 0 . Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Markov Decision Processes • If a reinforcement learning task has the Markov Property, it is called a Markov Decision Process (MDP). • If state and action sets are finite, it is a finite MDP. • To define a finite MDP, you need to specify: • state and action sets • one-step dynamics defined by transition probabilities: Psas′ = Pr {st +1 = s ′ st = s,at = a} ∀ s, s ′ ∈ S ,a ∈ A( s ). • reward probabilities: Rsas′ = E {rt +1 st = s,at = a, s t +1 = s ′} ∀ s, s ′ ∈ S ,a ∈ A( s ). Copyright Vasant Honavar, 2006. 17 Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Finite MDP Example Recycling Robot • • • • At each step, robot has to decide whether it should a) actively search for a can, b) wait for someone to bring it a can, or c) go to home base and recharge. Searching is better but runs down the battery; if runs out of power while searching, has to be rescued (which is bad). Decisions made on basis of current energy level: high, low. Reward = number of cans collected Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Value Functions • The value of a state is the expected return starting from that state; depends on the agent’s policy: State - value function for policy π : ⎧∞ ⎫ V π ( s ) = E π {Rt st = s} = E π ⎨∑ γ k rt + k +1 st = s ⎬ ⎩ k =0 ⎭ The value of taking an action in a state under policy π is the expected return starting from that state, taking that action, and thereafter following π : Action - value function for policy π : ⎧∞ ⎫ Q π ( s, a ) = E π {Rt st = s, at = a} = E π ⎨∑ γ k rt + k +1 st = s, at = a ⎬ ⎩ k =0 ⎭ Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Bellman Equation for a Policy π The basic idea: Rt = rt +1 + γ rt + 2 + γ 2 rt +3 + γ 3 rt + 4 L ( ) = rt +1 + γ rt + 2 + γ rt +3 + γ 2 rt + 4 L = rt +1 + γ Rt +1 π So: V (s) = Eπ {Rt st = s} = Eπ {rt +1 + γ V (st +1 ) st = s} Or, without the expectation operator: [ ] V π ( s ) = ∑ π( s, a )∑ Psas′ Rsas′ + γV π ( s ′) a s′ Copyright Vasant Honavar, 2006. 18 Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Optimal Value Functions • For finite MDPs, policies can be partially ordered: π ≥ π ′ if and only if V π (s) ≥ V π ′ (s) for all s ∈S • There is always at least one (and possibly many) policies that is better than or equal to all the others. This is an optimal policy. We denote them all π *. • Optimal policies share the same optimal state-value function: ∗ π V (s) = max V (s) for all s ∈S π • Optimal policies also share the same optimal action-value function: π ∗ Q (s,a) = max Q (s, a) for all s ∈S and a ∈ A(s) π This is the expected return for taking action a in state s and thereafter following an optimal policy. Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Bellman Optimality Equation for V* The value of a state under an optimal policy must equal the expected return for the best action from that state: π∗ ∗ V (s) = max Q (s,a) a∈A( s) = max E{rt +1 + γ V (st +1 ) st = s, at = a} ∗ a∈A( s) = max ∑ Psas′ [Rsas′ + γ V ∗ (s′ )] a∈A( s) s′ s (a) max a The relevant backup diagram: r s' ∗ V is the unique solution of this system of nonlinear equations. Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Bellman Optimality Equation for Q* { } Q∗ (s,a) = E rt +1 + γ max Q∗ (s t+1 , a′) st = s,at = a [ a′ = ∑ P R + γ max Q ( s′, a′ ) s′ a s s′ a ss′ ∗ a′ ] (b) s,a r The relevant backup diagram: s' max a' Q* is the unique solution of this system of nonlinear equations. Copyright Vasant Honavar, 2006. 19 Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Why Optimal State-Value Functions are Useful Any policy that is greedy with respect to V ∗ is an optimal policy. ∗ Therefore, given V , one-step-ahead search produces the long-term optimal actions. Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University What About Optimal Action-Value Functions? * Given Q , the agent does not even have to do a one-step-ahead search: π ∗ (s) = arg max Q∗ (s,a) a∈A (s) Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Solving the Bellman Optimality Equation • Finding an optimal policy by solving the Bellman Optimality Equation requires: – accurate knowledge of environment dynamics; – enough space an time to do the computation; – the Markov Property. • How much space and time do we need? – polynomial in number of states (via dynamic programming methods), – BUT, number of states is often huge • We usually have to settle for approximations. • Many RL methods can be understood as approximately solving the Bellman Optimality Equation. Copyright Vasant Honavar, 2006. 20 Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Efficiency of DP • To find an optimal policy is polynomial in the number of states… • BUT, the number of states often grows exponentially with the number of state variables • In practice, classical DP can be applied to problems with a few millions of states. • Asynchronous DP can be applied to larger problems, and appropriate for parallel computation. • It is surprisingly easy to come up with MDPs for which DP methods are not practical. Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Reinforcement learning Environment state action reward Agent Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Markov Decision Processes Assume finite set of states S set of actions A at each discrete time agent observes state st ∈ S and chooses action at ∈ A • then receives immediate reward rt • and state changes to st+1 • Markov assumption: st+1 = δ(st, at) and rt = r(st, at) – i.e., rt and st+1 depend only on current state and action – functions δ and r may be nondeterministic – functions δ and r not necessarily known to agent • • • • Copyright Vasant Honavar, 2006. 21 Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Agent’s learning task • Execute actions in environment, observe results, and • learn action policy π : S → A that maximizes E [rt + γrt+1 + γ2rt+2 + … ] from any starting state in S • here 0 ≤ γ < 1 is the discount factor for future rewards • • • • Note something new: Target function is π : S → A but we have no training examples of form 〈s, a〉 training examples are of form 〈〈s, a 〉, r〉 Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Reinforcement learning problem • Goal: learn to choose actions that maximize • r0 + γr1 + γ2r2 + … , where 0 ≤ γ < 1 Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Learning An Action-Value Function Estimate Q π for the current behavior policy π. st st , at rt+ 1 st+1 rt+2 st+2 st+1 , at+1 st+2, at+2 After every transition from a nontermina l state s t , do : Q(st , at ) ← Q(st , at ) + α[rt +1 + γQ(st +1 , at +1 ) − Q(st , at )] If st +1 is terminal, then Q( st +1 , at +1 ) = 0. Copyright Vasant Honavar, 2006. 22 Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Value function • To begin, consider deterministic worlds... • For each possible policy π the agent might adopt, we can define an evaluation function over states V π ( s ) ≡ rt + γrt +1 + γ 2 rt + 2 + ... ∞ ≡ ∑ γ i rt + i i =0 where rt, rt+1, ... are generated by following policy π starting at state s • Restated, the task is to learn the optimal policy π* π * ≡ arg max V π ( s ), (∀s ) π Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University What to learn • We might try to have agent learn the evaluation function Vπ* (which we write as V*) • It could then do a look-ahead search to choose best action from any state s because π * ( s ) ≡ arg max[ r ( s, a ) + γV * (δ ( s, a ))] a • A problem: • This works well if agent knows δ : S × A → S, and r:S×A→ℜ • But when it doesn't, it can't choose actions this way Copyright Vasant Honavar, 2006. 23 Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Action-Value function – Q function • Define a new function very similar to V* Q ( s, a ) ≡ r ( s, a ) + γV * (δ ( s, a )) • If agent learns Q, it can choose optimal action even without knowing δ! π * ( s ) ≡ arg max[r ( s, a ) + γV * (δ ( s, a ))] π π * ( s ) ≡ arg max Q( s, a ) π • Q is the evaluation function the agent will learn Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Training rule to learn Q • Note Q and V* are closely related: V * ( s ) = max Q( s, a ' )) a' • Which allows us to write Q recursively as Q( s1 , a1 ) = r ( s1 , a1 ) + γV * (δ ( s1 , a1 )) = r ( s1 , a1 ) + γ max Q(δ ( st +1 , a' )) a' • Let Q̂ denote learner’s current approximation to Q. Consider training rule Qˆ ( s, a ) ← r + γ max Qˆ ( s' , a' ) a' • where s’ is the state resulting from applying action a in state s. Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Q-Learning [ ] Q(st , at ) ← Q (st , at ) + α rt +1 + γ max Q(s t +1 , a ) − Q(st , at ) a Copyright Vasant Honavar, 2006. 24 Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Q Learning for Deterministic Worlds • • • • • • • For each s, a initialize table entry Observe current state s Q ˆ ( s, a ) ← 0 Do forever: Select an action a and execute it Receive immediate reward r Observe the new state s’ ˆ ( s, a ) as follows: Update the table entry for Q Qˆ ( s ) ← r + γ max Qˆ ( s ' , a ' ) a' • s ← s’. Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Updating Q Qˆ ( s1 , a right ) ← r + γ max Qˆ ( s 2 ' , a' ) a' ← 0 + 0.9 max{63,81,100} ← 90 Notice if rewards non-negative, then (∀s, a, n ) Qˆ n +1 ( s, a ) ≥ Qˆ n ( s, a ) and (∀s, a, n) 0 ≤ Qˆ n ( s, a ) ≤ Q( s, a ) Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Convergence theorem • Theorem Q̂ converges to Q. Consider case of deterministic world, with bounded immediate rewards, where each 〈s, a〉 visited infinitely often. • Proof: Define a full interval to be an interval during which each 〈s, a〉 is visited. During each full interval the largest error in Q̂ table is reduced by factor of γ. • Let Q̂n be table after n updates, and Δn be the maximum error in Q̂n : that is Δ n = max | Qˆ n ( s , a ) − Q ( s , a ) | s ,a Copyright Vasant Honavar, 2006. 25 Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Convergence theorem • For any table entry Qˆ n ( s, a) updated on iteration n + 1, the error in the revised estimate Qˆ n+1 ( s, a) is | Qˆ n +1 ( s, a ) − Q ( s, a ) | = | (r + γ max Qˆ n ( s ' , a ' )) − (r + γ max Q( s ' , a ' )) | a' a' = γ | max Qˆ n ( s ' , a ' ) − max Q( s' , a ' ) | a' a' Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Q Learning Recipe | Qˆ n +1 ( s, a) − Q( s, a) | = | (r + γ max Qˆ n ( s ' , a' )) − (r + γ max Q( s' , a ' )) | a' a' = γ | max Qˆ n ( s ' , a' ) − max Q( s ' , a' ) | a' a' ≤ γ max | Qˆ n ( s' , a ' ) − Q( s' , a ' ) | a' ≤ γ max | Qˆ n ( s' ' , a ' ) − Q( s' ' , a' ) | s '', a ' | Qˆ n +1 ( s, a) − Q( s, a) | = γΔ n Note we used general fact that: | max f1 (a ) − max f 2 (a ) |≤ max | f1 (a ) − f 2 (a ) | a a a Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Non-deterministic case • What if reward and next state are non-deterministic? • We redefine V and Q by taking expected values. V π ( s ) ≡ E[rt + γrt +1 + γ 2 rt + 2 + ...] ⎡∞ ⎤ ≡ E ⎢∑ γ i rt +i ⎥ ⎣ i =0 ⎦ Q ( s, a ) ≡ E[ r ( s, a ) + γV * (δ ( s, a ))] Copyright Vasant Honavar, 2006. 26 Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Nondeterministic case Q learning generalizes to nondeterministic worlds Alter training rule to Qˆ n ( s, a ) ← (1 − α n )Qˆ n −1 ( s, a ) + α n [ r + max Qˆ n −1 ( s' , a' )] a' where αn = 1 1 + visitsn ( s, a ) Convergence of Q̂ to Q can be proved [Watkins and Dayan, 1992] Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Temporal Difference Learning • Temporal Difference (TD) learning methods • Can be used when accurate models of the environment are unavailable – neither state transition function nor reward function are known • Can be extended to work with implicit representations of action-value functions • Are among the most useful reinforcement learning methods Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Example – TD-Gammon • • • • • Learn to play Backgammon (Tesauro, 1995) Immediate reward: +100 if win -100 if lose 0 for all other states • Trained by playing 1.5 million games against itself. • Now comparable to the best human player. Copyright Vasant Honavar, 2006. 27 Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Temporal difference learning Q learning: reduce discrepancy between successive Q estimates One step time difference: Q (1) ( st , at ) ≡ rt + γ max Qˆ ( st +1 , a ) a Why not two steps? Or n? Q ( 2 ) ( st , at ) ≡ rt + γrt +1 + γ 2 max Qˆ ( st + 2 , a ) a Q ( n ) ( st , at ) ≡ rt + γrt +1 + L + γ ( n −1) rt + n −1 + γ n max Qˆ ( st + n , a ) a Blend all of these: Q λ ( st , at ) ≡ (1 − λ )[Q (1) ( st , at ) + λQ ( 2 ) ( st , at ) + λ2Q ( 3) ( st , at ) Copyright Vasant Honavar, 2006. Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Temporal difference learning Q λ ( st , at ) ≡ (1 − λ )[Q (1) ( st , at ) + λQ ( 2 ) ( st , at ) + λ2Q ( 3) ( st , at ) Equivalent expression: Q λ ( st , at ) = rt + γ [(1 − λ ) max Qˆ ( st , at ) + λQ λ ( st +1 , at +1 )] a • • • • TD(λ) algorithm uses above training rule Sometimes converges faster than Q learning converges for learning V* for any 0 ≤ λ ≤ 1 (Dayan, 1992) Tesauro's TD-Gammon uses this algorithm Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Handling Large State Spaces • Replace Q̂ table with neural net or other function approximator • Virtually any function approximator would work provided it can be updated in an online fashion Copyright Vasant Honavar, 2006. 28 Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Learning state-action values • Training examples of the form: {description of ( st , at ), vt } • The general gradient-descent rule: r r θ t +1 = θ t + α [v t − Qt (st ,at )]∇θrQ(st ,at ) Copyright Vasant Honavar, 2006. Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Linear Gradient Descent Watkins’ Q(λ) Copyright Vasant Honavar, 2006. 29
© Copyright 2026 Paperzz