Today’s Topics Reinforcement Learning (RL) • Q learning • Exploration vs Exploitation • Generalizing Across State • Used in Clinical Trials Recently • Increased Emphasis in IT Companies • Inverse RL 1/19/17 Spring 2017 (Shavlik©) 1 Reinforcement Learning vs Supervised Learning RL requires much less of teacher – Teacher must set up ‘reward structure’ – Learner ‘works out the details’ ie, writes a program to maximize rewards received 1/19/17 Spring 2017 (Shavlik©) 2 Sequential Decision Problems Courtesy of Andy Barto (pictured) • Decisions are made in stages • The outcome of each decision is not fully predictable, but can be observed before the next decision is made • The objective is to maximize a numerical measure of total reward (or equivalently, to minimize a measure of total cost) • Decisions cannot be viewed in isolation need to balance desire for immediate reward with possibility of high future reward 1/19/17 Spring 2017 (Shavlik©) 3 RL Systems: Formalization SE = the set of states of the world eg, an N -dimensional vector ‘sensors’ (and memory of past sensations) AE = the set of possible actions an agent can perform (‘effectors’) W = the world R = the immediate reward structure W and R are the environment, can be stochastic functions (usually most states have R=0; ie rewards are sparse) 1/19/17 Spring 2017 (Shavlik©) 4 Embedded Learning Systems: Formalization (cont.) W: SE x AE SE [here the arrow means ‘maps to’] The world maps a state and an action and produces a new state R: SE “reals” Provides rewards (a number; often 0) as a function of the current state Note: we can instead use R: SE x AE “reals” (ie, rewards depend on how we ENTER a state) 1/19/17 Spring 2017 (Shavlik©) 5 A Graphical View of RL • Note that both the world and the agent can be probabilistic, so W and R could produce probability distributions • We’ll assume deterministic problems The real world, W sensory info R, reward (a scalar) - indirect teacher an action The Agent 1/19/17 Spring 2017 (Shavlik©) 6 Common Confusion State need not be solely the current sensor readings – Markov Assumption commonly used in RL Value of state is independent of path taken to reach that state – But can store memory of the past in current state Can always create Markovian task by remembering entire past history 1/19/17 Spring 2017 (Shavlik©) 7 Need for Memory: Simple Example ‘Out of sight, but not out of mind’ Time=1 learning agent opponent WALL opponent Time=2 WALL Seems reasonable to remember opponent was recently seen learning agent 1/19/17 Spring 2017 (Shavlik©) 8 State vs. Current Sensor Readings Remember state is what is in one’s head (past memories, etc) not ONLY what one currently sees/hears/smells/etc 1/19/17 Spring 2017 (Shavlik©) 9 Policies The agent needs to learn a policy E : ŜE AE The policy, E, function 1/19/17 Given a world state, ŜE, which action, AE, should be chosen? ŜE is our learner’s APPROXIMATION to the true SE Spring 2017 (Shavlik©) Remember: The agent’s task is to maximize the total reward received during its lifetime 10 True World States vs. the Learner’s Representation of the World State • From here forward, S, will be our learner’s approximation of the true world state • Exceptions W: S x A S R: S reals 1/19/17 These are our notations for how the true world behaves when we act upon it You can think that W and R take as an argument the learner’s representation of the world state and internally convert that to the ‘true’ world state(s) Spring 2017 (Shavlik©) 11 Policies (cont.) To construct E, we will assign a utility (U) (a number) to each state U E ( s) t 1 R( s, E , t ) t 1 • is a positive constant ≤ 1 • R(s, E, t) is the reward received at time t, assuming the agent follows policy E and starts in state s at t=0 • Note: future rewards are discounted by 1/19/17 Spring 2017 (Shavlik©) t-1 12 Why have a Decay on Rewards? • Getting ‘money’ in the future worth less than money right now – Inflation – More time to enjoy what it buys – Risk of death before collecting • Allows convergence proofs of the functions we’re learning 1/19/17 Spring 2017 (Shavlik©) Lecture #13, Slide 13 The Action-Value Function We want to choose the ‘best’ action in the current state So, pick the one that leads to the best next state (and include any immediate reward) Let Q E ( s, a) R(W ( s, a)) U E (W ( s, a)) Immediate reward received for going to state W(s,a) [Alternatively, R(s, a) ] 1/19/17 Future reward from further actions (discounted due to 1-step delay) Spring 2017 (Shavlik©) 14 The Action-Value Function (cont.) If we can accurately learn Q (the actionvalue function), choosing actions is easy Choose action a, where a arg max Q( s, a' ) a 'actions Note: x = argmax f(x) sets x to the value that leads to a max value for f(x) 1/19/17 Spring 2017 (Shavlik©) 15 Q vs. U Visually state action state Key U(2) states U(5) actions U(1) Q(1,ii) U(3) U(6) U(4) U’s ‘stored’ on states Q’s ‘stored’ on arcs 1/19/17 Spring 2017 (Shavlik©) 16 Q Q’s vs. U’s U S U Q U • Assume we’re in state S Which action do we choose? • U’s (Model-based) – Need to have a ‘next state’ function to generate all possible next states (eg, chess) – Choose next state with highest U value • Q’s (Model-free, though can also do model-based Q learning) – Need only know which actions are legal (eg, web) – Choose arc with highest Q value 1/19/17 Spring 2017 (Shavlik©) 17 Q-Learning (Watkins PhD, 1989) Let Qt be our current estimate of the optimal Q Our current policy is t (s) a such that Qt ( s, a ) max [Qt ( s, b)] bknown actions Our current utility-function estimate is U t ( s ) Qt ( s, t ( s )) - hence, the U table is embedded in the Q table and we don’t need to store both 1/19/17 Spring 2017 (Shavlik©) 18 Q-Learning (cont.) Assume we are in state St ‘Run the program’ * for awhile (n steps) Determine actual reward and compare to predicted reward Adjust prediction to reduce error * Ie, follow the current policy 1/19/17 Spring 2017 (Shavlik©) 19 Updating Qt Let rt( N ) N -step estimate of future rewards 1/19/17 N k 1 Rt k NU t ( St N ) k 1 Actual (discounted) reward received during the N time steps Spring 2017 (Shavlik©) Estimate of future reward if continued to t = 20 Changing the Q Function (ie, learn a better approx.) Old estimate Qt N ( St , at ) Qt ( St , at ) rt New estimate (at time t + N) Learning rate (for deterministic worlds, set α=1) 1/19/17 Spring 2017 (Shavlik©) (N ) Qt ( St , at ) Error 21 Pictorially (here rewards are on arcs, rather than states) Actual moves made (in red) r1 S1 r2 r3 SN Potential next states Qest ( s1, a ) r1 r2 2 r3 +<estimate of remainder of infinite sum> r1 r2 2 r3 3U (SN ) r1 r2 2 r3 3 max Q(S N , b) bactions 1/19/17 Spring 2017 (Shavlik©) 22 How Many Actions Should We Take Before Updating Q ? Why not do so after each action? – One–step Q learning – Most common approach 1/19/17 Spring 2017 (Shavlik©) 23 Exploration vs. Exploitation In order to learn about better alternatives, we can’t always follow the current policy (‘exploitation’) Sometimes, need to try random moves (‘exploration’) 1/19/17 Spring 2017 (Shavlik©) 24 Exploration vs. Exploitation (cont) Approaches 1) p percent of the time, make a random move; could let 1 p 2) Prob(picking action A in state S ) # moves _ made QS , A const Q S ,i const Exponentia -ting gets rid of negative values iactions 1/19/17 Spring 2017 (Shavlik©) 25 One-Step Q-Learning Algo 1/19/17 0. S initial state 1. If random # P then a = random choice // Occasionally ‘explore’ Else a = t(S) // Else ‘exploit’ 2. Snew W(S, a) Rimmed R(Snew) 3. Error Rimmed + U(Snew) – Q(S, a) // Use Q to compute U 4. Q(S, a) Q(S, a) + Error 5. S Snew 6. Go to 1 Act on world and get reward Spring 2017 (Shavlik©) // Should also decay α 26 Visualizing Q -Learning (1-step ‘lookahead’) The estimate State I Q(I,a) Action a (get reward R) Should equal R + max Q(J,x) State J a z b 1/19/17 - train ML system to learn a consistent set of Q values … Spring 2017 (Shavlik©) 27 Bellman Optimality Equation (from 1957, though for U function back then) IF s ,a Q( s, a) RN max QSN , a' a 'actions Where SN = W(s,a) , ie, the next state THEN The resulting policy, (s) = argmax Q(s,a), is optimal – ie, leads to highest discounted total rewards (also, any optimal policy satisfies the Bellman Eq) 1/19/17 Spring 2017 (Shavlik©) 28 A Simple Example (of Q-learning - with updates after each step, ie N =1) Q=0 S0 R=0 Let = 2/3 S1 R=1 Q=0 Q=0 Q=0 S3 R=0 S2 R = -1 Q=0 Q=0 S4 R=3 Qnew R max Qnext state (deterministic world, so α=1) 1/19/17 Spring 2017 (Shavlik©) 29 A Simple Example (Step 1) S0 S2 Q=0 S0 R=0 Let = 2/3 S1 R=1 Q=0 Q=0 Q = -1 S3 R=0 S2 R = -1 Q=0 Q=0 1/19/17 S4 R=3 Qnew R max Qnext state Spring 2017 (Shavlik©) 30 A Simple Example (Step 2) S2 S4 Q=0 S0 R=0 Let = 2/3 S1 R=1 Q=0 Q=0 Q = -1 S3 R=0 S2 R = -1 Q=0 Q=3 1/19/17 S4 R=3 Qnew R max Qnext state Spring 2017 (Shavlik©) 31 A Simple Example (Step i) S0 S2 Q=0 S0 R=0 Let = 2/3 S1 R=1 Q=0 Q=0 Q = -1 S3 R=0 S2 R = -1 Q=0 Q=3 1/19/17 Assume we get to the end of the game and ‘magically’ restarted in S0 S4 R=3 Qnew R max Qnext state Spring 2017 (Shavlik©) 32 A Simple Example (Step i+1) S0 S2 Q=0 S0 R=0 Let = 2/3 S1 R=1 Q=0 Q=0 Q=1 S3 R=0 S2 R = -1 Q=0 Q=3 1/19/17 S4 R=3 Qnew R max Qnext state Spring 2017 (Shavlik©) 33 A Simple Example (Step ∞) - ie, the Bellman optimal Q=? S0 R=0 Let = 2/3 S1 R=1 Q=? Q=? Q=? S3 R=0 S2 R = -1 Q=? Q=? 1/19/17 What would the final Q values be if we explored + exploited for a long time, always returning to S0 after 5 actions? S4 R=3 Qnew R max Qnext state Spring 2017 (Shavlik©) 34 A Simple Example (Step ∞) Let = 2/3 Q=1 S0 R=0 What would happen if > 2/3? Lower path better S1 R=1 Q=0 Q=0 Q=1 S3 R=0 S2 R = -1 Q=0 Q=3 1/19/17 What would happen if < 2/3 ? Upper path better S4 R=3 Shows need for EXPLORATION since first ever action out of S0 may or may not be the optimal one Qnew R max Qnext state Spring 2017 (Shavlik©) 35 An “On Your Own” RL HW Consider the deterministic reinforcement environment drawn below. Let γ=0.5. Immediate rewards are indicated inside nodes. Once the agent reaches the ‘end’ state the current episode ends and the agent is magically transported to the ‘start’ state. B (r=5) 4 Start (r=0) 4 4 4 End (r=5) 4 A (r=2) 4 C (r=3) 4 (a) A one-step, Q-table learner follows the path Start B C End. On the graph below, show the Q values that have changed, and show your work. Assume that for all legal actions (ie, for all the arcs on the graph), the initial values in the Q table are 4, as show above (feel free to copy the above 4’s below, but somehow highlight the changed values). Start (r=0) 1/19/17 B (r=5) A (r=2) End (r=5) C (r=3) Spring 2017 (Shavlik©) 36 An “On Your Own” RL HW (b) Starting with the Q table you produced in Part (a), again follow the path Start B C End and show the Q values below that have changed from Part (a). Show your work. Start (r=0) B (r=5) End (r=5) A (r=2) C (r=3) (c) What would the final Q values be in the limit of trying all possible arcs ‘infinitely’ often? Ie, what is the Bellman-optimal Q table? Explain your answer. Start (r=0) B (r=5) A (r=2) End (r=5) C (r=3) (d) What is the optimal path between Start and End? Explain. 1/19/17 Spring 2017 (Shavlik©) 37 Q-Learning: The Need to ‘Generalize Across State’ Remember, conceptually we are filling in a huge table States S0 S1 S2 A c t i o n s 1/19/17 a b c . . . z ... Sn . . . ... Q(S2, c) Spring 2017 (Shavlik©) Tables are a very verbose representation of a function 38 Representing Q Functions More Compactly We can use some other function representation (eg, neural net) to compactly encode this big table Second argument is a constant Q (S, a) An encoding of the state (S) Q (S, b) . . .. . Q (S, z) Each input unit encodes a property of the state (eg, a sensor value) 1/19/17 Or could have one net for each possible action Spring 2017 (Shavlik©) 39 Q (S, 0) Q (S, 1) … … Q Tables vs Q Nets . Q (S, 9) Given: 100 Boolean-valued features 10 possible actions Size of Q table 10 2100 Similar idea as Full Joint Prob Tables and Bayes Nets (called ‘factored’ representations) Size of Q net (100 HU’s) 100 100 + 100 10 = 11,000 Weights between inputs and HU’s 1/19/17 Weights between HU’s and outputs Spring 2017 (Shavlik©) 40 Why Use a Compact Q-Function? 1. 2. Full Q table may not fit in memory for realistic problems Can generalize across states, thereby speeding up convergence ie, one example ‘fills’ many cells in the Q table Notes 1. When generalizing across states, cannot use α=1 2. Convergence proofs only apply to Q tables 1/19/17 Spring 2017 (Shavlik©) Lecture #25, Slide 41 Three Forward Props and a BackProp Q(S0, A) A 1 S0 N . . . N . S1 N . . . N . 3 A S0 Q(S0, Z) Q(S1, A) A 2 N . . Q(S1, Z) Choose action in state S0 - execute chosen action in world, ‘read’ new sensors and reward Estimate u(S1) = Max Q(S1,X) where X actions Q(S0, A) vs new estimate Calc “teacher’s” output . . N Aside: could save some forward props by caching information Q(S0, Z) - assume Q is ‘correct’ for other actions Backprop to reduce error at Q(S0, A) 1/19/17 Spring 2017 (Shavlik©) 42 The Agent World (Rough sketch, implemented in Java [by me], linked to cs540 home page) Pushable Ice Cubes * * * * * Opponents 1/19/17 * * * The RL Agent * Food Spring 2017 (Shavlik©) 43 0 50/25 15 HU -10 5 HU -20 Q-table -30 Perceptrons (600 ex’s) (Supervised learning) Hand-coded -40 Q-net: 5 HU’s Q-net: 15 HU’s -50 -60 0 Q-net: 25 HU’s Q-net: 50 HU’s 500 1000 Training-set steps (in K) 1/19/17 Spring 2017 (Shavlik©) 1500 <-- 1000x slower CPU Mean(discounted) score on the testset suite Some (Ancient) Agent World Results 2000 ~2 weeks (10-20 yrs 44 ago) Estimating Value’s ‘In Place’ (see Sec 2.6 +2.7 of Sutton+Barto RL textbook) Let ri be our i th estimate of some Q Note: ri is not the immediate reward, Ri ri = Ri + U(next statei) Assume we have k +1 such measurements 1/19/17 Spring 2017 (Shavlik©) 45 Estimating Value’s (cont) Estimate based on k + 1 trails 1 k 1 Qk 1 ri k 1 i 1 k 1 rk 1 ri k 1 i 1 1 rk 1 k Qk k 1 (cont.) 1/19/17 Spring 2017 (Shavlik©) Ave of the k + 1 measurements Pull out last term Stick in definition of Qk 46 ‘In Place’ Estimates (cont.) 1 rk 1 k 1Qk Qk k 1 1 Qk rk 1 Qk k 1 latest estimate current ‘running’ average Add and subtract Qk Notice that needs to decay over time Repeating Qk 1 Qk rk 1 Qk 1/19/17 Spring 2017 (Shavlik©) 47 Note • The ‘running average’ analysis is for Q tables • When ‘generalizing across state,’ the Q values are coupled together • So when generalizing across state, cant simply divide by number of times an arc traversed • Also, even if DETERMINISTIC, still need to do a running average 1/19/17 Spring 2017 (Shavlik©) 48 Q-Learning Convergences • Only applies to Q tables and deterministic, Markovian worlds • Theorem: if every state-action pair visited infinitely often, 0 ≤ < 1, and |rewards| ≤ C (some constant), then s, a lim Q t ( s, a) Qactual ( s, a) t ^ the approx. Q table (Q) 1/19/17 the true Q table (Q) Spring 2017 (Shavlik©) 49 An RL Video https://m.youtube.com/watch?v=iqXKQf2BOSE 1/19/17 Spring 2017 (Shavlik©) Lecture #20, Slide #50 Inverse RL • Inverse RL: Learn the reward function of an agent by observing its behavior • Some early papers A. Ng and S. Russell, “Algorithms for inverse reinforcement learning,” in ICML, 2000 P. Abbeel and A. Ng, “Apprenticeship learning via inverse reinforcement learning,” in ICML, 2004 1/19/17 Spring 2017 (Shavlik©) Lecture #20, Slide #51 Recap: Supervised Learners Helping the RL Learner • Note that Q learning automatically creates I/O pairs for a supervised ML algo when ‘generalizing across state’ • Can also learn a model of the world (W) and the reward function (R) – Simulations via learned models reduce need for ‘acting in the physical world’ 1/19/17 Spring 2017 (Shavlik©) 52 Challenges in RL • Q tables too big, so use function approximation – can ‘generalize across state’ (eg, via ANNs) – convergence proofs no longer apply, though • Hidden state (‘perceptual aliasing’) – two different states might look the same (eg, due to ‘local sensors’) – can use theory of ‘Partially Observable Markov Decision Problems’ (POMDP’s) • Multi-agent learning (world no longer stationary) 1/19/17 Spring 2017 (Shavlik©) 53 Could use GAs for RL Task • Another approach is to use GAs to evolve good policies – Create N ‘agents’ – Measure each’s rewards over some time period – Discard worst, cross over best, do some mutation – Repeat ‘forever’ (a model of biology) • Both ‘predator’ and ‘prey’ evolve/learn, ie co-evolution 1/19/17 Spring 2017 (Shavlik©) 54 Summary of Non-GA Reinforcement Learning Positives – Requires much less ‘teacher feedback’ – Appealing approach to learning to predict and control (eg, robotics, sofbots) Demo of Google’s Q Learning – Solid mathematical foundations • Dynamic programming • Markov decision processes • Convergence proofs (in the limit) – Core of solution to general AI problem ? 1/19/17 Spring 2017 (Shavlik©) 55 Summary of Non-GA Reinforcement Learning (cont.) Negatives – Need to deal with huge state-action spaces (so convergence very slow) – Hard to design R function ? – Learns specific environment rather than general concepts – depends on state representation ? – Dealing with multiple learning agents? – Hard to learn at multiple ‘grain sizes’ (hierarchical RL) 1/19/17 Spring 2017 (Shavlik©) 56
© Copyright 2026 Paperzz