Reinforcement Learning (RL) • • Consider an “agent” embedded in an environment Task of the agent Repeat forever: 1) sense world 2) reason 3) choose an action to perform Definition of RL • Assume the world (ie, environment) periodically provides rewards or punishments (“reinforcements”) • Based on reinforcements received, learn how to better choose actions Sequential Decision Problems Courtesy of A.G. Barto, April 2000 • Decisions are made in stages • The outcome of each decision is not fully predictable but can be observed before the next decision is made • The objective is to maximize a numerical measure of total reward (or equivalently, to minimize a measure of total cost) • Decisions cannot be viewed in isolation: need to balance desire for immediate reward with possibility of high future reward Reinforcement Learning vs Supervised Learning • How would we use SL to train an agent in an environment? • Show action to choose in sample of world states – “I/O pairs” • RL requires much less of teacher • Must set up “reward structure” • Learner “works out the details” – i.e. writes a program to maximize rewards received Embedded Learning Systems: Formalization • SE = the set of states of the world • e.g., an N -dimensional vector • “sensors” • AE = the set of possible actions an agent can perform • “effectors” • W = the world • R = the immediate reward structure W and R are the environment, can be probabilistic functions Embedded learning Systems (formalization) W: SE x AE SE The world maps a state and an action and produces a new state R: SE x AE “reals” Provides rewards (a number) as a function of state and action (as in textbook). Can equivalently formalize as a function of state (next state) alone. A Graphical View of RL • Note that both the world and the agent can be probabilistic, so W and R could produce probability distributions. • For now, assume deterministic problems The real world, W sensory info R, reward (a scalar) - indirect teacher The Agent an action Common Confusion State need not be solely the current sensor readings • Markov Assumption Value of state is independent of path taken to reach that state • Can have memory of the past Can always create Markovian task by remembering entire past history Need for Memory: Simple Example “out of sight, but not out of mind” T=1 learning agent opponent WALL opponent T=2 WALL learning agent Seems reasonable to remember opponent recently seen State vs. Current Sensor Readings Remember state is what is in one’s head (past memories, etc) not ONLY what one currently sees/hears/smells/etc Policies The agent needs to learn a policy E : SE AE The policy, E, function Given a world state, SE, which action, AE, should be chosen? Remember: The agent’s task is to maximize the total reward received during its lifetime Policies (cont.) To construct E, we will assign a utility (U) (a number) to each state. U E ( s) t 1 R( s, E , t ) t 1 - is a positive constant < 1 - R(s, E, t) is the reward received at time t, assuming the agent follows policy E and starts in state s at t=0 - Note: future rewards are discounted by t-1 The Action-Value Function We want to choose the “best” action in the current state So, pick the one that leads to the best next state (and include any immediate reward) Let Q E ( s, a) R(W ( s, a)) U E (W ( s, a)) immediate reward received for going to state W(s,a) Future reward from further actions (discounted due to 1step delay) The Action-Value Function (cont.) If we can accurately learn Q (the actionvalue function), choosing actions is easy Choose a, where a arg max Q( s, a' ) a 'actions Q vs. U Visually state action state Key U(2) U(5) states actions U(1) Q(1,ii) U(3) U(6) U(4) U’s “stored” on states Q’s “stored” on arcs Q-Learning (Watkins PhD, 1989) Let Qt be our current estimate of the correct Q Our current policy is Qt ( s, a ) max [Qt ( s, b)] t (s) a, such that bknown actions Our current utility-function estimate is U t ( s ) Qt ( s, t ( s )) - hence, the U table is embedded in the Q table and we don’t need to store both Q-Learning (cont.) Assume we are in state St “Run the program”(1) for awhile (n steps) Determine actual reward and compare to predicted reward Adjust prediction to reduce error (1 ) I.e., follow the current policy How Many Actions Should We Take Before Updating Q ? Why not do so after each action? • “1 – Step Q learning” • Most common approach Exploration vs. Exploitation In order to learn about better alternatives, we can’t always follow the current policy (“exploitation”) Sometimes, need to try “random” moves (“exploration”) Exploration vs. Exploitation (cont) Approaches 1) p percent of the time, make a random 1 move; could let p # moves _ made 2) Prob(picking action A in state S ) QS , A const Q S ,i const iactions Exponentiating gets rid of negative values One-Step Q-Learning Algo 0. S initial state 1. If random # P then a = random choice Else a = t(S) 2. Snew W(S, a) Rimmed R(Snew) 3. Q(S, a) Rimmed + maxa’ Q(Snew, a’) 4. S Snew • Go to 1 Act on world and get reward In Stochastic World, Don’t Trash Current Q Entirely… Change Line 3: 3. Q(S, a) Rimmed + maxa’ Q(Snew, a’) To: Q(S, a) α [Rimmed + maxa’ Q(Snew, a’)] + (1-α) Q(S, a) A Simple Example (of Q-learning - with updates after each step, ie N =1) Q=0 S0 R=0 S1 R=1 Let = 2/3 Q=0 Q=0 S3 R=0 S2 R = -1 Q=0 S4 R=3 Q=0 Q=0 Algo: Pick State +Action Qnew R max Qnext state Repeat (deterministic world, so α=1) A Simple Example (Step 1) S0 S 2 Q=0 S0 R=0 S1 R=1 Let = 2/3 Q=0 Q = -1 S3 R=0 S2 R = -1 Q=0 S4 R=3 Q=0 Q=0 Algo: Pick State +Action Qnew R max Qnext state Repeat (deterministic world, so α=1) A Simple Example (Step 2) S2 S4 Q=0 S0 R=0 S1 R=1 Let = 2/3 Q=0 Q = -1 S3 R=0 S2 R = -1 Q=3 S4 R=3 Q=0 Q=0 Algo: Pick State +Action Qnew R max Qnext state Repeat (deterministic world, so α=1) A Simple Example (Step ) Q=1 S0 R=0 S1 R=1 Let = 2/3 Q=0 Q=1 S3 R=0 S2 R = -1 Q=3 S4 R=3 Q=0 Q=0 Algo: Pick State +Action Qnew R max Qnext state Repeat (deterministic world, so α=1) Q-Learning: Implementation Details Remember, conceptually we are filling in a huge table States S0 S1 S2 A c t i o n s a b c . . . z ... . . . ... Q(S2, c) Sn Tables are a very verbose representation of a function Q-Learning: Convergence Proof • Applies to Q tables and deterministic, Markovian worlds. Initialize Q’s 0 or random finite. • Theorem: if every state-action pair visited infinitely often, 0≤<1, and |rewards| ≤ C (some constant), then s, a lim Qt (s, a) Qactual (s, a) t ^ the approx. Q table (Q) the true Q table (Q) Q-Learning Convergence Proof (cont.) • Consider the max error in the approx. Q-table at step t : max | Q ( s, a) Q ( s, a) | t t actual s ,a • The max Qactual (s, a) is finite since |r| ≤ C, so max | Qactual | i C C +C+Cγ i 0 1 • Since | Q 0 | finite, we have 0 finite, i.e. initial max error is finite Q-Learning Convergence Proof (cont.) Let s’ be the state that results from doing action a in state s. Consider what happens when we visit s and do a at step t + 1: Qt 1 ( s, a) Q( s, a) R max Qt ( s ', a ') R max Q(s ', a ") a' a" Current state Next state By Q-learning rule (one step) By def’n of Q (notice best a in s’ might be different) Q-Learning Convergence Proof (cont.) ^ = | maxa’ Qt(s’, a’) – maxa’’ Q(s’, a’’) | By algebra ^ ≤ maxa’’’ | Qt(s’, a’’’) – Q(s’, a’’’) | Since max of all differences is as big as a particular one ^ ≤ maxs’’,a’’’ | Qt(s’’, a’’’) – Q(s’’, a’’’) | Max at s’ ≤ max at any s = Δt Plugging in defn of Δt Q-Learning Convergence Proof (cont.) • Hence, every time, after t, we visit an <s, a>, its Q value differs from the correct answer by no more than Δt • Let To=to (i.e. the start) and TN be the first time since TN-1 where every <s, a> visited at least once • Call the time between TN-1 and TN, a complete interval Clearly ΔTN ≤ ΔTN-1 Q-Learning Convergence Proof (concluded) • That is, every complete interval, Δt is reduced by at least • Since we assumed every <s, a> pair visited infinitely often, we will have an infinite number of complete intervals Hence, lim Δt = 0 t Representing Q Functions More Compactly We can use some other function representation (eg, neural net) to compactly encode this big table An encoding of the state (S) Second argument is a constant Q (S, a) . . .. . Q (S, b) Q (S, z) Each input unit encodes a property of the state (eg, a sensor value) Or could have one net for each possible action Q Tables vs Q Nets Given: 100 Boolean-valued features 10 possible actions Size of Q table 10 * 2 to the power of 100 # of possible states Size of Q net (100 HU’s) 100 * 100 + 100 * 10 = 11,000 Weights between inputs and HU’s Weights between HU’s and outputs Why Use a Compact Q-Function? 1. Full Q table may not fit in memory for realistic problems 2. Can generalize across states, thereby speeding up convergence i.e., one example “fills” many cells in the Q table SARSA vs. Q-Learning (1994, 1996) (1989) Exploring can be hazardous! Should we learn to consider its impact? The Cliff-Walking Task (pg 150 of Sutton + Barto RL Text) Safe route R=-1 for all of these Start Optimal path -if no exploration! The Cliff (R = -100) Goal R=0 What would Q-Learning learn? SARSA = State Action Reward State Action SARSA Q(s,a) Q(s,a) + [ R + Q(s’, a’) – Q(s,a) ] Standard Q-Learning Q(s,a) Q(s,a) + [ R + max Q(s’, a’’) – Q(s,a) ] SARSA uses actual next action (still chosen via explore-exploit strategy, e.g. soft-max or “coin-flip”) actual - SARSA also converges a’ Notice that in Q learning, (currently) non-optimal moves do not impact the Q function s a’ R s’ a’’ best Sample Results: Cliff-Walking Task optimal SARSA Total Reward Until Goal Q-Learning Episodes (ie from Start to Goal) (prob of random move = 0.1) SARSA learns to avoid ‘pitfalls’ while in exploration phase (always need to explore if in a real-world situation?)
© Copyright 2026 Paperzz