REINFORCEMENT LEARNING I November 23: Lecture 21 Reinforcement Learning: a very general technique Agents, robots, or computer programs that: 1. 2. 3. 4. Observe the environment Understand what the current state is Take an action based on the state Receive some type of reward that can be immediate of delayed Reinforcement Learning: a very general technique Agents, robots, or computer programs that: 1. 2. 3. 4. Observe the environment Understand what the current state is Take an action based on the state Receive some type of reward that can be immediate of delayed Life saving! Reinforcement Learning: a very general technique Agents, robots, or computer programs that: 1. 2. 3. 4. Observe the environment Understand what the current state is Take an action based on the state Receive some type of reward that can be immediate of delayed Pocket emptying Also learn how to be creative in cursing Reinforcement Learning: a very general technique Agents, robots, or computer programs that: 1. 2. 3. 4. Observe the environment Understand what the current state is Take an action based on the state Receive some type of reward that can be immediate of delayed Dispatching Get a cab fast Reinforcement Learning: a very general technique Agents, robots, or computer programs that: 1. 2. 3. 4. Observe the environment Understand what the current state is Take an action based on the state Receive some type of reward that can be immediate of delayed Traffic adjust Yet to be implemented in PR Reinforcement Learning The general learning problem That’s what we want to learn ® = ¼(s) Action State Policy Issues with reinforcement learning 1. Delayed reward – Observing results of action can take a while – Future reward may be significant 2. Exploration vs Exploitation 3. Partially observable states 4. Life-long learning Taking future rewards into consideration • We can consider future rewards more or less important for taking an action • An idea is to define a discount factor ° • Reward that will be taken t steps into the t future, is weighted by a factor ° • Another idea is to consider a finite orizon • Or an average reward Focus on discounted rewards Markov Decision Process: a schematic view st at st+1 = ±(st ; at ) rt = r(st ; at ) Goal is to maximize MDP: functions dependent only on current state So, what are we looking to learn? Optimal policy has a value function A simple example Absorbing state ° = 0:9 Why is optimal? One suboptimal? A simple example Absorbing state ° = 0:9 Optimal One suboptimal? Which evaluation function should we learn • It is difficult to learn a function ¼¤ : S ! A • No training examples of the form <state,action> • Instead only reward is available • Learn V* ? • Maybe, because we can choose s1 over s2 if V*(s1) > V*(s2). So can find optimal action: Which evaluation function should we learn • Learn V* ? • Maybe, because we can choose s1 over s2 if V*(s1) > V*(s2). So can find optimal action: No knowledge of these! The Q function Rewrite things Unlike V*, function Q has two arguments, state and action. Hides environment response. A simple example Absorbing state ° = 0:9
© Copyright 2026 Paperzz