REINFORCEMENT LEARNING I

REINFORCEMENT LEARNING I
November 23: Lecture 21
Reinforcement Learning:
a very general technique
Agents, robots, or computer programs that:
1.
2.
3.
4.
Observe the environment
Understand what the current state is
Take an action based on the state
Receive some type of reward
that can be immediate of delayed
Reinforcement Learning:
a very general technique
Agents, robots, or computer programs that:
1.
2.
3.
4.
Observe the environment
Understand what the current state is
Take an action based on the state
Receive some type of reward
that can be immediate of delayed
Life saving!
Reinforcement Learning:
a very general technique
Agents, robots, or computer programs that:
1.
2.
3.
4.
Observe the environment
Understand what the current state is
Take an action based on the state
Receive some type of reward
that can be immediate of delayed
Pocket emptying
Also learn how to be
creative in cursing
Reinforcement Learning:
a very general technique
Agents, robots, or computer programs that:
1.
2.
3.
4.
Observe the environment
Understand what the current state is
Take an action based on the state
Receive some type of reward
that can be immediate of delayed
Dispatching
Get a cab fast
Reinforcement Learning:
a very general technique
Agents, robots, or computer programs that:
1.
2.
3.
4.
Observe the environment
Understand what the current state is
Take an action based on the state
Receive some type of reward
that can be immediate of delayed
Traffic adjust
Yet to be
implemented in PR
Reinforcement Learning
The general learning problem
That’s what we want to learn
® = ¼(s)
Action
State
Policy
Issues with reinforcement learning
1. Delayed reward
– Observing results of action can take a while
– Future reward may be significant
2. Exploration vs Exploitation
3. Partially observable states
4. Life-long learning
Taking future rewards into consideration
• We can consider future rewards more or
less important for taking an action
• An idea is to define a discount factor °
• Reward that will be taken t steps into the
t
future, is weighted by a factor °
• Another idea is to consider a finite orizon
• Or an average reward
Focus on discounted rewards
Markov Decision Process: a schematic view
st
at
st+1 = ±(st ; at )
rt = r(st ; at )
Goal is to maximize
MDP: functions
dependent only on
current state
So, what are we looking to learn?
Optimal policy has a value
function
A simple example
Absorbing state
° = 0:9
Why is optimal?
One suboptimal?
A simple example
Absorbing state
° = 0:9
Optimal
One suboptimal?
Which evaluation function should we learn
• It is difficult to learn a function ¼¤ : S ! A
• No training examples of the form <state,action>
• Instead only reward is available
• Learn V* ?
• Maybe, because we can choose s1 over s2
if V*(s1) > V*(s2). So can find optimal action:
Which evaluation function should we learn
• Learn V* ?
• Maybe, because we can choose s1 over s2
if V*(s1) > V*(s2). So can find optimal action:
No knowledge of these!
The Q function
Rewrite things
Unlike V*, function Q has two
arguments, state and action.
Hides environment response.
A simple example
Absorbing state
° = 0:9