CSC 411: Lecture 19: Reinforcement Learning
Richard Zemel, Raquel Urtasun and Sanja Fidler
University of Toronto
November 29, 2016
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
1 / 38
Today
Learn to play games
Reinforcement Learning
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
2 / 38
Playing Games: Atari
https://www.youtube.com/watch?v=V1eYniJ0Rnk
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
3 / 38
Playing Games: Super Mario
https://www.youtube.com/watch?v=wfL4L_l4U9A
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
4 / 38
Making Pancakes!
https://www.youtube.com/watch?v=W_gxLKSsSIE
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
5 / 38
Reinforcement Learning Resources
RL tutorial – on course website
Reinforcement Learning: An Introduction, Sutton & Barto Book (1998)
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
6 / 38
Reinforcement Learning
Learning algorithms differ in the information available to learner
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
7 / 38
Reinforcement Learning
Learning algorithms differ in the information available to learner
I
Supervised: correct outputs
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
7 / 38
Reinforcement Learning
Learning algorithms differ in the information available to learner
I
I
Supervised: correct outputs
Unsupervised: no feedback, must construct measure of good output
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
7 / 38
Reinforcement Learning
Learning algorithms differ in the information available to learner
I
I
I
Supervised: correct outputs
Unsupervised: no feedback, must construct measure of good output
Reinforcement learning
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
7 / 38
Reinforcement Learning
Learning algorithms differ in the information available to learner
I
I
I
Supervised: correct outputs
Unsupervised: no feedback, must construct measure of good output
Reinforcement learning
More realistic learning scenario:
I
Continuous stream of input information, and actions
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
7 / 38
Reinforcement Learning
Learning algorithms differ in the information available to learner
I
I
I
Supervised: correct outputs
Unsupervised: no feedback, must construct measure of good output
Reinforcement learning
More realistic learning scenario:
I
I
Continuous stream of input information, and actions
Effects of action depend on state of the world
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
7 / 38
Reinforcement Learning
Learning algorithms differ in the information available to learner
I
I
I
Supervised: correct outputs
Unsupervised: no feedback, must construct measure of good output
Reinforcement learning
More realistic learning scenario:
I
I
I
Continuous stream of input information, and actions
Effects of action depend on state of the world
Obtain reward that depends on world state and actions
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
7 / 38
Reinforcement Learning
Learning algorithms differ in the information available to learner
I
I
I
Supervised: correct outputs
Unsupervised: no feedback, must construct measure of good output
Reinforcement learning
More realistic learning scenario:
I
I
I
Continuous stream of input information, and actions
Effects of action depend on state of the world
Obtain reward that depends on world state and actions
I
not correct response, just some feedback
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
7 / 38
Reinforcement Learning
[pic from: Peter Abbeel]
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
8 / 38
Example: Tic Tac Toe, Notation
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
9 / 38
Example: Tic Tac Toe, Notation
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
10 / 38
Example: Tic Tac Toe, Notation
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
11 / 38
Example: Tic Tac Toe, Notation
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
12 / 38
Formulating Reinforcement Learning
World described by a discrete, finite set of states and actions
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
13 / 38
Formulating Reinforcement Learning
World described by a discrete, finite set of states and actions
At every time step t, we are in a state st , and we:
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
13 / 38
Formulating Reinforcement Learning
World described by a discrete, finite set of states and actions
At every time step t, we are in a state st , and we:
I
Take an action at (possibly null action)
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
13 / 38
Formulating Reinforcement Learning
World described by a discrete, finite set of states and actions
At every time step t, we are in a state st , and we:
I
I
Take an action at (possibly null action)
Receive some reward rt+1
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
13 / 38
Formulating Reinforcement Learning
World described by a discrete, finite set of states and actions
At every time step t, we are in a state st , and we:
I
I
I
Take an action at (possibly null action)
Receive some reward rt+1
Move into a new state st+1
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
13 / 38
Formulating Reinforcement Learning
World described by a discrete, finite set of states and actions
At every time step t, we are in a state st , and we:
I
I
I
Take an action at (possibly null action)
Receive some reward rt+1
Move into a new state st+1
An RL agent may include one or more of these components:
I
Policy π: agent’s behaviour function
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
13 / 38
Formulating Reinforcement Learning
World described by a discrete, finite set of states and actions
At every time step t, we are in a state st , and we:
I
I
I
Take an action at (possibly null action)
Receive some reward rt+1
Move into a new state st+1
An RL agent may include one or more of these components:
I
I
Policy π: agent’s behaviour function
Value function: how good is each state and/or action
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
13 / 38
Formulating Reinforcement Learning
World described by a discrete, finite set of states and actions
At every time step t, we are in a state st , and we:
I
I
I
Take an action at (possibly null action)
Receive some reward rt+1
Move into a new state st+1
An RL agent may include one or more of these components:
I
I
I
Policy π: agent’s behaviour function
Value function: how good is each state and/or action
Model: agent’s representation of the environment
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
13 / 38
Policy
A policy is the agent’s behaviour.
It’s a selection of which action to take, based on the current state
Deterministic policy: a = π(s)
Stochastic policy: π(a|s) = P[at = a|st = s]
[Slide credit: D. Silver]
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
14 / 38
Value Function
Value function is a prediction of future reward
Used to evaluate the goodness/badness of states
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
15 / 38
Value Function
Value function is a prediction of future reward
Used to evaluate the goodness/badness of states
Our aim will be to maximize the value function (the total reward we receive
over time): find the policy with the highest expected reward
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
15 / 38
Value Function
Value function is a prediction of future reward
Used to evaluate the goodness/badness of states
Our aim will be to maximize the value function (the total reward we receive
over time): find the policy with the highest expected reward
By following a policy π, the value function is defined as:
V π (st )
= rt + γrt+1 + γ 2 rt+2 + · · ·
γ is called a discount rate, and it is always 0 ≤ γ ≤ 1
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
15 / 38
Value Function
Value function is a prediction of future reward
Used to evaluate the goodness/badness of states
Our aim will be to maximize the value function (the total reward we receive
over time): find the policy with the highest expected reward
By following a policy π, the value function is defined as:
V π (st )
= rt + γrt+1 + γ 2 rt+2 + · · ·
γ is called a discount rate, and it is always 0 ≤ γ ≤ 1
If γ close to 1, rewards further in the future count more, and we say that the
agent is “farsighted”
γ is less than 1 because there is usually a time limit to the sequence of
actions needed to solve a task (we prefer rewards sooner rather than later)
[Slide credit: D. Silver]
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
15 / 38
Model
The model describes the environment by a distribution over rewards and
state transitions:
P(st+1 = s 0 , rt+1 = r 0 |st = s, at = a)
We assume the Markov property: the future depends on the past only
through the current state
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
16 / 38
Maze Example
Rewards:
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
17 / 38
Maze Example
Rewards: −1 per time-step
Actions:
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
17 / 38
Maze Example
Rewards: −1 per time-step
Actions: N, E, S, W
States:
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
17 / 38
Maze Example
Rewards: −1 per time-step
Actions: N, E, S, W
States: Agent’s location
[Slide credit: D. Silver]
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
17 / 38
Maze Example
Arrows represent policy π(s)
for each state s
[Slide credit: D. Silver]
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
18 / 38
Maze Example
Numbers represent value V π (s)
of each state s
[Slide credit: D. Silver]
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
19 / 38
Example: Tic-Tac-Toe
Consider the game tic-tac-toe:
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
20 / 38
Example: Tic-Tac-Toe
Consider the game tic-tac-toe:
I
reward:
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
20 / 38
Example: Tic-Tac-Toe
Consider the game tic-tac-toe:
I
reward: win/lose/tie the game (+1/ − 1/0) [only at final move in given
game]
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
20 / 38
Example: Tic-Tac-Toe
Consider the game tic-tac-toe:
I
I
reward: win/lose/tie the game (+1/ − 1/0) [only at final move in given
game]
state:
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
20 / 38
Example: Tic-Tac-Toe
Consider the game tic-tac-toe:
I
I
reward: win/lose/tie the game (+1/ − 1/0) [only at final move in given
game]
state: positions of X’s and O’s on the board
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
20 / 38
Example: Tic-Tac-Toe
Consider the game tic-tac-toe:
I
I
I
reward: win/lose/tie the game (+1/ − 1/0) [only at final move in given
game]
state: positions of X’s and O’s on the board
policy: mapping from states to actions
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
20 / 38
Example: Tic-Tac-Toe
Consider the game tic-tac-toe:
I
I
I
reward: win/lose/tie the game (+1/ − 1/0) [only at final move in given
game]
state: positions of X’s and O’s on the board
policy: mapping from states to actions
I
based on rules of game: choice of one open position
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
20 / 38
Example: Tic-Tac-Toe
Consider the game tic-tac-toe:
I
I
I
reward: win/lose/tie the game (+1/ − 1/0) [only at final move in given
game]
state: positions of X’s and O’s on the board
policy: mapping from states to actions
I
I
based on rules of game: choice of one open position
value function: prediction of reward in future, based on current state
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
20 / 38
Example: Tic-Tac-Toe
Consider the game tic-tac-toe:
I
I
I
reward: win/lose/tie the game (+1/ − 1/0) [only at final move in given
game]
state: positions of X’s and O’s on the board
policy: mapping from states to actions
I
I
based on rules of game: choice of one open position
value function: prediction of reward in future, based on current state
In tic-tac-toe, since state space is tractable, can use a table to represent
value function
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
20 / 38
RL & Tic-Tac-Toe
Each board position (taking into account symmetry) has some probability
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
21 / 38
RL & Tic-Tac-Toe
Each board position (taking into account symmetry) has some probability
Simple learning process:
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
21 / 38
RL & Tic-Tac-Toe
Each board position (taking into account symmetry) has some probability
Simple learning process:
I
Zemel, Urtasun, Fidler (UofT)
start with all values = 0.5
CSC 411: 19-Reinforcement Learning
November 29, 2016
21 / 38
RL & Tic-Tac-Toe
Each board position (taking into account symmetry) has some probability
Simple learning process:
I
I
Zemel, Urtasun, Fidler (UofT)
start with all values = 0.5
policy: choose move with highest
probability of winning given current
legal moves from current state
CSC 411: 19-Reinforcement Learning
November 29, 2016
21 / 38
RL & Tic-Tac-Toe
Each board position (taking into account symmetry) has some probability
Simple learning process:
I
I
I
Zemel, Urtasun, Fidler (UofT)
start with all values = 0.5
policy: choose move with highest
probability of winning given current
legal moves from current state
update entries in table based on
outcome of each game
CSC 411: 19-Reinforcement Learning
November 29, 2016
21 / 38
RL & Tic-Tac-Toe
Each board position (taking into account symmetry) has some probability
Simple learning process:
I
I
I
I
Zemel, Urtasun, Fidler (UofT)
start with all values = 0.5
policy: choose move with highest
probability of winning given current
legal moves from current state
update entries in table based on
outcome of each game
After many games value function will
represent true probability of winning
from each state
CSC 411: 19-Reinforcement Learning
November 29, 2016
21 / 38
RL & Tic-Tac-Toe
Each board position (taking into account symmetry) has some probability
Simple learning process:
I
I
I
I
start with all values = 0.5
policy: choose move with highest
probability of winning given current
legal moves from current state
update entries in table based on
outcome of each game
After many games value function will
represent true probability of winning
from each state
Can try alternative policy: sometimes select moves randomly (exploration)
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
21 / 38
Basic Problems
Markov Decision Problem (MDP): tuple (S, A, P, γ) where P is
P(st+1 = s 0 , rt+1 = r 0 |st = s, at = a)
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
22 / 38
Basic Problems
Markov Decision Problem (MDP): tuple (S, A, P, γ) where P is
P(st+1 = s 0 , rt+1 = r 0 |st = s, at = a)
Standard MDP problems:
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
22 / 38
Basic Problems
Markov Decision Problem (MDP): tuple (S, A, P, γ) where P is
P(st+1 = s 0 , rt+1 = r 0 |st = s, at = a)
Standard MDP problems:
1. Planning: given complete Markov decision problem as input, compute
policy with optimal expected return
[Pic: P. Abbeel]
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
22 / 38
Basic Problems
Markov Decision Problem (MDP): tuple (S, A, P, γ) where P is
P(st+1 = s 0 , rt+1 = r 0 |st = s, at = a)
Standard MDP problems:
1. Planning: given complete Markov decision problem as input, compute
policy with optimal expected return
2. Learning: We don’t know which states are good or what the actions
do. We must try out the actions and states to learn what to do
[P. Abbeel]
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
23 / 38
Example of Standard MDP Problem
1. Planning: given complete Markov decision problem as input, compute policy
with optimal expected return
2. Learning: Only have access to experience in the MDP, learn a near-optimal
strategy
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
24 / 38
Example of Standard MDP Problem
1. Planning: given complete Markov decision problem as input, compute policy
with optimal expected return
2. Learning: Only have access to experience in the MDP, learn a near-optimal
strategy
We will focus on learning, but discuss planning along the way
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
25 / 38
Exploration vs. Exploitation
If we knew how the world works (embodied in P), then the policy should be
deterministic
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
26 / 38
Exploration vs. Exploitation
If we knew how the world works (embodied in P), then the policy should be
deterministic
I
just select optimal action in each state
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
26 / 38
Exploration vs. Exploitation
If we knew how the world works (embodied in P), then the policy should be
deterministic
I
just select optimal action in each state
Reinforcement learning is like trial-and-error learning
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
26 / 38
Exploration vs. Exploitation
If we knew how the world works (embodied in P), then the policy should be
deterministic
I
just select optimal action in each state
Reinforcement learning is like trial-and-error learning
The agent should discover a good policy from its experiences of the
environment
Without losing too much reward along the way
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
26 / 38
Exploration vs. Exploitation
If we knew how the world works (embodied in P), then the policy should be
deterministic
I
just select optimal action in each state
Reinforcement learning is like trial-and-error learning
The agent should discover a good policy from its experiences of the
environment
Without losing too much reward along the way
Since we do not have complete knowledge of the world, taking what appears
to be the optimal action may prevent us from finding better states/actions
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
26 / 38
Exploration vs. Exploitation
If we knew how the world works (embodied in P), then the policy should be
deterministic
I
just select optimal action in each state
Reinforcement learning is like trial-and-error learning
The agent should discover a good policy from its experiences of the
environment
Without losing too much reward along the way
Since we do not have complete knowledge of the world, taking what appears
to be the optimal action may prevent us from finding better states/actions
Interesting trade-off:
I
immediate reward (exploitation) vs. gaining knowledge that might
enable higher future reward (exploration)
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
26 / 38
Examples
Restaurant Selection
I
I
Exploitation: Go to your favourite restaurant
Exploration: Try a new restaurant
Online Banner Advertisements
I
I
Exploitation: Show the most successful advert
Exploration: Show a different advert
Oil Drilling
I
I
Exploitation: Drill at the best known location
Exploration: Drill at a new location
Game Playing
I
I
Exploitation: Play the move you believe is best
Exploration: Play an experimental move
[Slide credit: D. Silver]
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
27 / 38
MDP Formulation
Goal: find policy π that maximizes expected accumulated future rewards
V π (st ), obtained by following π from state st :
V π (st )
Zemel, Urtasun, Fidler (UofT)
= rt + γrt+1 + γ 2 rt+2 + · · ·
CSC 411: 19-Reinforcement Learning
November 29, 2016
28 / 38
MDP Formulation
Goal: find policy π that maximizes expected accumulated future rewards
V π (st ), obtained by following π from state st :
V π (st )
= rt + γrt+1 + γ 2 rt+2 + · · ·
∞
X
=
γ i rt+i
i=0
Game show example:
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
28 / 38
MDP Formulation
Goal: find policy π that maximizes expected accumulated future rewards
V π (st ), obtained by following π from state st :
V π (st )
= rt + γrt+1 + γ 2 rt+2 + · · ·
∞
X
=
γ i rt+i
i=0
Game show example:
I
assume series of questions, increasingly difficult, but increasing payoff
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
28 / 38
MDP Formulation
Goal: find policy π that maximizes expected accumulated future rewards
V π (st ), obtained by following π from state st :
V π (st )
= rt + γrt+1 + γ 2 rt+2 + · · ·
∞
X
=
γ i rt+i
i=0
Game show example:
I
I
assume series of questions, increasingly difficult, but increasing payoff
choice: accept accumulated earnings and quit; or continue and risk
losing everything
Notice that:
V π (st ) = rt + γV π (st+1 )
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
28 / 38
What to Learn
We might try to learn the function V (which we write as V ∗ )
V ∗ (s) = max [r (s, a) + γV ∗ (δ(s, a))]
a
Here δ(s, a) gives the next state, if we perform action a in current state s
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
29 / 38
What to Learn
We might try to learn the function V (which we write as V ∗ )
V ∗ (s) = max [r (s, a) + γV ∗ (δ(s, a))]
a
Here δ(s, a) gives the next state, if we perform action a in current state s
We could then do a lookahead search to choose best action from any state s:
π ∗ (s) = arg max [r (s, a) + γV ∗ (δ(s, a))]
a
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
29 / 38
What to Learn
We might try to learn the function V (which we write as V ∗ )
V ∗ (s) = max [r (s, a) + γV ∗ (δ(s, a))]
a
Here δ(s, a) gives the next state, if we perform action a in current state s
We could then do a lookahead search to choose best action from any state s:
π ∗ (s) = arg max [r (s, a) + γV ∗ (δ(s, a))]
a
But there’s a problem:
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
29 / 38
What to Learn
We might try to learn the function V (which we write as V ∗ )
V ∗ (s) = max [r (s, a) + γV ∗ (δ(s, a))]
a
Here δ(s, a) gives the next state, if we perform action a in current state s
We could then do a lookahead search to choose best action from any state s:
π ∗ (s) = arg max [r (s, a) + γV ∗ (δ(s, a))]
a
But there’s a problem:
I
This works well if we know δ() and r ()
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
29 / 38
What to Learn
We might try to learn the function V (which we write as V ∗ )
V ∗ (s) = max [r (s, a) + γV ∗ (δ(s, a))]
a
Here δ(s, a) gives the next state, if we perform action a in current state s
We could then do a lookahead search to choose best action from any state s:
π ∗ (s) = arg max [r (s, a) + γV ∗ (δ(s, a))]
a
But there’s a problem:
I
I
This works well if we know δ() and r ()
But when we don’t, we cannot choose actions this way
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
29 / 38
Q Learning
Define a new function very similar to V ∗
Q(s, a) = r (s, a) + γV ∗ (δ(s, a))
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
30 / 38
Q Learning
Define a new function very similar to V ∗
Q(s, a) = r (s, a) + γV ∗ (δ(s, a))
If we learn Q, we can choose the optimal action even without knowing δ!
π ∗ (s)
Zemel, Urtasun, Fidler (UofT)
=
arg max [r (s, a) + γV ∗ (δ(s, a))]
a
CSC 411: 19-Reinforcement Learning
November 29, 2016
30 / 38
Q Learning
Define a new function very similar to V ∗
Q(s, a) = r (s, a) + γV ∗ (δ(s, a))
If we learn Q, we can choose the optimal action even without knowing δ!
π ∗ (s)
Zemel, Urtasun, Fidler (UofT)
=
arg max [r (s, a) + γV ∗ (δ(s, a))]
=
arg max Q(s, a)
a
a
CSC 411: 19-Reinforcement Learning
November 29, 2016
30 / 38
Q Learning
Define a new function very similar to V ∗
Q(s, a) = r (s, a) + γV ∗ (δ(s, a))
If we learn Q, we can choose the optimal action even without knowing δ!
π ∗ (s)
=
arg max [r (s, a) + γV ∗ (δ(s, a))]
=
arg max Q(s, a)
a
a
Q is then the evaluation function we will learn
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
30 / 38
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
31 / 38
Training Rule to Learn Q
Q and V ∗ are closely related:
V ∗ (s) = max Q(s, a)
a
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
32 / 38
Training Rule to Learn Q
Q and V ∗ are closely related:
V ∗ (s) = max Q(s, a)
a
So we can write Q recursively:
Q(st , at )
Zemel, Urtasun, Fidler (UofT)
= r (st , at ) + γV ∗ (δ(st , at ))
CSC 411: 19-Reinforcement Learning
November 29, 2016
32 / 38
Training Rule to Learn Q
Q and V ∗ are closely related:
V ∗ (s) = max Q(s, a)
a
So we can write Q recursively:
Q(st , at )
= r (st , at ) + γV ∗ (δ(st , at ))
= r (st , at ) + γ max
Q(st+1 , a0 )
0
a
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
32 / 38
Training Rule to Learn Q
Q and V ∗ are closely related:
V ∗ (s) = max Q(s, a)
a
So we can write Q recursively:
Q(st , at )
= r (st , at ) + γV ∗ (δ(st , at ))
= r (st , at ) + γ max
Q(st+1 , a0 )
0
a
Let Q̂ denote the learner’s current approximation to Q
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
32 / 38
Training Rule to Learn Q
Q and V ∗ are closely related:
V ∗ (s) = max Q(s, a)
a
So we can write Q recursively:
Q(st , at )
= r (st , at ) + γV ∗ (δ(st , at ))
= r (st , at ) + γ max
Q(st+1 , a0 )
0
a
Let Q̂ denote the learner’s current approximation to Q
Consider training rule
Q̂(s, a) ← r (s, a) + γ max
Q̂(s 0 , a0 )
0
a
where s 0 is state resulting from applying action a in state s
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
32 / 38
Q Learning for Deterministic World
For each s, a initialize table entry Q̂(s, a) ← 0
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
33 / 38
Q Learning for Deterministic World
For each s, a initialize table entry Q̂(s, a) ← 0
Start in some initial state s
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
33 / 38
Q Learning for Deterministic World
For each s, a initialize table entry Q̂(s, a) ← 0
Start in some initial state s
Do forever:
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
33 / 38
Q Learning for Deterministic World
For each s, a initialize table entry Q̂(s, a) ← 0
Start in some initial state s
Do forever:
I
Select an action a and execute it
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
33 / 38
Q Learning for Deterministic World
For each s, a initialize table entry Q̂(s, a) ← 0
Start in some initial state s
Do forever:
I
I
Select an action a and execute it
Receive immediate reward r
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
33 / 38
Q Learning for Deterministic World
For each s, a initialize table entry Q̂(s, a) ← 0
Start in some initial state s
Do forever:
I
I
I
Select an action a and execute it
Receive immediate reward r
Observe the new state s 0
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
33 / 38
Q Learning for Deterministic World
For each s, a initialize table entry Q̂(s, a) ← 0
Start in some initial state s
Do forever:
I
I
I
I
Select an action a and execute it
Receive immediate reward r
Observe the new state s 0
Update the table entry for Q̂(s, a) using Q learning rule:
Q̂(s, a) ← r (s, a) + γ max
Q̂(s 0 , a0 )
0
a
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
33 / 38
Q Learning for Deterministic World
For each s, a initialize table entry Q̂(s, a) ← 0
Start in some initial state s
Do forever:
I
I
I
I
Select an action a and execute it
Receive immediate reward r
Observe the new state s 0
Update the table entry for Q̂(s, a) using Q learning rule:
Q̂(s, a) ← r (s, a) + γ max
Q̂(s 0 , a0 )
0
a
I
s ← s0
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
33 / 38
Q Learning for Deterministic World
For each s, a initialize table entry Q̂(s, a) ← 0
Start in some initial state s
Do forever:
I
I
I
I
Select an action a and execute it
Receive immediate reward r
Observe the new state s 0
Update the table entry for Q̂(s, a) using Q learning rule:
Q̂(s, a) ← r (s, a) + γ max
Q̂(s 0 , a0 )
0
a
I
s ← s0
If we get to absorbing state, restart to initial state, and run thru ”Do
forever” loop until reach absorbing state
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
33 / 38
Updating Estimated Q
Assume the robot is in state s1 ; some of its current estimates of Q are as
shown; executes rightward move
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
34 / 38
Updating Estimated Q
Assume the robot is in state s1 ; some of its current estimates of Q are as
shown; executes rightward move
Q̂(s1 , aright ) ← r + γ max
Q̂(s2 , a0 )
0
a
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
34 / 38
Updating Estimated Q
Assume the robot is in state s1 ; some of its current estimates of Q are as
shown; executes rightward move
Q̂(s1 , aright ) ← r + γ max
Q̂(s2 , a0 )
0
a
← r + 0.9 max{63, 81, 100} ← 90
a
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
34 / 38
Updating Estimated Q
Assume the robot is in state s1 ; some of its current estimates of Q are as
shown; executes rightward move
Q̂(s1 , aright ) ← r + γ max
Q̂(s2 , a0 )
0
a
← r + 0.9 max{63, 81, 100} ← 90
a
Important observation: at each time step (making an action a in state s
only one entry of Q̂ will change (the entry Q̂(s, a))
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
34 / 38
Updating Estimated Q
Assume the robot is in state s1 ; some of its current estimates of Q are as
shown; executes rightward move
Q̂(s1 , aright ) ← r + γ max
Q̂(s2 , a0 )
0
a
← r + 0.9 max{63, 81, 100} ← 90
a
Important observation: at each time step (making an action a in state s
only one entry of Q̂ will change (the entry Q̂(s, a))
Notice that if rewards are non-negative, then Q̂ values only increase from 0,
approach true Q
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
34 / 38
Q Learning: Summary
Training set consists of series of intervals (episodes): sequence of (state,
action, reward) triples, end at absorbing state
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
35 / 38
Q Learning: Summary
Training set consists of series of intervals (episodes): sequence of (state,
action, reward) triples, end at absorbing state
Each executed action a results in transition from state si to sj ; algorithm
updates Q̂(si , a) using the learning rule
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
35 / 38
Q Learning: Summary
Training set consists of series of intervals (episodes): sequence of (state,
action, reward) triples, end at absorbing state
Each executed action a results in transition from state si to sj ; algorithm
updates Q̂(si , a) using the learning rule
Intuition for simple grid world, reward only upon entering goal state → Q
estimates improve from goal state back
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
35 / 38
Q Learning: Summary
Training set consists of series of intervals (episodes): sequence of (state,
action, reward) triples, end at absorbing state
Each executed action a results in transition from state si to sj ; algorithm
updates Q̂(si , a) using the learning rule
Intuition for simple grid world, reward only upon entering goal state → Q
estimates improve from goal state back
1. All Q̂(s, a) start at 0
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
35 / 38
Q Learning: Summary
Training set consists of series of intervals (episodes): sequence of (state,
action, reward) triples, end at absorbing state
Each executed action a results in transition from state si to sj ; algorithm
updates Q̂(si , a) using the learning rule
Intuition for simple grid world, reward only upon entering goal state → Q
estimates improve from goal state back
1. All Q̂(s, a) start at 0
2. First episode – only update Q̂(s, a) for transition leading to goal state
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
35 / 38
Q Learning: Summary
Training set consists of series of intervals (episodes): sequence of (state,
action, reward) triples, end at absorbing state
Each executed action a results in transition from state si to sj ; algorithm
updates Q̂(si , a) using the learning rule
Intuition for simple grid world, reward only upon entering goal state → Q
estimates improve from goal state back
1. All Q̂(s, a) start at 0
2. First episode – only update Q̂(s, a) for transition leading to goal state
3. Next episode – if go thru this next-to-last transition, will update
Q̂(s, a) another step back
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
35 / 38
Q Learning: Summary
Training set consists of series of intervals (episodes): sequence of (state,
action, reward) triples, end at absorbing state
Each executed action a results in transition from state si to sj ; algorithm
updates Q̂(si , a) using the learning rule
Intuition for simple grid world, reward only upon entering goal state → Q
estimates improve from goal state back
1. All Q̂(s, a) start at 0
2. First episode – only update Q̂(s, a) for transition leading to goal state
3. Next episode – if go thru this next-to-last transition, will update
Q̂(s, a) another step back
4. Eventually propagate information from transitions with non-zero reward
throughout state-action space
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
35 / 38
Q Learning: Exploration/Exploitation
Have not specified how actions chosen (during learning)
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
36 / 38
Q Learning: Exploration/Exploitation
Have not specified how actions chosen (during learning)
Can choose actions to maximize Q̂(s, a)
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
36 / 38
Q Learning: Exploration/Exploitation
Have not specified how actions chosen (during learning)
Can choose actions to maximize Q̂(s, a)
Good idea?
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
36 / 38
Q Learning: Exploration/Exploitation
Have not specified how actions chosen (during learning)
Can choose actions to maximize Q̂(s, a)
Good idea?
Can instead employ stochastic action selection (policy):
exp(k Q̂(s, ai ))
P(ai |s) = P
j exp(k Q̂(s, aj ))
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
36 / 38
Q Learning: Exploration/Exploitation
Have not specified how actions chosen (during learning)
Can choose actions to maximize Q̂(s, a)
Good idea?
Can instead employ stochastic action selection (policy):
exp(k Q̂(s, ai ))
P(ai |s) = P
j exp(k Q̂(s, aj ))
Can vary k during learning
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
36 / 38
Q Learning: Exploration/Exploitation
Have not specified how actions chosen (during learning)
Can choose actions to maximize Q̂(s, a)
Good idea?
Can instead employ stochastic action selection (policy):
exp(k Q̂(s, ai ))
P(ai |s) = P
j exp(k Q̂(s, aj ))
Can vary k during learning
I
more exploration early on, shift towards exploitation
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
36 / 38
Non-deterministic Case
What if reward and next state are non-deterministic?
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
37 / 38
Non-deterministic Case
What if reward and next state are non-deterministic?
We redefine V , Q based on probabilistic estimates, expected values of them:
V π (s)
Eπ [rt + γrt+1 + γ 2 rt+2 + · · · ]
∞
X
= Eπ [
γ i rt+i ]
=
i=0
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
37 / 38
Non-deterministic Case
What if reward and next state are non-deterministic?
We redefine V , Q based on probabilistic estimates, expected values of them:
V π (s)
Eπ [rt + γrt+1 + γ 2 rt+2 + · · · ]
∞
X
= Eπ [
γ i rt+i ]
=
i=0
and
Q(s, a)
=
=
E [r (s, a) + γV ∗ (δ(s, a))]
X
E [r (s, a) + γ
p(s 0 |s, a) max
Q(s 0 , a0 )]
0
s0
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
a
November 29, 2016
37 / 38
Non-deterministic Case: Learning Q
Training rule does not converge (can keep changing Q̂ even if initialized to
true Q values)
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
38 / 38
Non-deterministic Case: Learning Q
Training rule does not converge (can keep changing Q̂ even if initialized to
true Q values)
So modify training rule to change more slowly
Q̂(s, a) ← (1 − αn )Q̂n−1 (s, a) + αn [r + γ max
Q̂n−1 (s 0 , a0 )]
0
a
where s 0 is the state land in after s, and a0 indexes the actions that can be
taken in state s 0
1
αn =
1 + visitsn (s, a)
where visits is the number of times action a is taken in state s
Zemel, Urtasun, Fidler (UofT)
CSC 411: 19-Reinforcement Learning
November 29, 2016
38 / 38
© Copyright 2026 Paperzz