Reinforcement Learning
• Control learning
• Control polices that choose optimal actions
• Q learning
• Convergence
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
1
Control Learning
Consider learning to choose actions, e.g.,
• Robot learning to dock on battery charger
• Learning to choose actions to optimize factory
output
• Learning to play Backgammon
Note several problem characteristics
• Delayed reward
• Opportunity for active exploration
• Possibility that state only partially observable
• Possible need to learn multiple tasks with same
sensors/effectors
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
2
One Example: TD-Gammon
Tesauro, 1995
Learn to play Backgammon
Immediate reward
• +100 if win
• -100 if lose
• 0 for all other states
Trained by playing 1.5 million games against itself
Now approximately equal to best human player
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
3
Reinforcement Learning Problem
Environment
action reward
state
Agent
s0
a0
r0
s1
a1
r1
s2
a2
r2
...
Goal: learn to choose actions that maximize
r0 + r1 + 2r2 + …, where 0 < 1
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
4
Markov Decision Process
Assume
• finite set of states S
• set of actions A
• at each discrete time, agent observes state st S
and choose action at A
• then receives immediate reward rt
• and state changes to st+1
• Markov assumption: st+1 = (st, at) and rt = r(st, at)
– i.e., rt and st+1 depend only on current state and action
– functions and r may be nondeterministic
– functions and r no necessarily known to agent
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
5
Agent’s Learning Task
Execute action in environment, observe results, and
• learn action policy : S A that maximizes
E[rt + rt+1 + 2rt+2 + …]
from any starting state in S
• here 0 < 1 is the discount factor for future
rewards
Note something new:
• target function is : S A
• but we have no training examples of form <s,a>
• training examples are of form <<s,a>,r>
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
6
Value Function
To begin, consider deterministic worlds …
For each possible policy the agent might adopt, we
can define an evaluation function over states
Vπ ( s) rt γ rt 1 γ 2 rt 2 ...
γ i rt i
i 0
where rt,rt+1,… are generated by following policy
starting at state s
Restated, the task is to learn the optimal policy *
π* argmax V ( s), (s)
π
π
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
7
0
0
100
0 0
0 0
0
0
100
G
0 0
72
9081
8172
100
81
0
100
G
90 81
100
90
Q(s,a) values
r(s,a) (immediate reward) values
90
0
81
G
G
0
81
90
V*(s) values
CS 5751 Machine
Learning
100
One optimal policy
Chapter 13 Reinforcement Learning
8
What to Learn
We might try to have agent learn the evaluation
function V* (which we write as V*)
We could then do a lookahead search to choose best
action from any state s because
π* (s) argmax r(s,a) γ V*(δ (s,a))
a
A problem:
• This works well if agent knows a : S A S,
and r : S A
• But when it doesn’t, we can’t choose actions this
way
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
9
Q Function
Define new function very similar to V*
Q( s, a) r(s,a) γ V*(δ (s,a))
If agent learns Q, it can choose optimal action even
without knowing d!
π* (s) argmax r(s,a) γ V*(δ (s,a))
a
π* (s) argmax Q( s, a)
a
Q is the evaluation function the agent will learn
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
10
Training Rule to Learn Q
Note Q and V* closely related:
V * (s) max Q( s, a)
a
Which allows us to write Q recursively as
Q( st ,at ) r(st ,at ) γ V*(δ (st ,at ))
r(st ,at ) γ max Q( st 1,a)
a
Let Q̂ denote learner’s current approximation to Q.
Consider training rule
Qˆ ( s,a ) r γ max Qˆ ( s,a)
a
where s' is the state resulting from applying action a
in state s
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
11
Q Learning for Deterministic Worlds
For each s,a initialize table entry Qˆ ( s,a ) 0
Observe current state s
Do forever:
• Select an action a and execute it
• Receive immediate reward r
• Observe the new state s'
• Update the table entry for Qˆ ( s,a ) as follows:
Qˆ ( s,a ) r γ max Qˆ ( s,a)
a
• s s'
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
12
Updating
R
63
63
R
100
72 81
100
90 81
aright
initial state: s1
Qˆ ( s1 , aright ) r γ max Q̂( s2 ,a)
next state: s2
a
0 0.9 max{ 63,81,100} 90
notice if rewards non - negative, then
(s, a, n) Qˆ ( s, a) Qˆ ( s, a)
n 1
n
and
(s, a, n) 0 Qˆ n ( s, a) Q( s, a)
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
13
Convergence
Q̂ converges to Q. Consider case of deterministic world
where each <s,a> visited infinitely often.
Proof: define a full interval to be an interval during which
each <s,a> is visited. During each full interval the largest
error in Q̂ table is reduced by factor of
Let Q̂n be table after n updates, and n be the maximum error
in Q̂n ; that is
n max Qˆ n (s, a) Q(s, a)
s ,a
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
14
Convergence (cont)
ˆ ( s, a) updated on iteration n+1, the
For any table entry Q
n
ˆ ( s, a) is
error in the revised estimate Q
n
Qˆ n 1 ( s, a ) Q( s, a ) (r γ max Qˆ n(s,a)) (r γ max Q(s,a))
a
a
γ max Qˆ n(s,a) max Q(s,a)
a
a
γ max Qˆ n(s,a) Q(s,a)
a
γ max Qˆ n(s,a) Q(s,a)
s , a
Qˆ n 1 ( s, a ) Q( s, a ) γ n
Note we used general fact that
max f1(a) - max f 2(a) max f1(a) - f 2(a)
a
CS 5751 Machine
Learning
a
a
Chapter 13 Reinforcement Learning
15
Nondeterministic Case
What if reward and next state are non-deterministic?
We redefine V,Q by taking expected values
V π (s) E[ rt γ rt 1 γ 2 rt 2 ...]
E i 0γ i rt i
Q( s, a) E[r ( s, a) γ V * (δ (s,a))]
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
16
Nondeterministic Case
Q learning generalizes to nondeterministic worlds
Alter training rule to
Qˆ n ( s, a) (1 α n )Qˆ n1 ( s, a) α n [r max Qˆ n1 ( s, a)]
a
where
1
αn
1 visitsn ( s, a )
Can still prove converge of Q̂ to Q [Watkins and
Dayan, 1992]
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
17
Temporal Difference Learning
Q learning: reduce discrepancy between successive
Q estimates
One step time difference:
Q (1) ( st , at ) rt γ max Qˆ ( st 1 , a)
a
Why not two steps?
Q ( 2) ( st , at ) rt γ rt 1 γ 2 max Qˆ ( st 2 , a)
a
Or n?
Q ( n ) ( st , at ) rt γ rt 1 ... γ n -1rt n1 γ n max Qˆ ( st n , a)
a
Blend all of these:
Qλ ( st , at ) (1 λ ) Q (1) ( st , at ) λ Q (2) ( st , at ) λ 2Q (3) ( st , at ) ...
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
18
Temporal Difference Learning
Qλ ( st , at ) (1 λ ) Q (1) ( st , at ) λ Q (2) ( st , at ) λ 2Q (3) ( st , at ) ...
Equivalent expression:
Qλ ( st , at ) rt γ (1 λ ) max Qˆ ( st , at ) λ Qλ ( st 1 , at 1 )
a
TD() algorithm uses above training rule
• Sometimes converges faster than Q learning
• converges for learning V* for any 0 1
(Dayan, 1992)
• Tesauro’s TD-Gammon uses this algorithm
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
19
Subtleties and Ongoing Research
• Replace Q̂ table with neural network or other
generalizer
• Handle case where state only partially observable
• Design optimal exploration strategies
• Extend to continuous action, state
• Learn and use d : S A S, d approximation to
• Relationship to dynamic programming
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
20
RL Summary
• Reinforcement learning (RL)
–
–
–
–
control learning
delayed reward
possible that the state is only partially observable
possible that the relationship between states/actions
unknown
• Temporal Difference Learning
– learn discrepancies between successive estimates
– used in TD-Gammon
• V(s) - state value function
– needs known reward/state transition functions
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
21
RL Summary
• Q(s,a) - state/action value function
–
–
–
–
–
related to V
does not need reward/state trans functions
training rule
related to dynamic programming
measure actual reward received for action and future
value using current Q function
– deterministic - replace existing estimate
– nondeterministic - move table estimate towards
measure estimate
– convergence - can be shown in both cases
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
22
© Copyright 2026 Paperzz