Machine Learning

Reinforcement Learning
• Control learning
• Control polices that choose optimal actions
• Q learning
• Convergence
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
1
Control Learning
Consider learning to choose actions, e.g.,
• Robot learning to dock on battery charger
• Learning to choose actions to optimize factory
output
• Learning to play Backgammon
Note several problem characteristics
• Delayed reward
• Opportunity for active exploration
• Possibility that state only partially observable
• Possible need to learn multiple tasks with same
sensors/effectors
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
2
One Example: TD-Gammon
Tesauro, 1995
Learn to play Backgammon
Immediate reward
• +100 if win
• -100 if lose
• 0 for all other states
Trained by playing 1.5 million games against itself
Now approximately equal to best human player
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
3
Reinforcement Learning Problem
Environment
action reward
state
Agent
s0
a0
r0
s1
a1
r1
s2
a2
r2
...
Goal: learn to choose actions that maximize
r0 + r1 + 2r2 + …, where 0   < 1
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
4
Markov Decision Process
Assume
• finite set of states S
• set of actions A
• at each discrete time, agent observes state st  S
and choose action at  A
• then receives immediate reward rt
• and state changes to st+1
• Markov assumption: st+1 = (st, at) and rt = r(st, at)
– i.e., rt and st+1 depend only on current state and action
– functions  and r may be nondeterministic
– functions  and r no necessarily known to agent
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
5
Agent’s Learning Task
Execute action in environment, observe results, and
• learn action policy  : S  A that maximizes
E[rt + rt+1 + 2rt+2 + …]
from any starting state in S
• here 0   < 1 is the discount factor for future
rewards
Note something new:
• target function is  : S  A
• but we have no training examples of form <s,a>
• training examples are of form <<s,a>,r>
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
6
Value Function
To begin, consider deterministic worlds …
For each possible policy  the agent might adopt, we
can define an evaluation function over states
Vπ ( s)  rt  γ rt 1  γ 2 rt  2  ...

 γ i rt i
i 0
where rt,rt+1,… are generated by following policy 
starting at state s
Restated, the task is to learn the optimal policy *
π*  argmax V ( s), (s)
π
π
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
7
0
0
100
0 0
0 0
0
0
100
G
0 0
72
9081
8172
100
81
0
100
G
90 81
100
90
Q(s,a) values
r(s,a) (immediate reward) values
90
0
81
G
G
0
81
90
V*(s) values
CS 5751 Machine
Learning
100
One optimal policy
Chapter 13 Reinforcement Learning
8
What to Learn
We might try to have agent learn the evaluation
function V* (which we write as V*)
We could then do a lookahead search to choose best
action from any state s because
π* (s)  argmax r(s,a)  γ V*(δ (s,a))
a
A problem:
• This works well if agent knows a  : S  A  S,
and r : S  A  
• But when it doesn’t, we can’t choose actions this
way
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
9
Q Function
Define new function very similar to V*
Q( s, a)  r(s,a)  γ V*(δ (s,a))
If agent learns Q, it can choose optimal action even
without knowing d!
π* (s)  argmax r(s,a)  γ V*(δ (s,a))
a
π* (s)  argmax Q( s, a)
a
Q is the evaluation function the agent will learn
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
10
Training Rule to Learn Q
Note Q and V* closely related:
V * (s)  max Q( s, a)
a
Which allows us to write Q recursively as
Q( st ,at )  r(st ,at )  γ V*(δ (st ,at ))
 r(st ,at )  γ max Q( st 1,a)
a
Let Q̂ denote learner’s current approximation to Q.
Consider training rule
Qˆ ( s,a )  r  γ max Qˆ ( s,a)
a
where s' is the state resulting from applying action a
in state s
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
11
Q Learning for Deterministic Worlds
For each s,a initialize table entry Qˆ ( s,a )  0
Observe current state s
Do forever:
• Select an action a and execute it
• Receive immediate reward r
• Observe the new state s'
• Update the table entry for Qˆ ( s,a ) as follows:
Qˆ ( s,a )  r  γ max Qˆ ( s,a)
a
• s  s'
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
12
Updating
R
63
63
R
100
72 81
100
90 81
aright
initial state: s1
Qˆ ( s1 , aright )  r  γ max Q̂( s2 ,a)
next state: s2
a
 0  0.9 max{ 63,81,100}  90
notice if rewards non - negative, then
(s, a, n) Qˆ ( s, a)  Qˆ ( s, a)
n 1
n
and
(s, a, n) 0  Qˆ n ( s, a)  Q( s, a)
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
13
Convergence
Q̂ converges to Q. Consider case of deterministic world
where each <s,a> visited infinitely often.
Proof: define a full interval to be an interval during which
each <s,a> is visited. During each full interval the largest
error in Q̂ table is reduced by factor of 
Let Q̂n be table after n updates, and n be the maximum error
in Q̂n ; that is
 n  max Qˆ n (s, a)  Q(s, a)
s ,a
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
14
Convergence (cont)
ˆ ( s, a) updated on iteration n+1, the
For any table entry Q
n
ˆ ( s, a) is
error in the revised estimate Q
n
Qˆ n 1 ( s, a )  Q( s, a )  (r  γ max Qˆ n(s,a))  (r  γ max Q(s,a))
a
a
 γ max Qˆ n(s,a)  max Q(s,a)
a
a
 γ max Qˆ n(s,a)  Q(s,a)
a
 γ max Qˆ n(s,a)  Q(s,a)
s  , a
Qˆ n 1 ( s, a )  Q( s, a )  γ  n
Note we used general fact that
max f1(a) - max f 2(a)  max f1(a) - f 2(a)
a
CS 5751 Machine
Learning
a
a
Chapter 13 Reinforcement Learning
15
Nondeterministic Case
What if reward and next state are non-deterministic?
We redefine V,Q by taking expected values
V π (s)  E[ rt  γ rt 1  γ 2 rt  2  ...]

 E i 0γ i rt i


Q( s, a)  E[r ( s, a)  γ V * (δ (s,a))]
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
16
Nondeterministic Case
Q learning generalizes to nondeterministic worlds
Alter training rule to
Qˆ n ( s, a)  (1 α n )Qˆ n1 ( s, a)  α n [r  max Qˆ n1 ( s, a)]
a
where
1
αn 
1  visitsn ( s, a )
Can still prove converge of Q̂ to Q [Watkins and
Dayan, 1992]
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
17
Temporal Difference Learning
Q learning: reduce discrepancy between successive
Q estimates
One step time difference:
Q (1) ( st , at )  rt  γ max Qˆ ( st 1 , a)
a
Why not two steps?
Q ( 2) ( st , at )  rt  γ rt 1  γ 2 max Qˆ ( st  2 , a)
a
Or n?
Q ( n ) ( st , at )  rt  γ rt 1  ...  γ n -1rt  n1  γ n max Qˆ ( st  n , a)
a
Blend all of these:


Qλ ( st , at )  (1  λ ) Q (1) ( st , at )  λ Q (2) ( st , at )  λ 2Q (3) ( st , at )  ...
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
18
Temporal Difference Learning


Qλ ( st , at )  (1  λ ) Q (1) ( st , at )  λ Q (2) ( st , at )  λ 2Q (3) ( st , at )  ...
Equivalent expression:

Qλ ( st , at )  rt  γ (1 λ ) max Qˆ ( st , at )  λ Qλ ( st 1 , at 1 )
a

TD() algorithm uses above training rule
• Sometimes converges faster than Q learning
• converges for learning V* for any 0   1
(Dayan, 1992)
• Tesauro’s TD-Gammon uses this algorithm
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
19
Subtleties and Ongoing Research
• Replace Q̂ table with neural network or other
generalizer
• Handle case where state only partially observable
• Design optimal exploration strategies
• Extend to continuous action, state
• Learn and use d : S  A  S, d approximation to 
• Relationship to dynamic programming
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
20
RL Summary
• Reinforcement learning (RL)
–
–
–
–
control learning
delayed reward
possible that the state is only partially observable
possible that the relationship between states/actions
unknown
• Temporal Difference Learning
– learn discrepancies between successive estimates
– used in TD-Gammon
• V(s) - state value function
– needs known reward/state transition functions
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
21
RL Summary
• Q(s,a) - state/action value function
–
–
–
–
–
related to V
does not need reward/state trans functions
training rule
related to dynamic programming
measure actual reward received for action and future
value using current Q function
– deterministic - replace existing estimate
– nondeterministic - move table estimate towards
measure estimate
– convergence - can be shown in both cases
CS 5751 Machine
Learning
Chapter 13 Reinforcement Learning
22

Download Report

Machine Learning

Paperzz.com

Your Paperzz