Document

Today’s Topics
Reinforcement Learning (RL)
• Q learning
• Exploration vs Exploitation
• Generalizing Across State
• Used in Clinical Trials Recently
• Increased Emphasis in IT Companies
• Inverse RL
1/19/17
Spring 2017 (Shavlik©)
1
Reinforcement Learning
vs Supervised Learning
RL requires much less of teacher
– Teacher must set up ‘reward structure’
– Learner ‘works out the details’
ie, writes a program to maximize
rewards received
1/19/17
Spring 2017 (Shavlik©)
2
Sequential Decision Problems
Courtesy of Andy Barto (pictured)
• Decisions are made in stages
• The outcome of each decision is not fully predictable,
but can be observed before the next decision is made
• The objective is to maximize a numerical measure of total
reward (or equivalently, to minimize a measure of total cost)
• Decisions cannot be viewed in isolation
need to balance desire for immediate reward
with possibility of high future reward
1/19/17
Spring 2017 (Shavlik©)
3
RL Systems: Formalization
SE = the set of states of the world
eg, an N -dimensional vector
‘sensors’ (and memory of past sensations)
AE = the set of possible actions
an agent can perform (‘effectors’)
W = the world
R = the immediate reward structure
W and R are the environment,
can be stochastic functions
(usually most states have R=0; ie rewards are sparse)
1/19/17
Spring 2017 (Shavlik©)
4
Embedded Learning Systems:
Formalization (cont.)
W: SE x AE  SE
[here the arrow means ‘maps to’]
The world maps a state and an action
and produces a new state
R: SE  “reals”
Provides rewards (a number; often 0)
as a function of the current state
Note: we can instead use R: SE x AE  “reals”
(ie, rewards depend on how we ENTER a state)
1/19/17
Spring 2017 (Shavlik©)
5
A Graphical View of RL
• Note that both the world and the agent can be
probabilistic, so W and R could produce probability
distributions
• We’ll assume deterministic problems
The real world, W
sensory
info
R, reward
(a scalar)
- indirect
teacher
an
action
The Agent
1/19/17
Spring 2017 (Shavlik©)
6
Common Confusion
State need not be solely the
current sensor readings
– Markov Assumption commonly used in RL
Value of state is independent of
path taken to reach that state
– But can store memory of the past in current state
Can always create Markovian task by remembering
entire past history
1/19/17
Spring 2017 (Shavlik©)
7
Need for Memory: Simple Example
‘Out of sight, but not out of mind’
Time=1
learning agent
opponent
WALL
opponent
Time=2
WALL
Seems reasonable to
remember opponent
was recently seen
learning agent
1/19/17
Spring 2017 (Shavlik©)
8
State vs.
Current Sensor Readings
Remember
state is what is in one’s head
(past memories, etc)
not ONLY what one currently
sees/hears/smells/etc
1/19/17
Spring 2017 (Shavlik©)
9
Policies
The agent needs to learn a policy
E : ŜE  AE
The policy, E, function
1/19/17
Given a world state, ŜE,
which action, AE, should
be chosen? ŜE is our
learner’s APPROXIMATION
to the true SE
Spring 2017 (Shavlik©)
Remember: The agent’s task
is to maximize the total reward
received during its lifetime
10
True World States vs. the Learner’s
Representation of the World State
• From here forward, S, will be our learner’s
approximation of the true world state
• Exceptions
W: S x A  S
R: S  reals
1/19/17
These are our notations for how the true
world behaves when we act upon it
You can think that W and R take as an
argument the learner’s representation of
the world state and internally convert that
to the ‘true’ world state(s)
Spring 2017 (Shavlik©)
11
Policies (cont.)
To construct E, we will assign a utility (U)
(a number) to each state

U  E ( s)    t 1 R( s,  E , t )
t 1
•  is a positive constant ≤ 1
• R(s, E, t) is the reward received at time t,
assuming the agent follows policy E and starts in
state s at t=0
• Note: future rewards are discounted by 
1/19/17
Spring 2017 (Shavlik©)
t-1
12
Why have a Decay on Rewards?

• Getting ‘money’ in the future worth
less than money right now
– Inflation
– More time to enjoy what it buys
– Risk of death before collecting
• Allows convergence proofs of the
functions we’re learning
1/19/17
Spring 2017 (Shavlik©)
Lecture #13, Slide 13
The Action-Value Function
We want to choose the ‘best’ action in the current state
So, pick the one that leads to the best next state
(and include any immediate reward)
Let
Q E ( s, a)  R(W ( s, a))   U  E (W ( s, a))
Immediate reward
received for going to
state W(s,a)
[Alternatively, R(s, a) ]
1/19/17
Future reward from further actions
(discounted due to 1-step delay)
Spring 2017 (Shavlik©)
14
The Action-Value Function (cont.)
If we can accurately learn Q (the actionvalue function), choosing actions is easy
Choose action a, where
a  arg max Q( s, a' )
a 'actions
Note: x = argmax f(x) sets x to the value that leads to a max value for f(x)
1/19/17
Spring 2017 (Shavlik©)
15
Q vs. U Visually
state
action
state
Key
U(2)
states
U(5)
actions
U(1)
Q(1,ii)
U(3)
U(6)
U(4)
U’s ‘stored’ on states
Q’s ‘stored’ on arcs
1/19/17
Spring 2017 (Shavlik©)
16
Q
Q’s vs. U’s
U
S U
Q
U
• Assume we’re in state S
Which action do we choose?
• U’s (Model-based)
– Need to have a ‘next state’ function to generate
all possible next states (eg, chess)
– Choose next state with highest U value
• Q’s (Model-free, though can also do model-based Q learning)
– Need only know which actions are legal (eg, web)
– Choose arc with highest Q value
1/19/17
Spring 2017 (Shavlik©)
17
Q-Learning
(Watkins PhD, 1989)
Let Qt be our current estimate of the optimal Q
Our current policy is
 t (s)  a such that Qt ( s, a )  max [Qt ( s, b)]
bknown
actions
Our current utility-function estimate is
U t ( s )  Qt ( s,  t ( s ))
- hence, the U table is embedded in the
Q table and we don’t need to store both
1/19/17
Spring 2017 (Shavlik©)
18
Q-Learning (cont.)
Assume we are in state St
‘Run the program’ * for awhile (n steps)
Determine actual reward and compare
to predicted reward
Adjust prediction to reduce error
* Ie, follow the current policy
1/19/17
Spring 2017 (Shavlik©)
19
Updating Qt
Let
rt( N )
N -step
estimate of
future
rewards
1/19/17
 N k 1

   Rt k    NU t ( St  N )
 k 1

Actual (discounted)
reward received during
the N time steps
Spring 2017 (Shavlik©)
Estimate of future reward
if continued to t = 
20
Changing the Q Function
(ie, learn a better approx.)
Old estimate

Qt  N ( St , at )  Qt ( St , at )   rt
New estimate
(at time t + N)
Learning rate
(for deterministic
worlds, set α=1)
1/19/17
Spring 2017 (Shavlik©)
(N )
 Qt ( St , at )
Error
21

Pictorially (here rewards are on arcs, rather than states)
Actual moves made
(in red)
r1
S1
r2
r3
SN
Potential next
states
Qest ( s1, a )  r1   r2   2 r3 +<estimate of remainder of infinite sum>
 r1   r2   2 r3   3U (SN )
 r1   r2   2 r3   3 max Q(S N , b)
bactions
1/19/17
Spring 2017 (Shavlik©)
22
How Many Actions Should We
Take Before Updating Q ?
Why not do so after each action?
– One–step Q learning
– Most common approach
1/19/17
Spring 2017 (Shavlik©)
23
Exploration vs.
Exploitation
In order to learn about better alternatives,
we can’t always follow the
current policy (‘exploitation’)
Sometimes, need to try
random moves (‘exploration’)
1/19/17
Spring 2017 (Shavlik©)
24
Exploration vs. Exploitation (cont)
Approaches
1) p percent of the time, make a random
move; could let
1
p
2) Prob(picking action
A in state S )
# moves _ made
QS , A
const

Q  S ,i 
const

Exponentia
-ting gets
rid of
negative
values
iactions
1/19/17
Spring 2017 (Shavlik©)
25
One-Step Q-Learning Algo
1/19/17
0.
S  initial state
1.
If random #  P
then a = random choice // Occasionally ‘explore’
Else a = t(S)
// Else ‘exploit’
2.
Snew  W(S, a)
Rimmed  R(Snew)
3.
Error  Rimmed +  U(Snew) – Q(S, a) // Use Q to compute U
4.
Q(S, a)  Q(S, a) +  Error
5.
S  Snew
6.
Go to 1
Act on world and get reward
Spring 2017 (Shavlik©)
// Should also decay α
26
Visualizing Q -Learning
(1-step ‘lookahead’)
The estimate
State I
Q(I,a)
Action a
(get reward R)
Should equal
R +  max Q(J,x)
State J
a
z
b
1/19/17
- train ML system to
learn a consistent set
of Q values
…
Spring 2017 (Shavlik©)
27
Bellman Optimality Equation
(from 1957, though for U function back then)
IF
s ,a Q( s, a)  RN   max QSN , a'
a 'actions
Where SN = W(s,a) , ie, the next state
THEN
The resulting policy, (s) = argmax Q(s,a),
is optimal – ie, leads to highest discounted total rewards
(also, any optimal policy satisfies the Bellman Eq)
1/19/17
Spring 2017 (Shavlik©)
28
A Simple Example
(of Q-learning - with updates after each step, ie N =1)
Q=0
S0
R=0
Let  = 2/3
S1
R=1
Q=0
Q=0
Q=0
S3
R=0
S2
R = -1
Q=0
Q=0
S4
R=3
Qnew  R   max Qnext state
(deterministic world, so α=1)
1/19/17
Spring 2017 (Shavlik©)
29
A Simple Example (Step 1)
S0  S2
Q=0
S0
R=0
Let  = 2/3
S1
R=1
Q=0
Q=0
Q = -1
S3
R=0
S2
R = -1
Q=0
Q=0
1/19/17
S4
R=3
Qnew  R   max Qnext state
Spring 2017 (Shavlik©)
30
A Simple Example (Step 2)
S2  S4
Q=0
S0
R=0
Let  = 2/3
S1
R=1
Q=0
Q=0
Q = -1
S3
R=0
S2
R = -1
Q=0
Q=3
1/19/17
S4
R=3
Qnew  R   max Qnext state
Spring 2017 (Shavlik©)
31
A Simple Example (Step i)
S0  S2
Q=0
S0
R=0
Let  = 2/3
S1
R=1
Q=0
Q=0
Q = -1
S3
R=0
S2
R = -1
Q=0
Q=3
1/19/17
Assume we get
to the end of the
game and
‘magically’
restarted in S0
S4
R=3
Qnew  R   max Qnext state
Spring 2017 (Shavlik©)
32
A Simple Example (Step i+1)
S0  S2
Q=0
S0
R=0
Let  = 2/3
S1
R=1
Q=0
Q=0
Q=1
S3
R=0
S2
R = -1
Q=0
Q=3
1/19/17
S4
R=3
Qnew  R   max Qnext state
Spring 2017 (Shavlik©)
33
A Simple Example (Step ∞)
- ie, the Bellman optimal
Q=?
S0
R=0
Let  = 2/3
S1
R=1
Q=?
Q=?
Q=?
S3
R=0
S2
R = -1
Q=?
Q=?
1/19/17
What would the final Q values be
if we explored + exploited for a
long time, always returning to S0
after 5 actions?
S4
R=3
Qnew  R   max Qnext state
Spring 2017 (Shavlik©)
34
A Simple Example (Step ∞)
Let  = 2/3
Q=1
S0
R=0
What would happen if  > 2/3?
Lower path better
S1
R=1
Q=0
Q=0
Q=1
S3
R=0
S2
R = -1
Q=0
Q=3
1/19/17
What would happen if  < 2/3 ?
Upper path better
S4
R=3
Shows need for EXPLORATION
since first ever action out of S0
may or may not be the optimal one
Qnew  R   max Qnext state
Spring 2017 (Shavlik©)
35
An “On Your Own” RL HW
Consider the deterministic reinforcement environment drawn below. Let γ=0.5. Immediate
rewards are indicated inside nodes. Once the agent reaches the ‘end’ state the current
episode ends and the agent is magically transported to the ‘start’ state.
B
(r=5)
4
Start
(r=0)
4
4
4
End
(r=5)
4
A
(r=2)
4
C
(r=3)
4
(a) A one-step, Q-table learner follows the path Start  B  C  End. On the graph below,
show the Q values that have changed, and show your work. Assume that for all legal
actions (ie, for all the arcs on the graph), the initial values in the Q table are 4, as show
above (feel free to copy the above 4’s below, but somehow highlight the changed values).
Start
(r=0)
1/19/17
B
(r=5)
A
(r=2)
End
(r=5)
C
(r=3)
Spring 2017 (Shavlik©)
36
An “On Your Own” RL HW
(b) Starting with the Q table you produced in Part (a), again follow the path
Start  B  C  End and show the Q values below that have changed from Part (a).
Show your work.
Start
(r=0)
B
(r=5)
End
(r=5)
A
(r=2)
C
(r=3)
(c) What would the final Q values be in the limit of trying all possible arcs ‘infinitely’
often? Ie, what is the Bellman-optimal Q table? Explain your answer.
Start
(r=0)
B
(r=5)
A
(r=2)
End
(r=5)
C
(r=3)
(d) What is the optimal path between Start and End? Explain.
1/19/17
Spring 2017 (Shavlik©)
37
Q-Learning:
The Need to ‘Generalize Across State’
Remember, conceptually we are filling in a huge table
States
S0 S1 S2
A
c
t
i
o
n
s
1/19/17
a
b
c
.
.
.
z
...
Sn
.
.
.
...
Q(S2, c)
Spring 2017 (Shavlik©)
Tables are a
very verbose
representation
of a function
38
Representing Q Functions
More Compactly
We can use some other function representation
(eg, neural net) to compactly encode this big table
Second
argument is a
constant
Q (S, a)
An
encoding
of the
state (S)
Q (S, b)
.
.
..
.
Q (S, z)
Each input unit encodes
a property of the state
(eg, a sensor value)
1/19/17
Or could have one net
for each possible action
Spring 2017 (Shavlik©)
39
Q (S, 0)
Q (S, 1)
…
…
Q Tables vs Q Nets
.
Q (S, 9)
Given: 100 Boolean-valued features
10 possible actions
Size of Q table
10  2100
Similar idea as Full Joint
Prob Tables and Bayes Nets
(called ‘factored’ representations)
Size of Q net (100 HU’s)
100  100 + 100  10 = 11,000
Weights between
inputs and HU’s
1/19/17
Weights between
HU’s and outputs
Spring 2017 (Shavlik©)
40
Why Use a Compact
Q-Function?
1.
2.
Full Q table may not fit in memory for realistic problems
Can generalize across states,
thereby speeding up convergence
ie, one example ‘fills’ many cells in the Q table
Notes
1. When generalizing across states, cannot use α=1
2. Convergence proofs only apply to Q tables
1/19/17
Spring 2017 (Shavlik©)
Lecture #25, Slide 41
Three Forward Props
and a BackProp
Q(S0, A)
A
1
S0
N
.
.
.
N
.
S1
N
.
.
.
N
.
3
A
S0
Q(S0, Z)
Q(S1, A)
A
2
N
.
.
Q(S1, Z)
Choose action in state S0
- execute chosen action in world,
‘read’ new sensors and reward
Estimate u(S1) =
Max Q(S1,X) where X  actions
Q(S0, A) vs new estimate
Calc “teacher’s” output
.
.
N
Aside: could save some
forward props by caching
information
Q(S0, Z) - assume Q is ‘correct’ for other actions
Backprop to reduce error at Q(S0, A)
1/19/17
Spring 2017 (Shavlik©)
42
The Agent World
(Rough sketch, implemented in Java [by me], linked to cs540 home page)
Pushable
Ice Cubes
*
*
*
*
*
Opponents
1/19/17
*
*
*
The RL Agent
*
Food
Spring 2017 (Shavlik©)
43
0
50/25
15 HU
-10
5 HU
-20
Q-table
-30
Perceptrons (600 ex’s)
(Supervised learning)
Hand-coded
-40
Q-net: 5 HU’s
Q-net: 15 HU’s
-50
-60 0
Q-net: 25 HU’s
Q-net: 50 HU’s
500
1000
Training-set steps (in K)
1/19/17
Spring 2017 (Shavlik©)
1500
<-- 1000x slower CPU
Mean(discounted) score on the testset suite
Some (Ancient) Agent World Results
2000
~2 weeks
(10-20 yrs
44 ago)
Estimating Value’s ‘In Place’
(see Sec 2.6 +2.7 of Sutton+Barto RL textbook)
Let ri be our i th estimate of some Q
Note: ri is not the immediate reward, Ri
ri = Ri +  U(next statei)
Assume we have k +1 such measurements
1/19/17
Spring 2017 (Shavlik©)
45
Estimating Value’s (cont)
Estimate
based on
k + 1 trails
1 k 1
Qk 1 
ri

k  1 i 1
k
1





 rk 1   ri 
 k  1 
i 1 
 1 

rk 1  k  Qk 
 k 1
(cont.)
1/19/17
Spring 2017 (Shavlik©)
Ave of the k + 1
measurements
Pull out last term
Stick in
definition of Qk
46
‘In Place’ Estimates (cont.)
 1 

rk 1  k  1Qk  Qk 
 k  1
 1 
 Qk  
rk 1  Qk 
 k  1

latest
estimate
current
‘running’
average
Add and
subtract Qk
Notice that 
needs to decay
over time
Repeating
Qk 1  Qk   rk 1  Qk 
1/19/17
Spring 2017 (Shavlik©)
47
Note
• The ‘running average’ analysis
is for Q tables
• When ‘generalizing across state,’ the Q
values are coupled together
• So when generalizing across state,
cant simply divide by number of times
an arc traversed
• Also, even if DETERMINISTIC, still need
to do a running average
1/19/17
Spring 2017 (Shavlik©)
48
Q-Learning Convergences
• Only applies to Q tables and deterministic,
Markovian worlds
• Theorem: if every state-action pair visited infinitely
often, 0 ≤  < 1, and |rewards| ≤ C (some constant),
then

s, a lim Q t ( s, a)  Qactual ( s, a)
t 
^
the approx. Q table (Q)
1/19/17
the true Q table (Q)
Spring 2017 (Shavlik©)
49
An RL Video
https://m.youtube.com/watch?v=iqXKQf2BOSE
1/19/17
Spring 2017 (Shavlik©)
Lecture #20, Slide #50
Inverse RL
• Inverse RL: Learn the reward function of an
agent by observing its behavior
• Some early papers
A. Ng and S. Russell, “Algorithms for inverse reinforcement learning,”
in ICML, 2000
P. Abbeel and A. Ng, “Apprenticeship learning via inverse
reinforcement learning,” in ICML, 2004
1/19/17
Spring 2017 (Shavlik©)
Lecture #20, Slide #51
Recap: Supervised Learners
Helping the RL Learner
• Note that Q learning automatically
creates I/O pairs for a supervised ML
algo when ‘generalizing across state’
• Can also learn a model of the world (W)
and the reward function (R)
– Simulations via learned models reduce
need for ‘acting in the physical world’
1/19/17
Spring 2017 (Shavlik©)
52
Challenges in RL
• Q tables too big, so use function approximation
– can ‘generalize across state’ (eg, via ANNs)
– convergence proofs no longer apply, though
• Hidden state (‘perceptual aliasing’)
– two different states might look the same
(eg, due to ‘local sensors’)
– can use theory of ‘Partially Observable
Markov Decision Problems’ (POMDP’s)
• Multi-agent learning (world no longer stationary)
1/19/17
Spring 2017 (Shavlik©)
53
Could use GAs for RL Task
• Another approach is to use GAs
to evolve good policies
– Create N ‘agents’
– Measure each’s rewards over some time period
– Discard worst, cross over best, do some mutation
– Repeat ‘forever’ (a model of biology)
• Both ‘predator’ and ‘prey’ evolve/learn,
ie co-evolution
1/19/17
Spring 2017 (Shavlik©)
54
Summary of Non-GA
Reinforcement Learning
Positives
– Requires much less ‘teacher feedback’
– Appealing approach to learning to predict and
control (eg, robotics, sofbots)
Demo of Google’s Q Learning
– Solid mathematical foundations
• Dynamic programming
• Markov decision processes
• Convergence proofs (in the limit)
– Core of solution to general AI problem ?
1/19/17
Spring 2017 (Shavlik©)
55
Summary of Non-GA
Reinforcement Learning (cont.)
Negatives
– Need to deal with huge state-action spaces
(so convergence very slow)
– Hard to design R function ?
– Learns specific environment rather than general
concepts – depends on state representation ?
– Dealing with multiple learning agents?
– Hard to learn at multiple ‘grain sizes’ (hierarchical RL)
1/19/17
Spring 2017 (Shavlik©)
56