ECE-453 Lecture 1 - UTK-EECS

ECE-517: Reinforcement Learning
in Artificial Intelligence
Lecture 12: Generalization and Function Approximation
October 23, 2012
Dr. Itamar Arel
College of Engineering
Department of Electrical Engineering and Computer Science
The University of Tennessee
Fall 2012
1
Outline
Introduction
Value Prediction with function approximation
Gradient Descent framework


On-Line Gradient-Descent TD(l)
Linear methods
Control with Function Approximation
ECE 517: Reinforcement Learning in AI
2
Introduction
We have so far assumed a tabular view of value or statevalue functions
Inherently limits our problem-space to small state/action
sets



Space requirements – storage of values
Computation complexity – sweeping/updating the values
Communication constraints – getting the data where it needs
to go
Reality is very different – high-dimensional state
representations are common
We will next look at generalizations – an attempt by the
agent to learn about a large state set while visiting/
experiencing only a small subset of it

People do it – how can machines achieve the same goal?
ECE 517: Reinforcement Learning in AI
3
General Approach
Luckily, many approximation techniques have been
developed

e.g. multivariate function approximation schemes
We will utilize such techniques in a RL context
ECE 517: Reinforcement Learning in AI
4
Value Prediction with FA
As usual, let’s start with prediction of Vp
Instead of using a table for Vt, the latter will be represented
in a parameterized functional form
 


Vt ( s)  f s,t
t  t (1), t (2),, t (n)
T
transpose
We’ll assumethat Vt is a sufficiently smooth differentiable
function of  t , for all s.
For example,
a neural network can be trained to predict V

where  t are the connection
weights

We will require that  t is much smaller than the state set
When a single state is backed up, the change generalizes to
affect the values of many other states
ECE 517: Reinforcement Learning in AI
5
Adapt Supervised Learning Algorithms
Training Info = desired (target) outputs
Inputs
Supervised Learning
System
Outputs
Training example = {input, target output}
Error = (target output – actual output)
ECE 517: Reinforcement Learning in AI
6
Performance Measures
Let us assume that training examples all take the form
 descriptio n of s , V
t
p
( st )

A common performance metric is the mean-squared error
(MSE) over a distribution P :

MSE ( t )   P( s) V ( s)  Vt ( s)
p

2
sS
Q: Why use P? Is MSE the best metric?
Let us assume that P is always the distribution of states
at which backups are done
On-policy distribution: the distribution created while
following the policy being evaluated

Stronger results are available for this distribution.
ECE 517: Reinforcement Learning in AI
7
Gradient Descent
Let f be any function of the parameter space.

Its gradient at any point  t in this space is :


 T

 f ( t ) f ( t )
f ( t ) 


 f ( t )  
,
,,
.

 (n) 
  (1)  (2)
 (2)
 (1)
t   t (1), t (2)
T
 (1)
We iteratively move down the gradient:
t 1  t   f (t )
ECE 517: Reinforcement Learning in AI
8
Gradient Descent in RL
Let’s now consider the case where the target output, vt,
for sample t is not the true value (unavailable)
In such cases we perform an approximate update, such
that


 t 1   t 


2

p
 V ( st )  Vt ( st )

t

2

 t 

2
 vt  Vt ( st ) 2
t
  t   vt  Vt ( st ) Vt ( st )
t
where vt is an unbiased estimate of the target output.
Example of vt are:


Monte Carlo methods: vt = Rt
TD(l): Rlt
The general gradient-descent is guaranteed to converge to
a local minimum
ECE 517: Reinforcement Learning in AI
9
On-Line Gradient-Descent TD(l)
ECE 517: Reinforcement Learning in AI
10
Residual Gradient Descent
The following statement is not completely accurate:


 t 1   t 

2

 vt  Vt ( st )   t   vt  Vt ( st ) Vt ( st )
2
t
t
since it suggests that   vt
t
 0 which is not true, e.g.
vt  rt 1  V (st 1 )t 1   t vt   t V (st 1 )t 1
so, we should be writing (residual GD):



t 1  t   vt  Vt (st )  Vt 1 (st 1 )   Vt (st )
t
t

Comment: the whole scheme is no longer supervised
learning based!
ECE 517: Reinforcement Learning in AI
11
Linear Methods
One of the most important special cases of GD FA
Vt becomes a linear function of the parameters vector
For every state, there is a (real valued) column vector of
features

s  s (1), s (2), , s (n)
T
The features can be constructed from the states in many
ways
The linear approximate state-value function is given by
T 
n
Vt ( s)   t s    t (i )s (i )
i 1
Vt ( s )  ?
ECE 517: Reinforcement Learning in AI
12
Nice Properties of Linear FA Methods
The gradient is very simple:  Vt (s)  s

For MSE, the error surface is simple: quadratic surface
with a single (global) minimum
Linear gradient descent TD(l) converges:



 appropriately
Step size decreases
On-line sampling (states sampled from the on-policy
distribution)
Converges to parameter vector  with property:
1  l

MSE( ) 
MSE( )
1 
(Tsitsiklis & Van Roy, 1997)

ECE 517: Reinforcement Learning in AI
best parameter vector
13
Limitations of Pure Linear Methods
Many applications require a mixture (e.g. product) of the
different feature components
 Linear form prohibits direct representation of the
interactions between features
 Intuition: feature i is good only in the absence of
feature j
Example: Pole Balancing task
 High angular velocity can be good or bad …
If the angle is high  imminent danger of falling
(bad state)
If the angle is low  the pole is righting itself
(good state)
In such cases we need to introduce features that express
a mixture of other features
ECE 517: Reinforcement Learning in AI
14
Coarse Coding – Feature Composition/Extraction
0
ECE 517: Reinforcement Learning in AI
15
Shaping Generalization in Coarse Coding
• If we train at one point (state), X, the parameters of all circles
intersecting X will be affected
• Consequence: the value function of all points within the
union of the circles will be affected
• Greater affects for points that have more circles “in
common” with X
ECE 517: Reinforcement Learning in AI
16
Learning and Coarse Coding
All three cases have the same number of features (50), learning
rate is 0.2/m (m – the number of features present in each example)
ECE 517: Reinforcement Learning in AI
17
Tile Coding
Binary feature for each tile
Number of features present at any one
time is constant
Binary features means weighted sum
easy to compute
Easy to compute indices of the
features present
0
ECE 517: Reinforcement Learning in AI
18
Tile Coding Cont.
Irregular tilings
Hashing
ECE 517: Reinforcement Learning in AI
0
19
Control with Function Approximation
Learning state-action values

Training examples of the form:
 descriptio n of st , at  and vt 
The general gradient-descent rule:
t 1  t   v t  Qt (st ,at )Q(st ,at )
Gradient-descent Sarsa(l) (backward view):

t 1  t   t et
where
t  rt 1   Qt (st 1,at 1)  Qt (st ,at )
et   l et1  Qt (st ,at )
ECE 517: Reinforcement Learning in AI
20
GPI with Linear Gradient Descent Sarsa(l)
ECE 517: Reinforcement Learning in AI
21
GPI Linear Gradient Descent Watkins’ Q(l)
ECE 517: Reinforcement Learning in AI
22
Mountain-Car Task Example
Challenge: driving an underpowered car up a steep
mountain road

Gravity is stronger than its engine
Solution approach: build enough inertia from other slope to
carry it up the opposite slope
Example of a task where things can get worse in a sense
(farther from the goal) before they get better

Hard to solve using classic control schemes
Reward is -1 for all steps until the episode terminates
Actions full throttle forward (+1), full throttle reverse (1) and zero throttle (0)
Two 9x9 overlapping tiles were used to represent the
continuous state space
ECE 517: Reinforcement Learning in AI
23
Mountain-Car Task
ECE 517: Reinforcement Learning in AI
24
Mountain-Car Results (five 9 by 9 tilings were used)
ECE 517: Reinforcement Learning in AI
25
Summary
Generalization is an important RL attribute
Adapting supervised-learning function approximation
methods
 Each backup is treated as a learning example
Gradient-descent methods
Linear gradient-descent methods
 Radial basis functions
 Tile coding
Nonlinear gradient-descent methods?
 NN Backpropagation?
Subtleties involving function approximation,
bootstrapping and the on-policy/off-policy
distinction
ECE 517: Reinforcement Learning in AI
26