Q-Function

Reinforcement Learning
Rafy Michaeli
Assaf Naor
Supervisor:
Yaakov Engel
FOR MORE INFO...
Visit project’s home page at:
http://www.technion.ac.il/~smily/rl/index.html
Project Goals
Study the field of Reinforcement
Learning (RL)
 Have a practical experience with
implementing RL algorithms
 Examine the influence of various
parameters on RL algorithms
performence

Overview

Reinforcement Learning
In RL problems, an agent (a
decision-maker), attempts to
control a dynamic system by
choosing actions every time
interval
Overview, cont.

Reinforcement Learning
The agent receives a feedback
with every action it executes
Overview, cont.

Reinforcement Learning
The ultimate goal of the agent
is to learn a strategy for
selecting actions such that the
overall performance is optimized
according to a given criteria
Overview, cont.

The Value function

Given a fixed policy
, which
determines the action to be
performed at a given state, this
function assigns a value to every
state in the state space (all possible
states the system can have)
Overview, cont.

The Value function
The value of a state is defined
as the weighted sum (short term
reinforcements are taken more strongly into account than
of the reinforcements
received when starting at that
state and following the given
policy to a final state
long term ones)
Overview, cont.

The Value function
Or mathematically:

V ( st )     rt  k

k
k 0
Overview, cont.

The Action Value Function or QFunction

Given a fixed policy
, this
function assigns a value to every
pair of (state, action) in the
(state, action) space
Overview, cont.

The Action Value Function or QFunction
The value of a pair (state s,
action a) is defined as the
weighted sum of reinforcements
due to executing action a at state
s, and then following the given
policy for selecting actions in
subsequent states
Overview, cont.

The Action Value Function or QFunction
Or mathematically:


Q ( st , at )  rt 1    Q ( st 1 ,  ( st 1 ))
Overview, cont.

The learning algorithm
Uses experiences to
progressively learn the optimal
value function, which is the
function that predicts the best
long term outcome an agent
could receive from a given state
Overview, cont.

The learning algorithm
The agent studies the optimal
value function by continually
exercising the current, nonoptimal estimate of the value
function and improving this
estimate after every experience
Overview, cont.

The learning algorithm
Given the optimal value
functuion, the agent can then
evaluate the optimal policy by
performing:
a   ( st )  arg max {R( st , at )   V (T ( st , at ))}

t


at A
Description
Overviewed the field of Learning in
general and focused on RL
algorithms
 Implemented various RL algorithms
on a chosen task, aiming to teach the
agent the best way to perform the
task

Description, cont.
The task of the agent
Given a car’s initial location
and velocity, bring it to a desired
location with a zero velocity, as
quickly as possible !
Description, cont.
The task of the agent
System description:
• The car can move either forward or
backwards
• The agent can control the car’s
acceleration at any time interval
Description, cont.
The task of the agent
System description:
• Walls are placed on both sides of the
track
• When the car hits a wall, it bounces
back in the same speed it had prior to
the collision
Description, cont.
A sketch of the system
Description, cont.
The code was written using MATLAB
 Performed experiments to determine
the influence of different parameters
on the Learning algorithm (mainly on

convergence and how fast the system studies the optimal
policy)

Tested the performance of CMAC as
a function approximator (tested for
both 1D and 2D functions)
Implementation issues

Function approximators Representing the Value/Q Function
– Lookup Tables
• A finite ordered set of elements (A possible
implementation would be an array). Each element is
uniquely associated with an index. Accessing the
element would be through it’s index.
• Each region in a continuous state space is mapped
into an element of the lookup table. Thus, all states
within a region are aggregated (accumulated) into
one table element, therefore assigned the same
value.
Implementation issues, cont.

Function approximators Representing the Value/Q Function
– Lookup Tables
• This mapping from the state space to the Lookup
Table can be uniform or non-uniform.
An example of a uniform mapping of the state space to cells
Implementation issues, cont.

Function approximators Representing the Value/Q Function
– Cerebellar Model Articulation Controller – CMAC
• each state activates a specific set of memory
locations (features). The arithmetic sum of their
values is the value of the stored function.
A CMAC structure realization
Implementation issues, cont.

Learning the optimal Value Function
We wish to study the optimal
Value Function from which we
can deduce the optimal action
policy
Implementation issues, cont.

Learning the optimal Value Function
Our learning algorithm was
based on methods of Temporal
Difference or shortly, TD
Implementation issues, cont.

Learning the optimal Value Function
We define the temporal
difference as:




ˆ
ˆ
(st )  R(st ,  (st ))   Vt (st 1 )  Vt (st )
Implementation issues, cont.

Learning the optimal Value Function
At each time step we update
the estimated Value Function by
calculating:


ˆ
ˆ
Vt 1 (st )  Vt (st )    (st )
Implementation issues, cont.

Learning the optimal Value Function
By definition, the optimal
policy satisfies:
a   ( st )  arg max {R( st , at )   V (T ( st , at ))}

t


at A
Implementation issues, cont.

Learning the optimal Value Function
– TD() and Eligibility Traces
The TD rule as presented
above is really an instance of a
more general class of algorithms
called TD(  ) with   0 .
Implementation issues, cont.

Learning the optimal Value Function
– TD() and Eligibility Traces
The general TD( ) rule is
similar to TD rule given above:


ˆ
ˆ
Vt 1 (s)  Vt (s)    (st )  et 1 (s)
if s  st
     et ( s)
et 1 ( s)  
    et ( s)  1 if s  st
 is taken to be 0    1 .
Implementation issues, cont.

Look-Up Table Implementation
– We used a Look-Up Table to represent the Value
Function, and acquired the optimal policy by
applying the TD( ) algorithm.
– We used a non uniform mapping of the state
space to cells in the Look-Up Table which enabled
us to keep a rather small number of cells but still
have a fine quantization around the origin.
Implementation issues, cont.

CMAC Implementation - 1
– CMAC is used to represent the Value Function and
TD( ) is the learning algorithm.

CMAC Implementation - 2
– CMAC is used to represent the Q-Function and TD(
) is the
 learning algorithm.
Implementation issues, cont.

System simulation description
We simulated each of the
three implementations for
different values of   0,0.5,0.9 .
For each value of  we tested
the system for different values of
  0.2,0.5,0.8 .  was taken to be 1
throughout the simulations.
Simulation Results

We define:
– success rate:
The percentage of all tries in
which the agent has successfuly
been able to bring the car to it’s
destination with zero velocity
Simulation Results

We define:
– Average way:
The average of the time
intervals it took the agent to
bring the car to it’s destination
Simulation Results

Look-Up Table results
– A common result for all parameter variants is the
improvement of the success rate and the
shortening of the average way to goal as learning
progresses.
– For a given  , it’s hard to observe any differences
between the results for different values of  .
– As  increases, the learning process is better i.e.
For a given try number, the results for the success
rate and average way are better.
Simulation Results

Look-Up Table results
– It’s noted that eventually, in all cases, the success
rate reaches 100% i.e. the agent successfully
brought the car to it’s goal.
Look-Up Table performance summary
Simulation Results

CMAC Q-Function results
– A common result for all parameter variants is the
improvement of the success rate and the
shortening of the average way to goal as learning
progresses.
– For a given  , better results were obtained for a
bigger  .
– As  increases, the learning process is generally
better.
Simulation Results

CMAC Q-Function results
– In most cases, 100% success rate is not reached,
though it reaches 100% in some cases.
– We can see in some cases that the success rate
decreases along the tries and increases again.
CMAC Q-Learning performance summary
Simulation Results

CMAC Value Iteration results
– The figure ahead shows the results obtained by
CMAC Value Iteration implementation compared
to the results already obtained for the CMAC QLearning implementation. The results are for the
best pair of ( ,  ) as obtained from previous
results.
Simulation Results

CMAC Value Iteration results
A comparison between CMAC Q-Learning and CMAC
Value Iteration performance
Simulation Results

Learning process examples
– In figure 1 we show the process of learning for a
specific starting state and learning parameters.
The figure shows the movement of the car after
every few tries, for 150 time consecutive time
intervals.
– In figure 2 we demonstrate the systems ability (at
the end of learning) to direct the car to it’s goal
starting from different states.
Figure 1: The progress of learning for:   0.8,   0.9, s0  0.2,0.2
Figure 2: The systems performance from different starting states after try 20 with the
  0.8,   0.9
parameters:
Conclusions

In this project we implemented a
family of R.L. Algorithms, TD( ) , with
two different function approximators,
CMAC and Look-Up Table.
Conclusions

We examined the affect of the
learning parameters,  , and  on
the overall performance of the
system.
– In the Look-Up Table implementation:  does not
have a significant impact on the results; as 
increases the success rate increases more rapidly.
Conclusions

We examined the affect of the
learning parameters,  , and  on
the overall performance of the
system.
– In the CMAC implementation: as  or 
increases, the success rate increases and the
average way decreases more rapidly.