Incremental Multi-Step Q-learning

Fuzzy Inference System
Learning By Reinforcement
Presented by
Alp Sardağ
A Comparison of
Fuzzy & Classical Controllers
 Fuzzy
Controller: Expert systems based
on if-then rules where premises and
conclusions are expressed by means of
linguistic terms.
 Rules
close to natural language
 A priori knowledge
 Classical
Controller: Need analytical
task model.
Design Problem of FC
 A priori
knowledge extraction is not
easy:
 Disagreement
between experts
 Great number of variables necessary to
solve the control task
Self Tunning FIS

A direct teacher: based on input-output set of
trainning data.
 A distal teacher: does not give the correct
actions, but the desired effect on the process.
 A performance measure: EA
 A critic: gives rewards and punishment with
respect to state reached by the learner. RL
methods.
 There are no more than two fuzzy sets
activated for an input value
Goal

To overcome the limitations of classical
reinforcement learning methods,
”discrete state perception and discrete
actions”.
NOTE: In this paper MISO FIS is used.
A MIMO FIS
FIS is made of N rules of the following form:
Ri: ith rule of the rule base
Si:input variables
Lij: linguistic term of input variable; its membership function Lij
YNO:output variables
Oij: linguistic term of output variable
Rule Preconditions

Membership functions are triangles and
trapezoids (altough not differentiable).


because they are simple
Sufficient in a number of application

Strong fuzzy partition used:

All values activate at least one fuzzy set, the
input universe is completely covered.
Strong Fuzzy Partition Example
Rule Conclusions

Each of i rule has No corresponding
conclusions:
 For
Each Rule the truth value with
respect to S is computed with:
where T norm is implemented by a product:

The FIS outputs are
Learning
 Number
and positions of the input fuzzy
labels being set using a priori
knowledge.
 Structural
Learning: consists in tuning the
number of rules.
 FACL and FQL learning: are reinforcement
learning methods that deal with only the
conclusion part.
Reinforcement Learning
NOTE: state observability is total.
Markovian Decision Problem
S
a finite discrete state
 U a finite discrete action
 R primary reinforcements R:SxUR
 P transition probabilities
P:SxUxS [0,1].
 State
evaluation function:
The Curse of Dimensionality
 Some
form of generalization must be
incorporated in state representation.
Various function approximators used:
 CMAC
 Neural
Networks
 FIS: the state space encoding is based on
a vector corresponding to the current state.
Adaptive Heuristic Critic
 AHC
is made of two components:
 Adaptive
Critic Element: Critic developed in
an adaptive way from primary
reinforcements, represent an evaluation
function more informative than the one
given by the environment through rewards
and punishment (V(S) values).
 Associative Search Element: selects
actions which lead to better critic values
FACL Scheme
The Critic
At time step t, the critic value is computed with
conclusion vector:
TD error is given by:
TD-learning update rule:
The Actor

When the rule Ri is activated, one of the Ri
local action is elected to participate in the
global action, based on its quality. The global
action triggered:
where -greedy is a function implementing
mixed exploration-exploitation strategy.
Tunning vector w
 TD
error, the improvement measure
except in the beginning is a good
approximator of the optimal evaluation
function. The actor learning rule:
Meta Learning Rule

Update strategie for learning rate:




Every parameter should have its learning rate.
(=1n)
Every learning rate should be allowed to vary over
time. (in order V values to converge)
When the derivative of a parameter have the
same sign for several consecutive time steps, its
learning rate should be increased.
When the parameter derivative sign alternates for
several consecutive time steps, its learning rate
should be decreased. Delta-Bar-Delta rule:
Execution Procedure
1.
2.
3.
4.
5.
6.
Estimation of evaluation function
corresponding to the current state.
Computation of the TD error.
Tunning of parameter vector v and w.
Estimation of the new evaluation function for
the current state with new conclusion vector
vt+1.
Learning rate updating with Delta-Bar-Delta
rule.
For each activated rule, election of the local
action: computation and triggering of the
global action Ut+1.
Example
Example Cont.


The number of rules is twenty five.
For the sake of simplicity, the discerete actions
available are the same for all rules.
The discerete action set:

The reinforcement function:

Results

Performance measure for distance:

Results: