Fuzzy Sarsa Learning and the proof of existence of its stationary points

Asian Journal of Control, Vol. 10, No. 5, pp. 535 549, September 2008
Published online in Wiley InterScience (www.interscience.wiley.com) DOI: 10.1002/asjc.054
FUZZY SARSA LEARNING AND THE PROOF OF EXISTENCE OF ITS
STATIONARY POINTS
Vali Derhami, Vahid Johari Majd, and Majid Nili Ahmadabadi
ABSTRACT
This paper provides a new Fuzzy Reinforcement Learning (FRL) algorithm based on critic-only architecture. The proposed algorithm, called Fuzzy
Sarsa Learning (FSL), tunes the parameters of conclusion parts of the Fuzzy
Inference System (FIS) online. Our FSL is based on Sarsa, which approximates the Action Value Function (AVF) and is an on-policy method. In each
rule, actions are selected according to the proposed modified Softmax action
selection so that the final inferred action selection probability in FSL is equivalent to the standard Softmax formula. We prove the existence of fixed points
for the proposed Approximate Action Value Iteration (AAVI). Then, we show
that FSL satisfies the necessary conditions that guarantee the existence of stationary points for it, which coincide with the fixed points of the AAVI. We
prove that the weight vector of FSL with stationary action selection policy
converges to a unique value. We also compare by simulation the performance
of FSL and Fuzzy Q-Learning (FQL) in terms of learning speed, and action
quality. Moreover, we show by another example the convergence of FSL and
the divergence of FQL when both algorithms use a stationary policy.
Key Words: Learning systems, fuzzy systems, reinforcement learning,
sarsa, stationary point.
I. INTRODUCTION
Designing an optimal controller for a complex
system that interacts with an uncertain and nonde-
Manuscript received August 24, 2006; revised March 27,
2007; accepted September 8, 2007.
Vali Derhami is at Electrical Engineering Department, Tarbiat Modares University, P.O. Box 14115-143, Tehran, Iran (email: [email protected]).
Vahid Johari Majd is the corresponding author and is
with the Electrical Engineering Department, Tarbiat Modares
University, P.O. Box 14115-143, Tehran, Iran (e-mail:
[email protected]).
Majid Nili Ahmadabadi is with the Control and Intelligent Processing Center of Excellence, University of Tehran,
School of Cognitive Science, Institute for Studies on Theoretical Physics and Mathematics, Tehran, Iran (e-mail: mnili@
ut.ac.ir).
This research was supported in part by I.T.R.C. under contract No. T500/8211.
q
terministic environment is extremely difficult or even
impossible. Hence, controller design in such environments is an area of interest for researchers [1]. Since
there is usually little information about the desired output of the controller, an unsupervised learning algorithm is generally used in such systems. Reinforcement Learning (RL) is a modern and powerful approach
for learning control strategy online, which uses only a
scalar performance index without any direct supervisor
[2, 3].
Due to the curse of dimensionality, application
of RL in control problems with large spaces requires
the use of Function Approximators (FAs) [4, 5]. Fuzzy
Inference System (FIS) is a universal approximator,
which offers powerful capabilities such as knowledge
representation by if–then rules, modeling, and control
of nonlinear uncertain systems with the desired accuracy [6, 7]. As such, some authors [8–12] have offered
Fuzzy Reinforcement Learning (FRL), and have employed FRL to tune a fuzzy controller.
2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society
536
Asian Journal of Control, Vol. 10, No. 5, pp. 535 549, September 2008
The two most well-known architectures used in
FRL are actor-critic [12, 13] and critic-only methods
[3]. Actor-critic based FRL uses a FIS for the approximation of state value function and another FIS for the
actor module to generate actions. A major drawback
in most actor-critic implementations is lack of suitable exploration. In order to solve this problem, some
have proposed to add noise to the selected final action [14, 15]. Although this solution yields some performance enhancement, it suffers from symmetrical exploration around the action with the highest value, and
lack of relation between action selection probability and
action value.
In contrast with the above architecture, a criticonly based FRL uses an FIS only for approximating the
Action Value Function (AVF), and the action selection
probability of the output depends on this approximation. Such an action selection strategy improves balance
between exploration and exploitation [3, 8]. For these
reasons, we focus on FRL with critic-only architecture
in this paper.
Some authors have implemented the Q-learning
method (which is an off-policy method) with linear
FAs using fuzzy systems [8, 9]. This critic-only based
FRL algorithm, called Fuzzy Q-Learning (FQL), was
employed in some problems [8, 16]. However, FQL is
a heuristic method and lacks mathematical analysis.
Moreover, divergence of the Q-learning algorithm utilized with linear FAs has been shown in [4, 17] by examples. The possibility of divergence is because the algorithm updates action values according to a distribution
different from that of the Markov chain dynamic [18].
In addition, unlike standard Q-learning, the Q-learning
method with FAs introduced in [8] does not preserve
the off-policy property, and the final approximate action
values in the limit will depend on the policy used.
In contrast to Q-learning with linear FAs, there
have been some analytical achievements for on-policy
RL methods with linear FAs. In [19], proof of the existence of stationary points for the linear Temporal Difference (TD) method that uses Softmax action selection
was offered. The proof only addresses the approximation of state value function. Thus, in order to use Softmax action selection, approximation of AVF is computed from the state value function assuming that the
model of the environment is known.
In [20, 21], approximation of AVF’s was given using combination of Sarsa with linear FAs, which we
call linear Sarsa. In [20], the convergence of the weight
parameters of linear Sarsa into a fixed region was presented only when the policies in all episodes were stationary. In [21], the authors presented an approximate
policy iteration algorithm. It has been shown that, if
q
the policy improvement operator is a Lipschitz function, then the approximate policy iteration algorithm
converges to a unique solution from any initial policy.
However, the result has two drawbacks: First, the theorem offers no guarantee of the quality of the converging policy. Second, the new policy is generated only
after the weight parameters converge under the current
policy. This problem significantly decreases the learning speed, and makes it inappropriate for control problems with online learning, where it is desirable to update action policy at each step. In addition, selection
of a suitable linear FA and action selection method are
challenging in these works.
Contrary to standard Sarsa and Q-learning, no analytical proof is available to this date for the convergence of FRL based on critic-only when the policy is
not stationary and/or when it changes according to the
action value at each time step.
In this paper, we first present a new FRL algorithm, called Fuzzy Sarsa Learning (FSL), and show
that it approximates AVF of discrete Sarsa. As a necessary condition for the convergence of the algorithm, we
prove the existence of stationary points for FSL, which
coincide with the fixed points of our proposed Approximate Action Value Iteration (AAVI). To make the final inferred action selection probability in FSL equivalent to a Boltzman distribution, we introduce a modified
Softmax action selection for selecting an action among
the candidate actions in each fuzzy rule. We also prove
the convergence of FSL weight vector to a unique value
under stationary policy.
Similar to [10], although our proofs are carried
out only for discrete action-state spaces, FSL algorithm
works well for continuous spaces as shown by simulations. We compare the performance of FSL and FQL
in the boat problem. Moreover, we show convergence
of FSL and divergence of FQL under a stationary
policy.
The organization of this paper is as follows: In
Section II, the RL algorithms are described. Fuzzy Sarsa
learning is presented in Section III. In Section IV, the
theoretical analysis of FSL is proposed. Simulation results are given in Section V. Finally, the discussion and
conclusion of the paper are given in Section VI.
II. REINFORCEMENT LEARNING
In an agent-based system with reinforcement
learning, at each time step t, the agent observes the current state of the environment, and makes an action from
the finite discrete set of actions A under the decided
policy. Consequently, the environment goes to state
2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society
537
V. Derhami et al.: Fuzzy Sarsa Learning and the Proof of Existence
st+1 with transition probability p(st , at , st+1 ), and the
agent receives reinforcement signal rt+1 = r (st , at ) [3].
The policy, denoted by (s, a), is the probability
of selecting action a in state s. The agent learns to take
actions to reach the states with greater values. The value
of state s under policy is defined by the following
function [3, 22]:
∞
k
V (s) = E rt+k+1 |st = s
01 (1)
k=0
where is a discount factor, and E {·} denotes the expected value. Similarly, AVF is the value of action a in
state s under policy , and is defined by [3]:
∞
k
Q (s, a) = E rt+k+1 |st = s, at = a
(2)
k=0
In the following, we briefly describe the TD
method and its two widely-used extensions for estimating AVFs.
follows [23]:
Q(st , at ) ← Q(st , at ) + t rt+1
+ max Q(st+1 , b) − Q(st , at )
b∈A
(5)
If both of the following assumptions hold, all three
methods, explained above, converge to their optimal
values [3, 23–25]:
Assumption 1. The problem environment is a Markov
Decision Process (MDP), aperiodic, and irreducible
with limited reinforcement signals.
Assumption 2. Learning rate t is positive, nonincreasing, and satisfies:
∞
t = ∞,
t=0
∞
t=0
2t <∞
(6)
2.1 Temporal difference learning
The simplest TD learning method, known as
TD(0), is described as [3]:
V (st ) ← V (st ) + t [rt+1 + V (st+1 ) − V (st )]
(3)
where is the learning rate and r is the immediate reward. In the above formula, the term inside the brackets
is called TD error.
2.1.1 Sarsa
The Sarsa method estimates AVF for the current
policy according to the following update formula [3]:
Q(st , at ) ← Q(st , at ) + t [rt+1
+Q(st+1 , at+1 ) − Q(st , at )]
(4)
All elements of the quintuple of events (st , at , rt+1 ,
st+1 , at+1 ) are used in the above update law. The name
‘Sarsa’ for this algorithm has been derived from these
five letters.
2.1.2 Q-learning
Similar to Sarsa, the Q-learning method estimates the AVF with the difference that the goal
is to estimate the maximum AVF for all possible
policies. The AVF in this algorithm is updated as
q
III. FUZZY SARSA LEARNING
In this section, we introduce Fuzzy Sarsa Learning
(FSL). Consider an n-input and one-output zero-order
TSK fuzzy system [6, 7] with R rules of the following
form:
Ri : If x1 is L i1 and . . . and xn is L in ,
then (ai1 with value wi1 ) or . . . or
(aim with value wim )
where s = x1 × · · · × xn is the vector of n-dimensional
input state, L i = L i1 × · · · × L in is the n-dimensional
strictly convex and normal fuzzy set of the i-th rule with
a unique center, m is the number of possible discrete
actions for each rule, ai j is the j-th candidate action,
and weight wi j is the approximated value of the j-th
action in the i-th rule. The goal of FSL is to adapt the
values wi j on-line, to be used to obtain the best policy.
The firing strength of each rule is computed by the
product of the antecedent fuzzy sets. The normalized
firing strength functions of the rules are considered as
state basis functions. For the finite discrete set of environment states S, we can write the state space matrix
s as:
⎤
⎡ 1 1
1 2 · · · 1R
⎢
.. ⎥
(7)
s = ⎣ ... ...
. ⎦
1N 2N · · · NR
2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society
538
Asian Journal of Control, Vol. 10, No. 5, pp. 535 549, September 2008
j
where i is the normalized firing strength of the i-th
rule for state s j (s1 , · · · , s N ∈ S), and N = |S|. Since
the centers of fuzzy sets are different, matrix s is full
rank [10]. The system output a and the corresponding
approximate AVF Q̂(s, a) are computed as follows:
at (st ) =
Q̂ t (st , at ) =
R
i (st )aii +
i=1
R
i (st ) wtii
(8)
+
(9)
i=1
where i + is index of the selected action according to
the following proposed modified Softmax policy for the
i-th rule
exp(i wi j /)
p(ai j ) = m
ik
k=1 exp(i w /)
(10)
In this formula, is the temperature parameter. The
difference between this and the conventional Softmax
formula is that we have multiplied the firing strength of
each rule by the value of each possible action for that
rule. As we will show in Theorem 1, this modification
will result in a final action selection whose probability
complies with the standard Softmax formula.
The weight parameters of the i-th rule are updated
by:
ij
wt+1
t × Q̂ t (st , at ) × i (st ) if j = i +
(11)
=
0
otherwise
where Q̂ is the approximate action value error determined by
Q̂ t (st , at ) = rt+1 + Q̂ t (st+1 , at+1 )
− Q̂ t (st , at )
IV. THEORETICAL ANALYSIS OF FSL
In this section, we provide some theoretical results
concerning the existence of the stationary points for
FSL, and the convergence of its weight vector.
4.1 The existence of stationary points for FSL
algorithm
As we will show, FSL is an implementation of
linear Sarsa with a Softmax action selection policy in
each time step. Hence, we first prove the existence of
stationary points for the linear Sarsa where the actions
are selected according to the Softmax formula. In the
special case of linear FAs with R basis functions, we
can approximate AVF by:
Q̂ t (s, a) =
R
i (s, a)wt (i)
(13)
i=1
where w is the adjustable weight vector and 1 , . . . , R
are the fixed state-action basis functions on the stateaction space [21]. The above equation can be written in
the following vector form:
Q̂ t (s, a) = T (s, a)wt
(14)
where T (s, a) = [1 (s, a), . . . , R (s, a)]. Defining the state-action space matrix with dimension
|S| · |A| × R as:
⎡
(12)
and and are the discount factor and the learning
rate, respectively. The algorithm procedure of FSL is
summarized below:
1. Observe state st+1 and receive reinforcement signal
rt+1 .
2. Select a suitable action of each rule using modified
Softmax action selection (10).
3. Compute final action at+1 and the approximate AVF
Q̂ t (st+1 , at+1 ) using (8) and (9), respectively.
4. Compute Q̂ and update w by (12) and (11),
respectively.
q
5. Compute new approximate AVF Q̂ t+1 (st+1 , at+1 )
using (9).
6. Apply the final action.
7. t ← t+1 and return to step 1.
⎤
1 (s1 , a1 ) . . . R (s1 , a1 )
⎢
⎥
..
⎢
⎥
.
⎢
⎥
⎢ 1 (s1 , a M ) . . . R (s1 , a M ) ⎥
⎢
⎥
⎢
⎥
..
⎢
⎥
.
⎢
⎥
=⎢
⎥
..
⎢
⎥
.
⎢
⎥
⎢ 1 (s N , a1 ) . . . R (s N , a1 ) ⎥
⎢
⎥
⎢
⎥
..
⎣
⎦
.
1 (s N , a M ) . . . R (s N , a M )
⎡ T
⎤
(s1 , a1 )
⎢
⎥
..
= [1 | · · · | R ] = ⎣
⎦
.
T
(s N , a M )
(15)
2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society
539
V. Derhami et al.: Fuzzy Sarsa Learning and the Proof of Existence
where M = |A|, we have
Q̂ t = wt
(16)
Without loss of generality, we assume the following:
Assumption 3. The matrix is full rank.
Let D be a diagonal matrix with dimension |S| ·
|A| × |S| · |A| whose diagonal elements comprise the
steady state probability of each state-action under policy . Using a stochastic gradient descent method to
decrease the sum squared error Q − Q̂2D by adjusting the weight vector w in linear Sarsa, the weight vector update law after executing action at turns out to
be [17]:
wt+1 = wt + t (st , at )[rt+1
+T (st+1 , at+1 )wt −T (st , at )wt ]
(17)
Assumption 4. The action selection policy in each
state is obtained using the following Softmax formula
[3]:
w (s, a) = exp( Q̂(s, a)/)
Since the action is generated using approximate AVF,
the action selection policy depends on and w. To emphasize this, we placed and w in the index of policy in equation (18). Notice that the Softmax formula (18)
is a continuous probability distribution function. This
is an important feature when discussing convergence in
continuous RL [21]. Moreover, equation (18) depends
on all action values, which causes suitable exploration
for the algorithm [3].
Lemma 1. In a linear Sarsa algorithm (17), the stochastic variable wt under assumptions 1–4 asymptotically
follows the trajectory w of the following Ordinary Differential Equation (ODE):
ẇ = T Dw (r + Pw w − w)
(19)
where Pw is the transition probability matrix moving
from one state-action to another under policy w .
Proof. The update law in (17) is a stochastic recursive
algorithm. Thus, based on [19, 26], under the four conditions of (C1–C4) given below, we can apply ODE
methods for analyzing a recursive stochastic algorithm:
(C1) The algorithm should be of the form:
wt+1 = wt + t H (wt , X t+1 )
q
where w lies in
and the state lies in
X t = f (t )
(20)
k .
(21)
where, for a fixed w, the extended state t is a
Markov chain with transition probability pw (t )
as a function of w. The row vector pw (t ) is equal
to the t -th row of transition probability matrix
Pw . Moreover, for all w, the Markov chain {t }
should have unique stationary asymptotic behavior.
(C4) The mean vector field defined by h(w) = limt→∞
E w (H (w, X t )) should exist and should be regular,
which constructs the ODE of the form ẇ = h(w).
Now, we verify the above conditions for linear
Sarsa (17). Let:
⎤
⎡
zt
st
zt =
(22)
, t = ⎣ z t−1 ⎦
at
rt
⎤
(z t )
X t = f (t ) = ⎣ (z t−1 ) ⎦
rt
⎡
(18)
b∈A exp( Q̂(s, b)/)
d ,
(C2) The learning rate sequence should
∞ be positive, decreasing, and satisfying
t=0 t = ∞,
∞ t=0 t <∞ for some >1.
(C3) X t should be of the form:
(23)
H (wt , X t+1 ) = (z t )(rt+1 + T (z t+1 )wt
−T (z t )wt )
(24)
Then, substituting (24) into (20) yields (17). Thus, the
algorithm satisfies condition 1. Moreover, assumption
2 satisfies condition C2 with = 2. To verify condition C3, one can see that the extended state t includes
3 parts: reward signal rt , which is MDP according to
assumption 1, z t−1 which is known at time t, and z t ,
which is with transition probability:
Pw (z t−1 , z t )
= P(st−1 , at−1 ), st ))w (st , at )
(25)
The elements of matrix Pw are calculated for any fixed
w in the entire space. According to Assumption 1, P is
irreducible and aperiodic. Moreover, w follows Boltzman distribution, and thus, it is ananalytic function
which satisfies w (s, a)>0 and
a∈A w (s, a) = 1.
Therefore, for a fixed w, Pw is irreducible and aperiodic, and {t } is a Markov chain with a unique stationary asymptotic behavior. Hence, condition C3 is
satisfied.
2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society
540
Asian Journal of Control, Vol. 10, No. 5, pp. 535 549, September 2008
As mentioned, w follows a Boltzman distribution
that is an analytic function, hence, considering Assumption 1, the steady state probability distribution dw (z)
under policy w is an invariant and a regular function.
Therefore, the expectation in condition C4 can be calculated as:
h(w) =
dw (z)(z)[r (z) + ( pw (z)w)
z∈|S|.|A|
T
− (z)w]
(26)
where pw (z) is the z-th row of transition probability
matrix Pw . equation (26) in vector form becomes:
Simplifying the right hand side, the above equation reduces to equation (28). This result shows that the fixed
points of proposed AAVI are the stationary points of
linear Sarsa.
Next, we show that the stationary points of linear
Sarsa are the fixed points of AAVI as well. Assume
vector w ∗ is a stationary point of linear Sarsa (17). Thus,
we have:
T Dw w ∗ = T Dw (r + Pw w ∗ )
(32)
Multiplying both sides of the above equation by
(T Dw )−1 results in:
(T Dw )−1 T Dw w ∗
(27)
= (T Dw )−1 T Dw (r + Pw w ∗ ) (33)
where diagonal matrix Dw has dimension |S|·|A| × |S|
·|A| with elements dw (z) for different values of z. Using operators w and Tw and policy (18), the above
equation reduces to:
h(w) = T Dw (r + Pw w − w)
Lemma 2. The linear Sarsa algorithm (17) under Assumptions 1–4 has stationary points, which coincide
with the fixed points of the proposed AAVI (equation
(A8) in the Appendix).
Proof. According to Lemma 1, the sequence of estimate w obtained from equation (17) asymptotically follows the ODE trajectory of equation (19). Vector w is a
stationary point of this ODE if and only if ẇ = 0, which
follows that:
Dw w = Dw (r + Pw w)
T
T
(28)
With reference to Lemma A1 (see Appendix), the proposed AAVI defined in (A8) has fixed points; hence,
there is a weight vector w ∗ such that:
w ∗ = w Tw w ∗
(29)
Considering the projection operator w and the
dynamic programming operator Tw defined in (A8) and
(A6), respectively, and using Softmax policy (18), equation (29) can be written as:
w ∗ = (T Dw )−1 T Dw (r + Pw w ∗ )
(30)
Multiplying both sides of equation (30) by T Dw
yields:
T Dw w ∗ = T Dw (T Dw )−1
× T Dw (r + Pw w ∗ )
q
(31)
w ∗ = w Tw w ∗
(34)
Therefore, the stationary points of the linear Sarsa
algorithm (17) coincide with the fixed points of the
AAVI in (A.8).
Theorem 1. The proposed FSL algorithm under Assumptions 1 and 2 has stationary points, which coincide
with fixed points of the proposed AAVI.
Proof. We first show that FSL algorithm is an implementation of the linear Sarsa (17). We may expand the
i-th rule into m separate rules as follows:
If s is L i then ai1 with wi1
If s is L i then ai2 with wi2
..
.
If s is L i then aim with wim
Thus, we have R series of m rules. In each action
selection, only one rule is selected from each series
and the combination R selected rules generates the final
action. Hence, we can define state-action basis vector
with length m · R as follows:
⎡
m
m
⎣
(s, a) = 0 · · · 1 (s) · · · 0 0 · · · 2 (s) · · · 0
⎤T
m
· · · 0 · · · R (s) · · · 0⎦
(35)
where m is the number of candidate actions in each
rule. As we can see in (35), in each series of elements
2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society
541
V. Derhami et al.: Fuzzy Sarsa Learning and the Proof of Existence
only one element is nonzero depending on which rule is
chosen in that series. For each input state, it is possible
to have m R vector . Therefore, the state-action space
dimension of the problem is |S| · m R × m · R. The basis
vectors T (s, a) constitute the rows of the state-action
space matrix with m · R adjustable weight parameters:
w T = [w 11 , . . . , w1m , w 21 , . . . , w 2m ,
w R1 , . . . , w Rm ]
(36)
Using (35) and (36), equation (9) reduces to equation (14).
It can be easily seen that substituting equation
(35), (36) and (14) into equation (11) and (12) yields
equation (17). Therefore, the update law of FSL algorithm is equivalent to that of the linear Sarsa algorithm.
Considering the above results and the assumptions
of this theorem, we can conclude that, if we prove that
Assumptions 3 and 4 hold for FSL, then the result of
Lemma 2 holds for it as well.
a) To show that the columns of matrix are linearly independent, let B = 0 for some vector B. Then,
using (35), we can write:
⎤
m
m
m
⎥
⎢ 1
1
1
⎢ 1 0 · · · 0 2 0 · · · 0 · · · R 0 · · · 0 ⎥
⎥
⎢
m
m
m
⎢ ⎥
⎥
⎢
⎥
⎢
⎢ 011 · · · 0 12 0 · · · 0 · · · 1R 0 · · · 0 ⎥
⎥
⎢
.
⎥
⎢
..
⎥
⎢
⎥
⎢
⎥
⎢
m
m
m
⎥
⎢ ⎥
⎢
⎢ 0 · · · 01 0 · · · 01 · · · 0 · · · 01 ⎥
⎢
R ⎥
1
2
⎥
⎢
.
⎥
B = 0 = ⎢
..
⎥
⎢
⎥
⎢
m
m
m
⎥
⎢
⎢ ⎥
⎥
⎢ N
⎥
⎢ 1 0 · · · 0 2N 0 · · · 0 · · · N
R0···0 ⎥
⎢
m
m
m
⎢ ⎥
⎥
⎢
⎥
⎢ N
0···0 ⎥
⎢ 01 · · · 0 2N 0 · · · 0 · · · N
R
⎥
⎢
⎥
⎢
..
⎥
⎢
.
⎥
⎢
⎥
⎢
m
m
m
⎣ ⎦
0 · · · 01N 0 · · · 02N · · · 0 · · · 0 N
R
⎡
⎤
B1
⎢
⎥
B2
⎢
⎥
⎢
⎥
.
⎢
⎥
.
.
⎢
⎥
⎢
⎥
⎢
⎥
Bm
⎢
⎥
⎢
⎥
.
..
×⎢
⎥
⎢
⎥
⎢B
⎥
⎢ m(R−1)+1 ⎥
⎢
⎥
⎢ Bm(R−1)+2 ⎥
⎢
⎥
..
⎢
⎥
⎣
⎦
.
⎡
q
Bm R
⎡
⎤
⎫
11 B1 + 12 Bm+1 + · · · + 1R Bm(R−1)+1 ⎪
⎪
⎪
⎢
⎪
.
⎪
⎢
⎪
..
⎪
⎢
⎪
⎪
⎢
⎪
1
1
1
⎪
⎢ Bm + B
⎪
+
·
·
·
+
B
⎢
R m(R−1)+1 ⎪
1
2 m+1
⎪
⎪
⎢
⎪
.
⎪
⎢
⎬
..
⎢
⎢
mR
⎢
.
⎪
⎪
⎢
..
⎪
⎪
⎢
⎪
⎪
⎢
⎪
1 B + 1 B
1 B
⎪
⎢
+
·
·
·
+
⎪
m
R
1
2m
R
1
2
⎪
⎢
⎪
⎪
⎢
.
⎪
⎪
⎢
..
⎪
⎪
⎢
⎪
⎭
⎢
1
1
1
⎢
B
+
B
+
·
·
·
+
B
m
m
R
2m
R
1
2
⎢
⎢
.
⎢
..
⎢
=⎢
.
⎢
..
⎢
⎢
⎫
⎢ N
NB
B
+
+
·
·
· + N
⎢ 1 1
R Bm(R−1)+1 ⎪
2 m+1
⎪
⎢
⎪
⎪
..
⎢
⎪
⎪
⎢
⎪
.
⎪
⎢
⎪
⎪
⎢ N
N
N
⎪
⎢ 1 Bm + 2 Bm+1 + · · · + R Bm(R−1)+1 ⎪
⎪
⎪
⎢
⎪
⎪
.
⎢
⎪
⎬
..
⎢
⎢
mR
⎢
.
⎪
⎢
⎪
..
⎪
⎢
⎪
⎪
⎢
⎪
⎪
N B + N B
NB
⎢
⎪
+
·
·
·
+
⎪
⎢
R mR
1 1
2 2m
⎪
⎪
⎢
⎪
.
⎪
⎢
⎪
..
⎪
⎣
⎪
⎪
⎭
1N Bm + 2N B2m + · · · + N
B
R mR
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦
(37)
mR
sets of homogeChoosing the first row from each
neous equations in right hand side of (37) gives:
⎧ 1
⎨ 1 B1 + 12 Bm+1 + · · · + 1R Bm(R−1)+1 = 0
2 B + 22 Bm+1 + · · · + 2R Bm(R−1)+1 = 0
⎩ 1N 1
1 B1 + 2N Bm+1 + · · · + NR Bm(R−1)+1 = 0
(38)
Considering the state space matrix S defined in (7),
equation (37) can be written as:
⎤
⎡
B1
⎢ Bm+1 ⎥
⎥
⎢
(39)
S × ⎢
⎥ =0
..
⎦
⎣
.
Bm(R−1)+1
Since S is full rank, we have B1+km = 0, k = 0, 1, ..,
(R − 1). A similar procedure for the i-th rows of each
m R equation sets in the right hand side of (37) results
in Bi = 0. This yields B = 0̄.
b) To show that the probability of final action selection P(a) complies with the standard Softmax formula, we write P(a) as the product of the probability
of action selection of all rules:
R
+
exp(i wii /)
m
ij
i=1
j=1 exp(i w /)
R
+
exp((
wii )/)
= R mi=1 i
ij
i=1
j=1 exp(i w /)
p(a) =
2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society
(40)
542
Asian Journal of Control, Vol. 10, No. 5, pp. 535 549, September 2008
Considering (9), the above equation simplifies to
equation (18).
0,0
x
ω
4.2 Convergence of FSL algorithm under
stationary policy
As proven in Theorem 1, FSL is an implementation of linear Sarsa() with = 0, where is the eligibility trace decay parameter used in TD() method [3].
If the weight vector in FSL() with ∈ [0, 1] is updated
according to:
y
θ Quay
Current
direction
fc
ij
ij
wt+1 = t × Q̂ t (st , at ) × et , i = 1, 2, . . . , R,
j = 1, 2, . . . , m
(41)
where e is eligibility trace and updated as follows [5]
ij
et−1 + i (st ) if j = i +
ij
et =
(42)
ij
et−1
otherwise
initialized with e−1 = 0, then, FSL() is an implementation of linear Sarsa().
Lemma 3. In the special case where the action selection policy is stationary, the FSL() algorithm under
Assumptions 1 and 2 converges to a unique value with
Probability 1. In addition, the final weight parameter
vector w ∗ satisfies the following inequality:
w ∗ − Q ∗ D 1 − Q ∗ − Q ∗ D
1−
(43)
Proof. According to Theorem 1, it is obvious that the
update law of the FSL() algorithm is equivalent to that
of the linear Sarsa() algorithm and that the columns
of matrix are linearly independent. The convergence
of the state value function in linear TD() method to
a unique amount under Assumptions 1–3 and constant
state transition probability matrixhas been proven in
Theorem 1 of [5]. If we consider the transition of the
state-action to the next state-action and use the stateaction basis functions instead of the state basis functions, then equation (41) is equivalent to the update law
given in Theorem 1 of [5]. Under stationary policy, all
the elements of transition matrix are constant; hence, in
this case, the results of the mentioned theorem (convergence, uniqueness, and error bound) hold for FSL()
as well.
V. SIMULATION
In this section, we first demonstrate the performance of FSL versus FQL, which was used in [8] for
q
Current
force
200,200
Fig. 1. Boat Problem [8].
the evaluation of FQL. Then, both algorithms are compared in terms of convergence under stationary policy
in a multi-goal environment.
5.1 Boat problem
We use FSL and FQL to tune a fuzzy controller to
drive a boat from the left bank to the right bank quay
in a river with strong nonlinear current. The goal is to
reach the quay from any position on the left bank (see
Fig. 1).
5.1.1 System dynamics
This problem has two continuous state variables,
namely, x and y position of the boat’s bow ranging from
0 to 200. The quay center is located at (200,100) and
has a width of five. The river current is [8]:
E(x) = f c [x/50 − (x/100)2 ]
(44)
where f c is the current force ( f c = 1.25). The new bow
position of the boat is computed by:
xt+1 = min(200, max(0, xt + st+1 cos(
t+1 )))
yt+1 = min(200, max(0, yt − st+1 sin(
t+1 )
−E(xt+1 )))
(45)
The boat angle t and speed st are computed by the
following difference equations:
t+1
t+1
st+1
t+1
=
=
=
=
t + (I t+1 )
t + ((t+1 − t )(st+1 /S Max ))
st + (Sdes − st )I
min(max( p(at+1 − t+1 ), −45◦ ), 45◦ )
(46)
2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society
543
V. Derhami et al.: Fuzzy Sarsa Learning and the Proof of Existence
1
VLow Low
High VHigh
Med
0.5
0
1
0
VLow
50
100
x
150
200
Low
Med
High
VHigh
50
100
y
150
200
0.5
0
0
Fig. 2. Input membership functions.
where I = 0.1 is the system inertia, S Max = 2.5 is the
maximum possible speed for the boat, Sdes = 1.75 is
the speed goal, t is the rudder angle, and p = 0.9 is
the proportional coefficient used to compute the rudder
angle to reach the fuzzy controller’s desired direction at
[8]. The computations are performed every 0.1 second.
However, in order to observe the effect of the applied
control, the controller and the learning mechanisms are
triggered every one second.
5.1.2 Learning details
As shown in Fig. 2, five fuzzy sets have been
used to partition each input variable resulting in
25 rules [8]. For the sake of simplicity, the possible discrete actions are the same for all the rules,
though it is possible to assign different sets of actions to each rule. The discrete action sets are made
of 12 directions, ranging from the Southwest to North
U = {−100, −90, −75, −60, −45, −35, −15, 0, 15,
45, 75, 90}. The controller generates continuous actions by the combination of these discrete actions. To
keep the same conditions between FQL and FSL, our
modified Softmax action selection policy (10) is used
for both methods.
The reinforcement function is equal to zero
during the traverse of the boat, and is a nonzero value based on the y position of the boat after reaching one of the three zones of the right
bank: The success zone Z s corresponds to the quay
(x = 200; y ∈ [97.5, 102.5]), the viability zone Z v
where x = 200 and y ∈ [92.5, 97.5] ∪ [102.5, 107.5],
and the failure zone Z f , which includes all points
of the right bank. The reinforcement function is
q
defined by [8]:
⎧
+1
⎪
⎨
D(x, y)
R(x, y) =
−1
⎪
⎩
0
(x, y) ∈ Z s
(x, y) ∈ Z v
(x, y) ∈ Z f
otherwise
(47)
where D(x, y) is a function that decreases linearly from
1 to −1, relative to the distance from the success zone.
The evaluation of each experiment is the average
of performance measures introduced in [8] over 100
runs. Every run includes a “learning phase” and a “testing phase”. For the learning phase, we generate 100 sets
of random positions. Every set comprises 5000 random
values for the y-axis of the starting point of the boat
movement (x = 10; y = random). The learning phase
finishes when either 40 successive non-failure zones are
reached or the number of episodes exceeds 5000. We
call the episode number at the end of learning phase the
Learning Duration Index (LDI), which is a measure of
learning time period.
The testing phase is made of 40 episodes, whose
initial points are specified by x = 10 and y distributed
with equal distance in the range of 0 to 200. We define
the distance error of the reached bank position relative
to the quay center by:
d(x, y)
|y − 100|
if right bank reached
=
100 + (200 − x) otherwise
(48)
We then define Distance Error Index (DEI) as the average of d over the 40 episodes in testing phase, which
is a measure of the learning quality. The learning rate
and the temperature parameter decrease during the
learning phase as follows [3]:
t−1 /1.001 if t = 4k
t =
(49)
t−1
otherwise
⎧
⎨ t−1 − (0.99)t × 0.4 × t−1 if t = 4k&t<100
t = t−1 − (0.99)t × 0.2 × t−1 if t = 4k&t100
⎩
t−1
otherwise
(50)
where k is a positive integer. To increase exploration
and to escape from local extrema, if the learning phase
of an experiment does not end after 1500 episodes, is
reset to its initial value, and is set to 0.25 of its initial
value.
5.1.3 Simulation results
Table I shows the results of four experiments in
FSL and FQL with different initial learning rates and
2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society
544
Asian Journal of Control, Vol. 10, No. 5, pp. 535 549, September 2008
Table I. Simulation results for different initial learning rates and temperatures.
Initial Parameters
Method
Avg. DEI
Avg. LDI
Std (LDI)
Success Rate
Viability Rate
Failure Rate
=1
= 0.1
FQL
FSL
7.94
4.45
1309
1010
1184
1065
49.175
56.85
43.575
40.325
7.25
2.825
= 0.01
= 0.1
FQL
FSL
12.26
7.09
1227
962
1486
1050
46.85
50.10
41.875
44.50
11.275
5.40
= 0.01
= 0.01
FQL
FSL
8.69
8.68
733
698
1084
1013
49.55
51.275
43.75
42.475
6.7
6.25
=5
= 0.1
FQL
FSL
8.71
8.38
1456
1290
975
900
51.35
51.425
41.675
41.85
6.975
6.725
DEI, Distance error index; FQL, Fuzzy Q-Learning; FSL, Fuzzy Sarsa Learning; LDI, Learning duration index.
50
40
Frequency
temperatures. The average DEI, average LDI, success
rate, viability rate, and failure rate (defined in equation
(47)) were computed by taking the average in 100 runs
for four experiments.
As the results show, the training time and the action quality in FSL is considerably better than FQL.
Moreover, the initial value of temperature parameter
greatly influences the learning quality and the learning
speed, and a medium initial temperature value seems to
be more suitable. It is noteworthy that, in the Softmax
formula, the exploration degree depends on the temperature factor as well as the difference of action values ( Q̂(s, a)). If the difference is small, then selecting
a medium temperature factor would cause high exploration. On the other hand, if this difference is large, a
small temperature factor would cause high exploitation.
Figs 3 and 4 show the histograms of LDIs for both algorithms for 0 = 1, 0 = 0.1. As we see from the figures,
compared to those of FQL, not only are the LDIs in
FSL closer together, but also there are fewer instances
of reaching its upper bound (5000 episodes) This means
that FSL learns much faster with much better accuracy.
30
20
10
0
0
1000
2000
3000
LDI
4000
5000
6000
Fig. 3. Histogram of learning duration indexes for Fuzzy Sarsa
Learning.
40
To demonstrate the convergence of our algorithm
(FSL) under stationary policy and to show the possibility of divergence of FQL, we use both algorithms to approximate AVF in an 11 × 11 multi-goal obstacle-free
environment, as shown in Fig. 5. This environment is
similar to the grid world environment presented in [18]
except that the states ∈ [0, 11] × [0, 11] and actions ∈
[0◦ , 360◦ ] are continuous. The action is the movement
angle of the agent with respect to the horizontal axis;
hence, the agent with zero angle moves right, with 90◦
angle moves down, etc. The agent moves with unit constant steps in the direction of final inferred action angle.
q
Frequency
5.2 Multi-goal obstacle-free environment
30
20
10
0
0
1000
2000
3000
LDI
4000
5000
6000
Fig. 4. Histogram of learning duration indexes for Fuzzy Q-Learning.
2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society
545
V. Derhami et al.: Fuzzy Sarsa Learning and the Proof of Existence
0,0
0,1
1
x
1,0
G
-1
+1
10,0 11,0
G
11,1
,
0.8
0.6
W1L
0.4
0.2
W1R
0
y
W2R
S
-0.2
-0.4
W2L
-0.6
-1
-0.8
+1
11,10
G
10,11 11,11
0,10
G
0,11 1,11
Low
High
0.5
0
0
0.2
0.4
0.6
0.8
1
x
Low
1
High
0.5
0
0
0.2
0.4
0.6
0.8
1
y
Fig. 6. Input membership functions.
If the agent bumps into a wall, it will remain in the same
state. The four corners are terminal states. The agent
receives a reward of +1 for entering the bottom-right
and upper-left corners, and −1 for entering other two
corners. All other actions have zero reward.
5.2.1 Learning details
The input of FIS are the normalized position of
agent (x, y). As shown in Fig. 6, two Gaussian fuzzy
sets are used to partition each dimension, resulting in
the following four rules:
Rule 1: If x is Low and y is Low, then R with w 1R
or D with w 1D or L with w 1L or U with w 1U .
Rule 2: If x is Low and y is High, then R with
w 2R or D with w 2D or L with w 2L or U with w 2U .
q
0
0.5
1
1.5
Time step
2
2.5
4
x 10
Fig. 7. Weight values of Rules 1 and 2 in Fuzzy Sarsa Learning.
Fig. 5. A multi-goal obstacle-free environment.
1
-1
Rule 3: If x is High and y is Low, then R with
w 3R or D with w 3D or L with w 3L or U with w 3U .
Rule 4: If x is High and y is High, then R with
w 4R or D with w 4D or L with w 4L or U with w 4U .
In the above rules, R, D, L, and U denote right,
down, left, and up candidate actions, respectively, and
the weight vector w is defined according to equation
(36). The continuous final action is inferred by directional combination of these discrete actions in twodimensional action space. In each rule, the action selection probability is equal for all candidate actions; hence,
the policy is stationary.
We apply both FQL() and FSL() algorithms to
this problem with = 0.9, w0 = 0̄, and 0 = 0.3. The
learning rate decreases through the run. Each run includes 10,000 episodes. An episode starts from the center (5.5, 5.5) and finishes if agent reaches the corners
or the number of steps exceeds 1000. We execute 10
independent runs.
Figs 7 and 8 show the weight values corresponding to the right and left candidate actions of all rules.
We only showed two weight values for each rule, because in Rules 1 and 4 the trajectories of the right action
weight and left action weight are fairly similar to those
of the down action weight and up action weight, respectively. Likewise, in Rules 2 and 3, the trajectories of the
right action weight and left action weight are similar to
those of the up action weight and down action weight,
respectively.
As predicted by Lemma 3, the weight values in
FSL() converged to a unique value. The final weight
values in different runs are very close to each other with
the maximum standard deviation of 0.007 and average
weight vector w = [0.2153 0.2168 0.3204 0.3204 −
0.2130 − 0.3176 − 0.3198 − 0.2153 − 0.3162 −
2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society
546
Asian Journal of Control, Vol. 10, No. 5, pp. 535 549, September 2008
1
0.8
0.6
W4R
0.4
0.2
W4L
0
W3L
-0.2
-0.4
W3R
-0.6
-0.8
-1
0
0.5
1
1.5
2
2.5
4
Time Step
x 10
Fig. 8. Weight values of Rules 3 and 4 in Fuzzy Sarsa Learning.
3
x 10
28
2.5
2
1.5
1
0.5
0
-0.5
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Time step
Fig. 9. The trajectory of right action weight of Rule 1 in Fuzzy
Q-Learning.
0.2119 −0.2144 −0.3185 0.3155 0.3172 0.2112 0.2108]
for 10 runs.
As is clear from the figures, the weight parameters
have converged to their final amounts in about 10,000
time steps. Notice that, due to the symmetric environment in this specific example, the corresponding weight
values of the symmetric rules are equal. In contrast, the
weight parameters in FQL() diverged. Fig. 9 shows
one of the diverging weights in FQL().
VI. DISCUSSION AND CONCLUSION
In this paper, we concentrated on critic-only based
FRL algorithms. Employing fuzzy systems enables us
to embed human experience in the learning system. We
q
presented a new on-policy method, called FSL, based on
Sarsa. FSL has two algorithmic differences with FQL.
The first difference is in the method of computation
of Q̂, similar to the difference between discrete Qlearning and discrete Sarsa. However, due to the use of
FA for the approximation of the action value, updating
the weights of both algorithms in every step affects all
approximated values of state-action as well as the selected action. Hence, both algorithms, unlike discrete
Q-learning and Sarsa, do not behave similarly to each
other, even under greedy policy. The second difference
lies in the strategy of action selection. In the FQL presented in [8] and [16], a mixed exploration-exploitation
strategy has been employed for action selection in each
rule. This type of action selection does not guarantee
generation of final action according to a continuous
probability function. On the other hand, in continuous
reinforcement learning, actions should be generated by a
continuous probability function so that small changes in
the value of an action do not create substantial changes
in the control behavior [21]. To maintain this feature,
we employed a modified Softmax action selection in
each rule of the proposed FSL so that the probability
function of final action selection became continuous.
The main advantage of FSL compared to FQL is
the existence of theoretical analysis concerning the existence of stationary points. Moreover, we proved the
convergence of FSL under stationary policy. On the
other hand, there is no proof for the existence of the stationary points or convergence in any situation for FQL,
and besides, some examples of divergence in linear Qlearning were reported in the literature.
It should be noted that the existence of stationary
points for FSL does not necessarily imply convergence
for the algorithm. However, this result is important from
different aspects. First, to the best of our knowledge, the
presented theoretical results for FSL are the first analytical results for critic-only based FRL in control applications where policy is adapted in each step.Second, FSL
is an implementation of Linear Sarsa with FIS. With this
implementation, we coped with two challenges mentioned in the introduction: defining state-action basis
functions and action selection policy. Third, the proof
of the existence of stationary points that coincide with
fixed points of proposed AAVI signifies a good chance
to reach the desired solution as it was seen in the simulation. Fourth, since the action selection policy in FSL
is updated in each time step, the learning speed in this
algorithm is high.
Experimental results verified that FSL outperformed FQL. They also showed that action quality
strongly depends on the temperature factor (i.e., exploration). In discrete RL, selecting a greater temperature
2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society
547
V. Derhami et al.: Fuzzy Sarsa Learning and the Proof of Existence
value usually improves quality, but results in slower
learning. However, as seen in the example, this is not
necessarily true in continuous RL. The results in Section 5.2 verified the convergence of FSL under stationary policy and divergence of FQL. Since we employed
a uniform probability distribution for policy in that
example, updating the action value according to FQL
strongly differs from a Markov chain distribution, and
this difference causes FQL to quickly diverge. These
results indicate that, although FQL is known as an offpolicy method, it strongly depends on the policy and
does not work well under a high exploration policy.
whose elements are defined as follows
p ((si , a), (s j , b)) = p((si , a), s j ) × (s j , b)
(A3)
where si , s j ∈ S, a, b ∈ A, p((si , a), s j ) is the transient
probability from (si , a) to s j obtained by environment
model. The optimal AVFQ ∗ uniquely solves Bellman’s
equation [3]:
Q ∗ = max{r + P Q}
(A4)
where the maximization is preformed element-wise. If
we define dynamic programming operator T by:
VII. APPENDIX: APPROXIMATE ACTION
VALUE ITERATION (AAVI)
An improved version of approximate value iteration (not action value iteration) was presented in [19]
and was guaranteed to have at least one fixed point.
In each iteration, the approximate value of each stateaction pair is computed using approximate values of all
states by:
Q(s, a) = r (s, a) +
p(s, a, s )V (s )
(A1)
s ∈S
Then, they are used in Softmax formula to obtain the
probability of action selection for each state. Since, in
RL algorithms, the environment is assumed unknown,
the amounts of r (s, a) and p(s, a, s ) are not available
for computing AVF in equation (A1). This leads us to
present an AAVI by extending the improved approximate value iteration given in [19]. Our method approximates the AVF, and thus, computing it by equation (A1)
in each iteration is not needed. AAVI can be used for analyzing the AVF approximation algorithms in RL such
as linear Sarsa.
In what follows, we offer some new definitions to
formulate the AAVI method. Because of the similarity
of AAVI and the method in [19], only the necessary
equations for the linear Sarsa analysis are discussed.
Consider an MDP environment [27]. A policy is
optimal if it maximizes the state-action values, or equivalently, if it maximizes vector:
Q =
∞
(P )k r
(A2)
k=0
where r is the vector of the immediate rewards with
dimension |S|·|A| × 1 in every state-action pair, and P
is the transition probability matrix of state-action pair
(s, a) to the next with dimension |S| · |A| × |S| · |A|
q
T Q = max{r + P Q}
(A5)
then, the optimal AVF can be characterized as the fixed
points of the operator T . Moreover, we define an operator T for every policy as:
T Q = r + P Q
(A6)
The action value iteration defined by Q l+1 = T Q l
improves estimated Q l in each iteration. Similar to
equation (14), we can approximate action value Q(s, a)
in the l-th iteration using linear FAs with R basis
functions by:
Q̂ l (s, a) =
R
i (s, a)wl (i)
(A7)
i=1
where w is the adjustable weight vector. Similar to the
update law given in [19] for the approximate value iteration, we propose a new approximation action value
iteration method with following update law:
Q̂ l+1 = w Tw Q̂ l
l
where
the
(A8)
l
projection
operator
w = (T
l
Dw )−1 T Dw projects on the space spanned by
l
l
the basis functions for minimizing the weighted norm
Q − w D , and Dw is a diagonal matrix with
l
dimension |S| · |A| × |S| · |A| whose diagonal elements comprise the steady state probability of each
state-action pair under policy . Moreover, Tw is the
dynamic programming operator using the Softmax
formula (18).
Lemma A1. The proposed AAVI (A.8) has at least one
fixed point for any >0.
2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society
548
Asian Journal of Control, Vol. 10, No. 5, pp. 535 549, September 2008
Proof. Defining operator H as:
H Q = w Tw Q
(A9)
we have Q̂ l+1 = H Q̂ l . Theorem 5.1 in [19] proves the
existence of a fixed point for any >0 for the operator
H (defined by V̂l+1 = H V̂l ) used in approximate value
iteration where V̂ denotes state value function.
Replacing the state space matrix, the state transition matrix, and V in the equations given in Section V
of [19] with state-action space matrix (given in (15)),
state-action transition matrix P (with elements defined
in (A.3)), and Q, respectively, and also using projection operator (A8) and Softmax action selection (18),
the existence of a fixed point for H defined in (A.9) is
consequently proven by the similar procedure presented
in that paper.
REFERENCES
1. Juang, C. F., “An automatic building approach to
special Takagi–Sugeno fuzzy network for unknown
plant modeling and stable control,” Asian J. Control,
Vol. 5, No. 2, pp. 176–186 (2003).
2. Hasegawa, Y., T. Fukuda, and K. Shimojima,
“Self-scaling reinforcement learning for fuzzy logic
controller-applications to motion control of two-link
brachiation robot,” IEEE Trans. Ind. Electron., Vol.
46, No. 6, pp. 1123–1131 (1999).
3. Sutton, R. S. and A. G. Barto, Reinforcement
Learning: An Introduction, MIT Press, Cambridge,
MA, (1998).
4. Wiering, M. A., “Convergence and divergence in
standard and averaging reinforcement learning,”
Proc. Europ. Conf. Mach. Learn., Italy, pp. 477–
488 (2004).
5. Tsitsiklis, J. N. and B. Van Roy, “An analysis
of temporal-difference learning with function
approximation,” IEEE Trans. Autom. Control, Vol.
42, No. 5, pp. 674–690 (1997).
6. Baranyi, P., P. Korondi, R. J. Patton, and H.
Hashimoto, “Trade-Off between approximation
accuracy and complexity for TS fuzzy models,”
Asian J. Control, Vol. 6, No. 1, pp. 21–33
(2004).
7. Al-Gallaf, E. A., “Clustered based Takagi-Sugeno
neuro-fuzzy modeling of a multivariable nonlinear
dynamic system,” Asian J. Control, Vol. 7, No. 2,
pp. 163–1176 (2005).
8. Jouffe, L., “Fuzzy inference system learning by
reinforcement methods,” IEEE Trans. Syst., Man,
Cybern. C, Vol. 28, No. 3, pp. 338–355 (1998).
q
9. Kim, M. S., G. G. Hong, and J. J. Lee, “Online fuzzy
Q-learning with extended rule and interpolation
technique,” Proc. IEEE Int. Conf. Intell. Robot.
Syst., Korea, Vol. 2, pp. 757–762 (1999).
10. Vengerov, D., N. Bambos, and H. R. Berenji, “A
fuzzy reinforcement learning approach to power
control in wireless transmitters,” IEEE Trans. Syst.,
Man, Cybern. B, Vol. 35, No. 4, pp. 768–778
(2005).
11. Su, S. F. and S. H. Hsieh, “Embedding
fuzzy mechanisms and knowledge in box-type
reinforcement learning controllers,” IEEE Trans.
Syst., Man, Cybern. B, Vol. 32, No. 5, pp. 645–653
(2002).
12. Beom, H. R. and H. S. Cho, “A sensor-based
navigation for a mobile robot using fuzzy logic and
reinforcement learning,” IEEE Trans. Syst., Man,
Cybern., Vol. 25, No. 3, pp. 464–477 (1995).
13. Barto, A. G., R. S. Sutton, and C. W. Anderson,
“Neuron-like adaptive elements that can solve
difficult learning control problems,” IEEE Trans.
Syst., Man, and Cybern., Vol. 13, pp. 834–846
(1983).
14. Ye, C., N. H. C. Yung, and D. Wang, “A
fuzzy controller with supervised learning assisted
reinforcement learning algorithm for obstacle
avoidance,” IEEE Trans. Syst., Man, Cybern. B, Vol.
33, No. 1, pp. 17–27 (2003).
15. Boada, M. J. L., V. Egido, R. Barber, and M.
A. Salichs, “Continuous reinforcement learning
algorithm for skills learning in an autonomous
mobile robot,” Proc. IEEE Int. Conf. Ind. Electron.
Soc., Spain, pp. 2611–2616 (2003).
16. Er, M. J. and C. Deng, “Online tuning of fuzzy
inference systems using dynamic fuzzy Q-learning,”
IEEE Trans. Syst., Man, Cybern. B, Vol. 34, No. 3,
pp. 478–1489 (2004).
17. Baird, L. C., “Residual algorithms: Reinforcement
learning with function approximation,” Proc. 12th
Int. Conf. Mach. Learn., California, pp. 30–37
(1995).
18. Precup, D., R. S. Sutton, and S. Dasgupta, “Offpolicy temporal-difference learning with function
approximation,” Proc. 18th Int. Conf. Mach. Learn.,
Massachusetts, pp. 417–424 (2001).
19. De Farias, D. P. and B. Van Roy, “On the existence
of fixed points for approximate value iteration
and temporal-difference learning,” J. Optim. Theory
Appl., Vol. 105, No.3, pp. 25–36 (2000).
20. Gordon, G. J., “Reinforcement learning with
function approximation converges to a region,”
Proc. 8th Int. Conf. Neural Inf. Process. Syst.,
Colorado, pp. 1040–1046 (2000).
2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society
V. Derhami et al.: Fuzzy Sarsa Learning and the Proof of Existence
21. Perkins, T. J. and D. Precup, “A convergent form of
approximate policy iteration,” Proc. 9th Int. Conf.
Neural Inf. Process. Syst., Singapore, pp. 1595–
1602 (2000).
22. Kaelbling, L. P., M. L. Littman, and A. W. Moore,
“Reinforcement learning: A survey,” J. Artif. Intell.
Res., Vol. 4, pp. 237–285 (1996).
23. Watkins, C. and P. Dayan, “Q-Learning,” Mach.
Learn., Vol. 8, pp. 279–292 (1992).
24. Singh, S., T. Jaakkola, M. L. Littman, and C.
Szepesvari, “Convergence results for single-step onpolicy reinforcement learning algorithms,” Mach.
Learn., Vol. 39, pp. 287–308. (2000).
25. Tsitsiklis, J. N., “Asynchronous stochastic
approximation and Q-learning,” Mach. Learn., Vol.
16, pp. 85–202 (1994).
26. Beneveniste, A., M. Metivier, and P. Priouret,
Adaptive Algorithm and Stochastic Approximation,
Springer, Berlin (1990).
27. Kalyanasundaram, S., E. K. P. Chong, and N. B.
Shroff, “Markov decision processes with uncertain
transition rates: sensitivity and max hyphen min
control,” Asian J. Control, Vol. 6, No. 2, pp.
253–269 (2004).
q
Vali Derhami received his B.Sc.
and M.Sc. degrees from the control
engineering departments of Isfahan University of Technology, and
Tarbiat Modares University, Iran,
in 1996 and 1998, respectively.
Currently, he is a Ph.D. student in
Tarbiat Modares University. His
research interests are neural fuzzy
systems, intelligent control, reinforcement learning, and robotics.
549
Vahid Johari Majd was born in
1965 in Tehran, Iran. He received
his B.Sc. degree in 1989 from the
E.E. department of the University
of Tehran, Iran. He then received
his M.Sc. and Ph.D. degrees from
the E.E. department of the University of Pittsburgh, PA, U.S.A. in
1991 and 1995, respectively. He
is currently an associate professor
in the E.E. department of Tarbiat
Modares University, Tehran, Iran,
and is the director of intelligent
control systems laboratory. His areas of interest include
intelligent identification and control, agent based systems, neuro-fuzzy control systems, and chaotic systems.
Majid Nili Ahmadabadi
was
born in 1967. He received the B.S.
degree in mechanical engineering from the Sharif University of
Technology in 1990, and the M.Sc.
and Ph.D. degrees in information
sciences from Tohoku University,
Japan, in 1994 and 1997, respectively. In 1997, he joined the Advanced Robotics Laboratory at Tohoku University. Later, he moved
to the ECE Department of University of Tehran, where he is an
Associate Professor and the Head of Robotics and AI
Laboratory. He is also a Senior Researcher with the
School of Cognitive Sciences, Institute for Studies
on Theoretical Physics and Mathematics, Tehran. His
main research interests are multiagent learning, biologically inspired cognitive systems, distributed robotics,
and mobile robots.
2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society