Asian Journal of Control, Vol. 10, No. 5, pp. 535 549, September 2008 Published online in Wiley InterScience (www.interscience.wiley.com) DOI: 10.1002/asjc.054 FUZZY SARSA LEARNING AND THE PROOF OF EXISTENCE OF ITS STATIONARY POINTS Vali Derhami, Vahid Johari Majd, and Majid Nili Ahmadabadi ABSTRACT This paper provides a new Fuzzy Reinforcement Learning (FRL) algorithm based on critic-only architecture. The proposed algorithm, called Fuzzy Sarsa Learning (FSL), tunes the parameters of conclusion parts of the Fuzzy Inference System (FIS) online. Our FSL is based on Sarsa, which approximates the Action Value Function (AVF) and is an on-policy method. In each rule, actions are selected according to the proposed modified Softmax action selection so that the final inferred action selection probability in FSL is equivalent to the standard Softmax formula. We prove the existence of fixed points for the proposed Approximate Action Value Iteration (AAVI). Then, we show that FSL satisfies the necessary conditions that guarantee the existence of stationary points for it, which coincide with the fixed points of the AAVI. We prove that the weight vector of FSL with stationary action selection policy converges to a unique value. We also compare by simulation the performance of FSL and Fuzzy Q-Learning (FQL) in terms of learning speed, and action quality. Moreover, we show by another example the convergence of FSL and the divergence of FQL when both algorithms use a stationary policy. Key Words: Learning systems, fuzzy systems, reinforcement learning, sarsa, stationary point. I. INTRODUCTION Designing an optimal controller for a complex system that interacts with an uncertain and nonde- Manuscript received August 24, 2006; revised March 27, 2007; accepted September 8, 2007. Vali Derhami is at Electrical Engineering Department, Tarbiat Modares University, P.O. Box 14115-143, Tehran, Iran (email: [email protected]). Vahid Johari Majd is the corresponding author and is with the Electrical Engineering Department, Tarbiat Modares University, P.O. Box 14115-143, Tehran, Iran (e-mail: [email protected]). Majid Nili Ahmadabadi is with the Control and Intelligent Processing Center of Excellence, University of Tehran, School of Cognitive Science, Institute for Studies on Theoretical Physics and Mathematics, Tehran, Iran (e-mail: mnili@ ut.ac.ir). This research was supported in part by I.T.R.C. under contract No. T500/8211. q terministic environment is extremely difficult or even impossible. Hence, controller design in such environments is an area of interest for researchers [1]. Since there is usually little information about the desired output of the controller, an unsupervised learning algorithm is generally used in such systems. Reinforcement Learning (RL) is a modern and powerful approach for learning control strategy online, which uses only a scalar performance index without any direct supervisor [2, 3]. Due to the curse of dimensionality, application of RL in control problems with large spaces requires the use of Function Approximators (FAs) [4, 5]. Fuzzy Inference System (FIS) is a universal approximator, which offers powerful capabilities such as knowledge representation by if–then rules, modeling, and control of nonlinear uncertain systems with the desired accuracy [6, 7]. As such, some authors [8–12] have offered Fuzzy Reinforcement Learning (FRL), and have employed FRL to tune a fuzzy controller. 2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society 536 Asian Journal of Control, Vol. 10, No. 5, pp. 535 549, September 2008 The two most well-known architectures used in FRL are actor-critic [12, 13] and critic-only methods [3]. Actor-critic based FRL uses a FIS for the approximation of state value function and another FIS for the actor module to generate actions. A major drawback in most actor-critic implementations is lack of suitable exploration. In order to solve this problem, some have proposed to add noise to the selected final action [14, 15]. Although this solution yields some performance enhancement, it suffers from symmetrical exploration around the action with the highest value, and lack of relation between action selection probability and action value. In contrast with the above architecture, a criticonly based FRL uses an FIS only for approximating the Action Value Function (AVF), and the action selection probability of the output depends on this approximation. Such an action selection strategy improves balance between exploration and exploitation [3, 8]. For these reasons, we focus on FRL with critic-only architecture in this paper. Some authors have implemented the Q-learning method (which is an off-policy method) with linear FAs using fuzzy systems [8, 9]. This critic-only based FRL algorithm, called Fuzzy Q-Learning (FQL), was employed in some problems [8, 16]. However, FQL is a heuristic method and lacks mathematical analysis. Moreover, divergence of the Q-learning algorithm utilized with linear FAs has been shown in [4, 17] by examples. The possibility of divergence is because the algorithm updates action values according to a distribution different from that of the Markov chain dynamic [18]. In addition, unlike standard Q-learning, the Q-learning method with FAs introduced in [8] does not preserve the off-policy property, and the final approximate action values in the limit will depend on the policy used. In contrast to Q-learning with linear FAs, there have been some analytical achievements for on-policy RL methods with linear FAs. In [19], proof of the existence of stationary points for the linear Temporal Difference (TD) method that uses Softmax action selection was offered. The proof only addresses the approximation of state value function. Thus, in order to use Softmax action selection, approximation of AVF is computed from the state value function assuming that the model of the environment is known. In [20, 21], approximation of AVF’s was given using combination of Sarsa with linear FAs, which we call linear Sarsa. In [20], the convergence of the weight parameters of linear Sarsa into a fixed region was presented only when the policies in all episodes were stationary. In [21], the authors presented an approximate policy iteration algorithm. It has been shown that, if q the policy improvement operator is a Lipschitz function, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. However, the result has two drawbacks: First, the theorem offers no guarantee of the quality of the converging policy. Second, the new policy is generated only after the weight parameters converge under the current policy. This problem significantly decreases the learning speed, and makes it inappropriate for control problems with online learning, where it is desirable to update action policy at each step. In addition, selection of a suitable linear FA and action selection method are challenging in these works. Contrary to standard Sarsa and Q-learning, no analytical proof is available to this date for the convergence of FRL based on critic-only when the policy is not stationary and/or when it changes according to the action value at each time step. In this paper, we first present a new FRL algorithm, called Fuzzy Sarsa Learning (FSL), and show that it approximates AVF of discrete Sarsa. As a necessary condition for the convergence of the algorithm, we prove the existence of stationary points for FSL, which coincide with the fixed points of our proposed Approximate Action Value Iteration (AAVI). To make the final inferred action selection probability in FSL equivalent to a Boltzman distribution, we introduce a modified Softmax action selection for selecting an action among the candidate actions in each fuzzy rule. We also prove the convergence of FSL weight vector to a unique value under stationary policy. Similar to [10], although our proofs are carried out only for discrete action-state spaces, FSL algorithm works well for continuous spaces as shown by simulations. We compare the performance of FSL and FQL in the boat problem. Moreover, we show convergence of FSL and divergence of FQL under a stationary policy. The organization of this paper is as follows: In Section II, the RL algorithms are described. Fuzzy Sarsa learning is presented in Section III. In Section IV, the theoretical analysis of FSL is proposed. Simulation results are given in Section V. Finally, the discussion and conclusion of the paper are given in Section VI. II. REINFORCEMENT LEARNING In an agent-based system with reinforcement learning, at each time step t, the agent observes the current state of the environment, and makes an action from the finite discrete set of actions A under the decided policy. Consequently, the environment goes to state 2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society 537 V. Derhami et al.: Fuzzy Sarsa Learning and the Proof of Existence st+1 with transition probability p(st , at , st+1 ), and the agent receives reinforcement signal rt+1 = r (st , at ) [3]. The policy, denoted by (s, a), is the probability of selecting action a in state s. The agent learns to take actions to reach the states with greater values. The value of state s under policy is defined by the following function [3, 22]: ∞ k V (s) = E rt+k+1 |st = s 01 (1) k=0 where is a discount factor, and E {·} denotes the expected value. Similarly, AVF is the value of action a in state s under policy , and is defined by [3]: ∞ k Q (s, a) = E rt+k+1 |st = s, at = a (2) k=0 In the following, we briefly describe the TD method and its two widely-used extensions for estimating AVFs. follows [23]: Q(st , at ) ← Q(st , at ) + t rt+1 + max Q(st+1 , b) − Q(st , at ) b∈A (5) If both of the following assumptions hold, all three methods, explained above, converge to their optimal values [3, 23–25]: Assumption 1. The problem environment is a Markov Decision Process (MDP), aperiodic, and irreducible with limited reinforcement signals. Assumption 2. Learning rate t is positive, nonincreasing, and satisfies: ∞ t = ∞, t=0 ∞ t=0 2t <∞ (6) 2.1 Temporal difference learning The simplest TD learning method, known as TD(0), is described as [3]: V (st ) ← V (st ) + t [rt+1 + V (st+1 ) − V (st )] (3) where is the learning rate and r is the immediate reward. In the above formula, the term inside the brackets is called TD error. 2.1.1 Sarsa The Sarsa method estimates AVF for the current policy according to the following update formula [3]: Q(st , at ) ← Q(st , at ) + t [rt+1 +Q(st+1 , at+1 ) − Q(st , at )] (4) All elements of the quintuple of events (st , at , rt+1 , st+1 , at+1 ) are used in the above update law. The name ‘Sarsa’ for this algorithm has been derived from these five letters. 2.1.2 Q-learning Similar to Sarsa, the Q-learning method estimates the AVF with the difference that the goal is to estimate the maximum AVF for all possible policies. The AVF in this algorithm is updated as q III. FUZZY SARSA LEARNING In this section, we introduce Fuzzy Sarsa Learning (FSL). Consider an n-input and one-output zero-order TSK fuzzy system [6, 7] with R rules of the following form: Ri : If x1 is L i1 and . . . and xn is L in , then (ai1 with value wi1 ) or . . . or (aim with value wim ) where s = x1 × · · · × xn is the vector of n-dimensional input state, L i = L i1 × · · · × L in is the n-dimensional strictly convex and normal fuzzy set of the i-th rule with a unique center, m is the number of possible discrete actions for each rule, ai j is the j-th candidate action, and weight wi j is the approximated value of the j-th action in the i-th rule. The goal of FSL is to adapt the values wi j on-line, to be used to obtain the best policy. The firing strength of each rule is computed by the product of the antecedent fuzzy sets. The normalized firing strength functions of the rules are considered as state basis functions. For the finite discrete set of environment states S, we can write the state space matrix s as: ⎤ ⎡ 1 1 1 2 · · · 1R ⎢ .. ⎥ (7) s = ⎣ ... ... . ⎦ 1N 2N · · · NR 2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society 538 Asian Journal of Control, Vol. 10, No. 5, pp. 535 549, September 2008 j where i is the normalized firing strength of the i-th rule for state s j (s1 , · · · , s N ∈ S), and N = |S|. Since the centers of fuzzy sets are different, matrix s is full rank [10]. The system output a and the corresponding approximate AVF Q̂(s, a) are computed as follows: at (st ) = Q̂ t (st , at ) = R i (st )aii + i=1 R i (st ) wtii (8) + (9) i=1 where i + is index of the selected action according to the following proposed modified Softmax policy for the i-th rule exp(i wi j /) p(ai j ) = m ik k=1 exp(i w /) (10) In this formula, is the temperature parameter. The difference between this and the conventional Softmax formula is that we have multiplied the firing strength of each rule by the value of each possible action for that rule. As we will show in Theorem 1, this modification will result in a final action selection whose probability complies with the standard Softmax formula. The weight parameters of the i-th rule are updated by: ij wt+1 t × Q̂ t (st , at ) × i (st ) if j = i + (11) = 0 otherwise where Q̂ is the approximate action value error determined by Q̂ t (st , at ) = rt+1 + Q̂ t (st+1 , at+1 ) − Q̂ t (st , at ) IV. THEORETICAL ANALYSIS OF FSL In this section, we provide some theoretical results concerning the existence of the stationary points for FSL, and the convergence of its weight vector. 4.1 The existence of stationary points for FSL algorithm As we will show, FSL is an implementation of linear Sarsa with a Softmax action selection policy in each time step. Hence, we first prove the existence of stationary points for the linear Sarsa where the actions are selected according to the Softmax formula. In the special case of linear FAs with R basis functions, we can approximate AVF by: Q̂ t (s, a) = R i (s, a)wt (i) (13) i=1 where w is the adjustable weight vector and 1 , . . . , R are the fixed state-action basis functions on the stateaction space [21]. The above equation can be written in the following vector form: Q̂ t (s, a) = T (s, a)wt (14) where T (s, a) = [1 (s, a), . . . , R (s, a)]. Defining the state-action space matrix with dimension |S| · |A| × R as: ⎡ (12) and and are the discount factor and the learning rate, respectively. The algorithm procedure of FSL is summarized below: 1. Observe state st+1 and receive reinforcement signal rt+1 . 2. Select a suitable action of each rule using modified Softmax action selection (10). 3. Compute final action at+1 and the approximate AVF Q̂ t (st+1 , at+1 ) using (8) and (9), respectively. 4. Compute Q̂ and update w by (12) and (11), respectively. q 5. Compute new approximate AVF Q̂ t+1 (st+1 , at+1 ) using (9). 6. Apply the final action. 7. t ← t+1 and return to step 1. ⎤ 1 (s1 , a1 ) . . . R (s1 , a1 ) ⎢ ⎥ .. ⎢ ⎥ . ⎢ ⎥ ⎢ 1 (s1 , a M ) . . . R (s1 , a M ) ⎥ ⎢ ⎥ ⎢ ⎥ .. ⎢ ⎥ . ⎢ ⎥ =⎢ ⎥ .. ⎢ ⎥ . ⎢ ⎥ ⎢ 1 (s N , a1 ) . . . R (s N , a1 ) ⎥ ⎢ ⎥ ⎢ ⎥ .. ⎣ ⎦ . 1 (s N , a M ) . . . R (s N , a M ) ⎡ T ⎤ (s1 , a1 ) ⎢ ⎥ .. = [1 | · · · | R ] = ⎣ ⎦ . T (s N , a M ) (15) 2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society 539 V. Derhami et al.: Fuzzy Sarsa Learning and the Proof of Existence where M = |A|, we have Q̂ t = wt (16) Without loss of generality, we assume the following: Assumption 3. The matrix is full rank. Let D be a diagonal matrix with dimension |S| · |A| × |S| · |A| whose diagonal elements comprise the steady state probability of each state-action under policy . Using a stochastic gradient descent method to decrease the sum squared error Q − Q̂2D by adjusting the weight vector w in linear Sarsa, the weight vector update law after executing action at turns out to be [17]: wt+1 = wt + t (st , at )[rt+1 +T (st+1 , at+1 )wt −T (st , at )wt ] (17) Assumption 4. The action selection policy in each state is obtained using the following Softmax formula [3]: w (s, a) = exp( Q̂(s, a)/) Since the action is generated using approximate AVF, the action selection policy depends on and w. To emphasize this, we placed and w in the index of policy in equation (18). Notice that the Softmax formula (18) is a continuous probability distribution function. This is an important feature when discussing convergence in continuous RL [21]. Moreover, equation (18) depends on all action values, which causes suitable exploration for the algorithm [3]. Lemma 1. In a linear Sarsa algorithm (17), the stochastic variable wt under assumptions 1–4 asymptotically follows the trajectory w of the following Ordinary Differential Equation (ODE): ẇ = T Dw (r + Pw w − w) (19) where Pw is the transition probability matrix moving from one state-action to another under policy w . Proof. The update law in (17) is a stochastic recursive algorithm. Thus, based on [19, 26], under the four conditions of (C1–C4) given below, we can apply ODE methods for analyzing a recursive stochastic algorithm: (C1) The algorithm should be of the form: wt+1 = wt + t H (wt , X t+1 ) q where w lies in and the state lies in X t = f (t ) (20) k . (21) where, for a fixed w, the extended state t is a Markov chain with transition probability pw (t ) as a function of w. The row vector pw (t ) is equal to the t -th row of transition probability matrix Pw . Moreover, for all w, the Markov chain {t } should have unique stationary asymptotic behavior. (C4) The mean vector field defined by h(w) = limt→∞ E w (H (w, X t )) should exist and should be regular, which constructs the ODE of the form ẇ = h(w). Now, we verify the above conditions for linear Sarsa (17). Let: ⎤ ⎡ zt st zt = (22) , t = ⎣ z t−1 ⎦ at rt ⎤ (z t ) X t = f (t ) = ⎣ (z t−1 ) ⎦ rt ⎡ (18) b∈A exp( Q̂(s, b)/) d , (C2) The learning rate sequence should ∞ be positive, decreasing, and satisfying t=0 t = ∞, ∞ t=0 t <∞ for some >1. (C3) X t should be of the form: (23) H (wt , X t+1 ) = (z t )(rt+1 + T (z t+1 )wt −T (z t )wt ) (24) Then, substituting (24) into (20) yields (17). Thus, the algorithm satisfies condition 1. Moreover, assumption 2 satisfies condition C2 with = 2. To verify condition C3, one can see that the extended state t includes 3 parts: reward signal rt , which is MDP according to assumption 1, z t−1 which is known at time t, and z t , which is with transition probability: Pw (z t−1 , z t ) = P(st−1 , at−1 ), st ))w (st , at ) (25) The elements of matrix Pw are calculated for any fixed w in the entire space. According to Assumption 1, P is irreducible and aperiodic. Moreover, w follows Boltzman distribution, and thus, it is ananalytic function which satisfies w (s, a)>0 and a∈A w (s, a) = 1. Therefore, for a fixed w, Pw is irreducible and aperiodic, and {t } is a Markov chain with a unique stationary asymptotic behavior. Hence, condition C3 is satisfied. 2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society 540 Asian Journal of Control, Vol. 10, No. 5, pp. 535 549, September 2008 As mentioned, w follows a Boltzman distribution that is an analytic function, hence, considering Assumption 1, the steady state probability distribution dw (z) under policy w is an invariant and a regular function. Therefore, the expectation in condition C4 can be calculated as: h(w) = dw (z)(z)[r (z) + ( pw (z)w) z∈|S|.|A| T − (z)w] (26) where pw (z) is the z-th row of transition probability matrix Pw . equation (26) in vector form becomes: Simplifying the right hand side, the above equation reduces to equation (28). This result shows that the fixed points of proposed AAVI are the stationary points of linear Sarsa. Next, we show that the stationary points of linear Sarsa are the fixed points of AAVI as well. Assume vector w ∗ is a stationary point of linear Sarsa (17). Thus, we have: T Dw w ∗ = T Dw (r + Pw w ∗ ) (32) Multiplying both sides of the above equation by (T Dw )−1 results in: (T Dw )−1 T Dw w ∗ (27) = (T Dw )−1 T Dw (r + Pw w ∗ ) (33) where diagonal matrix Dw has dimension |S|·|A| × |S| ·|A| with elements dw (z) for different values of z. Using operators w and Tw and policy (18), the above equation reduces to: h(w) = T Dw (r + Pw w − w) Lemma 2. The linear Sarsa algorithm (17) under Assumptions 1–4 has stationary points, which coincide with the fixed points of the proposed AAVI (equation (A8) in the Appendix). Proof. According to Lemma 1, the sequence of estimate w obtained from equation (17) asymptotically follows the ODE trajectory of equation (19). Vector w is a stationary point of this ODE if and only if ẇ = 0, which follows that: Dw w = Dw (r + Pw w) T T (28) With reference to Lemma A1 (see Appendix), the proposed AAVI defined in (A8) has fixed points; hence, there is a weight vector w ∗ such that: w ∗ = w Tw w ∗ (29) Considering the projection operator w and the dynamic programming operator Tw defined in (A8) and (A6), respectively, and using Softmax policy (18), equation (29) can be written as: w ∗ = (T Dw )−1 T Dw (r + Pw w ∗ ) (30) Multiplying both sides of equation (30) by T Dw yields: T Dw w ∗ = T Dw (T Dw )−1 × T Dw (r + Pw w ∗ ) q (31) w ∗ = w Tw w ∗ (34) Therefore, the stationary points of the linear Sarsa algorithm (17) coincide with the fixed points of the AAVI in (A.8). Theorem 1. The proposed FSL algorithm under Assumptions 1 and 2 has stationary points, which coincide with fixed points of the proposed AAVI. Proof. We first show that FSL algorithm is an implementation of the linear Sarsa (17). We may expand the i-th rule into m separate rules as follows: If s is L i then ai1 with wi1 If s is L i then ai2 with wi2 .. . If s is L i then aim with wim Thus, we have R series of m rules. In each action selection, only one rule is selected from each series and the combination R selected rules generates the final action. Hence, we can define state-action basis vector with length m · R as follows: ⎡ m m ⎣ (s, a) = 0 · · · 1 (s) · · · 0 0 · · · 2 (s) · · · 0 ⎤T m · · · 0 · · · R (s) · · · 0⎦ (35) where m is the number of candidate actions in each rule. As we can see in (35), in each series of elements 2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society 541 V. Derhami et al.: Fuzzy Sarsa Learning and the Proof of Existence only one element is nonzero depending on which rule is chosen in that series. For each input state, it is possible to have m R vector . Therefore, the state-action space dimension of the problem is |S| · m R × m · R. The basis vectors T (s, a) constitute the rows of the state-action space matrix with m · R adjustable weight parameters: w T = [w 11 , . . . , w1m , w 21 , . . . , w 2m , w R1 , . . . , w Rm ] (36) Using (35) and (36), equation (9) reduces to equation (14). It can be easily seen that substituting equation (35), (36) and (14) into equation (11) and (12) yields equation (17). Therefore, the update law of FSL algorithm is equivalent to that of the linear Sarsa algorithm. Considering the above results and the assumptions of this theorem, we can conclude that, if we prove that Assumptions 3 and 4 hold for FSL, then the result of Lemma 2 holds for it as well. a) To show that the columns of matrix are linearly independent, let B = 0 for some vector B. Then, using (35), we can write: ⎤ m m m ⎥ ⎢ 1 1 1 ⎢ 1 0 · · · 0 2 0 · · · 0 · · · R 0 · · · 0 ⎥ ⎥ ⎢ m m m ⎢ ⎥ ⎥ ⎢ ⎥ ⎢ ⎢ 011 · · · 0 12 0 · · · 0 · · · 1R 0 · · · 0 ⎥ ⎥ ⎢ . ⎥ ⎢ .. ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ m m m ⎥ ⎢ ⎥ ⎢ ⎢ 0 · · · 01 0 · · · 01 · · · 0 · · · 01 ⎥ ⎢ R ⎥ 1 2 ⎥ ⎢ . ⎥ B = 0 = ⎢ .. ⎥ ⎢ ⎥ ⎢ m m m ⎥ ⎢ ⎢ ⎥ ⎥ ⎢ N ⎥ ⎢ 1 0 · · · 0 2N 0 · · · 0 · · · N R0···0 ⎥ ⎢ m m m ⎢ ⎥ ⎥ ⎢ ⎥ ⎢ N 0···0 ⎥ ⎢ 01 · · · 0 2N 0 · · · 0 · · · N R ⎥ ⎢ ⎥ ⎢ .. ⎥ ⎢ . ⎥ ⎢ ⎥ ⎢ m m m ⎣ ⎦ 0 · · · 01N 0 · · · 02N · · · 0 · · · 0 N R ⎡ ⎤ B1 ⎢ ⎥ B2 ⎢ ⎥ ⎢ ⎥ . ⎢ ⎥ . . ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ Bm ⎢ ⎥ ⎢ ⎥ . .. ×⎢ ⎥ ⎢ ⎥ ⎢B ⎥ ⎢ m(R−1)+1 ⎥ ⎢ ⎥ ⎢ Bm(R−1)+2 ⎥ ⎢ ⎥ .. ⎢ ⎥ ⎣ ⎦ . ⎡ q Bm R ⎡ ⎤ ⎫ 11 B1 + 12 Bm+1 + · · · + 1R Bm(R−1)+1 ⎪ ⎪ ⎪ ⎢ ⎪ . ⎪ ⎢ ⎪ .. ⎪ ⎢ ⎪ ⎪ ⎢ ⎪ 1 1 1 ⎪ ⎢ Bm + B ⎪ + · · · + B ⎢ R m(R−1)+1 ⎪ 1 2 m+1 ⎪ ⎪ ⎢ ⎪ . ⎪ ⎢ ⎬ .. ⎢ ⎢ mR ⎢ . ⎪ ⎪ ⎢ .. ⎪ ⎪ ⎢ ⎪ ⎪ ⎢ ⎪ 1 B + 1 B 1 B ⎪ ⎢ + · · · + ⎪ m R 1 2m R 1 2 ⎪ ⎢ ⎪ ⎪ ⎢ . ⎪ ⎪ ⎢ .. ⎪ ⎪ ⎢ ⎪ ⎭ ⎢ 1 1 1 ⎢ B + B + · · · + B m m R 2m R 1 2 ⎢ ⎢ . ⎢ .. ⎢ =⎢ . ⎢ .. ⎢ ⎢ ⎫ ⎢ N NB B + + · · · + N ⎢ 1 1 R Bm(R−1)+1 ⎪ 2 m+1 ⎪ ⎢ ⎪ ⎪ .. ⎢ ⎪ ⎪ ⎢ ⎪ . ⎪ ⎢ ⎪ ⎪ ⎢ N N N ⎪ ⎢ 1 Bm + 2 Bm+1 + · · · + R Bm(R−1)+1 ⎪ ⎪ ⎪ ⎢ ⎪ ⎪ . ⎢ ⎪ ⎬ .. ⎢ ⎢ mR ⎢ . ⎪ ⎢ ⎪ .. ⎪ ⎢ ⎪ ⎪ ⎢ ⎪ ⎪ N B + N B NB ⎢ ⎪ + · · · + ⎪ ⎢ R mR 1 1 2 2m ⎪ ⎪ ⎢ ⎪ . ⎪ ⎢ ⎪ .. ⎪ ⎣ ⎪ ⎪ ⎭ 1N Bm + 2N B2m + · · · + N B R mR ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ (37) mR sets of homogeChoosing the first row from each neous equations in right hand side of (37) gives: ⎧ 1 ⎨ 1 B1 + 12 Bm+1 + · · · + 1R Bm(R−1)+1 = 0 2 B + 22 Bm+1 + · · · + 2R Bm(R−1)+1 = 0 ⎩ 1N 1 1 B1 + 2N Bm+1 + · · · + NR Bm(R−1)+1 = 0 (38) Considering the state space matrix S defined in (7), equation (37) can be written as: ⎤ ⎡ B1 ⎢ Bm+1 ⎥ ⎥ ⎢ (39) S × ⎢ ⎥ =0 .. ⎦ ⎣ . Bm(R−1)+1 Since S is full rank, we have B1+km = 0, k = 0, 1, .., (R − 1). A similar procedure for the i-th rows of each m R equation sets in the right hand side of (37) results in Bi = 0. This yields B = 0̄. b) To show that the probability of final action selection P(a) complies with the standard Softmax formula, we write P(a) as the product of the probability of action selection of all rules: R + exp(i wii /) m ij i=1 j=1 exp(i w /) R + exp(( wii )/) = R mi=1 i ij i=1 j=1 exp(i w /) p(a) = 2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society (40) 542 Asian Journal of Control, Vol. 10, No. 5, pp. 535 549, September 2008 Considering (9), the above equation simplifies to equation (18). 0,0 x ω 4.2 Convergence of FSL algorithm under stationary policy As proven in Theorem 1, FSL is an implementation of linear Sarsa() with = 0, where is the eligibility trace decay parameter used in TD() method [3]. If the weight vector in FSL() with ∈ [0, 1] is updated according to: y θ Quay Current direction fc ij ij wt+1 = t × Q̂ t (st , at ) × et , i = 1, 2, . . . , R, j = 1, 2, . . . , m (41) where e is eligibility trace and updated as follows [5] ij et−1 + i (st ) if j = i + ij et = (42) ij et−1 otherwise initialized with e−1 = 0, then, FSL() is an implementation of linear Sarsa(). Lemma 3. In the special case where the action selection policy is stationary, the FSL() algorithm under Assumptions 1 and 2 converges to a unique value with Probability 1. In addition, the final weight parameter vector w ∗ satisfies the following inequality: w ∗ − Q ∗ D 1 − Q ∗ − Q ∗ D 1− (43) Proof. According to Theorem 1, it is obvious that the update law of the FSL() algorithm is equivalent to that of the linear Sarsa() algorithm and that the columns of matrix are linearly independent. The convergence of the state value function in linear TD() method to a unique amount under Assumptions 1–3 and constant state transition probability matrixhas been proven in Theorem 1 of [5]. If we consider the transition of the state-action to the next state-action and use the stateaction basis functions instead of the state basis functions, then equation (41) is equivalent to the update law given in Theorem 1 of [5]. Under stationary policy, all the elements of transition matrix are constant; hence, in this case, the results of the mentioned theorem (convergence, uniqueness, and error bound) hold for FSL() as well. V. SIMULATION In this section, we first demonstrate the performance of FSL versus FQL, which was used in [8] for q Current force 200,200 Fig. 1. Boat Problem [8]. the evaluation of FQL. Then, both algorithms are compared in terms of convergence under stationary policy in a multi-goal environment. 5.1 Boat problem We use FSL and FQL to tune a fuzzy controller to drive a boat from the left bank to the right bank quay in a river with strong nonlinear current. The goal is to reach the quay from any position on the left bank (see Fig. 1). 5.1.1 System dynamics This problem has two continuous state variables, namely, x and y position of the boat’s bow ranging from 0 to 200. The quay center is located at (200,100) and has a width of five. The river current is [8]: E(x) = f c [x/50 − (x/100)2 ] (44) where f c is the current force ( f c = 1.25). The new bow position of the boat is computed by: xt+1 = min(200, max(0, xt + st+1 cos( t+1 ))) yt+1 = min(200, max(0, yt − st+1 sin( t+1 ) −E(xt+1 ))) (45) The boat angle t and speed st are computed by the following difference equations: t+1 t+1 st+1 t+1 = = = = t + (I t+1 ) t + ((t+1 − t )(st+1 /S Max )) st + (Sdes − st )I min(max( p(at+1 − t+1 ), −45◦ ), 45◦ ) (46) 2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society 543 V. Derhami et al.: Fuzzy Sarsa Learning and the Proof of Existence 1 VLow Low High VHigh Med 0.5 0 1 0 VLow 50 100 x 150 200 Low Med High VHigh 50 100 y 150 200 0.5 0 0 Fig. 2. Input membership functions. where I = 0.1 is the system inertia, S Max = 2.5 is the maximum possible speed for the boat, Sdes = 1.75 is the speed goal, t is the rudder angle, and p = 0.9 is the proportional coefficient used to compute the rudder angle to reach the fuzzy controller’s desired direction at [8]. The computations are performed every 0.1 second. However, in order to observe the effect of the applied control, the controller and the learning mechanisms are triggered every one second. 5.1.2 Learning details As shown in Fig. 2, five fuzzy sets have been used to partition each input variable resulting in 25 rules [8]. For the sake of simplicity, the possible discrete actions are the same for all the rules, though it is possible to assign different sets of actions to each rule. The discrete action sets are made of 12 directions, ranging from the Southwest to North U = {−100, −90, −75, −60, −45, −35, −15, 0, 15, 45, 75, 90}. The controller generates continuous actions by the combination of these discrete actions. To keep the same conditions between FQL and FSL, our modified Softmax action selection policy (10) is used for both methods. The reinforcement function is equal to zero during the traverse of the boat, and is a nonzero value based on the y position of the boat after reaching one of the three zones of the right bank: The success zone Z s corresponds to the quay (x = 200; y ∈ [97.5, 102.5]), the viability zone Z v where x = 200 and y ∈ [92.5, 97.5] ∪ [102.5, 107.5], and the failure zone Z f , which includes all points of the right bank. The reinforcement function is q defined by [8]: ⎧ +1 ⎪ ⎨ D(x, y) R(x, y) = −1 ⎪ ⎩ 0 (x, y) ∈ Z s (x, y) ∈ Z v (x, y) ∈ Z f otherwise (47) where D(x, y) is a function that decreases linearly from 1 to −1, relative to the distance from the success zone. The evaluation of each experiment is the average of performance measures introduced in [8] over 100 runs. Every run includes a “learning phase” and a “testing phase”. For the learning phase, we generate 100 sets of random positions. Every set comprises 5000 random values for the y-axis of the starting point of the boat movement (x = 10; y = random). The learning phase finishes when either 40 successive non-failure zones are reached or the number of episodes exceeds 5000. We call the episode number at the end of learning phase the Learning Duration Index (LDI), which is a measure of learning time period. The testing phase is made of 40 episodes, whose initial points are specified by x = 10 and y distributed with equal distance in the range of 0 to 200. We define the distance error of the reached bank position relative to the quay center by: d(x, y) |y − 100| if right bank reached = 100 + (200 − x) otherwise (48) We then define Distance Error Index (DEI) as the average of d over the 40 episodes in testing phase, which is a measure of the learning quality. The learning rate and the temperature parameter decrease during the learning phase as follows [3]: t−1 /1.001 if t = 4k t = (49) t−1 otherwise ⎧ ⎨ t−1 − (0.99)t × 0.4 × t−1 if t = 4k&t<100 t = t−1 − (0.99)t × 0.2 × t−1 if t = 4k&t100 ⎩ t−1 otherwise (50) where k is a positive integer. To increase exploration and to escape from local extrema, if the learning phase of an experiment does not end after 1500 episodes, is reset to its initial value, and is set to 0.25 of its initial value. 5.1.3 Simulation results Table I shows the results of four experiments in FSL and FQL with different initial learning rates and 2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society 544 Asian Journal of Control, Vol. 10, No. 5, pp. 535 549, September 2008 Table I. Simulation results for different initial learning rates and temperatures. Initial Parameters Method Avg. DEI Avg. LDI Std (LDI) Success Rate Viability Rate Failure Rate =1 = 0.1 FQL FSL 7.94 4.45 1309 1010 1184 1065 49.175 56.85 43.575 40.325 7.25 2.825 = 0.01 = 0.1 FQL FSL 12.26 7.09 1227 962 1486 1050 46.85 50.10 41.875 44.50 11.275 5.40 = 0.01 = 0.01 FQL FSL 8.69 8.68 733 698 1084 1013 49.55 51.275 43.75 42.475 6.7 6.25 =5 = 0.1 FQL FSL 8.71 8.38 1456 1290 975 900 51.35 51.425 41.675 41.85 6.975 6.725 DEI, Distance error index; FQL, Fuzzy Q-Learning; FSL, Fuzzy Sarsa Learning; LDI, Learning duration index. 50 40 Frequency temperatures. The average DEI, average LDI, success rate, viability rate, and failure rate (defined in equation (47)) were computed by taking the average in 100 runs for four experiments. As the results show, the training time and the action quality in FSL is considerably better than FQL. Moreover, the initial value of temperature parameter greatly influences the learning quality and the learning speed, and a medium initial temperature value seems to be more suitable. It is noteworthy that, in the Softmax formula, the exploration degree depends on the temperature factor as well as the difference of action values ( Q̂(s, a)). If the difference is small, then selecting a medium temperature factor would cause high exploration. On the other hand, if this difference is large, a small temperature factor would cause high exploitation. Figs 3 and 4 show the histograms of LDIs for both algorithms for 0 = 1, 0 = 0.1. As we see from the figures, compared to those of FQL, not only are the LDIs in FSL closer together, but also there are fewer instances of reaching its upper bound (5000 episodes) This means that FSL learns much faster with much better accuracy. 30 20 10 0 0 1000 2000 3000 LDI 4000 5000 6000 Fig. 3. Histogram of learning duration indexes for Fuzzy Sarsa Learning. 40 To demonstrate the convergence of our algorithm (FSL) under stationary policy and to show the possibility of divergence of FQL, we use both algorithms to approximate AVF in an 11 × 11 multi-goal obstacle-free environment, as shown in Fig. 5. This environment is similar to the grid world environment presented in [18] except that the states ∈ [0, 11] × [0, 11] and actions ∈ [0◦ , 360◦ ] are continuous. The action is the movement angle of the agent with respect to the horizontal axis; hence, the agent with zero angle moves right, with 90◦ angle moves down, etc. The agent moves with unit constant steps in the direction of final inferred action angle. q Frequency 5.2 Multi-goal obstacle-free environment 30 20 10 0 0 1000 2000 3000 LDI 4000 5000 6000 Fig. 4. Histogram of learning duration indexes for Fuzzy Q-Learning. 2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society 545 V. Derhami et al.: Fuzzy Sarsa Learning and the Proof of Existence 0,0 0,1 1 x 1,0 G -1 +1 10,0 11,0 G 11,1 , 0.8 0.6 W1L 0.4 0.2 W1R 0 y W2R S -0.2 -0.4 W2L -0.6 -1 -0.8 +1 11,10 G 10,11 11,11 0,10 G 0,11 1,11 Low High 0.5 0 0 0.2 0.4 0.6 0.8 1 x Low 1 High 0.5 0 0 0.2 0.4 0.6 0.8 1 y Fig. 6. Input membership functions. If the agent bumps into a wall, it will remain in the same state. The four corners are terminal states. The agent receives a reward of +1 for entering the bottom-right and upper-left corners, and −1 for entering other two corners. All other actions have zero reward. 5.2.1 Learning details The input of FIS are the normalized position of agent (x, y). As shown in Fig. 6, two Gaussian fuzzy sets are used to partition each dimension, resulting in the following four rules: Rule 1: If x is Low and y is Low, then R with w 1R or D with w 1D or L with w 1L or U with w 1U . Rule 2: If x is Low and y is High, then R with w 2R or D with w 2D or L with w 2L or U with w 2U . q 0 0.5 1 1.5 Time step 2 2.5 4 x 10 Fig. 7. Weight values of Rules 1 and 2 in Fuzzy Sarsa Learning. Fig. 5. A multi-goal obstacle-free environment. 1 -1 Rule 3: If x is High and y is Low, then R with w 3R or D with w 3D or L with w 3L or U with w 3U . Rule 4: If x is High and y is High, then R with w 4R or D with w 4D or L with w 4L or U with w 4U . In the above rules, R, D, L, and U denote right, down, left, and up candidate actions, respectively, and the weight vector w is defined according to equation (36). The continuous final action is inferred by directional combination of these discrete actions in twodimensional action space. In each rule, the action selection probability is equal for all candidate actions; hence, the policy is stationary. We apply both FQL() and FSL() algorithms to this problem with = 0.9, w0 = 0̄, and 0 = 0.3. The learning rate decreases through the run. Each run includes 10,000 episodes. An episode starts from the center (5.5, 5.5) and finishes if agent reaches the corners or the number of steps exceeds 1000. We execute 10 independent runs. Figs 7 and 8 show the weight values corresponding to the right and left candidate actions of all rules. We only showed two weight values for each rule, because in Rules 1 and 4 the trajectories of the right action weight and left action weight are fairly similar to those of the down action weight and up action weight, respectively. Likewise, in Rules 2 and 3, the trajectories of the right action weight and left action weight are similar to those of the up action weight and down action weight, respectively. As predicted by Lemma 3, the weight values in FSL() converged to a unique value. The final weight values in different runs are very close to each other with the maximum standard deviation of 0.007 and average weight vector w = [0.2153 0.2168 0.3204 0.3204 − 0.2130 − 0.3176 − 0.3198 − 0.2153 − 0.3162 − 2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society 546 Asian Journal of Control, Vol. 10, No. 5, pp. 535 549, September 2008 1 0.8 0.6 W4R 0.4 0.2 W4L 0 W3L -0.2 -0.4 W3R -0.6 -0.8 -1 0 0.5 1 1.5 2 2.5 4 Time Step x 10 Fig. 8. Weight values of Rules 3 and 4 in Fuzzy Sarsa Learning. 3 x 10 28 2.5 2 1.5 1 0.5 0 -0.5 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Time step Fig. 9. The trajectory of right action weight of Rule 1 in Fuzzy Q-Learning. 0.2119 −0.2144 −0.3185 0.3155 0.3172 0.2112 0.2108] for 10 runs. As is clear from the figures, the weight parameters have converged to their final amounts in about 10,000 time steps. Notice that, due to the symmetric environment in this specific example, the corresponding weight values of the symmetric rules are equal. In contrast, the weight parameters in FQL() diverged. Fig. 9 shows one of the diverging weights in FQL(). VI. DISCUSSION AND CONCLUSION In this paper, we concentrated on critic-only based FRL algorithms. Employing fuzzy systems enables us to embed human experience in the learning system. We q presented a new on-policy method, called FSL, based on Sarsa. FSL has two algorithmic differences with FQL. The first difference is in the method of computation of Q̂, similar to the difference between discrete Qlearning and discrete Sarsa. However, due to the use of FA for the approximation of the action value, updating the weights of both algorithms in every step affects all approximated values of state-action as well as the selected action. Hence, both algorithms, unlike discrete Q-learning and Sarsa, do not behave similarly to each other, even under greedy policy. The second difference lies in the strategy of action selection. In the FQL presented in [8] and [16], a mixed exploration-exploitation strategy has been employed for action selection in each rule. This type of action selection does not guarantee generation of final action according to a continuous probability function. On the other hand, in continuous reinforcement learning, actions should be generated by a continuous probability function so that small changes in the value of an action do not create substantial changes in the control behavior [21]. To maintain this feature, we employed a modified Softmax action selection in each rule of the proposed FSL so that the probability function of final action selection became continuous. The main advantage of FSL compared to FQL is the existence of theoretical analysis concerning the existence of stationary points. Moreover, we proved the convergence of FSL under stationary policy. On the other hand, there is no proof for the existence of the stationary points or convergence in any situation for FQL, and besides, some examples of divergence in linear Qlearning were reported in the literature. It should be noted that the existence of stationary points for FSL does not necessarily imply convergence for the algorithm. However, this result is important from different aspects. First, to the best of our knowledge, the presented theoretical results for FSL are the first analytical results for critic-only based FRL in control applications where policy is adapted in each step.Second, FSL is an implementation of Linear Sarsa with FIS. With this implementation, we coped with two challenges mentioned in the introduction: defining state-action basis functions and action selection policy. Third, the proof of the existence of stationary points that coincide with fixed points of proposed AAVI signifies a good chance to reach the desired solution as it was seen in the simulation. Fourth, since the action selection policy in FSL is updated in each time step, the learning speed in this algorithm is high. Experimental results verified that FSL outperformed FQL. They also showed that action quality strongly depends on the temperature factor (i.e., exploration). In discrete RL, selecting a greater temperature 2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society 547 V. Derhami et al.: Fuzzy Sarsa Learning and the Proof of Existence value usually improves quality, but results in slower learning. However, as seen in the example, this is not necessarily true in continuous RL. The results in Section 5.2 verified the convergence of FSL under stationary policy and divergence of FQL. Since we employed a uniform probability distribution for policy in that example, updating the action value according to FQL strongly differs from a Markov chain distribution, and this difference causes FQL to quickly diverge. These results indicate that, although FQL is known as an offpolicy method, it strongly depends on the policy and does not work well under a high exploration policy. whose elements are defined as follows p ((si , a), (s j , b)) = p((si , a), s j ) × (s j , b) (A3) where si , s j ∈ S, a, b ∈ A, p((si , a), s j ) is the transient probability from (si , a) to s j obtained by environment model. The optimal AVFQ ∗ uniquely solves Bellman’s equation [3]: Q ∗ = max{r + P Q} (A4) where the maximization is preformed element-wise. If we define dynamic programming operator T by: VII. APPENDIX: APPROXIMATE ACTION VALUE ITERATION (AAVI) An improved version of approximate value iteration (not action value iteration) was presented in [19] and was guaranteed to have at least one fixed point. In each iteration, the approximate value of each stateaction pair is computed using approximate values of all states by: Q(s, a) = r (s, a) + p(s, a, s )V (s ) (A1) s ∈S Then, they are used in Softmax formula to obtain the probability of action selection for each state. Since, in RL algorithms, the environment is assumed unknown, the amounts of r (s, a) and p(s, a, s ) are not available for computing AVF in equation (A1). This leads us to present an AAVI by extending the improved approximate value iteration given in [19]. Our method approximates the AVF, and thus, computing it by equation (A1) in each iteration is not needed. AAVI can be used for analyzing the AVF approximation algorithms in RL such as linear Sarsa. In what follows, we offer some new definitions to formulate the AAVI method. Because of the similarity of AAVI and the method in [19], only the necessary equations for the linear Sarsa analysis are discussed. Consider an MDP environment [27]. A policy is optimal if it maximizes the state-action values, or equivalently, if it maximizes vector: Q = ∞ (P )k r (A2) k=0 where r is the vector of the immediate rewards with dimension |S|·|A| × 1 in every state-action pair, and P is the transition probability matrix of state-action pair (s, a) to the next with dimension |S| · |A| × |S| · |A| q T Q = max{r + P Q} (A5) then, the optimal AVF can be characterized as the fixed points of the operator T . Moreover, we define an operator T for every policy as: T Q = r + P Q (A6) The action value iteration defined by Q l+1 = T Q l improves estimated Q l in each iteration. Similar to equation (14), we can approximate action value Q(s, a) in the l-th iteration using linear FAs with R basis functions by: Q̂ l (s, a) = R i (s, a)wl (i) (A7) i=1 where w is the adjustable weight vector. Similar to the update law given in [19] for the approximate value iteration, we propose a new approximation action value iteration method with following update law: Q̂ l+1 = w Tw Q̂ l l where the (A8) l projection operator w = (T l Dw )−1 T Dw projects on the space spanned by l l the basis functions for minimizing the weighted norm Q − w D , and Dw is a diagonal matrix with l dimension |S| · |A| × |S| · |A| whose diagonal elements comprise the steady state probability of each state-action pair under policy . Moreover, Tw is the dynamic programming operator using the Softmax formula (18). Lemma A1. The proposed AAVI (A.8) has at least one fixed point for any >0. 2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society 548 Asian Journal of Control, Vol. 10, No. 5, pp. 535 549, September 2008 Proof. Defining operator H as: H Q = w Tw Q (A9) we have Q̂ l+1 = H Q̂ l . Theorem 5.1 in [19] proves the existence of a fixed point for any >0 for the operator H (defined by V̂l+1 = H V̂l ) used in approximate value iteration where V̂ denotes state value function. Replacing the state space matrix, the state transition matrix, and V in the equations given in Section V of [19] with state-action space matrix (given in (15)), state-action transition matrix P (with elements defined in (A.3)), and Q, respectively, and also using projection operator (A8) and Softmax action selection (18), the existence of a fixed point for H defined in (A.9) is consequently proven by the similar procedure presented in that paper. REFERENCES 1. Juang, C. F., “An automatic building approach to special Takagi–Sugeno fuzzy network for unknown plant modeling and stable control,” Asian J. Control, Vol. 5, No. 2, pp. 176–186 (2003). 2. Hasegawa, Y., T. Fukuda, and K. Shimojima, “Self-scaling reinforcement learning for fuzzy logic controller-applications to motion control of two-link brachiation robot,” IEEE Trans. Ind. Electron., Vol. 46, No. 6, pp. 1123–1131 (1999). 3. Sutton, R. S. and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA, (1998). 4. Wiering, M. A., “Convergence and divergence in standard and averaging reinforcement learning,” Proc. Europ. Conf. Mach. Learn., Italy, pp. 477– 488 (2004). 5. Tsitsiklis, J. N. and B. Van Roy, “An analysis of temporal-difference learning with function approximation,” IEEE Trans. Autom. Control, Vol. 42, No. 5, pp. 674–690 (1997). 6. Baranyi, P., P. Korondi, R. J. Patton, and H. Hashimoto, “Trade-Off between approximation accuracy and complexity for TS fuzzy models,” Asian J. Control, Vol. 6, No. 1, pp. 21–33 (2004). 7. Al-Gallaf, E. A., “Clustered based Takagi-Sugeno neuro-fuzzy modeling of a multivariable nonlinear dynamic system,” Asian J. Control, Vol. 7, No. 2, pp. 163–1176 (2005). 8. Jouffe, L., “Fuzzy inference system learning by reinforcement methods,” IEEE Trans. Syst., Man, Cybern. C, Vol. 28, No. 3, pp. 338–355 (1998). q 9. Kim, M. S., G. G. Hong, and J. J. Lee, “Online fuzzy Q-learning with extended rule and interpolation technique,” Proc. IEEE Int. Conf. Intell. Robot. Syst., Korea, Vol. 2, pp. 757–762 (1999). 10. Vengerov, D., N. Bambos, and H. R. Berenji, “A fuzzy reinforcement learning approach to power control in wireless transmitters,” IEEE Trans. Syst., Man, Cybern. B, Vol. 35, No. 4, pp. 768–778 (2005). 11. Su, S. F. and S. H. Hsieh, “Embedding fuzzy mechanisms and knowledge in box-type reinforcement learning controllers,” IEEE Trans. Syst., Man, Cybern. B, Vol. 32, No. 5, pp. 645–653 (2002). 12. Beom, H. R. and H. S. Cho, “A sensor-based navigation for a mobile robot using fuzzy logic and reinforcement learning,” IEEE Trans. Syst., Man, Cybern., Vol. 25, No. 3, pp. 464–477 (1995). 13. Barto, A. G., R. S. Sutton, and C. W. Anderson, “Neuron-like adaptive elements that can solve difficult learning control problems,” IEEE Trans. Syst., Man, and Cybern., Vol. 13, pp. 834–846 (1983). 14. Ye, C., N. H. C. Yung, and D. Wang, “A fuzzy controller with supervised learning assisted reinforcement learning algorithm for obstacle avoidance,” IEEE Trans. Syst., Man, Cybern. B, Vol. 33, No. 1, pp. 17–27 (2003). 15. Boada, M. J. L., V. Egido, R. Barber, and M. A. Salichs, “Continuous reinforcement learning algorithm for skills learning in an autonomous mobile robot,” Proc. IEEE Int. Conf. Ind. Electron. Soc., Spain, pp. 2611–2616 (2003). 16. Er, M. J. and C. Deng, “Online tuning of fuzzy inference systems using dynamic fuzzy Q-learning,” IEEE Trans. Syst., Man, Cybern. B, Vol. 34, No. 3, pp. 478–1489 (2004). 17. Baird, L. C., “Residual algorithms: Reinforcement learning with function approximation,” Proc. 12th Int. Conf. Mach. Learn., California, pp. 30–37 (1995). 18. Precup, D., R. S. Sutton, and S. Dasgupta, “Offpolicy temporal-difference learning with function approximation,” Proc. 18th Int. Conf. Mach. Learn., Massachusetts, pp. 417–424 (2001). 19. De Farias, D. P. and B. Van Roy, “On the existence of fixed points for approximate value iteration and temporal-difference learning,” J. Optim. Theory Appl., Vol. 105, No.3, pp. 25–36 (2000). 20. Gordon, G. J., “Reinforcement learning with function approximation converges to a region,” Proc. 8th Int. Conf. Neural Inf. Process. Syst., Colorado, pp. 1040–1046 (2000). 2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society V. Derhami et al.: Fuzzy Sarsa Learning and the Proof of Existence 21. Perkins, T. J. and D. Precup, “A convergent form of approximate policy iteration,” Proc. 9th Int. Conf. Neural Inf. Process. Syst., Singapore, pp. 1595– 1602 (2000). 22. Kaelbling, L. P., M. L. Littman, and A. W. Moore, “Reinforcement learning: A survey,” J. Artif. Intell. Res., Vol. 4, pp. 237–285 (1996). 23. Watkins, C. and P. Dayan, “Q-Learning,” Mach. Learn., Vol. 8, pp. 279–292 (1992). 24. Singh, S., T. Jaakkola, M. L. Littman, and C. Szepesvari, “Convergence results for single-step onpolicy reinforcement learning algorithms,” Mach. Learn., Vol. 39, pp. 287–308. (2000). 25. Tsitsiklis, J. N., “Asynchronous stochastic approximation and Q-learning,” Mach. Learn., Vol. 16, pp. 85–202 (1994). 26. Beneveniste, A., M. Metivier, and P. Priouret, Adaptive Algorithm and Stochastic Approximation, Springer, Berlin (1990). 27. Kalyanasundaram, S., E. K. P. Chong, and N. B. Shroff, “Markov decision processes with uncertain transition rates: sensitivity and max hyphen min control,” Asian J. Control, Vol. 6, No. 2, pp. 253–269 (2004). q Vali Derhami received his B.Sc. and M.Sc. degrees from the control engineering departments of Isfahan University of Technology, and Tarbiat Modares University, Iran, in 1996 and 1998, respectively. Currently, he is a Ph.D. student in Tarbiat Modares University. His research interests are neural fuzzy systems, intelligent control, reinforcement learning, and robotics. 549 Vahid Johari Majd was born in 1965 in Tehran, Iran. He received his B.Sc. degree in 1989 from the E.E. department of the University of Tehran, Iran. He then received his M.Sc. and Ph.D. degrees from the E.E. department of the University of Pittsburgh, PA, U.S.A. in 1991 and 1995, respectively. He is currently an associate professor in the E.E. department of Tarbiat Modares University, Tehran, Iran, and is the director of intelligent control systems laboratory. His areas of interest include intelligent identification and control, agent based systems, neuro-fuzzy control systems, and chaotic systems. Majid Nili Ahmadabadi was born in 1967. He received the B.S. degree in mechanical engineering from the Sharif University of Technology in 1990, and the M.Sc. and Ph.D. degrees in information sciences from Tohoku University, Japan, in 1994 and 1997, respectively. In 1997, he joined the Advanced Robotics Laboratory at Tohoku University. Later, he moved to the ECE Department of University of Tehran, where he is an Associate Professor and the Head of Robotics and AI Laboratory. He is also a Senior Researcher with the School of Cognitive Sciences, Institute for Studies on Theoretical Physics and Mathematics, Tehran. His main research interests are multiagent learning, biologically inspired cognitive systems, distributed robotics, and mobile robots. 2008 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society
© Copyright 2026 Paperzz