2012 IEEE International Conference on Systems, Man, and Cybernetics October 14-17, 2012, COEX, Seoul, Korea Multi-Objective Reinforcement Learning Method for Acquiring All Pareto Optimal Policies Simultaneously Yusuke Mukai Yasuaki Kuroe Hitoshi Iima Department of Advanced Fibro Science Kyoto Institute of Technology Kyoto, Japan [email protected] Department of Information Science Kyoto Institute of Technology Kyoto, Japan [email protected] Department of Information Science Kyoto Institute of Technology Kyoto, Japan [email protected] Abstract—This paper studies multi-objective reinforcement learning problems in which an agent gains multiple rewards. In ordinary multi-objective reinforcement learning methods, only a single Pareto optimal policy is acquired by the scalarizing method which uses the weighted sum of the reward vector, and therefore different Pareto optimal policies are acquired by changing the weight vector and by performing the methods again. On the other hand, a method in which all Pareto optimal policies are acquired simultaneously is proposed for problems whose environment model is known. By using the idea of the method, we propose a method that acquires all Pareto optimal policies simultaneously for the multi-objective reinforcement learning problems whose environment model is unknown. Furthermore, we show theoretically and experimentally that the proposed method can find the Pareto optimal policies. Index Terms—Reinforcement learning, Multi-objective problem, Pareto optimal policy I. I NTRODUCTION Recently, optimizing not only a single objective function but also multiple objective functions attracts attention in the field of optimization. Similarly, multi-objective reinforcement learning [1][2] in which an agent acquires policies that achieve multiple objectives attracts attention in the field of reinforcement learning [3]. In the multi-objective reinforcement learning problems, the agent gains, multiple rewards corresponding to the objectives and there is the tradeoff between the rewards. Therefore, there are multiple policies each of which is not dominated by any other policy from the viewpoint of how many the agent gains the rewards. These policies are called Pareto optimal policies, and acquiring them is the aim of the multi-objective reinforcement leaning problems. For solving the multi-objective reinforcement learning problems, usual single-objective reinforcement learning methods can be used by scalarizing the multiple rewards into a single reward. One of such scalarizing methods is to multiply each reward with a specified weight and to sum up the respective weighted rewards [4]. But we can not acquire more than one Pareto optimal policy weight vector by this method. However, it is often required to acquire multiple Pareto optimal policies in real problems. In order to do so, it is inefficient to perform the method many times with changing the weight vector. 978-1-4673-1714-6/12/$31.00 ©2012 IEEE In order to find all the Pareto optimal policies, there has been proposed a method in which the vertices of the convex hull of state-action vectors (Q-vectors) are updated by the value iteration method [4]. The optimal policies are acquired by giving the Q-vectors weight vectors after the Q-vectors converge. Furthermore, it is proven that the update equation of the Q-vectors is equal to the update equation of Q-value in the linear scalarized method. However, the model of environment must be given because the method is based on the value iteration. In this paper, we propose a reinforcement learning method that can learn all Pareto optimal policies simultaneously for multi-objective problems whose environment model is unknown. Because existing action selection methods are proposed for single-objective problems, they can not be applied directly to multi-objective problems. In this paper, we also propose an action selection method that is based on the vertices of the convex hull of the Q-vectors. Furthermore, we theoretically prove that our proposed method can acquire Pareto optimal policies, and show experimentally evaluate its performance by applying it to Deep Sea Treasure problem. II. M ULTI -O BJECTIVE R EINFORCEMENT L EARNING P ROBLEM In this paper, we consider reinforcement learning problems of multi-objective Markov decision process. A. Multi-Objective Markov Decision Process By extending the single-objective Markov decision process, the multi-objective Markov decision process is defined as follows. • There are an environment and an agent which takes an action at discrete time t = 1, 2, 3, · · · . • The agent receives a state s ∈ S from the environment. S is the finite set of states. • The agent takes an action a ∈ A at state s. A is the finite set of actions that the agent can select. • The environment gives the agent the next state s ∈ S. The next state is determined with the state transition 1917 probability P (s, a, s ) for state s, action a and the next state s , and the state transition probability is defined by the mapping P : S × A × S → [0, 1]. • (1) There are M (> 1) objectives which the agent wants to achieve, and the agent gains the reward vector r(s, a, s ) = r1 (s, a, s ),r2 (s, a, s ), T · · · , rM (s, a, s ) (2) from the environment when it receives the next state. The mapping from the states to probabilities of selecting each action is called a policy, and policy π is defined by π : S × A → [0, 1]. (3) B. Multi-Objective Reinforcement Learning Problem The goal of multi-objective reinforcement learning is to acquire Pareto optimal policies in the multi-objective Markov decision process. The set Πp of the Pareto optimal policies is defined by Πp = π p ∈ Π ∃ π ∈ Π, p (4) s.t. V π (s) >p V π (s),∀ s ∈ S where Π is the set of all policies and >p is the dominance relation. For two vectors a = (a1 , a2 , · · · , an ) and b = (b1 , b2 , · · · , bn ), a >p b means that ai ≥ bi is satisfied for all i and ai > bi is satisfied for at least one i. Moreover, π V π (s) = (V1π (s), V2π (s), · · · , VM (s)) is the value vector of state s under policy π and it is defined by ∞ V π (s) = Eπ γ k r t+k+1 st = s (5) k=0 where Eπ is the expected value given that the agent follows policy π, st is the state at time t, r t is the reward vector at t and γ is the discount rate parameter. We also define the Q-vector by ∞ π k Q (s, a) = Eπ γ r t+k+1 st = s, at = a (6) k=0 where at is the action at time t. The multi-objective reinforcement learning problem is to find Pareto optimal policies under the condition that the agent does not know the state transition probability P (s, a, s ) and the expected reward vector E{r(s, a, s )}. A. Basic Concept For the multi-objective problem, there is a method in which the problem is modified to a single-objective problem by scalarizing the reward vector. The scalarized reward rs is given by the following equation using a weight vector w = (w1 , w2 , · · · , wM )T [2] : rs = w T r 1 (7) 2 M = w1 r + w2 r + · · · + wM r . (8) By using the scalarization, Q-learning method can be applied to the multi-objective problem. We call this method the weighted sum Q-learning method. However, this method requires to specify the weight vector in advance and is not able to acquire more than one Pareto optimal policy for the specified weight vector. In order to acquire the Pareto optimal policies for all the weight vectors, there has been proposed a method in which the set of the vertices in the convex hull of multiple Q-vectors is updated by the value iteration method [4]. The reason why the method updates only the vertices of the convex hull is that the optimal Q-vectors belong to the converged set of the vertices. It is proven that the update equation of the Q-vectors in this method is equal to the update equation of the Q-value in the weighted sum Q-learning method. However, because this method treats only the problems in which the state transition probability and the expected reward vector are given, it is not able to be applied to the reinforcement learning problems in which they are not given. In our proposed method, the multi-objective reinforcement learning problems are solved by introducing the idea of [4] into Q-learning method. In order to introduce it, it is required to develop the update equation of the Q-vectors by extending the update equation of the Q-value in Q-learning method. It is also required to develop an action selection method, because existing action selection methods are proposed for singleobjective problems and can not be applied directly to multiobjective problems. Thus, we propose a method in which an action is determined based on the dominance relation between Q-vectors by extending the ε-greedy method [3] which is used in the single-objective problems. B. Update Equation of Q-Vectors In this subsection, we give the update equation of Q-vectors ◦ which is used in the proposed method. Let Q(s, a) be the set of the vertices in the convex hull of the Q-vectors for taking action ◦ a at state s and the following operations on Q are defined [4]. Translation and scaling operations ◦ ◦ u + b Q ≡ u + bq q ∈ Q . III. P ROPOSED M ETHOD Addition of two convex hulls ◦ ◦ ◦ ◦ Q + U ≡ hull q + u q ∈ Q, u ∈ U . We propose a method which can solve the above-mentioned problem. 1918 (9) (10) Step 4 The operator ”hull” means to extract the set of the vertices of the convex hull from the set of vectors. Extracting the Q-value Qw (s, a) ≡ max wT q. Step 5 (11) ◦ q ∈Q(s,a) Step 6 Under the definition of these operations, the update equation of Q in the proposed method is given by ◦ ◦ + α r(s, a, s ) + γ hull ◦ Step 7 Q (s , a ) Give various weight vectors w, and calculate Qw (s, a) by (11) for each w. The policy for each w is obtained by taking action a∗ = arg maxa Qw (s, a) for each state. (12) a where α is the learning rate and γ is the discount rate. C. Action Selection Method In single-objective reinforcement learning methods, the εgreedy method is often used as an action selection method. In the ε-greedy method, the agent takes an action randomly with probability ε and the greedy action with probability 1 − ε. The greedy action a∗ at state s is given by a∗ = arg max Q(s, a). (13) a This method can not be applied in the proposed method because not the Q-value but the Q-vector is treated. Therefore, we propose a new method in which the agent an selects ∗ ∗ action at random from the set of actions a a = ◦ arg nd a Q(s, a) . The operator ”nd” means to extract the Q-vectors each of which is not dominated by any other Qvector, and is defined by nd Q = q ∈ Q ∃ q ∈ Q s.t. q >p q . (14) ◦ IV. T HEORETICAL I NVESTIGATION ON ACQUISITION OF PARETO O PTIMAL P OLICIES BY THE P ROPOSED M ETHOD In this section, we prove that Pareto optimal policies are found by the proposed method. First, we show that the update equation (12) in the proposed method is equivalent to the update equation in the weighted sum Q-learning method [4]. Second, we mention that the Q-values which are acquired by the weighted sum Q-learning method are optimal in the singleobjective problem which is modified from the original multiobjective problem. Third, we show that the policy which is obtained from the optimal Q-values is Pareto optimal. A. Relationship between the proposed method and the weighted sum Q-learning method Lemma 1 Given a weight vector w, (12) is equivalent to the following update equation in weighted sum Q-learning method. ◦ Because the action whose all Q are dominated by Q of other actions is not good, such an action is not selected. The action a∗ is selected with probability 1 − ε, and a random action is selected with probability ε. Qw (s, a) = (1−α)Qw (s, a) +α rs + γ max Qw (s , a ) a The flow of our proposed method is as follows. Proof Equation (11) is the same as Qw (s, a) = max wT q ◦ Initialize Q(s, a) for all states and all actions, and set n = 0. The variable n indicates the number of episodes. Step 2 Step 3 (15) The scalar reward rs is given by (7). D. Algorithm Step 1 If the agent reaches a terminal state or t = T , go to Step 6. Otherwise, set t ← t + 1, and return to Step 3. T is the maximum number of action. Set n ← n + 1. If n = N , go to Step 7. Otherwise, return to Step 2. N is the maximum number of episode. ◦ Q (s, a) =(1 − α) Q (s, a) ◦ Update Q(st , at ) according to (12). Set t = 0. The agent receives the initial state s0 . The agent selects action at at state st according to the proposed action selection method in Subsection III-C, and it receives the next state st+1 , and a reward vector r t+1 . ◦ q ∈Q (s, a) . By substituting (12), we obtain ◦ T Qw (s,a) = max w q q ∈ (1 − α) Q (s, a) ◦ Q (s , a ). + α r(s, a, s ) + γ hull a 1919 (16) (17) Furthermore, by using (9), we obtain T Qw (s, a) = max w q q ∈ hull (1 − α)q s + α r(s, a, s ) + γq s ◦ ◦ Q (s , a ) q s ∈ Q (s, a), q s ∈ hull . (18) Theorem 1 We assume that all state-action pairs are visited and continue to be updated. If the learning rate α(t) for time t holds the following conditions, Q-values converge to the optimal values with probability 1 by the weighted sum Q-learning method. ∞ α(t) = ∞, t=1 ∞ α2 (t) < ∞. (24) t=1 a Because the maximum of the inner product of w and q is only in the set of the vertices of convex hull, the operator ”hull” can be removed : Qw (s, a) = max wT (1 − α)q s + α r(s, a, s ) + γq s q s ◦ ∈Q (s, a), q s ∈ hull ◦ Q (s , a ) (19) a + α wT r(s, a, s ) + γwT q s ◦ ∈Q (s, a), q s ∈ hull ◦ Lemma 2 The policy gained by the weighted sum Q-learning method is Pareto optimal. π V̂w (s) = Q (s , a ) (20) a p π V̂w (s) = q s ∈ Q (s , a ), a ∈ A(s ) = (1 − α) max w ◦ q s ∈Q(s,a) + αγ max a q max q s ◦ ∈Q(s ,a ) s (21) M p wi Viπ (s) ≥ i=1 M wi Viπ (s) (∀π, s). (26) i=1 p Vjπ̄ (s) > Vjπ (s) (∃ j, s) (27) and p Vkπ̄ (s) ≥ Vkπ (s) (∀ k; k = j). (28) Hence, the following equation holds. M (22) + αrs wT q s . (25) i=1 We assume that π p is not a Pareto optimal policy. From this assumption, there is a policy π̄ ∈ Π such that Because wT r(s, a, s ) is not related to q s , we obtain ◦ Qw (s, a) = max (1 − α)wT q s q s ∈ Q (s, a) T T + αw r(s, a, s ) + αγ max w q s ◦ wi Viπ (s). π For the policy π p = arg maxπ V̂w (s), the following equation holds clearly for π ∈ Π and s ∈ S: a T M ◦ = (1 − α) max wT q s q s ∈ Q (s, a) T T + α max w r(s, a, s ) + γw q s ◦ Q (s , a ) . q s ∈ hull C. Relationship between Q-values gained by the weighted sum Q-learning method and the Pareto optimal policy Proof π The weighted sum V̂w (s) of the values in the proposed method is given by = max (1 − α)wT q s q s From lemma 1 and theorem 1, Q-values gained by the proposed method are equal to the optimal Q-values gained by the weighted sum Q-learning method. (23) Finally, (15) is derived by using (11). B. Optimality of Q-values obtained by the weighted sum Qleaning method The following theorem holds from the convergence theorem of Q-learning method for single-objective problems [5]. p wi Viπ (s) i=1 < M wi Viπ̄ (s). (29) i=1 This equation conflicts with (26). Hence, π p is a Pareto optimal policy. π The state value function V̄w (s) and the state-action value π function Qw (s, a) in the weighted sum Q-learning method are given by M M π i i V̄w (s) = Eπ wi rt+1 +γ wi rt+2 i=1 i=1 M i + γ2 wi rt+3 + · · · st = s , (30) i=1 1920 π Qw (s, a) = Eπ + γ2 M +γ M ◦ i wi rt+2 i=1 M i=1 π For V̄w (s), i wi rt+1 i=1 i wi rt+3 + · · · st = s, at = a . (31) = M M w i Eπ i=1 + M i=1 i rt+1 s t = s i wi Eπ γrt+2 s t = s i=1 + M i wi Eπ γ 2 rt+3 s t = s + · · · (33) i=1 = M i + γ 2 rt+3 + · · · st = s π = V̂w (s). (34) (35) Hence, π (s) π p = arg max V̂w (36) π π = arg max V̄w (s) π (37) π = arg max max Qw (s, a). π a ◦ ||q 1 − q 2 || < ρ (39) where q 1 is a basis Q-vector and ρ is a threshold parameter. ◦ The basis vector is determined randomly from Q, and this operation is repeated while there exist Q-vectors removed. VI. N UMERICAL E XPERIMENTS In this section, we show results of numerical experiments. The objectives of the experiments are as follows: 1) Can Pareto optimal policies be acquired according to the theory of the proposed method in section IV? 2) Can Pareto optimal policies be acquired in the case where the method in section V is introduced? Deep Sea Treasure problem [2], is used as an experimental study. A. Deep Sea Treasure Problem We consider the following Deep Sea Treasure problem. 11 × 10 grid world problem shown in Fig. 1. • The upper left cell in Fig. 1 is agent’s initial state, and the cells with numbers show the treasures. • The agent can move up, down, left or right at each time t. However, if the destination is a black cell or out of the grid world, the agent does not move. • There are two objectives in this problem. The first is reaching a treasure with the smallest number of step, and the second is getting the most valuable treasure (treasure which has the largest number). • An episode ends when the agent reaches a treasure or the number of steps reaches the maximum. • For the first objective, the agent gains reward (penalty) −1 whenever it takes an action. • For the second objective, when the agent reaches a treasure cell, it gains the number in the cell as the reward. Otherwise, it gains no reward. The values of the treasures are modified from the original setting of [2]. Fig. 1 shows the returns which the agent gains under all the Pareto optimal policies in this problem. For example, the point at (−19, 124) shows the returns for the policy under which the agent moves 19 times and gains the treasure whose value is 124. • i i wi Eπ rt+1 + γrt+2 i=1 ◦ a Q-vector q 2 of Q is removed from Q if π i V̄w (s) = Eπ wi rt+1 st = s i=1 M i + Eπ γ wi rt+2 st = s i=1 M 2 i + Eπ γ wi rt+3 st = s + · · · (32) is performed, the number of the elements of Q becomes enormous, which makes the computation time much longer. To resolve this problem, we introduce a method to reduce the number of Q-vectors after the update by (12). In the method, (38) The last equation shows that π p is the policy obtained by the weighted sum Q-learning method. Since π p is a Pareto optimal policy, the theorem is proved. D. Pareto optimality of the policy acquired by the proposed method The following theorem holds from Theorem 1, Lemma 1 and Lemma 2. Theorem 2 We assume that all state-action pairs are visited and continue to be updated. If the learning rate α(t) for time t holds the conditions (24), the policies obtained by the proposed method with weight vectors are Pareto optimal policies. V. R EDUCTION OF C OMPUTATIONAL C ONSUMPTION As mentioned above, the proposed method can acquire all Pareto optimal policies. However, when the proposed method B. Numerical experiment 1 The aim of this experiment is to confirm that Pareto optimal policies are acquired according to the theory in Section IV. The values of the parameters in the proposed method are as follows: • Number of episodes: N = 40000 • Learning rate: α = 1.0 1921 120 100 Treasure value 80 60 40 20 0 -20 Fig. 1. -15 -10 -5 0 Step penalty Deep Sea Treasure Fig. 3. ◦ Q([0, 0], right) obtained in experiment 1 (Step penalty > −30) 120 120 100 80 Treasure value Treasure value 100 60 40 20 0 -20 80 60 40 20 -15 -10 -5 0 0 -70000 Step penalty -68000 -66000 -64000 -62000 -60000 Step penalty Fig. 2. Returns for all following Pareto optimal policies in Deep Sea Treasure problem Fig. 4. ◦ Q([0, 0], right) obtained in experiment 1 (Step penalty < −30) Discount rate: γ = 1.0 Random action selection probability: ε = 0.2 • Threshold of distance: ρ = 0.0 This value of ρ means that the method in Section V is not used. These values of α and γ are set in order that the proposed method in which the method in Section V is not used is performed in a short time. Discount rate: γ = 0.999 Random action selecting probability: ε = 0.2 • Threshold of distance: ρ = 2.5 In this experiment, the method in section V is used. These values of α, γ, ε and ρ are determined through the preliminary experiments in such a way that the proposed method works as good as possible. Fig. 3 and Fig. 4 shows Q obtained by the proposed method. Fig. 5 and Fig. 6 shows Q obtained by the proposed method. • • ◦ • • ◦ ◦ ◦ This is Q of the initial state and the action ”right”. For this learning result, all Pareto optimal policies are acquired through This is Q of the initial state and the action ”right”. For this learning result, all Pareto optimal policies are acquired through Qw which are extracted from Q by giving various weight vectors w. Qw which are extracted from Q by giving various weight vectors w. C. Numerical experiment 2 D. Discussion If the problem becomes more complicated and its size becomes larger, Pareto optimal policies may not be acquired for the parameter setting of subsection VI-B. Thus, we show the experimental result for typical parameter setting of Qlearning. The values of the parameters are as follows: • Number of episodes: N = 10000 • Learning rate: α = 0.7 From the results, we showed empirically that all Pareto optimal policies can be acquired simultaneously by the proposed method. However, the computation time in experiment 1 was 1112 seconds, that for experiment 2 was 121143 seconds (the CPU of the PC which was used in the experiments is Intel core i3 3.20GHz), and it took long time to learn the Pareto optimal ◦ ◦ 1922 120 Treasure value 100 80 60 40 20 0 -30 -25 -20 -15 -10 -5 0 sum Q-learning, the Q-value converges to an optimal value in the weighted sum Q-learning method, and the policy gained by the weighted sum Q-learning method is Pareto optimal. Furthermore, we showed it experimentally through the results of two numerical experiments. However, our method has a problem of high computational consumption. In order to resolve this problem, we have proposed an action selection method that can reduce the number of Q-vectors. By using this method, we could reduce computational consumption. However, many unpromising Q-vectors remain still. The computational time of our proposed method is so long. Therefore, we need to consider the way of reducing computational time. Step penalty Fig. 5. R EFERENCES ◦ Q([0, 0], right) obtained in experiment 2 (Step penalty > −30) 120 Treasure value 100 80 60 [1] T. Kamioka, E. Uchibe, K. Doya, ”Max-Min Actor-Critic for Multiple Reward Reinforcement Learning”, IEICE Transactions on Information and Systems, Vol.J90-D, No.9, pp.2510-2521, 2007 (in Japanese) [2] P. Vamplew, J. Yearwood, R. Dazeley, and A. Berry, ”On the Limitations of Scalarisation for Multi-Objective Reinforcement Learning of Pareto Fronts”, Artificial Inteligence 2008, LNAI 5360, pp.372-378, 2008 [3] R.S. Sutton and A.G. Barto: Reinforcement Learning, MIT Press, 1998 [4] L. Barrett and S. Narayanan, ”Learning All Optimal Policies with Multiple Criteria”, Proceedings of 25th International Conference on Machine Learning, 2008 [5] C.J.C.H. Watkins and P. Dayan, ”Q-learning”, Machine Learning, Vol.8, No.3, pp.279-292, 1992 40 20 0 -700 -600 -500 -400 -300 -200 -100 Step penalty Fig. 6. ◦ Q([0, 0], right) obtained in experiment 2 (Step penalty < −30) policies. This is because many vertices of the convex hull of Q-vectors are updated by the operators of the convex hull. Whereas some of the vertices contribute toward Pareto optimal policies, the others do not as shown in Fig. 4 and Fig. 6. For such unpromising Q-vectors, it takes the agent long time to reach a treasure, which also makes the computation time longer. But, if we do not use the method which reduces computational consumption and parameter setting in experiment 2, learning process has proceeded 88 episodes and learning has not finish in 206 hours (= 8 days and 14 hours). Thus, it can be said that the method which reduces computational consumption is effective from the result that we could finish learning process with the method such as experiment 2. VII. C ONCLUSION We have proposed a reinforcement learning method that can acquire all Pareto optimal policies simultaneously for multiobjective problems whose environment model is unknown. We have also proved that our proposed method can acquire Pareto optimal policies theoretically by showing that the update equation of the proposed method is equivalent to the weighted 1923 本文献由“学霸图书馆-文献云下载”收集自网络,仅供学习交流使用。 学霸图书馆(www.xuebalib.com)是一个“整合众多图书馆数据库资源, 提供一站式文献检索和下载服务”的24 小时在线不限IP 图书馆。 图书馆致力于便利、促进学习与科研,提供最强文献下载服务。 图书馆导航: 图书馆首页 文献云下载 图书馆入口 外文数据库大全 疑难文献辅助工具
© Copyright 2026 Paperzz