Multi-Objective Reinforcement Learning Method for

2012 IEEE International Conference on Systems, Man, and Cybernetics
October 14-17, 2012, COEX, Seoul, Korea
Multi-Objective Reinforcement Learning Method for
Acquiring All Pareto Optimal Policies
Simultaneously
Yusuke Mukai
Yasuaki Kuroe
Hitoshi Iima
Department of Advanced Fibro Science
Kyoto Institute of Technology
Kyoto, Japan
[email protected]
Department of Information Science
Kyoto Institute of Technology
Kyoto, Japan
[email protected]
Department of Information Science
Kyoto Institute of Technology
Kyoto, Japan
[email protected]
Abstract—This paper studies multi-objective reinforcement
learning problems in which an agent gains multiple rewards. In
ordinary multi-objective reinforcement learning methods, only
a single Pareto optimal policy is acquired by the scalarizing
method which uses the weighted sum of the reward vector,
and therefore different Pareto optimal policies are acquired by
changing the weight vector and by performing the methods again.
On the other hand, a method in which all Pareto optimal policies
are acquired simultaneously is proposed for problems whose
environment model is known. By using the idea of the method,
we propose a method that acquires all Pareto optimal policies
simultaneously for the multi-objective reinforcement learning
problems whose environment model is unknown. Furthermore, we
show theoretically and experimentally that the proposed method
can find the Pareto optimal policies.
Index Terms—Reinforcement learning, Multi-objective problem, Pareto optimal policy
I. I NTRODUCTION
Recently, optimizing not only a single objective function
but also multiple objective functions attracts attention in the
field of optimization. Similarly, multi-objective reinforcement
learning [1][2] in which an agent acquires policies that achieve
multiple objectives attracts attention in the field of reinforcement learning [3]. In the multi-objective reinforcement learning
problems, the agent gains, multiple rewards corresponding to
the objectives and there is the tradeoff between the rewards.
Therefore, there are multiple policies each of which is not
dominated by any other policy from the viewpoint of how
many the agent gains the rewards. These policies are called
Pareto optimal policies, and acquiring them is the aim of the
multi-objective reinforcement leaning problems.
For solving the multi-objective reinforcement learning problems, usual single-objective reinforcement learning methods
can be used by scalarizing the multiple rewards into a single
reward. One of such scalarizing methods is to multiply each
reward with a specified weight and to sum up the respective
weighted rewards [4]. But we can not acquire more than one
Pareto optimal policy weight vector by this method. However,
it is often required to acquire multiple Pareto optimal policies
in real problems. In order to do so, it is inefficient to perform
the method many times with changing the weight vector.
978-1-4673-1714-6/12/$31.00 ©2012 IEEE
In order to find all the Pareto optimal policies, there has
been proposed a method in which the vertices of the convex
hull of state-action vectors (Q-vectors) are updated by the
value iteration method [4]. The optimal policies are acquired
by giving the Q-vectors weight vectors after the Q-vectors
converge. Furthermore, it is proven that the update equation of
the Q-vectors is equal to the update equation of Q-value in the
linear scalarized method. However, the model of environment
must be given because the method is based on the value
iteration.
In this paper, we propose a reinforcement learning method
that can learn all Pareto optimal policies simultaneously for
multi-objective problems whose environment model is unknown. Because existing action selection methods are proposed
for single-objective problems, they can not be applied directly
to multi-objective problems. In this paper, we also propose an
action selection method that is based on the vertices of the
convex hull of the Q-vectors. Furthermore, we theoretically
prove that our proposed method can acquire Pareto optimal
policies, and show experimentally evaluate its performance by
applying it to Deep Sea Treasure problem.
II. M ULTI -O BJECTIVE R EINFORCEMENT L EARNING
P ROBLEM
In this paper, we consider reinforcement learning problems
of multi-objective Markov decision process.
A. Multi-Objective Markov Decision Process
By extending the single-objective Markov decision process,
the multi-objective Markov decision process is defined as
follows.
• There are an environment and an agent which takes an
action at discrete time t = 1, 2, 3, · · · .
• The agent receives a state s ∈ S from the environment.
S is the finite set of states.
• The agent takes an action a ∈ A at state s. A is the finite
set of actions that the agent can select.
• The environment gives the agent the next state s ∈ S.
The next state is determined with the state transition
1917
probability P (s, a, s ) for state s, action a and the next
state s , and the state transition probability is defined by
the mapping
P : S × A × S → [0, 1].
•
(1)
There are M (> 1) objectives which the agent wants to
achieve, and the agent gains the reward vector
r(s, a, s ) = r1 (s, a, s ),r2 (s, a, s ),
T
· · · , rM (s, a, s )
(2)
from the environment when it receives the next state.
The mapping from the states to probabilities of selecting
each action is called a policy, and policy π is defined by
π : S × A → [0, 1].
(3)
B. Multi-Objective Reinforcement Learning Problem
The goal of multi-objective reinforcement learning is to
acquire Pareto optimal policies in the multi-objective Markov
decision process. The set Πp of the Pareto optimal policies is
defined by
Πp = π p ∈ Π ∃ π ∈ Π,
p
(4)
s.t. V π (s) >p V π (s),∀ s ∈ S
where Π is the set of all policies and >p is the dominance
relation. For two vectors a = (a1 , a2 , · · · , an ) and b =
(b1 , b2 , · · · , bn ), a >p b means that ai ≥ bi is satisfied for
all i and ai > bi is satisfied for at least one i. Moreover,
π
V π (s) = (V1π (s), V2π (s), · · · , VM
(s)) is the value vector of
state s under policy π and it is defined by
∞
V π (s) = Eπ
γ k r t+k+1 st = s
(5)
k=0
where Eπ is the expected value given that the agent follows
policy π, st is the state at time t, r t is the reward vector at t
and γ is the discount rate parameter.
We also define the Q-vector by
∞
π
k
Q (s, a) = Eπ
γ r t+k+1 st = s, at = a
(6)
k=0
where at is the action at time t.
The multi-objective reinforcement learning problem is to
find Pareto optimal policies under the condition that the agent
does not know the state transition probability P (s, a, s ) and
the expected reward vector E{r(s, a, s )}.
A. Basic Concept
For the multi-objective problem, there is a method in
which the problem is modified to a single-objective problem
by scalarizing the reward vector. The scalarized reward rs
is given by the following equation using a weight vector
w = (w1 , w2 , · · · , wM )T [2] :
rs = w T r
1
(7)
2
M
= w1 r + w2 r + · · · + wM r .
(8)
By using the scalarization, Q-learning method can be applied to the multi-objective problem. We call this method
the weighted sum Q-learning method. However, this method
requires to specify the weight vector in advance and is not
able to acquire more than one Pareto optimal policy for the
specified weight vector.
In order to acquire the Pareto optimal policies for all the
weight vectors, there has been proposed a method in which
the set of the vertices in the convex hull of multiple Q-vectors
is updated by the value iteration method [4]. The reason why
the method updates only the vertices of the convex hull is
that the optimal Q-vectors belong to the converged set of the
vertices. It is proven that the update equation of the Q-vectors
in this method is equal to the update equation of the Q-value in
the weighted sum Q-learning method. However, because this
method treats only the problems in which the state transition
probability and the expected reward vector are given, it is not
able to be applied to the reinforcement learning problems in
which they are not given.
In our proposed method, the multi-objective reinforcement
learning problems are solved by introducing the idea of [4]
into Q-learning method. In order to introduce it, it is required
to develop the update equation of the Q-vectors by extending
the update equation of the Q-value in Q-learning method. It is
also required to develop an action selection method, because
existing action selection methods are proposed for singleobjective problems and can not be applied directly to multiobjective problems. Thus, we propose a method in which an
action is determined based on the dominance relation between
Q-vectors by extending the ε-greedy method [3] which is used
in the single-objective problems.
B. Update Equation of Q-Vectors
In this subsection, we give the update equation of Q-vectors
◦
which is used in the proposed method. Let Q(s, a) be the set of
the vertices in the convex hull of the Q-vectors for taking action
◦
a at state s and the following operations on Q are defined [4].
Translation and scaling operations
◦
◦
u + b Q ≡ u + bq q ∈ Q .
III. P ROPOSED M ETHOD
Addition of two convex hulls
◦
◦
◦
◦
Q + U ≡ hull q + u q ∈ Q, u ∈ U .
We propose a method which can solve the above-mentioned
problem.
1918
(9)
(10)
Step 4
The operator ”hull” means to extract the set of the
vertices of the convex hull from the set of vectors.
Extracting the Q-value
Qw (s, a) ≡
max wT q.
Step 5
(11)
◦
q ∈Q(s,a)
Step 6
Under the definition of these operations, the update equation
of Q in the proposed method is given by
◦
◦
+ α r(s, a, s ) + γ hull
◦
Step 7
Q (s , a )
Give various weight vectors w, and calculate
Qw (s, a) by (11) for each w. The policy for
each w is obtained by taking action a∗ =
arg maxa Qw (s, a) for each state.
(12)
a
where α is the learning rate and γ is the discount rate.
C. Action Selection Method
In single-objective reinforcement learning methods, the εgreedy method is often used as an action selection method. In
the ε-greedy method, the agent takes an action randomly with
probability ε and the greedy action with probability 1 − ε. The
greedy action a∗ at state s is given by
a∗ = arg max Q(s, a).
(13)
a
This method can not be applied in the proposed method
because not the Q-value but the Q-vector is treated. Therefore,
we propose a new method in which the agent
an
selects
∗
∗ action at random from the set of actions a a =
◦
arg nd a Q(s, a) . The operator ”nd” means to extract the
Q-vectors each of which is not dominated by any other Qvector, and is defined by
nd Q = q ∈ Q ∃ q ∈ Q s.t. q >p q .
(14)
◦
IV. T HEORETICAL I NVESTIGATION ON ACQUISITION OF
PARETO O PTIMAL P OLICIES BY THE P ROPOSED M ETHOD
In this section, we prove that Pareto optimal policies are
found by the proposed method. First, we show that the update
equation (12) in the proposed method is equivalent to the
update equation in the weighted sum Q-learning method [4].
Second, we mention that the Q-values which are acquired by
the weighted sum Q-learning method are optimal in the singleobjective problem which is modified from the original multiobjective problem. Third, we show that the policy which is
obtained from the optimal Q-values is Pareto optimal.
A. Relationship between the proposed method and the
weighted sum Q-learning method
Lemma 1
Given a weight vector w, (12) is equivalent to the following
update equation in weighted sum Q-learning method.
◦
Because the action whose all Q are dominated by Q of other
actions is not good, such an action is not selected. The action
a∗ is selected with probability 1 − ε, and a random action is
selected with probability ε.
Qw (s, a) = (1−α)Qw (s, a)
+α rs + γ max
Qw (s , a )
a
The flow of our proposed method is as follows.
Proof
Equation (11) is the same as
Qw (s, a) = max wT q
◦
Initialize Q(s, a) for all states and all actions, and
set n = 0. The variable n indicates the number of
episodes.
Step 2
Step 3
(15)
The scalar reward rs is given by (7).
D. Algorithm
Step 1
If the agent reaches a terminal state or t = T , go to
Step 6. Otherwise, set t ← t + 1, and return to Step
3. T is the maximum number of action.
Set n ← n + 1. If n = N , go to Step 7. Otherwise,
return to Step 2. N is the maximum number of
episode.
◦
Q (s, a) =(1 − α) Q (s, a)
◦
Update Q(st , at ) according to (12).
Set t = 0. The agent receives the initial state s0 .
The agent selects action at at state st according to
the proposed action selection method in Subsection
III-C, and it receives the next state st+1 , and a reward
vector r t+1 .
◦
q ∈Q (s, a) .
By substituting (12), we obtain
◦
T Qw (s,a) = max w q q ∈ (1 − α) Q (s, a)
◦
Q (s , a ).
+ α r(s, a, s ) + γ hull
a
1919
(16)
(17)
Furthermore, by using (9), we obtain
T Qw (s, a) = max w q q ∈ hull (1 − α)q s + α r(s, a, s ) + γq s ◦
◦
Q (s , a )
q s ∈ Q (s, a), q s ∈ hull
. (18)
Theorem 1
We assume that all state-action pairs are visited and continue to be updated. If the learning rate α(t) for time t holds the
following conditions, Q-values converge to the optimal values
with probability 1 by the weighted sum Q-learning method.
∞
α(t) = ∞,
t=1
∞
α2 (t) < ∞.
(24)
t=1
a
Because the maximum of the inner product of w and q is only
in the set of the vertices of convex hull, the operator ”hull”
can be removed :
Qw (s, a) = max wT (1 − α)q s
+ α r(s, a, s ) + γq s
q s
◦
∈Q
(s, a), q s
∈ hull
◦
Q (s , a )
(19)
a
+ α wT r(s, a, s ) + γwT q s ◦
∈Q
(s, a), q s
∈ hull
◦
Lemma 2
The policy gained by the weighted sum Q-learning method
is Pareto optimal.
π
V̂w
(s) =
Q (s , a )
(20)
a
p
π
V̂w
(s) =
q s ∈ Q (s , a ), a ∈ A(s )
= (1 − α)
max w
◦
q s ∈Q(s,a)
+ αγ max
a
q
max
q s
◦
∈Q(s ,a )
s
(21)
M
p
wi Viπ (s) ≥
i=1
M
wi Viπ (s)
(∀π, s).
(26)
i=1
p
Vjπ̄ (s) > Vjπ (s) (∃ j, s)
(27)
and
p
Vkπ̄ (s) ≥ Vkπ (s) (∀ k; k = j).
(28)
Hence, the following equation holds.
M
(22)
+ αrs
wT q s .
(25)
i=1
We assume that π p is not a Pareto optimal policy. From this
assumption, there is a policy π̄ ∈ Π such that
Because wT r(s, a, s ) is not related to q s , we obtain
◦
Qw (s, a) = max (1 − α)wT q s q s ∈ Q (s, a)
T
T + αw r(s, a, s ) + αγ max w q s ◦
wi Viπ (s).
π
For the policy π p = arg maxπ V̂w
(s), the following equation
holds clearly for π ∈ Π and s ∈ S:
a
T
M
◦
= (1 − α) max wT q s q s ∈ Q (s, a)
T
T + α max w r(s, a, s ) + γw q s ◦
Q (s , a ) .
q s ∈ hull
C. Relationship between Q-values gained by the weighted sum
Q-learning method and the Pareto optimal policy
Proof
π
The weighted sum V̂w
(s) of the values in the proposed
method is given by
= max (1 − α)wT q s
q s
From lemma 1 and theorem 1, Q-values gained by the
proposed method are equal to the optimal Q-values gained by
the weighted sum Q-learning method.
(23)
Finally, (15) is derived by using (11).
B. Optimality of Q-values obtained by the weighted sum Qleaning method
The following theorem holds from the convergence theorem
of Q-learning method for single-objective problems [5].
p
wi Viπ (s)
i=1
<
M
wi Viπ̄ (s).
(29)
i=1
This equation conflicts with (26). Hence, π p is a Pareto optimal
policy.
π
The state value function V̄w
(s) and the state-action value
π
function Qw (s, a) in the weighted sum Q-learning method are
given by
M
M
π
i
i
V̄w (s) = Eπ
wi rt+1
+γ
wi rt+2
i=1
i=1
M
i
+ γ2
wi rt+3
+ · · · st = s ,
(30)
i=1
1920
π
Qw (s, a) = Eπ
+ γ2
M
+γ
M
◦
i
wi rt+2
i=1
M
i=1
π
For V̄w
(s),
i
wi rt+1
i=1
i
wi rt+3 + · · · st = s, at = a .
(31)
=
M
M
w i Eπ
i=1
+
M
i=1
i
rt+1
s t = s
i
wi Eπ γrt+2
s t = s
i=1
+
M
i
wi Eπ γ 2 rt+3
s t = s + · · ·
(33)
i=1
=
M
i
+ γ 2 rt+3
+ · · · st = s
π
= V̂w
(s).
(34)
(35)
Hence,
π
(s)
π p = arg max V̂w
(36)
π
π
= arg max V̄w
(s)
π
(37)
π
= arg max max Qw (s, a).
π
a
◦
||q 1 − q 2 || < ρ
(39)
where q 1 is a basis Q-vector and ρ is a threshold parameter.
◦
The basis vector is determined randomly from Q, and this
operation is repeated while there exist Q-vectors removed.
VI. N UMERICAL E XPERIMENTS
In this section, we show results of numerical experiments.
The objectives of the experiments are as follows:
1) Can Pareto optimal policies be acquired according to the
theory of the proposed method in section IV?
2) Can Pareto optimal policies be acquired in the case
where the method in section V is introduced?
Deep Sea Treasure problem [2], is used as an experimental
study.
A. Deep Sea Treasure Problem
We consider the following Deep Sea Treasure problem.
11 × 10 grid world problem shown in Fig. 1.
• The upper left cell in Fig. 1 is agent’s initial state, and
the cells with numbers show the treasures.
• The agent can move up, down, left or right at each time
t. However, if the destination is a black cell or out of the
grid world, the agent does not move.
• There are two objectives in this problem. The first is
reaching a treasure with the smallest number of step, and
the second is getting the most valuable treasure (treasure
which has the largest number).
• An episode ends when the agent reaches a treasure or the
number of steps reaches the maximum.
• For the first objective, the agent gains reward (penalty)
−1 whenever it takes an action.
• For the second objective, when the agent reaches a
treasure cell, it gains the number in the cell as the reward.
Otherwise, it gains no reward.
The values of the treasures are modified from the original
setting of [2]. Fig. 1 shows the returns which the agent gains
under all the Pareto optimal policies in this problem. For
example, the point at (−19, 124) shows the returns for the
policy under which the agent moves 19 times and gains the
treasure whose value is 124.
•
i
i
wi Eπ rt+1
+ γrt+2
i=1
◦
a Q-vector q 2 of Q is removed from Q if
π
i
V̄w (s) = Eπ
wi rt+1 st = s
i=1
M
i
+ Eπ γ
wi rt+2
st = s
i=1
M
2
i
+ Eπ γ
wi rt+3 st = s + · · · (32)
is performed, the number of the elements of Q becomes
enormous, which makes the computation time much longer.
To resolve this problem, we introduce a method to reduce the
number of Q-vectors after the update by (12). In the method,
(38)
The last equation shows that π p is the policy obtained by the
weighted sum Q-learning method. Since π p is a Pareto optimal
policy, the theorem is proved.
D. Pareto optimality of the policy acquired by the proposed
method
The following theorem holds from Theorem 1, Lemma 1
and Lemma 2.
Theorem 2
We assume that all state-action pairs are visited and continue to be updated. If the learning rate α(t) for time t holds the
conditions (24), the policies obtained by the proposed method
with weight vectors are Pareto optimal policies.
V. R EDUCTION OF C OMPUTATIONAL C ONSUMPTION
As mentioned above, the proposed method can acquire all
Pareto optimal policies. However, when the proposed method
B. Numerical experiment 1
The aim of this experiment is to confirm that Pareto optimal
policies are acquired according to the theory in Section IV.
The values of the parameters in the proposed method are as
follows:
• Number of episodes: N = 40000
• Learning rate: α = 1.0
1921
120
100
Treasure value
80
60
40
20
0
-20
Fig. 1.
-15
-10
-5
0
Step penalty
Deep Sea Treasure
Fig. 3.
◦
Q([0, 0], right) obtained in experiment 1 (Step penalty > −30)
120
120
100
80
Treasure value
Treasure value
100
60
40
20
0
-20
80
60
40
20
-15
-10
-5
0
0
-70000
Step penalty
-68000
-66000
-64000
-62000
-60000
Step penalty
Fig. 2. Returns for all following Pareto optimal policies in Deep Sea Treasure
problem
Fig. 4.
◦
Q([0, 0], right) obtained in experiment 1 (Step penalty < −30)
Discount rate: γ = 1.0
Random action selection probability: ε = 0.2
• Threshold of distance: ρ = 0.0
This value of ρ means that the method in Section V is
not used. These values of α and γ are set in order that the
proposed method in which the method in Section V is not
used is performed in a short time.
Discount rate: γ = 0.999
Random action selecting probability: ε = 0.2
• Threshold of distance: ρ = 2.5
In this experiment, the method in section V is used. These
values of α, γ, ε and ρ are determined through the preliminary
experiments in such a way that the proposed method works as
good as possible.
Fig. 3 and Fig. 4 shows Q obtained by the proposed method.
Fig. 5 and Fig. 6 shows Q obtained by the proposed method.
•
•
◦
•
•
◦
◦
◦
This is Q of the initial state and the action ”right”. For this
learning result, all Pareto optimal policies are acquired through
This is Q of the initial state and the action ”right”. For this
learning result, all Pareto optimal policies are acquired through
Qw which are extracted from Q by giving various weight
vectors w.
Qw which are extracted from Q by giving various weight
vectors w.
C. Numerical experiment 2
D. Discussion
If the problem becomes more complicated and its size
becomes larger, Pareto optimal policies may not be acquired
for the parameter setting of subsection VI-B. Thus, we show
the experimental result for typical parameter setting of Qlearning. The values of the parameters are as follows:
• Number of episodes: N = 10000
• Learning rate: α = 0.7
From the results, we showed empirically that all Pareto optimal policies can be acquired simultaneously by the proposed
method.
However, the computation time in experiment 1 was 1112
seconds, that for experiment 2 was 121143 seconds (the CPU
of the PC which was used in the experiments is Intel core i3
3.20GHz), and it took long time to learn the Pareto optimal
◦
◦
1922
120
Treasure value
100
80
60
40
20
0
-30
-25
-20
-15
-10
-5
0
sum Q-learning, the Q-value converges to an optimal value in
the weighted sum Q-learning method, and the policy gained
by the weighted sum Q-learning method is Pareto optimal.
Furthermore, we showed it experimentally through the results
of two numerical experiments. However, our method has
a problem of high computational consumption. In order to
resolve this problem, we have proposed an action selection
method that can reduce the number of Q-vectors. By using
this method, we could reduce computational consumption.
However, many unpromising Q-vectors remain still. The
computational time of our proposed method is so long. Therefore, we need to consider the way of reducing computational
time.
Step penalty
Fig. 5.
R EFERENCES
◦
Q([0, 0], right) obtained in experiment 2 (Step penalty > −30)
120
Treasure value
100
80
60
[1] T. Kamioka, E. Uchibe, K. Doya, ”Max-Min Actor-Critic for Multiple
Reward Reinforcement Learning”, IEICE Transactions on Information and
Systems, Vol.J90-D, No.9, pp.2510-2521, 2007 (in Japanese)
[2] P. Vamplew, J. Yearwood, R. Dazeley, and A. Berry, ”On the Limitations
of Scalarisation for Multi-Objective Reinforcement Learning of Pareto
Fronts”, Artificial Inteligence 2008, LNAI 5360, pp.372-378, 2008
[3] R.S. Sutton and A.G. Barto: Reinforcement Learning, MIT Press, 1998
[4] L. Barrett and S. Narayanan, ”Learning All Optimal Policies with Multiple
Criteria”, Proceedings of 25th International Conference on Machine
Learning, 2008
[5] C.J.C.H. Watkins and P. Dayan, ”Q-learning”, Machine Learning, Vol.8,
No.3, pp.279-292, 1992
40
20
0
-700
-600
-500
-400
-300
-200
-100
Step penalty
Fig. 6.
◦
Q([0, 0], right) obtained in experiment 2 (Step penalty < −30)
policies. This is because many vertices of the convex hull of
Q-vectors are updated by the operators of the convex hull.
Whereas some of the vertices contribute toward Pareto
optimal policies, the others do not as shown in Fig. 4 and
Fig. 6. For such unpromising Q-vectors, it takes the agent
long time to reach a treasure, which also makes the computation time longer. But, if we do not use the method which
reduces computational consumption and parameter setting in
experiment 2, learning process has proceeded 88 episodes
and learning has not finish in 206 hours (= 8 days and 14
hours). Thus, it can be said that the method which reduces
computational consumption is effective from the result that
we could finish learning process with the method such as
experiment 2.
VII. C ONCLUSION
We have proposed a reinforcement learning method that can
acquire all Pareto optimal policies simultaneously for multiobjective problems whose environment model is unknown.
We have also proved that our proposed method can acquire
Pareto optimal policies theoretically by showing that the update
equation of the proposed method is equivalent to the weighted
1923
本文献由“学霸图书馆-文献云下载”收集自网络,仅供学习交流使用。
学霸图书馆(www.xuebalib.com)是一个“整合众多图书馆数据库资源,
提供一站式文献检索和下载服务”的24 小时在线不限IP 图书馆。
图书馆致力于便利、促进学习与科研,提供最强文献下载服务。
图书馆导航:
图书馆首页
文献云下载
图书馆入口
外文数据库大全
疑难文献辅助工具