Reinforcement Learning Based on Actions and Opposite Actions

AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt
Reinforcement Learning Based on Actions and Opposite Actions
Hamid R. Tizhoosh
Systems Design Engineering
University of Waterloo
Waterloo, ON N2L 3G1, Canada
[email protected],
http://pami.uwaterloo.ca/tizhoosh/
Abstract
Reinforcement learning is a machine intelligence
scheme for learning in highly dynamic and probabilistic environments. The methodology, however, suffers
from a major drawback; the convergence to an optimal solution usually requires high computational expense since all states should be visited frequently in
order to guarantee a reliable policy. In this paper,
a new reinforcement learning algorithm is introduced
to achieve a faster convergence by taking into account
the opposite actions. By considering the opposite actions simultaneously multiple updates can be made for
each state observation. This leads to a shorter exploration period and, hence, expedites the convergence.
Experimental results for the grid world problem of different sizes are provided to verify the performance of
the proposed approach.
Keywords: Reinforcement Learning, Opposition
Learning, Q-Learning, Opposite Actions, Opposite
States
1
Introduction
The field of machine intelligence contains many examples of ideas borrowed from nature, simplified and
modified for implementation in a numerical framework. Reinforcement learning (RL) [1, 12, 7] is one
example of such an idea, building on the concepts of
reward and punishment central to human and animal
learning. Animals, for instance dogs, learn how to
behave by receiving rewards for “good” actions in a
particular situation. Similarly, an RL agent learns by
accumulating knowledge (reinforcement signals) to select the action leading to the highest expected reward.
Compared to other methodologies in machine
learning, RL algorithms have distinct advantages.
The most significant one is the ability of RL agents to
autonomously explore highly dynamic and stochastic
environments and develop, by accumulating evalua-
tive feedback from the environment, an optimal control policy. This characteristic is specially desirable if
apriori knowledge is not available and a large training set from the past is missing as well. One of the
problems faced by RL algorithms, however, is the long
time needed for exploring the unknown environment.
Generally the convergence of RL algorithms is
guaranteed if all states can be visited infinitely [12,
2, 10]. Of course, in practice we work with limited
computational resources and, in addition, applicationbased time constraints usually do not allow lengthy
processing. Therefore, a relatively large number of
research works in RL field has been devoted to state
reduction, or experience generalization of RL agents.
Reibero [9] has argued that the RL agents are very
rarely a tabula rasa and, hence, the exploration may
not be intensively required. He has proposed the embedding of apriori knowledge about the characteristics
of the environment. Combination of reinforcement
agents with other techniques is another approach. For
instance, decision trees can be used to decrease the
sample space [4]. Other methods exist that make
use of prior knowledge to improve future reinforcement learning agents [5]. Modified RL agents, for instance relational reinforcement agents [6], have been
proposed as well.
2
Opposite Actions
Reinforcement learning is based on interaction of an
intelligent agent with the environment by receiving reward and punishment [12]. The agent takes actions to
change the state of the environment and receives reward and punishment from the environment. In this
sense, reinforcement learning is a type of weakly supervised learning. In order to explain how the concept of
revolutionary learning can be used to extend reinforcement agents, we focus on the simplst and most popular reinforcement algorithm, namely the Q-learning
AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt
[14, 15, 8, 3]. General investigations on RL agents
based on temporal differences [11, 12] (λ 6= 0) remain
subject for future works.
For Q-learning in particular, the amount of time
needed for convergence is proportional to the size of
the Q-matrix. A larger Q-matrix, resulting from a
larger number of states and/or a greater number of
actions requires more time to be filled.
Generally, the RL agents begin from scratch and
make stochastic decisions, explore the environment,
find rewarding actions and exploit them. Specially at
the very begin the performance of the RL agents is
poor due to lack of knowledge about which actions
can steer the environment in desired direction. In [13]
the idea of opposition-based learning was introduced
and defined as follows:
Opposition-Based Learning Let f (x) be a function to be solved and g(·) a proper evaluation
function. If x is an initial (random) guess and x̆
is its opposite value, then in every iteration we
calculate f (x) and f (x̆). The learning continues with x if g(f (x)) > g(f (x̆)), otherwise with
x̆. The evaluation function g(·), as a measure
of optimality, compares the suitability of results
(e.g. fitness function, reward and punishment,
error function etc.).
Examples are provided to demonstrate how the
concept of opposition can be emplyed to extend genetic algorithms, neural networks etc. [13].
With respect to reinforcement learning, the
opposition-based learning constitutes that whenever
the RL agent takes an action it should also consider
the opposite action and/or opposite state. This
will shorten the state-space traversal and should consequently accelerate the convergence.
Of course the concept of opposition can be applied
if opposite actions and opposite states are meaningful
in the context of the problem at hand. In regard to action a and state s and the existence of their opposites
ă and s̆, following cases can be distinguished:
1. Opposite action ă and opposite state s̆ are given:
at least four cases can be updated per state observation.
2. Only ă can be defined: two cases can be updated
per state observation.
3. Only s̆ can be defined: two cases can be updated
per state observation.
4. Neither ă nor s̆ can be given: application of opposition concept not straightforward.
Assuming that opposite actions and opposite states
both exist, then at least four state-action pairs can be
updated in each iteartion. In general, if action a is
rewarded for the state s, then a is punished for the
Figure 1: Time saving in RL: the action a is rewarded
for the state s (the gray cell). The opposite cases
are updated simultaneously without explicit actiontaking.
opposite state s̆, the opposite action ă is punished for
s and rewarded for s̆ (Figure 1).
The Degree of Opposition - In order to make
additional updates as described the RL agent has
to know how to find opposite actions, and opposite
states. Clearly, this will depend on the application
at hand. Whereas for applications such as elevator
control opposite actions are straightforward, this may
not be the case for other applications. Nonetheless,
general procedures may be defined to facilitate this.
Here, a degree of opposition ϕ̆ can be defined to measure in how far two actions a1 and a2 are the opposite
of each other:
ϕ̆(a1 |si , a2 |sj ) = η ×

|Q(s
,
a
)
−
Q(s
,
a
)|
i
1
j
2
1 − exp −
 ,
max (Q(si , ak ), Q(sj , ak ))


k
where η is the state similarity and can be calculated based on state clustering or by a simple measure
such as
P
|Q(si , ak ) − Q(sj , ak )|
η(si , sj ) = 1 − Pk
. (1)
max (Q(si , ak ), Q(sj , ak ))
k
Example - Following excerpt of a Q
matrix is given:
s1
s2
..
.
a1
1
1
..
.
a2
20
2
..
.
a3
4
5
..
.
a4
10
8
..
.
(2)
The state similarities are
η(s1 , s1 ) = 1
η(s2 , s2 ) = 1
η(s1 , s2 ) = 1 −
(3)
0+18+1+2
1+20+5+10
= 0.40
AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt
Sample degrees of opposition can be calculated:
ϕ̆(a1 |s1 , a1 |s1 ) =
ϕ̆(a1 |s1 , a2 |s1 ) =
ϕ̆(a1 |s1 , a3 |s1 ) =
ϕ̆(a1 |s1 , a4 |s1 ) =
1 × (1 − 1)
1 × (1 − e−19/20 )
1 × (1 − e−3/20 )
1 × (1 − e−9/20 )
ϕ̆(a1 |s1 , a1 |s2 ) =
ϕ̆(a2 |s1 , a2 |s2 ) =
ϕ̆(a2 |s1 , a4 |s2 ) =
ϕ̆(a3 |s1 , a3 |s2 ) =
0.4 × (1 − e−0/20 ) = 0
0.4 × (1 − e−18/20 ) = 0.24
0.4 × (1 − e−12/20 ) = 0.18
0.4 × (1 − e−1/20 ) = 0.02
(4)
=0
= 0.62
= 0.14
= 0.37
Apparently the degree of opposition ϕ̆ delivers intuitively correct values with the exception that two actions in two different states cannot be regarded as
opposite actions if their Q-value is the same. If this
is not desired in the context of the problem at hand,
other ways of defining the degree of opposition may
be required.
3
Algorithm Design
Using opposite states and opposite actions in order
to make additional updates in each episode can be
explicitly implemented if the original reinforcement
algorithm is given. In this work, Q-learning is selected
to demonstrate the design and implementation of an
opposition-based RL agent. The Q-learning algorithm
is given in Table 1.
In order to develop the opposition-based version
of the Q-learning, different strategies can be followed
among which the following has been investigated in
this work:
• A second learn step ᾰ as a decreasing function
of number of episodes nE will be defined. Additional updates for opposite actions ă will be
made in a diminishing manner (see algorithm
OQL in Table 2). In each iteration/step i the
opposition learn step ᾰ can be updated using a
similar equation such as the following
r
i
.
(5)
ᾰ = 1 −
nE
In the next section experimental results will be provided to compare the Q-learning (QL) with the proposed algorithm OQL.
Figure 2: Grid world problem: The agent can move in
four directions to find the goal (marked with a star).
The reward was calculated, since the goal position was known, based on the distance d of the agent,
at the position (xA , yA ), from the goal located at
(xG , yG ) coordinates:
+1
if di ≤ di−1 ,
r=
(6)
−1
if di > di−1 ,
where the distance was calculated as follows
p
d = (xG − xA )2 + (yG − yA )2 .
(7)
Two experiment series were designed and conducted.
In the first series the policy accuracy generated by the
agents was measured for different numbers of learn
steps and episodes. In the second series a desired policy accuracy was determined to verify how many learn
steps and episodes each algorithm would need to reach
that accuracy.
The two algorithms Ql and OQL were run for different grid world sizes. For each grid world two optimal policies π1 and π2 were generated manually (Figure 3). A performance measure Aπ∗ was defined to
measure the accuracy of the policy π ∗ generated by
the agent:
Aπ∗ =
k(π ∗ ∩ π1 ) ∪ (π ∗ ∩ π2 )k
,
m×m
(8)
where k·k denotes the cardinality of a two dimensional
set, and the grid world is of size m × m. For K trials
the total average accuracy Ām×m can be given:
Ām×m =
1 X (k)
Aπ∗ .
K
(9)
k
4
Experimental Verification
In order to verify the performance of the oppositionbased reinforcement learning, the grid world problem
was selected (Fig. 2). Experiments were conducted
for 50×50 and 100×100 grid worlds for the Q-learning
(QL) and the proposed algorithm OQL.
Every experiment was repeated K = 10 times
to generate average accuracies Ām×m along with
coressponding standard deviations σ. The summary
of results is represented in Table 3.
Interpretation - Following statements can be
made based on the results:
AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt
Table 1: QL: Q-Learning algorithm [12]
Initialize Q(s, a) randomly
Repeat (for each episode)
Initialize s
Repeat (for each iteration of episode)
Choose a from s using policy π derived from Q
Take action a, observe r, s0
Q(s, a) ← Q(s, a) + α[r + γ max
Q(s0 , a0 ) − Q(s, a)]
0
s ← s0
until s is terminal
a
Table 2: OQL: The proposed Q-learning extension using opposite actions
Initialize Q(s, a) randomly
Set ᾰ = 1
Repeat (for each episode)
Initialize s
Repeat (for each iteration of episode)
Choose a from s using policy π derived from Q
Take action a, observe r, s0
Q(s, a) ← Q(s, a) + α[r + γ max
Q(s0 , a0 ) − Q(s, a)]
0
a
Determine opposite action ă
Calculate opposite reward r̆
Update the learn step ᾰ, e.g. using Eq. 5
Q(s, ă) ← Q(s, ă) + ᾰ[r̆ + γ max
Q(s0 , a0 ) − Q(s, ă)]
0
s ← s0
until s is terminal
a
• 50 × 50-Case: OQL has the highest accuracy
with a distance of 8% to QL.
• 100×100-Case: OQL is the best alternative with
almost 10% difference to QL.
Experiments were run for higher learn steps (α = 0.9)
as well. The benefit of OQL algorithms were there
observable but not as distinct as for lower learn step
(α = 0.1).
5
Conclusion
In this paper a new class of reinforcement algorithms
was proposed. Based on the idea of opposition-based
learning, the reinforcement learning can be extended
to consider not only the states and actions but also
opposite states and opposite actions. The motivation
was to reduce the exploration time and, hence, accelerate the convergence speed of reinforcement agents.
Q-learning, as the most popular RL algorithm,
was selected to demonstrate the opposition-based extension. A new algorithm driven from standard Qalgorithm was developed and implemented. In order
to verify the potential benefits of the opposition-based
learning, experiments with 50 × 50 and 100 × 100 grid
worlds were conducted. Optimal policies were created
manually upon which a performance measure could
be established. The results demonstrate the faster
convergence of the opposition-based extension of Qlearning.
Acknowledgements
The author would like to thank Maryam Shokri for
valuable discussions, and Kenneth Lee and Jendrik
Schmidt for conducting the experiments.
References
[1] Barto, A.G., Sutton, R.S. , Brouwer, P.S., Associative search network: A reinforcement learning
associative memory, Biological Cybernetics, Vol.
40, No. 3, pp. 201-211, May 1981.
[2] Beggs, A.W., On the convergence of reinforcement learning, Journal of Economic Theory Volume: 122, Issue: 1, May, 2005, pp. 1-36.
AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt
Figure 3: Two samples for optimal policies for 10 × 10 grid world with the goal at a fixed position. Manually
created optimal policies are used to measure the performance of each algorithm.
Table 3: Results of experiments: Average policy accuracies for Q-learning (QL) and the proposed oppositionbased algorithm (OQL) for different grid world sizes. A softmax policy was used for all algorithms . (max. 50
episodes and 1000 iterations per episode, α = 0.1, γ = 0.7 for both algorithms)
Algorithms
QL
OQL
Ā50×50
80.5
88.1
[3] Caironi, P.V. C., Dorigo, M., Training and delayed reinforcements in Q-learning agents, International Journal of Intelligent Systems Volume:
12, Issue: 10, October 1997, pp. 695 - 724
[4] Driessens, K., Ramon, J., Blockeel, H.. Speeding
Up Relational Reinforcement Learning through
the Use of an Incremental First Order Decision Tree Learner, Proc. 12th European Conference on Machine Learning, Freiburg, Germany,
September 2001.
[5] Drummond, C., Composing functions to speed up
reinforcement learning in a changing world, Proc.
10th European Conference on Machine Learning.
Springer-Verlag, 1998.
[6] Dzeroski, S., De Raedt, L., Driessens, K, Relational Reinforcement Learning, Machine Learning Volume: 43, Issue: 1-2, April 2001 - May
2001, pp. 7-52.
[7] Kaelbling, L.P., Littman, M.L., Moore, A.W.,
Reinforcement Learning: A Survey, Journal of
Artificial Intelligence Research, Volume 4, 1996.
[8] Peng, J., Williams, R. J., Incremental multi-step
q-learning, Machine Learning, 22(1/2/3), 1996.
[9] Ribeiro, C. H. C., Embedding a Priori Knowledge
in Reinforcement Learning, Journal of Intelligent
and Robotic Systems 21: 51Ð71, 1998.
[10] Singh, S., Jaakkola, T., Littman, M.L.,
SzepesvĞri, C., Convergence Results for Single-
σ
2.1
2.6
Ā100×100
66.3
75.6
σ
1.0
1.4
Step On-Policy Reinforcement-Learning Algorithms, Machine Learning Volume: 38, Issue: 3,
March 2000, pp. 287-308.
[11] Sutton, R. S., Temporal Credit Assignment in
Reinforcement Learning. PhD thesis, University
of Massachusetts, Amherst, MA, a984.
[12] Sutton, R.S., Barto, A.G., Reinforcement Learning: An Introduction, MIT Press, 1998.
[13] Tizhoosh, H.R., Opposition-Based Learning: A
New Scheme for Machine Intelligence, International Conference on Computational Intelligence for Modelling Control and Automation CIMCA’2005, 28 - 30 November 2005, Vienna Austria.
[14] Watkins, C. J. C. H., Learning from Delayed Rewards, PhD thesis, Cambridge University, Cambridge, England, 1989.
[15] Watkins, C. J. C. H., Dayan, P., Q-learning, Machine Learning, 8:279-292, 1992.