A Novel Artificial Hydrocarbon Networks Based Value Function

A Novel Artificial Hydrocarbon Networks Based
Value Function Approximation in Hierarchical
Reinforcement Learning
Hiram Ponce
Faculty of Engineering, Universidad Panamericana, 03920 Mexico City, Mexico
[email protected]
Abstract. Reinforcement learning aims to solve the problem of learning optimal or near-optimal decision-making policies for a given domain
problem. However, it is known that increasing the dimensionality of the
input space (i.e. environment) will increase the complexity for the learning algorithms, falling into the curse of dimensionality. Value function
approximation and hierarchical reinforcement learning have been two
different approaches proposed to alleviate reinforcement learning from
this illness. In that sense, this paper proposes a new value function approximation using artificial hydrocarbon networks –a supervised learning method inspired on chemical carbon networks– with regularization at
each subtask in a hierarchical reinforcement learning framework. Comparative results using a greedy sparse value function approximation over
the MAXQ hierarchical method was computed, proving that artificial hydrocarbon networks improves accuracy and efficiency on the value function approximation.
Keywords: reinforcement learning, machine learning, artificial organic
networks, regularization
1
Introduction
Reinforcement learning (RL) aims to solve the problem of learning optimal or
near-optimal decision-making policies for a given domain problem [3]. In fact, the
input space (i.e. environment) is represented using states that are also composed
by a set of features. Depending on the domain problem, the state space can be
continuous or discrete; but in any case, the number of features (dimensions)
that represents the environment will impact in the dimensionality of the overall
domain. However, it is known that increasing the dimensionality of the state
space will increase the complexity for the learning algorithms, falling into the
curse of dimensionality [8]. To this end, due to table structures of states or stateaction pairs, and large state spaces, reinforcement learning becomes inefficient
and impractical to be applied successfully in complex domains [7].
In that sense, hierarchical reinforcement learning (HRL) deals with large domains dividing the global goal into smaller subgoals with related subtasks, and
attempts to tackle each subgoal separately [7,10]. Several hierarchical strategies
have been proposed in the literature, for example: the options method [21] defines each subtask in terms of a fixed policy provided, the hierarchy of abstract
machines (HAM) method [10] defines each subtask in terms of a nonlinear deterministic finite-state machine, or the MAXQ method [3] that defines each subtask
in terms of a terminal predicate (i.e. terminal state of the subtask) and a local
reward function. Focusing on the latter, there are three key factors that the
MAXQ method supports and provides better learning performance in high dimensional spaces [3]: (i) temporal abstraction that assumes a certain task might
take several steps to accomplish, modeled as a semi-Markov decision process,
(ii) state abstraction that defines a relevant subset of features to represent the
state space for each subtask, and (iii) task sharing that means certain task can
be shared by other subtasks if needed.
In particular, Dietterich [3] proved that optimal or near-optimal state abstraction, i.e. safe state abstraction, improves the efficiency of the MAXQ method
when dealing with high dimensional spaces. Looking closely, it means that the
minimum subset of features will optimize value (or state-action) functions. In
fact, value function approximation [11,18] tends to reduce the complexity of reinforcement learning algorithms due to its compact form representation, instead
of tables. It allows to generalize over the state space so that the learning complexity is reduced much better [3]. Different approaches have been proposed for
value function approximation in non-hierarchical reinforcement learning, such
as: linear function approximators [4,18], greedy algorithms [9], machine learning
approaches [11,19], sparse approximators [9,5], non-parametric approximators
[9], and others.
From the above, this paper proposes a new value function approximation
based on artificial hydrocarbon networks at each subtask in the hierarchy of
the MAXQ method to improve accuracy and efficiency on that type of knowledge representation. In fact, the artificial hydrocarbon networks (AHN) method
has been proved to be efficient in other function approximation approaches
[16,14,15,13]. Currently, the present work is part of an ongoing research in hierarchical reinforcement learning and the automation of the safe state abstraction
in a given problem domain. To this end, two key contributions are present in
this paper. First, it is an alternative way of value function approximation using the notion of molecular packaging information through the AHN method,
which also exploits the machine learning paradigm for modeling nonlinear and
sparse spaces. Secondly, it presents a modified version of the AHN method using L1 regularization to improve the efficiency of the approximator in terms of
generalization and avoiding overfitting.
The rest of the paper is organized as follows. Section 2 presents the related
work of value function approximation in RL and trends related to the HRL
framework. Section 3 presents the description of the proposal including a brief
overview of artificial hydrocarbon networks and the modified version based on
L1 regularization. Section 4 describes the experimentation based on the wellknown Dietterich’s taxi problem domain and the comparative results obtained
so far, and Sect. 5 concludes the paper.
2
Fundamentals and Related Work
Value (state-action) function approximation has been widely studied in reinforcement learning. In this section, the fundamentals of RL and value function are
considered. Then, the state-of-the-art in terms of value function approximation
is described. Lastly, the MAXQ method is briefly introduced.
2.1
Preliminaries
A Markov decision process (MDP) provides a model M in which an agent interacts with its fully-observable environment, denoted as M = hS, A, P, R, P0 i;
where, S is a finite set of states of the environment, A is a finite set of actions
that the agent can perform, P (s0 |s, a) is the probabilistic transition of the environment from its current state s and the resulting state s0 when an action a ∈ A
is performed, R(s0 |s, a) is the reward that the agent receives when it performs
an action a and the environment makes a transition from s to s0 , and P0 (s) is
the starting probability when the MDP is initialized in a state s [4,3].
Then, the aim of the reinforcement learning is to produce (i.e. learn) a policy
π that is a mapping from states to actions that tells the agent what action a
to perform when it is in state s, such that a = π(s). In that sense, the value
function V π provides a mechanism to find a policy π, which it is defined as the
expected cumulative reward that an agent will obtain by executing π starting in
state s. The value function is expressed as (1); where, rt is the reward that an
agent receives at time t, and γ a discount factor [3]. In fact, the value function
satisfies the Bellman equation [1] for a fixed policy as (2).
(∞
)
X
V π (s) = E
γ i rt+i |st = s, π
(1)
i=0
V π (s) =
0
X
P (s0 |s, π(s)) {R(s0 |s, π(s)) + γV π (s0 )}
(2)
s
In addition, there is the state-action function, Qπ , or Q-function (3) that
computes the expected cumulative reward of performing an action a in state s
by following the policy π.
X
Qπ (s, a) =
P (s0 |s, a) {R(s0 |s, a) + γQπ (s0 , π(s0 ))}
(3)
s0
2.2
Overview of MAXQ Method
Formally, MAXQ decomposes a given problem M modeled as an MDP into a
finite set of subtasks. Each subtask Mi is defined as a semi-Markov decision
process (SMDP) tuple Mi = hTi , Ai , R̂i i that is completed in N time steps;
where, Ti is a termination state of the current subtask, Ai is a set of actions that
can be performed to achieve that subtask, and R̂i is the pseudo-reward function
that specifies a pseudo-reward for each transition to a terminal state. Associated
to each subtask, there is a policy πi , and the set of all policies in the problem
π = {π0 , . . . , πn } is called the hierarchical policy [3].
In order to find an optimal or near-optimal hierarchical policy, the MAXQ
method decomposes the hierarchical value function V π (i.e. the expected cumulative reward of following π starting in state s) into a subset of so-called projected
value functions, V π (i, s), denoted as (4), or alternatively as (5) if the first action
a chosen from πi is a subroutine and it terminates in N steps. Notice that the
expected cumulative reward after executing a is V π (πi (s), s) = V π (a, s).
(
π
V (i, s) = E
∞
X
)
u
γ rt+u |st = s, π
(4)
u=0
π
V (i, s) = E
(N −1
X
∞
X
u
γ rt+u +
u=0
)
u
γ rt+u |st = s, π
u=N
= V π (πi (s), s) +
X
(5)
Piπ (s0 , N |s, πi (s))γ N V π (i, s0 )
s0 ,N
Moreover, Dietterich [3] proved that the related projected state-action function can be expressed as (6) with C π as the completion function (7); resulting
in a re-expression of V π (i, s) as (8)1 .
Qπ (i, s, a) = V π (a, s) + C π (i, s, a)
C π (i, s, a) =
X
Piπ (s0 , N |s, a)γ N Qπ (i, s0 , π(s0 ))
(6)
(7)
s0 ,N
(
π
V (i, s) =
Qπ (i, s, πi (s))
if i is composite
P
0
0
P
(s
,
|s,
i)R(s
|s,
i)
if i is primitive
s0
(8)
To this end, MAXQ requires a hierarchy of the problem defined with a set
of nodes [3]: Max nodes (representing primitive actions or subtasks), and Q
nodes (actions of subtasks). In particular, Max nodes store V π (i, s) values, and
Q nodes store completion functions C π (i, s, a) that represent the cumulative
reward of completing subtask i after invoking action a ∈ Ai in state s. The
MaxQ-Q algorithm [3] implements the MAXQ method that works with arbitrary R̂i functions. In [3], it is shown that R̂i functions contaminate completion
functions C π . To solve this problem, MaxQ-Q uses two completion functions:
the contaminated Ĉi function using R̂i , and the clean Ci function without using
R̂i .
1
Primitive subtasks are those that are performed and terminated by actions, while
composite subtasks are those that performe other subtasks
2.3
Value Function Approximation in Reinforcement Learning
Value or state-action function approximation has been proposed in reinforcement
learning to relax the curse of dimensionality, to avoid table structures of knowledge representation, and to compute continuous state spaces [11,9,18]. In that
sense, value function approximation considers to define states s ∈ S using a finite
number of features fi , ∀i = 1, ..., k where k is the number of features in state
s, and to approximate the value function as a function of a finite-dimensional
parameter vector like (9).
V π (s) ' V π (f1 , . . . , fk )
(9)
Different approaches have been proposed for value function approximation
in non-hierarchical reinforcement learning, as described above. In fact, linear
approximation is the most used technique, for example: least-squares temporal
difference (LSTD) [2] or the gradient-based temporal difference (TD) learning
[20]. Since linear mappings of nonlinear features have been considered [4], it does
not offer well accuracy in discrete domains as well as it overfits data.
As an alternative, machine learning based approaches for value function approximation have been proposed in the literature. For example, Riedmiller [19]
proposed the neural fitted Q (NFQ) algorithm that is based on multilayer perceptrons that stores and reuse experiences at each step of the retraining process.
However, artificial neural networks and linear approaches tend to smooth the
value function globally [11]. This behavior is not suitable for the expected cumulative reward function because it contaminates the overall performance of the
RL algorithm. Thus, authors in [11] proposed to use model trees for state-action
function approximators in a hierarchical reinforcement learning framework allowing an agent learns strategies for the Settlers of Catan game. They proved
that value function approximation in combination with HRL improves the performance of the reinforcement learning in complex games.
Kolter and Ng [5] proposed to learn a sparse representation of the value
function in order to improve generalization and reduce overfitting by adapting
supervised learning models with regularization. In this way, Painter-Wakefield
and Parr [9] proposed a sparse value function approximation to overcome overfitting in some approaches. They considered a regularization term in the optimization process, obtaining sparse, compact and more understandable value function
approximations. Two different algorithms were proposed by them using the orthogonal matching pursuit (OMP), and both the Bellman residual minimization
(BRM) and the temporal difference (TD) approaches.
Currently in this work, machine learning and sparse approximation techniques are used together in order to inherit the advantages of both in terms of
the value function approximation. In addition, it is also proposed to be used in
the MAXQ hierarchical reinforcement learning framework.
3
Description of the Proposal
In this section, a brief overview of artificial hydrocarbon networks is presented.
Then, a modified version of this method using L1 regularization is proposed, and
lastly it describes how to use this proposal for value function approximation.
3.1
Artificial Hydrocarbon Networks
Artificial hydrocarbon networks technique is a supervised learning method inspired on chemical hydrocarbon compounds [14,16], and it belongs to the class of
learning algorithms called artificial organic networks [14]. Similarly to chemical
compounds, artificial hydrocarbon networks packages information in modules
called molecules that can be related between them with similar chemical mechanisms, i.e. heuristic rules, resulting in graph models that are organized and optimized in terms of chemical energy. In [16], it is proved that the AHN method
preserves chemical-based characteristics like: modularity, inheritance, organization and structural stability. For readability, Table 1 summarizes the relationship
between chemical-based terms of the artificial organic networks framework and
their computational meanings in the AHN technique [16].
Table 1. Description of the chemical terms used in artificial hydrocarbon networks.
Chemical Terminology
Symbols Meaning
environment
x
(features) data inputs
behavior
y
(target) data output, solution of mixtures
atoms
hi , vC (parameters) basic structural units or properties
molecules
ϕ(x) (functions) basic units of information
compounds
ψ(x) (composite functions) complex units of information
mixtures
AHN (x) (linear combinations) combination of compounds
stoichiometric coefficients
αi
(weights) definite ratios in mixtures
bounds
L0 , Lt (parameters) boundaries in the inputs
energy
E0 , Et (loss function) error between target and estimated values
In AHN, the basic unit of information is the CH-molecule. It is made of two
or more atoms related among them in order to define a behavioral function ϕ(x)
due to the input vector x = {x1 , . . . , xk }. This molecule is made of a carbon
atom with value vC and surrounded with hydrogen atoms with values hi ∈ C,
as expressed in (10); where, 1 ≤ d ≤ 4 represents the number of hydrogen atoms
in the molecule.
ϕ(x) = vC
k Y
d
X
(xr − hi,r )
(10)
r=1 i=1
Unsaturated CH-molecules, i.e. d < 4, can be joined together with other
CH-molecules forming chains of molecules, so-called artificial hydrocarbon compounds. In this work, saturated and linear chains of molecules will be used as
compounds like in (11) in which a compound is made of n molecules: (n − 2)
CH2 molecules and two CH3 molecules to the sides. CHd -symbol represents a
molecule with d hydrogens [16]. A compound behavior function ψ is also defined. The simplest compound behavior is a piecewise function denoted as (12);
where, Lt = {Lt,1 , . . . , Lt,k } for all t = 0, . . . , n is a set of boundaries at which
an CH-molecule can act over the input space.
CH3 − CH2 − · · · − CH2 − CH3
(11)


ϕ1 (x) L0,r ≤ xr < L1,r
ψ(x) = · · ·
, ∀r = 1, . . . , k
···


ϕn (x) Ln−1,r ≤ xr ≤ Ln,r
(12)
To obtain the boundaries Lt , a distance δ between two adjacent bounds, i.e.
{Lt−1 , Lt }, is computed as in (13). In addition, ∆δ is computed using a gradient
descent method based on the energy of the adjacent molecules (Et−1 and Et )
like in (14), where 0 < η < 1 is a learning rate parameter. For implementability,
energy of molecules can be computed using the least squares error (LSE) as the
loss function [16].
δ = δ + ∆δ
(13)
∆δ = −η(Et−1 − Et ) , E0 ← 0
(14)
Also, the AHN method defines a mixture AHN that is a linear combination
of c compounds in definite ratios so-called stoichiometric coefficients {α1 , . . . , αc }
as shown in (15).
AHN (x) =
c
X
αi ψi (x)
(15)
i=1
The simplest training algorithm is so-called AHN-algorithm [17]. For this
work, an artificial hydrocarbon network will be considered to be a single hydrocarbon compound, thus the simple AHN-algorithm was adapted for a unique
hydrocarbon compound as shown in Algorithm 1. For a detailed description of
this machine learning technique, see [14,16,17].
3.2
Modified AHN-Algorithm Using L1 Regularization
As noted in Algorithm 1, the AHN-algorithm computes hydrogen hi,r and carbon vC values using the least squares estimates method. Instead of this method
is the most used in regression, several drawbacks have to be considered. Actually, LSE method does not guarantee generalization of the training model when
dealing with new and unobserved data, and it easily falls into overfitting [9].
For instance, a modified version of the AHN-algorithm using natural computing
such as simulated annealing and particle swarm optimization can be found in
Algorithm 1 Simple AHN-Algorithm for a saturated and linear compound.
Input: a training data set Σ = (x, y), the number of molecules in the compound n ≥ 2,
the learning rate parameter 0 < η < 1, and a tolerance value > 0.
Output: the saturated and linear hydrocarbon compound AHN .
Initialize an empty compound AHN = {}.
Create a new saturated linear compound C like (11).
Randomly initialize intermolecular distances δ.
while |y − ψ| > do
Determine all bounds Lt of C using δ.
for each j-th molecule in C do
Determine all molecular parameters of ϕj in (10) using LSE.
end-for
Build the compound behavior ψ of C using (12).
Update intermolecular distances using (13) and (14) with η.
end-while
Update AHN with C and ψ.
return AHN
[12]. Some improvements in stability and generalization were found, but computational time was increased. In that sense, this work proposes to use an L1
regularization in the LSE-based optimization process to find suitable molecular
parameters in AHN providing generalization, interpretability and sparsiness in
the regression solution [9].
Let (10) be rewritten in its polynomial form as (16), where the set of coefficients ai are computed from the hydrogen values hi,r , i.e. the root values of the
polynomials.
ϕ(x) = vC
d
k X
X
ai xir
r=1 i=0
(
= vC
d
X
ai xi1 + · · · +
d
X
i=0
)
(16)
ai xik
i=0
Assuming that the input vector x is composed of independent and indentically distributed features, then (16) can be restated as (17),
ϕ(x) =
σ
X
wi φi (x)
(17)
i=1
where the set of coefficients wi are computed from ai and vC , the limit is
calculated as σ = k(d + 1), and the set of basis functions φi is defined as:
φi (x) = xmod(i−1,d+1)
, s = quotient((d + i)/(d + 1))
s
Then, the set of molecular parameters wi can be found in terms of the L1
regularization that stands for minimizing the least squares error subject to a
L1 -norm constraining wi , as expressed in (18), where yj is the jth response observation, q is the number of samples in the data, and Tw is a positive tuning
parameter. In fact, the least absolute shrinkage and selection operator (LASSO)
method [22] solves this optimization process by proposing the alternative problem (19) with some positive value β (actually related to Tw ).


q
X

X
2
wi = arg min
(yj − ϕ(x))
, s.t.
|wi | ≤ Tw
(18)


j=1
wi = arg min

3.3
i

q
X
2
(yj − ϕ(x)) + βkwk1


(19)

j=1
Value Function Approximation Using the AHN-Algorithm
From the above, this work proposes to use the modified version of the AHNalgorithm with L1 regularization into the value function approximation approach
in the hierarchy of the MAXQ method.
Let wj be the set of all molecular parameters vC and hi,r for each CHmolecule ϕj (x) for all j = 1, . . . , n of the form as (10). Then, consider the
projected value function V π (u, s) of (8) for the state u. It is easy to see that
when the state u is primitive or when the subtask terminates after executing
an action a, the value function is a cumulative reward, V π (a, s). Then, the projected value function approximation can be reduced to the completion function
approximation. Thus, it can be represented as an artificial hydrocarbon network
model AHN subject to L1 regularization in its molecular parameters wj , of a
finite set of features Fu = {f1,u , . . . , fk,u } that characterizes state u, as expressed
in (20).
C π (u, s, a) ' AHN (Fu ) , s.t.kwj k1 ≤ Tw , ∀j = 1, . . . , n
(20)
In [6], authors applied the LASSO to the Bellman residual minimization
(BRM), stated as the minimum R +γV π (s0 )−V π (s), obtaining (21) when V π (s)
is represented as a linear combination of a finite set of features Fu in definite
ratios wu .
wu = arg min

q
X

j=1
!2
Rj + γ
X
u
wu Fu0 −
X
u
w u Fu
+ βkwk1


(21)

Thus, the completion function approximation problem (20) can be solved
using (19) with an equivalent solution to C π (u, s, a) as in (21) for a positive
value β, if x = Fu − γFu0 and y = V π (u, s) when state u is primitive. Algorithm
2 summarizes the proposed AHN-based value function approximation in the
MAXQ method.
Algorithm 2 AHN-based value function appromixation in MAXQ.
Input: the set of features Fu , the set of expected cumulative rewards V π (a, s), the
discount factor γ, the regularization term β > 0, the number of molecules in the
compound n ≥ 2, the learning rate parameter 0 < η < 1, and a tolerance value > 0.
Output: the value function approximation using AHN .
Set x = Fu − γFu0 and y = V π (a, s).
Initialize an empty compound AHN = {}.
Create a new saturated linear compound C like (11).
Randomly initialize intermolecular distances δ.
while |y − ψ| > do
Determine all bounds Lt of C using rt .
for each j-th molecule in C do
Determine all molecular parameters wj of ϕj in (10) using (19) with β.
end-for
Build the compound behavior ψ of C using (12).
Update intermolecular distances using (13) and (14) with η.
end-while
Update AHN with C and ψ.
return AHN
4
Performance Evaluation
In order to prove the accuracy and efficiency of the value function approximation
based on artificial hydrocarbon networks with L1 regularization, the well-known
Dietterich’s taxi domain was used as case study. In particular, this experimentation considers a comparative analysis evaluating the proposed method, the
simple AHN-algorithm without regularization and the OMP-BRM approach of
[9]. In this section, a brief description of the taxi domain is presented. Then,
some practical issues in the implementation of the proposal are described. Lastly,
comparative results are summarized and discussed.
4.1
Description of the Taxi Domain
The taxi domain was formulated by Dietterich [3]. It consists on a 5-by-5 grid
environment in which there is a taxi agent. The world has four locations marked
as red (R), blue (B), green (G) and yellow (Y). In particular, this domain is
episodic which starts when the taxi requires to pick up a passenger and stops
when the passenger is put down after a trip. At each episode, the taxi starts in a
random square into the world. Then, a passenger placed in one of the four specific
locations wishes to be transported to another specific location. Both the starting
and goal locations are randomly chosen at the beginning of the episode. In that
sense, the taxi must travel to the passenger’s location, pick up the passenger,
go to the goal location, and put down the passenger. Figure 1 shows the taxi
domain and its hierarchical decomposition based on [3].
R
Root
B
Get
Pickup
G
Put
Navigate
Putdown
Y
North
South
East
West
Fig. 1. Taxi domain and its hierarchical decomposition, based on [3].
4.2
Implementation of the AHN-Based Value Function
Approximation
The taxi domain was used to evaluate the performance of the proposed value
function approximation. The experiment consisted on running 30, 000 episodes
using the MAXQ method. During the execution, the clean C values were collected which are equal to V π (u, s). Also, a set of features Fu was identified as:
f1,u the current cab position, f2,u the source position, f3,u the goal position, and
f4,u the current subtask’s goal position. Five repetitions of this experiment were
done, and the collected data was partitioned into training (70%) and testing
(30%) sets. At last, these data sets were standardized.
Using the above information, the AHN-based value function approximation
with L1 regularization (Algorithm 2) was run. A cross-validation approach (only
using the training dataset) was used to set the free parameters in the model,
obtaining: the number of molecules n = 4 and the learning rate η = 0.1. The
remaining parameters were set manually: the discount factor γ = 0.95 (as used in
the MAXQ method) and the regularization parameter β = 1×10−10 . To this end,
the AHN-based approximator was evaluated using the testing set (i.e. unobserved
data from the learner). Also, the comparative analysis considered the simple
AHN-algorithm without regularization as another value function approximator
with the same parameter values. In addition, the proposal was compared with the
OMP-BRM based approximator. In fact, this is a recent approach that represents
the value function in RL as a linear approximation with weight parameters
obtained from the Bellman residual minimization (21) and computing it as a
regression problem at the orthogonal matching pursuit algorithm [9]. In that
sense, the training and testing sets were also used in the OMP-BRM method.
The free parameters were also set manually: γ = 0.95 and β = 1 × 10−10 .
The root-mean square error (RMSE) metric was employed to evaluate the
performance of both approximators. In fact, since the Bellman residual is computed using the L2 -norm between the predicted value V̂ function and the true
value V function, kV̂ − V k, the RMSE metric gives an explicit and sensitive
evaluation of the accuracy approximation.
4.3
Comparative Results and Discussion
The clean C value functions and their approximations from seven nodes in the
hierarchy of the taxi domain were analyzed. Particularly, the approximations
were performed in four primitive nodes (North, South, East, West) and three
composite nodes (Get, NavigateForGet, NavigateForPut). Actually, the Put
node was not evaluated since it always has a zero cumulative reward value [3].
Table 2 summarizes the RMSE performance evaluation for the proposed AHNL1 , simple AHN and the OMP-BRM based value function approximators. In
addition, Table 3 summarizes the RMSE performance of the three approximators
using the training dataset.
Table 2. RMSE performance in the value function approximators (testing set).
Node
AHN-L1 simple AHN OMP-BRM
North
0.719
0.726
0.989
South
0.759
0.765
0.995
East
0.768
0.835
0.991
West
0.824
0.884
0.994
Get
0.947
9.768
0.991
NavigateForGet 1.133
0.991
1.000
NavigateForPut 0.974
1.012
0.975
Table 3. RMSE performance in the value function approximators (training set).
Node
AHN-L1 simple AHN OMP-BRM
North
0.783
0.750
0.989
South
0.749
0.744
0.993
East
0.813
0.780
0.995
West
0.823
0.740
0.993
Get
0.931
0.840
0.995
NavigateForGet 1.087
0.947
0.976
NavigateForPut 0.988
0.779
0.987
From Table 2, it can be seen that the AHN-L1 based method obtained the
lowest RMSE values (except for the NavigateForGet node) in comparison with
the simple AHN and OMP-BRM based approximators, when validating the performance with the testing data set. Also, the simple AHN based method was
able to approximate the clean C value functions better than the OMP-BRM
method, except in the composite nodes Get and NavigateForPut. However, results show that composite nodes could be more difficult to approximate in any
of these approaches.
3
3
2
2
value function
value function
On the other hand, Table 3 shows that RMSE values from the simple AHN
approximator are better than those from the AHN-L1 based method, when validating the performance with the training data set. It can be explained because
regularization tends to generalize more than to overfit the response of the approximator. In that sense, it is evident that the AHN-L1 based method tends
to perform larger RMSE values than the simple AHN based method. It can also
be confirmed when comparing results between the simple AHN approximator
and the OMP-BRM based method. RMSE values are larger from the OMPBRM method, based on regularization, than those from the simple AHN based
method. Nevertheless, it can be seen in Tables 2 and 3 that L1 regularization
offers better accuracy and efficiency after the training procedure, because the
RMSE values are smaller in both AHN-L1 and OMP-BRM methods based on
regularization, when using the testing set (i.e. unobserved data), in contrast
with the simple AHN approximator. At last, it is noted that the AHN-L1 based
method obtained better results when testing than training. This behavior might
be explained because L1 regularization tends to avoid overfitting, and underfitting is then a possibility.
To this end, in Fig. 2 is shown the function approximation of the clean C
values in the East node using the AHN-L1 , simple AHN and the OMP-BRM
based methods. As noticed, the OMP-BRM approximator is not well tuned; but
it can make a distinction between negative and positive standardized C values.
In addition, the proposed AHN-L1 aproximator can deal with the nonlinearities
of the true C values better than the simple AHN approximator.
1
0
true V
AHN-L
1
simple AHN
OMP-BRM
1
0
-1
-1
-2
-2
0
200
400
600
800
1000
1200
0
100
training samples
200
300
400
500
testing samples
Fig. 2. Value function approximation in the East node.
5
Conclusions and Future Work
This paper presents a new value function approximation based on artificial hydrocarbon networks algorithm with L1 regularization for hierarchical reinforce-
ment learning. In particular, the proposed approach was implemented in the
hierarchy of the MAXQ method.
The methodology of the proposed approximator consisted on using machine
learning and regularization approaches in order to inherit characteristics from
them, such as: defining nonlinear models to prevent overfitting, get interpretability and exploit sparsiness. However, this work limited its attention to nonlinear
models preventing overfitting.
In order to validate the proposal, the Dietterich’s taxi domain was used
as case study. The proposed AHN-L1 , the simple AHN and the OMP-BRM
based methods were compared. In fact, the proposed AHN-L1 based method was
proved to be accurate and efficient in terms of the RMSE metric. Results confirm
that the proposal obtains a nonlinear model for value function approximation in
the hierarchy of the MAXQ method, preventing overfitting.
For future work, this proposal will be analyzed in terms of interpretability and
sparsiness, and it also be mathematically formalized. Additionally, underfitting
on the AHN-L1 based method will be analyzed in order to determine its influence
in the value function approximation. As mentioned, this work is part of an
ongoing research, so its implementation for automatic state abstraction will be
considered too.
References
1. R. Bellman. Dynamic Programming. Princeton University Press, 1957.
2. S. Brandtke and A. Barto. Linear least-squares algorithms for temporal difference
learning. Machine Learning, 22(1):33–57, 1996.
3. T. Dietterich. Hierarchical reinforcement learning with MAXQ value function decomposition. Journal of Artificial Intelligence Research, 13:227–303, 2000.
4. H. Jakab and L. Csato. Sparse Approximations to Value Functions in Reinforcement Learning, volume 4 of Springer Series in Bio-/Neuroinformatics, pages 295–
314. Springer, 2015.
5. J. Kolter and A. Ng. Regularization and feature selection in least-squares temporal
difference learning. In 26th Annual International Conference on Machine Learning,
pages 521–528, 2009.
6. M. Loth, M. Davy, and P. Preux. Sparse temporal difference learning using LASSO.
In IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pages 352–359, Honolulu, Hawaii, United States of America,
April 2007. IEEE.
7. S. Mahajan. Hierarchical reinforcement learning in complex learning problems: A
survey. International Journal of Computer Science and Engineering, 2(5):72 – 78,
2014.
8. T. Mitchell. Machine Learning. McGraw Hill, 1997.
9. C. Painter-Wakefield and R. Parr. Greedy algorithms for sparse reinforcement
learning. In 29th International Conference on Machine Learning, pages 1391–1398,
2012.
10. R. Parr and S. Russell. Reinforcement learning with hierarchies of machines. In
MIT Press, editor, Conference on Advances in Neural Information Processing Systems, pages 1043–1049, 1997.
11. M. Pfeiffer. Reinforcement learning of strategies for settlers of catan. In International Conference on Computer Games: Artificial Intelligence, 2004.
12. H. Ponce. Bio-inspired training algorithms for artificial hydrocarbon networks: A
comparative study. In 13th Mexican International Conference on Artificial Intelligence (MICAI), pages 162–166, Tuxtla Gutierrez, Chiapas, Mexico, November
2014. IEEE.
13. H. Ponce, L. Miralles-Pechuan, and L. Martinez-Villasenor. Artificial hydrocarbon
networks for online sales prediction. In O. Pichardo, O. Herrera, and G. Arrayo,
editors, Advances in Artificial Intelligence and Its Applications, volume 9414 of
Lecture Notes in Computer Science, pages 498–508. Springer, 2015.
14. H. Ponce and P. Ponce. Artificial organic networks. In 2011 IEEE Conference on
Electronics, Robotics and Automotive Mechanics, pages 29–34, Cuernavaca, Morelos, Mexico, 2011. IEEE.
15. H. Ponce, P. Ponce, and A. Molina. Adaptive noise filtering based on artificial
hydrocarbon networks: An application to audio signals. Expert Systems With Applications, 41(14):6512–6523, 2014.
16. H. Ponce, P. Ponce, and A. Molina. Artificial Organic Networks: Artificial Intelligence Based on Carbon Networks, volume 521 of Studies in Computational
Intelligence. Springer, 2014.
17. H. Ponce, P. Ponce, and A. Molina. The development of an artificial organic
networks toolkit for LabVIEW. Journal of Computational Chemistry, 36(7):478–
492, 2015.
18. Z. Qin, W. Li, and F. Janoos. Sparse reinforcement learning via convex optimization. In 31st International Conference on Machine Learning, volume 32, pages
424–432, 2014.
19. M. Riedmiller. Neural fitted Q iteration - first experiences with a data efficient neural reinforcement learning method. In J. Gama, R. Camacho, P. Brazdil, A. Jorge,
and L. Torgo, editors, Machine Learning: ECML 2005, volume 3720 of Lecture
Notes in Computer Science, pages 317–328. Springer, 2005.
20. R. Sutton, H. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvari, and
E. Wiewiora. Fast gradient-descenet methods for temporal-difference learning with
linear function approximation. In 26th Annual International Conference on Machine Learning, pages 993–1000. ACM, 2009.
21. R. Sutton, S. Singh, D. Precup, and B. Ravindran. Improved switching among
temporally abstract actions. In MIT Press, editor, Conference on Advances in
Neural Information Processing Systems, volume 11, pages 1066–1072, 1999.
22. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society, Series B, 58(1):267–288, 1996.

Download Report

A Novel Artificial Hydrocarbon Networks Based Value Function

Paperzz.com

Your Paperzz