A Novel Artificial Hydrocarbon Networks Based Value Function Approximation in Hierarchical Reinforcement Learning Hiram Ponce Faculty of Engineering, Universidad Panamericana, 03920 Mexico City, Mexico [email protected] Abstract. Reinforcement learning aims to solve the problem of learning optimal or near-optimal decision-making policies for a given domain problem. However, it is known that increasing the dimensionality of the input space (i.e. environment) will increase the complexity for the learning algorithms, falling into the curse of dimensionality. Value function approximation and hierarchical reinforcement learning have been two different approaches proposed to alleviate reinforcement learning from this illness. In that sense, this paper proposes a new value function approximation using artificial hydrocarbon networks –a supervised learning method inspired on chemical carbon networks– with regularization at each subtask in a hierarchical reinforcement learning framework. Comparative results using a greedy sparse value function approximation over the MAXQ hierarchical method was computed, proving that artificial hydrocarbon networks improves accuracy and efficiency on the value function approximation. Keywords: reinforcement learning, machine learning, artificial organic networks, regularization 1 Introduction Reinforcement learning (RL) aims to solve the problem of learning optimal or near-optimal decision-making policies for a given domain problem [3]. In fact, the input space (i.e. environment) is represented using states that are also composed by a set of features. Depending on the domain problem, the state space can be continuous or discrete; but in any case, the number of features (dimensions) that represents the environment will impact in the dimensionality of the overall domain. However, it is known that increasing the dimensionality of the state space will increase the complexity for the learning algorithms, falling into the curse of dimensionality [8]. To this end, due to table structures of states or stateaction pairs, and large state spaces, reinforcement learning becomes inefficient and impractical to be applied successfully in complex domains [7]. In that sense, hierarchical reinforcement learning (HRL) deals with large domains dividing the global goal into smaller subgoals with related subtasks, and attempts to tackle each subgoal separately [7,10]. Several hierarchical strategies have been proposed in the literature, for example: the options method [21] defines each subtask in terms of a fixed policy provided, the hierarchy of abstract machines (HAM) method [10] defines each subtask in terms of a nonlinear deterministic finite-state machine, or the MAXQ method [3] that defines each subtask in terms of a terminal predicate (i.e. terminal state of the subtask) and a local reward function. Focusing on the latter, there are three key factors that the MAXQ method supports and provides better learning performance in high dimensional spaces [3]: (i) temporal abstraction that assumes a certain task might take several steps to accomplish, modeled as a semi-Markov decision process, (ii) state abstraction that defines a relevant subset of features to represent the state space for each subtask, and (iii) task sharing that means certain task can be shared by other subtasks if needed. In particular, Dietterich [3] proved that optimal or near-optimal state abstraction, i.e. safe state abstraction, improves the efficiency of the MAXQ method when dealing with high dimensional spaces. Looking closely, it means that the minimum subset of features will optimize value (or state-action) functions. In fact, value function approximation [11,18] tends to reduce the complexity of reinforcement learning algorithms due to its compact form representation, instead of tables. It allows to generalize over the state space so that the learning complexity is reduced much better [3]. Different approaches have been proposed for value function approximation in non-hierarchical reinforcement learning, such as: linear function approximators [4,18], greedy algorithms [9], machine learning approaches [11,19], sparse approximators [9,5], non-parametric approximators [9], and others. From the above, this paper proposes a new value function approximation based on artificial hydrocarbon networks at each subtask in the hierarchy of the MAXQ method to improve accuracy and efficiency on that type of knowledge representation. In fact, the artificial hydrocarbon networks (AHN) method has been proved to be efficient in other function approximation approaches [16,14,15,13]. Currently, the present work is part of an ongoing research in hierarchical reinforcement learning and the automation of the safe state abstraction in a given problem domain. To this end, two key contributions are present in this paper. First, it is an alternative way of value function approximation using the notion of molecular packaging information through the AHN method, which also exploits the machine learning paradigm for modeling nonlinear and sparse spaces. Secondly, it presents a modified version of the AHN method using L1 regularization to improve the efficiency of the approximator in terms of generalization and avoiding overfitting. The rest of the paper is organized as follows. Section 2 presents the related work of value function approximation in RL and trends related to the HRL framework. Section 3 presents the description of the proposal including a brief overview of artificial hydrocarbon networks and the modified version based on L1 regularization. Section 4 describes the experimentation based on the wellknown Dietterich’s taxi problem domain and the comparative results obtained so far, and Sect. 5 concludes the paper. 2 Fundamentals and Related Work Value (state-action) function approximation has been widely studied in reinforcement learning. In this section, the fundamentals of RL and value function are considered. Then, the state-of-the-art in terms of value function approximation is described. Lastly, the MAXQ method is briefly introduced. 2.1 Preliminaries A Markov decision process (MDP) provides a model M in which an agent interacts with its fully-observable environment, denoted as M = hS, A, P, R, P0 i; where, S is a finite set of states of the environment, A is a finite set of actions that the agent can perform, P (s0 |s, a) is the probabilistic transition of the environment from its current state s and the resulting state s0 when an action a ∈ A is performed, R(s0 |s, a) is the reward that the agent receives when it performs an action a and the environment makes a transition from s to s0 , and P0 (s) is the starting probability when the MDP is initialized in a state s [4,3]. Then, the aim of the reinforcement learning is to produce (i.e. learn) a policy π that is a mapping from states to actions that tells the agent what action a to perform when it is in state s, such that a = π(s). In that sense, the value function V π provides a mechanism to find a policy π, which it is defined as the expected cumulative reward that an agent will obtain by executing π starting in state s. The value function is expressed as (1); where, rt is the reward that an agent receives at time t, and γ a discount factor [3]. In fact, the value function satisfies the Bellman equation [1] for a fixed policy as (2). (∞ ) X V π (s) = E γ i rt+i |st = s, π (1) i=0 V π (s) = 0 X P (s0 |s, π(s)) {R(s0 |s, π(s)) + γV π (s0 )} (2) s In addition, there is the state-action function, Qπ , or Q-function (3) that computes the expected cumulative reward of performing an action a in state s by following the policy π. X Qπ (s, a) = P (s0 |s, a) {R(s0 |s, a) + γQπ (s0 , π(s0 ))} (3) s0 2.2 Overview of MAXQ Method Formally, MAXQ decomposes a given problem M modeled as an MDP into a finite set of subtasks. Each subtask Mi is defined as a semi-Markov decision process (SMDP) tuple Mi = hTi , Ai , R̂i i that is completed in N time steps; where, Ti is a termination state of the current subtask, Ai is a set of actions that can be performed to achieve that subtask, and R̂i is the pseudo-reward function that specifies a pseudo-reward for each transition to a terminal state. Associated to each subtask, there is a policy πi , and the set of all policies in the problem π = {π0 , . . . , πn } is called the hierarchical policy [3]. In order to find an optimal or near-optimal hierarchical policy, the MAXQ method decomposes the hierarchical value function V π (i.e. the expected cumulative reward of following π starting in state s) into a subset of so-called projected value functions, V π (i, s), denoted as (4), or alternatively as (5) if the first action a chosen from πi is a subroutine and it terminates in N steps. Notice that the expected cumulative reward after executing a is V π (πi (s), s) = V π (a, s). ( π V (i, s) = E ∞ X ) u γ rt+u |st = s, π (4) u=0 π V (i, s) = E (N −1 X ∞ X u γ rt+u + u=0 ) u γ rt+u |st = s, π u=N = V π (πi (s), s) + X (5) Piπ (s0 , N |s, πi (s))γ N V π (i, s0 ) s0 ,N Moreover, Dietterich [3] proved that the related projected state-action function can be expressed as (6) with C π as the completion function (7); resulting in a re-expression of V π (i, s) as (8)1 . Qπ (i, s, a) = V π (a, s) + C π (i, s, a) C π (i, s, a) = X Piπ (s0 , N |s, a)γ N Qπ (i, s0 , π(s0 )) (6) (7) s0 ,N ( π V (i, s) = Qπ (i, s, πi (s)) if i is composite P 0 0 P (s , |s, i)R(s |s, i) if i is primitive s0 (8) To this end, MAXQ requires a hierarchy of the problem defined with a set of nodes [3]: Max nodes (representing primitive actions or subtasks), and Q nodes (actions of subtasks). In particular, Max nodes store V π (i, s) values, and Q nodes store completion functions C π (i, s, a) that represent the cumulative reward of completing subtask i after invoking action a ∈ Ai in state s. The MaxQ-Q algorithm [3] implements the MAXQ method that works with arbitrary R̂i functions. In [3], it is shown that R̂i functions contaminate completion functions C π . To solve this problem, MaxQ-Q uses two completion functions: the contaminated Ĉi function using R̂i , and the clean Ci function without using R̂i . 1 Primitive subtasks are those that are performed and terminated by actions, while composite subtasks are those that performe other subtasks 2.3 Value Function Approximation in Reinforcement Learning Value or state-action function approximation has been proposed in reinforcement learning to relax the curse of dimensionality, to avoid table structures of knowledge representation, and to compute continuous state spaces [11,9,18]. In that sense, value function approximation considers to define states s ∈ S using a finite number of features fi , ∀i = 1, ..., k where k is the number of features in state s, and to approximate the value function as a function of a finite-dimensional parameter vector like (9). V π (s) ' V π (f1 , . . . , fk ) (9) Different approaches have been proposed for value function approximation in non-hierarchical reinforcement learning, as described above. In fact, linear approximation is the most used technique, for example: least-squares temporal difference (LSTD) [2] or the gradient-based temporal difference (TD) learning [20]. Since linear mappings of nonlinear features have been considered [4], it does not offer well accuracy in discrete domains as well as it overfits data. As an alternative, machine learning based approaches for value function approximation have been proposed in the literature. For example, Riedmiller [19] proposed the neural fitted Q (NFQ) algorithm that is based on multilayer perceptrons that stores and reuse experiences at each step of the retraining process. However, artificial neural networks and linear approaches tend to smooth the value function globally [11]. This behavior is not suitable for the expected cumulative reward function because it contaminates the overall performance of the RL algorithm. Thus, authors in [11] proposed to use model trees for state-action function approximators in a hierarchical reinforcement learning framework allowing an agent learns strategies for the Settlers of Catan game. They proved that value function approximation in combination with HRL improves the performance of the reinforcement learning in complex games. Kolter and Ng [5] proposed to learn a sparse representation of the value function in order to improve generalization and reduce overfitting by adapting supervised learning models with regularization. In this way, Painter-Wakefield and Parr [9] proposed a sparse value function approximation to overcome overfitting in some approaches. They considered a regularization term in the optimization process, obtaining sparse, compact and more understandable value function approximations. Two different algorithms were proposed by them using the orthogonal matching pursuit (OMP), and both the Bellman residual minimization (BRM) and the temporal difference (TD) approaches. Currently in this work, machine learning and sparse approximation techniques are used together in order to inherit the advantages of both in terms of the value function approximation. In addition, it is also proposed to be used in the MAXQ hierarchical reinforcement learning framework. 3 Description of the Proposal In this section, a brief overview of artificial hydrocarbon networks is presented. Then, a modified version of this method using L1 regularization is proposed, and lastly it describes how to use this proposal for value function approximation. 3.1 Artificial Hydrocarbon Networks Artificial hydrocarbon networks technique is a supervised learning method inspired on chemical hydrocarbon compounds [14,16], and it belongs to the class of learning algorithms called artificial organic networks [14]. Similarly to chemical compounds, artificial hydrocarbon networks packages information in modules called molecules that can be related between them with similar chemical mechanisms, i.e. heuristic rules, resulting in graph models that are organized and optimized in terms of chemical energy. In [16], it is proved that the AHN method preserves chemical-based characteristics like: modularity, inheritance, organization and structural stability. For readability, Table 1 summarizes the relationship between chemical-based terms of the artificial organic networks framework and their computational meanings in the AHN technique [16]. Table 1. Description of the chemical terms used in artificial hydrocarbon networks. Chemical Terminology Symbols Meaning environment x (features) data inputs behavior y (target) data output, solution of mixtures atoms hi , vC (parameters) basic structural units or properties molecules ϕ(x) (functions) basic units of information compounds ψ(x) (composite functions) complex units of information mixtures AHN (x) (linear combinations) combination of compounds stoichiometric coefficients αi (weights) definite ratios in mixtures bounds L0 , Lt (parameters) boundaries in the inputs energy E0 , Et (loss function) error between target and estimated values In AHN, the basic unit of information is the CH-molecule. It is made of two or more atoms related among them in order to define a behavioral function ϕ(x) due to the input vector x = {x1 , . . . , xk }. This molecule is made of a carbon atom with value vC and surrounded with hydrogen atoms with values hi ∈ C, as expressed in (10); where, 1 ≤ d ≤ 4 represents the number of hydrogen atoms in the molecule. ϕ(x) = vC k Y d X (xr − hi,r ) (10) r=1 i=1 Unsaturated CH-molecules, i.e. d < 4, can be joined together with other CH-molecules forming chains of molecules, so-called artificial hydrocarbon compounds. In this work, saturated and linear chains of molecules will be used as compounds like in (11) in which a compound is made of n molecules: (n − 2) CH2 molecules and two CH3 molecules to the sides. CHd -symbol represents a molecule with d hydrogens [16]. A compound behavior function ψ is also defined. The simplest compound behavior is a piecewise function denoted as (12); where, Lt = {Lt,1 , . . . , Lt,k } for all t = 0, . . . , n is a set of boundaries at which an CH-molecule can act over the input space. CH3 − CH2 − · · · − CH2 − CH3 (11) ϕ1 (x) L0,r ≤ xr < L1,r ψ(x) = · · · , ∀r = 1, . . . , k ··· ϕn (x) Ln−1,r ≤ xr ≤ Ln,r (12) To obtain the boundaries Lt , a distance δ between two adjacent bounds, i.e. {Lt−1 , Lt }, is computed as in (13). In addition, ∆δ is computed using a gradient descent method based on the energy of the adjacent molecules (Et−1 and Et ) like in (14), where 0 < η < 1 is a learning rate parameter. For implementability, energy of molecules can be computed using the least squares error (LSE) as the loss function [16]. δ = δ + ∆δ (13) ∆δ = −η(Et−1 − Et ) , E0 ← 0 (14) Also, the AHN method defines a mixture AHN that is a linear combination of c compounds in definite ratios so-called stoichiometric coefficients {α1 , . . . , αc } as shown in (15). AHN (x) = c X αi ψi (x) (15) i=1 The simplest training algorithm is so-called AHN-algorithm [17]. For this work, an artificial hydrocarbon network will be considered to be a single hydrocarbon compound, thus the simple AHN-algorithm was adapted for a unique hydrocarbon compound as shown in Algorithm 1. For a detailed description of this machine learning technique, see [14,16,17]. 3.2 Modified AHN-Algorithm Using L1 Regularization As noted in Algorithm 1, the AHN-algorithm computes hydrogen hi,r and carbon vC values using the least squares estimates method. Instead of this method is the most used in regression, several drawbacks have to be considered. Actually, LSE method does not guarantee generalization of the training model when dealing with new and unobserved data, and it easily falls into overfitting [9]. For instance, a modified version of the AHN-algorithm using natural computing such as simulated annealing and particle swarm optimization can be found in Algorithm 1 Simple AHN-Algorithm for a saturated and linear compound. Input: a training data set Σ = (x, y), the number of molecules in the compound n ≥ 2, the learning rate parameter 0 < η < 1, and a tolerance value > 0. Output: the saturated and linear hydrocarbon compound AHN . Initialize an empty compound AHN = {}. Create a new saturated linear compound C like (11). Randomly initialize intermolecular distances δ. while |y − ψ| > do Determine all bounds Lt of C using δ. for each j-th molecule in C do Determine all molecular parameters of ϕj in (10) using LSE. end-for Build the compound behavior ψ of C using (12). Update intermolecular distances using (13) and (14) with η. end-while Update AHN with C and ψ. return AHN [12]. Some improvements in stability and generalization were found, but computational time was increased. In that sense, this work proposes to use an L1 regularization in the LSE-based optimization process to find suitable molecular parameters in AHN providing generalization, interpretability and sparsiness in the regression solution [9]. Let (10) be rewritten in its polynomial form as (16), where the set of coefficients ai are computed from the hydrogen values hi,r , i.e. the root values of the polynomials. ϕ(x) = vC d k X X ai xir r=1 i=0 ( = vC d X ai xi1 + · · · + d X i=0 ) (16) ai xik i=0 Assuming that the input vector x is composed of independent and indentically distributed features, then (16) can be restated as (17), ϕ(x) = σ X wi φi (x) (17) i=1 where the set of coefficients wi are computed from ai and vC , the limit is calculated as σ = k(d + 1), and the set of basis functions φi is defined as: φi (x) = xmod(i−1,d+1) , s = quotient((d + i)/(d + 1)) s Then, the set of molecular parameters wi can be found in terms of the L1 regularization that stands for minimizing the least squares error subject to a L1 -norm constraining wi , as expressed in (18), where yj is the jth response observation, q is the number of samples in the data, and Tw is a positive tuning parameter. In fact, the least absolute shrinkage and selection operator (LASSO) method [22] solves this optimization process by proposing the alternative problem (19) with some positive value β (actually related to Tw ). q X X 2 wi = arg min (yj − ϕ(x)) , s.t. |wi | ≤ Tw (18) j=1 wi = arg min 3.3 i q X 2 (yj − ϕ(x)) + βkwk1 (19) j=1 Value Function Approximation Using the AHN-Algorithm From the above, this work proposes to use the modified version of the AHNalgorithm with L1 regularization into the value function approximation approach in the hierarchy of the MAXQ method. Let wj be the set of all molecular parameters vC and hi,r for each CHmolecule ϕj (x) for all j = 1, . . . , n of the form as (10). Then, consider the projected value function V π (u, s) of (8) for the state u. It is easy to see that when the state u is primitive or when the subtask terminates after executing an action a, the value function is a cumulative reward, V π (a, s). Then, the projected value function approximation can be reduced to the completion function approximation. Thus, it can be represented as an artificial hydrocarbon network model AHN subject to L1 regularization in its molecular parameters wj , of a finite set of features Fu = {f1,u , . . . , fk,u } that characterizes state u, as expressed in (20). C π (u, s, a) ' AHN (Fu ) , s.t.kwj k1 ≤ Tw , ∀j = 1, . . . , n (20) In [6], authors applied the LASSO to the Bellman residual minimization (BRM), stated as the minimum R +γV π (s0 )−V π (s), obtaining (21) when V π (s) is represented as a linear combination of a finite set of features Fu in definite ratios wu . wu = arg min q X j=1 !2 Rj + γ X u wu Fu0 − X u w u Fu + βkwk1 (21) Thus, the completion function approximation problem (20) can be solved using (19) with an equivalent solution to C π (u, s, a) as in (21) for a positive value β, if x = Fu − γFu0 and y = V π (u, s) when state u is primitive. Algorithm 2 summarizes the proposed AHN-based value function approximation in the MAXQ method. Algorithm 2 AHN-based value function appromixation in MAXQ. Input: the set of features Fu , the set of expected cumulative rewards V π (a, s), the discount factor γ, the regularization term β > 0, the number of molecules in the compound n ≥ 2, the learning rate parameter 0 < η < 1, and a tolerance value > 0. Output: the value function approximation using AHN . Set x = Fu − γFu0 and y = V π (a, s). Initialize an empty compound AHN = {}. Create a new saturated linear compound C like (11). Randomly initialize intermolecular distances δ. while |y − ψ| > do Determine all bounds Lt of C using rt . for each j-th molecule in C do Determine all molecular parameters wj of ϕj in (10) using (19) with β. end-for Build the compound behavior ψ of C using (12). Update intermolecular distances using (13) and (14) with η. end-while Update AHN with C and ψ. return AHN 4 Performance Evaluation In order to prove the accuracy and efficiency of the value function approximation based on artificial hydrocarbon networks with L1 regularization, the well-known Dietterich’s taxi domain was used as case study. In particular, this experimentation considers a comparative analysis evaluating the proposed method, the simple AHN-algorithm without regularization and the OMP-BRM approach of [9]. In this section, a brief description of the taxi domain is presented. Then, some practical issues in the implementation of the proposal are described. Lastly, comparative results are summarized and discussed. 4.1 Description of the Taxi Domain The taxi domain was formulated by Dietterich [3]. It consists on a 5-by-5 grid environment in which there is a taxi agent. The world has four locations marked as red (R), blue (B), green (G) and yellow (Y). In particular, this domain is episodic which starts when the taxi requires to pick up a passenger and stops when the passenger is put down after a trip. At each episode, the taxi starts in a random square into the world. Then, a passenger placed in one of the four specific locations wishes to be transported to another specific location. Both the starting and goal locations are randomly chosen at the beginning of the episode. In that sense, the taxi must travel to the passenger’s location, pick up the passenger, go to the goal location, and put down the passenger. Figure 1 shows the taxi domain and its hierarchical decomposition based on [3]. R Root B Get Pickup G Put Navigate Putdown Y North South East West Fig. 1. Taxi domain and its hierarchical decomposition, based on [3]. 4.2 Implementation of the AHN-Based Value Function Approximation The taxi domain was used to evaluate the performance of the proposed value function approximation. The experiment consisted on running 30, 000 episodes using the MAXQ method. During the execution, the clean C values were collected which are equal to V π (u, s). Also, a set of features Fu was identified as: f1,u the current cab position, f2,u the source position, f3,u the goal position, and f4,u the current subtask’s goal position. Five repetitions of this experiment were done, and the collected data was partitioned into training (70%) and testing (30%) sets. At last, these data sets were standardized. Using the above information, the AHN-based value function approximation with L1 regularization (Algorithm 2) was run. A cross-validation approach (only using the training dataset) was used to set the free parameters in the model, obtaining: the number of molecules n = 4 and the learning rate η = 0.1. The remaining parameters were set manually: the discount factor γ = 0.95 (as used in the MAXQ method) and the regularization parameter β = 1×10−10 . To this end, the AHN-based approximator was evaluated using the testing set (i.e. unobserved data from the learner). Also, the comparative analysis considered the simple AHN-algorithm without regularization as another value function approximator with the same parameter values. In addition, the proposal was compared with the OMP-BRM based approximator. In fact, this is a recent approach that represents the value function in RL as a linear approximation with weight parameters obtained from the Bellman residual minimization (21) and computing it as a regression problem at the orthogonal matching pursuit algorithm [9]. In that sense, the training and testing sets were also used in the OMP-BRM method. The free parameters were also set manually: γ = 0.95 and β = 1 × 10−10 . The root-mean square error (RMSE) metric was employed to evaluate the performance of both approximators. In fact, since the Bellman residual is computed using the L2 -norm between the predicted value V̂ function and the true value V function, kV̂ − V k, the RMSE metric gives an explicit and sensitive evaluation of the accuracy approximation. 4.3 Comparative Results and Discussion The clean C value functions and their approximations from seven nodes in the hierarchy of the taxi domain were analyzed. Particularly, the approximations were performed in four primitive nodes (North, South, East, West) and three composite nodes (Get, NavigateForGet, NavigateForPut). Actually, the Put node was not evaluated since it always has a zero cumulative reward value [3]. Table 2 summarizes the RMSE performance evaluation for the proposed AHNL1 , simple AHN and the OMP-BRM based value function approximators. In addition, Table 3 summarizes the RMSE performance of the three approximators using the training dataset. Table 2. RMSE performance in the value function approximators (testing set). Node AHN-L1 simple AHN OMP-BRM North 0.719 0.726 0.989 South 0.759 0.765 0.995 East 0.768 0.835 0.991 West 0.824 0.884 0.994 Get 0.947 9.768 0.991 NavigateForGet 1.133 0.991 1.000 NavigateForPut 0.974 1.012 0.975 Table 3. RMSE performance in the value function approximators (training set). Node AHN-L1 simple AHN OMP-BRM North 0.783 0.750 0.989 South 0.749 0.744 0.993 East 0.813 0.780 0.995 West 0.823 0.740 0.993 Get 0.931 0.840 0.995 NavigateForGet 1.087 0.947 0.976 NavigateForPut 0.988 0.779 0.987 From Table 2, it can be seen that the AHN-L1 based method obtained the lowest RMSE values (except for the NavigateForGet node) in comparison with the simple AHN and OMP-BRM based approximators, when validating the performance with the testing data set. Also, the simple AHN based method was able to approximate the clean C value functions better than the OMP-BRM method, except in the composite nodes Get and NavigateForPut. However, results show that composite nodes could be more difficult to approximate in any of these approaches. 3 3 2 2 value function value function On the other hand, Table 3 shows that RMSE values from the simple AHN approximator are better than those from the AHN-L1 based method, when validating the performance with the training data set. It can be explained because regularization tends to generalize more than to overfit the response of the approximator. In that sense, it is evident that the AHN-L1 based method tends to perform larger RMSE values than the simple AHN based method. It can also be confirmed when comparing results between the simple AHN approximator and the OMP-BRM based method. RMSE values are larger from the OMPBRM method, based on regularization, than those from the simple AHN based method. Nevertheless, it can be seen in Tables 2 and 3 that L1 regularization offers better accuracy and efficiency after the training procedure, because the RMSE values are smaller in both AHN-L1 and OMP-BRM methods based on regularization, when using the testing set (i.e. unobserved data), in contrast with the simple AHN approximator. At last, it is noted that the AHN-L1 based method obtained better results when testing than training. This behavior might be explained because L1 regularization tends to avoid overfitting, and underfitting is then a possibility. To this end, in Fig. 2 is shown the function approximation of the clean C values in the East node using the AHN-L1 , simple AHN and the OMP-BRM based methods. As noticed, the OMP-BRM approximator is not well tuned; but it can make a distinction between negative and positive standardized C values. In addition, the proposed AHN-L1 aproximator can deal with the nonlinearities of the true C values better than the simple AHN approximator. 1 0 true V AHN-L 1 simple AHN OMP-BRM 1 0 -1 -1 -2 -2 0 200 400 600 800 1000 1200 0 100 training samples 200 300 400 500 testing samples Fig. 2. Value function approximation in the East node. 5 Conclusions and Future Work This paper presents a new value function approximation based on artificial hydrocarbon networks algorithm with L1 regularization for hierarchical reinforce- ment learning. In particular, the proposed approach was implemented in the hierarchy of the MAXQ method. The methodology of the proposed approximator consisted on using machine learning and regularization approaches in order to inherit characteristics from them, such as: defining nonlinear models to prevent overfitting, get interpretability and exploit sparsiness. However, this work limited its attention to nonlinear models preventing overfitting. In order to validate the proposal, the Dietterich’s taxi domain was used as case study. The proposed AHN-L1 , the simple AHN and the OMP-BRM based methods were compared. In fact, the proposed AHN-L1 based method was proved to be accurate and efficient in terms of the RMSE metric. Results confirm that the proposal obtains a nonlinear model for value function approximation in the hierarchy of the MAXQ method, preventing overfitting. For future work, this proposal will be analyzed in terms of interpretability and sparsiness, and it also be mathematically formalized. Additionally, underfitting on the AHN-L1 based method will be analyzed in order to determine its influence in the value function approximation. As mentioned, this work is part of an ongoing research, so its implementation for automatic state abstraction will be considered too. References 1. R. Bellman. Dynamic Programming. Princeton University Press, 1957. 2. S. Brandtke and A. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1):33–57, 1996. 3. T. Dietterich. Hierarchical reinforcement learning with MAXQ value function decomposition. Journal of Artificial Intelligence Research, 13:227–303, 2000. 4. H. Jakab and L. Csato. Sparse Approximations to Value Functions in Reinforcement Learning, volume 4 of Springer Series in Bio-/Neuroinformatics, pages 295– 314. Springer, 2015. 5. J. Kolter and A. Ng. Regularization and feature selection in least-squares temporal difference learning. In 26th Annual International Conference on Machine Learning, pages 521–528, 2009. 6. M. Loth, M. Davy, and P. Preux. Sparse temporal difference learning using LASSO. In IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pages 352–359, Honolulu, Hawaii, United States of America, April 2007. IEEE. 7. S. Mahajan. Hierarchical reinforcement learning in complex learning problems: A survey. International Journal of Computer Science and Engineering, 2(5):72 – 78, 2014. 8. T. Mitchell. Machine Learning. McGraw Hill, 1997. 9. C. Painter-Wakefield and R. Parr. Greedy algorithms for sparse reinforcement learning. In 29th International Conference on Machine Learning, pages 1391–1398, 2012. 10. R. Parr and S. Russell. Reinforcement learning with hierarchies of machines. In MIT Press, editor, Conference on Advances in Neural Information Processing Systems, pages 1043–1049, 1997. 11. M. Pfeiffer. Reinforcement learning of strategies for settlers of catan. In International Conference on Computer Games: Artificial Intelligence, 2004. 12. H. Ponce. Bio-inspired training algorithms for artificial hydrocarbon networks: A comparative study. In 13th Mexican International Conference on Artificial Intelligence (MICAI), pages 162–166, Tuxtla Gutierrez, Chiapas, Mexico, November 2014. IEEE. 13. H. Ponce, L. Miralles-Pechuan, and L. Martinez-Villasenor. Artificial hydrocarbon networks for online sales prediction. In O. Pichardo, O. Herrera, and G. Arrayo, editors, Advances in Artificial Intelligence and Its Applications, volume 9414 of Lecture Notes in Computer Science, pages 498–508. Springer, 2015. 14. H. Ponce and P. Ponce. Artificial organic networks. In 2011 IEEE Conference on Electronics, Robotics and Automotive Mechanics, pages 29–34, Cuernavaca, Morelos, Mexico, 2011. IEEE. 15. H. Ponce, P. Ponce, and A. Molina. Adaptive noise filtering based on artificial hydrocarbon networks: An application to audio signals. Expert Systems With Applications, 41(14):6512–6523, 2014. 16. H. Ponce, P. Ponce, and A. Molina. Artificial Organic Networks: Artificial Intelligence Based on Carbon Networks, volume 521 of Studies in Computational Intelligence. Springer, 2014. 17. H. Ponce, P. Ponce, and A. Molina. The development of an artificial organic networks toolkit for LabVIEW. Journal of Computational Chemistry, 36(7):478– 492, 2015. 18. Z. Qin, W. Li, and F. Janoos. Sparse reinforcement learning via convex optimization. In 31st International Conference on Machine Learning, volume 32, pages 424–432, 2014. 19. M. Riedmiller. Neural fitted Q iteration - first experiences with a data efficient neural reinforcement learning method. In J. Gama, R. Camacho, P. Brazdil, A. Jorge, and L. Torgo, editors, Machine Learning: ECML 2005, volume 3720 of Lecture Notes in Computer Science, pages 317–328. Springer, 2005. 20. R. Sutton, H. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvari, and E. Wiewiora. Fast gradient-descenet methods for temporal-difference learning with linear function approximation. In 26th Annual International Conference on Machine Learning, pages 993–1000. ACM, 2009. 21. R. Sutton, S. Singh, D. Precup, and B. Ravindran. Improved switching among temporally abstract actions. In MIT Press, editor, Conference on Advances in Neural Information Processing Systems, volume 11, pages 1066–1072, 1999. 22. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996.
© Copyright 2026 Paperzz