Learning heuristic policies – a reinforcement learning problem Thomas Philip Runarsson School of Engineering and Natural Sciences University of Iceland [email protected] Abstract. How learning heuristic policies may be formulated as a reinforcement learning problem is discussed. Reinforcement learning algorithms are commonly centred around estimating value functions. Here a value function represents the average performance of the learned heuristic algorithm over a problem domain. Heuristics correspond to actions and states to solution instances. The problem of bin packing is used to illustrate the key concepts. Experimental studies show that the reinforcement learning approach is compatible with the current techniques used for learning heuristics. The framework opens up further possibilities for learning heuristics by exploring the numerous techniques available in the reinforcement learning literature. 1 Introduction The current state of the art in search techniques concentrate on problem specific systems. There are many examples of effective and innovative search methodologies which have been adapted for specific applications. Over the last few decades, there has been considerable scientific progress in building search methodologies and customizing these methodologies. This has usually been achieved through hybridization with problem specific techniques for a broad scope of applications. This approach has resulted in effective methods for intricate real world problem solving environments and is commonly referred to as heuristic search. At the other extreme an exhaustive search could be applied without a great deal of proficiency. However, the search space for many real world problems is too large for an exhaustive search, making it too costly. Even when an effective search method exists, for example mixed integer programming, real world problems frequently do not scale well, see eg. [6] for a compendium of so-called NP optimization problems. In such cases heuristics offer an alternative approach to complete search. In optimization the goal is to search for instances x, from a set of instances X , which maximize a payoff function f (x) while satisfying a number of constraints. A typical search method starts from an initial set of instances. Then, iteratively, search operators are applied locating new instances until instances with the highest payoff are reached. The key ingredient to any search methodology is thus the structure or representation of the instances x and the search operators that manipulate them. The aim of developing automated systems for designing and selecting search operators or heuristics is a challenging research objective. Even when a number of search heuristics have been designed, for a particular problem domain, the task still remains of selecting those heuristics which are most likely to succeed in generating instances with higher payoff. Furthermore, the success of a heuristic will depend on a particular case in point and the current instance when local search heuristics are applied. For this reason additional heuristics may be needed to guide and modify the search heuristics in order to produce instances that might otherwise not be created. These additional are so-called meta-heuristics. Hyper-heuristics are an even more general approach where the space of the heuristics themselves is searched [4]. A recent overview on methods of automating the heuristic design process is given in [2,5]. In general we can split the heuristic design process into two parts; the first being the actual heuristic h or operator used to modify or create instance1 x ∈ X , the second part being the heuristic policy π(φ(x), h), the probability of selecting h, where φ(x) are features of instance x, in the simplest form φ(x) = x. Learning a heuristic h can be quite tricky for many applications. For example, for a designed heuristic space h ∈ H there may exist heuristics that create instances x ∈ / X or where the constraints are not satisfied. For this reason most of the literature in automating the heuristic design process is focused on learning heuristic policies [15,12,3], although sometimes not explicitly stated. The main contribution of this paper is on how learning heuristics can be put in a reinforcement learning framework. The approach is illustrated for the bin packing problem. The focus is on learning a heuristic policy and the actual heuristics will be kept as simple and intuitive as possible. In reinforcement learning policies are found directly or indirectly, via a value functions, using a scheme of reward and punishment. To date only a handful of examples [15,11,1,10] exist on applying reinforcement learning for learning heuristics. However, ant system have also many similarities to reinforcement learning and can be thought of as learning a heuristic policy, see [7,8]. Commonly researchers apply reinforcement learning only to a particular problem instance, not to the entire problem domain as will be attempted here. The literature of reinforcement learning is rich in applications which can be posed as Markov decision processes, even partially observable ones. Reinforcement learning methods are also commonly referred to as approximate dynamic programming [13], since commonly approximation techniques are used to model policies. Posing the task of learning heuristic within this framework opens up a wealth of techniques for this research domain. It also may help formalize better open research questions, such as how much human expertise is required for the design of a satisfactory heuristic search method, for a given problem domain f ∈ F? The following section illustrates how learning heuristics may be formulated as a reinforcement learning problem. This is followed by a description of the bin-packing problem and a discussion of commonly used heuristic for this task. 1 So called construction heuristics versus local search heuristics. Section 4 illustrated how temporal difference learning can be applied to learning heuristic policies for bin packing and the results compared with classical heuristics as well as those learned using genetic programming in [12]. Both off-line and on-line bin packing are considered. The paper concludes with a summary of main results. 2 Learning heuristics – a reinforcement learning problem In heuristic search the goal is to search for instances x, which maximize some payoff or objective f (x) while satisfying a number of constraints set by the problem. A typical search method starts from an initial set of instances. Then, iteratively, heuristic operators h are applied locating new instances until instances with the highest payoff are reached. The key ingredients to any heuristic search methodology is thus; the structure or representation of the instances x, the heuristic h ∈ H, the heuristic policy, π, and payoff f (x). Analogously, it is possible to conceptualise heuristic search in the reinforcement learning framework [14] as pictured below. Here the characteristic features of our instance φ(x) is synonymous to a state in the reinforcement learning literature and likewise the heuristic h to an action. Each iteration of the search heuristic is denoted by t. The reward must be written as follows: f (x) = T X c(xt ) (1) t=0 where T denotes the final iteration, found by some termination criteria for the heuristic search. For many problems one would set c(xt ) = 0 for all t < T and then c(xT ) = f (xT ). For construction heuristics T would denote the iteration for when the instance has been constructed completely. For some problems, the objective f (x) can be broken down into a sum as shown in (1). One such example is the bin packing problem. Each time a bin new bin needs to be opened a reward of c(x) = −1 is given else 0. It is the search agent’s responsibility to update its heuristic policy based on the feedback from the particular problem instance f ∈ F being searched. Once the search has been terminated the environment is updated with a new problem instance sampled from F . This way a new learning episode is initiated. This makes the heuristic learning problem noisy. The resulting policy learned, however, is one that maximizes the average performance over the problem domain, that is 1 X (f ) max f (xT ) (2) π |F | f ∈F (f ) where xT is the solution found by the learned heuristic policy for problem f . The average performance over the problem domain corresponds to the so called value function in reinforcement learning. Reinforcement learning algorithms are commonly centred around estimating value functions. 3 Bin Packing Given a bin of size W̄ and a list of items of sizes w1 , w2 , . . . , wn , each item must be in exactly one bin, n X zi,j = 1, j = 1, . . . , m (3) i=1 where zi,j is 1 if item i is in bin j. The bins should not overflow, that is m X zi,j <= W̄ xj , i = 1, . . . , n (4) j=1 where xj is 1 when bin j is used else 0. The objective is to minimize the number of bins used, m X xj (5) min z,x j=1 The number of decision variables are therefore (n+1)m binary variables, where m is an upper estimate on the number of bins needed. The bin packing problem is a combinatorial NP-hard problem. Problem instance can be generated quite easily from this problem domain by randomly sampling the weights w from some known distribution. Previous studies, see for example [12], have used a discrete uniform distribution U (wmin , wmax ) and kept the number of items n fixed. Clearly one would expect that different distributions of w would result in different heuristic policies. However, hand crafted heuristics found in the literature often do not take into account the weight distribution. There are two heuristic approaches to solving the bin packing problem, one is on-line in nature and the other off-line. In the on-line case on must pack each weight in the order in which they arrive, that is w1 first, then w2 and so on. Search (agent) State φ(xt) Reward c(xt) Heuristic policy ht = π(φ(xt)) c(xt+1) φ(xt+1) Problem instance (environment) Fig. 1. Learning heuristic search as a reinforcement learning problem. In the off-line case the order does not matter, in essence you have been given all the weights at the same time. The common on-line heuristics include firstfit (FF) and best-fit. First-fit places the next item to be packed into the first bin j with with sufficient residual capacity or gap. The best fit searches for the bin with the smallest but sufficient capacity. Both methods have a worst case number of bins needed of 17/10OPT(w) + 2, where OPT is the optimal number of bins. An off-line version of FF is first fit decreasing (FFD), where the weights have been placed in a non-increasing order. Using this new order the largest unpacked item is always packed into the first possible bin. A new bin is opened when needed and all bins stay open. The number of bins used by FFD is at most 11/9OPT(w) + 6/9. A modification of FFD [9] also exists and numerous other variations may be found in the literature. 4 Illustrative example using bin-packing Now the techniques described above are illustrated for the bin-packing problem. We consider the problem domain F where n = 100 and items w ∼ U (wmin , wmax ) = U (20, 80) are to be packed into bins of size W̄ = 150. Both on-line and off-line approaches to bin packing will be studied. As with most reinforcement learning methods a value function will be approximated. The value function approximates the expected value of the solutions found f (xT ) over the domain F . For example, in bin packing the aim is to minimize the number of bins used, and so the value function approximates the mean number of bins used by the heuristic search algorithm for the entire problem domain. The optimal policy π is the one that is greedy with respect to this value function. There are in principle two types of value functions, so-called value function V π and heuristic-state value function Qπ . A policy greedy with respect to the heuristic-state value function is the optimal policy, defined as follows, h∗t = argmax Qπ (φ(xt ), h), (6) h∈H however, for a state value function a one step lookahead must be performed (h) h∗t = argmax V π (φ(xt+1 )) (7) h∈H (h) where φ(xt+1 ) is the resulting (post-heuristic) state when heuristic h is applied to solution instance xt . The reinforcement learning algorithm applied here is known as temporal difference learning. The learned policy is one that minimizes the mean number of bins used for the problem domain. The temporal difference learning formula is simply V (φ(xt )) = V (φ(xt )) + α V (φ(xt+1 )) + c(xt+1 ) − V (φ(xt )) (8) and Q(φ(xt ), ht ) = Q(φ(xt ), ht ) + α Q(φ(xt+1 ), ht+1 ) + c(xt+1 ) − Q(φ(xt , ht )) (9) where 0 < α < 1 is a step size parameter which needs to be tuned. The heuristic h for bin packing will usually be simply the assignment of a weight wi to a particular bin j given that there is sufficient capacity. Two heuristics are illustrated below. In one approach the heuristic chooses in what order the weights should be assigned to a bin, but the learned heuristic policy decides in which bin the item should be placed. Another approach uses the learned heuristic policy to select the weight to be assigned but the heuristic selects the bin to place the weight in. Both of these are so-called construction heuristics. Each iteration step t corresponds to a weight being assigned and so T = n. The cost occurred at each iteration t can be −1 when a new bin is opened else 0. Alternatively the cost can be at all times zero, but at the terminal iteration the negative number of bins opened or even the negative mean number of gaps created. 4.1 On-line bin packing In on-line bin packing the items have independent and identically distributed (IID) weights sampled from the distribution U (20, 80). The heuristic simply selects one item after the other. If the item will not fit in any open bin the heuristic will open a new bin and place the item there. If the item fits in more than one open bin then the learned heuristic policy is used to select the most appropriate bin. Having decided on the heuristic one must select an appropriate state description for the solution instance x. In this case the current total weight Wj of a bin j under consideration seems appropriate. The post-heuristic state would then simply be Wj + wi , where we are considering placing the next weight wi in bin j. This state description, which is simply the content of a single bin, is clearly not rich enough to predict how many bins will be opened in the future. Nevertheless, one may be able to predict the final gap, i.e. (W̄ − Wj ), for the bins. The cost function in this case would be the mean gap created. The temporal difference (TD) learning scheme would then be as follows: V (Wj ) = V (Wj ) + α V (Wj + wi ) − V (Wj ) (10) and once all bins have been packed a final update is performed for all bins opened as V (Wj ) = V (Wj ) + α (W̄ − Wj ) − V (Wj ) (11) The heuristic policy is one which is greedy with respect to V , i.e. the bin chosen for item wi is j = argmin V (Wj + wi ) (12) j,Wj +wi ≤W̄ The noise needed to drive the learning process is created by generating new problem instances or items to be packed within each new episode of the temporal difference learning algorithm2 . Figure 2 shows the value function learned using the TD algorithm, the expected gap versus bin weight for on-line packing. There 2 Learning parameter α = 0.001. IID items Expected gap 20 15 10 5 0 0 50 100 150 W Fig. 2. Expected gap as a function of bin weight for on-line packing. W Decreasing items Expected gap 20 15 10 5 0 0 50 100 150 W Fig. 3. Expected gap as a function of bin weight for off-line packing. we can see that one should avoid leaving gaps of around size 10 to 20, this seems reasonable since the smallest item weight is 20 and so it becomes impossible to fill this gap. The policy learned is, therefore, very specific to the distribution of weights being packed, as one may expect. The learned heuristic policy is specialized for the problem domain in question. 4.2 Off-line bin packing The simplest off-line heuristic approach to bin-packing is FFD. We can repeat the exercise in the last section by ordering the items to be packed in a decreasing order. Now the distribution is no longer IID and the predicted gap, shown in figure 3, is completely different. The regions of smaller expected gaps seen are now smaller than those in figure 2 and as a result the performance of the learned heuristic is better. The mean number of bins packed now is 35.01, or a savings of one bin on average. We now illustrate how the reinforcement learning framework is able to implement the histogram-matching approach in [12] for off-line packing. In [12] a part of the heuristic value function is found using genetic programming (GP). The problem instance domain is extended in their study to include W̄ = 150, 75, 300 and weight distributions U (20, 80), U (1, 150), U (30, 70), U (1, 80) when W̄ = 150, for example. In this case the heuristic selects the bin with the smallest gap and the learned heuristic policy is defined as follows: w∗ = argmin{GP (w)} g t (w) − ot (w) (13) w where g t (w) is the number of gaps of size w and ot (w) the number of unpacked items of size w. Initially there are no gaps (g 0 = 0) and o0 is simply a histogram of the items weights to be packed. Integer weights are assumed such that the maximum number of bins needed for the gap histogram (g) is W̄ and wmax for the items (o). New gaps are created once bins are filled and when new ones are opened. The GP (w) function in the above formulation is part of the heuristic value function discovered by a genetic program and wmax + wmin + w −4 (14) GP (w) = + 10 W̄ was found to be very robust [12]. The same state description will now be used to learn the decision value function using temporal difference learning. Here Q values are used, where the decisions made are the weights assigned to a bin. In [12] the bins are selected in such a manner that the bin with the smallest gap is selected, i.e. best fit. However, this gap is only available as long as g t (wi ) > ot (wi ). The same heuristic strategy for selecting a bin is used here. The temporal difference (TD) formulation is as follows: Q(st , wt ) = Q(st , wt ) + α Q(st+1 , wt+1 ) + ct+1 − Q(st , wt ) (15) were the state st = g t (wt ) − ot (wt ) and ct+1 is 1 if a new bin was opened else 0. The value of a terminal state is zero as usual, i.e. Q(sn+1 , ·) = 0. The weight selected follows then the policy wt = argmin Q(st , w) (16) w,ot (w)>0 The value function now tells us the expected number of bins that will be opened given the current state st and taking decision wt , at iterationP t while n following the heuristic policy π. The number of bins used is, therefore, i=1 ci . However, for more general problems the cost of a solution is not known until the complete solution has been built. So an alternative formulation is to have no cost during search (ct = 0, t = 1, . . . , n) and only at the final iteration give a terminal cost which is the number of bins used, i.e. cn+1 = m. Figure 4 below shows the moving average number of bins used as a function of learning episodes3 . The noise is the result from generating new problem instance at each episode. When the performance of this value function is compared with the one in [12] on 100 test problems, no statistical difference in the mean number of bins used is observed, µGP = 34.27 and µT D = 34.53. These results improve on the off-line approach of first-fit above. 3 α = 0.01, slightly larger than before. 0.95#Binse−1+0.05*#Binse Moving average of number of bins used versus learning episodes 42 40 38 36 34 0 2000 4000 6000 episode (e) 8000 10000 Fig. 4. Moving average number of bins used as a function of learning episodes. Each new episode generates a new problem instance, hence the noise. 5 Summary and conclusions The challenging task of heuristic learning was put forth as a reinforcement learning problem. The heuristics are assumed to be designed by the human designer and correspond to the actions in the usual reinforcement learning setting. The states represent the solution instances and the heuristic policies learned decide when these heuristics should be used. The simpler the heuristic the more challenging it becomes to learn a policy. At the other extreme a single powerful heuristic may be used and so no heuristic policy need be learned (only one action possible). The heuristic policy is found indirectly by approximating a value function, whose value is the expected performance of the algorithm over the problem domain as a function of specific features of a solution instance and the applied heuristics. It is clear that problem domain knowledge will be reflected in the careful design of features, such as we have seen in the histogram-matching approach to bin packing, and in the design of heuristics. The machine learning side is devoted to learning heuristic policies. The exploratory noise needed to drive the reinforcement learning is introduced indirectly by generating completely new problem instance at each learning episode. This very different from the reinforcement learning approaches commonly seen in the literature for learning heuristic search, where usually only a single problem (benchmark) instance is considered. One immediate concern, which needs to be addressed, is the level of noise encountered during learning for when a new problem instance is generated at each new episode. Although the bin packing problems tackled in this paper could be solved, figure 4 shows that convergence may also be an issue. One possible solution to this may be to correlate the instances generated. References 1. R. Bai, E.K. Burke, M. Gendreau, G. Kendall, and B. McCollum. Memory length in hyper-heuristics: An empirical study. In Computational Intelligence in Scheduling, 2007. SCIS’07. IEEE Symposium on, pages 173–178. IEEE, 2007. 2. E.K. Burke, M.R. Hyde, G. Kendall, G. Ochoa, E. Ozcan, and J.R. Woodward. Exploring hyper-heuristic methodologies with genetic programming. Computational Intelligence, pages 177–201, 2009. 3. E.K. Burke, MR Hyde, G. Kendall, and J. Woodward. A genetic programming hyperheuristic approach for evolving two dimensional strip packing heuristics. IEEE Transactions on Evolutionary Computation (to appear), 2010. 4. E.K. Burke and G. Kendall. Search methodologies: introductory tutorials in optimization and decision support techniques. Springer Verlag, 2005. 5. E.K. Burker, M. Hyde, G. Kendall, G. Ochoa, E. ”Ozcan, and J.R. Woodward. A classification of hyper-heuristic approaches. Handbook of Metaheuristics, pages 449–468, 2010. 6. P. Crescenzi and V. Kann. A compendium of NP optimization problems. http://www.nada.kth.se/ viggo/problemlist/compendium.html. Technical report accessed september 2010. 7. M. Dorigo and L. Gambardella. A study of some properties of Ant-Q. Parallel Problem Solving from NaturePPSN IV, pages 656–665, 1996. 8. M. Dorigo and L.M. Gambardella. Ant colony system: A cooperative learning approach to the traveling salesman problem. Evolutionary Computation, IEEE Transactions on, 1(1):53–66, 2002. 9. S. Floyd and R.M. Karp. FFD bin packing for item sizes with uniform distributions on [0, 1/2]. Algorithmica, 6(1):222–240, 1991. 10. D. Meignan, A. Koukam, and J.C. Créput. Coalition-based metaheuristic: a selfadaptive metaheuristic using reinforcement learning and mimetism. Journal of Heuristics, pages 1–21. 11. Alexander Nareyek. Choosing search heuristics by non-stationary reinforcement learning. Applied Optimization, 86:523–544, 2003. 12. R. Poli, J. Woodward, and E.K. Burke. A histogram-matching approach to the evolution of bin-packing strategies. Evolutionary Computation, 2007. CEC 2007. IEEE Congress on, pages 3500–3507, Sept. 2007. 13. W.B. Powell. Approximate Dynamic Programming: Solving the curses of dimensionality. Wiley-Interscience, 2007. 14. R.S. Sutton and A.G. Barto. Reinforcement learning: An introduction. The MIT press, 1998. 15. Wei Zhang and Thomas G. Dietterich. A Reinforcement Learning Approach to Job-shop Scheduling. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pages 1114–1120, San Francisco, California, 1995. Morgan-Kaufmann.
© Copyright 2026 Paperzz