Learning heuristic policies – a reinforcement learning problem

Learning heuristic policies –
a reinforcement learning problem
Thomas Philip Runarsson
School of Engineering and Natural Sciences
University of Iceland
[email protected]
Abstract. How learning heuristic policies may be formulated as a reinforcement learning problem is discussed. Reinforcement learning algorithms are commonly centred around estimating value functions. Here a
value function represents the average performance of the learned heuristic
algorithm over a problem domain. Heuristics correspond to actions and
states to solution instances. The problem of bin packing is used to
illustrate the key concepts. Experimental studies show that the reinforcement learning approach is compatible with the current techniques
used for learning heuristics. The framework opens up further possibilities
for learning heuristics by exploring the numerous techniques available in
the reinforcement learning literature.
1
Introduction
The current state of the art in search techniques concentrate on problem specific
systems. There are many examples of effective and innovative search methodologies which have been adapted for specific applications. Over the last few decades,
there has been considerable scientific progress in building search methodologies
and customizing these methodologies. This has usually been achieved through
hybridization with problem specific techniques for a broad scope of applications.
This approach has resulted in effective methods for intricate real world problem
solving environments and is commonly referred to as heuristic search. At the
other extreme an exhaustive search could be applied without a great deal of
proficiency. However, the search space for many real world problems is too large
for an exhaustive search, making it too costly. Even when an effective search
method exists, for example mixed integer programming, real world problems
frequently do not scale well, see eg. [6] for a compendium of so-called NP
optimization problems. In such cases heuristics offer an alternative approach
to complete search.
In optimization the goal is to search for instances x, from a set of instances X ,
which maximize a payoff function f (x) while satisfying a number of constraints.
A typical search method starts from an initial set of instances. Then, iteratively,
search operators are applied locating new instances until instances with the
highest payoff are reached. The key ingredient to any search methodology is thus
the structure or representation of the instances x and the search operators that
manipulate them. The aim of developing automated systems for designing and
selecting search operators or heuristics is a challenging research objective. Even
when a number of search heuristics have been designed, for a particular problem
domain, the task still remains of selecting those heuristics which are most likely
to succeed in generating instances with higher payoff. Furthermore, the success
of a heuristic will depend on a particular case in point and the current instance
when local search heuristics are applied. For this reason additional heuristics
may be needed to guide and modify the search heuristics in order to produce
instances that might otherwise not be created. These additional are so-called
meta-heuristics. Hyper-heuristics are an even more general approach where the
space of the heuristics themselves is searched [4].
A recent overview on methods of automating the heuristic design process
is given in [2,5]. In general we can split the heuristic design process into two
parts; the first being the actual heuristic h or operator used to modify or create
instance1 x ∈ X , the second part being the heuristic policy π(φ(x), h), the
probability of selecting h, where φ(x) are features of instance x, in the simplest
form φ(x) = x. Learning a heuristic h can be quite tricky for many applications.
For example, for a designed heuristic space h ∈ H there may exist heuristics that
create instances x ∈
/ X or where the constraints are not satisfied. For this reason
most of the literature in automating the heuristic design process is focused on
learning heuristic policies [15,12,3], although sometimes not explicitly stated.
The main contribution of this paper is on how learning heuristics can be
put in a reinforcement learning framework. The approach is illustrated for the
bin packing problem. The focus is on learning a heuristic policy and the actual
heuristics will be kept as simple and intuitive as possible. In reinforcement learning policies are found directly or indirectly, via a value functions, using a scheme
of reward and punishment. To date only a handful of examples [15,11,1,10] exist
on applying reinforcement learning for learning heuristics. However, ant system
have also many similarities to reinforcement learning and can be thought of as
learning a heuristic policy, see [7,8]. Commonly researchers apply reinforcement
learning only to a particular problem instance, not to the entire problem domain
as will be attempted here.
The literature of reinforcement learning is rich in applications which can be
posed as Markov decision processes, even partially observable ones. Reinforcement learning methods are also commonly referred to as approximate dynamic
programming [13], since commonly approximation techniques are used to model
policies. Posing the task of learning heuristic within this framework opens up a
wealth of techniques for this research domain. It also may help formalize better
open research questions, such as how much human expertise is required for the
design of a satisfactory heuristic search method, for a given problem domain
f ∈ F?
The following section illustrates how learning heuristics may be formulated
as a reinforcement learning problem. This is followed by a description of the
bin-packing problem and a discussion of commonly used heuristic for this task.
1
So called construction heuristics versus local search heuristics.
Section 4 illustrated how temporal difference learning can be applied to learning
heuristic policies for bin packing and the results compared with classical heuristics as well as those learned using genetic programming in [12]. Both off-line
and on-line bin packing are considered. The paper concludes with a summary of
main results.
2
Learning heuristics – a reinforcement learning problem
In heuristic search the goal is to search for instances x, which maximize some
payoff or objective f (x) while satisfying a number of constraints set by the
problem. A typical search method starts from an initial set of instances. Then,
iteratively, heuristic operators h are applied locating new instances until instances with the highest payoff are reached. The key ingredients to any heuristic
search methodology is thus; the structure or representation of the instances x,
the heuristic h ∈ H, the heuristic policy, π, and payoff f (x). Analogously,
it is possible to conceptualise heuristic search in the reinforcement learning
framework [14] as pictured below. Here the characteristic features of our instance
φ(x) is synonymous to a state in the reinforcement learning literature and
likewise the heuristic h to an action. Each iteration of the search heuristic is
denoted by t. The reward must be written as follows:
f (x) =
T
X
c(xt )
(1)
t=0
where T denotes the final iteration, found by some termination criteria for the
heuristic search. For many problems one would set c(xt ) = 0 for all t < T and
then c(xT ) = f (xT ). For construction heuristics T would denote the iteration
for when the instance has been constructed completely. For some problems, the
objective f (x) can be broken down into a sum as shown in (1). One such example
is the bin packing problem. Each time a bin new bin needs to be opened a reward
of c(x) = −1 is given else 0.
It is the search agent’s responsibility to update its heuristic policy based on
the feedback from the particular problem instance f ∈ F being searched. Once
the search has been terminated the environment is updated with a new problem
instance sampled from F . This way a new learning episode is initiated. This
makes the heuristic learning problem noisy. The resulting policy learned, however, is one that maximizes the average performance over the problem domain,
that is
1 X
(f )
max
f (xT )
(2)
π |F |
f ∈F
(f )
where xT is the solution found by the learned heuristic policy for problem f .
The average performance over the problem domain corresponds to the so called
value function in reinforcement learning. Reinforcement learning algorithms are
commonly centred around estimating value functions.
3
Bin Packing
Given a bin of size W̄ and a list of items of sizes w1 , w2 , . . . , wn , each item must
be in exactly one bin,
n
X
zi,j = 1,
j = 1, . . . , m
(3)
i=1
where zi,j is 1 if item i is in bin j. The bins should not overflow, that is
m
X
zi,j <= W̄ xj ,
i = 1, . . . , n
(4)
j=1
where xj is 1 when bin j is used else 0. The objective is to minimize the number
of bins used,
m
X
xj
(5)
min
z,x
j=1
The number of decision variables are therefore (n+1)m binary variables, where m
is an upper estimate on the number of bins needed. The bin packing problem is a
combinatorial NP-hard problem. Problem instance can be generated quite easily
from this problem domain by randomly sampling the weights w from some known
distribution. Previous studies, see for example [12], have used a discrete uniform
distribution U (wmin , wmax ) and kept the number of items n fixed. Clearly one
would expect that different distributions of w would result in different heuristic
policies. However, hand crafted heuristics found in the literature often do not
take into account the weight distribution.
There are two heuristic approaches to solving the bin packing problem, one
is on-line in nature and the other off-line. In the on-line case on must pack each
weight in the order in which they arrive, that is w1 first, then w2 and so on.
Search
(agent)
State
φ(xt)
Reward
c(xt)
Heuristic policy
ht = π(φ(xt))
c(xt+1)
φ(xt+1)
Problem instance
(environment)
Fig. 1. Learning heuristic search as a reinforcement learning problem.
In the off-line case the order does not matter, in essence you have been given
all the weights at the same time. The common on-line heuristics include firstfit (FF) and best-fit. First-fit places the next item to be packed into the first
bin j with with sufficient residual capacity or gap. The best fit searches for the
bin with the smallest but sufficient capacity. Both methods have a worst case
number of bins needed of 17/10OPT(w) + 2, where OPT is the optimal number
of bins. An off-line version of FF is first fit decreasing (FFD), where the weights
have been placed in a non-increasing order. Using this new order the largest
unpacked item is always packed into the first possible bin. A new bin is opened
when needed and all bins stay open. The number of bins used by FFD is at most
11/9OPT(w) + 6/9. A modification of FFD [9] also exists and numerous other
variations may be found in the literature.
4
Illustrative example using bin-packing
Now the techniques described above are illustrated for the bin-packing problem.
We consider the problem domain F where n = 100 and items w ∼ U (wmin , wmax ) =
U (20, 80) are to be packed into bins of size W̄ = 150. Both on-line and off-line
approaches to bin packing will be studied. As with most reinforcement learning
methods a value function will be approximated. The value function approximates
the expected value of the solutions found f (xT ) over the domain F . For example,
in bin packing the aim is to minimize the number of bins used, and so the value
function approximates the mean number of bins used by the heuristic search
algorithm for the entire problem domain. The optimal policy π is the one that
is greedy with respect to this value function. There are in principle two types of
value functions, so-called value function V π and heuristic-state value function
Qπ . A policy greedy with respect to the heuristic-state value function is the
optimal policy, defined as follows,
h∗t = argmax Qπ (φ(xt ), h),
(6)
h∈H
however, for a state value function a one step lookahead must be performed
(h)
h∗t = argmax V π (φ(xt+1 ))
(7)
h∈H
(h)
where φ(xt+1 ) is the resulting (post-heuristic) state when heuristic h is applied
to solution instance xt . The reinforcement learning algorithm applied here is
known as temporal difference learning. The learned policy is one that minimizes
the mean number of bins used for the problem domain. The temporal difference
learning formula is simply
V (φ(xt )) = V (φ(xt )) + α V (φ(xt+1 )) + c(xt+1 ) − V (φ(xt ))
(8)
and
Q(φ(xt ), ht ) = Q(φ(xt ), ht ) + α Q(φ(xt+1 ), ht+1 ) + c(xt+1 ) − Q(φ(xt , ht )) (9)
where 0 < α < 1 is a step size parameter which needs to be tuned.
The heuristic h for bin packing will usually be simply the assignment of
a weight wi to a particular bin j given that there is sufficient capacity. Two
heuristics are illustrated below. In one approach the heuristic chooses in what
order the weights should be assigned to a bin, but the learned heuristic policy
decides in which bin the item should be placed. Another approach uses the
learned heuristic policy to select the weight to be assigned but the heuristic
selects the bin to place the weight in. Both of these are so-called construction
heuristics. Each iteration step t corresponds to a weight being assigned and
so T = n. The cost occurred at each iteration t can be −1 when a new bin is
opened else 0. Alternatively the cost can be at all times zero, but at the terminal
iteration the negative number of bins opened or even the negative mean number
of gaps created.
4.1
On-line bin packing
In on-line bin packing the items have independent and identically distributed
(IID) weights sampled from the distribution U (20, 80). The heuristic simply
selects one item after the other. If the item will not fit in any open bin the
heuristic will open a new bin and place the item there. If the item fits in more
than one open bin then the learned heuristic policy is used to select the most
appropriate bin. Having decided on the heuristic one must select an appropriate
state description for the solution instance x. In this case the current total weight
Wj of a bin j under consideration seems appropriate. The post-heuristic state
would then simply be Wj + wi , where we are considering placing the next weight
wi in bin j. This state description, which is simply the content of a single bin, is
clearly not rich enough to predict how many bins will be opened in the future.
Nevertheless, one may be able to predict the final gap, i.e. (W̄ − Wj ), for the
bins. The cost function in this case would be the mean gap created. The temporal
difference (TD) learning scheme would then be as follows:
V (Wj ) = V (Wj ) + α V (Wj + wi ) − V (Wj )
(10)
and once all bins have been packed a final update is performed for all bins opened
as
V (Wj ) = V (Wj ) + α (W̄ − Wj ) − V (Wj )
(11)
The heuristic policy is one which is greedy with respect to V , i.e. the bin chosen
for item wi is
j = argmin V (Wj + wi )
(12)
j,Wj +wi ≤W̄
The noise needed to drive the learning process is created by generating new
problem instances or items to be packed within each new episode of the temporal
difference learning algorithm2 . Figure 2 shows the value function learned using
the TD algorithm, the expected gap versus bin weight for on-line packing. There
2
Learning parameter α = 0.001.
IID items
Expected gap
20
15
10
5
0
0
50
100
150
W
Fig. 2. Expected gap as a function of bin weight for on-line packing.
W
Decreasing items
Expected gap
20
15
10
5
0
0
50
100
150
W
Fig. 3. Expected gap as a function of bin weight for off-line packing.
we can see that one should avoid leaving gaps of around size 10 to 20, this seems
reasonable since the smallest item weight is 20 and so it becomes impossible to
fill this gap. The policy learned is, therefore, very specific to the distribution
of weights being packed, as one may expect. The learned heuristic policy is
specialized for the problem domain in question.
4.2
Off-line bin packing
The simplest off-line heuristic approach to bin-packing is FFD. We can repeat
the exercise in the last section by ordering the items to be packed in a decreasing
order. Now the distribution is no longer IID and the predicted gap, shown in
figure 3, is completely different. The regions of smaller expected gaps seen are
now smaller than those in figure 2 and as a result the performance of the learned
heuristic is better. The mean number of bins packed now is 35.01, or a savings
of one bin on average.
We now illustrate how the reinforcement learning framework is able to implement the histogram-matching approach in [12] for off-line packing. In [12] a part
of the heuristic value function is found using genetic programming (GP). The
problem instance domain is extended in their study to include W̄ = 150, 75, 300
and weight distributions U (20, 80), U (1, 150), U (30, 70), U (1, 80) when W̄ = 150,
for example. In this case the heuristic selects the bin with the smallest gap and
the learned heuristic policy is defined as follows:
w∗ = argmin{GP (w)} g t (w) − ot (w)
(13)
w
where g t (w) is the number of gaps of size w and ot (w) the number of unpacked
items of size w. Initially there are no gaps (g 0 = 0) and o0 is simply a histogram
of the items weights to be packed. Integer weights are assumed such that the
maximum number of bins needed for the gap histogram (g) is W̄ and wmax for
the items (o). New gaps are created once bins are filled and when new ones are
opened. The GP (w) function in the above formulation is part of the heuristic
value function discovered by a genetic program and
wmax + wmin + w
−4
(14)
GP (w) =
+ 10
W̄
was found to be very robust [12]. The same state description will now be used
to learn the decision value function using temporal difference learning. Here Q
values are used, where the decisions made are the weights assigned to a bin. In
[12] the bins are selected in such a manner that the bin with the smallest gap
is selected, i.e. best fit. However, this gap is only available as long as g t (wi ) >
ot (wi ). The same heuristic strategy for selecting a bin is used here. The temporal
difference (TD) formulation is as follows:
Q(st , wt ) = Q(st , wt ) + α Q(st+1 , wt+1 ) + ct+1 − Q(st , wt )
(15)
were the state st = g t (wt ) − ot (wt ) and ct+1 is 1 if a new bin was opened else
0. The value of a terminal state is zero as usual, i.e. Q(sn+1 , ·) = 0. The weight
selected follows then the policy
wt = argmin Q(st , w)
(16)
w,ot (w)>0
The value function now tells us the expected number of bins that will be
opened given the current state st and taking decision wt , at iterationP
t while
n
following the heuristic policy π. The number of bins used is, therefore, i=1 ci .
However, for more general problems the cost of a solution is not known until
the complete solution has been built. So an alternative formulation is to have
no cost during search (ct = 0, t = 1, . . . , n) and only at the final iteration give a
terminal cost which is the number of bins used, i.e. cn+1 = m.
Figure 4 below shows the moving average number of bins used as a function of
learning episodes3 . The noise is the result from generating new problem instance
at each episode. When the performance of this value function is compared with
the one in [12] on 100 test problems, no statistical difference in the mean number
of bins used is observed, µGP = 34.27 and µT D = 34.53. These results improve
on the off-line approach of first-fit above.
3
α = 0.01, slightly larger than before.
0.95#Binse−1+0.05*#Binse
Moving average of number of bins used versus learning episodes
42
40
38
36
34
0
2000
4000
6000
episode (e)
8000
10000
Fig. 4. Moving average number of bins used as a function of learning episodes. Each
new episode generates a new problem instance, hence the noise.
5
Summary and conclusions
The challenging task of heuristic learning was put forth as a reinforcement
learning problem. The heuristics are assumed to be designed by the human designer and correspond to the actions in the usual reinforcement learning setting.
The states represent the solution instances and the heuristic policies learned
decide when these heuristics should be used. The simpler the heuristic the more
challenging it becomes to learn a policy. At the other extreme a single powerful
heuristic may be used and so no heuristic policy need be learned (only one action
possible).
The heuristic policy is found indirectly by approximating a value function,
whose value is the expected performance of the algorithm over the problem
domain as a function of specific features of a solution instance and the applied
heuristics. It is clear that problem domain knowledge will be reflected in the
careful design of features, such as we have seen in the histogram-matching
approach to bin packing, and in the design of heuristics. The machine learning
side is devoted to learning heuristic policies.
The exploratory noise needed to drive the reinforcement learning is introduced indirectly by generating completely new problem instance at each learning
episode. This very different from the reinforcement learning approaches commonly seen in the literature for learning heuristic search, where usually only
a single problem (benchmark) instance is considered. One immediate concern,
which needs to be addressed, is the level of noise encountered during learning
for when a new problem instance is generated at each new episode. Although
the bin packing problems tackled in this paper could be solved, figure 4 shows
that convergence may also be an issue. One possible solution to this may be to
correlate the instances generated.
References
1. R. Bai, E.K. Burke, M. Gendreau, G. Kendall, and B. McCollum. Memory length in
hyper-heuristics: An empirical study. In Computational Intelligence in Scheduling,
2007. SCIS’07. IEEE Symposium on, pages 173–178. IEEE, 2007.
2. E.K. Burke, M.R. Hyde, G. Kendall, G. Ochoa, E. Ozcan, and J.R. Woodward. Exploring hyper-heuristic methodologies with genetic programming. Computational
Intelligence, pages 177–201, 2009.
3. E.K. Burke, MR Hyde, G. Kendall, and J. Woodward. A genetic programming
hyperheuristic approach for evolving two dimensional strip packing heuristics.
IEEE Transactions on Evolutionary Computation (to appear), 2010.
4. E.K. Burke and G. Kendall. Search methodologies: introductory tutorials in
optimization and decision support techniques. Springer Verlag, 2005.
5. E.K. Burker, M. Hyde, G. Kendall, G. Ochoa, E.
”Ozcan, and J.R. Woodward. A classification of hyper-heuristic approaches.
Handbook of Metaheuristics, pages 449–468, 2010.
6. P. Crescenzi and V. Kann. A compendium of NP optimization problems.
http://www.nada.kth.se/ viggo/problemlist/compendium.html. Technical report
accessed september 2010.
7. M. Dorigo and L. Gambardella. A study of some properties of Ant-Q. Parallel
Problem Solving from NaturePPSN IV, pages 656–665, 1996.
8. M. Dorigo and L.M. Gambardella. Ant colony system: A cooperative learning
approach to the traveling salesman problem. Evolutionary Computation, IEEE
Transactions on, 1(1):53–66, 2002.
9. S. Floyd and R.M. Karp. FFD bin packing for item sizes with uniform distributions
on [0, 1/2]. Algorithmica, 6(1):222–240, 1991.
10. D. Meignan, A. Koukam, and J.C. Créput. Coalition-based metaheuristic: a selfadaptive metaheuristic using reinforcement learning and mimetism. Journal of
Heuristics, pages 1–21.
11. Alexander Nareyek. Choosing search heuristics by non-stationary reinforcement
learning. Applied Optimization, 86:523–544, 2003.
12. R. Poli, J. Woodward, and E.K. Burke. A histogram-matching approach to the
evolution of bin-packing strategies. Evolutionary Computation, 2007. CEC 2007.
IEEE Congress on, pages 3500–3507, Sept. 2007.
13. W.B. Powell. Approximate Dynamic Programming: Solving the curses of dimensionality. Wiley-Interscience, 2007.
14. R.S. Sutton and A.G. Barto. Reinforcement learning: An introduction. The MIT
press, 1998.
15. Wei Zhang and Thomas G. Dietterich. A Reinforcement Learning Approach
to Job-shop Scheduling. In Proceedings of the Fourteenth International Joint
Conference on Artificial Intelligence, pages 1114–1120, San Francisco, California,
1995. Morgan-Kaufmann.