Model-Selection for Non-Parametric Function
Approximation in Continuous Control Problems: A
Case Study in a Smart Energy System
Daniel Urieli and Peter Stone
Dept. of Computer Science
The University of Texas at Austin
Austin, TX, 78712 USA
{urieli,pstone}@cs.utexas.edu
Abstract. This paper investigates the application of value-function-based reinforcement learning to a smart energy control system, specifically the task of controlling an HVAC system to minimize energy while satisfying residents’ comfort
requirements. In theory, value-function-based reinforcement learning methods
can solve control problems such as this one optimally. However, since choosing an appropriate parametric representation of the value function turns out to be
difficult, we develop an alternative method, which results in a practical algorithm
for value function approximation in continuous state-spaces. To avoid the need
to carefully design a parametric representation for the value function, we use
a smooth non-parametric function approximator, specifically Locally Weighted
Linear Regression (LWR). LWR is used within Fitted Value Iteration (FVI), which
has met with several practical successes. However, for efficiency reasons, LWR is
used with a limited sample-size, which leads to poor performance without careful
tuning of LWR’s parameters. We therefore develop an efficient meta-learning procedure that performs online model-selection and tunes LWR’s parameters based
on the Bellman error. Our algorithm is fully implemented and tested in a realistic
simulation of the HVAC control domain, and results in significant energy savings.
1 Introduction
This paper is motivated by a real-world discrete-time continuous control problem in
which the state space is continuous and the action space is discrete. Specifically, we
focus on the task of controlling an HVAC system’s thermostat1 in a house with ‘heat’,
‘cool’, ‘auxiliary-heat’ or ‘off’ actions, with the goal of reducing yearly energy consumption while satisfying temperature comfort requirements for the occupants. Such
discrete-time continuous control problems commonly arise when a digital controller
controls a physical system, and when the possible control actions constitute either a finite set, or a low dimensional space that can be discretized without losing much control
capability. Other examples of this class of problems are robot control, autonomous helicopter control, and autonomous car control. When the controlled system’s dynamics
is unknown in advance, model-based Reinforcement Learning can be used to efficiently
1
HVAC: Heating, Ventilation, and Air-conditioning
2
Daniel Urieli and Peter Stone
learn the dynamics first, and then solve the control problem, possibly by computing or
approximating a value function[13].
When using such a method to learn fine-grained control actions, one of the most
crucial choices is how to represent the value function. One reason that this choice is
so crucial is that the cycle time between consecutive control actions is typically short,
compared to the time-range, or the horizon, over which the overall system behavior is
optimized. Therefore, a single action often has a relatively minor effect on both the state
of the system and on the immediate cost/reward, so that the overall performance is a sum
of large number of minor contributions. Consequently, the value of a state that results
from a suboptimal action is typically close to the value of the state that results from
an optimal action. To induce the optimal policy, a value function approximator must
be able to capture these fine differences between state values. Note that taking a suboptimal action may not seem like a problem when action effects are minor. However,
since the problem’s horizon can be orders of magnitude longer than the length of an
action, the number of actions taken within the horizon is typically large, and repeatedly
choosing suboptimal actions can accumulate to large losses.
Value function approximation is an active area of research: it is often unclear how to
approximate the value function well enough so as to distinguish an optimal action from
a suboptimal action. The three most common methods to approximate a value-function
are lookup-tables, parametric methods, and non-parametric methods [16]. Lookup tables often suffer from Bellman’s curse of dimensionality at the resolution levels that are
required for continuous control problems. Parametric methods are typically computationally efficient, but assume that the value function takes some global, parametric form.
Non-parametric methods make much weaker assumptions about the value-function’s
form, and therefore can, in principle, approximate any function. However, they typically require more data and computation than parametric methods.
One way to avoid the difficulties in approximating a value function in continuous
spaces, is to use direct policy search methods, which directly optimize the parameters of some parametrized policy. Policy search methods have recently achieved several notable successes, e.g. [13,9,3], and have been gaining increased popularity for
real-world control problems, perhaps due to the difficulties in approximating the value
function in continuous spaces. However, if we could address the challenge of approximating the optimal value function well enough, we could gain some of the advantages
of value-function-based methods over direct policy search methods, for instance aiming
for global rather than local optimum, and requiring less interactions with the real-world
due to bootstraping.
To address our HVAC control problem, we develop a general, practical, algorithm
for approximating the value function in continuous state spaces. To avoid the need to
carefully design a parametric representation for the value function, we use a smooth
non-parametric function approximator, specifically Locally Weighted Linear Regression (LWR) (e.g. [2]). To compute the value function we use LWR within Fitted Value
Iteration (FVI), an algorithm that has proven convergence properties and often performs
well in practice [6,12]. However, being limited by a small sample size due to a run-time
efficiency requirement on the system, we must tune LWR’s parameters carefully, otherwise the system performs poorly. We therefore develop an efficient meta-learning pro-
Model-Selection for Non-Parametric Function Approximation: A Case Study
3
cedure that performs online model-selection and tunes LWR’s parameters. The model
selection procedure is based on two main ideas, substantiated empirically through a
large number of simulations. The first idea is that minimizing the empirical L1 or L∞
Bellman error of the approximate value function is correlated with optimizing performance on our task. It is shown that the same statement is not true for the L2 Bellman
error. The second idea is that minimizing the Bellman error by tuning LWR’s parameters can be done efficiently. We note that while the Bellman error was used as a criterion
for optimization by algorithms implementing generalized policy iteration using a fixed
representation for the value function, and for tuning and generating basis functions in
linear architectures, to the best of our knowledge it has not been used as an optimization
criterion for tuning a non-parametric representation (see Sec. 6).
We apply our algorithm to the realistic control task of controlling a thermostat to
optimize energy consumption while satisfying comfort requirements in a realistically
simulated home. We build a complete Reinforcement Learning agent that uses our algorithm and show that (1) our agent outperforms the thermostat strategy that is deployed in
practice, (2) our function approximation scheme leads to better performance than when
using popular methods of value function discretization, linear function approximation
with reasonable features, and non-parametric function approximation using equivalent
computation with a much denser sample and without model-selection; and (3) our online model selection leads to performance that is close to that of an empirical upperbound achieved using a state-of-the-art optimization method (CMA-ES [7]) combined
with a clairvoyant model evaluator that returns the actual future performance of a model.
The result is an adaptive value-function approximation algorithm for continuous statespaces, which uses a non-parametric representation to minimize the assumptions about
representation, and tunes it online to the specific environment in which it is deployed.
2 Preliminaries
2.1
Reinforcement Learning
In this paper we focus on solving control problems through Reinforcement Learning
(RL) [19]. Reinforcement learning problems are often modeled as Markov Decision
Processes (MDPs). An (episodic) Markov Decision Process (MDP) [18] is a tuple
(S, A, P, R, T ), where S is the set of states; A is a set of actions; P : S ×A×S → [0, 1]
is a state transition probability function where P (s, a, s′ ) denotes the probability of
transitioning to state s′ when taking action a from state s; R : S × A → R is a reward
function; and T ∈ S is a set of terminal states, where entering one of which terminates
an episode. In the context of MDPs, the goal of RL is to learn an optimal policy, when
the model (namely P and/or R) is initially unknown. A policy is a mapping π : S → A
from states to actions. A policy π induces a value for each state s ∈ S, denoted as
V π (s), defined as the expected sum of rewards hobtained by the agent when starting
i
PN
in state s and following policy π: V π (s) = E
t=0 R(st , at )|s0 = s, sN ∈ T, π .
V π (s) : S → R is called a value function. For a given MDP, there exists an optimal
∗
policy π ∗ such that V π (s) ≥ V π (s) for every s.
4
Daniel Urieli and Peter Stone
∗
While V π (s) is induced
by the policy π ∗ , it also induces π ∗ . It can be shown that
P
∗
∗
∗
π (s) = argmaxa∈A s′ ∈S R(s, a) + P (s, a, s′ ) · V π (s′ ). Therefore, given V π (s)
(and R, P ), an agent can act optimally using a one-step look-ahead from any given state.
This is the premise of value function based RL. For a finite S, there are algorithms that
∗
provably find V π (s) and therefore the optimal policy. When S is infinite, in general
∗
we can only compute an approximation V̂ π (s) of the optimal value function, and the
best methodology to do so is still an open research problem. As RL does not need to
know the system dynamics (model) in advance, it is an appropriate approach to control
problems when the system dynamics are either unknown or partially known, and when
a system is controlled in an uncertain environment to which it needs to adapt.
2.2
The Challenge of Function Approximation
The choice of function approximator can be crucial to determining an RL algorithm’s
performance in control problems of the type we consider. We start by defining a sufficient condition for P
a function approximator to induce the optimal policy. Denote
∗
∗
E[s′ |sa] [V π (s′ )] := s′ ∈S P (s, a, s′ )·V π (s′ ). In a given state s, let ǫs be the smallest
absolute difference between the expected values of states resulting from two different
actions taken from s, where one action is optimal and the other is sub-optimal:
ǫs :=
∗
min
∗
a ,a∈A
a =π ∗ (s) is optimal
a is sub-optimal
∗
{|E[s′ |sa∗ ] [V π (s′ )] − E[s′ |sa] [V π (s′ )]|}
(1)
∗
Suppose that the system is in state s0 and an action needs to be chosen.2 In general, if
∗
ǫ
the function approximator is able to approximate V π (s) to within s20 , meaning
∗
∗
max |V̂ π (s) − V π (s)| <
s∈S
ǫs0
2
(2)
∗
then greedy action selection based on V̂ π (s) is guaranteed to induce the optimal ac∗
∗
∗
ǫ
tion from s0 , since:3 E[s′ |sa∗ ] [V̂ π (s′ )]−E[s′ |sa] [V̂ π (s′ )] > E[s′ |sa∗ ] [V π (s′ )− s20 ]−
∗
ǫ
ǫ
ǫ
E[s′ |sa] [V π (s′ ) + s20 ] ≥ ǫs0 − s20 − s20 = 0. When condition 2 holds for every state
s0 ∈ S, the function approximator induces the optimal action in every state, and therefore the optimal policy. When it does not hold in every state, the function approximator
may not induce the optimal policy.
Clearly, the smaller ǫs is, the harder it is to achieve the desired ǫ2s function approximation accuracy. Unfortunately, for the type of problems this paper is concerned with,
namely real-world, discrete-time continuous control problems, ǫs can in fact be small
for many states. This happens when the following two properties hold (for S := Rn ):
– Actions have “small” effect: there exists some (relatively small) δ such that for
every s′ , s′′ ∈ S that can result from taking actions at some state s ∈ S, it holds
2
3
Note that we assume the existence of a sub-optimal action, since otherwise there is no decision
of any consequence to be made, as all actions are optimal. Therefore ǫs > 0.
For simplicity of presentation, we neglect the 1-step reward, which could be incorporated into
Equation (1) in a straightforward way.
Model-Selection for Non-Parametric Function Approximation: A Case Study
5
that ||s′ − s′′ || < δ. This can happen, for instance, when the cycle time between
subsequent control actions is short compared to the problem’s horizon.
– The optimal value function is Lipschitz continuous: there exists some K > 0 such
that ∀s1 , s2 ∈ S : ||V (s1 ) − V (s2 )|| < K||s1 − s2 ||. This holds, for instance, when
the reward and the transition functions are Lipschitz continuous.
Combining the two conditions, we get that ∀s ∈ S : ǫs < ||E[V (s′ )]−E[V (s′′ )]|| ≤
E[||V (s′ )−V (s′′ )||] < E[K||s′ −s′′ ||] < Kδ, so that the smaller the action effect δ, the
smaller ǫs is for any state s, and so the harder it is to achieve the desired approximation
accuracy. In the results section we demonstrate how this results in repeated suboptimal
actions, degrading performance on our thermostat control task.
3 Approximating the Value Function
RL algorithms that are based on value function approximation can roughly be divided
into model-free algorithms, which are usually more computationally efficient, and modelbased algorithms, which are usually more data efficient. As we are motivated by realworld problems, where gathering experience is often an expensive operation, we focus
here on model-based RL. In model-based RL, an RL agent first explores the environment and learns an approximate model of it (namely P and R). Using this model, the
∗
agent simulates experiences and computes V̂ π (s). Since in this paper we focus on
value-function approximation, we assume that an approximate model is either given, or
was already learned by the agent. For instance, in the results section, our RL agent first
∗
learns an approximate model, and then uses it to compute V̂ π (s).
3.1
Approximate Dynamic Programming
∗
For computing V̂ π (s), we start by using sampling-based Fitted Value Iteration (FVI) [6]
(a detailed overview of its roots can be found at [12]). FVI is an approximate dynamic
∗
programming algorithm that computes V̂ π (s) by repeatedly scanning a finite sample
(1) (2)
(m)
of states SF V I := s , s , . . . , s , applying the following two steps:
∀i ∈ 1, . . . , m
(3)
y (i) := maxa R(s(i) , a) + γE[s′ |s(i) a] [V̂ π (s′ )]
∗
V̂ π (s) := SL {hs(i) , y (i) i|i ∈ 1, . . . , m}
∗
∗
(4)
where V̂ π (s(i) ) is initialized arbitrarily; the expectation over the resulting state is approximated by Monte-Carlo sampling; and after each update scan, a supervised learning
algorithm SL is used as a function approximator that approximates the value function
over the complete state-space, based on the “labeled” examples hs(i) , y (i) i. While FVI
∗
is not guaranteed to converge to V π (s), it often performs well in practice, and is theoretically well-behaved [12]. In addition, FVI is an off-policy algorithm, which means it
can approximate an optimal policy before ever executing it.
6
Daniel Urieli and Peter Stone
3.2
Function Approximator
Inside FVI, the function approximator we use as SL is Locally Weighted Linear Regression (LWR). LWR is a non-parametric, smooth function approximator that uses
only minimal representation assumptions, and that has been used successfully to model
complex real-world dynamics [13]. Given a set of m labeled examples (x(i) , y (i) ) and
a query point x for which we want to predict a value, our version of LWR does the
following:
(i)
2
1. w(i) := exp − (x 2τ−x) for i = 1, . . . , m [compute a weight for each training example]
Pm
2. Fit θ that minimizes i=1 w(i) (y (i) − θT x(i) )2 [use weights for weighted regression]
3. Output θT x
Here τ is a “bandwidth” parameter that determines how quickly the weights decay.
Small weights are typically truncated for computational efficiency reasons. Since the
weights w(i) depend on the specific query point, LWR builds a local model around the
query for every prediction it makes. Note that when used with FVI, x(i) := s(i) , where
s(i) ∈ SF V I . In general, LWR results in smoother function approximation than simpler non-parametric methods such as nearest-neighbors, which is a desirable property
for continuous control tasks. However, in general, LWR can extrapolate, and this can
prevent FVI from converging [6], so to ensure convergence we trim LWR’s predicted
value to be within the range defined by its neighbors values.
4 Efficient Model Selection
Like most learning algorithms, LWR usually needs to be tuned to work well for a particular problem. LWR is typically tuned by adjusting the values of the bandwidth parameter τ and of distance-metric-related parameters c1 , . . . , cn , where ci is a scalar that
scales si , the i’th state attribute in a state s. Adjusting the values of c1 , . . . , cn effectively changes the distance metric based on the relative importance of state attributes.
While it is possible to make ci a general function ci : S → R rather than a scalar, we
take an approach of global tuning [2], in which ci is a scalar. This keeps the number
of LWR parameters at a total of n + 1, so that tuning is more computationally efficient
and less susceptible to overfitting. A given set of parameter values is said to define a
model to be used by the LWR function approximator, and the process of tuning these
parameters is a form of model selection.4 The goal of our model-selection process is
to find a set of parameters, that when used by LWR inside FVI, results in a function
approximation that is close to the optimal value function.
When model-selection is done in a supervised learning setup, each candidate model
is typically evaluated using cross-validation. In our setup, using cross-validation by
holding out subsets of SF V I is problematic since (1) we don’t have the actual values
of states s ∈ SF V I as we have in supervised learning, but only the values that FVI
converges to, and (2) as SF V I is typically sparse (to keep the run-time of FVI acceptable), having a good cross-validation accuracy on SF V I does not necessarily imply
4
Note that LWR’s model is different than, and should not be confused with, the MDP model.
Model-Selection for Non-Parametric Function Approximation: A Case Study
7
good prediction accuracy over the rest of the state space S \ SF V I . Therefore, we seek
an alternative model evaluation measure. The ideal way of evaluating a model is by
measuring the agent’s performance when acting based on a value function that uses this
model. However, evaluating the agent in the real-world with different models is often
prohibitively time-consuming and expensive. Therefore, we only use it as an empirical
upper-bound in our simulated domain.
Instead, we use a theoretically-founded model evaluation measure that is efficiently
computed in practice: the value-function’s empirical max Bellman error. In a given state
s, the absolute Bellman (optimality) error of a function V̂ : S → R is defined as:
BEV̂ (s) := |V̂ (s) − maxa (R(s, a) + γE[s′ |sa] [V̂ (s′ )])|
Next, for a function V̂ : S → R the following holds:
V̂ ≡ V π
∗
⇐⇒ ∀s ∈ S : BEV̂ (s) = 0
(5)
Furthermore, [21] (resp. [12]) establishes that for a full (resp. sample) Bellman backup:
∗
|V π (s) − V̂ (s)| ∝ BEV̂ (s)
(6)
Equations (5), (6) imply that ideally the Bellman error would be 0, or as close as possible to 0, for every state. Note that while the convergence of FVI means that ∀s ∈ SF V I :
BEV̂ (s) ≈ 0, the Bellman error might still be large for states s ∈ S \ SF V I . In order to
′
address that, we create a random sample of test states T := {t(1) , ..., t(m ) }, and define
a vector of Bellman errors (overloading notation):
′
BEV̂ (T ) := (BEV̂ (t(1) ), . . . , BEV̂ t(m ) )
(7)
Motivated by Equations (5) (6), we use the max Bellman error ||BEV̂ (T )||∞ as a
model-evaluation measure when tuning LWR’s parameters. This model evaluation measure is computed solely based on a value function computed by FVI, without needing
more data or interactions with the environment. Our model selection process then becomes a continuous optimization problem of finding a set of LWR parameters ψ ∈
Rn+1 that minimizes ||BEV̂ (T )||∞ where V̂ is the resulting value function after running FVI with LWR using ψ. Ideally, the max Bellman error should be computed over
all states in the state space, however since the state-space in non-enumerable, we take
a practical approach and set T to be a (as dense as computationally possible) random
sample of states. In the results section we show that (a) minimizing ||BEV̂ (T )||∞ is
correlated with good actual performance, and that (b) minimizing it can be done efficiently.
Putting all of these components together, the main general contribution of the paper
beyond the domain-specific results is the MSNP algorithm (Model Selection for NonParametric function approximation). As summarized in Algorithm 1, it executes the
following steps. The algorithm’s input is a learned MDP model, and an iterative continuous optimization algorithm, that finds a minimum of a function F : Rn+1 → R. Since
we generally do not have the gradient of the Bellman error as a function of the LWR
parameter set, we use gradient free optimization algorithms in our experiments. An interesting future extension would be comparing them with subgradient methods. MSNP
8
Daniel Urieli and Peter Stone
starts by generating a sample S of states for FVI (step 1) and a sample of test states T
over which it would compute the max Bellman Error (step 2). It then initializes a vector
v of Bellman Errors (step 3) and a vector of LWR parameters ψ (step 5). In the main
loop, it repeatedly runs the following steps until the max Bellman Error converges: run
FVI (step 8), compute the resulting max Bellman Error (step 9-11), send it back to the
optimization algorithm as the evaluation of the current parameter set (step 15), and get
from the optimization algorithm a new set of LWR parameters to evaluate (steps 16-17).
Algorithm 1 MSNP(MDP-Model, OptimizationAlgorithm)
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
S ← {s(1) , s(2) , . . . , s(m) } the set of points used by FittedValueIteration
′
T ← {t(1) , s(2) , . . . , t(m ) } test points sampled within the boundaries of S
′
v ← {∞, . . . , ∞} ∈ Rm
PreviousError ← ∞
ψ ← OptimizationAlgorithm.initializeParameters()
done ← F alse
while not done do
V̂ ← F ittedV alueIteration(MDP-Model, S, ψ) // approximate value function
for i = 1 → m′ do
vi ← BEV̂ (t(i)) // the Bellman error in t(i)
end for
MaxBellmanError ← ||v||∞
if |MaxBellmanError - PreviousError| < ǫ then
done ← True
else
OptimizationAlgorithm.observe(MaxBellmanError)
∆ψ ← OptimizationAlgorithm.step()
ψ ← ψ + ∆ψ
PreviousError ← MaxBellmanError
end if
end while
MSNP is an efficient algorithm for approximating the value function in continuous
state spaces, that uses our model-selection procedure for tuning a Fitted Value Iteration
with Locally Weighted Linear Regression, and which has the following properties:
1. General: It uses only minimal assumptions about a value function’s representation,
by using an non-parametric function approximator (LWR).
2. Practical: It tunes the representation of the LWR function approximator without
the need to evaluate the agent in the environment but rather based on an internal
property of the value function, namely the Max Bellman Error. In that sense, it is
data efficient.
3. Adaptive: It tunes online the function approximator’s representation to the environment it is deployed in.
4. Effectively use computation: Its model selection procedure achieves better performance than when using the same amount of computation on a larger SF V I sample
without any model-selection.
Model-Selection for Non-Parametric Function Approximation: A Case Study
9
5 Results
In this section we investigate the application of MSNP to the task of controlling an
HVAC’s thermostat in a realistically simulated home.
5.1
Experimental Setup
Our experiments are run using GridLAB-D5 , an open-source smart-grid simulator that
was developed for the U.S. Dept. of Energy. It models a residential home, including
heat gains and losses and the effects of thermal mass, as a function of weather (temperature and solar radiation), occupant behavior (thermostat settings and internal heat
gains from appliances), and heating/cooling system efficiencies. It uses meteorological
data collected by the National Renewable Energy Laboratory6 in cities across the USA.
In our experiments, GridLAB-D simulates a residential home with a heat-pump based
HVAC system, which is widely used due to its high efficiency.
We assume that occupants are at home between 6pm and 7am of the next day, and
that the house is empty between 7am and 6pm (referred to as the don’t-care period). Our
goal is to (1) minimize the energy consumed by the HVAC system, while (2) keeping a
desired temperature range of 69-75◦ (F ) when the occupants are at home, and being indifferent to temperature otherwise (Figure 1). Due to uncertainty in future weather and
in the house’s environment, simple strategies fail to satisfy at least one of the requirements. For instance, turning the system off at 7am and turning it back on at a fixed time,
such as 6pm, or even earlier, can fail to satisfy both requirements in cold winter days,
since the temperature gets significantly out of range at 6pm and restoring it might take
several hours, during which comfort is violated. Moreover, while doing so, an energy
expensive auxiliary heater is used (since the heat-pump becomes inefficient), and the
resulting energy is higher than when just keeping the temperature between 69-75◦ (F )
throughout the day. We model the problem as an episodic MDP7 , as follows:
Fig. 1. Temperature Requirements Specification.
5
6
http://www.gridlabd.org
http://www.nrel.gov
10
Daniel Urieli and Peter Stone
– S: {hTin , Tout , T imei| Tin and Tout are the indoor and outdoor air temperatures
(in Fahrenheit), and T ime is a 24-hour clock time (in minutes).}
– A: {COOL, OFF, HEAT, AUX}. Namely, there are four possible actions for cooling,
off, (heat-pump-)heating and auxiliary heating, respectively.
– P: computed by the GridLAB-D simulator and is initially unknown to the agent.
– R: −(the energy consumed by the last action) −C6pm . Here, C6pm is a quadratic
cost applied when missing the temperature spec at 6pm.
– T: {s ∈ S|s.time == 23:59pm}
For testing the MSNP algorithm, we build a full RL agent that controls the thermostat. Since our focus is on value function approximation, we leave the problem of
sample-efficient exploration and model-learning outside the scope of this work. Instead, to cover diverse weather conditions, the agent explores during don’t-care periods of one simulated year, where the OFF action is chosen with probability of (1 −
currentT ime−7am
). Otherwise, cooling or heating is chosen, depending on whether the
6pm−7am
indoor temperature high or low, respectively. If heating is chosen, then HEAT or AUX
are chosen with probabilities 0.9 and 0.1 respectively. While exploring, the agent collects tuples of the form hs, a, r, s′ i, which are the current state, action, reward, and next
state. The agent uses these tuples as labeled examples hs, ai → r and hs, ai → s′ for fitting the functions R and P with linear regression using state features and their squares,
where s is represented as h1, Tin , Tout , T ime, Tin 2 , Tout 2 , T ime2 i. This representation
was chosen based on a small amount of trial and error. Adding the squares was intuitively aimed at addressing non-linearity in the transition, to some extent. Using P and
∗
R, the agent runs MSNP and acts greedily based on V̂ π for an additional year.8 Inside
MSNP, FVI uses a state sample SF V I arranged as a grid inside the three-dimensional
state space, of size 20x10x20=4000. For running LWR inside FVI we use the 15-nearest
neighbors, and the grid structure allows us to find them in constant time.
5.2
Sensitivity to Errors in Function Approximation
Figure 2 demonstrates the difficulty of value function approximation in continuous domains with short actions, using the thermostat control task. The x-axis is the 24-hour
time of day and the y-axis is the indoor temperature controlled by the actions of the
agent, who acts greedily based on an approximate value function. The agent turns the
system off during the don’t-care period, letting the temperature rise, and eventually
cools in advance to return the temperature back to range by 6pm. However before starting to cool, there are several heating actions that are physically wrong, chosen due to
small approximation errors in the value function, in this case due to using LWR without
tuning its parameters. Each suboptimal 2-minute action increases the daily consumption
by only about 0.1%, but repeatedly taking them can increase consumption by 10% and
more. These suboptimalities and more severe ones happened when using discretized
7
8
An action is taken every 2 minutes, as the simulator models a realistic lockout of the system.
In practice, Gridlab-D only has one year of ”average” weather data. We therefore used 9 of
every 10 days during the training year, and the remaining days during the testing year so as
to have separate training and testing data. Our reported results reflect the average of repeating
this experiment 10 times with each different possible subset of ”held-out” days.
Model-Selection for Non-Parametric Function Approximation: A Case Study
11
representations as well as linear value-function representations with reasonable features.
Fig. 2. Suboptimal policy due to func. approximation errors.
5.3
Using the Max Bellman Error for Model-Selection
Next, we investigate using the Bellman error as a criterion for model-selection of our
LWR function approximator. The plots in Figure 3 show the correlation between empir∗
ical Bellman error in the approximate value function V̂ π and the agent’s performance
∗
when acting based on V̂ π . The plots summarize 10,000 experiments, each represented
as a point. Each experiment tests a set of n + 1 LWR parameters, which defines a model
used by LWR, as was discussed in Section 4. In the thermostat domain there are three
state attributes, and therefore four model parameters. We sweep the parameter space by
setting each parameter to one of 10 possible values, and this gives 10 × 10 × 10 × 10 =
10,000 possible parameter sets. The empirical Bellman errors were measured as the L1 ,
∗
L2 or L∞ norms of the vector of Bellman errors in V̂ π over a uniformly random sample of |T | = 256, 000 states. It can be seen that when the L1 and L∞ errors are smallest,
performance is expected to be close to the best possible (lowest energy consumption).
The same does not hold for the L2 error, as minimizing the L2 error results in consuming 4% more energy than the best result. Note that in general these plots clearly
highlight the need for model-parameters tuning, as untuned parameters can consume
about 25% more energy than the best possible parameters.
5.4
Efficiently Optimizing the Bellman Error
The previous section tested the first of two steps for creating an efficient model-selection
algorithm. We saw that model-selection, or representation tuning, of the LWR function
approximator can be done without the need to evaluate an agent in the environment, but
12
Daniel Urieli and Peter Stone
7
7
1.3
1.2
1.1
1
6
8
10
10
L1 Bellman Error
1.2
1.1
1
0.9 2
10
10
10
7
x 10
1.1
1.05
1
0.95
8.5
4
6
10
10
L2 Bellman Error
x 10
1.2
1.1
1
0.9 0
10
8
10
7
Energy (kWh)
Energy (kWh)
1.1
1.3
9
9.5
L Bellman Error
1
10
4
x 10
x 10
1.04
1.05
1
0.95
350
2
4
6
10
10
L∞ Bellman Error
10
10
11
L Bellman Error
12
7
Energy (kWh)
0.9 4
10
7
x 10
Energy (kWh)
x 10
Energy (kWh)
Energy (kWh)
1.3
400
450
L2 Bellman Error
500
x 10
1.02
1
0.98
0.96
9
∞
Fig. 3. Bellman errors (x-axes) vs. actual performance (y-axes, lower is better). Top row: full
plots. Bottom row: zoom into the bottom-left corner (best performance) of each top row plot.
rather based on an internal property of the value function, namely the L∞ or L1 Bellman
errors, so in that sense it is data efficient. Next, we show that tuning the model parameters based on the max Bellman error can be done computationally efficiently. Note
that once using the Bellman error for model evaluation, our model-selection problem
becomes minimizing an objective function that maps an LWR parameter set to the max
∗
Bellman error in V̂ π computed using this parameter set by FVI with LWR. We compare several efficient local-search derivative-free algorithms for finding the minimum of
a continuous function.9 The algorithms we compare are Powell’s method, Nelder–Mead
method, also known as Amoeba, and a coordinate-descent algorithm in which we hold
all parameters fixed and optimize one parameter at a time using Brent’s method, using
implementations from [17]. Results are shown in Figure 4. The horizontal gray line in
the figure is the best max Bellman error that was achieved when using an offline, stateof-the-art, parallel optimization algorithm CMA-ES [7], when running it with 10,000
function evaluations (100 generations with a population-size of 100). It can be seen that
after about 30 function evaluations all three methods get close to CMA-ES’s value, and
that the Brent’s method-based coordinate-descent reaches there after about 15 function
evaluations. Our FVI implementation converges in less than 2-minutes on a standard
desktop machine, so that 15-30 function evaluations takes 30-60 minutes.
How robust is running local optimization for finding a global minimum of the max
Bellman Error? To try to answer this question, we fixed the parameter values at c1 =
c2 = c3 = 0.5, τ = 0.0005, and then changed one parameter at a time across its range
(ci ∈ [0.05, 1], τ ∈ [0.00005, 0.0010]) measuring the max Bellman error as a function
of this parameter, where as usual, the max Bellman error was computed over the value
function computed using FVI with LWR. Results are in Figure 5, and show that while
the max Bellman error is not a convex function of representation parameters, it still has
a relatively large basin of convergence.
9
In general we do not have derivative information for the optimized function
Model-Selection for Non-Parametric Function Approximation: A Case Study
Max Bellman Error
60
13
Brent
Amoeba
Powell
50
40
30
20
10
0
0
10
20
30
iteration #
40
Fig. 4. Comparing different optimization algorithms on the task of finding a parameter set that
minimizes the max Bellman error.
4
100
4000
450
95
3500
400
90
7
x 10
6
350
3000
5
75
2500
2000
1500
70
Max Bellman Error
80
Max Bellman Error
300
Max Bellman Error
Max Bellman Error
85
250
200
4
3
150
2
1000
65
100
60
500
50
55
0
0
0
2
4
6
8
10
12
Tau scaling
14
16
18
20
0
2
4
6
8
10
12
Tin scaling
14
16
18
20
1
0
2
4
6
8
10
12
Tout scaling
14
16
18
20
0
0
2
4
6
8
10
12
Time scaling
14
16
18
20
Fig. 5. Bellman Error basin of convergence: Bellman Error (y-axis) as a function of each LWR parameter, holding all the other parameters fixed. The x-axes from left to right: τ, cTin , cTout , cT ime
5.5
Performance of MSNP
Finally, Table 1 demonstrates the advantage of using MSNP, by testing it on our thermostat control problem. In these experiments, we use our RL agent, changing only the
way the value function is approximated. The table shows the energy consumed by the
HVAC system over one simulated year, in which the agent acts greedily based on each
of the different approximate value functions. Simulations were run using real weather
files from three different cities in the US. As a reference, the “Default” column shows
the results of using a default heat-pump thermostat strategy that is used in real-world
deployments, which just keeps the temperature between 69-75◦ (F ) throughout the day.
This strategy does not shut down the system during the don’t-care period, since doing so without knowing how long in advance to turn the system back on can result
in violating either or both requirements (1) and (2), as discussed above. The “LargeSample” column was generated when using FVI with LWR to approximate the value
function, but instead of using the model-selection like MSNP does, the agent spends the
same amount of computation on just running FVI with LWR on a larger state sample
SF V I of 160x80x160=2048000 states without any model selection, and using default
values for the LWR parameters, similar to the values used in Section 5.4: ci = 0.5 for
i = 1, ..., n and τ = 0.0005.10 MSNP was run using a state sample SF V I of 20x10x20
states and used |T | = 256, 000 states for computing the Bellman error, so that its
10
The “LargeSample” results are actually slightly better than they should be because they did
not use the 9 days of 10 methodology referenced in footnote8 . It was trained on the full year
of data.
14
Daniel Urieli and Peter Stone
computation time, dominated by the number of LWR predictions was no larger than
that of LargeSample’s. The “CMA-ES” column serves as an empirical upper-bound on
performance in our simulated domain. It was generated by running the state-of-the-art
CMA-ES optimization method to perform model selection on top of FVI with LWR,
using (1) 10,000 model evaluations (100 generations, each with a population size of
100), and (2) a clairvoyant model-evaluator that returns the agent’s actual future performance using a given model, by running a one-year simulation using this model. It
can be seen that MSNP performs better than “LargeSample”, which demonstrates that
online model-selection has an advantage over just increasing the density of the sample
size. MSNP’s performance is close to that of CMA-ES’s, despite the fact that it uses only
40 function evaluations (instead of 10,000) and doesn’t have access to the “real” model
evaluation measure of the unknown future performance, that CMA-ES has. Note that
while the MSNP and CMA-ES agents satisfied the temperature comfort requirements,
the LargeSample agent frequently did not satisfy them. In Figure 6 we demonstrate
how the RL agent controls the temperature in mild and extreme winter/summer days.
Table 1. Performance of an agent using MSNP
City
Default (kWh) LargeSample (kWh) MSNP (kWh) CMA-ES (kWh) % Energy-Savings
New York City
11084.8
10923.5
9859.3
9816.3
11.0%
Boston
12277.1
12480.7
11433.6
11052.8
6.9%
Chicago
15172.5
14778.2
14186
13778.4
6.5%
Fig. 6. Our agent controlling the temperature in a house in New York City area, in mild and hot
summer days (top-left and top right, respectively), and mild and extreme winter days (bottom-left
and bottom-right, respectively).
Model-Selection for Non-Parametric Function Approximation: A Case Study
15
6 Related Work
RL has been applied to realistic control tasks, however recent successes frequently used
policy search methods, rather than value-function-based methods(e.g. [13,9,3]). Value
function based RL has had success on some robotic tasks [15], but there the assumption
was that the value function can be represented as a predetermined set of basis functions,
an assumption that does not necessarily hold in the general case. Non-parametric value
function approximation methods have been suggested, e.g. in [4]. The idea of using the
Bellman error as a criterion for optimization has been used by algorithms implementing generalized policy iteration, e.g. in [10,1]. The Bellman error has also been used for
tuning and generating basis functions adaptation in linear function approximation architectures [11,8,14], while here we use it to tune a non-parametric representation. The
model selection proposed here is different then the model selection done by [13]. There,
the setup was offline, supervised learning for learning the transition function, while ours
is an online reinforcement learning setup, for approximating the value function, where
there are no labels over the data, but only the values to which FVI converge to, which
could be different then the real state values. A paper that is closely related to ours is [5],
which designs an abstract model-selection algorithm and proves theoretical guarantees
about it. Similarly to here, they consider batch RL, in which a data set D of sampled
transitions from the MDP is given, and is used for selecting a candidate value function
by minimizing a Bellman error. In their case they abstract the way value function candidates are generated and assume they are independent of D, while here we actually
use D to approximate the model and generate candidates using MSNP. Their theoretical
guarantees are proved under a slightly different setup, and it would be interesting to
explore whether they can be extended to our setup. The problem of thermostat control
was addressed in [20], but there the focus was on solving the complete RL problem,
including exploration, model learning and planning, and no value function approximation was used, while here the focus is on investigating the application of value-function
based RL to the continuous, realistic domain of HVAC thermostat control.
7 Conclusion
This paper presents the application of value-function-based RL to the real-world smartenergy application of controlling an HVAC thermostat to minimize energy consumption
while satisfying temperature comfort requirements, along with detailed empirical results and analysis. In addition, the paper introduces MSNP, which is a general, practical
algorithm for approximating the value function for continuous control problems, using
an efficient model-selection procedure based on the Bellman error.
This paper opens up several interesting directions for future work. For example, it is
worth investigating the Bellman error’s basin of convergence as a function of the modelparameters. Another interesting direction is exploring the use of subgradient methods
for minimizing the Bellman error, and comparing them with the gradient-free methods
we used. Finally, an important future direction is to expand MSNP’s empirical analysis by including more domains and competing methods, and to evaluate it in higherdimensional state spaces.
16
Daniel Urieli and Peter Stone
References
1. Antos, A., Szepesvári, C., Munos, R.: Learning near-optimal policies with bellman-residual
minimization based fitted policy iteration and a single sample path. Mach. Learn. 71(1), 89–
129 (Apr 2008)
2. Atkeson, C.G., Moore, A.W., Schaal, S.: locally weighted learning (1997)
3. Deisenroth, M.P., Rasmussen, C.E.: PILCO: A Model-Based and Data-Efficient Approach
to Policy Search. In: Getoor, L., Scheffer, T. (eds.) Proceedings of the 28th International
Conference on Machine Learning. Bellevue, WA, USA (June 2011)
4. Engel, Y., Mannor, S., Meir, R.: Reinforcement learning with gaussian processes. In: In Proc.
of the 22nd International Conference on Machine Learning. pp. 201–208. ACM Press (2005)
5. Farahmand, A.M., Szepesvári, C.: Model selection in reinforcement learning. Mach. Learn.
85(3), 299–332 (Dec 2011)
6. Gordon, G.J.: Stable function approximation in dynamic programming. In: in Machine
Learning: Proceedings of the Twelfth International Conference. Morgan Kaufmann (1995)
7. Hansen, N.: The CMA Evolution Strategy: A Tutorial (January 2009)
8. Keller, P.W., Mannor, S., Precup, D.: Automatic basis function construction for approximate
dynamic programming and reinforcement learning. In: Proceedings of the 23rd international
conference on Machine learning. pp. 449–456. ICML ’06, ACM, New York, NY, USA (2006)
9. Kohl, N., Stone, P.: Machine learning for fast quadrupedal locomotion. In: The Nineteenth
National Conference on Artificial Intelligence. pp. 611–616 (July 2004)
10. Lagoudakis, M.G., Parr, R.: Least-squares policy iteration. J. Mach. Learn. Res. 4, 1107–
1149 (Dec 2003)
11. Menache, I., Mannor, S., Shimkin, N.: Basis function adaptation in temporal difference reinforcement learning. Annals of Operations Research 134, 215–238 (2005)
12. Munos, R., Szepesvári, C.: Finite time bounds for sampling based fitted value iteration. In:
ICML. pp. 881—886 (2005)
13. Ng, A.Y., Kim, H.J., Jordan, M.I., Sastry, S.: Autonomous helicopter flight via reinforcement learning. In: Thrun, S., Saul, L., Schölkopf, B. (eds.) Advances in Neural Information
Processing Systems 16. MIT Press, Cambridge, MA (2004)
14. Parr, R., Painter-Wakefield, C., Li, L., Littman, M.: Analyzing feature generation for valuefunction approximation. In: Proceedings of the 24th international conference on Machine
learning. pp. 737–744. ICML ’07, ACM, New York, NY, USA (2007)
15. Peters, J., Schaal, S.: Natural actor-critic. Neurocomputing 71(79), 1180 – 1190 (2008)
16. Powell, W.B.: Approximate Dynamic Programming: Solving the Curses of Dimensionality,
2nd Edition. Wiley (2011)
17. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press, New York, NY, USA, 3
edn. (2007)
18. Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming.
John Wiley & Sons, Inc., New York, NY, USA, 1st edn. (1994)
19. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge,
MA (1998)
20. Urieli, D., Stone, P.: A learning agent for heat-pump thermostat control. In: Proceedings of
the 12th International Conference on Autonomous Agents and Multiagent Systems (AAMAS) (May 2013)
21. Williams, R.J., Baird, III, L.C.: Tight performance bounds on greedy policies based on imperfect value functions. In: Proceedings of the Tenth Yale Workshop on Adaptive and Learning
Systems (1994), available at http://leemon.com/papers/1994wb.pdf
© Copyright 2026 Paperzz