Bayesian Query Construction for Neural Network Models

Bayesian Query Construction for Neural
Network Models
Gerhard Paass
Jorg Kindermann
German National Research Center for Computer Science (GMD)
D-53757 Sankt Augustin, Germany
[email protected]
[email protected]
Abstract
If data collection is costly, there is much to be gained by actively selecting particularly informative data points in a sequential way. In
a Bayesian decision-theoretic framework we develop a query selection criterion which explicitly takes into account the intended use
of the model predictions. By Markov Chain Monte Carlo methods
the necessary quantities can be approximated to a desired precision. As the number of data points grows, the model complexity
is modified by a Bayesian model selection strategy. The properties of two versions of the criterion ate demonstrated in numerical
experiments.
1
INTRODUCTION
In this paper we consider the situation where data collection is costly, as when
for example, real measurements or technical experiments have to be performed. In
this situation the approach of query learning ('active data selection', 'sequential
experimental design', etc.) has a potential benefit. Depending on the previously
seen examples, a new input value ('query') is selected in a systematic way and
the corresponding output is obtained. The motivation for query learning is that
random examples often contain redundant information, and the concentration on
non-redundant examples must necessarily improve generalization performance.
We use a Bayesian decision-theoretic framework to derive a criterion for query construction. The criterion reflects the intended use of the predictions by an appropriate
444
Gerhard Paass. Jorg Kindermann
loss function. We limit our analysis to the selection of the next data point, given a
set of data already sampled. The proposed procedure derives the expected loss for
candidate inputs and selects a query with minimal expected loss.
There are several published surveys of query construction methods [Ford et al. 89,
Plutowski White 93, Sollich 94]. Most current approaches, e.g. [Cohn 94], rely
on the information matrix of parameters. Then however, all parameters receive
equal attention regardless of their influence on the intended use of the model
[Pronzato Walter 92]. In addition, the estimates are valid only asymptotically. Bayesian approaches have been advocated by [Berger 80], and applied to neural networks
[MacKay 92]. In [Sollich Saad 95] their relation to maximum information gain is
discussed. In this paper we show that by using Markov Chain Monte Carlo methods it is possible to determine all quantities necessary for the selection of a query.
This approach is valid in small sample situations, and the procedure's precision can
be increased with additional computational effort. With the square loss function,
the criterion is reduced to a variant of the familiar integrated mean square error
[Plutowski White 93].
In the next section we develop the query selection criterion from a decision-theoretic
point of view. In the third section we show how the criterion can be calculated using
Markov Chain Monte Carlo methods and we discuss a strategy for model selection.
In the last section, the results of two experiments with MLPs are described.
2
A DECISION-THEORETIC FRAMEWORK
Assume we have an input vector x and a scalar output y distributed as y "" p(y I x, w)
where w is a vector of parameters. The conditional expected value is a deterministic
function !(x, w) := E(y I x, w) where y = !(x, w)+£ and £ is a zero mean error term.
Suppose we have iteratively collected observations D(n) := ((Xl, iii), .. . , (Xn, Yn)).
We get the Bayesian posterior p(w I D(n)) = p(D(n) Iw) p(w)/ J p(D(n) Iw) p(w) dw
and the predictive distribution p(y I x, D(n)) = p(y I x, w)p(w I D(n)) dw if p(w) is
the prior distribution.
J
We consider the situation where, based on some data x, we have to perform an
action a whose result depends on the unknown output y. Some decisions may have
more severe effects than others. The loss function L(y, a) E [0,00) measures the
loss if y is the true value and we have taken the action a E A. In this paper we
consider real-valued actions, e.g. setting the temperature a in a chemical process.
We have to select an a E A only knowing the input x. According to the Bayes
Principle [Berger 80, p.14] we should follow a decision rule d : x --t a such that
the average risk J R(w, d) p(w I D(n)) dw is minimal, where the risk is defined as
R(w, d) := J L(y, d(x)) p(y I x, w) p(x) dydx. Here p(x) is the distribution of future
inputs, which is assumed to be known.
For the square loss function L(y, a) = (y - a)2, the conditional expectation
d(x) := E(y I x, D(n)) is the optimal decision rule. In a control problem the loss
may be larger at specific critical points. This can be addressed with a weighted square loss function L(y, a) := h(y)(y - a)2, where h(y) 2: a [Berger 80,
p.1U]. The expected loss for an action is J(y - a)2h(y) p(y I x, D(n)) dy. Replacing the predictive density p(y I x, D(n)) with the weighted predictive density
Bayesian Query Construction for Neural Network Models
445
p(y I x, Den) := h(y) p(y I x, Den)/G(x), where G(x) := I h(y) p(y I x, Den) dy,
we get the optimal decision rule d(x) := I yp(y I x, Den) dy and the average loss
G(x) I(y - E(y I x, D(n))2 p(y I x, Den) dy for a given input x. With these modifications, all later derIvations for the square loss function may be applied to the
weighted square loss.
The aim of query sampling is the selection of a new observation x in such a way
that the average risk will be maximally reduced. Together with its still unknown
y-value, x defines a new observation (x, y) and new data Den) U (x, y). To determine
this risk for some given x we have to perform the following conceptual steps for a
candidate query x:
1. Future Data: Construct the possible sets of 'future' observations Den) U
(x, y), where y ""' p(y I x, Den).
2. Future posterior: Determine a 'future' posterior distribution of parameters
p(w I Den) U (x, y» that depends on y in the same way as though it had
actually been observed.
3. Future Loss: Assuming d~,x(x) is the optimal decision rule for given values
of x, y, and x, compute the resulting loss as
1';,x(x):=
J
L(y,d;,x(x»p(ylx,w)p(wIDen)U(x,y»dydw
(1)
4. Averaging: Integrate this quantity over the future trial inputs x distributed
as p(x) and the different possible future outputs y, yielding
1';:= Ir;,x(x)p(x)p(ylx,Den)dxdy.
This procedure is repeated until an x with minimal average risk is found. Since local
optima are typical, a global optimization method is required. Subsequently we then
try to determine whether the current model is still adequate or whether we have to
increase its complexity (e.g. by adding more hidden units).
3
COMPUTATIONAL PROCEDURE
Let us assume that the real data Den) was generated according to a regression model
y = !(x, w)+{ with i.i.d. Gaussian noise {""' N(O, (T2(w». For example !(x, w) may
be a multilayer perceptron or a radial basis function network. Since the error terms
are independent, the posterior density is p( w I Den) ex: p( w) rr~=l P(Yi I Xi, w) even
in the case of query sampling [Ford et al. 89].
As the analytic derivation of the posterior is infeasible except in trivial cases, we
have to use approximations. One approach is to employ a normal approximation
[MacKay 92], but this is unreliable if the number of observations is small compared to the number of parameters. We use Markov Chain Monte Carlo procedures
[PaaB 91, Neal 93] to generate a sample WeB) := {WI, .. .WB} of parameters distributed according to p( w I Den). If the number of sampling steps approaches infinity,
the distribution of the simulated Wb approximates the posterior arbitrarily well.
To take into account the range of future y-values, we create a set of them by simulation. For each Wb E WeB) a number of y ""' p(y I x, Wb) is generated. Let
446
y(x.R)
Gerhard Paass. JiJrg Kindermann
{YI, ... , YR} be the resulting set. Instead of performing a new Markov
Monte Carlo run to generate a new sample according to p(w I DCn) U (x, y)), we
:=
use the old set WCB) of parameters and reweight them (importance sampling).
In this way we may approximate integrals of some function g( w) with respect to
p(w I DCn) U (x, y)) [Kalos Whitlock 86, p.92]:
- -))d
j 9 (w ) P(W IDCn) U( X,
Y
W
__
--
L~-lg(Wb)P(ylx,Wb)
B
Lb=l p(Y I x, Wb)
(2)
The approximation error approaches zero as the size of WCB) increases.
3.1
APPROXIMATION OF FUTURE LOSS
Consider the future loss f;,x(x) given new observation (x, y) and trial input Xt. In
the case of the square loss function, (1) can be transformed to
f~,.t(Xt)
=
j[!(Xt,w)-E(yIXt,Dcn)U(X,y)Wp(wIDcn)U(x,y))dw (3)
+ j £T2(w) p(w I DCn) U (x, y)) dw
where £T2(w) := Var(y I x, w) is independent of x. Assume a set XT = {Xl, ... , XT}
is given, which is representative of trial inputs for the distribution p(x). Define
S(x, y) := L~=i p(Y I x, Wb) for y E YCx,R) . Then from equations (2) and (3) we get
E(ylxt,DCn)U(x,y)):= 1/S(x,Y)L~=1!(Xt,Wb)P(Ylx,Wb) and
1
B
S(x -) L£T 2(Wb)P(Ylx,Wb)
,y b=l
1
+ S(x
(4)
B
-) I)!(Xt, Wb) - E(y I Xt, DCn) U (x, y))]2 p(Y I x, Wb)
,y
b=l
The final value of f; is obtained by averaging over the different y E YCx,R) and
different trial inputs Xt E XT. To reduce the variance, the trial inputs Xt should
be selected by importance sampling (2) to concentrate them on regions with high
current loss (see (5) below). To facilitate the search for an x with minimal f; we
reduce the extent of random fluctuations of the y values. Let (Vi, ... , VR) be a
vector of random numbers Vr -- N(O,1), and let jr be randomly selected from
{1, ... , B}. Then for each x the possible observations Yr E YCx,R) are defined as
Yr := !(x, wir) + V r£T2(wir). In this way the difference between neighboring inputs
is not affected by noise, and search procedures can exploit gradients.
3.2
CURRENT LOSS
As a proxy for the future loss, we may use the current loss at
x,
rcurr(x) = p(x) j L(y, d*(x)) p(y I x, DCn)) dy
(5)
Bayesian Query Construction for Neural Network Models
447
where p(x) weights the inputs according to their relevance. For the square loss
function the average loss at x is the conditional variance Var(y I x, DCn». We get
=
Tcurr(X)
p(x) jU(x,w)-E(YIX,DCn»)2p(wIDcn»dw
(6)
+ p(x) j 0"2(w) p(w I D(n» dw
If E(y I x,DCn»
fr~~=lf(x,wb) and the sample WCB):= {Wl, ... ,WB} is
representative of p(w I DCn» we can approximate the current loss with
Tcurr(X)
~
p( x) ~
13 L..tU(x, Wb) -
2
E(y I x, DCn») +
A
p( x) ~
13 L..t 0"
b=l
2
(Wb)
(7)
b=l
If the input distribution p( x) is uniform, the second term is independent of x.
3.3
COMPLEXITY REGULARIZATION
Neural network models can represent arbitrary mappings between finite-dimensional
spaces if the number of hidden units is sufficiently large [Hornik Stinchcombe 89].
As the number of observations grows, more and more hidden units are necessary to catch the details of the mapping. Therefore we use a sequential procedure to increase the capacity of our networks during query learning. White and
Wooldridge call this approach the "method of sieves" and provide some asymptotic results on its consistency [White Wooldridge 91]. Gelfand and Dey compare Bayesian approaches for model selection and prove that, in the case of nested models Ml and M2, model choice by the ratio of popular Bayes factors
p(DCn) I Mi) := J p(DCn) I W, Mi ) p(w I Mi) dw will always choose the full model
regardless of the data as n --t 00 [Gelfand Dey 94]. They show that the pseudoBayes factor, a Bayesian variant of crossvalidation, is not affected by this paradox
A(Ml' M2) :=
n
n
;=1
j=1
II p(y; I x;, DCn,j), Mt}j II p(Y; Ix;, DCn,j), M2)
(8)
Here DCn ,;) := D(n) \ (x;, y;). As the difference between p(w I DCn» and p( wi D(n,j»
is usually small, we use the full posterior as the importance function (2) and get
p(Y;
I x;, DCn,j),Mi) =
j p(Y; IXj,w,Mi)p(wIDCn,j),Mi)dw
'" B/(t,l/P(Y;li;,W"M,))
4
(9)
NUMERICAL DEMONSTRATION
In a first experiment we tested the approach for a small a 1-2-1 MLP target function with Gaussian noise N(0,0.05 2 ). We assumed the square loss function and a
uniform input distribution p(x) over [-5,5]. Using the "true" architecture for the
approximating model we started with a single randomly generated observation. We
448
Gerhard Paass, JiJrg Kindermann
~
=~!¥~
--- ~tuo:io_
~
.. .' .
1'01
..
on
~
I - '~ ' =~ I
:;
"
. ..
a:
0
::::.:::::.::::\....
d
:;
....
\~.
'\ ------ -- - - - - - - -----
\., 1\l
. . ......_. _-_._...........__................... _. ._......._..
~
\!
~
,
\
:.,.
\, '
"
\!
'"
0
..
-2
10
15
20
No.d_
25
30
Figure 1: Future loss exploration: predicted posterior mean, future loss and current
loss for 12 observations (left), and root mean square error of prediction (right) .
estimated the future loss by (4) for 100 different inputs and selected the input with
smallest future loss as the next query. B = 50 parameter vectors were generated requiring 200,000 Metropolis steps. Simultaneously we approximated the current loss
criterion by (7). The left side of figure 1 shows the typical relation of both measures.
In most situations the future loss is low in the same regions where the current loss
(posterior standard deviation of mean prediction) is high. The queries are concentrated in areas of high variation and the estimated posterior mean approximates
the target function quite well.
In the right part of figure 1 the RMSE of prediction averaged over 12 independent
experiments is shown. After a few observations the RMSE drops sharply. In our
example there is no marked difference between the prediction errors resulting from
the future loss and the current loss criterion (also averaged over 12 experiments).
Considering the substantial computing effort this favors the current loss criterion.
The dots indicate the RMSE for randomly generated data (averaged over 8 experiments) using the same Bayesian prediction procedure. Because only few data points
were located in the critical region of high variation the RMSE is much larger.
In the second experiment, a 2-3-1 MLP defined the target function I(x, wo) , to which
Gaussian noise of standard deviation 0.05 was added. I( x, wo) is shown in the left
part of figure 2. We used five MLPs with 2-6 hidden units as candidate models
Ml, .. . , M5 and generated B = 45 samples WeB) of the posterior pew I D(n)' M.),
where D(n) is the current data. We started with 30,000 Metropolis steps for small
values of n and increased this to 90,000 Metropolis steps for larger values of n.
For a network with 6 hidden units and n = 50 observations, 10,000 Metropolis
steps took about 30 seconds on a Sparc10 workstation. Next, we used equation (9)
to compare the different models, and then used the optimal model to calculate the
current loss (7) on a regular grid of 41 x 41 = 1681 query points x. Here we assumed
the square loss function and a uniform input distribution p(x) over [-5,5] x [-5,5].
We selected the query point with maximal current loss and determined the final
query point with a hillclimbing algorithm. In this way we were rather sure to get
close to the true global optimum.
The main result of the experiment is summarized in the right part of figure 2. It
Bayesian Query Construct.ion for Neural Network Models
449
•
o
".
.m
eXDlorati~n
random a
:2"': \
<:>
\
~\·{l· .
.,
."
o .. o .. o ............. __ (). ...
\
. . .......... 0 ... .. ........ --
..
~.
20
40
0
60
80
100
No. of Observations
Figure 2: Current loss exploration: MLP target function and root mean square error.
shows - averaged over 3 experiments - the root mean square error between the true
mean value and the posterior mean E(y I x) on the grid of 1681 inputs in relation to
the sample size. Three phases of the exploration can be distinguished (see figure 3).
In the beginning a search is performed with many queries on the border of the
input area. After about 20 observations the algorithm knows enough detail about
the true function to concentrate on the relevant parts of the input space. This leads
to a marked reduction ofthe mean square error. After 40 observations the systematic
part of the true function has been captured nearly perfectly. In the last phase of
the experiment the algorithm merely reduces the uncertainty caused by the random
noise. In contrast , the data generated randomly does not have sufficient information
on the details of f(x , w), and therefore the error only gradually decreases. Because
of space constraints we cannot report experiments with radial basis functions which
led to similar results.
Acknowledgements
This work is part of the joint project 'REFLEX' of the German Fed. Department
of Science and Technology (BMFT), grant number 01 IN 111Aj4. We would like to
thank Alexander Linden, Mark Ring, and Frank Weber for many fruitful discussions.
References
[Berger 80] Berger, J. (1980): Statistical Decision Theory, Foundations, Concepts, and
Methods. Springer Verlag, New York.
[Cohn 94] Cohn, D. (1994): Neural Network Exploration Using Optimal Experimental
Design. In J. Cowan et al. (eds.): NIPS 5. Morgan Kaufmann, San Mateo.
[Ford et al. 89] Ford, I. , Titterington, D.M., Kitsos, C.P. (1989): Recent Advances in Nonlinear Design. Technometrics, 31, p.49-60.
[Gelfand Dey 94] Gelfand, A.E., Dey, D.K. (1994): Bayesian Model Choice: Asymptotics
and Exact Calculations. J. Royal Statistical Society B, 56, pp.501-514.
450
Gerhard Paass, Jorg Kindermann
Figure 3: Squareroot of current loss (upper row) and absolute deviation from true
function (lower row) for 10,25, and 40 observations (which are indicated by dots) .
[Hornik Stinchcombe 89] Hornik, K., Stinchcombe, M. (1989): Multilayer Feedforward
Networks are Universal Approximators. Neural Networks 2, p.359-366.
[Kalos Whitlock 86] Kalos, M.H., Whitlock, P.A. (1986): Monte Carlo Methods, Wiley,
New York.
[MacKay 92] MacKay, D. (1992): Information-Based Objective Functions for Active Data
Selection. Neural Computation 4, p .590-604.
[Neal 93] Neal, R.M. (1993): Probabilistic Inference using Markov Chain Monte Carlo
Methods. Tech. Report CRG-TR-93-1, Dep. of Computer Science, Univ. of Toronto.
[PaaB 91] PaaB, G. (1991): Second Order Probabilities for Uncertain and Conflicting Evidence. In: P.P. Bonissone et al. (eds.) Uncertainty in Artificial Intelligence 6. Elsevier,
Amsterdam, pp. 447-456.
[Plutowski White 93] Plutowski, M., White, H. (1993): Selecting Concise Training Sets
from Clean Data. IEEE Tr. on Neural Networks, 4, p.305-318.
[Pronzato Walter 92] Pronzato, L., Walter, E. (1992): Nonsequential Bayesian Experimental Design for Response Optimization. In V. Fedorov, W.G. Miiller, I.N. Vuchkov
(eds.): Model Oriented Data-Analysis. Physica Verlag, Heidelberg, p. 89-102.
[Sollich 94] Sollich, P. (1994): Query Construction, Entropy and Generalization in Neural
Network Models. To appear in Physical Review E.
[Sollich Saad 95] Sollich, P., Saad, D. (1995): Learning from Queries for Maximum Information Gain in Unlearnable Problems. This volume.
[White Wooldridge 91] White, H., Wooldridge, J. (1991): Some Results for Sieve Estimation with Dependent Observations. In W. Barnett et al. (eds.) : Nonparametric and
Semiparametric Methods in Econometrics and Statistics, New York, Cambridge Univ.
Press.