Modelling transcriptional regulation using Gaussian Processes

Modelling transcriptional regulation using Gaussian
processes
Neil D. Lawrence
School of Computer Science
University of Manchester, U.K.
[email protected]
Guido Sanguinetti
Department of Computer Science
University of Sheffield, U.K.
[email protected]
Magnus Rattray
School of Computer Science
University of Manchester, U.K.
[email protected]
Abstract
Modelling the dynamics of transcriptional processes in the cell requires the knowledge of a number of key biological quantities. While some of them are relatively
easy to measure, such as mRNA decay rates and mRNA abundance levels, it is
still very hard to measure the active concentration levels of the transcription factor
proteins that drive the process and the sensitivity of target genes to these concentrations. In this paper we show how these quantities for a given transcription factor
can be inferred from gene expression levels of a set of known target genes. We
treat the protein concentration as a latent function with a Gaussian process prior,
and include the sensitivities, mRNA decay rates and baseline expression levels as
hyperparameters. We apply this procedure to a human leukemia dataset, focusing
on the tumour repressor p53 and obtaining results in good accordance with recent
biological studies.
Introduction
Recent advances in molecular biology have brought about a revolution in our understanding of cellular processes. Microarray technology now allows measurement of mRNA abundance on a genomewide scale, and techniques such as chromatin immunoprecipitation (ChIP) have largely unveiled
the wiring of the cellular transcriptional regulatory network, identifying which genes are bound by
which transcription factors. However, a full quantitative description of the regulatory mechanism of
transcription requires the knowledge of a number of other biological quantities: first of all the concentration levels of active transcription factor proteins, but also a number of gene-specific constants
such as the baseline expression level for a gene, the rate of decay of its mRNA and the sensitivity
with which target genes react to a given transcription factor protein concentration. While some of
these quantities can be measured (e.g. mRNA decay rates), most of them are very hard to measure
with current techniques, and have therefore to be inferred from the available data. This is often
done following one of two complementary approaches. One can formulate a large scale simplified
model of regulation (for example assuming a linear response to protein concentrations) and then
combine network architecture data and gene expression data to infer transcription factors’ protein
concentrations on a genome-wide scale. This line of research was started in [3] and then extended
further to include gene-specific effects in [10, 11]. Alternatively, one can formulate a realistic model
of a small subnetwork where few transcription factors regulate a small number of established target genes, trying to include the finer points of the dynamics of transcriptional regulation. In this
paper we follow the second approach, focussing on the simplest subnetwork consisting of one tran-
scription factor regulating its target genes, but using a detailed model of the interaction dynamics
to infer the transcription factor concentrations and the gene specific constants. This problem was
recently studied by Barenco et al. [1] and by Rogers et al. [9]. In these studies, parametric models
were developed describing the rate of production of certain genes as a function of the concentration
of transcription factor protein at some specified time points. Markov chain Monte Carlo (MCMC)
methods were then used to carry out Bayesian inference of the protein concentrations, requiring
substantial computational resources and limiting the inference to the discrete time-points where the
data was collected.
We show here how a Gaussian process model provides a simple and computationally efficient
method for Bayesian inference of continuous transcription factor concentration profiles and associated model parameters. Gaussian processes have been used effectively in a number of machine
learning and statistical applications [8] (see also [2, 6] for the work that is most closely related to
ours). Their use in this context is novel, as far as we know, and leads to several advantages. Firstly,
it allows for the inference of continuous quantities (concentration profiles) without discretization,
therefore accounting naturally for the temporal structure of the data. Secondly, it avoids the use of
cumbersome interpolation techniques to estimate mRNA production rates from mRNA abundance
data, and it allows us to deal naturally with the noise inherent in the measurements. Finally, it greatly
outstrips MCMC techniques in terms of computational efficiency, which we expect to be crucial in
future extensions to more complex (and realistic) regulatory networks.
The paper is organised as follows: in the first section we discuss linear response models. These
are simplified models in which the mRNA production rate depends linearly on the transcription
factor protein concentration. Although the linear assumption is not verified in practice, it has the
advantage of giving rise to an exactly tractable inference problem. We then discuss how to extend the
formalism to model cases where the dependence of mRNA production rate on transcription factor
protein concentration is not linear, and propose a MAP-Laplace approach to carry out Bayesian
inference. In the third section we test our model on the leukemia data set studied in [1]. Finally,
we discuss further extensions of our work. MATLAB code to recreate the experiments is available
on-line.
1
Linear Response Model
Let the data set under consideration consist of T measurements of the mRNA abundance of N genes.
We consider a linear differential equation that relates a given gene j’s expression level xj (t) at time
t to the concentration of the regulating transcription factor protein f (t),
dxj
= Bj + Sj f (t) − Dj xj (t) .
dt
(1)
Here, Bj is the basal transcription rate of gene j, Sj is the sensitivity of gene j to the transcription
factor and Dj is the decay rate of the mRNA. Crucially, the dependence of the mRNA transcription
rate on the protein concentration (response) is linear. Assuming a linear response is a crude simplification, but it can still lead to interesting results in certain modelling situations. Equation (1) was used
by Barenco et al. [1] to model a simple network consisting of the tumour suppressor transcription
factor p53 and five of its target genes. We will consider more general models in section 2.
The equation given in (1) can be solved to recover
Bj
+ kj exp (−Dj t) + Sj exp (−Dj t)
xj (t) =
Dj
Z
t
f (u) exp (Dj u) du
(2)
0
where kj arises from the initial conditions, and is zero if we assume an initial baseline expression
level xj (0) = Bj /Dj .
We will model the protein concentration f as a latent function drawn from a Gaussian process
prior distribution. It is important to notice that equation (2) involves only linear operations on the
function f (t). This implies immediately that the mRNA abundance levels will also be modelled as
a Gaussian process, and the covariance function of the marginal distribution p (x1 , . . . , xN ) can be
worked out explicitly from the covariance function of the latent function f .
Let us rewrite equation (2) as
Bj
+ Lj [f ] (t)
Dj
where we have set the initial conditions such that kj in equation (2) is equal to zero and
Z t
Lj [f ] (t) = Sj exp (−Dj t)
f (u) exp (Dj u) du
xj (t) =
(3)
0
is the linear operator relating the latent function f to the mRNA abundance of gene j, xj (t). If the
covariance function associated with f (t) is given by kf f (t, t0 ) then elementary functional analysis
yields that
cov (Lj [f ] (t) , Lk [f ] (t0 )) = Lj ⊗ Lk [kf f ] (t, t0 ) .
Explicitly, this is given by the following formula
Z t
Z t0
0
0
kxj xk (t, t ) = Sj Sk exp (−Dj t − Dk t )
exp (Dj u)
exp (Dk u0 ) kf f (u, u0 ) du0 du. (4)
0
0
If the process prior over f (t) is taken to be a squared exponential kernel,
!
0 2
(t
−
t
)
,
kf f (t, t0 ) = exp −
l2
where l controls the width of the basis functions1 , the integrals in equation (4) can be computed
analytically. The resulting covariances are obtained as
√
πl
0
[hkj (t0 , t) + hjk (t, t0 )]
(5)
kxj xk (t, t ) = Sj Sk
2
where
0
2 exp (γk )
t −t
t
hkj (t0 , t) =
− γk + erf
+ γk
exp [−Dk (t0 − t)] erf
Dj + Dk
l
l
0
t
−exp [− (Dk t0 + Dj )] erf
− γk + erf (γk ) .
l
Rx
Here erf(x) = √2π 0 exp −y 2 dy and γk = D2k l . We can therefore compute a likelihood which
N
relates instantiations from all the observed genes, {xj (t)}j=1 , through dependencies on the paramN
eters {Bj , Sj , Dj }j=1 . The effect of f (t) has been marginalised.
To infer the protein concentration levels, one also needs the “cross-covariance” terms between xj (t)
and f (t0 ), which is obtained as
Z t
kxj f (t, t0 ) = Sj exp (−Dj t)
exp (Dj u) kf f (u, t0 ) du.
(6)
0
Again, this can be obtained explicitly for squared exponential priors on the latent function f as
0
√
πlSj
t −t
t
2
exp (γj ) exp [−Dj (t0 − t)] erf
− γj + erf
+ γj
.
kxj f (t0 , t) =
2
l
l
Standard Gaussian process regression techniques [see e.g. 8] then yield the mean and covariance
function of the posterior process on f as
−1
hf ipost = Kf x Kxx
x
−1
Kfpost
f = Kf f − Kf x Kxx Kxf
(7)
where x denotes collectively the xj (t) observed variables and capital K denotes the matrix obtained
by evaluating the covariance function of the processes on every pair of observed time points. The
1
The scale of the process is ignored to avoid a parameterisation ambiguity with the sensitivities.
model parameters Bj , Dj and Sj can be estimated by type II maximum likelihood. Alternatively,
they can be assigned vague gamma prior distributions and estimated a posteriori using MCMC
sampling.
In practice, we will allow the mRNA abundance of each gene at each time point to be corrupted by
some noise, so that we can model the observations at times ti for i = 1, . . . , T as,
yj (ti ) = xj (ti ) + j (ti )
2
0, σji
(8)
with j (ti ) ∼ N
. Estimates of the confidence levels associated with each mRNA measurement can be obtained for Affymetrix microarrays using probe-level processing techniques
such as the mmgMOS model of [4]. The covariance of the noisyprocess is simply obtained as
2
2
2
2
Kyy = Σ + Kxx , with Σ = diag σ11
, . . . , σ1T
, . . . , σN
1 , . . . , σN T .
2
Non-linear Response Model
While the linear response model presents the advantage of being exactly tractable in the important
squared exponential case, a realistic model of transcription should account for effects such as saturation and ultrasensitivity which cannot be captured by a linear function. Also, all the quantities in
equation (1) are positive, but one cannot constrain samples from a Gaussian process to be positive.
Modelling the response of the transcription rate to protein concentration using a positive nonlinear
function is an elegant way to enforce this constraint.
2.1
Formalism
Let the response of the mRNA transcription rate to transcription factor protein concentration levels
be modelled by a nonlinear function g with a target-specific vector θj of parameters, so that,
dxj
= Bj + g(f (t), θj ) − Dj xj
dt
Z t
Bj
xj (t) =
+ exp (−Dj t)
du g(f (u), θj ) exp (Dj u) ,
Dj
0
(9)
where we again set xj (0) = Bj /Dj and assign a Gaussian process prior distribution to f (t). In
this case the induced distribution of xj (t) is no longer a Gaussian process. However, we can derive
the functional gradient of the likelihood and prior, and use this to learn the Maximum a Posteriori
(MAP) solution for f (t) and the parameters by (functional) gradient descent. Given noise-corrupted
data yj (ti ) as above, the log-likelihood of the data Y = {yj (ti )} is given by
"
#
N
T
2
NT
1 X X (xj (ti ) − yj (ti ))
2
log(2π) (10)
− log σji −
p(Y |f, {Bj , θj , Dj , Ξ}) = −
2
2 i=1 j=1
σji
2
where Ξ denotes collectively the parameters of the prior covariance on f (in the squared exponential
case, Ξ = l2 ). The functional derivative of the log-likelihood with respect to f is then obtained as
T
N
X
X
δ log p(Y |f )
(xj (ti ) − yj (ti )) 0
=−
Θ(ti − t)
g (f (t))e−Dj (ti −t)
2
δf (t)
σ
ji
i=1
j=1
(11)
where Θ(x) is the Heaviside step function and we have omitted the model parameters for brevity.
The negative Hessian of the log-likelihood with respect to f is given by
w(t, t0 ) = −
+
T
N
X
(xj (ti ) − yj (ti )) 00
δ 2 log p(Y |f ) X
0
=
Θ(t
−
t)δ
(t
−
t
)
g (f (t))e−Dj (ti −t)
i
2
δf (t) δf (t0 )
σ
ji
i=1
j=1
T
X
i=1
Θ(ti − t)Θ (ti − t0 )
N
X
0
−2 0
σji
g (f (t)) g 0 (f (t0 )) e−Dj (2ti −t−t )
j=1
(12)
where g 0 (f ) = ∂g/∂f and g 00 (f ) = ∂ 2 g/∂f 2 .
2.2
Implementation
We discretise in time t and compute the gradient and Hessian on a grid using approximate Riemann
quadrature. In the simplest case, we choose a uniform grid [tp ] p = 1, . . . , M so that ∆ =
tp − tp−1 is constant. We write f = [fp ] to be the vector realisation of the function f at the grid
points. The gradient of the log-likelihood is then given by,
T
N
X
X
∂ log p(Y |f )
(xj (ti ) − yj (ti )) 0
g (fp ) e−Dj (ti −tp )
= −∆
Θ (ti − tp )
2
∂fp
σ
ji
i=1
j=1
(13)
and the negative Hessian of the log-likelihood is,
Wpq = −
+ ∆2
T
N
X
X
∂ 2 log p(Y |f )
(xj (ti ) − yj (ti )) 00
g (fq ) e−Dj (ti −tq )
= δpq ∆
Θ (ti − tq )
2
∂fp ∂fq
σ
ji
i=1
j=1
T
X
Θ (ti − tp ) Θ (ti − tq )
i=1
N
X
(14)
−2 0
σji
g (fq ) g 0 (fp ) e−Dj (2ti −tp −tq )
j=1
where δpq is the Kronecker delta. In these and the following formulae ti is understood to mean the
index of the grid point corresponding to the ith data point, whereas tp and tq correspond to the grid
points themselves.
We can then compute the gradient and Hessian of the (discretised) un-normalised log posterior
Ψ(f ) = log p(Y |f ) + log p(f ) [see 8, chapter 3]
∇Ψ(f ) = ∇ log p(Y |f ) − K −1 f
∇∇Ψ(f ) = −(W + K −1 )
(15)
where K is the prior covariance matrix evaluated at the grid points. These can be used to find the
MAP solution fˆ using Newton’s method. The Laplace approximation to the log-marginal likelihood
is then (ignoring terms that do not involve model parameters)
(16)
log p(Y ) ' log p(Y |fˆ) − 1 fˆT K −1 fˆ − 1 log |I + KW |.
2
2
We can also optimise the log-marginal with respect to the model and kernel parameters. The gradient
of the log-marginal with respect to the kernel parameters is [8]
X
∂ log p(Y |Ξ) ∂ fˆp
∂ log p(Y |Ξ)
∂K −1 ˆ 1
∂K
= 12 fˆT K −1
K f − 2 tr (I + KW )−1 W
+
(17)
∂Ξ
∂Ξ
∂Ξ
∂Ξ
∂ fˆp
p
where the final term is due to the implicit dependence of fˆ on Ξ.
2.3
Example: exponential response
As an example, we consider the case in which
g (f (t) , θj ) = Sj exp (f (t))
(18)
which provides a useful way of constraining the protein concentration to be positive. Substituting
equation (18) in equations (13) and (14) one obtains
T
N
X
X
∂ log p(Y |f )
(xj (ti ) − yj (ti ))
= −∆
Θ (ti − tp )
Sj efp −Dj (ti −tp )
2
∂fp
σ
ji
i=1
j=1
Wpq = −δpq
T
N
X
X
∂ log p(Y |f )
−2 2 fp +fq −Dj (2ti −tp −tq )
+ ∆2
Θ (ti − tp ) Θ (ti − tq )
σji
Sj e
.
∂fp
i=1
j=1
The terms required in equation (17) are,
∂ log p(Y |Ξ)
1X
= −(AW )pp −
Aqq Wqp
2 q
∂ fˆp
where A = (W + K −1 )−1 .
∂ fˆ
∂K
= AK −1
∇ log p(Y |fˆ) ,
∂Ξ
∂Ξ
3
Results
To test the efficacy of our method, we used a recently published biological data set which was studied using a linear response model by Barenco et al. [1]. This study focused on the tumour suppressor
protein p53. mRNA abundance was measured at regular intervals in three independent human cell
lines using Affymetrix U133A oligonucleotide microarrays. The authors then restricted their interest to five known target genes of p53: DDB2, p21, SESN1/hPA26, BIK and TNFRSF10b. They
estimated the mRNA production rates by using quadratic interpolation between any three consecutive time points. They then discretised the model and used MCMC sampling (assuming a log-normal
noise model) to obtain estimates of the model parameters Bj , Sj , Dj and f (t). To make the model
identifiable, the value of the mRNA decay of one of the target genes, p21, was measured experimentally. Also, the scale of the sensitivities was fixed by choosing p21’s sensitivity to be equal to one,
and f (0) was constrained to be zero. Their predictions were then validated by doing explicit protein
concentration measurements and growing mutant cell lines where the p53 gene had been knocked
out.
3.1
Linear response analysis
We first analysed the data using the simple linear response model used by Barenco et al. [1]. Raw
data was processed using the mmgMOS model of [4], which also provides estimates of the credibility associated with each measurement. Data from the different cell lines were treated as independent
instantiations of f but sharing the model parameters {Bj , Sj , Dj , Ξ}. We used a squared exponential covariance function for the prior distribution on the latent function f . The inferred posterior
mean function for f , together with 95% confidence intervals, is shown in Figure 1(a). The pointwise
estimates inferred by Barenco et al. are shown as crosses in the plot. The posterior mean function
matches well the prediction obtained by Barenco et al.2 Notice that the right hand tail of the inferred mean function shows an oscillatory behaviour. We believe that this is an artifact caused by
the squared exponential covariance; the steep rise between time zero and time two forces the length
scale of the function to be small, hence giving rise to wavy functions [see page 123 in 8]. To avoid
this, we repeated the experiment using the “MLP” covariance function for the prior distribution over
f [12]. Posterior estimation cannot be obtained analytically in this case so we resorted to the MAPLaplace approximation described in section 2. The MLP covariance is obtained as the limiting case
of an infinite number of sigmoidal neural networks and has the following covariance function
!
wtt0 + b
0
k (t, t ) = arcsin p
(19)
(wt2 + b + 1) (wt02 + b + 1)
where w and b are parameters known as the weight and the bias variance. The results using this
covariance function are shown in Figure 1(b). The resulting profile does not show the unexpected
oscillatory behaviour and has tighter credibility intervals.
Figure 2 shows the results of inference on the values of the hyperparameters Bj , Sj and Dj . The
columns on the left, shaded grey, show results from our model and the white columns are the
estimates obtained in [1]. The hyperparameters were assigned a vague gamma prior distribution
(a = b = 0.1, corresponding to a mean of 1 and a variance of 10). Samples from the posterior distribution were obtained using Hybrid Monte Carlo [see e.g. 7]. The results are in good accordance
with the results obtained by Barenco et al. Differences in the estimates of the basal transcription
rates are probably due to the different methods used for probe-level processing of the microarray
data.
3.2
Non-linear response analysis
We then used the non-linear response model of section 2 in order to constrain the protein concentrations inferred to be positive. We achieved this by using an exponential response of the transcription
rate to the logged protein concentration. The inferred MAP solutions for the latent function f are
plotted in Figure 3 for the squared exponential prior (a) and for the MLP prior (b).
2
Barenco et al. also constrained the latent function to be zero at time zero.
4
4
3
3
2
2
1
1
0
0
−1
−1
−2
0
5
(a)
−2
10
0
5
(b)
10
Figure 1: Predicted protein concentration for p53 using a linear response model: (a) squared exponential
prior on f ; (b) MLP prior on f . Solid line is mean prediction, dashed lines are 95% credibility intervals. The
prediction of Barenco et al. was pointwise and is shown as crosses.
0.25
2.5
0.2
2
0.15
1.5
0.9
0.8
0.7
0.6
0.5
0.4
0.1
1
0.05
0.5
0.3
0.2
0.1
0
DDB2
hPA26
TNFRSF20b
p21
BIK
0
DDB2
hPA26
TNFRSF20b
p21
BIK
0
DDB2
hPA26
TNFRSF20b
p21
BIK
(a)
(b)
(c)
Figure 2: Results of inference on the hyperparameters for p53 data studied in [1]. The bar charts show (a)
Basal transcription rates from our model and that of Barenco et al.. Grey are estimates obtained with our model,
white are the estimates obtained by Barenco et al. (b) Similar for sensitivities. (c) Similar for decay rates.
4
Discussion
In this paper we showed how Gaussian processes can be used effectively in modelling the dynamics of a very simple regulatory network motif. This approach has many advantages over standard
parametric approaches: first of all, there is no need to restrict the inference to the observed time
points, and the temporal continuity of the inferred functions is accounted for naturally. Secondly,
Gaussian processes allow noise information to be accounted for in a natural way. It is well known
that biological data exhibits a large variability, partly because of technical noise (due to the difficulty
to measure mRNA abundance for low expressed genes, for example), and partly because of the difference between different cell lines. Accounting for these sources of noise in a parametric model can
be difficult (particularly when estimates of the derivatives of the measured quantities are required),
while Gaussian Processes can incorporate this information naturally. Finally, MCMC parameter
estimation in a discretised model can be computationally expensive due to the high correlations between variables. This is a consequence of treating the protein concentrations as parameters, and
results in many MCMC iterations to obtain reliable samples. Parameter estimation can be achieved
easily in our framework by type II maximum likelihood or by using efficient Monte Carlo sampling
techniques only on the model hyperparameters.
While the results shown in the paper are encouraging, this is still a very simple modelling situation.
For example, it is well known that transcriptional delays can play a significant role in determining
the dynamics of many cellular processes [5]. These effects can be introduced naturally in a Gaussian
process model; however, the data must be sampled at a reasonably high frequency in order for delays
to become identifiable in a stochastic model, which is often not the case with microarray data sets.
Another natural extension of our work would be to consider more biologically meaningful nonlinearities, such as the popular Michaelis-Menten model of transcription used in [9]. Finally, networks
consisting of a single transcription factor are very useful to study small systems of particular interest
such as p53. However, our ultimate goal would be to describe regulatory pathways consisting of
6
6
5
5
4
4
3
3
2
2
1
1
0
0
5
(a)
10
0
0
5
(b)
10
Figure 3: Predicted protein concentration for p53 using an exponential response: (a) shows results of using a
squared exponential prior covariance on f ; (b) shows results of using an MLP prior covariance on f . Solid line
is mean prediction, dashed lines show 95% credibility intervals. The results shown are for exp(f ), hence the
asymmetry of the credibility intervals. The prediction of Barenco et al. was pointwise and is shown as crosses.
more genes. These can be dealt with in the general framework described in this paper, but careful
thought will be needed to overcome the greater computational difficulties.
Acknowledgements
We thank Martino Barenco for useful discussions and for providing the data. We gratefully acknowledge support from BBSRC Grant No BBS/B/0076X “Improved processing of microarray data with
probabilistic models”.
References
[1] M. Barenco, D. Tomescu, D. Brewer, R. Callard, J. Stark, and M. Hubank. Ranked prediction of p53
targets using hidden variable dynamic modeling. Genome Biology, 7(3):R25, 2006.
[2] T. Graepel. Solving noisy linear operator equations by Gaussian processes: Application to ordinary and
partial differential equations. In T. Fawcett and N. Mishra, editors, Proceedings of the International
Conference in Machine Learning, volume 20, pages 234–241. AAAI Press, 2003.
[3] J. C. Liao, R. Boscolo, Y.-L. Yang, L. M. Tran, C. Sabatti, and V. P. Roychowdhury. Network component analysis: Reconstruction of regulatory signals in biological systems. Proceedings of the National
Academy of Sciences USA, 100(26):15522–15527, 2003.
[4] X. Liu, M. Milo, N. D. Lawrence, and M. Rattray. A tractable probabilistic model for affymetrix probelevel analysis across multiple chips. Bioinformatics, 21(18):3637–3644, 2005.
[5] N. A. Monk. Unravelling nature’s networks. Biochemical Society Transactions, 31:1457–1461, 2003.
[6] R. Murray-Smith and B. A. Pearlmutter. Transformations of Gaussian process priors. In J. Winkler, N. D.
Lawrence, and M. Niranjan, editors, Deterministic and Statistical Methods in Machine Learning, volume
3635 of Lecture Notes in Artificial Intelligence, pages 110–123, Berlin, 2005. Springer-Verlag.
[7] R. M. Neal. Bayesian Learning for Neural Networks. Springer, 1996. Lecture Notes in Statistics 118.
[8] C. E. Rasmussen and C. K. Williams. Gaussian Processes for Machine Learning. MIT press, 2005.
[9] S. Rogers, R. Khanin, and M. Girolami. Model based identification of transcription factor activity from
microarray data. In Probabilistic Modeling and Machine Learning in Structural and Systems Biology,
Tuusula, Finland, 17-18th June 2006.
[10] C. Sabatti and G. M. James. Bayesian sparse hidden components analysis for transcription regulation
networks. Bioinformatics, 22(6):739–746, 2006.
[11] G. Sanguinetti, M. Rattray, and N. D. Lawrence. A probabilistic dynamical model for quantitative inference of the regulatory mechanism of transcription. Bioinformatics, 22(14):1753–1759, 2006.
[12] C. K. I. Williams. Computation with infinite neural networks. Neural Computation, 10(5):1203–1216,
1998.