Monte Carlo simulation models: Sampling from the joint distribution of

Monte Carlo simulation models:
Sampling from the joint distribution of
’State of Nature’-parameters
Erik Jørgensen∗
Biometry Research Unit. Danish Institute of Agricultural Sciences,
P.O.Box 50, DK-8830 Tjele, Denmark
Abstract
When using Monte Carlo simulation models for decision support it is important to represent the full uncertainty that faces the decision maker. This paper focuses on approaches
towards specifying the uncertainty in input parameters, also called "State-of-nature". Until recently, such specification has only been possible in practice under special conditions,
e.g., independent parameters or parameters following specific multivariate distributions,
such as the normal distribution. As a result of advances in Bayesian statistical methodology, it is now possible to specify much more complicated distributions, and the distributions
can even be found conditional on observations made prior to the simulations. The paper
presents cases to illustrate the potential.
1 Introduction
Models of animal production systems is widely used for investigating different production
strategies etc. Many of these models uses Monte Carlo simulation techniques to calculate output
variables of interest.
As a result of the large complexity of such models, even the correct specification of model input
parameters leads to difficulties. As a result, the uncertainty in the input parameters is usually
ignored. When the model is used for studying the ’behaviour’ of the system, this may not be
important. However, when the model is used for decision support, the full uncertainty facing
the decision maker needs to be considered. In the so-called Dina pig model (Jørgensen & Kristensen, 1995) the intention is to include the full uncertainty in the model, i.e., the uncertainty
in model parameters is included as well as the usual uncertainty in system development with
known parameters.
∗
E-mail: [email protected], WWW: http://www.jbs.agrsci.dk/~ejo/
1
With offset in the Dina pig model, this paper illustrates approaches toward correct specification
of input parameters.
1.1 Elements of Simulation models
Initially, we will present some elements of the Monte Carlo simulation method, mainly to introduce the notations followed in the paper. Details of the methods can be found in textbooks
such as Fishmann (1996).
Essentially, the Monte Carlo simulation method is a method for evaluating an integral
Z
Ψ = Eπ {U (X)} = U (x)π(x)dx
(1)
where Eπ () is the expectation with respect to the probability density π and U () is some response
function, e.g., a utility function. It involves generating random draws X = x (j) from the target
distribution π and then estimating Ψ by
b = 1 U x(1) + · · · + U x(k)
Ψ
k
(2)
In our context, X = {Θ, Φ} is a vector consisting of decision parameters, Θ, and system
parameters and state variables, Φ. The Monte Carlo method is thus a numeric method for
evaluating the integral in Eq. (1). In addition, if the random draws, x (j) are independent, we can
easily obtain an estimate of the error of the approximation, using the Central Limit Theorem,
(see e.g., Fishmann, 1996, section 2.2).
Often, it is an advantage to reformulate the integral in Eq. (1) by splitting Φ into the so-called
state of nature, ΦO , and parameters and state variables Φ·s = {Φ1s , Φ2s , · · · , ΦT s } that are
calculated by the model. (The additional index denotes model step, e.g., model time). A subset
of Φ·s , Ω is called the output of the model. This splitting of the parameter vector leads to a
reformulation of Eq. (1)
n
o Z Z
π(x)
b = Eπ Eπ {U (X)} =
d{Θ, Φ·s } πO (ΦO )dΦO
(3)
Ψ
U (x)
O
S|O
πO (ΦO )
where EπS|O {U (X)} denotes the conditional expectation of U (X) for a given state of nature
ΦO .
The dimension of ΦO will in general be fixed by the model structure, while the number of elements in Φ·s will vary with different decisions and different combinations of the other elements
in Φ. Disregarding the problem of dimensionality, the integration with respect to Φ O is well behaved and lends itself to techniques other than simple Monte Carlo simulation. In contrast, the
integration with respect to Φ·s is of a complexity that is only feasible to solve using the Monte
Carlo method. (Note, that the dimension of ΦO in such models is often in excess of hundred, so
even though it is well behaved the evaluation of the integral is complicated).
2
1.2 Additional information concerning State-of-Nature
Often, we want to use the the model in a specific context, e.g., to predict effect of different
production strategies within a specific herd. In this case, we have additional information concerning the model parameters, i.e., registrations related to the model parameters y. In this case,
we are interested to base our inference on the conditional distribution of the parameters given
the observations, π0 (Φ0 |y).
Note, that this implies that we additionally specify a model of the joint distribution of the parameters Φ0 and the observations, y. But this is exactly the purpose of our simulation model,
i.e., the output parameters Ω is usually observable and the observations y is a subset of Ω. In Jørgensen (2000a) calibrating of model parameters with observations of model output parameters
is described. However, du e to complexity issues this approach has limitations.
In the present context, we will therefore concentrate on the situation, where we are able to
specify an alternative model of the relation between the observations and the model parameters
π(Ω|Y = y, Φ0 ) = π(Ω|Φ0 )π(Φ0 |Y = y)
though there is an inherent inconsistency in the approach as y ⊆ Ω. Of course we may argue
that y is independent of the features, we explore in the model, i.e., decisions and capacity
restrictions.
The problem handled in the present paper is how to specify the joint probability distribution of
(i)
the parameters πO (ΦO ), in order to make it possible to draw pseudo-random instantiations Φ 0
of the distribution. The presentation is structured as a description of three cases. The approach
described is used in the Dina pig simulation model (Jørgensen & Kristensen, 1995), and may
be combined with recent advances within Bayesian statistics
1.3 The framework for specification
The specification of the prior distribution is similar to the specification need within Bayesian
approaches to statistical analysis and learning in expert systems (Spiegelhalter et al., 1993,
1996). One widely used program is the so-called WinBUGS program (Spiegelhalter et al.,
1999). The WinBUGS program is intended for inference in graphical models using the Markov
Chain Monte Carlo approach. The original intention in the Dina pig model was to use the
WinBUGS language for the specification. However, in most cases the use of WinBUGS would
be too inconvenient. Under assumptions of independency between parameters, the graphical
model is simply a set of disconnected nodes. Therefore, the model specification in the Dina pig
model follows the specification language in WinBUGS, but is integrated into the general model
specification.
3
2 Case I: Independent parameters
When specifying a probability distribution for the parameters in the state of nature, a simplifying
assumption is that the the parameters are independent. In the Dina pig model this is the standard
assumption.
The independence assumption implies that the joint density of the parameters is simply the
product of the density of each individual parameter i.e.,
π(ΦO ) = π(Φ0,1 ) · π(Φ0,2 ) · · · · π(Φ0,n )
That is for each state parameter, Φ0,i , in the model, instead of only specifying the expected value,
we have to select a probability distribution and the parameters describing this distribution. We
will follow standard practice and use the term hyper-parameters.
Very often it is most natural to specify the distribution of the parameter on a different scale
than the actual parameter. Parameters describing proportions may be specified on a logit scale
and a log-normal distribution may be natural for some parameters. As an example, parameter
values describing time until an event (i.e., positive) may often be described as following a
lognormal distribution. Therefore, the normal distribution is selected as the distribution with
corresponding hyperparameters and the transformation is the exponential function exp(). The
available distributions and transformation closely follows the notation in the WinBUGS manual.
2.1 Specification of growth related parameters
One of the available growth models in the Dina pig model, is an extension of a simple Gompertz
growth model, as described in Jørgensen (1998). We will use this model as an example of the
specification of the prior distribution. The Gompertz growth model in its standard format is
dW
= k {K − ln(Wt )} Wt
dt
(4)
where Wt is the weight at time t, k is the growth rate and K is the logarithm of asymptotic
maximum weight. This produces a sigmoid curve that closely corresponds to the growth of
the pig. Notice, that the description of K should not be taken literally. Extrapolation from
measurements during the slaughter pig growth phase to the age, when maximum weight is
approached, is not reliable. The basic formula in Eq. (4) is modified in the simulation model,
but the basic formula may still be recognised.
A growth parameter called the current herd level at time t, kht follows a first order stationary
autoregressive process with
kh(t+∆t ) = µkh + α(∆t )(kht − µkh ) + β(∆t )εh
(5)
µkh is the expected level (e.g. the population expectation)
and ε h is a random noise, where εh ∼
p
2
N (0, σh ) α(∆t ) = exp(−α0 ∆t ) and β(∆t ) = 1 − α(∆t )2 is the autoregression parameters
4
with the varying length of the time steps taken into account. 1 − α 0 corresponds approximately
to the usual autoregression parameter with time step 1.
The individual growth parameter for each pig is kpig . kpig is drawn from a normal distribution
with expectation equal to the herd level at the time of the pig’s introduction into the herd, i.e.,
kpig ∼ N (kht , σk2 ) with t the time of introduction into the herd of the pig.
The specification the herd level of growth rate kh will be based on estimates of daily gain from
production data bases. Usual values is that the herd level in daily gain varies between 700
and 1000 gram, roughly speaking a standard deviation of 300/4 ≈ 75 g. As the daily gain is a
function of the k parameter as well as the K parameter we use a first order Taylor approximation
i.e., dg ≈ f (k0 , K0 ) + fk0 (k0 , K0 )(k − k0 ) + fK0 (k0 , K0 )(K − K0 ) as basis for an approximate
variance
V(dg) ≈ (fk0 )2 (k0 , K0 )V(k) + (fK0 )2 V(K)
Furthermore, we assume that 90% of the variation is due to variation in k h . From these assumptions we find that K ∼ N (5.40, 1/0.0272), and kh ∼ N (0.0116, 1/0.0008432) (The normal
distribution is parameterised with mean and precision (= 1/σ 2 ) following WinBUGS). With
the mean parameter values the average daily gain is 885 based on the growth of a single animal
from 77 to 175 days.
α0 is selected to obtain a correlation between herd level 3 months apart of between 0.95 and
0.99, i.e., N (0.0003, 1/0.00012) The variation on herd level σh is specified to reflect that the
variance within the herd is assumed to be between 0.25 to 0.5 of the total variance between
herds. A lognormal scale is assumed i.e., log(σh2 ) ∼ N (−6.6, 176.56). Variation between
pigs consists of a ’genetic’ part and a ’random walk part’. The specification is based on the
assumption, that after 90 days of growth the width of confidence interval for live weight is
30 kg, corresponding to a standard deviation of 30/4 Between 1/3 to 2/3 of the variance is
’permanent’ corresponding to a σk between [0.00057, 0.0008061]). Therefore we select the
following distribution for σk ∼ N (9.5 × 10−4 , 2.4 × 107 ). The random walk part corresponds
to an additional standard deviation in daily gain uniformly distributed between [0.25, 0.75].
Similar considerations is made for the specification of the other model parameters, i.e., parameters describing start weight, feed intake, feed waste, slaughter waste (killing out percentage),
and relation ship between live weight and meat percentage. The available space does not allow
us to present the data. Using the Dina pig model the kernel density plots shown in Fig. 1 is
produced for each input parameter.
k
0.008
Std.k
0.010
0.012
0.014
Std.dailygain
0.0005
0.0010
0.0015
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Figure 1: Prior distribution of variables. The ’rug’ indicates the values used in actual simulation runs
5
3 Case 2: Using samples from Markov Chain Monte Carlo
The second case is taken from a study concerning precision of clinical diagnosis, Bådsgaard
& Jørgensen (2000). The results will be used in section 4 as well. The case has been selected
because it illustrates a situation that is almost standard, when using simulation models for decision support. A hierarchic model is used for describing a population of subjects based on
empirical data. When using the simulation model, we want to refer to a subject (e.g., a herd)
from this population, either with no further information on the subject or with some additional
information (e.g., previous performance) on the subject. In this case, the prior distribution is
estimated using the Markov Chain Monte Carlo approach via WinBUGS.
For clinical diseases, estimation of herd prevalence relies on how precise the veterinarian is.
The precision is usually expressed as sensitivity, SE, the probability of correct identification of
a diseased animal, and specificity, SP, the probability of correct identification of the healthy
animal. SE and SP influences the observed prevalence in the herd.
Consider the case, where a veterinarian inspects 10 animals. We want to estimate the probability
of observing nobs diseased animals, conditional on the true prevalence in the herd p dis , i.e.,
Z 1Z 1
Pr(nobs = i|pdis = p) =
Pr(nobs = i|pdis = p, SE = u, SP = v)π(u, v)dudv
(6)
0
0
where π(u, v) is the joint probability density of sensitivity and specificity of the veterinarian.
In this context, the simulation model is very simple, i.e., with known parameters the observed
number of diseased animals simply follows the binomial distribution. However, the parameters
is not known.
Our knowledge concerning the specificity and sensitivity of the veterinarian may either arise
from the specific knowledge of the vet based on previous observations, or from our general
knowledge concerning the population of veterinarians.
In Fig. 2 this is illustrated. In the study (Bådsgaard & Jørgensen, 2000) the distribution (Vet pop )
of the precision parameters (Veti = {SE, SP}), were quantified using an experimental setup
where 4 veterinarians (only two in the figure) simultaneously assessed clinical symptoms of a
total of 155 animals. In the present context, we want to use the information from this study for
estimation of π(u, v) in Eq. (6). Two situation will be addressed, either if a specific veterinarian
participating in quantification study (Vet2 ) or a different veterinarian selected at random from
the population (Vet3 ).
6
Quantification Study
Vetpop
"Simulation" model
Vet1
Vet2
Sympi1
Sympi2
Vet3
Sympi5
Sympi3
Statei
Statei
Statei
Herd0
Herd1
Herd2
Herdpop
Figure 2: Schematic illustration of the clinical setup and our study.
The WinBUGS analysis produces a sample {(SE(1) , SP(1) ), (SE(2) , SP(2) ), . . . , (SE(n) , SP(n) )}
from the relevant distributions as illustrated by the kernel densities in Fig. 3 (n is the sample
size). Note that we may be able to approximate the joint distribution using some standard
probability distribution such as the multivariate normal distribution. However, it is not obvious
to what extent the parameters will follow such a distribution. Furthermore, the efforts will be a
waste of time. For our purpose, we need exactly what the MCMC approach produces, a sample
from the correct joint distribution.
In the present case a sample of 10.000 were produced. To estimate the probability in Eq. (6)
we proceed as follows for a given true prevalence ph . First the probability of observing disease
symptoms is calculated
(i)
(i)
p(i)
o = SE ph + (1 − ph )(1 − SP )
(i)
Then number of animals with disease symptoms no is drawn from the binomial distribution
(i)
(i)
(i)
i.e., no ∼ Binomial(10, po ). Finally, the distribution of no for all i is used form finding the
desired probabilities.
In the present case the "simulation" model is so simple that calculation time and sample size is
of (almost) no concern. However, with more complicated simulation models this issue becomes
important. In contrast to the previous case, the samples produced by the MCMC approach are
not√independent. Therefore, the precision of the output by the simulation model is not simply
σ̂/ n
7
1.2
1.2
0.8
0.4
0.0
0.4
Density
0.8
b)
0.0
Density
a)
−5
0
5
10
−5
logit(SE)
0
5
10
logit(SP)
Figure 3: Kernel density estimates on logit scale of sensitivity (a) and specificity (b) for random veterinarian (
) and veterinarian no. 1 (
).
Table 1: Distribution of number of diseased from clinical inspection with different herd health state.
Random veterinarian.
Herd
Health
0
1-10
10-40
40-60
> 60
0
14·34
5·65
0·94
0·02
0·00
1
28·71
17·59
4·83
0·33
0·00
2
27·25
26·23
12·75
1·56
0·05
Number of diseased (clinical)
3
4
5
6
7
17·92
8·17
2·77
0·68
0·14
24·20 16·05
6·97
2·79
0·41
19·65 22·84 19·02 12·14
5·74
5·39 12·03 20·50 23·87 20·16
0·30
1·28
4·21
9·41 16·69
8
0·02
0·09
1·71
11·62
23·78
9
0·00
0·02
0·34
3·86
25·53
10
0·00
0·00
0·03
0·69
18·76
Table 2: Distribution of number of diseased from clinical inspection with different herd health state. Vet.
no. 1 from experiment
Herd
Health
0
1-10
10-40
40-60
> 60
0
43·32
18·12
3·64
0·52
0·30
1
21·48
25·70
10·46
1·45
0·43
2
11·78
21·13
16·77
4·07
0·86
Number of diseased (clinical)
3
4
5
6
7
7·24
5·25
3·40
2·60
1·91
13·26
7·89
4·69
3·23
2·12
19·46 17·41 13·00
8·78
4·87
8·95 14·65 19·69 20·07 15·20
1·56
3·09
6·60 11·09 16·94
8
8
1·19
1·83
3·05
9·29
21·00
9
1·17
1·28
1·63
4·65
21·62
10
0·66
0·75
0·93
1·48
16·53
3.1 Conclusion
The Markov Chain Monte Carlo methods seems ideally suited to be used in the context of
specification of prior distribution for use in simulation modelling. Even if standard statistical
model such as generalized linear models may be more expedient for experimental analysis,
MCMC may be still relevant because a random sample from the population is automatically
produces. The only problem is that the samples are not drawn independently, but to a large
extent this can be remedied by thinning the sample.
4 Case 3: Sampling from Expert system (Bayesian network)
The third case is taken from a project concerning intervention strategies for respiratory diseases, as presented in Otto (2000). The system uses the uses H UGIN TM program to formulate
a probabilistic expert system for diagnosis and error detection concerning Mycoplasma. The
present example is a slight modification of an example described in detail in Jørgensen (2000b).
The final system will in addition to the diagnostic network include a module for Monte Carlo
assessment of cost-benefit of different controlstrategies.
The prevalence of the disease is expected to depend on management level and two risk factors.
The prevalence may be observed either by the farmer or by a veterinarian. The precision of
the farmers observation depends on his ability as a manager. Disease prevalence and quality of
management influences growth rate.
Manage
Gain
Risk1
Risk2
Preval
Farmobs
VetFind
Figure 4: Hugin Expert system
The quantifications of the dependencies in Fig. 4 is based upon Stärk et al. (1998). Two of
the risk factors in her table 10 has been selected with corresponding parameter estimates. Two
additions has been made. The overdispersion has been modelled by a random herd effect,
and an additional management factor not included in her study is added for illustration. The
9
parameters from the logistic regression in Stärk et al. (1998) has thus been supplement with
effect of management and between herd variation. Three factors influence herd prevalence. The
management quality (Manage), Manure removal in nursery (Risk1) and No. of pigs in room
(nursery) (Risk2) is No. of pigs in room (nursery). Each factor is categorized into discrete
levels. The detailed model parameters are described in Jørgensen (2000b). The parameters is
used to specify the necessary probability distribution tables in the Bayesian network in Fig.
4. The Prevalence node is defined as a continuous variable, the prevalence of serologically
positive animals. For the purpose of the model the prevalence node is divided into 5 categories
No disease, from 1-10 percent disease, from 10 to 40 percent disease, from 40 to 60 percent
disease and above 60 percent disease.
Based on the model we can calculate the probability of being in each of these different categories of disease level for each combination of the parent nodes (risk factors). To illustrate, the
probability distribution is shown for a selected part of the combinations of risk factors In Table
3.
Table 3: Distribution of herd health level for average management and selected risk factors.
No. Of pigs
Manure Removal
Herd Health
0
1-10
10-40
40-60
> 60
1st quartile
< daily daily > daily
0·43
40·47
53·19
5·03
0·88
11·08
76·93
11·83
0·14
0·01
12·79
76·86
10·24
0·11
0·01
2nd quartile
< daily daily > daily
0·01
10·37
59·87
20·78
8·97
1·22
54·58
41·57
2·33
0·30
1·52
57·65
38·68
1·92
0·23
The next step in the modelling is the specification of the problem detection by the farmer. In
the present example Farm_obs is defined with two levels No problem observed and Problem
observed. In his daily work, the farmer assess the disease level continuously, but the measurement is not necessarily very precise. Furthermore, the observations may not lead to a problem
detection, because the farmer may suppose that he is looking at a normal disease level, i.e.,
is threshold for problem detection is high. A natural model of the farmers observation is that
good farmers are more precise in their observation, and that they tend to react to lower levels of disease. Based on these assumptions the probability table of Farm_obs conditioned on
management quality and health problem is specified.
The next node is the veterinarian diagnosis, i.e., he visits the farm and samples 10 animals at
random and makes a clinical inspection of the animal. The outcome of the clinical inspection
is the number of diseased animals, i.e., the states of the Vet_Find1 node is {0, . . . , 10}. If we
know the herd prevalence and the precision of the veterinarian, we can calculate the probability
distribution of number of diseased based on the assumptions above. This is exactly the probability table that were specified in section 3 and Table 1 and 2. Of course, the table need to reflect
our knowledge concerning the veterinarian.
10
In the typical use of the expert system, we need to base our inference on evidence on a minimum
of two nodes. The farmer will have detected a problem in the herd, and we will have the the
result of the veterinarians inspection of the sample. Conditional on this evidence we need to
advice the farmer, if he should change his production strategy and how it should be changed.
Our expectation towards future production strategies will of course depend on the risk factors
actually causing the problem. A high stocking rate might suggest an increase of herd size
combined with sectioned production. But if there is poor management as well the full benefit of
sectioned production might not be obtained. The cost-benefit of control-strategies thus depends
on the combination of risk-factors presents.
The Bayesian network contains the full joint probability of these combinations, and in the program Hugin a random sample from this joint distribution may readily be found, using existing
procedures in the application programming interface (simulate). In contrast to the output
from the MCMC method, the subsequent samples from Hugin are independent samples from
the distribution.
5 Conclusion
In the present paper different approaches towards specification of prior distribution of "stateof-nature" parameters has been presented. The conclusions is that such a specification can
be made readily using ’off-the-shelf’ methods, and the possibility for handling prior evidence
concerning these parameters are good. The only word of caution, is that the sample produced
by the MCMC does not consist of independent instantiations from the distribution, but this can
be easily remedied by simply discarding instantiations.
However, it should be noted that the techniques are restricted to relationships between model
parameters and evidence, where it is not important to use the full simulation model. This relates
especially to capacity restrictions and interactions between animals. The ideas in Jørgensen
(2000a) may be used in such cases.
Another important aspect not covered, is the state of individual animals currently present in the
herd. It is not possible to estimate the current state of an individual in the herd without taken
observations and decisions into account. The individuals remain in the herd because it has been
decided not to cull it. We need techniques to calculate probability distribution of the current
state given the evidence that it is alive.
If the use of simulation models is restricted to steady state results of production strategies, we
can avoid this problem.
11
References
Bådsgaard, N.P. & E. Jørgensen (2000). A Bayesian approach to estimating the reliability
of clinical observations – With an application to herd prevalence estimation. Preventive
Veterinary Medicine, in prep.
Fishmann, G.S. (1996). Monte Carlo. Concepts, Algorithms, and Applications. Springer-Verlag
New York, Inc.
Jørgensen, E. (1998). Stochastic modelling of pig production. Working Paper: Growth Models.
Dina Notat, 73 pp. 1–23. URL: http://www.jbs.agrsci.dk/~ejo/growth.
eps.
Jørgensen, E. (2000a). Calibration of a Monte Carlo Simulation Model of Disease Spread in
Slaughter Pig Units. Computers and Electronics in Agriculture, 25 pp. 245–259. URL:
http://www.jbs.agrsci.dk/~ejo/dinapig/compefit.ps.
Jørgensen, E. (2000b). Elements of Bayesian network specification in an animal health research
project. Internal report, Biometry Research Unit, Danish Institute of Agricultural Sciences, 2000-2 pp. 1–24. URL: http://www.jbs.agrsci.dk/~ejo/dinapig/
diag1504a.pdf.
Jørgensen, E. & A.R. Kristensen (1995). An object oriented simulation model of a pig herd
with emphasis on information flow. In FACTs 95 March 7, 8, 9, 1995, Orlando Florida,
Farm Animal Computer Technologies Conference, pp. 206–215.
Otto, L. (2000). Mycoplasma for pigs in a Bayesian Network: A decision support system. In
Proc. "Economic modelling of Animal Health and Farm Management". November 23-24,
2000 Wageningen.
Spiegelhalter, D.J., A.P. Dawid, S.L. Lauritzen, & R.G. Cowell (1993). Bayesian Analysis in
Expert Systems. Statistical Science, 8(3) pp. 219–283.
Spiegelhalter, D.J., A. Thomas, & N. Best (1996). Computation on Bayesian Graphical Models.
Bayesian Statistics, 5 pp. 407–425.
Spiegelhalter, D.J., A. Thomas, N. Best, & W. Gilks (1999). WinBUGS. Version 1.2 User
Manual. MRC Biostatistics Unit. URL: http://weinberger.mrc-bsu.cam.ac.
uk/bugs/Welcome.html.
Stärk, K.D.C., D.U. Pfeiffer, & R.S. Morris (1998). Risk factors for respiratory disease in New
Zealand pig herds. New Zealand Veterinary Journal, pp. 1–10.
12