Modelling Measurement Errors in Linked Administrative and Survey

Mathematical Institute
Master Thesis
Statistical Science for the Life and Behavioural Sciences
Modelling Measurement Errors in Linked
Administrative and Survey Data
Author:
Samuel Robinson
Supervisor:
Dr. Johannes Schmidt-Hieber
University of Leiden
External Supervisor:
drs. Sander Scholtus
Centraal Bureau voor de Statistiek
October 2016
Contents
1 Introduction
5
2 Structural Equation Model
8
3 Finite Mixture Models
10
4 Contamination - Error Model
4.1 Contaminated Multi-Source Data Properties . . . . . . . . . . . . . . . . . . .
4.2 Mixing Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Identifying Error Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
13
15
15
5 EM Algorithm
17
6 Applying the EM Algorithm to Fit the Basic Model
6.1 Initial Parameter Estimates . . . . . . . . . . . . . . .
6.1.1 Error Proportions . . . . . . . . . . . . . . . .
6.1.2 Regression Coefficients . . . . . . . . . . . . . .
6.1.3 True Score Variance . . . . . . . . . . . . . . .
6.1.4 Observed Score Error Variances . . . . . . . . .
6.2 Complete-data loglikelihood . . . . . . . . . . . . . . .
6.3 E Step . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.1 Error Category Probabilities . . . . . . . . . .
6.3.2 Conditional Expected Error Labels . . . . . . .
6.3.3 Conditional Expected True Scores . . . . . . .
6.4 Conditional Expected True Score Squared . . . . . . .
6.4.1 Incomplete-data loglikelihood . . . . . . . . . .
6.5 Maximisation Step . . . . . . . . . . . . . . . . . . . .
6.5.1 Updating the Error Proportions . . . . . . . .
6.5.2 Updating the Regression Coefficients . . . . . .
6.5.3 Updating The Variances . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
19
19
20
21
21
22
25
25
28
29
30
32
34
34
36
36
7 Combining the Sources With Gold Standard Data
38
8 Model 2: Including Slope and Intercept
8.1 Initial Estimates . . . . . . . . . . . . .
8.2 Expectation Step . . . . . . . . . . . . .
8.3 Mahalanobis Distance . . . . . . . . . .
8.4 Standard Errors . . . . . . . . . . . . . .
40
41
42
46
47
9 Simulation
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
49
3
CONTENTS
4
10 Real Data Analysis
10.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2 Log Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
50
51
53
11 Comparison of Methods
56
12 Conclusion
12.1 Future Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
58
Chapter 1
Introduction
Statistics Netherlands produces official statistics on many di↵erent topics. In some cases,
data are available from multiple sources on (in theory) the same variable. For example
administrative data collected by a government organisation may be available alongside a
statistical survey measuring some of the same information and these data sources can be
linked for individual persons or businesses. Historically the primary source of data for statistical agencies has been statistical surveys, but in recent years administrative data has
become more popular. The observed values in these di↵erent sources often di↵er, either due
to measurement errors or due to the fact that there are some discrepancies between statistical and administrative concepts. As described in Laitila et al. (2012), statistical surveys are
normally associated with sample surveys, where part of the population is selected at random
and estimates are derived from sample responses. There are however several other possible
sources for collection of data and information generation. Today there is an increasing interest among statistical agencies in using administrative data for production of official statistics
(Wallgren and Wallgren, 2007). Adding administrative data sources, the researcher has the
additional option to choose a register survey, a survey based on data from administrative
sources. The researcher also has the option of using a combination of sample and register
surveys. Certainly sample surveys and register surveys can be designed in di↵erent ways
so the option today in survey design is to choose among a set of sample surveys, register
surveys and combinations of sample and register surveys. In traditional sample surveys the
researcher is generally able to control the way the data is collected and therefore more able
to evaluate the quality of the resulting statistics. This is not the case with register surveys
as the researcher has no control over data collection. It is still possible to assess some quality
aspects using registry information, such as variable and population definitions, but some
aspects are less easy to evaluate such as measurement errors.
The purpose of this project is to produce a model, built on existing models, which will
enable the accurate estimation of the measurement errors for a di↵erent source. We will
attempt to estimate the proportion of errors and other distributive properties for three different sources, two of which are administrative and the third of which is a statistical survey.
Even if, as the case may be, none of the individual sources meet the requirements for a
high quality source they can still be combined to produce a high quality source, or can be
used as a frame of reference against which the quality of survey sources can be more precisely
ascertained. The process of checking and correcting data is known as statistical data editing,
some methods to achieve this are outlined in de Waal et al. (2011a). A particular branch of
statistical editing is known as selective editing, which involves identifying the observations
containing the largest or most important errors. The editing treatment can then be focused
5
CHAPTER 1. INTRODUCTION
6
on these observations to reduce the cost of this phase. This application is described in more
depth in Guarnera and di Zio (2013). Our procedure includes the estimation of expected
scores for each unit, and comparing the expected scores to the observed scores will lead to
identification of the largest errors. Our algorithm can therefore also be used for selective
editing purposes. In addition to this, by estimating the error proportions for edited datasets
our model can be used to compare di↵erent editing procedures.
Statistics Netherlands and their equivalent department in Italy, istat, currently use different methods to assess measurement error. The Dutch measurement model uses linear
structural equation model, which can also be extended to assess systematic measurement
bias, and is described extensively in Scholtus et al. (2015). We also give a brief introduction
to SEMs in the next chapter of this paper. These models, however, contain the assumption
of continuous error and therefore zero probability of the error equalling zero and the observed
value being equal to the true value. An alternative approach without this restriction is the
Italian measurement model described in Guarnera and Varriale (2015), an approach based
on the latent class model, containing the assumption of intermittent error, i.e. there is a
proportion of observations that are a↵ected by errors but not all necessarily are and the data
from a given source essentially consists of a mixture of observations from two distributions;
one with and one without error. In this model a latent variable dictates the proportion of
observations from each source that contain errors, and if an error is not present then the true
unobserved variable and the observed variable are equal. Observed data is therefore modelled through a latent class model in which the latent variable is the binary error indicator
variable.
The aim of this research project was to first apply the Italian latent class method to our
dataset and estimate the parameters using the EM algorithm, then to adapt this model with
the introduction of intercept and slope parameters so that it is able to capture systematic
bias, in the way that the Dutch model is. This adapted model would then be used to
assess the measurement quality of two di↵erent administrative sources measuring turnover
of Dutch businesses, compared to the measurement quality of a linked survey that was also
used to gauge the same variable. Hence this paper illustrates this approach in the case of a
single target variable and three linked data sources.The results of this comparison were then
compared with the previous results of the same comparison yielded using the Dutch model.
We will refer to the restricted model (without intercept and slope parameters) as Model 1
and the extended model as Model 2.
This paper is structured as follows: in Chapter 2 the previous methodology is introduced,
with a general outline of the application of structural equation models in this context. In
Chapter 3 we discuss finite mixture models in the general form. Chapter 4 introduces a
specific form of a finite mixture model: the contamination error model, and a description of
our data properties under the contaminated-source assumption. Chapter 5 describes the EM
algorithm which will be used to obtain maximum likelihood estimates for our parameters.
Chapter 6 is the bulk of this paper and describes in depth how the EM algorithm is applied
to our dataset (Model 1). Chapter 7 introduces a small potential adaptation that can be
made to Model 1 with the availability of ’Gold Standard’ data. Our expanded model, Model
2, is then described in Chapter 8. This model is applied to a simulated dataset and the
results are discussed in Chapter 9. Some relevant background information about our real
dataset is presented in Chapter 10, and then the parameter estimates obtained by Model 2
are shown. These results are then compared with those obtained previously by the structural
equation model on the same data, in Chapter 11. The conclusions of the project are then
CHAPTER 1. INTRODUCTION
7
discussed in Chapter 12.
All statistical analyses in this project are performed in the R-software environment and
the R code used for the final analysis has been provided in the appendices, along with further
mathematical detail.
Chapter 2
Structural Equation Model
One way of estimating the properties of this type of administrative data is by using a Structural Equation Model (SEM). A SEM is defined as a system of linear regression equations,
modelling a set of observed variables as a function of underlying unobserved variables, as
well as modelling the relationships between these unobserved variables. In the context of
questionnaire design, there is a well established tradition of using linear SEMs to assess the
measurement quality of survey variables (Scholtus et al., 2015). This model can be seen as
an extension of the Classical Test Theory used to predict outcomes of psychological testing.
Each observed variable is modelled as an imperfect measure of an underlying latent variable.
The outcome is then an estimated regression line for each observed variable in relation to
the true score, which can then be rearranged to estimate the true score given the observed
variable. The SEM can be used to calculate a measure of the validity of each of the sources,
in Scholtus et al. (2015) this is given by the Indicator Validity Coefficient (IVC): a measure
based on the correlation between the true score and the observed variable. This can therefore
be used to establish whether there is an argument for using cheaper administrative data, if
the fitted data indicates that it has validity that is at least as high as the more expensive
survey data.
To achieve model identification and therefore estimates for all the parameters it is necessary to collect ’gold standard’ data, in addition to the original three sources. This gold
standard data consists of the true score for each unit and in this case can be obtained by
auditing the observations from one of the original sources and placing them under further
scrutiny to ascertain their accuracy. It is likely to be prohibitively expensive to obtain audited scores for the whole dataset, therefore the gold standard data is only available for a
small subsample of units. Later on we consider a di↵erent model for which it is not necessary
to obtain gold standard data to achieve identification, but nonetheless posit an expansion of
this model which accounts for the potential availability of an error-free data source.
In 2015 a SEM was fitted to a dataset consisting of two administrative sources and a
survey source (the same data with which we well apply our own model to in a later chapter)
by Scholtus, Bakker and van Delden, and estimations of the unknown variance, intercept
and slope parameters were found. These estimates were contrasted with those produced by
our model in Section 10.3. The structural equation model assumes that the errors follow a
continuous distribution (e.g. a normal distribution) and are independent across the observed
variables. Under this assumption the probability of two observed variables being equal is zero.
As we will expand upon in the next chapter, this is inconsistent with the intermittent-error
nature of the data, which implies that an initially unknown proportion of the observations
from each source are error free. Therefore we will now consider a di↵erent model that allows
8
CHAPTER 2. STRUCTURAL EQUATION MODEL
9
for the possibility of two observed variables being equal. In the next chapter we will introduce
the general form of the finite mixture model.
Chapter 3
Finite Mixture Models
A finite mixture model can be defined as a convex combination of two or more probability
density functions (PDFs). By combining the properties of individual density functions, finite
mixture models are capable of approximating many arbitrary distributions. This flexibility
enables the modelling using mixture components of densities which are not producible from
standard parametric distributions. The maximum likelihood method can be used to estimate
the unknown parameters of the model, and this can be implemented efficiently with the EM
algorithm or a variation of such (McLachlan and Peel, 2000).
A mixture model can be formally understood as follows:
Let
y = (y1 , .....yn )
be a single realised set of observations of a given variable.
We can say that each observation belongs to one of K di↵erent mixing components.
We can also say that the prior probability that the observation belongs to the k th component
is denoted by ⇡k . We can then introduce the component label set, a n ⇥ K matrix z =
(z11 , .....zKn ). The ith observed score yi has a corresponding label vector
zi = (z1i , .....zKi )
(3.1)
an unobserved vector of zero-one indicator variables to define the mixture model component
from which it is said to have arisen.
zi is hence distributed from a multinomial distribution consisting of one draw of probabilities ⇡1 , ⇡2 ...⇡K
If the zkn values were available then the ⇡k probabilities could be calculated simply by
averaging them
Pn
zki
⇡k = i=1
n
In general, the complete data vector would be of the form:
ycomplete = (y T , z T )
T
Under this model the component label set z are generally unavailable and are therefore to be
treated as missing data, and the mixture distribution must be estimated on data available
from the marginal distribution only, rather than the joint distribution of the feature vector
yi and its corresponding component label vector zi .
10
CHAPTER 3. FINITE MIXTURE MODELS
11
A probability density function of a mixture model is then defined by a convex combination
of K component PDFs (McLachlan and Peel, 2000),
p(y|✓) =
K
X
⇡k pk (y|✓ k )
(3.2)
k=1
where pk (y|✓ k ) is the pdf of the kth component, ⇡k are the mixing proportions and ✓ =
(⇡1 , ...⇡k , ✓ 1 , ...✓ k ) is the set of parameters. It can be assumed that the ⇡k are non-negative
quantities that sum to 1
⇡k
0 for k 2 {1, ....., K},
(3.3)
and
K
X
⇡k = 1.
(3.4)
k=1
The property of convexity dictates that, given that each pg (y|✓ k ) defines a probability density function, p(y|✓) will also be a probability density function.
Table 3.1: A sample of five observations drawn from a three-component mixture, and their
corresponding error labels
i
1
2
3
4
5
z1i
0
0
1
0
0
z2i
1
0
0
0
1
z3i
0
1
0
1
0
mixture component
2
3
1
3
2
yn
345.1
554.7
115.4
487.7
382.6
These mixing components have equal variances and have means of 200, 350 and 500
respectively. Their mixing weights are 0.25, 0.25 and 0.5. A plot of their combined density
and the weighted densities of the individual mixing components is given by Figure 3.1.
CHAPTER 3. FINITE MIXTURE MODELS
12
Figure 3.1: Weighted density plot of a three-mixture distribution
In the contamination-error model, the indicator label vector and error proportions are
defined slightly di↵erently, however.
Chapter 4
Contamination - Error Model
The contamination-error model is a particular form of finite mixture model which assumes
an intermittent error mechanism, such that a proportion of a given set of observations are
’contaminated’ by an (possibly additive) error, and the remaining observations reflect the
true scores. Rather than being drawn from one of K mixing components with probability
⇡k we can now say that each observation in the g th set, yg = (yg1 , . . . , ygn ), contains an
error with probability ⇡g . Furthermore our component labels are also slightly redefined:
rather than denoting whether or not an observation is drawn from a particular distribution
they now indicate whether or not that particular observation contains an error. It is not
possible to identify the error proportions from this data alone. For this application of the
contamination-error model, however, the complete data for the ith unit consist of
(y1i , y2i , y3i , . . . , yki , yi⇤ ),
where yi⇤ denotes the true score for unit i.
The indicators zgi can then be derived from these [e.g., z1i = 1 if y1i 6= yi⇤ and z1i = 0
otherwise].
Table 4.1: An example of component labels, observed scores and true scores for a contaminated dataset of 2 sources
i
1
2
3
4
5
z1i
1
0
1
0
0
z2i
0
1
1
1
0
y1i
295.4
401.6
122
487.7
228.8
y2i
326
432.1
137.1
513.2
228.8
yi⇤
326
401.6
118.4
487.7
228.8
For our application we will assume that our data does indeed consist of contaminated
sources. In the next section we will discuss these data properties in more detail.
4.1
Contaminated Multi-Source Data Properties
We will now consider a set of data consisting of n sets of observations of responses taken from
three independent sources hereby known as; 1, 2 and 3. The ith observation from source g
13
CHAPTER 4. CONTAMINATION - ERROR MODEL
14
can be denoted by ygi
0
B
B
B
Y1 = B
B
B
@
y11
y12
y13
.
.
y1n
1
0
C
B
C
B
C
B
C , Y2 = B
C
B
C
B
A
@
y21
y22
y23
.
.
y2n
1
0
C
B
C
B
C
B
C , Y3 = B
C
B
C
B
A
@
y31
y32
y33
.
.
y3n
1
C
C
C
C.
C
C
A
(4.1)
Each of these observed score vectors has a corresponding latent error indicator vector
Zg = (zg1 , zg2 , zg3 ...zgn )T
(4.2)
such that every observed score has a corresponding boolean zgi value that indicates whether
or not the observation is erroneous. If this observation does not contain an error then it is
equal to the true score, denoted by yi⇤ for the ith unit. It follows that the observed score for
source g where g 2 {1, 2, 3}, conditional on zgi is defined by the equation:
⇢
yi⇤ if there’s no error on source g (zgi = 0)
ygi =
(4.3)
⇤
yi + ✏gi if there is an error on source g (zgi = 1)
where ✏gi are independent normally distributed random variables with zero mean and variance g 2 .
The underlying true score for the ith unit, yi⇤ is assumed to come from the normal
distribution with mean xTi and variance 2 :
yi⇤ ⇠ N (xTi ,
2
)
where xi is the set of p covariates for the ith set of observations and
coefficient vector:
0 1
0 1
xi1
1
Bxi2 C
B 2C
B C
B C
C
B C
xi = B
B . C, = B . C
@ . A
@ . A
xip
p
and
2
is the corresponding
denotes the true score variance. Another way of expressing (4.3) is
ygi = yi⇤ + zgi ✏gi .
This application of the
an additive Gaussian error
Therefore each source can
contain an error with prior
model assumes a Gaussian distribution for the true data, with
applied to the observations that are considered ’contaminated’.
either reflect the true score of a given observation, yi⇤ , or can
probability ⇡1 , ⇡2 and ⇡3 respectively.
P (z1i = 1) = ⇡1
P (z2i = 1) = ⇡2
P (z3i = 1) = ⇡3
Hence the prior probabilities of each observation being error free are 1
1 ⇡3 respectively.
⇡1 , 1
⇡2 , and
CHAPTER 4. CONTAMINATION - ERROR MODEL
15
Each set of observations then consists of one of 23 = 8 possible error permutations:
klm 2 {000, 001, 010, 100, 011, 101, 110, 111}
where k, l and m are equal to 1 in the presence of an error and 0 in the absence of an error
in sources 1, 2 and 3 respectively. In some cases the error permutation categorisation is
unknown, but in some cases it can be identified from the data, as we will show in Section
4.3.
4.2
Mixing Weights
The probability distribution of the observed data is a trivariate mixture of eight Gaussian distributions, with each mixture component corresponding to a particular error pattern
across the three sources. First the overall likelihood, consisting of eight separate terms, of
the ith observation is computed from the mixing weights and the densities of the respective
error permutations. The mixing weights of each of the eight components are given by a
combination of the three weight proportions ⇡1 , ⇡2 and ⇡3 .
w111 : ⇡1 ⇡2 ⇡3
w110 : ⇡1 ⇡2 (1
w101 : ⇡1 (1
w011 : (1
w100 : ⇡1 (1
4.3
⇡3 )
⇡2 )⇡3
⇡1 )⇡2 ⇡3
⇡2 )(1
⇡3 )
w010 : (1
⇡1 )⇡2 (1
⇡3 )
w001 : (1
⇡1 )(1
⇡2 )⇡3
w000 : (1
⇡1 )(1
⇡2 )(1
⇡3 ).
(4.4)
Identifying Error Categories
The errors are continuous and normally distributed under the assumptions of the model and
therefore the probability of two independent error components taking the exact same value
is infinitely small, hence if observations for a business from two di↵erent sources are identical
then it can be assumed that this shows the true score, and by extension whether or not
the third source is equal to the other two indicates whether this is the true score or a score
containing an error.
A unit contain three equal observations would therefore have the error labels
{y1i = y2i = y3i } ! P (y1i = yi⇤ ) = P (y2i = yi⇤ ) = P (y3i = yi⇤ ) = 1
so
z1i = z2i = z3i = 0
and a unit with 2 equal observations and 1 with a di↵erent value
{y1i = y2i 6= y3i } ! P (y1i = yi⇤ ) = P (y2i = yi⇤ ) = 1, P (y3i = yi⇤ ) = 0
so
z1i = z2i = 0, z3i = 1
Using this logic four of the categories can be identified here; 000, 001, 010, and 100 where
1 indicates an error for the respective groups (e.g. 001 represents an error on Source 3 and
CHAPTER 4. CONTAMINATION - ERROR MODEL
16
the true value shown by the other two sources). These sets of observations can be assigned
to the sets S1 , S2 , S3 and S4 . Observations belonging to the four remaining categories:
{011, 101, 110, 111}, cannot be categorised based on the data and are assigned to the set
Srest .
S1 = i such that {z1i = 0, z2i = 0, z3i = 0}
S2 = i such that {z1i = 0, z2i = 0, z3i = 1}
S3 = i such that {z1i = 0, z2i = 1, z3i = 0}
S4 = i such that {z1i = 1, z2i = 0, z3i = 0}
Srest = i otherwise.
(4.5)
The number of units in each of the five sets can also be identified, and denoted by n1 , n2 ,
n3 , n4 and nrest respectively.
Notice that the five sets are disjoint.
The data is incomplete as the presence or absence of errors is unknown for some of
the observations. It is hence a partially supervised dataset, with class membership known
for some but not all of the sets of observations. It can also be viewed as a missing data
problem. Only a proportion of the observations in z can be inferred directly from the data,
the remainder are unavailable and must therefore be estimated by treating them as missing
data and subsequently applying the EM algorithm. We will now introduce this algorithm
and describe its constituent parts in the general case.
Chapter 5
EM Algorithm
The expectation-maximisation (EM) algorithm is a general iterative method for finding the
maximum likelihood estimator of a parameter or a set of parameters, ✓. It is particularly
useful in the case where there are missing values or latent variables, as with this application. It is a formalisation of the ad-hoc procedure of alternately estimating and updating the
unknown parameters and missing values by maximising the (log) likelihood with each iteration until the estimates converge. It is a popular method of maximum likelihood estimation
because it is computationally simple due to relying only on complete-data computations;
the E-step of each iteration only involves taking expectations over complete-data conditional
distributions and the M-step of each iteration only requires complete-data maximum likelihood estimation, which is usually in a simple closed form. It is also numerically stable,
with the observed data likelihood increasing monotonically with each iteration, as proven by
Dempster et al. (1977).
We will now follow Gupta and Chen (2011) to explain this in more depth.
To use EM, you must be given some observed data y, a parametric density p(y|✓), a description of some complete target data x, and the parametric density p(x|✓). We can assume that
the complete data can be modeled as a continuous random vector X with density p(x|✓)
where ✓ 2 ⌦ for some set ⌦. You do not observe X directly; instead, you observe a realization y of the random vector Y that depends on X. For example, X might be a random
vector and Y the mean of its components, or if X is a complex number then Y might be
only its magnitude, or Y might be the first component of the vector X. Given that you only
have y, the goal here is to find the maximum likelihood estimate (MLE) of ✓
ˆ M LE = arg max p(y|✓).
✓
(5.1)
✓2⌦
However for some problems it is difficult to solve this. Then we can try EM: we make a guess
about the complete data X and solve for the ✓ that maximizes the (expected) log-likelihood
of X. And once we have an estimate for ✓, we can make a better guess about the complete
data X, and iterate. Each iteration of the EM algorithm is usually described in two steps:
an (expectation) E-step followed by a (maximisation) M-step, but for further clarity it can
be broken down into five steps:
Step 1: let m = 0 and make initial estimate ✓ (m) for ✓.
Step 2: Given the observed data y and pretending for the moment that your current guess ✓ (m) is correct, formulate the conditional probability distribution
17
CHAPTER 5. EM ALGORITHM
18
p(x|y, ✓ (m) ) for the complete data x.
Step 3: Using the conditional probability distribution p(x|y, ✓ (m) ) calculated in Step
2, form the conditional expected log-likelihood, which is called the Q-function.
Step 4: Find the ✓ that maximises this Q function, the result is your new estimate
✓ (m+1) .
Step 5: let m := m + 1 and return to step 2.
Steps 2:5 are then repeated until the chosen stopping criterion is satisfied.
In this application the missing data consists of the error labels, introduced in equation
(3.1), which indicate whether or not an error is present in a given observation for a particular
source.
In the next chapter we will detail in sequence how these five steps are implemented to
obtain parameter estimates for our application. We will begin by doing this for the basic
model, before expanding this to incorporate intercept and slope parameters.
Chapter 6
Applying the EM Algorithm to Fit
the Basic Model
6.1
Initial Parameter Estimates
In the basic model form applied by Guarnera and Varriale the estimable parameters are
✓ = (⇡1 , ⇡2 , ⇡3 , ,
2
,
2
1,
2
2,
2
3 ).
Definitions for each of the parameters are given in the following table:
⇡g
2
2
g
The proportion of errors present in source g.
The px1 regression coefficient vector obtained when fitting a linear model of the
observed true scores against the nxp matrix of regressors X.
The variance of the true scores that is unexplained by X, also referred to as the
true score variance.
The variance of the observed scores from source g that is unexplained by their
corresponding true scores.
Initial estimates of these 8 parameters can be obtained by looking purely at the data for
which the error labels are observed.
The initial parameter estimates can be defined as ✓ (t) prior to the 1st iteration, so when
t = 0: so can be written as
(0)
(0)
(0)
✓ (0) = (⇡1 , ⇡2 , ⇡3 ,
(0)
,
2 (0)
,
2 (0)
,
1
2 (0)
,
2
2 (0)
).
3
We will now explain how each of these initial estimates are derived in turn, starting with the
error proportions.
6.1.1
Error Proportions
Lemma 1. The initial estimates for the error proportions of the three sources are given by:
s
(w000 + w001 )(w000 + w010 )
(0)
⇡1 =1
w000 + w100
s
(w000 + w001 )(w000 + w001 )
(0)
⇡2 =1
w000 + w010
s
(w000 + w100 )(w000 + w010 )
(0)
and ⇡3 =1
.
w000 + w001
19
CHAPTER 6. APPLYING THE EM ALGORITHM TO FIT THE BASIC MODEL
20
Proof. As the error combinations of units belonging to four of the groups can be identified
correctly based on their observed scores, it is possible to obtain initial estimates of their
mixing weights w000 , w001 , w010 , and w100 . From these simultaneous equations can be formed
and initial estimates of the error proportions can also calculated. The mixing weights have
previously been defined in equations (4.4), and can be combined to create variables A, B
and C
w000 + w100 = (⇡2
1)(⇡3
1) = A
w000 + w010 = (⇡1
1)(⇡3
1) = B
w000 + w001 = (⇡1
1)(⇡2
1) = C.
A
(⇡2
=
B
(⇡1
C = (⇡1
1)(⇡3
1)(⇡3
1)(⇡2
1)
⇡2
=
1)
⇡1
1) = (⇡1
1
1
1)2
A
B
CB
1)2 =
rA
CB
⇡1 1 = ±
A
s
r
CB
(w000 + w001 )(w000 + w010 )
⇡1 = 1 ±
=1±
.
A
w000 + w100
(⇡1
The ⇡ components are proportions that cannot exceed 1 so the square root of the positive
quotient must be subtracted from 1, therefore:
s
(w000 + w001 )(w000 + w010 )
(0)
⇡1 = 1
.
w000 + w100
Similarly:
(0)
⇡2 = 1
s
(w000 + w001 )(w000 + w001 )
(0)
and ⇡3 = 1
w000 + w010
s
(w000 + w100 )(w000 + w010 )
.
w000 + w001
⌅
6.1.2
Regression Coefficients
An initial estimate of the regression coefficient vector was obtained using ordinary least
squares regression of the true scores in the cases in which they were observed from the data,
against the corresponding regression coefficients.
The initial estimate of
is given by:
(0)
= (XkT Xk )
1
XkT yk⇤
(6.1)
where yk⇤ is a vector collecting the true scores that could be obtained directly from the data
yk⇤ = yi⇤ |i 2 {S1 [ S2 [ S3 [ S4 }
CHAPTER 6. APPLYING THE EM ALGORITHM TO FIT THE BASIC MODEL
21
and Xk is the matrix of covariates for the same units.
With an estimated regression coefficient vector , the fitted values (or predicted values)
from the regression are given by
ŷ = X ˆ .
These fitted values are used to obtain our initial true score variance estimate.
6.1.3
True Score Variance
Our initial estimate for the true score variance is given by:
2 (0)
X
1
=
nk
p
ŷi )2
(yi⇤
i2Sk
where nk is the total number of units with identifiable true scores
nk = n1 + n2 + n3 + n4
Sk is the union of the first four sets
Sk = {S1 [ S2 [ S3 [ S4 }
and p is the number of covariates in the model. ŷi are the fitted values obtained by regressing
the identifiable true scores against their corresponding covariates, and hence also a multiple
of our initial estimates given in (6.1)
ŷi = Xk
6.1.4
(0)
.
Observed Score Error Variances
Our g2 parameters are the variances of the error components ✏gi . These error components
only exist for the units with an error on source g. For these units we already know that
ygi = yi⇤ + ✏gi . By rearranging this we can state that
yi⇤ .
✏gi = ygi
Our initial estimates for
2
g
can then be obtained by taking the variance of this di↵erence
yi⇤ )
Var(ygi
(6.2)
for the units in which there is an error on source g and for which the true score is identifiable.
Referring to (4.5) it is clear that for sources 1,2 and 3 this is the case for sets S4 , S3 and S2
respectively. Using the standard formula for sample variance gives us the following formula
2 (0)
1
=
1
n4
1
X
(✏1i
✏¯1i )2 .
(6.3)
i2S4
The observations that are known to be error free (in this case y2i ) can be used as proxy for
the true scores yi⇤ , so
✏1i = (y1i
and
✏¯1i =
P
y2i )
i2S4 (y1i
n4
y2i )
(6.4)
.
(6.5)
CHAPTER 6. APPLYING THE EM ALGORITHM TO FIT THE BASIC MODEL
22
Analogously for the second and third error variances
2 (0)
2
2 (0)
3
=
=
1
n3
1
1
n2
1
X
(✏2i
✏¯2i )2
(6.6)
X
(✏3i
✏¯3i )2 .
(6.7)
i2S3
i2S2
It is not possible to calculate explicit parameter estimates by factorising the likelihood,
so an iterative method must be used, i.e. the EM algorithm. First we detail the derivation
of the parameters of the basic model using this method, before extending the model to
incorporate systemic bias by introducing intercept and slope parameters for each source.
6.2
Complete-data loglikelihood
According to McLachlan and Peel (2000), the general form of the complete-data loglikelihood
function of a finite mixture model with H components and n i.i.d. observations is
`c (✓) =
H X
n
X
h=1 i=1
zhi {log wh + log fh (yi ; ✓)} ,
with wh and fh (.) denoting the mixing weight and density function of component h, respectively, and zhi = 1 if observation yi belongs to component h and zhi = 0 otherwise. For the
contamination-error model with three observed values per unit ygi and a true value per unit
yi⇤ this can be expressed as follows:
`c (✓) =
1 X
1 X
1 X
n
X
zklmi (log wklm + log hklm (y1i , y2i , y3i , yi⇤ ))
(6.8)
k=0 l=0 m=0 i=1
where zklm = I{z1i = k, z2i = l, z3i = m} is an indicator = 1 when the appropriate error combination klm is selected. wklm is the mixing weight of error combination klm and
hklm (y1i , y2i , y3i , yi⇤ ) is the joint normal density of the ith set of observations and the ith true
score.
The latter can be factorised to give a product of the marginal normal density of the ith unit
true score and the trivariate joint normal density of the ith observation of each of the three
sources, conditional on the true score :
hklm (y1i , y2i , y3i , yi⇤ ) = h(yi⇤ )hklm (y1i , y2i , y3i |yi⇤ )
(6.9)
This can be further factorised to give
hklm (y1i , y2i , y3i , yi⇤ ) = h(yi⇤ )hk (y1i |yi⇤ )hl (y2i |yi⇤ )hm (y3i |yi⇤ )
as we will demonstrate with the following lemma and proof.
Lemma 2. The trivariate joint conditional density h(y1i , y2i , y3i |yi⇤ ) is equal to the product
of three marginal densities:
hklm (y1i , y2i , y3i |yi⇤ ) = hk (y1i |yi⇤ )hl (y2i |yi⇤ )hm (y3i |yi⇤ ).
(6.10)
CHAPTER 6. APPLYING THE EM ALGORITHM TO FIT THE BASIC MODEL
23
Proof. It is known that a multivariate density function can be separated into a product of
univariate density functions in the case of independence of variables. It follows from this
that if the observations are independent conditional on the true score then the theorem is
satisfied. Let yci and ydi denote the observed scores given by two of the sources, for unit i,
where c 6= d. From (4.3), it follows that, conditional on a fixed value of y i , the only source
of variation in yci and ydi consists of ✏ci and ✏di , and these errors are independent if c 6= d.
Thus, conditional on yi⇤ , yci and ydi are independent.
⌅
We will now derive the covariance matrix of the observed and true scores, conditional on
the error labels.
Lemma 3. The 4-dimensional covariance matrix of true score and the observed scores of
the three sources, conditional on the error labels, is given by:
0 2
1
2
2
2
B
⌃yi⇤ y1i y2i y3i = B
@
2
2
2
2
+ z1i
2
1
2
2
2
2
+ z2i
2
2
2
2
2
2
+ z3i
3
2
C
C
A
(6.11)
hence the covariance elements of the observed scores are accounted for entirely by the true
score variance.
Proof. First, the top left element represents the true score variance and is already known
to be 2 . The remaining diagonal elements which correspond to the variances of the true
score and the three observed scores. Again, let yci and ydi denote the observed scores given
by two of the sources, for unit i, where c 6= d.
Var(yci ) = Var(yi⇤ + zci ✏ci )
= Var(yi⇤ ) + zci 2 Var(✏ci )
=
2
+ zci 2
=
2
+ zci
2
c
2
c
The final line is possible as zci is a boolean. And the o↵-diagonal covariances:
Cov(yci , ydi ) = Cov(yi⇤ + zci ✏ci , yi⇤ + zdi ✏di )
= Cov(yi⇤ , yi⇤ ) + Cov(yi⇤ , zci ✏ci ) + Cov(yi⇤ , zdi ✏di ) + Cov(zci ✏ci , zdi ✏di )
= Cov(yi⇤ , yi⇤ ) + zci Cov(yi⇤ , ✏ci ) + zdi Cov(yi⇤ , ✏di ) + zci zdi Cov(✏ci , ✏di ).
An assumption of the model is that the errors are independent of one another, and of the
true scores, hence the final three components are equal to zero and the covariances of the
observed scores are given by
Cov(yci , ydi ) = Cov(yi⇤ , yi⇤ ) = Var(yi⇤ ) =
2
.
regardless of the presence or absence of error on an observation. The covariances of the true
score with the observed scores are similarly given by
Cov(yci , yi⇤ ) = Cov(ydi , yi⇤ ) = Var(yi⇤ ) =
2
Hence all correlations between the sources are equal to 2 , Z is independent of true score
error ✏ and therefore the three sources are independent conditional on yi⇤ , as their covariance
comes from the variance of the true score alone.
⌅
CHAPTER 6. APPLYING THE EM ALGORITHM TO FIT THE BASIC MODEL
24
Substituting (6.10) into (6.8) and expanding gives the loglikelihood expression of the
data:
`c (✓) =
1 X
1 X
1 X
n
X
k=0 l=0 m=0 i=1
zklmi {log wklm + log h(yi⇤ ) + log hk (y1i |yi⇤ )
+ log hl (y2i |yi⇤ ) + log hm (y3i |yi⇤ )}.
The errors are assumed to be normally distributed, and the distributions of the ith observed score from source g conditional on the ith true score are given by:
⇢
N (yi⇤ , g 2 ) if error on source g
⇤
ygi |yi ⇠
(6.12)
(ygi yi⇤ ) otherwise
where (.) denotes the Dirac’s delta function with mass on zero.
The density components for observations containing an error, h1 (ygi |yi⇤ ), are therefore
probability density functions of the normal distribution with mean yi⇤ and variance g2 .
1
h1 (ygi |yi⇤ ) = q
2⇡
2
g
(ygi yi⇤ )2
2 g2
e
.
2
The true score is also normally distributed, with variance
Its density component is defined by:
h(yi⇤ ) = p
(yi⇤
1
2⇡
2
e
xTi )
2
2
(6.13)
and mean xTi .
2
(6.14)
The appropriate density functions from (6.12) can then be substituted in to expand the
expression. The delta functions do not involve any of the parameters and would therefore
di↵erentiate to zero and can be removed from the likelihood expression, to give :
8
n <X
3
T
X
(yi⇤
xi )2
`c (✓) = C +
[zgi log ⇡g + (1 zgi ) log(1 ⇡g )] log
:
2 2
g=1
i=1
9

3
⇤
2
X
(ygi yi ) =
zgi log g +
,
(6.15)
;
2 g2
g=1
where the constant term C collects all terms that do not depend on any unknown parameters.
This is the complete-data loglikelihood, and to calculate it the complete dataset (y1i , y2i , y3i , yi⇤ )
is needed. From this it would be possible to first derive the indicators zgi and then the parameters.
In our application the yi⇤ value is not observed for all units and hence some of the zgi values are unknown, and the loglikelihood function must be modified to accomodate this by
replacing them with expectations of these values. The first part of our algorithm constitutes
calculating these expectations.
CHAPTER 6. APPLYING THE EM ALGORITHM TO FIT THE BASIC MODEL
6.3
25
E Step
6.3.1
Error Category Probabilities
In the EM framework, maximum likelihood estimation can be achieved with incomplete data
by replacing each unobserved value of yi⇤ and zgi by its conditional expectation, given the
observed values and current parameter estimates.
The E step of the EM algorithm consists of applying this strategy to expression (6.15), and
the resultant conditional expectation of this loglikelihood function is denoted by Q.
An abbreviated form of Q is given in (6.38), but for this to be understood some further
concepts first need to be introduced.
In (4.5) we partitioned the full dataset into five disjoint and exhaustative identifiable sets
based on their observed errors. There are in fact eight error combinations and, although
they are not identifiable from the data, the set Srest consists of units with four distinct
combinations. It is therefore possible to further partition Srest into four subsets based on
these combinations.
We will now define these four subsets in terms of their error categories, and introduce a
notation for them:
S5 = i such that {z1i = 1, z2i = 1, z3i = 0}
S6 = i such that {z1i = 1, z2i = 0, z3i = 1}
S7 = i such that {z1i = 0, z2i = 1, z3i = 1}
S8 = i such that {z1i = 1, z2i = 1, z3i = 1}
(6.16)
It can also be stated that these are disjoint, and that when combined they cover the entire
space Srest .
Sj \ Sk = ; for j, k 2 {1, ...., 8} and j 6= k
(S5 [ S6 [ S7 [ S8 ) = Srest
Although it is unknown which of these sets a unit of observations belongs to, it is possible
to estimate the probabilities of the unit belonging to each set, as we will now illustrate.
Lemma 4. The posterior probability estimate, at iteration t, that the ith set of observations
has the error combination klm given that it belongs to the set Srest and conditional on the
current parameter estimates,
⌧klmi = P (z1i = k, z2i = l, z3i = m | i 2 Srest , y1i , y2i , y3i , ✓ (t) )
= P (i 2 Sj | y1i , y2i , y3i , i 2 Srest , ✓ (t) )
where j denotes the appropriate subset number from (6.16), is given by the following equation:
(t)
⌧klmi = P
w(t) klm hklm (y1i y2i y3i | ✓ (t) )
klm2{110,101,011,111} [w
(t)
klm hklm (y1i y2i y3i
| ✓ (t) )]
wklm is the mixing weight corresponding to eror pattern klm, computed from the error
(t)
proportion estimates for the tth iteration, ⇡g .
Proof. It is calculated by applying the Bayes’ rule for posterior probabilities.
In the case that A is discrete and B is continuous:
CHAPTER 6. APPLYING THE EM ALGORITHM TO FIT THE BASIC MODEL
P (Ai | B = b) =
P (Ai )fB (b | Ai )
fB (b)
26
(6.17)
Or, in words
Posterior prob. of event Ai , given evidence B =
Prior prob. of Ai * Likelihood of B given Ai
Prior likelihood of B
In our context A is the event of i 2 Sj , B is the observed values (y1i , y2i , y3i ) and the
knowledge that i 2 Srest , and everything is conditioned on the current parameter estimates
✓ (t) .
⌧klmi = P (i 2 Sj | yi , i 2 Srest , ✓ (t) ) =
P (i 2 Sj | ✓ (t) )fh (yi , i 2 Srest | i 2 Sj , ✓ (t) )
fh (yi , i 2 Srest | ✓ (t) )
(6.18)
Where yi denotes the observed data for the ith unit: (y1i , y2i , y3i ), and fh (.) denotes the
relevant density function.
Sj is known to be a subset of Srest and hence
P (i 2 Srest | i 2 Sj ) = 1
It is therefore superfluous to include the argument i 2 Srest in fh (yi , i 2 Srest | i 2 Sj , ✓ (t) ).
It follows that
⌧klmi =
P (i 2 Sj | ✓ (t) )fh (yi | i 2 Sj , ✓ (t) )
fh (yi , i 2 Srest | ✓ (t) )
(6.19)
The prior probabilities P (i 2 Sj | ✓ (t) ) have already been defined as the weights given in
equations (4.4).
P (i 2 S5 | ✓ (t) ) = w110
P (i 2 S6 | ✓ (t) ) = w101
P (i 2 S7 | ✓ (t) ) = w011
P (i 2 S8 | ✓ (t) ) = w111
(6.20)
The likelihood functions can now be considered. First the likelihood on the numerator:
f (yi | i 2 Sj , ✓ (t) ).
In words this can be regarded as the density function of three observed scores, drawn from a
normal distribution, having an error pattern klm corresponding to the subset, Sj . The true
scores can not be derived from the data in these subsets and therefore unlike the density
component for the complete data loglikelihood, given in equation (6.9), this is not conditioned
on yi⇤ . This is therefore a trivariate normal density.
f (yi | i 2 S5 , ✓ (t) ) = h110 (y1i , y2i , y3i | ✓ (t) )
(6.21)
f (yi | i 2 S6 , ✓ (t) ) = h101 (y1i , y2i , y3i | ✓ (t) )
f (yi | i 2 S7 , ✓ (t) ) = h011 (y1i , y2i , y3i | ✓ (t) )
f (yi | i 2 S8 , ✓ (t) ) = h111 (y1i , y2i , y3i | ✓ (t) )
(6.22)
CHAPTER 6. APPLYING THE EM ALGORITHM TO FIT THE BASIC MODEL
27
where hklm (y1i y2i y3i | ✓ (t) ) is the trivariate normal density, conditional on the current
parameter estimates
hklm (y1i y2i y3i | ✓ (t) ) = N3 (yi , xTi , ⌃klm ),
⌃klm is the joint covariance matrix for a unit with error combination klm
0
⌃klm = @
2
+ z1i
2
1
2
2
2
2
+ z2i
2
2
2
2
2
2
+ z3i
3
2
1
A,
and yi is the vector of observations from the three sources for unit i
0
1
y1i
yi = @ y2i A .
y3i
Finally we can define the density function on the denominator, f (yi , i 2 Srest | ✓ (t) ). This
can be regarded as the density of observed scores yi corresponding to the entire set Srest .
Given that the units belonging to Srest can be viewed as a finite mixture consisting of four
components, we can now recall the density function of a mixture model. As introduced in
(3.2):
A probability density function of a mixture model is defined by a convex combination of
K component PDFs
p(y|✓) =
K
X
⇡k pk (y|✓ k )
k=1
where pk (y|✓ k ) is the PDF of the kth component and ⇡k are the mixing proportions.
In our application we have four components with mixing weights wklm and probability
density functions hklm (y1i y2i y3i | ✓ (t) ).
Therefore
X
f (yi , i 2 Srest | ✓ (t) ) =
[w(t) klm hklm (y1i y2i y3i | ✓ (t) )].
(6.23)
klm={110,101,011,111}
We can then substitute (6.20), (6.21) and (6.23) into the equation (6.19), to give us:
⌧klmi = P
w(t) klm hklm (y1i y2i y3i | ✓ (t) )
klm={110,101,011,111} [w
(t)
klm hklm (y1i y2i y3i
This completes the proof of this lemma.
| ✓ (t) )]
.
(6.24)
⌅
It is always the case that
⌧011i + ⌧101i + ⌧110i + ⌧111i = 1.
(6.25)
CHAPTER 6. APPLYING THE EM ALGORITHM TO FIT THE BASIC MODEL
28
In the case of extreme outlying data this method of estimating ⌧ s can lead to a computation error and these posterior probability estimates cannot be calculated, in which case
an alternative method is used to assign the most likely error category to each unit: the
Mahalonobis distance. This is explained in more detail in Section 8.3.
6.3.2
Conditional Expected Error Labels
As the error labels are unknown in the case of incomplete data, it is necessary to calculate
their expected values conditional on the data that is observed and the current parameter
estimates. These conditional expectations are calculated for the unobserved values which
correspond to the units belonging to the set Srest .
The expectations are updated iteratively with every E step of the algorithm and are given
here for the tth iteration.
Lemma 5. The expected error labels for the ith unit are defined by the following:
⇢
z1i
if z1i is observed
(t)
E(z1i |observed data, ✓ ) =
(t)
(t)
(t)
⌧110i + ⌧101i + ⌧111i
otherwise
⇢
z2i
if z2i is observed
E(z2i |observed data, ✓ (t) ) =
(t)
(t)
(t)
⌧110i + ⌧011i + ⌧111i
otherwise
⇢
z3i
if z3i is observed
E(z3i |observed data, ✓ (t) ) =
⌧101i (t) + ⌧011i (t) + ⌧111i (t) otherwise
(6.26)
where ⌧klmi is the posterior probability of the ith unit consisting of error combination klm, as
defined in (6.24).
The expectations for the cases where zgi are observed from the data are tautologically
true and do not need to be proven. We can now provide a proof for the units for which they
are unobserved, i.e. E(zgi | i 2 Srest ).
Proof. There are two important properties of Bernoulli random variables to help us here.
Let X denote a random Bernoulli variable, the following two properties of X can be stated
(Weiss, 2005):
E(X) = P (X = 1)
and P (X = 1) = 1
P (X = 0)
(6.27)
It follows from this that our expectation can also be written as
E(zgi | i 2 Srest ) = 1
P (zgi = 0 | i 2 Srest )
We can now look again to the subset definitions (6.16), and it is evident that for each source,
g, the event zgi = 0 is unique to a particular subset. The probability of this event occurring
is therefore equal to the probability of the unit belonging to that subset, which corresponds
to our ⌧klmi value. Hence our first three probabilities can be redefined:
P (z1i = 0|i 2 Srest ) = P (i 2 S7 |i 2 Srest ) = ⌧011i
P (z2i = 0|i 2 Srest ) = P (i 2 S6 |i 2 Srest ) = ⌧101i
P (z3i = 0|i 2 Srest ) = P (i 2 S5 |i 2 Srest ) = ⌧110i
CHAPTER 6. APPLYING THE EM ALGORITHM TO FIT THE BASIC MODEL
29
and our expectations can be deduced
E(z1i | i 2 Srest ) = 1
⌧011i
E(z3i | i 2 Srest ) = 1
⌧110i
E(z2i | i 2 Srest ) = 1
⌧101i
It has already been stated that
⌧011i + ⌧101i + ⌧110i + ⌧111i = 1
and these can therefore be rewritten to satisfy (6.26).
⌅
6.3.3
Conditional Expected True Scores
As well as the expected error component labels, the expectation step also involves the estimation of the expected true scores in the cases where they are not deducible from the data.
We will now demonstrate how to derive these.
Lemma 6. The expected true score for a unit with an unknown error category ( i 2 Srest ),
conditional on the observations and the current parameter estimates, is given by the formula
E(yi⇤ |y1i , y2i , y3i , ✓, i 2 Srest ) = y3i ⌧110i + y2i ⌧101i + y1i ⌧011i + µ⇤i ⌧111i
Where µ⇤i is the value obtained when resolving (6.30).
Proof. To prove this theorem we must first restate that the set Srest consists of units that
can belong to any of four unidentifiable mutually exclusive subsets, as defined in (6.16) and
is hence a four component mixture model.
By the law of total expectation we can evaluate our expectation over the set Srest by
partitioning it
E(yi⇤ |y1i , y2i , y3i , ✓, i 2 Srest ) =
4
X
j=1
E(yi⇤ |y1i , y2i , y3i , ✓, i 2 Sj )P (i 2 Sj | i 2 Srest ).
It is already known that these probabilities are given by our ⌧klmi estimates. Now we are
able to substitute these in, and write our formula in expanded form before considering each
expectation separately.
E(yi⇤ |y1i , y2i , y3i , ✓, i 2 Srest ) = E(yi⇤ |y1i , y2i , y3i , ✓, i 2 S5 )⌧110i +
E(yi⇤ |y1i , y2i , y3i , ✓, i 2 S6 )⌧101i +
E(yi⇤ |y1i , y2i , y3i , ✓, i 2 S7 )⌧011i +
E(yi⇤ |y1i , y2i , y3i , ✓, i 2 S8 )⌧111i
For the first three subsets we know from (6.16) that there’s an error-free observation and
hence yi⇤ is present in the data. These expected true scores can therefore be replaced with y3i ,
y2i and y1i respectively. Units from S8 have no error free observations so their contribution to
the expected true score cannot be deduced in this way. We will show how their expectations
are estimated later, but for now we have
E(yi⇤ |y1i , y2i , y3i , ✓, i 2 Srest ) = y3i ⌧110i + y2i ⌧101i + y1i ⌧011i +
E(yi⇤ |y1i , y2i , y3i , ✓, i 2 S8 )P (i 2 S8 )
CHAPTER 6. APPLYING THE EM ALGORITHM TO FIT THE BASIC MODEL
30
Once the expected error labels have been derived it is possible to infer the expected true
scores for nearly all units, however there is one set for which this is still not the case; those
that contain errors from all three sources.
To evaluate these we shall consider the expectation formula for a conditional normal distribution.
Let U1 and U2 be two multivariate normal variables of size n and m respectively, with
expectations
E(U1 ) = µ1
and E(U2 ) = µ2
✓
◆
U1
such that when combined U =
is (n + m)-dimensional normal.
U2
Their covariance matrix can be partitioned to give
✓
◆
✓
◆
U1
⌃11 ⌃12
⌃ = Cov
=
.
U2
⌃21 ⌃22
The conditional normal distribution for U1 , given that U2 = u2 is given by:
E(U1 |U2 = u2 ) = µ1 + ⌃12 ⌃22
1
(u2
µ2 )
(6.28)
In our application U1 corresponds to our true score yi⇤ and U2 is the trivariate normal observed score vector yi = (y1i , y2i , y3i ).
Their expected values are all xTi , and ⌃ is the four-dimensional covariance matrix introduced in (6.11), with all the zgi values equal to 1. This can then be partitioned as follows
⌃yi⇤ y1i y2i y3i =
✓
⌃11 ⌃12
⌃21 ⌃22
◆
0
2
B
=B
@
2
2
2
+
2
2
2
2
1
2
2
2
2
2
2
+
2
2
2
1
2
2
+
3
2
C
C.
A
(6.29)
The expected true score for the units consisting of observations with errors on all three
sources is given by:
µ⇤i = xTi
+
2
2
2
0
@
2
+
2
2
1
2
2
2
+
2
2
2
2
2
2
+
3
2
1
A
1
0
y1i
@ y2i
y3i
xTi
xTi
xTi
1
A
(6.30)
which can be substituted in to estimate the expected true score with unknown error combination
E(yi⇤ |y1i , y2i , y3i , ✓, i 2 Srest ) = y3i ⌧110i + y2i ⌧101i + y1i ⌧011i + µ⇤i ⌧111i .
(6.31)
⌅
6.4
Conditional Expected True Score Squared
The expanded form of the loglikelihood function given by equation (6.38) also contains terms
involving the expected true score squared. This cannot be calculated by simply squaring the
expected true score, but rather it involves an additive variance component.
CHAPTER 6. APPLYING THE EM ALGORITHM TO FIT THE BASIC MODEL
31
The complete-data loglikelihood contains the expression (yi⇤ ygi )2 .
The expectation of this whole expression is taken: E((yi⇤ ygi )2 ).
ygi )2 ) = E ⇤ (yi⇤ 2 )
E((yi⇤
2(ygi )E ⇤ (yi⇤ ) + ygi 2 .
(6.32)
where E ⇤ (yi⇤ ) is shorthand for E(yi⇤ | y1i , y2i , y3i , ✓ (t) ) and denotes the expected true score
conditional on the data and the current parameter estimates at the tth iteration, and is used
in the situation where there are three errors and therefore the true score is not present
in the data. E ⇤ (yi⇤ )2 denotes the expected squared true score under these conditions:
E(yi⇤ 2 |y1i , y2i , y3i , ✓).
Lemma 7. Let µ⇤i denote the expected true score calculated using the equation (6.30). The
expected true score squared is then given by
E(yi⇤ 2 |y1i , y2i , y3i , ✓ (t) ) = µ⇤i 2 +
2
2
2
0
2
+
@
2
1
2
2
2
+
2
2
2
2
2
2
2
+
3
2
1
A
10
@
2
2
2
Proof. It is well known that the variance of any variable X, can be calculated from the
following equation
Var(X) = E(X 2 ) [E(X)]2 .
(6.33)
The expectation of this variable squared is therefore given by
E(X 2 ) = Var(X) + [E(X)]2 .
This equation can be used to derive our true score squared
E(yi⇤ 2 |y1i , y2i , y3i , ✓ (t) ) = µ⇤i 2 + Var(yi⇤ |y1i , y2i , y3i , ✓).
In order to do this the variance of our true score must therefore be calculated. From here on
let this variance be denoted by c⇤i Let U1 and U2 again be two multivariate normal variables
whose covariance matrix can be partitioned to give
✓
◆
✓
◆
U1
⌃11 ⌃12
⌃ = Cov
=
.
U2
⌃21 ⌃22
The variance of U1 conditional on U2 is then given by
Var(U1 |U2 ) = ⌃12 ⌃221 ⌃21 .
We can now recall the partitioned covariance matrix that has been derived for our application,
equation (6.29), and substitute in the corresponding parts to obtain our expected variance
0 2
1 10 2 1
2
2
+ 12
2
2
2
2
2+
2
2
@
A @ 2 A.
c⇤i =
(6.34)
2
2
2
2+
2
2
3
Note that this conditional variance is independent of the observed scores.
This can then be added to µ⇤i , subsequently our expected true score squared is given by
equation (6.35)
=
µ⇤i 2
+
2
2
2
0
@
2
+
2
2
1
2
2
E(yi⇤ 2 |y1i , y2i , y3i , ✓) = µ⇤i 2 + c⇤i
1 10 2 1
2
2
+
2
2
2
2
2
+
3
2
A
@
2
2
A.
(6.35)
⌅
1
A
CHAPTER 6. APPLYING THE EM ALGORITHM TO FIT THE BASIC MODEL
6.4.1
32
Incomplete-data loglikelihood
Once we have derived our expected true scores and error probabilities for the cases where
zgi is unobserved we can substitute them into the complete-data likelihood (6.15) to form
the incomplete-data loglikelihood function.
`(✓) = C +
8
n <X
3
X
i=1
3
X
:
[E(zgi ) log ⇡g + (1
E(zgi )) log(1
g=1

E(zgi ) log
g=1
g
+
E((ygi
2
9
=
yi⇤ )2 )
2
g
;
⇡g )]
T
E((yi⇤
log
2
xi )2 )
2
.
(6.36)
Then this loglikelihood can be rewritten so that the units with observed and unobserved
error combinations are separated.
`(✓) =C +
8
4 X <X
3
X
j=1 i2Sj
3
X

8
3
X <X
i2Srest
:
g
((ygi
2
+
⇡g )]
T
(yi⇤
log
2
xi )2 )
2
9
=
yi⇤ )2 )
2
g
;
[E(zgi ) log ⇡g + (1
E(zgi )) log(1
⇡g )]
log
g=1
E(yi⇤ 2 ) + (
3
X
zgi ) log(1
g=1
(zgi ) log
g=1
+
:
[zgi log ⇡g + (1
T
"
xi )2
2 2
E(zgi ) log
g=1
g
2
+
T
xi E(yi⇤ )
E(yi⇤ 2 )
+ ygi
2
2ygi E(yi⇤ )
2
2
g
#9
=
;
.
(6.37)
The relevant expectations introduced in the previous section can then be substituted in.
Expanding this leaves us with a very lengthy equation, so for the sake of readability some
shorthand notation can be introduced. The constant term C can also be removed as it
contains none of the parameters and would disappear with di↵erentiation. The full form of
the Q function which is necessary to di↵erentiate with respect to the parameters in the CM
step can be written as follows.
X
X
X
X
Q =
log w000 +
log w001 +
log w010 +
log w100
i2S1
+
X
X
i2S2
i2S3
⌧klmi log wklm
klm2rest i2Srest
n 1
log
2
2
1
1
2
2V 1
1
n 2
log
2
i2S4
n
log
2
2
2
1
2
2
1
2
2V 2
2
2
V
n 3
log
2
2
3
1
2
2V 3
3
(6.38)
rest denotes the set of error combinations {011, 101, 110, 111}.
The maximum likelihood parameter estimates for each iteration of the algorithm are achieved
by di↵erentiating this likelihood, and before this is possible it needs to be written in expanded
form. As the weights are independent of i, they are therefore the same for every unit within
CHAPTER 6. APPLYING THE EM ALGORITHM TO FIT THE BASIC MODEL
the same category. The summation can therefore be simplified
X
log w000 = n1 log w000
33
(6.39)
i2S1
and analogous for the sums over sets 2 : 4.
X
Q =n1 log w000 + n2 log w001 + n3 log w010 + n4 log w100 +
X
⌧klmi log wklm
klm2rest i2Sklm
n
log 2
2
n 3
log
2
1
2
2
1
2
3
2
n 1
log
2
V
1
2
1
2
n 2
log
2
2V 1
1
2
2
1
2
2V 2
2
2V 3
3
nj denotes the number of units in set Sj , and can be observed from the data.
n
g
denotes the expected number of units with error on source g:
n
g
=
n
X
E(zgi ).
i=1
This value is known for the data for which the error labels are observed, but needs to be
estimated for the remaining units:
X
n 1 = n4 +
E(z1i )
i2Srest
= n4 +
X
[(⌧101i + ⌧110i + ⌧111i )]
X
[(⌧011i + ⌧110i + ⌧111i )]
i2Srest
and analogously
n
= n3 +
2
i2Srest
n
= n2 +
3
X
[(⌧011i + ⌧101i + ⌧111i )].
i2Srest
In our Q function the ’loss’ parts of the normal probability density functions have been
grouped, and given the notations V and V g . V can be understood as the overall estimated
sum of errors squared and V g can be understood as the sums of errors squared for the
observations with errors on source g:
X
2
V =
(y1i xTi )
i2S1
+
X
(y2i
i2S4
+
X
2
xTi ) +
[⌧110i (y3i
X
i2S3
2
xTi
(y1i
2
xTi ) +
) + ⌧101i (y2i
X
i2S2
2
xTi
(y1i
xTi )
) + ⌧011i (y1i
2
xTi )
2
i2Srest
+⌧111i (E(yi⇤ 2 )
2
2(xTi )E(yi⇤ ) + (xTi ) )]
(6.40)
CHAPTER 6. APPLYING THE EM ALGORITHM TO FIT THE BASIC MODEL
V
1
=
X
(y1i
y2i )2
34
(6.41)
i2S4
+
X
[⌧110i (y1i
y3i )2 + ⌧101i (y1i
y2i )2 + ⌧111i (E(yi⇤ 2 )
2(y1i )E(yi⇤ ) + y1i 2 )]
y3i )2 + ⌧011i (y1i
y2i )2 + ⌧111i (E(yi⇤ 2 )
2(y2i )E(yi⇤ ) + y2i 2 )]
y3i )2 + ⌧011i (y1i
y3i )2 + ⌧111i (E(yi⇤ 2 )
2(y3i )E(yi⇤ ) + y3i 2 )]
i2Srest
V
2
=
X
(y1i
y2i )2
i2S3
+
X
[⌧110i (y2i
i2Srest
V
3
=
X
(y1i
y3i )2
i2S2
+
X
[⌧101i (y2i
i2Srest
These notations may seem unwieldy but lead to elegant solutions when deriving the
parameter estimates in the M step, as we will show in the next section.
6.5
Maximisation Step
The distributive properties of the overall dataset can be estimated from the posterior probabilities of component membership for each datapoint, in the maximisation step of the algorithm. Estimates of each of the parameters can therefore be derived by taking the partial
derivative of our Q function with respect to each of the relevant parameters in turn, and
finding the argument of the maxima of the result. As we will show, there are some dependencies between these derivation steps and therefore it is necessary to establish a particular
order in which the estimates will be derived.
We will now introduce the equations that provide the updated parameter estimates for
each iteration of the algorithm and then show how these were derived, starting with the error
proportions.
6.5.1
Updating the Error Proportions
Theorem 1. The error proportion estimates for each of the three sources can be updated
within each iteration of the algorithm by the following three equations.
(t+1)
⇡1
(t+1)
⇡2
(t+1)
⇡3
=
=
=
n4 +
n3 +
n2 +
P
(t)
i2Srest (⌧110i
P
P
(t)
(t)
+ ⌧101i + ⌧111i )
n
(t)
i2srest (⌧110i
(t)
(t)
+ ⌧011i + ⌧111i )
n
(t)
i2srest (⌧011i
n
(t)
(6.42)
(6.43)
(t)
+ ⌧101i + ⌧111i )
(6.44)
CHAPTER 6. APPLYING THE EM ALGORITHM TO FIT THE BASIC MODEL
35
Proof. It has already been stated that the M step involves maximising Q with respect to
each of the eight estimable parameters.
Therefore, for the error proportions ⇡1 , ⇡2 and ⇡3 , the (t + 1)th , parameter estimates can be
computed by:
⇡1 (t+1) = argmax⇡1 Q(✓ | ✓ (t) )
⇡2 (t+1) = argmax⇡2 Q(✓ | ✓ (t) )
⇡3 (t+1) = argmax⇡3 Q(✓ | ✓ (t) ).
These maxima can be found by di↵erentiating the Q function with respect to the proportions as the following illustrates for the error proportion of the first source:
@Q
@⇡1
=
@
(n1 log w000 + n2 log w001 + n3 log w010 + n4 log w100 +
@⇡1
X
X
n
1
⌧klmi log wklm
log 2
V
2
2 2
klm2rest i2Srest
n 1
log
2
2
1
1
2
2V 1
1
n 2
log
2
2
2
1
2
2V 2
2
n 3
log
2
2
3
1
2
2 V 3 ).
3
The density components can be removed as they are independent of ⇡1 so will disappear
when their partial derivative is taken, hence only the mixing weights contain components of
⇡1 . We are left with:
@Q
@
=
(n1 log w000 + n2 log w001 + n3 log w010 + n4 log w100
@⇡1
@⇡1
X
X
+
⌧klmi log wklm ).
klm2rest i2Srest
Once this derivative is taken then the following equation remains:
P
P
n2 + i2srest (⌧110i + ⌧101i + ⌧111i ) n1 + n3 + n4 + i2srest ⌧011i
@Q
=
@⇡1
⇡1
1 ⇡1
= 0
Solving this gives:
⇡1 =
n4 +
Similarly:
⇡2 =
⇡3 =
n3 +
n2 +
P
i2Srest (⌧110i
P
P
+ ⌧101i + ⌧111i )
n
i2srest (⌧110i
+ ⌧011i + ⌧111i )
n
i2srest (⌧011i
+ ⌧101i + ⌧111i )
n
.
,
.
The ⇡ values are then updated correspondingly with each iteration of the algorithm.
These estimates can be understood intuitively as
estimated num. of units. with error on source g
.
total num. of units
⌅
CHAPTER 6. APPLYING THE EM ALGORITHM TO FIT THE BASIC MODEL
6.5.2
36
Updating the Regression Coefficients
Subsequently the regression coefficient vector, , can be updated. This updated estimate is
given by the following equation:
8
<X
X
X
X
X
1
(t+1)
=(
xi xTi )
xi y1i +
xi y2i +
xi y1i +
xi y1i +
:
i2s
i2S3
i2S2
i2S1
i2S4
)
X (t)
(t)
(t)
(t)
(⌧110i xi y3i + ⌧101i xi y2i + ⌧011i xi y1i + ⌧111i xi E(yi⇤ )) .
(6.45)
i2Srest
To reach this equation, the argument of
(t+1)
that maximises the Q function must be found
= argmax Q(✓ | ✓ (t) ).
This was achieved by di↵erentiating the Q function again, this time with respect to
@Q
@
=
(n1 w000 + n2 w001 + n3 w010 + n4 w100 +
@
@
n
log
2
1
2
2
2
n 1
log
2
V
2
1
The mixing weights contain no
stage, leaving us with:
@Q
@
=
(
@
@
1
2
n 2
log
2
2V 1
1
2
2
2
X
⌧klmi log wklm
klm2irest i2Sklm
n 3
log
2
2V 2
2
1
n 1
log 12
2
2
1
n 3
V 2
log
2
2
2 2
2
2
2
1
V
8
X
X
1 <
T
=
2
(x
y
x
x
)
2
(xi y2i
i
1i
i
i
2 :
i2S1
i2S4
X
X ⇥
2
(xi y1i xi xTi ) 2
(⌧110i (xi y3i
2
2
xi xTi
1
2
3
xi xTi )
2
2
2 V 3)
3
X
xi xTi )
(xi y1i
i2S3
xi xTi )+
) + ⌧011i (xi y1i
xi xTi
)+
⌧111i (E(yi⇤ )xi
xi xTi
))
= 0.
Solving this for
2 V 3 ).
3
2V 1
1
i2Srest
⌧101i (xi y2i
6.5.3
1
2
3
elements so can be removed from the derivative at this
n
log 2
2
n 2
log
2
i2S2
1
X
:
⇤
)
gives us (6.45).
Updating The Variances
The same procedure is followed to update the variance estimates, 2 , 1 2 , 2 2 and 3 2 , in
the M step. First for the true score variance:
X
X
@Q
@
=
(n
w
+
n
w
+
n
w
+
n
w
+
⌧klmi log wklm
1
000
2
001
3
010
4
100
@ 2
@ 2
klm2irest i2Sklm
n
log
2
1
2
2
2
V
n 1
log
2
2
1
1
2
2V 1
1
n 2
log
2
2
2
1
2
2V 2
2
n 3
log
2
2
3
1
2
2 V 3 ).
3
CHAPTER 6. APPLYING THE EM ALGORITHM TO FIT THE BASIC MODEL
Only two of these terms contain
derivative:
@Q
@ 2
Solving for
2
2
37
components so this can be simplified before taking the
@
n
(
log
2
@
2
n
V
=
+ 4
2
2
2
= 0
=
gives:
2 (t+1)
=
V
n
1
2
2
2
V )
(6.46)
and analogously for the observed score variances:
V 1
n 1
V 2
(t+1)
2
=
2
n 2
V
(t+1)
3
2
=
.
3
n 3
1
2 (t+1)
=
(6.47)
(6.48)
(6.49)
It should be noted that to obtain an estimate for V it is necessary to already have one
for the regression coefficient , and therefore the regression coefficients should be updated
prior to the V s within each iteration of the algorithm. This is the only dependency between
parameters here and it is one-way, if there were a two-way inter-dependency then maximising
Q with respect to all the parameters simultaneously would be intractable. Hence it wouldn’t
be possible to obtain parameter estimates with the standard EM algorithm but a variation of
this, the Expectation/Conditional Maximisation (ECM) algorithm, would have to be used.
A general framework for the ECM algorithm is provided by Meng and Rubin (1993), and an
example of a similar application in which it is used is detailed by Guarnera and di Zio (2013).
In the following chapter we will introduce a way in which the model can be extended to
account for the availability of a ’gold standard’ source.
Chapter 7
Combining the Sources With Gold
Standard Data
Gold standard data may have been obtained by auditing an existing source, to increase
the accuracy of the observations. It is assumed in this case that this auditing process is
sufficiently e↵ective to produce observations that are equal to the underlying true score
values yi⇤ . Gold standard source may alternatively have been obtained independently from
the existing three sources. Either way we are left with a vector of observations for which
every unit is error free. Let the vector of observations given by this fourth source be denoted
by y4 , and let it be stated that
y4i = yi⇤ for all i 2 S ⇤
(7.1)
where S ⇤ denotes the set of units for which a gold standard observation has been obtained.
This set of audited observations, S ⇤ , is combined with the existing data. Obtaining a
gold standard source is an expensive procedure, and hence it is generally only possible to
produce these gold standard observations for a small subsample of the units. As the true
score is already known for the sets of observations in the groups for which the error pattern
has been identified: S1 , S2 , S3 , and S4 , it is not necessary to further audit units that belong
to these groups. Therefore it would be prudent to randomly sample from sets of observations
belonging to the remaining group, Srest , when determining which units to audit.
Assuming that this has been done, it can then be stated that
S ⇤ ✓ Srest .
Once the underlying true score is known for the units with errors on two or more sources
it is possible, by comparing the observation from each source with the true score, to infer
the values of z1i , z2i and z3i for i 2 S ⇤ and therefore the exact error pattern for each of
these units. For example, if the equation y4i = y2i is satisfied but the equations y4i = y1i
and y4i = y3i are not then the unit has the error pattern 011. Subsequently our set Srest ,
previously defined as
Srest = {S5 [ S6 [ S7 [ S8 }
can be further partitioned, based on whether or not the error combinations are now identifiable.
0
⇤
Srest = {Srest
[ Srest
}
38
(7.2)
CHAPTER 7. COMBINING THE SOURCES WITH GOLD STANDARD DATA
39
0
⇤
where Srest
denotes the group for which the error categories are still unknown and Srest
is
the group for which they can now be identified.
These can be further partitioned into their constituent subsets based on error combination
Srest = {(S50 [ S60 [ S70 [ S80 ) [ (S5⇤ [ S6⇤ [ S7⇤ [ S8⇤ )}.
(7.3)
These final four subsets are now distinguishable from one another, and the expected true score
no longer needs to be estimated for their units. The incomplete-data likelihood expression
can be updated to incorporate this new information:
X
X
X
X
X
Q =
log w000 +
log w001 +
log w010 +
log w100 +
log w110 +
i2S1
X
i2S2
log w101 +
i2S6⇤
n
log
2
X
i2S3
log w011 +
i2S7⇤
1
2
2
2
V⇤
X
i2S4
log w111 +
i2S8⇤
n 1
log
2
2
1
X
0
i2Srest
1
2
⇤
2V 1
1
i2S5⇤
X
⌧klmi log wklm
klm2rest
n 2
log
2
2
2
1
2
⇤
2V 2
2
n 3
log
2
2
3
1
2
⇤
2V 3
3
Where V ⇤ and the V ⇤g components denote updated versions of the sum of square equations
(6.40), (6.41),(6.42) and (6.42), and are all provided in the appendix.
In the SEM model which assumed continuous non-zero errors the gold standard data was
necessary as a reference point once the bias and intercept parameters were included. This
is not the case in our model which contains intermittent errors, as the observations without
error can already act as a reference, however the precision of the model is still improved by
the inclusion of the ’gold standard’ observations as the proportion of units with unknown
error combinations is reduced.
Chapter 8
Model 2: Including Slope and
Intercept
Although the algorithm defined in the previous chapter is successful in producing estimates,
at this stage it is simplistic and assumes measurement on each source is unbiased, i.e. that
all of the sources are assumed to have errors that are centered around zero (and hence have
mean 0 and variance g 2 ), There are no parameters that account for the possibility of
systematic measurement bias leading to a skewed set of results coming from a particular
source. This model would therefore produce inaccurate estimates in the case where there is
bias on one or more of the sources. If it was possible to estimate systematic bias for a given
source on the other hand then ”deductive corrections” could be performed to correct a large
number of observations simultaneously as part of the data editing process, as the underlying
error mechanism is known. This increases the efficiency of the editing process as less manual
editing is necessary (de Waal et al., 2011b). We will now present a model which is able to
capture these systematic errors.
To accommodate for potential bias we can introduce six further estimable parameters to
the model. An intercept parameter ag and a slope parameter bg for each of the three sources.
The parameters that now require estimating are:
✓ = (⇡1 , ⇡2 , ⇡3 , ,
2
,
2
1,
2
2,
2
3 , a1 , b1 , a2 , b2 , a3 , b3 )
Introducing bias and intercept parameters does not a↵ect the true score,(yi⇤ ), only the
observed scores from each source ygi given the presence of an error on source g.
The ith observed score for source g ygi can now be described by:
⇢
yi⇤ if there’s no error on source g (zgi = 0)
ygi =
(8.1)
ag + bg yi⇤ + ✏gi if there is an error on source g (zgi = 1)
ag denotes the intercept coefficient and bg denotes the slope coefficient for an observation
from source g.
The equations (8.1) can also be written in the form:
z
ygi = ag zgi + bggi yi⇤ + ✏gi zgi
The model described in the previous chapter can be thought of as this one under the restriction that ag = 0 and bg = 1.
As with the previous model, the first step in the EM algorithm is obtaining the initial
parameter estimates. We will now demonstrate how these are obtained in Model 2.
40
CHAPTER 8. MODEL 2: INCLUDING SLOPE AND INTERCEPT
8.1
41
Initial Estimates
For the first five parameters these estimates are derived no di↵erently from how they were
in Model 1, as the error and slope parameters have no dependencies with any of the error
proportions, the covariate regressor or the true score variance.
Recall how least squares regression was used to obtain the estimated
coefficients in
Model 1, and subsequently to obtain the fitted scores with which the initial true score
variance estimate was calculated. In Model 1 it was not necessary to estimate fitted values
during the estimation of the error variances, as due to the restrictions ag = 0 and bg = 1 these
’fitted’ values would be equal to the known true scores. As these restrictions have been taken
away in Model 2 the procedure to obtain initial estimates for the observed score variances
now similarly involves least squares regression. Specifically, simple linear regression of the
observed scores on the true scores. We have already stated that ygi = ag + bg yi⇤ + ✏gi in the
case of an error on source g. This can be written in terms of the residual component
✏gi = ygi
ag
bg yi⇤ .
The aim then is to find an equation of the straight line
ygi = ag + bg yi⇤
(8.2)
which best fits the data and thus minimises the sum of squared residuals
X
The values of ↵ and
✏2gi =
X
(ygi
ag
bg yi⇤ )2 .
that minimise the function
n
X
(yi
↵
xi ) 2
i=1
are given by
ˆ = xy
x2
x̄ȳ
x̄2
=
Cov(x, y)
Var(x)
and
↵
ˆ = ȳ
ˆx̄.
Adapting this to our application gives us the initial estimates
(0)
b1 =
Cov(yi⇤ , y1i )
Var(yi⇤ )
(8.3)
and
(0)
a1 = ȳ1
(0)
b1 ȳ ⇤ .
(8.4)
These are calculated over the units in which there is an error on Source 1 and the true score
is observable, i.e. {i 2 S4 }.
(0) (0)
(0)
(0)
The same technique can be used to achieve estimates for a2 , b2 , a3 and b3 .
CHAPTER 8. MODEL 2: INCLUDING SLOPE AND INTERCEPT
42
We can now derive the estimates for the observed score variances. These parameters are
defined as measures of the variance of the observed score that are unexplained by the true
score variance and hence in this case the observed scores are regressed against the true scores
to obtain first the initial slope and intercept estimates and then the fitted values from which
the residuals are calculated.
First the slope and intercept initial estimates:
1 X
2 (0)
=
(y1i yˆ1i )2
(8.5)
1
n4 2
i2S4
where yˆ1i are the fitted values obtained when regressing the observed scores of the first source
in the case where they are known to contain errors on the true scores for the same sources.
Analogously for the second and third error variances
1 X
2 (0)
=
(y2i yˆ2i )2 ,
2
n3 2
(8.6)
i2S3
2 (0)
3
8.2
=
1
n2
2
X
(y3i
yˆ3i )2 .
(8.7)
i2S2
Expectation Step
The introduction of these new parameters also means that the variance and covariance elements of the true and observed scores will change. Again, let yci and ydi denote the observed
scores given by two of the sources, for unit i, where c 6= d. The intercept and slope parameters
do not a↵ect the true scores, so it is still the case that:
Var(yi⇤ ) =
2
.
However, in the case of an error on source c the observed score yci now has the variance
Var(yci ) = Var(ac + bc yi⇤ + ✏ci )
= bc 2 Var(yi⇤ ) + Var(✏ci )
= bc 2
2
+
2
c.
The covariance between yci and yi⇤ is now given by
Cov(yci , yi⇤ ) = Cov(ac + bc yi⇤ + ✏ci , yi⇤ )
= Cov(ac , yi⇤ ) + Cov(bc yi⇤ , yi⇤ ) + Cov(✏ci , yi⇤ ).
The first term vanishes because ac is a constant, and the third does because yi⇤ and ✏ci are
independent so
Cov(yci , yi⇤ ) = Cov(bc yi⇤ , yi⇤ )
= bc Var(yi⇤ )
= bc
2
.
The covariances between yci and ydi in the case of an error on both sources are:
Cov(yci , ydi ) = Cov(ac + bc yi⇤ + ✏ci , ad + bd yi⇤ + ✏di )
= Cov(bc yi⇤ , bd yi⇤ )
= bc bd Var(yi⇤ )
= bc b d
2
.
CHAPTER 8. MODEL 2: INCLUDING SLOPE AND INTERCEPT
43
As a result of this the four dimensional covariance matrix conditional on zgi , defined in
(6.11) for the basic model, is now given by
0
2
B b1 z1i
⌃yi⇤ y1i y2i y3i = B
@ b2 z2i
b3 z3i
2
2
2
b1 z1i 2
b1 2z1i 2 + z1i
b1 z1i b2 z2i 2
b1 z1i b3 z3i 2
1
b2 z2i 2
b1 z1i b2 z2i 2
2z2i 2
b2
+ z2i
b2 z2i b3 z3i 2
2
2
2
b3 z3i 2
b1 z1i b3 z3i 2
b2 z2i b3 z3i 2
2z3i 2
b3
+ z3i
1
3
2
C
C . (8.8)
A
As previously, the posterior error combination probability ⌧klmi values are given by the
equation (6.24) .
⌧klmi = P (z1i = k, z2i = l, z3i = m | yi 2 Srest )
=P
w(t) klm hklm (y1i y2i y3i | ✓ (t) )
klm={110,101,011,111} [w
(t)
klm hklm (y1i y2i y3i
| ✓ (t) )]
but in this case hklm (y1i y2i y3i | ✓ (t) ) is the trivariate normal density
00
1 0
1
1
y1i
z1i a1 + b1 z1i xTi
hklm (y1i y2i y3i | ✓ (t) ) = N3 @@ y2i A , @ z2i a2 + b2 z2i xTi A , ⌃klm A
y3i
z3i a3 + b3 z3i xTi
and ⌃klm is the joint covariance matrix for a unit with error combination klm, equivalent to
the bottom right 3x3 matrix within (8.8):
0 2z1i 2
1
b1
+ z1i 1 2
b1 z1i b2 z2i 2
b1 z1i b3 z3i 2
⌃klm = @ b1z1i b2z2i 2 b22z2i 2 + z2i 22 b2z2i b3z3i 2 A .
b1 z1i b3 z3i 2
b2 z2i b3 z3i 2
b3 2z3i 2 + z3i 3 2
The expected true score µ⇤i is now a function of ag and bg as well as the other parameters.
xTi
+ ( b1 z1i
2
b2 z2i
2
b3 z3i
2
µ⇤i = E(yi⇤ |y1i , y2i , y3i , ✓ (t) ) =
!
T
1
) ⌃klm
y1i
y2i
y3i
(a1 + b1 xi )
(a2 + b2 xTi )
(a3 + b3 xTi )
.
(8.9)
The expected true score squared must also be updated. Recall from Chapter 6 that
E(yi⇤ 2 |y1i , y2i , y3i , ✓) = µ⇤i 2 + c⇤i
where c⇤i denotes the variance conditional on the data and the current parameter estimates,
Var(yi⇤ |y1i , y2i , y3i , ✓). This is again the case with Model 2 but the variance, again obtained
by partitioning the covariance matrix, (8.8) is now given by
0 z1i 2 1
b1
⇤
z1i 2
z2i 2
z3i 2
@
b2
b3
b2 z2i 2 A .
c i = ( b1
)⌃klm
(8.10)
b3 z3i 2
The density components for the observations containing errors, previously given by equation
(6.13) are then adjusted to accommodate this change in expected score:
(ygi
1
hg (ygi |yi⇤ ) = q
2⇡
2
g
e
ag
2
2
g
bg yi⇤ )2
.
CHAPTER 8. MODEL 2: INCLUDING SLOPE AND INTERCEPT
The likelihood expression is therefore adjusted accordingly:
8
n <X
3
X
`c (✓) = C +
[zgi log ⇡g + (1 zgi ) log(1 ⇡g )]
:
g=1
i=1
9

3
⇤
2
X
(ygi ag bg yi ) =
zgi log g +
.
;
2 g2
log
44
T
(yi⇤
2
xi )2
2
(8.11)
g=1
The location parameter in the normal density components corresponding to these observations, previously illustrated in (6.12), are replaced by ag + bg yi⇤ and the V g equations given
by (6.41), (6.42) and (6.42) are all subsequently updated :
X
V 1 =
(y1i [a1 + b1 y2i ])2 +
i2S4
X
[a1 + b1 y3i ])2 + ⌧101i (y1i
[⌧110i (y1i
[a1 + b1 y2i ])2 + ⌧111i (E{(y1i
i2Srest
(a1 + b1 yi⇤ ))2 }]
Expanding the final part gives
(a1 + b1 yi⇤ ))2 } = y1i 2
E{(y1i
= y1i 2
. = y1i 2
2y1i (a1 + b1 E(yi⇤ )) + E((a1 + b1 yi⇤ )2 )
2y1i (a1 + b1 µ⇤i ) + a1 2 + 2a1 b1 µ⇤i + b1 2 E(yi⇤ 2 )
2y1i (a1 + b1 µ⇤i ) + a1 2 + 2a1 b1 µ⇤i + b1 2 (c⇤i + µ⇤i 2 )
Hence:
V
1
=
X
(y1i
[a1 + b1 y2i ])2 +
i2S4
X
[⌧110i (y1i
[a1 + b1 y3i ])2 + ⌧101i (y1i
[a1 + b1 y2i ])2 +
i2Srest
⌧111i (y1i 2
2y1i (a1 + b1 µ⇤i ) + a1 2 + 2a1 b1 µ⇤i + b1 2 (c⇤i + µ⇤i 2 )]
similarly
V
2
=
X
(y2i
[a2 + b2 y1i ])2 +
i2S3
X
[⌧110i (y2i
[a2 + b2 y3i ])2 + ⌧011i (y2i
[a2 + b2 y1i ])2 +
i2Srest
⌧111i (y2i 2
2y2i (a2 + b2 µ⇤i ) + a2 2 + 2a2 b2 µ⇤i + b2 2 (c⇤i + µ⇤i 2 )]
and
V
3
=
X
(y3i
[a3 + b3 y1i ])2 +
i2S2
X
[⌧101i (y3i
[a3 + b3 y2i ])2 + ⌧011i (y3i
[a3 + b3 y1i ])2 +
i2Srest
⌧111i (y3i 2
2y3i (a3 + b3 µ⇤i ) + a3 2 + 2a3 b3 µ⇤i + b3 2 (c⇤i + µ⇤i 2 )].
These updated V g components are the substituted into the previously defined estimation
equations for the observed score variances, (6.47), (6.48) and (6.49). As V remains the same
with the introduction of slope and intercept parameters, so does the estimation equation for
CHAPTER 8. MODEL 2: INCLUDING SLOPE AND INTERCEPT
45
2.
the true score variance,
As with the other parameters, estimates for the intercept and slope values for each source
can be obtained by finding the maximum arguments, by di↵erentiating the Q function and
equating to 0:
@Q
@V 1
=
@a1
@a1
X
=
2(y1i
[a1 + b1 y2i ])
i2S4
+
X
i2Srest
{ 2⌧110i (y1i
[a1 + b1 y3i ])
2⌧101i (y1i
[a1 + b1 y2i ])
+⌧111i ( 2y1i + 2a1 + 2b1 µ⇤i )}
= 0,
@Q
@V 1
=
@b1
@b1
X
=
(8.12)
2(y1i
[a1 + b1 y2i ])y2i
i2S4
+
X n
2⌧110i (y1i
[a1 + b1 y3i ])y3i
i2Srest
2⌧101i (y1i
[a1 + b1 y2i ])y2i + ⌧111i ( 2y1i µ⇤i + 2a1 µ⇤i + 2b1 (c⇤i + µ⇤i 2 ))
= 0
o
(8.13)
Di↵erential equations for a2 , a3 , b2 and b3 can be defined analogously.
Equation (8.12) can be rearranged to give:
nX
o
X
a1 +
[⌧110i a1 + ⌧101i a1 + ⌧111i a1 ] =
i2S4
i2Srest
nX
(y1i
b1 y2i )
i2S4
+
X
[⌧110i (y1i
b1 y3i ) + ⌧101 (y1i
i2srest
+ ⌧111i (y1i
b1 µ⇤i )]
b1 y2i )
o
P
P
Which
can be simplified, as i2S4 a1 = n100 and i2Srest [⌧110i + ⌧101i + ⌧111i ] = nrest
P
i2Srest , and separated into factors of a1 and b1 .
a1 [n100 + nrest
X
⌧011i ] =
i2Srest
⌧101i (y1i
b1 y2i ) + ⌧111i (y1i
b1 µ⇤i )]
X
(y1i
b1 y2i )+
i2S4
If b1 is known then a1 could be derived by rearranging this
P
P
b1 y2i ) + i2Srest [⌧110i (y1i b1 y3i ) + ⌧101i (y1i
i2S4 (y1i
P
a1 =
n100 + nrest
i2Srest ⌧011i
X
[⌧110i (y1i
b1 y3i )+
i2Srest
b1 y2i ) + ⌧111i (y1i
b1 µ⇤i )]
.
(8.14)
CHAPTER 8. MODEL 2: INCLUDING SLOPE AND INTERCEPT
46
However, b1 is unknown, but both estimates can be obtained by forming parallel equations
based on (8.12) and (8.13).
A1 a 1 + B 1 b 1 = C 1
(8.15)
A2 a 1 + B 2 b 1 = C 2
(8.16)
where:
A1 = n100 + nrest
B1 =
X
i2S4
C1 =
X
X
y2i +
X
⌧011i
(1
⌧011i )y1i
i2Srest
[⌧110i y3i + ⌧101i y2i + +⌧111i µ⇤i ]
i2Srest
X
y1i +
i2S4
i2Srest
and
A2 =
X
y2i +
i2S4
B2 =
X
i2S4
C2 =
X
X
[⌧110i y3i + ⌧101i y2i + ⌧111i µ⇤i ]
i2Srest
2
X
y2i +
[⌧110i y3i 2 + ⌧101i y2i 2 + ⌧111i (c⇤i + µ⇤i 2 )]
i2Srest
y1i y2i +
i2S4
X
[⌧110i y1i y3i + ⌧101i y1i y2i + ⌧111i µ⇤i y1i ].
i2Srest
It follows that
(t+1)
b1
(t+1)
and a1
=
C1 A2
(t)
(t)
C 2 A1
(t)
(t)
B 2 A1
B 1 A2
(t)
=
C1
(t)
(t)
(t)
(t)
(t) (t+1)
B 1 b1
(t)
A1
.
Analogous equations can be used to derive estimates for a2 , b2 , a3 and b3 , but with the A,
B and C formulae adjusted accordingly. These are given in the appendix.
8.3
Mahalanobis Distance
The posterior error combination probability ⌧klmi values are proportions, with the sum of
the product of the weights and density functions for each error combinations as a denominator. These densities, hklmi (.) can also be thought of as a measure of the likelihood that
an observed score is drawn from a particular distribution. In our R code they are obtained
using the dmvnorm function from the mvtnorm package, and if a particular datapoint
is very deviant from all of the considered distributions then all of these likelihoods are extremely small and in some cases the function returns a value of 0. The fraction (6.24) would
therefore be 00 which has no defined value. To get around this, the Mahalanobis distance
can be used to categorise datapoints with extremely small densities. This is the distance of
the components from their expected values, taking into account the variability of the variable (unlike the Euclidean distance) and rescaling the components so that components with
greater variability are given less weight than those with low variability.
CHAPTER 8. MODEL 2: INCLUDING SLOPE AND INTERCEPT
47
The Mahalanobis distance between two points x = (x1 , x2 , ...xp )t and y = (y1 , y2 , ....yp )t
in the p-dimensional space Rp is defined as
q
ds (x, y) = (x y)t S 1 (x y)
where S denotes the covariance matrix, in our case the ⌃klm that corresponds to the appropriate error pattern.
In the case of this model the distance is calculated between the observed value vector
(y1i , y2i , y3i )t and the expected value vector (a1 z1i + bz11i yi⇤ , a2 z2i + bz12i yi⇤ , a3 z3i + bz33i yi⇤ )t The
unit is then categorised into the group with the shortest Mahalanobis distance. In most cases
when this was applied the unit was assigned to the group with three errors.
8.4
Standard Errors
Standard errors were calculated for each of the parameter estimates to give a measure of
the accuracy of this maximum likelihood estimation. If we had a complete dataset, these
standard errors could be obtained from the variance-covariance matrix produced by taking
the inverse of the observed (Fisher) information matrix, the negative of the Hessian matrix
of second order partial derivatives:
Iobs =
@ 2 `c
.
@✓@✓ T
hence in the case of complete data the variance of the estimator is given by
ˆ =I 1
V(✓)
obs
ˆ
evaluated at our ML estimates i.e. ✓ = ✓.
The standard errors are then given by the square roots of the diagonal elements of this variance matrix. The accuracy of this approximation is discussed in Efron and Hinkley (1978).
This application is an incomplete data problem so this is not directly applicable, and our
standard errors will be inflated. However by repeating the above procedure with unobserved
values replaced by their conditional expectations, evaluated at the maximum likelihood estimate obtained with the EM algorithm, we can obtain a complete-data variance-covariance
ˆ
matrix Vcom (✓).
According to Formula (9.7) in Little and Rubin (2002), the variancecovariance matrix of the maximum likelihood estimates based on incomplete data (obtained
by the EM algorithm) is then given by
ˆ = Vcom (✓)(I
ˆ
Vobs (✓)
DM)
1
.
Here I denotes an identity matrix and DM denotes the matrix that describes the loss of
information due to having incomplete data, i.e. the derivative of the EM mapping. As discussed by Little and Rubin, this matrix can be estimated by applying a ’supplemented’ EM
ˆ has been obtained a sequence of SEM iterations must be run to
(SEM) algorithm. After ✓
estimate DM. We will now follow Chapter 9 of Little and Rubin (2002), in which iteration
(t + 1) is broken down into five steps which are defined as follows:
ˆ and ✓ (t) .
Input: ✓
Step 1: Run the previously defined E and M steps to obtain ✓ (t+1) .
Step 2: Fix i = 1. Calculate
✓ (t) (i) = (✓ˆ1 , ..., ✓ˆi
(t) ˆ
ˆ
1 , ✓i , ✓i+1 , ..., ✓d )
(8.17)
CHAPTER 8. MODEL 2: INCLUDING SLOPE AND INTERCEPT
48
ˆ with the ith component replaced by ✓(t) .
Which is equivalent to our final estimates ✓
i
Step 3: Treating ✓ (t) (i) as current ✓ estimate now one run more EM iteration to obtain
˜ (t+1) (i).
✓
Step 4: Obtain the ratio
(t)
rij
=
(t+1)
✓˜j
(i)
(t)
✓i
✓ˆj
✓ˆi
for j = 1, ...., d.
(8.18)
Step 5: Repeat steps 2 to 5 for i = 2, ..., d.
DM is then the limiting matrix {rij } as t ! 1.
Applying this to the log or logit transformation of our parameters may satisfy the asˆ is a
sumptions of asymptotic normality better. It can be derived from (8.11) that V(✓)
block-diagonal matrix of the following form.
0
1
V(ˆ
⇡1 , ⇡
ˆ2 , ⇡
ˆ3 )
B
C
V( ˆ , ˆ 2 )
0
B
C
B
C
2
V(â1 , b̂1 , ˆ1 )
B
C,
B
C
@
A
0
V(â2 , b̂2 , ˆ22 )
2
V(â3 , b̂3 , ˆ3 )
where
and
0
⇡
ˆ1 (1 ⇡
ˆ1 )
n
B
V(ˆ
⇡1 , ⇡
ˆ2 , ⇡
ˆ3 ) = @
0
0
ˆ2
V( ˆ , ˆ ) =
2
and
V(âg , b̂g , ˆg2 )
0
B
B
=B
@
ˆg2
n
g
n
g
ˆg2
g
Pn
g
0
0
⇡
ˆ2 (1 ⇡
ˆ2 )
n
⇡
ˆ3 (1 ⇡
ˆ3 )
n
0
Pn
1
T
i=1 xi xi
T
0
⇤ 2
i=1 zgi (yi )
Pn
⇤
i=1 zgi yi
0
0
0
2ˆ 4
n
ˆg2
n
g
g
Pn
!
1
C
A
⇤
i=1 zgi yi
ˆg2
g
0
0
1
C
C
0 C.
A
4
2ˆg
n g
Pn
Pn
1
⇤ 2
⇤ 2
with g =
i=1 zgi (yi )
n g ( i=1 zgi yi ) [which is just the determinant of the 2 ⇥ 2
matrix in the top-left corner of the observed information matrix given by the inverse of
ˆ2
V(âg , b̂g , ˆg2 ), multiplied by n gg ].
The application of (S)EM to obtain asymptotic variance-covariance estimates is discussed in
more depth in Meng and Rubin (1991).
Chapter 9
Simulation
Prior to performing our real data analysis we tested the algorithm against a dataset that
had been simulated with the same assumed properties. we simulated a dataset with these
properties using R. This is shown in Appendix B for the case with 2000 units, 1 xi covariate
and randomly generated true parameter values.
We then used this simulated data to test the prediction accuracy of Model 2. The results
of this are given in Table 9.1.
Table 9.1: Parameter estimates and 95% CIs for simulated data
parameter
⇡1
⇡2
⇡3
2
a1
b1
2
1
a2
b2
2
2
a3
b3
2
3
true value
0.45
0.3
0.6
5
25
0.1
1.02
0.3
0.2
1.03
0.2
0.005
0.95
0.05
estimate
0.4358
0.2966
0.5991
5.0023
24.3346
0.0382
1.0201
0.2963
0.2073
1.0299
0.2001
0.0052
0.9500
0.0460
SE
0.0112
0.0103
0.0110
0.0017
0.7697
0.0239
0.0001
0.0147
0.0241
0.0001
0.0118
0.0108
0.0000
0.0023
95% CI
0.4139-0.4577
0.2764-0.3168
0.5775-0.6207
4.9990-5.0057
22.8261-25.8431
-0.0086-0.0857
1.0199-1.1.0202
0.2676-0.3251
0.1601-0.2545
1.0298-1.0301
0.1770-0.2232
-0.0162-0.0265
0.9499-0.9501
0.0415-0.0505
Our model was able to predict with a 95% confidence level the underlying values for all
of the variables other than a1 . The reason for the failure to estimate a1 could be that fairly
small intercepts were chosen and the e↵ect of these was almost imperceptible due to the
severity of the slopes and the large y values. When we repeated the estimation with di↵erent
input values we were able to estimate this. The standard errors for the slope coefficients are
extremely small, i.e. the model is able to estimate these very precisely.
49
Chapter 10
Real Data Analysis
10.1
Data Description
Before we present the results of the estimation procedure we must give more information
about the dataset. The target variable in this application is total annual turnover achieved
by businesses in the Netherlands, in the year 2012. The businesses can be subdivided into
several di↵erent NACE (Nomenclature of Economic Activity) groups, according to their
business activities, and assigned a corresponding five digit NACE code. This application
focused on eight di↵erent NACE groups: four belonging to the ”Trade” sector and four
belonging to the ”Transportation and storage” sector. An overview of these groups is given
in Table 10.1 and a breakdown of the number of units in each group is given in Table 10.2
and 10.3. The largest of these groups, NACE code 45112 pertaining to ”sale and repair of
passenger cars and light motor vehicles”, consists of 818 units after the selection process and
the smallest of these, NACE code 50100: ”Sale and repair of motorcycles and related parts”
only contains observations from all three sources for 58 units. There is information for 37462
businesses, of which 36311 contain turnover values from the VAT register. This is by far the
largest source; there are also 2480 units containing observations from the Profit Declaration
Register (PDR), which is also obtained from the tax authorities, and 2996 observations from
the Structural Business Survey (SBS) questionnaire. As only linked units (those containing
observations from all three sources) can be assessed by the algorithm, the result is a dataset
consisting of turnover scores for 2210 businesses.
The actual interpretation of what constitutes a ’business’ is contentious, however, as
there are two definitions; a legal one and a statistical one. A large business is likely to
consist of several segments which represent di↵erent functions, and are usually run separately
by di↵erent managers, each of which is classified as a single ’legal business unit’. In this
dataset, however, each observation belongs to a statistical business unit which in some cases
will comprise of multiple legal units. Larger businesses were originally overrepresented in
the dataset, so to compensate for this the units have been weighted by size. Very large or
very complex units such as conglomerates have been excluded from the analysis. The SBS
sample survey data also provided information for the Number of employees at each business,
which was incorporated as an indicator. The model is applicable for any number of indicator
variables, and data for Cost of purchases and Total operating costs was also provided by the
sources.
50
CHAPTER 10. REAL DATA ANALYSIS
51
Table 10.1: Overview of NACE groups contained in dataset
NACE code
45112
45190
45200
45400
50100
50300
52100
52290
Description
Sale and repair of passenger cars and light motor vehicles
Sale and repair of trucks, trailers, and caravans
Specialised repair of motor vehicles
Sale and repair of motorcycles and related parts
Sea and coastal passenger water transport and ferry-services
Inland passenger water transport and ferry-services
Warehousing and storage
Forwarding agencies, ship brokers and charterers; weighing and measuring
Table 10.2: Number of units in first four NACE groups
NACE group
total units
total linked units
total linked units (w/o dummy obs.)
percentage selected
45112
18680
819
818
4.37
45190
1790
170
167
9.33
45200
6054
238
234
3.87
45400
1763
60
58
3.29
Table 10.3: Number of units in second four NACE groups
NACE group
total units
total linked units
total linked units (w/o dummy obs.)
percentage selected
50100
943
98
96
10.18
50300
4500
323
321
7.13
52100
764
145
142
18.59
52290
2968
375
374
12.6
Some units had missing observations from the administrative data-sets due to unit nonresponse. (Scholtus et al., 2015) examined whether this non-response yielded selection bias
for our target variable, and found no evidence for this.
10.2
Log Transformation
The model assumes that the true values and measurement errors are both normally distributed however in reality they are very skewed for the response variable Turnover, so
neither of these assumptions hold. By looking at the Turnover values and examining the
cases with identifiable error patterns it can be seen that these residual distributions more
closely resemble a log-normal distribution. Figure 10.1 shows the original density of turnover
scores and the log-transformed density is given in figure 10.2. Therefore, a log-transformation
of the data prior to applying the algorithm would lead to more accurate results. For that
reason the results we present in the next chapter will have all variables measured on a log
scale.
As with most administrative data, all of the observed values in our dataset are nonnegative real numbers so are within the domain of the log function and the log transformation
CHAPTER 10. REAL DATA ANALYSIS
Figure 10.1: Turnover scores for group 45112
Figure 10.2: Log transformed turnover scores for group 45112
52
CHAPTER 10. REAL DATA ANALYSIS
53
is suitable. However, we have some zero values and the result
log(0) =
Inf
is not computationally manageable in our algorithm. To compensate for this we will apply
an arbitrary small additive constant to the entire dataset before the log transformation. The
measurement model (8.1) in the case of an error is then replaced by
log(ygi + 0.01) = ag + bg log(yi⇤ + 0.01) + ✏gi
which is equivalent to
ygi + 0.01 = exp(ag )(yi⇤ + 0.01)bg exp(✏gi )
10.3
Results
Our EM algorithm for Model 2 was run on all eight NACE groups independently, with
Turnover as our response and Number of Employees as the sole covariate. All of the subsequent parameter estimates and associated standard errors for the eight NACE groups are
provided in Table 10.4 and Table 10.5.
Table 10.4: Parameter estimates and standard errors for the first 4 NACE groups
NACE group
Parameter
2
⇡1
a1
b1
2
1
⇡2
a2
b2
2
2
⇡3
a3
b3
2
3
45112
Estimate
1.31
42.30
0.65
-0.06
1.01
0.04
0.91
-0.11
0.98
0.30
0.08
3.96
0.20
14.4
SE
0.07
2.09
0.02
0.04
0.00
0.00
0.01
0.05
0.01
0.02
0.01
1.41
0.21
2.54
45190
Estimate
1.01
44.58
0.62
-0.13
1.02
0.01
0.83
2.33
0.70
1.14
0.13
3.66
0.37
7.62
Source 1 = SBS, Source 2 = VAT, Source 3 = PDR
* Initial estimates only
SE
0.07
2.21
0.02
0.04
0.00
0.00
0.02
0.05
0.01
0.06
0.02
1.41
0.21
1.35
45200
Estimate
0.64
35.72
0.22
2.72
0.32
5.03
0.71
1.15
0.83
0.67
0.52
-0.02
1.00
0.00
SE
0.13
3.30
0.03
0.90
1.06
1.06
0.03
0.22
0.03
0.08
0.04
0.04
0.01
0.00
45400
Estimate SE
0.03*
0.86*
0.69*
-
CHAPTER 10. REAL DATA ANALYSIS
54
Table 10.5: Parameter estimates and standard errors for the second 4 NACE groups
NACE group
Parameter
2
⇡1
a1
b1
2
1
⇡2
a2
b2
2
2
⇡3
a3
b3
2
3
50100
Estimate
0.33
53.76
0.63
0.14
0.97
0.03
0.73
0.06
0.97
1.17
0.36
-6.53
0.72
26.60
SE
0.23
7.76
0.06
0.10
0.01
0.13
0.06
0.02
0.59
0.08
0.06
0.30
3.91
0.58
50300
Estimate
-0.26
39
0.46
2.48
0.58
2.36
0.79
0.03
0.99
0.04
0.11
3.50
0.01
23.93
SE
0.11
3.08
0.03
0.45
0.07
0.28
0.02
0.06
0.01
0.00
0.02
2.12
0.34
5.65
52100
Estimate
1.88
25.77
0.35
4.27
0.42
2.49
0.80
3.70
0.57
1.45
0.70
0.09
0.99
0.00
SE
0.13
3.06
0.06
0.56
0.07
0.53
0.05
0.40
0.05
0.20
0.04
0.05
0.01
0.00
52290
Estimate
2.45
31.40
0.61
6.76
0.21
0.99
0.87
6.65
0.21
1.54
0.58
6.80
0.21
0.70
SE
0.12
3.29
0.08
0.12
0.01
0.11
0.02
0.13
0.02
0.13
0.03
0.12
0.01
0.09
Source 1 = SBS, Source 2 = VAT, Source 3 = PDR
For most of the parameters it was not possible to obtain accurate estimates for NACE
group 45400 due to the low number of available linked units. With only 56 units overall and
a very small percentage of these initially appearing to have an error on Source 1. The initial
error proportion estimates have nonetheless been included.
Aside from this, several things are noticeable in these results:
-The intermittent error assumption seems to be reasonable: all of the estimated
error proportions for each group are significantly di↵erent from 1.
-The error proportions are consistently highest in the VAT survey: for all eight
of the NACE groups ⇡2 is the largest error proportion. These estimates range from 0.710.91 while the SBS errors range from 0.03-0.65 and the PDR error proportions range from
0.08-0.70. By multiplying each error proportion with the total number of linked units for
that source an estimate for the total number of errors on each source can be obtained, these
are: 1174 for SBS, 1862 for VAT and 635 for PDR. Hence the overall error proportions can
be estimated at 0.53, 0.84 and 0.29 respectively, and the Profit Declaration Register is
altogether the least erroneous source.
-Some of the ag and bg estimates appear to be inaccurate: several of them have large
standard errors and large corresponding error variances. To investigate the cause of this
we looked more closely at the PDR source for NACE group 50300. As Figure 10.1 shows,
there is one outlying datapoint (the top left one) which skews the fitted red line. There is
also a row of outliers along the bottom that are the result of zero values. It is likely that
the imprecise parameter estimates for group 45112 Source 3, group 45190 Source 3, group
45200 Source 1 and group 50100 Sources 2 and 3 are caused by similar outliers. Notice that
the sources with lower error proportions are more likely to have these inaccurate ag and bg
estimates, probably because the outlying observations have more influence in these cases.
CHAPTER 10. REAL DATA ANALYSIS
Figure 10.3: y3 vs y ⇤ for group 50300 with fitted line
55
Chapter 11
Comparison of Methods
A structural equation model was used by Scholtus (2016) on the first four NACE groups of
the same data, and the resultant slope and intercept parameters are given by Table 11.1.
Table 11.1: Parameter estimates and Standard Errors given by SEM
Nace group
Parameter
a1
b1
a2
b2
a3
b3
45112
Estimate
-0.09
1.00
0.08
0.93
-0.12
1.02
SE
0.19
0.03
0.15
0.03
0.09
0.02
45190
Estimate
-0.02
1.00
0.54
0.89
0.40
0.94
SE
0.01
0.00
0.16
0.03
0.14
0.02
45200
Estimate
0.17
0.95
0.13
0.98
0.25
0.96
SE
0.19
0.04
0.16
0.03
0.16
0.03
45400
Estimate
0.17
0.96
-0.34
1.01
-0.32
1.06
SE
0.29
0.06
0.33
0.07
0.12
0.03
Source 1 = SBS, Source 2 = VAT, Source 3 = PDR
The ’trustworthy’ ag and bg estimates from Model 2 generally have smaller standard
errors than their equivalent estimates from the SEM.
Due to the fact that the data was log-transformed it is not easy to interpret the slope
and intercept estimates, except in the trivial case where ag = 0 and bg = 1; then the
interpretation would be the same as with (8.1). As the SEM carries the assumption of no
error free observations it is calculating slope and intercept estimates based on all of the data,
rather than just the erroneous observations. It would therefore be expected that the line
fitted by the SEM would be closer to the line ygi = yi⇤ , and subsequently the ag and bg
estimates would be closer to 0 and 1 respectively. This is most evident in the respective
b2 estimates for groups 45190 and 45200: 0.70 and 0.83 from our model and 0.89 and 0.98
given by the SEM. This disparity is illustrated for fictitious intermittent-error datapoints in
Figure 11.1.
56
CHAPTER 11. COMPARISON OF METHODS
Figure 11.1: Slope estimates for the SEM and Contamination models
57
Chapter 12
Conclusion
In this thesis we detailed the use of the EM algorithm to obtain maximum likelihood estimates for error proportions and other distribution parameters of contaminated multi-source
data. This model had previously been discussed by Guarnera and Varriale (2015) but we
provided a more in-depth step by step guide to the procedure. We then worked out and
detailed an expansion to this model which accounted for systemic bias by the inclusion of
an intercept and slope parameter. We produced an R code to apply this algorithm and
then we tested this model on a simulated dataset and found that it was able to estimate the
parameters accurately, with the slope estimates being particularly precise.
We then used the model to derive estimates and asymptotic standard errors for our logtransformed real data: annual turnover for dutch businesses given by three di↵erent sources
(a statistical survey and two administrative sources), with number of employees as an independent variable. The data were divided into eight groups based on economic activity and it
was clear that the Profit Declaration Register, one of the administrative sources, contained
the smallest proportion of errors for all groups. In some cases we were able to obtain precise
estimates for the other parameters but in others the estimates appeared to be skewed by
outlying observations. Most of these outliers were the result of erroneous zero (as opposed
to missing) values. These are likely to be missing observations that have been defaulted as
zero in the data collection phase [see: Dasu and Johnson (2003)], so in future they should
be treated as missing and either should be imputed or these units should be removed in the
data cleaning process, as was the case with missing values in this study.
In Chapter 11 we compared our results with those obtained using a structural equation
model. In addition to our model being able to estimate error proportions we found that the
a and b estimates were more precise for our model in the groups without significant outliers.
With this in mind, and considering the fact that the SEM requires expensive ’gold standard’
data, we suggest that our model is more suitable for the data used in this application.
12.1
Future Considerations
For this application we focused on a single outcome variable but in future it will be necessary
to model several outcome variables simultaneously. In addition to this it could be interesting
to consider the case with two rather than three sources (increasing the number of sources to
model could also be considered but would only require a simple extension to the algorithm).
Another improvement that could be made is the extension of the estimation procedure to
allow for the inclusion of units with missing data.
58
Appendix A
1
We provide the expressions for the ’loss’ components of the Q function for Model 2, following
on from Section 7.1.
X
V⇤ =
2
(y1i
xTi ) +
(y2i
xTi ) +
i2S1
X
2
i2S4
X
[⌧110i (y3i
X
(y1i
2
xTi ) +
i2S3
2
xTi
) + ⌧101i (y2i
X
i2S2
2
xTi
0
i2Srest
⌧111i (E(yi⇤ 2 )
2
xTi ) +
(y1i
2
2(xTi )E(yi⇤ ) + (xTi ) )] +
) + ⌧011i (y1i
X
(y4i
2
xTi ) +
xTi )
2
⇤
i2Srest
V ⇤1 =
X
(y1i
y2i )2 +
y4i )2 +
(y1i
i2{S5⇤ [S6⇤ [S8⇤ }
i2S4
X
X
[⌧110i (y1i
y3i )2 + ⌧101i (y1i
y2i )2 + ⌧111i (E(yi⇤ 2 )
2(y1i )E(yi⇤ ) + y1i 2 )]
0
i2Srest
V ⇤2 =
X
(y2i
y1i )2 +
y4i )2 +
(y2i
i2{S5⇤ [S7⇤ [S8⇤ }
i2S3
X
X
[⌧110i (y2i
y3i )2 + ⌧011i (y1i
y2i )2 + ⌧111i (E(yi⇤ 2 )
2(y2i )E(yi⇤ ) + y2i 2 )]
0
i2Srest
V ⇤3 =
X
(y3i
y1i )2 +
y4i )2 +
(y3i
i2{S6⇤ [S7⇤ [S8⇤ }
i2S2
X
X
[⌧011i (y3i
y1i )2 + ⌧101i (y3i
y2i )2 + ⌧111i (E(yi⇤ 2 )
0
i2Srest
59
2(y3i )E(yi⇤ ) + y3i 2 )]
CHAPTER 12. CONCLUSION
60
2
The following are the equations that provide the updated estimates for a2 , b2 , a3 and b3
within the algorithm, analogous to the equations for a1 and b1 in Section 8.2.
X
[2]
A1 = n010 + nrest
⌧101i
[2]
B1
=
X
y1i +
i2S3
[2]
C1
=
X
[2]
=
X
=
X
=
X
X
y1i +
[3]
B1
X
y1i +
X
=
y1i y2i +
[3]
A2
=
[3]
y3i +
X
(1
⌧110i )y3i
X
[⌧011i y1i + ⌧101i y2i + ⌧111i µ⇤i ]
=
X
y1i +
=
X
[⌧011i y1i + ⌧101i y2i + +⌧111i µ⇤i ]
i2Srest
i2Srest
y1i 2 +
i2S2
[3]
C2
⌧110i
i2Srest
i2Srest
X
X
[⌧110i y2i y3i + ⌧011i y1i y2i + ⌧111i µ⇤i y2i ]
X
X
y1i +
i2S2
B2
X
i2Srest
i2S2
and
[⌧110i y3i 2 + ⌧011i y1i 2 + ⌧111i (c⇤i + µ⇤i 2 )]
i2Srest
i2S2
[3]
C1
[⌧110i y3i + ⌧011i y1i + ⌧111i µ⇤i ]
= n001 + nrest
=
⌧101i )y2i
i2Srest
2
i2S3
[3]
A1
(1
i2Srest
i2S3
[2]
C2
[⌧110i y3i + ⌧011i y1i + +⌧111i µ⇤i ]
X
y2i +
i2S3
[2]
B2
i2Srest
i2Srest
i2S3
A2
X
X
[⌧011i y1i 2 + ⌧101i y2i 2 + ⌧111i (c⇤i + µ⇤i 2 )]
i2Srest
y1i y3i +
i2S2
X
[⌧011i y1i y3i + ⌧101i y2i y3i + ⌧111i µ⇤i y3i ]
i2Srest
(t+1)
b2
=
[2](t)
A2
[2](t)
A2
C1
B1
[2](t)
(t+1)
=
C1
(t+1)
=
C1
and a2
b3
(t+1)
and a3
C2
[2](t)
B2
[3](t)
A2
C1
A1
[2](t)
[2](t)
A1
[2](t)
[2](t) (t+1)
b2
[2](t)
A1
A2
B1
[2](t)
B1
[3](t)
[3](t)
=
[2](t)
[3](t)
C2
[3](t)
A1
[3](t)
B2
[3](t)
A1
[3](t) (t+1)
b2
[3](t)
A1
B1
[3](t)
[3](t)
Appendix B
Here we provide the R code used for simulation of data fitting Model 2 assumptions, used in
Chapter 9.
#SIMULATION OF THE DATA
l i b r a r y ( mvtnorm )
##e r r o r p r o p o r t i o n s
pi1 < 0.45
pi2 < 0.3
pi3 < 0.6
##v a r i a n c e s
var < 25
var1 < 0 . 3
var2 < 0 . 2
var3 < 0 . 0 5
#s l o p e and i n t e r c e p t f o r each s o u r c e
a1 < 0 . 1
a2 < 0 . 2
a3 < 0 . 0 0 5
b1 < 1 . 0 2
b2 < 1 . 0 3
b3 < 0 . 9 5
i n t e r c e p t s < c ( a1 , a2 , a3 )
s l o p e s < c ( b1 , b2 , b3 )
##number o f o b s e r v a t i o n s
n < 2000
##r e g r e s s i o n c o e f f i c i e n t s
betas < 5
##e r r o r i n d i c a t o r s
z1 < rbinom ( n , 1 , p i 1 )
z2 < rbinom ( n , 1 , p i 2 )
z3 < rbinom ( n , 1 , p i 3 )
#x < m a t r i x ( 0 , 3 , 3 )
ysim < matrix ( 0 , 2 0 0 0 , 3 )
xmat < y t r u e < matrix ( 0 , 2 0 0 0 , 1 )
xs < e r r o r s < s i g m a s < i n t e r c e p t <
for ( i in 1 : n) {
##x m a t r i x
x < xmat [ i ] <
xs [ [ i ] ] < x
rnorm ( 1 , 4 0 , 1 5 )
61
slope <
list ()
CHAPTER 12. CONCLUSION
62
i n t e r c e p t [ [ i ] ] < c ( a1 ⇤ z1 [ i ] , a2 ⇤ z2 [ i ] , a3 ⇤ z3 [ i ] )
s l o p e [ [ i ] ] < c ( b1 ˆ z1 [ i ] , b2 ˆ z2 [ i ] , b3 ˆ z3 [ i ] )
#s l o p e p a r a m e t e r s f o r c a s e s w i t h e r r o r , = 1 i f no e r r o r
##e r r o r v e c t o r
errmean < rep ( 0 , 3 )
t r u e e r r < rnorm ( 1 , 0 , sqrt ( var ) )
y t r u e [ i ] < x⇤ b e t a s + t r u e e r r
errors <
c (rnorm ( 3 , mean = 0 , sd=sqrt ( c ( var1 , var2 , var3 ) ) ) )
errors [ [ i ] ] <
c ( z1 [ i ] , z2 [ i ] , z3 [ i ] ) ⇤ e r r o r s
t r u e e r r < rnorm ( 1 , 0 , var )
ysim [ i , ] <
i n t e r c e p t [ [ i ] ] + slope [ [ i ] ] ⇤ ytrue [ i ] + e r r o r s [ [ i ] ]
}
#y s and x s i n d a t a f r a m e
y < cbind ( ysim , xmat )
Appendix C
Here we provide the R code which we wrote to obtain the parameter estimates and standard
errors for Model 2.
#e c m a l g o r i t h m < f u n c t i o n ( y ) {
#ESTIMATE PARAMETERS
#FUNCTION TO SEPARATE DATA AND CALCULATE KNOWN PROPORTIONS
#t h i s f u n c t i o n t a k e s a f u l l d a t a s e t , y , c o n s i s t i n g o f 3 s o u r c e s
and t h e r em ai n i ng columns o f c o v a r i a t e s .
d a t s e p < function ( y ) {
n < nrow( y )
#s e p a r a t e d a t a i n t o s u b s e t s d e p e n d i n g on p r e s e n c e o f e r r o r s
s 0 0 0 < which ( y [ , 1 ] == y [ , 2 ] & y [ , 2 ] == y [ , 3 ] )
#i n d e x o f y
rows i n which e r r o r c o m b i n a t i o n s can be i d e n t i f i e d
s 0 0 1 < which ( y [ , 1 ] == y [ , 2 ] & y [ , 2 ] != y [ , 3 ] )
s 0 1 0 < which ( y [ , 1 ] == y [ , 3 ] & y [ , 2 ] != y [ , 3 ] )
s 1 0 0 < which ( y [ , 2 ] == y [ , 3 ] & y [ , 1 ] != y [ , 3 ] )
s r e s t < which ( y [ , 1 ] != y [ , 2 ] & y [ , 2 ] != y [ , 3 ] & y [ , 1 ] != y [ , 3 ] )
y000 < y [ s000 , ]
y001 < y [ s001 , ]
y010 < y [ s010 , ]
y100 < y [ s100 , ]
yrest < y [ srest , ]
n000 < length ( s 0 0 0 ) ; n100 < length ( s 1 0 0 ) ; n010 < length ( s 0 1 0 )
; n001 < length ( s 0 0 1 )
n r e s t < length ( s r e s t ) ; n < nrow( y )
m000 < n000 / n
#p r o p o r t i o n s
o f each e r r o r c o m b i n a t i o n / mixing w e i g h t s . p i s can be
e s t i m a t e d from t h e s e
m001 < n001 / n
m010 < n010 / n
m100 < n100 / n
return ( l i s t ( s 0 0 0 = s000 , s 0 0 1 = s001 , s 0 1 0 = s010 , s 1 0 0 = s100 ,
s r e s t = s r e s t , y000 = y000 ,
y001 = y001 , y010 = y010 , y100 = y100 , y r e s t = y r e s t
, n000 = n000 , n001 = n001 ,
n010 = n010 , n100 = n100 , n r e s t = n r e s t , n = n , m000
= m000 , m001 = m001 ,
m010 = m010 , m100 = m100 ) )
}
63
CHAPTER 12. CONCLUSION
64
#THE EXPECTATION STEP
Estep < function ( pi1 , pi2 , pi3 , var , var1 , var2 , var3 , b e t a s , a1 ,
a2 , a3 , b1 , b2 , b3 , y ) {
arguments < d a t s e p ( y )
s 0 0 0 < arguments$ s 0 0 0 ; s 0 0 1 < d a t s e p ( y ) $ s 0 0 1 ; s 0 1 0 < d a t s e p ( y
)$s010 ; s100 < datsep ( y )$s100 ;
s r e s t < d a t s e p ( y ) $ s r e s t ; y000 < d a t s e p ( y ) $y000 ; y001 < d a t s e p
( y ) $y001 ; y010 < d a t s e p ( y ) $y010 ;
y100 < d a t s e p ( y ) $y100 ; y r e s t < d a t s e p ( y ) $ y r e s t ; n000 < d a t s e p
( y ) $n000 ; n001 < d a t s e p ( y ) $n001 ;
n010 < d a t s e p ( y ) $n010 ; n100 < d a t s e p ( y ) $n100 ; n r e s t < d a t s e p (
y ) $ n r e s t ; n < d a t s e p ( y ) $n ;
m000 < d a t s e p ( y ) $m000 ; m001 < d a t s e p ( y ) $m001 ; m010 < d a t s e p ( y
) $m000 ; m100 < d a t s e p ( y ) $m100 ;
i n t e r c e p t s < c ( a1 , a2 , a3 )
s l o p e s < c ( b1 , b2 , b3 )
#I n i t a l v a l u e s o f p i b a s e d on known p r o p o r t i o n s
#Unknown mixing w e i g h t e s t i m a t e s b a s e d on p i s
m011 < ( 1
pi1 ) ⇤ pi2 ⇤ pi3
m101 < p i 1 ⇤ ( 1
pi2 ) ⇤ pi3
m110 <
pi1 ⇤ pi2 ⇤ (1
pi3 )
m111 < p i 1 ⇤ p i 2 ⇤ p i 3
#sigmaklm = cov m a t r i c e s c o n d i t i o n a l on e r r o r c o m b i n a t i o n klm
ystar <
y s t a r 2 < numeric ( n r e s t )
sigma110 < matrix ( c ( var , b1 ⇤ var , b2 ⇤ var , var ,
b1 ⇤ var , ( b1 ˆ 2 ) ⇤ var + var1 , b1 ⇤ b2 ⇤
var , b1 ⇤ var ,
b2 ⇤ var , b1 ⇤ b2 ⇤ var , ( b2 ˆ 2 ) ⇤ var +
var2 , b2 ⇤ var ,
var , b1 ⇤ var , b2 ⇤ var , var ) ,
4 , 4)
sigma101 < matrix ( c ( var , b1 ⇤ var , var , b3 ⇤ var ,
b1 ⇤ var , ( b1 ˆ 2 ) ⇤ var + var1 , b1 ⇤ var ,
b1 ⇤ b3 ⇤ var ,
var , b1 ⇤ var , var , b3 ⇤ var ,
b3 ⇤ var , b1 ⇤ b3 ⇤ var , b3 ⇤ var , ( b3 ˆ 2 )
⇤ var + var3 ) ,
4 , 4)
sigma011 < matrix ( c ( var , var , b2 ⇤ var , b3 ⇤ var ,
var , var , b2 ⇤ var , b3 ⇤ var ,
b2 ⇤ var , b2 ⇤ var , ( b2 ˆ 2 ) ⇤ var + var2 ,
b2 ⇤ b3 ⇤ var ,
b3 ⇤ var , b3 ⇤ var , b2 ⇤ b3 ⇤ var , ( b3 ˆ 2 )
⇤ var + var3 ) ,
4 , 4)
sigma111 < matrix ( c ( var , b1 ⇤ var , b2 ⇤ var , b3 ⇤ var ,
CHAPTER 12. CONCLUSION
65
b1 ⇤ var , ( b1 ˆ 2 ) ⇤ var + var1 , b1 ⇤ b2 ⇤
var , b1 ⇤ b3 ⇤ var ,
b2 ⇤ var , b1 ⇤ b2 ⇤ var , ( b2 ˆ 2 ) ⇤ var +
var2 , b2 ⇤ b3 ⇤ var ,
b3 ⇤ var , b1 ⇤ b3 ⇤ var , b2 ⇤ b3 ⇤ var , ( b3
ˆ 2 ) ⇤ var + var3 ) ,
4 , 4)
tau101 < tau110 < tau011 < tau111 < numeric ( n r e s t )
var111 < var
sigma111 [ 1 , 2 : 4 ] %⇤% solve ( sigma111 [ 2 : 4 , 2 : 4 ] )
%⇤% sigma111 [ 2 : 4 , 1 ] # cond . v a r . e s t i m a t e when 3 e r r o r s
present
#P o s t e r i o r e r r o r c o m b i n a t i o n p r o b a b i l i t i e s
for ( i in 1 : n r e s t ) {
x i < y r e s t [ i , 4 : ncol ( y ) ] ⇤ b e t a s
tau110 [ i ] < m110 ⇤ dmvnorm ( y r e s t [ i , 1 : 3 ] , c ( a1 + b1 ⇤ xi , a2 +
b2 ⇤ xi , x i ) , sigma110 [ 2 : 4 , 2 : 4 ] )
tau101 [ i ] < m101 ⇤ dmvnorm ( y r e s t [ i , 1 : 3 ] , c ( a1 + b1 ⇤ xi , xi ,
a3 + b3 ⇤ x i ) , sigma101 [ 2 : 4 , 2 : 4 ] )
tau011 [ i ] < m011 ⇤ dmvnorm ( y r e s t [ i , 1 : 3 ] , c ( xi , a2 + b2 ⇤ xi ,
a3 + b3 ⇤ x i ) , sigma011 [ 2 : 4 , 2 : 4 ] )
tau111 [ i ] < m111 ⇤ dmvnorm ( y r e s t [ i , 1 : 3 ] , c ( a1 + b1 ⇤ xi , a2 +
b2 ⇤ xi , a3 + b3 ⇤ x i ) , sigma111 [ 2 : 4 , 2 : 4 ] )
tausum < tau111 [ i ] + tau110 [ i ] + tau101 [ i ] + tau011 [ i ]
i f ( tausum != 0 ) {
tau110 [ i ] < tau110 [ i ] / tausum
tau101 [ i ] < tau101 [ i ] / tausum
tau011 [ i ] < tau011 [ i ] / tausum
tau111 [ i ] < tau111 [ i ] / tausum
} else {
#Mahalanobis d i s t a n c e
mahal110 < as . numeric ( t ( y r e s t [ i , 1 : 3 ]
c ( a1 + b1 ⇤ xi ,
a2 + b2 ⇤ xi , x i ) ) )%⇤%
solve ( sigma110 [ 2 : 4 , 2 : 4 ] )%⇤% as . numeric ( y r e s t [ i , 1 : 3 ]
c(
a1 + b1 ⇤ xi , a2 + b2 ⇤ xi , x i ) )
mahal101 < as . numeric ( t ( y r e s t [ i , 1 : 3 ]
c ( a1 + b1 ⇤ xi , xi ,
a3 + b3 ⇤ x i ) ) )%⇤%
solve ( sigma101 [ 2 : 4 , 2 : 4 ] )%⇤% as . numeric ( y r e s t [ i , 1 : 3 ]
c(
a1 + b1 ⇤ xi , xi , a3 + b3 ⇤ x i ) )
mahal011 < as . numeric ( t ( y r e s t [ i , 1 : 3 ]
c ( xi , a2 + b2 ⇤ xi ,
a3 + b3 ⇤ x i ) ) )%⇤%
solve ( sigma011 [ 2 : 4 , 2 : 4 ] )%⇤% as . numeric ( y r e s t [ i , 1 : 3 ]
c(
xi , a2 + b2 ⇤ xi , a3 + b3 ⇤ x i ) )
mahal111 < as . numeric ( t ( y r e s t [ i , 1 : 3 ]
c ( a1 + b1 ⇤ xi , a2
+ b2 ⇤ xi , a3 + b3 ⇤ x i ) ) )%⇤%
solve ( sigma111 [ 2 : 4 , 2 : 4 ] )%⇤% as . numeric ( y r e s t [ i , 1 : 3 ]
c(
a1 + b1 ⇤ xi , a2 + b2 ⇤ xi , a3 + b3 ⇤ x i ) )
CHAPTER 12. CONCLUSION
}
66
mahal < c ( mahal110 , mahal101 , mahal011 , mahal111 )
mahalind < which . min( mahal )
tau110 [ i ] < 0
tau101 [ i ] < 0
tau011 [ i ] < 0
tau111 [ i ] < 0
i f ( mahalind == 1 ) {
tau110 [ i ] < 1
}
i f ( mahalind == 2 ) {
tau101 [ i ] < 1
}
i f ( mahalind == 3 ) {
tau011 [ i ] < 1
}
i f ( mahalind == 4 ) {
tau111 [ i ] < 1
}
#E x p e c t e d y⇤ and y⇤ˆ2 f o r s i t u a t i o n w i t h 3 e r r o r s
y s t a r [ i ] < y r e s t [ i , 4 : ncol ( y ) ] ⇤ b e t a s
+ sigma111 [ 1 ,
2 : 4 ] %⇤%
#y s t a r i s e x p e c t e d t r u e
score in
solve ( sigma111 [ 2 : 4 , 2 : 4 ] ) %⇤% as . numeric ( y r e s t [ i , 1 : 3 ]
(
i n t e r c e p t s + s l o p e s ⇤ y r e s t [ i , 4 : ncol ( y ) ] ⇤ b e t a s ) )
# c a s e when 3 e r r o r s p r e s e n t ,
c o n d i t i o n a l on p a r a m e t e r s and d a t a
y s t a r 2 [ i ] < var111 + ( y s t a r [ i ] ) ˆ2
true score squared
#y s t a r 2 i s e s t i m a t e o f
#l o g l i k e l i h o o d component f o r r e s t group
}
#True v a l u e s
y t r u e < numeric ( n )
#Observed
y t r u e [ c ( s000 , s010 , s 0 0 1 ) ] < y [ c ( s000 , s010 , s 0 0 1 ) , 1 ]
#t r u e
s c o r e i s known t o be y i 1 i n c a s e s w i t h no e r r o r on s o u r c e 1
y t r u e [ s 1 0 0 ] < y [ s100 , 2 ]
#t r u e
s c o r e = y i 2 ( or y i 3 ) i n c a s e w i t h no e r r o r on s o u r c e 2 ( and
s o u r c e 3)
#E s t i m a t e d
y t r u e [ s r e s t ] < y r e s t [ , 3 ] ⇤ tau110 + y r e s t [ , 2 ] ⇤ tau101 + y r e s t
[ , 1 ] ⇤ tau011 + y s t a r ⇤ tau111 #t r u e s c o r e can be e s t i m a t e d
i n s r e s t from
# c o m b i n a t i o n o f t a u s and e s t i m a t e d t r u e s c o r e from each e r r o r
comb .
CHAPTER 12. CONCLUSION
67
return ( l i s t ( tau110 = tau110 , tau101 = tau101 , tau011 = tau011 ,
tau111 = tau111 , y000 = y000 , y100 = y100 ,
y010 = y010 , y001 = y001 , y r e s t = y r e s t , y s t a r =
y s t a r , y s t a r 2 = y s t a r 2 , y t r u e = ytrue , s 1 0 0 =
s100 ,
s 0 1 0 = s010 , s 0 0 1 = s001 , s r e s t = s r e s t , var111 =
var111 ) )
}
# end o f E s t e p f u n c t i o n
#Function t o u p d a t e b e t a s
b e t a f n < function ( y , y t r u e ) {
#y t r u e i s t r u e s c o r e when
i d e n t i f i a b l e and e s t i m a t e d t r u e s c o r e when not i d e n t i f i a b l e
nx < ncol ( y )
3 #number o f X c o v a r i a b l e s
sumxx < matrix ( 0 , nx , nx )
for ( i in 1 : n) {
xx < y [ i , 4 : ncol ( y ) ]%⇤%t ( y [ i , 4 : ncol ( y ) ] )
sumxx < sumxx + xx
}
sumxy < rep ( 0 , nx )
for ( i in 1 : n) {
xy < y [ i , 4 : ncol ( y ) ] ⇤ y t r u e [ i ]
sumxy < xy + sumxy
}
b e t a s < solve ( sumxx ) %⇤% sumxy
return ( b e t a s )
}
#True v a r i a n c e f u n c t i o n
v a r f n < function ( y , ytrue , b e t a s , y r e s t , s r e s t , n , n r e s t , tau110 ,
tau101 , tau011 , tau111 , y s t a r , y s t a r 2 ) {
vsigmaknown < 0
yknown < y[ s r e s t , ]
ytrueknown < y t r u e [ s r e s t ]
nknown < n
nrest
f o r ( i i n 1 : nknown ) {
v l o s s < ( ytrueknown [ i ]
yknown [ i , 4 : ncol ( y ) ] ⇤ b e t a s ) ˆ 2
vsigmaknown < vsigmaknown + v l o s s
}
vsigmarest < 0
for ( i in 1 : n r e s t ) {
CHAPTER 12. CONCLUSION
68
v l o s s < tau110 [ i ] ⇤ ( y r e s t [ i , 3 ]
y r e s t [ i , 4 : ncol ( y ) ] ⇤
b e t a s ) ˆ 2 + tau101 [ i ] ⇤ ( y r e s t [ i , 2 ]
y r e s t [ i , 4 : ncol ( y ) ]
⇤ betas ) ˆ 2 +
tau011 [ i ] ⇤ ( y r e s t [ i , 1 ]
y r e s t [ i , 4 : ncol ( y ) ] ⇤ b e t a s ) ˆ 2
+ tau111 [ i ] ⇤ ( y s t a r 2 [ i ]
2 ⇤ y r e s t [ i , 4 : ncol ( y ) ] ⇤
betas ⇤ ystar [ i ] +
vsigmarest < vloss + vsigmarest
}
vsigma < vsigmaknown + v s i g m a r e s t
var < vsigma / n
return ( var )
}
#Var1 f u n c t i o n
v a r 1 f n < function ( y100 , y r e s t , ytrue , b e t a s , s r e s t , n100 , n r e s t ,
tau110 , tau101 , tau011 , tau111 , pi1 , y s t a r , y s t a r 2 , a1 , b1 ) {
vsigma100 < 0
f o r ( i i n 1 : n100 ) {
v l o s s < ( y100 [ i , 1 ]
( a1 + b1 ⇤ y100 [ i , 2 ] ) ) ˆ 2
vsigma100 < v l o s s + vsigma100
}
vsigma1rest < 0
for ( i in 1 : n r e s t ) {
CHAPTER 12. CONCLUSION
69
v l o s s < tau110 [ i ] ⇤ ( y r e s t [ i , 1 ]
( a1 + b1 ⇤ y r e s t [ i , 3 ] ) ) ˆ
2 + tau101 [ i ] ⇤ ( y r e s t [ i , 1 ]
( a1 + b1 ⇤ y r e s t [ i , 2 ] ) ) ˆ
2 +
tau111 [ i ] ⇤ ( ( y r e s t [ i , 1 ] ) ˆ 2
2 ⇤ y r e s t [ i , 1 ] ⇤ ( a1 + b1⇤
y s t a r [ i ] ) + a1 ˆ 2 + 2 ⇤ a1 ⇤ b1 ⇤ y s t a r [ i ] + b1 ˆ 2 ⇤
ystar2 [ i ] )
vsigma1rest < vloss + vsigma1rest
}
vsigma1 < vsigma100 + v s i g m a 1 r e s t
var1 < vsigma1 / ( n100 + sum( tau101 + tau110 + tau111 ) )
sum ( tau101 + tau110 + tau111 ) )
return ( var1 )
#
}
#Var2 f u n c t i o n
v a r 2 f n < function ( y010 , y r e s t , ytrue , b e t a s , s r e s t , n010 , n r e s t ,
tau110 , tau101 , tau011 , tau111 , pi2 , y s t a r , y s t a r 2 , a2 , b2 ) {
vsigma010 < 0
f o r ( i i n 1 : n010 ) {
v l o s s < ( y010 [ i , 2 ]
( a2 + b2 ⇤ y010 [ i , 1 ] ) ) ˆ 2
vsigma010 < v l o s s + vsigma010
}
vsigma2rest < 0
for ( i in 1 : n r e s t ) {
v l o s s < tau110 [ i ] ⇤ ( y r e s t [ i , 2 ]
2 + tau011 [ i ] ⇤ ( y r e s t [ i , 2 ]
2 +
tau111 [ i ] ⇤ ( ( y r e s t [ i , 2 ] ) ˆ 2
⇤ y s t a r [ i ] ) + a2 ˆ 2 + 2 ⇤ a2
ystar2 [ i ] )
vsigma2rest < vloss + vsigma2rest
}
vsigma2 < vsigma010 + v s i g m a 2 r e s t
var2 < vsigma2 / ( n010 + sum( tau110
return ( var2 )
( a2 + b2 ⇤ y r e s t [ i , 3 ] ) ) ˆ
( a2 + b2 ⇤ y r e s t [ i , 1 ] ) ) ˆ
2 ⇤ y r e s t [ i , 2 ] ⇤ ( a2 + b2
⇤ b2 ⇤ y s t a r [ i ] + b2 ˆ 2 ⇤
+ tau011 + tau111 ) )
}
#Var3 f u n c t i o n
v a r 3 f n < function ( y001 , y r e s t , ytrue , b e t a s , s r e s t , n001 , n r e s t ,
tau110 , tau101 , tau011 , tau111 , pi3 , y s t a r , y s t a r 2 , a3 , b3 ) {
vsigma001 < 0
f o r ( i i n 1 : n001 ) {
v l o s s < ( y001 [ i , 3 ]
( a3 + b3 ⇤ y001 [ i , 2 ] ) ) ˆ 2
vsigma001 < v l o s s + vsigma001
}
vsigma3rest < 0
for ( i in 1 : n r e s t ) {
CHAPTER 12. CONCLUSION
70
v l o s s < tau011 [ i ] ⇤ ( y r e s t [ i , 3 ]
( a3 + b3 ⇤ y r e s t [ i , 1 ] ) ) ˆ
2 + tau101 [ i ] ⇤ ( y r e s t [ i , 3 ]
( a3 + b3 ⇤ y r e s t [ i , 2 ] ) ) ˆ
2 +
tau111 [ i ] ⇤ ( ( y r e s t [ i , 3 ] ) ˆ 2
2 ⇤ y r e s t [ i , 3 ] ⇤ ( a3 + b3
⇤ y s t a r [ i ] ) + a3 ˆ 2 + 2 ⇤ a3 ⇤ b3 ⇤ y s t a r [ i ] + b3 ˆ 2 ⇤
ystar2 [ i ] )
vsigma3rest < vloss + vsigma3rest
}
vsigma3 < vsigma001 + v s i g m a 3 r e s t
var3 < vsigma3 / ( n001 + sum( tau101 + tau011 + tau111 ) )
return ( var3 )
}
#’ b1 f u n c t i o n
b1fn < function ( n100 , n r e s t , tau011 , tau110 , tau101 , tau111 , y100
, y r e s t , y s t a r , y s t a r 2 , var111 ) {
A1 . 1 < n100 + n r e s t
sum( tau011 )
B1 . 1 < A2 . 1 <
sum( y100 [ , 2 ] ) + sum( tau110 ⇤ y r e s t [ , 3 ] +
tau101 ⇤ y r e s t [ , 2 ] + tau111 ⇤ y s t a r )
C1 . 1 < sum( y100 [ , 1 ] ) + sum( ( 1
tau011 ) ⇤ y r e s t [ , 1 ] )
B2 . 1 < sum( y100 [ , 2 ] ˆ 2 ) + sum( tau110 ⇤ y r e s t [ , 3 ] ˆ 2 + tau101
⇤ y r e s t [ , 2 ] ˆ 2 + tau111 ⇤ ( y s t a r 2 ) )
C2 . 1 < sum( y100 [ , 1 ] ⇤ y100 [ , 2 ] ) + sum( tau110 ⇤ y r e s t [ , 1 ] ⇤
y r e s t [ , 3 ] + tau101 ⇤ y r e s t [ , 1 ] ⇤ y r e s t [ , 2 ] + tau111 ⇤ y s t a r
⇤ yrest [ , 1])
b1 < (C1 . 1 ⇤ A2 . 1
C2 . 1 ⇤ A1 . 1 ) / ( B1 . 1 ⇤ A2 . 1
B2 . 1 ⇤ A1 . 1 )
return ( l i s t (A1 . 1 = A1 . 1 , B1 . 1 = B1 . 1 , C1 . 1 = C1 . 1 , b1 = b1 ) )
}
#b2 f u n c t i o n
b2fn < function ( n010 , n r e s t , tau011 , tau110 , tau101 , tau111 , y010
, y r e s t , y s t a r , y s t a r 2 , var111 ) {
A1 . 2 < n010 + n r e s t
sum( tau101 )
B1 . 2 < A2 . 2 <
sum( y010 [ , 1 ] ) + sum( tau110 ⇤ y r e s t [ , 3 ] +
tau011 ⇤ y r e s t [ , 1 ] + tau111 ⇤ y s t a r )
C1 . 2 < sum( y010 [ , 2 ] ) + sum( ( 1
tau101 ) ⇤ y r e s t [ , 2 ] )
B2 . 2 < sum( y010 [ , 1 ] ˆ 2 ) + sum( tau110 ⇤ y r e s t [ , 3 ] ˆ 2 + tau011
⇤ y r e s t [ , 1 ] ˆ 2 + tau111 ⇤ ( y s t a r 2 ) )
C2 . 2 < sum( y010 [ , 2 ] ⇤ y010 [ , 1 ] ) + sum( tau110 ⇤ y r e s t [ , 2 ] ⇤
y r e s t [ , 3 ] + tau011 ⇤ y r e s t [ , 2 ] ⇤ y r e s t [ , 1 ] + tau111 ⇤ y s t a r
⇤ yrest [ , 2])
b2 < (C1 . 2 ⇤ A2 . 2
C2 . 2 ⇤ A1 . 2 ) / ( B1 . 2 ⇤ A2 . 2
B2 . 2 ⇤ A1 . 2 )
return ( l i s t (A1 . 2 = A1 . 2 , B1 . 2 = B1 . 2 , C1 . 2 = C1 . 2 , b2 = b2 ) )
}
#b3 f u n c t i o n
b3fn < function ( n001 , n r e s t , tau011 , tau110 , tau101 , tau111 , y001
, y r e s t , y s t a r , y s t a r 2 , var111 ) {
A1 . 3 < n001 + n r e s t
sum( tau110 )
B1 . 3 < A2 . 3 <
sum( y001 [ , 1 ] ) + sum( tau101 ⇤ y r e s t [ , 2 ] +
tau011 ⇤ y r e s t [ , 1 ] + tau111 ⇤ y s t a r )
CHAPTER 12. CONCLUSION
71
C1 . 3 < sum( y001 [ , 3 ] ) + sum( ( 1
tau110 ) ⇤ y r e s t [ , 3 ] )
B2 . 3 < sum( y001 [ , 1 ] ˆ 2 ) + sum( tau101 ⇤ y r e s t [ , 2 ] ˆ 2 + tau011
⇤ y r e s t [ , 1 ] ˆ 2 + tau111 ⇤ ( y s t a r 2 ) )
C2 . 3 < sum( y001 [ , 3 ] ⇤ y001 [ , 1 ] ) + sum( tau101 ⇤ y r e s t [ , 2 ] ⇤
y r e s t [ , 3 ] + tau011 ⇤ y r e s t [ , 1 ] ⇤ y r e s t [ , 3 ] + tau111 ⇤ y s t a r
⇤ yrest [ , 3])
b3 < (C1 . 3 ⇤ A2 . 3
C2 . 3 ⇤ A1 . 3 ) / ( B1 . 3 ⇤ A2 . 3
B2 . 3 ⇤ A1 . 3 )
return ( l i s t (A1 . 3 = A1 . 3 , B1 . 3 = B1 . 3 , C1 . 3 = C1 . 3 , b3 = b3 ) )
}
s000 < datsep ( y )$s000 ; s001 < datsep ( y )$s001 ; s010 < datsep ( y )$
s010 ; s100 < datsep ( y )$s100 ;
s r e s t < d a t s e p ( y ) $ s r e s t ; y000 < d a t s e p ( y ) $y000 ; y001 < d a t s e p ( y
) $y001 ; y010 < d a t s e p ( y ) $y010 ;
y100 < d a t s e p ( y ) $y100 ; y r e s t < d a t s e p ( y ) $ y r e s t ; n000 < d a t s e p ( y
) $n000 ; n001 < d a t s e p ( y ) $n001 ;
n010 < d a t s e p ( y ) $n010 ; n100 < d a t s e p ( y ) $n100 ; n r e s t < d a t s e p ( y )
$ n r e s t ; n < d a t s e p ( y ) $n ;
m000 < d a t s e p ( y ) $m000 ; m001 < d a t s e p ( y ) $m001 ; m010 < d a t s e p ( y ) $
m000 ; m100 < d a t s e p ( y ) $m100 ;
nx < ncol ( y )
3
n i t e r < 200 #50 i t e r a t i o n s
m < 1 e 6 #c o n v e r g e n c e c r i t e r i a
p i 1 < p i 2 < p i 3 < var < var1 < var2 <
< b1 < b2 < b3 <
numeric ( n i t e r )
b e t a s < matrix ( 0 , n i t e r , nx )
var3 <
a1 <
a2 <
a3
#i n i t i a l e s t i m a t e s f o r pi , d e r i v e d from known p r o p o r t i o n s m000 ,
m001 , mo10 , m100
pi1 [ 1 ] < 1
sqrt ( ( ( m000 + m001 ) ⇤ ( m000 + m010 ) ) / ( m000 + m100
))
pi2 [ 1 ] < 1
sqrt ( ( ( m000 + m001 ) ⇤ ( m000 + m100 ) ) / ( m000 + m010
))
pi3 [ 1 ] < 1
sqrt ( ( ( m000 + m100 ) ⇤ ( m000 + m010 ) ) / ( m000 + m001
))
##l i n e a r model o f known e r r o r p a t t e r n g r o u p s t o p r e d i c t i n i t i a l
values
ytrueknown < c ( y [ c ( s000 , s010 , s 0 0 1 ) , 1 ] , y [ s100 , 2 ] )
xknown < y [ c ( s000 , s010 , s001 , s 1 0 0 ) , 4 : ncol ( y ) ]
reglm < lm( ytrueknown ˜ xknown 1)
#w i t h or w i t h o u t i n t e r c e p t ?
summary( reglm )
#i n i t i a l b e t a s from r e g r e s s i o n
b e t a s [ 1 ] < reglm $ c o e f f i c i e n t s
y100 [ , ncol ( y ) + 1 ] < y100 [ , 4 : ncol ( y ) ] ⇤ b e t a s [ 1 , ]
y010 [ , ncol ( y ) + 1 ] < y010 [ , 4 : ncol ( y ) ] ⇤ b e t a s [ 1 , ]
y001 [ , ncol ( y ) + 1 ] < y001 [ , 4 : ncol ( y ) ] ⇤ b e t a s [ 1 , ]
CHAPTER 12. CONCLUSION
72
#i n i t i a l i n t e r c e p t s and s l o p e s from r e g r e s s i o n w i t h b e t a s
multiplied
a1 [ 1 ] < summary(lm( y100 [ , 1 ] ˜ y100 [ , 2 ] ) ) $ c o e f f i c i e n t s [ 1 , 1 ]
b1 [ 1 ] < summary(lm( y100 [ , 1 ] ˜ y100 [ , 2 ] ) ) $ c o e f f i c i e n t s [ 2 , 1 ]
a2 [ 1 ] < summary(lm( y010 [ , 2 ] ˜ y010 [ , 1 ] ) ) $ c o e f f i c i e n t s [ 1 , 1 ]
b2 [ 1 ] < summary(lm( y010 [ , 2 ] ˜ y010 [ , 1 ] ) ) $ c o e f f i c i e n t s [ 2 , 1 ]
a3 [ 1 ] < summary(lm( y001 [ , 3 ] ˜ y001 [ , 1 ] ) ) $ c o e f f i c i e n t s [ 1 , 1 ]
b3 [ 1 ] < summary(lm( y001 [ , 3 ] ˜ y001 [ , 1 ] ) ) $ c o e f f i c i e n t s [ 2 , 1 ]
#i n i t i a l t r u e s c o r e v a r i a n c e
var [ 1 ] < (summary( reglm ) $sigma ) ˆ2
#i n i t i a l e r r o r v a r i a n c e s
var1 [ 1 ] < (summary(lm( y100 [ , 1 ] ˜ y100 [ , 2 ] ) ) $sigma ) ˆ2
var2 [ 1 ] < (summary(lm( y010 [ , 2 ] ˜ y010 [ , 1 ] ) ) $sigma ) ˆ2
var3 [ 1 ] < (summary(lm( y001 [ , 3 ] ˜ y001 [ , 1 ] ) ) $sigma ) ˆ2
(summary(lm( y100 [ , 1 ] ˜ y100 [ , 2 ] ) ) $sigma ) ˆ2
for ( k in 2 : n i t e r ) {
params < Estep ( p i 1 [ k 1] , p i 2 [ k 1] , p i 3 [ k 1] , var [ k 1] , var1 [ k
1] , var2 [ k 1] , var3 [ k 1] , b e t a s [ k 1 , ] ,
a1 [ k 1] , a2 [ k 1] , a3 [ k 1] , b1 [ k 1] , b2 [ k 1] , b3 [
k 1] , y )
tau110 < params$ tau110
tau101 < params$ tau101
tau011 < params$ tau011
tau111 < params$ tau111
y t r u e < params$ y t r u e
y s t a r < params$ y s t a r
y s t a r 2 < params$ y s t a r 2
#c o n d i t i o n a l v a r i a n c e e s t i m a t e f o r c a s e w i t h 3 e r r o r s
var111 < params$ var111
#u p d a t e p i s
p i 1 [ k ] < ( n100 + sum( tau101 ) + sum( tau110 ) + sum( tau111 ) ) / n
p i 2 [ k ] < ( n010 + sum( tau110 ) + sum( tau011 ) + sum( tau111 ) ) / n
p i 3 [ k ] < ( n001 + sum( tau101 ) + sum( tau011 ) + sum( tau111 ) ) / n
#u p d a t e b e t a s
betas [ k , ] < betafn (y , ytrue )
#Update s l o p e s and i n t e r c e p t s
b 1 p a r s < b1fn ( n100 , n r e s t , tau011 , tau110 , tau101 , tau111 , y100
, y r e s t , y s t a r , y s t a r 2 , var111 )
A1 . 1 < b 1 p a r s $A1 . 1
B1 . 1 < b 1 p a r s $B1 . 1
C1 . 1 < b 1 p a r s $C1 . 1
b1 [ k ] < b 1 p ar s $b1
a1 [ k ] < (C1 . 1
B1 . 1 ⇤ b1 [ k ] ) / A1 . 1
b 2 p a r s < b2fn ( n010 , n r e s t , tau011 , tau110 , tau101 , tau111 , y010
, y r e s t , y s t a r , y s t a r 2 , var111 )
A1 . 2 < b 2 p a r s $A1 . 2
B1 . 2 < b 2 p a r s $B1 . 2
C1 . 2 < b 2 p a r s $C1 . 2
CHAPTER 12. CONCLUSION
73
b2 [ k ] < b 2 p a r s $b2
a2 [ k ] < (C1 . 2
B1 . 2 ⇤ b2 [ k ] ) / A1 . 2
b 3 p ar s < b3fn ( n001 , n r e s t , tau011 , tau110 , tau101 , tau111 , y001
, y r e s t , y s t a r , y s t a r 2 , var111 )
A1 . 3 < b 3 p a r s $A1 . 3
B1 . 3 < b 3 p a r s $B1 . 3
C1 . 3 < b 3 p a r s $C1 . 3
b3 [ k ] < b 3 p a r s $b3
a3 [ k ] < (C1 . 3
B1 . 3 ⇤ b3 [ k ] ) / A1 . 3
#u p d a t e v a r i a n c e s
var [ k ] < v a r f n ( y , ytrue , b e t a s [ k , ] , y r e s t , s r e s t , n , n r e s t ,
tau110 , tau101 , tau011 , tau111 , y s t a r , y s t a r 2 )
var1 [ k ] < v a r 1 f n ( y100 , y r e s t , ytrue , b e t a s [ k , ] , s r e s t , n100 ,
n r e s t , tau110 , tau101 , tau011 , tau111 , p i 1 [ k ] , y s t a r , y s t a r 2 ,
a1 [ k ] , b1 [ k ] )
var2 [ k ] < v a r 2 f n ( y010 , y r e s t , ytrue , b e t a s [ k , ] , s r e s t , n010 ,
n r e s t , tau110 , tau101 , tau011 , tau111 , p i 2 [ k ] , y s t a r , y s t a r 2 ,
a2 [ k ] , b2 [ k ] )
var3 [ k ] < v a r 3 f n ( y001 , y r e s t , ytrue , b e t a s [ k , ] , s r e s t , n001 ,
n r e s t , tau110 , tau101 , tau011 , tau111 , p i 3 [ k ] , y s t a r , y s t a r 2 ,
a3 [ k ] , b3 [ k ] )
print ( k )
print ( c ( p i 1 [ k ] , p i 2 [ k ] , p i 3 [ k ] , b e t a s [ k , ] , var [ k ] , var1 [ k ] , var2
[ k ] , var3 [ k ] , a1 [ k ] , a2 [ k ] , a3 [ k ] , b1 [ k ] , b2 [ k ] , b3 [ k ] ) )
i f ( abs ( 1
( p i 1 [ k ] / p i 1 [ k 1]) ) < m & abs ( 1
( pi2 [ k ] / pi2 [ k
1]) ) < m & abs ( 1
( p i 3 [ k ] / p i 3 [ k 1]) ) < m &
abs ( 1
( var [ k ] / var [ k 1]) ) < m & abs ( 1
( var1 [ k ] / var1 [ k
1]) ) < m & abs ( 1
( var2 [ k ] / var2 [ k 1]) ) < m &
abs ( 1
( var3 [ k ] / var3 [ k 1]) ) < m & abs ( 1
( a1 [ k ] / a1 [ k
1]) ) < m & abs ( 1
( a2 [ k ] / a2 [ k 1]) ) < m & abs ( 1
( a3 [
k ] / a3 [ k 1]) ) < m &
abs ( 1
( b1 [ k ] / b1 [ k 1]) ) < m & abs ( 1
( b2 [ k ] / b2 [ k 1]) )
< m & abs ( 1
( b3 [ k ] / b3 [ k 1]) ) < m) #c o n v e r g e s when
prop . change < m
break
}
print ( paste ( ” c o n v e r g e d a f t e r ” , k , ” i t e r a t i o n s ” ) )
print ( paste ( ” f i n a l e s t i m a t e f o r p i 1 i s ” , p i 1 [ k ] ) )
print ( paste ( ” f i n a l e s t i m a t e f o r p i 2 i s ” , p i 2 [ k ] ) )
print ( paste ( ” f i n a l e s t i m a t e f o r p i 3 i s ” , p i 3 [ k ] ) )
print ( paste ( ” f i n a l e s t i m a t e f o r var i s ” , var [ k ] ) )
CHAPTER 12. CONCLUSION
74
print ( paste ( ” f i n a l e s t i m a t e f o r var1 i s ” , var1 [ k ] ) )
print ( paste ( ” f i n a l e s t i m a t e f o r var2 i s ” , var2 [ k ] ) )
print ( paste ( ” f i n a l e s t i m a t e f o r var3 i s ” , var3 [ k ] ) )
print ( paste ( ” f i n a l e s t i m a t e f o r t h e f i r s t r e g r e s s i o n c o e f f i c i e n t
i s ” , betas [ k , 1 ] ) )
print ( paste ( ” f i n a l e s t i m a t e f o r t h e s e c o n d r e g r e s s i o n c o e f f i c i e n t
i s ” , betas [ k , 2 ] ) )
print ( paste ( ” f i n a l e s t i m a t e f o r t h e t h i r d r e g r e s s i o n c o e f f i c i e n t
i s ” , betas [ k , 3 ] ) )
print ( paste ( ” f i n a l e s t i m a t e f o r t h e f i r s t i n t e r c e p t c o e f f i c i e n t i s
” , a1 [ k ] ) )
print ( paste ( ” f i n a l e s t i m a t e f o r t h e s e c o n d i n t e r c e p t c o e f f i c i e n t
i s ” , a2 [ k ] ) )
print ( paste ( ” f i n a l e s t i m a t e f o r t h e t h i r d i n t e r c e p t c o e f f i c i e n t i s
” , a3 [ k ] ) )
print ( paste ( ” f i n a l e s t i m a t e f o r t h e f i r s t s l o p e c o e f f i c i e n t i s ” ,
b1 [ k ] ) )
print ( paste ( ” f i n a l e s t i m a t e f o r t h e s e c o n d s l o p e c o e f f i c i e n t i s ” ,
b2 [ k ] ) )
print ( paste ( ” f i n a l e s t i m a t e f o r t h e t h i r d s l o p e c o e f f i c i e n t i s ” ,
b3 [ k ] ) )
#p l o t t h e p i e s t i m a t e s
plot ( p i 1 [ 1 : k ] , ylim=c ( 0 , 1 ) , x l a b = ’ I t e r a t i o n ’ , y l a b = ’ Pi
estimates ’ )
points ( p i 2 [ 1 : k ] , col = ’ r e d ’ )
points ( p i 3 [ 1 : k ] , col = ’ b l u e ’ )
legend ( ’ t o p r i g h t ’ , col = c ( ’ b l a c k ’ , ’ r e d ’ , ’ b l u e ’ ) , legend = c ( ’ Pi
1 ’ , ’ Pi 2 ’ , ’ Pi 3 ’ ) , pch = 1 )
#p l o t t h e v a r i a n c e e s t i m a t e s
plot ( var [ 1 : k ] , ylim=c ( 0 , 4 5 0 0 0 0 0 0 0 ) , x l a b = ’ I t e r a t i o n ’ , y l a b = ’
Variance estimates ’ )
points ( var1 [ 1 : k ] , col = ’ r e d ’ )
points ( var2 [ 1 : k ] , col = ’ b l u e ’ )
points ( var3 [ 1 : k ] , col = ’ g r e e n ’ )
legend ( ’ t o p r i g h t ’ , col = c ( ’ b l a c k ’ , ’ r e d ’ , ’ b l u e ’ , ’ g r e e n ’ ) ,
legend = c ( ’ Var ’ , ’ Var 1 ’ , ’ Var 2 ’ , ’ Var 3 ’ ) , pch = 1 )
#p l o t o f i n t e r c e p t e s t i o m a t e s
plot ( a1 [ 1 : k ] , ylim=c ( 900 , 5 0 ) , x l a b = ’ I t e r a t i o n ’ , y l a b = ’
Intercept estimates ’ )
points ( a2 [ 1 : k ] , col = ’ r e d ’ )
points ( a3 [ 1 : k ] , col = ’ b l u e ’ )
legend ( ’ t o p r i g h t ’ , col = c ( ’ b l a c k ’ , ’ r e d ’ , ’ b l u e ’ ) , legend = c ( ’ a1
’ , ’ a2 ’ , ’ a3 ’ ) , pch = 1 )
plot ( b1 [ 1 : k ] , ylim=c ( 0 , 2 ) , x l a b = ’ I t e r a t i o n ’ , y l a b = ’ S l o p e
estimates ’ )
points ( b2 [ 1 : k ] , col = ’ r e d ’ )
points ( b3 [ 1 : k ] , col = ’ b l u e ’ )
legend ( ’ t o p r i g h t ’ , col = c ( ’ b l a c k ’ , ’ r e d ’ , ’ b l u e ’ ) , legend = c ( ’ b1
’ , ’ b2 ’ , ’ b3 ’ ) , pch = 1 )
CHAPTER 12. CONCLUSION
75
e s t i m a t e s < c ( p i 1 [ k ] , p i 2 [ k ] , p i 3 [ k ] , b e t a s [ k ] , var [ k ] , a1 [ k ] , b1
[ k ] , var1 [ k ] , a2 [ k ] , b2 [ k ] , var2 [ k ] , a3 [ k ] , b3 [ k ] , var3 [ k ] )
#STANDARD ERRORS
#DM Matrix
dmfun < function ( pi1 , pi2 , pi3 , var , var1 , var2 , var3 , b e t a s , a1 ,
a2 , a3 , b1 , b2 , b3 , y , k ) {
t r u e t h e t a s < c ( p i 1 [ k ] , p i 2 [ k ] , p i 3 [ k ] , b e t a s [ k ] , var [ k ] , a1 [ k
] , b1 [ k ] , var1 [ k ] , a2 [ k ] , b2 [ k ] , var2 [ k ] , a3 [ k ] , b3 [ k ] , var3 [
k])
t r u e t h e t a s . t f < c ( log ( p i 1 [ k ] / ( 1
p i 1 [ k ] ) ) , log ( p i 2 [ k ] / ( 1
p i 2 [ k ] ) ) , log ( p i 3 [ k ] / ( 1
pi3 [ k ] ) ) ,
b e t a s [ k ] , log ( var [ k ] ) , a1 [ k ] , b1 [ k ] , log (
var1 [ k ] ) , a2 [ k ] , b2 [ k ] , log ( var2 [ k ] ) , a3 [
k ] , b3 [ k ] , log ( var3 [ k ] ) )
dmr . t f < dmr < params < l i s t ( )
ready < rep (FALSE, 1 4 )
dmr [ [ 1 ] ] < matrix ( 0 , 1 4 , 1 4 )
dmr . t f [ [ 1 ] ] < matrix ( 0 , 1 4 , 1 4 )
f o r ( t i n 1 : ( k 1) ) {
print ( t )
t h e t a s < c ( p i 1 [ t ] , p i 2 [ t ] , p i 3 [ t ] , b e t a s [ t ] , var [ t ] , a1 [ t ] ,
b1 [ t ] , var1 [ t ] , a2 [ t ] , b2 [ t ] , var2 [ t ] , a3 [ t ] , b3 [ t ] , var3 [ t
])
t h e t a s . t f < c ( log ( p i 1 [ t ] / ( 1
p i 1 [ t ] ) ) , log ( p i 2 [ t ] / ( 1
p i 2 [ t ] ) ) , log ( p i 3 [ t ] / ( 1
pi3 [ t ] ) ) ,
b e t a s [ t ] , log ( var [ t ] ) , a1 [ t ] , b1 [ t ] , log ( var1 [ t
] ) , a2 [ t ] , b2 [ t ] , log ( var2 [ t ] ) , a3 [ t ] , b3 [ t
] , log ( var3 [ t ] ) )
f o r ( i i n ( 1 : 1 4 ) [ ! ready ] ) {
thetahat < truethetas
thetahat . t f < truethetas . t f
thetahat [ i ] < thetas [ i ]
thetahat . t f [ i ] < thetas . t f [ i ]
print ( t h e t a h a t )
params [ [ i ] ] < Estep ( p i 1 = t h e t a h a t [ 1 ] , p i 2 = t h e t a h a t [ 2 ] ,
p i 3 = t h e t a h a t [ 3 ] , b e t a s = t h e t a h a t [ 4 ] , var = t h e t a h a t
[5] ,
a1 = t h e t a h a t [ 6 ] , b1 = t h e t a h a t [ 7 ] ,
var1 = t h e t a h a t [ 8 ] ,
a2 = t h e t a h a t [ 9 ] , b2 = t h e t a h a t [ 1 0 ] ,
var2 = t h e t a h a t [ 1 1 ] ,
CHAPTER 12. CONCLUSION
tau110 < params [ [
tau101 < params [ [
tau011 < params [ [
tau111 < params [ [
y t r u e < params [ [ i
y s t a r < params [ [ i
y s t a r 2 < params [ [
var111 < params [ [
76
a3 = t h e t a h a t [ 1 2 ] , b3 = t h e t a h a t [ 1 3 ] ,
var3 = t h e t a h a t [ 1 4 ] , y )
i ] ] $ tau110
i ] ] $ tau101
i ] ] $ tau011
i ] ] $ tau111
] ] $ytrue
] ] $ystar
i ] ] $ystar2
i ] ] $ var111
p i 1 . 0 < ( n100 + sum( tau101 ) + sum( tau110 ) + sum( tau111 ) ) /
n
p i 2 . 0 < ( n010 + sum( tau110 ) + sum( tau011 ) + sum( tau111 ) ) /
n
p i 3 . 0 < ( n001 + sum( tau101 ) + sum( tau011 ) + sum( tau111 ) ) /
n
#u p d a t e b e t a s
betas .0 < betafn (y , ytrue )
#UPDATE SLOPES AND INTERCEPTS
b 1 p ar s < b1fn ( n100 , n r e s t , tau011 , tau110 , tau101 , tau111 ,
y100 , y r e s t , y s t a r , y s t a r 2 , var111 )
A1 . 1 < b 1 p a r s $A1 . 1
B1 . 1 < b 1 p a r s $B1 . 1
C1 . 1 < b 1 p a r s $C1 . 1
b1 . 0 < b 1 p a r s $b1
a1 . 0 < (C1 . 1
B1 . 1 ⇤ b1 . 0 ) / A1 . 1
b 2 p a r s < b2fn ( n010 , n r e s t , tau011 , tau110 , tau101 , tau111 ,
y010 , y r e s t , y s t a r , y s t a r 2 , var111 )
A1 . 2 < b 2 p a r s $A1 . 2
B1 . 2 < b 2 p a r s $B1 . 2
C1 . 2 < b 2 p a r s $C1 . 2
b2 . 0 < b 2 p a r s $b2
a2 . 0 < (C1 . 2
B1 . 2 ⇤ b2 . 0 ) / A1 . 2
b 3 p a r s < b3fn ( n001 , n r e s t , tau011 , tau110 , tau101 , tau111 ,
y001 , y r e s t , y s t a r , y s t a r 2 , var111 )
A1 . 3 < b 3 p a r s $A1 . 3
B1 . 3 < b 3 p a r s $B1 . 3
C1 . 3 < b 3 p a r s $C1 . 3
b3 . 0 < b 3 p a r s $b3
a3 . 0 < (C1 . 3
B1 . 3 ⇤ b3 . 0 ) / A1 . 3
logitpi1 .0 <
logitpi2 .0 <
logitpi3 .0 <
log ( p i 1 . 0 / ( 1
log ( p i 2 . 0 / ( 1
log ( p i 3 . 0 / ( 1
#u p d a t e v a r i a n c e s
pi1 . 0 ) )
pi2 . 0 ) )
pi3 . 0 ) )
CHAPTER 12. CONCLUSION
77
var . 0 < v a r f n ( y , ytrue , b e t a s . 0 , y r e s t , s r e s t , n , n r e s t ,
tau110 , tau101 , tau011 , tau111 , y s t a r , y s t a r 2 )
var1 . 0 < v a r 1 f n ( y100 , y r e s t , ytrue , b e t a s . 0 , s r e s t , n100 ,
n r e s t , tau110 , tau101 , tau011 , tau111 , p i 1 . 0 , y s t a r ,
y s t a r 2 , a1 . 0 , b1 . 0 )
var2 . 0 < v a r 2 f n ( y010 , y r e s t , ytrue , b e t a s . 0 , s r e s t , n010 ,
n r e s t , tau110 , tau101 , tau011 , tau111 , p i 2 . 0 , y s t a r ,
y s t a r 2 , a2 . 0 , b2 . 0 )
var3 . 0 < v a r 3 f n ( y001 , y r e s t , ytrue , b e t a s . 0 , s r e s t , n001 ,
n r e s t , tau110 , tau101 , tau011 , tau111 , p i 3 . 0 , y s t a r ,
y s t a r 2 , a3 . 0 , b3 . 0 )
l o g v a r . 0 < log ( var . 0 )
l o g v a r 1 . 0 < log ( var1 . 0 )
l o g v a r 2 . 0 < log ( var2 . 0 )
l o g v a r 3 . 0 < log ( var3 . 0 )
n e w t h e t as < c ( p i 1 . 0 , p i 2 . 0 , p i 3 . 0 , b e t a s . 0 , var . 0 , a1 . 0 , b1
. 0 , var1 . 0 , a2 . 0 , b2 . 0 , var2 . 0 , a3 . 0 , b3 . 0 , var3 . 0 )
n e w t h e t as . t f < c ( l o g i t p i 1 . 0 , l o g i t p i 2 . 0 , l o g i t p i 3 . 0 , b e t a s
. 0 , l o g v a r . 0 , a1 . 0 , b1 . 0 , l o g v a r 1 . 0 , a2 . 0 , b2 . 0 , l o g v a r 2
. 0 , a3 . 0 , b3 . 0 , l o g v a r 3 . 0 )
dmr [ [ t ] ] [ i , ] < ( n ew t h et a s
truethetas ) / ( thetahat [ i ]
truethetas [ i ])
dmr . t f [ [ t ] ] [ i , ] < ( n ew t h et a s . t f
truethetas . tf ) / (
thetahat . t f [ i ]
truethetas . tf [ i ])
i f ( t > 1) {
dev < ( dmr [ [ t ] ] [ i , ]
dmr [ [ t 1 ] ] [ i , ] ) / dmr [ [ t 1 ] ] [ i , ]
s e l < abs ( dmr [ [ t 1 ] ] [ i , ] ) > 1 e 2
print ( dev )
i f (max( abs ( dev [ s e l ] ) ) < 5 e 3 ) ready [ i ] < TRUE
}
}
dmr [ [ t + 1 ] ] < dmr [ [ t ] ]
dmr . t f [ [ t + 1 ] ] < dmr . t f [ [ t ] ]
}
return ( l i s t ( dmr = dmr , dmr . t f = dmr . t f , n i t e r=t ) )
}
r e s u l t s < dmfun ( pi1 , pi2 , pi3 , var , var1 , var2 , var3 , b e t a s , a1 ,
a2 , a3 , b1 , b2 , b3 , y , k )
results$niter
params < Estep ( p i 1 [ k ] , p i 2 [ k ] , p i 3 [ k ] , var [ k ] , var1 [ k ] , var2 [ k ] ,
var3 [ k ] , b e t a s [ k ] , a1 [ k ] , a2 [ k ] , a3 [ k ] , b1 [ k ] , b2 [ k ] , b3 [ k ] , y )
tau110 < params$ tau110
tau101 < params$ tau101
tau011 < params$ tau011
tau111 < params$ tau111
CHAPTER 12. CONCLUSION
78
y t r u e < params$ y t r u e
y s t a r < params$ y s t a r
y s t a r 2 < params$ y s t a r 2
x < y[ , (1:3) ]
#v a r i a n c e m a t r i x p i s component
v . p i s < v . z e t a s < matrix ( 0 , 3 , 3 )
diag ( v . p i s ) < c ( p i 1 [ k ] ⇤ (1 p i 1 [ k ] ) / n , p i 2 [ k ] ⇤ (1 p i 2 [ k ] ) / n ,
p i 3 [ k ] ⇤ (1 p i 3 [ k ] ) / n )
diag ( v . z e t a s ) < c ( 1 / ( n ⇤ p i 1 [ k ] ⇤ (1 p i 1 [ k ] ) ) , 1 / ( n ⇤ p i 2 [ k ]
⇤ (1 p i 2 [ k ] ) ) , 1 / ( n ⇤ p i 3 [ k ] ⇤ (1 p i 3 [ k ] ) ) )
#z e t a i s l o g i t
transformed pi
#v a r i a n c e m a t r i x b e t a s sigmas
v . beta . var < v . beta . p h i < matrix ( 0 , 2 , 2 )
diag ( v . beta . var ) < c ( var [ k ] ⇤ solve (sum( x ⇤ t ( x ) ) ) , 2 ⇤ ( var [ k
]ˆ2) / n)
diag ( v . beta . p h i ) < c ( var [ k ] ⇤ solve (sum( x ⇤ t ( x ) ) ) , 2 / n )
#p h i i s l o g
t r a n s f o r m e d sigma
v . a1 . b1 . var1 < v . a2 . b2 . var2 < v . a3 . b3 . var3 <
matrix ( 0 , 3 , 3 )
n1 < n100 + n r e s t
sum( tau011 )
n2 < n010 + n r e s t
sum( tau101 )
n3 < n001 + n r e s t
sum( tau110 )
D e l t a 1 < sum( ( y [ s100 , 2 ] ) ˆ 2 ) + sum( y r e s t [ , 3 ] ˆ 2 ⇤ tau110 + y r e s t
[ , 2 ] ˆ 2 ⇤ tau101 + y s t a r 2 ⇤ tau111 )
( 1 /n1 ) ⇤ (sum( ( y [ s100 , 2 ] ) ) + sum( y r e s t [ , 3 ] ⇤ tau110 + y r e s t [ , 2 ]
⇤ tau101 + y s t a r ⇤ tau111 ) ) ˆ2
v . a1 . b1 . var1 [ 1 , 1 ] < ( var1 [ k ] / ( n1 ⇤ D e l t a 1 ) ) ⇤
(sum( ( y [ s100 , 2 ] ) ˆ 2 ) + sum( y r e s t [ , 3 ] ˆ 2 ⇤ tau110 + y r e s t [ , 2 ] ˆ 2 ⇤
tau101 + y s t a r 2 ⇤ tau111 ) )
#y [ s100 , 2 ] = z 1 i . y i ⇤
v . a1 . b1 . var1 [ 2 , 1 ] < v . a1 . b1 . var1 [ 1 , 2 ] <
( var1 [ k ] / ( n1 ⇤
Delta1 ) ) ⇤
(sum( ( y [ s100 , 2 ] ) ) + sum( y r e s t [ , 3 ] ⇤ tau110 + y r e s t [ , 2 ] ⇤ tau101
+ y s t a r ⇤ tau111 ) )
v . a1 . b1 . var1 [ 2 , 2 ] < ( var1 [ k ] / D e l t a 1 )
v . a1 . b1 . var1 [ 3 , 3 ] < ( 2 ⇤ var1 [ k ] ˆ 2 ) / n1
v . a1 . b1 . p h i 1 < v . a1 . b1 . var1
v . a1 . b1 . p h i 1 [ 3 , 3 ] < 2 /n1
#s o u r c e 2
D e l t a 2 < sum( ( y [ s010 , 1 ] ) ˆ 2 ) + sum( y r e s t [ , 3 ] ˆ 2 ⇤ tau110 + y r e s t
[ , 1 ] ˆ 2 ⇤ tau011 + y s t a r 2 ⇤ tau111 )
( 1 /n2 ) ⇤ (sum( ( y [ s010 , 1 ] ) ) + sum( y r e s t [ , 3 ] ⇤ tau110 + y r e s t [ , 1 ]
⇤ tau011 + y s t a r ⇤ tau111 ) ) ˆ2
v . a2 . b2 . var2 [ 1 , 1 ] < ( var2 [ k ] / ( n2 ⇤ D e l t a 2 ) ) ⇤
CHAPTER 12. CONCLUSION
79
(sum( ( y [ s010 , 1 ] ) ˆ 2 ) + sum( y r e s t [ , 3 ] ˆ 2 ⇤ tau110 + y r e s t [ , 1 ] ˆ 2 ⇤
tau011 + y s t a r 2 ⇤ tau111 ) )
#y [ s100 , 2 ] = z 1 i . y i ⇤
v . a2 . b2 . var2 [ 2 , 1 ] < v . a2 . b2 . var2 [ 1 , 2 ] <
( var2 [ k ] / ( n2 ⇤
Delta2 ) ) ⇤
(sum( ( y [ s010 , 1 ] ) ) + sum( y r e s t [ , 3 ] ⇤ tau110 + y r e s t [ , 1 ] ⇤ tau011
+ y s t a r ⇤ tau111 ) )
v . a2 . b2 . var2 [ 2 , 2 ] < ( var2 [ k ] / D e l t a 2 )
v . a2 . b2 . var2 [ 3 , 3 ] < ( 2 ⇤ var2 [ k ] ˆ 2 ) / n2
v . a2 . b2 . p h i 2 < v . a2 . b2 . var2
v . a2 . b2 . p h i 2 [ 3 , 3 ] < 2 /n2
#s o u r c e 3
D e l t a 3 < sum( ( y [ s001 , 1 ] ) ˆ 2 ) + sum( y r e s t [ , 1 ] ˆ 2 ⇤ tau011 + y r e s t
[ , 2 ] ˆ 2 ⇤ tau101 + y s t a r 2 ⇤ tau111 )
( 1 /n3 ) ⇤ (sum( ( y [ s001 , 1 ] ) ) + sum( y r e s t [ , 1 ] ⇤ tau011 + y r e s t [ , 2 ]
⇤ tau101 + y s t a r ⇤ tau111 ) ) ˆ2
v . a3 . b3 . var3 [ 1 , 1 ] < ( var3 [ k ] / ( n3 ⇤ D e l t a 3 ) ) ⇤
(sum( ( y [ s001 , 1 ] ) ˆ 2 ) + sum( y r e s t [ , 1 ] ˆ 2 ⇤ tau011 + y r e s t [ , 2 ] ˆ 2 ⇤
tau101 + y s t a r 2 ⇤ tau111 ) )
#y [ s100 , 2 ] = z 1 i . y i ⇤
v . a3 . b3 . var3 [ 2 , 1 ] < v . a3 . b3 . var3 [ 1 , 2 ] <
( var3 [ k ] / ( n3 ⇤
Delta3 ) ) ⇤
(sum( ( y [ s001 , 1 ] ) ) + sum( y r e s t [ , 1 ] ⇤ tau011 + y r e s t [ , 2 ] ⇤ tau101
+ y s t a r ⇤ tau111 ) )
v . a3 . b3 . var3 [ 2 , 2 ] < ( var3 [ k ] / D e l t a 3 )
v . a3 . b3 . var3 [ 3 , 3 ] < ( 2 ⇤ var3 [ k ] ˆ 2 ) / n3
v . a3 . b3 . p h i 3 < v . a3 . b3 . var3
v . a3 . b3 . p h i 3 [ 3 , 3 ] < 2 /n3
vmat < matrix ( 0 , 1 4 , 1 4 )
vmat [ 1 : 3 , 1 : 3 ] < v . p i s
vmat [ 4 : 5 , 4 : 5 ] < v . beta . var
vmat [ 6 : 8 , 6 : 8 ] < v . a1 . b1 . var1
vmat [ 9 : 1 1 , 9 : 1 1 ] < v . a2 . b2 . var2
vmat [ 1 2 : 1 4 , 1 2 : 1 4 ] < v . a3 . b3 . var3
v . obs < vmat %⇤% ( solve ( diag ( 1 4 )
v . obs1 < vmat %⇤% ( g i n v ( diag ( 1 4 )
sqrt ( diag ( vmat ) )
sqrt ( diag ( v . obs ) )
sqrt ( diag ( v . obs1 ) )
r e s u l t s $dmr [ [ k ] ] ) )
r e s u l t s $dmr [ [ k ] ] ) )
vmat . t f < matrix ( 0 , 1 4 , 1 4 )
vmat . t f [ 1 : 3 , 1 : 3 ] < v . z e t a s
vmat . t f [ 4 : 5 , 4 : 5 ] < v . beta . p h i
vmat . t f [ 6 : 8 , 6 : 8 ] < v . a1 . b1 . p h i 1
vmat . t f [ 9 : 1 1 , 9 : 1 1 ] < v . a2 . b2 . p h i 2
vmat . t f [ 1 2 : 1 4 , 1 2 : 1 4 ] < v . a3 . b3 . p h i 3
v . obs . t f < vmat . t f %⇤% ( solve ( diag ( 1 4 )
v . obs . t f 1 < vmat . t f %⇤% ( g i n v ( diag ( 1 4 )
r e s u l t s $dmr . t f [ [ k ] ] ) )
r e s u l t s $dmr . t f [ [ k ] ] ) )
CHAPTER 12. CONCLUSION
diag ( c ( p i 1 [ k ] ⇤(1 p i 1 [ k ] ) , p i 2 [ k ] ⇤(1 p i 2 [ k ] ) , p i 3 [ k ] ⇤(1 p i 3 [ k
] ) , rep ( 1 , length ( b e t a s [ k ] ) ) ,
var [ k ] , 1 , 1 , var1 [ k ] , 1 , 1 , var2 [ k ] , 1 , 1 , var3 [ k ] ) )
v . obs < D %⇤% ( v . obs . t f %⇤% D)
SEs < sqrt ( diag ( v . obs ) )
D<
#95% c o n f i d e n c e i n t e r v a l s
( l o w e r l < e s t i m a t e s 1.96⇤SEs )
( u p p e r l < e s t i m a t e s +1.96⇤SEs )
80
Bibliography
Dasu, T. and T. Johnson (2003). Exploratory Data Mining and Data Cleaning. John Wiley
and Sons inc.
de Waal, T., J. Pannekoek, and S. Scholtus (2011a). The editing of statistical data: methods
and techniques for the efficient detection and correction of errors and missing values.
Discussion Paper 2011, Statistics Netherlands, The Hague.
de Waal, T., J. Pannekoek, and S. Scholtus (2011b). Handbook of Statistical Data Editing
and Imputation. John Wiley and Sons inc.
Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete
data via the em algorithm. Journal of the Royal Statistical Society 39.
Efron, B. and D. V. Hinkley (1978). Assessing the accuracy of the maximum likelihood
estimator: Observed versus expected fisher information. Biometrika 65 (3), 457–482.
Guarnera, U. and M. di Zio (2013). A contamination model for selective editing. Journal of
Official Statistics 29, 539–555.
Guarnera, U. and R. Varriale (2015). Estimation and editing for data from di↵erent sources.
an approach based on latent class model. Working Paper No. 32, UN/ECE Work Session
on Statistical Data Editing, Budapest.
Gupta, M. R. and Y. Chen (2011). Theory and Use of the EM Algorithm. Foundations and
Trends in Signal Processing 4.
Laitila, T., A. Wallgren, and B. Wallgren (2012). Quality assessment of administrative
data. Source: Statistics Sweden, Research and Development Methodology Reports from
Statistics Sweden,.
Little, R. J. A. and D. Rubin (2002). Statistical Analysis with Missing Data (Second ed.).
Wiley Interscience.
McLachlan, G. and D. Peel (2000). Finite Mixture Models. Wiley Interscience.
Meng, X. and D. B. Rubin (1991). Using EM to Obtain Asymptotic Variance-Covariance
Matrices: The SEM Algorithm. Journal of the American Statistical Association 86 (416).
Meng, X. and D. B. Rubin (1993). Maximum Likelihood Estimation via the ECM Algorithm:
A General Framework. Biometrika 80, 267–278.
Scholtus, S. (2016). Editing and Estimation of Measurement Errors in Statistical Data. Ph.
D. thesis, VU University, Amsterdam.
81
BIBLIOGRAPHY
82
Scholtus, S., B. F. M. Bakker, and A. Van Delden (2015). Modelling measurement error to
estimate bias in administrative and survey variables. Discussion Paper 2015-17, Statistics
Netherlands, The Hague.
Wallgren, A. and B. Wallgren (2007). Register-based Statistics: Administrative Data for
Statistical Purposes. John Wiley and Sons inc.
Weiss, N. (2005). A Course In Probability. Pearson Addison Westley.