STAT 8630, Mixed-Effect Models and Longitudinal

STAT 8630, Mixed-Effect Models and Longitudinal
Data Analysis — Lecture Notes
Introduction to Longitudinal Data
Terminology:
Longitudinal data consist of observations (i.e., measurements) taken repeatedly through time on a sample of experimental units (i.e., individuals,
subjects).
• The experimental units or subjects can be human patients, animals,
agricultural plots, etc.
• Typically, the terms “longitudinal data” and “longitudinal study”
refer to situations in which data are collected through time under
uncontrolled circumstances. E.g., subjects with torn ACLs in their
knees are assigned to one of two methods of surgical repair and then
followed through time (examined at 6, 12, ,18, 24 months for knee
stability, say).
• Longitudinal data are to be contrasted with cross-sectional data.
Cross-sectional data contain measurements on a sample of subjects
at only one point in time.
Repeated measures: The terms “repeated measurements” or, more simply, “repeated measures” are sometimes used as rough synonyms for “longitudinal data”, however, there are sometimes slight differences in meaning
for these terms.
• Repeated measures are also multiple measurements on each of several
individuals, but they are not necessarily through time. E.g., measurements of chemical concentration in the leaves of a plant taken at
different locations (low, medium and high on the plant, say) can be
regarded as repeated measures.
1
• In addition, repeated measures may occur across the levels of some
controlled factor. E.g., crossover studies involve repeated measures. In a crossover study, subjects are assigned to multiple treatments (usually 2 or 3) sequentially. E.g., a two period crossover experiment involves subjects who each get treatments A and B, some
in the order AB, and others in the order BA.
Another rough synonym for longitudinal data is panel data.
• The term panel data is more common in econometrics, the term
longitudinal data is most commonly used in biostatistics, and the
term repeated measures most often arises in an agricultural context.
In all cases, however, we are referring to multiple measurements of essentially the same variable(s) on a given subject or unit of observation. We’ll
often use the more generic term clustered data to refer to this situation.
Advantages and Disadvantages of Longitudinal Data:
Advantages:
1. Although time effects can be investigated in cross-sectional studies in
which different subjects are examined at different time points, only
longitudinal data give information on individual patterns of change.
2. Again, in contrast to cross-sectional studies involving multiple time
points, longitudinal studies economize on subjects.
3. In investigating time effects in a longitudinal design or treatment
effects in a crossover design, each subject can “serve as his or her own
control”. That is, comparisons can be made within a subject rather
than between subjects. This eliminates between-subjects sources of
variability from the experimental error and makes inferences more
efficient/powerful (think paired t-test versus two-sample t-test).
2
4. Since the same variables are measured repeatedly on the same subjects, the reliability of those measurements can be assessed, and
purely from a measurement standpoint, reliability is higher.
Disadvantages:
1. For longitudinal or, more generally, clustered data it is typically reasonable to assume independence across clusters, but repeated measures within a cluster are almost always correlated, which complicates the analysis.
2. Clustered data are often unbalanced or partially incomplete (involve
missing data), which also complicates the analysis. For longitudinal
data, this may be due to loss to follow-up (some subjects move away,
die, miss appointments, etc.). For other types of clustered data, the
cluster size may vary (e.g., familial data, where family size varies).
3. As a practical matter, methods and/or software may not exist or
may be complex, so obtaining results and interpretting them may be
difficult.
3
Data Structure for Clustered Data:
The general data structure of clustered data is given in the table below.
Here, yij represents the j th observation from the ith cluster, where i =
1, . . . , n, and j = 1, . . . , ti .*
• Associated with each observation yij we may have a p × 1 vector of
explanatory variables, or covariates, xij = (xij1 , xij2 , . . . , xijp )T .
• In addition, to indicate missing data, we may sometimes define a
missing value indicator:
{
1, if yij and xij are observed,
δij =
0, otherwise.
• We will often write the set of responses from the ith subject as a
vector: yi = (yi1 , . . . , yiti )T . In addition, let y = (y1T , . . . , ynT )T be
the combined
∑n response vector from all subjects and time points and
let N = i=1 ti be the total sample size.
* Note that our text uses slightly different notation in which ni is the
cluster size and N is the number of clusters.
4
Often subjects will be grouped somehow into treatment groups. In this
case, we will need additional subscripts to index the groups. For example,
the data layout below in Table 1.2 represents a one-way layout with s
groups with repeated measures over t time points. Here, yhij represents
the j th measurement on the ith subject from the hth group.
• Note that the s groups might correspond to s levels of a single treatment factor, or s combinations of the levels of two or more factors.
In the latter case, it may be convenient to introduce additional subscripts.
– E.g., for a two-way layout with repeated measures involving
factors A and B, we might index the data as yhkij to represent
the j th observation on the ith subject in the h, k th treatment
(at the hth level of A and the k th level of B).
5
For the single group case we can drop the index h from table 1.2 to represent the data as follows:
For longitudinal data, the response variable can be laid out in a single
column as in Table 1.1, or with one column per time point as in table 1.2
amd 1.3. The two layouts suggest that such data can be conceptualized
as univariate or multivariate data.
In fact, there are classical normal-theory approches to analyzing continuous repeated measures data of each type:
— univariate methods (most notably the repeated measures analysis of
variance); and
— multivariate methods (profile and growth curve analysis).
6
Repeated Measures ANOVA
The classical repeated measures analysis of variance (RM-ANOVA) situation is a one-way layout with repeated measures over t time points. This
situation is displayed in Table 1.2.
Here, there are are n1 , n2 , . . . , ns subjects, respectively, in s treatment
groups. If there measurements are taken at just a single time point, we
have a (unbalanced) one-way layout. However, subjects are followed up
over t time points to yield a repeated measures design.
Example – Methemoglobin in Sheep:
An experiment was designed to study trends in methemoglobin (M ) in
sheep following treatment with 3 equally spaced levels of NO2 (factor A).
Four sheep were assigned to each level and each animal was measured at
6 sampling times (factor B), 5 of them following treatment. The response
was log(M + 5).
The data from this experiment are as follows:
NO2
Sheep
1
2
1
1
1
1
2
2
2
2
3
3
3
3
1
2
3
4
5
6
7
8
9
10
11
12
2.197
1.932
1.946
1.758
2.230
2.398
2.054
2.510
2.140
2.303
2.175
2.041
2.442
2.526
2.251
2.054
3.086
2.580
3.243
2.695
3.896
3.822
2.907
3.824
Sampling Time
3
4
2.542
2.526
2.501
2.588
3.357
2.929
3.653
2.996
4.246
4.109
3.086
4.111
7
2.241
2.152
1.988
2.197
3.219
2.874
3.811
3.246
4.461
4.240
2.827
4.301
5
6
1.960
1.917
1.686
2.140
2.827
2.282
3.816
2.565
4.418
4.127
2.493
4.206
1.988
1.917
1.841
1.686
2.534
2.303
3.227
2.230
4.331
4.084
2.230
4.182
• Here s = 3, n1 = n2 = n3 = 4, and t = 6.
The basic RM-ANOVA approach is based upon a similarity between a
repeated measures design and a split-plot experimental design. The
RM-ANOVA approach uses the split-plot model with modifications to the
split-plot analysis, if necessary, to account for differences between the two
designs.
A split-plot experimental design is one in which (at least) two sizes of
experimental unit are used. The larger experimental unit is known as the
whole plot, and is randomized to some experimental design (a one-way
layout, say).
The whole plot is then subdivided into smaller units known as split plots,
which are then assigned to a second experimental design within each whole
plot.
Example – Chocolate Cake:
An experiment was conducted to determine the effect of baking temperature on quality for three recipes of chocolate cake. Recipes I
and II differed in that the chocolate was added at 40◦ C. and 60◦
C., respectively, while recipe III contained extra sugar. Six different
baking temperatures were considered: 175◦ C., 185◦ C., 195◦ C.,
205◦ C., 215◦ C., and 225◦ C. 45 batches of cake batter were prepared using each of the 3 recipes 15 times in a completely random
order. Each batch was large enough for 6 cakes, and the six baking
temperatures were randomly assigned to the 6 cakes per batch in a
random manner. One of several measurements of quality made on
each cake was the breaking angle of the cake.
8
The data from this experiment are as follows:
• Here, there are 45 batches of cake, which occur in a balanced one-way
layout. The batches, are the whole plots and recipe, with 3 levels, is
the whole plot factor.
• Each batch is then split into 6 cakes, which are randomly assigned to
one of 6 temperatures. The cakes are the split plots, and temperature
is the split plot factor.
9
Let yhij represent the response at the j th level of the split-plot factor for
the ith whole plot in the hth group. The model traditionally used for the
split-plot design exemplified by the chocolate cake example is
yhij = µ + αh + ei(h) + βj + (αβ)hj + εhij ,
(∗)
Here, µ is a grand mean, αh is an effect for the hth level of the whole plot
factor (e.g., recipe), βj is an effect for the j th level of the split-plot factor
(temperature), and (αβ)hj is an interaction term for the whole and split
plot factors. ei(h) is a random effect for the ith whole plot nested in the
hth level of the whole plot factor, and εhij is the overall error term.
One way to think about the split plot model is that it is the union of the
model appropriate for the whole plots and the model appropriate for the
split plots.
• The whole plots occur in a one-way layout, so the one-way layout
model
µ + αh + ei(h)
is appropriate for batches.
• The split plots occur in a randomized complete block design, so the
RCBD model
µ + blockhi +βj + εhij
| {z }
=αh +ei(h)
is appropriate for cakes.
• Putting these portions of the models together and adding an interaction term (αβ)hj to capture interactions between the whole and
split plot factors, leads to model (*).
• The random whole plot effect ei(h) can be thought of as the whole
plot error term and εhij as the split plot error term. Since there are
two experimental units with two separate randomizations, there are
two error terms in the model.
10
In fact, the split plot model described above is an example of a linear
mixed-effects model (LMM).
• It includes fixed effects for the whole plot factor (the αh ’s for recipes),
split plot factor (the βj ’s for temperatures), and their interaction (the
(αβ)hj ’s). These are the regression parameters of the model.
• It also includes random effects: ei(h) , a whole plot (or batch) effect,
in addition to the overall error term εhij , which is always present in
any linear model, so is typically not categorized as a random effect,
even though it is.
• The term “mixed-effects model” or sometimes simply “mixed model”
refers to the fact that the linear predictor of the model (the right side
of the model equation, excluding the overall error term) includes both
fixed and random effects.
Fixed vs. random effects: The effects in the model account for variability
in the response across levels of treatment and design factors. The decision
as to whether fixed effects or random effects should be used depends upon
what the appropriate scope of generalization is.
• If it is appropriate to think of the levels of a factor as randomly
drawn from, or otherwise representative of, a population to which
we’d like to generalize, then random effects are suitable.
– Design or grouping factors are usually more appropriately modeled with random effects.
– E.g., blocks (sections of land) in an agricultural experiment,
days when an experiment is conducted over several days, lab
technician when measurements are taken by several technicians, subjects in a repeated measures design, locations or sites
along a river when we desire to generalize to the entire river,
etc.
11
• If, however, the specific levels of the factor are of interest in and of
themselves then fixed effects are more appropriate.
– Treatment factors are usually more appropriately modeled with
fixed effects.
– E.g., In experiments to compare drugs, amounts of fertilizer,
hybrids of corn, teaching techniques, and measurement devices,
all of these factors are most appropriately modeled with fixed
effects.
• A good litmus test for whether the level of some factor should be
treated as fixed is to ask whether it would be of broad interest to
report a mean for that level. For example, if I’m conducting an
experiment in which each of four different classes of third grade students are taught with each of three methods of instruction (e.g., in
a crossover design) then it will be of broad interest to report the
mean response (level of learning, say) for a particular method of
instruction, but not for a particular classroom of third grades.
– Here, fixed effects are appropriate for instruction method, random effects for class.
Since the whole plot error term represents random whole plot effects (batch
effects), the ei(h) ’s are random effects. Therefore, we must make some assumptions about their distribution (and the distribution of the overall error
term εhij ) to complete the split-plot model. The following assumptions are
typical:
yhij = µ + αh + βj + (αβ)hj + ei(h) + εhij = µhj + ei(h) + εhij ,
where
iid
{ei(h) } ∼ N (0, σe2 )
iid
{εhij } ∼ N (0, σ 2 )
cov(εhij , ei′ (h′ ) ) = 0,
12
for all h, i, j, h′ , i′ .
The chocolate cake experiment is an example of a balanced split plot design
with whole plots arranged in a one-way layout. More complex split-plot
designs are possible. E.g., whole plots are often arranged in a RCBD, split
plots could be split once again to create split-split-plots, etc.
However, the classical analysis of all of these designs is relatively straightforward provided that the design is balanced.
• By “balanced” here, we mean that there is an equal number of replicates for each whole plot treatment (recipe), and the same set of
subplot treatments (temperatures) was observed within each whole
plot (batch).
The classical analysis of the balanced split plot model with whole plots
in a one-way layout is based on the model at the bottom of p.12 and the
following decomposition of the deviations yhij − ȳ··· of each observation
from the grand sample mean:
yhij − ȳ··· = (ȳh·· − ȳ··· ) + (ȳhi· − ȳh·· ) + (ȳ··j − ȳ··· )
+ (ȳh·j − ȳh·· − ȳ··j + ȳ··· ) + (yhij − ȳh·j − ȳhi· + ȳh·· ), (∗)
where n =
∑
h
nh , we assume nh is constant over h, and
ȳ··· = (nt)
−1
nh ∑
s ∑
t
∑
yhij
ȳh·· = (nh t)
h=1 i=1 j=1
ȳ··j = n
−1
nh
s ∑
∑
yhij
−1
nh ∑
t
∑
yhij
i=1 j=1
ȳh·j =
n−1
h
h=1 i=1
nh
∑
i=1
yhij
ȳhi· = t
−1
t
∑
yhij
j=1
are sample means over all observations (ȳ··· ), over observations in the hth
whole plot treatment group (ȳh·· ), etc.
13
This decomposition leads to the following analysis of variance:
Source of
Variation
Whole plot groups
Whole plot error
Split plot groups
Interaction
Split plot error
Total
Sum of
Squares
SSW P G
SSW P E
SSSP G
SSW P G×SP G
SSSP E
SST
d.f.
s−1
n−s
t−1
(s − 1)(t − 1)
(n − s)(t − 1)
nt − 1
E(M S)
σ 2 + tσe2 + Q(α, αβ)
σ 2 + tσe2
σ 2 + Q(β, αβ)
σ 2 + Q(αβ)
σ2
The sums of squares in the ANOVA table are simply the sums, over all
observations, of the terms in decomposition (*). That is,
SSW P G
nh ∑
t
s ∑
∑
=
(ȳh·· − ȳ··· )2
h=1 i=1 j=1
SSW P E
nh ∑
t
s ∑
∑
(ȳhi· − ȳh·· )2
=
h=1 i=1 j=1
SSSP G
nh ∑
s ∑
t
∑
=
(ȳ··j − ȳ··· )2
h=1 i=1 j=1
SSW P G×SP G
nh ∑
s ∑
t
∑
=
(ȳh·j − ȳh·· − ȳ··j + ȳ··· )2
h=1 i=1 j=1
SSSP E
nh ∑
s ∑
t
∑
=
(yhij − ȳh·j − ȳhi· + ȳh·· )2 .
h=1 i=1 j=1
In addition, the quantities Q(α, αβ), Q(β, αβ), and Q(αβ) are quadratic
forms representing differences across whole plot groups, split plot groups,
and the whole plot group × split plot group interaction, respectively.
14
That is, Q(α, αβ) is a sum of squares in the αh ’s and (αβ)hj ’s that equals
zero under the null hypothesis of no differences across whole plot groups
(hypothesis (1) below). Similarly, Q(β, αβ) and Q(αβ) are sums of squares
that are zero under no differences across split plot groups (hypothesis (2)),
and under no interaction (hypothesis (3)), respectively.
• The Q(·) notation is often used because it is more convenient than
writing the term out exactly. The exact form of these terms are not
important, it only matters that these terms are zero under the null
hypothesis of no effect, and positive under the alternative.
F tests appropriate for the hypotheses of interest can be determined by
examination of the expected mean squares.
Let µhj = E(yhij ) = µ + αh + βj + (αβ)hj and let
µ̄h· = t
−1
t
∑
¯
µhj = µ + αh + β̄· + (αβ)
h·
j=1
be the marginal mean for whole plot group h and let
−1
µ̄·j = s
s
∑
¯
µhj = µ + ᾱ· + βj + (αβ)
·j
h=1
be the marginal mean for the j th split plot group. Then the hypotheses of
interest and their corresponding test statistics are as follows:
1. H1 : µ̄1· = · · · = µ̄s· (no main effect of the whole plot factor), which
is tested with
F =
M SW P G
∼ F (s − 1, n − s);
M SW P E
15
2. H2 : µ̄·1 = · · · = µ̄·t (no main effect of the split plot factor), which
is tested with
F =
M SSP G
∼ F (t − 1, (n − s)(t − 1));
M SSP E
3. and H3 : (αβ)hj = 0 for all h, j (no interaction between whole plot
and split plot factors), which is tested with
F =
M SW P G×SP G
∼ F ((s − 1)(t − 1), (n − s)(t − 1)).
M SSP E
Note that side conditions are often placed on the split plot model to avoid
the complications introduced by having an overparameterized model and
non-full-rank design matrix. The usual sum-to-zero side conditions are
∑
∑
∑
∑
(αβ)hj =
(αβ)hj = 0.
αh =
βj =
h
j
j
h
Such conditions are not strictly necessary to derive the F tests given above,
but they do simplify things somewhat.
Without the side conditions, H1 can be expressed in terms of the fixed
effects in the model as
¯
¯
H1 : α1 + (αβ)
1· = · · · = αs + (αβ)s·
which reduces to
H1 : α1 = · · · = αs = 0
under the sum-to-zero constraints. Note that under
∑s these2 constraints the
Q(α, αβ) term in E(M SW P G ) reduces to Q(α) = h=1 αh /(s − 1), which,
of course, is 0 under H1 and > 0 otherwise.
• Similar comments apply to H2 , which under the sum-to-zero constraints ∑
is equivalent to H2 : β1 = · · · = βt = 0 and Q(β, αβ) =
t
Q(β) = j=1 βj2 /(t − 1).
16
Example — Chocolate Cake:
• See the handout labelled choccake.sas.
• In choccake.sas we use PROC MIXED to perform the analysis just
described. A call to PROC GLM is also included which reproduces
the basic PROC MIXED results (e.g., the ANOVA table and expected mean squares). However, PROC GLM is not designed for
mixed models, and cannot, in general, be trusted to produce correct
results for split plot models and other LMMs.
• Note that method=type3 requests the classical ANOVA-type analysis. This is not the default, which is a REML analysis.
• The basic results are that there are not significant interactions between recipe and baking temperature, there are not significant main
effects of recipe, but there are significant main effects of temperature.
– From the contrasts and profile plot, we see that mean breaking
angle increases linearly with baking temperature.
• Note that the expected mean squares are printed on the bottom of
p.1 and the top of p.2. These results agree with the ANOVA table
given previously in these notes.
• Method of moment estimators of the variance components σ 2 and σe2
are easily derived from the expressions for E(M SSP E ) and E(M SW P E ).
Equating M SSP E with its expectation σ 2 yields the estimator
σ̂ 2 = M SSP E = 20.4709.
Similarly, equating M SW P E with its expectation σ 2 + tσe2 yields
σ̂e2 =
M SW P E − M SSP E
= 41.8370.
t
17
Estimation and Inference on Means in the Split-plot Model:
According to the model on p.12,
var(yhij ) = var(µhj + ei(h) + εhij ) = var(ei(h) + εhij )
= var(ei(h) ) + var(εhij ) + 2 cov(ei(h) , εhij ) = σe2 + σ 2 .
|
{z
}
=0
• Because the variance of yhij is the sum of two components, σe2 and
σ 2 , these quantities are often called variance components.
In addition,
cov(yhij , yhik )
= cov(µhj + ei(h) + εhij , µhk + ei(h) + εhik ) = cov(ei(h) + εhij , ei(h) + εhik )
= cov(ei(h) , ei(h) ) + cov(ei(h) , εhik ) + cov(εhij , ei(h) ) + cov(εhij , εhik )
{z
}
|
{z
} |
{z
} |
=0
= var(ei(h) ) =
=0
=0
σe2
for j ̸= k, and
cov(yhij , yh′ i′ j ′ ) = 0,
for h ̸= h′ or i ̸= i′ (i.e., the covariance is 0 between subjects).
From these results we see that the correlation is zero between observations
on different subjects, but
σe2
corr(yhij , yhik ) = 2
≡ ρ,
σ + σe2
j ̸= k.
• The within-subject correlation ρ here is called the intra-class correlation.
18
The within subject covariance structure we have just described can be
represented succinctly as
 σ2 + σ2


var(yhi ) = 


σe2
σe2
..
.
e
σe2
σe2
σ 2 + σe2
σe2
..
.
σe2
σe2
σ 2 + σe2
..
.
···
···
···
..
.
σe2
σe2
σe2
..
.
σe2
σe2
···
σ 2 + σe2






= (σ 2 + σe2 )[(1 − ρ)It + ρJtt ],
where It is the t × t identity matrix and Jtt is a t × t matrix of ones.
• This variance covariance structure is called compound symmetry.
Recall that for a linear model of the form
y = Xβ + ε,
E(ε) = 0, var(ε) = σ 2 V
where V is a known positive definite matrix, the best linear unbiased
estimator (BLUE) of a vector of estimable functions Cβ is given by Cβ̂GLS
where β̂GLS is a generalized least squares (GLS) estimator of the form
β̂GLS = (XT V−1 X)− XT V−1 y.
• Under compound symmetry it is not hard to show that the GLS
estimator is equivalent to the ordinary least squares (OLS) estimator
Cβ̂OLS where
β̂OLS = (XT X)− XT y
(see Rencher, Example 7.8.1; or Graybill, Corollary 6.8.1.2).
19
For balanced split-plot experiments, it is easy to show that the marginal
mean µ̄h· , µ̄·j and the joint means µhj are all estimable, and have BLUEs
ˆ h· = ȳh·· ,
µ̄
ˆ ·j = ȳ··j ,
µ̄
µ̂hj = ȳh·j .
Standard errors for these quantities are defined as the estimated standard
deviations. Therefore, we need the variances of these estimators.
(
)
1 ∑ T
1 ∑ T
var(ȳh·· ) = var
j yhi =
j var(yhi )jt
nh t i t
(nh t)2 i t
1 ∑
1 ∑ 2
2
2
2
2
=
[t(σ + σe ) + (t − t)σe ] =
[tσ + t2 σe2 ]
2
2
(nh t) i
(nh t) i
=
nh t
1
2
2
(σ
+
tσ
)
=
(σ 2 + tσe2 )
e
2
(nh t)
nh t
In addition,
(
var(ȳ··j ) = var
1 ∑∑
yhij
snh
i
)
h
1 ∑∑
=
var(yhij )
(snh )2
i
h
snh
σ 2 + σe2
1 ∑∑ 2
2
2
2
(σ + σe ) =
(σ + σe ) =
=
2
(snh )2
(sn
)
snh
h
i
h
and, similarly,
(
var(ȳh·j ) = var
1 ∑
yhij
nh i
)
=
20
nh
1 2
var(y
)
=
(σ + σe2 ).
hij
2
nh
nh
In the case of ȳh·· , its variance is easy to estimate because E(M SW P E ) =
σ 2 + tσe2 . So,
√
√
M SW P E
s.e.(ȳh·· ) = var(ȳ
ˆ h·· ) =
.
nh t
However, var(ȳ··j ) and var(ȳh·j ) both involve σ 2 + σe2 , which is not the
expected value of any mean square in the ANOVA.
We do have estimates of σ 2 and σe2 individually, though; namely, M SSP E ,
and (M SW P E − M SSP E )/t (bottom of p.17). So,
√
s.e.(ȳ··j ) =
where
σ 2d
+ σe2
snh
√
,
and s.e.(ȳh·j ) =
σ 2d
+ σe2
nh
σ 2d
+ σe2 = M SSP E + (M SW P E − M SSP E )/t
1
t−1
M SSP E + M SW P E .
=
t
t
21
CIs and contrasts for marginal means for whole plot factor:
Confidence intervals and hypothesis tests on µ̄h·· are based on the pivotal
quantity
t=
ȳh·· − µ̄h··
ȳh·· − µ̄h··
=√
∼ t(d.f.W P E ) = t(n − s).
s.e.(ȳh·· )
M SW P E /(nh t)
This leads to a 100(1 − α)% CI for µ̄h·· given by
√
ȳh·· ± t1−α/2 (d.f.W P E )s.e.(ȳh·· )
=
ȳh·· ± t1−α/2 (n − s)
M SW P E
nh t
∑
∑
For a contrast ψ = h ch µ̄h·· with sample estimator C = h ch ȳh·· , we
test H0 : ψ = 0 via t or F tests.
The appropriate t test statistic is
t=
|C|
,
s.e.(C)
which we compare to
t1−α/2 (d.f.W P E )
for an α-level test. Equivalently, we can use an F test:
F = t2 ,
which we compare to
F1−α (1, d.f.W P E ).
The standard error for a contrast in the whole plot groups is given by
v
(
) √
√
u
∑
∑
u
M SW P E ∑ 2
ˆ
ch ȳh·· =
c2h var(ȳ
ˆ h·· ) =
ch .
s.e.(C) = tvar
nh t
h
h
A 100(1 − α)% CI for ψ is given by
C ± t1−α/2 (d.f.W P E )s.e.(C).
22
h
CIs and contrasts for marginal means for split plot factor:
Confidence intervals and hypothesis tests on µ̄··j are based on the pivotal
quantity
t=
ȳ··j − µ̄··j
ȳ··j − µ̄··j
=√
.
s.e.(ȳ··j )
[(t − 1)M SSP E + M SW P E ]/(snh t)
• However, now this quantity is not distributed exactly as a student’s
t!
Why?
Because the denominator doesn’t involve a single mean square (χ2 divided
by its d.f.), but instead involves a linear combination of mean squares.
What’s the distribution of a linear combination of independent mean
squares in normally distributed random variables?
An approximate answer is given by Satterthwaite’s formula. Satterthwaite showed that a linear combination of independent mean squares of the
form M S = a1 M S1 + · · · + ak M Sk is approximately χ2 with approximate
degrees of freedom given by
d.f. =
(M S)2
(a1 M S1 )2
d.f.1
+ ··· +
(ak M Sk )2
d.f.k
,
where here d.f.i is the d.f. associated with M Si and the ai ’s are constants.
In our case, M S is (t − 1)M SSP E + M SW P E so we have
ν=
[(t − 1)M SSP E + M SW P E ]2
[(t−1)M SSP E ]2
d.f.SP E
23
+
(M SW P E )2
d.f.W P E
.
Thus the pivotal quantity has an approximate t distribution:
ȳ··j − µ̄··j
ȳ··j − µ̄··j
.
t=
=√
∼ t(ν).
s.e.(ȳ··j )
[(t − 1)M SSP E + M SW P E ]/(snh t)
This leads to an approximate 100(1 − α)% CI for µ̄··j given by
ȳ··j ± t1−α/2 (ν)s.e.(ȳ··j ).
∑
∑
For a contrast ψ = j cj µ̄··j , we have the sample estimator C = j cj ȳ··j .
Despite the fact that s.e.(ȳ··j ) involves two mean squares, it turns out that
s.e.(C) involves only one, so no Satterthwaite approximation is necessary
for a contrast in the µ̄··j ’s. To see this, note
∑
∑
∑
1 ∑∑
1 ∑∑
C=
cj ȳ··j =
cj
yhij =
cj
(µhj + ei(h) + εhij )
sn
sn
h
h
j
j
i
j
i
h
h
∑
∑
∑
∑
cj ε̄··j
=
cj (µ̄·j + ē·(·) + ε̄··j ) =
cj µ̄·j + ē·(·)
cj +
j
j
j
| {z }
j
=0
So,
∑ σ2
∑
∑
σ2 ∑ 2
2
2
cj
var(C) = var(
cj ε̄··j ) =
cj var(ε̄··j ) =
=
cj .
sn
sn
h
h
j
j
j
j
Therefore, for this kind of contrast,
√
M SSP E ∑ 2
s.e.(C) =
cj .
snh
j
Based on this result, we can test H0 : ψ = 0 with an exact t or F test.
Our test statistic is
|C|
t=
, which we compare to t1−α/2 (d.f.SP E )
s.e.(C)
for an α-level test. Equivalently, we can compute:
F = t2 ,
which we compare to
F1−α (1, d.f.SP E ).
A 100(1 − α)% CI for ψ is given by
C ± t1−α/2 (d.f.SP E )s.e.(C).
24
Joint Means:
As for marginal means of the split plot groups, the variance of the joint
mean estimator ȳh·j involves σ 2 + σe2 , which we must estimate with [(t −
1)M SSP E + M SW P E ]/t, rather than just a single mean square.
Satterthwaite’s formula yields
ȳh·j − µ̄h·j
ȳh·j − µ̄h·j
.
=
∼ t(ν).
s.e.(ȳh·j )
[(t − 1)M SSP E + M SW P E ]/(snh t)
This leads to an approximate 100(1 − α)% CI for the joint mean µ̄h·j given
by
ȳh·j ± t1−α/2 (ν)s.e.(ȳh·j ).
Contrasts:
∑ ∑
For contrast in the joint means
of
the
form
ψ
=
h
j chj µ̄h·j , we esti∑ ∑
mate the contrast with C = h j chj ȳh·j . and we form test statistics in
the usual way. That is, t and F statistics for H0 : ψ = 0 are given by
t=
|C|
,
s.e.(C)
and F = t2 ,
respectively.
However, the formula for s.e.(C) and the distribution of the test statistics
depend on the nature of the contrast in the joint means. For certain
contrasts in the joint means, s.e.(C) involves only one mean squares; for
others, s.e.(C) involves two mean squares.
The former case is easy and yields an exact distribution, the latter case is
harder because we need Satterthwaite’s formula again.
25
Case 1 (the easy case): Contrasts across split plot groups but within
a single whole plot group:
In this case, C =
∑ ∑
h
j chj ȳh·j
√
s.e.(C) =
has standard error
M SSP E ∑ ∑ 2
chj
nh
j
h
Exact tests and confidence intervals for these contrasts are then
based on the fact that the t and F test statistics are distributed
as
t ∼ t(d.f.SP E ),
F = t2 ∼ F (1, d.f.SP E ).
Case 2 (the harder case): Contrasts involving more than one level of
the whole plot factor:
In this case, C =
∑ ∑
h
j chj ȳh·j
has standard error
v
u
u (t − 1)M SSP E + M SW P E ∑ ∑ 2
s.e.(C) = t
chj
nh t
j
h
The t and F statistics for contrasts of this type do not have exact
t and F distributions. However, approximate tests and confidence
intervals for these contrasts can be based on the approximations
.
t ∼ t(ν),
.
F ∼ (1, ν),
where, again, ν is given by Satterthwaite’s formula (bottom of p.
23).
26
Example — Chocolate Cake (Continued)
• The DDFM=SATTERTH option on the MODEL statement tells
SAS to use the Satterthwaite formula to obtain the correct degrees
of freedom for F and t tests. These are denominator d.f. (hence
DDFM) for F tests and d.f. for t tests.
• By default, PROC MIXED will use the “containment” method for
computing these d.f. if a RANDOM statement is included in the
call to PROC MIXED. This is a method which attempts to guess
the right d.f. from the syntax of the MODEL statement (see the
SAS documentation for details). However, the containment method
can be wrong. Its advantage is that it takes fewer computational
resources than the Satterthwaite method.
• As an example of the Satterthwaite method, consider estimation
of µ11 − µ21 , the difference between the mean response for cakes
made with recipe 1, temperature 1 versus cakes made with recipe 2,
temperature 1.
• This is a “case 2” scenario, so we must use Satterthwaite’s formula.
From the formula on the bottom of p.23,
ν=
[(t − 1)M SSP E + M SW P E ]2
[(t−1)M SSP E ]2
d.f.SP E
+
(M SW P E )2
d.f.W P E
=
[(6 − 1)20.471 + 271.493]2
[(6−1)20.471]2
210
+
(271.493)2
42
= 77.4.
The estimate is ȳ1·1 − ȳ2·1 = 29.1333 − 26.8667 = 2.2667 with a
standard error of
v
u
u (t − 1)M SSP E + M SW P E ∑ ∑ 2
t
chj
nh t
j
h
√
(6 − 1)20.471 + 271.493 2
=
[1 + (−1)2 ] = 2.8823
15(6)
27
• So, an approximate 95% CI for µ11 − µ21 is
2.2667 ± t.975 (77.4)(2.8823) = (−3.47, 8.01)
and an approximate .05-level t test of H0 : µ11 = µ21 compares
t = 2.2667/(2.8823) = 0.79
to a t(77.4) distribution. Equivalently, an F test can be used where
we compare F = t2 = 0.792 = 0.62 to an F (1, 77.4) distribution.
Both of these tests give a p-value of .4340, so we fail to reject H0 at
level α = .05.
We have presented the simplest split plot design in which whole plots
occur in a one-way layout. Other more complicated split plot designs are
possible. In practice, the most common split plot design is one in which
the whole plots occur in a randomized complete block design.
• For example, suppose the chocolate cake experiment was conducted
over 15 days, and the batches were blocked by day. That is, on day
one 1 batch from each recipe was prepared and the resulting cakes
baked. On day two 1 more batch from each recipe was prepared, etc.
In such a situation, the appropriate model becomes
yhij = µ + αh + bi + ehi + βj + (αβ)hj + εhij ,
where yhij is the response on j th split-plot in the hth whole plot group in
the ith whole plot block.
Here, ehi and εhij are the whole plot and split plot error terms, respectively, with similar assumptions as in the previous model. In addition,
bi represents a whole plot block effect. If considered random, which is
typically most appropriate, the bi ’s are assumed i.i.d. N (0, σb2 ) and independent of the ehi ’s and the εhij ’s.
• Rather than treat this model in any detail, we move on to the repeated measures ANOVA (RM-ANOVA). This example will be covered by the general theory of LMMs, which we are working toward.
28
RM-ANOVA:
The repeated measures ANOVA (RM-ANOVA) is based upon the similarity between repeated measures designs and split-plot designs.
Consider again the Methomoglobin in Sheep example. This is a one-way
layout with repeated measures over 6 time points. It has much the same
structure as the chocolate cake example. This suggests using the same
analysis.
• However, there is one important difference: In the RM design, the
“split-plot” factor, time, is not randomized!
Instead, each experimental unit is measured under “time 1” first, “time 2”
second, etc. This means that observations are subject to serial correlation
as well as shared-characteristics, or clustering-type, correlation.
Recall that in the split-plot model, yhi , the vector of observation on the
h, ith whole plot, had the compound symmetry variance-covariance structure:


1 ρ ··· ρ
ρ 1 ··· ρ

var(yhi ) = (σe2 + σ 2 ) 
 .. .. . . . ..  ,
. .
.
ρ ρ ··· 1
where ρ = σe2 /(σ 2 + σe2 ).
• Often, this seems an inappropriate variance-covariance structure for
repeated measures. Typically, we would expect observations taken
close together in time to be more highly correlated than observations
taken far apart.
• That is, we often expect a decaying correlation structure through
time, rather than constant correlation through time.
29
Sphericity:
It turns out that compound symmetry is a sufficient but not necessary
condition for the F tests from the split-plot analysis to be valid for the
RM-ANOVA design.
A more general condition, known as sphericity, is necessary and sufficient.
Sphericity can be expressed in several different, but equivalent ways. In
particular, sphericity is equivalent to
1. the variances of all pairwise differences between repeated measures
are equal; that is,
var(yhij − yhik ) is constant for all j, k.
2. ϵ = 1, where
ϵ=
t2 (σ̄jj − σ̄·· )2
,
∑t ∑t
∑t
2 − 2t
2 + t2 σ̄ 2 )
(t − 1)( j=1 j ′ =1 σjj
σ̄
′
··
j=1 j·
where
σ̄·· = mean of all elements of Σ ≡ var(yhi )
σ̄jj = mean of elements on main diagonal of Σ
σ̄j· = mean of elements in row j of Σ
• Since compound symmetry means that var(yhij ) and cov(yhij , yhij ′ )
are constant for all i, j, j ′ , it is clear that
var(yhij − yhij ′ ) = var(yhij ) + var(yhij ′ ) − 2cov(yhij , yhij ′ )
is constant for all j, j ′ . Therefore, compound symmetry is a special
case of sphericity.
• Sphericity is a more general (weaker) condition than compound symmetry, and mathematically, it is all that is needed. However, it is
difficult to envision a realistic form for Σ in which sphericity would
hold and compound symmetry would not.
30
Mauchly has proposed a test for sphericity. This test is of limited practical
use for several reasons:
• Low power in small samples.
• In large samples test is likely to reject sphericity when non-sphericity
has little effect on validity of the split-plot F tests.
• Sensitive to departures from normality.
• Very sensitive to outliers.
It can be shown that sphericity holds when ϵ = 1 and maximum nonsphericity holds when ϵ = 1/(t − 1).
Under non-sphericity, it can be shown that the F test statistics for the
repeated measures factor (usually time) and interactions involving the repeated measures factor have approximate F distributions where the degrees of freedom are the usual degrees of freedom multiplied by ϵ.
• Therefore, the “fix” of the split-plot analysis is to multiply the numerator and denominator degrees of freedom by ϵ̂ for all F tests on
time and on and interactions involving time.
Two estimators of ϵ are commonly used: Greenhouse-Geisser and HunyhFeldt. ϵ̂GG is simply ϵ computed on the sample variance-covariance matrix
S rather than Σ, the population (true) variance-covariance matrix. ϵ̂HF
is defined as


ϵ̂HF




n
(t
−
1)ϵ̂
−
2
·
GG

= min 
1,
 (t − 1)(d.f.W P E − (t − 1)ϵ̂GG )  ,
 |
{z
}
value given by SAS
where n· =the total number of “whole plots”.
• Note that ϵ ≤ 1, so the value given by SAS should always be rounded
down to 1, if it is > 1.
31
Which ϵ̂? For true ϵ ≤ 0.5 (greater non-sphericity) ϵ̂GG is better. For true
ϵ ≥ 0.75 (less non-sphericity) ϵ̂HF is better. In practice we don’t know the
true value of ϵ so often it is hard to say which is better.
• ϵ̂GG tends to give the larger adjustment (makes it harder to reject
H0 ) than ϵ̂HF , so if we desire to be conservative (slow to reject) then
we should use ϵ̂GG .
If we have a program like SAS that can compute ϵ̂GG and ϵ̂HF and corresponding adjusted p-values easily, then we should always go ahead and do
the adjustment for non-sphericity.
• There is no down-side here, because if the data are spherical, then
we should get ϵ̂GG = ϵ̂HF = 1, and the adjustment will end up not
altering the split-plot analysis at all (which is what we would want).
If the data are non-spherical, then an appropriate adjustment will
be done.
• If we want to avoid computing ϵ̂, then a very conservative approach is
to use the adjustment for maximum non-sphericity. That is, multiply
numerator and denominator d.f. of the F tests by (t − 1)−1 .
• Alternatively, use the Greenhouse-Geisser algorithm (see Davis, §5.3.2).
32
Example — Methemoglobin in Sheep:
• See sheep1.sas. In this SAS program PROC MIXED is used first
to perform the split-plot analysis exactly as in the chocolate cake
example, and then the REPEATED statement in PROC GLM is
used to reproduce this split-plot analysis as a RM-ANOVA analysis.
That is, both PROCs fit the model
yhij = µhj + ei(h) + εhij ,
iid
iid
where µhj = µ + αh + βj + (αβ)hj , {ei(h) } ∼ N (0, σe2 ), {εhij } ∼
N (0, σ 2 ), and the ei(h) ’s and εhij ’s are independent of each other.
• The two sets of results are basically the same, but PROC GLM gives
Mauchly’s test and Greenhouse-Geisser and Hunyh-Feldt adjusted pvalues for tests involving time.
• For the basic ANOVA table and F tests, both procedures give the
correct RM-ANOVA results. However, PROC GLM will still not
give the correct inferences on means and contrasts.
• PROC MIXED would give the correct inferences under the assumption of sphericity if we were to add ESTIMATE and CONTRAST
statements to the program in sheep1.sas. In addition, PROC MIXED
can be made to “fix” the split-plot analysis for non-sphericity, but
with a more sophisticated fix than that we’ve discussed and that
which is implemented in PROC GLM with the REPEATED statement. We’ll get to this later.
• The basic results of the analysis are as follows:
– According to Mauchly’s test (the one labelled, “Orthogonal
Components” on p.8), there is significant evidence of nonsphericity, so we should use adjusted p-values for tests on time
and no2×time. The estimates of ϵ are ϵ̂GG = .2610 and ϵ̂HF =
.3551, which are both pretty far from 1, indicating non-sphericity.
33
– At level α = .05, we reject the hypothesis of no interaction
with either adjustment (HF or GG, see p.10). The nature of
the interaction can be seen in the profile plot on p.5. From
the profile plot it appears that we should make inferences on
time separately within each of the no2 groups, but it does seem
meaningful to test for a main effect of no2.
– The main effect test of no2 is significant (p = .0026, see p.9)
and needs no adjustment. It is clear that the mean response is
higher (at least after time 1) for increasing levels of no2.
– The main effect of time is significant, but, as noted above, the
pattern over time seems to be different enough from one no2
group to the next that it really would be more appropriate to
compare times separately within each no2 group.
– A natural set of contrasts to examine here would be linear and
nonlinear contrasts over time separately in each no2 group.
E.g., for no2 group 1, assuming that the times were equally
spaced, these contrasts would look like this in SAS:
contrast ’linear time, no2=1’
contrast ’nonlinear time, no2=1’
34
time -5 -3 -1 1 3 5
no2*time -5 -3 -1 1 3 5
0 0 0 0 0 0 0 0 0 0 0 0;
time 5 -1 -4 -4 -1 5
no2*time 5 -1 -4 -4 -1 5
0 0 0 0 0 0 0 0 0 0 0 0,
time -5 7 4 -4 -7 5
no2*time -5 7 4 -4 -7 5
0 0 0 0 0 0 0 0 0 0 0 0,
time 1 -3 2 2 -3 1
no2*time 1 -3 2 2 -3 1
0 0 0 0 0 0 0 0 0 0 0 0,
time -1 5 -10 10 -5 1
no2*time -1 5 -10 10 -5 1
0 0 0 0 0 0 0 0 0 0 0 0;
Multivariate Methods for Repeated Measures
The Multivariate Linear Model:
Suppose we have a t component response vector yi = (yi1 , . . . , yit )T on the
ith of n subjects, and suppose that yij is generated from the linear model
yij = xTi βj + εij ,
i = 1, . . . , n, j = 1, . . . , t,
where xi = (xi1 , . . . , xip )T is a vector of p explanatory variables specific to
the ith subject (but constant over the t components of the response), and
βj = (β1j , . . . , βpj )T is a vector of unknown parameters specific to the j th
component of the response.
Let εi = (εi1 , . . . , εit )T denote the vector of error terms for the ith subject.
We assume that the t components of the response are correlated within a
given subject, so we assume
εi ∼ Nt (0, Σ).
• In applications to repeated measures data, the t components of
the response vector correspond to the response measured at t distinct time points. So, Σ describes the variance-covariance structure
through time.
• To ensure that Σ is positive definite, we assume p ≤ n − t.
We also assume independence between subjects so


ε1
.
ε ≡  ..  ∼ Nnt (0, In ⊗ Σ).
εn
35
• Here, ⊗ denotes the Kronecker (aka direct) product. W ⊗ Z, the
Kronecker product of matrices Wa×b , Zc×d , results in the ac × bd
matrix


w11 Z w12 Z · · · w1b Z
 w21 Z w22 Z · · · w2b Z 
 .
..
.. 
..
 ..
.
. 
.
wa1 Z
···
wa2 Z
wab Z
Thus, In ⊗ Σ is the nt × nt block-diagonal matrix

Σ
0
.
 ..
0
Σ
..
.
···
···
..
.

0
0
.. 
.
0
0
···
Σ
• Thus, according to this model, y1 , . . . , yn are independent random
vectors with yi ∼ Nt (µi , Σ) where µi = (µi1 , . . . , µit )T , µij = xTi βj .
To express this model in matrix terms, let
 
y1T
y11
 ..   ..
Y= . = .
yn1
ynT

···
..
.
···

y1t
..  ,
. 
ynt
 
x11
xT1
 ..   ..
X= . = .
xTn
xn1


· · · x1p
..  ,
..
.
. 
· · · xnp
where X is of rank p ≤ (n − t). Also, let

β11
 ..
B = (β1 , . . . , βt ) =  .
βp1

· · · β1t
..  ,
..
.
. 
· · · βpt

 
εT1
ε11
 ..   ..
E= . =
.
εn1
εTn
···
..
.
···

ε1t
.. 
.
εnt
Then the multivariate linear model takes the form
Y = XB + E,
vec(ET ) ∼ Nnt (0, In ⊗ Σ).
36
(∗)
Estimation:
The maximum likelihood and ordinary least squares estimator of B is
B̂ = (XT X)−1 XT Y
• Note that B̂ is equal to (β̂1 , . . . , β̂t ), where β̂j = (XT X)−1 XT uj is
the usual least squares estimator based just on uj , the j th column
of Y.
• B̂ is the BLUE.
That the OLS estimator is the BLUE can be seen by writing the multivariate model as a linear model. Let vec(M) denote the vector formed
by stacking the columns of its matrix argument M. The the multivariate
linear model (*) can be written in a univariate form as
vec(Y) = (Ip ⊗ X)vec(B) + vec(E).
It is easily seen that the error term has moments E{vec(E)} = 0 and
var{vec(E)} = Σ ⊗ In . Therefore, this model has the form of a GLS
model, which implies that the GLS estimator would be BLUE.
However, the same theorem we alluded to in claiming OLS to yield BLUEs
in the split-plot model (see p.19) applies here. That theorem says that in
the univariate linear model
y = Xβ + ε,
E(ε) = 0, var(ε) = σ 2 V,
the OLS estimator of β is BLUE iff C(VX) ⊂ C(X) (see Graybill, Thm
6.8.1, or Christensen, Thm 10.4.5).
This condition is easy to show here because it is a property of Kronecker
products that (A ⊗ B)(C ⊗ D) = AC ⊗ BD for suitably conformable
matrices. Therefore,
(Σ ⊗ In )(Ip ⊗ X) = Σ ⊗ X = (Ip ⊗ X)(Σ ⊗ I).
So,
C([Σ ⊗ In ][Ip ⊗ X]) = C([Ip ⊗ X][Σ ⊗ I]) ⊂ C([Ip ⊗ X]).
37
The MLE of Σ is
1
(Y − XB̂)T (Y − XB̂).
n
However, this estimator is biased. An unbiased estimator of Σ is
S=
1
(Y − XB̂)T (Y − XB̂).
n−p
Estimation of linear functions of B is often of interest. Let ψ = aT Bc
where a and c are p × 1 and t × 1 vectors of constants, respectively.
• Note that a operates as a contrast within time points. c operates as
a contrast across time points.
The BLUE of ψ is
ψ̂ = aT B̂c
and has variance
var(ψ̂) = (cT Σc)[aT (XT X)−1 a].
Hypothesis Testing:
Most hypotheses of interest in the multivariate linear model can be expressed as
H0 : ABC = D,
(†)
where
Aa×p
has rank a ≤ p, and operates across subjects (w/in time)
Ct×c
Da×c
has rank c − p, and operates across time (w/in subjects)
is a matrix of constants; often, D = 0a×c .
38
• This framework is very general. E.g., setting A = I, D = 0 yields
contrasts in time, C = I, D = 0 yields contrasts across subjects,
A = I, C = I, D = 0 yields the hypothesis B = 0, etc.
There are four tests commonly used to test hypotheses of this form, and
all of the test statistics are defined in terms of the hypothesis SSCP
matrix and the error, or residual, SSCP matrix
• Here, SSCP stands for sum of squares and cross-products. A SSCP
is the multivariate analog of a sum of squares in the univariate linear
model. Note that it is a matrix, not a scalar.
Recall that in the univariate linear model,
y = Xβ + ε,
ε ∼ Nn (0, σ 2 In ),
an F test statistic for the hypothesis H0 : Aa×p βp×1 = da×1 was given by
F =
(Aβ̂ − d)T [A(XT X)−1 AT ]−1 (Aβ̂ − d) n − p
SSH n − p
=
.
T
T
T
a
SSE
a
[y y − β̂ X Xβ̂]
In the multivariate context, SSH becomes a matrix, the hypothesis SSCP,
given by
Qh = (AB̂C − D)T [A(XT X)−1 AT ]−1 (AB̂C − D),
and SSE becomes a matrix, the error SSCP, given by
Qe = CT [YT Y − B̂T (XT X)B̂]C.
• Since these quantities are matrices in the multivariate context, we
can no longer compare them as simply as in the univariate F test.
That is, we cannot simply take their ratio. Even computing Qh Q−1
e
is not an option, because the result is not a scalar test statistic.
39
How to do we compare the “sizes” of Qh and Qe with a scalar quantity?
Several answers have been put forward, leading to several different test
statistics:
•
•
•
•
Roy’s test: based on the largest eigenvalue of Qh Q−1
e .
Lawley and Hotelling’s test: test statistic is tr(Qh Q−1
e ).
Pillai’s test: test statistic is a function of tr[Qh (Qh + Qe )−1 ].
Wilks’ likelihood ratio test: Wilks’ test statistic depends on Λ =
|Qe |/|Qh + Qe | and is equivalent to the LRT.
Unfortunately, none of these tests is “best” in all situations. However,
all of these tests are approximately equivalent in large samples, and differ
very little in power for small samples.
• Because LRTs have good properties in general, we will confine
attention to Wilks’ test.
There is no general result that gives the exact distribution of Wilks’ test
statistic. That is, the exact reference distribution for Wilks’ test is not
known, in general.
However, for some multivariate ANOVA (MANOVA) models that arise in
profile analysis, there is an equivalence between Wilks’ test and an F test,
where the exact reference distribution is an F distribution.
40
In particular, in a one-way MANOVA model for comparing (s) groups
(treatments) based upon a t-variate response, exact distributions are available for the following cases:
Test
( Statistic
)(
)
n−s
1−Λ
Case
t=1
s≥2
(
t=2
s≥2
Λ
n−s−1
s−1
)(
F (s − 1, n − s)
√ )
1−
√ Λ
Λ
F (2(s − 1), 2(n − s − 1))
( n−t−1 ) ( 1−Λ )
t≥1
s=2
t
( n−t−2
t≥1
s=3
where n =
s−1
Distribution
t
∑s
h=1
)(
Λ
√
1−
√ Λ
Λ
F (t, n − t − 1)
)
F (2t, 2(n − t − 2))
nh is the total number of subjects.
• In other cases we rely on approximations to obtain p−values for
Wilks’ Lambda.
• A large sample approximation due to Bartlett gives the following
rejection rule to obtain an approximate α−level test: Reject H0 if
(
t+s
− n−1−
2
)
log(Λ) > χ2α (t(s − 1)).
• Other approximations are available to obtain approximate p−values
when the total sample size n is small. These approximation are
implemented in SAS and other computer programs and are quite
good even for small sample sizes.
41
Profile Analysis:
We maintain the same notation and set-up: suppose repeated measures at
t time points from s groups
of subjects. Let nh = number of subjects in
∑s
group h, and let n = h=1 nh . Let yhij denote the observation at time j
for subject i in group h.
We assume the response vectors yhi = (yhi1 , . . . , yhit )T are independent,
with


µh1


yhi ∼ N (µh , Σ), where µh =  ...  ,
µht
and µhj = E(yhij ).
The profile analysis model is

yhij = µhj + εhij ,
where εhi

εhi1
.
=  ..  ∼ N (0, Σ).
εhit
In terms of the multivariate general linear model,
Y = XB + E
T 
y11

1 0
 .. 
 .   .. ..
 T  . .
 y1n1  


1 0
 yT  

 21   0 1
 .  . .
 .   .. ..
 .  
 T  = 0 1
 y2n2  


.. ..
 .  

 ..   . .


0 0
 yT  
 s1  
. .
 .   .. ..
 .. 
0 0
T
ysns

or

··· 0
.
..
. .. 

··· 0


··· 0
µ11

.. 
..
. .   µ21

· · · 0   ...

..
.. 
µs1
.
.

··· 1
.. 
..
. .
··· 1
42

εT11
 .. 
 . 
 T 
 ε1n1 


T 
 
 ε21 
· · · µ1t
 . 

· · · µ2t  
 .. 

..  +  εT 
..
 2n2 
.
.


 . 
· · · µst
 .. 


 εT 
 s1 
 . 
 .. 
εTsns

Three general hypotheses are of interest in profile analysis:
H01 : the mean profiles (over time) for the s groups are parallel (i.e., no
group×time interaction);
H02 : no differences among groups;
H03 : no differences among time points.
• Note that H01 should be tested first, because the result of this test
affects what form the other two hypotheses should take (H02 and
H03 have been expressed in a purposely vague way here).
If H01 is accepted, then, under the assumption that no interaction is
present, it is appropriate to test for no difference between groups by comparing the mean response in each group averaged over all time points, and
it is appropriate to test no difference across time points by comparing the
mean response at each time, averaged over groups.
If, however, we reject H01 then it may be more appropriate to test hypotheses of the form
H04 : no difference among groups within some subset of the measurement
occasions;
H05 : no difference among time points in a particular group, or subset of
groups;
H06 : no difference within some subset of measurement occasions in a particular group or subset of groups.
43
Test of parallelism:
The hypothesis of parallelism is
H01
µ1,t−1 − µ1t


µs1 − µs2


 µ −µ
 = · · · =  s2 . s3  .



..

 
µ21 − µ22
µ11 − µ12
 µ12 − µ13   µ22 − µ23
=
:
..
..
 

.
.

µs,t−1 − µst
µ2,t−1 − µ2t
• Testing this hypothesis is equivalent to conducting a one-way multivariate analysis of variance (MANOVA) on the t − 1 differences
between adjacent time points from each subject.
In terms of the general form of the hypothesis, H01 can be expressed as
ABC = D where
A(s−1)×s = (Is−1 , −js−1 ),

1
0 ···
 −1 1 · · ·

 0 −1 · · ·
Ct×(t−1) = 
..
..
 ...
.
.

 0
0 ···
0
0
D(s−1)×(t−1) = 0(s−1)×(t−1)
44
0
0
0
..
.







1 
· · · −1
Test of no difference among groups:
Depending on the result of the test of H01 , two tests of no difference among
groups are possible.
First, if H01 is accepted, then we would test for differences across groups
averaging over (or equivalently, summing over) time points. In this case
H02 takes the form H02a : ABC = D where
A(s−1)×s = (Is−1 , −js−1 ),
Ct×1 = jt
D(s−1)×1 = 0(s−1)×1 .
• This test is equivalent to doing a one-way ANOVA on the totals (or
means) across time, for each subject.
Second, if H01 is rejected, we would not want to assume parallelism in
testing across groups. In this case the null hypothesis is

H02b
 



µ11
µ21
µ21
 µ12   µ22 
 µ22 



 . ,
:
=
=
·
·
·
=
 ..   .. 
 . 
.
.
.
µ1t
µ2t
µ2t
or H02b : ABC = D, where
A(s−1)×s = (Is−1 , −js−1 ),
Ct×1 = It
D(s−1)×t = 0(s−1)×t .
• This is the one-way MANOVA test on the vector of means at each
time point.
45
Test of no difference among time points:
Similar to testing no difference among groups, the appropriate test here
depends upon the result of testing H01 . If H01 is accepted, we will typically
want to test no difference across time points, averaging (or equivalently,
summing) across groups. This hypothesis is H03a : ABC = D where
1 T
js
A1×s = jTs or
s
)
(
It−1
Ct×(t−1) =
−jTt−1
D1×(t−1) = 01×(t−1) .
If H01 is rejected, then we can compare time points without assuming
parallelism. That is, we can test the hypothesis

H03b
 



µ11
µ12
µ1t
 µ21   µ22 
 µ2t 





:  ..  =  ..  = · · · = 
 ..  .
.
.
.
µs1
µs2
µst
This hypothesis can also be written in the form ABC = D where
As×s = Is ,
(
)
It−1
,
Ct×(t−1) =
−jTt−1
Ds×(t−1) = 0s×(t−1) .
46
Example — Methemoglobin in Sheep (again):
Recall that there are t = 6 measurements through time on each of
n1 = n2 = n3 = 4 sheep in NO2 groups 1, 2 and 3 (s = 3). The
profile analysis model for these data is

T

y11
1
 .. 
 .   ...
 T  
 y14   1
 T  
 y21   0
 

 ..   ..
 . =.
 

 yT   0
 24  
 yT   0
 31   .
 .  .
.
 .. 
0
yT

34
0
..
.
0
1
..
.
1
0
..
.
0

εT11
 .. 
 . 
 T 
 ε14 

 
 εT21 
µ16


 .. 

µ26 +  .  ,


µ36
 εT 
 24 
 εT 
 31 
 . 
 .. 


0
.. 
.
0

0
 µ
..   11
µ21
.

µ31
0

1
.. 

.
1
µ12
µ22
µ32
···
···
···
εT34
or Y = XB + E where Y and E are 12 × 6 matrices, X is 12 × 3,
and B is 3 × 6.
The hypothesis of parallelism is H01 : ABC = 0 where

(
A=
1 0
0 1
−1
−1
)
,
1
 −1

 0
C=
 0

0
0
0
0
0
1
0
0
−1 1
0
0 −1 1
0
0 −1
0
0
0

0
0 

0 
.
0 

1
−1
• See handout sheep2.sas. In this handout, PROC GLM is used to fit
the MANOVA model and to illustrate how to test the profile analysis
hypotheses.
47
• In the first call to PROC GLM, we test for parallelism. Note that
A is specified with the CONTRAST statement, the transpose of C
is specified with the MANOVA statement, and D is assumed to be
equal to 0.
• According to Wilks’ test, the hypothesis of parallelism is rejected at
α = .05 (p = .0164).
• Given that H01 is rejected, we should compare groups and times
without assuming parallelism, and/or compare times within each
group separately and compare groups with each time separately.
However, for illustration purposes, I’ve given the tests of H02a , H02b ,
H03a , and H03b in sheep2.sas.
– The hypothesis of no group effect assuming parallelism (H02a )
is rejected (p = .0026),
– the hypothesis of no group effect without assuming parallelism
(H02b ) is rejected (p = .0350),
– the hypothesis of no time effect assuming parallelism (H03a ) is
rejected (p < .0001), and
– the hypothesis of no time effect without assuming parallelism
(H03b ) is rejected (p < .0001).
• Finally, I tested for no time effect in group 1 only. This hypothesis
was also rejected (p = .0006).
• All of these results are consistent with the profile plot obtained in
sheep1.sas.
48
Growth Curve Analysis:
We have seen that repeated measures of a single variable (methemoglobin,
say) over time can be analyzed with multivariate methods (e.g., MANOVA)
by regarding each time-specific measurement of the variable as a distinct
variable.
• E.g., if we measure methemoglobin at 10, 20, 30, 40, 50, and 60
minutes after treatment for each subject in each of three treatment
groups, we can compare the groups with a MANOVA based on t = 6
variables: methemoglobin at 10 min., methemoglobin at 20 min., . . .,
methemoglobin at 60 min.
• Such an approach does not recognize any ordering of the repeated
measurements and fits no model to describe time trends or growth
curves.
• In fact, repeated measurements through time are naturally ordered.
In this case, it may be of interest to characterize trends over time using
low-order polynomials (e.g., linear or quadratic curves in time).
By modelling the time trend, we hope to summarize the mean response
at the t time points with q < t parameters, rather than allowing for t
separate time-specific means.
• The use of polynomials to describe time trend within the context of
a multivariate linear model is known as growth curve analysis,
and is usually attributed to Potthoff and Roy.
• Not to be confused with the use of nonlinear models of growth (e.g.,
Richards’ model, von Bertalanffy’s model, etc.).
• This approach is seldom used these days, so we will not discuss it further. The use of polynomials in time to describe patterns of change
is still common, but this is more commonly done in the framework
of linear mixed models these days.
49
Linear Mixed Effects Models (LMMs)
There are several disadvantages/limitations to multivariate methods (profile analysis, growth curves) for longitudinal data analysis.
• Methods assume same set of measurement times for each subject,
so cannot handle missing data, varying measurement times, varying
cluster size (number of repeated measures) easily.
• Cannot handle time-varying covariates easily.
• Models make no assumptions on within-subject var-cov matrix. This
makes these methods broadly valid, but not powerful.
• M’variate methods don’t model sources of heterogeneity/correlation
in the design generating the data. No quantification of heterogeneity, little flexibility to model multiple sources of heterogeneity and
correlation.
A much more flexible class of models is the class of linear mixed effects
models (LMMs).
We have already seen examples of LMMs: the split-plot model (chocolate
cake example), RM-ANOVA model (methemoglobin in sheep example).
• In these cases, a cluster-specific random effect (whole plot error term)
was included to model whole plot to whole plot or subject to subject
variability and to imply correlation within a whole-plot/subject.
In general, the inclusion of random effects into the linear model allows
for modeling (and quantification) of multiple sources of heterogeneity and
complex patterns of correlation.
Further flexibility is achieved in this class of models by also letting the
error term have a general, non-spherical variance-covariance matrix.
The result is a very rich, flexible and useful class of models.
50
Some Simple LMMs:
The one-way random effects model — Railway Rails:
(See Pinheiro and Bates, §1.1) The data displayed below are from an
experiment conducted to measure longitudinal (lengthwise) stress in
railway rails. Six rails were chosen at random and tested three times
each by measuring the time it took for a certain type of ultrasonic
wave to travel the length of the rail.
4
3
Rail
6
1
5
2
40
60
80
100
Zero-force travel time (nanoseconds)
Clearly, these data are grouped, or clustered, by rail. This clustering
has two closely related implications:
1. (within-cluster correlation) we should expect that observations
from the same rail will be more similar to one another than
observations from different rails; and
2. (between cluster heterogeneity) we should expect that the mean
response will vary from rail to rail in addition to varying from
one measurement to the next.
These ideas are really flip-sides of the same coin.
51
Although it is fairly obvious that clustering by rail must be incorporated in the modeling of these data somehow, we first consider a
naive approach.
The primary interest here is in measuring the mean travel time.
Therefore, we might naively consider the model
yij = µ + εij ,
i = 1, . . . , 6, j = 1, . . . , 3,
where yij is the travel time for the j th trial on the ith rail, and we
iid
assume ε11 , . . . , ε63 ∼ N (0, σ 2 ).
Here, µ is the mean travel time which we wish to estimate. Its
ML/OLS estimate is ȳ·· = 66.5 and the MSE is s2 = 23.6452 .
However, an examination of the residuals form this model plotted
separately by rail reveals the inadequacy of the model:
0
-20
-40
Residuals for simple mean model
20
Boxplots of Raw Residuals by Rail, Simple Mean Model
2
5
1
6
Rail No.
52
3
4
Clearly, the mean response is changing from rail to rail. Therefore,
we consider a one-way ANOVA model:
yij = µ + αi + εij .
(∗)
Here, µ is a grand mean across the rails included in the experiment,
and αi is an effect up or down from the grand mean specific to the ith
rail. Alternatively, we could define µi = µ + αi as the mean response
for the ith rail and reparameterize this model as
yij = µi + εij .
The OLS estimates of the parameters of this model are µ̂i = ȳi· ,
of (µ̂1 , . . . , µ̂6 ) = (54.00, 31.67, 84.67, 96.00, 50.00, 82.67) and s2 =
4.022 . The residual plot looks much better:
4
2
0
-2
-4
-6
Residuals for one-way fixed effects model
6
Boxplots of Raw Residuals by Rail, One-way Fixed Effects Model
2
5
1
6
Rail No.
53
3
4
However, there are still drawbacks to this one-way fixed effects model:
– It only models the specific sample of rails used in the experiment, while the main interest is in the population of rails from
which these rails were drawn.
– It does not produce an estimate of the rail-to-rail variability
in travel time, which is a quantity of significant interest in the
study.
– The number of parameters increases linearly with the number
of rails used in the experiment.
These deficiencies are overcome by the one-way random effects model.
To motivate this model, consider again the one-way fixed effects
model. Model (*) can be written as
yij = µ + (µi − µ) + εij
∑
where, under the usual constraint i αi = 0, (µi − µ) has mean 0
when averaged over the groups (rails).
The one-way random effects model, replaces the fixed parameter
(µi − µ) with a random effect bi , a random variable specific to the
ith rail, which is assumed to have mean 0 and an unknown variance
σb2 . This yields the model
yij = µ + bi + εij ,
where b1 , . . . , b6 are independent random variables, each with mean
0 and variance σb2 . Often, the bi ’s are assumed normal, and they are
usually assumed independent of the εij ’s. Thus we have
iid
b1 , . . . , bn ∼ N (0, σb2 ),
iid
independent of ε11 . . . , εntn ∼ N (0, σ 2 ),
where n is the number of rails, ti the number of observations on the
ith rail.
54
– Note that now the interpretation of µ changes from the mean
over the 6 rails included in the experiment (fixed effects model)
to the mean over the population of all rails from which the six
rails were sampled.
– In addition, we are not estimating µi the mean response for a
single rail, which is not of interest. Instead we are estimating
the population mean µ and the variance from rail to rail in the
population, σb2 .
– That is, now our scope of inference is the population of rails,
rather than the six rails included in the study.
– In addition, we can estimate rail to rail variability σb2 ; and
– The number of parameters no longer increases with the number
of rails tested in the experiment.
The one-way random effects model is really a simplified version of the
split-plot model and it implies a similar variance-covariance structure. It is easy to show that for the one-way random effects model
var(yij ) = σb2 + σ 2
cov(yij , yij ′ ) = σb2 ,
j ̸= j ′
corr(yij , yij ′ ) = ρ ≡
σb2
,
σb2 + σ 2
cov(yij , yi′ j ′ ) = 0,
i ̸= i′ .
j ̸= j ′ ,
and
That is, if yi = (yi1 , . . . , yiti )T , then y1 , . . . , yn are independent,
with


1 ρ ··· ρ
ρ 1 ··· ρ

var(yi ) = (σb2 + σ 2 ) 
 .. .. . . . .. 
. .
.
ρ ρ ··· 1
(cf. the split-plot var-cov structure on p.29).
55
– In the rails example, the one-way random effects model again
leads to a BLUE of ȳ·· = 65.5 for µ.
– The restricted maximum likelihood (REML) estimators of
σ 2 and σb2 coincide with the method of moment type estimators
we derived in the split-plot model. These estimates are σ̂b2 =
24.8052 and σ̂ 2 = 4.0212 .
The randomized complete block model — Stool Example:
In the last example, the data were grouped by rail and we were
interested in only one treatment (there was only one experimental
condition under which the travel time along the rail was measured).
Often, several treatments are of interest and the data are grouped.
In a randomized complete block design (RCBD), each of s treatments
are observed in each of n blocks.
As an example, consider the data displayed below. These data come
from an experiment to compare the ergonomics of four different stool
designs. b = 9 subjects were asked to sit in each of s = 4 stools. The
response measured was the amount of effort required to stand up.
T1
T2
T3
T4
2
1
7
Subject
3
6
9
4
5
8
8
10
12
Effort required to arise (Borg scale)
56
14
Let yij be the response for the j th stool type tested by the ith subject.
The classical fixed effects model for the RCBD assumes
yij = µ + αj + βi + εij ,
= µj + βi + εij ,
i = 1, . . . , n, j = 1, . . . , s,
iid
where ε11 , . . . , εns ∼ N (0, σ 2 ).
Here, µj is the mean response for the j th stool type, which can be
broken apart into a grand mean µ and a stool type effect αj . βi is a
fixed subject effect.
Again, the scope of inference for this model is the set of 9 subjects
used in this experiment. If we wish to generalize to the population
from which the 9 subjects in this experiment were drawn, a more
appropriate model would consider the subject effects to be random.
The RCBD model with random subject effects is
yij = µj + bi + εij ,
where
iid
b1 , . . . , bn ∼ N (0, σb2 )
iid
independent of ε11 , . . . , εns ∼ N (0, σ 2 ).
An equivalent representation is
yi = Xi β + Zi bi + εi ,
where

i = 1, . . . , n,

 
 


yi1
µ1
εi1
1
.


 
.
yi =  ...  , Xi = Is , β =  ...  , Zi = js =  ..  , εi =  ..  .
1
εis
yis
µs
57
From this model representation it is clear that the variance-covariance
structure here is quite similar to that in the one-way random effects
and split plot models. In particular,
cov(yi , yi′ ) = cov(Xi β + Zi bi + εi , Xi′ β + Zi′ bi′ + εi′ ) = 0,
var(yi ) = var(Xi β + Zi bi + εi ) = var(Zi bi + εi )
i ̸= i′ ,
= Zi var(bi ) Zi + var(εi ) = σb2 Js,s + σ 2 Is
| {z }
| {z }
=σb2
 σ2 + σ2

=

σb2
..
.
b
=σ 2 Is

σb2
σ 2 + σb2
..
.
···
···
..
.
σb2
σb2
..
.
σb2
···
σ 2 + σb2
σb2



It is often stated that whether block effects are assumed random or
fixed does not affect the analysis of the RCBD. This is not completely
true. It is true that whether or not blocks are treated as random does
not affect the ANOVA F test for treatments. The ANOVA table for
the RCBD with random block effects is
Source of
Variation
Treat’s
Blocks
Error
Total
Sum of
Squares
∑
n j (ȳ·j − ȳ·· )2
∑
s j (ȳi· − ȳ·· )2
SSE (by subtr.)
∑ ∑
2
i
j (yij − ȳ·· )
d.f.
Mean
Squares
s−1
n−1
(s − 1)(n − 1)
SST rt
s−1
SSBlocks
n−1
SSE
(s−1)(n−1)
E(M S)
2
n
∑
σ +
σ 2 + sσb2
σ2
j
F
(µj −µ̄· )2
s−1
sn − 1
– This table is identical to that with blocks fixed except for the
expected MS for blocks. The F tests for the two situations are
identical.
58
M ST rt
M SE
However, there are important differences in the analysis of the two
designs. These differences affect inferences on treatment means.
For instance, in the fixed block effects model, the variance of a treatment mean is
∑
σ2
−1
var(ȳ·j ) = var{n
(µj + βi + εij )} = var(ε̄·j ) =
,
n
i
whereas in the random block effects model
∑
var(ȳ·j ) = var{n−1
(µj + bi + εij )} = var(b̄· + ε̄·j )
i
σb2
σ2
σ 2 + σb2
= var(b̄· ) + var(ε̄·j ) =
+
=
.
n
n
n
From the expressions for expected MS, method of moment (aka
anova) estimators for σ 2 and σb2 are easily derived (cf. p.17, the
analogous results for the split-plot model):
σ̂ 2 = M SE
M SBlocks − M SE
σ̂b2 =
s
This leads to a standard error of
√
√
M SBlocks + (s − 1)M SE
s.e.(ȳ·j ) = var(ȳ
ˆ ·j ) =
ns
in the random block effects model and a standard error of
√
M SE
s.e.(ȳ·j ) =
n
in the fixed block effects model.
– See stool1.sas. Note that the s.e.’s on LSMEANS are larger for
the random blocks model. This makes sense, since the scope
of inference for this model is broader.
59
The General LMM — Theory:
In general, we can write the linear mixed model as
y = Xβ + Zb + ε,
(1)
where X and Z are known matrices (the model or design matrices for
the fixed and random effects, respectively), β is a vector of unknown fixed
effects (parameters), b is a vector of random effects, and ε is a vector of
random error terms.
We assume for the random vectors b and ε that
E(b) = 0,
var(b) = D,
E(ε) = 0,
var(ε) = R,
and cov(b, ε) = 0.
• For statistical inference and for likelihood-based estimation we must
add distributional assumptions on b and ε. We make the usual
assumptions that
b ∼ N (0, D),
ε ∼ N (0, R).
• Notice that the variance-covariance matrices D and R are not assumed to be known, and are of general form. In special cases we will
assume spherical errors (R = σ 2 In ) and/or special forms for D.
For example, in the RCBD model with random block effects, suppose there
are n = 3 random blocks and s = 2 treatments. Then
yij = µj + bi + εij
can be written in the general form (1) as

 


y11
1 0
1
 y12   0 1  ( )  1

 


 y21   1 0  µ1
0
+

=

 y22   0 1  µ2
0

 
 | {z } 
y31
1 0
0
=β
y32
0 1
0
| {z } | {z }
|
=y
=X
60
follows:



0 0
ε11
0 0     ε12 
 b


1 0   1   ε21 
 b2 + 

1 0
 ε22 
 b


0 1 | {z3 }
ε31
=b
0 1
ε32
{z
}
| {z }
=ε
=Z
For estimation of the fixed effects β, the mixed model can be written as a
generalized least-squares model. Define
V ≡ var(y) = var(Xβ + Zb + ε) = var(Zb + ε) = ZDZT + R.
• We assume that V is nonsingular.
Then model (1) is equivalent to
y = Xβ + ζ,
E(ζ) = 0,
var(ζ) = V.,
or
y ∼ Nn (Xβ, V).
If V were known (at least up to a multiplicative constant), then our results
on GLS estimation would apply here, and we would obtain β̂ as a solution
to the equation
XT V−1 Xβ = XT V−1 y
and then the BLUE of any estimable function cT β would be given by cT β̂.
However, V is typically unknown. In that case, suppose we have an estimator V̂ of V. Then a natural approach for estimating cT β is to treat V̂
as the true value V and then use the (estimated) GLS estimator
cT β̂ = cT (XT V̂−1 X)− XT V̂−1 y.
(∗)
• If V̂ is “close to” V, then cT β̂ should be close to the BLUE of
cT β. However, corresponding standard errors based on var(c
ˆ T β̂) =
cT (XT V̂−1 X)− c are known to be underestimated somewhat because
they don’t account for the error in estimating V by V̂.
• We will see that the estimator defined by (*) is the ML (REML)
estimator of β when V̂ is the ML (REML) estimator of V.
61
Prediction of b:
Before considering our specific problem of predicting b in the LMM, we
need to know a little bit about prediction of random variables, in general.
Suppose we have random variables y1 , . . . , yn from which we’d like to predict the random variable y0 . What is the best predictor of y0 ?
If we use the mean squared error of prediction as our criterion of optimality,
then we can show that the best (minimum mse) predictor is
mbp (y) ≡ E(y0 |y),
where y = (y1 , . . . , yn )T .
• Here the mean squared error of prediction for a predictor t(y) is
defined to be E[{y0 − t(y)}2 ]. Note this is a criterion of optimality
for a predictor t(y). Don’t confuse this with the M SE of a fitted
regression model.
62
This result is stated in the following theorem:
Theorem: Let mbp (y) = E(y0 |y). Then, for any predictor t(y),
E[{y0 − t(y)}2 ] ≥ E[{y0 − mbp (y)}2 ].
Thus mbp (y) = E(y0 |y) is the best predictor of y0 in the sense of minimizing the mean squared error of prediction.
Proof:
[
]
]
[
2
2
E {y0 − t(y)} = E {y0 − mbp (y) + mbp (y) − t(y)}
[
]
[
]
2
2
= E {y0 − mbp (y)} + E {mbp (y) − t(y)}
+ 2E [{y0 − mbp (y)} {mbp (y) − t(y)}] .
Since both E[{y0 − mbp (y)}2 ] and E[{mbp (y) − t(y)}2 ] are nonnegative,
it suffices to show that E [{y0 − mbp (y)}{mbp (y) − t(y)}] = 0. This is
indeed the case because
( [
[
]
])
E {y0 − mbp (y)} {mbp (y) − t(y)} = E E {y0 − mbp (y)}{mbp (y) − t(y)}|y
( [
)
]
= E E {y0 − mbp (y)}|y {mbp (y) − t(y)}
(
)
= E {E(y0 |y) −mbp (y)}{mbp (y) − t(y)}
| {z }
(
=mbp (y)
)
= E 0 {mbp (y) − t(y)} = 0.
• To form the best predictor E(y0 |y) we, in general, require knowledge of the joint distribution of (y0 , y1 , . . . , yn )T which may not be
available.
• It requires substantially less information to form the best linear
predictor of y0 based on y. For the best predictor in the class of
linear predictors, we need only the means, variances and covariances
of y0 and y.
63
Limiting ourselves to the class of linear predictors, we seek a predictor of
the form γ0 + yT γ for some vector γ that minimizes E{(y0 − γ0 − yT γ)2 }.
Let µy0 = E(y0 ), σy20 = var(y0 ), µy = E(y), Vyy = var(y) and vyy0 =
cov(y, y0 ).
−
−1
Let γ ∗ denote a solution to Vyy γ = vyy0 . I.e. γ ∗ = Vyy
vyy0 (= Vyy
vyy0
in the case that Vyy is nonsingular). Then the following theorem holds:
Theorem: The function
mblp (y) ≡ µy0 + (γ ∗ )T (y − µy )
is a best linear predictor of y0 based on y.
Proof: Denote an arbitrary linear predictor as t(y) = γ0 + yT γ. Then
E[{y0 − t(y)}2 ] = E[{y0 − mblp (y) + mblp (y) − t(y)}2 ]
= E[{y0 − mblp (y)}2 ] + E[mblp (y) − t(y)}2 ]
+ 2E[{y0 − mblp (y)}{mblp (y) − t(y)}].
Again, it suffices to show that E[{y0 − mblp (y)}{mblp (y) − t(y)}] = 0,
because if this cross-product term is 0 then
E[{y0 − t(y)}2 ] = E[{y0 − mblp (y)}2 ] + E[mblp (y) − t(y)}2 ].
To find t(y) that minimizes the left hand side (the mse criterion), observe
that both terms on the right hand side are nonnegative, the first term
does not depend on t(y), and the second term is minimized when it is
zero, which happens when t(y) = mblp (y).
So, it remains to show that E[{y0 − mblp (y)}{mblp (y) − t(y)}] = 0, which
we leave as an exercise.
64
• In general, the best linear predictor and best predictor differ. However, in the special case in which (y0 , y1 , . . . , yn )T is multivariate
normal, the best linear predictor and best predictor coincide.
• It can also be shown that the BLP is essentially unique, so that it
makes sense to speak of the BLP.
Now let’s return to the LMM context. Suppose y satisfies the LMM (1)
from p.69. Then
µy = Xβ,
Vyy = ZDZT + R (assumed nonsingular)
so that the BLP of y0 is
µy0 + vy0 y (ZDZT + R)−1 (y − Xβ).
• However, this predictor is typically not of much use, because β is
unknown. In addition, µy0 , D and R may be unknown as well.
For now, suppose that D and R and vy0 y (which is often a function of D
and R) are known, but µy0 and β are not. Then, since the BLP is not
available, a natural predictor to consider is
mblup (y) ≡ µ̂y0 + vy0 y (ZDZT + R)−1 (y − Xβ̂),
where µ̂y0 and Xβ̂ are BLUEs of µyo and E(y) = Xβ, respectively.
• It can be shown that mblup (y) is the best linear unbiased predictor of y0 (see Christensen, Ch.12). That is, in the class of unbiased
predictors that are linear in y, mblup (y) has the minimum mse of
prediction.
Unbiasedness of a Predictor: A predictor t(y) of y0 is said to be
unbiased if
E{t(y)} = E(y0 ).
65
In a LMM context, it is typically of interest to predict cT b, a linear combinations of the vector of random effects, based upon y, the observed data
vector.
• That is, we now let cT b play the role of y0 in our description of
BLUP above.
Since E(cT b) = cT E(b) = 0, µ̂y0 in mblup (y) becomes 0. In addition,
vy0 y = cov(cT b, y) = cT cov(b, Xβ + Zb + ε)
= cT cov(b, Zb + ε) = cT {cov(b, b)ZT + cov(b, ε)}
| {z }
=0
T
T
T
= c var(b)Z = c DZ
T
Therefore, the BLUP of cT b is given by
cT DZT V−1 (y − Xβ̂)
where Xβ̂ is the BLUE of Xβ and V = var(y) = ZDZT + R.
If we are interested in the BLUP of a vector of such functions, (cT1 b, . . . , cTr b)T =
Cb this result extends in the obvious way: The BLUP of Cb is given by
CDZT V−1 (y − Xβ̂).
It is sometimes convenient to write this BLUP in the equivalent form
BLUP(Cb) = CDZT V−1 (y − Xβ̂) = CDZT Py,
where
P = V−1 − V−1 X(XT V−1 X)− XT V−1 .
66
(†)
The “Mixed Model Equations”:
At this point we have seen that for D and R known (i.e., for var(y) =
V = ZDZT + R known), a BLUE of β and the BLUP of b are
β̂ = (XT V−1 XT )− XT V−1 y,
b̂ = DZT V−1 (y − Xβ̂),
respectively.
In the classical linear model, the BLUE of β is obtained as the solution of
the normal equations. In the LMM there is an analogous set of equations
that yield the BLUE and BLUP of β and b. These equations are called
the mixed model equations or, sometimes, Henderson’s equations.
We now present the mixed model equations. We assume R and D are
nonsingular, known matrices.
Recall the LMM:
y = Xβ+Zb+ε,
E(b) = E(ε) = 0, var(b) = D, var(ε) = R, cov(b, ε) = 0.
If b was fixed instead of random, the normal equations (based on GLS)
for the model would be
( ) ( T)
( T)
β
X
X
−1
R−1 y
R (X, Z)
=
T
T
b
Z
Z
which may be written equivalently as
( T −1
) ( ) ( T −1 )
X R X XT R−1 Z
β
X R y
=
.
T −1
T −1
Z R X Z R Z
b
ZT R−1 y
Of course, in the mixed model, b is random, which leads to a slightly
different set of equations, known as the mixed model equations:
( T −1
) ( ) ( T −1 )
X R X
XT R−1 Z
β
X R y
=
.
(∗)
T −1
−1
T −1
Z R X D +Z R Z
b
ZT R−1 y
67
Theorem: If (β̂ T , b̂T )T is a solution to the mixed model equations, then
Xβ̂ is a BLUE of Xβ and b̂ is a BLUP of b.
Proof: Recall that the LMM is equivalent to the model
y = Xβ + ζ,
var(ζ) = ZDZT + R ≡ V.
E(ζ) = 0,
Therefore, Xβ̂ will be a BLUE of Xβ if β̂ is a solution to XT V−1 Xβ =
XT V−1 y. It can be shown (see Theorem B.56 in Christensen, for example)
that
V−1 = R−1 − R−1 Z{D−1 + ZT R−1 Z}−1 ZT R−1 .
If β̂ and b̂ are solutions, then the second row of (*) gives
ZT R−1 Xβ̂ + {D−1 + ZT R−1 Z}b̂ = ZT R−1 y
⇒
b̂ = {D−1 + ZT R−1 Z}−1 ZT R−1 (y − Xβ̂).
(∗∗)
The first row of (*) is
XT R−1 Xβ̂ + XT R−1 Zb̂ = XT R−1 y.
Substituting the expression for b̂ gives
XT R−1 Xβ̂ + XT R−1 Z{D−1 + ZT R−1 Z}−1 ZT R−1 (y − Xβ̂) = XT R−1 y,
or
XT R−1 Xβ̂ − XT R−1 Z{D−1 + ZT R−1 Z}−1 ZT R−1 Xβ̂
= XT R−1 y − XT R−1 Z{D−1 + ZT R−1 Z}−1 ZT R−1 y,
or
XT (R−1 − R−1 Z{D−1 + ZT R−1 Z}−1 ZT R−1 ) Xβ̂
{z
}
|
=V−1
−1
−1
= XT (R−1 − R
|
Z{D
+ ZT R−1 Z}−1 ZT R−1 ) y.
{z
}
=V−1
Thus, β̂ is a GLS solutions so that Xβ̂ is BLUE.
68
Now to show b̂ is a BLUP: b̂ in (**) can be rewritten as
b̂ = (D{D−1 + ZT R−1 Z} − DZT R−1 Z)
× {D−1 + ZT R−1 Z}−1 ZT R−1 (y − Xβ̂)
= (DZT R−1 − DZT R−1 Z{D−1 + ZT R−1 Z}−1 ZT R−1 )(y − Xβ̂)
= DZT (R−1 − R−1 Z{D−1 + ZT R−1 Z}−1 ZT R−1 )(y − Xβ̂)
|
{z
}
=V−1
= DZT V−1 (y − Xβ̂),
which is the BLUP of b by result (†) on p.75 (here I is playing the role of
C since we’re interested in the BLUP of Ib = b).
Sampling Variance of BLUE and BLUP for V known:
Just as it is useful for inference on β to known the variance of our estimator
β̂, it is useful to known the prediction variance of the BLUP.
For V known and Cβ a vector of estimable functions, the estimator Cβ̂ =
C(XT V−1 X)− XT V−1 y has variance-covariance matrix
var(Cβ̂) = C(XT V−1 X)− CT .
69
The analogous result for Cb̂ is as follows:
var(Cb̂) = var(CDZT Py) = CDZT PVPT ZDT CT
= CDZT PZDCT
Here, we have used the fact that PVP = P and that P and D are symmetric.
If we are interested in the variance in the prediction error Cb − Cb̂, then
we have
var(Cb − Cb̂) = Cvar(b − b̂)CT
= C{var(b) + var(b̂) − cov(b, b̂) − cov(b, b̂)T }CT
where
cov(b, b̂) = cov(b, DZT Py) = cov(b, y)PZD
= cov(b, Xβ + Zb + ε)PZD = cov(b, Zb)PZD
= cov(b, b)ZT PZD = DZT PZD = var(b̂).
Therefore,
var(Cb − Cb̂) = C{var(b) + var(b̂) − var(b̂) − var(b̂)T }CT
= C{D − DZT PZD}CT .
In addition, note that for C a matrix of constants such that Cβ is a vector
of estimable functions, then
cov(Cβ̂, b̂) = 0.
70
Maximum Likelihood Estimation:
• We have already seen that for known V, Cβ has BLUE Cβ̂ where
β̂ = (XT V−1 X)− XT V−1 y and b has BLUP DZT V−1 (y − Xβ̂).
• These results do not depend upon any distributional assumption on
b and ε.
• In addition, to this point we have concentrated on the case when V
is known. We now relax that assumption to consider the V unknown
case.
• Note that b is not a parameter of the model, so while we may be
interested in b̂, predicting b is not part of fitting the model.
• The unknown parameters of the LMM are β, D, and R.
• Typically, some structure is placed on D and R so that their forms
are known, and they are assumed to be matrix functions of a relatively small number of parameters.
Let θ be the q × 1 vector of unknown parameters describing D and R, and
hence V.
• We will often write these matrices as D(θ), R(θ), V(θ) to emphasize
this dependence.
So, fitting the LMM involves estimating β and θ. After the model has
been fit it may also be of interest to predict b.
A unified framework for estimation of these parameters is provided by
maximum likelihood, which requires that make distributional assumptions
on b and ε.
• Such assumptions will be necessary anyway for inference, so there is
not much cost in making them at the estimation phase.
71
Suppose y = Xβ + Zb + ε, where b ∼ N (0, D(θ)), ε ∼ N (0, R(θ)) and
b and ε are independent. Then
y ∼ N (Xβ, V(θ)),
where V = ZDZT + R,
so the loglikelihood for β, θ is just the log of a multivariate normal density:
ℓ(β, θ; y) = −
n
1
1
log(2π) − log{|V(θ)|} − (y − Xβ)T {V(θ)}−1 (y − Xβ).
2
2
2
The ML estimators of β and θ can be found by taking partial derivatives
of ℓ(β, θ; y) with respect to β and the components of θ and setting the
resulting functions equal to zero, and solving.
To take these partial derivatives, we need some results on matrix and
vector differentiation. The following four results appear in Christensen
(Plane Answers to Complex Questions) as Proposition 12.4.1, but can also
be found in McCulloch et al., 2008, Appendix M, and other standard
references.
1.
∂Ax
∂x
2.
∂xT Ax
∂x
= A.
= 2xT A.
3. If A is a function of a scalar s,
∂A−1
∂A −1
= −A−1
A .
∂s
∂s
4. If A is a function of a scalar s,
(
)
∂ log |A|
−1 ∂A
= tr A
.
∂s
∂s
72
Back to our problem. Recall the loglikelihood:
ℓ(β, θ; y) = −
1
1
n
log(2π) − log{|V(θ)|} − (y − Xβ)T {V(θ)}−1 (y − Xβ).
2
2
2
Using the matrix and vector differentiation results above, we obtain the
following partial derivatives:
∂ℓ
= −β T XT {V(θ)}−1 X + yT {V(θ)}−1 X, and
∂β
(
)
∂ℓ
1
∂V
1
∂V
= − tr V−1
+ (y − Xβ)T {V(θ)}−1
{V(θ)}−1 (y − Xβ),
∂θj
2
∂θj
2
∂θj
j = 1, . . . , q.
Setting these partials equal to zero, we get the following set of estimating
equations which can be solved to obtain the MLEs β̂ and θ̂:
XT {V(θ)}−1 Xβ = XT {V(θ)}−1 y
(
)
∂V
∂V
{V(θ)}−1 (y − Xβ),
tr V−1
= (y − Xβ)T {V(θ)}−1
∂θj
∂θj
(♡)
j = 1, . . . , q.
• Although these equations do not, in general, have simple closed-form
solutions, they can be solved simultaneously by any one of several
numerical techniques (e.g., Newton-Raphson, EM algorithm)
73
Instead of solving the equations (♡) simultaneously, an alternative method
of maximizing ℓ(β, θ; y), which is often more convenient, is the method of
profile (log)likelihood.
i. First, treat θ as fixed and maximize the loglikelihood ℓ(β, θ; y) with
respect to β.
ii. Second, plug the estimator of β, call it βθ (a function of θ), back
into the loglikelihood. This yields pℓ(θ; y) ≡ ℓ(βθ , θ; y), which is a
function of θ only (called the profile loglikelihood for θ). Maximize
pℓ(θ; y) with respect to θ to obtain the MLE θ̂.
iii. Finally, the MLE of β is obtained by plugging θ̂ into our estimator
for β obtained in step 1. That is, the MLE of β is β̂ = β ˆ .
θ
• Notice that for fixed θ, maximizing ℓ(β, θ; y) with respect to β is
equivalent to minimizing
(y − Xβ)T {V(θ)}−1 (y − Xβ)
which is the GLS criterion. Therefore, step 1 gives
βθ = [XT {V(θ)}−1 X]− XT {V(θ)}−1 y.
The real work is done in step 2, where we obtain θ̂ by maximizing
pℓ(θ; y) = −
]
1[
log{|V(θ)|} + (y − Xβθ )T {V(θ)}−1 (y − Xβθ ) .
2
Once this step is accomplished, it is clear that the MLE of β will then be
β̂ = β ˆ = (XT V̂−1 X)− XT V̂−1 y,
θ
where V̂ = V(θ̂).
• For a presentation of efficient computational methods for maximizing
pℓ(θ; y), see Pinheiro and Bates (2000, §2.2) and McCulloch, Searle,
and Neuhaus (2008, Ch. 14).
74
Variance Component Models:
While the parameterization of V through θ can, in the general LMM, take
on a wide variety of forms, in the important subclass of LMMs known as
variance component models, V(θ) has a specific simple form.
In variance component models, the levels of any particular random effect
are assumed to be independent with the same variance. Different random
effects are allowed different variances and are assumed independent. In
addition, the errors are assumed to be homoscedastic so that R = σ 2 In .
• The one-way random effects model, the RCB model, and the splitplot model are all examples of variance component models.
Another example is the model for an s × s Latin Square design in which
both blocking factors (rows and columns in the Latin square) are thought
of as random. The appropriate model for yijk , the response in the ith
treatment, j th row, k th column, would be
yijk = µ + αi + rj + ck + εijk ,
75
iid
r1 , . . . , rs ∼ N (0, σr2 )
iid
c1 , . . . , cs ∼ N (0, σc2 )
ε ∼ N (0, σ 2 I)
In variance component models, Z can be partitioned into q −1 submatrices
(q = 2 in the one-way, RCB, and split-plot models, q = 3 in the LSD)
as Z = (Z1 , Z2 , . . . , Zq−1 ) and b can be partitioned accordingly as b =
(bT1 , bT2 , . . . , bTq−1 )T .
Let m(i) denote the number of columns in Zi (= number of element in
bi ). We assume var(bi ) = σi2 Im(i) , i = 1, . . . , q − 1, and cov(bi , bj ) = 0,
for i ̸= j.
Then D takes on a block-diagonal structure as follows:

σ12 Im(1)

0

D=
..

.
0
0
2
σ2 Im(2)
..
.
0
···
···
..
.
···

0

0


..

.
2
σq−1
Im(q−1)
We also assume R = σq2 In . Putting these assumptions together, the matrix
V is assumed to be of the form
V=
q−1
∑
σi2 Zi ZTi
+
σq2 In
=
q
∑
σi2 Zi ZTi ,
(♢)
i=1
i=1
where Zq ≡ In .
• E.g., in the LSD, suppose that the number of treatments = number
of rows = number of columns = 3. Suppose the design and data are
as follows:
A B C
y111 y212 y313
C A B
B C A
y321 y122 y223
y231 y332 y133
,
76
Then the model can be written as

 


y111
1 1 0 0
1 0
 y212   1 0 1 0 
1 0

 


 y313   1 0 0 1     1 0

 
 µ

 y321   1 0 0 1 
0 1

 
  α1  
 y122  =  1 1 0 0    +  0 1

 
 α2

 y223   1 0 1 0 
0 1

 
 α3

 y231   1 0 1 0 
0 0

 


y332
1 1 0 0
0 0
y133
1 0 0 1
0 0




ε111
1 0 0
 ε212 
0 1 0




 0 0 1     ε313 




 ε321 
 1 0 0  c1




+  0 1 0   c2  +  ε122 




 ε223 
 0 0 1  c3




 ε231 
1 0 0




ε332
0 1 0
ε133
0 0 1
or

0
0

0 

0  r1

0   r2 

0  r3

1

1
1
y = Xβ + Z1 b1 + Z2 b2 + ε
Here, R = σq2 Zq , where Zq = I9 , and D has the form
( 2
)
σr I3
0
D=
0
σc2 I3
and
V = var(y) = var(Z1 b1 + Z2 b2 + ε)
= Z1 σr2 I3 ZT1 + Z2 σc2 I3 ZT2 + Zq σq2 I9 ZTq
= σr2 Z1 ZT1 + σc2 Z2 ZT2 + σq2 Zq ZTq .
In variance component models, θ = (σ12 , σ22 , . . . , σq2 )T and V(θ) =
(cf. (♢)).
∑q
T
j=1 θj Zj Zj
This leads to some simplification in the likelihood equations (♡). In particular,
∂V
= Zj ZTj .
∂θj
77
Information Matrix:
Under the general theory of maximum likelihood estimation, ML estimators are consistent and asymptotically normal, under suitable regularity
conditions.
• For a discussion of the regularity conditions, see, for example, Cox
and Hinkley (1974, Theoretical Statistics, Ch.9), or Seber and Wild
(1989, Nonlinear Regression, Ch.12).
In particular, the asymptotic variance-covariance matrix of a MLE ϕ̂, defined as the maximizer of a loglikelihood function ℓ(ϕ), is I(ϕ)−1 , where
(
I(ϕ) = −E
∂ 2 ℓ(ϕ)
∂ϕ∂ϕT
)
,
is known as the Fisher information matrix.
• In practice, we replace ϕ by ϕ̂ and use I(ϕ̂).
• Alternatively, the observed information matrix or negative Hessian matrix
( 2
)
∂ ℓ(ϕ) −
ˆ
∂ϕ∂ϕT
ϕ=ϕ
can be used in place of I(ϕ̂) without changing the asymptotics.
78
In the LMM context, ϕ = (β T , θ T )T and the loglikelihood is given on the
top of p.73. The information matrix is given by


∂2ℓ
∂2ℓ
( T −1
)
X V X
0
∂ ββ T
∂ β∂ θT
{
(
)}
=
−E  ( 2 )T
,
1
2
0
tr V−1 ∂V V−1 ∂V
∂ ℓ
∂ β∂ θT
∂ ℓ
∂ θ∂ θT
2
∂θj
∂θk
{ (
)}
−1 ∂V −1 ∂V
where tr V ∂θj V ∂θk
denotes the q × q matrix with the quantity
inside the curly braces as its j, k th element (see §6.11.a.iii of McCulloch,
Searle, and Neuhaus, 2008, for the details).
Inverting this matrix leads to the following asymptotic variance-covariance
matrices for the MLE (β̂ T , θ̂ T )T :
avar(Xβ̂) = X(XT V−1 X)− XT
[{ (
)}]−1
∂V
∂V
V−1
avar(θ̂) = 2 tr V−1
∂θj
∂θk
acov(Xβ̂, θ̂) = 0.
• Inference for ML in the LMM can now be based on standard asymptotic likelihood-based methods — Wald, score and LR tests and CIs;
and model selection criteria such as AIC, BIC — all can be formed
in the usual way.
• E.g., the Wald test statistic for the hypothesis Cβ = d, where Cβ
is a vector of estimable functions (i.e., equals AX for some A), and
where rank(C) = nrows(C) ≤ rank(X), is given by
[ {
]−1
}−
a
T
−1
T
(Cβ̂ − d) C X V(θ̂) X
C
(Cβ̂ − d) ∼ χ2 (nrows(C)).
T
• However, we will see that we can improve upon these asymptotic
results with approximate F and t-based inference that work in small
and large samples.
79
Restricted Maximum Likelihood Estimation:
Recall that in simple problems, ML estimation of variances produces biased
estimators.
• E.g.,
in a one-sample problem from a N (µ, σ 2 ), the MLE ∑
of σ 2 is
∑
n
n
1
1
2
2
i (xi − x̄) rather than the unbiased estimator s = n−1
i (xi −
n
2
x̄) .
• Another example: in the CLM the MLE of the error variance is n1 SSE
1
rather than the unbiased estimator S 2 = n−rank(X)
SSE .
In estimating the variance, these ML estimators ignore the fact that parameters in the mean have been estimated.
• In the one-sample problem, one degree of freedom is used up in estimating µ with x̄, so the appropriate divisor is n − 1 (number of observations minus number of “non-redundant” parameters estimated)
rather than n.
• In the CLM, we use up rank(X) degrees of freedom in estimating β.
Therefore, the appropriate divisor is n − rank(X) rather than n.
We’d prefer to have a general “likelihood-based” method of estimation
that produces estimators of variances that account for the degrees of freedom lost (information used up) in estimating parameters of the mean.
Such a method is restricted maximum likelihood (REML) estimation (sometimes called residual maximum likelihood or marginal maximum
likelihood).
• Note that the goal here is to improve upon ML estimators of θ, not
β and θ. That is, REML is a method of estimating the variancecovariance parameters, not a method of estimating all of the parameters of the model.
– However, given a REML estimator of θ it is obvious how the
ML estimator of β should be formed.
80
In REML, parameter estimates of θ are obtained by maximizing that part
of the likelihood which is invariant to Xβ.
• That is, we eliminate β from the log-likelihood by considering the
loglikelihood of a set of linear combinations of y, known as error
contrasts, whose distribution does not depend upon β, rather than
the density of y itself.
Error Contrasts: A linear combination kT y is said to be an error contrast if E(kT y) = 0 for all β.
• It follows that kT y is an error contrast if and only if XT k = 0.
Let C(X) denote the column space of X and C(X)⊥ its orthogonal complement. Let PC(X)⊥ = I − PC(X) = I − X(XT X)− XT and suppose
rank(X) = s. Then each element of the vector PC(X)⊥ y is an error contrast because
E(PC(X)⊥ y) = (I − PC(X) )E(y) = (I − PC(X) )Xβ = 0.
Note, however, that rank(PC(X)⊥ ) = n − s while the dimension of PC(X)⊥
is n × n.
• Therefore, there are some redundancies among the elements of PC(X)⊥ y.
A natural question arises:
How many essentially different (non-redundant) error contrasts can
be included in a single set?
81
Linearly Independent Error Contrasts: Error contrasts kT1 y, kT2 y, . . . , kTm y
are said to be linearly independent if k1 , . . . , km are linearly independent
vectors.
Theorem: Any set of error contrasts contains at most n−rank(X) = n−s
linearly independent error contrasts.
Theorem: Let K be a n × (n − s) matrix such that KT K = I and
KKT = PC(X)⊥ . The (n − s) × 1 vector w defined by
w = KT y
is a vector of n − s linearly independent error contrasts. (It is not the only
vector, however.)
The REML approach consists of applying ML estimation to w rather than
y. If y ∼ Nn (Xβ, V(θ)) where V(θ) = ZD(θ)ZT + R(θ), then
w ∼ Nn−s (0, KT V(θ)K).
Therefore, the restricted loglikelihood for θ is just the log density of w:
ℓR (θ; y) = −
1
1
n−s
log(2π) − log |KT V(θ)K| − wT (KT V(θ)K)−1 w.
2
2
2
θ̂ is a REML estimate of θ if ℓR (θ; y) attains its maximum value at θ = θ̂.
• It can be shown that the maximizer of the restricted loglikelihood
does not depend upon which vector of n − s linearly independent
error contrasts is used to form w.
– That is, the REML estimator is well-defined.
82
It is preferrable to express ℓR (θ; y) in terms of X and V (quantities that
define our model) rather than in terms of K and V. Hence, the following
result:
Theorem: The log-likelihood function associated with any vector of n − s
linearly independent error contrasts is, apart from an additive constant
that doesn’t depend on θ,
1
1
ℓR (θ; y) = − log |V(θ)| − (y − Xβθ )T {V(θ)}−1 (y − Xβθ )
2
2
1
− log |X̃T {V(θ)}−1 X̃|,
2
(♠)
where X̃ represents any n × s matrix such that C(X̃) = C(X) and βθ is
any solution to XT {V(θ)}−1 Xβ = XT {V(θ)}−1 y.
• Note that in ordinary ML, we obtain the profile likelihood for θ as
1
1
pℓ(θ; y) = − log |V(θ)| − (y − Xβθ )T {V(θ)}−1 (y − Xβθ ).
2
2
Notice that ℓR (θ; y) differs only from pℓ(θ; y) by the additional
term − 21 log |X̃T {V(θ)}−1 X̃|. This term serves as an adjustment, or
penalty, for the estimation of β. Hence REML estimation is sometimes called a penalized likelihood method.
To obtain θ̂, the REML estimate of θ, we solve the estimating equations
∂ℓR (θ; y)
= 0,
∂θi
83
i = 1, . . . , q
In general LMMs, these estimating equations can be written
)
{ (
)}
(
1
∂V(θ)
1
∂V(θ)
T
−1
−1
(y−Xβθ ) {V(θ)}
{V(θ)} (y−Xβθ )− tr Q
= 0,
2
∂θi
2
∂θi
i = 1, . . . , q, where
[
]
Q = {V(θ)}−1 I − X(XT {V(θ)}−1 X)− XT {V(θ)}−1 .
In variance component models where θ = (σ12 , σ22 , . . . , σq2 )T , these estimating equations simplify based on ∂V(θ)/(∂σi2 ) = Zi ZTi , i = 1, . . . , q.
• REML estimators are not, in general, unbiased. However, they typically have less bias than ML estimators of variance components.
• While it is not possible to make completely general recommendations
concerning REML vs. ML estimation, it does appear that REML
estimators perform better than ML estimators for s large relative to
n. I would recommend REML over ML for s > 4 or so.
• As previously mentioned, REML provides an estimator of θ, it says
nothing about the estimation of β. However, ML says to estimate θ
as
θ̂ML = arg max pℓ(θ; y)
θ
and then estimate β as
[
{
}−1 ]−1
{
}−1
T
T
β̂ML = β ˆ = X V(θ̂ML )
X
X V(θ̂ML )
y.
θML
In REML, we estimate θ by maximizing ℓR (θ; y) instead of pℓ(θ; y).
Therefore, the obvious “REML estimator of β” is
[
{
}−1 ]−1
{
}−1
T
T
β̂REML = β ˆ
= X V(θ̂REML )
X
X V(θ̂REML )
y,
θREML
where
θ̂REML = arg max ℓR (θ; y).
θ
84
T
T
The asymptotic variance-covariance matrix of (β̂REML
, θ̂REML
)T is given
by
( T −1 −
)
(X V X)
0
[{ (
)}]−1 ,
−1 ∂V −1 ∂V
0
2 tr P ∂θj P ∂θk
where
P = V−1 − V−1 X(XT V−1 X)− XT V−1 = K(KT VK)−1 KT .
• As in ML estimation, this asymptotic var-cov matrix can be estimated by evaluating it at the REML estimates.
• Wald-based inference can then be done based on the estimated asymptotic var-cov matrix.
• Note however, that the restricted loglikelihood cannot be treated
as an ordinary loglikelihood. In particular, LRTs, AICs, and BICs
based on the restricted loglikelihood objective function should not
be used to select between models with different fixed-effects specifications.
• The restricted loglikelihood given by (♠) can be derived in a number
of different ways. Harville (1974) and Laird and Ware (1982) use
a Bayesian approach, while Patterson and Thompson (1971) use a
more traditional fequentist approach.
• It can also be derived as a modified profile likelihood function (see
Pawitan, §10.6 and Ch.17). See also McCullagh and Nelder, §7.2,
for connections to marginal and conditional likelihood.
85
Small-Sample Inference on the Fixed Effects:
As mentioned previously, ML/REML estimation provides a unified framework for estimation and inference in the LMM, and standard likelihoodbased inference techniques for fixed effects are available (Wald tests, LR
tests, etc.).
In addition, for many special cases of the LMM, such as anova models,
exact (small and large sample) inference techniques are available for balanced data.
• E.g., in the one-way random effects model, or the RCB model, or
the split-plot model, it is possible to obtain exact F tests to test
treatment effects, do inference on treatment means, etc.
However, for unbalanced data, exact distributional results are not available, and in small samples, asymptotic variances can seriously underestimate the true variances of parameter estimators, compromising the validity
of the asymptotic inferences.
• Therefore, large-sample techniques (e.g., conventional Wald and LR
tests) are not recommended for inference on the fixed effects in
LMMs, except in very large samples.
We now consider small sample inference methods which attempt to compensate for the underestimation of the sampling variance of the REML
estimator of β in the LMM. The following presentation is based on Kenward and Roger (1997).
86
Recall that the REML estimator of β is
β̂ = Φ(θ̂)XT V(θ̂)−1 y,
where
Φ(θ) = {XT V(θ)−1 X}−1 ,
and θ̂ = θ̂REML is the REML estimator of θ.
• For simplicity, we will assume that X is of full rank, the results
presented here extend to the non-full rank case as well.
Recall that the matrix Φ(θ) is the asymptotic variance-covariance matrix
of β̂, and conventionally, its estimator
Φ̂ ≡ Φ(θ̂) = {XT V(θ̂)−1 X}−1
is used to quantify the precision of β̂.
There are two sources of bias in ϕ̂ as a measure of the precision of β̂ in
small samples:
1. Φ(θ) takes no account of the error introduced into β̂ by having to
estimate θ, so it is an underestimate of var(β̂).
– Another way to[ say this is as follows: When ]θ is known, we
know that var {XT V(θ)−1 X}−1 XT V(θ)−1 y = Φ(θ). Undoubtedly, plugging in θ̂ in place of θ introduces some extra
variability (error), so var(β̂) must be greater than Φ(θ).
2. Φ̂ = Φ(θ̂) is a biased estimator of Φ(θ).
87
To correct these deficiencies, we write the variability in β̂ as the sum of
two components:
var(β̂) = Φ + Λ,
where Λ represents the amount by which Φ = avar(β̂) underestimates
var(β̂).
Using Taylor series expansions, it can be shown that Λ can be approximated by


q
q
∑ ∑

Λ≈Φ
Wij (Sij − Ti ΦTj ) Φ,


i=1 j=1
where
∂V−1
X,
Ti = X
∂θi
T
∂V−1 ∂V−1
Sij = X
V
X,
∂θi
∂θi
T
and Wij is the (i, j)th element of W = var(θ̂).
In addition, a Taylor series expansion about θ can be used to show that
Φ̂ is biased as follows:
q
q
1 ∑∑
E(Φ̂) ≈ Φ − Λ +
Wij ΦRij Φ,
2 i j
{z
}
|
=(∗)
where
∂ 2 V −1
V X.
∂θi ∂θj
Since we want to estimate Φ + Λ, Kenward and Roger suggest an adjusted
small sample var-cov matrix of β̂ given by
q
q
1 ∑∑
Φ̂adj = Φ̂ + 2Λ̂ −
Ŵij Φ̂R̂ij Φ̂,
2 i j
Rij = XT V−1
where Ŵij is the (i, j)th element of avar(
ˆ θ̂) (top of p.85), and V(θ̂) is
substituted for V(θ) to form R̂ and Λ̂.
• Note that in variance component models, the term (*) equals 0, so
the adjusted estimator of var(β̂) simplifies to Φ̂adj = Φ̂ + 2Λ̂.
88
Inference and Degrees of Freedom:
For a testable hypothesis of the form H0 : Cβ = d, where C is c × p of
full row rank, a reasonable test statistic for H0 is given by
1
(Cβ̂ − d)T [var(C
ˆ
β̂)]−1 (Cβ̂ − d).
c
(∗∗)
When V is known, we have
var(Cβ̂) = C(XT V−1 X)− CT ,
or if V is known up to a multiplicate constant, i.e., if V = σ 2 W for W
known, then
1
var(Cβ̂) = 2 C(XT W−1 X)− CT ,
σ
which is estimated by
var(C
ˆ
β̂) =
where
1
C(XT W−1 X)− CT ,
2
S
(y − Xβ̂)T (y − Xβ̂)
S =
n − rank(X)
2
is the M SE from the fitted model. In that case, (**) becomes
F ≡
(Cβ̂ − d)T [C(XT V−1 X)− CT ]−1 (Cβ̂ − d)
cS 2
and, under H0 , we have
F ∼ F (c, n − rank(X)).
89
However, in the LMM when V is unknown, the estimation of var(Cβ̂)
becomes more challenging, and no longer leads necessarily to an exact F
statistic.
That is, when V is unkown, it still makes sense to use (**) as a test statistic, but now we use the Kenward-Roger estimate Φ̂adj to obtain var(C
ˆ
β̂).
This leads to the test statistic
F̂ ≡
1
(Cβ̂ − d)T [CΦ̂adj CT ]−1 (Cβ̂ − d).
c
This quantity no longer has an exact F distribution, but similar to the
Satterthwaite procedure, we can approximate the distribution of F̂ by
assuming that
.
λF̂ ∼ F (c, d),
(†)
for some λ and d.
A λ and d that make this a good approximation can be found by equating
the first and second moments of both sides of (†). This approach leads to
λ=
d=
d
(d − 2)E(F̂ )
1 + 2/c
var(F̂ )
2{E(F̂ )}2
−
1
c
,
and
.
Formulas for E(F̂ ) and var(F̂ ) are given in Kenward and Roger(1997).
• This procedure for estimating λ and d gives approximate F statistics
that will typically perform much better than asymptotic inference
techniques.
• These F statistics are approximate, in general, but do reduce to the
exact F statistics in those cases in which exact results are available
(e.g., balanced anova models).
90
• The Kenward-Roger approach to inference is implemented in PROC
MIXED with the DDFM=KENWARDROGER option on the MODEL
statment.
• Alternatively, the DDFM=SATTERTH option implements a closely
related Satterthwaite approximation to obtain denominator degrees
of freedom for approximate F tests. In the Satterthwaite procedure,
the unadjusted estimator Φ̂ is used instead of Kenward and Roger’s
Φ̂adj to form the F statistic. That is, var(C
ˆ
β̂) = C{XT V(θ̂)−1 X}− CT
is used in the Satterthwaite procedure.
• There is not yet consensus on which approach to small sample inference is, in general, preferrable in LMMs. However, a recent simulation study (Schaalje et al., 2002) suggests that the K-R procedure
may be superior, and it is what I recommend.
Inference on θ:
Often, the mean structure is of more interest when fitting a LMM than the
variance-covariance structure. However, adequate modeling of the var-cov
structure is important for model-based inference on the mean (on β).
• Overparameterization of the var-cov structure leads to inefficient estimation of β, low power, and potentially poor estimation of standard errors for β̂.
• On the other hand, an over-simplified var-cov structure can invalidate inferences on β.
In addition, the var-cov structure is sometimes of interest in and of itself, and correct modeling of the var-cov structure can be important when
handling certain types of missing data.
91
The major obstacle to inference on θ is that θ parameterizes the variancecovariance matrices D and R, which must be positive definite.
Therefore, the parameter space for θ is constrained, which can strongly
affect the distributional results typically used for inference.
• E.g., in variance component models, the elements of θ have the interpretation as variances, so they are necessarily non-negative. This
strongly affects the adequacy of distributional approximations for θ̂.
Wald Tests:
Based on classical likelihood theory,
θ̂ML
(
)
a
∼ N θ, avar(θ̂ML ) ,
and θ̂REML
(
)
a
∼ N θ, avar(θ̂REML ) ,
where the asymptotic var-cov matrices are given at the top of pp. 79 and
85, for ML and REML, respectively.
In principle therefore, Wald based inference for a linear combination η =
cT θ, could be based on the distributional result:
cT θ̂ − η
.
√
∼ N (0, 1).
cT avar(θ̂)c
(†)
However, the adequacy of this normal approximation depends strongly on
how close η is to the boundary of its parameter space.
92
E.g., suppose η = θj and θj is a variance component (or a diagonal element
of D). Then if θj is close to zero, (†) becomes a poor approximation. In
fact, (†) breaks down altogether if θj = 0, or, more generally, if η is on the
boundary of its parameter space.
• So, if η is far from the boundary of its parameter space, (†) can be
used to form Wald confidence intervals and hypothesis tests.
– Here, how far η must be from its boundary depends upon the
sample size.
• However, (†) is useless for testing whether variance components are
equal to zero (i.e., for testing whether a certain random effect is
necessary in the model).
Likelihood Ratio Tests:
Similar comments apply to LR tests.
Let Θ denote the parameter space for θ. Then, according to classical
likelihood theory, a hypothesis of the form
H 0 : θ ∈ Θ0 ,
where Θ0 is a subspace of Θ, can be tested with the LR statistic
.
0
2{ℓ(θ̂ML ) − ℓ(θ̂ML
} ∼ χ2 (dim(Θ) − dim(Θ0 )),
0
where θ̂ML and θ̂ML
are the MLEs in Θ and Θ0 , respectively.
• Similarly, if the null hypothesis does not involve β, then a restricted
LR test can be done based on
.
0
2{ℓR (θ̂REML ) − ℓR (θ̂REML
} ∼ χ2 (dim(Θ) − dim(Θ0 )).
However, the regularity conditions that establish these results (Wilks’ Theorem) assume that θ is not on the boundary of its parameter space. So,
once again, standard asymptotic theory does not apply to LR testing for
the significance of variance components.
93
In selecting an appropriate variance-covariance specification for our model,
such hypotheses often arise.
• E.g., testing whether a certain random effect is necessary in the
model is equivalent to testing whether its associate variance component is zero.
Therefore, how can we go about building the var-cov structure of our
model?
One possible answer is to use the same Wald and LR test statistics, but
use their proper reference distributions under the null hypothesis.
• That is, when the hypothesis places the parameter on the boundary
of its parameter space, that means the usual normal and chi-square
limiting distributions are no longer correct, not that the test statistics are no loinger appropriate. So, use the same statistics but use
the right reference distribution.
Unfortunately, figuring out what the right reference distribution is can be
hard, and sometime even if the distribution is known, obtaining critical
values or p−values from that distribution can be hard.
• The only truly simple case is when the error variance covariance
matrix is of the form R = σ 2 I and we are testing a null model
that includes i.i.d. random effects that are each q-variate versus
an alternative model with i.i.d. random effects that are each q + 1
variate, where under both the null and alternative each random effect
vector has a general, unstructured variance-covariance matrix.
– In this case, the null distribution of the LRT and restricted
LRT is a 50:50 mixture of a χ2 (q) and a χ2 (q + 1) distribution.
94
• E.g., suppose we are choosing between a model with no random
effects, versus a model with a cluster specific random effect, bi , where
iid
b1 , ..., bn ∼ N (0, σb2 ).
Then to test H0 : σb2 = 0, the LR and restricted LR tests both
have null distribution that is a 50:50 mixture of a χ2 (0) and a χ2 (1)
distribution, where χ2 (0) denotes the chi-square distribution with 0
d.f., which is 0 with probability 1.
– In this case, the correct p-value for H0 is exactly one-half the
value based on a χ2 (1) distribution.
• As a second example, suppose we are choosing between a model with
a cluster specific random effect, bi , where
iid
b1 , ..., bn ∼ N (0, σb2 )
versus a model with a bivariate cluster-specific random effect, bi ,
where
)
(
θ11 θ12
iid
.
b1 , ..., bn ∼ N (0, D), where D =
θ12 θ22
Then the LR and restricted LR tests both have null distribution that
is a 50:50 mixture of a χ2 (1) and a χ2 (2).
• A table of critical values of 50:50 mixtures of χ2 (q) and χ2 (q + 1) is
given in the back of our text, and can be used for testing hypotheses
on variance components in this special case.
• However, more generally (e.g., when R ̸= σ 2 I, or when comparing
more complex random effects structures) the null distribution of the
LRT can be difficult to find and use.
– In such cases, our textbook authors suggest testing the hypotheses using the standard asymptotics of the LRT, but using
α = 0.1 instead of α = 0.05.
95
• Alternatively, if we are not interested in inference on the variancecovariance structure per se, but simply want to choose an adequate
var-cov structure so that we can get valid conclusions in our inferences on the mean, then we may choose not to do formal hypothesis
tests on the variance-covariance structure of the model, but simply
select an adequate var-cov structure via a model selection criterion,
such as AIC.
Model Selection Criteria:
Consider testing some hypothesis H0 versus an alternative HA where these
hypotheses correspond to nested models.
Let ℓ0 and ℓA denote the loglikelihood function evaluated at the MLE
under the null and alternative models, respectively. Further let #ϕ0 and
#ϕA denote the number of free parameters under H0 and HA .
Then the LR test rejects H0 is ℓA − ℓ0 is large in comparison to the
difference in d.f. of the two models to be compared. That is, it rejects if
ℓA − ℓ0 > F(#ϕA ) − F(#ϕ0 ),
or equivalently, if
ℓA − F (#ϕA ) > ℓ0 − F(#ϕ0 ),
for an appropriate function F.
• For example, for an α-level LR test under standard regularity conditions (i.e., when the null hypothesis doesn’t place the parameter on
the boundary of the parameter space), F is a function such that
F(#ϕA ) − F (#ϕ0 ) =
1 2
χ
(#ϕA − #ϕ0 ).
2 1−α
• This procedure can only be considered a formal hypothesis test if the
null and alternative correspond to nested models and if F(#ϕA ) −
F(#ϕ0 ) gives the appropriate critical value from the reference distribution of ℓA − ℓ0 .
96
However, there is no reason why the above procedure could not be used
as a rule of thumb for comparing any two (not necessarily nested) models,
or why we couldn’t consider other functions F(·) for choosing between
models.
• Some commonly used functions for F are given in Table 6.7 of Verbeke & Molenberghs (2000, p.74) reproduced below.
• These functions yield different choices of information criteria, of
which AIC and BIC are by far the most common.
• The basic idea in all of these criteria is to compare models based on
their maximized loglikelihood values, but to penalize for the use of
too many parameters (penalize model complexity).
• The model with the largest value of whichever criterion is chosen is
the winner, but note that sometimes AIC and BIC are defined as −2
times the definition given here so that smallest is best.
• Note that for discriminating between variance-covariance structures,
ℓR may be used in place of ℓ as long as we replace n by n∗ = n − p,
the number of linearly independent error contrasts used to form ℓR .
• Which criterion performs best is not an easy question to answer and
depends upon the nature of the data, models, and the purpose to
which the models are to be put.
97
• AIC tends to penalize less for model complexity than BIC, so it tends
to err on the side of overspecified models, whereas BIC errs on the
side of underspecification.
– Underspecification tends to lead to bias, overspecification to
inefficiency (increased variance).
• For choosing an adequate variance-covariance structure when interest centers on the mean, the consequences of underspecification are
more dire. Therefore I do not recommend BIC.
• For selecting the variance-covariance structure in mixed models, AIC
is commonly used. However, AIC is an estimator of the expected
Kullback discrepancy between the true model and a fitted model,
and as such it is known to be downwardly biased by an amount that
disappears asymptotically, but can be substantial in small samples
or whenever p is large relative to the total sample size n.
• Therefore, a variety of alternatives to AIC have been proposed that
attempt to correct this bias with the goal of better performance in
small samples. One of the most popular and effective alternatives is
the corrected AIC (AICc) of Sugiura (1978) and Hurvich and Tsai
(1989, 1993, 1995):
(
)
n∗
AICc = −2ℓ(ϕ̂) + 2k
vs. AIC = −2ℓ(ϕ̂) + 2k,
n∗ − k − 1
where ϕ̂ is the MLE of the model parameter ϕ and k = #ϕ.
• AICc is recommended in many regression contexts (CLM, GLMs,
time series analysis, nonlinear regression, etc.), but has not been formally justified in a mixed model context, especially for the selection
of the variance-covariance/random effects structure. Therefore, it is
not recommended for this purpose. Development of bias-corrected
versions of the AIC for mixed models is an active area of research. At
present, most of the methods that have been proposed and validated
are computationally intensive and not implemented in standard software.
98
The Linear Mixed Model for Clustered Data:
So far, we have presented the LMM in its general form. However, application to longitudinal or other clustered data is among the most common
uses of the LMM, so we now consider the LMM in that context.
Suppose we have data on n subjects, where yi = (yi1 , . . . , yiti )T is the ti ×1
vector of observations available on the ith subject, i = 1, . . . , n. Then the
LMM for longitudinal data is given by
yi = Xi β + Zi bi + εi ,
i = 1, . . . , n,
where Xi and Zi are ti × p and ti × g design matrices for fixed effects β
and random effects bi , respectively. εi is a vector of error terms.
As before, it is assumed that
bi ∼ N (0, D(θ)), εi ∼ N (0, Ri (θ))
and b1 , . . . , bn , ε1 , . . . , εn are independent.
• Note that var(εi ) = Ri depends upon i through the dimension ti of
yi (the cluster size), and may also depend on i by having a different
form, or at least different parameter values for different subsets of
subjects.
• D and the dimension of bi , however, are assumed to the same for all
i.
99
The specification above can be equivalently expressed as
yi |bi ∼ N (Xi β + Zi bi , Ri (θ)),
where bi ∼ N (0, D(θ)).
(♣)
• (♣) is known as the hierarchical formulation of the model.
Note that the hierarchical model implies a marginal model. That is, the
marginal density of yi can be obtained from the conditional density of
yi |bi and the marginal density of bi throught the relationship
∫
f (yi ) =
f (yi |bi )f (bi )dbi .
Because both densities in the integrand are normal, the integral yields a
normal so that
yi ∼ N (Xi β, Vi (θ)),
Vi (θ) = Zi D(θ)ZTi + Ri (θ).
• That is, the hierarchical (conditionally specified) model implies a
corresponding marginal model.
• Note however, that for arbitrary Vi (θ) (Vi (θ) an arbitrary p.d. matrix), or even Vi (θ) p.d. and of the form Zi D(θ)ZTi + Ri (θ) where
D and Ri are not assumed p.d., the marginal model does not necessarily imply the hierarchical one.
– That is, there is a subtle distinction between the hierarchical and marginal formulations of the model. This distinction
will become more important when we study generalized linear
mixed models.
100