CHAPTER 5
5
5.1
LONGITUDINAL DATA ANALYSIS
Population-Averaged Linear Models for Continuous Response
Introduction
We begin our discussion of modern models and methods for longitudinal data analysis by considering a general class of models and associated methods for continuous response that arises from
taking a population-averaged perspective. This class of models addresses all of the drawbacks of
classical models and methods summarized in Section 4.2.
Namely, models of this type do not require the data set to be balanced; i.e., the elements of the
response vectors Y i do not need to be observations taken at the the same n time points. In addition, the model framework allows a very general specification for the form of the overall aggregate
covariance matrix of a data vector and allows it to differ depending on, for instance, the values of
covariates.
The population mean response is represented by a linear model that allows among- and withinindividual covariates to be incorporated straightforwardly and involves parameters that characterize
features of the population mean response, such as patterns of change exhibited over time, and how
these features might be associated with among-individual covariates.
Finally, although the model incorporates an assumption of multivariate normality of a response
vector conditional on covariates, using large sample (large m) arguments, as long as m is large
enough and the model for the population mean response is correctly specified , it is possible to
show that estimators of parameters in the models are consistent for the true values and to deduce
an approximate normal sampling distribution for them, even if the true distribution of the response
is not normal. The approximate sampling distribution then forms the basis for inferential goals such
as assessments of uncertainty and hypothesis testing procedures.
Moreover, as we demonstrate, even if the representation for the overall pattern of covariance is not
correctly specified , estimators for parameters in a correctly specified population mean response
model are still consistent , and an approximate sampling distribution can be derived.
115
CHAPTER 5
5.2
LONGITUDINAL DATA ANALYSIS
Model specification
BASIC MODEL: Recall again that the observed data are
(Y i , z i , ai ) = (Y i , x i ),
, i = 1, ... , m,
independent across i, where Y i = (Yi1 , ... , Yini )T , with Yij recorded at time tij , j = 1, ... , ni (possibly
different times for different individuals); z i = (z Ti1 , ... , z Tini )T comprising within-individual covariate
information u i and the tij ; ai is a vector of among-individual covariates; and x i = (z Ti , aTi )T .
The population-averaged linear model we study in this chapter is most relevant when the responses Yij are continuous. The model is written as
Y i = X i β + i ,
i = 1, ... , m.
(5.1)
• In (5.1), X i is a design matrix for individual i depending on individual i’s covariates x i , examples of which we present momentarily.
• The deviation i = (i1 , ... , ini )T is such that
E(i |x i ) = 0,
var(i |x i ) = V i = V i (ξ, x i ),
(5.2)
where V i (ξ, x i ) (ni × ni ) can depend on the covariates x i and on a vector of covariance parameters ξ, which includes correlation parameters α (s × 1) and variance parameters θ (r × 1).
We discuss examples shortly. We sometimes suppress this dependence for brevity and simply
write V i .
• The form of V i is specified by the data analyst in accordance with the features of the given
situation. Because of the dependence of V i on covariates, there is no requirement, for example,
that the form of the covariance matrix be the same for all individuals. We elaborate on this point
in the examples below.
• Ordinarily, it is assumed that the conditional distribution of i given x i is multivariate normal ,
i |x i ∼ N {0, V i (ξ, x i )},
sometimes written more briefly as i |x i ∼ N (0, V i ).
116
(5.3)
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
• β is a vector of parameters characterizing the population mean response; that is, with the
assumption on i in (5.2), we have that
E(Y i |x i ) = X i β (ni × 1),
(5.4)
representing the population mean response for individual i, or indeed any individual in the population with covariates x i .
• From (5.2), it follows that
var(Y i |x i ) = V i = V i (ξ, x i ) (ni × ni ),
(5.5)
the overall population covariance matrix for an individual with covariates x i , characterizing the
aggregate pattern of covariance combining among- and within-individual sources for such an
individual.
• With the normality assumption (5.3), the model can be written succinctly as
Y i |x i ∼ N {X i β, V i (ξ, x i )}, i = 1, ... , m,
(5.6)
which we often abbreviate as
Y i |x i ∼ N (X i β, V i ), i = 1, ... , m.
REPRESENTATION OF COVARIANCE MATRIX: To facilitate thinking about models V i (ξ, x i ), it is
sometimes convenient to represent this covariance matrix as a product of “standard deviation
matrices ” and a correlation matrix. Let T i (θ, x i ) be the (ni × ni ) diagonal matrix whose diagonal
elements are models for var(Yij |x i ), depending on a parameter θ as above. Let Γi (α, x i ) be a (ni × ni )
correlation matrix, depending on a parameter α. Then it is straightforward to deduce (try it) that a
model for the overall covariance structure can be obtained as
1/2
V i (ξ, x i ) = T i
1/2
where T i
1/2
(θ, x i )Γi (α, x i )T i
(θ, x i ),
ξ = (θ T , αT )T ,
(5.7)
(θ, x i ) is the matrix whose diagonal elements are the models for the standard deviations
1/2
{var(Yij |x i )}1/2 . Clearly, T i
1/2
(θ, x i )T i
(θ, x i ) = T i (θ, x i ). We sometimes write T i and Γi for brevity,
suppressing dependence on θ, α, and x i .
The representation (5.7) allow features of overall variance and the overall pattern of correlation to be
thought of separately. That is, one can entertain models for correlation structure and beliefs about
variance separately to arrive at an overall specification. We demonstrate in examples below.
117
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
MODEL SUMMARY: It is often convenient to summarize the model as follows. Recall that the total
P
number of observations Yij is N = m
i=1 ni . Define
Y1
Y2
Y =
..
.
Ym
(N × 1),
X1
X2
X =
..
.
Xm
(N × p),
=
1
2
..
.
(N × 1).
(5.8)
m
Then (5.1) can be expressed compactly as (try it)
Y = X β + .
(5.9)
It follows from (5.2) that
E(|x̃) = 0,
where x̃ is the collection of all covariates x i , i = 1 ... , m, for all m individuals, so that, from (5.4),
E(Y |x̃) = X β.
Define the block diagonal matrix
V 1 (ξ, x 1 )
V (ξ, x̃) =
···
0
0
V 2 (ξ, x 2 ) · · ·
..
..
.
.
0
..
.
0
···
0
0
..
.
(N × N).
(5.10)
V m (ξ, x m )
We often write (5.10) for brevity as
V =
V1
0
..
.
0
0
···
V2 ···
..
..
.
.
0
···
0
0
.. (N × N).
.
Vm
(5.11)
Then, from (5.2) and (5.5), defining similarly T (θ, x̃) = T and Γ(α, x̃) = Γ (N × N),
var(|x̃) = var(Y |x̃) = V (ξ, x̃) = V = T 1/2 ΓT ,
which follows by the independence of i (and Y i ) for i = 1, ... , m.
Note that V in (5.11) has a different definition from that in Chapter 3. Henceforth, we use the symbol
V in this way to represent the covariance matrix of the “stacked ˙’’ random vectors and Y (conditional
on the x i ).
118
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
The model (5.6) can then be summarized by
Y |x̃ ∼ N {X β, V (ξ, x̃)},
(5.12)
Y |x̃ ∼ N (X β, V ).
(5.13)
or, briefly,
• In the literature on longitudinal data analysis and in software documentation , it is common to
write the model incorporating the normality assumption using this “stacked” notation, suppressing dependence of the overall covariance matrix on parameters and covariate information; that
is, as in (5.13).
• We use the more detailed notation (5.12) when we wish to emphasize explicitly the dependence of the covariance matrix on parameters and covariates.
REMARK: Recall that x i for individual i includes the times tij , = 1, ... , ni , at which i was observed,
which, technically, are not “covariates ” in the strict sense, although they often play the role of
“covariates ” as far as implementation is concerned. Thus, conditioning on x i is really meant to
imply conditioning on all among- and within-individual covariates.
We now demonstrate features of the population-averaged linear model and its interpretation by
considering its specification in several examples.
EXAMPLE 1, DENTAL STUDY: We have already considered a population-averaged model for these
data in Section 2.4. Recall that there is one among-individual covariate, gender, which we represented for child i as gi = 0 if i is a girl and gi = 1 if i is a boy, so that ai = gi ; there are
no within-individual covariates u i . The response was measured for all m = 27 children at ages
(t1 , ... , t4 ) = (8, 10, 12, 14). Thus, z ij = tj for all i, and x i contains gi (and the four time points). Thus,
conditioning on covariates x i corresponds to conditioning on gender.
From a population-averaged perspective, the primary question of interest is whether or not the rate
of change of the population mean response profile for boys differs from that for girls. In (2.22), we
specified a model for the population mean at time tij for a child of gender gi as
E(Yij |x i ) = {β0,B gi + β0,G (1 − gi )} + {β1,B gi + β1,G (1 − gi )}tij ,
(5.14)
so that β1,G and β1,B are the slopes of the assumed straight line population mean profiles for girls
and boys, respectively. Interest is in comparing β1,G and β1,B .
119
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
Thus,
β = (β0,G , β1,G , β0,B , β1,B )T ,
p = 4, and, for child i,
(1 − gi ) (1 − gi )t1 gi
..
..
..
Xi =
.
.
.
(1 − gi ) (1 − gi )t4 gi
gi t1
..
. ,
gi t4
(5.15)
so that
1 t1 0 0
. . . .
X i = .. .. .. ..
1 t4 0 0
,
0 0 1 t1
. . . .
X i = .. .. .. ..
0 0 1 t4
,
(5.16)
if i is a girl or boy, respectively.
Clearly, X i in (5.15) is not of full rank for any i. Intuitively, this reflects that fact that a boy does not
provide information on parameters describing the population mean for girls, and vice versa. However,
it is straightforward to observe that the “stacked design matrix ” X in (5.8), has full column rank
p = 4, as it comprises 11 matrices X i like that on the left hand side of (5.16) stacked on top of 16 like
that on the right hand side; the p = 4 columns of X are clearly linearly independent (check). This
demonstrates that the problem of making inference on β is feasible from data like those in the study,
involving children of both genders.
To complete the model, we specify a model V i (ξ, x i ) for the overall pattern of covariance var(Y i |x i ).
Because these data are balanced , it was straightforward to calculate the sample overall covariance
matrices and their associated correlation matrices for each gender in Section 2.6. Recall that the
numerical estimates in (2.33) and (2.34) suggest the following:
• Overall variance is likely constant over time for each gender, but the variance estimates are
larger for boys than for girls. Formally, the data suggest that var(Yij |x i ) for boys and girls are
the same for all j but that for boys is larger. Thus, recognizing that conditioning on x i is really
conditioning on gi , a reasonable model is
2
var(Yij |gi = 0) = σG
,
var(Yij |gi = 1) = σB2 .
(5.17)
The specification in (5.17) can be represented by taking T i (θ, x i ) to be the diagonal matrix with
diagonal elements all equal to
2
σG
(1 − gi ) + gi σB2 ,
or, equivalently,
2
T i (θ, x i ) = {σG
(1 − gi ) + gi σB2 }I 4 ,
120
2
θ = (σG
, σB2 )T .
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
• The ages are equally-spaced in time, so any model that is reasonable under this condition is
possible. The empirical evidence suggests that, for each gender, the overall pattern of correlation is approximately compound symmetric with a different correlation parameter α in (2.25)
for each gender. That is,
Γi (α, x i ) = [1 − {(1 − gi )αG + gi αB }]I 4 + {(1 − gi )αG + gi αB }J 4 ,
where thus α = (αG , αB )T .
Combining the above, the suggested covariance model is
2
σG
1
αG
···
αG
αG
..
.
1
..
.
···
..
.
αG
..
.
αG · · ·
αG
1
= σ 2 {(1 − αG )I 4 + αG J 4 }
G
(5.18)
for girls, and
σB2 {(1 − αB )I 4 + αB J 4 }
(5.19)
2 , σ 2 , α , α )T .
for boys. The covariance parameter ξ characterizing V i is then ξ = (σG
B
G
B
ALTERNATIVE PARAMETERIZATION: As with any linear model , it is possible to represent the population mean response model (5.14) using a different parameterization. Because interest focuses
on the difference in slopes characterizing the rates of change of population mean dental distance
for boys and girls, it is natural to express the population mean directly in terms of a parameter
representing this difference. Thus, an equivalent alternative to (5.14) is
E(Yij |x i ) = {β0,G + β0,B−G gi } + {β1,G + β1,B−G gi }tij .
(5.20)
In (5.20), β0,B−G and β1,B−G represent the difference in intercept and slope between boys and girls
and will be positive if that for boys exceeds that for girls. Moreover, for example, the slope of the
population mean response for boys is then β1,B−G + βG , and similarly for intercept.
REMARK: The population mean response model in (5.14) or (5.20) in no way requires the time points
tij to be the same for each child. Even if these data were not balanced , there would be no problem
specifying such a model. Specification of the covariance model when data are not balanced does
require some special consideration; we discuss this shortly.
121
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
700
Low dose
700
Zero dose
9
2
500
4
3
5
4
5
2
1
3
5
3
4
2
5
5
1
1
9
600
3
5
2
Body Weight (g)
4
3
2
4
9
8
6
10
10
8
8
6
9
6
10
8
8
10
10
6
8
7
6
7
6
7
7
1
1
7
400
1
4
6
8
0
High dose
Sample averages
15
15
500
13
12
15
15
13
15
6
8
400
400
12
11
13
15
600
12
13
14
12
13
Body Weight (g)
600
12
14
13
14
11
8
500
14
11
6
Zero dose
Low dose
High dose
11
14
4
Weeks
11
14
11
12
2
Weeks
700
2
700
0
Body Weight (g)
9
10
7
500
600
2
4
3
400
Body Weight (g)
9
0
2
4
6
8
0
2
Weeks
4
Weeks
Figure 5.1: Growth of guinea pigs receiving different doses of vitamin E diet supplement.
EXAMPLE 2, GUINEA PIG DIET STUDY: The same considerations apply to specification of a
population-averaged model for these data, which are also balanced. We discuss specification of
a model for population mean response, which illustrates some key issues.
Recall from Section 1.2 that 15 guinea pigs were given a growth-inhibiting substance at baseline
(time 0, beginning of the first week). At weeks 1, 3, and 4, body weight we measured. Immediately
after the week 4 measurement (so at the start of week 5), the pigs were randomized to receive zero,
low, or high dose of vitamin E, 5 pigs per group, and body weight was subsequently recorded subsequently at weeks 5, 6, and 7. Thus all m = 15 pigs were observed at times (t1 , ... , t6 ) = (1, 3, 4, 5, 6, 7),
so z ij = tj for all i, and ai is the among-individual covariate dose group, with three possible values,
which can be represented as ai = (di1 , di2 , di3 )T , where di` = 1 if pig i was randomized to dose group
` and = 0 otherwise, where ` = 1, 2, 3 correspond to zero, low, and high dose.
We reproduce Figure 1.3 from Chapter 1 for convenience as Figure 5.1.
Because the pigs were treated identically until the end of week 4, a reasonable model for population
mean response takes it to be identical for pigs in all three groups through week 4. Because pigs were
then randomized at this time to receive one of the three doses, a model should allow the population
mean response profile to be potentially different for each dose group henceforth.
122
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
That is, a plausible population mean model has two “phases ,” before and after introduction of vitamin
E, where the second “phase ” is different for each group.
From the plot of sample averages over time in Figure 5.1, a model that takes each of these phases
to be a straight line is reasonable, where the intercept and slope of the first phase is the same for
all groups. A model that incorporates these features is the linear spline model
E(Yij |x i ) = β0 + β1 tij +
3
X
β2` di` (tij − 4)+
(5.21)
`=1
with a knot at week 4, where
x+ = x
if x ≥ 0,
= 0 if x < 0.
From (5.21), for any pig, population mean response follows the straight line
β0 + β1 t
through week t = 4. For t ≥ 4, for a pig in group `, population mean response is represented as
β0 + β1 tij + β2` (t − 4) = {β0 + β1 (4)} + (β1 + β2` )(t − 4),
so that, with t = 4 as the “origin ,” population mean weight follows a straight line for t ≥ 4 with
“intercept ” (value at t = 4 when the dose was administered) β0 + β1 (4) and slope β1 + β2` .
Differences in population mean response trajectory are reflected in (5.21) by differences among the
β2` , ` = 1, 2, 3. The model (5.21) could of course be parameterized in alternative ways. The model
allows the possibility that the population mean profile for the zero dose group changes after week
4, even though the pigs in this group did not receive vitamin E. If there were reason to believe that
the population mean trajectory for pigs not receiving vitamin E before week 4 should continue after
week 4, a modification of the model would be to take β21 = 0 in (5.21); however, the visual evidence
in Figure 1.3 does not support this. Perhaps the effect of the growth-inhibiting substance begins to
manifest at week 4, leading to a downward trend, but the addition of vitamin E mitigates this effect.
123
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
Figure 5.2: Hæmatocrit trajectories for hip replacement patients. The left hand panels show individual
profiles by gender; the right hand panels show a fitted quadratic model for the mean superimposed.
HIP REPLACEMENT STUDY: These data are adapted from Crowder and Hand (1990, Section 5.2).
Thirty patients underwent hip replacement surgery, 13 males and 15 females. Hæmatocrit, the ratio
of volume packed red blood cells relative to volume of whole blood recorded as a percentage, was
planned to be measured on each patient at baseline , week 0, prior to surgery, and then at weeks 1,
2, and 3 post-surgery. In addition to gender, age of each patient was also recorded. The data are
shown in Figure 5.2.
The primary objectives are to determine if there are differences in the population mean pattern of
change of hæmatocrit following surgery between genders and to characterize the patterns of change.
It is evident in the left hand panels of Figure 5.2 that several patients of both genders are missing the
measurement at week 2; there is also female who is missing both this and the baseline measurement.
Crowder and Hand do not offer an explanation; because this is so systematic , occurring for about
half of the male and half of the female patients, it is plausible that these observations are missing
for reasons having nothing to do with the health status of the patients but rather might reflect, for
example, failure of the equipment used ascertain hæmatocrit values during week 2. We downplay
this complication for now and return to the issue of missing responses later in this chapter.
124
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
These data exemplify the common situation where, although it was planned to record the response
at n = 4 prespecified times (0,1,2,3 weeks), not all individuals have all responses recorded, so that ni
varies with i, although those that are available are at the prespecified times. That is, ni = 4 for some
patients, for whom tij = 0, 1, 2, 3 for j = 1, ... , 4; ni = 3 for those missing the week 2 measurement,
so that tij = 0, 1, 3, j = 1, ... , 3; and ni = 2 for the female patient missing the baseline and week 2
responses, so that tij = 1, 3, j = 1, 2. For patient i, z ij = tij , j = 1, ... , ni , and ai = (gi , ai )T , where gender
gi = 0 for females and gi = 1 for males; and ai is the age of the patient (years), ranging from 47 to 79
for females (sample average 66.07) and 44 to 74 for males (65.52).
For both genders, Figure 5.2 shows that hæmatocrit drops from baseline after surgery and then
begins to rebound over the 3 weeks post-surgery. This suggests that the following quadratic model
for population mean is reasonable, which allows the pattern to differ between genders:
E(Yij |x i ) = {β0,F (1 − gi ) + β0,M gi } + {β1,F (1 − gi ) + β1,M gi }tij + {β2,F (1 − gi ) + β2,M gi }tij2 .
(5.22)
The basic model (5.23) can be modified to incorporate the possibility that features of the mean response are age-dependent ; for example,
E(Yij |x i ) = {β0,F (1 − gi ) + β0,M gi } + {β3,F (1 − gi ) + β3,M gi }ai
+{β1,F (1 − gi ) + β1,M gi }tij + {β2,F (1 − gi ) + β2,M gi }tij2 .
(5.23)
allows mean hæmatocrit at baseline depend on patient age of patient in a way that is different for
each gender. The linear and quadratic effects that govern the pattern of change post-baseline could
be modified similarly, and any of these models could bereparameterized in terms of parameters
representing the differences in intercept and linear and quadratic effects between genders.
Plausible models V i (ξ, x i ) for the overall pattern of covariance include those that are suited to what
are ideally balanced data with equally-spaced time points; however, fitting of such models requires
that the missing values for some patients be taken into account appropriately. We discuss this
shortly.
HIV CLINICAL TRIAL: These data are reported in Fitzmaurice, Laird, and Ware (2011) and are from
a randomized, double-blind clinical trial, AIDS Clinical Trials Group (ACTG) Study 193A, in patients
infected with human immunodeficiency virus (HIV) exhibiting advanced immune suppression; i.e.,
CD4 T-cell counts ≤ 50 cells/mm3 . CD4 count is a standard measure reflecting the status of the
immune system , which is compromised in patients with HIV infection.
125
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
ZDV+ alt ddI
ZDV+ZAL
ZDV+ddI
ZDV+ddI+NVP
6
4
log CD4
2
0
6
4
2
0
0
10
20
30
40
0
10
20
30
40
Week
Figure 5.3: log(CD4+1) profiles for subjects in ACTG Study 193A. A loess smoother fitted to all the
data for each treatment is superimposed on the individual profiles in each panel.
1313 subjects were randomized to one of four daily treatment regimens consisting of dual or triple
combinations of drugs in the class of HIV-1 reverse transcriptase inhibitors: (1) 600 mg of zidovudine
(ZDV) alternating monthly with 400 mg of didanosine (ddI); (2) 600 mg ZDV plus 2.25 mg zalcitabine
(ZAL); (3) 600 mg ZDV plus 400 mg ddI; or (4) 600 mg ZDV plus 400 mg ddI and 400 mg nevirapine
(NVP) (triple therapy).
CD4 measurements were planned at baseline (week 0) and then at 8-week intervals during follow-up,
at weeks 8, 16, 24, 32, and 40. Figure 5.3 shows the individual log-transformed CD4 profiles for
subjects randomized to each treatment regimen; because CD4 count of zero is possible, it is customary to take the response variable to be log(CD4+1) (transformed CD4 counts appear approximately
normally distributed). As can be seen from the plots, actual visits did not necessarily take place
at exactly these times; moreover, some subjects skipped visits altogether or dropped out of the
study before 40 weeks.
For example, visit times for the first subject in the ZDV+ZAL group were tij = 0, 7.6, 15.6, 23.6, 32.6,
and 40 weeks; the first subject in the ZDV+ddI group had actual visits at tij = 0, 7.1, 16.1, and 32.4
weeks. The number of CD4 measurements per subject ranged from 1 to 9, with a median of 4.
126
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
An approximation to addressing this issue would be to “bin ” actual visit times to correspond to the
intended times, so that, for example, 7.6 and 7.1 weeks would be rounded to 8 weeks. However, as
discussed in Section 4.2, treating all responses within some interval of an intended visit time as if
they were all observed at that time is ad hoc , with unknown effects on inference. If the actual visit
times are available, clearly it is preferable to incorporate them in an analysis.
In addition to treatment regimen, also recorded for each subject is age (years) and gender; thus,
the among-individual covariates are ai = (gi , ai , δi1 , ... , δi4 )T , where gi = 0 (1) for a female (male)
subject; ai is age; and δi` = 1 if subject i was randomized to treatment regimen ` and 0 otherwise,
` = 1, ... , 4.
A local polynomial regression (loess) curve naı̈vely fitted to all the data for each treatment is
superimposed on each panel in Figure 5.3 as suggested in Section 2.6 to give a rough idea of the
overall population mean trend. The visual evidence suggests that a straight line might provide a
reasonable representation of the overall population mean response in each group, although the triple
therapy group shows a subtle rise followed by a decay, which might be better captured by a quadratic.
Downplaying this for now, a simple model that allows a separate, straight line mean trajectory for each
treatment is
E(Yij |x i ) = β0 + {β14 + β11 δi1 + β12 δi2 + β13 δi3 }tij .
(5.24)
In (5.24), the intercept is taken to be the same for all regimens; because subjects were randomized
to the four regimens, the mean response at baseline (week 0), prior to the start of treatment, should
be identical for all regimens, assuming that the randomization was carried out faithfully. Indeed, the
sample averages of log-transformed CD4 at baseline are 2.98, 2.93, 2.91, and 2.84 for subjects
randomized to regimens 1 – 4.
We have parameterized the slope term in braces so that the triple therapy regimen 4 is the reference
regimen. That is, β14 is the slope for the mean CD4 profile for regimen 4, and β14 + β1` is the slope
for regimen ` = 1, 2, 3, so that β1` , ` = 1, 2, 3 represents the difference in slope relative to triple
therapy. Of course, an alternative parameterization in terms of separate slopes for each regimen
is possible; likewise, allowing for separate intercepts would allow investigation of the integrity of the
randomization. Model (5.24) could also be modified to incorporate dependence of intercept and slope
on age and gender or to allow quadratic effects.
Specification of a covariance model V i (ξ, x i ) requires some care. Because all individuals were seen
at potentially different times, with different numbers of visits, models for balanced and equallyspaced data might not be suitable.
127
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
CONSIDERATIONS FOR COVARIANCE MODELS, BALANCED DATA: When the data are balanced , as for the dental study, as discussed in Section 2.6, inspection of sample covariance and
correlation matrices, scatterplot matrices, autocorrelation functions, and lag plots can assist the analyst in identifying plausible models.
In fact, these approaches can be refined to take into account a postulated population mean model so
as to take advantage of the belief that the mean follows a smooth trajectory. Instead of basing these
diagnostic aids on sample means at each time point, one can instead estimate those means by a
preliminary fit of the mean model using ordinary least squares , treating the observations from all
individuals as if they are all mutually independent. Although this sounds suspect, as we discuss in
Section 5.5, if the mean model is correctly specified in the sense defined in Section 4.3, then the
OLS estimator for β in the overall population mean model X β is consistent for the true value β 0 .
Thus, at least for m “large,” using the predicted values from the OLS fit to estimate the population
means should be reasonable.
CONSIDERATIONS FOR COVARIANCE MODELS, DATA NOT BALANCED: When a longitudinal
data set is not balanced , not only is it more difficult to think about plausible covariance models,
more ominously , if the intention was to record the response at the same prespecified times for
all individuals, but some observations are missing for some individuals, then things become more
complicated. In Section 5.6, we discuss the challenges associated with such missing data and the
assumptions that must be fulfilled to enable valid inferences to be drawn using the models and
methods in this and the next chapter.
For now, we limit our discussion to operational issues associated with specifying a covariance structure in this situation. Consider the hip replacement study, where the times of observation are the
same for all individuals except that some individuals are missing the response at some of these
times.
Recall that, as discussed in Section 1.3, our notational convention is that Y i is the (ni × 1) vector of
responses actually observed and recorded at times ti1 , ... , tini on individual i. Let
Z i = (Zi1 , ... , Zin )T
(5.25)
be the (n × 1) vector of intended responses to be collected at times t1 , ... , tn , where n ≥ ni for all
i = 1, ... , m. In the literature on missing data methods, Z i is referred to as the full data for subject i;
see Section 5.6.
128
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
• Clearly, for an individual for whom all intended responses are observed, ni = n and Y i = Z i .
Thus, V i for such a individual is a model for var(Y i |x i ) = var(Z i |x i ).
• For an individual with some components of Z i not observed (missing), we can make a correspondence as follows. Consider the hip replacement study, where n = 4,
Z i = (Zi1 , ... , Zi4 )T ,
(t1 , ... , t4 ) = (0, 1, 2, 3).
Consider an individual who is missing the intended observation at t3 = 2 weeks. Then
Y i = (Zi1 , Zi2 , Zi4 )T at times (ti1 , ti2 , ti3 ) = (t1 , t2 , t4 ) = (0, 1, 3).
• Here, V i is a model for the covariance matrix of var(Y i |x i ) as in (5.26), namely, for
var(Zi1 |x i )
cov(Zi1 , Zi2 |x i ) cov(Zi1 , Zi4 |x i )
cov(Zi2 , Zi1 |x i )
var(Zi2 |x i )
cov(Zi2 , Zi4 |x i ) ,
cov(Zi4 , Zi1 |x i ) cov(Zi4 , Zi2 |x i )
var(Zi4 |x i )
(5.26)
(5.27)
which we can write equivalently as in (5.7) as
1/2
Vi = Ti
1/2
Γi T i
,
where T i = diag{var(Zi1 |x i ), var(Zi2 |x i ), var(Zi4 |x i )}, and
1
corr(Zi1 , Zi2 |x i ) corr(Zi1 , Zi4 |x i )
Γi = corr(Zi2 , Zi1 |x i )
1
corr(Zi2 , Zi4 |x i )
corr(Zi4 , Zi1 |x i ) corr(Zi4 , Zi2 |x i )
1
(5.28)
.
It should be clear from (5.27) that there is no conceptual problem in positing an unstructured covariance matrix under these circumstances; the only caveat is that some bookkeeping is necessary to
establish the correspondence between observed and intended time points.
Similarly, specification of a compound symmetric correlation structure is not problematic, as correlation between any two elements of Z i , and thus Y i (given x i ) is the same under this model.
Here, the intended time points are equally-spaced, so that the one-dependent model in (2.27) and
the AR(1) model in (2.28) are also candidates.
129
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
It is straightforward to see that the one-dependent model for the situation in (5.27) takes Γi in (5.28)
to be
1 α 0
α 1 0
0 0 1
;
and the corresponding AR(1) model is
1
α
α3
α 1 α2
3
2
α α
1
(check). Software packages for fitting population-averaged linear models using the methods discussed in the next section incorporate the appropriate bookkeeping for this situation.
In a situation like the HIV clinical trial in ACTG Study 193A, things are more complex. In this study,
we can still conceive of the full data that were intended to be collected on each subject; that is, the
vector of intended responses Z i as in (5.25) at prespecified times (t1 , ... , tn ).
Strictly speaking, however, each individual i is seen at potentially different time points , so that,
operationally , the covariance models that can be feasibly entertained are limited. For example,
it is not possible to take the covariance matrix to be completely unstructured , as individuals seen
at different time points cannot share the same covariance parameters, so that the vector ξ could be
potentially different for each i (and thus infeasible to estimate ).
Recognizing that the actual time points for most individuals target the intended, equally-spaced time
points, models such as the compound symmetric, one-dependent, AR(1) might be reasonable approximations to the true covariance structure. Alternatively, if within-individual sources of correlation are pronounced, correlation models such as the exponential (2.31) or Gaussian (2.32), which
depend on the distances between actual time points, are also feasible.
In Chapter 6, we discuss subject-specific linear models, for which a model for V i is induced through
specification of separate models for contributions to the overall covariance structure from withinand among-individual sources. This structure “automatically ” addresses complications arising
because of imbalance.
130
CHAPTER 5
5.3
LONGITUDINAL DATA ANALYSIS
Maximum likelihood estimation under normality
Given a model specification
E(Y i |x i ) = X i β,
1/2
var(Y i |x i ) = V i = V i (ξ, x i ) = T i
1/2
(θ, x i )Γi (α, x i )T i
(θ, x i ),
as in (5.4), (5.5), and (5.7), and using the independence of (Y i , x i ), i = 1, ... , m, it is possible to
formulate estimating equations that can be solved to yield estimators for the mean parameters β
(p × 1) and covariance parameters ξ = (θ T , αT )T (r + s × 1).
LOGLIKELIHOOD: Specifically, under the additional assumption that the conditional distribution of
Y i given x i is multivariate normal as in (5.6) and using the independence across i, we can appeal
to the principle of maximum likelihood to derive estimators for β and ξ as follows.
Writing the model succinctly as in (5.12), the joint density for Y conditional on x̃ is
p(y|x̃; β, ξ) = (2π)N/2 |V (ξ, x̃)|−1/2 exp{−(y − X β)T V −1 (ξ, x̃)(y − X β)/2}
m
Y
=
(2π)−ni /2 |V i (ξ, x i )|−1/2 exp{−(y i − X i β)T V −1
i (ξ, x i )(y i − X i β)/2}. (5.29)
i=1
It follows from (5.29) that the loglikelihood has the form, ignoring constants,
n
o
l(β, ξ) = (−1/2) log |V (ξ, x̃)| + (Y − X β)T V −1 (ξ, x̃)(Y − X β)
= (−1/2)
m n
X
log |V i (ξ, x i )| + (Y i − X i β)T V −1
i (ξ, x i )(Y i − X i β)
(5.30)
o
(5.31)
i=1
ESTIMATING EQUATIONS: We appeal to standard matrix differentiation results summarized in Appendix A to derive the estimating equations (score equations ) whose joint solution in β and ξ
leads to the maximum likelihood estimators for these parameters under the assumption of multivariate normality.
Differentiating (5.30) and equivalently (5.31) with respect to β (p × 1) yields the estimating equation
T
X V
−1
(ξ, x̃)(Y − X β) =
m
X
X Ti V −1
i (ξ, x i )(Y i − X i β) = 0,
(5.32)
i=1
which follows (verify) using the following results in Appendix A:
• For x (n × 1), symmetric (n × n) matrix A, and quadratic form Q = x T Ax, ∂Q/∂x = 2Ax (n × 1).
• If x depends on β (p × 1), the chain rule then gives ∂Q/∂β = (∂x/∂β)(∂Q/∂x), where (∂x/∂β)
is a (p × n) matrix.
131
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
It is straightforward to observe the following.
• The estimating equation (5.32) can be rewritten as
n
o−1
X T V −1 (ξ, x̃)Y =
β = X T V −1 (ξ, x̃)X
m
X
!−1
X Ti V −1
i (ξ, x i )X i
i=1
m
X
X Ti V −1
i (ξ, x i )Y i .
i=1
(5.33)
Thus, if V (equivalently V i , i = 1, ... , m) were known ; i.e., if ξ were known , then (5.33) defines
explicitly an estimator for β.
Of course, the covariance parameter ξ is ordinarily not known and must be estimated , which
can be accomplished by solving another estimating equation discussed below, jointly with
(5.32).
• If the model E(Y i |x i ) = X i β is correctly specified , then it is straightforward to observe that
(5.32) is an unbiased estimating equation.
In fact , even if the model V i (ξ, x i ) is not a correct specification for var(Y i |x i ), the estimating
equation is still unbiased. This suggests that, even if we have specified the covariance structure incorrectly, the maximum likelihood estimator for β under normality will be consistent for
the true value β 0 as long as the mean model is correctly specified.
• Moreover, the foregoing observations hold whether or not the distribution of Y i |x i is actually
multivariate normal.
Now consider differentiation of the loglikelihood (5.30) and equivalently (5.31) with respect to ξ
(r +s ×1). This is again straightforward using the following matrix differentiation results from Appendix
A. Let V (ξ) be a (n × n) nonsingular matrix depending on a vector ξ.
• If ξk is the kth element of ξ, then ∂/∂ξk V (ξ) is the (n × n) matrix whose (`, p) element is the
partial derivative of the (`, p) element of V (ξ) with respect to ξk .
h
i
• ∂/∂ξk {log |V (ξ)|} = tr V −1 (ξ){∂/∂ξk V (ξ)} , where tr(A) is the trace of square matrix A.
• ∂/∂ξk V −1 (ξ) = −V −1 (ξ) {∂/∂ξk V (ξ)} V −1 (ξ).
• For quadratic form Q = x T V (ξ)x, ∂Q/∂ξk = x T {∂/∂ξk V (ξ)}x. Thus, from the previous result,
∂/∂ξk {x T V −1 (ξ)x} = −x T V −1 (ξ) {∂/∂ξk V (ξ)} V −1 (ξ)x.
132
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
Let ξk , k = 1, ... , r + s, be the k th (scalar) component of ξ. Applying the foregoing results to differentiation of the loglikelihood (5.30) and equivalently (5.31) with respect to ξk , it can be verified (try it) that
the result is the following set of (r + s) estimating equations :
(1/2) (Y − X β)T V −1 (ξ, x̃){∂/∂ξk V (ξ, x̃)}V −1 (ξ, x̃)(Y − X β)
h
i
− tr V −1 (ξ, x̃){∂/∂ξk V (ξ, x̃)} = 0, k = 1, ... , r + s.
(5.34)
or, equivalently,
(1/2)
m X
−1
(Y i − X i β)T V −1
i (ξ, x i ){∂/∂ξk V i (ξ, x i )}V i (ξ, x i )(Y i − X i β)
i=1
h
i
− tr V −1
(ξ,
x
){∂/∂ξ
V
(ξ,
x
)}
= 0, k = 1, ... , r + s.
i
k
i
i
i
(5.35)
Stacked, the (r + s) estimating equations in (5.34) or (5.35) define implicitly the maximum likelihood
estimator for the covariance parameter ξ under the assumption of normality. In particular, the
estimator is obtained by solving these equations jointly with the equations in (5.32).
We now demonstrate that these estimating equations are unbiased if the mean model X i β and the
covariance model V i (ξ, x i ) are correctly specified. Consider form of the equations in (5.35), where
they are written as a sum over i of independent quantities. It can be shown that the conditional (on
x i ) expectation of a summand in (5.35) is equal to zero by appealing to the following result:
• If U is a random vector with mean zero and covariance matrix V , and A is a square matrix, then
E(U T AU) = tr{E(UU T )A} = tr(V A) = tr(AV ) (this is a special case of a more general result in
Appendix A).
Using this, we have (assuming expectation is under the parameter values η = (β T , ξ T )T ,
h
i
−1
Eη (Y i − X i β)T V −1
(ξ,
x
){∂/∂ξ
V
(ξ,
x
)}V
(ξ,
x
)(Y
−
X
β)
|
x
i
k
i
i
i
i
i
i
i
i
h
i
−1
= tr V i (ξ, x i ){∂/∂ξk V i (ξ, x i )}V −1
(ξ,
x
)V
(ξ,
x
)
i
i
i
i
h
i
= tr V −1
i (ξ, x i )∂/∂ξk {V i (ξ, x i )} ,
(5.36)
from whence unbiasedness of (5.35) follows. Of course, if V i (ξ, x i ) were incorrectly specified , the
equation is not necessarily unbiased.
As above, the argument to show that these estimating equations are unbiased does not require
multivariate normality to hold; all that is necessary is that the first two moments of the distribution
of Y i given x i are correctly specified.
133
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
SUMMARY: The estimators for β and ξ in a model of the form
E(Y i |x i ) = X i β,
1/2
var(Y i |x i ) = V i = V i (ξ, x i ) = T i
1/2
(θ, x i )Γi (α, x i )T i
(θ, x i )
under the assumption that the conditional distribution of Y i given x i is multivariate normal with
these moments are defined as the joint solution to the estimating equations
m
X
X Ti V −1
i (ξ, x i )(Y i − X i β) = 0,
(5.37)
i=1
(1/2)
m X
−1
(Y i − X i β)T V −1
i (ξ, x i ){∂/∂ξk V i (ξ, x i )}V i (ξ, x i )(Y i − X i , β)
i=1
h
i
− tr V −1
(ξ,
x
){∂/∂ξ
V
(ξ,
x
)}
= 0, k = 1, ... , r + s,
i
k
i
i
i
(5.38)
where (5.37) implies
β=
( m
X
)−1
X Ti V −1
i (ξ, x i )X i
i=1
m
X
X Ti V −1
i (ξ, x i )Y i .
(5.39)
i=1
1/2
SPECIAL CASE: With V i (ξ, x i ) = T i
1/2
(θ, x i )Γi (α, x i )T i
(θ, x i ) as in (5.7), a common assumption is
that
var(Yij |x i ) = σ 2 for all i,
so that T i (θ, x i ) = σ 2 I ni , r = 1, and thus
V i = σ 2 Γi (α, x i ), ξ = (σ 2 , αT )T .
(5.40)
It can be verified (do it) under these conditions that the estimating equation of form (5.38) corresponding to σ 2 (k = 1) reduces to
σ 2 = N −1
m
X
(Y i − X i β)T Γ−1
i (α, x i )(Y i − X i β).
(5.41)
i=1
We refer to this case further shortly.
IMPLEMENTATION: Solution of the estimating equations to obtain the maximum likelihood estib and b
mators (MLEs ) for β and ξ, which we denote as β
ξ, is of course equivalent to maximizing the
loglikelihood (5.31) in β and ξ. This is the way it is usually implemented in software packages, using
standard optimization techniques such as a Newton-Raphson algorithm.
134
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
The usual implementation takes advantage of the fact that (5.37) leads to the expression for β in
(5.39) in terms of ξ and, when V i is of the form in (5.40), with a multiplicative scale parameter σ 2 ,
(5.38) yields the expression for σ 2 in (5.41) in terms of β and α. Any β and σ 2 solving the estimating
equations, or, equivalently, maximizing the loglikelihood, must satisfy these expressions.
Thus, if the expressions for β in (5.39) and, in the case of (5.40), σ 2 in (5.41) are substituted into the
loglikelihood, the result is a function solely of the covariance or correlation parameters. This practice
is referred to as profiling. The result is that the objective function so obtained can be maximized
in the covariance or correlation parameters, which is an optimization problem of lower dimension
so hopefully more tractable than maximizing the loglikelihood in all parameters as-is, not taking
advantage of these expressions. Once the estimates of the covariance/correlation parameters are
obtained, the estimates for β and, if relevant, σ 2 , maximizing the objective function can be obtained
by substitution of b
ξ in their expressions.
It is also possible to specify an iterative algorithm to solve the estimating equations that proceeds by
cycling between solving the equation for β holding ξ fixed at the current estimate and solving that for
ξ holding β fixed. This is more interesting and useful in the general nonlinear models we consider
in later chapters, so we defer discussion until then.
5.4
Restricted maximum likelihood
BIASED ESTIMATION IN FINITE SAMPLES: We have already observed that the MLEs for β and ξ
under the assumption of normality should be consistent estimators for their true values β 0 and ξ 0 ,
provided that the models E(Y i |x i ) = X i β and var(Y i |x i ) = V i (ξ, x i ) are correctly specified , under
general conditions, as they solve unbiased estimating equations. However, in finite samples , the
estimator for ξ can be subject to bias due to a phenomenon similar to that encountered in estimation
of variance of a scalar outcome Y from an iid sample or in ordinary linear regression.
In particular, if we have an iid sample Y1 , ... , Ym from some distribution with mean µ and variance σ 2 ,
it is well known that the MLE for σ 2 under the assumption of normality ,
m
−1
m
X
(Yi − Y )2 ,
i=1
is a (downwardly) biased estimator for σ 2 for fixed m, as its expectation is σ 2 (m − 1)/m.
135
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
Accordingly, the usual sample variance estimator
s2 = (m − 1)−1
m
X
(Yi − Y )2
i=1
is unbiased and thus preferred. Evidently, this bias is a consequence of the need to estimate µ
rather than knowing it.
ELIMINATING THE EFFECT OF ESTIMATION OF MEAN PARAMETERS: Thus, although the rationale for s2 is immediate from this calculation, s2 can also be deduced by viewing it as the result of an
approach that does not rely on estimation of µ. Let Y = (Y1 , ... , Ym )T , with 1 a (m × 1) vector of 1s,
and let A be a (m × m − 1) matrix of column rank m − 1 such that AT 1 = 0. Defining the so-called
vector of m − 1 error constrasts
U = AT Y ,
if we assume that the Yi are N (µ, σ 2 ), so that Y ∼ N (µ1, σ 2 I m ), then it is straightforward to deduce
that U ∼ N (0, σ 2 AT A) and that maximizing the corresponding loglikelihood in σ 2 yields the estimator
σ
b2 = (m − 1)−1 Y T A(AT A)−1 AT Y = s2
(try it). That is, the sample variance can be derived from effectively eliminating µ from consideration.
A similar result holds for linear regression. Here, with independent pairs (Yi , x i ), i = 1, ... , m, and
model Yi = x Ti β + i with E(i |x i ) = 0, var(i |x i ) = σ 2 , if X is the (m × p) design matrix with rows x i , if
b is the OLS estimator, it is well known that the MLE for σ 2 under the assumption that i given x i is
β
P
Tb 2
normally distributed is m−1 m
i=1 (Yi − x i β) , which can be shown to be biased. Dividing instead by
(m − p) yields the usual residual mean square, which is unbiased; a similar argument to that above
based on suitably defined “error contrasts ” can be made to justify this estimator, which is a simpler
version of one we give shortly in the context of longitudinal data.
DEMONSTRATION: Given these observations, it is natural to be concerned that normal-theory maximum likelihood estimation of the covariance parameters ξ in our setting might be subject to similar
bias. Clearly, it is not possible for general covariance model V i (ξ, x i ) to carry out a similar explicit
argument. To get a sense, however, consider the special case where
V i (ξ, x i ) = σ 2 Γi (x i ),
(5.42)
where the correlation matrix Γi (x i ) is a known function of covariates (so there is no parameter α).
136
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
Writing Γi and Γ for brevity, the MLEs for β and σ 2 in this case are (check)
b = (X T Γ−1 X )−1 X T Γ−1 Y ,
β
b T Γ−1 (Y − X β).
b
σ
b2 = N −1 (Y − X β)
(5.43)
It is straightforward (try it) to show that the quadratic form in σ
b2 in (5.43) can be written as
Y T {Γ−1 − Γ−1 X (X T Γ−1 X )−1 X T Γ−1 }Y ,
which, letting Y ∗ = Γ−1/2 Y and X ∗ = Γ−1/2 X for Γ−1 = Γ−1/2 Γ−1/2 , can be reexpressed as
Y T∗ {I N − X ∗ (X T∗ X ∗ )−1 X T∗ }Y ∗ = Y T∗ (I N − P ∗ )Y ∗ .
Here, E(Y ∗ |x̃) = X ∗ β, var(Y ∗ |x̃) = σ 2 I N (verify), and P ∗ is a symmetric , idempotent matrix. By the
result for the expectation of a quadratic form in Appendix A,
E{Y T∗ (I N − P ∗ )Y ∗ |x̃} = tr{σ 2 I N (I N − P ∗ )} + β T X T∗ (I N − P ∗ )X ∗ β = σ 2 {N − tr(P ∗ )} + 0 = σ 2 (N − p),
using the fact that (see Appendix A) that the trace of a symmetric, idempotent matrix is equal to its
rank, and the rank of X (N × p) and thus X ∗ and P ∗ is p, and X T∗ (I N − P ∗ )X ∗ = 0 (check).
It follows that
E(b
σ 2 |x̃) =
N −p 2
σ ,
N
demonstrating that the MLE is biased in finite samples (m individuals, N total observations) and that
the alternative estimator
b T Γ−1 (Y − X β)
b
σ
b2 = (N − p)−1 (Y − X β)
(5.44)
is preferred. Again, it is evident that the bias is a consequence of the needing to estimate β rather
than knowing it. We now consider a generalization of the approach involving error contrasts above
to estimation of ξ in a covariance model V i (ξ, x i ) that eliminates estimation of β from the calculation.
RESTRICTED MAXIMUM LIKELIHOOD: Analogous to the previous argument, let A be a (N × N − p)
matrix of column rank N − p such that AT X = 0, where of course X is the (N × p) “stacked” design
matrix for all m individuals. Define the vector of N − p error contrasts to be
U = AT Y .
Then if Y ∼ N (X β, V ), where we suppress dependence on ξ and x̃ for brevity, we can write
Y = X β + ,
∼ N (0, V ),
and it is straightforward that
U = AT X β + AT = AT ∼ N (0, AT V A).
137
(5.45)
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
The loglikelihood corresponding to (5.45) is easily found to be, ignoring constants,
h
i
lR (ξ) = (−1/2) log |AT V (ξ, x̃)A| + Y T A{AT V (ξ, x̃)A}−1 AT Y ,
(5.46)
which does not depend on β. The claim is that maximizing lR (ξ) in ξ leads to an estimator that
“corrects ” for the finite-sample bias in the spirit of (5.44).
We first rewrite (5.46) in a form that makes it directly comparable to the usual normal loglikelihood
(5.30). Note that an A that satisfies AT X = 0 is, for (N × N − p) matrix C,
A = {I N − X (X T X )−1 X T }C.
First, we show that the second term in (5.46) can be rewritten as
b
b T V −1 (Y − X β),
Y T A(AT V A)−1 AT Y = (Y − X β)
(5.47)
b = (X T V −1 X )−1 X T V −1 Y as in (5.33).
where β
• We first demonstrate that
A(AT V A)−1 AT = P
where P = V −1 − V −1 X (X T V −1 X )−1 X T V −1 .
(5.48)
Defining
T = I N − X (X T X )−1 X T − A(AT A)−1 AT ,
it is straightforward to observe that T is symmetric and idempotent , where idempotency can
be verified by direct multiplication to show T T = T , using AT X = 0. Thus,
tr(T T T ) = tr(T ) = tr(I N ) − tr{X (X T X )−1 X T } − tr{A(AT A)−1 AT } = N − p − (N − p) = 0,
and tr(T T T ) = 0 implies T = 0 (check), from whence it follows that
I N − X (X T X )−1 X T = A(AT A)−1 AT .
Because
AT X = AT V 1/2 V −1/2 X = (V 1/2 A)T (V −1/2 X ) = 0,
the same result above holds with A replaced by V 1/2 A and X replaced by V −1/2 X , yielding
I N − V −1/2 X (X T V −1 X )−1 X T V −1/2 = V 1/2 A(AT V A)−1 AT V 1/2 .
Pre- and post-multiplying (5.49) by V −1/2 then gives (5.48)
138
(5.49)
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
• It can then be shown by brute-force multiplication (try it) that
P = PV P.
(5.50)
Using (5.48) and (5.50), with P = V −1 − V −1 X (X T V −1 X )−1 X T V −1 , we have
Y T A(AT V A)−1 AT Y = Y T PY = Y T PV PY
= {Y − X (X T V −1 X )−1 X T V −1 Y }T (V −1 V V −1 ){Y − X (X T V −1 X )−1 X T V −1 Y }
b
b T V −1 (Y − X β),
= (Y − X β)
demonstrating (5.47).
We can thus rewrite the loglikelihood (5.46) as
n
o
b T V −1 (ξ, x̃)(Y − X β)
b .
lR (ξ) = (−1/2) log |AT V (ξ, x̃)A| + (Y − X β)
(5.51)
We now argue that the first term can be expressed as
log |AT V (ξ, x̃)A| = log |V (ξ, x̃)| + log |X T V −1 (ξ, x̃)X |.
(5.52)
Differentiate (5.51) with respect to the kth component of ξ to obtain
b
b T V −1 (ξ, x̃){∂/∂ξk V (ξ, x̃)}V −1 (ξ, x̃)(Y − X β)
(1/2) (Y − X β)
− tr[{AT V (ξ, x̃)A}−1 AT {∂/∂ξk V (ξ, x̃)}A] .
The second term can be written, using shorthand and letting V ξ = {∂/∂ξk V (ξ, x̃)}, as
tr {(AT V A)−1 AT V ξ A} = tr(PV ξ )
= tr{(V −1 − V −1 X (X T V −1 X )−1 X T V −1 )V ξ }
= tr(V −1 V ξ ) − tr{V −1 X (X T V −1 X )−1 X T V −1 )V ξ }
= {∂/∂ξk log |V (ξ, x̃)|} − tr{(X T V −1 X )−1 X T V −1 V ξ V −1 X }
= {∂/∂ξk log |V (ξ, x̃)|} + {∂/∂ξk log |X T V −1 (ξ, x̃)X |}.
Because this shows that the derivative of the left hand side of (5.52) is equal to the derivative of
the right hand side, we conclude that the first term in (5.51) can be rewritten as log |V (ξ, x̃)| +
log |X T V −1 (ξ, x̃)X |, as required.
139
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
Substituting in (5.51) yields what is usually referred to as the restricted maximum likelihood
(REML ) objective function
n
o
b T V −1 (ξ, x̃)(Y − X β)
b + log |X T V −1 (ξ, x̃)X |
lR (ξ) = (−1/2) log |V (ξ, x̃)| + (Y − X β)
(5.53)
#
" m
m
X
o
Xn
−1
T
−1
T
b V (ξ, x i )(Y i − X i β)
b + log = (−1/2)
log |V i (ξ, x i )| + (Y i − X i β)
X
V
(ξ,
x
)X
i
i
i
i
i
i=1
i=1
(5.54)
where
T
b = {X V
β
−1
(ξ, x̃)X }
−1
T
X V
−1
(ξ, x̃)Y =
( m
X
)−1
X Ti V −1
i (ξ, x i )X i
i=1
m
X
X Ti V −1
i (ξ, x i )Y i .
i=1
b depends on ξ. The suggestion is to maximize
Note that (5.53) and (5.54) are functions of ξ only, as β
b
(5.53), or equivalently (5.54), in ξ and then substitute the resulting estimator in the expression for β.
b substituted for β
Comparing (5.53) and (5.54) to (5.30) and (5.31) with the expression (5.33) for β
shows that they have the same form except for the third term on the right hand side of (5.53) and
(5.54). It is this term that effects the “correction ” for finite sample bias.
Differentiating with respect to the k th component of ξ, k = 1, ... , r + s, and setting equal to zero yields
b T V −1 (ξ, x̃){∂/∂ξk V (ξ, x̃)}V −1 (ξ, x̃)(Y − X β)
b
(Y − X β)
h
i
−tr V −1 (ξ, x̃){∂/∂ξk V (ξ, x̃)}
h
i
+tr {X T V −1 (ξ, x̃)X }−1 X T V −1 (ξ, x̃){∂/∂ξk V (ξ, x̃)}V −1 (ξ, x̃)X = 0
(5.55)
or, equivalently,
m X
b T V −1 (ξ, x i ){∂/∂ξk V i (ξ, x i )}V −1 (ξ, x i )(Y i − X i β)
b
(Y i − X i β)
i
i
i=1
h
i
− tr V −1
(ξ,
x
){∂/∂ξ
V
(ξ,
x
)}
(5.56)
i
k
i
i
i
(
)
−1 m
m
X
X
T
−1
−1
+tr
X i V i (ξ, x i )X i
X Ti V −1
i (ξ, x i ){∂/∂ξk V i (ξ, x i )}V i (ξ, x i )X i = 0.
i=1
i=1
By the manipulations leading to (5.52), (5.55) can be rewritten, in shorthand, as
Y T PV ξ PY − tr(PV ξ ),
and it can be shown that E(Y T PV ξ PY ) = tr(PV ξ ), so that these estimating equations are unbiased ;
the details are left as an exercise for the diligent student.
As for the MLEs, implementation is via maximization of the objective function (5.53) using standard optimization algorithms such as Newton-Raphson.
140
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
DEMONSTRATION, CONTINUED: We demonstrate that estimation of ξ via REML is expected to lead
to “correction ” for bias due to estimation of β in the special case in (5.42) where V i (ξ, x i ) = σ 2 Γi (x i )
and where the correlation matrix Γi (x i ) is known.
Writing Γi and Γ for brevity, we have as in (5.43) that
b = (X T Γ−1 X )−1 X T Γ−1 Y ,
β
and (5.55) becomes
4
b T Γ−1 (Y − X β)/σ
b
(Y − X β)
− tr(I N )/σ 2 + tr{(X T Γ−1 X )−1 (X T Γ−1 X )}/σ 2 = 0.
Noting that the last term is equal to tr(I p ) = p, solving yields
b −1 (Y − X β).
b
σ
bR2 = (N − p)−1 (Y − X β)Γ
(5.57)
The REML estimator in (5.57) can be seen to be identical to that in (5.44).
REMARKS:
• Although not possible to demonstrate in general, similar “bias correction ” is achieved for covariance parameters other than scale parameters.
• The original justification for the REML approach is attributed to Patterson and Thompson (1971).
See Verbeke and Molenberghs (2000, Section 5.3) for details and other interpretations of the
approach.
• It is not possible to demonstrate theoretically that one of the ML or REML approach is uniformly preferable for estimation of covariance parameters ξ in general. In the special case
of balanced data collected according to a design like that in Chapter 3, with population mean
model specified by the classical analysis of variance representation, it turns out that the estimators of the covariance parameters obtained using REML are the same as the classical
ANOVA estimators obtained by equating mean squares to their expectations; see Verbeke and
Molenberghs (2000, Section 5.3) for further references.
• In practice , REML is often used by default owing to its interpretation given here as providing
estimators that should exhibit less bias in finite samples. In fact, software implementing fitting of
models like the ones in this and the next chapter ordinarily uses REML as the default method
for estimation of covariance parameters.
141
CHAPTER 5
5.5
LONGITUDINAL DATA ANALYSIS
Large sample inference
b As we have seen, in the context of a particular model
SAMPLING DISTRIBUTION FOR β:
E(Y i |x i ) = X i β,
1/2
var(Y i |x i ) = V i = V i (ξ, x i ) = T i
1/2
(θ, x i )Γi (α, x i )T i
(θ, x i ),
i = 1, ... , m, (5.58)
most questions of scientific interest can be represented as questions about the components of β in
(5.58). To make inference on β to address the questions formally, we require an estimator for β and
its sampling distribution.
The obvious estimator for β is that solving the estimating equation in (5.37), namely,
m
X
X Ti V −1
i (ξ, x i )(Y i − X i β) = 0,
(5.59)
i=1
jointly with an estimating equation for ξ such as the ML equation (5.38) or the REML equation (5.56).
• An estimating equation of the general form in (5.59) is often referred to as a linear estimating
equation because it depends on the response through a linear function of the response,
namely (Y i − X i β). This will be important shortly.
Regardless of which method, ML or REML, one uses to estimate the covariance parameter ξ, even if
the model in (5.58) is correctly specified and the distribution of Y i given x i is exactly multivariate
normal with these moments, it is not possible in general to derive the exact sampling distribution
for the resulting estimator
T
b = {X V
β
−1
−1
(b
ξ, x̃)X }
T
X V
−1
(b
ξ, x̃)Y =
( m
X
)−1
b
X Ti V −1
i (ξ, x i )X i
i=1
m
X
b
X Ti V −1
i (ξ, x i )Y i ,
(5.60)
i=1
where b
ξ in (5.60) is either of the MLE or REML estimator for the covariance parameters in the covariance model in (5.58). Clearly, (5.60) is a complicated function of the data.
LARGE SAMPLE THEORY: Accordingly, we appeal to large sample theory to derive an approxib using the general approach for estimating equations discussed in
mate sampling distribution for β
Section 4.3. As we discussed there, the argument does not require that the assumption of normality
of the distribution of Y i given x i holds.
We assume that the model for E(Y i |x i ) in (5.58) is correctly specified. Recall that this means that
there is a value β 0 such that the true expectation of Y i given x i is X i β 0 ; that is, β 0 is a parameter of
the distribution that truly generated the data.
142
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
• Clearly, if this is not the case, then we are in pretty serious trouble, as we are addressing the
questions of interest (which are questions about population mean response) in a framework
that may not be consistent with the truth.
As suggested by our development so far, specification of a model for the overall population-averaged
covariance matrix is admittedly more difficult than specifying a model for the mean. Accordingly,
it is reasonable to be concerned that the model we specify for var(Y i |x i ) might not be correctly
specified. That is, for example we might select a correlation model Γi (α, x i ) that does not faithfully
represent the true overall correlation structure, and/or we might make incorrect assumptions about
the overall variance.
Accordingly, we first consider the ideal situation in which the models for both overall mean and
covariance posited in (5.58) are correctly specified , and then consider the case where the latter
model might be incorrect.
COVARIANCE MODEL CORRECTLY SPECIFIED: If the model V i (ξ, x i ) in (5.58) is correctly specified, then there is a value ξ 0 such that the true overall covariance matrix
var(Y i |x i ) = V 0i
(5.61)
is V i (ξ 0 , x i ), i = 1, ... , m. That is, V 0i = V i (ξ 0 , x i ) is the covariance matrix of the conditional distribution
of Y i given x i actually generating the data.
Rather than just substituting directly into the generic argument in Section 4.3, we carry out the argument from scratch so as to demonstrate a fundamental and well-known result that persists across
all types of mean-covariance models. The estimator (5.60) satisfies
m
X
b
b
X Ti V −1
i (ξ, x i )(Y i − X i β) = 0.
(5.62)
i=1
T
bT , b
b = (β
Collecting the parameters as η = (β T , ξ T )T , let η
ξ )T . Because both the mean and covariance
models are correctly specified , the estimating equation (5.59 ) and that solved to estimate ξ (ML or
b is a consistent estimator for
REML) are unbiased estimating equations. Thus, we expect that η
the true value η 0 = (β T0 , ξ T0 )T .
143
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
b
Following the argument in Section 4.3, we multiply (5.62) by m−1/2 and take a linear Taylor series in η
about the η 0 . Here, as on the last page of Appendix B (review it), instead of writing the linear term of
the series in terms of this “stacked” parameter vector, we write it as the sum of terms corresponding
to each component of η. That is,
0
=
m
≈ m
−1/2
m
X
−1/2
i=1
m
X
"
+ m
b
b
X Ti V −1
i (ξ, x i )(Y i − X i β)
(
X Ti V −1
i (ξ 0 , x i )(Y i
i=1
m
X
−1
− X i β0 ) +
−m
−1
m
X
)
X Ti V −1
i (ξ 0 , x i )X i
b −β )
m1/2 (β
0
i=1
#
1/2 b
X Ti {∂/∂ξV −1
(ξ − ξ 0 ).
i (ξ 0 , x i )}(Y i − X i β 0 ) m
(5.63)
i=1
• In the third term on the right hand side in (5.63), we do not attempt to be more precise about
the form of the partial derivative of the covariance matrix V i (ξ, x i ) with respect to ξ (which this
notation is meant to indicate is evaluated at ξ 0 ). This derivative evidently is rather complicated.
As we see momentarily, we needn’t worry about this.
b and b
• We have used the consistency of β
ξ to approximate the sums in the second and third
terms as evaluated at the true value η 0 rather than an intermediate value η ∗ as in the argument
in Section 4.3.
Write the expansion compactly as
b − β ) + E m m1/2 (b
0 ≈ C m − Am m1/2 (β
ξ − ξ 0 ),
0
(5.64)
where, using V i (ξ 0 , x i ) = V 0i as in (5.61),
Cm = m
−1/2
m
X
X Ti V −1
0i (Y i
− X i β 0 ),
Am = m
i=1
Em = m
−1
m
X
X Ti V −1
0i X i ,
i=1
−1
m
X
X Ti {∂/∂ξV −1
i (ξ 0 , x i )}(Y i − X i β 0 ).
i=1
• We in fact assume that
m1/2 (b
ξ − ξ 0 ) = Op (1);
(5.65)
i.e., that this quantity is bounded in probability (see Appendix C). Under regularity conditions,
most estimators that are solutions to unbiased estimating equations satisfy (5.65). This says
p
that m1/2 (b
ξ − ξ 0 ) is “well-behaved ” as m → ∞ and describes the rate at which b
ξ −→ ξ 0 ; i.e.,
(5.65) is equivalent to b
ξ − ξ 0 = Op (m−1/2 ). This ensures that the rightmost term in (5.64) does
not “blow up ” as m → ∞.
144
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
If we view the argument conditional on x̃, then
Am → A = lim m−1
m→∞
m
X
X Ti V −1
0i X i .
i=1
By the central limit theorem ,
L
C m −→ N (0, B),
where
B = lim m
m→∞
−1
m
X
−1
X Ti V −1
0i V 0i V 0i X i
i=1
= lim m
m→∞
−1
m
X
X Ti V −1
0i X i = A.
i=1
By the weak law of large numbers , using E(Y i |x i ) = X i β 0 , it is straightforward that
p
E m −→ 0.
(5.66)
Thus, rearranging and applying these results along with Slutsky’s theorem , we are left with
L
b − β ) ≈ A−1 C m −→
m1/2 (β
N (0, A−1 BA−1 ) = N (0, A−1 ).
0
(5.67)
• Note that (5.66) effectively eliminates any effect of having to estimate ξ. That is, if ξ 0 were
known and substituted in (5.62), we could have immediately concluded (5.67).
b for a parameter β in a model
• This reflects the fundamental result that if we obtain a estimator β
for a population mean by solving a linear estimating equation , with an estimated “weight
b − β ) is the same as that for the
matrix ,” the large sample (normal) distribution of m1/2 (β
0
(ideal ) estimator for β we could have obtained if the “weight matrix ” were known.
• This says that there is no loss of precision suffered by the estimator for β due to having had
to estimate covariance parameters versus knowing them. Intuitively, this seems like a pretty
optimistic result.
• Indeed, in small samples (small number of individuals m), inference based on the result in
(5.67) can be optimistic in the sense that, for example, standard errors for the components
b derived from (5.67) as we discuss momentarily will be too small and thus fail to reflect
of β
the true uncertainty associated with estimating β (which includes uncertainty due to estimating
ξ). In “larger” samples, inferences are often fairly reliable. Of course, what comprises “large
enough ” in any particular setting is not known.
145
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
b m , where A
b m is Am with b
To use the result (5.67) in practice, we approximate A by A
ξ substituted for
ξ 0 in V i0 = V i (ξ 0 , x i ), exploiting the fact that b
ξ is a consistent estimator for ξ 0 under the conditions
b given by
here. This yields the approximate sampling distribution for β
b −1 ) = N (β , Σ
b ∼· N (β , m−1 A
b M ),
β
0
0
m
where
bM =
Σ
m
X
(5.68)
!−1
b
X Ti V −1
i (ξ, x i )X i
= {X T V −1 (b
ξ, x̃)X }−1 .
(5.69)
i=1
b m .)
(Note that the m−1 on the left hand side of (5.68) “cancels” with that in A
• In practice, standard errors for the estimators for the components of β and associated confidence intervals for and test statistics concerning the corresponding components of the true
parameter β 0 can be constructed in the usual way based on (5.68) and (5.69).
COVARIANCE MODEL POSSIBLY INCORRECTLY SPECIFIED: We can generalize the above argument to the case where the posited model V i (ξ, x i ) in (5.58) is not necessarily correctly specified.
That is, there is no value ξ 0 such that V i (ξ 0 , x i ) = V 0i , where, again, V 0i is the true covariance
matrix generating the data.
Of course, in practice, we would proceed unknowingly as if the model V i (ξ, x i ) is correct and solve
an estimating equation of the form (5.38) (ML) or (5.56) (REML) to obtain an estimator b
ξ. Because
the model is incorrect, it is not even clear that ξ has meaning, as it does not represent a quantity
relevant to the true mechanism generating the data. Accordingly, it is not clear exactly what b
ξ is
“estimating.”
In the generic argument in Section 4.3, we started from the premise that the model underlying the
estimating equations being solved for the parameter η is correctly specified , so that the estimating
P
equations m
i=1 Ψi (U i , η) = 0 are unbiased ; and
E{Ψi (U i , η 0 )} = 0,
where η 0 is the true value. Inspection of (5.38) or (5.56) makes clear that, in our problem, if the
model V i is not correct, then a summand of the estimating equations does not have expectation
zero necessarily, so that the estimating equations are not unbiased. In this situation, we can say
something about the behavior of b
ξ in our problem, as follows.
146
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
In the generic case of a correct model, under regularity conditions, it is possible to weaken the
argument in Section 4.3. If instead we have only that
m
X
E{Ψi (U i , η 0 )} = 0
(5.70)
i=1
(so that each summand does not necessarily have mean zero, but their sum does), then it still holds
p
b −→ η 0 , and the argument leading to the asymptotic normality of the estimator for
in general that η
η goes through unchanged, except that the covariance matrix of Ψi (U i , η 0 ) is no longer equal to
E{Ψi (U i , η 0 )ΨTi (U i , η 0 )}, so that the definitions of the matrices B m and B in the argument must be
P
changed; e.g., B m = m−1 m
i=1 var{Ψi j(U i , η 0 )} instead.
If the model on which the estimating equations are based is incorrect under regularity conditions, it
is usually the case that there exists η ∗ such that
m
X
E{Ψi (U i , η ∗ )} = 0,
(5.71)
i=1
where this expectation is still with respect to the true distribution of U i .
It turns out that, by analogy to (5.70), if (5.71) holds, solving the “incorrect ” estimating equation will
yield an “estimator” such that
p
b −→ η ∗ .
η
(5.72)
Although η ∗ does not have any meaning with respect to the true distribution generating the data,
it is a fixed value dictated by (5.71). A value like η ∗ can be thought of as the value that “tries to get
closest” to the representing the truth within the confines of an incorrect model, and consequently has
been referred to as the least false parameter.
The key point is that, even with an incorrectly specified model , we can still deduce the behavior
of an “estimator” for a parameter in that model, even if the parameter has no real meaning.
Returning to our problem, we thus assume that, for incorrectly specified model V i (ξ, x i ), if we solve
estimating equations like those in (5.38) or (5.56), the solution b
ξ satisfies (5.72) for some ξ ∗ ; namely,
p
b
ξ −→ ξ ∗ ,
and, under regularity conditions and analogous to (5.65), m1/2 (b
ξ − ξ ∗ ) = Op (1).
147
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
Suppose then that V i is incorrectly specified, let
V ∗i = V i (ξ ∗ , x i )
denote the “incorrect covariance matrix ” implied by the choice of this incorrect model, and consider
again solving (5.62), namely,
m
X
b
b
X Ti V −1
i (ξ, x i )(Y i − X i β) = 0.
i=1
First, note that for the estimating equation (5.59),
m
X
X Ti V −1
i (ξ, x i )(Y i − X i β) = 0,
i=1
we have
T ∗ −1
∗
E{X Ti V −1
(Y i − X i β 0 )|x i } = 0,
i (ξ , x i )(Y i − X i β 0 )|x i } = E{X i V i
i = 1, ... , m, so that the expectation of each summand is zero, even though the covariance model
is incorrectly specified , so that the estimating equation is still unbiased, analogous to the demonb is a consistent estimator for β ,
stration for univariate OLS in Section 4.3. We thus conclude that β
0
despite the fact that the “weight matrix ” used in the linear estimating equation is not the inverse of
the true covariance matrix. In fact, this holds even if we take V i = I ni , i = 1, ... , m; that is, assume all
N observations across all m individuals are mutually uncorrelated. The resulting estimator for β is
effectively OLS , treating all N observations as if they were independent.
T
bT , b
Expanding about (β
ξ )T = (β T0 , ξ ∗ T )T , analogous to (5.63),
0
=
m−1/2
m
X
b
b
X Ti V −1
i (ξ, x i )(Y i − X i β)
i=1
≈ m
−1/2
"
+ m
=
C ∗m
m
X
(
∗
X Ti V −1
i (ξ , x i )(Y i
i=1
m
X
−1
−
− X i β0 ) +
−m
−1
m
X
)
∗
X Ti V −1
i (ξ , x i )X i
b −β )
m1/2 (β
0
i=1
#
∗
1/2 b
X Ti {∂/∂ξV −1
(ξ − ξ ∗ )
i (ξ , x i )}(Y i − X i β 0 ) m
i=1
∗
b
Am m1/2 (β
− β 0 ) + E ∗m m1/2 (b
ξ − ξ ∗ ).
(5.73)
With V ∗i = V i (ξ ∗ x i ) and V 0i = var(Y i |x i ), the true covariance matrix , it is clear that
p
E ∗m −→ 0,
A∗m → A∗ = lim m−1
m→∞
C ∗m = m−1/2
m
X
m
X
X i V ∗i −1 X i ,
i=1
L
X Ti V ∗i −1 (Y i − X i β 0 ) −→ N (0, B ∗ )
i=1
by the central limit theorem , where
∗
B = lim m
m→∞
−1
m
X
X Ti V ∗i −1 V 0i V ∗i −1 X i .
i=1
148
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
Thus, rearranging and using Slutsky’s theorem as before, we have
L
b − β ) −→
m1/2 (β
N (0, A∗ −1 B ∗ A∗ −1 ).
0
(5.74)
p
• As in the case where the covariance model is correctly specified , because E m −→ 0, there
is no effect of estimating ξ in the incorrect model V i . If the matrix V ∗i had been known , it is
straightforward to observe that (5.74) would still follow.
This reflects a generalization of the result we saw in the case of a correctly specified covarib −β ) is the same if the “weight
ance model, namely that the large sample distribution of m1/2 (β
0
matrix ” used in the linear estimating equation for β is fixed or estimated.
• In fact, the argument leading to the result (5.67) in the case of a correctly specified model is
a special case of this result, where the covariance model V i is correct after all, so that the
ξ ∗ = ξ 0 , the value such that V 0i = V i (ξ, x i ).
Note that (5.74), while informative about the behavior of the estimator for β when the posited covariance model is incorrect , cannot be used as-is in practice, as V 0i is of course unknown. We return
to this point shortly.
OPTIMAL LINEAR ESTIMATING EQUATION: From (5.67), when the covariance model is correctly
specified , the estimator solving the linear estimating equation satisfies
b ∼· N {β , (X T V −1 X )−1 },
β
C
0
0
(5.75)
where the subscript C indicates “correct,” and V 0 = block diag(V 01 , ... , V 0m ). Likewise, when the
covariance model is incorrectly specified , from (5.74), the estimator solving the linear estimating
equation satisfies
b ∼· N {β , (X T V ∗ −1 X )−1 (X T V ∗ −1 V 0 V ∗ −1 X )(X T V ∗ −1 X )−1 },
β
IC
0
(5.76)
where the subscript IC indicates “incorrect,” and V ∗ = block diag(V ∗1 , ... , V ∗m ).
The covariance matrices of the approximate sampling distributions in (5.75) and (5.76) reflect, at least
for m large, the precision with which β can be estimated by solving the linear estimating equation
b and β
b are consistent estimators
for β under correct and incorrect covariance models. Both β
C
IC
for β 0 ; thus, we can compare the covariance matrices of their approximate sampling distributions to
b to β
b .
examine the relative efficiency of β
IC
C
149
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
To this end, consider the difference
−1
(X T V ∗ −1 X )−1 (X T V ∗ −1 V 0 V ∗ −1 X )(X T V ∗ −1 X )−1 − (X T V −1
0 X) .
(5.77)
We now argue that the difference (5.77) is a nonnegative definite matrix; that is,
−1
λT {(X T V ∗ −1 X )−1 (X T V ∗ −1 V 0 V ∗ −1 X )(X T V ∗ −1 X )−1 − (X T V −1
0 X ) }λ ≥ 0
(5.78)
for all λ. It follows that, if (5.78) holds, the diagonal elements (5.77) are all ≥ 0, so that the difference
in the approximate sampling variances of the estimators for each component of β is ≥ 0 (check),
b are more efficient than those of β
b .
implying that the components of β
C
IC
Letting
X ∗ = V ∗ −1/2 X ,
V ∗ −1/2 V ∗ −1/2 = V ∗−1 ,
c = W −1/2 X ∗ (X T∗ X ∗ )−1 λ,
∗ 1/2
W = V ∗ 1/2 V −1
,
0 V
W = W 1/2 W 1/2 ,
rewrite (5.78) as (check)
c T {I N − W 1/2 X ∗ (X T∗ W X ∗ )−1 X T∗ W 1/2 }c.
(5.79)
It is straightforward to verify (try it) that
I N − W 1/2 X ∗ (X T∗ W X ∗ )−1 X T∗ W 1/2 = I N − P ∗
is symmetric and idempotent , so that (5.79) can be written as
c T (I N − P ∗ )c = c T (I N − P ∗ )(I N − P ∗ )c = d T d ≥ 0,
demonstrating (5.78).
b are more
The result (5.78) shows that, at least approximately (for “large” m), the components of β
C
b . Formally, (5.78) demonstrates that, for a given population mean
precise estimators than those of β
IC
model X β, among all linear estimating equations , that formed using a correct covariance model
will yields a (asymptotically) relatively more efficient estimator than any other based on an incorrect
covariance model. That is, the linear estimating equation with “weight matrix” based on a correct
covariance model is optimal among all linear equations in this sense. Of course, this comes as no
surprise.
The result does not provide insight into how much more precise in general. Evidently, the comparison of the large sample covariance matrices will depend on the particular situation – the population
mean response model and covariates x i on which it is based (assumed correct), the true covariance
matrix, and the assumed covariance model.
150
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
We demonstrate a more general optimality result in the case of a nonlinear model for populationaveraged mean response in Chapter 8, which subsumes the one here.
NORMALITY NOT REQUIRED: Note that nowhere in these arguments is anything assumed about
the true distribution of Y i given x i ; e.g., that it is multivariate normal. The only assumption on
this distribution required is that it possess sufficient moments so that application of the weak law
of large numbers, the central limit theorem, etc, is justified. Accordingly, even though we derived the
estimating equations for β and ξ in the assumed covariance model by starting with the normal
loglikelihood , the resulting estimator for β has desirable properties that hold much more generally.
“ROBUST” COVARIANCE MATRIX: In practice, it is natural to be concerned that a posited covariance model is not correctly specified. Identifying an appropriate model is admittedly challenging ;
the structure adopted must faithfully represent the aggregate effects of both among- and withinindividual variance and correlation.
Accordingly, rather than carry out inference on β 0 based on the approximate sampling distribution in
(5.68), which is based on the covariance model being correct , it is conventional to base it on the
foregoing argument under the condition that the posited covariance model may not be correct and
the result in (5.74), which we repeat here for convenience, dropping the subscript IC:
L
b − β ) −→
m1/2 (β
N (0, A∗ −1 B ∗ A∗ −1 ),
0
where
∗
A = lim m
−1
m
X
m→∞
X Ti V ∗i −1 X i ,
∗
B = lim m
−1
m
X
m→∞
i=1
(5.80)
X Ti V ∗i −1 V 0i V ∗i −1 X i .
i=1
∗
Of course, A can be approximated by
m
−1
m
X
X Ti V −1 (b
ξ, x i )X i ;
i=1
the difficulty is that B ∗ depends on the true covariance matrix V 0i , which is not known.
However, from the argument in Section 4.3, B ∗ can be approximated by
m
−1
m
X
b
b T −1 (b
X Ti V −1 (b
ξ, x i )(Y i − X i β)(Y
ξ, x i )X i .
i − X i β) V
i=1
b and weak law of large numbers, this
The diligent student can verify that, using the consistency of β
expression converges in probability to B ∗ .
151
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
Combining, it is thus common to base inference on the approximate sampling distribution
b ∼· N (β , Σ
b R ),
β
0
bR =
Σ
( m
X
X Ti V −1 (b
ξ, x i )X i
i=1
)−1 m
X
b
b T −1 (b
X Ti V −1 (b
ξ, x i )(Y i − X i β)(Y
ξ, x i )X i
i − X i β) V
(5.81)
( m
X
i=1
(5.82)
)−1
X Ti V −1 (b
ξ, x i )X i .
i=1
b R is often referred to as the robust sandwich or empirical (sampling) covariance matrix, in conΣ
b M in (5.69), which is often called the model-based covariance matrix, being based on
trast to Σ
the assumption that the model for overall covariance structure is correctly specified. “Robust ”
refers to the fact that m× (5.82) is a consistent estimator for the true sampling covariance matrix
b − β ) when the covariance model is incorrectly specified (and even if it is correct ). It is
of m1/2 (β
0
thus robust to possible misspecification of the covariance model V i .
b R rather
• It is conventional in practice to base inference on the robust covariance matrix Σ
b M to protect against the possibility of an incorrect covariance
than the model-based version Σ
model.
• Software packages implementing these methods and those in the next chapter usually use
b R by default to compute approximate standard errors, confidence intervals, and so on.
Σ
b R should result in less optimistic assessment of
• By the argument leading to (5.77), using Σ
the precision with which the components of β are estimated.
QUESTIONS OF INTEREST: As discussed in the context of the examples in Section 5.2, questions
of scientific interest are usually expressed in terms of linear functions of the components of β.
For instance, in the population mean response model (5.14) for the dental study given by
E(Yij |x i ) = {β0,B gi + β0,G (1 − gi )} + {β1,B gi + β1,G (1 − gi )}tij ,
β = (β0,G , β1,G , β0,B , β1,B )T ,
interest focuses on the difference in slopes between the genders, β1,B − β1,G , so that
L = (0, −1, 0, 1).
152
(5.83)
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
If interest is in estimating the population mean response for boys at age t0 = 11, then we focus on
Lβ = β0,B + β1,B t0 ,
L = (0, 1, 0, , t0 ).
Questions of interest can also involve more than one contrast of the components of β; for example,
continuing with the dental study, whether or not the (assumed straight line) population mean response
trajectories for boys and girls in fact coincide involves the two contrasts β0,B − β0,G and β1,B − β1,G
(equal intercepts and slopes). The null hypothesis that both intercepts and slopes for boys and
girls are the same, so that the lines coincide, can be expressed as as Lβ = 0, where
−1 1
0 0
.
L=
0 0 −1 1
(5.84)
In general, questions can be expressed in terms of L (c × p) (of full rank), c ≥ 1, corresponding to a
set of contrasts of interest, where ordinarily rank(L) = c.
b an estimator for Lβ is then Lβ,
b
INFERENCE: Using the approximate sampling distribution for β,
b either of Σ
b M or Σ
b R , Lβ
b has approximate sampling distribution, from (5.68) and (5.81),
and, with Σ
b ∼· N (Lβ , LΣL
b T ).
Lβ
0
(5.85)
b
Thus, for example, if Lβ represents the difference in slopes in (5.83) (c = 1), a standard error for Lβ
b T )1/2 , and a conventional Wald type 100(1 − α)% confidence interval for Lβ is
is (LΣL
0
b ± cα/2 (LΣL
b T )1/2 ,
Lβ
where cα/2 is an appropriate critical value, such as the the 1 − α/2 quantile of the standard normal
or t distribution with some degrees of freedom, discussed further below. A test of H0 : β1,B − β1,G = 0
b
b T )1/2 to the
versus H1 : β1,B − β1,G 6= 0 would be based on comparing the test statistic Lβ/(L
ΣL
appropriate critical value from a normal or t distribution.
More generally, approximate test statistics for the hypotheses
H0 : Lβ = h vs. H1 : Lβ 6= h,
where L is (c × p) with (usually) rank(L) = c, and h is a specified (c × 1) vector (almost always h = 0),
can be constructed based on what is now the c-variate approximate sampling distribution (5.85).
153
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
• An approximate Wald test statistic is
b − h)T (LΣL
b T )−1 (Lβ
b − h),
TL = (Lβ
(5.86)
which has approximately a chi-squared distribution with rank(L) degrees of freedom, so that
the test is carried out by comparing TL to the appropriate χ2 critical value. If L is a row vector
(c = 1), then this test is equivalent to the usual “Z test” based on using a standard normal
critical value.
• Wald-type tests can be optimistic in practice and reject H0 more often than they should because either of the large sample approximate sampling distributions (5.68) and (5.81) do not
take into account variability associated with estimating ξ, so that the test statistic is too large.
In finite samples (finite m), this is often addressed by instead using a statistic of the form
FL =
b − h)T (LΣL
b T )−1 (Lβ
b − h)
(Lβ
,
rank(L)
(5.87)
which is compared to an F distribution with rank(L) numerator degrees of freedom and denominator degrees of freedom estimated from the data. When c = 1, this reduces to a t test, with
degrees of freedom estimated similarly.
Several methods have been proposed to estimate the denominator degrees of freedom for
the test statistic (5.87), one of which is based on the so-called Satterthwaite approximation.
These are implemented in available software. We do not discuss these here; see Verbeke and
Molenberghs (1997, Section 3.5.2 and Appendix A) and the documentation for SAS proc mixed
for details. These methods usually lead to different results ; however, with large m, all yield
degrees of freedom that are sufficiently large that the associated p-values are very similar.
When the null hypothesis corresponds to a comparison of nested models , as for L in (5.83) with
equal slopes or in (5.84) where the straight lines for boys and girls coincide under the null, an
alternative approach is to carry out a classical likelihood ratio test based on the normal likelihood
LML (β, ξ) =
m
Y
(2π)−ni /2 |V i (ξ, x i )|−1/2 exp{−(y i − X i β)T V −1
i (ξ, x i )(y i − X i β)/2}.
(5.88)
i=1
Here, one fits the “full model ” of interest (5.58) first by solving the estimating equations for β and
b and b
ξ dictated by (5.88) [equivalently, maximizing (5.88)] to obtain β
ξ. One then imposes the
condition dictated by the null hypothesis Lβ = 0 and fits the resulting “reduced model ” by maximizing
(0)
b (0) and b
the corresponding (5.88) (solving the estimating equations) to obtain β
ξ , say.
154
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
The likelihood ratio test statistic is then
(0)
b (0) , b
b b
TLRT = −2{log LML (β
ξ ) − log LML (β,
ξ)}.
(5.89)
Under regularity conditions, the test statistic TLRT in (5.89) has an approximate chi-squared distribution with degrees of freedom equal to the difference in the dimensions p of β in the “full” model
and that for the “reduced” model; this difference is typically equal to c.
• Although the test statistic (5.89) comes about from assuming that the distribution of Y i given x i
is normal , large sample (large m) arguments show that the result that TLRT has an approximate
χ2 distribution holds even if this distribution is not normal.
• If one uses the REML objective function in place of the normal likelihood (5.88), a valid test
is not obtained. This is because the population mean parameter β is eliminated from consideration through the “error contrasts,” and this parameter is different under the “full” and
“reduced” models, so that the REML objective function is effectively based on different (mean
zero) responses under each model and thus the two REML “loglikelihoods” are not comparable.
Inference on components of ξ is also sometimes of interest. We defer discussion of this to Chapter 6. For now, we describe the use of so-called information criteria as a way of informally comparing competing models, and in particular competing covariance models.
INFORMATION CRITERIA: Although scientific questions are typically framed in terms of β in a model
of the form E(Y i |x i ) = X i β and can sometimes be cast as a comparison between nested models for
the population mean response of this form, other questions arise where the models to be compared
cannot be viewed as nested.
For example, in building a model of the form (5.58), while we have in mind a specific model
E(Y i |x i ) = X i β in which to frame the scientific questions, we may wish to compare the support in
the data for several different models V i (ξ, x i ) for the overall covariance structure var(Y i |x i ). Ordinarily, competing models, e.g., compound symmetric versus AR(1), for example, are not nested.
Alternatively, we may wish to compare two competing models for E(Y i |x i ) that involve different combinations of covariates and consequently are not nested.
155
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
Information criteria provide an informal approach to these challenges. As is well known, the more
parameters that are incorporated in a model, the larger the loglikelihood becomes; thus, if we
wish to compare competing models that are not nested based on the maximized loglikelihoods for
these models, we must take this into account. Simply comparing the maximized loglikelihoods directly
favors “larger” models. Accordingly, the idea behind information criteria is to incorporate a penalty for
using more parameters and compare instead penalized versions of the maximized loglikelihoods.
bML denote generically the maximized loglikelihood for a specific mean-covariance model, and
Let log L
let P be the total number of parameters (mean and covariance) in the model (= p + r + s for us). Some
popular information criteria are as follows; the definitions are such that smaller values are preferred.
• Akaike Information Criterion (AIC):
bML + 2P.
AIC = −2 log L
(5.90)
• Schwarz’s Bayesian Information Criterion (BIC): With N the total number of observations,
bML + (P log N).
BIC = −2 log L
(5.91)
• Hannan-Quinn Information Criterion (HQ):
bML + {P log(log N)}.
HQ = −2 log L
(5.92)
All but AIC involve penalties depending on both the number of model parameters P and the total
number of observations N, so that differences in loglikelihood are calibrated relative to both of
these factors.
Analogous criteria can be defined based on the logarithm of the REML objective function. However ,
as noted above, REML “loglikelihoods” are comparable only if they involve the same mean model;
thus, information criteria based on REML should be used only to compare covariance models paired
with the same population mean response model. Some advocate here setting P equal to the number
of covariance parameters (P = r + s). In addition, because the REML objective function is formulated
based on N − p error contrasts, N in (5.91) and (5.92) should be replaced by N − p.
Inspection of information criteria should not be used to draw formal inferences; rather, they should
be viewed only as ad hoc rules of thumb. It is entirely possible in practice that different criteria will
prefer different models. AIC often prefers “larger” models relative to BIC, with HQ intermediate. It
is beyond our scope to offer a rigorous justification for the use of information criteria for this purpose.
156
CHAPTER 5
5.6
LONGITUDINAL DATA ANALYSIS
Missing data
Longitudinal data analysis often involves dealing with missing data , most prominently because of
attrition of individuals over time; that is, dropout. This is, of course, a recurrent challenge when the
individuals are human subjects.
Here, although it is intended to ascertain the outcome of interest at specific time points, as in many
of the examples we have examined, some individuals fail to present for the outcome to be recorded
after a certain time point , leading to what is often called a monotone pattern of missingness.
More generally, it is the case in many longitudinal studies that individuals do not show up at the
intended times in a haphazard fashion , so that the pattern of missingness for any individual can be
nonmonotone.
We have already discussed in Section 5.2 the hip replacement study, in which several patients exhibit
a nonmonotone missingness pattern in which they are missing the intended response measurement at week 2 (with one patient also missing the baseline measurement). Recall that, because
this phenomenon seems systematic and occurs for about half of the patients of each gender, it is
reasonable to speculate that the fact that these observations are missing has nothing to do with the
health status of patients or their genders. We return to this point shortly.
EXAMPLE: AGE-RELATED MACULAR DEGENERATION CLINICAL TRIAL: Figure 5.4 shows data
reported by Molenberghs and Kenward (2007) from a multicenter clinical trial comparing an experimental (active) treatment, interferon-α, to placebo in m = 240 patients with age-related macular
degeneration (AMD), a leading cause of vision loss among people aged 50 and older. AMD causes
damage to the macula, a spot near the center of the retina and the part of the eye needed for sharp,
central vision. Patients with AMD progressively lose vision at varying rates. The response, visual
acuity , was assessed at baseline (week 0) and then at weeks 4, 12, 24, and 52, and measured the
total number of letters a patient read correctly on a standardized vision chart with lines of letters of
decreasing size.
Patients were randomized to the two treatments, and all have baseline responses observed; however,
only 188 of the 240 patients have observed responses at all five time points. Of those remaining,
24 dropped out before the final clinic visit at 52 weeks, 8 before the 24 week visit, 6 before the 12
week visit, and 6 before the 4 week visit, with the remaining 8 missing visits intermittently. These
data exemplify the very common situation in longitudinal studies in humans in which missingness is
almost entirely due to dropout.
157
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
Placebo
Active Treatment
80
Visual Acuity
60
40
20
0
0
10
20
30
40
50
0
10
20
30
40
50
Week
Figure 5.4: Visual acuity profiles for subjects in the age-related macular degeneration trial. Averages
of observed responses at each time point are superimposed on the individual profiles in each panel.
A full account of the implications of such missing data on inference and of methods for valid analysis
in their presence is the subject of an entire course. Accordingly, we restrict our attention here to
implications of missingness for the analysis methods we have discussed; these will also be relevant
to those in the next chapter. Whether or not proceeding with an analysis using the observed data
as if they were the intended data leads to valid inferences on questions of interest depends on the
underlying mechanism responsible for the missingness, as we now discuss.
NOTATION: We first introduce notation in the context of our longitudinal data framework useful for
formalizing study of missingness. We defined in (5.25) the full data
Z i = (Zi1 , ... , Zin )T ;
(5.93)
that is, the responses intended to be collected on individual i at prespecified times t1 , ... , tn . We focus
on the situation where the responses actually observed , which we denote as Y i , have components
that are a subset of those of Z i , as in (5.26) for the hip replacement data and evidently for the AMD
data.
158
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
Assume that the covariates planned to be recorded, x i , are observed for all individuals i = 1, ... , m.
Of course, in practice, this is also not always the case, but this is beyond the scope of our discussion
here. As is customary in this context, we consider the problem conditional on x i .
From this point of view, if we intend to collect the responses (5.93), then it is clear that the questions
of interest pertain to the population mean response for Z i given x i . Thus, when we adopt a model
for the population mean response for Y i given x i as we have discussed up to now, implicitly, we are
ordinarily actually specifying a model for the population mean response for Z i given x i .
Accordingly, the questions of interest pertain to the distribution of the full data. When data are
missing, our objective is thus to address those questions based on the observed data.
Define the missing data indicators corresponding to the n components of Z i as
1 if Z is observed,
ij
Rij =
0 otherwise,
j = 1, ... , n; and let
Ri = (Ri1 , ... , Rin )T ,
(5.94)
so that Ri records whether or not Zij , j = 1, ... , n, is observed. Then the ideal full data are
(Z i , Ri ),
which, unless Ri = 1, can never be fully observed (convince yourself).
REMARK: Some authors refer to Z i as the complete data and (Z i , Ri ) as the full data.
Let r denote a possible missingness pattern ; that is, a vector of zeroes and ones that is a possible
value of Ri in (5.94). In general, there are 2n possible missingness patterns. If the only missingness patterns observed are those corresponding to dropout , and all individuals are observed at
baseline (time t1 ), then there are n possible patterns
(1, 0, ... , 0),
(1, 1, 0, ... , 0),
... ,
(1, 1, ... , 1).
For a specific pattern of missingness r , write Z (r)i to denote the part of Z i that is observed, and
Z (r̄)i to denote the part that is missing. Then (convince yourself), we can represent the data that we
actually get to see as
(Z (Ri )i , Ri ),
159
i = 1, ... , m.
(5.95)
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
We have been referring to the observed data as Y i , which we now identify with Z (r i )i when Ri = r i ;
however, strictly speaking, the missing indicators are also part of the observable information. In
the missing data literature, (5.95) is referred to as the observed data.
Write the density of Ri given Z i and x i as
p(r i |z i , x i ) = pr(Ri = r i |Z i = z i , x i )
(5.96)
MISSING DATA MECHANISMS: Rubin (1976) pioneered a hierarchical taxonomy of missing data
mechanisms , which has become standard:
• Missing Completely at Random (MCAR): The data are said to be MCAR if
pr(Ri = r i |Z i , x i ) does not depend on Z i ;
(5.97)
that is, Ri ⊥
⊥ Z i conditional on covariates x i . Then
p(r i |z i , x i ) = p(r i |x i ).
(5.98)
The MCAR mechanism is plausible in situations where it is clear that missingness has nothing
to do with the issues under study; for example, human subjects drop out of a study because
they move away for work or family reasons. In the hip replacement study, if the missing values
at week 2 are due to faulty equipment, for example, it may be reasonable to assume that the
mechanism is MCAR.
Intuitively, under a MCAR mechanism, it should be possible to make valid inferences on the
questions of interest. The observed data are still representative of the information intended
to be collected; there are just fewer observations than originally planned. Thus, the main
consequence of proceeding with an analysis of the observed data will be loss of efficiency.
• Missing at Random (MAR): The data are said to be MAR if
pr(Ri = r i |Z i , x i ) = pr(Ri = r i |Z (r i )i , x i );
(5.99)
that is, the probability of missingness pattern r i as a function of Z i depends only on the components of Z i that are observed under r i . Then
p(r i |z i , x i ) = p(r i |z (r i )i , x i ).
(5.100)
If subjects base their decisions to drop out on their observed response values to that point,
and these values are available to the data analyst, then the MAR mechanism is plausible.
160
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
Intuitively, if the missingness of intended data is associated with evolving responses, and those
responses reflect, for example, evolving health status, if sicker patients are more likely to drop
out, then the observed data are probably not representative of the information intended to
be collected. Patients who remain in the study may be the ones who are doing better on
their assigned treatments; accordingly, proceeding with an analysis to address the questions of
interest without taking this into account is likely to lead to misleading inferences.
If all data implicated in dropout decisions are available to the data analyst, as they are if the
mechanism is MAR, it should be possible to do something to “adjust ” for the missingness on
their basis in an analysis of the observed data.
• Missing Not at Random (MNAR): The data are said to be MNAR if pr(Ri = r i |Z i , x i ) depends
on components of Z i that are not observed when Ri = r i .
Intuitively, if a MNAR mechanism governs the missingness, again, the observed data are not
representative of what was intended. However, because the data that are implicated in dropout
decisions are not available , “adjusting ” the analysis for missingness seems hopeless.
AMD EXAMPLE, CONTINUED: In the AMD study, as in most clinical or observational studies of
humans with dropout , it is unlikely that the missingness has nothing to do with the health status of
the subjects. For example, it may well be that patients whose vision continues to deteriorate might
decide to leave the study on the advice of their physicians over concerns that they are achieving no
benefit. Here, MCAR is clearly implausible.
If these decisions are based solely on inspection of the visual acuity measures up to that point,
assuming a MAR mechanism would be reasonable. On the other hand, if the decisions are made
based on other, unrecorded factors that might be associated with patients’ future prognosis and
that would be reflected in future visual acuity measures , which are not observed , the mechanism
is MNAR.
FUNDAMENTAL CHALLENGE: Of course, we cannot determine from the available data which of
these two explanations reflects the true state of affairs. This conundrum exemplifies the fundamental
challenge of inference with missing data – the true missingness mechanism is not identifiable
from the observed data. Accordingly, whether or not it is plausible to assume that the mechanism is
MCAR or MAR, under which methods for achieving valid inferences on questions of interest based
on the observed data are fairly straightforward, cannot be determined from the data.
161
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
Ordinarily, subject-matter expertise and knowledge is incorporated to justify the assumption of MCAR
or MAR; however, it remains an unverifiable assumption.
The upshot is that applying the longitudinal analysis methods we have discussed so far and will
discuss in the remainder of the course to the observed data when there is missingness without
acknowledgment of this complication can lead to misleading inferences.
A full course on analysis in the presence of missing data examines this issue in excruciating detail.
Here, we focus on one key result that speaks directly to the validity of carrying out an analysis of the
observed data using the methods in this and the next chapter under the assumption of MAR.
OBSERVED DATA LIKELIHOOD: Consider the joint density of the ideal full data (Z i , Ri ), which
we write as
p(z i , r i |x i ) = p(r i |z i , x i ) p(z i |x i ).
(5.101)
In (5.101), we have factorized the density in to the product of two terms.
• The first term on the right hand side of (5.101), p(r i |z i , x i ), is the density of the missingness
indicator Ri given the full data Z i and covariates x i . As above, depending on the missingness
mechanism, this density might simplify ; we discuss this momentarily.
• The second term on the right hand side, p(z i |x i ), is the density of the intended, full data given
covariates. As discussed above, we now see that, from the perspective of the missing data
framework, the models we have written for E(Y i |x i ) and var(Y i |x i ) in (5.58), and indeed for
the density p(y i |x i ) under the assumption of normality in (5.29), are really models implicitly
reflecting our beliefs about the density of the full data Z i given x i .
Thus, the ML methods derived in Section 5.3 correspond to assuming that p(z i |x i ) in (5.101) is the
n-variate normal density, depending on population mean and covariance parameters η = (β T , ξ T )T .
In principle, we could also adopt a model for the density of the missingness mechanism , involving
a parameter ψ, say. Thus write the assumed model for (5.101) as
p(z i , r i |x i ; ψ, η) = p(r i |z i , x i ; ψ) p(z i |x i ; η).
(5.102)
For Ri = r i , we can partition Z i as above into observed and missing components as (Z (r i )i , Z (r̄ i )i ).
Accordingly, we can write (5.102) as
p(z (r i )i , z (r̄ i )i , r i |x i ; ψ, η) = p(r i |z (r i )i , z (r̄ i )i , x i ; ψ) p(z (r i )i , z (r̄ i )i |x i ; η).
162
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
It follows that we can obtain the joint density of the observed component and Ri as
p(z (r i )i , r i |x i ; ψ, η) =
Z
p(z (r i )i , z (r̄ i )i , r i |x i ; ψ, η) d z (r̄ i )i
Z
p(r i |z (r i )i , z (r̄ i )i , x i ; ψ) p(z (r i )i , z (r̄ i )i |x i ; η) d z (r̄ i )i .
=
(5.103)
This is the density of the observed data (Z (Ri )i , Ri ) in (5.95) as discussed above.
Now under MAR, from (5.100), the first term in the integrand of (5.103) satisfies
p(r i |z i , x i ; ψ) = p(r i |z r i )i , z (r̄ i )i , x i ; ψ) = p(r i |z (r i )i , x i ; ψ).
Substituting in (5.103), we obtain
p(z (r i )i , r i |x i ; ψ, η) =
Z
p(r i |z (r i )i , x i ; ψ) p(z (r i )i , z (r̄ i )i |x i ; η) d z (r̄ i )i
Z
= p(r i |z (r i )i , x i ; ψ) p(z (r i )i , z (r̄ i )i |x i ; η) d z (r̄ i )i
= p(r i |z (r i )i , x i ; ψ) p(z (r i )i |x i ; η)
(5.104)
Suppose now that we have a sample of observed data from m individuals, (Z (Ri )i , Ri ), i = 1, ... , m,
as in (5.95). Consider the form of the likelihood for the parameters (ψ T , η T )T based on the observed
data, often called the observed data likelihood. From (5.104), the contribution to the likelihood
for an individual i with Ri = r is
p(Z (r )i , r |x i ; ψ, η) = p(r |Z (r)i , x i ; ψ) p(Z (r)i |x i ; η);
(5.105)
when r = 1, in fact Z (r )i = Z i . It follows that the contribution to the likelihood for the ith individual can
be written
Y
r
p(Z (r)i , r|x i ; ψ, η)I(Ri =r ) =
Y
p(r |Z (r)i , x i ; ψ)I(Ri =r) p(Z (r)i |x i ; η)I(Ri =r) ,
(5.106)
r
where the product is over all possible missingness patterns r. The observed data likelihood is
then the product over i = 1, ... , m of terms (5.106).
IGNORABILITY: Assume that the parameters ψ and η are variation independent in the sense that
their possible values lie in a rectangle, so that the range of η is the same for all possible values of
ψ, and vice versa. This is often called the separability condition. Similar assumptions are often
made in statistical modeling more generally without comment.
163
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
Under the separability condition, there is no information about the parameter of interest, η, in the
first term on the right hand side of (5.106). Thus, for the purpose of maximizing the likelihood to make
inference on η, we can ignore this term. Accordingly, we need only maximize in η
m Y
Y
i=1
p(Z (r)i |x i ; η)I(Ri =r) or equivalently
r
m X
X
i=1
I(Ri = r ) log p(Z (r )i |x i ; η).
(5.107)
r
In fact, under ignorability and separability , it is common to refer to (5.107) as the observed data
likelihood (loglikelihood).
Now consider (5.107) from the perspective of the ML approach in Section 5.3. As we have noted, in
the context of intending to collect full data at prespecified time points, the spirit of the model for the
observed response vector Y i conditional on x i we have discussed is that it really reflects a model
Z i given x i . That is, the questions of interest are formulated within a model for the data we intend
to collect.
From this perspective, as noted above, we are thus assuming that the distribution of Z i given x i is nvariate normal. If no data were missing , the likelihood for η would be the product of the individual
n-variate normal densities dictated by our assumptions on the conditional (on x i ) population mean
and covariance structure.
When some of the intended observations are missing , and the missingness mechanism is assumed
to be MAR (which, of course, we cannot verify from the data), the contribution
p(Z (r )i |x i ; η)
for individual i with Ri = r is the density of the corresponding subvector of Z i . As is well-known ,
any subvector of a multivariate normal random vector is itself multivariate normal with mean vector and covariance matrix corresponding to the components contained in the subvector; an example
of the latter was demonstrated in (5.27) for the hip replacement study.
Accordingly, when some responses are missing , and we are willing to believe the assumption of
MAR and the separability condition , we can ignore the first term on the right hand side of (5.106),
and thus we can regard the likelihood (5.30) and (5.31) as the observed data likelihood. That is,
the usual analysis we carry out to estimate β and ξ under the assumption the response vectors Y i
are ni -variate normal conditional on x i corresponds to the likelihood analysis we would perform both
if the full data were observed (i.e., r = 1 for all m individuals) and with missing data (i.e., some
individuals missing some components of Z i ) under MAR.
164
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
The first term on the right hand side of (5.106) represents the missing data mechanism under
MAR. Under these conditions, then, if interest is solely in the parameters β and ξ, there is no need
to model and fit the missing data mechanism.
KEY RESULT: The usual conclusion from these developments is thus that, under MAR , we expect
the usual analysis to yield valid inferences on β and ξ. However, we must be careful to qualify what
we mean by “valid inferences.”
• We emphasize that, for the usual analysis to yield valid inference, both (i) the assumption
that the distribution of Z i given x i is multivariate normal with mean and covariance structure
correctly specified and (ii) the assumption of MAR must hold. If either of these assumptions
is not true, then it is no longer the case that the inferences are necessarily valid.
• The estimators for β and ξ obtained by maximizing (5.107) in η are identical to those obtained
by maximizing (5.106) (under separability). The estimators so obtained will be consistent for
the true values of these parameters assuming, of course, that the full data model is correctly
specified.
• Likelihood ratio tests comparing nested models for the full data based on the statistic in
(5.89) will also be valid, as, under separability , the missingness mechanism in (5.106) would
have been estimated identically under the “full” and “reduced” full data models and thus cancels in the final test statistic.
WRINKLE: Although these results are pleasing, there is a catch : obtaining an appropriate approximate sampling distribution to use as the basis for standard errors and Wald confidence intervals
and tests is not straightforward ,, as discussed in detail by Verbeke and Molenberghs (2000, Section 17.3 and Chapter 21) and Molenberghs and Kenward (2007, Chapter 12). We focus here as we
have previously on inference on β.
• Recall the Taylor series argument in (5.63) and (5.64) to derive the approximate sampling disb when the models for mean and covariance matrix are correctly
tribution for the ML estimator β
specified. From the vantage point of missing data, this argument was made and accordingly
expectations of the quantities involved were taken acting as if the lengths ni of the Y i were
fixed by design. If, as here, normality holds and the mean and covariance models are corb as does
rect this argument yields the same large-m approximate sampling distribution for β
T
bT , b
finding the expected information matrix for (β
ξ )T and inverting it, where the expectation
is taken from this perspective.
165
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
• Specifically, recall the definitions
Am = m−1
m
X
X Ti V −1
0i X i ,
E m = m−1
i=1
m
X
X Ti {∂/∂ξV −1
i (ξ 0 , x i )}(Y i − X i β 0 ).
i=1
With ξ (r + s × 1), using the results for matrix differentiation in Appendix A, E m is in fact the
(p × r + s) matrix with k th column
−m−1
m
X
−1
−1
X Ti V −1
0i {∂/∂ξ k V i (ξ 0 , x i )}V 0i (Y i − X i β 0 ),
(5.108)
i=1
Thus, the observed information matrix; that is, the negative of the matrix of second partial
derivatives of the loglikelihood (5.31), is
Am
−E m
−E Tm −G m
,
(5.109)
where G m is the (r + s × r + s) matrix of second partial derivatives of the loglikelihood with
respect to elements of ξ. If we find the expected information matrix by taking the expectation
of (5.109) (conditional on x̃), acting as if the lengths ni of the Y i were fixed by design, then
E(E m |x̃) = 0.
(5.110)
Then the expected information matrix is block diagonal so that, taking the inverse of the
conditional expectation of (5.109), by standard likelihood theory, we are led to the result in
(5.67),
L
b − β ) −→ N (0, A−1 ), A = lim m−1
m1/2 (β
0
m→∞
m
X
X Ti V −1
0i X i .
(5.111)
i=1
• The foregoing argument hinges critically on the fact that expectation was taken as if the
lengths ni of the Y i were fixed by design , leading to (5.111). However , from the perspective
of missing data, these lengths are not fixed in advance ; rather, they are a consequence of the
realized pattern of missingness. Accordingly, calculation of the expected information must
acknowledge this by placing this problem in the missing framework we have just described.
• Calculation of E(E m |x̃) or more precisely, from (5.108), the expectation of a summand in the
kth column of E m ,
−1
−1
X Ti V −1
0i {∂/∂ξ k V i (ξ 0 , x i )}V 0i (Y i − X i β 0 ),
(5.112)
from this point of view can be accomplished via a conditioning argument where the conditioning
set involves missingness pattern Ri . The details of this formulation and argument are beyond
our scope here.
166
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
• It can then be shown that the expectation of (5.112) (conditional on x̃) is not equal to 0 in
general, so that E(E m |x̃) 6= 0 and
p
E m −→ E,
for some E 6= 0 in general. Thus, the expected information is not block diagonal.
• Consequently, if we appeal to standard likelihood theory, using the formula for the inverse of a
partitioned matrix in Appendix A, we obtain that, acknowledging that the ni are the result of
missingness,
L
b − β ) −→ N [ 0, {A − E(−G)−1 E T }−1 ],
m1/2 (β
0
p
G m −→ G
(5.113)
for G (r + s × r + s) positive definite.
Comparing (5.113) to (5.111) shows that using the usual large sample approximation the samb when the Y i are (ni ×1) as a result of a MAR mechanism leads to standard
pling distribution of β
errors that are too small and Wald test statistics and confidence intervals that are thus too optimistic when the differing ni are the consequence of MAR.
• As a way around this, it has thus been advocated that, instead of basing the approximate
sampling distribution on the usual expected information matrix , one should base it on the
observed information matrix (5.109) and obtain standard errors and other inferences based
on the inversion of this matrix. This preserves the nonzero off-diagonal, providing an empirical
approximation to (5.113). A practical difficulty is that most software packages do not offer this
option and do not output this matrix as a by-product of the optimization of the loglikelihood.
• It has become common practice (which of course does not make it correct) to disregard this
issue and to use the usual approximate sampling distribution for inference as if it were valid.
Although this has the potential to yield misleadingly optimistic inferences, there are empirical
examples where it does not seem to be too terrible. However it is important to be aware of this
problem. Ideally, calculation of the inverse of the full observed information matrix is strongly
preferred.
• It goes without saying that one should not use the robust or empirical covariance matrix
(5.82) in this situation. Not only does it suffer from the same drawback, it allows the possibility
that the covariance matrix is incorrectly specified.
167
CHAPTER 5
LONGITUDINAL DATA ANALYSIS
REMARK: Although in the particular case of a correctly specified model , MAR mechanism, and
likelihood-based analysis it is possible to obtain valid inferences on the questions of interest (regarding aspects of the full data distribution ), it is important to recognize that this is not the case in
general. Proceeding with a standard analysis in the presence of missing data can lead to substantially
biased results.
Accordingly, it is essential that the data analyst think critically and realistically about possible reasons
for missingness. An enormous body of literature exists on methods for achieving valid inferences in
the presence of missing data. Verbeke and Molenberghs (2000, Chapters 14-21) and Molenberghs
and Kenward (2007) offer extensive discussion of methods for handling missing data in longitudinal
data analysis, including alternative approaches under MAR and methods when it is not possible to
assume MAR (so that the mechanism is assumed to be MNAR).
REMARK: Contrary to widespread belief, analyses based on so-called Big Data are not somehow
exempt from the issues that arise because of missing data. For example, if we have data from
electronic health records on millions of subjects, the fact that some subjects have more observations on the outcome of interest might reflect that they are having encounters with the health system
more frequently because of poorer health status. Thus, subjects with fewer observations and thus
“missing data” by comparison might be healthier, so that inferences on the effects of treatments in
the population of all subjects will be compromised if this is not taken into account. With such large
m, bias (inconsistency) of standard estimators for population quantities of interest will swamp variance. The result will be estimators that are very precise but that are very far off from representing
the true quantities of interest.
168
© Copyright 2026 Paperzz