Catellier, Diane J.; (1998).Inference for the General Linear Multivariages Model with Missing Data in Small Samples."

UNC BIOSTATISTICS
MIMED SERIES NO 2192T
INfERENcE FOR 1'HE
LINEAR MUL'TIVARlAGES
GENERAL
~DEL \UTH
sMALL sAMPLES
ier
BY Diane J. eatell
Date
Name
MISSING DATA IN
-----------
The Library 04 the Department of Statistics
North Carolina State University
INFERENCE FOR THE GENERAL LINEAR MULTIVARIAGES
MODEL WITH MISSING DATA IN SMALL SAMPLES
by
Diane J. Catellier
Department of Biostatistics
University of North Carolina
I"
•
Institute of Statistics
Mimeo Series No. 2192T
May 1998
Inference for the General Linear Multivariate
Model with Missing Data in Small Samples
Diane J. Catellier
A dissertation submitted to the faculty of the University of North Carolina at
Chapel Hill in partial fulfillment of the requirements for the degree of Doctor
of Public Health in the Department of Biostatistics, School of Public Health.
Chapel Hill
1998
Approved by:
<Jt-eM{ ~. Cht~
Advisor
.
i
.
ABSTRACT
DIANE 1. CATELLIER. Inference for the General Linear Multivariate Model
with Missing Data in Small Samples. (Under the direction of Dr. Keith E. Muller)
The General Linear Multivariate Model (GLMM) provides a convenient and statistically
well-behaved framework for estimation and testing with complete, multivariate Gaussian
data The loss of any data in this setting substantially complicates matters. For the case in
which data are missing at random (MAR), reliable methods exist estimation. In contrast,
little work has been done in the area of inference, especially in small samples.
I describe two new strategies for hypothesis testing for the GLMM in small samples with
MAR data. Both strategies use the EM algorithm for estimation, and focus on commonly
used test statistics: the Hotelling-Lawley trace (U), the Pillai-Bartlett trace (V), Wilk's
Lambda (W), and the Geisser-Greenhouse corrected univariate test (GG).
The first approach for providing accurate inference involves adjusting the sample size
(N) in the degrees of freedom for the F tests to reflect the actual amount of observed data.
Eleven sample size adjustments were examined for each test. Simulations suggest that the
preferred adjustment varies with the test statistic. The adjustment which works best for GG
is based on the mean number of non-missing responses. V requires a stronger adjustment
based on the harmonic mean number of non-missing pairs of responses. W and U work best
with an even more aggressive adjustment based on the minimum number of non-missing
pairs. The adjusted tests appear to control test size at or below the nominal rate with as few
as 12 observations and up to 10% missing data.
11
The second approach determines significance for the test statistics using permutationbased methods. A Monte Carlo approximation to the permutation test is carried out by
tabulating F statistic values for the observed sample and a random sample of possible data
"
.
permutations. The p value is computed as the proportion of the data permutations with F
values that exceed the F for the original sample. Simulations results indicated that empirical
test sizes for the approximate permutation tests did not differ significantly from the target
rate, and empirical power was always equal or greater than that for the adjusted F tests.
Versatility of permutation methods is limited, however, to a narrow range of hypotheses.
III
ACKNOWLEDGEMENTS
'w
I gratefully acknowledge my dissertation advisor, Dr. Keith E. Muller, for his
encouragement, guidance, and support, and for the
~umerous
hours he contributed to this
research. I will probably continue to write comma-spliced sentences, but they ought to be
fewer. I'd also like to thank my committee members, Dr. Paul Stewart, Dr. James Hosking,
Dr. Gary Koch, and Dr. David Weber, for their input particularly in the early stages of this
research.
My doctoral program was financially supported by research assistantships with the
Collaborative Studies Coordinating Center (CSCC) and the Biometric Consulting
Laboratory. I'd like to thank both for contributing to my training. The University of Alberta
Hospitals provided me with opportunities to be part of cutting edge cardiovascular research,
and for this I am extremely grateful. I want to particularly thank Dr. Koon K. Teo from the
University of Alberta for his mentorship over my 8 years of graduate work.
The love and encouragement exhibited by my parents, siblings, in-laws, nephews and
friends was invaluable. They continually reminded me of the things that matter most in life.
Finally, I express my deep appreciation to my partner for her love and patience particularly
during the stressful times.
IV
TABLE OF CONTENTS
Page
LIST OF TABLES
vii
LIST OF FIGURES
viii
1 INTRODUCTION AND LITERATURE REVIEW
1.1 Introduction
1.2 Background for Paper #1 (Tests for Gaussian Repeated Measures with
Missing Data in Small Samples
1.2.1 Complete Case
1.2.2 Missing Data Case
1.2.3 Mixed Model.
1.3 Background for Paper #2 (Comparison of Approximate Permutation Tests
with Parametric Tests in Manova with Missing Data)
1.3.1 Motivation
1.3.2 Theory of Permutation Tests
,
1.3.3 Literature Review Related to Permutation Tests for MANOVA.
1.3.4 Exact versus Approximate Permutation Tests
1.3.5 Statement of the Problem
1.4 Background for Paper #3 (LINMOD 4.0: Software for the General
Linear Multivariate Model, Allowing Missing Data
14
2 TESTS FOR GAUSSIAN REPEATED MEASURES WITH
MISSING DATA IN SMALL SAMPLES
2.1 Introduction
2.1.1 Motivation
2.1.2 General Strategies for the Analysis of Repeated Measures Designs
2.2 Known Methods for Estimation and Inference
2.2.1 Complete Data
2.2.2 Data Missing at Random
2.3 New Tests for Data Missing at Random
2.4 Numerical Evaluations
2.4.1 Methods
2.4.2 Results
17
19
19
20
21
21
24
27
28
28
29
v
1
1
2
2
6
7
9
9
9
11
12
14
2.5 Conclusions
35
3 COMPARISON OF APPROXIMATE PERMUTATION TESTS WITH
PARAMETRIC TESTS FOR MANOVA WITH MISSING DATA
3.1 Introduction
3.1.1 Motivation
3.1.2 Relevant Literature
3.2 Statement of the Problem
3.3 New Methods
3.3.1 The Basic Method
3.3.2 Exact versus Approximate Permutation Tests
3.4 Numerical Evaluations
3.4.1 Purpose of Studies and Overall Design
3.4.2 Methods
3.4.3 Results
3.5 An Example: The Effect of Choline Defiency on Humans
3.6 Conclusions
4 LINMOD 4: A PROGRAM FOR GENERAL LINEAR MULTIVARIATE
MODELS WITH MISSING DATA
4.1 Motivation
4.2 A New Approach
4.2.1 Overview
4.2.2 Comparison to Other Programs
4.2.3 History of the Program
4.3 LINMOD 4 Program
4.4 Recommended Option Settings
4.5 Example
40
41
41
43
46
47
47
48
49
49
50
50
53
55
59
60
62
62
62
64
64
65
65
5 CONCLUSIONS AND FUTURE RESEARCH
5.1 Looking Backwards; Successes From This Research
5.2 Looking Forward; Future Research
5.2.1 Robustness
5.2.2 Improving on the Adjusted Tests
5.2.3 Power Approximation
APPENDIX A: DOCUMENTATION FOR FITML MODULE IN LINMOD 4
VI
73
73
75
75
75
76
77
List of Tables
2.1 Sample Size Adjustments for Error Degrees of Freedom
2.2 Test Size for Mixed Model F (5000 Replications, ± 0.006)
2.3 Test Size for GLMM F Tests (0% Missing, 5000 Replications, ± 0.006)
2.4 Adjusted Degree of Freedom Test Size for Fw (5%, 10% Missing,
5000 Replications, ± 0.006)
2.5 Adjusted Degree of Freedom Test Size for Fu (5%, 10% Missing,
5000 Replications, ± 0.006)
2.6 Adjusted Degree of Freedom Test Size for Fv (5%, 10% Missing,
5000 Replications, ± 0.006)
2.7 Adjusted Degree of Freedom Test Size for Fcc (5%, 10% Missing,
5000 Replications, ± 0.006)
2.8 Adjusted Degree of Freedom Test Size for Multivariate F Test
When s = min(a, b) = 1 (5%, 10% Missing, 5000 Replications, ± 0.006)
3.1 Approximate Permutation Test Size for F Tests (0% Missing,
1000 Replications, ± 0.014)
3.2 Approximate Permutation Test Size for F Tests (5%, 10% Missing,
1000 Replications, ± 0.014)
3.3 Approximate Permutation Power for F Tests (0% Missing,
1000 Replications, ± 0.031)
3.4 Approximate Permutation Power for F Tests (5%, 10% Missing,
1000 Replications, ± 0.031 )
3.5 Choline Measurements Over 5-Week Period in Male Subjects
4.1 Choline Measurements Over 5-Week Period in Male Subjects
Vll
28
30
31
32
32
33
34
34
51
51
: 52
53
55
66
List of Figures
4.1 LINMOD Programming Statements for Choline Example
4.2 LINMOD Output for Choline Example
Vlll
67
67
Chapter 1
1.1 INTRODUCTION
In repeated measures designs, the experimental unit is typically a human or animal
subject. Each subject is measured under several treatments or at different points in time.
Clinical trials research and laboratory studies routinely use repeated measures designs, as this
design allows the investigator to assess and describe the change both within and between
subjects as time and/or experimental conditions change. A variety of analysis strategies have
been proposed to evaluate the effects of treatments and covariates on the pattern of responses.
Linear models, in particular, are useful in situations in which the responses are approximately
Gaussian and can be explained by some linear function of predictors. When each subject is
observed at the same p times and no observations are missing, closed-form maximum
likelihood (ML) estimates of the model parameters are often available. More typically, the
response is not observed at all time points for every subject. An extensive literature exists for
dealing with missing data. Despite the advances made over the last 30 years in the analysis
of data with missing values, weaknesses remain, particularly in the area of hypothesis testing
in small, incomplete samples.
Studies with small samples commonly arise when the
experimental units are difficult or expensive to obtain. Simulation studies by Barton and
Cramer (1989) and Schluchter and Elashoff (1990) showed that various asymptotic test
statistics for hypotheses about fixed effects yield inflated type I error rates in small samples.
The main objective of this research is to develop missing-data hypothesis testing methods
for repeated measures designs that will yield accurate type I error rates and maintain adequate
power in small, incomplete data situations.
The class of designs of interest in this paper may be characterized as involving 1) small
samples, 2) continuous repeated measurements data, and 3) missing data that are assumed to
be "missing at random" or MAR (Rubin, 1976). In this setting, factors or covariates can be
classified as within-subject or between-subject factors.
variables, such as time, or
experi~ental
Within-subject factors represent
conditions which vary within subjects. Between
subject factors represent baseline characteristics or covariates which do not vary over time.
This research will center on the basic repeated measures design, with one within-subject
factor and one between subject factor.
This dissertation contains three separate papers (Chapters 2, 3, and 4) intended for
publication, and as such, each contains it's own literature review. In order to avoid repetition,
I will only provide a brief overview of the literature in this introductory chapter.
1.2 BACKGROUND FOR PAPER #1 (TESTS FOR GAUSSIAN REPEATED
MEASURES WITH MISSING DATA IN SMALL SAMPLES)
This paper focuses on likelihood-based theory for the statistical analysis of repeated
measures data with missing values. I begin by briefly discussing practical approaches to
missing data problems in multivariate analysis. The methods are based on three standard
statistical models: the general linear univariate and multivariate models and the general linear
mixed model. Before discussing the univariate and multivariate approaches to hypothesis
testing with missing data, an understanding of the models for complete-data is required. The
mixed model, which inherently allows for the possibility of missing data will be discussed
last.
1.2.1 Complete Case
The General Linear Multivariate Model (GLMM) subsumes a wide range of models for
multivariate Gaussian data. For the purposes of data analysis, the most important special
cases include Multivariate Analysis of Variance and Covariance (MANOVA, MANCOVA),
2
and the multivariate and univariate approaches to repeated measures ANOV A. Furthermore,
some forms of discriminant analysis and canonical correlation also represent special cases.
Suppose that the responses for subj ect i (i E {I, ... , N}) are measured at p times (the same
for all subjects). The GLMM is
£'(Y) = XB,
(1.2.1.1 )
where £' denotes the expectation operation and Y (N x p), X (N x q, fixed and known,
conditional upon having chosen the subjects) and B (q x p, fixed and unknown) denote the
matrix of observations, design matrix, and parameter matrix, respectively. Indicate the i th
row of X as Xi = rowi(X),
Assuming that rowi(Y) is Np(XiB, E), we can test the
general linear hypothesis (GLH)
(1.2.1.2)
Each row of the matrix C (a x q) defines a between-subject contrast and each column of the
matrix U (p x b) defines a within-subjects contrast. Define VE = N - rank(X). All tests
of H o are based on
B = (X'X)-X'Y,
(1.2.1.3)
e=CBU,
(1.2.1.4)
(1.2.1.5)
and
E = U'(Y - XB)'(Y - XB)U.
(1.2.1.6)
Ii and E are the "matrices of sums of squares and cross-products" for hypothesis and error.
W(VE, E*), where
Both have Wishart distributions, with Ii W(a, E*, 0) and E
IV
IV
(1.2.1.7)
3
= min( a, b)
is the noncentrality parameter matrix. Let s
....... ---I
HE. Let s* indicate the rank ofO, with s*
~
indicate the rank of Ii and hence
s.
Under the assumption that rowi(Y) is Np(XiB, ~), for i = 1, ... , N, three common
multivariate statistics for testing Ha are:
1
s
Wilks lambda
w=
Pillai-Bartlett
V
IE(H+E)-11
= 1]l+Zi '
s
Z
i=1
+
.
= tr[H(H +E)-I] = L: - 1
i Z.'
~
s
Rotelling-Lawley
U
= tr(HE- l ) = L:Zi'
i=1
where Zi denotes the eigenvalues of Ii B-
1
The exact distributions of U and V are only
.
known when s = min(a, b) = 1, and when s
~
2 for W. For general values of a and b, it is
necessary to use an asymptotic approximation of the distributions. Rao (1973) derived the
following F approximation to W
Fi _ (1 - Wl/t)/Vl(W)
w - (Wl/t)/ V2(W) ,
(1.2.1.8)
with
t = { [(a 2b2 - 4)/(a 2 + b2 - 5)
1
p/2
if(a 2 + b2 - 5) > 0
otherwise.
Under Ha, Fw is approximately distributed as a central F distribution with VI (W)
= ab and
V2(W) = t [VE - (b - a + 1)/2] - (ab - 2)/2 .numerator and denominator degrees of
freedom, respectively.
The asymptotic distribution of the Pillai-Bartlett trace criterion
obtained by Muller (1998) is
Fv
=
V / (K . ab)
,
(s - V)/(K . S(VE + s - b))
which is approximately F V1 ,V2 under H a, where
4
(1.2.1.9)
Vl(V) = Kab, V2(V) = K· S(VE + S - b),
K =' 1
[S(VE+s-b)(vE+a+2)(VE+a-l)].
S(VE + a)
VE(VE + a - b)
McKeon (1974) provided a slightly better F approximation than the more widely used PillaiSampson approximation for the Hotelling-Lawley statistic.
Write the McKeon
approximation
F = (U jh)j(ab) ,
(1.2.1.10)
Ijv2 (U)
with Vl(U)
= ab, v2 (U) = (4 + ab + 2)g',
,
v1- vE(2b + 3) + b(b + 3)
- vda + b + 1) - (a + 2b + b2 - 1) ,
(1.2. 1.1l)
9 -
and
V 2 (U) - 2
h=---VE - b - 1
(1.2.1.12)
The multivariate approach described above allows the p repeated measures to be
correlated in any pattern, since E is completely general. The univariate analysis of variance
procedure assumes that each row of Y is independent with a p-variate multivariate normal
distribution, having covariance matrix E such that E* = U'EU =
1970). This condition is sometimes referred to as sphericity.
0'2 I
(Huynh and Feldt,
The univariate approach to
analysis of repeated measures designs refers to an analysis procedure whereby the univariate
F test is adjusted for the amount of departure from sphericity. The traditional univariate F
statistic for testing H o can be computed using the H and E matrices according to the
following equation
Fu
=
tr(H)jab
tr(E)j(b· VE)
5
(1.2.1.13)
Box showed that when the sphericity is not met, under the null hypothesis Fu is
approximately distributed as a central F distribution with abc and (b· VE) . c degrees of
freedom, where the value of c is defined as
(1.2.1.14)
The Ak (k
= 1,2, .. , b) in this equation are the eigenvalues of E*.
Since E*, and thus c, is
usually unknown, c must be estimated from the sample covariance matrix. The univariate
approach with the Geisser-Greenhouse correction (Greenhouse and Geisser, 1959) uses the
maximum likelihood estimate of c:
(1.2.1.15)
1.2.2 Missing Data Case
When there are missing values among the responses, there exist some well-behaved ML
and REML estimation methods. The computational efficiency and simplicity of the EM
algorithm (Dempster, Laird and Rubin, 1977) make it an obvious choice for estimation in the
present setting.
With respect to inference, a distinguishing characteristic of each of the univariate and
multivariate test statistics is that they involve the error matrix E, which has error degrees of
freedom VE
=N -
rank(X).
Barton and Cramer (1989) suggested adjusting VE in the
missing data setting to reflect the amount of incompleteness in the data. They compared the
performance of Rao's F approximation to W with various adjustments to VE. Among the
possible alternatives, approximating N with N* = Njj (average number of non-missing data
pairs) was shown to provide the most stable and accurate test sizes for N as small as 40 and
up to 20% missing data.
6
The Barton and Cramer approach can be used to produce missing data analogs for the
three multivariate test statistics (ltV, V, U) as well as for the univariate test statistic using a
Geisser-Greenhouse correction, by simply replacing VE with VE*
= N* -
rank(X). This
paper examines the behavior of this hypothesis testing method using at very small samples
sizes (i.e., 12, 24). In addition to looking at the performance of the N*
= Njjl,
we also
considered the minimum Njj' and the geometric and harmonic mean Njj' as candidates for
N*. The empirical test sizes for N* = N (the complete data sample size) are provided for
comparison purpose since this is often what is used in practice.
1.2.3 Mixed Model
Mixed models have long been used for the analysis of continuous data, especially in
ANDVA and MANDVA settings.
By virtue of modeling the subject as a random
component, the mixed model can encompass a broad range of covariance structures. In many
cases, the analysis results from the univariate and multivariate approaches can be obtained
from a mixed model analysis.
The general mixed model may be written as
(1.3.2)
with Yi a Pi x 1 vector of measurements from the ith subject (i = 1, ... ,N), ei a Pi x 1
random vector of within-subject error terms. Here Xi (Pi
X
q) and Zi (Pi x m) are fixed
and random effects design matrices, respectively, for the ith subject, and B (q x 1) and bi
(m xI) are the unknown fixed and random effects parameters. The model assumptions
include:
(1.3.3)
Therefore, the expectation and variance for Yi and the entire data vector, Y, are
(1.3.4)
7
v (yd
= lJ i = ZiD..Z; + (72Vi
(1.3.5)
= XB
(1.3.6)
and
£(y)
V(y)
= lJ = Dg(lJ 1, lJ2,'"
diagonal and zeros elsewhere.
Vi
(1.3.7)
,lJN).
Here Dg(lJ 1 , lJ 2, ... ,lJN) denotes a block diagonal.matrix with lJ i , i
.
= 1,2, '"
N, on the
is a known covariance structure for within-subject
variation; (72 is an unknown within-subject variance parameter. An important special case is
the random-effects model, with
Vi
assumed be an identity matrix, Ii.
Since repeated
measures models are a special case of random-effects models I will assume
Vi
= Ii
throughout this paper.
Because the mixed model does not require every subject to have the same number of
responses, it it a natural choice for analysing data from repeated measures designs with
missing data. The multivariate model can be transformed to fit the mixed model framework
by stacking the columns of Y into an N p x 1 vector, creating a new design matrix equal to
(Ip 0 X), where 0 is the left Kronecker product. Rows with any missing responses are then
deleted. Note that there is no random effects matrix in this specification of the mixed model.
In this paper, a series of simulations are used to compare the Barton and Cramer approach
to repeated measures· analysis with missing data (described in Section 1.2.1.2) and the mixed
model analysis. Samples of size 12 and 24 are drawn from multivariate normal distributions
and missing data are introduced at random. The empirical rejection rates are calculated under
various conditions (low/moderatelhigh correlation, equaVunequal variances) for tests with a
nominal significance level of 0.05. The number of response variables considered is 3 and 6.
•
8
1.3 BACKGROUND FOR PAPER #2 (APPROXIMATE PERMUTATION TESTS
WITH PARAMETRIC ESTIMATES IN MANOVA WITH MISSING DATA)
1.3.1 Motivation
The motivation for this paper parallels that of Paper #1. The problem of interest is briefly
summarized as follows. A number of convenient and statistically well-behaved estimation
and inference methods exist for complete, multivariate normal data. The presence of missing
response values, however, complicates matters. Researchers have primarily focused on the
problem of estimation, and have succeeded in producing accurate and efficient estimation
procedures in the presence of missing data. The EM algorithm (Orchard and Woodbury,
1972; Dempster, Laird and Rubin, 1977), is an example of one such procedure, which
appears to work well in both large and small samples for multivariate analysis of variance
models (MANOVA) with missing data. In the same setting, inference proves much more
difficult. For many parametric tests it is often the case that the distribution of the test statistic
is only known asymptotically. If the sample size is large and the proportion of data values
that are missing is small, these parametric tests may provide valid significance levels.
However, in small samples, they can be extremely inexact (see for example, Barton and
Cramer, 1989; Schluchter and Elashoff, 1990). For both large and small samples, analyses
based on the complete cases only, while accurate, can be extremely inefficient.
In Paper #1, we looked at ways of adjusting the sample size, N, in the error degrees of
freedom equation
liE
=N -
rank(X) of approximate F tests to reflect the amount of
incompleteness in the data as a means of obtaining accurate p values. In this paper we
consider a distribution-free approach based upon the theory of permutation tests as an
alternative method for missing data inference in small multivariate Gaussian samples.
1.3.2 Theory of Permutation Tests
The logic of Fisher's permutation procedure, proposed in 1935, is that under the null
hypothesis of no population differences, observations are equally likely to be associated with
9
any of the populations. If all assignments of observations to populations are equally likely
under the null hypothesis, then a null distribution for any test statistic can be generated by
calculating the value of that test statistic for each data permutation (including the obtained
data permutation).
The exact significance level is simply the proportion of the data
permutations with test statistic values that exceed that which was originally obtained.
Fisher's rationale for the validity of the permutation procedure relied upon the assumption
that the data were randomly sampled from larger populations for which statistical inferences
were intended. Pitman (1937a, 1937b, 1938) demonstrated that the permutation principle can
be justified on the basis of random assignment of subjects to treatments alone, without regard
to how the sample was collected. His work provided the theoretical basis for the application
of permutation methods to the randomized experiment. Randomization tests therefore refer
to the use of permutation procedures in the context of random assignment. In this work, I
shall use the two names ("randomization" and "permutation") interchangeably.
Four factors make the randomization test procedure a promising technique for
determining significance in a small sample MANDVA setting. First and foremost, given
random assignment of observations to treatment groups, this procedure is guaranteed to
provide valid p values under the null hypothesis, even in small samples. Second, the validity
of permutation tests is independent of the type of test statistic chosen.
This flexibility
permits us to choose a test statistic that is most sensitive to departures from the null
hypothesis of interest. Third, the chosen test statistic becomes distribution-free, and thus
more robust, when significance is determined by a permutation procedure. The final reason
for considering permutation test procedures is their simplicity relative to attempting to
develop a new statistical test to handle the compound problem of small samples and missing
data.
10
'"
1.3.3 Literature Review Related to Permutation Tests for MANOVA
Much of the theoretical work on permutation tests for MANOVA has dealt with methods
based on ranks. The asympotic power properties and relative efficiencies (with respect to
their parametric competitors) of various rank permutation tests for MANOVA in the
complete data setting are presented in detail in Chapter 5 ofPuri and Sen (1993). Servy and
Sen (1987) extended the theory of rank permutation tests for one-way MANOVA and
multivariate analysis of covariance (MANCOVA) models to allow for missing data. Their
approach involves replacing the observed variables with their ranks or other scores,
excluding the missing values, and imputing the missing data with the mean of the scores of
the non-missing values of the variable it belongs to. Further extension of this approach to
two-way MANOVA designs are due to Jerdack and Sen (1990).
Permutation tests can also be based on test statistics which are explicit functions of the
actual values of the sample observations. Puri and Sen (1993, p. 76) refer to these tests as
component randomization tests. Wald and Wolfowitz (1944) proposed a randomization test
of whether two samples have come from identical multivariate normal populations (against
the alternative that there is a difference of "location" for some variables) based on the
permutation distribution ofa modified Hotelling's T2 statistic. Chung and Fraser (1958)
derive an alternative randomization test for the multivariate two-sample problem for the case
where the number of variables k is large (i.e., k
+ 2 > N).
Their approach was to take a
statistic suitable for the univariate case, apply it to each of the k variables and add the
resulting expressions.
Unlike Hotelling's T2 statistic, a test statistic constructed in this
manner does not take account of the covariance of the responses. A randomization test is
then obtained by evaluating this test statistic for each possible reordering of the data.
Friedman and Rafsky (1979) provide a k sample generalization of the Wald-Wolfowitz twosample test. The method involves constructing the minimum spanning tree for the combined
sample, then generating the permutation distribution of the runs statistic through a series of
11
random relabelings ofthe individual data points. The use of randomization tests based on the
conventional multivariate test statistics (the Pillai-Bartlett Trace, Wilks' Lambda, HotellingLawley Trace) has not been studied.
A weakness of component randomization tests is that the permutation distribution must
be computed for every new set of data. The permutation distribution of a test statistic based
on ranks, however, is invariant to changes in the actual values of the observations, and thus a
table with the "standardized" null distribution of the test statistic (for a particular sample
space) can be used repeatedly to determine significance for any new sample. One of the
strengths of component randomization tests lies in their superior power relative to the rank
randomization tests. For larger samples the discrepancy in power may not be substantial and
permutation tests based on statistics of either type would be appropriate. In the case of small,
possibly inadequately powered data sets, on the other hand, the component randomization
tests are preferred since any potential loss in power should be avoided.
1.3.4 Exact versus Approximate Permutation Tests
All permutation-based procedures for inference involve four steps. These may be loosely
described as follows. 1) Choose a test statistic, S. 2) Compute S for the original set of
observations. 3) Obtain the permutation distribution of S by either enumerating all possible
reassignments of observations to treatments group, or by choosing a random sample of
reassignments (keeping the number of observations in each group constant), and recalculating
the test statistics for each reassignment. 4) Compute the p value as proportion of test statistic
values greater than or equal to the value corresponding to the originally obtained data.
Exact permutation tests involve tabulating test statistic values for the complete set of data
q
permutations, including the observed test statistic. In all, there are M =
N!/TI N g ! ways in
g=1
which the N response vectors can be assigned to the q treatment groups. As the number of
12
observations increases, the number of values of the test statistic that need to be calculated to
obtain the exact p value increases very rapidly.
Pitman (1937a, 1937b, 1938) approached this problem by deriving the first four moments
of the exact permutation distribution of a test statistic and showing that they converge to the
moments of some well-known distribution (X 2 , normal, etc.) as N increases. Wald and
Wolfowitz (1944) improved on this by deriving a general theorem on the limiting
distributions of a class oftest statistics called linear permutation statistics (see Puri and Sen,
1993; p73). As an application of the theorem, they derived the limiting distribution of a test
statistic, T /2 , which is a simple monotonic function of the Hotelling's T2 • The statistic T /2
would be appropriate for testing the null hypothesis that two distributions III and Il2 are
identical, assuming homogeneity of the covariance matrices (i.e., restricting the alternative to
the case where III and Il2 differ only in the mean values). For the case in which two
variables are measured on each sampling unit, and each pair is jointly bivariate normally
distributed, the permutation distribution for T 12 was shown to approach the X2 distribution
with two degrees of freedom as
N~ 00.
In small samples, one can neither take advantage of the asYmptotic approximations for
the exact permutation distributions of the multivariate test statistics, nor is it computationally
feasible to obtain the complete permutation distributions for these statistics. For example, the
total number of data permutations for an experiment with N = 12 subjects randomly
assigned to 4 treatments is 369,600. Alternatively, we can significantly reduce the amount of
time required to determine significance by drawing a random sample of the data
rearrangements, without replacement, to produce a close approximation to the exact p value.
Such methods are called "approximate" permutation tests. The closer the significance level
is to 5% the more permutations will be needed. The bootstrap method, formulated by Efron
(1979), is similar to the permutation method in that it estimates the distribution of the test
statistic using the observed data. The main difference is that sampling is carried out without
13
replacement usmg the permutation approach and with replacement for the bootstrap
procedure.
Software for computing both exact and approximate permutation tests are available
without charge on the internet.
Currently, none of the programs are able to perform
permutation tests for the multivariate Gaussian k (
~
2) sample problem. Although it is
possible to apply the principles of the widely used "s?ift" algorithm described in Edgington
(1995, p393-398) to the MANOVA setting, such a program has not been produced.
A
simpler, but less efficient, procedure for determining approximate permutation p values for
MANOVA, is to choose a Monte Carlo sample of random permutations.
1.3.5 Statement of the Problem
In this paper, the impact of missing data on test size and power of approximate
permutation tests for three conventionally used multivariate test statistics (the Pillai-Bartlett
Trace, Wilks' Lambda, Hotelling-Lawley Trace) will be compared with that of their
parametric counterparts through simulation studies. The first series of simulations will focus
on the null case, and the second on power.
1.4 BACKGROUND FOR PAPER #3 (LINMOD 4.0: SOFTWARE
FOR THE GENERAL LINEAR MULTIVARIATE MODEL,
ALLOWING MISSING DATA
This paper reviews the latest version of LINMOD which encorporates the missing data
techniques outlined in paper # 1.
The software will allow statisticians to apply the new
methods I have developed in the real world, immediately.
The quality of the program is
greatly enhanced because it is based on three previous versions of the LINMOD program
(LINMOD version 3.2, Muller and Hunter, 1992; LINMOD version 2, Muller and Peterson;
LINMOD version 1, Helms, Hosking and Christiansen).
14
REFERENCES
Barton, C. N. and Cramer, E. C. (1989), "Hypothesis Testing in Multivariate Linear Models
with Randomly Missing Data," Communications in Statistics - Simulations, 18, 875895.
Chung, 1. H. and Fraser, D. A. S. (1958), "Randomization Tests for a Multivariate TwoSample Problem," Journal ofthe American Statistical Association, 53, 729-735.
Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977), "Maximum Likelihood from
Incomplete Data via the EM Algorithm (with Discussion)," Journal of the Royal
Statistical Society-B, 39, 1-38.
Edgington, R. A. (1995), Randomization tests (3rd edition), New York: Marcel Dekker.
Efron, B. (1979), "Bootstrap Methods:
Statistics, 7, 1-26.
Another Look at the Jackknife," The Annals of
Fisher, R. A. (1935), Design o/Experiments, Oliver and Boyd, Edinburgh.
Friedman, J. H. and Rafsky, L. C. (1979), "Multivariate Generalizations of the WaldWolfowitz and Smirnov Two-Sample Test," The Annals ofStatistics, 7, 697-717.
Greenhouse, S. W. and Geisser, S. (1959), "On Methods in the Analysis of Profile Data,"
Psychometrica, 24, 95-112.
Huynh, H. and Feldt, L. S. (1970), "Conditions Under Which Mean Square Ratios in
Repeated Measurement De~igns Have Exact F Distributions," Journal ofthe American
Statistical Association, 65, 1582-1589.
Jerdack, G. R. and Sen, P. K. (1990), "Nonparametric Tests of Restricted Interchangeability,"
Annals ofthe Institute ofStatistical Mathematics, 42, 99-114.
McKeon, J. J. (1974), "F Approximations to the Distribution of Hotelling's T o2 ," Biometrika,
61,381-383.
Morrison, D. F. (1973), "A Test for Equality of Means of Correlated Variates with Missing
Data on One Response," Biometrika, 60, 101-105.
Muller, K. E. (1998), "A New F Approximation for the Pillai-Bartlett Trace Under Ho,"
Journal ofComputational and Graphical Statistics, 7, 131-137.
Orchard, T. and Woodbury, M. A. (1972), "A Missing Information Principle: Theory and
Applications," In Proceedings of the 6th Berkeley Symposium on Mathematical
Statistics and Probability, 1, 697-715. Berkeley, California: University of California
Press.
15
Pitman, E. 1. G. (1937a), "Significance Tests which can be Applied to Samples from any
Populations," Journal o/the Royal Statistical Society, B, 4, 119-130.
Pitman, E. 1. G. (1937b), "Significance Tests which can be Applied to Samples from any
Populations. II. The correlation coefficient," Journal o/the Royal Statistical Society, B,
4,225-232.
Pitman, E. 1. G. (1938), "Significance Tests which can be Applied to Samples from any
Populations. III. The Analysis of Variance Test," Biometrica, 29, 322-335.
Puri, M. L. and Sen, P. K. (1993), Nonparametric Methods in Multivariate Analysis, Florida:
Krieger Publishing Company.
.
Rao, C. R. (1973), Linear Statistical Inference and Its Applications, New York: John Wiley
(2nd ed.).
Rubin, D. B. (1976), "Inference and Missing Data," Biometrika, 63, 581-592.
Schluchter, M. D. and Elashoff, 1. D. (1990), "Small-sample Adjustments to Tests with
Unbalanced Repeated Measures Assuming Several Covariance Structures," Journal of
Statistical Computing - Simulations, 37, 69-87.
Servy, E. C. and Sen, P. K. (1987), "Missing Variables in Multi-Sample Rank Permutation
Tests for MANOVA and MANCOVA," Sankhya, 49, 78-95.
Wa1d, A. and Wolfowitz, J. (1944), "Statistical Tests Based on Permutations of the
Observations," The Annals ofMathematical Statistics, 15,358-372.
16
Chapter 2
TESTS FOR GAUSSIAN REPEATED MEASURES
WITH MISSING DATA IN SMALL SAMPLES
Diane J. Catellier
Department of Biostatistics CB#7400
University of North Carolina
Chapel Hill, North Carolina 27599-7400
D. Catellier: telephone 919-966-7283, email [email protected]
FAX 919-966-3804
Key words: test size, missing data, multivariate linear models, MANOVA
17
SUMMARY
For the analysis of continuous repeated measures data with missing data and small
sample size, Barton and Cramer (1989) recommended using the EM algorithm for estimation
and modifying Rao's F approximation to Wilks' test with adjusted error degrees of freedom.
Their adjustment replaces total sample size with the average number of non-missing pairs of
responses. Computer simulations led to the conclusion that the modified test was slightly
conservative for total sample size of N = 40. Here we extend the evaluation of Barton and
Cramer's method, as well as a number of others, to even smaller sample sizes
(N E {12, 24}). Although the Barton and Cramer method produces acceptable test size for
Wilks' test if N
= 24, this is not the case if N = 12 and 10% of the data are missing.
We describe a number of extensions of the Barton and Cramer method by creating
analogs of the Pillai-Bartlett trace, Hotelling-Lawley trace and Geisser-Greenhouse corrected
univariate tests. Eleven sample size adjustments were examined for each test. All involve
computing degrees of freedom by replacing the number of independent sampling units (N)
by some function of the numbers of non-missing pairs of responses.
The method preferred varies with the test statistic. Replacing N by the mean number of
non-missing responses works best for the Geisser-Greenhouse test. The Pillai-Bartlett test
requires the stronger adjustment of replacing N by the hannonic mean number of nonmissing pairs of responses.
For Wilks' and Hotelling-Lawley, an even more aggressive
adjustment based on the minimum number of non-missing pairs must be used to control test
size at or below the nominal rate. Overall, simulation results allowed concluding that an
adjusted test can always control test size at or below the nominal rate, even with as few as 12
observations and up to 10% missing data.
2.1 INTRODUCTION
2.1.1 Motivation
Repeated measures data anse when more than one response is observed for each
experimental unit. The repeated measurements can be distinct variables, or a single variable
measured at several points in time, with the spacing consistent across subjects. For ease of
presentation, call the experimental unit a "subject" and the metameter on which the
measurements are indexed "time". Clinical trials research and laboratory studies routinely
use repeated measures designs because they allow the investigator to assess and describe the
change both within and between subjects as time or experimental conditions change. A
variety of analysis strategies have been proposed. Linear models, in particular, are useful in
situations in which the responses are at least approximately Gaussian and can be explained by
some linear function of predictors. When each subject is observed at the same p times and no
observations are missing, closed-form maximum likelihood (ML) estimates of the model
parameters are often available.
Often, especially in clinical trials,the response is not
observed at all time points for every subject.
A large amount of research has been directed at estimation for linear repeated measures
models with missing data. These appear to work well in both large and small samples. In
contrast, much less effort has been directed towards methods for inference.
asymptotic test statistics work well in large samples.
Various
However, in small samples the
available methods for inference may work very poorly. In particular, the methods produce
inflated type I error rates in small samples (Barton and Cramer, 1989; Schluchter and
Elashoff, 1990).
We seek to develop hypothesis tests for Gaussian repeated measures with missing data,
accurate in small samples. In doing so, we restrict attention to a particular range of studies.
We assume that the missing data elements are "missing at random" (MAR; Rubin, 1976).
19
2.1.2 General Strategies for the Analysis of Repeated Measures Designs
Models in which the expected value of the response vector equals a linear function of the
parameters have traditionally been described as (general) linear models. Most often, one of
three strategies is used for linear models with repeated measures: the multivariate analysis of
variance (MANDVA) approach, the univariate approach to repeated measures, or mixed
model analysis. All three models account for the dependencies among the repeated measures,
but differ in the special form assumed for the covariance matrix within subjects,
~.
See Muller, LaVange, Ramey and Ramey (1992) for an overview of the multivariate and
univariate approaches to repeated measures ANDVA.
They described the assumptions
behind the methods as well as the most widely used tests for both approaches. For the sake
of brevity, the information in that article will be assumed throughout.
In general, the
MANDVA approach allows the responses to have any covariance pattern. In contrast, the
uncorrected univariate approach assumes sphericity (~* = U/~U
1970). Compound symmetry
of~,
= (]'2 I; Huynh and Feldt,
coupled with choosing U orthonormal, for example,
guarantees sphericity and hence exact univariate tests. The "corrected" univariate approach
to repeated measures analysis of variance (ANDVA) involves adjusting the univariate F
statistic for the amount of departure from sphericity.
The mixed model has long been used for the analysis of continuous data, especially for
missing and mistimed data. By virtue of modeling the subject as a random component, the
mixed model can encompass a broad range of covariance structures. In this paper, we will
compare existing inference methods for all three approaches to new ones.
Using the terminology of Rubin (1976), missing responses are said to be missing at
random (MAR) if missingness of a particular response does not depend on its unobserved
value, but can depend on the covariates or any of the observed responses. Likelihood-based
estimation methods assume that the data are MAR.
20
.
Alternatively, one can use the quasi-likelihood approach of Liang and Zeger (1986) by
.
solving the generalized estimating equations (GEE) to obtain estimates of the regression
parameters. Park (1993) compared the GEE approach to the ML approach for multivariate
normal outcomes. He showed that with no missing data and an unstructured covariance
matrix that the GEE and ML score equations are equivalent and lead to the same estimates of
expected value and covariance parameters.
With missing observations, however, the
equivalence fails. For data missing completely at random (MCAR, Rubin, 1976), the GEE
solution produces consistent estimators.
Three weaknesses, however, make GEE less
desirable than ML estimation in the missing data setting. First, the GEE estimate of the
working covariance matrix may not always be positive definite, while the ML estimate from
the EM algorithm (Dempster, Laird and Rubin, 1977) is guaranteed to be positive definite
(Laird, Lange and Stram, 1987). Second, simulation studies by Stiger, Kosinski, Barnhart
and Kleinbaum (1997), Qu, Piedmonte, and Williams (1994), Park (1993), and Emrich and
Piedmont (1992) allow concluding that ML estimators perform better in small samples than
GEE estimators. The former tend to have less bias, smaller mean squared errors and more
accurate test size. Third, under misspecification of the covariance, the GEE estimators will
only be consistent provided that the missing observations are MCAR. ML procedures give
unbiased estimates under the weaker MAR assumption.
The limitations of the GEE
procedure in small· samples, combined with the focus on Gaussian data, makes ML
estimation more attractive and therefore the focus of this paper.
2.2 KNOWN METHODS FOR ESTIMATION AND INFERENCE
2.2.1 Complete Data
The repeated measures model can be represented as a special case of the general linear
multivariate model (GLMM). See Muller et al. (1992), Davidson (1972), and O'Brien and
Kaiser (1985) for further discussion.
Suppose that the responses for subject i
(i E {l, ... ,N}) are measured at p times (the same for all subjects). The GLMM is
21
E(Y) = XB,
(2.2.1.1)
where E denotes the expectation operation and Y (N x p), X (N x q, fixed and known,
conditional upon having chosen the subjects) and B (q x p, fixed and unknown) denote the
matrix of observations, design matrix, and parameter matrix, respectively. Indicate the i th
row of X as Xi = rowi(X),
Assuming that rowi(Y) is Np(XiB, E), we can test the
general linear hypothesis (GLH)
(2.2.1.2)
Each row of the matrix C (a x q) defines a between-subject contrast and each column of the
matrix U (p x b) defines a within-subjects contrast. Define VE
=N
- rank(X). All tests
of H o are based on
13 = (X'X)-X'Y,
(2.2.1.3)
8=C13U,
(2.2.1.4)
(2.2.1.5)
and
E = U'(Y - X13)'(Y - X13)U.
ii and E have Wishart distributions W(a, E*, n), W(VE, E*) respectively with
n = (8 - 8 0 )'[C(X'X)-C'r l (8 - 8 0 ) E;I.
(2.2.1.6)
(2.2.1.7)
is the noncentra1ity parameter matrix. Let s = min( a, b) indicate the rank of ii and hence
-- ---I
HE . Let s* indicate the rank of n, with s* ::; s.
The common multivariate tests may be constructed using the eigenvalues (I}, ... ,ls) of
ii E- l .
Specifically, define for Wilks' Lambda W =
Ii (1 + li)-\ for Pillai-Bartlett trace
i=1
s
V = L:li(l
i=1
+ Ii)
-1
s
, for Hotelling-Lawley trace U = L:1i' and for Roy's largest root
i=1
22
•
R
= max(li)'
When s
>1
(or s
>2
for Wilks'), closed fonn expressions for the
distributions of these test statistics are not available and approximations are used (§2.2,
Muller, et al. 1992).
In general, none of the four multivariate tests is unifonnly most
powerful among all hypothesis testing situations. Hence, the optimal choice depends on the
alternative hypothesis. See Olson (1974, 1976, 1979), Anderson (1984, pp. 330-333), and
Muller, et al. (1992) for detailed discussions of relative test powers.
Concerns about
robustness and power led to not considering Roy's test any further.
All designs that can be analyzed with the multivariate approach can also be analyzed
using the univariate approach to repeated measures.
The usual univariate F statistic is
defined by
Fu
= t~jj)jab
.
(2.2.1.8)
tr(E)j(b'VE)
If the sphericity condition is met, Fu follows a central F distribution with ab and (bVE)
degrees of freedom, under Ho. If sphericity is not met, Box (1954a, b) suggested that Fu
follows an approximate F distribution with degrees of freedom abE and bVE€, under H o, with
€
= tr (E.) j [b . tr(E~) J.
general, 1jb
~ € ~
Since E, and thus
€,
is usually unknown,
€
must be estimated. In
1, with the upper bound corresponding to sphericity. Assuming
leads to a conservative test, while choosing
€
€
= 1jb
= 1 (the uncorrected test) leads to a liberal test.
The Geisser-Greenhouse (1959) test (GG) uses the maximum likelihood estimate of
E = tr(E.)j [b
€
. tr(E~)],
while
the
Huynh-Feldt
(1976)
test
(HF)
= (NbE - 2)j[b(VE - bE)]. It is common practice to trim improper estimates of
€ (€
€,
uses
> 1)
to 1. Muller and Barton (1989, 1991) suggested that theE-adjusted F test provides the best
compromise in controlling type I error rate with excellent power. Hence only the GG test
statistic will be examined here.
In the null case, each of the multivariate test statistics can be approximated by an F
random variable. If (a 2 + b2 - 5) > 0 then let t = [(a 2b2
23
-
4)j(a2 + b2 - 5)]1/2, otherwise
t = 1.
Rao (1973) suggested approximating W by F(Vl l V2) = [(1- W)/vl(TV)]/
[vV 1/ t /V2(W)]
with Vl(W)
= ab
and v2 (W)
= t[VE -
(b - a + 1)/2] - (ab - 2)/2).
Although widely used in current statistical packages, Pillai's (1954) F approximation for V
may be very conservative in small samples. Muller (1998) developed an F approximation
for V that provides substantially better accuracy. Hence Muller's approximation will be
used, with Vl(V)
= Kab, V2(V) = K· 8(VE + 8 -
1
K =
[8(VE
8(VE + a)
+8 -
b), with
b)(VE + a + 2)(VE + a VE(VE + a - b)
1)]•.
(2.2.1.9)
McKeon (1974) provided a slightly better F approximation than the more widely used PillaiSampson approximation for the Hotelling-Lawley statistic.
Write the McKeon
approximation
F = (U /h)/(ab) ,
1/v2 (U)
with Vl(U)
(2.2.1.10)
= ab, v2 (U) = (4 + ab + 2)g',
,
9 =
v~ - vE(2b + 3)
vE(a + b + 1) -
+ b(b + 3)
(a + 2b + b2 -
1)
,
(2.2.1.11)
and
V 2 (U) - 2
h=---VE - b - 1
(2.2.1.12)
2.2.2 Data Missing at Random
For the GLMM, both ML and REML estimation methods have been extensively
investigated for MAR data. For a general review, see Little and Rubin (1986, chapters 7-10).
For some special case patterns of missing data such as monotone missing data, the likelihood
can be factored into a product, each containing distinct parameters. The solutions for the
special cases have closed forms (Rubin, 1974). For arbitrary missing data patterns, the
solution must be obtained iteratively. The computational efficiency and simplicity of the EM
24
•
algorithm (Orchard and Woodbury, 1972; Beale and Little, 1975; Dempster, Laird and
Rubin, 1977) make it an obvious choice for ML estimation in the setting of interest. Barton
and Cramer's experience with the method for the application at hand also strongly supports
the choice. The algorithm preferred for more general models is not as obvious. See for
example, Elswick and Chinchilli (1993) for a discussion of methods for estimating the
covariance matrix in the GMANOVA model with missing data, or Callahan and Harville
(1990) for a comparison of algorithms for the general mixed model.
Except in special cases, no known method exists for providing accurate and efficient
inference in small multivariate normal samples with missing data.
Hypothesis tests
constructed from complete observations only, while accurate in small samples, are
inefficient. A number of approximate methods have been proposed for the problem of testing
equality of means for a bivariate normal sample with data missing on one variable (Morrison,
1973; and Little, 1976 and 1988).
Morrison (1973) and Little (1976) recommended
approximating variously derived statistics to the t distribution with m - 1 degrees of
freedom, where m is the number of complete cases. Little (1988) considered a Bayesian
approach to making inferences about the difference in means.
Barton and Cramer (1989) suggested an appealing technique for testing the general linear
hypothesis in a GLMM with an arbitrary pattern of missing data. The approach involves
using the EM algorithm for ML estimation, and modifying Rao's F approximation to Wilks'
test, Fw, with adjusted error degrees of freedom.
observations for which both
Yij and ¥ij, for i
E
Let N jj indicate the number of
{l. .. N}, have non-missing values. Note
that N jj equals the number of cases observed for the J"th response.
All adjustments
considered by Barton and Cramer, and in this paper, involve replacing N by N. in
lJE
= N - rank(X). In all cases N. equals a function only of {Njj}. For samples of size
40 and up to 20% missing data, test statistics with degrees of freedom based on the naive
estimate of N. = N produced inflated type I error rates ranging from 0.10 to 0.23 assuming a
25
nominal rate of 0.05, while those based on the number of complete cases resulted in rates
which were substantially lower than 0.05 (range: 0.004-0.014). The optimal choice for N*
was the average number of non-missing pairs of responses (N*6 in Table 2.1). Wilks' test
based on this sample size adjustment produced reasonably accurate test sizes across all
simulated conditions.
The mixed model is often used for multivariate data in which some of the response
measurements are missing. Let Yi be a (Ni xl) vector of measurements for the i th subj ect.
N
Let N+
= 'LNi .
= [y{, ... , y/vJ' is modeled as
In the mixed model, Y+
i=l
(2.2.2.1)
with X and Z the known design matrices for the fixed and random effects respectively, and
the b the vector of unknown random effects. The key assumptions for inference are that b
and e are independent and multivariate Gaussian. Define vec( M) as the vector created by
stacking the columns of M.
Also let A ® B
= {aijB}
indicate the (left) Kronecker
product. For the cases of interest, the model may be written so that
GLMM.
For complete data X+
=X
f3 = vec(B') from the
®lp • For missing data merely delete' each row
corresponding to a missing response. Let
~i,
of dimension Pi, be the sub-matrix of ~ with
rows and columns corresponding to data observed for subject i. Let A =
indicate the block-diagonal matrix created by placing
Then e+
~1
Dg(~l,
... ~N)
in the upper left diagonal, etc.
= N[O, A].
In all but a few cases, likelihood-based estimation of f3 and A requires iterative methods
such as Newton-Raphson, the Method of Scoring or the EM algorithm. The software used in
this paper (PROC MIXED in SAS®) uses an implementation of the Newton-Raphson
algorithm developed by Lindstrom and Bates (1988) to compute the ML estimates of the
fixed effects and REML estimates of the covariance parameters.
26
.
Exact methods are not available to test hypotheses of the form H o : f)
= C f3 =
O. The
approximate, large-sample test statistics can be unreliable in small samples. Schluchter and
Elashoff (1990, §6) examined the test size of various ML Wald-type statistics (computed as
the parameter estimate divided by an asymptotic standard error) under the mixed model
formulation. Their small sample simulation results suggest approximating a modified Wald
statistic to an F distribution with denominator degrees of freedom based on the number of
complete cases. Another method of test construction for the mixed model uses the likelihood
ratio principle (see Hocking, 1985, §8.3.1). The statistic computed using this method is
approximately X2 for large sample samples. This approximation has been demonstrated to be
unreliable in small samples (Woolson, Leeper and Clarke, 1978; Woolson and Leeper, 1980;
and Leeper and Woolson, 1982). The empirical type I error rates were typically in the 0.100.25 range, far exceeding the nominal rate of 0.05. The version of PROC MIXED studied
here used the following approximate F statistic (SAS Institute, 1997, p644):
iJ' (X~A-IX+)-O
F=-....:..-------'-rank(C)
(2.2.2.2)
,
with numerator degrees of freedom equal to rank(C). Although several approximations are
available for the denominator degrees of freedom (see SAS Institute Inc., 1997, page 607),
only the Satterthwaite approximation was considered in this paper.
2.3 NEW TESTS FOR DATA MISSING AT RANDOM
The success of the basic strategy followed by Barton and Cramer leads to a number of
obvious generalizations. First, the approach will be applied to other test statistics. Second,
some additional functions of the sample sizes merit consideration.
Third, even smaller
sample sizes will be studied. In all cases, the EM algorithm will be used for estimation.
In addition to W, the U, V and GG tests may be modified to apply to missing data
settings. In all cases, this requires changing only the error degrees of freedom by replacing
27
N with some fonn of N*. Overall 11 fonns for N* will be examined for each of four test
statistics.
They are listed in Table 2.1 in rank order from smallest to largest, with the .
exception that N*3 can be either less than or greater than N*6 (and hence N*4 and N*5,).
In all cases N / N*
~
1. Consequently, in large samples (as N
~ 00,
with fixed p, q,
and proportion missing) the choice of N* has less and less effect. The fonn of the results of
Rothenberg (which assume a sequence of local alternatives), as cited in Anderson (1984,
§8.6.5) support this position.
Table 2.1 Sample Size Adjustments for Error Degrees of Freedom
N*lO
Function of {Njj' }
= number of complete cases
= min({Njj'})
= min( {Njj })
= harmonic mean( {Njj' } )
= geometric mean({Njj' } )
= arithmetic mean({ Njj' } )
= harmonic mean( {Njj } )
= geometric mean({N jj } )
= arithmetic mean({N jj } )
= max( {Njj })
N*l1
=N
Name
N*l
N*2
N*3
N*4
N*5
N*6
N*7
N*8
N*9
2.4 NUMERICAL EVALUATIONS
2.4.1 Methods
All simulations involved a small range of research designs. In all cases, the designs
included 1) one within-subject factor with p = 3 or 6 levels, 2) one between-subject factor
with q = 4 levels, 3) N E {12, 24}, and 4) 0%, 5% or 10% of the data missing.
No
subject's data were allowed to be completely missing. The procedure for producing missing
data generated data that are MCAR. Other factors considered are the relative error variance
of response variables (equal, unequal), and the error correlation structure (medium, high) See
28
Tables I-II in Barton and Cramer (1989) for details. In additional, a third level was added to
the correlation structure factor, which allowed assessing the effect of very low correlation
between the responses. The structure specified equal correlation (p
..
= 0.1)
for each pair of
responses. The overall test for the presence of a trend (linear, quadratic or cubic) with respect
to the between-subject factor for each of the response measures was of primary interest.
Under the null, of course, B = O. For 5,000 replications and assuming a true type I error
rate of 0.05, the 95% confidence bounds around the type I error rate estimates are
approximately
± 0.006.
2.4.2 Results
On average, higher levels of correlation within subjects were associated with modestly
higher type I error rates. Since this pattern was consistent for each of the test statistics, only
the results for the low and high correlation conditions will be presented.
The empirical type I error rates for the mixed model F statistic are given in Table 2.2.
The results indicate that the test has poor small sample properties, producing inflated type I
error rates even when none of the data were missing. For N = 24, test sizes increased from
slightly greater (0.07-0.10) to considerably greater (0.16-0.32) than the nominal
level as the number of repeated measures increased from 3 to 6.
29
0:
= 0.05
Table 2.2 Test Size for Mixed Model F
(5000 Replications, ± 0.006)
(72
12
12
12
Pjj'
Low
Low
High
24
24
Low
High
=1=
12
12
12
Low
Low
High
N
24 Low
24 High
J
% Missing
p=3
6
0
0
0
.126
.134
.125
.58
.60
.59
0
0
.069
.074
.16
.16
5
5
5
.182
.171
.172
5
5
.080
.081
10
10
10
.250
.244
.262
10
10
.086
.095
=1=
=1=
=1=
=1=
=1=
=1=
=1=
12
12
12
Low
Low
High
-
24
24
Low
High
=1=
=1=
=1=
=1=
.21
.23
.303
.322
For the conditions with no missing data, all four univariate and multivariate test statistics
succeeded in controlling the type I error rate at or below the nominal rate (see Table 2.3).
This illustrates that the sample sizes, while small, are large enough that the approximate F
tests are essentially unbiased for complete data. Hence any discrepancy from the desired test
size may be attributed to the influence of missing responses, and not to any inaccuracy in test
approximations for complete data.
30
;.
N
12
12
12
PH
Low
Low
High
24
24
Low
High
O'~
J
Table 2.3 Test Size for GLMM F Tests
(0% Missing, 5000 Replications, ± 0.006)
Fw
Fu
Fv
3
6
p=3
6
3
6
3
6
Faa
i=
i=
.050
.052
.053
.046
.054
.053
.048
.050
.048
.051
.055
.053
.044
.049
.042
.046
.059
.045
.027
.040
.054
.013
.025
.058
i=
i=
.049
.053
.051
.049
.050
.051
.050
.048
.048
.051
.049
.049
.041
.049
.039
.053
Tables 2.4, 2.5, 2.6, and 2.7 summarize the empirical test sizes for W, U, V, and GG for
5% and 10% missing data. All tables give results for tests based on N*2
N*l1
=N
= min{ N jj }
and
in order to define bounds on test size.
The EM algorithm failed roughly 90% of the time for the condition with p = 6 and
N
= 12 subjects and even 5% missing data.
Estimates are well defined for complete data.
The table cells for these conditions were left blank. Not surprisingly, the results indicate that
the worst accuracy tends to occur with more repeated measures, fewer subjects, more missing
data and more correlation within subjects.
From Tables 2.4 and 2.5, it is evident that the adjusted F w and Fu tests based on N*ll
give inflated test sizes, and those based on N*4, while accurate for N
= 24 were inflated for
N = 12. On the other hand, tests based on N*2 controlled test size at or below the nominal
rate under all simulated conditions, with the exception of the condition with p = 6, N = 12
and 10% missing data, in which case test size was as high as 0.09.
31
Table 2.4 Adjusted Degree of Freedom Test Size for Fw
(5%,10% Missing, 5000 Replications, ± 0.006)
N*2
% Missing
p=3
12
12
12
PH
Low
Low
High
0"2
-
i=
i=
5
5
5
.029
.033
.032
24
24
Low
High
i=
i=
5
5
.030
.031
12
12
12
Low
Low
High
-
i=
i=
10
10
10
.047
.042
.051
24
24
Low
High
i=
i=
10
10
.020
.025
N
J
N*4
6
3
N*l1
.073
.072
.074
.017
.022
.049
.053
.051
.062
6
.145
.134
.141
.067
.067
.148
.144
.152
.078
.094
3
6
.087
.089
.142
.146
.345
.335
.354
.171
.199
.133
.158
.379
.411
Table 2.5 Adjusted Degree of Freedom Test Size for Fu
(5%, 10% Missing, 5000 Replications, ± 0.006)
N*2
% Missing
p=3
12
12
12
PH
Low
Low
High
0"2
i=
i=
5
5
5
.032
.033
.034
24
24
Low
High
i=
i=
5
5
.028
.031
12
12
12
Low
Low
High
-
i=
i=
10
10
10
.052
.047
.055
24
24
Low
High
i=
i=
10
10
.021
.029
N
J
N*4
6
3
N*l1
6
.072
.067
.074
.021
.025
.048
.052
.052
.061
6
.142
.130
.134
.069
.068
.163
.153
.169
.093
.086
3
.084
.088
.144
.148
.341
.333
.352
.184
.217
.135
.158
.379
.417
Table 2.6 contains test size for modified Fv tests. The test based on N* 11 provided
inflated type I error rates, and the N*2-adjusted test was conservative. Test sizes for
32
N*4
were extremely accurate, with the exception of the condition with p = 6, N = 12 and 10%
missing data where the type I error rates were approximately 0.1.
12
12
12
Table 2.6 Adjusted Degree of Freedom Test Size for Fv
(5%,10% Missing, 5000 Replications, ± 0.006)
%
N*2
N*4
N*l1
(]'2
Missing
6
p=3
3
6
3
6
Pjj'
J
Low
5
.023
.052
.102
Low
5
.021
.051
.099
=1=
.022
High =1=
5
.050
.096
24
24
Low
High
12
12
12
Low
Low
High
24
24
Low
High
N
=1=
=1=
=1=
=1=
=1=
=1=
5
5
.029
.030
10
10
10
.011
.010
.013
10
10
.017
.020
.015
.017
.046
.049
.054
.052
.057
.051
.058
.019
.022
.044
.055
.081
.088
.125
.128
.206
.215
.214
.106
.120
.127
.148
.334
.354
The results given in Table 2.7 suggest that N*9 is a reasonable adjustment function for the
Faa test. When N = 12, this test is slightly conservative, however this corresponds to the
modest conservatism found when no data are missing (Table 1 in Muller and Barton, 1989).
When s = min( a, b) = 1, all of the multivariate test statistics are equivalent. This can
occur if the rank of the C contrast matrix is one (a
= 1) or if the rank of U is one (b = 1).
The empirical test sizes shown in Table 2.8 suggest that when a = 1, the best adjustment for
the degrees of freedom is based on N*2, while N*4 appears to work well when b = 1.
33
Table 2.7 Adjusted Degree of Freedom Test Size .for FGG
(5%,10% Missing Data, 5000 Replications, ± 0.006)
N PH
12 Low
12 Low
12 High
(J"2
24
24
J
% Missing
i=
i=
5
5
5
.010
.019
.026
Low
High
i=
i=
5
5
.030
.049
12
12
12
Low
Low
High
-
i=
i=
10
10
10
.002
.008
.009
24
24
Low
High
i=
i=
10
10
.018
.017
N. ll
N. g
N· 2
p=3
6
3
6
.047
.049
.037
.041
.051
.048
.061
.061
.057
.053
.073
.078
.075
.030
.038
.040
.004
.009
6
.053
.059
.066
.035
.043
.050
.010
.018'
3
.044
.043
.091
.076
.095
.070
Table 2.8 Adjusted Degree of Freedom Test Size for Multivariate F Test
When 8 = min(a, b) = 1
(5%, 10% Missing, 5000 Replications, ± 0.006)
a = 1,b = 3
a = 3, b = 1
%
N. 4
N. 4
N. ll
N*ll
N· 2
N· 2
N
12
12
12
PH
Low
Low
High
24
24
Low
High
(J"~
Missing
p=3
3
3
p=3
3
3
-
i=
i=
5
5
5
.038
.034
.039
.072
.066
.074
.114
.105
.110
.031
.029
.027
.051
.048
.042
.078
.071
.065
i=
i=
5
5
.036
.037
.052
.050
.077
.078
.036
.030
.046
.043
.067
.064
-
.058
.054
.056
.132
.131
.137
.253
.251
.258
.018
.016
.014
.043
.040
.036
.096
.010
.088
.028
.028
.052
.048
.102
.104
.026
.018
.043
.036
.089
.073
J
12 Low
12 Low
12 High
i=
i=
10
10
10
24
24
i=
i=
10
10
Low
High
34
2.5 CONCLUSIONS
Conclusion 1. For all tests considered, accuracy decreases with more repeated measures,
fewer subjects, more missing data and more correlation within subjects.
Conclusion 2. The mixed model F statistic used by PROC MIXED in SAS@ with
Satterthwaite approximated denominator degrees of freedom gives liberal test size for
N
~
24, even with complete data.
Conclusion 3. For 6 responses and 12 subjects, the EM algorithm failed roughly 90% of
the time, for the version used here.
Conclusion 4. A degree of freedom adjustment can always control test size at or below
the nominal level, even for conditions as extreme as N
= 12 and 10% missing data.
Conclusion 5. The choice of adjustment varies with the test.
5.1 N*2, the minNjj, is best for the Wilks' and Hotelling-Lawley tests.
5.2 N*4, the harmonic mean of Njj', is best for the Pillai-Bartlett test.
5.3 N*9, the mean N jj , is best for the Geisser-Greenhouse test.
Note that the simulated data were generated in such a way as to create data that are
MCAR. Hence, one area of future research which warrants consideration are conditions in
which the data are MAR.
Techniques for power analysis, given that test size can be controlled, are also appealing.
Obviously, the approach taken here is more intuitive and heuristic, than analytical.
Nevertheless, we believe we have succeeded, where other approaches have not, in suggesting
what appears to be a method that ensures test size does not exceed the nominal rate in small,
missing data samples for the GLMM. A more formal approach must necessarily involve a
rather sophisticated attack, due to the complexity of the distributions for the multivariate test
statistics, even with complete data.
Such formal approaches clearly represent the most
needed future research.
35
REFERENCES
Anderson, T. W. (1984), An Introduction To Multivariate Statistical Analysis, New York:
John Wiley (2nd ed.).
Barton, C. N. and Cramer, E. C. (1989), "Hypothesis Testing in Multivariate Linear Models
with Randomly Missing Data," Communications in Statistics - Simulations, 18, 875895.
Beale, E. M. L. and Little, R. J. A. (1975), "Missing Values in Multivariate Analysis,"
Journal o/the Royal Statistical Society-B, 37, 129-145.
Box, G. E. P. (1954a), "Some Theorems on Quadratic Forms Applied in the Study of
Analysis of Variance Problems, I: Effects of Inequality of Variance in the One-way
Classification," The Annals ofStatistics, 25, 290-302.
Box, G. E. P. (1954b), "Some Theorems on Quadratic Forms Applied in the Study of
Analysis of Variance Problems, II: Effects of Inequality of Variance and of Correlation
Between Errors in the Two-way Classification," The Annals ofStatistics, 25, 484-498..
Callahan, T. P. and Harville, D. A. (1990), "Some New Algorithms for Computing
Maximum Likelihood Estimates of Variance Components," Journal of Statistical
Computation and Simulation, 38, 239-259.
Davidson, M. L. (1972), "Univariate Versus Multivariate Tests in Repeated Measures
Experiments," Psychological Bulletin, 77, 446-452.
Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977), "Maximum Likelihood from
Incomplete Data via the EM Algorithm (with Discussion)," Journal of the Royal
Statistical Society-B, 39, 1-38.
Emrich, L. J. and Piedmonte, M. R. (1992), "On Some Small Sample Properties of
Generalized Estimating Equation Estimates for Multivariate Dichotomous Outcomes,"
Journal ofStatistical Computation and Simulation, 41, 19-29.
Greenhouse, S. W. and Geisser, S. (1959), "On Methods in the Analysis of Profile Data,"
Psychometrica, 24, 95-112.
Hocking, R. R. (1985), The Analysis of Linear Models, Brooks/Cole Publishing Co.,
Monterey, CA.
Huynh, H. and Feldt, L. S. (1970), "Conditions Under Which Mean Square Ratios in
Repeated Measurement Designs Have Exact F Distributions," Journal o/the American
Statistical Association, 65, 1582-1589.
36
Huynh, H. and Feldt, L. S. (1976), "Estimation of the Box Correction for Degrees of
Freedom Correction From Sample Data in Randomized Block and Split-Plot Designs,"
Journal ofEducational Statistics, 1 (1),69-82.
Laird, N. M., Lange, N. and Stram, D. (1987), "Maximum Likelihood Computations with
Repeated Measures: Application of the EM Algorithm," Journal of the American
Statistical Association, 82, 97-105.
Leeper, J. D. and Woolson, R. F. (1982), "Testing Hypotheses for the Growth Curve Model
when the Data are Incomplete," Journal of Statistical Computation and Simulation, 15,
97-107.
Liang, K. Y. and Zeger, S. L. (1986), "Longitudinal Data Analysis Using Generalized Linear
Models," Biometrika, 73, 13-22.
Lindstrom, M. J. and Bates, D. M. (1988), "Newton-Raphson and EM algorithms for Linear
Mixed-effects Models for Repeated Measures Data," Journal ofthe American Statistical
Association, 83, 1014-1022.
Little, R. J. A. (1976), "Inference About Means From Incomplete Multivariate Data;"
Biometrika, 63, 593-604.
Little, R. J. A. (1988), "Approximate Calibrated Small Sample Inference About Means From
Bivariate Normal Data with Missing Values," Computational Statistics and Data
Analysis, 7, 161-178.
Little, R. J. A. and Rubin, D. B. (1987), Statistical Analysis with Missing Data, New York:
John Wiley.
McKeon, J. J. (1974), "P Approximations to the Distribution of Hotelling's T 02," Biometrika,
61,381-383.
Morrison, D. F. (1973), "A Test for Equality of Means of Correlated Variates with Missing
Data on One Response," Biometrika, 60, 101-105.
Muller, K. E. (1998), "A New F Approximation for the Pillai-Bartlett Trace Under H o,"
Journal ofComputational and Graphical Statistics, 7, 131-137.
Muller, K. E. and Barton, C. N. (1989), "Approximate Power for Repeated Measures for
Repeated Measures Anova Lacking Sphericity," Journal of the American Statistical
Association, 84, 549-555.
Muller, K. E. and Barton, C. N. (1991), Correction to "Approximate Power for Repeated
Measures for Repeated Measures Anova Lacking Sphericity," Journal of the American
Statistical Association, 86, 255-256.
37
Muller, K. E., LaVange, L. M., Ramey, S. L. and Ramey, C. T. (1992), "Power Calculations
for General Linear Multivariate Models Including Repeated Measures Applications,"
Journal o/the American Statistical Association, 87, 1209-1226.
Muller, K. E. and Peterson, B. L. (1984), "Practical Methods for Computing Power in
Testing the Multivariate Linear Hypothesis," Computational Statistics and Data
Analysis, 2, 143-158.
O'Brien, R. G. and Kaiser, M. K. (1985), "MANOVA Method for Analyzing Repeated
Measures Designs: An Extensive Primer," Psychological Bulletin, 97, 316-333.
Olson, C. L. (1974), "Comparative Robustness of -Six Tests in Multivariate Analysis of
Variance," Journal ofthe American Statistical Association, 69, 894-908.
Olson, C. L. (1976), "Choosing a Test Statistic in Multivariate Analysis," Psychological
Bulletin, 86, 579-586.
Olson, C. L. (1979), "Practical Considerations in Choosing a MANOVA Test Statistic: A
Rejoinder to Stevens," Psychological Bulletin, 86, 1350-1352.
Orchard, T. and Woodbury, M. A. (1972), "A Missing Information Principle: Theory and
Applications," In Proceedings of the 6th Berkeley Symposium on Mathematical
Statistics and Probability, 1, 697-715. Berkeley, California: University of California
Press.
Park, T. (1993), "A Comparison of the Generalized Estimating Equation Approach with the
Maximum Likelihood Approach for Repeated Measurements," Statistics in Medicine,
12, 1723-1732.
Pillai, K. C. S. (1954), "On Some Distribution Problems in Multivariate Analysis," Institute
ofStatistics Mimeo Series No. 88, University of North Carolina, Chapel Hill.
Qu, Y. S., Piedmonte, M. R. and Williams, G. W. (1994), "Small Sample Validity of Latent
Variable Models for Correlated Binary Data," Communications in Statistics Simulations, 23; 243-269.
Rao, C. R. (1973), Linear Statistical Inference and Its Applications, New York: John Wiley
(2nd ed.).
Rubin, D. B. (1974), "Characterization of Estimation of Parameters in Incomplete Data
Problems," Journal ofthe American Statistical Association, 69, 467-474.
Rubin, D. B. (1976), "Inference and Missing Data," Biometrika, 63, 581-592.
SAS Institute Inc. (1997), SAS/STAT Software: Changes and enhancements, Release 6.12,
SAS Institute Inc., Cary, NC.
38
Schluchter, M. D. and Elashoff, 1. D. (1990), "Small-sample Adjustments to Tests with
Unbalanced Repeated Measures Assuming Several Covariance Structures," Journal of
Statistical Computing - Simulations, 37, 69-87.
Stiger, T. R, Kosinski, A. S., Barnhart, H. X. and Kleinbaum, D. G. (1997), "ANOVA for
Repeated Ordinal Data with Small Sample Size? A Comparison of ANOVA,
MANOVA, WLS and GEE Methods by Simulation," JSM Abstract Book: p. 246.
Woolson, R. F., Leeper, J. D. (1980), "Hypothesis Testing in Multivariate Linear Models
with Randomly Missing Data," Communications in Statistics - Theory and Methods, A9,
1491-1513.
Woolson, R F., Leeper, J. D. and Clarke, W. R (1978), "Analysis ofIncomplete Data From
Longitudinal and Mixed Longitudinal Studies," Journal of the Royal Statistical Society-A,
141,242-252.
39
Chapter 3
COMPARISON OF APPROXIMATE PERMUTATION
TESTS WITH PARAMETRIC TESTS IN MANOVA
WITH MISSING DATA
Diane J. Catellier
Department of Biostatistics CB#7400
University of North Carolina
Chapel Hill, North Carolina 27599-7400
D. Catellier: telephone 919-966-7283, email [email protected]
FAX 919-966-3804
Key words: test size, randomization test, multivariate linear models, MANOVA
40
SUMMARY
We describe how data permutation methods provide a new and powerful method for
testing some classes of hypotheses in Multivariate Analysis of Yariance (MANOYA) with
missing data and small sample size. The first step involves using the EM algorithm to find
maximum likelihood estimates of means and covariances using all of the available data,
based on an assumption of Gaussian data.
The second step is to compute the F
approximations to Wilks' Lambda, the Pillai-Bartlett trace, and the Hotelling-Lawley trace
test statistics. Next, a Monte Carlo approximation to the permutation test is carried out by
tabulating F statistic values for a random sample of possible data permutations. The fourth
step is to compute the significance level as proportion of F values larger than or equal to the
observed F.
We compared the approximate permutation tests to degree of freedom adjusted F tests
recommended by Catellier and Muller (1998) and the usual unadjusted F tests. The adjusted
tests replace the number of independent sampling units (N) by some function of the numbers
of non-missing pairs of responses. Simulation results confirm that unadjusted F tests give
substantially inflated test sizes except with complete data. Although the adjusted tests limit
test size to no more than the nominal level, they can be noticeably conservative. In contrast,
the approximate permutation tests apparently provide unbiased tests, even for very small
samples (N = 12) and up to 10% missing data. Overall, the permutation-based methods had
equal or higher power than the adjusted F tests in all cases, while still controlling test size.
3.1 INTRODUCTION
3.1.1 Motivation
We focus on a very common range of statistical models, with an important complication.
For complete and multivariate Gaussian data, the General Linear Multivariate Model
(GLMM) provides convenient and statistically well-behaved estimation and testing, even in
41
small samples.
Example designs include both repeated measures and MANGVA
arrangements. The loss of any data substantially complicates the picture.
Missing data plague almost all experimental settings.
For Gaussian data with some
observations missing at random (MAR, Rubin, 1976), practical and effective estimation
procedures are available, even for small samples. A useful overview of this topic can be
found in Little and Rubin (1987). In particular, maximum likelihood (ML) estimation using
the EM algorithm seems to work well in both large and small samples (Beale and Little,
1975). In the same setting, hypothesis testing proves much more difficult. The work of
Barton and Cramer (1989) and Schluchter and Elasoff (1990) allow concluding that no
general method has been shown to control test size at or below the nominal level, except in
moderate to large samples. Analysis of the complete cases only does provide unbiased
results, but can be extremely inefficient.
Catellier and Muller (1998) recently introduced methods for Gaussian data which do
succeed in controlling test size in small sample GLMM analysis with missing data. Their
approximate methods often allow test size to fall somewhat below the target level, especially
at the smallest sample sizes. Hence their methods sacrifice some power. In the present
research we seek to regain the power lost by taking advantage of methods which increase in
appeal as sample sizes decrease.
Permutation-based inference methods, in which the only assumption
IS
one of
exchangeability of observations (Lehmann, 1986, p231), are guaranteed to be valid under the
null hypothesis, even with small samples sizes. Experimental studies with subjects randomly
.assigned to treatments are an important example of data which meet the exchangeability
assumption. An exact permutation-based p value for an observed test statistic is computed as
the proportion of test statistic values computed from all possible reassignments of
observations to treatment groups which are as large as the observed value. Koch and Gillings
42
(1983) refer to inferences based on the pennutation theory as design-based inferences as
opposed to model-based inferences which require assumptions external to the study design.
A pennutation test is based on some chosen test statistic. Clearly, one prefers a test
statistic that is likely to provide the most powerful test of the null hypothesis of interest. In
practice, the distributional properties of the data are considered when choosing between
parametric or nonparametric estimation of the population parameters.
For example, the
combination of nonparametric (e.g., rank-based) estimates with pennutation-based (i. e.,
design-based) inference would seem to have the greatest appeal when lacking a convincing
choice of distribution model for the data. In this paper, we consider an analysis approach
based on parametric estimates with pennutation-based inference.
We restrict attention to the GLMM setting meeting all the usual Gaussian assumptions,
except for the presence of MAR data. Barton and Cramer (1989) and Catellier and Muller
(1998) studied the null case properties of a method based on EM estimation and degree of
freedom adjusted approximate parametric tests in small to moderately sized samples. Sample
sizes ranged from 12 to 40, and the proportion missing ranged from
°
to 10%. Simulations
results showed that the unadjusted parametric F tests give substantially inflated test sizes in
all but the complete data setting. The adjusted tests held the test size to no more than the
nominal level, but were noticeably conservative under certain conditions. Hence a test which
has actual test size -equal to the nominal level holds promise, due to the opportunity for
improved statistical power. The combination of this idea and the possibility of improved
robustness motivated the current research. An additional enticement was that many authors
(for example, see Ludbrook and Dudley, 1998) argue strongly that first principles dictate that
exact pennutation test should be preferred in most biomedical research.
3.1.2 Relevant Literature
Catellier and Muller (1998) provided a brief overview of the literature pertaining to
strategies for providing accurate and efficient parametric inference in small multivariate
43
nonnal samples with missing data. The literature on pennutation-based inference can be
divided into two divergent streams. One relates to pennutation tests for rank-based test
statistics, and the other to pennutation tests for test statistics which are explicit functions of
the actual values of the sample observations.
These latter tests are sometimes called
component randomization tests (see Puri and Sen, 1993, p76).
Since a pennutation
distribution of a rank-based test statistic is invariant to changes in the actual values of the
observations, it is possible to prepare tables for various sample sizes which can be used
repeatedly to detennine significance for new samples.
Component randomization tests
require that the pennutation distribution be computed for every new set of data.
Much of the theoretical work on pennutation procedures is devoted to tests based on
ranks.
The asymptotic power properties and relative efficiencies (with respect to their
parametric competitors) of various rank pennutation tests for MANOVA in the complete data
setting are presented in detail in Chapter 5 of Puri and Sen (1993). Servy and Sen (1987)
extended the theory of rank pennutation tests for one-way MANOVA and multivariate
analysis of covariance models to allow for missing data. Their approach involves replacing
the observed variables with ranks or other scores, excluding the missing values, and imputing
the missing data with the mean of the scores of the non-missing values for that variable.
Jerdack and Sen (1990) extended the approach to two-way MANOVA designs.
The first example of component pennutation tests for MANOVA is due to Wald and
Wolfowitz (1944). They proposed a pennutation test based on a modified Hotelling's T2
statistic to test the hypothesis that two samples arose from one multivariate nonnal
population, assuming homogeneity of covariance.
Chung and Fraser (1958) derived an
alternative pennutation test in the same setting for the case in which the number of response
variables is large. Friedman and Rafsky (1979) provide a k sample generalization of the
Wald-Wolfowitz two-sample test. The method involves constructing the minimum spanning
tree for the combined sample, then generating the pennutation distribution of the runs
44
statistic through a series of random relabelings of the individual data points. Little is known
about the properties of permutation tests based on three commonly used multivariate test
statistics: the Pillai-Bartlett trace, Wilks' Lambda, and the Hotelling-Lawley trace.
The bootstrap resembles the permutation test in that it derives a distribution for the test
statistic using the observed data (Efron and Tibshirani, 1993). The bootstrap distribution is
obtained by repeatedly resampling the observations (with replacement) separately from each
sample and computing the test statistic for each resample. Romano (1989) showed that the
power of the bootstrap and permutation tests are asymptotically equivalent. However, the
permutation test may be preferable since it has exact level a for finite samples.
An important limitation of the permutation methods is that they are currently limited to
fairly simple designs. For instance, there is considerable debate among statisticians as to
whether a test for an interaction in a factorial experiment exists can be tested using
permutation methods.
See Edgington (1995, p137-138) or Scheffe (1959, p318) for an
arguments which support the claim that tests for interaction are impossible. Welch (1990.
provides justification for a method of constructing a permutation test of interaction based on
invariance and sufficiency. Still and White (1981) proposed an approximation permutation
test of interaction in which residuals, obtained by removal of estimates of the main effects,
are permuted. Theoretical objections to permutation tests based upon estimated residuals
have been raised by Bradbury (1987), Edgington (1995) and others.
They argue that
permutation tests which permute design dependent functions of the observations instead of
the observations themselves, do not meet the requirement that all data permutations be
equally likely under the null hypothesis. Given that no permutation test of interaction has
been thoroughly justified, other methods such as the bootstrap are often recommended.
The scope of this literature review has been limited to permutation tests for testing
equality of the treatment means, assuming homogeneity of covariance in one-way MANOVA
models. See Edgington (1995) and Good (1994) for permutation tests pertaining to other
45
experimental designs or for testing other hypotheses, such as those concerning correlation
and trend.
3.2 STATEMENT OF THE PROBLEM
Notation and setting closely follow that of Catellier and Muller (1998), which should be
consulted for details. We note only the following few points. We assume throughout that the
usual MANGVA assumptions of Gaussian distributed errors and homogeneous covariance
structure are realistic, and missing responses are MAR. Hence when the data are complete
the F approximations to the multivariate tests provide valid results.
Finally, we restrict
attention to a limited class of tests (one between-subject factor and one within-subject factor).
To facilitate the discussion of permutation procedures we introduce some notation.
Suppose the N subjects are randomly assigned to one of q 2:: 2 treatment groups, with N g
subjects in each group (g E {I, ... q}), each of whom contribute p response measures. For
i E {I, ... ,Ng }, let
Y l (g) =
[y(g)
II , ... ,
y.(g)] '
lp
indicate one of a set of independent and identically distributed random vectors with
continuous distribution function, Fg • The q sets are assumed to be mutually independent.
The most general null hypothesis specifies that
against the alternative that {Fg } are not all the same. The GLMM is
£(Y) = XB,
(3.2.1)
J
where £ denotes the expectation operation and Y (N x p), X (N x q, fixed and known,
conditional upon having chosen the subjects) and B (q x p, fixed and unknown) denote the
"
matrix of observations, design matrix, and parameter matrix, respectively.
46
Our main interest is in testing the null hypothesis of equality of the treatment location
parameters, assuming homogeneity of covariance. This is equivalent to testing the null
hypothesis of no interaction in X and Y. For this alternative, we let
Fg(Y)
= Fg(Y + A g )
for all 9 and test
H o : Al
= .. , = A q = 0
against the alternative that {Ag } are not all zero.
The null hypothesis implies that the response vector for each subject is the same under
one treatment assignment as under any alternative assignment, or equivalently, that the Yi are
q
interchangeable for i E {I, ... ,N} and N = ,,£Ng • Thus, under Ho, each data permutatio_n
g=1
represents the results that would have been obtained for a particular assignment of subjects to
the treatment conditions. The alternative hypothesis is that there is a differential effect of the
treatments for at least one of the subjects.
3.3 NEW METHODS
3.3.1 The Basic Method
All versions of the permutation procedure for inference in the GLMM with missing data
involve four steps. These may be loosely described as follows. 1) Compute ML estimates of
the expected value and covariance parameters via the EM algorithm for the original set of
observations. 2) Compute unadjusted F approximations for three multivariate test statistics:
Wilk's Lambda (W), the Hotelling-Lawley trace (U), and the Pillai-Bartlett trace (V) using
the estimates obtained in Step 1. The F approximations for W, U, and V coincide with
those used by Catellier and Muller (1998). 3) Obtain the permutation distributions of Fw,
Fv, and Fu by either enumerating all possible reassignments of observations to treatment
groups, or by choosing a random sample of reassignments (keeping the number of
observations in each group constant), and recalculating the test statistics for each
47
..
reassignment. 4) Compute the p value as proportion of test statistic values greater than or
equal to the value corresponding to the originally obtained data.
3.3.2 Exact versus Approximate Permutation Tests
Exact permutation tests involve tabulating the test statistic value for every possible
q
permutation. In all, there are M =
N!/TI N g ! ways in which a sample of N
response vectors
g=1
can be combined into q samples of sizes N I , ... , N q •
As the number of observations
increases, the number of values of the test statistic that need to be calculated to obtain the
exact p value increases very rapidly.
Pitman (1937a, 1937b, 1938) approached the problem by deriving the first four moments
of the exact permutation distribution of a test statistic and showing that they converge to the
moments of some well-known distribution (X 2 , normal, etc.) as N increases. Wald and
Wolfowitz (1944) provided a general theorem on the limiting distributions of the class of
linear permutation statistics (Puri and Sen, 1993, p73).
For example, the limiting
distribution of the permutation distribution for a modified Hotelling T2 statistic approaches
the X2 distribution with two degrees of freedom as N-+ 00.
In the MANDVA setting of interest, with a small to moderate sample size, one cannot
take advantage of the asymptotic approximations for the permutation distributions. Neither
can one obtain the complete permutation distributions for the statistics, except for a rather
exorbitant amount of computing (at least with current equipment and methods).
example, the total number of data permutations for an experiment with N
= 12
For
subjects
randomly assigned to 4 treatments is 369,600. Alternatively, we can significantly reduce the
amount of time required by drawing a random sample of the data rearrangements, without
replacement, to produce a close approximation to the exact p value.
"approximate" or Monte-Carlo permutation tests.
48
Such methods are
A random draw from the set of complete data permutations can be obtained by assigning
a random uniform [0,1] number to each row of the design matrix, X, and sorting this matrix
according to the uniform values.
This sorted matrix is then reassigned to the original
response matrix, Y. This procedure is equivalent to random sampling from the N! possible
permutations of N observations. Using the concept of a partition (section 4.2, Johnson and
Kotz, 1972) it is easy to ensure that only data permutations in which at least one subject's
assignment is different than the original treatment assignment are sampled. For example, if
the partition representing the original allocation of N
= 12 subjects
to q = 4 treatment
groups (A, B, C, D) is
Treatment
Group
A
B
C
D
Subject
2 11
6
3 1 10
894
12 5 7
and its corresponding canonical representation is
Treatment
Group
Subject
A
2 6 11
B
1 3 10
C
489
D
5 7 12
then only random data permutations whose canonical partitions are different than the original
belong to the set of M possible data permutations.
3.4 NUMERICAL EVALUATIONS
3.4.1 Purpose of Studies and Overall Design
The impact of missing data on test size and power of permutation tests for three
conventionally used multivariate test statistics will be compared with that of their parametric
49
counterparts through simulation studies. The first series of simulations will focus on the null
case, and the second on power. When examining power, a diffuse noncentrality pattern
(Olson, 1976) for the expected value matrix, B, was considered. The multivariate power
estimation algorithm of Muller and Peterson (1984) was used to compute estimates of B
corresponding to approximate power of 0.8.
3.4.2 Methods
Simulations involved the following factors: 1) one within-subject factor with p
=3
levels, 2) one between-subject factor with q = 4 levels, 3) N E {12, 24}, 4) proportion
missing of 11" E {O, 0.05, 0.10}, and 5) three patterns for the error covariance matrix defined
by either equal or unequal variances, and either low or high correlation, (i. e., {PH}
and {Pjj'}
~
= 0.1
0.7, respectively). No subject's data were allowed to be completely missing.
3.4.3 Results
Tables 3.1-3.4 give the empirical test size and power levels for both parametric and
permutation-based tests for W, U, and V.
Simulation results were based on 5,000
replications for the parametric tests and 1,000 replications for the permutation tests.
Assuming a nominal significance level of 0.05, the approximate 95% confidence bounds for
each entry in Tables 3.1 amd 3.2 is no greater than ± 0.014. The maximum 95% confidence
bounds for each power estimate in Tables 3.3 and 3.4 is no greater than ± 0.031, which
occurs when the truepower is 0.5.
We describe first simulation results for the null situation. Table 3.1 gives the empirical
test sizes for the parametric F tests and their permutation counterparts (P), in the complete
data setting. All tests fall within a tolerable range of the target test size. Results for the
missing data conditions are shown in Table 3.2.
It is evident that all three unadjusted
parametric tests Fw, Fu, and Fv give inflated test .sizes. For each test statistic, increasing
the sample size from N
= 12
to N
= 24
reduced, but did not overcome, the problems
introduced by the missing data. The adjusted parametric tests Fw .' Fu .' and FV* held test
50
size to no more than the nominal level, but were noticeably conservative under the worst
simulated conditions. The empirical rejection rates for the permutation tests, on the other
hand, never fell outside the 95% confidence interval for the true test size under any of the
missing data conditions.
Table 3.1 Approximate Permutation Test Size for F Tests
(0% Missing, 1000 Replications, ± 0.014)
W
U
P
Fw
P
Fu
P
Fv
N Pj)' Ci~J
v
N
12
12
12
Pjj'
Low
Low
High
24
24
Low
High
12
12
12
Low
Low
High
24
24
Low
High
12
12
12
Low
Low
High
24
24
Low
High
=1=
=1=
=1=
=1=
.050
.052
.053
.053
.054
.057
.048
.050
.048
.056
.056
.057
.044
.049
.042
.053
.058
.051
.049
.053
.061
.057
.050
.051
.062
.057
.048
.051
.055
.061
Table 3.2 Approximate Permutation Test Size for F Tests
(5%,10% Missing, 1000 Replications, ± 0.014)
W
U
%
Ci~
Miss
P
P
Fu Fu*
Fv
Fw Fw*
J
=1=
=1=
=1=
=1=
=1=
=1=
=1=
=1=
V
P
5
5
5
.145
.134
.141
.029
.033
.032
.054
.061
.048
.142
.130
.134
.032
.033
.034
.059
.059
.049
.102
.099
.096
Fv*
.052
.051
.050
5
5
.087
.089
.030
.031
.046
.060
.084
.088
.028
.031
.052
.058
.081
.088
.046
.049
.046
.059
10
10
10
.345
.335
.354
.047
.042
.051
.047
.049
.039
.341
.333
.352
.052
.047
.055
.044
.044
.043
.206
.215
.214
.057
.051
.058
.061
.033
.048
10
10
.133
.158
.020
.025
.060
.045
.135
.158
.021
.029
.064
.043
.127
.148
.044
.055
.059
.038
.056
.053
.052
Tables 3.3 and 3.4 give empirical powers for the complete and missing data cases,
respectively. Empirical power was not computed for the unadjusted parametric tests since
51
"
they all produced inaccurate test sizes in the null case with missing data. For the complete
data case, we found that the empirical powers of the parametric F tests essentially coincide
with that of their permutation-based counterparts.
The adjusted F tests had equal or lower power than the permutation-based tests when
data were missing.
The Pillai-Bartlett test power was roughly the same using either
inferential method. In contrast, the power for Wilks' and the Hotelling-Lawley adjusted tests
was approximately 0.7-0.9 the power of permutation-based tests.
These results are not
surprising given that the test sizes for Fv. were quite close to the nominal level, while they
were noticeably conservative for Fw. and Fu•. Consequently, the permutation test for V has
a clear advantage over the permutation tests for either W or U, particularly when 10% of the
data are missing. In the worst case condition with N = 12 and 10% missing data, the power
for the permutation test for V was twice that for W, and four times that for U.
Table 3.3 Approximate Permutation Power for F Tests
(0% Missing, 1000 Replications, ± 0.031)
W
U
V
N Pjjl
Fu P
Fv P
Fw P
12 Low
.87 .87
.70 .75
.94 .95
12 High
.88 .88
.71 .72
.94 .95
24 Low
24 High
##-
.82
.83
.82
.84
52
.79
.79
.79
.80
.84
.85
.84
.87
N
Table 3.4 Approximate Permutation Power for F Tests
(5%,10% Missing, 1000 Replications, ± 0.031)
W
V
U
(]'2
% Missing
P
P
P
Pj)'
Fw*
Fv*
Fu*
J
12
12
Low
High
=1=
24
24
Low
High
=1=
12
12
Low
High
=1=
24
24
Low
High
=1=
=1=
=1=
=1=
=1=
5
5
.62
.62
.76
.76
.33
.31
.48
.47
.88
.89
.91
.91
5
5
.67
.68
.77
.78
.60
.61
.72
.74
.79
.79
.81
.81
10
10
.37
.37
.43
.43
.18
.19
.20
.18
.82
.82
.81
.82
10
10
.53
.54
.71
.68
.44
.46
.65
.61
.72
.75
.76
.73
3.5 AN EXAMPLE: THE EFFECT OF CHOLINE DEFICIENCY ON HUMANS
In this section we illustrate the use of approximate permutation tests for MANOVA by
analyzing data from a study that examined the effects of choline deprivation on plasma
choline concentration over 35 days, in healthy male subjects (Zeisel, DaCosta, Franklin,
Alexander, Lamont, Sheard and Beiser, 1991). Subjects were given a standard diet which
included 500 mg/day of choline for one week, and then were randomly assigned into two diet
groups, one that contained choline and one that did not. During the 5th week of study, all
subjects again consumed a diet containing choline. Blood samples for choline analyses were
obtained before the start of the study (day 0) and on days 7, 14,21,28, and 35. Only one of
the 14 subjects with baseline data had any missing data, with the single missing response
being for day 35. The data are presented in Table3.5.
A multivariate analysis of covariance model of difference scores allowed testing the
effects of diet on the plasma choline concentration over time, while controlling for treatment
group differences in baseline choline levels. Note that the permutation distribution is based
on reassignments of both the response vector and the baseline covariate to treatment groups.
53
Interpretation of treatment effects using this method should therefore be thought of as being
conditional upon the set of responses and covariates actually obtained. The null hypothesis
of interest was a test of no treatment group effect on the set of responses. For this particular
design with two treatment groups, all multivariate F tests coincide. Using all of the available
data, the value of the unadjusted F statistic is 3.48 with
VI
= 5 and v2 = 7 numerator and
denominator degrees of freedom, respectively. The corresponding p value is 0.067. Given
the inflated test sizes reported in Table 3.2 for the unadjusted tests, this p value should be
view with caution. When significance is determined using a permutation approach the p
value is 0.082. The adjusted F statistic based replacing N by the harmonic mean number of
non-missing pairs of responses was 3.31 with
leading to p
= 0.080.
VI
=5
and v2
= 6.65
degrees of freedom,
Using an even more conservative adjustment based on the minimum
number of non-missing pairs led to a p value of 0.108 based on F
= 2.98 with VI = 5 and
v 2 = 6 degrees of freedom. All missing data analysis methods led to lower significance
values that the analysis based on complete cases only (F
p = 0.126).
54
=
2.74,
VI
= 5, v2 = 6
and
Table 3.5 Choline Measurements Over 5-Week Period in
Day
Treatment
7
14
21
28
0
9.30
9.51 10.84
Control
9.93 12.29
8.14 11.43
9.44 11.10
9.77
9.95
12.56 10.90 11.19 12.31
8.86
9.23
8.56
10.15 10.32
8.78
9.37
7.54
9.20
11.00
8.72
8.13
8.14 11.76
10.46
Deficient
12.15
12.88
7.94
9.42
9.57
11.54
11.65
8.73
9.52
9.66
9.86
12.82
10.95
10.43
10.64
8.08
9.05
7.71
7.87
7.17
9.01
8.66
9.81
7.70
9.07
7.29
8.89
8.18
8.98
8.60
8.04
6.44
6.76
6.37
8.69
8.30
6.56
7.87
7.52
6.42
Male Subjects
35
9.24
10.56
12.78
12.39
9.74
9.39
10.61
12.28
12.61
9.66
9.69
8.76
8.93
3.6 CONCLUSIONS
Conclusion 1. When large sample test statistics cannot be justified due to inadequate
sample size, and the methods apply, an approximate permutation test can be used to ensure
validity.
Conclusion 2. Simulation results suggest that both the parametric multivariate F tests
and their corresponding approximate permutation tests provide control test sizes when the
data are complete.
Conclusion 3. With 5% or 10% missing data, 1) unadjusted parametric tests yield
inflated type I error rates, 2) adjusted tests work well, but can be conservative, and 3)
permutation-based methods provide test sizes which are close to the nominal rate.
Conclusion 4. Parametric tests and approximate permutation tests are equally powerful in
the complete case.
55
Conclusion 5. Under all missing data conditions, the pennutation tests have power at
least as great as the adjusted parametric tests.
Conclusion 6. The pennutation test for the Pillai-Bartlett trace has a clear advantage over
other pennutation tests.
Conclusion 7. Much work needs to be done to detennine the usefulness and limitations
of the pennutation procedure.
Currently, pennutation tests are limited to fairly simple
designs and have no universally accepted test for an interaction in factorial designs.
REFERENCES
Barton, C. N. and Cramer, E. C. (1989), "Hypothesis Testing in Multivariate Linear Models
with Randomly Missing Data," Communications in Statistics - Simulations, 18, 875895.
Beale, E. M. L. and Little, R. J. A. (1975), "Missing Values in Multivariate Analysis,"
Journal o/the Royal Statistical Society-B, 37, 129-145.
Bradbury, I. (1987), "Analysis of Variance Versus Randomization Tests - A Comparison,"
British Journal 0/ Mathematical and Statistical Psychology, 40, 177-187.
Catellier, D. 1. and Muller, K. E. (1998), "Inference for the General Linear Multivariate
Model with Missing Data in Small Samples," Institute 0/ Statistics Mimeo Series No.
XXXX, University of North Carolina, Chapel Hill.
Chung, 1. H. and Fraser, D. A. S. (1958), "Randomization Tests for a Multivariate TwoSample Problem," Journal o/the American Statistical Association, 53, 729-735.
Edgington, R. A. (1995), Randomization tests (3rd edition), New York: Marcel Dekker.
Efron, B. and Tibshirani, R. (1993), An Introduction to the Bootstrap, New York: Chapman
& Hall.
Friedman, 1. H. and Rafsky, L. C. (1979), "Multivariate Generalizations of the WaldWolfowitz and Smimov Two-Sample Test," The Annals o/Statistics, 7, 697-717.
Good, P. I. (1994), Permutation Tests: A Practical Guide to Resampling Methods/or Testing
Hypotheses, New York: Springer-Verlag.
Jerdack, G. R. and Sen, P. K. (1990), "Nonparametric Tests of Restricted Interchangeability,"
Annals o/the Institute o/Statistical Mathematics, 42, 99-114.
56
Johnson, N. L. and Kotz, S. (1972), "Continuous multivariate distributions," New York: John
Wiley.
Koch, G. G. and Gillings, D. B. (1983), "Inference, design based vs. model based," In Kotz,
S. and Johnson N. L. eds. Encyclopedia ofStatistical Sciences, New York: John Wiley,
4: 84-88.
Lehmann, E. L. (1986), Testing Statistical Hypotheses, New York: John Wiley.
Little, R. J. A. and Rubin, D. B. (1987), Statistical Analysis with Missing Data, New York:
John Wiley.
Ludbrook, J. and Dudley, H. (1998), "Why Permutation Tests are Superior to t and F Tests
in Biomedical Research," The American Statistician, 52, 127-132.
Muller, K. E. and Peterson, B. L. (1984), "Practical Methods for Computing Power in
Testing the Multivariate Linear Hypothesis," Computational Statistics and Data
Analysis, 2, 143-158.
Olson, C. L. (1976). "Choosing a Test Statistic in Multivariate Analysis," Psychological
Bulletin, 86, 579-586.
Pitman, E. J. G. (1937a), "Significance Tests which can be Applied to Samples from any
Populations," Journal ofthe Royal Statistical Society, B, 4, 119-130.
Pitman, E. J. G. (1937b), "Significance Tests which can be Applied to Samples from any
Populations. II. The correlation coefficient," Journal ofthe Royal Statistical Society, B,
4,225-232.
Pitman, E. J. G. (1938), "Significance Tests which can be Applied to Samples from any
Populations. III. The Analysis of Variance Test," Biometrica, 29, 322-335.
Puri, M. L. and Sen, P. K. (1993), Nonparametric Methods in Multivariate Analysis, Florida:
Krieger Publishing Company.
Romano, J. P. (1989), "Bootstrap and Randomization Tests of Some Nonparametric
Hypotheses," The Annals ofStatistics, 17, 141-159.
Rubin, D. B. (1976), "Inference and Missing Data," Biometrika, 63, 581-592.
Scheffe, H. (1959), The Analysis of Variance, New York: John Wiley.
Schluchter, M. D. and Elashoff, J. D. (1990), "Small-sample Adjustments to Tests with
Unbalanced Repeated Measures Assuming Several Covariance Structures," Journal of
Statistical Computing - Simulations, 37, 69-87.
Servy, E. C. and Sen, P. K. (1987), "Missing Variables in Multi-Sample Rank Permutation
Tests for MANOVA and MANCOVA," Sankhya, 49, 78-95.
57
Still, A. W. and White, A. P. (1981), "The Approximate Randomization Test as an
Alternative to the F test in Analysis of Variance," British Journal of Mathematical and
Statistical Psychology, 34, 243-252.
Wald, A. and Wolfowitz, J. (1944), "Statistical Tests Based on Permutations of the
Observations," The Annals ofMathematical Statistics, 15, 358-372.
Zeisel, S. H., DaCosta, K., Franklin, P. D., Alexander, E. A., Lamont, J. T., Sheard, N. F. and
Beiser, A. (1991), "Choline, an Essential Nutrient for Humans," FASEB, 5, 2093-2098.
58
Chapter 4
LINMOD 4: A PROGRAM FOR
GENERAL LINEAR MULTIVARIATE MODELS
WITH MISSING DATA
Diane J. Catellier
Department of Biostatistics CB#7400
University of North Carolina
Chapel Hill, North Carolina 27599-7400
D. Catellier: telephone 919-966-7283, email [email protected]
FAX 919-966-3804
Key words: test size, randomization test, multivariate linear models, MANOVA
59
SUMMARY
LINMOD 4 computes estimates of the parameters of a General Linear Multivariate Model
(GLMM) and performs tests of the general linear hypothesis in the presence of data assumed
to be missing at random (MAR). The EM algorithm provides maximum likelihood estimates
of expected value and covariance parameters using all of the available data. The program
computes approximate tests described by Catellier and Muller (1998). With complete data
the tests reduce to standard multivariate tests, including Wilks, Hotelling-Lawley, and PillaiBartlett.
Approximate Geisser-Greenhouse corrected and uncorrected tests for the
"univariate" approach to repeated measures are also available.
The tests differ from standard by reducing the error degrees of freedom replacing the
number of independent sampling units by some function of the numbers of non-missing pairs
of responses. Simulation results of Catellier and Muller (1998) lead to the conclusion that
the approach provides the best currently available methods for controlling test size at or
below the nominal rate. The methods control test size even with as few as 12 observations
for 6 repeated measurements and 5% missing data. The source code, an extensive user's
guide, and example programs may be obtained free of charge via the Internet.
4.1 MOTIVATION
Many commercial vendors and shareware sources provide a great variety of flexible and
programs for analyzing multivariate Gaussian data with no missing observations. We focus
here on estimation and testing in a class of models often described as the General Linear
Multivariate Model (GLMM).
For example, see Muller, LaVange, Ramey, and Ramey
(1992) for a detailed specification from the perspective of power analysis. For the purposes
of data analysis, the most important special cases include Multivariate Analysis of Variance
and Covariance (MANOVA, MANCOVA), and the multivariate and univariate approaches to
repeated measures ANOV A.
Furthermore, some forms of discriminant analysis and
canonical correlation also represent special cases.
60
Missing data present one of the most common and vexing problems in such analyses,
especially in small samples. For data otherwise meeting the assumptions of the model, and
data missing at random (MAR, Rubin, 1976), the EM algorithm (Orchard and Woodbury,
1972; Dempster, Laird, and Rubin, 1977) allows easily computing maximum likelihood
(ML) estimates of all model parameters (expected value and covariance matrices) using all of
the available data. As in the complete data setting, widely available commercial software and
shareware conveniently provide ML estimates.
As of this writing, no available software provides an efficient means of controlling test
size at or below the nominal level in small samples in the presence of missing data
(Schluchter and Elashoff, 1990; Barton and Cramer, 1989; Catellier and Muller, 1998).
Deletion of cases with missing values is the default option with many statistical software
packages. While this approach is unbiased, it can be extremely inefficient in small samples.
Recently Catellier and Muller (1998) described approximate tests for the setting of
interest. With complete data, the tests reduce to standard multivariate tests, including Wilks,
Hotelling-Lawley, and Pillai-Bartlett.
Approximate Geisser-Greenhouse corrected and
uncorrected tests for the "univariate" approach to repeated measures are also available. The
tests differ from standard by reducing the error degrees of freedom replacing the number of
independent sampling units by some function of the numbers of non-missing pairs of
responses. Simulation results of Catellier and Muller (1998) lead to the conclusion that the
approach provides the best currently available methods for controlling test size at or below
the nominal rate. The methods control test size even with as few as 12 observations for 6
repeated measurements and 5% missing data. Hence a computer program which implements
the new methods in a convenient fashion would likely prove extremely useful for the practice
.of data analysis.
61
4.2 A NEW APPROACH
4.2.1 Overview
This paper introduces LINMOD 4, a SAS® PROC IML program which implements the
methods of Catellier and Muller (1998). For mathematical details consult that paper, which
closely follows the notation of Muller, LaVange, Ramey, and Ramey (1992). The parameters
of the specified models are estimated using the EM algorithm. The parameter estimates are
then used to compute analogs of the hypothesis (ft) and error (E) sums of squares matrices.
In tum, analogs of the Hotelling-Lawley trace, the Pillai-Bartlett trace, Wilks' Lambda, the
Geisser-Greenhouse corrected univariate test and corresponding uncorrected are computed.
For example, for the Hotelling-Lawley analog, compute
tr( ft E-
1
).
Next an adjusted
sample size (N*) is computed to replace N (the total sample size) in calculation of error
degrees of freedom for the F approximations commonly used. In all cases N* equals a
function of the number of non-missing pairs of responses. The best choice for N* depends on
the test statistic on the hypothesis of interest. By default, the LINMOD program uses the N*
choices recommended by Catellier and Muller (1998). However, the user is free to override
the default with various other choices for N*. In all cases the methods reduce to standard
ones for complete data.
4.2.2 Comparison to Other Programs
As of this writing, some widely-used commercial software packages which have
implemented the EM algorithm for estimation with missing data in the GLMM include
BMDp® 7.0 (BMDPAM and BMDP8D modules), SPSS® 7.5 (Missing Value Analysis), and
SAS® (PROC MIXED). The ML estimates are used to produce inferential statements (e. g.
interval estimates and p values) that are based on a number of variations of commonly used
test statistics and large sample theory.
62
There are obviously other strategies for treating the problem of mlssmg data.
Computational routines for multiple imputation of multivariate Gaussian data, for example,
have been written for use with S-PLUS 4.0®, and are also available in SOLAS® for Missing
Data Analysis. In multiple imputation, each missing value is replaced by m
>
1 simulated
values. The resulting m versions of the complete data are analyzed by standard completedata methods.
Issues related to inference using this multiple imputation has not been
extensively investigated, especially in small samples.
Earlier versions of LINMOD were written to allow a sophisticated user complete control
over a GLMM analysis by giving easy computational access to all intermediate results and by
allowing the user to specify contrasts and control structure in a matrix language.
The
software proves very useful in helping students learn linear models analysis and theory.
Furthermore the software provides extremely useful modules for simulations involving the
GLMM.
Early versions of LINMOD pre-dated availability of any similarly powerful
commercial software. As noted earlier, many commercial packages now provide analysis for
a wide range of GLMM
models~
The trade-off in using LINMOD in this setting is increased
flexibility and control for the sophisticated user versus friendly interface for the naive user.
For example, LINMOD does not add or in any way recognize an intercept term in any model.
The user must code one if desired, and may choose to test it, if present. The user should be
familiar with the material presented in such texts as Timm (1975).
The primary advantage of LINMOD 4 over previous versions lies in its ability to provide
accurate inferences in the presence of missing data. The importance of using adjusted tests
increases with decreasing sample size.
For example, consider a balanced design with
= 12 observations, 3 in each of q = 4 treatment groups, p = 3 repeated measures, and a
target test size of a = .05. Using ML estimates, but not adjusting the sample size for error
N
degrees of freedom calculations gives a Wilks test analog with type I error rate of roughly .05
for no missing data, .14 for 5% missing data, and .34 for 10% missing data.
63
Other
multivariate tests (and the univariate approach to repeated measures tests) perform as badly.
Similar results also occur with currently available tests commonly used (which are based on
large sample theory).
4.2.3 History of the Program
LINMOD 4 constitutes at least the sixth generation of such software. In the 1970's, a
FORTRAN program (MGLM) was used for
faculty and students.
multiv~ate
analysis by UNC-CH Biostatistics
The second generation was written in PROC MATRIX to take
advantage of the functions in the language. The third generation (LINMOD 1.0) provided
many more functions, as well as a better user interface and error checking. Version 2 was a
modest revision allowing easier support and greater portability. The introduction of IML
(and the demise of PROC MATRIX) led, somewhat belatedly, to LINMOD 3.0 in IML.
Great effort was made to allow for efficient analysis of large data sets. Version 4 differs
from Version 3 mainly in its ability to handle missing data.
4.3 LINMOD 4 PROGRAM
LINMOD 4 contains sixteen modules. The modules must be executed in a particular
sequence of steps. For example, a secondary hypothesis cannot be tested without first fitting
a model, nor can a model be fitted without first creating the proper sums of squares and cross
products matrix. The steps required for data analysis in LINMOD are as follows: 1) invoke
PROC IML and read the data corresponding to independent and dependent variables into
matrices, 2) calculate expected value parameters (modules MAKESS, GETCORSS), 3)
compute the estimates of the parameters of a GLMM (module FITMODEL) and 4) perform
tests of a general linear hypothesis (module TESTGLH). When data are missing, steps 2 and
3 are combined into one module called FITML. Module TESTGLH computes multivariate
test statistics.
64
4.4 RECOMMENDED OPTION SETTINGS
In each application, the user must decide which degree of freedom adjustment to use.
Deciding among the 11 options depends on 1) the test statistic of interest, 2) the hypothesis of
interest (most importantly, whether the matrix of secondary parameters has rank one). The
decision can be made using the tables in Catellier and Muller as a guide, or simply by using
the NBEST option.
4.5 EXAMPLE
Table 4.1 contains data from a study that examined the effects of choline deprivation on
plasma choline concentration over 35 days, in healthy male subjects (Zeisel, DaCosta,
Franklin, Alexander, Lamont, Sheard and Beiser, 1991). Subjects were given a standard diet
which included 500 mg/day of choline for one week, and then were randomly assigned into
two diet groups, one that contained choline and one that did not. During the fifth week of
study all subjects again consumed a diet containing choline. Blood samples for choline
analyses were obtained before the start ofthe study (day 0) and on days 7, 14,21,28, and 35.
One subject had data missing for day 35. Large amounts of missing data for a very small
experiment would seem very worrisome.
65
Table 4.1 Choline Measurements Over 5-Week Period
in Male Subjects
Day
Treatment
Control
Deficient
0
9.93
9.77
12.56
10.15
11.00
10.46
7
12.29
8.14
10.90
10.32
9.20
8.72
14
9.30
11.43
11.19
8.86
8.78
8.13
21
9.51
9.44
12.31
9.23
' 9.37
8.14
28
10.84
11.10
9.95
8.56
7.54
11.76
35
9.24
10.56
12.15
12.88
7.94
9.42
9.57
11.54
11.65
8.73
9.52
9.66
9.86
12.82
10.95
10.43
10.64
8.08
9.05
7.71
7.87
7.17
9.01
8.66
9.81
7.70
9.07
7.29
8.89
8.18
8.98
8.60
8.04
6.44
6.76
6.37
8.69
8.30
6.56
7.87
7.52
6.42
9.39
10.61
12.28
12.61
9.66
9.69
8.76
8.93
12.78
12.39
9.74
A multivariate analysis of covariance model of difference scores allowed testing the
effects of diet on the plasma choline concentration over time, while controlling for treatment
group differences in baseline choline levels. The null hypothesis of interest is a test of the
trends by treatment interaction. For this particular design with two treatment groups, all
multivariate F tests coincide (using the same N*).
The Geisser-Greenhouse corrected
univariate test is denoted by FaG. The value of the unadjusted F statistic (N*
with
111
= N) is 2.77
= 4 and 112 = 8 numerator and denominator degrees of freedom, respectively.
The
corresponding p value is 0.10. The Geisser-Greenhouse corrected univariate test results are:
FGG
= 2.67, 111 = 2.95, 112 = 32.4 and p = 0.065.
Given the inflated test sizes reported in
Catellier and Muller (1998) for the unadjusted tests, this p value should be viewed with
caution. Figure 1 contains the analysis results obtained from LINMOD using the adjusted
degree of freedom method. The adjusted F statistic based replacing N by the harmonic
66
mean number of non-missing pairs of responses was 2.65 with
vI
= 4 and V z = 7.65 degrees
= 0.12. The results for using the univariate approach to repeated
= 2.58, vI = 2.95, V z = 31.4 and p = 0.072. Both missing data analysis
of freedom, leading to p
measures are
FGG
methods led to a smaller p value than the analysis based on 13 complete cases (F = 2.00,
VI
= 4, V z = 7 and p = 0.20; FGG = 2.40, vI = 2.8, V2 = 28.3 and p = 0.092).
Figure 4.1 LINMOD Programming State~ents for Choline Example
LIBNAME LIBPATH "H: \DJC\DISSERT\CHOLINE\";
%LET LMDIRECT = H:\DJC\LINMOD4\SOURCE\;
%INCLUDE "&LMDIRECT. MACROLIB. MAC" / NOSOURCE2
PROC IML WORKSIZE=5000 SYMSIZE=5000;
&LINMOD ;
USE LIBPATH.CHOLINE;
READ ALL VAR{DAY7 DAY14 DAY21 DAY28 DAY35} INTO Y [COLNAME=DEPVARS];
READ ALL VAR{GROUP} INTO GROUP;
READ ALL VAR{DAYO} INTO DAYO;
CELLMEAN = DESIGN(GROUP);
x = CELLMEAN I I DAYO;
** CELL MEAN CODING;
INDVARS={"CONTROL" "DEFICIENT" "DAYO"};
ZNAMES = INDVARS I I DEPVARS ;
Z = X
II Y;
P = NCOL(Y);
*- - - - - - -- - - - -- - -- - - - - - - - - - -- -_.*;
* REPEATED MEASURES HYPOTHESIS *;
* - - - - - - .- - - - - - - - - - - - - - - - - - - - - - - - * ;
C={ -1 1 O};
RUN UPOLY1 ((1 :5), "TIME", U, UNAME);
THETARNM={"DEFICIENT-CONTROL"};
THETACNM={"LIN" "QUAD" "CUB" "QUAR"};
OPT_ON = {"NBEST"};
RUN SETOPT;
RUN FITML;
RUN TESTGLH;
QUIT;
67
Figure 4.2 LINMOD Output for Choline Example
Model Parameters:
N
ncol(Y)
rank(X)
rank(W) Tolerance
14
5
3
0 1.5872E-9
BETA - Matrix of Parameter Estimates
CONTROL
DEFICIENT
DAYO
***
DAY7
DAY14
DAY21
9.5664
9.8885
0.034
7.0619
5.8578
0.2398
7.. 408
5.9615
0.2122
DAY28·
12.555
9.8688
-0.244
Expanded Columns of BETA
NOTE: Degrees of freedom based on NSTAR
CONTROL
DEFICIENT
DAYO
CONTROL
DEFICIENT
DAYO
CONTROL
DEFICIENT
DAYO
CONTROL
DEFICIENT
DAYO
CONTROL
DEFICIENT
DAYO
DAY7
Std Err
9.5664
9.8885
0.034
3.2441
3.181
0.2987
DAY14
Std Err
7.0619
5.8578
0.2398
2.3775
2.3313
0.2189
DAY21
Std Err
7.408
5.9615
0.2122
2.4668
2.4188
0.2271
DAY28
Std Err
12.555
9.8688
-0.244
2.7032
2.6506
0.2489
DAY35
Std Err
12.842
11 .961
-0.164
3.3264
3.2618
0.3063
68
0.0146
0.0111
0.9116
t Value 2 Tail p
2.9703
2.5127
1.0956
0.014
0.0308
0.2989
t Value 2 Tail p
3.0031
2.4646
0.9342
0.0133
0.0334
0.3722
t Value 2 Tail p
4.6445
3.7232
-0.98
0.0009
0.004
0.3502
t Value 2 Tail p
3.8607
3.667
-0.536
12.842
11 .961
-0.164
= NJJP_MIN
t Value 2 Tail p
2.9489
3.1086
0.1138
DAY35
0.0032
0.0043
0.604
Estimated SIGMA: ML_SIGMAHAT*{_N_/{_N_-_RANKX_}}
DAY7
DAY14
DAY21
DAY28
DAY35
2.2593
-0.096
0.7498
0.3501
0.5722
-0.096
1.2135
0.6967
0.2509
-0.444
0.7498
0.6967
1.3064
0.1892
0.5934
0.3501
0.2509
0.1892
1.5687
-0.458
0.5722
-0.444
0.5934
-0.458
2.3755
- SIGMA_
DAY7
DAY14
DAY21
DAY28
DAY35
Estimated Error Correlation, Matrix of SIGMA
DAY7
DAY14
DAY21
DAY28
DAY35
1.000
-0.058
0.436
0.186
0.247
-0.058
1.000
0.553
0.182
-0.262
0.436
0.553
1.000
0.132
0.337
0.186
0.182
0.132
1.000
-0.237
0.247
-0.262
0.337
-0.237
1.000
- SCORRDAY7
DAY14
DAY21
DAY28
DAY35
C
CONTROL DEFICIENT
DEFICIENT-CONTROL
u
DAY7
DAY14
DAY21
DAY28
DAY35
DAYO
-1
LIN
0
QUAD
CUB
QUAR
-0.632 0.5345 -0.316
-0.316 -0.267 0.6325
o -0.535
0
0.3162 -0.267 -0.632
0.6325 0.5345 0.3162
0.1195
-0.478
0.7171
-0~478
0.1195
THETA is the estimate of CBU
LIN
DEFICIENT-CONTROL
-1.23
QUAD
CUB
QUAR
1.514 0.5567 0.7557
Column of THETA with Associated Statistics
*** NOTE: Degrees of freedom based on NSTAR = NJJP_H
LIN Std Err t Value
DEFICIENT-CONTROL
-1.23
0.6491
69
-1.894
OF 2tail p
10.65
0.0856
R Sqrd
0.252
Column of THETA with Associated Statistics
*** NOTE: Degrees of freedom based on NSTAR = NJJP_H
QUAD Std Err t Value
DEFICIENT-CONTROL
1.514
0.7514
2.0149
OF 2tail p R Sqrd
10.65
0.0699
0.276
Column of THETA with Associated Statistics
*** NOTE: Degrees of freedom based on NSTAR = NJJP_H
CUB Std Err t Value
DEFICIENT-CONTROL
0.5567
0.6503
0.8562
OF 2tail p R Sqrd
10.65
0.4107
0.0644
Column of THETA with Associated Statistics
*** NOTE: Degrees of freedom based on NSTAR = NJJP_H
QUAR Std Err t Value
DEFICIENT-CONTROL
0.7557
0.593
1.2744
OF 2tail p R Sqrd
10.65
0.2296
0.1323
Univariate Tests for columns of THETA=C*BETA*U - THETAO
*** NOTE: Degrees of freedom based on NSTAR = NJJP_H
F Value
LIN
QUAD
CUB
QUAR
Num df
3.5887
4.0597
0.7331
1.6242
Den df p Value
R Sqrd
10.65 0.0856
10.65 0.0699
10.65 0.4107
10.65 0.2296
0.252
0.276
0.0644
0.1323
Estimated Correlation Matrix based on E
_R_
LIN
QUAD
CUB
QUAR
CUB
QUAR
1.000 0.250 0.263
0.250 1.000 -0.068
0.263 -0.068 1.000
0.102 0.630 0.340
0.102
0.630
0.340
1.000
LIN
QUAD
70
Generalized squared canonical correlations: CanVar1
- STMAT1
LROOT
LAMBDA
HLTRACE
PILTRACE
VALUE APPROX F
0.5811
0.4189
1.3872
0.5811
NUM OF DENOM OF
4
4.
4
4
3.468
2.4276
2.4276
2.4276
0.581
P VALUE ASSOCITN
10
7
7
7
0.0504
0.1443
0.1443
0.1443
*** NOTE: Degrees of freedom for LAMBDA,HLTRACE: NSTAR
*** NOTE: Degrees of freedom for PILTRACE:
NSTAR
0.5811
0.5811
0.5811
0.5811
= NJJP_MIN
= NJJP_H
Univariate Repeated Measures Tests
*** NOTE: Degrees of freedom for Gsr Grnh based on NSTAR = NJJ_A
F OF Numer OF Denom epslnHat
Uncrrctd
Gsr Grnh
2.6131
2.6131
4
2.9495
43.2
31.855
1
0.7374
p Value
0.0483
0.0692
4.6 ACQUIRING THE PROGRAM
Free copies of the program and accompanying documentation may be acquired either via
the World Wide Web or FTP. Connect to the web site http://www.bios.unc.edul-muller/ and
then follow the directions presented there.
Alternately, use anonymous FTP at
ftp:/lwww.bios.unc.edulpub/faculty/muller/linmod401
to
acquire
the
software
and
documentation.
REFERENCES
Barton, C. N. and Cramer, E. C. (1989), "Hypothesis Testing in Multivariate Linear Models
with Randomly Missing Data," Communications in Statistics - Simulations, 18, 875895.
Catellier, D. J. and Muller, K. E. (1998), "Inference for the General Linear Multivariate
Model with Missing Data in Small Samples," Institute of Statistics Mimeo Series No.
XXXX", University of North Carolina, Chapel Hill.
71
Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977), "Maximum Likelihood from
Incomplete Data via the EM Algorithm (with Discussion)," Journal of the Royal
Statistical Society-B, 39, 1-38.
Muller, K. E., LaVange, L. M., Ramey, S. L. and Ramey, C. T. (1992), "Power Calculations
for General Linear Multivariate Models Including Repeated Measures Applications,"
Journal ofthe American Statistical Association, 87, 1209-1226.
Orchard, T. and Woodbury, M. A. (1972), "A Missing Information Principle: Theory and
Applications," In Proceedings of the 6th Berkeley Symposium on Mathematical
Statistics and Probability, 1, 697-715. Berkeley, California: University of California
.
Press.
Rubin, D. B. (1976), "Inference and Missing Data," Biometrika, 63, 581-592.
Schluchter, M. D. and Elashoff, J. D. (1990), "Small-sample Adjustments to Tests with
Unbalanced Repeated Measures Assuming Several Covariance Structures," Journal of
Statistical Computing - Simulations, 37, 69-87.
Timm, N. H. (1975), Multivariate Analysis, Monterey, California: Brooks/Cole.
Zeisel, S. H., DaCosta, K., Franklin, P. D., Alexander, E. A., Lamont, 1. T., Sheard, N. F. and
Beiser, A. (1991), "Choline, an Essential Nutrient for Humans," FASEB, 5, 2093-2098.
72
Chapter 5
CONCLUSIONS AND FUTURE RESEARCH
5.1 LOOKING BACKWARDS; SUCCESSES FROM THIS RESEARCH
Recall that the motivation for this research was to provide defensible inference for
Gaussian data with missing data in small samples. Two approaches to estimation and testing
with missing data in the General Linear Multivariate Model (GLMM) are commonly used.
1) The EM algorithm is first used to provide maximum likelihood (ML) estimates of
expected value and covariance parameters using all of the available data. The ML estimates
are then used to compute the to standard multivariate tests.
2) The GLMM can be
transformed into a mixed model framework by creating a separate record for each
observation. This modeling approach allows for missing data. Likelihood-based estimates
can readily be computed using iterative methods and then used to produce inferences that are
based on large sample theory.
I began by first documenting the limitations of the commonly used inference approaches
described above in a series of null case simulations. For example, consider a balanced design
with N
= 12 observations,
and a target test size of a:
3 in each of q = 4 treatment groups, p
= .05.
= 3 repeated measures,
Using ML estimates and unadjusted error degrees of
freedom gives a Wilks test analog with type I error rate of roughly .05 for no missing data,
.14 for 5% missing data, and .34 for 10% missing data. Other multivariate tests (and the
univariate approach to repeated measures tests) perform as badly. Similar results also occur
using the mixed model approach with approximate test sizes of .13, .17, and .25, for 0%, 5%,
and 10% missing data, respectively.
73
In the first paper, I examined the performance of an inference strategy that generalizes a
..
suggestion due to Barton and Cramer (1989). In all cases the EM algorithm provides ML
estimates. In tum, a function of the number of non-missing pairs of responses (N*) replaces
N (number of subjects) in calculating error degrees of freedom for approximate F tests.
Simulation results suggest that the best choice for N* varies with the test statistic. Replacing
N by the mean number of non-missing responses works best for the Geisser-Greenhouse test.
The Pillai-Bartlett test requires the stronger adjustment of replacing N by the harmonic mean
number of non-missing pairs of responses. For Wilks' and Rotelling-Lawley, an even more
aggressive adjustment based on the minimum number of non-missing pairs must be used to
control test size at or below the nominal rate. Overall, simulation results allowed concluding
that an adjusted test can' always control test size at or below the nominal rate, even with as
few as 12 observations and up to 10% missing data.
Although the adjusted tests described in the first paper limit test size to no more than the
nominal level, they can be noticeably conservative, especially at the smallest sample sizes.
In the second paper I sought to regain the power lost by using approximate permutation
methods. Exact permutation methods are guaranteed to provide accurate inferences for a
particular range of hypotheses, even in very small samples. Based on simulation results, the
approximate permutation tests apparently provide unbiased tests, even for very small samples
(N
= 12) and up to 10% missing data.
Overall, the permutation-based methods had equal or
higher power than the adjusted .F tests in all cases, while still controlling test size.
Unfortunately, using permutation tests disallows considering a broad range of often used
designs and hypothesis tests. For example, any study with only within-subject factors cannot
be analyzed with the approach, without making a highly restrictive (and usually incorrect)
.assumption.
74
The successes of the first two papers led to the decision to create free software to
implement the adjusted degrees of freedom methods. The convenience of the software will
allow putting the methods into immediate practice in data analysis.
5.2 LOOKING FORWARD; FUTURE RESEARCH
5.2.1 Robustness
The approximate permutation tests appeal because of their natural applicability to small
samples. Due to the distribution-free nature of the permutation tests, they would also seem
natural for non-Gaussian samples with missing data. However, the particular permutation
tests which were examined in this work (using likelihood-based estimates and test statistics)
may not be the most effective for non-Gaussian data.
Naturally, finding alternative
permutation tests which can be applied to non-Gaussian data with missing values would
require extensive future research. The results here hold promise for such an approach.
5.2.2 Improving On the Adjusted Tests
A number of ideas seem worth pursuing in order to improve the adjusted tests. The use
of REML estimation defines one direction. Experience indicates that REML estimates tend
to have less bias. For example, for complete data the REML approach provides the unbiased
estimator of the covariance matrix, while the ML estimate is biased. With complete data the
two differ only by a factor of N j(N - r). The relationship is likely more complicated with
missing data. The difference between the two matrices may suggest a degrees of freedom
adjustment.
A more systematic approach may emanate from a method of moments attack. Either
exact or approximate moments for parameter estimates or
it
and
E
analogs seem worth
pursuing. Note the success of Satterthwaite type approximations in mixed models in general
and the Geisser-Greenhouse corrected test in particular.
Little and Rubin (1987, §7.5) expressed the relationship between the observed and
complete information matrices as
75
,
observed information = complete info·rmation - missing information.
"
Louis (1982) noted that, with data in hand, one can estimate two of the three terms and then
solve for the third. In tum, the ratio of observed to complete has great appeal as an effective
sample size, in parallel to N* / N or [( N* - r) / (N - r)]. Some promising analytic results for
this approach have been derived for special cases of the GLMM.
5.2.3 Power Approximation
Many authors have studied power approximation for the GLMM. See Muller, LaVange,
Ramey and Ramey (1992) for a review. They also suggested a simple strategy for choosing
sample sizes when one expects missing data, by treating the complete data power as an upper
bound, and the power based on expected complete cases only as a lower bound.
This
assumes that one can use all of the data, with an effective sample size between the two
bounds.
An appealing strategy for approximating power with missing data arises from studying
the distribution of N*. In tum, consider power as the expected value of a set function. Note
that current power approximations do not work as well as desired for some alternative
hypotheses in very small samples. Hence such complete data limitations would be inherited
by missing data methods.
REFERENCES
Barton, C. N. and Cramer, E. C. (1989), "Hypothesis Testing in Multivariate Linear Models
with Randomly Missing Data," Communications in Statistics - Simulations, 18, 875895.
Little, R. J. A. and Rubin, D. B. (1987), Statistical Analysis with Missing Data, New York:
John Wiley.
Louis, T. A. (1982), "Finding the Observed Information Matrix when Using the EM
Algorithm," Journal ofthe Royal Statistical Society, Series B, 44, 226-233.
Muller, K. E., LaVange, L. M., Ramey, S. L. and Ramey, C. T. (1992), "Power Calculations
for General Linear Multivariate Models Including Repeated Measures Applications,"
Journal ofthe American Statistical Association, 87, 1209-1226.
76
APPENDIX A
DOCUMENTATION FOR FITML MODULE IN LINMOD 4
A.I Function
The FITML module creates the same matrices that are created by MAKESS and
FITMODEL. The primary difference is that when data are missing, FITML uses all of the
available data, where other modules use only the complete cases. The FITML module can
also be used in place of the calls to the following set of modules:
&PROCSSCP,
GETCORSS and FITMODEL.
A.2 Algorithm
The EM algorithm is used to obtain the maximum likelihood estimates of the model
parameters, 13, and covariance matrix,
E.
These estimates, along with estimates of(X'X)-l
are used to create the uncorrected sums of squares and cross products matrix, _SS_. Next,
the estimates of the parameters ofa GLMM (module FITMODEL)
For example, a secondary hypothesis cannot be tested without first fitting a model, nor
can a model be fitted without first creating the proper sums of squares and cross products
matrix. and 4) perfonn tests of a general linear hypothesis (module TESTGLH). When data
are missing, steps 2 and 3 are combined into one module called FITML. Module TESTGLH
computes multivariate test statistics.
A.3 User Inputs
In addition to the inputs that are required for the MAKESS module (Z and ZNAMES),
the number of dependent variables must also specified (P).
A.3.1 Z, the data matrix containing both the X's and Y's, with columns as variables.
A.3.2 ZNAMES, a character matrix of one row of names for the columns of Z must be
defined. It must confonn to Z.
77
A.3.3 P, a scalar indicating the number of dependent variables.
A.4 System Input Matrices
The following system matrices are required by this module:
_OPT_ - the option status matrix.
- ECODE- - the error condition matrix.
A.5 Options and Defaults
The option names used to change the type of degree of freedom adjustment used are the
following:
NLIST, NJJP_MIN, NJJ_MIN, NJJP_H, NJJP_G, NJJP_A, NJJ_H, NJJ_G,
NJJ_A, NJJ_MAX, NTOTAL, NBEST.
See the table below for definitions.
Let N j j
indicate the number of observations for which both }ij and }ij, for i E {I... N}, have nonmissing values. Note that N jj equals the number of cases observed for the jth response. All
adjustments involve replacing N by N* in liE
=N
- rank(X). In all cases N* equals a
function only of {Njj' }. The default value is NBEST which used the preferred adjustment
for the test statistic and hypothesis at hand.
All of the options that are available for
MAKESS and FITMODEL work with FITML. See the documentation for those modules for
specific details. See Chapter 4 for an example program.
c
,
78
Sample Size Adjustments for Error Degrees of Freedom
Name
NLIST
NJJP MIN
NJJ MIN
NJJP H
NJJP G
NJJP A
NJJ H
NJJ G
NJJ A
NJJ MAX
NTOTAL
Function of {Njj' }
= number of complete cases
= min( {Njj'})
= min( {Njj })
= harmonic mean( {Njj' } )
= geometric mean({Njj' } )
= arithmetic mean( {Njj' } )
= harmonic mean( {Njj })
= geometric mean({ N jj } )
= arithmetic mean({N jj } )
•
=max({Njj })
=N
c
79