UNC BIOSTATISTICS MIMED SERIES NO 2192T INfERENcE FOR 1'HE LINEAR MUL'TIVARlAGES GENERAL ~DEL \UTH sMALL sAMPLES ier BY Diane J. eatell Date Name MISSING DATA IN ----------- The Library 04 the Department of Statistics North Carolina State University INFERENCE FOR THE GENERAL LINEAR MULTIVARIAGES MODEL WITH MISSING DATA IN SMALL SAMPLES by Diane J. Catellier Department of Biostatistics University of North Carolina I" • Institute of Statistics Mimeo Series No. 2192T May 1998 Inference for the General Linear Multivariate Model with Missing Data in Small Samples Diane J. Catellier A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment of the requirements for the degree of Doctor of Public Health in the Department of Biostatistics, School of Public Health. Chapel Hill 1998 Approved by: <Jt-eM{ ~. Cht~ Advisor . i . ABSTRACT DIANE 1. CATELLIER. Inference for the General Linear Multivariate Model with Missing Data in Small Samples. (Under the direction of Dr. Keith E. Muller) The General Linear Multivariate Model (GLMM) provides a convenient and statistically well-behaved framework for estimation and testing with complete, multivariate Gaussian data The loss of any data in this setting substantially complicates matters. For the case in which data are missing at random (MAR), reliable methods exist estimation. In contrast, little work has been done in the area of inference, especially in small samples. I describe two new strategies for hypothesis testing for the GLMM in small samples with MAR data. Both strategies use the EM algorithm for estimation, and focus on commonly used test statistics: the Hotelling-Lawley trace (U), the Pillai-Bartlett trace (V), Wilk's Lambda (W), and the Geisser-Greenhouse corrected univariate test (GG). The first approach for providing accurate inference involves adjusting the sample size (N) in the degrees of freedom for the F tests to reflect the actual amount of observed data. Eleven sample size adjustments were examined for each test. Simulations suggest that the preferred adjustment varies with the test statistic. The adjustment which works best for GG is based on the mean number of non-missing responses. V requires a stronger adjustment based on the harmonic mean number of non-missing pairs of responses. W and U work best with an even more aggressive adjustment based on the minimum number of non-missing pairs. The adjusted tests appear to control test size at or below the nominal rate with as few as 12 observations and up to 10% missing data. 11 The second approach determines significance for the test statistics using permutationbased methods. A Monte Carlo approximation to the permutation test is carried out by tabulating F statistic values for the observed sample and a random sample of possible data " . permutations. The p value is computed as the proportion of the data permutations with F values that exceed the F for the original sample. Simulations results indicated that empirical test sizes for the approximate permutation tests did not differ significantly from the target rate, and empirical power was always equal or greater than that for the adjusted F tests. Versatility of permutation methods is limited, however, to a narrow range of hypotheses. III ACKNOWLEDGEMENTS 'w I gratefully acknowledge my dissertation advisor, Dr. Keith E. Muller, for his encouragement, guidance, and support, and for the ~umerous hours he contributed to this research. I will probably continue to write comma-spliced sentences, but they ought to be fewer. I'd also like to thank my committee members, Dr. Paul Stewart, Dr. James Hosking, Dr. Gary Koch, and Dr. David Weber, for their input particularly in the early stages of this research. My doctoral program was financially supported by research assistantships with the Collaborative Studies Coordinating Center (CSCC) and the Biometric Consulting Laboratory. I'd like to thank both for contributing to my training. The University of Alberta Hospitals provided me with opportunities to be part of cutting edge cardiovascular research, and for this I am extremely grateful. I want to particularly thank Dr. Koon K. Teo from the University of Alberta for his mentorship over my 8 years of graduate work. The love and encouragement exhibited by my parents, siblings, in-laws, nephews and friends was invaluable. They continually reminded me of the things that matter most in life. Finally, I express my deep appreciation to my partner for her love and patience particularly during the stressful times. IV TABLE OF CONTENTS Page LIST OF TABLES vii LIST OF FIGURES viii 1 INTRODUCTION AND LITERATURE REVIEW 1.1 Introduction 1.2 Background for Paper #1 (Tests for Gaussian Repeated Measures with Missing Data in Small Samples 1.2.1 Complete Case 1.2.2 Missing Data Case 1.2.3 Mixed Model. 1.3 Background for Paper #2 (Comparison of Approximate Permutation Tests with Parametric Tests in Manova with Missing Data) 1.3.1 Motivation 1.3.2 Theory of Permutation Tests , 1.3.3 Literature Review Related to Permutation Tests for MANOVA. 1.3.4 Exact versus Approximate Permutation Tests 1.3.5 Statement of the Problem 1.4 Background for Paper #3 (LINMOD 4.0: Software for the General Linear Multivariate Model, Allowing Missing Data 14 2 TESTS FOR GAUSSIAN REPEATED MEASURES WITH MISSING DATA IN SMALL SAMPLES 2.1 Introduction 2.1.1 Motivation 2.1.2 General Strategies for the Analysis of Repeated Measures Designs 2.2 Known Methods for Estimation and Inference 2.2.1 Complete Data 2.2.2 Data Missing at Random 2.3 New Tests for Data Missing at Random 2.4 Numerical Evaluations 2.4.1 Methods 2.4.2 Results 17 19 19 20 21 21 24 27 28 28 29 v 1 1 2 2 6 7 9 9 9 11 12 14 2.5 Conclusions 35 3 COMPARISON OF APPROXIMATE PERMUTATION TESTS WITH PARAMETRIC TESTS FOR MANOVA WITH MISSING DATA 3.1 Introduction 3.1.1 Motivation 3.1.2 Relevant Literature 3.2 Statement of the Problem 3.3 New Methods 3.3.1 The Basic Method 3.3.2 Exact versus Approximate Permutation Tests 3.4 Numerical Evaluations 3.4.1 Purpose of Studies and Overall Design 3.4.2 Methods 3.4.3 Results 3.5 An Example: The Effect of Choline Defiency on Humans 3.6 Conclusions 4 LINMOD 4: A PROGRAM FOR GENERAL LINEAR MULTIVARIATE MODELS WITH MISSING DATA 4.1 Motivation 4.2 A New Approach 4.2.1 Overview 4.2.2 Comparison to Other Programs 4.2.3 History of the Program 4.3 LINMOD 4 Program 4.4 Recommended Option Settings 4.5 Example 40 41 41 43 46 47 47 48 49 49 50 50 53 55 59 60 62 62 62 64 64 65 65 5 CONCLUSIONS AND FUTURE RESEARCH 5.1 Looking Backwards; Successes From This Research 5.2 Looking Forward; Future Research 5.2.1 Robustness 5.2.2 Improving on the Adjusted Tests 5.2.3 Power Approximation APPENDIX A: DOCUMENTATION FOR FITML MODULE IN LINMOD 4 VI 73 73 75 75 75 76 77 List of Tables 2.1 Sample Size Adjustments for Error Degrees of Freedom 2.2 Test Size for Mixed Model F (5000 Replications, ± 0.006) 2.3 Test Size for GLMM F Tests (0% Missing, 5000 Replications, ± 0.006) 2.4 Adjusted Degree of Freedom Test Size for Fw (5%, 10% Missing, 5000 Replications, ± 0.006) 2.5 Adjusted Degree of Freedom Test Size for Fu (5%, 10% Missing, 5000 Replications, ± 0.006) 2.6 Adjusted Degree of Freedom Test Size for Fv (5%, 10% Missing, 5000 Replications, ± 0.006) 2.7 Adjusted Degree of Freedom Test Size for Fcc (5%, 10% Missing, 5000 Replications, ± 0.006) 2.8 Adjusted Degree of Freedom Test Size for Multivariate F Test When s = min(a, b) = 1 (5%, 10% Missing, 5000 Replications, ± 0.006) 3.1 Approximate Permutation Test Size for F Tests (0% Missing, 1000 Replications, ± 0.014) 3.2 Approximate Permutation Test Size for F Tests (5%, 10% Missing, 1000 Replications, ± 0.014) 3.3 Approximate Permutation Power for F Tests (0% Missing, 1000 Replications, ± 0.031) 3.4 Approximate Permutation Power for F Tests (5%, 10% Missing, 1000 Replications, ± 0.031 ) 3.5 Choline Measurements Over 5-Week Period in Male Subjects 4.1 Choline Measurements Over 5-Week Period in Male Subjects Vll 28 30 31 32 32 33 34 34 51 51 : 52 53 55 66 List of Figures 4.1 LINMOD Programming Statements for Choline Example 4.2 LINMOD Output for Choline Example Vlll 67 67 Chapter 1 1.1 INTRODUCTION In repeated measures designs, the experimental unit is typically a human or animal subject. Each subject is measured under several treatments or at different points in time. Clinical trials research and laboratory studies routinely use repeated measures designs, as this design allows the investigator to assess and describe the change both within and between subjects as time and/or experimental conditions change. A variety of analysis strategies have been proposed to evaluate the effects of treatments and covariates on the pattern of responses. Linear models, in particular, are useful in situations in which the responses are approximately Gaussian and can be explained by some linear function of predictors. When each subject is observed at the same p times and no observations are missing, closed-form maximum likelihood (ML) estimates of the model parameters are often available. More typically, the response is not observed at all time points for every subject. An extensive literature exists for dealing with missing data. Despite the advances made over the last 30 years in the analysis of data with missing values, weaknesses remain, particularly in the area of hypothesis testing in small, incomplete samples. Studies with small samples commonly arise when the experimental units are difficult or expensive to obtain. Simulation studies by Barton and Cramer (1989) and Schluchter and Elashoff (1990) showed that various asymptotic test statistics for hypotheses about fixed effects yield inflated type I error rates in small samples. The main objective of this research is to develop missing-data hypothesis testing methods for repeated measures designs that will yield accurate type I error rates and maintain adequate power in small, incomplete data situations. The class of designs of interest in this paper may be characterized as involving 1) small samples, 2) continuous repeated measurements data, and 3) missing data that are assumed to be "missing at random" or MAR (Rubin, 1976). In this setting, factors or covariates can be classified as within-subject or between-subject factors. variables, such as time, or experi~ental Within-subject factors represent conditions which vary within subjects. Between subject factors represent baseline characteristics or covariates which do not vary over time. This research will center on the basic repeated measures design, with one within-subject factor and one between subject factor. This dissertation contains three separate papers (Chapters 2, 3, and 4) intended for publication, and as such, each contains it's own literature review. In order to avoid repetition, I will only provide a brief overview of the literature in this introductory chapter. 1.2 BACKGROUND FOR PAPER #1 (TESTS FOR GAUSSIAN REPEATED MEASURES WITH MISSING DATA IN SMALL SAMPLES) This paper focuses on likelihood-based theory for the statistical analysis of repeated measures data with missing values. I begin by briefly discussing practical approaches to missing data problems in multivariate analysis. The methods are based on three standard statistical models: the general linear univariate and multivariate models and the general linear mixed model. Before discussing the univariate and multivariate approaches to hypothesis testing with missing data, an understanding of the models for complete-data is required. The mixed model, which inherently allows for the possibility of missing data will be discussed last. 1.2.1 Complete Case The General Linear Multivariate Model (GLMM) subsumes a wide range of models for multivariate Gaussian data. For the purposes of data analysis, the most important special cases include Multivariate Analysis of Variance and Covariance (MANOVA, MANCOVA), 2 and the multivariate and univariate approaches to repeated measures ANOV A. Furthermore, some forms of discriminant analysis and canonical correlation also represent special cases. Suppose that the responses for subj ect i (i E {I, ... , N}) are measured at p times (the same for all subjects). The GLMM is £'(Y) = XB, (1.2.1.1 ) where £' denotes the expectation operation and Y (N x p), X (N x q, fixed and known, conditional upon having chosen the subjects) and B (q x p, fixed and unknown) denote the matrix of observations, design matrix, and parameter matrix, respectively. Indicate the i th row of X as Xi = rowi(X), Assuming that rowi(Y) is Np(XiB, E), we can test the general linear hypothesis (GLH) (1.2.1.2) Each row of the matrix C (a x q) defines a between-subject contrast and each column of the matrix U (p x b) defines a within-subjects contrast. Define VE = N - rank(X). All tests of H o are based on B = (X'X)-X'Y, (1.2.1.3) e=CBU, (1.2.1.4) (1.2.1.5) and E = U'(Y - XB)'(Y - XB)U. (1.2.1.6) Ii and E are the "matrices of sums of squares and cross-products" for hypothesis and error. W(VE, E*), where Both have Wishart distributions, with Ii W(a, E*, 0) and E IV IV (1.2.1.7) 3 = min( a, b) is the noncentrality parameter matrix. Let s ....... ---I HE. Let s* indicate the rank ofO, with s* ~ indicate the rank of Ii and hence s. Under the assumption that rowi(Y) is Np(XiB, ~), for i = 1, ... , N, three common multivariate statistics for testing Ha are: 1 s Wilks lambda w= Pillai-Bartlett V IE(H+E)-11 = 1]l+Zi ' s Z i=1 + . = tr[H(H +E)-I] = L: - 1 i Z.' ~ s Rotelling-Lawley U = tr(HE- l ) = L:Zi' i=1 where Zi denotes the eigenvalues of Ii B- 1 The exact distributions of U and V are only . known when s = min(a, b) = 1, and when s ~ 2 for W. For general values of a and b, it is necessary to use an asymptotic approximation of the distributions. Rao (1973) derived the following F approximation to W Fi _ (1 - Wl/t)/Vl(W) w - (Wl/t)/ V2(W) , (1.2.1.8) with t = { [(a 2b2 - 4)/(a 2 + b2 - 5) 1 p/2 if(a 2 + b2 - 5) > 0 otherwise. Under Ha, Fw is approximately distributed as a central F distribution with VI (W) = ab and V2(W) = t [VE - (b - a + 1)/2] - (ab - 2)/2 .numerator and denominator degrees of freedom, respectively. The asymptotic distribution of the Pillai-Bartlett trace criterion obtained by Muller (1998) is Fv = V / (K . ab) , (s - V)/(K . S(VE + s - b)) which is approximately F V1 ,V2 under H a, where 4 (1.2.1.9) Vl(V) = Kab, V2(V) = K· S(VE + S - b), K =' 1 [S(VE+s-b)(vE+a+2)(VE+a-l)]. S(VE + a) VE(VE + a - b) McKeon (1974) provided a slightly better F approximation than the more widely used PillaiSampson approximation for the Hotelling-Lawley statistic. Write the McKeon approximation F = (U jh)j(ab) , (1.2.1.10) Ijv2 (U) with Vl(U) = ab, v2 (U) = (4 + ab + 2)g', , v1- vE(2b + 3) + b(b + 3) - vda + b + 1) - (a + 2b + b2 - 1) , (1.2. 1.1l) 9 - and V 2 (U) - 2 h=---VE - b - 1 (1.2.1.12) The multivariate approach described above allows the p repeated measures to be correlated in any pattern, since E is completely general. The univariate analysis of variance procedure assumes that each row of Y is independent with a p-variate multivariate normal distribution, having covariance matrix E such that E* = U'EU = 1970). This condition is sometimes referred to as sphericity. 0'2 I (Huynh and Feldt, The univariate approach to analysis of repeated measures designs refers to an analysis procedure whereby the univariate F test is adjusted for the amount of departure from sphericity. The traditional univariate F statistic for testing H o can be computed using the H and E matrices according to the following equation Fu = tr(H)jab tr(E)j(b· VE) 5 (1.2.1.13) Box showed that when the sphericity is not met, under the null hypothesis Fu is approximately distributed as a central F distribution with abc and (b· VE) . c degrees of freedom, where the value of c is defined as (1.2.1.14) The Ak (k = 1,2, .. , b) in this equation are the eigenvalues of E*. Since E*, and thus c, is usually unknown, c must be estimated from the sample covariance matrix. The univariate approach with the Geisser-Greenhouse correction (Greenhouse and Geisser, 1959) uses the maximum likelihood estimate of c: (1.2.1.15) 1.2.2 Missing Data Case When there are missing values among the responses, there exist some well-behaved ML and REML estimation methods. The computational efficiency and simplicity of the EM algorithm (Dempster, Laird and Rubin, 1977) make it an obvious choice for estimation in the present setting. With respect to inference, a distinguishing characteristic of each of the univariate and multivariate test statistics is that they involve the error matrix E, which has error degrees of freedom VE =N - rank(X). Barton and Cramer (1989) suggested adjusting VE in the missing data setting to reflect the amount of incompleteness in the data. They compared the performance of Rao's F approximation to W with various adjustments to VE. Among the possible alternatives, approximating N with N* = Njj (average number of non-missing data pairs) was shown to provide the most stable and accurate test sizes for N as small as 40 and up to 20% missing data. 6 The Barton and Cramer approach can be used to produce missing data analogs for the three multivariate test statistics (ltV, V, U) as well as for the univariate test statistic using a Geisser-Greenhouse correction, by simply replacing VE with VE* = N* - rank(X). This paper examines the behavior of this hypothesis testing method using at very small samples sizes (i.e., 12, 24). In addition to looking at the performance of the N* = Njjl, we also considered the minimum Njj' and the geometric and harmonic mean Njj' as candidates for N*. The empirical test sizes for N* = N (the complete data sample size) are provided for comparison purpose since this is often what is used in practice. 1.2.3 Mixed Model Mixed models have long been used for the analysis of continuous data, especially in ANDVA and MANDVA settings. By virtue of modeling the subject as a random component, the mixed model can encompass a broad range of covariance structures. In many cases, the analysis results from the univariate and multivariate approaches can be obtained from a mixed model analysis. The general mixed model may be written as (1.3.2) with Yi a Pi x 1 vector of measurements from the ith subject (i = 1, ... ,N), ei a Pi x 1 random vector of within-subject error terms. Here Xi (Pi X q) and Zi (Pi x m) are fixed and random effects design matrices, respectively, for the ith subject, and B (q x 1) and bi (m xI) are the unknown fixed and random effects parameters. The model assumptions include: (1.3.3) Therefore, the expectation and variance for Yi and the entire data vector, Y, are (1.3.4) 7 v (yd = lJ i = ZiD..Z; + (72Vi (1.3.5) = XB (1.3.6) and £(y) V(y) = lJ = Dg(lJ 1, lJ2,'" diagonal and zeros elsewhere. Vi (1.3.7) ,lJN). Here Dg(lJ 1 , lJ 2, ... ,lJN) denotes a block diagonal.matrix with lJ i , i . = 1,2, '" N, on the is a known covariance structure for within-subject variation; (72 is an unknown within-subject variance parameter. An important special case is the random-effects model, with Vi assumed be an identity matrix, Ii. Since repeated measures models are a special case of random-effects models I will assume Vi = Ii throughout this paper. Because the mixed model does not require every subject to have the same number of responses, it it a natural choice for analysing data from repeated measures designs with missing data. The multivariate model can be transformed to fit the mixed model framework by stacking the columns of Y into an N p x 1 vector, creating a new design matrix equal to (Ip 0 X), where 0 is the left Kronecker product. Rows with any missing responses are then deleted. Note that there is no random effects matrix in this specification of the mixed model. In this paper, a series of simulations are used to compare the Barton and Cramer approach to repeated measures· analysis with missing data (described in Section 1.2.1.2) and the mixed model analysis. Samples of size 12 and 24 are drawn from multivariate normal distributions and missing data are introduced at random. The empirical rejection rates are calculated under various conditions (low/moderatelhigh correlation, equaVunequal variances) for tests with a nominal significance level of 0.05. The number of response variables considered is 3 and 6. • 8 1.3 BACKGROUND FOR PAPER #2 (APPROXIMATE PERMUTATION TESTS WITH PARAMETRIC ESTIMATES IN MANOVA WITH MISSING DATA) 1.3.1 Motivation The motivation for this paper parallels that of Paper #1. The problem of interest is briefly summarized as follows. A number of convenient and statistically well-behaved estimation and inference methods exist for complete, multivariate normal data. The presence of missing response values, however, complicates matters. Researchers have primarily focused on the problem of estimation, and have succeeded in producing accurate and efficient estimation procedures in the presence of missing data. The EM algorithm (Orchard and Woodbury, 1972; Dempster, Laird and Rubin, 1977), is an example of one such procedure, which appears to work well in both large and small samples for multivariate analysis of variance models (MANOVA) with missing data. In the same setting, inference proves much more difficult. For many parametric tests it is often the case that the distribution of the test statistic is only known asymptotically. If the sample size is large and the proportion of data values that are missing is small, these parametric tests may provide valid significance levels. However, in small samples, they can be extremely inexact (see for example, Barton and Cramer, 1989; Schluchter and Elashoff, 1990). For both large and small samples, analyses based on the complete cases only, while accurate, can be extremely inefficient. In Paper #1, we looked at ways of adjusting the sample size, N, in the error degrees of freedom equation liE =N - rank(X) of approximate F tests to reflect the amount of incompleteness in the data as a means of obtaining accurate p values. In this paper we consider a distribution-free approach based upon the theory of permutation tests as an alternative method for missing data inference in small multivariate Gaussian samples. 1.3.2 Theory of Permutation Tests The logic of Fisher's permutation procedure, proposed in 1935, is that under the null hypothesis of no population differences, observations are equally likely to be associated with 9 any of the populations. If all assignments of observations to populations are equally likely under the null hypothesis, then a null distribution for any test statistic can be generated by calculating the value of that test statistic for each data permutation (including the obtained data permutation). The exact significance level is simply the proportion of the data permutations with test statistic values that exceed that which was originally obtained. Fisher's rationale for the validity of the permutation procedure relied upon the assumption that the data were randomly sampled from larger populations for which statistical inferences were intended. Pitman (1937a, 1937b, 1938) demonstrated that the permutation principle can be justified on the basis of random assignment of subjects to treatments alone, without regard to how the sample was collected. His work provided the theoretical basis for the application of permutation methods to the randomized experiment. Randomization tests therefore refer to the use of permutation procedures in the context of random assignment. In this work, I shall use the two names ("randomization" and "permutation") interchangeably. Four factors make the randomization test procedure a promising technique for determining significance in a small sample MANDVA setting. First and foremost, given random assignment of observations to treatment groups, this procedure is guaranteed to provide valid p values under the null hypothesis, even in small samples. Second, the validity of permutation tests is independent of the type of test statistic chosen. This flexibility permits us to choose a test statistic that is most sensitive to departures from the null hypothesis of interest. Third, the chosen test statistic becomes distribution-free, and thus more robust, when significance is determined by a permutation procedure. The final reason for considering permutation test procedures is their simplicity relative to attempting to develop a new statistical test to handle the compound problem of small samples and missing data. 10 '" 1.3.3 Literature Review Related to Permutation Tests for MANOVA Much of the theoretical work on permutation tests for MANOVA has dealt with methods based on ranks. The asympotic power properties and relative efficiencies (with respect to their parametric competitors) of various rank permutation tests for MANOVA in the complete data setting are presented in detail in Chapter 5 ofPuri and Sen (1993). Servy and Sen (1987) extended the theory of rank permutation tests for one-way MANOVA and multivariate analysis of covariance (MANCOVA) models to allow for missing data. Their approach involves replacing the observed variables with their ranks or other scores, excluding the missing values, and imputing the missing data with the mean of the scores of the non-missing values of the variable it belongs to. Further extension of this approach to two-way MANOVA designs are due to Jerdack and Sen (1990). Permutation tests can also be based on test statistics which are explicit functions of the actual values of the sample observations. Puri and Sen (1993, p. 76) refer to these tests as component randomization tests. Wald and Wolfowitz (1944) proposed a randomization test of whether two samples have come from identical multivariate normal populations (against the alternative that there is a difference of "location" for some variables) based on the permutation distribution ofa modified Hotelling's T2 statistic. Chung and Fraser (1958) derive an alternative randomization test for the multivariate two-sample problem for the case where the number of variables k is large (i.e., k + 2 > N). Their approach was to take a statistic suitable for the univariate case, apply it to each of the k variables and add the resulting expressions. Unlike Hotelling's T2 statistic, a test statistic constructed in this manner does not take account of the covariance of the responses. A randomization test is then obtained by evaluating this test statistic for each possible reordering of the data. Friedman and Rafsky (1979) provide a k sample generalization of the Wald-Wolfowitz twosample test. The method involves constructing the minimum spanning tree for the combined sample, then generating the permutation distribution of the runs statistic through a series of 11 random relabelings ofthe individual data points. The use of randomization tests based on the conventional multivariate test statistics (the Pillai-Bartlett Trace, Wilks' Lambda, HotellingLawley Trace) has not been studied. A weakness of component randomization tests is that the permutation distribution must be computed for every new set of data. The permutation distribution of a test statistic based on ranks, however, is invariant to changes in the actual values of the observations, and thus a table with the "standardized" null distribution of the test statistic (for a particular sample space) can be used repeatedly to determine significance for any new sample. One of the strengths of component randomization tests lies in their superior power relative to the rank randomization tests. For larger samples the discrepancy in power may not be substantial and permutation tests based on statistics of either type would be appropriate. In the case of small, possibly inadequately powered data sets, on the other hand, the component randomization tests are preferred since any potential loss in power should be avoided. 1.3.4 Exact versus Approximate Permutation Tests All permutation-based procedures for inference involve four steps. These may be loosely described as follows. 1) Choose a test statistic, S. 2) Compute S for the original set of observations. 3) Obtain the permutation distribution of S by either enumerating all possible reassignments of observations to treatments group, or by choosing a random sample of reassignments (keeping the number of observations in each group constant), and recalculating the test statistics for each reassignment. 4) Compute the p value as proportion of test statistic values greater than or equal to the value corresponding to the originally obtained data. Exact permutation tests involve tabulating test statistic values for the complete set of data q permutations, including the observed test statistic. In all, there are M = N!/TI N g ! ways in g=1 which the N response vectors can be assigned to the q treatment groups. As the number of 12 observations increases, the number of values of the test statistic that need to be calculated to obtain the exact p value increases very rapidly. Pitman (1937a, 1937b, 1938) approached this problem by deriving the first four moments of the exact permutation distribution of a test statistic and showing that they converge to the moments of some well-known distribution (X 2 , normal, etc.) as N increases. Wald and Wolfowitz (1944) improved on this by deriving a general theorem on the limiting distributions of a class oftest statistics called linear permutation statistics (see Puri and Sen, 1993; p73). As an application of the theorem, they derived the limiting distribution of a test statistic, T /2 , which is a simple monotonic function of the Hotelling's T2 • The statistic T /2 would be appropriate for testing the null hypothesis that two distributions III and Il2 are identical, assuming homogeneity of the covariance matrices (i.e., restricting the alternative to the case where III and Il2 differ only in the mean values). For the case in which two variables are measured on each sampling unit, and each pair is jointly bivariate normally distributed, the permutation distribution for T 12 was shown to approach the X2 distribution with two degrees of freedom as N~ 00. In small samples, one can neither take advantage of the asYmptotic approximations for the exact permutation distributions of the multivariate test statistics, nor is it computationally feasible to obtain the complete permutation distributions for these statistics. For example, the total number of data permutations for an experiment with N = 12 subjects randomly assigned to 4 treatments is 369,600. Alternatively, we can significantly reduce the amount of time required to determine significance by drawing a random sample of the data rearrangements, without replacement, to produce a close approximation to the exact p value. Such methods are called "approximate" permutation tests. The closer the significance level is to 5% the more permutations will be needed. The bootstrap method, formulated by Efron (1979), is similar to the permutation method in that it estimates the distribution of the test statistic using the observed data. The main difference is that sampling is carried out without 13 replacement usmg the permutation approach and with replacement for the bootstrap procedure. Software for computing both exact and approximate permutation tests are available without charge on the internet. Currently, none of the programs are able to perform permutation tests for the multivariate Gaussian k ( ~ 2) sample problem. Although it is possible to apply the principles of the widely used "s?ift" algorithm described in Edgington (1995, p393-398) to the MANOVA setting, such a program has not been produced. A simpler, but less efficient, procedure for determining approximate permutation p values for MANOVA, is to choose a Monte Carlo sample of random permutations. 1.3.5 Statement of the Problem In this paper, the impact of missing data on test size and power of approximate permutation tests for three conventionally used multivariate test statistics (the Pillai-Bartlett Trace, Wilks' Lambda, Hotelling-Lawley Trace) will be compared with that of their parametric counterparts through simulation studies. The first series of simulations will focus on the null case, and the second on power. 1.4 BACKGROUND FOR PAPER #3 (LINMOD 4.0: SOFTWARE FOR THE GENERAL LINEAR MULTIVARIATE MODEL, ALLOWING MISSING DATA This paper reviews the latest version of LINMOD which encorporates the missing data techniques outlined in paper # 1. The software will allow statisticians to apply the new methods I have developed in the real world, immediately. The quality of the program is greatly enhanced because it is based on three previous versions of the LINMOD program (LINMOD version 3.2, Muller and Hunter, 1992; LINMOD version 2, Muller and Peterson; LINMOD version 1, Helms, Hosking and Christiansen). 14 REFERENCES Barton, C. N. and Cramer, E. C. (1989), "Hypothesis Testing in Multivariate Linear Models with Randomly Missing Data," Communications in Statistics - Simulations, 18, 875895. Chung, 1. H. and Fraser, D. A. S. (1958), "Randomization Tests for a Multivariate TwoSample Problem," Journal ofthe American Statistical Association, 53, 729-735. Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977), "Maximum Likelihood from Incomplete Data via the EM Algorithm (with Discussion)," Journal of the Royal Statistical Society-B, 39, 1-38. Edgington, R. A. (1995), Randomization tests (3rd edition), New York: Marcel Dekker. Efron, B. (1979), "Bootstrap Methods: Statistics, 7, 1-26. Another Look at the Jackknife," The Annals of Fisher, R. A. (1935), Design o/Experiments, Oliver and Boyd, Edinburgh. Friedman, J. H. and Rafsky, L. C. (1979), "Multivariate Generalizations of the WaldWolfowitz and Smirnov Two-Sample Test," The Annals ofStatistics, 7, 697-717. Greenhouse, S. W. and Geisser, S. (1959), "On Methods in the Analysis of Profile Data," Psychometrica, 24, 95-112. Huynh, H. and Feldt, L. S. (1970), "Conditions Under Which Mean Square Ratios in Repeated Measurement De~igns Have Exact F Distributions," Journal ofthe American Statistical Association, 65, 1582-1589. Jerdack, G. R. and Sen, P. K. (1990), "Nonparametric Tests of Restricted Interchangeability," Annals ofthe Institute ofStatistical Mathematics, 42, 99-114. McKeon, J. J. (1974), "F Approximations to the Distribution of Hotelling's T o2 ," Biometrika, 61,381-383. Morrison, D. F. (1973), "A Test for Equality of Means of Correlated Variates with Missing Data on One Response," Biometrika, 60, 101-105. Muller, K. E. (1998), "A New F Approximation for the Pillai-Bartlett Trace Under Ho," Journal ofComputational and Graphical Statistics, 7, 131-137. Orchard, T. and Woodbury, M. A. (1972), "A Missing Information Principle: Theory and Applications," In Proceedings of the 6th Berkeley Symposium on Mathematical Statistics and Probability, 1, 697-715. Berkeley, California: University of California Press. 15 Pitman, E. 1. G. (1937a), "Significance Tests which can be Applied to Samples from any Populations," Journal o/the Royal Statistical Society, B, 4, 119-130. Pitman, E. 1. G. (1937b), "Significance Tests which can be Applied to Samples from any Populations. II. The correlation coefficient," Journal o/the Royal Statistical Society, B, 4,225-232. Pitman, E. 1. G. (1938), "Significance Tests which can be Applied to Samples from any Populations. III. The Analysis of Variance Test," Biometrica, 29, 322-335. Puri, M. L. and Sen, P. K. (1993), Nonparametric Methods in Multivariate Analysis, Florida: Krieger Publishing Company. . Rao, C. R. (1973), Linear Statistical Inference and Its Applications, New York: John Wiley (2nd ed.). Rubin, D. B. (1976), "Inference and Missing Data," Biometrika, 63, 581-592. Schluchter, M. D. and Elashoff, 1. D. (1990), "Small-sample Adjustments to Tests with Unbalanced Repeated Measures Assuming Several Covariance Structures," Journal of Statistical Computing - Simulations, 37, 69-87. Servy, E. C. and Sen, P. K. (1987), "Missing Variables in Multi-Sample Rank Permutation Tests for MANOVA and MANCOVA," Sankhya, 49, 78-95. Wa1d, A. and Wolfowitz, J. (1944), "Statistical Tests Based on Permutations of the Observations," The Annals ofMathematical Statistics, 15,358-372. 16 Chapter 2 TESTS FOR GAUSSIAN REPEATED MEASURES WITH MISSING DATA IN SMALL SAMPLES Diane J. Catellier Department of Biostatistics CB#7400 University of North Carolina Chapel Hill, North Carolina 27599-7400 D. Catellier: telephone 919-966-7283, email [email protected] FAX 919-966-3804 Key words: test size, missing data, multivariate linear models, MANOVA 17 SUMMARY For the analysis of continuous repeated measures data with missing data and small sample size, Barton and Cramer (1989) recommended using the EM algorithm for estimation and modifying Rao's F approximation to Wilks' test with adjusted error degrees of freedom. Their adjustment replaces total sample size with the average number of non-missing pairs of responses. Computer simulations led to the conclusion that the modified test was slightly conservative for total sample size of N = 40. Here we extend the evaluation of Barton and Cramer's method, as well as a number of others, to even smaller sample sizes (N E {12, 24}). Although the Barton and Cramer method produces acceptable test size for Wilks' test if N = 24, this is not the case if N = 12 and 10% of the data are missing. We describe a number of extensions of the Barton and Cramer method by creating analogs of the Pillai-Bartlett trace, Hotelling-Lawley trace and Geisser-Greenhouse corrected univariate tests. Eleven sample size adjustments were examined for each test. All involve computing degrees of freedom by replacing the number of independent sampling units (N) by some function of the numbers of non-missing pairs of responses. The method preferred varies with the test statistic. Replacing N by the mean number of non-missing responses works best for the Geisser-Greenhouse test. The Pillai-Bartlett test requires the stronger adjustment of replacing N by the hannonic mean number of nonmissing pairs of responses. For Wilks' and Hotelling-Lawley, an even more aggressive adjustment based on the minimum number of non-missing pairs must be used to control test size at or below the nominal rate. Overall, simulation results allowed concluding that an adjusted test can always control test size at or below the nominal rate, even with as few as 12 observations and up to 10% missing data. 2.1 INTRODUCTION 2.1.1 Motivation Repeated measures data anse when more than one response is observed for each experimental unit. The repeated measurements can be distinct variables, or a single variable measured at several points in time, with the spacing consistent across subjects. For ease of presentation, call the experimental unit a "subject" and the metameter on which the measurements are indexed "time". Clinical trials research and laboratory studies routinely use repeated measures designs because they allow the investigator to assess and describe the change both within and between subjects as time or experimental conditions change. A variety of analysis strategies have been proposed. Linear models, in particular, are useful in situations in which the responses are at least approximately Gaussian and can be explained by some linear function of predictors. When each subject is observed at the same p times and no observations are missing, closed-form maximum likelihood (ML) estimates of the model parameters are often available. Often, especially in clinical trials,the response is not observed at all time points for every subject. A large amount of research has been directed at estimation for linear repeated measures models with missing data. These appear to work well in both large and small samples. In contrast, much less effort has been directed towards methods for inference. asymptotic test statistics work well in large samples. Various However, in small samples the available methods for inference may work very poorly. In particular, the methods produce inflated type I error rates in small samples (Barton and Cramer, 1989; Schluchter and Elashoff, 1990). We seek to develop hypothesis tests for Gaussian repeated measures with missing data, accurate in small samples. In doing so, we restrict attention to a particular range of studies. We assume that the missing data elements are "missing at random" (MAR; Rubin, 1976). 19 2.1.2 General Strategies for the Analysis of Repeated Measures Designs Models in which the expected value of the response vector equals a linear function of the parameters have traditionally been described as (general) linear models. Most often, one of three strategies is used for linear models with repeated measures: the multivariate analysis of variance (MANDVA) approach, the univariate approach to repeated measures, or mixed model analysis. All three models account for the dependencies among the repeated measures, but differ in the special form assumed for the covariance matrix within subjects, ~. See Muller, LaVange, Ramey and Ramey (1992) for an overview of the multivariate and univariate approaches to repeated measures ANDVA. They described the assumptions behind the methods as well as the most widely used tests for both approaches. For the sake of brevity, the information in that article will be assumed throughout. In general, the MANDVA approach allows the responses to have any covariance pattern. In contrast, the uncorrected univariate approach assumes sphericity (~* = U/~U 1970). Compound symmetry of~, = (]'2 I; Huynh and Feldt, coupled with choosing U orthonormal, for example, guarantees sphericity and hence exact univariate tests. The "corrected" univariate approach to repeated measures analysis of variance (ANDVA) involves adjusting the univariate F statistic for the amount of departure from sphericity. The mixed model has long been used for the analysis of continuous data, especially for missing and mistimed data. By virtue of modeling the subject as a random component, the mixed model can encompass a broad range of covariance structures. In this paper, we will compare existing inference methods for all three approaches to new ones. Using the terminology of Rubin (1976), missing responses are said to be missing at random (MAR) if missingness of a particular response does not depend on its unobserved value, but can depend on the covariates or any of the observed responses. Likelihood-based estimation methods assume that the data are MAR. 20 . Alternatively, one can use the quasi-likelihood approach of Liang and Zeger (1986) by . solving the generalized estimating equations (GEE) to obtain estimates of the regression parameters. Park (1993) compared the GEE approach to the ML approach for multivariate normal outcomes. He showed that with no missing data and an unstructured covariance matrix that the GEE and ML score equations are equivalent and lead to the same estimates of expected value and covariance parameters. With missing observations, however, the equivalence fails. For data missing completely at random (MCAR, Rubin, 1976), the GEE solution produces consistent estimators. Three weaknesses, however, make GEE less desirable than ML estimation in the missing data setting. First, the GEE estimate of the working covariance matrix may not always be positive definite, while the ML estimate from the EM algorithm (Dempster, Laird and Rubin, 1977) is guaranteed to be positive definite (Laird, Lange and Stram, 1987). Second, simulation studies by Stiger, Kosinski, Barnhart and Kleinbaum (1997), Qu, Piedmonte, and Williams (1994), Park (1993), and Emrich and Piedmont (1992) allow concluding that ML estimators perform better in small samples than GEE estimators. The former tend to have less bias, smaller mean squared errors and more accurate test size. Third, under misspecification of the covariance, the GEE estimators will only be consistent provided that the missing observations are MCAR. ML procedures give unbiased estimates under the weaker MAR assumption. The limitations of the GEE procedure in small· samples, combined with the focus on Gaussian data, makes ML estimation more attractive and therefore the focus of this paper. 2.2 KNOWN METHODS FOR ESTIMATION AND INFERENCE 2.2.1 Complete Data The repeated measures model can be represented as a special case of the general linear multivariate model (GLMM). See Muller et al. (1992), Davidson (1972), and O'Brien and Kaiser (1985) for further discussion. Suppose that the responses for subject i (i E {l, ... ,N}) are measured at p times (the same for all subjects). The GLMM is 21 E(Y) = XB, (2.2.1.1) where E denotes the expectation operation and Y (N x p), X (N x q, fixed and known, conditional upon having chosen the subjects) and B (q x p, fixed and unknown) denote the matrix of observations, design matrix, and parameter matrix, respectively. Indicate the i th row of X as Xi = rowi(X), Assuming that rowi(Y) is Np(XiB, E), we can test the general linear hypothesis (GLH) (2.2.1.2) Each row of the matrix C (a x q) defines a between-subject contrast and each column of the matrix U (p x b) defines a within-subjects contrast. Define VE =N - rank(X). All tests of H o are based on 13 = (X'X)-X'Y, (2.2.1.3) 8=C13U, (2.2.1.4) (2.2.1.5) and E = U'(Y - X13)'(Y - X13)U. ii and E have Wishart distributions W(a, E*, n), W(VE, E*) respectively with n = (8 - 8 0 )'[C(X'X)-C'r l (8 - 8 0 ) E;I. (2.2.1.6) (2.2.1.7) is the noncentra1ity parameter matrix. Let s = min( a, b) indicate the rank of ii and hence -- ---I HE . Let s* indicate the rank of n, with s* ::; s. The common multivariate tests may be constructed using the eigenvalues (I}, ... ,ls) of ii E- l . Specifically, define for Wilks' Lambda W = Ii (1 + li)-\ for Pillai-Bartlett trace i=1 s V = L:li(l i=1 + Ii) -1 s , for Hotelling-Lawley trace U = L:1i' and for Roy's largest root i=1 22 • R = max(li)' When s >1 (or s >2 for Wilks'), closed fonn expressions for the distributions of these test statistics are not available and approximations are used (§2.2, Muller, et al. 1992). In general, none of the four multivariate tests is unifonnly most powerful among all hypothesis testing situations. Hence, the optimal choice depends on the alternative hypothesis. See Olson (1974, 1976, 1979), Anderson (1984, pp. 330-333), and Muller, et al. (1992) for detailed discussions of relative test powers. Concerns about robustness and power led to not considering Roy's test any further. All designs that can be analyzed with the multivariate approach can also be analyzed using the univariate approach to repeated measures. The usual univariate F statistic is defined by Fu = t~jj)jab . (2.2.1.8) tr(E)j(b'VE) If the sphericity condition is met, Fu follows a central F distribution with ab and (bVE) degrees of freedom, under Ho. If sphericity is not met, Box (1954a, b) suggested that Fu follows an approximate F distribution with degrees of freedom abE and bVE€, under H o, with € = tr (E.) j [b . tr(E~) J. general, 1jb ~ € ~ Since E, and thus €, is usually unknown, € must be estimated. In 1, with the upper bound corresponding to sphericity. Assuming leads to a conservative test, while choosing € € = 1jb = 1 (the uncorrected test) leads to a liberal test. The Geisser-Greenhouse (1959) test (GG) uses the maximum likelihood estimate of E = tr(E.)j [b € . tr(E~)], while the Huynh-Feldt (1976) test (HF) = (NbE - 2)j[b(VE - bE)]. It is common practice to trim improper estimates of € (€ €, uses > 1) to 1. Muller and Barton (1989, 1991) suggested that theE-adjusted F test provides the best compromise in controlling type I error rate with excellent power. Hence only the GG test statistic will be examined here. In the null case, each of the multivariate test statistics can be approximated by an F random variable. If (a 2 + b2 - 5) > 0 then let t = [(a 2b2 23 - 4)j(a2 + b2 - 5)]1/2, otherwise t = 1. Rao (1973) suggested approximating W by F(Vl l V2) = [(1- W)/vl(TV)]/ [vV 1/ t /V2(W)] with Vl(W) = ab and v2 (W) = t[VE - (b - a + 1)/2] - (ab - 2)/2). Although widely used in current statistical packages, Pillai's (1954) F approximation for V may be very conservative in small samples. Muller (1998) developed an F approximation for V that provides substantially better accuracy. Hence Muller's approximation will be used, with Vl(V) = Kab, V2(V) = K· 8(VE + 8 - 1 K = [8(VE 8(VE + a) +8 - b), with b)(VE + a + 2)(VE + a VE(VE + a - b) 1)]•. (2.2.1.9) McKeon (1974) provided a slightly better F approximation than the more widely used PillaiSampson approximation for the Hotelling-Lawley statistic. Write the McKeon approximation F = (U /h)/(ab) , 1/v2 (U) with Vl(U) (2.2.1.10) = ab, v2 (U) = (4 + ab + 2)g', , 9 = v~ - vE(2b + 3) vE(a + b + 1) - + b(b + 3) (a + 2b + b2 - 1) , (2.2.1.11) and V 2 (U) - 2 h=---VE - b - 1 (2.2.1.12) 2.2.2 Data Missing at Random For the GLMM, both ML and REML estimation methods have been extensively investigated for MAR data. For a general review, see Little and Rubin (1986, chapters 7-10). For some special case patterns of missing data such as monotone missing data, the likelihood can be factored into a product, each containing distinct parameters. The solutions for the special cases have closed forms (Rubin, 1974). For arbitrary missing data patterns, the solution must be obtained iteratively. The computational efficiency and simplicity of the EM 24 • algorithm (Orchard and Woodbury, 1972; Beale and Little, 1975; Dempster, Laird and Rubin, 1977) make it an obvious choice for ML estimation in the setting of interest. Barton and Cramer's experience with the method for the application at hand also strongly supports the choice. The algorithm preferred for more general models is not as obvious. See for example, Elswick and Chinchilli (1993) for a discussion of methods for estimating the covariance matrix in the GMANOVA model with missing data, or Callahan and Harville (1990) for a comparison of algorithms for the general mixed model. Except in special cases, no known method exists for providing accurate and efficient inference in small multivariate normal samples with missing data. Hypothesis tests constructed from complete observations only, while accurate in small samples, are inefficient. A number of approximate methods have been proposed for the problem of testing equality of means for a bivariate normal sample with data missing on one variable (Morrison, 1973; and Little, 1976 and 1988). Morrison (1973) and Little (1976) recommended approximating variously derived statistics to the t distribution with m - 1 degrees of freedom, where m is the number of complete cases. Little (1988) considered a Bayesian approach to making inferences about the difference in means. Barton and Cramer (1989) suggested an appealing technique for testing the general linear hypothesis in a GLMM with an arbitrary pattern of missing data. The approach involves using the EM algorithm for ML estimation, and modifying Rao's F approximation to Wilks' test, Fw, with adjusted error degrees of freedom. observations for which both Yij and ¥ij, for i E Let N jj indicate the number of {l. .. N}, have non-missing values. Note that N jj equals the number of cases observed for the J"th response. All adjustments considered by Barton and Cramer, and in this paper, involve replacing N by N. in lJE = N - rank(X). In all cases N. equals a function only of {Njj}. For samples of size 40 and up to 20% missing data, test statistics with degrees of freedom based on the naive estimate of N. = N produced inflated type I error rates ranging from 0.10 to 0.23 assuming a 25 nominal rate of 0.05, while those based on the number of complete cases resulted in rates which were substantially lower than 0.05 (range: 0.004-0.014). The optimal choice for N* was the average number of non-missing pairs of responses (N*6 in Table 2.1). Wilks' test based on this sample size adjustment produced reasonably accurate test sizes across all simulated conditions. The mixed model is often used for multivariate data in which some of the response measurements are missing. Let Yi be a (Ni xl) vector of measurements for the i th subj ect. N Let N+ = 'LNi . = [y{, ... , y/vJ' is modeled as In the mixed model, Y+ i=l (2.2.2.1) with X and Z the known design matrices for the fixed and random effects respectively, and the b the vector of unknown random effects. The key assumptions for inference are that b and e are independent and multivariate Gaussian. Define vec( M) as the vector created by stacking the columns of M. Also let A ® B = {aijB} indicate the (left) Kronecker product. For the cases of interest, the model may be written so that GLMM. For complete data X+ =X f3 = vec(B') from the ®lp • For missing data merely delete' each row corresponding to a missing response. Let ~i, of dimension Pi, be the sub-matrix of ~ with rows and columns corresponding to data observed for subject i. Let A = indicate the block-diagonal matrix created by placing Then e+ ~1 Dg(~l, ... ~N) in the upper left diagonal, etc. = N[O, A]. In all but a few cases, likelihood-based estimation of f3 and A requires iterative methods such as Newton-Raphson, the Method of Scoring or the EM algorithm. The software used in this paper (PROC MIXED in SAS®) uses an implementation of the Newton-Raphson algorithm developed by Lindstrom and Bates (1988) to compute the ML estimates of the fixed effects and REML estimates of the covariance parameters. 26 . Exact methods are not available to test hypotheses of the form H o : f) = C f3 = O. The approximate, large-sample test statistics can be unreliable in small samples. Schluchter and Elashoff (1990, §6) examined the test size of various ML Wald-type statistics (computed as the parameter estimate divided by an asymptotic standard error) under the mixed model formulation. Their small sample simulation results suggest approximating a modified Wald statistic to an F distribution with denominator degrees of freedom based on the number of complete cases. Another method of test construction for the mixed model uses the likelihood ratio principle (see Hocking, 1985, §8.3.1). The statistic computed using this method is approximately X2 for large sample samples. This approximation has been demonstrated to be unreliable in small samples (Woolson, Leeper and Clarke, 1978; Woolson and Leeper, 1980; and Leeper and Woolson, 1982). The empirical type I error rates were typically in the 0.100.25 range, far exceeding the nominal rate of 0.05. The version of PROC MIXED studied here used the following approximate F statistic (SAS Institute, 1997, p644): iJ' (X~A-IX+)-O F=-....:..-------'-rank(C) (2.2.2.2) , with numerator degrees of freedom equal to rank(C). Although several approximations are available for the denominator degrees of freedom (see SAS Institute Inc., 1997, page 607), only the Satterthwaite approximation was considered in this paper. 2.3 NEW TESTS FOR DATA MISSING AT RANDOM The success of the basic strategy followed by Barton and Cramer leads to a number of obvious generalizations. First, the approach will be applied to other test statistics. Second, some additional functions of the sample sizes merit consideration. Third, even smaller sample sizes will be studied. In all cases, the EM algorithm will be used for estimation. In addition to W, the U, V and GG tests may be modified to apply to missing data settings. In all cases, this requires changing only the error degrees of freedom by replacing 27 N with some fonn of N*. Overall 11 fonns for N* will be examined for each of four test statistics. They are listed in Table 2.1 in rank order from smallest to largest, with the . exception that N*3 can be either less than or greater than N*6 (and hence N*4 and N*5,). In all cases N / N* ~ 1. Consequently, in large samples (as N ~ 00, with fixed p, q, and proportion missing) the choice of N* has less and less effect. The fonn of the results of Rothenberg (which assume a sequence of local alternatives), as cited in Anderson (1984, §8.6.5) support this position. Table 2.1 Sample Size Adjustments for Error Degrees of Freedom N*lO Function of {Njj' } = number of complete cases = min({Njj'}) = min( {Njj }) = harmonic mean( {Njj' } ) = geometric mean({Njj' } ) = arithmetic mean({ Njj' } ) = harmonic mean( {Njj } ) = geometric mean({N jj } ) = arithmetic mean({N jj } ) = max( {Njj }) N*l1 =N Name N*l N*2 N*3 N*4 N*5 N*6 N*7 N*8 N*9 2.4 NUMERICAL EVALUATIONS 2.4.1 Methods All simulations involved a small range of research designs. In all cases, the designs included 1) one within-subject factor with p = 3 or 6 levels, 2) one between-subject factor with q = 4 levels, 3) N E {12, 24}, and 4) 0%, 5% or 10% of the data missing. No subject's data were allowed to be completely missing. The procedure for producing missing data generated data that are MCAR. Other factors considered are the relative error variance of response variables (equal, unequal), and the error correlation structure (medium, high) See 28 Tables I-II in Barton and Cramer (1989) for details. In additional, a third level was added to the correlation structure factor, which allowed assessing the effect of very low correlation between the responses. The structure specified equal correlation (p .. = 0.1) for each pair of responses. The overall test for the presence of a trend (linear, quadratic or cubic) with respect to the between-subject factor for each of the response measures was of primary interest. Under the null, of course, B = O. For 5,000 replications and assuming a true type I error rate of 0.05, the 95% confidence bounds around the type I error rate estimates are approximately ± 0.006. 2.4.2 Results On average, higher levels of correlation within subjects were associated with modestly higher type I error rates. Since this pattern was consistent for each of the test statistics, only the results for the low and high correlation conditions will be presented. The empirical type I error rates for the mixed model F statistic are given in Table 2.2. The results indicate that the test has poor small sample properties, producing inflated type I error rates even when none of the data were missing. For N = 24, test sizes increased from slightly greater (0.07-0.10) to considerably greater (0.16-0.32) than the nominal level as the number of repeated measures increased from 3 to 6. 29 0: = 0.05 Table 2.2 Test Size for Mixed Model F (5000 Replications, ± 0.006) (72 12 12 12 Pjj' Low Low High 24 24 Low High =1= 12 12 12 Low Low High N 24 Low 24 High J % Missing p=3 6 0 0 0 .126 .134 .125 .58 .60 .59 0 0 .069 .074 .16 .16 5 5 5 .182 .171 .172 5 5 .080 .081 10 10 10 .250 .244 .262 10 10 .086 .095 =1= =1= =1= =1= =1= =1= =1= 12 12 12 Low Low High - 24 24 Low High =1= =1= =1= =1= .21 .23 .303 .322 For the conditions with no missing data, all four univariate and multivariate test statistics succeeded in controlling the type I error rate at or below the nominal rate (see Table 2.3). This illustrates that the sample sizes, while small, are large enough that the approximate F tests are essentially unbiased for complete data. Hence any discrepancy from the desired test size may be attributed to the influence of missing responses, and not to any inaccuracy in test approximations for complete data. 30 ;. N 12 12 12 PH Low Low High 24 24 Low High O'~ J Table 2.3 Test Size for GLMM F Tests (0% Missing, 5000 Replications, ± 0.006) Fw Fu Fv 3 6 p=3 6 3 6 3 6 Faa i= i= .050 .052 .053 .046 .054 .053 .048 .050 .048 .051 .055 .053 .044 .049 .042 .046 .059 .045 .027 .040 .054 .013 .025 .058 i= i= .049 .053 .051 .049 .050 .051 .050 .048 .048 .051 .049 .049 .041 .049 .039 .053 Tables 2.4, 2.5, 2.6, and 2.7 summarize the empirical test sizes for W, U, V, and GG for 5% and 10% missing data. All tables give results for tests based on N*2 N*l1 =N = min{ N jj } and in order to define bounds on test size. The EM algorithm failed roughly 90% of the time for the condition with p = 6 and N = 12 subjects and even 5% missing data. Estimates are well defined for complete data. The table cells for these conditions were left blank. Not surprisingly, the results indicate that the worst accuracy tends to occur with more repeated measures, fewer subjects, more missing data and more correlation within subjects. From Tables 2.4 and 2.5, it is evident that the adjusted F w and Fu tests based on N*ll give inflated test sizes, and those based on N*4, while accurate for N = 24 were inflated for N = 12. On the other hand, tests based on N*2 controlled test size at or below the nominal rate under all simulated conditions, with the exception of the condition with p = 6, N = 12 and 10% missing data, in which case test size was as high as 0.09. 31 Table 2.4 Adjusted Degree of Freedom Test Size for Fw (5%,10% Missing, 5000 Replications, ± 0.006) N*2 % Missing p=3 12 12 12 PH Low Low High 0"2 - i= i= 5 5 5 .029 .033 .032 24 24 Low High i= i= 5 5 .030 .031 12 12 12 Low Low High - i= i= 10 10 10 .047 .042 .051 24 24 Low High i= i= 10 10 .020 .025 N J N*4 6 3 N*l1 .073 .072 .074 .017 .022 .049 .053 .051 .062 6 .145 .134 .141 .067 .067 .148 .144 .152 .078 .094 3 6 .087 .089 .142 .146 .345 .335 .354 .171 .199 .133 .158 .379 .411 Table 2.5 Adjusted Degree of Freedom Test Size for Fu (5%, 10% Missing, 5000 Replications, ± 0.006) N*2 % Missing p=3 12 12 12 PH Low Low High 0"2 i= i= 5 5 5 .032 .033 .034 24 24 Low High i= i= 5 5 .028 .031 12 12 12 Low Low High - i= i= 10 10 10 .052 .047 .055 24 24 Low High i= i= 10 10 .021 .029 N J N*4 6 3 N*l1 6 .072 .067 .074 .021 .025 .048 .052 .052 .061 6 .142 .130 .134 .069 .068 .163 .153 .169 .093 .086 3 .084 .088 .144 .148 .341 .333 .352 .184 .217 .135 .158 .379 .417 Table 2.6 contains test size for modified Fv tests. The test based on N* 11 provided inflated type I error rates, and the N*2-adjusted test was conservative. Test sizes for 32 N*4 were extremely accurate, with the exception of the condition with p = 6, N = 12 and 10% missing data where the type I error rates were approximately 0.1. 12 12 12 Table 2.6 Adjusted Degree of Freedom Test Size for Fv (5%,10% Missing, 5000 Replications, ± 0.006) % N*2 N*4 N*l1 (]'2 Missing 6 p=3 3 6 3 6 Pjj' J Low 5 .023 .052 .102 Low 5 .021 .051 .099 =1= .022 High =1= 5 .050 .096 24 24 Low High 12 12 12 Low Low High 24 24 Low High N =1= =1= =1= =1= =1= =1= 5 5 .029 .030 10 10 10 .011 .010 .013 10 10 .017 .020 .015 .017 .046 .049 .054 .052 .057 .051 .058 .019 .022 .044 .055 .081 .088 .125 .128 .206 .215 .214 .106 .120 .127 .148 .334 .354 The results given in Table 2.7 suggest that N*9 is a reasonable adjustment function for the Faa test. When N = 12, this test is slightly conservative, however this corresponds to the modest conservatism found when no data are missing (Table 1 in Muller and Barton, 1989). When s = min( a, b) = 1, all of the multivariate test statistics are equivalent. This can occur if the rank of the C contrast matrix is one (a = 1) or if the rank of U is one (b = 1). The empirical test sizes shown in Table 2.8 suggest that when a = 1, the best adjustment for the degrees of freedom is based on N*2, while N*4 appears to work well when b = 1. 33 Table 2.7 Adjusted Degree of Freedom Test Size .for FGG (5%,10% Missing Data, 5000 Replications, ± 0.006) N PH 12 Low 12 Low 12 High (J"2 24 24 J % Missing i= i= 5 5 5 .010 .019 .026 Low High i= i= 5 5 .030 .049 12 12 12 Low Low High - i= i= 10 10 10 .002 .008 .009 24 24 Low High i= i= 10 10 .018 .017 N. ll N. g N· 2 p=3 6 3 6 .047 .049 .037 .041 .051 .048 .061 .061 .057 .053 .073 .078 .075 .030 .038 .040 .004 .009 6 .053 .059 .066 .035 .043 .050 .010 .018' 3 .044 .043 .091 .076 .095 .070 Table 2.8 Adjusted Degree of Freedom Test Size for Multivariate F Test When 8 = min(a, b) = 1 (5%, 10% Missing, 5000 Replications, ± 0.006) a = 1,b = 3 a = 3, b = 1 % N. 4 N. 4 N. ll N*ll N· 2 N· 2 N 12 12 12 PH Low Low High 24 24 Low High (J"~ Missing p=3 3 3 p=3 3 3 - i= i= 5 5 5 .038 .034 .039 .072 .066 .074 .114 .105 .110 .031 .029 .027 .051 .048 .042 .078 .071 .065 i= i= 5 5 .036 .037 .052 .050 .077 .078 .036 .030 .046 .043 .067 .064 - .058 .054 .056 .132 .131 .137 .253 .251 .258 .018 .016 .014 .043 .040 .036 .096 .010 .088 .028 .028 .052 .048 .102 .104 .026 .018 .043 .036 .089 .073 J 12 Low 12 Low 12 High i= i= 10 10 10 24 24 i= i= 10 10 Low High 34 2.5 CONCLUSIONS Conclusion 1. For all tests considered, accuracy decreases with more repeated measures, fewer subjects, more missing data and more correlation within subjects. Conclusion 2. The mixed model F statistic used by PROC MIXED in SAS@ with Satterthwaite approximated denominator degrees of freedom gives liberal test size for N ~ 24, even with complete data. Conclusion 3. For 6 responses and 12 subjects, the EM algorithm failed roughly 90% of the time, for the version used here. Conclusion 4. A degree of freedom adjustment can always control test size at or below the nominal level, even for conditions as extreme as N = 12 and 10% missing data. Conclusion 5. The choice of adjustment varies with the test. 5.1 N*2, the minNjj, is best for the Wilks' and Hotelling-Lawley tests. 5.2 N*4, the harmonic mean of Njj', is best for the Pillai-Bartlett test. 5.3 N*9, the mean N jj , is best for the Geisser-Greenhouse test. Note that the simulated data were generated in such a way as to create data that are MCAR. Hence, one area of future research which warrants consideration are conditions in which the data are MAR. Techniques for power analysis, given that test size can be controlled, are also appealing. Obviously, the approach taken here is more intuitive and heuristic, than analytical. Nevertheless, we believe we have succeeded, where other approaches have not, in suggesting what appears to be a method that ensures test size does not exceed the nominal rate in small, missing data samples for the GLMM. A more formal approach must necessarily involve a rather sophisticated attack, due to the complexity of the distributions for the multivariate test statistics, even with complete data. Such formal approaches clearly represent the most needed future research. 35 REFERENCES Anderson, T. W. (1984), An Introduction To Multivariate Statistical Analysis, New York: John Wiley (2nd ed.). Barton, C. N. and Cramer, E. C. (1989), "Hypothesis Testing in Multivariate Linear Models with Randomly Missing Data," Communications in Statistics - Simulations, 18, 875895. Beale, E. M. L. and Little, R. J. A. (1975), "Missing Values in Multivariate Analysis," Journal o/the Royal Statistical Society-B, 37, 129-145. Box, G. E. P. (1954a), "Some Theorems on Quadratic Forms Applied in the Study of Analysis of Variance Problems, I: Effects of Inequality of Variance in the One-way Classification," The Annals ofStatistics, 25, 290-302. Box, G. E. P. (1954b), "Some Theorems on Quadratic Forms Applied in the Study of Analysis of Variance Problems, II: Effects of Inequality of Variance and of Correlation Between Errors in the Two-way Classification," The Annals ofStatistics, 25, 484-498.. Callahan, T. P. and Harville, D. A. (1990), "Some New Algorithms for Computing Maximum Likelihood Estimates of Variance Components," Journal of Statistical Computation and Simulation, 38, 239-259. Davidson, M. L. (1972), "Univariate Versus Multivariate Tests in Repeated Measures Experiments," Psychological Bulletin, 77, 446-452. Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977), "Maximum Likelihood from Incomplete Data via the EM Algorithm (with Discussion)," Journal of the Royal Statistical Society-B, 39, 1-38. Emrich, L. J. and Piedmonte, M. R. (1992), "On Some Small Sample Properties of Generalized Estimating Equation Estimates for Multivariate Dichotomous Outcomes," Journal ofStatistical Computation and Simulation, 41, 19-29. Greenhouse, S. W. and Geisser, S. (1959), "On Methods in the Analysis of Profile Data," Psychometrica, 24, 95-112. Hocking, R. R. (1985), The Analysis of Linear Models, Brooks/Cole Publishing Co., Monterey, CA. Huynh, H. and Feldt, L. S. (1970), "Conditions Under Which Mean Square Ratios in Repeated Measurement Designs Have Exact F Distributions," Journal o/the American Statistical Association, 65, 1582-1589. 36 Huynh, H. and Feldt, L. S. (1976), "Estimation of the Box Correction for Degrees of Freedom Correction From Sample Data in Randomized Block and Split-Plot Designs," Journal ofEducational Statistics, 1 (1),69-82. Laird, N. M., Lange, N. and Stram, D. (1987), "Maximum Likelihood Computations with Repeated Measures: Application of the EM Algorithm," Journal of the American Statistical Association, 82, 97-105. Leeper, J. D. and Woolson, R. F. (1982), "Testing Hypotheses for the Growth Curve Model when the Data are Incomplete," Journal of Statistical Computation and Simulation, 15, 97-107. Liang, K. Y. and Zeger, S. L. (1986), "Longitudinal Data Analysis Using Generalized Linear Models," Biometrika, 73, 13-22. Lindstrom, M. J. and Bates, D. M. (1988), "Newton-Raphson and EM algorithms for Linear Mixed-effects Models for Repeated Measures Data," Journal ofthe American Statistical Association, 83, 1014-1022. Little, R. J. A. (1976), "Inference About Means From Incomplete Multivariate Data;" Biometrika, 63, 593-604. Little, R. J. A. (1988), "Approximate Calibrated Small Sample Inference About Means From Bivariate Normal Data with Missing Values," Computational Statistics and Data Analysis, 7, 161-178. Little, R. J. A. and Rubin, D. B. (1987), Statistical Analysis with Missing Data, New York: John Wiley. McKeon, J. J. (1974), "P Approximations to the Distribution of Hotelling's T 02," Biometrika, 61,381-383. Morrison, D. F. (1973), "A Test for Equality of Means of Correlated Variates with Missing Data on One Response," Biometrika, 60, 101-105. Muller, K. E. (1998), "A New F Approximation for the Pillai-Bartlett Trace Under H o," Journal ofComputational and Graphical Statistics, 7, 131-137. Muller, K. E. and Barton, C. N. (1989), "Approximate Power for Repeated Measures for Repeated Measures Anova Lacking Sphericity," Journal of the American Statistical Association, 84, 549-555. Muller, K. E. and Barton, C. N. (1991), Correction to "Approximate Power for Repeated Measures for Repeated Measures Anova Lacking Sphericity," Journal of the American Statistical Association, 86, 255-256. 37 Muller, K. E., LaVange, L. M., Ramey, S. L. and Ramey, C. T. (1992), "Power Calculations for General Linear Multivariate Models Including Repeated Measures Applications," Journal o/the American Statistical Association, 87, 1209-1226. Muller, K. E. and Peterson, B. L. (1984), "Practical Methods for Computing Power in Testing the Multivariate Linear Hypothesis," Computational Statistics and Data Analysis, 2, 143-158. O'Brien, R. G. and Kaiser, M. K. (1985), "MANOVA Method for Analyzing Repeated Measures Designs: An Extensive Primer," Psychological Bulletin, 97, 316-333. Olson, C. L. (1974), "Comparative Robustness of -Six Tests in Multivariate Analysis of Variance," Journal ofthe American Statistical Association, 69, 894-908. Olson, C. L. (1976), "Choosing a Test Statistic in Multivariate Analysis," Psychological Bulletin, 86, 579-586. Olson, C. L. (1979), "Practical Considerations in Choosing a MANOVA Test Statistic: A Rejoinder to Stevens," Psychological Bulletin, 86, 1350-1352. Orchard, T. and Woodbury, M. A. (1972), "A Missing Information Principle: Theory and Applications," In Proceedings of the 6th Berkeley Symposium on Mathematical Statistics and Probability, 1, 697-715. Berkeley, California: University of California Press. Park, T. (1993), "A Comparison of the Generalized Estimating Equation Approach with the Maximum Likelihood Approach for Repeated Measurements," Statistics in Medicine, 12, 1723-1732. Pillai, K. C. S. (1954), "On Some Distribution Problems in Multivariate Analysis," Institute ofStatistics Mimeo Series No. 88, University of North Carolina, Chapel Hill. Qu, Y. S., Piedmonte, M. R. and Williams, G. W. (1994), "Small Sample Validity of Latent Variable Models for Correlated Binary Data," Communications in Statistics Simulations, 23; 243-269. Rao, C. R. (1973), Linear Statistical Inference and Its Applications, New York: John Wiley (2nd ed.). Rubin, D. B. (1974), "Characterization of Estimation of Parameters in Incomplete Data Problems," Journal ofthe American Statistical Association, 69, 467-474. Rubin, D. B. (1976), "Inference and Missing Data," Biometrika, 63, 581-592. SAS Institute Inc. (1997), SAS/STAT Software: Changes and enhancements, Release 6.12, SAS Institute Inc., Cary, NC. 38 Schluchter, M. D. and Elashoff, 1. D. (1990), "Small-sample Adjustments to Tests with Unbalanced Repeated Measures Assuming Several Covariance Structures," Journal of Statistical Computing - Simulations, 37, 69-87. Stiger, T. R, Kosinski, A. S., Barnhart, H. X. and Kleinbaum, D. G. (1997), "ANOVA for Repeated Ordinal Data with Small Sample Size? A Comparison of ANOVA, MANOVA, WLS and GEE Methods by Simulation," JSM Abstract Book: p. 246. Woolson, R. F., Leeper, J. D. (1980), "Hypothesis Testing in Multivariate Linear Models with Randomly Missing Data," Communications in Statistics - Theory and Methods, A9, 1491-1513. Woolson, R F., Leeper, J. D. and Clarke, W. R (1978), "Analysis ofIncomplete Data From Longitudinal and Mixed Longitudinal Studies," Journal of the Royal Statistical Society-A, 141,242-252. 39 Chapter 3 COMPARISON OF APPROXIMATE PERMUTATION TESTS WITH PARAMETRIC TESTS IN MANOVA WITH MISSING DATA Diane J. Catellier Department of Biostatistics CB#7400 University of North Carolina Chapel Hill, North Carolina 27599-7400 D. Catellier: telephone 919-966-7283, email [email protected] FAX 919-966-3804 Key words: test size, randomization test, multivariate linear models, MANOVA 40 SUMMARY We describe how data permutation methods provide a new and powerful method for testing some classes of hypotheses in Multivariate Analysis of Yariance (MANOYA) with missing data and small sample size. The first step involves using the EM algorithm to find maximum likelihood estimates of means and covariances using all of the available data, based on an assumption of Gaussian data. The second step is to compute the F approximations to Wilks' Lambda, the Pillai-Bartlett trace, and the Hotelling-Lawley trace test statistics. Next, a Monte Carlo approximation to the permutation test is carried out by tabulating F statistic values for a random sample of possible data permutations. The fourth step is to compute the significance level as proportion of F values larger than or equal to the observed F. We compared the approximate permutation tests to degree of freedom adjusted F tests recommended by Catellier and Muller (1998) and the usual unadjusted F tests. The adjusted tests replace the number of independent sampling units (N) by some function of the numbers of non-missing pairs of responses. Simulation results confirm that unadjusted F tests give substantially inflated test sizes except with complete data. Although the adjusted tests limit test size to no more than the nominal level, they can be noticeably conservative. In contrast, the approximate permutation tests apparently provide unbiased tests, even for very small samples (N = 12) and up to 10% missing data. Overall, the permutation-based methods had equal or higher power than the adjusted F tests in all cases, while still controlling test size. 3.1 INTRODUCTION 3.1.1 Motivation We focus on a very common range of statistical models, with an important complication. For complete and multivariate Gaussian data, the General Linear Multivariate Model (GLMM) provides convenient and statistically well-behaved estimation and testing, even in 41 small samples. Example designs include both repeated measures and MANGVA arrangements. The loss of any data substantially complicates the picture. Missing data plague almost all experimental settings. For Gaussian data with some observations missing at random (MAR, Rubin, 1976), practical and effective estimation procedures are available, even for small samples. A useful overview of this topic can be found in Little and Rubin (1987). In particular, maximum likelihood (ML) estimation using the EM algorithm seems to work well in both large and small samples (Beale and Little, 1975). In the same setting, hypothesis testing proves much more difficult. The work of Barton and Cramer (1989) and Schluchter and Elasoff (1990) allow concluding that no general method has been shown to control test size at or below the nominal level, except in moderate to large samples. Analysis of the complete cases only does provide unbiased results, but can be extremely inefficient. Catellier and Muller (1998) recently introduced methods for Gaussian data which do succeed in controlling test size in small sample GLMM analysis with missing data. Their approximate methods often allow test size to fall somewhat below the target level, especially at the smallest sample sizes. Hence their methods sacrifice some power. In the present research we seek to regain the power lost by taking advantage of methods which increase in appeal as sample sizes decrease. Permutation-based inference methods, in which the only assumption IS one of exchangeability of observations (Lehmann, 1986, p231), are guaranteed to be valid under the null hypothesis, even with small samples sizes. Experimental studies with subjects randomly .assigned to treatments are an important example of data which meet the exchangeability assumption. An exact permutation-based p value for an observed test statistic is computed as the proportion of test statistic values computed from all possible reassignments of observations to treatment groups which are as large as the observed value. Koch and Gillings 42 (1983) refer to inferences based on the pennutation theory as design-based inferences as opposed to model-based inferences which require assumptions external to the study design. A pennutation test is based on some chosen test statistic. Clearly, one prefers a test statistic that is likely to provide the most powerful test of the null hypothesis of interest. In practice, the distributional properties of the data are considered when choosing between parametric or nonparametric estimation of the population parameters. For example, the combination of nonparametric (e.g., rank-based) estimates with pennutation-based (i. e., design-based) inference would seem to have the greatest appeal when lacking a convincing choice of distribution model for the data. In this paper, we consider an analysis approach based on parametric estimates with pennutation-based inference. We restrict attention to the GLMM setting meeting all the usual Gaussian assumptions, except for the presence of MAR data. Barton and Cramer (1989) and Catellier and Muller (1998) studied the null case properties of a method based on EM estimation and degree of freedom adjusted approximate parametric tests in small to moderately sized samples. Sample sizes ranged from 12 to 40, and the proportion missing ranged from ° to 10%. Simulations results showed that the unadjusted parametric F tests give substantially inflated test sizes in all but the complete data setting. The adjusted tests held the test size to no more than the nominal level, but were noticeably conservative under certain conditions. Hence a test which has actual test size -equal to the nominal level holds promise, due to the opportunity for improved statistical power. The combination of this idea and the possibility of improved robustness motivated the current research. An additional enticement was that many authors (for example, see Ludbrook and Dudley, 1998) argue strongly that first principles dictate that exact pennutation test should be preferred in most biomedical research. 3.1.2 Relevant Literature Catellier and Muller (1998) provided a brief overview of the literature pertaining to strategies for providing accurate and efficient parametric inference in small multivariate 43 nonnal samples with missing data. The literature on pennutation-based inference can be divided into two divergent streams. One relates to pennutation tests for rank-based test statistics, and the other to pennutation tests for test statistics which are explicit functions of the actual values of the sample observations. These latter tests are sometimes called component randomization tests (see Puri and Sen, 1993, p76). Since a pennutation distribution of a rank-based test statistic is invariant to changes in the actual values of the observations, it is possible to prepare tables for various sample sizes which can be used repeatedly to detennine significance for new samples. Component randomization tests require that the pennutation distribution be computed for every new set of data. Much of the theoretical work on pennutation procedures is devoted to tests based on ranks. The asymptotic power properties and relative efficiencies (with respect to their parametric competitors) of various rank pennutation tests for MANOVA in the complete data setting are presented in detail in Chapter 5 of Puri and Sen (1993). Servy and Sen (1987) extended the theory of rank pennutation tests for one-way MANOVA and multivariate analysis of covariance models to allow for missing data. Their approach involves replacing the observed variables with ranks or other scores, excluding the missing values, and imputing the missing data with the mean of the scores of the non-missing values for that variable. Jerdack and Sen (1990) extended the approach to two-way MANOVA designs. The first example of component pennutation tests for MANOVA is due to Wald and Wolfowitz (1944). They proposed a pennutation test based on a modified Hotelling's T2 statistic to test the hypothesis that two samples arose from one multivariate nonnal population, assuming homogeneity of covariance. Chung and Fraser (1958) derived an alternative pennutation test in the same setting for the case in which the number of response variables is large. Friedman and Rafsky (1979) provide a k sample generalization of the Wald-Wolfowitz two-sample test. The method involves constructing the minimum spanning tree for the combined sample, then generating the pennutation distribution of the runs 44 statistic through a series of random relabelings of the individual data points. Little is known about the properties of permutation tests based on three commonly used multivariate test statistics: the Pillai-Bartlett trace, Wilks' Lambda, and the Hotelling-Lawley trace. The bootstrap resembles the permutation test in that it derives a distribution for the test statistic using the observed data (Efron and Tibshirani, 1993). The bootstrap distribution is obtained by repeatedly resampling the observations (with replacement) separately from each sample and computing the test statistic for each resample. Romano (1989) showed that the power of the bootstrap and permutation tests are asymptotically equivalent. However, the permutation test may be preferable since it has exact level a for finite samples. An important limitation of the permutation methods is that they are currently limited to fairly simple designs. For instance, there is considerable debate among statisticians as to whether a test for an interaction in a factorial experiment exists can be tested using permutation methods. See Edgington (1995, p137-138) or Scheffe (1959, p318) for an arguments which support the claim that tests for interaction are impossible. Welch (1990. provides justification for a method of constructing a permutation test of interaction based on invariance and sufficiency. Still and White (1981) proposed an approximation permutation test of interaction in which residuals, obtained by removal of estimates of the main effects, are permuted. Theoretical objections to permutation tests based upon estimated residuals have been raised by Bradbury (1987), Edgington (1995) and others. They argue that permutation tests which permute design dependent functions of the observations instead of the observations themselves, do not meet the requirement that all data permutations be equally likely under the null hypothesis. Given that no permutation test of interaction has been thoroughly justified, other methods such as the bootstrap are often recommended. The scope of this literature review has been limited to permutation tests for testing equality of the treatment means, assuming homogeneity of covariance in one-way MANOVA models. See Edgington (1995) and Good (1994) for permutation tests pertaining to other 45 experimental designs or for testing other hypotheses, such as those concerning correlation and trend. 3.2 STATEMENT OF THE PROBLEM Notation and setting closely follow that of Catellier and Muller (1998), which should be consulted for details. We note only the following few points. We assume throughout that the usual MANGVA assumptions of Gaussian distributed errors and homogeneous covariance structure are realistic, and missing responses are MAR. Hence when the data are complete the F approximations to the multivariate tests provide valid results. Finally, we restrict attention to a limited class of tests (one between-subject factor and one within-subject factor). To facilitate the discussion of permutation procedures we introduce some notation. Suppose the N subjects are randomly assigned to one of q 2:: 2 treatment groups, with N g subjects in each group (g E {I, ... q}), each of whom contribute p response measures. For i E {I, ... ,Ng }, let Y l (g) = [y(g) II , ... , y.(g)] ' lp indicate one of a set of independent and identically distributed random vectors with continuous distribution function, Fg • The q sets are assumed to be mutually independent. The most general null hypothesis specifies that against the alternative that {Fg } are not all the same. The GLMM is £(Y) = XB, (3.2.1) J where £ denotes the expectation operation and Y (N x p), X (N x q, fixed and known, conditional upon having chosen the subjects) and B (q x p, fixed and unknown) denote the " matrix of observations, design matrix, and parameter matrix, respectively. 46 Our main interest is in testing the null hypothesis of equality of the treatment location parameters, assuming homogeneity of covariance. This is equivalent to testing the null hypothesis of no interaction in X and Y. For this alternative, we let Fg(Y) = Fg(Y + A g ) for all 9 and test H o : Al = .. , = A q = 0 against the alternative that {Ag } are not all zero. The null hypothesis implies that the response vector for each subject is the same under one treatment assignment as under any alternative assignment, or equivalently, that the Yi are q interchangeable for i E {I, ... ,N} and N = ,,£Ng • Thus, under Ho, each data permutatio_n g=1 represents the results that would have been obtained for a particular assignment of subjects to the treatment conditions. The alternative hypothesis is that there is a differential effect of the treatments for at least one of the subjects. 3.3 NEW METHODS 3.3.1 The Basic Method All versions of the permutation procedure for inference in the GLMM with missing data involve four steps. These may be loosely described as follows. 1) Compute ML estimates of the expected value and covariance parameters via the EM algorithm for the original set of observations. 2) Compute unadjusted F approximations for three multivariate test statistics: Wilk's Lambda (W), the Hotelling-Lawley trace (U), and the Pillai-Bartlett trace (V) using the estimates obtained in Step 1. The F approximations for W, U, and V coincide with those used by Catellier and Muller (1998). 3) Obtain the permutation distributions of Fw, Fv, and Fu by either enumerating all possible reassignments of observations to treatment groups, or by choosing a random sample of reassignments (keeping the number of observations in each group constant), and recalculating the test statistics for each 47 .. reassignment. 4) Compute the p value as proportion of test statistic values greater than or equal to the value corresponding to the originally obtained data. 3.3.2 Exact versus Approximate Permutation Tests Exact permutation tests involve tabulating the test statistic value for every possible q permutation. In all, there are M = N!/TI N g ! ways in which a sample of N response vectors g=1 can be combined into q samples of sizes N I , ... , N q • As the number of observations increases, the number of values of the test statistic that need to be calculated to obtain the exact p value increases very rapidly. Pitman (1937a, 1937b, 1938) approached the problem by deriving the first four moments of the exact permutation distribution of a test statistic and showing that they converge to the moments of some well-known distribution (X 2 , normal, etc.) as N increases. Wald and Wolfowitz (1944) provided a general theorem on the limiting distributions of the class of linear permutation statistics (Puri and Sen, 1993, p73). For example, the limiting distribution of the permutation distribution for a modified Hotelling T2 statistic approaches the X2 distribution with two degrees of freedom as N-+ 00. In the MANDVA setting of interest, with a small to moderate sample size, one cannot take advantage of the asymptotic approximations for the permutation distributions. Neither can one obtain the complete permutation distributions for the statistics, except for a rather exorbitant amount of computing (at least with current equipment and methods). example, the total number of data permutations for an experiment with N = 12 For subjects randomly assigned to 4 treatments is 369,600. Alternatively, we can significantly reduce the amount of time required by drawing a random sample of the data rearrangements, without replacement, to produce a close approximation to the exact p value. "approximate" or Monte-Carlo permutation tests. 48 Such methods are A random draw from the set of complete data permutations can be obtained by assigning a random uniform [0,1] number to each row of the design matrix, X, and sorting this matrix according to the uniform values. This sorted matrix is then reassigned to the original response matrix, Y. This procedure is equivalent to random sampling from the N! possible permutations of N observations. Using the concept of a partition (section 4.2, Johnson and Kotz, 1972) it is easy to ensure that only data permutations in which at least one subject's assignment is different than the original treatment assignment are sampled. For example, if the partition representing the original allocation of N = 12 subjects to q = 4 treatment groups (A, B, C, D) is Treatment Group A B C D Subject 2 11 6 3 1 10 894 12 5 7 and its corresponding canonical representation is Treatment Group Subject A 2 6 11 B 1 3 10 C 489 D 5 7 12 then only random data permutations whose canonical partitions are different than the original belong to the set of M possible data permutations. 3.4 NUMERICAL EVALUATIONS 3.4.1 Purpose of Studies and Overall Design The impact of missing data on test size and power of permutation tests for three conventionally used multivariate test statistics will be compared with that of their parametric 49 counterparts through simulation studies. The first series of simulations will focus on the null case, and the second on power. When examining power, a diffuse noncentrality pattern (Olson, 1976) for the expected value matrix, B, was considered. The multivariate power estimation algorithm of Muller and Peterson (1984) was used to compute estimates of B corresponding to approximate power of 0.8. 3.4.2 Methods Simulations involved the following factors: 1) one within-subject factor with p =3 levels, 2) one between-subject factor with q = 4 levels, 3) N E {12, 24}, 4) proportion missing of 11" E {O, 0.05, 0.10}, and 5) three patterns for the error covariance matrix defined by either equal or unequal variances, and either low or high correlation, (i. e., {PH} and {Pjj'} ~ = 0.1 0.7, respectively). No subject's data were allowed to be completely missing. 3.4.3 Results Tables 3.1-3.4 give the empirical test size and power levels for both parametric and permutation-based tests for W, U, and V. Simulation results were based on 5,000 replications for the parametric tests and 1,000 replications for the permutation tests. Assuming a nominal significance level of 0.05, the approximate 95% confidence bounds for each entry in Tables 3.1 amd 3.2 is no greater than ± 0.014. The maximum 95% confidence bounds for each power estimate in Tables 3.3 and 3.4 is no greater than ± 0.031, which occurs when the truepower is 0.5. We describe first simulation results for the null situation. Table 3.1 gives the empirical test sizes for the parametric F tests and their permutation counterparts (P), in the complete data setting. All tests fall within a tolerable range of the target test size. Results for the missing data conditions are shown in Table 3.2. It is evident that all three unadjusted parametric tests Fw, Fu, and Fv give inflated test .sizes. For each test statistic, increasing the sample size from N = 12 to N = 24 reduced, but did not overcome, the problems introduced by the missing data. The adjusted parametric tests Fw .' Fu .' and FV* held test 50 size to no more than the nominal level, but were noticeably conservative under the worst simulated conditions. The empirical rejection rates for the permutation tests, on the other hand, never fell outside the 95% confidence interval for the true test size under any of the missing data conditions. Table 3.1 Approximate Permutation Test Size for F Tests (0% Missing, 1000 Replications, ± 0.014) W U P Fw P Fu P Fv N Pj)' Ci~J v N 12 12 12 Pjj' Low Low High 24 24 Low High 12 12 12 Low Low High 24 24 Low High 12 12 12 Low Low High 24 24 Low High =1= =1= =1= =1= .050 .052 .053 .053 .054 .057 .048 .050 .048 .056 .056 .057 .044 .049 .042 .053 .058 .051 .049 .053 .061 .057 .050 .051 .062 .057 .048 .051 .055 .061 Table 3.2 Approximate Permutation Test Size for F Tests (5%,10% Missing, 1000 Replications, ± 0.014) W U % Ci~ Miss P P Fu Fu* Fv Fw Fw* J =1= =1= =1= =1= =1= =1= =1= =1= V P 5 5 5 .145 .134 .141 .029 .033 .032 .054 .061 .048 .142 .130 .134 .032 .033 .034 .059 .059 .049 .102 .099 .096 Fv* .052 .051 .050 5 5 .087 .089 .030 .031 .046 .060 .084 .088 .028 .031 .052 .058 .081 .088 .046 .049 .046 .059 10 10 10 .345 .335 .354 .047 .042 .051 .047 .049 .039 .341 .333 .352 .052 .047 .055 .044 .044 .043 .206 .215 .214 .057 .051 .058 .061 .033 .048 10 10 .133 .158 .020 .025 .060 .045 .135 .158 .021 .029 .064 .043 .127 .148 .044 .055 .059 .038 .056 .053 .052 Tables 3.3 and 3.4 give empirical powers for the complete and missing data cases, respectively. Empirical power was not computed for the unadjusted parametric tests since 51 " they all produced inaccurate test sizes in the null case with missing data. For the complete data case, we found that the empirical powers of the parametric F tests essentially coincide with that of their permutation-based counterparts. The adjusted F tests had equal or lower power than the permutation-based tests when data were missing. The Pillai-Bartlett test power was roughly the same using either inferential method. In contrast, the power for Wilks' and the Hotelling-Lawley adjusted tests was approximately 0.7-0.9 the power of permutation-based tests. These results are not surprising given that the test sizes for Fv. were quite close to the nominal level, while they were noticeably conservative for Fw. and Fu•. Consequently, the permutation test for V has a clear advantage over the permutation tests for either W or U, particularly when 10% of the data are missing. In the worst case condition with N = 12 and 10% missing data, the power for the permutation test for V was twice that for W, and four times that for U. Table 3.3 Approximate Permutation Power for F Tests (0% Missing, 1000 Replications, ± 0.031) W U V N Pjjl Fu P Fv P Fw P 12 Low .87 .87 .70 .75 .94 .95 12 High .88 .88 .71 .72 .94 .95 24 Low 24 High ##- .82 .83 .82 .84 52 .79 .79 .79 .80 .84 .85 .84 .87 N Table 3.4 Approximate Permutation Power for F Tests (5%,10% Missing, 1000 Replications, ± 0.031) W V U (]'2 % Missing P P P Pj)' Fw* Fv* Fu* J 12 12 Low High =1= 24 24 Low High =1= 12 12 Low High =1= 24 24 Low High =1= =1= =1= =1= =1= 5 5 .62 .62 .76 .76 .33 .31 .48 .47 .88 .89 .91 .91 5 5 .67 .68 .77 .78 .60 .61 .72 .74 .79 .79 .81 .81 10 10 .37 .37 .43 .43 .18 .19 .20 .18 .82 .82 .81 .82 10 10 .53 .54 .71 .68 .44 .46 .65 .61 .72 .75 .76 .73 3.5 AN EXAMPLE: THE EFFECT OF CHOLINE DEFICIENCY ON HUMANS In this section we illustrate the use of approximate permutation tests for MANOVA by analyzing data from a study that examined the effects of choline deprivation on plasma choline concentration over 35 days, in healthy male subjects (Zeisel, DaCosta, Franklin, Alexander, Lamont, Sheard and Beiser, 1991). Subjects were given a standard diet which included 500 mg/day of choline for one week, and then were randomly assigned into two diet groups, one that contained choline and one that did not. During the 5th week of study, all subjects again consumed a diet containing choline. Blood samples for choline analyses were obtained before the start of the study (day 0) and on days 7, 14,21,28, and 35. Only one of the 14 subjects with baseline data had any missing data, with the single missing response being for day 35. The data are presented in Table3.5. A multivariate analysis of covariance model of difference scores allowed testing the effects of diet on the plasma choline concentration over time, while controlling for treatment group differences in baseline choline levels. Note that the permutation distribution is based on reassignments of both the response vector and the baseline covariate to treatment groups. 53 Interpretation of treatment effects using this method should therefore be thought of as being conditional upon the set of responses and covariates actually obtained. The null hypothesis of interest was a test of no treatment group effect on the set of responses. For this particular design with two treatment groups, all multivariate F tests coincide. Using all of the available data, the value of the unadjusted F statistic is 3.48 with VI = 5 and v2 = 7 numerator and denominator degrees of freedom, respectively. The corresponding p value is 0.067. Given the inflated test sizes reported in Table 3.2 for the unadjusted tests, this p value should be view with caution. When significance is determined using a permutation approach the p value is 0.082. The adjusted F statistic based replacing N by the harmonic mean number of non-missing pairs of responses was 3.31 with leading to p = 0.080. VI =5 and v2 = 6.65 degrees of freedom, Using an even more conservative adjustment based on the minimum number of non-missing pairs led to a p value of 0.108 based on F = 2.98 with VI = 5 and v 2 = 6 degrees of freedom. All missing data analysis methods led to lower significance values that the analysis based on complete cases only (F p = 0.126). 54 = 2.74, VI = 5, v2 = 6 and Table 3.5 Choline Measurements Over 5-Week Period in Day Treatment 7 14 21 28 0 9.30 9.51 10.84 Control 9.93 12.29 8.14 11.43 9.44 11.10 9.77 9.95 12.56 10.90 11.19 12.31 8.86 9.23 8.56 10.15 10.32 8.78 9.37 7.54 9.20 11.00 8.72 8.13 8.14 11.76 10.46 Deficient 12.15 12.88 7.94 9.42 9.57 11.54 11.65 8.73 9.52 9.66 9.86 12.82 10.95 10.43 10.64 8.08 9.05 7.71 7.87 7.17 9.01 8.66 9.81 7.70 9.07 7.29 8.89 8.18 8.98 8.60 8.04 6.44 6.76 6.37 8.69 8.30 6.56 7.87 7.52 6.42 Male Subjects 35 9.24 10.56 12.78 12.39 9.74 9.39 10.61 12.28 12.61 9.66 9.69 8.76 8.93 3.6 CONCLUSIONS Conclusion 1. When large sample test statistics cannot be justified due to inadequate sample size, and the methods apply, an approximate permutation test can be used to ensure validity. Conclusion 2. Simulation results suggest that both the parametric multivariate F tests and their corresponding approximate permutation tests provide control test sizes when the data are complete. Conclusion 3. With 5% or 10% missing data, 1) unadjusted parametric tests yield inflated type I error rates, 2) adjusted tests work well, but can be conservative, and 3) permutation-based methods provide test sizes which are close to the nominal rate. Conclusion 4. Parametric tests and approximate permutation tests are equally powerful in the complete case. 55 Conclusion 5. Under all missing data conditions, the pennutation tests have power at least as great as the adjusted parametric tests. Conclusion 6. The pennutation test for the Pillai-Bartlett trace has a clear advantage over other pennutation tests. Conclusion 7. Much work needs to be done to detennine the usefulness and limitations of the pennutation procedure. Currently, pennutation tests are limited to fairly simple designs and have no universally accepted test for an interaction in factorial designs. REFERENCES Barton, C. N. and Cramer, E. C. (1989), "Hypothesis Testing in Multivariate Linear Models with Randomly Missing Data," Communications in Statistics - Simulations, 18, 875895. Beale, E. M. L. and Little, R. J. A. (1975), "Missing Values in Multivariate Analysis," Journal o/the Royal Statistical Society-B, 37, 129-145. Bradbury, I. (1987), "Analysis of Variance Versus Randomization Tests - A Comparison," British Journal 0/ Mathematical and Statistical Psychology, 40, 177-187. Catellier, D. 1. and Muller, K. E. (1998), "Inference for the General Linear Multivariate Model with Missing Data in Small Samples," Institute 0/ Statistics Mimeo Series No. XXXX, University of North Carolina, Chapel Hill. Chung, 1. H. and Fraser, D. A. S. (1958), "Randomization Tests for a Multivariate TwoSample Problem," Journal o/the American Statistical Association, 53, 729-735. Edgington, R. A. (1995), Randomization tests (3rd edition), New York: Marcel Dekker. Efron, B. and Tibshirani, R. (1993), An Introduction to the Bootstrap, New York: Chapman & Hall. Friedman, 1. H. and Rafsky, L. C. (1979), "Multivariate Generalizations of the WaldWolfowitz and Smimov Two-Sample Test," The Annals o/Statistics, 7, 697-717. Good, P. I. (1994), Permutation Tests: A Practical Guide to Resampling Methods/or Testing Hypotheses, New York: Springer-Verlag. Jerdack, G. R. and Sen, P. K. (1990), "Nonparametric Tests of Restricted Interchangeability," Annals o/the Institute o/Statistical Mathematics, 42, 99-114. 56 Johnson, N. L. and Kotz, S. (1972), "Continuous multivariate distributions," New York: John Wiley. Koch, G. G. and Gillings, D. B. (1983), "Inference, design based vs. model based," In Kotz, S. and Johnson N. L. eds. Encyclopedia ofStatistical Sciences, New York: John Wiley, 4: 84-88. Lehmann, E. L. (1986), Testing Statistical Hypotheses, New York: John Wiley. Little, R. J. A. and Rubin, D. B. (1987), Statistical Analysis with Missing Data, New York: John Wiley. Ludbrook, J. and Dudley, H. (1998), "Why Permutation Tests are Superior to t and F Tests in Biomedical Research," The American Statistician, 52, 127-132. Muller, K. E. and Peterson, B. L. (1984), "Practical Methods for Computing Power in Testing the Multivariate Linear Hypothesis," Computational Statistics and Data Analysis, 2, 143-158. Olson, C. L. (1976). "Choosing a Test Statistic in Multivariate Analysis," Psychological Bulletin, 86, 579-586. Pitman, E. J. G. (1937a), "Significance Tests which can be Applied to Samples from any Populations," Journal ofthe Royal Statistical Society, B, 4, 119-130. Pitman, E. J. G. (1937b), "Significance Tests which can be Applied to Samples from any Populations. II. The correlation coefficient," Journal ofthe Royal Statistical Society, B, 4,225-232. Pitman, E. J. G. (1938), "Significance Tests which can be Applied to Samples from any Populations. III. The Analysis of Variance Test," Biometrica, 29, 322-335. Puri, M. L. and Sen, P. K. (1993), Nonparametric Methods in Multivariate Analysis, Florida: Krieger Publishing Company. Romano, J. P. (1989), "Bootstrap and Randomization Tests of Some Nonparametric Hypotheses," The Annals ofStatistics, 17, 141-159. Rubin, D. B. (1976), "Inference and Missing Data," Biometrika, 63, 581-592. Scheffe, H. (1959), The Analysis of Variance, New York: John Wiley. Schluchter, M. D. and Elashoff, J. D. (1990), "Small-sample Adjustments to Tests with Unbalanced Repeated Measures Assuming Several Covariance Structures," Journal of Statistical Computing - Simulations, 37, 69-87. Servy, E. C. and Sen, P. K. (1987), "Missing Variables in Multi-Sample Rank Permutation Tests for MANOVA and MANCOVA," Sankhya, 49, 78-95. 57 Still, A. W. and White, A. P. (1981), "The Approximate Randomization Test as an Alternative to the F test in Analysis of Variance," British Journal of Mathematical and Statistical Psychology, 34, 243-252. Wald, A. and Wolfowitz, J. (1944), "Statistical Tests Based on Permutations of the Observations," The Annals ofMathematical Statistics, 15, 358-372. Zeisel, S. H., DaCosta, K., Franklin, P. D., Alexander, E. A., Lamont, J. T., Sheard, N. F. and Beiser, A. (1991), "Choline, an Essential Nutrient for Humans," FASEB, 5, 2093-2098. 58 Chapter 4 LINMOD 4: A PROGRAM FOR GENERAL LINEAR MULTIVARIATE MODELS WITH MISSING DATA Diane J. Catellier Department of Biostatistics CB#7400 University of North Carolina Chapel Hill, North Carolina 27599-7400 D. Catellier: telephone 919-966-7283, email [email protected] FAX 919-966-3804 Key words: test size, randomization test, multivariate linear models, MANOVA 59 SUMMARY LINMOD 4 computes estimates of the parameters of a General Linear Multivariate Model (GLMM) and performs tests of the general linear hypothesis in the presence of data assumed to be missing at random (MAR). The EM algorithm provides maximum likelihood estimates of expected value and covariance parameters using all of the available data. The program computes approximate tests described by Catellier and Muller (1998). With complete data the tests reduce to standard multivariate tests, including Wilks, Hotelling-Lawley, and PillaiBartlett. Approximate Geisser-Greenhouse corrected and uncorrected tests for the "univariate" approach to repeated measures are also available. The tests differ from standard by reducing the error degrees of freedom replacing the number of independent sampling units by some function of the numbers of non-missing pairs of responses. Simulation results of Catellier and Muller (1998) lead to the conclusion that the approach provides the best currently available methods for controlling test size at or below the nominal rate. The methods control test size even with as few as 12 observations for 6 repeated measurements and 5% missing data. The source code, an extensive user's guide, and example programs may be obtained free of charge via the Internet. 4.1 MOTIVATION Many commercial vendors and shareware sources provide a great variety of flexible and programs for analyzing multivariate Gaussian data with no missing observations. We focus here on estimation and testing in a class of models often described as the General Linear Multivariate Model (GLMM). For example, see Muller, LaVange, Ramey, and Ramey (1992) for a detailed specification from the perspective of power analysis. For the purposes of data analysis, the most important special cases include Multivariate Analysis of Variance and Covariance (MANOVA, MANCOVA), and the multivariate and univariate approaches to repeated measures ANOV A. Furthermore, some forms of discriminant analysis and canonical correlation also represent special cases. 60 Missing data present one of the most common and vexing problems in such analyses, especially in small samples. For data otherwise meeting the assumptions of the model, and data missing at random (MAR, Rubin, 1976), the EM algorithm (Orchard and Woodbury, 1972; Dempster, Laird, and Rubin, 1977) allows easily computing maximum likelihood (ML) estimates of all model parameters (expected value and covariance matrices) using all of the available data. As in the complete data setting, widely available commercial software and shareware conveniently provide ML estimates. As of this writing, no available software provides an efficient means of controlling test size at or below the nominal level in small samples in the presence of missing data (Schluchter and Elashoff, 1990; Barton and Cramer, 1989; Catellier and Muller, 1998). Deletion of cases with missing values is the default option with many statistical software packages. While this approach is unbiased, it can be extremely inefficient in small samples. Recently Catellier and Muller (1998) described approximate tests for the setting of interest. With complete data, the tests reduce to standard multivariate tests, including Wilks, Hotelling-Lawley, and Pillai-Bartlett. Approximate Geisser-Greenhouse corrected and uncorrected tests for the "univariate" approach to repeated measures are also available. The tests differ from standard by reducing the error degrees of freedom replacing the number of independent sampling units by some function of the numbers of non-missing pairs of responses. Simulation results of Catellier and Muller (1998) lead to the conclusion that the approach provides the best currently available methods for controlling test size at or below the nominal rate. The methods control test size even with as few as 12 observations for 6 repeated measurements and 5% missing data. Hence a computer program which implements the new methods in a convenient fashion would likely prove extremely useful for the practice .of data analysis. 61 4.2 A NEW APPROACH 4.2.1 Overview This paper introduces LINMOD 4, a SAS® PROC IML program which implements the methods of Catellier and Muller (1998). For mathematical details consult that paper, which closely follows the notation of Muller, LaVange, Ramey, and Ramey (1992). The parameters of the specified models are estimated using the EM algorithm. The parameter estimates are then used to compute analogs of the hypothesis (ft) and error (E) sums of squares matrices. In tum, analogs of the Hotelling-Lawley trace, the Pillai-Bartlett trace, Wilks' Lambda, the Geisser-Greenhouse corrected univariate test and corresponding uncorrected are computed. For example, for the Hotelling-Lawley analog, compute tr( ft E- 1 ). Next an adjusted sample size (N*) is computed to replace N (the total sample size) in calculation of error degrees of freedom for the F approximations commonly used. In all cases N* equals a function of the number of non-missing pairs of responses. The best choice for N* depends on the test statistic on the hypothesis of interest. By default, the LINMOD program uses the N* choices recommended by Catellier and Muller (1998). However, the user is free to override the default with various other choices for N*. In all cases the methods reduce to standard ones for complete data. 4.2.2 Comparison to Other Programs As of this writing, some widely-used commercial software packages which have implemented the EM algorithm for estimation with missing data in the GLMM include BMDp® 7.0 (BMDPAM and BMDP8D modules), SPSS® 7.5 (Missing Value Analysis), and SAS® (PROC MIXED). The ML estimates are used to produce inferential statements (e. g. interval estimates and p values) that are based on a number of variations of commonly used test statistics and large sample theory. 62 There are obviously other strategies for treating the problem of mlssmg data. Computational routines for multiple imputation of multivariate Gaussian data, for example, have been written for use with S-PLUS 4.0®, and are also available in SOLAS® for Missing Data Analysis. In multiple imputation, each missing value is replaced by m > 1 simulated values. The resulting m versions of the complete data are analyzed by standard completedata methods. Issues related to inference using this multiple imputation has not been extensively investigated, especially in small samples. Earlier versions of LINMOD were written to allow a sophisticated user complete control over a GLMM analysis by giving easy computational access to all intermediate results and by allowing the user to specify contrasts and control structure in a matrix language. The software proves very useful in helping students learn linear models analysis and theory. Furthermore the software provides extremely useful modules for simulations involving the GLMM. Early versions of LINMOD pre-dated availability of any similarly powerful commercial software. As noted earlier, many commercial packages now provide analysis for a wide range of GLMM models~ The trade-off in using LINMOD in this setting is increased flexibility and control for the sophisticated user versus friendly interface for the naive user. For example, LINMOD does not add or in any way recognize an intercept term in any model. The user must code one if desired, and may choose to test it, if present. The user should be familiar with the material presented in such texts as Timm (1975). The primary advantage of LINMOD 4 over previous versions lies in its ability to provide accurate inferences in the presence of missing data. The importance of using adjusted tests increases with decreasing sample size. For example, consider a balanced design with = 12 observations, 3 in each of q = 4 treatment groups, p = 3 repeated measures, and a target test size of a = .05. Using ML estimates, but not adjusting the sample size for error N degrees of freedom calculations gives a Wilks test analog with type I error rate of roughly .05 for no missing data, .14 for 5% missing data, and .34 for 10% missing data. 63 Other multivariate tests (and the univariate approach to repeated measures tests) perform as badly. Similar results also occur with currently available tests commonly used (which are based on large sample theory). 4.2.3 History of the Program LINMOD 4 constitutes at least the sixth generation of such software. In the 1970's, a FORTRAN program (MGLM) was used for faculty and students. multiv~ate analysis by UNC-CH Biostatistics The second generation was written in PROC MATRIX to take advantage of the functions in the language. The third generation (LINMOD 1.0) provided many more functions, as well as a better user interface and error checking. Version 2 was a modest revision allowing easier support and greater portability. The introduction of IML (and the demise of PROC MATRIX) led, somewhat belatedly, to LINMOD 3.0 in IML. Great effort was made to allow for efficient analysis of large data sets. Version 4 differs from Version 3 mainly in its ability to handle missing data. 4.3 LINMOD 4 PROGRAM LINMOD 4 contains sixteen modules. The modules must be executed in a particular sequence of steps. For example, a secondary hypothesis cannot be tested without first fitting a model, nor can a model be fitted without first creating the proper sums of squares and cross products matrix. The steps required for data analysis in LINMOD are as follows: 1) invoke PROC IML and read the data corresponding to independent and dependent variables into matrices, 2) calculate expected value parameters (modules MAKESS, GETCORSS), 3) compute the estimates of the parameters of a GLMM (module FITMODEL) and 4) perform tests of a general linear hypothesis (module TESTGLH). When data are missing, steps 2 and 3 are combined into one module called FITML. Module TESTGLH computes multivariate test statistics. 64 4.4 RECOMMENDED OPTION SETTINGS In each application, the user must decide which degree of freedom adjustment to use. Deciding among the 11 options depends on 1) the test statistic of interest, 2) the hypothesis of interest (most importantly, whether the matrix of secondary parameters has rank one). The decision can be made using the tables in Catellier and Muller as a guide, or simply by using the NBEST option. 4.5 EXAMPLE Table 4.1 contains data from a study that examined the effects of choline deprivation on plasma choline concentration over 35 days, in healthy male subjects (Zeisel, DaCosta, Franklin, Alexander, Lamont, Sheard and Beiser, 1991). Subjects were given a standard diet which included 500 mg/day of choline for one week, and then were randomly assigned into two diet groups, one that contained choline and one that did not. During the fifth week of study all subjects again consumed a diet containing choline. Blood samples for choline analyses were obtained before the start ofthe study (day 0) and on days 7, 14,21,28, and 35. One subject had data missing for day 35. Large amounts of missing data for a very small experiment would seem very worrisome. 65 Table 4.1 Choline Measurements Over 5-Week Period in Male Subjects Day Treatment Control Deficient 0 9.93 9.77 12.56 10.15 11.00 10.46 7 12.29 8.14 10.90 10.32 9.20 8.72 14 9.30 11.43 11.19 8.86 8.78 8.13 21 9.51 9.44 12.31 9.23 ' 9.37 8.14 28 10.84 11.10 9.95 8.56 7.54 11.76 35 9.24 10.56 12.15 12.88 7.94 9.42 9.57 11.54 11.65 8.73 9.52 9.66 9.86 12.82 10.95 10.43 10.64 8.08 9.05 7.71 7.87 7.17 9.01 8.66 9.81 7.70 9.07 7.29 8.89 8.18 8.98 8.60 8.04 6.44 6.76 6.37 8.69 8.30 6.56 7.87 7.52 6.42 9.39 10.61 12.28 12.61 9.66 9.69 8.76 8.93 12.78 12.39 9.74 A multivariate analysis of covariance model of difference scores allowed testing the effects of diet on the plasma choline concentration over time, while controlling for treatment group differences in baseline choline levels. The null hypothesis of interest is a test of the trends by treatment interaction. For this particular design with two treatment groups, all multivariate F tests coincide (using the same N*). The Geisser-Greenhouse corrected univariate test is denoted by FaG. The value of the unadjusted F statistic (N* with 111 = N) is 2.77 = 4 and 112 = 8 numerator and denominator degrees of freedom, respectively. The corresponding p value is 0.10. The Geisser-Greenhouse corrected univariate test results are: FGG = 2.67, 111 = 2.95, 112 = 32.4 and p = 0.065. Given the inflated test sizes reported in Catellier and Muller (1998) for the unadjusted tests, this p value should be viewed with caution. Figure 1 contains the analysis results obtained from LINMOD using the adjusted degree of freedom method. The adjusted F statistic based replacing N by the harmonic 66 mean number of non-missing pairs of responses was 2.65 with vI = 4 and V z = 7.65 degrees = 0.12. The results for using the univariate approach to repeated = 2.58, vI = 2.95, V z = 31.4 and p = 0.072. Both missing data analysis of freedom, leading to p measures are FGG methods led to a smaller p value than the analysis based on 13 complete cases (F = 2.00, VI = 4, V z = 7 and p = 0.20; FGG = 2.40, vI = 2.8, V2 = 28.3 and p = 0.092). Figure 4.1 LINMOD Programming State~ents for Choline Example LIBNAME LIBPATH "H: \DJC\DISSERT\CHOLINE\"; %LET LMDIRECT = H:\DJC\LINMOD4\SOURCE\; %INCLUDE "&LMDIRECT. MACROLIB. MAC" / NOSOURCE2 PROC IML WORKSIZE=5000 SYMSIZE=5000; &LINMOD ; USE LIBPATH.CHOLINE; READ ALL VAR{DAY7 DAY14 DAY21 DAY28 DAY35} INTO Y [COLNAME=DEPVARS]; READ ALL VAR{GROUP} INTO GROUP; READ ALL VAR{DAYO} INTO DAYO; CELLMEAN = DESIGN(GROUP); x = CELLMEAN I I DAYO; ** CELL MEAN CODING; INDVARS={"CONTROL" "DEFICIENT" "DAYO"}; ZNAMES = INDVARS I I DEPVARS ; Z = X II Y; P = NCOL(Y); *- - - - - - -- - - - -- - -- - - - - - - - - - -- -_.*; * REPEATED MEASURES HYPOTHESIS *; * - - - - - - .- - - - - - - - - - - - - - - - - - - - - - - - * ; C={ -1 1 O}; RUN UPOLY1 ((1 :5), "TIME", U, UNAME); THETARNM={"DEFICIENT-CONTROL"}; THETACNM={"LIN" "QUAD" "CUB" "QUAR"}; OPT_ON = {"NBEST"}; RUN SETOPT; RUN FITML; RUN TESTGLH; QUIT; 67 Figure 4.2 LINMOD Output for Choline Example Model Parameters: N ncol(Y) rank(X) rank(W) Tolerance 14 5 3 0 1.5872E-9 BETA - Matrix of Parameter Estimates CONTROL DEFICIENT DAYO *** DAY7 DAY14 DAY21 9.5664 9.8885 0.034 7.0619 5.8578 0.2398 7.. 408 5.9615 0.2122 DAY28· 12.555 9.8688 -0.244 Expanded Columns of BETA NOTE: Degrees of freedom based on NSTAR CONTROL DEFICIENT DAYO CONTROL DEFICIENT DAYO CONTROL DEFICIENT DAYO CONTROL DEFICIENT DAYO CONTROL DEFICIENT DAYO DAY7 Std Err 9.5664 9.8885 0.034 3.2441 3.181 0.2987 DAY14 Std Err 7.0619 5.8578 0.2398 2.3775 2.3313 0.2189 DAY21 Std Err 7.408 5.9615 0.2122 2.4668 2.4188 0.2271 DAY28 Std Err 12.555 9.8688 -0.244 2.7032 2.6506 0.2489 DAY35 Std Err 12.842 11 .961 -0.164 3.3264 3.2618 0.3063 68 0.0146 0.0111 0.9116 t Value 2 Tail p 2.9703 2.5127 1.0956 0.014 0.0308 0.2989 t Value 2 Tail p 3.0031 2.4646 0.9342 0.0133 0.0334 0.3722 t Value 2 Tail p 4.6445 3.7232 -0.98 0.0009 0.004 0.3502 t Value 2 Tail p 3.8607 3.667 -0.536 12.842 11 .961 -0.164 = NJJP_MIN t Value 2 Tail p 2.9489 3.1086 0.1138 DAY35 0.0032 0.0043 0.604 Estimated SIGMA: ML_SIGMAHAT*{_N_/{_N_-_RANKX_}} DAY7 DAY14 DAY21 DAY28 DAY35 2.2593 -0.096 0.7498 0.3501 0.5722 -0.096 1.2135 0.6967 0.2509 -0.444 0.7498 0.6967 1.3064 0.1892 0.5934 0.3501 0.2509 0.1892 1.5687 -0.458 0.5722 -0.444 0.5934 -0.458 2.3755 - SIGMA_ DAY7 DAY14 DAY21 DAY28 DAY35 Estimated Error Correlation, Matrix of SIGMA DAY7 DAY14 DAY21 DAY28 DAY35 1.000 -0.058 0.436 0.186 0.247 -0.058 1.000 0.553 0.182 -0.262 0.436 0.553 1.000 0.132 0.337 0.186 0.182 0.132 1.000 -0.237 0.247 -0.262 0.337 -0.237 1.000 - SCORRDAY7 DAY14 DAY21 DAY28 DAY35 C CONTROL DEFICIENT DEFICIENT-CONTROL u DAY7 DAY14 DAY21 DAY28 DAY35 DAYO -1 LIN 0 QUAD CUB QUAR -0.632 0.5345 -0.316 -0.316 -0.267 0.6325 o -0.535 0 0.3162 -0.267 -0.632 0.6325 0.5345 0.3162 0.1195 -0.478 0.7171 -0~478 0.1195 THETA is the estimate of CBU LIN DEFICIENT-CONTROL -1.23 QUAD CUB QUAR 1.514 0.5567 0.7557 Column of THETA with Associated Statistics *** NOTE: Degrees of freedom based on NSTAR = NJJP_H LIN Std Err t Value DEFICIENT-CONTROL -1.23 0.6491 69 -1.894 OF 2tail p 10.65 0.0856 R Sqrd 0.252 Column of THETA with Associated Statistics *** NOTE: Degrees of freedom based on NSTAR = NJJP_H QUAD Std Err t Value DEFICIENT-CONTROL 1.514 0.7514 2.0149 OF 2tail p R Sqrd 10.65 0.0699 0.276 Column of THETA with Associated Statistics *** NOTE: Degrees of freedom based on NSTAR = NJJP_H CUB Std Err t Value DEFICIENT-CONTROL 0.5567 0.6503 0.8562 OF 2tail p R Sqrd 10.65 0.4107 0.0644 Column of THETA with Associated Statistics *** NOTE: Degrees of freedom based on NSTAR = NJJP_H QUAR Std Err t Value DEFICIENT-CONTROL 0.7557 0.593 1.2744 OF 2tail p R Sqrd 10.65 0.2296 0.1323 Univariate Tests for columns of THETA=C*BETA*U - THETAO *** NOTE: Degrees of freedom based on NSTAR = NJJP_H F Value LIN QUAD CUB QUAR Num df 3.5887 4.0597 0.7331 1.6242 Den df p Value R Sqrd 10.65 0.0856 10.65 0.0699 10.65 0.4107 10.65 0.2296 0.252 0.276 0.0644 0.1323 Estimated Correlation Matrix based on E _R_ LIN QUAD CUB QUAR CUB QUAR 1.000 0.250 0.263 0.250 1.000 -0.068 0.263 -0.068 1.000 0.102 0.630 0.340 0.102 0.630 0.340 1.000 LIN QUAD 70 Generalized squared canonical correlations: CanVar1 - STMAT1 LROOT LAMBDA HLTRACE PILTRACE VALUE APPROX F 0.5811 0.4189 1.3872 0.5811 NUM OF DENOM OF 4 4. 4 4 3.468 2.4276 2.4276 2.4276 0.581 P VALUE ASSOCITN 10 7 7 7 0.0504 0.1443 0.1443 0.1443 *** NOTE: Degrees of freedom for LAMBDA,HLTRACE: NSTAR *** NOTE: Degrees of freedom for PILTRACE: NSTAR 0.5811 0.5811 0.5811 0.5811 = NJJP_MIN = NJJP_H Univariate Repeated Measures Tests *** NOTE: Degrees of freedom for Gsr Grnh based on NSTAR = NJJ_A F OF Numer OF Denom epslnHat Uncrrctd Gsr Grnh 2.6131 2.6131 4 2.9495 43.2 31.855 1 0.7374 p Value 0.0483 0.0692 4.6 ACQUIRING THE PROGRAM Free copies of the program and accompanying documentation may be acquired either via the World Wide Web or FTP. Connect to the web site http://www.bios.unc.edul-muller/ and then follow the directions presented there. Alternately, use anonymous FTP at ftp:/lwww.bios.unc.edulpub/faculty/muller/linmod401 to acquire the software and documentation. REFERENCES Barton, C. N. and Cramer, E. C. (1989), "Hypothesis Testing in Multivariate Linear Models with Randomly Missing Data," Communications in Statistics - Simulations, 18, 875895. Catellier, D. J. and Muller, K. E. (1998), "Inference for the General Linear Multivariate Model with Missing Data in Small Samples," Institute of Statistics Mimeo Series No. XXXX", University of North Carolina, Chapel Hill. 71 Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977), "Maximum Likelihood from Incomplete Data via the EM Algorithm (with Discussion)," Journal of the Royal Statistical Society-B, 39, 1-38. Muller, K. E., LaVange, L. M., Ramey, S. L. and Ramey, C. T. (1992), "Power Calculations for General Linear Multivariate Models Including Repeated Measures Applications," Journal ofthe American Statistical Association, 87, 1209-1226. Orchard, T. and Woodbury, M. A. (1972), "A Missing Information Principle: Theory and Applications," In Proceedings of the 6th Berkeley Symposium on Mathematical Statistics and Probability, 1, 697-715. Berkeley, California: University of California . Press. Rubin, D. B. (1976), "Inference and Missing Data," Biometrika, 63, 581-592. Schluchter, M. D. and Elashoff, J. D. (1990), "Small-sample Adjustments to Tests with Unbalanced Repeated Measures Assuming Several Covariance Structures," Journal of Statistical Computing - Simulations, 37, 69-87. Timm, N. H. (1975), Multivariate Analysis, Monterey, California: Brooks/Cole. Zeisel, S. H., DaCosta, K., Franklin, P. D., Alexander, E. A., Lamont, 1. T., Sheard, N. F. and Beiser, A. (1991), "Choline, an Essential Nutrient for Humans," FASEB, 5, 2093-2098. 72 Chapter 5 CONCLUSIONS AND FUTURE RESEARCH 5.1 LOOKING BACKWARDS; SUCCESSES FROM THIS RESEARCH Recall that the motivation for this research was to provide defensible inference for Gaussian data with missing data in small samples. Two approaches to estimation and testing with missing data in the General Linear Multivariate Model (GLMM) are commonly used. 1) The EM algorithm is first used to provide maximum likelihood (ML) estimates of expected value and covariance parameters using all of the available data. The ML estimates are then used to compute the to standard multivariate tests. 2) The GLMM can be transformed into a mixed model framework by creating a separate record for each observation. This modeling approach allows for missing data. Likelihood-based estimates can readily be computed using iterative methods and then used to produce inferences that are based on large sample theory. I began by first documenting the limitations of the commonly used inference approaches described above in a series of null case simulations. For example, consider a balanced design with N = 12 observations, and a target test size of a: 3 in each of q = 4 treatment groups, p = .05. = 3 repeated measures, Using ML estimates and unadjusted error degrees of freedom gives a Wilks test analog with type I error rate of roughly .05 for no missing data, .14 for 5% missing data, and .34 for 10% missing data. Other multivariate tests (and the univariate approach to repeated measures tests) perform as badly. Similar results also occur using the mixed model approach with approximate test sizes of .13, .17, and .25, for 0%, 5%, and 10% missing data, respectively. 73 In the first paper, I examined the performance of an inference strategy that generalizes a .. suggestion due to Barton and Cramer (1989). In all cases the EM algorithm provides ML estimates. In tum, a function of the number of non-missing pairs of responses (N*) replaces N (number of subjects) in calculating error degrees of freedom for approximate F tests. Simulation results suggest that the best choice for N* varies with the test statistic. Replacing N by the mean number of non-missing responses works best for the Geisser-Greenhouse test. The Pillai-Bartlett test requires the stronger adjustment of replacing N by the harmonic mean number of non-missing pairs of responses. For Wilks' and Rotelling-Lawley, an even more aggressive adjustment based on the minimum number of non-missing pairs must be used to control test size at or below the nominal rate. Overall, simulation results allowed concluding that an adjusted test can' always control test size at or below the nominal rate, even with as few as 12 observations and up to 10% missing data. Although the adjusted tests described in the first paper limit test size to no more than the nominal level, they can be noticeably conservative, especially at the smallest sample sizes. In the second paper I sought to regain the power lost by using approximate permutation methods. Exact permutation methods are guaranteed to provide accurate inferences for a particular range of hypotheses, even in very small samples. Based on simulation results, the approximate permutation tests apparently provide unbiased tests, even for very small samples (N = 12) and up to 10% missing data. Overall, the permutation-based methods had equal or higher power than the adjusted .F tests in all cases, while still controlling test size. Unfortunately, using permutation tests disallows considering a broad range of often used designs and hypothesis tests. For example, any study with only within-subject factors cannot be analyzed with the approach, without making a highly restrictive (and usually incorrect) .assumption. 74 The successes of the first two papers led to the decision to create free software to implement the adjusted degrees of freedom methods. The convenience of the software will allow putting the methods into immediate practice in data analysis. 5.2 LOOKING FORWARD; FUTURE RESEARCH 5.2.1 Robustness The approximate permutation tests appeal because of their natural applicability to small samples. Due to the distribution-free nature of the permutation tests, they would also seem natural for non-Gaussian samples with missing data. However, the particular permutation tests which were examined in this work (using likelihood-based estimates and test statistics) may not be the most effective for non-Gaussian data. Naturally, finding alternative permutation tests which can be applied to non-Gaussian data with missing values would require extensive future research. The results here hold promise for such an approach. 5.2.2 Improving On the Adjusted Tests A number of ideas seem worth pursuing in order to improve the adjusted tests. The use of REML estimation defines one direction. Experience indicates that REML estimates tend to have less bias. For example, for complete data the REML approach provides the unbiased estimator of the covariance matrix, while the ML estimate is biased. With complete data the two differ only by a factor of N j(N - r). The relationship is likely more complicated with missing data. The difference between the two matrices may suggest a degrees of freedom adjustment. A more systematic approach may emanate from a method of moments attack. Either exact or approximate moments for parameter estimates or it and E analogs seem worth pursuing. Note the success of Satterthwaite type approximations in mixed models in general and the Geisser-Greenhouse corrected test in particular. Little and Rubin (1987, §7.5) expressed the relationship between the observed and complete information matrices as 75 , observed information = complete info·rmation - missing information. " Louis (1982) noted that, with data in hand, one can estimate two of the three terms and then solve for the third. In tum, the ratio of observed to complete has great appeal as an effective sample size, in parallel to N* / N or [( N* - r) / (N - r)]. Some promising analytic results for this approach have been derived for special cases of the GLMM. 5.2.3 Power Approximation Many authors have studied power approximation for the GLMM. See Muller, LaVange, Ramey and Ramey (1992) for a review. They also suggested a simple strategy for choosing sample sizes when one expects missing data, by treating the complete data power as an upper bound, and the power based on expected complete cases only as a lower bound. This assumes that one can use all of the data, with an effective sample size between the two bounds. An appealing strategy for approximating power with missing data arises from studying the distribution of N*. In tum, consider power as the expected value of a set function. Note that current power approximations do not work as well as desired for some alternative hypotheses in very small samples. Hence such complete data limitations would be inherited by missing data methods. REFERENCES Barton, C. N. and Cramer, E. C. (1989), "Hypothesis Testing in Multivariate Linear Models with Randomly Missing Data," Communications in Statistics - Simulations, 18, 875895. Little, R. J. A. and Rubin, D. B. (1987), Statistical Analysis with Missing Data, New York: John Wiley. Louis, T. A. (1982), "Finding the Observed Information Matrix when Using the EM Algorithm," Journal ofthe Royal Statistical Society, Series B, 44, 226-233. Muller, K. E., LaVange, L. M., Ramey, S. L. and Ramey, C. T. (1992), "Power Calculations for General Linear Multivariate Models Including Repeated Measures Applications," Journal ofthe American Statistical Association, 87, 1209-1226. 76 APPENDIX A DOCUMENTATION FOR FITML MODULE IN LINMOD 4 A.I Function The FITML module creates the same matrices that are created by MAKESS and FITMODEL. The primary difference is that when data are missing, FITML uses all of the available data, where other modules use only the complete cases. The FITML module can also be used in place of the calls to the following set of modules: &PROCSSCP, GETCORSS and FITMODEL. A.2 Algorithm The EM algorithm is used to obtain the maximum likelihood estimates of the model parameters, 13, and covariance matrix, E. These estimates, along with estimates of(X'X)-l are used to create the uncorrected sums of squares and cross products matrix, _SS_. Next, the estimates of the parameters ofa GLMM (module FITMODEL) For example, a secondary hypothesis cannot be tested without first fitting a model, nor can a model be fitted without first creating the proper sums of squares and cross products matrix. and 4) perfonn tests of a general linear hypothesis (module TESTGLH). When data are missing, steps 2 and 3 are combined into one module called FITML. Module TESTGLH computes multivariate test statistics. A.3 User Inputs In addition to the inputs that are required for the MAKESS module (Z and ZNAMES), the number of dependent variables must also specified (P). A.3.1 Z, the data matrix containing both the X's and Y's, with columns as variables. A.3.2 ZNAMES, a character matrix of one row of names for the columns of Z must be defined. It must confonn to Z. 77 A.3.3 P, a scalar indicating the number of dependent variables. A.4 System Input Matrices The following system matrices are required by this module: _OPT_ - the option status matrix. - ECODE- - the error condition matrix. A.5 Options and Defaults The option names used to change the type of degree of freedom adjustment used are the following: NLIST, NJJP_MIN, NJJ_MIN, NJJP_H, NJJP_G, NJJP_A, NJJ_H, NJJ_G, NJJ_A, NJJ_MAX, NTOTAL, NBEST. See the table below for definitions. Let N j j indicate the number of observations for which both }ij and }ij, for i E {I... N}, have nonmissing values. Note that N jj equals the number of cases observed for the jth response. All adjustments involve replacing N by N* in liE =N - rank(X). In all cases N* equals a function only of {Njj' }. The default value is NBEST which used the preferred adjustment for the test statistic and hypothesis at hand. All of the options that are available for MAKESS and FITMODEL work with FITML. See the documentation for those modules for specific details. See Chapter 4 for an example program. c , 78 Sample Size Adjustments for Error Degrees of Freedom Name NLIST NJJP MIN NJJ MIN NJJP H NJJP G NJJP A NJJ H NJJ G NJJ A NJJ MAX NTOTAL Function of {Njj' } = number of complete cases = min( {Njj'}) = min( {Njj }) = harmonic mean( {Njj' } ) = geometric mean({Njj' } ) = arithmetic mean( {Njj' } ) = harmonic mean( {Njj }) = geometric mean({ N jj } ) = arithmetic mean({N jj } ) • =max({Njj }) =N c 79
© Copyright 2024 Paperzz