"e AN INVESTIGATION OF SOME MULTIPLE REGRESSION METHODS FOR INCOMPLETE SAMPLES by • Co F o Federspiel, R. J. Monroe, and B. G. Greenberg ·This research was supported in part by U. S. Public Health Service Training Grant 2G-38 and by the Revolving Research Fund of the Institute of Statistics of the Consolidated U~versity of North Carolina through a grant from the General· Education Board for research in basic statistical methodology. •• Institute of· Statistics Mimeo Series No. 236 August, 1959 iv •e TABLE OF CONTENTS Page LIST OF TABLES vi LIST OF FIGURES ix CHAPTER I INTRODUCTION 1 CHAPTER II METHODS 5 2.1 2.2 2.3 2.4 Generation of Multivariate Samples Censoring . Empirical Measures of Efficiency • Description of Experiments • 2.5 The Computing Programs CHAPTER III • 5 7 8 10 14 16 THE DELETION OF INCOMPLETE OBSERVATIONS 3.1 Discussion • • • • • • • • • • • • 3.2 Bias and Variance of Regression Coefficients • 3.3 Efficiency of Prediction Equations CHAPTER IV 4.1 4.2 4.3 4.4 16 17 23 THE METHOD OF PAIRED CORRELATION COEFFICIENTS • 27 Definition • Bias and Variance of Regression Coefficients • Error Variance • • • Efficiency of Prediction Equations 27 29 29 32 CHAPTER V·' TEE GREENBERG PROCEDURE 5.1 5.2 5.3 5.4 5.5 49 Definition Bias and Variance of Regression Coefficients. Error Variance. • • • • • • Efficiency of Prediction Equations Convergence. • 49 52 56 61 71 CHAP TER VI MAJaMUM LIKELIHOOD • 6.1 6.2 6.3 6.4 Discussion • • • • • • • • • • • Bias and Variance of Regression Coefficients. Error Variance -. • ••• Efficiency of Prediction Equations CHAPTER VII AN EXAMPLE 7.1 The Data • 7.2 Solutions 7.3 Discussion • 79 • 79 81 81 ' 83 • 85 . ' 85 87 90 •e v TABLE OF CONTENTS (Continued) Page CRAnER IDI SUMMARY AND CONCLUSIONS 8.1 summarY. • 8.2 Conclusions LIST OF REFERENCES • • . . . . . 0 • • • 8· • 0 0 0 ft ~ • • 000 • • • o~ • 0 Q • 0 0 v • • 0 • Q 0 0 • Q b Q 0 • 0 0 • APPENDIX A. GENERATION OF MULTIVARIATE NORMAL DEVIATES • APPENDIX B. THE NORMAL DEVIATES USED IN SAMPLING • • • 94 94 103 106 • • • 109 • 0 113 vi LIST OF TABLES Page .1. Statistics of regression coefficients estimated by the deletion of incomplete observation vectors 2. Means and variances of errors of estimate obtained by the deletion of incomplete observations • 18 • 22 A 3. Means of V (b i ) estimated by the deletion of incomplete observatIons 4A. • • •. • • • • • • • • • •• Values of E and their standard errors estimated by the deletion of incomplete observations • 24 • 25 • 25 4B. Values of A and their standard errors estimated by the deletion of incomplete observations 5. Statistics of regression coefficients estimated by the Method of Paired Correlation Coefficients • 30 6. Means and variances of errors of estimate obta.ined by • the Method of Paired Correlation Coefficients 7A. Values of E and their standard errors estimated by the Method of Paired Correlation Coefficients 7B. Values of A and their standard errors estimated by the Method of Paired Correlation Coefficients 8. 9. 10. • 34 • 34 Analysis of variance of In E for the trivariate populations of Experiment 1 • Values of E and residuals from the response equation, "E + , of Experiment 2 • AB • 36 Analysis of variance of values of E • 37 1l. Valu~s of E and their standard errors at symmetric'al points in the trivariate population s p a c e . , 14. Statistics of regression coefficients estimated by the Greenberg Procedure; Experiment 1 15. Statistics of regression coefficients estimated by the Greenberg Procedure; Experiment 4 35 • 38 A 12. Analysis of variance for the model EA+B of Experiment 2 13· Values of i estimated by the Method of Paired Correlation Coefficients • e e • 33 • 42 • 44 • 53 . 55 •e vii LIST OF TABLES (continued) Page 16. Statistics of regression coefficients from Experiment 7 • 58 - l7A. Values of E and their standard errors estimated by the Greenberg Procedure; Experiment 1 63 l7B. Values of A and their standard errors est:i.ma.ted by the Greenberg Procedure1 Experiment 1 63 - 18. - Values of E and their- standard errors estimated by the Greenberg Procedure; Experiment 4 Analyses of variance for values of E from the data of ~eriment 4 65 Analyses of variance for models fitted in Experiment 4 66 Values of E and A from the twenty samples of Experiment 8 70 M.ean number of iterations required for convergence 72 Analysis of variance for number of iterations required - for convergence in Experiment 4B • 73 _Iterations required for convergence using the Greenberg procedure on Carver t s data with different initial estimates of missing values 25. 64 • • Statistics of regression coefficients estimated by ma:ximum likelihood • y.x estimated by maximum likelihood 82 26. Means and variances of &2 83 27A. Means of E and their standard errors estimated by maximum likelihood 84 27B. Means of A and their standard errors estimated by maximum likelihood 84 -- 28. Biochemical determinations from Welt data 86 29. Smnple correlation coefficients and their confidencelimits based on the nij available pairs of Xi and X in Table 28 89 Some extreme muscle potassium values predicted from the Welt data by various methods • 92 j e e 74 30. - 31A. Values of E from Experiment 1 95 •e viii LIST OF TABLES (continued) Page 32A. Values of 6' from i from 32B. Values of i 33A. Values of E and their standard errors, Experiment 6 • • 99 33B. Values of i .100 34. E', 6' 31B. Values of Values of 35A. Values of 35B. Values of Experiment 1 • 96 Experiment 4B • • 98 from Experiment 4B • • 98 and their standard errors, Experiment 6 • E and i and and their standard errors from :&q:>eriment 7 their standard errors, Experiment 8 • .102 their standard errors, Experiment 8 • .102 APPENDIX. Table B-1. Frequency tabulation of ten thousand N(O,l) deviates • e e .101 .113 ix e LIS T OF FIGURES e Page 1. 2. Histograms of regression coefficients estimated by deletion of incomplete observations • 20 Histograms of regression coefficients estimated from complete observations 21 Histograms of regression coefficients estimated by the Paired Correlation Method 31 40 41 45 • 47 57 e e Mean pseudo-variance of Greenberg Procedure plotted against true variance; from data of Exper:l.ment 4B • • 60 Mean variance estimated from uncensored data plotted against true variance; from data of Experiment 4B. • • • • • 60 Histograms of estimated error variances 62 Histograms of E; Experiment 7 69 • Histograms of 6; Experiment 7· 68 Histogram of number of iterations required for convergence. 75 Convergence curves applying the Greenberg Procedure to the five-variate sample, number 1 0 7 . 77 Convergence curves applYing the Greenberg Procedure to the five-variate sample, number 106 78 •e CHAPTER I IN'lRODUCTION A cOll1l11on problem in statistics is that of obtaining a multiple regression equation from data which have missing values among the inde})endent variates. Examples of this situation oocur in attempting to get prediction equations in such diverse fields as psychology, economios, and bioohemistry. In psychology, some subjects may have failed to take all the tests in a series. In economics, production records of a given commodity may not have been kept consistently over a long period of time. In biochemistry, the analysis of some oomponent of a serological sample may have been vitiated. • Several investigators have been conoermed with the general problem of missing values in multivariate samples, They have, for the most part, dealt with maximum likelihood estimators for specifio oases and were not directly oonoerned with prediotion equations. Wilks (19.32) considered cases where values of x and y are both missing in samples from a bivariate normal and gave maximum likelihood. estimates for some of the parameters, given others. He also evaluated the large sample efficienoy of some simpler alternative estimates. In considering an extension of Wilks' work to the trivariate case, Matthai (1951) derived maximum likelihQoo estimators for the three means given , the other six parameters. e e He suggested a modified fom of estimation for the case in which it is necessary to estimate the variances and covariuces from the data. h addition, he discussed applioations of this work to the design of sample surveys. 2 e e Lord (1955) solved maximum likelihood equations to estimate the parameters for the special case of three variables, x:t' x 2' and x ' in 3 which x:t and x are d:i. strlbuted bivari~te normally, as are x and x • 2 3 3 ltlth no loss of generality, the data could be arranged: ••• x:t(n+l) .-.. X:1.N x3 (n+l) • • • x.3N ••• where the primary subscript denotes the variable, and the secondary, the individual. He also showed that, although the est:imates of some of the parameters are improved by using all the data, estimates of the regression 0'~2l 2 2 coefficients and errors of estimate, ~2 III OJ. ' ~l' 0'2.1' and OJ.1 are • not. In considering this special case, Lord pointed out that tithe general maxtmum. likelihood equations have proved rather intractable and no simple formulae for the maximum likelihood estimators are available in the general case, even for a sample from a bivariate population." Edgett (1956) dealt more directly with the prediction problem in considering the special case of the trivariate nonnal in which observations are missing .from one variate only. Data would have the fom: x..t1' x2l , x31 , e e x2l , • •• x:tn x , • •• x x , ·.. x3n 22 32 :;'(n+1) • • • Xo.1.N 2n After some rather tedious a.lgebra, he obtai-ned solutions in closed form for the maxinnun likelihood equations to estimate the parameters 11 , 11 , 1 2 e e 3 222 1J.3' 0'1' 0'2' 0'3' 0'12' 0'13' 0'23· regression of Xl OD From these, ooeffioients for the x 2 and x were estimated, oonsidering the latter 3 two as indepemdent variates. The problems of Edgett and Lord were also oonsidered by T. W. Anderson (1957) who indioated an approaoh whioh somewhat simplifies . . the mathematioal manipulation necessary to get the maximum likelihood solutions. The procedure involves expressing the jcnnt densities of the variates as the product of a marginal density and the remaining oonditional densities. Further reference will be made to the papers of Edgett and Anderson in the chapter on maximum likelihood. In a paper dealing partioularly with the prediotion problem, Nioholson (1957) disoussed the multivariate case for which observations • are missing in the dependent variable. estimates for the parameters based OIl He gave maximum likelihood the "inoomplete sample '• amd for the oomplete portion of the sample, aDd conoluded that, for prediction purposes, DO use oan be made of data which are incomplete in the dependemt variable. This thesis will deal primarily with methods for determining prediotion equations when observations amomg the independent variates are not available. These meth0ds include: 0rdinary least squares, using only those observation vecwrs whioh are complete in every variate; ma:x:imum. likelihood for certain speoial cases; and two other prooedures which have been proposed on the basis of their intuitiva appeal and e e general applicabilit,y. One of these methods involves finding the matrix of paired correlation coeffioients for all variables, then solving for the standardized regression coefficients, and finally, converting these e e 4 1;0 the ordinary regression coefficients by making use of the means and variances caloulated from all observations present for a given variable. '!his will be referred to as the Method of Paired Correlation Coefficients, or simply, the Paired Correlation Method. The other approach consists of an iterative computing scheme whereby missing values of a given variable are successively predicted from the other variables until the coefficients for the regression of y on the xi oonverge simultaneously. This is oalled tlte Greem'berg Prooedure. The objeotive of this study is to appraise these methods. An empirical sampling approaoh, or what is sometimes referred to as the Monte Carlo method, is used in this investigation. This approaoh has been found useful in problems which seem to defy strictly mathematioal • treatmemt, and has beem particularly successful in studies of estimation prooedures for a wide variety of models (Aitchison and Brown, 1957; Morrison, 19.56; Neiswanger and Yancey, 1959; Orcutt and Coohrane, 1949). Its use here involves application of the estimation procedures described above to simulated fragmentary multivariate normal samples. Measures of prediction efficiency are calculated for each sample estimate and are appropriately summarized. The calculations of estimates and their efficiencies were carried out on an IBM 6.st> electronic computer. In the following chapter, notation, definitions of efficiency, and the general method of approaching the problem are discussed. 1m Chapters III to VI, respectively, each method and the results of investigation e e using that method are described in detail. Chapter VII illustrates the use of the different procedures with an example and finally, in Chapter VITI, comparisons of the methods and their applicability under different circumstances are considered. 5 e CHAPTER II e METlmDS 2.1 Generation of Multivariate Samples All the estimation procedures to be considered will assume samples £rem an underlying multivariate normal po]>ulation. In order to carry out empirical studies, it was necessary to generate samples from multivariate normal populations with give. characteristics. Fer p small, a convenient way of generating samples from a p-variate Donnal population with given parameters is given by applying a seri~s of transformatioll. to p iDdepende!lt sets of D0rmal deviates. Finding the transformati0Ds involves getting the conditional distribution t0r x 2' given • that for x.3' given ~ XJ.' then, and x 2' and so on, tinally getting the conditional distribution for Xi' given the remaining p-l variates. The transformation for the 'bivariate case is meJltioned in the Ramd Tables (1955) ud taat for the trivariate ease is developed in Appendix A, section A.I. Actually, explicit expressions for the transformations are only of academic interest, since, for a given population, numerical values of the ]>arameters may be substituted in the distribution function and the necessary conditioDal distributions found directly. This procedure, which has been described elsewhere 1lJy Moon&n (1957), 11 illustrated in detaU iD A]>]>emdix A, section A.2 in finding the transtormations necessary to generate samples from a tour-variate nor.mal population used in suoceeding chapters. e e Multivariate normal deviates having a mean of zero ad varial'lce equal to one have 'been generated, since we are not interested in trying to simulate any special population with regard to these parameters. ID 6 e e this way, different populations- to be considered are charaoterized . completely by their correlation matrices. A convenient notation to indicate a given population consists of writing the correlation matrix for that population 11'1 a single line between brukets, omitting the diagonal and corresponding S3'Jll1l8tr'ical elements" and setting off the rows Gf the oorrelation matrix by COJllJl&S. For example, a population having the correlation matrix 1 .012 .012 1 .023 ;~3 P23 1 P13 .01 14- P14 1'24 .034 • 4 .02 .0 34 1 would be designated [.o12 P13"14,P23P2Jl "'341 • Since the Rand Tables (1955) afforded a ready source of N(O,l) deviates, they were used in punohiJ1g a deck of two thousand cards, five deviates per care\. for input in generating the multivar:l.ate samples needed. A. description of the way the deviates were seleeted aDd a frefiJUency tabulation of their values is given in Appendix B. Henceforth" samples of these N(O,l) independent deviates will be referred to as tlorigi.nal samples·. Their transformed cnl'lterparts ·will be called "generated samples". To dElll10nstrate the efficacy of the generation program, the matrix of e e S'llIIlS, sums of squares, and sums of cross products for fatU" hundred observations of a four-va:riate populati0n is included in Appendix A, section A.3. 7 e e 2.2 Censoring Patterns of missing values te be oonsidered. do not inolud.e oases of obse:Mrations with oharaoters misstng from every independent variate, that is, cases where an observation vector consists of only the dependent variable. It is olear that suoh observations oan oontribute nothing to a prediotion equation for y ~ the XIS. A desoription of the notation used to designate patterns of missing values in general follows. used 1;0 For the trivariate oase, (01 ( 2 ) is denote the n'Wl1ber of oharaoters, 01' missing .from ~ alone, and 02' the number of oharaoters missing from x 2 alone. For the fQlU"variate oase (°1°203' 012°13(23) indioates 0i(i == 1,2,3) as the mumber of oharaoters missing in ~ alone and 0ij (i,tj) as the number of obser- • vation vectors having oharaoters mis siDg from ~ Xi and x • For j example, a sample of fifteen observations desipated (303,211) Ddght appear as follows: **--iE--***--**-* ,*' ;. '* - * '* * '* *' ;. - *' '* '* * '* ~E-**~E-;' **- **;.;. ** ******~E-******** where a dash represents a missiDg value and an asterisk represents an observed value. ~ 0i + 2 ~ 0ij .n e e == N- ~ ~j 0 == The number of missing oharaoters is given by == 14 and the number of oomplete observat~ons is given by 5. The extension to higher variate oases is obvious. . In the sampling work which follows, the artifioial deletion of oertain values among the independent variates to simulate missing measurements will be referred to as oensoring. 8 e e The censoring sohemes considered fall into two classes. One of these is the_special case where values are missing from one variable only. The other is the case in which missing values occur with eqo.al frequency among all the independent variables, with only ome value missLng from each observation vector. By the general notation indicated above, this latter type of censoring would be written (cl c 0 ,000) for 2 3 the .four-variate case and (0102c304'000000,0000) for the five-variate case. For these four- and five-variate oases the general notation has been abbreviated to (010203) and (01c 2c c ), respectively. 3 4 It should be emphasized, however, that except for the method of maximum likelihood, the estimation procedures discussed are not restricted in any way by the pattern of missing values • • 2.3 Empirical Measures of Efficiency The usefulness of a regression equation is usually judged b7 the magnitUde of the bias and the varian£e of the estimated regression b i - ~i coefficients. The t-statistic, t · SO ' has been taken as a oriteriQn for bias, and in the tables i of biases in sucoeeding chapters, values which exceed the appropriate tabulated values of t at the five per oent level have been designated with an asterisk. However, it should 'be nated that we are interested primarily in prediction, and that estimates of regression coeffioients may be biased in opposite direotions so as to be mutually compensatory in nature as far as prediotion is concerned. e e For this reason, as well as for general oon- venience, oertain single measures of efficienoy for a prediction 9 equation have been proposed and will also be used to evaluate the One such measure defined by Nicholson (194&) is methods discussed. N S -< • 1 (y-< - Y) 2 Ez • ~N~----"'-S (y-< .,( • 1 Z) 2 where y-< represents the observed value of "1' for the -<th observation; Y-<, the estimate of "1'-< based om the least squares regression equation of y on the Xi for the N uncensored observations; and Zoo<' the est:tma.te of' y-< based on the prediction equation, Z, whose efficiency is being tested. Both sums are taken over the N observation vectors which give • rise to Y. Thus 0 oS E S 1, assuming the maximum value when Y • Z. Since the parameters are known, another important measure of efficiency is N S 2 AZ • where ~ p-l = ~l S ~ .x. 11 1 (r -< - Z) 2 N is the true regression equation. Z is defined as .,( above, and the sum is taken over the same N observation vectors as indicated above. The measures EZ and A~ are of interest wb9n we con sider Z as a prediction equation estimated from incomplete data. Although both are measures of Itprediction efficiency" and are e e usually referred to throughout the text by the word "efficiency", it is understood tha:t they are merely convenient empirical devices for 10 e e comparison, and, as such, are to be distinguished from orthodox theoretical efficiency which can be defined for an estimate, 2 where 06 is the variance of e /::0. e, eI " as the minirrrum variance unbiased estimator of 9. It may be noted that E is an index which measures the goodness of the equation in question relative to the "best" estimate from the ., uncensored saI!¥lle. It should be pointed out in this connection that E ma.y be very high although neither estimate is good in the sense of giving a prediction egnation close to the true one. • On the other hand, /1~' which is a measure of the average squared deviation of the Z from values predicted by the true regressi.on equation over a given sample, is dependent on the population from which samples were generated. Hence, it is not useful in comparing methods applied to samples from different populations. 2 In the tables which follow, the values of b are considered in terms of /1, since sampling distributions of /1 appear to be more 2 symmetrical than those of /1 • 2.4 Description of Experiments As was indicated in Chapter I, an empirical sampling approach e e was used. This approach took the form of a series of eight sampling experiments which were performed in sequence and were dictated in large part by successive results. The principal factors considered 11 were the characteristics of the populations from. which samples were taken and the intensity of censoring. Except for one experim.ent, sample size was confined to N = 20. Since we have some knowledge of how sample size affects estimates, this does not seem to be a very objectionable restriction. The populations, intensity of censoring, and number of samples on which calculations were made varied from experiment to experiment. The number of samples considered in a given experiment or on which a set of means and variances was based is denoted by k. A brief description of the experiments conducted is set forth in the following paragraphs for the twofold purpose of (1) providing some perspective before the presentaliion of detailed results in chapters • concerned with the respective methods of estimation and (2) identifying the experiments so that they need not be described repeatedly in various places throughout the text. Experiment 1 was the most comprehensive experiment undertaken. It consisted of estimating regression equations for an arrangement in which twenty original samples were transformed to produce samples from the trivariate populations, [.25 .50, the four-variate population [.25 .50 .25] and [.80 .75, .25 .50, .00, .80], and .25]. Each of the resulting forty generated samples from the trivariate populations was censored in ten different ways, (11), (22), (04,), (40), (08), (80). e e (33), (44), (55), (66), The twenty generated samples from the four- variate population were each censored (111), (222), (333), (444), and (555). Estimates of prediction equations were made from the samples 12 e e at each of these population-censoring combinations using the uncensored data, complete observations, the Greenberg Procedure, and the Method of Paired Correlation Coefficients. In Exper:l.rrrant 2, samples from thirty-nine different trivariate populations were censored (55) and regression statistics calculated using the Method of Paired Correlation Coefficients. The main purpose was to find out how different populations affected the efficiency of the prediction equations determined by using this method. ment can be subdivided into two parts. The e~eri In Experiment 2,A., the same ten original samples were used to provide ten generated samples from each of the thirty-nine populations. In Experiment 2B, 390 different sets of original samples were used in generating ten samples from each of • the thirty-nine populations. t~e The rationale dictating the use of this of design is based on the premise that the mean efficiencies at each point in the population space, relative to other points, may be better determined by using the same set of original samples, but that the entire configuration in the space would be better determined by using different random samples at each point. This feature of the sampling plan represents a unique facet of the general empirical sampling approach. It is also used in Experiment 4. Experiment 3 was performed to determine whether or not conclusions based on Experiment 2 also held in the four-variate case. -Twenty original samples were used to generate twenty samples from each of six e e different four-variate populations. The samples were censored (555) and regression equations determined using the Method of .Paired Correlation Coefficients. 13 In Exper:iment 4, the Greenberg Procedure was used on samples cen- sored (55) from sixteen different trivariate populations. The popula- tions selected were dictated by a response surface design in the space of admissible sets of correlation coefficients. this experiment consisted of two parts: Like Experiment 2, !.lA, involving the use of ten original samples to generate ten samples :from each population, and 4B, making use of different sets of original samples for each population. Experiment 5 was designed to investigate the convergence of successive est:imates made by the Greenberg Procedure. It was limited in scope and is described in detail in the chapter dealing with the Greenberg Procedure. The Paired Correlation Method is applied to samples from two five- • variate populations in Experiment 6. The populations were chosen to generate samples simulating those arising in two "live problem" situations. The purpose was to demonstrate cases where the Paired Correlation Method would or w:>uld not be appropriate. The method was used on two hundred samples censored (3333), one hundred from each of the populations examined. Experiment 7 involved the use of complete observations, the GreEllberg Procedure, and the Paired Correlation Method to find regression equations from one hundred four-variate samples censored (333). The population chosen was one for which the Method of Paired Correlation Coefficients was not expected to provide efficient results. e e The experiment aimed at determining, under such circumstances, the relative "goodness" of prediction with the use of complete observations, The Greenberg Prooedure, and the Paired Correlation Method. 14 In Experiment 8, the Greenberg Procedure was used on twenty samples from the two five-variate populations considered in Experiment 6. The primary function of this work was to study convergence properties under the Greenberg Procedure, however, the results in terms of prediction efficiency were tabulated and discussed. Values of 6Z and 'E Z' as well as estimates of the regression parameters, were computed for each of the cases in the e:xperiments desoribed above. 2 05 The Computing Programs There are a few points about the canputing programs lib ich merit discussion. • The input for every program oonsisted of sets of p inde- pendent N(O,l) deviates which were transfomed into nmltivariate normal deviates and stored in a data table on the drum. After twenty veotors had been generated, the computations were carried out and the prooess repeated, the transfor:rred values never existing outside the machine. The transforming ooeffioients for a given population were computed manually and stored on the drum to be used repeatedly for a. series of samples. Although the data oonsisted of a complete table of N(N==20) p-dimensional vectors, they were aooompanied bya oensoring code comprised of Np binary digits, eaoh one denoting the presenoe or absence of its oorresponding deviate value. Censoring oodes, like the trans- forming coeffioients mentioned above, were punched on cards and e e stored on the drum to be used repeatedly for a series of samples. e e One device used to check the correctness of a program for d.i:f'ferent methods of est:iJna.tion consisted of using a censoring code which indicated no missing values. All methods of estimation should then yield identical results. Symbolic Optimum Assembly Programming (SOAP) was used in writ:ing all programs • • e e 16 e e CHAPTER III THE nELETION OF IMOOMPLETE OBSERVATIONS 3.1 Discussion The most obvious solution to the problem. of missing values is simply to discard observation vectors which are not complete in every variate. data. We will refer to this prooedure as estimation from oomplete Actually, such treatment hardly can be called a solution, and, at best, must be considered a trivial one. The main purposes of in- olud1ng a section devoted to this prooedure are to point out objeotions to it, and to include results given by it whioh will serve as standards of oomparison for results obtained by other methods. • The objeotion to suoh a prooedure is that the effioienoyof the prediction equation might be deoreased by oasting out important information. In deoiding whether to use only data oomplete in every variate or whether to employ some other method making use of all the information, the first factor to be oonsidered is the amount and proportion of complete data available. If a large amount of data is available, one usually would not hesitate to discard a part of it in order to apply a straightforward analysis. On the other hand, there might be enough oomplete observation vectors available so that the others could be discarded, but by so I doing, the variance of predicted values would be increased or estimates e e biased. Many times values missing from incomplete observation veotors occur near the extremes of the range of the v'arl.able in question. The occurrence of missing values is, in fact, often correlated with their extreme size for one reason or another. It is likely that observed 17 e values in an incomplete vector fall at very high or very low levels e also, and by discarding such values the range of tlB Xl s is decreased, resulting in diminished precision of the regression est:imate. For example, in the field of anthropometry, Miinter (1936) and his associates, in measuring skeletal remains, found that bones of smaller specimens were often broken and could not be measured. If a prediction equation for some specific characteristic were to be determined" or if a linear d:L scr1m:l.nator were to be established, then discarding all the measure" ments of the specimens with at least one broken bone would clearly be Another example applicable to the present discussion undesirable. arises in trying to predict earnings based on questionnaire infomation regarding an employee's experience, education, place of • other variables. emplo~ent, and Non-response on education is known to be correlated with low earnings. It follows from the above discussion that consideration should be given to the levels of the Xl s at which one expects to apply the pre- diction equation. In predicting y for values in the middle range of . the x's, there would be less concern about discarding incomplete observations haVing extreme x values than would otherwise be the case. 3.2 Bias and Variance of Regression Coefficients The data from Experiment 1, computed by using only complete observation vectors, is summarized in Table 1 which gives the biases and variances of the regression coefficients for different combinations of e e censoring and population. Since the estimates obtained by deleting incomplete observation vectors are ordinary least squares estimates based on N - f ci 111 n observations, they are expected to be unbiased - ee ee Table 1. Statistics of regression coefficients estimated by the deletion of incomplete observation vectors; Experiment 1, N· 20, k = 20 .- bo-llo V(bo) ~-1l1 , .25J V(b:!.) (00) .113* .055 .051 .058 (ll) .1291- .055 .069 (22) .1491- .09;> (33) .193* (44) .9;> [.25 Censoring [ .80 .60 , .80J V(b ) 2 bo-j30 V(b) ~-1l1 V(b:!.) -.033 .025 .078* .026 .060 .052 -.037 .032 .072 .0l.4 .042 .087* .027 .0)8 0076 .015 .053 .082 .093 .037 .046 .101* .024 .030 .101 .042 .058 .062 .056 .135 .064 .046 ,134* .030 -.007 .123 .072 .058 .19:>* ",055 .048 .150 .086 .075 .104* .026 -.029 .1;b .096 .096 (55) .124* .068 .009 .195 .053 .095 .086* .033 -.032 .197 .059 .120 (66) .194* .144 . .018 .243 .016 .208 .135* .070 .001 .312 .018 .262 (04) .098 .063 .0)4 .076 -.001 .041 .069 .030 .014 .071 -.001 .052 (40) .19:>* .055 .059 .121 .048 .040 .104 .017 .007 .109 .o;:h .09;> (9 8) .086 .082 -.029 .074 -.053 .053 .060 .040 .018 .078 -.060 .067 (80) .;1.25 .072 .117 .209 .021 .050 .087 .035 .067 .168 .023 .063 .9;> [.25 .75 , b 2-11 2 .9;> .25 , 0 b 2-11 2 V(b ) 2 b -11 3 3 V(b3) -.019 .030 .035 .020 .050 .021 .014 -.001 .037 .056 .018 .036 .028 .064 .032 .009 .058 .051 .038 .002 .040 (333) .042 .035 -.001 .092 .006 .048 .017 .055 (444) .024 .112 .069 .279 ,030 .100 ....022 .099 (555) -.023 .398 •090 1.008 2.763 .112 .866 V(b) (000) .028 .Ql6 (Ill) .068* (222) a Tbe symbol * indicates 'i\-1l1 0 .. - b - II t • ~ > t 05' 19, where St;2 • 1Ii). i •.37.1 . i t~i) V(b ) 2 .25J V(bl ) b o-ll 0 b2-11 2 ~ (» 19 e e with variance increasing as n decreases. Table 1 seems to bear out these e:xpectations. None of the biases of bl and b 2 are significant by the t criterion given in Chapter IT. The bias of bO which is indicated signifioant for most of the three variate oases here may be disnissed as a sampling phenomenon on the basis of similar unbiased statistios calculated f'rom other sets of' original sanples censored (SS) in Experiment 4. Reoall that the data of' Table 1 f'or each oensoring population combination is based on the ~ set of twenty original deviate samples. Further evidence confirming the expected nor.malcharacter of the regression' coef'ficients based on complete data can be seen in the histograms of Figures 1 and 2. • These data were taken from experiments con- cerned with higher variate cases. .Each figure is based on dif'f'erent sets of' one hundred original samples. For the data in Figure 1, the values of' t f'or testing bias in b ' bl , b 2, b , and b were - .818, O 3 4 '- .391, -~473, .212 and .700, respectively, while the-t values for bias in the data represented by Figure 2 were -"21, ::~886, - .•068, and .904 for b o' b l , b , and b , respectively. 2 3 The mean values and variances of' the s2y.x .from Experiment 1 have been computed and are tabulated in Table 2. As in Table 1, the tendency f'or the variances of the estimates to increase with increased censoring may be noted. not evident. e e A similar tendency with regard to the bias is i Figure lOeB) shows the histogram of' values- of y. x .from one· hundred samples censored (333). The shape of the histogram is consistent with the theoretical distribution of suoh a statistic, which is proportional to x2 with seven degrees of freed'?J!1. 20, e e ~ - 2.0 It> - -,; -,5 -I ,2> 7 /,/ 'oS 1,6 2..0 2..'1 ').~ I.q 2.3 = .~l ~4 to 10 il -.Iot C .'8 ."1 ~3 • .. 1.2., 1.0~ ~c 10 -1,1 -.7 -.~ .1 ,S ,q ~2 •.0; ~ 1.1. -:v 10 ·4.~ -2.L\ -1.0 -, "1 . -.Lj -I,OC Ie -,q -:7 -.5 -:0 -.1 ·e e ., .3 . s .7 .tt ~O ·0 Figure 1. Histograms of regression ooefficients estimated by deletion of inoomplete observations; [0.80 .29 .40, .50 -.20 .,0, .20 .70 .30] c,ensoring (3333), N .. 20; k • 100 • 21 e e 10 -,I ,I .~ .5 .7 .q I.~ 1.5 n'.1 I.~ 1.5 1.7 I.If 1.'g'2.C ~3 ••61. 1.1 2.0 10 -.1 .1 .?J ,5 ,7 ~2· .~B· • ."f 30 10 .'1 .E. .'i! 1.0 1.2. I.b 2..?- 2..4 ~l=:I.lf ;;0 20 10 -,Li6 e e Figure 2. -.~5 -.l5 -.15 -:05 .05 .15 .7.5 .~5 .lt5 .55 .65 :7S • 13 o=0 Histograms of regression coefficients estimated from complete observations; [-.70 -.52 .32, .60 .26, .44], censoring (333), N = 20, k = 100 • ee ee Table 2. Means and variances ot errors of estimate: . obtained by the...deletioD of·ineomplete observation; EJq;>erimentl,N == 20,k • 20 [.25 .$l , 0251 [.80 (T2 == 073-33" ,;J.. CeDsoring ;,y .x [.22 .60 , .80J == 0'3556 .' _"",y~.x=-· _ V(si.x) s~.x V(s~.x> (00) ,742 .082 •.360 .019 (000) (11) .7.34 07.36 •.356 •.357 •.366 .022 .018 .026 (111) .756 .094 .076 .110 .740 ,14.3 .358 .Q.34 (66) .770 .724 (o4t. .7.37 .208 .106 .086 •.373 •.351 •.357 .049 .025 .020 {40~ .750 .118 •.364 .028 (O8~ .701 .76.3 .340 •.370 .0.32 (80~ .097 .136 " ~5~ --;; y.x (222) (.3.3.3) (444) (555) ,~22·:P 2 CT . . • .:3"2:39' ,~2>1 "Y.X Censoring s;.x (22) (.33) (44) .:P~i5 •.349 ...361 V(s2 ) y.x .007 .345 .370 .009 .015 .Q26 •.315 .064 •.351 •.31Q .023 I\) J\) 23 e e As mentioned above, the values in Tables 1 and 2 are inoluded primarily as sta.ndards for oomparison with similar tables based on sta.tistios oalculated by methods whioh utilize inoomplete observation veotors. The estimated varianoes of regression coeffioients referred to above are the ''between sample" or external estimates of varianoes, denoted V(b ).· Although the ~stimated varianoes in suooeeding disi cussions will be of this type, it seems appropriate to mention that "wi thin sample tt or :h1 Bliss' (1952) terms, the internal variances . " Vw(b ), are available here. The inconsistenoy between estimates of i these two types has long been of ooncern in bioassay applio ations, but as an experimental rather than a statistioal problem (Bliss, 1952; • Sheps and Munson, 1957). .An examination of the oonsistenoy of these two ways of estimating the variance of regression coeffioients is of irrlierest as a by-product of this researoh. Table 3 gives mean values A of the Vw(b i ). A comparison of these with their counterparts in Table 2 reveals no startling discrepanoies. 3.3 Efficiency of Prediction Equations Table 4A. gives the means of E and their standard errors for the censoring sohemes of Experiment 1. 'When estimates are obtained by ¢l.eleting incomplete observation vectors, Eis invariant for samples representing different populations, but generated from the same original samples. e e The means of A and their standard errors are shown :h1 Table 4B. • ee ee A Table 3. Means of Vw(bi ) estimated by the deletion of incomplete observations; Experiment 1, N = 20, k • 20 [.25 Censoring .25] Vw(bo) - Vw(~) Vw(b ) 2 - (00) (11) .042 (22) .055 .069 .078 .108 (33) (44) (55) (66) .~, .047 .151 [ .80 .60, Vw(bo ) Vw(~) .80] Vw(b2) .052 .060 .048 .058 .020 .064 .060 .023 .075 .093 .107 .063 .081 .073 .080 .102 .095 .122 .027 .033 .038 .052 .075 .086 .106 .122 .166 .191 .073 .251 .241 .159 .229 [.25 Censoring .9:> .75, .25 Vw(b0 ) Vv(~) (000) (Ill) .021 .028 (222) .~, .251 Vw(b 2) Vw(b ) 3 .023 .031 .040 .- - .032 .040 .068 .120 (333) (444) .036 .061 .080 .034 .043 .058 .090 .128 .109 .048 .073 .117 .155 (555) .798 1.402 3.994 1.635 I\) ~ 25 e e 4A. Table Values of E and their standard errors estiniated by the deletion of incomplete observations; Experiment 1, N = 20, k = 20 Trivariate Censoring (11) (22) t33 ) 44) (55) (66~ (04 (40) (08) (80) • Table Four Variate E S.E.(E} .965 .939 .885 .855 .803 .722 .944 .939 .881 .859 .009 .011 .018 .023 .028 .039 .011 .Oll .020 .025 Censoring -E s.E.(i) (111) (222~ (333 .967 .888 .795 .625 .335 .007 .018 .033 .056 .045 (444) (555) 4B. Values of ! and theizl standard errors estimated by the deletion of incomplete observations; Experiment 1, N = 20, k • 20 .9), .25] [.80 .60, .80 ] [.25 .9) .75, .25 .9), .25] CensorCensor- [ .25 ing ing S.E.(!) t:. S.E. (i) t:. t:. S.E.(i) - (00) (ll) (22) (33) (44) (55) (66) (04\ (40 e e (08 (80) .303 .340 .375 .437 .460 .488 .607 .353 .385 .381 .468 - - .028 .031 .032 .039 .037 .039 .074 .030 .039 .030 .055 .211 .237 .261 .304 .320 .340 .423 .245 .268 .266 .326 .020 .021 .022 .027 .026 .027 .051 .021 .027 .021 .038 (000) (111) (222) (333) (444) .222 .252 .301 ·.348 .524. .012 .016 .019 .029 .074 26 e e These tables demonstrate an expected relationship, that greater censoring decreases efficiency of prediction. By so doing, the results lend support to the use of EZ and AZ as effective measures of efficiency. The histograms of Figures 7A, lIA, and 12B indicate distributions of E and Awhen using only complete observations for certain cases considered in Experiments 6 and 7. Chapter'VIII includes tables in which mean efficiencies and their standard errors based on estimates from complete observations, are compared with their counterparts based on other methods • • e e 27 e CHAPTER Dr e THE METHOD OF FAmED CORRELATION COEFFICIENTS 4.1 Definition Although the prooedure desoribed below seems to have been used widely to deal with the problem at hand, there is no apparent referenoe to it in the literature, either describing it or even referring to it as a technique employed in analyzing inoomplete data. It is not clear whether this is due to the rather simple nature of the scheme or whether it is a reflection of its limited usefulness. At any rate, it will be described below and some of its limitations will be brought out in connection with results given later in the chapter. • Suppose we have a set of data such as that schematically represented on page 6. From this we can fom the matrix of paired correlation coefficients: 1 r 1 Cpxp = l2 r r l3 23 ... ... I I rl(p-l) I I r ly r 2 (p-l) I I r 2y I r (p-l)y 1 I I C(p-l) (p-l) = (4.1) I - - - - - - i - - -----'----I I l!.:ty I .tl,y 1 I I I 1 I where the r ij are correlation coefficients estimated from the existing n .. pairs of observations in the sample of N observation vectors. The J.J e e standardized regression coefficients, b~ b; ••• b{P_l)' are then obtained as (4.2) e e 28 The b~ may then be oonverted to ordinary regression ooefficients by substituting the appropriate statistics for the parameters in the well-known formulae ~. J. 0' = ~~.foJ. .:.::l. 0'. J. and The sample values for the mean and variance of a variable are taken over all existing values of that variable; that is, (4.5) • and (4.6) Generally, n. > ni J.• J.- ny = N, As mentioned in Chapter I, the restriction, will be maintained here. .An estimate of the error varianoe O'~.x may also be obtained by substituting sample values given above for parameters in the fomula where! is the vector of regression ooefficients, ~, and V is the variance-covariance matrix of the xi. e e is its transpose, It is clear that this method of estimation reduces to ordinary least squares regression when nij = N; that is, where no values are missing. 29 e e 4.2 Bias and Varianoe of Regression Coeffioients The biases, h. - ~., and varianoes, V(b i ), for the population~ ~ oensoring sohemes of Experiment 1 are indioated in Table 5. There is no marked tendenoy for bias to inorease with inoreased oensoring, and only the biases in be appear to be signifioant by the t oriterion adopted to assess this bias. The remarks in Chapter III" pointing out that the signifioant bias of be is due to sampling" are also applicable There were no signifioant biases among the b and b in Table 5. l 2 In the four-variate situation, at higher levels of censoring, the values here. - of b i - ~i are very large" although not significantly so by the t tests. This draws our attention to the large va.riances at these levels. In oertain situations, whioh will be discussed in sucoeeding seotions" • the biases" I;i - ~i' beoome very large as do the variances of the hi • Consequently, one may be estimating the ooeffioients very badly in using this procedure" although the t oriterion for bias would not indicate this. Examples of these large varianoes are illustrated in the results of Experiments 6 and 7, whioh will be discussed in the section on effioienoies later in this chapter. The statistios relating to the bias and variances of regression ooefficients computed by the Paired Correlation Method in Experiment 7· are given in the second oolumn of Table 16 and serve to point up the statements made above. Histograms of these regression ooeffioients may be seen in Figure 3. e e 4.3 Error Varianoe The error varianoe, i y. x' was predioted for eaoh sample as desoribed in seotion 4.1, equation (4.7). The mean values and - ee 'lable ee 5. Statistics of regression coefficients esthJated by the Method of Paired Correlation Coefficients; EJcperilnent .. .... .;n [ .25 Censoring ~-/l o 0 , N .. 20, k .. 20 .25] V(b ) ~-fll V(~) o l~ [.80 b 2-fl2 .60 , .80] V(b2) b o-130 V(b) ~-131 V(~) 0 b2~fl2 V(b ) 2 (00) .llJl!"a .055 .0S1. .058 -.033 .025 .onl-* .027 .060 .052 -.037 .032 ·(ll) .121* .o;n .066 .064 -.001 .D34 .082 .031 .040 .082 -.002 .059 (22) .124* .o;n .072 .067 .017 .033 .078 .032 .b44 .108 .000 .083 (33) .lID* .055 .047 .084 .026 .031 .089 .042 -.020 .151 .039 .105 (44) .147* .058 .042 .102 .052 .029 .092 .046 -.048 .198 .060 .137 (55) .18'7* .083 .068 .136 .066 .067 .091 .O~ .027 .2;n .021 .152 (66) .202* .078 .l41 .166 .009 .132 .086 .046 .080 .485 -.027 .298 (04) .1291- .049 .045 .055 .017 .039 .096* .036 .031 .089 .006 .080 (40) .1391- .065 .048 .102 ·.002 .028 .079 .030 '-.091 .170 .041 .090 (08) .108 .058 .048 .077 -.108 .076 .085 .049 -.061 .182 -.079 .130 (80) .128 .083 .075 .225 -.046 .058 .092* .038 .038 .3~ -.047 .162 .;n [.25 =-lbo-po V(b ) (000) .028 .016 (lll) .0.51- (222) .O~ .75 , .25 , .;n .25] V(~) b 2-fl 2 V(b2) b3:-fl3 V(b ) -.019 .030 .035 .020 .049 .021 .020 .017 .041 .043 .034 .025 .033 .023 .041 .055 .076 .041 -.005 .042 .064 o ~-fll 3 (333) .0.51- .026 .032 .079 .087 .047 -.017 (444) .261 .646 -.270 2.244 .449 2.310 .613 9.190 .749 .1S'! .928 (555) a The symibol ~329 * .973 indicates .239 b. - fl. t=~ ~. l. .609 > t. 05, .475 2 19, where Si6... l. V(bi ) ~ w o 31 e e '-10 ~ ~ 10 ++++ -:;,"1 -1..~ -2.1. -I.e -1.0 "',4 ,1 = ~4 •'g ,51 20 +++ /0 ++ -Lj,~ -36 -aO -2.,L\ -I,t -1.2. 0 -,j, ,b 1.2. ~3. IZ I.Cl6 f-- • 30 20- (,<,~, , 10 ++ + -12, -I,t -1,1.\ -I,t) -,a 'Tl .'2.. ,6 1,0 1,1t \."l '2.,'1 ~O = 0 Figure 3. Histograms of regression coefficients estimated by the Paired Correlation Method; [.60 .80 .29' .40, .50 -.20 .50, .20 .70, .30) censoring (3333), N 20, k = 100 = The symbol + indicates a value outside the range of scale. i 32 e e varianoes of these estimates from Experiment 1 are given in Table 6. Table 6 reveals the negative bias in the est:1mates attendant with inoreased censoring, and suggests that the varianoe of the estimates inoreases as oensoring beoomes heavy. .An examination of the individual s~.x for the four-variate population of Experiment 7, oensored (33), point s up this bias in a rather startling manner by revealing negative estimates of varianoe. The distribution of s2y. x at this particular point is indioated in Figure lOD by a histogrant of the estimates from one hundred samples. It beoomes apparent that it is possible for the negativa estimates to arise when we realize that the oorrelation matrix, C(p_l)(p_l)' is not guaranteed positive definite, sinoe the r ij are based on different nij pairs • • 4.4 Effioienoy of Prediction Equations Table 7A gives the means of E and their standard errors for the various population-censoring combinations of Experiment 1. Table 7B shows the mean values of t:. and their associated standard errors. There appears to be a very definite population effect influencing the effioienoies. It is clearly demonstrated by testing the hypothesis that geometric mean values of E representing the two populations are equal, assuming the usual linear model. An analysis of varianoe of the logarithms of E for the three variate populations oensored (11) to (66) app ears in Table 8. e e ·- -Table 6. ee Means and variances of errors of estimate obtained by the Method of Paired Correlation Coefficients; Experiment 1, N == ·20, k == 20 .50 [ .25 Censoring O"~.x -Sy.x 2 , .25] [.80.60· • .80] 0"2 = .3556 . == .7333 . l.X ~(s:.~ . .-.......-. 2 Sy.x (00) (11) .741 .715 .082 .360 .084 .349 (22) .093 .098 .340 (33) (44) .715 .710 .696 ·.096 (55) .(66) .652 .611 .095 .102 (04) (40) (08) (80) .718 .087 .723 .718 .628 .092 .082 .109 ·.332 .308 .308 .289 .341 .326 .333 .296 [.25 Censoring V(s2 .) y. - .50 s .75 . , .25 .50 2 0"l.X == .3239 2 y.x . (000) (lll) .349 .340 .007 .006 .012 .016 (222) .320 (333) (444) .319 .194 .064 .012 .008 (555) .25] ~(s2y. ~ .019 .017 .015 .015 .032 .015 • .219 .364 .015 .021 .034 w w • _,~.",,,,,,,,,,,,""" •.'.""·i'_._.• •.•. _......... ".,..""-.. .:...;.._~ 34 e Table 7A. Values of i and their standard errors estimated by the Method of Paired Correlation Coefficients; Experiment 1, N = 20, k 20 e = Censor- [.25 ing E - (11) (22) ~~~ (55) (66) (04) (40) (08) (80) • , .983 .972 .958 .948 .918 .857 .983 .967 .936 .906 .En, .25] [ .80 .60, .80 ] Censor- [.25 ing S.E.(E) E S.E.(E) - .005 .006 .009 .010 .021 .035 .005 .008 .024 .029 i .962 .943 .901 .874 .819 .769 .931 .939 .833 .880 .010 .013 .025 .028 .038 .046 .012 .019 .027 .042 (111) (222) gla~ ; , (555) .En, - .25] E S.E.(E} .967 .907 .870 .784 .606 .008 .021 .028 .053 .072 I - = 20 [080 060, .80 ] Censor- [ .25 Censor- [ .25 .En, 025J ! ing ing A S .E o(i) A S.Eo(A) (00) (000) 0020 .028 .211 .303 (111) (11) .020 .028 .246 .321 (222) (22) 0028 .024 0251 .329 .029 .034 (333) (33) .297 .352 (444) (44) .036 0317 .376 0033 (55) (555) .422 0049 .360 .039 (66) .408 .054 0057 .484 (04) .025 .023 .253 0329 (40) .035 .270 .030 .355 (08) .360 .029 .031 .335 (80) .063 .057 .430 .338 - e .75, .25 Table 7B. Values of A and their standard errors estimated by the Method of Paired Correlation Coefficients; Experiment 1, N = 20, k e .En - I 050- -A .222 .262 .296 .330 .807 .774 .75, .25 .50, .25] S .E. ('A) .012 .016 .019 .024 .454 .202 35 e e Table 8. Analysis of varianoe of In E for the trivariate populations of Experiment 1 Variation due to d.f. Censoring (C) Population (p) CxP Error 1 5 228 Total 5 - , S.S. M.S. F 1.2691 ·3902 .1162 6.9693 .2538 ·3902 .0232 .0306 8.29 12.75** 239 I I ! I / NS I J I In order to define more olearly the regions of differential effioienoy in the three-variate oase, Experiment ,2 was set up to evaluate effioienoyat a large number of points in the population spaoe at • ~ fixed level of oensoring (55). The aim of the experiment was to find a response surfaoe to desoribe effioienoy for different t~es of populations. Thirty-nine populations were ohosen to represent regions in the spaoe of ad'nissible ~2' Ply' P2y values for whioh Ply and ~2y were positive. The points ohosen are given in the first oolumn of Table 9. As mentioned in Chapter II, the experiment was divided into two parts; part A., in which the same set of ten original samples was used to generate the samples on whioh estimates were made at eaoh point, and part B, in whioh 390 different samples were used to generate ten samples at each of the thirty-nine points. choosing this t~e The purpose of of sampling scheme was discussed :in Chapter II. Table 10 shows that the use of the same set of ten original samples at eaoh point in Experiment 2A makes for an expected reduction in the e e estimated error variance ·of the efficienoies. 36 e e Table 9. I' equation, E +B, of Experiment 2 A A Population [ .80 .ro, [ -.70 .30, .30, .30, .30, .30, .30, .30, .30, .30, 030, .50, .50, .50, .50, .50, 050, .50, .70, 070, 070, 080, .20, .30, .50, .70, .70, .70, .70, .70, 070, .30, .21, 031, .67, .50, .20, .60, .30, [ -.60 • e e Values of i and residuals from the response [ -.30 [ -.30 [ 010 [-.10 [ .50 [ ·70 [ ·30 [ .50 [ 070 [ .00 [- 050 [- .20 [ .20 [ ·40 [ .60 [ 060 [ .90 [ -050 [ .30 [ .75 [ .69 [ .87 [ -020 [ .20 [ 090 [ 030 [ 000 [ .60 [ .18 [ -.55 [ -.32 [ 089 [ .30 [- 075 [ 020 [- .69 .80J .40] .201 .30J: .oo]! .201 .70]; .(0) .701 .80]; .20J .20] 0801 .201 .401. .10) 0301 .50} .801 .50] .10J .301 .301 0401 .551 .40] .201 .87] .70] .10] .50] .,0] 030] .50] .70J .631 030] .77] .40] EA+B IA IB .8047 .5454 .7735 09012 .7159 1.0164 .7663 .9368 .8776 .8575 .9219 .7670 .8556 .7434 .8265 ·9293 09217 .8883 .8030 .7006 .6468 .7400 08212 .8753 .7841 .7189 .8426 .7480 07995 06924 .8208 09665 07804 .7729 .7576 08947 .6186 .7799 .5551 .8046 .4835 .7937 .8387 .7884 .8977 .8090 .9159 .8057 .8825 .9026 .7428 .8694 .7806 .8245 .8863 .8936 .8908 .8734 .5468 .6998 .8218 .8829 .9003 08318 .7499 08460 .6889 08377 .6994 .8431 .8972 .80/.,.6 .8096 .8062 .8925 06533 07989 .5163 .8819 .3218 .8599 .89119 07988 .9612 .7382 ·9311 .8220 .8943 .9268 .8102 .8219 .8562 .8403 .9332 .9195 .9418 08342 .6907 .5719 ·7197 .8579 .8273 .8843 .6003 .8497. 06515 08007 07153 .9058 .9359 .8332 .8006 .7646 .8973 .7118 06877 04746 EA - /\ EA+B -.0001 - .0619 .0202 ~.0625 .0725 -.1187 .0421 -.0209 -.0719 .0250 -.0193 -.0242 .0138 .0372 - .0020 -.0430 -.0281 .0025 00704 -.1538 .0530 .0318 00617 .0250 .0477 .0310 .0034 - .0591 .0382 .0070 .0223 -.0693 .0242 .0367 .0484 -.0022 .0347 .0190 -.0388 A ~ - EA+B .•0772 -.2236 .0864 -.0063 .0829 -.0552 -.0281 -.0057 -.0556 .•0368 .00149 .0432 -.0337 .1128 .0138 .0039 -.0022 .o53S .0312 -.0099 -.07119 -.0703 .0367 -.0480 .1002 -.0586 00069 -.0965 00012 00229 .0850 -.0306 .0528 .0277 .0070 .0026 .0932 -·9022 -.0805 37 e e Table 10. Analysis of variance of values of E d.f. Variation due to S.S. M.S. Experiment 2A Population (p) Samples (S) PxS 38 9 342 -389 Total 4.2437 9.3076 9.0269 - Total 389 Sf2 = .0026 22 •.5782 Experiment 2B 38 6.7964 351 1304488 Population Samples in populations .1117 1.0342 .0264 .1789 .0383 20.2452 It is necessary to digress briefly at this point to expla.in what • is meant by admissible sets of the Pij and to justify choosing the points indicated in the first column of Table 9 instead of those given by some more orderly response SllT'face design. It is obvious that some sets of Pij cannot be considered, since they lead to values of the multiple correlation ooefficient outside the range zero to unity. The region of admissible values is defined by the restriction that any 2 2 2 2 1/2 the range ~2P2y ± (1 - P12 -P2y + P12P2y) • Geometrically, this region may be visualized as the space inside a value of Ply must lie , l.n circular cylinder with axis Ply' which has been pinched olosed at each en~ and twisted until the ends are at right angles to each other. An orthodox reeponse surface design did not seem to be desirable at this point, since it would not have fitted into such an irregularly shaped e e spaoe so as to provide points falling into extreme regions. It was mentioned that among the populations considered, there were no sets having negative values for Ply or P2y • Such a selection was 38 e e 2 based on the symmetry whioh exists in the relationship ofPij and R I the multiple oorrelation ooeffioient If' both Ply andP 2y are negativel 2 R is the same as i f they were positive. If Ply and P2y are of 2 opposite sign, the value of R is the same as if the sign of P12 were 0 ohanged and the signs of ,Ply and P2y were oonsidered the same. A similar sort of symmetry has been assumed to exist for the effioiencies. Some empirical evidence to indioate the validity of this assumption is offered in Table 11. Table 110 Values of E and their standard errors at symmetrioal points in the trivariate population space; oensoring (55), N "" 20, k "" 10 Population [ P12 Ply' 22y1 • ! S oE o(E) (-.20 [ .20 050, -0401 .~, ·40J .824 0869 0090 0021 [-075 [ .75 020, .20, 0301 - .301 0653 .635 .093 .095 [ .18 [ .18 ·30, -·30, ·501 -·501 .897 .940 .038 oOU [-·55 ·55 [- ·55 .21, .21, - 021, .30J -0301 -.30J .805 .815 .854 .094 .044 .054 r Differenoes in mean efficienoy of the magnitude indicated above can be explained on the basis of differential censoring. Reoa11that the dependent variable is not censored, whereas the independent vari- e e ables are censored (55). General second degree equations were fitted to the mean vaJ.ues of ~ . indicat~d in 'fable 9 and resulted in the following equations to describe the response surfaces. For Experiment 2A, 39 e e i A = 1.1019 - .•0047P12 - .3608p~y- and for Experiment 2B ~B = 1.0566 - .2630P12 + 2 - .7245P2y + Although the coefficients of these equations are quite different, the contours which they represent are very similar in the regions where the sampling was done. A 1t:s' respectively. Figures 4 and 5 show contaurs for "EA, and ~ The contour levels of E for .7, .8, and .9 are shown a.t three levels of Ply in each diagram. • The sub space 0 f admissible P12 and P2y is bounded by heavy lines and the shaded areas near these !' !' borders indicate that the quadratic models, E,A and EB fit very poorly in these peripheral regions. For eX8llq)le, an additional population considered, [.65 .30, .901, gave values of 11. = ·401 and !B = .451. For the th::lrty-nine points fitted, "E explained about eighty-two per A c~nt of the variation :in the values of lA' and ~ explained about . s~venty-eight per cent of the variation in the values of ~. In both cases the variation accounted for by the modal was highly sign:i.:ficant. The seventy-eight points of the two experiments were combined to fit the single general quadratic model e e A 2 ... 2 EA+B = 1.0793 - .133 8P12 - .083~P1Y + .02l4P2y - .530lP12 - .42,58Ply 2' . (4.10) - .5431 p2y + .21 88P12 Ply + .7506p12P2y + •2503P1yP2y· The analysis of variance for this model is given in Table 12 following. 40 e e "" :' ".,.. ",.: . 'll'_ / 7, " / / I Eats I " I I / .,.. / Je/ I I I r I I I I -.6 o -.2 I . I I I , I ~,(~.....~/~,- - I .4 .2 E=(l / "'I_-""_-!Io:;/ I I I " / I I I .~ I I I I I I r I I I -.s S I I ~ ( I I I I I .e.- / I / illj. l· " I 1\ 1"" ,e,u. / I " "" .s .6 P:L2 I / / • / I ,, I ~~i3 i..9 I I ,!2 I I \ -.4 ,.( I I I I I I -.6 I I I 'I. -.2 ~.8 I I I -- ", 1\ E-J.7 I I If I I I I / I I I I ... I / " I I '" / I "" " I .6 .8 ,. , .-._~:<: P:Ly = \ I I •7 , I I I I I I K I I I I I I " E="a " / '. -.s -.2 0 .2 " / " r I / Ib: .7/ ..,' / e e r I I I /~ " Figure 4. Contours for ~A in trivariate population space I lit 41 e e ,:AfIII""" - """. .... - - ........ " "It.\ -- - , \ ~.'8~ \ \ \ I I I I I I / I I I I • , I 1 /'" 1 1 ,.,1 I' ,.7 I ...., I I -.2 , \ ...... ~ ."". I I I I I I ... ' .".--- .... ... "\ E=,9,· '<.t I I I I I I I' -.2 o .2 ,. ,.' .... ' ."",. .... --- ..... , \ \ )l. \ \ ", ~... I I I I I I I' -.8 Figure ~ Contours for -.2 Es 1\ o .2 I / ,~=.7 -l. .4 in trivariate population space P:1.2 e e Table 12. Analysis of variance for the model "EA+B of Experiment 2 Variation due to ! d.f'. Model "EA+B Lack of fit Errora Total S.S. M.S. F .8339 .1887 .0927 .0065 14·24 3.10 39 .0813 .0021 77 1.1039 9 29 I - I i a Failur~ of duplicates at the same points in Experiment 4A and 4B to agree. Note the similarity between the estimates of e:xperimental error for E given in Table 11 and that determined by duplicates at each point as g:i.ven in Table 12. • The residuals from the single general ql1adratic are given in Table 9 with the values of E at each point., These data and the contour diagrams of Figures 4 and 5 might reasonably be interpreted to indicate that this method of estimation is least suitable whenapp~ied to_ populations in which zero order correlation coefficients are of unlike magnitude or when their absolute values are large. The approach used here to indicate how the type of population affects efficiency would be very cumbersane in considering more than three variates. In the four-variate case, six factors are required to characterize a population, hence, the fitting of a general second degree surface would require twenty-eight points, plus those needed to obtain an estimate of error • e e .An alternative approach to higher variate cases might consist of an extension of the results obtained for the trivariate case. In the four-variate case, for example, we could consider all possible subsets 43 e e of three correlation coefficients as representing three-variate populations and could get some appraisal of e:xpected efficiency for each of these. Then, if the results for the'triva:riate case held for four variates, some composite measure of these expected efficiencies would be highly correlated with the efficienoy obtained for the four-variate population. The simplifioation of this idea resorted to here consisted of oonP3yJ in terms of the three trivariate populations made up of all sets of sidering the four-variate population [,012 P13 Ply' ;P23 P2y' two (of the three) prediotors with the predictand; specifically, ?· [P1 2 Ply' P2;' [P13 Ply' P3yJ, and [P 23 P2y' P3 The contention is then. that the efficiency of the four-variate population will be • related to a oombined measure of the effioienoies of the three subsets as determined by the kind of relationship indicated in equation (4.10) above. Experiment 3 provided some data which, with this approach, confirms the extension of the results for the trivariate case to that for four variates. Table 13 gives the mean efficiencies for six four-variate populations, eaoh oensored (555) with the means based on twenty samples. The population correlation matrice,s for all possible subsets of two prediotors with tle predictand are indicated for each four-variate population. For these trivariate subset populations, the predioted /\ e e effioiencies based on E +B were oomputed and are given in the last A A /\ oolumn of Table 13. The means of these E +B, denoted avg (E +B), A A and the E for each four-variate population may be seen plotted against each other in Figure 6. e e Table 13. the Method of Paired Correlation Coefficients; Experiment 3, censoring (555) N- 20, k 20 - [ P:L2 P:J.3 P:J.y' 1023 ~y' P3Y] [ .20 .29 .30, .59 .40, .51] [ .10 .20 .40, .48 .9:>, .60] a EA+B A [P:ij Piy' Pjy] , .757 .20 .29 .59 .20, .30, .40, .40 1.008 .971 .913 .710 .10 .20 .48 .40, .40, .50 .930 .915 .906 .9:>, .51 .51 .60 .60 I [ .60 .80 .29, .9:> -.20, .20] .686 [ .20 .,31 .60, .25 .50, .77 ] .~5 [ .80 Possible subsets of two predictors and the predictand E Population • 44 E estimated by Values of - .90 .30, .70 .35, .53] .342 [-.70 -.52 .32, .60 .26, .44] .271 \ .60 .29, -.20: .20 .80 .29, .20; .9:> -.20, .778 .741 1.026 ; .60, .9:>: .20 .31 .25 .60, .9:>, .77: .77: .879 .814 .827 .80 .90 .70 .30, .30, .35, .35. .5.3; .53 .800 .788 .88'2 -.70 .32, .,32, .26, .4ll; .411; .2~ .647 .674 .923 ~.52 .60 a These-values were determined by suhstituting the values of. Pi" Piy' and P jy into equation (4·10) as P12' Ply' and~2y' respectJ.ve!y. e e 45 0 .95 0 ·90 .85 0 0 0 .80 .75 ·30 ·40 ·50 .00 ·70 .80 E Associated values of E and avg 2 :£rom the data of Table 13. R .. .78 Figure 6. • (EA+B) Although the evidence here is meager, it appears that the Paired Correlation Method in the four-variate case al so yields poor prediction equations for populations with heterogeneously correlated variables. This hypothesis is furtlE r supported by the data from the five-variate cases of Experiment 6, which will be discussed presently, and by the data of Experiment 7, which will be summarized in Chapter VIII. Experiment 6 imrolved use of the Paired Correlation Method with data simulated to represent two situations, one for which this method would be e:xpected to yield reliable results, and another for which it would not. A type of population which often arises in biological situations e e imrolving variates governed by feedback mechanisms is exemplified by [.60 .80 .29 .40, .50 -.20 .;0, .20 .70, 030Jo Because of the heterogeneous character of the correlation ooefficients it would not seem e e 46 advisable to apply the Paired Correlation Method to inoomplete samples from suoh a populat ion. .An example of the twe of population frequently enoountered in pre- dioting from batteries of tests is [.36 .24 .25, .22J. .41 .40 .32,~24 .24 .30, This oorrelation matrix is, in faot, based on results of tests administered to trainees in attempting to prediot their suocess in pilot training (Du Bois, 1958). Since, in this case, oorrelations between pairs of variates are similar, the Paired Correlation Coefficient Method would seem to be a sati Bf'aotory tool to use in ~ttiIlg a prediotion equa.tion where values among the independent variables are missing. Two hundred samples, oensored (3333), were generated to simulate • one hundred from each of the two populations' above • The Paired Correlation Method was then applied to each sample to determine a prediction eQl':ation. Histograms of the resulting values of E from these equations are shown in Figure 7, along with the efficiencies obtained by the ~se of complete observations. the expectations stated above. The results rather strikingly oonf1rm The mean values of E and A from Experiment 6 and their respeotive standard errors are given in Tables 33A and 33B which are inoluded an:ong the summary results of Chapter VIII. In the absence of .! priori knowledge of the population from which a fragmentary sample is taken, it may be worthwhile to set oonfidmoe l:i.mits on the sample oorrelation ooeffioients from all existing' pairs. e e ~is mar be of help in deoiding whetl:e r or not to use the method of this ohapter in forming a prediotion equation. 47 ~c e e 15 - 10 II .s f--..-f-.-I c .\ .2. (A) ,~ .4 .5 .6 .7 t,e Complete observations 2.0 f5 10 5 0 [.00 .80 "(B) .29 .2. .3 .4 ,5 eEl .7 Paired Correlation Method .40, .~ -.20 .50, .20 .~ .70, IS I.o .30) .---- I() • ."'1 10- 5 I .. \ I .5 (C) ,~ ~-·I . .15 ,e; 1· (> ;r .'3 .ct \.0 Complete observations 55 50 45 l.j\') 35 3b 2,5 10 IS \0 .5 0 ,I .2. ,~ ,l..i .6 .6 (n) e e [ .36 ·41 Figure 7. Paired Correlation Method ·40 .32, .24 .24 .30, .24 .25, Histograms of' E· Experiment 6 censoring (3333~, N = 20, k = 100 .22]' 48 e e 4.5 Related Methods It is well known that r is a biased estimate of p. Accordingly, one might substitute the approximately unbiased est:lm.ate, v p =r 1 2 [1 + 2(N-l) (1 - r ) 1 , in the correlation matrix and then proceed to find the prediction equation as outlined in section 4.1. A type of maxLmum likelihood estimate of p is found by ma.x1mizing .m f( r ) with respect to p. It is approximately .. " p • r [1 - 1 2(N-l) (1 - r 2 )1 • This estimate of p may also be used in forming the correlation matrix to be used in estimating the regression equation. • Since neither of these schemes gives regression equations consistent with the least squares estimate of the regression equation when no observations are missing, they have not been investigated here. However, it is interesting to note that in two populations where the method of paragraph 4.1 proved inefficient, neither of the alternatives provided a significant improvement. ~~ In the populations [- .69 [090 .70, .871 of Experiment 2A" the method of section me~:n.E3ffioiencies.of .474 and .649, respectively. estimates of p gave mean effioiencies of .441 4.1 e e .504 and .660, respeotively. yielded Use of the uliliased and .636, respectively, while substitution of the ma.xirm.1m. likelihood estimates yielded efficienoies of ·30, .40J 49 e CHAPTER V e THE GREENBERG PROCEDURE 5.1 Definition The method to be discussed in this chapter was first suggested by Greenberg (Institute of Statistics, 1957) in connection with data from a perinatal mortality study in which hospital records were not c0l1X.Plete in the detail desired. In a comprehensive analysis of these data, Wells (1959) defined and applied the method. A more specific notation than that used in previous chapters will facilitate a description of the procedure. variates, • ~, Suppose we assume the p x 2' ••• , x(p_l)' y to be distributed multivariate normally. Consider a sample of N independent observation vectors, having a subset equation (5.1) which is determined as follows: a. Estimate all missing characters by some reasonable in:Ltial value suoh as the sample mean of the particular variate from which e e the character comes. Having done this, one can proceed to find the ordinary least squares solution for predicting y. It ma:y e e be written (5.2) where the 0 indicate that both observed values and initial xi estimates for missing characters were used to find Oy by least squares. b. Using the same data, determine lx-= ~ -"]. + lb' 0 +~,' 0 °10 .+ •• ; + 12 x 2 °13 x3 the least squares equation obtained by taking XI as the dependent variable and the remaining Xi and y as independent • variables • Replace the previous estimates for missing values of " by those given using equation (5.3) J for example, to estimate th the value of XJ. missing from the..<., observation, c. With the data now including revised estimates for missing values of 1 XI' deter.mine ~= 1 ~ 1 . ~ 0 b20+ 021 XJ.+ °23 x/' ~ •• '. -I- ~ 0 . 1 02(p-l) x(p_l)+ b 2 ? (5.4) e e by ordinary least squares, where x has been regarded as the 2 dependent variable and the other variables as independent. e e Replace the previous estimates of x by those given using 2 equation 5.4. d.' Continue to successively re-estimate the missing ~j~ using where k represents the coefficient of hXi used in predicting b ji x j for the k th iteration, and ~i indicates either observed values of the Xi or those estimated on the hth iteration by oontinuing the prooedure outlined above (h • k if j h III k - 1 if j <i). >i and The suocessive estimating equations are: • ..• e e where m is the number of iterations required for oonvergenoe; oonvergenoe being defined by the condition (equation 5.1). my III m-ly, 52 e e If' oharacters are missing from one variate only, say x iteration is required. ji ' no In this case, missing entries among the x j are initially estimated by where the b ji and b jy are determined by ordinary least squares regression methods using the n complete observations, regarding X j as the dependent variate. This condition is made clear by considering that when x j is the dependent variate, the criterion used for the best fitting hyperplane (in terms of the remaining variates) oonsists of minimizing 2 the sums of squares (xj ..<. - Xj ) • Since no x j exist for the N - n inoomplete observation vectors, the hyperplane fitted to all N A • observations is identioal to that for the n oomplete observations . 5.2 Bias and Variance of Regression Coefficients Use of the Greenberg Procedure in Experiment 1 yielded biases, - ~i' and varianoes, V(bi ), for the various population-oensoring combinations as given in Table 14. These data illustrate the tendenoy bi for both bias and variance of the regression coefficients to increase with increased censoring. The bias in b Which, from Table 14, appears to be signifioant or O nearly significant in most oases can probably be regarded as a sampling phenomenon. This statement appears reasonable when we recall that corresponding values of Table 1, based on complete observations, also e e yielded t tests indicating bias. are theoretically unbiased. It is known that those coefficients The remarks in Chapter III in regard to these apparently significant values of b may be oited as applicable O - ee Table 14. S~tistics ee ot regression coefficients estimated by the Greenberg Procecl.urE'; Experiment 1, N .. 20, k .. 20 -- '" .~ [.25 , r .80 .25] .60 , .80] Censoring 'iio-llo V(bo) '1\-131 V(~) V(b ) 2 bo-13 o V(b ) o '1\-1\ V(~) (00) '.113* .055 .051 .058 -.033 .025 .078* .026 .060 .052 (ll) .121* .049 .082 -.037 .032 .069 -.066 .036 .091* .027 .009 .073 .032 .053 (22) .124* -.047 .107 .081 .010 .041 .096* .025 -.044 .104 .084 .068 (33) 'ii2-13 2 b2-13 2 V(b ) 2 .143* .054 .102 .104 .029 .042 .112 .025 -.123 .133 .1591- .087 (44) .159!t- .058 .138 .126 .055 .093 .107* .026 -.211* .184 .240* .136 (55) .222* .085 .159 .233 .0'11 .181 .lll* .035 -.300 .264 .303 .192 (66) .286* .160 .238 .'113 .006 .539 .164* .081 -.377* .489 .3~ .359 (04) .122* .O~ .045 .055 .029 .056 .091* .027 -.062 .066 .ll5 .062 .124 -.005 .029 .084* .027 -.006 .155 .026 .077 .284* .086 (40) .132* .060 .126 (08) .095 .062 .022 .097 .016 .163 .080 .035 -.240* .098 (80) .120 .072 .230 .326 .059 .052 .078 .029 .049 .352 .~ [.25 'iio13 9 .75 , .~ .25 , '1\131 V(~) 'ii2-13 2 V(b ) 2 .026 .026 .Oll .026 -.095 .019 'ii -13 3 3 V(b ) 3 (000) -.104 .016 (lll) -.171* .014 -.037 .060 -.045 .034 -.054 .048 (222) -.177* .017 -.ll3 .073 -.073 .048 .040 .067 (333) -.206 .029 -.095 .063 -.210 .189 .205 .156 (444) -.166* .015 -.222 .057 ' -.264 .214 .330 .298 (555) -.491 .261 -.047 .568 -.756 1.348 .435 .774 * 'ii. - 13 i J.S'ij a TI:Je symbol indicates > t .05' i b For this tour-variate case, k .. 5. k - 1, .161 .25]b V(b) 0 -.037 V(b ) i where st... - k - ' and k is the number ot samples on which the b i " are based. J. VI \A) e e here also. In addition, it :may be noted that there are no significant t values for the biases, bo - 130 , in Table 15, which shows the biases and variances of the regression coefficients from Experiment 4 where samples from several different populations were censored (55). Significant biases in b and b are indicated in Table 14 at the l 2 higher levels of censoring in population [.80 .60, .80 J• From Table 15, it can be seen that, although the Greenberg procedure does not generally produce biased regression coefficients, more than the expected number of coefficients, bl and b2, in Experiment 4 fail to meet the criterion for unbiasedness. From the limited amount of data available, it is not possible to determine the exact nature of the bias. There does not seem to be a pronounced relationship between bias and • population, as is indicated by the disparity between results obtained for the same populations in Experiments 4A and 4B. In both Experiment 1 and Experiment 4, estimates of the regression coefficients were made using complete observations as well as by the Greenberg Procedure. It has been mentioned that coefficients determined in this way are unbiased and consequently provide a kind of standard with which one can compare the Greenberg Procedure. significant values of bo - 130 , Except for the corresponding to those in Table 14, none of the biases of regression coefficients calculated from complete observations were found to be significant. This suggests t~,t the Greenberg Procedure may produce biased estimates of 131 and 13 2 • Experiment 7 provides us with still more data which we may examine e e in commenting on the bias and variance of regression coefficients obtained by use of the Greenberg Procedure. from the population [~.70 The one hundred samples -.52 .32, .60 .26, .44J, censored (333), -- ee Table 15. ee Statistics of regression coefficients estimated by the Greenberg Procedure; Experiment 4, censoring (55) ,N- 20, k - 10 Population _ ~eriment4A A bo-(30 V (bo) [-.2 [-.2 [-.2 [-.2 [ .6 [ .6 [ .6 [ .6 [ .873 [-.473 [ .2 [ .2 [ .2 [ .2 [ .2 .6 , .6 , -.2 , -.2 , .6 , .6 , -.2 , -.2 , .2 , .2 , .873, -.473, .2 , .2 , .2 , .6 ] -.2 ] -.2 ] -.6 - ] .6 ] -.2 ] -.2 1 .6 ] .2 ] .2 ] .2 ] .2 J .873] - .473] .2 ] .069 .101 .090 .148 .146 .030 .128 .076 .165 .147 .072 .098 .102 .077 .171 .016 .068 .088 .081 .064 .008 .127 .023 .159 .114 .036 .091 .031 .045 .074 b1-(31 V(b:I)! 1)2-(32 j I V(b 2) I bo-(30 .440 I .043 Exoeriment 4B V(bo) br-(31 V(lJ.) 1)2-(32 V(b2) I I I .116 .015 .037 .107 .120 .072 •345 1 .133 -.251 .035 .457 .536 iI -.036 .180 1.303 : -.078 ....159 .314 -.2!:4 .449! .120 .528 :, .033 .002 .080 .107! -.19B*- .072 .665; .210' 1.037! -.068 -.299 -.259* .•032 i .233* .063 ; .043 , -.574', 2.763 .174 .610' 3 .221 .008 .300 .539 I .037 .~3 .028 .028 .052 .209 ; .007 -.366*- .179 .230 .229 ! -.033 -.140' .108 .10B*- .036 : .092 .219 . -.018 -.195 .374 -.164' .100 -.133 .413 ' .053 .369 I I I .014 .116 .060 .090 .036 .010 .165 .013 .239 .073 .017 .067 .022 .019 .040 .063 .403* -.116' .022 .002 .130* .029' -.138 .043 ..307 .042 -.112 -.068 .011 - .013 1 .179\ -.245 .128 , -.179 .255\ .262 .239\ .001 .014. -.166* .5621 - .071' I .0!:4 j .085 3.059' - .117 .380 .279 .034 -.062 .105 -.004 .106 .092 .246* .0411-·22~ .18S' ;,873 i I .380 .011 .496 .277 .142 .292 .021 .748 .079 3.677 .278 .074 .1!:4 .033 .063 .970 ... * Indicates iii ~ Pi ~. J. > t.0 5, 9 d.!. 2 where s-si == '" (bi ) V --w \Tl \Tl e e yielded estimates of regression coefficients, the biases and variances of which are shown in the third column of Table 16. Histograms of the coefficients obtained using the Greenberg Procedure are given in Figure 8. For this particular case, with evidence available from a large number of samples, all the regression coefficients except b are clearly O shown to be biased. The tabulated variances of the regression coefficients call for only brief comment 0 In Tables 1.4 and 15 marked differences may be noted in the variances for different populations, just as one would expect. For example, the variances in the regression coefficients are large for population [.8728 .2, 2J, which has a large covariance between the independent variables and large error of estimate. • On the other hand, the variances of regression coefficients are small for population [-.2 06, .6], where the covariance between the independent variables and the error of estimate are both small. A. The tendency for the V(b ) i to increase with increased censoring in Table 1.4 was also noted in discussing similar data calculated by other methods. 5.3 Estimate of Error 2' An estimate of error, SYox may be calculated in the usual way after estimating the missing values and determining the regression eqUation by the Greenberg Procedure. Such a statistic grossly underestimates the true variance, since the degrees of freedom used in its computation are based on some values estimated from the data. e e In similar cases it has been possible to estimate "effective degrees of freedom tt based, on the comparison of like moments 0 This approach is not applicable here, however, since the mathematical form e e 302..0 • 10 • -.1 .1 .7 ~3" I.~ 8 1.\ 1.5 .9 \,\ 13 1,5 1,7 IF! .Q~ 10 ·,1 ,I .'3 .5 .7 ~2" • .bO 2.0 10 I,t- ~ 1."2. 1,1+ 16 \:t' '1"C 1..1 :l...l+ = 1.11 30 '20 \0 ~ e e o .. 0 Figure 8 Histograms of regression coefficients estimated by the Greenberg Procedure~ population [-.70 -.52 .32, .60 .26, .44], censoring (333), N a 20, k .. 100 e e 59 e e of the distribution or its moments is not known. A simple rule of thumb 2 sufficient to produce reasonable estimates of 0y. x is given by defining Iteffective degrees of freedomll for error as where c • the total number of missing variate values among the p-l independent variates. The resulting "pseudo-variance" then is 2' (N-p) sYeX (5.6) C N-p -p-l - 2' is calculated in the usual way assuming that the estimated where Sy.x characters were observed values. • The rationale for choosing N - p - c p-l as Iteffective degrees of freedom" is that the proportion of the independent variate values estimated is N(P:l) • To illustrate the effectiveness of equation (5.6) as a measure of variance, we will consider the "pseudo variances" arising from the use of the Greenberg Procedure in Experiment 4B. The mean of the tlpseudo- variances" for each of the fifteen populations is plotted against the . 2 true variance, 0y. x' in Figure 9A. A weighted least squares line, assumed to go through the origin, was fitted to these points and gave 2 2* o • 1.05 Sy x. The slope is not significmltly different from unity y.x • although values of s~.x still appear to be underestimated for low * 2 . 2 * values of 0y.x' Differences between 0y.x and Sy.x can be accounted for to some extent by sampling variation. e e This is evinced by comparing Figure 9A with Figure 9B, where the values of s2 y. x' calculated from the 2 • uncensored samples, are plotted against the values of 0y.x . -e 60 2i~ Sy.x o 1.1 1.0 o " ) =1 o Overestimates of variance .9 .8 / / / 0/ .7 .6 / / / / 0 0/' .5 .4 Underestimates of variance / / o / / .3 . / ( .2 / •1 / /' : (j • .1 .2 .3 .4 .5 .6 .7 .8 .9 1.0 y.x Figure 9A. Mean pseudo-variance of Greenberg Procedure plotted against the true variance; from data of Experiment 4B s2 y.x 1.1 1.0 .9 o~ 6/ // .7 / /0 .6 .5 .4 / .3 0/ .2 .1 .I // e ( / / // / ... 1 /0 Overestimates of variance .8 e 2 / 0 / o /~ Underestimates of variance / .. 0 ~ .6 .7 .8 .9 1.0 y.x .1 .2 .3 .4 .5 . Figure 9B. Mean variance estimated from uncensored data plotted against 'true variance; from data of Experiment 4B 61 e e 2* Additional evidenoe concerning the use of s y.x as a measure of residual error is given by considering its sampling distribution for the data from population [-.70 -.52 .32, .60 .26, .44] when samples 2* are censored (333). Figure 10C shows a histogram for values of s y.x based on one hundred such samples. Figure lOA represents the sampling distribution for errors of estimate from corresponding uncensored samples and Figure lOB shows the sampling distribution for errors of estimate based on complete observations. Since f e • 13 here, the histogram of Figure 10C would be expected to assume a form intermediate in shape between that given by Figure lOA (theoretically proportional • to x2 with 16 degrees of freedom) and that shown in Figure lOB (theoretically proportional to x2 with 7 degrees of freedom). Thus, ~~ 2 it can be seen that Sy •x underestimates (Jy.x here. This observation sUpports a statement made in the preoeding paragraph) namely, that s2* . appears to underestimate (J2 x for low values of (J2 • For the y.x y. y.x four-variate population from which the one hundred samples were selected, (J2y. x •• 20. It must be pointed out that this apparent underestimation may not occur for higher values of (J2 • y.x 5.4 Efficiency of Prediction Equations Table l7A gives the means of E and their standard errors for the various population-censoring schemes of Experiment 1. the corresponding means of t::. and their standard errors. Table l7B shows There does not appear to be a population effect influencing the efficiencies here, as e e was the case with the Paired Correlation Coefficient Method. A further evaluation of the Greenberg Procedure was considered desirable, however, so E:xperiment 4 was set up to evaluate the method over a wider range of -I- - e e Ie (A) r-- r-- rl No Censoring 10. (B) • e e Complete Observations 10 (C) Pseudo Variance Greenberg Procedure (D) Paired Correlation Coefficients Figure 10 Histograms of estimated error variances; Experiment 7 [-.70 -.;b .32, .60 .26, .44], censoring (333), N • 20, k • 100 a The symbol + indicates a value outside the range of scale. 62 63 e e Table 17A. Values of i and their standard errors estimated by the Greenberg Procedure; Experiment 1, N = 20, k Censor- ( .25 ing i (11) (22~ (33 (44) (55) (66) (04) (40) (08) (80) • .976 .95'6 .939 .894 .810 .614 .967 .956 .875 .825 .50, .25] (.80 .60, .80] -- Censoring ( .25 E S.E. (i) S.E.(E) .009 .011 .013 .023 .041 .047 .009 .012 .036 .032 = 20 .978 .960 .910 .836 .764 .664 .944 .953 .845 .860 Table 17B. Values of .007 .011 .021 .030 .041 .051 .014 .012 .027 .033 (111) (222) .50 i .75, .25 (333~ aand their standard errors .25] S .E.(E) .018 .032 .129 .136 .159 .949 .858 .736 .627 .445 (444 (555) .50, estimated by the Greenberg Prooedure; Experiment 1, N • 20, k • 20 Censor- ( .25 ing A. (00) (11) (22) (33) (44) (55) (66) e e t l 40 08) (80) .303 .330 ..351 .386 0447 .564 .833 .347 .379 .431 .430 .50, .25] (.80 .60, .80 ] Censor- [ .25 S.E. (!) S.E.(r> ing A. .028 .028 .028 .033 .036 .055 .099 .027 .036 .037 .069 - .211 .239 .256 .291 '1343 .401 .518 .253 .262 .332 .345 .020 .020 .022 0024 .024 0029 .062 .019 .026 .021 .044 (000) (111) (222) (333) (444) (555) .50 A. .222 ..324 .365 .482 .576 1.114 .75, .25 .25] .50, ...._._ ... S.E. (A) .012 .017 .019 .097 .117 .3.:b 64 e e populations and a larger set of random san:ples. Since the Greenberg Procedure is quite time consuming, only a limited number of populations could be sampled. And, just as in Chapter IV, the coordinates of the points were required to be admissible sets of Pij' The configuration of points chosen was a central composite type of response surface design selected so that the extreme points would fall within the region of admissible values. 4 was As mentioned in Chapter II, Experiment composed of two parts; part A, based on samples from different populations generated from the same set of original deviates, and part , :'.- B, based on different sets of original samples at each point •. The points chosen and the means of efficiencies with their standard errors at these points are shown·in Table 18. • Mean efficiency at each point was based on ten samples • - Table 18. Values of E and their standard errors esti.mated by the Greenberg Procedure; Experiment 4, censoring (55), N I : 20, k • 10 Experiment 4A E S .E. (E) Population e e [-.2 [-.2 [-.2 [-.2 [ .6 [ ".6 [ .6 [ ·.6 [ .8728 {-.4728 [ .2 [ .2 [ .2 [ .2 [ .2 .6 , .6 , -.2 , '-.2 , .6 , .6 " , -.2 , -.2 , .2 , .2 , .8728, -.4728, .2 , .2 , .2 , - Experiment 4B. E S.E.(E} . ~. •6 ] -.2 ] ] -.2 ] .6 .6 ] -.2 ] -.2 ] .6 ] ] .2 ] .2 ] .2 ] .2 .8728] -.4728] ] .2 .710 .772 .716 .720 .79J .716 .765 .714 .711 .712 .692 .774 .764 .•765 .777 .074. .067 .083 .081 .086 .056 .080 .069 .082 .084 .073 .064 .060 .061 .072 ., .810 .76b .808 .820 .820 .837" .783 .769 .727 ".812 .840 .830 .798 .822 .727 .044 .•081 .039 .053 .061 .032 .042 ".040 .085 .037 .039 .035 .059 .035 .085 65 e e The use of the same set of original samples at each point in Experiment 4A has the same effect, although not as marked, as that which was noted in Experiment 2. This effect is that the mean square error for part A becomes smaller than .that for part B after differences among samples have been accounted for in the analysis of variance of This outcome is seen in Table 19 below. E. Table 19. Analyses of variance for values of E from the data of Experiment 4., \ Variation due to I d.f. I S.S. Experiment Population (p) (s) Samples • p xs Total -149 M.S. 4A. .1246 4.1677 3.1720 14 9 126 '! .0089 .4631 .0252 sE • .0025 .0135 .0295 .0030 sg E • 2 7.4643 Eag:>eriment 4B Population Samples in populations Total 14 135 149 I .1886 3.9861 4.1747 The general second degree equation fitted to the data of Experiment 4A yielded the response surface described by 2 2 A = .7718 + .0658 P:t2 + o0268Pry'" .01l3P2y - .1458P:t2 -.0982P:!.y ~ .l!i 66 e e E:x;periment 4B gave . (5.6) The analyses of variance for these fitted surfaces are given in Table 20. Table 20. An~es of variance for models fitted in' Experiment Variation due to 1\ Model E A Error Total • d.f. S.S. M.S. Experiment 4A. .0065 9 .0007 .0059 .0012 5 -14 4. F <1 .0124 Experiment 4B Model ~ 9 .0156 .0017 Error 5 .0033 .0007 Total -14 2.65 N.S. .0189 The variation accounted for by the models EA and ~ is not significant. Thus, it does not appear that, in trivariatecases, the population sampled affects the efficiency of prediction equations estimated b.Y the Greenberg Procedure. These analyses confirm the findings from Experiment 1. E:x;periment 7, which was mentioned in section 5.2, lends some additional information on the efficiency of the Greenberg Procedure. e e From the four-variate population [-.70 -.52 .32, .60 .26, .44], one hundred samples, censored (333), were selected and regression equations estirnated by various methods. Histograms of the ' resulting sampling e e 67 distributions of E and 6. are shown in Figures llC and 120 respectively. This particular case e:xl9Illplifies the sort of missing data situation to which the Greenberg Procedure might be advantageously applied if there were some basis for not wanting to discard the fragmentary observations. The computing time required in using the Greenberg Procedure for higher variate cases precluded extensive sampling work in this area. However, it was .necessary to ascertain that such computations were possible. Inaddition to this, the results of the four-variate work in Experiment 7 were encouraging enough to warrant applying the Green- berg Procedure to some of the five-variate samples of Experiment 6. Consequently, Experiment 8 was carried out with twenty samples, the results of which gave the efficiencies E and b. shown in Table 21. • The values of the statistics given' by using the other methods are also presented for comparison. In population [.60 .80 .29 .40, 50 -.20 • .50, .20 .70, .30], the Greenberg Procedure yielded better results than were obtained with the other two methods. This was true for eight of the ten samples considered in comparing it with the use of complete observations, and for nine of the ten samples considered in comparing it w~'th the Paired Correlation Method. In population [.36 .40 .41 .32, .24 .24 .30, .24 .25, .22], there appeared to be little difference between the Greenberg Procedure and the use of complete observations, while the Paired Correlation Method yielded significantly better results than either of these. e e 68 - 25 ~ 2C /s - /0 ~ S 5 n (A) Uncensored Data ll) I~ \0 5 (B) Complete Observations • (C) Greenberg Procedure 20 15 - to L_-L--.L--.!_L+---L--+'--lL~::C:::I:---.!_.t=::::c=~---.l....:+++:!=I ++a I' - it ,2 (D) Figure 12. ~ ..£.4 .0 .6 Paired Correlation Method Histograms of t.; Experiment 7 population [-.70 -.52 .32; fJJ censoring (333), N = 20, k = 100 .26, .44] a The symbol + indicates a value outside the range of scale. 69 e It "-10 .. ~5 .---- I- .:so 15 2J) 15 /0 S o ,!.I (A) I I .5 .<0 Complete observations r- r--- • I .~ .1 (B) .~ .5 I .b :1 Greenberg Procedure 20 /5 10 o I I .1, .2. .3 r .q ;7 (C) , Paired Correlation Method Figure 11. Histograms 6f E; Experiment 7. population [-.70 -.52 .32, .60 censoring (333), N = 20, k = 100 .26, .44J 1 \,0 e ee e_ Table 21. Values of E and!:i from the twenty samples of Experilnent 8; censoring (3333), N = 20 Population [.60 .80 .29 .40, .~ -.20 .~, .20 .70, .30] E A ~~-(h.eenl::lerg Comp1et.eu _.u--·Paii'ed Complete Paired GreeJ;loorg Observations Correlation Procedure Observations Correlation Procedure .. Sample Number .681 .817 .634 .041 .168 .415 .704 .496 91 92 93 94 95 96 97 98 99 .~9 .612 100 Mean -- .. Std. EtTer , " 105 106 107 108 109 110 III ll2 113 Mean . t Std. Error .~8 .077 .865 .036 .773 .539 .878 .796 .270 .726 .710 • 2 .601 .089 .211 .381 .676 .358 .283 1.414 2.144 .464 .228 .940 .496 .721 .856 .405 .640 .388 .535 .303 . .870 .409 .1589 .397 .533 .618 .648 .596 .440 .086 .066 .181 .110 Population [.36 .40 .41.3 2, .24 .24 .30, .24 .25, .22] .802 .871 .897 .935 .282 .868 .936 3.297 .869 .448 .312 .746 1.096 .878 .981 .555 .848 .852 .794 .950 .!:46 .705 .326 .774 1.258 .386 .707 .357 .655 .398 .974 .353 .781 .644 .423 .953 8 1.22 .24 .~6 1.086 .664 .•889 .261 .028 .076 .077 .487 .319 .094 .676 .835 .165 .089 .587 .753 .391 . .898 .356 .756 .743 .317 .475 .835 . .158 .490 .2~ .405 .510 .379 .378 .446 .129 .410 .356 .042 1.093 .291 .4~ .625 1.009 1.424 1.173 1.ll2 .632 1.000 .881 .114 Uncensored nata .119 .325 .152 .227 .175 .195 .301 .199 .188 .179 .206 .020 .803 .331 .333 .538 .721 .244 .390 .331 .354 .214 .426 .063 -.J 0 71 e e 5.5 Convergence In describing the Greenberg Procedure, convergence was defined by the cond1tion that the respective b i of the prediction equation be equal on wo successive iterations. Except for differences caused by rounding, this is tantamount to the requirement that successive estimates of mis,sing values be equal. In view of the fact that several of these estimates must converge simultaneously, it is not surprising that a large number of iterations are required in certain cases. In Experiment 1, the Greenberg Procedure was applied to twenty random samples from each of two trivariate populations which were censored (11), (22), (33), (44), (55), (66), respectively. The mean number of iterations required for each of these censoriti'g· schemes in • the order indioated is 5.0, 6.7, 10.3, 13.6, 17.1, and 24.2, respectively• An analysis of variance for these data indicated that censoring was the outstanding facto:ri affeoting speed of convergence. The same sort of relationship between. intensity of censoring and number of iterations follows for. higher variate cases. For e:xample, in the four-variate case of Experiment 1 with censoring (222), which is comparable to (33) above, the :mean number of iterations required for convergence was 11.80 Likewise, in censoring (444), comparable to (66) above, an average of 28.4 iterations were required. In Experiment 8, five-variate samples were generated from each of two populations. e e Ten different original samples, censored (3333), were evaluated for each. The mean number of iterations per sample required for convergence was 2706 for one population, and 23.7 for the other. The standard errors were 8.9 and 8.2, respectively. 72 e e A reasonable question to ask at this point is whether or not the number of iterations required for convergence depends more on the number or the proportion of observations missing. An answer to this question is indicated by the results from Experiment 5,· a 2 x 3 factorial involving samples of varying size from the population [.6 -.2, .6]. Estimation using the Greenberg Procedure was carried out, using five randomly selected samples for each combination, where 6, 10, and 16 values were censored representing proportions of twenty and fifty percent of the total number of observation vectors. N, and the mean number of iterations, shown in Table 22 below. k, The sample size, required for convergence is The t:lme required per iteration is also given for the various sample sizes. • Table 22. Mean number of iterations required for convergence a in sampling from [.6 - .2, .6]; k = 5 - m Censorine: (33) (55) (88) 5.2 4.4 4.4 Proportion of Observation Vectors Censored .2 .5 seconds seconds N per sm m s N per m iteration iteration - 1.30 .55 .55 30 37 9J ~ 80 79 18.0 10.3 10.6 5.79 1.34 2.79 12 20 32 22 31 43 a In this experiment the criterion for convergence was taken as agreement of the respective regression coefficients to three decimal places on successive iterations. The most striking feature of Table 22 is that the number of. iterations required seems to depend almost entirely on the proportion of data missing, rather than the amount. It should be noted, however, that for both proportions., a trifle more iteration was neoessary at the lower 73 e e levels than at the higher levels. .Al though this phenomenon could easily be explained by sampling variation (note the sk)' one is led to speculate that a certain amount of complete data is necessary before the rate of convergence for a given proportion becomes constant. The data from Experiment 1 indicate that the difference in speed of convergence between populations is very small. Experiment 4B affords more illuminating data in this connection. In that experiment, the Greenberg Procedure was applied to l;b samples of independent normal deviates grouped into fifteen sets, each of which was used to generate ten multivariate normal samples representing a different population. The number of iterations required for convergence in these l;b cases was used to test the hypothesis that population affects the • number of iterations necessary for convergence. The analysis of variance shown in Table 23 reveals the unimportance of population as an effect influencing convergence. Table 23. Analysis of variance for number of iterations required for convergence in Experiment 4B; N. 20, k • 10 Variation. due to Population Samples in populations Total d.f. S.S. M.S. F 14 871.4 62.2 1.0 135 8421.4 62.4 149 9292.8 - The initial estimates of missing values also seem to be of e e relatively minor importance. In section 5.1, it was stated that missing characters would initially be estimated by some "reasonable" value. Two kinds of initial estimates were tried in a preliminary investigation 74 e e using trivariate samples of size twenty from Carver's Anthropometric Data (19;4). In each of seven samples, ten of the forty characters among the independent variates were censored. The Gre.~nberg Procedure was carried out, using as initial estimates the sample variate mean; zero; and a bivariate regression estimate, Xi ..< • a + by..<, where a and b were estimated by least squares, using all available pairs of Xi and y. The number of iterations required using these initial estimates is given in Table 24. Table 24. Iterations required for convergence using the Greenberg Procedure on Carver's data with different initial estimates of missing values. • Initial estimates by Sample Bivariate variate regression means 11 10 11 14 6 5 5 5 8 5 11 7 Sample Number 1 2 3 4 5 6 4 7 4 Zero .13 21 22 22 35 24 49 Zero as an initial estimate is clearly interior. Since the true means of the variates, ~ and x 2' were 139 and 282, respectively, the use of zero as an itlitial estimate could hardly be regarded as reasonable. There is little basis for preference between the sample variate mean and the bivariate regression estimate as initial values, except that e e the variate means are simpler to calculate. 75 e e In using the Greenberg Procedure with artificially generated samples, initial estimates were made using the true mean, the true regression equation, and the random character actually designated as missing. Obviously such estimates could not be used in practice, but their use shortened convergence time somewhat in the sampling work. In discussing convergence time, we have talked in terms of the . 'number of iterations required for convergence. It must also be remem- bered that the time per iteration increases as the number of variates increase. In adding an additional variate, another matrix must be inverted with each iteration, and the order of the matrices to be inverted increases by one. Inversion time is roughly proportional to the square of the dimension of the matrix involved. • The time required for one iteration by the IBM 650, with N = 20, is approximately thirty seconds for the three-variate case, seventy seconds for the four-variate case, and about one hundred and fifty seconds for the five-variate case. The distribution of the number of iterations is obviously skewed and is indicated by the histogram of Figure 13 for the work done using the Greenberg Procedure in Experiment 7. 30 .2.0 10 e e 5-6 Figure 13. ~ -10 \::'-14 \T-It ll-U ' 'l :!>~•• :?I7-~Yi Histogram of number of iterations required for convergence; [-.70 -.52 .32, .60 .26, .44], censoring (333), N = 20, k = 100 Iterations 76 e e The regression ooeffioients in the three-variate oase usually oonverge monotonioally, almost invariably so after the first two or three iterations. As the number of variates inorease, the pattern of oonver- genoe beoomes a bit more erratio. Examples of the oonvergenoe ourves for two five-variate oases from EJePerirnent 8 oan be seen in Figures 14 and 15. • e e 77 • e e Figure 14 Convergence curves applying the· Greenberg Procedure to the five-variate sample, number 107, censored (3333) 78 e e ~ 't ., )t ¥- ~ lie: IE 0( ,/ / .2 \\ • ~ a Iteration number e e Figure 15 Conve ce curves applying the Greenberg Procedure to the five-variate sample, number 106, censored (3333) a Converges after 33 iterations. * 79 e e CHAPTER VI M.A.XmJM LIKELIHOOD 6.1 Disoussion Applioation of the prinoiple of rnaximwn likelihood to the problem disoussed in this thesis is praotioal for only a few speoial oases. For the most part, the mathematios involved in the ma:x:l..mizing prooess beoomes extremely unwieldy. One oase in the literature whioh is dealt with in detail is that in whioh values are missing from only one of the variables in a sanple from a trivariate normal population (Edgett, 1956 and Anderson, 1957). Using the· notation indioated for data of the form given on page 2, • Edgett sets up the likelihood :funotion n TI f.,t. (:;.,X2, X3 1 lJol ,lJo2,lJo3; ai'~'a5; P12,P13,P23) ...1 ( 6.1) n IT f.t. (:;"X3 ...n+l I ~1,lJo3; lJol ' ....3; ai, 0';; P13) and differentiates with respeot to the ....i and the ;'j. Then, eXpressing the J-j in terms of aij , he prooeedsthrough some ourribersome algebra and indioates the solutions for the lJo i and a ij in a series of formulae. A thorough examination of Edgett's paper is most EI11ightening in that it points up the diffioulties of extending the maxLmum likelihood method to more general oases for the problem at hand. e e The approaoh used by Anderson in solving the same problem is worthy of mention here by virtue of its s:t.mplioity and the faot that it oan be used to get maximum likelihood estimates easily for oertain other 80 e e 5!ecial cases. Considering the same form of data referred to in the preceding paragraph, Anderson expressed the likelihood function (6.1) as n . TI f-< (:lC:L,x3 I ~1'~3; O"i,O"~; P13) -4=1 n n f -< (x2 -4=1 ( 6.2) I ~20 + ~21.3:lC:L + ~23 .lx3; O"~ .13) where ~20 .. ~2 - ~21.3~1 - ~23.1~1 2 0"1~21.3 + 010"3P13~23.1 .. 0"10"~12 (6.~) 2 0"10"2P1J1321.3 + 0"3~23.1 .. 0"30"2P23 • 2 2 2 2 2 2 0"2.13 .. 0"2 - (~21.30"1 + ~23.1a3 + 2 ~21.3j323.10'10"3P13) If any other values of the parameters give a higher value to (6.1), they would give a higher value to (6.2), with the ~t sand 2 0"2.13 given by (6.3). The max!m:um likelihood estimates of ~l' ~3' ~, <1, and Pl,3 are based on all N observations, whlle the estimates of the regression parameters. for x 2, gi. ven :lC:L an d x , are computed from the complete 3 observations only. The values of ~2' 2 0'2' P12 and P23 are then found from the relations (6.3). If we oonsider x .. y as the dependent 3 variate, estimates of the coe.f:f'icients for the prediction of y from :llJ.. e e and x 2 may be calculated by fo:mIUlae similar to 6.3. The results are identical with those of Edgett and, in addition to being more easily derived" are more simply e::llpressible. Another example of this approach will be considered in discussing the example of Chapter m. It should be noted, however, that this sort of manipulation does not provide a 81 e e ~eneraJ. solution. For exanple, it is not applioable to the trivariate oases :we have the form o~nsidered in previous ohapters in whioh oensoring was of 1 2). The trivariate oases of Experiment 1, in whioh observations were (0 0 missing :from one variable only, have been oonsidered using the maximum likelihood estimates desoribed above. 6.2 Bias and Varianoe of Regression Coeffioients Table 25 indioates the biases and varianoes of the regression ooeffioients for the oases of Experiment 1 where maximum likelihood estination was oarried out as desoribed above. As was the oase in Chapters • m, IV, and V, only the biases of the oonstant ooeffioient are signifioant by the t oriterion, and thLs has ~ee!l explained as a sampling phenomenon. As one would expeot, there is a tendenoy for tb!l varianoe of the ooeffioients to be higher when eight vaJ.ues a:re censored than when only four are absent. 6.,3 Estimate of Error It is well known that max:iJnum likelihood estimates of varianoe In the oase of the est:lmates of O'~.x above, three para- are biased. meters (the ~i) were estimated from twenty (fragmentary) observations, so that an estimate corrected for this bias would be obtained by using 20/17 &;.x. There will also be some bias due to the fragmentary oharacter of e e the data. A rough adjustment for both types of bias might consist of multiplying the maximum likelihood estimate by a factor N_ c k= P=!c N-p- p-l • ee Table 25. ee Statistics of regression coefficients estimated by maximum likelihood; Experiment 1, N = 20, k [.25 "V(b0 ) bl-~l V(b1 ) .049 .131* .096' .051 .063 .061 .042 .053 .103 .076 (80) .115 .073 .070 .208 a The symbol * indicates t .049 = b. - ~ J. as·i .00, _ A .118*a [.80 .25] _ censoring bo -~ 0 (04) (40) . (08) .5), = 20 i .80J A _ A -~ 0V(b) b2-(32 f(b 2 ) b0 bl-~l V(b1 ) b2-~2 V(b 2 ) 0 .002 .009 -.033 -.032 > t .037 .029 .096 .051. .029 .026 .062 .006 .057 .083* .027 .070· .036 .022 .107 -.001 -.039 .096 .027 .065 .186 .045 -.054 .057 .081 .087* .078* ,." 19, where .0.:;) as2 i = .085 V(b i ) 20 co ro 83 e e where 0 equals the total number of missing charaoters among the p-l independent variates. The mean values and variances of errors of estimate, oalculated .from the twenty samples for each case in Experiment 1, are presented in Table 26. The mean values multiplied by the correotion faotor, k, above are also given. Means and Variances of ~y.x estimated by ma:x::1.mum likelihood Table 26. (.25 ·50, .25.} 2 ay.x =- .7333 Censoring • .60, (iy.x =- .80] .35,,6 A2 a ~(?iy.x ) .'k~ ay.x ::t2 a y.x ~(~~.x) k1T" y.x .617 .622 .063 .069 .061 .080 •71.Jl .746 .745 .692 .296 .294 .284 .280 .012 .012 .009 .012 .355 y.x (04) (40) (08) (80) ( .80 ·$J5 ·562 ·353 .350 .345 6.4 Effioienoy of Prediction Equations Means of E and their standard errors are indicated in Table 271 and the means of !J and their assooiated standard errors are shown in Table 27B. Those tables do not indicate marked differences between the two populations and show an expected decrease in efficiency with increased censoring. e e A comparison of the efficiencies obtained using maximum l:ikelihood with those obtained by using other methods is disoussed in Chapter VIII. 84 e e Table 27-A. Means of E and their standard errors estimated by maximum likelihood; Experiment 1, N = 20, k = 20 - [.25 . Censoring ! (04) (40)(08) .981 .969 .918 (80' ~O17 ·50, .25] S.E.(I) .005 .008 .029 .027 [ .80 ! .$, .80] S.E. (I) .008 .008 .014 .024 .970 .973 .930 .927 ... Table 27B. Means of A and their standard errors estimated by maxi.mum likelihood; E:x:periJnent 1, N • 20, k • 20 [.25 • Censoring (04) (40) (08) ( 80). -A ·50, .251 S.E.en .325 .354 .369 .025 .034 .032 ·424 .053 [.80 -A ·~237 .241 .278 .280 .00, .801 S.E. ('X) .020 .023 .017 .•035 ',-~ e e 85 e e CHAPTER VII AN EXAMPLE 7.1 The Data Diffioulties involved in the determination of musole potassium severly limit the physioian in prescribing proper therapy for patients suffering a depletion of this ion. The aotual measurement of musole potassium requires the use of a radioaotive isotope, whioh is an expensive, exaoting, and time oonsuming prooedure. Working with laboratory animals, Welt and his assooiates [1958] reoently oarried out studies aimed at developing m3thods to facilitate • measurement of musole potassium in humans. Their approach involved predicting the oonoentration of musole potassium from a number of other more easily obtainable measurements. The ability to determine the oon- centration of serum po,tassium, serum carbon dioxide, and red cell potassium from the analYsis of a single blood sample provided suoh measurements. Data available for such determinations on a series of fifty-four rats is given in Table 28. The animals were sacrificed to obtain the assooiated values of muscle potassium. It will be noted that determinations for serum carbon dioxide, (X ), and red cell potassium, (X ), are not available for every animal. 2 3 Thus the question of methodology' in forming an appropriate prediotion e e equation arises. 86 e e Table 28;. Bioohemioal determinations from Wdt data Rat Number Serum K (mH./L.) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 • e e JS. 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 a The 2.61 3.61 2.94 3.58 4.65 4.53 3.41 3.41 5.00 5.38 3.54 4.73 5.05 5.00 2.84 3.11 2.87 .3.32 2.79 3.10 2.61 2.94 2.32 2.74 2.70 2.43 2.60 2.60 2.85 1.84 2.05 1.61 1.92 2.01 1.66 2.02 1.50 ·1.72 1.73 1.97 1.67 1.68 1.72 1.84 1.74 2.20 1.50 2.16 2.81 4.81 1.62 2.73 2.65 2.38 X2 Serum CO 2 (roM./L.) X 3 Red Cell K (roM./L.) ••& 24.6 25.7 19.7 23.4 25.6 22.4 24.7 21.6 22.8 25.1 .. 17.4 25.4 19.3 23.4 23.6 21.5 26.9 25.1 22.0 26.2 29.8 25.1 30.6 29.5 33.3 31.7 33.8 34.4 38.7 42.2 39.4 27.4 --44.3 34.3 43.8 39.2 symbo1-- indicates a "missing datum. 101.4 102.0 100.8 99.6 98.8 101.3 106.0 94 •.3 96.8 100 •.3 108.0 101.5 105.9 107.3 102.5 106.0 101.2 100.1 100.1 96.8 96.7 100.9 100.1 104.2 96.9 100.3 100.2 106.3 103.2 106.8 96.0 111.0 98.8 91.1 89.4 92.8 106.3 104.1 99.3 109.6 86.5 66.5 103.0 105.2 83.2 98.0 90.4 X4 • y ]hsole K (mM/l00 gr.) 44.7 47.7 47.3 45.5 47.2 47.8 47.2 46.9 45~7 44.9 44.9 47.5 46.0 46.5 4.3.5 38.1 42.9 45.0 42.4 42.3 42.3 39.1 38.1 41.0 34.6 41.0 34.3 38.6 .38.6 35.5 .34.8 33.4 .34.5 38.8 31.5 33.0 30.6 31•.3 33.1 31.4 29.5 25.2 27.7 28.2 30.0 27.7 25.0 24.2 20.6 29.6 23.4 30.1 23.1 29.9 ':~.~ 87 e e 7.2 Solutions The methods indicated in previous chapters will be applied here and the results given. A discussion of these results will be given in seotion 7.3. Equation (7.1) below was dete::mined by ordinary lea.st squares, using the thirty-two complete observations. (7.1) It was mentioned in the previous ohapter that the approach of Anderson (1957) would be used in obtaining a maximum likelihood prediotion equation from the welt data. • To do so, we delete values of x2 from rats number two, three, fifteen, and seventeen, so that the remaining da.ta may be arranged in the form: .. · x • • • • • ~N • • 12 ~3 x21 x 22 x • • • • x 2n 23 2 x x x x 31 32 33 • • • 3nl x41 x42 x43 • • • • • • • • • • • xllN ~l 0 0 0 · . where the primary subscript denotes the variable; the seoondary subsoript, the observation; and x4 :: yis the dependent variable. particular oase, there are nl = 32 In our observations canplete on all vari- ates, n 2 = 47 oomplete on all except x , and N = 54 complete on :;, and 3 . y. The assumption of multivariate normality is reasonable here and the likelihood function oan be written: ~ n2 'IT 2 TI Jl f)xil~·,O"J.·'P .. ) II -< = 1 J. J.J -< = n i,j.= 1,2,3,4 N 2 J. f)xil~,'O"·'Pij) +1 i,j =11,3,4 J. . -< i IT = ~+n +1 = 1,4 2 2 J. J. J. (7.2) f-«x·I~"0",'P14) -e 88 and ma.y be rewritten n2 N IT f) xi lf.Li,O"i,P14r IT f)x31~30.0 = .,(=1 + .,(=1 ~31.4X:J. + ~34.lx4' i = 1,4 n l IT f )x4 ~20 .00 + ~21.34X:J. + ~23 .14x3 + ~24.l3x4) 1 ..(=1 where ~30.0 = 2 0"3.14 II f.L 3 - ~3l.~1' -~34.l~4 2 2 2 22. . ) 0"3 - (~3l.20"1 + ~34.10"4 + 2~3l.4~34.10"10"4P14 2 0"1~3l.4 + 0"10"4P14~34.l • (7.4) = 0"10"3 P13 2 0'10'4P14~31.4 + 0'4~34.l = 0'30'4P34 and ~20.00 II ~2 - ~2l.43f.Ll - ~23.l~3 - ~24.23~4 2 c{.134 = 0'2 v~ ~8V~ = (S (7.5) where rhl1 ~ V e e = a13 0'14 . 2 0'31 0'3 a43 0'34 2 0'4 r~2 l5 = 0'23 l ~2l.43 and ~ ~23.l4 ~24.l3 a24 Thus, maximum likelihood est:i.ma.tes of = ~l' f.L4' O'i, O'r, and P14 are obtained in the usual way from all N = 54 observations. Est:i.mates of f.L 3' O'~, P13' and P23 are made from these foregoing values combined -e 89 with estimates of the regression statistics from the n 2 observations using the relationships of (7.4). The remaining parameters are based on those already calculated plus regression statistics from the n l complete observations by substituting these in the parametric forms of (7.5) and solving. The estinates for the regression parameters ~40' 2 ~41.23' ~42.l3' ~43 .12' and 0'4.123 may then be derived from the maximum likelihood estimates of the l1i , O'i, and Pij. The resulting equation is All the data available were used in obtaining a prediction equation by the method of paired correlation coefficients. • matrix, the number of values upon which each ooefficient was based, and the oonfidence interval for each ooefficient is given in Table 29. Table 29. Sample correlation coefficients and their confidenoe limits based on the n.. available pairs of Xi and x j in Table J.J 28 ij e e The correlation 95% confidence interval 12 36 13 23 ly 2y 47 32 54 36 3y 47 -.625 -.020 -.79 to -.34 -.29 to .28 .008 .672 -.792 ...36 to .37 .50 to .79 -.89 to -.62 .342 .06 to .57 90 e e Further reference will be made to this in the discus sion section below. The equation determined by this method is Yp = 13.11 + 2 .2~ - .67x2 + .37X3 The Greenberg method also utilized all the data available. ':fhe sample variate means were used as initial estimates for the missing values and the iteration procedure described in Chapter V was carried out. Convergence was reached after fourteen iterations, each of which took two minutes and seventeen seconds. Slightly over a half hour of IBM 650 time was required for all the calculations which yielded the prediction equation • (7.8) 7.3 Discussion of the results The solutions given in section 7.2 will be discussed briefly here with the purpose of pointing out the advantages of and objections to applying the various methods to the set of data at hand. The broader aim of indicating factors, which would be of importance in extending these methods and selecting the appropriate ones for other practical situations, is implied. Equation (7.1) was determined by ordinary least squares using only the thirty-two complete observations. Although a sample of this size can be expected to give fairly reliable results, it seems more than a little wasteful to discard data from over forty percent (~) of the animals used; or, on a measurement basis, over one-third (1~~) of the total number of determinations available. A researcher would have 91 e e serious reservations about disregarding so great a portion of data even when observations are easy to obtain . I f the determinations were from human subjects rather than rats, the research worker would be less inclined to discard incomplete data. _ Another point to be considered here is the one discussed generally in paragraph 3.10 Inspection of Table 28 reveals that the data to be discarded would include many of the observations with very high and very low muscle potassium values. Ignoring such portions of the data would seem. unwise i f the equation were to be used to predict muscle potassiUm' values which were expected to be at the extremes of the sample • 0 Yith only the da~a of Table 28 it is not possible to test whether or not one of the other methods might be preferable in this regard. However, some of the results using each prediction equation are tabulated in Table 30 below· for the three complete observations having muscle potassium deteminations at either end of the muscle potassium scale (25.0> y> 47.0) 0 The sums of the absolute deviation.s and sums of the squared deviations from the true values for these six observations are given at the foot of the table. From this examination no dramatic differences in the methods are indicated as far as predictability at the extremes is concerned. In obtaining prediction equation (7.6) by the method of rna.xi.mum likelihood, only four values were discarded, two of which were included in observation vectors having low muscle potassium values. However, the serum and muscle potassium values (X:I.and x4) in these vectors were retained so that not all of the information on anyone 92 e e animal was ignored; and., on a measurement basis, only slightly over two percent (1~7~ of the available information was discarded. Considering the sample size and known desirable properties of maximum likelihood estimators, prediction equation (7.6) could be regarded as a reasonable first choice. It must be remembered, however, that this is a special case of m.a.xi.mum likelihood estimation, and that a slightly different pattern of missing observations 'WOuld have precluded its use here. Table 30. Some extreme muscle potassium values predicted from the Welt data by various methods Value predicted by • For rat number Y c Yt Yp yg 5 6 7 47 48 51 47.87 45.89 40.76 27.11 25.95 29.09 48.18 45.90 41.23 27.37 23.72 29.53 47.61 45.09 40.67 27.11 20.43 29.17 47.85 45.21 40.63 26.00 20.57 28.39 18.56· 17.83 21.31 19.43 83.62 102.20 89.38 Iy - yl (Y - yj 85.40 Actual value y 47.2 47.8 47.2 25.0 24.2 23.4 Comments on prediction equation (7.7), determined by the Method of Paired Correlation Coefficients, call for reference to Table 29. The sample correlation matrix given there with the confidence limits e e for each coefficient, indicate that this is a sample from a population for which this type of estimation is not particularly well suited; 93 e e that is, one in which the correlations between variables are of different magnitudes. Even though we are dealing with a different type of censoring and have larger numbers of pairs than were considered in the empirical sampling work of Chapter IV, the Method of Paired Correlation Coefficients would appear to be the least appropriate of th3 four methods considered here. The Greenberg Procedure, which aJ.so utilizes all the data available yields the prediction equation (7.8). There are no ! priori reasons for not applying the Greenberg Procedure to these data, nor W>uld its use be contraindicated on the basis of th3 empirical sampling work discussed in previous chapters. • One objection to employing it here might be the rather lengthy computations involved. These would seem to be of minor importance considering the avaUabilityof high speed computing machinery and the accompanying program. In this case, the GTeenberg Procedure permitted a solution which would certainly be preferable to that obtained by the Method of Paired Correlation Coefficients. If the missing characters had been more evenly distributed among the independent variables, so that the method of ma:x:i.mum likelihood could not have been easily applied to the bulk of the data, it would have presented a still more appealing alternative. 94 e e CHAPTER VIII SUMMARY AND CONCLUSIONS 8.1 SUII'Il11a.ry In the previous chapters, alternative methods have been described for dealing with the problem of prediotion by multiple regression when values are missing from among the independent variates. The results of empirioal studies designed to evaluate the effioiency of the respective methods were discussed. o The objeotives of this chapter are firs~ to com- pare these methods by bringing together results in which all the methods were used with the same or similar data. and secondly, to make suggestions • regarding the use of the methods 0 Finally, comments will be made on the need for further research to answer some of the questions arising from the results of this work. The eight sampling experiments described in Chapter II were sequential in nature. Experiment I was the broadest in scope but perhaps the narrowest in depth. A summary of the statistics of prediction efficiency for all the sampling done in this experiment is given in Tables ,3lA and ,31B. These tables indicate that estimation by maximum likelihood, where it can be easily employed, is superior to the other schemes. For the three variate cases where maximum likelihood estimation is not easily carried out, neither the Paired Correlation Method nor the Greenberg e" e Procedure appear to yield results overwhelmingly better than those obta::i,.ned by merely deleting incomplete observations. However, a population differential was noted in results given by the Paired Correlation Method. ee e ee Table 3U. Values of E from Experiment 1; N • 20, k .. 20 [.25 o~, .23) [.80 .60, .80] Censoring I Complete Paired Greenberg Ma.ximum Observations Correlation Procedure Lik ... ~~~ (08) (80) (ll) (22) (33) (44) (55) (66) .944 .939 .881 .858 .964 .939 .885 .855 .803 .722 .983 .967 .936 .906 .983 .972 .958 .948 .918 .857 [.25 (lll) (222) (333) (444) (555) .967 .888 .795 .625 .335 .967 .956 .875 .825 .976 .956 .939 .894 .810 .614 Complete Paired Greenberg Maxiniu:m Observations Correlation Procedure Likelihood .981 .969 .918 .917 I .944 .939 .881 .858 .964 .939 .885 .855 .803 .722 .931 .939 .833 .880 .962 .943 .901 .874 .819 .769 .944 .953 .845 .860 .978 .960 ' .910 .836 .764 .970 .973 .930 .927 .664 .~ 075, .25 .~, .251 .967 •949a •858a .907 .870 •736a .627a .784 .606 .44s-a a These means were based on k • 5. '0 \Ft e -e ee Table 31B. Values of i from Experiment 1; N ,. 20, k ,. 20 [.25 .~, .25J [.80 .60, 080} Oensoring IComplete Paired Greenberg Ma:x::i.mum Complete Paired Greenber·g Ma.:x:iiiium Observations Correlation Procedure Likelihood Observations Correlation Procedure Likelihood (04) (40) (08) (80) (11) (66) .353 .385 .381 .468 .340 0375 0437 .459 .488 J:/J7 (Ill) (222) (333) (444) (555) .252 .301 0348 .524 10292 (221 031 !44J (551 .347 .325 .329 .3~ .379 .355 .431 .369 0360 .518 0430 .•424 .321 .330 .329 .351 0386 .352 .447 .376 .422 .564 .484 .833 [.25 03) .75, .25 03), 025J .262 032~ .36 a .296 0482 0330 .S76& 0807 10114a 0774 a These means were based on k .. .246 .268 .266 .326 .237 .261 .304 .320 .340 .423 0253 .270 .335 0338 .246 .251 .292 .316 .360 0408 .253 .262 .322 .345 .239 0256 .291 .343 0401 .518 .237 .241 0278 .280 5. \0 0\ e e 97 This population differential led to the second and third experiments which were designed to investigate it further. In Experiment 2, it was found that a general quadratic relationship would describe the mean efficiencies of samples from trivariate populations with different correlation matrices. The fitted quadratio relationship, based on 780 samples from thirty-nine different populations, revealed that the Method of Paired Correlation Coefficients was grossly inefficient for samples from populations in which the variables were highly and heterogeneously correlated. Likewise, it was eminently satisfactory for cases in which the variables were not so related. Experiment 3 indicated in a rather cursory way that the results of • Experiment 2 were also applicable to four-variate situations • Experiment 4 was planned to confirm the results of Experiment lover a wider range of populations a.s far as the Greenberg Prooedure was concerned. It was also intended to check on whether or not a popUlation differential of the type indicated in Experiment 2 existed for this method of estimation. Tables 32A and 32B show that although the effioienoy of the Greenberg Prooedure is insensitive to the trivariate population being sampled, it does not yield results superior to those obtained by deleting incomplete observations. In Experiment 4, the Greenberg Procedure also yielded more than the expected number of biased regression coefficients by the t criterion defined in Chapter II. e e The biases were compensating in nature, however, as far as prediction was ooncerned. These results, combined with the extensive oomputations necessary for solutions by the Greenberg Procedure, seem to dictate against using it in trivariate cases. 98 e e Table 32A. Values of E from Experiment 4B; censoring (55), N 20, k .. 10 Complete Greenberg Paired Population Observations Correlation Procedure .6 , .6 ] .810 [-.2 .778 .558 .6 , -.2 } .845 .867 .760 [-.2 ] .808 -.2 , -.2 .788 .891 [-.2 .812 .820 -.2 , .6 ] .961 [-.2 [ .6 .6 , .6 ] .871 .820 .876 .6 , -.2 ] [ .6 .464 .837 .797 .866 -.2 , -.2 ] [ .6 .783 .756 .804 [ .6 -.2 , .6 ] .769 .474 , .2 ] .758 [ .8728 .2 .727 .753 .908 .812 .863 [-.4728 .2 , .2 ] .2 .8728, .2 ] .858 .8~ .840 [ .761 -.4728, .2 ] [ .2 .803 .830 .2 , .8728] .819 .803 [ .2 .798 .822 .2 , -.4728] .890 .882 [ .2 .2 , .2 ] [ .2 .751 .906 .727 III S§ E • i .002601 .025107 Table 32B. V&lues of ~ from Experiment 4B; censoring (55), N 20, k .. 10 Complete Greenberg Uncensored Paired Population Observations Correlation Procedure Data .6 , .6 ] .180 [-.2 .199 .294 .095 .6 , -.2 ] .~2 .687 [-.2 .383 .315 -.2 , -.2 ] .492 [-.2 .485 .287 .356 -.2 , .6 ] [-.2 .475 .59J .322 .373 .6 , .6 ] .275 [ .6 .258 .382 .346 .6 , -.2 ] [ .6 .192 .191 .140 .512 -.2 , -.2 ] .689 [ .6 .510 .689 .363 [ .6 -.2 , .6 ] .242 .160 .245 .416 , .2 ] .600 .766 [ .8728 .2 .523 .332 .401 [-.4728 .2 , .2 ] .532 .315 .499 ] .219 .268 .269 .172 .8728, .2 [ .2 -.4728, .2 ] .458 .426 .312 [ .2 .389 .288 .288 .306 .2 , .8727 ] .196 [ .2 .2 , -.4728] .320 .428 [ .2 .238 .344 .2 , .2 ] .722 .452 .831 [ .2 .324 III e e .001349 99 e e Experiment 5 showed that the number of iterations required for convergence in using the Greenberg Procedure was influenced more by the proportion than by the amount of missing data. In Experiment 6, samples censored (3333) were selected from two different five-variate populations. Prediction equations were computed by deleting incomplete observations and by applying the Paired Correlation Method. The five-variate populations were chosen to represent cases in which the Paired Correlation Method was expected to yield diverse results in terms of prediction efficiency. The resulting means and standard errors of E and A are given in Tables 33A and 33B, respectively. These tables represent further evidence to confirm,that, in higher variate cases, the • Paired Correlation Method is suitable for use with fragmentary samples from populations 'With homogeneously correlated variables; but that it is definitely unsatisfactory for samples from populations with heterogeneously correlated variables. The histograms representing the distri- bution of E for both methods of estimation in the two populations are givan in Figure 7. Table 33A. Values of E and their standard errors; Experiment 6, censoring (3333), N = 20, k • 100 omplete Observations E S.E. (E) e e E S.E.(E) .80 .29 • 0, .515 .020 [.36 .40 .41 .32, .24 .518 .023 .24 Paired Correlation , .20 .70, .30] .357 .027 .30, .24 .25, .22] .874 .016 100 e e Table 33B. a Values of and their standard errors; Experiment 6, censoring (3333), N = 20, k = 100 Uncensored Dta , .20 .70, .30 ] .183 .006 .30, .24 .25, .22] ~ S .E.(a) S.E. (a) .444 ~ .015 Although Experiments 1 and 4 did not indicate the Greenberg Pro- cedure to be worthwhile in any of the trivariate cases considered, it was not shown to be clearly inferior to the other methods as far as • prediction efficiency was concerned. .An examination of the relative effectiveness of the Greenberg Procedure in higher variate cases seemed to be warranted and was undertaken in Experiment 7. Complete observa- tions, the Paired Correlation Method, and the Greenberg Procedure were used to obtain prediction e~ations and measures of their efficiency for one hundred samples from a population for which the Paired Correlation Method was not thought to be especially well suited. in terms of the sampling distributions of E and 11 and 12, respec tively • The results a are shown in Figures The mean values of E and a, with their s tan- dard errors, for the different methods of estimation are given in Table 34. According to these criteria, both the Greenberg Procedure and the deletion method are better than the Paired Correlation Method; and e e results using the Greenberg Procedure appear to be at least slightly better than those obtained by omitting the incomplete observations. 101 e e It will be recalled from Chapter V, however, that the regression coefficients obtained by the Greenberg Procedure in this experiment showed significant bias by the t criterion. Table 34. i Values of E, Aand their standard errors from Experiment 7; population [.70 -.51 .31, .60 .26, .44], censoring (333) N = 20, k = 100 Complete Observations .79~ S.E.(E) .014 A S.E. (3) .266 .013 - • Estimation by Paired Greenberg Correlation Procedure .603 0837 .d2~ .Ol3 .482 .063 0245 .015 Uncensored Data .189 .006 The computing time involved precluded an extensive amount of sampling with the GTeenberg Procedure for five-variate cases; but the results of its use with four variates, relative to that with three, suggested some evaluation at the five-variate level. Also, the need for some convergenoe information at higher variate levels demanded investigation with at least a few samples. Acoordingly, ten samples, each censored (3333), were selected fram eaoh of the two populations used in Experiment 6; and the GTeenberg Procedure was used to find prediction equations from them. A summary of the resulting efficiencies is given by the means and standard errors of E and A in Tables 35A and 358, respectively. [.60 .80 .29 .40, e e .5b -.20 .5b, In population .20 070, .30], the Greenberg Procedure yielded better results than those obtained by the use of complete observations. In population [036 040 .41 .32, .24 .24 .30, .24 .25, .22], 102 e e Table 3.5A. Values of E and their standard errors; Experiment 8, censoring (3333), N==20,k==10 E S.E. (E) - E S .E. (I) • Table 358. Values of ~ and their standard errors; Experiment 8, oensoring (3333), N==20,k==lO Paired - A S .E. (~) -6- S.E. (A) ,e e Greenberg Paired Correlation Procedure .40, .~ -.20 .56, .20 .70, .30] .648 .~8 .440 .066 .086 .077 [.36 .40 .41 .32, .24 .24 .30, .24 .25, .22] .664 .601 .889 .028 .076 .089 Complete Observations [.66 .80 .29 Greenberg Unoensored Prooedure Data • , .20 .70, .30 .350 .20 .042 .020 .24 .30, .24 .25, .22] .881 .426 .114 .063 103 e e there appeared to be little difference between the Greenberg Procedure and the use of complete observations, while the Paired Correlation Method yielded significantly better results than either of these. 8.2 Conclusions The findings summ.a.rized above support the following conclusions and suggestions regarding the use of the methods described here: 1. The use of the Paired Correlation Method is advisable in the trivariate case only i f it is reasonably certain that the sample has been drawn from a population in which the variables are homogeneously correlated. • This provision also holds in higher variate cases; and it must be kept in mind that, as more variates are introduced, there is less possibility of homogeneity existing among their zero order correlations. 2. If there are enough complete observations in three-and fourvariate cases, and unless there is some ~ priori reason for not discarding incomplete observation vectors, the use of only the complete observations gives results which usually will not be much less efficient than those obtained by the Greenberg Procedure and Paired Correlation Method. 3. In the trivariate case, the Greenberg Procedure generally is not worthwhile. 4. e e In addition to these conclusions, the research carried out provides some basis for the two suggestions below: a) If all or nearly all the data can be used to obtain ma.x:i.mum 104 likelihood estimates, similar to the manner indicated in Chapters VI and VII, that method is to be recommended. This suggestion is based on the limited amount of sampling done in Experiment 1 and the well-known desirable properties of maximum likelihood estimates. b) The Greenberg Procedure seems to give best results in cases of four or more variates where the Paired Corralation Method would not be e:xpected to perform efficiently. The former technique is to be recommended for higher variate cases in which most of the observations are incomplete, but in which the total number of missing • values is small and evenly distributed among the different variates 0 These cases could be described in the notation of section 2.2 as ones in which the c are i large and nearly equal, and in which thec ij and other c' s with multiple subscripts are small or equal to zero 0 The example of Chapter VII illustrates the applicability of some of these suggestions to a biological problem. The limited amount of investigation which led to the last two recommendations (4a and 4b) points up the need for furth~r research to confirm and supplement that done here. Possibly the most obvious need for further investigation is with the use of the Greenberg Procedure in higher variate e e cases. Further studies appear to be in order to determine its prediction efficiency as well as the extent and nature of bias arising with its use. The apparent effectiveness in the four-variate case should be verified 105 e e for similar cases, and the rather cursory examination of the two fivevariate cases carried out here certainly invites further research. However, the sampling approach in any very large investigation of this kind would probably require computing facilities of capacity greater than the IBM 6~ • Another suggestion for further research is that some sampling studies be done using maximum likelihood estimates obtained by iterative methods for the three-variate cases. It is also conceivable that maximum likelihood estimates could be obtained for some particular types of censoring in four-variate cases, either by explicit solutions, as pointed out in the example of Chapter VII, or by iterative solutions. • e e Empirical investigations involving these estimates and comparing them with the ones discussed here would be enlightening. 106 e e LIST OF REFERENCES Aitchison, V. and J. A. C. Brown. 1957. The Lognomal Distribution. Cambridge University Press, Cambridge. Anderson, T. W. 1957. Ma:ximum likelihood estizrates for a multivariate normal distribution w~ n some ob servations are missing. Journal of the American Statistical Association~:200-203. Bliss, C. I. 1952. The Stati sties of Bioassay. Inc _ New York. The Academic Press" Carver,H. C. 19l.,l.Anthropometric Data. Edwards Brothers. Arm Arbor. Du Bois, P. H. Brothers. 1957 • Multivariate Correlational Analysis. New York. Harper B. 1957. Introduction to Statistical Theory. Mimeographed classnotes. Unive;-sity of North Carolina, C~pel Hill. Dtu1can,D'~ Dwyer,"P. S. 1951. Linear Computations. New York • • John '1tlley and Sons, Edgett, G~ L. 1956. Multiple regression with missing observations among'the independent variables. Journal of the American Statistical Asso ciation ~: 122-131. --~ Foster, W. D. 1950. On the selection of predictors: two approaches. Unpublished Ph.D. Thesis. North Carolina State College, Raleigh. Institute of statistics. 1957. A Record of Research IV:17. Consolidated University of North Carolina. . Lord, t: M-. 1955. Estimation of parameters from incomplete data. Journal of the American Statistical Association 22: 870-876. Matthai, A'. 1951. Estimation of parameters from incomplete data with application to dem go of sample surveys. Sankhya 11: 145-152. ~ e e Moonan:'W. J. 1957. Linear transformation to a set of stochastically dependent normal variables. Journal of the American Statistical Ass:>ciation ~: 247-252. Morrison, R. D. 1956. Some studies on the estimates of the e:x;ponents in'models containing one and two exponentials. Unpublished Ph.D. Thesis. North Carolina State College, Raleigh. Munter, A. H. 1936., A ,study of the lengths of long bones of' the arms and legs in man,' with special reference to Anglo-Saxon skeletons. Biometrika ~: 25'8 • e e 107 Neiswanger"W. A.and T. A. Yancey. 1959. Parameter estimates and autonomous· growth • Journal of the American Statistical _Ass?ciation ~: 389-402. Nicholson" G. E." Jr. 1948. The application of a regression equation to a new samp-le. Unpublished Ph.D. Thesis. University of North Carolina" Chapel Hill. Nicholson" G. E." Jr. 1957. Estimation of parameters from inoomplete multiva.riatesamples. Journal of the Amerioan Statistical Association ~: 523-526. Orcutt" G. H. and D. Cochrane. 1949. A sampling study of the merits of autoregressive and reduced form transformations in regression analysis. Journal of the American Statistical Association _._ ~:356-372 •. Rand COi"porabion. 1955. A Million Random Digits with 100,,000 Nomal Deviates. The Free Press" Glencoe.. Illinois. Rao" C. R. 1956. Analysis of dispersion with incomplete observations on one of the characters. Journal of the Royal Statistical Society" Series B" 18:259-264. • - Sheps" M. C. and P. L. Munson. 1957. The -error of replicated potency estimates in a Biological Assay method of the parallel line t~e. Biometrics ,U= 131-132. Wells" H. B.1959. North Carolina perinatal mortality study: investigation of methods for analyzing data. Unpublished Ph.D. T'hesis. University of North Oarolina" Ohapel Hill. Welt" L. G. et ale 19513. The prediction of muscle potassium from blood electrolytes in potassium depleted rats. Transactions of the A.ssociation of American Physicians 10: 250-259 • - - 1f.i.lks;S. - 19320 Moments and distributions of estimates of population parameters from fragmentary samples. Annals of Mathematioal Statistics J: 163-203 • e e e e • e e 109 e e APPENDIX A GENERATION OF MULTIVARIATE NORMAL DEVIATES A.l Transf'omations to Generate Trivariate Normal Deviates Consider a p-va;riate normal distribution, asp"" N(.!:!:p' Vpp )' .!:!:P being the vector of' means (p dimensional) and V the p x p varianoecovariance matrix., If' asp 2Er is taken as the vector of' any r elements of' arranged in the last r positions, and if' the remaining p - r =q elements are denoted x , then the conditional distribution -q aErl.!qfV N(~rlq' Vrr1q ) • where ~ -r Iq = ~ ""'T - S srq (x - ~ ) rr -q-q -1 spp is a lower triangular matrix such that 13 pp [~ ] = qq e srq = Vl / 21. - 1 ~ -p ~1 D [:qq rq ~J -1 -1 = Spp rr 2 vrrlq = 13rr This theorem is stated and proved by Duncan (1957). e e Given independ~nt N(O,l) deviates ~, x 2, x and wishing to fin,d 3 the transf'omations necessary to generate. trivariate normal deviates ~ ~, .~ x " x3t with ~I1 2 = ~12 = ~13 = 0, . . no e e V= 1 P12 .013 P21 1 P23 P31 P32 1 Xi • J!1.' ~ = fi.2~ + /1 we take and, in a manner similar to that described below find - Pi2 We want f(X3 /J!1.X2 ) 1 P12 P13 P21 1 V =- ,P23 P31 P32 • x2 " 1 1 8' • PI) (.0 23 - P12P13)(1 - 0 o o k(l -pi2)-1/2 o 1 -If 8 =- _n. 1""12 (1 _ p2 )-1/2 12 e f) 11;2 = (1 - pi2)-1 e V3 1;2 I P~2,-1/2 2 2 = 833 =k [(.013 - P1t'23) X:J. + (P23 - 2 -1 (1 - .012 ) " o o Pl~13) x21 III e e So therefore A.2 An Example of the Procedure Used to Generate Four-variate Normal Deviates Given independent N(O,l) deviates multivariate normal deviates • Xi' = .25 1 .50 .25 .75 .50 x , x , ~' we wish to generate 2 3 such that ~i = ~~ = ~3 ~t = 0 x3' xt .25 1 V x~, %:1' • .50 ·75 .25 .50 1 .25 .25 1 Using computing methods described by Dwyer (1951) we find directly LS~\ e e Sf = 1 .25 .50 .75 0 .96824584 .12909944 .32274861 0 0 .85634884 .09731237 0 0 0 .56909018 o o o - .25819889 1.03279556 0 0 .54494926 .15560978 1.16774841 0 -1.07827614 - .55910614 - .19968077 1·75719074 1 . ~\ (}."\.,)~ S-l' = -e 112 Take Xi = JC:t and using the relationship x' = -8rr grq -q x f. + 8r rxry -r successively for r = 2, 3, q <r, we get 4, .25x:t + .968 245 84x2 '0 '1 X = .46666667Xi + .13333j33x~ + .8553488ltx 3 v x~ = .61363636x{ + .3181818~+ .11363636x + 3 3 .1.3 Some Results Obtained by Using the Transformations of A,.2 Four hundred sets of multivariate deviates were generated using the transformations above and the first four hundred inpat cards. • These deviates yielded the following matrix of sums, ad N SDS SWI1S of.quare., of cress products: ~JC:t ~~ ~x3 ~x4 ~~~~x2~x:tx3 ~X:tx4 ~x~ ~x2x3 ~x2x4 ~x~ Sym. 3·70 -26.21 400 39 ·44 4·79 377·86 109·08 205.01 297·28 408.86 105.29 219·09 • ~x3x4 411·46 J.i6.~ Sym. ~x~ 418.19 which reduces to the correlation matrix 1 ~2 P13 P14 1 P23 1'24 1 P34 Sym. 1 e e Y 1 = .279 .521 ·748 1 .265 .532 Sym. 1 .281 1 It may be noted that the Srrsrq are vectors of partial regression coefficients. 113 "e THE NORMAL DEVIA.'l'ES TfSlm IN SAMPLING Ten thousand N(a,l) deviates were selected at random from the one hundred thousand listed in the Rand Tables (19SS). A. full desoription of these deviates, their origin, and the tests for randomness whioh were applied to them is given with that listing. It will suffice here to note that the deviates were punched, five per card, in two non-overlapping groups. The starting point for each group was selected at random, one group being punched serially from rows, the other from columns. • The cards were then shuffled and IIWIlbered i. order. .A frequency tabulation of these N( 0,1) deviates is given for the ten equal intervals from -S .000 to +S .000 in Table B-1 below. Table B-1. Frequenoy tabulation of ten thousand N( 0,1) deviates .000 .000 1.000 2.000 3·000 4.000 From -4·000 -3·000 -2.000 -1.000 To -4·999 -3·999 -2·999 -1·999 -·999 ·999 1.999 2·999 3·999 4.999 Frequency Number Expected 1 13 200 133S 3410 3164 1368 212 7 0 ·3 13 214 1359 3413 3413 1359 214 13 .J
© Copyright 2025 Paperzz