C.F. Federspiel, R.J. Monroe and B.G. Greenberg; (1959)An investigation of some multiple regression methods for incomplete samples."

"e
AN INVESTIGATION OF SOME MULTIPLE
REGRESSION METHODS FOR INCOMPLETE SAMPLES
by
•
Co F o Federspiel, R. J. Monroe, and B. G. Greenberg
·This research was supported in part by U. S. Public Health
Service Training Grant 2G-38 and by the Revolving Research Fund of
the Institute of Statistics of the Consolidated U~versity of North
Carolina through a grant from the General· Education Board for research
in basic statistical methodology.
••
Institute of· Statistics
Mimeo Series No. 236
August, 1959
iv
•e
TABLE OF CONTENTS
Page
LIST OF TABLES
vi
LIST OF FIGURES
ix
CHAPTER I
INTRODUCTION
1
CHAPTER II METHODS
5
2.1
2.2
2.3
2.4
Generation of Multivariate Samples
Censoring
.
Empirical Measures of Efficiency •
Description of Experiments •
2.5 The Computing Programs
CHAPTER III
•
5
7
8
10
14
16
THE DELETION OF INCOMPLETE OBSERVATIONS
3.1 Discussion • • • • • • • • • • • •
3.2 Bias and Variance of Regression Coefficients •
3.3 Efficiency of Prediction Equations
CHAPTER IV
4.1
4.2
4.3
4.4
16
17
23
THE METHOD OF PAIRED CORRELATION COEFFICIENTS •
27
Definition
•
Bias and Variance of Regression Coefficients •
Error Variance •
• •
Efficiency of Prediction Equations
27
29
29
32
CHAPTER V·' TEE GREENBERG PROCEDURE
5.1
5.2
5.3
5.4
5.5
49
Definition
Bias and Variance of Regression Coefficients.
Error Variance. • • • • • •
Efficiency of Prediction Equations
Convergence.
•
49
52
56
61
71
CHAP TER VI MAJaMUM LIKELIHOOD •
6.1
6.2
6.3
6.4
Discussion • • • • • • • • • • •
Bias and Variance of Regression Coefficients.
Error Variance -.
•
•••
Efficiency of Prediction Equations
CHAPTER VII AN EXAMPLE
7.1 The Data •
7.2 Solutions
7.3 Discussion
•
79
•
79
81
81
' 83
•
85
.
'
85
87
90
•e
v
TABLE OF CONTENTS (Continued)
Page
CRAnER IDI
SUMMARY AND CONCLUSIONS
8.1
summarY. •
8.2
Conclusions
LIST OF REFERENCES
•
•
. . .
. .
0
•
•
•
8·
•
0
0
0
ft
~
•
•
000
•
•
•
o~
•
0
Q
•
0
0
v
•
•
0
•
Q
0
0
•
Q
b
Q
0
•
0
0
•
APPENDIX A.
GENERATION OF MULTIVARIATE NORMAL DEVIATES
•
APPENDIX B.
THE NORMAL DEVIATES USED IN SAMPLING
•
•
•
94
94
103
106
•
•
•
109
•
0
113
vi
LIST OF TABLES
Page
.1.
Statistics of regression coefficients estimated by the
deletion of incomplete observation vectors
2. Means and variances of errors of estimate obtained by
the deletion of incomplete observations
• 18
• 22
A
3. Means of V (b i ) estimated by the deletion of incomplete
observatIons
4A.
•
•
•. •
•
•
•
•
•
•
•
•
••
Values of E and their standard errors estimated by the
deletion of incomplete observations
•
24
•
25
•
25
4B. Values of A and their standard errors estimated by the
deletion of incomplete observations
5.
Statistics of regression coefficients estimated by the
Method of Paired Correlation Coefficients
• 30
6. Means and variances of errors of estimate obta.ined by
•
the Method of Paired Correlation Coefficients
7A. Values of E and their standard errors estimated by the
Method of Paired Correlation Coefficients
7B. Values of A and their standard errors estimated by the
Method of Paired Correlation Coefficients
8.
9.
10.
• 34
• 34
Analysis of variance of In E for the trivariate
populations of Experiment 1
•
Values of E and residuals from the response equation,
"E + , of Experiment 2 •
AB
• 36
Analysis of variance of values of E
• 37
1l. Valu~s of E and their standard errors at symmetric'al
points in the trivariate population s p a c e .
,
14. Statistics of regression coefficients estimated by the
Greenberg Procedure; Experiment 1
15. Statistics of regression coefficients estimated by the
Greenberg Procedure; Experiment
4
35
• 38
A
12. Analysis of variance for the model EA+B of Experiment 2
13· Values of i estimated by the Method of Paired Correlation
Coefficients
•
e
e
• 33
•
42
• 44
• 53
. 55
•e
vii
LIST OF TABLES (continued)
Page
16.
Statistics of regression coefficients from Experiment 7 •
58
-
l7A. Values of E and their standard errors estimated by the
Greenberg Procedure; Experiment 1
63
l7B. Values of A and their standard errors est:i.ma.ted by the
Greenberg Procedure1 Experiment 1
63
-
18.
-
Values of E and their- standard errors estimated by the
Greenberg Procedure; Experiment 4
Analyses of variance for values of E from the data of
~eriment 4
65
Analyses of variance for models fitted in Experiment 4
66
Values of E and A from the twenty samples of Experiment 8
70
M.ean number of iterations required for convergence
72
Analysis of variance for number of iterations required
- for convergence in Experiment 4B •
73
_Iterations required for convergence using the Greenberg
procedure on Carver t s data with different initial
estimates of missing values
25.
64
•
•
Statistics of regression coefficients estimated by
ma:ximum likelihood
•
y.x estimated by maximum likelihood
82
26. Means and variances of &2
83
27A. Means of E and their standard errors estimated by maximum
likelihood
84
27B. Means of A and their standard errors estimated by maximum
likelihood
84 --
28.
Biochemical determinations from Welt data
86
29.
Smnple correlation coefficients and their confidencelimits
based on the nij available pairs of Xi and X in Table 28
89
Some extreme muscle potassium values predicted from the Welt
data by various methods •
92
j
e
e
74
30.
-
31A. Values of E from Experiment 1
95
•e
viii
LIST OF TABLES (continued)
Page
32A. Values of
6' from
i from
32B. Values of
i
33A. Values of
E and their standard errors, Experiment 6 •
• 99
33B. Values of
i
.100
34.
E', 6'
31B. Values of
Values of
35A. Values of
35B. Values of
Experiment 1
• 96
Experiment 4B •
• 98
from Experiment 4B •
• 98
and their standard errors, Experiment 6 •
E and
i and
and their standard errors from :&q:>eriment 7
their standard errors, Experiment 8 •
.102
their standard errors, Experiment 8 •
.102
APPENDIX. Table B-1. Frequency tabulation of ten thousand
N(O,l) deviates
•
e
e
.101
.113
ix
e
LIS T OF FIGURES
e
Page
1.
2.
Histograms of regression coefficients estimated by
deletion of incomplete observations •
20
Histograms of regression coefficients estimated from
complete observations
21
Histograms of regression coefficients estimated by the
Paired Correlation Method
31
40
41
45
•
47
57
e
e
Mean pseudo-variance of Greenberg Procedure plotted against
true variance; from data of Exper:l.ment 4B •
•
60
Mean variance estimated from uncensored data plotted against
true variance; from data of Experiment 4B. • • • • •
60
Histograms of estimated error variances
62
Histograms of E; Experiment 7
69
•
Histograms of 6; Experiment 7·
68
Histogram of number of iterations required for convergence.
75
Convergence curves applying the Greenberg Procedure to the
five-variate sample, number 1 0 7 .
77
Convergence curves applYing the Greenberg Procedure to the
five-variate sample, number 106
78
•e
CHAPTER I
IN'lRODUCTION
A cOll1l11on problem in statistics is that of obtaining a multiple
regression equation from data which have missing values among the
inde})endent variates.
Examples of this situation oocur in attempting
to get prediction equations in such diverse fields as psychology,
economios, and bioohemistry.
In psychology, some subjects may have
failed to take all the tests in a series.
In economics, production
records of a given commodity may not have been kept consistently over
a long period of time.
In biochemistry, the analysis of some oomponent
of a serological sample may have been vitiated.
•
Several investigators have been conoermed with the general problem
of missing values in multivariate samples,
They have, for the most part,
dealt with maximum likelihood estimators for specifio oases
and were
not directly oonoerned with prediotion equations.
Wilks (19.32) considered cases where values of x and y are both
missing in samples from a bivariate normal and gave maximum likelihood.
estimates for some of the parameters, given others.
He also evaluated
the large sample efficienoy of some simpler alternative estimates.
In
considering an extension of Wilks' work to the trivariate case, Matthai
(1951) derived maximum likelihQoo estimators for the three means given
,
the other six parameters.
e
e
He suggested a modified fom of estimation
for the case in which it is necessary to estimate the variances and
covariuces from the data.
h addition, he discussed applioations of
this work to the design of sample surveys.
2
e
e
Lord (1955) solved maximum likelihood equations to estimate the
parameters for the special case of three variables, x:t' x 2' and x ' in
3
which x:t and x are d:i. strlbuted bivari~te normally, as are x and x •
2
3
3
ltlth no loss of generality, the data could be arranged:
•••
x:t(n+l)
.-.. X:1.N
x3 (n+l)
• • • x.3N
•••
where the primary subscript denotes the variable, and the secondary, the
individual.
He also showed that, although the est:imates of some of the
parameters are improved by using all the data, estimates of the regression
0'~2l
2
2
coefficients and errors of estimate, ~2 III OJ. ' ~l' 0'2.1' and OJ.1 are
•
not.
In considering this special case, Lord pointed out that tithe general
maxtmum. likelihood equations have proved rather intractable and no simple
formulae for the maximum likelihood estimators are available in the
general case, even for a sample from a bivariate population."
Edgett (1956) dealt more directly with the prediction problem in
considering the special case of the trivariate nonnal in which
observations are missing .from one variate only.
Data would have the
fom:
x..t1'
x2l ,
x31 ,
e
e
x2l ,
• ••
x:tn
x ,
• ••
x
x ,
·..
x3n
22
32
:;'(n+1)
• • • Xo.1.N
2n
After some rather tedious a.lgebra, he obtai-ned solutions in closed form
for the maxinnun likelihood equations to estimate the parameters 11 , 11 ,
1 2
e
e
3
222
1J.3' 0'1' 0'2' 0'3' 0'12' 0'13' 0'23·
regression of
Xl
OD
From these, ooeffioients for the
x 2 and x were estimated, oonsidering the latter
3
two as indepemdent variates.
The problems of Edgett and Lord were also oonsidered by T. W.
Anderson
(1957) who indioated an approaoh whioh somewhat simplifies
.
.
the mathematioal manipulation necessary to get the maximum likelihood
solutions.
The procedure involves expressing the jcnnt densities of
the variates as the product of a marginal density and the remaining
oonditional densities.
Further reference will be made to the papers
of Edgett and Anderson in the chapter on maximum likelihood.
In a paper dealing partioularly with the prediotion problem,
Nioholson (1957) disoussed the multivariate case for which observations
•
are missing in the dependent variable.
estimates for the parameters based
OIl
He gave maximum likelihood
the "inoomplete sample '• amd for
the oomplete portion of the sample, aDd conoluded that, for prediction
purposes,
DO
use oan be made of data which are incomplete in the dependemt
variable.
This thesis will deal primarily with methods for determining
prediotion equations when observations amomg the independent variates
are not available.
These meth0ds include: 0rdinary least squares, using
only those observation vecwrs whioh are complete in every variate;
ma:x:imum. likelihood for certain speoial cases; and two other prooedures
which have been proposed on the basis of their intuitiva appeal and
e
e
general applicabilit,y.
One of these methods involves finding the matrix of paired
correlation coeffioients for all variables, then solving for the
standardized regression coefficients, and finally, converting these
e
e
4
1;0
the ordinary regression coefficients by making use of the means and
variances caloulated from all observations present for a given variable.
'!his will be referred to as the Method of Paired Correlation Coefficients,
or simply, the Paired Correlation Method.
The other approach consists of an iterative computing scheme whereby
missing values of a given variable are successively predicted from the
other variables until the coefficients for the regression of y on the
xi oonverge simultaneously.
This is oalled tlte Greem'berg Prooedure.
The objeotive of this study is to appraise these methods.
An
empirical sampling approaoh, or what is sometimes referred to as the
Monte Carlo method, is used in this investigation.
This approaoh has
been found useful in problems which seem to defy strictly mathematioal
•
treatmemt, and has beem particularly successful in studies of estimation
prooedures for a wide variety of models (Aitchison and Brown, 1957;
Morrison, 19.56; Neiswanger and Yancey, 1959; Orcutt and Coohrane, 1949).
Its use here involves application of the estimation procedures described
above to simulated fragmentary multivariate normal samples.
Measures of
prediction efficiency are calculated for each sample estimate and are
appropriately summarized.
The calculations of estimates and their
efficiencies were carried out on an IBM 6.st> electronic computer.
In the following chapter, notation, definitions of efficiency, and
the general method of approaching the problem are discussed.
1m Chapters
III to VI, respectively, each method and the results of investigation
e
e
using that method are described in detail.
Chapter VII illustrates the
use of the different procedures with an example and finally, in Chapter
VITI, comparisons of the methods and their applicability under different
circumstances are considered.
5
e
CHAPTER II
e
METlmDS
2.1 Generation of Multivariate Samples
All the estimation procedures to be considered will assume samples
£rem an underlying multivariate normal po]>ulation.
In order to carry
out empirical studies, it was necessary to generate samples from
multivariate normal populations with give. characteristics. Fer p small,
a convenient way of generating samples from a p-variate Donnal population
with given parameters is given by applying a
seri~s
of transformatioll.
to p iDdepende!lt sets of D0rmal deviates. Finding the transformati0Ds
involves getting the conditional distribution t0r x 2' given
•
that for x.3' given
~
XJ.'
then,
and x 2' and so on, tinally getting the conditional
distribution for Xi' given the remaining p-l variates.
The transformation
for the 'bivariate case is meJltioned in the Ramd Tables (1955) ud taat
for the trivariate ease is developed in Appendix A, section A.I.
Actually,
explicit expressions for the transformations are only of academic interest,
since, for a given population, numerical values of the ]>arameters may be
substituted in the distribution function and the necessary conditioDal
distributions found directly.
This procedure, which has been described
elsewhere 1lJy Moon&n (1957), 11 illustrated in detaU iD A]>]>emdix A,
section A.2 in finding the transtormations necessary to generate samples
from a tour-variate nor.mal population used in suoceeding chapters.
e
e
Multivariate normal deviates having a mean of zero ad varial'lce
equal to one have 'been generated, since we are not interested in trying
to simulate any special population with regard to these parameters.
ID
6
e
e
this way, different
populations- to be considered are charaoterized
.
completely by their correlation matrices.
A convenient notation to indicate a given population consists of
writing the correlation matrix for that population 11'1 a single line
between brukets, omitting the diagonal and corresponding S3'Jll1l8tr'ical
elements" and setting off the rows Gf the oorrelation matrix by COJllJl&S.
For example, a population having the correlation matrix
1
.012
.012
1
.023
;~3
P23
1
P13 .01 14-
P14 1'24 .034
•
4
.02
.0
34
1
would be designated [.o12 P13"14,P23P2Jl "'341 •
Since the Rand Tables (1955) afforded a ready source of N(O,l)
deviates, they were used in punohiJ1g a deck of two thousand cards, five
deviates per care\. for input in generating the multivar:l.ate samples
needed. A. description of the way the deviates were seleeted aDd a
frefiJUency tabulation of their values is given in Appendix B.
Henceforth" samples of these N(O,l) independent deviates will be
referred to as tlorigi.nal samples·.
Their transformed cnl'lterparts ·will
be called "generated samples".
To dElll10nstrate the efficacy of the generation program, the matrix
of
e
e
S'llIIlS,
sums of squares, and sums of cross products for fatU" hundred
observations of a four-va:riate populati0n is included in Appendix A,
section A.3.
7
e
e
2.2
Censoring
Patterns of missing values te be oonsidered. do not inolud.e oases
of obse:Mrations with oharaoters misstng from every independent variate,
that is, cases where an observation vector consists of only the dependent
variable.
It is olear that suoh observations oan oontribute nothing
to a prediotion equation for y
~
the XIS.
A desoription of the notation used to designate patterns of
missing values in general follows.
used
1;0
For the trivariate oase, (01 ( 2 ) is
denote the n'Wl1ber of oharaoters, 01' missing .from ~ alone,
and 02' the number of oharaoters missing from x 2 alone. For the fQlU"variate oase (°1°203' 012°13(23) indioates 0i(i == 1,2,3) as the mumber
of oharaoters missing in ~ alone and 0ij (i,tj) as the number of obser-
•
vation vectors having oharaoters mis siDg from
~
Xi and x • For
j
example, a sample of fifteen observations desipated (303,211) Ddght
appear as follows:
**--iE--***--**-*
,*'
;.
'* - * '* * '* *' ;. - *' '* '* * '*
~E-**~E-;'
**- **;.;. **
******~E-********
where a dash represents a missiDg value and an asterisk represents an
observed value.
~ 0i + 2 ~ 0ij
.n
e
e
==
N- ~
~j
0 ==
The number of missing oharaoters is given by
==
14 and the number of oomplete observat~ons is given by
5. The extension to higher variate oases is obvious.
.
In the sampling work which follows, the artifioial deletion of
oertain values among the independent variates to simulate missing
measurements will be referred to as oensoring.
8
e
e
The censoring sohemes considered fall into two classes.
One of
these is the_special case where values are missing from one variable
only.
The other is the case in which missing values occur with eqo.al
frequency among all the independent variables, with only ome value
missLng from each observation vector.
By the general notation indicated
above, this latter type of censoring would be written (cl c 0 ,000) for
2 3
the .four-variate case and (0102c304'000000,0000) for the five-variate
case. For these four- and five-variate oases the general notation has
been abbreviated to (010203) and (01c 2c c ), respectively.
3 4
It should be emphasized, however, that except for the method of
maximum likelihood, the estimation procedures discussed are not
restricted in any way by the pattern of missing values •
•
2.3 Empirical Measures of Efficiency
The usefulness of a regression equation is usually judged b7 the
magnitUde of the bias and the varian£e of the estimated regression
b i - ~i
coefficients. The t-statistic, t ·
SO
' has been taken as a
oriteriQn for bias, and in the tables
i
of biases in sucoeeding
chapters, values which exceed the appropriate tabulated values of t at
the five per oent level have been designated with an asterisk.
However,
it should 'be nated that we are interested primarily in prediction, and
that estimates of regression coeffioients may be biased in opposite
direotions so as to be mutually compensatory in nature as far as
prediotion is concerned.
e
e
For this reason, as well as for general oon-
venience, oertain single measures of efficienoy for a prediction
9
equation have been proposed and will also be used to evaluate the
One such measure defined by Nicholson (194&) is
methods discussed.
N
S
-< • 1 (y-< - Y)
2
Ez • ~N~----"'-S
(y-< .,( • 1
Z) 2
where y-< represents the observed value of "1' for the -<th observation;
Y-<, the estimate of "1'-< based om the least squares regression equation
of y on the Xi for the N uncensored observations; and Zoo<' the est:tma.te
of' y-< based on the prediction equation, Z, whose efficiency is being
tested. Both sums are taken over the N observation vectors which give
•
rise to Y.
Thus 0
oS
E
S 1, assuming the maximum value when Y • Z.
Since the parameters are known, another important measure of
efficiency is
N
S
2
AZ •
where
~
p-l
= ~l
S
~ .x.
11
1
(r -< -
Z)
2
N
is the true regression equation.
Z is defined as
.,(
above, and the sum is taken over the same N observation vectors as
indicated above.
The measures EZ and A~ are of interest wb9n we con sider Z as a
prediction equation estimated from incomplete data.
Although both are measures of Itprediction efficiency" and are
e
e
usually referred to throughout the text by the word "efficiency",
it is understood tha:t they are merely convenient empirical devices for
10
e
e
comparison, and, as such, are to be distinguished from orthodox theoretical efficiency which can be defined for an estimate,
2
where 06 is the variance of
e
/::0.
e,
eI "
as
the minirrrum variance unbiased estimator
of 9.
It may be noted that E is an index which measures the goodness of
the equation in question relative to the "best" estimate from the
.,
uncensored saI!¥lle.
It should be pointed out in this connection that E
ma.y be very high although neither estimate is good in the sense of
giving a prediction egnation close to the true one.
•
On the other hand,
/1~' which is a measure of the average squared deviation of the Z from
values predicted by the true regressi.on equation over a given sample,
is dependent on the population from which samples were generated.
Hence, it is not useful in comparing methods applied to samples from
different populations.
2
In the tables which follow, the values of b are considered in
terms of /1, since sampling distributions of /1 appear to be more
2
symmetrical than those of /1 •
2.4 Description of Experiments
As was indicated in Chapter I, an empirical sampling approach
e
e
was used.
This approach took the form of a series of eight sampling
experiments which were performed in sequence and were dictated in
large part by successive results.
The principal factors considered
11
were the characteristics of the populations from. which samples were
taken and the intensity of censoring. Except for one experim.ent,
sample size was confined to N = 20.
Since we have some knowledge of
how sample size affects estimates, this does not seem to be a very
objectionable restriction.
The populations, intensity of censoring,
and number of samples on which calculations were made varied from
experiment to experiment.
The number of samples considered in a given
experiment or on which a set of means and variances was based is
denoted by k.
A brief description of the experiments conducted is set forth in
the following paragraphs for the twofold purpose of (1) providing some
perspective before the presentaliion of detailed results in chapters
•
concerned with the respective methods of estimation and (2) identifying
the experiments so that they need not be described repeatedly in
various places throughout the text.
Experiment 1 was the most comprehensive experiment undertaken.
It consisted of estimating regression equations for an arrangement in
which twenty original samples were transformed to produce samples from
the trivariate populations,
[.25 .50,
the four-variate population [.25
.50
.25] and [.80
.75,
.25 .50,
.00,
.80], and
.25]. Each of
the resulting forty generated samples from the trivariate populations
was
censored in ten different ways, (11), (22),
(04,), (40), (08), (80).
e
e
(33), (44), (55), (66),
The twenty generated samples from the four-
variate population were each censored (111), (222),
(333), (444), and
(555). Estimates of prediction equations were made from the samples
12
e
e
at each of these population-censoring combinations using the uncensored
data, complete observations, the Greenberg Procedure, and the Method of
Paired Correlation Coefficients.
In Exper:l.rrrant 2, samples from thirty-nine different trivariate
populations were censored (55) and regression statistics calculated
using the Method of Paired Correlation Coefficients.
The main purpose
was to find out how different populations affected the efficiency of
the prediction equations determined by using this method.
ment can be subdivided into two parts.
The
e~eri­
In Experiment 2,A., the same ten
original samples were used to provide ten generated samples from each
of the thirty-nine populations.
In Experiment 2B, 390 different sets
of original samples were used in generating ten samples from each of
•
the thirty-nine populations.
t~e
The rationale dictating the use of this
of design is based on the premise that the mean efficiencies at
each point in the population space, relative to other points, may be
better determined by using the same set of original samples, but that
the entire configuration in the space would be better determined by
using different random samples at each point.
This feature of the
sampling plan represents a unique facet of the general empirical
sampling approach.
It is also used in Experiment
4.
Experiment 3 was performed to determine whether or not conclusions
based on Experiment 2 also held in the four-variate case. -Twenty
original samples were used to generate twenty samples from each of six
e
e
different four-variate populations.
The samples were censored (555)
and regression equations determined using the Method of .Paired
Correlation Coefficients.
13
In Exper:iment
4,
the Greenberg Procedure was used on samples cen-
sored (55) from sixteen different trivariate populations.
The popula-
tions selected were dictated by a response surface design in the space
of admissible sets of correlation coefficients.
this experiment consisted of two parts:
Like Experiment 2,
!.lA, involving the use of ten
original samples to generate ten samples :from each population, and
4B,
making use of different sets of original samples for each population.
Experiment 5 was designed to investigate the convergence of
successive est:imates made by the Greenberg Procedure.
It was limited
in scope and is described in detail in the chapter dealing with the
Greenberg Procedure.
The Paired Correlation Method is applied to samples from two five-
•
variate populations in Experiment 6.
The populations were chosen to
generate samples simulating those arising in two "live problem"
situations.
The purpose was to demonstrate cases where the Paired
Correlation Method would or w:>uld not be appropriate.
The method was
used on two hundred samples censored (3333), one hundred from each of
the populations examined.
Experiment 7 involved the use of complete observations, the
GreEllberg Procedure, and the Paired Correlation Method to find
regression equations from one hundred four-variate samples censored
(333). The population chosen was one for which the Method of Paired
Correlation Coefficients was not expected to provide efficient results.
e
e
The experiment aimed at determining, under such circumstances, the
relative "goodness" of prediction with the use of complete observations,
The Greenberg Prooedure, and the Paired Correlation Method.
14
In Experiment 8, the Greenberg Procedure was used on twenty samples
from the two five-variate populations considered in Experiment 6.
The
primary function of this work was to study convergence properties under
the Greenberg Procedure, however, the results in terms of prediction
efficiency were tabulated and discussed.
Values of 6Z and 'E Z' as well as estimates of the regression parameters, were computed for each of the cases in the e:xperiments desoribed
above.
2 05 The Computing Programs
There are a few points about the canputing programs lib ich merit
discussion.
•
The input for every program oonsisted of sets of p inde-
pendent N(O,l) deviates which were transfomed into nmltivariate normal
deviates and stored in a data table on the drum.
After twenty veotors
had been generated, the computations were carried out and the prooess
repeated, the transfor:rred values never existing outside the machine.
The transforming ooeffioients for a given population were computed
manually and stored on the drum to be used repeatedly for a. series of
samples.
Although the data oonsisted of a complete table of N(N==20)
p-dimensional vectors, they were aooompanied bya oensoring code comprised of Np binary digits, eaoh one denoting the presenoe or absence
of its oorresponding deviate value.
Censoring oodes, like the trans-
forming coeffioients mentioned above, were punched on cards and
e
e
stored on the drum to be used repeatedly for a series of samples.
e
e
One device used to check the correctness of a program for
d.i:f'ferent methods of est:iJna.tion consisted of using a censoring code
which indicated no missing values.
All methods of estimation should
then yield identical results.
Symbolic Optimum Assembly Programming (SOAP) was used in writ:ing
all programs •
•
e
e
16
e
e
CHAPTER III
THE nELETION OF IMOOMPLETE OBSERVATIONS
3.1 Discussion
The most obvious solution to the problem. of missing values is
simply to discard observation vectors which are not complete in every
variate.
data.
We will refer to this prooedure as estimation from oomplete
Actually, such treatment hardly can be called a solution, and,
at best, must be considered a trivial one.
The main purposes of in-
olud1ng a section devoted to this prooedure are to point out objeotions
to it, and to include results given by it whioh will serve as standards
of oomparison for results obtained by other methods.
•
The objeotion to suoh a prooedure is that the effioienoyof the
prediction equation might be deoreased by oasting out important information.
In deoiding whether to use only data oomplete in every
variate or whether to employ some other method making use of all the
information, the first factor to be oonsidered is the amount and proportion of complete data available.
If a large amount of data is
available, one usually would not hesitate to discard a part of it in
order to apply a straightforward analysis.
On the other hand, there might be enough oomplete observation
vectors available so that the others could be discarded, but by so
I
doing, the variance of predicted values would be increased or estimates
e
e
biased.
Many times values missing from incomplete observation veotors
occur near the extremes of the range of the v'arl.able in question.
The
occurrence of missing values is, in fact, often correlated with their
extreme size for one reason or another. It is likely that observed
17
e
values in an incomplete vector fall at very high or very low levels
e
also, and by discarding such values the range of tlB
Xl
s is decreased,
resulting in diminished precision of the regression est:imate.
For
example, in the field of anthropometry, Miinter (1936) and his associates,
in measuring skeletal remains, found that bones of smaller specimens
were often broken and could not be measured.
If a prediction equation
for some specific characteristic were to be determined" or if a linear
d:L scr1m:l.nator were to be established, then discarding all the measure"
ments of the specimens with at least one broken bone would clearly be
Another example applicable to the present discussion
undesirable.
arises in trying to predict earnings based on questionnaire infomation
regarding an employee's experience, education, place of
•
other variables.
emplo~ent,
and
Non-response on education is known to be correlated
with low earnings.
It follows from the above discussion that consideration should be
given to the levels of the
Xl
s at which one expects to apply the pre-
diction equation. In predicting y for values in the middle range of
.
the x's, there would be less concern about discarding incomplete observations haVing extreme x values than would otherwise be the case.
3.2 Bias and Variance of Regression Coefficients
The data from Experiment 1, computed by using only complete observation vectors, is summarized in Table 1 which gives the biases and
variances of the regression coefficients for different combinations of
e
e
censoring and population.
Since the estimates obtained by deleting
incomplete observation vectors are ordinary least squares estimates
based on N -
f
ci
111
n observations, they are expected to be unbiased
-
ee
ee
Table 1. Statistics of regression coefficients estimated by
the deletion of incomplete observation vectors;
Experiment 1, N· 20, k = 20
.-
bo-llo
V(bo)
~-1l1
, .25J
V(b:!.)
(00)
.113*
.055
.051
.058
(ll)
.1291-
.055
.069
(22)
.1491-
.09;>
(33)
.193*
(44)
.9;>
[.25
Censoring
[ .80
.60
,
.80J
V(b )
2
bo-j30
V(b)
~-1l1
V(b:!.)
-.033
.025
.078*
.026
.060
.052
-.037
.032
.072
.0l.4
.042
.087*
.027
.0)8
0076
.015
.053
.082
.093
.037
.046
.101*
.024
.030
.101
.042
.058
.062
.056
.135
.064
.046
,134*
.030
-.007
.123
.072
.058
.19:>*
",055
.048
.150
.086
.075
.104*
.026
-.029
.1;b
.096
.096
(55)
.124*
.068
.009
.195
.053
.095
.086*
.033
-.032
.197
.059
.120
(66)
.194*
.144 .
.018
.243
.016
.208
.135*
.070
.001
.312
.018
.262
(04)
.098
.063
.0)4
.076
-.001
.041
.069
.030
.014
.071
-.001
.052
(40)
.19:>*
.055
.059
.121
.048
.040
.104
.017
.007
.109
.o;:h
.09;>
(9 8)
.086
.082
-.029
.074
-.053
.053
.060
.040
.018
.078
-.060
.067
(80)
.;1.25
.072
.117
.209
.021
.050
.087
.035
.067
.168
.023
.063
.9;>
[.25
.75
,
b 2-11 2
.9;>
.25
,
0
b 2-11 2
V(b )
2
b -11
3 3
V(b3)
-.019
.030
.035
.020
.050
.021
.014
-.001
.037
.056
.018
.036
.028
.064
.032
.009
.058
.051
.038
.002
.040
(333)
.042
.035
-.001
.092
.006
.048
.017
.055
(444)
.024
.112
.069
.279
,030
.100
....022
.099
(555)
-.023
.398
•090
1.008
2.763
.112
.866
V(b)
(000)
.028
.Ql6
(Ill)
.068*
(222)
a Tbe symbol
* indicates
'i\-1l1
0
.. -
b - II
t • ~
>
t 05' 19, where St;2 •
1Ii).
i
•.37.1 .
i
t~i)
V(b )
2
.25J
V(bl )
b o-ll 0
b2-11 2
~
(»
19
e
e
with variance increasing as n decreases.
Table 1 seems to bear out
these e:xpectations.
None of the biases of bl and b 2 are significant
by the t criterion given in Chapter IT. The bias of bO which is indicated signifioant for most of the three variate oases here may be
disnissed as a sampling phenomenon on the basis of similar unbiased
statistios calculated f'rom other sets of' original sanples censored (SS)
in Experiment 4.
Reoall that the data of' Table 1 f'or each oensoring
population combination is based on the
~
set of twenty original
deviate samples.
Further evidence confirming the expected nor.malcharacter of the
regression' coef'ficients based on complete data can be seen in the histograms of Figures 1 and 2.
•
These data were taken from experiments con-
cerned with higher variate cases. .Each figure is based on dif'f'erent
sets of' one hundred original samples.
For the data in Figure 1, the
values of' t f'or testing bias in b ' bl , b 2, b , and b were - .818,
O
3
4
'- .391, -~473, .212 and .700, respectively, while the-t values for bias
in the data represented by Figure 2 were -"21, ::~886, - .•068, and .904
for b
o'
b l , b , and b , respectively.
2
3
The mean values and variances of' the s2y.x .from Experiment 1 have
been computed and are tabulated in Table 2.
As in Table 1, the
tendency f'or the variances of the estimates to increase with increased
censoring may be noted.
not evident.
e
e
A similar tendency with regard to the bias is
i
Figure lOeB) shows the histogram of' values- of y. x .from
one· hundred samples censored (333).
The shape of the histogram is
consistent with the theoretical distribution of suoh a statistic, which
is proportional to
x2 with seven degrees of freed'?J!1.
20,
e
e
~
-
2.0
It>
-
-,;
-,5
-I
,2>
7
/,/
'oS
1,6
2..0 2..'1 ').~
I.q
2.3
= .~l
~4
to
10
il
-.Iot
C
.'8
."1
~3
•
..
1.2.,
1.0~
~c
10
-1,1
-.7
-.~
.1
,S
,q
~2
•.0;
~
1.1. -:v
10
·4.~
-2.L\ -1.0 -,
"1 .
-.Lj
-I,OC
Ie
-,q -:7 -.5 -:0 -.1
·e
e
.,
.3 . s
.7
.tt
~O ·0
Figure 1.
Histograms of regression ooefficients estimated by
deletion of inoomplete observations;
[0.80 .29 .40, .50 -.20 .,0, .20 .70 .30]
c,ensoring (3333), N .. 20; k • 100
•
21
e
e
10
-,I
,I
.~
.5
.7
.q
I.~
1.5
n'.1
I.~
1.5 1.7
I.If
1.'g'2.C
~3 ••61.
1.1
2.0
10
-.1
.1
.?J
,5
,7
~2· .~B·
•
."f
30
10
.'1
.E.
.'i!
1.0
1.2.
I.b
2..?- 2..4
~l=:I.lf
;;0
20
10
-,Li6
e
e
Figure 2.
-.~5
-.l5 -.15 -:05 .05 .15 .7.5
.~5
.lt5 .55 .65 :7S
•
13 o=0
Histograms of regression coefficients estimated from complete
observations; [-.70 -.52 .32, .60 .26, .44], censoring (333),
N
= 20,
k = 100
•
ee
ee
Table 2. Means and variances ot errors of estimate:
. obtained by the...deletioD of·ineomplete observation;
EJq;>erimentl,N == 20,k • 20
[.25 .$l , 0251 [.80
(T2
== 073-33"
,;J..
CeDsoring
;,y .x
[.22
.60 ,
.80J
== 0'3556 .'
_"",y~.x=-·
_
V(si.x)
s~.x
V(s~.x>
(00)
,742
.082
•.360
.019
(000)
(11)
.7.34
07.36
•.356
•.357
•.366
.022
.018
.026
(111)
.756
.094
.076
.110
.740
,14.3
.358
.Q.34
(66)
.770
.724
(o4t.
.7.37
.208
.106
.086
•.373
•.351
•.357
.049
.025
.020
{40~
.750
.118
•.364
.028
(O8~
.701
.76.3
.340
•.370
.0.32
(80~
.097
.136 "
~5~
--;;
y.x
(222)
(.3.3.3)
(444)
(555)
,~22·:P
2
CT . . • .:3"2:39'
,~2>1
"Y.X
Censoring
s;.x
(22)
(.33)
(44)
.:P~i5
•.349
...361
V(s2 )
y.x
.007
.345
.370
.009
.015
.Q26
•.315
.064
•.351
•.31Q
.023
I\)
J\)
23
e
e
As mentioned above, the values in Tables 1 and 2 are inoluded
primarily as sta.ndards for oomparison with similar tables based on
sta.tistios oalculated by methods whioh utilize inoomplete observation
veotors.
The estimated varianoes of regression coeffioients referred to
above are the ''between sample" or external estimates of varianoes,
denoted V(b ).· Although the ~stimated varianoes in suooeeding disi
cussions will be of this type, it seems appropriate to mention that
"wi thin sample tt or :h1 Bliss' (1952) terms, the internal variances
.
"
Vw(b ), are available here. The inconsistenoy between estimates of
i
these two types has long been of ooncern in bioassay applio ations, but
as an experimental rather than a statistioal problem (Bliss, 1952;
•
Sheps and Munson, 1957). .An examination of the oonsistenoy of these
two ways of estimating the variance of regression coeffioients is of
irrlierest as a by-product of this researoh.
Table 3 gives mean values
A
of the Vw(b i ). A comparison of these with their counterparts in
Table 2 reveals no startling discrepanoies.
3.3 Efficiency of Prediction Equations
Table 4A. gives the means of E and their standard errors for the
censoring sohemes of Experiment 1. 'When estimates are obtained by
¢l.eleting incomplete observation vectors, Eis invariant for samples
representing different populations, but generated from the same
original samples.
e
e
The means of A and their standard errors are shown :h1 Table
4B.
•
ee
ee
A
Table 3. Means of Vw(bi ) estimated by the
deletion of incomplete observations;
Experiment 1, N = 20, k • 20
[.25
Censoring
.25]
Vw(bo) - Vw(~) Vw(b )
2
-
(00)
(11)
.042
(22)
.055
.069
.078
.108
(33)
(44)
(55)
(66)
.~,
.047
.151
[ .80
.60,
Vw(bo )
Vw(~)
.80]
Vw(b2)
.052
.060
.048
.058
.020
.064
.060
.023
.075
.093
.107
.063
.081
.073
.080
.102
.095
.122
.027
.033
.038
.052
.075
.086
.106
.122
.166
.191
.073
.251
.241
.159
.229
[.25
Censoring
.9:> .75, .25
Vw(b0 ) Vv(~)
(000)
(Ill)
.021
.028
(222)
.~,
.251
Vw(b 2)
Vw(b )
3
.023
.031
.040
.-
-
.032
.040
.068
.120
(333)
(444)
.036
.061
.080
.034
.043
.058
.090
.128
.109
.048
.073
.117
.155
(555)
.798
1.402
3.994
1.635
I\)
~
25
e
e
4A.
Table
Values of E and their standard errors estiniated by
the deletion of incomplete observations;
Experiment 1, N = 20, k = 20
Trivariate
Censoring
(11)
(22)
t33 )
44)
(55)
(66~
(04
(40)
(08)
(80)
•
Table
Four Variate
E
S.E.(E}
.965
.939
.885
.855
.803
.722
.944
.939
.881
.859
.009
.011
.018
.023
.028
.039
.011
.Oll
.020
.025
Censoring
-E
s.E.(i)
(111)
(222~
(333
.967
.888
.795
.625
.335
.007
.018
.033
.056
.045
(444)
(555)
4B. Values of ! and theizl standard errors estimated by
the deletion of incomplete observations;
Experiment 1, N = 20, k • 20
.9), .25] [.80 .60, .80 ]
[.25 .9) .75, .25 .9), .25]
CensorCensor- [ .25
ing
ing
S.E.(!)
t:. S.E. (i)
t:.
t:. S.E.(i)
-
(00)
(ll)
(22)
(33)
(44)
(55)
(66)
(04\
(40
e
e
(08
(80)
.303
.340
.375
.437
.460
.488
.607
.353
.385
.381
.468
-
-
.028
.031
.032
.039
.037
.039
.074
.030
.039
.030
.055
.211
.237
.261
.304
.320
.340
.423
.245
.268
.266
.326
.020
.021
.022
.027
.026
.027
.051
.021
.027
.021
.038
(000)
(111)
(222)
(333)
(444)
.222
.252
.301
·.348
.524.
.012
.016
.019
.029
.074
26
e
e
These tables demonstrate an expected relationship, that greater
censoring decreases efficiency of prediction. By so doing, the results
lend support to the use of EZ and AZ as effective measures of efficiency.
The histograms of Figures 7A, lIA, and 12B indicate distributions
of E and Awhen using only complete observations for certain cases
considered in Experiments 6 and 7.
Chapter'VIII includes tables in which mean efficiencies and their
standard errors based on estimates from complete observations, are
compared with their counterparts based on other methods •
•
e
e
27
e
CHAPTER Dr
e
THE METHOD OF FAmED CORRELATION COEFFICIENTS
4.1 Definition
Although the prooedure desoribed below seems to have been used
widely to deal with the problem at hand, there is no apparent referenoe
to it in the literature, either describing it or even referring to it
as a technique employed in analyzing inoomplete data.
It is not clear
whether this is due to the rather simple nature of the scheme or
whether it is a reflection of its limited usefulness.
At any rate, it
will be described below and some of its limitations will be brought out
in connection with results given later in the chapter.
•
Suppose we have a set of data such as that schematically
represented on page 6.
From this we can fom the matrix of paired
correlation coefficients:
1
r
1
Cpxp
=
l2
r
r
l3
23
...
...
I
I
rl(p-l)
I
I
r ly
r 2 (p-l)
I
I
r 2y
I
r (p-l)y
1
I
I
C(p-l) (p-l)
=
(4.1)
I
- - - - - - i - -
-----'----I
I
l!.:ty
I
.tl,y
1
I
I
I
1
I
where the r ij are correlation coefficients estimated from the existing
n .. pairs of observations in the sample of N observation vectors.
The
J.J
e
e
standardized regression coefficients, b~ b;
•••
b{P_l)' are then
obtained as
(4.2)
e
e
28
The b~ may then be oonverted to ordinary regression ooefficients
by substituting the appropriate statistics for the parameters in the
well-known formulae
~.
J.
0'
= ~~.foJ. .:.::l.
0'.
J.
and
The sample values for the mean and variance of a variable are taken
over all existing values of that variable; that is,
(4.5)
•
and
(4.6)
Generally, n.
> ni J.•
J.-
ny
= N,
As mentioned in Chapter I, the restriction,
will be maintained here.
.An estimate of the error varianoe
O'~.x may also be obtained by
substituting sample values given above for parameters in the fomula
where! is the vector of regression ooefficients,
~,
and V is the variance-covariance matrix of the xi.
e
e
is its transpose,
It is clear that
this method of estimation reduces to ordinary least squares regression
when nij
= N;
that is, where no values are missing.
29
e
e
4.2
Bias and Varianoe of Regression Coeffioients
The biases, h. - ~., and varianoes, V(b i ), for the population~
~
oensoring sohemes of Experiment 1 are indioated in Table 5. There is
no marked tendenoy for bias to inorease with inoreased oensoring, and
only the biases in be appear to be signifioant by the t oriterion
adopted to assess this bias.
The remarks in Chapter III" pointing out
that the signifioant bias of be is due to sampling" are also applicable
There were no signifioant biases among the b and b in Table 5.
l
2
In the four-variate situation, at higher levels of censoring, the values
here.
-
of b i - ~i are very large" although not significantly so by the t tests.
This draws our attention to the large va.riances at these levels. In
oertain situations, whioh will be discussed in sucoeeding seotions"
•
the biases" I;i - ~i' beoome very large as do the variances of the hi •
Consequently, one may be estimating the ooeffioients very badly in
using this procedure" although the t oriterion for bias would not
indicate this. Examples of these large varianoes are illustrated in
the results of Experiments 6 and 7, whioh will be discussed in the
section on effioienoies later in this chapter.
The statistios
relating to the bias and variances of regression ooefficients computed
by the Paired Correlation Method in Experiment 7· are given in the
second oolumn of Table 16 and serve to point up the statements made
above.
Histograms of these regression ooeffioients may be seen in
Figure 3.
e
e
4.3 Error Varianoe
The error varianoe,
i y. x'
was predioted for eaoh sample as
desoribed in seotion 4.1, equation (4.7).
The mean values and
-
ee
'lable
ee
5. Statistics of regression coefficients esthJated
by the Method of Paired Correlation Coefficients;
EJcperilnent
..
....
.;n
[ .25
Censoring
~-/l
o 0
,
N .. 20, k .. 20
.25]
V(b )
~-fll
V(~)
o
l~
[.80
b 2-fl2
.60
,
.80]
V(b2)
b o-130
V(b)
~-131
V(~)
0
b2~fl2
V(b )
2
(00)
.llJl!"a
.055
.0S1.
.058
-.033
.025
.onl-*
.027
.060
.052
-.037
.032
·(ll)
.121*
.o;n
.066
.064
-.001
.D34
.082
.031
.040
.082
-.002
.059
(22)
.124*
.o;n
.072
.067
.017
.033
.078
.032
.b44
.108
.000
.083
(33)
.lID*
.055
.047
.084
.026
.031
.089
.042
-.020
.151
.039
.105
(44)
.147*
.058
.042
.102
.052
.029
.092
.046
-.048
.198
.060
.137
(55)
.18'7*
.083
.068
.136
.066
.067
.091
.O~
.027
.2;n
.021
.152
(66)
.202*
.078
.l41
.166
.009
.132
.086
.046
.080
.485
-.027
.298
(04)
.1291-
.049
.045
.055
.017
.039
.096*
.036
.031
.089
.006
.080
(40)
.1391-
.065
.048
.102
·.002
.028
.079
.030
'-.091
.170
.041
.090
(08)
.108
.058
.048
.077
-.108
.076
.085
.049
-.061
.182
-.079
.130
(80)
.128
.083
.075
.225
-.046
.058
.092*
.038
.038
.3~
-.047
.162
.;n
[.25
=-lbo-po
V(b )
(000)
.028
.016
(lll)
.0.51-
(222)
.O~
.75
,
.25
,
.;n
.25]
V(~)
b 2-fl 2
V(b2)
b3:-fl3
V(b )
-.019
.030
.035
.020
.049
.021
.020
.017
.041
.043
.034
.025
.033
.023
.041
.055
.076
.041
-.005
.042
.064
o
~-fll
3
(333)
.0.51-
.026
.032
.079
.087
.047
-.017
(444)
.261
.646
-.270
2.244
.449
2.310
.613
9.190
.749
.1S'!
.928
(555)
a The symibol
~329
*
.973
indicates
.239
b. -
fl.
t=~
~.
l.
.609
> t. 05,
.475
2
19,
where
Si6...
l.
V(bi )
~
w
o
31
e
e
'-10
~
~
10
++++
-:;,"1
-1..~
-2.1. -I.e -1.0 "',4
,1
=
~4
•'g
,51
20
+++
/0
++
-Lj,~
-36 -aO -2.,L\ -I,t -1.2.
0
-,j,
,b
1.2.
~3.
IZ
I.Cl6
f--
•
30
20-
(,<,~,
,
10
++
+
-12, -I,t -1,1.\ -I,t)
-,a
'Tl
.'2..
,6
1,0
1,1t
\."l
'2.,'1
~O = 0
Figure 3.
Histograms of regression coefficients estimated by the
Paired Correlation Method;
[.60 .80 .29' .40, .50 -.20 .50, .20 .70, .30)
censoring (3333), N 20, k = 100
=
The symbol + indicates a value outside the range of scale.
i
32
e
e
varianoes of these estimates from Experiment 1 are given in Table 6.
Table 6 reveals the negative bias in the est:1mates attendant with
inoreased censoring, and suggests that the varianoe of the estimates
inoreases as oensoring beoomes heavy.
.An examination of the individual
s~.x for the four-variate population of Experiment 7, oensored (33),
point s up this bias in a rather startling manner by revealing negative
estimates of varianoe.
The distribution of s2y. x at this particular
point is indioated in Figure lOD by a histogrant of the estimates from
one hundred samples.
It beoomes apparent that it is possible for the
negativa estimates to arise when we realize that the oorrelation
matrix, C(p_l)(p_l)' is not guaranteed positive definite, sinoe the
r ij are based on different nij pairs •
•
4.4
Effioienoy of Prediction Equations
Table 7A gives the means of E and their standard errors for the
various population-censoring combinations of Experiment 1.
Table 7B
shows the mean values of t:. and their associated standard errors.
There
appears to be a very definite population effect influencing the
effioienoies.
It is clearly demonstrated by testing the hypothesis
that geometric mean values of E representing the two populations are
equal, assuming the usual linear model.
An analysis of varianoe of
the logarithms of E for the three variate populations oensored (11) to
(66) app ears in Table 8.
e
e
·-
-Table 6.
ee
Means and variances of errors of estimate
obtained by the Method of Paired Correlation Coefficients;
Experiment 1, N == ·20, k == 20
.50
[ .25
Censoring
O"~.x
-Sy.x
2
,
.25] [.80.60· • .80]
0"2
= .3556 .
== .7333 .
l.X
~(s:.~ .
.-.......-.
2
Sy.x
(00)
(11)
.741
.715
.082
.360
.084
.349
(22)
.093
.098
.340
(33)
(44)
.715
.710
.696
·.096
(55)
.(66)
.652
.611
.095
.102
(04)
(40)
(08)
(80)
.718
.087
.723
.718
.628
.092
.082
.109
·.332
.308
.308
.289
.341
.326
.333
.296
[.25
Censoring
V(s2 .)
y.
-
.50
s
.75 . , .25 .50
2
0"l.X == .3239
2
y.x .
(000)
(lll)
.349
.340
.007
.006
.012
.016
(222)
.320
(333)
(444)
.319
.194
.064
.012
.008
(555)
.25]
~(s2y. ~
.019
.017
.015
.015
.032
.015
•
.219
.364
.015
.021
.034
w
w
•
_,~.",,,,,,,,,,,,""" •.'.""·i'_._.•
•.•. _......... ".,..""-.. .:...;.._~
34
e
Table 7A. Values of i and their standard errors estimated by
the Method of Paired Correlation Coefficients;
Experiment 1, N = 20, k
20
e
=
Censor- [.25
ing
E
-
(11)
(22)
~~~
(55)
(66)
(04)
(40)
(08)
(80)
•
, .983
.972
.958
.948
.918
.857
.983
.967
.936
.906
.En,
.25] [ .80 .60, .80 ] Censor- [.25
ing
S.E.(E)
E S.E.(E)
-
.005
.006
.009
.010
.021
.035
.005
.008
.024
.029
i
.962
.943
.901
.874
.819
.769
.931
.939
.833
.880
.010
.013
.025
.028
.038
.046
.012
.019
.027
.042
(111)
(222)
gla~
;
,
(555)
.En,
-
.25]
E
S.E.(E}
.967
.907
.870
.784
.606
.008
.021
.028
.053
.072
I
-
= 20
[080 060, .80 ]
Censor- [ .25
Censor- [ .25 .En, 025J !
ing
ing
A S .E o(i)
A S.Eo(A)
(00)
(000)
0020
.028
.211
.303
(111)
(11)
.020
.028
.246
.321
(222)
(22)
0028
.024
0251
.329
.029
.034
(333)
(33)
.297
.352
(444)
(44)
.036
0317
.376
0033
(55)
(555)
.422
0049
.360
.039
(66)
.408
.054
0057
.484
(04)
.025
.023
.253
0329
(40)
.035
.270
.030
.355
(08)
.360
.029
.031
.335
(80)
.063
.057
.430
.338
-
e
.75, .25
Table 7B. Values of A and their standard errors estimated by
the Method of Paired Correlation Coefficients;
Experiment 1, N = 20, k
e
.En
-
I
050-
-A
.222
.262
.296
.330
.807
.774
.75, .25
.50, .25]
S .E. ('A)
.012
.016
.019
.024
.454
.202
35
e
e
Table 8. Analysis of varianoe of In E for the
trivariate populations of Experiment 1
Variation due to
d.f.
Censoring (C)
Population (p)
CxP
Error
1
5
228
Total
5
-
,
S.S.
M.S.
F
1.2691
·3902
.1162
6.9693
.2538
·3902
.0232
.0306
8.29
12.75**
239
I
I
!
I
/
NS
I
J
I
In order to define more olearly the regions of differential
effioienoy in the three-variate oase, Experiment ,2 was set up to
evaluate effioienoyat a large number of points in the population
spaoe at
•
~
fixed level of oensoring (55).
The aim of the experiment
was to find a response surfaoe to desoribe effioienoy for different
t~es
of populations.
Thirty-nine populations were ohosen to represent
regions in the spaoe of ad'nissible ~2' Ply' P2y values for whioh Ply
and
~2y
were positive.
The points ohosen are given in the first
oolumn of Table 9. As mentioned in Chapter II, the experiment was
divided into two parts; part A., in which the same set of ten original
samples was used to generate the samples on whioh estimates were made
at eaoh point, and part B, in whioh 390 different samples were used to
generate ten samples at each of the thirty-nine points.
choosing this
t~e
The purpose of
of sampling scheme was discussed :in Chapter II.
Table 10 shows that the use of the same set of ten original samples
at eaoh point in Experiment 2A makes for an expected reduction in the
e
e
estimated error variance ·of the efficienoies.
36
e
e
Table 9.
I'
equation, E +B, of Experiment 2
A
A
Population
[ .80
.ro,
[ -.70
.30,
.30,
.30,
.30,
.30,
.30,
.30,
.30,
.30,
030,
.50,
.50,
.50,
.50,
.50,
050,
.50,
.70,
070,
070,
080,
.20,
.30,
.50,
.70,
.70,
.70,
.70,
.70,
070,
.30,
.21,
031,
.67,
.50,
.20,
.60,
.30,
[ -.60
•
e
e
Values of i and residuals from the response
[ -.30
[ -.30
[ 010
[-.10
[ .50
[ ·70
[ ·30
[ .50
[ 070
[ .00
[- 050
[- .20
[ .20
[ ·40
[
.60
[ 060
[ .90
[ -050
[ .30
[ .75
[ .69
[ .87
[ -020
[ .20
[ 090
[ 030
[ 000
[ .60
[ .18
[ -.55
[ -.32
[ 089
[ .30
[- 075
[ 020
[- .69
.80J
.40]
.201
.30J:
.oo]!
.201
.70];
.(0)
.701
.80];
.20J
.20]
0801
.201
.401.
.10)
0301
.50}
.801
.50]
.10J
.301
.301
0401
.551
.40]
.201
.87]
.70]
.10]
.50]
.,0]
030]
.50]
.70J
.631
030]
.77]
.40]
EA+B
IA
IB
.8047
.5454
.7735
09012
.7159
1.0164
.7663
.9368
.8776
.8575
.9219
.7670
.8556
.7434
.8265
·9293
09217
.8883
.8030
.7006
.6468
.7400
08212
.8753
.7841
.7189
.8426
.7480
07995
06924
.8208
09665
07804
.7729
.7576
08947
.6186
.7799
.5551
.8046
.4835
.7937
.8387
.7884
.8977
.8090
.9159
.8057
.8825
.9026
.7428
.8694
.7806
.8245
.8863
.8936
.8908
.8734
.5468
.6998
.8218
.8829
.9003
08318
.7499
08460
.6889
08377
.6994
.8431
.8972
.80/.,.6
.8096
.8062
.8925
06533
07989
.5163
.8819
.3218
.8599
.89119
07988
.9612
.7382
·9311
.8220
.8943
.9268
.8102
.8219
.8562
.8403
.9332
.9195
.9418
08342
.6907
.5719
·7197
.8579
.8273
.8843
.6003
.8497.
06515
08007
07153
.9058
.9359
.8332
.8006
.7646
.8973
.7118
06877
04746
EA -
/\
EA+B
-.0001
- .0619
.0202
~.0625
.0725
-.1187
.0421
-.0209
-.0719
.0250
-.0193
-.0242
.0138
.0372
- .0020
-.0430
-.0281
.0025
00704
-.1538
.0530
.0318
00617
.0250
.0477
.0310
.0034
- .0591
.0382
.0070
.0223
-.0693
.0242
.0367
.0484
-.0022
.0347
.0190
-.0388
A
~ - EA+B
.•0772
-.2236
.0864
-.0063
.0829
-.0552
-.0281
-.0057
-.0556
.•0368
.00149
.0432
-.0337
.1128
.0138
.0039
-.0022
.o53S
.0312
-.0099
-.07119
-.0703
.0367
-.0480
.1002
-.0586
00069
-.0965
00012
00229
.0850
-.0306
.0528
.0277
.0070
.0026
.0932
-·9022
-.0805
37
e
e
Table 10. Analysis of variance of values of E
d.f.
Variation due to
S.S.
M.S.
Experiment 2A
Population (p)
Samples (S)
PxS
38
9
342
-389
Total
4.2437
9.3076
9.0269
-
Total
389
Sf2 = .0026
22 •.5782
Experiment 2B
38
6.7964
351
1304488
Population
Samples in populations
.1117
1.0342
.0264
.1789
.0383
20.2452
It is necessary to digress briefly at this point to expla.in what
•
is meant by admissible sets of the Pij and to justify choosing the
points indicated in the first column of Table 9 instead of those given
by some more orderly response SllT'face design.
It is obvious that some
sets of Pij cannot be considered, since they lead to values of the
multiple correlation ooefficient outside the range zero to unity. The
region of admissible values is defined by the restriction that any
2
2
2 2 1/2
the range ~2P2y ± (1 - P12 -P2y + P12P2y)
•
Geometrically, this region may be visualized as the space inside a
value of Ply must lie
,
l.n
circular cylinder with axis Ply' which has been pinched olosed at each
en~
and twisted until the ends are at right angles to each other. An
orthodox reeponse surface design did not seem to be desirable at this
point, since it would not have fitted into such an irregularly shaped
e
e
spaoe so as to provide points falling into extreme regions.
It was mentioned that among the populations considered, there were
no sets having negative values for Ply or P2y • Such a selection was
38
e
e
2
based on the symmetry whioh exists in the relationship ofPij and R I
the multiple oorrelation ooeffioient If' both Ply andP 2y are negativel
2
R is the same as i f they were positive. If Ply and P2y are of
2
opposite sign, the value of R is the same as if the sign of P12 were
0
ohanged and the signs of ,Ply and P2y were oonsidered the same. A
similar sort of symmetry has been assumed to exist for the effioiencies.
Some empirical evidence to indioate the validity of this assumption is
offered in Table 11.
Table 110 Values of E and their standard errors
at symmetrioal points in the trivariate
population space; oensoring (55), N "" 20, k "" 10
Population
[ P12 Ply'
22y1
•
!
S oE o(E)
(-.20
[ .20
050, -0401
.~,
·40J
.824
0869
0090
0021
[-075
[ .75
020,
.20,
0301
- .301
0653
.635
.093
.095
[ .18
[ .18
·30,
-·30,
·501
-·501
.897
.940
.038
oOU
[-·55
·55
[- ·55
.21,
.21,
- 021,
.30J
-0301
-.30J
.805
.815
.854
.094
.044
.054
r
Differenoes in mean efficienoy of the magnitude indicated above
can be explained on the basis of differential censoring.
Reoa11that
the dependent variable is not censored, whereas the independent vari-
e
e
ables are censored (55).
General second degree equations were fitted to the mean vaJ.ues of
~ . indicat~d
in 'fable 9 and resulted in the following equations to
describe the response surfaces.
For Experiment 2A,
39
e
e
i A = 1.1019
- .•0047P12 -
.3608p~y-
and for Experiment 2B
~B
= 1.0566
- .2630P12 +
2
- .7245P2y +
Although the coefficients of these equations are quite different,
the contours which they represent are very similar in the regions
where the sampling was done.
A
1t:s'
respectively.
Figures 4 and 5 show contaurs for "EA, and
~
The contour levels of E for .7, .8, and .9 are shown
a.t three levels of Ply in each diagram.
•
The sub space 0 f admissible P12
and P2y is bounded by heavy lines and the shaded areas near these
!'
!'
borders indicate that the quadratic models, E,A and EB fit very poorly
in these peripheral regions. For eX8llq)le, an additional population
considered, [.65
.30,
.901, gave values of
11.
= ·401
and
!B = .451.
For the th::lrty-nine points fitted, "E explained about eighty-two per
A
c~nt of the variation :in the values of lA' and ~ explained about .
s~venty-eight per cent of the variation in the values of ~.
In both
cases the variation accounted for by the modal was highly sign:i.:ficant.
The seventy-eight points of the two experiments were combined to
fit the single general quadratic model
e
e
A
2
... 2
EA+B = 1.0793 - .133 8P12 - .083~P1Y + .02l4P2y - .530lP12 - .42,58Ply
2'
.
(4.10)
- .5431 p2y + .21 88P12 Ply + .7506p12P2y + •2503P1yP2y·
The analysis of variance for this model is given in Table 12 following.
40
e
e
""
:'
".,..
",.:
. 'll'_
/
7,
"
/
/
I
Eats
I
"
I
I
/
.,..
/
Je/
I
I
I
r
I
I
I
I
-.6
o
-.2
I
.
I
I
I
,
I
~,(~.....~/~,- - I
.4
.2
E=(l
/
"'I_-""_-!Io:;/
I
I
I "
/
I
I
I
.~
I
I
I
I
I
I
r
I
I
I
-.s
S
I
I
~
(
I
I
I
I
I
.e.-
/
I
/
illj.
l·
"
I
1\
1""
,e,u. /
I
"
""
.s
.6
P:L2
I
/
/
•
/
I
,,
I
~~i3
i..9
I
I
,!2
I
I
\
-.4
,.(
I
I
I
I
I
I
-.6
I
I
I
'I.
-.2
~.8
I
I
I
--
",
1\
E-J.7
I
I If
I
I
I
I
/
I
I
I
I
...
I
/
"
I
I
'"
/
I
""
"
I
.6
.8
,. ,
.-._~:<:
P:Ly
=
\
I
I
•7
,
I
I
I
I
I
I
K I
I
I
I
I
I
"
E="a
"
/
'.
-.s
-.2
0
.2
"
/
"
r
I
/ Ib: .7/
..,'
/
e
e
r
I
I
I
/~
"
Figure 4. Contours for ~A in trivariate population space
I
lit
41
e
e
,:AfIII""" -
""".
....
- - ........ "
"It.\
-- - ,
\
~.'8~
\
\
\
I
I
I
I
I
I
/
I
I
I
I
•
,
I
1
/'"
1
1
,.,1
I'
,.7
I
....,
I
I
-.2
,
\
...... ~
."".
I
I
I
I
I
I
... ' .".--- .... ...
"\
E=,9,·
'<.t
I
I
I
I
I
I
I'
-.2
o
.2
,.
,.' .... '
."",.
.... --- ..... ,
\
\
)l.
\
\
",
~...
I
I
I
I
I
I
I'
-.8
Figure ~ Contours for
-.2
Es
1\
o
.2
I
/
,~=.7
-l.
.4
in trivariate population space
P:1.2
e
e
Table 12. Analysis of variance for the
model "EA+B of Experiment 2
Variation due to
! d.f'.
Model "EA+B
Lack of fit
Errora
Total
S.S.
M.S.
F
.8339
.1887
.0927
.0065
14·24
3.10
39
.0813
.0021
77
1.1039
9
29
I
-
I
i
a Failur~ of duplicates at the same points in Experiment
4A and 4B to agree.
Note the similarity between the estimates of e:xperimental error
for
E given
in Table 11 and that determined by duplicates at each point
as g:i.ven in Table 12.
•
The residuals from the single general ql1adratic
are given in Table 9 with the values of E at each point., These data
and the contour diagrams of Figures 4 and 5 might reasonably be interpreted to indicate that this method of estimation is least suitable
whenapp~ied to_
populations in which zero order correlation coefficients
are of unlike magnitude or when their absolute values are large.
The approach used here to indicate how the type of population
affects efficiency would be very cumbersane in considering more than
three variates.
In the four-variate case, six factors are required to
characterize a population, hence, the fitting of a general second
degree surface would require twenty-eight points, plus those needed to
obtain an estimate of error •
e
e
.An alternative approach to higher variate cases might consist of
an extension of the results obtained for the trivariate case.
In the
four-variate case, for example, we could consider all possible subsets
43
e
e
of three correlation coefficients as representing three-variate populations and could get some appraisal of e:xpected efficiency for each of
these.
Then, if the results for the'triva:riate case held for four
variates, some composite measure of these expected efficiencies would
be highly correlated with the efficienoy obtained for the four-variate
population.
The simplifioation of this idea resorted to here consisted of oonP3yJ
in terms of the three trivariate populations made up of all sets of
sidering the four-variate population [,012 P13
Ply'
;P23
P2y'
two (of the three) prediotors with the predictand; specifically,
?·
[P1 2 Ply' P2;' [P13 Ply' P3yJ, and [P 23 P2y' P3
The contention
is then. that the efficiency of the four-variate population will be
•
related to a oombined measure of the effioienoies of the three subsets
as determined by the kind of relationship indicated in equation (4.10)
above.
Experiment 3 provided some data which, with this approach, confirms
the extension of the results for the trivariate case to that for four
variates.
Table 13 gives the mean efficiencies for six four-variate
populations, eaoh oensored (555) with the means based on twenty samples.
The population correlation matrice,s for all possible subsets of two
prediotors with tle predictand are indicated for each four-variate population.
For these trivariate subset populations, the predioted
/\
e
e
effioiencies based on E +B were oomputed and are given in the last
A
A
/\
oolumn of Table 13. The means of these E +B, denoted avg (E +B),
A
A
and the E for each four-variate population may be seen plotted against
each other in Figure 6.
e
e
Table 13.
the Method of Paired Correlation Coefficients;
Experiment 3, censoring (555) N- 20, k
20
-
[ P:L2
P:J.3
P:J.y'
1023
~y'
P3Y]
[ .20
.29
.30,
.59
.40,
.51]
[ .10
.20
.40,
.48
.9:>,
.60]
a
EA+B
A
[P:ij
Piy'
Pjy] ,
.757
.20
.29
.59
.20,
.30,
.40,
.40
1.008
.971
.913
.710
.10
.20
.48
.40,
.40,
.50
.930
.915
.906
.9:>,
.51
.51
.60
.60
I
[ .60
.80
.29,
.9:> -.20,
.20]
.686
[ .20
.,31
.60,
.25
.50,
.77 ]
.~5
[ .80
Possible subsets
of two predictors
and the predictand
E
Population
•
44
E estimated by
Values of -
.90
.30,
.70
.35,
.53]
.342
[-.70 -.52
.32,
.60
.26,
.44]
.271
\
.60 .29, -.20:
.20
.80 .29,
.20;
.9:> -.20,
.778
.741
1.026
;
.60,
.9:>:
.20
.31
.25
.60,
.9:>,
.77:
.77:
.879
.814
.827
.80
.90
.70
.30,
.30,
.35,
.35.
.5.3;
.53
.800
.788
.88'2
-.70
.32,
.,32,
.26,
.4ll;
.411;
.2~
.647
.674
.923
~.52
.60
a These-values were determined by suhstituting the values of. Pi" Piy'
and P jy into equation (4·10) as P12' Ply' and~2y' respectJ.ve!y.
e
e
45
0
.95
0
·90
.85
0
0
0
.80
.75
·30
·40
·50
.00
·70
.80
E
Associated values of E and avg
2
:£rom the data of Table 13. R .. .78
Figure 6.
•
(EA+B)
Although the evidence here is meager, it appears that the Paired
Correlation Method in the four-variate case al so yields poor prediction
equations for populations with heterogeneously correlated variables.
This hypothesis is furtlE r supported by the data from the five-variate
cases of Experiment 6, which will be discussed presently, and by the
data of Experiment 7, which will be summarized in Chapter VIII.
Experiment 6 imrolved use of the Paired Correlation Method with
data simulated to represent two situations, one for which this method
would be e:xpected to yield reliable results, and another for which it
would not.
A type of population which often arises in biological situations
e
e
imrolving variates governed by feedback mechanisms is exemplified by
[.60 .80 .29 .40, .50 -.20 .;0, .20 .70, 030Jo Because of the heterogeneous character of the correlation ooefficients it would not seem
e
e
46
advisable to apply the Paired Correlation Method to inoomplete samples
from suoh a populat ion.
.An example of the twe of population frequently enoountered in pre-
dioting from batteries of tests is [.36
.24 .25, .22J.
.41
.40
.32,~24
.24 .30,
This oorrelation matrix is, in faot, based on results
of tests administered to trainees in attempting to prediot their
suocess in pilot
training (Du Bois, 1958).
Since, in this case,
oorrelations between pairs of variates are similar, the Paired Correlation Coefficient Method would seem to be a sati Bf'aotory tool to use in
~ttiIlg
a prediotion equa.tion where values among the independent
variables are missing.
Two hundred samples, oensored (3333), were generated to simulate
•
one hundred from each of the two populations' above • The Paired Correlation Method was then applied to each sample to determine a prediction
eQl':ation.
Histograms of the resulting values of E from these equations
are shown in Figure 7, along with the efficiencies obtained by the
~se
of complete observations.
the expectations stated above.
The results rather strikingly oonf1rm
The mean values of E and A from
Experiment 6 and their respeotive standard errors are given in Tables
33A and 33B which are inoluded an:ong the summary results of Chapter VIII.
In the absence of .! priori knowledge of the population from which
a fragmentary sample is taken, it may be worthwhile to set oonfidmoe
l:i.mits on the sample oorrelation ooeffioients from all existing' pairs.
e
e
~is
mar be of
help in deoiding whetl:e r or not to use the method of
this ohapter in forming a prediotion equation.
47
~c
e
e
15
-
10
II
.s f--..-f-.-I
c
.\
.2.
(A)
,~
.4
.5
.6
.7
t,e
Complete observations
2.0
f5
10
5
0
[.00
.80
"(B)
.29
.2.
.3
.4
,5
eEl
.7
Paired Correlation Method
.40, .~ -.20 .50, .20
.~
.70,
IS
I.o
.30)
.----
I()
•
."'1
10-
5
I
..
\
I
.5
(C)
,~
~-·I
.
.15
,e;
1· (>
;r
.'3
.ct
\.0
Complete observations
55
50
45
l.j\')
35
3b
2,5
10
IS
\0
.5
0
,I
.2.
,~
,l..i
.6
.6
(n)
e
e
[ .36
·41
Figure 7.
Paired Correlation Method
·40 .32, .24 .24 .30, .24
.25,
Histograms of' E· Experiment 6
censoring (3333~, N = 20, k = 100
.22]'
48
e
e
4.5
Related Methods
It is well known that r is a biased estimate of p.
Accordingly,
one might substitute the approximately unbiased est:lm.ate,
v
p
=r
1
2
[1 + 2(N-l) (1 - r ) 1
,
in the correlation matrix and then proceed to find the prediction
equation as outlined in section
4.1.
A type of maxLmum likelihood estimate of p is found by ma.x1mizing
.m
f( r ) with respect to p.
It is approximately
..
"
p • r [1 -
1
2(N-l)
(1 - r
2
)1 •
This estimate of p may also be used in forming the correlation matrix
to be used in estimating the regression equation.
•
Since neither of these schemes gives regression equations consistent with the least squares estimate of the regression equation when
no observations are missing, they have not been investigated here.
However, it is interesting to note that in two populations where the
method of paragraph
4.1
proved inefficient, neither of the alternatives
provided a significant improvement.
~~
In the populations [- .69
[090 .70, .871 of Experiment 2A" the method of section
me~:n.E3ffioiencies.of
.474
and
.649,
respectively.
estimates of p gave mean effioiencies of
.441
4.1
e
e
.504
and .660, respeotively.
yielded
Use of the uliliased
and .636, respectively,
while substitution of the ma.xirm.1m. likelihood estimates yielded
efficienoies of
·30, .40J
49
e
CHAPTER V
e
THE GREENBERG PROCEDURE
5.1 Definition
The method to be discussed in this chapter was first suggested by
Greenberg (Institute of Statistics, 1957) in connection with data from
a perinatal mortality study in which hospital records were not c0l1X.Plete
in the detail desired.
In a comprehensive analysis of these data, Wells
(1959) defined and applied the method.
A more specific notation than that used in previous chapters will
facilitate a description of the procedure.
variates,
•
~,
Suppose we assume the p
x 2' ••• , x(p_l)' y to be distributed multivariate normally.
Consider a sample of N independent observation vectors, having a subset
equation
(5.1)
which is determined as follows:
a.
Estimate all missing characters by some reasonable in:Ltial value
suoh as the sample mean of the particular variate from which
e
e
the character comes.
Having done this, one can proceed to find
the ordinary least squares solution for predicting y.
It ma:y
e
e
be written
(5.2)
where the 0
indicate that both observed values and initial
xi
estimates for missing characters were used to find Oy by least
squares.
b.
Using the same data, determine
lx-= ~
-"].
+ lb' 0 +~,' 0
°10
.+ •• ; +
12 x 2 °13 x3
the least squares equation obtained by taking
XI
as the
dependent variable and the remaining Xi and y as independent
•
variables •
Replace the previous estimates for missing values of "
by those given using equation (5.3) J for example, to estimate
th
the value of XJ. missing from the..<., observation,
c.
With the data now including revised estimates for missing values
of
1
XI'
deter.mine
~=
1
~
1 . ~ 0
b20+ 021 XJ.+ °23 x/'
~ ••
'.
-I-
~
0
. 1
02(p-l) x(p_l)+ b 2
?
(5.4)
e
e
by ordinary least squares, where x
has been regarded as the
2
dependent variable and the other variables as independent.
e
e
Replace the previous estimates of x by those given using
2
equation 5.4.
d.'
Continue to successively re-estimate the missing
~j~
using
where k
represents the coefficient of hXi used in predicting
b ji
x j for the k th iteration, and ~i indicates either observed
values of the Xi or those estimated on the hth iteration by
oontinuing the prooedure outlined above (h • k if j
h
III
k - 1 if j
<i).
>i
and
The suocessive estimating equations are:
•
..•
e
e
where m is the number of iterations required for oonvergenoe;
oonvergenoe being defined by the condition
(equation 5.1).
my
III
m-ly,
52
e
e
If' oharacters are missing from one variate only, say x
iteration is required.
ji ' no
In this case, missing entries among the x j are
initially estimated by
where the b ji and b jy are determined by ordinary least squares regression methods using the n complete observations, regarding X j as the
dependent variate.
This condition is made clear by considering that
when x j is the dependent variate, the criterion used for the best fitting
hyperplane (in terms of the remaining variates) oonsists of minimizing
2
the sums of squares
(xj ..<. - Xj ) • Since no x j exist for the N - n
inoomplete observation vectors, the hyperplane fitted to all N
A
•
observations is identioal to that for the n oomplete observations .
5.2
Bias and Variance of Regression Coefficients
Use of the Greenberg Procedure in Experiment 1 yielded biases,
- ~i' and varianoes, V(bi ), for the various population-oensoring
combinations as given in Table 14. These data illustrate the tendenoy
bi
for both bias and variance of the regression coefficients to increase
with increased censoring.
The bias in b Which, from Table 14, appears to be signifioant or
O
nearly significant in most oases can probably be regarded as a sampling
phenomenon.
This statement appears reasonable when we recall that
corresponding values of Table 1, based on complete observations, also
e
e
yielded t tests indicating bias.
are theoretically unbiased.
It is known that those coefficients
The remarks in Chapter III in regard to
these apparently significant values of b may be oited as applicable
O
-
ee
Table 14.
S~tistics
ee
ot regression coefficients estimated
by the Greenberg Procecl.urE';
Experiment 1, N .. 20, k .. 20
--
'"
.~
[.25
,
r .80
.25]
.60
,
.80]
Censoring
'iio-llo
V(bo)
'1\-131
V(~)
V(b )
2
bo-13 o
V(b )
o
'1\-1\
V(~)
(00)
'.113*
.055
.051
.058
-.033
.025
.078*
.026
.060
.052
(ll)
.121*
.049
.082
-.037
.032
.069
-.066
.036
.091*
.027
.009
.073
.032
.053
(22)
.124*
-.047
.107
.081
.010
.041
.096*
.025
-.044
.104
.084
.068
(33)
'ii2-13 2
b2-13 2
V(b )
2
.143*
.054
.102
.104
.029
.042
.112
.025
-.123
.133
.1591-
.087
(44)
.159!t-
.058
.138
.126
.055
.093
.107*
.026
-.211*
.184
.240*
.136
(55)
.222*
.085
.159
.233
.0'11
.181
.lll*
.035
-.300
.264
.303
.192
(66)
.286*
.160
.238
.'113
.006
.539
.164*
.081
-.377*
.489
.3~
.359
(04)
.122*
.O~
.045
.055
.029
.056
.091*
.027
-.062
.066
.ll5
.062
.124
-.005
.029
.084*
.027
-.006
.155
.026
.077
.284*
.086
(40)
.132*
.060
.126
(08)
.095
.062
.022
.097
.016
.163
.080
.035
-.240*
.098
(80)
.120
.072
.230
.326
.059
.052
.078
.029
.049
.352
.~
[.25
'iio13
9
.75
,
.~
.25
,
'1\131
V(~)
'ii2-13 2
V(b )
2
.026
.026
.Oll
.026
-.095
.019
'ii -13
3 3
V(b )
3
(000)
-.104
.016
(lll)
-.171*
.014
-.037
.060
-.045
.034
-.054
.048
(222)
-.177*
.017
-.ll3
.073
-.073
.048
.040
.067
(333)
-.206
.029
-.095
.063
-.210
.189
.205
.156
(444)
-.166*
.015
-.222
.057
' -.264
.214
.330
.298
(555)
-.491
.261
-.047
.568
-.756
1.348
.435
.774
*
'ii. - 13 i
J.S'ij
a TI:Je symbol
indicates
> t .05'
i
b For this tour-variate case, k .. 5.
k - 1,
.161
.25]b
V(b)
0
-.037
V(b )
i
where st... - k - ' and k is the number ot samples on which the b i " are based.
J.
VI
\A)
e
e
here also.
In addition, it :may be noted that there are no significant
t values for the biases,
bo - 130 ,
in Table 15, which shows the biases
and variances of the regression coefficients from Experiment
4 where
samples from several different populations were censored (55).
Significant biases in b and b are indicated in Table 14 at the
l
2
higher levels of censoring in population [.80 .60, .80 J• From Table
15, it can be seen that, although the Greenberg procedure does not
generally produce biased regression coefficients, more than the
expected number of coefficients, bl and b2, in Experiment 4 fail to
meet the criterion for unbiasedness. From the limited amount of data
available, it is not possible to determine the exact nature of the bias.
There does not seem to be a pronounced relationship between bias and
•
population, as is indicated by the disparity between results obtained
for the same populations in Experiments
4A and 4B.
In both Experiment 1 and Experiment
4,
estimates of the regression
coefficients were made using complete observations as well as by the
Greenberg Procedure.
It has been mentioned that coefficients determined
in this way are unbiased and consequently provide a kind of standard
with which one can compare the Greenberg Procedure.
significant values of
bo - 130 ,
Except for the
corresponding to those in Table
14,
none of the biases of regression coefficients calculated from complete
observations were found to be significant.
This suggests
t~,t
the
Greenberg Procedure may produce biased estimates of 131 and 13 2 •
Experiment 7 provides us with still more data which we may examine
e
e
in commenting on the bias and variance of regression coefficients
obtained by use of the Greenberg Procedure.
from the population [~.70
The one hundred samples
-.52 .32, .60 .26, .44J, censored (333),
--
ee
Table 15.
ee
Statistics of regression coefficients estimated by the Greenberg Procedure;
Experiment 4, censoring (55) ,N- 20, k - 10
Population
_
~eriment4A
A
bo-(30 V (bo)
[-.2
[-.2
[-.2
[-.2
[ .6
[ .6
[ .6
[ .6
[ .873
[-.473
[ .2
[ .2
[ .2
[ .2
[ .2
.6 ,
.6 ,
-.2 ,
-.2 ,
.6 ,
.6 ,
-.2 ,
-.2 ,
.2 ,
.2 ,
.873,
-.473,
.2 ,
.2 ,
.2 ,
.6 ]
-.2 ]
-.2 ]
-.6 - ]
.6 ]
-.2 ]
-.2 1
.6 ]
.2 ]
.2 ]
.2 ]
.2 J
.873]
- .473]
.2 ]
.069
.101
.090
.148
.146
.030
.128
.076
.165
.147
.072
.098
.102
.077
.171
.016
.068
.088
.081
.064
.008
.127
.023
.159
.114
.036
.091
.031
.045
.074
b1-(31
V(b:I)!
1)2-(32
j
I
V(b 2) I bo-(30
.440 I .043
Exoeriment 4B
V(bo)
br-(31
V(lJ.)
1)2-(32 V(b2)
I
I
I
.116
.015
.037
.107
.120
.072
•345 1 .133
-.251
.035
.457
.536 iI -.036
.180 1.303 : -.078
....159
.314
-.2!:4
.449! .120
.528 :, .033
.002
.080
.107! -.19B*- .072
.665; .210' 1.037! -.068
-.299
-.259* .•032 i .233* .063 ; .043
,
-.574', 2.763
.174
.610' 3 .221
.008
.300
.539 I .037
.~3
.028
.028
.052
.209 ; .007
-.366*- .179
.230
.229 ! -.033
-.140' .108
.10B*- .036 : .092
.219 . -.018
-.195
.374 -.164'
.100
-.133
.413 ' .053
.369
I
I
I
.014
.116
.060
.090
.036
.010
.165
.013
.239
.073
.017
.067
.022
.019
.040
.063
.403*
-.116'
.022
.002
.130*
.029'
-.138
.043
..307
.042
-.112
-.068
.011
-
.013
1
.179\ -.245
.128 , -.179
.255\ .262
.239\ .001
.014. -.166*
.5621 - .071'
I
.0!:4 j .085
3.059' - .117
.380
.279
.034 -.062
.105 -.004
.106
.092
.246* .0411-·22~
.18S'
;,873
i
I
.380
.011
.496
.277
.142
.292
.021
.748
.079
3.677
.278
.074
.1!:4
.033
.063
.970
...
* Indicates
iii
~ Pi
~.
J.
>
t.0 5, 9 d.!.
2
where
s-si
==
'" (bi )
V
--w
\Tl
\Tl
e
e
yielded estimates of regression coefficients, the biases and variances
of which are shown in the third column of Table 16.
Histograms of the
coefficients obtained using the Greenberg Procedure are given in Figure
8.
For this particular case, with evidence available from a large
number of samples, all the regression coefficients except b are clearly
O
shown to be biased.
The tabulated variances of the regression coefficients call for
only brief comment
0
In Tables
1.4 and 15 marked differences
may be noted
in the variances for different populations, just as one would expect.
For example, the variances in the regression coefficients are large for
population [.8728 .2, 2J, which has a large covariance between the
independent variables and large error of estimate.
•
On the other hand,
the variances of regression coefficients are small for population
[-.2 06, .6], where the covariance between the independent variables
and the error of estimate are both small.
A.
The tendency for the V(b )
i
to increase with increased censoring in Table 1.4 was also noted in
discussing similar data calculated by other methods.
5.3
Estimate of Error
2'
An estimate of error, SYox may be calculated in the usual way after
estimating the missing values and determining the regression eqUation
by the Greenberg Procedure.
Such a statistic grossly underestimates the
true variance, since the degrees of freedom used in its computation are
based on some values estimated from the data.
e
e
In similar cases it has been possible to estimate "effective
degrees of freedom tt based, on the comparison of like moments
0
This
approach is not applicable here, however, since the mathematical form
e
e
302..0 •
10 •
-.1
.1
.7
~3"
I.~
8
1.\
1.5
.9
\,\ 13 1,5 1,7 IF!
.Q~
10
·,1
,I
.'3 .5 .7
~2"
•
.bO
2.0
10
I,t-
~
1."2. 1,1+
16
\:t' '1"C 1..1 :l...l+
= 1.11
30
'20
\0
~
e
e
o
.. 0
Figure 8 Histograms of regression coefficients estimated by the Greenberg Procedure~ population [-.70 -.52 .32, .60 .26, .44],
censoring (333), N a 20, k .. 100
e
e
59
e
e
of the distribution or its moments is not known. A simple rule of thumb
2
sufficient to produce reasonable estimates of 0y. x is given by defining
Iteffective degrees of freedomll for error as
where c • the total number of missing variate values among the p-l
independent variates.
The resulting "pseudo-variance" then is
2'
(N-p) sYeX
(5.6)
C
N-p -p-l
-
2' is calculated in the usual way assuming that the estimated
where Sy.x
characters were observed values.
•
The rationale for choosing N - p -
c
p-l
as Iteffective degrees of freedom" is that the proportion of the
independent variate values estimated is N(P:l) •
To illustrate the effectiveness of equation (5.6) as a measure of
variance, we will consider the "pseudo variances" arising from the use
of the Greenberg Procedure in Experiment
4B.
The mean of the tlpseudo-
variances" for each of the fifteen populations is plotted against the
.
2
true variance, 0y. x' in Figure 9A. A weighted least squares line, assumed
to go through the origin, was fitted to these points and gave
2
2*
o
• 1.05 Sy x. The slope is not significmltly different from unity
y.x
•
although values of s~.x still appear to be underestimated for low
*
2 . 2 *
values of 0y.x'
Differences between 0y.x and Sy.x can be accounted for
to some extent by sampling variation.
e
e
This is evinced by comparing
Figure 9A with Figure 9B, where the values of s2
y. x' calculated from the
2 •
uncensored samples,
are plotted against the values of 0y.x
.
-e
60
2i~
Sy.x
o
1.1
1.0
o
" ) =1
o
Overestimates
of variance
.9
.8
/
/
/
0/
.7
.6
/
/
/
/
0
0/'
.5
.4
Underestimates
of variance
/
/
o
/
/
.3
.
/
(
.2
/
•1
/
/'
:
(j
•
.1 .2 .3 .4 .5 .6 .7 .8 .9 1.0 y.x
Figure 9A. Mean pseudo-variance of Greenberg Procedure plotted
against the true variance; from data of Experiment 4B
s2
y.x
1.1
1.0
.9
o~
6/
//
.7
/
/0
.6
.5
.4
/
.3
0/
.2
.1
.I
//
e
(
/
/
//
/
... 1
/0
Overestimates
of variance
.8
e
2
/
0
/
o
/~
Underestimates
of variance
/ ..
0
~
.6 .7 .8 .9 1.0 y.x
.1 .2 .3 .4 .5
. Figure 9B. Mean variance estimated from uncensored data plotted
against 'true variance; from data of Experiment 4B
61
e
e
2*
Additional evidenoe concerning the use of s y.x as a measure of
residual error is given by considering its sampling distribution for
the data from population [-.70 -.52 .32, .60 .26, .44] when samples
2*
are censored (333). Figure 10C shows a histogram for values of s
y.x
based on one hundred such samples. Figure lOA represents the sampling
distribution for errors of estimate from corresponding uncensored
samples and Figure lOB shows the sampling distribution for errors of
estimate based on complete observations.
Since f
e • 13 here, the
histogram of Figure 10C would be expected to assume a form intermediate
in shape between that given by Figure lOA (theoretically proportional
•
to x2 with 16 degrees of freedom) and that shown in Figure lOB
(theoretically proportional to x2 with 7 degrees of freedom). Thus,
~~
2
it can be seen that Sy •x underestimates (Jy.x here. This observation
sUpports a statement made in the preoeding paragraph) namely, that
s2* . appears to underestimate (J2 x for low values of (J2 • For the
y.x
y.
y.x
four-variate population from which the one hundred samples were
selected, (J2y. x •• 20. It must be pointed out that this apparent
underestimation may not occur for higher values of (J2 •
y.x
5.4 Efficiency of Prediction Equations
Table l7A gives the means of E and their standard errors for the
various population-censoring schemes of Experiment 1.
the corresponding means of t::. and their standard errors.
Table l7B shows
There does not
appear to be a population effect influencing the efficiencies here, as
e
e
was the case with the Paired Correlation Coefficient Method.
A further
evaluation of the Greenberg Procedure was considered desirable, however,
so E:xperiment 4 was set up to evaluate the method over a wider range of
-I-
-
e
e
Ie
(A)
r--
r--
rl
No Censoring
10.
(B)
•
e
e
Complete Observations
10
(C)
Pseudo Variance Greenberg Procedure
(D)
Paired Correlation Coefficients
Figure 10 Histograms of estimated error variances; Experiment 7
[-.70 -.;b .32, .60 .26, .44], censoring (333),
N • 20, k • 100
a The symbol + indicates a value outside the range of scale.
62
63
e
e
Table 17A. Values of i and their standard errors estimated by
the Greenberg Procedure;
Experiment 1, N = 20, k
Censor- ( .25
ing
i
(11)
(22~
(33
(44)
(55)
(66)
(04)
(40)
(08)
(80)
•
.976
.95'6
.939
.894
.810
.614
.967
.956
.875
.825
.50,
.25] (.80 .60, .80]
-- Censoring
( .25
E S.E. (i)
S.E.(E)
.009
.011
.013
.023
.041
.047
.009
.012
.036
.032
= 20
.978
.960
.910
.836
.764
.664
.944
.953
.845
.860
Table 17B. Values of
.007
.011
.021
.030
.041
.051
.014
.012
.027
.033
(111)
(222)
.50
i
.75, .25
(333~
aand their standard errors
.25]
S .E.(E)
.018
.032
.129
.136
.159
.949
.858
.736
.627
.445
(444
(555)
.50,
estimated by
the Greenberg Prooedure;
Experiment 1, N • 20, k • 20
Censor- ( .25
ing
A.
(00)
(11)
(22)
(33)
(44)
(55)
(66)
e
e
t
l
40
08)
(80)
.303
.330
..351
.386
0447
.564
.833
.347
.379
.431
.430
.50,
.25] (.80 .60, .80 ] Censor- [ .25
S.E. (!)
S.E.(r> ing
A.
.028
.028
.028
.033
.036
.055
.099
.027
.036
.037
.069
-
.211
.239
.256
.291
'1343
.401
.518
.253
.262
.332
.345
.020
.020
.022
0024
.024
0029
.062
.019
.026
.021
.044
(000)
(111)
(222)
(333)
(444)
(555)
.50
A.
.222
..324
.365
.482
.576
1.114
.75, .25
.25]
.50, ...._._
...
S.E. (A)
.012
.017
.019
.097
.117
.3.:b
64
e
e
populations and a larger set of random san:ples.
Since the Greenberg
Procedure is quite time consuming, only a limited number of populations could be sampled.
And, just as in Chapter IV, the coordinates
of the points were required to be admissible sets of Pij'
The
configuration of points chosen was a central composite type of response
surface design selected so that the extreme points would fall within
the region of admissible values.
4 was
As mentioned in Chapter II, Experiment
composed of two parts; part A, based on samples from different
populations generated from the same set of original deviates, and part
,
:'.-
B, based on different sets of original samples at each point •. The
points chosen and the means of efficiencies with their standard errors
at these points are shown·in Table 18.
•
Mean efficiency at each point
was based on ten samples •
-
Table 18. Values of E and their standard errors
esti.mated by the Greenberg Procedure;
Experiment 4, censoring (55), N I : 20, k • 10
Experiment 4A
E S .E. (E)
Population
e
e
[-.2
[-.2
[-.2
[-.2
[ .6
[ ".6
[ .6
[ ·.6
[ .8728
{-.4728
[ .2
[ .2
[ .2
[ .2
[ .2
.6 ,
.6 ,
-.2 ,
'-.2 ,
.6 ,
.6 " ,
-.2 ,
-.2 ,
.2 ,
.2 ,
.8728,
-.4728,
.2 ,
.2 ,
.2 ,
-
Experiment 4B.
E S.E.(E}
.
~.
•6 ]
-.2
]
]
-.2
]
.6
.6
]
-.2
]
-.2
]
.6
]
]
.2
]
.2
]
.2
]
.2
.8728]
-.4728]
]
.2
.710
.772
.716
.720
.79J
.716
.765
.714
.711
.712
.692
.774
.764
.•765
.777
.074.
.067
.083
.081
.086
.056
.080
.069
.082
.084
.073
.064
.060
.061
.072
.,
.810
.76b
.808
.820
.820
.837"
.783
.769
.727
".812
.840
.830
.798
.822
.727
.044
.•081
.039
.053
.061
.032
.042
".040
.085
.037
.039
.035
.059
.035
.085
65
e
e
The use of the same set of original samples at each point in
Experiment 4A has the same effect, although not as marked, as that
which was noted in Experiment 2.
This effect is that the mean square
error for part A becomes smaller than .that for part B after differences
among samples have been accounted for in the analysis of variance of
This outcome is seen in Table 19 below.
E.
Table 19.
Analyses of variance for values of E
from the data of Experiment 4.,
\
Variation due to
I
d.f.
I
S.S.
Experiment
Population (p)
(s)
Samples
•
p
xs
Total
-149
M.S.
4A.
.1246
4.1677
3.1720
14
9
126
'!
.0089
.4631
.0252
sE • .0025
.0135
.0295
.0030
sg
E •
2
7.4643
Eag:>eriment 4B
Population
Samples in populations
Total
14
135
149
I
.1886
3.9861
4.1747
The general second degree equation fitted to the data of Experiment
4A yielded the response surface described
by
2
2
A = .7718 + .0658 P:t2 + o0268Pry'" .01l3P2y - .1458P:t2 -.0982P:!.y
~
.l!i
66
e
e
E:x;periment 4B gave
. (5.6)
The analyses of variance for these fitted surfaces are given in Table 20.
Table 20. An~es of variance for models fitted in' Experiment
Variation due to
1\
Model E
A
Error
Total
•
d.f.
S.S.
M.S.
Experiment 4A.
.0065
9
.0007
.0059
.0012
5
-14
4.
F
<1
.0124
Experiment 4B
Model ~
9
.0156
.0017
Error
5
.0033
.0007
Total
-14
2.65 N.S.
.0189
The variation accounted for by the models
EA and ~
is not significant.
Thus, it does not appear that, in trivariatecases, the population sampled
affects the efficiency of prediction equations estimated b.Y the Greenberg
Procedure.
These analyses confirm the findings from Experiment 1.
E:x;periment 7, which was mentioned in section 5.2, lends some
additional information on the efficiency of the Greenberg Procedure.
e
e
From the four-variate population [-.70 -.52 .32, .60 .26, .44], one
hundred samples, censored (333), were selected and regression equations
estirnated by various methods.
Histograms of the ' resulting sampling
e
e
67
distributions of E and 6. are shown in Figures llC and 120 respectively.
This particular case e:xl9Illplifies the sort of missing data situation to
which the Greenberg Procedure might be advantageously applied if there
were some basis for not wanting to discard the fragmentary observations.
The computing time required in using the Greenberg Procedure for
higher variate cases precluded extensive sampling work in this area.
However, it was .necessary to ascertain that such computations were
possible.
Inaddition to this, the results of the four-variate work
in Experiment 7 were encouraging enough to warrant applying the Green-
berg Procedure to some of the five-variate samples of Experiment 6.
Consequently, Experiment 8 was carried out with twenty samples, the
results of which gave the efficiencies E and b. shown in Table 21.
•
The
values of the statistics given' by using the other methods are also
presented for comparison.
In population [.60 .80 .29 .40,
50 -.20 • .50,
.20 .70, .30], the Greenberg Procedure yielded better results than were
obtained with the other two methods.
This was true for eight of the
ten samples considered in comparing it with the use of complete
observations, and for nine of the ten samples considered in comparing
it
w~'th
the Paired Correlation Method.
In population [.36 .40 .41 .32,
.24 .24 .30, .24 .25, .22], there appeared to be little difference
between the Greenberg Procedure and the use of complete observations,
while the Paired Correlation Method yielded significantly better results
than either of these.
e
e
68
-
25 ~
2C
/s
-
/0 ~
S
5
n
(A)
Uncensored Data
ll)
I~
\0
5
(B)
Complete Observations
•
(C)
Greenberg Procedure
20
15 -
to
L_-L--.L--.!_L+---L--+'--lL~::C:::I:---.!_.t=::::c=~---.l....:+++:!=I ++a
I'
-
it
,2
(D)
Figure 12.
~
..£.4
.0
.6
Paired Correlation Method
Histograms of t.; Experiment 7
population [-.70 -.52 .32; fJJ
censoring (333), N = 20, k = 100
.26,
.44]
a The symbol + indicates a value outside the range of scale.
69
e
It
"-10 ..
~5
.----
I-
.:so
15
2J)
15
/0
S
o
,!.I
(A)
I
I
.5
.<0
Complete observations
r-
r---
•
I
.~
.1
(B)
.~
.5
I
.b
:1
Greenberg Procedure
20
/5
10
o
I
I
.1,
.2.
.3
r
.q
;7
(C) , Paired Correlation Method
Figure 11. Histograms 6f E; Experiment 7.
population [-.70 -.52 .32, .60
censoring (333), N = 20, k = 100
.26,
.44J
1
\,0
e
ee
e_
Table 21. Values of E and!:i from the twenty samples of Experilnent 8;
censoring (3333), N = 20
Population [.60 .80 .29 .40, .~ -.20 .~, .20 .70, .30]
E
A
~~-(h.eenl::lerg
Comp1et.eu _.u--·Paii'ed
Complete
Paired
GreeJ;loorg
Observations Correlation Procedure
Observations Correlation Procedure
..
Sample
Number
.681
.817
.634
.041
.168
.415
.704
.496
91
92
93
94
95
96
97
98
99
.~9
.612
100
Mean -- ..
Std. EtTer ,
"
105
106
107
108
109
110
III
ll2
113
Mean .
t
Std. Error
.~8
.077
.865
.036
.773
.539
.878
.796
.270
.726
.710
• 2
.601
.089
.211
.381
.676
.358
.283
1.414
2.144
.464
.228
.940
.496
.721
.856
.405
.640
.388
.535
.303 .
.870
.409
.1589
.397
.533
.618
.648
.596
.440
.086
.066
.181
.110
Population [.36 .40 .41.3 2, .24 .24 .30, .24 .25, .22]
.802
.871
.897
.935
.282
.868
.936
3.297
.869
.448
.312
.746
1.096
.878
.981
.555
.848
.852
.794
.950
.!:46
.705
.326
.774
1.258
.386
.707
.357
.655
.398
.974
.353
.781
.644
.423
.953
8
1.22
.24
.~6
1.086
.664
.•889
.261
.028
.076
.077
.487
.319
.094
.676
.835
.165
.089
.587
.753
.391
.
.898
.356
.756
.743
.317
.475
.835
.
.158
.490
.2~
.405
.510
.379
.378
.446
.129
.410
.356
.042
1.093
.291
.4~
.625
1.009
1.424
1.173
1.ll2
.632
1.000
.881
.114
Uncensored
nata
.119
.325
.152
.227
.175
.195
.301
.199
.188
.179
.206
.020
.803
.331
.333
.538
.721
.244
.390
.331
.354
.214
.426
.063
-.J
0
71
e
e
5.5 Convergence
In describing the Greenberg Procedure, convergence was defined by
the cond1tion that the respective b i of the prediction equation be equal
on wo successive iterations. Except for differences caused by rounding,
this is tantamount to the requirement that successive estimates of
mis,sing values be equal.
In view of the fact that several of these
estimates must converge simultaneously, it is not surprising that a
large number of iterations are required in certain cases.
In Experiment 1, the Greenberg Procedure was applied to twenty
random samples from each of two trivariate populations which were
censored (11), (22), (33), (44), (55), (66), respectively.
The mean
number of iterations required for each of these censoriti'g· schemes in
•
the order indioated is 5.0, 6.7, 10.3, 13.6, 17.1, and 24.2, respectively•
An analysis of variance for these data indicated that censoring
was the outstanding facto:ri affeoting speed of convergence.
The same sort of relationship between. intensity of censoring and
number of iterations follows for. higher variate cases.
For e:xample,
in the four-variate case of Experiment 1 with censoring (222), which
is comparable to (33) above, the :mean number of iterations required
for convergence was 11.80 Likewise, in censoring (444), comparable to
(66) above, an average of 28.4 iterations were required.
In Experiment 8, five-variate samples were generated from each of
two populations.
e
e
Ten different original samples, censored (3333), were
evaluated for each.
The mean number of iterations per sample required
for convergence was 2706 for one population, and 23.7 for the other.
The standard errors were 8.9 and 8.2, respectively.
72
e
e
A reasonable question to ask at this point is whether or not the
number of iterations required for convergence depends more on the number
or the proportion of observations missing. An answer to this question
is indicated by the results from Experiment
5,· a 2 x 3 factorial
involving samples of varying size from the population [.6 -.2, .6].
Estimation using the Greenberg Procedure was carried out, using five
randomly selected samples for each combination, where 6, 10, and 16
values were censored representing proportions of twenty and fifty
percent of the total number of observation vectors.
N, and the mean number of iterations,
shown in Table 22 below.
k,
The sample size,
required for convergence is
The t:lme required per iteration is also given
for the various sample sizes.
•
Table 22. Mean number of iterations required for convergence a
in sampling from [.6 - .2, .6]; k = 5
-
m
Censorine:
(33)
(55)
(88)
5.2
4.4
4.4
Proportion of Observation Vectors Censored
.2
.5
seconds
seconds
N
per
sm
m
s
N
per
m
iteration
iteration
-
1.30
.55
.55
30
37
9J
~
80
79
18.0
10.3
10.6
5.79
1.34
2.79
12
20
32
22
31
43
a In this experiment the criterion for convergence was taken as
agreement of the respective regression coefficients to three
decimal places on successive iterations.
The most striking feature of Table 22 is that the number of. iterations
required seems to depend almost entirely on the proportion of data
missing, rather than the amount.
It should be noted, however, that for
both proportions., a trifle more iteration was neoessary at the lower
73
e
e
levels than at the higher levels.
.Al though this phenomenon could
easily be explained by sampling variation (note the sk)' one is led
to speculate that a certain amount of complete data is necessary before
the rate of convergence for a given proportion becomes constant.
The data from Experiment 1 indicate that the difference in speed
of convergence between populations is very small.
Experiment 4B
affords more illuminating data in this connection.
In that experiment,
the Greenberg Procedure was applied to l;b samples of independent
normal deviates grouped into fifteen sets, each of which was used to
generate ten multivariate normal samples representing a different
population.
The number of iterations required for convergence in these
l;b cases was used to test the hypothesis that population affects the
•
number of iterations necessary for convergence.
The analysis of
variance shown in Table 23 reveals the unimportance of population as
an effect influencing convergence.
Table 23. Analysis of variance for number of iterations
required for convergence in Experiment 4B; N. 20, k • 10
Variation. due to
Population
Samples in populations
Total
d.f.
S.S.
M.S.
F
14
871.4
62.2
1.0
135
8421.4
62.4
149
9292.8
-
The initial estimates of missing values also seem to be of
e
e
relatively minor importance.
In section 5.1, it was stated that missing
characters would initially be estimated by some "reasonable" value.
Two kinds of initial estimates were tried in a preliminary investigation
74
e
e
using trivariate samples of size twenty from Carver's Anthropometric
Data (19;4).
In each of seven samples, ten of the forty characters
among the independent variates were censored.
The
Gre.~nberg
Procedure
was carried out, using as initial estimates the sample variate mean;
zero; and a bivariate regression estimate, Xi ..< • a + by..<, where a and
b were estimated by least squares, using all available pairs of Xi
and y.
The number of iterations required using these initial estimates
is given in Table 24.
Table 24. Iterations required for convergence using
the Greenberg Procedure on Carver's data
with different initial estimates of missing values.
•
Initial estimates by
Sample
Bivariate
variate
regression
means
11
10
11
14
6
5
5
5
8
5
11
7
Sample
Number
1
2
3
4
5
6
4
7
4
Zero
.13
21
22
22
35
24
49
Zero as an initial estimate is clearly interior. Since the true means
of the variates,
~
and x 2' were 139 and 282, respectively, the use of
zero as an itlitial estimate could hardly be regarded as reasonable.
There is little basis for preference between the sample variate mean
and the bivariate regression estimate as initial values, except that
e
e
the variate means are simpler to calculate.
75
e
e
In using the Greenberg Procedure with artificially generated
samples, initial estimates were made using the true mean, the true
regression equation, and the random character actually designated as
missing.
Obviously such estimates could not be used in practice, but
their use shortened convergence time somewhat in the sampling work.
In discussing convergence time, we have talked in terms of the
. 'number of iterations required for convergence.
It must also be remem-
bered that the time per iteration increases as the number of variates
increase.
In adding an additional variate, another matrix must be
inverted with each iteration, and the order of the matrices to be
inverted increases by one.
Inversion time is roughly proportional to
the square of the dimension of the matrix involved.
•
The time required
for one iteration by the IBM 650, with N = 20, is approximately thirty
seconds for the three-variate case, seventy seconds for the four-variate
case, and about one hundred and fifty seconds for the five-variate case.
The distribution of the number of iterations is obviously skewed
and is indicated by the histogram of Figure 13 for the work done using
the Greenberg Procedure in Experiment 7.
30
.2.0
10
e
e
5-6
Figure 13.
~ -10
\::'-14
\T-It
ll-U
' 'l
:!>~••
:?I7-~Yi
Histogram of number of iterations required for
convergence; [-.70 -.52 .32, .60 .26, .44],
censoring (333), N = 20, k = 100
Iterations
76
e
e
The regression ooeffioients in the three-variate oase usually oonverge monotonioally, almost invariably so after the first two or three
iterations.
As the number of variates inorease, the pattern of oonver-
genoe beoomes a bit more erratio. Examples of the oonvergenoe ourves
for two five-variate oases from EJePerirnent 8 oan be seen in Figures 14
and 15.
•
e
e
77
•
e
e
Figure
14 Convergence curves applying the· Greenberg Procedure to the
five-variate sample, number 107, censored (3333)
78
e
e
~
't
.,
)t
¥-
~
lie:
IE
0(
,/
/
.2
\\
•
~
a
Iteration number
e
e
Figure 15 Conve
ce curves applying the Greenberg Procedure to the
five-variate sample, number 106, censored (3333)
a Converges after 33 iterations.
*
79
e
e
CHAPTER VI
M.A.XmJM LIKELIHOOD
6.1 Disoussion
Applioation of the prinoiple of rnaximwn likelihood to the problem
disoussed in this thesis is praotioal for only a few speoial oases.
For the most part, the mathematios involved in the ma:x:l..mizing prooess
beoomes extremely unwieldy.
One oase in the literature whioh is dealt with in detail is that
in whioh values are missing from only one of the variables in a sanple
from a trivariate normal population (Edgett, 1956 and Anderson, 1957).
Using the· notation indioated for data of the form given on page 2,
•
Edgett sets up the likelihood :funotion
n
TI f.,t. (:;.,X2, X3 1 lJol ,lJo2,lJo3; ai'~'a5;
P12,P13,P23)
...1
( 6.1)
n
IT
f.t. (:;"X3
...n+l
I ~1,lJo3;
lJol ' ....3;
ai, 0';; P13)
and differentiates with respeot to the ....i and the ;'j.
Then, eXpressing
the J-j in terms of aij , he prooeedsthrough some ourribersome algebra
and indioates the solutions for the lJo i and a ij in a series of formulae.
A thorough examination of Edgett's paper is most EI11ightening in that
it points up the diffioulties of extending the maxLmum likelihood method
to more general oases for the problem at hand.
e
e
The approaoh used by Anderson in solving the same problem is worthy
of mention here by virtue of its s:t.mplioity and the faot that it oan
be used to get maximum likelihood estimates easily for oertain other
80
e
e
5!ecial cases.
Considering the same form of data referred to in the
preceding paragraph, Anderson expressed the likelihood function (6.1)
as
n
.
TI f-< (:lC:L,x3 I ~1'~3; O"i,O"~;
P13)
-4=1
n
n
f -< (x2
-4=1
( 6.2)
I ~20
+
~21.3:lC:L + ~23 .lx3; O"~ .13)
where
~20 .. ~2 - ~21.3~1 - ~23.1~1
2
0"1~21.3 + 010"3P13~23.1 .. 0"10"~12
(6.~)
2
0"10"2P1J1321.3 + 0"3~23.1 .. 0"30"2P23
•
2
2
2
2
2
2
0"2.13 .. 0"2 - (~21.30"1 + ~23.1a3 + 2 ~21.3j323.10'10"3P13)
If any other values of the parameters give a higher value to
(6.1), they would give a higher value to (6.2), with the
~t sand
2
0"2.13 given by (6.3).
The max!m:um likelihood estimates of
~l' ~3' ~,
<1,
and Pl,3 are
based on all N observations, whlle the estimates of the regression
parameters. for x 2, gi. ven :lC:L an d x , are computed from the complete
3
observations only.
The values of
~2'
2
0'2' P12 and P23 are then found
from the relations (6.3).
If we oonsider x .. y as the dependent
3
variate, estimates of the coe.f:f'icients for the prediction of y from :llJ..
e
e
and x 2 may be calculated by fo:mIUlae similar to 6.3. The results are
identical with those of Edgett and, in addition to being more easily
derived" are more simply e::llpressible. Another example of this approach
will be considered in discussing the example of Chapter
m.
It should
be noted, however, that this sort of manipulation does not provide a
81
e
e
~eneraJ.
solution. For exanple, it is not applioable to the trivariate
oases :we have
the form
o~nsidered
in previous ohapters in whioh oensoring was of
1 2).
The trivariate oases of Experiment 1, in whioh observations were
(0 0
missing :from one variable only, have been oonsidered using the maximum
likelihood estimates desoribed above.
6.2 Bias and Varianoe of Regression Coeffioients
Table 25 indioates the biases and varianoes of the regression
ooeffioients for the oases of Experiment 1 where maximum likelihood
estination was oarried out as desoribed above.
As was the oase in Chapters
•
m,
IV, and V, only the biases of the
oonstant ooeffioient are signifioant by the t oriterion, and thLs has
~ee!l
explained as a sampling phenomenon. As one would expeot, there
is a tendenoy for tb!l varianoe of the ooeffioients to be higher when
eight vaJ.ues a:re censored than when only four are absent.
6.,3 Estimate of Error
It is well known that max:iJnum likelihood estimates of varianoe
In the oase of the est:lmates of O'~.x above, three para-
are biased.
meters (the
~i)
were estimated from twenty (fragmentary) observations,
so that an estimate corrected for this bias would be obtained by using
20/17
&;.x.
There will also be some bias due to the fragmentary oharacter of
e
e
the data.
A rough adjustment for both types of bias might consist of
multiplying the maximum likelihood estimate by a factor
N_ c
k=
P=!c
N-p- p-l
•
ee
Table 25.
ee
Statistics of regression coefficients estimated by
maximum likelihood; Experiment 1, N = 20, k
[.25
"V(b0 )
bl-~l
V(b1 )
.049
.131*
.096'
.051
.063
.061
.042
.053
.103
.076
(80)
.115
.073
.070
.208
a The symbol
* indicates t
.049
=
b. - ~
J.
as·i
.00,
_
A
.118*a
[.80
.25]
_
censoring bo -~ 0
(04)
(40)
. (08)
.5),
= 20
i
.80J
A
_
A
-~ 0V(b)
b2-(32 f(b 2 ) b0
bl-~l V(b1 ) b2-~2 V(b 2 )
0
.002
.009
-.033
-.032
>
t
.037
.029
.096
.051.
.029
.026
.062
.006
.057
.083* .027
.070· .036
.022
.107
-.001
-.039
.096
.027
.065
.186
.045
-.054
.057
.081
.087*
.078*
,." 19, where
.0.:;)
as2
i
=
.085
V(b i )
20
co
ro
83
e
e
where
0
equals the total number of missing charaoters among the p-l
independent variates.
The mean values and variances of errors of estimate, oalculated
.from the twenty samples for each case in Experiment 1, are presented
in Table 26.
The mean values multiplied by the correotion faotor, k,
above are also given.
Means and Variances of ~y.x estimated
by ma:x::1.mum likelihood
Table 26.
(.25 ·50, .25.}
2
ay.x =- .7333
Censoring
•
.60,
(iy.x =-
.80]
.35,,6
A2
a
~(?iy.x )
.'k~
ay.x
::t2
a
y.x
~(~~.x)
k1T"
y.x
.617
.622
.063
.069
.061
.080
•71.Jl
.746
.745
.692
.296
.294
.284
.280
.012
.012
.009
.012
.355
y.x
(04)
(40)
(08)
(80)
( .80
·$J5
·562
·353
.350
.345
6.4 Effioienoy of Prediction Equations
Means of E and their standard errors are indicated in Table 271
and the means of !J and their assooiated standard errors are shown in
Table 27B.
Those tables do not indicate marked differences between the two
populations and show an expected decrease in efficiency with increased
censoring.
e
e
A comparison of the efficiencies obtained using maximum
l:ikelihood with those obtained by using other methods is disoussed in
Chapter VIII.
84
e
e
Table 27-A. Means of E and their standard errors estimated
by maximum likelihood; Experiment 1, N = 20, k = 20
-
[.25
.
Censoring
!
(04)
(40)(08)
.981
.969
.918
(80'
~O17
·50,
.25]
S.E.(I)
.005
.008
.029
.027
[ .80
!
.$,
.80]
S.E. (I)
.008
.008
.014
.024
.970
.973
.930
.927
...
Table 27B. Means of A and their standard errors estimated
by maxi.mum likelihood; E:x:periJnent 1, N • 20, k • 20
[.25
•
Censoring
(04)
(40)
(08)
( 80).
-A
·50,
.251
S.E.en
.325
.354
.369
.025
.034
.032
·424
.053
[.80
-A
·~237
.241
.278
.280
.00,
.801
S.E. ('X)
.020
.023
.017
.•035
',-~
e
e
85
e
e
CHAPTER VII
AN EXAMPLE
7.1 The Data
Diffioulties involved in the determination of musole potassium
severly limit the physioian in prescribing proper therapy for patients
suffering a depletion of this ion.
The aotual measurement of musole
potassium requires the use of a radioaotive isotope, whioh is an expensive, exaoting, and time oonsuming prooedure.
Working with laboratory animals, Welt and his assooiates [1958]
reoently oarried out studies aimed at developing m3thods to facilitate
•
measurement of musole potassium in humans.
Their approach involved
predicting the oonoentration of musole potassium from a number of other
more easily obtainable measurements.
The ability to determine the oon-
centration of serum po,tassium, serum carbon dioxide, and red cell
potassium from the analYsis of a single blood sample provided suoh
measurements.
Data available for such determinations on a series of
fifty-four rats is given in Table 28.
The animals were sacrificed
to obtain the assooiated values of muscle potassium.
It will be noted that determinations for serum carbon dioxide,
(X ), and red cell potassium, (X ), are not available for every animal.
2
3
Thus the question of methodology' in forming an appropriate prediotion
e
e
equation arises.
86
e
e
Table 28;. Bioohemioal determinations from Wdt data
Rat
Number
Serum K
(mH./L.)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
•
e
e
JS.
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
a
The
2.61
3.61
2.94
3.58
4.65
4.53
3.41
3.41
5.00
5.38
3.54
4.73
5.05
5.00
2.84
3.11
2.87
.3.32
2.79
3.10
2.61
2.94
2.32
2.74
2.70
2.43
2.60
2.60
2.85
1.84
2.05
1.61
1.92
2.01
1.66
2.02
1.50
·1.72
1.73
1.97
1.67
1.68
1.72
1.84
1.74
2.20
1.50
2.16
2.81
4.81
1.62
2.73
2.65
2.38
X2
Serum CO 2
(roM./L.)
X
3
Red Cell K
(roM./L.)
••&
24.6
25.7
19.7
23.4
25.6
22.4
24.7
21.6
22.8
25.1
..
17.4
25.4
19.3
23.4
23.6
21.5
26.9
25.1
22.0
26.2
29.8
25.1
30.6
29.5
33.3
31.7
33.8
34.4
38.7
42.2
39.4
27.4
--44.3
34.3
43.8
39.2
symbo1-- indicates a "missing datum.
101.4
102.0
100.8
99.6
98.8
101.3
106.0
94 •.3
96.8
100 •.3
108.0
101.5
105.9
107.3
102.5
106.0
101.2
100.1
100.1
96.8
96.7
100.9
100.1
104.2
96.9
100.3
100.2
106.3
103.2
106.8
96.0
111.0
98.8
91.1
89.4
92.8
106.3
104.1
99.3
109.6
86.5
66.5
103.0
105.2
83.2
98.0
90.4
X4 • y
]hsole K
(mM/l00 gr.)
44.7
47.7
47.3
45.5
47.2
47.8
47.2
46.9
45~7
44.9
44.9
47.5
46.0
46.5
4.3.5
38.1
42.9
45.0
42.4
42.3
42.3
39.1
38.1
41.0
34.6
41.0
34.3
38.6
.38.6
35.5
.34.8
33.4
.34.5
38.8
31.5
33.0
30.6
31•.3
33.1
31.4
29.5
25.2
27.7
28.2
30.0
27.7
25.0
24.2
20.6
29.6
23.4
30.1
23.1
29.9
':~.~
87
e
e
7.2
Solutions
The methods indicated in previous chapters will be applied here
and the results given.
A discussion of these results will be given in
seotion 7.3.
Equation (7.1) below was dete::mined by ordinary lea.st squares,
using the thirty-two complete observations.
(7.1)
It was mentioned in the previous ohapter that the approach of
Anderson (1957) would be used in obtaining a maximum likelihood prediotion equation from the welt data.
•
To do so, we delete values of
x2 from rats number two, three, fifteen, and seventeen, so that the
remaining da.ta may be arranged in the form:
.. ·
x
• • • •
• ~N
• •
12 ~3
x21 x 22 x
• • • • x 2n
23
2
x
x
x
x
31 32 33 • • • 3nl
x41 x42 x43 • • • • • • • • • • •
xllN
~l
0
0
0
·
.
where the primary subscript denotes the variable; the seoondary subsoript, the observation; and x4 :: yis the dependent variable.
particular oase, there are nl
= 32
In our
observations canplete on all vari-
ates, n 2 = 47 oomplete on all except x , and N = 54 complete on :;, and
3
. y. The assumption of multivariate normality is reasonable here and
the likelihood function oan be written:
~
n2
'IT
2
TI
Jl f)xil~·,O"J.·'P .. ) II
-< = 1
J.
J.J -< = n
i,j.= 1,2,3,4
N
2
J.
f)xil~,'O"·'Pij)
+1
i,j =11,3,4
J.
. -<
i
IT
= ~+n
+1
= 1,4 2
2
J. J. J.
(7.2)
f-«x·I~"0",'P14)
-e
88
and ma.y be rewritten
n2
N
IT f) xi lf.Li,O"i,P14r IT f)x31~30.0
=
.,(=1
+
.,(=1
~31.4X:J. + ~34.lx4'
i = 1,4
n
l
IT f )x4 ~20 .00 + ~21.34X:J. + ~23 .14x3 + ~24.l3x4)
1
..(=1
where
~30.0
=
2
0"3.14
II
f.L 3 - ~3l.~1' -~34.l~4
2
2
2
22.
.
)
0"3 - (~3l.20"1 + ~34.10"4 + 2~3l.4~34.10"10"4P14
2
0"1~3l.4 + 0"10"4P14~34.l
•
(7.4)
= 0"10"3 P13
2
0'10'4P14~31.4 + 0'4~34.l = 0'30'4P34
and
~20.00
II
~2 - ~2l.43f.Ll - ~23.l~3 - ~24.23~4
2
c{.134 = 0'2 v~
~8V~
= (S
(7.5)
where
rhl1
~
V
e
e
=
a13 0'14
.
2
0'31 0'3
a43
0'34
2
0'4
r~2
l5 = 0'23
l
~2l.43
and
~
~23.l4
~24.l3
a24
Thus, maximum likelihood est:i.ma.tes of
=
~l'
f.L4'
O'i, O'r,
and P14 are
obtained in the usual way from all N = 54 observations. Est:i.mates of
f.L 3'
O'~, P13' and
P23 are made from these foregoing values combined
-e
89
with estimates of the regression statistics from the n 2 observations
using the relationships of (7.4). The remaining parameters are based
on those already calculated plus regression statistics from the n
l
complete observations by substituting these in the parametric forms of
(7.5) and solving.
The estinates for the regression parameters ~40'
2
~41.23' ~42.l3' ~43 .12' and 0'4.123 may then be derived from the maximum
likelihood estimates of the l1i ,
O'i,
and Pij.
The resulting equation
is
All the data available were used in obtaining a prediction equation
by the method of paired correlation coefficients.
•
matrix, the number of values upon which each ooefficient was based,
and the oonfidence interval for each ooefficient is given in Table 29.
Table 29.
Sample correlation coefficients and their
confidenoe limits based on the n.. available pairs of Xi and x j in Table J.J 28
ij
e
e
The correlation
95% confidence interval
12
36
13
23
ly
2y
47
32
54
36
3y
47
-.625
-.020
-.79 to -.34
-.29 to .28
.008
.672
-.792
...36 to .37
.50 to .79
-.89 to -.62
.342
.06 to
.57
90
e
e
Further reference will be made to this in the discus sion section below.
The equation determined by this method is
Yp = 13.11 + 2 .2~ - .67x2 + .37X3
The Greenberg method also utilized all the data available.
':fhe
sample variate means were used as initial estimates for the missing
values and the iteration procedure described in Chapter V was carried
out.
Convergence was reached after fourteen iterations, each of which
took two minutes and seventeen seconds.
Slightly over a half hour of
IBM 650 time was required for all the calculations which yielded the
prediction equation
•
(7.8)
7.3
Discussion of the results
The solutions given in section 7.2 will be discussed briefly here
with the purpose of pointing out the advantages of and objections to
applying the various methods to the set of data at hand.
The broader
aim of indicating factors, which would be of importance in extending
these methods and selecting the appropriate ones for other practical
situations, is implied.
Equation (7.1) was determined by ordinary least squares using only
the thirty-two complete observations.
Although a sample of this size
can be expected to give fairly reliable results, it seems more than a
little wasteful to discard data from over forty percent (~) of the
animals used; or, on a measurement basis, over one-third
(1~~)
of the
total number of determinations available. A researcher would have
91
e
e
serious reservations about disregarding so great a portion of data
even when observations are easy to obtain . I f the determinations were
from human subjects rather than rats, the research worker would be
less inclined to discard incomplete data. _
Another point to be considered here is the one discussed generally
in paragraph 3.10
Inspection of Table 28 reveals that the data to be
discarded would include many of the observations with very high and
very low muscle potassium values.
Ignoring such portions of the data
would seem. unwise i f the equation were to be used to predict muscle
potassiUm' values which were expected to be at the extremes of the
sample
•
0
Yith only the
da~a
of Table 28 it is not possible to test
whether or not one of the other methods might be preferable in this
regard.
However, some of the results using each prediction equation
are tabulated in Table 30 below· for the three complete observations
having muscle potassium deteminations at either end of the muscle
potassium scale (25.0> y> 47.0)
0
The sums of the absolute deviation.s
and sums of the squared deviations from the true values for these six
observations are given at the foot of the table.
From this examination
no dramatic differences in the methods are indicated as far as predictability at the extremes is concerned.
In obtaining prediction equation (7.6) by the method of rna.xi.mum
likelihood, only four values were discarded, two of which were
included in observation vectors having low muscle potassium values.
However, the serum and muscle potassium values (X:I.and x4) in these
vectors were retained so that not all of the information on anyone
92
e
e
animal was ignored; and., on a measurement basis, only slightly over two
percent
(1~7~ of the available information was discarded. Considering
the sample size and known desirable properties of maximum likelihood
estimators, prediction equation (7.6) could be regarded as a reasonable
first choice.
It must be remembered, however, that this is a special
case of m.a.xi.mum likelihood estimation, and that a slightly different
pattern of missing observations 'WOuld have precluded its use here.
Table 30. Some extreme muscle potassium values
predicted from the Welt data by various
methods
Value predicted by
•
For rat
number
Y
c
Yt
Yp
yg
5
6
7
47
48
51
47.87
45.89
40.76
27.11
25.95
29.09
48.18
45.90
41.23
27.37
23.72
29.53
47.61
45.09
40.67
27.11
20.43
29.17
47.85
45.21
40.63
26.00
20.57
28.39
18.56·
17.83
21.31
19.43
83.62
102.20
89.38
Iy - yl
(Y - yj 85.40
Actual value
y
47.2
47.8
47.2
25.0
24.2
23.4
Comments on prediction equation (7.7), determined by the Method
of Paired Correlation Coefficients, call for reference to Table 29.
The sample correlation matrix given there with the confidence limits
e
e
for each coefficient, indicate that this is a sample from a population
for which this type of estimation is not particularly well suited;
93
e
e
that is, one in which the correlations between variables are of
different magnitudes.
Even though we are dealing with a different type
of censoring and have larger numbers of pairs than were considered in
the empirical sampling work of Chapter IV, the Method of Paired Correlation Coefficients would appear to be the least appropriate of th3 four
methods considered here.
The Greenberg Procedure, which aJ.so utilizes all the data available yields the prediction equation (7.8).
There are no ! priori
reasons for not applying the Greenberg Procedure to these data, nor
W>uld its use be contraindicated on the basis of th3 empirical sampling
work discussed in previous chapters.
•
One objection to employing it
here might be the rather lengthy computations involved.
These would
seem to be of minor importance considering the avaUabilityof high
speed computing machinery and the accompanying program.
In this
case, the GTeenberg Procedure permitted a solution which would
certainly be preferable to that obtained by the Method of Paired
Correlation Coefficients. If the missing characters had been more
evenly distributed among the independent variables, so that the method
of ma:x:i.mum likelihood could not have been easily applied to the bulk
of the data, it would have presented a still more appealing alternative.
94
e
e
CHAPTER VIII
SUMMARY AND CONCLUSIONS
8.1 SUII'Il11a.ry
In the previous chapters, alternative methods have been described
for dealing with the problem of prediotion by multiple regression when
values are missing from among the independent variates.
The results of
empirioal studies designed to evaluate the effioiency of the respective
methods were discussed.
o
The objeotives of this chapter are
firs~
to com-
pare these methods by bringing together results in which all the methods
were used with the same or similar data. and secondly, to make suggestions
•
regarding the use of the methods
0
Finally, comments will be made on the
need for further research to answer some of the questions arising from
the results of this work.
The eight sampling experiments described in Chapter II were sequential in nature.
Experiment I was the broadest in scope but perhaps the
narrowest in depth.
A summary of the statistics of prediction efficiency
for all the sampling done in this experiment is given in Tables ,3lA and
,31B.
These tables indicate that estimation by maximum likelihood, where
it can be easily employed, is superior to the other schemes.
For the
three variate cases where maximum likelihood estimation is not easily
carried out, neither the Paired Correlation Method nor the Greenberg
e"
e
Procedure appear to yield results overwhelmingly better than those obta::i,.ned by merely deleting incomplete observations.
However, a population
differential was noted in results given by the Paired Correlation Method.
ee
e
ee
Table 3U. Values of E from Experiment 1;
N • 20, k .. 20
[.25
o~,
.23)
[.80 .60, .80]
Censoring I Complete
Paired
Greenberg Ma.ximum
Observations Correlation Procedure Lik ...
~~~
(08)
(80)
(ll)
(22)
(33)
(44)
(55)
(66)
.944
.939
.881
.858
.964
.939
.885
.855
.803
.722
.983
.967
.936
.906
.983
.972
.958
.948
.918
.857
[.25
(lll)
(222)
(333)
(444)
(555)
.967
.888
.795
.625
.335
.967
.956
.875
.825
.976
.956
.939
.894
.810
.614
Complete
Paired
Greenberg Maxiniu:m
Observations Correlation Procedure Likelihood
.981
.969
.918
.917
I
.944
.939
.881
.858
.964
.939
.885
.855
.803
.722
.931
.939
.833
.880
.962
.943
.901
.874
.819
.769
.944
.953
.845
.860
.978
.960 '
.910
.836
.764
.970
.973
.930
.927
.664
.~
075, .25 .~, .251
.967
•949a
•858a
.907
.870
•736a
.627a
.784
.606
.44s-a
a These means were based on k •
5.
'0
\Ft
e
-e
ee
Table 31B. Values of i from Experiment 1;
N ,. 20, k ,. 20
[.25 .~, .25J
[.80 .60, 080}
Oensoring IComplete
Paired
Greenberg Ma:x::i.mum
Complete
Paired
Greenber·g Ma.:x:iiiium
Observations Correlation Procedure Likelihood Observations Correlation Procedure Likelihood
(04)
(40)
(08)
(80)
(11)
(66)
.353
.385
.381
.468
.340
0375
0437
.459
.488
J:/J7
(Ill)
(222)
(333)
(444)
(555)
.252
.301
0348
.524
10292
(221
031
!44J
(551
.347
.325
.329
.3~
.379
.355
.431
.369
0360
.518
0430
.•424
.321
.330
.329
.351
0386
.352
.447
.376
.422
.564
.484
.833
[.25 03) .75, .25 03), 025J
.262
032~
.36 a
.296
0482
0330
.S76&
0807
10114a
0774
a These means were based on k ..
.246
.268
.266
.326
.237
.261
.304
.320
.340
.423
0253
.270
.335
0338
.246
.251
.292
.316
.360
0408
.253
.262
.322
.345
.239
0256
.291
.343
0401
.518
.237
.241
0278
.280
5.
\0
0\
e
e
97
This population differential led to the second and third experiments
which were designed to investigate it further.
In Experiment 2, it was
found that a general quadratic relationship would describe the mean efficiencies of samples from trivariate populations with different correlation
matrices.
The fitted quadratio relationship, based on 780 samples from
thirty-nine different populations, revealed that the Method of Paired
Correlation Coefficients was grossly inefficient for samples from populations in which the variables were highly and heterogeneously correlated.
Likewise, it was eminently satisfactory for cases in which the variables
were not so related.
Experiment 3 indicated in a rather cursory way that the results of
•
Experiment 2 were also applicable to four-variate situations •
Experiment
4 was
planned to confirm the results of Experiment lover
a wider range of populations a.s far as the Greenberg Prooedure was concerned.
It was also intended to check on whether or not a popUlation
differential of the type indicated in Experiment 2 existed for this method
of estimation.
Tables 32A and 32B show that although the effioienoy of
the Greenberg Prooedure is insensitive to the trivariate population being
sampled, it does not yield results superior to those obtained by deleting
incomplete observations.
In Experiment 4, the Greenberg Procedure also
yielded more than the expected number of biased regression coefficients
by the t criterion defined in Chapter II.
e
e
The biases were compensating
in nature, however, as far as prediction was ooncerned.
These results,
combined with the extensive oomputations necessary for solutions by the
Greenberg Procedure, seem to dictate against using it in trivariate cases.
98
e
e
Table 32A. Values of E from Experiment 4B;
censoring (55), N 20, k .. 10
Complete
Greenberg
Paired
Population
Observations Correlation Procedure
.6 , .6 ]
.810
[-.2
.778
.558
.6 , -.2 }
.845
.867
.760
[-.2
]
.808
-.2 , -.2
.788
.891
[-.2
.812
.820
-.2 , .6 ]
.961
[-.2
[ .6
.6 , .6 ]
.871
.820
.876
.6 , -.2 ]
[ .6
.464
.837
.797
.866
-.2 , -.2 ]
[ .6
.783
.756
.804
[ .6
-.2 , .6 ]
.769
.474
,
.2
]
.758
[ .8728 .2
.727
.753
.908
.812
.863
[-.4728 .2 , .2 ]
.2
.8728,
.2
]
.858
.8~
.840
[
.761
-.4728, .2 ]
[ .2
.803
.830
.2 , .8728]
.819
.803
[ .2
.798
.822
.2 , -.4728]
.890
.882
[ .2
.2 , .2 ]
[ .2
.751
.906
.727
III
S§
E
•
i
.002601
.025107
Table 32B. V&lues of ~ from Experiment 4B;
censoring (55), N 20, k .. 10
Complete
Greenberg Uncensored
Paired
Population
Observations Correlation Procedure Data
.6 , .6 ]
.180
[-.2
.199
.294
.095
.6 , -.2 ]
.~2
.687
[-.2
.383
.315
-.2 , -.2 ]
.492
[-.2
.485
.287
.356
-.2 , .6 ]
[-.2
.475
.59J
.322
.373
.6 , .6 ]
.275
[ .6
.258
.382
.346
.6 , -.2 ]
[ .6
.192
.191
.140
.512
-.2 , -.2 ]
.689
[ .6
.510
.689
.363
[ .6
-.2 , .6 ]
.242
.160
.245
.416
, .2 ]
.600
.766
[ .8728 .2
.523
.332
.401
[-.4728 .2 , .2 ]
.532
.315
.499
]
.219
.268
.269
.172
.8728, .2
[ .2
-.4728, .2 ]
.458
.426
.312
[ .2
.389
.288
.288
.306
.2 , .8727 ]
.196
[ .2
.2 , -.4728]
.320
.428
[ .2
.238
.344
.2 , .2 ]
.722
.452
.831
[ .2
.324
III
e
e
.001349
99
e
e
Experiment 5 showed that the number of iterations required for
convergence in using the Greenberg Procedure was influenced more by
the proportion than by the amount of missing data.
In Experiment 6, samples censored (3333) were selected from two
different five-variate populations. Prediction equations were computed
by deleting incomplete observations and by applying the Paired Correlation
Method.
The five-variate populations were chosen to represent cases in
which the Paired Correlation Method was expected to yield diverse results
in terms of prediction efficiency.
The resulting means and standard errors
of E and A are given in Tables 33A and 33B, respectively.
These tables
represent further evidence to confirm,that, in higher variate cases, the
•
Paired Correlation Method is suitable for use with fragmentary samples
from populations 'With homogeneously correlated variables; but that it is
definitely unsatisfactory for samples from populations with heterogeneously correlated variables.
The histograms representing the distri-
bution of E for both methods of estimation in the two populations are
givan in Figure
7.
Table 33A.
Values of E and their standard errors;
Experiment 6, censoring (3333),
N = 20, k • 100
omplete
Observations
E
S.E. (E)
e
e
E
S.E.(E)
.80 .29 • 0,
.515
.020
[.36 .40 .41 .32, .24
.518
.023
.24
Paired
Correlation
, .20 .70, .30]
.357
.027
.30, .24 .25, .22]
.874
.016
100
e
e
Table 33B.
a
Values of and their standard errors;
Experiment 6, censoring (3333),
N = 20, k = 100
Uncensored
Dta
, .20 .70, .30 ]
.183
.006
.30, .24 .25, .22]
~
S .E.(a)
S.E. (a)
.444
~
.015
Although Experiments 1 and
4 did not indicate the Greenberg Pro-
cedure to be worthwhile in any of the trivariate cases considered, it
was not shown to be clearly inferior to the other methods as far as
•
prediction efficiency was concerned. .An examination of the relative
effectiveness of the Greenberg Procedure in higher variate cases seemed
to be warranted and was undertaken in Experiment
7.
Complete observa-
tions, the Paired Correlation Method, and the Greenberg Procedure were
used to obtain prediction
e~ations
and measures of their efficiency
for one hundred samples from a population for which the Paired Correlation Method was not thought to be especially well suited.
in terms of the sampling distributions of E and
11 and 12, respec tively •
The results
a are shown in Figures
The mean values of E and
a,
with their s tan-
dard errors, for the different methods of estimation are given in Table
34.
According to these criteria, both the Greenberg Procedure and the
deletion method are better than the Paired Correlation Method; and
e
e
results using the Greenberg Procedure appear to be at least slightly
better than those obtained by omitting the incomplete observations.
101
e
e
It will be recalled from Chapter V, however, that the regression
coefficients obtained by the Greenberg Procedure in this experiment
showed significant bias by the t criterion.
Table 34.
i
Values of E, Aand their standard errors from Experiment 7;
population [.70 -.51 .31, .60 .26, .44], censoring (333)
N = 20, k = 100
Complete
Observations
.79~
S.E.(E)
.014
A
S.E. (3)
.266
.013
-
•
Estimation by
Paired
Greenberg
Correlation
Procedure
.603
0837
.d2~
.Ol3
.482
.063
0245
.015
Uncensored
Data
.189
.006
The computing time involved precluded an extensive amount of sampling
with the GTeenberg Procedure for five-variate cases; but the results of
its use with four variates, relative to that with three, suggested some
evaluation at the five-variate level.
Also, the need for some convergenoe
information at higher variate levels demanded investigation with at least
a few samples.
Acoordingly, ten samples, each censored (3333), were
selected fram eaoh of the two populations used in Experiment 6; and the
GTeenberg Procedure was used to find prediction equations from them.
A
summary of the resulting efficiencies is given by the means and standard
errors of E and A in Tables 35A and 358, respectively.
[.60 .80 .29 .40,
e
e
.5b
-.20
.5b,
In population
.20 070, .30], the Greenberg Procedure
yielded better results than those obtained by the use of complete observations.
In population [036 040 .41 .32, .24 .24 .30, .24 .25, .22],
102
e
e
Table 3.5A. Values of E and their standard errors;
Experiment 8, censoring (3333),
N==20,k==10
E
S.E. (E)
-
E
S .E. (I)
•
Table 358. Values of ~ and their standard errors;
Experiment 8, oensoring (3333),
N==20,k==lO
Paired
-
A
S .E. (~)
-6-
S.E. (A)
,e
e
Greenberg
Paired
Correlation
Procedure
.40, .~ -.20 .56, .20 .70, .30]
.648
.~8
.440
.066
.086
.077
[.36 .40 .41 .32, .24 .24 .30, .24 .25, .22]
.664
.601
.889
.028
.076
.089
Complete
Observations
[.66 .80 .29
Greenberg Unoensored
Prooedure
Data
• , .20 .70, .30
.350
.20
.042
.020
.24 .30, .24 .25, .22]
.881
.426
.114
.063
103
e
e
there appeared to be little difference between the Greenberg Procedure
and the use of complete observations, while the Paired Correlation
Method yielded significantly better results than either of these.
8.2 Conclusions
The findings summ.a.rized above support the following conclusions and
suggestions regarding the use of the methods described here:
1.
The use of the Paired Correlation Method is advisable in the
trivariate case only i f it is reasonably certain that the
sample has been drawn from a population in which the variables
are homogeneously correlated.
•
This provision also holds in
higher variate cases; and it must be kept in mind that, as more
variates are introduced, there is less possibility of homogeneity
existing among their zero order correlations.
2.
If there are enough complete observations in three-and fourvariate cases, and unless there is some
~
priori reason for
not discarding incomplete observation vectors, the use of only
the complete observations gives results which usually will not
be much less efficient than those obtained by the Greenberg
Procedure and Paired Correlation Method.
3.
In the trivariate case, the Greenberg Procedure generally is not
worthwhile.
4.
e
e
In addition to these conclusions, the research carried out provides some basis for the two suggestions below:
a)
If all or nearly all the data can be used to obtain ma.x:i.mum
104
likelihood estimates, similar to the manner indicated
in Chapters VI and VII, that method is to be recommended.
This suggestion is based on the limited amount
of sampling done in Experiment 1 and the well-known
desirable properties of maximum likelihood estimates.
b)
The Greenberg Procedure seems to give best results in
cases of four or more variates where the Paired Corralation Method would not be e:xpected to perform efficiently.
The former technique is to be recommended for
higher variate cases in which most of the observations
are incomplete, but in which the total number of missing
•
values is small and evenly distributed among the different variates
0
These cases could be described in the
notation of section 2.2 as ones in which the c are
i
large and nearly equal, and in which thec ij and other
c' s with multiple subscripts are small or equal to zero
0
The example of Chapter VII illustrates the applicability of some of
these suggestions to a biological problem.
The limited amount of investigation which led to the last two recommendations (4a and 4b) points up the need for
furth~r
research to confirm
and supplement that done here. Possibly the most obvious need for further
investigation is with the use of the Greenberg Procedure in higher variate
e
e
cases. Further studies appear to be in order to determine its prediction
efficiency as well as the extent and nature of bias arising with its use.
The apparent effectiveness in the four-variate case should be verified
105
e
e
for similar cases, and the rather cursory examination of the two fivevariate cases carried out here certainly invites further research.
However, the sampling approach in any very large investigation of this
kind would probably require computing facilities of capacity greater
than the IBM 6~ •
Another suggestion for further research is that some sampling
studies be done using maximum likelihood estimates obtained by iterative methods for the three-variate cases.
It is also conceivable that
maximum likelihood estimates could be obtained for some particular types
of censoring in four-variate cases, either by explicit solutions, as
pointed out in the example of Chapter VII, or by iterative solutions.
•
e
e
Empirical investigations involving these estimates and comparing them
with the ones discussed here would be enlightening.
106
e
e
LIST OF REFERENCES
Aitchison, V. and J. A. C. Brown. 1957. The Lognomal Distribution.
Cambridge University Press, Cambridge.
Anderson, T. W. 1957. Ma:ximum likelihood estizrates for a multivariate
normal distribution w~ n some ob servations are missing. Journal
of the American Statistical Association~:200-203.
Bliss, C. I. 1952. The Stati sties of Bioassay.
Inc _ New York.
The Academic Press"
Carver,H. C. 19l.,l.Anthropometric Data. Edwards Brothers.
Arm Arbor.
Du Bois, P. H.
Brothers.
1957 • Multivariate Correlational Analysis.
New York.
Harper
B. 1957. Introduction to Statistical Theory. Mimeographed
classnotes. Unive;-sity of North Carolina, C~pel Hill.
Dtu1can,D'~
Dwyer,"P. S. 1951. Linear Computations.
New York •
•
John '1tlley and Sons,
Edgett, G~ L. 1956. Multiple regression with missing observations
among'the independent variables. Journal of the American
Statistical Asso ciation ~: 122-131.
--~
Foster, W. D. 1950. On the selection of predictors: two approaches.
Unpublished Ph.D. Thesis. North Carolina State College, Raleigh.
Institute of statistics. 1957. A Record of Research IV:17.
Consolidated University of North Carolina. .
Lord,
t: M-. 1955. Estimation of parameters from incomplete data.
Journal of the American Statistical Association 22: 870-876.
Matthai, A'. 1951. Estimation of parameters from incomplete data
with application to dem go of sample surveys. Sankhya 11: 145-152.
~
e
e
Moonan:'W. J. 1957. Linear transformation to a set of stochastically
dependent normal variables. Journal of the American Statistical
Ass:>ciation ~: 247-252.
Morrison, R. D. 1956. Some studies on the estimates of the e:x;ponents
in'models containing one and two exponentials. Unpublished Ph.D.
Thesis. North Carolina State College, Raleigh.
Munter, A. H. 1936., A ,study of the lengths of long bones of' the arms
and legs in man,' with special reference to Anglo-Saxon skeletons.
Biometrika ~: 25'8 •
e
e
107
Neiswanger"W. A.and T. A. Yancey. 1959. Parameter estimates and
autonomous· growth • Journal of the American Statistical
_Ass?ciation ~: 389-402.
Nicholson" G. E." Jr. 1948. The application of a regression equation
to a new samp-le. Unpublished Ph.D. Thesis. University of North
Carolina" Chapel Hill.
Nicholson" G. E." Jr. 1957. Estimation of parameters from inoomplete
multiva.riatesamples. Journal of the Amerioan Statistical
Association ~: 523-526.
Orcutt" G. H. and D. Cochrane. 1949. A sampling study of the merits
of autoregressive and reduced form transformations in regression
analysis. Journal of the American Statistical Association
_._ ~:356-372 •.
Rand COi"porabion. 1955. A Million Random Digits with 100,,000 Nomal
Deviates. The Free Press" Glencoe.. Illinois.
Rao" C. R. 1956. Analysis of dispersion with incomplete observations
on one of the characters. Journal of the Royal Statistical
Society" Series B" 18:259-264.
•
-
Sheps" M. C. and P. L. Munson. 1957. The -error of replicated potency
estimates in a Biological Assay method of the parallel line t~e.
Biometrics ,U= 131-132.
Wells" H. B.1959. North Carolina perinatal mortality study:
investigation of methods for analyzing data. Unpublished Ph.D.
T'hesis. University of North Oarolina" Ohapel Hill.
Welt" L. G. et ale 19513. The prediction of muscle potassium from
blood electrolytes in potassium depleted rats. Transactions of
the A.ssociation of American Physicians 10: 250-259 •
-
-
1f.i.lks;S. - 19320 Moments and distributions of estimates of population
parameters from fragmentary samples. Annals of Mathematioal
Statistics J: 163-203 •
e
e
e
e
•
e
e
109
e
e
APPENDIX A
GENERATION OF MULTIVARIATE NORMAL DEVIATES
A.l Transf'omations to Generate Trivariate Normal Deviates
Consider a p-va;riate normal distribution,
asp"" N(.!:!:p'
Vpp )' .!:!:P
being the vector of' means (p dimensional) and V the p x p varianoecovariance matrix., If'
asp
2Er
is taken as the vector of' any r elements of'
arranged in the last r positions, and if' the remaining p - r
=q
elements are denoted x , then the conditional distribution
-q
aErl.!qfV N(~rlq' Vrr1q )
•
where
~
-r Iq
=
~
""'T
-
S srq (x - ~ )
rr
-q-q
-1
spp is a lower triangular matrix such that 13 pp
[~ ] =
qq
e
srq
= Vl / 21. -
1
~
-p
~1
D
[:qq
rq
~J
-1
-1
= Spp
rr
2
vrrlq = 13rr
This theorem is stated and proved by Duncan (1957).
e
e
Given independ~nt N(O,l) deviates ~, x 2, x and wishing to fin,d
3
the transf'omations necessary to generate. trivariate normal deviates
~
~,
.~
x " x3t with ~I1
2
= ~12 = ~13 = 0,
.
.
no
e
e
V=
1
P12
.013
P21
1
P23
P31
P32
1
Xi • J!1.'
~ = fi.2~ + /1
we take
and, in a manner similar to that described below find
- Pi2
We want
f(X3 /J!1.X2 )
1
P12 P13
P21
1
V =-
,P23
P31 P32
•
x2 "
1
1
8' •
PI)
(.0 23 - P12P13)(1 -
0
o
o
k(l -pi2)-1/2
o
1
-If
8
=-
_n.
1""12
(1 _ p2 )-1/2
12
e
f) 11;2 = (1 - pi2)-1
e
V3 1;2
I
P~2,-1/2
2
2
= 833
=k
[(.013 - P1t'23) X:J. + (P23 -
2 -1
(1 - .012 )
"
o
o
Pl~13)
x21
III
e
e
So
therefore
A.2 An Example of the Procedure Used to Generate
Four-variate Normal Deviates
Given independent N(O,l) deviates
multivariate normal deviates
•
Xi'
=
.25
1
.50
.25
.75
.50
x , x , ~' we wish to generate
2 3
such that ~i = ~~ = ~3 ~t = 0
x3' xt
.25
1
V
x~,
%:1'
•
.50
·75
.25
.50
1
.25
.25
1
Using computing methods described by Dwyer (1951) we find directly
LS~\
e
e
Sf
=
1
.25
.50
.75
0
.96824584
.12909944
.32274861
0
0
.85634884
.09731237
0
0
0
.56909018
o
o
o
- .25819889
1.03279556
0
0
.54494926
.15560978
1.16774841
0
-1.07827614
- .55910614
- .19968077
1·75719074
1
. ~\
(}."\.,)~ S-l' =
-e
112
Take
Xi = JC:t
and using the relationship
x' = -8rr grq -q
x f. + 8r rxry
-r
successively for r = 2, 3,
q <r, we get
4,
.25x:t + .968 245 84x2
'0
'1
X = .46666667Xi + .13333j33x~ + .8553488ltx
3
v
x~ = .61363636x{ + .3181818~+ .11363636x +
3
3
.1.3 Some Results Obtained by Using the Transformations of A,.2
Four hundred sets of multivariate deviates were generated using
the transformations above and the first four hundred inpat cards.
•
These deviates yielded the following matrix of sums,
ad
N
SDS
SWI1S
of.quare.,
of cress products:
~JC:t ~~
~x3
~x4
~~~~x2~x:tx3 ~X:tx4
~x~ ~x2x3 ~x2x4
~x~
Sym.
3·70 -26.21
400
39 ·44
4·79
377·86 109·08 205.01 297·28
408.86 105.29 219·09
•
~x3x4
411·46 J.i6.~
Sym.
~x~
418.19
which reduces to the correlation matrix
1
~2
P13
P14
1
P23
1'24
1
P34
Sym.
1
e
e
Y
1
=
.279
.521
·748
1
.265
.532
Sym.
1
.281
1
It may be noted that the Srrsrq are vectors of partial regression
coefficients.
113
"e
THE NORMAL DEVIA.'l'ES TfSlm IN SAMPLING
Ten thousand N(a,l) deviates were selected at random from the one
hundred thousand listed in the Rand Tables (19SS).
A. full desoription
of these deviates, their origin, and the tests for randomness whioh
were applied to them is given with that listing.
It will suffice here to note that the deviates were punched, five
per card, in two non-overlapping groups.
The starting point for each
group was selected at random, one group being punched serially from
rows, the other from columns.
•
The cards were then shuffled and IIWIlbered
i. order. .A frequency tabulation of these N( 0,1) deviates is given for
the ten equal intervals from -S .000 to +S .000 in Table B-1 below.
Table B-1.
Frequenoy tabulation of ten thousand N( 0,1) deviates
.000 .000 1.000 2.000 3·000 4.000
From
-4·000 -3·000 -2.000 -1.000
To
-4·999 -3·999 -2·999 -1·999 -·999 ·999 1.999 2·999 3·999 4.999
Frequency
Number
Expected
1
13
200
133S
3410 3164 1368 212
7
0
·3
13
214
1359
3413 3413 1359 214
13
.J