Svendsgarrd, D. J.; (1978)Contributions to the Theory of Biased Estimation of Linear Models."

CONTRIBUTIONS TO THE THEORY OF
BIASED ESTIMATION OF LINEAR MODELS
by
David J. Svendsgarrd
Department of Biostatistics
University of North Carolina at Chapel Hill
Institute of Statistics Mimeo Series No. 1162
JUNE 1978
DAVID
JOH~ SVG~DSGMRD.
of Linear Models.
Contributions to the Theory of Biased Estimation
(Under the direction of LAWRENCE L. KUPPER.)
This study investigates estimators that may be used instead of
the usual least squares estimator for the parameters of a ljnear model.
TIlese estimators are linear transforms of the least squares estimator
ffild they are usually biased.
The motivation to use a biased estimator is either a desire to
obtain an improveJ estimator or to satisfy a constraint.
criterion called Cit;;neral Mean Square Error (G1SE).
We use a
When there is a
constraint we only consider the class of estimators that satisfy it.
then proceed to determine the estimator with the smallest
~6E
We
within
this class.
The linear transfoIlll of the least squares estimator, wilich defines
this estimator with the smallest C1-1SE, is usually a function of unknown
parameters.
Therefore, such an estimator is not useable.
lIawever,
consideration of the form of this optimal estimator together with the
availabili ty of prior infonnation of bounds on the parameter space can
result in the
develo~nent
of a useful biased estimator that is superior
to many of the other estimators in the class for some values of the
parameters.
Rather than base the choice of the transformation on prior lmowledge,
one can let the data choose the transform.
For example, the Stein
estimator, the ridJe estimator and stepwise regression can be viewed as
stochastic linear transforms of the full model least squares estimator.
Son~
of these estimators are compared and applications are
considered.
ii
TABW OF CONfENTS
OIAPTER
I
PAGE
ACKNOWLEDGMENTS
v
LIST OF TABLES
vi
LIST OF FIGURES
vii
GENEAAL CONSIDERATIONS
1.1
Introduction . . .
1
1.2
Response Surface Methodology
1
1.3 Review of the Literature
1. 4 An Outline of the Study
II
A
RINIl~W
12
OJ.' BIASED LST fMATION
2.1
Introduction . . . . . .
14
2.2
Estimation of Regression Coefficients
16
2.2.1
2.2.2
2.2.3
Tl1C IMSL of Some Biased Estimation Procedures for
Linear ~bdels tVhere Spurious Correlations Exist
Among the Imlependent Variables . . . . . . . . . .
22
Aspect of the Problem of Spurious Correlations
in the Independent Variables. . . . . . .
23
Ridge Regression
26
~1
2.3 The Case ll/here the Aim is Smoothing or Prediction
2.3.1
Achieving Minimum Bias with Approximation
Functions
.
2.3.1.1
2.4
II I
7
Correcting an Estimator so that its lBias
Error is Minimized . . .
Identification of Explanatory Variables
29
30
31
34
I'lliW RESULTS
3.1
Introduction
41
iii
3.2 Parameter Estimation . . . . . . . . . . .
3.2.1
~:ior Itnform a~io/n2About Exactly One of the
Uiemen s 0 f ~2 a
.
3.3 A Characterization of Some Biased Estimators . . .
3.3.1
~onsideration
.@1 +
42
47
51
of the Estimators of the fonn
53
E2.e2
3.3.2 The SK Estimator
60
3.3.3 Estimation of y
64
3.3.4 A Confidence Interval for
3.3.5
y
Application of the Ridge Estimator to Smoothing
Prediction Situations . . . . . . . . . . . . . . .
65
~l
66
IV APPLICATIONS A\JD RJRfllliR COHPARISONS
4.1
Application of Biased Estimation Theory to an Iterative
MaximlDTl Likelihood Search
73
4.1.1 Background
73
4.1.2 Theoretical Development.
75
~i
4.1.2.1
4.1.2.2
Description of the ClDTlulative Logistic
Adjusted for Natural Responsiveness DoseResponse Model . . . . . . . . . . .
75
Description of an Existing Iterative
Method . . . . . . . . . . . . . . .
76
4.1.2.3 Development of an Alternative Estimator
via the Consideration of Biased Estimation Theory . . . . . . . . .
80
4.1.3 Empirical Evaluation of Modification
83
4.1.4
85
Conclusion
4.2 A Comparison of Some Estimators of the form .K§
4.2.1
Introduction
89
89
4. 2. 2 The Model
90
4.2.3 The Class of Designs
90
4.2.4 The Criterion Used for Comparison . .
95
iv
4.2.5 Explanation of the Canparison . . . . . . . • .
96
Method of Obtaining Bound on the Ratio
of Bias Errors
97
4.~.5.l
4.2.6 Results
4.2.7
98
Interpretation of the Results.
4.3 An Application of Biasetl Estimation to the Low Dose
Extrapolation Problem
. . . .
4.3.1
Introtluction
4.3.2 An Example
4.3.2.1 An Example where Bias is more Important
than Variance . . . • • . . • . . • .
4.3.3 Obtaining Bounds on the Dose Response Curve . .
4.3.3.1
V FUTURE RESUARCII . .
LIST OF REFERENCES
Results
101
103
103
107
III
113
123
129
133
v
I wish to express particular gratitude to Dr. Lawrence L. Kupper
and Dr. Ronald
\v.
Helms for the guidance that made this study possible.
I also wish to thank the other members of my lbctoral Corrnnittee Dr. R.R. Kuebler, Dr. M.A. Shiff111aIl, and Dr. H.A. Tyroler for their
various contributions. I also want to thank Susan Stapleton for
typing the manuscript.
Finally, a special tribute must be given my wife, Sylvia, who
encouraged me and shared the frustrations.
This research was supported by the Depart.JTcnt of Biostatistics
through the National Institute of EnvirOIIDlCntal ileal th Sciences
Training Grw1t 3-TOlES00133 Biostatistics for Reseach and
Environmental Health.
vi
LIST OF TABLES
Page
Los Angeles Student Nurse Data . . . . . . . .
86
MaximlDTI Likelihood Estimates for each Symptom
87
Successfu1(S) and Failing (F) Iteration Attempts .
88
Comparison of Biased Estimators
Sample Size n~
y~~
and value of
~ nQ(~2)·
Sample Size n
when V~~
~O
=
0
nQ(~~).
~~
and
Z and value of
~
102
~O
*
=1
~2
to reject
~2
=
1
n£(~~) when
.
to reject
and
n(~~) ~
~2
125
n(
~~) ' ~"
nL (~2)
*
= 10
.....
126
vii
LIST OF FIGURES
Page
Illustration of Mechanism for Choosing
6*.
63
Locations of design points when
~
= O.
93
Locations of design points when
~
= 1/2.
94
The overprediction of a concave upward response function at a dose
equal to 2. The straight line passes thru the response at
dose levels of 1 and 3.
...........•..
106
The bias error at
between (~O'
bias error at
between (~O'
115
resulting from the linear interpolation
nl~O))
and (~2' n(~2)) is smaller than the
~l
resulting from the linear interpolation
n(~O))
mld (~2' n(~2))'
~l
Chapter I
GENERAL CONSIDERATIONS
1.1
Introduction
When estimating the parameters of a linear model, one generally has
available an unbiased estimator.
The least squares estimator is an un-
biased estimator when fitted to the true model.
Under the assumption of
uncorrelated random errors with zero means, the least squares estimator
of any linear combination of the parameters has the smallest variance of
any unbiased estimator that is a linear combination of the observations.
I-bwever, smaller mean square error of an estimator for some linear combinations of the parameters is obtainable if a biased estimator is
considered.
In this dissertation, biased estimators are explored.
These
estimators are later compared in terms of their mean square error.
The literature of response surface analysis has been motivated by
similar considerations.
We find the setting of response surface analysis
a convenient starting point to discuss this problem.
1.2 Response Surface Methodology
. Response surface analysis is concerned with a functional relationship
2
between a response variable n,
~l' ~2'
... ,
~q.
and q independent variables
The relationship involves the parameters
°1,° 2,
... ,0r •
These parameters are generally unknown, but can be estimated from experimental data.
Typically, such data will consist of N measured values
of n corresponding to N specified combinations of levels of the
~'s.
The response relationship is usually interpreted geometrically as
representing a surface in a space whose coordinates are the q + 1
variables
~l'
levels of the
... ,
~.
1
~q
and n.
The N experimental combinations of
's are represented by points in the space of the
independent variables.
The collection of these N points
is called the experimental design.
In practice, the form of the response function f(£,!2)
is
generally unknown and the usual procedure is to assume that the unknown
function can be represented at all points
E;
by a polynomial in
~
~
.
~
Justification for considering the polynomial model comes from the fact
that functions with certain smoothness properties can be approximated
about a point by partial sum of a Taylor Series.
Therefore, we shall
consider that the true model can be written as
n = ;C' ~
lxp pxl
where
of
~.,
~
is the vector of coefficients of the polynomial and the elements
are standardized functions of tlle variables ,in·
the ith element of the vector
~u
.~.Specifically,
corresponding to the point
t;,
_u
is
3
q
x.lU =
1
where f. = N
J
r i;;ju-i;;j
II
j=l
l
s.
J
N
L
i;;ju'
u=l
non-negative integers.
the s.
r'
IJ
are scale factors, and the h .. 's are
IJ
J
£~
For example, if
=
(~lu' ~Zu), ~l
and s. = 1 then for a second order polynomial
J
~
(p=6) we might have
Z Z
.
= (1, i;;lu' ~Zu' ~lu,F,;2u,F,;luF.Zu)·
In this case, the matrix whose ith jth element is h ..
IJ
Let
= r Z = 0,
1-0
1
0
Z
0
1-'
1_0
0
1
0
Z
lJ
is
be the matrix whose ith, uth element is xiu and X (Nxl)
be the vector of uncorrelated responses whose uth element is observed
~
lNxp)
when the level of the factors is
tions of X from
~
X= ~
~.
If
~
is the vector of devia-
then
+ ~ •
We assume these deviations are random uncorre1ated variables with mean
.
2
zero and varlimce a. ~'Je also assume X has full column rank, that
is, (X'X)-l exists ..
Given this model, it is possible to derive the conclusions of the
Gauss-Markov Theorem:
"The variance of 2S'Q is smaller when 12 = ~
for any other unbiased estimator (of
~)
= (~,~)-l~,X
than
which is a linear combination
of the elements of the vector X."
~
Thus, the guarantees of 2S' ~ provided by the Gauss -Markov Theorem
4
are true regardless of whether or not :£'
is in the sampled region and
~'~
regardless of physical constraints that restrict the range of
.
Since these specifications are not a necessary prerequisite to using the
conclusions of the Gauss-Markov Theorem, there is no consideration that
for some
"
~,~'~
may be less reliable than for other
no guarantee that
"
~
~.
Also there is
will not take on nonsensical values.
Al though the use of
"
may be justified by means of various
~
criteria, one overriding characteristic
possesses is:
~
"Of the estimators which are linear combinations of the
y,
~
best fits the model to the data." That is, the sum of squared errors
of fit
~(b)
......,
= (Y-Xb), (Y-Xb)
(1.2.1)
......,~......,~
is minimized when 12
"
Since y -
=~.
is a measure of precision.
When p
is lDlexplained error,
~
cjl (g)
is large due to fitting a high
degree polynomial, experience has taught that unreliable estimates of n
outside the region of experimentation can occur.
polynomials make
cjl(b)
Since high degree
small, the lDlwarranted use of high degree
polynomials treats the data as being more precise than the data usually
are.
That is, if
on the average.
~
were known, then
"
cjl(~)
would be larger than cjl(§)
That is,
"
(N-P)o2 = E(cjl(~))
< E(cjl(~)) =
No
2
.
In summary, the sum of squared deviations of y from the model fitted
by the least squares estimator are smaller on the average than the
actual deviations are.
The more parameters that are estimated, the
more the discrepancy between these two sums of squares.
5
There are various methods that can be used to better balance the
fitting and extrapolation errors.
The judicious choice of a low order
polynomial and/or variable selection routines often yield
extrapolation characteristics will be better behaved.
n~dels
whose
The choice of
the model used to fit the data may be based on considerations of regions
of interest of the
~
variables.
For example, the assumption of
linearity might be plausible over certain regions but violate physical
laws when the real line is considered.
Regardless of the rootivation, we
assume that one has specified a region of interest R of the
ables over which it is desired to approximate
ftmction
We will consider the case where
f
vari-
n by some approximating
is related to n as
a = ~i~l is the form of the model to be fit.
A general form of criterion that has been used for variable
so that
n
selection, smoothing of the response observations, selection of the
experimental design, and parameter estimation is general mean square
error
= tr ME(Q
M
- ~)(Q - ~)' •
is invariably taken to be a positive definite symmetric matrix.
pxp
The integrateu. mean square error (IMSE) is a special case of the
fonn J(M,b).
of b is
To see this, let R be the region of interest.
The IMSE
6
I
rl -1 =
where
W
R ~ •
If we denote the moment matrix as
J~'~
= rl
R
then
J
Nrl
=
a
2
~
=
~ E(Q-~)(Q-~)'
tr
a
~
= J(
, Q).
a
It can be shown that
where
and
V
J
=
V
_Nn
-z
a
B
=
N~
a
+
B
I
I
R
R
Var(~'Q)~
error of
_ ~,~)2 ~ .
(E(~'Q)
V is the integrated variance of
,
~ 'Q
and
B is the integrated bias
x'b.
~
~
J(M, Q).
A similar partition exists for
partitions by V and
error (with respect to
We shall denote these
B and call them the variance error and the bias
M).
When the model is written in the form:
(1.2.2)
we will denote the submatrices of
~
corresponding to this partitioning
7
of
~
as
W =
W'
~2
In a situation where the model is further partitioned
~l
~2a
(1.2.3)
~2b
we denote the submatrices by
W=
~l
W
~2a
~2b
W
~2a
W
~aa
W
~ab
~2b
W'
~ab
~bb
The partitioning of X corresponding to the partitioning in (1.2.2)
will be denoted by
and corresponding to (1.2.3) by
1.3 Review of the Literature
Many estimators for the parameters of the linear model can be
8
viewed as a linear transform of the least squares estimator of the full
model.
That is,
The motivation for considering such an estimator comes from one of two
sources.
One source is the desire to obtain a better estimator than the
least squares estimator.
The second source is almowledgc of constraints
or limitations on the parmllcter space.
There is also the possibility
that both sources of motivation are acting.
These sources of motivation closely coincide with the three aims of
regression analysis
Parameter estimation
Prediction or smoothing
Identification of explanatory variables.
Under parameter estimation, the full model least squares estimator
is generally available.
estimator.
The motivation here is usually to obtain a better
Under prediction dr smoothing, there may be a constraint that
makes the full model least squares estimator unacceptable.
Under i denti-
fication of explanatory variables, it is often the case that both sources
of motivation are involved.
~
In our discussion,
Usually, a stochastic
~
will be either stochastic or deterministic.
is used to obtain an estimator that is better
than the least squares estimator.
However, when prior infonnation about
the parameters is available, a deterministic
~
purposes of obtaining an improved estimator.
Classically, a deterministic
~
is used to satisfy a constraint.
may be considered for
9
First, let us consider a deterministic lS when no prior infonnation
is available <md one's goal is to obtain an estimator with better general
mean square error characteristics than full model.
Using this criterion
Theil (1971) has shm"n that the optimal choice of K invo]ves the
unknown coefficients.
Thus, it appears that knowledge of the unknown
parameters is a prerequisite to obtaining an improved estimator under
such conditions.
of ;25' ~
~reover,
Barnard (1963) has shown that when the range
is lUlbounded then among estimatoIS having bOlUlded mean square
"
error, the least squares estimator ;25' ~ has minimum mean square error.
Ignoring other considerations, then it is only in those instances when
;25'~
is bounded that a biased estimator can compete with the least squares
estimator in terms of mean square error.
Let us now consider stochastic choices of lS.
In the lnultivariate normal setting when there are 3 or more variables, Stein (1960) has exhibited a whole class of estimators for the
mean vector that have mean square error characteristics
least squares estimator.
smk~ller
Motivated by the results of Stein, Sclove (1968)
has considered the univariate regression situation where p
error is normally distributed.
>
3 and the
When the design matrix is orthogonal,
Sclove has shown there is a class of estimators of
general
than the
~
having smaller
mean square error than the least squares estimator.
He shows
that this result holds for any form of general mean square error
as long as
M is
a symmetric positive definite matrix.
Now let us return to considerations involved with using a deterministic lS.
For example, the estimator obtained by only fitting a subset of the
variables of the true model using the least squares estimator is a trans-
10
fonn of the full model least squares estimator.
The IMSE of such an
estimator depends on the design.
Using integrated mean square error (IMSE) criteria, Box and Draper
(1959) (BD) considered choosing the design that would minimize the bias
error portion of IMSE when a fixed subset of the variables of the true
model was fit by least squares.
To achieve minimum bias error they found
that certain rnoments of the design had to agree with the corresponding
moments of a uniform distribution on R.
Whether or not the Box and Draper scheme achieves smaller IMSE than
obtained by the least squares estimator fitting the full model depends
on the variance 0 2 and the magnitude of the coefficients associated
with the variables that are not included in the selected subset.
Karson, Manson and Hader (KMH) (1969) minimized the bias error of
IMSE by choice of estimator rather than choice of design.
Any remaining
design flexibility could be used to minimize the variance error portion
of IMSE.
When designs are used that make all the coefficients of the
full model estimable, the KMH scheme always achieves smaller IMSE than
does the BD scheme.
of the full
n~del
However, the IMSE of the least squares estimator
may be smaller depending on the coefficients associated
with variables not included in the selected subset.
This is to be expected
in view of the work of Barnard which was referred to earlier.
Finally, we shall consider the use of a deterministic
prior information is available.
little attention.
~
when
This problem has received relatively
A reason that it has not received more attention is
the subjectivity involved here.
We assume that one has prior information
that can be expressed as a bound on the magnitude of some combination of
the parameters of the model.
11
Sometimes a natural bound is apparent.
information is not so easy to specify.
More often, the prior
In such a situation, the easiest
thing to do is to ignore prior information.
When the
pU~lose
of the
analysis is to report the data, ignoring prior information has an air
of objectivism to it.
When the objective is to use the results of the analysis for
pre~
diction or parameter estimation, ignoring prior information can lead to
failure.
This failure can result from an intolerably large mean square
error of the estimator of
:2S'~
:2S'~
over the range of reasonable values that
can assume.
Usually, prior information is incorporated into an estimation
strategy using a Bayesian framework.
This requires specification of a
prior density on the parameters.
It is felt that specification of a bound on certain combinations of
the parameters is a more straightforward
decision than specification of
a prior density.
With this in mind, Kupper and Meydrech (KM)(l975) considered the problem of minimizing IMSE when a fixed subset of the variables of the true
model are fit.
Instead of using the least squares estimator of the para-
meter vector as Box and Draper did, they considered shrinking the least
square estimator.. The ,:llTIOWlt of shrinkar.;c was based on an a priori bound on
parameters of the variables that were excluded from the fitted model.
Design considerations were also incorporated into their scheme in
order to reduce IMSE.
They found that when their a priori bound was
correct, their method resulted in smaller IMSE than was achieved by the
Box and Draper method.
Smaller IMSE was also achievable even when the
parameters exceeded the presumed bound by a factor of two.
12
In a quite different context, Boerl and Kermard (HK)(1970a) developed
the ridge estimator.
The use of the ridge estimator was developed to
identify explanatory variables when the variables of the design matrix
were highly correlated.
They compared their estimator in terms of
general mean square error.
Specifically, they considered the mean of
the squared length of the deviations of the estimator from the true
parameter values.
In terms of general mean squared error, this is
equivalent to using M = I.
They fOillld that the ridge estimator was better (in tenus of the
criteria) than the least squares estimator when a certain hound on the
parameters held.
The ridge trace is a plot of the elements of the ridge estimator
versus a scalar that defines a class of possible ridge estimators.
The
ridge trace is used to choose the ridge estimator from this class.
Since
this choice involves the data, their ridge estimator is actually a
stochastic transform of the least squares estimator.
Since their com-
parison was based on the transform being deterministic, this comparison
is open to question.
1.4 An Outline of the Study
The preceding literature review has briefly described the estimators which we shall study.
Certain aspects of these estimators will
be dealt with in more detail in Chapter 2.
In that chapter, we shall be
noting a general similarity of these estimators while keeping in mind
the purpose of these estimators.
We believe that it is important to
distinguish these purposes, since our eventual goal is to compare
estimators.
13
Consider the difficulties that can occur if these purposes are not
distinguished when comparing estimators. TIle way that comparisons are
made places certain restrictions on the general validity of results., When
one considers the varied purposes of these estimators, it becomes clear
that the results of comparisons are valid under only special circumstances.
In the extreme case, when there are no circumstances when one would
consider either estimator appropriate, the result of comparing two
estimators is of no interest.
To avoid states of affairs of this nature, we shall classify estimators by purpose in Chapter two.
When we compare estimators, we shall
be comparing only estimators that are within the same purpose group.
In Chapter three we shall describe the new results we have obtained
about estimators that are transforms of the least squares estimator.
A
In Chapter four, we compare estimators of the form
requires that we select criteria for comparison that take
1$~.
This
into account
the circumstances for which the original author developed the estimator.
Also in this chapter, we
inco~)orate
a biased estimator into an iterative
method for finding the maximum likelihood estimator for a dose response
model.
We also consider the bias introduced by extropolating.
Lastly, we describe what routes seem profitable for future research.
Chapter II
A REVIEI'1 OF BIASED ESTIMATION
2.1
Introduction
In this chapter, our aim is to review certain estimators which have
been proposed in the literature.
These estimators have the characteri-
zation:
(2.1.1)
where ~
= (~,~)-l~,X is the least squares estimator. This characteri-
zation facilitates the comparison of these estimators in terms of their
general
mean square error.
there is a matrix
~
Define the class
K where £ is in K if
such that (2.1.1) is satisfied.
As will be seen, this characterization holds for many estimators.
Some of these estimators have been developed for a very specific purpose.
Comparisons involving such limited-purpose estimators can cause problems.
Therefore, before comparing two estimators, we want to insure that the
intersection of the two sets describing the purposes of each of the
estimators is not empty.
To further delineate the generality of com-
parison results, one must also take into account the restrictions placed
on the comparison.
For example, we may restrict comparisons to only
estimators of the form
£
=~,
where
~
is a matrix of constants.
In summary, we have to consider the purposes of the estimators and
the restrictions surrounding the comparison.
The generality of the
comparison will depend on the intersection of three sets:
-The set of restrictions placed on the comparison;
15
-The two sets describing the purposes of each of the estimators.
In the extreme case when the intersection of these three sets is
empty, we would consider the comparison invalid.
Therefore, to insure valid comparisons we will consider three purpose-classification groups.
In this chapter, each estimator will be
classified into one of these groups, and comparisons will be made
only among estimators in the same group.
The three groups are:
1.
The form of the true model is assumed to be mown and one wishes
to estimate the parameters of the model.
2.
The form of the true model is not known, and one wishes to
estimate certain relevant features of the true model.
These
features may be expressed in terms of a model which will be used
to extrapolate (or predict) or to simply condense the data.
3.
The form of the true model is not known and one wants to
identify explanatory variables.
In this chapter we will introduce estimators according to the way
the original authors introduced them.
Also, generally mown results
about these estimators will be introduced at this stage.
discuss estimators used for parameter estimation.
First, we will
Here, we take the
full model least squares estimator as a viable alternative to other
estimators mentioned in this section.
-The minimum general
form
~
where
These other estimators are:
mean square error estimator of the
IS is a matrix of constants.
-Estimators satisfying the Stein restriction as applied
to regression theory by Sclove and Branchik.
-Ridge Estimator.
16
Before the introduction of the ridge estimator, we shall discuss the concept of spurious correlations.
In the second section, we consider estimators that might be used
when smoothing and/or prediction is the aim.
In that section we only
consider the case when the full model least squares estimator is not a
viable alternative.
Thus, the integrated bias of the estimators, which
we will consider, will generally be nonzero.
Here, we will compare (in
terms of tMSE) those design and estimation requirements that result in
minimizing the integrated bias.
Also, the Kupper-Meydrech estimator and
the reduced model least squares estimator are discussed.
In the third section, we discuss the estimation situation when
identification of explanatory variables is the purpose of the analysis.
The inconsistency arising from using percentiles of F distribution for
determining cutoff points in stepwise regression is mentioned.
Other
variable selection criteria are mentioned and the use of ridge regression
is also mentioned.
Estimation of Regression Coefficients
2.2
When interest centers on estimating the coefficients
it is usually assumed that the true model is being fit.
squares estimator
tion.
"
~
=
(~'~)
-1
~'r
is unbiased for
~
~
of a model,
The usual least
under this assump-
Therefore, motivation for considering a transformed estimator
12= IS
~
pxp pxl
of
~
form.
comes mainly from a desire to minimize mean square error in some
That is, there is a willingness to accept non-zero bias in an
estimator in order that the total error, bias plus variance, be reduced.
17
Various forms of mean square error criteria have been considered.
Many of these forms can be expressed as
For example, Mallows (1973) takes
M to be ~'~a2; l~er1
Kennard (1970) use the identity matrix for
M;
and
and Box and Draper's
(1959) integrated mean squared error can be viewed as letting
M
~
So
E(Q-~)(Q-~)'~
= Nn2
a
fR xx'Jx.
~
~
called the mean square error matrix of
Q,
matrix common to many forms of mean square error criteria.
is a
When
M is
a symmetric positive definite matrix, the mean square error matrix of b
is the only quantity that need be considered for the purpose of comparing
estimators with respect to the same mean square error criterion.
The
following theorem (Finkbeiner, 1960) illustrates the importance of this
matrix.
A real symmetric pxp matrix
~
represents a positive definite
quadratic form if and only if
M= ££'
for some real non-singular pxp matrix
Since
~
~.
= E(Q-~) (Q-~)' is non-negative definite anti . 'M is
positive Jefinite we have that
tr
~
= tr ££'§b = tr
since all the diagonal elements of
£'~ ~
E'~£
0,
are non-negative.
18
for arbitrary 12
Therefore, for the purpose of comparing J(M,12)
and 12 2 ,
1
we have that
whenever the matrix
~
-
~
~1
is non-negative definite and
~
is
~2
positive definite.
When K is stochastic, the mean square error matrix of 12 =
~
= E(12-~)(12-~)' = Var
,.
,.
~
,.
~
is:
,.
(E(~)-~)(E(~)-~)'
,.
+
,.
= E(E(~)~-~)(E(~)~-~)'
a2E(~-E(~))[(~,~)-1
+
+
88/a2](~-E(~))'
~
(~,~)
,.
,.
,.
+ COy
COy
(~,~)'
(~,~)~'(E(~)-!)'
+ COy
,.
(E(~)-l)~(Cov(~,~))'
+
(2.2.1)
This expression for the mean square error matrix is obtained by using
the identities
,.
,.
,.
Var ~ = E[Var(~I~)]+ Var [E(~I~)]
and
,.
,.
E(~)
= COY
~
This expression for
(~,~)
+
allows one to gain insight into the dif-
ference between using a deterministic
When
~
E(~)~
~
and a stochastic
is deterministic (a matrix of constants),
and
,.
COY
(~,~)
= Q.
~
matrix.
19
Thus, only the first matrix to the right of the last equal sign in expression (2.2.1) is nonzero in this case.
111Cil (1971) has considered
when
where
and ti is an arbitrary pxN matrix, the general
mcan square error can
be expressed as
E
=
~b
where til
is non-negative definite.
the form J(M,12)
It follows that when a criterion of
is used and when 12
is any linear combination of
y,
then 12* will be the linear combination that minimizes J(M,b).
"'"
~
Unfortunately,
useful estimator.
for some 0 2 and
estimator of x'S
K*
involves unknown parameters. Therefore,!2*
A particular choice of
~,
IS
is not a
!S may achieve small J(M,!2)
but Barnard (1963) has shown that the least squares
that
linear combination of Y haVing minimtnn
mean square error in the class of bounded mean square error estimators
when the range x' B is 1mbollilded.
I iis
resul t is generali zed to the
relevffilt setting in Chapter 3.
Therefore, the justification for using the least squares estimator
does not require that one be restricted to the class of unbiased estimators.
If the range of
~'~
can not be restricted and prudence
requires only consideration of bounded mean square error estimates, then
the least squares estimator is the only resort.
It is when the magnitude
of the mean square error of the least squares estimator is unacceptably
large that improved estimators in some larger class must be sought.
20
!
When
Q
is stochastic, however, there are certain estilnators
of
the form
~
b =
M
that have the property that when
~ ~ N(~, a2(~,~)-1)
and p~3
then
(2.2.2)
for all
~.
James and Stein (1961) proved that (2.2.2) holds when
M= 1
and
(~,~)-l = 1. They found that (2.2.2) holds for the class of estimators
characterizeJby K being a scalar matrix
(a mUltiple of
I).
Wlattacharya (1966) found a class of estimators for which (2.2.2) holds
when
M is
diagonal with diagonal elements
d.1
where
(2.2.3)
This class of estimators is characterized by K being diagonal.
shows how to obtain estimators satisfying (2.2.2) when
and
M is
~'~
He also
is nonsingular
positive definite.
Since this is the only case of general interest, we shall show
exactly what the
! matrix looks like for this class of estimators.
Bhattacharya (1967
) has shown tll<it there is a nonsingular matrix
L so tllat when we define
~ =
1&
and
"-
then the variance-covariance matrix of
~
= 1& is
21
and
L will
diagoa<.llize~.
That is, there is a diagonal matrix
~
so
that
1'121
The columns of
1
= ~.
can be arranged
SQ
timt (2.2.3) is satisfied.
1
Bhattacharya has found the transfonn
so that in the transfonned space,
we have the conditions timt the appropriate
elements of
&. are wlCorrelated.
~1
is diagonal and the
By Bhattacharya's extension to the
results of Jumes and Stein we knmv that there is a
A
Thus
K matrix so that
A
J(~,~) ~
J(D,£) for all £.
(2.2.4.)
!'Jow, consider an arbitrary estimator ~, and subscript the J (M,b)
notation to clarify what parameter vector we mean.
J(3 ~,Q)
Then
= E(Q-~) 'M(Q-ln
E(Q-~)'1'~(Q-~)
= J ex. (D,
-.. . Lb)
~
•
We can use this last relationship to transfonn both sides of (2.2.4) to
obtain
J~,1
-1
A
A
~) 'S J~,~)
Now let us present a matrix
for all
~.
IS which has this property.
Let
A
~(j)
Ji
=
A
A
(ul'u 2 , ... ,u j )
be the ith Jiagonal of
and
D so that (2.2.3) is satisfied.
22
How, let
f.
J
~
=
3,4, ... ,p.
is then the diagonal matrix whose ith element is
p
I
j=l
where
dp+ l
2.2.1
(d.-d·+ )f./d.
J J 1 J 1
,
O.
==
The IMSE of Some Biased Estimation Procedures for Linear Models
Where Spurious Correlations Exist Among the Independent Variables.
We now consider a particular difficulty that is sometimes faced in
parameter estimation.
correlations.
This difficulty concerns the effect of spurious
The effect of this problem may be so serious that the
aims of the study can not be achieved.
In this case, the purpose of the
study may be downgraded from parameter estimation to ident ification of
explanatory variables or prediction and smoothing.
Thus, this difficulty
may be experienced with any of the three purposes.
We choose to consider
its effect only when the purpose is parameter estimation.
Spurious correlations are apparent associations between unrelated
quantities.
When the data are observed rather than contrOlled, spurious
correlations may arise.
It is enlightening for the understanding of an
aspect of this problem to consider a region of interest and the effect of
spurious correlations on IMSE when the usual least
used.
squar~estimator
is
An alternative estimator, the ridge estimator, has been proposed
as a better estimator in such a situation.
After discussing spurious
correlations we shall compare these two estimators.
23
2.2. 2
S urious Correlations in the Inde-
It is believed that an important aspect of the problem of spurious
, correlations can simply be viewed as the multidimensional analog of the
one-dimensional extrapolation problem.
Consider a simple case where one is attempting to estimate the
coefficients of a linear model of the form:
One might offhandedly believe that fitting such a ftmction would
provide a reasonable predictor for values of the independent variables
(x I ,x 2)
that are in the rectangle:
where these minima
and maxima
independent variables observed.
are taken over the N values of the
See Figure 2.1.
The figure shows a typical scatter plot of Xl
and x 2 for which
the Pearson product moment coefficient would be near one. As can be seen
the regions near the top left and bottom right corners are sparsely
sampled and relatively distant from the region of experimentation.
Using
simulated data, Marquardt (1975) considers such a situation and has shown
that when there is high correlation, the models with coefficients determined by usual least squares predict well in the region of experimentation, but poorly near the corners of the square.
One way to circumvent
the problem is to let the distribution of points
region of interest.
(x l ,x 2) dictate the
For example, one could choose an ellipse as shown
e·
24
in the figure as the region of interest.
then it will be rare that a point
If
(x I ,x 2)
Xl
and x 2 are associated,
lie outside the ellipse and
so it is reasonable that there would be reduced interest in this region.
On the other hand, if
Xl
and x 2 are not associated, that is, the
correlation is spurious, then there may still be a great deal of interest
in the part of the rectangle outside the ellipse.
In this situation and
at other times when the region of interest must remain fixed, Marquardt
proposes the ridge estimator as an alternative to the least squares estimator for the parameters.
Thus, spurious correlations mean that the experimenter has interest
in regions of the independent variables that are sparsely sampled.
The
one-dimensional extrapolation problem occurs because tne experimenter
has interest in a range of the independent variable that has been
sparsely sampled.
'fl1c
c~:t,;'li'olation
i'rohlcr occurs hccausc tllc e:;'1Jcrimcllter who uses
the least squares estimator must tacitly assume that the fitted function
holds in this sparsely sampled region.
Similarly, the experimenter who
uses the least squares estimator in the multivariate case may be
(unknowingly) making a similar assumption.
25
Figure 2.1
Scatter diagram of two highly correlated varIables
3 -.--------y-----.-------,,---------.----------,
--_.---- . ----- .. - -"'.,
2 -1!---------1f----------:f_--_="-.....:.:--~--_1f_---_=..;.:----t
.
----'
--
.--,-
1-
.
.,.- -'~
:
:.
~--------+--_;;"'L.----_+-------_+-----r''---__t
-
.'
X2
-
0
-
,,
-
-1 -
.
•
-
:
I"
,
..
,
.:-
-.
~
/
...'
~
"
..•.
~
~
.'
..'
~·-·-_:,.'-------t------- .---+-------:.~--_+-------_t
:
~
-2
,.'
/
,:
.
.,'
- . '
.,' "
:
..
-
-'
-
.,.'
-
-
-
-3
-
I
I
I
I
I
I
-1
_':1
L..
I
I
J
I
I
I
I
I
I
1
The points were obtained from a table of random numbers.
I
.--:.
The under-
lying distribution is bivariate nonnal with zero means, unit variances
and a correlation of 0.9.
bution.
The ellipse is a 75 percentile of this di&tri-
26
2.2.3 Ridge Regression
Hoer1 and Kennard (HK) (1970a) have proposed a method of estimating
the parameters of a linear model.
This technique, called ridge regression,
is designed to provide estimates with desirable properties when splrious
correlations exist among the independent variables.
Hoer1 and Kennard assume that the variables have been standardized
so that
~'~
is in correlation form.
situation when
Since they consider only the
is of full column rank (even though it may have a
~
"near singularity"), this implies that there is no constant term in the
model.
In general, we take this to mean that the constant term has been
subtracted away.
vector.
This may involve estimating it by the mean of the
y
When this is the case, we ignore the fact that this constant has
been estimated when we compute the general
mean square error.
The ridge estimator is
where
k is a small positive quantity.
from the ridge trace, the graph of the
In practice,
k is determined
Qk coefficient vector versus
In the presence of variables so correlated that
~'~
is ill conditioned,
some elements of the Qk vector may vary greatly with a
k.
k.
sn~ll
change in
It is recommended that one choose the smallest value of k, say k*,
so that for values of k greater than k*,
the elements of
Qk'
there is little change in
The doge trace can also be used for variable
selection.
The ridge estimator has not been considered in a formal region of
interest framework.
Therefore, ]MSE criterion could not be used as
criterionfor evaluating the properties of the ridge estimator.
Most
27
writers have chosen instead to use total MSE.
We shall first compare the
least squares estimator to the riJge estimator using total Iv1SE criterion,
and then we shall show how this comparison is effected when a region of
interest is defined and IMSE criteria is used.
Ii1\: have shown that there eXlsts
range of k values such that
3
smaller total MSE is achieved using the ridge estimator Qk with k
chosen from this set of values than when
k = 0 (the least squares
estimator for the full model).
E
To see this, let
~'~
be such that
=
E'~E
E'E = 1 where
and
A
A
diag (A I ,A 2 , ... ,AJ
i = Ef,
total
~
then if
Ai ~ A
j
and
for
= (Qk-f)'(Qk- f ),
MSE of these coefficients.
=
it follows that
a
a
p
2
the vector
=
i,
a
is the
P
A.
L
L
+
1
(A.+k)2
i=l
1
P
2
(k=O)
1S
I
L
i=l
k*
and
~
The total MSE of the least squares estimator
~
E(~)
~
= PQk
E (i k- i) , (ik-i)
i=l
When k
ik
k
Now for any
~
=
Define
i > j.
A.
1
2
2
'
</>rnax
where
1S
the maximum of the squared elements of
we have:
.; a 2
p
L
i=l
A.
_ _1_=
(A.+k)2
1
+
P
[
</>2
kk*
_rnax
_ _..,.-
i=l (A.+k)2
1
28
p
(J
L
=
(J
1
(A.+k)2
i=l
=
2A. +a 2k
1
P
2
I
L i=l A.+k
$
(J
2
1
P
r
i=l
I
A.
1
That is, the total MSE of the coefficient vector is less for the full
model least squares estimator when
k
<
(J
2
Thus, if k is chosen sufficiently small one can obtain an
over the least squares estimator.
~rovement
However, to know that k has been
chosen-properly requires some prior knowledge of the true parameters.
Now, let us see how this comparison is affected when IMSE criteria
is used.
an
First, a region of interest
R whose center is at the origin.
Ws
= Q
J xx'dx
~
R
~
R has to be specified.
Consider
Let
.
Then, the IMSE of the ridge estimator is:
Let !}'!} be the square root decomposition of
define
then,
~s
(Le.
~s
=
tl'W
and
29
(12k -tn'
= (II.
~s(12k-~) = (!£k-!£)'£'t!-l~st!-l£(!£k-!£)
-''')'P'H,-lH'HH-lp(~
-II.) = (."
-,")'P'P(d, -".)
~ ~
~ ~
~.:t:.k.:t:.
.:t:.k.:t:.,.., ,.., .:t:.k .:t:.
.:t:.k.:t:.
= (!£k-!£)'(!£k-!£) .
The same reasoning applies here that applied when using total MSE
criterion.
[t follows that when
o<
where
2
~max
k
<
o2/~2
max
is the maximtun element of the squares of the elements of the
parameter vector !£,
the ridge estimator will achieve smaller TIMSE than
the least squares estimator.
2.3 The Case Where the Aim is Smoothing or Prediction
There are numerous instances where it is best to consider a model
that is much simpler than what might be considered the true model.
This situation usually occurs when there are a large number of
possibly important factors.
In the interest of keeping the situation
relatively simple, one might choose to work with an approximating function,
and forego the complexity of finding the true model.
Usually, an adequately fitting approximation function will leave
unexplained a sizeable amount of the deviation in the data.
tion will consist of bias and variance errors.
The distinction between
these two types of error is important to keep in mind
will not average out in the long run.
This varia-
namely bias error
Thus confidence intervals based
on the assumption that the true model has been fit will tend to paint a
rosier situation than exists.
Of course, the confidence intervals will
be willer thall those obtained using the true model, but this bias contribution will not ';average out" so that the usual interpretation of the
confidence interval must be altered.
30
In order to more honestly portray the data, one might assume that
the true model is only slightly more complicated than the fitted model.
For example, when fitting a first order polynomial to the data, one
might assume that the true model is a second degree polynomial.
This is the situation we consider.
We assume that one is fitting a
model that is not the form of the assumed model.
The reason one is doing
this is because of smoothing or prediction considerations.
These consider-
A
ations rule out the alternative of using
~
to estimate
~.
For example,
in a dose-response situation one may be interested in low dose levels.
The instrullent used to measure response may be insensitive at low dose
levels.
For this reason the dose is applied at higher levels where the
response can be measured and the fitted model is used to extrapolate back
to the low dose region of interest.
Time and again, experiments have
indicated that response is a concave function of dose in this region.
One might assume that the true response is a concave function but fit a
straight line.
A fitted straight line in such a situation will usually
overpredict response.
Thus, one is able to achieve a safety factor that
may not be achievable with a concave function.
Also, if one fits a
quadratic function there is no guarantee that the fitted function will not
show an increase in response in low dose regions or negative values of
response.
estimating
For reasons such as these one might rule out the option of
~
with no bias.
2.3.1 Achieving
Min~
In both the BD and
BD achieved
min~
choice of estimator.
Bias with Approximating Functions
~;
approaches, minimum bias error was achieved.
bias by choice of design and KMH achieved it by
Later, Karson (1970) showed that
min~
bias could
be achieved by a combination of design and estimation considerations.
31
A simple way of viewing minimum bias techniques will be presented
along with a comparison of the resultant IMSE achieved by these techniques
when the true model is misspecified.
2.3.1.1
Correcting an Estimator so that its Bias Error is Minimized
Suppose that the true model is:
and it is desired to fit an approximating function of the form:
KMH have shown that any estimator
.121 with expected value
will achieve minimum integrated squared bias
suggest using the estimator
-1 ~
~l + ~l ~2~2'
square estimators using the full model.
In order to use the
~I
estimator,
-1
~1 + ~1 ~2~2
(B). -E0r_examp1e,
where
I 81
~
1_~2
_
KMH
are the least
That is,
-1
~1 + ~1 ~2~2
must be estimable.
Satisfying this condition may require additional observations.
Probably
the usual estimator that is commonly employed is
Among other desired properties, it minimizes the sum
explained squared error
(X-~l.12)'(X-~l.12)'
of un-
Since KMH have shown that
minimum bias error imposes a constraint only on an estimator's expected
32
~1
value, one might consider correcting
terms
~1
~1'
to
What is the
~1
error in this case? That is, what
by adding a vector of correcting
necessary to obtain minTInuffi bias
~l
satisfies
Since
where
it follows that
Using this concept, it is easy to understand how estimation and
design considerations have been employed to achieve minimum bias error.
The BD approach is to use the estimator !1 and set ~1 equal to zero
by using the design condition
8 = ~i1~2' This then precludes the
necessity of having to worry about
~1
by ~1
-1
=
A
(~l h72-8)~2'
~2'
The KMH method is to estimate
This follows since ~l
A
=
A
~1-M2'
They use
design flexibility to minimize V.
It has been shown by KMH that given complete choice of design their
approach will result in smaller IMSE than the BD approach.
Whether, in
a particular problem, one prefers to use one of these two approaches,
may depend on how much control over the independent variables one has,
cost of sampling in subregions in R,
and other considerations.
An
alternative approach given by Karson that also results in minimum B is
to use both design and estimation restrictions.
can be seen by partioning £1 as follows:
How this is achieved
33
c~l
Vihere
=
I A )A= (A
~a'~b I
When this partition is possible, one may choose the design restriction
A.
~D
= W-lW
~l ~2b and
. te
est~
(A
~
~a -W-lW
~l ~2a ) ~2a
.
the non-zero t erm by
Since all three approaches will achieve the same minimum value of
bias error, comparison of their
variance error.
'r:1(,
IH'-)L's
requires us to con'"ider only the
v21riailce error of the
j(:,111
estimator is
where
If the dimensions of S-l
partioned.
are greater than one, this matrix can be
Therefore denote the partion of §
partition of ~2
=
-1
that agrees with the
(~2a:~2b) as:
I
S' I
~12 :
I
Since the variance error of the BD estimator is
the variance error of the KMH estimator is:
V
KMH
=
V
BD
+
N tr(W-lW
-A) 'w (W-lW -A) S
~l ~2a ~a
~l ~l ~2a ~a ~11
-1
+
-1
{2N tr(~l ~2b-Aa)'~1(~1 ~2b-~)§12
34
In the above expression, denote the second term on the right by 6V
a
and the bracketed term by
6Vb
o
The above expression becomes
The KMH method uses the design which minimizes VKMI-r
Since
it follows that the KMH approach always attains a variance error no larger
than that attained by the BD method.
Also, the variance error of the
is VK = VBD + 6Va .
By the same argument, it follows that the Karson method results in
Karson
est~ator
an IMSE that is between thosc attained. by the BD and K1-11-1 methods.
In summary, it has been shown that the KMH method is preferable to
the BD method, in terms of small IMSE,when the true model is correctly
specified.
The Karson method is better than the BD method, but not as
good as the KMH method, when only IMSE criteria are considered.
2.4
Identification of Explanatory Variables
When the aim of the analysis is identification of explanatory
variables, and linear models are employed, it is common to focus attention
on tests of significance rather than
model.
est~ates
of the parameters of the
If the cxperimcnt involvcs a small nwnber of variables, the analysis
is rather straightforward.
the parameter vector, using
In this case, it is possible to estimate
~,
~.
When the data are [rom an uncontrolled study and. thc dependent vatiable is observed at many different combinations of levels of the
independent variables, the statistical power of these significance tests
35
may be small.
In this situation, there still remains a
the explanatory variables.
de~ire
to identify
In such a case, linear effects (plus possibly
quadratic effects and crossproduct terms) may be considered the true
model, while higher order terms are ignored.
In order to identify the explanatory variahles at this stage, it is
quite cornmon that stepwise regression is employed.
Even when the terms
of the true model are included in the terms being analyzed, the usual
tests associated with stepwise regression are not valid.
F
Pope and
Webster (1972) have considered the forward selection procedure and the
F
statistic used at the ith step to decide whether or not to include one
additional variable in the equation. Superficially, the
Illis a central
F statistic
F distrihution under the normality assumption and under
the null hypothesis
13·1
where
13·1
0,
=
is the coefficient associated with a variable not already
included in the equation.
if the maximum of these
This test is done for all such variables and
F statistics does not exceed some cutoff value
the selection process stops.
The cutoff value is usually chosen to match
a certain significance level based on the central
F distribution.
Pope
and Webster have listed sufficient conditions for this maximum F statistic
to follow a central
F distribution, and they note that these conditions
must not hold due to the way the procedure works.
So, although the
procedure is clearcut, there is no probability available to measure the
goodness of the procedure.
Imving described an inconsistency with using
F tests and stepwise
regression, we feel that parameter estimation should be given more
36
attention, even in this situation. Consider
how the estimator resulting
from a stepwise regression procedure relates to
~,
the least squares
estimator when all the terms (linear, quadratic and crossproduct) are
included in the model.
Assume that the selection procedure stops after Ps
included in the prediction equation so that
Then assume that the first
Qs
clements have been
is a Ps x I
vector.
p
elements of ~ are those terms included
s
Then ~s
b = ~s~
K a where ~s
K is Ps by p and
h
in the prediction equation.
K =
~s
I
~p
A
~s
s
Ps x (p-ps)-
~fere As = (XIX )-lX'X where X is the matrix whose columns are the
observed values of the variables in the equation. The columns of ~
~s~s
~s~
~s
are the observed values of variables not included in the equation.
that
K
~s
Note
must be considered stochastic, since the assumed permutation of
the elements of
~
and value of Ps
depends on the
y
observations.
Criteria other than the use of cutoff points based on the
F dis-
tribution have been used for basis of variable selection in stepwise
regression procedures.
The R2 statistic has also been employed to judge whether an
additional variable should be added to the equation.
The square of the
multiple correlation coefficient is defined to be:
h
h
(x-~) I (y-~)
(Y-X) O:-D
I
where FO tests f=O for the model
n=So+~'f,
and where
y is
the vector whose N
37
y.
elements are the average of the
centage.
2
R is usually expressed as a per-
Including more variables into the equation will increase R2 •
If including the ith variable would only result in
.
R2 ,
In
a "sma] 1" increase
this variable is not included and the selection procedure stops.
Although it is difficult to objectively decide how small is a "small"
2
Increase In R , the tabulation of the R2 along with the variable
selected at each step offers a way to report the essential features of
the data.
statistic proposed by Mallows (1973) is
b )
(Y-X'"""'5.b. . . . 5 )'lY-X
. . . . ........s,....,s
t"o.J
~2
- n + 2p
(1
where
is an estimator for
(1
2
•
111i5 statistic estimates the
expected value of:
which is the scaled
Stull
be plotted against
p
of the squared errors.
Mallows' idea is that
for as many of the regressions as possible.
C
p
Such
a plot gives a intuitive feel of the relative explanatory power of using
different subsets of the possible regressions.
The use of such plots is
very much subjective, but can be seen as controlling for numbers of
observations and numbers of variables in the equation.
Since the R2
statistic can usually be made one by including enough variables in the
equation, R2 doesn't control for the number of variables in the equa-
38
tion.
The similarity of l{2
to a corrdation coefficient is probably
the reason Cor its stronger appeal.
Helms (1974) considers average estimated variance (AEV) as a method
of variable selection.
This involves the choice of a region of interest
and consideration of moments.
The AEV automatically incorporates infor-
mation about bias and variance when one enters or deletes a variable.
When the data are collected and the moments arc (~'~)/N, the AEV becomes
2
ALII = slP/N,
where
2
~
~
sl = (r-~lfl)'(r-~lfl)/(N-Pl)'
When these moments are used anu when only submodels having the same rank
of
~l
are considered then the
I\l~V
is a monotonic function of
cp .
Consequently, in this case both statistics lead to the same conclusion.
The developers of ridge regression felt that omitting variables is
an unwise procedure.
They felt that the true model should be fit, and
if it is desirable to identify important variables, then those variables
whose associated estimated coefficients have large magnitudes should be
considered Inost important.
Since these variables are scaled by dividing
each variable by the square root of its sum of squares, comparisons of
these magnitudes are invariant to multiplication of the variable by a
constant.
In this sense, ridge analysis transforms the problem of vari-
able selection back to a problem of parameter estimation.
However, when
one considers the objective of variable selection, it is not clear what
criterion one would want to use to judge the goodness of different estimation methods.
Lxcept [or ridge regression, all of these procedures use the possibly
biaseu least squares estimator
~l
that minimizes the sum of squares
(~-~l~l) '(!-~l~l)' A reason for preferring
~l
to other estimators of
the fOTIn ~l = KC_1 IS that
~l yields the optimal value for any of
2
these statistics. That 1S, given the matrix
then R is maximized
~l
and
Cp
and A.EV
arc minimized using
~l'
39
Even so, the least squares estimator being used here is not
~l.
but the estimator
~l
If this estimator is used, then the estimator
that minimizes the squares of the residuals from this model is the stagewise estimator
~s.
Stagewise regression
data analysis.
1S
a procedure that has been used in economic
If the model is
~l
one might estimate
by
and then regress the residuals of this regression onto
~2
to get
~
estimates for
~2.
So the stagcwise estimator
~s
is:
~
~s
~l
=
,
-1,
~
1_(~2~2) ~2(X-~1~1)_
whereas the usual estimator is:
where
8 = (~i~l)-l ~i~2.
Wallace (1964) has considered the case where
~
and has shown that the mean square of
error of S1
when
13} is less than the mean square
40
<
I
~
and
62 attains smaller MSE than
Sz
when
62
1
<
Var C61)
where
1S
the coefficient of correlation between xl
and
xz.
Chapter III
Nl W I\L ~;lJ I I ~i
3. 1 Introduction
In Chapter I I we have rev i e\veJ some of the estimators we will
be considering.
An
attempt was made to present these estimators in one
of the three "purpose" classification categories.
In this chapter we will
not always follow the classification labels presented in the last chapter.
We want to assess certain aspects of an estimator's robustness.
That is,
we want to assess certain characteristics of an estimator (or a slight
modification of an estimator) when the estimator is used for a purpose
other than the original intended pUrPOse.
Sometimes an estimator may
perform well
but the assumptions involving its implementation may be
unwarranted.
Problems of this nature will also be discussed.
3.2 Parameter Estimation
Consider the situation in which one desires to estimate the parameter
vector
say
Q,
~
with small IMSE.
in the class
First,consider
where
mators
~
We shall consider some estimators of
~,
K.
transforms
K which are deterministic.
is a matrix of constants.
That is
We shall only consider those esti-
Q whose integrated bias error
42
is less than NO'
~
That is, the integrated bias error is bounded.
(~-l)'~(~-l) is non-negative definite if
is positive definite,
Thus, there is a matrix
Since
£
K,
+ I.
such that
£'f = 1
and
P'AP = (K-I)'W(K-I)
......
where
~
t"'V
..................
A is the diagonal matrix of eigenvalues of which at least one
eigenvalue is strictly positive.
positive eigenvalue and let
1.
rv
Then let
R
K,
=
2~O
A.1
Ei
P..
~l
Let
Ai
be the value of a strictly
be an associated eigenvector with length
So
This is inconsistent with our earlier assumption.
Thus, the only estiA
mator in K which is a deterministic transform of
integrated bias error for all
~
and has bounded
B is the identity transfoTI:l.
Now, let us consider cases where there is prior knowledge of upper
bounds on some of the parameters or prior knOWledge of upper bounds on
functions of the parameters.
First, consider when there is an upper bound on n2(~)/a2
in R.
Specifically, say that for
n2(x)/a 2
~
in
for
~
R
< U.
We shall consider the shrunken(i.e., shrinking towards 0 ),estimator
studied by Mayer and Willke.
K=
6~vl
where
6~
The transform
is a scalar so that
~
for this estimator is
~
43
15
a scalar multiple of the least squares estimator.
The IMSE of
~
is
(3.2.1.)
where
~
and is the variance error of
1S
~;
and
the bias error of the zero estimator,
0t>1W
Since
b - O.
The choice of
that has smallest IMSE is
o~
involves the lll1known parameters, as was the case for a
general deterministic
~
in chapter II, it is not a useable choice of
scalar for an estimator of the form
~.
It is possible to obtain an estimator with smaller IMSE than
has when the parameter space is bolll1ded.
the bOlll1d we have assumed on n1x)/o2
Now,
~
J(W,~)
< J(W,~)
~
To see this, first note that
implies
44
when
This is true when
Also
when
Combining ,we have that
~
J~,~) < min(J(~,O), J(~,~))
when
min
[O'Vr::~o ]
BO-VFM ,
0MW <
11
max [
VFM+B O
J
Based on the above inequalities, an estimator that attains small IMSE for
a wide range of
<
B is
O
r
max
la'
U-V FM
U+VFM
Second, we consider the case where
1
~
estimator that was described in Section 2.2.
case of general
(3.2.2)
J
is stochastic.
Consider the Stein
Since IMSB is just a special
mean square error, the results in Section 2.2 about this
estimator will apply.
Therefore, we assume that the errors are normally
distributed and we assume that p
~
3.
Now let us compare this stochastic estimator, say £s'
Consider the set of parameter values
~
for which
to
+
QMw.
45
,
~/o
When
2
~/a
<
LJ.
A
equals LJ,
the IMSE of
Qs
Qs
IS
IMSE of
2
the IMSE of
equals the IMSE of
is always less than or equal to the IMSE of
~
less than the If\lSE of
if
~ '~~/a2
Now let us considcr the other extreme where
~
=
~.
~,
Since
the
= U.
O.
From (3.2.1)
anJ (3.2.2) we have
(3.2.3)
Qs
Bhattacharyya has shown that the IMSE of
I
03
1=
is bounded by
dO}
1
Since
we have that thc IMS1: of
!2s
is bOW1ded by
p
L
. 3
1=
h =
I
I
l
Equatillg the multiples of
J(~,~s)
<
J(~,~;1W)
d.
i=l
V
1
FM in
do1
1
h VB-I
where
(3.2.4)
j
(3. 2. 3) and (3. 2. 4) , we see that
whcn
(3.2.5)
anu
B
O.
46
If one is unwilling to assume a bound on
~'~/a2 smaller than the
lllultiple of V in (3.2.5) ,then ~swould be preferable no matter what the
HI
true value of .@. '~ya .
We will now consider the e [fect that spurious correlations have on
the bound in (3.2.5).
First, consiJcr the orthogonal case (no spurious
correlations1 that is, (X IX) -1 = 1.
We assume that M = 1.
Then the
are all one and f rOI1l (3. 1. 4)
h = 2+2(p-2)/N-p+2
d., 's
P
2N
p(N-p+2)
=-~--
sothatthemultiIJleo{ VI~I
ill (3.1.5) is
l+h~
Ip(N-p+2) + I2N
1- h~ - Ip (N-p+ 2) - I2N
If
p
=
10 and
N = 36 as in the 10 factor example of HK (1970b) then
l+~ = 3.06
l-n
2
If
is any nonsingular matrix and M = I, then the
eigenvalues of
~'~).
d. 's
1
This would be the situation when the correlations
between the independent variables are viewed as being spurious.
XIX
1S ill conditioned, the eigenvalues are very spread out.
situation the factor
are
When
In this
l+h~
l-h~
will be larger then when ~I~
= 1.
We sWJstituted into (3.2.4) the
eigenvalues from the 10 factor example of lioerl and i(ennard (1970) to obtain
h
!:.-
~
l-h~';
=
6.89.
Thus, we see that the filultiple of VFM
in (3. 2.5) is larger when
spurious corrclatirnls are present than in the orthogonal situation.
47
Apparently, in the ill-conditioned situation,
requires less exact
prior knowledge in order to be competitive with
3. 2.1.
b .
~s
Prior Infonnation About Exactly One of the Elements of
Wc/.
Now let us consider the case where prior infonnation is expressible
in slightly different fonn.
Instead of a bmmd on
T)2/ a 2,
we asstune
an a priori bound exists on one of the elements of ~/a~ say 82 , We wish to
incorporate this prior knowledge into our estimation procedure for
~
.0. wEl result in an intolerably large 1M..SE.
because we feel that using
That is, failure is likely to result if our 1MSE is too large.
We do
not define "too large" and ask the reader to bear with us. An application
of this estimator follows in 4.1.
Consider the estimator
x'b
= x'K~
~ '"'Vo
I"oJ
P.
K..
"-
~i ~l
where
p.
~l
a(~i ~82 - X
z
82).
= (X'X
)-1 ~1~'
X'Y
~l~l
82=(~2(1-~1 (~i~1)-1~i)~2)k2(1-~1(~i~1)-1~i)r
,
~
and
_
V
(~r,
,-l,
,r,v
a is a scalar constant. Note that when
0
squares estimator for fitting t h e red uce d rna d e 1
then 12 0 =
~.
i,et us uctermine
x'b = [x
~ ~o
~l
I
I
x]
2
,
'1-
1
~
~
whence
The expected value of 12 a is
1(.
Now if
~l
= 0, 12 0 is the least
P.
When a = 1,
T) = ~l~l·
=
~1
-
~S2
then
(3.2.1.1)
48
and the bias of x'b
~ ~o
is
x'
1) S
- (K...... ...., ......
=
(1- o){ ;15186 2 - x2~2 }
= (1-0)(;15i8 - x2) 62
The integrated bias is
The integrated variance of ~x'b~o
is
+
001
.~
o
02 --A -G- 1 [-A' 1]
0-1
=
:J
....o
0
+
;
G = ~2(1-~1(~i~1)-1~1)~2 .
where
So
Vo
= VRM
+ 0 2V Ii '
where VRM is the integrated variance of
grated variance of
of
.12
0
-
is
~
~
and Vii
minus the integrated variance of
is the inte~.
The IMSE
49
Now
=
l]~
N 82 G tr[ -f/:
a2
2
.
--fj-I
1
2
8 G
=
2
---za
Vt:,.
,
so that
Note that J(~, QO)
respectively.
and J(~, Ql)
are the IMSE for
J(~,
It can be shown that
o
~2 G
=
2
82
2 (' + a 2
J
Now
and
Since
Q6)
~l
and ~, ,
is minimized when
so
J (~ , 12.lJ -J (~ , 120) < 0 when
0 <
cS <
28 22G
--""'2---;:<2-
a +l3 G
u
2
and
we would prefer the estimator (3.2.1.1) whenever
<
min
28 22 G
[
1, 2
8 G+a
2
2
J
(3.2.1.2 )
To proceed further, suppose that it is possible to specify a bound
U so that
U.
When
2 2 < U,
8z1a
max
then the interval
[
UG-l]
0, <
UG+l
cS <
.
mIn
[
1
~]
, UG+l
is contained in the interval (3.2.1.2),' If we agree to use
/' = max
[ 0, UG-l ]
UG+l
then it follows that
if
51
U-G- 1
- - < Szlo
2
< U •
UG+3
3.3.
A Characterization of Some Biased Estimators
In situations where one is not fitting the true model, there is a
desire to achieve small IMSE for the model one has chosen to fit.
In
the last section we have seen one aspect pertaining to this problem.
That aspect is the danger that such an estimator will have an unbounded
TIMSE.
Specifically, if any
and one has chosen a bound
~
In p dimensional space is possible,
U on the IMSE, then there are ~
and
0
2
combinations that will yield an IMSE larger than U for any estimator
!
A
~
where
!
is deterministic and not equal to
fact is not always considered.
!. In practice, this
llir argument rests on the notion that
whenever one fits a reduced model to data, then he has prior convictions
that certain values of ~ and 0 2 will not occur. We further assume
that he is willing to state his prior convictions in the form of a bound
on the parameter space.
We will use these assumptions somewhat freely
in what follows.
Consider the case where the true response is
(3.3.1)
Let Y
NXI
be a vector of uncorrelated responses and let
x = (~l .
Nip
NxP 1
~2)'
Nxj5 2
so that
52
The KM, BD, KMH, and ridge esttmators can all be characterized as
linear transforms of the least squares esttmator of the full model.
That is,
A
b =
.K
~
PxP
where
K
=
Define
is a PlXP matrix.
Then for the BD estimator,
QBD
=
(~i~l)-l~rY, so that the matrix 1<1
for this estimator takes on the value
The KMH estimator has a
~l
K = (I~Pl
~l
The ridge estimator has a .Kl
where
k is a scalar.
k is stochastic.
matrix of the form
WI-1 '-W
.Z)
matrix of the form
The value of k to be used is based on
The KM estimator has a .Kl
matrix of
y,
whence
53
where !KM is a matrix of constants.
These constants are chosen based
on prior bounds on the parameter space.
-~l
3. 3.l.Consideration of the Estimators of the form
At this point we will consider a particular class.
discussion, we shall refer to only the first
estimator
PI
+ j:2~2
To simplify the
elements 12
Q. That is
So that for the BD estimator
[1
~]
~l
A
= U~]
~2
and for the KMH estimator
1 -
~
Q
1
= [1 Q]
1
of the
54
~
.@l
U· ~l-1~z]
W-lW ]
= [I
h
~
~Z
.@z
1
RJ
.@l
h
1
Q
=
.@z
~
= (I
.
.@
-1
(WI WZ-A))
.@Z
We shall consider estimators of the fom
(3.3.2)
In this class of estimators,
QBD attains minimum integrated variance
and QKMH attains minimum integrated bias.
~
Let us proceed to find that
that minimizes the IMSE, the sum of integrated variance and inte-
grated bias.
For an arbitrary !2'
the integrated mean square error of
the estimator (3.Z.Z) when (3.2.1) holds is
J(W, b)
~
~
- Z W'A
~Z~
+
A'W
A}A
~l~ ~Z
~
Here M = Xz(1-Xl (XiXi) -1c-~1) Xz .
Thi SIS what
the symbol
G for another purpose.
respect to
~2
with respect to
and then differentiate J
We wi11 be using
We want to differentiate J
and set the expression to zero.
differentiating J
G was.
~2'
Instead of directly
let !2 =
with respect to
H.
with
HQ,
where
H is
The derivative is
square,
55
J
J
ClH
~
and
3-I = 0
when
3H
~
(3.3.3)
g,
In order that (3.2.3) hold for all
it must be that
or
A solution that holds for all
~2~2
---2-
is
(1
G.
.-
~~'
= CW~2 -W.~l A'
~
~
2
[~~'
~ + M- l ]-1
2
(1
~
(1
It appears that the minbnum IMSE estimator of the form (3.3.2) is
b
* = ~A
.t::l
-1
+ (\i1 w_-~
·.. . l.
-~2~2'
2
(1
In order to prove that
l;t*
[~2~2'
2
-1] .t::2·
+ M
~
A
A
(3.3.4)
(1
is the mininn.un IMSE estimator of the form
(3. :i 2), it is necessary to take an arbitrary
!S2. and
show that
56
J(~,
Let
Q*)
< J(~,
Q) •
K* be defined so that
Z
Then
+
*M
-1 *,
~2
N tr(~2
* [ M-1
= N tr ~2
-1
- ~zM ~2)~1
~2~2] ~Z*,
+ --;
~1
(3.3.5)
From (3.3.4) we have
and letting
J(~,
Q*) -
J(~,
*
Q)
*,
N tr !SZ Q!SZ
- 2N tr ~2~2/cr2
=
~1
(8'~1-~2)~;
57
- N
tr
~2
Q ~2
(3.3.6)
~1 .
Now
K*z Q K*'
z
-ZN tr
~l
and
= ZN tr
Ki ~l K*z Q
Substituting these expressions into (3.3.6) and completing the square we
have
J
Now
~l
~,
b*) - J Oj",
(3.3.7)
12) = -
and Q are positive semidefinite so
also positive semidefinite.
*
*
(KZ-!SZ)'~1(!S2-!Sz)
is
This means that the entire product is
positive semidefinite.
Since the trace of a positive semidefinite
matrix is nonnegative
we have from (3.3.7) that the IMSE of 12* is
always less than the IMSE of 12.
So 12 * attains minimum IMSE of all
estimators of the form (3. 3 .2).
The form of b* is of interest for several reasons.
12* involves unknown parameters, and therefore
be seen that generally
is not generally useable.
condition, then '!sZ* =
First, it can
a
However, if the design satisfies the BD
and the BD estimator is the minimum IMSE
estimator of this form.
Now let's see a way that the KMH estimator is related to b *
The expected value of
*
~
is
58
Using the fact that M is positive definite and (See Lerruna 1, Mayer
and Wi1lke (1973)) that
-1
1
Q = 1- - - aIR
~Z':"'Z
1+ -Zo
,§Z~2
-Zo
it can be shown that
It follows that the expected value of
Q* is
(3.3.8)
for 1\z given by (3. 38) is
which is just a scalar
correction term.
0E
(oE)
Clearly,
multiple of the KMH minimum bias error
0E
is a function of unknown parameters and
is a monotonically increasing function of ,§~z/o2 bounded above
by one.
Equation (3. 38) can be used to gain different
estimators.
views~,of
the BD and KMH
The BD estimator can be viewed as minimum variance unbiased
*
estimator of E(Q)
when
~2m2
is zero.
The KMH estimator can be
viewed as the minimum variance unbiased estimator of
'§Z
E(b *) .
lim
~ ~2 +
00
59
If the RD and KMlI estimator had been developed in order to attain
the same expected value as
* then both estimators do so when
h,
~2 M~21a2 takes on an extreme valueC when
BD
estimator or for very large values
of
~2
A'
M
A
~2 ~ ~2
M~2/0 2
10
2
= 0 for the
for the roMH
estimator) .
Since we considered
*
E(Q)
to gain a different view of the KMH
estimator, we will confine our remarks to the roMH estimator.
First,
*
compare the relationship between the roMH estimator and EOb)
to the
relationship between
~E
estimator in
K.
~
and the expected value of the minimum
We have not formally considered this last relationship,
but certainly this last relationship must be similar to the first
relationship.
Its
Consider
expected value is
an estimator discussed on page
and
~
is unbiased for
43 •
*
lim E(oMW
~).
~
Thus, it would appear that these two relationships are very similar.
Since the relationships are similar, one might feel that the KMH estimator is similar to
used estimator.
~-
an often critized estimator but also an often
However, this comparison ignores an important point.
The roMH estimator is intended to be used in situations where the
integrated bias error is liOllzero.
estimator
~
The full model least squares
is intended for use when the integrated bias is zero.
The KMlI estimator is the minimlUll bias estimator for estimates of the
form
However, the increase in integrated variance of the KMH estimator may
not be warranted in light of the fact that one is going to end up with
60
a biased estimator.
The KMH estimator has been praised by its authors because one can
use it without having any prior knowledge of the parameters.
We see this
As we stated
as misleading to potential users of this estimator.
earlier, we assume users have prior convictions that certain values of
2 will not occur. Thus the fact that the KMH estimator doesn't
Q and 0
use the prior knowledge is actually a h inorance.
It's something one
can usc when he has prior knowledge, but for some reason doesn't want
to use the prior knowledge.
In other words, the
~I
estimator is a
fancy way of camouflaging one's subjectivity in a maze of matrices.
It
would seem that if one had some prior knowledge, it would be more
honest to try to quantify it instead of hiding it.
When one considers
that such quantification may lead to an improved estimator, it seems
worth-while to attempt quantification of prior knowledge.
3.3.2. The S~ ~stimator.
We now consider the Svendsgaard-Kupper (SK) estimator.
Consider
the dcterministjc lS matrix
K'=
where lSI =
so that
~
QS K = lS .§
= .§
~
= .§l
where
0
is a constant.
+
~2
+ o~
+
-1
o~il~2-8)i2
~
~2-8).§2 '
The IMSE of QSK
(3. :i 9)
is
61
where
y =
.l22~()ii2
--2-
.md
(J
with
1 - ~lai~l) -1~i' Let us change the arguments of J(.) to
and 1;
=
J(o).
Note that J(O)
and J(1)
BD and KMH expressions for
are, respectively, the corresponding
J.
Now,
and
Since
J(o) - J(O)
<
° when
0
° when
(y-T)
<
0
2y
<
(y+.)
and
J(o) - J(l)
<
<
0
<
1,
(y+.)
we would prefer the estimator (3.3.9) whenever
max
[0,
y-T ] < 0 <
y+.
min
[
1,
~].
y+.
(3.3.10)
62
To proceed further, suppose that it is possible to specify an upper
bound U for
This can be done, for example, in the following way.
y.
Because the BD integrated squared bias can be written as the sum of two
non-negative tenus, namely
and because this quantity shoUld be less than
NQ2
o
J
R
n2dx~,
which is the value of B for that estimator identically equal to zero,
it follows that
2
y <
U when n
o
2
<
U.
Thus, one can construct an upper bound for
If we assume that
max
U,
y <
[0, U-
T
]
< 0 <
min
contained in the interval (3.3.10).
if
<
[1, -l::L]
y+T
If we agree to use
it follows that
J(o*)
by considering
then the interval
U+T
1S
y
min[J(O), J(l)]
n
2 and o 2 .
63
(U--r) •
<
y <
U.
(U+3.)
Figure (3.3.2.1) illustrates these concepts.
FIGURi~
l3.3.2.1).
min [1,
Illustration of
~'echanism
for Choosing
0*
_!:l ]
y+.
1
0.8
max
0* = 0.60
0.6
l
r.:l
0, y+.
1
J
0.4
0.2
o "---+--4----+-----I---_---+---__+_
(U-.)'(
(U+3.)
•
2.
U
..
64
3. 3.3.
Estimation of
y.
It is possible to find an unbiased estimator of y by using the
following easily demonstrated result:
If
= ~ and
E(~)
= y,
Var(~)
then
In our case,
A
A
E(~2)
= ~2 and
Var(~2)
so that
Under the usual normality assumptions,
1
(N-p)
a
is distributed as
2
(N-p)
X' L1-~ (~ ,~) -l~ , ] X
2
x(N-p), and s2 and i2~oi2 are independent.
Since one can readily show that
N-P]
E(1/s2) = [
N-p-2
-ia '
it follows that
A
y
=
(N-p- 2) ~2~<k
(N-p)s2
is an unbiased estimator of y.
-
T
6S
3.3.4. A Confidence Interval for
y.
When one has made a prior assumption, it is prudent to make an a
Jlosterioricheck of the assumption.
Since the use of the SK estimator
requires the specifjcation of an upper hound on
check this.
n
z/a Z,
Accordingly, a confidence level on y
it is wise to
has been derived.
This derivation is based on the assumption that the deviations from
the true model are nonnally distributed.
Let §
If
A.
-1
= LL
where k is lower trianglar and
I
~
= 1-
-1
~z
.
is the largest eigenvalue of k 'Wo1 then by a result in Rao
max
[1965],
~2~O~Z
1: ' 1'lY 1 1:
=
=
z'L'WLz'"
~,...."~,....,,
~2~Z
So
y
A.
<;
max
Now
~
F =
~2
M~Z
Pz s
F'
vl'v Z
where
and
v
z
case Pz
the
Z
denotes the noncentral
(A.)
degrees of freedom and noncentrality parameter
=
rank M and
A. =
~Z' § ~Z/az. Let
o.th percentile of this distribution.
Pr[F'
N
PZ' -p,o.
say A,
F distribution with vI
(A)
<
F]
F' N
(A)
PZ' -P ,a.
In this
denote
Since
is a decreasing function of
which satisfies
A..
A,
the value of A,
66
~
FI
=
(A)
P2,N-p,a
F
will be the upper bOlmd for all values of
F
N
P2' -p,a
(A) ~
A
for which
F.
Thus
~2 § ~2
2
Pr
A
<
~
1 - a.
(3.3.4.1)
o
Since
y
~
A
max
~2§ ~2
o
_y_
A
we have that
2
~
~2 §~2
2
(3. 3. 4.2)
0
max
Substituting (3.3.4.2) into (3.3.4.1), it follows that
~
Pr[y
<
Thus, one would expect
As long as
U
A
max
A]
2
1 - a.
with confidence
y < A
A
max
was greater than
A
max
A,
(I-a) X 100 percent.
one could be confident that
~
U
had not been chosen too small.
or
y,
If
U
was smaller than
A
A
max
then it would appear that the data did not substantiate the small
~
choice of
U.
Varying U
based on
A
A would make
max
Q
stochastic.
The ]MSE of such a resulting estimator has not been investigated.
If
U
had been carefully chosen, then one should not place a great deal of
weight on this confidence interval.
3.3.5 An Application of the Ridge Estimator to Smoothing and Prediction
Situations.
L;~lrLier
we considered estimators when the least squares estimator
67
.@ was for some reason not un acceptable alternative estimator.
situations
In some
this restriction may be made without thorough consideration.
For example, a straight line fit to data may be considered smoother
than a quadratic fit to tllCSC saIlle data. The integrated variance is
always smaller when the straight line is fit, and one may feel that he
is achieving smaller IMSE by fitting a straight line.
Nothing we have
discussed so far contradicts this conception if the true quadratic
coefficient is small.
However, the ridge estimator offers a way to
modify all the elements of
~
for the purpose of achieving smaller IMSE.
This method may he preferahle in those situations when one's goal is to
achieve an estimator with smaller IMSE than ~.
When
2 is large
0
A
and the columns of
very large.
~
are highly correlated, the IMSE of
~
can be
In such situations there is not always a great deal of
prior knowledge available on any of the elements of
~.
Therefore, the
elimination of variables in order to reduce the integrated variance can
be quite subjective and can appear arbitrary.
We shall develop an estimator for use in such a situation.
an arbitrary estimator
Consider
Q with IMSE
J (W, b)
= V~
+\
The IMSE of Q can be expressed as
R
Q
B
Q
where BO is the bias error of the zero estimator.
we know a nuTIlber ~b
Now, if a priori
68
then
We may prefer £1
to Q2
if
(3.3.5.1)
when
t?b
$
~l
(3. 3.5.2)
t?b
~2
When (3.3.5.1) and (3.3.5.2) held we would know that the bound on the
IMSE of Ql was less tItan the bound on the IMSE of Q2'
parameter coefficients
There may be
such that even when (3.3.5.1) and (3.3.5.2) hold
~
we would have
We shall temporarily ignore this possibility, and choose the estimator
Q
that attains
the ratio
t?Q
integrated variance given that the bound on
min~
is fixed.
The integrated bias of the zero estimator is
The ratio of this integrated bias to the bias error of an esti~
mator of the fonn b =
Q
~
f
R
is
(~' (!S-lHn 2
~'
W~
~
and by an application of the Cauchy Schwarz inequality this ratio is
69
bOWlded by
We select an estimator of the form b =
is fixed (at an a priori value
r'}). As
~
where this bOWld
a Lagrangian problem, we want to
minimize
The derivative is
= NW(X'X)-IK'
- ~ ~
+
1 (K-1) ,
K,..."
(3.3.5.3)
where k is chosen so that
is true.
Setting (3. 3.5.3) to zero and solving for
~,
we have that the
solution matrix is
This completes the derivation of this estimator.
We shall now make
some comments about this estimator.
First, this estimator is of the general form of a ridge estimator.
For a general ridge estimator the transform of the least squares coefficients is
is a generalized ridge estimator.
70
Second, we shall consider how
we know that
n2/0 2
U on R.
<
J~,
" may be chosen.
Say that a priori
Then
+
A
Q) <
J~, ~)
when
Solving for" 11+ we have that the IMSE of Q+ is smaller than the IMSE
of.@.
when
VA - V +
"
~I
1 lere f ore, when
U >
Q
~
:$
12+
n21a 2 on
R
is true, any choice of k so that (3.3.5.4)
hO."LJS will result in an estimator with smalle r
by
B.
(3.3.5.4)
U
HEE than the IMSE attained
Obviously, when olle is very conservative and will albnit to only a
large value for
lJ
then
ol~e
might as well use
B.
We say this since
equation (Z. 3.5.4) indicates that only small choices of "
+
will be avail-
b
~
able.
However, we see that equation (3.3.5.4) gives us a method to assess
if U is too large.
3.4
Identification of Explanatory Variables
We shall only discuss the ridge estimator in this section.
problem of how to best select
k
The
has received considerable attention.
A nlIDlber of methods for selecting k have been reviewed by R. 1.
Obenchain (1975).
of
S
He lists 6 methods of selecting
k besides the use
71
We would like to list a seventh method.
That is, choose k so
that the residual sum of squares
satisfies
"
~
=
N-p
The IMSE of such an estimator is difficult to evaluate, since it is a
stochastically shrunken estimator.
not in S,
Alsq such an estimator is apparently
the class of estimators that satisfy the Stein restriction.
However, choosing
k in this way does result in an estimator that
has some nice properties.
·If there is no error then k is zero.
·It has the same RSS that would be expected to result if
~
were substituted for
"
~K'
Using the la-Factor example from Hoerl-Kennard (Page 71) we find
that a k of about .15 satisfies
3b
Since the other nonzero
26
k's were .007, .02, .039, .11 (.2, .3) and
.54; we find that .15 is near the median.
Applying this technique to Hoerl and Kennard~ second example, we
obtain a k of about .08.
between . 2 and . 4.
Hoerl and Kennard suggest using a k
72
Many other characteristics of this method of choosing k can be
studied.
However, as an objective method, it appears reasonable and
gives similar results to other methods.
is a main reason to use it.
We suggest that its simplicity
Chap
~Hr
IV
APPLICATIONS AND FURTHER COMPARISONS
4.1
An Application of ~iased Estimation Theory to an Iterative
MaxilTU1ITl Likelihood Search
--
4.1.1.
i3ackgrOlmd
Parametric dose-response models have great utility in quantifying
the relationship between an agent and a health effect of that agent.
Knowledge of such relationships is key information for the determination
of air quality standards.
The parameters of such models are estimated from observed data.
The
maximum likelihood estimator (MLE) of these parameters have good statistical attributes, and so the MLE is often calculated.
Even for simple
mathematical models, this calculation usually involves iterative methods.
Such iterative schemes do not always successfully converge to the MLE.
We have attempted to determine the MLE for the cumulative logistic
function adjusted for natural responsiveness.
an epidemiological panel study.
The data were obtained in
Generally, there are no control groups
for such studies and existing iterative methods can perform poorly under
such circumstances.
Difficulty in using one such method was experienced
in this case.
The success of the iterative method that we used depended heavily on our
ability to start the iteration with good starting estimates of the parameters (starting values).
If one set of starting values doesn't result
in convergence, one tries another set.
Complex subjective judgments may
be required to obtain good starting values.
Since starting with the MLE
as a set of starting. values will usually result in convergence one can
74
always be accused of using poor starting values when the iterative scheme
fails.
There are a couple of ways that an iterative scheme can fail.
When
very poor starting values are used, convergence to a relative maximum
may take place, or the specified number of iterations may be used up
before convergence to the MLE has occurred.
Another type of failure
results when the revised estimates overshoot the MLE and get progressively
worse.
This type of failure occurs when a certain matrix requiring in-
version is ill-conditioned.
It is this type of failure that was
experienced in our case, and it is this type of problem that our modification is designed to handle.
This type of failure can be viewed as the fault of unbiased estimation.
At each iteration the revised estimates are derived from a
solution vector.
This solution vector can be considered the least squares
estimator of the parameters of a particular linear model.
Although the
least squares estimator attains minimum variance among unbiased estimators, this variance can be large.
The theory of biased estimation is based on the fact that mean
squared error (bias plus variance) can be smaller for a slightly biased
estimator with much smaller variance, than for an unbiased estimator.
This theory is applied to develop a class of estimators appropriate for
this problem.
The result is an iterative maximum likelihood search
strategy in which the user of the method simply specifies a bound on the
difference between the MLE and the starting value for the proportion of
natural responses.
The method using the least squares estimator can be
viewed as considering this difference to be unbounded.
Thus, the user
can specify a very crude bound and expect to do better by incorporating
this bound into the modified method.
75
First, we present the theory involved.
This includes a description
of the dose-response model, a description of the iterative method, the
development of a modification to this iterative method, and a discussion
on the subjective input required to use the modification.
Second, we do an empirical evaluation of this modification.
The
data used in the evaluation arc described. Next, we describe the evaluation, list the results, and draw conclusions.
4.1.2.
4.1.2.1.
Theoretical Develo0nent
Descri tion of
Responsiveness
z.1
Consider a dose-response experiment where dose levels
different levels are applied to n.1
at N
r.
subjects and at each dose level
1
subjects respond.
If the administration of a dose
Z causes a proportion P(Z)
of
the test subjects to respond and other independent factors acting on the
test subjects during the experiment cause a proportion C to respond,
then the total expected proportion responding will be
P'(Z)
= P(Z) + C - P(Z)C = C + (I-C) P(Z).
This equation is called Abbott's Formula.
If P(Z)
(4.1.2.1)
is the cumula-
tive logistic function
1
P(Z) = - - - - - 1 + exp[- (A+BZ)]
then P'(Z)
is known as the cumulative logistic adjusted for natural
responsiveness.
This model is motivated by the concept of a tolerance.
76
An
Zo such that
individual's tolerance is defined as that level of dose
doses higher than
Zo always cause a response.
The purpose of the model
is to make inferences about the distribution of tolerances in a target
population.
In fitting such models, it is usual to assume that the n
subjects exposed to a dose level
target population.
less than
Z is
Z.1
were randomly selected from the
Assume in addition that the proportion of tolerances
P(Z)
for each
Z.
When these two assumptions are
correct the probability of observing r.
1
dosed at level
responses from the n.
subjects
1
Z.1 for i = 1,2, ... ,N is
L(A,B,C) =
N
TT
i=l
In general,
i
L(A,B,C)
n.
r.
(l)p'(Z.) l(l-P'(Z.))
r·1
1
n.-r.
1
1
1
is called the likelihood function.
A
A
Those
A
choices of A, Band C (denoted by A, B, and C) that maximize
L(A,B,C)
are the MLE.
4.1.2.2.
Description of an existing iterative method.
Based on an approximation to the first derivative of the log likelihood evaluated at the MLE by the first order terms of the Tay10rMaclaurin expansion, Finney (1971) has derived the expressions used in
iteratively solving for the MLE of the parameters of the probit doseresponse model adjusted for natural responsiveness.
These expressions
can be easily applied to the logistic model described in (4.1.2.1).
When
As, Bs ' and Cs are the starting values for the parameters A, B, and C
respectively, at the iteration number S, the revised estimates
77
-
~
As + 1
a
Bs + 1
ttl
Cs + 1
a
0
O
0
+
Cs
Z
~
The formulae for
l and a Z are algebraically the same as
the formulae for the least squares estimator of the parameters of the
aO' a
linear model
i
where the
E·
1
= 1,Z, ... ,N,
(4.1.Z.2.1)
are uncorrelatcd random variables with zero mean and
corrnnon variance
(J
Z
are all determined from
•
As, Bs and Cs ' and the formula for the Yi also involves r 1..
formulae below define the x.. and y. for i = 1,2, ... ,N and
J1
j
1
= 0,1, and 2.
r.
p.
1
=
-n.1 - Cs
1
1 - C
s
P.
1
1
=
1 +
cxp - (As +B S Z.)
1
1 - P. ;
Q.
1
1
W.1
=
w'1
=
P.Q. ;
1 1
w?1
Q.
1
[Po1
+~]
1-C
5
;
The
78
P. -po
1
1
)(n.w.)
W.1
.}-
1 1
= (n.w.)~;
1 1
and
Q.
1
w.1
( n.w.
1 1
)~ .
C1 is obtained from inspection of the data,
and Bl can
be obtained by iteration to maximize the likelihood for this value of
If
~
Cl . The appropriate expressions involved in this first stage of iteration are algebraically equivalent to the least squares estimators
=
for the parameters of the model
i
which is obtained from (4.1.2.2.1) by setting
= I, 2, ... ,N
a 2 to zero.
(4.1.2.2.2)
Of course,
starting values for A and B are required even for this first stage
of iteration, and for
B one could use zero and set
O
p-c
A
O
where
I
= In-I-I>
79
Lr.1
-
P= -
Ln.1
This choice of starting value for AO maximizes the likelihood
function for these values of BO and Cl . r~ving obtained revised
estimates for A and B after a number of iterations using this first
stage, one could proceed to the second stage of iteration using
(4.1.2.2.1) and these revised estimates as starting values.
Based on empirical results, it appears that the success of such
a two stage scheme depends on how well
Cl is chosen and on what
criterion is used to determine when to stop first stage iterations and
transition to the second stage. Very stringent criteria would be wasteful of computer time if Cl is poorly chosen,and its use does not
guarantee success when Cl is chosen close to C. When a very loose
criterionis employed, e.g. omitting the first stage entirely, failure
rates are high.
The failures are usually the result of high correla-
tions among the x ..
)1
inversion.
resulting in a singularity in the matrix requiring
Invariably, such a singularity is preceeded by an unreason-
ably large overestimation of the magnitude of a?.
When we regard (4.1.2.2.1) as the true model whose parameters are to be
estimated, the problems mentioned in the last paragraph are a familiar
weakness of least squares estimation.
That is, even though the least
squares estimator achieves minimum variance among unbiased estimators,
this variance can be very large when the independent variables are
highly correlated.
In those cases cited above, we view this variance
as being intolerably large.
Possibly a slightly biased estimator with
much smaller variance can achieve small enough mean squared error (bias
plus variance) to usually avoid that type of problems.
80
4.1.2.3
Estimator via the Consideration
Consider the class of biaseu estimators of the fonn
~o-I
aO
=
al
I aZ
+
(Xl
0
1_
-ko-j
-'
kl
1-kZ_
(4.1.Z.3.1)
(X2
I
where
kO' kl and k Z are constants. We shall consider using the a i
in place of the (x.1 in a single phase iteration sclleme. Note that if
the k.1
are zero then the a.1
are the least squares estimators of the
model described in (4.1.Z.2.Z1 which are biased when (Xz
is also possible to obtain the
(x.
1
is nonzero.
It
from (4.l.Z.3.1) by letting the k.
1
take on certain values.
We shall use an error criterion that seems appropriate for this
problem and show how the k.1
can be chosen so that tor a wide range of
(X2 values, better performance in terms of this criterion is attained by
an estimator using the chosen k.1 's
than is attained by using either of
the estimators mentioned in the above paragraph.
The choice of the k.1
is made by snecifying an upper bound on (X~/o2.
It turns out that the formula for the ki that we suggest using
involves the specified upper bound and the correlation between the x.)1'.
Generally, this correlation changes from one iteration to the next, so
that even though only one bound is used. the k.1
vary between iterations.
In those cases where successful convergence takes place, the estimators
are similar to those that were suggested for the two stage scheme.
InitiallY, the x ..
)1
are highly correlated and the k.1
are zero.
In
81
later iterations, the correlations become smaller and the
the
a· 's
approach
1
'lhus, instead of selecting a criterion ror deciding when to
Jump from the first to second stage of iterations, the use of the proposed
estimator eliminates the need to make this decision by automatically
incorporating it into the computation process based on the value of the
specified upper bound.
Let the true model at each iteration be
and consider approximating
n over an interval of interest
E;O
~
x
~
E;l
with
where
(a O' aI' a 2) is a vector of estimated regression coefficients. A
reasonable measure of the closeness of n to n is integrated mean square
error
Nr2
J = -2
a
JE;l E(n-n)
~
2dx
i;0
where
This is the situation where one may consider usinQ the estimator
discussed in Section 3. 2.1.
--kOk =
ki
k2
Using that estimator suggests that we set
-=-~O-I
=
\5
-]Jl
1_1
J
82
where
IJ O and IJ I are the least square estimators of IJO and IJI
(respectively) for the model
(4.1.2.3.2)
The value of ,\
where
s
2
Finally,
suggested by (3.2.1.::»
is ill this case
the sum of squared deviations· from fitting (4.1.2.3.2).
IS
U is our upper bound on a~/a2.
Now, let us consider how to
choose U.
One way to choose a value U that bounds
upper bound for
2 as being
considering
a
2
2
and a lower bound for
2 2
a2/a
is to find an
2
a.
Therefore, first consider
C-C s '
a
a
since Cs +l = a 2+C s . Partial justification for
C-C s can be based on a result from large sample theory:
=
2
" ... if first approximations are of nonzero efficiency, one cycle of
computation will yield fully efficient estimates" Finney (1971).
(4.1. 2.1),
C
By
is a proportion, so that if we agree to set C to some value
between zero and
1
then
O!1e
(C-C l )2
=
a~ will be less than one.
Considering only the variability contributed to Yi by Pi. we have
0
2
= Var
y.
1
=
(I-C) (C+(l-C)Po1 (I-Po))
1
(l-C s )(C s +(l-C s )Po(l-P.))
1
1
When we assume that the starting values are equal to the respective true
parameters of the model, we have that
o
2
is one.
For these reasons we
suggest using U = 1.
Smaller values .for U such as
investigated the case where
~
~2
PI might be used, but we have only
is deterministic.
The use of any infor-
83
rnation gained from the data in fixing
k would require that
~
be
treated as random.
In practice, the error may be too large even when U is carefully
chosen.
U.
So if the iteration routine fails one should consider varying
However, it would seem unnecessary to choose a value of U larger
than one.
4.1.3.
l.:mpirical Evaluation of
~loJirication
Hammer et al (1974) reported on data from student nurses in Los
Angeles who completed daily symptom diaries during the period of their
training.
The total numbersof yes/no responses of four symptom cate-
gories pooled over days having the same range of maximum hourly oxidant
levels are listed in Table 4.1.1 along with the midpoint of these oxidant
ranges.
The symptom categories are Headache, Eye Discomfort, Cough and
Chest Discomfort with no accompanying fever, chills or temperature.
Due
to the restrictions, the positive responses are indicators of only mild
personal discomfort.
Assuming independence between days of responses from the same
student nurse, the number of positive responses are taken to be binomially distributed with parameters
Pk.=
1
Pi
and n i
where
C + (l-C)/(l+exp[-(A+BZ i )])·
The MLE's for this model are listed in Table 4.1.2 for each symptom.
Iterations were run using three different starting values for
and four different botmds on ~/o2.
(a)
The starting values for
C
C were:
The observed proportion of total responses corresponding to
the lowest oxidant level.
For all four symptom categories this pro-
84
portion tUTIled out to differ from the MLE for
C by less than 0.01.
The
minimum observed proportion was used as a starting value for C in the
case of Chest Discomfort in order to avoid the computation of the logarithm of a negative number in the process of getting a starting value
for A.
(b)
One half the proportion used in (a).
(c)
Zero.
The starting value for
B was zero and the starting value for A
was
As = In(P-C s )/(l-~
where
P = (Lr·)/Ln
..
1
1
"2 "2
The four different values for U used were PI' Pg , 1, and
00.
Infinity corresponds to using the least squares estimator for (4.1.2.2.1).
Iterations were continued until the change in the likelihood was
zero (to the accuracy of the computer), or until 30 iterations, or until
the program faulted.
TIm
111e maxinUlffi number of iterations used on anyone
was 25.
Table 4.1.3 lists the occurrences of successful convergence.
From
this table it can be seen that the percent of successful convergences was
higher in all cases when a finite bound on ~ /0 2 was used. When
pi
was specified as a bound, the failure was due to convergence to the MLE
of A and B for the starting value Cl . The highest percent of
,convergences
when a starting value for C of either 0 or
was used occurred when LJ was set equal to one.
~uccessful
~Pl
85
4.1. 4.
Conclusions
If good starting values are used, convergence can always take place
using Finney's formula.
However, how good these starting values must be
depends on the data.
The modification of Finney's formula developed here yields successful convergence for starting values that are too poor to be used directly
in Finney's fonnula.
When very poor starting values for C are used
neither the ITIodification nor Finney's
ever, for some values of U,
the modification.
forn~la
works all the time.
How-
a higher success rate can be obtained using
Moreover. when seemingly adequate C starting values
are used, successful convergence is obtained using the modification in
cases where the use of Finney's fonnula had failed.
The use of smaller U values seems to yield poorer success rates
for poor choices of C starting values, but can achieve improved success
rates for seemingly adequate C starting values.
It is recommended that a number of C starting values be used with
this modification.
be taken to be one.
When no other information is available,
U should
If convergence fails due to a singularity, a tighter
bound should be tried.
A plot of the likelihood obtained at the final
iteration versus the corresponding C value is helpful for deciding if
more starting values should be tried and what values to use.
e
e
e
Table 4.1.1
Los Angeles Student Nurse Data
Dai 1Y Maxum.D11
Total Exposed
Hourly Oxidant
Level, pphm
Nurses x Days
Z.
i
Headache
I Eye Discomfort
Proportion
Proportion
of Yeses
Yes of Yeses Yes
n.
1
r.
1
1
r.
A
P·i
I
1
Cough
----proportion
Yes
of Yeses
p.
r.
1
1
II
P.
Chest Discomfort
Proportion
Yes
of Yeses
r.
1
1
P.
1
I
1
3
14656
1539
.105
-
7
10856
1162
.107
.)
9
2030
215
.106
7
I 733
I
.050
1334
.091
264
.018
586
.054
1075
.099
195
.018
114
.056
207
.102
39
.019
1
4
12
10912
1200
.110
I
644
~059
1026
.094
196
.018
-
17
8352
952
.114
I 576
I
.069
810
.097
142
.017
22
3780
438
.116
I 348
.. 092
344
.091
60
I, 168
I 107
I
.112
144
.09(;
II ~v~n
.016
;)
6
7
27
1500
173
.115
3
35
603
81
.134
9
45
159
24
.151
--
.J..1
...J../
I
1_
......
51
.177
I
71
.118
I
.321
I
27
.170
I
.020
14
.023
9
.057
-----_!.--_-----~------
O:l
C'
87
Table 4.1.2
Maximum Likelihood Estimates for Each Symptom
Symptom
A
B
C
Headache
- 4.626920811
.04070218878
.09535299228
Eye Discomfort
- 5.045747967
.09309640264
.04166289253
Cough
- 9.949532964
.1690304341
.09442183274
Chest Discomfort
-13.26019686
.2243214541
.01772926043
88
Table 4.1.3
Successful (S) and Failing (F) Iteration Attempts
Symptom
U
p2
1
p2
9
0
~Pl
Eye Discomfort
Headache
Cough
Chest Discomfort
% Success
S
F
F
F
25
S
F
F
F
25
Eye Discomfort
Headache
Cough
Chest Discomfort
9- Success
S
F
F
F
25
S
F
F
F
25
F
S
S
F*
50
Eye Discomfort
Headache
Cough
Chest Discomfort
% Success
S
S
F
F
50
S
S
F*
F*
50
F
0
1
00
Starting Value for C
Eye Discomfort
Headache
Cough
Chest Discomfort
% Success
F*
F*
F*
F*
a
*Indicates failure occurred due to program default.
F*
F*
F*
F*
a
PI
S
S
S
S
100
S
S
F*
50
F*
F*
S
F*
25
89
4.2.
4.2.1.
A CCMPARISON OF SOME ESTIMATORS OF TIlE FORM
~
Introduction
In this section we will compare some estimators that are of the
"
~.
form
The comparisons will be done in such a way that some light
will be shet!
on answers to such questions as:
·For poor designs (highly correlated independent variables), is the
BD estlinator preferable to the
~1
estimator?
Does the SK estimator
achieve sizeable reduction in TMSE as compared to the KMH estimator when
the scalar
6
is chosen conservatively?
·If one's goal is the reduction in IMSE, are there full model
estimators (i.e., estimators of the form
"
~
where
~
is nonsingu1ar)
that are competitive with reduced model estimators in achieving small
IMSE?
·How does
the estimator developed in ·Section 3.2.4, compare
to the ~1 and ED estimators?
Since Q+
is a generalized ridge esti-
mator, the results of such comparisons are indicative of comparisons
between the ridge estimator and these two other estimators.
In the general setting of linear models, the answers to these
questions are in terms of complex matrix expressions whose interpretations often depend on the values of unknown coefficients and on the
coordinates of the design points.
In general, these coefficients and
these coordinates are free to vary in multidimensional space.
For
these reasons we shall restrict the scope of our comparisons.
We shall
consider a particular model, a simple class of designs, and a certain
90
criterion.
The model is discussed in Section 4.2.2.
discussed in Section 4.2.3.
The class of designs is
The criterion and the motivation for
considering this criterion is discussed in S zction 4.2.4.
tion of the comparison is in 4.2.5.
An
explana-
The results of the comparison are
described in
~ction
4.2.6, and our interpretations of these results
are given in
S~ction
4.2.7.
4.2.2.
The Model
We consider the situation where the true model is not known.
We are
interested in estimating the response surface over a particular region
of interest.
We want to use a model that is general enough to detect
important irregularities in the response surface, and yet we don't want
a model that is so complex that it has little chance of verification in
subsequent experiments.
In many situations, the most complex model that
can be used is the second degree polynomial.
We shall assume that the
second degree polynomial in two variables is the true model.
Specifi-
cally, we assume that the true model is of the form
We assume that the
are unknown coefficients. Later, we wi11assume that on a priori bound on n2jo2 can be specified. Such an
SIS
assumption implies some knowledge of the
specific
4.2.3.
knowled~e
of the
SIS
SIS.
However, no additional
will be assumed.
The Class of Designs
We take the region of interest to be the square bounded by
-1
~
Xl
~
1 and
-1
~
x2
~
1.
The designs we shall consider will be
91
generated by the scalar
1;,.
111e5e designs arc nine point designs.
The
coordinates of these nine points are
(Xl ,x Z)
= ( -r;;, r;;)
(1 _ 1- 1+
z Z' Z
t) ,
(1, 1)
r;;
(- -1 -
l+
z 2'
z
t) ,
(0, 0),
(12
2' 21 - £z') ,
+ r;;
(-1, -1),
1
(- 7
+
1
r;;
r;;
7' - 2" - 2)'
(r;;, -r;;).
Figures 4.2.1 and 4.2.2 show the plots of these nine points when
1 and when
r;;
r;;
is
is l/Z. respectively.
The correlation coefficient of Xl
mean of the design points for
of what value
r;;
orthogonal design.
assumes.
As
and x2 is (l-r;;2)Al+r;;2). The
and x 2 is always zero irrespective
Xl
When
l;;
is one, we have a completely
s approaches 0, Xl and x 2 become in-
creasingly correlated until all points lie on the line Xl = x 2 when
r;; is zero. We will consider five designs in this class specified by
l;;
= 1, . 75, . 5, . 25 and .12 5 .
For this model and this region of interest, the moment matrix is
92
(
W=
,...,
Q
JR""""""'"
xx'dx =
1
0
0
0
1/3
0
0
0
0
0
0
1/3
0
0
0
1/3
0
0
1/5 1/9
0
1/3
0
0
1/9 1/5
0
0
0
1_ 0
1/3 1/3
0
0
0
1/9
-,
_I
93
FICURE 4.2.1
a
..
-
-..
1
..
..
;
..
..
..
..
..
-1
..
..
-,,,
-2
-15
::
;
~
, ,
..
;
.'
1
-1 .•
,-.
"
f
·; .... s·
,
f
t, .
•
of
•••
'Xl
Locations of design points when
r,;
, -, • • • •
1.e
1 .•
• '.5
, , ,
=1
T
I
94
FTGlJRI: 4.2.2
2
-
-
-
f.---
-
0
-
-
C)
0
·0--
-
0
-
C~
-
-1
--
0
~--
-
-
-
",
I-T'rr'r
-1 5
,
-1,0
,
,
I
J
J
,
I
I
J
I
0.5
-0 5
Locations of design points when
,
,
1; =
1
2·
,
J
,
I
10
,
,
-.
I.S
95
4.2.4.
The Criterion Used for Comparison
We believe that IMSE is a reasonable choice for comparison of estimators.
However, the computation of the IM.SE of biased estimators
generally requires the specification of the coefficients of the model.
Such specification would limit the generality of the results of the
comparisons.
We shall instead compare
estirr~tors
by equating their integrated
variance (when possible) and observing a bound on the ratio between
the integrated bias of one estimator and the integrated bias of the
zero estimator.
compared.
The two ratios, for each of the two estimators, are
We would prefer that estimator whose ratio is smaller.
Let's
see what inference can be made about the IMSE of the two estimators
after comparing these ratios.
Consider two estimators.
J.
1
= V.1
+
B.1 ,
i = 1,2, .
III
and Q,2' whose respective IMSE is
If the bias error of the zero estimator is
BO' then let cp.1 for i = 1 and i = 2 be the respective bounds on
the ratio of B.1 to BO· TIlen we have
J. =
V.
+
(B./BO)B
~
V.
+ CP·B
111O
11
o.
When we have equated the integrated variance so that VI
CPI
<
CP2 implies that the bound on the IMSE of
bound on the IMSE of Q,2
Ql
= Vz
then
is less than the
for all
BO. For certain values of the coefficients, it may be that the IMSE of QZ is smaller than the IMSE of
£1. However, with little specific knowledge of the coefficients, this
seems an objective method for comparing estimators.
a more efficient estimator than Q2 when
that VI
=
VZ·
CPl
We shall call
is less than
cf>Z
~l
given
96
When the two V.1 's
can not be equated, then we consider the dif-
ference
set this difference equal to zero, and then solve for
B '
O
When a
bound M on BO satisfies
V -V
B < M< 2 l
O
4>1-4>2
then we know that 121 IS more "efficient" than 12 2 ,
We interject a word of caution at this point. For many criteria,
it is possible to find optimal estimators.
An optimal estimator may
perform well for a particular criterion. but it may be poor when the
criterion is modified slightly.
The optimal estimator for the criterion used here is the
mator.
Since this estimator
~s
12+ esti-
similar to the ridge estimator, it is
tempting to infer that the results -of these comparisons apply to'the ridge
estimator.
However, the fact that the scalar k of the ridge estimator
is not generally chosen in the
n~nner
that we will choose it weakens
this inference.
4.2.5.
Explanation of the Comparison
In this section we are interested in comparing four estimators:
-The full model estimator.
-The b+
estimator.
-The BD estimator.
-The KMH estimator.
Each of these estimators is a deterministic transform of the least
97
squares estimator
estimato~
the BD
~.
A typical estimator is
estimato~
.'2
'"
=.K@..
The full model
and the KMH estimator have an integrated
variance that depends on the design only.
The only one of the four
estimators whose integrated variance can be modified once the design is
fixed is the b+
~
estimator.
We shall vary the scalar k of the
estimator so as to equate the integrated variance error of
.'2+
b+
~
with
that of either the BD or KMH estimator.
Only choosing k equal to zero will
equate the integrated variance error of
.'2+
model estimator.
estimator with the full
TIlerefore, the estimator b+ will be compared with
the full model estimator when the
k
is fixed so that the integrated
variance of the KMH estimator and the integrated variance of
.'2+
are
equal.
4.2.5.1.
Method of Obtaining Bound on the Ratio of Bias Errors
Consider an estimator of the form
b =
K~.
TIle bias error of this
estimator is
TIle bias error of the zero estimator is obtained bv setting k to zero
and is
BO where
TIle ratio of these two bias errors is
(4.2.5.1.1)
98
A version of the C-S inequality states that "Let A be a p.d. rn x rn
matrix and DEE,
rn (I.e.,
lJ
~
is an rn-vector
); then
sup
XcEm
~
and the supremum is attained at ~*::: A-lU." Rao(1965). Ne shall apply this
fact to the integrand of(4.2.l.1.l). In this casa letting
~ :::~,
A::;
and
~
:::
~'(~~l),
,we have
~'(K-I)~
By
~'
-1
(K-I)
I
~
.
the monotone convergence theorem ,
(~' (K-lJ~)2 ~
n
s;
f ~' (!5-D~-lC!S-D '~ ~
~'~
4.2.6.
Results
Table 4.2.1 lists the results of our comparisons.
First, let us consider the comparison between the BD and the
estimator.
We see that for all choices of
b+ estimator is smaller than the bound
1;;,
</>
the bound
</>
Q+
of the
of the BD estimator.
This
means that,for all values of the integrated bias of the zero estimator,
the
the
Q+
Q+ . estimator has a smaller upper bound on its IMSE.
Thus,
estimator with this choice of k is more efficient than the
BD estimator.
99
Q+ we see this same relation-
Secondly, comparing the KMH and
ship between the bounds on the ratio.
Thus
b+ with this choice of
k is more efficient than the KMH estimator.
Thirdly, we compare the full moue1 estimator with
been chosen to equate the integrated varuance of
variance of the KMH estimator.
= 4.25 - 3
<P Q+
_
Q+to the integrated
= 1, we have
~
For
10 .
<
.0857-0
<PFM
So if one could specify a bound smaller than 10 on
achieve smaller IMSE us ing
1) +
using the full model estimator.
the result holds for
with
n2/a 2 one would
k fixed as above rather than
TIlis results holds for all
set at one.
~
Q+ when k has
and
was decreased toward zero,
~
As
~,
this result held for larger values of the integrated bias of the zero
estimator.
When
~
v
_FM
<P
is .125,
-
V 1,+
~
Jt _<PFM
= 4020 - 1210 ; 15000.
.212 - 0
Fourthlv, we compare the KMH and full model estimators.
1 then
= 4.25 -
3
=
.4,
3 - 0
and when
~ =
.125
v
-V
KMH = 4020 - 1210 ; 1000
<P KMH - <p FM
3 - 0
FM
When
C=
100
We see that the KMH estimator is preferable to the full model only for
choices of bounds that are smaller than the botmds that were true for
the MVFR estimator.
Fifthly, we compare the BD and full model estimators.
When 1',;=1,
then
VFM - VBD
~BD - ~FM
and when
r,; =
4.25 - 2.15
=
~
= .2,
7.25 - 0
.125
4020 - 34.6
C!
1000 .
3.99
So the BD and KMH estimators are very similar in their comparison to
the full model estimator.
Lastly, we compare the BD and KMH estimators.
=
When
r,;
3 - 2.15
7.25 - 3
When
I',;
=
1, then
~
= .25.
is 3/4, 1/2, 1/4 and 1/8, then this bound is .14, .67, 67.4,
and 1187.3, respectively.
Thus, the BD estimator seems preferable to
the KMH estimator only when the independent variables are highly
correlated.
101
4.2.7. Interp.retations of the Results
Let us see what light has been shed
posed in Section 4.2.1.
on the answers to the questions
For the first question it would appear that
these comparisons indicate that as the independent variables become more
correlated the BD estimator becomes more clearly preferable. We must qualify
this statement since the answer depends on the values of the unknown
parameters.
Here it would seem that the SK estimator could be used.
However, the SK estimator can do no better than the BD estimator in
achieving small integrated variance.
The
KM estimator may be pre-
ferable when the integrated variance of the BD estimator is too large.
It would seem that we have answered the second question in the
affirmative.
It appears that the
the BD or KMH estimators.
Q+ is more efficient than either
Since this estimator is similar to the ridge
estimator, one might expect that the ridge estimator is more efficient
than either the BD or KMH estimators.
values of k are used in practice.
lbwever, it may be that smaller
If this is the case, then Q+ will
have characteristics that are similar to the full model estimator.
e
e
e
Table 4.2.1
Comparison of Biased Estimators
Ivariance Error
I;;
r
BD
1
0
2.15
3/4 .28
1/2 .60
1/4 .882
b+
k used
Bound on Bias Error Ratio
BD
b+
Variance Error
Full
b+
KMH
Bound on Bias Error Ratio
k used
KMH
b+
I
2.15 3.4
7.25
.301
4.25
3
3
1.41
3
.0857
2.55 3.801
4.77
.476
5.44
2.8
2.8
3.2
3
.378
3.69
3.69 2.36
3.93
.798
4.31
4.31 1.88
3
.654
9.88
9.88
.51
3.83
1.03
.0442
3
.277
.1095
3.99
1.09
.00194
3
.212
2.55
1/8 .969 34.6
34.6
15.5
240 65.8
I 4020
65.8
1210 1210
I
......
o
N
4. 3 AN APPLI CATION OF BIASED ESTIMATION
TO
TIlL LOW roSE EXTRAPOLATION PROBLEM
4.3.1
IntroJuction
In sc.recning environmcntal conturninants for their effects on the
hcalth of a large population, often the goal is to obtain information
about the effects when the exposure Jose of the contaminant is low.
The exposure to low levels of a contaminant may result in a serious
health effect (Jeath or presence of tumors) in only a small proportion
of the population.
However, even a proportion that might normally be
considereJ small may be of COnCeTIl to public health officials when the
target population is large (say, over a million).
When the response rate is low, say less than one in a hunJred,
there is a fair chance that a reasonable size sample (say, 50 individuals)
will contain no affected individuals.
Thus, there is a real danger that
hazardous contaminants can pass Ulluetected through such a screening.
When all outcomes are the same (no detections), the estimated
variance is zero.
llowever, the variance must have been nonzero if we
have TIlisseJ detection due to using too small a sample size.
Therefore,
when we do make a detection we must realize that it is unlikely that the
observed proportions will coincide with proportions that would be
obtained in repeated experiments.
In order that we are not mislead by
such random fluctuation we may plan at the outset to limit the amount of
detailed information to be extracted from the data.
Although using
large mnnbers of experimental units can improve precision, there is
104
always a lir.lit on the number of experimental units available.
When one conducts an experiment with a small munber of experimental
units, one is taking a risk that the infotJRation to be derived from the
experimental data lacks sufficient precision.
Since we have ruled out
increasing the number of experimental units, we shall confine our
remarks to two alternatives that can also improve precision.
First, one might decide to allocate the experimental units so that
the barest minDlllUTI of detailed infonnation can be extracted from the data.
Sudl a sacrifice would be made in order that this minimum amount of
information is as precise as possible.
For example, when the shape
of the response curve is unknown and the number of experimental units
is
fixe~
then a two-point design can more precisely estimate the
differential resprnlse between the two points than is possible when same
units are allocated to other dose levels.
llowever, confining the design
to two levels sacrifices information about the gradient.
Second, one might decide to extrapolate.
That is, choose
experimental doses higher than the interval of interest, fit a curve
to these points, and use the points lying
on the curve in the interval
of interest as estimates of the response.
In practice, a few units
are usually tested at the background dose
as a control.
Thus, one
might prefer to call this method an interpolation rather than an extrapolation.
Perhaps
However, the term extrapolation is usually used here.
this term persists because the number of control units is
usually small, that is,the resulting precision is only able to detect
background response levels of, say, five percent.
lOS
Usually a straight line is chosen as the curve fit to the dose
levels when interpolation is used.
choose a linear function.
There are a number of reasons to
First, the parameters of a linear function
can be estimated more precisely (less variance) than the parwneters
of a more complex function.
in shape.
Second, most dose-response curves are sigmoid
Thus, intervening points on a line (connecting points on
the lower concave portion of the dose-response curve) will lie above
(on the average)
the true response curve.
See Figure 4.3.1-
Here, we shall discuss some methods that can be employed to
increase the precision of estimates of the dose-response curve.
assume that the
e~)erimenter
We
is able to dose at any 'reasonable level
that he desires.
First, we assume that the true function is unknown, that goal
of the experiment is the estimation of 6 (the differential response
between dose levels
levels
62)
t,;0
and
t,;0
~2 > ~l
and
t,;l)' that a two-point design at dose
is employed, that the estimator for 6 (say
is based on linear interpolation between t,;0
only the variance of
62
is the criterion used.
fino that the variance of ~2
is smallest when
and
t,;2'
and that
In this setting we
~z
is as large as
possible.
Second, we assume that the true model is a quadratic function.
"-
Since 6 2 will be biased, we use mean square error (MSE) criterion
rather than variance criterion. We find that the square of the quadratic term divided by the variance of the experimental error must be
bounded in order to' 'guarantee that tJle MSE oJ ~2" issJlIaller using
t,;z > ~l
rather than
~2
= t,;l·
106
FIGlJR.E
4.3.1
20 ---r------r--------r---------y-------,
-------i
1So --/----------
,,
10
",
,"
"I'
"
~
C,/)
5
~
C,/)
~
-,
00
a
1
3
roSE
~O
~l
~2
The overpreJiction of a concave upward response function
at a Jose equal to 2. The straight line passes thru the
response fW1ctioll at Jose levels of 1 and 3.
107
Third, we consider a particular
quadrati~
namely
2
S2x.
Since the true
model is unknown we cannot deny that this may be the true form of the
function.
We find under special circumstances that we would only want
to thoo~e';2 only
slightly larger- than ';1
in order to minimize MSE.
We can not reconnnend this as a solution, _but only consider i t- 'asinformation to motivate making further constraining>assumptiorrs about" the fonn
of the dose-response curve.
Fourth, we aSSlDTIe that
than 1;1'
1;2 has been chosen arbitrarily larger
We aSSlDTIe that the true dose -response curve is a quadratic
function with positive coefficients.
We then find functions that bound
this class of functions, and estimate them.
Finally, we show what level lower than i;2
to use ina sub-
sequent experiment to best refute the upperbound on this class.
We
also reconnnend how sample size should be chosen for this subsequent
experiment.
4.3.2.
An Example
Consider a population of experimental units (htunans, mice, etc.).
Conceptually, it is possible to imagine the average response,
n(x) ,
that would be obtained if the whole population were administered a dose
x.
dose
We can further imagine that the function
leve~
other than x.
11 could be defined for
AI though it would be impossible to verify by
experimentation, it is conceptually possible to view 11lx)
as a function
defined for all dose levels x in some interval.
11 (x)
We call
the
true response.
Let us aSSlDTIe that the experimenter has interest in determining information about the response on an interval
is a subinterval of the interval where
~o
11 (x)
R = {xl~o
<
x
is defined.
might be the background concentration of an agent, and
< ~l}'
which
For example,
~l
might
represent a very small increase over
108
that could be realized if no en-
~O
virorunental control were exercised with regard to this agent.
'lhe tenn
"very small" means that the concentration would be much smaller than
typical concentr,ations that laboratories are designed to handle.
Let
y
be the vector of observed responses from N experimental units which
have been exposed to doses
lxl,xZ' ... 'xN)
dose levels may be the same.
the true response vector.
Let nlx)
Some or all of these
= (nlxIJ,n(xZ), ... ,n(xN)) be
We asswne that
Y = fl. lx)
where £
= ~'.
+ £ ,
is a vector of random uncorrelated errors with zero means and
corrnnon variance
(J
z
•
Biological differences, errors in applying the dose,
sampling error and instrument error are some of the sources of error that
make the vector £ nonzero.
So far we have -made no asstmrptions about
the functional form of n(x).
A minimal amount of useful information about
n (x)
is provided by
the difference
lnc
~~&rallietcr
f.,
is l"IIOSt
f!lca:nii1gf~1
w:len n is a monotonic increasing func-
don of c.,osc. This population parameter can be estimated rather precisely
as compared to trying to estimate a more complex aspect of n(x)
the interval
on
R.
For reasons such as these, the experimenter may be content with an
experimental design that can prOVide infonnation only about'
is especially tlue when the number Of
exp~rimental
6.
This
units is snaIl.
We shall discuss only the case where :two dose levels are used.
If one obtains nO response measurements at
measurements at
~l'
~O
then an unbiased estimator for
ference of the two means
and
6
~
response
is the dif-
.
"
109
- y
~O
Here,
Y~
and
Yt;
1
levels
t;l
are the averages of the responses made at dose
0
and
t;o'
The variance of
respectively.
6
1
is
Var 6 = 0 2 (11
nO
+
1-).
n1
It may be judged that this variance is too large.
Such a judgment is
usually based on the size of type I and type II errors.
In order to
compute the type II error, one must specify how small a difference one
desires to detect.
When these computations indicate that the variance is too large,
one could decide to increase nO and nl .
We asstnne that this alternative is not available. 'We will consider
extrapolating.
1hat is, instead of using
dose at a level higher than
6.
In this case,
where
t;l
as a dose level, we will
E;l' and using linear interpolation, estimate
our estimator for
6
is
~
8 is the estimated slope from fitting the straight line by
1
least squares.
If we made n Z and nO response measurements at
t;o' respectively, then the variance of 6 2 is
t 2 (t;2
>
tl)
and
Var
If n = n then the variance of 6 2 will be less than the variance
2
1
of 6 1 whenever
A
118
<
1.
Thus, based on only the variance, we would prefer that
than E;l.
that
n(x)
t
z
be greater
This result comes about because we are ignoring the fact
may not be a straight line.
From figure 4.3.1, we see that on the average
overestimate 6.
~2
will tend to
This tendency is not taken into account when only
variance is considered.
In order to partially account for this tendency, let us assume that
n(x)
is only slightly more complex than a straight line.
assume that
n(x)
is a quadratic in x,
We will
namely
Under this asst.DTlpt ion ,
62 is a biased estimator for
remains unbiased. The expected value of 6 is
2
E(6 2) = (E;I-E;o)E(~l)
6, while
61
A
A
= (E;l-F;O) (a 1+aa 2) ,
where
a
= (F;Z+E;O) .
6
z t O)a
z
= (F;l- t O)a l + (F;12
Since
!
the mean square error (MSE) of 62 is then
and 6 will attain smaller MSE than 61 when (assuming nl = nZ)
2
111
(4.3 ..2.2)
Thus,
/::"2
4.3.2.1.
is preferable to
/::"1
only when (4.3.2. 2) holds.
An Example where Bias Is More Important than Variance.
In the last section we have considered that the true model is a
quadratic. From (4.3.2.1) we see that when
62 is zero then one attains
2 2
smaller MSE for larger values of s2. We have also shown how 62/0
must be botnlded to attain smaller MSE using~2 rather than ~l. We
shall now consider a situation where the optimum choice for
only slightly larger than
is
sl.
We assume that the response
n(x)
s2
is always less than one.
second degree polynomial in x,
n(x)
is an incidence rate.
Therefore,
We assume that the true response is a
namely
2
n(x) = a2x .
We have set
13 0 and 13 1 to zero for our convenience. However, in a
sense, this choice of 13 0 and 13 1 yields a response model whose curvature
1S
somewhat extreme, when, in fact,
and 13 are positive.
0
1
It is often assumed when working with proportions that the variance
is due solely to sampling error.
13
We shall also make this assumption.
However, it should be kept in mind that this is probably a lower bound
on the true variance in many practical applications.
If the variance is
112
actually larger, then our assumption will yield a value of
smaller than the optimum value.
(,2 which is
However, the realization that this
assumption may be wrong does justify using the least squares estimator
~2
instead of some weighted least squares estimator.
With our assumptions about the model and the fonn of the variance,
we have that
(4.3.2.1.1)
"
Taking the derivative of MSE(6 2) with respect to ~2 and setting it to
zero and multiplying by it by (~2-~o)3/2e2(~l-~o)2, we have
3
2
2
(~2-2e2~2)(~2-~O)/n2-~2(l-e2~2)/n2 + 82(~2-~1)(~2-~O)
3
= O.
(4. 3. 2 .1. 2)
If we multiplied this expression out we would find that the only
82 in it is -~2~O' Consider the situation
Then we can divide -(4.3.2.1.2) by f:: 23t3 2 • We are
tenn that didn't have a
where
l;O is zero.
left with
Solving for
~2
we have
Using this value for
~2
"
MSE /).2
we have that
113
A
The ratio of MSE
~z
to the MSE of
~l
is
I
nZ
- - - --------;-;-2nZ-1
(nZ-l)(1-8Z~I)
there will be only
We see that for typically small values of
slight improvement.
We have derived the optimal choice for
~Z
using a number of
assumptions that seem weighted in favor of not interpolating at all.
practice, larger choices of
In view of the fact that
not recommend that
~2
~2
In
may be justifiable on other grounds.
~i8Z will be small in this application, we do
be selected in this way.
4.3.3 Obtaining Bounds on the Dose-Response Curve.
In the last section we have seen that one would not optimally
t: 2 much larger than t: 1 when a straight line is used for extra-
choose
polation and the true mouel is a very restrictive quadratic. Let's say one
does not know thE' miIlue moJel. Then he'can't rule out that the true model
may behave similarly to the quadratic that was. used in the previous
section.
when
That is, the M5E may be minimized for the true model only
t: Z is quite close to
Now M5E consists of both bias and
variance, and variance can be controlled by choosing ;2
Therefore, it seems that when
large.
t: 2 is large and one fits a straight
line one needs to be most concerned about the bias.
When one fits a straight line to Jose-response data on the
grounds that it is conservative, one is making some assumptions about
the form of the true dose -response curve.
He is assuming that the
true dose-response curve is both monotonically increasing and concave
upward over the interval from
hold
t: O to t: 2 • When these assumptions
then one can infer that the straight line will lie above the
114
true response on the average.
inference.
It could be that
This is. often considered a rather weak
~
is zero, and yet the estimate of
that is computed from the straight line fit
~
suggests that the
environmental contaminant poses a real danger to the health of some
large target population.
Rather throl take action to control the
contaminant, it may be decided that it is preferable to obtain a
better bound on
~.
If the assumptions about the shape of the dose response curve
are true, then doing a new experiment using a dose level
~I
2
is less than f,;2'
will result in an estimate for
~
f,;Z' where
which is
still conservative and yet this new estimate will have a smaller
expected value than the previous estimate for
(See Figure 4-.3.3.1).
~
However, many animals might be needed before one could safely reject
the old estimate for
~
and accept the new estimate.
Thus, there seems to be a need to decide how many animals should
be used for
determine
SUdl
a subsequent experiment.
There is also a need to
q.
It would seem that this problem is solvable by viewing it in the
light of classical hypothesis testing.
Viewed in this light we might
want to test whether n(f,;Z) is larger than 15f,;z) , the estimator of n(f,;Z)
based ml linear interpolation between f,;O and f,;Z' or if whether n(f,;Z) is zero.
The problem is not that simple.
Consider choosing the point
[f,;1'~2]'
to reject
q,
which is in the interval
that necessitates the use of the least number of animals
n(~Z) = 1t~2) (mId accept n(f,;Z) = 0)
.
is false.
Now, if we select
cilOices of
q
~2=f,;2
from the interval'
then
[~l ,t,;2]'
when n(f,;Z) = ~f,;z)
A
nL(~Z)-O
is
max~mized
among
Thus, it would seem that this
critG:rion would indicate that a second experiment be repeated at the two
115
Figure 4.3.3.1
I
L
--
.-
c,:o
The fi!:.TUre shows that the bias error at
linear
inte~)olation between
smaller than the bias error at
CC O' nCc,:O))
~1
and
c,:l
resulting from the
C~2' nC~2))
is
resulting from the linear inter-
Cc,:O' nCC O)) and C~z' nCc,:Z))' This result holds
for any function that is monotone concave upward on Cc,:O' ~Z)
polation between
whenever
c,:1 < c,:Z <
~Z·
116
doses
<;0
~Z·
and
Such a seconJ experiment does not really seem
to relev&~tly address the
problem of obtaining a better bound on ~.
~2
On the other hand, if we choose
nL(~lJ
~2
we are lead to choose
to minimize the bias error of
~l'
equal to
~l
not have sufficient animals to test at
Pre~Jy,
we do
and obtain sufficient
precision.
In order to objectively select a "reasonable" point
lies between
~l
~2
and
we make an additional assumption.
tentatively assume that
= 80
n(x)
8
1
where
of
and
t\
82
and
and
q
This assumption about the signs
8 Z is not so constraining that we cannot define
to be those values that make
~O
cally increasing.
and
~2
We
2
+ SIx + SzX '
8 Z are non-negative.
unknown response at
that
n (x)
80 ,
a1
coincide with the true
when the true response is monotoni-
Before we consider the inferences that can be made
when this assumption holds, let us first consider how constraining
the assumptions
~)out
8 0 , aI' and
S2
are.
First, consider the fmlction
2
nA(x) = aO + alx + azX
where this function is equal to the true unknown function at the
points
~O' ~l
and
~2·
In order for
ments it need not follow that
(~O' ~2)'
n (x)
A
nA(x)
to satisfy thesereq:ui.re-
is monotone OIlCQlllcaYe. upward on
However, if the true unknown fWlction is both· monotone and
concave upward then the parameters of
nA(x)
must satisfy the conditions
a z > 0 and 'a l > - a2(~1+~O)'
The condition aZ > 0 arises from the fact that
than the coordinate corresponding to
through
U;;O,nA(~O))
-a Z(l;l + ~O)
and
arises since
~l
(l;Z' nA(l;z))'
nA(~l)
must be less
on the straight line passing
The condition a 1 >
nA (~O) < nA (~1)'
117
Since Cl.1 need not be positive, it does not follow that our assumption is general enough to allow nCx) to coincide with the true
dose-response model at
~o' ~l
and
of how close the true response at
~Z·
~o'
However,
sl
and
£';Z
Cl.Z
is a measure
lie to a straight
line.
In this sense,then, when Cl. Z is small, the true unknown response
will be approximately linear. When Cl. Z is small, Cl.1 cannot be too
negative. This is because Cl. 1 must be larger than -Cl.ZC£';l+£';O). Now
let us consider the error of nCx)
Since nCx)
nACx)) at
£';0
coincid~with
and £';Z'
at x
sl when Cl.1 is negative.
the true unknown response (and with
=
I
it follows that
C4.3.3.l)
and
C4.3.3.Z)
The error of nCx)
at x = sl
is
Substituting the value of So - Cl. 0 defined by C4.3.3.l) into (4.3.3.3),
and substituting the value of Sl - Cl.1 defined by thedif£erence of
C4.3.3.l) and C4.3.3.Z) into C4.3.3.3), we have that
£,;Z_£,;2-
nC£';l) - nAC£';l) = (Sl-Cl. l )
EquationC4.3.3.4) shows that
1 0
C£';l-sO) + ~Z-sO
nC~l)-nACsl)
C4.3.3.4)
is always positive.
Also
n(sl) - nACs l ) is minimized if S1 is taken
is negative. It appears that nCs l ) - nACsl)
the equation shows that
to be zero wIlen Cl. l
decreases as Sz gets larger.
This is true if the true response is a
quadratic so that Cl. l is fixed. However, Cl.l can be expected to change
with SZ when the true unknown response is more complex than a quadratic.
.~
118
Thus, we consider
to be the quadratic that coincides with the true response at
and [,2
if the coefficients are positive.
sO' sl
When this is noLpossible
we assume that
where n(x)
coincides with the true response at
So and
~2.
In
this case, our assumption is conservative since n(l;;l) - nA(l;;l)
is
positive.
Now, consider the class of functions
Q, where n(x)
is in Q
if
and
and
B
1
S2
are non-negative.
A relevant property of functions
in Q is:
For any n(x)
nL (x)
in Q,
there are two corresponding functions
and nQ (x) such that these functions are in Q an 1
for
l;;O
for
i
<
=
x
~ ~2'
and
a or 2. We can show that this holds by considering
119
and nQ(x) = 60
+
2
8 2x ,
where
and
For these particular functions,
61 > 0, 62 > 0, and x < So + s2' Since 0.:: So <
have that {xlt: o < x < t: Z} is a subset of the set {xix
We require 62 > 0 so that n(x) < nL(x) and we require
when
that nQ (x)
<
n (x).
Under our assumptions about
61 and
t: Z'
<
we
t:O+t: Z}'
61 > 0 so
62 we know
that, since
then
Hence when our assumptions are true about n(x) , we know that obtaining
bounds on 6
is straightforward.
From a two -point design and for any n(x)
unbiased estimators for the functions
to n(x).
nL(x)
in Q,
and nQ(x)
we can find
corresponding
These estimators are
(4.3.3.5)
and
120
(4.3.3.6)
We have that
and
It follows that
TlQ(x)
nL(x)
and nQ(x)
are unbiased for
TlL(x)
and
respectively.
Further, when Var Yr,
uncorrelate4 then by the
o
=
Var Y
[,;2
Gauss-~~rkov
that nL(x)
and nQ(x)
and TlQ(X) ,
respectively.
and
are
theorem we have at the point x
are best linear unbiased estimates of TlL(x)
When Var Y
so "Var YsZ
are using a two-point design we know that Yr:
"'0
maximum likelihood estimators for
n(sO)
likelihood
and Ys
are the
Z
and Tl(sZ)'
are the maximum likelihooJ estimators for
respectively~ecause
then since we
TlL(x)
"-
So TlL(x)
and TlQ (x) ,
of the illvarimlce property of the maximum
est~nators.
We see that if the true dose-response function is in Q
for
So
<
and
x
<
then
SZ. Thus, by making a rather nonrestrictive asst.Ul1ption,
we find that we can bound the true response function.
However non-
restrictive the asstmrption, we must admit to a degree of arbitrariness
and thus the need to validate this asstmrption.
121
In the validation we assume that the response is an incidence
rate and that the frequency is binomially distributed.
subsequent two point experiment at x =
goal is to make inferences about 11,
those values of
£'z
such that
(1.2.
and x =
~O
Consider a
q.
Since our
we clearly only wish to consider
q
< ~2'
The estimates of
nLcq) and nQ(~~) can be computed from the data of the first
experiment. These are
"{'2Y'0 'OY'2 + or'2- Y,ol,;}/C'2-'ol
=
We want to dloose
{E;,~E;, o - E;,~E;, 2 cy~ -V~ )E;,,2}/CE;,~_E;,~)
2
+
~2
~O
•
E;,'2 and the m.unber of experimental units,
so that it is possible to "optimally" test whether there is a significffilt difference between nLCE;,Z)
and nQ(E;,Z)'
these values we do the test on nLCE;,z)
Since we don't know
and nQCE;,z).
We quantify
this test as follows:
Let YE;,'
2
= r Z/n 2 be the observed incidence rate at E;,Z' The
likelihood ratio test of the hypothesis
r'2 is distributed binomially with parameters
n
Z and
r'2 is distributed binomially with parameters n
Z and
versus the altenlative hypothesis
is to reject lI
O
when (using the approximate large sample distribution)
122
-Z log eL,
> C
l
nZ
1 - nLCE,pjnzCl-Yq )
r
1 -
\.
Y",
"'2
The constant C is determined by selecting the significance level of
1
tlle test. Let us now consider how the required sample size could be
determined.
One way to select sample size is to determine a proportion so that
if the observed proportion is less than this value then HO is rejected
at the selected level. It would seem that if YE,z ~ ~CE,z) then one
would expect to reject HO'
since
q).
When Y
q
=
ECnQCq)) = 11QU': Z) < 11L C
~CE,2)' then Ln , becomes
[' ,r
11 L U:: Z)
Ln ,
2 ::
nQ (E,2)
n'
Since L :: LIZ,
nz
when
v zy
-[
'"
'
jnZ(l-~(-Z))
1 - 11 CE,')
L 2
1 - nQCE,p
(4.3.3.7)
we would reject 110 at the 5% significance level
Z
= Xl,O.l
The required number of observations to reject HO when YE,z ~ nQCE,z)
is
C4.3.3.8)
By choosing
E,Z
to maximize-loge L1 , equation C4.3.3.8) shows that
we will minimize our sample size requirements.
123
4.3.3.1.
Results
In this section we present tables listing our prescribed values
of
and
nl
2
for various choices of Y ' Y~2' So
so
C"I
"'2
~2'
and
First,we will describe how these tables were constructed and then
we will corrunent on the implications of this analysis.
For fixed values of Y ' Y ' t;o' ~i
so s 2
and
(4.3.3.6)
to compute nL(si)
and
and nQ(t;i),
t;2' we used (4.3.3.5)
respectively.
ni
was set to
1, and expression (4.3.3.7) was used to compute L , and
1
from this we used (4.3.3.8) to compute ni. We varied t;i until ni
was minimized.
t;i that minimized ni'
These values of
are correct to the second decimal place.
(the minimized value of ni)
YS2 when So
=
0 and t;2
when So = 1 and
=
and
1.
q
say
Table 4.3.3.1.1
~~,
shows
n~
for various values of Y~o
Table 4.3.3.1.2
and
and ~~
shows ni
t;2 = 10.
Both tables show that
decreases as either Yt;2
increases or
V
decreases.
n~
These tables
'0
also show, that except when V[ is zero,
~~ decreases as either
.
'2
Y
increase or Yt; decreases. When y:~
is zero then ~~
s2
0
o
increases with Yt;
2
Now let us consider the implications of these tables.
Asstmle
the experimenter has used a two-point design and estimated tJ. based
on nL(x).
He considers conducting a subsequent experiment to check
on the conservativeness of his estimate for
"-
reject n(t;~) ~ nL(t;~)
Yt;z ~ nQ(t;z)·
on Yt;*
2
tJ..
He is willing to
at the 5 percent significance level when
That is, he is willing to re-estimate
and ignoring Yt;
tJ.
based only
when the above hypothesis is rejected.
2
124
For example, Table 4.3.3.1.1
YE,
o
=
0,
require more animals.
value of VE,~ between
If he chooses
0.5
=
[,2
then he should conduct the experiment at
he needs 24 animals.
animals.
shows that if Y
q
=
and
0. .30 and
E,'2 other than 0.30 he will
Also if he desires to reject for some other
nQCE,i) and
~C~i)
he will need more
If he finds that he has insufficient resources to meet the
sample size requirements, this analysis indicates that he should not
conduct a subsequent experiment for this purpose.
His conclusion
should be that 6 lies somewhere between the estimates based on
Table 4.3.3.1.1
n~
Sample Size (Per Group)
~~
and value of
to reject
n(~2) ~ nL(~2)
when y~*
nQ(~~)'
(~2
is the value of
~2
that requires the smallest n
~o
Si~lificance
= 0,
~2
Z for
<
2
A
this test)
=1
probability = 0.05 (one-sided test)
Cell entries are
In~ ~*
.
2
I~Y~2
y~
a.
a
0.01
I
0.01
1329
0.015
0.02
'0.29
0.29
0.29
264
663
885
5955
20589
0.47
0.05
0.10
0.2
131
649
64
260
0.30
51
42
0.30
62
0.-34
0.36
102
0.36
74
0.44
0.41
0.38
0.37
- 1125
183
118
0.41
0.46
0.42
521
257
-0.'46 - 0.45
2984
0.49
0.25
0.30
0.50
24
0.30
0.30
27
68
9.2
0.39
317
0.25
Q.35
0.37
0.43
!
0.30
. 83
212
~oY
0.48
0.20
0.29
0.40
1473
0.02
0.10
0.29
0.46
Z~U41
O.lllS
0.05
50
' 0.34
0.34
53
28
0.36 0.36
29
57
0.36
0.37
36
85
0.40
0.41
158
51
0.43
0.45
793
105
0.48
0.47
3424
159
0.49
0.49
256
0.49
......
N
V1
e
--
e
-
e
e
Table 4. 3.1. 2
Sample Size (Per Group)
A
n(
Q
~Z).
(~Z
n
is the value of
~2
that requires the smallest n
~
Significance probability
=
~2 to reject n(~2) ~ ~L(~2) when y~*
Z and value of
0.05
o= 1
~
Z for
this test)
<
2
2 = 10
(1 - tail test)
~
Cell entries are I
l~
Y~2
Y ~
~O
0
0.01
2152
2 ~*
2
0.02
0.015
1074
1433
3.7
3.7
3.7
8997
30951
5.1
5.3
0.010
0.02
0.05
0.10
211
129
3Z~
~~r;
401
48/
0.25
0.50
56
4.7
4.6
/8
240
5.1
4465
0.2
4.4
4.4
4.7
5.2
4~
130
389
780
4.3
4.3
Hi
4.4
4.8
5.1
43
83
US!
4.2
4.2
4.3
4.4
LBO
1699
41
78
11:>
4.1
3.9
4.2
106
1~8
4.7
5.0
3.9
4.3
4.5
4.8
ZZ34
38
67
9H
1.43
0.50
0.30
82
4.4
4.4
4.6
0.25
3.8
3.8
3.7
5.3
0.20
:103
0.10
1480
143~8L
0.015
0.05
427
4.9
5.0
1:>8
ll~U
5.4
5.3
5.2
£'38
:>lL4
5.3
5.4
384
5.4
I-'
N
0\
127
Sumrn~
4.3.4
We have considered some of the errors in extrapolation.
of these errors are due to the variability of the data.
San~
A separate
error component arises when we incorrectly specify the true model.
We have shown that even when we specify the wrong model, we can
control random variability by dosing at high levels and using linear
interpolation to estimate the differential response between doses
at lower levels.
We have seen that an important error in linear interpolation is
model misspecification.
the best choice for
only slightly.
For a particular model, we have seen that
controlliI~
variability and bias is to interpolate
When the model is correctly specified we need only
control variability.
In this case we are lead to dosing at as high a
level as possible.
The class of models that contains these two extreme examples
must be considered quite flexible in its ability to contain a
function that approximates the true function.
For any function in
this class we can use a two-point design to obtain unbiased estimates
of functions that bound the first function.
When these bounds are too wide for decision-making purposes we
have shown how to design an experiment to distinguish between these
bounds.
This design suggests what dose level to use and the sample
size requirements.
The second experiment may still involve some interpolation.
In
view of the fact that our assumption about the forms of the model may
be too narrow we suggest computing similar bounds from this second
128
experiment.
If one believes that the true response is approximately
linear at low dose, then there is reason to believe that at small
enough dose
the linear model will not be rejected.
Consideration of the fact that many observations are required to
distinguish between bounds when Yt;,
- Yt;,
is about 0.01 indicates
0
that attempting to obtain e~)erimental data at lower response levels
2
may not be worthwhile.
129
CIIJ\IYI'EH V
FlrnnU; RESEARCII
One of the aims of this dissertation was to develop estimators that
could attain smaller general
n~an
square error than the least squares
estimator.
Many of the estimators that were studied can attain smaller mean
square error than the least squares estimator, but only for a subset of
the parameter space.
Thus, using these estimators because they have
small Inean square error requires that one know that the true parameters
lie in this subset.
Thus, prior knowledge is required in this case.
In many instill1ces, only crude knowledge of certain bounds on the
parameters is necessary to insure that the parameters lie in the
required subspace.
It was felt that there would be many instances where
such knowledge would exist for many data sets.
However, in the cases we have investigated, when such bounds exist,
they are usually very large and the incorporation of this information
into the estimator would probably not be worth the effort of having to
explain what one was doing.
In those instances where a smaller bound is known, one often finds
that a transformation is used.
For example, proportions are bounded by
one, but one might employ linear model methods by transforming to the
logits of these proportions.
Since the logit is not bounded, there is
no obvious choice for a bound on the logit in general.
This suggests that the use of transformations, which map a finite
130
interval into an tmbounded interval, may be viewed as a way of incorporating prior knowledge into the estimator.
Transformations also have an ability to reduce the terms in a linear
model.
For example, the log transformation can remove IJRlltip1icative
interaction tenns.
Thus, transformations may be able to reduce the
nwnber of parameters to be estimated.
However, the use of transfor-
mations has some disadvantages.
In the first place, using transformations of the dependent variable
makes the interpretation of the parameters more complex.
To reduce the
complexity, one might try to make inferences about the tmtransformed
variable from the estimates obtained in the transformed space.
There
are theoretical reasons why such inferences are invalid.
For example, if Y is lognormally distributed, that is
Y
where
Z ~ N(ll,a 2 )
=
exp Z,
then
E(Y) =
exp{ll
+
a
2
/2},
whereas
exp E ill Y = exp EZ
=
exp
II •
So, in general
E(Y)
+f- l
E(f(Y)).
That is, one can not talk about how Y will behave on the average if
one determines how fey)
behaves and then transforms back using f-1.
In the case of the log transformation we see that such a procedure
introduces bias.
It might be of interest to conduct research to
determine the mean square error of such a procedure, and compare it to
131
some biased estimator.
We have investigated linear transformations of the independent
variable.
For example, if the true model is
n ::: 2S'~
and one fits
n ::: ~'~
where
Z :::
Mx
~,
then the integrated bias error is minimized when
~::: (~')
One might consider a
-1
~
~~
z defined as above when using principal component
regression.
We have considered only estimators that are linear transformations
of the least squares estimator.
There is at least one application
involving a fixed nonlinear transformation of the least squares estimator.
This is the regression problem when there are errors in the
measurement of the independent variables.
When the variances of the
measurement error in both variables are known, one solution in estimating
the coefficients is to rotate the axis a certain amount, obtain the
least squares estimator using the transformed variables and obtain the
required estimator by transforming back.
This required estimator is a
least squares estimator using the transformed variables.
For this reason, it may be of interest to consider nonlinear
transforms of the least squares estimator.
If such a study were done,
the nonlinear transfonn might be approximated by a linear transfonn.
132
In this case, it may be possible to apply some of the theory developed
here.
133
LIST OF
Barnard, G.A. 1960.
REFERENC1~S
"The Logic of Least Squares."
Series 1325: 124-127.
JouY'nal of the Royal
Statistical Society
Bhattacharya, P.K. 1966. "Estimating the mean of a multivariate normal
population with general quadratic loss function." Annals of
Mathemat'ical Statisti(!s~ 37, 1819-1824.
Box, C.E.P. anJ ;~.R. Draper. 1959. "A Basis for the selection of a
Response Surface Design. JouY'nal oj' the Amer'ican Staiist-ical
Association 54: 622-654.
Finkbeiner, D.T. 1960. Intr'oduction to Matrices and Lineal' TY'ansfoY'mations. W.II. Freeman and Company, San Francisco.
Firuley, D.J. 1971.
PY'obit Analysis.
Cambridge.
Hammer, D.L, V. J-iasselb1ad, 13. Portnoy, and P.F. Wehrle. 1974. "Los
Angeles Student Nurse Study." Ar'chives of EnviY'onmental Health
28: 255-260.
Helms, R.W. 1974. "The Averagc Estimated Variance Criterion for the
Selection-of-Variables Problem in General Linear Models."
TechnometY'ics 16: 261-74.
Hoerl, A.Eo and R. W. Kennard. 1970a. "Biased Estimation for Nonorthogonal Problems." Technometr'1:cs 12: 55-67.
Hoerl, A.E. and R.W. KClmard, 1970b. "Ridge Regression: Applications
to Non-orthogonal Problems." TechnometY'ics 12: 69-82.
James, \11. and C. Stein. 1961. Estimation with Quadratic loss.
Proccedings of the Fourth Berkeley Symposium on Mathematical
Statistics and Probability, 1, 361-379. University of California
Press, Berkeley anu Los Nlgeles.
i\:arson, M.J. 1970. "Design Criterion for Minimum Bias Estimation of
Response Surfaces." Journal of the AmeY'ican Statistical Association 65: 1565-1572.
Karson, M.J., A.R. Manson and R.J. Hader. 1969. "Minimum Bias Estimation and EXl'erimenta1 Design for Response Surfaces."
TechnometY'ics~ 11:
561-475.
Kupper, 1.L. and E.F. Meydrech. 1974. "Experimental Design Considerations Based on a New Approach to Mean Square Error Estimation of
Response Surfaces." JouY'nal of the AmeY'ican Statistical Association 69: 461-463.
134
Mallows, C.L. 1973.
"Some Comments on
C."
p
Technometl'ics 15:
661-75.
Marquarut, D.W. and R.D. Snec. 1975. "Ridge Regression in Practice."
'l'he Ilmer'ican sta tis tician 29: 3- 20.
Maycr, L.S. ami LA. iVil1ke.1973. "On Biased Estimation in Linear
Hode1s. Technometr'ics 15: 497- 508.
Pope, P.T. and J.T. Webster. 1972. "The Use of an F-Statistic in
Stepwise Regression Procedures." Technometl'ics 14: 327-340.
Rao, C.R. 1965. Lineal' StatisticaL Infel'ence and It's AppLications.
John Wiley and Sons, New York.
Sc1ove, S. L. 1968. "Improved Estimators for Coefficients in Linear
Regression." JouY'nal of the American StatisticaL Association
63: 597-606.
Theil, H. 1971. Principles of gconometl'ics.
Company, Amsterd~n.
North Holland Publishing
Wallace, T.D. 1964. 'T~fficiencics for Stepwise Regressions."
of the American Statistical Ilssociation 59: 1179-82.
JournaL