. 1\
• ON THE ESTIMATION OF REGRESSION COEFFICIENTS WITH A
COORDINAIEWISE MEAN SQUARE ERROR CRITERION OF GOODNESS
•
Burt S. Holland
Institute of Statistics
Mimeograph Series No. 693
July, 1970
iv
TABLE OF CONTENTS
Page
..
LIST OF TABLES .
1.
v
INTRODUCTION
1
1. 1 The Model
• • . . . • .
• . •
1. 2 Mu1 tico11ineari ty .
. . • .
1.3 Estimation with a Mean Square Error Criterion of
Goodness . . . . . . . . .
. . . .
1.4 Multivariate G~nera1izations of Mean Square Error.
2.
REVIEW OF LITERATURE .
2.1
2.2
2.3
3.
3.2
3.3
3.4
..
4.
9
10
11
3.1.1
b
3.1.2
b
3.1.3
b
3.1.4
b
3.1.5
b
13
of the Estimators
13
14
2
3
...
4
. , ..
17
•
17
23
q
Asymptotic Distribution Theory of the Estimators
Relative Mean Square Efficiency of the Estimator&I
A Simulation Experiment . .
Discussion of the Es~imaforS
Summary..
• •......
Conclusions and Recommendations.
5.
LIST OF REFERENCES
6.
APPENDIX:
•
19
5
SUMMARY, CONCLUSIONS AND RECONMENOATIONS
4.1
4.2
II
The Stein-James Estimator.
Test-Estimation PrQcedures
Ridge R~gression
Construct~on
3
6
9
ALTERNATIVE ESTIMATORS OF P
3.1
1
2
.....
,
..
THE SIMULATION DESIGN AND PROGRAM .
23
31
35
41
41
43
48
50
v
LIST OF
TAB~ES
Page
3.1
3.2
..
Estimated relative efticienGies E '
on N ~ 500 itera~ion~ . . . . Z
Relative efficiency E for
6
var~ou~
~3'
and ES' based
values of qj
33
37
1.
INTRODUCTlON
1.1
The Model
,
Consider the linear model
=: X~
whe:re X ;: [x
p
=: n,
P ;:
tj
+ e ,
] is an n
(~l'
... ,
(1. 1)
~
p
p )
p
matri~
I
of known fixed quantities with rank
a p x 1 vector of unknown parameters to
be estimated, and e an n x 1 vector of random variables distributed
with mean vector 0 and dispersion matrix ~2I, ~2 unknown.
The Gauss-
Markov Theorem (Graybill, 1961, pp. 115_116) states that for this model
the minimum variance linear unbiased
with dispersion matrix ~2(X'X)-1.
estimator b
1
esti~ator
By "minimum variance" of the vector
we mean here that Var (b lj ) ~ va~ (b!j) for each j ;: 1,
*)
•
of p is given by
tr
2, ... , p, where b* ;: (bll'
b* , ... , ·b
lp
1
12
biased estimator of
I
i. s any ot h er
l'tnear un-
p.
The minimum varian~e quad~a~ic unbiased estimator pf ~2 is given
by
(1. 3)
2
When interest lies in constructing confidence regions or tests
of hypotheses concerning functions of
~
of the distribution of e is necessary.
and
2
~
, further specification
If, as is often the case, we
can assume that e follows a multivariate normal probability law, b
and
,,2
~
1
are the jointly sufficient maximum likelihood estimators of
their expectations.
"2
2
Also un~er this assumption, (n - p)~ / ~
2
"2
X (n - p) and var (~)
= 2~4
/ (n - p).
In this thesis we are concerned with the estimation of
~J
where
in order to elucidate insofar as possible the mechanism underlying the
model, primary interest lies in efficient structural estimation.
We
are less interested here in improved prediction of the response
corresponding to a given p_tup1e (x tlJ x '
t2
"'J
x tp ) or in judiciously
choosing such a p-tup1e to optimize the response y.
1.2
Mu1tic911inearity
Despite its desirable properties, the estimator b
leaves a lot to be desired.
l
occasionally
If X (and hence XIX) is nearly singular
so that IXIxl is "small," the variances of the {b
lj
} get very large
and the estimators themselves become quite sensitive to changes in
.
specification of the model.
This ill-conditioning of the model, often
referred to as multicollinearity, has long been a prickly problem for
investigators in all fields of application.
There are several avenues of approach to the multicollinearity
problem.
The most satisfactory one is to insist upon additional
information
(~.~.,
more sample data or elaboration of the model).
If,
as is often the case, such information is unobtainable, some investigators would consider the possibility of reducing the dimensionality
3
of the problem;
~'~')
discarding some of the regressors.
1
However) if
the theoretical considerations underlying the construction of the model
are not to be neglected) this approach is inappropriate when the
statistician's aim is structural estimation.
The incorrect omiesion of
an important though multicollinear variable from the list of independent variables introduces a perceptible bias in the estimation of the
, ,
coe ff"1C1ents. 2
rema1n1ng
This thesis follows a third route in considering some slightly
biased estimators
of~.
In particular) we consider alternative
estimators obtained by modifying the criterion of goodness from linear
minimum variance unbiasedness to emallness of mean square error
(m.s.e.).
It is felt that in practice) few people would seriously
object to this minor change in loss structure.
1. 3
Estimation with a Mean Square Error Criterion of Goodness
For estimating the scalar parameter e) the m.s,e. of the estimator
e
(having finite variance) is given by
E([e - E(e)]
A
2
A
var(e) + bias (9) .
Implicit in the adoption of this risk function is the acceptance of
1
See Draper and Smith) 1969.
2See Farrar and Glauber (1967) for an excellent account of the
multicollinearity problem and some proposed remedies.
4
bias in the estimation if the reduction in variance surpasses
newly introduced squared bias term.
t~e
When there is substantial
multicollinearity in the model (1.1), the large variance of the
estimators means that even a small percentage reduction in variance'
can be appreciable in absolute terms.
A
Whereas minimum variance unbiasedness stipulates that 9 be made
e,
close to E(e) subject to the condition E(S)
A
e be
error criterion requires that
(9 -
E(8))2
and
(e _9)2
the mean square
e itself.
close to
Both
are attractive loss functions in that
they possess the property of convexity.
But in practice, minimum
variance unbiased (m.v.u.) estimators are usually easy to construct
while minimum m.s.e. estimators are not.
For unbiased estimation with
a strictly convex loss function, the Rao-Blackwell Theorem (Fraser,
1966, p. 57) gives an explicit procedure for construction of the
unique m.v.u. estimator when there exists a complete sufficient
statistic for the- family of densities (Pei e e
~}.
There is no
analogous result for minimum m.s.e. estimation.
As an illustration of the difficulty tn obtaining minimum m.s.e.
A
estimators, suppose
~
is the m.v.u. estimator of a scalar parameter
~
and one asks what constant "c" will minimize
2
A
= c var(~)
2
The optimum c is clearly ~ ; [~
A
~~
2;
2
A
+ (c-l)
2
[~+ v~r(~)J
2 2
~
+ var(~)
A
J,
.
(1. 4)
so that
(1. 5)
5
has smaller m.s.e. than~.
However} ~ cannot be computed without prior
2
"
knowledge of the value of the ratio ~ / var~.
This dilemma occurs
when we wish to use a modification of the arithmetic mean of a random
sample of size n to estimate the unknown mean of a
unknown variance.
populatio~
haVing
Such interference of "nuil3ance parameters" has
led Kendall and Stuart (1967) p. 22) to state}
Minimum m.s.e. estimators are not much used} but
it is as well to recognize that the objection to
them is a practical} rather than theoretical one.
= ~~2} ~
If it should happen that var ;
m.s.e. estimator
rr
2
for~.
known} we can get a minimum
As an example} consider the estimation of
in the model (1.1) with normal errors:
~2
rr
"2
2 2
"" rr (rr)
= (n _ 1)
2 2
/ [(rr)
~2 /
4
+ 2rr / (n - 1)]
(n + 1)
and
4
-2
2
m.l3.e. (rr ; rr) = 2CJ / (n + 1)
<
I.rr4/ (n _ 1)
"2
2
::: m. $. e. (cr j cr) .
Where minimum m.s.e. e$timators do not exist} it may I3till
b~
possible to construct estimators that have smaller m.s.e. than
traditional estimators over
~
known parameters.
se~
13 in (1. 1) .
We shall
wide range of values for various unthat this is the caSe when estimating
6
1.4
Multivariate Generalizations of Mean Square Error
Thus far the discussion has been confined to the estimation of a
single parameter.
adopt a suitable
estimator
Upon moving to the multivariate problem, we must
gen~ra1ization
e = (a1,
82 ,
9 = (91' 6 , ... , 6 )
2
p
•.. ,
8p)
I
of m.s.e. to the case of a vector
of a vector parameter
1.
One attractive generalization that has been proposed (Bhattacharya,
1966) is
(1. 6)
where W is a p x p symmetric positive definite matrix of known weights.
(This guarantees that the
and convex.)
cprr~sponding
loss function is
non~egative
W is often taken to be D, a diagonal matrix of positive
elements, or more particularlY, the identity matrix:
(1. 7)
Geometrically inclined readers will refer to (1.6) as the expected
(squared) distance between
Instead of employing a
e and
6 in the metric of W.
~ing1e
criterion such as (1.6) we shall,
in this thesis, Gonstrvct vector estimators by simu1taneQus1y attending
to the p univariate proo1ems of rendering (m.s.e. (e.;
J
2, ... , p} as small as possible.
E(6. _ 6.)2
J
J
<
Thus we would call
e.),
J
j = 1,
e IIbetter
fl
than
(1. 8)
7
for every j
= I,
2, ... , p, and at least one of the inequalities is
strict.
e is
better than
If
according to (1.6) if W
the converse of this
e according
= D (but
~tatement
to (1.8), it is better also
not necessarily if W f D).
is
~learly
The criterion (1.8) lacks the
false.
g~ometric
appeal of (1.6).
fails to take account of the cross product terms (E(e. J
an omission
th~t
may not be
However,
It
e.)dL,
- eLI)},
J
J
J
w~rranted.
The difficulty with the
employm~nt
of choosing a satisfactory W.
of (1.6) lies in the dilemma
Unless W is judiciously chosen, (1.8)
may be preferable to (1.6) in that if the latter is used:
(i) Some of the (e.) may be
J
poo~ly estimat~d
(although others
are nQt).
(ii) Unjustified emphasi$ may be placed on the efficient estimation of some
subs~t
of (e.) relative to another subset.
J
(iii) The criterion is not scale invariant.
In view of our declarep aim of estimation rather than efficient prediction or optimization, it seems unwise, in the absence of further
information, to consiqer a means of estimation that may perform
well for one part of the model to the detriment of another. 3
criterion
attend~
to the estimation
the remaining elements
chooses
~
of~.
o~
each
~.
J
apart from that
Our
fo~
Recall too that the Gauss-Markov Theorem
~ = b l to minimize each E[P j - ~(~j)J2 rather than
~
A
~
E[ f3 - E(f3) ] 'W[ f3 - E(f:J)] for some W,
3The m.s.e. of prediction of a "futqre" response Y corresponding
t
to a stochastic p-tuple (~tl' x t2 , ... , x tp ) with dispersion matrix W
2
~
.
is given by ~ + E(f3 - ~) IW(~ - f3).
A
8
An even more restri~tivecriterion.than (1.8) is to caH
"better in m.s.e." than
e iff m.B.e.
(h'S) ~m.s.e. (h'~) for every
h (p x l)} or equ;iv~lently} iff [E(e - 6)
is a positive semi-definite matrix.
Fraser (1966)
~.
e
(6 - e) 1 -
E(e - e)
(9 -
6) I]
4
60) states a multivariate
generalizat~on
Rao_Blackwell Theorel\l whi~h elllplors the notion of "ellipsoid of
concentratio~'
as the
lll~ltiva~i~tQ
4See Toro-Vizcarrondo
an~
generalization of varianGe.
Wallace (19p8) p. 560).
of the
9
2.
2.1
RE;VIEW OF LlTERATURE
The
Like most investigators)
Stein-J~m~s
Ja~es
E~ttmator
and Stein (19Q1) have preferred to
use criterion (l..7).
For the special case where XIX
orthogonal polynomial
regressi~n»)
6 mu1tivariat~
=;:
I
(~:.~.:)
normal) and p
~
3)
they have shown that
~(y)
=;:
[1 _ y;.2
!
i)
is uniformly (in 13 pnd
(2.1)
Y'X4CY]X'Y
bett~r than b 1
=;
X'Y) for y any
positiv~
number less than Z(p-2)(n-p)!(R_p+2») a.nd thatm.s.e. [~(Y)J 13] is
minimized by taking 'Y
=;:
(p - 2) (n _ p) ! (n _ p + 2) .
The coefficient
[1 - y~2! Y'XX 'Y] will be between 0 and l. for all admisl;lib1e y
whenever
Thus the Stein_James estimator (2.1) is
si~ply
multiple of b .
l
P(y)
They merely prQve that
a (scalar) constant
is better than b )
1
omitting the motivation behind its construction.
necessary to render b
l
inadmissible)
~ut
Normality is not
in its absence no alternative
estimator has been proposed.
Baranchik (1970) has generalized (2.1) to allow y to be a certain
bounded function of F
•
p)n"TP
10
Stein (1956) has also shown that b
risk function still being (1.7)].
is admissible when p $ 2 [the
l
This means that in the two re-
gressors case with criterion (1.8), we cannot simultaneously
upon b l1 and b 12 for all possible values of
and
~
~
2
imp~ove
.
Sc10ve (1966, 1968) has pointed out that P(y) generalizes to the
case XiX
f
I provided that we take W ~ XIX.
m.s.e. (6; ~) ~ E(j - ~)
Ix'x(i -
Thus (1.6) now appears as
~)
.
(2.2)
Since XIX is proportional to the inverse of the variance_covariance
matrix of b , this choice of W removes objection (iii) in Section 1.4,
1
and goes a long way toward the withdrawaL of (i).
Bhattacharya (1966)
has indicated that for W f I and arbitrary XIX, we can at least transform the problem to the W ~ 0 Gase.
Sc1ove's paper (1968) surveys all the literature discussed in
this section in
ord~r
to interpFet some highly mathematical
res~lts
for the benefit of applied statisticians.
2.2
Test-Estimation Procedures
I
j
Consider the model
(2.3)
which differs from (1.1) with p
· means. 5
correcte d f or t h e~r
~
2 only in that the vaFiables are now
Bancroft (1944) has discussed the
estimator ~l of ~l specified as foll~ws.
Student's t test of the hypothesis H:
o
Perform the level ~
~
2
~
° vs
H:
a
P2.
f 0,
5 The distinctio1;l between (2.3) and (1.1) will be further discussed
in Section 3.3 below.
11
Then let
if H is rejected.
o
=
n
.E (x lt - xl) (Y t - Y)
t=l
Toro_Vizcarrondo (1968) has
a m.s.e. test
6
~onsidered
"
estimator where ~(y)
is taken to
o
IJ = 0
These so-called
akin to the
o
invest~gat~d
is used in place of
Baranchik (1964)
F test of H :
if H is accepted.
,
t.
~odification
~~
of the Stein_James
the null vector when a preliminary
H:
a
vs
is accepted.
"j:est.,.~st~mation"
"discard~n&
than they are to the
a
Stude~t's
the same estimator where
p:J;'ocedures are actually more
regressors" approB,f;:h to multicollinearity
"n~w
esHlllator" procedure to be investigated in
the next chapter.
2.3
Hoerl and Kennard
(1970a~
~idge Regre~sion
present
th~
estimator
(2.4)
6See Toro-Vizcarrondo and Wallac~ (1968).
12
where the
scala~
k is chosen so as to stabilize the
the estimator less sensitive than b
cation.
It is
demonstr~ted
that
l
syst~m
by
makin~
to small changes in model speciei-
w~th
risk as in (1.7) there are
admissible values of k ~uch that ~(k) is better than b .
l
explicit expression for k is presented.
However) no
The authors suggest
~hat
its
choice be base~ on a graph of the (~j(k)} vs k (called a "ridge
trace"); that k be the smallest value such that for k"( > k) the
(~j(k*)} are nea~ly independent pf k*.
They claim that the k that
one will employ in practice will only slightly
of squares [Y ~ X~(k)] I [y ~ X~(k)].
Bayes estimator ot
I t i~
incr~ase
the error sum
also noted that (2.4) is the
p when the parameter vector has a
pr~or
distribution
wi th mean 0 (p x 1) and dispersion (cr2/k) I (p:x;p).
A second paper by the
contains
illustra~ions
sam~
authors
(~oerl
and Kennard) 1970b)
of the performance of uhe ridge-trace procedure.
13
ALTERNATIVE ESTIMATORS OF /3
3.
3.1
For lack of
,
otper
~ny
here are essentially
Constru~tion
Analogous to the
approach, all estimatora considered
me~ns ~f
modifie~fions
situa~ion
of the Estimators
of b .
1
for scalar estimation discussed in
Section 1.3, we find that our modified estimators contain
parame t ers
Q
t"
and Cf2
That is,
Pis
In the fopm (3.1),
considered for
th~
th~
clearly of no use.
Two procedures wer~
construction of employable estimators;
(i) In (3.1) set Cf2 ~ ~2 anc;1 let our estimator of /3 be the
solution b of the equation
(3.2)
to be
dete~in~q
QY
iteratio~
(ii) In (3.1) set Cf~ ~ ~~ and /3 ~ b
b
Procedu~e
estimators,
"
= /3(b~,
for
1
to obtain
"2
cr) •
(1) was qutck1y
rule~
or otherwise;
~boosing
di~missed.
(3.3)
For all of the new
good starting values to Prime the
iteration were hard to come by.
Often the iteration diverged, or
converged too slowly to be pf practical use.
Procedure (ii) was the one chosen to be employed here.
the advantage of rendering an estimator that is a
sufficient statistics b
1
and ~2.
funct~on
This has
of the
It see~ed intuitively desirable not
to depart from the use of sufficient statistics.
14
The "raw form" estimators E(3.1) as oppo$ed to (3.3)J to be
computed are
mini~um
m.s.e. estimators) but the employable
of form (3.3) are not.
the
ques~ion
es~imators
to be answered was} "Does there
exist a ~ that is lappreciably' better than h pver a 'wide' range of
1
possible values of p and (J'2?"
We shall denote the
~ive
new
estim,to~s
by
When in the "r~w form ll Pliior to making the substitutions p = b 1 and
2
A2
0
(J
:;:: (J )
we shall wri te the e~tim~tors as b . The j.,; th component of
i
any vector h will be wl'it;ten hjor (h)j'
superscript denotes the optimal value of
The reader is
rem~nded th~~
In this section)
~ny
~n
asterisk
variable.""
the objective used to compute the
"raw form" estimator~ is sep~rCJte minimiZation of
E<P j
_
Pj )2
fo1,"
each j :;:: 1} 2} "'} p.
To construct
optimizes (in the
"
b~}
we attempt to find that p x p matrix K which
indiqate~
Sense) the estimator Kb 1 .
* } were
h
k*j
Let k denote the j.th ~ow of K. Then ob
= kjh
2j
l
j
is chosen to minimi~e E(kjb
P ) Z.
l
-
j
We compute:
bias (k b ;
j l
var (k b )
j 1
Pj ) ..
:;::
k~P ...
2
r;r
Pj ;
kj (X'X)-lk~
J
15
(3.4)
Differentiating
...
result equal to the
withrespe~t
n~ll
to the vector k. and setting the
J
vector) we obtain
:::; 0 .
(3.5)
Notice that the second derivative with respect to k
j
is a positive
definite matrix.
Continuing from (3.5»)
whence
and
(3.6)
Employing the identity
1
1
(A + uv ') -1 -~ A-I - A- uv 'A- /
))
(1 + v ,-1
Au
(3.7)
16
where A is a square nonsingular matrix, u and v are column vectors,
and (A + uv') is nonsingular, we can write
1
r -
=-2 [
0-
X I Xj3j3 I
2
0-
] XIX
.
+ j3'X'Xj3
Then
0
b
2
= ~pl [
I 2
X'X/3/3'
2
cr
0-
] X'y
+ /3' X1X/3
and
b bl
1 1
bZ = ~
cr
b
[
X'Xb b l
I ~ ~2
~
I I
blX'Xb
l
=~
[
rr+
. . - ;"2 1
1
cr
] X'y
+ b'X'Xb
1
1
I
] b'X1y
I
i
+ blX'Xb
I
I
blX'y
=;
[
...
2
cr
I
+ b'X'y
1
] bl .
(;3.8)
17
Recall that biX'Y is the regression sum of squares and
the error mean
in the standard Analysis of Variance
squ~re
At first glance it
.. 2
seem~
.. 2
~
is
t~b1e.
as though we have found K* to be a
+ biX'Y)] multiple of the identity matrix.
constant [b'X'Y /
(~
is not the qase)
how~ver) b~cause
This
the end result (3.8) is a consequence
~.
of having substituted,bl for
This estimator c9nstrains the K in Section 3.1.1 to be a diagonal
matrix with
j~th
*
Thus bo 3j = mjb1j.
entry m .
j
The estimatprs b
2
~nd
b
3
are
From (1.5) we find that
mu1tip1~cative modifications
We now consider an additive modification) b ) whose raw form is
4
written
*,
;:: b l+DXY.
of b .
1
18
Like the
Hoer~
and
K~nnard
procedure discussed in Section 2.3,
this estimator attempts to reduce the p variances by altering the
diagonal elements of (X ' X)-l.
p
The estimator b
is found by computing the minimum with respect
4j
to D of the j-th diagonal element of the "m.s.e. matrix"
E(b l + DX;'Y -
= E[(b l -
13)
(b i + DX'Y - /3)
13)
+
DX'~f3
I
+ DX ' s][(b 1 - 13) + DX'X/3 + DX'sJ'
+ DX; I Xf3f3 IXI XD + EDX ISS IXD
+ DX'Xf3f3 IX 'XD .
Apart from t h e
2
~
(X; I X) -1 term which is free of D, the j-th diagonal
element is found to be
where Zj is the j_th row of XIX, d j the j_th diagonal element of D, and
[Sj.t] =X1X.
Thus the j_th diagonal element of
n*
is
19
Therefore}
~4J'
=: b · - i(x'y). /
. J
lJ
«()2~ JJ..
+
z.f3f3'z~)
J
J
and
"2
- ()
n
"2
2
n
r: xt·Y t / [() s .. + ( r: xt·Y t) ] }
tr::::l
J
JJ
t=:l
(3.9)
J
Attempts to combine the p equations (3.9) into a compact
for the b
vector were
4
The fourth
e~pression
un~uccessful.
propo~ed
estimator
~s
first three in both appearapce and
radically
~opoeption.
di~ferent
from the
We consider first the
case p =: 2:
=: ([ I - S (I _ SX I XS)
2Yf~+1
J
-1
}
S ]b l j
w*j ~ 0
(3.10)
s -~
..
Abbreviating r
12
JJ
as r} we see that SX'XS is simply the correlation matrix
where S is the
9ia~onal
1
r
r
1
(
while
)
}
matrix with
j-th~ntry
20
o
-r
2w+l
(I _ SX I XS) 2w+ 1 :: (
)
-r
2w+l
Ir I
Since rank (X) :: 2, we h~ve
<
o
as w -
00
Thus (3.11) tends to 0 (2 x 2)
1.
*
~j
(3.11)
0
tends to b
as w and b
j
1j
Sj
the Gau$s-Markov estimator of
.
when
00.
*
0
2
For w :: 0, b
:: (6 XrY)j'
j
Sj
8
:: O.
12
It was hoped that with
a judicious choice of w.* between these extremes, b0
s J'
J
tive compromise between b
would be an effec-
and the estimator when the (3-j)-th
1j
Gol~m~
To find w~, the value of w. that makes th~ j-th
of X is ignored.
J
J
00.
~)
diagonal element of the m. s. e. matrix E(b S
(9 5
-
~)
I
as small as
o
possible, we begin by computing the dispersion matrix of b :
s
.f(
2
::
[I - S (I - SX I XS)
(J
2w.+1
1
*
2w.+1
_ SX'XS)
1
S- ] (X I X) - [I _ S-
J
J
~
S].
One finds that
2
;:;:
(J
*
*
.. 2
~.~
4w.~
(sJJ) s .. (1 + 2r J
+ r J )
JJ
+ 2s jj s jt Vs.
jj jt
(r + 2r
*
+ (sj~ 2 s U
*
*
4w.+3
2w.+1
s
J
+ r
i(
J
2w +2
4w +2
(1 + 2r j
+ r j ) } ,
)
(I
21
where.e,
=3
- j, j
(E[(b
1
= 1,2.
The "bias_cobias" matrb;;is
- (3) - SCI - SX'XS)
2w* +1
j S-l b1 ]}(E[(b 1 - (3)
'i~
2w.+l
_ SCI .- SX'XS) J
s-lb J}' ,
1
the j-th diagonal element of which is found to be
r
4w*, +2
1: .,
~ '.e,
J (S ../SD,)2SJJ(S'9[3. + su[3,) + (S9'/S .. ) sJ (s .. [3,
JJ ~
J~ J
~ ~
~
JJ
JJ J
The squared bias of ob
is clearly at a maximum when w* = 0 and
j
Sj
decreases monotonically to zero as w,* _
00.
Likewise, it can be shown
J
0 2 *
that var (b ') is at a minimum (~/s .. ) when w = 0 and increa~es
SJ
j
JJ
monotonically to ~2/(1_r2)sj' as w~ _ 00. Thus it is not surprising
o
J
J
that m.s.e. (bS,j [3.) attains a minimum for some positive and finite
J
'i~
w '
j
J
Further ca1culatiQn
r~ve~ls
this optimal w to be
j
(3.12)
In practice, w.* is rounded off to the nearest integer.
J
o
obtain b
from b
by replacing [3.e, and
Sj
Sj
"2 respectively.
~
2
~
in
(3.1~)
We then
wi thb .e, and
1
22
Unfortunately, it was found that except in a very special
C~SeJ
b does not generalize to p > 2 because
S
(I ~ SX'XS)w
as
_
0
w
-
(3.13)
(p x p)
00
fails to hold in general.
Furthermore, (3.13) is valid with decreasing
frequency as p increases.
Consider for example
SX'xs
1
a
a
a
1
a
'"
(p x p)
a
,
a
..
1
which is an admissible correlation matrix for -l/(p-l) < a ~~.
be shqwn that (3.13) hQ1ds here iff lal < l/(p-l).
sary and
suffici~nt
~t can
A general neces-
condition for the validity of (3.13) is
~bat
the
largest eigenvalue of SX1XS be less than" 2.
Also, the computation of w~ involves the solution of a (p_l)_st
J
.
order difference
;
equa~ion.
Its solution is intractable for p > 2 .
b does generalize when p >2 i f XiX is "block diagonal" with
S
all blocks 2 x 2 or scalars.
Then all
~j corresponding to a
co~umn
of X that is one of a correlated pair are estimated just as in the
=2
p
case by igporing the remaining p - 2 columns of X.
~. are estimated as in (3.10) for the p
J
here I - SX'XS
=1
- 1
= O.
=1
case--b S '
J
The
r~maining
= (X'Y)j/s JJ
.. --for
24
It was found that the
nqrm~lity
to obtain any results concerning
presence in what follows:
...
b~)
assumption was necessary in order
b ) and b .
S
4
Also) let 0 <
-
~1
<
n -
~2
We shall assume its
~
n -
... < ~
be the
- pn
ordered eigenvalues of XIX .
Definition.
The sequence of p¥:p matrices (A ) =
n
converge to the
... ) p.
ma~rix A
(Notation:
Lemma 1.
An -
= [atjJ if
~
or
([a~~)J) h
l,.J
said to
(n) _ a ..
a..
for each i) j = 1) 2)
l-J
lim A
n
;LJ
= A) •
n....oo
If
(p x p)
as
n_oo
(3.15)
as
n-oo
(3.16)
then
for all non-null
Proof.
~.
Note that the eigenvalues of the p.d. matrix (X'X)-l are the
ij
reciprocals of the (~jn) and that (3.15) s~ys s(n) - 0
all i) j
= 1)
2) ... ) p.
Using a
th~arem
a~
n - ~
from Bodewig (1956) p. Q6»)
we have for every n:
0
< ~-l
1n
j
< [ . 2:: . (si(n) )2 J-\
;L)
J
as
0
-1
~ln - 0
as
n
-t-
00.
n-oo
)
It follows from a result stated by Rao
[1965) p. 50) (If.2.l)J that
for
25
~ln
.....
00
as
hence
(3. 17)
for all
non_nu11~.
Then (3.16) must hold, for otherwise we reach
a contradition of (3.17).
Lemma 2.
If
-n- ..... A
as
(3.18)
where A is a finite p x p positive definite matrix, then (3.15) holds.
XiX
Proof.
Let - n
= An .
The determinant of a
~quare ~atrix
is a
continuous function of its elements, hence A _ A implies
n
det (A ) ..... det (A) > O.
n
It follows
immediat~ly
from this and from the
well-known formula for the inverse of a nonsingu1ar matrix A in terms
n
of its cofactors and determinant ('£'':::' the "method of adjoints"),
o
(p x p) .
26
(3.18) is the usual regularity condition assumed in discussions of
large sample properties of estimators in linear models. 7
Lemma 3.
The sequence of random variables Un converges in probability
to a random variable U iff for some g > 0 ,
E[~]
g
n_oo,
as
Q
l+jun-ul
For a proof of this Lemma, see Loeve (1963, p. 158).
Lemma 4.
If any of the three conditions (3.15), (3.16), or (3.18)
holds, then plim (;2 / blX 'Y) = O.
Proof.
Since biX'Y /
"2
p~
~
F'(p, n-p; AI), where A'
F'
n
n
= ~'x,x~/~2,
we have, for arbitrary 0 > 0,
(3.19)
From Lemmas 1 and 2 we see that any of the three conditions mentioned
above implies that Al _
n
00
as
n _
00.
It is a well-known property of the noqcentral F distribution that
this enables us to conclude that the right-hand side of (3.19) converges
to zero and hence that plim (~2 /biX'Y)
= O.
The fact that the
denominator d.f. of F ' is an increasing function of n considerably
speeds this convergence.
7
See for example Malinvaud (19p6, p. 174).
27
Corollary
1,
If any of the three conditions (3.15)}. (3.16)} or (3.18)
hold) then
as
Proof.
n_oo.
Applying Lemmas 3 and 4 with U
n
we have that
~2 /
bJ.X 'Y
E [ ----=---- ]
_
0
1 + ~2 / bJ.X I Y
as
The result follows immediately.
Lemma 5.
If Un is a sequence of rand9m variables and
Wa
constant}
then a sufficient condition for the equalities
plim Un = lim E(U n )
W
n-oo
is that
Proof.
E(U _ W) 2 _ 0
n
as
n-ooo
The consistency of U as an estimator of
Tcheby~ff's
n
Wis
a consequence of
Inequality} while the asymptotic unbiasedness follows from
the inequalities
<
28
Theorem 1.
for each j
If either of the conditions (3.15) or (3.18) holds, then
= 1,
2, .,., p, b
biased estimator of
Proof.
~
2j
is a consistent and asymptotically un-
..
J
First suppose ~ 1= 0 (p xl).
o <
2
E(b 2 · - ~.)
J
Then
J
b'X'Y
A2
1
] [ 2' ()
] Ib!.
J ; +b'X'Y
; +b'X'y
J
1
1
< E(bl.-~.)2 + 2E( I~.I
J
J
[
2
-~·I}
J
(3.20)
Since E(b . - ~.)
lJ
J
2
2"
=()
sJJ, (3.15) (or, by Lemma 2, (3.18)) implies that
the first two terms of (3.20) tend to zero as n -
00.
Corollary 1
establishes the convergence to zero of the third term of (3.20).
Hence
applying Lemma 5, we have the desired result.
If, on the
0
ther hand,
~
= 0 (p x 1), then the third term of
(3.20) is identically zero and Lemma 4 and Corolla(y 1 are no longer
required for the proof.
Theorem 2.
If lim sjj = 0
n.....oo
unbiased estimator of
~.,
J
,
then b
3j
is a consistent and asymptotically
29
The proof of this theorem is almost identical to that of Theorem 1.
F'
.
Turning now to b "
. 4J
(1
,
n-p
2 "Z"
A") where A" = R, Icy sJJ.
'n '
n
t'J
it is seen that to establish its asymptotic
unbiasedness and consistency as an estimator of /3, we merely need to
J
show that the second term on the right hand side of (3.9) has expectation zero and probability limit zero, respectively.
We can rewrite
this term as
=
1 +
'1(
where b
lj
is the simple
linea~
(3.21)
1 + F* (I, n-p;
regression coefficient obtained when all
F*
elements of /3 other than /3, are ignored, and
J
= F* (I,
n-p; ,*)
~
n
4S
L
a
noncentral F random variable with noncentrality parameter
s ,)'(sl"
=
PJ
J
2
sZ"
J
',., s
,)/3
P J . (3.ZZ)
cy s, ,
JJ
The author was unable to show that the expectation of (3.21)
tended to zero under certain conditions; however he is fairly
that this is the case.
The difficulty arises from an inability to
separate this random variable into two components whose
can be taken separatelY'
cer~ain
expectation~
That is, we cannot (for example) claim that
< E Ib*lj IE(l +F *)-1
30
Although i t seems plausible that Ib
I
1j
and (1 +F-{()-l should be
negatively correlated (since the size of
both directly related to
result were unsuccessful.
I~ J.J)}
J
b
1j
I
and that of (1 + F-{() are
attempts to formally establish this
However) we can show that under a certain
regularity condition the probability limit of (3.21) is zero.
Lemma 6.
n _
as
If
00
then plim (1 +F-{() -1 = O.
)
Pre (1 +F*) -1 > oJ
_
0
as
in the same fashion as the right-hand side of (3.19) discussed in
Lemma 4.
Theorem 3.
plim b
=
4j
Proof.
If
~
s .. JJ
00
and
A*
n
as
00
then
..
J
As noted above} we need only show that the probability limit of
. Var(-b* . ) = (J"2s -1
Smce
.. ) s .. - 00
as
1J
JJ
JJ
implies) via Tchebycheff's Inequality) that p1im (_b* )
1j
(3.21) is zero.
-{( -1
Lemma 6) we have that plim (1 +F)
= O.
n_oo
-~
..
J
From
Then using a result in
Cramer
" (1963) p, 255)) it follows that
o
= q j b 1j ) it clearly has none of the desirable
asymptotic properties except in the trivial case ~j = 0,
As for b
6j
(qj)
(3.23)
31
3.3
Relative Mean Square Efficiency of the Estimators:
A Simulation Experiment
Since b , b , b and b are nonlinear in €, obtaining their
4
2
3
5
p.d.f. 's or even first and second moments seemed a near impossible
task.
Thus) in order to assess the quality of these estimators, it
was necessary to resort to a simulation experiment.
A detailed account of the simulation is deferred until the
Appendix.
The presentation in this section will encompass the salient
results of the study.
Practical limitations on the size of the experiment made it
necessary to confine attention almost exclusively to the two regressors case.
Conjectures concerning the nature of generalizations
to P > 2 will be made in the following Section and in Chapter 4.
We wish to ascertain how the m.s.e. for b .. (i > 1) compares
~]
with that of b
lj
.
Therefore we are interested in relative mean square
error efficiencies of b
lj
to b
ij
) i > 1:
E. -- m.s.e.(b .. ; 130) /m.s.e.(b .; ~.) .
l]
~
~]]
]
i
> 1
From considerations of symmetry, it is clear that E. is independent of
~
The simulation computes estimates
quantities upon which the
(E.} depend.
~
{E i }
for a range of values of
It uses n
=
25 throughout, but
from ancillary experiments) it was determined that the values of the
A
E are virtually stable in the range 10 ~ n < 50. For j = 1) 2)
i
22"
let A. = ~. / ~ s]] denote the noncentrality parameter of the Student's
]
]
t test of H'
~. = 0 vs H:
0']
a
~]. ~ O.
r
Then it was determined that for
32
estimating
" depend only on r, A , and At as follows, where,
the E
i
j
~jJ
as before, t
=
'"
E
2
..
=
3 - j, j
EZ (Aj , At.?
1, 2:
r)
.~
" = E (A )
E
3 j
3
(3.24)
A
It was discovered that E depends more heavily on A than At.
Z
j
The estimator b
4
was discarded at an early stage of the investi-
gation because it was found that
AjJ
E2 is
less than
E4
for virtually all
AtJ r, and that whenever the contrary is true, it is by a very
A
small margin and moreover, E exceeds E .
4
3
The experiment assumed that € is multivariate normal.
It remains
to be seen whether departures from this assumption seriously affect
the conclusions we shall draw.
A
The E. are presented in Table 3,1 (page 33) for 4 values of r,
..
~
7 or 8 values of A (= A, or At' whichever is a more important deterJ
A
..
A
minant of the E. in question), and 2 or 3 values of At for E ·
~
2
No
formal investigation was made of the reliability of the entriesj
to have done so would have entailed a large simulation experiment in
itself.
The author is confident, however, that all entries are accu-
rate to within
± .02 and that the accuracy increases to as fine as
± .0005 as the entries get very close to 1. 8
8A discussion of this statement is given in the Appendix.
34
"
In addition to the E.)
the actual estimators themselves were
~
computed.
b
2j
and b
3j
absolute value) with b
were always slightly smaller than b
2j
lj
in
usually being very close to blj-_these
observations are actually deducible from inspection of the estimators
themselves.
b
5j
is in a sense the most ambitious estimator because it
was occasionally greater than b
lj
in absolute value.
Es tima tes were
made of the proportion of m.s.e. attributable to squared bias) and that
for b
5
was often as large as .25.
The proportion for b
2
and b
3
rarely
exceeded .15.
Some people would criticize the present model,
(3.25)
t
on the grounds that the fitted regression plane is constrained to pass
through the origin rather than the point (y) xl)
x2 );
that we should
consider instead the model
(3.26)
The simulation program was incapable of handling this setup) but the
similar model
(3.27)
was subjected to examination.
This is (1. 1) for p = 3 with s12 = s13
::.:; 0, and differs from (3.26) in that b
21
and b
31
are both unequal to y.
The result of the simulation of (3.27) was that the performance of the
35
b
and b
2j
3j
were almost indistinguishable from that of the b2 and b 3
estimators in (3. 2S) •
9
Finally, the rounding of w~ to the nearest integer was found to
J
have a negligible effect on the relative efficiency of bSj as compared
with its employment exactly as in (3.12).
3.4
Discussion of the Estimators
Prior'to the commencement of the simulation, it was conjectured
that b
2
would be the "worst" of the proposed estimators because of
the basic simplicity of its modification of b . The modifying factor
1
I
l
is a function only of b X y/;2, a summary statistic that indicates the
departure of the entire vector
~
from the null vector.
For our
coordinatewise criterion of goodness, an estimator that separately
modified each b
in a way peculiar to that coordinate was thought to
1j
have a greater change of success.
guess that the quality of b
2
Moreover, it seemed reasonable to
would decrease as p increased because
blX1y /;2 contains a relatively decreasing amount of information about
each individual b
1j
as the number of coordinates grows.
The results of the simulation show that at least the first of
these conjectures was far from the truth.
...
A
A
E is less than either E
2
3
or E far more often than when the contrary is true.
S
For r > 0, E
2
is only rarely greater than 1 (i.e., b , less efficient than b ,).
2J
1J
- As r approaches +1, one is increasingly unlikely to encounter a
(A1 , A2 } for which
efficiency over b
1j
E2 > 1,
but on the other hand, the increases in
become increasingly negligible.
9 This is a rather loose statement because for p > 2, it is not
quite certain which rls and AIS are the relevant ones for E and E .
2
3
36
"
The asymmetry of E
with respect to changes in the sign of r is
2
a rather puzzling finding.
the two cases r
= .7
and r
In Table 3.1, the discrepancy between E in
Z
= -.7
is far greater than can be accounted
for by the stated error inherent in the simulation.
The Gauss_Markov
estimator is symmetric in r and intuitively) the estimation problem
seems to be of equal difficulty when the sign of r is changed.
No
adequate analytical explanation has been conceived for this phonomenon.
As previously mentioned) the values b
corresponding b
lj
.
2j
were very close to the
"2 can be
The modifying fac tor b I XI Y / (b 'X I Y + 0-)
written as pF / (pF + 1)) where F follows the variance-ratio distribution
with p and n - p degrees of freedom and noncentrality parameter
Since F is likely to vary directly with
modification can be expected to be considerable for
only slight for ~j distant from the origin.
A.) we see that the
J
~j
For small
near zero) but
I~j I)
the
" is small for
modification is in the "correct direction; II thus E
2
A.
J
near zero.
Let us compare b lj with b ) the family of estimators consisting
6j
of constant fractions of b lj ) j
2
= 1)
2.
From (3.14) we have
2
qJ' + ( 1 - q.) A.
J
J
~'~')
in A .
j
the relative m.s.e. efficiency of b
lj
to b
6j
is actually linear
For purposes of comparison) we present in Table 3.2 a listing
37
of E for several values of q and the same set of A values appearing
6
j
in Table 3.1.
Table 3,2 shows that b
6j
is an extremely attractive alternative
to b lj for some range of qj as long as one is quite confident that
does not exceed a certain value.
Aj
As q. increases to 1, the relative
J
efficiency gets very close to 1 even for small values of A., but at
J
the same time) the value of A below which E is less than 1,
6
j
(1 + q.) / (1 - q.), increases markedly.
J
J
choosing q.
J
= A.JO / (1
+
A.).
JO
E is minimized at A. = A. by
6
J
JO
Therefore, unless one is an intransigent
minimaxer or knows that A is quite likely to be large, there is
j
probably some qj for which b
6j
is preferable to both b
lj
and b
2j
.
If
the contrary is true, the choice of some other estimator is indicated.
The evaluation of b
6
can obviously be carried much further if one is
willing to attribute a prior probability distribution to A••
J
Table 3,2
~
Relative efficiency E for various values of qj
6
,5
1.0
1.5
,30
.335
.580
.825
.50
.375
,500
,625
.750
,70
.535
.580
.625
,670
.940
.90
,815
,820
.825
.830
.860
.910
1. 81
.95
.904
,905
.906
.908
.915
.928
1. 15
.99
. .980
.980
.980
.980
.981
.981
2
1.07
5
10
100
2.54
4.99
49.1
1.50
2. 75
25.3
1. 39
9.49
.990
38
b
2
is similar in form to the Stein-James estimator, which is
applicable when X'X = I and p > 2.
y
=
Using the "optimal"
p) / (n - p + 2), the modifying constant is seen from
(p - 2) (n
(2.1) to be
...2
(p - 2) (n.- p) cr
1 -.(n-p+2)Y'XX'Y
while that for b
,
. . .2
2
is y'XX'Y / (Y'XX'Y + cr ).
The two are quite similar;
and one is led to speculate that the decrease in relative efficiency
due to the employment of (2.1) rather than b
l
is often negligible.
It is fairly safe to conclude from Table 3.1 that E decreases to
2
1 as A...... 00.
J
This observation is easy to explain.
2
the increase in A. corresponds to cr .... 0;
J
or b
2
.... b l •
!..=..,
Unless IJ.
J
= 0,
b'X'y / (b'X'Y +~2) .... 1,
Similar explanations can be given for E and E '
S
3
(Recall
that E increases without bound as A .... 00).
j
6
Theorem 1 concerning the asymptotic behavior of b
regularity condition (3.1S) or (3.18).
2
embodies a
These conditions require that
the sequences (x .} > 1 do not dwell near the origin.
tJ t_
The hypothesis
of Theorem 2 includes the weaker statement "lim sjj = 0."
It seems
n-oo
improbable that any of these requirements will often fail to be met in
practice (especially when one is working with time series data).
Since for p
=2
the relative efficiency of b
3j
depends only on A ,
j
the quality of this estimator is unlikely to undergo substantial change
as p increases so long as p doesn't get too close to n.
But the in-
crease in p should have some effect on the precision of the estimators
b
lj
and ~2, and the value of sjj; hence on bi / (bi +~2sjj) also.
j
j
39
The estimator b
3j
also was not
~
priori expected to be successful
because whereas
contains too little information concerning the
individual blj'
ignores the behavior of all components in b
aside
l
A
from b
lj
.
The fact that E is unaltered by changes in r leads to the
3
conclusion that m.s.e. (b ,; ~,) depends on r only to the sam~ extent
3J
J
sjj.
as var(b ,); i.e., through
Note also that E is symmetric in r.
S
lJ
-A
The restriction of b S to the two regressors case severely limits
its applicability.
Certain facts that are always true when p
occasionally false for p
~
Z are
=
3--two of these were mentioned in Section
3.l.4--and another is the nonexistence of partial correlations among
the columns of X until p > 2.
It is intriguing that the Stein-James
estimator is valid only for p > 2;
~.~.,
when and only when b
S
is
(in general) not.
The estimators b , b , and b , as well as the Stein-James
6
2
3
estimator, alter b
origin.
l
by shifting each of its components closer to the
This type of modification is the most obvious way to decrease
the variance of an estimator.
A
var (c6) = c
Z
A
var ( 6)
A
< var
e
o<
c
<
1
(3.28)
where c is a constant, and one would expect (3.28) to hold even if c is
A
a random variable unless the choice of c and 6 is rather bizarre.
is one explanation for the relative success of b ' b , and b as
Z
3
6
contrasted with that of b
4
and b .
S
This
40
In addition to their relatively poor performance in "improving!!
upon b , the latter two estimators permit an occasional difference in
l
arithmetic sign between b
estimation of the sign of
and b
lj
~.
4j
.
In many contexts, incorrect
can be a serious error.
The first three
J
estimators listed above perform no worse on this score than b .
l
10
At the outset of this investigation we noted that the need for
alternatives to b
l
becomes especially acute as Irl
a known quantity, an alternative to b
l
t 1.
Since r is
that performs well only for
Jrl near 1 would have been just as welcome as one that is relatively
efficient for all r.
Unfortunately, none of the new estimators display
any dramatic overall improvement in relative efficiency as Irl
t 1.
lOaf course, if the investigator incurs a special loss from
incorrect estimation of sign, this information should be included. in
the specification of his risk runction. In practice, however, this is
infrequently done.
41
4.
SUMMARY, CONCLUSIONS AND RECOMMENDATIONS
4,1
Summary
This thesis is an attempt to provide alternatives to best linear
unbiased (Gauss-Markov) estimation in the general linear hypothesis
model of full rank (Graybill, 1961).
Alternatives are especially
desirable in the presence of multicollinearity because the variances
of the Gauss.-Markov estimators may then be excessively large.
By
changing the criterion of goodness to mean square error in each
separate coordinate of the vector estimator, it is occasionally possible to construct slightly biased estimators having far smaller
variances than those of the usual estimator.
It is felt that when
the statistician's aim is in efficient structural estimation (rather
than prediction), in practice, few people would have serious reservations about this minor change in loss structure.
Five new estimators (b., i
1
= 2, .,', 6} are constructed and
presented as prospective applicable alternatives to the Gauss-Markov
estimator (called b
l
herein).
Each of the proposed estimators takes
the form of a modification of b .
l
•
The direct determination of the quality of the new estimators
was possible only for b ,
6
That of the remaining four estimators was
disclosed in the two regressors case by a computer simulation experiment.
Statements are made concerning the prospects for generalization
of the results of the simulation to situations where there are three
or more independent variables,
The results of the simulation are
presented in a table of estimated relative mean square error
4Z
efficiencies of b 1 to the Cb i }.
The entries in the table are found to
be dependent upon various combinations of Cr, A , A }' where r denotes
1
Z
the correlation between the two regressors and the CA.} are noncentra1J
ity parameters of conventional statistical tests relating to the model.
Results concerning the
are given for b ' b , b
Z
3
asymptot~c
4
properties of the new estimators
and b6 .
The three estimators b ' b
Z
3
and b
6
are all of the form
(4.1)
i = Z, 3, 6,
where b
ij
and b
1j
refer to the j_th coordinate of the vector b
and gi is a random variable bounded by 0 and 1.
i
or b ,
1
As a general rule,
these three estimators are found to be preferable by far to either b
or b '
S
4
Their relative efficiencies are less than 1 for a surprisingly
wide range of Cr, A , A }'
1
Z
The estimator b
6
is in many cases an
attractive alternative to b , but at other times it has some extremely
1
unfavorable properties that premonish against its use.
b
4
or b
S
is not advised under any
The use of
~ircumstances.
An estimator that has frequently appeared in the statistical
literature, originally conceived by James and Stein (1961), also
happens to be of the form (4.1).
The conclusions which emerge from
the simulation herein lead to some tentative notions about the behavior
of this estimator.
The results in this thesis are appraised in the
light of the findings in James and Stein (1961) and recent research
along similar lines by other investigators.
A detailed account of the simulation study with particular
emphasis on its design aspect is presented in the Appendix.
43
4.2
Concluaions and Recommendations
One of the vital (though too often
of b
~nderemphasized)
properties
is ita robustness to departures from the distributional assump-
l
tions made for E.
than b
6
~efore
for actual use in
robustness.
certifying any of the new estimators other
prac~ice)
Another aspect! of the (b } that needs to be examined
i
is their sensitivity to minor
~hanges
is present to a serious extent) b
changes.
a study must be made of their
1
in X.
When multicollinearity
ia overly responsive to such
It seems unlikely) however) that the proposed estimatora
sensitiv~
will be much less
than b
l
because of their heavy functional
dependence on it.
Among the many virtues of b ) is the ease in obtaining a best
l
quadratic unbiased estimate of its variance.
No means of obtaining
"good" estimates of measures of reliI;loility of the new estimators has
been presented herein.
In
vi~w
moments of b ) b ) b4J I;lnd 0
5
2
3
estimates is likely to be a
of the
ar~
f~ct
that the exact small-sample
unknown) the constrl,lction of such
difficu~t
analytical problem.
Since the results of the simulation are almost
»
limited to the p
~
2
cas~)
~t
exclus~vely
ia necessary to consider the probable
effects of the relaxation of this assumption.
My guess is that so
long as p does not get "too close" to n) the relative efficiencies E
Z
and E will behave similarly to what has been discovered when there
3
are only two
regressor~
in the model.
The absence of any knQwle4ge of the values of the
an explicit assumption throughout this thesis
unknown in (1.1).
(~.}
because.~
J
has been
and rr2 are
(If there exists such prior knowledge and it is
44
formally considered to be an inherent part of the model, b
many of its optimum properties.)
l
may lose
Hence it is impossible to recommend
any single b. over all others because none of them is uniformly best
~
O.. }.
over all
J
More often, however, the investigator has some idea of the range
of the {A.}, though it may
J
b~
difficult to incorporate such vague
information into the estimation procedure.
Depending upon his
willingness to risk using an inefficient estimator in order to have
the opportunity to use a possibly efficient one, he may wish to
consider b
if
2
b
or b as an al. te-rna tive to b ·
6
l
A. is known to be
J
rat1;le~
small, and b
is sure that A. is not very large.
6j
2j
might be considered
merits attention i f he
With the l.lse of b , one has
2
J
little to gain but little to lose, while t1;le employment of b
to
appreci~ble
gain or extreme regret.
6
can lead
It is difficult to conceive of
circumstances where one wpuld wish to use b , b or b '
S
3
4
A new line of research that uhis brings to mind is a two-stage
estimation procedure wherein we first estimate the {A.}, (say with
J
~
{A.}), and based on these estimates choose some estimator (possibly
J
b l ) that is relatively efficient for the {A.} in some neighborhood
J
"
of the {A..}.
J
"
2
A convenient estimator is A.. ::: b . /
lJ
J
distributed as a noncentral
freedom and noncentrality
~
"2 j j
0-
s
, which is
variate with 1 and n - p degrees of
pa~ameter
.
A..
J
Consider, for instance, the following outline
o~
a two_stage
procedure utilizing b , which supposes that our objective is to use
6
b 6j rather than b
lj
subjec~
to the guarantee that
45
for some preassigned a s (O) 1).
"
[A. < A. ]
Pr~
J -
f\j
>
JO
First choose
1 - a .
~jO
such that
11
(4.2)
Then choose q. just large enough so that E .(q.) = 1 when
6J J
J
viz. :
q. = (A.
J
JO
- 1)
I
(~.
JO
A.
J
=
A. ;
JO
+ 1).
The prior literature dealing with the subject of this thesis is
hardly more encouraging than the results reported here.
references surveyed in
~hapter
Of the
2 J only two give one much over which
to be optimistic concerning the likelihood of significant future
progress in the study of biased estimation of regression coefficients.
I am impressed with the finding by James and Stein (1961) that (with
their loss function) b
l
is an
inadmi~sible
estimator for p > 2.
But
admissibility is not often a crucial property of estimators for the
applied statistician because it is so rare that he cannot (with the
knowledge of theoretical cans!derations underlying the model) place
some sort of bounds on the likely
t
Conversely} inadmissible
esti~ators
James and Stein have made no
estimator.
~anges
mentio~
of parameters to be estimated.
are not to be hastily abandoned,
of the probable quality of their
As indicated earlier} our results concerning the quality
llThere exists a uniformly most powerful test of H :
o
vs H
a
A. > A.
J
JO
based on the nonaentral beta distribution.
.
~.
<
J -
~.
JO
See
Toro-Vizcarrondo and Wallace {l968} p. ~~4) and Toro-Vizcarrondo
(1968) for a full discussion. It follows from Lehmann (l966) pp. 68}
80) that there is a uniformly most accurate confidence bound for ~.
of the form indicatecl in (4.2).
J
46
of b
give rise to an educated conjecture that the improvement of (2.1)
2
over b
l
will often be insignificant.
Moreover) for the reasons given
in Section 1.4) the applicability of a weighted sum of mean square
errors loss function is often highly doubtful.
It is hoped that future
work along these lines by mathematical statisticians will be somewhat
more considerate of the needs of experimental researchers) not the
least of which is a loss structure of form (1.8) rather than (1.6).
While employing the loss structure (1. 7)) Hoerl and Kennard (1970 a)
b) have taken a fresh) novel approach
~o
the whole problem) which for
several examples they present has been an unqualified success.
question of the stability of b
l
The
in the face of small changes in the
data is in this context equivalent to the problem of large variances.
It remains to be seen how much more
~tability
can be achieved without
adding large biases to the individual estimators.
The prospects for future major improvements upon Gauss-Markov
estimation are not particularly promising.
Aside from the ridge
regression procedure) the few successes to date are of limited applicability because they either presuppose much prior knowledge about
the
CA,} or are improvements to only a negligible extent.
J
I think
there is some chance that two-stage estimation procedures of the
sort discussed above may yield
sl~ghtly
better estimators than those
examined herein) but it should be recognized that ease of computation
is one of the virtues of b
l
and as we proceed to explore increasingly
complex estimators) we must begin tq consider whether the extra
computational effort is justified by the prospective gain in efficiency.
47
The Gauss-Markov and
Rao~~lackwell
remarkable conceptual simplicity.
of bringing
absence of
additiona~
similar~y
If Qne must rule out the possibility
infprmation to bear) I intuitively feel that the
appealing
~heo~ems
square error criterion of goodness
solution to the problem
attained.
Theorems are results of
(a~ ca~t
for estimation with a mean
sign~fies
that a truly satisfying
in this thesis) will never be
48
5.
LIST PF
~EfERENCES
Bancroft) T. A. 1944. On biases in est~mation du~ to the use of
preliminary tests of sig~i£icance. Annals of Mathematical
Statistics 15:190-204.
Baranchik, A. J. 1964. MU1Fiple regression and estimation of the
mean of a multivariate normal distribution. Technical Report
No. 51, Department of Statistics, Stanford University, Stanford,
California.
Baranchik, A. J. 1970. A family of minimax estimators of the mean
of a multivariate normal distribution. Annals of Mathematical
Statistics 41:642~645.
Bhattacharya) P. K. 1966. Estimating the mean of a multivariate
normal population wi~h gen~ra1 quadratic loss function. Annals
of Mathematical Statistics 37;1819-1824.
Bodewig, E. 1956.
Amsterdam.
Matri~ Calcp1u~;
North Holland Publishing Co.,
Cram{r, H. 1963. Mathematical Methods of Statistics.
University Press, P~inc~ton} New Jersey.
Princeton
Farrar, D. E., and Glauber, R. R. 1967. Multicollinearity in
regression analysis: the problem revisited. Review of Economics
and Statistics 49:96-10&.
Fraser, D. A. S. 1966. NQqparametric Methods in Statistics.
Wiley and Sons, In~., New York, New York.
John
Graybill) F. A. 1961. An Introduction to ~inear Statistical Models)
Vol. 1. McGraw-Hill Book Co., Inc.} New York) New York.
Hoerl, A. E.) and Kennard, R. W. 1970~. Ridge regression. Biased
estimation for nonorthogonal problems. Technometrics lZ:55-68.
Hoerl) A. E., and Kennard, R. W. 1970b. Ridge regression. Applica_
tions to nonorthogonal problems. 7echnometrics 12:69-82.
James) W.) and Stein} C. 1961. Estimation with quadratic loss.
Proceedings of the Fourth Berkeley Symposium o~ Mathematical
Statistics and Probability 1:361-379. University of California
Press, Berkeley and Los Angel~s.
Kendall) M. G.) and Stuart} A. 1967. The Advanced Theory of
Statistics, Vol. II. Hafner P~blishing Co., New York, New York.
49
Lehmann, E. L. 1966. Testing Statistical Hypotheses.
Sons, Inc., New York) New York.
Lo~ve, M.
1963. Probability Theory.
Princeton) New Jersey.
John Wiley and
D. Van Nostrand Co., Inc.,
Malinvaud, E. 1966. Statistical Methods of Econometrics.
McNally and Co.) Inc.) Chicago) Illinois.
Rand
Rao, C. R. 1965. Linear Statistical Inference and Its Applications.
John Wiley and Sons, Inc.) New York) New York.
Sclove, S. L. 1966. Improved estimation of regression parameters.
Technical Report No. 125) Department of Statistics) Stanford
University) Stanford) California.
Sclove, S. L. 1968. Improved estimation for coefficients in linear
regression. Journal of the American Statistical Association 63:
596-606.
Stein) C. 1956. Inadmissibility of the usual estimator for the
mean of a multivariate normal distribution. Proceedings of the
Third Berkeley Symposium on Mathematical Statistics and
Probability 1:197-206. University of California Press,
Berkeley and Los Angeles.
Toro-Vizcarrondo) C. 1968. Multicollinearity and the mean square
error criterion in multiple regression: a test and some
sequential estimator comparisons. Unpublished ph.D. thesis)
Department of Experimental Statistics) North Carolina State
University at Raleigh. University Microfilms) Ann Arbor)
Michigan.
Toro-Vizcarrondo, C.) and Wallace, T. D. 1968. A test of the
mean square criterion for restrictions in linear regression.
Journal of the American Statistical Association 63:558-572.
so
6.
APPENDIX:
THE SIMULATION DESIGN AND PROGRAM
A simulation experiment was used to compute the estimated relative
efficiencies appearing in Table 3.1.
12
As explained in Section 3.3,
the computations in the table are based on the assumptions that the
= 2S.
random errors are normally distributed and n
The input for the simulation consists of the full rank matrix
X (n x p), the parameter vector
n random N (0,
1
vector Y
2
~)
~,
and
2
The program generates the
~.
disturbances which comprise e, and computes the
= X~ + e.
Then pretending that we do not know
~ and ~2, it
calculates from X and Y the values of the estimators b , b , b , b and
l
2
4
3
b (with the exception that the calculation of b is omitted if p > 2).
S
S
This operation is repeated with a new random e for a total of N iterations,' and estimates
m,s.e.(b .. ;
~J
are computed for i
~J')
= 1,
ave (b .. _ ~.)2
~J
2, 3, 4, Sand j
(6.1)
J
= 1,
2, ... , p.
"ave" refers to the average value over iterations.
In (6.1),
Finally, the
relative efficiencies are estimated according to
i
(6.2)
While m. s. e, (b
lj
) is known to be equal to
(J
2 ..
sJJ, the estimate rather
12 1 am grateful to Mr, James Goodnight, Department of Experimental
Statistics, North Carolina State University at Raleigh, for writing
the computer program used in this study.
51
than the population value was used in the denominator of (6.2) to check
the effect of any systematicality that might have been present in the
nN generated errors.
Clearly) the input quantities were not chosen haphazardly.
The
major task in the design of the simulation was to answer the question)
•
I!In what respects can X) 13) and cr2 be selected wi thout loss of
It was determined that) for p = 2) they can be taken
generality?1!
arbitrarily subject to their leading to the desired values of the
variables r) AI) and A defined in Section 3.3.
2
To describe the
behavior of an E.) we estimate it for a number of configurations of
~
the quantities upon which the estimate depends.
Equations (3.24) were arrived at through what was essentially a
A
For example) E was unaffected by a change
3
99
2
"
t b u t dou bl'~ng eac h 0 f~j)
A
s ~~ wh'l
~ e E2 was no)
cr ) d
an s JJ (keeping
trial and error procedure.
A
.
~n
A
r constant) left E invariant.
2
Given a finite amount of available computer time) it was necessary to choose a rather limited number of the r) AI) and A ,
2
were chosen:
.3),7) .98) and -.7.
Four rls
These are) roughly speaking) a low)
average) high and average negative correlation) respectively.
The 7
or 8 values of the I!more important AI! were deemed sufficient to give a
good indication of the functional relationship under consideration.
In practice) cr2 was set equal to 1.
Next) X was conveniently
chosen subject to its yielding the desired r
J
=
fl.
The choice of X
Then the 13. were selected so as to give the desired
J
2
2"
13. / cr sJJ) j = 1) 2,
fixed the sjj,
A.
f
J
52
Another major problem that had to be tackled was the method of
choice of the number of iterations) N.
A large number of iterations
was needed to stabilize the sample estimates (6.2») but the computer
time involved was roughly in proportion to N.
To check on their
stabilization) the cumulative efficiencies E. were printed out at
~
intervals of 100 iterations.
It was found that by taking N = 500)
Table 3.1 could be constructed to the degree of accuracy indicated in
Section 3.3.
This was thought to be adequate in view of the goal of
the simulation) which was merely to make a comparison between estiA
mators and not the formal tabulation of moments.
The E. in the table
~
were informally (but not casually) obtained from a careful examination
of the results at the end of 300) 400) and 500 iterations.
As an
illustration of the procedure employed) we consider two examples.
For computing E with
3
Aj
= 1 and r = .7)
the estimates after 300) 400
and 500 iterations were .786) .768) and .773 respectively.
was employed in Table 3.1.
Thus. 77
In Section 3.3) an accuracy of ±.02 was
claimed for the CEil; ~'!') that E3 lies between .75 and .79.
This
assumption seems fairly safe in view of the stepwise estimates
A
obtained above.
and r
=
Next consider the computation of E with
2
Aj = At = 10
A
.98.
Here the values of E after 300) 400 and 500 iterations
2
were .9979) .9983) and .9982 respectively.
used for Table 3.1.
Thus the value .998 was
It is even clearer in this case that we have ±.02
accuracy for our estimate; it is not unlikely that the true accuracy
is as fine as ±.OOI.
It would have been preferable to choose N according to some
stopping procedure built into the program; this would have assured that
53
A
the Eo are measured with approximately equal precision.
~
But it was
felt that such an inordinate complication of the program would not
significantly enhance the quality of the study.
The same nN
= 12,500 random Nl(O, 1) numbers were used (in the
same order) for each entry in Table 3.1 in order to insure the
•
ceteris paribus nature of the measurement of the effect of a change
in estimator, r, or AIS,
For each entry in the table, the corresponding estimates were made
of "the proportion of m.s.e. attributable to squared bias" by computing
2
the ratios bi:s (b. 0) /m,;.e. (b .• j (30)' i = I, 2, 3, 4, 5, where
~J
bias (b ij )
= ave
(b
ij
~JJ
) - I3 j '
The construction of Table 3.1 utilized approximately 30 minutes of
time on an l.B.M. 360-75 Computer.
This excludes all time consumed in
the design of the simulation and ancillary experiments.
Generalization of these results to p > 2 is likely to present the
investigator with formidable problems of simulation design, for it is
conceivable that some
2P - 2 A's.
Ei might
depend upon as many as (~) r's and
..
NORTH CAROLINA STATE UNIVERSITY
INSTITUTE OF STATISTICS
(Mimeo Series available for distribution at cost)
630.
.
"
•
Sen, P. K. and M. L. Puri.
631.
Loynes, Robert M.
632.
Simons, Gordon.
On some selection procedures in two-way layouts.
An invariance principle for reversed martingales.
A martingale decomposition theorem.
On fixed size confidence bands for the bundle strength of filaments.
633.
Sen, P. K.
634.
Ghosh, Malay.
635.
Kelly, Douglas G.
636.
Sproule, Raymond Nelson.
637.
Loynes, R. M.
638.
Wegman, Edward J.
639.
Michaels, Scott Edward.
640.
Cole, J. W. L.
641.
Leadbetter, M. R.
Asymptotic optimal non-parametric tests for miscellaneous problems of linear regression.
Concavity of magnetization as a function of external field strength for ising ferromagnets.
A sequential fixed-width confidence interval for the mean of aU-statistics.
Stopping time on Brownian motion: Some properties of Roots' construction.
Non-parametric probability density estimation.
Optimization of testing and estimation procedures for a quadratic regression model.
Multivariate analysis of variance using patterned covariance matrices.
On certain results for stationary point processes and their application.
642.
Fretwell, S. D.
On territorial behavior and other factors influencing habitat distribution in birds.
643.
Loynes, R. M.
Theorems of ergodic type for stationary sequences with missing observations.
644.
Johnson, Mark Allyn.
On the Kiefer-Wolfowitz process and some of its modifications. Ph.D. 'Thesis.
645.
Sen, P. K. and S. K. Chatterjee.
646.
Helms, Ronald William. A procedure for the selection of terms and estimation of coefficients in a response surface model
with integration-orthogonal terms. Ph.D. 'Thesis.
On the Kolmogorov-Smimov-type test of synunetry.
647.
Wegman, Edward.
648•
Sen, P. K. and Malay Ghosh.
statistics.
Maximum likelihood estimation of a unimodel density, II.
649.
Seheult, Allan Henry.
On unbiased estimation of density functions. Ph.D. 'Thesis.
650.
Williamson, Norman.
Some topics in system theory. Ph.D. Thesis.
651.
Weber, Donald Chester.
On bounded length sequential confidence intervals based on one-sample as in rank-order
A stochastic model for automobile accident experience. Ph.D. 'Thesis.
© Copyright 2026 Paperzz