Sawyer, J.W., Jr. and Read, K.L.Q.; (1982)Estimating Nonlinear Functional Relations by Maximum Likelihood and Least Squares."

•
,e
ESTIMATING NONLINEAR FUNCTIONAL RELATIONS
BY MAXIMUM LIKELIHOOD AND LEAST SQUARES
by
J.W. Sawyer, Jr. and K.L.Q. Read
Department of Biostatistics
University of North Carolina at Chapel Hill
Institute of Statistics Mimeo Series No,· 1374
January 1982
e
Estimating Nonlinear Functional Relations
by Naximum Likelihood and Least Squares
.,
?A
...•.
J. W. Sawyer and K.L.Q. Read
Institute of Statistics Mimeo Series No. 1374
Errata
Page 1:
eq. (1.1) should read
x. , a),
1p
Line 4 from bottom:
= 1,
i
2, ••• , t
asterisk Vas: V* •
insert footnote at bottom of page:
*
If V is taken to refer to arbitrarily chosen vectors
Xl' , ••• , Xt1
-
1
-
1
,co~catenated
as X'
=
... , X't'
(X'l' ,
-
t
1
-
1
1t
),
the m;'s need not be equal.
1
Page 2:
line 2:
V, not E.
para 2 line 6:
insert after
••• of the data:
(that is, one
para 2 line 8:
parameters).
(insert right parenthesis)
.~
Page 4:
Page 6:
6 lines above eq. (2.1):
the pr + r functions
eq. (2.1).!.
... ,
2 lines below eq. (2.5):
t
i:=l
Page 12: '2
nd
a2
eq:
N
ao:.
1J
=
_1.
r
'
\I
= 1,
2, ••• , r
any subset iI' ••• , i r of the •••
E
Page 8:
x.),
m.(X. - x.)'
1.
_1
_1.
r -1
(X. - x.)
_1_1
Estimating Nonlinear Functional Relations
by Maximum Likelihood and Least Squares
by
.e
J. W. Sawyer, Jr.
K. L. Q. Read
1
1
2
Department of Biostatistics, School of Hygiene
and Public Health,
The Johns Hopkins University, Baltimore, Maryland 21205.
2
Department of Mathematical Statistics and oPerations Research,
University of Exeter, Exeter EX4 4PU.
SUMMARY
It is assumed that p + 1 unknown variables at t points are
such that the (p + 1)
st
variable is given by a nonlinear
f, of the other p variables and a vector
~
function~
of structural parameters.
Replicate observations on the p + 1 variables for each of the t
points are assumed to be mutually independent and multivariate
normal, with connnon covariance matrix 1:.
.
~
the maximum likelihood estimate t:i of
(i
For known 1:, obtaining
is equivalent to minimizing
a weighted sum of squares with respect to cr and
pi:
nuisance
~
parameters.
.e
If 1: is replaced by a consistent estimate 1: in the
weighted sum of squares, the weighted least squares estimate i
-
is no longer equal to ~.
Ji (; - 01)
-
However, for N total observationS,~N(&~) and
have the same asymptotic distribution.
implies that, even
This result
if 1: must be estimated, °a close relationship
-
exists between maximum likelihood and the generalized least squares
procedure which Dolby (1972) has already applied to nonlinear
functional relations with 1: known.
KEYWORDS:
Nonlinear functional relation, maximum likelihood,
weighted least squares, generalized least squares
INTRODUCTION
1.
For(p + ~.dimensional vectors ~i' i == 1, 2, •.• t, for an
r-dimensional
vector~,
xi,p+l
r
~
t, and for known f, let
... f(Xi ' .. x ip '
S),
. hold for Bome unknown nonrandom vectors
(1.1 )
i == 1,2, ... t
0 0 0
~l
' ••
~t
' and ot.
Let
observations
exist on each
~i
o
.
Then the observations (1.2)
are said to have a functional relationship.
.e
problem of estimating
~o
Most treatments of the
have assumed that each
~ik
is multivariate
normal.
If t is fixed while each m becomes large, then maximum
i
likelihood is an appropriate method of estimation. Dolby and
Freeman (1975) discuss maximum likelihood estimation for linear
or nonlinear f, with all mi .. m > 1.
They allow for correlation
between ~ik and ~jk for i I: j, but require that ~ik and ~jh be
independent for k I: h, even if i • j.
and let all
~
Let ~ ... (~' lk' ••. ~' tk)'
* Then another
have variance-covariance matrix!.
method of estimating
~
_0
-
is to calculate ot minimizing the weighted
sum of squares
(1.3)
.....
*XIf'
-.-._ ..........•-....
_~_._----_._
... -...
,---
V is taken to refer to arbitrarily chosen vectors
,
-'
I
, ••• , Xt "
concatenated
as X == (A'1'
, ••• , Y
,
),
- 11
1
1
~t1
1
t
1
t
the mits need not be equal.
-
-2, ..• -~t') sa ti s fi es th e equat i ons (1 • 1) •
vb ere -~ , • (~1'
1: and multivariate normal observations,;
" likelihood estimate.
000
~l'
•••
XJp '
~1'
For known
is clearly the maximum
The need to estimate the pt nuisance parameters
0
••• xtp can be eliminated by a modification of
(1.3) which involves first order Taylor series approximations of the
constraints (1.1).
For p • 1 and known V, Dolby (1972) demonstrates
that this approach, known as generalized least squares, is equivalent
to maximum likelihood, except for the error arising from the first
order linear approxtmation to a nonlinear f.
-
For any p and for
unknown V, Dolby and Freeman (1975) establish that the maximum
likelihood and generalized least square
estimat~are
equivalent
if f is linear.
The equivalence to within an approximation of generalized
least squares and maximum likelihood for nonlinear f and known
-
V, which Dolby (1972) establishes for p
= 1,
can easily be extended
to the case of arbitrary p by a similar development.
that a consistent estimate
suppose that
V is
V of
Now suppose
V replaces V in (1.3), and further
a fixed, never-updated function of the data (thQ..t is) one
which does not involve
iterated estimates of
~
_0
or the nuisance
parameters). The estimate ~ obtained by minimizing (1~) will still "be
-
equivalent to a generalized least squares estimate based on
-
instead of V, to within a first order approximation.
!
This can
be established by viewing the negative of (1.3) as a 1l1ikelihoodll
to be maximized, and employing the same arguments as Dolby applies
to the likelihood when V is known.
e_
-3-
The primary purpose of this paper is to demonstrate that,
even if V must be estimated, minimization of (1.3) is still as
precise a procedure as maximum likelihood, in the sense that
-
;
and the maxtmum likelihood estimate have the same asymptotic
distribution.Iftthe1ight of the diScussion above, this means that
any asymptotic inefficiency involved in using generalized least
squares as opposed to maximum likelihood for nonlinear f will be
due only to the approximation inherent in the generalized least
squares estimate.
For normal data the equality of the asymptotic distributions
of a weighted least squares estimate
.e
i, employing random weights,
and the maximum likelihood estimate, will be established in
Section 4.
Some preliminary results necessary for Section 4
will be presented in Section 2, along with some results necessary
to Section 3, which establishes the consistency of
~
without
assuming that the data are normal.
2.
PRELIMINARIES
The formal model to be treated is as follows:
Let (1.1)
and (1.2) hold, and further require that ~ik and ~jh be independent
unless i ... j and k
-
Let all ~ik have non-singular covanance
Thus V becomes block diagonal, with all diagonal blocks
matrix~.
equal to
= h.
-
~.
(This model is intuitively appealing in that it
allows sums of squares and cross products to be pooled in order
to
estimate~.
With minor modifications, the arguments which
follow will work just as well for the model with general V treated
-4-
in Dolby and Freeman (1975), or for a model which allows a
separate matrix Ei for each set of replications
i
a
1, 2, ••• t).
that is, for N
a
~i1' •.. ~imi'
Let all m increase in fixed proportions,
i
E m , let m a rt N for fixed n •
i
i
i
i
Let f possess first partials
wi~h
respect to all its arguments
which are continuous in a neighborhood of the true values.
For
every choice iI' ••. i r of the integers from 1 to t, let the
3acobian of the matrix with elements
,
a Of'D
be different from O.
~,
v • 1, 2, ••• r
It will follow that, in a neighborhood of
e.
the true values, the pt + r functions
xi
• f (xi '1'
~p
~
,!) ~ - 1,2,
... r
and
~
- 1, 2, ••. r,
j -
1, 2, .•. p
will have an inverse possessing continuous first partials, and in
particular,
IS»·
g)1
.
Gi ' ... ~i
)
r
,
~
(2.1)
• 1, 2, •.• r
-
for some vector function g which possesses continuous first partials.
.
It follows that, within some neighborhood of the true values any
vectors !l' .•.
~t
satisfying (1.1) are such that. the corresponding
_ may be obtained by anyone of ( ~) p08siblefunctions
!.
-5This result can be used to establish that if some estimates
Xl'
•• X
_
-t of the true population means satisfy (1.1), and if the
,differences
~l - :~,
••• '
° (N-i) also.
~t -
:; are 0p (N 1 ), then ~
_ -
Q(
_0
is
p
This result in turn will be necessary in order to apply a
result of Weiss (1971).
0
t(Z_n • -9
)
Let a data vector
Z have log-likelihood
. _n
for ,a q-dimensional vector 9.
-0
-
Let 1 be an estimate
_n
of eO satisfying
s
lim p{.Jnt in - 9~t
n-ClCl
for i -= 1, 2.
.e
(Condition
;a
(2.2)
L(n)} - 0
q and all non-random sequences {L(n») with L(n) ... _.
(2.2) holds if and only if
An (~
_n) be the vector
aUa9_
ein -
e~
is 0p(n-t
».
Let
evaluated at _n
"I , and let -n
1 (9 n) be the
9.
q )( q information mt-trix evaluated at" _n
Under mild regularity
conditions, Weiss demonstrates that
e
A(e )
e! • _n + [1{9
_n
__n)]-1 __ n
(2.3)
is such that.Jn(9*
_n - _eO) has the same asymptotic distribution
r
0
..
as'IIn(e
e is the maximum likelihood estimate.
_n - _e ). where _n
A
Now if n is equated
with N for the model described at the
beginning of this section, and the data are multivariate normal,
be the vector of dimension pt + r + (P+l)(p+2)/2 comprising
-!.eO will
the pt free parameters
•••
••• x ' and the free
000
~1'
parameters of~.
~p' ~l'
0
tp
Further, !N(~o)/N will be continuous and notl-ra.t\doM
-6-
for all permissible N, and'!N~)/N
probability.
will converge to !N(!o)/N in
Now (2.3) can be rewritten
It follows by Slutsky's theorem and a theorem of Cramlr (1946)
(2.4)
converges in probability to 0, "'i(~
Co.
hence "N(.!N -
!0)
x and "N(.:N
C
! 0 ),
- !o)
and
.IN(!N _ 20 ) ,
and
will have the same asymptotic
distribution.
The following simple result establishes that
criteria (2.2) if the corresponding
-a
meets the
vect~rs ~l' ••• ~t
satisfying
(1.1) do.
Let hI' ••• hs be non-negative functions of random
variables Y , ••• Y ' respectively. Then for nonrandom y
s
l
A simple
proof of (2.5) for s • 2 is given in Tucker (1967).
Now chose any subset iI' ••• i
at the subscripts 1 to t. Then
r
with probability approaching l, Xi ' ••• Xi will lie in the
.
... 1
... r
neighborhood for which the associated function g, as defined by
(2.1), exists.
Thus, with probability approaching 1,
e.
-7-
*' ••• .!i*' ). lies on the line segment joining
where the vector (xi)
- 1
r
d (-'
-, )
(~il) ••• ~il an
~il)·· ~ir·
0'
0'
*
ag/axi
Since each vector of partials
is continuous) it will converge in probability to
, - "i
ag/ax~.
~
Thus) with probability approaching 1) the absolute
~
values of the partials ag/axi
* J are bounded by some oonstant M
.,
which does not depend on the data.
It follows that) for any
nonrandom sequence (L(N) 1 with L(N) -
CIl»
P(IN I~ - ~ I ~ L(N») ~ p (M ~ P~l.JNlxi j
~
0
I.l
I.l
". =1 jel
with probability approaching 1 as N -
.e
_
0
- x0
~l-~l)"· it
_t
-t
lim
N _
CIl»
are 0 (N-
P
t),
p{JNl~-I.l - !1I°1
...
CIl».
-
-V
x~.
"
j'
:l
L(N»)
Hence) if the differences
we have
.:l L(N»).
(2.6)
0
by virtue of (2.5).
, From the results of this section) it follows that) for Section
3) it suffices to
sho~ that~,
t
1:
mi
1:
i-I k-l
for
-
minimizing
,
(X-ik - _i
x ) i:-lfx
- _i
x )
- fik
subject to (1.1) are such that
.
suitably chosen t.
~t
••
~l
-
~lo
•••
~t
-
~tO
(2.7)
areOp(N-
i)
In Section 4, it will suffice to show that
the functions (2.4) associated with the likelihood of the observations
~ik)
i - I ) 2) ••• t) j - 1, 2, ••• ) mi , converge to 0 on probability
when evaluated at the weighted least squares solutions and i.
_.
-83.
CONSISTENCY OF THE LEAST SQUARES ESTIMATORS
For normal data, (2.7) is equivalent to a likelihood conditional
on an estimated
E.
Consistency and possibly efficiency of ;
-
might be argued from that viewpoint.
Postulating a distribution
is not required in order to demonstrate that
.
however.
It is only required that
&_
~o
••
is 0 (N- i),
p
~ - ~ be elementwise
-
-
0 (Np
that it be positive definite with probability approaching
~,
i),
and
that its definition not be dependent on the minimization of (2.7).
Now (2.7) is equal to
t
,
m
i
.
1~1 ~J~lk -!1) E-l(~ll' - ~1) + SNb,
... ~t)
where ~i is the mean of the m replications and
i
SN ( ~1' ••• ~t )
t
= i~1
"'-1(-~i - ~i )
(_
)'
mi ~i -~i
~
Thus it 18 sufficient to determine ~l' ... ~t minimizing SN
under the constraints
(1.1).
Let E· ( CT ) and.E
jh
-1
1h
• (O""").
o
0
Then SN(~l ' ••• ~t ) is the sum of terms of the form
or
for j"h.
Since each &jh is a continuous function of the elements of
E,
e.
-9-
a-,.1h ~ ajh in probability forj,
h
D
1,2, ••• P + 1.
probability approaching 1, all elements of
bound M.
Since m
i
-f
Thus, with.
have a nonrandom upper
= TTiN, the mean -Xij is such that -Xij - xij0
op (N- i). It follows that, for any nonrandom sequence
L(N)
~
is
{L(N)} with
ClO,
approaches 0 as N -
Since
CD.
a result similar to (3.1) applies to the cross-product terms.
In
the manner in which we established (2.6), we apply (2.5), (3.1),
.e
(3.2), and the boundedness of the elements of ~"'-I with probability
-
approaching 1 to obtain
lim
N~ClO
P(SN(~0,
..
:~ ).~ L(N») ... 0
for all nonrandom sequences (L(N)} with L(N) -
SN(~'
... ~) =' SN(~
0,
•••
CD.
Since
3~)
it follows that
(3.3)
Now let T'T •
,.
~
- be a
square root decomposition obtained
by some well defined algorithm (such as a Cholesky decomposition),
such that the elements of T are continuous functions of the elements
of
,.
~
•
-10-
Let
and
~i·
( -1)'_~i
T
Then
From (3.j) and the fact that m = niN, it follows that, for all
i
nonrandom (L(N}) with L(~) ... -, and for all i and j,
of necessity.
-- E obtained by the same algorithm
T. It follows that every
-element T of ! converges in probability
- to the corresponding
Now suppose that ,.',. • E is a square root decomposition of
a~
jh
element !jh of!o
Since
!i -
~i = !'(!i - ~i)
,
it follows that
Since the values lTjhl are bounded by some nonrandom constant
with probability approaching 1, it follows, using (2.5) that
(3.4)
e.
-11-
But since
the assertion
N1!:m P(./iilS'iJ - X:j I "L(N») • 0
is established by virtue of (2.5).
4.
THE MULTIVARIATE NORMAL CASE
Now let the observations be multivariate normal.
A
natural estimate for E is
1
t
-N
.e
i
m
E (X
t
E
i=l k=l
- ik
-x- J(-Xik -x- i )'
•
The log-likelihood of the observation is
m
L-
-~P+l)N/~
J,n(2n)
~
- (N/2) tn I_E
~~.
I - -.L
~
2 i-I
.
i
E
fx
k-l ~ik
-
x )'E-l(X_ x )
_i -
_ik
_i
Now clearly
-2
for
~
~~
oq~
~~
~Xij
for i - I , 2, ••• t, j
E-
E.
(4.1)
0Ct~
- 1, 2, ••• r, and
. -2
- -
~SN
...
-
= 1,
~SN
(4.2)
~Xij
2, ••• p , when ~ is ev~l~~ted at
When (4.1) and (4.2) are evaluated at & and the pt i.stimcJed
.,.
,.",."
,..,
-
nuisance parameters xll ' ••• x 1p ' ~l' ••• ~p' all these
-12partials are identically O.
1
and - -
..[N
aJN
...
Thus the conditions that
1
aJ
-Ji ocrN
N
...
approach 0 in. probability, when evaluated at the
aX ij
least squares estimates and
1
to show that ..[N
aJ'N
aCTij
,.
~,
are
tri~lly
fulfilled.
It remains
- 0 ~n probability when evaluated at these
estimates.
Write. J'N as·
~. -EP+l)N/~tn(2Tf) - ~
and let
N tnlEI -
~ ;(~ik - ~i)~ik - ~i)' - ~ for
ij • {ekh1 be defined by e kh
otherwise. Then for i < j,
E
=1
~ tr~-l~(~ik' - ~i)(~ik - ~i
convenience.
~et
for k • i and h • j and e kh • 0
e.
or
Let
Then
a~
aCTij
)'J
is given by
1 [ a
] .-N CTij + a
- N CTij + ~
ij
ij + a ji
(4.3)
For i • j, a sfmi1ar derviation gives
(4.4)
-13-
[The derivations of (4.3) and (4.4) follow closely the usual line
for the derivation of the maximum likellhood estimates for the
underlying parameters of multivariate normal data.
Rao (1973).)
Compare
For a square matrix C, let diag(C) be a diagonal
-
-
matrix with diagonal elements equal to the diagonal elements of
-
C.
Then (4.3) and (4.4) imply
.. - N ~
-1
./ )
+ 21 Ndiag(1:: -1 ) + A - 21 ,
dia5'~
(4.5)
But A may be expressed as
~ - ~-l
.e
[N
~l ~-l + ~-l~ mi(~i - ~~~i - ~i)'J~-l
- =f
so that, when (4.5) is evaluated at 1::
solutions,
~-l
Now since the elements of 1::
1
a~
it follows that ---.[N ai:
-
-!.r=
'liN
e1ementwise.
and the least squares
converge to those 1::
-1
in probability,
converges to 0 e1ementwise if
(x --xJ (i\!i - -i
x)' -p 0
1:: m
i -i
i
That is, it is sufficient to demonstrate that
-14.
Now for any prescribed i > 0, choosing L(N)
1-
= Ji N4
in (3.4)
will yield
(4.6) must be valid.
.IN(~
Hence the asymptotic distributions of
- !o) and IN(! - ~o) , where ~ is the ML estimate, are equal.
-15REFERENCES
C~'r,
H.
(1946).
Mathematical Methods of Statistics. frinceton:
frinceton University fress.
Dolby, G. R. (1972).
Generalized least squares and maximum likelihood
estimation of non-linear functional
re1ationships.~.
Roy. Statist.
Soc., 34, 393-400.
Dolby, G. R. and Freeman T. G. (1975).
Functional relationships
having many independent variables and errors with multivariate
normal distribution.
Rao, C. R. (1973).
.e
New York:
~.
Multivariate Anal. 5, 466-479.
Linear Statistical Inference and Its Applications.
John Wiley and Sons •
Tucker, M. G. (1967).
A Graduate Course in Probability.
New York:
Academic Press.
Weiss, L. (1971).
Asymptotic properties of maximum likelihood
estimators in some nonstandard cases.
66, 345-350.
~.
Amer. Statist. Assoc.,