da Silva E Souza; (1251).Statistical Inference in Nonlinear Models: A Pseudo Likelihood Approach."

'e
STATISTICAL INFERENCE IN NONLINEAR MODELS:
A
PSEu~O
LIKELIHOOD APPROACH
by
GERALDO DA SILVA E
SOUZA
A thesis submitted to the Graduate Faculty of
North Carolina State University
in partial fulfillment of the
requirements for the Degree of
Doctor of Philosophy
DEPARTMENT OF STATISTICS
RALEIGH
1 9 7 9
APPROVED BY:
gJ2i}~~
U~0~!~fAAf
Chairman of Advisory Committee
ii
BIOGRAPHY
Geraldo da Silva e Souza was born September 19, 1946, in Rio de
Janeiro, Brasil.
He received his elementary and secondary
educatio~
in
his home town, graduating from Co1egio Estadual Prof. Clovis Monteiro in
1968.
He entered Universidade Federal do Rio de Janeiro, Rio de
Janeiro, and received a Bachelor of Science degree in Mathematics in
1971.
He also entered Universidade do Estado do Rio de Janeiro, Rio de
Janeiro, and received a Bachelor of Science degree in Economics in
1972.
In January, 1973, the author was granted a fellowship from the
Conselho Nacional de Pesquisas, Rio de Janeiro, to pursue graduate
studies at Universidade Federal do Rio de Janeiro.
He received a Master
of Science degree in System Engineering from this university in 1974.
In Augus4 1974, the author accepted employment with Empresa Brasileira
de Correios e Telegrafos, Brasilia, as an economist.
In December, 1974,
he was named Chairman of the Division of Economic Studies.
In July, 1975,
he accepted a part-time employment with the Department of Statistics at
Universidade de Brasilia, Brasilia, as an Assistant Professor.
In
August, 1976, he was granted financial support from Empresa Brasileira
de Pesquisa Agropecuaria, Brasilia, and enrolled in the Department of
Statistics at North Carolina State University to pursue a program of
studies leading to the Doctoral degree in Statistics.
The author is married to the former Wanda Fritsch.
Mrs. Souza
graduated from Escola Nacional de Ciencias Estatisticas, Rio de Haneiro,
and from Universidade do Estado do Rio de Janeiro, Rio de Janeiro, in
1972, receiving the degree of Bachelor of Science in Statistics and
Bachelor of Science in Economics, respectively.
They have two children:
Alexandre, born February 9, 1973; and Marcos, born October 1, 1974.
iv
TABLE OF CONTENTS
Page
. ..
····
1.
INTRODUCTION
2.
LITERATURE REVIEW
3.
THE STATISTICAL MODEL
3.1
3.2
3.3
3.4
Description
Probability Space
Estimation
Hypothesis Testing
1
.....
..···
····
2
·...
..·.
7
7
8
9
10
4.
CESARO SUMMABILITY
11
5.
CONSISTENCY
23
6.
ASYMPTOTIC NORMALITY
29
7.
TEST STATISTICS
7.1
7.2
7.3
7.4
Wa1d's Test
The Lagrange Multiplier Test
The Likelihood Ratio Test
Applications
. .. .
·
·
8.
THE MAXIMIn1 LIKELIHOOD ESTIMATOR
9.
THE SINGLE EQUATION EXPLICIT MODEL
THE
n.
LIST OF REFERENCES
SE&~INGLY
...·
....
·
40
42
44
47
50
·
UNRELATED NONLINEAR MODEL
10.
40
····
··..
53
57
62
1.
INTRODUCTION
The general regression context is that of an observed multivariate
response to an observed multivariate input.
Based on empirical evidence
or some underlying theory the investigator decides on the kind of structu~e
(model) linking input variables to response variables.
Frequently,
the nature of the problem under study gives rise to a statistical model
with additive errors
This representation defines responses (y's), implicitly, as a function of
exogenous variables (inputs, the
dom components (e's).
XIS),
unknown parameters (y), and ran-
This thesis considers estimation and hypothesis
testing for such models.
The approach adopted is the inclusion, in the
model's description, of an optimization criterion which defines an estimator.
This is called the pseudo likelihood.
A theory of large sample
statistical inference is then developed for the pseudo likelihood estimator including its strong consistency, asymptotic normality, and the null
and non null distributions of the Wald's test statistic, an analog of
the likelihood ratio test statistic, and an analog of the Lagrange
multiplier test statistic (Rao's efficient score test).
It turns out
that the proposed theory encompasses a variety of statistical procedures
in use for nonlinear models.
In this class we may include Zellner-type
nonlinear multivariate procedures, and the more standard least squares
and maximum likelihood procedures.
2
2.
LITERATURE REVIEW
Jenrich (1969) and Malinvaud (1970) were the precursors of what is
known as large sample statistical theory of nonlinear regress;on models.
Jenrich's paper, dealing with the single equation explicit model
t
1,2, ... ,n
sets forth conditions for the strong consistency and asymptotic normality
of the least squares estimator and shows that the Gauss-Newton iteration
method of estimation is asymptotically stable.
Based on assumptions of
uniform convergence Jenrich establishes a method of proving strong consistency, and a method of proving asymptotic normality without requiring
third order derivatives.
Malinvaud's paper develops. a description of the requisite behavior
of exogenous variables in terms of weak convergence of measures so as to
achieve the uniform convergence required by Jenrich.
Malinvaud applies
his methods both to single equation explicit models and to multivariate
regressions.
After the papers of Jenrich and Malinvaud the field developed
rapidly.
A variety of models have been considered and rigorously
studied.
In the context of univariate explicit models, Gallant (1973)
combines the techniques of Malinvaud and Jenrich to obtain strong consistency and asymptotic normality of the nonlinear least squares estimator without assuming the existence of second order derivatives of the
response function.
In the discussion of the regression problem compact-
ness of the parameter space is not required.
In another series of papers
Gallant (1975a, 1975b, 1977b) develops a complete theory of inference
3
for
all practical purposes including Monte Carlo simulations.
These
papers together with Blattburg, McGuire, and Weber (1971) reveal that
the convergence (in distribution) of the likelihood ratio test statistic is faster than that of the Wald's test statistic.
Still in the con-
text of the single equation explicit model a good summary of results may
may be found in Gallant (1975d).
This article is an expository presenta-
tion of the theory and methods of nonlinear regression which exploits
the analogies with the theory and methods of linear regression.
In the context of multivariate explicit models, i.e., models of
the form
t=l, ... ,n;a.
1,2, ... ,m
Gallant (1975c) proves strong consistency and asymptotic normality of
the three step procedure estimator.
Barnett (1976) working with the
same model approaches the regression problem by means of maximum likelihood estimation.
Single equation explicit models with time series errors have been
studied by Robinson (1972), Hannan (1971), and Gallant and Goebel (1976).
These models are of the form
t
where the
series.
u 's
t
1, ... , n
are assumed to define a covariance stationary time
The main approach to these models has been through an initial
approximation of the variance covariance matrix of the
u's
by means of
nonlinear least squares residuals and subsequent use of the Aitken procedure.
4
Ballet Lawrence (1975), dealing with the single equation implicit
model
t=l, •.. ,n
introduces a distance function-absolute value of g to the pth power
(p
p
~
~
1),
1.
tion for
to define an estimator.
Strong consistency is proved for
Asymptotic normality is obtained for
y
is not required in applications.
a predictor are also studied.
p > 1.
An analytic solu-
Asymptotic properties of
Her work is a precursor of many of the
ideas which appear in this thesis.
The more general case of simultaneous systems of nonlinear, implicit
equations, have been the most recent area of development.
Amemya (1974)
proves weak consistency and asymptotic normality of the two stage nonlinear least squares.
Jorgenson and Lafont (1974) derive a Cramer-Rao
lower bound and show that maximum likelihood theory can be used to generate an estimator that attains this bound.
.
maximum likelihood approach.
Amemya (1977) develops
the
He proves weak consistency and asymptotic
normality of the maximum likelihood estimator based on an assumption of
normality for the error distribution.
He also proves that the maximum
likelihood estimator is more efficient than the nonlinear three stage
least squares estimator if the model specification is correct.
Gallant
(1977a) describes the nonlinear three stage least squares estimator.
Strong consistency and asymptotic normality is proved.
His methods allow
parametric estimation subject to nonlinear restrictions across equations.
Gallant and Holly (1978)
refine
the concept of Cesaro summability -
a uniform strong law of large numbers, preViously introduced in
~
5
Gallant (1977a). They obtain, without normality assumptions, the asymptotic
law of maximum likelihood estimators (which is normal).
A complete large
sample theory of statistical inference is developed in this paper.
This thesis is concerned with methods and ideas introduced by
Gallant and Holly (1978).
Thus it is worthwhile to describe in more
detail the framework upon which their results are based.
The residual
sequence is generated according to' the density
where
an unknown parameter.
p(elo*),
0*
is
The model, in implicit form, is
t = 1,2, ... ,n .
The conditions of the implicit function theorem are satisfied so that
the model induces a conditional density function for
y
of the form
The only structure exploited is that of an estimator
y=
(8,a)
defined
by maximizing
1
n
I
n t=l
£n p(Yt1xt,8,0) .
Assuming that
has a unique maximum over the parameter space at
P
x is a convenient measure on X,
proved.
The limiting normal law of
s~rong
A
(8,0)
consistency of
=
0
(8 ,0*) , where
A
Y can be
Y is obtained by means of the
6
asymptotic normality of the scores
y
= (8,a) .
Here we follow precisely along these lines.
Instead of the likeli-
hood a general function is considered and an optimization procedure is
defined.
The properties of the resulting estimator are studied based
on the unique maximum assumption.
law is obtained.
A score is defined and its asymptotic
The resulting theory covers not only the Gallant and
Holly (1978) results but also most of the methods and models described
above.
ture.
Thus, this thesis contributes to a unification of this litera-
7
3.
THE STATISTICAL MODEL
3.1
Description
The basic model is a relation of the form
t
=
1,2, ... ,n .
The y's are sample values of endogenous variables, vectors in Y, a
Borel subset of ~m.
The x's are sample values of exogenous variables,
vectors in X, a Borel subset of ~k.
vectors in E, a Borel subset of ~m.
residual variables,
parameter
y
o
the e's are sample values of
r,
is a vector in
The true
a subset of ~p.
Assumption 1
For each
(x,y)
E
Xx r
the structural equation
G(y,x,y)
=
e
defines a one to one mapping from Y onto E.
of a reduced form
y
= f(e,x,y)
This implies the existence
which maps E onto Y and satisfies
G(f(e,x,y),x,y)
=e
.
The function f defining the reduced form is continuous on
E x Xx r .
We remark that in any application of the model proposed here, in
order to use the statistical methods we will set forth, it will not be
necessary to have a closed form expression for the reduced form, or even
to be able to compute it using numerical methods.
8
Assumption 2
The residual sequence
(et)t>l
is a realization of a sequence of
identically and independently distributed random vectors.
probability measure, defined on the Borel subsets of
E, is denoted by
We emphasize that the distribution determined by
free.
Their common
P
is parameter
E
In many instances the case envisaged is when residual variables
have mean zero and identity variance covariance matrix.
In applications,
if any parametrization is present in the error distribution the model
should be transformed so that all parameters are absorbed in
y.
The
models we study in Chapters 8, 9, and 10 are typical examples of this
case.
Assumption 3
The set
X is provided with a probability measure
of exogenous variables may be fixed or random.
P '
X
The sequence
No matter how exogenous
variables are treated (random or not), the sequence
~
(xt)t
1
is held
fixed.
For the moment we only want the reader to note that
be treated as a fixed sequence in a probability space.
between
P
x
and
(xt)t > 1
3.2
theorem (Ash (1972)),
class
B(E)
(E,B(E) ,P)
E.
P
E
will
A formal link
will be established in Chapter 4.
Probability Space
Let Assumptions 2 and 3 hold.
quences of elements in
(xt)t > 1
Let
E denote the set of all se-
By the infinite dimensional product measure
determines a probability measure
of the Borel subsets of
E.
P
on the
The probability space
provides the basic probability space for the discussion of
9
our statistical problems.
will refer to
(xt)t > 1
P
Thus, in what follows, probability statements
and the random variables, for the particular sequence
fixed, will be functions defined on
E and measurable with
B(E).
respect to
3.3
Estimation
For estimation purposes interest is focused on an £-vector of parameters
,0.
O
In a typical application,
1\
A
o
r . The vector
puted function of
will equal some easily com-
is a point in
A.
The estimation process will be carried out with the use of a real
valued function
~q.
subset of
T*
meter
S(y,X,T,A)
The variable
T.
E
Y x X x T x A where
defined on
y
o
is a
T corresponds to the presence of a paraT* will equal some easily
In a typical application,
computed function of
T
and will be regarded as a nuisance parameter.
Assumption 4
The function
A
estimator
I
A
(Yn (T -T
n
T
*»
S(y,x,T,A)
is available for estimating
n
n> 1
n
-
T
*I
<
T
A strongly consistent
*•
The sequence
is bounded in probability, i.e., given
there exists a bound
p{~IT
is continuous.
M
.
E
M }> 1 E
such that for
E
n
E
> 0
sufficiently large
•
Definition 3.3.1
Let Assumptions 1 through 4 hold.
(~ )
n n>l
Any sequence of random variables
satisfying
s n (X n ) = sup
AEA
S (A)
n
10
where
S (A)
n
is called a sequence of pseudo likelihood estimators for
to the function
s
O
A
•
We refer
as the pseudo likelihood.
In Chapter 5 we show that under certain conditions there exists a
O
sequence of pseudo likelihood estimators strongly consistent for
A
•
The formulation considered here for the estimation problem is
motivated by a consideration of the statistical methods presently in use
in nonlinear regression analysis and some others one might wish to
employ.
In Chapters 8, 9, and 10 we present examples which suggest
possible uses for the pseudo likelihood and the unifying character of
the theory we are to develop.
3.4
Hypothesis Testing
In this thesis we address the problem of testing the hypothesis
H :
O
maps
o
h(A )
~
=0
into
against the alternative
~r.
H :
A
o
h(A ) # 0,
where
h
Test statistics will be defined in Chapter 7.
The
theorems on the asymptotic distributions of those random variables will
permit not only the statistical test of
H against
O
H
A but also the
evaluation of the power of the corresponding test for local alternatives.
11
4.
CESARO SUMMABILITY
In this chapter we discuss the notion of Cesaro summable sequences.
The concept refines Malinvaud's (1970) use of the weak convergence of
measures in nonlinear statistical analysis, and has been used by
Gallant (1977a) and Gallant and Rolly (1978) to describe the requisite
asymptotic behavior of the joint sequence (et,xt)t>l'
The main con-
sequences of its use here are:
i)
A conclusion similar to that of the ReIly-Bray theorem is
feasible without the restrictive boundedness condition.
ii)
Regularity conditions that insure the existence of the almost
sure uniform limit of sequences of the form
can be stated in terms of properties
~hat
are intrinsic to the statisti-
cal model.
Definition 4.1
A sequence of points
(V )
t t>l
in a Borel subset
V of a Euclidean
space is said to generate Cesaro summable sequences with respect to the
probability measure
function
b(v),
v, defined on the Borel subsets of
and the
nonnegative, V-integrable, if
, 1
1 ~m n-1-<X>
for every function
b(v).
V,
n
a(v),
I
t=l
a(v t )
=
f a(v)v(dv)
V
real valued, continuous, and dominated by
12
The notion of Cesaro summability is not void.
We present an ele-
mentary proof of this fact here using Stute's (1976) generalization of
the Glivenko-Cantel1i theorem.
This result, as pointed out by Wesler
(1979), can be obtained with weaker hypotheses using more advanced
methods of proof.
Theorem 4.1
Let
(Zt)t>l
be a sequence of identically and independently dis-
tributed M-dimensiona1 random variables defined on the probability' space
*
(li,A,P).
Let
v
be the common induced probability measure defined
M
on the Borel subsets of R .
Assume
w
f
There exists a set
absolutely continuous with
~M.
respect to some product measure on
v-integrable.
v
A
Let
b(v)
A such that
E
be nonnegative and
P*(A)
= 0 and if
A
, 1
1 J.m -
n
I
a(Zt (w»
n-+<X> n t=l
J
R
for every function
a(v).
a(v)v(dv)
M
real valued, continuous, and dominated by
b(v) .
Proof:
Let
V be the collection of all bounded cubes in RM.
By Stute's
(1976) generalization of the G1ivenko-Cantelli theorem there exists
a
P *-null set
A
o
E
A
lim sup
such that i f
1
IV
f Ao
I
n
I ~(Zt(w»
n-+<X> VE V n t=l
where
I
W
- v(V)
I=0
is the notation for the indicator function of a set
(4.1)
V.
13
Since
V.
J
b(v)
is an integrable function one may choose a sequence
of cubes such that
f b(v)v(dv)
1
< j
V~
J
By the strong law of large numbers for each cube
*
p (A.) = 0
integer
n.
J
there exists a positive
with the property that
n
n
I
I
t=l
(Zt(w»b(Z (w»
V~
t
1
<
j
n
>
z
t
and
u A.
j=O J
E
A
= Zt(w) .
n.
J
J
+00
A =
there exists a set
J
such that for every
J
1
Let
V.
>
E:
There exists
Let
be given.
0
and
n.
Jo
f b(v)v(dv)
w
C
and denote
such that
V.
Jo
E:
< -
5 '
V~
Jo
(4.2)
E:
< -
5
Let
V = V.
o
J
and
a(v)
=
>
-
n.
J
o
continuous and dominated by
into disjoint cubes
V. ; i
1
of the form
g
i
g(v)
n
o
It is possible to divide
and construct
for
I
o
i=l
r.I
1
V.
1
(v)
i
= 1,2, ... , i o
b(v).
= 1,2, ... ,1 0 ,
14
such that
sup la(v) - g(v)
VEV
sup
n
vEv
~
<
n
a
such that for
> n.
Jo
n
I
t=l
(4.3)
•
o
By (4.1) there exists
1
I
IV(Zt) - v(v)
n > n
a
t:
<
(4.4 )
io
I Iril
5
i=l
If
n
>
n
a
,then
-1 nI
n t=l
f a(v)v(dv)
a(z) t
JRM
n
+ 1:
n
I
I
t=l
VC
o
(z )b(z ) +
t
t
f b(v)v(dv)
C
V
o
1
<
n
I
-
n t=l
~ (z )a(z )
0
t
t
+~
- I aCv)vCdv)I
V
by
5
(4.2) .
o
Also,
1
n
-n t=l
I
~ ~1
I
n
t=l
L
-Vo
t
t
11
~ (z ) Ia(z )-g(z) ] + ,0
t
t
t
- I aCv)vCdv)I
(z )a(z )
V
o
I
n
n t=l
I
Iv (z )g(z ) 0
+ Jla(v) - g(v) !v(dv)
V
o
t
t
fg(v)v(dv) II.
Vo
15
by (4.3).
Now
- J
g(v)v(dv)
Vo
=
r.
~
io
I
i=l
<
Iril
E:
=
io
5
I
i=l
E:
by (4.4) .
5
Ir ~.1
Hence
1
1
n
-J
a(v)v(dv)
I
< E:
for
n
>
n
a
.0
JRM
Let Assumptions 2 and 3 hold.
We now motivate a crucial assumption
regarding Cesaro summability and the model introduced in Chapter 3 by
means of two examples which follow directly from Theorem 4.1.
The first
example shows that the property of Cesaro summability holds for the
joint sequence
(e t ,x)
1 (and
also for the components) when exogenous
t t>
.
variables are fixed in repeated samples,
and
P
x
(1970).
P
E
is absolutely continuous,
is chosen to be the limiting measure introduced by Malinvaud
The second example deals with the random case, i.e., when
exogenous variables are random.
It shows that the same conclusion holds
if independence and absolute continuity are imposed.
16
Example 1:
Let
M be a positive integer and
~k.
of vectors in
and that
P
E
measure on
Assume that
ao' ...
(xt)t>O
'~_l
a finite collection
is generated according to
is absolutely continuous with respect to some product
=
mm.
Then
.(x)t t>O
generates Cesaro summa bl e sequences
with respect to
The dominating function in this case is irrelevant (in the sense that
domination is not necessary) as well as the continuity requirement.
Moreover, if
~ =
P
E
x P
almost every sequence
x
and
(e )
t t>l
b(e,x)
is nonnegative and
is such that
Cesaro summable sequences with respect to
Indeed, let
We can write
1
n
n
a(x)
=
I
a(x )
t
and
b(e,x).
a(x ) +
t
I
t=M
X and
0 < c < M - 1, and
ZM-l
n
t=O
~
be a measurable function on
aM + c, where
~-integrable,
aM-l
a(x ) + ... +
t
I
t=(a-l)M
a(x )
t
n > M.
17
Thus,
n
lim l I a(x)
n+oo n t=O
t
=!
b(e,x).
I
a(a )
t
M t=O
On the other hand let
dominated by
M-l
=J
X
¢(e,x)
E x X and
be continuous on
n = aM + c, 0
For
a(x)
<
c
<
we can write
M - 1,
aM-I
I
+ ... +
t=(a-l)M
c
1
a
= a+1
'i'
'i'
a
L
+1 L ¢(e +; ,aJ + aM-!aM+ c ;=0
MJ
a
° 0
......
c
...
J=
o
By theorem 4.1 there exist null sets
if
(e )
t t>l
is not a realization in
=
=
I ~ J¢(e,ao)
i=O
Jf
E X
by Fubini's theorem (Ash(1972)). 0
M-1
a-I
I 'i'
L ¢(e... 0+ ° ,a 0)
°
+1 a J=
° 0
MJ ~
~
~=c
'i'
L
A(a ), ... , A(a.._ )
M l
M_Io
U A(ao),
then
j=O
~
M-I
E
~
¢(et,x t )
PE(de)
¢(e,x) \.1 (de,dx) ,
-
such that
18
Example 2:
Assume that the sequence of exogenous variables is independent of
the sequence of residual variables and defines an independently and
P .
X
identically distributed stochastic process with common distribution
Assume also that
P
and
x
P
E
are absolutely continuous with respect
~k and
to product measures on
~m, respectively.
X.
the set of all sequences of elements in
denoted by
x
bility measure determined by
the product probability on
b(e,x)
x
= PE
~
i)
X
.
B
X will be
P'
be the proba-
X and
E x X generated by P and P'.
E x X,
nonnegative,
v
Let
~-lntegrable,
Then the following conditions hold:
There exists a
C
in
x P
Let
on the Borel subsets of
be a measurable function on
where
x
P
X denote
A sequence in
E bye.
and a sequence in
Let
P'-null set
B
such that every realization
generates Cesaro summable sequences with respect to
b (x) =
f b (e, x)
PE (de)
P
x
and
.
E
ii)
in
For every realization
B( E)
x
C
in
such that for every
e
B
Indeed, by theorem 4.1 there exists a
sequence
section
(e,i)
C
A
in
A(i) = {e
E
•
For each
E; (e,i)
E
A}.
\l
and
x
E
X
~
and
b(e,x)
P-null set
generates
b(e,x).
v-null set
A
such that
holds for every joint
consider the measurable
Since
v(A) = f P(A(x)) p'(di)
X
- c , (e,x)
[A(x)]
in
Cesaro summable sequences with respect to
Cesaro summability with respect to
there exists a
0
~
19
there exists a
and on
pI-null set
[A(x)]c
B
l
such that if
every sequence
-e
will be such that
Cesaro summable sequences with respect to
4.1 there exists a
x
=0
x fBI' P(A(x»
pI-null set
~
and
generates
b(e,x).
such that on
By theorem
every sequence
generates Cesaro summable sequences with respect to
P
x and
b(x).
Thus, it is plausible to impose
Assumption 5
Let
be the product probability on the Borel sets of
~
determined by
function
E
b(e,x),
and
sup b(e,x)
(Assumptions 2 and 3).
P
x
measurable, nonnegative,
For every
i)
that
P
X
o
There exists a
~-integrable
X there exists a neighborhood
E
is measurable and
E x X
such that
N(x )
o
such
PE-integrable.
x EN (x )
o
ii)
The sequence
(xt)t>l
(Assumption 3) generates Cesaro
summable sequences with respect to
b(x) =
P
x and
f b(e,x)
PE(de) .
E
iii)
B(E)
There exists a set
such that
(et,xt)t>l
peA) = 0
A
(depending on
(Xt)t>~
and for every sequence
in
(et)t>l
in
generates Cesaro summable sequences with respect to
~
Condition (i) of Assumption 5 is a technical requisite that will
imply continuity of certain functions defined through processes of
integration.
Examples 1 and 2 show that conditions (ii) and (iii) are
verified in instances that come readily to mind.
20
The proof of the following lemma is in Gallant (1977a).
Lemma 4.1
Let
V
a(v,8)
be real valued, continuous, defined on
is a Borel set and
Euclidean spaces).
G
Let
is compact (V
(vt)t>l
and
V x G , where
are both subsets of
G
be a sequence in
V generating
Cesaro summable sequences with respect to the probability measure
V and a dominating function
b(v).
v
on
If
sUPla(v,8)1 ~ b(v)
8EG
for every
v
i)
in
lim
n~
V,
then:
supl~
8EG
I
f a(v,S)v(dv)
ii)
a(vt,S) -
t=l
Jr
a(v,8)v(dv)
'I'
o .
V
is continuous on
G .
V
Theorem 4.2
Let Assumptions 1, 2, 3, and 5 hold.
of
Let
f.
where
K
~(y,x,p)
tion
f'
ts a compact subset of a Euclidean space.
(y,x,y,p)
(e)
t t>l
in
be a compact subset
be real valued and continuous on
I~(y,x,p)
for all
Let
I
<
Y x X x K,
If
b(G(y,x,y),x)
Y x X x f' x K,
then for almost every realiza-
21
converge uniformly on
r'
x K to
f J ~(f,e,x,y),x,p)~(de,dx)
.
E X
The limit function is a continuous function of
(y,p).
Proof:
Given
(e,x,y,p),
I~(f(e,x,y),x,p) I
let
y
f (e, x, y).
1~(y,x,p)1
Then
2. b(G(y,x,y) ,x)
= b(e,x)
.
Hence
sup
I~(f(e,x,y) ,x,p)
(y,p)c:f' x K
I 2.
b(e,x) ,
sup
I!Hf(e,x,y),x,p) PE(de) \ 2. b(x) .
(y,p)E:f' x K E
By condition (i) of Assumption 5 and the dominated convergence theorem
22
J ~(f(e~x~y)~x~p)
PE(de)
E
is continuous.
The conclusion then follows from Lemma 4.1.
0
23
5.
CONSISTENCY
In this chapter we establish conditions under which we may guarantee
the existence and strong consistency of a sequence of pseudo likelihood
estimators.
In order to use our results on estimation in the derivation
of the asymptotic non-null distribution of test statistics it is
necessary to introduce a dependence of
Y
o
on the
(therefore of
sample size.
Assumption 6
The true parameter
is indexed by
yO
o
nand
Y*
lim Yn
Y*
E
r .
n-+<lO
There are unique points
Y
0
0
= Y* 'Yl'Y2'
... ,
corresponding to
,
wh~ch
maximize
*
S(y,l ,'\) =
f
J s(f(e,x,y),x"
* ,'\)~(de,dx)
E X
over
A.
The function
f
is the reduced form of Assumption 1 and
is the probability measure of Assumption 5.
Also
lim
In
n-+<lO
8
(,\0_.\*)
n
~
=
8,
:nrQ.
E
Notice that if Assumption 1 holds and we are given a sequence
o
( V
)
'n n>l
r
in
reduced form
For each
and a J'oint sequence
f(e,x,y)
n > 1,
let
(e
x)
t' t t>l
we may use the
to define a triangular array in
Ytn
=
o
f(et,xt'Yn)
for
1
< t
< n
Y as follows.
This is
indeed the structure we have in mind when Assumption 6 is present.
this context the sequence
as a particular case.
sequence
o
Yn
y
o
(Yt)t>l
In
defined in Chapter 3 may be viewed
It is the array corresponding to the constant
In this case Assumption 6 also defines the
24
conceptual link between
O
and
A
o
y .
The vector
on which we focus
our interest for estimation purposes, must be the unique point producing
the maximum of
*
S (y o ,T ,A)
over
A.
As we are going to see only a para-
meter satisfying this condition can be estimated consistently.
This is
A
a conclusion from theorem 5.1 below which states that
3.3.1) converges almost surely to
A
n
(definition
A* when Assumptions 1 through 6
(and other technical conditions) hold.
Thus, any result proved in the
context of Assumption 6 may be translated in terms of the model proposed in Chapter 3.
Another point to be explained is the presence of
tion 6 holds.
In that general setting,
*
T
T
*
when Assump-
is to be viewed as a
nuisance parameter defined as the almost sure limit of
Tn .
From this point on, whenever Assumption 6 is present, we will use
the notation
suggested above.
In this context
S (A)
n
of
Definition 3.3.1, for example, becomes
S (A)
n
Definition 5.1
Let
s > 0
be given and
open ball with center at
8
8 a point in a Euclidean space.
and radius
s
corresponding closed ball is represented by
is denoted
N (8).
s
The
The
Ns (8).
Assumption 7
The set
function
W(A)
A is closed and unbounded.
for which:
There exists a positive
25
i)
sup
AEA
ii)
There exists
~~1
for all
-N E
*
E
o
in
-N
E:
> 0
»
y,x,y ,x
y x X X N (y*) x N (T*) , where
E
e0
o
( t*) cT.
o
lim inf W(A) > - d
IIA II ~ 0>
iii)
such that
b(G(
~
I
and
-
is continuous •
S(y,X,t,A)
W(A)
(y,x,y,t)
(y) c r
o
S(y,x,t,A)
wCA)
where
* *
d = S(y ,t ,A * )
The conditions stated in Assumption 7 represent a modification of
those presented by Huber (1967) and are necessary in the proof of
Lemma 5.1 below which establishes the existence of pseudo likelihood
estimators.
If compactness of
A is assumed, Lemma 5.1 is not necessary
and domination is all that is needed to prove strong consistency.
Lemma 5.1
Let Assumptions 1 through 7 hold.
A*
containing
There is a compact set
such that for almost every realization
exists a positive integer
n
o
for which
n > n
o
A'
(e )
t t>l
there
implies
sup S (A) = sup S (A) •
AE: A n
AE.A' n
Proof:
By condition (iii) of Assumption 7 there exists a compact set
and
0
<
E:
<
1
such that
inf
AEA
I
W(A) > -d+E: .
- l-E:
A'
26
(e )
t t>l
Let
(i n ) n>l
be a realization in
*
is convergent to
C
A
(Assumption 5) such that
Almost every error realization is such.
T.
By conditions (i), (ii), and (iii) of Assumption 7 and Theorem 4.2,
there exists
1
such that for
n
I
n t=l
Hence, if
;\ fA'
and
n
1
I
s(y
n t=l
tn
,x ,1' ,;\)
t n
< d -
E.
By condition (ii) of Assumption 7, and Theorem 4.2
~
*
1 n s(y ,T))
1 im _---.:t:;:n.:..-.:.n=____ =
w("\*)
n-+oo n t= 1
A
I
Thus, there exists
maximum of
S (;\)
n
n
2
d
*
W(
;\ )
such that
must occur at some point of
A'.O
Lemma 5.1 suggests that an assumption that it eventually suffices
to maximize
S (;\)
n
over a compact set is not as restrictive as might
appear at first glance.
As an alternative to Assumption 7, one may
verify this conclusion directly as in Gallant and Holly (1978).
27
~
Let
S (~ ) = sup S (A).
n n
AEA' n
stated be1ow~ we may consider
be such that
n
(1969)~
of Jenrich
Appealing to Lemma 2
~
a r~ndom variable.
n
Lemma 5.2
Let
set of a Euclidean space and
e
let
Q(6~y)
from
x Y where
Y is a measurable space.
be a measurable function of
a continuous function of
e
e
Q be a real valued function on
Y into
e
y
e
is a compact
For each
and for each
y
6
in
in
Y
Then there exists a measurable function
6.
such that for all
Q(§(Y)~Y)
= inf
y
Y
in
Q(6,y) .
6Ee
Theorem 5.1
(Strong consistency)
Let Assumptions I through 7 hold.
Let
(~n)n>l
be a sequence of
random variables (which exists by Lemmas 5.1 and 5.2) such that for
sufficiently large
S (~ )
n n
= sup S (A).
AEA'
n
Then
lim ~
n-"""
n
n
= A* almost
surely.
Proof:
Let
(T)
, n n>l
( e t ) t>l
be a realization
is convergent.
in
C
A
(Assumption 5) such that
Almost every error realization is such.
By condition (ii) of Assumption 7 and Theorem 4.2
converges uniformly on
of
A'
to
S(y * ,T * ,A).
There exists a subsequence
Let
(Sn(A))n>l
A be a limit point
(~~)k>l
converging to
Then
* = S (y * ,T * ,A *)
S(y *,T *,A) = lim S (~ ) > lim S (A)
k+oo n k n k
k+oo n k
A •
28
By Assumption 6
X= A*•
lim An
A
n-+<><>
S(y * ,T *A)
has a unique maximum at
*
A = A.
A
Since for
= A* • 0
n
large
A
n
is confined to a compact set,
Hence
29
6.
ASYMPTOTIC NORMALITY
Frequently we will make use of the matrix representation of certain
concepts in differential calculus.
Let
~(x,y)
vector
y.
The notation employed is as follows.
be a real valued function of the k-vector
ox
(x,y)) ,
0
= (3II (x,y), ... , lL
~
xl
r
ox' (x,y) = [~: (x,y)
~
W(x)
=
(~l (x)""'¢c(x))'
oW
ox'
(x)
.
is a vector valued function of
~
r
(x)
ox'
o¢
l
_c (x)
ox'
Also,
and the m-
We denote
~ (x,y)
If
x
x
30
In particular,
2
a p (x,y)
ax'ax
(.£1
ax
a
= ax'
(x,y) J
The basic rules for manipulation of derivatives given in matrix
form may be found in Theil (1971).
Let Assumptions 1 through 7 hold so that there exists a sequence
(An ) n>l
of random variables
*
strongly consistent for
A
satisfying
A
S (A ) = sup S (\) for n sufficiently large.
n n
Ad.. n
necessary to establish asymptotic normality for
determined by Assumption 8 below.
The framework
I
yn
*
(A -A)
A
n
is
The notation is the same considered
in Assumptions 1 through 7.
Assumption 8
as
aI
(y,x,T,A) ,
The elements of
and
as
as
aI (y,x,T,A) aI..' (y,x,T,A) are continuous and dominated by
b(G(y,x,y) ,x)
-
*
N
S
(A)
for all
in
* x N
(y)
y x X x N
So
,
where
N
For every
J
*
(A) c
So
a
i)
(y,x,y,T,A)
-
A' •
X E X and
* X
(T)
So
Moreover
n > 1
as
(f(e,x,y o),X,T * ,A 0 ) PE(de) =
n
n
aA
o.
E
ii)
a2s
JE XJ aT'aA
(f(e,x,y *),X,T * ,A *) l1(de,dx)
o.
The two integral conditions imposed in Assumption 8 do not appear
to impede application of the results in instances which come readily
31
to mind.
Apparently they are intrinsic properties of reasonable esti-
mation procedures.
Our example in Chapter 10 serves to illustrate
both conditions.
Definition 6.1
J
= -
Jf
02
OA'OA
(f(e,x,y *),X,L * ,A *)
s
~(de,dx)
,
E X
os (f(e,x,y *),X,T * ,\ *)
(f(e,x,y *),X,L * ,A *) OA'
f J oSOA
I =
~(de,dx)
.
E X
Proposition 6.1
Let Assumptions 1 through 8 hold.
Then
J
is positive definite.
Proof:
By the mean value theorem and dominated convergence theorem
J =-
where
S
is defined in Assumption 6.
maximum at
*
J
A,
positive definite.
Since
S(y * ,T * ,A) has a unique
must be negative definite.
Then J
must be
0
At this point we remark that at the level of generality considered
here the matrix
r
positive definite.
(information matrix) may not be equal to
J
or
These facts do not affect our proofs of asymptotic
normality in Theorems 6.1 and 6.2.
However, the singularity of
I
is
a nuisance for hypothesis testing and does not seem to be of practical
importance.
In Chapter 7 we will impose nonsingularity for the infor-
mation matrix.
As shown in Gallant and Holly (1978),
I
is
32
1 = J) when one restricts attention to maximum
non-singular (indeed
likelihood estimation.
Lemma 6.1 below is proved in Jenrich (1968).
It is a version of
the mean value theorem from advanced calculus for random functions.
The proof presented by Jenrich permits an immediate extension to
Taylor's expansions.'
Lemma 6.1
~(y,8)
Let
be a real valued function on
measurable space and
For each
each
and
y
8
2
8
in
in
e
Yx 8
where
is a convex compact subset of a Euclidean space.
8
~(y,8)
let
be a measurable function of
Y a continuously differentiable function of
be measurable functions from
a measurable function
Y is a
e
from
Y into
Y into
e
e.
8.
y
and for
Let
8
1
Then there exists
such that
i)
ii)
all
y
B(y)
in
lies on the segment joining. 8 (y)
1
and
8 (y)
2
for
Y.
We now introduce the analog of the scores defined by Gallant and
Holly (1978).
Theorem 6.1
Let Assumptions 1 through 8 hold.
as
r
n (,0)
in i3A
An
L
--">
Then
N(O,T)
33
Proof:
The proof is done in two steps.
First we show that
verifying the Lindeberg-Feller condition.
variable differs from
"-
latter has
T
as
,in
instead of
n
n ( An)
0
~
*
Notice that this random
only in the
T
component.
The
With this result on hand then we
T •
show that the conclusion of the theorem is implied by Assumption 4 and
the second integral condition of Assumption 8.
We may assume without any loss of generality that
values in
*
N
E:
for every
(T)
n
T
n
takes its
(it is possible to construct a tail
o
equivalent sequence with this property).
Let
u
be a vector with
lui = 1
and consider
1 < t < n
Each
Ztn
n = 1,2, . . . .
has mean zero by the integral condition (i) of Assumption 8,
and variance
= u'
f
E
By Theorem 4.2,
, 1
I ~m n-+<><>
If
u'Iu
=0
1
n
rn t=lL
,--
n
n
L
t=l
(}
2
tn
= u'Iu
•
(6.1)
Z
converges in distribution to zero by
tn
Chebychev's inequality. Assume u'Iu >0. Let E: > 0 be given and
34
A
n
n
Izi
= {ZE::R
>
E
vn =
IV
}
n
2
I
(J
t=l
tn
Let
By the transformation theorem (Ash (1972»
n J
/ n t=l
L
Z2 dF
A
Z
tn
= ....!l..
V
n
B
n
n
We now show that the Lindeberg-Feller condition for Chung's (1974)
lim
n~
of
Central Limit Theorem is satisfied, ....
e.,
lim ~ = (u'Iu)-l > 0
n+<x>V
Thus it is enough to show that
n
a
>
By (6.1)
lim
n~
* *
=
For this purpose let
for
=0
.1L.
V Bn
n
(y ,A ) ,
B
n
=
0
and
0
A(a)
=
{z E: JR
IZ I
> Ea}
Let
Ba (p)
=
f f IA(a)(u'
;,;,~
(f(e,x,y *),X,T * ,A *»
EX
as (f(e,x,y *),X,T * ,A»
* 2
X (u' aI
Since
- (p)
*
B
is finite (Assumption 8), given
o
such that
*
(p ) e!l
2 .
B
a
~(de,dx)
o
•
n > 0
there exists a o
35
¢ on the line and a positive integer
Choose a continuous function
such that
for every
n
~
n
•
l
Let
*
By Theorem 4.2, lim B~(p ) = B~(p)
Hence there exists a positive
n-+oo 'I' n
'I'
integer n
such that for all n ~ n
2
2
B
<
n -
Thus,
lim Bn =
11+00
°
B (p )
¢
t=l
oA
<
B~ (p *) + n
'I'
2
<n •
and
~~l§.
L
(y,x
~
n
tn
*0
L
,T ,A ) ---> N(O,1) •
tn.
(6.2)
36
Let
and denote by
th
the i - component of
g .(T)
n1
= 1,2, ••• ,t
for each
i
(T)
ni n>l
converging to
By Lemma 6.1,
there exists a sequence of random variables
t'
*
for which
By Theorem 4.2 and the integral condition (ii) of Assumption 8,
dg .
.
n111m -,,-, (t'
n+<x>
01"
.)
n:L.
= 0
almost surely. Then we can write
g
n
where the elements of
In
*
~
(T n ) C
n
g
n
(1"
*) = Cn (T. . n -T * )
go to zero almost surely.
(T -T ) is bounded in probability (Assumption 4),
n
r- g (T * )
vn
n
have the same asymptotic distribution.
r(*)
vn
g T
n
1 ~
~
= -L
In t=l
'"1\
( Ytn,Xt,T,
* \0 )
An
OA
The result then follows from (6.2).
0
But
Since
In
g (~ )
n
n
and
37
Theorem 6.2 (Asymptotic normality)
Let Assumptions 1 through 6 hold.
*
as
In aAn (A )
L
--->
Then
N(Jo,l) ,
I
in
(A -A *) - -L>
n
A
Let
Jn
=
Then, almost surely,
J =J
lim
n-+co
n
and
lim
n-+oo
r .
in
Proof:
Without any loss of generality we may assume that
values in
N
€
*
T
and
(A)
in
n
o
* for every
(T),
N
€
n
A
n
>
1
takes its
(it is
0
possible to construct tail equivalent sequences with this property).
Jn
The convergence of
and
By Lemma 6.1, for each
(5::)
ni n>l
random variables
CiS
n
dA.
~
(~
n
)
follows from Theorem 4.2.
i = 1,2, ... ,i
*
there exists a sequence of
*
converging to
aS n
- --- (A )
CiA.
~
in
for which
A
_
(A
*
A
.)(). -A )
n~
n
Since
~
n
maximizes
for
S (A)
n
large,
n
as
*
n (A )
aA.
as n
(~ )
n
aA
_
38
= 0 •
Thus
*
A
(A .)(A -A )
n
n~
~
and by Theorem 4.2 we can write
as n
aA
where the elements of
*
(A ) = (J+
(J.
n
(J.
n
)
*
(6.3)
(A -A )
A
n
go to zero almost surely.
A similar argu-
ment yields
as n
aA
as
*
n
(A ) - - -
Bn
where the elements of
(6.4)
dA
go to zero.
By (6.4), Theorem 6.1, and Assumption 6
*
rn asat
L
(A ) - > N(J8,I)
•
From this fact and (6.3)
rn
*
A
L
(A -A ) - > N(o,J
n
-1
IJ
-1
) . 0
We translate these results to the language of Chapter 3 as follows.
Suppose the vector
AO
satisfies the unique maximum condition.
under the assumptions we have stated so far,
~
n
Then,
is strongly consistent
39
for
AO
and
~
(i n -AO)
__L_> N(O,J-lrJ- l ).
The asymptotic variance
covariance matrix is estimated consistently by
j -lIj -1
n
n
40
7.
are considered here.
TEST STATISTICS
Test statistics are defined and the asymptotic
distributions of these random variables are derived in the context of
Assumptions 1 through 8 and certain conditions imposed on
h.
These
results, as seen in Section 7.4, dictate the procedures to be followed
in any application of the statistical model proposed in Chapter 3.
Assumption 9
I
The matrix
of Definition 6.1 has full rank.
Assumption 10
The r-vector valued function
* = 0,
h(J\ )
*
and
H(J\ ) =
oh1 (A *)
oJ\
h is continuously differentiable,
has full row rank.
"under the null hypothesis" means that
The statement
for every
n.
In what follows we use Searle's (1971) definition of a noncentral
chi-square.
7.1
Wald's test
The Wald's test statistic is
W = n(h(~ »~H VH )-l(h(~
n
n
nnn
n
where
H
n
»
41
Theorem 7.1.1
Let Assumptions 1 through 10 hold.
bution to a noncentral chi-square with
8IH'(HVH')-~8
centrality parameter
V = J-lrJ-
l
r
W
n
converges in distri-
degrees of freedom and non-
' where
= H(A *)
H
, and
Under the null hypothesis the convergence is to a chi-
.
square with
2
Then
r
degrees of freedom.
Proof:
~
Without any loss of generality we may assume that
values in
- (A)
*
N
for every
n
n
takes its
(we can construct a tail equivalent
So
sequence with this property).
By Lemma 6.1, for each
there exists a sequence of random variables
A*
('5:)
ni n>l
i
= 1,2, ... ,r
converging to
almost surely for which
*
"hi
_
*
A
h.(~) -h.(A) =--;;::, (:\ .)(A -A)
n
~
1.
Since the elements of
H(A)
n
In
h(~ )
n
Since
and
a
A
A
A
2
= (H+a n )(:\ n -:\ *)
A
go to zero almost surely.
n
HIn
n (A -:\ *)
n
(HnVnHn I) -k:
n
are continuous functions we may write
h(A *)
h(~ )
where the elements of
nJ.
0/\
have the same asymptotic distribution.
exists for
n
sufficiently large and converges
_1
almost surely to
It follows that
(HVH ' ) ~ ,
(Hn VH
~-~
n n
~ h(~ n )
,
42
and
(HVH')
*
~
_k
2 rn (A -A )
n
have the same asymptotic distribution.
*
_1
Since
L
(HVH') ~ Hrn (A -A ) --> N((HVH')
_k
n
2H8,
I)
(7.1)
,
r
the first claim of the theorem follows.
o
each
>c
*
i = 1,2, ... ,r
for every
n.
By the mean value theorem, for
there exists a sequence
=
(Ani)n>l
converging to
for which
o
n
*
h.(>C) - h.(A )
~
~
oh i (X- .)(A0 ->c *)
n
= ~
0/\
n~
Thus we can write
where the elements of
He
=
O.
Sn
are convergent to zero.
The convergence of
W
n
It follows that
to a chi-square with
r
degrees of
freedom follows from (7.1).0
7.2
The Lagrange Multiplier Test
The Lagrange multiplier test (an analog of it) requires the
existence of an estimator subject to the constraint
h(>C)
= O.
If
Assumptions 1 through 10 hold there exists a sequence of random variables
such that for
n
sufficiently large
>c
n
maximizes
S
n
(>C)
43
subject to
= o. An
h(A)
argument similar to that of Theorem 5.1 shows
lim A
that except for outcomes in an unusual set,
= A* .
n
The Lagrange multiplier test statistic is
-
A
J, H ,
where
n
and
n
V
n
J ,H ,
are the analogs of
n
respectively, when the constrained estimator
n
A
and
V ,
n
is considered.
n
Theorem 7.2.1
Let Assumptions 1 through 10 hold.
bution to a noncentral chi-square with
a 'H' (HVH') -lHa
centrality parameter
2
Then
r
R
n
converges in distri-
degrees of freedom and non-
Under the null hypothesis,
converges in distribution to a chi-square with
r
R
n
degrees of freedom.
Proof:
Without any loss of generality we may assume that
values in
*
(A)
N
E:
for every
n
A
n
takes its
(we can construct a tail equivalent
o
sequence with this property).
A
By Lemma 6.1, Theorem 4.2, and almost sure convergence of
T
n
,
A
n
and
we can write
as n
aA
as n
*
(A )
where the elements of
aA
a
n
go to zero almost surely.
By Lemma 6.1, continuity of the elements of
convergence of
A ,
n
*
(A ) = (J+a ) (A -A )
n
n
n
we can write
H(A), and almost sure
44
h(A *)
h(A )
n
where the elements of
B
n
= (H+B n )(A n -A *)
go to zero almost surely.
Hence, for
n
large,
rn
(H+B n ) (A n -A *)
=
0 .
Thus
as
In HJ- l aAn
Since
J n' Vn' and
(An) _L_> N(He,HVH') .
H
n
(7.2)
converge almost surely to
J, V, and
H,
respectively,
as
n
<IinvnIin )-\1n]-1
(A n )
n rn aA
L
-->
N[(HVH')
and the first claim of the theorem follows.
as in the proof of theorem 7.1.1,
7.3
He
=0
-1.::
2
e
He, I
r
]
Under the null hypothesis,
. 0
The Likelihood Ratio Test
The likelihood ratio test (an analog of it) is based on the
statistic
L
n
2n(S (A ) - S (A ))
n n
n n
45
Theorem 7.3.1
=
r.
converges in distribution to a noncentral chi-square with
r
Let Assumptions 1 through 10 hold.
of freedom and noncentrality parameter
hypothesis,
L
n
Assume also
J
a 'H' (RVH') -IRa
2
Then
L
n
degrees
Under the null
converges in distribution to a chi-square with
r
degrees of freedom.
Proof:
.\
Without any loss of generality we may assume that
take their values in
N
So
*
(.\)
n
and
.\
n
(we can construct tail equivalent
sequences with this property).
By the Taylor's expansion version of Lemma 6.1 there exists a
sequence of random variables
(.\ )
n
converging to
n>l
.\ *
almost
surely for which
S (.\ )
n
n
By Theorem 4.2 we can write
2n(S (.\ )
n n
where the elements of
S (.\
n
a
n
»
n(;\. -;\. ) '(]+a ) ().. -J,. )
n
oJ,.
n
n
n
go to zero almost surely.
n
There are Lagrange multipliers
as n
n
(.\
n
)
R'8
n n
8
n
=
satisfying
0 •
(7.3)
46
By continuity of the elements of
H(A)
A,
where the elements of
n
Hn
we can write
almost surely.
=
H + Sn
'
and almost sure convergence of
go to zero
Then
From this expression and (7.2) in the proof of Theorem 7.2.1, we conclude
L
rn en -->
(7.4)
By Lemma 6.1, Theorem 4.2, and almost sure convergence of
and
n
n
we can write
as n
as
(A)
n
aA
where the elements of
n
- -aA
~n
~
(A )
n
(J+E;
n
)(A -A )
n
n
go to zero almost surely.
large,
as n
aA (A n )
=
0
rn
(]+E; )
rn
by definition of
A
(H+a. ) ,
n
since
A, A,
e
n
(j+E, ) -1 (H+a. ) '
n
n
n
rn
en
(A -A )
n n
Thus,
n
rn
(A -A )
n
n
Then, for
n
47
and by (7.4)
rn6 n-~ n ) - -L>
N[J-lH'(HJ-lH,)-lHe, J-~I(HJ-lH,)-l(HVH')(HJ-lHI)-lHJ-lJ.
(7.5)
Hence
L
2n[S (\ ) - S (\ )J - - >
n n
n n
where
Z
L(z'Jz)
is the multivariate normal with mean and variance covariance
matrix as in (7.5).
covariance matrix of
If
Z,
J
=I
E
then
V
= J-l
and the variance
say, is such that
EJ
is idempotent
The first claim of the theorem then follows from Searle (1971).
the null hypothesis, as before,
7.4
He
=0
Under
. 0
Applications
As we have done in previous chapters we now show how the theory fits
in the statistical model proposed in Chapter 3.
If
\0
satisfies
the unique maximum requirement, we have seen that our assumptions imply
the almost sure convergence of
rn (\ n-\0 ).
to
A
n
\0
and asymptotic normality
A
for
Suppose we want to perform a test of
For large samples, under
I
r
= J)
RO' the statistics
W
n'
R ,
n
H against
O
and
L
RA ·
(if
n
have distributions that can be approximated by a chi-square with
degrees of freedom and therefore may be used for the test.
Suppose
we also want to investigate the behavior of the particular test statistic
we choose in the presence of small deviations from the null hypothesis.
Our theory says that a noncentrality parameter is obtained if we
specify a sequence (convergent)
o
(y)
n n>l
such that the corresponding
48
sequence
(Ao)
n n>l
(Assumption 6) converges to a point satisfying the
null hypothesis.
follows.
'To use this result in practice the argument is as
As mentioned earlier, typically,
puted functions of
O
tion.
A
n
,
are easily comAO
is specified then
y, so that i f
becomes specified and this
A and
n
,no
and
will satisfy the unique maximum condi-
0
AO) is specified and
(Y° ' Tn'
n
n
a 'H ' (HVH') -lHa
a =
2
is to be computed.
Thus, in a typical application,
the noncentrality parameter
The use of Theorem 7.4.1 below avoids the problem of having to specify
*
(y , '
* , A)
*
to make this computation.
The interesting feature
this result is that integration with respect to
approximate
of
P
x is not necessary to
a.
Theorem 7.4.1
Let Assumptions 1 through 10 hold.
sequence converging to
JO
n
- -
n
1
L
n
t=l
f
E
*
(y ,
,* ,
A*)
and
° T° AO) be any
(Yn'
n' n
lim In (A o -A *) = a
Let
n
n->«>
Let
2
a s (f(e,x,yo),x ,,0,.\0) PE(de)
aA'a.\
n
t n n
HO = H(A 0)
n
n
a
Then
°=a
lim a
n->«> n
0
= n(h(A°»
n
n
.
, (HoVoHo ,)-l(h(A o »
n n n
n
,
e
49
Proof:
By the mean value theorem, for each
sequence
(x ,) n> 1
n~
converging to
By continuity of the elements of
where the elements of
for
n
Sn
*
i
= l, .•. ,r
there exists a
for which
A
H(A)
we can write
converge to zero.
sufficiently large,
=
o
0
0
_k
(H V H) 2(H+S )
n n n
n
rn
Hence
= (HVH ')
and the result follows.
0
-k
0
*
(A -A )
~8
n
50
8.
THE MAXIMUM LIKELIHOOD ESTllf.ATOR
Leaving out the details, we now show how Gallant and Holly (1978)
results can be derived from the theory we have presented.
The basic relation is
t=l,Z,oo.,n
where the residual sequence
(u )
t t>l
(8.1)
is the realization of an indepen-
dently and identically distributed stochastic process.
The common proba~
bility law for this process has mean zero, variance covariance matrix
*
positive definite, and is absolutely continuous with respect to the
Lebesgue measure on
~m.
The parameter
eO
e.
is a vector in
Definition 8.1
Let
~
be a synrrnetric matrix and
(a . . )
mm
~J
a = (all,alZ,aZZ, ... ,alm,aZm, ... ,amm) '.
The notation
~(a) = ~
will be
used.
Let
a
*
*
~(a)
satisfy
~
*
We can write
t = 1,Z,oo.,n .
Let
e
t
=
*
(L(a»
-J.,;
and
2Ut
o
G (y , x, Y ) = (~( a
*» _J.,; g (y , x, e 0 ),
2
y
0
0
*
= ( e , a ).
Then (8.1) is equivalent to
t = 1,Z,oo.,n
where the
Also
of all
e's
have mean zero and identity variance covariance matrix.
yO E r,
where
a's
such that
r
is the
2:(a)
the common density for the
cartesian product of
is positive definite.
e-sequence.
e
with the set
Denote by
pee)
If the conditions of the
51
implicit function theorem are satisfied for
for every pair
(x,y)
Xx
in
=
p(y!x,y)
r,
G and Assumption 1 hold,
we can define a density
l:yG, (y,X,y)!P(G(y,X,y)) •
Let
S(y,X,A) = £np(Ylx,A) , A
r .
E:
Then
S (A)
n
If s(f(e,x,yO),x,A)~(de,dx)
=
EX
If
£n p(y!x,A)p(y!x,yO)dy Px(dx) .
(8.2)
YX
Assume that (8.2) has a unique maximum at
~
1 through 8, i f
sistent for
n
= (6
eo ,an
A
n
,a)
n
maximizes
A = yO
S (A), §
n
n
en-eo]!
A
a -a
L
* -:>
is strongly con-
*
is strongly consistent for
rn
Under Assumptions
a,
and
1
N(O,J- )
n
J = I
since
in the maximum likelihood context.
J- 1 (or
n
A
estimated by
If
rank at
h(A)
A= y
Suppose that
The matrix
J- 1 is
i-n1 ).
is continuously differentiable and its Jacobian has full
°,
we may test
a*
is considered as a nuisance parameter for hypothesis
H :
O
h(yo) = 0
against
H :
A
h(Yo) 1=
O.
52
testing so that the only functions
-
h(A) = h(e).
ah
-
= a8'
R(e)
h
of interest are those of the form
Then, with the notation of Chapter 7,
(e) .
R(A) = (H(6),O) ,
The Wald's test statistic, for example, becomes
where
I]
I-A_l
A_I
Jn
=
A_I
J ea
ee
b~~
A1
J-
aa-l
the partition being conformable to the dimensions of
e
and
a
(clearly
A
the matrix
I
large,
has a distribution that can be approximated by a chi-square
wi~h
W
n
n
could also be used).
Under the null hypothesis, for
n
degrees of freedom determined by the dimension of the vector
function
f or t.h e
h.
To approximate power,
. f ~cation
.
spec~
,0
Yn0 , An
=
eO
n'
o
Y
must be specified.
so that proposition
used to approximate the noncentrality parameter
non-null distribution of
W
n
(also of
R
n
and
a
By (8.2),
7.4.1 may be
of the asymptotic
L).
n
At this point we remark that we have diverged a little from Gallant
and Rolly (1978).
The tests derived in their paper are based on the
concentrated likelihood technique.
We have not done this here, although
possible and straightforward along their lines.
In this respect it is
interesting to observe that explicit knowledge of the limiting law of
,-
in
0
A
(8 -6 )
n
is not necessary to derive the asymptotic distribution of
any of the three test statistics considered in Chapter 7.
53
THE SINGLE EQUATION EXPLICIT MODEL
9.
Assume that responses
x
to inputs
Yt
t
are generated according
to the structure
t
o
e
The parameter
of
9-
~.
=
1,2, ... ,n.
is an interior point of
The function
g
8
which is a compact subset
is continuous and, for each
continuously differentiable in
8.
(9.1)
x
fixed, twice
The residual sequence
(u)
t t>l
is
the realization of an independently and identically distributed sequence
of random variables, each with mean zero and variance
*
(a) 2 , a
*
This model can be transformed to the standard form of Chapter 3 as
follows.
Let
G(y,x,y~
o
y-g(x,8 )
a
yo
*
=
Then (9.1) is equivalent to
t
where
e
t
= u
t
/0 *
y
1,2, ... ,n
Here the reduced form is
f(e,x,y o )
Define
s(y,x,8)
=
-~(y-g(x,8))
2
.
*
(8 0 ,a ) .
> 0 •
54
Then
sn (S)
Notice that the definition of
s
above implies that
a
*
is being
treated as a nuisance parameter, i.e., no interest is being focused on
a* for estimation purposes.
Let Assumptions 3 and 5 hold.
fE XJ s(f(e,x,y
o
We have:
-
),x,e)~(de,dx)
~
f (g(x,e o ) - g(x,e))2 Px(dx)
X
Assumption 6 will be satisfied if
f (g(x,e o )
- g(x,e))2 Px(dx)
X
has a unique minimum at
maximizes
Sn (e)
=
e
n
J
and
Assume that this is the case.
is strongly consistent for
(e~n-eo) _L_> N(O,J-1TJ- l )
compute
eO
If
and the integrability conditions of Chapters 5 and 6
are satisfied then
In
vn
e
T.
as
and
F or 11
. l
'
ustrat10n
purposes we W1'11
Clearly
ae (y,x,8)
a
=- ~
eO
=
(y-g(x,e)) l&
ae (x,e) ,
2
l£..
a
(x,s) ao' (x,e) + (y-g(x,e)) ae '~e (x,S) .
A
e
n
55
It follows that
J
=
r
J
;~
o
(x,e )
~~I (x,e o )
PX(dx) ,
X
I = (0*)2 f
X
*
o
(x,e )
~:, (x,eo)
Px(dx) .
,-
Thus, the asymptotic variance covariance matrix of
*
(0)
2
(f,3e
dP
0
(x,e )
•
dP
o.
f6't
(x,8 )
Px(dx))
vn
0
(e n -e )
A
is
-1
X
consistently estimated by
Since
performed.
J
j
n
-Ii n j n -1
is not equal to
I
(Theorem 6.2).
the likelihood ratio test cannot be
Of course the reason for this difference lies in the choice
of the objective function.
If the common distribution of the residual
variables were absolutely continuous with respect to some measure on the
real line, and
s
defined via the likelihood,
mentioned in Chapters 6 and 8.
I
would be equal to J, as
Consider the normal case, for example.
Let
s(y,x,A)
1
- logo - - - (y-g(x, e))
2
20
2
,
A = (8,0)
and assume that the identification condition (8.2) of Chapter 8 is
satisfied.
Here
56
1
~ (y-g(x,S»
~)
dS (x,S
a
dS
dA (y,x,S,a) =
_ 1:. +
(y-g(x,S»
a
a
2
3
~2
d g
(y-g) dS 'dS
-.1...
a
3
1
2
a
where arguments were deleted in the latter expression for purposes of
simplicity.
Then
f
I
J =
0
~ (x,S 0 ) ~
d6' (x,S ) Px(dx)
3S
°
(a*r 2 X
2
°
en
rn
L
-->
,..
cr
where
A =
n
(6 n ,;)
n
_ SO
n
N(O,J
-1
)
- a*
maximizes
S (\).
n
are asymptotically independent since
The estimators
J-
l
results do not hold, for the same function
S
n
and
is block diagonal.
s,
a
n
These
if the distribution
generating the random components is not normal.
Specification of parameters for power purposes should be made,
in both examples, as in Chapter 8.
permitted.
Notice that for the first example
57
10.
THE SEEMINGLY UNRELATED NONLINEAR MODEL
The seemingly unrelated nonlinear model is defined by a set of
m
univariate nonlinear regressions
t
1,2, ... ,n; a
=
1,2, ... ,m.
We may write
t=1,2, ... ,n
(10.1)
where
X
t
= (Xl , ... ,x
t
mt
)' ,
ut=(ul'···,u
t
mt ) ' ,
eo
The parameter
set of
R£.
eO
=
(0,
e , ... , eO,),
m
l
is an interior point of
The function
g
0 which is a compact sub-
is continuous and, for each
twice continuously differentiable in
(ut)t>l
'
e.
x
fixed,
The residual sequence
is a realization of an independently and identically distri-
buted sequence of random variables each with mean zero and variance
covariance matrix
L*
positive definite.
58
In the context of Definition 8.1, let
* = 2: *
2:(0 )
and
*
o
G(y,x,y)
Then
*
2:(, ) = (2:) -1 •
=
*
and
y0
= (6 0 ,0 *)
0
,*
be such that
Let
*
(2:(0 » -k2(y-g(x,6 0 »,
•
(10.1) is equivalent to
1,2, ... ,n
t
where
The reduced form is
*
o
k
Y = f(e,x,y ) = (2:(0 » 2e
+ g(x,e 0 ) .
Define
=-
s(y,x",8)
~(y-g(x,8»
'2:C'r) (y-g(x,8»
.
For the seemingly unrelated nonlinear model, an initial strongly
A
consistent estimator
,
is obtained as follows.
residual vector
A
u
,
n
is such that
induces
s (e) =
n
For each of the
1
A
0
as
2:(T)
Assumption 4, i.e.,
is available (Gallant (1975c).
It
m regressions compute the
of the corresponding nonlinear least squares fit.
a
Form the elements
,*
of
n
n
I
=- u u
A
=
A
a
n
(Z)-l
I
vn
(, -, *)
A
n
S and let
r = (aaS)
.
The estimator
This estimator also satisfies
is bounded in probability.
Thus;
s
4It
59
Let Assumptions 3 and 5 hold.
We have
J J s(f(e,x,y °),X,T *
,e)~(de,dx)
E X
=-
~- ~
f (g(x,6)-g(x,60» '0:*)-\g(x,6)-g(x,60»
PX(dx) .
X
Assumption 6 will be satisfied if
fX (g(x,6)-g(x,6 °»' 0:)*
has a unique minimum at
°
-1 (g(x,6)-g(x,6»
6 = So.
PX(dx)
Assume that this is the case.
The following derivatives are useful to obtain the matrices
and
8.
J
I
as well as to verify the two integral conditions of Assumption
Clearly,
~~
(y,x,T,6) =
[006g,
(x,6)] 'l:(T) (y-g(x,6»
To compute the Jacobian matrix of
as
a6 ' write
By definition,
a;
I
[g{(X,e)l:(T)(y-g(x,e»]
2
as
ae ' ae
(y, x, T, e) =
a;' [g~(X,e)l:(T)
(y-g(x,S»]
60
f(y-g(x,e)) E(T) Wl
ag
=
[~
ae'
(x,6)
(x, e)
]' I(T) ..l&..
ae' +
ag£
l(y-g(x,e); E; T; 3e'
(x,6)
Also,
la~' (8{
. th
The typical element in t h e 1
; - row
two elements of the form
k
th
component of
0
(x,6)I(T)(y-g(x,6)))
2
a s
aT 'a 6
f
is the sum of
where
gik(x,6)(y-g(x,6))j
g. (x,6)
and
1:
gik(x,6)
is the
. th
is t h e J
- component of
(y-g(x,e)) .
J
y - g(x,S).
It follows that
to
P
E
and
=
and
a2s
aT'ae integrates to zero with respect
respectively, and also
W,
=]
I
as
ae
o
g
f[ade , (x,e )
J' (I*)-l
ae'
(..l&..
0 J PX (dx)
(x, e)
•
X
A
If
e
n
maximizes
S (A)
n
and the integrability conditions of
Chapters 5 and 6 are satisfied, then
eO
e
n
is strongly consistent for
A
and
Ei ther
A
J., or
n
I
n
to estimate the asymptotic variance covariance matrix.
can be used
61
Consider the problem of testing
o
h(e ) f 0,
H :
A
where
of full row rank at
eo .
h
y
=0
Since
=]
I
implies a specification of
any of the test statistics
For power purposes a specification
T
as well as of
in this case), so that proposition 7.4.1 may be used.
we may we11 use the e 1 ements
component of
y
against
is continuously differentiable with Jacobian
presented in Chapter 7 can be used.
of
o
h(e )
H :
O
cr
as
-_ _
1
n
A
UA1U
a
S
(which equals
A
Notice that here
to specify the second
so that the specification of
T
is
T
n
e
62
11.
LIST OF REFERENCES
Amemiya, T. (1974), The nonlinear two-stage least squares estimator.
Journal of Econometrics 2, 105-110.
(1977), The maximum likelihood estimator and the nonlinear
three stage least squares estimator in the general nonlinear
simultaneous equation model. Econometrica 45, 955-968.
Ash, R. B. (1972), Real Analysis and Probability.
New York.
Academic Press, Inc.,
Ballet Lawrence, S. (1975), Estimation of the parameters in an implicit
model by minimizing the sum of absolute values of order p.
Unpublished Ph.D. dissertation, North Carolina State University,
Raleigh, N.C.
Barnett, W. A. (1976), Maximum likelihood and iterated Aitken estimation of nonlinear systems of equations. Journal of the American
Statistical Association 71, 354-360.
Blattburg, C., McGuire, T., Weber, W. (1971), A Monte Carlo investigation of the power of alternative techniques for testing hypothesis
in nonlinear models. Unpublished manuscript.
Chung, K. L. (1974), A Course in Probability.
New York.
Academic Press, Inc.,
Gallant, A. R. (1973), Inference for nonlinear models. Institute of
Statistics Mimeograph Series No. 875. North Carolina State
University, Raleigh, N.C.
(1975a), The power of the likelihood ratio test of
location in nonlinear regression models. Journal of the American
Statistical Association 70, 199-203.
(1975b), Testing a subset of parameters of a nonlinear
regression model .. Journal of the American Statistical Association
70, 927-932.
(1975c), Seemingly unrelated nonlinear regressions.
Journal of Econometrics 3, 35-50.
(1977a), Three stage least squares estimation for a
system of simultaneous, nonlinear, implicit equations. Journal
of Econometrics 5, 71-88.
(1977b), Testing a nonlinear regression specification:
a nonregular case. Journal of the American Statistical Association
72, 523-530.
Gallant, A. R., Goebel, J. (1976), Nonlinear regression with auto
correlated errors. Journal of the American Statistical Association 71, 961-967.
63
Gallant, A. R., Holly, A. (1978), Statistical inference in an implicit,
nonlinear, simultaneous equation model in the context of maximum
likelihood estimation. Cahiers du Laboratoire d'Econometrie,
Ecole Politechnique, Paris.
Hannan, E. J. (1971), Nonlinear time series regression.
Applied Probability 8, 767-780.
Journal of
Holly, A. (1978), Tests of nonlinear statistical hypothesis in multiple
equation nonlinear models. Cahiers du Laboratoire d'Econometrie,
Ecole Politechnique, Paris.
Huber, P. J. (1967), The behavior of maximum likelihood estimators
under non standard conditions. Procedures of the Fifth Berkeley
Symposium Math. Statist. Prob. 1, 73-101.
Jenrich, R. I. (1969), Asymptotic properties of nonlinear least squares
estimators. The Annals of Mathematical Statistics 40, 633-643.
Jorgenson, D.W., Laffont, J. (1974), Efficient estimation of nonlinear
simultaneous equations with additive disturbances. Annals of
£conomic and Social Measurement 3, 615-640.
Malinvaud, E. (1970), The consistency of nonlinear regressions.
Annals of Mathematical Statistics 41, 956-969.
The
Robinson, P. M. (1972), Nonlinear regression for multiple time series.
Journal of Applied Probability 9, 758-768.
Searle, S. R. (1971), Linear Models.
New York.
John Wiley and Sons, Inc.,
Stute, W. (1976), On a generalization of the Glivenko-Cantelli Theorem.
Zeitschrift fur Wahrschein1ichkeitstheorie und Verwandte Gebiete
35, 167-175.
Theil, H. (1971), Principles of Econometrics.
Inc., New York.
Wesler, O.
(1979), Personal communication.
John Wiley and Sons,