Rodriguez, German; (1975)A multivariate linear model with laten factor structure."

A MJJLTIVARIATE LINEAR MODEL WITH
LATENT FACTOR STRUCTURE
by
German Rodriguez
Institute of Statistics Mimeo Series #1014
University of North Carolina at Chapel Hill
Chapel Hill, North Carolina
June 1975
RODRIGUEZ,
GE~~.
Factor Structure.
A Multivariate Linear Model with Latent
(Under the direction of NORMAN LLOYD JOHNSON and
HENRY BRADLEY WELLS.)
In the social and behavioral sciences, the variables of interest
are frequently unobservable constructs, factors, or latent variables.
The statistical analysis, on the other hand, must be based on observable indicators, responses, or manifest variables.
In this dissertat-
ion we propose a "general statistical model for the analysis of this
type of data, designed to permit estimation of parameters and tests of
hypotheses pertaining to unobservable constructs on the basis of
observable indicators.
The model results from combining a multi-
variate linear model for the latent variables with a factor mlalysis
model relating these to the manifest variables, and is termed the
latent lineax' model.
The model is stated in general form as a linear model where the
regression and dispersion parameters are structured, and is shown to
be analogous in its method of construction to growth curve and covariance structure models.
The problem of identifying the structural
parameters is discussed, and solutions are given along the same lines
as in factor analysis.
Some special cases of the model are considered,
and used to illustrate its range of application and some relationships
with other models, such as factor analysis in several popUlations, path
analysis, and variance components.
The likelihood equations for estimating the parameters are derived,
using some recent results on matrix differentiation.
An
iterative
procedure based on the Fletcher-Powell method of function minimiza-
tion is proposed for their solution.
The numerical procedure has
been found to possess excellent convergence properties.
A detailed
treatment of large sample theory is given, including proofs of the
consistency and asymptotic normality of the estimators, for a family
of structural linear models which includes the latent linear model.
In this process the concept of the limiting Fisher information matrix
is introduced, and expressions for its elements are derived.
These
arc used to obtain formulae for the asymptotic variances and covariances of the estimators of the structural parameters in the latent
linear model.
The likelihood ratio
and Wald techniques are used to
construct large sample tests of the goodness of fit of the model,
and of a variety of hypotheses about the structural parameters; and
the asymptotic distributions of the test statistics are derived.
A nump-rical example using simulated data is given, illustrating
the proposed estimation and testing procedures.
The results of a
simulation study conducted to evaluate several large sample approximations are reported.
These results indicate that the asymptotic results
developed provide reasonable approximations for moderate size sanples.
Finally two extensions of the model are suggested as topics for
further research.
ACKNOWLEDGEMENTS
It is a pleasure to express my appreciation to my co-advisors
Dr. N.L. Johnson, for his patience, advice and encouragement during
the course of this research; and Dr. H.B. Wells, for his continued
guidance and support throughout my graduate studies.
I am also indebted to the other members of my advisory committee,
Drs. J.E. Grizzle, D. Quade and P. Uhlenberg - as well as Dr. P.K. Sen,
who participated in the committee before taking a sabbatical leave for their assistance and interest in my work.
Special thanks go to
Dr. D. Quade for his many detailed comments and suggestions.
I would also like to thank Drs. W. Hoeffding and D.J. de Waa1,
for useful discussions on maximum likelihood estimation and on matrix
derivatives.
During my stay at Chapel Hill I received continued financial support from the Population Council and the Ford Foundation, and I would
like to express my debt of gratitude to these institutions.
I am also
grateful to the Carolina Population Center for providing a tuition
grant in 1972.
I must warmly thank my family, and in particular my wife Pat, for the
enduring love and support that made possible my graduate studies.
Finally, I wish to thank Mrs. J. Maxwell, for typing the manuscript with great patience, speed and accuracy.
ii
CONTENTS
Acknowledgements----------------------------------------~--
1.
2.
3.
ii
INTRODUCTION AND LITERATURE REVIEW
1.1
Introduction--------------------------------------- 1
1.2
The Factor Analysis Model-------------------------- 2
1.3
Exploratory Factor Analysis------------------------ll
1.4
Confirmatory Factor Analysis-----------------------20
1.5
Related Models-------------------------------------24
THE MULTIVARIATE LATENT LINEAR MODEL
2.1
Introduction---------------------------------------28
2.2
Statement of the Model------------------------ ---·--28
2.3
The Identification Problem-------------------------32
2.4
Applications of the Model--------------------------37
MAXIMUM LIKELIHOOD ESTIMATION
3.1
Introduction---------------------------------------43
3.2 The Likelihood Equations---------------------------43
4.
3.3
Estimation of B-----------------------------------48
3.4
Estimation of g-----------------------------------sO
3.5
Estimation of
3.6
The Iterative Procedure----------------------------54
A
t""oJ'
~
('"oJ
and
~
("OJ
-------------------------51
LARGE SAMPLE PROPERTIES OF THE ESTIMATORS
4.1
Introduction---------------------------------------s7
4.2
Asymptotic Results for Linear Models---------------s8
4.3
Consistency of
e
""'I1
in Structural Linear Models-----66
iii
iv
4.4
The Information Matrix for Structural Linear Models----- 69
4.5
Asymptotic Normality of
4.6
Large Sample Theory for the Latent Linear
e in Structural Linear
"'I1
Models~-
77
Model~--------
83
,...
4.7
5.
Approximate Second Derivatives of
F-------------------- 91
HYPOTHESIS TESTING
5.1
Introduction-------------------------------------------- 95
5.2 Testing Goodness of Fit--------------------------------- 95
II
·
5 . 3 Test~ng
Hypot h eses about I,...,,'
I
~ and
~
5.4 Testing Linear Hypotheses about ~
6.
7.
.------------------ 102
104
A NUMERICAL EXAMPLE
6.1
6.2
6.3
Introduction------------------------------------------- III
Simulation of Data------------------------------------- III
Maximum Likelihood Estimation------------------·_------- 113
6.4
Hypothesis Testing------------------------------------- 119
A SIMULATION STUDY
7.1
7.2
7.3
8.
~
,..."
Introduction--------------------------- r - - - - - - - - - - - - - - - 123
Simulation of Data----.-------------------------------- 123
Empirical Distributions------------------·-------------- 126
SUGGESTIONS FOR FURTHER RESEARCH
8.1 Introduction------------------------------------------- 133
8.2 The Latent Growth Curve Model-------------------------- 133
8.3 The Latent Covariance Structure Model--.--------------~ 134
APPENDIX: ON MATRIX DERIVATIVES
A.1
A.2
A.3
A.4
Introduction------------------------------------------Definition of Matrix Derivatives----------------------Rules for Matrix Differentiation----------------------Maximum Likelihood Estimation in the Linear Mode1-.----
138
139
141
148
BIBLIOGRAPHY-------------------------------------------------.-. 151
I.
1.1
INTRODUCTION AND LITERATIJRE REVIEW
Introduction
In the social and behavioral sciences the variables of interest
are frequently unobservable constructs, factors (Thurstone, 1947)
or latent vaY'iables (Lazarsfeld, 1950), such as attitudes, intelligence
or socio-economic status.
The statistical analysis, on the other hand,
must be based on observable indicators, responses or manifest variables,
such as verbal expressions of attitudes, performance on an intelligence
test, or education and income.
Furthermore, since measurement in the
social sciences is usually inexact, a multiple indicator approach must
frequently be employed in data collection, using several measurements
of a few underlying factors of interest.
In this work we propose and study a general statistical model for
the analysis of this type of data, designed to permit estimation of
parameters and tests of hypotheses pertaining to unobservable constructs
or latent variables, on the basis of observable indicators.
The model
results from combining a multivariate linear model for the latent
variables with a factor analysis model relating these to the manifest
variables.
In this context the relationships of interest are represent-
ed in the underlying linear model and factor analysis is used to model
the measurement process.
In Chapter 2 we state the general model and discuss its relationship
with other models proposed in the literature.
In Chapter 3 we derive
4It
2
maximum likelihood equations for estimating the parameters and describe
an iterative procedure for their solution.
In Chapter 4 we study large
sample properties of the estimators and obtain formulae for their asymptotic standard errors.
In Chapter 5 we discuss hypothesis.testing using
likelihood ratio and Wald techniques.
examples to illustrate the procedures.
In Chapter 6 we provide numerical
In Chapter 7 we describe a
simulation study conducted to evaluate some large sample approximations
and to assess the power of the proposed test procedures.
Finally, in
Chapter 8 we discuss some extensions and suggestions for further work.
In the derivation of our results we make extensive use of matrix
derivatives.
very
Since some of the techniques used have been developed
recently, we provide a brief review of matrix differentiation in
the appendix.
The proposed model is an extension of the factor analysis model.
In order to develop a proper background for its study, a review of
factor analysis is provided in the rest of this introduction.
The main
results and developments are discussed, but proofs are omitted whenever
they can be found in the literature or when more general results are
proved later in this work.
1.2 The Factor Analysis MOdel
1.2.1
Historical Remarks
Factor analysis originated with Spearman (1904), who proposed
that correlations among several tests of intelligence could be accounted
for by a common factor underlying all tests (intelligence) plus factors
specific to each test (errors).
The theory
was extended to multiple
3
factors by Thurstone (1931, 1947), who gave considerable impetus to
the development of the field.
Factor analysis has been developed
mostly by psychologists and has always been a controversial subject,
see for example the discussion in Kendall and Babington Smith (1950).
Harman (1967) provides an authoritative account of the different
schools of thought and methods of analysis that have evolved.
The statistical approach to factor analysis was initiated by
Lawley (1940, 1942, 1943), who derived maximum likelihood equations for
estimating the parameters of factor models.
For several years the dev-
elopment of this approach was hindered by a lack of satisfactory numerical methods for obtaining the estimates.
Howe (1955) and Bargmann (1957)
proposed a Gauss-Seidel iterative method, and Rao (1955) proposed an estimation procedure based on canonical correlation analysis, which can be
shown to be equivalent to maximum likelihood estimation; but experience
with these methods was not satisfactory, see for example Maxwell (1961)
and Browne (1965).
More recently, however, Joreskog (1967, 1969) dev-
. eloped a procedure based on the Fletcher and Powell (1963) method of
function minimization which proved superior to previous developments
and made efficient and accurate numerical solution of the likelihood
equations possible, see also Lawley (1967) and Joreskog and Lawley
(1968).
Other statisticians who have contributed to the development of
factor analysis are Bartlett (1937, 1938, 1950, 1953) and Anderson and
Rubin (1956).
An excellent discussion cf the maximum likelihood approach
appears in Lawley and Maxwell (1971).
4
1.2.2
The Factor Model
The general factor analysis model is
(1.2.1)
\
where
ll:
x: pXl
~
pXl
is a stochastic vector of manifest variables or responses,
is a vector of means,
is a matrix of parameters called
~:pxq
~
factor loadings of full column rank
q<p, l.,: gXl is a stochastic vector
of latent variables or factors with
E (~) =
(positive definite)
E(~) =
Q,
Var(~)
=
and
!
pX 1
~:
Var(x)
= ~
p.d.
~
is a vector of random errors with
= diag(~l""'~p)
=~
E(~)
The model implies that
Q and
p.d. and
Cov(l,~) =
Var(~)
and
=y ,
Q.
where the
variance-covariance matrix has the following structure:
y = ~'
~'
The diagonal elements of
diagonal elements of
!
+
!
(1.2.2)
are called corrununalities, and the
are called specificities.
These names refer to
those parts of the variance uf each response that can be attributed to
the common factors and to the specific errors, respectively.
The essence of the model is to write the manifest variables as
linear combinations of a smaller number of latent variables plus uncorrelated errors.
Since from (1.2.1)
Var(xl ~v )
f"'oJ
= ~ = diag(~"
.&
""'J
... ,~)
P ,
(1.2.3)
the factors may be said to explain the covariance structure of the
responses.
1.2.3
The Identification Problem
In (1.2.1) we assumed that
= II
+ .....,..,
lit,
and clearly
and
E(l)
= Q.
If
E(~)
=~
then
cannot be separately identified.
5
The assumption that
sense that if
=Q
ECx)
=~
ECr)
implies no loss of generality, in the
l* = l - 5 and H* = ~ +
we can redefine
thus reducing the general case to (1.2.1).
~
,
Since interest centers on
the covariance structure, most authors write the model with both
E(r)
=Q
example
and
=Q '
E(~)
effectively working with
~*
= ~-B
' see for
Joreskog (1967) or Lawley and Maxwell (1971, eh. 2).
adopt this practice hereafter.
We shall
The preceding remarks, however, will be
important in the context of the general model proposed in §2.2.
Turning our attention to the covariance structure, note that the
model is not identified unless restrictions are imposed on
Y satisfies
for if
L
~
is replaced by
~
!*
= 1!b'.
~
¢,
qxq non-singular matrix,
is any
y would still satisfy (1.2.2) if
then
and
(1.2.2) and
A and
is replaced by
/\*
~
= ....,..,
AL- l
This indeterminacy of the model
corresponds to a non-singular linear transformation of the factors
'1.*
= ,br.
Since
has
1::
q
2
elements, at least
constraints need be
imposed upon the parameters.
There are two approaches or types of solutions to this problem,
known as exploratory or unrestricted factor analysis and confirmatory
or restricted factor analysis.
In expZoratory factor anaZysis
by
constraints are imposed
defining the factors to be uncorrelated and have unit variances.
In this case
I
~q
and (1.2.2) becomes
y
The assumption
Var(r)
=!
=~
+!
Var(r)
I
= "'<l
we can redefine
any square root of
~
~q(q+1)
¢
~
=
M'
+
(1.2.2) ,
implies no loss of generali ty , for if
= !-\
r*
!
1
and
A*
~
= M,Yz
(sYmmetric or lower triangular).
can be written as
~
= ~*r*
+!
and
where
Y-
¢2
is
Then
y = ~'
+! can be
6
y = ~*~*'
written as
to (1.2.2)'.
If
q
>
+~,
thus reducing the general case (1.2.2)
Hence the name unrestricted factor analysis.
1
the model is still not identified, however, for if y
satisfies (1.2.2)' and
M is
any orthogcnal matrix of order
y also satisfies (1.2.2)' with
~
replaced by
~* = ~'.
This in-
determinacy corresponds to a (rigid) rotation of the factors
Since
M has
~q(q-l)
free elements, an additional
q, then
~q(q-l)
l* = ~.
constraints
This is the problem of selecting a basis of the common
are needed.
factor space.
A set of restrictions that turns out to be quite convenient and
interesting is to require
~,!-l~ to be diagonal with its elements
arranged in decreasing order of magnitude.
This restriction will be
related to canonical correlation analysis in §1.2.4 and to principal
components in §1.2.5.
The resulting basis will be called the canonical
basis of the factor space, following Rao (1955).
In many behavioral science applications the canonical basis is used
in estimation but the resulting estimates are then rotated to obtain a
structure that is easier to interpret.
Several criteria for rotation
have been proposed, but these will not be reviewed here.
reader is referred to Kaiser (1958, 1959), Hendrickson and
The interested
Whi~e
(1964),
Jennrich and Sampson (1966), Crawford and Ferguson (1970), Browne (1974a)
and Lawley and Maxwell (1971, Ch. 6).
In confirmatory factor analysis
q
constraints are usually imposed
by defining the factors as having unit variances, and at least an additional
ents of
q(q-l) constraints are imposed by requiring
~
to be fixed, usually zero.
q(q-l) or more elem-
The pattern of fixed values must
be such that it would be destroyed by any variance-preserving non-singular
7
linear transformation of the factors other than the identity matrix.
A simple way to achieve identifiability is to set to 0 at least (q-l)
elements in each column of
and Rubin (1956).
~.
Other conditions are given by Anderson
This approach usually entails a restriction on the
common factor space and rests on a hypothesis regarding the structure
of
~.
1.2.4.
Hence the names restricted or confirmatory factor analysis.
A Canonical Reduction of the Factor Model
Consider the unrestricted factor model with
~
=
and note that
I
~q
(1.2.4)
From canonical correlation theory, see Hote11ing (1935, 1936), we
~* = 1'~
know that there exist linear transformations
l*
and
~ ~'x
such that
(1.2.5)
where
£:
pxq
=
[~J
and
The
p.
1.
~*,
onica1 correlations between factors and responses, and
corresponding canonical variates.
Furthennore
i-th largest characteristic root of
vectors of
y-l~1
= ch.
x*
(A'V
-1
are the
1\.)
1"'''',-..,J
is the
is a matrix of eigen-
standardized so that
of orthonormal eigenvectors of
2
p.1
are the can-
I
and
.- .- -- ~'
~'y.'~
~
M
is a matrix
~
~'y-l~.
The canonical variates have the interesting property that
~*
= ,D:*
+ 1:.*
(1.2.6)
8
where
1:.*
say, and
= 1'l:,
with
Cov(Z*,~*)
for
Q.
= ............
L''¥L
= diag(1- p 21 , ... ,1- pq2,1, ... ,1)
Thus (1.2.6) is a factor model.
r this model has the property that
the structure of
only on factor
=
VaTC.~.*)
y~
1
for i=l, ... ,q
i = q+l, ... ,p.
X~
1
-
In view of
is loaded
and is independent of the
factors
,
Also the loading of
x~
1
on
y'!<
1
canonical correlation between factors and responses.
y~
1
is the i-th largest
Thus we have re-
duced the general model (1.2.1) to a particularly simple structure
(1.2.6).
This is called the canonical reduction of the factor model,
see Rodriguez (1975).
Rao (1955) has proposed defining the factors
l
as the canonical
variates of the factor space with respect to the response space.
terms of our analysis this implies that
In
~'y-l~ must be diagonal and
have its elements arranged in decreasing order of magnitude, for then
is ~,,{l~ itse.1f and
M = ....q
I .
....
~(!+Q)-l where Q = ~,!-l~
for
It can be
sho~~
that
(see §3.4), and hence a sufficient condition
t:.'y-1~ to be diagonal and have its elements ordered is that ~,!-l~
be diagonal and have its elements ordered.
This shows that the canonical
restrictions do indeed lead to Rao's canonical basis of the factor space.
Let
x* ~ (x*l""'x*)'
be the first
....1
.
q
response space.
If
~
q\ canonical variates of the
is diagonal it can be shown that
(1.2.7)
This result is related to methods for estimating factor scores proposed
by Bartlett (1937, 1938) and Thompson (1951).
For further details and
a derivation of these results see Rodriguez (1975).
9
1.2.5
Factor Analysis and
Prin~ipal Comp~nent~
The unrestricted factor model is similar to. but should be distinguished from, principal components. introduced by Hotelling (1933).
We now compare
Let
and let
~'~
thes~
E(~) =
~:
pxp
models.
Q, Var(x)
=
y, Q ;
y
1
1
y.
be the matrix of orthonormal eigenvectors of
is the vector of principal components of
since
d. = ch. (V)
diag(d1, ...• d ) where
p
= ~' =
II'
~.
Let
.[ =
~
~
Then
k
Then
2.
we can wrIte
(1.2.8)
E(~)
where
=Q
and
Var(v)
,(.,
=I
considered a factor mode 1 with
A'x
_
_
D~
=
f"tJ
~
.
thus the p
,
.
~-p
p
Thus principal components may be
factors and no errors.
From 0.2.8)
factors are the principal components of
standardized to unit variance.
So far we assumed that
Q = diag(d l , ... ,dq ) , let
take
of
y is of full rank.
k
r = ~2: pxq. Then
and let
V
~:
combination of
q
If
be the first
(1.2.8) holds with
orthogonal factors.
standardized principal components of
The
q
.[
k
= ~2
~
=
q
<
P we
eigenvectors
q
x
~
being a linear
factors are the first
q
x.
are not zero but small, the the first
components approximate
the choice
pxq
y
If rank
rather well.
q
principal
Okamoto (1969) has shown that
minimizes the eigenvalues of
k = E(~-rx)(~-£l)'
and thus its trace and norm, which are reasonable measures of information
loss.
The matrix
k
may be interpreted as the error variance in the
model
(1.2.9)
10
E(~)
where
tr
= Q1
= ~1
Var(~)
The covariance structure of
~.
given by the first
fundamental
q
= Q and r
COV(r,~)
~
is then
is chosen to minimize
Y=rr'
+ ~
standardized principal components of
is that
We have thus shown that principal components
may be interpreted in terms of factor models with
errors, or with
is
The
~-~.
difference between this model and factor analysis
E need not be diagonal.
r
and
p
factors and no
factors and correlated errors,
q
Let us now see in what sense factor analysis may be interpreted in
terms of principal components.
model (1.2.1) - (1.2.2)'
Since
Var (~:)
= -I
Let
x
satisfy the unrestricted factor
and consider the random vector
~-~
= ~.
'
(1.2.10)
a matrix of rank
rank of
q.
Thus a basic feature of the factor model is that the
y can be reduced by subtracting a matrix !
of specific var-
iances.
We now show that under the canonical restrictions that
Q = ~,!-1~
be diagonal and have its elements ordered the factors are simply the first
1
q
standardized principal components of !-~(~-~).
1
The scaling by
has the effect of making the specific variances unity.
Now
Var['¥.... -~ (x-z)]
.... ....
by (1. 2. 10) .
ch. (ll) = c5.
1 -
1
If
Q
(1.2.11)
is diagonal and ordered then
(i=l, ... ,q), and it may be verified that the first
eigenvectors of (1.2.11) are given by
!-~
~ = '¥-~All-~
The first
1
standardized principal components of !-~(~-~)
are then
q
q
11
the canonicaZ factors.
For another discussion of this relationship see
Lawley and Maxwell (1971, pp. 7-9).
t
!-~
The scaling by
may be omitted and a principal component anal-
ysis may be conducted on the matrix (1.2.10) instead of (1.2.11).
Since principal components are not scale-invariant, however, this leads
to a different basis of the factor space.
q
standardized
are called principaZ factors, see Rao
~-~
principal components of
The first
(1955) and Harman (1967, Ch. 8).
This completes our review of the factor analysis model.
We now con-
sider estimation of the parameters and hypothesis testing in the exploratory and confirmatory cases.
assumptions that
x: ' so that
For this purpose we introduce the additional
and
v - Nq (0,4»
__
,J..,
~
independently of
z - NP (0,...",..'1')
..
~
N (, I > V) •
p.t:-
1.3 Exploratory Factor Analysis
·1.3.1
Maximum Likelihood Estimation
Let
be a random sample from
-xl""'x
-n+ 1
fies (1.2.2)'.
We now consider estimation of
onical restrictions.
~
and note that
p-
logL
where
c
and!
=
V
satis-
under the can-
y
1 n+l
= -n L\"
(1.3.1)
(x -x)(x -x)'
<:\=1
nS - W (V,n).
-
~
Define the unbiased estimator of
~.
where
N (1.l,V)
p - -
-a -
-a-
The log-likelihood function is then
c -.!.n
2
10giVI
-
-.!.n
is a constant including terms on
2
§
tr v-IS
but not on
(1.3.2)
-V.
12
Maximizing log L is equivalent to minimizing
(1.3.3)
F with respect to
The derivatives of
~
and
'II
~
are
\
and
(1.3.4)
aF
.
-1
-1
-a'll = dlag V (V-S)V
~
~
~
(1.3.5)
~
~d
It can be shown that given!
~
likelihood estimator) of
the conditional m.1.e. (maximum
is
(1.3.6)
where
D
~P
.
= dlag(Pl,
... ,pq ),
p.1
= ch.1 ('II -~S'¥ -!z ), g =
~
~
(~l"" ,~)
and
~i is the orthonormal eigenvector of !-!z§!-~ corresponding to
p.
1
(i=l, ... ,q).
F may be written as
The minimized value of
r
j=q+l
(p.-log p.-l)
J
J
and the derivatives of this function with respect to
evaluating (1.3.5) at
aF
~
= ~,
",I
alj) i = 't'i
!'
(1.3.7)
obtained
can be written as
[r (P.-l)w~.+l
j =1
J
1J
_
S~lll'.
't'
]
•
(1.3.8)
A
The m.l.e.
!
~
F(!), and the m.1.e.
of!
~ of
is computed by numerical minimization of
~
is obtained by evaluating (1.3.6) at'll.
Iterative procedures for the numerical minimization are described in
~
13
§1.3.2 and §1.3.3 below.
Joreskog (1967)
OT
For a derivation of these results see
Lawley and Maxwell (1971. Ch. 4).
A nice property of the m.l.e.'s
~
invariant, see Morrison (1967, p. 268).
special case
~
~
= ~I''1>'
and
!
is that they are scale-
Also we remark that in the
closed-form expressions for the estimates exist,
see Lawley and Maxwell (1971, p. 47).
Joreskog and Goldberger (1972) have considered generalized least
squares estimation in factor analysis.
Computation of the estimates
requires again iterative procedures.
1.3.2
The Fletcher-Powell Method
To obtain the estimates in maximum likelihood factor analysis
Joreskog (1967) has proposed using a numerical method of function minimization due to Davidon (1959) and further developed by Fletcher and
Powell (1963).
Given a function
first derivatives
f
(~)
~(~),
depending on a parameter vector
~
wi th
the method uses a symmetric p.d. matrix
E
~
which is improved on each iteration and eventually converges to the
inverse of the matrix of second derivatives
minimum.
and
Let
f).(s), ._
~(s)
._
an d
E(s)
~
re f er
E at the start of the s-th iteration.
~(Q)
+~O
evaluated at the
th e va 1ues
0f
g (8)
~
The method uses a simple
_£(S)~(S) to determine a point with
linear search along the direction
positive gradient, and a cubic interpolation procedure to find
Then the matrix
8
~,
._
f).(S+l).
£(5) is improved using
= E(s)
I
+ 8(s)
~
(s) (5)'
~
1
+ y(s))!
(s) (s)'
)!
(1.3.9)
14
.!d(s)
where
= ~(S+l)
u(s)'w ls )
with
and
....
~
_
y
(5)' (s)
~
.
~
(s) _
-
~
(s+l)
-
~
(s)
,
For further details and proofs
of the convergence properties of the method see Fletcher and Powell (1963).
e = (t/JI, ... ,t/Jp )'
....
In our case
as t/J.(IL- s . . .
1
11
estim~te
may be taken
A better estimate recommended by Joreskog (1963) is
t/J~l) = (l_q/2p)/sii
1
initial matrix
and the initial
where
s
ii
is the (i,i)-th element of
~
-1
.
The
may be taken simply as the identity matrix, in which
E
case the first iteration is in the direction of steepest descent.
Lawley (1967) has given the following approximate second derivatives
of
F(~)
based on large sample considerations
a2p
B = [ at/J.ot/J.
1
where
s.·
1)
§.
=
g
2
= (Sij)
(1.3.10)
)
1
is the (i,j)-th element of
is p.d. for all (p-q)
for
J.
2
>
p+q
-1
and
Q
!-~(l-
The matrix
provides a geod initial value
In practice, to speed i.Ip convergence it is recommended to start
\
with two steepest descent iterations and then compute
on
f)
G
~
and switch to the Fletcher-Powell method with
g
(which depends
.£ (1)
= Q-l.
A difficulty that arises in applying this method is that elements
of
!
may become negative during iteration.
zation is done under the restriction that
To avoid this the minimi-
t/J. > £
1
(i=l, ... ,p)
is a small positive constant, usually .005 or .001.
E
t/J.
1
becomes
< E , it is changed to
If its derivative at
E
E
where
If a value of
before evaluating the function.
turns out to be positive,
tV.1
is fixed at
E
and minimization is continued with respect to the remaining elements of
Sets of estimates where
'" =
t/J.
1
E
for some
i
are termed improper
15
A
or Heywood cases.
In these situations the variates with
~.
1
=£
are partialled-out and a new model is fitted to the partial covariance
matrix for the remaining variables.
For further details see Joreskog
(1967) .
The Newton-Raphson
1.3.3
Metho~
Clarke (1970) has recently derived exact expressions for the second
derivatives of
F, which enabled him to use the Newton-Raphson method
of function minimization.
In the notation of §1.3.1,
his result is
h ..
1J
(1.3.11)
where
0..
1J
If
is the Kronecker delta.
g(s)
is the s-th approximation to the minimum point, then the
Newton-Raphson method takes
(1.3.12)
The exact matrix of second derivatives
ions.
Lawley's (1967) approximation
ation, (b) if
Q
U
is used in most iterat-
is used (a) for the first iter-
maxl~~s+l)_~~s)1
> .1 , or (c) if B[~(s)] is not p.d.
.
1
1
1
The latter condition is required because
minate neighborhood of the minimum.
improve convergence.
tl is p.d. only in an indeter-
The first two conditions serve to
Improper values of
!
are handled in the same
manner as described above.
The Newton-Raphson method usually requires fewer iterations than
the Fletcher-Powell method, but necessitates a somewhat greater amount
of computation on each iteration.
Curren·::: experience indicates that it
is more efficient unless
1.3.4
p
is fairly large.
Large Sample Standard Errors
Anderson and R~bin (1956) have proved that
are asymptotically normally distributed when
~
"
In(~~~j
p
-+
"
.and
In(!-!)
y and mC§'-Y) is
asymptotically normal, a condition that is satisfied when
x
,. .",
~
N (p, V) •
P
"'-J
""
Derivation of asymptotic variances and covariances, however, is very
involved.
Lawley (1953) has shown that if !
large
is known, for sufficiently
n
(1.3.13)
where
~
is a matrix with element
a. .
1r,J $
in row
[p(r~l)+i]
and
column [p(s-l)+j] and
9L (~. y. A. A. ) ] ,r=s
m:-::l
a.
.
1T,J 5
m 1m 1m JIB
Ir
=
(1.3.14)
with
~r
= Pr/(Pr-l),
Yrm
2
= [(pr -l)j(pr -p)]
m
1, and
as defined
in (1.3.6).
Several years later, Lawley (1967) showed that if
then for large
2
-1
Yare!)
~
-G
n "'"
"
YarrA)
."'"
~
+ 2,!!g-1.§) , and
leA
n
(1.3.16)
. ~n G- 1B
(1. 3.17)
" "
Cov(!,~)
(1.3.15)
~
G is as defined in (1.3.10),
"'"
is estimated
n
"
where
!
~
"'"
is a matrix with
b.
""'lr
in
17
column [p(r-l)+i], ~lr
bo = (b ,1r
l
0
b ..
),lr
,
•••
,b p, lr
0
= - A.)r (p r -1)" \jJ )~ 2 [ <5 1))
o1JJ -!zA
A
lr)r
0
0
0
0
/
) ' ,
and
(p -1) + P
r
I
A A
0
0
/
rm=l lm)m
ir
(p - p )].
r
m
(1.3.18)
\
For a derivation of these results see Lawley and Maxwell (1971, Ch. 5),
except for expression (1.3.18) which is in error in the original and has
been corrected by
Jennrich and Thayer (1973).
If the loadings are rotated further complications arise.
Lawley
as known and
and Maxwell suggest treating the transformation matrix
thus the rotated loadings as linear functions of the original loadings.
In practice, however,
~
is usually derived from the data.
Archer and
Jennrich (1973) and Jennrich (1973) have obtained results for rotated
loadings.
Recently, Jennrich (1974) has obtained simplified results for the
asymptotic variances and covariances by approaching the problem as one
in constrained maximum likelihood estimation.
m.l.e. of a parameter
~(~)
=
Q,
and let
l(~)
generally be singular).
~
Let
e :
""11
mxl
be the
which is assumed to satisfy constraints
be the Fisher information matrix (which will
Define the augmented information matrix
*
I (e)
,...,...
=
[l(~)
(e~;a~) ,
be the matrix in the upper left mxm block of
1* -1 (~).
Then under certain regularity conditions
For a proof of this result see Silvey (1971, p. 81).
We require that
18
!C.~)
and
a.8/Cl~
that
L*-l(~)
exist and be continuous in a neighborhood of
~
~n be consistent.
exist. and that
In exploratory factor analysis the elements of the information
matrix are given by
I ( \ j ' \R,)
-1
= (V -1 )ok(A'V
A).n
1
'"
JX"
t"'o.I
I(A .. ,I/Jk) = (V
1J
'
""'I
-1
/'OJ
)'k(V
1
,.....,
-1
-1
(V'" .....,A)'n(V
1;r" -
+
,....,
and
A)k"J
'"
-1
A)k"J
(1.3.21)
roJ
(1.3.22)
1('~i'~k
",) = !(V-l)~
2 ~
Ik
(1.3.23)
The constraint functions associated with the canonical restrictions are
(1. 3.24)
for
1
~
u < v
~
q, and have dcr-ivatives
ag
uv
~
OA. .
1J
= (8.:\.
JU IV
+ O. A. )/lJi.
JV IU
I
, and
(1.3.25)
(1.3.26)
These derivatives can be arranged into a vector by ordering the
subscripts in lexicographical fashion.
The augmented information matrix
is then
!(fj,!)
dg/afj
....
I ('i' , ....
'¥)
~
~
(1.3.27)
sym.
and
l-(~,!) is obtained inverting (1.3.27).
Consistent estimators of
the asymptotic variances and covariances are obtained substituting the
m.l.e.'s
" "
A,'¥
....11 "11
for
~
.-
and
~
.-
in (1.3.27).
The results have been
19
found to
agree with results obtained using Lawley's formulae (after
correction).
Although the method is computationally less efficient,
the formulae are pleasingly simple.
Furtherm;:>re, the method can easily
Ii
be applied to analytically rotated loadings by modifying the constraint
functions (1.3.24).
For further details see Jennrich (1974).
The results discussed so far apply to estimates derived from a
variance-covariance matrix
§.
The modifications required when m.l.e.'s
are obtained from a correlation matrix may be found in Lawley and Maxwell (1971, Ch. 5) and Jennrich (1974).
1.3.3
Hypothesis
Let
~
Testi~
denote the set of all symmetric p.d. matrices of order
and let
W
testing
HO: yEw
denote the subset for which (1.2.2)' holds.
vs.
HI:
YE~-W
~
The m.1. e. of y
script
q
We consider
, the goodness of fit of the model.
It is well known that the unrestricted m.1. e. of y
max log L
p
= - }(log!§\
+
A
under (1.2.2)'
is
is
§
and
(1.3.28)
p)
AA
"'q = M'
V
emphasizes the number of factors fitted.
+
!, where the subIt can be shown
that
max log L
=-
1
~(logIYql + p) ,
A
(1.3.29)
W
and hence the goodness of fit likelihood ratio test statistic is
(1.3.30)
which can be shown to be simply
'"
A
nF(!) , hence the choice of
F
in
(1.3.3).
The asymptotic distribution of
-2 log A.
is
Xv2 with degrees of
e
20
freedom
v
1
= zP(p-1)
1
- [pq+p - zq(q-1)]
the number of parameters in y
in
! '
A and
If
q
=0
1
2
= I[(p-q)
(1.3.31)
- (p+q)] ,
minus the number of free parameters
see Lawley and Maxwell (1971, Ch. 4).
" = diag
YO
then
§
and (1.3.30) reduces to the well-
known likelihood ratio test of independence, see Anderson (1958, Ch.
9).
2
Box (1949) has shown that the X approximation is
In this case
improved if n
is replaced by
corrections for
q
>
n - (2p+5)/6 in (1.3.30).
0 have not been established, but Bartlett (1951)
has suggested using Box's correction with
n-q
and
Similar
nand
p
replaced by
p-q.
From (1.3.29) it can be seen that the likelihood ratio statistic
for testing
q versus
q+l
factors is
(1.3.32)
. and
-2 logA ~ ~
v
=
with degrees of freedom
[p(q+l) + P - t(q+l)] - [pq + p -
~(q-l)] = p-q
(1.3.33)
the difference between the number of free parameters in (q+l)- and
q-factor models.
1.4 Confirmatory Factor Analysis
1.4.1
Maximum Likelihood Estimation
We now consider estimation of
thesis.
~,2
and! under a structural hypo-
Parameters will be of three types:
(1)
free or unconstrained
parameters, (2) parameters that are constrained to be equal to other
21
parameters in the model, and (3) fixed paOLameters that have been assignin a one-factor model, we
For ex.amp1e, if
ed given values.
may treat
Al
AI'
set-up is more general than the set-up of Joreskog (1969)
This
as fixed and
A2 , ... ,A
p
as constrained to be equal to
or Lawley and Maxwell (1971, eh. 7), who consider only free and fixed
parameters, and is in the spirit of more general models considered by
Joreskog (1970a).
As in §1.3.1
we will minimize the function
F(~,2,!) =
where
log\y!
+
tr y
-1
§ - logl§1 - p ,
(1.4.1)
satisfies (1.2.2) and a structural hypothesis.
V
~
The derivatives of
~,2 and!
with respect to
F
are
cF
(1.4.2)
'dA = 22M
~
of
- - = 2fjl~ - diag AIM,
a~
~
and
....".,.
(1.4.3)
'~s
aF
a!
n
- = diag
(1.4.4)
where
£
= y-l(y_§)y-l
(1.4.5)
2 and 1.
Let the elements of fj,
be arranged into a vector
x = (A~l,· .. ,Apl;"·;·\q, .. ·,Apq;<P1l;<Pl2,cj>22;
... ;cj>lq, ... ,cj>qq;~I"."~p)I
and let
ft: mxl contain the free parameters in X. Then
aF L .. aF
ae. = a 1.J
aYe
j
1
where
(1.4.6)
,
a ..
1J
=1
if
e ..- y.
1
J
and
J
a ..
1.J
=0
(1.4.7)
'
otherwise, and
aF/dY.
J
22
may be obtained from (1.4.2) - (1.4.4).
The function
F
~
is now considered a function of
and may be
minimized using the Fletcher-Powell method described in §1.3.2.
initial matrix
li may be taken simply as the identity matrix.
The
A
better choice is provided by half the inverse of the information matrix,
as explained in §1.4.2.
Since initial estimates may be far from the
minimum point and the information matrix depends on
i t is con-
~,
venient to start with several steepest descent iterations and then
switch to the Fletcher-Powell method.
For further details and a deri-
vation of these results see Joreskog (1969) or Lawley and Maxwell
(1971, Ch. 7).
1.4.2
Large Sample Standard Errors
By standard asymptotic theory, see Kendall and Stuart (1961,
rn(~
pp. 54-55), it can be shown that
l(~)
-g)
is the Fisher information matrix.
~, ~
matrix with respect to
I CA ij , AkR,)
I
CA ij , <PkR,)
=
and
%
~ Nm [Q,,!-l (~J]
The elements of the information
are given by
(V-I). (¢AlV- 1A¢).
....
~~
1k"""""""
where
•
J R, + (V
I"<tJ
-1
A¢)·oC"V
11..1
,..,....."
1
-1
·,1
= 2(2-o
R,)[()£
~)ik(~l)£
M)R,j
k
#"IJ
+
-1
A~)k'J
(1.4.8)
~
-1]
(V -1 A)'9.(AlV·
A¢)k'J ,
1"
~
t"'.J
I"'oJ
I"'>J
~
(1.4.9)
I (A •. ,1jJk)
1J
-1
-1
= (V,...., )'k(V
AO)k'J •
'" "l
,.....,
(1.4.10)
~
1
.
-1
-1
= -4(2-0
.. )(2-0 ko )[(AlV 1"'oJ.l.
A)'k(A'V
A)·n
1J
,....,,....,
,...., J.'<J
AI
#"IJ
+
ICcP··,1jJk)
1J
t"J
-1
-1
(A'V
A)'o(A'V -A) J'k] ,
~
~
1~
~
~
(1.4.11J
1
-1
-1
= -2(2-0
.. )(V A)k'1 (V A)k'J
1J
(1.4.12)
= l(V-1)~
2""
1k
(1.4.13)
~
~
~
~
23
These results were derived independently by Lawley (1967) and
Lockhart (1967).
diag (2) = lq
These authors, however, assume that
and hence do not give the Kronecker factors in (1.4.9), (1.4.11) and
(1.4.13).
See also Joreskog (1969) and Lawley and Maxwell (1971,
Ch. 7).
The information matrix with respect to the free parameters is
simply
= kI
1(8.,8.)
J
1
where
a
ik
is 1
if
8
i
It
(1.4.14)
a·ka·nI(yk,Yt) ,
J;v
1
= Yj
are
and 0 otherwise, and the I(yk,y )
t
obtained from (1.4.8) - (1. 4 .13) .
Since log L is, except for a constant, -
IUD
=
-E
rl
Ln
2
a
log L
a&,a&, ,
J -_¥
1 [
"21 nF,
2
F ]
a&,o&,:
a
Expected values of the second derivatives of
by
21(~).
If
&,
is a preliminary estimate of &,
non-singular, then the matrix
value for the matrix
§
'"'
21 I -1 (&,)
we have
(1.4.15)
F are thus given
I(&')
such that
is
can be used as a good initial
in the Fletcher-Powell method.
Mulaik (1971) has derived exact expressions for the second derivatives of
F with respect to the elements of
~,
2
and
!, which
could be used in a Newton-Raphson algorithm for minimizing
expressions; however, are too complicated to give here.
F.
The
Furthermore,
the large number of parameters usually involved in confirmatory factor
analysis makes the Fletcher-Powell method generally more efficient than
Newton-Raphson.
~
24
1.4.3
~othesis
Testing
y has the structure (1.2.2) with the speci-
The hypothesis that
fied restrictions regarding free, fixed and constrained parameters may
\
be tested in large samples using the likelihood ratio goodness of fit
statistic
-2 log A = n[loglQI + tr(Q-l~) - 10gl~1 - p] ,
A
where
V = ~'
+
!
which is asymptotically distributed as
v =
where
21
A
n times
Multipliers other than
in law to
p. 93) take
n
~ with
(1.4.17)
p(p-l) - m ,
m is the number of free parameters.
statistic is simply
(1.4.16)
A
As in §1.3.5, the test
A
F(~,~,!),
in (1.4.1.6)
the minimum value of
~lich
2
Xv have not been established.
F.
may improve convergence
Lawley and Maxwell (1971,
n-(2p+5)/6, which is appropriate if
q = O.
1.5 Related Models
We now briefly consider two extensions of factor analysis which are
relevant to our work.
1.5.1
Factor Analysis in Several Populations
Several authors have proposed extensions of factor analysis to sev-
eral populations, see for example Rash (1953), Lawley and Maxwell (1971,
eh. 9), Joreskog (1971a) and more recently Please (1973).
In the gener-
al case one can write for the j-th population (j=l,: .. ,k)
(1.5.1)
where
~(j)
is a vector of responses,
~(j) is a vector of means, ~(j)
25
is a matrix of loadings,
E[!(j)] =
E[y(j)] = ~(j), Var[x(j)l = !(j) p.d.,
Q , Var [!(j)] = ~(j) with ~(j) a diagonal p.d. matrix
'"
and Cov[X(j) ,!(j)] =
'"
Q. Clearly this general model is not identified.
Lawley and Maxwell (1971, Ch. 9) consider the case
= Q,
r,;(j)
'"
~(j) = ~
B(j) = ",'
0
onal elements of
~(l)
~(j) =!
(j=1,2), set the diag-
to unity, and derive maximum likelihood equa-
'"
.
f or es t'1mat'1ng
t10ns
and
k = 2 with
A
I1J
~,~,
,(I)
..
~
and
~(2)
~
un d er norma l't
1 y assump t -
ions.
Joreskog (l97Ia) considers the case of general
and requires that
~
(j) for J-I,
'- ... ,k.
q
2
k
=Q
with
independent constraints be imposed on
and
Actually his model is more general in that the
factors and responses need not be the same in all populations.
(1971a) also studies in greater detail the case where
Joreskog
A(j) = ~
(j=l, ..• ,k).
Please (1973) considers the case of general k with
~(j) = Q and
~(j) = ~ (j=l, ... ,k), and derives the likelihood equations under the
set of restrictions
structure ignoring
AlA = ¥"Vq
I.
~
I"V
While Joreskog emphasizes the covariance
the means, Please considers the factor means also.
Note that the models are heteroscedastic in that
var[~(j)] need not
be the same in all populations.
In all' these models the Fletcher-Powell method is used to compute
the estimates.
1.5.2
Analysis of Covariance Structures
Joreskog (1970a) has proposed a general model for the analysis
of covariance structures which can be written as
(1.5.2)
e
26
~:
where
~:
pxg
rank
pxn
~:
and
g
$
is a matrix of
P
hxn
and
=Q
and
observations on
p
variates,
are design matrices of known constants of
h
<
known parameters and
E(~)
n
Var(~)
~:
n
respectively,
g
is a stochastic matrix of errors with
=y
gXh
is a matrix of un\
0
ln,
where the variance-covariance matrix
y has the structure
y = ~(~tt'
where
~:
pxq
and
[: qxr
+ I)~' +
f
(1.5.3)
are matrices of factor loadings,
is a p.d. matrix of factor covariances and
1:
diagonal p.d. matrices of specific variances.
out the parametric structure of y
qxq
and
~:
~:
pxp
rxr
are
The model (1.5.2) with-
is the growth curve model of
Potthoff and Roy (1964).
There is a great deal of indeterminacy in the model, for if
II
is a
non-sin~Jlar
and
12 is any non-singular matrix of order q, then ~ may be re-
placed by
-1
Ul
(1.5.3) leaving
matrix of order
,[
by
-1
I l £I 2 ,
y unaffected.
p
~
such that
by
IIT!i
is diagonal
I 2[IZ and 1 by Il!!i in
In order to achieve identifiability
some restrictions must be imposed on the parameters.
Joreskog (1970a) allows the parameters in
~,~,
I, k, 1,
and!
to be (1) fixed at given values, (2) constrained to be equal to other
parameters, or (3) free, unconstrained.
this generates a variety of
interesting special cases, including models for the analysis of congeneric measurements, factor analysis, Wiener and Markov process models
for repeated measurements, and some special cases of path analysis.
The likelihood equations for estimating the parameters when
27
u
~
~ N
V 0 ~
I)
pxn (0
~,~
are given in Joreskog (l970a).
The estimates are
again obtained using the Fletcher-Powell method.
Several applications of the model are considered in Joreskog
(1970b, 1971b).
The review paper by Mukherjee (1973) compares the
factor analysis and covariance structure models.
Recently, Browne
(l974b)has considered generalized least squares estimation in the
analysis of linear and non-linear covariance strJctures, and has
proved several optimality properties of the estimators.
II.
2.1
TIlE MULTIVARIATE LATENT LINEAR
~t)DEL
Introduction
In this chapter we propose a generalization of the factor analysis
model which obtains by letting the factors satisfy a multivariate linear
model.
The resulting model is termed the multivariate latent linear
model, and provides a general framework for the statistical analysis of
unobservable constructs of latent variables on the basis of observable
indicators or manifest variables.
In §2.2 we provide a formal statement
of the model and note that it is analogous in its method of construction
to growth curve and covariance structure models.
In §2.3 we
consid~r
the identification problem and the types of restrictions that must be
imposed on the parameters to achieve identification.
In §2.4 we consid-
er some special cases and applications of the model, and discuss its
relationship with factor analysis in several populations, path analysis,
and variance components models.
2.2 Statement of the Model
Consider the multivariate general linear model
x = ~.A + f
where
~:
y:
qxn
(2.2.1)
'
is a stochastic matrix of
n
observations on
qxr is a matrix of unknown regression parameters,
~:
rxn
q
variates,
is a design
29
matrix of full row Tank
r
and
errors with
f
nand
<
is a stochastic matrix of
= "'"'~
Var(f.)
.-
3 "'"'n
I , where
of unknown variances and covarianccs.
For estimation and testing pur-
Nqn"'"''''"'
x (O,~ 3 I ).
"'"'n
We are interested in estimating the parameters
poses we will assume that
is a p.d. matrix
€ "'"'
"'"'
.
"'"'
~
and
, and in
testing the multivariate general linear hypothesis
(2.2.2)
£:
where
txq
and
B: rxs
are matrices of fixed, known constants.
We suppose, however, that the matrix
!: pxn with p
we observe a data matrix
Y
"'"'
>
q
is not observable.
related to
Instead
Y by the
general factor analysis modeZ
x
"'"'
~:
where
of full
with
rank
Q'
=
(2.2.3)
... ....
Z ,
is a matrix of unknown parameters called factor Zoadings,
pxq
coll~n
E(f)
= ...,.,
AY
q
p
<
Var(~) =
and
! Q
In'
~
is a stochastic matrix of errors
where!
= diag(~l""'~p)
is a diag-
onal p.d. matrix of unknown parameters called specificities, and
Cov(y,~) =
~
Q.
For estimation and testing purposes we will assume that
Npxn(Q,! @
In)
independently of
y.
In this formulation
y
represents unobservable co rstructs, factors.
or latent variables, and
!
represents observable indicators. responses
or manifest variables.
parameters
data
!.
~
and
~
The problem
is to estimate the linear model
and to test linear hypotheses about
using the
We will also be interested in estimating and testing hypotheses
about the factor model parameters
~
and
!.
Combining (2.2.1) and (2.2.3) we can write the model in general form
30
as
(2.2.4)
where
is a stochastic matrix with
and
Var(lD
\
.§
= /::1.,
= ,..V @ "'I1
I
and
(2.2.5)
(2.2.6)
Under the normality assumptions on the distributions of
and ,t,
U ,.. Npxn (O,V @ "'I1
I ) .
,..
~
(2.2.7)
~
The proposed model is thus seen to be a multivariate general linear
model where both the regression and dispersion parameters are structured.
We will call
.§
the structural
and
V the reduaed
parameters.
fo~
parameters and
~,~, ~
and
f
The structure of .§ is analogous to the
types of structures found in econometric models, see for example Theil
(1971), and the structure of
y
is of the factor analysis type discussed
in §1.2, with the important feature that the parameter
A is common to
both structures.
Since the basis of the proposed model is a multivariate general
linear model on unobservable or latent variables, it will be termed the
multivariate latent linear model.
The model is analogous in its method of construction to growth curve
and covariance structure models.
In the growth curve model we assume
that
E(~)
= ffi.
with within-subjects design matrix
~:
(2.2.8)
£:
pxq
and subject parameters
qxn, and we assume that the subject parameters in turn satisfy the
31
linear model
e
with structural parameters
~:
= ~
~:qxr
(2.2.9)
and across-subjects design matrix
Thus the growth curve is a second-order or compounded linear
rxn.
model, see Grizzle and Allen (1969).
Suppose now that
~
is given by a factor model
(2.2.10)
where
,f}: pxq
is
a matrix
E(~) =
of errors with
Q'
is a stochastic matrix
of factor loadings,
Var(~) =
matrix of specifities, and
!
where 1
@ In
Cov(r,~) =
is a diagonal p.d.
Q ' and suppose that
in turn
is given by a factor model
(2.2.11)
where
[: qxr
factors with
f
is a matrix of loadings,
E(f)
=
Q and Var(F)
~
=
E
r @I
~
Then
T
E(~)
a
stochas~ic
where
~
is a stochastic matrix of errors with
where
i~
E(f) = Q
r
is
and Var(X)
~
=
V @I
~
~
,....,,....,.....,,....,
-
~
@
Cov(E,f)
I
"""n
=
Q.
where
v = A(rrr' +T)A' + ~
ro.J
p. d., and
Var(s) = T
is a diagonal p.d. matrix of specificities and
=Q
matrix of
~,...,
,....,'
(2.2.12)
which is the type of covariance structure assumed in Joreskog's (1970a)
general model described in 91.5.2.
Thus the covariance structure model
is a second-order or compounded factor model.
The latent linear model, on the other hand, may be described as a
compounded linear-factor model.
It may be noted, incidentally, that the covariance structure (2.2.6)
of the latent linear model may be obtained as a special case of Joreskog's
32
by setting
r =I
~
however, that
E(~)
and
I =Q
=~
in (2,2.12).
where
(2.2.4) and (2.2.5) we assume
P
is a known matrix, while from
E(~) = ~ ,
appears also in the structure of
Joreskog (1970a) assumes,
is unknown and
where
y.
2.3 The Identification Problem.
We now discuss the identification problem for the model described
in §2.2 and consider restrictions on the structural parameters and on
the design matrix.
2.3.1
Restrictions on the Structural Parameters,
Let
~
and
X satisfy the linear model (2.2.4) with reduced form parameters
V given by (2.2.5) and (2.2.6), respectively.
model is not identified let
1
To see that the
be any non-singular matrix of order
q
and let
z~*
-A*
~*
Then
],
~
and
~
L~ ,
= ......
AL- I
= ......
L~L .
= .......
may be replaced by
(2.3.1)
(2.3.2)
and
(2.3.3)
~*, ~*
and
~*
in (2.2.5) and
(2.2.6) without modifying the structure of the model nor the reduced
form parameters.
This indeterminacy of the model corresponds to a non-
singular linear transformation of the factors X*= 11. Since
2
2
q
elements, at least q
constraints need be imposed upon
Except for the additional parameter matrix
~,
L has
~,
A and!.
the problem is the
same as encountered in factor analysis, and can be solved along the same
lines as in §l.2.2.
We therefore consider unrestricted and restricted
versions of the model.
33
~q(q+l)
In the unrestricted case
defining
¢ = I.
~
~q
This implies no loss of generality because if
~~ is any square root of
~
~-~X
=
,
where
, and redefine
= I
¢*
~q
~
If
.
q > I, however, we could still replace
';;*
= ~'2
~* =
and
~
A
~
by
(2.3.4)
and
M:j ,
(2.3.5)
M is any orthonormal matrix of order q.
where
X*
we can introduce the linear transformation
~ ~ !q
and
constraints will be imposed by
In this event,
~q (q-l)
additional constraints are needed to achieve identification.
~,!-l~ to
Proceeding as in factor analysis, we may require
be diagonal with its elements arranged in decreasing order of magnitude,
thus obtaining the canonical basis of the factor space, see §1.2.3.
A is any matrix satisfying (2.2.5) and (2.2.6), then the choice of
If
~
that transforms
A
to satisfy the canonical restrictions is given
by the matrix of orthonormal eigenvectors of
~'!
-1
~
,
for then
(2.3.6)
and
-1
o.1 = ch.1 (A'~
A).
,..." '"
Unfortunately,
t"'W
in our case these restrictions do not turn out to be very convenient in
maximum likelihood estimation.
An
alternative set of constraints is obtained by setting to 0 the
first (j-l) elements in the j-th column of
case
A has a triangular pattern of zeros
~
for j=2, ... ,q.
In this
34
0
o
A A
21 22
o
All
A A
ql q2
A =
...
t"
Let
(2.3.7)
A
qq
A
pq
A be any matrix satisfying (2.2.5) and (2.2.6), and let
denote the first
q
~.
rows of
Then the choice of
M that transforms
~
to satisfy (2.3.7) is given by the Gram-Schmidt factorization
A
= TM'
-q
~
matrix
is lower triangular and
, where I
A
~q
The
is orthonormal.
M is a product of Householder transformations and may be
comput~
ed using the modified Gram-Schmidt algorithm as described in Golub (1969).
The resulting basis of the factor space will be called the G2'am-Schmid-t
basis,
and has the property that the i-th response (i=l, ... ,q-l) is
not loaded on factors
i+l, ... ,q, a feature that may be helpful in factor
interpretation.
Statistically, one basis is as good as another so long as it allows
identification of the model, and may be rotated to satisfy any criteria
that may simplify interpretation of the resulting factors.
In particular,
the Gram-Schmidt basis may be rotated to canonical form.
It may be noted that we have achieved identification by imposing
constraints on
2
and
~
but not directly on
is of course affected by the choice of
~,
although the latter
~.
In the restricted case we will consider a structural hypothesis
that will usually impose
at least an additional
q
constraints by requiring diag(2)
=
!q' and
q(q-l) constraints by setting to zero (q-l)
elements in each column of
A
In the simplest case where only
q
2
35
constraints are imposed in this manner, the common factor space is not
truly restricted and may be transformed to an equivalent canonical or
Gram-Schmidt form.
!
and
In a more general case, all parameters in
~,~
will be allowed to be either: (1) fixed, known parqmeters,
(2) parameters constrained to be equal to other parameters in the model,
or (3) free, unconstrained parameters.
In this case special care must
be exercised in setting up the restrictions to ensure that they are
sufficient to identify the model.
It may be noted that again we have
not imposed constraints directly on
In this regard we feel that the
constraints should be based directly on the nature of the measurement
process and hence should refer to
2.3.2
! .
and
Restrictions on the Design Matrix.
In (2.2.3) we have written the factor analysis part of the model
=~
!
as
umns of
+ ~ ,
~
without a separate location parameter
!' as we had in (1.2.1).
for the col-
In many applications it may be desir-
able to introduce such a location parameter.
In order to do this, how-
ever, it is necessary to introduce some restrictions on the design
~.
matrix
To fix ideas, consider a one-sample problem with ,...
A
x
,...
- = ~= E (l) .
=~
E(~)
=0
and
y,
Let
and write
Then if a location parameter U is introduced we have
+ ~
ely identified.
r
!
X denote generically the columns of
and
= ,...
I' n .
and, as we noted in §1.2.3,
In this case we assume
~
B
=Q
and
~
are not separat-
or, equivalently, set
in the linear model.
Consider now a two-sample problem with
n
observations in each
36
~n
l' ]
-1'
~n
sample, let
Then if a location
parameter is introduced we have
E(~o.)
=
~
+
~ -~.
(0.=1,, ... , n)
separately identified, for we could replace
~*
~
=
+
we assume
~
=
[~,
~(~*-~)
~
-~]
=
k
(o.=n+l, ... ,2n), and clearly
E(~o.)'
without affecting
~
and
~
and
~
by
(o.=I, ... ,2n).
and
are not
~*
and
In this case
Q , or equivalently set the linear model with r=l and
~
, and estimate only
§..
and the location difference
From these remarks it is clear that
should not contain a row of ones;
~
i.e. the model should not specify an overall mean of l.
Note now that an alternative parameterization for the two sample
l'
problem with r=2 is obtained by setting
g=
(~1'~2)'
E,
~l
and
~2
~l
r
LQ
and
' but we can estimate
Although no overall mean of l
plicit in
=
Since this is equivalent to the previous model, it is not
possible to identify
~1-~2'
~
~n
and
li
and
is specified, such a mean is im-
~2'
In the general case, adding a location vector
)d
amounts to con-
sidering the model
(2.3.8)
The reduced form parameters
~
and
~
and only if the augmented design matrix
Since
in this model are identified if
lr 1'1
~
is of full row rank.
A itself is of full row rank, the only additional requirement is
that all rows of
~
be linearly independent of
In.
37
Estimation of
B'
discussed in §3.2, is considerably simplified
~
by introducing the stronger requirement that
be orthogonal to
ln,
Le.
&n = Q
This entails no loss of generality, for if
replace
by
this means that
A - aI'
~*~
~
~
and
I::!.
by 1:: *
(2.3.9)
rQ
=
~
= !:
+
Ba.
we can always
In practical terms
should contain only deviations of dummy variates or
regressors from their means.
The above requirement has been considered
in a different context by Puri and Sen (1969).
2.4
Applications of the Model.
We now consider some special cases of the model to illustrate its
range of application and some relationships with other models.
2.4.1
Factor Analysis in Several Populations.
In the k-sample problem the latent linear model may be written as
(2.4.1)
where
~(j)
is an observation in the j-th population,
of factors with
E[x(j)] = ~(j)
vector of errors with
and
Var[x(j)] =~,
E [~ (j)] = Q ' Var [~ (j )] =
Cov[x(j) ,~(j)] = Q , 0=1, ... ,k).
X(j)
and
is a vector
~(j)
is a
! diagonal and
We thus obtain a model of the kind
described in §l.S.l.
The models proposed by Joreskog (1971a) and Please (1973) for this
situation emphasize differences in the covarianc2 structure among populations, and thus aTe more in the spirit of factor analysis, while the
present model emphasizes differences in location, and thus is more in the
spirit of linear models.
4It
38
In a sense the model is more restrictive than Please's, for
!
and
are assumed invariant over populations.
~
It is more general,
however, in that the parameters of the present model may be estimated
under quite general identifying restrictions or structural 'hypotheses.
Also Please (1973) assumes
U
=Q
cated in §2.3.2, we can estimate
~
to obtain m.l.e.'s while, as indiand location differences among
populations.
An example where the present model could be applied is if a scale
containing 40 questions believed to measure 5 types of attitudes is
applied to 4 groups and it is desired to test whether the groups differ
on their attitudes.
model
analysis
We conjecture that the multivariate latent linear
proposed here is more powerful than a multivariate
analysis of variance based on all 40 questions, or an analysis based on
scores obtained for each factor by averaging the questions that best
measure it.
2.4.2
A Multiple-Cause Multiple-Indicator Causal Model.
Path analysis, introduced by Wright (1918), is a technique frequent-
ly used in the social and behavioral sciences as well as in biometry to
study causal relationships among variables in a non-experimental situation.
See for example Blalock (1961), Boudon (1965), Duncan (1966) and
Land (1969)
on the social science side, and Turner and Stevens (1959)
on the biometry side.
The technique is closely related to structural
equation models studied by econometricians, see for example Wold (1964)
or Theil (1971).
The model proposed here may be regarded as a rath analysis model
39
involving multiple causes of unobservable variates which are assessed
using a multiple indicator approach.
The causal structure for the
case of two unobservable variates is illustrated in Figure 2.4,1.
Y
--~2~X -e
t
p
Z
p
E
Fig. 2.4.1
The A.
1
A Multiple-Cause Multiple-Indicator Causal Model
represent observable indicators.
error terms.
Y.
represent causes of the unobservable
The
E.
1
and
1
Z.
1
and the
x.1
are uncorrelated
Curved arrows represent correlations among independent
variables and straight arrows represent causal relationships.
By the
nature of the linear model all independent variates (A) are assumed to
affect the unobservable dependent variates (Y).
By the nature of the
factor model, however, some of the paths from unobservable variates (Y)
to indicators (X) may be assumed non-existent.
The model allows esti-
mation of all regression coefficients and error variances under suitable
identification restrictions.
Hauser and Goldberger (1971) consider causal models involving unobservable variates and give a one-factor version of the above model.
They state incorrectly, however, that the estimates may be obtained using
Joreskog's (1970a) method for the analysis of covariance structures;
40
see
§1.5.2 and the discussion at the end of §2,2.
The present in ..
terpretation of the proposed model, however, was inspired by their work.
An example where the model could be applied is when one wishes to
study the determinants of a latent variable such as risk of ischemic
heart disease using indicators of risk such as blood pressure. serum
cholesterol and blood glucose.
Which variables are used as determinants
and which as indicators would depend, of course, on the purpose of the
study.
2.4.3
Variance Components and Mixed Models.
Since most applications of the proposed model are likely to involve
multiple measurements of a few underlying characteristics of interest,
it is convenient to consider in some detail the analysis of multiple
indicator data.
Since our interest centers on the measurement process we
will ignore for simplicity the underlying linear model.
A model frequently employed when a set of subjects is measured
independently by
P
observers is the variance components model
x ..
1J
where
er,
Ct
+
13. + s.
J1
+ z .. ,
1J
(2.4.2)
x.. is the score assigned to the i-th subject by the j-th observ1J
is an overall mean, 13. is a fixed effect due to the j-th observJ
L. 8·J = 0,
er with
J
(J
2•
e
s.
is a random effect due to the i-th subject, with
1
2
z .. is an error term with mean 0 and var1J
It is assumed that all random effects and errors are mutually
mean 0 and variance
iance
= Ct
uncorrelated.
(J
s
and
The correlation between any two measurements on the same
subject is the intraclass correlation
of this and related models see Landis and Koch (1974).
For a discussion
41
The model (2.4.2) may be rewritten within our framework and
notation as
X..
1)
where
lJ·=a+S.,
J
J
The variate
y
A = as ,
= lJ.)
+
AY. +
(2.4.3)
Z .•
1J
1
y.l=
s./a
IS
and
Var(z .. )
1J
may be interpreted as the characteristic being meas-
ured, standardized to have mean zero and variance one.
sian of the j-th measurement on
variance.
is
y, and
as the corresponding errop
~
The squared correlation between the j-th measurement and
= A2/(A2+~)
p2
'[he parameters
A may be interpreted as the intercept and slope of the regres-
and
lJ.
J
= ~ = ae2 .
y
, and is the same as the intraclass correlation be-
tween any two measurements.
Note that (2.4.3) may be written in vector form as
with
E(y)
= 0,
thus
E(~)
=B
Var(y)
and
= I,
E(~)
=0
~,
11
= ~V = AI'1''''p
Var(~)
= ~I
~
~
=B
+ ~p(+ ~
Var(z)
.-
= ~I~p'
+ 1jJI
.
Thus (2.4.2) or (2.4.3) is
~p
seen to be a special case of a one-factor model with
~
!
Cov(y,~)
It
~
=AI
~
=Q
,
and
and
, where the regressions of all measurements on the characteristic
of interest have the same slope and error variance, and differ only on
their intercepts (biases).
More general models may be obtained by
relaxing the equal slope assumption, the homoscedasticity assumption, or
both.
The resulting structures are listed on Table 2.4.1.
In psychological test theory, tests that measure the same chaTacter-
istic are called paraUeZ if they satisfy Model 1 and tau-equivalent if
they satisfy Model 2, see Lord and Novick
(1968) and Joreskog (197Ib).
We have thus seen that the one-factor model provides a quite general
model for the multiple indicator case, by allowing not only the intercepts
lJ·
J
but also the slopes
A.
J
and error variances
~j
(j=l, ... ,p) in the
regression of each measurement on the characteristic being measured to be
42
Table 2,4.1
Four Measurement Models
Model
=========s====-"..........
_..s===-====,=~==
1.
Variance Components
2.
Heteroscedastic Variance Components
3.
Homoscedastic Factor Analysis
4.
Factor Analysis
different.
In the general case the squared correlation between the i.th
and j-th measurements is
2
p ..
1J
tween the i-th measurement and
not be the same for all
i,j.
2 2
2
2
= A.A./[(A.+~.)(A.+~.)]
1J
11
J
J
Y is
while that
be~
222
p.1 = A./(A.+~.)
, and these need
111
Furthermore, the basic assumption that all
observers measure the same characteristic and the more restrictive
assumptions leading to Models 1-3 may be formally tested.
These remarks
may clearly be extended to multiple factor models.
The model proposed in §2.2 provides a further generalization by
allowing the characteristics being measured to satisfy a linear model,
so that other variables affecting them may be considered and studied.
III. MAXIMUM LIKELIHOOD ESTIMATION
3.1
Introduction.
We now consider maximum likelihood estimation of the parameters
of the latent linear model under
normality assumptions.
discussion is organized in five parts.
In
The
§3.2 we derive the likeli-
hood equations using the matrix differentiation rules given in the
appendix.
In
§3.3 we discuss the necessary modifications when a
separate location parameter M is introduced.
when
:
In §3.4 we show that
is unrestricted its conditional m.1.e. for fixed
!
may be obtained analytically.
~
and
t
~,~
and
In §3.5 we consider estimation of
under certain identifying restrictions.
~,
Finally, in §3.6
we describe an iterative procedure for computing the estimates.
3.2 The Likelihood Equations.
Let us write the latent linear model as
(3.2.1)
where
~
is of full row rank
r
~
<
n,
= !:1:"
y =~
where
~
is of full column rank
is diagonal p.d.
U - Npxn (0-'-V @ -n'
I )
-
(3.2.2)
and
+ ~
q < p,
(3.2.3)
~
is symmetric p.d. and
!
44
The logarithm of the likelihood function may be written as
(3.2.4)
Let us use the notation
.T
."
= !(X_AA)(X-AA)'
nro.J~
(3.2.5)
1'"'oJ~
Maximizing log L is then equivalent to minimizing the function
F = 10glyl
+
tr
Y-1I ,
(3.2.6)
which will be used to obtain results directly comparable with factor
analysis, see (1.3. 3) and (1.4.1).
It may be noted that the likelihood function depends on the structural parameters
meters
~
and
~,~, ~
Y,
and
!
only through the reduced form para-
and thus remains invariant under transformations
of the type discussed in
§2.3.1.
This shows that the problems of
identifiability and estimability are equivalent.
Henceforth we will
assume that sufficient restrictions have been imposed on the model to
identify the parameters and thus achieve estimabi1ity.
The first result we need is
Lemma 3.2.1.
The derivatives of the function
with respect to the reduced form parameters
aF
a~
= - ~-l(~-M)~'
aF
1
1
ay = y- (Y-In:- .
F defined in (3.2.6)
~
and
and
yare given by
(3.2.7)
(3.2.8)
For a proof of this result see §A.4, in particular (A.4.10) and
CA. 4 .16) .
Next we prove the following:
45
Lemma 3.2.2.
The derivatives of the reduced form parameters
~
and
....V defined in (3.2.2) and (3.2.3) with respect to the structural parameters
N'~' ~
and
!
are given by
a.§
"'0::
=__:
= (A
I ) ....
E ( q, r )
.... 3 ....q
(3.2.9)
a.§
=E
(3.2.10)
a~
ay
"'A
0....
= E (p, q ) (4lA
....
~
. . (p,q) (=
I
@
@
N
I )
q ,
....
I ) + (A4l
....q
~
@
(3.2.11)
I ) I ( p, q ) '
-p....
(3.2.12)
and
av....
a!d
r
blocks £(p,p)
=
The concept of matrix derivative
and
l(m,n)
Proof:
Q
£11.
= diag
l
Q
1
(3.2.13)
\pJ
and the notations
E.. , ....
E( m,n )
....lJ
are defined in the appendix, see Definitions A.2.2 and A.2.3.
The proof is a straightforward application of the matrix differ-
entiation rules given in §A.3.
In particular (3.2.9) and (3.2.10) follow
from (A.3.3); (3.2.11) follows from the sum rule (A.3.1) and (A.3.6).
(3.2.12) may be obtained from the sum rule and (A.3.3), and (3.2.13)
follows from the sum rule and the definition of matrix derivative (A.2.5).
Note that in (3.2.12) we have ignored the symmetry of
(3.2.13) we have taken into account the fact that
indicated by the use of the subscript
~
d.
We are now ready to prove the following result:
~
, but in
is diagonal, as
o
46
Theorem 3.2.3.
The derivatives of the function
with respect to the structural parameters
~
F defined in (3.2.6)
~,!
and!
are given by
(3.2.14)
(3.2.15)
aF
~ = 2fj 'fM
"'s
aF
a%d
and
n
diag
=
- diag fj 'fM
(3.2.17)
'"
where
Proof:
(3.2.16)
(3.2.18)
The proof is based on a series of applications of the chain rule
(A.3.15), together with Lemmas 3.2.1 and 3.2.2.
respect to
~
Differentiating with
we have from (A.3.1S)
aF = aF
a~
a~
*
a~
(3.2.19)
a~
=-
~ V- 1 (X_AA)A' * (A'" ~ "'q
I )E
n '" '" ~ '"
"'(q,r)
(3.2.20)
=-
i
(3.2.21)
[~(~-M),~-l~],
where (3.2.20) follows from (3.2.7) and (3.2.9), and (3.2.21) is obtained
using identity (A.3.14) for star products.
Then (3.2.14) follows by
computing the, transpose in (3.2.21).
Differentiating with respect to
~
we have, from the chain rule,
(3.2.22)
The first term in (3.2.22) is
aF
a~
a~
* "1~. = - ~ V-l(X-AA)A' * E
(= ~ I )
an
n '"
'" ~ '"
"'(p,q) '"
"'q
= - ~V-1(X_AA)A'='
n"'"
.- ~,...."
t'OW
(3.2.23)
(3.2.24)
47
where (3.2.23) follows from (3.2.7) and (3.2.10), and (3.2.24) follows
from identity (A.3.l4) and
(~)I = ~I~I.
The second term in (3.2.22) is
av
aF
-*
,~
av
"I A
Oll
=
g * [E( p,q ) (<1>A'
~
.-
=Q *
~
@
~
I ) + (A<1>
@ ~
I )I(
)]
~
~ p,q
~q
E ( p, q ) (<1>A
I @ I
) + .g
~
~q
~
* (A<1>
~
@
I )I
(
)
~ p, q
~
(3.2.25)
(3.2.26)
(3.2.27)
=
22M
(3.2.28)
where (3.2.25) follows from (3.2.8), notation (3.2.18) and (3.2.11);
(3.2.26) follows from the distributivity of star products, (3.2.27) may
be obtained using identities (A.3.l3) and (A.3.141, and (3.2.28) follows
from the symmetry of
2
This proves (3.2.15).
and
2
Differentiating with respect to
aF
-=
a~
aF
av
~
ay*
= ~g *
we have
(3.2.29)
(A @ I )E
-
-q
~(q,q)
= ~I~
(AI @ I )
-
-q
(3.2.30)
(3.2.31)
where (3.2.29) follows from the chain rule, (3.2.30) follows from (3.2.8),
(3.2.18) and (3.2.12), and (3.2.31) follows from identity (A.3.l4) and
the symmetry of
n.
partial derivatives of
~
From (3.2.31) and the remarks in §A.3.3, the
F with respect to the distinct elements of
may be written as in (3.2.16).
Finally, differentiating with respect to
!,
(3.2.32)
48
=g *
(3.2.33)
diag blocks £(p,p)
g,
= diag
(3.2.34)
where (3.2.32) follows from the chain rule, and (3.2.33) follows from
\
(3.2.8), (3.2.18) and (3.2.13).
To obtain (3.2.34) recall the defin-
ition of star product (A.3.ll),note that
when i 1 j
and that
w.. E..
11 11
w..
IJ
is mUltiplied by
= diag(O, ..• ,O,w11
.. ,O, ..• ,O)
E..
f\ w..
11~11
in the i-th diagonal position. 'Thus
1
where
r
=0
W••
11
is
= diag .-9. .
0
This completes the proof of the theorem.
In the special case where
Q: pxq
and the linear model part of the
structure is ignored, the results in Theorem 3.2.3 reduce to the wellknown results of factor analysis; compare (3.2.15) - (3.2.17) with
(1.4.2) - (1.4.4).
3.3 Estimation of g.
lbe results obtained so far apply to the case where no location
parameter
m.l.e.
g is specified.
X = !n
is
~
Xl
""""Tl
We now show that if g
is introduced its
and that the results in §3.2 hold with
X
.-
re-
Xl' , provided """"Tl
Al = O.
Placed by X - """"Tl
~
In the general case the function
with
I
~
F has the same form (3.2.6) but
defined by
(3.3.1)
Proceeding in the same fashion as in (A.4.5) - (A.4.10) we have
49
aF
a~
=
= - ~n -V-I(X-UI"
- ~
RAd)
- ~
E
(1'
* -(p,l) -n
a
I )
~ -1
- ~A)l .
= - ~n -V-I(X-III'
- ~'Tl
~-n
(3.3.2)
Setting (3.3.2) to zero and noting that
1'1
-n-n
=n
, a scalar, we
obtain the equation
(3.3.3)
which under the restriction
Al
~R
= -0
leads to the estimator
B = .!.
Xl = ~
n "'-'""l1
(3.3.4)
F
Substituting this result into (3.3.1) we see that the function
niminized over
~
has the usual form (3.2.6)
r = ~(~-X!~
with
- ~)(~-~ - ~)l
Thus to minimize this function with respect to
we can use Theorem 3.2.3 with
,X
. . , replaced by
(3.3.5)
.
~,~, ~
and
!
!.- - Xl' .
~n
This result is analogous to factor analysis where one uses
S = __1__ (X-Xl') (X-Xl') , in the general case,
n-l - --n - "'-'""l1
the difference in the denominator being due to the fact that the Wishart
S
-
= .!.n-XX'
if
.k:! =
Q
and
rather than the Normal likelihood is used.
It may also be noted that the restriction
discussed in
§2. 3.2 permits considerable simplification of the results, for we can
see from (3.2.3) that if we only required
of
1, then
-n
.k:!
~
to be linearly independent
would have to be estimated iteratively.
50
3.4 Esttffiation of
~
.
In the most general case, where all the parameters
!
may be restricted, the likelihood equations given in
If l;
solved numerically.
that minimizes
F
§~.2
must be
is unrestricted, however, the value
conditional on fixed values
~
and 'l'
~
and
~,~, ~
~
of the
~
other structural parameters may be obtained analytically.
The following result will be useful.
Lemma 3.4.1.
y
Let
be any matrix satisfying (3.2.3).
y-l = !-1
y-l~
A'V-lA
,.".,
1"f/tI"""
Then
_ !-l~(! +~)-l~,!-l ,
= !-l~(l +~)-l
= ,...,~(I + ~~)-l ,
t'f>J
(3.4.1)
, and
(3.4.2)
(3.4.3)
,........,
where
(3.4.4)
This result is well-known, see for example Lawley and Maxwell (1971,
Result (3.4.1) may be verified post-multiplying by
p. 89).
the form
A~A'
......-..
+ 'l'.
~
y
in
Results (3.4.2) and (3.4.3) follow immediately .
We can now prove the following.
Theorem 3.4.2.
Let
F be defined by (3.2.6).
~
restricted, the value :
of
~, ~
and!
that minimizes
·If
g
~s
un-
F conditional on fixed values
is given by
(3.4.5)
Proof:
to :
On setting to zero the partial derivatives of
given in (3.2.14) and since
~
=~ ,
F with respect
we obtain
(3.4.6)
S1
which under the full rank assumptions of the model gives
(3.4.7)
This result is sufficient to compute
follows.
•
but can be simplified as
~,
Inverting (3.4.3) we obtain
(~'y-l~)-l
=
(1
~)~-l
+
= ~-l(l
~-l
+
~
+~) = ~-lCl
+
=
~)'.
(3.4.8)
On the other hand, transposing (3.4.2)
~'y-l =
where
~-T
= (~,)-l.
(!
+
~)-T~'f-l
(3.4.9)
J
Now using (3.4.8) and (3.4.9)
(~'y-l~)-l~'y-l = ~-l(!
+
~~)'(!
+
~~)-T~'f-l = ~-l~'f-l.
(3.4.10)
Hence (3.4.7) may be written as in (3.4.5).
This completes the proof.
o
Note that
~
does not depend on
~.
(3.4.5) over (3.4.7) is that inversion of
of
3.5
f '
A computational advantage of
y
•
is replaced by inversion
a diagonal matrix.
Estimation of A,
~
~
~
and
--
~.
~
We consider now estimation of the remaining structural parameters,
which will be subject to a set of identifying restrictions as discussed
in §2. 3.1.
Unfortunately, there is no analytic solution of the likeli··
hood equations (3.2.15) - (3.2.17), and the estimates must be obtained
numerically.
•
52
The function to be optimized is
~,~
~,and
and
for fixed
F minimized over
will be written as
F = F(~,l,~) = min
F(~,~,~,f)
'"
= F(g,~,~,f)
(3.5.1)
'"
We now give some results which are us'eful in evaluating
derivatives.
F and its
First we note that
(3.5.2)
where
!'"
denotes
!
Computation of
~
evaluated at
Iy\
= '"~.
is simplified by the following result given
by Lawley and Maxwell (1971, p. 89).
Lemma 3.5.1
Let y
be given by (3.2.3).
IYI =I!I Ilq
+
Then
~~I
,
(3.5.3)
where
Proof:
If
~
The proof is based on the following result from linear algebra.
is pxq
and
§
is qxp
where
I + AB I
I"'P""'"
Premu1tip1ying (3.2.3) by
and multiplying by
1%1
~
=
-1
q
<
p
then
I"'q--1 + BA I .
(3.5 .4)
, taking determinants on both sides
we obtain
IY\
Then (3.5.3) follows using (3.5.4) with
(3.5.5)
A
,....,
= ~-lA
~,....,
and
B
,..."
= ~A'.
~
o
Lemmas 3.4.1 and 3.5.1 permit computation of the determinant and
inverse of ....
V in terms of the determinants and inverses of (1....
+ ...,...,
~~)
53
:r..
and
'i is pxp, whi.le (1
Note that
considerably smaller than
!
p, and
+
,M) is qxq, with q usually
is diagonal.
~
I
To evaluate
ions.
it is convenient to introduce the following notat-
•
Let
and
(3.5.6)
(3.5.7)
Note that
~
is s)wmetric and does not depend on the parameters.
~
Then
I
may be written as
(3.5.8)
This result follows from expanding (3.2.5) and using (3.4.5) and
the above notation, and has a minor computational advantage over (3.2.5)
=I
in that the product (AA,)-lAA'
~
~
drops out.
~
We now have the following result.
~
Lemma 3.5.2
~, ~
and
!
The derivatives of
F defined in (3.5.1) with respect to
are given by
•
(3.5.9)
~
l:)F
aT"
~
= 2~ 'M
-
diag
~
!:.'M
and
(3.5.10)
~s
(3.5.11)
~
where
2
denotes
2
~
evaluated at
I =I
and
~,£
are as defined in
(3.5.6) - (3.S. 7).
Proof:
~
Derivatives of
F with respect to
!:., 2
evaluating (3.2.15) - (3.2.17) of Theorem 3.2.3 at
and
~
=
!
are obtained
for by
•
54
(3.5.1) and the chain rule
aF'
a~
since '"
!.
'"
aF C.~,~,2,!)
aF(~,~,2,!)
=
is a root of
a::..,
at;
aF/ag = Q
A simi1iar result holds for
¢ and
This gives immediately (3.5.10) and (3.5.11), while (3.5.9) is
obtained evaluating (3.2.15) and using the notation in (3.5.6) and
0
(3.5.7).
Unfortunately, no convenient simplification of
2
has been found.
3.6 The Iterative Procedure
We are now ready to consider estimation of
~,
2
and
~
by numer-
ical minimization of '"
F under the Gram-Schmidt restrictions or under a
structural hypothesis.
In these cases parameters may be (1) fixed a
priori, (2) constrained to be equal to other parameters in the model, or
(3) free, unconstrained parameters.
Let the elements of
the diagonal elements of
r
=
~,
!
the upper triangular elements of
2
and
be arranged into a vector
0'11'··· ,ApI;··· ;A lq ,··· , Apq;<P ll ;<P 12 ,<P 22 ;··· ;<P lq '··· , <Pqq;ljil , ... ,ljip)'
(3.6.2)
and let the free parameters in
~, ~
and
!
form a vector
(3.6.2)
~
We now regard
partial derivatives
F as a function
f(~)
of the free parameters
~
with
55
(3.6.3)
where
=1
a ..
lJ
derivaties of
if
8.
1
dF/ay
= y.J
and
a ..
lJ
=0
otherwise, and the partial
are obtained from Lemma 3.5.2.
f(g) and its derivatives
•
(The function
gee) also depend, of course, on the fixed
~
~
parameters and the data, which are treated as constants).
The function
f(g) may now be minimized using the method of Davidon
(1959) and Fletcher and Powell (1963) described in
The initial matrix
§1.3.2.
£ may be taken as the identity matrix, in
which case the first iteration is in the direction of steepest descent.
A better chol'ce of
f.(l)
.-
however
'
l'S gl'ven by the l'nverse of
(3.6.4)
where the second-order derivatives of
f(g)
are given by
h .. =
lJ
(3.6.5)
and the expected values of the second-order derivatives of
•
Fare
derived in §4.7.
Since initial estimates may be far from the minimum point and
depends on
~
g, it is convenient to start with several steepest descent
iterations, which work better in the first stages of the minimization
process.
The results from this stage may then be used to compute
gel)
using (3.6.4), and the process may be continued using Fletcher-Powell
iterations until
for
<S
a convergence criterion such as
= .001, say, is satisfied.
max
<
1
. Ig·1
1 ~l~m
<S
•
56
Until experience in using this procedure accumulates, an optimum
changeover point can not be determined.
On the basis of Joreskog's
(1970a) experience on the analysis of covariance structures, however,
it is recommended to change over when
f
decreases by less than 5%
between two consecutive steepest-descent iterations, see also §6.2.3.
Since during iteration the elements of
!
may become negative,
we proceed as in factor analysis introducing the restriction
(i=l, ... ,p)
E > 0
for small
~.
> E
~
and handle the problem in exactly
the same manner as described in §1.3.2.
It may be noted that in factor analysis the canonical restrictions are quite convenient in maximum likelihood estimation, for they
lead to an analytical solution for the m.l.e. of
A for fixed
%'
and thus reduce the numerical problem to minimization of a function
of
!.
Unfortunately, in our case the structure of
(3.4.9) makes such analytic solution impossible.
aF/a~
in
To use these re-
strictions in estimation would considerably complicate the numerical
procedure.
We have thus preferred to use the Gram-Schmidt restrict-
ions in estimation, leaving the option of rotating to canonical estimates afterwards.
In the case where
minimizing
M
is restricted, the estimates may be
computed
F using the same procedure described here, except that
the partial derivatives are as given in Theorem 3.2.3, and the number
of parameters to be estimated iteratively may increase considerably.
•
IV.
4.1
LARGE SAMPLE PROPERTIES OF THE ESTIMATORS
Introduction
The main purpose of this chapter is to establish the consistency
and asymptotic normality of the maximum likelihood estimators of the
free parameters in the latent linear model.
e
~
is the m.l.e. of a parameter
e
It is well known that if
based on"l n
then, under certain regularity conditions, as
i. i.d. observations
~
n-+
and
oo
(4.1.1)
.........
~
where
!(Q)
'
is the Fisher information matrix, see for
example Kendall
.~
-~
~
and Stuart (1961, pp. 54-55) or Zacks (1971, pp. 246-257).
ly in linear models the columns of
~
Unfortunate-
, though independent, are not
identically distributed random vectors.
Thus the standard results
•
mentioned above cannot be applied to our problem; but additional definitions and assumptions must be introduced and existing proofs must be
adapted.
The approach adopted is the following.
In §4.2 we state a basic
assumption regarding the nature of the limiting process, and establish
conditions for the consistency and asymptotic normality of least squares
estimators in linear models.
In §4.3 we consider a family of struct-
ural linear models where the reduced form parameters
regular functions of a structural parameter
~,
~
and
V are
and use the results of
the previous section to establish the consistency of an estimator
•
58
e"
'""'Jl
of Q
obtained by minimizing the function
F of Chapter 3.
In §4.4 we introduce the concept of the limiting Fisher information
!(§) ,
matrix
which plays a central role in our treatment of asymp-
totic distribution theory; obtain the second derivatives of the
function F; use these to derive expressions for the elements of
I(e)
,.,.
,.,.
for the family of structural linear models introduced in the
previous section, and show that
l(Q).
is a consistent estimator of
1(8 )
'" "'ll
In §4.5 we establish some additional convergence results and
show that the asymptotic distribution of
m(e - e)
"'ll
,.,.
in the family
of structural linear models under consideration is as given in (4.1.1),
but with
!(Q)
denoting the limiting (rather than the ordinary)
Fisher information matrix.
In §4.6 we proceed to specialize the re-
suIts obtained to the case of the latent linear model.
here is to obtain expressions for the elements of
linear model.
The main task
l(Q)
for the latent
Finally, in §4.7 we prove an asymptotic result which
provides large sample approximations to the second derivatives of the
function
F of §3.5.
These approximations are used in the Fletcher,.,.
Powell procedure for minimizing
F
in §3.6.
4.2 Asymptotic Results for Linear Models
Consider the multivariate general linear model
(4.2.1)
where
An
is of full row rank
r
<
n
i.i.d. random vectors with mean vector
matrix
y.
and the colunms of
!1n
are
Q and p.d. variance-covariance
59
Define the least squares estimators of
and
~
y
and
(4.2.2)
(4.2.3)
which are also m.l.e.'s
if
~
U
~
Npxn (O,V @ I ) , see §A.4.
~
~
~
~
We now consider the large sample behavior of
4.2.1
•
and
Conditions on the Limiting Process
The linear model (4.2.1) is conditional on a design matrix
of fixed, known constants.
As
n
~~
increases, however, this matrix is
modified by the addition of new columns.
Clearly some stability con-
ditions must be imposed on this process.
In this regard we introduce
Assumption 4.2.1.
Let
{A}
be a sequence of
~
known constants, and define
lim
rxn matrices of fixed,
We assume that
~
=g
(4.2.4)
yt+<Xl
•
exists and is positive definite.
This assumption (or an equivalent one) has been used by Eicker
(1963) and Sen and Puri (1970) in studies of asymptotic properties
of least squares estimators in linear models.
Also, it is frequently
considered in the econometric literature, see for example Theil (1971,
Ch. 8).
We now consider some implications of this assumption in
applied situations.
Consider a multivariate one-way analysis of variance problem for
which the design matrix
a .. = 1
1J
if observation
A
~n
j
consists of dummy
is on treatment
i
variates such that
and
a .. :: 0
1J
otherwise.
•
60
Let
n.
denote the number of observations on the i-th treatment
1
and
(i=l, ... ,r)
n = En..
1
Then
= diag !(nl,
... ,nr )
n
(4.2.5)
and Assumption 4.2.1 is equivalent to requiring that as
n.
~+
n
n.:
1
0 <
n.1
< 1
n
+
00
(i=l, .•. ,r), so that the sample size increases for
all treatments.
Or consider a multivariate multiple regression problem where the
design matrix
A
"'11
consists of regressors
or explanatory variables.
Although the analysis is conditional on fixed values of the regressors
the columns
of~,
stochastic vector.
E(a
a') = E
""'a-a
+
,.oJ
1111'
ICI::j
~l"."~
say
Suppose that
(a=l, ... ,n).
, may be regarded as values taken by a
E(a ) = ~ and Var(a) = E , so that
-a
"""0."""
Then by Kolmogorov's strong law of
large numbers, see e.g. Rao (1965, p. 94), as
2n=
Hence if
n
\ a a' a.s. ~
""a""a +
.a=l
t..
+ ~
+ ~~'.
,........
(4.2.6)
E is p.d., Assumption 4.2.1 will be satisfied for
sample sequences
An
_I
n
n
almos~
all
{8n}'
alternative way to conceptualize the limiting process is to
consider a sequence of replications of a basic experiment with design
matrix
A:
rxk
such that the n-th experiment in the sequence has
design matrix
(4.2.7)
with
r
2n
S
n matrices on the right hand side.
If
A
is of full row rank
k, then this sequence satisfies Assumption 4.2.1 trivially, for
= ~, for all
n.
While this more restrictive assumption leads to
61
somewhat simpler proofs of asymptotic results, it becomes difficult
to justify use of those results as approximations in applied situations, such as analysis of covariance problems, where the columns of
may be all different, and hence
A
~n
member of a sequence of form (4.2.7).
.
A
may not be considered a
~
•
In this regard it is important
to bear in mind that asymptotic results are useful to the extent that
they can be used as approximations in large samples, and that this may
well depend on the conceptual framework used in the derivations.
4.2.2
Consistency of
~
Yn
and
We are now ready to prove the following result.
Lemma 4.2.1.
matrix
8n
Let
!n
satisfy the linear model (4.2.1) with design
If the sequence of design matrices
(n=r+1, r+2, .•. ).
satisfies Assumption 4.2.1
Bf
~n
Le.
Proof.
and
V
"11
....B
then as n -).
00
and
(4.2.8)
are consistent estimators of
Consider first
~
~
and
From (4.2.2), noting that
y.
E(~)
=
•
~,
(4.2.9)
On the other hand, from Anderson (1958, p. 182),
(4.2.10)
From Assumption 4.2.1 we have, as n -).
Var(~)
-).
00
Q
which in view of (4.2.9) im.p1ies that
(4.2.11)
B
Nn
converges to
~
~
.
In
quadratic mean, and hence also in probability, see Rao (1965, p. 90).
tit
62
Consider now
~.
From (4.2.2) and (4.2.3),
(4.2.12)
Let us define
T
"'11
= .!-(X
n "'I1
- RA )
~
(L II
RA )
~
(4.2.13)
I
Expanding (4.2.13) and using (4.2.2) we obtain
nT
"'11
= "'11"'11
X X' - "41"'11"'11
R A AI~'
(4.2.14)
and on substituting this into (4.2.12) we find, after simplification,
(4.2.15)
Now from (4.2.13) and (4.2.1),
In
.... I
where the
u u'
"'a-a
= .!n
1
u u' = "'11"'11
In
may be written as
n
L
(4.2.16)
U u'
n a=l"'a""'Cl
are i.i.d. random matrices with
E(u u ' )
""'Cl"'a
= '"V for all
a.
Hence by Khinchin's law of large numbers, see Rao (1965, p. 92),
as
n
-+
00
T
"'I1
On
the other hand, since
£ '"V
~
(4.2.17)
•
g~
and for each
n, the quadratic
~
form in the second term of (4.2.15) is a continuous function of
we have, using Assumption 4.2.1 and result (xiii) in Rao (1965, p. 104),
that as
n
-+
00
(~-~)2n(~-~)' ~ Q
(4.2.18)
In view of (4.2.15), the lemma follows from (4.2:17) and (4.2.18).
o
63
This lemma generalizes Theorem 1 in Eicker (1963), who proves
consistence of ]ri
in the univariate case. For related results see
Jennrich (1969) and Sen and Puri (1970).
4.2.3
Asymptotic Normality of ~l
It is well known that if
R ~ NPxr [~V
.t:: , ~
Nn
U
"'Tl
~
and
N
V
"'Tl
(0 Vel)
"'Tl
e (A"'Tl"'Tl
A )-1] and nV
~ WP (V,n-r)
"'Tl
~
see for example Anderson (1958, p. 183).
Lemma 4.2.2.
matrix
A
"'Tl
X
Let
and
"'Tl
U
"'Tl
~
~
independently of
R;
Nn
Asymptotically we have
satisfy the linear model (4.2.1) with design
NP xn (0~, V
~
uence of design matrices
n
then
~xn~'~
{A}
~n
x
In),
(n=r+l,r+2, ... ).
If the seq-
satisfies Assumption 4.2.1, then as
00
(4.2.19)
and, independently of
~,
(4.2.20)
where the notation used in (4.2.20) indicates that the asymptotic
covariance of the (i,j)-th and
Thus
Proof:
~
and
V
"'Tl
(k,~)-th
elements of
is
have asymptotically a joint normal distribution.
The first part of the lemma follows from the fact that for
every fixed
n
(4.2.21)
and from Assumption 4.2.1.
64
To prove the second part note thm: from (4.2.15)
(4.2.22)
T
From (4. 2 . 16) ,
u u'
""'CX""'CX
with
"'I1
is the average of
E (u u') = V
""'CX""'CX
"'"
n
i.i.d. random matrices
and covariances given by
(4.2.23)
(cx=l, ... ,n),
see Anderson (1958, p. 39).
Thus by the multivariate
version of the Lindeberg-Levy central limit theorem, see
pp. 107-109), as
n +
~o
(1965,
00
d
In (T"'I1 -V)
"'"
N[O,((vokvoo+voovo k ))]
+
"'"
1N J
J~
1
(4.2.24)
Consider now the quadratic form on the right hand side of (4.2.22).
In Lemma 4.2.1 we proved that
~
g ~.
This result, however, can be
strengthened to
n~ (~-~) &Q
(4 . 2 . 25)
because from (4.2.9) the expectation of the left hand side is
Q, and
from (4.2.10) its variance is
n \rar (~ -.§) =
which converges to
Q as
n
+
00.
....!. 'i
@
In
~1 ,
(4.2.26)
Since the quadratic form of interest
is a continuous function of
4.2.1 and (4.2.25) we obtain that as
for each
n
+
n, from Assumption
00
(4.2.27)
In view of (4.2.22), the rest of the proof follows from (4.2.24),
(4.2.27) and Slutzky's theorem, see for example result (ix) in Rao
(1965, p. 101).
o
65
The second part of Lemma 4.2.2 is analogous to Theorem 4.2.4
in Anderson (1958, p. 75), establishing the asymptotic normality
of the sample covariance matrix.
The results obtained in §4.5 below, regarding the asymptotic
normality of m.l.e.'s in structural linear mode1s,depend only on
and
V
having the asymptotic joint distribution specified in
"'J1
Lemma 4.2.2.
It may be of interest to investigate whether the aSStmp-
tion of normality in the lemma may be weakened, so that estimates
obtained minimizing the function
F of (3.2.6) may be asymptotically
normally distributed even if the distribution of
~
is not normal.
In this regard we note the following.
Sen and Puri (1970) have shown that if the columns of
(4.2.1) are i.i.d. with mean vector
~
in
Q and variance-covariance matrix
y , and if the sequence of design matrices
{A}
"'J1
satisfies Asswnption
4.2.1 and the generalized Noether condition
2
n 2
{a .. / I a. } ~ 0
l~i~r 1) a=l 1a
max
as n ~
00
(4.2.28)
,
l~j~n
then
In(~-~) has the asymptotic distribution (4.2.19).
From the proof of the second part of the lemma, on the other hand,
it is seen that if Assumption 4.2.1 is satisfied and if the fourth
moments of
with mean
u
exist then
~a
0
In(Y"'J1 -V)
~
is asymptotically normal
and covariances depending on the fourth moments. If the
fourth cumulants of
u
~a
are zero then toe tovariances are as given
in (4.2.20); see Cramer (1946, pp. 365-366) or Kendall and Stuart
(1969, p. 321).
Finally, proceeding as in the proof of Theorem 4.3.2
in Anderson
66
(1958, p. 83), it can be shown that Cov(~n'~)
Thus
~
and
V
""Tl
=Q
for all
n.
are uncorrelated, and their asymptotic marginaZ
distributions are as given in Lenwa 4.3.2 under assumptions weaker
than normality.
be
These weaker conditions, however, do not'appear to
~
sufficient for the asymptotic joint distribution of
Yn
and
to be as given in the lenwa .
4.3 Consistency of
....
in Structural Linear Models
~n
We now consider a family of structural linear models where
satisfies (4.2.1) and the reduced form parameters
~
and
X
""Tl
yare
continuous differentiable functions of a structural parameter vector
....
Consider an estimator
of
~
~.
obtained by minimizing the function
~
F first defined in (3.2.6),
(4.3.1)
....
Note that
a
""Tl
is the m.l.e. of
~
if
~
.-.
Npxn (O,V
"" ""
u
~
I ).
""Tl
Using (4.2.2), (4.2.3) and proceeding as in (4.2.15), we can write
(4.3.2)
Thus
V.
""Tl
F
depends on the observations
X
""Tl
only through
Under normality assumptions this means that
~
and
~
and
V
are
""ll
sufficient, though not necessarily minimal sufficient, statistics for
proving consistency of
Theorem 4.3.1.
Let
(n=r+l,r+2, ... ), let
~
~
We now take advantage of this fact in
~
the structural parameter
....
~
satisfy (4.2.1) with design matrix
and
y
~
be continuous differentiable functions
67
of a structural parameter
(4.3.1).
is identified then as
~
e a .....8
.....n
8
"'fi
Proof:
{A } satisfies Assump-
If the sequence of design matrices
tion 4.2.1 and if
i.e.
8" minimize the function
.....n
and let
~,
Since by (4.3.2)
~
00
(4.3.3)
~
8* , .t::
P.*
.....
~,.§
assigned to the parameters
n
,
is a consistent estimator of
Let us use the notations
"'fi
and
and
v*
for arbitrary values
y.
is a continuous function of
F
~ ~.§ and
and since by Lemma 4.2.1
Yn &y
~
and
, we have using Assump-
tion 4.2.1 and result (xiii) in Rao (1965, p. 104), that as
We now prove that if
~*
#
~
n-+oo
= .....e.
p lim
~
00
(4.3.5)
n-+oo
which implies that in the probability limit
8*
.....
n
then
p lim F(~*) > P lim F(~) ,
at
V
"'fi
F has a unique minimum
Now
F(~*)
- P lim F(£)
n-+oo
n-+oo
= 10gly*1
+ try*-I[Y+(.§_.§*)Q(.§_.§*)'] - 10g1YI-p
~ 10gly*1 - log!yl
+
try*-ly - p
(4,3.6)
(4.3.7)
= trV*-lV
.....
.....
=
I (A.
. 1
1=
~
0 ,
1
- log A.-I)
1
(4.3.8)
(4.3.9)
where (4.3.6) follows from (4.3.4), (4.3.7) follows from the fact that
68
(~-~*)Q(~-~*)'
is a non-negative definite quadratic form, (4.3.8)
fQllows by letting
A.1 1
= ch. (y*
.....
-1
Y)
.....
and the relations of traces and
determinants to latent roots, and (4.3.9) follows from the fact that if
x
is non-negative
x-log x-I
~
O.
This part of the proof is based on
an argument of Watson (1964) reported by Rao (1965, p. 449) in connection with maximum likelihood estimation of the parameters of a multinormal distribution.
If
~
is identified,
(or both).
If
~* ~ ~
~* ~ ~
implies that
the quadratic form
~* ~ ~
(~-~*)Q(g-~*)'
definite and thus the inequality in (4,3.7) is strict.
If
Y*
~
Y
is positive
Y*
~
Y
Y* -1 yare not all unity and thus the in-
then the latent roots of
equality in (4.3.9) is strict.
Since
or
Hence the strict inequality in (4.3.5).
F is a continuous differentiable function of the structural
parameter, and fr0m (4.3.5) in the probability limit it has a unique
minimum at
~*
= ~ , its partial derivatives must vanish there. Thus
aF/a~
there exists a root of
= Q which is consistent for
~.
If
there are multiple roots care must be exercised to avoid local minima.
To this end we define
"
~
as a root satisfying
" ) S F(e*)
F(e
"'11
.....
for all permissible
~*.
(4.3.10)
By (4.3.5), a root so determined is consist-
ent, and for sufficiently large
n
is unique.
This part of the proof
is adapted from Rao (1965, p. 300) and Kendall and Stuart (1961,
pp. 40-41).
assumptions.
Note that the proof does not depend on distributional
0
69
In our experience with the latent linear model, roots of
found so far have been unique.
aF/a~
=Q
On the basis of experience accumulated
in factor analysis, see Joreskog (1967), we conjecture that estimates
such that
where
~.
1
~.
1
>
(i=l, ... ,p) are unique, while in impraper cases
€
= E for some
minimum values of
i
there may be multiple solutions giving equal
F.
4.4 The Infonnation Matrix for Structural Linear Models
We now introduce the concept of the limiting Fisher information matrix
!(~),
and derive expressions for the elements of
l(~)
for the family of
structural linear models considered in the previous section.
4.4.1
The Limiting Fisher Information Matrix
Let
density
~l""'~
fa(~;~)
be independent random vectors and let
depending on a parameter
x
'""0.
(a= 1 , ... , n) .
~,
not necessary that each density depend on all the elements of
but each
8.
1
have
It is
~,
should appear in some of the densities.
The Fisher information matrix relative to
~
associated with the
a-th observation is
(4.4.1)
which under certain regularity conditions, see Zacks (1971, pp. 182183), may be written as
(4.4.2)
Since
xl ' ... , -n
x
"'"
are independent, the Fisher information matrix
associated with the first
n
n
observations is
I
a=1
duce
la(~)'
We now intro-
70
The limiting Fisher information matrix relative
Definition 4.4.1.
to
8
associated with the sequence of independent random vectors
is defined as
{x}
....a
= lim
n~
1
n
n
~ I (8)
l. ....a ....
c=l
(4.4.3)
provided the indicated limit exists.
If the
are i.i.d., then
information matrix.
!(~)
reduces to the ordinary Fisher
This definition was motivated by the work of
Wald (1948).
By analogy with the standard result (4.1.1) for the case of i.i.d.
observations. it becomes natural to ask under what conditions on the
sequence of densities
{f}
a
as
n
-+
00
(4.4.4)
where
'"
~
is the m.1. e. of
(obtained maximizing
log L
Fisher information matrix.
~
based on n independent observations
= L log
fa)
and
1(8) is the limiting
........
a
This problem has been considered by Bradley
and Gart (1962), but the general conditions given there are rather
difficult to verify in specific cases.
In §4.S we therefore give a
direct proof of (4.4.4) for the case of structural linear models.
First, however, we obtain the second derivatives of
expressions for the elements of
4.4.2
Second Derivatives of
Lemma 4.4.1.
Let
F and derive
!(~).
F.
F be the function defined in (3.2.6) or (4.3.1),
and let the reduced form parameters be twice differentiable functions
of a structural parameter
Then
71
(4.4.5)
. (4.4.6)
Proof:
In our proof we use repeatedly several results on matrix
differentiation given in the appendix, particularly (A.3.l6), (A.3.20)
(A.3.21) and (A.3.22).
Differentiating
e.1
F with respect to
using (A.3.l6) gives
(4.4.7)
and using Lemma 3.2.1 for the derivatives of
and
F with respect to
y we obtain
aF = - -2 tr [A (X - RA )' V-1 - a~J
-ae.
n
-n -n ~ ~
ae.
1
where
T
-n
is defined in (4.2.13).
1
This proves (4.4.5).
~
72
Let us introduce the notations
c. = -n
A (X-n -BA
)'V-1
~
and
D.
"""J1
-1
= ,...,V
as
-1
. . s-,
a.
(4.4.9)
1.
av
(V-T)V
""'Il"""
-1-
""*I
;
~e
a.
(4.4.10)
1.
so that on differentiating (4.4.8) with respect to
=-
-1
2 atrC.
n ~a~s;-.- +
atrC.
-1
trC.
-1
-1
as. = tr [ a.§'
+
we obtain
as.
(4.4.11)
J
with respect to
atrc.
J
atrD.
-1
J
We now differentiate
e.
_1
atrc.
t r [---'a:--v--:-'-
Using (A.3.16)
_
av]
as.
-
J
e..
J
J
(4.4.12)
The derivatives needed in (4.4.12) are as follows
atr~.
1
a [A A'Q'V- 1 -"-::""J
as 1
=- a.§'t r -n-n~
~
as.
1
=-
~~
as'
as.
Y-1
by
(A.3.21) ,
(4.4.13)
1
atrC....1.
=-
as'
1 -BA)A' ~ v-I
-V- (x
-n ~ -n as. 1
by (A ..
3 21) ,
(4.4.14)
and
atrC.
~1.
= V- 1 (X
~
-RA )A'
by (A.3.21).
""'Il~""'Il
(4.4.15)
Substituting (4.4.13) - (4.4.15) into (4.4.12) and taking care
atrC.
laa J..
~1.
to use the transposes in the last two cases, gives
Using this
in conjunction with (4.4.11)
gives the first three terms
of (4.4.6).
We now differentiate
trD.
~1.
with respect to
Using (A.3.16)
a J..
atrD.
aa.
~1.
J
(4.4.16)
The derivatives needed in (4,4.16) are as follows,
atrD.
~1.
a
= a~
tr
2
aVj
~
['1. -1 In'i -1
-1
aa i
av~
= - -n v
-aa. ~v-1 (X""'Il - ~
RA ) ""'Il
A'
~
(4.4.17)
1.
by an argument similar to that used in (A.4.5) - (A.4.10);
atrQi _
ay
~
a
- av
tr
~
= _
~ -1
'1.
a'i
ae:-1. - '1.
-1
-1
In'i
a'i
1
aa.1. j
v- {_a
tr[v- a~ J
a'iaa
1
~
1
~
1
a
r,
-1
- a'i- 1 trLY InY
i
J}
-1 a'i
-1
aaiJ Y,
, by (A.3.20)
74
= -Y.
= -Y.
-Itae.ay - ,ay
-1 ] -1 •
2ae . Y. In Y.
1
-1
a~ -1
ae. v
1
by (A.3.2l) and (A.3.22)
1
av
-l
'" v-IT V-I
2y
+.ae. ~ "'n'" •
(4.4.18)
1
and finally
atrD.
a(av/ae.)
'"
1
~1
= ,...,
V-l(V-T
)V- l
,..., '"'""1l-
by (A.3.2l).
(4.4.19)
Substituting (4.4.17) - (4.4.19) into (4.4.16) and taking the
transposes as required gives
atrD./ae ..
Using this in (4.4.11)
J
"'1
gives the last four terms in (4.4.6).
This completes the proof of
o
the lemma.
4.4.3
The Limiting Fisher Information Matrix of Structural Linear Models.
We can now prove the following result.
Theorem 4.4.2.
~
'" Npxn(Q.y
meters
~
parameter
and
Let
e In).
!n
satisfy (4.2.1) with design matrix
and
(n=r+l.r+2 •... ). and let the reduced form para-
y be twice differentiable functions of the structural
If the sequence of design matrices
~.
A
"'n
{A}
"'n
satisfies
Assumption 4.2.1 then the elements of the limiting Fisher information
matrix of Definition 4.4.1 are given by
Further. if
~
is identified then
L(~)
is positive definite.
75
Proof:
Under the assumed regularity conditions l
l(~)
is given by
(4.4.21)
,
and if Q is identified then
I (e)
..........
is non-singular, see Silvey (1959,
pp. 81-82).
Since logL
=
c - YznF
where
c
is a constant we have
Consider the second derivatives of
By Assumption 4.2.1, as
n
-+
00
F
given in Lemma 4.4.1.
the first term in (4.4.6), which con-
tains no random variables, converges to
r. a£
I
-1
a~l
(4.4.23)
2trLQ·ae. Y ae.]·
J
1
Under the assumptions of tne model, for each
n
(4.4.24)
Using this result in (4.4.6) we see that on taking expectations the
second, third and fourth terms vanish.
Also under the assumptions of the model, for each
E[In] = -n1 rE(u.....Ct.....Ct
ul ) =V
.....
u
n
(4.4.25)
Using this result in (4.4.6) we see that on taking expectations the
seventh term vanishes, while the sixth becomes
2
r. -1
trLY
a~
-1
a~l
ae.1 Y ae-:-J'
J
(4.4.26)
On subtracting from this the fifth term in (4.4.6), which is non-
76
stochastic and does not depend on n, and recalling expression
(4.4.23) for the first term, we obtain
(4.4.27)
In view of (4.4.22), the theorem follows from (4.4.27) after
o
dividing by 2.
Jennrich (1970) asserted that if X1 "",Xn are LLd. NP (lI,V)
IV ""
where g and y' depend on a structural parameter ~, then the
elements of the (ordinary) Fisher information matrix are given by
andindicated that the result can be proved by appealing to a result
concerning the expected value of the product of four jointly normally
distributed random variables.
In a later paper, Jennrich (1974) used
this result to derive simplified formulae for asymptotic standard
errors in factor analysis, in which case
g does not depend on stluct-
ural parameters and thus the first term on the right hand side of
(4.4.28) drops out.
See in this regard §1.3.4.
We remark that (4.4.28) may be obtained as a special case of
Theorem 4.4.2
by
2n = -n1 ""11""11
l' 1 = 1
le~ting
for all
r
=1
n, and
and noting that
l'
~ = ""11'
and
~
= l:!
is pXl
so that the quantity
inside the brackets in the first term of (4.4.20) is a scalar.
Note
that by taking second derivatives we do not need to appeal in the
proof to results other than the first two moments of
We now consider estimation of
1(8).
"" ""
~.
77
Let
Lemma 4.4.3.
e
"1l
tions of Theorem 4.4.2, as n
~
Then under the assump-
~
be the m.l.e. of
00
(4.4.29)
Proof:
e" ~p '"e.
"'n
By Theorem 4.3.1,
av/ae.
'"
1
a~/ae.,
Since
1
"-
their values at
e
and
aV/a8.
1
'"
and y-l
1'-
astically to
~,
.-
e.
"-
I(e)
are in turn continuous functions of
'" "1l
evaluated at
e ,
"1l
they will converge stoch-
o
This completes the proof.
l(~).
In actual practice, the matrix
be replaced by
are continuous functions of
'"
Since the elements of
aR/ae.,
y-1
will converge stochastically to their values at
"1l
the true parameter value
.t:::
I(Q).
is a consistent estimator of
Le.
~,
"
Q appearing in
l(~~)
which by Assumption 4.2.1 converges to
would
Q as
"
4.5 Asymptotic Normality of ~
in Structural Linear MOdels.
"
We are now ready to prove asymptotic normality of the m.1.e. ~
in structural linear models.
function
F
depends on the observations only through
the estimator
by
aF(R
We first note that since by (4.3.2) the
,V ;8
~ "'1l "1l
~
is an implicit function of
)/ae
'"
= o.
and
and
V
"'1l
V defined
"'1l
If first and second derivatives of this
implicit function exist in a neighborhood of
otic joint distribution of
~
~
~
the limiting distribution of
and
V
"1l
/:n(e"1l -8)
'"
~,
and if the asympt-
is multivariate normal, then
is normal; see for example
Theorem 4.2.5 in Anderson (1958, pp. 76-77).
While this approach
78
shows rather clearly the nature of the problem, derivation of the
asymptotic distribution is rather difficult, for it involves the
~
derivatives of the implicit function defining
and ~.
in terms of ~
We shall therefore adopt a more indirect approach.
For
clarity in the argument we prove first
Lemma 4.5.1.
Let
F be the function defined in (3.2.6) or (4.3.1),
~
and let the reduced form parameters
y
and
be twice differentiable
functions of structural parameter
§
sequence of design matrices
satisfies Assumption 4.2.1, then
as
n
-+-
{A }
"11
which is identified.
If the
00
!; l(§) ,
where
l(§)
4.4.2
and
(4.5.1)
is the limiting Fisher information matrix given in Theorem
§
is any permissible value of the structural parameter
vector.
Proof:
Consider the second derivatives of
Lemma 4.4.1.
By
4.2.1, as
Assun~tion
n
+
F given in (4.4.6) of
00
the first term in
(4.4.6) converges to
(4.5.2)
Using (4.2.2) we can write
(4.5.3)
~
Since by Lemma 4.2.1
tions of the lemma, as
n
-+-
g~
and since
-+-
Q'
under the assump-
00
ln A (X -BA
)'
"""11
~
2n
~
£ ,..0
.
(4.5.4)
79
This result implies that as
n
~
the second, third and fourth
00
terms in (4.4.6) converge stochastically to zero.
Recall now from (4.2.17) that as
P
T
~
V
This result implies that as
n
~
"1l
n
~
00
(4.5.5)
~
00
the seventh term in (4.4.6)
converges stochastically to zero, while the sixth term converges
stochastically to
~
2tr y-
1
ay
-ae.
1
V
~
-1
-
a~]
ae.1 .
(4.5.6)
On subtracting from this the fifth term, which does not depend on
n, and recalling expression (4.5.2) for the first term, we find that
as n
~
00
(4.5.7)
This lemma then follows by comparing (4.5.7) with (4.4.20) in
Theorem 4.4.2, which gives
o
leg).
Suppose now that we evaluate the second partial derivatives of
at a value
e
"1l
which converges stochastically to
~
V
"1l
g.
Let
~ and
~
correspond to
we have
~ ~~
~
e.
Since these are continuous functions of
"1l
and
F
V
"1l
~ V , and thus as
e
~
n~oo
~
(~-~) P 0
~
(V -V )
"1l
n
g0 .
(4.5.8)
(4.5.9)
Using these results and proceeding as in the proof of Lemma 4.5.1
we obtain
80
Corollary 4.5.2.
e
'"
Under the assumptions of Lemma 4.5.1, if
is a vector converging stochastically to
~
then as
n -+
~'Jl
00
1 a2F(~)
'2
where
(4.5.10)
a~a~'
is the limiting Fisher information matrix of Theorem 4.4.2.
l(~)
We are now ready to establish the main result of this section.
Theorem 4.5.3.
and
Let
X
satisfy (4.2.1) with design matrix
"'n
U '" Np x n (O,V
@ I ), (n=r+l,r+2, ... )
'" '"
"'n
and let the reduced form
"'1l
parameters
~
ural parameter
{~}
matrices
y be twice differentiable functions of a struct-
and
~
A
"'n
which is identified.
If the sequence of design
satisfies Assumption 4.2.1, then as
n -+
00
(4.5.11)
where
e'"
is the m.l.e.
'""1l
of
8 and
,..",
I(8)
I"W
is the limiting Fisher
~
information matrix given in Theorem 4.4.2.
,.
Consider the Taylor Series expansion of
Proof:
true parameter value
aF(S"'1l )/ae...,
about the
~.
aF(~)
=
where
'"
e
aa...,
(4.5.12)
+
i's a point in the line segment connecting
""fi
On inserting a factor of
.1 we obtain
Iii" aF(~) 1 aF(~)
-2
ae-- '2 aaaa '
'" '"
'"
Since by Theorem 4.3.1,
" +
a = va"'1l
'"
""fi
e
""fi
p 8
-+ '"
.
(l-v)g
for some
'"
ePa
n-+
-+ '" as
~
Then by Corollary 4.5.2, as
~.
rn(~ - fD
""fi
o .~ v
'" and
~
co
(4.5.13)
and since
1, it follows that as
n-+
oo
n-+
co
81
....
ClF(8 )
1
""Jl
P -2 ........
1(8)
(4.5.14)
-+
Using this result on (4,5.13) and noting that by identifiability
1(8)
is non-singular, we obtain that
m(e-n -8)....
p
where the symbol
~
(4.5.15)
in (4.5.15) indicates that the vectors on the
left and right hand side are asymptotically stochastically equivalent,
i.e.
as
n
-+
00
their difference converges in probability to zero.
~ aF(~)/a~.
Consider now the random vector
From (4.4.5) and
(4.5.3) ,
(4.5.16)
Since
E (8)
n
=
f3
and
E(T)
=
n
rrn ~F C~)l
ELz
a~
V , we have from (4.5.16)
_
j -
(4.5.17)
Q
~ aF(~);a~
Consider now the variance-covariance matrix of
This depends on the variances and covariances of
T , but it
~ and ""Jl
is easier to derive from the relationship between
Since
logL
=
c - !znF,
aF(8)/a8
""'J
t""J
= ~n
alogL/a8
F
and log L.
and thus
~
(4.5.18)
But under the assumed regularity conditions
cnOgL
!.El
• dlogLj
n
a~.
.- . . .a8'
.
= _~Ela~lOgLj
= n!.~
n
d8a8'
.... ....
I
a=l ""(X
(8)
....
(4,5.19)
82
where
I (8)
is the information matrix associated with the a-th obser-a.
~
vation, see (4.4.2).
Using (4.5.19) in (4.5.18) we find that the variance-covariance
matrix is
Var
which converges to
[
m
aF (~)j
- 2 -ae
-~
!(2) ,
1 n
= -n \'L
a=l
I
(4.5.20)
(8)
~a ~
the limiting Fisher information, as
n
+
00.
Having obtained the means and covariances, note from (4.5.16) that
the vector
aF(~)/d~
is a function of
second derivatives existing for all
~
and
y.
~
~
and
and
Y
m(Y
~
-V)
~
ditions of the present theorem.
in a neighborhood of
~
is multivariate normal under the conTherefore by
Theorem 4.2.5 in Ander-
son (1958, p. 77), the limiting distribution of
variate normal.
~ aF(~J/d!l
m
aF(~)
d
a!l
+
(4.5.21)
N[Q, !(Q)] •
is the limiting Fisher information matrix.
result we obtain
is multi-
In view of (4.5.17) and (4.5.20).
2
I(!l)
with first and
Furthermore, by Lemma 4.2.2 the asymptotic joint distribuand
where
Vn
From this
that
1
1- (8)
....
....
In -aF (!l)
°2
aQ.
(4.5.22)
Then the theorem follows by recalling (4.5.15) and by Slutzky's
theorem. see for example result (ix) in Rao (1965, p. 101).
The
basic argument in the proof is adapted from Kendall and Stuart (1961.
pp. 54-55).
o
83
u
In this theorem we have assumed that
-
~
N (O,V @ I )
pxn - ~
-n
and
have taken advantage of this condition in obtaining the asymptotic
variance-covariance matrix.
In(e -8)
-n -
From (4.5.16), however, it is clear that
will have the asymptotic distribution (4.5.11) 'whenever the
joint asymptotic distribution of
in Lemma 4.2.2.
In(~ -iD and
I:n(V
~V) is as given
"11
~
If interest centers on structural parameters that
appear only in
~
or only in
y, as it is the case in factor analysis,
then it is only required that the asymptotic marginal distributions of
In(~-~)
In(~-y)
and
be as given in Lemma 4.2.2.
See in this regard
the discussion in §4.2.3.
4.6 Large Sample Theory for the Latent Linear Model
Let us now specialize the results obtained in the previous section
to the case of the latent linear model.
We summarize our results in
the following
Corollary 4.6.1.
Let
X
-n
(2.2.6) with design matrix
satisfy the latent linear model (2.2.4) ~
and
~ ~
Npxn(Q,y x
In)
(n=r+1,r+2, ... ).
Assume that the model has been identified in the manner described in
~
§2.3.1, let
£,
3.
~,!
denote a vector containing the free parameters in
e" be the m.l.e. of ft discussed in Chapter
f, and let -n
and
If the sequence of design matrices
then as n
-+
{A}
-n
00
" -+
P 8 ,
8
-n
(4.6.1)
~
is a consistent estimator of
i. e. ,
satisfies Assumption 4.2.1,
8 ,.
~
(4.6.2)
where
left)
is the limiting Fisher information matrix of Definition
84
4.4.1; and furthermore
(4.6.3)
i. e.
!C.~J
Proof:
is estimated consistently by
1(8" ).
"" ""'11
The proof follows from Theorem 4.3.1, Lemma 4.4.3 and Theorem
4.5.3, and the fact that in (2.2.5) and (2.2.6)
£,~, ~
differentiable functions of
and
~
and yare twice
!, and hence of
0
~
It remains only to obtain expressions for the elements of the
limiting Fisher information matrix.
To this end we give two auxiliary
lemmas.
Lemma 4.6.2.
The derivatives of the reduced form parameters
and
~
\
with respect to the elements of the structural parameter matrices
~, ~, ~
and
!
of (2.2.5) and (2.2.6) are given by
(4.6.4)
(4.6.5)
av
":17
OA.
•
1)
= E.. ~A' + A~E ..
""1)--
{
AE .. A'
av
""
.. )!I.(E .. +E .. ) !I.'
~ = t(2-0 1)
"" ""IJ ""JI ""
IJ
.
--II""
(4.6.6)
---)1
if i = j
!I.E .. !I. , + AE .. A' if i -F j
--)1""
--IJ""
,
(4.6:7)
and
av
""
alJJ·1
= E..
""11
(4,6,8)
V
85
Proof:
This result is a corollary to Lemma 3.2.2, giving the anal-
ogous matrix derivatives.
follow from (A.3.4),
Results (4.6.4), (4.6.5) and (4.6.7)
(4.6.6) follows from (A.3.7), and (4.6.8)
follows from first principles.
Lemma 4.6.3.
let
~
=
be a matrix with 1 in the (i,j)-th posi-
~lJ
£ki: rxs
elsewhere, and let
0
tion and
E.. : pxq
Let
(a .. ) be qxr
1J
and
Also
be similarly defined.
.. ) be rxp.
~ = (b 1J
Then
(4.6.9)
Proof:
By direct computation
pxr ,
with a similar result holding for
£ki~'
Thus
pxp ,
E. ,AEknB
~1J-
(4.6.10)
)(,~
and the only non-zero diagonal element of (4.6.11) is
(4.6.11)
ajkbii.
Hence
o
the lemma.
We are now ready to prove
Theorem 4.6.4.
Let
X
~
(2.2.6) with design matrix
satisfy the latent linear model (2.2.4) ~
and
If the sequence of design matrices
u
~
{~}
~
Npxn (O,V
@~
I ) (n=r+l,r+2, ... ).
~ ~
satisfies Assumption 4.2.1,
then the elements of the limiting Fisher information matrix with respect
to
~,~, ~
and
~
86
are given by
-1
I (f::ij'f::U )
=
~)ik(Q)jR.
(4.6.12)
I (f:: ij , AkR)
= (y-1~)ki(~')jR.
(4.6.13)
I ( \ j '\1'.)
= (y-1)ik(~Q~')jR.
(~'Y
+
(V
-1
t"'oJ
).k(~A'V
1
~
-1
I'-.J
1
= z(2-0 kR.)[(Y
-1
A~)·o
~
J XI +
~)ik(~'Y
-1
(V
f'J
-1
A~)·o(V
IN -
-1
~
~)R.j + (Y
-1
A~)k'
~
J
~)iR.(~'Y
, (4.6.14)
-1
~)kj]
,
(4.6.15)
I(\j'1Vk )
I
(if'ij , <l>kR)
- (V
I"OV
-1
)·k(V
1
A~)k·
J
~,...,,....,
-1
~)iR.(~'Y
1
= -2(2-0 .. )(V
1J
=
1
-1
-
-1
~)jk] ,
-1
A)k·(V A)k·
- 1 - - J
(4.6.17)
and
-1 2
(4.6.18)
(4.6.19)
I(Y ) ik '
the remaining elements, such as
Proof:
(4.6.16)
,
~
1
-1-1
= -4(2-0
.. )(2-0 koN ) [(A'V A) 1.k(A'V
A)·o
1J
-,..., - IN
+ (~'Y
I (<P • • , 1V )
1J k
-1
#"'oJ
I(f::ij,<PkR.)
being all zero.
The proof follows from Theorem 4.4.2 and repeated application
of Lemmas 4.6.2 and 4.6.3, as follows:
= tr[nE
.. A'V -1 AE ko ]
~J1- -IV
by (4.6.4)
-1
= tr[E""')1"'"
.. A'V- ~
AEkon]
y.,,;::S
-1
= (A'V
A).k(n)o.
- - 1
NJ
;:$
by (4.6.9) .
87
r
d~ I -1 a~
I(~ij,Ak~) = trLQ a~ij y,
d\~
l
J
by (4.4.20) ,
= tr[nE
.. A1V-1Ekn~]
~Jl-""'" f"'oJ~_
by (4.6.4) and (4.6.5) ,
= (A'V-1)'k(~n)n'
~ ~
1
by (4.6.9)
:=::;;$)(,J
r
d~' -1 d~ l
1
r: -1
'L d\;J + "2 tr L'L
dY,
d\j
I (\j , Ak~) = tr LQ d\j
d~
-1
'L
l
dAk;J
by (4.4.20) ,
. =·tr[n~'E
.. V-1Ekn~]
~ ~Jl~
7vt"'oJ
f'J
1
-1 (E .. ¢A'+A~E .. )V -1 (Ekn¢A'+A¢E )]
+ -ztr[V
nk
- N--.J
~
-1J~
~J1""'"
~7.J
by (4.6.5) and (4.6.6) ,
= tr[n~'E
.. V-1Ekn~]
~ fOWJ1.t".I
A,f"W
1
-1
-1
1E.. ~A'V-lEkn~A'] + -2tr[V
+ Zl tr [VE.. ¢A'V- A¢E nk ]
-.. -.1J--..
- ,. . .,
-~J--
+
1
-1 A¢E .. V-1 Ekn¢A']
ztr[V
,. . .,
~
~J1-
N~
+ ~(~A,u-l).
'¥
v
k (~A'V-1).
'¥
n
2
1
""
~
+ -2(¥
.--
-1
J
~JV
~~
~
~)(,1
-1
1
+ ~Z r[V
-..
~J1-
1
-1
~
A~)'n(V
1",,""'"
~
-1
A~)k'J
~
by (4.6.9) ,
+ (V".."
-1
-1
-1
-1
)'k(¢A'V
A¢)n' + (V-.. A¢)'n(V
A¢)k'J
1
17v"""
~
~:tv]
nk ] ,
~7v
1(~A'V-1A~)
(V-I) k'
+ -2
'¥
'¥'n
~ ""
~ J To. ~
1
)'k(¢A'V
A¢)n'
1
JV] + -Z(V
-..
~
-1 A¢E .. V-1 A¢E
~
~
88
I(A ij '¢k9.,)
1
=
I tr
-1 [
a~
-1
Y laAij Y
a~ 1
by (4.4.20) ,
a¢k9.,j
1
-1
-1
= -4(2-o
ko ) tr[V (E .. ¢A'+A¢E .. )V A(Eko+Eok)A']
~
N
~lJ~
~Jl
~
~
~
~N
N
-
\
by (4.6.6) and (4.6.7) ,
1
{-I
-1
1-1
= -4(2-o
+ tr[V -E..
¢A'V AEokN]
ko ) tr[V E.. ¢A'V AEkoA']
N~
N
~lJ--
~
~
~
-lJ~
~
~N
~
}
+ tr[V -1 A¢E .. V-1 AEkoA']
+ tr[V -1 A¢E .. V-1 -AEokA']
,
NI'OV
-N~
+
(y
-1
~Jl-
~
-1
~)ik(~'Y
~
~)9.,j + (Y
-1
~Jl-
~)i9.,(~'Y
-1
~)kj]
by (4.6.9) ,
leA. ·,ljIk)
1J
1 ~V -1
:: -tr
2
a~
1
= 2t r [y
-1
a~]
- - V -1 -
dA .. 1J
-
-1
(£ij~'+~ji)Y £kk]
1
-1
-1
= -2tr
[V E.. ¢A'V Ekk ]
- ~
1
by (4.4.20)
dljlk
-lJ~
-1
-1
1
by (4.6.6) and (4.6.8)
-1
-1
+ -2tr [V A¢E .. V Ekk ] ,
- ~J~- 1
-1
-1
= -2(¢A'V
) J·k(V- )k·1 + -2(V
)·k(V
A¢)k·J
~1
~
-1
= (V- -1 )·k(V
A¢)k·
1
-J
1
r -1
1(¢ij'¢k9.,) = ItrlY
ay
-1
a¢ij Y
by (4.6.9)
.
d~ 1
d¢kJ
by (4.4.20)
1
-1
-1
= -8(2-0
.. )(2-o ko )tr[V A(E-I)
.. +E-)1
.. )A'V
A(Eko+Eok)A']
1J
- - N
-
-
N
-N
-
by (4.6.7)
1
{-I
-1
-1 .
-1
= -8(2-0
.. )(2-o koN ) tr[V- AE .. A'V- AEkoA']+tr[V
AE .. A'V- AEokA']
1J
N~lJ-
~
~lJ-
~N
-
-
89
1 2 - u..
~ ) ( 2 - uk
~ n) [(A ' V-I A) .k (A ' V-I A) n' + (A'V- 1A), n (A'V- 1A) k .
= -8(
1J
N
J
N1
IN
1
~
1
+ (~'YA)'k(AIV
.1
~
1
tr Y,
I (q>ij , W
k) = I
-1
~
~
r a~
-1
~
~
~
A)n'
NJ + (A'V
~
aw~J
Y,
~
~
a~l
-1
Laq>ij
~
~
~
-1
~
~lJ
~
A)·n
IN + (A'V
~
~
~
~
~
~
-1\
A)k']
J by (4.6.9)
~
~
by (4.4.20) ,
1
-1
-1
= -4(2-0
.. )tr[V A(E .. +E .. )A'V Ekk ]
1J
~
~
~J1
~
~
~
by (4.6.7) and (4.6.8)
1
- I AE .. A'V -1 E ] + tr[V -1 AE .. A'V - E
I}
=- -4(2-0
.. ) {
tr[V
1)
- - kk
kk ]
~
1
~lJ~
1
~Jl-
~
1
~
by (4.6.9)
.' = -4 (2-0 1J
.. )[(A'V)·k(V~ ~
J ~ A)k·
~
1
1
-1
-1
= -2(2-0
.. )(V A)k'1 (Y A)k'J
1J
~
~
~
~
by (4.4.20) ,
-1
= -21tr [V,..." -1 -11E.. Y ,...."Ekk ]
by (4.6.8)
1 -1
-1
= -2(V
)'k(V
)k'1
1
by (4.6.9)
~
~
1) ~
= ,!,(y2 ~
1k
o
This completes the proof.
It may be noted that all terms not involving
g
in the above expres-
sions agree with the corresponding results for confirmatory factor
analysis given in §1.4.2.
Compare (4.6.14) - (4.6.19) with (1.4.8)-
(1.4.13) which have been obtained by a different method.
90
The theorem gives the elements of the limiting information matrix
in terms of
~,~, ~
the free parameters
of
;:;,~,~,
and
~
if
To obtain the elements with respect to
be a vector containing all the elements
ordered in lexicographical fashion .• Then
1.J
a ik = 1
X
e, let
I(e.~.)
where
~.
and
= k,~]·
L a·ka.oI(yk,yo)
J ......
0i = Yk
(4.6.20)
,
and 0 otherwise, and the
I(Yk'Y~)
are
obtained from Theorem 4.6.4.
The same method of proof of Theorem 4.6.4 could be used to obtain
the second derivatives of
F with respect to
~,
using Lemma 4.4.1
in combination with the auxiliary Lemmas 4.6.2 and 4.6.3.
These deri-
vatives could then be used in a Newton-Raphson algorithm for computing
the estimates.
The results, however, are too complicated to be of much
practical use, and the usually large number of parameters to be estimated
iteratively would generally make the Fletcher-Powell method more efficient than Newton-Raphson.
An alternative approach that could be used to obtain second derivatives is to differentiate the first derivatives given in Theorem 3.2.3.
In all but one case, however, this is an arduous task, even with the help
of the matrix differentiation rules given in the Appendix, primarily
because of the lack of a convenient chain rule for matrix functions of
matrices.
We now give one second derivative that serves to check the
results obtained in this section.
From (3.2.14)
(4.6.21)
Differentiating this with respect to
:
we obtain
91
a raF}
a~la~
=
(4.6.22)
1
= 2(A'VA @ ....q....
I )E( q, r )(0Nn
............
where (4.6.23) follows from (A.3.3).
@ "'I'
I ) ,
(4.6.23)
Collecting elements in different
blocks of (4.6.23), or using (A.3.4), we obtain
a
(aF) _
a~kR, a~ - 2 ~
,-1
Y M.kR,2n
But the (i,j)-thelement of ~kR,~ is
2(A'V
-
.,
.......
-1
("OJ
Since under Assumption 4.2.1
(4.6.24)
.
aikbR,j
Thus
(4.6.25)
A)'k(O
)'0
1
~ JAI
-
~ ~
Q as n
~
00
,
we find, recalling
(4.4.22) and noting that (4.6.25) is non-stochastic, that
I(~,1J"~ko)
]V
-1
= (A'V
A)'k(n)
............ 1
~ J]V
(4.6.26)
'0'
which agrees with (4.6.12) of Theorem 4.6.4.
We remark that this
result could not be verified by comparison with analogous results for
factor analysis, because it involves
Q.
4.7 ~oximate Second Derivatives of
Recall from Lemma 4.5.1 that as
n
F
~
00
~ l(Q)
Thus for sufficiently large
derivatives of
(4.7.1)
n, approximate vaues of the second
F with respect to the free parameters are given by
twice the limiting Fisher information matrix.
Hence
1 -1
'2!
(Q)
92
g in the Fletcher-
provides a good initial value for the matrix
Powell iterative procedure for minimizing
When
~
F discussed in §1.3.2.
is unrestricted, however, estimates of the parameters
~
are computed by minimizing the function
over
~
~
for fixed
~
!,
and
F, which is
F
~inimized
as indicated in §3.5 and §3.6. We
~
now show how approximate vaues of the second derivatives of
be obtained from
Lemma 4.7.1.
1(8).
Let
For this purpose we prove
F be the function defined in (3.2.6) or (4.3.1),
'i be twice differentiable
and let the reduced form parameters
~
and
functions of a structural parameter
~
which is identified.
be partitioned into components
be the function
{~}
~l'
1(2)
~
1
for i,j=l,2.
J
a2 p
The function
F
Suppose the
exists, and let
1(~)
Then as n
-+
00
(4.7.2)
2" a~la~i
Proof:
~
~
satisfies Assumption 4.2.1, so that
1..(8): m.xm.
~1J
Let
m2x l, and let
~2:
for a fixed
~2
the limiting Fisher information matrix
be partitioned into blocks
and
~l : mlxl
F minimized over
sequence of design matrices
1
F may
F
is given by
(4.7.3)
~
where
~2(~1)
~2
is the value of
that minimizes
F
for given
~l'
~
Interpret
Q2(§1) as the root of
aF [§l
,:22 (~l)]
a~2
=
Q
Differentiating (4.7.4) with respect to
(4.7.4)
8'
~l
using the chain rule
93
for vector derivatives (A.3.23), we find
(4.7.5)
+
Differentiating (4.7.3) in the same fashion we obtain
ai\~'l)
ae'
~l
=
=
aF[~1,ft2(ftl)]
~
aft2(~1)
as'
~2
ae-'-
ae' - - +
~l
aF[~1'~2(ftl)]
ae'
~l
~
aF[~1'~2(~1)]
(4.7.6)
~l
by (4.7.4).
(4,7.7)
Transposing (4.7.7) and differentiating with respect to
~i
using (A.3.23) gives
a2tr(~1 )
a~la~i
=
(4.7.8)
But from (4.7.5),
(4.7.9)
provided the indicated inverse exists.
~
and writing
Substituting this into (4.7.8)
~
~2
for
~2(~1)
we have
a2F(~I)
a~la~i
provided the indicated inverse exists.
From Theorem 4.3.1,
4.5.2 , as n
+
00
~2(~1)
&~2
as
n
+
00.
Thus from Corollary
94
~ 1. . (8)
~lJ '"
ft
Since
is identified,
(i,j=l,2) .
1(8)
'"
~
(4.7.11)
is non-singular.
In view of
(4.7.11) this implies that the inverse required in (4.7.9) - (4.7.10)
will exist for sufficiently large
we find that as
1
n
~
n.
On using (4.7.11) on (4.7.10)
00
a2 p(ft 1 )
(4.7.12)
2" aft1 ,afti
This completes the proof.
(We are grateful to Professor Hoeffding
for outlining the proof of a similar result for the case of
8 2 scalars.
81 and
The above proof is a straightforward extension of his
o
argument to the rnu1tiparameter case.)
In view of the theorem, a good initial value for the matrix
the Fletcher-Powell iterative procedure of §3.5
for minimizing
E
.....
in
'"
F is
given by
(4.7.13)
By a well-known result concerning the inverses of partitioned matrices,
see Morrison (1967, p. 66); the matrix (4.7.13) is half the matrix in
the upper
left
m x m block
1
1
of
I-l(ft).
V.
5.1
HYParHESIS TESTING
Introduction
In this chapter we consider large sample tests of hypotheses in
the latent linear model.
The discussion is organized in three parts.
In §5.2 we develop a likelihood ratio goodness of fit test for a family of structural linear models, obtain the asymptotic distribution of
the test statistic if the model fits, and discuss application of the
procedure in the latent linear model.
In §5.3 we consider briefly
testing hypotheses about structural parameters after the structure
has been established, with special reference to hypotheses involving
the parameters
~, ~
and
%
in the latent linear model.
discuss testing linear hypotheses about the parameter
In §5.4 we
in the latent
linear model and give a procedure for computing the restricted m.l.e.
of
N
'
which is needed to construct the likelihood ratio test statistic.
We also consider an alternative test procedure using Wald's statistic,
which does not require computation of the restricted m.l.e., and derive
the large sample distributions of the test statistic under the null
hypothesis and under a sequence of alternative hypotheses.
5.2 Testing Goodness of Fit
Let the matrix
X
~n
be given by
(5.2.1)
where
A
~
is of full row rank
r < n
and
u
~
~
N
V
pxn (0
~,~
@ I
~
)
with
96
'1. p.d.
Let
n
be the Cartesian product of the set of all
n where
the subset of
§
pxp symmetric p.d. matrices
and the set of all
.@
.@
and
pxr
matrices
w be
'1., and let
are specified functions of a vector
V
~
.
We consider testing
HI: {'§,'1.}dl-w ,
vs.
H : {'§,'1.h:w
O
(5.2.2)
i.e., the goodness of fit of a structural linear model where the reduced form parameters
~
and
'1.
are functions of a structural para-
meter §.
Let
~
= pr
+
denote the number of parameters in .§
~(p+l)
y , and let m denote the number of parameters in
imposes
~-m
constraints upon
~
and
'1.
~
~.
and
Then H
and we require
O
m<
~
for
HO to be non-trivial.
Let
~
and
'in
be the unrestricted m.l.e.'s
in (4.2.2) and (4.2.3), and let
distinct elements in
S1
~
:tn:
and V
•
~
n
where
c
Let
Y.. ,
and
y
given
be a vector containing all
Then the maximum value of log L in
is
max log L(.§,y)
let
of .§
'"
~
is a constant and
'"e :mxl
""11
and
and let
'"
V
~
F
=c
-~F(y )
=c
-~[logIV I+p]
(5.2.3)
~n
""11
(5.2.4)
is the function defined in (3.2.6).
denote the m.l.e. of the structural parameter
be the corresponding restricted m.l.e.'s of ~
'" = l(X -~A )(X -~A ) I .
T
n ~
""11
~n
~
~
~
,
and
Then the maximum value of log L
97
in
w
is
max log L(~,y) = c -~F(~)
(5.2.5)
w
(5.2.6)
where we have written
F as a function of
g.
The likelihood ratio statistic for testing (5.2.2) is then given
by
(5.2.7)
-2 log An = -2[max log L - max log L]
n
w
'"
= n[F(~)
(5.2.8)
F(in)]
'"
'" 1'"
= n[loglYnI+tr ~ In - 10gIYnI-p]
.
We now obtain the asymptotic null distribution of
Let
Theorem 5.2.1.
X
of a structural parameter
{A }
"'11
~
~
and
-2 log A
n
y
be twice differentiable functions
which is identified.
satisfies Assumption 4.2.1
-2 log An
where
-2 log An'
satisfy (5.2.1) for (n=r+l,r+2, ... ), and let
"'11
the reduced form parameters
design matrices
(5.2.9)
d
-+
If the sequence of
then as
n
-+
00
Xv2 '
(5.2.10)
is as defined in (5.2.7)
(5.2.9) and
v =
,
~-m
the difference between the number of reduced form and structural parameters.
Proof:
The proof is adapted from the argument used by Rao (1965,
pp. 347-351) to obtain the asymptotic null distribution of likelihood
ratio test statistics based on i.i.d. observations.
!
Let
and let
r
n
expansion of
be a vector containing the distinct elements of
be its unrestricted m.l.e.
F(r)
about
r
=
m:
~
Consider a Taylor series
and
y,
~
98
(5.2.11)
where
!n
is a point on the line segment connecting
Xn
Since
is a root of
3F/3r
=Q •
2:n
and y .
the second term\of the right
hand side of (5.2.11) is zero.
'"
Since ",n
y
4.2.1
-
p
Y
",n
-+
= VXn
Y as
'"
+
n -+
(I-V)! for some 0
<Xl
•
fore by Corollary 4.5.2. as
1
"2
ley)
where
we have that
n
-+
~
1'"n
V ~ 1. and since by Lernma
p
-+
1 as n
-+
<Xl
•
There-
<Xl
g
-
(5.2.12)
I (y) •
'"
denotes the limiting Fisher information matrix with respect
'"
to
y. which may be obtained from Theorem 4.2.2.
'"
Using these results in (5.2.11). we obtain the following asymptotic
stochastic equivalence
(5.2.13)
On the other hand. using expression (4.5.15) in the proof of
Theorem 4.5.3. and noting that by Lemma 4.2.2
when
I(y)
'" '"
is non-singular
y is p.d .• we have
(5.2.14)
On using this in (5.2.13) we obtain
(5.2.15)
Furthermore. from expression (4.5.21) in the proof of Theorem
4.5.3, as
n -+
<Xl
99
rn
aF(:O
-2-
Under
H '
O
r
d
-+- N[0, I (y)]
"' ........
ay
....
(5.2.16)
.
is a function of a vector
~
of structural para-
meters. admitting first derivatives
(5.2.17)
5I.xm •
Let
'"
~
denote the m.1.e. of ft
(5.2.11) - (5.2.15)
under
H •
O
Proceeding as in
we obtain the asymptotic stochastic equivalence
(5.2.18)
left)
where
to
~.
denotes the limiting Fisher information matrix with respect
which is given in Theorem 4.2.2 and is non-singular if
ft is
identified.
Using the chain rule for vector derivatives (A.3.23),
aF(8)
'"
a~'
--,;:--=-:--
=
(5.2.19)
and substituting this into (5.2.18) we obtain
(5.2.20)
In view of (5.2.15) and (5.2.20), the likelihood ratio statistic
(5.2.8) may be written as
P n aF(2)
-21ogA n
since
F(r)
Let
Q
and
=
"4
-'-a-y""-
F(~)
aF(y)
'"
ay-
(5.2.21)
are the same.
denote the matrix in brackets in (5.2.21).
In view of
(5.2.16), the right hand side of (5.2.21) is a quadratic form in
100
asymptotically normally distributed random variables, and will thereX~
fore have an asymptotic
\l
distribution with degrees of freedom
= tr Q lex) if and only if Q !(r)Q = Q.
To verify this condition note that by (4.4.2) and (4.4.4)
•
1(8) = 1" 1:- E[alogL dlogLl
~ ~
1m n
ae
ae'
;}-+OO
_ 1.
-
1m
n~
I'"'>J
J'
t'J
1 E[ art alogL
dY
ae-
n
~
alogL a! ]
by (5.2.19)
ay' ae'
~
~
~
(5.2.22)
Now
=!
-1
ar
(X) - 2ae ,!
ar
-1
Hence as
n,+
\l
00
a!'
(g)aa-
a!
ae,l
+
-1
(g)!(g)I
ay'
(~) a~
= tr Q l(x)= tr[l
-1
tr
ar
(X) - ae'!
1-1 (~) [ar'
a~
= £
tr[l-l(ft)l(g)]
=~ -
tr ~
I
=~ - m .
-
by (5.2.22)
(5.2.23)
-1
~ X~ with
ay'
(g)a8 ]I(r)
~
-
a!'
(~)a8'
=Q.
the right hand side of (5.2.21)
= tr l~
-1
-,....,
-1
(r) - ag' I
=I
-1
~
ar ]
I(r) a~'
by (5.2.22)
(5.2.24)
101
The theorem then follows by (5.2.21) and Slutzky's theorem.
o
Thus for sufficiently large
n, an approximate size
a
test
of (5.2.2) has critical region
2
-2 log An > Xv,l-a
(5.2.25)
The theorem provides a large sample goodness of fit test for the
latent linear model, where
~
and
some structural parameter matrices
y
satisfy (2.2.5) and (2.2.6) for
~,~,
2
and
~,
which are subject
to a set of identifying restrictions as discussed in §2.3.l.
In exploratory studies the investigator will usually fit a sequence
of unrestricted nlodels with
q
latent variables,
~
~
=I
~
and
~q
satisfying the Gram-Schmidt pattern (2.3.7), for different values of
q.
The goodness of fit statistic may then be used as an informal crit-
erion to determine an appropriate value of
as in polynomial regression.
q, much in the same spirit
The number of free structural parameters
in a q-factor unrestricted latent linear model is
~q(q-l),
m
and the number of reduced form parameters is
= qr
t
+
pq + p
= pr
+
~
~(p+l).
Hence the number of degrees of freedom is
v
= (p-q)r
+
~[(p_q)2 _ (p+q)] •
(5.2.26)
This expression may be used to determine the maximum number of factors
that may be fitted for given
p
and
r.
Note that on setting
r
=
0,
(5.2.26) reduces to the corresponding result (1.3.31) for exploratory
factor analysis.
In confirmatory studies the investigator will usually be able to
specify in advance the number of latent variables, and a set of identi-
~
102
fying restrictions that is better suited to the nature of the problem
at hand.
As indicated in §2.3.l, these restrictions may take the form
of a structural hypothesis specifying free, fixed and constrained
elements in
~, ~
!.
and possibly
The goodness of fit of the
model may be tested using (5.2.25) with degrees of freedom
V
where
= pr
1
(5.2.27)
+"2 p (p+ 1) - m
m is the number of free parameters.
On setting
r
=
0,
(5.2.27) reduces to the corresponding result (1.4.17) for confirmatory
factor analysis.
5.3 Testing Hypotheses about
and
--
'¥ •
~
So far we have discussed testing the goodness of fit of a structural
linear model where
~.
~
and
yare functions of a structural parameter
Having found a model that fits, we may use the likelihood ratio
technique to test a variety of hypotheses about the structural parameter
~.
Let
g
be an sxl vector function of
Q.
Then we consider
testing
vis
Let
a"
""Il
(5.3.1)
be the unrestricted m.l.e. of
under the restrictions imposed by
HO•
g, and let
"
~
be the m.l.e.
Then the likelihood ratio test
statistic is
-2 log An
"
= n[F(ftn)
(5.3.2)
The hypothesis (5.3.1) is clearly equivalent to a hypothesis
specifying that
p. 350).
~
is a function of an (m-s)xl vectorl, see Rao (1965,
By the same argument used in the proof of Theorem 5.2.1,
103
it then follows that as
n
+
00
-2 log A
n
d
+
2
(5.3.3)
X .
s
From (5.2.8) it can be seen that the statistic (5.3.2) is actually
the difference between the goodness of fit statistics for two structural linear models, specifying
~
and
y
as functions of 1
~,
and
respectively.
From a theoretical point of view the testing problem considered here
is essentially the same as that considered in the previous section.
From a practical point of view, however, it is convenient to distinguish
between the problems of finding a structure that fits and studying characteristics of the parameters of that structure.
In the context of the latent linear model, this technique may
readily be applied to test hypotheses about the structural parameters
~, ~
and
! , of the type considered in §2.3.1. where the elements of
~, ~
and
!
are specified to be (1) free, (2) fixed at given values,
or (3) constrained to be equal to other parameters in the model.
In
this case the procedures discussed in Chapter 3 may be used to obtain
both the unrestricted and restricted m.l.e.'s
of the structural para-
meters, by fitting in effect two models under two sets of restrictions.
Two examples of hypotheses that may be of interest are
(in a one-factor model) and
H : ~
O ....
H:A=A1
0....
"'P
= WI"'P , which have been considered
in §2.4.3.
The same approach may be used to test hypotheses about the structural
parameter
estimating
~,
:
but we have not yet discussed in detail the problem of
under a set of restrictions.
problem for the case of a linear hypothesis.
We now consider this
e
104
5.4 Testing Linear Hypotheses About g .
Consider testing the general linear hypotheses
vis
where
~:
row rank
5.4.1
txq
t
~
and
B: rxs
(5.4.1)
are matrices of known constants of full
q and full column rank
s
~
r
respectively.
The Likelihood Ratio Test
To construct a likelihood ratio test of HO we require the m.l.e.
~
of
under the restrictions
Theorem 5. 4. 1.
Let
= Q.
In this regard we prove
F be defined by (3.2.6).
the linear restrictions
~
If
~
= Q ' then the value
F conditional on fixed values of
....
where :
~
~
2
and !
~
=
is subject to
that minimizes
is given by
(5.4.2)
is the value that minimizes
and!
when:
Proof:
Let
1:
F conditional on fixed
~,2
is unrestricted, given in (3.4.5).
sxt
be a matrix of Lagrange multipliers.
We minimize
the function
(5.4.3)
with respect to
:
and
1.
Differentiating (5.4.3) with respect to
:
using (3.2.14) and
(A.3.19) gives
(5.4.4)
105
and differentiating (5.4.3) with respect to
l' using (A.3.19) and
transposing gives
d1 = C--~~~
df
(5.4.5)
Q we obtain, from (5.4.4)
On setting these equations to
(5.4.6)
which under the full rank assumptions of the model gives
(5.4.7)
~
where (5.4.7) follows from expression (3.4.5) giving
On
~
the other hand, from (5.4.5) we obtain
~
~
C~B
--..
On using (5.4.8) on (5.4.7)
= ...,0
(5.4.8)
•
we obtain
(5.4.9)
and hence
(5.4.10)
Substituting this into (5.4.7) gives (5.4.2).
This completes the
0
proof.
~
Note that if
pect.
H : ~
O
=Q
~
then:
=Q
by (5.4.2), as we would ex-
Also computation of (~'y-l~)-l may be simplified using (3.4.3)
in Lemma 3.4.1.
106
of ;; ,
To obtain the unconditional restricted m.l.e.
~
F(~
/\ ~ \1/)'
;;,~,~,~
.. ,
mlnlmlze
wl"th respect to
~,:.
and
._
~
~
1'1.
~
<1>
,
~
given by (3.4.5), and then evaluate (5.4.2) at the, m.l.e. 's
;;:;
/\
proceeding in the
given by (5.4.2) instead
same manner as in §3.5 and §3.6, but using
of
~
._
~
and
~
'¥
~
The likelihood ratio test statistic is then
A
~
n[F(§ ,/\
,<1>
A
-2 log A
n
=
A
,~
~~~~
A
)
-
F(~ ,/\
5.4.2
(5.4.11)
,'¥ )]
~~~~
and is asymptotically distributed as X~
restrictions imposed by
A
,<1>
with
V
= ts
, the number of
HO'
The Wald Statistic
Note that computation of the likelihood ratio statistic requires
fitting the model with
unrestricted, and with
restricted, and
at this later stage (5.4.2) must be evaluated in each iteration.
We
now consider an alternative test procedure suggested by Wald (1943),
which does not require calculation of restricted m.l.e.'s
and thus
is computationally more convenient.
Let us introduce the following definition.
let
~
vec
denote an
one below another.
For any matrix
mnxl vector containing the columns of
If the matrices
~,~
and
£
~
~:
mxn
packed
are conformable, then it
can be shown that
vec(~)
= (£' e ~)vec
~
,
(5.4.12)
see for example result (2.10) in Neudecker (1969).
Let
.s; = vec g and
m.l.e. of
~
vec
~
~
,where
is the unrestricted
Using (5.4.12) the linear hypothesis (5.4.1) may be
107
written in terms of
as
~
(5.4.13)
vis
Let
~
denote the vector of (free) structural parameters in the
model, and suppose the elements of
~.
Let the limiting Fisher information matrix
be partitioned into four blocks
is
are the first qr
~
~
of Theorem 4.6.3
(i,j=1,2) such that
1. . (8)
~1J
1(8)
~
elements of
~
III (~J
q r'G r , and define
(5.4.14)
the upper-left
qrXqr block of
! -1 (~).
Let
e
evaluated at the unrestricted m.l.e.
'"
r(8)
of~.
~
denote (5.4.14)
~~
Then the Wald (1943)
statistic for testing (5.4.13) is
We now obtain the asymptotic distribution of
W
n
under the null
hypothesis and under a sequence of alternative hypotheses.
X
Let
Theorem 5.4.2.
satisfy the latent linear model (2.2.4) -
~
(2.2.6) with design matrix
... );
A
and
~
U
~
and with structural parameters
satisfies
~
=
Q
identified and let
for fixed matrices
~
~
Npxn (O,V @ I ) , (n=r+l,r+2,
~,~, ~
£ and
~
~
~n
and
!,
~.
where
Suppose the model is
denote the vector of free parameters.
sequence of design matrices
{A }
~
~
If the
satisfies Assumption 4.2.1 then as
n-+- oo
(5.4.16)
with degrees of freedom
v = ts
the number of restrictions imposed by
108
Proof:
By Theorem 4.5.3, as
n
+
00
(5.4.17)
where
"-
8
"1l
is the m.l.e. of
8
'"
and
is the limiting Fisher in-
I(8)
'" '"
formation matrix.
If
"-
~
= vec g
is given by the first qr
is its m.1.e., then as
~
n
+
elements of
e
and
00
(5.4.18)
for
k(§) , defined at (5.4.14), is the variance-covariance matrix of
the asymptotic marginal distribution of the first qr elements of
rn(~-§).
Since a linear function of asymptotically normally distributed random
variables is asymptotically normal, as
n
+
00
(5.4.19)
Now by Lemma 4.4.3, as
n
+
00
(5.4.20)
Using this in (5.4.15), we obtain the following asymptotic
stochastic equivalence
In view of (5.4.19), the right-hand side of (5.4.21) is a quadratic
form in asymptotically normally distributed random variables with mean
Q
and variance-covariance matrix equal to the inverse of the discrim-
inant matrix in the quadratic form; and therefore converges in distribution to a central Xv2
distribution with
v
= ts.
The theorem then follows by Slutzky's Theorem.
o
109
In view of this result, a large sample test of the hypothesis
(5.4.1) of approximate size
a
has critical region
2
(5.4.22)
Wn>XV, 1-a .
,
Consider now a Pitman sequence of alternative hypotheses
(5.4.23)
where
n is a fixed vector.
bution of
W
under this sequence of alternatives.
n
Theorem 5.4.3.
Let
X
"'I1
satisfy the latent linear model (2.2.4) -
(2.2.6) with design matrix
Parameters
~
We now obtain the asymptotic distri-
,A,
"Tl..-..l
~
t'"tJ
and
A,
U - N
"'I1
~,
f"'W
where
(O,V @ I ) , and structural
"'P x n - -
"'I1
~
"11
"'I1
satisfies (5.4.23), for
(n=r+l,r+2, ... ) .
If the sequence of design matrices
Assumption 4.2.1
then as
n
~
{A}
-n
satisfies
00
d X2 (u)
~
Wn ~
,
V
with degrees of freedom
V
= ts
(5.4.24)
and non-centrality parameter
(5.4.25)
Proof:
Proceeding as in the proof of Theorem 5.4.2, but taking into
account the fact that
~n
satisfies (5.4.23), we obtain that as
n~oo
(5.4.26)
where
~(~)
is defined at (5.4.14).
Note that since
~(~)
is
a function of the limiting information matrix, and since the sequence
of alternatives (5.4.23) converges to the null hypothesis,
evaluated under the null hypothesis.
k(~)
is
e
110
Consider now the asymptotic stochastic equivalence (5.4.21).
In view of (5.4.26), the right-hand side of (5.4.21) is now a quadratic form in asymptotically normally distributed random variables
with mean
D
and variance-covariance matrix equal to the inverse of
the discrilninant matrix in the quadratic form, and therefore converges
in distribution to a non-central ~(O)
and
0
distribution with
v = ts
as given in (5.4.25).
0
The theorem then follows by Slutzky's Theorem.
In view of this result, for sufficiently large
n, the Wald
test of the null hypothesis (5.4.1) with critical region
(5.4.22)
has approximate power against the (fixed) alternative HI:
(~'
x
£)1 = D
given by
Pr{x~(o) > ~,l-a} ,
where the non-centrality parameter
o
= nn'[(~'
0
(5.4.27)
is given by
@ £)k(§)(~ @ £,)]-In
(5.4.28)
This completes our discussion of hypothesis testing in latent
linear models.
VI • A NUMERICAL EXAMPLE
6.1
Introduction
In this chapter we use simulated data to illustrate estimation
and testing in latent linear models.
In §6.2
we describe the pop-
ulation model and the procedure used to simulate the data.
In §6.3
we illustrate maximum likelihood estimation of the parameters in the
model, and in §6.4
we consider testing hypotheses about the struct-
ural parameters.
6.2 Simulation of Data
Let us consider a two sample latent linear model with p
manifest variables and q
=2
latent variables.
the design matrix be given by
the i-th population and
~ =
parameters be
H
~
[02
= .8
12
a ..
1)
=1
a..
1)
=0
=5
Let the elements of
if observation
j
is from
otherwise, and let the structural
'
1.2] ,
~
2.8
=
.9
0
.3
.7
.2
.4
.5
.4
.3
.6
.6
.1
.8
.7
and
%= diag
.5
(6.2.1)
In order to achieve identification, the parameters
~
are fixed at
parameters.
0
and
A
and
12
1 , respectively, while the others are free
2
As indicated in §2.3.l, this implies no loss of general-
ity, for any set of parameter values may be transformed to conform
to these restrictions.
112
The reduced form parameters, computed from (6.2.1) using (2.2.5)
and (2.2.6),are
.@' =
[ .18
.30
.42
.54
1.08
1.40
1. 72
2.04
.66 ]
2.36
,
and
1.11
)l
=
.63
.93
.45
.43
.91
.27
.33
.39
1.05
.09
.23
.37
.51
(6.2.2)
1.35
Note that there are 25 distinct elements in
free parameters in
g,
~
and
!.
~
and
y,
and 18
Hence the number of degrees
of freedom associated with the model is 7.
To generate data satisfying this model, recall (2.2.1) and
(2.2.3), and proceed as follows:
(1) generate two standard normal
the first
as given in (6.2.1), to obtain an observation
[second] column of :
Xi
add to
(2)
random variates and form a vector
from the first [second] population; (3) generate five standard
normal variates and multiply each by the square root of one of the
diagonal elements of
!i
!
as given in (6.2.1), to obtain a vector
of random errors; (4) premu1tiply
and add
z.
~1
ponding to
to obtain a vector
Xi;
x.
~1
ri
by
~
as given in (6,2,1)
of manifest variables corres-
and (5) repeat the above steps for i=l, ... ,n.
This method was used to generate a sample of 100 observations
from each population.
The standard normal random variables required
in steps (1) and (3) above were generated using subroutine VARGEN,
available at the University of North Carolina Computation Center.
The data, given in terms of the sum of products matrices, are
113
as follows:
M'-=
M,'=
e~O
I~OJ
[ 31.2911
58.5773
63.6883
59.1566
110.0359
137.2624
174.0489
213.5502
76 1187]
0
,
233.827'1
and
335.0615
280.9319
390.6948
XX' = 295.2279
362.0408
525.2303
304.7186
410.8762
506.5956
762.9457
318.3725
·430.5187
540.4315
683.0306
....,."
(6.2.3)
881.1164
The least-squares estimates of the reduced form parameters
obtained from these data are
[ .3129
~' =
1.1004
.5858
.6369
.5916
1.3726
1. 7405
2.1355
o
7612J
,
2.3383
and
1.0210
V
.... =
.5578
.8399
.4198
.4292
.2561
.1863
(6.2.4)
.4155
.9087
.4862
1.3596
.3249
.4249
.6933
1.3821
which may be compared with (6.2.2) .
6.3 Maximum Likelihood Estimation
A computer program has been written to do all necessary calcu1ations in fitting latent linear models.
The program is written in
Fortran IV-G for the IBM System/360, and has been tested extensively
at the University of North Carolina Computation Center.
114
A choice of four descent methods for minimizing
in the program:
matrix
F
is provided
(1) steepest descent, (2) Fletcher-Powell with initial
g equal to the identity matrix, (3) Fletcher-Powell with ini-
tial matrix
E obtained from the information matrix in
a~cordance
with
the results of §4.7, and (4) a combination of steepest descent for the
initial iterations, followed by Fletcher-Powell using the information
matrix for the remaining iterations, the change-over point being when
F
fails to decrease by more than 5% between two consecutive steepest
descent iterations.
These methods will be referred to hereafter as
SO, FP-I, FP-II and SO/FP-II, respectively.
SO/FP-II combination will be preferred.
In most applications the
When good initial estimates
of the parameters are available, however, the FP-II method may be
used right from the start.
The program was used to compare the SO, FP-I, and SO/FP-II
methods for the data given in (6.2.3).
To start the iterative proce-
dure each free parameter was set to 1, thus providing arbitrary initial
estimates.
Iteration was stopped when all partial derivatives of
F
were less than .001 in absolute value.
Table 6.3.1 shows the behavior of the function and its derivatives
during the iterative procedure.
For each method the column
F shows
the value of the function, and the column G-max shows the largest
derivative in absOlute value, at selected stages of the process.
At
the bottom of the table we give the CPU time in seconds required by
each method on the IBM System/360-7S.
115
Table 6.4.1
Steepest Descent and Fletcher Powell
Function Minimization
FP-I
SD
Iteration
G-max
SD/FP-II
G-max
F\
G-max
o
5.717378
.895047
5.717378
.895047
5.717378
.895047
1
5.089637
.446847
5.089637
.446847
5.089637
.446847
2
4.691579
.681971
4.486940
.628014
4.691579
.681971
3
4.440242
.352651
4.390288
.548307
4.440242
.352651
4
4.318538
.590315
4.258196
.481806
4.318538
.590315
5
4.246603
.298247
4.211948
.646492
4.068019
.212787
6
4.210801
.325221
4.131595
.400649
4.055455
.411462
8
4.164438
.206044
4.092626
.272744
4.044735
.027394
10
4.128719
.182090
4.068557
.084633
4.044586
.003847
12
4.099288
.154941
4.057798
.167238
4.044574
.000889
15
4.066047
.291028
4.045418
.052400
18
4.049712
.037323
4.044612
.007396
21
4.046359
.033712
4.044574
.000458
182
4.044577
.000985
Time
(seconds)
32.44
4.89
1.63
The steepest descent method works well in the initial stages of
the procedure, but requires 182 iterations and 32.44 seconds to con~
verge, indicating that the function
in a neighborhood of its minimum.
F
is probably relatively flat
The Fletcher-Powell method is
clearly superior, requiring only 21 iterations and 4.89 seconds to
116
converge.
Although SO is usually
faster,th~n
FP-I for the first
iterations, it would appear that in this example the initial estimates are sufficiently close to the m.l:e.'s to offset this
advantage.
The combined method SO/FP-II is the best one for this example,
requiring only 12 iterations and 1.63 seconds to obtain the esti~).
mates of all 14 parameters (not counting the 4 elements of
The
change-over from SO to FP-II occurs after the 4th iteration, and the
large sample approximations to the second derivatives of
F appear
to work well, for the process converges very rapidly after that.
To
obtain some feeling for the accuracy of the approximation we let the
FP-I method continue for several additional iterations, thus obtaining a very close approximation to the inverse of the matrix of second
derivatives evaluated at the minimum.
This was compared with the
large sample approximation of §4.7, and found very close.
The maximum likelihood estimates of the parameters are
'"
-
=
[.5078
1. 3459]
.7566
2.4067
r·
'
7956
o1
.6900
.1986
'"
A= .5325
.4233
.3265
L·2315
..8392
7Ol~J
A
3875
r..3237
l
and! = diag .4831
(6.3.1)
l·7088
.6469
J
and are reasonably close to the true parameter values (6.2.1).
These estimates were verified using different starting points for
the iterative procedure, in particular starting from the true parameter values, and were found to be correct to within .001.
The goodness of fit statistic is 9.51, and is approximately
distributed as a chi-square variate with 7 degrees of freedom.
This
gives an approximate p-value of .2181, which would lead to accepting
117
the hypothesis that the model fits.
The maximum likelihood estimates of the reduced form parameters,
obtained from (6.3.1) using (2.2.5) and (2.2.6), are
"-
.§' = [ .4040
1.0708
.5006
.5907
.6967
1.4066
1.7354
2.1282
. 7525l ,
2.33l3J
and
1. 0205
.5490
.8392
.4237
.4515
.9458
.2598
.3646
.4709
1.3078
.1842
.3664
.4785
.6645
"-
V
~
=
(6.3.2)
1.4047
These estimates take into account the latent factor structure
of the model, while those of (6.2.2) do not.
The residuals, or
differences between these two sets of estimates, may provide valuable
insight in fitting latent linear models.
For this example the resid-
uals are all small, as would be expected.
The average and largest
absolute differences between
~
~
and
respectively, and those between V and
are .0419 and .1051,
yare .0205 and .0518,
respectively.
The estimated large sample standard errors of the estimates of
the structural parameters, obtained from the information matrix
evaluated at the m.l.e.'s (6. 3 . 1 ), are
·
r
0923
o
r·1205l
.0814 .0717
1\
s.e.
::
(~)
=
.1285 .1992J
[ .1493 .2657
"-
,s~e(~) =
l~~::~
.0597
.0708
J
.1038 .0808
1. 0708
1\
, s. e.
"-
Cf) = diag
l~~:~~J
.1198
'
(6.3.3)_
118
A
where
s. e.
"
(~)
is a matrix whose (i,j)-th element is the estimated
standard error of the (i,j)-th element of
A
s.e.
~
, and
A
S • e.
"
(!!) and
are similarly defined.
On the other hand, the true large sample standard errors, ob-
tained from the information matrix evaluaLed at the parameter values
(6 . 2 . 1), are
r·
s.e.
c:~)
,...
=
[.1155 .17001
.1440 . 2849J '
s . e.
(~) =
l
~ ~~~~
l
o
0937
.0539
1332
.
.0733
.0495
.0601
s. e. (2) =
.0825 .0583
.0791
J
.1216
.0938 .0759J
(6.3.4)
Note that the estimated standard errors are very close to the
true values.
Although in actual applications of the model the latent variates
are not observable,
example
the method of data simulation used in this
is such that the values of the latent variates are known.
Using those values we obtain the following estimate of
~,
from
ordinary linear model analysis,
H
,..,
=
.3344
[ .9395
1.15311
2.9l70J ' with s.e. (g)
=
[.0990
.1002
1
. 0990
.1002J. (6.3.5)
These estimates are somewhat closer to the true parameter values
(6.2.1) than the estimates
of (6.3.1) based on the latent linear
model analysis, and their estimated standard errors are smaller than
the estimated standard errors of :
given in (6.3.3).
These results
illustrate the fact that some information is lost when the latent
variates can not be observed, but they also indicate that parameters
119
pertaining to the latent variates can still be estimated reliably
on the basis of observable indicators.
6.4
Hypothesis Testing
Let us now consider testing hypotheses about the structural
parameters.
We consider first testing
vis
(6.4.1)
the hypothesis that all specificities are equal.
To test (6.4.1) we fit a model where the first element of
say
~,
f '
is a free parameter, and all other elements are constrained
to be equal to
The total number of free parameters in the model
~.
is now 14 instead of 18, and the number of degrees of freedom associated with
H is 4.
O
To compute the estimates we used the FP-II method, with initial
estimates equal to the true parameter values for
value .5 for
~.
A and to the
The procedure converged in 5 iterations to a function
value of 4.157385 and a largest absolute derivative of .000S06.
The
restricted m.l.e.'s of the structural parameters are
.7384
""
M
;:;;:,
[5489 1. 4695J
=
.6740 2.2232
,
~
A
.639S
= . S742
.
20~91
.3967 , and
.330S
. 74SS
.2280
.8981
""
'1'
~
= .S0381s .
(6.4.2)
J
The goodness of fit statistic for the restricted model is 32.07,
and would be approximately distributed as a chi-square variate with
11 degrees of freedom if the model fitted.
The approximate p-value
120
is .0007, indicating that the more restricted model does not fit
the data.
To test (6.4.1) we form the difference of the goodness of
fit statistics, in accordance with the procedure of §5.3, obtaining
a test statistic of 22.56.
Under H the test statistic is approxO
imately distributed as a chi-square variate with 4 degrees of freedom.
The approximate p-value is thus .0002 and H would be rejected at the
O
.0002 significance level.
Since we know HO to be false, this result
gives us some assurance with respect to the power of the test procedure.
Let us now consider testing a simple linear hypothesis about
:
Let
using the Wald statistic of §5.4.
column of :
~.
~J
denote the j-th
(j=1,2), and consider testing
vis
(6.4.3)
the hypothesis of no location difference between the populations.
Note that
H imposes two constraints upon
O
This hypothesis may be put in the framework of §5.4 by letting
~
= 12
and
£' = [-1,1]. Note that
o
-1
The unrestricted m.l.e. of
~
1
o
~]
(6.4.4)
given in (6.3.1) may be written
as the vector
~
=
[.5078, .7566,1.3459,2.4067]' .
The estimated large sample variance covariance matrix of
(6.4.5)
~
obtained from the information matrix evaluated at the m.l.e.'s (6.3.1)
is
121
r .0165
"-
E
=
-.0041
.0223
.0090
-.0054
.0397
L-·0047
.0186
-.0163
I
e
(6.4.6)
.0706J
Premultiplying (6.4.5) by (6.4.4) we find the estimated differences in location
(£' 8
~)~n
= [.9381,
(6.4.7)
1.6501]' ,
The estimated large sample variance-covariance matrix of (6. 4 .7),
obtained on premultiplying (6.4.6) by (6.4.4) and post-multiplying
by the transpose of (6.4.4), is
.0383
(£' 8
~)f(£ 8 ~')
=
[ -.0103
1
-. 0103
. 0558J '
(6.4.8)
and we see immediately that the differences shown in (6.4.7) are
too large relative to their variances for HO to be true.
Inverting (6.4.8), premultiplying by the transpose of (6.4.7)
and post-multiplying by (6.4.7), we obtain a value of 84.67 for the
Wald statistic.
Since the approximate distribution of the statistic
under H is a central chi-square with 2 degrees of freedom, we reject
o
The reason that the statistic is so large is that the Mahalanobis
distance between the populations is very large.
The true differences,
obtained from (6.2.1), are
(£' 8
~)~
=
[1.0,
2.0]' ,
(6 A ,9)
and the true large sample variance-covariance matrix of (6.4.7),
obtained from the information matrix evaluated at the true parameter
122
values (6.2.1), is
=
r
.0352
L-·0080
-.0080]
.0682 .
(6.4.10)
Inverting (6.4.10), pre-multiplying by the transpose of (6.4.9)
and post-multiplying by
meter to be
0
= 103.16.
(6.4.9), we find the non-centrality paraAccording to the results of §5.4, the
true large sample distribution of the Wald statistic is a non-central
chi-square with 2 degrees of freedom and non-centrality 103.16.
The
probability that a X~(103.l6) variate exceeds 84.67 is .8430; thus
the results obtained are not at all surprising, given the actual
difference between the populations.
Since the method used to simulate the data gives the values of
the latent variates, a direct test of
H is possible.
O
From ordinary
linear models analysis we obtain a value of 151.14 for the likelihood
ratio statistic, and 228.44 for the Hotelling trace statistic, which
is equivalent to the Wald statistic for this problem.
Under H these
O
statistics are each distributed as chi-square with 2 degrees of freedom in large samples, and lead to rejection of HO.
The fact that
they are larger than the statistic arising from the latent linear
model analysis illustrates the fact that some power is lost when the
latent variates cannot be observed.
The fact that the latter stat-
istic is nonetheless highly significant, however, gives some reassurance with respect to the power of tests of hypotheses pertaining to
latent variables based on observable indicators.
VII.
7.1
A SIMJLATIClJ S1UDY
Introduction
We now report the results of a simulation study conducted to
evaluate the large sample approximations of Chapters 4 and 5.
§7.2
In
we describe the population model and the characteristics of
the study, and in §7.3 we consider the distributions of the goodness
of fit statistic, of the estimators of the structural parameters,
and of the Wald statistic.
7.2 Simulation of Data
Let us consider a two-sample latent linear model with 3 measurements and 1 factor.
if observation
j
is from the i-th population and
and let the structural parameters be
f
=
[1,2],
a .... 1
1J
a .. = 0 otherwise,
1J
Let the design matrix have elements
~=U] ,
and
¢ = 1,
! = diag
.91]
(.51
.75.
In order to achieve identification, the parameter
(7.2.1)
¢
is fixed
at 1, while all others are free parameters.
The reduced form parameters are
[3
.§ = .5
.7
.~]
1.0
1.4
,
and
....V =
ps
.21
I
.35
The structural parameters have been chosen so that
for all i.
J
(7 • 2 • 2)
Var(x.)
1
=1
Thus the correlation between the i-th and j-th measurements
e
124
of the latent variable
y
Cov(x. ,x.) = A.A.
is simply
1
J
1
J
as given in
(7.2.2) , and the correlation between the i-th measurement and
Cov(x.y) = A. , (see in this regard §2.4.3) .
simply
1
1
y
is
Thus we use 3
measurements which have correlations of .3, .5 and .7 with the underlying factor.
Note that there are 12 elements in
meters in
and
f.
~
and
y,
and 8 free para-
Hence the number of degrees of freedom
associated with the model is 4.
Let the elements of
~, ~
and
!
as
be ordered in a vector
follows:
(7.2.3)
The information matrix evaluated at the true parameter values
using (4.6.12) - (4.6.19)
1(8)
,..
....
is
r~·lOO7
... 2323
3.6127
-.4782
-1.0202
3.9411
.1310
-.0114
-.0285
.5008
-.0244
.3203
-,1070
.0038
,6580
-.1032
-.2203
.6736
.0161
.0734
l·O654
.1309
.1396
.2874
0
0
.2792
.5748
0
0
=
1
,6866
o
o
,2906
o
(7.2.4)
and its inverse is
125
l
.4459
•
. 2187
.6228
.3407
.6567
1.1235
-.0893
-.0053
.0045
2.0225
!-l(~)= -.0124
-.1689
.0166
-.0043
1.6316
-.1936
-.3051
-.8710
-.0666
-.2467
-.5424
-.9088
-1.4600
.0182
.0675
1. 0515 5.4436
L-1.0848
-1. 8177
-2.9200
.0364
.1350
2.1030 4.0051
I
I
2.2119
(7.2.5)
We take samples of size 50 from each population, so that n = 100.
The data were generated using the same procedure as in §6.2
The sum
of products were accumulated as the observations were generated, and
punched out for analysis.
A total of 1000 sets of data were generated
in this manner.
To estimate the parameters efficiently a special computer program
~
was written to evaluate the function
F
and its derivatives using
all simplifications that obtain in a one-factor model.
For example
from (3.4.1),
(V
~
-1
l/~.
{
1J
-
1
) .. =
6 =
1
1
-A. A. /[ ~. ~.
1 J
where
A./[~.
2
2 (1+6)],
3
2
i=l
1
1
L A./~.
J
(l +6) ]
i=j
(7.2.6)
i#j
(7.2.7)
1
and from (3.5.3),
(7.2.8)
After experimenting with different numerical procedures to compute
the estimates for ten sets of data, it was decided to use the FP-II
method with initial estimates given by the true parameter values, and
with initial
£
matrix obtained by dividing by 2 the elements of the
I
11.45l3J
126
inverse of the information matrix evaluated at the true parameter
values.
The method was found to require between 3 and 5 iterations
to converge for the test samples, the convergence criterion being
F be less than .001 in absolute
that all partial derivatives of
value.
Note that the number of parameters being estimated iteratively
is 6.
The efficiency of the method is further illustrated by the fact
that computation of the estimates for 1000 sets of data took only
23.9 seconds of CPU time in the IBM System/360-75.
The iterative pro-
cedure converged in all but 3 cases (out of 1000) for which the estimate of the smallest specificity,
~3'
tended to become negative.
As
mentioned in §1.3.2 this is quite common in factor analysis, particularly
in small samples when a specificity is not very large relative to the
standard error of its estimate.
For the sake of simplicity, these sets
of data were replaced.
7.3 Empirical Distributions
7.3.1
The Goodness of Fit Statistic
According to the results of §5.2, we expect the goodness of fit
statistic to be approximately distributed as a central chi-square
variate with 4 degrees of freedom.
Table 7.3.1 gives the number of statistics falling below selected
percentile values of the
X~
distribution.
Note that the fit is
remarkably good, particularly in the upper tail, which is used in
setting up critical regions.
A test of the goodness of fit of the
model at the 95% significance level based on the asymptotic theory
would accept the null hypothesis in 95.1% of the samples.
127
Table 7.3.1
Number of Goodness of Fit Statistics
Falling Below Selected Percentiles of X~
Number of
Statis tics
Percentile
20
185
40
375
60
589
80
790
90
903
95
951
99
989
===:=s_=:...?=~========
The Kolmogorov-Smirnov goodness of fit statistic is 0.0256, and
the associated p-value is .5305.
The chi-square goodness of fit
test us ing 5 categories with expected frequencies of 20% gives
x~ = 3.11, with associated p-value of .5396.
These results indicate that the
X~
distribution provides a
very good approximation to the distribution of the goodness of fit
statistic for moderate size samples.
7.3.2
Estimates of Structural Parameters
According to the results of Chapter 4, we expect the maximum like-
lihood estimators of the structural parameters to be approximately
normally distributed with means equal to the true parameter values
(7.2.1) and variances and covariances given by the inverse of the
information matrix (7.2.5) divided by the sample size, which is n=lOO.
Table 7.3.2 shows the mean, variance, skewness, kurtosis, and the
Kolmogorov-Smirnov goodness of fit statistic together with its p-value
for each estimate.
~
128
Table 7.3.2
Descriptive Statistics for Empirical
Distribution of the Maximum Likelihood Estimators
Estimator
Mean
Variance
_.===::;-...-=----~
Skewness
,...
=
"-
==
~~-::::a::
-=-~
Kurtosis
D
p-va1ue
~
.2945
.0045
.212
.176
.0502
.0129
.4897
.0068
-.023
-.068
.0599
.0015
A3
.6870
.0125
-.023
-.133
.0549
.0048
"-
.8963
.0180
.305
.174
.0704
.0001
.7376
.0165
.455
.710
.0656
.0004
.4979
.0252
.142
.081
.0625
.0008
t;;1
1.0572
.0666
.888
2.261
.0661
.0003
'"t;;2
2.0901
.1660
1.134
2.376
.0772
.0000
A1
A
A2
"-
1jJ1
"-
1jJ2
"-
1jJ3
A
=--=---::r.
-
===
~
-==e
Note that the empirical means are very close to the true parameter
values, and that the empirical variances are very close to the asymptotic variances given by the inverse of the information matrix.
The
empirical variance-covariance matrix of the estimates is
.0045
1\
A
Var(Q)
=
.0024
.0068
.0039
.0064
.0125
-.0005
.0003
-.0115
.0180
.0001
-.0014
.0008
.0000
.0165
-.0024
-.0031
-.0100
-.0001
-.0042
.0252
-.0063
-.0109
-.0175
.0002
.0003
.0130
.0666
-.0145
-.0240
-.0368
.0006
.0022
.0263
.0593
o166j
(7.3.1)
Comparison of (7.3.1) with the inverse of the information matrix
(7.2.5) indicates that all empirical variances and covariances are very
close to the asymptotic values.
This result gives us assurance as to
129
the correctness of the formulae for the elements of the information
matrix given in §4.6, and indicates a fast rate of convergence of
the first two moments.
Convergence of higher order moments, however, appears to be
slower, judging from the values of the skewness and kurtosis coefficients.
The Kolmogorov-Smirnov statistics given in the sixth column
of Table 7.3.2, on the other hand, range between .0502 and .0772, and
seven of them indicate a lack of fit significant at the .001 level
or better.
estimate
Thus for a sample of size 100 the distributions of the
appears not to be normal.
The actual values of, the statistics are not very large, though,
and the question of most practical import is to what extent the
asymptotic distribution can serve as a reasonable approximation for
a sample of this size.
To investigate this matter the information
matrix evaluated at the m.l.e.'s was computed and inverted for each
sample, and an approximate 95% confidence interval of the form
e.
1
where
-1
1. . (8)
-11 -
+ 1.96/I~~(8)/100
-11 -
,
(7.3.2)
denotes the i-th diagonal element of
constructed for each of the parameters.
Table 7.3.3 gives the number
of such intervals that actually contained the parameter.
130
Table 7.3.3 Number of 95% Confidence Intervals Found
to Contain True Parameter Value
Parameter
Number
======s.o=-rz==-=
)\1
937
A2
937
A3
942
W1
W2
927
W3
940
~1
963
~2
964
931
Note that the empirical confidence coefficients range from 92.7%
to 96.4%, and thus are close to the asymptotic value.
Therefore,
confidence intervals computed from moderate size samples on the
basis of the asymptotic theory would be acceptable for most practical
purposes.
7.3.3
The Wa1d Test for Linear Hypotheses
Let us now consider testing the hypothesis of no location differ-
ence between the populations using the Wa1d statistic of §5.4.
The
true difference between the populations, obtained from (7.2.1) is
~2 - ~1
=1
,
(7.3.3)
and the large sample variance of the estimated difference obtained
from (7.2.5), is
"
Var(~2
From the results of §5.4
"
- ~1)
= .0888.
(7.3.4)
we expect the Wald statistic to be
131
approximately distributed as a non-central chi-square variate with
1 degree of freedom and non-centrality parameter
6
= 8.88.
For each of the 1000 samples, the Wald statistic was computed
using the inverse of the information matrix evaluated at the m.l.e.'s.
Table 7.3.4 shows the number of statistics falling below selected
percentile values of the
X~ (8.88) distribution.
Number
Percentile
==-==========-===-=========
5
45
10
101
20
238
40
426
60
567
80
715
The chi-square goodness of fit statistic using 5 categories with
expected frequencies of 20% is
all fit.
74.99, and indicates a very poor over-
Inspection of the data shows that the empirical distribution
of the Wald statistic has a very long upper tail. However, it is the
lower tail that would be used in computing approximate powers,
and the fit is considerably better there.
We now assess the
practical significance of this result.
An approximate test of
H at the 95% significance level based
O
on the asymptotic central distribution of
3.8415.
W rejects if
2
W'>'- X1,.95-
The approximate power of the test computed from the asymptotic
132
non-central distribution of
W is therefore given by
Pr{X~(8.88) ~ 3.8415}
=
.8461 .
(7.3.5)
.
On the other hand, the empirical power, given by the number of
statistics exceeding 3.8415 in the simulation study, was found to be
.825.
In conclusion, the results of this section indicate that the
asymptotic distribution theory provides quite reasonable approximations
for moderate sample sizes, particulary for the goodness of fit statistic
and for the standard errors of the estimates.
More extensive simulation
studies would be needed, however, to compare results for alternative
sets of parameter values and to establish empirically the rates of
convergence of the distributions.
VI I I. SUGGESTIONS FOR FUR1HER RESEARCH
8.1
Introduction
In this chapter we provide some suggestions for future research
by proposing two extensions of the latent linear model: the latent
growth curve model, stated in §8.2; and the latent covariance
structure model, stated in §8.3.
These models are similar to the
latent linear model in that observable indicators are used to estimate parameters and test hypotheses pertaining to unobservable
constructs, and belong to the general family of structural linear
models considered in Chapters 4 and 5.
8.2 The Latent Growth Curve Model
Consider the growth curve model
r =~
y:
where
£:
qxn
q xq'
q'~q, ~:
r<n,
g:
is a matrix of
+
n
(8.2.1)
f"
observations on
variates,
is a within-subjects design matrix of full column rank
rxn
q'xr
is an across-subjects design matrix of full row rank
is a matrix of unknown
regression parameters, and
f,:qxn is a stochastic matrix of errors with
Var(n
.-
q
= <I>
~
@
=Q
and
I .
~
Suppose now that
data matrix
E(f,)
~:
pxn
y is not observable.
related to
Instead we observe a
y by the factor analysis model
134
AY
....x = ........
where
~:
pxq
(8.2.2)
....Z
+
is a matrix of factor loadings and
E(~) =
of errors with
Q'
Var(~) =
matrix of specificities, and
f
@
Cov(y,~) =
In '
where
~:
pxn
is a matrix
is' a diagonal
O.
Combining (8.2.1) and (8.2.2) we can write the model as
(8.2.3)
where
EC!D
=
0
Var(U)
....
=
V @ "'n
I
....
.§ = lli,
V = ,..,....,.....
A¢lA'
and
(8.2.4)
+ If .
(8.2.5)
The model is thus seen to be a multivariate linear model where
the regression and dispersion parameters are structure and related
to each other.
Since its basis is a growth curve model on unobserv-
able variates, it will be termed the latent growth curve model.
The model differs from the latent linear model only in the
addition of the new design matrix
....p
and hence the modifications
required in the results of Chapter 3 to obtain the maximum likelihood
estimators of the structural parameters are relatively simple.
Also,
since the model belongs to the family of structural linear models
considered in Chapters 4 and 5, the general asymptotic results given
there can be applied.
8.3 The Latent Covariance Structure Model
Consider now the covariance structure model
y =~
+
f,
(8.3.1)
135
y, £'
where
~
~
and
E(f) = Q and Var
errors with
[:qxs
(£)
= nT'
~
~
where
f
are as defined in §8.2.
~
=
where
(8.3.2)
+ ~,
T
is a matrix of loadings,
f:sxs
is a symmetric p.d.
matrix of factor variances and covariances, and
matrix of specificities.
is a matrix of
!:qxq is a diagonal
This part of the model is similar to, but
less general than, Joreskog's (1970a) covariance structure model,
considered in §1.5.
Suppose, as before, that
Y is not observable but instead we
observe
(8.3.3)
where
~:
pxq
~
is a matrix of loadings and
of errors with
E(~)
=Q,
Var(~)
=! e
diagonal matrix of specificities, and
In
is a stochastic matrix
,where
Cov(y,~)
!:
pxp
is a
= O.
Combining (8.3.1) and (8.3.3) we obtain
(8.3.4)
where
E OD
=Q
Var(l!)
= Yx
In
(8.3.5)
and
y = ~(£k['
+ !)~' +
! .
(8.3.6)
The resulting model is thus a structural linear model of the
general type considered in Chapters 4 and 5.
The structure of
is identical with the covariance structure (1.5.3) in Joreskog's
model, but the structure of
~
does not appear in (1.5.2).
Since the basis of the model is a
involves the parameter
~
, which
Y
136
covariance structure model
on unobservable variates it will be
termed the tatent covariance structure modet.
The relationship between this model and Joreskog's (1970a)
model for the analysis of covariance structures can best
~e
cussed in terms of an example given in Joreskog (1970a).
dis-
Consider
the Wiener stochastic process
(8.3.7)
(t=l, ... ,q) where
crements.
~t
= E(y ), and
t
~
are independent in-
This may be written in matrix form as
l
where
u l '·· .,ut
=~
(8.3,8)
+ ~ ,
is a lower triangular matrix whose nonzero elements are all
The variance-covariance matrix of l
unity.
~ =~'
is
,
(8.3.9)
k is a diagonal matrix whose elements are the variances of
where
the independent increments.
l
Joreskog (1970a) assumes that instead of
~
of
q
we observe a vector
measurements, given by
(8.3.10)
where
~
Var(~)
=!
is a q-vector of measurement errors with
a diagonal matrix, and
- -
E(x) =
~,
Cov(l,~)
=
Q.
and
E(~)
=
Q,
Thus
(8.3.11)
(8.3.12)
which is a special case of (1,5,2) - (1.5,3) with
= -n
l'
137
r
k restricted
=~,
to be diagonal,
Suppose now that instead of
! =
where
!
is a vector of
factor loadings, and
z
I
=
Q and
~
=
1.
X we observe
~ + ~ ,
(8.3.13)
~:
p>q measurements,
pxq
is a matrix of
is a p-vector of measurement errors with
and
(8.3.14)
(8.3.15)
which is a special case of (8.3.4) - (8.3.6)
[
= ~,
k restricted to be diagona, and T
=
Q.
with
P
t""oJ
=I
-.,J
A = ~'
I'
"""J
This model will be termed
the Latent Wiener Process.
While Joreskog (1970a) considers the case of a single fallible
measurement
variance
~t'
indicators
variance ft.
of
Y , with regression coefficient
t
1
and error
the more general model given here considers a set of
of
Xt ' with regression coefficients
~t
and error
APPENDIX
ON MATRIX DERIVATIVES
A.I
Introduction
Matrix deriva7-ives were introduced in the statistical literature by
Dwyer and MacPhail (1948).
A number of useful results for matrix dif-
ferentiation have been given by Deemer and Olkin (1951), Olkin (1953),
Dwyer (1967), Neudecker (1969), Tracy and Dwyer (1969), Vetter (1970)
and more recently MacRae (1974).
Some elementary results appear in the
text by Graybill (1970, pp. 260-70).
The derivation of these results, for the most part, has not been
based on a unified matrix differential calculus.
As MacRae (1974) has
noted, either a typical element of a matrix array is examined in hopes
of inferring a matrix expression for the entire array, or matrices of
total differentials are obtained and transformed into arrays of derivatives using special theorems.
Compare Anderson (1958) and Neudecker
(1969) .
Furthermpre, most authors consider derivatives of a scalar with respect to a matrix and of a matrix with respect to a scalar, but do not
consider explicitly the derivative of a
mat~ix ~ith
respect to a matrix.
The latter is usually described in terms of the derivatives of a matrix
with respect to each of the elements of another matrix.
Dwyer (1967).
See for example
139
Recently, MacRae (1974) has proposed a unified approach to matrix
differentiation and has introduced new matrix operations and identities.
Her approach simplifies the notation and facilitates application by
providing a formal matrix differential calculus, i.e., a set of general
rules for dealing with matrix differentiation.
A similar approach may
be found in the paper by Vetter (1970).
In this appendix we review some basic definitions and theorems and
collect a number of results that are used throughout the dissertation.
A.2 Definition of Matrix Derivatives
c
Definition A.2.l
matrices.
Let
• » ,
We consider first derivatives of scalar functions of
f
be a scalar function of a matrix
we define the derivative of
f
with respect to
(a~~.1J
Example A.2.1.
~
~:
mxn.
Then
as the mxn matrix
).
(A.2.1)
The following two results are well known.
If
is
~
an mxm matrix, then
dtr~
ax
- 1m '
.....
(A.2.2)
and
~
non-singular.
Result (A.2.2) follows directly from the definition.
(A.2.3)
Result
(A.2.3) may be obtained by writing the determinant in terms of cofactors, see for example Graybill (1970, pp. 266-267).
We now consider the derivative of a matrix function of a matrix
~
140
Definition A.2.2.
X:
Let
~:
functions of a matrix
pxq
be a matrix whose elements are
mxn.
~
oX
Let
= (__
a_)
partial derivative operators.
of
Y
with respect to
be a matrix of
x ..
1J
~
Then we define the matrix derivative
as
~
(A.2.4)
where multiplication by a partial derivative operator corresponds to
a
aYkQ,
the operation of partial differentiation, i.e., YkQ, ax =
and
-ax-
~
thus
ay
ax
ax
~
-
u , ... , aY1q
ax
-
a~
=
~
CA., 2 .,5)
ay
.-::..E9.
ay 1
,
prnxqn
••• J
a~
This definition is due to MacRae (1974).
Neudecker (1969) and
Vetter (1970) arrange matrix derivatives in a pattern that corresponds
to
a/a~ ~
X ' but do not use the concept of a partial derivative oper-
ator.
Definition A.2.3.
position and
Let
E..
-lJ
be an mxn matrix with
1 in
the (i,j)-th
0 elsewhere, and define
Eon
'
••• J
E
~ln
2 2
m xn
....E(m,n) =
l\n1
'
... ,
E
-mn
E'
£i1 ' ... , ....In
mnxmn
....I (m,n) =
E'
""1111
J ••• ,
(A. 2.6)
E'
-ron
.
(A.2.7)
141
The matrix
I
is called the permuted identity matpix, see
(m,n)
Tracy and Dwyer (1969) and MacRae (1974). The notation ~(m,n)
E
has been introduced by deWaa1 (1974).
Theorem A.2.1.
Let
~
be an mxn matrix.
Then
= ~(m,n)
E
ax'
a!
~=
(A.2.8)
,
(A.2.9)
I
~(m,n)
This theorem is given by MacRae (1974) and Vetter (1970) using
a different notation.
The proof follows from Definitions A.2.2 and
A.2.3.
A.3 Rules for Matrix Differentiation
,
We now review some rules for matrix differentiation and illustrate
~
their application.
A.3.1
The Sum, Product and Inverse Rules.
X:
pxq
and
whose elements are functions of a matrix
~:
mxn.
Theorem A.3.l.
(The Sum Rule).
Let
ay....
a(y+z)
....
~
= -a~
a~
pxq
~:
be matrices
Then
(A.3.l)
+
The proof follows directly from Definition A.2.2.
Let
X
and
matrices whose elements are functions of
~:
mxn.
Theorem A.3.2.
(The Product Rule).
ayz
ay
~ = a~ (~
0
.!n)
+
(X
0
lm)
~
be conformable
Then
az
a! .
(A.3.2)
e
142
For a proof see MacRae (1974).
An analogous result is given by
Vetter (1970) using a different notation.
Example A.3.l.
§
be nxq
where
(Linear Forms).
~
and
aAXB
~
Proof:
=
§
~
Let
be pxm, ~
~.
do not depend on
(~3 lm)£(m,n)(~ @ In)
Using the product rule, since
and using the product rule again, since
aB/ax
........
be mxn
Then
(A.3.3)
.
= ....0
a~/a~
and
we have
=0
,
o
then (A.3.3) follows using (A.2.8).
Collecting elements in different blocks of (A.3.3) we obtain
aAXB
"1.x-_ = AE .. B
v
a
..
....lJ
(A. 3 .4)
"""1 J....
This result is well known, see for example Tracy and Dwyer (1969).
Example A.3.2.
£
be mxp
where
(Quadratic Forms).
Let
~
be mXn,
....B and £ do not depend on X.
~
be nxn , and
Then
(A.3.5)
Proof:
Using the product rule
143
a~/a!
and using the product rule again, since
a~
a~ (~@ In)(~'£ @ In)
=
(~@
+
=Q
ax'
!m)ai
and
a£;a~
= Q,
(£ @ In)
Then (A.3.5) follows from the well-known identity for Kronecker
products
(~e ~)(£
e Q) = ~
@~
o
and Theorem A.2.l.
An important special case of (A.3.5) is obtained setting
and noting that
I
--rn
@ I
-n
= --rnn
I
£ = '""Ill
I
Then
(A.3.6)
Collecting terms in different blocks of (A.3.6) we obtain
a~'
ax .. = E . . BX'
-~J~
+
(A.3.7)
XBE ..
~J].
~J
a well-known result, see Tracy and Dwyer (1969).
Theorem A.3.3.
function of
~:
(The Inverse RUle).
mxn.
Let
X
be a non-singular matrix
Then
(A.3.8)
lbe proof follows from the product rule writing
An impqrtant special case of (A.3.8) is obtained when
for
~
non-singular.
X=~
Then
ax- l
::!
0._
=_
(X -1 @ I ) E (
-
'""Ill -
m,n
) eX -1 @ I )
-
. '""Ill
(A.3.9)
Collecting terms in different blocks of (A.3.9) we obtain
(A.3.10)
144
a result given in Tracy and Dwyer (1969).
A.3.2
The Star Product and the Chain Rule
MacRae (1974) has introduced the following operation:
Definition A.3.1.
(Star Product).
be partitioned into
~
and
~
pq
blocks
Let
A be pxq
B.. : mxn.
and let
pmxqn
~:
Then the star product of
~.1J
is defined as the mxn matrix
L a 1J~1J
.. B.•
(A. 3 .11)
ij
If hand ]
have the same dimensions then
~
Theorem A.3.4.
Let
ABC
-----
~
*
~
be
= ~B' *
(~)' =
= tr
~'~
mxp,
~
(A.3.l2)
be
pxq
and
£
be
(C~ @ ~m
I )1~(m,n) (A
@~
I )
~
qxn.
Then
(A.3.l3)
(A.3.l4)
B' *
~
Identity (A.3.l3) is given by MacRae (1974) and (A.3.14) in the
present notation is given by deWaal (1974).
ablished by direct computation.
applications.
The proofs can be est.
These results are very useful in
Other identities that we have not used in our work
may be found in MacRae (1974) and deWaa1 (1974).
The importance of the star product is given by the following
result due to MacRae (1974).
Theorem A.3.5.
matrix
1:
pxq
(The Chain Rule).
Let
f
be a scalar function of a
whose elements are functions of a matrix
af
d~
Clf
= a1
!: mxn. Then
(A.3.l5)
145
The proof follows directly from the ordinary chain rule and the
definition of star product.
If
~
is a scalar
x
then, by the simplification of star pro-
ducts of matrices of the same dimension noted in CA.3.l2),
CA.3.l6)
We also note that by repeated use the theorem applies to the case
where
f
Y which
Z which is a matrix function of
is a scalar function of
in turn depends on
In this case
~.
CA.3.l7)
where the star products must be evaluated as indicated, for
az
~
ax
az
~
ay
., ay * a~
indeed the star product on the right hand side of
this expression is not defined unless
In the examples below let
with matrix derivative
af/ay.
Example A.3.3.
be pxm,
Let
~
f
z
.....
is a scalar .
be a scalar function of a matrix
be mxn
~
- A'
-
and
~
be
nxq.
af
~'.
aCm)·-
y
Then
CA.3.l8)
The proof follows from the chain rule, Example A.3.l and identity
CA. 3.14).
Example A.3.4.
Let
~
afCXBX'C)
...--...
~
ax
~
be mxn,
=
B be nxn
and £ be mxp.
af
~'~~' C
af
XB
ac~'£) .- ._- + ..... a(~'£) ~
Then
CA.3.l9)
146
The proof
follO\~s
from the chain rule, Example A. 3.2 and
identities CA.3.l3) and CA.3.l4).
Example A.3.S.
Let
be non-singular.
~
Then
(A. 3. 20)
where
~-T = (~-l),.
The proof follows from the chain and inverse
rules and identity (A.3.l4).
A number of important special cases are obtained when
for square
X.
= tr Y
fCr)
In particular note that from Example A.3.3
a tr
AXB
= ~'~'
a~
(A.3.2l)
and from Example A.3.4
d tr XBX'C
(A.3.22)
Vetter (1970) has given a chain rule for matrix functions of
matrices.
His result, however, involves packing the columns of the
matrices one below another in vector form, and it appears unlikely
to be very useful in applications until further identities are established to simplify the results.
In the
~ase
of vector functions of vectors, however, the following
result is easily established.
Theorem A.3.6.
r:
sxl
Let
~:
txl
be a vector function
whose elements depend on a vector !: rXl.
of a vector
Then
(A.3.23)
where
= (dZ./dX.):
1
J
tXr , by Definition A.2.2.
147
A.3.3
A Note on Symmetric and Diagonal Matrices
So far in our discussion we have assumed that the elements of
An important case of functional depend-
are functionally independent.
ence that arises in our
X
work is ,that of
X
symmetric.
'In this case
the best strategy is to apply the formal matrix differentiation rules
ignorming the symmetry and to bring this in the last step by appropriately modifying the final result.
For our purposes it will suf-
fice to discuss the case of derivatives of scalar functions of symmetric
matrices.
According to Oefinition A.2.l, the (i,j)-th element of
is the partial derivative of
other elements of
f
with respect to
X as constants.
If
x ..
treating all
lJ
is symmetric, however, it
~
may be desirable to compute partial derivatives of
to the distinct elements x..
(I
1J
other elements except
x ..
~
i
<
j
as constants.
J1
~
af/a~
m) of
f
!
with respect
treating all
These derivatives may be
arranged in convenient matrix form by defining
af
ax
=
.....s
2 af
ax.....
.
af
d lag a~
In nwnerical optimization problems involving
(A.3.24)
f(~)
partial derivatives may be obtained from (A.3.24).
the required
Note, however, that
(A.3.24) is not a matrix derivative in the sense of Definition A.2.l,
just a convenient notation.
applications, see for example
This distinction is important in some
Travinsky and Bargmann (1964).
Another important case where the elements of
is when
~
is diagonaZ.
a diagonal matrix
X
.....
!
are restricted
In the case of a scalar function
f
of
one may apply the formal matrix differentiation
148
rules ignoring the fact that
is diagonal, and take account of this
~
in the last step by defining
af
~X
o~d
= d"lag
af
~X
0._
(A.3.25)
•
Alternatively, in many applications involving diagonal matrices it is
possible to proceed directly from first principles.
A.4 Maximmn Likelihood Estimation in the Linear Model
We illustrate the application of matrix derivatives in maximum
likelihood estimation in the multivariate linear model
(A,4.l)
where
~
is pxn,
V @ ""ll
I)
,..,U ,.., Npxn (0
,..,',..,
~
is pxr,
with
8
is rxn
of full row rank
r < nand
y positive definite.
.-
The log-likelihood function may be written as
Let us use the notation
(A.4.3)
Maximizing
log L
is equivalent to minimizing
(A,4.4)
To differentiate this function with respect to
~
use the
chain rule (A.3.l7), obtaining
aF
- = (a
a,§
Using (A.3.2l)
tr V-IT
~
a!
~
*
) * a(~-~)
a,§
a(~-~)
aT,..,
.
(A.4,5)
149
=
V-1
(A.4.6)
~
and using (A.3.6)
aT
-,-----~~
a(~-8~)
= IJE(
) [(X-i<A)'
TIl~ p,n
~ ~
e
e
I ] + [(X-i<A)
~
~ ~
I ]I(
)}..
~ ~ p,n
(A.4.7)
Substituting these results into (A.4.5) gives
(A.4.8)
by identities (A.3.13) and (A.3.14).
On the other hand
= - ~(p,r)
E
(Aby the sum rule and (A.3.3).
Q
I )
~
(A.4.9)
Using (A.4.8) and (A.4.9) on (A.4.5)
(A.4.10)
by identity (A.3.14).
On setting (A.4.10) to zero we obtain the well known result
~ = M' (M' ) -1
(A. 4.11)
y, use the sum and product rules,
To differentiate with respect to
obtaining
aF
ay
=
a logl~1
ay
av- l
a tr v-IT
+
av- l
*
~
ay
(A.4.12)
~
Using (A. I. 3)
a 1ogl~1
ay
On the other hand
= V
-1
(A.4.13)
150
av- l
a tr y-lI
ay-l
~
*
ay
=-
T
'"
* (V
'"
-1
@
I ) E(
) (V
"'P '" p, p '"
-1
@
(A.4.l4)
I )
'" p
(A.4.l5)
where (A.4.l4) follows from (A.3.2l) and (a.3.9); and (A.4.l5) follows
from identity (A.3.l4).
Combining (A.4.l3) and (A.4.lS) we obtain
~ = V-l[V_T]V- 1
ay
'"
'"
~
~
So far we have ignored the symmetry of y.
of the elements of
(A.4,16)
,
Setting to
0
each
afjay, however, is the same as setting to
each of the elements of
afjays
=
2afjay - diag afjay
0
and gives
y = r·
To obtain
y
note that
T
'"
A
at
~
=~ .
idempotent
Using the fact that
is a function of
[! - ~'CM') -1 ~]
~
and evaluate it
is symmetric and
gives the well known result
(A.4.l7)
An alternative derivation of these results, without using matrix
derivatives, may be found in Anderson (1958, pp. 44-49 and 179-181).
BIBLIOGRAPtN
Anderson, T.W. (1958). An Introduction to MUltivariate Statistical
Analysis. New York: John Wiley and Sons.
Anderson, T.W. and Rubin, H. (1956). Statistical inference in factor
analysis. In Neyman, J. (Editor). Proceedings Third Berkeley
Symposium on Mathematical Statistics and Probability, 5: Ill-ISO.
Berkeley: University of California Press.
Archer, C.O. and Jennrich, R.I. (1973). Standard errors for rotated
factor loadings. Research Bulletin, 73-77. Princeton, New Jersey:
Educational Testing Service.
Bargmann, R.E. (1957). A study of independence and dependence in
multivariate normal analysis. Ph.D. Thesis, University of North
Carolina. (Also Institute of Statistics Mimeo Series No. 186.)
Bartlett, M.S. (1937). The statistical conception of mental factors.
British Journal of Psychology, 28: 97-104.
Bartlett, M.S. (1938). Methods of estimating mental factors.
London, 141: 609-610.
Nature-
Bartlett, M.S. (1950). Tests of significance in factor analysis.
British Journal of Psychology - Statistics Section, 3: 77-85.
Bartlett, M.S. (1951). A further note on tests of significance in
factor analysis. British Journal of Psychology - Statistics Section,
4: 1-2.
Bartlett, M.S. (1953). Factor analysis as a statistician sees it.
Uppsala Symposium on Psychological Factor Analysis. Stockholm:
Almquist and Wiksel1.
Blalock, H.M. (1961). Causal Inferences in Nonexperimental Research.
Chapel Hill: University of North Carolina Press.
Boudon, R. (1965). A method of linear causal analysis: dependence
analysis. American Sociological Revie~, 35: 101-111.
Box, G.E.P. (1949). A general distribution theory for a class of likelihood ratio criteria. Biometrika, 36: 317-346.
Bradley, R.A. and Gart, J.A. (1969). The asymptotic properties of ML
estimators when sampling from associated populations. Biometrika,
49: 205-214.
152
Browne, M.W. (1965). A comparison of factor analytic techniques.
Master's Thesis, University of Witwatersrand, South Africa.
Browne, M.W. (1974a).
Gradient methods for analytic rotation.
British Journal of Mathematical and Statistical Psychology, 27:
115-121.
Browne, M.W. (1974b). Generalized least squares estimators in the
analysis of covariance structures. South African Statistical Journal,
8: 1-24.
Clarke, M.R.B. (1970). A rapidly convergent method for maximum likelihood factor analysis. British Journal of Mathematical and Statistical Psychology, 23: 43-52.
Cramer, H. (1946). Mathematical Methods of Statistics.
Princeton University Press.
Princeton:
Crawford, C.B. and Ferguson, G.A. (1970). A general rotation criterion
and its use in orthogonal rotation. Psychometrika, 35: 321-332.
Davidon, W.C. (1959). A variable metric method for minimization.
Research and Development report, ANL-5990.
AEC
Deemer, W.L. and aIkin, I. (1951). Jacobians of matrix transformations
useful in multivariate analysis. Biometrika, 38: 345-367.
de Waal, D.J. (1974). Parametric multivariate analysis - Part 2.
Supplement B: matrix derivatives. Mimeographed notes, Department
of Statistics, University of North Carolina.
Duncan, 0.0. (1966).
Path analysis:
Journal of Sociology, 72: 3-16.
sociological examples.
American
Dwyer, P.S. (1967). Some applications of matrix derivatives in multivariate analysis. Journal of the American StatisticaZ Association,
62: 607-625.
Dwyer, P.S. and MacPhail, M.S. (1948).
Symbolic matrix derivatives.
Annals of Mathematical Statistics, 19: 517-534.
Eicker, F. (1963). Asymptotic normality and consistency of the least
squares estimators for families of linear regressions. Annals of
Mathematical Statistics, 34: 447-456.
Fletcher, R. and Powell, M.J.D. (1963). A rapidly convergent descent
method for minimization. Computer Journal, 6: 163-168.
Golub, G.A. (1969). Matrix decompositions and statistical calculations.
In Milton, R.C. and NeIder, A. (Editors). Statistical Computation.
New York: Academic Press.
153
Graybill, F.A. (1970). Introduction to Matrices with Applications
Statistics. Belmont, California: Wadsworth.
~n
Grizzle, D.J. and Allen, D.M. (1969). Analysis of growth and Bose
response curves. Biometrics, 25: 307-318.
Harman, H.H. (1967).
Chicago Press.
Modern Factor Analysis.
Chicago:
University of
Hauser, R.M. and Goldberger, A.S. (1971). The treatment of unobservable variables in path analysis. In Costner, H.L. (Editor). Sociological Methodology: 1971. San Francisco: Jossey Bass.
Hendrickson, A.E. and White, P.O. (1964). Promax: a quick method for
rotation to oblique simple structure. British Journal of Statistical
Psychology, 17: 65-70.
Hotelling, H. (1933). Analysis of a complex of statistical variables
into principal components. Journal of Educational Psychology, 24:
417-441, 498-520.
Hotel1ing, H. (1935).
The most predictable criterion.
Educational Psychology, 26: 139-142.
Hote11ing, H. (1936).
28: 321-377.
Journal of
Relations between two sets of variates. Biometrika,
Howe, W.G. (1955). Some contributions to factor analysis. Ph.D. Thesis,
University of North Carolina. (Also Report No. ORNL-19l9, Oak Ridge,
Tennessee: Oak Ridge National Laboratory.)
.Jennrich, R.I. (1969). Asymptotic properties of non-linear least squares
estimators. Annals of Mathematical Statistics, 40: 633-643.
Jennrich, R.I. (1970). An asymptotic X2 test for the equality of two
correlation matrices. Journal of the American Statistical Association,
65: 904-912.
Jennrich, R.I. (1973). Standard errors for obliquely rotated factor
loadings. Research Bulletin, 73-28. Princeton, New Jersey: Educational Testing Service.
Jennrich, R.I. (1974). Simplified formulae for standard errors in
maximum-likelihood factor analysis. British Journal of Mathematical
and Statistical Psychology, 27: 122-131.
Jennrich, R.I. and Sampson, P.F. (1966).
Psychometrika, 31: 313-323.
Rotation for simple loadings.
Jennrich, R.I. and Thayer, D.I. (1973). A note on LaWley's formulae for
standard errors in maximum likelihood factor analysis. Research
Bulletin 73-31. Princeton, New Jersey: Educational Testing Service.
154
Joreskog, K.G. (1963). Statistical Estimation in Factor Analysis.
Stockholm: Almquist and Wiksell.
Joreskog, K.G. (1967). Some contributions to maximum likelihood
factor analysis. Psychometrika~ 32: 443-482.
Joreskog, K. G. (1969). A general approach to confirmatory. maximum
likelihood factor analysis. Psychometrika~ 34: 183-202.
Joreskog, K.G. (1970a). A general method for analysis of covariance
structures. Biometrika~ 57: 239-251.
Joreskog, K.G. (1970b). Estimation and testing of simplex models.
B2'itish Journal of Mathematical and Statistical Psychology~ 23:
121-145.
Joreskog, K.G. (1971a). Simultaneous factor analysis in several populations. Psychometrika~ 36: 409-426.
Joreskog, K.G. (197lb). Statistical analysis of sets of congeneric
tests. Psychometrika~ 36: 109-133.
Joreskog, K.G. and Goldberger, A.S. (1972). Factor analysis by generalized least squares. Psychometrika~ 37: 243-260.
Joreskog, K.G. and Lawley, D.N. (1968). New methods in maximum likelihood factor analysis. British Journal of Mathematical and Statistical
Psychology~ 21: 85-96.
Kaiser, H.F. (1958).
factor analysis.
The variance criterion for analytic rotation in
23: 187-200.
Psychometrika~
Kaiser, H.F. (1959). Computer program for varimax rotation in factor
analysis. Educational and Psychological Measurement~ 19:413-420.
Kendall, M.G. and Babington-Smith, B. (1950). Factor analysis.
of the Royal Statistical Society - Series B~ 12: 60-94.
Journal
Kendall, M.G. and Stuart, A. (1961). The Advanced Theory of Statistics Vol. II: Inference and Relationship. London: Charles Griffin and Co.
Kendall, M.G. and Stuart, A. (1969). The Advanced Theory of Statistics Vol. I: Distribution Theory. (3rd Edition). London: Charles Griffin
and Co.
Land, K.C. (1969). Principles of path analysis. In Borgatta, E.F.
(Editor). Sociological Methodology: 1969. San Francisco: Jossey-Bass.
Landis, J.R. and Koch, G.G. (1974). A review of statistical methods in
the analysis of data arising from observer reliability studies. University of North Carolina: Institute of Statistics Mimeo Series No. 956.
155
Lawley, D.N. (1940). lne estimation of factor loadings by the method
of maximum likelihood. Proceedings of the Royal Society of
Edinburgh - Series A, 60: 64-82.
Lawley, D.N. (1942). Further investigations in factor estimation.
Proceedings of the Royal Society of Edinburgh - Series A, 62: 176-185.
Lawley, D.N. (1943). The application of the maximum likelihood method
to factor analysis. British Journal of Psychology, 33: 172-175.
Lawley, D.N. (1953). A modified method of estimation in factor analysis
and some large sample results. UppsaZa Symposium on Psychological
Factor Analysis. Stockholm: Almquist and Wiksell.
Lawley, D.N. (1967). Some new results in maximum likelihood factor
analysis. Proceedings of the Royal Society of Edinburgh - Series A,
67: 256-264.
LaWley, D.N. and Maxwell, A.E. (1971). Factor Analysis as a Statistical
Method. (2nd Edition). New York: American Elsevier.
Lazarfeld, P.F. (1950). The logic and mathematical foundation of latent
structure analysis. In Stouffer et. al. (Editors). Measurement and
Prediction. Princeton, New Jersey: Princeton University Press,
Lockhart, R.S. (1967). Asymptotic sampling variances for factor analytic models identified by specified zero parameters. Psychometrika,
32: 265-277.
Lord, F.M. and Novick, M.R. (1968). Statistical Theories of Mental Test
Scores. Reading, Massachusetts: Addison-Wesley.
MacRae, E.C. (1974). Matrix derivatives with an application to an adaptive linear decision problem. Annals of Statistics, 2: 337-346.
Maxwell, D.N. (1961). Recent rends in factor analysis.
Royal Statistical Society - Series A, 124: 49-59.
Morrison, D.F. (1967).
McGraw Hill.
Multivariate Statistical Methods.
Journal of the
New York:
Mukherjee, B.N. (1973). Analysis of covariance structures and exploratory
factor analysis. British Journal of Mathematical and Statistical
PsychoLogy, 26: 125-154.
Mulaik, S.A. (1971). A note on some equations of confirmatory factor
analysis. Psychometrika, 36: 63-70.
Neudecker, H. (1969). Some theorems on matrix differentiation with
special reference to Kronecker matrix products. Journal of the American Statistical Association, 64: 953-963.
~
156
Okamoto, M. (1969). Optimality of principal components. In Krishnaiah,
P.R. (Editor). Multivapiate Analysis II: Proceedings of the Second
International Symposium on Multivariate Analysis. New York: Academic
Press.
Olkin, I. (1953). Note on "Jacobians of matrix transformations useful
in multivariate analysis." Biometpika, 40: 43-46.
Please, N.W. (1973). Comparison of factor loadings in different populations. Bpitish Journal of Mathematical and Statistical Psychology,
26: 61-89.
Potthoff, R.R. and Roy, S.N. (1964). A generalized multivariate analysis
of variance model useful specially for growth curves. Biometpika, 51:
313-326.
Puri, M.L. and Sen, P.K. (1969). A class of rank order tests for a
general linear hypothesis. Annals of Mathematical Statistics, 40:
1325-1343.
Rao, C.R. (1955).
Estimation and tests of significance in factor analysis.
Psychometpika, 20: 93-111.
Rao, C.R. (1965). Lineap Statistical Infepence and its Applications.
New York: John Wiley and Sons.
Rash, G. (1953).
On simultaneous factor analysis in several populations.
Uppsala Symposium on Psychological Factor Analysis. Stockholm:
Almquist and Wickse1l.
Rodriguez, G. (1975). A canonical reduction of the factor analysis model.
University of North Carolina: Institute of Statistics Mimeo Series
No. 992.
Sen, P.K. and Puri, M.L. (1970). Asymptotic theory of likelihood ratio
and rank order tests in some multivariate linear models. Annals of
Mathematical Statistics, 41: 87-100.
Silvey, S.D. (1971).
Statistical Inference.
Baltimore: Penquin Books.
Spearman, C. (1904). General intelligence, objectively determined and
measured. American Journal of Psychology, 15: 201-293.
Theil, H. (1971).
Sons.
Principles of Econometrics.
New York: John Wiley and
Thompson, G.H. (1951). The Factorial Analysis of Human Ability.
London University Press.
London:
Thurstone, L.L. (1931).
38: 406-427.
Multiple factor analysis.
Psychological Review,
Thurstone, L.L. (1947).
of Chicago Press.
MUltiple Factor Analysis.
Chicago: University
157
Tracy, D. S. and Dwyer, P. S. (1969). Multivariate maxima and mlnlma
with !T1atrix derivatives. Jourr.al of the American Statistical
Assoaiation~ 64: 1576-1594.
'fravinsky, I. and Bargmann, R.E. (1964). Maximum likelihood estimation with incomplete data. Annals of Mathematical Statistics~ 35:
647-657.
Turner, M.E. and Stevens, C.D. (1959). The regression analysis of
causal paths. Biometrics~ 15: 236-258.
Vetter, W.J. (1970).
Uerivatives operations on matrices.
actions on Automatic Control- - AC~ 15: 241-244.
IEEE T2'ans-
Wald, A. (1943). Tests of statisticc.'.l hypctheses concerning several
paran;eters \Then the number of observations is large. Transactions of
t-he Americarr. Mathematical Society~ 54: 426-483.
Wald, A. (1948). Es tim;ltion of a parD.meter when the number of unknown
parnmeters increases indefinitely with the number of observations.
Annals of Mathematieal Sta.tist·ws~ 19: 220-227.
Watson, G.S. (1964).
26: 303·-304.
A note c-n m<lximum likelihood.
Sankhya - Series
Wold, H. (Editor) (1964). Economet2'ic Model BuUding:
caus~l chain approach.
t'.ms t,;~rd'lm: North Holland.
Wright, S. (1918).
On the natur'c ,.>f size
factors.
A~
Essays on the
Genetics~.3:
Zacks, S. (19il). The Theory of Statistical Inference.
John Wiley and Sons.
New York:
367-374.