PLS` Janus face

PLS’ Janus face
Response to Professor Rigdon’s ‘Rethinking Partial Least
Squares Modeling: in praise of simple methods’
Theo K. Dijkstra
University of Groningen
Faculty of Economics and Business
PO Box 800
Groningen, The Netherlands
[email protected]
24 April 2013
Abstract
I respond to two of the theses put forward by Professor Rigdon: 1.
PLS should sever every tie with factor modeling, and 2. the fundamental problem of factor indeterminacy makes factor-based methods
fundamentally unsuited to prediction-oriented research. With respect
to the first thesis my response is that adherence to its advice would
waste a potentially very useful method: there is a version of PLS that
appears to be a valuable alternative to the mainstream approaches
in factor modelling, both linear and non-linear. The response to the
second thesis is that one can generate predictions with factor models,
and that all models and techniques are ultimately instruments that
1
transform data into predictions about behaviour. Our task is to find
out which approach works best in which circumstances. I would not
support an a priori exclusion of tools. In an addendum I attempt to
sketch the historical path-dependent development of Herman Wold’s
PLS, in order to elucidate some of its characteristic features.
In a thought provoking, carefully crafted and worded paper Professor Rigdon
has put forward a number of strong statements. They have as their core
an unreserved choice for PLS as an extension of principal components and
canonical variables analysis. He strongly rejects PLS as a tool for estimating
structural relationships between latent variables, ‘factors’. Instead he calls
for a full-blown development of path modeling in terms of linear composites,
with tailor-made inferential tools and proper means to assess measurement
validity.
Of the statements put forward I choose two to comment on, the ones I feel
most strongly about. In essence I wish to maintain the double-sided nature
of PLS that characterized it from the very start. In the family of a structural equations estimators PLS, when properly adjusted, can be a valuable
member as well.
1. PLS should sever all ties with factor-based SEM
Response: This will be a somewhat elaborate reply, but the topic seems
to demand it. First I will indicate why application of traditional PLS to
factor models could indeed be problematic. Next I will specify a simple way
to correct for the ‘distortions’ that the use of proxies as stand-ins for the
latent variables generates. This covers linear as well as nonlinear models.
I will report some recent results that allow one, in my view at least, to be
optimistic about the feasibility and usefulness of the new approach. Its core
is still PLS and therefore it maintains its numerical efficiency. Standard PLS
software produces essentially all the input one needs. The high complexity
of alternative methods in mainstream latent variable modeling, see (Schumacker & Marcoulides 1998) or (Kelava 2011) e.g., makes this variation of
PLS a valuable option. In a nutshell: I contend that when Professor Rigdon’s
advice is adhered to, we may waste a potentially very useful tool.
Our starting point is Wold’s ‘basic design’, a system of structural equations between latent variables, each of which is measured by a unique set of
indicators (at least two indicators), and where all measurement errors are
mutually independent, and independent of all latent variables. We assume
we have a random sample from a population with finite moments of appropriate order. It is well-known that in this setting PLS, when applied to the
2
population (probability limits), will yield loadings that are too large in absolute value, and correlations between the latent variables that are too small in
absolute value (Dijkstra, 1981, 2010). Also, the squared multiple correlation
coefficients, the R2 ’s, are too small see (Dijkstra, 2010). These properties
are reflected as tendencies in finite samples. If we focus on the correlations
between the latent variables, a typical variable is denoted by ηi , we have:
ρij = ρij · qi · qj .
(1)
Here ρij is the correlation between the PLS proxies η i and η j for ηi and ηj ; the
correlation between the latter is ρij . The quality qi of the PLS-proxy η i for
ηi is by definition its positive correlation with ηi , and qj is defined similarly.
The quality depends on the ‘mode’. With less than maximal qualities (and
they can never equal one in regular situations) we have a ‘distorted’ picture of
the relationships between the latent variables. To give an extreme randomly
generated example, suppose the true correlation matrix between three latent
variables η1 , η2 and η3 equals:

 

1 ρ12 ρ13
1
−.97 .69
ρ12 1 ρ23  = −.97
1
−.58 .
(2)
ρ13 ρ23 1
.69 −.58
1
We assume here as elsewhere that the latent variables have zero mean and
unit variance. Let
η3 = β1 η1 + β2 η2 + ζ
(3)
be the regression equation of interest. I.e. the coefficients β1 and β2 are
chosen such that the implied residual ζ is uncorrelated with η1 and η2 . Every
other choice for the β’s produces a residual whose average squared value is
higher than E (ζ 2 ). So the β’s must satisfy the so-called normal equations,
0 = E (ζη1 ) = E ((η3 − β1 η1 − β2 η2 ) η1 ) = ρ13 − β1 − β2 ρ12 and 0 = E (ζη2 ) .
In matrix notation:
ρ13
1 ρ12 β1
=
.
(4)
ρ23
ρ12 1
β2
Solving for β2 yields:
β2 =
ρ23 − ρ12 ρ13
1 − ρ212
(5)
with an analogous expression for β1 . For the values at hand we get (the
numbers are rounded to two decimals)
η3 = 2.16 · η1 + 1.51 · η2 + ζ and R2 = .61.
3
(6)
If qi = .90 for i = 1, 2, 3, PLS will yield for the correlation matrix between
the proxies:

 

1 ρ12 ρ13
1
−.79 .56
ρ12 1 ρ23  = −.79
1
−.47
(7)
ρ13 ρ23 1
.56 −.47
1
and a regression produces
η 3 = .50 · η 1 − .08 · η 2 + residual and R2 = .31.
(8)
The coefficients are way too small in absolute value, one of them is of the
wrong sign as well, and the R2 is close to half the true size. Note that
β2 =
ρ23 − ρ12 ρ13 q12
· q2 q3
1 − ρ212 q12 q22
(9)
so, as is easily verified, the coefficient of the second proxy will be negative as
long as q1 < .93 with the ρij ’s as specified. I emphasize that the regression
numbers as reported are not estimates, they are ‘population values’, around
which sample values will fluctuate. Clearly, the dramatic distortion of the
true relationship is mainly due to the very low value of ρ12 , close to minus
one.
A more systematic analysis can be obtained by varying across all possible
and permissible values of the correlations1 ρ12 , ρ13 and ρ23 . For each set of
correlations we calculated the corresponding regresssion coefficients, βi , and
the R2 . For PLS we assumed that all proxies have the same quality q. We
have the following table:
Table 1: Range of PLS-values as a function of proxy-quality
q 2 PLS range(βi ) PLS range(R2 )
.50
.75
.90
.95
1.00
±.50
±.76
±1.16
±1.61
±4.40
[0, .33]
[0, .64]
[0, .85]
[0, .93]
[0, 1]
The last row gives the undistorted, true picture: what we would get with
the latent variables instead of with the proxies; R2 is uniformly distributed
1
I used H. Joe (2006) to generate a million correlations, uniformly from the ellipsoid
that describes all permissible correlations (the ensuing correlation matrix has to be positive
definite). I selected those outcomes where ρ212 ≤ .95, effectively discarding five from every
thousand outcomes. Table 1 is based on 995,155 correlation matrices.
4
in this case. The true R2 is always larger than the value for PLS; the true
regresssion coefficients are in absolute value larger than those of PLS in 94%
of the cases for q 2 = .50, and in 92% of the cases for each of the other values
of q 2 .
Irrespective of one’s interpretation of this table, I would maintain that
there are grounds to try an old idea, first put forward by me in print in 1981,
but never tested until 2011, see (Dijkstra 1981, 2011). Depending on the PLS
mode I suggested to adjust the correlation estimates using easy estimates of
the quality of the proxies. For mode A we have a consistent estimator for qi
by the formula:
qbi := (w
bi| w
bi ) · b
ci
(10)
where w
bi is the estimated vector of weights for the ith proxy and
12
bi
w
bi| (Sii − diag (Sii )) w
.
b
ci :=
w
bi| (w
bi w
bi| − diag (w
bi w
bi| )) w
bi
(11)
The matrix Sii is the covariance/correlation matrix of the ith block of indicators. The use of diag means that only the off-diagonal elements
of the matriP
bi,a w
bi,b Sii,ab et
ces are used. So the numerator within brackets is just a6=b w
cetera. We get consistent estimates for the ρij ’s by dividing PLS’ estimates
by qbi and qbj . Loadings can be estimated consistently by b
ci · w
bi , see (Dijkstra 2011), or (Dijkstra & Henseler 2012) and (Dijkstra & Schermelleh-Engel
2012) e.g. for derivations and additional details. It is worth pointing out
that standard PLS-software yields all the required input.
We now have consistent estimators for the loadings and for the correlations between the latent variables, they are in fact asymptotically normal
as well. The same is true for the standard estimators for the structural
relationships, and also for the theoretical, structured correlation matrix of
the vector of indicators. The latter fact opens up the possibility to test the
overall goodness-of-fit. See (Dijkstra & Henseler 2012) for some encouraging
results, for a linear structural model with feedback. Incidentally, goodnessof-fit tests are a conditio sine qua non in mainstream factor modelling, but
appear to be alien to the ‘PLS-scene’. In the final section of this paper, the
addendum, I try to shed some light on this issue by revisting the past, the
path dependent genesis of PLS.
An exciting new development, a PhD-project at UCLA, exploits an old
idea of (Bentler & Dijkstra 1985): if we update a consistent estimator once,
using Newton-Raphson or Gauss-Newton on the first-order conditions of a
suitable fitting function, we get an asymptotically efficient estimator. So
PLS can be made to be as efficient as GLS- or ML-estimators, by a simple
adjustment followed by one iteration of a standard optimization routine.
5
The approach as outlined here can be extended to recursive sytems of
regression equations with interaction between the latent variables (modelled
by products ηi ηj ). No distributional assumptions beyond the existence of
moments are needed; see (Dijkstra 2011) and (Dijkstra & Schermelleh-Engel
2012). To spell it out for the simplest case, consider
η3 = γ1 η1 + γ2 η2 + γ12 (η1 η2 − ρ12 ) + residual.
(12)
The γ’s satisfy the following normal equations, as induced by the demand
that the residual is uncorrelated with the three ‘explanatory’ variables:
 

 
γ1
Eη1 η3
1
Eη1 η2
Eη12 η2
 Eη2 η3  = Eη1 η2
1
Eη1 η22   γ2 
(13)
2
2 2
2
2
γ12
Eη1 η2 η3
Eη1 η2 Eη1 η2 Eη1 η2 − ρ12
(recall that the latent variables are standardized to have zero mean and
unit variance). The moments can be found from the ‘PLS-moments’ for the
proxies and the q’s:
Eη 1 η 2
Eη 1 η 3
Eη 2 η 3
Eη 21 η 2
Eη 1 η 22
Eη 1 η 2 η 3
Eη 21 η 22 − 1
=
=
=
=
=
=
=
q1 q2 Eη1 η2
q1 q3 Eη1 η3
q2 q2 Eη2 η3
q12 q2 Eη12 η2
q1 q22 Eη1 η22
q1 q2 q3 Eη1 η2 η3
q12 q22 Eη12 η22 − 1 .
(14)
(15)
(16)
(17)
(18)
(19)
(20)
Only the last equation deviates form the easy pattern. The moments for the
proxies are estimated simply by their corresponding sample equivalents. For
example Eη 21 η 22 is estimated by
n
i
1 Xh |
b2| y2t )2
(w
b1 y1t )2 · (w
n t=1
(21)
where n is the sample size and y1t is the vector of scores on the indicators of
the first block for the tth individual or entity, with an analogous definition for
y2t . Note that for this to work we need to have the raw data, not just the moments of the sample. Inserting the estimators for the moments of the latent
variables, as adjusted for the imperfect correlations with the proxies, in (13)
and solving for the γ’s yields consistent and (under appropriate conditions)
asymptotically normal estimators for the coefficients.
6
With squares and higher order terms the same approach can be followed,
except that now some distributional assumptions appear to be needed. With
normality (for the exogenous latent variables) recursive systems of structural
equations with polynomial terms are in principle estimable. In this way
one keeps the high numerical efficiency and stability of PLS, as well as its
robustness against misspecification, and gets consistent and asymptotically
normal estimators for non-trivial structural models. And as stated before,
in terms of complexity the new approach appears to compare rather favorably with mainstream alternatives, see (Schumacker & Marcoulides 1998)
and (Kelava et al 2011) e.g. A Monte Carlo study for a model with an
interaction term and squares where the new approach was compared with
the most efficient maximum likelihood method around (LMS, ‘Latent Moderated Structural Equations’), produced quite acceptable results: unbiased
estimates with standard errors comparable with those of LMS and computationally way faster than LMS (Dijkstra & Schermelleh-Engel 2012).
2. Factor indeterminacy and (factor) prediction
Professor Rigdon stresses the fact that factor scores cannot be determined
unambiguously, and argues for the exclusive use of composites instead. He
also pleads for new ways to validate the model, by forecasting theoretical concepts using the observations on the indicators. Determining the gap between
proxy and concept would be the main evaluating validity.
Response: I found the argument hard to follow, and I may well have
misconstrued it, but here is my reply. It is indeed undeniable that factor
scores cannot be determined unambiguously. At best, we can get some estimates of aspects of the conditional distribution of the latent variable given
its indicators. But how does selecting one number as representative for this
distribution solve the issue? A composite as generated by mode B e.g. is
proportional to the mode of the conditional distribution in case of normality,
or proportional to the best linear approximation to the conditional expectation of the latent variable given its indicators in more general situations. A
plethora of other choices could be made. The use of a specific proxy cannot
take away the inherent and real uncertainty.
I find it also difficult to understand how the gap between the concept and
its proxy can be measured, if we can have numbers for the proxy only. To
think of a concrete example: in (Chin et al 2003) the relationship between the
concept ‘intention to use’ (of a particular IT-application) and the concepts
‘perceived usefulnes’ and ‘enjoyment’ was investigated. See also (Dijkstra &
Henseler 2011) and (Dijkstra & Schermelleh-Engel 2012). One of the models employed was the interaction model specified above. One can of course
7
predict the proxy for ‘intention to use’ via the indicators for the other concepts/proxies, but I do not see the number with which to compare it. If in
a follow-up study one can ascertain actual use as well, then we can test the
quality of the predicted ‘intention to use’-proxy as a forecast for actual use,
via a logistic regression e.g. I would like to emphasize that the latent variable
model could be employed for this purpose as well: E (η3 |y1 , y2 ) , where y1 and
y2 are the indicator vectors for ‘perceived usefulnes’ and ‘enjoyment’ resp.,
can be shown to be a linear expression in terms of linear composites and their
products, using normality as a working hypothesis. At the end of the day, we
have algorithms that transform observations on indicators into predictions of
actual behaviour. We can pit the algorithms against each other, and find out
which works best in which cicumstances. To me, techniques with proxies and
techniques with latent variables are instruments, tools. I am not inclined to
endorse a categorical ban on either of them.
3. Addendum, a sketch of the genesis of PLS
Herman O. A. Wold (1908-1992) had a lifelong fascination for ‘least squares’
and its role in modeling behavioral relationships, in prediction and estimation. The evolution of the least squares (LS) body of methods is the red
thread in his research (Wold, 1980)2 . It started with his Doctoral Thesis
(1938), in which he established by LS methods the ‘decomposition theorem
for stationary time series’ that brought him fame. His treatise on Demand
Analysis (Wold, 1953, reprinted in 1982), a synthesis of Paretoan utility theory, theory of stationary processes, and regression theory was entirely based
on LS analysis. In the late fifties he saw that the rationale of LS analysis
could be developed on the basis of ‘predictor specification’, by defining a
causal predictive relation as the conditional expectation of the predictand.
This would also reflect its intended use, prediction in case of singular equations, and prediction by the ‘chain principle’, i.e repeated substitutions, in
case of recursive systems of equations. For a long time Wold insisted on a
conditional expectation specification for behavioral equations. Systems of
equations would qualify only as causal-predictive when each separate equation had that property, hence his strong preference for recursive systems,
or ‘causal chains’ as he would call them (Wold, 1964). The causal interpretation of simultaneous equations, in which there is feedback between the
variables has long been a very controversial topic in econometrics, and ap2
I refer here and below to a number of mimeographed documents, mostly written by
Herman Wold, which are or may not be available in print. Naturally, I will send a copy
on request.
8
pears to be only recently resolved, see Pearl (2009). Anyway, Wold saw in
the sixties a way to reformulate non-recursive systems that would allow of a
causal-predictive interpretation of the separate equations, and estimation by
(iterative) least squares. Consider the following model
η1 = β12 η2 + γ11 ξ1 + γ12 ξ2 + ζ1
η2 = β21 η1 + γ23 ξ3 + γ24 ξ4 + ζ2
(22)
(23)
where all variables, the ‘endogenous’ η1 and η2 , and the ‘exogenous’ ξ1 , ξ2 ,
ξ3 and ξ4 are scalar3 . This is known as Summers’ model, (Summers 1965), it
has been used often in econometric studies, see also (Mosbaek & Wold 1970).
In the classical setting the residuals ζ1 and ζ2 are assumed to be uncorrelated with the exogenous variables ξ. The feedback between the endogenous
variables then entails that the coefficients of the equations are not regression
coefficients: ζ1 cannot be uncorrelated with η2 , nor can that be the case for
ζ2 and η1 . This is most easily seen by solving for the endogenous variables.
We can write η1 = η1? + residual1 where η1? is a linear combination of the
four exogenous variables, in fact it is the regression of η1 on ξ, and residual1
combines ζ1 and ζ2 linearly; similarly for η2 , η2 = η2? + residual2 . So ζ1 is
correlated with residual2 and therefore with η2 . It was observed in the fifties
that systems like (22, 23) could be rewritten as:
η1 = β12 η2? + γ11 ξ1 + γ12 ξ2 + residual1
η2 = β21 η1? + γ23 ξ3 + γ24 ξ4 + residual2 .
(24)
(25)
In each equation the residual is uncorrelated with the righthand side explanatory variables. So now the coefficients are regresssion coefficients. Estimation
could proceed in two rounds: first regress each endogenous variable on the
four exogenous variables, this yields estimates for η1? and η2? , and then use
regresssions on (24) and (25). This is known as ‘two-stage LS’, see (Theil,
1958) and (Basmann, 1957) for the original development4 . Note that
η1? = β12 η2? + γ11 ξ1 + γ12 ξ2
η2? = β21 η1? + γ23 ξ3 + γ24 ξ4
(26)
(27)
Wold suggested to iterate the procedure. I.e. use the freshly obtained estimates for the β’s and the γ’s to get new values for the η ? ’s, based on (26)
3
For ease of readability I am deliberately ‘sloppy’ in the specification of the model,
ignoring constraints on the coefficients et cetera, but this will not obscure the main point I
want to make. Also, very special parameter values may contradict the general statements,
but I will not specify them. Finally, I will freely confuse zero correlation with independence.
4
Theil had in fact discovered two-stage LS in 1953, and described it in an internal
report of the Central Planning Bureau of the Netherlands.
9
and (27). Then update the coefficient estimates by new regresssions, use
them to get new η ? ’s, followed by yet another set of regressions et cetera,
and continue iterating until the values no longer change. This was known as
the Fix-Point method, see (Mosbaek & Wold 1970). Later, see the same reference, Wold noted that the algorithm would also work for the model where
the residuals are not uncorrelated with all four exogenous variables, but only
with the relevant variables at the same side of the equation. Previously we
specified eight zero correlations for six parameters, now only six. In the
parlance of econometrics we loose the ‘over-identifying’ restrictions, used for
testing and estimation, but in the parlance of PLS, we now adhere to the
‘parity principle’, making the model more modest, and possibly more robust.
See (Mosbaek & Wold 1970) for an extensive analysis of the Generalized
Interdependent (GEID)-systems.
As I see it, Wold tended to use the word ‘model’ not in the conventional
way, as a restriction on moments or on distributions, but more as a rule for
constructing variables. He often used the conditional expectation specification formally but as far as I know never actually tested its validity. With
e.g. η1 = f (ξ) + ζ1 where E (η1 |ξ) = f (ξ) for a function f of a specified
form, we have E (ζ1 |ξ) = 0. The zero conditional expectation is a very strong
property. It says essentially that ζ1 is uncorrelated with all functions of ξ,
not just linear functions. An informal test would be to plot the estimated
residual versus the elements of ξ and check for nonlinear patterns. But this
was never done. Instead, the implication of zero-correlation between ζ1 and
the elements of ξ was used to justify LS. The ensuing estimate for the LSresidual is uncorrelated with the elements of ξ by construction. Constraints
as imposed on factor models, see below, were not used for testing either.
So in the early sixties an important feature that would turn out to be
the characteristic of Wold’s future work had emerged: the estimation of both
parameters and variables by means of ‘alternating least squares’, and the implicit definition of those entities as a fixed point of an algorithm. Generally,
there was no overall fitting criterion to be optimized, instead one performed
a sequence of local optimizations where the outputs of the previous LS programs temporarily set part of the parameters of the next program. A lecture
tour in 1964 in the USA turned out to be pivotal for PLS’ genesis: when he
presented the Fix-Point method it was pointed out to Wold that principal
components could be calculated in an analogous way5 . To spell out, consider
a set of vectors {yt }Tt=1 where for each (individual or entity) t the vector yt
contains the observations on a fixed set of indicators, possibly standardized.
5
Wold acknowledges Professor G. S. Tolley, University of North Carolina, and Dr. R.
A. Porter, Spindletop Research Center, Lexington, Kentucky. See Wold (1966).
10
Write
e · ηet + e
yt = λ
t
(28)
where the tildes indicate that these are not factor model parameters and
variables. Here the scalar variable ηet is treated as a parameter as well and is
e Since doubling, say, λ
e and halving
to be estimated alongside the loading λ.
ηet yields the same product, we need to set the scale. One convention among
others is to impose unit sample variance on the ηet ’s. If we now define the
loadings and the ‘latent
as those values that minimize the sum of
P variable’
e the corresponding
the squared residuals Tt=1 e
|t e
t , we find that for a given λ
e and
ηet is a linear combination of yt , obtainable by a regression of yt on λ,
T
e is obtainable by a regression of the
for given {e
ηt }t=1 , the k th element of λ
yk,t ’s on the ηet ’s. Switching back and forth between these regressions until
they settle down leads to a fixed point, also known as the first principal
component. Note that, again, by construction the residuals are uncorrelated
with the ‘latent variable’ (but they cannot be mutually uncorrelated since
e| e
λ
t must be zero). Also note that (28) is not a model in the classical sense,
since there is no restriction at all: every sample covariance matrix has a
largest eigenvalue and a corresponding real eigenvector. A one-factor model
with uncorrelated residuals does impose a non-trivial restriction, even for
the ‘saturated model’ with three indicators (Dijkstra, 1992), that can be
used for testing via a goodness-of-fit test. A principal component ‘model’ on
the other hand can ‘only’ be tested by means of its usefulness, in condensing
and summarizing high-dimensional data e.g.
Continuing with the historical sketch, it is clear that extensions first to
two blocks, later to more blocks and more than one interrelationship between
the blocks, were not immediate. Attempts to build models for two blocks are
described in (Wold, 1975). These include of course the canonical variables,
also present in (Wold, 1966). There were others that did not use two latent
variables, each one combining the indicators of one block only, but one latent
variable combining indicators of both blocks. Eventually Wold decided on
the form as we know it (Wold, 1976). But the extension to more blocks and
their relationships was still not settled. See (Wold, 1976) for some of the
complicated schemes he devised. Some of my attempts to solve the issue,
as a member of Wold’s research group at the Wharton School in 1977, built
upon extensions of the principal components approach. One could perhaps,
I thought, minimize the sum of the squared residuals of both blocks plus
the squared residuals of the inner equation, as a function of the loadings,
the structural parameters, and all values of the latent variables, and similarly for more blocks. The ensuing latent variables typically combined all
indicators, and for larger models things seemed to get hopelessly messy and
11
the calculations time consuming and unstable. In 1977 Wold opted for a
radical simplification, maybe, if I may flatter myself, stimulated partly by
noting that approaches such as mine were going nowhere: he introduced
the sign-weighted sum of ‘adjacent’ latent variables, and weights for a latent
variable were simply to be determined by a regression of its corresponding
sign-weighted sum on the indicators of its block, or in reverse order, depending on the ‘mode’. I vividly remember how relieved he was when PLS
seemed to have settled down and attained a form on which one could build.
He may have liked the quotation attributed to Winston Churchill: out of
intense complexities, intense simplicities emerge. And he would have been
enormously pleased to learn that at this time of writing (2013) the main
software program, SmartPLS, has 40,000 registered users! The Handbook of
PLS (Esposito Vinzi et al, 2010) and its many references is an impressive
tribute to the fruitfulness and intellectual stimulus of Wold’s approach.
The double-sided nature of PLS, as a desciptive/predictive approach as
well as an alternative to SEM in high-dimensional data in a low-structure
environment, always seemed to be a natural thing to Wold, as I think one
can read him. The conceptual structure of factor models was the backdrop
against which Wold developed and tested his methods. This is particularly
clear in his discussion of the ‘basic design’, see (Wold, 1982) e.g. If he had
just wanted to extend principal components and canonical variables analysis,
there would have been no need to develop the concept of consistency-at-large,
that allows one to say when the difference between a facor model and one of
PLS ’modes would be small. In fact, he insisted that a fundamental principle
of soft modeling is that all interaction between the blocks of observables is
conveyed by the latent variables (Wold 1981, 1982b). What is still surprising
to me is that the implication of this principle, namely that the covariance
matrix has an explicit and clear structure (the covariance matrices between
the blocks all have rank one), is tested nor exploited. This is the more
surprising since the (probability limits of the) weights of mode A and the
loadings of mode B are then proportional to the true loadings. In principle
one could recover all factor model parameters consistently as described above.
Of course, it would be absurd to say that Wold shun testing. He strongly
preferred ‘local tests’ if I may call them that way. He emphasized the need to
check whether, say, signs and sizes of particular estimates were as expected
and in agreement with theory. He strongly favoured distribution-free samplereuse methods like the Jackknife for tests of significance, and the StoneGeisser test for predictive quality. But the implied structural constraints
of a model were never exploited6 . A pragmatic, instrumental approach to
6
A PLS-reviewer of a paper that I submitted in which I showed that an adjustment of
12
science may call for a neutral integration of the two approaches.
PLS is one of those families of ideas that evokes emotions, and apparently calls upon people to make a choice. I have been on both sides of the
fence. In the time I worked on my PhD thesis my sympathies were clearly
with the structural maximum likelihood approach, (Dijkstra, 1981, 1983).
Later, inspired by Kettenring’s wonderful paper on canonical analysis (Kettenring, 1971) I regained my appreciaton and enthusiasm for methods like
PLS (Dijkstra, 2009, 2010) and Dijkstra & Henseler (2011). Now I have a
convinced pragmatic stance, and refuse to condemn one and praise only the
other. Let us establish empirically where each works best. For problems
in well-established fields highly structured approaches like mainstream SEM
may be appropriate, other fields will be well served by highly efficient means
of extracting information from high-dimensional data. And methods that
blend the two, as I am working on, may well occupy their own niche.
I thank the editors for inviting me to response to Professor Rigdon’s
paper. And I am grateful for his stimulating paper, that helped improve my
thinking on PLS, again.
References
[1] Basmann, R. L. (1957). A generalized classical method of linear estimation of coefficients in a structural equation. Econometrica, 25, pp.
77-83.
[2] Bentler, P. M. and Dijkstra, T. K. (1986). Efficient Estimation via Linearization in Structural Models. In: P. R. Krishnaiah (ed.), Multivariate
Analysis-VI, chapter 2, pp. 9-43, North-Holland, Amsterdam.
[3] Chin, W. W., Marcolin, B. L., Newsted, P. R. (2003). A partial least
squares latent variable modeling approach for measuring interaction effects. International Systems Research, 14, 189-217.
[4] Dijkstra, T. K. (1981). Latent variables in linear stochastic models. PhD
thesis. (second edition,1985, Amsterdam: Sociometric Research Foundation).
[5] Dijkstra, T. K. (1983). Some comments on maximum likelihood and
partial least squares methods. Journal of Econometrics, 22, pp. 67-90.
PLS would allow of goodness-of-fit tests, informed me that ‘we have no need for this’. My
subjective impression, unsupported by serious interviews, is that this lack of appreciation
of the possibility to check for the presence of structure in the data could be widespread in
the PLS-community.
13
[6] Dijkstra, T. K. (1992). On statistical inference with parameter estimates
on the boundary of the parameter space. British Journal of Mathematical and Statistical Psychology, 45, pp. 289-300.
[7] Dijkstra, T. K. (2009). PLS for path diagrams revisited, and extended.
Paper published in the Proceedings of the 6th International Conference on Partial Least Squares, Sept 4-7, Beijing, China. See also
http://www.rug.nl/staff/t.k.dijkstra/research
[8] Dijkstra, T. K. (2010). Latent variables and indices. In: Esposito Vinzi,
V., Chin, W. W., Henseler, J., Wang, H. (eds.), Handbook of Partial
Least Squares, chapter 1, pp. 23-46. Springer-Verlag, Heidelberg.
[9] Dijkstra, T. K. (2011). Consistent Partial Least Squares estimators for linear and polynomial factor models, working paper.
(http://www.rug.nl/staff/t.k.dijkstra/research).
[10] Dijkstra, T. K. and J. Henseler (2011). Linear indices in nonlinear structural equation models: best fitting proper indices and other composites.
Quality & Quantity, 45, 1505-1518.
[11] Dijkstra, T. K. and J. Henseler (2012). Consistent and asymptotically
normal PLS-estimators for linear structural equations. working paper.
(http://www.rug.nl/staff/t.k.dijkstra/research).
[12] Dijkstra, T. K. and K. Schermelleh-Engel (2012). Consistent Partial
Least Squares for Nonlinear Structural Equation Models (submitted),
see also http://www.rug.nl/staff/t.k.dijkstra/research.
[13] Esposito Vinzi, V., Chin, W. W., Henseler, J., Wang, W. (eds.) (2010).
Handbook of Partial Least Squares. Concepts, Methods and Applications.
Springer Verlag, Heidelberg.
[14] Joe, H. (2006). Generating random correlation matrices based on partial
correlations. Journal of Multivariate Analysis 97, 2177-2189.
[15] Jöreskog, K. G. & Wold, H., (1979). The ML and PLS Techniques for
Modelling with Latent Variables: Comparative Aspects, third draft,
mimeographed document.
[16] Jöreskog, K. G. & Wold, H. (eds.),(1982). Systems under Indirect Observation, part I and part II, North-Holland, Amsterdam.
14
[17] Kelava, A. et al (2011). Advanced nonlinear latent variable modeling: Distribution analytic LMS and QML estimators of interaction and
quadratic effects. Structural Equation Modeling, 18 (3), 465-491.
[18] Kettenring, J. R. (1971). Canonical analysis of several sets of variables.
Biometrika, 58(3), pp.433-451.
[19] Mosbaek, E. J. & Wold, H. (1970). Interdependent Systems, structure
and estimation. North-Holland, Amsterdam.
[20] Schumacker, R. E. & G. A. Marcoulides (eds), (1998). Interaction and
nonlinear effects in structural equation modelling. Lawrence Erlbaum
Ass., Mahwah, NJ.
[21] Pearl, J. (2009). Causality. Models, Reasoning and Inference. Cambridge
University Press, second edition, Cambridge.
[22] Rigdon, E. E. (2013). Rethinking Partial Least Squares path modeling:
in praise of simple methods. Long Range Planning (this issue).
[23] Summers, R. (1965). A capital intensive approach to the small sample
properties of various simultaneous equation estimators. Econometrica,
33, pp. 1-41.
[24] Theil, H. (1958). Economic Forecasts and Policy. North-Holland, Amsterdam.
[25] Wold, H. (1964), (ed.) Econometric Model Building. Essays on the
Causal Chain Approach. North-Holland, Amsterdam.
[26] Wold, H. (1966). Nonlinear Estimation by Iterative Least Squares Procedures, pp. 411-444 in David, F. N. (ed.), Research Papers in Statistics.
Festschrift for J. Neyman, Wiley, New York.
[27] Wold, H. (1975). Path Models with Latent Variables: The NIPALS Approach, pp. 307-359 in Blalock, H. M. et al (eds.), Quantitative Sociology, International Perspectives on Mathematical and Statistical Modeling, Academic Press, New York.
[28] Wold, H. (1976). On the transition from pattern recognition to model
building. European Meeting of the Econometric Society, Helsinki, Finland, 23-27 August.
15
[29] Wold, H. (1980). Contributions to the rationale of LS, PLS and soft
modelling in my published work, Geneva, May 1980, mimeographed
document.
[30] Wold, H. (1981). Comments on the papers by J. B. Grubman et al,
Comments integrated with a briefing of PLS (Partial Least Squares)
soft modeling. ASA-CENSUS-NBER Conference on Applied Time Series Analysis of Economic Data, October 13-15.
[31] Wold, H. (1982a). Demand Analysis, a Study in Econometrics. Greenwood Press, Westport, reprint of 1953 edition, Wiley, New York.
[32] Wold, H. (1982b). The Basic Design and Some Extensions, pp. 1-54 in
Jöreskog, K. G. & Wold, H. (eds.), Systems under Indirect Observation,
part II, North-Holland, Amsterdam.
16