Statistical Correlation and the Theory of Cluster Types

Statistical Correlation and the Theory of Cluster Types
Author(s): Ragnar Frisch and Bruce D. Mudgett
Reviewed work(s):
Source: Journal of the American Statistical Association, Vol. 26, No. 176 (Dec., 1931), pp. 375392
Published by: American Statistical Association
Stable URL: http://www.jstor.org/stable/2277939 .
Accessed: 13/08/2012 08:23
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].
.
American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journal
of the American Statistical Association.
http://www.jstor.org
NEW SERIES, NO. 176
(VOL. XXVI)
DECEMBER,
1931
JOURNAL OF THE AMERICAN
STATISTICAL ASSOCIATION
Formerlythe QuarterlyPublication of the American Statistical Association
STATISTICAL
CORRELATION AND THE THEORY
CLUSTER TYPES
OF
D. MUDGETT,University
of
FRISCH,University
of Oslo,AND BRUcJE
BY RAGNAR
Minnesota
1. THE NOTION OF CLUSTER TYPES AND THEIR GEOMETRIC REPRESENTATION IN THREE DIMENSIONS
Let therebe given a statisticalpopulation composed of N observations each characterizedby n variable attributes,xi, x2 . . . x". Let
n be called the dimensionalityof the observations. It is proposedto
discuss the types of systematicvariationthat may take place between
the several variables: What is the degree of freedomof the system?
Which variables can be considered independent? And so on. To
simplify,we shall consider only linear relationships between the
to indicate the nature of the possivariables. This will be sufficient
Most
of the argumentin the present
to
bilitiesit is proposed analyze.
of
the
cluster
built
types developed by Ragnar
upon
theory
paper is
Frischin his paper, "Correlationand Scatterin StatisticalVariables,"
in the NordicStatisticalJournal,Vol. 1, 1928,pp. 36-102. To simplify
the discussionand to make it possibleto build upon a directgeometric
intuitionof the situation,the case of threevariables, Xl, x2 and X3 Will
firstbe discussed. That is to say, we deal with a population having
three attributesmeasured along three rectilinearaxes. In this case,
the dimensionalityof the population,n, is equal to three. A point in
three-dimensional
space then representsa given observation,that is, a
given individual in the population of N. Each such observationis
characterizedby a given value of each of the threevariables x1, x2 and
X3. By assuming that the variables x1, x2 and X3 are measured as
deviationsfromtheirmeans, the mean of each xi will lie in the origin
of the coordinate axes. The swarm of N observation points thus
376
AmericanStatisticalAssociation
[2
obtained forms a three-dimensionalscatter diagram. This scatter
diagrammay exhibitany one of the followingclustertypes.
A. The disorganizedswarm. In this case the observations are
distributedin a disorganizedway in space with no restrictionsupon
their positions, or no loss of freedom. The number of degrees of
freedomin the populationwillbe called its rank,or its unfoldingcapacity, and will be designatedby p, so that in the presentcase p = n = 3.
The rank is now equal to the dimensionality. The case is illustrated
by a swarm of bees distributedwidelyin a given space withoutconcentrationat any point or again by the raindropsas theyfall through
space in a storm.
B. The plane (the" pancake"). If the swarmis pressedtogetherin a
given directionit assumes the shape of a "pancake." This pancake
has as its ideal representationa plane passing throughthe origin.
This plane representsthe systematicvariations while the deviations
fromthe plane, the thicknessof the pancake, representthe accidental
variations. It may now be said that the observationscome close to
lyingin a plane,the accidentalvariationsproducinga thinslab in place
of a plane. From the point of view of systematic(as distinguished
fromaccidental) variations,the populationhas now been subjected to
the loss ofone degreeoffreedom. The representativepointmay move
freelywithinthe slab but cannot go beyond it except by "accident."
It may be said that the swarmhas receiveda one-dimensionalflattening,or, again, that it has been simplyflattened. The populationmay
also be called simply collinearin this instance. If the number of
degrees of freedomlost by the population be designated by p, the
The rank is now exactly one less
resultnow is: p=1, p=2, p+p=n.
case
The
may be illustratedperhaps by a
than the dimensionality.
a
on
board whichpasses throughthe
has
that
bees
alighted
swarm of
the
bees
co6rdinate
the
forminga clusteron this board.
axes,
originof
underB be pressedtogether
discussed
If
the
rod.
The
population
C.
in
the
line
a
cluster
to
come
plane, the swarmof scatter
that
it
along
so
The observationsbecomeof
freedom.
further
one
loses
degree
points
This case is referredto
the
rod
a
around
origin.
through
concentrated
collinear
or
precisely,two-fold
(more
as multiplyflattened, multiply
is
now
of
the
rank
The
1; the flattening
1,
p=
population
flattened).
the
of course, still
and
rank
the
=
of
sum
is,
flattening
2.
The
is p
3.
to
equal
or
D. The point. Subject the rod to one furtherdegreeofflattening,
the
and
of
freedom
observations
one
its
of
the
loss
remaining
degree
to
becomeconcentratedarounda pointorin a tinyball at the origin. The
meaningof thisis that fromthe pointof view of systematicvariations,
3]
StatisticalCorrelationand theTheoryof ClusterTypes
377
the observed population does not show any variation at all. Whatever variation there has been is in the nature of accidental errorsof
observation. Now the flatteningis p = 3. That is, the flatteningis
equal to the dimensionality. All degreesof freedomare lost and the
unfoldingcapacity,p, is equal to zero.
These fourcases representthe main types of relationshipsbetween
the variables x1,x2and x3ofthe populationwhenthe systematicvariation in the x's is linear in character. Proceedingnow to a more detailed discussion of these main types, it is necessary to distinguish
betweencertainsubtypes. In case (A) thereis no systematicrelationship existingbetween the variables. The selection of an arbitrary
value of xi and an arbitraryvalue of x2leads to no particularexpectation with respect to X3. Geometrically,if a line x1= constant= a1,
x2= constant= a2 is drawnin the coordinatespace (x1, X2, X3) therewill
be no concentrationof the observationsaround any particularvalue
ofX3on thisline. And similarlyforcomparisonsofx2withx1and X3etc.
Case (B) is not so simple. The slab, or plane, may assume various
positionsso long as it passes throughthe origin(thisrequirementbeing,
of course,a matteronly of the selectionof the originof the x's). The
followingsubcases are to be distinguished:
axes. As it passes
B 1. The plane may containnoneofthecoordinate
to zero with
equal
angle
not
makes
an
throughthe originit
the
axes.
threeco6rdinate
each of
B 2. The plane may containone axis, say the x3 axis. It will then
be perpendicularto the co6rdinateplane (xl, x2),and willappear
as a door hinged to the x3 axis. It might,of course,equally
well have been hinged to any other of the co6rdinate axes.
This is a veryimportantcase forwhat follows.
B 3. The plane ofthe observationsmay containtwooftheco6rdinate
axes, say x2and x,. It now coincideswiththe (x2,x3) co6rdinate plane and the variable xi shows no systematicvariation,
its variationbeing only withinthe thicknessof the slab; , = 0
approximately,that is o would have been zero ifit werenot for
the accidental variations. In the case (B) the plane can, of
course, never contain all three axes, so that (B 1-2-3) cover
all possible cases.
The rod throughthe origin,i.e., the main type (C), likewisepresents
threesub-types.
C 1. The rod does notlie in any of thecoordinateplanes.
C 2. It lies in one co6rdinate plane, say (X2, X3); then a, = 0, approximately.
C 3. It lies in two of the co6rdinateplanes, that is it coincideswith
378
AmericanStatisticalAssociation
[4
one axis. For example,lyingin the co6rdinateplanes (xi, x3)
and (x2, X3), it coincides with the X3 axis and o-= o2 =0, approximately.
The case (D) involves only one situation, the observationslying
within a very limited distance (very tiny ball) around the origin.
Here o,, o-2,and U3 are all approximatelyequal to zero.
The term"approximately" in the above analysis is used to indicate
that the parametersin question,because of the accidental variations,
may deviate somewhatfromtheirsystematicvalues. It is because of
the accidentalvariationsthat the pointslie, in one case, in a slab rather
than in a plane; again in a rod ratherthan on a perfectline and finally
in a tinyball ratherthan rigorouslyin a point. For brevityhereafter
the term "approximately" will as a rule be omittedin the discussion
of the various cases.
2. ALGEBRAIC
INTERPRETATION
OF CLUSTER
TYPES
IN THREE
VARIABLES
The clustertypes that have been discussed are basic to an understanding of the nature of the systematicrelationshipsbetween the
variables x1,x2and X3and forinterpreting
the linear regressionof any
ofthe variablesupon one of,or all, the others. This willbe recognized
whenthe algebraicexpressionsforthe various clustertypesare brought
to mind. We proceed now to discuss the various types in algebraic
terms,that is, in termsof the nature of the linear relationshipsthat
exist betweenthe variables.
In case (A), the disorganizedswarm,or the raindrops,thereis complete lack of systemin the spatial distributionof the variables and a
regressionequation betweenthemhas no meaning. The onlythingto
do in this case is to leave the data alone. This case, therefore,may
be dismissed from furtherconsiderationonce a criterionhas been
established by which the lack of systematic organization can be
recognized.
In case (B), the plane, wherethe observationshave lost one degreeof
freedom,thereexistsone and onlyone systematicrelationshipbetween
the variables of the form:
(2.1)
a1xl+a2x2+a3x3= 0.
But thisrelationshipmay or may notactuallycontainall threevariables.
The case whereit does contain them all is the case where the plane
containsnoneofthe axes (Case B 1). Here the coefficients
a1,a2and a3
are all different
fromzero. Then the equation (2.1) may be solved
forany xi in termsofthe otherx's. That is to say, equation (2.1) may
be writtenin any one of the followingthreeways:
5]
StatisticalCorrelationand theTheoryof ClusterTypes
(2.2)
Xi=
al2.3x2+al3.2x3
X2
a21.3X1+a23.1x3
379
X.3= an- 2X1+ a,32 . 12
forthe
The notation a12.3,etc., is here used instead of the usual b12.3
in orderto indicatethat it is hereonlya question
regressioncoefficients
ways. The notation
of solving the equation (2.1) in three different
involves somethingmore than
b12.3,etc., forthe regressioncoefficients
merely differentways of writingthe same equation. It refersto
the coefficients,
namely
statisticalproceduresfordetermining
different
directionsin whichto make the least squared minimalization.
different
In the case (B 1) whereit is possibleto expresseach variablein terms
of the othersthe variables are said to forma closedset.
Case (B 2) is the situationwherethe observationalplane containsthe
x3axis. The set is stillcollinearbut is no longera closedset; x, has now
becomea superfluousvariable. It has no place in the regression. This
can be seen easily in the geometricfigure. The plane of the observationsis perpendicularto the (xi, x2) co6rdinateplane and intersectsthe
latterin a line throughthe origin. This line is called the trace of the
regressionplane in the (xi, x2)plane. This trace is evidentlya locus of
points with (xi, x2) co6rdinates. If an arbitrary(xi, x2) point be selected, then one of the followingtwo thingswill happen. Either this
(xi, x2) pointfallson the trace,and then any magnitudeof X3 may correspondto the selected (xi, x2) point; or the (xi, x2) point falls outside
the trace, in which case no x3 magnitudewill correspondto it. It,
therefore,has no meaning now to express xi in terms of xi and x2.
That whichdoes have a meaningis to expressa relationbetweenxi and
x2. Selection of any value of xi thereforeimmediatelyspecifiesa
correspondingvalue of x2as definedby the trace but does not specify
any particularvalue of x3. Inversely,selectionofa particularvalue of
X3does not locate a particularvalue of eitherxi or x2. The variable x3
is not a partnerin the systematicrelationship. This is equivalent to
saying that in the equation (2.1) as = 0, so that the relationshipbetween the variables is now of the form:
aixl+a2X2
=
0
wherea, and a2 are not equal to zero. In otherwords,xi can be expressed in terms of x2, thus: xi = a12x2;or inversely,x2=a2lxl. The
does not existin the presentcase; xi and
relationshipX3= a31.2X1+a32.1X2
x2taken by themselvesforma closed set, and x3 is superfluous.
Case (B 3) where the regressionplane coincides with one of the
co6rdinateplanes also representsa singlerelationshipofthe form(2.1)
380
AmericanStatisticalAssociation
[6
between the variables. But in this relationshipthere are now two
coefficients
that are zero. If the plane lies in the (x2,x3) co6rdinate
plane, a2=a3 = 0. The relationshipis then representedby
aix1=O; ai=O; ai-O.
That is, the observationsare scatteredfreelyin the (x2x3) co6rdinate
plane, there being no systematicrelationshipbetween them in this
plane. And xi is an ineffective
variable. So far as the systematic
variationis concernedxi = 0 forall values ofx2and x3.
Where the observationsare clusteredin a rod, thereexisttwoindependentrelationshipsbetweenthe variables. There even exist three
relationships,representedby the threeequations:
aix1+a2x2
(2.3)
=
0
= 0
a'lxl+a'3x3
= 0
a"2x2+a"3x3
But onlytwo ofthesethreerelationswillbe independent. By knowing
any two of them,the thirdcan be derived. Any set of two variables
taken by themselvesnow constitutesa two-dimensionalcollinearset.
In the case (C 1), each of the three two-dimensionalcollinearsets
(2.3) is closed. That is to say, all the coefficients
in (2.3) are i 0. Any
of the variables may now be expressedin termsof any one ofthe other
variables.
In the case (C 2) the rod lies in one, and only one, of the co6rdinate
planes, say in (x2,x3). There is now one closed set of two variables,
represented
by the equationa2X2+a3X3 = 0, wherea2 and a3 are both
=i0. Furthermore,
thereis one ineffective
set of one variable,a1x,= 0,
o
=
in
0. But xi cannotbe expressed termseitherof x2or of X3
a,l 0,
(except in the trivialformx1= O.x2+0.x3).
The case (C 3), where the rod is lying in the x3 axis involves two
ineffective
sets of one variable each, namely:
(24)
(2.4)
=0
aa1x1
a1, 0
i= 0
a2X2= 0
a2
0
2= 0
+
that is, thereis variation along the X3axis but no variation along the
x1 and x2axes.
In the case of the point (or tiny ball), (D), there exist three independentrelationsbetweenthe variables. There is no freedomleftnow
in the threevariables, all threebeing ineffective. That is to say, the
threeindependentrelationsare:
c1=0
aix1=0
a,10
(2.5)
a2X2=0
a3x3=
a240
0
a3
4
0-2=0
0
03 = 0
7]
StatisticalCorrelationand theTheoryof ClusterTypes
3. ALGEBRAIC
INTERPRETATION
OF CLUSTER
TYPES
IN
381
n VARIABLES
A set ofn variables (x1,x2 . . . x,) is called a linearlydependentset,
or collinearwhenthereexistsat least one relationof the form:
(3.1)
a1xi+a2x2+
. . . +anxn=O
wherethe coefficients,
ai, are not all equal to zero. A collinearset is
said to be flattenedexactlyp times,or to be p-foldflattenedwhenthere
existexactlyp independentrelationshipsofthe form(3.1) betweenthe
conditionfora set to be p-fold
n variables. A necessaryand sufficient
flattenedis that thereexistat least one p-dimensionalset whichis noncollinear, where p = n - p, while all (p+ 1) and higher dimensional
subsets are collinear. In this case there exist exactly p independent
regressions,each involvingnot morethan (p+ 1) variables,and further
being such that the set ofthese (p+1) variablesis a simplycollinearset.
When p = 0 the set of n variables is not flattened;there exists no
systematiclinear relationshipbetweenthe variables and no regression
equation is possible.
When p= 1 the set is once flattenedor is simplycollinear. The rank,
p, of the set is now (n-1) and there exists one (n-1)-dimensional
regressionplane.
This regressionplane may or may not containall the variables. The
plane will not contain those variables the co6rdinateaxes of whichin
n-dimensionalspace lie in the regressionplane, and the corresponding
coefficientsin the regressionequation (3.1) will be equal to zero.
These variables are superfluousvariables in the regressionsystem. If
from
the regressioncoefficients
ai (i =1, 2, . . . n) are all different
zero,all ofthe n variables are presentin the regressionequation. The
set is then a closedset. The (n - 1)-dimensionalregressionplane now
contains none of the co6rdinate axes and each xi can be expressed
linearlyin terms of the others. In this case there cannot exist any
relationshipof the form(3.1) involvingfewerthan n variables.
When p>1 the set is multiplycollinear or multiplyflattenedand
thereexistsa regressionmanifoldof less than (n -1) dimensions.
4. STATISTICAL
CRITERIA
FOR CLUSTER
TYPES
As statisticalcriteriaforthese several types of clusteringthere are
here introducedthe coefficient
of collectivealienation and its correlaof collective correlation. Proceeding now to the
tive, the coefficient
exact definitionof the collectivealienationand correlationcoefficients,
let
AmericanStatisticalAssociation
382
(ri,
(R)
-
rl2 . . . rin
r21,r22 . . . r2n)
rnlyrn2
. .
.
rnn
be the correlationmatrixforthe n variables xi, x2 . . . xn; rij is the
simple (total) correlationcoefficientbetween xi and xi. The determinantvalue of this matrixis denoted by
R=R(2
rii,r12 .
. . rln
r2i, r22 .
.
rnlyrn2
.
. r2n
. . rnn
Further,R i= Rj(12. . ..) denotes the element in the i-th row and
the j-th column of the adjoint correlationmatrix (R). Each R j is
determinedby calculatingthe determinantvalue of (R) aftercrossing
out the i-th row and the j-th column and multiplyingby (-1) i+
We shall not here enter into a theoreticaldiscussion as to why the
offerplausible criteria
collectivealienation and correlationcoefficients
of how far a given set of statistical variables deviates from being
linearlydependent. This interpretationis discussed at lengthin the
paper by Ragnar Frisch already referredto.' Here mentionwill be
made only of the formal hierarchicorder that exists between the
partial, the multipleand the collectivecoefficients.Consider the set
of variables,xl, x2 . . . xn, and denote the classical partial correlation
For the
and alienation coefficientsby rij(12 . . . n) and 5ij(12 . . ..
sake of symmetry,
all the subscripts,1, 2 . . . n are writtenas secondary subscripts,withoutomittingi and j. In ordernot to give rise to
confusionwith the usual notation,wherei and j are omittedfromthe
list of secondarysubscripts,the secondarysubscriptsare here enclosed
in a parenthesis. Similarly the classical multiple correlation and
and 5i(12 . * n)
alienation coefficientsare designated by ri(12 .. .)
respectively. As is well known,these parameterssatisfythe equation:
(4.1)
r2+52= 1
wherethe same set of subscriptsis attached to r and to s. The colare also definedso as to
lective correlationand alienation coefficients
depend on two primary
satisfy(4.1). But whilethe partialcoefficients
depend on one primarysubsubscriptsand the multiple coefficients
script, the collective coefficientshave no primarysubscriptsat all.
They are definedwith regard to the set of variables as such. More
precisely,they are definedthus:
1 In thispaper the termcoefficient
ofalienation.
ofscatterwas used instead of coefficient
9]
StatisticalCorrelationand theTheoryof ClusterTypes
(4.2)
s=S(1
2
. . . n)=V R(l
7
=coefficientof collective
alienation in the set
(Xl,
r=
r(l 2.
1/l-R(l
. .n) =
383
2 . . . )=
X2
. . . Xn)
coefficientof collective correlation in
the set (xI, x2 . . . x")
satisfythe relations:
It is easy to prove that these coefficients
(4.3)
O<s<1
reducesto the simple(total)
If n = 2 the collectivealienationcoefficient
alienation coefficientand the correlationcoefficient(apart from its
sign) reduces to the simple (total) correlationcoefficient. From the
definitionshere given it followsthat S2 and r2are polynomialsin the
nevercan become
simplecorrelationcoefficients
ri1;S2and r2,therefore,
of the indeterminateform0(unless one or more of the simple correlabecome ofthisform). This is a fundamentalproperty
tion coefficients
of the collective alienation and correlationcoefficients,which dispartialand multiple
tinguishestheseparametersfromthe corresponding
may
coefficients. The fact that the partial and multiple coefficients
become of the indeterminateform-is easily seen fromthe well known
formulae:
2 .. *
n)=X~~'
+
(4 ri(1
(4.4)
ri(l 2
. *
. n)=
(4.5)
ri_(l 2
-
L
R
R?1
v'RWRIIii
of multiplecorrecoefficient
lation
coefficientof partial correlation
It is indeed easy to prove that
(4.6)
and
(4.7)
R2 ij<
Rjj,Rjj
=<R<Rii< 1
0, (4.4) and (4.5) must give rise to indeterminate
expressionsof the type 0
so that if RI
The general propertyof the collective alienation coefficient
which
makes this parameter a useful tool in studyingcluster types is the
following. The collectivealienation is equal to zero when, and only
when,thereexistsan exact linear dependencyin the set forwhichthe
collective alienation is computed, and furthermorethis coefficient
384
AmericanStatisticalAssociation
[10
increasesas the swarmof scatterpointstakes on a shape that deviates
more and more from the shape where a linear dependency exists.
When the collective alienation has become equal to 1 (which is its
largestpossiblevalue) the swarmof scatterpointshas reached a shape
which may be characterizedas perfectunfolding. The variables have
now become orthogonal(uncorrelated),that is to say, all the simple
correlationcoefficients,
ri1are equal to zero. The collectivealienation
condition
coefficient
beingequal to unityis the necessaryand sufficient
sense.
fororthogonalityin the above defined
The threevariablesxi, x2and x3may be taken to illustratethe use of
the collectivealienation as a tool in determiningclustertypes. The
various cases to be discussedhave theirideal patternin the elementary,
well-knownalgebraic propositionsregardinglinear forms. What is
done hereis primarilyto translatethese propositionsfromthe language
of perfect linear dependency to the language of " nearly" linear
dependency.
A. The disorganizedswarm is characterizedby s(123) being near to
unity.
B. The plane. One flattening;p =1, p = 2, simply collinear. The
criterionforthis case is that s(123) is near to zero,and furthermore
R
8(13) =
at least one of the three magnitudes, 8(23) =
=
zero.
from
different
is
significantly
VR33(-23)
8(12)
\/R22(123,,
B 1. Plane contains no coordinate axis. Each of the three
magnitudes, s(13),
8(23)
and
8(12)
is significantly different
fromzero. In thiscase (Xl, X2, X3) forma closedset.
B 2. Plane contains one co6rdinate axis (x3). Now 8(12)=
is close to zero while 8(23) and s(13) are signifiV/R33(123)
fromzero. In this case the two variables
cantlydifferent
forma closedsetand x3is a
(xl, x2) takenby themselves
superfluousvariable.
B 3. Plane contains two co6rdinate axes, (X2, x3): Here s(n)
and S(13) are both close to zero, while S(23) is significantly
different
fromzero. In this case x2and X3 forma set of
disorganizedvariables while xi is ineffective.
C. The rod. Two-dimensional flattening;p = 2, p=1; multiply
collinear. The criterionforthis case is that S(123) shall be near to
zero and, at the same time, that S(12), S(13) and S(23) shall also be
close to zero. There now exist two independentrelationships
betweenthe variables.' These two relationshipsmay be written
1 Strictlyspeaking,the criteriahereconsideredshow onlythat thereexistat leasttwo linearrelation(i.e., that we do not have case D)
ships. If it be assumedthatnotall thethreevariablesare ineffeotive
thenthe criteriaconsideredshow that thereare exactlytwo independentrelationships.
11]
StatisticalCorrelationand theTheoryof ClusterTypes
385
in differentways. In particular they may by eliminationbe
writtenin such a way that each of them contains at most two
variables. If it be assumed that all the variables are effective
theneach ofthe two relationsthus obtainedmust containexactly
two variables. In this case any set of two variables constitutes
a closed two-dimensionalset.
Criteriaforthe sub-types(C2) and (C3) cannot be discussed
only. Or more
in termsof the collectivealienation coefficients
The
very fact of computingthe collective
preciselyexpressed:
alienationcoefficients
(whichare based on the simple correlation
coefficients)involves the assumptionthat all the variables are
effective. But this is also the only assumptionmade in using
the collective alienation coefficients.If none of the variables
all the simple (total) correlationcoefficients
are
are ineffective,
determinative.
The situationforthreevariablesis now easily generalizedto the case
of n variables. If a great number of variables (x1, x2 . . . xn) are
observed,and criteriafor clustertypes are wanted, compute firstthe
for the whole set. If this
collective alienation coefficientS(12
.
is not close to zero, any attempt at studyingthe variables
coefficient
by means of linear relationshipsshould be abandoned.
If S(12
) is close to zero, consider all the (n-1)-dimensional
subsets (23 . . . n), (13 . . . n) . . . (12 . . . (n-1)) that can be
formedby leaving out one variable at a time. There are in all n such
for each such
subsets. Compute the collective alienation coefficient
subset. Such a collective alienation coefficientwe call an (n - 1)dimensionalcollectivealienationcoefficient. If thereexistsat leastone
that is significantly
different
from
such (n - 1)-dimensionalcoefficient
zero, the set is simplycollinear,that is, thereexistsexactly one linear
relationof the form.
.
.
.
(4.8)
.
.
.
aix1+a2x2+
. . . +anxn =
0.
This relationshould be looked upon as actually containingonly those
variables Xk that are such, that the collectivealienation obtained by
leaving out Xk, namely,' S(12 . . . )k( . . . n), is significantlydifferent
fromzero.
If all the (n - 1)-dimensional collective alienation coefficients
8(12 . . . )k( . . . n) (k = 1.2 . . . n) are close to zero, the set is at least 2
dimensionallyflattened,that is, there exist at least two independent,
relationsof the form(4.8). In orderto findout if thereexist exactly
two or possibly even more than two such relations,considerall the
I The inverse parenthesis ) ( is used to denote "exclusion of."
AmericanStatisticalAssociation
386
[12
(n -2) subsets that can be formedby leaving out in all possible ways
two of the variables, and computethe collectivealienation coefficient
for each such subset. Such a coefficientwill be called an (n-2)dimensionalcollective alienation coefficient. If there exists at least
that is significantlydifferent
one such (n-2)-dimensional coefficient
fromzero,thenthe flatteningis exactlytwo, that is thereexistexactly
two independentrelationsof the form(4.8).
If all the (n-2)-dimensional collective alienation coefficientsare
close to zero, considerthe (n - 3)-dimensionalcoefficients.If all these
shouldalso be closeto zero,considerthe (n -4)-dimensional coefficients,
etc. Suppose it be necessary to continue to the p-dimensionalcoefficients
beforea level is reached whereat least one of the collective
fromzero. That is to
is significantly
different
alienation coefficients
say, all the (p+1)- and higherdimensionalcollective alienation coare close to zero,whilethereexistsat least one p-dimensional
efficients
fromzero. In this case the
that is significantlydifferent
coefficient
rank (i.e., the unfoldingcapacity) of the set is p, and the flatteningis
p = n - p. There now exist exactlyp independentsystematicrelations
of the form (4.8). By combinationand elimination,using these p
relations,we may arriveat many sortsof linear relationsbetweenthe
variables. There is in particularone set ofrelationsthat is interesting:
We may select a certainp-dimensionalsubset whichis not a collinear
set (at least one such set existsby the verydefinitionof p). And then
we may expresseach of the otherp-variableslinearlyin termsof the
selected p-dimensionalsubset.
5.
MEANINGLESS
RESULTS
WHEN
LINEAR
DEPENDENCIES
EXIST
Amongthe various possible clustertypes discussedin the preceding
sectionsthereis onlyone veryparticulartype in whichall the orthodox
correlationand regressionparametershave a sense, namely,the case
of a set whichis not only collinearbut also closed. If the set is not
closed a greatnumberofthe orthodoxcorrelationparameterslose their
meaning. And ifthe set becomesmultiplycollinear,each ofthe orthodox correlationand regressionparameterslose their meaning. Considerfirstthe set whichis simplycollinearbut is not a closed set. This
is the case whereS(12 . .. n) iS close to zero, and one or more,but not
all, of the (n - 1)-dimensional coefficientsS(12 . . . )k( ... n) are also
close to zero. Those variables xk forwhich S(12 . .. )k( . . . ) is close
to zero should simplybe looked upon as superfluousvariablesfromthe
point of view of linearregressions. Under no circumstancesmust the
coefficientsof (4.8) be determinedby computingthe regressioncoefficients
bkj 12 . .. n of xkon the othervariables. In factthe collective
13]
StatisticalCorrelationand theTheoryof ClusterTypes
387
occurs as a denominator in
meanforcing
a
Determining
nwould,therefore,
bki. 12 . . . n
into the demagnitudewhose deviation fromzero is non-significant
thus obtained would
nominator. The systemof regressioncoefficients
consequentlybe of the form: accidental errordivided by accidental
error,and would have no sense. This situationis illustratedby the
three-dimensional
case (B 2), that is the case wherethereexistsexactly
one systematicregressionplane in (xl, x2, X3),but where this plane
containsthe x3-axis. In this case it would obviouslyhave no sense to
expressX3as a linear combinationof x1 and x2. In this case neither
of correlationof Xk on the set of the other
the multiple coefficients
of correlationbetween Xkand any
variables nor the partial coefficient
particularone ofthe othervariables shouldbe computed. All of these
parameterswill now be withouta meaning. In the rigorouscase they
alienation coefficientS(12
. . . )k( . . . n)
bk,. 12 . . .
depend on an expressionof the indeterminate? formand in the statis0
tical case theirvalues are determinedby the ratio betweentwo small
quantitiesdue to accidental errorsof observation. It is even easy to
. . n. -.0 OiS
construct cases where the limitingprocess S(12 . . ..
carriedout in such a way that the value of the partial r may have any
value betweenplus one and minusone (theseextremevalues included).
Or the value of the multiple r may be made to assume any value
betweenzero and one.
And that is not all. The standarderrorof these correlationparameters will also become meaninglessfor a similarreason. Neither the
multiple nor the partial correlationparameters considered nor the
standard errorof these will consequentlyfurnishthe slightestindication of the clustertype or of the fact that Xkis a superfluousvariable.
But in the collective alienation coefficientswe have a system of
criteriathat can never be subject to this kind of meaninglessness,
as alreadymentioned,are
because the collectivealienationcoefficients,
polynomialsin the simplecorrelationcoefficients.An inspectionofthe
wouldimmediately
(n -1)-dimensional collectivealienationcoefficients
tell whichofthe variablesshouldbe oustedfromthe regressionsystem.
Now considerthe case wherenoneofthe (n - 1)-dimensionalcollective
is significantly
alienation coefficients
different
fromzero, that is, the
case wherethe set is multiplycollinear. There now exist at least two
independentregressionsof the form(4.8). The situationis now such
that whatevervariable in the set (xl, x2 . . . xn) be selected,the result
would always be meaninglessif the regressionwere determinedof this
variable on the others. Still worse,the standard errorsof the regression coefficients
computed by the orthodoxformulaewould also lose
AmericanStatisticalAssociation
388
[14
theirmeaning,
so thattherewouldbe no warningsignaltellingus to
keepawayfromthissortofregression.
Lookingback on the variouscases discussedit is clearthat the
troublealwayscomesin thosecases wherethereexists(rigorously
or
approximately) a linear dependencybetweenthose variables that are
writtenin therightmemberof theorthodox
regressionequation,that is,
intheleast
as independent
betweenthosevariablesthatareconsidered
squarefitting
procedure.
havea meaning,
In orderto getbackto a basiswheretheregressions
are
find
subsets
that
andthento
to
those
itis necessary
simply
collinear,
it to a closedsetand
treateachofthesesubsetsseparately
byreducing
init. A generalschemeforperformthelinearregression
to determine
alienationcoefficients
is the
ingthisanalysisbymeansofthecollective
as
above
Then
rank
the
select
Determine
that
following:
p
explained.
collectivealienation
subsetforwhichthep-dimensional
p-dimensional
is largest. Call thissetthebasisset. Thisis thep-dimencoefficient
sionalsubsetthatcomesclosestto beingan uncorrelated
(orthogonal)
thebasissetwitheachofthe
set. Thenformp subsetsby combining
variables. Each ofthese(p+ 1) setsmaybe considered
as
remaining
simplycollinearand treated
separately.That is to say, each such
set shouldbe reducedto a closedset by omitting
(p+1)-dimensional
inthesetaccording
tothecollecthesevariablesxkthataresuperfluous
as appliedto thissubset.
tivealienationcoefficient
criterion,
The aboveanalysismaybe illustrated
by an artificially
constructed
wereselectedon fourvariables,x1, x2, x3
problem. Ten observations
and x4,the valuesbeingwrittendownarbitrarily
exceptforthe rethatthesumofthevariablesx2,x3and x4 foreach observaquirement
tionshouldequal one hundred. (Whenmeasuredas deviationsfrom
thesetoften
theirmeans,therefore,
x2+x+x4=0). For convenience
valuesforeach variablewas madeto totalto evenhundredsso that
meansand deviationsfrommeanswouldnotinvolvedecimalvalues.
UsingcapitalX to denotetheabsolutevalueofa variableand smallx
deviationfromthe mean,the data, as
to denotethe corresponding
are:
above defined,
Xl
25
14
37
17
24
29
41
X2
18
27
31
19
12
15
21
X3
29
43
51
34
46
35
28
X4
53
30
18
47
42
50
51
15]
StatisticalCorrelationand theTheoryof ClusterTypes
39
52
22
28
17
12
41
57
36
31
26
52
300
200
400
400
389
The correlationcoefficients
forthese observationswhichenterinto the
matrix(R(1234)) are:
correlation
r13= +.408675
r12=-+.169678
r23=+.218931
r24=-.689427
ril =
r14= - .392790
r34=-.857719
r22= r33= r44= 1.00
That x2,x3and X4 forma closed set is shownby the values of the coefficients S(234), S(23), S(24) and 8(34) for 5(234) = 0, S(23) = .98, S(24) = .72 and S(34) = .51.
None ofthe last threequantitiesare close to zero,hencethe conclusion
that (X2,
X3, X4)
forma closedset. The partialcorrelation
coefficients
in thisset are all equal to minusone and multiplecorrelationcoefficients
are all equal to plus one. This is easily checked,as has been done, by
actual computation. So far everythingis all right. When, however,
variable xi is included in the systemtrouble arises. If one tries to
computethe partialcorrelations,r12.34,
or r14.23
it is foundthat they
r13.24
are all of the form0 and similarresults hold for the multiplecoeffi0
cient r1.234
and forthe regressioncoefficients
and b14.23.
b12.34,
b13.24
Mr. H. I. Richards, writingin the March, 1931, number of this
JOURNAL
on "Analysis of the Spurious Effectof High Intercorrelation
of IndependentVariables on Regressionand CorrelationCoefficients,"
of multiplecorrelationand regression
says "that accurate coefficients
can be obtained whenthe independentvariables are perfectlyintercorrelated,if thereare no errorsin the calculations." This is fundamentally wrongand is due to certainslips in Mr. Richards' mathematics.
He quotes the formulaeformultipleand partial correlation,as given
by Yule, as:
. (-21n.23
(7.1) 1- 21.234. . . n=1(-r 212)(1 - '13.2)(7'214.23)
.1)
n(7.2)
r14.23 =
l- r1.23r4.23
r14
r2l23r423
The latter is not correct,the correctformulabeing:
(7.3)
r14.23=
(1-r21243)J2(1-r224.3)H
However, this is of minorimportancein this connection. Mr. Rich-
AmericanStatisticalAssociation
390
[16
nature and independentof
ards' fundamentalerroris of a different
that ifx2,x3and X4
He
maintains
whetherwe startfrom(7.2) or (7.3).
that
theirsum is equal
by
the
fact
are perfectlycorrelated,forinstance,
wouldbe the
by
(7.1)
R1.234
result
of
computing
to one hundred,the
or
one ofthem
are
included
and
variables
X2,
all
three
X3
X4
same whether
omitted;forexample,X4. This must be so, he claims,because
A. Factors (1- r212)and (1- r213.2)are the same in each case.
B. Factor (1-r214.23)= 1 wheneverr4.23= 1.
His attemptto provepropositionB is as follows:
1. Factor (1 -r24.23)in the denominatorof (7.2) equals zero whenever
thereis perfectcorrelationbetweenX2,X3 and X4. This makes the
denominatorof (7.2) equal zero.
2. The numeratorin (7.2) also equals zero since:
(a) r4.23= 1
(b) r1.23=rl4
3. The formri4.23=? must howeverbe equal to zero since the limit
of r14.23is zero when r4.23tends toward unity.
Proposition(A) is obviously correct,and (B 1), whichis the same as
(B 2 a) is also correctas is seen fromthe formula
r4.23?
=
1
R(23)
since,in the case underconsideration,R(234)= 0, and it may be assumed
that R(23),40 inasmuch as the case where x2 and x3 are perfectlycorrelated is of no interestin the presentconnection.
But proposition(B 2 b) is not correct. The correlationrl.23 is the
correlationbetweenxi and the value of the variable xi calculated from
the regression:
(7.4)
bl2.3x2+b13.2x3
are determinedso as to make the correlationa
wherethe coefficients
(the x's here being
maximum. On the other hand, if x2+x3+x4=0
deviationsfrommeans) then r14can be looked upon as the correlation
betweenxi and thevariable:
(7.5)
- X2 -
X3.
It is quite obviousthat the correlationbetweenxi and (7.4) need not be
the same as the correlationbetweenxi and (7.5). The onlythingthat
can be said is that
r214< r21.23
17]
StatisticalCorrelationand theTheoryof ClusterTypes
391
= 0, is
The proposition (B 3), namely that in the case considered r14.23
not correcteither. There is even a double reasonforits falsity. First
processat
thereis no questionofa limiting
of all, where X2+X3+X4=0
all. The value of r4.23is not approaching as a limit the value unity, but
is exactlyequal to unityall the timeand can equal nothingelse. In the
second place, even if therewere a valid case forconsideringa limiting
process at all, it is possible to show that this limitingprocess can be
carried out in such a way as to obtain for the limiting value of r14.23any
resultwhateverbetween + 1 and -1. The errorwhichMr. Richards
makes on this point is that he lets r4.23tend toward zero while all the
other parametersinvolved, r14,r1.23,etc., are kept constant. This is,
the limitingprocessthat
however,onlya veryspecial way of performing
has as its final result a situation where x2+x3+x4=0
(or where there
exists some other exact linear relation between these variables). In
general,such a limitingprocess can be performedby varyingcertain
observationalvalues in x2or x3or X4. These observationalvalues must
be consideredas the independentvariablesduringthe limitingprocess.
If that is done, we see that the numeratorand the denominatorin the
ratio definedby (7.2), (or by the correctformula(7.3)), are notfunctionsof a singlevariablebutof severaland the limitingvalue of a ratio
whose numeratorand denominatordepend on morethan one variable,
depends not only on the finalsituationtoward whichthe systemtends
butalso on thepathfollowedin orderto reachthe finalsituation. We can
illustratethis by the case of a ratio AXy) between two functionsof
g(x,y)
two independentvariables x and y. Suppose that bothf and g vanish
at the origin;i.e.,f(0,0) =g(0,0) =0. What is the limitingvalue of the
ratio
when x and y tend toward zero? That will depend on the
g(x'y)
g(x,y)
path whichthe representativepoint in (x,y) co6rdinatesfollowson its
way towardthe origin. For simplicityit may be assumed thatfand g
have continuouspartial deviates in the vicinityof origin. The ratio
consideredwill,therefore,in the vicinityof originbe equal to
(7.5)
fxdx+fydy
g,dx+gydy
denote the partial deviationsat the originand dx,dy
wherefx,y,g.,gy
denotethe co6rdinatesofthe path along whichthe representativepoint
tends toward the origin. It is quite obvious that in generalthe limiting value of (7.5) will depend on the ratiody,that is, it will depend on
dx
the directionfromwhichthe point (x,y) tends toward origin.
392
AmericanStatisticalAssociation
[18
What Mr. Richards has proved by his limitingprocessis, therefore,
only that it is possibleto approach to a linear dependencybetween
X2, X3and X4in such a particularway that the limitingvalue for r14.23
becomes 0. But he has by no means proved that this limitingvalue
mustbe zero because thereexistsa lineardependencybetweenx2,X3and
X4.
In Frisch's paper, already referredto, thereare givenexamplesof
in
limitingprocessesthat will bringthe partial correlationcoefficients
three variables as close as may be desiredto any magnitudebetween
-1 and + 1 (wherethe limitingprocesstends toward representinga
linear dependencybetweentwo of the variables) and it would not be
to give similarexamples with one more variable, whichis the
difficult
situationhere discussed.
One furthercommentmightbe to the point: Mr. Richardsgivessome
numericalcomputationsintendedto verifythe theoreticalproofof his
whetherX4is incontentionthat one will get the same value of R1.234
not.
cannot conor
these
numerical
computations
However,
cluded
tain any such "verification." Either the computations must be
wrong,or he must in some point or anotherin the computationshave
=0.
madeuse ofthat fact whichshould be proved,namelythat r14.23