Statistical Correlation and the Theory of Cluster Types Author(s): Ragnar Frisch and Bruce D. Mudgett Reviewed work(s): Source: Journal of the American Statistical Association, Vol. 26, No. 176 (Dec., 1931), pp. 375392 Published by: American Statistical Association Stable URL: http://www.jstor.org/stable/2277939 . Accessed: 13/08/2012 08:23 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp . JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. . American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journal of the American Statistical Association. http://www.jstor.org NEW SERIES, NO. 176 (VOL. XXVI) DECEMBER, 1931 JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Formerlythe QuarterlyPublication of the American Statistical Association STATISTICAL CORRELATION AND THE THEORY CLUSTER TYPES OF D. MUDGETT,University of FRISCH,University of Oslo,AND BRUcJE BY RAGNAR Minnesota 1. THE NOTION OF CLUSTER TYPES AND THEIR GEOMETRIC REPRESENTATION IN THREE DIMENSIONS Let therebe given a statisticalpopulation composed of N observations each characterizedby n variable attributes,xi, x2 . . . x". Let n be called the dimensionalityof the observations. It is proposedto discuss the types of systematicvariationthat may take place between the several variables: What is the degree of freedomof the system? Which variables can be considered independent? And so on. To simplify,we shall consider only linear relationships between the to indicate the nature of the possivariables. This will be sufficient Most of the argumentin the present to bilitiesit is proposed analyze. of the cluster built types developed by Ragnar upon theory paper is Frischin his paper, "Correlationand Scatterin StatisticalVariables," in the NordicStatisticalJournal,Vol. 1, 1928,pp. 36-102. To simplify the discussionand to make it possibleto build upon a directgeometric intuitionof the situation,the case of threevariables, Xl, x2 and X3 Will firstbe discussed. That is to say, we deal with a population having three attributesmeasured along three rectilinearaxes. In this case, the dimensionalityof the population,n, is equal to three. A point in three-dimensional space then representsa given observation,that is, a given individual in the population of N. Each such observationis characterizedby a given value of each of the threevariables x1, x2 and X3. By assuming that the variables x1, x2 and X3 are measured as deviationsfromtheirmeans, the mean of each xi will lie in the origin of the coordinate axes. The swarm of N observation points thus 376 AmericanStatisticalAssociation [2 obtained forms a three-dimensionalscatter diagram. This scatter diagrammay exhibitany one of the followingclustertypes. A. The disorganizedswarm. In this case the observations are distributedin a disorganizedway in space with no restrictionsupon their positions, or no loss of freedom. The number of degrees of freedomin the populationwillbe called its rank,or its unfoldingcapacity, and will be designatedby p, so that in the presentcase p = n = 3. The rank is now equal to the dimensionality. The case is illustrated by a swarm of bees distributedwidelyin a given space withoutconcentrationat any point or again by the raindropsas theyfall through space in a storm. B. The plane (the" pancake"). If the swarmis pressedtogetherin a given directionit assumes the shape of a "pancake." This pancake has as its ideal representationa plane passing throughthe origin. This plane representsthe systematicvariations while the deviations fromthe plane, the thicknessof the pancake, representthe accidental variations. It may now be said that the observationscome close to lyingin a plane,the accidentalvariationsproducinga thinslab in place of a plane. From the point of view of systematic(as distinguished fromaccidental) variations,the populationhas now been subjected to the loss ofone degreeoffreedom. The representativepointmay move freelywithinthe slab but cannot go beyond it except by "accident." It may be said that the swarmhas receiveda one-dimensionalflattening,or, again, that it has been simplyflattened. The populationmay also be called simply collinearin this instance. If the number of degrees of freedomlost by the population be designated by p, the The rank is now exactly one less resultnow is: p=1, p=2, p+p=n. case The may be illustratedperhaps by a than the dimensionality. a on board whichpasses throughthe has that bees alighted swarm of the bees co6rdinate the forminga clusteron this board. axes, originof underB be pressedtogether discussed If the rod. The population C. in the line a cluster to come plane, the swarmof scatter that it along so The observationsbecomeof freedom. further one loses degree points This case is referredto the rod a around origin. through concentrated collinear or precisely,two-fold (more as multiplyflattened, multiply is now of the rank The 1; the flattening 1, p= population flattened). the of course, still and rank the = of sum is, flattening 2. The is p 3. to equal or D. The point. Subject the rod to one furtherdegreeofflattening, the and of freedom observations one its of the loss remaining degree to becomeconcentratedarounda pointorin a tinyball at the origin. The meaningof thisis that fromthe pointof view of systematicvariations, 3] StatisticalCorrelationand theTheoryof ClusterTypes 377 the observed population does not show any variation at all. Whatever variation there has been is in the nature of accidental errorsof observation. Now the flatteningis p = 3. That is, the flatteningis equal to the dimensionality. All degreesof freedomare lost and the unfoldingcapacity,p, is equal to zero. These fourcases representthe main types of relationshipsbetween the variables x1,x2and x3ofthe populationwhenthe systematicvariation in the x's is linear in character. Proceedingnow to a more detailed discussion of these main types, it is necessary to distinguish betweencertainsubtypes. In case (A) thereis no systematicrelationship existingbetween the variables. The selection of an arbitrary value of xi and an arbitraryvalue of x2leads to no particularexpectation with respect to X3. Geometrically,if a line x1= constant= a1, x2= constant= a2 is drawnin the coordinatespace (x1, X2, X3) therewill be no concentrationof the observationsaround any particularvalue ofX3on thisline. And similarlyforcomparisonsofx2withx1and X3etc. Case (B) is not so simple. The slab, or plane, may assume various positionsso long as it passes throughthe origin(thisrequirementbeing, of course,a matteronly of the selectionof the originof the x's). The followingsubcases are to be distinguished: axes. As it passes B 1. The plane may containnoneofthecoordinate to zero with equal angle not makes an throughthe originit the axes. threeco6rdinate each of B 2. The plane may containone axis, say the x3 axis. It will then be perpendicularto the co6rdinateplane (xl, x2),and willappear as a door hinged to the x3 axis. It might,of course,equally well have been hinged to any other of the co6rdinate axes. This is a veryimportantcase forwhat follows. B 3. The plane ofthe observationsmay containtwooftheco6rdinate axes, say x2and x,. It now coincideswiththe (x2,x3) co6rdinate plane and the variable xi shows no systematicvariation, its variationbeing only withinthe thicknessof the slab; , = 0 approximately,that is o would have been zero ifit werenot for the accidental variations. In the case (B) the plane can, of course, never contain all three axes, so that (B 1-2-3) cover all possible cases. The rod throughthe origin,i.e., the main type (C), likewisepresents threesub-types. C 1. The rod does notlie in any of thecoordinateplanes. C 2. It lies in one co6rdinate plane, say (X2, X3); then a, = 0, approximately. C 3. It lies in two of the co6rdinateplanes, that is it coincideswith 378 AmericanStatisticalAssociation [4 one axis. For example,lyingin the co6rdinateplanes (xi, x3) and (x2, X3), it coincides with the X3 axis and o-= o2 =0, approximately. The case (D) involves only one situation, the observationslying within a very limited distance (very tiny ball) around the origin. Here o,, o-2,and U3 are all approximatelyequal to zero. The term"approximately" in the above analysis is used to indicate that the parametersin question,because of the accidental variations, may deviate somewhatfromtheirsystematicvalues. It is because of the accidentalvariationsthat the pointslie, in one case, in a slab rather than in a plane; again in a rod ratherthan on a perfectline and finally in a tinyball ratherthan rigorouslyin a point. For brevityhereafter the term "approximately" will as a rule be omittedin the discussion of the various cases. 2. ALGEBRAIC INTERPRETATION OF CLUSTER TYPES IN THREE VARIABLES The clustertypes that have been discussed are basic to an understanding of the nature of the systematicrelationshipsbetween the variables x1,x2and X3and forinterpreting the linear regressionof any ofthe variablesupon one of,or all, the others. This willbe recognized whenthe algebraicexpressionsforthe various clustertypesare brought to mind. We proceed now to discuss the various types in algebraic terms,that is, in termsof the nature of the linear relationshipsthat exist betweenthe variables. In case (A), the disorganizedswarm,or the raindrops,thereis complete lack of systemin the spatial distributionof the variables and a regressionequation betweenthemhas no meaning. The onlythingto do in this case is to leave the data alone. This case, therefore,may be dismissed from furtherconsiderationonce a criterionhas been established by which the lack of systematic organization can be recognized. In case (B), the plane, wherethe observationshave lost one degreeof freedom,thereexistsone and onlyone systematicrelationshipbetween the variables of the form: (2.1) a1xl+a2x2+a3x3= 0. But thisrelationshipmay or may notactuallycontainall threevariables. The case whereit does contain them all is the case where the plane containsnoneofthe axes (Case B 1). Here the coefficients a1,a2and a3 are all different fromzero. Then the equation (2.1) may be solved forany xi in termsofthe otherx's. That is to say, equation (2.1) may be writtenin any one of the followingthreeways: 5] StatisticalCorrelationand theTheoryof ClusterTypes (2.2) Xi= al2.3x2+al3.2x3 X2 a21.3X1+a23.1x3 379 X.3= an- 2X1+ a,32 . 12 forthe The notation a12.3,etc., is here used instead of the usual b12.3 in orderto indicatethat it is hereonlya question regressioncoefficients ways. The notation of solving the equation (2.1) in three different involves somethingmore than b12.3,etc., forthe regressioncoefficients merely differentways of writingthe same equation. It refersto the coefficients, namely statisticalproceduresfordetermining different directionsin whichto make the least squared minimalization. different In the case (B 1) whereit is possibleto expresseach variablein terms of the othersthe variables are said to forma closedset. Case (B 2) is the situationwherethe observationalplane containsthe x3axis. The set is stillcollinearbut is no longera closedset; x, has now becomea superfluousvariable. It has no place in the regression. This can be seen easily in the geometricfigure. The plane of the observationsis perpendicularto the (xi, x2) co6rdinateplane and intersectsthe latterin a line throughthe origin. This line is called the trace of the regressionplane in the (xi, x2)plane. This trace is evidentlya locus of points with (xi, x2) co6rdinates. If an arbitrary(xi, x2) point be selected, then one of the followingtwo thingswill happen. Either this (xi, x2) pointfallson the trace,and then any magnitudeof X3 may correspondto the selected (xi, x2) point; or the (xi, x2) point falls outside the trace, in which case no x3 magnitudewill correspondto it. It, therefore,has no meaning now to express xi in terms of xi and x2. That whichdoes have a meaningis to expressa relationbetweenxi and x2. Selection of any value of xi thereforeimmediatelyspecifiesa correspondingvalue of x2as definedby the trace but does not specify any particularvalue of x3. Inversely,selectionofa particularvalue of X3does not locate a particularvalue of eitherxi or x2. The variable x3 is not a partnerin the systematicrelationship. This is equivalent to saying that in the equation (2.1) as = 0, so that the relationshipbetween the variables is now of the form: aixl+a2X2 = 0 wherea, and a2 are not equal to zero. In otherwords,xi can be expressed in terms of x2, thus: xi = a12x2;or inversely,x2=a2lxl. The does not existin the presentcase; xi and relationshipX3= a31.2X1+a32.1X2 x2taken by themselvesforma closed set, and x3 is superfluous. Case (B 3) where the regressionplane coincides with one of the co6rdinateplanes also representsa singlerelationshipofthe form(2.1) 380 AmericanStatisticalAssociation [6 between the variables. But in this relationshipthere are now two coefficients that are zero. If the plane lies in the (x2,x3) co6rdinate plane, a2=a3 = 0. The relationshipis then representedby aix1=O; ai=O; ai-O. That is, the observationsare scatteredfreelyin the (x2x3) co6rdinate plane, there being no systematicrelationshipbetween them in this plane. And xi is an ineffective variable. So far as the systematic variationis concernedxi = 0 forall values ofx2and x3. Where the observationsare clusteredin a rod, thereexisttwoindependentrelationshipsbetweenthe variables. There even exist three relationships,representedby the threeequations: aix1+a2x2 (2.3) = 0 = 0 a'lxl+a'3x3 = 0 a"2x2+a"3x3 But onlytwo ofthesethreerelationswillbe independent. By knowing any two of them,the thirdcan be derived. Any set of two variables taken by themselvesnow constitutesa two-dimensionalcollinearset. In the case (C 1), each of the three two-dimensionalcollinearsets (2.3) is closed. That is to say, all the coefficients in (2.3) are i 0. Any of the variables may now be expressedin termsof any one ofthe other variables. In the case (C 2) the rod lies in one, and only one, of the co6rdinate planes, say in (x2,x3). There is now one closed set of two variables, represented by the equationa2X2+a3X3 = 0, wherea2 and a3 are both =i0. Furthermore, thereis one ineffective set of one variable,a1x,= 0, o = in 0. But xi cannotbe expressed termseitherof x2or of X3 a,l 0, (except in the trivialformx1= O.x2+0.x3). The case (C 3), where the rod is lying in the x3 axis involves two ineffective sets of one variable each, namely: (24) (2.4) =0 aa1x1 a1, 0 i= 0 a2X2= 0 a2 0 2= 0 + that is, thereis variation along the X3axis but no variation along the x1 and x2axes. In the case of the point (or tiny ball), (D), there exist three independentrelationsbetweenthe variables. There is no freedomleftnow in the threevariables, all threebeing ineffective. That is to say, the threeindependentrelationsare: c1=0 aix1=0 a,10 (2.5) a2X2=0 a3x3= a240 0 a3 4 0-2=0 0 03 = 0 7] StatisticalCorrelationand theTheoryof ClusterTypes 3. ALGEBRAIC INTERPRETATION OF CLUSTER TYPES IN 381 n VARIABLES A set ofn variables (x1,x2 . . . x,) is called a linearlydependentset, or collinearwhenthereexistsat least one relationof the form: (3.1) a1xi+a2x2+ . . . +anxn=O wherethe coefficients, ai, are not all equal to zero. A collinearset is said to be flattenedexactlyp times,or to be p-foldflattenedwhenthere existexactlyp independentrelationshipsofthe form(3.1) betweenthe conditionfora set to be p-fold n variables. A necessaryand sufficient flattenedis that thereexistat least one p-dimensionalset whichis noncollinear, where p = n - p, while all (p+ 1) and higher dimensional subsets are collinear. In this case there exist exactly p independent regressions,each involvingnot morethan (p+ 1) variables,and further being such that the set ofthese (p+1) variablesis a simplycollinearset. When p = 0 the set of n variables is not flattened;there exists no systematiclinear relationshipbetweenthe variables and no regression equation is possible. When p= 1 the set is once flattenedor is simplycollinear. The rank, p, of the set is now (n-1) and there exists one (n-1)-dimensional regressionplane. This regressionplane may or may not containall the variables. The plane will not contain those variables the co6rdinateaxes of whichin n-dimensionalspace lie in the regressionplane, and the corresponding coefficientsin the regressionequation (3.1) will be equal to zero. These variables are superfluousvariables in the regressionsystem. If from the regressioncoefficients ai (i =1, 2, . . . n) are all different zero,all ofthe n variables are presentin the regressionequation. The set is then a closedset. The (n - 1)-dimensionalregressionplane now contains none of the co6rdinate axes and each xi can be expressed linearlyin terms of the others. In this case there cannot exist any relationshipof the form(3.1) involvingfewerthan n variables. When p>1 the set is multiplycollinear or multiplyflattenedand thereexistsa regressionmanifoldof less than (n -1) dimensions. 4. STATISTICAL CRITERIA FOR CLUSTER TYPES As statisticalcriteriaforthese several types of clusteringthere are here introducedthe coefficient of collectivealienation and its correlaof collective correlation. Proceeding now to the tive, the coefficient exact definitionof the collectivealienationand correlationcoefficients, let AmericanStatisticalAssociation 382 (ri, (R) - rl2 . . . rin r21,r22 . . . r2n) rnlyrn2 . . . rnn be the correlationmatrixforthe n variables xi, x2 . . . xn; rij is the simple (total) correlationcoefficientbetween xi and xi. The determinantvalue of this matrixis denoted by R=R(2 rii,r12 . . . rln r2i, r22 . . rnlyrn2 . . r2n . . rnn Further,R i= Rj(12. . ..) denotes the element in the i-th row and the j-th column of the adjoint correlationmatrix (R). Each R j is determinedby calculatingthe determinantvalue of (R) aftercrossing out the i-th row and the j-th column and multiplyingby (-1) i+ We shall not here enter into a theoreticaldiscussion as to why the offerplausible criteria collectivealienation and correlationcoefficients of how far a given set of statistical variables deviates from being linearlydependent. This interpretationis discussed at lengthin the paper by Ragnar Frisch already referredto.' Here mentionwill be made only of the formal hierarchicorder that exists between the partial, the multipleand the collectivecoefficients.Consider the set of variables,xl, x2 . . . xn, and denote the classical partial correlation For the and alienation coefficientsby rij(12 . . . n) and 5ij(12 . . .. sake of symmetry, all the subscripts,1, 2 . . . n are writtenas secondary subscripts,withoutomittingi and j. In ordernot to give rise to confusionwith the usual notation,wherei and j are omittedfromthe list of secondarysubscripts,the secondarysubscriptsare here enclosed in a parenthesis. Similarly the classical multiple correlation and and 5i(12 . * n) alienation coefficientsare designated by ri(12 .. .) respectively. As is well known,these parameterssatisfythe equation: (4.1) r2+52= 1 wherethe same set of subscriptsis attached to r and to s. The colare also definedso as to lective correlationand alienation coefficients depend on two primary satisfy(4.1). But whilethe partialcoefficients depend on one primarysubsubscriptsand the multiple coefficients script, the collective coefficientshave no primarysubscriptsat all. They are definedwith regard to the set of variables as such. More precisely,they are definedthus: 1 In thispaper the termcoefficient ofalienation. ofscatterwas used instead of coefficient 9] StatisticalCorrelationand theTheoryof ClusterTypes (4.2) s=S(1 2 . . . n)=V R(l 7 =coefficientof collective alienation in the set (Xl, r= r(l 2. 1/l-R(l . .n) = 383 2 . . . )= X2 . . . Xn) coefficientof collective correlation in the set (xI, x2 . . . x") satisfythe relations: It is easy to prove that these coefficients (4.3) O<s<1 reducesto the simple(total) If n = 2 the collectivealienationcoefficient alienation coefficientand the correlationcoefficient(apart from its sign) reduces to the simple (total) correlationcoefficient. From the definitionshere given it followsthat S2 and r2are polynomialsin the nevercan become simplecorrelationcoefficients ri1;S2and r2,therefore, of the indeterminateform0(unless one or more of the simple correlabecome ofthisform). This is a fundamentalproperty tion coefficients of the collective alienation and correlationcoefficients,which dispartialand multiple tinguishestheseparametersfromthe corresponding may coefficients. The fact that the partial and multiple coefficients become of the indeterminateform-is easily seen fromthe well known formulae: 2 .. * n)=X~~' + (4 ri(1 (4.4) ri(l 2 . * . n)= (4.5) ri_(l 2 - L R R?1 v'RWRIIii of multiplecorrecoefficient lation coefficientof partial correlation It is indeed easy to prove that (4.6) and (4.7) R2 ij< Rjj,Rjj =<R<Rii< 1 0, (4.4) and (4.5) must give rise to indeterminate expressionsof the type 0 so that if RI The general propertyof the collective alienation coefficient which makes this parameter a useful tool in studyingcluster types is the following. The collectivealienation is equal to zero when, and only when,thereexistsan exact linear dependencyin the set forwhichthe collective alienation is computed, and furthermorethis coefficient 384 AmericanStatisticalAssociation [10 increasesas the swarmof scatterpointstakes on a shape that deviates more and more from the shape where a linear dependency exists. When the collective alienation has become equal to 1 (which is its largestpossiblevalue) the swarmof scatterpointshas reached a shape which may be characterizedas perfectunfolding. The variables have now become orthogonal(uncorrelated),that is to say, all the simple correlationcoefficients, ri1are equal to zero. The collectivealienation condition coefficient beingequal to unityis the necessaryand sufficient sense. fororthogonalityin the above defined The threevariablesxi, x2and x3may be taken to illustratethe use of the collectivealienation as a tool in determiningclustertypes. The various cases to be discussedhave theirideal patternin the elementary, well-knownalgebraic propositionsregardinglinear forms. What is done hereis primarilyto translatethese propositionsfromthe language of perfect linear dependency to the language of " nearly" linear dependency. A. The disorganizedswarm is characterizedby s(123) being near to unity. B. The plane. One flattening;p =1, p = 2, simply collinear. The criterionforthis case is that s(123) is near to zero,and furthermore R 8(13) = at least one of the three magnitudes, 8(23) = = zero. from different is significantly VR33(-23) 8(12) \/R22(123,, B 1. Plane contains no coordinate axis. Each of the three magnitudes, s(13), 8(23) and 8(12) is significantly different fromzero. In thiscase (Xl, X2, X3) forma closedset. B 2. Plane contains one co6rdinate axis (x3). Now 8(12)= is close to zero while 8(23) and s(13) are signifiV/R33(123) fromzero. In this case the two variables cantlydifferent forma closedsetand x3is a (xl, x2) takenby themselves superfluousvariable. B 3. Plane contains two co6rdinate axes, (X2, x3): Here s(n) and S(13) are both close to zero, while S(23) is significantly different fromzero. In this case x2and X3 forma set of disorganizedvariables while xi is ineffective. C. The rod. Two-dimensional flattening;p = 2, p=1; multiply collinear. The criterionforthis case is that S(123) shall be near to zero and, at the same time, that S(12), S(13) and S(23) shall also be close to zero. There now exist two independentrelationships betweenthe variables.' These two relationshipsmay be written 1 Strictlyspeaking,the criteriahereconsideredshow onlythat thereexistat leasttwo linearrelation(i.e., that we do not have case D) ships. If it be assumedthatnotall thethreevariablesare ineffeotive thenthe criteriaconsideredshow that thereare exactlytwo independentrelationships. 11] StatisticalCorrelationand theTheoryof ClusterTypes 385 in differentways. In particular they may by eliminationbe writtenin such a way that each of them contains at most two variables. If it be assumed that all the variables are effective theneach ofthe two relationsthus obtainedmust containexactly two variables. In this case any set of two variables constitutes a closed two-dimensionalset. Criteriaforthe sub-types(C2) and (C3) cannot be discussed only. Or more in termsof the collectivealienation coefficients The very fact of computingthe collective preciselyexpressed: alienationcoefficients (whichare based on the simple correlation coefficients)involves the assumptionthat all the variables are effective. But this is also the only assumptionmade in using the collective alienation coefficients.If none of the variables all the simple (total) correlationcoefficients are are ineffective, determinative. The situationforthreevariablesis now easily generalizedto the case of n variables. If a great number of variables (x1, x2 . . . xn) are observed,and criteriafor clustertypes are wanted, compute firstthe for the whole set. If this collective alienation coefficientS(12 . is not close to zero, any attempt at studyingthe variables coefficient by means of linear relationshipsshould be abandoned. If S(12 ) is close to zero, consider all the (n-1)-dimensional subsets (23 . . . n), (13 . . . n) . . . (12 . . . (n-1)) that can be formedby leaving out one variable at a time. There are in all n such for each such subsets. Compute the collective alienation coefficient subset. Such a collective alienation coefficientwe call an (n - 1)dimensionalcollectivealienationcoefficient. If thereexistsat leastone that is significantly different from such (n - 1)-dimensionalcoefficient zero, the set is simplycollinear,that is, thereexistsexactly one linear relationof the form. . . . (4.8) . . . aix1+a2x2+ . . . +anxn = 0. This relationshould be looked upon as actually containingonly those variables Xk that are such, that the collectivealienation obtained by leaving out Xk, namely,' S(12 . . . )k( . . . n), is significantlydifferent fromzero. If all the (n - 1)-dimensional collective alienation coefficients 8(12 . . . )k( . . . n) (k = 1.2 . . . n) are close to zero, the set is at least 2 dimensionallyflattened,that is, there exist at least two independent, relationsof the form(4.8). In orderto findout if thereexist exactly two or possibly even more than two such relations,considerall the I The inverse parenthesis ) ( is used to denote "exclusion of." AmericanStatisticalAssociation 386 [12 (n -2) subsets that can be formedby leaving out in all possible ways two of the variables, and computethe collectivealienation coefficient for each such subset. Such a coefficientwill be called an (n-2)dimensionalcollective alienation coefficient. If there exists at least that is significantlydifferent one such (n-2)-dimensional coefficient fromzero,thenthe flatteningis exactlytwo, that is thereexistexactly two independentrelationsof the form(4.8). If all the (n-2)-dimensional collective alienation coefficientsare close to zero, considerthe (n - 3)-dimensionalcoefficients.If all these shouldalso be closeto zero,considerthe (n -4)-dimensional coefficients, etc. Suppose it be necessary to continue to the p-dimensionalcoefficients beforea level is reached whereat least one of the collective fromzero. That is to is significantly different alienation coefficients say, all the (p+1)- and higherdimensionalcollective alienation coare close to zero,whilethereexistsat least one p-dimensional efficients fromzero. In this case the that is significantlydifferent coefficient rank (i.e., the unfoldingcapacity) of the set is p, and the flatteningis p = n - p. There now exist exactlyp independentsystematicrelations of the form (4.8). By combinationand elimination,using these p relations,we may arriveat many sortsof linear relationsbetweenthe variables. There is in particularone set ofrelationsthat is interesting: We may select a certainp-dimensionalsubset whichis not a collinear set (at least one such set existsby the verydefinitionof p). And then we may expresseach of the otherp-variableslinearlyin termsof the selected p-dimensionalsubset. 5. MEANINGLESS RESULTS WHEN LINEAR DEPENDENCIES EXIST Amongthe various possible clustertypes discussedin the preceding sectionsthereis onlyone veryparticulartype in whichall the orthodox correlationand regressionparametershave a sense, namely,the case of a set whichis not only collinearbut also closed. If the set is not closed a greatnumberofthe orthodoxcorrelationparameterslose their meaning. And ifthe set becomesmultiplycollinear,each ofthe orthodox correlationand regressionparameterslose their meaning. Considerfirstthe set whichis simplycollinearbut is not a closed set. This is the case whereS(12 . .. n) iS close to zero, and one or more,but not all, of the (n - 1)-dimensional coefficientsS(12 . . . )k( ... n) are also close to zero. Those variables xk forwhich S(12 . .. )k( . . . ) is close to zero should simplybe looked upon as superfluousvariablesfromthe point of view of linearregressions. Under no circumstancesmust the coefficientsof (4.8) be determinedby computingthe regressioncoefficients bkj 12 . .. n of xkon the othervariables. In factthe collective 13] StatisticalCorrelationand theTheoryof ClusterTypes 387 occurs as a denominator in meanforcing a Determining nwould,therefore, bki. 12 . . . n into the demagnitudewhose deviation fromzero is non-significant thus obtained would nominator. The systemof regressioncoefficients consequentlybe of the form: accidental errordivided by accidental error,and would have no sense. This situationis illustratedby the three-dimensional case (B 2), that is the case wherethereexistsexactly one systematicregressionplane in (xl, x2, X3),but where this plane containsthe x3-axis. In this case it would obviouslyhave no sense to expressX3as a linear combinationof x1 and x2. In this case neither of correlationof Xk on the set of the other the multiple coefficients of correlationbetween Xkand any variables nor the partial coefficient particularone ofthe othervariables shouldbe computed. All of these parameterswill now be withouta meaning. In the rigorouscase they alienation coefficientS(12 . . . )k( . . . n) bk,. 12 . . . depend on an expressionof the indeterminate? formand in the statis0 tical case theirvalues are determinedby the ratio betweentwo small quantitiesdue to accidental errorsof observation. It is even easy to . . n. -.0 OiS construct cases where the limitingprocess S(12 . . .. carriedout in such a way that the value of the partial r may have any value betweenplus one and minusone (theseextremevalues included). Or the value of the multiple r may be made to assume any value betweenzero and one. And that is not all. The standarderrorof these correlationparameters will also become meaninglessfor a similarreason. Neither the multiple nor the partial correlationparameters considered nor the standard errorof these will consequentlyfurnishthe slightestindication of the clustertype or of the fact that Xkis a superfluousvariable. But in the collective alienation coefficientswe have a system of criteriathat can never be subject to this kind of meaninglessness, as alreadymentioned,are because the collectivealienationcoefficients, polynomialsin the simplecorrelationcoefficients.An inspectionofthe wouldimmediately (n -1)-dimensional collectivealienationcoefficients tell whichofthe variablesshouldbe oustedfromthe regressionsystem. Now considerthe case wherenoneofthe (n - 1)-dimensionalcollective is significantly alienation coefficients different fromzero, that is, the case wherethe set is multiplycollinear. There now exist at least two independentregressionsof the form(4.8). The situationis now such that whatevervariable in the set (xl, x2 . . . xn) be selected,the result would always be meaninglessif the regressionwere determinedof this variable on the others. Still worse,the standard errorsof the regression coefficients computed by the orthodoxformulaewould also lose AmericanStatisticalAssociation 388 [14 theirmeaning, so thattherewouldbe no warningsignaltellingus to keepawayfromthissortofregression. Lookingback on the variouscases discussedit is clearthat the troublealwayscomesin thosecases wherethereexists(rigorously or approximately) a linear dependencybetweenthose variables that are writtenin therightmemberof theorthodox regressionequation,that is, intheleast as independent betweenthosevariablesthatareconsidered squarefitting procedure. havea meaning, In orderto getbackto a basiswheretheregressions are find subsets that andthento to those itis necessary simply collinear, it to a closedsetand treateachofthesesubsetsseparately byreducing init. A generalschemeforperformthelinearregression to determine alienationcoefficients is the ingthisanalysisbymeansofthecollective as above Then rank the select Determine that following: p explained. collectivealienation subsetforwhichthep-dimensional p-dimensional is largest. Call thissetthebasisset. Thisis thep-dimencoefficient sionalsubsetthatcomesclosestto beingan uncorrelated (orthogonal) thebasissetwitheachofthe set. Thenformp subsetsby combining variables. Each ofthese(p+ 1) setsmaybe considered as remaining simplycollinearand treated separately.That is to say, each such set shouldbe reducedto a closedset by omitting (p+1)-dimensional inthesetaccording tothecollecthesevariablesxkthataresuperfluous as appliedto thissubset. tivealienationcoefficient criterion, The aboveanalysismaybe illustrated by an artificially constructed wereselectedon fourvariables,x1, x2, x3 problem. Ten observations and x4,the valuesbeingwrittendownarbitrarily exceptforthe rethatthesumofthevariablesx2,x3and x4 foreach observaquirement tionshouldequal one hundred. (Whenmeasuredas deviationsfrom thesetoften theirmeans,therefore, x2+x+x4=0). For convenience valuesforeach variablewas madeto totalto evenhundredsso that meansand deviationsfrommeanswouldnotinvolvedecimalvalues. UsingcapitalX to denotetheabsolutevalueofa variableand smallx deviationfromthe mean,the data, as to denotethe corresponding are: above defined, Xl 25 14 37 17 24 29 41 X2 18 27 31 19 12 15 21 X3 29 43 51 34 46 35 28 X4 53 30 18 47 42 50 51 15] StatisticalCorrelationand theTheoryof ClusterTypes 39 52 22 28 17 12 41 57 36 31 26 52 300 200 400 400 389 The correlationcoefficients forthese observationswhichenterinto the matrix(R(1234)) are: correlation r13= +.408675 r12=-+.169678 r23=+.218931 r24=-.689427 ril = r14= - .392790 r34=-.857719 r22= r33= r44= 1.00 That x2,x3and X4 forma closed set is shownby the values of the coefficients S(234), S(23), S(24) and 8(34) for 5(234) = 0, S(23) = .98, S(24) = .72 and S(34) = .51. None ofthe last threequantitiesare close to zero,hencethe conclusion that (X2, X3, X4) forma closedset. The partialcorrelation coefficients in thisset are all equal to minusone and multiplecorrelationcoefficients are all equal to plus one. This is easily checked,as has been done, by actual computation. So far everythingis all right. When, however, variable xi is included in the systemtrouble arises. If one tries to computethe partialcorrelations,r12.34, or r14.23 it is foundthat they r13.24 are all of the form0 and similarresults hold for the multiplecoeffi0 cient r1.234 and forthe regressioncoefficients and b14.23. b12.34, b13.24 Mr. H. I. Richards, writingin the March, 1931, number of this JOURNAL on "Analysis of the Spurious Effectof High Intercorrelation of IndependentVariables on Regressionand CorrelationCoefficients," of multiplecorrelationand regression says "that accurate coefficients can be obtained whenthe independentvariables are perfectlyintercorrelated,if thereare no errorsin the calculations." This is fundamentally wrongand is due to certainslips in Mr. Richards' mathematics. He quotes the formulaeformultipleand partial correlation,as given by Yule, as: . (-21n.23 (7.1) 1- 21.234. . . n=1(-r 212)(1 - '13.2)(7'214.23) .1) n(7.2) r14.23 = l- r1.23r4.23 r14 r2l23r423 The latter is not correct,the correctformulabeing: (7.3) r14.23= (1-r21243)J2(1-r224.3)H However, this is of minorimportancein this connection. Mr. Rich- AmericanStatisticalAssociation 390 [16 nature and independentof ards' fundamentalerroris of a different that ifx2,x3and X4 He maintains whetherwe startfrom(7.2) or (7.3). that theirsum is equal by the fact are perfectlycorrelated,forinstance, wouldbe the by (7.1) R1.234 result of computing to one hundred,the or one ofthem are included and variables X2, all three X3 X4 same whether omitted;forexample,X4. This must be so, he claims,because A. Factors (1- r212)and (1- r213.2)are the same in each case. B. Factor (1-r214.23)= 1 wheneverr4.23= 1. His attemptto provepropositionB is as follows: 1. Factor (1 -r24.23)in the denominatorof (7.2) equals zero whenever thereis perfectcorrelationbetweenX2,X3 and X4. This makes the denominatorof (7.2) equal zero. 2. The numeratorin (7.2) also equals zero since: (a) r4.23= 1 (b) r1.23=rl4 3. The formri4.23=? must howeverbe equal to zero since the limit of r14.23is zero when r4.23tends toward unity. Proposition(A) is obviously correct,and (B 1), whichis the same as (B 2 a) is also correctas is seen fromthe formula r4.23? = 1 R(23) since,in the case underconsideration,R(234)= 0, and it may be assumed that R(23),40 inasmuch as the case where x2 and x3 are perfectlycorrelated is of no interestin the presentconnection. But proposition(B 2 b) is not correct. The correlationrl.23 is the correlationbetweenxi and the value of the variable xi calculated from the regression: (7.4) bl2.3x2+b13.2x3 are determinedso as to make the correlationa wherethe coefficients (the x's here being maximum. On the other hand, if x2+x3+x4=0 deviationsfrommeans) then r14can be looked upon as the correlation betweenxi and thevariable: (7.5) - X2 - X3. It is quite obviousthat the correlationbetweenxi and (7.4) need not be the same as the correlationbetweenxi and (7.5). The onlythingthat can be said is that r214< r21.23 17] StatisticalCorrelationand theTheoryof ClusterTypes 391 = 0, is The proposition (B 3), namely that in the case considered r14.23 not correcteither. There is even a double reasonforits falsity. First processat thereis no questionofa limiting of all, where X2+X3+X4=0 all. The value of r4.23is not approaching as a limit the value unity, but is exactlyequal to unityall the timeand can equal nothingelse. In the second place, even if therewere a valid case forconsideringa limiting process at all, it is possible to show that this limitingprocess can be carried out in such a way as to obtain for the limiting value of r14.23any resultwhateverbetween + 1 and -1. The errorwhichMr. Richards makes on this point is that he lets r4.23tend toward zero while all the other parametersinvolved, r14,r1.23,etc., are kept constant. This is, the limitingprocessthat however,onlya veryspecial way of performing has as its final result a situation where x2+x3+x4=0 (or where there exists some other exact linear relation between these variables). In general,such a limitingprocess can be performedby varyingcertain observationalvalues in x2or x3or X4. These observationalvalues must be consideredas the independentvariablesduringthe limitingprocess. If that is done, we see that the numeratorand the denominatorin the ratio definedby (7.2), (or by the correctformula(7.3)), are notfunctionsof a singlevariablebutof severaland the limitingvalue of a ratio whose numeratorand denominatordepend on morethan one variable, depends not only on the finalsituationtoward whichthe systemtends butalso on thepathfollowedin orderto reachthe finalsituation. We can illustratethis by the case of a ratio AXy) between two functionsof g(x,y) two independentvariables x and y. Suppose that bothf and g vanish at the origin;i.e.,f(0,0) =g(0,0) =0. What is the limitingvalue of the ratio when x and y tend toward zero? That will depend on the g(x'y) g(x,y) path whichthe representativepoint in (x,y) co6rdinatesfollowson its way towardthe origin. For simplicityit may be assumed thatfand g have continuouspartial deviates in the vicinityof origin. The ratio consideredwill,therefore,in the vicinityof originbe equal to (7.5) fxdx+fydy g,dx+gydy denote the partial deviationsat the originand dx,dy wherefx,y,g.,gy denotethe co6rdinatesofthe path along whichthe representativepoint tends toward the origin. It is quite obvious that in generalthe limiting value of (7.5) will depend on the ratiody,that is, it will depend on dx the directionfromwhichthe point (x,y) tends toward origin. 392 AmericanStatisticalAssociation [18 What Mr. Richards has proved by his limitingprocessis, therefore, only that it is possibleto approach to a linear dependencybetween X2, X3and X4in such a particularway that the limitingvalue for r14.23 becomes 0. But he has by no means proved that this limitingvalue mustbe zero because thereexistsa lineardependencybetweenx2,X3and X4. In Frisch's paper, already referredto, thereare givenexamplesof in limitingprocessesthat will bringthe partial correlationcoefficients three variables as close as may be desiredto any magnitudebetween -1 and + 1 (wherethe limitingprocesstends toward representinga linear dependencybetweentwo of the variables) and it would not be to give similarexamples with one more variable, whichis the difficult situationhere discussed. One furthercommentmightbe to the point: Mr. Richardsgivessome numericalcomputationsintendedto verifythe theoreticalproofof his whetherX4is incontentionthat one will get the same value of R1.234 not. cannot conor these numerical computations However, cluded tain any such "verification." Either the computations must be wrong,or he must in some point or anotherin the computationshave =0. madeuse ofthat fact whichshould be proved,namelythat r14.23
© Copyright 2026 Paperzz