1 The Relationship between a Correlation Coefficient and its Associated Slope Estimates in Multiple Linear Regression Rudy A. Gideon Abstract This article takes correlation coefficients as the starting point to obtain inferential results in linear regression. Under certain conditions, the population correlation coefficient and the sampling correlation coefficient can be related via a Taylor series expansion to allow inference on the coefficients in simple and multiple regression. This general method includes nonparametric correlation coefficients (NPCCs) and so gives a universal way to develop regression methods. This work is part of a correlation estimation system (CES) that uses correlation coefficients to perform many types of estimation, including time series, nonlinear, generalized linear models, and on individual distributions. AMS (2000) subject classification. Primary 62J05, 62G05, 62G08. Keywords and phrases. Correlation, rank statistics, linear regression, nonparametric, Kendall, greatest deviation correlation coefficient. 0. Preliminaries and Notation Capital R denotes any population correlation coefficient and lower case r the corresponding sample correlation coefficient. R(X,Y) then stands for the unknown but fixed population correlation coefficient for the bivariate random variable (X,Y) . Of course if X and Y are independent R(X,Y) = 0. When needed, subscripts are used to denote a particular correlation coefficient. For the bivariate normal and the usual population definition of correlation, R(X,Y)= ρ but for GDCC and Kendall's τ , Rgd ( X , Y ) = Rτ ( X , Y ) = ( 2 / π ) sin −1 ( ρ ) . In addition τ and GDCC retain the same value over the class of bivariate t distributions. For the bivariate normal, ρ on X so random variables X and Y − ρ zero for any R. σy σx σy σx is the regression parameter of Y X are independent, hence R( X , Y − ρ σy σx X ) is 2 For an arbitrary random variable ( X , Y ) , let β be the regression parameter when the parameters of the distribution are unknown. So now R( X , Y − βX ) is a function of variable β , which is sometimes written as f (β ) to emphasize this fact. This function is zero only at β 0 , the true value of the regression parameter. For the specific example above with bivariate normal random variables, β 0 = ρ σy σx . The multivariate version, with β ′ = ( β 1 , β 2 ,K , β p ) , l ′ = (l1 , l2 , K , l p ) , and X ′ = ( X 1 , X 2 , K , X p ) , yields R( X ′l , Y − X ′β ) = E ( X ′l (Y − X ′β )) as a function of p-dimensional β . (Prime denotes V ( X ′l )V (Y − X ′β ) transpose.) Note that when β 0 is substituted for β , R( X ′l , Y − X ′β 0 ) = 0 because X ′l and Y − X ′β 0 are assumed to be independent. For the X variable the data is written Xnxp = ( x1 , x 2 , L , x p ) where each xi is a column vector of length n and y is the data from random variable Y. Then for any corresponding R and r, with the correct β 0 , r ( X nxp l, y − X nxp β 0 ) has a distribution centered about zero, i.e. a null distribution, because R( X ′l , Y − X ′β 0 ) = 0 . So if an expression for R( X ′l , Y − X ′β ) is known and is a continuous function of β , it can then be expanded about β 0 via a Taylor series whose first term is zero. It is known that if continuous bivariate random variables are independent then the asymptotic (as n approaches infinity) null distributions for GDCC and Kendall's τ are as follows: n rgd is N(0,1) and n − 1τ is N(0,4/9) where rgd and τ (or rτ ) are the sample versions of Rgd and Rτ , respectively. Thus, for example, n − 1τ ( X nxp l , y − X nxp β 0 ) has an asymptotic N(0,4/9) distribution because Rτ ( X ′l, Y − X ′β 0 ) = 0 . 3 Rτ ( X ′l, Y − X ′β ) = (2 / π ) sin − 1 ρ X ′l ,Y − X ′β which is a continuous function of β for the class of bivariate t distributions. Note that even though multivariate distributions are the sampling distributions, they are only used through linear combinations that so that correlation coefficients can be employed. For tied values within the X data or within the Y data, this paper uses the max-min method given in Gideon and Hollister (1987) as it is more general than the often used local averaging method in that it can be applied to all rank correlation coefficients and allows computer programs to run in all circumstances. This sets the stage for the developments in this paper. 1. Introduction In the research on GDCC it was found that the only way to perform inference on β , the regression parameter, was to relate it to the asymptotic distribution of rgd through a Taylor series for Rgd . Then the generality of the procedure was discovered and is given in this paper, providing a unification not previously known. The result in the case of Pearson's correlation coefficient is not new; only the derivation is unique. The method is applied to Pearson's as a check on its validity and as the first step towards the actual new result. Let R ( X i , Y ) be the population correlation coefficient for the ith component of the pdimensional column random variable X with Y . While any correlation coefficient (CC) can be used, full examples are given for just two, Pearson's rp and the Greatest Deviation (Gideon and Hollister, 1987); Kendall's Tau is also illustrated to some extent. The continuous and discrete cases are done purposefully to show the comprehensiveness of the 4 CES. Familiarity with GDCC is not necessary to follow the arguments, but a complete description of this easy-to-use but complicated-to-explain correlation coefficient is given in the references. The fact that a nonparametric correlation coefficient can be implemented in a cohesive fashion in multiple regression is the key point. This work is part of a correlation estimation system (CES) using correlation coefficients as the starting point to perform many types of estimation, including time series and nonlinear and generalized linear models. Parameter estimation on individual distributions can also be done. The results open up a vast set of possibilities; enough has been done to be very optimistic about the direction and usefulness of this work. If β 0 is the vector of population regression parameters for an assumed linear system, then the population values, R( X i , Y − X ′β 0 ) , i = 1,2,... p are zero. For l a p-dimensional vector of constants, R( X ′l , Y − X ′β 0 ) is the population correlation parameter of a linear combination of the components of X with Y − X ′β 0 . The model assumption in this work is that the error random variable, Y − X ′β 0 , is uncorrelated with X ′l ; that is R( X ′l , Y − X ′β 0 ) = 0 . Now form a truncated Taylor series in β about β 0 for the function of β , R( X ′l , Y − X ′β ) . Once this is done, the population variables in the series are replaced by their sample equivalents to obtain a further approximation which leads to the determination of the asymptotic distribution of linear combinations of the corresponding slope estimates in a multiple linear regression. This is valid for both continuous and nonparametric sample correlation coefficients. 5 While the normal distribution is typically used (because, in general, it is most familiar), asymptotic distributions for correlation coefficients have been developed for distributions with finite second moments. The process in this paper is general and could be used whenever asymptotic distributions of correlation coefficients have been developed for continuous bivariate distributions or for finite cases in which permutation arguments apply. For example, the limiting distributions hold over a class of distributions for nonparametric correlation coefficients. It is known, for example, that GDCC has the same population value and limiting distribution for the bivariate Cauchy as for the bivariate normal, that is, over the whole class of bivariate t distributions (Gideon and Hollister, 1987 and Gideon, Prentice, and Pyke, 1989). So the technique is robust. 2. Derivation of the Classical Method through Correlation As mentioned above, the method is first developed for Pearson's correlation coefficient both to show that it is an alternate way to derive the same results as those that come from existing Least Squares (LS) and classical normal methods and because the results are needed in the next step. It is extended to NPCC using GDCC as an example. Because Kendall's Tau and GDCC have the same population form for the bivariate t distributions the result for GDCC is valid for Tau. The full rank multivariate normal model is used with covariance matrix partitioned as follows into the response and regressor variables. As σ 2 usual, let Σ p +1, p+1 = 1 σ 12 σ 12' , where σ 12 is the variance of the response variate, σ 12 is Σ 22 the column vector of covariances of the response variable with the regressor variables, and Σ 22 is the p by p covariance matrix of the regressor variates. Let Y be the response variate 6 and X the column vector of regressor variates. As above let β 0 be the vector of population regression parameters. Then it is known that β 0 = Σ −221σ 12 . Let µ = E(Y) and µ x = E(X), a p-dimensional vector. The regression model is E (Y X = x ) = µ + ( x − µ x )' β 0 . The parameter R( X i , Y ) is the correlation that corresponds to the ith element of σ 12 . The diagonal values of Σ 22 are variances and the off diagonal elements are covariances. When these are not known, they can be estimated by a method that uses the same correlation coefficient. Scale estimates using rgd are given in Gideon and Rothan (2007). Examples of both known and unknown covariance structure appear below. The classical variance estimation method is well known but in this paper an approach via correlation coefficients, as given in Gideon and Rothan (2007), is used. This maintains a consistent level of robustness between slope and variation estimation. With l as above, a vector of constants, consider the correlation parameter as a function of β , f ( β ) = R ( X ′l, Y − X ′β ) . For the correlations and continuous random variables under consideration, f is a differentiable function of β . In order to relate the null distribution of any correlation coefficient to linear combinations of the estimated slopes, f (β ) is expanded into a truncated multivariate Taylor series about β 0 . Then the random variables will be replaced by data and β by β̂ , the estimated slopes, so that an asymptotic distribution can be used. The estimated slopes are computed using the chosen correlation coefficient as noted below. Finally, using the asymptotic null distribution of r, the chosen correlation coefficient, the asymptotic distribution of any linear combination of β̂ is found. 7 For convenience and without loss of generality let µ = 0 and µ x = 0 . We start by determining f (β ) in an explicit form and then taking its partial derivatives with respect to β . The goal is to write f ( β ) ≈ f ( β 0 ) + ( β − β 0 ) ∂f ( β ) ∂β β =β0 , a truncated Taylor series. Start with R the usual population definition of correlation, R( X ′l , Y − X ′β ) = E ( X ′l (Y − X ′β )) , and further V ( X ′l )V (Y − X ′β ) E ( X ′l (Y − X ′β )) = l ′E ( XY ) − l ′E ( X X ′) β = l ′σ 12 − l ′Σ 22 β , V ( X ′l ) = l ′Σ 22l , or a (l ) for short. Also observe that V (Y − X ′β ) = σ 12 + β ′Σ 22 β − 2 β ′σ 12 , or b(β ) for short. −1 −1 Now β 0 = Σ 22 σ 12 so that R( X ′l , Y − X ′β 0 ) = l ′σ 12 − l ′Σ 22Σ 22σ 12 = 0 , and 2 ′ Σ −221σ 12 = σ res b( β 0 ) = σ 12 − σ 12 where res stands for residuals. We now expand f (β ) = R( X ′l , Y − X ′β ) = ∂f ( β ) = ∂β ∂f ( β ) ∂β β =β0 l ′σ 12 − l ′Σ 22 β into a Taylor series. Because a ( l ) b( β ) a (l )b( β ) ( −l ′Σ 22 ) − (l ′σ 12 − l ′Σ 22 β ) ∂ a ( l ) b( β ) ∂β a ( l ) b( β ) = − l ′Σ 22 = a (l )b( β ) , and − l ′Σ 22 , then the truncated Taylor series is l ′Σ 22lσ res f ( β ) = R ( X ′l , Y − X ′β ) ≈ R( X ′l , Y − X ′β 0 ) − l ′Σ 22 (β − β0) . l ′Σ 22lσ res Now we have arrived at the place to put in the sample equivalents. In this series, X is replaced by the n by p data matrix Xnxp= ( x1 , x 2 , L , x p ) , Y by y and f is evaluated at β̂ where rp ( xi , y − X nxp βˆ ) = 0, i = 1,2, L , p . In other words, the parameters have been 8 replaced by estimates and random variables by data. The equation remains approximately true within sampling variation, i.e. the two remaining quantities are approximately equal in distribution. For Pearson's rp these p simultaneous equations are equivalent to the usual least squares normal equations and the solution vector gives the standard least squares results. Both this and the GDCC formulation and technique for solving the set of equations appear in Rummel (1991); this is a generalized formulation of these normal equations which for correlation coefficients other than rp are herein called regression equations. They are valid for any correlation coefficient. Some results are illustrated in Section 4. Because of the linearity of the covariance function, these p equations imply that rp ( X nxp l , y − X nxp βˆ ) = 0 . Thus, f ( βˆ ) = rp ( X nxp l, y − X nxp βˆ ) = 0 ≈ rp ( X nxp l, y − X nxp β 0 ) − l ′Σ 22 ( βˆ − β 0 ) . Because l ′Σ 22 lσ res this difference is approximately zero, rp ( X nxp l , y − X nxp β 0 ) and l ′Σ 22 ( βˆ − β 0 ) are l ′Σ 22lσ res approximately equal in distribution. Now the former term has a null distribution since outcomes X nxp l and y − X nxp β 0 come from independent random variables X ′l and Y − X ′β 0 . It can be inferred from (Anderson, 1958 and Burg, 1975) that n rp ( X nxp l, y − X nxp β 0 ) has an asymptotic N(0,1) distribution. Consequently, nl ′Σ 22 ( βˆ − β 0 ) has that same asymptotic distribution. l ′Σ 22lσ res To relate this result to standard methods, start by transforming from vector l to vector k where l = Σ −221k ; thus, Σ 22l = k and l ′Σ 22 = k ′ . Then the quadratic form equality is 9 l ′Σ22 l = k ′Σ −221k . Hence, n k ′( βˆ − β 0 ) k ′Σ −221 kσ res has an asymptotic N(0,1) distribution. Thus from the connection with the Pearson correlation coefficient we have the following result: ( k ′Σ 22 k )σ res k ′( βˆ − β 0 ) is approximately N (0, ) n −1 2 (1) where β̂ solves the normal equations rp ( xi , y − X nxp βˆ ) = 0, i = 1,2, L , p . The equivalent of result (1) in classical least squares or normal theory fixed x multiple linear regression model is introduced to show that the CES has produced the standard result. Assume now that the n x p data matrix has been chosen: y = X nxp β + ε where ε ~ N (0, σ 2 I ) with I the n x n identity matrix. Let X * = ( x1* , x2* , L , x *p ) where the * indicates that the data have been centered at the means. Then, as is well known, the sum of 1 squares matrix is X *′ X * with βˆ = ( X *′ X *) − X *′ y and V ( βˆ ) = σ 2 ( X *′ X *) −1 . The distribution of βˆ − β is multivariate normal: MN (0,V ( βˆ )) . Thus, the distribution of k ′( βˆ − β 0 ) is N (0, σ 2 k ′( X *′ X *) −1 k ) . The two notations for the two systems, CES and classical, are now readily related by equating the variances. (k ′ −1 Σ 22 2 k )σ res = k ′( X *′ X *) −1 kσ 2 . For the fixed X case, Σ 22 = ( X *′ X * )/n. For X a n matrix of outcomes of random variables, ( X *′ X * )/n estimates Σ 22 . In Graybill (1976), under the null hypothesis that k ' β = m and under normality assumptions, the distribution of 10 k ' βˆ − m σˆ k ' ( X * X * ) −1 k is student t with n-p degrees of freedom. With large n, then the normal distribution is the asymptotic approximation. 3. Derivation of a Universal Method of Multiple Linear Regression through Correlation It is known (Gideon and Hollister, 1987) that for joint normal random variables W1 ,W2 the population value of Rgd (W1 , W2 ) is 2 −1 sin ρW1 , W2 (Kendall's Tau is the same) where π ρW1 , W2 is the bivariate normal correlation parameter between W1 and W2 . For random variables X ′l and Y − X ′β 0 , set f1 ( β 0 ) = R gd ( X ′l , Y − X ′β 0 ) = 2 sin −1 ρ X ′l ,Y − X ′β0 . Note that π while the form is the same, f1 is used here instead of f to signify that a different correlation coefficient has been chosen. For normal random variables R( X ′l , Y − X ′β ) = ρ X l′ ,Y − X β′ and so the results of the previous section can be used. The truncated Taylor series for f1 ( β ) is f1 ( β ) = R gd ( X ′l , Y − X ′β ) ≈ Rgd ( X ′l, Y − X ′β 0 ) + The partial derivative is ∂ R ( X ′l, Y − X ′β ) β =β0 ( β − β 0 ) . ∂β gd ∂ 2 Rgd ( X ′l, Y − X ′β ) = ∂β π 1 1 − ρ 2X ′l ,Y − X ′β ∂ ρ ′ ′ ∂β X l ,Y − X l β =β0 . At β = β 0 , X ′l and Y − X ′β 0 are independent random variables, so ρ X ′l ,Y − X ′β0 = 0 , and the 11 latter partial derivative is, as before, − l ′Σ 22 . The truncated Taylor series becomes l′Σ 22lσ res f1 ( β ) = R gd ( X ′l , Y − X ′β ) ≈ R gd ( X ′l , Y − X ′β 0 ) − 2 π l ′Σ 22 ( β − β0 ) . l ′Σ 22 lσ res Now solve (Rummel, 1991) the associated regression equations with data X nxp and y. rgd ( xi , y − X nxp βˆgd ) = 0, i = 1,2, L , p. βˆgd is a solution vector with i th component β̂ i, gd . Every correlation coefficient has a similar set of regression equations; these correspond to the normal equations in the case of Pearson's rp. These regression equations would have solutions β̂τ had Tau been chosen as the correlation coefficient. The regression equations for Tau can be solved by iterations involving the medians of elementary slopes. (See Sen, 1968 for the simple linear regression case, and Gideon, 2008 for an illustrated look at this work specialized to Kendall's Tau). The GDCC does not have the same linearity properties that Pearson's rp has and so it is not necessarily true that rgd ( X nxp l , y − X nxp βˆ gd ) is exactly zero; however, computer simulations have shown that rgd ( X nxp l , y − X nxp βˆ gd ) is zero or very close to zero. We repeat the above procedure by substituting the sample counterparts into the truncated series and evaluating f1 at βˆgd . Even though f1 ( βˆ gd ) may be only close to zero, simulations in this and other examples indicate that the asymptotic distribution theory is still good. Again Rgd ( X ′l, Y − X ′β 0 ) is zero and its sample equivalent multiplied by n rgd ( X nxp l, y − X nxp βˆgd ) , has an approximate N(0,1) distribution (Gideon, Prentice, Pyke, 1989). It now follows as before that 2 n π l ′Σ 22 ( βˆgd − β 0 ) has an l ′Σ 22lσ res n, 12 approximate N(0,1) distribution. (For Kendall's Tau, N(0,1) distribution, and so an asymptotic N(0, 3 n − 1 τ has an approximate 2 3 n − 1 is the multiplier to use on βˆτ − β 0 ; i.e. βˆτ − β 0 has 2 2 π 2σ res ) distribution. This was successfully tested via simulations 9( n − 1)σ x2 on continuous and discrete data for one regressor variable.) 2 π 2l ′Σ 22 lσ res ). Consequently, l ′Σ 22 ( βˆ gd − β 0 ) is N (0, 4n Again we let l′Σ 22 = k ′ . Thus, 2 π 2 (k ′Σ −221k )σ res k ′( βˆgd − β 0 ) is approximat ely N (0, ) 4n (2) where β̂ gd solves the regression equations rgd ( xi , y − X nxp βˆgd ) = 0, i = 1,2, L , p. This latter derivation can be used with any correlation coefficient whose population form and asymptotic distribution is known, illustrating the universality of the method. In particular, because Tau and GDCC have the same population value on the class of bivariate t distributions, an equation equivalent to (2) holds when the Tau regression equations are solved. Equivalent means the correct asymptotic distribution must be substituted at the substitution step. As a special case we let k be a vector of 0s except for a 1 in the ith position. The above 2 π 2σ ii σ res result gives the asymptotic distribution of βˆi , gd − β 0 as N (0, ) where 4n 13 σ ii is the (i, i ) element of Σ −221 , that is, σ ii is just the reciprocal of the variance of the ith regressor variable because Σ 22 is 1 by 1. 4. Illustration of the Correlation Approach To illustrate the asymptotic results in the normal case, simulations were run for various sample sizes. Examination of a large number of quantile plots showed that, whether using rgd or rp , the distributions of the chosen linear combinations of the estimated β 's were very similar. The normal was chosen because the exact distributions of linear combinations of the components of the β -vector are known for Pearson's, so if GDCC results exhibit similar characteristics, then this correlation technique is feasible. The first example is in two parts, each with its own distinct covariance structure. The hypothesis of interest is that β1 = β 2 . In the first part the null hypothesis is true and in the second it is false. Data were generated by the linear transformation y x1 = x 2 z1 A z 2 where A is a 3 by 3 matrix and (Z 1 , Z 2 , Z 3 ) are independent and identically z 3 distributed normal random variables with mean zero and variance one. 1 1 1 1 1 1 For Part 1, A = A1 = 0 1 1 and for Part 2, A = A2 = 0 2 1 . 1 1 0 1 1 0 y σ 12 σ 12 ' . In both parts x1 is distributed N (0, Σ) where Σ = AI 3 A' = σ 12 Σ 22 x 2 14 The 2 by 2 matrix Σ 22 is the covariance matrix of the regressor variables. The regression −1 2 −1 slopes are β 0 = Σ 22 σ 12 and σ res = σ 12 − σ 12 ' Σ 22 σ 12 . For Part 1, Σ −1 22 3 2 2 2 3 1 1 2 − 1 2 = , β 0 = , σ res = , Σ = 2 2 1 . 3− 1 2 3 2 3 2 1 2 For Part 2, Σ −1 22 3 3 2 1 3 2 1 2 − 2 2 = , β 0 = , σ res = , Σ = 3 5 2 . 6 − 2 5 3 2 3 2 2 2 The conditional models for the two parts are then, 2 2 1 1 2 2 Model 1: Y X = x is N ( x1 + x 2 , ) , Model 2: Y X = x is N ( x1 + x 2 , ) . 3 3 3 3 3 3 The correlation structure is given in the following table: pop corr Model 1 y , x1 Model 2 35 2 3 y, x2 2 3 x1 , x 2 12 Y X=x (2 3) 2 =0.942 2 3 25 (1 3) 7 = 0.882 Table 1: Correlation Structure GDCC 0.783 0.687 In this table GDCC is found by the inverse sine transformation of the multiple correlation coefficient, ρ , (2 π ) sin −1 ρ (Gideon and Hollister, 1987). To apply the above for testing the hypothesis β1 = β 2 , use k ' = (1,−1) . The value of k ' Σ −221 k is 2 for Model 1 and 11/6 for Model 2. From results (1) and (2) the asymptotic distributions of βˆ1 − βˆ2 for the Pearson and GDCC and the two models are Model 1 Pearson 2 N (0, ) 3n GDCC N (0, π2 2 ) 4 3n 15 Model 2 22 π 2 22 ) N (0, ) 18n 4 18n Table 2: Distributions for the Two Models N (0, Note that because π 2 4 @1.57, the standard deviation of the GDCC is about 1.57 times larger. Some details for Model 2 are now given. Simulations were run with W1 = βˆ1 − βˆ2 and W2 = βˆ1,gd − βˆ2, gd recorded each time. Plots of W1 -vs- W2, quantile plots of W1 -vs- W2, individual normal quantile plots for each of W1 and W2 (one is shown below) with accompanying Kolmogorov-Smirnov tests of fit all gave better than expected results; i.e., the distributions of the β -contrasts comparing the classical methods and the GDCC methods were similar. The distributions are nearly the same except for the scale factor. Because the null hypothesis is false in this case the center point of the contrast is not zero but near -1/3. The comparisons W3 = 2 βˆ1 − βˆ2 and W4 = 2 βˆ1, gd − βˆ2 ,gd in which the null hypothesis is true were also run. Here k ' = (2,−1) and the asymptotic distributions can be calculated as shown above. Again W3 and W4 exhibited similar normal distributions but the center points on the quantile plots were, of course, near zero. Model 1 simulations also gave better than expected results. Additionally, simulations on the individual β s, e.g. k ' = (1,0) , again produced good results. Sample sizes were run from 10 to 100 with 50 simulations each. Although many sample quantile plots were produced for various sample sizes, they are not given as they duplicated the plots already shown. It is interesting that the null distribution of GDCC approaches normality slowly but 16 even for small (10-20) sample sizes, simulations of the distribution of linear combinations of slopes via result (2), show near normality. -0.2 W1 -0.6 -0.4 -0.5 -1.0 W2 0.0 0.0 0.2 Figure 1: Normal Quantile Plots of W1, W2, W3, W4 -2 -1 0 1 2 -2 -1 0 1 2 Quantiles of Standard Normal 0.2 -0.2 W3 0.0 -0.6 -1.0 -0.5 W4 0.5 0.6 1.0 Quantiles of Standard Normal -2 -1 0 1 Quantiles of Standard Normal 2 -2 -1 0 1 2 Quantiles of Standard Normal 5. An example of simple and multiple regression with the 1992 Atlanta Braves team record of 175 games To illustrate the viability of this work, it is necessary to bring in ideas on variation estimation that allow the full estimation procedure to be carried out. This example shows how the correlation estimation technique (Web site www.math.umt.edu/gideon) can parallel the standard multiple regression analysis thus showing that it as viable as classical regression analysis. Estimation from correlation is not the standard approach, but not only is it as good or better than classical least squares, it also gives a cohesiveness to the 17 analysis as it is valid for every correlation coefficient and rivals other robust techniques. Besides the distribution technique for the slopes, the estimates of the variation structure are illustrated, including residual error, standard deviations of the slopes, multiple correlation coefficient, and partial correlations. Rank based correlations devalue extreme values; this allows GDCC to be more robust than classical least squares as is seen in this example. The generality of the technique includes dealing with tied values, so a baseball example was chosen because the data have numerous tied values. The work on this example started after the 1992 baseball season, showing the slowness of the research process. This data has extreme values but no outliers, so techniques of removing or reweighting are not appropriate. Also, one expects that hits and runs are fairly highly correlated, so this type of example tests the convergence of the numerical technique for the regression equations, the Gauss-Seidel method. To demonstrate these concepts, several regressions are run on the baseball data. In each of these, the response variable y is the length of a game in hours. The first regression in sections 5.1 through 5.3 uses the first three of the following four regressor variables. x1 , the total number of runs by both teams in a game, x2 , the total number of hits by both teams in a game, x3 , the total number of runners by both teams left on base in a game, x4 , the total number of pitchers used in a game by both teams. The interest is in determining how various conditions in a game affect the length of the game. The second regression, section 5.4, is a simple linear regression of time, y, on x4 . The third regression, also section 5.4, uses all four of the regressor variables. The main purpose is to use the asymptotic distributions of the slopes to compare least squares (LS) 18 to the correlation estimation system (CES). This is accomplished by using the Pearson correlation coefficient for LS regression and the Greatest Deviation correlation coefficient (GDCC) for the CES. This comparison shows that the CES in this paper is a competitive regression technique. Because GDCC works, then other nonparametric correlation coefficients also work: Kendall's Tau, Spearman's rho, Gini (1914), absolute value, and others (Gideon, 2007). The residual standard deviations are compared and surprisingly the one derived from GD is less than that of LS. Also the multiple CCs are computed and one partial CC is computed. Quantile plots on the residuals are discussed. Although time is a continuous random variable all the regressor variables are discrete; so at best for the classical analysis only an approximate multivariate normal distribution would model the data. All classical inference is based on the normal distribution or central limit theorems that give asymptotic results. Although the CES is based on limit theorems on continuous data, the results appear good even though all the regressor variables are discrete. However, more work needs to be done on the asymptotics on discrete data to support the theory implied by this example. 5.1 Example of the Regression of Time on Three Variates For any CC and in particular for the NPCC rgd , the regression equations for the first example are rgd ( xi , y − b1 x1 − b2 x 2 − b3 x3 ) = 0, i = 1,2,3. Thus, the regressor variables are uncorrelated with the regression residuals. The intercept of the regression is obtained by taking the median of these residuals b0 = median(y − b1x1 − b2 x2 − b3x3 ). 19 The residual SD is obtained by the methods of Gideon and Rothan (2007 in process; also on Web site); ie, a simple linear regression of the sorted residuals (least to greatest) on the ordered N(0,1) quantiles. Let quan and res represent these ordered vectors. Then the estimated SD, s, which is also the slope of the GD regression line, is found as the solution to rgd (quan,res − s * quan) = 0 . (5.1.1) For this example, Splus was used with some accompanying C routines for the GDCC technique; the lm command, linear models, was used for LS. The results are GDCC: yˆ = 1.8374 + 0.04908x1 − 0.01022 x2 + 0.05479x3 , s = σˆ gd = 0.2518 or (.2518)60 = 15.1 minutes LS: yˆ = 1.7179 + 0.04459x1 − 0.01079 x2 + 0.06910x 3 , s = σˆ LS = 0.2919 on 171 degrees of freedom (.2919)60 = 17.5 minutes Note that σˆ gd < σˆ LS suggests that the CES is viable. A large number of comparisons of these two regression planes were run. An analysis of the residuals and normal quantile plots revealed that the difference between the two regression fits is almost negligible. Both have about 15 data points that are away from normality on the normal quantile plot. For Pearson's the least squares fit was performed on the residuals regressed on the normal quantiles whereas the GD technique was used in the other quantile plot. However, when the line is drawn through the normal quantile plots as described 20 above, there is one difference. The least squares plot has a slight wave around normality but GD does not. Both show a similar pattern for the 15 extreme points. Both have about the same distribution of residuals. Plots of the Pearson residuals versus the GD residuals revealed very little difference. Many plots and counting comparisons were run in order to find a significant preference but none could be found. So in what follows, the important thing is that this nonparametric correlation coefficient gives results that are comparable to least squares for these three regressor variables on this particular data set. The three most extreme data points arise from games 28, 86, and 147 which are games of lengths 16, 10 and 12 innings, respectively. There are a total of 18 extra-inning games and these could all be consider non-standard data even though all four regressor variables relate well to the times of these games. It was verified that deleting the most serious of these extremes made LS more like the GD regression, again showing the value of the CES, because no data points should be deleted for a realistic analysis. The asymptotics in this article are now illustrated for this example. The results and all the calculations are given in order to encourage the broader development of regression techniques with correlation coefficients. slopes b1 b2 b3 Standard errors of the regression coefficients, z scores, and P-values SE z score P-value GD LS GD LS GD LS 0.01318 0.0094 3.72 4.77 .0002 .0000 0.01324 0.0101 -0.77 -1.07 .44 .2867 0.00998 0.0077 5.49 9.02 .0000 .0000 Table3: Statistics on the GD and LS Slope Estimates 21 The calculation of the estimated standard errors of the slopes is given next. From the analysis above the asymptotic distributions are LS: N(β i , σ iiσ res ) n and GD: N ( β i , ii π 2 σ σ res ) 4 n 2 where n=175 and for GD the estimate of σ res is the square of the slope of the GD regression line of the sorted residuals, y − yˆ , on N(0,1) quantiles. For LS, the linear models command from Splus was used with the standard calculation although the CES 2 estimate of σ res coming from a quantile plot with Pearson's CC was very close to the LS result (equation 5.5.1 using rp ). Recall that Σ 22 is the 3x3 covariance matrix of the regressor variables, σ i2 denotes its ith diagonal element, and σ ii is the ith diagonal element of Σ −221 . For the GD case, Σ̂ 22 was obtained from the correlation matrix and the diagonal matrix of standard deviations. The GD estimates of the SDs, σˆ i , i = 1,2,3 , are obtained similar to equation 5.1.1 by using the sorted data in place of res. The 3 by 3 GD correlation matrix Σ̂ gd is used in the form in which each GD correlation was transformed to an estimate of a bivariate normal (or bivariate Cauchy) correlation by ρˆ = sin( Σˆ 22 σˆ1 0 = 0 σˆ 2 0 0 πr gd 2 ). In other words, 0 σˆ 1 0 0 0 Σˆ gd 0 σˆ 2 0 . σˆ 3 0 0 σˆ 3 The SDs and correlations needed for all of these calculations are now given in the following tables. 22 y x1 1 0.4835 1 x2 0.6053 0.7686 1 x3 0.6745 0.2308 0.6117 1 x4 y 0.7201 x1 0.6025 x2 0.6279 x3 0.4764 x4 1 Table 4: Pearson's correlation coefficient, rp y x1 x2 x3 y 1 0.3736 0.4138 0.4023 x1 (0.5537) 1 0.5690 0.1839 x2 (0.6052) (0.7794) 1 0.4080 x3 (0.5907) (0.2849) (0.5979) 1 x4 (0.6942) (0.5907) (0.5461) (0.3869) Table 5: GDCC: upper triangle is rgd, lower triangle is x4 0.4885 0.4023 0.3678 0.2529 1 (sin πrgd 2) In Table 6, LS is the classical least squares, GD is using a GD fitting on the quantile plot. y x1 x2 x3 x4 LS 0.4420 4.1994 4.7855 4.1457 2.1885 GD 0.4003 3.8825 4.6199 4.0076 2.1468 Table 6: Estimates of the standard deviations of the regression variables From these tables the covariance matrix Σ 22 can be formed and inverted to obtain for the LS or Pearson and CES or GD the following. For Pearson the calculation gives Σˆ −1 22 0.1785 − 0.1570 0.0691 = − 0.1570 0.2079 − 0.1101 . 0.0691 − 0.1101 0.1198 23 For GD the calculation gives Σˆ −1 22 0.1943 − 0.1548 0.0530 = − 0.1548 0.1962 − 0.0925 . 0.0530 − 0.0925 0.1114 ii Again recall that σ is the notation for the (i,i) element in Σ −1 22 . From the asymptotic results connecting slope estimates and correlation, for Pearson's rp σˆ βˆi = 2 σˆ ii σˆ res 0.2919 = σˆ ii = (0.02207) σˆ ii , and this equals for n 175 (i=1), 0.009324; (i=2), 0.010063; (i=3), 0.0077. There is a close agreement with this and the standard errors direct from the Splus linear models (lm) command. For GD σˆ βˆ = i 2 π 2σˆ ii σˆ res π 2 (0.2518) 2 σˆ ii = (0.02990) σˆ ii , and this equals for = 4n 4(175) (i =1),(0.02990) .1943 = 0.01318 (i = 2),(0.02990) .1962 = 0.01324 (i = 3),(0.02990) .1114 = 0.00998 Note that the inference on the significance of the slopes using P-values is essentially the same whether LS or GD regression is used. The estimated values βˆi are somewhat different. The smallest standard error on the residuals of the regression is from GD not LS (15.1 minutes versus 17.5 minutes). It is always the case that GD regression is less influenced by the extremes than LS and in inference like this, one probably wants knowledge for the standard games and not to be swayed by a few unusual games. The smaller z-scores for GD, from the π 2 term in the SD, is the price paid for NP inference, but the inference is valid over a larger class of distributions. References for the above work are on the Web site. 24 5.2 Example of Partial Correlation In order to more fully compare LS and GD regression, the partial correlation of Y and X2 was computed deleting the effects of X1 and X3. The variable X2 was chosen because the Pearson and GD correlations with Y are nearly the same and positive but in the regression in section 5.1 the coefficient of X2 is negative. Recall that rp ( y, x 2 ) = 0.6053 and rgd ( y, x2 ) = 0.4138 with sin(πrgd / 2) = 0.6052 . Also, the outcomes of variable X2, total number of hits, have many ties and so this choice provides a good test of the NPCC tied value algorithm. For each CC the regressions of Y on X1 and X3 and X2 on X1 and X3 must be computed in order to obtain residuals and then the correlations of these residuals give the partial correlations. The regressions are for LS: yˆ = 1.6807 + 0.03645 x1 + 0.06339 x 3 and xˆ2 = 3.4463 + 0.7553x1 + 0.5296x 3. Therefore for LS the partial correlation is rp ( y − yˆ , x2 − xˆ2 ) = −0.08146 . For GD: yˆ = 1.7982 + 0.04030x1 + 0.05083x3 and xˆ 2 = 3.5798 + 0.7647 x1 + 0.5011x3 . The partial correlation is rgd ( y − yˆ , x2 − xˆ2 ) = −0.04023 with sin( π (−0.04023) / 2) = −0.06315 . Thus, it is seen that the Pearson and GD partial CC of Y and X2 removing the effects of X1 and X3 are very similar. 25 5.3 The Multiple CC of the Regression. The multiple CC of Y on X1, X2, and X3 is LS: rp ( y, yˆ ) = 0.5713 = 0.7558 , GD: rgd ( y, yˆ ) = 0.5287 and sin( πrgd / 2) = 0.7383 . This result is in slight disagreement with the SD of the residuals in that GD gives a smaller multiple CC than does Pearson indicating a slightly looser relationship. It is known that GD underestimates ρ . 5.4 Examples of Regression of Time on Four Variates and One Variate The regression of Y on just X4 and then on all four regressor variables is now given so that the LS and GD methods can be compared. The CCs are rp ( y, x 4 ) = 0.7201 and rgd ( y, x 4 ) = 0.4885 with sin(πrgd / 2) = 0.6942 . X4 has a higher correlation with Y for both CCs than for each of the other three regressor variables. Hence, possibly an important predictor variable has been left out of the regression equation. Here are all the relevant regression equations: LS: yˆ = 1. 9167 + 0.1454 x4 with σˆ res = 0.3076 , GD: yˆ = 1. 9000 + 0.1500 x 4 with σˆ res = 0.2769 . LS: yˆ = 1.5753 + 0.0217x1 − 0.0127x2 + 0.0533x3 + 0.0897x4 with rp ( y, yˆ ) = 0.6332 = 0.8205 and σˆ res = 0.2556 on 170 degrees of freedom, 26 GD: yˆ = 1.7473 + 0.0457x1 − 0.0310 x2 + 0.0567x3 + 0.0718x 4 with rgd ( y, yˆ) = 0.5805 , sin(πrgd / 2) = 0.7906 , and σˆ res = 0.2280 . From the "lm" command in Splus the P-values for variables 1,2,3,4 are respectively, 0.0142, 0.1514, 0.0000, and 0.0000. In the four-variable regression LS has a slightly higher multiple CC but GDCC has a slightly smaller residual SE, somewhat at odds. The two slopes with the biggest difference between the two regressions are for X1 and X2. The coefficient of X2 for GD is over twice as large as for LS. If the value of the X2 coefficient for GD had been the LS coefficient, it would have been very significant — as the P-value (0.00027) is much lower than the 0.1514 given above (t-value -3.52). The X1 coefficient in the GD regression is also more than twice that of LS. So there is a real difference in these regressions, giving further justification for the use of the CES. The normal quantile plots of the residuals for the four-variable regression reveal fewer unusual games than the three-variable regression did. See the graphs below and note the different vertical scales. There are now 5 extremes for LS, and 4 to 6 (depending on visual judgement) for GD, with all the remaining points lying very close to the GD straight line through the points. However, the distance from the GD line to the unusual points is much greater for GD than for LS. This explains one of the differences in the regression output. When the GD line is compared to the LS line on the normal quantile residual plot for the LS regression, the GD gives a better evaluation criteria. This is because the GD line goes through more of the sorted residuals and is not swayed by the extremes. So visually one can check more easily for normality. GD obtains a smaller residual SE by not weighting the 27 very few unusual points as much as LS does. Whether or not the difference in the coefficients and the GDCC standard deviation of (0.2280)60=13.7 minutes is meaningful to a data analysis compared to (0.2556)60=15.3 minutes is entirely subjective. In the current problem on the length of major league baseball games, with the idea that they are too long, 0.0 GD Sorted Residuals 0.0 -1.0 -0.5 -0.5 LS Sorted Residuals 0.5 0.5 1.0 the GD analysis would be more appropriate for this analyst. -2 -1 0 1 Quantiles of Standard Normal 2 -2 -1 0 1 2 Quantiles of Standard Normal Figure 2: Normal Quantile Residual Plots for LS and GD Four-Variable Regression For the simple linear regression of time on X4, total number of pitchers used, the asymptotic inference on the slope is now given. The SE of the slope coefficient is 11 calculated. Because in this case Σ̂ 22 is a 1x1 matrix, the inverse is σˆ = 1 ˆ2 2 where σ 4 σˆ 4 28 is the estimate of the variance of x4 by whatever method is being used. The formulas for LS and GD are Pearson or LS: σˆ βˆ = GD: σˆ βˆ = 2 π 2σˆ res = 4nσˆ 42 2 σˆ res ( 0.3076 )2 = = 0.0106 with z score of 13.7 nσˆ 42 175(4.7895) π 2 (0.2769 ) 2 = 0.0153 with z score of 9.8. (4 )175(4.6088) As in the other examples the SE of the slope is higher for the NPCC than for the classical case. However, the real question is which one is more reliable over a variety of data sets. As a check on the LS value, direct from the Splus "lm" command σˆ βˆ = 0.0106 . A small simulation study was done on the distribution of the GDCC on the variables time of game (continuous) and number of pitchers (discrete); i.e., variables Y and X4 above. This study showed that indeed the asymptotic distribution is fairly normal even for small to moderate sample sizes (>10). The implication is that the Taylor series process described earlier works just as well when some variables are discrete. A second implication may be that the max-min tied value procedure hastens the convergence to normality. It would be nice to have a proof of this. 6. Conclusion This article sets the framework for a very general method of multiple regression based on the distribution and population values of correlation coefficients. The quality of the GD results should eliminate lingering doubts as to the validity of this and other NP methods in linear regression. There are six other correlations in Gideon (2007) to which the process in this article can be applied. Some comments are given on the use of Kendall's Tau in the 29 CES method. Which correlation to use on a particular data set is also a research question. The L-one correlation coefficient in Gideon (2007) could be profitably used whenever Lone methods are appropriate. A long-term goal would be to have a computer package that can select different correlation coefficients on which to perform the multiple regression analysis. Another important goal for this article was to reemphasize that the problem of tied values is apparently not an issue when the max-min method is used. Thus, the CES using rank based or continuous (including Pearson's which is equivalent to least squares) correlation coefficients in multiple linear regression estimation is not only a very viable technique, but also provides a cohesion not always found in other methods. These results can segue into other estimation areas of statistics; some of these ideas can be pursued by consulting the Web site. They include for example, estimation in nonlinear regression, generalized linear models, and time series (Sheng, 2002). The CES provides a simple way (if computer programs have been written) to use robust methods in these latter areas without having to resort to data manipulation. Credits: Special thanks to former students, Steven Rummel, Jacquelynn Miller, and especially to Carol Ulsafer, collaborator and editorial assistant. 7. References Anderson, T. W. (1958). An Introduction to Multivariate Statistical Analysis. John Wiley and Sons, N.Y. Burg, Karl V. (1975). Statistical Models in Applied Sciences. John Wiley and Sons, N.Y. 30 Gibbons, J. D. and Chakraberti, S. (1992). Nonparametric Statistical Inference 3rd ed. Marcel Dekker, Inc. N.Y. Gideon, R. A. (2008). Kendall's τ In Correlation and Regression, in progress. Gideon, R. A. (2007). The Correlation Coefficients, Journal of Modern Applied Statistical Methods, 6, No. 2, in progress. Gideon, R. A., Prentice, M. J., and Pyke, R. (1989). The Limiting Distribution of the Rank Correlation Coefficient rgd. In: Contributions to Probability and Statistics (Essays in Honor of Ingram Olkin) edited by Gleser, L. J., Perlman, M. D., Press, S. J., and Sampson, A. R. Springer-Verlang ,N.Y. :217-226. Gideon, R. A. & Hollister, R. A. (1987). A Rank Correlation Coefficient Resistant to Outliers, Journal of the American Statistical Association 82, no.398, 656-666. Gideon, R. A. and Rothan, A. M., CSJ (2007). Location and Scale Estimation with Correlation Coefficients. Communications in Statistics-Theory and Methods, under review. Gini, C. (1914). L'Ammontare c la Composizione della Ricchezza della Nazioni, Bocca, Torino. Graybill, F. A. (1976). Theory and Application of the Linear Model. Duxbury Press, Massachusetts. 31 Miller, Jacquelynn (1995). Multiple Regression Development with GDCC, Masters Thesis. University of Montana. Rummel, Steven E. (1991). A Procedure for Obtaining a Robust Regression Employing the Greatest Deviation Correlation Coefficient, Unpublished Ph.D. Dissertation, University of Montana, Missoula, MT 59812, full text accessible through UMI ProQuest Digital Dissertations. Sen, P.K. (1968). Estimates of the Regression Coefficient based on Kendall's Tau. Journal of the American Statistical Association, 63: 1379-1389. Sheng, HuaiQing (2002). Estimation in Generalized Linear Models and Time Series Models with Nonparmetric Correlation Coefficients, Unpublished Ph.D. Dissertation, University of Montana, Missoula, MT 59812, full text accessible through http://wwwlib.umi.com/dissertations/fullcit/3041406 Web site www.math.umt.edu/gideon.
© Copyright 2026 Paperzz