1 The Relationship between a Correlation Coefficient and its

1
The Relationship between a Correlation Coefficient and its Associated Slope Estimates in
Multiple Linear Regression
Rudy A. Gideon
Abstract
This article takes correlation coefficients as the starting point to obtain inferential results in linear
regression. Under certain conditions, the population correlation coefficient and the sampling correlation
coefficient can be related via a Taylor series expansion to allow inference on the coefficients in simple
and multiple regression. This general method includes nonparametric correlation coefficients (NPCCs)
and so gives a universal way to develop regression methods. This work is part of a correlation estimation
system (CES) that uses correlation coefficients to perform many types of estimation, including time
series, nonlinear, generalized linear models, and on individual distributions.
AMS (2000) subject classification. Primary 62J05, 62G05, 62G08.
Keywords and phrases. Correlation, rank statistics, linear regression, nonparametric, Kendall, greatest
deviation correlation coefficient.
0. Preliminaries and Notation
Capital R denotes any population correlation coefficient and lower case r the
corresponding sample correlation coefficient. R(X,Y) then stands for the unknown but fixed
population correlation coefficient for the bivariate random variable (X,Y) . Of course if X
and Y are independent R(X,Y) = 0. When needed, subscripts are used to denote a particular
correlation coefficient. For the bivariate normal and the usual population definition of
correlation, R(X,Y)= ρ but for GDCC and Kendall's τ , Rgd ( X , Y ) =
Rτ ( X , Y ) = ( 2 / π ) sin −1 ( ρ ) . In addition τ and GDCC retain the same value over the class
of bivariate t distributions. For the bivariate normal, ρ
on X so random variables X and Y − ρ
zero for any R.
σy
σx
σy
σx
is the regression parameter of Y
X are independent, hence R( X , Y − ρ
σy
σx
X ) is
2
For an arbitrary random variable ( X , Y ) , let β be the regression parameter when the
parameters of the distribution are unknown. So now R( X , Y − βX ) is a function of
variable β , which is sometimes written as f (β ) to emphasize this fact. This function is
zero only at β 0 , the true value of the regression parameter. For the specific example
above with bivariate normal random variables, β 0 = ρ
σy
σx
. The multivariate version, with
β ′ = ( β 1 , β 2 ,K , β p ) , l ′ = (l1 , l2 , K , l p ) , and X ′ = ( X 1 , X 2 , K , X p ) , yields
R( X ′l , Y − X ′β ) =
E ( X ′l (Y − X ′β ))
as a function of p-dimensional β . (Prime denotes
V ( X ′l )V (Y − X ′β )
transpose.) Note that when β 0 is substituted for β , R( X ′l , Y − X ′β 0 ) = 0 because X ′l and
Y − X ′β 0 are assumed to be independent. For the X variable the data is written Xnxp =
( x1 , x 2 , L , x p ) where each xi is a column vector of length n and y is the data from random
variable Y. Then for any corresponding R and r, with the correct β 0 ,
r ( X nxp l, y − X nxp β 0 ) has a distribution centered about zero, i.e. a null distribution, because
R( X ′l , Y − X ′β 0 ) = 0 . So if an expression for R( X ′l , Y − X ′β ) is known and is a
continuous function of β , it can then be expanded about β 0 via a Taylor series whose first
term is zero. It is known that if continuous bivariate random variables are independent then
the asymptotic (as n approaches infinity) null distributions for GDCC and Kendall's τ are
as follows: n rgd is N(0,1) and
n − 1τ is N(0,4/9) where rgd and τ (or rτ ) are the sample
versions of Rgd and Rτ , respectively. Thus, for example,
n − 1τ ( X nxp l , y − X nxp β 0 ) has an
asymptotic N(0,4/9) distribution because Rτ ( X ′l, Y − X ′β 0 ) = 0 .
3
Rτ ( X ′l, Y − X ′β ) = (2 / π ) sin − 1 ρ X ′l ,Y − X ′β which is a continuous function of β for the class
of bivariate t distributions. Note that even though multivariate distributions are the
sampling distributions, they are only used through linear combinations that so that
correlation coefficients can be employed.
For tied values within the X data or within the Y data, this paper uses the max-min method
given in Gideon and Hollister (1987) as it is more general than the often used local
averaging method in that it can be applied to all rank correlation coefficients and allows
computer programs to run in all circumstances. This sets the stage for the developments in
this paper.
1. Introduction
In the research on GDCC it was found that the only way to perform inference on β , the
regression parameter, was to relate it to the asymptotic distribution of rgd through a Taylor
series for Rgd . Then the generality of the procedure was discovered and is given in this
paper, providing a unification not previously known. The result in the case of Pearson's
correlation coefficient is not new; only the derivation is unique. The method is applied to
Pearson's as a check on its validity and as the first step towards the actual new result.
Let R ( X i , Y ) be the population correlation coefficient for the ith component of the pdimensional column random variable X with Y . While any correlation coefficient (CC)
can be used, full examples are given for just two, Pearson's rp and the Greatest Deviation
(Gideon and Hollister, 1987); Kendall's Tau is also illustrated to some extent. The
continuous and discrete cases are done purposefully to show the comprehensiveness of the
4
CES. Familiarity with GDCC is not necessary to follow the arguments, but a complete
description of this easy-to-use but complicated-to-explain correlation coefficient is given
in the references. The fact that a nonparametric correlation coefficient can be implemented
in a cohesive fashion in multiple regression is the key point. This work is part of a
correlation estimation system (CES) using correlation coefficients as the starting point to
perform many types of estimation, including time series and nonlinear and generalized
linear models. Parameter estimation on individual distributions can also be done. The
results open up a vast set of possibilities; enough has been done to be very optimistic about
the direction and usefulness of this work.
If β 0 is the vector of population regression parameters for an assumed linear system, then
the population values, R( X i , Y − X ′β 0 ) , i = 1,2,... p are zero. For l a p-dimensional vector
of constants, R( X ′l , Y − X ′β 0 ) is the population correlation parameter of a linear
combination of the components of X with Y − X ′β 0 . The model assumption in this work is
that the error random variable, Y − X ′β 0 , is uncorrelated with X ′l ; that is
R( X ′l , Y − X ′β 0 ) = 0 . Now form a truncated Taylor series in β about β 0 for the function
of β , R( X ′l , Y − X ′β ) . Once this is done, the population variables in the series are
replaced by their sample equivalents to obtain a further approximation which leads to the
determination of the asymptotic distribution of linear combinations of the corresponding
slope estimates in a multiple linear regression. This is valid for both continuous and
nonparametric sample correlation coefficients.
5
While the normal distribution is typically used (because, in general, it is most familiar),
asymptotic distributions for correlation coefficients have been developed for distributions
with finite second moments. The process in this paper is general and could be used
whenever asymptotic distributions of correlation coefficients have been developed for
continuous bivariate distributions or for finite cases in which permutation arguments apply.
For example, the limiting distributions hold over a class of distributions for nonparametric
correlation coefficients. It is known, for example, that GDCC has the same population
value and limiting distribution for the bivariate Cauchy as for the bivariate normal, that is,
over the whole class of bivariate t distributions (Gideon and Hollister, 1987 and Gideon,
Prentice, and Pyke, 1989). So the technique is robust.
2. Derivation of the Classical Method through Correlation
As mentioned above, the method is first developed for Pearson's correlation coefficient
both to show that it is an alternate way to derive the same results as those that come from
existing Least Squares (LS) and classical normal methods and because the results are
needed in the next step. It is extended to NPCC using GDCC as an example. Because
Kendall's Tau and GDCC have the same population form for the bivariate t distributions
the result for GDCC is valid for Tau. The full rank multivariate normal model is used with
covariance matrix partitioned as follows into the response and regressor variables. As
σ 2
usual, let Σ p +1, p+1 =  1
σ 12
σ 12' 
 , where σ 12 is the variance of the response variate, σ 12 is
Σ 22 
the column vector of covariances of the response variable with the regressor variables, and
Σ 22 is the p by p covariance matrix of the regressor variates. Let Y be the response variate
6
and X the column vector of regressor variates. As above let β 0 be the vector of population
regression parameters. Then it is known that β 0 = Σ −221σ 12 . Let µ = E(Y) and µ x = E(X),
a p-dimensional vector. The regression model is E (Y X = x ) = µ + ( x − µ x )' β 0 . The
parameter R( X i , Y ) is the correlation that corresponds to the ith element of σ 12 . The
diagonal values of Σ 22 are variances and the off diagonal elements are covariances. When
these are not known, they can be estimated by a method that uses the same correlation
coefficient. Scale estimates using rgd are given in Gideon and Rothan (2007). Examples of
both known and unknown covariance structure appear below. The classical variance
estimation method is well known but in this paper an approach via correlation coefficients,
as given in Gideon and Rothan (2007), is used.
This maintains a consistent level of robustness between slope and variation estimation.
With l as above, a vector of constants, consider the correlation parameter as a function of
β , f ( β ) = R ( X ′l, Y − X ′β ) . For the correlations and continuous random variables under
consideration, f is a differentiable function of β . In order to relate the null distribution of
any correlation coefficient to linear combinations of the estimated slopes, f (β ) is
expanded into a truncated multivariate Taylor series about β 0 . Then the random variables
will be replaced by data and β by β̂ , the estimated slopes, so that an asymptotic
distribution can be used. The estimated slopes are computed using the chosen correlation
coefficient as noted below. Finally, using the asymptotic null distribution of r, the chosen
correlation coefficient, the asymptotic distribution of any linear combination of β̂ is found.
7
For convenience and without loss of generality let µ = 0 and µ x = 0 . We start by
determining f (β ) in an explicit form and then taking its partial derivatives with respect to
β . The goal is to write f ( β ) ≈ f ( β 0 ) + ( β − β 0 )
∂f ( β )
∂β
β =β0
, a truncated Taylor series.
Start with R the usual population definition of correlation,
R( X ′l , Y − X ′β ) =
E ( X ′l (Y − X ′β ))
, and further
V ( X ′l )V (Y − X ′β )
E ( X ′l (Y − X ′β )) = l ′E ( XY ) − l ′E ( X X ′) β = l ′σ 12 − l ′Σ 22 β , V ( X ′l ) = l ′Σ 22l , or a (l ) for
short. Also observe that V (Y − X ′β ) = σ 12 + β ′Σ 22 β − 2 β ′σ 12 , or b(β ) for short.
−1
−1
Now β 0 = Σ 22
σ 12 so that R( X ′l , Y − X ′β 0 ) = l ′σ 12 − l ′Σ 22Σ 22σ 12 = 0 , and
2
′ Σ −221σ 12 = σ res
b( β 0 ) = σ 12 − σ 12
where res stands for residuals. We now expand
f (β ) = R( X ′l , Y − X ′β ) =
∂f ( β )
=
∂β
∂f ( β )
∂β
β =β0
l ′σ 12 − l ′Σ 22 β
into a Taylor series. Because
a ( l ) b( β )
a (l )b( β ) ( −l ′Σ 22 ) − (l ′σ 12 − l ′Σ 22 β )
∂ a ( l ) b( β )
∂β
a ( l ) b( β )
=
− l ′Σ 22
=
a (l )b( β )
, and
− l ′Σ 22
, then the truncated Taylor series is
l ′Σ 22lσ res
f ( β ) = R ( X ′l , Y − X ′β ) ≈ R( X ′l , Y − X ′β 0 ) −
l ′Σ 22
(β − β0) .
l ′Σ 22lσ res
Now we have arrived at the place to put in the sample equivalents. In this series, X is
replaced by the n by p data matrix Xnxp= ( x1 , x 2 , L , x p ) , Y by y and f is evaluated at β̂
where rp ( xi , y − X nxp βˆ ) = 0, i = 1,2, L , p . In other words, the parameters have been
8
replaced by estimates and random variables by data. The equation remains approximately
true within sampling variation, i.e. the two remaining quantities are approximately equal in
distribution. For Pearson's rp these p simultaneous equations are equivalent to the usual
least squares normal equations and the solution vector gives the standard least squares
results. Both this and the GDCC formulation and technique for solving the set of equations
appear in Rummel (1991); this is a generalized formulation of these normal equations
which for correlation coefficients other than rp are herein called regression equations.
They are valid for any correlation coefficient. Some results are illustrated in Section 4.
Because of the linearity of the covariance function, these p equations imply that
rp ( X nxp l , y − X nxp βˆ ) = 0 . Thus,
f ( βˆ ) = rp ( X nxp l, y − X nxp βˆ ) = 0 ≈ rp ( X nxp l, y − X nxp β 0 ) −
l ′Σ 22
( βˆ − β 0 ) . Because
l ′Σ 22 lσ res
this difference is approximately zero, rp ( X nxp l , y − X nxp β 0 ) and
l ′Σ 22
( βˆ − β 0 ) are
l ′Σ 22lσ res
approximately equal in distribution. Now the former term has a null distribution since
outcomes X nxp l and y − X nxp β 0 come from independent random variables X ′l and
Y − X ′β 0 . It can be inferred from (Anderson, 1958 and Burg, 1975) that
n rp ( X nxp l, y − X nxp β 0 ) has an asymptotic N(0,1) distribution. Consequently,
nl ′Σ 22
( βˆ − β 0 ) has that same asymptotic distribution.
l ′Σ 22lσ res
To relate this result to standard methods, start by transforming from vector l to vector k
where l = Σ −221k ; thus, Σ 22l = k and l ′Σ 22 = k ′ . Then the quadratic form equality is
9
l ′Σ22 l = k ′Σ −221k . Hence,
n k ′( βˆ − β 0 )
k ′Σ −221 kσ res
has an asymptotic N(0,1) distribution. Thus from
the connection with the Pearson correlation coefficient we have the following result:
( k ′Σ 22 k )σ res
k ′( βˆ − β 0 ) is approximately N (0,
)
n
−1
2
(1)
where β̂ solves the normal equations rp ( xi , y − X nxp βˆ ) = 0, i = 1,2, L , p .
The equivalent of result (1) in classical least squares or normal theory fixed x multiple
linear regression model is introduced to show that the CES has produced the standard
result. Assume now that the n x p data matrix has been chosen: y = X nxp β + ε where
ε ~ N (0, σ 2 I ) with I the n x n identity matrix. Let X * = ( x1* , x2* , L , x *p ) where the *
indicates that the data have been centered at the means. Then, as is well known, the sum of
1
squares matrix is X *′ X * with βˆ = ( X *′ X *) − X *′ y and V ( βˆ ) = σ 2 ( X *′ X *) −1 . The
distribution of βˆ − β is multivariate normal: MN (0,V ( βˆ )) . Thus, the distribution of
k ′( βˆ − β 0 ) is N (0, σ 2 k ′( X *′ X *) −1 k ) .
The two notations for the two systems, CES and classical, are now readily related by
equating the variances.
(k ′
−1
Σ 22
2
k )σ res
= k ′( X *′ X *) −1 kσ 2 . For the fixed X case, Σ 22 = ( X *′ X * )/n. For X a
n
matrix of outcomes of random variables, ( X *′ X * )/n estimates Σ 22 . In Graybill (1976),
under the null hypothesis that k ' β = m and under normality assumptions, the distribution of
10
k ' βˆ − m
σˆ k ' ( X * X * ) −1 k
is student t with n-p degrees of freedom. With large n, then the normal
distribution is the asymptotic approximation.
3. Derivation of a Universal Method of Multiple Linear Regression through
Correlation
It is known (Gideon and Hollister, 1987) that for joint normal random variables W1 ,W2 the
population value of Rgd (W1 , W2 ) is
2 −1
sin ρW1 , W2 (Kendall's Tau is the same) where
π
ρW1 , W2 is the bivariate normal correlation parameter between W1 and W2 . For random
variables X ′l and Y − X ′β 0 , set f1 ( β 0 ) = R gd ( X ′l , Y − X ′β 0 ) =
2
sin −1 ρ X ′l ,Y − X ′β0 . Note that
π
while the form is the same, f1 is used here instead of f to signify that a different
correlation coefficient has been chosen. For normal random variables
R( X ′l , Y − X ′β ) = ρ X l′ ,Y − X β′ and so the results of the previous section can be used. The
truncated Taylor series for f1 ( β ) is
f1 ( β ) = R gd ( X ′l , Y − X ′β ) ≈ Rgd ( X ′l, Y − X ′β 0 ) +
The partial derivative is
∂
R ( X ′l, Y − X ′β ) β =β0 ( β − β 0 ) .
∂β gd
∂
2
Rgd ( X ′l, Y − X ′β ) =
∂β
π
1
1 − ρ 2X ′l ,Y − X ′β
∂
ρ ′
′
∂β X l ,Y − X l
β =β0
. At
β = β 0 , X ′l and Y − X ′β 0 are independent random variables, so ρ X ′l ,Y − X ′β0 = 0 , and the
11
latter partial derivative is, as before,
− l ′Σ 22
. The truncated Taylor series becomes
l′Σ 22lσ res
f1 ( β ) = R gd ( X ′l , Y − X ′β ) ≈ R gd ( X ′l , Y − X ′β 0 ) −
2
π
l ′Σ 22
( β − β0 ) .
l ′Σ 22 lσ res
Now solve (Rummel, 1991) the associated regression equations with data X nxp and y.
rgd ( xi , y − X nxp βˆgd ) = 0, i = 1,2, L , p.
βˆgd is a solution vector with i th component β̂ i, gd . Every correlation coefficient has a
similar set of regression equations; these correspond to the normal equations in the case of
Pearson's rp. These regression equations would have solutions β̂τ had Tau been chosen as
the correlation coefficient. The regression equations for Tau can be solved by iterations
involving the medians of elementary slopes. (See Sen, 1968 for the simple linear
regression case, and Gideon, 2008 for an illustrated look at this work specialized to
Kendall's Tau). The GDCC does not have the same linearity properties that Pearson's rp
has and so it is not necessarily true that rgd ( X nxp l , y − X nxp βˆ gd ) is exactly zero; however,
computer simulations have shown that rgd ( X nxp l , y − X nxp βˆ gd ) is zero or very close to
zero. We repeat the above procedure by substituting the sample counterparts into the
truncated series and evaluating f1 at βˆgd . Even though f1 ( βˆ gd ) may be only close to zero,
simulations in this and other examples indicate that the asymptotic distribution theory is
still good. Again Rgd ( X ′l, Y − X ′β 0 ) is zero and its sample equivalent multiplied by
n rgd ( X nxp l, y − X nxp βˆgd ) , has an approximate N(0,1) distribution (Gideon, Prentice,
Pyke, 1989). It now follows as before that
2
n
π
l ′Σ 22
( βˆgd − β 0 ) has an
l ′Σ 22lσ res
n,
12
approximate N(0,1) distribution. (For Kendall's Tau,
N(0,1) distribution, and so
an asymptotic N(0,
3
n − 1 τ has an approximate
2
3
n − 1 is the multiplier to use on βˆτ − β 0 ; i.e. βˆτ − β 0 has
2
2
π 2σ res
) distribution. This was successfully tested via simulations
9( n − 1)σ x2
on continuous and discrete data for one regressor variable.)
2
π 2l ′Σ 22 lσ res
).
Consequently, l ′Σ 22 ( βˆ gd − β 0 ) is N (0,
4n
Again we let l′Σ 22 = k ′ . Thus,
2
π 2 (k ′Σ −221k )σ res
k ′( βˆgd − β 0 ) is approximat ely N (0,
)
4n
(2)
where β̂ gd solves the regression equations rgd ( xi , y − X nxp βˆgd ) = 0, i = 1,2, L , p.
This latter derivation can be used with any correlation coefficient whose population form
and asymptotic distribution is known, illustrating the universality of the method. In
particular, because Tau and GDCC have the same population value on the class of
bivariate t distributions, an equation equivalent to (2) holds when the Tau regression
equations are solved. Equivalent means the correct asymptotic distribution must be
substituted at the substitution step.
As a special case we let k be a vector of 0s except for a 1 in the ith position. The above
2
π 2σ ii σ res
result gives the asymptotic distribution of βˆi , gd − β 0 as N (0,
) where
4n
13
σ ii is the (i, i ) element of Σ −221 , that is, σ ii is just the reciprocal of the variance of the ith
regressor variable because Σ 22 is 1 by 1.
4. Illustration of the Correlation Approach
To illustrate the asymptotic results in the normal case, simulations were run for various
sample sizes. Examination of a large number of quantile plots showed that, whether using
rgd or rp , the distributions of the chosen linear combinations of the estimated β 's were
very similar. The normal was chosen because the exact distributions of linear combinations
of the components of the β -vector are known for Pearson's, so if GDCC results exhibit
similar characteristics, then this correlation technique is feasible.
The first example is in two parts, each with its own distinct covariance structure. The
hypothesis of interest is that β1 = β 2 . In the first part the null hypothesis is true and in the
second it is false. Data were generated by the linear transformation
y
 
 x1  =
x 
 2
 z1 
 
A z 2  where A is a 3 by 3 matrix and (Z 1 , Z 2 , Z 3 ) are independent and identically
z 
 3
distributed normal random variables with mean zero and variance one.
 1 1 1
 1 1 1




For Part 1, A = A1 =  0 1 1  and for Part 2, A = A2 =  0 2 1  .
 1 1 0
 1 1 0




y
 
 σ 12 σ 12 ' 
.
In both parts  x1  is distributed N (0, Σ) where Σ = AI 3 A' = 

σ 12 Σ 22 
x 
 2
14
The 2 by 2 matrix Σ 22 is the covariance matrix of the regressor variables. The regression
−1
2
−1
slopes are β 0 = Σ 22
σ 12 and σ res
= σ 12 − σ 12 ' Σ 22
σ 12 .
For Part 1, Σ
−1
22
3 2 2


 2 3
1
1  2 − 1
2
= 
 , β 0 = 
 , σ res = , Σ =  2 2 1  .
3− 1 2 
3
 2 3
2 1 2


For Part 2, Σ
−1
22
3 3 2


1 3
2
1  2 − 2
2
= 
 , β 0 = 
 , σ res = , Σ =  3 5 2  .
6 − 2 5 
3
 2 3
2 2 2


The conditional models for the two parts are then,
2
2
1
1
2
2
Model 1: Y X = x is N ( x1 + x 2 , ) , Model 2: Y X = x is N ( x1 + x 2 , ) .
3
3
3
3
3
3
The correlation structure is given in the following table:
pop corr
Model 1
y , x1
Model 2
35
2 3
y, x2
2 3
x1 , x 2
12
Y X=x
(2 3) 2 =0.942
2 3
25
(1 3) 7 = 0.882
Table 1: Correlation Structure
GDCC
0.783
0.687
In this table GDCC is found by the inverse sine transformation of the multiple correlation
coefficient, ρ , (2 π ) sin −1 ρ (Gideon and Hollister, 1987).
To apply the above for testing the hypothesis β1 = β 2 , use k ' = (1,−1) . The value of
k ' Σ −221 k is 2 for Model 1 and 11/6 for Model 2. From results (1) and (2) the asymptotic
distributions of βˆ1 − βˆ2 for the Pearson and GDCC and the two models are
Model 1
Pearson
2
N (0, )
3n
GDCC
N (0,
π2 2
)
4 3n
15
Model 2
22
π 2 22
)
N (0,
)
18n
4 18n
Table 2: Distributions for the Two Models
N (0,
Note that because π 2 4 @1.57, the standard deviation of the GDCC is about 1.57 times
larger.
Some details for Model 2 are now given. Simulations were run with W1 = βˆ1 − βˆ2 and
W2 = βˆ1,gd − βˆ2, gd recorded each time. Plots of W1 -vs- W2, quantile plots of W1 -vs- W2,
individual normal quantile plots for each of W1 and W2 (one is shown below) with
accompanying Kolmogorov-Smirnov tests of fit all gave better than expected results; i.e.,
the distributions of the β -contrasts comparing the classical methods and the GDCC
methods were similar. The distributions are nearly the same except for the scale factor.
Because the null hypothesis is false in this case the center point of the contrast is not zero
but near -1/3. The comparisons W3 = 2 βˆ1 − βˆ2 and W4 = 2 βˆ1, gd − βˆ2 ,gd in which the null
hypothesis is true were also run. Here k ' = (2,−1) and the asymptotic distributions can be
calculated as shown above. Again W3 and W4 exhibited similar normal distributions but
the center points on the quantile plots were, of course, near zero.
Model 1 simulations also gave better than expected results. Additionally, simulations on
the individual β s, e.g. k ' = (1,0) , again produced good results. Sample sizes were run
from 10 to 100 with 50 simulations each. Although many sample quantile plots were
produced for various sample sizes, they are not given as they duplicated the plots already
shown. It is interesting that the null distribution of GDCC approaches normality slowly but
16
even for small (10-20) sample sizes, simulations of the distribution of linear combinations
of slopes via result (2), show near normality.
-0.2
W1
-0.6
-0.4
-0.5
-1.0
W2
0.0
0.0
0.2
Figure 1: Normal Quantile Plots of W1, W2, W3, W4
-2
-1
0
1
2
-2
-1
0
1
2
Quantiles of Standard Normal
0.2
-0.2
W3
0.0
-0.6
-1.0
-0.5
W4
0.5
0.6
1.0
Quantiles of Standard Normal
-2
-1
0
1
Quantiles of Standard Normal
2
-2
-1
0
1
2
Quantiles of Standard Normal
5. An example of simple and multiple regression with the 1992 Atlanta Braves team
record of 175 games
To illustrate the viability of this work, it is necessary to bring in ideas on variation
estimation that allow the full estimation procedure to be carried out. This example shows
how the correlation estimation technique (Web site www.math.umt.edu/gideon) can
parallel the standard multiple regression analysis thus showing that it as viable as classical
regression analysis. Estimation from correlation is not the standard approach, but not only
is it as good or better than classical least squares, it also gives a cohesiveness to the
17
analysis as it is valid for every correlation coefficient and rivals other robust techniques.
Besides the distribution technique for the slopes, the estimates of the variation structure are
illustrated, including residual error, standard deviations of the slopes, multiple correlation
coefficient, and partial correlations. Rank based correlations devalue extreme values; this
allows GDCC to be more robust than classical least squares as is seen in this example.
The generality of the technique includes dealing with tied values, so a baseball example
was chosen because the data have numerous tied values. The work on this example started
after the 1992 baseball season, showing the slowness of the research process. This data
has extreme values but no outliers, so techniques of removing or reweighting are not
appropriate. Also, one expects that hits and runs are fairly highly correlated, so this type of
example tests the convergence of the numerical technique for the regression equations, the
Gauss-Seidel method. To demonstrate these concepts, several regressions are run on the
baseball data. In each of these, the response variable y is the length of a game in hours.
The first regression in sections 5.1 through 5.3 uses the first three of the following four
regressor variables.
x1 , the total number of runs by both teams in a game,
x2 , the total number of hits by both teams in a game,
x3 , the total number of runners by both teams left on base in a game,
x4 , the total number of pitchers used in a game by both teams.
The interest is in determining how various conditions in a game affect the length of the
game. The second regression, section 5.4, is a simple linear regression of time, y, on x4 .
The third regression, also section 5.4, uses all four of the regressor variables. The main
purpose is to use the asymptotic distributions of the slopes to compare least squares (LS)
18
to the correlation estimation system (CES). This is accomplished by using the Pearson
correlation coefficient for LS regression and the Greatest Deviation correlation coefficient
(GDCC) for the CES. This comparison shows that the CES in this paper is a competitive
regression technique. Because GDCC works, then other nonparametric correlation
coefficients also work: Kendall's Tau, Spearman's rho, Gini (1914), absolute value, and
others (Gideon, 2007). The residual standard deviations are compared and surprisingly the
one derived from GD is less than that of LS. Also the multiple CCs are computed and one
partial CC is computed. Quantile plots on the residuals are discussed.
Although time is a continuous random variable all the regressor variables are discrete; so
at best for the classical analysis only an approximate multivariate normal distribution
would model the data. All classical inference is based on the normal distribution or central
limit theorems that give asymptotic results. Although the CES is based on limit theorems
on continuous data, the results appear good even though all the regressor variables are
discrete. However, more work needs to be done on the asymptotics on discrete data to
support the theory implied by this example.
5.1 Example of the Regression of Time on Three Variates
For any CC and in particular for the NPCC rgd , the regression equations for the first
example are rgd ( xi , y − b1 x1 − b2 x 2 − b3 x3 ) = 0, i = 1,2,3. Thus, the regressor variables
are uncorrelated with the regression residuals. The intercept of the regression is obtained
by taking the median of these residuals
b0 = median(y − b1x1 − b2 x2 − b3x3 ).
19
The residual SD is obtained by the methods of Gideon and Rothan (2007 in process; also
on Web site); ie, a simple linear regression of the sorted residuals (least to greatest) on the
ordered N(0,1) quantiles. Let quan and res represent these ordered vectors. Then the
estimated SD, s, which is also the slope of the GD regression line, is found as the solution
to
rgd (quan,res − s * quan) = 0 .
(5.1.1)
For this example, Splus was used with some accompanying C routines for the GDCC
technique; the lm command, linear models, was used for LS. The results are
GDCC: yˆ = 1.8374 + 0.04908x1 − 0.01022 x2 + 0.05479x3 ,
s = σˆ gd = 0.2518 or (.2518)60 = 15.1 minutes
LS: yˆ = 1.7179 + 0.04459x1 − 0.01079 x2 + 0.06910x 3 ,
s = σˆ LS = 0.2919 on 171 degrees of freedom (.2919)60 = 17.5 minutes
Note that σˆ gd < σˆ LS suggests that the CES is viable.
A large number of comparisons of these two regression planes were run. An analysis of the
residuals and normal quantile plots revealed that the difference between the two regression
fits is almost negligible. Both have about 15 data points that are away from normality on
the normal quantile plot. For Pearson's the least squares fit was performed on the residuals
regressed on the normal quantiles whereas the GD technique was used in the other quantile
plot. However, when the line is drawn through the normal quantile plots as described
20
above, there is one difference. The least squares plot has a slight wave around normality
but GD does not. Both show a similar pattern for the 15 extreme points. Both have about
the same distribution of residuals. Plots of the Pearson residuals versus the GD residuals
revealed very little difference. Many plots and counting comparisons were run in order to
find a significant preference but none could be found. So in what follows, the important
thing is that this nonparametric correlation coefficient gives results that are comparable to
least squares for these three regressor variables on this particular data set.
The three most extreme data points arise from games 28, 86, and 147 which are games of
lengths 16, 10 and 12 innings, respectively. There are a total of 18 extra-inning games and
these could all be consider non-standard data even though all four regressor variables
relate well to the times of these games. It was verified that deleting the most serious of
these extremes made LS more like the GD regression, again showing the value of the CES,
because no data points should be deleted for a realistic analysis.
The asymptotics in this article are now illustrated for this example. The results and all the
calculations are given in order to encourage the broader development of regression
techniques with correlation coefficients.
slopes
b1
b2
b3
Standard errors of the regression coefficients, z scores, and P-values
SE
z score
P-value
GD
LS
GD
LS
GD
LS
0.01318
0.0094
3.72
4.77
.0002
.0000
0.01324
0.0101
-0.77
-1.07
.44
.2867
0.00998
0.0077
5.49
9.02
.0000
.0000
Table3: Statistics on the GD and LS Slope Estimates
21
The calculation of the estimated standard errors of the slopes is given next. From the
analysis above the asymptotic distributions are
LS: N(β i ,
σ iiσ res
)
n
and GD: N ( β i ,
ii
π 2 σ σ res
)
4
n
2
where n=175 and for GD the estimate of σ res is the square of the slope of the GD
regression line of the sorted residuals, y − yˆ , on N(0,1) quantiles. For LS, the linear
models command from Splus was used with the standard calculation although the CES
2
estimate of σ res
coming from a quantile plot with Pearson's CC was very close to the LS
result (equation 5.5.1 using rp ). Recall that Σ 22 is the 3x3 covariance matrix of the
regressor variables, σ i2 denotes its ith diagonal element, and σ ii is the ith diagonal element
of Σ −221 .
For the GD case, Σ̂ 22 was obtained from the correlation matrix and the diagonal matrix of
standard deviations. The GD estimates of the SDs, σˆ i , i = 1,2,3 , are obtained similar to
equation 5.1.1 by using the sorted data in place of res. The 3 by 3 GD correlation matrix
Σ̂ gd is used in the form in which each GD correlation was transformed to an estimate of a
bivariate normal (or bivariate Cauchy) correlation by ρˆ = sin(
Σˆ 22
σˆ1 0

=  0 σˆ 2
0 0

πr gd
2
). In other words,
0  σˆ 1 0
0
 

0 Σˆ gd  0 σˆ 2 0  .
σˆ 3   0
0 σˆ 3 
The SDs and correlations needed for all of these calculations are now given in the
following tables.
22
y
x1
1 0.4835
1
x2
0.6053
0.7686
1
x3
0.6745
0.2308
0.6117
1
x4
y
0.7201
x1
0.6025
x2
0.6279
x3
0.4764
x4
1
Table 4: Pearson's correlation coefficient, rp
y
x1
x2
x3
y
1
0.3736
0.4138
0.4023
x1 (0.5537) 1
0.5690
0.1839
x2 (0.6052) (0.7794) 1
0.4080
x3 (0.5907) (0.2849) (0.5979) 1
x4 (0.6942) (0.5907) (0.5461) (0.3869)
Table 5: GDCC: upper triangle is rgd, lower triangle is
x4
0.4885
0.4023
0.3678
0.2529
1
(sin πrgd 2)
In Table 6, LS is the classical least squares, GD is using a GD fitting on the quantile plot.
y
x1
x2
x3
x4
LS 0.4420 4.1994 4.7855 4.1457 2.1885
GD 0.4003 3.8825 4.6199 4.0076 2.1468
Table 6: Estimates of the standard deviations of the regression variables
From these tables the covariance matrix Σ 22 can be formed and inverted to obtain for the
LS or Pearson and CES or GD the following.
For Pearson the calculation gives Σˆ
−1
22
 0.1785 − 0.1570 0.0691 


=  − 0.1570 0.2079 − 0.1101 .
 0.0691 − 0.1101 0.1198 


23
For GD the calculation gives
Σˆ
−1
22
 0.1943 − 0.1548 0.0530 


=  − 0.1548 0.1962 − 0.0925  .
 0.0530 − 0.0925 0.1114 


ii
Again recall that σ is the notation for the (i,i) element in Σ −1
22 . From the asymptotic
results connecting slope estimates and correlation, for Pearson's rp
σˆ βˆi =
2
σˆ ii σˆ res
0.2919
=
σˆ ii = (0.02207) σˆ ii , and this equals for
n
175
(i=1), 0.009324; (i=2), 0.010063; (i=3), 0.0077. There is a close agreement with this and
the standard errors direct from the Splus linear models (lm) command.
For GD σˆ βˆ =
i
2
π 2σˆ ii σˆ res
π 2 (0.2518) 2
σˆ ii = (0.02990) σˆ ii , and this equals for
=
4n
4(175)
(i =1),(0.02990) .1943 = 0.01318
(i = 2),(0.02990) .1962 = 0.01324
(i = 3),(0.02990) .1114 = 0.00998
Note that the inference on the significance of the slopes using P-values is essentially the
same whether LS or GD regression is used. The estimated values βˆi are somewhat
different. The smallest standard error on the residuals of the regression is from GD not LS
(15.1 minutes versus 17.5 minutes). It is always the case that GD regression is less
influenced by the extremes than LS and in inference like this, one probably wants
knowledge for the standard games and not to be swayed by a few unusual games. The
smaller z-scores for GD, from the π 2 term in the SD, is the price paid for NP inference,
but the inference is valid over a larger class of distributions. References for the above
work are on the Web site.
24
5.2 Example of Partial Correlation
In order to more fully compare LS and GD regression, the partial correlation of Y and X2
was computed deleting the effects of X1 and X3. The variable X2 was chosen because the
Pearson and GD correlations with Y are nearly the same and positive but in the regression
in section 5.1 the coefficient of X2 is negative. Recall that rp ( y, x 2 ) = 0.6053 and
rgd ( y, x2 ) = 0.4138 with sin(πrgd / 2) = 0.6052 . Also, the outcomes of variable X2, total
number of hits, have many ties and so this choice provides a good test of the NPCC tied
value algorithm. For each CC the regressions of Y on X1 and X3 and X2 on X1 and X3 must
be computed in order to obtain residuals and then the correlations of these residuals give
the partial correlations.
The regressions are for LS:
yˆ = 1.6807 + 0.03645 x1 + 0.06339 x 3 and xˆ2 = 3.4463 + 0.7553x1 + 0.5296x 3.
Therefore for LS the partial correlation is rp ( y − yˆ , x2 − xˆ2 ) = −0.08146 .
For GD: yˆ = 1.7982 + 0.04030x1 + 0.05083x3 and xˆ 2 = 3.5798 + 0.7647 x1 + 0.5011x3 .
The partial correlation is rgd ( y − yˆ , x2 − xˆ2 ) = −0.04023 with
sin( π (−0.04023) / 2) = −0.06315 .
Thus, it is seen that the Pearson and GD partial CC of Y and X2 removing the effects of X1
and X3 are very similar.
25
5.3 The Multiple CC of the Regression.
The multiple CC of Y on X1, X2, and X3 is
LS: rp ( y, yˆ ) = 0.5713 = 0.7558 ,
GD: rgd ( y, yˆ ) = 0.5287 and sin( πrgd / 2) = 0.7383 .
This result is in slight disagreement with the SD of the residuals in that GD gives a smaller
multiple CC than does Pearson indicating a slightly looser relationship. It is known that GD
underestimates ρ .
5.4 Examples of Regression of Time on Four Variates and One Variate
The regression of Y on just X4 and then on all four regressor variables is now given so that
the LS and GD methods can be compared. The CCs are rp ( y, x 4 ) = 0.7201 and
rgd ( y, x 4 ) = 0.4885 with sin(πrgd / 2) = 0.6942 . X4 has a higher correlation with Y for
both CCs than for each of the other three regressor variables. Hence, possibly an important
predictor variable has been left out of the regression equation. Here are all the relevant
regression equations:
LS: yˆ = 1. 9167 + 0.1454 x4 with σˆ res = 0.3076 ,
GD: yˆ = 1. 9000 + 0.1500 x 4 with σˆ res = 0.2769 .
LS: yˆ = 1.5753 + 0.0217x1 − 0.0127x2 + 0.0533x3 + 0.0897x4
with rp ( y, yˆ ) = 0.6332 = 0.8205 and σˆ res = 0.2556 on 170 degrees of freedom,
26
GD: yˆ = 1.7473 + 0.0457x1 − 0.0310 x2 + 0.0567x3 + 0.0718x 4
with rgd ( y, yˆ) = 0.5805 , sin(πrgd / 2) = 0.7906 , and σˆ res = 0.2280 .
From the "lm" command in Splus the P-values for variables 1,2,3,4 are respectively,
0.0142, 0.1514, 0.0000, and 0.0000.
In the four-variable regression LS has a slightly higher multiple CC but GDCC has a
slightly smaller residual SE, somewhat at odds. The two slopes with the biggest difference
between the two regressions are for X1 and X2. The coefficient of X2 for GD is over twice
as large as for LS. If the value of the X2 coefficient for GD had been the LS coefficient, it
would have been very significant — as the P-value (0.00027) is much lower than the
0.1514 given above (t-value -3.52). The X1 coefficient in the GD regression is also more
than twice that of LS. So there is a real difference in these regressions, giving further
justification for the use of the CES.
The normal quantile plots of the residuals for the four-variable regression reveal fewer
unusual games than the three-variable regression did. See the graphs below and note the
different vertical scales. There are now 5 extremes for LS, and 4 to 6 (depending on visual
judgement) for GD, with all the remaining points lying very close to the GD straight line
through the points. However, the distance from the GD line to the unusual points is much
greater for GD than for LS. This explains one of the differences in the regression output.
When the GD line is compared to the LS line on the normal quantile residual plot for the
LS regression, the GD gives a better evaluation criteria. This is because the GD line goes
through more of the sorted residuals and is not swayed by the extremes. So visually one can
check more easily for normality. GD obtains a smaller residual SE by not weighting the
27
very few unusual points as much as LS does. Whether or not the difference in the
coefficients and the GDCC standard deviation of (0.2280)60=13.7 minutes is meaningful to
a data analysis compared to (0.2556)60=15.3 minutes is entirely subjective. In the current
problem on the length of major league baseball games, with the idea that they are too long,
0.0
GD Sorted Residuals
0.0
-1.0
-0.5
-0.5
LS Sorted Residuals
0.5
0.5
1.0
the GD analysis would be more appropriate for this analyst.
-2
-1
0
1
Quantiles of Standard Normal
2
-2
-1
0
1
2
Quantiles of Standard Normal
Figure 2: Normal Quantile Residual Plots for LS and GD Four-Variable Regression
For the simple linear regression of time on X4, total number of pitchers used, the
asymptotic inference on the slope is now given. The SE of the slope coefficient is
11
calculated. Because in this case Σ̂ 22 is a 1x1 matrix, the inverse is σˆ =
1
ˆ2
2 where σ 4
σˆ 4
28
is the estimate of the variance of x4 by whatever method is being used. The formulas for
LS and GD are
Pearson or LS: σˆ βˆ =
GD: σˆ βˆ =
2
π 2σˆ res
=
4nσˆ 42
2
σˆ res
( 0.3076 )2
=
= 0.0106 with z score of 13.7
nσˆ 42
175(4.7895)
π 2 (0.2769 ) 2
= 0.0153 with z score of 9.8.
(4 )175(4.6088)
As in the other examples the SE of the slope is higher for the NPCC than for the classical
case. However, the real question is which one is more reliable over a variety of data sets.
As a check on the LS value, direct from the Splus "lm" command σˆ βˆ = 0.0106 .
A small simulation study was done on the distribution of the GDCC on the variables time
of game (continuous) and number of pitchers (discrete); i.e., variables Y and X4 above.
This study showed that indeed the asymptotic distribution is fairly normal even for small to
moderate sample sizes (>10). The implication is that the Taylor series process described
earlier works just as well when some variables are discrete. A second implication may be
that the max-min tied value procedure hastens the convergence to normality. It would be
nice to have a proof of this.
6. Conclusion
This article sets the framework for a very general method of multiple regression based on
the distribution and population values of correlation coefficients. The quality of the GD
results should eliminate lingering doubts as to the validity of this and other NP methods in
linear regression. There are six other correlations in Gideon (2007) to which the process
in this article can be applied. Some comments are given on the use of Kendall's Tau in the
29
CES method. Which correlation to use on a particular data set is also a research question.
The L-one correlation coefficient in Gideon (2007) could be profitably used whenever Lone methods are appropriate. A long-term goal would be to have a computer package that
can select different correlation coefficients on which to perform the multiple regression
analysis. Another important goal for this article was to reemphasize that the problem of
tied values is apparently not an issue when the max-min method is used. Thus, the CES
using rank based or continuous (including Pearson's which is equivalent to least squares)
correlation coefficients in multiple linear regression estimation is not only a very viable
technique, but also provides a cohesion not always found in other methods. These results
can segue into other estimation areas of statistics; some of these ideas can be pursued by
consulting the Web site. They include for example, estimation in nonlinear regression,
generalized linear models, and time series (Sheng, 2002). The CES provides a simple way
(if computer programs have been written) to use robust methods in these latter areas
without having to resort to data manipulation.
Credits: Special thanks to former students, Steven Rummel, Jacquelynn Miller, and
especially to Carol Ulsafer, collaborator and editorial assistant.
7. References
Anderson, T. W. (1958). An Introduction to Multivariate Statistical Analysis. John Wiley
and Sons, N.Y.
Burg, Karl V. (1975). Statistical Models in Applied Sciences. John Wiley and Sons, N.Y.
30
Gibbons, J. D. and Chakraberti, S. (1992). Nonparametric Statistical Inference 3rd ed.
Marcel Dekker, Inc. N.Y.
Gideon, R. A. (2008). Kendall's τ In Correlation and Regression, in progress.
Gideon, R. A. (2007). The Correlation Coefficients, Journal of Modern Applied
Statistical Methods, 6, No. 2, in progress.
Gideon, R. A., Prentice, M. J., and Pyke, R. (1989). The Limiting Distribution of the
Rank Correlation Coefficient rgd. In: Contributions to Probability and Statistics (Essays in
Honor of Ingram Olkin) edited by Gleser, L. J., Perlman, M. D., Press, S. J., and Sampson,
A. R. Springer-Verlang ,N.Y. :217-226.
Gideon, R. A. & Hollister, R. A. (1987). A Rank Correlation Coefficient Resistant to
Outliers, Journal of the American Statistical Association 82, no.398, 656-666.
Gideon, R. A. and Rothan, A. M., CSJ (2007). Location and Scale Estimation with
Correlation Coefficients. Communications in Statistics-Theory and Methods, under review.
Gini, C. (1914). L'Ammontare c la Composizione della Ricchezza della Nazioni, Bocca,
Torino.
Graybill, F. A. (1976). Theory and Application of the Linear Model. Duxbury Press,
Massachusetts.
31
Miller, Jacquelynn (1995). Multiple Regression Development with GDCC, Masters
Thesis. University of Montana.
Rummel, Steven E. (1991). A Procedure for Obtaining a Robust Regression Employing the
Greatest Deviation Correlation Coefficient, Unpublished Ph.D. Dissertation, University of
Montana, Missoula, MT 59812, full text accessible through UMI ProQuest Digital
Dissertations.
Sen, P.K. (1968). Estimates of the Regression Coefficient based on Kendall's Tau. Journal
of the American Statistical Association, 63: 1379-1389.
Sheng, HuaiQing (2002). Estimation in Generalized Linear Models and Time Series
Models with Nonparmetric Correlation Coefficients, Unpublished Ph.D. Dissertation,
University of Montana, Missoula, MT 59812, full text accessible through
http://wwwlib.umi.com/dissertations/fullcit/3041406
Web site www.math.umt.edu/gideon.