Correlation

1
Chapter 12.5: Correlation
Instructor: Dr. Arnab Maity
2
Suppose we have two variables x and y. Regression analysis uses one variable (x) to predict the other (y) using a linear relationship. However, often we may want to look at joint
behavior of the two variables together rather than predicting one using the other. In this
chapter, we learn about the sample correlation coefficient r as a measure of how strongly
two variables x and y are related in a sample.
Recall: Given two random variables X and Y , the correlation coefficient or the population correlation coefficient is defined as
Cov(X, Y )
.
ρ= p
V (X)V (Y )
This measures the strength of relationship between X and Y .
The correlation coefficient has the following properties.
• It does not have any units.
• −1 ≤ ρ ≤ 1.
• If X and Y are independent, then ρ = 0, but ρ = 0 does not imply independence.
• ρ = −1 or ρ = 1 if and only if Y = a + bX for some number a and nonzero number b
(that is, Y and X are linearly related).
These properties tell us that ρ measures the strength of linear relationship between X
and Y . When ρ = 0, X and Y are said to be uncorrelated. They can still be dependent if
there is strong nonlinear relationship between them.
Example: Let X be a discrete random variable which takes three values -1, 0 and 1 with
probability 1/3 for each. Let Y = X 2 . Then we have
E(X) = 0, E(XY ) = 0
and as a result Cov(X, Y ) = 0. This also means ρ = 0. But as we clearly see X and Y are
clearly dependent (as Y = X 2 ).
A value of ρ near 1 does not necessarily imply that increasing the value of X causes Y to
increase. It implies only that large X values are associated with large Y values. Be careful:
association (a high correlation) is not the same as causation.
3
Now we will define a sample version of the correlation coefficient. Suppose that we observe
n pairs of data points (x1 , y1 ), . . . (xn , yn ). It is natural to say that the variables x and y are
positively related if large values of x are paired with large values of y and small values of x
are paired with small values of y. Similarly, we may say that x and y are negatively related
if large values of x are paired with small values of y and small values of x are paired with
large values of y.
To define the sample version of correlation, we first define the quantity
Sxy =
n
X
(xi − x̄)(yi − ȳ).
i=1
Note that the two variables are positively correlated, then xi values above x̄ will tend be
paired with yi values above ȳ. Thus the product (xi − x̄)(yi − ȳ) will be positive. Similarly, xi values below x̄ will tend be paired with yi values below ȳ again making the product
(xi − x̄)(yi − ȳ) positive. Thus the quantity Sxy will be positive if there is a positive relation
between x and y. A similar argument can be made for negative relationship as well.
The sample correlation coefficient is defined as
Sxy
,
r=p
Sxx Syy
where
Sxx
n
X
=
(xi − x̄)2
i=1
and
Syy =
n
X
i=1
(yi − ȳ)2 .
4
Properties of r
1. The value of r does not depend on which of the two variables under study is labeled x
and which is labeled y.
2. The value of r is independent of the units in which x and y are measured.
3. −1 ≤ r ≤ 1
4. r = 1 if and only if all (xi , yi ) pairs lie on a straight line with positive slope, and r = −1
if and only if all (xi , yi ) pairs lie on a straight line with negative slope.
5. The square of the sample correlation coefficient gives the value of the coefficient of
determination that would result from fitting the simple linear regression model.
Property 1 is exactly opposite to what happens in regression analysis, where all the quantities of interest (e.g., slope, intercept and error variance) depend on which of the two variables
is the response/dependent variable.
On the other hand, Property 5 tells us that the proportion of variation in the dependent
variable explained by fitting the simple linear regression model does not depend on which of
the two variables is the response/dependent variable.
Example: Recall the Arsenic data example discussed in the previous lecture. We saw data
on x =pH and y = arsenic removed (%) by a particular process. Data for this example is
shown in the book (Example 12.2).
Simple linear regression results:
Dependent Variable: Arsenic removed
Independent Variable: pH
Arsenic removed = 190.268 - 18.034 pH
Sample size: 18
R (correlation coefficient) = -0.950
R-sq = 0.903
Estimate of error standard deviation: 6.125
Simple linear regression results:
Dependent Variable: pH
Independent Variable: Arsenic removed
pH = 10.350 - 0.050 Arsenic removed
Sample size: 18
R (correlation coefficient) = -0.950
R-sq = 0.903
Estimate of error standard deviation: 0.322
5
Property 2 indicates that the sample correlation coefficient remains the same even we change
the scale of measurements (e.g., meters vs. feet. kilograms vs. pounds) or a location shift.
Property 3 tells us that r can only take value between -1 and +1. This is exactly the same
as that of the population correlation coefficient ρ. Combined with property 4, we can see
that r = 1 if and only if all the x and y pairs fall exactly on a straight line with positive
slope. A similar statement can be made for r = −1.
r will be less that 1 in absolute value for any other type of scatterplot even if there are clear
relationship between x and y.
An informal rule of thumb for the value of r is given below.
• Weak: −0.5 ≤ r ≤ 0.5 (R2 ≤ 0.25)
• Moderate: −0.8 < r < 0.5 or 0.5 < r < 0.8 (0.25 < R2 < 0.64)
• Strong: r ≤ −0.8 or r ≥ 0.8 (R2 ≥ 0.64)
6
The correlation coefficient r measures how strong the linear relationship between x and y is
in the observed sample. We can think of the pairs (xi , yi ) as random samples from a population with unknown population correlation coefficient ρ. In fact, we can think the sample
correlation coefficient r as a point estimate of the population correlation coefficient ρ based
on the observed sample. This lets us to do inference on ρ
Assumptions:
2
• X1 , . . . , Xn are random sample from a normal distribution N (µX , σX
)
• Y1 , . . . , Yn are random sample from a normal distribution N (µY , σY2 )
• Correlation coefficient between X and Y is ρ.
Define the point estimator for ρ as
Pn
− X̄)(Yi − Ȳ )
.
Pn
2
2
(X
−
X̄)
(Y
−
Ȳ
)
i
i
i=1
i=1
R = pPn
i=1 (Xi
Hence, the sample correlation coefficient r is a point estimate of ρ.
Testing for the Absence of Correlation
1. We want to test H0 : ρ = 0
2. Test statistic:
√
R n−2
T = √
1 − R2
3. If the null hypothesis H0 is true, T follows a t-distribution with n−2 degrees of freedom.
4. Value of the test statistic:
5. Rejection regions:
Alternative Hypothesis
Ha : ρ 6= 0
Ha : ρ < 0
Ha : ρ > 0
√
r n−2
t= √
1 − r2
Rejection Region for a Level α Test
t ≤ −tα/2,n−2 OR t ≥ tα/2,n−2 (two-tailed)
t ≤ −tα,n−2 (lower-tailed)
t ≥ tα,n−2 (upper-tailed)
6. Caution: The assumptions of normality of X and Y are crucial to employ this test. If
the data deviates from normality substantially (e.g., a Q-Q plot shows deviation from
straight line), this test should not be employed for small sample sizes.
7
Example: We revisit the arsenic example.
Simple linear regression results:
Dependent Variable: Arsenic removed
Independent Variable: pH
Arsenic removed = 190.26829 - 18.034245 pH
Sample size: 18
R (correlation coefficient) = -0.95049529
R-sq = 0.9034413
Estimate of error standard deviation: 6.1255839
Parameter estimates:
Parameter Estimate
Intercept 190.26829
Slope
-18.034245
Std. Err.
12.587118
1.4739533
Alternative
6= 0
6= 0
DF
16
16
T-Stat
P-value
15.116112 <0.0001
-12.23529 <0.0001
Test H0 : ρ = 0 vs. Ha : ρ 6= 0.
Relationship with model utility test in regression analysis: they are equivalent.
(compare with test statistic for β1 )
8
Other inferences for ρ
1. Under the assumptions made before, the quantity
1
1+R
V = ln
2
1−R
has approximately a normal distribution with mean and variance
1+ρ
1
1
and σV2 =
.
µV = ln
2
1−ρ
n−3
2. We want to test H0 : ρ = ρ0 .
Here ρ0 can be any value, not just zero.
3. Test statistic:
√
1 + ρ0
1
T = n − 3 V − ln
.
2
1 − ρ0
4. For large sample size n, if the null hypothesis H0 is true, Z follows N (0, 1) distribution.
5. Rejection regions:
Alternative Hypothesis
Ha : ρ 6= 0
Ha : ρ < 0
Ha : ρ > 0
Rejection Region for a Level α Test
z ≤ −zα/2 OR z ≥ zα/2 (two-tailed)
z ≤ −zα (lower-tailed)
z ≥ zα (upper-tailed)
6. Caution: This test should not be employed for small sample sizes.