1 Chapter 12.5: Correlation Instructor: Dr. Arnab Maity 2 Suppose we have two variables x and y. Regression analysis uses one variable (x) to predict the other (y) using a linear relationship. However, often we may want to look at joint behavior of the two variables together rather than predicting one using the other. In this chapter, we learn about the sample correlation coefficient r as a measure of how strongly two variables x and y are related in a sample. Recall: Given two random variables X and Y , the correlation coefficient or the population correlation coefficient is defined as Cov(X, Y ) . ρ= p V (X)V (Y ) This measures the strength of relationship between X and Y . The correlation coefficient has the following properties. • It does not have any units. • −1 ≤ ρ ≤ 1. • If X and Y are independent, then ρ = 0, but ρ = 0 does not imply independence. • ρ = −1 or ρ = 1 if and only if Y = a + bX for some number a and nonzero number b (that is, Y and X are linearly related). These properties tell us that ρ measures the strength of linear relationship between X and Y . When ρ = 0, X and Y are said to be uncorrelated. They can still be dependent if there is strong nonlinear relationship between them. Example: Let X be a discrete random variable which takes three values -1, 0 and 1 with probability 1/3 for each. Let Y = X 2 . Then we have E(X) = 0, E(XY ) = 0 and as a result Cov(X, Y ) = 0. This also means ρ = 0. But as we clearly see X and Y are clearly dependent (as Y = X 2 ). A value of ρ near 1 does not necessarily imply that increasing the value of X causes Y to increase. It implies only that large X values are associated with large Y values. Be careful: association (a high correlation) is not the same as causation. 3 Now we will define a sample version of the correlation coefficient. Suppose that we observe n pairs of data points (x1 , y1 ), . . . (xn , yn ). It is natural to say that the variables x and y are positively related if large values of x are paired with large values of y and small values of x are paired with small values of y. Similarly, we may say that x and y are negatively related if large values of x are paired with small values of y and small values of x are paired with large values of y. To define the sample version of correlation, we first define the quantity Sxy = n X (xi − x̄)(yi − ȳ). i=1 Note that the two variables are positively correlated, then xi values above x̄ will tend be paired with yi values above ȳ. Thus the product (xi − x̄)(yi − ȳ) will be positive. Similarly, xi values below x̄ will tend be paired with yi values below ȳ again making the product (xi − x̄)(yi − ȳ) positive. Thus the quantity Sxy will be positive if there is a positive relation between x and y. A similar argument can be made for negative relationship as well. The sample correlation coefficient is defined as Sxy , r=p Sxx Syy where Sxx n X = (xi − x̄)2 i=1 and Syy = n X i=1 (yi − ȳ)2 . 4 Properties of r 1. The value of r does not depend on which of the two variables under study is labeled x and which is labeled y. 2. The value of r is independent of the units in which x and y are measured. 3. −1 ≤ r ≤ 1 4. r = 1 if and only if all (xi , yi ) pairs lie on a straight line with positive slope, and r = −1 if and only if all (xi , yi ) pairs lie on a straight line with negative slope. 5. The square of the sample correlation coefficient gives the value of the coefficient of determination that would result from fitting the simple linear regression model. Property 1 is exactly opposite to what happens in regression analysis, where all the quantities of interest (e.g., slope, intercept and error variance) depend on which of the two variables is the response/dependent variable. On the other hand, Property 5 tells us that the proportion of variation in the dependent variable explained by fitting the simple linear regression model does not depend on which of the two variables is the response/dependent variable. Example: Recall the Arsenic data example discussed in the previous lecture. We saw data on x =pH and y = arsenic removed (%) by a particular process. Data for this example is shown in the book (Example 12.2). Simple linear regression results: Dependent Variable: Arsenic removed Independent Variable: pH Arsenic removed = 190.268 - 18.034 pH Sample size: 18 R (correlation coefficient) = -0.950 R-sq = 0.903 Estimate of error standard deviation: 6.125 Simple linear regression results: Dependent Variable: pH Independent Variable: Arsenic removed pH = 10.350 - 0.050 Arsenic removed Sample size: 18 R (correlation coefficient) = -0.950 R-sq = 0.903 Estimate of error standard deviation: 0.322 5 Property 2 indicates that the sample correlation coefficient remains the same even we change the scale of measurements (e.g., meters vs. feet. kilograms vs. pounds) or a location shift. Property 3 tells us that r can only take value between -1 and +1. This is exactly the same as that of the population correlation coefficient ρ. Combined with property 4, we can see that r = 1 if and only if all the x and y pairs fall exactly on a straight line with positive slope. A similar statement can be made for r = −1. r will be less that 1 in absolute value for any other type of scatterplot even if there are clear relationship between x and y. An informal rule of thumb for the value of r is given below. • Weak: −0.5 ≤ r ≤ 0.5 (R2 ≤ 0.25) • Moderate: −0.8 < r < 0.5 or 0.5 < r < 0.8 (0.25 < R2 < 0.64) • Strong: r ≤ −0.8 or r ≥ 0.8 (R2 ≥ 0.64) 6 The correlation coefficient r measures how strong the linear relationship between x and y is in the observed sample. We can think of the pairs (xi , yi ) as random samples from a population with unknown population correlation coefficient ρ. In fact, we can think the sample correlation coefficient r as a point estimate of the population correlation coefficient ρ based on the observed sample. This lets us to do inference on ρ Assumptions: 2 • X1 , . . . , Xn are random sample from a normal distribution N (µX , σX ) • Y1 , . . . , Yn are random sample from a normal distribution N (µY , σY2 ) • Correlation coefficient between X and Y is ρ. Define the point estimator for ρ as Pn − X̄)(Yi − Ȳ ) . Pn 2 2 (X − X̄) (Y − Ȳ ) i i i=1 i=1 R = pPn i=1 (Xi Hence, the sample correlation coefficient r is a point estimate of ρ. Testing for the Absence of Correlation 1. We want to test H0 : ρ = 0 2. Test statistic: √ R n−2 T = √ 1 − R2 3. If the null hypothesis H0 is true, T follows a t-distribution with n−2 degrees of freedom. 4. Value of the test statistic: 5. Rejection regions: Alternative Hypothesis Ha : ρ 6= 0 Ha : ρ < 0 Ha : ρ > 0 √ r n−2 t= √ 1 − r2 Rejection Region for a Level α Test t ≤ −tα/2,n−2 OR t ≥ tα/2,n−2 (two-tailed) t ≤ −tα,n−2 (lower-tailed) t ≥ tα,n−2 (upper-tailed) 6. Caution: The assumptions of normality of X and Y are crucial to employ this test. If the data deviates from normality substantially (e.g., a Q-Q plot shows deviation from straight line), this test should not be employed for small sample sizes. 7 Example: We revisit the arsenic example. Simple linear regression results: Dependent Variable: Arsenic removed Independent Variable: pH Arsenic removed = 190.26829 - 18.034245 pH Sample size: 18 R (correlation coefficient) = -0.95049529 R-sq = 0.9034413 Estimate of error standard deviation: 6.1255839 Parameter estimates: Parameter Estimate Intercept 190.26829 Slope -18.034245 Std. Err. 12.587118 1.4739533 Alternative 6= 0 6= 0 DF 16 16 T-Stat P-value 15.116112 <0.0001 -12.23529 <0.0001 Test H0 : ρ = 0 vs. Ha : ρ 6= 0. Relationship with model utility test in regression analysis: they are equivalent. (compare with test statistic for β1 ) 8 Other inferences for ρ 1. Under the assumptions made before, the quantity 1 1+R V = ln 2 1−R has approximately a normal distribution with mean and variance 1+ρ 1 1 and σV2 = . µV = ln 2 1−ρ n−3 2. We want to test H0 : ρ = ρ0 . Here ρ0 can be any value, not just zero. 3. Test statistic: √ 1 + ρ0 1 T = n − 3 V − ln . 2 1 − ρ0 4. For large sample size n, if the null hypothesis H0 is true, Z follows N (0, 1) distribution. 5. Rejection regions: Alternative Hypothesis Ha : ρ 6= 0 Ha : ρ < 0 Ha : ρ > 0 Rejection Region for a Level α Test z ≤ −zα/2 OR z ≥ zα/2 (two-tailed) z ≤ −zα (lower-tailed) z ≥ zα (upper-tailed) 6. Caution: This test should not be employed for small sample sizes.
© Copyright 2026 Paperzz