Inferences for correlation

Inferences for
Correlation
Quantitative Methods II
Plan for Today
• Recall: correlation coefficient
• Bivariate normal distributions
• Hypotheses testing for population correlation
• Confidence intervals for population
correlation
1
Bivariate Analysis
• Is there a relationship between two variables?
• For example, is there a relationship between
a person’s income and his level of educational
attainment?
• These types of questions are studied by the
bivariate analysis, where “bi” indicates two
variables.
• Statisticians also work with more than two
variables, leading to a multivariate analysis.
Correlation Coefficient r
• It was developed by Karl Pearson in the early
1900s as a numerical measure of strength and
direction of the linear association between the
independent variable x and the dependent
variable y.
• The value of r is always between −1 and 1.
• If r > 0 , the correlation is positive (when x
increases, y increases as well).
• If r < 0 , the correlation is negative (when x
increases, y decreases).
2
Correlation Coefficient r
• r ≈ −1 : perfect negative correlation
• −1 < r < −0.6 : strong negative correlation
• −0.6 < r < −0.3 : moderate negative correlation
• −0.3 < r < 0 : weak negative correlation
• r ≈ 0 : no correlation
• 0 < r < 0.3 : weak positive correlation
• 0.3 < r < 0.6 : moderate positive correlation
• 0.6 < r < 1 : strong positive correlation
• r ≈ 1 : perfect positive correlation
A formula for the coefficient of correlation
𝑠𝑥
∙ 𝑏෠
𝑠𝑦
where 𝑠𝑥 and 𝑠𝑦 are the standard deviations for
the x- and y- data respectively, and the slope
𝑟=
𝑏෠ =
𝑥𝑦 − 𝑥ҧ ∙ 𝑦ത
𝑥 2 − 𝑥ҧ 2
However, the correlation coefficient is much
faster and easier to compute using your scientific
calculator !
3
Salaries and education
•The table on the left
represents a sample of
11 individuals working
for the government of
Quebec, their annual
salary in thousands of $,
and their educational
attainment, in years.
Computing the correlation coefficient
Using the formulas from before or the built-in
calculator functions, compute:
𝒓 = 𝟎. 𝟔𝟑𝟗𝟓
• This is an example of a strong positive
correlation.
4
Bivariate normal distribution
We shall always assume that the set (x, y) of
ordered pairs of data comes from a bivariate
normal distribution. It means that for a fixed
value of x, the values of y are normally
distributed, and for a fixed value of y the
values of x are normally distributed as well.
In most cases, the results are still accurate if
the distributions are bell-shaped and
symmetrical, and the y-variances are
approximately equal.
Hypothesis testing for correlation.
The population correlation is denoted by the
Greek letter ρ (“rho”) and the sample
correlation by r. The null-hypothesis is always
going to be that the values of x and y have no
linear correlation, that is 𝐻0 : 𝜌 = 0.
The alternate hypotheses will always be
𝐻𝐴 : 𝜌 ≠ 0
(two-tailed test)
5
The test statistic
To test hypotheses for a population
correlation, we are going to use the Student’s t
distribution with (n-2) degrees of freedom:
df = 𝑛 − 2
And the test statistic 𝑡∗ is given by
𝑡∗ =
𝑟∙ 𝑛−2
1 − 𝑟2
Here n is the number of pairs of data (x, y).
Example: study hours and grades
Five students have recorded the number of
hours they studied for an exam and their grades:
Hours
2
5
1
4
2
Grade
80
80
70
90
60
Assuming a bivariate normal population, test at
a 5% level of significance whether the
correlation between the number of hours of
study and the grade is significant.
6
Example: study hours and grades
First of all, let us compute the sample correlation
coefficient, using formulas or a calculator.
We have 𝑟 = 0.6138 . State the hypotheses:
𝐻0 : 𝜌 = 0 , 𝐻𝐴 : 𝜌 ≠ 0. (A two-tailed test.)
The test statistic: 𝑡∗ =
0.6138∙ 5−2
1−0.61382
= 1.35
The number of degrees of freedom is df = 3.
The critical values: ±𝑡 3, 0.025 = ±3.182
The p-value = 2 ∙ 0.142 = 0.284 > 0.05 = 𝛼
Decision: fail to reject H0.
Example: reading time and TV
Do reading an TV viewing compete for leisure
time? To find out, a psychologist interviewed a
random sample of 15 children regarding the
number of books they had read during the last
year and the number of hours they had spent
watching TV on a daily basis. If a correlation
coefficient of −0.715 is obtained, is the
correlation significant at the 5% level of
significance? Assume that it’s a bivariate
normal population.
7
Example: reading time and TV
Let us state the hypotheses:
𝐻0 : 𝜌 = 0 , 𝐻𝐴 : 𝜌 ≠ 0. (A two-tailed test.)
The test statistic: 𝑡∗ =
−0.715∙ 15−2
1−(−0.715)2
= −3.69
The number of degrees of freedom is df = 13.
The critical values: ±𝑡 13, 0.025 = ±2.16
(Sketch the curve and the regions of rejection.)
The p-value = 2 ∙ 0.002 = 0.004 < 0.05 = 𝛼
Decision: reject H0.
The confidence intervals
We start with the Fisher transformation:
1
1+𝑟
∙ ln
2
1−𝑟
It turns out, that for a bivariate normal
population, Z is (approximately) normally
distributed with the st. deviation of 1Τ 𝑛 − 3
So the confidence interval for 𝜇𝑧 is
𝑍=
𝑐=𝑍−
𝑧(𝛼 Τ2)
𝑛−3
< 𝜇𝑍 < 𝑍 +
𝑧(𝛼 Τ2)
𝑛−3
=𝑑
8
The confidence intervals
Now we perform the inverse Fisher
transformation to get the confidence interval
for the population correlation ρ:
e2∙𝑐 − 1
e2∙𝑑 − 1
< 𝜌 < 2∙𝑑
e2∙𝑐 + 1
e +1
Make sure you can compute these quantities
correctly on your scientific calculator!
Let us consider examples.
Example: study hours and grades
Let us find the 95% confidence interval for ρ.
Recall that we have 𝑟 = 0.6138 . Do the Fisher
1
2
transformation: 𝑍 = ∙ ln
1+0.6138
1−0.6138
= 0.7150
𝑧 𝛼 Τ2 = 1.96 . The confidence interval for 𝜇𝑍 :
1.96
1.96
0.7150 −
< 𝜇 < 0.7150 +
2
2
so
𝑐 = −0.6709 < 𝜇𝑍 < 2.1009 = 𝑑
9
Example: study hours and grades
Now we’ll do the inverse Fisher transformation.
e2∙(−0.6709) − 1
e2∙2.1009 − 1
< 𝜌 < 2∙2.1009
e
+1
e2∙(−0.6709) + 1
After computation, we find the 95% confidence
interval for the population correlation
coefficient ρ:
−0.5856 < 𝜌 < 0.9705
Example: reading time and TV
Let us find the 94% confidence interval for ρ.
Recall that we have 𝑟 = −0.715 . Do the Fisher
1
2
transform: 𝑍 = ∙ ln
1+ −0.715
1− −0.715
= −0.8973
𝑧 𝛼 Τ2 = 1.88 . The confidence interval for 𝜇𝑍 :
1.88
1.88
−0.8973 −
< 𝜇 < −0.8973 +
12
12
so
𝑐 = −1.4400 < 𝜇𝑍 < −0.3546 = 𝑑
10
Example: reading time and TV
Now we’ll do the inverse Fisher transformation.
e2∙(−1.44) − 1
e2∙(−0.3546) − 1
< 𝜌 < 2∙(−0.3546)
e2∙(−1.44) + 1
e
+1
After computation, we find the 94% confidence
interval for the population correlation
coefficient ρ:
−0.8937 < 𝜌 < −0.3404
Example: salt and anxiety (practice)
Is there a correlation between one’s salt intake
and his or her level of stress and anxiety?
A study of 32 volunteers has found a
correlation coefficient of −𝟎. 𝟐𝟔 between the
participants’ salt intake and the amplitude of
their adrenaline spikes.
(a) Test at a 5% level of significance whether
the population correlation is significant.
(b) Construct a 95% confidence interval for the
population correlation coefficient.
Assume a bivariate normal population.
11
Example: immigration and GDP (practice)
Do immigration rates correlate with GDP (gross
domestic product)? A researcher took data
from 40 different countries and found the
correlation coefficient for her sample to be
equal to 0.44.
(a) Test at a 1% level of significance whether
there is a significant correlation between
immigration rates and GDP.
(b) Construct a 98% confidence interval for the
population correlation coefficient.
12