Part 2

MATH/STAT 352
CORRELATION AND REGRESSION
CORRELATION
BIVARIATE DATA: CORRELATION AND REGRESSION
Two variables of interest: X, Y.
GOAL: Quantify association between X and Y: correlation.
Predict value of Y from the value of X: regression.
EXAMPLES: (height, weight), (yrs. of education, salary), (hrs. of
studying, exam score), (SAT score, GPA), (chemical reaction time,
temperature), (rainfall, runoff volume), (demand, price), etc.
BIVARIATE DATA: PAIRS (X, Y): (x1, y1), (x2, y2), …, (xn, yn).
(xi, yi) – ith observation, values of X and Y on the ith “subject”.
Correlation studies: study type and amount of association between X
and Y.
Regression studies: aim to predict Y from X by constructing a simple
equation relating Y to X.
CORRELATION
GRAPHICAL REPRESENTATION OF BIVARIATE DATA – SCATTER
PLOT: plot observations (x1, y1), (x2, y2), …, (xn, yn) as points on the plane.
Types of association/relationship between vars: positive, negative, none.
Positive association: Two variables are positively associated if large values
of one tend to be associated (occur) with large values of the other variable
and small values of one tend to be associated with small values of the
other variable.
Positive association
Example: Height and weight are
usually positively associated
CORRELATION, contd.
Negative association: Two variables are negatively associated if large
values of one tend to be associated (occur) with small values of the
other. The variables tend to “move in opposite directions”.
No association: If there is no association, the points in the scatter plot
show no pattern.
Negative association
E.g. High demand often occurs
with low price.
No association
PEARSON CORRELATION COEFFICIENT
Measure of strength of association is the correlation coefficient:
Population: ρ (rho); Sample: rxy.
Data: (x1, y1), (x2, y2), …, (xn, yn).
Sample statistics: sample means:
Sample standard deviations: sx and sy.
Sample correlation coefficient:
Correlation coefficient measures strength of LINEAR association.
PROPERTIES OF THE SAMPLE CORRELATION
COEFFICIENT rXY.
 
 
 
 
 
rxy> 0 indicates positive association (ρxy> 0) between X and Y.
rxy < 0 indicates negative association (ρxy< 0) btween X and Y.
rxy≈ 0 indicates no association (ρxy≈ 0) between X and Y.
-1 ≤ rxy (or ρxy)≤ 1, the closer | rxy | (or | ρxy |) to 1, the stronger
the relationship between X and Y.
Computational formula for r: given sample stats
we can compute r as
CORRELATION COEFFICIENT AND ASSOCIATION
Strong association
r = 0.9
Moderate association
r = 0.75
r = 0.5
Weak association
No association
Almost perfect
association, but
not linear, r small.
r = 0.28
r = - 0.3
CORRELATION, CONTD.
 
Correlation does not imply CAUSATION!
 
Watch out for hidden (lurking) variables.
Example. Study of fires. X=amount of damage, Y = # of
firefighters.
rXY≈0.85. The more firefighters, the more damage?
Hidden variable: Size of the fire.
EXAMPLE
In a study of income an savings, data was collected from 10 households. Both
savings and income are reported in thousands of $ in the following table.
Find the correlation coefficient between income and
income savings
savings.
25
28
35
39
44
48
52
65
55
72
0.5
0.0
0.8
1.6
1.8
3.1
4.3
4.6
3.5
7.2
Solution: Summary statistics: X=income, Y=savings
Σxi = 463 , Σx2i = 23533, Σyi = 27.4 , Σy2i = 120.04 ,
Σxi yi =1564.4.
There is a strong positive association
between family income and savings.
Testing hypotheses about significant correlation
Test on significance level α.
STEP1. Ho: ρ=0
vs
Ha: ρ≠ 0 or Ha: ρ > 0 or Ha: ρ < 0
STEP 2. Test statistic:
Under the Ho, the test statistic has t distr. with n-2 df.
STEP 3. Critical value? One-sided test tα, two-sided tα/2 .
STEP 4. DECISION-critical/rejection region(s) depends on Ha.
Ha: ρ≠ 0
Reject Ho if |t|> tα/2;
Ha: ρ > 0
Reject Ho if t > tα;
Ha: ρ < 0
Reject Ho if t < - tα.
Example
Income and savings. Test if there is a significant positive linear
relationship between income and savings. Use significance level of
5%
STEP1. Ho: ρ=0
vs
STEP 2. Test statistic:
Ha: ρ > 0 (positive relationship)
Under the Ho, the test statistic has t distr. with 8 df.
STEP 3. Critical value? One-sided test t0.05= 1.86.
STEP 4. DECISION-critical/rejection region(s) depends on Ha.
Ha: ρ > 0
Reject Ho if t > 1.86; t=10.1 > 1.86, so we reject Ho.
Answer: There is a significant positive correlation between income and
savings.
Example: Eruptions of the Old Faithful Geyser
When Old Faithful geyser erupts in the Yellowstone NP, the length of
the eruption time and the time interval until the next eruption are
recorded. The data for 8 eruptions produced the sample correlation
coefficient r = 0.926. Test if this relationship is significantly different
from zero. Use significance level of 0.01.
STEP1. Ho: ρ=0
vs
Ha: ρ≠ 0
STEP 2: Test statistic:
STEP 3. Critical value? Two-sided test, df=6, t0.005= 3.707.
STEP 4. |t|=6.008 >3.707, so reject Ho. Thus, there is a significant linear
relationship between the length of the eruption and the time after the
eruption.