2-Variable Statistics Variable Statistics Find the correlation

10/2/2014
Find the mean and standard deviation
of the x’s and y’s using 2-var stats.
2-Variable Statistics
Now that we have used one variable statistics to “store”
our necessary numbers, let’s learn another way that’s even
better☺
☺
Find the mean and standard deviation
of the x’s and y’s using 2-var stats.
x
y
21
6
18
9
30
3
35
4
Use this when
using your lists
to find r.
x
y
21
6
18
9
30
3
35
4
Find the correlation Coefficient:
x
y
4
6
8
15
15
22
19
18
22
27
=
Find the correlation Coefficient:
x
32
40
30
18
15
25
y
27
82
34
14
1
22
=
Zx
Zy
4.558674571
= 0.912
5
Zx*Zy
Zx
Zy
Zx*Zy
3.599887921
= 0.900
4
Find the correlation Coefficient:
x
2
8
10
14
28
32
18
y
72
60
64
52
43
40
32
=
Zx
Zy
Zx*Zy
−4.868894211
= −0.811
6
1
10/2/2014
A student wonders if tall women tend to date taller men than do
short women. She measures herself, her dormitory roommate,
and the women in the adjoining rooms. Then she measures the
next man each woman dates. Draw & discuss the scatterplot
and calculate the correlation coefficient.
Women
(x)
Men
(y)
66
64
A student wonders if tall women tend to date taller men than do
short women. She measures herself, her dormitory roommate,
and the women in the adjoining rooms. Then she measures the
next man each woman dates. Draw & discuss the scatterplot
and calculate the correlation coefficient.
Women
(x)
Men
(y)
72
66
72
0
1.1859
0
68
64
68
-0.9535
-0.3953
0.3769
66
70
66
70
0
0.3953
0
65
68
65
68
-0.4767
-0.3953
0.1884
70
71
70
71
1.9069
0.7906
1.5076
65
65
65
65
-0.4767
-1.581
0.7538
=
2.826668855
0.565
5
Guess the correlation coefficient
http://istics.net/stat/Correlations/
Linear Regression
Can we make a Line of Best Fit
Regression Line
When a scatterplot shows a linear relationship, we’d like to
Want:
1) The distances to
the line to be
the same.
2) The smallest
distances.
summarize the overall pattern by drawing a line on the
scatterplot.
A regression line summarizes the relationship between two
variables, but only in a specific setting: when one of the
variables helps explain or predict the other.
Regression – unlike scatter plots – requires that we have
an explanatory variable and a response variable.
2
10/2/2014
Regression Line
Let’s try some!
This is a line that describes how a response variable (y) changes
http://illuminations.nctm.org/ActivityDetail.aspx?ID=146
as an explanatory variable (x) changes.
It’s used to predict the value of (y) for a given value of (x).
The regression line is a model for the data.
Regression Line
When given the response
variable (y) and the
explanatory variable (x), the
regression line relating y to x
has equation of the following
form:
Predicted Value:
) – The
(
predicted value of y for a
given value of x.
y-intercept: (a) - the
predicted value of
the y when x is 0.
yˆ = a + bx
Slope: (b) – the amount by
which y is predicted to
change when x increases by 1
unit.
Use the regression line to answer the
= 18773 − 86.18( !"#$%"&')
following. Slope
The predicted price of
the car decreases by
$86.18 for every
additional thousand
miles driven.
y -intercept
The following data shows the number of miles driven and advertised
price for 11 used Honda CR-Vs from the 2002-2006 model years
(prices found at www.carmax.com). The scatterplot below shows a
strong, negative linear association between number of miles and
advertised cost. The correlation is -0.874. The line on the plot is the
regression line for predicting advertised price based on number of
miles.
Thousand
Miles
Driven
Cost
(dollars)
22
29
35
39
45
49
55
56
69
70
86
17998
16450
14998
13998
14599
14988
13599
14599
11998
14450
10998
18
16
14
12
10
20 30 40 50 60 70 80 90
ThousandMilesDriven
Predict the price for a Honda with 50,000
miles. (Use 50 in equation!)
= 18773 − 86.18( !"#$%"&')
The predicted cost
($18,773) of a used
Honda 2002 to 2006
CR-V with 0 miles.
= ,-../ − -0. ,- 12
()*+
=$14, 464.
()*+
3
10/2/2014
Extrapolation
This refers to using a regression line for prediction far
outside the interval of values of the explanatory variable x
used to obtain the line.
They are not usually very accurate predictions.
Should we predict the asking price for a used
2002-2006 Honda CR-V with 250,000 miles?
No! We only have data for cars with between
22,000 and 86,000 miles. We don’t know if the
linear pattern will continue beyond these
values. In fact, if we did predict the asking
price for a car with 250 thousand miles, it
would be −$2772!
Residual
A residual is the difference between an
observed value of the response variable and the
value predicted by the regression line.
Slope: 3&$ = 40. 45678&& 6'7589574096"'8:.
Y-int: ; − 7 = 100. 45678&&89510096"'67#75.
Predict weight after 16 wk
;< = 100 = 40 16 = 74096"'
Predict weight at 2 years:
2;' = 1048:';;< = 100 = 40 104 = 426096"'(6#$!79.4$! ?')
This is unreasonable and is a result of extrapolation.
Example
The equation of the least-squares regression line for the
sprint time and long-jump distance data is predicted
long-jump distance = 304.56 – 27.63 (sprint time).
Find and interpret the residual for the student who had a
sprint time of 8.09 seconds.
residual = observed y – predicted y
= /2C. 10D.. 0/ -. 2E = -,. 2/)B*+A
(+@)*+@@)AB*+ residual = ; − ;<
Regression
Let’s see how a regression line is
calculated.
'?!6& = ; − ;<
'?!6& = 151 − 81.03
'?!6& = 69.97
This student jumped 69.97 inches
farther than we expected based
on his sprint time.
Fat vs Calories in Burgers
Fat (g)
19
31
34
35
39
39
43
Calories
410
580
590
570
640
680
660
4
10/2/2014
Let’s standardize the variables
Fat
Cal
z - x's
19
410
-1.959
-2
31
580
-0.42
-0.1
34
590
-0.036
0
35
570
0.09
-0.2
39
640
0.6
0.56
39
680
0.6
1
43
660
1.12
0.78
Let’s clarify a little. (Just watch & listen)
The equation for a line that passes through the origin can be
written with just a slope & no intercept: ; = "G
But, we’re using z-scores so our equation should reflect this
and thus it’s
z y = mz x
z - y's
F
The line must contain the point
Many lines with different slope pass through the origin.
Which one fits our data the best? That is, which slope
determines the line that minimizes the sum of the squared
residuals.
F
( x, y ) and pass through the origin.
Line of Best Fit –Least Squares
Regression Line
Let’s find it. (just watch & soak it in)
Note: MSR is “Mean Squared Residual”
It’s the line for which the sum of the squared residuals is smallest.
We want to find the mean squared residual.
H+A)@IJ = KLA+M+@ − N+@)*+@
MSR =
∑( z
y
− mz x )
2
⌢
z y = mz x
2
St. Dev of z scores is
1 so variance is 1 also.
y
Slope – rise over run
O3P = 1 − 2" + "Q
−b
2a
A slope of r for z-scores means that for every increase of 1
standard deviation in z x , there is an increase of r standard
deviations in z y. “Over 1 and up r.”
Translate back to x & y values – “over one standard deviation in
x, up r standard deviations in y.
Slope of the regression line is:
Hence – the slope of the best fit line for z-scores is the
correlation coefficient → r.
since
O3P = 1 − 2" + "Q
Continue……
This gives us
2
n −1
∑ zx z y + m2 ∑ zx 2
− 2m
n −1
n −1
n −1
MSR = 1 − 2mr + m 2
This is r!
Focus on the vertical deviations from the line.
−(−D)
R=
=
D(,)
)
n −1
z
∑ ( y 2 − 2mz x z y + m2 z x 2 )
∑z
MSR =
Since this is a parabola – it reaches it’s minimum at x =
⌢
− zy
n −1
∑( z
MSR =
MSR =
y
b=
rs y
sx
5
10/2/2014
Why is correlation “r”
Let’s Write the Equation
from algebra
y = mx + b
Because it was calculated from the regression of y on x after
standardizing the variables – just like we have just done –
thus he used r to stand for (standardized) regression.
b0 y-intercept

b1 slope
y = b0 + b1 x
Slope:
#S =
Fat (g)
Calories
19
410
31
580
34
590
35
570
39
640
39
680
43
660
'T 0.961(89.815)
=
= 11.056
7.804
'U
Explain the slope:
Your calories increase by
11.056 for every gram of fat.
Now for the final part – the equation!
Now it can be used to predict.
Y-intercept: Remember – it has to pass through the point
.
How many calories do I expect to find in a hamburger that
y = b0 + b1 x
(
x, y
has 25 grams of fat?
)
Solve for y-intercept:
F = L + L, F
F = D,2. E1C + ,,. 210 /C. D-1.,CDE F = 1-E. 00
F − L, F = D,2. E1C
L2 = = ,2,. /D-1 − E. DE10
YIM)MJH+ = ,2,. /D-1 − E. DE10(Z+B[JJ − − Y*\)R+)
Try another problem
Interpret the slope:
The survival rate will decrease by 9.2956 for every one minute of call-to-shock.
Mean call to-shock
time
Survival
Rate
2
90
6
45
7
30
9
5
12
2
=
−3.84030233
= −0.960
4
Interpret the y-int:
'T
= −9.2956
'U
The survival rate is 101.3285 when there is NO call to shock time.
#V = ;W − #S G̅ = 101.3285
Predict the survival rate for a 10 min. call to shock time
#S =
= ,2,. /D-1 − E. DE10
YIM)MJH+ = ,2,. /D-1 = E. DE10(Z+B[JJ − − Y*\)R+)
= ,2,. /D-1 − E. DE10 ,2 = -. /.D1R)BI+A
Predict the survival rate for a 20 min. call to shock time
= ,2,. /D-1 − E. DE10 D2 = −-C. 1-/1R)BI+A] ^(J)B
6
10/2/2014
Try another problem
verbal = 1.05 (math) + 20.7
Interpret the slope:
SAT Math
SAT Verbal
600
650
720
800
540
600
450
500
620
620
Interpret the y-int:
Predict the verbal score for a math score of 400
Predict the verbal score for a math score of 500
That’s…all…..Folks!
Homework:
p. 191 (27-32, 35, 37,39,41, 47)
7