Sept. 29 Statistic for the day: In an average lifetime, a human will

Deterministic relationship:
Sept. 29 Statistic for the day:
In an average lifetime, a human will
produce more than 10,000 gallons of
urine
Value of response is exactly determined by explanatory variable.
In this case, the relationship happens to be linear – so what is r?
Another statistical relationship
Statistical relationship:
800
600
400
3.5
4.0
150
100
Fastest Speed Driven (MPH)
200
50
50
Weight
150
150
Regression equation:
100
3.0
Correlation (r):
Strength of linear association
(Outliers removed)
Fastest Speed Driven (MPH)
2.5
Grade Point Average
Another statistical relationship
Speed = 112 − 5.4×GPA
200
2.0
50
10
Value
of20response
only
partly
determined
misleadingby
when the axis
15
25
30
35
40
explanatory
doesn’t begin at zero.
Agevariable
100
5
0
Note: Unlike for a bar graph, a
scatterplot is usually not
50
Weight
100
150
Fastest Speed Driven (MPH)
1000
200
STAT 100, section 004, Fall 2008
5
10
15
20
25
Age
2.0
2.5
3.0
3.5
4.0
r = 0.893
30
35
40
2.0
2.5
3.0
3.5
4.0
Grade Point Average
r = −0.147
Grade Point Average
1
Correlation = .10
Facts about Correlation:
• We use the letter “r” to denote the correlation coefficient.
• The correlation coefficient is a measure of the strength of
the linear relationship between the two variables in a
scatterplot.
• The value of r must always be between −1 and 1:
Correlation = .70
Correlation = .37
Correlation = .97
a. r=0 means no linear relationship.
b. Positive r means the two variables tend to increase
together (with r=1 meaning a perfect linear relationship)
c. Negative r means that one variable increases while the
other decreases (with −1 meaning a perfect linear
relationship)
Strength is not the same as
statistical significance
Strength vs. statistical significance
Assume that the truth is NO linear relationship. What
proportion of randomly generated scatterplots would
have a stronger linear relationship than the one
observed?
Common rule of thumb: If p-value is smaller than 0.05
(five percent), then the result is considered statistically
significant.
ANSWER: p-value
Average number of words a child knows at
various ages
150
100
Relationship is nearly linear and quite strong.
50
Weight
Imagine a scatter plot of the average number
Of words a child knows for ages about 1.5 to 6 .
50
100
150
Fastest Speed Driven (MPH)
200
Correlation (r):
Strength of linear association
5
Not really
significant
10
15
20
25
30
Age
r = 0.893
p-value = 0.107
35
40
Not really
significant,
but close
2.0
2.5
3.0
3.5
4.0
Grade Point Average
r = −0.147
p-value = 0.078
2
Another problem:
Sometimes we see strong relationship in
absurd examples.
Two seemingly unrelated variables have
a high correlation.
Note the problem of extrapolation here.
At age 1, ave =3, but predicted is –251!!
Extrapolation occurs when you try to predict
beyond the range of the explanatory variable
This signals the presence of a third variable
that is highly correlated with the other two.
(Confounding)
Vocabulary vs Shoe Size
How can we have such high correlation between
shoe size and vocabulary? (Note: These data
were made up.)
Easy: Both increase with age and hence age
is a confounding (or hidden or lurking) variable.
Age is positively correlated with both shoe
size and with vocabulary.
PA Election Fraud (case study 23.1, page 442)
Dem minus Rep vote counts (so positive means D ahead)
For absentee versus machine
Special election to fill state senate seat in 1993.
William Stinson (D) received 19,127 machine counted votes
Bruce Marks (R) received 19,691 machine votes
Stinson got 1,391 absentee votes
Marks got 366
Stinson wins by 461 votes
Question: Is this an unusual number of absentee votes?
3
R wins machine
D wins absentee
4