Another problem: Vocabulary vs Shoe Size

Oct. 5 Statistic for the day:
Percent of men 18–64 years old who would
give up sex for the entire season to win their
fantasy football league: 16%
Average number of words a child knows at
various ages
Imagine a scatter plot of the average number
Of words a child knows for ages about 1.5 to 6.
Relationship is nearly linear and quite strong.
Assignment:
Read Chapter 11
Do exercises 4, 5, 8, 13, and 17 on pages 213–215.
Regression Plot
Y = -806 + 555 X
Another problem:
Correlation = .985
2500
Words
known
2000
Sometimes we see strong relationship in
absurd examples.
1500
1000
Two seemingly unrelated variables have
a high correlation.
500
0
2
3
4
5
6
Age
S = 158.602
R-Sq = 97.1 %
R-Sq(adj) = 96.6 %
Note the problem of extrapolation here: At age 1, predicted size of
vocabulary is –251!
Extrapolation means trying to predict beyond the range of the explanatory
variable. (Remember the Sept. 12 example of running times?)
This signals the presence of a third variable
that is highly correlated with the other two.
(Confounding)
Vocabulary vs Shoe Size
Regression Plot
How can we have such high correlation between
shoe size and vocabulary? (Note: These data
were made up.)
Y = -806 + 555 X
Correlation = .985
2500
Words
known
2000
Easy: Both increase with age and hence age
is a confounding (or hidden or lurking) variable.
1500
1000
Age is positively correlated with both shoe
size and with vocabulary.
500
0
2
3
4
Shoe Size
5
S = 158.602
6
R-Sq = 97.1 %
R-Sq(adj) = 96.6 %
1
Outliers
Example: STAT 100 FA 08 scatterplot
Sometimes they are the most important
data points.
The outlier on the previous plot
was obviously a mistake.
But what about in this case?
(Again, these are data
from STAT 100.)
800
600
400
Phone minutes per week
0
50
100
150
Credits this semester
What about the outliers here?
400
20
600
800
30
1000
40
Exercise minutes per week
0
0
200
10
TV hours last week
200
Sometimes they indicate errors in data
input. Some experts estimate that roughly
5% of all data entered is in error.
151 is a lot
of credits!
0
They show up in graphical displays
as detached or stray points.
1000
Outliers are data that are not compatible
with the bulk of the data.
0
200
400
600
800
1000
Phone minutes per week
The question about how to treat outliers seen in the
data depends on the situation.
Sometimes, as in the case of a person who reports 151
credits, it is an obvious mistake and it should be
investigated and either fixed or removed.
It seems likely that “151” was supposed to be “15” and we
can understand how this could have happened.
The question about how to treat outliers seen in the
data depends on the situation.
Sometimes, as in the case of TV hours or exercise minutes,
an outlier could be legitimate. However, depending on the
situation, we might still wish to remove it so that it does not
cause problems in analysis. Whenever we remove
outliers–especially if they are not merely obvious errors–this
should be reported along with any results!
Finally, there are times when outliers are the most
interesting data points...
2
Dem minus Rep vote counts (so positive means D ahead)
For absentee versus machine
PA Election Fraud (case study 23.1, page 442)
Regression Plot
absentee = -182.575 + 0.295319 machine
- 0.0000285 machine**2
S = 294.363
R-Sq = 62.0 %
R-Sq(adj) = 57.8 %
Special election to fill state senate seat in 1993.
1000
absentee
William Stinson (D) received 19,127 machine counted votes
Bruce Marks (R) received 19,691 machine votes
Stinson got 1,391 absentee votes
Marks got 366
0
Regression
-1000
Stinson wins by 461 votes
95% PI
0
Question: Is this an unusual number of absentee votes?
Put options on stocks give buyers the right to sell stock at
a specified price during a certain time. They rise in value
if the underlying stock falls below the strike price.
7500
A surveillance team was sifting through data flagged as
unusual, such as spikes in options volume, and asking
member firms for audit trail information that might determine
whether those trades had suspicious origins.
The value of puts on airline stocks soared on Sept. 17 when
U.S. stock and options markets reopened after a four-day
closure, as airline stocks slid as much as 40 percent.
Using its position-monitoring records and those of clearing member
firms, they know instantly the names of those who executed trades,
which firms cleared the trades and whether trades were made for
customers, broker-dealers or market makers on the trading floor.
American Airlines was at $32 prior to attack. Suppose a
terrorist buys a put option (at say $5 per share) to have the
right to sell at $25. The price after the attack was at $16.
That put option is now more valuable.
Look at the possible effect of a single outlier on regression lines:
The point is that outliers can be very
informative.
Without red point:
r = 0.02
30
10
20
With red point:
r = 0.12
0
TV hours last week
40
They are sometimes the most interesting points in the data sets.
Whether or not they are informative, outliers can have a major
influence on certain statistical analyses, such as correlations and
regression lines.
5000
In the days leading up to the attacks, volume in some of these
options, particularly in the puts, surged to more than four
times their recent daily averages, exchange data showed.
Put Options (NYTimes, September 26, 2001)
However, they are interesting only because of the pattern of the
rest of the data.
2500
machine
0
200
400
600
800
1000
Phone minutes per week
3
STAT 100 Spring 2004
height = 58.2 + 0.06 weight
Correlation = .38
As seen in this example,
the presence of an outlier
can lower the value of r
and mask the true
strength of the correlation.
100
80
70
60
100
200
300
A single point can interfere with the proper interpretation of
the data.
Removing only one point can dramatically
change the value of the correlation coefficient or the slope
and intercept of the regression line.
STAT 100 Spring 2004
w eight
It is for this reason that outliers must be treated with
caution, sometimes even removed if we feel there is a
reason to do so. IF OUTLIERS ARE REMOVED, THIS
MUST BE REPORTED.
height = 56.3 + 0.07 weight
With outlier:
r=0.38
Correlation = .57
80
height
height
90
The moral of the story:
70
Without
outlier: r=0.57
Whenever you compute a correlation, look at a scatter plot.
This will help you identify possible outliers.
60
100
200
300
w eight
4