Oct. 5 Statistic for the day: Percent of men 18–64 years old who would give up sex for the entire season to win their fantasy football league: 16% Average number of words a child knows at various ages Imagine a scatter plot of the average number Of words a child knows for ages about 1.5 to 6. Relationship is nearly linear and quite strong. Assignment: Read Chapter 11 Do exercises 4, 5, 8, 13, and 17 on pages 213–215. Regression Plot Y = -806 + 555 X Another problem: Correlation = .985 2500 Words known 2000 Sometimes we see strong relationship in absurd examples. 1500 1000 Two seemingly unrelated variables have a high correlation. 500 0 2 3 4 5 6 Age S = 158.602 R-Sq = 97.1 % R-Sq(adj) = 96.6 % Note the problem of extrapolation here: At age 1, predicted size of vocabulary is –251! Extrapolation means trying to predict beyond the range of the explanatory variable. (Remember the Sept. 12 example of running times?) This signals the presence of a third variable that is highly correlated with the other two. (Confounding) Vocabulary vs Shoe Size Regression Plot How can we have such high correlation between shoe size and vocabulary? (Note: These data were made up.) Y = -806 + 555 X Correlation = .985 2500 Words known 2000 Easy: Both increase with age and hence age is a confounding (or hidden or lurking) variable. 1500 1000 Age is positively correlated with both shoe size and with vocabulary. 500 0 2 3 4 Shoe Size 5 S = 158.602 6 R-Sq = 97.1 % R-Sq(adj) = 96.6 % 1 Outliers Example: STAT 100 FA 08 scatterplot Sometimes they are the most important data points. The outlier on the previous plot was obviously a mistake. But what about in this case? (Again, these are data from STAT 100.) 800 600 400 Phone minutes per week 0 50 100 150 Credits this semester What about the outliers here? 400 20 600 800 30 1000 40 Exercise minutes per week 0 0 200 10 TV hours last week 200 Sometimes they indicate errors in data input. Some experts estimate that roughly 5% of all data entered is in error. 151 is a lot of credits! 0 They show up in graphical displays as detached or stray points. 1000 Outliers are data that are not compatible with the bulk of the data. 0 200 400 600 800 1000 Phone minutes per week The question about how to treat outliers seen in the data depends on the situation. Sometimes, as in the case of a person who reports 151 credits, it is an obvious mistake and it should be investigated and either fixed or removed. It seems likely that “151” was supposed to be “15” and we can understand how this could have happened. The question about how to treat outliers seen in the data depends on the situation. Sometimes, as in the case of TV hours or exercise minutes, an outlier could be legitimate. However, depending on the situation, we might still wish to remove it so that it does not cause problems in analysis. Whenever we remove outliers–especially if they are not merely obvious errors–this should be reported along with any results! Finally, there are times when outliers are the most interesting data points... 2 Dem minus Rep vote counts (so positive means D ahead) For absentee versus machine PA Election Fraud (case study 23.1, page 442) Regression Plot absentee = -182.575 + 0.295319 machine - 0.0000285 machine**2 S = 294.363 R-Sq = 62.0 % R-Sq(adj) = 57.8 % Special election to fill state senate seat in 1993. 1000 absentee William Stinson (D) received 19,127 machine counted votes Bruce Marks (R) received 19,691 machine votes Stinson got 1,391 absentee votes Marks got 366 0 Regression -1000 Stinson wins by 461 votes 95% PI 0 Question: Is this an unusual number of absentee votes? Put options on stocks give buyers the right to sell stock at a specified price during a certain time. They rise in value if the underlying stock falls below the strike price. 7500 A surveillance team was sifting through data flagged as unusual, such as spikes in options volume, and asking member firms for audit trail information that might determine whether those trades had suspicious origins. The value of puts on airline stocks soared on Sept. 17 when U.S. stock and options markets reopened after a four-day closure, as airline stocks slid as much as 40 percent. Using its position-monitoring records and those of clearing member firms, they know instantly the names of those who executed trades, which firms cleared the trades and whether trades were made for customers, broker-dealers or market makers on the trading floor. American Airlines was at $32 prior to attack. Suppose a terrorist buys a put option (at say $5 per share) to have the right to sell at $25. The price after the attack was at $16. That put option is now more valuable. Look at the possible effect of a single outlier on regression lines: The point is that outliers can be very informative. Without red point: r = 0.02 30 10 20 With red point: r = 0.12 0 TV hours last week 40 They are sometimes the most interesting points in the data sets. Whether or not they are informative, outliers can have a major influence on certain statistical analyses, such as correlations and regression lines. 5000 In the days leading up to the attacks, volume in some of these options, particularly in the puts, surged to more than four times their recent daily averages, exchange data showed. Put Options (NYTimes, September 26, 2001) However, they are interesting only because of the pattern of the rest of the data. 2500 machine 0 200 400 600 800 1000 Phone minutes per week 3 STAT 100 Spring 2004 height = 58.2 + 0.06 weight Correlation = .38 As seen in this example, the presence of an outlier can lower the value of r and mask the true strength of the correlation. 100 80 70 60 100 200 300 A single point can interfere with the proper interpretation of the data. Removing only one point can dramatically change the value of the correlation coefficient or the slope and intercept of the regression line. STAT 100 Spring 2004 w eight It is for this reason that outliers must be treated with caution, sometimes even removed if we feel there is a reason to do so. IF OUTLIERS ARE REMOVED, THIS MUST BE REPORTED. height = 56.3 + 0.07 weight With outlier: r=0.38 Correlation = .57 80 height height 90 The moral of the story: 70 Without outlier: r=0.57 Whenever you compute a correlation, look at a scatter plot. This will help you identify possible outliers. 60 100 200 300 w eight 4
© Copyright 2026 Paperzz