Deterministic relationship: Sept. 29 Statistic for the day: In an average lifetime, a human will produce more than 10,000 gallons of urine Value of response is exactly determined by explanatory variable. In this case, the relationship happens to be linear – so what is r? Another statistical relationship Statistical relationship: 800 600 400 3.5 4.0 150 100 Fastest Speed Driven (MPH) 200 50 50 Weight 150 150 Regression equation: 100 3.0 Correlation (r): Strength of linear association (Outliers removed) Fastest Speed Driven (MPH) 2.5 Grade Point Average Another statistical relationship Speed = 112 − 5.4×GPA 200 2.0 50 10 Value of20response only partly determined misleadingby when the axis 15 25 30 35 40 explanatory doesn’t begin at zero. Agevariable 100 5 0 Note: Unlike for a bar graph, a scatterplot is usually not 50 Weight 100 150 Fastest Speed Driven (MPH) 1000 200 STAT 100, section 004, Fall 2008 5 10 15 20 25 Age 2.0 2.5 3.0 3.5 4.0 r = 0.893 30 35 40 2.0 2.5 3.0 3.5 4.0 Grade Point Average r = −0.147 Grade Point Average 1 Correlation = .10 Facts about Correlation: • We use the letter “r” to denote the correlation coefficient. • The correlation coefficient is a measure of the strength of the linear relationship between the two variables in a scatterplot. • The value of r must always be between −1 and 1: Correlation = .70 Correlation = .37 Correlation = .97 a. r=0 means no linear relationship. b. Positive r means the two variables tend to increase together (with r=1 meaning a perfect linear relationship) c. Negative r means that one variable increases while the other decreases (with −1 meaning a perfect linear relationship) Strength is not the same as statistical significance Strength vs. statistical significance Assume that the truth is NO linear relationship. What proportion of randomly generated scatterplots would have a stronger linear relationship than the one observed? Common rule of thumb: If p-value is smaller than 0.05 (five percent), then the result is considered statistically significant. ANSWER: p-value Average number of words a child knows at various ages 150 100 Relationship is nearly linear and quite strong. 50 Weight Imagine a scatter plot of the average number Of words a child knows for ages about 1.5 to 6 . 50 100 150 Fastest Speed Driven (MPH) 200 Correlation (r): Strength of linear association 5 Not really significant 10 15 20 25 30 Age r = 0.893 p-value = 0.107 35 40 Not really significant, but close 2.0 2.5 3.0 3.5 4.0 Grade Point Average r = −0.147 p-value = 0.078 2 Another problem: Sometimes we see strong relationship in absurd examples. Two seemingly unrelated variables have a high correlation. Note the problem of extrapolation here. At age 1, ave =3, but predicted is –251!! Extrapolation occurs when you try to predict beyond the range of the explanatory variable This signals the presence of a third variable that is highly correlated with the other two. (Confounding) Vocabulary vs Shoe Size How can we have such high correlation between shoe size and vocabulary? (Note: These data were made up.) Easy: Both increase with age and hence age is a confounding (or hidden or lurking) variable. Age is positively correlated with both shoe size and with vocabulary. PA Election Fraud (case study 23.1, page 442) Dem minus Rep vote counts (so positive means D ahead) For absentee versus machine Special election to fill state senate seat in 1993. William Stinson (D) received 19,127 machine counted votes Bruce Marks (R) received 19,691 machine votes Stinson got 1,391 absentee votes Marks got 366 Stinson wins by 461 votes Question: Is this an unusual number of absentee votes? 3 R wins machine D wins absentee 4
© Copyright 2026 Paperzz