Homework 1 – Key Soci 252-003 Professor François Nielsen 6.10 (Mensa) • Need z-score 2.5 standard deviations from the mean. Mean=100, standard deviation=16 so 2.5 standard deviations above the mean is 100 + 2.5*16 = 140 6.15 (Final exams) • Anna: 83 French, 83 Spanish. Megan: 77 French, 95 Spanish. Overall French: mean = 81, s.d. = 5. Overall Spanish: mean = 74, s.d. = 15 a) Anna has an average of 83 overall in (French+Spanish)/2, so does not qualify for honors Megan has an average of 86 overall in (French+Spanish)/2, so does qualify for honors b) To see who actually did better, we must look at the student’s performance relative to the class distribution on the exam (assuming similar quality of students in each class) o Anna scored 0.6 standard deviations above the mean in French, 0.4 above the mean in Spanish o Megan scored 0.8 standard deviations below the mean in French, 1.4 above mean in Spanish Anna’s average score was 0.5 standard deviations above the mean, while Megan’s was 0.3 above mean So when we consider the difficulty of the respective exams, Anna appears to be the better student, even though Megan was the one who qualified for honors. 6.18 (Car speeds) a) 3.84/3.56 = 1.079 standard deviations b) If we assume a truly normal, symmetric distribution that can be extrapolated many standard deviations from the mean, then the distances from the mean are 13.84 mph (for 10 mph) and 10.16 mph (for 34 mph). Therefore we would say that 10 mph would be more unusual to be observed. 6.20 (Car speeds again) a) After subtracting 20 from each observation, the mean speed is now 3.84 mph. The standard deviation remains 3.56 because we have not changed the shape of the distribution, only shifted it to the left. b) To shift the units, multiply both the mean and standard deviation by the conversion rate. The mean speed in kilometers per hour is 38.36 kph, and the standard deviation is 5.73 kph. 6.43 (Cholesterol) • Given: adult American women cholesterol ~N(188, 24). Units are mg/dL a) [normal distribution “bell curve” with mean=188 and SD=24 b) Cholesterol level of 200 is 12 points above the man, or 0.5 standard deviations (so z-score = 0.5) you can look this statistic up in a table of Z-values, or in R, enter > pnorm(0.5) and the answer is 0.691, so about 69%, so 31% of American women have cholesterol level above 200. c) The z-scores of 170 and 150 (on this distribution) are -0.75 and -1.583. 22.7% of the normal distribution is below -0.75, and 5.7% of the normal distribution is below -1.583. So between these two values is 17.0% of the distribution. d) InterQuartile Range, the distance between 25th and 75th percentiles, is 32 on the cholesterol scale e) The 85th percentile corresponds to a z-score of 1.0375, which equates to 212.9 mg/dL of cholesterol 7.8 (Kentucky Derby 2007) • The distribution of winning speeds over time has a nonlinear form with a moderately strong positive relationship. The rate of increase from the beginning to 1950 is steeper than from 1950 to present, and since 1960 the typical winning speed does not appear to have risen at all. 7.10 (Coffee sales) a) [Pyramid-looking, normally-distributed histogram] b) The sales are increasing over time; histogram does not show time c) Unimodal, symmetric sales; average sales around $350 7.12 (Matching) • (a) is -0.977, (b) is 0.736, (c) is 0.951, and (d) is -0.021 7.36 (Burgers II) a) Correlation is 0.199. There does not appear to be a relationship between sodium and fat content in burgers, especially when excluding the low-fat, low-sodium item (which exerts significant “leverage” on when estimating a linear relationship). Even with the outlier included, the correlation shows of 0.199 shows a weak relationship. b) Spearman’s rho is slightly negative. Using ranks doesn’t allow the outlier to have as strong of an influence and the remaining points have little or no association. Flights 5500000 6000000 6500000 7000000 7.42 (Flights) a) The correlation is 0.828, indicating a strong trend of increasing flights per year b) The trend is positive and curved (i.e., the rate of increase is itself increasing). There is a low outlier in 2002, resulting from the down economy and/or fear of flying after 9/11/2001. 1996 1998 2000 2002 2004 Year c) The plot is not straight and has an outlier. Either violation would disqualify the correlation. d) Kendall’s Tau is 0.71. It is an appropriate measure. It would be 1.0 if not for the outlier of 2002. 8.2 (Horsepower) • 46.87 – 0.084*200 = 30 (mpg) 8.4 (Horsepower, again) • If a car’s mpg value has a positive residual, that means it is above the regression line, and has better gas mileage (more miles per gallon) than predicted by its horsepower. 8.6 (more horsepower) • The slope of -0.084 means that on-average, within the range of this linear model, a car’s fuel efficiency decreases by 0.084 mph for every unit increase in horsepower 8.14 (EVEN MORE HORSEPOWER! jk, this one is about Residuals) a) The model is not appropriate; the relationship is nonlinear b) The model may not be appropriate; the spread is changing, getting more condensed around a line when moving to the right (as the X-variable increases) c) The model is appropriate 8.22 (More misinterpretations) a) R2 measures the amount of variation accounted for. It does not imply that x determines (or causes) the values of y. Here, literacy rate accounts for 64% of the observed variation in life expectancy, but that does not mean literacy rate causes life expectancy. b) This interpretation is implying a causal relationship. But just because two variables are related does not mean that one cause the other. We could say that a 5% increase in literacy rate is associated with an average of 2-year improvement in life expectancy. 8.40 (Success in college) a) Predicted GPA = -1.262 + 0.00214*SAT b) You could say that a person with an SAT score of 0 would have a GPA of -1.26, but both a negative GPA and a score of 0 on the SAT are impossible. So the y-intercept only serves to adjust the height of the line, and is meaningless by itself. More generally, it is risky to extrapolate a linear relationship beyond the range of the data used to calculate the linear relationship to begin with. c) The expected GPA is higher by 0.21 points for every additional 100 points scored on the SAT. d) A freshman who scored 1400 on the SAT is predicted to have a 3.23 GPA. e) The SAT score is only somewhat useful, since the R2 is only 0.221 (so, about 22% of the variation in GPA is explained by SAT score). There are many other factors that influence GPA. f) I would rather have a positive residual (where my GPA is higher than predicted by my SAT score) 8.46 (Drug abuse) a) Yes, the linear model is appropriate. The plot shows a positive, linear and fairly strong relationship. b) Percent marijuana use accounts for 87.3% of the variation in use of other drugs. c) predicted Other% = -3.068 + 0.615*Marijuana% d) Each additional percentage point of teens using marijuana adds 0.615 percent to the percentage of teens using other drugs in a given country e) The results do NOT confirm marijuana as a gateway drug. They indicate an association between marijuana use and use of other drugs, but this does not tell you the causal story. 8.48 (Veggie burger) a) The predicted fat content of the burger is 20.4g b) The residual for the veggie burger is -10.4g of fat. Veggie burgers have less fat relative to protein than typical menu items. c) [you should mention predicted fat content, observed fat content, and the residual] 8.60 (Heptathlon 2004 again) a) Predicted LongJump = 4.20053 + 1.1054*HighJump. Every additional meter in high jump is associated with an additional 1.1m, on average, in long jump. While there is close to a 1:1 linear relationship, long jump scores “start” at a higher point. b) Only 12.6% of the variability in long jump performance is accounted for by high jump performance c) Yes, the slope is positive. (It would be very surprising if it were negative, right? That would mean that people with better long-jump marks tended to have poorer high-jump marks) d) The residuals plot is fairly patternless e) No. The residual standard deviation is 0.196m. Not much smaller than the SD of all long jumps (0.206), and the R2 is only 12.6%. 9.4 (HDI revisited) a) The relationship is not straight; HDI rises more quickly relative to cellphones, and then plateaus b) The scatterplot of residuals will be curved downward 9.12 (more unusual points) a) 1. The point has high leverage, makes large residual a bit smaller 2. Yes, the point is influential on the regression line 3. Correlation would increase because scatter around the line would decrease 4. Slope would increase, because the outlier pulls the line toward 0 (horizontal) b) 1. High leverage, small residual 2. Yes, the pint is influential 3. Correlation would become weaker because the outlier has high zx and zy values, thus increasing the correlation 4. Slope would decrease because the outlier pulls the line from nearly flat to a positive slope c) 1. Little leverage, large residual 2. No, the point is not influential 3. Correlation would become stronger because scatter would decrease 4. Slope would be about the same d) 1. High leverage, small residual 2. No, the point is not influential 3. Correlation would become weaker and become less negative because the outlier has a large negative zx * zy value 4. Slope would stay about the same because the outlier is consistent with the slope determined by the other points 9.16 (What’s the effect?) • Perhaps playing computer games makes kids more violent • Alternately, it is possible that kids who are already more violent like to play violent computer games • Or, some other factor could account for both behaviors, so the relationship between computer games and violence is caused by a “lurking” variable such as the child’s home life or genetic predisposition. 9.22 (Ages of couples 2007) a) The correlation is -0.88. The square-root of the R2 value (77.4%, or 0.774) is 0.88, and you know the relationship is negative by looking at the graph. b) The age difference at marriage is decreasing by about 0.01718 years, per year as time passes c) The predicted age difference in 2015 is about 1.65 years d) The latest data point is for the year 2007. Extrapolating to 2015 assumes the trend will continue in the same manner. 10.8 (Crowdedness) a) The scatterplot shows upward curvature and decreasing spread b) Try plotting log(GDP) against crowdedness 10.10 (Crowdedness again) a) The student has gone too far; we now see marked downward curvature. b) A next step would be to try a “weaker” re-expression, like reciprocal square root or log(GDP) Extended Example: Simple Regression 20 40 prestige 60 80 > plot(prestige ~ education) 6 8 10 12 education 14 16 > cor(education, prestige) The correlation between education and occupational prestige is 0.850. The plot shows a rather linear relationship, so correlation is a reasonable expression of the strong relationship between these variables. > lm1 <- lm(prestige ~ education, data = Prestige) > summary(lm1) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -10.732 3.677 -2.919 0.00434 ** education 5.361 0.332 16.148 < 2e-16 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 9.103 on 100 degrees of freedom Multiple R-squared: 0.7228, Adjusted R-squared: 0.72 F-statistic: 260.8 on 1 and 100 DF, p-value: < 2.2e-16 Regression equation: Prestige = -10.732 + 5.361*Education Each year of education is worth, on average, 5.36 points on the prestige scale. The intercept is -10.7, but the model is not applicable to people with zero years of education so this intercept value simply serves to set the height of the line. 20 40 prestige 60 80 > plot(prestige ~ education, data = Prestige, col = "blue") > abline(lm1, col = "red") 6 8 10 12 14 16 education > plot(residuals(lm1) ~ fitted.values(lm1), col = "blue") > abline(h = 0, col = "red", lty = 2) 10 0 residuals(lm1) -10 -20 30 40 50 60 70 fitted.values(lm1) The residuals look relatively healthy, although they are not perfectly-cleanly scattered. There are more low outliers in the center, and more high-outliers to the left, but neither trend is strong. > qqnorm(residuals(lm1)) > qqline(residuals(lm1)) 0 -10 -20 Sample Quantiles 10 Normal Q-Q Plot -2 -1 0 1 2 Theoretical Quantiles The residuals fall quite cleanly along the line, further indicating their health > plot(prestige ~ income) 80 60 prestige 40 20 0 5000 10000 15000 20000 25000 income > cor(income, prestige) The correlation of income with prestige is 0.715, but the relationship is curved; prestige rises more quickly initially and then “tops out” compared to income. Correlation is not the best expression of the relationship between income and prestige. > lm2 <- lm(prestige ~ income, data = Prestige) > summary(lm2) Residuals: Min 1Q -33.007 -8.378 Median -2.378 3Q 8.432 Max 32.084 Coefficients: Estimate Std. Error t (Intercept) 2.714e+01 2.268e+00 income 2.897e-03 2.833e-04 --Signif. codes: 0 '***' 0.001 '**' value Pr(>|t|) 11.97 <2e-16 *** 10.22 <2e-16 *** 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 12.09 on 100 degrees of freedom Multiple R-squared: 0.5111, Adjusted R-squared: 0.5062 F-statistic: 104.5 on 1 and 100 DF, p-value: < 2.2e-16 > plot(residuals(lm2) ~ fitted.values(lm2), col = "blue") > abline(h = 0, col = "red", lty = 2) 30 20 10 0 -30 -20 -10 residuals(lm2) 40 60 80 100 fitted.values(lm2) The regression equation: Prestige = 27.14 + 0.0029*Income There is a 0.0029 increase in prestige for each $1 increase in income, or a 2.9-point increase in prestige for each $1k increase in income. However, the residual plot is not “healthy” as the residuals are shaped like an arch 20 40 prestige 60 80 > plot(prestige ~ log10(income), data = Prestige, col = "blue") 3.0 3.5 4.0 log10(income) > lm3 <- lm(prestige ~ log10(income), data = Prestige) > summary(lm3) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -139.856 16.954 -8.249 6.6e-13 *** log10(income) 49.635 4.497 11.037 < 2e-16 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 11.61 on 100 degrees of freedom Multiple R-squared: 0.5492, Adjusted R-squared: 0.5447 F-statistic: 121.8 on 1 and 100 DF, p-value: < 2.2e-16 The log of income behaves much better as a linear predictor of prestige, as seen in the plot which has a much more linear-looking relationship than the plot of prestige vs. (unlogged) income.
© Copyright 2026 Paperzz