Need z-score 2.5 standard deviations from the mean

Homework 1 – Key
Soci 252-003
Professor François Nielsen
6.10 (Mensa)
• Need z-score 2.5 standard deviations from the mean. Mean=100, standard deviation=16
so 2.5 standard deviations above the mean is 100 + 2.5*16 = 140
6.15 (Final exams)
• Anna: 83 French, 83 Spanish. Megan: 77 French, 95 Spanish.
Overall French: mean = 81, s.d. = 5. Overall Spanish: mean = 74, s.d. = 15
a) Anna has an average of 83 overall in (French+Spanish)/2, so does not qualify for honors
Megan has an average of 86 overall in (French+Spanish)/2, so does qualify for honors
b) To see who actually did better, we must look at the student’s performance relative to the class
distribution on the exam (assuming similar quality of students in each class)
o Anna scored 0.6 standard deviations above the mean in French, 0.4 above the mean in Spanish
o Megan scored 0.8 standard deviations below the mean in French, 1.4 above mean in Spanish
Anna’s average score was 0.5 standard deviations above the mean, while Megan’s was 0.3 above mean
So when we consider the difficulty of the respective exams, Anna appears to be the better student, even
though Megan was the one who qualified for honors.
6.18 (Car speeds)
a) 3.84/3.56 = 1.079 standard deviations
b) If we assume a truly normal, symmetric distribution that can be extrapolated many standard
deviations from the mean, then the distances from the mean are 13.84 mph (for 10 mph) and 10.16
mph (for 34 mph). Therefore we would say that 10 mph would be more unusual to be observed.
6.20 (Car speeds again)
a) After subtracting 20 from each observation, the mean speed is now 3.84 mph. The standard deviation
remains 3.56 because we have not changed the shape of the distribution, only shifted it to the left.
b) To shift the units, multiply both the mean and standard deviation by the conversion rate.
The mean speed in kilometers per hour is 38.36 kph, and the standard deviation is 5.73 kph.
6.43 (Cholesterol)
• Given: adult American women cholesterol ~N(188, 24). Units are mg/dL
a) [normal distribution “bell curve” with mean=188 and SD=24
b) Cholesterol level of 200 is 12 points above the man, or 0.5 standard deviations (so z-score = 0.5)
you can look this statistic up in a table of Z-values, or in R, enter
> pnorm(0.5)
and the answer is 0.691, so about 69%, so 31% of American women have cholesterol level above 200.
c) The z-scores of 170 and 150 (on this distribution) are -0.75 and -1.583.
22.7% of the normal distribution is below -0.75, and 5.7% of the normal distribution is below -1.583.
So between these two values is 17.0% of the distribution.
d) InterQuartile Range, the distance between 25th and 75th percentiles, is 32 on the cholesterol scale
e) The 85th percentile corresponds to a z-score of 1.0375, which equates to 212.9 mg/dL of cholesterol
7.8 (Kentucky Derby 2007)
• The distribution of winning speeds over time has a nonlinear form with a moderately strong positive
relationship. The rate of increase from the beginning to 1950 is steeper than from 1950 to present,
and since 1960 the typical winning speed does not appear to have risen at all.
7.10 (Coffee sales)
a) [Pyramid-looking, normally-distributed histogram]
b) The sales are increasing over time; histogram does not show time
c) Unimodal, symmetric sales; average sales around $350
7.12 (Matching)
• (a) is -0.977, (b) is 0.736, (c) is 0.951, and (d) is -0.021
7.36 (Burgers II)
a) Correlation is 0.199. There does not appear to be a relationship between sodium and fat content in
burgers, especially when excluding the low-fat, low-sodium item (which exerts significant “leverage”
on when estimating a linear relationship). Even with the outlier included, the correlation shows of
0.199 shows a weak relationship.
b) Spearman’s rho is slightly negative. Using ranks doesn’t allow the outlier to have as strong of an
influence and the remaining points have little or no association.
Flights
5500000
6000000
6500000
7000000
7.42 (Flights)
a) The correlation is 0.828, indicating a strong trend of increasing flights per year
b) The trend is positive and curved (i.e., the rate of increase is itself increasing). There is a low outlier in
2002, resulting from the down economy and/or fear of flying after 9/11/2001.
1996
1998
2000
2002
2004
Year
c) The plot is not straight and has an outlier. Either violation would disqualify the correlation.
d) Kendall’s Tau is 0.71. It is an appropriate measure. It would be 1.0 if not for the outlier of 2002.
8.2 (Horsepower)
• 46.87 – 0.084*200 = 30 (mpg)
8.4 (Horsepower, again)
• If a car’s mpg value has a positive residual, that means it is above the regression line, and has better
gas mileage (more miles per gallon) than predicted by its horsepower.
8.6 (more horsepower)
• The slope of -0.084 means that on-average, within the range of this linear model, a car’s fuel efficiency
decreases by 0.084 mph for every unit increase in horsepower
8.14 (EVEN MORE HORSEPOWER! jk, this one is about Residuals)
a) The model is not appropriate; the relationship is nonlinear
b) The model may not be appropriate; the spread is changing, getting more condensed around a line
when moving to the right (as the X-variable increases)
c) The model is appropriate
8.22 (More misinterpretations)
a) R2 measures the amount of variation accounted for. It does not imply that x determines (or causes) the
values of y. Here, literacy rate accounts for 64% of the observed variation in life expectancy, but that
does not mean literacy rate causes life expectancy.
b) This interpretation is implying a causal relationship. But just because two variables are related does
not mean that one cause the other. We could say that a 5% increase in literacy rate is associated with
an average of 2-year improvement in life expectancy.
8.40 (Success in college)
a) Predicted GPA = -1.262 + 0.00214*SAT
b) You could say that a person with an SAT score of 0 would have a GPA of -1.26, but both a negative GPA
and a score of 0 on the SAT are impossible. So the y-intercept only serves to adjust the height of the
line, and is meaningless by itself. More generally, it is risky to extrapolate a linear relationship beyond
the range of the data used to calculate the linear relationship to begin with.
c) The expected GPA is higher by 0.21 points for every additional 100 points scored on the SAT.
d) A freshman who scored 1400 on the SAT is predicted to have a 3.23 GPA.
e) The SAT score is only somewhat useful, since the R2 is only 0.221 (so, about 22% of the variation in GPA
is explained by SAT score). There are many other factors that influence GPA.
f) I would rather have a positive residual (where my GPA is higher than predicted by my SAT score)
8.46 (Drug abuse)
a) Yes, the linear model is appropriate. The plot shows a positive, linear and fairly strong relationship.
b) Percent marijuana use accounts for 87.3% of the variation in use of other drugs.
c) predicted Other% = -3.068 + 0.615*Marijuana%
d) Each additional percentage point of teens using marijuana adds 0.615 percent to the percentage of
teens using other drugs in a given country
e) The results do NOT confirm marijuana as a gateway drug. They indicate an association between
marijuana use and use of other drugs, but this does not tell you the causal story.
8.48 (Veggie burger)
a) The predicted fat content of the burger is 20.4g
b) The residual for the veggie burger is -10.4g of fat. Veggie burgers have less fat relative to protein than
typical menu items.
c) [you should mention predicted fat content, observed fat content, and the residual]
8.60 (Heptathlon 2004 again)
a) Predicted LongJump = 4.20053 + 1.1054*HighJump. Every additional meter in high jump is associated
with an additional 1.1m, on average, in long jump. While there is close to a 1:1 linear relationship, long
jump scores “start” at a higher point.
b) Only 12.6% of the variability in long jump performance is accounted for by high jump performance
c) Yes, the slope is positive. (It would be very surprising if it were negative, right? That would mean that
people with better long-jump marks tended to have poorer high-jump marks)
d) The residuals plot is fairly patternless
e) No. The residual standard deviation is 0.196m. Not much smaller than the SD of all long jumps (0.206),
and the R2 is only 12.6%.
9.4 (HDI revisited)
a) The relationship is not straight; HDI rises more quickly relative to cellphones, and then plateaus
b) The scatterplot of residuals will be curved downward
9.12 (more unusual points)
a)
1. The point has high leverage, makes large residual a bit smaller
2. Yes, the point is influential on the regression line
3. Correlation would increase because scatter around the line would decrease
4. Slope would increase, because the outlier pulls the line toward 0 (horizontal)
b)
1. High leverage, small residual
2. Yes, the pint is influential
3. Correlation would become weaker because the outlier has high zx and zy values, thus
increasing the correlation
4. Slope would decrease because the outlier pulls the line from nearly flat to a positive slope
c)
1. Little leverage, large residual
2. No, the point is not influential
3. Correlation would become stronger because scatter would decrease
4. Slope would be about the same
d)
1. High leverage, small residual
2. No, the point is not influential
3. Correlation would become weaker and become less negative because the outlier has a large
negative zx * zy value
4. Slope would stay about the same because the outlier is consistent with the slope determined
by the other points
9.16 (What’s the effect?)
• Perhaps playing computer games makes kids more violent
• Alternately, it is possible that kids who are already more violent like to play violent computer games
• Or, some other factor could account for both behaviors, so the relationship between computer games
and violence is caused by a “lurking” variable such as the child’s home life or genetic predisposition.
9.22 (Ages of couples 2007)
a) The correlation is -0.88. The square-root of the R2 value (77.4%, or 0.774) is 0.88, and you know the
relationship is negative by looking at the graph.
b) The age difference at marriage is decreasing by about 0.01718 years, per year as time passes
c) The predicted age difference in 2015 is about 1.65 years
d) The latest data point is for the year 2007. Extrapolating to 2015 assumes the trend will continue in the
same manner.
10.8 (Crowdedness)
a) The scatterplot shows upward curvature and decreasing spread
b) Try plotting log(GDP) against crowdedness
10.10 (Crowdedness again)
a) The student has gone too far; we now see marked downward curvature.
b) A next step would be to try a “weaker” re-expression, like reciprocal square root or log(GDP)
Extended Example: Simple Regression
20
40
prestige
60
80
> plot(prestige ~ education)
6
8
10
12
education
14
16
> cor(education, prestige)
The correlation between education and occupational prestige is 0.850. The plot shows a rather linear
relationship, so correlation is a reasonable expression of the strong relationship between these variables.
> lm1 <- lm(prestige ~ education, data = Prestige)
> summary(lm1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -10.732
3.677 -2.919 0.00434 **
education
5.361
0.332 16.148 < 2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.103 on 100 degrees of freedom
Multiple R-squared: 0.7228,
Adjusted R-squared: 0.72
F-statistic: 260.8 on 1 and 100 DF, p-value: < 2.2e-16
Regression equation: Prestige = -10.732 + 5.361*Education
Each year of education is worth, on average, 5.36 points on the prestige scale. The intercept is -10.7, but the
model is not applicable to people with zero years of education so this intercept value simply serves to set the
height of the line.
20
40
prestige
60
80
> plot(prestige ~ education, data = Prestige, col = "blue")
> abline(lm1, col = "red")
6
8
10
12
14
16
education
> plot(residuals(lm1) ~ fitted.values(lm1), col = "blue")
> abline(h = 0, col = "red", lty = 2)
10
0
residuals(lm1)
-10
-20
30
40
50
60
70
fitted.values(lm1)
The residuals look relatively healthy, although they are not perfectly-cleanly scattered. There are more low
outliers in the center, and more high-outliers to the left, but neither trend is strong.
> qqnorm(residuals(lm1))
> qqline(residuals(lm1))
0
-10
-20
Sample Quantiles
10
Normal Q-Q Plot
-2
-1
0
1
2
Theoretical Quantiles
The residuals fall quite cleanly along the line, further indicating their health
> plot(prestige ~ income)
80
60
prestige
40
20
0
5000
10000
15000
20000
25000
income
> cor(income, prestige)
The correlation of income with prestige is 0.715, but the relationship is curved; prestige rises more quickly
initially and then “tops out” compared to income. Correlation is not the best expression of the relationship
between income and prestige.
> lm2 <- lm(prestige ~ income, data = Prestige)
> summary(lm2)
Residuals:
Min
1Q
-33.007 -8.378
Median
-2.378
3Q
8.432
Max
32.084
Coefficients:
Estimate Std. Error t
(Intercept) 2.714e+01 2.268e+00
income
2.897e-03 2.833e-04
--Signif. codes: 0 '***' 0.001 '**'
value Pr(>|t|)
11.97
<2e-16 ***
10.22
<2e-16 ***
0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.09 on 100 degrees of freedom
Multiple R-squared: 0.5111,
Adjusted R-squared: 0.5062
F-statistic: 104.5 on 1 and 100 DF, p-value: < 2.2e-16
> plot(residuals(lm2) ~ fitted.values(lm2), col = "blue")
> abline(h = 0, col = "red", lty = 2)
30
20
10
0
-30
-20
-10
residuals(lm2)
40
60
80
100
fitted.values(lm2)
The regression equation: Prestige = 27.14 + 0.0029*Income
There is a 0.0029 increase in prestige for each $1 increase in income, or a 2.9-point increase in prestige for
each $1k increase in income.
However, the residual plot is not “healthy” as the residuals are shaped like an arch
20
40
prestige
60
80
> plot(prestige ~ log10(income), data = Prestige, col = "blue")
3.0
3.5
4.0
log10(income)
> lm3 <- lm(prestige ~ log10(income), data = Prestige)
> summary(lm3)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
-139.856
16.954 -8.249 6.6e-13 ***
log10(income)
49.635
4.497 11.037 < 2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.61 on 100 degrees of freedom
Multiple R-squared: 0.5492,
Adjusted R-squared: 0.5447
F-statistic: 121.8 on 1 and 100 DF, p-value: < 2.2e-16
The log of income behaves much better as a linear predictor of prestige, as seen in the plot which has a much
more linear-looking relationship than the plot of prestige vs. (unlogged) income.