Sample solutions to Assignment 2 5.40 Suppose a positive

Sample solutions to Assignment 2
5.40
Suppose a positive relationship has been found between each of the following sets of variables.
For each set, discuss possible reasons why the connection may not be causal.
a) Number of deaths from automobiles and soft drink sales for each year from 1950 to 2000.
The population has increased in that period and one would expect a corresponding
increase in both variables simply as a result of population increase. Thus the relationship does
not necessarily imply that one variable causes the other. Comparing rates per capita, instead of
absolute numbers, would adjust for changes in population but not for other possible confounding
factors that change over the period.
b) Amount of daily walking and quality of health for men over 65 years old.
Factors that could contribute to the relationship other than a causal effect of walking on health
include:
a) men who are healthier to begin with may have a higher propensity to walk,
b) "over 65" includes a broad range of ages. Age could be a confounding factor if both health and
the propensity to walk decrease with age.
c) other factors such as social class, culture, location of residence could act as confounding
factors.
c) Number of ski accidents and average wait time for the ski lift for each day during one winter
at a ski resort.
Average wait time will be highly correlated with the number of skiers on the slope on any given
day as will the number of accidents even if the probability of an accident for each skier remains
the same from day to day. Other confounding factors might include skiing experience which may
tend to be lower on average on busy days; and the density of skiers on the slopes which will be
higher on busy days.
However ski accidents could cause longer wait times if the use of lifts becomes preempted by ski
patrols in the course of performing rescues.
5.46
The equation is
=
yˆ 0.25 + 0.384 x
where y is foot length in cm and x is height in in.
a) How much does average foot length increase for each 1-inch increase in height?
0.384 cm/in
b) Predict the difference in foot lengths for men whose height differs by 10 in.
10 × 0.384 =
3.84 cm
c) Suppose Max is 70 in. tall and has a foot length of 28.5 cm. Based on the regression equations
what is the predicted foot length for Max? What is the value of the prediction error (residual) for
Max?
Predicted value: yˆ| x =70 = 0.25 + 0.384 × 70= 27.13 cm
Residual: y − yˆ= 28.5 − 27.13= 1.37 cm
5.48
r2 =
−0.362 =
0.1296
Using hours of study as a predictor for hours of sleep 'explains' 12.96% of the variance. I.e. the
'unexplained variance' in hours of sleep (the variance of the residuals after predicting hours of
sleep from hours of study) is 87.04% of the overall variance in hours of sleep.
5.49
The winning time in the Olympic men's 500-meter speed skating race over the years 1924 to
1992 can be described by the regression equation:
Winning time = 255 − 0.1094 Year
a) Is the correlation between winning time and year positive or negative? Explain.
The correlation is negative because the correlation has the same sign as the
regression coefficient since:
regression coefficient
correlation =
Std .Dev. of x × Std .Dev. of y
The standard deviations in the denominator are positive so the correlation and the regression
coefficient must have the same sign.
b) In 1994, the actual winning time for the gold medal was 36.33 seconds. Use the regression
equation to predict the winning time for1994, and compare the prediction to what actually
happened.
Predicted time:
yˆ|1994 = 255 − 0.1094 ×1994 = 35.8564
so the actual winning time was 36.33 − 36.8564 =
−0.5264 , i.e. 0.5264 seconds faster than
predicted.
c) Explain what the slope of -0.1094 indicates in terms of how winning times change from year
to year.
The slope means that the winning time improves, on average, by 0.1094 seconds per year over
the period from 1924 to 1992.
d) Why should we not use this regression equation to predict the winning time in the 2050
Winter Olympics?
The regression assumes a linear relationship which may be a reasonable approximation for a
period of time but it certainly cannot be extrapolated very far otherwise the winning time would,
eventually, be predicted to be below what is humanly possible, or even, physically possible.
480
500
520
540
Verbal
560
580
5.59
a)
20
40
60
PctTook
80
b)
Assuming that a straight line represents the relationship between average SAT per state and the
percentage of students who take the SAT, the predicted average SAT between states goes down
by 1.081 for each percent increase in the participation rate.
c) The scatterplot suggests two major clusters: states in which fewer that 20% take the test and
states in which 50% or more take the test with just a few states scattered between. Among the
states in which 50% or more take the test, the expected score seems relatively constant, around
500. Between 10% and 50%, a line seems reasonable.
d) The intercept represents the mean Verbal score when the percent taking the test is 0. However
the mean of nothing is not defined.
5.60
580
560
540
480
500
520
Verbal
480
500
520
540
560
580
600
Math
a) The relationship is linear with a correlation of about .9 (visually).
b) The most notable outlier is #12 (Hawaii) where Verbal score is much lower than the
Math score. Similar outliers, but less marked, are California and Illinois. Three states have
Verbal scores that are somewhat higher than expected in relationshiop to Math scores:
Mississippi, Arkansas and West Virginia.
Hawaii is the only state in with two official languages, English and Hawaian, reflecting a
demographic composition in which performance on a verbal test in the English language might
be hindered due to reduced fluency in English. Illinois and California might have large groups
of more recent immigrants whose performance might be hindered in a test on the English
language.
The positive outliers seem to be states with possibly low immigration and a large proportion of
native English speaker who may have a relative advantage with a test in the English language.
0
200
400
b)
600
Salary
800
1000
5.62
40
50
60
Age
70
The correlation is 0.13 and r-squared is 0.0169.
c) At higher ages there is a tendency to have higher salaries. An increase in one standard
deviation of age is associated with an increase of 0.13 standard deviations in Salary. The
relationship is weak.
Chapter 6.
6.7
a) The row percents indicate the percentage of students, for each sex, who fall in each of the
three categories when asked their perception of their weight.
b) You might have done the following by hand or with Excel but here's how it 'could' have been
done with R‼
c)
About Right
Overweight
Underweight
Male
Female
0
20
40
60
80
Freq
d) More males than females consider themselves 'about right' (77% versus 67%). Of those who
don't consider themselves about right, women tend to consider themselves overweight (30%
overweight vs 2% underweight) while men consider thenselves underweight (4% overweight vs
19% underweight).
e) The sample consisted of Penn State statistics students. With respect to weight attitudes they
may have been representative of Penn State students in general although we can't be sure. They
are probably not representative of students in general and almost certainly not of the population
as a whole.
6.22
Using the terminology of this chapter, what name applies to each of the boldface numbers in the
following quotes.
a) "Fontham found increased risks of lung cancer with increasing exposure to secondhand
smoke, whether if took place at home, at work, or in a social setting. A spouses's smoking alone
produced an overall 30 percent increase [relative increase in risk] in lung-cancer risk"
b) "What they found was that women who smoked had a risk [of getting lung cancer] 27.9 times
as great [relative risk] as non-smoking women; in contrast, the risk for men who smoked
regularly was only 9.6 times greater [relative risk] than that for male non-smokers"
c) One student in five [risk] reports abandoning safe-sex practices when drunk.
6.27
a,b) Tax rate in each income category and overall:
1974
Under 5,000
5.4%
5K to 9,999
9.3%
10K to 14,999 11.1%
15K to 99,999 16%
100,000 +
38.4%
Total
14.1%
1978
3.5%
7.3%
10%
15.9%
38.3%
15.2%
c) they decreased within each income category
d) the overall tax rate increased
e) This is an instance of Simpson's Paradox. Although the tax rate decreased in each category
from 1974 to 1978, incomes shifted upwards so that in 1978 a larger proportion of incomes were
in higher categories with the result that the overall rate in 1978 gives more weight to the higher
rates in higher income categories.
6.29
a) The new treatment was the more successful treatment in Hospital A. The percent surviving
with the new treatment was 100/1000 or 10%, compared to 5/100 or 5% with the standard
treatment.
b) The new treatment was the more successful treatment in Hospital B. The percent surviving
with the new treatment was 95/100 or 95%, compared to 500/1000 or 50% with the standard
treatment.
c. The combined contingency table is:
survive
standard 505 = 5 + 500
new
195 = 95 + 100
total
600
die
595 = 95+500
905 = 900 + 5
1600
total
1100
1100
2200
The standard treatment is more successful than the new treatment in the combined data set. The
percent surviving with the standard treatment is (505/1100)
100% = 45.9%, compared to
(195/1100)
100% = 17.7%. This is an example of Simpson’s Paradox.
6.43
Step 1:
H0: There is no relationship between hormone therapy and death from CHD
Ha: There is a relationship between hormone therapy and death from CHD
Step 2:
Expected counts:
Step 3:
From above: p-value = 0.4679
Step 4: Fail to reject H0 at the 5% level of significance because the p-value is greater than 0.05
Step 5: Conclude there is not enough evidence to say that there is a relationship between
hormone therapy and risk of death from CHD.
6.56
Since the p-value of 0.0005 is less than 0.001 we conclude that there is very strong evidence of a
relationship between Anger and Heart Disease.
6.57
In the "no anger" group, 191 men were free of heart disease while 8 men had heart disease
yielding odds of 191 to 8 or, dividing both counts by 8, shows that the odds are 23.875 to 1, or
about 24 to 1.
In the "most anger" group, 500 men were free of heart disease while 59 men had heart disease
yielding odds of 500 to 59, i.e. 8.475 to 1, or approximately 8.5 to 1.
6.58
The odds ratio is
odds for no anger
191/ 8
= = 2.817
odds for most anger 500 / 597
The odds of remaining free of heart disease versus getting heart disease for men with no anger
are about 2.8 times the odds of those events for men with the most anger.
6.62
c) Based on the p-value, we conclude that there is evidence of a relationshiop between Sex and
feelings about weight.
d) There appears to be a reversal in that more men than women feel "about right" at Penn State in
contrast with UC Davis where more women than men feel about right. The difference may not be
significant however. Among those who are not satisfied with their weight, the women at both
Penn State and UC Davis were particularly concerned with being overweight. However, in
contrast with Penn State where men were primarily concerned with being underweight, the men
at UC Davis were more equally concerned with being overweight and being underweight.
Chapter 7
Exercises
Notes:
1. Numbers in red need to be done for Assignment 2.
2. The numbers shown in the text all have the form '7.x' where 'x' is the number of the question
within chapter 7. In the following lists I only show 'x'.
Chapter 7, pp. 240--247:
1. Random Circumstances & 2. Interpretations of Probability
2, 5, 7, 16
3. Probability Definitions and Relationships
18, 19, 20
4. Basic Rules for Finding Probabilities
34, 35, 36, 42
5. Strategies for Finding Complicated Probabilities
44, 45, 46, 47, 50, 54
6. Using Simulation to Estimate Probabilities
none
7. Coincidences and Intuitive Judgments about Probability
64, 68, 72, 76 (similar question likely to be on test)
Chapter Exercises
82, 83, 84, 85, 91 to 98 (sequence of exercises on same problem).
7.34
a) 1/2
b) 1/2
c) 1/2 x 1/2 = 1/4
d) 1/2 + 1/2 – 1/4 = 3/4
7.36
7.42
a) The issue involved is whether any of the traits of driving a red pickup truck, being a smoker
and having blond hair are related. It seems reasonable to assume the three traits are independent,
but we can't know for certain. Perhaps people who drive red vehicles or people who drive pickup
trucks are more likely to be risk takers, as are people who smoke.
b) P(red pickup truck and smokes and blond)= (1/50)(.30)(.20) = .0012. Use the multiplication
rule for independent events (Rule 3b extension).
c) Number fitting the description of the criminal = 10,000(.0012) = 12. The value of the
proportion (.0012) was determined in part (b).
d) Probability = 11/12 = .917 that the driver arrested by the police is innocent. There are 12
vehicle owners who fit the description. Assuming the description is accurate (and that the
perpetrator comes from this town), one of these 12 is guilty and the other eleven are innocent.
e) The answer to part (d) suggests an argument against the prosecutor's reasoning. While the
evidence narrows the possibilities down to only twelve people, eleven of these twelve people are
innocent. Conditional on the given evidence (and only the given evidence), the probability is
high (.917) that the arrested person is innocent.
7.45
a) (.6)(100) = 60.
b) (.2)(60) = 12. You could also solve this as (.6)(.2)(100) = 12.
c)
Senior Non-senior
Science
12
48
Liberal Arts 12
28
Total
24
76
d) 24/100 = .24. Therefore, 24% of the students are seniors.
Total
60
40
100
7.46
7.54
7.64
7.68
a) The probability that the person actually carries the virus is 1/11 = .0909 because in the lowrisk population, for every infected person who tests positive there are 10 people who test positive
and do not carry the virus.
b) Although the probability of testing positive for those with the disease is high, the reverse is
not true. If there are a very large number of people who do not have the disease, then even if only
a small percent of those test positive, the result will be a large number of positive tests in healthy
people. Here is a table that conveys this idea. Notice that everyone with the disease tests positive,
and almost 90% of those without the disease correctly test negative, yet of every 11 people who
test positive, only one has the virus.
Test positive
HIV
10
Non HIV 100
Total
110
Test negative
0
890
890
Total
10
990
1000
7.72
No, she is not correct. Assuming birth outcomes are independent, having 3 consecutive boys
does not change the probability that the fourth child will be a boy.
7.82
a) Percent = 40%, and P(get A | regular attendance) = .40
b) Percent = 10%, and P(get A | not regular attendance) = .10
c) Students who get an A either regularly attend or do not. Over the whole class:
P(get A) = P(regularly attend and get A) + P(do not regularly attend and get A)
The two components on the right side of the previous equation are:
P(regularly attend and get A) = P(regularly attend) × P(get A | regularly attend)
= (.7)(.4) = .28
P(do not regularly attend and get A)
= P(do not regularly attend) × P(get A | do not regularly attend) = (.3)(.1) = .03
So, P(get A) = .28 + .03 = .31, which is 31%.
7.84
7.85