Summer 2012 Midterm Solutions

Midterm Exam
Instructor: Tessa Childers-Day
Stat 20
12 July 2012
Please write your name and student ID below, and circle your section. With your
signature, you certify that you have not observed poor or dishonest conduct on the part of
your classmates. You also certify that you have not been a party to poor or dishonest
conduct, and that the work on this exam is solely your own.
Name:
Student ID:
Signature:
Date:
Section:
101 (2pm-3pm)
102 (3pm-4pm)
Answer the questions in the spaces provided. There are questions on the front and back of
each page. This midterm covers the material from Lectures 1 through 13, and Homeworks
1 through 6. Show your work, including labeling quantities (such as z-scores). The clearer
that your work is, the easier it is to award partial or full credit. If you do not show your
work, you will not receive credit. You are welcome to leave your answers as fractions. If
you use decimals, please round all answers to two significant figures, and hold your
rounding until the final calculation.
Question
Points
1
8
2
2
3
6
4
3
5
4
6
20
7
7
8
5
9
5
Total:
60
Score
Stat 20
Midterm Exam, Page 2 of 12
12 July 2012
1. I am interested in exploring rates of condom usage by college students in the United States, including
factors that may influence usage. I want to design a survey and collect student responses. Please address
each of the following issues, being as specific as possible.
(a) (2 points) (Survey or Census) A fellow investigator believes that we should not do a survey, and
instead should perform a census (questioning all college students). How would you explain why a
survey is preferable to a census in this case? Or is the other investigator correct?
Solution: A survey is preferable to a census because a census is very expensive in both money
and time. Interviewers must be trained or questionnaires mailed, and non-responses must be
followed up on.
(b) (3 points) (Sample Design) I propose to create a list of all colleges in the United States, by state.
From each state, I will choose the 2 largest colleges. From each college, I will sort students alphabetically, and mail every 100th student a survey. What kind of sampling plan is this? Comment on
strengths and weaknesses of this plan, including what sort of biases, if any, I should worry about.
Solution: Many answers could be satisfactory, this is just one. This is a multistage plan, including stratification into states (exhaustive and distinct) and systematic (every 100th student)
sampling. A strength is the good geographical coverage (stratification into states guarantees
coverage of the whole country). A weakness is that by using the two largest schools, I exclude
small schools, such as religious institutions, which may lead to selection bias. I should also be
worried about non-response bias (students who fail to return the survey).
(c) (3 points) (Survey Design) Recalling that the purpose of the study is to explore rates of condom
usage and factors that influence usage, I decide to ask how often a student uses a condom, when
engaging in intercourse. Is this all I should ask about, or are there other variables that I should
collect data about? If there are other variables, please give several examples. How should questions
be asked? What sort of biases, if any, should I worry about?
Solution: Many answers could be satisfactory, this is just one. We should ask about other
variables, like gender, age, whether they are in a serious relationship, usage of other forms
of birth control, religious/cultural views on birth control, whether the cost of birth control is
an influencing factor, etc. Because this is a potentially embarrassing subject, we should use a
survey that maintains anonymity, and the questions should be neutrally and non-judgementally
worded, to avoid interviewer/response bias.
2. (2 points) The table below could be used to generate a histogram showing the distribution of the scores
on a Stat 20 quiz (though the histogram is not shown here). There were 8 possible points, and due
to partial credit, no students scored a 0. The table below gives the intervals and heights of the bars.
Intervals contain the right endpoint but not the left. Fill in the height for the missing interval.
score (in points)
0-3
3-4
4-5
5-6
height (% per point)
3.33
17
19
21
Solution:
6-8
Stat 20
Midterm Exam, Page 3 of 12
12 July 2012
score (in points)
0-3
3-4
4-5
5-6
6-8
height (% per point)
3.33
17
19
21
16.505
Stat 20
Midterm Exam, Page 4 of 12
12 July 2012
3. Suppose that, on any given day, if the temperature reaches 70◦ (or above), I open my office window. If
the temperature does not reach 70◦ F, I do not open my office window. All temperatures are measured in
degrees Fahrenheit. I collect data on 100 days of temperatures, as well as the status of my office window.
I record the information about the window as “open” or “closed”. Later, I decide to arbitrarily assign
numeric codes to window status, coding “1” if the window is open and “0” if the window is closed. A
sample of 5 days of data are shown below.
Temperature
71◦
69◦
66◦
66◦
70◦
..
.
Window Status
open
closed
closed
closed
open
..
.
Window Status (coded)
1
0
0
0
1
..
.
Circle the correct answer:
(a) (1 point) The level of measurement for temperature is
A. Nominal
B. Ordinal
C. Interval
D. Ratio
(b) (1 point) The level of measurement for window status is
A. Nominal
B. Ordinal
C. Interval
D. Ratio
(c) (1 point) The level of measurement for window status (coded) is
A. Nominal
B. Ordinal
C. Interval
D. Ratio
(d) (3 points) I tell my office-mate, “Since we always open the window if the temperature is 70◦ or
higher, knowing the temperature perfectly predicts the status of our window. Thus, the correlation
between temperature and window status is +1.” She disagrees, claiming to have calculated the
correlation, and found that it is 0.83. Is she right, am I right, or are we both wrong? How do you
explain this contradiction? (Hint: Drawing a picture may help)
Solution: We are both wrong. In this case, one shouldn’t calculate correlation–correlation
measures linear relationship, this isn’t linear, not football shaped. Perfectly predictable, but
not linear at all. Also, order of encoding matters–0 for closed or 0 for open. Treating the
window status numerically doesn’t make much sense, so calculating a correlation doesn’t make
much sense either.
Stat 20
Midterm Exam, Page 5 of 12
12 July 2012
4. (3 points) I own 10 pairs of socks. They can be described by two attributes–coloring and pattern. 4
pairs are brightly colored, while 6 pairs are striped. 3 pairs are both brightly colored and striped. What
is the chance I wear brightly colored or striped socks three days in a row? (Assume that all of my socks
are clean at the beginning of the first day, and that I neither do laundry nor re-wear socks).
Solution: Note that color and pattern are NOT mutually exclusive. Thus,
P(brightly colored OR striped) = P(brightly colored) + P(striped) - P(brightly colored AND striped)
4
6
3
7
=
+
−
=
= 0.7
10 10 10
10
We are looking at 3 draws, without replacement, from a box with 7 tickets that we desire (brightly
colored or striped) and 3 tickets that we do not desire (neither brightly colored nor striped). Thus:
P(three days) = P(1st day brightly colored OR striped) × P(2nd day brightly colored OR striped|1st
day brightly colored OR striped) × P(3rd day brightly colored OR striped|2 days brightly colored
6 5
210
7
× × =
= 0.292.
OR striped) =
10 9 8
720
5. Below is a histogram containing the number of drivers killed in the UK for each month, from January
1969 through December 1984.
0.010
0.000
0.005
Density
0.015
Histogram of Driver Deaths
per Month in UK: 1969−1984
60
80
100
120
140
160
180
200
Number of Drivers Killed
(a) (2 points) Sketch and label the approximate locations of the mean, median, and mode on the
histogram. No calculations are necessary.
Solution: We are only concerned about the rough locations of these quantities, and their
ordering.
Stat 20
Midterm Exam, Page 6 of 12
12 July 2012
Histogram of Driver Deaths
per Month in UK: 1969−1984
0.010
0.000
0.005
Density
0.015
mean
median
mode
60
80
100
120
140
160
180
200
Number of Drivers Killed
(b) (2 points) There are 4 boxplots shown. One of them corresponds to the same data set as the
histogram in (a) (the other three do not). Indicate which boxplot corresponds to the histogram in
(a). (Labels are shown below the corresponding boxplot)
(b)
Boxplot 3
Solution: Boxplot 3 is the correct box plot. Boxplot 1 is too symmetric, with too many
outliers, indicating heavy tails. Boxplot 2 is too far from symmetric, with too many outliers
on the right tail. Boxplot 4 is skewed in the wrong direction. Boxplot 3 shows the appropriate
right tail, two possible outliers, and a median nearly in them middle of the data.
Stat 20
Midterm Exam, Page 7 of 12
Boxplot of Driver Deaths
per Month in UK: 1969−1984
200
200
Boxplot of Driver Deaths
per Month in UK: 1969−1984
●
●
●
●
180
●
●
180
160
●
●
●
●
140
120
100
120
100
80
60
80
●
●
●
●
●
●
●
●
●
●
●
60
●
●
●
●
●
●
●
Number of Drivers Killed
160
●
●
●
●
●
●
●
140
Number of Drivers Killed
●
●
●
●
●
●
Boxplot #1
Boxplot #2
Boxplot of Driver Deaths
per Month in UK: 1969−1984
200
Boxplot of Driver Deaths
per Month in UK: 1969−1984
200
12 July 2012
●
160
140
120
60
80
100
Number of Drivers Killed
140
120
100
80
60
Number of Drivers Killed
160
180
180
●
Boxplot #3
Boxplot #4
Stat 20
Midterm Exam, Page 8 of 12
12 July 2012
6. Measurements of the weight of 44 snowy plover eggs (along with the weight of the 44 chicks after they are
hatched) were collected by BLSS: The Berkeley Interactive Statistical System of Abrahams and Rizzardi.
The weight of the egg and of the newly hatched chick are measured in grams (gm). A scatterplot of chick
weight versus egg weight is shown below, along with means, SDs, and the correlation coefficient. Assume
that the scatterplot is “football shaped”, and thus that egg weight and chick weight have approximately
normal histograms.
Chick Weight vs. Egg Weight
●
7.0
●
●
●
●
●
●
●
●
6.0
Chick Weight (gm)
6.5
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
5.5
●
●
●
●
7.5
8.0
8.5
9.0
9.5
10.0
Egg Weight (gm)
Egg Weight: mean = 8.63 gm sd = 0.48 gm
Chick Weight: mean = 6.15 gm sd = 0.41 gm
correlation: r = 0.85
(a) (3 points) Fill in the blank: About 40% of the chicks have weights between 5.8 grams and
grams.
Solution: A chick that weighs 5.8 grams has a z-score of -0.85 grams, and is thus at the 20th
percentile. The 60th percentile is then the point which we are looking for. The 60th percentile
corresponds to a z-score of 0.25, which is 6.15 + 0.25 × 0.41 = 6.253 grams.
(b) (2 points) One of the chicks is at the 75th percentile for weight. Predict the weight of the egg it
hatched from. Include units.
Stat 20
Midterm Exam, Page 9 of 12
12 July 2012
Solution: Here, x = chick weight, y = egg weight. Since the chick has z-score of 0.85,
y − ȳ
x − x̄
=r
SDy
SDx
y − 8.63
= 0.85 (0.65)
0.48
y = 0.85 (0.65) (0.48) + 8.63
= 8.90
We predict that the egg weighs 8.90 grams.
(c) (2 points) The prediction in (b) is likely to be off by how much? Include units.
Solution: The RMS error for regression tells us how far off a prediction is likely to be. The
RMS error for predicting egg weight from chick weight features egg weight as the y variable, so
p
p
RMS Error = 1 − r2 SDy = 1 − .852 (0.48) = 0.25.
Our prediction is likely to be off by 0.25 grams or so.
(d) (4 points) Among all chicks at the 75th percentile for weight, approximately what percent hatched
from eggs heavier than 75% of all eggs?
Solution: Chicks at the 75th percentile for weight come from eggs that weigh, on average, 8.90
grams, give or take 0.25 grams or so. These are our new mean and new SD for the normal
curve inside the strip. We want to know what percentage of these eggs are heavier than the 75th
percentile for all eggs. The 75th percentile for all eggs is at z = 0.65, or 0.65 × 0.48 + 8.63 = 8.94
grams. Thus, we need the area above
z=
8.94 − 8.90
0.04
=
= 0.16.
0.25
0.25
100 − 11.92
= 44.04% of
2
percentile for weight hatched from eggs heavier than 75% of all eggs.
The area corresponding to z = 0.16 is 11.92%. Thus approximately
chicks at the 75th
(e) (2 points) Using the regression method, and predicting chick weight from egg weight, we find that
eggs weighing 8 grams, produce chicks weighing 5.69 grams, on average. True or false, and explain
your answer: using the regression method, and predicting egg weight from chick weight, we would
find that chicks weighing 5.69 grams hatch from eggs weighing 8 grams, on average.
Solution: False. There are 2 regression lines, one that predicts chick weight from egg weight,
and one that predicts egg weight from chick weight. The first one tells us that a chick from an
egg of weight 8 grams will weigh 5.69 grams, on average. The second one tells us that chicks
that weigh 5.69 grams hatch from eggs that weigh 8.17 grams, on average (NOT 8 grams). You
did not need to calculate the estimate for egg weight based on chick weight. The explanation
was sufficient.
(f) (4 points) Somehow, the data on chicks weighing more than 6.2 grams is lost. The correlation
between egg weight and chick weight for the remaining data points will be:
A. About the same (r ≈ 0.85)
0.85)
B. Somewhat less (r < 0.85)
C. Somewhat greater (r >
Stat 20
Midterm Exam, Page 10 of 12
12 July 2012
Circle the correct choice, and explain why it is correct.
Solution: The correlation decreases, because by cutting off part of our data, we lose some of
the linearity. The remaining cloud of data points looks more round, and less football shaped,
so our correlation decreases. This is called attenuation.
(g) (3 points) One of your classmates observes that chick that hatch from small eggs tend to be larger
than the regression line predicts, while chicks that hatch from large eggs tend to be smaller than
the regression line predicts. They attribute this to small eggs having thin shells, and large eggs
having thick shells (i.e., the weight is due not to the chick, but to the shell). Does this make sense?
Or is something else at work here?
Solution: This is actually the regression effect at work–nothing special is going on here. Your
classmate is committing the regression fallacy by believing that a simple phenomenon that
occurs in all regression situations is actually due to a significant/important cause.
Stat 20
Midterm Exam, Page 11 of 12
12 July 2012
7. There are 20 coins in a jar. Of these, 8 are quarters, 5 are dimes, 3 are nickels, and 4 are pennies. 8
coins are drawn at random, without replacement from the jar.
(a) (2 points) What is the chance that the first coin is a quarter and the second coin is a dime?
8
Solution: P(1st quarter and 2nd dime) = P(1st quarter) × P(2nd dime |1st quarter) =
×
20
5
= 0.105
19
(b) (2 points) What is the chance that the fourth coin is a quarter and the eighth coin is a dime?
Solution: Because we know nothing about any of the other coins, this is the same as the
probability above.
P(4th quarter and 8th dime) = P(1st quarter and 2nd dime) = P(1st quarter) × P(2nd dime
5
40
8
×
=
= 0.105
|1st quarter) =
20 19
380
(c) (3 points) What is the chance that the last two coins are of the same denomination?
Solution: Since we know nothing about any of the other coins, this is the same as the first
two draws being of the same denomination.
P(last two are the same) = P(first two are the same) = P(first two are quarters or first two are
dimes or first two are nickels or first two are pennies).
All of these events are mutually exclusive, thus
P(last two are the same) = P(first two are quarters)
two are dimes)+ P(first
twoare
+ P(first 8
7
5
4
3
2
4
3
nickels) + P(first two are pennies) =
×
+
×
+
×
+
×
=
20 19
20 19
20 19
20 19
94
= 0.247
380
Stat 20
Midterm Exam, Page 12 of 12
12 July 2012
8. A fair die is rolled 5 times. Find the chances of the following events.
(a) (2 points) All “6”s
Solution: P(five “6”s) =
5
1
1
=
= 0.00013
6
7776
(b) (3 points) At least two “6”s
1
Solution: Note that a dice roll can fit the binomial situation, where P(6) = and P(not 6) =
6
5
. P(at least two “6”s) = 1 - P(fewer than two “6”s) = 1 - P(zero “6”s or one “6”) = 1 - [P(zero
6
"
5 0
4 1 #
5!
5!
5
1
5
1
“6”s) + P(one “6”)] = 1−
+
= 1−[0.402 + 0.402] = 0.196
5!0! 6
6
4!1! 6
6
9. A fair die is rolled 10 times. Find the chances of the following events.
(a) (2 points) Exactly five “5”s
Solution: This is just a single binomial term
P (five “500 s) =
10!
5!5!
5 5
1
5
= 0.013
6
6
(b) (3 points) At most one “6” in the 10 rolls, given there are no “6”s in the first five rolls
Solution: Since there are no “6”s in the first five rolls, we must roll at most one “6”s in the
second five rolls. The two sets of five rolls are independent of each other. Thus, we simply need
the probability of rolling zero “6”s or one “6” in 5 rolls. This was found in (b).
P(zero “6”s or one “6” in five rolls | zero “6”s or one
rolls) = P(zero “6”s or #one
" “6” infive
4 1
5 0
5!
5
1
5!
5
1
“6” in five rolls) = P(zero “6”s) + P(one “6”) =
+
=
5!0! 6
6
4!1! 6
6
[0.402 + 0.402] = 0.804