Final Exam with solutions

Math 1011
Final Exam
May 1 2014
Name:
ID:
1. True or false, and explain briefly if your answer is false:
(a) In an observational study, it is the subjects who assign themselves to the different
groups (the treatment group and the control group). The investigators just watch
what happens. (5 points)
(b) The CLT (Central Limit Theorem) can be applied to both sum of draws and product
of draws. (5 points)
(c) Suppose a 95%-confidence interval for the average household size in a city is: 2.16 ∼
2.44. This tells you that 95% of the households in the city contain between 2.16 and
2.44 persons. (5 points)
(d) A test statistic measures the difference between the data and what is expected based
on the alternative hypothesis. (5 points)
Solution.
(a) True.
(b) False. The CLT can be applied to sum of draws but not to product of draws.
(c) False. It confuses the SD with the SE. The SE measures the chance error for multiple
samples. The confidence level tells that about 95% of all the samples of the same size,
the corresponding confidence intervals will cover the true value. The 2.16 ∼ 2.44 is
just one of them.
(d) False. The expected value is based on the null hypothesis, not the alternative. All
the calculations are based on the null.
2
1
2. Multiple choice: there is only one correct answer to each of the following questions, please
circle the correct one.
(i) If a technician wants to determine whether a roulette wheel (38 pockets) is ready to
. (5 points)
use or not, then the technician could use a
I. one-sample z-test
(a)
(b)
(c)
(d)
II. χ2 -test
only I
only II
either I or II
none of above
(ii) A 10 grams weight is sent to a local laboratory to be determined whether it needs
calibration or not. A technician reports the following measurements:
10.0001 grams
10.0004 grams
9.9998 grams
10.0003 grams.
Suppose the SD is known as 0.0001 grams, and the chance errors follow the normal
distribution. According to the data, the technician could use a
. (5 points)
(a)
(b)
(c)
(d)
one-sample z-test
two-sample z-test
t-test
χ2 -test
Solution.
(i) (b). It involves 38 categories, so we have to use a χ2 -test.
(ii) (a). Since we know the SD of the box and the errors follow the normal curve, we use
the normal approximation.
2
2
3. In one study, an organization wants to show the dependence between family incomes and
living regions in a small town. More precisely, they divide the family incomes into 3 levels:
poor, median, and rich; and they also divide the small town into 2 regions: region A and
region B. They plan to show the distribution of the 3 levels in region A differs from the
distribution in region B.
In order to do that, they want to draw a representative sample of househoulds in both
regions, and compute the sample distributions to make a test. Suppose they do the survey
in this way: they use their own judgment to choose some specific districts in both regions,
then they hand out questionnaires to the households in the chosen districts and wait for
the responses. Although the response rate is very low: only 8% in total, they still get a
sample:
rich
median
poor
Region A
17
266
8
Region B
52
134
3
(a) What kind of test shall the organization use? What is the null hypothesis and what
is the alternative? (5 points)
(b) Is the sample they got representative? Yes or no, explain briefly. If your answer is
no, please give some suggestions to improve the survey? (5 points)
Solution.
(a) The organization shall use a χ2 -test. The null says the family incomes and the living
regions are independent, that is the distribution of the 3 levels in region A is the
same as the distribution in region B. The alternative says the distribution differs
from region to region.
(b) No. There are selection bias and nonresponse bias in the survey. The specific districts
in the town might not be representative. For instance, in region B, they might choose
a rich district to mail questionnaires. Both rich and poor households do not tend to
give responses, so we could possibly get lower rate of rich and poor households.
Suggestions: they should use a probability method to draw a sample rather than use
their own judgment. They should have no discretion at all. It is better to send some
interviewers to interview the households randomly rather than hand out questionnaires. It could increase the response rate.
2
3
4. In a clinical trial on a new therapy for lung cancer. A group of patients were selected to be
observed in the study. Some of the patients felt rather despairing that they refused to try
any therapies, and the others decided to try the new one. Over the period of the study, the
researchers got the data about the survival rate—the rate for those who survive 5 years
or longer after the treatment. They wanted to show the new therapy indeed had effect on
lung cancer by using a test of significance.
(a) What kind of test shall the researchers use? What is the null hypothesis and what is
the alternative? (5 points)
(b) Are the data based on an observational study or a randomized controlled experiment?
What is the treatment group, and what is the control group? (5 points)
(c) Did the data they got from the study work? Explain briefly. (5 points)
Solution.
(a) The researchers shall use a two-sample z-test. (Indeed, the two-sample z-test can only
be applied to randomized control experiment, not observational study. The design of
experiment here is really bad, they had to change into randomized control experiment,
then applied the two-sample z-test. So to this question, any test you answer will get
credit.) The null hypothesis is: the survival rate of the treatment group is the same
as the survival rate of the control group. The alternative is: the new therapy indeed
had effect, that is the survival rate of the treatment group is higher than the one of
the control group.
(b) The data are based on an observational study. The treatment group is the group
of patients who decided to try the new therapy. The control group is the group of
patients who refused to try any therapies.
(c) No, the data did not work. The survival rate for the treatment group might be higher
than what is expected. This is because the patients in the control group are those
who felt rather despairing, and the patients in the treatment group are those who
were optimistic. It seemed that the patients in the treatment group were more likely
to survive after treatment, no matter the therapy was a new one or the old one. This
factor confounded with the effect of the new therapy, so there is bias and confounding.
2
4
5. In one year, 200 students took calculus I. The average score is 65, and the SD is 10. Suppose
the histogram of the scores follows the normal distribution.
(a) Use the normal curve to estimate the number of the students with scores between 70
and 80. (10 points)
(b) If one student claims his score is higher than 84% of all the students, use the normal
approximation to estimate his score. (10 points)
Note: You may need the following data from the normal table:
z
0.50
0.75
1.00
1.25
1.50
Height
35.21
30.11
24.20
18.26
12.95
Area
38.29 (≈ 38)
54.67 (≈ 55)
68.27 (≈ 68)
78.87 (≈ 79)
86.64 (≈ 87)
Solution.
= 0.5, 80−65
= 1.5. From the normal
(a) Convert the scores into standard units: 70−65
10
10
table, the area under the normal curve between -0.5 and 0.5 is 38%, and the area
between -1.5 and 1.5 is 87%. So the area under the normal curve between 0.5 and
1.5 is 21 × 87% − 12 × 38% = 24.5%. Hence, the number of the students with scores
between 70 and 80 is about 200 × 24.5% = 49.
(b) Suppose the student’s score is converted into the standard unit z. Then according to
the claim, the area under the normal curve to the left of z is about 84%. So the area
to the right of z is about 100% − 84% = 16%. By symmetry of the normal curve,
the area to the left of -z is about 16%. Therefore, the area between -z and z is about
84% − 16% = 68%. From the table, we see that z=1.00. Hence, the score is about
65 + 10 × 1.00 = 75.
2
5
6. One year, there were about 600,000 faculty members at around 3,000 institutions of higher
learning in the U.S. (including junior colleges and community colleges). As part of a
continuing study of higher education, the Carnegie Commission took a simple random
sample of 2,500 of these faculty persons. On the average, these 2,500 sample persons had
published 1.7 research papers in the two years prior to the survey, and the SD was 2.3
papers.
Find a 95%-confidence interval for the average number of research papers published by all
600,000 faculty members in the two years prior to the survey. (10 points)
Solution. The average number of research papers published by all 600,000 faculty members can be estimated by the sample average as 1.7. Since the SD of the population is
unknown,√we use the bootstrap method to estimate it as 2.3 papers. So the SE for sum
115
= 0.046. Hence a
is about 2, 500 × 2.3 = 115 papers. Then the SE for average is 2,500
95%-confidence interval is 1.7 ± 0.092.
2
6
7. According to the record in 2000, the 60th percentile of the family income in a certain city
was $71,000. In 2013, a market research organization took a simple random sample of 500
families in the city; about 42% of the sample families had incomes over $71,000. Did the
60th percentile of the family income in this city increase over the period 2000 to 2013?
Formulate the null and alternative hypotheses, and use a test of significant to detect the
statement. (15 points)
Note: You may need the following data from the normal table:
z
0.80
0.90
1.00
1.10
Height
28.97
26.61
24.20
21.79
Area
57.63 (≈ 58)
63.19 (≈ 63)
68.27 (≈ 68)
72.87 (≈ 73)
Solution. To detect the statement: the 60th percentile of the family income in the city
increased over the period 2000 to 2013, we use a 0-1 box. Tickets are marked 1 for the
families having incomes over $71,000 in 2013, and 0 for the others. The data are like 500
draws from the box. If the 60th percentile remained the same as 2000, then there were
about 40% of the household incomes over $71,000, that is there are 40% of 1’s in the box.
If the difference is real, 42% is more than 40%, then the 60th percentile did increase over
the period.
So the null is: the 60th percentile remained the same as 2000, that is there are 40% of 1’s
in the box. The alternative is: the 60th percentile did increase, there are more than 40%
of 1’s in the box.
We√use the one-sample z-test. Based on the null, the
√ SD of the box can be estimated
as 0.4 × 0.6 ≈ 0.5. So the SE for number is about 500 × 0.5 ≈ 11. Then the SE for
11
percentage is about 500
× 100% = 2.2%. Hence, the z-statistic is about z ≈ 42%−40%
≈ 0.9.
2.2%
Therefore, the P-value is about 18.5%, which is not significant. So we stay with the null
hypothesis and conclude that the 60th percentile did not increase over the period 2000 to
2013.
2
7
8. (Bonus Problem) Do only when you finish the other problems.
In a Nevada roulette, there are 38 pockets numbered: ”0”, ”00”, and 1 through 36.
(i) One bet is odd or even, and it pays 1 to 1. That is, for the number 1 through 36, it
will be either an odd number or an even number. When betting $1, if you win the
house will give you an extra $1, if you lose the house will get your $1. If it comes out
”0” or ”00”, you lose.
(ii) Another bet is 1 ∼ 18 or 19 ∼ 36, and it also pays 1 to 1. For the number 1 through
36, it will be either in 1 ∼ 18 or in 19 ∼ 36. When betting $1, if you win the house
will give you an extra $1, if you lose the house will get your $1. Again, if it comes
out ”0” or ”00”, you lose.
Suppose someone will bet $1 on odd, and at the same time, someone else will bet $1 on
1 ∼ 18. If this pair of bets is made 400 times.
(a) What is the expected net gain for the house? Give or take by how much or so? (Please
round to integer. 5 points)
(b) How many times will the house make money? Give or take by how much or so?
(Please round to integer. 5 points)
Solution.
(a) We analyze the possibility for the 38 pockets: (i) for ”0” or ”00”, the house get $2
each time; (ii) for 1 ∼ 18, the house lose $1, moreover on odd numbers, the house
lose another $1, so there are 9 -$2’s, and 9 cancel out $0’s; (iii) for 19 ∼ 36, the house
win $1, but on odd numbers, it is canceled out, so there are 9 $0’s and 9 $2’s. In
summary, the tickets are: $2, $2, 9 -$2’s, 9 $0’s, 9 $0’s, and 9 $2’s, that is 11 $2’s, 9
-$2’s, and 18 $0’s.
× 400 ≈ 42 dollars. Since
In 400 times, the expeted net gain will be 11×2+9×(−2)
38
the
average
of
the
box
is
approximately
0,
the
SD
of
the
box can be estimated as
q
√
11×22 +9×(−2)2
≈ 1.5. Then the SE for the sum is about 400 × 1.5 = 30 dollars.
38
Hence, the net gain for the house would be about 42 dollars, give or take 30 dollars
or so.
(b) We use a new 0-1 box, with tickets marked 1 for the house making money, and 0
otherwise. From part (a), the house make money only when they get $2, so there are
11
11 1’s and 27 0’s in the box. The expected times will be 38
× 400 ≈ 116. The SD of
q
√
27
the box is 11
400 × 0.45 ≈ 9. Therefore,
38 × 38 ≈ 0.45. Then the SE for number is
the house will win 116 times, give or take 9 times or so.
2
8