this document

Mathematics 344
Extra Problems
May 6, 2015
1. A device that continuously measures and records seismic activity is placed in a remote region. The time, T ,
to failure of this device is exponentially distributed with mean 3 years. Since the device will not be monitored
during its first two years of service, the time to discovery of its failure is X = max(T, 2). Calculate Exp(X).
Hint: you might want to compute these quantities: P (T < 2), Exp(X | T < 2), Exp(X | T > 2). The last is
not hard to compute if you remember the cool property of the exponential distribution.
2. A charity receives 2025 contributions. Contributions are assumed to be mutually independent and identically
distributed with mean 3125 and standard deviation 250. Calculate the approximate 90th percentile for the
distribution of the total contributions received.
(a) 6,328,000
(b) 6,338,000
(c) 6,343,000
(d) 6,784,000
(e) 6,977,000
3. Claim amounts for wind damage to insured homes are mutually independent random variables with common
density function
(
f (x) =
3
x4
0
x>1
otherwise
where x is the amount of a claim in thousands. Suppose 3 such claims will be made. Calculate the expected
value of the largest of the three claims.
(a) 2025
(b) 2700
(c) 3232
(d) 3375
(e) 4500
4. You notice 4 cards face down on a table. For some reason you turn 2 of them over and notice that both of
them are red. What is the maximum likelihood estimate for the number of the orginal 4 cards that were red?
(Note: your parameter here is the number of cards that are red so the possible values of the parameter are
0, 1, 2, 3, 4.)
5. Suppose that a Bayesian has to report a point estimate for a parameter θ. One possibility is the mean of the
posterior definition. Suppose that X ∼ Binom(10, π) and π has a uniform distribution. Suppose that we obseve
x = 7. What is the
(a) maximum likelihood estimate of π?
(b) mean of the posterior distribution of π?
(c) maximum of the posterior density of π? (the MAP estimator)
6. A new car battery has an unknown voltage θ. Given that the battery was manufactured to be a 12 volt battery
but that there is some variation in the voltage of new batteries, we model θ by a normal distribution with mean
12 and standard deviation 0.5 volts. We test the voltage of the battery and get a measurement of 12.3 volts.
Our test apparatus is likely to produce a measurement has a normal distribution with mean the true voltage
and standard deviation about 0.1 volts.
(a) What is the maximum likelihood estimator of θ?
(b) What is the posterior distribution of θ?
Page 2
(c) What is a reasonable point estimate of θ using this posterior distribution?
7. It might be the case that Brink is just an 85% free-throw shooter and that he has been lucky (he is 81 for 89
at this writing).
(a) Set up this claim as a hypothesis test.
(b) Compute the p-value for these data.
(c) Draw a conclusion about the claim.
(d) Explain in English the meaning of the p-value.
8. Continuing problem 5.18, use simulation to test the null hypothesis
H0 :
Ha :
θ=0
θ>0
In particular, compute an approximate p-value for the sample
.83
.64 .20 .35 .82
.53 .99 .63 .17
.65
9. The built-in dataset morley has measurements of the speed of light in a classic experiment done by Michelson.
The measurements are in the variable Speed. Write a 95% confidence interval for the speed of light based on
this experiment. There were actually 5 different experiments. Is there anything to be learned from looking at
the experiments separately? (Note that Expt is a numeric variable. Convert it to categorical using factor.)
10. Continuing problem 5.18 and extra problem 8, write a 95% confidence interval for θ using the test given by the
likelihood ratio statistic for the sample given in extra problem 8.
iid
11. If X1 , . . . , Xn ∼ Exp(λ), what is the distribution of X̄? (Use moment generating functions.)
12. A classic (1889) data set studies the distribution of gender in families with 12 children each in Saxony. There
were 6,115 such families. A possible model for such families that the genders of successive children are independent, the probability of a male remains constant over time, and the probability of a male child is the same
from family to family. Suppose the probability of a male child is π. Decide whether these data are consistent
with the model and discuss.
Boys
Families
0
7
1
45
2
181
3
478
4
829
5
1112
6
1343
7
1033
8
670
9
286
10
104
11
24
12
3
13. There were 380 games in the Premier League last season (2013-14). The total number of goals scored in these
games were:
Goals
Games
0
27
1
75
2
82
3
70
4
63
5
39
6
17
7
4
8
1
9
2
(a) It’s claimed that the Poission distribution is a good model for these data. If so, what value for the paramter
λ should be used?
(b) Is the Poisson distribution a good model for these data? Use either chi-square or maximum likelihood.
(You might want to combine the last three classes as they have small cell counts.)
(c) Test the hypothesis that the Poisson distribution is a model for these data using the technically correct
version of the maximum likelihood method.
14. The Baylor Religion survey (http://www.thearda.com) is a high-quality survey on the religious beliefs and
practices of US residents. The results of two questions in one of the recent administrations are below. The
questions are: in which region of the country do you live? and do you believe God exists?
Page 3
East
Midwest
South
West
Absolutely
220
409
373
266
Probably
53
55
45
78
Probably not
32
25
16
43
Absolutely not
12
14
15
29
(a) Obviously, this is set-up for a test of homogeneity or independence. Which one?
(b) Write carefully the null hypothesis that you should be testing (based on your answer in (a)). Give a
definition in words of all greek letters in your hypothesis.
(c) Test the hypothesis. Write your conclusion in words.
15. This is a question about power. Consider the following 2 × 3 table of sample proportions.
.13
.07
.19
.11
.28
.22
(a) If these were the “true” proportions, argue that the two variables defining the table are not independent.
(b) Find the smallest sample size such that the null hypothesis of independence would be rejected for these
sample proportions at the 5% level of significance.
16. The dataframe cats of the MASS package has some data on cats in a laboratory.
Sex
Bwt
Hwt
categorical, levels F and M
body weight in kg
heart weight in g
(a) Fit a model that could be used to predict heart weight from body weight. (Ignore the sex of the cat.)
(b) For your fitted model, write the first five components of each vector in Figure 6.2.
(c) Use R to show numerically that residual vector is orthogonal to the fitted vector.
17. Suppose that A is a symmetric and idempotent matrix.
(a) Show that the diagonal entries each diagonal entry of A, aii satisfies 0 ≤ aii ≤ 1.
(b) Show that I − A is symmetric and idempotent.
(c) Show that if A has an inverse then A is the identity matrix.
18. Show that if A is an n × m matrix and B an m × n matrix, the trace of AB is equal to the trace of BA.
19. Show that the trace of the hat matrix is k + 1. (Hint: use problem 18.)
20. Show that the least squares estimate β̂1 of β1 in the model y = β0 + β1 x + has variance σ 2 /
−1
computing X T X
in this particular case.
P
(xi − x̄)2 by
21. A certain Mr. Whiteside wanted to determine the effect of insulating his house on gas consumption. Of course
gas usage also depends on the temperature. Data on weekly temperatures and gas consumption both before
and after insulating the house are in the whiteside data frame in the MASS package.
(a) Fit a linear model that can be used to predict gas usage from temperature and the insulation status of
the house.
(b) Intepret the coefficients in the model.
(c) What is the value of the unbiased estimator of σ 2 for this model?
(d) Interpret the number in part 21c in the context of the problem.
Page 4
22. Using the same model as problem 21, write 95% confidence intervals for each β in the model.
23. The dataframe SAT of the mosaic package has data on the average SAT score for each state of the union (in
1994) and the expenditures per pupil in elementary school and secondary school.
(a) Use a linear model to show that states that spend more on education produce students who do worse on
the SAT.
(b) While this result is unexpected, it can be explained using one of the other variables in the dataset. Do
that.
24. The R dataset Puromycin gives the rate of reaction as a function (in counts/min/min) of concentration of
an enzyme (in ppm) for two different substrates - one treated with Puromycin and one not treated. The
biochemistry suggests that these two variables are related by
rate = b0
conc
b1 + conc
Find the least squares estimates of b0 and b1 for the treated condition by both of the methods suggested in
class and compare the sums of squares of residuals.
25. The trees dataframe has data on the Girth, Height, and Volume of some cherry trees. Develop a function
that could be used to predict Volume from Girth. (This might be useful – we could figure out about how much
wood is in a tree by measuring its distance around.)
26. Read the description of the probit transformation in problem 6.41.
(a) Do part (a) of 6.41.
(b) Fit a probit model to the MedGPA data used in class. You will have to install the Stat2Data package. You
will also have to load the data (with data(MedGPA as this package does not automatically load the data.
27. The dataframe (Ericksen in the car package) has data on undercounting that occurred in the 1980 census.
The dataframe has 66 cases which correspond to 16 large metropolitan areas and 50 states (with the 16 cities
removed from their states). A careful estimate of the undercount of the 1980 census was constructed. The
question is whether we can model the undercount percentage by other demographic variables of the region.
minority
crime
poverty
language
highschool
housing
city
conventional
undercount
percentage black or Hispanic
crime rate per 1000
percentage poor
percentage having difficulty with English
percentage age 25 or older who have not finished high school
percentage of housing in small, multiunit buildings
categorical variable, levels city and state
percentage of households counted by convential enumeration
estimate of undercount percentage
Consider these three models:
mod1 undercount ∼1 + minority
mod2 undercount ∼1 + minority + crime
mod3 undercount ∼1 + minority + crime + poverty
(a) Give a good statistical reason for preferring mod2 to mod1.
(b) Give a good statistical reason for preferring mod2 to mod3.
(c) Is there a model using the given variables that you might prefer to mod2?
28. In problem 27, the three models give fitted vectors ŷmod1 , ŷmod3 , and ŷmod3 .
(a) Argue that ŷmod3 − ŷmod2 is orthogonal to ŷmod2 and to ŷmod1 .
Page 5
(b) Show that part (a) is true numerically by computing the various fitted vectors.
29. In problem 27 you gave a reason that mod2 is to be preferred to mod3. Presumably, part of your reasoning
included computing a p-value for a certain hypothesis test. Compute an approximate p-value for the same
hypothesis test that uses a method that does not depend on our normality assumption.
30. The dataframe cathedral in the faraway package has data on the cathedral nave heights and lengths of several
cathedrals in England.
x
y
style
the length of the nave
the height of the nave
the style of the cathedral, romanesque or gothic
Fit a model to predict the height of the nave from the length of the nave and the style. (y
∼
1 + x + style)
(a) Write the equation of the fitted model.
(b) Use R to esitmate the P -value for the null hypothesis that style does not matter in this model.
(c) Instead, use a simulation method (using shuffle) to compute the same P -value.
(d) Should style be included in this model?
31. The dataframe KidsFeet of the mosaic package has data on the 39 children incuding the width of their longer
foot.
(a) We can fit a model for foot width using the following code: lm(width ∼ length + sex). Write the
equation for the model that this code is used to fit (in terms of coefficients β).
(b) In terms of this model, what null hypotheses would each of these be used to fit:
width ∼ length + shuffle(sex)
width ∼ shuffle(length) + sex
width ∼ shuffle(length)+shuffle(sex)
32. The corrosion data frame in the faraway package has data on the loss due to corrosion of copper bars with
various iron Fe content.
(a) Fit the model loss
∼
1 + Fe.
(b) Draw the diagram corresponding to Figure 7.3, labelling each vector by its name and by its numerical
length (for this particular model). (There are six vectors in the picture so you should have six different
lengths.)
(c) For each vector in part 32b, write an easy way to compute its length from either summary or anova (if
possible).
33. The chickwts dataset has data on the weights of 71 newborn chicks after they are fed one of six different diets
for a period of time.
(a) Is there statistical evidence that the diet makes a difference in the weight of chicks?
(b) Box 7.2 on page 425 has a boatload of new notation. For each of the variables defined in Box 7.2, compute
the numerical value of that variable for this particular dataset. (Except of course for the parameters µ –
we’ll never know their values.)
34. For these questions use the chickwts dataset referred to in problem 33.
(a) Use the Tukey Honest Significant Difference method write confidence intervals for the mean difference in
weight of chicks fed each pair of diets.
(b) Use the method of contrasts to determine whether the average weight gain of chicks fed casein or sunflower
is greater than the average weight gain fed one of the other four feeds.
Mathematics 344
Extra Problems
May 6, 2015
(c) Summarize the Tukey intervals for a non-statistician by placing the feeds in different groups according to
whether we have evidence that feeds in one group are better than those in another (in producing weight
gain).
35. A study was carried out to determine the lifetime of four brands of pens. It was thought that the writing
surface might affect the lifetime so three different writing surfaces were used. A writing machine was used to
ensure consistency of pressure and the like and the lifetime of each pen in minutes was recorded. The data are
available by
pens <- read.csv("http://www.calvin.edu/~stob/courses/m344/S15/Pens.csv")
(Note that the variables are recorded as numbers. You will want to treat them as factors.) Is there evidence
to suggest a difference in the lifetimes of the pens? Explain using the appropriate analysis.