PS20

MATH 232: Engineering Mathematics
Problem Set 20, Final version
Due Date: Mon., May 3, 2010
First, here are some summary notes on inference procedures for sample means.
In general, we need random samples from normal populations, though our procedures are
robust against the latter (normality) assumption when sample sizes are large enough to
invoke the magic of the CLT.
Hypothesis Testing
The test statistic takes the form
t0 or z0 =
(estimate) - (hypothesized value)
,
std. error
where the relevant quantities for various scenarios are:
Parameter
of Interest
Variances
known?
Test
stat
Estimate
µ
µ
yes
no
z0
t0
x̄
x̄
µ2 − µ1
yes
z0
x̄2 − x̄1
µ2 − µ1
no†
t0
x̄2 − x̄1
µ2 − µ1
no‡
t0
x̄2 − x̄1
Standard
Error
√
σ/ n
√
s/ n
q 2
σ1
σ22
n1 + n2
q 2
s1
s22
n1 + n2
r
1
n1
+
1
n2
Distribution
Norm(0, 1)
t (df = n − 1)
Norm(0, 1)
t (df =
(n1 −1)s2 +(n2 −1)s2
2
1
n1 +n2 −2
(s21 /n1 + s22 /n2 )2
(s21 /n1 )2
n1 −1
t (df = n1 + n2 − 2)
† Formulas
provided in text, and employed by R’s t.test() command.
‡ Formulas
used in (recent/current(?)) Fundamentals of Engineering documents.
The P-value, if the alternate hypothesis
is 1-sided, is the area in the appropriate
tail beyond the test statistic in the appropriate distribution. (See figure at right.)
If the alternate hypothesis is 2-sided, the
P-value is twice that tail area.
+
(s22 /n2 )2
n2 −1
Norm(0,1) or
t(specified df)
0
z0 or t0
)
MATH 232
Problem Set 20
2
100(1 − α)% Confidence Intervals
A 2-sided confidence interval for one of
the parameters in the table above is given
by
Norm(0,1) or
t(specified df)
(estimate) ± (zα/2 or tα/2 )(std. error),
where the standard error and corresponding distribution are as given in the table,
and the critical value is chosen as pictured.
Area = γ
0
zγ or tγ
For a one-sided 100(1 − α)% confidence bound, we replace tα/2 (or zα/2 ) with
tα (or zα ) and compute just the lower/upper endpoint of the resulting CI as
appropriate.
In R
One of R’s built-in datasets is called sleep, and contains the results of a sleep
experiment. Specifically, two different soporifics (sleep-aids) were tried on subjects and the amount of extra sleep that each subject attained was recorded. The
question is whether one soporific worked better than another. This data was
used in a famous “Student” (William Sealy Gossett) paper of 1908 addressing
the issue of inference on two means.
> data(sleep)
> names(sleep)
[1] "extra" "group"
> sleep$extra
[1] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8
[16] 4.4 5.5 1.6 4.6 3.4
> sleep$group
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
Levels: 1 2
0.0
2.0
1.9
0.8
1.1
One may store the “extra” time slept in vectors, one for each group, and then
run the t.test() command using both as inputs, as in
> gp1 = sleep$extra[sleep$group == 1]
> gp2 = sleep$extra[sleep$group == 2]
> t.test(gp1, gp2)
PS20—Final version
0.1 -0.1
MATH 232
Problem Set 20
3
Welch Two Sample t-test
data: gp1 and gp2
t = -1.8608, df = 17.776, p-value = 0.0794
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.3654832 0.2054832
sample estimates:
mean of x mean of y
0.75
2.33
The resulting output contains both a 95% CI for the difference of means µ1 −
µ2 , as well as the test statistic, # of degrees of freedom, and P-value for the
hypothesis test
H0 : µ1 − µ2 = 0,
Ha : µ1 − µ2 , 0.
Study this output carefully so as to understand all that it reports. Another way
to get this same output without first having to create vectors like gp1 and gp2
is by using a formula:
> t.test(extra ˜ group, data = sleep)
Notice that the hypothesized value of zero for µ1 − µ2 is both inside the 95% CI
and not rejected at the significance level 5%.
Day 1 (Monday)
• Read Sections 5.1–5.3. You may give little attention to the part of Section 5.3.1 devoted
to the case σ21 = σ22 = σ2 , beginning about a third of the way down p. 225, and ending
about halfway through p. 228. You may similarly ignore that case discussed for a
page and a half at the start of Section 5.3.3. The reasons for this are theoretical (it
is not likely, if you do not know σ1 and σ2 , that you might still know them to be
equal), and practical (it seems the writers of FOE materials do not consider this case
necessary to know).
• Section 5.3: Do Problems 16, 20 part (a), 22, 26 and 27.
Day 2 (Thursday)
• Read Sections 5.4 and 6.1 from MRH4e.
• Section 5.4: Do Problems 30, 31, 32, 33, 36, 38 and 39.
PS20—Final version
MATH 232
Problem Set 20
4
Day 3 (Friday)
• Problem ?39: In class, we used R code like
> lm.results = lm(y ˜ x)
to obtain the linear model and store it in a variable. We get a more detailed summary
of the model with
> summary(lm.results)
We can do other things on the results from lm. In particular, we can obtain predicted
and residual values via
> predict(lm.results)
> residuals(lm.results)
Use such commands as needed, and data on the 2003 American League Baseball
season, found in the file http://www.calvin.edu/˜stob/data/al2003.csv, to investigate the following.
(a) Suppose we wish to predict the number of runs (R) a team will score on the
year given the number of homeruns (HR) the team will hit. Write a linear
relationship between these two variables.
(b) Use this linear relationship to predict the number of runs a team will score
given it hits 200 homeruns on the year.
(c) Are there any teams for which the linear relationship does a poor job in predicting runs from homeruns?
• Problem ?40: From data on the 2003 American League Baseball season, found in
the file http://www.calvin.edu/˜stob/data/al2003.csv, suppose that we wish to
predict the number of games a team will win (W) from the number of runs the team
scores (R).
(a) Write a linear relationship for W in terms of R.
(b) How many runs must a team score to win 81 games according to this relationship?
• Problem ?40: Suppose that we wish to fit a linear model without a constant: i.e.,
y = bx. Find a general formula for the value of b that minimizes the sums of squares
P
of residuals, ni=1 (yi − bxi )2 , in this case. (Hint: there is only one variable here, b, so
this is a straightforward Mathematics 161 max-min problem.)
PS20—Final version
MATH 232
Problem Set 20
5
• Problem ?41: In R, if we wish to fit a line y = bx without the constant term, we
use lm(y∼x-1). (The (-1) in the formula notation in this context tells R to omit the
constant term.) Using data on the 2003 American League Baseball season, found
in the file http://www.calvin.edu/˜stob/data/al2003.csv, define new variables
for W − L and R − OR. (For example, define wl=s$W-s$L where s is the data frame
containing your data.)
(a) Write W − L as a linear function of R − OR without a constant term.
(b) Why do you think it makes sense (given the nature of the variables) to omit a
constant term in this model?
• Problem ?42: The R dataset women (load it with the command data(women)) gives
the average weight of American women by height. Find (using R) the regression
line that predicts average weight from height. Use the xyplot command from the
lattice package, employing the switch type=c(’p’,’r’), to view the data along
with the regression line. Look also at a plot of the residuals vs. heights. (Include
this plot in your write-up, along with the commands you use to produce it). Finally,
calculate the correlation between these two variables. Based upon what you see,
do you think that a linear relationship is the best way to describe the relationship
between average weight and height? Explain why or why not.
PS20—Final version