Midterm Exam Key

STATISTICS 110, FALL 2015, MIDTERM EXAM
NAME:_____KEY__________________________
Your assigned homework number :___________
First 6 digits of Student ID: __________________
Open notes, calculator required. You should have 5 pages plus a page of R output, handed out separately. Make
sure you have them all. Each part of each problem is worth 4 points unless specified otherwise. Use the back of
the pages if you need more space, but tell us to turn the page over and look.
1. The Kids198 dataset accompanying the book provides data on a random sample of 198 children between the
ages of 8 and 18. Three of the variables measured and the way we will use them are for this question are:
Y = Weight (in pounds), X1 = Age (in months), X2 = 0 if Male and 1 if Female.
For each of the parts of this question, use the notation with Y and Xs rather than the names of the variables.
a. Write the population model that specifies a linear relationship between Y = Weight and X1 = Age and
for which that relationship has the same slope and the same intercept for males and females. Include
information about the normality assumption. (The left hand side of the model is provided, to get you
started.)
Y = β0 + β1X1 + ε where the condition is that the ε are independent and each from a N(0, σε)
distribution.
For parts (b) and (c) you don’t need to include information about the normality assumption. Continue using the
Y and X notation rather than names of variables.
b. Write the population model for the linear relationship between Weight and Age that includes the same
intercept but different slopes for males and females.
Y = β0 + β1X1 + β2X1 X2 + ε
(It’s okay if you use β3 for the coefficient for X1 X2 in anticipation of adding separate intercepts.)
c. Write the population model for the linear relationship between Weight and Age that includes different
intercepts and different slopes for males and females.
Y = β0 + β1X1 + β2X2 + β3X1 X2 + ε
d. Using the notation from your model in part (c), write the null and alternative hypotheses that would be
used to test whether the population regression lines are the same for males and females, versus that they
are not the same in some way.
H0: β2 = β3 = 0
Ha: At least one of β2 and β3 is not 0
2. [2 pts each] For each of the following situations, specify whether the statement provided is always true,
could be true for some populations and/or samples, or is never true. (Circle your answer.)
a. When X and Y have a deterministic linear relationship, the slope of the line is 1.
Always true
Could be true
Never true
b. If 100 independent 95% confidence intervals are created for a mean, each based on a different sample,
exactly 95 of them will cover the true population mean.
Always true
Could be true
Never true
c. Consider a regression situation with Y as the response, and 2 possible predictors X1 and X2. SSTotal will
be the same for the model with X1 and X2 as predictors as it is for the model with only X1 as a predictor.
Always true
Could be true
Never true
d. When the slope of a regression line is negative, r2 will also be negative.
Always true
Could be true
Never true
e. When the correlation between X and Y is positive (and not 0) the slope of the least square regression
line for simple linear regression is also positive.
Always true
Could be true
Never true
f. In a simple linear regression setting the numerical values of  1 and ˆ1 are equal.
Always true
Could be true
Never true
g. The sum of the residuals from fitting a least squares regression line will be 0.
Always true
Could be true
Never true
For Questions 3 to 9: The R output distributed with the exam includes an analysis of the Kids198 data
described in question 1, but with Height added as a predictor. It includes the response variable Y = Weight (in
pounds), and predictor variables X1 = Height (in inches), X2 = Age (in months) and X3 = Sex with 0 for males
and 1 for females. Use the R output to answer all of the rest of the questions except the multiple choice.
3. As shown on the output, the correlation between Weight and Sex is −0.245. Explain why the correlation is
negative.
A negative correlation indicates a general pattern that as one variable increases the other decreases. In this
case, Sex is 0 for males and 1 for females. So as Sex increases, Weight decreases. This makes sense because
women in general weight less than men do.
4. Interpret the value of the coefficient for Sex, which is −2.28.
For a male and female of the same age and height, the female is predicted to weigh 2.28 pounds less than
the male.
OR
For the population of males and females of any fixed age and height, the average weight for the females is
estimated to be 2.28 pounds less than the average weight of the males.
5. [2 pts each blank] Fill in numerical values in each of the blanks where possible. If a numerical value cannot
be determined from the computer output, write NA (not available). No extensive computations are required,
but if you need to compute something you can show your work on the side. (If you make a mistake in
computation, including your work might get you partial credit because we can see where you went wrong.)
a. ̂ 0 = __−174.33329___________
b.
MSE = __14.48____________
c. β1 = ___NA (It’s a population value, so it’s unknown.)____
d. The value of the test statistic for testing H0: β1 = β2 = β3 = 0 is __277.7___________
e. The standard error of ˆ1 = ___0.33094________________
f. SSModel = __174,622 (It’s 173615 + 785 + 222)_______
6. Write out the sample regression equation using the numbers from the output. Use the variable names instead
of the X notation for the predictor variables. Round the coefficients to 3 digits after the decimal place.
Yˆ = −174.333 + 4.284(Height) + 0.123(Age) – 2.281(Sex)
7. Row 3 after the “Head” command near the top of the output give values of all of the variables for the 3rd
child in the data set. Find the predicted weight and then the residual for this child. Show your work.
a. Predicted weight = −174.333 + 4.284(Height) + 0.123(Age) – 2.281(Sex)
= −174.333 + 4.284(50.1) + 0.123(119) – 2.281(0) = 54.93 pounds
b. Residual = Actual – predicted = 54 – 54.93 = −0.93 pounds
c. Interpret the value of the residual you found in part (b).
This child’s actual weight was 0.93 pounds lower than it was predicted to be based on his height, age
and sex.
8. Using the R output, give the value of the test statistic and the p-value for testing the following hypotheses.
a. In the simple linear regression model with Y = Weight and X = Age, the null hypothesis is that the
population slope is 0 and the alternative hypothesis is that the slope is not 0.
Test statistic = __17.5313_______
p-value = __2.2 × 10−16_____
b. In the multiple linear regression model with Y = Weight and the three predictors Height, Age and Sex,
the null hypothesis is that Age is not needed given that Height and Sex are included in the model, and
the alternative hypothesis is that Age is needed. In other words, the null hypothesis is that the population
coefficient for Age is 0 and the alternative hypothesis is that it is not 0.
Test statistic = __2.144_________
p-value = __0.0333_________
c. In the multiple linear regression model with Y = Weight and the three predictors Height, Age and Sex,
the null hypothesis is that Age is not needed given that Height and Sex are included in the model, and
the alternative hypothesis is that Age is needed and has a positive coefficient. In other words, the null
hypothesis is that the population coefficient for Age is 0 and the alternative hypothesis is that it is
greater than 0.
Test statistic = __2.144_________
p-value = __0.0333/2 = 0.0167_
9. (1 pt each) Fill in the blanks in the ANOVA table below, where F is the test statistic for H0: β1 = β2 = β3 = 0.
Hint: All of the information you need is in the output, but you will need to do some arithmetic to get some
of the values.
Source
Df
SumSq
MeanSq
Model
Error
__3____
_194___
_174622________
__40656________
_58207.33____
_210_________
Total
_197___
_215278________
F
p-value
_ 277.2____ _ 2.2 × 10−16__
MULTIPLE CHOICE (3 pts each) Circle the best choice
1. In simple linear regression, a plot of residuals versus fitted values is useful for checking some of the
necessary conditions for inference in regression. Which one of the following is it not useful for checking?
A. The relationship between Y and X is approximately linear.
B. The standard deviation of the errors remains constant across the x values.
C. The n pairs of observations are all independent.
D. It is useful for checking all of the above conditions.
2. A researcher is interested in predicting the relationship between Y = percent body fat and X = average
number of calories consumed per day for college freshmen who eat in the dining hall at a large university.
She employs two research assistants, and they each are going to independently take a random sample of 100
students who eat in the dining hall and ask them to provide this information. Which of the following
definitely will be the same for the two research assistants?
A. The population regression line.
B. The SSTotal for their samples.
C. The intercepts for the regression lines computed from their samples.
D. None of the above; all of those would differ.
3. In simple linear regression, MSE is used as an estimate of σ. In this context, what is σ?
A. The standard deviation of the population of all Y values, ignoring the values of X.
B. The standard deviation of the residuals from the sample.
C. The standard deviation of the population of X values at each value of Y.
D. The standard deviation of the population of Y values at each value of X.
4. Which of the following is a correct way to write the sample regression equation for simple linear
regression?
A. Y   0   1 X 1  e
B. Yˆ     X
0
1
1
C. Y  ˆ0  ˆ1 X 1  e
D. Yˆ  ˆ0  ˆ1 X 1  e
This one.
The output below from R uses the data set Kids198, accompanying the book. Some irrelevant
parts of the output have been removed for space reasons. Variable include:
Y = Weight in pounds
X1 = Height in inches
X2 = Age in months
X3 = Sex = 0 if Male, 1 if Female
The population model using variable names is µY = β0 + β1(Height) + β2(Age) + β3(Sex).
> head(Kids198)
Height Weight
1
67.8
166
2
63.0
93
3
50.1
54
#First 3 rows of data, to give you a feel for it
Age Sex
210
0
144
1
119
0
> cor(Kids198)
Height
Weight
Age
Sex
Height 1.0000000 0.8980355 0.83292636 -0.25811645
Weight 0.8980355 1.0000000 0.78141242 -0.24549691
Age
0.8329264 0.7814124 1.00000000 -0.06732361
Sex
-0.2581165 -0.2454969 -0.06732361 1.00000000
> cor.test(Kids198$Weight,Kids198$Age)
t = 17.5313, df = 196, p-value < 2.2e-16
sample estimates:
cor
0.7814124
> Kid.mod <- lm(Weight ~ Height + Age + Sex, data = Kids198)
> summary(Kid.mod)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -174.33329
13.80581 -12.628
<2e-16
Height
4.28446
0.33094 12.946
<2e-16
Age
0.12301
0.05737
2.144
0.0333
Sex
-2.28138
2.21699 -1.029
0.3047
--Residual standard error: 14.48 on 194 degrees of freedom
Multiple R-squared: 0.8111, Adjusted R-squared: 0.8082
F-statistic: 277.7 on 3 and 194 DF, p-value: < 2.2e-16
> anova(Kid.mod)
Analysis of Variance Table
Response: Weight
Df Sum Sq Mean Sq F value Pr(>F)
Height
1 173615 173615 828.4373 < 2e-16
Age
1
785
785
3.7454 0.05441
Sex
1
222
222
1.0589 0.30474
Residuals 194 40656
210