Take_HomeQuiz3_Solutionv3.pdf

NCSU ST512
TAKE HOME QUIZ
SUM 2 2011
1. A friend has asked you for help in fitting a line for a class project. He has collected data for
runners world record times on ten distances for men and women. He needs to fit a single linear
trend over all data points to show the relationship between time and distance, ignoring gender
of the runner. The world records on ten distances (outdoor running) are listed for men and
women in Table below. The records were taken from the website of the International
Association of Athletics Federation (IAAF), http://www.iaaf.org on July 23, 2011.
The marathon is a long-distance running event with an official distance of 42.195 kilometers (26
miles and 385 yards)
Male
Distance (m)
time(sec)
Female
date
time(sec)
date
100
9.77 14/06/2005
10.49 16/07/1988
200
19.32 01/08/1996
21.34 29/09/1988
400
43.18 26/08/1999
47.60 06/10/1985
800
101.11 24/08/1997
113.28 26/07/1983
1500
206.00 14/07/1998
230.46 11/09/1993
3000
440.67 01/09/1996
486.11 13/09/1993
5000
757.35 31/05/2004
864.53 03/06/2006
10000
1577.53 26/08/2005
1771.78 08/09/1993
21097.5
3535.00 15/01/2006
4004.00 15/01/1999
42195
7495.00 28/09/2003
8125.00 13/04/2003
As requested, your run a simple linear regression on this dataset and present your friend with
the results.
a)
Write the estimated Simple linear regression equation and test whether the linear
regression coefficient is significantly different from 0.
(Page 1)
timei  67.83242  0.18424  distancei
H o : 1  0
H1 :  i  0
,
  0.05
tcalc =68.41, p <0.0001 , 68.41 > 2.01 (t18, 0.025) , Reject Ho, There is a significant linear relationship
between Time and distance for world running record at a 0.05 significance level
b)
As learned in class, you run a lack of fit test on this data to ensure that the linear
fitting is adequate.
Lack of fit test:
July 23, 2011
Test the hypothesis that a higher degree polynomial may be needed.
(Page 6)
1
NCSU ST512
TAKE HOME QUIZ
H o : Higher deg ree polynomial is needed
SUM 2 2011
a=0.05
H1 : Linear regresion is adequate
MSLOF = 8943.0, MSE = 35988.8, FLOF = 8943.0/35988.8
Calculated F for lack of fit test = FLOF = 0.25 p-val = 0.9699 , Do not Reject Ho , there is not enough
statistical evidence at a 0.05 significance level to reject Ho. Linear fitting seems adequate.
Based on the lack of fit test, you decided that linear trend is fine, and prepared a plot of
the observed records and linear trend.
c)
Include the plot of predicted vs distance.
Still, a look at the residual should not be bad idea,
d)
present a plot of residual against predicted and discuss the linear regression fitting.
Does the plot of predicted against distance adjust well to data? Does the residual plot
show whether the linear fitting was adequate?
July 23, 2011
2
NCSU ST512
e)
TAKE HOME QUIZ
SUM 2 2011
After looking at the residuals, you decided to try a power function for this
relationship. This power function is expressed as a linear function after a log
transformation for both x- and y-variables, as shown next,
time  M  distanceb1
log10  time   log10  M   b1  log10  distance 
y  bo  b1 x
where
bo  log10  M 
y  log10  time 
x  log10  distance 
f)
Write the power function equation estimated for this data.
values for M and b1.
What are the estimated
(page 12)
time  M  distanceb1
bo  log10  M   1.21326
M  101.21326  0.061198
b1  1.10905
yi  1.21326  1.10905 xi
time  0.061198   distance 
1.10905
g)
Compute the predicted mean for distance 100 meters,
distance of 1000 meters
time1000 .
time100 .
Repeat computation for a
Note that
time x new   10 y 
where y  bo  b1  xnew
For distance = 100 meters, x100 = log10(100) = 2
ydis tan ce 100  1.21326  1.10905  2  1.00484
time dis tan ce 100   101.00484   10.11207 sec.
For distance = 1000 meters, x100 = log10(1000) = 3
ydis tan ce 1000  1.21326  1.10905  3  2.11389
time dis tan ce 1000   102.11389   129.984 sec.
July 23, 2011
3
NCSU ST512
h)
TAKE HOME QUIZ
Find the
time 1000
SUM 2 2011
. Interpret b1.
time 100
log10
time1000
time100
time1000
time100

 log10 time1000  log10 time100  1.10905
129.984
 12.8543  101.10905
10.11207
For a 10-fold increase on distance, new time is
12.8543   base time  ,
101.10905  12.8543
2.
Estimate the mean record time for a distance of 25 km. Calculate the 95%confidence
interval for this predicted mean.
x25000  log10  25000   4.39794
ydis tan ce 25000  1.21326  1.10905  4.39794  3.664275
time dis tan ce 25000  103.664275   4616.102 sec.
(SAS output pred time is 4615.99)
Confidence Interval for
the estimated mean time for a distance of 25000 meters
The MEANS Procedure
Variable
Corrected SS
Mean
Variance
Std Dev
---------------------------------------------------------------------------distance
3949839136
9191.04
171732136
13104.66
time
112592378
1485.19
5925914.65
2434.32
LOG_TIME
16.8274621
2.4585597
0.8856559
0.9410929
LOG_distance
16.1219348
3.3754829
0.7009537
0.8372298
----------------------------------------------------------------------------
Confidence interval for E Y | X  xo  , the conditional mean of Y when X=25000
 Y / X  x  1.21326  1.10905  4.39794  3.664275 ,
o

Var  Y | X  xo


Var  Y | X  xo
2

xo  x  

1




 n   xi  x 2 


2

 1  3.664275  3.3754832 
 MSE  
  0.00005903525  0.00768344 ^ 2
 20

16.121935


t18,0.05 2  2.100922
July 23, 2011
4
NCSU ST512
TAKE HOME QUIZ
SUM 2 2011
100 1    % Confidence interval for Y / X  xo is given by
 Y | X  x  t n 2, 2
o
2

xo  x  

1

MS  E   
 n   xi  x 2 


3.664275  2.100922 0.00768344
Conf  3.648133   y| x  4.39794  3.680417   95%
Lowerdis tan ce 25000  103.648133  4447.671
Upperdis tan ce 25000  103.680417  4790.902
Conf  4447.671  Time|dis tan ce 25000  4790.902   95%
In SAS output: log10-scale : lower limit =3.63890, upper limit =3.68963
Original-scale: lower limit = 4354.14, upper limit = 4893.59
i)
Plot of residuals against the predicted values is presented below. Discuss whether a
separate fitting is needed for male and females.
Residual plot
Residual
0.06
0.05
0.04
0.03
0.02
0.01
0.00
-0.01
-0.02
-0.03
-0.04
-0.05
-0.06
1
2
3
4
Predicted Value of LOG_TIME
gender
F
M
Note that women tend to have higher residuals than men in a clear pattern which may be evidence that a common linear equation
for both men and women may be underestimating the record time for a given distance when the runner is a women and over
estimating the record time for a given distance when the runner is a man. Also is important to note that these records provide with
just an observation for each distance in the considered range for each group. When fitting a separate curve for each group, any
statistical test is done against a MSE that measures basically the lack of fit from the linear trend. No measure of “pure error” is
available.
July 23, 2011
5
NCSU ST512
TAKE HOME QUIZ
SUM 2 2011
2. The following data presents the results of a study of the effect of ambient temperature and liquid
viscosity on the amount of energy (joules/sec) honeybees spend while drinking. Temperature levels
were 20 and 30C. Levels of liquid viscosity refer to the percent of Sucrose in total solids dissolved in
liquid. There were two levels for Sucrose, 20% and 40%. Each of the 4 combinations of temperature
and viscosity were repeated three times in controlled conditions, randomly assigning the bees to
each of the four experimental groups.
The following variables were used in the analysis to simplify calculations:
Temperature  25
5
Sucrose  30
X2
10
X 3  X 1 X 2
X1 
Temperature Sucrose X 3
20
20
1
20
40
1
Sucrose X 2
20
1
40
1
Temperature X 1
Note 20
1
30
1
30
30
20
40
1
1
Data.
July 23, 2011
Obs
i
temperature
sucrose
rep
energy
x1
x2
x3
1
20
20
1
3.1
-1
-1
1
2
20
20
2
3.7
-1
-1
1
3
20
20
3
4.7
-1
-1
1
4
20
40
1
5.5
-1
1
-1
5
20
40
2
6.7
-1
1
-1
6
20
40
3
7.3
-1
1
-1
7
30
20
1
6.0
1
-1
-1
8
30
20
2
6.9
1
-1
-1
9
30
20
3
7.5
1
-1
-1
10
30
40
1
11.5
1
1
1
11
30
40
2
12.9
1
1
1
12
30
40
3
13.4
1
1
1
6
NCSU ST512
TAKE HOME QUIZ
SUM 2 2011
a) The following regression model was fit to study the effect of temperature, sucrose and their
interaction on the amount of energy spent.
y j  o  1 X 1  2 X 2  3 X 3  e j
j  1,...,12
b) Test the following hypothesis
H o : 1   2  3  0
H1 : not all i  0
,
i  1, 2,3
(page 9)
SS(Model) =122.78, dfmodel = 3 , MS(Model) = 53.97, p-val <0.0001 . Reject Ho , at least one of the regression
coefficients i  0
c) Write the estimated regression equation (need to replace each regression coefficient by is
estimated value).
y j  7.4333  2.2667 X 1  2.1167 X 2  0.7833 X 3
j  1,...,12
d) Write the test hypothesis for each parameter, and conclusion.
H o : 1  0
H1 : 1  0
 =0.05
p-va l< 0.0001, Reject Ho regression coefficient for X1 is significantly
different from zero
 =0.05
p-va l< 0.0001, Reject Ho regression coefficient for X2 is significantly
different from zero
 =0.05
p-va l< 0.0001, Reject Ho regression coefficient for X3 is significantly
different from zero
H o : 2  0
H1 :  2  0
H o : 3  0
H1 :  3  0
e) How much is the change in energy when the temperature increases from 20 to 30 and the
viscosity is 20%?
20  25
 1
5
30  25
Temperature  40 : X 1 
1
5
20  30
Sucrose  20 : X 2 
 1
10
X 3  X 1 X 2   1 1  1 for X 1  1 and X 2  1
Temperature  40 : X 1 
X 3  X 1 X 2  1 1  1 for X 1  1 and X 2  1
y  X 11, X 21, X 31  7.4333  2.2667  1  2.1167  1  0.7833 1 =3.8332
y  X 11, X 21, X 31  7.4333  2.2667 1  2.1167  1  0.7833 1 =6.8
July 23, 2011
7
NCSU ST512
TAKE HOME QUIZ
SUM 2 2011
y j  7.4333  2.2667 X 1  2.1167  1  0.7833 X 3
Note that for X2=-1, y j   7.4333  2.1167    2.2667  0.7833 X 1
y j  5.3163  1.4834 X 1
6.8-3.8832 = 2.9668 = 2.2667*2-2*0.7833 = 2*1.4834
Twice the 1-unit change in X1 when X2 = -1.
Table of means for each experimental group is presented next.
Group mean
Temperature
20
30
Mean
f)
20
3.8
6.8
5.3
Sucrose
40
6.5
12.6
9.6
Mean
5.2
9.7
7.4
Show that the predicted mean for Temperature 20 when sucrose is at its average value is equal
to the observed mean for this temperature over both sucrose levels.
Temperature  20 : X 1  1
Sucrose  30 : X 2  0
X 3   1 0   0
y j  7.4333  2.2667  1  2.1167  0   0.7833  0 
y  x11, x 20, x 30  7.4333  2.2667  5.1663
y
 5.2
which is the same as Temperature20,.
, Mean response for Temperature=20 on average over
sucrose.
g) Show that the predicted mean for Sucrose = 40 when temperature is at its average value is
equal to the observed mean for that sucrose over both temperature levels.
Temperature  25 : X 1  0
Sucrose  40 : X 2  1
X 3   0 1  0
y j  7.4333  2.2667  0   2.1167 1  0.7833  0 
y  x10, x 21, x 30  7.4333  2.1167  9.5497
which is the same as
temperature.
July 23, 2011
y ., Sucrose20  9.6
,Mean response for Sucrose=40 when averaged over
8
NCSU ST512
TAKE HOME QUIZ
SUM 2 2011
h) Show that predicted value for Temperature =20 and Sucrose = 40 is equal to the observed mean
for the corresponding experimental group.
Temperature  20 : X 1  1
Sucrose  40 : X 2  1
X 3   11  1
y j  7.4333  2.2667  1  2.1167 1  0.7833  1
y  x11, x 21, x 31  6.5
which is the same as Mean response for Temperature=20 and Sucrose=40 ,
yTemperature20,Sucrose40  6.5
i)
.
Use the following graph to explain the significance of X3.
Energy against temperature
by sucrose level
energy
mean
14
14
12
12
10
10
8
8
6
6
4
4
2
2
0
0
20
21
22
23
24
25
26
27
28
29
30
temperature
sucrose
20
40
sucrose
20
40
When Sucrose = 20 the slope of the regression line for energy consumed and temperature is 1.4874
While if Sucrose = 40 (at its high value) this slope is 3.05. Bees energy required to drink liquid per
unit of Temperature increase depend on the viscosity of the liquid, requiring more energy when
viscosity is higher, for the observed range of 20-40% sucrose in liquid and ambient temperature
20C-40C .
July 23, 2011
9