May - Statistics GIDP

Statistics GIDP Ph.D. Qualifying Exam Methodology May 28th, 2015, 9:00am-­β€1:00pm Instructions: Provide answers on the supplied pads of paper; write on only one side of each
sheet. Complete exactly 2 of the first 3 problems, and 2 of the last 3 problems. Turn in only
those sheets you wish to have graded. You may use the computer and/or a calculator; any
statistical tables that you may need are also provided. Stay calm and do your best; good
luck.
1. A dataset is observed based on an experiment to compare three different brands of pens and
three different wash treatments with respect to their ability to remove marks on a particular
type of fabric. There are four replications in each combination of brands and treatments. The
observation averages within each combination are given in the following table.
1) Consider the mean model 𝑦!"# = πœ‡!" + πœ€!"# , where πœ€!"# ~𝑁(0, 𝜎 ! ). MSE is 0.2438.
Compute the estimates of πœ‡!! along with their standard errors.
πœ‡!! = 5.11, πœ‡!" = 6.65,
πœ‡!" = 6.51 πœ‡!" = 6.44,
πœ‡!! = 7.92, πœ‡ !" = 7.24 πœ‡!" = 7.01, πœ‡!" = 8.15, πœ‡!! = 7.77 se=sqrt(0.2438/n)=sqrt(0.2438/4)=0.2469 2) Propose an effect model with the interaction effect included; compute the estimate of
parameters.
The effect model is: 𝑦!"# = πœ‡ + 𝛼! + 𝛽! + 𝛼𝛽
!"
with assumptions: ! (𝛼𝛽)!"
! 𝛼!
= 0,
! 𝛽!
= 0,
+ πœ€!"# , 𝑖 = 1,2,3, 𝑗 = 1,2,3, π‘˜ = 1, … , 4 =
! (𝛼𝛽)!"
𝜎 ! = 0.2483 = 0, π‘Žπ‘›π‘‘ πœ€!"# ~𝑁(0, 𝜎 ! ) πœ‡ = 6.98 𝛼! = 6.09 βˆ’ 6.98 = βˆ’0.89 𝛼! = 7.2 βˆ’ 6.98 = 0.22 𝛼! = 7.64 βˆ’ 6.98 = 0.66 𝛽! = 6.19 βˆ’ 6.98 = βˆ’0.79 𝛽! = 7.57 βˆ’ 6.98 = 0.59 𝛽! = 7.17 βˆ’ 6.98 = 0.19 𝛼𝛽!! = 5.11 βˆ’ 6.09 βˆ’ 6.19 + 6.98 = βˆ’0.19 𝛼𝛽!" = 6.65 βˆ’ 6.09 βˆ’ 7.57 + 6.98 = βˆ’0.03 𝛼𝛽!" = 6.51 βˆ’ 6.09 βˆ’ 7.17 + 6.98 = 0.23 𝛼𝛽!" = 6.44 βˆ’ 7.2 βˆ’ 6.19 + 6.98 = 0.03 𝛼𝛽!! = 7.92 βˆ’ 7.2 βˆ’ 7.57 + 6.98 = 0.13 𝛼𝛽!" = 7.24 βˆ’ 7.2 βˆ’ 7.17 + 6.98 = βˆ’0.15 𝛼𝛽!" = 7.01 βˆ’ 7.64 βˆ’ 6.19 + 6.98 = 0.16 𝛼𝛽!" = 8.15 βˆ’ 7.64 βˆ’ 7.57 + 6.98 = βˆ’0.08 𝛼𝛽!! = 7.77 βˆ’ 7.64 βˆ’ 7.17 + 6.98 = βˆ’0.06 3) Compute standard errors of the parameter estimates for the main effects in part 2).
𝛼! = 𝑦!.. βˆ’ 𝑦… = 𝑦!.. βˆ’ ( 𝑦!.. + 𝑦!.. + 𝑦!.. )/3 2
1
1
π‘£π‘Žπ‘Ÿ 𝛼! = βˆ— 𝑦!.. + βˆ— 𝑦!.. + βˆ— 𝑦!.. 3
3
3
4 1 1
=
+ +
βˆ— π‘£π‘Žπ‘Ÿ 𝑦!.. 9 9 9
2 𝜎!
= βˆ—
3 3βˆ—4
𝜎!
= 18
Since a=b=3, s.e of each parameter estimate=MSE/18=0.0135 4) Complete the following ANOVA table.
source Pen Treatment Interaction Error Total DF 2 2 4 27 35 SS MS 6.0677 3.03385 1.4054 0.7027 0.6595 0.164875 6.7041 0.2483 14.8367 𝑆𝑆!"# = 𝑏𝑛
𝑦!.. βˆ’ 𝑦…
!
𝑆𝑆!"! = π‘Žπ‘›
𝑦!.. βˆ’ 𝑦…
!
!
F-­β€values 12.21849 2.83 0.664 = 3 βˆ— 4 βˆ— 1.2761 = 6.0677 = 3 βˆ— 4 βˆ— 1.0083 = 1.4054 !
5) (3pts) Compute R-square and test significance of the interaction effect.
R-­β€squre=SSmodel/SS total = 8.1326/14.8367=0.5481408 P-­β€value for interaction=P{F4, 27>0.664} >0.05 2. An engineer suspects that the surface finish of a metal part is influenced by the feed rate and
the depth of cut. She selects three feed rates and three depths of cut. However, only 9 runs
can be made in one day. She runs a complete replicate of the design on each day - the
combination of the levels of feed rate and depth of cut is chosen randomly. The data are
shown in the following table (dataset β€œSurface.csv” is provided). Assume that the days are
blocks.
Feed rate
0.2
0.25
0.3
0.15
74
92
99
(a) (3 pts) What design is this?
Blocking factorial design
Day 1
Depth
0.18
79
98
104
0.20
82
99
108
0.15
64
86
98
Day 2
Depth
0.18
68
104
99
0.20
88
108
110
(b) (6 pts) State the statistical model and the corresponding assumptions.
𝑦!"# = πœ‡ + 𝜏! + 𝛽! + πœπ›½
!
𝜏! = 0,
!
𝛽! = 0,
!"
+ 𝛿! + πœ€!"# , 𝑖 = 1,2,3, 𝑗 = 1,2,3, π‘˜ = 1,2 !
πœπ›½
!"
=
!
!
πœπ›½
!"
= 0,
!
𝛿! = 0, π‘Žπ‘›π‘‘ πœ€!"# ~𝑁(0, 𝜎 ) day (Ξ΄k) is a block factor. (c)
Make conclusions at false positive rate =0.05 and check model adequacy.
The ANOVA table is shown below. Both Depth and Feed factors are significant at
alpha=0.05. There is no unusual pattern detected in the QQ normality plot or the
residual plot.
Source
(d)
DF
Type III SS Mean Square F Value Pr > F
Day
1
5.555556
5.555556
0.21 0.6610
Depth
2
560.777778
280.388889
10.46 0.0059
Feed
2 2497.444444
1248.722222
46.58 <.0001
Depth*Feed
4
17.222222
0.64 0.6473
68.888889
What is the difference in randomization for experimental runs between this design
and the one in Question 3?
In this design: within a day, the combinations of the levels of feed rate and depth of cut are chosen randomly. In the design of Q3: within a day, a particular mix is randomly selected and then it is applied to a panel by three application methods. (e) Attach SAS/R code
data surf;
input Day Depth Feed Surface@@;
datalines;
1
0.15 0.2
74
1
0.18 0.2
79
1
0.2
0.2
82
1
0.15 0.25 92
1
0.18 0.25 98
1
0.2
0.25 99
1
0.15 0.3
99
1
0.18 0.3
104
1
0.2
0.3
108
2
0.15 0.2
64
2
0.18 0.2
68
2
0.2
0.2
88
2
0.15 0.25 86
2
0.18 0.25 104
2
0.2
0.25 108
2
0.15 0.3
98
2
0.18 0.3
99
2
0.2
0.3
110
;
proc glm data=surf;
class Day Depth Feed;
model Surface=Day Depth Feed Depth*Feed;
output out=myresult r=res p=pred;
run;
PROC univariate data=myresult normal;
var res;
qqplot res/normal(MU=0 SIGMA=EST COLOR=RED L=1);
run;
proc sgplot;
scatter x=pred y=res;
refline 0;
run; 3. An experiment is designed to study pigment dispersion in paint. Four different mixes of a
particular pigment are studied. The procedure consists of preparing a particular mix and then
applying that mix to a panel by three application methods (brushing, spraying, and rolling).
The response measured is the percentage reflectance of the pigment. Three days are required
to run the experiment. The data follow (dataset β€œpigment.csv” is provided). Assume that
mixes and application methods are fixed.
(a)
(3 pts) What design is this?
Split-plot design
(b)
(6 pts) State the statistical model and the corresponding assumptions.
𝑦!"# = πœ‡ + 𝛾! + 𝛼! + (𝛾𝛼)!" + 𝛽! + 𝛼𝛽
!"
+ πœ€!"# , 𝑖 = 1, … ,3, 𝑗 = 1, … ,4,
π‘˜ = 1, … ,3 𝛾! ~𝑁 0, 𝜎!! ,
!
(c)
𝛽! = 0,
!
!
(𝛼𝛽)!" =
!
𝛼! = 0, (𝛾𝛼)! ~𝑁 0, 𝜎!"
!
(𝛼𝛽)!" = 0, π‘Žπ‘›π‘‘ πœ€!"# ~𝑁(0, 𝜎 ! ) Make conclusions and check model adequacy.
ANOVA table is shown below. Both Mix and Method factors are significant at
alpha=0.05, and their interaction is slightly significant (pv-value =0.06). There is no
unusual pattern noticed in QQ normality plot or residual plot.
Type 3 Tests of Fixed Effects
Effect
Num DF Den DF F Value
Pr > F
Mix
3
6
135.77
<.0001
Method
2
16
165.30
<.0001
Mix*Method
6
16
2.49
0.0678
(d)
Now assume the application methods are random while the other terms are kept
same as before. State the statistical model and the corresponding assumptions
using the unrestricted method; reanalyze the data.
𝑦!"# = πœ‡ + 𝛾! + 𝛼! + (𝛾𝛼)!" + 𝛽! + 𝛼𝛽
!"
+ πœ€!"! , 𝑖 = 1, … ,3, 𝑗 = 1, … ,4,
π‘˜ = 1, … ,3 𝛾! ~𝑁 0, 𝜎!! ,
𝛽! ~𝑁 0, 𝜎!! , 𝛼𝛽
!
!
𝛼! = 0, (𝛾𝛼)! ~𝑁 0, 𝜎!"
!
!
!" ~ 𝑁 0, 𝜎!" , π‘Žπ‘›π‘‘ πœ€!"# ~𝑁(0, 𝜎 ) The ANOVA analysis shows that only the fixed effect Mix is significant.
Type 3 Tests of Fixed Effects
Effect Num DF Den DF F Value Pr > F
Mix
3
6
58.37 <.0001
The covariance component estimates for all random effects are shown below, no effect
is significant at alpha=0.05. From the QQ plot and residual plot all assumptions are
met.
Covariance Parameter Estimates
Cov Parm
Estimate Standard Error Z Value Pr > Z Alpha
Lower
Upper
Day
0.02216
0.09250
0.24 0.4053
0.05 0.001987 1.767E25
Mix*Day
0.02770
0.1655
0.17 0.4335
0.05 0.002525 1.934E54
Method
9.1146
9.2543
0.98 0.1623
0.05
2.4387
396.94
Mix*Method
0.3336
0.3315
1.01 0.1571
0.05
0.09094
12.6598
Covariance Parameter Estimates
Cov Parm
Residual
Estimate Standard Error Z Value Pr > Z Alpha
0.6718
And the residual plots:
(e) Attach SAS/R code
data pigment;
input Mix Method Day Resp@@;
datalines;
1
1
1
64.5
1
2
1
68.3
1
3
1
70.3
1
1
2
65.2
1
2
2
69.2
1
3
2
71.2
1
1
3
66.2
1
2
3
69
1
3
3
70.8
2
1
1
66.3
2
2
1
69.5
2
3
1
73.1
2
1
2
65
2
2
2
70.3
2
3
2
72.8
2
1
3
66.5
2
2
3
69
2
3
3
74.2
3
1
1
74.1
3
2
1
73.8
3
3
1
78
3
1
2
73.8
3
2
2
74.5
0.2375
2.83 0.0023
0.05
Lower
Upper
0.3726
1.5561
3
3
3
3
4
4
4
4
4
4
4
4
4
;
3
1
2
3
1
2
3
1
2
3
1
2
3
2
3
3
3
1
1
1
2
2
2
3
3
3
79.1
72.3
75.4
80.1
66.5
70
72.3
64.8
68.3
71.5
67.7
68.6
72.4
/* proc mixed => Stat model 2: only 5 terms included (rest terms are pooled
as the random error term) */
proc mixed data=pigment method=type1;
class Mix Method Day;
model Resp=Mix Method Mix*Method/outp=predicted;
random Day Day*Mix;
run;
PROC univariate data=predicted normal;
var Resid;
qqplot Resid /normal(MU=0 SIGMA=EST COLOR=RED L=1);
run;
proc sgplot;
scatter x=Pred y=Resid;
refline 0;
run;
/* part d)*/
proc mixed data=pigment CL covtest;
class Mix Method Day;
model Resp=Mix /outp=predicted;
random Day Day*Mix Method Mix*Method;
run;
PROC univariate data=predicted normal;
var Resid;
qqplot Resid /normal(MU=0 SIGMA=EST COLOR=RED L=1);
run;
proc sgplot;
scatter x=Pred y=Resid;
refline 0;
4. An intriguing use of loess smoothing for enhancing residual diagnostics employs the method
to verify, or perhaps call into question, indications of variance heterogeneity in a residual
plot. From a regression fit (of any sort: SLR, MLR, loess, etc.) find the absolute residuals |ei|,
i = 1,...,n. To these, apply a loess smooth against the fitted values YΜ‚i. If the loess curve for
the |ei|s exhibits departure from a horizontal line, variance heterogeneity is
indicated/validated. If the smooth appears relatively flat, however, the loess diagnostic
suggests that variation is not necessarily heterogeneous.
Apply this strategy to the following data: Y ={career batting average} (a number between 0
and 1, reported to three digit-accuracy) recorded as a function of X = {number of years
played} for n = 322 professional baseball players. (The data are found in the file
baseball.csv.) Plot the absolute residuals from a regression fit and overlay the loess smooth
to determine whether or not the loess smooth suggests possible heterogeneous variation.
Use a second-order, robust smooth. Explore the loess fit by varying the smoothing
parameter over selected values in the range 0.25 ≀ q ≀ 0.75.
____________________________________________________________________________
A1. Always plot the data! Sample R code:
baseball.df = read.csv( file.choose() )
attach( baseball.df )
Y = batting.average
X = years
plot( Y ~ X, pch=19 )
The plot indicates an increase in Y = batting average as X = years increases, so consider a
simple linear regression (SLR) fit.
For the loess smooth, first find the absolute residuals and fitted values from the SLR fit
absresid = abs( resid(lm( Y~X )) )
Yhat = fitted( lm( Y~X ) )
then apply loess (use a second-order, robust smooth, to allow for full flexibility). Try the
default smoothing parameter of q = 0.75:
baseball75.lo = loess( absresid~Yhat, span = 0.75, degree = 2,
family='symmetric' )
Ysmooth75 = predict( baseball75.lo,
data.frame(Yhat = seq(min(Yhat),max(Yhat),.001)) )
Plot |ei| against YΜ‚i and overlay the smooth:
plot( absresid~Yhat, xlim=c(.25,.29), ylim=c(0,.11) )
par( new=TRUE )
plot( Ysmooth75~seq(min(Yhat),max(Yhat),.001), type='l', lwd=2,
xaxt='n', yaxt='n' , xlab='', ylab='', xlim=c(.25,.29), ylim=c(0,.11)
)
For comparison, the second-order, robust smooth at q = 0.33 gives:
baseball33.lo = loess( absresid~Yhat, span = 0.33, degree = 2,
family='symmetric' )
Ysmooth33 = predict( baseball33.lo,
data.frame(Yhat = seq(min(Yhat),max(Yhat),.001)) )
plot( absresid~Yhat, xlim=c(.25,.29), ylim=c(0,.11) )
par( new=TRUE )
plot( Ysmooth33~seq(min(Yhat),max(Yhat),.001), type='l', lwd=2,
xaxt='n', yaxt='n' , xlab='', ylab='', xlim=c(.25,.29), ylim=c(0,.11)
)
which gives a more jagged smoothed curve (as would be expected). Also, the second-order,
robust smooth at q = 0.50 yields:
baseball50.lo = loess( absresid~Yhat, span = 0.50, degree = 2,
family='symmetric' )
Ysmooth50 = predict( baseball50.lo,
data.frame(Yhat = seq(min(Yhat),max(Yhat),.001)) )
plot( absresid~Yhat, xlim=c(.25,.29), ylim=c(0,.11) )
par( new=TRUE )
plot( Ysmooth50~seq(min(Yhat),max(Yhat),.001), type='l', lwd=2,
xaxt='n', yaxt='n' , xlab='', ylab='', xlim=c(.25,.29), ylim=c(0,.11)
)
which appaers less jagged (again, as would be expected) and more similar to the loess curve
at q = 0.75. From a broader perspective, all the smoothed loess curves suggest a fairly flat
relationship, so the issue of variance heterogeneity may not be critical. (Further
investigation would be warranted.)
The use of loess in this fashion is from
Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots. Journal of the
American Statistical Association 74(368), 829-836.
The data are from Sec. 3.8 of
Friendly, M. (2000). Visualizing Categorical Data. Cary, NC: SAS Institute, Inc.
5. Suppose you fit a simple linear regression model to data Yi ~ indep. N(Ξ± + Ξ²xi , Οƒ2); i=1,...,n.
(a) Let Ξ²Μ‚ be the usual least squares (LS) estimator of Ξ². State the distribution of Ξ²Μ‚. n
(b) Let S2 be the MSE = βˆ‘ (Yi – YΜ‚i)2/(n–2), where YΜ‚i = Ξ±Μ‚ + Ξ²Μ‚xi and Ξ±Μ‚ is the LS estimator for Ξ±. Recall that S2 1
is known to be an unbiased estimator for Οƒ2. State a result involving the Ο‡2 distribution that involves S2 and Οƒ2. What one important statistical relation (in terms of probability features) exists between this and the result you state in part (a)? Ξ²
(c) Find an unbiased estimator for the ratio . Οƒ
____________________________________________________________________________
n
–
A2: For simplicity, let d = n–2 and v = 1/βˆ‘ (xi – x)2. 1
(a) Ξ²Μ‚ ~ N(Ξ², Οƒ2v). Notice that Z = (b) W = Ξ²Μ‚ – Ξ²
Οƒ v
~ N(0,1). S 2d
~ Ο‡2(d), where d = n–2. This is statistically independent of Ξ²Μ‚ (and Z) in part (a). Οƒ2
(c) Given Z = S 2d
1
∞
~ N(0,1), independent of W = 2 ~ Ο‡2(d), where d = n–2. Note that since 0 Ξ“(d/2)
2d/2
Οƒ
Οƒ v
Ξ²Μ‚ – Ξ²
∫
w(d/2)–1 e–w/2 dw = 1, we know (for a = d/2) E[W–b] = ∫∞0 Ξ“(d/2)1 2
= d/2 w
(d/2)–b–1
∫∞0 w
e–w/2 dw = Ξ“(d/2)
⎑ Οƒ ⎀ = Ξ“({d–1}/2). In particular, E[W–(1/2)] = E ⎒
βŽ₯
⎣S d⎦ Ξ“(d/2) 2
e–w/2 dw = 2a Ξ“(a). Thus, e.g., 1
Ξ“(d/2)2
∫∞w
d/2 0
d/2 –b
)
Ξ“({d/2}–b)
= 2–b for b > 0. d/2
Ξ“({d/2}–b)2(
Ξ“(d/2)2
a–1
(d/2)–b–1
e–w/2 dw Now, since Z and W are independent, T = Z
W/d
~ t(d) and so for d > 1, E[T] = 0. This can be written as βŽ›Ξ²Μ‚ – β⎞
⎜
⎟
βŽΟƒ v⎠
βŽ‘Ξ²Μ‚ – β⎀
⎑ Ξ²Μ‚ ⎀
Ξ²Μ‚ – Ξ² Οƒ Ξ²Μ‚ – Ξ²
⎑ β ⎀ = β E ⎑1⎀ = β d = = ~ t(d), so E ⎒
= 0. That is, E ⎒
= E ⎒
βŽ₯
⎒ βŽ₯
βŽ₯
βŽ₯
Sd
Οƒ v S
S v
v ⎣S⎦ Οƒ v
⎣S v⎦
⎣S v⎦
⎣S v⎦
Οƒ 2d
2
⎑ Οƒ ⎀ = Ξ² d Ξ“({d–1}/2). Now multiply both sides by v to find E ⎒
βŽ₯
⎣S d⎦ Οƒ v Ξ“(d/2) 2
βŽ‘Ξ²Μ‚βŽ€
Ξ² d Ξ“({d–1}/2) Ξ²
E ⎒ βŽ₯ = v = Οƒ
⎣S⎦
Οƒ v Ξ“(d/2) 2
d Ξ“({d–1}/2)
. 2 Ξ“(d/2)
Ξ²
Ξ² Ξ“(d/2)
Therefore, an unbiased estimator for is S Ξ“({d–1}/2)
Οƒ
2
, where d = n–2. d
6. In the study of weathering in mountainous ecosystems, it is expected that silicon weathers
away in soil as ambient temperature increases. In an experiment to study this, data were
recorded on loss of silicon in soil at four independent sites over differing temperature
conditions. These were:
Temp. (˚C+5) Silicon conc. (mg/kg) 3 132.1 136.7 146.8 126.1 5 73.2 63.1 52.8 51.7 10 3.5 17.7 9.2 1.9 Assume that the observations satisfy Yij ~ indep. N(µ{ti},Οƒ2), i=1,…,3, j=1,…,4, where ti
are the 3 temperatures under study and µ{ti} is some function of ti. Using linear regression
methods, find a model that fits these data both as reasonably and as parsimoniously as
possible. (This question is purposefully open-ended.) From your fit, perform a test to
assess the hypotheses of no effect, vs. some effect, due to temperature change in these sites.
Set your false positive error rate to 0.05.
____________________________________________________________________________
A3. Always plot the data! Sample R code:
silicon.df = read.csv( file.choose() )
attach( silicon.df )
Y = conc
t = temp
plot( Y ~ t, pch=19 )
The plot indicates a decrease in Y = silicon concentration as X = temperature increases, so
consider a simple linear regression (SLR) fit and (first) check the residual plot:
siliconSLR.lm = lm( Y ~ t )
plot( resid(siliconSLR.lm)~t, pch=19 ); abline( h=0 )
The resid. plot indicates a clear pattern (also evident from a close look at the scatterplot): so
a SLR model gives a poor fit. With the observed pattern in the residuals, the obvious thing
to try next is a quadratic model:
siliconQR.lm = lm( Y ~ t + I(t^2) )
plot( resid(siliconQR.lm)~t, pch=19 ); abline( h=0 )
The resid. plot here indicates a better fit, with possibly a slight decrease in variation at
higher temperatures (i.e., slightly heterogeneous variance). But first, overlay the fitted
model on the original data:
bQR = coef( siliconQR.lm )
plot( Y ~ t, pch=19, xlim=c(3,10), ylim=c(0,150) ); par( new=T )
curve( bQR[1] + x*(bQR[2] + x*bQR[3]), xlim=c(3,10), ylim=c(0,150),
ylab='', xlab='' )
The ever-present danger with a quadratic fit is evident here: the very good fit also comes
with the unlikely implication that the mean response turns back up before we reach the
highest observed temperature. (Quick reflection suggests that this is hard to explain: it is
reasonable for the soil to lose silicon as temperature rises, but then how could it regain the
silicon as the temperature starts to rise even higher?)
So, start again: since the simple linear model fails to account for curvilinearity in the data,
try a transformation. The logarithm is a natural choice:
U = log(Y); plot( U ~ t, pch=19 )
siliconLog.lm = lm( U~t )
plot( resid(siliconLog.lm)~t, pch=19 ); abline( h=0 )
Some improvement is shown in the residuals, but the curvilinearity may still be present and
there is now clear variance heterogeneity. So, try a quadratic linear predictor again, but now
against U = log(Y), and also apply weighted least squares (WLS) to account for the
heterogeneous variances. For the WLS fit, the per-temperature replication makes choice of
the weights easy: use reciprocals of the sample variances at each temperature.
s2 = by(data=silicon.df$conc, INDICES=factor(silicon.df$temp), FUN=var)
w = rep( 1/s2, each=4 )
siliconQLog.lm = lm( U~t+I(t^2), weight=w )
plot( resid(siliconQLog.lm)~t, pch=19 ); abline( h=0 )
We don’t see much change in the residual plot (of course, we don't expect to: the theory tells
us that the inverse-variance weighting will nonetheless adjust for any variance
heterogeneity). An overlay of the (back-transformed) fitted model to the data shows a much
more-sensible picture:
bQLog = coef( siliconQLog.lm )
plot( Y ~ t, pch=19, xlim=c(3,10), ylim=c(0,150) ); par( new=T )
curve( exp(bQLog[1] + x*(bQLog[2] + x*bQLog[3])), xlim=c(3,10),
ylim=c(0,150), ylab='', xlab='' )
So, proceed with this model, where E[log(Yi)] = Ξ²0 + Ξ²1ti + Ξ²2ti2. The hypotheses of no
effect due to temperature is Ho: Ξ²1 = Ξ²2 = 0. (The alternative in Ha is β€œany difference”.) Test
this via (output edited)
summary( siliconQLog.lm )
Call:
lm(formula = U ~ t + I(t^2), weights = w)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.010783
1.762351
3.411 0.00774
t
-0.342931
0.654799 -0.524 0.61312
I(t^2)
-0.008346
0.048662 -0.172 0.86761
Residual standard error: 0.08094 on 9 degrees of freedom
Multiple R-squared: 0.8554,
Adjusted R-squared: 0.8232
F-statistic: 26.61 on 2 and 9 DF, p-value: 0.0001664
The pertinent test statistic here is the β€œfull” F-statistic of Fcalc = 26.61 with (2,9) d.f. (given
at bottom of output). The corresp. P-value is P = 1.7×10–4, which is well below 0.05. Thus
we reject Ho and conclude that under this model, there is a significant effect due to
temperature on (log)silicon concentration.
Notice, by the way, that besides Ξ²0, neither individual regression parameter is significant
based on its 1 d.f. partial t-test. This is due (not surprisingly) to the heavy multicollinearity
underlying this quadratic regression; the VIFs are both far above 10.0:
require( car )
vif( siliconQLog.lm )
t I(t^2)
110.312 110.312
Indeed, we should formally center the temperature variable before conducting the quadratic
fit. Notice, however, that the full F-statistic (and hence its P-value) does not change (output
edited):
tminustbar = scale( t, scale=F )
summary( lm(U~tminustbar+I(tminustbar^2), weight=w) )
Residual standard error: 0.08094 on 9 degrees of freedom
Multiple R-squared: 0.8554,
Adjusted R-squared: 0.8232
F-statistic: 26.61 on 2 and 9 DF, p-value: 0.0001664