a) Read the data file bias_0_data.txt into R (it's on the course website), perform regression to predict y from x, make the
scatterplot of the predictions versus the observed y values, and overlay a diagonal line (y-intercept=0, slope=1) on it. BUT,
because we want to diagonal line to actually appear as diagonal, make sure the range of x and y values shown in the scatterplot is
the same; in fact, set that range to (-6,6) for both x and y values. If you don't know how, check the prelabs, looking for xlim and
ylim.
dat = read.table("bias_0_data.txt",header=T)
attach(dat)
lm.1 = lm(y~x)
plot(y, predict(lm.1), xlim=c(-6,6), ylim=c(-6,6))
abline(0,1)
b) Now, read in the data file bias_1_data.txt, perform regression, and *overlay* on the previous plot (in part a) the scatterplot of
predictions versus the observed y values. Make these points red. If done correctly, you will see that the predictions are now all
positively biased (i.e., consistently shifted up).
dat = read.table("bias_1_data.txt",header=T)
attach(dat)
lm.2= lm(y~x)
points(y, predict(lm.2), xlim=c(-6,6), ylim=c(-6,6), col=2)
c) The scatterplot in part a looks good in that it does not suggests any problems with the model. However, as discussed in class,
the scatterplot in part b suggests a positive bias (the predictions are consistently higher than the observed values). Why? Hint:
There is something about the data that is causing this bias. What is it?
When you see this type of behavior, return to the data and *look* at it! In this case we have only one x and one y, and so it's easy
to look at their scatterplot:
plot(x,y)
We can now see that the outlier is causing the fit to shift up. This is an artificial example, but the bias seen in the scatterplot in
part b does arise in practice, although for other reasons (for example the weird histograms for x and/or y can lead to the same
behavior).
Confession: This is how I made the above data:
set.seed(1)
n = 50
x = runif(n,-1,1)
y = 2 + 3*x + rnorm(n,0,1)
dat = cbind(x,y)
write(t(dat),file="bias_0_data.txt",ncol=2)
x[n+1] = 0
# Add an outlier in a special place.
y[n+1] = 100
# Think of the experiments we did when we were trying
dat = cbind(x,y)
# to see how outliers effect r.
write(t(dat),file="bias_1_data.txt",ncol=2)
a) Read the data file transform_data.txt from the course website into R, and make a scatterplot of y versus x. Clearly, the relationship
is nonlinear and monotonic. I can tell you that a good transformation that linearizes the relationship is to take the sqrt of both and x
and y. Make a scatterplot of the transformed data.
dat = read.table("transform_data.txt", header=T)
attach(dat)
plot(x,y)
# nonlinear, but monotonic.
plot(sqrt(x), sqrt(y)) # linearizes the relationship.
b) Perform regression on the transformed data, and overlay the regression line on the scatterplot of the transformed data in part a).
lm.1 = lm( sqrt(y) ~ sqrt(x))
abline(lm.1)
c) Fit a regression model of the form y = alpha + beta_1 sqrt(x) + beta_2 x to the original (untransformed data).
lm.2 = lm(y ~ sqrt(x) + x)
d) In a clicker question I claimed that these two models are essentially equivalent. To check that, let's see if they make similar
predictions. Make a scatterplot to compare their predictions. Just keep in mind that the second model predicts y, but the first
model predicts sqrt(y).
yhat1 = predict(lm.1)
yhat2 = predict(lm.2)
plot(yhat1^2, yhat2 )
abline(0,1)
# predictions of sqrt(y)
# predictions of y itself
# nearly the same.
# Confession: this is how I made the data:
set.seed(11)
n = 100
x = runif(n,0,0.1)
y = 4 + 12*sqrt(x) + 9*x + rnorm(n,0,.1) # i.e., sqrt(y) = 2 + 3 sqrt(x)
dat = cbind(x,y)
write(t(dat),file="transform_data.txt",ncol=2)
a) Read the data file sin_data.txt from the course website, and make a scatterplot of the y versus x.
dat = read.table("sin_data.txt", header=T)
attach(dat)
plot(x,y)
b) The y values could be hourly temperature data at 100 different hours. In periodic situations like this the source of the periodic
behavior is often known; for example, the 24-hour daily cycle. In fact, if you look carefuly, you will see a 24-hr period (i.e., the x
distance from one peak to a neighboring peak). To confirm this, superimpose on the scatterplot in part a) a sine function with
a period of 24, and an amplitude of 1, plotted at all integer x values from 1 to 100. Hint: the equation of the sine function is y =
sin( 2 pi / period). Don't worry if the sine function does not go "through" the data.
x = c(1:100)
y_of_sine = sin(2*pi*x/24)
lines(x,y_of_sine)
c) Take the difference between the y values of the data and the y values of the sine function; it doesn't matter which minus which.
Then, make a scatterplot of the difference vsersus x.
y_diff = y - y_of_sine
plot(x,y_diff)
d) Now you are ready to plot a line through the previous scatterplot, because if you've done things correctly, the periodic behavior
will have disappeared by now. Find the equation of the OLS line, and overlay it on the previous scatterplot in part c.
lm.1 = lm(y_diff ~ x)
abline(lm.1)
e) report the R^2 and the s_e, and interpret both
summary(lm.1)
# R^2 = 0.9766 , s_e = 0.4573
About 98% of the variability in the *difference between* the ys can be attributed to time. The typical error (i.e., difference) about
the line is about 0.46
Confession: This is how I made the data
set.seed(123)
n = 100
x = c(1:n)
y = -5 + 0.1*x + sin(2*pi*x/24) # linear + period = 24
y = y + rnorm(n,0, 0.5)
# plus noise.
dat = cbind(x,y)
write(t(dat),file="sin_data.txt",ncol=2)
An experiment carried out to study the effect of the mole contents of cobalt (x1) and the calcination temperature (x2) on the surface area
of an iron-cobalt hydroxide catalyst (y) resulted in the following data("Structural Changes and Surface Properties of CoxFe3ÂxO4
Spinels," J. of Chemical Tech. and Biotech., 1994: 161-170):
x1: .6 .6 .6 .6 .6 1.0 1.0 1.0 1.0 1.0 2.6 2.6 2.6 2.6
x2: 200 250 400 500 600 200 250 400 500 600 200 250 400 500
y: 90.6 82.7 58.7 43.2 25.0 127.1 112.3 19.6 17.8 9.1 53.1 52.0 43.4 42.4
x1: 2.6 2.8 2.8 2.8 2.8 2.8
x2: 600 200 250 400 500 600
y: 31.6 40.9 37.9 27.5 27.3 19.0
A request to the SAS package to fit a+b1*x1+b2*x2+b3*x3, where x3 = x1*x2 (an interaction predictor), yielded the following output:
Dependent Variable: SURFAREA
Analysis of Variance
Sum of
Mean
Source DF
Squares
Square F Value Prob>F
Model 3 15223.52829 5074.50943 18.924 0.0001
Error 16 4290.53971 268.15873
Total 19 19514.06800
Root MSE 16.37555 R-square 0.7801
Dep Mean 48.06000 Adj R-sq 0.7389
C.V.
34.07314
Parameter Estimates
Parameter Standard T for H0: Prob
Variable DF Estimate
Error Parameter=0 > |T|
INTERCEP 1 185.485740 21.19747682
8.750 0.0001
COBCON 1 -45.969466 10.61201173
-4.332 0.0005
TEMP
1 -0.301503 0.05074421
-5.942 0.0001
CONTEMP 1 0.088801 0.02540388
3.496 0.0030
a) Interpret the value of the coefficient of multiple determination (R^2).
R^2 = 0.78 I.e., 78% of the variability in y can be explained (or attributed to) the linear relation y = a + b1 x1 + b2 x2 + b3 x3.
b) Predict the value of surface area when cobalt content is 2.6 and temperature is 250, and calculate the value of the corresponding
residual.
y = 185.49 - 45.97*2.6 - 0.3*250 + .089*2.6*250 = 48.8
c) Since b1 is about -46.0, is it legitimate to conclude that if cobalt content increases by 1 unit while the values of the other predictors
remain fixed, surface area can be expected to decrease by roughly 46 units? Explain your reasoning. Hint: think about collinearity and
interaction.
Were it not for the interaction term, the answer would be "yes." But, because b3 is not zero, the interaction term exists, and so, the
coefficients canNOT be interpreted in this way.
d) What is the typical error about the regression surface? First, find this quantity in the printout, and then reproduce it using the value of
SSE given in the printout.
The question is asking for s_e. Most standard computer printouts would report s_e on the same line (or near) R^2. This printout does
NOT! You just have to get used to these things. Now s_e = sqrt( SSE/(n-(k+1)) ) = sqrt( 4290.5/(20 - (3+1) ) ) = 16.4
e) Assess collinearity! By computer. For this question, you will have to enter the data into R.
x1 = c(.6,.6,.6,.6,.6,1.0,1.0,1.0,1.0,1.0,2.6,2.6,2.6,2.6,2.6,2.8,2.8,2.8,2.8,2.8)
x2 = c(200,250,400,500,600,200,250,400,500,600,200,250,400,500,600,200,250,400,500,600)
y = c(90.6,82.7,58.7,43.2,25.0,127.1,112.3,19.6,17.8,9.1,53.1,52.0,43.4,42.4,31.6,40.9,37.9,27.5,27.3,19.0)
x3 = x1*x2
# Even though the data are provided through x1 and x2, the regression equation is y = a + b1 x1 + b2 x2 + b3 x3. So, one might treat x3 as
# another predictor. In that case, the scatterplots and the correlations that should be check are the following:
par(mfrow=c(2,2))
plot(x1,x2)
plot(x1,x3)
plot(x2,x3)
cor(x1,x2)
cor(x1,x3)
cor(x2,x3)
#0
# 0.78241
# 0.5456007
# x1 and x2 are not at all collinear. But x1 and x3, and x2 and x3, do show some collinearity. And their correlation coefficients suggest that too.
# The implication of this collinearity is that the regression coefficients are going to have large variability (uncertainty), and so their interpretation
# (as rate of ....) is going to be unreliable.
# Now, on a slightly technical side: We already know that the coefficients are not going to be interpretable *because* of the interaction term.
# But, if the model did not have an interaction term (i.e., x3), then you would look only at the collinearity between x1 and x2. For this data set,
# given that x1 and x2 are not correlated, we would be able to interpret the coefficients.
The article "The Undrained Strength of Some Thawed Permafrost Soils" (Canadian Geotech. J., 1979: 420-427) contained the accompanying
data on y shear strength of sandy soil (kPa), x1 depth (m), and x2 water content (%).
Obs Depth Content Strength
1 8.9 31.5 14.7
2 36.6 27.0 48.0
3 36.8 25.9 25.6
4 6.1 39.1 10.0
5 6.9 39.2 16.0
6 6.9 38.3 16.8
7 7.3 33.9 20.7
8 8.4 33.8 38.8
9 6.5 27.9 16.9
10 8.0 33.1 27.0
11 4.5 26.3 16.0
12 9.9 37.0 24.9
13 2.9 34.6 7.3
14 2.0 36.4 12.8
a) Perform regression to predict y from x1, x2, x3 = x1^2, x4 = x2^2, and x5 = x1*x2; and write down the coefficients of the various terms.
# From the statement of the problem, I copied/pasted the data into an ascii text file called hw_3_37_dat.txt. I did include a line of header.
Then,
dat = read.table("hw_3_37_dat.txt", header=T)
x1 = dat[,2]
x2 = dat[,3]
y = dat[,4]
lm.1 = lm(y ~ x1 + x2 + I(x1^2) + I(x2^2) + I(x1*x2))
lm.1
# (Intercept)
x1
x2
I(x1^2)
I(x2^2) I(x1 * x2)
# -140.22976 -16.47521 12.82710
0.09555 -0.24339
0.49864
b) Can you interpret the regression coeficients? Explain.
# The regression coefficients cannot be interpreted (as the rate of ...) because of the interaction term. If the interaction term were not present,
we would still have to check for collinearity before we could attempt to interpret the coefficients.
c) Compute R^2 and explain what it says about goodness-of-fit ("in English").
summary(lm.1)
# Multiple R-squared: 0.7561 This means that about 76% of the variability in Strength can be attributed to (or explained by) Depth and
Water Content, through the expression given above in lm().
d) Compute s_e, and interpret ("in English").
# summary(lm.1) also contains se: 7.023. This means that the typical error (or deviation) of the data about the fit is about 7 kPa
e) Produce the residual plot (residuals vs. *predicted* y), and explain what it suggests, if any.
# Two ways of getting residual plots, that you learned in lab.
# first way:
plot(lm.1)
# hit return after this. The first fig is the residual plot.
# second way:
png("3_37_residual.png")
y_hat = predict(lm.1)
plot(y_hat, lm.1$residuals)
dev.off()
# The residualplot is a random scatter of dots about the y=0 line, and so, suggets that the fit is good.
f) Now perform regression to predict y from x1 and x2 only.
lm.2 = lm(y ~ x1 + x2 )
lm.2
# (Intercept)
x1
# 14.8893
0.6607
x2
-0.0284
g) Compute R^2 and explain what it says about goodness-of-fit.
summary(lm.2)
# Multiple R-squared: 0.447; about 45% of the variance in Strength can be attributed to (or explained by) Depth and and Water Content
through the linear expression in lm() .
h) Compare the above two R^2 values. Does the comparison suggest that at least one of the higher-order terms in the regression eqn
provides useful information about strength?
# Given that the more complex model has a much larger R2 (76% compared to 45%) it's reasonable to conclude that at least one of the
higher-order terms provides useful information about Strength.
i) Compute s_e for the model in part f, and compare it to that in part d. What do you conclude?
# i) According to summary(lm.2), the value of se is 9.019. In other words, the more complex model has a higher R2 and smaller typical
error (7.02 compared to 9.02).
Obs Depth Content Strength
1
8.9
31.5
14.7
2
36.6
27.0
48.0
3
36.8
25.9
25.6
4
6.1
39.1
10.0
5
6.9
39.2
16.0
6
6.9
38.3
16.8
7
7.3
33.9
20.7
8
8.4
33.8
38.8
9
6.5
27.9
16.9
10
8.0
33.1
27.0
11
4.5
26.3
16.0
12
9.9
37.0
24.9
13
2.9
34.6
7.3
14
2.0
36.4
12.8
For each of the data sets a) hw_3_dat1.txt and b) hw_3_dat2.txt, find the "best" fit, and report
R-squared and the standard deviation of the errors. Do not use some ad hoc criterion to determine
what is the "best" fit. Instead, use your knowledge of regression to find the best fit, and
explain in words why you think you have the best fit.
# a)
dat = read.table("hw_3_dat1.txt",header=T)
x1 = dat[,1] ; x2 = dat[,2] ; y = dat[,3]
plot(dat)
# There is no collinearity, but there is clearly a
# signature of interaction (see lecture 13).
lm.1 = lm(y~x1+x2 + I(x1*x2))
summary(lm.1) # R2 = 0.94, s_e = 2.0
# Note that without the interaction term,
# the fit is much worse:
lm.1 = lm(y~x1+x2)
summary(lm.1) # R2 = 0.23, s_e = 6.9
# b)
dat = read.table("hw_3_dat2.txt",header=T)
plot(dat)
# There is significant collinearity, and so one
# of the two predictors should be ignored.
plot(x1,y)
# There is still a quadratic behavior, though:
x1 = dat[,1] ; x2 = dat[,2] ; y = dat[,3]
lm.2 = lm(y~x1 + I(x1^2))
summary(lm.2) # R2 = 0.87, s_e = 3.8
# Note that without the quadratic term, the fit is much worse:
lm.0 = lm(y~x1)
summary(lm.0) # R2 = 0.25, s_e = 9.0
lm.00 = lm(y~ x1 + x2)
summary(lm.00) # R2 = 0.27, s_e = 8.9
Generate data on x1, x2, and y, such that
1) n (= sample size) = 100,
2) x1 and x2 are uncorrelated, and from a uniform distribution between 0 and 1,
a) Let y be given by y = 2 + 3 x1 + 4 x2 + error, where error is from a
normal distribution with mean = 0 and sigma = 0.5. Fit the model
y = alpha + beta1 x1 + beta2 x2 to the above data, and report R^2 and s_e.
b) Let y be given by y = 2 + 3 x1 + 4 x2 + 50 (x1 x2) + error, where error is
from a normal distribution with mean = 0 and sigma = 0.5. Fit the model
y = alpha + beta1 x1 + beta2 x2 to the above data, and report R^2 and s_e.
c) Fit the model y = alpha + beta1 x1 + beta2 x2 + beta3 (x1 x2) to the data
from part b, and report R^2 and s_e.
d) Install the R package called "rgl" on your computer, by typing
install.packages("rgl",dep=T), and following the instructions. If you
have trouble with this, ask the TAs or I during office hours.
Then, at the R prompt, type
library(rgl)
followed by
plot3d(x1,x2,y)
The panel you will see is interactive. By holding down the left-button,
mouse around, you will be able to "turn" the figure around in different
fun with it, THEN based on what you see, provide an explanation for why
terms of R2 and/or se) of the fit in part c is better than that in part
# Soln
set.seed(123)
n = 100
x1 = runif(n,0,1)
x2 = runif(n,0,1)
error = rnorm(n,0,0.5)
# a)
y = 2 + 3*x1 + 4*x2 + error
summary(lm(y ~ x1 + x2))
# R2 = 0.8792,
se = 0.4882
# b)
y = 2 + 3*x1 + 4*x2 + 50*x1*x2 + error
summary(lm(y ~ x1 + x2))
# R2 = 0.9118, se = 3.549
# c)
summary(lm(y ~ x1 + x2 + I(x1*x2))) # R2 = 0.9983, se = 0.4897
# d)
# install.packages("rgl",dep=T)
library(rgl)
plot3d(x1,x2,y)
# follow the instructions.
# Clearly, the data don't look planar. They are "warped" in the
# shape of a saddle surface. For this reason, a model with an
# interaction term will provide a better fit to the data.
and moving the
ways. Have some
the quality (in
b.
ntrial = 5000
xmax = xmin = numeric(ntrial)
for(trial in 1:ntrial){
x = rnorm(50,0,1)
xmax[trial] = max(x)
xmin[trial] = min(x)
}
hist(xmax)
hist(xmin)
ntrial = 5000
xbar = numeric(ntrial)
for(trial in 1:ntrial){
x = rexp(100,2)
xbar[trial] = mean(x)
}
qqnorm(xbar)
���������������������������������������������������������������������������
���������������������������������������������������������������������������������
�����������������������������������������������������
© Copyright 2026 Paperzz