Lines.

Mathematics 241
Lines
November 8
Goals for the day:
1. Words:
2. R:
model, fit, residuals, least-squares line
lm, residuals, fitted
3. Big idea:
the least squares line is one way to model the relationship between two quantitative variables
Warmup: by hand
Here is a plot of some (fictitious) data.
3.0
●
y
2.5
2.0
●
●
1.5
1.0
●
0.0
0.5
1.0
1.5
2.0
2.5
3.0
x
1. Draw a line that “fits” this data well. Don’t do any 3. Compute
computing – just eyeball it.
2. We measure the fit by determining the residuals. Remember that ŷi is the predicted value of yi and that
ei = yi − ŷi . Compute the following for your line (estimate to the nearest tenth).
i
1
2
3
4
xi
0
1
2
3
yi
1
2
2
3
ŷi
ei
The largest residual (in absolute value)
The sum of the absolute values of residuals
The sum of the squares of the residuals
4. The following computes the line that minimizes the
sums of the squares of the residuals. It then outputs
the equation of the line, the residuals, and the sum of
the squares of the residuals. Compare the resulting line
to the one that you constructed.
>
>
>
>
l=lm(y~x)
l
residuals(l)
sum( residuals(l)^2)
Page 2
Welds
Table 2.1 of Navidi records the nitrogen content (in percent) and yield strength (in ksi) for 28 welds. The data are
in the dataframe table2_1 of the Navidi package.
1. Draw a scatterplot of the observations. (Recall that you need to load the lattice package and use the xyplot
function.)
2. Does the relationship between these two variables look approximately linear?
3. To find the least-squares regression line we use the function lm (stands for linear model). Notice that the lm
function uses the same formula notation as xyplot. The function lm returns a complicated object that has a lot
of information stored in it.
> lweld=lm(Yield.Strength~Nitrogen.Content,data=table2_1)
Try the following and explain what each does.
> lweld
> residuals(lweld)
> fitted(lweld)
> coef(lweld)
4. Use the equation of the fitted line to predict the value of Yield.Strength when the value of Nitrogen.Content
is 0.04.
5. A way to plot the regression line on the xyplot is as follows:
> xyplot(Yield.Strength~Nitrogen.Content,data=table2_1,type=c('p','r'))
Here the argument type is a character vector in which numerous characters can appear. The option ’p’ plots
points. The option ’r’ plots the regression line. Many options are possible for type including all of "p", "l",
"h", "b", "o", "s", "S", "r", "a", "g", "smooth". Many can be used in combination. For example the
plot on the front page also used "g". Try some of these to see what you get.
Page 3
Investigating the least squares line
Use the internet and proceed to http://www.rossmanchance.com/applets/Reg/index.html. There is a link on the
course webpage so that you don’t have to type this whole url.
1. A set of points is presented. You can draw your own line to fit this data by selecting the blue box for ”your line”
and moving the line with your cursor. Construct a line that you think fits the data well. Record the equation of
the line here: .
The least squares line minimizes the sums of the squares of the residuals. Remember that the least squares line is
ŷ = β̂0 + β̂1 x
and thus the residuals are ei = yi − ŷi . Two different measures of fit of a line are
SAE =
X
|ei |
SSE =
X
e2i
2. For your line, you can compute both the sum of the absolute values of the residuals and the sum of the squares of
the residuals by clicking on both blue ”show” boxes. Record the numbers SAE and SSE here:
SAE
SSE
.
3. Now compute the least squares line by clicking the regression line box. Record the true least squares line here.
4. Compute the SAE and SSE for the least squares line (click the appropriate boxes).
SAE
SSE
.
5. Try the above activity several times until you get pretty good at approximating the regression line. (Of course
change the points each time.)
6. By selecting an individual point with the mouse, you may move the point around. Resest the points, add a
regression line and choose a point to move. As you move the point up and down (keep the x value fixed), observe
the movement of the regression line. Moving certain points causes the regression line to change a lot while moving
others causes the regression line to change only a small amount. A fixed change in y value causes the greatest
effect in the regression line for which points? These are called influential points.
Page 4
Detecting non-linearity
Just because a line fits well, it doesn’t necessarily mean that a linear relationship is the right way to think about
the data.
The M241 dataframe mentrack has the distance and time for the current world records for track events (for men).
(a) Plot the data. Add the regression line to the plot. Notice that the points are very close to the line – that is
the line fits the points well.
(b) Find the least squares line that predicts time from distance.
(c) It should be obvious from the data that a linear relationship is not appropriate. What would a linear
relationship imply about track records?
(d) It is also the case that one of the coefficients of the line you found says something strange about world record
times. What is that?
(e) An examination of the residuals often can tell us that a linear relationship is not appropriate. One of the
easiest ways to study the residuals is to plot them. If ltrack is the result of your linear model, try
> xyplot(residuals(ltrack)~fitted(ltrack))
What does this say about the relationship between distance and time?
It’s probably important to note here that the 10,000 meter case is an example of what kind of observation?
Do you have a conjecture about a family of nonlinear functions y = f (x) that might model these data better?