Statistics 2014, Fall 2001

1
Chapter 6 – Building Empirical Models
Robert Boyle (1627 – 1691) established the law that, for a given quantity of a gas (ideal) at a fixed
temperature, the pressure is inversely proportional to the volume: P = A/V, where A is a parameter
whose value is to be estimated from experimental measurements of pairs of values of P and V. By
pouring mercury into the open top of the long side of a J-shaped tube, Boyle increased the pressure of
the air trapped in the short leg. The volume of the trapped air would be V = h*(cross section of short
leg), where h is the height of the air in the short leg, in centimeters. If y = height of mercury in long
leg, in centimeters, adjusted for atmospheric pressure, then we have y = 0 + 1 x, where x = 1/h. Due
to the fact of experimental error, we may write this model as
y = 0 + 1 x + . In his experiment, Boyle took 25 measurements of y at 25 specific different values
of x. The data are shown in the table below, followed by a scatterplot.
h
y
48
29.4375
46
30.5625
44
31.9375
42
33.5000
40
35.3125
h
y
38
37.0000
36
39.3125
34
41.6250
32
44.1875
30
47.0625
h
y
28
50.3125
26
54.3125
24
58.8125
23
61.3125
22
64.0625
h
y
21
67.0625
20
70.6875
19
74.1250
18
77.8750
17
82.7500
h
y
16
87.8750
15
93.0625
14
100.4375
13
107.8125
12
117.5625
Boyle's Gas Law
140
120
100
y
80
60
40
20
0
0
0.01
0.02
0.03
0.04
0.05
Reciprocal of h
0.06
0.07
0.08
0.09
2
The data points look collinear, since the random error term in this experiment is relatively small.
However, a plot of the errors (called residuals) against the predicted value of y, which is a linear
function of the reciprocal of h, shows a pattern. There seems to be a parabolic relationship between the
errors and the predicted value of y. Also, there seems to be increasing disperson.
The model developed by Boyle is an example of an empirical model, a model that relates variables
based on observed data. Later, in the 19th century, the model was extended and given a theoretical
basis with the development of the kinetic theory of gases. For an ideal gas, this theory makes the
following assumptions: 1) each gas molecule is a point particle, 2) collisions between molecules are
perfectly elastic, and 3) collisions between molecules are rare. With these assumptions and the laws
of mechanics, Boyle’s Law became a mechanistic model, a model that relates variables based on an
underlying theory.
The simplest type of empirical model is called a simple linear regression model, in which there is an
assumed linear relationship between two variables. One variable is called the independent variable, or
predictor variable. The other variable is called the dependent variable, or the response variable. It is
assumed that
a) for each value of the predictor variable, there is a distribution of associated values of the response
variable;
b) the response variable is random, while the predictor variable is fixed; and
c) the relationship between the conditional mean of the response variable, given the value of the
predictor variable, is a linear function of the predictor variable:
EY | x   0  1 x .
3
Simple Linear Regression Model
The response variable is assumed to be related to the predictor variable according to the following
equation:
Yi   0  1 xi   i , where
Yi  the value of the response variable for the ith member of the sample,
 0  a parameter, called the intercept of the line of best fit, or the regression line,
1  a parameter, called the slope of the line of best fit, or the regression line,
xi  the value of the predictor variable for the ith member of the sample,
 i  a random error variable associated with the ith member of the sample; it is assumed that the
random errors are independent and identically distributed, with
 i ~ Normal 0, 2  .
A picture of the model is shown on p. 263.
Since it is assumed that a linear trend relationship exists between the predictor variable and the
response variable, before we proceed to use the model, we must do a scatterplot to see whether the
assumption of linearity is reasonable.
Since the predictor variable values are considered fixed, we have
EYi | xi   E0  1xi   i | xi   0  1xi  E i   0  1xi .
Also, we have
V Yi | x   V  0  1 xi | xi   V  i | xi   0  V  i | xi    2 .
We need to use sample data to estimate the three parameters,  0 ,  1 ,  . The estimation will be
done using the method of least squares. Given a sample of size n, the data consists of ordered pairs,
(x1, y1), (x2, y2), …, (xn, yn).
2
We will find the best estimators of the slope and intercept by minimizing the residual sum of squares
(also called the error sum of squares):
n
n
n
i 1
i 1
SSE    yi  yˆi    ei2    yi   0  1 xi  ,
i 1
2
2
with respect to the two parameters.
In doing this, we are simultaneously minimizing the squared vertical distances of the data points from
the line of best fit to the data. A concrete example is useful here.
4
Example: p. 285, Exercise 6-1.
A scatterplot of the data is shown below.
Scatterplot of Density v. Thermal Conductivity
0.28
0.26
Density
0.24
0.22
0.2
0.18
0.16
0.047
0.049
0.051
0.053
0.055
0.057
0.059
0.061
Thermal Conductivity
Imagine constructing this scatterplot concretely as follows:
1)
2)
3)
4)
5)
Draw the coordinate axes on a sheet of plywood.
Hammer nails into the board at each data point.
Obtain a thin wooden dowel and six rubber bands.
Place each rubber band around the dowel and one of the nails.
Wait until the dowel comes to rest.
The rest position of the dowel will be the minimum energy configuration of the system, the
configuration for which there will be the least total stretching of the rubber bands. This position will
also be the least squares regression line relating thermal conductivity and density.
We differentiate SSE w.r.t. each parameter, and set each derivative equal to 0, obtaining
n
n
SSE
SSE
 2  yi   0  1 xi   0 , and
 2  yi   0  1 xi xi  0 .
 0
1
i 1
i 1
This gives us two equations in two unknowns, called the normal equations:
n
n
n
n
n
i 1
i 1
i 1
i 1
i 1
nˆ 0  ˆ1  xi   y i , and ˆ0  xi  ˆ1  xi2   xi yi .
The solution is
5
n
ˆ1 
 x
i 1
i
 x  y i  y 
n
 x
i 1
i
 x
2
n

SS xy
SS xx

x y
i 1
i
i
1  n  n

   xi   y i 
n  i 1  i 1  ,
1 n 
x    xi 

n  i 1 
i 1
n
2
2
i
ˆ0  y  ˆ1 x .
Then the estimated regression line, or line of best fit to the data, is given by:
Yˆ  ˆ0  ˆ1 x .
The estimate of the error variance is found from the error sum of squares to be ˆ 2 
SSE
. There are
n2
only n – 2 degrees of freedom associated with the error sum of squares because two parameters, the
slope and the intercept, have already been estimated.