1 Chapter 6 – Building Empirical Models Robert Boyle (1627 – 1691) established the law that, for a given quantity of a gas (ideal) at a fixed temperature, the pressure is inversely proportional to the volume: P = A/V, where A is a parameter whose value is to be estimated from experimental measurements of pairs of values of P and V. By pouring mercury into the open top of the long side of a J-shaped tube, Boyle increased the pressure of the air trapped in the short leg. The volume of the trapped air would be V = h*(cross section of short leg), where h is the height of the air in the short leg, in centimeters. If y = height of mercury in long leg, in centimeters, adjusted for atmospheric pressure, then we have y = 0 + 1 x, where x = 1/h. Due to the fact of experimental error, we may write this model as y = 0 + 1 x + . In his experiment, Boyle took 25 measurements of y at 25 specific different values of x. The data are shown in the table below, followed by a scatterplot. h y 48 29.4375 46 30.5625 44 31.9375 42 33.5000 40 35.3125 h y 38 37.0000 36 39.3125 34 41.6250 32 44.1875 30 47.0625 h y 28 50.3125 26 54.3125 24 58.8125 23 61.3125 22 64.0625 h y 21 67.0625 20 70.6875 19 74.1250 18 77.8750 17 82.7500 h y 16 87.8750 15 93.0625 14 100.4375 13 107.8125 12 117.5625 Boyle's Gas Law 140 120 100 y 80 60 40 20 0 0 0.01 0.02 0.03 0.04 0.05 Reciprocal of h 0.06 0.07 0.08 0.09 2 The data points look collinear, since the random error term in this experiment is relatively small. However, a plot of the errors (called residuals) against the predicted value of y, which is a linear function of the reciprocal of h, shows a pattern. There seems to be a parabolic relationship between the errors and the predicted value of y. Also, there seems to be increasing disperson. The model developed by Boyle is an example of an empirical model, a model that relates variables based on observed data. Later, in the 19th century, the model was extended and given a theoretical basis with the development of the kinetic theory of gases. For an ideal gas, this theory makes the following assumptions: 1) each gas molecule is a point particle, 2) collisions between molecules are perfectly elastic, and 3) collisions between molecules are rare. With these assumptions and the laws of mechanics, Boyle’s Law became a mechanistic model, a model that relates variables based on an underlying theory. The simplest type of empirical model is called a simple linear regression model, in which there is an assumed linear relationship between two variables. One variable is called the independent variable, or predictor variable. The other variable is called the dependent variable, or the response variable. It is assumed that a) for each value of the predictor variable, there is a distribution of associated values of the response variable; b) the response variable is random, while the predictor variable is fixed; and c) the relationship between the conditional mean of the response variable, given the value of the predictor variable, is a linear function of the predictor variable: EY | x 0 1 x . 3 Simple Linear Regression Model The response variable is assumed to be related to the predictor variable according to the following equation: Yi 0 1 xi i , where Yi the value of the response variable for the ith member of the sample, 0 a parameter, called the intercept of the line of best fit, or the regression line, 1 a parameter, called the slope of the line of best fit, or the regression line, xi the value of the predictor variable for the ith member of the sample, i a random error variable associated with the ith member of the sample; it is assumed that the random errors are independent and identically distributed, with i ~ Normal 0, 2 . A picture of the model is shown on p. 263. Since it is assumed that a linear trend relationship exists between the predictor variable and the response variable, before we proceed to use the model, we must do a scatterplot to see whether the assumption of linearity is reasonable. Since the predictor variable values are considered fixed, we have EYi | xi E0 1xi i | xi 0 1xi E i 0 1xi . Also, we have V Yi | x V 0 1 xi | xi V i | xi 0 V i | xi 2 . We need to use sample data to estimate the three parameters, 0 , 1 , . The estimation will be done using the method of least squares. Given a sample of size n, the data consists of ordered pairs, (x1, y1), (x2, y2), …, (xn, yn). 2 We will find the best estimators of the slope and intercept by minimizing the residual sum of squares (also called the error sum of squares): n n n i 1 i 1 SSE yi yˆi ei2 yi 0 1 xi , i 1 2 2 with respect to the two parameters. In doing this, we are simultaneously minimizing the squared vertical distances of the data points from the line of best fit to the data. A concrete example is useful here. 4 Example: p. 285, Exercise 6-1. A scatterplot of the data is shown below. Scatterplot of Density v. Thermal Conductivity 0.28 0.26 Density 0.24 0.22 0.2 0.18 0.16 0.047 0.049 0.051 0.053 0.055 0.057 0.059 0.061 Thermal Conductivity Imagine constructing this scatterplot concretely as follows: 1) 2) 3) 4) 5) Draw the coordinate axes on a sheet of plywood. Hammer nails into the board at each data point. Obtain a thin wooden dowel and six rubber bands. Place each rubber band around the dowel and one of the nails. Wait until the dowel comes to rest. The rest position of the dowel will be the minimum energy configuration of the system, the configuration for which there will be the least total stretching of the rubber bands. This position will also be the least squares regression line relating thermal conductivity and density. We differentiate SSE w.r.t. each parameter, and set each derivative equal to 0, obtaining n n SSE SSE 2 yi 0 1 xi 0 , and 2 yi 0 1 xi xi 0 . 0 1 i 1 i 1 This gives us two equations in two unknowns, called the normal equations: n n n n n i 1 i 1 i 1 i 1 i 1 nˆ 0 ˆ1 xi y i , and ˆ0 xi ˆ1 xi2 xi yi . The solution is 5 n ˆ1 x i 1 i x y i y n x i 1 i x 2 n SS xy SS xx x y i 1 i i 1 n n xi y i n i 1 i 1 , 1 n x xi n i 1 i 1 n 2 2 i ˆ0 y ˆ1 x . Then the estimated regression line, or line of best fit to the data, is given by: Yˆ ˆ0 ˆ1 x . The estimate of the error variance is found from the error sum of squares to be ˆ 2 SSE . There are n2 only n – 2 degrees of freedom associated with the error sum of squares because two parameters, the slope and the intercept, have already been estimated.
© Copyright 2026 Paperzz