notes.

Data Analysis for Engineers
January 17, 2013
Contents
1 Dimes
3
2 Data
5
3 Models
7
4 Random Processes
9
5 The mean and the standard deviation
13
6 Densities
16
7 Population, Parameters, Samples and Statistics
21
8 Modeling Functional Relationships
24
9 Properties of random variables
27
10 Sampling Distributions
29
11 Evaluating the fit of the least squares line
32
12 The normal model
37
1
2
13 Fitting non-linear models
40
14 Experiments
43
15 A categorical explanatory variable
52
16 Models Galore
55
17 Coverage Intervals
66
18 Hypothesis Tests
70
19 Propagation of Error
75
3
Introduction
These are notes for the section of Mathematics 241 (Engineering Statistics) taught during the
Interim term of 2013 at Calvin College. These notes are meant to accompany and supplement
what happens in class. They are not intended to be a complete textbook for the course – students
will need to come to every session of the class in order for these notes to serve their intended
purpose.
These notes are meant to be read when assigned. When a section is assigned for a given class, you
should read it before that class. Each section has questions at the end and most sections also have
problems. You are automatically expected to answer the questions when you read the section. The
problems do not need to be done at the time the section is assigned but they may be assigned for
later. The sections are short – the intention is that the sections will require about 30-45 minutes
of focused work (that is uninterrupted by phone calls, Facebook, etc.).
The notes are intended to introduce you to topics before they are covered in class rather than to
summarize what is covered in class. In other words, you will meet new ideas first in the notes and
engage them first in the questions. Then in class, you will be able to test your understanding of
the topic and build on it by the classroom activities. Sometimes you might not be able to give
a complete answer to a question that is assigned but it will still be useful to try to answer it.
Discussion of these questions will often be used to introduce a topic in class. (This is somewhat
different than a usual math class in which you are assigned a section only after you cover it in
class.)
1
Dimes
We wanted to estimate the number of dimes in a large sack of dimes and we didn’t want to count
them. We decided to weigh the sack and divide by the weight of a dime. Specifically,
The sack of dimes had mass T g. We took a “representative” sample of 30 dimes and
weighed each carefully. The mean mass of a dime in our sample was 2.258 g. Assuming
that this is approximately the average mass of a dime in the sack, there are about
T /2.258 dimes in the sack.
The details of this computation are here.
> dimes$Mass
[1] 2.259 2.247 2.254 2.298 2.287 2.254 2.268 2.214 2.268 2.274 2.271
[12] 2.268 2.298 2.292 2.274 2.234 2.238 2.252 2.249 2.234 2.275 2.230
[23] 2.236 2.233 2.255 2.277 2.256 2.282 2.235 2.235
4
> mean(dimes$Mass)
[1] 2.25823
> histogram(dimes$Mass, main = "Masses of 30 dimes", xlab = "Mass in grams")
Percent of Total
Masses of 30 dimes
25
20
15
10
5
0
2.22
2.24
2.26
2.28
2.30
Mass in grams
This example has all the features of almost every problem we address in this course. We note three
of these here. We use a simple model of the dimes in the sack, namely their average mass is 2.258
g. We know of course that this cannot be exactly right. But we hope that it is close to what is
true. The enemy of our strategy is variation. Not all the dimes have the same mass. Even in our
sample, the dimes varied in mass from 2.21 to 2.30 g. We expect similar variation in the sack as
a whole. We hope for a representative sample. We recognize that our sample was not likely to
perfectly reflect the properties of the whole collection of dimes and that a different sample might
have given us a different mean and therefore a different estimate. We hope that steps were taken
so that we didn’t choose an obviously biased sample.
There are a lot of questions that could be asked about this simple example. The most obvious of
these is the question of the accuracy of our estimate. Is there a way to use the data to determine
how far we are likely to be off the true answer? A related question concerns the magic number
30 in the story. Clearly, a sample of size one would have been too small but how much could we
improve our estimate by choosing 60 or even 100 dimes? Another question concerns measurement
error – what does the fact that we do not measure the exact mass of each dime do to our estimate?
Yet another concerns the sampling method – how do we choose a “representative” sample? We’ll
come back to all these questions (and more) later in the course.
Questions
Q1.1 A student in a section of Mathematics 241 suggested weighing 30 dimes at once rather than
weighing the 30 individually.
(a) Explain how to compute the number of dimes in sack from the total weight of the 30.
5
(b) Give at least one good reason why this student’s strategy might be considered a better
strategy than the one we used to solve the problem.
(c) Give at least one good reason why our strategy might be considered better than this
student’s.
Q1.2 Another student suggested a method for estimating the number of dimes based on the volume
rather than the mass of dimes. Develop and critique such a method.
Q1.3 Variation is an issue in every engineering process that involves measurement of a quantity.
Suppose that you are asked to measure the height of your instructor. Why might two different
students report two different values of the height? (Presumably the instructor has only one
particular height.)
2
Data
This course concerns answering questions about engineering processes using data. In this section, we
introduce some important distinctions among types of data and also give a consistent framework for
thinking about data. Data will most often be presented to us in the form of a data set. A data set
consists of a set of objects, variously called individuals, cases, items, instances, observational
units, or subjects, together with a record of the value of a certain variable or variables defined
on these objects. We’ll usually use the term observational units for these objects. A variable is a
measurable characteristic of the observational units. Ideally, each observational unit has a value
for each variable but sometimes there are missing values.
The observational units about which we collect data might be as simple as a relatively small set
of physical objects. For example, we might take a sample of parts produced on an assembly line
and measure certain physical properties of those objects to see whether the assembly line is doing
what it is supposed to. More generally though, the observational unit is not necessarily a physical
object but rather it is an instance of an engineering process. To understand this, think of a typical
scientific experiment. For example, we might be attempting to measure the rate of a chemical
reaction. Typically, we perform more than just one “run” of the experiment. Each run is an
observational unit. For each run, we are normally measuring several variables – not only do we
measure the reaction rate but we also measure carefully the amount of each chemical input, the
temperature, the time of the reaction, and perhaps even the color of the resulting product.
The usual way of representing a data set is as a two-dimensional table. Each row of the table
corresponds to an observational unit and each column corresponds to a variable. (This way of
representing a data set should be familiar to users of Excel and that program is sometimes useful
for entering and manipulating our data prior to analysis.) For example, the first few rows of a data
set where the observational units are certain makes of cars (of the 1974 model year!) is below.
Notice that some of the variables (e.g, MPG) are obvious while others (perhaps VS) could use a
definition.
6
Mazda RX4
Mazda RX4 Wag
Datsun 710
Hornet 4 Drive
Hornet Sportabout
Valiant
mpg cyl disp hp drat
wt qsec vs am gear carb
21.0
6 160 110 3.90 2.620 16.46 0 1
4
4
21.0
6 160 110 3.90 2.875 17.02 0 1
4
4
22.8
4 108 93 3.85 2.320 18.61 1 1
4
1
21.4
6 258 110 3.08 3.215 19.44 1 0
3
1
18.7
8 360 175 3.15 3.440 17.02 0 0
3
2
18.1
6 225 105 2.76 3.460 20.22 1 0
3
1
Most of the variables that we will measure are quantitative, that is they are measured by a
number. However some variables are categorical – that is the values are certain categories or
labels. For example, if the observational units are people, typical quantitative variables include
age, height, and weight while a typical categorical variable is gender. Sometimes, as in the cars
example above, we use numbers to name the categories of a categorical variable (e.g., 1 denotes
Female and 0 denotes Male) but that’s not usually considered good practice and can lead us to
error. Sometimes we recode a quantitative variable as categorical – for example we might divide
all our dimes into just two categories after weighing them (e.g., heavy and light).
Quantitative variables can be further subdivided into two types – discrete and continuous. A
continuous variable is one for which any real number is a particular range is a possible value. The
mass of a dime presumably can be anything between 2.2g and 2.3g. (The use of the word continuous
here seems appropriate given our understanding of continuous functions from calculus.) A discrete
quantitative variable is typically a count variable (e.g., number of defects in a manufactured product). These variables are discrete since there is a gap between possible values of the variable (a
count can be 2 or 3 but not 2.5). The fact that we can measure things with only finite precision
means that even a continuous variable is actually measured on a discrete scale. While sometimes
it will be useful to exploit that fact and treat such a variable as discrete, we will usually treat such
a variable as continuous and think of the values that we actually measure as being approximations
of some “true” value. This is entirely consistent with what have done in calculus and in science
classes.
Usually, to understand a data set it is important to have some documentation (often called a codebook). Such documentation includes a careful description of each observational unit and variable.
Ideally a description of a variable should include the units, possible values, and an explanation of
how measurements were made.
In R, most data sets are stored in an object called a data.frame. A data.frame, when printed,
looks exactly like our description of data tables above. The columns of a data.frame (the variables)
are usually named. In R, a quantitative variable is stored in an object called a vector – you can
think of a vector as a list of numbers. The columns of a data.frame corresponding to quantitative
variables are vectors. A categorical variable is usually stored in an object called a factor which is
a list of category names. (Sometimes a categorical variable is stored in a character vector.)
Questions
Q2.1 Calvin maintains a data set in which each observational unit is a current student (including
7
you). List three quantitative and three categorical variables that you think are probably
in the data set. Are there any variables that you think are included that you have trouble
deciding whether they are quantitative or categorical?
Q2.2 List two good reasons for NOT using numbers to name the categories of a categorical variable.
3
Models
A model is simply anything used to represent something else. For example, many toys are models –
cars, dolls, stuffed animals are supposed to represent the real thing. No model is meant to exactly
capture all features of the thing being modeled. Toy cars, for example, look like the car they are
representing but lack the engine. Physical models can be important to engineers but the models
that we use in this course are mathematical. That is they represent features of an engineering
process in equations and numbers. Mathematical models are found all over the science. Almost
every physical law is expressed in some sort of equation. For example s = gt2 /2 is a model of an
object falling under the influence of gravity and expresses the distance fallen (s) as a function of
time since release (t). (The model is very accurate for free fall in a vacuum but in the real world
it’s only approximate due to resistance from air.) Another sort of mathematical model is a graph
– a graph represents some aspect of reality in an abstract picture.
Why do we construct models? There are at least three important uses for models.
1. Models provide a description of reality. In a model, we have a succinct representation of
a real-world system that can be used to communicate with others. A graph such as the
following tells a story. In this case, the graph represents the number of FTIACs (“first time
in any college”) at Calvin College in the fall of each year. The story it tells is one of, among
other things, a highly variable process!
FTIACs
1100
1000
900
800
1980
1990
2000
2010
Year
2. Models sometimes allow for prediction. A model can be used to make a prediction of what
might happen in the very next case. The same graph above has a “trend” superimposed on
it. If that trend is extrapolated another year, we can predict a slightly smaller freshman class
in 2013.
8
FTIACs
1100
1000
900
800
1980
1990
2000
2010
Year
Of course the model here (the “trend” curve) is open to debate. How in the world did we
settle on that curve among all the many that we could draw that vaguely fit the data? Good
question – an answer comes later!
3. Models occasionally allow for explanation. The model of FTIACs above does not give us any
explanation of the variation from year to year. A good model would begin to help us decide
why the pattern is what it is. An explanation of enrollment trends might take into account
such features as tuition, the economy, programs that the college offers, and the competition.
The mathematical models that we learn about in physics are deterministic. If we know the value of
m, the equation E = mc2 tells us exactly the energy contained in that mass. However the models
in a statistics class will not usually be deterministic but instead will incorporate randomness into
the description of the real world. To give a very simple example consider yet again the problem of
the mass of dimes. We might write the model this way:
y =µ+ε
Here y is the mass of an unknown dime and µ is the mean mass of dimes in the population in
question (in this case, the sack of dimes). The other term, ε, is a “random” error term. What this
equation says approximately is that µ is a good guess for the weight of a dime but that individual
dimes will vary in weight based on chance affects that affect each dime differently. We will generalize
this example in many ways but it will always have the same basic form:
y = model + error.
In this equation, the word “model” really means the deterministic part of the model and the word
“error” really denotes the deviation of the observation from the model prediction. This partitioning
of the variation of y into two pieces – that which can be explained (“model”) and that which cannot
(“error”) is perhaps the most fundamental idea of the course. (Different branches of science have
different names for this partitioning: explained vs. unexplained; signal vs. noise, predicted vs.
unpredicted; group vs. individual.)
In this equation, the variable y is called the response variable. This variable usually measures
some operating characteristic of our engineering process that we are attempting to predict or control.
9
Our model will usually involve some other variables that we think can be used to explain the
variation in y – such a variable is called explanatory variable. If x1 , . . . , xk are our explanatory
variables, our model will usually take the form y = f (x1 , . . . , xk ) where f is some function (hopefully
simple). For example, if there is one explanatory variable x, we might hope for a model of the form
y = b0 + b1 x (a linear function).
Models never capture all the features of a real-world system. The prominent statistician George
Box said “All models are wrong, but some are useful.” Wrong might be too strong here – it’s
probably better to say something like “All models are incomplete and approximate” – but it is wise
to remember that our models are only models.
Questions
Q3.1 If you wanted to construct a model for the height of Calvin students, what might be one useful
explanatory variable? (Note that arm length might be a useful explanatory variable but it is
as difficult to measure as is height. A good explanatory variable should be accessible.)
4
Random Processes
The kind of engineering process that we wish to model is one that has variation that we cannot
completely explain. For example, we might expect to produce washers with a thickness of 2 mm.
but the actual thickness of the washers that we produce will vary slightly from washer to washer.
We might expect a dime to be of average weight, but the next dime that we examine will probably
not be exactly average but instead be somewhat heavier or a little lighter. We try to explain this
variation — we might make models that try to explain the variation in terms of other variables that
we can measure. For example, it might be the older dimes are slightly lighter than newer dimes due
to wear. (It turns out that at least our sample data doesn’t support this hypothesis! You can see
that by looking at the graph below which has the mass of the dimes plotted by the year in which
the dime was minted.) But not all variation can be explained. For example, we usually cannot
know or measure all the variables that affect the process.
2.30
Mass
2.28
2.26
2.24
2.22
1970
1980
1990
Year
2000
10
The mathematical theory that we use to model situations in which there is such uncertainty is
known as probability theory. A probability is a number meant to measure the likelihood of the
occurrence of some uncertain event (in the future). The setting in which we use probability theory
is that of a random process.
Characteristics of a Random Process:
1. A random process has a number of possible outcomes.
2. For any particular trial or observation of the process, the outcome that will obtain is
uncertain.
3. The process could be repeated indefinitely (under essentially the same circumstances)
at least in theory.
Probability theory was born out of investigations into games of chance. Rolling a die is a perfect
example of random process. We cannot predict the outcome of the next roll of the die and it could
have one of any of six different outcomes. Each roll is essentially identical in what could happen
(but of course not necessarily in what does happen).
Usually we will be interested in random processes that have numbers as output. A random
variable is a random process that results in a number. Typically, a random variable will result
from making some measurement of an instance of the random process. For example, suppose that
we drop a ball from 10 meters. The length of time that it takes the ball to hit the ground is a
random variable. While we might think that we can predict exactly the amount of time it should
take for the ball to hit the ground, various features of the experiment will cause our prediction to
be in error. For example, it might be very difficult to release the ball from exactly the same height
each time and it will also be difficult to start and stop the timing device at the precise moment that
we are supposed to. So we might expect that the time might be any number in some (hopefully)
small interval of numbers. We will often use upper-case letters from the end of the alphabet (like
X) to denote random variables.
Given a random process, our goal is to make statements about the likelihood of various possible
outcomes of a trial of the random process. For example, we say that the probability that a coin
will come up heads when tossed is 1/2. (This reflects the plausible observation that we think it as
likely as not that heads comes up when we toss a coin.) More formally, we say an event is a set
of possible outcomes of the random process and our goal is to assign to each event E a number
P(E) (called the probability of E) such that P(E) measures in some way the likelihood of E. It
is important to note here that we only make probability statements about outcomes of the process
in the future – the past has already happened and is certain!
Likelihood here is an informal concept. We need to be more precise about what a statement
like P(E) = .3 actually means. Interpreting probability computations is fraught with all sorts of
11
philosophical issues but it is not too great a simplification at this stage to distinguish between two
different interpretations of probability statements.
The frequentist interpretation.
The probability of an event E, P(E), is the limit of the relative frequency that E occurs in
repeated trials of the process as the number of trials approaches infinity.
In other words, if the event E occurs en many times in the first n trials, then on the frequentist
interpretation, P(E) = lim en /n. On the frequentist interpretation then, if the probability of
n→∞
an event is 1/3, that means that we would expect to see it happen about 1/3 of the time if we
repeated the random process many times. Notice that this is a statement about a theoretically
infinite sequence of occurrences and doesn’t really say anything about the next occurrence. Thus
there is no way to “prove” that a probability statement is correct.
The subjectivist interpretation.
The probability of an event E, P(E), is an expression of how confident the assignor is that
the event will happen in the next trial of the process.
The word “subjective” is usually used in science in a pejorative sense but that is not the sense of
the word here. Subjective here simply means that the assignor needs to make a judgment and that
this judgment may differ from assignor to assignor. Nevertheless, this judgment might be based
on considerable evidence and experience. That is, it might be expert judgment. These kinds of
expert judgments about the likelihood of future events are quite common in a variety of domains.
Economic decisions are often made at least partly on the basis of subjective assessment of risk.
For example, an economist might say that the probability that the Dow Jones Industrial Average
increases in 2013 is 75%. This means, among other things, that the economist thinks that is more
likely than not that the Dow Jones increases in 2013. However she would not be particularly
surprised if it did not. On the other hand, a probability of 99% would mean that the economist
was virtually certain that Index would rise.
In this course, we will mostly appeal to the frequentist interpretation. That is we will imagine our
random process as being repeated forever and ask about its long range behavior. But this does not
tell us how to assign the numbers. In fact it seems circular – in order to “compute” probabilities
we would need to examine the long-range behavior of the process but it is the long-range behavior
that we wish to predict using the numbers we assign! So where do we get our numbers?
There are fundamentally two ways that we use to assign probabilities to events. In some situations
we use one or the other primarily but usually in the real world we use a combination of the two.
12
1. Assigning probabilities based on a model.
We toss a six-sided die. If E is the event that the die toss results in the number 3, we say
P (E) = 1/6. In other words, we think that in the long-run, one-sixth of the tosses result
in the number 3. The model we use here is very simple – we think that the six sides of the
die are “equally likely” due to the symmetry of the die. This equally likely model will be
useful to us in a variety of situations although of course in each situation it needs some sort
of justification such as the symmetry argument used for the die. For example, we would not
say that the probability that Calvin beats Hope in basketball on January 16 is 1/2 simply
because there are two possible outcomes. These outcomes are not obviously equally likely. Of
course models are not limited to equally likely models. We will use a variety of sophisticated
models to assign probabilities to events.
2. Assigning probabilities based on data (i.e., past performance).
If we have enough data, we can imagine that past performance of a random process will tell
us something about its long range behavior. This method of assigning probabilities is used
heavily in the insurance business. An insurance company might note that of 10,000 men
who were 50 years old on January 1,2001, there were 9,379 of them still living on January
1, 2011. So the probability that a 50 year old man lives ten more years might be estimated
to be 93.79%. This number is useful in determining the price of an insurance policy for a 50
year-old man.
The usual process for assigning probabilities turns out to be an interplay of these two methods.
We often start with a simple model and test it with data. We use the data to refine the model and
iterate the process.
Questions
Q4.1 A standard deck of playing cards has 52 cards. There are four suits (hearts, diamonds, spades,
clubs) and thirteen cards to each suit (Ace, 2, 3, 4, 5, 6, 7, 8, 9, 10, Jack, Queen, King). A
single card is picked from a well-shuffled deck. What is the probability that it is an Ace?
Q4.2 Two six-sided dice as tossed and the numbers that occur on them are added together. Therefore, there are 11 possibilities (the numbers from 2 to 12). Why is it not reasonable to believe
that the probability that the sum is 6 is 1/11? What might it be?
Q4.3 Some games of chance employ spinners rather than dice or cards. A classic board game
published by Cadaco-Ellis is “All-American Baseball.” The game contains discs for each
of several baseball players. The disk for Nellie Fox (the great Chicago White Sox second
baseman) is pictured below.
13
The disc is placed over a peg with a spinner mounted in the center of the circle. The spinner is
spun and comes to rest pointing to the one of the numbered areas. Each number corresponds
to the possible result of Nellie Fox batting. (For example, 1 is a homerun and 14 is a flyout.)
(a) Why is it unreasonable to believe that all the numbered outcomes are equally likely?
(b) Explain how one could use the idea of equal likelihood to predict the probability that
the spinner will land on the sector numbered 14 and then make an estimate of this
probability.
(Spinners with regions of unequal size are used heavily in the K–8 textbook series Everyday
Mathematics to introduce probability to younger children.)
Q4.4 Meteorologists often make statements such as “the probability of rain tomorrow is 70%.”
What do you think they mean by this and how do you think that they determine the number
here?
5
The mean and the standard deviation
In the first section, we used the mean of a sample of dimes as a model for the weight of a “typical”
dime. That’s not unusual. The mean is often used as a measure of “center.” (It’s not the only
such measure – the median is also commonly used as a measure of center.) While the mean was
a useful model, it was also important to remember that there was considerable variation in the
sample. Let’s examine that variation more carefully.
In these notes, we will almost always use n to denote the number of observational units and usually
also use y to name the response variable. With that notation, our observed values can be named
y1 , y2 , . . . , yn
(and R would use y[1] since subscripts aren’t available). With that notation, the mean is
14
Pn
ȳ =
i=1 yi
n
Using the idea that
observation = model + error
we will write
yi = ȳ + ei
The number ei is simply the deviation from the mean of the ith observation. We won’t call ei an
error but rather we will use a more neutral term residual. (Residual means “what’s left over”.)
For the dime example, we have
> options(scipen = 1, digits = 3)
> residuals = dimes$Mass - mean(dimes$Mass)
> residuals
[1] 0.000767 -0.011233 -0.004233 0.039767 0.028767 -0.004233
[8] -0.044233 0.009767 0.015767 0.012767 0.009767 0.039767
[15] 0.015767 -0.024233 -0.020233 -0.006233 -0.009233 -0.024233
[22] -0.028233 -0.022233 -0.025233 -0.003233 0.018767 -0.002233
[29] -0.023233 -0.023233
0.009767
0.033767
0.016767
0.023767
> otions(digits = 5)
Error:
could not find function "otions"
Of course some of the residuals are positive and some negative. What is more interesting is that
the residuals add up to 0 – the positive and negative distances to the mean exactly balance out.
That is
n
n
X
X
ei =
(yi − y) = 0.
i=1
i=1
Obviously, if all the residuals are very small, the mean is a good model for the data but if some (or
all) of the residuals are large, the model does not “explain” the data very well. So we would like
some measure of the size of the residuals in aggregate. There are a lot of measures that could be
used but the measure that we will use is the mean squared residual (which you learned in physics
to call the mean squared error). This number is more commonly called the variance (of the data)
and the computational formula is
Pn
− ȳ)2
n−1
i=1 (yi
15
There are two questions to ask about this formula: why square everything? and why use n − 1
instead of n?. The latter issue is an annoying technicality and will be addressed later and it doesn’t
really matter much anyway if n is large. The reason for squaring is more important and more
complicated and again we have to wait until later for a clear answer. But here is a partial answer.
The mean ȳ of y1 , . . . , yn is the value of c that minimizes the quantity
Pn
i=1 (yi
− c)2 .
This says something very important. Namely the mean and the variance go hand in hand – the
mean is the number that minimizes the sum of the squared residuals (and so also the mean squared
residual). If we used a different measure of spread, we might want to use a different measure of
center. Indeed, if we used absolute values instead of squares in the variance formula, we would
want to use the median instead of the mean!
Mean squared deviation (i.e., variance) is a good measure of variation but it has one drawback – it
has the wrong units. So we often use its square root instead. This number is called the standard
deviation of the data and denote it by sy (or just s when y is understood):
sP
sy =
n
i=1 (yi
− ȳ)2
n−1
Questions
Q5.1 Show as claimed in the section that the sum of the residuals is 0. That is
Pn
1 (yi
− y) = 0.
Q5.2 Show as claimed in the section that the mean is the value of c that minimizes the following
function. (Hint: you know how to determine the point at which a differentiable function has
a minimum.)
f (c) =
n
X
(yi − c)2
i=1
Q5.3 We often define new variables by performing simple transformations on old ones. For example,
we might change units by multiplying every observation by a constant. Suppose that we define
a new variable z by z = 10y. How do the mean, variance and standard deviation of z compare
to those of y? (Try examples first if you are not sure what the right answer should be.)
Q5.4 Suppose instead that we add a constant to all observations. For example suppose that we
define a new variable w by w = y + 20. How does the the mean, variance and standard
deviation of w compare to those of y.
16
6
Densities
A histogram is a simple picture describing the “density” of data. A histogram bar is tall if the
data is “dense” in the region represented by the bar and short otherwise. This notion of density
can be made precise by relabeling the vertical axis of a histogram to create a density histogram.
The histogram below is of the Sepal Lengths of 150 iris flowers (from the built in dataset iris).
> histogram(iris$Sepal.Length, br = seq(4, 8, 0.5), type = "density")
0.4
Density
0.3
0.2
0.1
0.0
4
5
6
7
8
iris$Sepal.Length
Here the y-axis is in density units. Density means “mass per unit of something” and here it means
mass per unit on the x-axis. To see how this works, consider those iris plants with Sepal.Length
between 5.5 and 6. The density of this group appears to be about .4 and the width of the bar is .5
so the total mass of the bar is .4 × .5 = .20. The density units are chosen so that the total mass
is 1. Therefore about 20% of the data are between 5.5 and 6. In this case obviously density is
equivalent to area and the shaded area is exactly 1.
Of course the density function that we use to model the data with a histogram depends heavily on
the choice of the bin size and placement. Here is a different density model of the data.
> histogram(iris$Sepal.Length, br = seq(4, 8, 0.25), type = "density")
Density
0.5
0.4
0.3
0.2
0.1
0.0
4
5
6
iris$Sepal.Length
7
8
17
One way of thinking about a histogram is that it is modeling the data by a density function that
is “piece-wise” constant. However there is no need for anyone who understands calculus to limit
their attention to constant functions. We generalize the notion of density histogram to that of a
density function.
A density is a function f with two properties:
Z
f (x) ≥ 0
∞
for all x
f (x) dx = 1
−∞
In the following graph, we add a density to the plot of iris sepal lengths.
> histogram(iris$Sepal.Length, br = seq(4, 8, 0.5), type = "density")
> ladd(panel.densityplot(iris$Sepal.Length))
0.4
Density
0.3
0.2
0.1
0.0
4
5
6
7
8
iris$Sepal.Length
Of course the density that is constructed here is not unique. There are many different density
curves that could plausibly be said to fit these data. So choices need to be made. However it is
important to remember that choices are made already in drawing histograms (bin placement and
width).
A density curve is an appropriate model for data – as appropriate as a histogram might be. However
we will often use density curves as models for the engineering process that produced the data.
Consider the following example. A certain process for manufacturing glass is monitored by testing
the breaking strength (in ksi) of 31 pieces of glass from the process. A density histogram, and a
density curve are drawn below.
18
Density
0.06
0.04
0.02
0.00
20
25
30
35
40
45
windowstrength$ksi
while the density curve is a smooth curve and tries to match the data well, it doesn’t seem likely
that it is a good description of the manufacturing process. We would expect a smooth curve and
one that is unimodal. The following graph includes a density that doesn’t fit the data quite as well
but is perhaps a better model of the underlying process.
0.05
density
0.04
0.03
0.02
0.01
0.00
10
20
30
40
50
ksi
If the process we are modelling with a density is a random process, the density function is usually
called a probability density function (abbreviated pdf) and it can be used to make probability
statements about the outcomes of the process (in the future).
We will use the following notation for such processes. Random variables will be named with
upper-case letters, preferably from the end of the alphabet, like X. Remember that a random
variable is simply a random process with a numerical outcome. Density functions are models for
the probabilities associated with continuous random variables. (Recall the discrete - continuous
distinction?) We write expressions like P(X ≤ 3) to stand for ”the probability that X is less than
or equal to 3.
Here is a very simple example of a probability model of a process. The time-to-failure of a particular
kind of part might be modeled as a random variable X. The following density function, called the
exponential density function for obvious reasons, is often used as a model for such a “time-to-failure”
19
random variable with an appropriate choice of the number λ.
f (x) = λe−λx
x≥0
0.08
0.02
density
A graph of this density for λ = 0.1 is
0
5
10
15
20
25
30
x
You can see that this curve might be a model of the time-to-failure of a part that probably doesn’t
last more than 30 time units and quite often fails within the first 10 time units. This density
function can then be used to make probability statements about the time-to-failure of the next
part. For example, the probability that the part will fail in at most 10 time units is
Z
10
0.1e−0.1x dx.
0
(This is turns out to be approximately 0.63.) Of course the choice to use this function and to use
0.1 as the value for λ needs to be justified in a particular situation. It may not be that the form of
this density curve is at all right for kind of part in question. It also might be that 0.1 leads to too
small or too large an estimate for the likelihood that a part fails in 10 time units. These choices
ultimately need to be made based on a combination of theory and past experimental data.
The exponential model is so widely used that its properties are built into R in several functions.
Suppose X is a random variable and that the exponential density is a good probability model for
it. Then
function
explanation
dexp(x,lambda)
returns fX (x), (the pdf).
pexp(q,lambda)
returns P(X ≤ q) (the cdf).
qexp(p,lambda)
returns x such that P(X ≤ x) = p.
Using our example of a part that fails according to the exponential density function with λ = 0.1
20
we can easily answer the following questions:
1. What is the probability that the part fails before the end of 15 time units?
> pexp(15, 0.1)
[1] 0.777
2. What is the probabiility that the part lasts more than 20 time units?
> 1 - pexp(20, 0.1)
[1] 0.135
3. By what time will about 50% of the parts have failed?
> qexp(0.5, 0.1)
[1] 6.93
Questions
Q6.1 convince yourself that the following are densities:
(
1/3
f (x) =
0
if 0 ≤ x ≤ 3
otherwise.
(
2x if 0 ≤ x ≤ 1
f (x) =
0
otherwise.
(
6x(1 − x) if 0 ≤ x ≤ 1
f (x) =
0
otherwise.
(
e−x
f (x) =
0
if 0 ≤ x < ∞
otherwise.
Q6.2 A manufacturer of fans that are designed to run continuously reports a MTBF (mean time
between failure) of 10,000 hours. We can model the failure of such fans by a random variable
with an exponential distribution with λ = 0.0001.
(a) What is the probability that a given fan will fail in the first 10,000 hours?
(b) You might think that if MTBF is 10,000, that the answer to part (a) should be .5. Why
is it plausible that it is not?
(c) What is the probability that a given fan will function for more than 20,000 hours?
(d) Do you think that the exponential density is a good model for failure time for products
like fans? Why or why not?
21
7
Population, Parameters, Samples and Statistics
In this section we introduce a very simple model for the kind of data analysis that we do in this
course. The key observation is that we collect data (on certain observational units) not so much to
learn about those particular units but to make inferences about the process that produced them.
For example, we looked at 30 dimes to make some inferences about the larger sack of dimes from
which they were chosen. In another example, we looked at two years of Twin Falls, Idaho weather
to attempt to make predictions about future weather. In each case our model was derived from the
data (the mean model in the case of the dimes and the Weibull model in the case of the wind) but
we then wanted to apply it to the larger context. The four words in the title of this section give us
a language for talking about this process.
A population is a finite collection of objects. Think of the population as the whole universe of
possible observational units about which we want to make some inferences. The population might
be a population of people but more usually in this course it is a population of objects. In our first
example, the population was the set of dimes in a particular sack. The situation we find ourselves
in is that the population is too large for us to examine each observational unit and measure the
relevant variables. So we take a sample.
A sample is a subset of the population. A sample of observational units is chosen from the
population so that these units can be examined more carefully. Samples are generally small relative
to the population. For example, political polls are often done with a samples of a few thousand
voters out of the millions of possible voters. Of course not just any sample will do – we would like
the sample to be “representative” of the population. It’s not clear how to accomplish this though
if one doesn’t know what the population looks like!
A parameter is a numerical characteristic of the population. For example, the mean mass of dimes
in the sack or the total number of dimes in the sack are both parameters of the population of dimes
in the sack. If Y is a quantitative variable defined on the whole population, typical parameters
include the mean, median, and quartiles of Y .
A statistic is a numerical characteristic of the sample. Just as the mean mass of dimes in the sack
is a parameter, the mean mass of dimes in the sample is a statistic.
So here’s the story – we have a population and we would like to know the value of some parameter
for that population. However we cannot examine all the objects in population to compute the
parameter. Therefore we take a sample and compute some statistic. We then use the statistic to
make a guess (estimate) at the value of the parameter. For example, it was reasonable to guess
that the mean weight of dimes in the sack was the same as the mean weight of dimes in the sample.
At least that seemed like the best guess given a sample. In this case, using the mean of the sample
to estimate the mean of a population seems like an obvious and sensible strategy. However in more
complicated situations, it will not always be clear what statistic should be used.
Statisticians have developed useful (but not completely coherent) notational conventions for talking
about this model of inference. Parameters are almost universally named by Greek letters. Some
22
of our favorite parameters will be µ, σ, α, and β. In particular, µ will always be used to denote
the mean of the population. There are two conventions for denoting statistics. Sometimes we use
roman letters. For example ȳ is notation for the mean of the sample data and s or sy denote the
standard deviation of the sample. A second convention for naming statistics is to use “hats.” Using
this convention, our estimate of µ might be named µ̂. It usually helps to pronounce the “hat” as
“predicted” or “estimated”.
The description above of how statistics works is a useful idealization. We’ll complicate it immediately. Consider the Idaho wind data. Our sample was not really a sample from some finite collection
of possible hours. Rather, we conceptualized the data as resulting from a random variable. Often
we’ll call such a random variable the population. So our “population” in this case is really (a model
of) the process that produces the data that we have in hand. We took one further step away from
the actual process in that we modeled the random variable itself by a Weibull density. We now can
see why we use α and β to name the parameters of the Weibull distribution – they are parameters
of (the model of) our process and are not measurable from the data. What we did in that case was
estimate α and β from the data. Our estimates were our statistics and it would be appropriate to
name them α̂ and β̂.
We have been using the equation
observation = model + error
to express the fact that our model does not capture every aspect of the process that produces the
data. We’re now going to complicate this just a bit. Again, think of dimes data. We used ȳ as a
model for the typical dime and noted that ȳ is a good choice in the sense that it fits the data well
– as well as possible if we use the squared error criteria. However ȳ is really not what we would
like to use as our model – we are using ȳ to estimate µ, the actual mean of the population. So we
really have two equations;
observation = true model + error
observation = fitted model + residual.
We’ll make a distinction between the fitted model (ȳ) in this case and the “true” model (µ). From
sample to sample, µ doesn’t change but our fitted model would.
Likewise, we might believe that there is a “true” Weibull distribution that models the process of
hourly wind speeds in Twin Falls, Idaho but that we could only get a “fitted” Wiebull distribution
based on our limited data. Had we more data or data from a different time period produced by
the same Weibull distribution, our fitted distribution might be different.
If the sample is to give us dependable information about the population, we need to understand
how the sample was produced. Of course we are hoping that the sample is representative of the
population, at least with respect to the variables that we are considering. Of course there is no
way of guaranteeing that a sample is representative. That would seem to need knowledge of the
23
population which we don’t have! But we would like to ensure at least that the sample is chosen by
a method that has no known biases. For finite populations, the most important sampling technique
is simple random sampling. A simple random sample of size n from a population is a subset of
size n chosen from the population in such a way that every set of size n has an equal chance to
be the sample. A traditional way to choose such a sample is to put names of the population in a
(possibly metaphorical) hat and then choose n objects with a blind draw. While it is possible to
do this with a physical process (even using real hats) the recommended method now is to use a
computer for this purpose. A simple way to do this is to number the the objects from 1 to N and
then choose n such numbers using the computer. The R function sample does this. For example
the following chooses a simple random sample of size 10 from the first 100 numbers:
> sample(1:100, 10, replace = F)
[1] 38 26 93 42 32 51 59 16 11 10
The fact that the replace option above is false ensures that the same object is not chosen to be in
the sample more than once. Notice that the sample chosen seems to be representative of the first
100 numbers in some ways but not in others.
In the case of sampling from a population, we are using a random process to choose the sample.
But there is nothing that requires probabilities to talk about the population. On the other hand, if
we are studying a process, we use probabilistic models to model the population itself. In this case,
our sample is viewed as repeated, independent instances of that random process. It is interesting
to note that in the wind speed example, our sample really does not have that character for it’s not
plausible to assume that the wind speed of one hour is not related to that of the previous (or the
next hour). Our simple model of the data production does not take this dependence into account.
We might hope that we collected enough data so that the lack of independence of the individual
observations does not matter.
Questions
Q7.1 Often surveys are conducted using “convenience” samples. A convenience sample may often be
biased in significant ways (and sometimes we may not realize what those ways are). Suppose
that you wish to survey Calvin students. In what ways might be the following samples be
biased?
(a) The first 30 students that you find outside of the SB 128 (our classroom).
(b) The first 30 students that you meet in Johnny’s.
(c) The first 30 students that are listed in the student directory (in alphabetical order).
(d) The first 30 students that you meet in the fitness center in the fieldhouse.
Q7.2 Consider the set of natural numbers P = {1, 2, . . . , 100} to be a population.
(a) How many prime numbers are there in the population?
24
(b) If a sample of size 10 is representative of the population, how many prime numbers
would we expect to be in the sample? How many even numbers would we expect to be
in the sample?
(c) Using R choose 5 different samples of size 10 from the population P . Record how many
prime numbers and how many even numbers are in each sample. Make any comments
about the results that strike you as relevant.
Q7.3 Donald Knuth, the famous computer scientist, wrote a book entitled “3:16”. This book was
a Bible study book that studied the 16th verse of the 3rd chapter of each book of the Bible
(that had a 3:16). Knuth’s thesis was that a Bible study of random verses of the Bible might
be edifying. The sample was of course not a random sample of Bible verses and Knuth had
ulterior motives in choosing 3:16. Describe a method for choosing a random sample of 60
verses from the Bible. Construct a method that is more complicated than simple random
sampling that seeks to get a sample representative of all parts of the Bible. (You might find
the table of number of verses in each book of the bible at http://www.deafmissions.com/
tally/bkchptrvrs.html to be useful!)
8
Modeling Functional Relationships
Many data analysis problems amount to describing the relationship between two quantitative variables. We have some variable y of interest that describes the result of an engineering process (a
response variable). We think that the variation in y can be at least partially explained by another
variable that we measure and sometimes even control (an explanatory variable). We would like
to describe the relationship between x and y with some sort of function, y = f (x).
As an example, consider the following data taken from an experiment on corrosion. Thirteen bars
of 90-10 Cu/Ni alloys were submerged for sixty days in sea water. The bars varied in iron content.
The weight loss due to corrosion for each bar was recorded. The variables measured were the
percentage content of iron (Fe) and the weight loss in mg per square decimeter (loss). A few data
values and a scatter plot of all the data are below.
> head(corrosion)
1
2
3
4
5
6
Fe loss
0.01 128
0.48 124
0.71 111
0.95 104
1.19 102
0.01 130
> xyplot(loss ~ Fe, data = corrosion, main = "Material loss by iron content")
25
Material loss by iron content
130
loss
120
110
100
90
0.0
0.5
1.0
1.5
2.0
Fe
Two things are clear from the plot. First, there can be no function f such that y = f (x) exactly.
This is easily seen since there are two different observations with the same value for Fe but different
values for loss. Second, a linear function could be a useful model for the process. In the following
graph, we superimpose a line on the scatterplot.
Material loss by iron content
130
loss
120
110
100
90
0.0
0.5
1.0
1.5
2.0
Fe
The line plotted has equation
loss = 129.79 − 24.02Fe.
Both the intercept and slope of this line have simple interpretations. For example, the slope suggests
that every increase of 1% of iron content means a decrease in loss of content of 24.02 mg per square
decimeter.
Obviously there are many lines that we can fit to these data and there is no one “right” choice.
Just as clearly however there are lines that are really bad choices. It comes down to deciding what
makes a line a good “fit” for the data. We want a line that is “close” to the data in some sense.
We are going to measure this in precisely the same way we did when we justified using the mean
to model one variable. Namely we are going to look at squared residuals. Let b0 + b1 x be any line.
For each i define ŷi = b0 + b1 x. The “hat” is read “predicted” or “fitted” and is the fitted value of
y for this particular line. For each i define ei , the ith residual, by
ei = yi − (b0 + b1 xi ) = yi − ŷi
This equation is just another instance of our fundamental equation:
26
observation = fitted model + residual
It will be impossible to choose a line so that all the values of ei are simultaneously small (unless
the data points are collinear). Various values of b0 , b1 might make some values of ei small while
making others large. While there are other choices for fitting the line, we will again minimize the
sms of squares of the residuals. Namely, our recipe for fitting a line is as follows:
Given (x1 , y1 ), . . . , (xn , yn ), find b0 and b1 such that if f (x) = b0 + b1 x and ei = yi − f (xi )
n
X
then
e2i is minimized.
i=1
We call
n
X
e2i the sum of squares of residuals and denote it SSE (for sum squares error). The
i=1
resulting line is called the regression line or least-squares line. Before we discuss this solution,
we show how to use R to find the regression line using the corrosion data that we started the section
with.
> lrust = lm(loss ~ Fe, data = corrosion)
> lrust
Call:
lm(formula = loss ~ Fe, data = corrosion)
Coefficients:
(Intercept)
130
Fe
-24
How are b0 and b1 computed? Notice that the function that we wish to minimize, SSE, is a
function of two unknowns b0 and b1 . To find the values of b0 and b1 that minimize it, we need to
use calculus. Specifically, we compute the partial derivatives of SSE with respect to b0 and b1 ,
set them equal to 0, and solve. The resulting values are the minimum. While we never have to
compute b0 and b1 by hand, formulas for computing them are
Pn
(xi − x̄)yi
b1 = Pi=1
n
2
i=1 (xi − x̄)
b0 = ȳ − b1 x̄
27
While these equations will always produce a line, it is important to remember that a line might be
a really bad model for the data at hand. The following graphs give examples due to Anscombe of
four different data sets that have exactly the same regression line. It is clear that the linear model
could only be appropriate for the first.
12
10
8
y2
4
6
8
4
6
y1
10
12
Anscombe's 4 Regression data sets
5
10
15
5
15
12
10
4
6
8
y4
10
8
4
6
y3
10
x2
12
x1
5
10
15
x3
5
10
15
x4
Questions
Q8.1 For three of Anscombe’s examples, the linear model presented in this section is clearly inappropriate. What kind of a model would you suggest using in each case?
Q8.2 The dataframe mentrack in the Stob package gives the world record for men in track races
of various distances. Fit a linear model that predicts the time from the distance. While the
fit looks pretty good, explain why a linear model is inappropriate for these data.
9
Properties of random variables
We have decided to model our data production process using probability models. In the case that
we are sampling from a finite population, our model is that each of the possible samples that might
obtain is equally likely. In the case that we are modeling a real world process, we use a random
variable to represent the process. In this section, we define the notion of mean and standard
deviation of a random variable.
Remember that a random variable is random process with a numerical outcome. A random variable
is either discrete or continuous according to the kinds of values that it produces. In each case
however we need a model for the random variable to compute probabilities concerning the random
process.
28
Discrete random variables
In the case that a random variable is discrete, we do not use a density function to model it but
rather we model the probabilities involved by specifying the probability of each outcome. Instead
of a density function, in other words, we have a mass function. A probability mass function for a
discrete random variable X is a function p such that for every possible x, p(x) = P(X = x). For
discrete random variables with a small finite number of outcomes, we usually specify p with a table.
For example, if X is the result of rolling two dice and adding the numbers that occur on the faces,
the probability mass function for X is easily seen to be given by
x
p(x)
2
1/36
3
2/36
4
3/36
5
4/36
6
5/36
7
6/36
8
5/36
9
4/36
10
3/36
11
2/36
12
1/36
It’s easy to see why p is called a mass function. We think of the total probability as a mass of one
and that mass is distributed among the possible outcomes in proportion to the the probability of
their occurrence. The mean and standard deviation of a random variable with mass function p is
given by
µX =
X
xp(x)
2
σX
=
X
(x − µ)2 p(x)
These formulas make sense from the interpretation of mean as center of mass. Here is a different
argument that makes this definition plausible. Consider first the following example. Suppose a
student has taken 10 courses and received 5 A’s, 4 B’s and 1 C. Using the traditional numerical
scale where an A is worth 4, a B is worth 3 and a C is worth 2, what is this student’s GPA (grade
point average)? The first thing to notice is that 4+3+2
= 3 is not correct. We cannot simply add
3
up the values and divide by the number of values. Clearly this student should have GPA that is
higher than 3.0, since there were more A’s than C’s.
Consider now a correct way to do this calculation and some algebraic reformulations of it.
GPA =
4+4+4+4+4+3+3+3+3+2
5·4+4·3+1·2
=
10
10
5
4
1
=
·4+
·3+
·2
10
10
10
5
4
1
=4·
+3·
+2·
10
10
10
= 3.4
The key observation here is that GPA as a sum of terms of the form
(grade)(proportion of courses in which the student got that grade) .
Since probability of an outcome is simply the proportion of times that outcome would occur in
the long-run, the formula for the mean is clearly the right one and it computes the average of all
outcomes if the random process is repeated indefinitely.
29
Continuous random variables
In the case that X is a continuous random variable, we use a density function to model X. So
for example if X is a continuous random variable, a particular Weibull density might be the best
model for X. (For a given random process, choosing such a model is not usually very easy as we
saw in the Idaho wind speed example.) The density associated with X is called its probability
density function. The mean of a probability density function f for a random variable X is given
by the following:
Z
∞
µX =
xf (x) dx
−∞
It is easy to see that the justification for this definition is the same as that for a discrete random
variable –the mean is simply the theoretical long-run average of the process which is consistent
with our definition of the mean of a sample. You can also note that the mean is the center of mass
of the density f . Similarly the standard deviation σ of a continuous random variable is given by
2
σX
10
Z
∞
=
(x − µ)2 f (x) dx
−∞
Sampling Distributions
In the examples in class on Tuesday, we noticed that different samples give rise to different values
of the statistics that we are interested in and so give us different estimates of the parameter that
we are attempting to estimate. That’s too bad. It means that not only will our estimate often be
wrong but we will never know whether we had a “good” sample or rather one of the very misleading
ones. While we won’t know whether any particular sample is good, there are cases in which we
can make statements about what kinds of samples might arise. In particular, thinking about the
sample mean as a random variable, we can make statements about the probability of seeing samples
of a given sort.
Suppose that we have a population from which we are choosing a random sample of size n and
from that random sample computing a statistic (for example ȳ or s). If we would list all possible
samples of size n from that population and compute the value of the statistic for each, we could investigate the distribution of these values. The resulting distribution of values is called the sampling
distribution of the statistic.
Let’s look at a (really small example). Suppose that the population consists of just five individuals
(say a starting basketball team!) and the variable of interest is the height of each individual. Let
us consider samples of size n = 2. (This of course is not a realistic example. In the real world, we
would never choose just a sample to study a population of size 5!) Suppose that our population
values are 67, 68, 71, 71, 74. (These are the heights of the Calvin Women’s starting basketball
30
team.) The mean here is 70.2 inches. Since this is a population mean, we write µ = 70.2. If
we choose a random sample of two of these women and compute ȳ for each, we get the following
possible numbers, each of which is equally likely by our sampling procedure.
Sample
67, 68
67, 71
67, 71
67, 74
68, 71
68, 71
68, 74
71, 71
71, 74
71, 74
ȳ
67.5
69.0
69.0
70.5
69.5
69.5
71.0
71.0
72.5
72.5
There are a few things that we can say about these samples. First, none of them get the mean
exactly right! This isn’t a surprise. Second, there are some bad samples (estimates off by more
than 2 inches) and there are some samples that we would consider pretty good (within an inch of
the true value).
Obviously if the population is very large and the sample is even moderately large (e.g., n = 30),
there would be far too many samples to do the above analysis. In any case, such an analysis is not
possible in general since we don’t have access to the whole population. This is even more obviously
the case when we are considering a process modeled by a density instead of a finite population. In
that case there are infinitely many possible samples so we would need some mathematical result to
consider all possible samples. But in many situations, depending on the kind of population that
we are considering and the statistic we are computing, we can at least say something about the
distribution of the possible values of the statistic. This is particularly true about the sample mean.
The mean of the sample means
In the basketball team example above, the mean of the 10 different sample means is 70.2, precisely
the same as the population mean. This isn’t an accident but rather an important property of the
sample mean that is true in every situation.
The mean of all possible sample means of samples of size n chosen from a population with
mean µ is µ.
31
At the risk of overloading the notation, we could write µȲ = µ. What this property says is that
on average, the sample mean does not overestimate or underestimate the population mean. Of
course any particular sample could be way off, but at least using the sample mean does not have
a bias towards being too large or too small. Remember though that this property assumes that
all possible samples are equally likely (that is we are considering a simple random sample) and is
really a statement about the mean of the random variable Ȳ .
We should point out that this property, though being a nice one and seemingly obvious is not true
for all statistics. For example, we might think that the sample standard deviation s is a good way
to estimate the population standard deviation σ but the average of all possible standard deviations
is not necessarily the population standard deviation. (In fact, on average s underestimates σ.)
The standard deviation of the sample means
The standard deviation is a measure of spread. We would like the standard deviation of the sample
means to be small since that would mean that the sample mean is usually close to its mean (which
is the population mean). In fact, the standard deviation of all possible sample means is related to
the standard deviation of the population and the size of the sample in the following way
The standard deviation of all possible sample means of samples of size n chosen from a
√
population with standard deviation σ is σ/ n. That is
σ
σȲ = √
n
What this fact says is that the precision of our estimate of µ increases by a factor of 2 (on average)
if we increase the sample size by a factor of 4. This fact says that if we have some information
about σ, we have information about how close ȳ is likely to be to µ. Of course σ is usually unknown
just as µ is.
To summarize, if the mean of our population (or process modeled by a density) is µ and the standard
deviation of the population is σ, then the distribution of all possible sample means has mean µ
√
and standard deviation σ/ n. We’ll have more to say about the distribution of the sample mean,
but we close this section with a practical example. Let’s consider the problem of estimating the
average height (µ) of all Calvin students. The population here is of size about 4,000 and there is
no way that we could compute µ. Suppose that we are able to choose a random sample of size 100
of Calvin students and we compute ȳ for this sample. How far is ȳ likely to be from µ? We could
use our second fact above but of course this would require us to know σ. While we don’t know σ,
in this case it is reasonable to believe that σ ≤ 6 and more likely that σ ≤ 5. If σ were larger than
6, many Calvin students would have to have heights that were more than a foot above and below
the average height. This just doesn’t happen (except for a few outliers). If σ ≤ 6 and the sample
32
√
has size 100, the standard deviation of the sample mean is σ/ n ≤ 0.6. What this means is that
typically, ȳ is about 0.6 in away from µ and quite often within an inch of µ. Of course to get a
better estimate, we simply would have to increase the sample size.
Bad sampling procedures
All of this discussion of sampling distributions is predicated on the assumption that we are sampling
by a random process. For a finite population, we assume that the samples are produced by simple
random sampling. For a process that is modeled by a density, we assume that the data results from
independent trials of the process modeled by that density. It these assumptions are not true, then
we cannot use the above results to say anything about the sampling distribution. This is often
forgotten by people who try to make inferences about the population from a sample that was not
produced by a careful random process. In those cases of course the sample mean ȳ still describes
the sample but we often can’t say much about µ using ȳ.
Questions
Q10.1 This applet http://onlinestatbook.com/stat_sim/sampling_dist/index.html can be used
to simulate samples of various sizes from different distributions. You can choose the population, the sample size, and the statistic and investigate the properties of the mean and
standard deviation of the statistic by simulating up to 10,000 possible samples. (10,000 isn’t
infinity but it’s close!) Use this applet to verify empirically the properties of the sample mean
described above.
Q10.2 USe the same applet to investigate the properties of the distribution of the sample standard deviation. In particular, compare the mean of that distribution to the actual standard
deviation of the population. What do you notice?
11
Evaluating the fit of the least squares line
Consider the corrosion data. We fit a line loss = b0 + b1 Fe to this data to attempt to model the
relationship between iron content and material loss.
> rustmodel = lm(loss ~ Fe, data = corrosion)
> rustmodel
Call:
lm(formula = loss ~ Fe, data = corrosion)
Coefficients:
33
(Intercept)
130
Fe
-24
> xyplot(loss ~ Fe, data = corrosion, type = c("p", "r"))
130
loss
120
110
100
90
0.0
0.5
1.0
1.5
2.0
Fe
Obviously the line fits fairly well. But how do we measure that?
Since the line that we choose minimizes SSE, the sums of squares of residuals, we should compute
that number. We have
> sum(residuals(rustmodel)^2)
[1] 103
This number does not tell us much. Obviously, if SSE = 0 we would have a perfect fit, but
otherwise the absolute size of SSE is too affected by the units involved (in this case the units of
the loss variable) and by the number of data points we happen to use (more data points means
more residuals to square and sum). To account for the number of data points, we compute a “root
mean squared residual.” This is usually denoted by se as it is the analogue of the standard deviation
for the model with only a term for the mean. We compute it this way
sP
se =
e2i
n−2
Note that the n − 1 of the standard deviation has now become n − 2 even though in both cases
we were expecting an n in keeping with interpretation of this quantity as an average. Again, we’ll
worry about that later. How should we interpret se ? Just as s is a “typical” or “average” deviation
from the mean, se is a typical residual and gives a measure of how far away from the regression
line typical points are. We could easily compute se from the formula above and the output of the
model as
34
> n = nrow(corrosion)
> se = sqrt(sum(residuals(rustmodel)^2)/(n - 2))
> se
[1] 3.06
What does the value of se tell us? Here is says that the typical residual is about 3.058. That is
we could expect to see residuals smaller and larger than this number. The number se gives us a
sense of the prediction error inherent in our model. (But be careful, we are predicting the past,
not the future!) In particular applications, we would need to decide if the fact that our model has
this much error is unacceptable to us.
We actually don’t need to compute the number se . R computes it for us and perhaps the easiest
way to find it is to use the summary function:
> summary(rustmodel)
Call:
lm(formula = loss ~ Fe, data = corrosion)
Residuals:
Min
1Q Median
-3.798 -1.946 0.297
3Q
0.992
Max
5.743
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
129.79
1.40
92.5 < 2e-16
Fe
-24.02
1.28
-18.8 1.1e-09
Residual standard error: 3.06 on 11 degrees of freedom
Multiple R-squared: 0.97,Adjusted R-squared: 0.967
F-statistic: 352 on 1 and 11 DF, p-value: 1.06e-09
There’s a boatload of numbers in this output – you can find the slope and the intercept in the
column labeled Estimate. The number se is called the residual standard error by R and you
can see that it is 3.06 (which agrees with our computation). The somewhat mysterious phrase “11
degrees of freedom” refers to the fact that n − 2 (which is 11 in this case) is used in the denominator
of the expression for se .
Another measure of the utility of this linear model is the number labeled Multiple R-squared
in the output of summary. This number is a percentage — 97% in this case — and is computed
35
as follows. We already know that the sum of squares of residuals is 102.85. On the other hand,
consider the model without Fe – in other words consider the model in which we predict that all the
values of loss are simply the mean. R can be used to analyze this model:
> meanmodel = lm(loss ~ 1, data = corrosion)
> ssresids = sum(residuals(meanmodel)^2)
> ssresids
[1] 3397
Notice that the sum of squares of the residuals in the model using the mean alone is quite large
(3396.617) compared to the sum of squares of residuals of the model using the explanatory variable
Fe (102.85). In other words, the availability of the explanatory variable “explained” or accounted
for much of the variation in the response variable that the means model could not. In fact we express
this improvement as a percentage. R summarizes this information in an analysis of variation table:
> anova(rustmodel)
Analysis of Variance Table
Response: loss
Df Sum Sq Mean Sq F value Pr(>F)
Fe
1
3294
3294
352 1.1e-09
Residuals 11
103
9
The column to read here is the column labeled Sum Sq. The row labeled residuals we recognize
as the sum of squares of residuals of the linear model. The sum of the two entries in this column
is simply the sum or squares of the means model. Therefore the entry in the row labeled Fe is the
reduction in the sum of squares of residuals accounted for by adding Fe to the model. We usually
call the sum of squares of the residuals for the means model SST ot since it is the total sum of
squares that need to be explained by any model beyond the constant one.
Now we can finally define R2 .
R2 =
SST ot − SSE
SST ot
In other words, R2 is the percentage of the total variation of y (as measured by sum of squares)
that can be explained by including x in the model. We even read R2 this way. In this case we
would say something like 97% of the variation in material lost is “explained” by the iron content
of the bars. What is a good R2 ? That depends on the situation. In physics laboratories, where
we try to control all variables except the explanatory one, R2 can often be as high as 99% or even
36
higher. On the other hand, in social scientific contexts where we know that our variable x is only
one part of the explanation of y, we might be very happy with an R2 of 30%.
The two number se and R2 give quantitative information about the goodness of fit of the line to
the data. There is also one plot that can give similar information and can also indicate whether a
line might not even be appropriate to fit. In the following graph, we plot the residuals of the model
against the x-values – we could also plot them against the y-values to get the same information.
> xyplot(residuals(rustmodel) ~ Fe, data = corrosion)
residuals(rustmodel)
6
4
2
0
−2
−4
0.0
0.5
1.0
1.5
2.0
Fe
What we hope to see is what we actually do see in this plot – the residuals look to be “randomly”
scattered around the line y = 0. In particular there is no obvious pattern in them. A pattern would
suggest that the relationship between y and x is not linear. To see an example, consider the data
on world records in track.
residuals(trackmodel)
> trackmodel = lm(Seconds ~ Meters, data = mentrack)
> xyplot(residuals(trackmodel) ~ Meters, data = mentrack)
20
10
0
−10
0
2000
4000
6000
8000
10000
Meters
It’s obvious from this plot that there is a strong relationship between the residuals (and thus the
Seconds variable) and the Meters variable that is not captured by the linear nature of our model.
In fact the shape of this residual plot suggests the kind of model that we should be looking for.
37
There are other more subtle features of the residual plot that also give warning signs. In the data
on breaking speed of cars, we see the following residual plot:
residuals(carsmodel)
> carsmodel = lm(dist ~ speed, data = cars)
> xyplot(residuals(carsmodel) ~ speed, data = cars)
40
20
0
−20
5
10
15
20
25
speed
The feature of interest here is that the residuals get bigger as x increases. While this doesn’t
argue against using a line to fit the relationship, it does say that there is something quantitatively
different happening at larger x values. One thing at least that this does is give more weight in the
regression to larger x values since the residuals that need to be minimized can be quite large in
that region. It’s not clear that this is desirable and in fact later we will think of ways to minimize
this effect.
Questions
Q11.1 Use the methods of this section to assess the fit of the model that predicts Calvin seniors
GPA from their ACT scores.
Q11.2 The dataset faithful which is supplied with the base R release has data on the eruption
time of the old faithful geyser. There is an obvious relationship between the length of an
eruption and the waiting time to the next eruption. Is a linear model a good model for this
relationship?
12
The normal model
Often, a variable of interest to us has a distribution that has a unimodal and symmetric shape. We
often recognize this most directly when we draw a histogram. For many of these kinds of variables,
one family of density functions have proved particular useful in providing models that fit well. The
normal density function is given by
38
f (x) = √
1
2
2
e−(x−µ) /2σ
2πσ
−∞<x<∞.
Notice that this density has two parameters, µ and σ. The mean and standard deviation of a
normal density are µ and σ so that the parameters are aptly, rather than confusingly, named.
R functions dnorm(x,mean,sd), pnorm(q,mean,sd), rnorm(n,mean,sd), and qnorm(p,mean,sd)
compute the relevant values. Of course normal is a poor choice of terms – other variables aren’t
abnormal! Some physicists and engineers therefore use the term Gaussian (after Gauss who used
this model extensively). The special case of µ = 0 and σ = 1 is called the standard normal
density. A graph of the standard normal density is below.
> plotDist("norm", params = list(mean = 0, sd = 1))
0.4
0.3
0.2
0.1
0.0
−4
−2
0
2
4
Notice the following important characteristics of this density: it is unimodal, symmetric, and is
positive for all real values both positive and negative. The standard normal density is enough to
understand all of the normal distributions due to the following lemma.
Lemma 12.1. If Y is normal with parameters µ and σ, then the random variable Z = (X − µ)/σ
is standard normal.
Proof. To see this, we show that P(a ≤ Z ≤ b) is computed by the integral of the standard normal
density function.
Z µ+bσ
X −µ
1
2
2
√
P(a ≤ Z ≤ b) = P(a ≤
≤ b) = P (µ + aσ ≤ X ≤ µ + bσ) =
e−(x−µ) /2σ dx .
σ
2πσ
µ+aσ
Now in the integral, make the substitution u = (x − µ)/σ. We have then that
Z b
Z µ+bσ
1
1
2
−(x−µ)2 /2σ 2
√
√ e−u /2 du .
e
dx =
2πσ
2π
a
µ+aσ
But the latter integral is precisely the integral that computes P(a ≤ U ≤ b) if U is a standard
normal random variable.
39
The normal distribution is used so often that it is helpful to commit to memory certain important
probability benchmarks associated with it.
The 68–95–99.7 Rule
If X is a normal density with mean µ and standard deviation σ
1. P(µ − σ < X < µ + σ) ≈ 68%
2. P(µ − 2σ < X < µ + 2σ) ≈ 95%
3. P(µ − 3σ < X < µ + 3σ) ≈ 99.7%.
Example 12.2. In 2000, the average height of a 19-year old United States male was 69.6 inches.
The standard deviation of the population of males was 5.8 inches. The distribution of heights of
this population is well-modeled by a normal distribution. Then the percentage of males within 5.8
inches of 69.6 inches was approximately 68%. In R,
> pnorm(69.6+5.8,69.6,5.8)-pnorm(69.6-5.8,69.6,5.8)
[1] 0.6826895
It turns out that the normal distribution is a good model for many variables. Whenever a variable has a unimodal, symmetric distribution in some population, we tend to think of the normal
distribution as a possible model for that variable. For example, suppose that we take repeated measures of a difficult to measure quantity such as the charge of an electron. It might be reasonable
to assume that our measurements center on the true value of the quantity but have some spread
around that true value. And it might also be reasonable to assume that the spread is symmetric
around the true value with measurements closer to the true value being more likely to occur than
measurements that are further away from the true value. Then a normal random variable is a
candidate for modeling the measurements in this case. Indeed, the normal model is often used as
a model of measurement error. (We’ll take a look at measurement error more systematically later
in the course.)
Of course not every unimodal symmetric distribution has exactly the shape of a normal distribution.
For example, the normal model implies that 68% of the observations would be within one standard
deviation of the mean. There are unimodal distributions that have “heavier” or “lighter” tails than
that.
There is one situation in which the normal model naturally arises that makes it a very important
model for applications. This fact is so important that it is summarized in a named theorem.
Theorem 12.3 (Central Limit Theorem). The distribution of the sample mean of all random
samples of size n from any population (or random process), approaches a normal distribution as n
approaches infinity.
40
We already know that the distribution of the sample mean has mean µ and standard deviation
√
σ/ n. This central limit theorem says that if the sample is large enough, then the distribution of
the sample mean is approximately normal. That means that we can use the normal distribution
to make approximate probability statements about the sample mean. In fact more is true. If the
population variable is normal to begin with, the distribution of the sample mean is always exactly
normal. (This fact isn’t especially important however since no population is exactly normal.)
The central limit theorem allows us to say some things about how close ȳ is likely to be to µ. For
√
example, if n is large enough, ȳ is within 1.96 ∗ σ/ n of µ in 95% of all possible samples.
Example 12.4. Consider again the distribution of heights of adult males which we know to have
µ = 69.6 and σ = 5.6 inches. Then if we take a sample of size 100, we know that ȳ for this sample
has a 95% chance of being within 1.96 ∗ 0.56 = 1.1 inches of the mean. In other words, such a
sample will usually produce a sample mean that is within an inch of the population mean.
Of course is not immediately clear how results like that of the example help us since it seems to
require us to know σ to make statements about how close µ is to ȳ. Remember, we just have a
sample and don’t even know µ, much less σ. There’s a way around this problem that was discovered
in 1908. We’ll get to that later.
Questions
Q12.1 The SAT Mathematics test scores were based on a scale set to have mean 500 and standard
deviation 115 and also to be normally distributed. With this model
(a) What percentage of test takers are expected to score below 400?
(b) What percentage of test takers are expected to score above 700?
(c) What are the SAT scores of the middle 50% of all test takers?
Q12.2 In the SAT example above, SAT scores are actually discrete – they are in intervals of 10
points so 500 and 510 are possible scores but 507 is not. How can we use the continuous
model above to compute the percentage of test takers expected to score exactly 500?
Q12.3 If we choose a random sample of 25 SAT test takers, what is the probability that the average
SAT score of that sample is above 525?
13
Fitting non-linear models
Suppose that we have quantitative explanatory (x) and response (y) variables but we have reason
to believe that the relationship between the two in not linear. That is we would like to fit a function
f to the data so that y ≈ f (x) but f is no longer a line. This may be because plotting indicates to
us that a non-linear function is appropriate or because we have theoretical reasons to expect the
form of the model to be nonlinear.
41
For example, in the world record track data, we certainly believe that the the relationship between
the length of the race and the world record time is not linear. This is true because we cannot
expect individuals to run longer races at the same rate of speed as shorter races. If y is record time
(in seconds) and x is distance of race (in meters) we might suppose that the appropriate function
to fit is
y = b0 xb1
where b0 and b1 are some constants to be determined. This example is typical – we normally have
a family of functions in mind and we need to choose certain parameters to fit the particular data
that we have. It is important to realize that there is no essential difference between this problem
and the linear curve-fitting problem. Once we choose our criterion for evaluating the fit, we simply
need to find the values of b0 and b1 to meet the criterion. We actually will use exactly the same
criterion – we will choose the values of b0 and b1 to minimize the sum of squares of residuals. That
is we have
observation = fitted model + residual
yi = ŷi + ei
where ŷiP
is the predicted value from the fitted model and we choose our constants b0 , b1 to minimize
SSE =
e2i .
While this setup looks exactly the same as the linear model case, there is a crucial difference.
In solving for b0 and b1 in the linear model case, it turns out that the crucial step (finding the
zeros of the partial derivatives that result from the minimization problem) is to solve two linear
equations in two unknowns. We can do this easily and exactly. In the case that f is non-linear,
the resulting equations also are non-linear and this means that a different method is needed to
find the appropriate coefficients. In general, we cannot usually solve non-linear equations exactly
but need to resort to some method that produces an approximate solution that is good enough.
These methods are invariably iterative methods that start with an initial guess and refine that
guess in steps until the solution is found. Newton’s method for finding the root of a function of
a single variable is the canonical example of such a method and you will recall from calculus that
the first step in Newton’s method was to make a guess as to the root. These iterative methods
usually require a “good enough” first guess or else they do not converge (or sometimes converge to
inappropriate solutions).
As you might hope for and have come to expect, R has a function that implements one method for
solving this “non-linear least squares” problem. The R function is named nls. We illustrate its use
using the track data.
> nonlinmodel = nls(Seconds ~ A * Meters^B, data = mentrack, start = c(A = 1,
+
B = 1.1))
> nonlinmodel
Nonlinear regression model
model: Seconds ~ A * Meters^B
42
data: mentrack
A
B
0.084 1.069
residual sum-of-squares: 168
Number of iterations to convergence: 5
Achieved convergence tolerance: 4.26e-07
Notice that nls expects a function that is written in formula form. The function uses variables
from the data frame and R assumes that all names variables are constants to be fitted. Starting
values must be given for all the constants by means of a vector of named numbers.
In this example, we see that R is proposing the model y = 0.084x1.069 and the value of the sum of
squares at this minimum is 168. Obviously we want to check the fit of the function by viewing a
graph. We know how to plot the data. We need to add a plot of the function y = .084x1.069 . An
easy way to do this is to use the makeFun function of the mosaic package that makes a function
from the model object.
> xyplot(Seconds ~ Meters, data = mentrack)
> fit = makeFun(nonlinmodel)
> plotFun(fit(Meters) ~ Meters, data = mentrack, add = TRUE)
Seconds
1500
1000
500
0
0
2000
4000
6000
8000
10000
Meters
Of course over the range of values, it’s difficult to assess the fit using the graph. We can look at
the residuals.
> xyplot(residuals(nonlinmodel) ~ Meters, data = mentrack)
residuals(nonlinmodel)
43
4
2
0
−2
−4
−6
−8
0
2000
4000
6000
8000
10000
Meters
Note that there is still a pattern in the residuals but the pattern is complex. The very shortest
races seem to be hard to integrate into an overall pattern. If you think hard about such races, you
might be able to think of a reason. Of course it may be the case that we simply have the “wrong”
model in f (x) = AxB .
Questions
Q13.1 There is another way to fit a nonlinear function to data that has been frequently used with
data like the world record track data. If we take logs of each side of y = AxB , we get
ln y = ln A + B ln x. Notice that this is just a linear function of the transformed data.
Compute the values of A and B that result from fitting a line to the transformed data.
Compare to the estimates that we got from nls.
Q13.2 The dataset MicMen.csv in Stob’s data has data on the rate v of a chemical reaction (product
per minute) and the concentration of the underlying substrate (mM). The nonlinear relationship that governs this is the Michalis-Menten equation and is given by
v=
αS
β+S
where α and β are constants that depend on the particular chemical reaction in question. See
if you can fit this function to the data – it will need a good choice of a starting point to that.
14
Experiments
In many datasets we have more than one variable and we wish to describe and explain the relationships between them. Often, we would like to establish that the relationship is a cause-and-effect
relationship rather than one that is just accidental.
44
Observational Studies
The American Music Conference is an organization that promotes music education at all levels. On
their website http://www.amc-music.com/research_briefs.htm they promote music education
as having all sorts of benefits. For example, they quote a study performed at the University of
Sarasota in which “middle school and high school students who participated in instrumental music
scored significantly higher than their non-band peers in standardized tests”. Does this mean that if
the availability of and participation in instrumental programs in a school is increased, standardized
test scores would generally increase? The American Music Conference is at least suggesting that
this is true. They are attempting to “explain” the variation in test scores by the variation in music
participation. The problem with that conclusion is that there might be other factors that cause the
higher test scores of the band students. For example, students who play in bands are more likely
to come from schools with more financial resources. They are also more likely to be in families
that are actively involved in their education. It might be that music participation and higher test
scores are a result of these variables. Such variables lurking variables. A lurking variable is any
variable that is not measured or accounted for but that has a significant effect on the relationship
of the variables in the study.
The Sarasota study described above is an observational study. In such a study, the researcher
simply observes the values of the relevant variables on the individuals studied. But as we saw above,
an observational study can never definitively establish a causal relationship between two variables.
This problem typically bedevils the analysis of data concerning health and medical treatment. The
long process of establishing the relationship between smoking and lung cancer is a classic example.
In 1957, the Joint Report of the Study Group on Smoking and Health concluded (in Science, vol.
125, pages 1129–1133) that smoking is an important health hazard because it causes an increased
risk for lung cancer. However for many years after that the tobacco industry denied this claim. One
of their principal arguments is that the data indicating this relationship came from observational
studies. (Indeed, the data in the Joint Report came from 16 independent observational studies.)
For example, the report documented that one out of every ten males who smoked at least two packs
a day died of lung cancer. but only one out of every 275 males who did not smoke died of lung
cancer. Data such as this falls short of establishing a cause-and-effect relationship however as there
might be other variables that increase both one’s disposition to smoke and susceptibility to lung
cancer.
Observational studies are useful for identifying possible relationships and also simply for describing
relationships that exist. But they can never establish that there is a causal relationship between
variables. Using observational studies in this way is analogous to using convenience samples to make
inferences about a population. There are some observational studies that are better than others
however. The music study described above is a retrospective study. That is the researchers
identified the subjects and then recorded information about past music behavior and grades. A
prospective study is one in which the researcher identifies the subjects and then records variables
over a period of time. A prospective study usually has a greater chance of identifying relevant
possible “lurking” variables so as to rule them out as explanations for a possible relationship.
One of the most ambitious and scientifically important prospective observational studies has been
45
the Framingham Heart Study. In 1948, researchers identified a sample of 5,209 adults in the town
of Framingham, Massachusetts (a town about 25 miles west of Boston). The researchers tracked
the lifestyle choices and medical records of these individuals for the rest of their lives. In fact the
study continues to this day with the 1,110 individuals who are still living. The researchers have
also added to the study 5,100 children of original study participants. There is no question that the
Framingham Heart Study has led to a much greater understanding of what causes heart disease
although it is “only” an observational study. For example, it is this study that gave researchers
the first convincing data that smoking can cause high blood pressure. The website of the study
http://www.nhlbi.nih.gov/about/framingham/ gives a wealth of information about the study
and about cardiovascular health.
Randomized Comparative Experiment
If an observational study falls short of establishing a causal relationship and even an expensive
well-designed prospective observational study cannot identify all possible lurking variables, can we
ever prove such a relationship?
The “gold standard” for establishing a cause and effect relationship between two variables is the
randomized comparative experiment. In an experiment, we want to study the relationship
between two or more variables. At least one variable is an explanatory variable and the value of
the variable can be controlled or manipulated. At least one variable is a response variable. The
experimenter has access to a certain set of experimental units (subjects, individuals, cases),
sets various values of the explanatory variables to create a treatment, and records the values of
the response variables. This description closely matches what you have typically done in science
laboratories.
It is important first of all that an experiment be comparative. If we are attempting to establish
that music participation increases grades, we cannot simply look at participators. We need to
compare the achievement level of participators to those who do not participate. Many educational
studies fall short of this standard. A school might introduce a new curriculum in mathematics and
measure the test scores of the students at the end of the year. However the school cannot make the
case that the test scores are a result of the new curriculum — the students might have achieved
the same level with any curriculum. All we are saying here is that if we want to understand the
effect of an explanatory variable x on a response variable y, x must really be a variable rather than
a constant!
In a randomized experiment we assign the observational units to the various treatments at random.
For example, if we took 100 fifth graders and randomly chose 50 of them to be in the band and
50 of them not to receive any music instruction, we could begin to believe that differences in their
test scores could be explained by the different treatments. In the laboratory, if we are testing two
different methods for
Example 14.1. Patients undergoing certain kinds of eye surgery are likely to experience serious
post-operative pain. Researchers were interested in the question of whether giving acetaminophen
46
to the patients before they experienced any pain would substantially reduce the subsequent pain
and the further need for analgesics. One group received acetaminophen before the surgery but no
pain medicine after the surgery. A second group received no pain medicine before the surgery and
acetaminophen after the surgery. And the third group received no acetaminophen either before or
after the surgery. Sixty subjects were used and 20 subjects were assigned at random to each group.
(Soltani, Hashemi, and Babaei, Journal of Research in Medical Sciences, March and April 2007;
vol. 12, No 2.)
In Example 14.1, the goal of random assignment is to construct groups that are likely to be representative of the whole pool of subjects. If the assignment were left to the surgeons, for example, it
might be the case that surgeons would give more pain medication to certain types of patients and
therefore we wouldn’t be able to attribute the different results to the different treatments.
Example 14.2. The R dataset chickwts gives the weights of chicks who were fed six different diets
over a period of time. The experimenter was attempting to determine which chicken feed caused the
greatest weight gain. Feed is the explanatory variable and there were six treatments (six different
feeds). Weight is the response variable. The first step in designing such an experiment is to assign
baby chicks at random to the six different feed groups. If we allow the experimenter to choose which
chicks receive which feed, she might unconsciously (or consciously) construct treatment groups that
are unequal to start.
Student (W.S. Gosset) was one of the researchers in the early part of the twentieth century who
realized the importance of randomization. One of his influential papers analyzed a large scale study
that was to compare the nutritional effects of pasteurized and unpasteurized milk. In the Spring
of 1930, 20,000 school children participated in the study. Of these, 5,000 received pasteurized milk
each day, 5,000 received unpasteurized milk, and 10,000 did not receive milk at all. The weight
and height of each student was recorded both before and after the trial. Student analyzed the
way in which students were assigned to the three experimental treatments. There were 67 schools
involved and in each school about half the students were in the control group and half received
milk. However each school received only one kind of milk, pasteurized or unpasteurized. This was
the first sort of bias that Student found — he was not convinced that the schools that received
pasteurized milk were comparable to those that received unpasteurized milk. A more important
difficulty was the way in which students were assigned either to the control or milk group within
a school. The students were assigned at random initially, but teachers were given freedom to
adjust the assignments if it seemed to them that the two groups were not comparable to each
other in weight and height. In fact Student showed that this freedom on the part of teachers to
assign subjects to groups resulted in a systematic difference between the groups in initial weight
and height. The control groups were taller and heavier on average than those in the milk groups.
Student conjectured that teachers unconsciously favored giving milk to the more undernourished
students.
Of course assigning subjects to treatments at random does not ensure that the experimental groups
are alike in all relevant ways. Just as we were subjected to sampling error when choosing a random
sample from a population, we can have variation in the groups due to the chance mechanism alone.
47
But assigning subjects at random will allow us to make probabilistic statements about the likelihood
of such error just as we were able to make confidence intervals for parameters based on our analysis
of sampling error that might arise in random sampling.
Randomized assignment and random samples
We assign subjects to treatments at random so that the various treatment groups will be similar
with respect to the variables that we do not control. That is, we would like the experimental groups
to be representative of the whole group of subjects. In surveys, we choose a random sample from
a population for a similar reason. We hope that the random sample is representative of a larger
population. Ideally, we would like both kinds of randomness in our experiments. Not only do we
ensure that the subjects are assigned at random to treatments, but we would like the subjects to be
chosen at random from a larger population. If this is true, we could more easily justify generalizing
our experimental results to a larger population than the immediate subject pool. In medical and
social scientific experiments, this is almost never the case. In the pain study of Example 14.1, the
subjects were simply all those persons who were operated on at a given clinic in a given period of
time. This issue is particularly important if we try to generalize the conclusions of a an experiment
to a larger population.
Example 14.3. The author of these notes helped conduct a study to investigate how people make
probabilistic judgments in situations for which they do not have much data. (Default Probabilities,
Osherson, Smith, Stob, and Wilkie, Cognitive Science, (15), 1991, 251–270.) Subjects were placed
in various experimental groups at random. However the subjects were not chosen at random
from any particular population. Indeed every subject was an undergraduate in an introductory
psychology course at the University of Michigan or Massachusetts Institute of Technology. It is
difficult to make an argument that the results of the paper would generalize to the population of
all undergraduates in the United States let alone to the population of all adults. The MIT students
in particular seemed to have a different set of strategies for dealing with probabilistic arguments.
Other features of a good experiment
In our analysis of simple random sampling from a population, we saw again and again the importance of large samples in getting precise estimates of our parameters. Analogously, if we are
to measure precisely the effect of a treatment, we would like many experimental units in each
treatment group. This principle is known as replication. With a small number of individuals, it
might be difficult to determine whether the differences in response are due only to the treatments
or whether they reflect the natural variation in individuals. The chickwts data illustrate the issue.
While there is definitely some variation between the groups, there is also considerable variation
within each group. Chicks fed meatmeal, for example, have weights spanning most of the range of
the the entire experimental group. It is probably the case that the small difference between the
linseed and soybean groups is due to the particular chicks in the groups rather than due to the
feed. More chickens in each group would help us resolve this issue however.
48
> xyplot(weight ~ feed, chickwts)
weight
400
300
200
100
caseinhorsebeanlinseed meatmealsoybeansunflower
feed
Figure 1: Weights of six different treatment groups of a total of 71 chicks.
In many good experiments one of the treatments is a control. A control generally means a treatment that is a baseline or status quo treatment. In an educational experiment, the control group
might receive the standard curriculum while another group is receiving the supposed improved
curriculum. In a medical experiment, the control group might receive the generally accepted treatment (or no treatment at all if ethical) while another group receives a new drug. In Example 14.1,
the group that received no pre-pain medication is referred to as the control group. The goal of a
control group is to establish a baseline to which to compare the new or changed treatment.
If the subjects are human, the control is often a placebo. A placebo is a “treatment” that is
really no treatment at all but looks like a treatment from the point of view of the subject. In
Example 14.1, all subjects received pills both before and after surgery. But some of these pills
contained no acetaminophen and were inert. Placebos are given to ensure that the placebo effect
is measurable. The placebo effect is the tendency for experimental subjects to be affected by the
treatment even if it has no content. The need for control groups and placebos is highlighted by the
next famous example.
Example 14.4. During the period 1927-1932, researchers conducted a large-scale study of industrial efficiency at the Hawthorne Plant of the Western Electric Company in Cicero, IL. The
researchers were interested in how physical and environmental features (e.g., lighting) affected
worker productivity and satisfaction. Researchers found that no matter what the experimental
conditions were, productivity tended to improve. Workers participating in the experiment tended
to work harder and better to satisfy those persons who were experimenting on them. This feature
of human experimentation — that the experimentation itself changes behavior whatever the treatment — is now called the Hawthorne Effect. (It is now generally accepted that the extent of the
Hawthorne Effect in the original experiments have been significantly overstated by the gazillions
of undergraduate psychology textbooks that refer to it. But the name remains and it makes a nice
story as well as a plausible cautionary tale!)
49
A
C
A
B
B
B
C
A
C
A
B
C
Figure 2: Two experimental designs for three fertilizers.
Another feature which helps to ensure that the differences in treatments are due to the treatments
themselves is blinding. An experiment is blind if the subjects do not know which treatment group
that they are in. In Example 14.1, no subject knew whether they were receiving acetaminophen
or a placebo. It is plausible that a subject knowing they receive a placebo would have a different
(subjective) estimate of pain than one who thought that they might be receiving acetaminophen.
An experiment is double-blind if the person administering the treatment also does not know which
treatment is being administered. This prevents the researcher from treating the groups differently.
It is not always possible or ethical to make an experiment blind or double-blind. But when possible,
blinding helps to ensure that the differences between treatments are due to the treatments which
is always the goal in experimentation.
Blocking
If the experimental subjects are identical, it does not matter which is assigned to which treatment.
The differences in the response variable are likely to be the result of the differences in treatment.
The subjects are not usually identical however or at least cannot be treated identically. So we
would like to know that the differences in the response variable are due to the differences in the
explanatory variable and not any systematic differences in subjects. Randomization is one tool
that we use to distribute such differences equally across the treatments. In some cases however,
our experimental units are not identical or our experiment itself introduces a systematic difference
in the units that is due to something other than the treatment variable. This leads to the notion
of blocking which we illustrate with a classic example.
R.A. Fisher was one of the key early figures in developing the principles of good experimental
design. He did much of this while working at Rothamsted Experimental Station on agricultural
experiments. He studied closely data from experiments that were attempting to establish such
things as the effects of fertilizer on yield. Suppose that we have three unimaginatively named
fertilizers A, B, C. We could divide the plot of land that we are using as in the first diagram of
Figure 2. But it might be the case that the further north in the plot, the better the soil conditions.
In that case, the variation in yield might be better explained (or at least partially explained) by
50
the location of the plot rather than by fertilizer. In this example, we would say that the effects
of northernness and fertilizer are confounded, meaning simply that we cannot separate them
given the data of the experiment at hand. To separate out the effect of northernness from that of
fertilizer, we could instead divide the patch using the second diagram in figure 2. Of course there
still might be variations in the soil conditions across the three fertilizers. But we would at least
be able to measure the effect of northernness separately from that of fertilizer. In this example,
“northernness” is a blocking variable and our goal is to isolate the variability attributable to
northernness so that we can see the differences between the fertilizers more clearly.
In a medical experiment it is often the case that gender or age are used as blocking variables.
Obviously, we cannot assign individuals to the various levels of these variables at random but it is
plausible that in certain circumstances gender or age can have a significant effect on the response.
If so, it would be useful to design an experiment that allows us to separate out the effects of, say,
gender and the treatment.
When using a blocking variable, it is important to continue to honor the principle of randomization.
Suppose for example that we use gender as a blocking variable in a medical experiment comparing
two treatments. The ideal experimental design would be to take a group of females and assign them
at random to the two treatments and similarly for the group of males. That is, we should randomize
the treatments within the blocks. The resulting experiment is usually called a randomized block
design. It is not completely randomized because subjects in one block cannot be assigned to
another but within a block it is randomized.
It is instructive to compare the randomized block design to stratified random sampling. In each
case, we divide subjects into groups and randomize within these groups. The goal is to isolate
and measure the variability that is due to the groups so that we can measure the variability that
remains.
A special case of blocking is known as a matched pair design. In such an experiment, there
are just two observations in each block (one for each of two treatments). In his 1908 paper,
Student analyzed earlier published data from such an experiment. That data is in the R dataframe
sleep. The two different treatments were two different soporifics (sleeping drugs). There was no
control treatment. The response variable was the number of extra hours of sleep gained by the
subject over his “normal” sleep. There were just 10 subjects and each subject took both drugs (on
different nights). Thus each subject was a block and there was one observation on each treatment
in each block. Student then compared the difference in the two drugs on each patient. Using the
individuals as blocks served to help Student to decide what part of the variation in the response
could be explained by the normal variation between individuals and what could be attributed to
the drugs themselves.
In educational experiments, matched pairs are often constructed by finding two students who are
very similar in baseline academic performance. Then it is hoped that the differences between these
students at the end of the experiment are the result of the different treatments.
It is important to remember that block designs are not an alternative to randomization. Indeed,
it is very important that we randomize the assignment to treatments within every block for the
51
same reasons that randomization is important when we have no blocking variable. Identifying
blocking variables is simply acknowledging that there are variables on which the treatments may
systematically differ.
Experimental Design
In the above sections, we have introduced the three key features of a good experimental design —
randomization, replication, blocking. We’ve illustrated these principles in the case that we have
just one explanatory variable with just a few levels. These principles can be extended to situations
with more than one explanatory variable however. In this book, we will not investigate the problem
of inference for such situations or discuss in detail the issues of experimental design in these cases.
In this section, we look at one example of extending these principles to experiments involving more
than one explanatory variable.
Example 14.5. The R dataframe ToothGrowth contains the results of an experiment performed
on Guinea Pigs to determine the effect of Vitamin C on tooth growth. There were two treatment
variables, the dose of Vitamin C, and the delivery method of the Vitamin C. The dose variable
had three levels (.5, 1, and 2 mg) and the delivery method was by either orange juice or ascorbic
acid. There were 10 guinea pigs given each of the six treatments. The plot below (using coplot())
shows the differences between the two delivery methods and the various does levels.
Given : supp
VC
OJ
1.0
1.5
2.0
35
0.5
●
●
30
●
●
20
5
10
15
len
25
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.5
1.0
1.5
2.0
ToothGrowth data: length vs dose, given type of supplement
It appears that both the delivery method and the dose have some effect on tooth growth.
Both the principles of randomization and replication extend to experiments with more than one
explanatory variable. In Example 14.5 for example, it is apparent that the 60 guinea pigs should
have been assigned at random to the six different treatments. And it also is clear that there should
52
have been enough guinea pigs in each treatment so that the natural variation from pig to pig can
be accounted for.
No blocking variables are described in the tooth growth study but it is often the case that natural
blocking variables can be identified. For example, in the tooth growth study, it might not have
been possible for the same technician to have recorded all the measurements. In that case, it would
not be a good idea for one technician to make all the measurements for the orange juice treatment
while another technician makes all the measurements for the ascorbic acid treatment. The blocking
variable would be the technician and we would attempt to randomize assignment withing treatment.
Since there were 10 guinea pigs in each of the 6 treatments, two technicians could each measure 5
guinea pigs in each treatment.
15
A categorical explanatory variable
We have already seen how to model the variation in a response variable y if we have no explanatory
variable (there is good reason to use ȳ as the model prediction). We have also seen how to model the
variation in a response variable y if there is just one explanatory variable x – we fit a linear model
or, if necessary, a non-linear function of x using the method of least squares. In this section we
consider the case of a categorical variable. Many engineering experiments are done by comparing
two or a small number of treatments.
In a very simple experiment, 21 rubberbands were assigned at random to two groups. One group was
heated in water at almost boiling for a period of time and the other was left at room temperature.
They then were loaded with a fixed mass and the length of the rubberband stretched was recorded.
A summary of the data and an appropriate plot are below.
> favstats(stretch ~ treat, rubberbands)
min Q1 median Q3 max mean
sd n missing
A 217 240
246 253 257 244 11.73 11
0
H 239 249
253 258 269 254 9.92 10
0
> bwplot(stretch ~ treat, rubberbands)
53
270
stretch
260
250
240
230
220
A
H
It appears that heated rubberbands generally stretch more than unheated rubberbands do. What
model should we use to explain the stretch based on the experimental group? It seems reasonable
that we should model each rubberband by the mean response of its group. In other words, a
reasonable model is that stretch is 244 if the rubberband is in group A and 254 if the rubberband
is in group H. It should not be a surprise that this model minimizes the sums of squares of residuals
and neither should it be a surprise that lm will do the work:
> lrb = lm(stretch ~ treat, rubberbands)
> lrb
Call:
lm(formula = stretch ~ treat, data = rubberbands)
Coefficients:
(Intercept)
244.09
treatH
9.41
It takes a minute to understand this output. lm wants to report the result with an “intercept” term.
The intercept term is obviously 244.09. What R does is choose one of the levels of the categorical
variable to correspond to the intercept term (just as x = 0 corresponds to the intercept term for
a quantitative variable). The other variable in the model is called treatH and it is an example of
an indicator variable. An indicator variable is a quantitative variable that takes only the values
0 and 1. In this case treatH= 1 if the observational unit is in group H (hence the name of the
variable) and 0 if the object is not in group H (which means in this case that it is in group A). So
the linear model that R actually fit was
yi = b0 + b1 treatH + ei
54
In other words, the model makes the following predictions
(
244 + 0 ∗ 9.41 if the ith observation is in A
yi =
244 + 1 ∗ 9.41 if the ith observation is in H
More detail on the linear model is obtained by with summary. Notice that R2 is quite low – on the
other hand this is a model based on just one categorical variable with two categories.
> summary(lrb)
Call:
lm(formula = stretch ~ treat, data = rubberbands)
Residuals:
Min
1Q Median
-27.09 -4.50
0.50
3Q
7.91
Max
15.50
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
244.09
3.29
74.17
<2e-16
treatH
9.41
4.77
1.97
0.063
Residual standard error: 10.9 on 19 degrees of freedom
Multiple R-squared: 0.17,Adjusted R-squared: 0.126
F-statistic: 3.89 on 1 and 19 DF, p-value: 0.0632
We need not restrict ourselves to two levels. For example, a linear model with the chickwts data
yields
> favstats(weight ~ feed, chickwts)
min
casein
216
horsebean 108
linseed
141
meatmeal 153
soybean
158
sunflower 226
Q1 median Q3 max mean
sd n missing
277
342 371 404 324 64.4 12
0
137
152 176 227 160 38.6 10
0
178
221 258 309 219 52.2 12
0
250
263 320 380 277 64.9 11
0
207
248 270 329 246 54.1 14
0
313
328 340 423 329 48.8 12
0
> lcw = lm(weight ~ feed, chickwts)
> lcw
55
Call:
lm(formula = weight ~ feed, data = chickwts)
Coefficients:
(Intercept)
323.58
feedsunflower
5.33
feedhorsebean
-163.38
feedlinseed
-104.83
feedmeatmeal
-46.67
feedsoybean
-77.15
It takes a minute to see that casein is the intercept value of feed and the five variables in the
model are indicator variables for the other five feeds. An ANOVA table gives information on the
sums of squares of residuals.
> anova(lcw)
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
feed
5 231129
46226
15.4 5.9e-10
Residuals 65 195556
3009
The sum of squares of residuals is about 196,000 but the model has explained 231,100 of the total
sum of squares. Again, the model is simply to predict that a chick fed a given feed will weigh the
same as the average chick in the sample that was fed that feed. This technology seems overkill for
computing means but viewing it as a linear model gives more information.
16
Models Galore
We’ve so far considered a model for our response variable y with no explanatory variables, models
with one quantitative explanatory variable x, and models with one categorical explanatory variable.
In this section we extend this to models with any number of explanatory variables both quantitative
and categorical. We will however restrict our attention to linear models (models linear in our
explanatory variables) so that we will fit all our models with lm(). So our models will look like
this:
yi = f (x1 , x2 , . . . , xk ) + ei
where y is a quantitative response variable, f is linear (we’ll define that precisely below after looking
at examples), and ei is the residual. We’ll fit such models in every case by minimizing the sums
56
of squares of residuals. In these models, means play a crucial role. You will recall that the model
in the case of no explanatory variable is to set ŷi = ȳ. In the case of one categorical variable, the
model sets ŷi to the sample mean of the group of yi . In the linear regression case, means enter the
picture in that the regression line always passes through the point (x̄, ȳ).
By linear model, we will mean that our function f will have the following form
f (x1 , . . . , xk ) = b0 + b1 t1 + b2 t2 + · · · + bm tm
where the terms t1 , . . . , tm are computed entirely from the x1 , . . . , xk with no unknown parameters.
In other words, the constants to be fitted (b0 , . . . , bm ) occur only in separate terms and linearly.
Notice that the terms ti do not have to be linear in the xi – we’ll give examples below as to what
we mean by that. To specify the model then is to list the terms t0 , . . . , tm that you want to include
in the model. We’ll now look at the kinds of terms that can be included in the model.
An intercept term
Notice that the first (zeroth) term in the function above is an “intercept”. While intercept will
always make sense when we define our 0 levels of variables appropriately, another way to describe
this term is that it is a sort of default value of the response variable when the explanatory variables
are set at some default level. Default level makes more sense for categorical variables – for continuous variables, x, x = 0 is that default level (which does not always mean much in the situation we
are modeling). Normally lm() will always include an intercept unless we explicitly tell it otherwise.
However if there are no other terms, we need to tell lm() to include it. Hence we fit the mean
model by
> l = lm(GPA ~ 1, data = sr)
> summary(l)
Call:
lm(formula = GPA ~ 1, data = sr)
Residuals:
Min
1Q
-1.5164 -0.3394
Median
0.0826
3Q
0.4086
Max
0.7796
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
3.2204
0.0136
237
<2e-16
Residual standard error: 0.495 on 1332 degrees of freedom
57
> anova(l)
Analysis of Variance Table
Response: GPA
Df Sum Sq Mean Sq F value Pr(>F)
Residuals 1332
327
0.245
Terms for individual variables
If x is quantitative or categorical, we have already seen how to include x in a model as well as
how to interpret the resulting coefficients. But we can include terms for any number of categorical
and/or quantitative variables. Perhaps the simplest case is one of each.
Example 16.1. The dataframe cats that we have looked at before has the heart and body weight
of a bunch of cats (and we think that these variables could be linearly related) as well as the gender
of each (which also might have predictive value).
> lcat = lm(Hwt ~ Bwt + Sex, cats)
> summary(lcat)
Call:
lm(formula = Hwt ~ Bwt + Sex, data = cats)
Residuals:
Min
1Q Median
-3.583 -0.970 -0.095
3Q
1.043
Max
5.102
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.4150
0.7273
-0.57
0.57
Bwt
4.0758
0.2948
13.83
<2e-16
SexM
-0.0821
0.3040
-0.27
0.79
Residual standard error: 1.46 on 141 degrees of freedom
Multiple R-squared: 0.647,Adjusted R-squared: 0.642
F-statistic: 129 on 2 and 141 DF, p-value: <2e-16
Notice that the plus sign on the right hand side of the model equation does not mean addition (we
would have to use the I() operator for that) but rather that there are additive terms for the two
58
variables listed.
Let’s interpret the model. The model can be written
ˆ = −0.42 + 4.08Bwt − 0.08SexM
Hwt
Remember that SexM is an indicator variable that takes the value 1 if the variable Sex is ’M’ and
is 0 otherwise (i.e, if the cat is female). In other words, the prediction of heart weight for two cats
of the same body weights but different sexes differs by 0.08 with the prediction for the female cat
being greater. (That doesn’t mean that female cats have bigger hearts but rather that we think
that they do have slightly bigger hearts after controlling for body weight.)
We can read off model values by plugging values into this equation or we can do it quickly with
makeFun.
> catheart = makeFun(lcat)
> catheart(2.3, "M")
1
8.88
> catheart(2.3, "F")
1
8.96
You might suppose that a reasonable way to approach this modeling problem would have been to
fit a separate regression equation for each gender of cats. We can do that graphically here
> xyplot(Hwt ~ Bwt, groups = Sex, data = cats, type = c("p", "r"), col = c("red",
+
"blue"))
59
20
Hwt
15
10
2.0
2.5
3.0
3.5
Bwt
But notice that our model above did not do this. Rather our specification of the model ensures that
we fits two lines of the same slope (the coefficient of Bwt does not change) but different intercepts.
In the above, we used one categorical and one quantitative variable. But we can have individual
terms for as many of each kind of variable as we want. (At this point we should mention that there
are dangers in including too many variables in a model – we’ll talk about those dangers later.)
We look at a couple of examples here.
Example 16.2. An industrial weaving machine is fe rolls of wool. There are two types of wool
(A or B) and the machine can be maintained at three different tensions (L,M,H). The wool breaks
occasionally and the the number of breaks per loom (which is a fixed length of wool) is reported.
A dataset warbreaks of 54 observations (9 per treatment) of the numbers of breaks (breaks) is
included with R.
> lwb = lm(breaks ~ wool + tension, data = warpbreaks)
> summary(lwb)
Call:
lm(formula = breaks ~ wool + tension, data = warpbreaks)
Residuals:
Min
1Q Median
-19.50 -8.08 -2.14
Coefficients:
3Q
6.47
Max
30.72
60
(Intercept)
woolB
tensionM
tensionH
Estimate Std. Error t value Pr(>|t|)
39.28
3.16
12.42 < 2e-16
-5.78
3.16
-1.83 0.07361
-10.00
3.87
-2.58 0.01279
-14.72
3.87
-3.80 0.00039
Residual standard error: 11.6 on 50 degrees of freedom
Multiple R-squared: 0.269,Adjusted R-squared: 0.225
F-statistic: 6.14 on 3 and 50 DF, p-value: 0.00123
We can see that we have several indicator variables, one for wool type B (so wool type A is the
default) and two for tension (medium and high tension with low tension being the default). Thus
the intercept corresponds to the prediction for wool type A and for low tension.
> fbreaks = makeFun(lwb)
> fbreaks("A", "L")
1
39.3
A good plot to show the relationship between the two categorical variables and the response variable
is given by
> dotplot(breaks ~ wool | tension, data = warpbreaks)
L
M
H
70
breaks
60
50
40
30
20
10
A
B
A
B
A
B
It’s clear that the linear model reflects both that wool B breaks less often and that both medium and
high tension are better than low tension. What this model has no way of communicating however
is that the relationship between the two wools is different in the different tension conditions.
61
The cats example has a quantitative and categorical variable as predictors, the warpbreaks example has two categorical variables. It is of course possible to have two quantitative predictors as
well.
Example 16.3. Suppose that we want to predict the number of runs that a baseball team might
score from the number of hits and homeruns that have. For data we’ll use all baseball teams and
all seasons in this century!
> lbb = lm(R ~ H + HR, Baseball21)
> summary(lbb)
Call:
lm(formula = R ~ H + HR, data = Baseball21)
Residuals:
Min
1Q
-111.31 -20.35
Median
0.73
3Q
21.11
Max
133.10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -388.1120
35.6023
-10.9
<2e-16
H
0.6472
0.0257
25.2
<2e-16
HR
1.1741
0.0588
20.0
<2e-16
Residual standard error: 33.2 on 327 degrees of freedom
Multiple R-squared: 0.827,Adjusted R-squared: 0.826
F-statistic: 781 on 2 and 327 DF, p-value: <2e-16
Of course the fitted model here is no longer a line but rather it is a plane in three-dimensional
space. The residual standard error here is about 33 runs which means that the typical deviation
from the predicted number of runs is about 33 runs a team. Later in the course we will learn how to
answer the question as to whether we really want to include these two variables in a model (instead
on just one of them or a model with even more of them).
But we can begin to see how we might decide this by fitting models with each variable separately.
> lbbH = lm(R ~ H, Baseball21)
> lbbHR = lm(R ~ HR, Baseball21)
> summary(lbbH)
Call:
62
lm(formula = R ~ H, data = Baseball21)
Residuals:
Min
1Q
-141.83 -31.93
Median
-1.32
3Q
31.14
Max
156.30
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -443.6695
52.8169
-8.4 1.4e-15
H
0.8224
0.0359
22.9 < 2e-16
Residual standard error: 49.4 on 328 degrees of freedom
Multiple R-squared: 0.616,Adjusted R-squared: 0.614
F-statistic: 525 on 1 and 328 DF, p-value: <2e-16
> summary(lbbHR)
Call:
lm(formula = R ~ HR, data = Baseball21)
Residuals:
Min
1Q
-152.96 -36.80
Median
-3.89
3Q
34.43
Max
221.37
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 476.0683
16.5750
28.7
<2e-16
HR
1.6805
0.0946
17.8
<2e-16
Residual standard error: 56.9 on 328 degrees of freedom
Multiple R-squared: 0.49,Adjusted R-squared: 0.489
F-statistic: 315 on 1 and 328 DF, p-value: <2e-16
Each of these models has a substantially larger residual standard error than the model with both
variables so perhaps this is evidence of the utility of the model.
Terms for interactions
The models with more than one variable that we have considered so far are models in which the
variables do not “interact.” For example in the baseball data, an additional hit leads to a prediction
63
of .64 additional runs independent of the number of home runs. But it is often the case that variables
interact. For example, in the warpbreaks example we could see from the graph that the effect of
using wool B depended on the tension as well. In this section, we show how to include terms for
the interaction of variables. The simplest way to include a term that accounts for the interaction
of two variable xi and xj is to introduce a linear term for the produce of xi and xj . So for example,
a two variable model with a term for the interaction of the two variables is
ŷ = β0 + b1 x1 + b2 x2 + b3 x1 x2
(Of course there are other ways that the variables could interact but this is the simplest term
consistent with a linear model.)
Example 16.4. We noticed that the effect of wool type in the warpbreak example depended on
the tension of the machine. We’ll express that interaction using interaction terms. Note that in
the formula form, an interaction term uses the colon (:) instead of a multiplication sign.
> lwb = lm(breaks ~ wool + tension + wool:tension, warpbreaks)
> lwb
Call:
lm(formula = breaks ~ wool + tension + wool:tension, data = warpbreaks)
Coefficients:
(Intercept)
44.6
woolB:tensionM
21.1
woolB
-16.3
woolB:tensionH
10.6
tensionM
-20.6
tensionH
-20.0
Remember that categorical variables are dealt with in our model using indicator variables. We
see that the interactions are dealt with by introducing two new indicator variables. The variable
woolB:tensionM for example is equal to 1 if woolB and tensionM are both 1 – that is in the case
that the wool type is B and the tension type is M. To determine the estimate for any combination
of wool and tension, we need only add the relevant terms. For example, with wool B and tension
M our estimate is the sum of 44.56 (intercept), -16.33 (due to wool B), -20.56 (due to tension M)
and 21.11 (due to the reaction of this wool and tension). We can do this with makeFun.
> fwb = makeFun(lwb)
> fwb("B", "M")
1
28.8
64
Notice that there are six terms in our model and there are six possible conditions to model. Thus
our model is big enough to fit each condition separately. In fact the model predictions are simply
the sample means of each condition. We can get those means directly by
> mean(breaks ~ wool + tension, warpbreaks)
A.L B.L A.M B.M A.H B.H
44.6 28.2 24.0 28.8 24.6 18.8
We can fit models for interaction terms in cases where some or all of the variables are quantitative
as well.
Example 16.5. > lcatsint = lm(Hwt ~ Bwt + Sex + Bwt:Sex, data = cats)
> lcatsint
Call:
lm(formula = Hwt ~ Bwt + Sex + Bwt:Sex, data = cats)
Coefficients:
(Intercept)
2.98
Bwt
2.64
SexM
-4.17
Bwt:SexM
1.68
It’s easiest to interpret this model by noticing that it creates two lines – one predicting heart weight
from body weight for males and the other for females. The equations are:
Hwt = 2.98 + 2.64 Bwt
Hwt = (2.98 − 4.17) + (2.64 + 1.68) Bwt
female cats
male cats
These are precisely the lines of the graph of Example 16.1. Notice that the lines have different
slopes and different intercepts – we would have gotten the same models if we had fit the males and
females entirely separately.
Transformation terms
We can also include terms for transformation of quantitative variables in our model. We’ve already
done that by including a term such as log(Meters) in our model of track data. It is quite common,
for example, to include terms such as x2 .
65
Example 16.6. The R dataset trees has data on the girth, height, and volume of wood in 31
cherry trees. We’d like to predict volume from girth but we realize a linear model is not reasonable.
>
>
>
>
ltrees = lm(Volume ~ Girth + I(Girth^2), data = trees)
ftrees = makeFun(ltrees)
xyplot(Volume ~ Girth, trees)
plotFun(ftrees(Girth) ~ Girth, add = T)
80
Volume
60
40
20
10
15
20
Girth
Of course R will fit any model we choose no matter how many variables or how poor the fit actually
is. We’re going to have to spend considerable time asking whether the models we consider actually
make sense.
Questions
Q16.1 The following regression (using state-by-state data in the dataset SAT of the mosaic package)
predicts that SAT scores go down as educational expenditures go up:
> lm(sat ~ expend, data = SAT)
Call:
lm(formula = sat ~ expend, data = SAT)
Coefficients:
(Intercept)
1089.3
expend
-20.9
This is a counter intuitive result but you might be able to give an explanation for it if you
fit a model with both expend and frac (the percentage of students in the state taking the
SAT). Fit the model and convince yourself that the new model makes more sense!
Q16.2 A relatively old dataset has data on infant mortality in 105 countries. Fit a model that
“explains” infant mortality by the region of the country and the per capita income of the
66
country. How is infant mortality related to income? How is infant mortality related to
continent? (The data are in the faraway package and dataframe infmort.)
17
Coverage Intervals
When we use a number in an engineering process whether a result of a measurement or a computation, the number is rarely exact. It is almost always an estimate of something. For example when
we measure the weight of something, the reported weight is never equal to the “true” value but
is an approximation that depends on, among other things, the quality and precision of our scale.
That’s why every laboratory instructor that you have ever had has required you to write uncertainty estimates to accompany the measurement. For example, you might report a measurement
as 2.431 ± 0.001 gm. The form of this expression is
estimate ± uncertainty
Of course there are various kinds of expressions of this form depending on what we are estimating
and on how we report the uncertainty. You looked at several of these in Physics labs.
In this section we consider situations in which our estimates are statistics used as estimates of
parameters. For example consider the dimes again. We used ȳ (the sample mean) as am estimate
of µ̄ the population mean. Obviously we do not expect that ȳ is exactly equal to µ. So we would
like an uncertainty that gives us some idea of how close ȳ is likely to be to µ.
To get such measures of uncertainty, we need to make some assumptions about how the sample
was produced and (often) some assumptions about the population. The most important idea here
is the notion of sampling distribution first introduced in Section 10.
In this section, we consider the most important example. Suppose that our model is the linear
model:
The standard linear model
The standard linear model is given by the equation
Y = β0 + β1 x1 + · · · + βk xk + where
1. is a random variable with mean 0 and variance σ 2 ,
2. β0 , β1 , . . . , βk , σ 2 are (unknown) parameters,
3. and has a normal distribution.
67
The data result from a “random sample” of the random terms . Notice that this model includes
(when k = 0) the situation of sampling from a normal population to estimate the mean, which is
this case is β0 . Under these conditions (which are admittedly quite strong), we can say a lot about
the distribution of bi over all possible samples.
• The mean of all possible values of bi is βi .
• The standard deviation of all possible values of bi is a known (messy) function of σ and the
data (in fact only the x values of the data).
The second fact above is an annoyance since σ is unknown. But there is a sensible way to construct
an estimate of σ. Namely we use the standard error of the residuals, se . Using se to estimate σ and
the formulas referred to in the second point above, we then get and estimate sbi of the standard
deviation of bi . This number is called the standard error of bi .
We do not have to know the computational formulas to compute the standard errors –R or any
other standard statistical package will do this. We illustrate with the corrosion data.
> lrust = lm(loss ~ Fe, corrosion)
> summary(lrust)
Call:
lm(formula = loss ~ Fe, data = corrosion)
Residuals:
Min
1Q Median
-3.798 -1.946 0.297
3Q
0.992
Max
5.743
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
129.79
1.40
92.5 < 2e-16
Fe
-24.02
1.28
-18.8 1.1e-09
Residual standard error: 3.06 on 11 degrees of freedom
Multiple R-squared: 0.97,Adjusted R-squared: 0.967
F-statistic: 352 on 1 and 11 DF, p-value: 1.06e-09
The column labeled Std. Error is the relevant one. Consider the estimate of the slope, −24.02.
The standard error is 1.280. We interpret that as saying that we should not be at all surprised if a
different sample gave us a value of the slope that is 1.28 away from −24.02. That in fact would be
a typical result. Some samples would surely give values of the slope that were even further away
68
than 1.28 and we should not be surprised at that. Using these numbers gives us one way of writing
a coverage interval, namely we might write
estimate ± standard error of the estimate
Alternatively, sometimes this information is presented in a form such as -24.02 (se 1.28) or some
such indication.
The special case of the mean is worth looking at. Consider the dimes example yet again.
> ldimes = lm(Mass ~ 1, dimes)
> summary(ldimes)
Call:
lm(formula = Mass ~ 1, data = dimes)
Residuals:
Min
1Q
Median
-0.04423 -0.02173 -0.00273
3Q
0.01577
Max
0.03977
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.25823
0.00403
560
<2e-16
Residual standard error: 0.0221 on 29 degrees of freedom
The summary indicates that b0 = ȳ = 2.258 and the standard error of this estimate is 0.004.
Again, we wouldn’t have been surprised if a different sample of 30 dimes resulted in an estimate
of 2.258+.004 or 2.258-.004. In this case the formula for the standard error is easy to write down:
√
sȳ = s/ n.
while an expression of the form estimate ± standard error is used, it is more common to construct
something called a confidence interval. A confidence interval is an interval of form
estimate ± margin of error
using a method that ensures that the interval is quite likely to capture the true value of the
parameter. R computes what are called 95% confidence intervals. we first compute some and then
say precisely what they mean.
> ldimes = lm(Mass ~ 1, dimes)
> confint(ldimes)
2.5 % 97.5 %
(Intercept) 2.25
2.27
69
> lrust = lm(loss ~ Fe, corrosion)
> confint(lrust)
2.5 % 97.5 %
(Intercept) 126.7 132.9
Fe
-26.8 -21.2
The 95% confidence interval for the mass of the dimes in the sack that we sampled from is
(2.250, 2.666). (We could also express this as 2.258 ± 0.008.) The 95% confidence interval for
the slope of the regression line in the corrosion example is (−26.84, −21.20).
It is tricky to say exactly what a confidence interval means. In particular the 95% number is a
probability but a probability of what? It makes not sense to talk about the probability that µ, the
mass of dimes in the sack, is in the interval (2.250, 2.266). The number µ is a fixed number, though
unknown, and it either is or is not in that interval. Rather the confidence interval statement is a
statement about our confidence in the procedure. In 95% of all possible samples, the confidence
interval we construct will capture the mean. We have no idea whether the particular sample
we got is one of those however. So our confidence is in the procedure, not the particular sample.
Nevertheless we can use confidence intervals such as we have computed with confidence (!) knowing
that doing so will not often lead us in error. Of course there is nothing magic about 95% confidence,
we might want (and R will produce) confidence intervals in which are confidence level is 99% or
90% or anything else.
It is very important to realize that these probability statements are built on the assumption that
our linear model is correct. Namely in the dimes example, we need to believe that the distribution
of the mass of dimes in the population (the sack) is normal and that the sample we chose was
truly a simple random sample. Neither of these things is likely to be exactly true. It turns out
that if we use this linear model to produce confidence intervals, the assumption of normality can
be relaxed somewhat. For example if the sample is very large or the population is not “far away”
from normal, the confidence interval procedure is fairly robust – for example the 95% confidence
interval might produce good intervals 94% or 96% of the time. However we should be cautious
using confidence intervals when the sample is quite small or we have reason to expect that the term has a distribution that is quite far from normal.
The confidnece intervals that we have considered so far are confidence intervals for the coefficient
of a quantitative variable x in the model. What about categorical variables? For the rubberband
example, we write 90% confidence intervals for β1 .
> lrb = lm(stretch ~ treat, data = rubberbands)
> confint(lrb, level = 0.9)
5 % 95 %
(Intercept) 238.40 249.8
treatH
1.16 17.7
70
Here we can say that a 90% confidence interval for the coefficient of treatH is (1.16, 17.66). But
the coefficient is simply the difference in the mean of treatment H and that of treatment A. In
other words, we a 90% confidence interval for the difference in the means of the two treatments is
(1.16, 17.66). Notice that 0 is not in this interval so we are confident that treatment H has a higher
mean stretch than treatment A. It is somewhat more difficult to interpret the confidence intervals
for more complicated models.
Confidence intervals are just one kind of coverage interval. Another kind of coverage interval is a
prediction interval. Given a linear model and estimates for the parameters derived from a sample,
we would like to use the estimated model to make a prediction of the value of y for a new data
point. For example, we might want to predict the mass of the very next dime we choose from the
sack or the material loss from another bar with a specified iron content. The predict() function
can be used to compute the value of the prediction (which can be done more easily with makeFun)
and also prediction intervals.
> newx = data.frame(Fe = 1)
> newx
Fe
1 1
> predict(lrust, newx, interval = "prediction")
fit lwr upr
1 106 98.8 113
What we see is that our best prediction for loss if Fe= 1.0 is 105.77. The 95% prediction interval
for the loss is (98.77, 112, 8). Obviously this is a large interval. There are three things contributing
to the size of the interval:
1. b0 and b1 are not exactly equal to β0 and β1 , (i.e., we have the wrong line)
2. even if they were, a typical point is not on the line (σ) determines how far way a point might
typically be (the point won’t be on even the right line)
3. the level 95% of the interval is high.
18
Hypothesis Tests
One of the first questions that we might ask about a parameter βi of our linear model is whether
βi = 0! Were βi = 0, we would simply drop the corresponding explanatory variable from our model.
71
Of course knowing that βi 6= 0 is not much knowledge so in general we prefer confidence intervals
that give us an idea of what βi could be (as opposed to what it isn’t). Nevertheless, a hypothesis
test of this sort might be a first step in trying to decide on the usefulness of a model with xi in it.
Of course it is extremely unlikely that bi = 0 – in fact it will essentially never be so. So a
hypothesis test has to be something more that checking that. In fact the question we need to ask
is how consistent the data are with the hypothesis that βi = 0.
We will illustrate the logic of hypothesis testing (which is quite subtle) with a simple example, the
rubber bands example.
> lrb = lm(stretch ~ treat, data = rubberbands)
> summary(lrb)
Call:
lm(formula = stretch ~ treat, data = rubberbands)
Residuals:
Min
1Q Median
-27.09 -4.50
0.50
3Q
7.91
Max
15.50
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
244.09
3.29
74.17
<2e-16
treatH
9.41
4.77
1.97
0.063
Residual standard error: 10.9 on 19 degrees of freedom
Multiple R-squared: 0.17,Adjusted R-squared: 0.126
F-statistic: 3.89 on 1 and 19 DF, p-value: 0.0632
In this case, testing the hypothesis that β1 = 0 amounts to testing whether there really is a
difference between the mean of the stretchability of heated and unheated rubber bands.
A hypothesis test starts with a null hypothesis – in our case a hypothesis that β1 = 0. We call the
null hypothesis H0 and write
H0 : β 1 = 0
Unlike the usual conception of scientific hypotheses, we don’t want to amass evidence for H0 . In
general we are hoping that H0 is false and hope to amass enough evidence to argue that it is false.
To that end, we have an alternate hypothesis that we are hoping to gather evidence for. The
alternate hypothesis is called Ha and in this case we write
Ha : β1 6= 0
72
Obviously to decide whether we have evidence that β1 6= 0, we compare b1 to 0. If b1 is “far away”
from 0, we suspect that we have evidence that β1 6= 0. If b1 is reasonably close to β1 , we are not
really sure.
Of course we need again to make assumptions about the sampling distribution of b1 in order to
decide what “close” means in this analysis. The assumptions of the linear model give us those.
And the result of applying this model appears in the last column of the summary.
> summary(lrb)
Call:
lm(formula = stretch ~ treat, data = rubberbands)
Residuals:
Min
1Q Median
-27.09 -4.50
0.50
3Q
7.91
Max
15.50
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
244.09
3.29
74.17
<2e-16
treatH
9.41
4.77
1.97
0.063
Residual standard error: 10.9 on 19 degrees of freedom
Multiple R-squared: 0.17,Adjusted R-squared: 0.126
F-statistic: 3.89 on 1 and 19 DF, p-value: 0.0632
The number in that column opposite the variable treatH is 0.0632. This number is called the
P -value of the hypothesis test and it means precisely this.
If the null hypothesis is true there is a 6.32% probability that a sample would result in a
value of b1 at least this far away from 0.
What do we do with this P -value of 6.32%? What it says is that if we act as if β1 6= 0 after seeing
such a sample, we would be wrong over 6% of the time. It is usually conventional to expect a
P -value of less that 5% before we claim to have enough evidence that β1 6= 0. (This is the same
5% that leads us to use 95% confidence intervals in many situations.)
Notice that the evidence that we have amassed is not evidence in favor of the null hypothesis.
After all b1 6= 0 so it is hard to see how we it could be evidence that β1 = 0. However what we are
73
claiming is that we don’t have enough evidence to reject H0 . This is the fundamental asymmetric
nature of hypothesis testing. The null hypothesis is a default hypothesis and we are not prepared
to act as if it is false unless we have significant evidence to that effect. It’s the innocent until proven
guilty hypothesis. So in this case, we certainly haven’t shown that β1 = 0. We have merely shown
that we don’t have enough evidence to say that β1 6= 0.
While we will not go through the details of how the P -value is computed, we should note that the
third column of numbers in the summary is called the t-value. You might have already noticed
that the t-value is simply bi /sbi . The t-value is simply the number of standard errors that b1 is
away from 0. You might expect that t-values of more than 2 are unusual if the null hypothesis
is true. Indeed that is correct – the T -value has a distribution that is very similar to the normal
distribution (but it depends on the sample size).
While we have just given an example here of one kind of hypothesis test, the general structure of
a hypothesis test is always the same and the key to understanding our conclusion is always the
P -value. We need to
1. Identify the null hypothesis – it asserts something about the (model of) the population.
2. Verify that the assumptions that are used to generate the sampling distribution are at least
approximately satisfied.
3. Determine that the sample elements are chosen independently from the population
4. Compute the P -value
5. Interpret the P -value in terms of the null hypothesis
We give another example using a larger model. In the galton model, we have included a number
of variables to predict the height of adult children.
> lgal = lm(height ~ mother + father + sex, Galton)
> summary(lgal)
Call:
lm(formula = height ~ mother + father + sex, data = Galton)
Residuals:
Min
1Q Median
-9.523 -1.440 0.117
3Q
1.473
Max
9.114
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.3448
2.7470
5.59 3.1e-08
74
mother
father
sexM
0.3215
0.4060
5.2260
0.0313
0.0292
0.1440
10.28
13.90
36.29
< 2e-16
< 2e-16
< 2e-16
Residual standard error: 2.15 on 894 degrees of freedom
Multiple R-squared: 0.64,Adjusted R-squared: 0.638
F-statistic: 529 on 3 and 894 DF, p-value: <2e-16
In the very bottom right is a P -value that is vanishingly small. This P -value is the result of testing
the null hypothesis
H0 : β1 = · · · = βk = 0
Obviously this is a very strong hypothesis – it claims that none of the explanatory variables are
useful in predicting our response variable. The alternate hypothesis is simply that at least one of
the βi is not 0. Normally this is the first hypothesis test done in a linear model context since not
having enough evidence to be reasonably confident that any of the explanatory variables matters
probably means that we either we have a poor model or that we haven’t anything like a large
enough sample size to detect the effect that we are hoping for. The test of this hypothesis is an
”F-test”. The F -value is computed from the sum of squares of the residuals and is tends to be
large in the null hypothesis is false and smaller if not. We’ll look more closely at F -values later.
Questions
Q18.1 We can simulate situations in which we know H0 is true and see what kinds of P -values we
obtain. Try the following:
y = rnorm(20, 0, 1)
x = rnorm(20, 0, 1)
l = lm(y ~ x)
summary(l)
Test the hypothesis that β1 = 0 using your sample. Obviously you have generated the sample
with β1 = 0 since x and y are completely independent. Do this experiment 20 times. On
average you should get a P -value that is less that 5% once. You may find that you get such
a P -value several times or not at all.
Q18.2 We can also simulate situations in which we know H0 is false and see if our hypothesis test
detects that. Try the following
x = rnorm(20, 0, 1)
y = 2 + 3 * x + rnorm(20, 0, 4)
l = lm(y ~ x)
summary(l)
75
Is there evidence to reject the (false) null hypothesis that β1 = 0? Try it several times. Do
you ever get a sample that does not appear to be strong evidence against this null hypothesis?
You can also try this by varying the error term (change σ from 4 to a larger or smaller number)
or by changing the sample size.
19
Propagation of Error
Often we use measured quantities to compute other quantities. For example, we might want to
know the area of a rectangle but we measure directly only the length and the width. In many
physical experiments we measure directly a few basic quantities and then use physical laws to
infer the value of other quantities. For example, we could never measure directly the gravitational
constant g but rather do experiments in which we measure other quantities (e.g., time for a ball
to drop a given height) and use the law of gravitation to infer g. An obvious question is this:
how uncertain is our estimate of the computed quantity given that we know the uncertainty of the
individual measurements used to compute it?
The problem is this. Suppose that we have k measured quantities x1 , . . . , xk and we have uncertainties ui for each. What is the uncertainty of the function y = f (x1 , . . . , xk )? In this section we
will look at a traditional answer that is given to this question – it’s the answer of the Physics 134
lab manual and one that is often used in engineering applications.
We first assume that the uncertainties are standard deviations – that is we model the measurements
as random variables that have a mean and a standard deviation. We have already seen how
to estimate uncertainties such as this. We will next assume that the quantities measured are
independent – there are more general treatments for the dependent case but we won’t do that here
(you can look it up!) Even with these assumptions, our result will only be approximate. Exact
results exist only for some functions f .
Suppose X1 , . . . , Xk are independent random variables with means µXi and standard deviations σXi . Then mean and standard deviations of the random variable Y = f (X1 , . . . , Xk )
are approximated by the following
µY ≈ f (µX1 , . . . , µXk )
and
s
σY ≈
∂f
∂X1
2
2
σX
1
+ ··· +
∂f
∂Xk
2
2
σX
k
Here the partial derivatives are evaluated at the the point (µX1 , dots, µXk ).
Of course in general we do not know σXi and we must use estimates (the uncertainties) to estimate
76
σY .
Example 19.1. Suppose that we measure the length and width of a rectangle and report the
findings as 8.0 ± .1 cm and 7.5 ± .1 cm. We’re assuming these uncertainties are standard deviations.
WE wish to estimate the area A = LW . Of course the partial derivatives are given by
∂A
=L
∂W
∂A
=W
∂L
We estimate the area as 60 cm2 with standard deviation:
sA =
p
(7.5)2 (.1)2 + (8.0)2 (.1)2 = 1.10cm2
While the formula in the box gives approximations, in an important case it is exact. This is the
case that the function f is a linear function. We’ll state it for two variable U and V .
If U and V are independent random variables the µU +V = µU + µV and σU2 +V = σU2 + σV2 .
Questions
Q19.1 Simulate the area example 1,000 times by assuming that the length and width have normal
distributions. Compute the standard deviation of the 1,000 trials. How close does it come to
the estimate we obtained above.