current version

Mathematics 243: Statistics
M. Stob
January 20, 2009
Preface
This is the textbook for the course Mathematics 243 taught at Calvin College. This
edition of the book is for the Spring, 2009 version of the course.
Not using a “standard textbook” requires an explanation. This book differs from
other available books in at least three ways. First, this book is a “modern” treatment
of statistics that reflects the most recent wisdom about what belongs, and does not
belong, in a first course on statistics. Most existing textbooks must give at least
some attention to traditional or old-fashioned approaches since traditional and oldfashioned courses are often taught. Second, this course relies on a particular statistical
software package, R. The use of R is expected of students throughout the course. Most
traditional textbooks are published so as to be usable with any software package (or
with no software package at all). The use of R is part of what makes this text modern.
Third, this textbook is written for Mathematics 243 and so includes all and only what
it covered in the course. Most traditional textbooks are rather encyclopedic.
While this textbook includes all the topics that are covered in the course, it is not
meant to be self-contained. In particular, the textbook is for a class that meets 52
times throughout the semester and what goes on in those sessions is important. Also,
the textbook contains numerous problems and the problems must be done so that the
concepts are understood in full detail.
The sections of the textbook are intended to be covered in the order that they
appear in the text. An exception concerns the appendix, Using R. The R language will
be introduced throughout the text by means of examples that solve the problems at
hand. The appendix gives fuller explanation of language features that are important
for developing the proficiency with R needed to proceed. The text will often refer
forward to the appropriate section of the appendix for more details. The text is not
a complete introduction to R however. R has a built-in help facility and there are also
several introductions to the R language that are available on the web. A particularly
good one is Simple R by John Verzani.
This text will change over the course of the semester. The current version of the
text will always be available on the course website http://www.calvin.edu/~stob/
courses/m243/S09. The pdf version is designed to be useful for on-screen reading.
The references in the text to other parts of the text and to the web are hyperlinked.
This is the second edition of this text. I would like to thank all the members of the
Spring, 2008 section of Mathematics 243 for their help in improving this book. Many
students found typographical errors and made useful suggestions for improving the
text. Special thanks are due to Susie Hirschfeld and Nicole Lenko for the many helpful
comments that they made. However errors, typographical and otherwise, no doubt
still abound. I encourage readers to communicate them to me at [email protected].
This text is a part of a larger effort to improve the teaching of statistics at Calvin
iii
College. Earlier versions of some of this material were used for the course Mathematics
232. Some of the material in this book was developed by Randy Pruim and appears
in the text for Mathematics 343–344. The assistance of Pruim and Tom Scofield in
the development of these courses is gratefully acknowledged.
This work is being distributed with a Creative Commons Attribution-NoncommercialShare-Alike 3.0 License.
iv
Contents
Introduction
1
1. Data
1.1. Basic Notions . . . . . . . . . . . . . . . .
1.2. A Single Variable - Distributions . . . . .
1.3. Measures of the Center of a Distribution .
1.4. Measures of Dispersion . . . . . . . . . . .
1.5. The Relationship Between Two Variables
1.6. Two Quantitative Variables . . . . . . . .
1.6.1. Examples . . . . . . . . . . . . . .
1.6.2. Fitting Functions to Data . . . . .
1.6.3. The Least-Squares Line . . . . . .
1.6.4. Non-linear Curve Fitting . . . . .
1.7. Exercises . . . . . . . . . . . . . . . . . .
2. Data from Random Samples
2.1. Populations and Samples
2.2. Simple Random Samples .
2.3. Other Sampling Plans . .
2.4. Exercises . . . . . . . . .
3. Probability
3.1. Random Processes . .
3.2. Assigning Probabilities
3.3. Probability Axioms . .
3.4. Empirical Probabilities
3.5. Independence . . . . .
3.6. Exercises . . . . . . .
.
I
.
.
.
.
.
–
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . . . . . . . . . . .
Equally Likely Outcomes
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
4. Random Variables
4.1. Basic Concepts . . . . . . . . . . . . . .
4.2. Discrete Random Variables . . . . . . .
4.2.1. The Binomial Distribution . . . .
4.2.2. The Hypergeometric Distribution
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
6
12
15
20
26
26
29
31
33
36
.
.
.
.
43
43
45
49
52
.
.
.
.
.
.
55
55
60
65
67
70
74
.
.
.
.
79
79
80
81
85
v
Contents
4.3. An Introduction to Inference . . . . . . . . . . . . .
4.4. Continuous Random Variables . . . . . . . . . . . . .
4.4.1. pdfs and cdfs . . . . . . . . . . . . . . . . . .
4.4.2. Uniform Distributions . . . . . . . . . . . . .
4.4.3. Exponential Distributions . . . . . . . . . . .
4.4.4. Weibull Distributions . . . . . . . . . . . . .
4.5. The Mean of a Random Variable . . . . . . . . . . .
4.5.1. The Mean of a Discrete Random Variable . .
4.5.2. The Mean of a Continuous Random Variable
4.6. Functions of a Random Variable . . . . . . . . . . .
4.6.1. The Variance of a Random Variable . . . . .
4.7. The Normal Distribution . . . . . . . . . . . . . . . .
4.8. Exercises . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
86
93
93
96
98
100
102
102
104
105
107
108
110
5. Inference - One Variable
5.1. Statistics and Sampling Distributions . . . . . . . . .
5.1.1. Samples as random variables . . . . . . . . .
5.1.2. Big Example . . . . . . . . . . . . . . . . . .
5.1.3. The Standard Framework . . . . . . . . . . .
5.2. The Sampling Distribution of the Mean . . . . . . .
5.3. Estimating Parameters . . . . . . . . . . . . . . . . .
5.3.1. Bias . . . . . . . . . . . . . . . . . . . . . . .
5.3.2. Variance . . . . . . . . . . . . . . . . . . . . .
5.3.3. Mean Squared Error . . . . . . . . . . . . . .
5.4. Confidence Interval for Sample Mean . . . . . . . . .
5.4.1. Confidence Intervals for Normal Populations
5.4.2. The t Distribution . . . . . . . . . . . . . . .
5.4.3. Interpreting Confidence Intervals . . . . . . .
5.4.4. Variants on Confidence Intervals and Using R
5.5. Non-Normal Populations . . . . . . . . . . . . . . . .
5.5.1. t Confidence Intervals are Robust . . . . . . .
5.5.2. Why are t Confidence Intervals Robust? . . .
5.6. Confidence Interval for Proportion . . . . . . . . . .
5.7. The Bootstrap . . . . . . . . . . . . . . . . . . . . .
5.8. Testing Hypotheses About the Mean . . . . . . . . .
5.9. Exercises . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
117
117
117
118
120
122
124
125
126
127
128
128
130
132
132
134
134
136
137
141
144
149
.
.
.
.
153
153
154
159
161
6. Producing Data – Experiments
6.1. Observational Studies . . . . . . . . . .
6.2. Randomized Comparative Experiments .
6.3. Blocking . . . . . . . . . . . . . . . . . .
6.4. Experimental Design . . . . . . . . . . .
vi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
7. Inference – Two Variables
7.1. Two Categorical Variables . . . . .
7.1.1. The Data . . . . . . . . . .
7.1.2. I independent populations .
7.1.3. One population, two factors
7.1.4. I experimental treatments .
7.2. Difference of Two Means . . . . . .
7.3. Exercises . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
163
163
163
164
168
170
171
177
8. Regression
8.1. The Linear Model . . . . . . .
8.2. Inferences . . . . . . . . . . . .
8.3. More Inferences . . . . . . . . .
8.4. Diagnostics . . . . . . . . . . .
8.4.1. The residuals . . . . . .
8.4.2. Influential Observations
8.5. Multiple Regression . . . . . .
8.6. Evaluating Models . . . . . . .
8.7. Exercises . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
179
179
183
186
190
191
194
195
200
206
A. Appendix: Using R
A.1. Getting Started . . . . . .
A.2. Vectors and Factors . . .
A.3. Data frames . . . . . . . .
A.4. Getting Data In and Out
A.5. Functions in R . . . . . . .
A.6. Samples and Simulation .
A.7. Formulas . . . . . . . . .
A.8. Lattice Graphics . . . . .
A.9. Exercises . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
209
209
209
210
212
214
215
217
218
219
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Index of Terms
223
Index of R Functions and Objects
227
Index of Datasets Used
229
vii
Introduction
Kellogg’s makes Raisin Bran and packages it in boxes that are labeled “Net Weight:
20 ounces”. How might we test this claim? It seems obvious that we need to actually
weigh some boxes. However we certainly cannot require that every box that we weigh
contains exactly 20 ounces. Surely some variation in weight from box to box is to be
expected and should be allowed. So we are faced with several questions: How many
boxes should we weigh? How should we choose these boxes? How much deviation in
weight from the 20 ounces should we allow? These are the kind of questions that the
discipline of statistics is designed to answer.
Definition (Statistics). Statistics is the scientific discipline concerned with collecting, analyzing and making inferences from data.
While we cannot tell the whole Raisin Bran story here, the answers to our questions as prescribed by NIST (National Institute of Standards and Technology) and
developed from statistical theory are something like this. Suppose that we are at a
Meijer’s warehouse that has just received a shipment of 250 boxes of Raisin Bran. We
first select twelve boxes out of the whole shipment at random. By at random we mean
that no box should be any more likely to occur in the group of twelve than any other.
In other words, we shouldn’t simply take the first twelve boxes that we find. Next we
weigh the contents of the twelve boxes. If any of the boxes are “too” underweight,
we reject the whole shipment - that is we disbelieve the claim of Kellogg’s (and they
are in trouble). If that is not the case, then we compute the average weight of the
twelve boxes. If that average is not “too” far below 20 ounces, we do not disbelieve
the claim.
Of course there are some details in the above paragraph. We’ll address the issue
of how to choose the boxes more carefully in Chapter 2. We’ll address the issue of
summarizing the data (in this case, using the average weight) in Chapter 1. The
question of how to judge whether a sample average is too far below 20 ounces will be
dealt with in Chapter 5.
Underlying our statistical techniques is the theory of probability which we take up
in Chapter 3. The theory of probability is meant to supply a mathematical model for
situations in which there is uncertainty. In the context of Raisin Bran, we will use
probability to give a model for the variation that exists from box to box. We will also
use probability to give a model of the uncertainty introduced because we are only
1
Introduction
weighing a sample of boxes.
If the whole course was only about Raisin Bran it wouldn’t be worth it (except
perhaps to Kellogg’s). But you are probably sophisticated enough to be able to
generalize this example. Indeed, the above story can be told in every branch of science
(biological, physical, and social). Each time we have a hypothesis about a real-world
phenomenon that is measurable but variable, we need to test that hypothesis by
collecting data. We need to know how to collect that data, how to analyze it, and
how to make inferences from it.
So without further ado, let’s talk about data.
2
1. Data
Statistics is the science of data. In this chapter, we talk about the kinds of data that
we study and how to effectively summarize such data.
1.1. Basic Notions
For our purposes, the sort of data that we will use comes to us in collections or
datasets. A dataset consists of a set of objects, variously called individuals, cases,
items, instances, units, or subjects, together with a record of the value of a certain
variable or variables defined on the objects.
Definition 1.1.1 (variable). A variable is a function defined on the set of objects.
Ideally, each individual has a value for each variable. These values are usually
numbers but need not be. Sometimes there are missing values.
Example 1.1.2.
Your college maintains a dataset of all currently active students. The individuals in this dataset are the students. Many different variables are defined and
recorded in this dataset. For example, every student has a GPA, a GENDER,
a CLASS, etc. Not every student has an ACT score — there are missing values
for this variable.
In the preceding example, some of the variables are obviously quantitative (e.g.,
GPA) and others are categorical (e.g., GENDER). A categorical variable is often
called a factor and the possible values of the categorical variables are called its
levels. Sometimes the levels of a categorical variable are represented by numbers.
For example, we might code gender using 1 for female and 0 for male. It will be
quite important to us not to treat the categorical variable as quantitative just because
numbers are used in this way. (Is the average gender 1/2?)
It is useful to think of the values of the variable for each individual as forming a
list. In R, the values of a particular quantitative variable defined on a collection of
individuals is usually stored in a vector. A categorical variable is stored in an R
object called a factor (which behaves much like a vector). You can read more about
3
1. Data
vectors and factors in Section A.2 of the Appendix.
We will normally think of a dataset as presented in a two-dimensional table. The
rows of the table correspond to the individuals. (Thus the individuals need to be
ordered in some way.) The columns of the table correspond to the variables. Each
of the rows and the columns normally has a name. In R, the canonical way to store
such data is in an object called a data.frame. More details on how to operate on
data.frames is in Appendix A.3.
In the remainder of this section, we give a few examples of datasets that can be
accessed in R and look at some of their basic properties. These datasets will be used
several times in this book.
Example 1.1.3.
The iris dataset is a famous set of measurements taken by Edgar Anderson on
150 iris plants of the Gaspe Peninsula which is located on the eastern tip of the
province of Quebec. The dataset is included in the basic installation of R. The
variable iris is a predefined data.frame. There are many such datasets built
into R.
> data(iris)
> dim(iris)
[1] 150
5
> iris[1:5,]
Sepal.Length
1
5.1
2
4.9
3
4.7
4
4.6
5
5.0
>
# the dataset called iris is loaded into a data.frame called iris
# list dimensions of iris data
# print first 5 rows (individuals), all columns
Sepal.Width Petal.Length Petal.Width Species
3.5
1.4
0.2 setosa
3.0
1.4
0.2 setosa
3.2
1.3
0.2 setosa
3.1
1.5
0.2 setosa
3.6
1.4
0.2 setosa
Notice that the data.frame has rows and columns. The individuals (rows) are,
by default, numbered (they can also be named) and the variables (columns)
are named. The numbers and names are not part of the dataset. Each column
of a data.frame is a vector or a factor. In the iris dataset, there are
150 individuals (plants) and five variables. Notice that four of the variables
(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) are quantitative
variables. The fifth variable, Species, is categorical variable (factor) with three
levels. The following R session uses the built-in iris dataset and shows how to
extract a individual variable from a data frame (Species and how to examine
the value of a particular variable for some individuals. More details on how
to work with data frames, vectors and factors can be found in Appendix A,
Sections A.2 and A.3.
4
1.1. Basic Notions
> iris$Species
# a boring vector
[1] setosa
setosa
setosa
setosa
[7] setosa
setosa
setosa
setosa
[13] setosa
setosa
setosa
setosa
[19] setosa
setosa
setosa
setosa
[25] setosa
setosa
setosa
setosa
[31] setosa
setosa
setosa
setosa
[37] setosa
setosa
setosa
setosa
[43] setosa
setosa
setosa
setosa
[49] setosa
setosa
versicolor versicolor
[55] versicolor versicolor versicolor versicolor
[61] versicolor versicolor versicolor versicolor
[67] versicolor versicolor versicolor versicolor
[73] versicolor versicolor versicolor versicolor
[79] versicolor versicolor versicolor versicolor
[85] versicolor versicolor versicolor versicolor
[91] versicolor versicolor versicolor versicolor
[97] versicolor versicolor versicolor versicolor
[103] virginica virginica virginica virginica
[109] virginica virginica virginica virginica
[115] virginica virginica virginica virginica
[121] virginica virginica virginica virginica
[127] virginica virginica virginica virginica
[133] virginica virginica virginica virginica
[139] virginica virginica virginica virginica
[145] virginica virginica virginica virginica
Levels: setosa versicolor virginica
> iris$Petal.Width[c(1:5,146:150)]
# selecting
[1] 0.2 0.2 0.2 0.2 0.2 2.3 1.9 2.0 2.3 1.8
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
some individuals
Example 1.1.4.
There are 3,077 counties in the United States (including D.C.). The U.S. Census
Bureau lists 3,141 units that are counties or county-equivalents. (Some people
don’t live in a county. For example, most of the land in Alaska is not in any
borough, which is what Alaska calls county level divisions. The Census Bureau
has defined county equivalents so that all land and every person is in some
county or other.) Data from the 2000 census about each county is available in
a dataset maintained at the website for this course. These data are available
from http://www.census.gov. The short R session below shows how to read
the file and computes a few interesting numbers.
> counties=read.csv(’http://www.calvin.edu/~stob/data/counties.csv’)
> dim(counties)
[1] 3141
9
> names(counties)
5
1. Data
[1] "County"
"State"
[5] "TotalArea"
"WaterArea"
[9] "DensityHousing"
> sum(counties$Population)
[1] 281421906
> sum(counties$LandArea)
[1] 3537438
"Population"
"LandArea"
"HousingUnits"
"DensityPop"
The population of the 50 states and D.C. was 281,421,906 at the time of the 2000
U.S. Census. There were over 3.5 million square miles of land area. Notice that
the variables State and County are categorical variables that together name
the county.
Example 1.1.5.
R comes with many user-created “packages”, many of which contain additional
datasets. The faraway package comes with a broccoli dataset. In this dataset,
a number of growers supply broccoli to a food processing plant. They are
supposed to pack the broccoli in boxes with 18 clusters to a box and with
each cluster weighing between 1.3 and 1.5 pounds. Four boxes from each of
three growers were selected and three clusters from each box were weighed.
Though numbers were used to denote the levels of the categorical variables
(such as grower), these variables have correctly been stored as factors rather
than vectors.
> library(faraway)
> dim(broccoli)
[1] 36 4
> broccoli[1:5,]
wt grower box cluster
1 352
1
1
1
2 369
1
1
2
3 383
1
1
3
4 339
2
1
1
5 367
2
1
2
> broccoli$grower
[1] 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3
Levels: 1 2 3
1.2. A Single Variable - Distributions
Now that we can get our hands on some data, we would like to develop some tools to
help us understand the distribution of a variable in a data set. By distribution we
6
1.2. A Single Variable - Distributions
mean two things: what values does the variable take on, and with what frequency.
Simply listing all the values of a variable is not an effective way to describe a distribution unless the data set is quite small. (Look at the list of iris species in Example 1.1.3
or think about the table of of the 3,141 counties in Example 1.1.4 for example.) For
larger data sets, we require some better methods of summarizing a distribution. In
this section, we will look particularly at some simple ways to summarize the distribution of a single variable.
The type of summary that we generate will vary depending on the type of data that
we are summarizing. A table is useful for summarizing a categorical variable. The
R function table() lists the different values of the variable and counts the number
of individuals taking on each value. The following table is a useful description of the
distribution of species of iris flowers in the iris dataset.
> table(iris$Species)
setosa versicolor
50
50
virginica
50
A more interesting table gives the number of counties per state. Note that it isn’t
always the largest states that have the most counties.
> table(counties$State)
Alabama
67
Arkansas
75
Connecticut
8
Florida
67
Idaho
44
Iowa
99
Louisiana
64
Massachusetts
14
Mississippi
82
Nebraska
93
New Jersey
21
North Carolina
100
Alaska
Arizona
27
15
California
Colorado
58
63
Deleware District of Columbia
3
1
Georgia
Hawaii
159
5
Illinois
Indiana
102
92
Kansas
Kentucky
105
120
Maine
Maryland
16
24
Michigan
Minnesota
83
87
Missouri
Montana
115
56
Nevada
New Hampshire
17
10
New Mexico
New York
33
62
North Dakota
Ohio
53
88
7
1. Data
Oklahoma
77
Rhode Island
5
Tennessee
95
Vermont
14
West Virginia
55
Oregon
36
South Carolina
46
Texas
254
Virginia
135
Wisconsin
72
Pennsylvania
67
South Dakota
66
Utah
29
Washington
39
Wyoming
23
Tables can be generated for quantitative variables as well.
> table(iris$Sepal.Length)
4.3 4.4 4.5 4.6 4.7 4.8 4.9
5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9
1
3
1
4
2
5
6 10
9
4
1
6
7
6
8
7
3
6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9
7 7.1 7.2 7.3 7.4 7.6 7.7 7.9
4
9
7
5
2
8
3
4
1
1
3
1
1
1
4
1
6 6.1
6
6
For a quantitative variable that takes on many different values a table such as this
one is often not of much help. In this case The table function is more useful in
conjunction with the cut() function. The second argument to cut() gives a vector
of endpoints of half-open intervals. Note that the default behavior is to use intervals
that are open to the left and closed to the right. We see for example that there are 32
iris plants with Sepal Length greater than 4 but less than or equal to 5. (Like most
functions in R, there are optional arguments to cut() that can change this behavior.)
> table(cut(iris$Sepal.Length,c(4,5,6,7,8)))
(4,5] (5,6] (6,7] (7,8]
32
57
49
12
The kind of summary in the above table is graphically presented by means of a
histogram which is one of the most useful ways to summarize the distribution of
a quantitative history. There are two R commands that can be used to build a
histogram: hist() and histogram(). hist() is part of the standard distribution of
R. histogram() can only be used after first loading the lattice graphics package,
which now comes standard with all distributions of R. The R functions are used as in
the following excerpt which generates the two histograms in Figure 1.1. Notice that
two forms of the histogram() function are given. The second form (the “formula”
form) will be discussed in more detail in Section 1.5. The histograms are of the
number of homeruns per team during the 2007 Major League Baseball season. That
is, the individuals in the dataset are major league baseball teams and homeruns (HR)
is one quantitative variable defined on those individuals.
> bball=read.csv(’http://www.calvin.edu/~stob/data/bball2007.csv’)
8
1.2. A Single Variable - Distributions
> hist(bball$HR)
> histogram(bball$HR)
> histogram(~HR,data=bball)
# standard R histogram
# lattice version of histogram
# formula form of histogram
Percent of Total
30
0 2 4 6 8
Frequency
12
Histogram of bball$HR
100
120
140
160
180
bball$HR
200
220
240
20
10
0
100
150
200
bball$HR
Figure 1.1.: Homeruns in major leagues: hist() and histogram()
Notice that the histograms produced differ in several ways. Besides aesthetic differences, the two histogram algorithms typically choose different break points. Both
the number of “bins” and the endpoints of the bins might differ between the two
different methods. Also, the default vertical scale of histogram() is in percentages
of total while the vertical scale of hist() contains actual counts. (Therefore we often
can determine the number of individuals from the histogram produced by hist().)
As one should expect from R, there are optional arguments to each of these functions
that can be used to change such decisions.
In these notes, we will usually use histogram() and indeed we will assume that the
lattice package has been loaded. Graphics functions in the lattice package often
have several useful features that are helpful especially in situations when there are
many variables and also functions in the lattice package have a similar interface.
A histogram gives a shape to a distribution and distributions are often described in
terms of these shapes. The exact shape depicted by a histogram will depend not only
on the data but on various other choices, such as how many bins are used, whether the
bins are equally spaced across the range of the variable, and just where the divisions
between bins are located. But reasonable choices of these arguments will usually lead
to histograms of similar shape, and we use these shapes to describe the underlying
distribution as well as the histogram that represents it.
Some distributions are approximately symmetric with the distribution of the
larger values looking like a mirror image of the distribution of the lower values. We will
call a distribution positively skewed if the portion of the distribution with larger
values (the right of the histogram) is more spread out than the other side. Similarly,
a distribution is negatively skewed if the distribution deviates from symmetry in
the opposite manner. While there are ways to measure the degree and direction of
skewness with a number it is usually sufficient for our purposes to describe distri-
9
1. Data
butions qualitatively as symmetric or skewed. See Figure 1.2 for some examples of
symmetric and skewed distributions.
0
neg. skewed
5
10
15
pos. skewed
symmetric
Percent of Total
20
15
10
5
0
0
5
10
15
0
5
10
15
x
Figure 1.2.: Skewed and symmetric distributions.
The county population data gives a natural example of a positively skewed distribution. Indeed, it is so skewed that the histogram of populations by county is almost
worthless. The histogram is on the left in Figure 1.3.
30
Percent of Total
Percent of Total
80
60
40
20
10
20
0
0
0
2000000
4000000
6000000
8000000
10000000
counties$Population
5
10
15
logPopulation
Figure 1.3.: County populations and natural log of county populations.
In the case of positively skewed data where the data includes observations of several
orders of magnitude, it is sometimes useful to transform the data. In the case
of county populations, a histogram of the natural log of population gives a nice
symmetric distribution. The histogram is on the right in Figure 1.3.
> logPopulation=log(counties$Population)
> histogram(logPopulation)
Notice that each of these distributions is clustered around a center where most of
10
1.2. A Single Variable - Distributions
the values are located. We say that such distributions are unimodal. Shortly we
will discuss ways to summarize the location of the “center” of unimodal distributions
numerically. But first we point out that some distributions have other shapes that
are not characterized by a strong central tendency. One famous example is eruption
times of the Old Faithful geyser in Yellowstone National park. The command
> data(faithful);
> histogram(faithful$eruptions,n=20);
produces the histogram in Figure 1.4 which shows a good example of a bimodal
distribution. There appear to be two groups or kinds of eruptions, some lasting
about 2 minutes and others lasting between 4 and 5 minutes. While the default
Percent of Total
12
10
8
6
4
2
0
2
3
4
5
eruptions
Figure 1.4.: Old Faithful eruption times (based on the faithful data set).
histogram has the vertical axis read percent of total, another scale will be useful to
us. In Figure 1.5, generated by
histogram(faithful$eruptions,type="density")
we have a density histogram. The vertical axis gives density per unit of the horizontal axis. With this as a density, the bars of the histogram have total mass of 1.
The histogram is read as follows. The bar that extends from 4 to 4.4 on the horizontal axis as width 0.4 and density approximately 0.6. This means that about 24%
((0.4)(0.6) = 0.24) of the data is represented by this bar.
0.5
Density
0.4
0.3
0.2
0.1
0.0
2
3
4
5
faithful$eruptions
Figure 1.5.: Density histogram of Old Faithful eruption times.
11
1. Data
One disadvantage of a histogram is that the actual data values are lost. For a large
data set, this is probably unavoidable. But for more modestly sized data sets, a stem
plot can reveal the shape of a distribution without losing the actual data values. A
stem plot divides each value into a stem and a leaf at some place value. The leaf is
rounded so that it requires only a single digit.
> stem(iris$Petal.Length)
The decimal point is at the |
1
1
2
2
3
3
4
4
5
5
6
6
|
|
|
|
|
|
|
|
|
|
|
|
012233333334444444444444
55555555555556666666777799
033
55678999
000001112222334444
5555555566677777888899999
000011111111223344
55566666677788899
0011134
6779
The first row of this stemplot shows that the smallest value of petal length is 1.0 while
there are a large number of irises with a petal length of 1.4. The choice of number of
stems depends on the number of individuals in the dataset and we usually use those
choices to R. A stemplot is a very efficient paper-and-pencil technique to represent
the distribution of a variable. This particular stemplot shows us that petal length
has a bimodal distribution which you might guess has something to do with the fact
that more than one species were represented in this dataset.
1.3. Measures of the Center of a Distribution
Qualitative descriptions of the shape of a distribution are important and useful. But
we will often desire the precision of numerical summaries as well. Two aspects of
unimodal distributions that we will often want to measure are central tendency (what
is a typical value? where do the values cluster?), and the amount of variation (are
the data tightly clustered around a central value, or more spread out?)
Two widely used measures of center are the mean and the median. You are
probably already familiar with both. The mean is calculated by adding all the values
of a variable and dividing by the number of values. Our usual notation will be to
denote the n values as x1 , x2 , . . . xn , and the mean of these values as x. Then the
12
1.3. Measures of the Center of a Distribution
formula for the mean becomes
x=
Pn
i=1 xi
n
.
The median is a value that splits the data in half – half of the values are smaller
than the median and half are larger. By this definition, there could be more than
one median (when there are an even number of values). This ambiguity is removed
by taking the mean of the “two middle numbers” (after sorting the data). Whereas
x denotes the mean of the n numbers x1 , . . . , xn , we use x̃ to denote the median of
these numbers.
The mean and median are easily computed in R.
> x=c(1,1,5,20,10)
> median(x)
[1] 5
> y=c(x,15)
> median(y)
[1] 7.5
# third largest number of the 5 in x
# average of 5 and 10 (third and fourth numbers)
The mean and median of iris sepal lengths is given by
> mean(iris$Sepal.Length); median(iris$Sepal.Length);
[1] 5.843333
[1] 5.8
We can also compute the mean and median of the Old Faithful eruption times.
> mean(faithful$eruptions); median(faithful$eruptions);
[1] 3.487783
[1] 4
Notice, however, that in the Old Faithful eruption times histogram (Figure 1.4) there
are very few eruptions that last between 3.5 and 4 minutes. So although these numbers
are the mean and median, neither seems to be a very good description of the typical
eruption time(s) of Old Faithful. It will often be the case that the mean and median
are not very good descriptions of a data set that is not unimodal. In the case of our
Old Faithful data, there seem to be two predominant peaks, but unlike in the case
of the iris data, we do not have another variable in our data that lets us partition
the eruptions times into two corresponding groups. This observation could, however,
lead to some hypotheses about Old Faithful eruption times. Perhaps eruption times
are different at night than during the day. Perhaps there are other differences in the
eruptions. Subsequent data collection (and statistical analysis of the resulting data)
might help us determine whether our hypotheses appear correct.
13
1. Data
Comparing mean and median
While both the mean and median provide a measure of the center of a distribution,
they measure different things and sometimes one measure is better than the other. If
a distribution is (approximately) symmetric, the mean and median will be (approximately) the same. (See Exercise 1.6.) If the distribution is not symmetric, however,
the mean and median may be very different. For example, if we begin with a symmetric distribution and add in one additional value that is very much larger than
the other values (an outlier), then the median will not change very much (if at all),
but the mean will increase substantially. Because of this, we say that the median is
resistant to outliers while the mean is not. A similar thing happens with a skewed,
unimodal distribution. If a distribution is positively skewed, the large values in the
tail of the distribution increase the mean (as compared to a symmetric distribution)
but not the median, so the mean will be larger than the median. Similarly, the mean
of a negatively skewed distribution will be smaller than the median. Consider the
data on the populations of the 3,141 county equivalents in the United States. From
R we see the great difference in the mean county population and the median county
population. Note that the largest county, Los Angeles County with over 9 million
people, alone contributes over 3,000 people to the mean.
> mean(counties$Population); median(counties$Population)
[1] 89596.28
[1] 24595
Over 80% of the counties in the United States are less populus than the “average”
county as can be seen in this next computation.
> sum(counties$Population<mean(counties$Population))
[1] 2565
Whether a resistant measure is desirable or not depends on context. If we are
looking at the income of employees of a local business, the median may give us a
much better indication of what a typical worker earns, since there may be a few large
salaries (the business owner’s, for example) that inflate the mean. This is also why
the government reports median household income and median housing costs. The
median county population perhaps tells us more about what a “typical” county looks
like than does the mean.
On the other hand, if we are ultimately interested in the total, the mean is more
useful. For example, the median of daily sales of a Hallmark Card store will likely be
smaller than the mean as there are several days of card sales that are outliers (e.g.,
the few days before Mother’s Day). The mean daily sales for a year allows us also to
compute the total sales on the year and comparing the means of two different stores
allows us to determine which store has greater sales overall.
14
1.4. Measures of Dispersion
The trimmed mean
There is another measure of central tendency that is less well known and represents
a kind of compromise between the mean and the median. In particular, it is more
sensitive to the the extreme values of a distribution than the median is, but less
sensitive than the mean. The idea of a trimmed mean is very simple.
Before calculating the mean, we remove the largest and smallest values from the
data. The percentage of the data removed from each end is called the trimming
percentage. The 10% trimmed mean is the mean of the middle 80% of the data (after
removing the largest and smallest 10%). A trimmed mean is calculated in R by setting
the trim argument of mean(), e.g. mean(x,trim=.10). Although a trimmed mean in
some sense combines the advantages of both the mean and median, it is less common
than either the mean or the median. This is partly due the mathematical theory that
has been developed for working with the median and especially the mean of sample
data. The 10% trimmed mean of county populations is 38,234 which is much closer
in size to the median than to the mean.
> mean(counties$Population,trim=.1)
[1] 38234.59
We note in passing that there are some complications in defining the trimmed mean.
For instance in the case of the example above, the 10% trimmed mean of the county
populations should be computed by removing the 314.1 largest most populus and the
314.1 least populus counties from the data. But this doesn’t make any sense so that
there needs to be some convention for dealing with these fractions of data points. In
fact there are several different conventions but we will let R handle the details of the
computation and not be too concerned about the differences. With large datasets,
the differences between the various definitions of, say, the 10% trimmed mean are
small.
In some sports, the trimmed mean is used to compute a competitors score based
on the scores given by individual judges. Diving, figure skating, and gymnastics are
three sports that use a trimmed mean to compute a competitors final score. (Diving
uses the middle three scores when there are five judges which amounts to computing
the 20% trimmed mean.)
1.4. Measures of Dispersion
It is often useful to characterize a distribution in terms of its center, but that is not
the whole story. Consider the distributions depicted in the histograms below.
15
1. Data
−10
A
0
10
20
30
B
0.20
Density
0.15
0.10
0.05
0.00
−10
0
10
20
30
In each case the mean and median are approximately 10, but the distributions clearly
have very different shapes. The difference is that distribution B is much more “spread
out”. “Almost all” of the data in distribution A are quite close to 10; a much larger
proportion of distribution B is “far away” from 10. The intuitive (and not very precise)
statement in the preceding sentence can be quantified by means of quantiles. The
idea of quantiles is probably familiar to you since percentiles are a special case of
quantiles.
Definition 1.4.1 (Quantile). Let p ∈ [0, 1]. A p-quantile of a quantitative distribution is a number q such that the (approximate) proportion of the distribution that is
less than q is p.
So for example, the .2-quantile divides a distribution into 20% below and 80%
above. The .2-quantile is also called the 20th percentile. The median is just the
.5-quantile (and the 50th percentile).
While the definition of quantile above seems clear, it does have the same complication as that of the definition of trimmed mean. Suppose your data set has 15 values.
What is the .30-quantile? 30% of the data would be (.30)(15) = 4.5 values. Of course,
there is no number that has 4.5 values below it and 10.5 values above it. This is the
reason for the parenthetical word approximate in Definition 1.4.1. Different methods
have been proposed for giving quantiles a single value, and R implements 9 different
methods! The next example illustrates the default computation of the .25-quantile
for datasets of size 5, 6, 7 and 8.
> quantile(1:5,.25)
25%
2
> quantile(1:6,.25)
25%
2.25
> quantile(1:7,.25)
16
1.4. Measures of Dispersion
25%
2.5
> quantile(1:8,.25)
25%
2.75
Fortunately, for large data sets, the differences between the various different quantile methods are usually small, so we will just let R compute quantiles for us using
the default method of the quantile() function. For example, here are the deciles
and quartiles of the Old Faithful eruption times.
> quantile(faithful$eruptions,(0:10)/10);
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1.6000 1.8517 2.0034 2.3051 3.6000 4.0000 4.1670 4.3667 4.5330 4.7000 5.1000
> quantile(faithful$eruptions,(0:4)/4);
0%
25%
50%
75%
100%
1.60000 2.16275 4.00000 4.45425 5.10000
The difference between the first and third quartiles is often used as a simple measure
of dispersion. This measure is called the inter-quartile range and abbreviated IQR.
The IQR of the Old Faithful eruption times is 4.45425 − 2.16275 = 2.2915. Note that
since IQR depends only on the middle 50% of data, it is a measure of dispersion that
is resistant to outliers.
Especially for hand computation, yet another method of computing the quartiles
based on the median is popular. With this method, the first- and third-quartiles
(the .25-quantile and the .75-quantile respectively) are called the lower hinge and the
upper hinge respectively. These are computed by the following definition.
Definition 1.4.2 (hinges). Suppose that a variable x1 , . . . , xn has an even number
of values, say n = 2k. Then the lower hinge is the median of the smallest k values
and the upper hinge is the median of the largest k values. If the variable has an odd
number of values, n = 2k + 1, then the lower hinge is the median of the smallest k + 1
values and the upper hinge is the median of the largest k + 1 values.
In other words, the lower hinge is the median of the lower half of the data with
the middle point included in that half if there are an odd number of data points.
Similarly for the upper hinge. A very common and useful description of the variability in a distribution is the five number summary. five number summary. The
five number summary consists of the minimum, lower hinge, median, upper hinge,
and maximum of the distribution. The five number summary is computed by the R
function fivenum().
The five-number summary is often presented graphically by means of a boxplot.
The standard R function is boxplot() and the lattice function is bwplot(). A
boxplot of the Sepal.Width of the iris data is in Figure 1.4 and was generated by
17
1. Data
> bwplot(iris$Sepal.Width)
●
●
2.0
2.5
3.0
●
3.5
4.0
●
●
4.5
iris$Sepal.Width
Figure 1.6.: Boxplot of Sepal.Width of iris data.
The sides of the box are drawn at the hinges. The median is represented by a dot
or line in the box. In some boxplots, the whiskers extend out to the maximum and
minimum values. However the boxplot that we are using here attempts to identify
outliers. Outliers are values that are unusually large or small and are indicated by a
special symbol beyond the whiskers. The whiskers are then drawn from the box to the
largest and smallest non-outliers. One common rule for automating outlier detection
for boxplots is the 1.5 IQR rule. This is the default rule in both boxplot functions
in R. Under this rule, any value that is more than 1.5 IQR away from the box is
marked as an outlier. Indicating outliers in this way is useful since it allows us to see
if the whisker is long only because of a few extreme values. A boxplot gives us some
idea of the symmetry and general dispersion of a variable but it certainly doesn’t give
us as much information about the shape of a distribution as a histogram.
Variance and Standard Deviation
Another important way to measure the dispersion of a distribution is by comparing
each value to the center of the distribution. If the distribution is spread out, these
differences will tend to be large, otherwise these differences will be small. To get a
single number, we could simply add up all of the deviation from the mean:
total deviation from the mean =
n
X
(xi − x) .
i=1
The trouble with this is that the total deviation from the mean is always 0 (see
Exercise 1.12). The problem is that the negative deviations and the positive deviations
always exactly cancel out.
18
1.4. Measures of Dispersion
To fix this problem we might consider taking the absolute value of the deviations
from the mean:
total absolute deviation from the mean =
n
X
|xi − x| .
i=1
This number will only be 0 if all of the data values are equal to the mean. Even better
would be to divide by the number of data values. Otherwise large data sets will have
large sums even if the values are all close to the mean.
n
mean absolute deviation =
1X
|xi − x| .
n
i=1
This is a reasonable measure of the dispersion in a distribution, but we will not use
it very often. There is another measure that is much more common, namely the
variance, which is defined by
n
1 X
variance = Var(x) =
(xi − x)2 .
n−1
i=1
You will notice two differences from the mean absolute deviation. First, instead of
using an absolute value to make things positive, we square the deviations from the
mean. One advantage of squaring over the absolute value is that it is much easier
to do calculus with a polynomial than with functions involving absolute values. The
second difference is that we divide by n − 1 instead of by n. There is a good reason
for this, even though dividing by n seems more natural. We will get to that reason
in Chapter 5. However the following principle helps to remember the number n − 1.
If we consider the n − 1 deviations (x1 − x), (x2 − x), . . . , (xn − x), any one of them is
determined from the other n − 1 (because of the property of the mean that the sums
of these deviations is 0). Therefore there are only n − 1 “degrees of freedom” in these
numbers rather than the n degrees of freedom in the data. Indeed, n − 1 is called
degrees of freedom for this reason.
Because the squaring changes the units of this measure, the square root of the
variance, called the standard deviation, is commonly used in place of the variance.
standard deviation = SD(x) =
p
Var(x) .
We will sometimes use the notation sx and s2x for the standard deviation and
variance respectively. The subscript x refers to the particular variable for which we
are computing the variance or standard deviation and we sometimes omit it (and
write s or s2 ) when it is clear what variable is involved.
All of these quantities are easy to compute in R.
19
1. Data
> x=c(1,3,5,5,6,8,9,14,14,20);
>
> mean(x);
[1] 8.5
> x - mean(x);
[1] -7.5 -5.5 -3.5 -3.5 -2.5 -0.5 0.5
> sum(x - mean(x));
[1] 0
> abs(x - mean(x));
[1] 7.5 5.5 3.5 3.5 2.5 0.5 0.5
> sum(abs(x - mean(x)));
[1] 46
> (x - mean(x))^2;
[1] 56.25 30.25 12.25 12.25
6.25
> sum((x - mean(x))^2);
[1] 310.5
> n= length(x);
> 1/(n-1) * sum((x - mean(x))^2);
[1] 34.5
> var(x);
[1] 34.5
> sd(x);
[1] 5.87367
> sd(x)^2;
[1] 34.5
5.5
5.5 11.5
5.5
5.5 11.5
0.25
0.25
30.25
30.25 132.25
1.5. The Relationship Between Two Variables
Many scientific problems are about describing and explaining the relationship between
two or more variables. In the next two sections, we begin to look at graphical and
numerical ways to summarize such relationships. In this section, we consider the case
where one or both the variables are categorical.
We first consider the case when one of the variables is categorical and the other
is quantitative. This is the situation with the iris data if we are interested in the
question of how, say, sepal length varies by species. A very common way of beginning
to answer this question is to construct side-by-side boxplots.
> bwplot(Sepal.Length~Species,data=iris)
We see from these boxplots (Figure 1.5) that the virginica variety of iris tends to have
the longest sepal length though the sepal lengths of this variety also have the greatest
variation.
The notation used in the first argument of bwplot() is called formula notation
and is extremely important when considering the relationship between two variables.
20
1.5. The Relationship Between Two Variables
Sepal.Length
8
7
●
6
●
●
5
setosa
●
versicolor
virginica
Figure 1.7.: Box plot for iris sepal length as a function of Species.
This formula notation is used throughout lattice graphics and in other R functions
as well. The simplest form of a formula is
y ~ x
We will often read this formula as “y modelled by x”.
In general, the variable y is the dependent variable and x the independent variable.
(In this example, it is more natural to think of species as the independent variable.
There is nothing logically incorrect however with thinking of sepal length as the
independent variable.) Usually, for plotting functions, y will be the variable presented
on the vertical axis, and x the variable to be plotted along the horizontal axis. In this
case, we are modeling (or describing) sepal length by species.
The formula notation can also be used with lattice function histogram(). For
example,
histogram(~Sepal.Length,data=iris)
will produce a histogram of the variable Sepal.Length. In this case, the independent
variable in the formula is omitted since the independent variable, the frequency of
the class, is computed by histogram(). Side-by-side histograms can be generated by
with a more general form of the formula syntax The same information in the boxplots
above is contained in the side-by-side histograms of Figure 1.8.
> histogram(~Sepal.Length | Species,data=iris,layout=c(3,1))
In this form of the formula
y~x | z
the variable z is a conditioning variable. The condition z is a variable that is used to
break the data into different groups. In the case of histogram(), the different groups
21
1. Data
4
setosa
5
6
7
8
versicolor
virginica
Percent of Total
50
40
30
20
10
0
4
5
6
7
8
4
5
6
7
8
Sepal.Length
Figure 1.8.: Sepal lengths of three species of irises
are plotted in separate panels. When z is categorical there is one panel for each level
of z. When z is quantitative, the data is divided into a number of sections based on
the values of z. The formula notation is used for mcu more than just graphics as we
will see later in this book.
We next turn to the case where both variables are categorical.
Example 1.5.1.
In 2004, over 400 incoming first-year students at Calvin College took a survey
concerning, among other things, their beliefs and values. In 2007, 221 of these
students were asked these same questions again. Their responses to three of
the questions are included in the file http://www.calvin.edu/~stob/data/
CSBVpolitical.csv. The variable SEX uses codes of 1 for male and 2 for
female. The other two variables, POLIVW04 and POLIVW07, refer to the
question “How would you characterize your political views?” as answered in
2004 and 2007. The coded responses are
Far right
1
Conservative
2
Middle-of-the-road 3
Liberal
4
Far left
5
Each of these questions results in a categorical variable. We might be interested
in whether there is a difference between self-characterization of male students
and female students. We might also be interested in the relationship of the
views of a student in 2004 and 2007. The first few entries of this dataset are
given in the following output.
22
1.5. The Relationship Between Two Variables
> csbv=read.csv(’http://www.calvin.edu/~stob/data/CSBVpolitical.csv’)
> csbv[1:5,]
SEX POLIVW04 POLIVW07
1
1
2
2
2
1
3
3
3
2
2
2
4
1
2
2
5
2
2
2
The most useful form of summary of data that arises from two or more categorical
variables is a cross tabulation. We first use a cross-tabulation to determine the
relationship of the gender of a student to his or her political views as entering firstyear students.
> xtabs(~SEX+POLIVW04,csbv)
POLIVW04
SEX 1 2 3 4
1 7 47 28 6
2 0 67 48 14
While the command syntax is a bit inscrutable, it should be clear how to read the
table. Note that no entering students characterized their views as “Far left” and no
female characterized her views as “Far right.” Also notice that it appears that males
tended to be more conservative than females.
The xtabs() function uses the formula syntax. As in histogram, there is now no
independent variable in the formula as the frequencies are computed from the data.
Also, the formula has form x~y1+y2 where the plus sign indicates that there are
two independent variables. Another example of xtabs() with just one independent
variable is
> xtabs(~SEX ,csbv)
SEX
1
2
88 133
which counts the number of males and females in our dataset.
In this first example of xtabs our dataset contained a record for each observation.
It is quite often the case that we are only given summary data.
Example 1.5.2.
Data on graduate school admissions to six different departments of the University of California of California, Berkeley, in 1973 are summarized in the dataset
http://www.calvin.edu/~stob/data/Berkeley.csv.
> Admissions=read.csv(’http://www.calvin.edu/~stob/data/Berkeley.csv’)
> Admissions[c(1,10,19),]
23
1. Data
Admit Gender Dept Freq
1 Admitted
Male
A 512
10 Rejected
Male
C 205
19 Admitted Female
E
94
We see that 512 Males were admitted to Department A while 10 Males were
rejected by Department C. We now use the xtabs function with a dependent
variable:
> xtabs(Freq~Gender+Admit,Admissions)
Admit
Gender
Admitted Rejected
Female
557
1278
Male
1198
1493
There seems to be relationship between the two variables in this cross-tabulation.
Females were rejected at a greater rate than Males. While this might be evidence of gender bias at Berkeley, further analysis tells a more complicated
story.
> xtabs(Freq~Gender+Admit+Dept,Admissions)
, , Dept = A
Admit
Gender
Admitted Rejected
Female
89
19
Male
512
313
, , Dept = B
Admit
Gender
Admitted Rejected
Female
17
8
Male
353
207
, , Dept = C
Admit
Gender
Admitted Rejected
Female
202
391
Male
120
205
, , Dept = D
Admit
Gender
Admitted Rejected
Female
131
244
Male
138
279
24
1.5. The Relationship Between Two Variables
, , Dept = E
Admit
Gender
Admitted Rejected
Female
94
299
Male
53
138
, , Dept = F
Admit
Gender
Admitted Rejected
Female
24
317
Male
22
351
In all but two departments, females are admitted at a greater rate than males
while in those two departments the admission rate is quite similar.
The next example again illustrates the difficulty in trying to explain the relationship
between two categorical variables, in this case race and the death penalty.
Example 1.5.3.
A 1981 paper investigating racial biases in the application of the death penalty
reported on 326 cases in which the defendant was convicted of murder. For
each case they noted the race of the defendant and whether or not the death
penalty was imposed.
> deathpenalty=read.table(’http://www.calvin.edu/~stob/data/deathpenalty.csv’)
> deathpenalty[1:5,]
Penalty Victim Defendant
1
Not White
White
2
Not Black
Black
3
Not White
White
4
Not Black
Black
5
Death White
Black
> xtabs(~Penalty+Defendant,data=deathpenalty)
Defendant
Penalty Black White
Death
17
19
Not
149
141
>
From the output, it does not look like there is much of a difference in the
rates at which black and white defendants receive the death penalty although
a white defendant is slightly more likely to receive the death penalty. However
25
1. Data
a different picture emerges if we take into account the race of the victim.
> xtabs(~Penalty+Defendant+Victim,data=deathpenalty)
, , Victim = Black
Defendant
Penalty Black White
Death
6
0
Not
97
9
, , Victim = White
Defendant
Penalty Black White
Death
11
19
Not
52
132
It appears that black defendants are more likely to receive the death penalty
when the victim is black and also when the victim is white.
In the last example, we met something called Simpson’s Paradox. Specifically,
we found that a relationship between two categorical variables (white defendants
receive the death penalty more frequently) is reversed if we divide the analysis by a
third categorical variable (black defendants receive the death penalty more often if
the victim is white and if the victim is black).
1.6. Two Quantitative Variables
A very common problem in science is to describe and explain the relationship between
two quantitative variables. Often our scientific theory (or at least our intuition)
suggests that two variables have a relatively simple functional relationship, at least
approximately.
1.6.1. Examples
Example 1.6.1.
Thirteen bars of 90-10 Cu/Ni alloys were submerged for sixty days in sea water.
The bars varied in iron content. The weight loss due to corrosion for each bar
was recorded. The R dataset below gives the percentage content of iron (Fe)
and the weight loss in mg per square decimeter (loss).
> library(faraway)
> data(corrosion)
26
1.6. Two Quantitative Variables
> corrosion[c(1:3,12:13),]
Fe loss
1 0.01 127.6
2 0.48 124.0
3 0.71 110.8
12 1.44 91.4
13 1.96 86.2
> xyplot(loss~Fe, data=corrosion)
> xyplot(loss~Fe,data=corrosion,type=c("p","r"))
●
●
120
loss
130
●
●
●
●
110
●
100
●
●
●
0.5
●
●
1.0
●
●
110
●
100
●
●
90
0.0
●
●
120
loss
130
# plot has points, regression line
1.5
2.0
●
●
●
90
●
●
0.0
Fe
0.5
1.0
1.5
2.0
Fe
Figure 1.9.: The corrosion data with a “good” line added on the right.
It is evident from the plot (Figure 1.9) that the greater the percentage of iron,
the less corrosion. The plot suggests that the relationship might be linear.
In the second plot, a line is superimposed on the data. The line is meant
to summarize approximately the linear relationship between iron content and
corrosion. (We will explain how to choose the line soon.) Note that to plot the
relationship between two quantitative variables, we may use either plot from
the base R package or xyplot from lattice. These functions use the same
formula notation as histogram().
What is the role of the line that we superimposed on the plot of the data in this
example? Obviously, we do not mean to claim that the relationship between iron
content and corrosion loss is completely captured by the line. But as a “model” of
the relationship between these variables, the line has at least three possible important
uses. First, it provides a succinct description of the relationship that is difficult to
see in the unsummarized data. The line plotted has equation
loss = 129.79 − 24.02Fe.
Both the intercept and slope of this line have simple interpretations. For example,
the slope suggests that every increase of 1% of iron content means a decrease in loss
of content of 24.02 mg per square decimeter. Second, the model might be used for
prediction in a situation where we have a yet untested object. We can easily use this
27
1. Data
Distance
100
200
400
800
1000
1500
Mile
2000
3000
5000
10,000
Time
9.69
19.30
43.18
1:41.11
2:11.96
3:26.00
3:43.13
4:44.79
7:20.67
12:37.35
26:17.53
Record Holder
Usain Bolt (Jamaica)
Usain Bolt (Jamaica)
Michael Johnson (US)
Wilson Kipketer (Denmark)
Noah Ngeny (Kenya)
Hicham El Guerrouj (Morocco)
Hicham El Guerrouj (Morocco)
Hicham El Guerrouj (Morocco)
Daniel Komen (Kenya)
Kenenisa Bekele (Ethiopia)
Kenenisa Bekele (Ethiopia)
Table 1.1.: Men’s World Records in Track (IAAF)
line to make a prediction for the material loss in an alloy of 2% iron content. Finally,
it might figure in a scientific explanation of the phenomenon of corrosion.
Example 1.6.2.
The current world records for men’s track appear in Table 1.1. These data may
be found at http://www.calvin.edu/~stob/data/mentrack.csv. The plot of
record distances (in meters) and times (in seconds) looks roughly linear. We
know of course (for physical reasons) that this relationship cannot be a linear
one. Nevertheless, it appears that a smooth curve might approximate the data
very well and that this curve might have a relatively simple formula. Such a
formula might help us predict what the world record time in a 4,000 meter race
might be (if ever such a race would be run by world-class runners).
●
Seconds
1500
1000
●
500
0
●
●●
●
0
●●
●●
●
2000
4000
6000
8000
10000
Meters
Example 1.6.3.
The R dataset trees contains the measurements of the volume (in cu ft), girth
28
1.6. Two Quantitative Variables
(diameter of tree in inches measured at 4 ft 6 in above the ground), and height
(in ft) of 31 black cherry trees in a certain forest. Since girth is easily measured,
we might want to use girth to predict volume of the tree. A plot shows the
relationship.
> data(trees)
> trees[c(1:2,30:31),]
Girth Height Volume
1
8.3
70
10.3
2
8.6
65
10.3
30 18.0
80
51.0
31 20.6
87
77.0
> xyplot(Volume~Girth,data=trees)
Volume
80
●
60
●●
40
●
●●
● ●●●●
●
● ●
●
20
●
● ●
●
●
●
●
●
●●
●
●●●
10
15
20
Girth
In this example, we probably wouldn’t expect that a linear relationship is the
best way to describe the data. Furthermore, the data indicates that any simple
function is not going to describe completely the variation in volume as a function
of girth. This makes sense because we know that trees of the same girth can
have different volumes.
1.6.2. Fitting Functions to Data
The three examples in the previous subsection share the following features. In each,
we are given n observations (x1 , y1 ), . . . , (xn , yn ) of quantitative variables x and y. In
each we would like to express the relationship between x and y, at least approximately,
using a simple functional form. In each case we would like to find a “model” that
explains y in terms of x. Specifically, we would like to find a simple functional
relationship y = f (x) between these variables. Summarizing, our goal is the following
Goal:
Given (x1 , y1 ), . . . , (xn , yn ), find a “simple” function f such that yi is
approximately equal to f (xi ) for every i.
29
1. Data
The goal is vague. We need to make precise the notion of “simple” and also the
measure of fit we will use in evaluating whether yi is close to f (xi ). In the rest of this
section, we make these two notions precise. The simplest functions we study are linear
functions such as the function that we used in Example 1.6.1. In other words, in this
case our goal is to find b0 and b1 so that yi ≈ b0 +b1 xi for all i. (Statisticians use b0 , b1
or a, b for the slope and intercept rather than the b, m that are typical in mathematics
texts. We will use b0 , b1 .) Of course, in only one of our motivating examples does it
seem sensible to use a line to approximate the data. So two important questions that
we will need to address are: How do we tell if a line is an appropriate description of
the relationship? and What do we do if a linear function is not the right relationship?
We will address both questions later.
How shall we measure the goodness of fit of a proposed function f to the data? For
each xi the function f predicts a certain value ŷi = f (xi ) for yi . Then ri = yi − ŷi is
the “mistake” that f makes in the prediction of yi . Obviously we want to choose f so
that the values ri are small in absolute value. Introducing some terminology, we will
call ŷi the fitted or predicted value of the model and ri the residual. (In statistics,
hats over variables always mean that we are talking about a predicted value.) The
following is a succinct statement of the relationship
observation = fitted + residual.
It will be impossible to choose a line so that all the values of ri are simultaneously
small (unless the data points are collinear). Various values of b0 , b1 might make
some values of ri small while making others large. So we need some measure that
aggregates all the residuals. Many choices are possible and R provides software to
find the resulting lines for many of these but the canonical choice and the one we
investigate here is the sum of squares of the residuals. Namely, our goal is now
refined to the following
Goal:
Given (x1 , y1 ), . . . , (xn , yn ), find b0 and b1 such that if f (x) = b0 + b1 x
n
X
and ri = yi − f (xi ) then
ri2 is minimized.
i=1
We call
n
X
i=1
ri2 the sum of squares of residuals and denote it by SSResid or SSE
(for sum squares error). The choice of the squaring function here is quite analogous
to the choice of squaring in the definition of variance for measuring dispersion. Just
as in the search for a measure of dispersion, different ways of combining the ri are
30
1.6. Two Quantitative Variables
possible. The line which minimizes the sums of the squares of the residuals is called
the “least-squares line” and we show how to find it in the next subsection.
1.6.3. The Least-Squares Line
Finding the least-squares line is a minimization problem of the sort that we are
familiar with from calculus. We wish to find b0 and b1 to minimize
SSResid =
n
X
i=1
ri2
=
n
X
(yi − (b0 + b1 xi ))2 .
i=1
It is important to note here that SSResid is a function of (the unknowns) b0 and b1
(the xi and yi that appear in this function are not variables but rather have numerical
values). The task of finding b0 and b1 is that of minimizing a function of two variables.
Since this function is nicely differentiable (one consequence of using squares rather
than absolute values for instance), calculus tells us to find the points where the partial
derivatives of SSResid with respect to each of b0 and b1 are 0. (Of course then we have
to check that we have found a minimum rather than a maximum or a saddlepoint.)
After much algebra (thankfully omitted), we find that
b1 =
P
(xi − x)(yi − y)
P
(xi − x)2
b0 = y − b1 x.
While it would be easy enough to use these formulas to compute the coefficients,
we will always find the least-squares line using R. The R function lm() finds the
coefficients of the least-squares line. Here, lm stands for linear model and the function
lm() uses the same formula syntax as does does xyplot.
> lm(loss~Fe,data=corrosion)
Call:
lm(formula = loss ~ Fe, data = corrosion)
Coefficients:
(Intercept)
Fe
129.79
-24.02
is
Using the values of b0 and b1 given abouve, the equation of our “least-squares” line
y=y+
P
(xi − x)(yi − y)
P
(x − x)
(xi − x)2
The quantities in these expressions are tedious to write, so we introduce some useful
abbreviations.
31
1. Data
Sxx
n
X
=
(xi − x̄)2
SST = Syy
n
X
=
(yi − ȳ)2
s2x = Sxx /(n − 1)
i=1
s2y = Syy /(n − 1)
i=1
Sxy
n
X
=
(xi − x̄)(yi − ȳ)
i=1
We can now rewrite the expression for b1 as
b1 =
Sxy
Sxx
and the equation for the line as
y−y =
Sxy
(x − x) .
Sxx
An important fact that we note immediately from the above equation for the line
is that it passes through the point (x, y). This says that, whatever else, we should
predict that the value of y is “average” if the value of x is “average”. This seems like
a plausible thing to do and it is a consequence of minimizing the sums of squares of
residuals as opposed to some other function of the residuals.
The slope b1 of the regression line tells us something about the nature of the linear
relationship between x and y. A positive slope suggests a positive relationship between
the two quantities, for example. However the slope has units — it is often useful to
have a dimensionless measure of the linear relationship. The key to finding such is to
re-express the variables x and y as unit-free quantities. We do that by “standardizing”
x and y. In problem 1.19 we introduced the notion of standardization of a variable.
If x is a variable, the new variable x0 = x−x
sx changes the data to have mean 0
and standard deviation 1. This new variable is unit-less. It can be shown that the
regression equation can be written as
y−y
x−x
=r
sy
sx
where r is the correlation coefficient between x and y given by
r=√
Sxy
p
.
Sxx Syy
It can be shown that −1 ≤ r ≤ 1. For the corrosion dataset we find that the
correlation coefficient between iron content (Fe) and material loss due to corrosion
(loss) is −.98.
32
1.6. Two Quantitative Variables
> cor(corrosion$loss,corrosion$Fe)
[1] -0.984743
This number can be easily interpreted using a sentence such as “loss decreases approximately .98 standard deviations for each increase of 1 standard deviation of iron
content in this dataset.”
In R, the object defined by the lm() function is actually a list that contains more
than just the fitted line. There are several functions to access the information contained in that object. In particular, the residuals() and fitted() return vectors of
the same length of the data containing the residuals and fitted values corresponding
to each data point.
> l=lm(loss~Fe,corrosion)
> fitted(l)
1
2
3
4
5
6
7
8
129.54640 118.25705 112.73247 106.96770 101.20293 129.54640 118.25705 95.19795
9
10
11
12
13
112.73247 82.70761 129.54640 95.19795 82.70761
> residuals(l)
1
2
3
4
5
6
7
-1.9464003 5.7429496 -1.9324749 -3.0677005 0.2970739 0.5535997 3.7429496
8
9
10
11
12
13
-2.8979527 0.3675251 0.9923919 -1.5464003 -3.7979527 3.4923919
From this output, we can see that the largest residual corresponds to the second data
point. For that point, (0.48,124), the predicted value is 118.26 and the residual is 5.74.
Note that a positive residual means that the prediction underestimates the actual.
A plot of residuals is often useful in determining whether a linear relationship is an
appropriate description of the relationship between the two variables. We know that
the track record data of Example 1.6.2 is not best summarized by a linear relationship.
When we try to do that, we have the residual plot of Figure 1.10.
> track=read.csv(’http://www.calvin.edu/~stob/data/mentrack.csv’)
> l=lm(Seconds~Meters,data=track)
> xyplot(residuals(l)~Meters,data=track)
The residual plot certainly suggests that there is a structure in the data that is other
than linear. The fitted model consistently underpredicts at short and long distances
while overpredicts at intermediate distances.
1.6.4. Non-linear Curve Fitting
Suppose that we wish to fit a function y = f (x) to data for which a linear function is
clearly not appropriate. If we have some simple class of functions that is defined by
a few parameters (such as the class of linear functions above), we could proceed in a
33
1. Data
20
●
●
residuals(l)
●
10
●
0
●
●
●●
●
−10
●
0
2000
●
4000
6000
8000
10000
Meters
Figure 1.10.: A residual plot for the male world records in track data.
similar way — that is we could find the values of the parameters that minimize the
sums of squares of residuals. Since this is usually a non-linear minimization problem,
it is often quite difficult though R gives us many tools to solve it. We will not pursue
this strategy here but instead look at the more common strategy of linearization. We
do this by means of the tree example, Example 1.6.3.
Example 1.6.4.
In Example 1.6.3, the relationship between the volume V and girth G of a
sample of cherry trees can be easily seen to be nonlinear. Both the plot of the
data and our geometrical intuition tell us this. We might suppose instead for
example that the relationship could be modeled by V = b0 + b1 G2 for some
b0 , b1 . We can see that the choice of coefficients b0 and b1 is easy here - we
simply have to transform our girth variable by squaring it. We show how to do
this using R to transform the variable as well as by doing the transformation
directly.
> lm(Volume~I(Girth^2),data=trees)
Call:
lm(formula = Volume ~ I(Girth^2), data = trees)
Coefficients:
(Intercept)
I(Girth^2)
-3.3551
0.1812
> G2=trees$Girth^2
> lm(Volume~G2,data=trees)
34
1.6. Two Quantitative Variables
Call:
lm(formula = Volume ~ G2, data = trees)
Coefficients:
(Intercept)
-3.3551
G2
0.1812
Figure 1.6.4 shows that the fit of this linearization is fairly good.
80
●
60
Volume
●●
●
●
●
40
●
●
●
●
●
●●●
●●●
●
●
● ●
20
●
●
●
●●
●●●
100
200
300
400
I(Girth^2)
Figure 1.11.: Volume of trees as a function of square of girth.
In the above example, the relationship was linear in the square of Girth. However
we must sometimes transform both of the variables to get a linear relationship.
Example 1.6.5.
Suppose that we are uncertain about using the square of the girth to predict
volume in the tree example. Suppose instead that we assume the relationship
has the form
V = b0 Gb1
This relationship does not assume that the appropriate exponent b1 is equal to
2. We can linearize this relationship as by computing the natural log of each
side of the equation to give
ln V = ln b0 + b1 ln G
Regression yields ln b0 = −2.353 (b0 = .095) and b1 = 2.20.
> lm(I(log(Volume))~I(log(Girth)),data=trees)
35
1. Data
Call:
lm(formula = I(log(Volume)) ~ I(log(Girth)), data = trees)
Coefficients:
(Intercept)
-2.353
I(log(Girth))
2.200
1.7. Exercises
1.1 A program for a sports event usually lists some characteristics of the members of
at least the home team. For a basketball team, name two categorical variables and
two quantitative variables that might be listed for each player.
1.2 The builtin R datasest OrchardSprays has four variables. Classify these as quantitative or categorical.
1.3 Load the built-in R dataset chickwts. (Use data(chickwts).)
a) How many individuals are in this dataset?
b) How many variables are in this dataset?
c) Classify the variables as quantitative or categorical.
1.4 For each of the following set of individuals and corresponding variable, determine
whether the distribution of that variable is likely to be relatively symmetric, skewed
positively, skewed negatively, or not easily described by any of these terms. Give a
reason for each of your answers.
a) the GPA of students at Calvin,
b) the yearly income of last year’s Calvin graduates,
c) the scores of all basketball games played in the National Basketball Association
last year,
d) the high temperature in Grand Rapids for each of the 365 days of 2007,
e) the weights of adult males in the United States (a big dataset!).
1.5 Use R to produce a stemplot of the eruption times of the Old Faithful geyser.
(See Figure 1.4.)
a) What are the actual values of the eruption times listed in the first row of the
stemplot?
b) There is some ambiguity in the row labeled 26. What possible values could the
entries in that row represent?
36
1.7. Exercises
1.6 The distribution of a quantitative variables is symmetric about m if whenever
there are k data values m + d there are also k values of m − d.
a) Show that if a distribution is symmetric about m then m is the median. (You
may need to handle separately the cases where the number of values is odd and
even.)
b) Show that if a distribution is symmetric about m then m is the mean.
c) Create a small distribution that is not symmetric about m, but the mean and
median are both equal to m.
1.7 Describe some situations where the mean or median is clearly a better measure
of central tendency than the other.
1.8 A bowler normally bowls a series of three games. When the author was first
learning long division, he learned to compute a bowling average. However he did not
completely understand the concept since to find the average of three games, he took
the average of the first two games and then averaged that with the the third game.
(That is, if x2 denotes the mean of the first two games and x3 denotes the mean of
the three games, the author thought that x3 = (x2 + x3 )/2.)
a) Give a counterexample to the author’s method of computing the average of
three games.
b) Given x2 and x3 , how should x3 be computed?
c) Generalizing, given the mean xn of n observations and an additional observation
xn+1 , how should the mean xn+1 of the n + 1 observations be computed?
1.9 Sketch a boxplot of a distribution that is positively skewed.
1.10 Compute the five-number summary of the squares of the numbers 1, 2, . . . , 13
and draw a boxplot of this distribution by hand.
1.11 Consider the numbers 12 , 22 , 32 , . . . , (4k + 1)2 .
a) Compute the five-number summary of this set of numbers.
b) What is the IQR of this set of numbers?
c) Is there any k such that there would be an outlier in this set of data according
to the 1.5 IQR rule?
1.12 Show that the total deviation from the mean, defined by
n
X
total deviation from the mean =
(xi − x) ,
i=1
37
1. Data
is 0 for any distribution.
1.13 Find a distribution with 10 values between 0 and 10 that has as large a variance
as possible.
1.14 Find a distribution with 10 values between 0 and 10 that has as small a variance
as possible.
1.15 We could compute the mean absolute deviation from the median instead of
from the mean. This function is often called MAD:
n
1X
MAD(x) =
|xi − x̃| .
n
i=1
(The R function mad() computes a slight variant of MAD.) Show that the mean
absolute deviation from the median is always less than or equal to the mean absolute
deviation from the mean.
1.16 Let SS(c) =
P
(xi − c)2 . (SS stands for sum of squares.) Show that the smallest
value of SS(c) occurs when c = x. This shows that the mean is a minimizer of SS.
(Hint: use calculus. How does one find the value of c that minimizes a “nice” function
of c?)
1.17 Suppose that x1 , . . . , xn are the values of some variable and a new variable y is
defined by adding a constant c to each xi . In other words, yi = xi + c for all i.
a) How does y compare to x?
b) How does Var(y) compare to Var(x)?
1.18 Repeat Problem 1.17 but with yi defined by multiplying xi by c. In other words,
yi = cxi for all i.
1.19 Suppose that x1 , . . . , xn are given and we define a new variable z by
zi =
xi − x
.
sx
What is the mean and the standard deviation of the variable z? This transformed
variable is called the standardization of x. In R, the expression z=scale(x) produces the standardization. The standard value zi of xi is also sometimes called the
z-score of xi .
1.20 The dataset singer comes with the lattice package. Make sure that you have
loaded the lattice package and then load that dataset. The dataset contains the
heights of 235 singers in the New York Choral Society.
38
1.7. Exercises
a) Using a histogram of the heights of the singers, describe the distribution of
heights.
b) Using side-by-side boxplots, describe how the heights of singers vary according
to the part that they sing.
1.21 The R dataset barley has the yield in bushels/acre of barley for various varieties
of barley planted in 1931 and 1932. There are three categorical variables in play:
the variety of barley planted, the year of the experiment, and the site at which
the experiment was done (the site Grand Rapids is Minnesota, not Michigan). By
examining each of these variables one at a time, make some qualitative statements
about the way each variable affected yield. (e.g., did the year in which the experiment
was done affect yield?)
1.22 A dataset from the Data and Story Library on the result of three different
methods of teaching reading can be found at http://www.calvin.edu/~stob/data/
reading.csv. The data includes the results of various pre- and post-tests given to
each student. There were 22 students taught by each method. Using the results of
POST3, what can you say about the differences in reading ability of the three groups
at the end of the course? Would you say that one of the methods is better than the
other two? Why or why not?
1.23 The death penalty data illustrated Simpson’s paradox. Construct your own
illustration to conform to the following story:
Two surgeons each perform the same kind of heart surgery. The result
of the surgery could be classified as “successful” or “unsuccessful.” They
have each done exactly 200 surgeries. Surgeon A has a greater rate of
success than Surgeon B. Now the surgical patient’s case can be classified
as either “severe” or “moderate.” It turns out that when operating on
severe cases, Surgeon B has a greater rate of success than Surgeon A. And
when operating on moderate cases, Surgeon B also has a greater rate of
success than Surgeon A.
By the way, who would you want to be your surgeon?
1.24 Data on the 2003 American League Baseball season is in the file http://www.
calvin.edu/~stob/data/al2003.csv’.
a) Suppose that we wish to predict the number of runs (R) a team will score on
the year given the number of homeruns (HR) the team will hit. Write a linear
relationship between these two variables.
b) Use this linear relationship to predict the number of runs a team will score given
it hits 200 homeruns on the year.
39
1. Data
c) Are there any teams for which the linear relationship does a poor job in predicting runs from homeruns?
1.25 Continuing to use data from the AL 2003 baseball season, suppose that we wish
to predict the number of games a team will win (W) from the number of runs the
team scores (R).
a) Write a linear relationship for W in terms of R.
b) How many runs must a team score to win 81 games according to this relationship?
1.26 Suppose that we wish to fit a linear model without a constant:
P i.e., y = bx.
Find the value of b that minimizes the sums of squares of residuals, ni=1 (yi − bxi )2
in this case. (Hint: there is only one variable here, b, so this is a straightforward
Mathematics 161 max-min problem.)
1.27 In R, if we wish to fit a line y = bx without the constant term, we use lm(y~x-1).
(The -1 in the formula notation in this context tells R to omit the constant term.)
Using the same data as Problem 1.25, define new variables for W − L and R − OR.
(For example, define wl=s$W-s$L where s is the data frame containing your data.)
a) Write W − L as a linear function of R − OR without a constant term.
b) Why do you think it makes sense (given the nature of the variables) to omit a
constant term in this model?
1.28 The R dataset women gives the average weight of American women by height.
Do you think that a linear relationship is the best way to describe the relationship
between average weight and height?
1.29 Find a transformation that transforms the following nonlinear equations y =
f (x) (that depend on parameters b0 and b1 ) to linear equations g(y) = b00 + b01 h(x).
b0
a) y =
b1 + x
x
b) y =
b0 + b1 x
1
c) y =
1 + b0 eb1 x
1.30 The R dataset Puromycin gives the rate of reaction as a function (in counts/min/min)
of concentration of an enzyme (in ppm) for two different substrates - one treated with
Puromycin and one not treated. The biochemistry suggests that these two variables
40
1.7. Exercises
are related by
rate = b0
conc
b1 + conc
Find good approximations to b0 , b1 by re-expressing the relationship as a linear one.
41
2. Data from Random Samples
If we are to make decisions based on data, we need to be careful in their collection. In
this chapter we consider one common way of generating data, that of sampling from
a population.
2.1. Populations and Samples
To determine whether Kellogg’s is telling the truth about the net weight of its boxes
of Raisin Bran, it is simply not feasible to weigh every box of cereal in the warehouse.
Instead, the procedure recommended by NIST (National Institute of Standards and
Technology) tells us to select a sample consisting of a relatively small number of
boxes and weigh those. For example, in a shipment of 250 boxes, NIST tells us to
weigh just 12. The hope is that this smaller sample is representative of the larger
collection, the population of all cereal boxes. We might hope, for example, that the
average weight of boxes in the sample is close to the average weight of the boxes in
the population.
Definition 2.1.1 (population). A population is a well-defined collection of individuals.
As with any mathematical set, sometimes we define a population by a census or
enumeration of the elements of the population. The registrar can easily produce an
enumeration of the population of all currently registered Calvin students. Other times,
we define a population by properties that determine membership in the population.
(In mathematics, we define sets like this all the time since many sets in mathematics
are infinite and so do not admit enumeration.) For example, the set of all persons
who voted in the last Presidential election is a well-defined population but it doesn’t
admit an easy enumeration.
Definition 2.1.2 (sample). A subset S of population P is called a sample from P .
Quite typically, we are studying a population P but have only a sample S and have
the values of one or several variables for each element of S. The canonical goal of
(inferential) statistics is:
43
2. Data from Random Samples
Goal:
Given a sample S from population P and values of a variable X on
elements of S, make inferences about the values of X on the elements of
P.
Most commonly, we will be making inferences about parameters of the population.
Definition 2.1.3 (parameter). A parameter is a numerical characteristic of the population.
For example, we might want to know the mean value of a certain variable defined
on the population. One strategy for estimating the mean of such a variable is to take
a random sample and compute the mean of the sample elements. Such an estimate
is called a statistic.
Definition 2.1.4 (statistic). A statistic is a numerical characteristic of a sample.
Example 2.1.5.
The Current Population Survey (CPS) is a survey sponsored jointly by the Census Bureau and the Bureau of Labor Statistics. Each month 60,000 households
are surveyed. The intent is to make inferences about the whole population of
the United States. For example, one population parameter is the unemployment
rate – the ratio of the number of those unemployed to the size of the total labor
force. The sample produces a statistic that is an estimate of the unemployment
rate of the whole population.
Obviously, our success in using a sample to make inferences about a population
will depend to a large extent on how representative S is of the whole population P
with respect to the properties measured by X. As one might imagine, if the 60,000
households in the Current Population Survey are to give dependable information
about the whole population, they must be chosen very carefully.
Example 2.1.6.
The Literary Digest began forecasting elections in 1912. While it forecasted
the results of the election accurately until 1932, in 1936 the poll predicted that
Alf Landon would receive 55% of the popular vote. But Roosevelt went on
to win the election in a landslide with 61% of the popular vote. What went
wrong with the poll? There were are at least two problems with the survey.
First, the Literary Digest sampled from telephone directories and automobile
44
2.2. Simple Random Samples
registration lists. Voters with telephones and automobiles in 1936 tended to be
more affluent and so were somewhat more likely to favor Landon than the typical
voter. Second, although the digest sent out more than 10 million questionaires,
only 2.3 million of these were returned. So it probably is the case that voters
favorable to Landon were more likely to return their questionaires than those
favorable to Roosevelt.
The representativeness of the sample will depend how the sample is chosen. A
convenience sample is a sample chosen simply by locating units that conveniently
present themselves. A convenience sample of students at Calvin could be produced by
grabbing the first 100 students that come through the doors of Johnny’s. It’s pretty
obvious that in this case, and for convenience samples in general, there is no guarantee
that the sample is likely to be representative of the whole population. In fact we can
predict some ways in which a “Johnny’s sample” would not be representative of the
whole student population.
One might suppose that we could construct a representative sample by carefully
choosing the sample according to the important characteristics of the units. For example, to choose a sample of 100 Calvin students, we might ensure that the sample
contains 54 females and 46 males. Continuing, we would then ensure a representative
proportion of first-year students, dorm-livers, etc. There are several problems with
this strategy. There are usually so many characteristics that we might consider that
we would have to take too large a sample so as to get enough subjects to represent
all the possible combinations of characteristics in the proportions that we desire. It
might be expensive to find the individuals with the desired characteristics. We have
no assurance that the subjects we choose with the desired combination of characteristics are representative of the group of all the individuals with those characteristics.
Finally, even if we list many characteristics, it might be the case that the sample will
be unrepresentative according to some other characteristic that we didn’t think of
and that characteristic might turn out to be important for the problem at hand.
Instead of trying to construct a representative sample, most survey samples are
chosen at “random.” We investigate the simplest sort of random sample in the next
section.
2.2. Simple Random Samples
Definition 2.2.1 (simple random sample). A simple random sample (SRS) of size k
from a population is a sample that results from a procedure for which every subset
of size k has the same chance to be the sample chosen.
For example, to pick a random sample of size 100 of the 4,224 Calvin students,
45
2. Data from Random Samples
we might write the names of all Calvin students on index cards and choose 100 of
these cards from a well-mixed bag of all the cards. In practice, random samples
are often picked by computers that produce “random numbers.” (A computer can’t
really produce random numbers since a computer can only execute a deterministic
algorithm. However computers can produce numbers that behave as if they are random. We’ll talk about what that might mean later.) In this case, we would number
all students from 1 to 4,224 and then choose 100 numbers from 1 to 4224 in such a
way that any set of 100 numbers has the same chance of occurring. The R command
sample(1:4224,100,replace=F) will choose such a set of 100 numbers.
It is certainly possible that a random sample is unrepresentative in some significant
way. Since all possible samples are equally likely to be chosen, by definition it is
possible that we choose a really bad sample. For example, a random sample of Calvin
students might fail to have any seniors in it. However the fact that a sample is chosen
by simple random sampling enables us to make quantitative statements about the
likelihood of certain kinds of nonrepresentativeness. This in turn will enable us to
make inferences about the population and to make statements about how likely it is
that our inferences are accurate. In Chapter 5 we will see how to place some bounds
on the error that using a random sample might produce. Such error is called sampling
error.
Definition 2.2.2 (sampling error). The sampling error of an estimate of a population parameter is the error that results from using a sample rather than the whole
population to estimate the parameter.
Of course we cannot know the sampling error exactly (this is equivalent to knowing
the population parameter). But we will be able to place some bounds on it. High
quality public opinion polls are usually published with some information about the
sampling error. For example, typical political polls are expressed as was this one
taken the day before the Iowa, 2008, Republican primary.
Mitt Romney is favored by 43% of the Iowa voters (with a margin of error
of ±3%).
While we will learn to carefully interpret this statement in Section 4.3, it means
roughly that we can be reasonably sure that 40%–46% of the population of Iowa
voters favors Romney if the only errors made in this process are those introduced by
using a sample rather than the whole population. (Though this survey was reported
the day before the Iowa caucuses, Romney actually only received 25.2% of the votes
in those caucuses.)
To see how large or small sampling error might be we return to the data on US
counties.
46
2.2. Simple Random Samples
Example 2.2.3.
Recall that the dataset http://www.calvin.edu/~stob/data/counties.csv
contains data on the 3,141 county equivalents in the United States. Suppose
that we take a random sample of size 10 counties from this population. How
representative is it? For example, can we make inferences about the mean population per county from a sample of size 10? (Of course in this instance, we
know the actual mean population per county – 89,526 – so we do not need a
sample to estimate it!) There are too many possible samples of size 10 to investigate them all, but we can get an idea of what might happen by taking many
different samples. In the following example, we collect 10,000 different random
samples of size 10. Notice that one of these samples had a mean population of
as small as 8,392 and another larger than 1.1 million. Half of the samples had
means between 38,219 and 107,426. It looks like using a sample of size 10 would
more often than not produce a sample with mean considerably less than the
population mean. This is to be expected since the distribution of populations
by county is highly skewed. Notice also from the example that samples of size
30 produce a narrower range of estimates than samples of size 10. That’s of
course not surprising. The distribution of all of the 10,000 samples of size 10
and of size 30 are in the histograms of Figure 2.1.
> mean(counties$Population)
[1] 89596.28
> fivenum(counties$Population)
[1]
67
11206
24595
61758 9519338
> samples = replicate(10000, mean( sample(counties$Population,10,replace=F)))
> fivenum(samples)
[1]
8391.70
38219.15
62015.35 107425.60 1122651.50
> samples = replicate(10000, mean( sample(counties$Population,30,replace=F)))
> fivenum(samples30)
[1] 18066.50 56462.10 78047.07 107471.27 592331.20
50
40
Percent of Total
Percent of Total
40
30
20
10
30
20
10
0
0
0
200000
400000
600000
samples
800000
1000000
1200000
0e+00
2e+05
4e+05
6e+05
samples30
Figure 2.1.: Sample means of 10,000 samples of size 10 (left) and 30 (right) of U.S.
Counties
47
2. Data from Random Samples
Of course the description of simple random sampling above is an idealized picture of
what happens in the real world. We are assuming that we can produce a dependable
list of the entire population, that we can have access to any subset of a particular size
from that population, and that we get perfect information about the sample that we
choose. The Current Population Survey Technical Manual spends considerable effort
identifying and attempting to measure non-sampling error. It lists several basic kinds
of such errors.
1. Inability to obtain information about all sample cases (unit nonresponse).
2. Definitional difficulties.
3. Differences in the interpretation of questions.
4. Respondent inability or unwillingness to provide correct information.
5. Respondent inability to recall information.
6. Errors made in data collection, such as recording and coding data.
7. Errors made in processing the data.
8. Errors made in estimating values for missing data.
9. Failure to represent all units with the sample (i.e., under-coverage).
[Bur06]
Most surveys of real populations (of people) fall prey to some or all of these problems.
Example 2.2.4.
The US National Immunization Survey attempts to determine how many young
children receive the common vaccines against childhood illnesses. For example,
in 2006, this survey estimates that 92.9% of ages 19–35 months at the time
of the survey had received at least three doses of one of the polio vaccines.
The sampling error reported for this estimate is 0.6%. The survey itself is a
telephone survey of households that contain at least 30,000 children. One issue
with a telephone survey is that not all children of the appropriate age live in a
household with a telephone. Also, it is extremely difficult to choose telephone
numbers at random.
Though we would like a list of the entire population from which to choose our
sample, as in the previous example we often must choose our sample from another list
that does not “cover” the population. The sampling frame is the list of individuals
from which we actually choose our sample. The quality of the sampling frame is one of
the most important features in ensuring a representative sample. Political pollsters,
for example, would like a list of all and only those persons who will actually vote
48
2.3. Other Sampling Plans
in the election. Usual sampling frames will omit some of these voters but will also
include many persons who will not vote.
Besides imperfect sampling frames, the biggest source of non-sampling error is often
non-response.
Example 2.2.5.
In 2004 during Quest, all incoming Calvin students were given a survey, the
CIRP Freshmen Survey. In other words, the “sample” was actually the whole
first year class. However only 43% of the first-year students actually filled out
the survey and returned it. It turns out that the students who did return the
survey had a much higher GPA by Spring, 2007 (when they were Juniors), than
those students who had not returned the survey. So the sample of students
studied in this survey was probably not representative of the first-year students
of 2004 in at least one important way.
The response rate in the National Immunization Survey is about 75% which is very
high for large surveys such as that one. Additionally, considerable effort is expended
in determining in what ways non-responders might differ from responders so that the
results from these 75% can be generalized to the whole population.
2.3. Other Sampling Plans
The concept of random sampling can be extended to produce samples other than
simple random samples. There are a number of reasons that we might want to choose
a sample that is not a simple random sample. One important reason is to reduce sampling error. Consider the situation in which the population in question has several
subpopulations that differ substantially on the variables in question. For example,
suppose that we wish to survey Calvin College students to determine whether they
favor abolishing the Interim. It seems likely that the seniors (who have take three or
four interims) might have in general a higher opinion of the interim than first year
students who have only taken DCM. Then a simple random sample in which first-year
students happen to be overrepresented is likely to underestimate the percentage of
students favoring the interim. A sample in which the classes are represented proportionally is an obvious strategy for overcoming this bias.
Example 2.3.1.
Suppose that we wish to have a sample of Calvin students of size 100 in which
the classes are represented proportionally. We should then choose a sample
according to the breakdowns in Table 2.1.
49
2. Data from Random Samples
Class Level
First-year
Sophomore
Junior
Senior
Other
Total
Population
1,016
977
949
1,072
157
4,171
Sample
24
23
23
26
4
100
Table 2.1.: Population of Calvin Students and Proportionate Sample Sizes
Once we have defined the sizes of our subsamples, it seems wise to proceed to choose
simple random samples from each subpopulation.
Definition 2.3.2 (stratified random sample). A stratified random sample of size k
from a population is a sample that results from a procedure that chooses simple
random samples from each of a finite number of groups (strata) that partition the
population.
In the example of sampling from the Calvin student body, we chose the random
sample so that the number of individuals in the sample from each strata were proportional to the size of the strata. While this procedure has much to recommend it,
it is not necessary and sometimes not even desirable. For example, only 4 “other”
students appear in our sample of size 100 from the whole population. This is fine if
we are only interested in making inferences about the whole population, but often
we would like to say something about the subgroups as well. For example, we might
want to know how much Calvin students work in off-campus jobs but we might expect
and would like to discover differences among the class levels in this variable. For this
purpose, we might choose a sample of 20 students from each of the five strata. (Of
course we would have to be careful about how to combine our numbers when making
inferences about the whole population.) We would say about this sample that we
have “oversampled” one of the groups. In public opinion polls, it is often the case
that small minority groups are oversampled. The sample that results will still be
called a random sample.
Definition 2.3.3 (random sample). A random sample of size k from a population is
a sample chosen by a procedure such that each element of the population has a fixed
probability of being chosen as part of the sample.
While we need to give a definition of probability in order to make this definition
precise, it is clear from the above examples what we mean. This definition differs
from that of a simple random sample in two ways. First, it does not requires that
50
2.3. Other Sampling Plans
each object has the same likelihood of being the sample chosen. Second, it does not
require that equal likelihood extends to groups. It is obvious that stratified random
sampling is a form of random sampling according to this definition.
Other forms of sampling meet the above definition of random sampling without
being simple random sampling. A sampling method that we might employ given a
list of Calvin students is to choose one of the first 417 students in the list and then
choose every 417th student thereafter. Obviously some subsets can never occur as the
sample since two students whose names are next to each other in the list can never
be in the same sample. Such a sample might indeed be representative however.
Example 2.3.4.
Another kind of modification to random sampling is used in the Current Population Survey. This survey of 60,000 households in the United States is conducted by individuals who live and work near enough to the sample subjects so
that they can conduct the survey in person. It is easy to imagine that 60,000
households chosen totally at random might be inconveniently distributed geographically. The CPS works as follows. First, the country is divided in about
800 primary sampling units, PSUs, which must be not too large (geographically). For example, large cities (actually, Metropolitan Statistical Areas) are
each PSUs. Other PSUs are whole counties or pairs of contiguous counties.
The PSUs are grouped into strata, and then one PSU per strata is chosen at
random (with a probability proportional to its population). The next stage
of the sampling procedure is to choose at random certain housing clusters. A
housing cluster is a group of four housing units in a PSU. The idea behind
sampling housing clusters rather than individual houses is to cut down on interviewer travel time. A larger sample is generated for the same cost. Of course
the penalty for using clusters is that clusters tend to have less variability than
the whole PSU in which the cluster lies and so the group of individuals in the
cluster will probably not be as representative of the PSU as a sample of similar
size chosen from the PSU at random.
The CPS illustrates two enhancements to simple random samples: it is multistage
(but with random sampling at each stage) and it produces a cluster sample, a sample
in which the ultimate sampling units are not the individuals desired but clusters of
individuals. We will not undertake a formal study of all the variants of sampling
methods and their resultant sampling errors, but it is good to keep in mind that most
large scale surveys are not simple random samples but some modification thereof.
Nevertheless, they all rely on the basic principle that randomness is our best hope for
producing representative samples.
51
2. Data from Random Samples
It is very important to note that we cannot guarantee by using random sampling of
whatever form that our sample is representative of the population along the dimension
we are studying. In fact with random sampling, it is guaranteed that it is possible
that we could select a really bad (unrepresentative) sample. What we hope to be able
to do (and we will later see how to do it) is to be able to quantify our uncertainty
about the representativeness of the sample.
2.4. Exercises
2.1 In the parts below, we list some convenience samples of Calvin students. For
each of these methods for sampling Calvin students, indicate in what ways that the
sample is likely not to be representative of the population of all Calvin students.
a) The students in Mathematics 243A.
b) The students in Nursing 329.
c) The first 30 students who walk into the FAC west door after 12:30 PM today.
d) The first 30 students you meet on the sidewalk outside Hiemenga after 12:30
PM today.
e) The first 30 students named in the “Names and Faces” picture directory.
f ) The men’s basketball team.
2.2 Suppose that we were attempting to estimate the average height of a Calvin
student. For this purpose, which of the convenience samples in the previous problem
would you suppose to be most representative of the Calvin population? Which would
you suppose to be least representative?
2.3 Consider the set of natural numbers P = {1, 2, . . . , 30} to be a population.
a) How many prime numbers are there in the population?
b) If a sample of size 10 is representative of the population, how many prime
numbers would we expect to be in the sample? How many even numbers would
we expect to be in the sample?
c) Using R choose 5 different samples of size 10 from the population P . Record
how many prime numbers and how many even numbers are in each sample.
Make any comments about the results that strike you as relevant.
2.4 Before easy access to computers, random samples were often chosen by using
tables of random digits. The tables looked something like this table which was constructed in R.
52
2.4. Exercises
[1]
[11]
[21]
[31]
[41]
[51]
40139
82348
09366
00976
83264
54703
61007
44867
05554
43068
34861
28006
60277
12854
86209
88362
60488
03477
41219
03179
36252
42080
52180
66384
45533
21145
33740
54161
03796
55787
68878
91154
92037
34593
17289
42212
48506
84831
21446
18209
39816
55253
11950
78503
63192
04344
19080
82256
07747
00159
87206
52566
64575
61471
69280
97920
58877
86976
55492
73665
Each digit in this table is supposed to occur with equal likelihood as are all pairs,
triples, etc. Suppose that a population has 280 individuals numbered 1–280. Explain
how to use the table to choose a random sample of size 5 from these individuals.
Write down the five numbers of the individuals that are chosen by your method.
2.5 In a very small class, the final exam scores of the six students were 139, 145, 152,
169, 171, and 189.
a) How many different simple random samples of size 3 of students in this class
are there?
b) What is the “population” mean of exam scores?
c) Suppose that we use the mean of exam scores of a SRS of size 3 to estimate the
population mean. What is the greatest possible error that we could make?
2.6 Donald Knuth, the famous computer scientist, wrote a book entitled “3:16”. This
book was a Bible study book that studied the 16th verse of the 3rd chapter of each
book of the Bible (that had a 3:16). Knuth’s thesis was that a Bible study of random
verses of the Bible might be edifying. The sample was of course not a random sample
of Bible verses and Knuth had ulterior motives in choosing 3:16. Describe a method
for choosing a random sample of 60 verses from the Bible. Construct a method
that is more complicated than simple random sampling that seeks to get a sample
representative of all parts of the Bible. (You might find the table of number of verses in
each book of the bible at http://www.deafmissions.com/tally/bkchptrvrs.html
to be useful!)
2.7 Suppose that we wish to survey the Calvin student body to see whether the
student body favors abolishing the Interim (we could only hope!). Suppose that
instead of a simple random sample, we select a random sample of size 20 from each
of the five groups of Table 2.1. Suppose that of 20 students in each group, 9 of the
first-year students, 10 of the sophomores, 13 of the juniors, 19 of the seniors and
all 20 of the other students favor abolishing the interim. Produce an estimate of
the proportion of the whole student body by using these sample results. Be sure to
describe and justify your computation that uses these results.
2.8 There are 3,141 county equivalents in the county dataset (http://www.calvin.
edu/~stob/data/uscounties.csv). Suppose that we wish to take a random sample
53
2. Data from Random Samples
of 60 counties. What are two different variables that might be useful to create strata
for a stratified random sample?
2.9 Describe a method for choosing a random sample of 200 Calvin students using
the “Names and Faces” directory.
2.10 You would like to estimate the percentage of books in the library that have red
covers. Describe a method of choosing a random sample of books to help estimate
this parameter. Discuss any problems that you see with constructing such a sample.
54
3. Probability
3.1. Random Processes
Probability theory is the mathematical discipline concerned with modeling situations
in which the outcome is uncertain. For example in choosing a simple random sample
we do not know which sample of individuals from the population that we might
actually get in our sample. The basic notion is that of a probability.
Definition 3.1.1 (A probability). A probability is a number meant to measure the
likelihood of the occurrence of some uncertain event (in the future).
Definition 3.1.2 (probability). Probability (or the theory of probability) is the
mathematical discipline that
1. constructs mathematical models for “real-world” situations that enable the computation of probabilities (“applied” probability)
2. develops the theoretical structure that undergirds these models (“theoretical”
or “pure” probability).
The setting in which we make probability computations is that of a random process.
(What we call a random process is usually called a random experiment in the literature
but we use process here so as not to get the concept confused with that of randomized
experiment, a concept that we introduce later.)
Characteristics of a Random Process:
1. A random process is something that is to happen in the future (not in the
past). We can only make probability statements about things that have not
yet happened.
2. The outcome of the process could be any one of a number of outcomes and
which outcome will obtain is uncertain.
3. The process could be repeated indefinitely (under essentially the same circumstances), at least in theory.
55
3. Probability
Historically, some of the basic random processes that were used to develop the
theory of probability were those originating in games of chance. Tossing a coin or
dealing a poker hand from a well-shuffled deck are examples of such processes. One
of the most important random processes that we study is that of choosing a random
sample from a population. It is clear that this process has all three characteristics of
a random process.
The first step in understanding a random process is to identify what might happen.
Definition 3.1.3 (sample space, event). Given a random process, the sample space
is the set (collection) of all possible outcomes of the process. An event of the random
process is any subset of the sample space.
The next example lists several random processes, their sample spaces, and a typical
event for each.
Example 3.1.4.
1. A fair die is tossed. The sample space can be described as the set S =
{1, 2, 3, 4, 5, 6}. A typical event might be E = {2, 4, 6}; i.e., the event that
an even number is rolled.
2. A card is chosen from a well-shuffled standard deck of playing cards. There
are 52 outcomes in the sample space. A typical event might be “A heart
is chosen” which is a subset consisting of 13 of the possible outcomes.
3. Twenty-nine students are in a certain statistics class. It is decided to
choose a simple random sample of 5 of the students. There are a boatload
of possible outcomes. (It can be shown that there are 118,755 different
samples of 5 students out of 29.) One event of interest is the collection of
all outcomes in which all 5 of the students are male. Suppose that 25 of
the students in the class are male. Then it can be shown that 53,130 of
the outcomes comprise this event.
We often have some choice as to what we call outcomes of a random process. For
example, in Example 3.1.4(3), we might consider two samples different outcomes if the
students in the sample are chosen in a different order, even if the same five students
appear in the samples. Or we might call such samples the same outcome. To some
extent, what we call an outcome depends on the way in which we are going to use
the results of the random process.
Given a random process, our goal is to assign to each event E a number P(E)
(called the probability of E) such that P(E) measures in some way the likelihood
of E. In order to assign such numbers however, we need to understand what they
56
3.1. Random Processes
are intended to measure. Interpreting probability computations is fraught with all
sorts of philosophical issues but it is not too great a simplification at this stage to
distinguish between two different interpretations of probability statements.
The frequentist interpretation.
The probability of an event E, P(E), is the limit of the relative frequency that E
occurs in repeated trials of the process as the number of trials approaches infinity.
In other words, if the event E occurs en many times in the first n trials, then on
the frequentist interpretation, P(E) = lim en /n.
n→∞
The subjectivist interpretation.
The probability of an event E, P(E), is an expression of how confident the assignor
is that the event will happen in the next trial of the process.
The word “subjective” is usually used in science in a pejorative sense but that is
not the sense of the word here. Subjective here simply means that the assignor needs
to make a judgment and that this judgment may differ from assignor to assignor.
Nevertheless, this judgment might be based on considerable evidence and experience.
That is, it might be expert judgment.
Mathematics cannot tell us which of these two interpretations is “true” or even
which is “better.” In some sense this is a discussion about how mathematics can
be applied to the real world and is a philosophical not a mathematical discussion.
In this book (as is customary for introductory texts) we will explain our probability
statements using frequentist language.
Notice that the frequentist approach makes an important assumption about a random process. Namely, it assumes that, given an event E, there will be a limiting
relative frequency of occurrence of E in repeated trials of the random process and
that this limiting relative frequency is always the same given any such infinite sequence of repeated trials. This is not something that can be proved. Consider the
simplest kind of random process, one with two outcomes. The paradigmatic example
of such a process is coin tossing. The frequentist approach would say that in repeated
tossing of a coin, the fraction of tosses that have produced a head approaches some
limit. The next example simulates this situation.
57
3. Probability
Example 3.1.5.
Suppose that we toss a coin “fairly.” That is we toss a coin so that we expect
that heads and tails are equally likely. Let E be the event that the coin turns
up heads. It is reasonable to think that in large numbers of tosses, the fraction
of heads approaches 1/2 so that P(E) = 1/2. (Indeed, there have been many famous coin-tossers throughout the years that have tried this experiment.) Rather
than toss physical coins, we illustrate what happens when a coin is tossed 1,000
times using R. We can model tossing a coin as constructing a sample of size
one from the two choices “heads” and “tails” with each of “heads” and “tails”
equally likely. In the R code below, we toss the coin 1,000 times and find that
after 1,000 tosses, the relative frequency of heads is 0.499. Notice however that
in the first 100 tosses or so that approximately 60% of the tosses were heads.
(We use the R function xyplot() from the lattice package to plot the relative
frequency of heads as the number of tosses increases from 1 to 1000.)
> coins = sample (c(’H’,’T’), 1000, replace=T)
> noheads = cumsum(coins==’H’)
> cumfrequency=noheads/(1:1000)
> xyplot(cumfrequency~(1:1000),type="l")
> cumfrequency[1000]
[1] 0.499
cumfrequency
0.6
0.4
0.2
0.0
0
200
400
600
800
1000
(1:1000)
Though in the above example, the simulated frequency of heads did indeed approach
1/2, there does not seem to be any reason why it wouldn’t be possible to toss 1,000
consecutive heads or, alternatively, to have the relative frequency of heads oscillate
wildly from very close to 0 to very close to 1. We will return to this issue when we
discuss the Law of Large Numbers in Section 5.2.
58
3.1. Random Processes
We should note here that one fact that is clear from the frequentist interpretation
of probability is the following.
For every event E, 0 ≤ P(E) ≤ 1.
We have already said that the sample space is a set and an event is a subset of the
sample space. We will use the language of set theory extensively to talk about events.
Definition 3.1.6 (union, intersection, complement). Suppose that E and F are
events in some sample space S.
1. The union of events E and F , denoted E ∪ F , is the set of outcomes that are
in either E or F .
2. The intersection of events E and F , denoted E ∩ F , is the set of outcomes that
are in both E and F .
3. The complement of an event E, denoted E 0 , is the set of outcomes that are in
S but not in E.
Example 3.1.7.
Suppose that a random sample of 5 individuals is chosen from a statistics class
of 20 students. Let E be the event that there are at least 3 males in the sample
and let F be the event that all five individuals are sophomores. Then we have
Event Description
E ∪ F either at least three males or all sophomores (or both)
E ∩ F all sophomores and at least three of them male
E0
at most two males
So far we have considered random processes that have only finitely many different
possible outcomes. Some random processes have infinitely many different outcomes
however. Here are two typical examples.
Example 3.1.8.
A six-sided die is tossed until all six different faces have appeared on top at
least once. The possible outcomes form an infinite collection since we could
toss arbitrarily many times before seeing the number 1.
Example 3.1.9.
Kellogg’s packages Raisin Bran in 11 ounce boxes. We might view the weight of
any particular box as the result of a random process. It is difficult to describe
exactly what outcomes are possible (is a 22 ounce box of Raisin Bran possible?),
but it certainly seems like at least all real numbers between 10.9 and 11.1 ounces
59
3. Probability
are possible. This is already an infinite set of outcomes. An important event is
that the weight of the box is at least 11 ounces.
3.2. Assigning Probabilities I – Equally Likely Outcomes
How shall we assign a probability P(E) to an event E? On the frequentist interpretation, we need to examine what happens if we repeat the experiment indefinitely. This
of course is not usually feasible. In fact, we often want to make probability statements
about a process that we will perform only once. For example, we would like to make
probability statements about what might happen in the Current Population Survey
but only one random sample is chosen. So what we need to do is make some sort of
model of the process and argue that the model allows us to draw conclusions about
what might happen if we repeat the experiment many times.
For many random processes, we can make a plausible argument that the possible
outcomes of the process are equally likely. That is, we can argue that each of the
outcomes will occur about as often as any other outcome in a long series of trials.
For example, when we toss a fair coin, we usually assume that in a large number
of trials we will have as many heads as tails. That is, we assume that heads and
tails are equally likely. That’s why coin tossing is often used as a means of choosing
between two alternatives. Similarly, given the symmetry of a six-sided die, the sides
of a die should be equally likely to occur when the die is rolled vigorously. In a more
important example, a procedure for random sampling is designed to ensure that all
samples are equally likely to occur. In this situation, it is straightforward to assign
probabilities to each event.
Definition 3.2.1 (probability in the equally likely case). Suppose that a sample
space S has n outcomes that are equally likely. Then the probability of each outcome
is 1/n. Also, the probability of an event E, P(E) is k/n where k is the number of
outcomes in E.
The following examples illustrate this definition. In each example, the key is to list
the outcomes of the process in such a way that it is apparent that they are equally
likely.
Example 3.2.2.
A six-sided die is rolled. Then one of six possible outcomes occurs. From the
symmetry of the die it is reasonable to assume that the six outcomes are equally
likely. Therefore, the probability of each outcome is 1/6. If E is the event that
is described by “the die comes up 1 or 2” then P(E) = 2/6 = 1/3 since the
event E contains two of the outcomes. This probability assignment means that
60
3.2. Assigning Probabilities I – Equally Likely Outcomes
in a large number of tosses of the die, approximately one-third of them will be
1s or 2s.
Example 3.2.3.
Suppose that four coins are tossed. What is the probability that exactly three
heads occur? It is tempting to list the outcomes in this particular experiment
as the set S = {0, 1, 2, 3, 4} since all that we are interested in is the number
of heads that occurs. However, it would be difficult to make an argument that
these outcomes are equally likely. The key is to note that there are really sixteen
possible outcomes if we distinguish the four coins carefully. To see this, label the
four coins (say, penny, nickel, dime, and quarter) and list the possible outcomes
as a four-tuple in that order (PNDQ):
HHHH HHHT HHTH HTHH
THHH HHTT HTHT THHT
HTTH THTH TTHH HTTT
THTT TTHT TTTH TTTT
Exactly 4 of these outcomes have three heads so that P(three heads) = 4/16 =
1/4. In fact, the following table gives the complete probability distribution of
the number of heads:
no. of heads
probability
0
1
16
1
4
16
2
6
16
3
4
16
4
1
16
Example 3.2.4.
In many games (e.g., Monopoly) two dice are thrown and the sum of the two
numbers that occur are used to initiate some action. Rather than use the 11
possible sums as outcomes, it is easy to see that there are 36 equally likely
outcomes (list the pairs (i, j) of numbers where i is the number on the first
die, j is the number on the second die and i and j range from 1 to 6). One
event related to this process is the event E that the throw results in a sum of
7 on the two dice. It is easy to see that there are 6 outcomes in E so that
P(E) = 6/36 = 1/6.
For simple random processes with a small number of equally likely outcomes, it
is easy to compute probabilities using Definition 3.2.1. But when the number of
outcomes is so large that it is impractical to list them all, it becomes more difficult.
In such a case, we need to be able to count the number of outcomes without listing
them. For example, in choosing a random sample of 10 students from a large class,
the number of different possible samples is very large and would be impractical to
enumerate.
61
3. Probability
The mathematical discipline of counting is known as combinatorics. In this text,
we will not spend a great deal of time counting outcomes in complicated cases but
rather leave such computations to R. However a few of the more important principles
of counting will be quite useful to us.
The Multiplication Principle
It is no accident that in rolling 2 dice that there are 62 possible outcomes and that in
flipping 4 coins that there are 24 possible outcomes. These are special cases of what
we will call the multiplication principle.
Definition 3.2.5 (cartesian product). If A and B are sets then the Cartesian product
of A and B, A × B, is the set of ordered pairs of elements of A and B. That is
A × B = {(a, b) | a ∈ A and b ∈ B} .
The Multiplication Principle is then given by the following lemma.
Lemma 3.2.6. If A has n elements and B has m elements then A × B has mn
elements.
It is easy to prove this lemma (and to remember the multiplication principle) by
a diagram. Let a1 , . . . an be the elements of A and b1 , . . . bm be the elements of B.
Then the elements of A × B are listed in the following two dimensional array that has
n rows and m columns or nm entries.
(a1 , b1 )
(a2 , b1 )
(an , b1 )
(a1 , b2 ) . . .
(a2 , b2 ) . . .
...
(an , b2 ) . . .
(a1 , bm )
(a2 , bm )
(an , bm )
It is easy to see that counting the outcomes in the experiment of tossing two dice
is equivalent to counting D × D where D = {1, 2, 3, 4, 5, 6}. The two sets A and B do
not have to be the same however.
Example 3.2.7.
A class has 20 students, 12 male and 8 female. A male and a female are chosen
at random from the class. How many possible outcomes of this process are
there? It is easy to see that we are simply counting A × B where A, the set
of males, has 12 elements and B, the set of females, has 8 elements. Therefore
there are 12 · 8 = 96 outcomes.
62
3.2. Assigning Probabilities I – Equally Likely Outcomes
The multiplication principle can be profitably generalized in two ways. First, we
can extend the principle to the case of more than two sets. It is easy to see that if sets
A, B, and C have n, m, p elements respectively, there are nmp triples of elements,
one from each of A, B, and C. This is because the set A × B × C can be thought of
as (A × B) × C. So for example, there are 63 = 216 different outcomes of the process
of tossing three fair dice.
A second way to generalize this principle is illustrated in the following example.
Example 3.2.8.
In a certain card game, a player is dealt two cards. What is the probability
that the player is dealt a pair? (A pair is two cards of the same rank. A
deck of playing cards has 4 cards of each of thirteen ranks.) We first need
to identify the equally likely outcomes and count them. Consider the cards
being dealt in succession. There are 52 choices for the first card that player
receives. For each of these 52 cards there are 51 possible choices for the second
card that the player receives. Thus there are (52)(51) = 2, 652 possible equally
likely outcomes. To see that this is really an application of the multiplication
principle above, we could view it as counting a set that has the same size as
A × B where A = {1, . . . , 52} and B = {1, . . . , 51} or we could directly list
the possible outcomes in a table as we did in the proof of the multiplication
principle. To compute the probability that the a pair is dealt, we need to also
count the number of outcomes that are a pair. This is (52)(3) = 156 since the
first card can be any card but the second card needs to be one of the three
cards remaining that has the same rank as the first card. Thus the probability
in question is 156/2652 = .059. Notice that in this example, we have treated
the two cards of a given hand as being ordered by taking into account the order
in which they are dealt. Of course it does not usually matter in a card game
the order in which the cards of a given hand are dealt. We will later show how
to compute the number of different unordered hands.
Generalizing this example, we have the following principle. If two choices must be
made and there are n possibilities for the first choice and, for any first choice there
are m possibilities for the second choice then there are nm many ways to make the
two choices in succession.
Counting Subsets
Many of our counting problems can be reduced to counting the number of subsets of
a set that are of a given size.
63
3. Probability
Example 3.2.9.
Suppose that a set A has 10 elements. How many different three element subsets
of A are there? To answer this question, we first count the number of ordered
three element subsets of A using the multiplication principle. It is easy to see
that there are 10 · 9 · 8 = 720 of these. However, since this counts the number
of ordered subsets it counts each different (unordered) subset several times.
In fact each three element subset is counted 3 × 2 × 1 = 6 times using the
same multiplication principle. (There are 3 choices for the first element, 2 for
the second, and 1 for the third.) Thus there must be 720/6 = 120 different
three-element subsets of A.
Generalizing the example, we have
Theorem 3.2.10. Suppose that A has n elements. There are
of size k where
n
n!
=
.
k
k!(n − k)!
n
k
many subsets of A
Proof. We first count the number of k-element ordered subsets of A. By the multiplication principle this is
n(n − 1)(n − 2) · · · (n − k + 1) =
n!
(n − k)!
This follows from the multiplication principle since there are n choices for the first
element of the subset, n − 1 choices for the second element, and so forth down to
n − k + 1 choices for the k th element. Now for any subset of size k, there are k(k −
1) · · · 1 = k! many different orderings of the elements of that subset. Thus each subset
is counted k! many times in our count of the ordered subsets. So there are actually
only
n!
n
/k! =
k
(n − k)!
many subsets of size k of A.
The number nk is obviously an important
one and it can be computed using R.
The R function choose(n,k) computes nk .
Example 3.2.11.
A random sample of 5 students is chosen from a class of 20 students, 12 of whom
are female. What is the probability that the sample consists of 5 females? We
first need to count the number of equally likely outcomes. Since there are 20
students and an outcome is a subset of size 5 of those 20, the number of different
64
3.3. Probability Axioms
random samples that we could have chosen is 20
5 = 15, 504. Since the event
that we are interested in is the collection of samples that have five females, we
need to count how
many of these 15,504 outcomes contain five females. But
that is simply 12
5 = 792 since each sample of five females is a subset of the 12
females in the class. So the probability in question is 792/15504 = 0.051.
> choose(20,5)
[1] 15504
> choose(12,5)
[1] 792
> 792/15504
[1] 0.05108359
3.3. Probability Axioms
In the last section, we considered one way of assigning probabilities to events. But
we can’t always identify equally likely outcomes.
Example 3.3.1.
A basketball player is going to shoot two free throws. What is the probability
that she makes both of them? It is easy to write the possible outcomes. Using
X for a made free throw and O for a miss, the four outcomes are XX, XO, OX,
and OO. In this respect, the process looks just like that of tossing a coin twice
in succession. But we have no reason to think that these four outcomes are
equally likely. In fact, it is almost always that case that shooter is more likely
to make a free throw than miss it so that it is probably the case that XX is
more likely to occur than OO.
As we have said before, mathematics cannot tell us how to assign probabilities in
situations such as Example 3.3.1. However not just any assignment of probabilities
makes sense. For example, we cannot assign a probability of 1/2 to each of the four
outcomes. It is not reasonable to think that the limiting relative frequency of all
four outcomes will be 1/2 if the experiment is repeated many times. In fact it seems
clear that we should be looking for four numbers that sum to 1. In 1933, Andrei
Kolmogorov published the first rigorous treatment of probability in which he gave
axioms for a probability assignment in the same way that Euclid gave axioms for
geometry.
Axiom 1. For all events E, P(E) ≥ 0.
Axiom 2. P(S) = 1.
65
3. Probability
Axiom 3. If E and F are disjoint events (i.e., have no outcomes in common) then
P(E ∪ F ) = P(E) + P(F )
More generally, if E1 , E2 , . . . is a sequence of pairwise disjoint events, then
P(E1 ∪ E2 ∪ · · · ) = P(E1 ) + P(E2 ) + · · · .
Axioms in mathematics are supposed to be propositions that are “intuitively obvious” and that we agree to accept as true without proof. Each of the three Kolmogorov
axioms can easily be interpreted as a statement about limiting relative frequency that
is obviously true. For example, the second axiom is obviously true because by our
definition of a random process, one of the outcomes in the sample space must occur.
Notice that the method of equally likely outcomes can be seen to rely heavily on
Axiom 2 and Axiom 3. While the axioms do not directly help us assign probabilities
in a case like Example 3.3.1, they do constrain our assignments. Also, they are useful
in helping to compute some probabilities in terms of others. Namely, we can prove
some theorems using these axioms.
Proposition 3.3.2. For every event E, P(E 0 ) = 1 − P(E).
Proof. The events E and E 0 are disjoint and E ∪ E 0 = S. Thus
P(E) + P(E 0 ) = P(E ∪ E 0 ) = P(S) = 1 .
The first equality is Axiom 3 and the last is Axiom 2. The proposition follows immediately.
A curious event is ∅. Since we assume that something happens each time the
random process is performed, it should be the case that P(∅) = 0. It is easy to see
that this follows from the proposition and Axiom 2 since S = ∅0 .
Proposition 3.3.3. For any events E and F , P(E ∪ F ) = P(E) + P(F ) − P(E ∩ F ).
Proof. We first use Axiom 3 and find that
P(E) = P(E ∩ F 0 ) + P(E ∩ F )
and
P(F ) = P(F ∩ E) + P(F ∩ E 0 )
Next we use Axiom 3 again to see that
P(E ∪ F ) = P(E ∩ F 0 ) + P(E ∩ F ) + P(E 0 ∩ F )
Combining, we have that
P(E ∪ F ) = (P(E) − P(E ∩ F )) + P(E ∩ F ) + (P(F ) − P(E ∩ F ))
which after simplifying gives the desired result.
66
3.4. Empirical Probabilities
The propositions above help us simplify probability computations, even in the case
of equally likely outcomes.
Example 3.3.4.
From experience, an insurance company estimates that a customer that has
both a homeowner’s policy and an auto policy has a probability of .83 of having
no claim on either policy in a given year. These policy holders also have a
probability of .15 of having an automobile claim and .05 of having a homeowner’s
claim. What is the probability that such a policy holder has both a homeowner
and automobile claim? If E is the event of a homeowner’s claim and F the event
of an auto claim, then we have P(E ∪ F ) = 1 − .83 = .17. Also P(E) = .05
and P(F ) = .15. Thus the event that we are looking for, E ∩ F , has probability
P(E) + P(F ) − P(E ∪ F ) = .03.
3.4. Empirical Probabilities
In Section 3.2, we saw how to assign probabilities consistent with the Kolmogorov
Axioms in the case that we could identify a priori equally likely outcomes. However in
many applications, the outcomes are not equally likely and there is usually no similar
theoretical principle that enables us to assign probabilities with confidence. In such
cases, we need some data from the real world to assign probabilities. While much of
Chapter 4 will be devoted to this problem, in this section we look at a very simple
method of assigning probabilities based on data.
Since the probability of an event E is supposed to be the limiting relative frequency
of the occurrence of E as the number of trials increases indefinitely, a very simple
estimate of the probability of E is the relative frequency with which it has occurred
in the past.
Example 3.4.1.
What is the probability that number 4 of the Calvin Knights will make a freethrow when he has to shoot one in a game? As of the writing of this example,
number 4 had attempted 38 free-throws and had made 33 of them. Thus the
relative frequency of a made free-throw is 87%. Thus we say that number 4 has
a 87% probability of making a free-throw.
There are all sorts of objections that might be raised to the computation in Example 3.4.1. The first that comes to mind is that 25 is a relatively small number of trials
on which to base the argument. Another serious objection might be to the whole
idea that there is a fixed probability that number 4 makes a free-throw. Nevertheless,
67
3. Probability
as a model of what number 4 might do on his next and subsequent free-throws, this
number might have some value and allow us to make some useful predictions.
We have seen of course that this method of assigning probabilities can lead us to
incorrect (and sometimes really bad) probability values. Even in 100 tosses of a coin,
it is quite possible that we would find 60 heads and so think that the probability
of a head was 0.6 rather than 1/2. (In Section 4.3 we will actually examine closely
the question of just how close to the “true” value we are likely to be given a certain
number n of coin tosses.) But in situations where we have a lot of past data and very
little of a theoretical model to help us compute otherwise, this might be a reasonable
strategy. This way of assigning probabilities is an important tool in the insurance
industry.
Example 3.4.2.
Suppose that an insurance company wants to sell a 5-year term life insurance
policy in the amount of $100,000 to a 55-year old male. Such a policy pays
$100,000 to the beneficiary of the policy holder only if he dies within five years.
Obviously, the insurance company would like to know that the probability that
the insured dies within five years. The key tool in computing such a probability
is a mortality table such as Figure 3.4.2. (The full table is available at http://
www.cdc.gov/nchs/data/nvsr/nvsr54/nvsr54_14.pdf.) Using data from a
variety of sources (including the US Census Bureau and the Center for Medicare
and Medicaid), the Division of Vital Services makes a very accurate count of
the number of people that die in the United States each year. For our problem,
we note that the table indicates that of every 88,846 men alive at the age of 55,
only 84,725 of them are alive at the age of 60. This means that our insurance
company has a probability of (88846 − 84725)/88846 = 0.046 of paying out on
this policy. If the company writes many such policies, it appears that it would
average about $4,600 per policy in payouts. This is the most important number
in trying to decide how much the company should charge for such a policy.
For the purpose of investigating how random processes work, it is very useful to use
R to perform simulations. We have already seen how to simulate a random process
in which the outcomes are equally likely. The next example simulates a process in
which the probabilities are determined empirically.
Example 3.4.3.
In the 2007 baseball season, Manny Ramirez came to the plate 569 times. Of
those 569 times, he had 89 singles, 33 doubles, 1 triple, 20 homeruns, 78 walks
(and hit by pitch), and 348 outs. We can use the frequency of these events
to estimate the probabilities of each sort of event that might happen when
68
3.4. Empirical Probabilities
Figure 3.1.: Portion of life table prepared by Division of Vital Services of U.S. Department of Health and Human Services
Ramirez comes to the plate. For example, we might estimate that the probability Ramirez will hit a homerun in his next plate appearance to be 20/569 = .035.
In the following R session we simulate one, and then five, of Manny Ramirez’s
plate appearances.
> outcomes=c(’Out’,’Single’,’Double’,’Triple’,’Homerun’,’Walk’)
> ramirez=c(348,89,33,1,20,78)/569
> sum(ramirez)
[1] 1
> ramirez
[1] 0.611599297 0.156414763 0.057996485 0.001757469 0.035149385 0.137082601
> sample(outcomes,1,prob=ramirez)
[1] "Double"
> sample(outcomes,5,prob=ramirez,replace=T)
[1] "Out"
"Double" "Out"
"Out"
"Walk"
69
3. Probability
3.5. Independence
It is often the case that two events associated with a random process are related in
some way so that if we knew that one of them was going to happen we would change
our estimate of the likelihood that the other would happen. The following example
illustrates this.
Example 3.5.1.
At the end of each semester, students in many college courses are given the
opportunity to rate the course. At Calvin, two questions that students are
asked are:
The course was: (Excellent, Very Good, Good, Fair, Poor)
The instructor: (Excellent, Very Good, Good, Fair, Poor)
Empirical evidence suggests that the probability that a student answers excellent on the question about the course is 0.25 and the probability that a student
answers Excellent to the question about the instructor is 0.41. (What is the
random process here? Am I suggesting that students answer these questions at
random?) Suppose that we happen to see that a student has answered Excellent to the first question. We would certainly not continue to suppose that the
probability that this student has answered Excellent to the second question is
just 0.41. We would guess that the students answers are not independent one
of another. In fact, 91% of the students who answer Excellent to the course
question also answer Excellent to the instructor question.
Definition 3.5.2 (conditional probability). Given two events E and F such that
P(F ) 6= 0, the conditional probability of E given F , written P(E | F ) is given by
P(E | F ) =
P(E ∩ F )
P(F )
It is easiest to interpret the formula for P(E | F ) using the relative frequency
interpretation. The denominator in the fraction in the definition is the proportion of
times that the event F happens in a large number of trials of the random process.
The numerator, P(E ∩ F ) is the proportion of times that both events happen. So
the fraction is the proportion of times E happens among those times that F happens
which is precisely what we want conditional probability to measure.
In the definition of conditional probability, it is best to think of F as being a fixed
event and that E is allowed to be any event in the sample space. Thus P(E | F ) is a
function of E.
In applications, it is often the case that we know P(F ) and P(E | F ). Using the
definition of conditional probability, we can then compute P(E ∩ F ) using
70
3.5. Independence
Multiplication Law of Probability
If E and F are events with P(F ) 6= 0 then
P(E ∩ F ) = P(F ) P(E|F ) .
Example 3.5.3.
Suppose that we choose two students from a class of 20 without replacement.
If there are 12 female students in the class, the probability of the first chosen
student being female is 12/20 = .6. Having chosen a female, the probability that
the second chosen student is also female is 11/19 since there are 11 remaining
females of the 19 remaining students. So the probability of choosing two females
in succession is (12/20)(11/19) = .347.
We can extend the analysis in Example 3.5.3 to compute the probabilities of all
possible combinations of E and F occurring or not. It is useful to view this situation
as a tree.
P(E)
P(F | E)
F
E∩F
P(F 0 | E)
F0
E ∩ F0
P(F | E 0 )
F
E0 ∩ F
P(F 0 | E 0 )
F0
E0 ∩ F 0
E
P(E 0 )
E0
It is clear when one thinks about it that, in general, P(E | F ) 6= P(F | E).
Indeed, simply knowing P(E | F ) does not necessarily given us any information
about P(F | E). As a simple example, note that the probability that a primary voter
votes for Hilary Clinton given that she votes in the Democratic primary is certainly
not equal to the probability that she votes in the Democratic primary given that she
votes for Hilary Clinton (the latter probability is 1!). In the next example, we look
at an important situation in which we desire to know P(F | E) but we only know
conditional probabilities of form P(E | F ).
71
3. Probability
Example 3.5.4.
Most laboratory tests for diseases aren’t infallible. The important question
from the point of view of the patient is what inference to make about the
disease status given the outcome of the test. Namely, if the test is positive,
how likely is it that the patient has the disease? The sensitivity of a test is
the probability that it will give a positive result given that the patient has the
disease. The specificity of a test is the probability that it will give a negative
result given that the patient does not have the disease. A widely used rapid test
for the HIV virus has sensitivity 99.9% and specificity 99.8%. Since the test
appears to be very accurate and it is now quite inexpensive, one might suppose
that doctors should give this test as a routine matter to allow for early detection
of the virus. In this situation, we are interested in four possible events:
D+ the patient has the disease
D− the patient does not have the disease
T + the test is positive
T − the test is negative
The sensitivity and specificity then give P(T + | D+ ) = .999 and P(T − | D− ) =
.998. (Note that this means that P(T + | D− ) = 0.002 and P(T − | D+ ) = 0.001.
Suppose now that a patient tests positive. What is the probability that this
patient has the disease? It is clear that this is the question of computing
P(D+ | T + ). We have
P(D+ | T + ) =
P(D+ ∩ T + )
.
P(T + )
Using the Multiplication Law, we have
P(D+ ∩ T + ) = P(T + | D+ ) P(D+ )
and also we have
P(T + ) = P(T + ∩ D+ ) + P(T + ∩ D− ) .
One more piece of information is needed to compute P(D+ | T + ) and that is
P(D+ ), the prevalence of the disease in the tested population. Of course this
depends on the population that is tested. It is estimated that about 0.01% of all
persons in the U.S. have the disease. So if we adopt a policy of testing everyone
without regard to other factors, we might estimate P(D+ ) = 0.0001. We can
now compute P(D+ | T + ). The probability tree is as follows.
72
3.5. Independence
0.999
T + (0.0001)(0.999) = (9.99)10−5
0.0001
0.001
T−
(0.0001)((0.001) = 10−7
0.9999
0.002
T+
(0.9999)(0.002) = 0.0020
0.998
T−
(0.9999)((0.998) = 0.9979
D+
D−
Using the probabilities computed from the tree, we have
P(D+ ∩ T + )
P(T + )
P(T + | D+ ) P(D+ )
=
P(T + ∩ D+ ) + P(T + ∩ D− )
(9.99)10−5
=
= 0.047 .
(9.99)10−5 + 0.0020
P(D+ | T + ) =
Thus, even though the test is very accurate, 95% of the time the positive result
will be for someone who does not have the disease! This is one reason that
universal testing for rare diseases often does not make economic sense.
This method of “reversing” the conditional probabilities is so important that it has
a name: Bayes’ Theorem.
Independence
If P(E | F ) = P(E), knowing that the event F occurs does not give us any more
information as to whether F will occur. Such events E and F are called independent.
The multiplication law simplifies in this case and leads to the following definition.
Definition 3.5.5 (independent). Events E and F are independent if
P(E ∩ F ) = P(E) P(F ) .
Notice that we do not assume that P(F ) 6= 0 in this definition. It is easy to see that
if P(F ) = 0, the equality in the definition is always true so that we would consider E
and F to be independent in this special case.
73
3. Probability
Example 3.5.6.
Suppose that a free-throw shooter makes 70% of her free-throws. What is the
probability that she makes both of her free-throws when she is fouled in the
act of shooting? It might be reasonable to suppose that the results of the two
free-throws are independent of each other. Then the probability of making two
successive free-throws is (0.7)(0.7) = 0.49. Similarly, the probability that she
misses both free throws is only 9%.
3.6. Exercises
3.1 For each of the following random processes, write a complete list of all outcomes
in the sample space.
a) A nickel and a dime are tossed and the resulting faces observed.
b) Two different cards are drawn from a hat containing five cards numbered 1–5
are put in a hat. (For some reason, lots of probability problems are about cards
in hats.)
c) A voter in the Michigan 2008 Primary elections is chosen at random and asked
for whom she voted. (See problem A.2.)
3.2 Two six-sided dice are tossed.
a) List all the outcomes in the sample space (you should find 36) using some
appropriate notation.
b) Let F be the event that the sum of the dice is 7. List the elements of F .
c) Let E be the event that the sum of the dice is odd. List the elements of the
event E.
3.3 If a Calvin College student is chosen at random and his/her height is recorded,
what is a reasonable listing of the possible outcomes? Explain the choices that you
have to make in determining what the outcomes are.
3.4 Weatherman in Grand Rapids are fond of saying things like “The probability of
snow tomorrow is 70%.” What do you think this statement really means? Can you
give a frequentist interpretation of this statement? A subjectivist interpretation?
3.5 It is clear from the computational formula that
n
= n−k
. Without resorting
to the formula, give an argument (in terms of what each of these symbols counts)
that this equation must be true.
74
n
k
3.6. Exercises
3.6 In Example 3.2.3 we considered the random experiment of tossing four coins. In
this problem, we consider the problem of tossing five coins.
a) How many equally likely outcomes are there?
b) For each x = 0, 1, 2, 3, 4, 5, compute the probability that exactly x many heads
occurs in the toss of five coins.
3.7 A 20-sided die (with sides numbered 1–20) is used in some games. Obviously the
die is constructed in such a way that the sides are intended to be equally likely to
occur when the die is rolled. (The die is in fact an icosohedron.) Using R, simulate
1,000 rolls of such a die. How many of each number did you expect to see? Include
a table of the actual number of times each of the 20 numbers occurred. Is there
anything that surprises you in the result?
3.8 A poker hand consists of 5 cards. What is the probability of getting dealt a poker
hand of 5 hearts? (Remember that there are 13 hearts in the deck of 52 cards.)
3.9 In Example 3.2.11 we considered choosing a random sample of 5 students from a
class of 20 students of whom 12 were female.
a) What is the probability that such a random sample will contain 5 males?
b) What is the probability that such a random sample will contain 3 females and
2 males?
3.10 Many games use spinners rather than dice to initiate action. A classic board
game published by Cadaco-Ellis is “All-American Baseball.” The game contains discs
for each of several baseball players. The disk for Nellie Fox (the great Chicago White
Sox second baseman) is pictured below.
75
3. Probability
The disc is placed over a peg with a spinner mounted in the center of the circle. The
spinner is spun and comes to rest pointing to the one of the numbered areas. Each
number corresponds to the possible result of Nellie Fox batting. (For example, 1 is a
homerun and 14 is a flyout.)
a) Why is it unreasonable to believe that all the numbered outcomes are equally
likely?
b) Explain how one could use the idea of equal likelihood to predict the probability
that the spinner will land on the sector numbered 14 and then make an estimate
of this probability.
(Spinners with regions of unequal size are used heavily in the K–8 textbook series
Everyday Mathematics to introduce probability to younger children.)
3.11 The traditional darboard is pictured below.
A dart that sticks in the board is scored as follows. There are 20 numbered sectors
each of which has a small outer ring, a small inner ring, and two larger areas. A dart
landing in the larger areas scores the number of the sector, in the outer ring scores
double the number of the sector, and in the inner ring scores triple the number of a
sector. The two circles near the center score 25 points (the outer one) and 50 points
(the inner one). Unlike the last problem, it does not seem that an equal likelihood
model could be used to compute the probability of a “triple 20.” Explain why not.
3.12 Suppose that E and F are events and that P(E), P(F ), and P(E ∩ F ) are
given. Find formulas (in terms of these known probabilities) for the probabilities of
the following events:
a) exactly one of E or F happens,
76
3.6. Exercises
b) neither E nor F happens,
c) at least one of E or F happens,
d) E happens but F does not.
3.13 Suppose that E, F , and G are events. Show that
P(E ∪F ∪G) = P(E)+P(F )+P(G)−P(E ∩F )−P(E ∩G)−P(F ∩G)+P(E ∩F ∩G) .
3.14 Use the axioms to prove that for all events E and F , if E ⊆ F then P(E) ≤ P(F ).
3.15 Show that for all events E and F that P(E ∩ F ) ≤ min{P(E), P(F )}.
3.16 In 2006, there were 42,642 deaths in vehicular accidents in the United States.
17,602 of the victims had a positive blood alcohol content (BAC). In 15,121 of these,
the BAC of the victim was greater than 0.08 (which is the legal limit for DUI in many
states). What is a good estimate for the probability that a victim of a vehicular
accident had BAC exceeding 0.08? The statistics in this problem can be found at
the Fatality Analysis Reporting System, http://www-fars.nhtsa.dot.gov/Main/
index.aspx. (A probability that we would really like to know is the probability
that a driver with a BAC of greater than 0.08 becomes a fatality in an accident.
Unfortunately, that’s a much harder number to obtain.)
3.17 We have used tossing coins as our favorite example of a process with two equally
likely outcomes. Consider instead the process where the coin is stood on end on a
hard surface and spun.
a) If a dime is used, do you think a head and a tail are equally likely to occur?
b) Do the experiment 10 times and record the results.
c) On the basis of your data, is it possible that heads and tails are equally likely?
d) Using the data alone, estimate the probability that a spun dime comes up heads.
3.18 In Example 3.4.2, we determined that the probability that a 55 year old male
dies before his 60th birthday is 0.046.
a) If the company sells this 5-year, $100,000 policy to 100 different men, how many
of these policies would you expect that they would have to pay the death benefit
for?
b) Simulate this situation. Namely use this empirical probability to simulate the
100 policies. How many policies did the company have to pay off on in your
simulation? Are you surprised by this result?
77
3. Probability
3.19 The senior engineering students and the senior nursing students gather for a
party. There are 62 senior engineering students (5 of them female) and 59 senior
nursing students (10 of them male). Suppose that a student is chosen at random.
a) What is the probability that she/he is a female given that she/he is an engineer?
b) What is the probability that she is an engineer given that she is female?
c) What is the probability that he/she is an engineer?
3.20 A fair die is tossed. If the number on the face of the die is n, then n coins are
tossed and the number of heads is counted. What is the probability that this process
results in zero heads being tossed?
3.21 Show that if E ⊆ F then P(F | E) = 1.
3.22 Construct an example to show that it is not necessarily true that P(E | F ) =
1 − P(E | F 0 ).
3.23 Show that if E and F are independent, then so are E 0 and F 0 .
3.24 Suppose that two different bags of blue and red marbles are presented to you
and you are told that one bag (bag A) has 75% blue marbles and the other bag (bag
B) has 90% red marbles. Suppose that you choose a bag at random. Now suppose
that you choose a single marble from the bag at random and it is red. What is the
probability that you have in fact chosen bag A?
3.25 Over the course of a season, a certain basketball player shot two free-throws on
36 occasions. On 18 of those occasions, she made both of the free-throws and on 9 of
the occasions she missed both (and so on 9 occasions she made one and missed one).
Does this data appear to be consistent with the hypothesis that she has a constant
probability of making a free-throw and that the result of the second throw of a pair
is independent of the first?
3.26 In Example 3.5.4 we studied the effectiveness of universal HIV testing and
determined that 95% of the time positive tests results are wrong even though the
test itself has a very high sensitivity. Now suppose that HIV testing is restricted to
a high risk population - one in which the prevalence of the disease is 25%. What is
the probability that a positive test result is wrong in this case?
78
4. Random Variables
4.1. Basic Concepts
If the outcomes of a random process are numbers, we will call the random process a
random variable. Since non-numerical outcomes can always be coded with numbers,
restricting our attention to random variables results in no loss of generality. We will
use upper-case letters to name random variables (X, Y , etc.) and the corresponding
lower-case letters (x, y, etc.) to denote the possible values of the random variable.
Then we can describe events by equalities and inequalities so that we can write such
things as P (X = 3), P (Y = y) and P (Z ≤ z). Some examples of random variables
include
1. Choose a random sample of size 12 from 250 boxes of Raisin Bran. Let X be
the random variable that counts the number of underweight boxes and let Y be
the random variable that is the average weight of the 12 boxes.
2. Choose a Calvin senior at random. Let Z be the GPA of that student and let
U be the composite ACT score of that student.
3. Assign 12 chicks at random to two groups of six and feed each group a different
feed. Let D be the difference in average weight between the two groups.
4. Throw a fair die until all six numbers have appeared. Let T be the number of
throws necessary.
We will consider two types of random variables, discrete and continuous.
Definition 4.1.1 (discrete random variable). A random variable X is discrete if its
possible values can be listed x1 , x2 , x3 , . . . .
In the example above, the random variables X, U , and T are discrete random
variables. Note that the possible values for X are 0, 1, . . . , 12 but that T has infinitely
many possible values 1, 2, 3, . . . . The random variables Y , Z, and D above are not
discrete. The random variable Z (GPA) for example can take on all values between
0.00 and 4.00. (We should make the following caveat here however. All variables are
discrete in the sense that there are only finitely many different measurements possible
to us. Each measurement device that we use has divisions only down to a certain
tolerance. Nevertheless it is usually more helpful to view these measurements as on
79
4. Random Variables
a continuous scale rather than a discrete one. We learned that in calculus.)
The following definition is not quite right — it omits some technicalities. But it is
close enough for our purposes.
Definition 4.1.2 (continuous random variable). A random variable X is continuous
if its possible values are all x in some interval of real numbers.
We will turn our attention first to discrete random variables.
4.2. Discrete Random Variables
If X is a discrete random variable, we will be able to compute the probability of any
event defined in terms of X if we know all the possible values of X and the probability
P (X = x) for each such value x.
Definition 4.2.1 (probability mass function). The probability mass function (pmf)
of a random variable X is the function f such that for all x, f (x) = P (X = x). We
will sometimes write fX to denote the probability mass function of X when we want
to make it clear which random variable is in question.
The word mass is not arbitrary. It is convenient to think of probability as a unit
mass that is divided into point masses at each possible outcome. The mass of each
point is its probability. Note that mass obeys the Kolmogorov axioms.
Example 4.2.2.
Two dice are thrown and the sum X of the numbers appearing on their faces
is recorded. X is a random variable with possible values 2, 3, . . . , 12. By using
the method of equally likely outcomes, we can see that the pmf f of X is given
by the following table:
x
f (x)
2
1/36
3
2/36
4
3/36
5
4/36
6
5/36
7
6/36
8
5/36
9
4/36
10
3/36
11
2/36
12
1/36
We can now compute such probabilities as P (X ≤ 5) = 5/18 by adding the
appropriate values of f .
Example 4.2.3.
We can think of a categorical variable as a discrete random variable by coding.
Suppose that a student is chosen at random from the Calvin student body. We
will code the class of the student by 1, 2, 3, 4 for the four standard classes and 5
80
4.2. Discrete Random Variables
25
Percent of Total
20
15
10
5
0
1
2
3
4
5
r
Figure 4.1.: The probability histogram for the Calvin class random variable.
for other. The coded class is a random variable. Referring to Table 2.1, we see
that the probability mass function of X is given by f (1) = 0.24, f (2) = 0.23,
f (3) = 0.23, f (4) = 0.26, f (5) = 0.04, and f (x) = 0 otherwise.
One useful way of picturing a probability mass function is by a probability histogram. For the mass function in Example 4.2.3, we have the corresponding histogram
in Figure 4.2.
On the frequentist interpretation of probability, if we repeat the random process
many times, the histogram of the results of those trials should approximate the probability histogram. The probability histogram is not a histogram of data from many
trials however. It is a representation of what might happen in the next trial. We will
often use this idea to work in reverse. In other words, given a histogram of data that
obtained from successive trials of a random process, we will choose the pmf to fit the
data. Of course we might not ask for a perfect fit but instead we will choose the pmf
f to fit the data approximately but so that f has some simple form.
Several families of discrete random variables are particularly important to us and
provide models for many real-world situations. We examine two such families here.
Each arises from a common kind of random process that will be important for statistical inference. The second of these arises from the very important case of simple
random sampling from a population. We will first study a somewhat different case
(which, among other uses, can be used to study sampling with replacement).
4.2.1. The Binomial Distribution
A binomial process is a random process characterized by the following conditions:
81
4. Random Variables
1. The process consists of a sequence of finitely many (n) trials of some simpler
random process.
2. Each trial results in one of two possible outcomes, usually called success (S)
and failure (F ).
3. The probability of success on each trial is a constant denoted by π.
4. The trials are independent one from another.
Thus a binomial process is characterized by two parameters, n and π. Given a
binomial process, the natural random variable to observe is the number of successes.
Definition 4.2.4 (binomial random variable). Given a binomial process, the binomial random variable X associated with this process is defined by X is the number
of successes in the n trials of the process. If X is a binomial random variable with
parameters n and π, we write X ∼ Binom(n, π).
The symbol ∼ can be read as “has the distribution” or something to that effect.
The use of the word distribution here is not inconsistent with our earlier use. Here
to specify a distribution is to specify the possible values of the random variable and
the probability that the random variable attains any particular value.
Example 4.2.5.
The following are all natural examples of binomial random variables.
1. A fair coin is tossed n = 10 times with the probability of a HEAD (success)
being π = .5. X is the number of heads.
2. A basketball player shoots n = 25 freethrows with the probability of making each freethrow being π = .70. Y is the number of made freethrows.
3. A quality control inspector tests the next n = 12 widgets off the assembly
line each of which has a probability of 0.10 of being defective. Z is the
number of defective widgets.
4. Ten Calvin students are randomly sampled with replacement. W is the
number of males in the sample.
The fact that the trials are independent of one another makes it possible to easily
compute the pmf of any binomial random variable using the multiplication principle.
We first give a simple example.
Example 4.2.6.
An unaccountably popular dice game is known as Bunko. Three dice are rolled
and the number of sixes rolled is the important value. Let X be the random
variable that counts the number of sixes in three dice. Then X ∼ Binom(3, 1/6).
82
4.2. Discrete Random Variables
We can now compute the probability mass function (which can take on values
0, 1, 2, 3). We simply need to keep track of all possible sequences of three successes and failures and find the probability of each such sequence.
f (3) = P(X = 3) = (1/6)(1/6)(1/6) = 1/216
f (2) = P(X = 2) = (1/6)(1/6)(5/6) + (1/6)(5/6)(1/6) + (5/6)(1/6)(1/6) = 15/216
f (1) = P(X = 1) = (1/6)(5/6)(5/6) + (5/6)(1/6)(5/6) + (5/6)(5/6)(1/6) = 75/216
f (0 = P(X = 0) = (5/6)(5/6)(5/6) = 125/216
The computation for f (2) for example has three terms, one for each of SSF, SFS,
FSS. The important probability fact for Bunko players is P(X ≥ 1) = 91/216.
We can easily generalize the previous example to any n and π to get the following
theorem.
Theorem 4.2.7 (The Binomial Distribution). Suppose that X is a binomial random
variable with parameters n and π. The pmf of X is given by
n x
n!
fX (x; n, π) =
π (1 − π)n−x =
π x (1 − π)n−x
x = 0, 1, 2, . . . , n .
x
x!(n − x)!
Proof. Suppose that n and π are given and that 0 ≤ x ≤ n. Consider all sequences
of n trials that have exactly x successes and n − x failures. There are nx of these
since all we have to decide is how to “choose” the x places in the sequence for the
successes. Now consider any one such sequence, say the sequence S. . . SF. . . F, the
sequence in which x successes are followed by n − x failures. The probability on this
sequence (and any sequence with x successes) is px (1 − p)n−x by the multiplication
principle, relying on the independence of the trials. The result follows.
Note the use of the semicolon in the definition of fX in the theorem. We will
use a semicolon to separate the possible values of the random variable (x) from the
parameters (n, π). For any particular binomial experiment, n and π are fixed. If n
and π are understood, we might write fX (x) for fX (x; n, π).
For all but very small n, computing f by hand is tedious. We will use R to do
this. Besides computing the mass function, R can be used to compute the cumulative
distribution function FX which is the useful function defined in the next definition.
Definition 4.2.8 (cumulative distribution function). If X is any random variable,
the cumulative distribution function of X (cdf) is the function FX given by
X
FX (x) = P (X ≤ x) =
fX (y)
y≤x
83
4. Random Variables
We will usually use the convention that the pmf of X is named by a lower-case
letter (usually fX ) and the cdf by the corresponding upper-case letter (usually FX ).
The R functions to compute the cdf, pdf, and also to simulate binomial processes are
as follows if X ∼ Binom(n, π).
function (& parameters)
explanation
rbinom(n,size,prob)
makes n random draws of the random variable X and returns them in a vector.
dbinom(x,size,prob)
returns P(X = x) (the pmf).
pbinom(q,size,prob)
returns P(X ≤ q) (the cdf).
Example 4.2.9.
Suppose that a manufacturing process produces defective parts with probability
π = .1. If we take a random sample of size 10 and count the number of defectives
X, we might assume that X ∼ Binom(10, 0.1). Some examples of R related to
this situation are as follows.
> defectives=rbinom(n=30, size=10,prob=0.1)
> defectives
[1] 2 0 2 0 0 0 0 2 0 1 1 1 0 0 2 2 3 1 1 2 1 1 0 2 0 1 1 0 1 1
> table(defectives)
defectives
0 1 2 3
11 11 7 1
> dbinom(c(0:4),size=10,prob=0.1)
[1] 0.34867844 0.38742049 0.19371024 0.05739563 0.01116026
> dbinom(c(0:4),size=10,prob=0.1)*30
# pretty close to table
[1] 10.4603532 11.6226147 5.8113073 1.7218688 0.3348078
> pbinom(c(0:5),size=10,prob=0.1)
# same as cumsum(dbinom(...))
[1] 0.3486784 0.7360989 0.9298092 0.9872048 0.9983651 0.9998531
>
It is important to note that
• R uses size for the number of trials (what we have called n) and n for the
number of random draws.
• pbinom() gives the cdf not the pmf. Reasons for this naming convention will
become clearer later.
• There are similar functions in R for many of the distributions we will encounter,
and they all follow a similar naming scheme. We simply replace binom with the
R-name for a different distribution.
84
4.2. Discrete Random Variables
4.2.2. The Hypergeometric Distribution
The hypergeometric distribution arises from considering the situation of random sampling from a population in which there are just two types of individuals. (That is
there is a categorical variable defined on the population with just two levels.) It is
traditional to describe the distribution in terms of the urn model. Suppose that we
have an urn with two different colors of balls. There are m white balls and n black
balls. Suppose we choose k balls from the urn in such a way that every set of k balls is
equally likely to be chosen (i.e., a random sample of balls) and count the number X of
white balls. We say that X has the hypergeometric distribution with parameters
m, n, and k and write X ∼ Hyper(m, n, k). A simple example shows how we can
compute probabilities in this case.
Example 4.2.10.
Suppose the urn has 2 white and 3 black balls and that we choose 2 balls
at random without replacement. If X is the number of white balls, we have
X ∼ Hyper(2, 3, 2). Notice that in this case there are 10 different possible
choices of two balls. If we label the balls W1, W2, B1, B2, B3, we have the
following:
2 whites (W1,W2)
1 white (W1,B1),(W1,B2), (W1,B3), (W2,B1), (W2,B2), (W3,B3)
0 whites (B1,B2), (B1,B3), (B2,B3)
Since the 10 different pairs are equally likely, we have P (X = 0) = 3/10,
P (X = 1) = 6/10, and P (X = 2) = 1/10.
The systematic counting of the example can easily be extended to compute the pmf
of any hypergeometric random variable.
Theorem 4.2.11. Suppose that X ∼ Hyper(m, n, k). Then the pmf f of X is given
by
n m
f (x; m, n, k) =
x
k−x
m+n
k
x ≤ min(k, m) .
Proof. The denominator counts the number of samples of size k from m + n many
balls. The two terms in the numerator count the number of ways of choosing x white
balls from m and k − x black balls from n. Multiplying the two terms together counts
the number of ways of choosing x white balls and k − x black balls.
R knows the hypergeometric distribution and the syntax is exactly the same as for
the binomial distribution (except that the names of the parameters have changed).
85
4. Random Variables
function (& parameters)
explanation
rhyper(nn,m,n,k)
makes nn random draws of the random
variable X and returns them in a vector.
dhyper(x,m,n,k)
returns P(X = x) (the pmf).
phyper(q,m,n,k)
returns P(X ≤ q) (the cdf).
Example 4.2.12.
Suppose that a statistics class has 29 students, 25 of whom are male. Let’s call
the females the white balls and the males the black balls. suppose that we choose
5 of these students at random and without replacement, i.e., a random sample of
size 5. Let X be the number of females in our sample. Then X ∼ Hyper(4, 25, 5).
Some interesting questions related to this random variable are answered by the
R output below.
> dhyper(x=c(0:5),m=4,n=25,k=5)
[1] 0.4473916888 0.4260873226 0.1162056334 0.0101048377 0.0002105175
[6] 0.0000000000
> dhyper(x=c(0:5),k=5,m=4,n=25)
# order of named arguments does not matter
[1] 0.4473916888 0.4260873226 0.1162056334 0.0101048377 0.0002105175
[6] 0.0000000000
> phyper(q=c(0:5),m=4,n=25,k=5)
[1] 0.4473917 0.8734790 0.9896846 0.9997895 1.0000000 1.0000000
> rhyper(nn=30,m=4,n=25,k=5)
# note nn for number of random outcomes
[1] 2 1 1 1 1 2 2 2 1 1 1 0 1 0 0 0 1 1 0 0 1 1 0 1 1 1 2 0 0 0
> dhyper(0:5,4,25,5)
# default order of unnamed arguments
[1] 0.4473916888 0.4260873226 0.1162056334 0.0101048377 0.0002105175
[6] 0.0000000000
>
4.3. An Introduction to Inference
There are many situations in which the binomial distribution seems to be the right
model for a process but for which π is unknown. The next example gives several quite
natural cases of this.
Example 4.3.1.
1. Microprocessor chips are being produced by an assembly line. There is
a possibility that any particular chip produced is defective. It might be
reasonable under some circumstances to assume that the probability that
86
4.3. An Introduction to Inference
Figure 4.2.: Zener cards for ESP testing.
any particular chip is defective is a constant π. Then in a sample of 10
chips, it might be plausible to assume that the number of defective chips
X behaves like a binomial random variable with n = 10 and π fixed but
unknown.
2. Perhaps it is reasonable to assume that a free-throw shooter in a basketball game has a constant probability π of making a free-throw and that
successive attempts are independent one from another. Then in a series
of n free-throws, the number of successful free-throws might behave as a
binomial random variable with n known and π unknown.
3. In a standard test for ESP, a card with one of five printed symbols is
selected without the person claiming to have ESP being able to see it. As
the experimenter “concentrates” on the symbol printed on the card, the
subject is supposed to announce which symbol is on the card. (These cards
are called Zener cards and are pictured in Figure 4.2.) While we think that
the probability that a subject can identify any card is 1/5, the person with
ESP might claim that the probability is higher. If we allow n trials of this
experiment, it is plausible to assume that the number of successful trials
X is a binomial random variable with π unknown.
In situations like those in the example, we often want to test a hypothesis about π.
For example, in the case of the person supposed to have ESP, we would like to test
our hypothesis that π = .2.
Let us look more closely at the ESP situation. What would it take for us to believe
that the subject in fact has a probability greater than 0.2 of correctly identifying the
hidden card? Clearly, we would want to have several trials and a rate of success that
we would think would not be likely by luck (or “chance”) alone. A standard test is
to use 25 trials. (In a standard deck, there are 25 cards with five each of the first
87
4. Random Variables
five symbols. Rather than going through the deck once however, we will think of the
experiment as shuffling the deck after each trial. Then it is clear that each of the five
types of cards is equally likely to occur as the top card.) The following R output is
relevant to our test.
> x=c(5:15)
> pbinom(x,25,.2)
[1] 0.6166894 0.7800353 0.8908772 0.9532258 0.9826681 0.9944451 0.9984599
[8] 0.9996310 0.9999237 0.9999864 0.9999979
Even if our subject is just guessing, he will get more than five cards right about
40% of the time. Clearly it is certainly possible that he is just guessing if he gets just
6 out of 25 correct. On the other hand, it is virtually certain that he will not get
more than 12 right if he is just guessing. While we would not have to believe the ESP
explaination for 13 out of 25 successes, it would be difficult to continue asserting that
π = 0.2 in this case. Of course there is a grey area. Suppose that our subject gets
10 cards right. The probability that our subject will get at least 10 cards correct by
guessing alone is less than 2%. Is this sufficiently surprising to rule out guessing as
an explanation? We might not rule out guessing but we would very likely test this
subject further.
The procedure described above for testing the ESP hypothesis is a special case of a
general (class of) procedures known as hypothesis tests. Any hypothesis test follows
the same outline.
Step 1: Identify the hypotheses
A statistical hypothesis test starts, oddly enough, with a hypothesis. A hypothesis is a statement proposing a possible state of affairs with respect to a probability
distribution governing an experiment that we are about to perform. There are a
variety of kinds of hypotheses that we might want to test.
1. A hypothesis stating a fixed value of a parameter: π = .5.
2. A hypothesis stating a range of values of a parameter: π ≤ .3.
3. A hypothesis about the nature of the distribution itself: X has a binomial
distribution.
In the ESP example, the hypothesis that we wished to test was π = .2. Notice
that we did not propose to test the hypothesis that a binomial distribution was the
correct explanation of the data. We assumed that the binomial distribution is a
plausible model of our data collection procedure. It will often be the case that we
make distributional hypotheses without thinking about testing them. (Sometimes
that will be a big mistake.)
88
4.3. An Introduction to Inference
In the standard way of describing hypothesis tests, there are actually two hypotheses that we view as being pitted against each other. For example, the two hypotheses
in the the ESP case were π = 0.2 (the subject does not have ESP) and π > 0.2 (the
subject does have ESP or some other mechanism of doing better than guessing). The
two hypotheses have standard names.
1. Null Hypothesis.The null hypothesis, usually denoted H0 , is generally a hypothesis that the data analysis is intended to investigate. It is usually thought
of as the “default” or “status quo” hypothesis that we will accept unless the
data gives us substantial evidence against it. The null hypothesis is often a
hypothesis that we want to “prove” false.
2. Alternate Hypothesis.The alternate hypothesis, usually denoted H1 or Ha ,
is the hypothesis that we are wanting to put forward as true if we have sufficient
evidence against the null hypothesis.
Thus we present our hypotheses in the ESP experiment as
H0 :
Ha :
π = 0.2
π > 0.2 .
For the ESP experiment, π = 0.2 is the null hypothesis since it is clearly our starting
point and it is the hypothesis that we wish to retain unless we have convincing evidence
otherwise.
Step 2: Collect data and compute a test statistic
Earlier, we defined a statistic as a number computed from the data. In the ESP
example, our statistic is simply the result of the binomial random variable. Since we
are using the statistic to test a hypothesis, we often call it a test statistic.
In our previous definition of a statistic, a statistic is a number, i.e., it is an actual
value computed from the data. In fact, we are now going to introduce some ambiguity
and refer also to the random variable as a statistic. So in this case, the random variable
X that is the result of counting the number of correct cards in 25 trials is our test
statistic but we will also refer to the value of that random variable x as a test statistic.
Of course the difference is whether we are referring to the experiment before or after
we collect the data. Note that we can only make probability statements about X.
The value of the statistic x is just a number which we have computed from the result
of the random process.
Amidst this confusion, the central point is that if we think of the test statistic as a
random variable it has a distribution. This distribution is unknown (since we do not
know π).
89
4. Random Variables
Step 3: Compute the p-value.
Next we need to evaluate the evidence that our test statistic provides. To do this
requires that we think about our statistic as a random variable. In the ESP testing
example, X ∼ Binom(25, π). The distribution of the test statistic is called its sampling distribution since we think of it as arising from producing a sample of the
process.
Since our test statistic is a random variable, we can ask probability questions about
our test statistic. The key question that we want to ask is this:
How unusual would the value of the test statistic that I obtained be if the
null hypothesis were true?
We show how to answer the question if the result of the ESP experiment were 9
out of 25 correct cards (36%).
Notice that if the null hypothesis is true,
P(X ≥ 9) = 1 − pbinom(8,25,0.2) = 0.047.
Therefore, if the null hypothesis is true, the probability that we would see a result
at least as extreme (in the direction of the alternate hypothesis) as 9 is 0.047. This
probability is called the p-value of the test statistic.
Definition 4.3.2 (p-value). The p-value of a test statistic t is the probability that
a result at least as extreme as t (in the direction of the alternate hypothesis) would
occur if the null hypothesis is true.
Notice that the p-value is a number that is associated with a particular outcome
of the process. The p-value of 9 successes in our example is 0.047. Since the p-value
is computed after the random process is performed, it is not a probability associated
with this particular outcome of the random process. Rather it is a probability that
describes what might happen if the experiment is repeated indefinitely. Namely, if
the null hypothesis is true, then about 5% of the time the subject would get 9 or more
successes in his 25 trials.
Countless journal articles in the social, biological, and medical sciences report on
the results of hypothesis tests. While there are many kinds of hypothesis tests and
it might not always be clear what kind of test the article is reporting on, it is almost
universally the case that the result of such a hypothesis test is reported using a p-value.
It is quite common to see statements such as p < 0.001. This obviously means that
either the null hypothesis being tested is false or something exceedingly surprising
happened.
90
4.3. An Introduction to Inference
Step 4: Draw a conclusion
Drawing a conclusion from a p-value is a judgment call and it is a scientific rather
than mathematical decision. Our p-value of a test statistic of 9 in the ESP experiment
is is 0.047. This means that if we would test many people for ESP and they were all
just guessing, about 5% of them would have a result at least as extreme as this. A
test statistic of 9 provides some evidence that our subject is more successful than we
would expect by chance alone but certainly not definitive such evidence. If we were
really interested in this question, we would probably subject the subject to more tests.
Sometimes the results of hypothesis tests are expressed in terms of decisions rather
than p-values. This is often the case when we must take some action based on our
data. We illustrate with a common example.
Example 4.3.3.
Suppose that a company claims that the defective rate of their manufacturing
process is 1%. A customer tests 100 parts in a large shipment and finds 4 of
these parts defective. Is the customer justified in rejecting the shipment? It
is easy to think of this situation as a hypothesis test. The test statistic 4 is
the result of a random variable X ∼ Binom(100, π). The null and alternate
hypotheses are given by
H0 : π = 0.01
Ha : π > 0.01 .
The p value of this value of the test statistic is 1 − pbinom(3,100,.01) = 0.018.
Therefore if the manufacturers claim is correct, we should only see 4 or more
defectives 1.8% of the time when we test 100 parts. The customer might be
justified in rejecting the shipment and therefore rejecting the null hypothesis.
We will describe our possible decisions as to reject the null hypothesis or not
reject the null hypothesis. There are of course two different kinds of errors that
we could make.
Definition 4.3.4 (Type I and Type II errors). A Type I error is the error of rejecting
H0 even though it is true.
A Type II error is the error of not rejecting H0 even though it is false.
Of course, if we reject the null hypothesis, we cannot know whether we have made
a Type I error. Similarly, if we do not reject the null hypothesis, we cannot know
whether we have made a Type II error. Whether we have committed such an error
depends on the true value of π which we cannot ever know simply from data.
To determine the likelihood of making such errors, we need to specify our decision
rule. For example in Example 4.3.3 we might decide that our rule is to reject the null
91
4. Random Variables
hypothesis if we see 4 or more defective parts in our shipment of 100. Then we know
that if the null hypothesis is true, the probability that we will make a type I error is
0.018. Of course we cannot know the probability of a type II error without knowing
the true value of π.
In the next example, we further consider the probabilities of the two kinds of
errors in particular situation. This example also illustrates another variation on the
hypothesis test as it considers a two-sided alternate hypothesis.
Example 4.3.5.
A coin-toss is going to be used to make a very important decision. Since the coin
is a commemorative coin of non-standard design (like those used in the Super
Bowl), it is very important to know whether it is fair. We decide to do a test
by tossing the coin 100 times and observing the number of heads. It is obvious
that our test statistic x is the result of a random variable X ∼ Binom(100, π)
and that our competing hypotheses should be
H0 :
Ha :
π = 0.5
π 6= 0.5 .
Notice the two-sided alternate hypothesis suggests that we want to reject the
fairness hypothesis in the case that the coin favors heads as well as the case that
the coin favors tails. It is reasonable to reject the null hypothesis whenever the
number of heads is too far from 50 in either direction. Let’s suppose that our
decision rule is to reject the null hypothesis if the number of heads is at most
40 or at least 60 (i.e., X ≤ 40 or X ≥ 60). Then the probability of a Type I
error is the probability of getting 40 or fewer or 60 or more heads when the true
probability of heads is 0.5. This is given by
> pbinom(40,100,.5)+(1-pbinom(59,100,0.5))
[1] 0.05688793
The probability of a Type I error is 0.057. The probability of a Type II error for
this decision rule can only be computed if we know the true value of π. Suppose
that π = 0.48. Then the probability of not rejecting the null hypothesis is given
by
> pbinom(59,100,0.48)-pbinom(40,100,0.48)
[1] 0.9231696
In other words, we will almost always make a type II error with this decision
rule if π = 0.48. On the other hand, if π = 0.4, then the probability of a Type
II error is 0.46 as given by
> pbinom(59,100,0.4)-pbinom(40,100,0.4)
[1] 0.4566630
92
4.4. Continuous Random Variables
This above computation illustrates the central dilemma of hypothesis testing. If
we want to make it very unlikely that we commit a Type I error as we did with
our decision rule here, it will be very difficult to detect that the null hypothesis
is false. For this decision rule, even 100 tosses of the coin do not allow us much
success in discovering that the coin has a 10% bias!
There is a trade-off between type I and type II errors. If we choose a decision rule
that is less likely to make a Type I error if the null hypothesis is true, then it is
more likely to make a Type II error if the null hypothesis is false. What one should
also notice in our treatment of hypothesis testing is the asymmetry between the two
hypotheses. We are generally not willing to tolerate a large probability of a Type
I error. However this seems to lead to a rather large probability of a Type II error
in the case that the null hypothesis is false. This asymmetry is intentional however
as the null hypothesis usually has a preferred status as the “innocent until proven
guilty” hypothesis.
4.4. Continuous Random Variables
Recall that a continuous random variable X is one that can take on all values in
an interval of real numbers. For example, the height of a randomly chosen Calvin
student in inches could be any real number between, say, 36 and 80. Of course all
continuous random variables are idealizations. If we measure heights to the nearest
quarter inch, there are only finitely many possibilities for this random variable and we
could, in principle, treat it as discrete. We know from calculus however that treating
measurements as continuous valued functions often simplifies rather than complicates
our techniques. In order to understand what kinds of probability statements that
we would like to make about continuous random variables, it is helpful to keep in
mind this idea of the finite precision of our measurements however. For example, a
statement that a randomly chosen individual is 72 inches tall is not a claim that the
individual is exactly 72 inches tall but rather a claim that the height of the individual
is in some small interval (maybe 71 34 to 72 14 if we are measuring to the nearest half
inch). So probabilities of the form P (X = x) are not especially meaningful. Rather
the appropriate probability statements will be of the form P (a ≤ X ≤ b).
4.4.1. pdfs and cdfs
Recall the analogy of probability and mass. In the case of discrete random variables,
we represented the probability P(X = x) by a point of mass P(X = x) at the point x
and had total mass 1. In this case mass is continuous and the appropriate weighting
of mass is a density function. In the following example, we can see how this works.
93
4. Random Variables
1.0
0.8
0.6
0.8
0.6
0.4
0.6
0.4
0.4
0.2
0.2
0.0
0.2
0.0
0
2
4
6
8
0.0
0
2
Time
4
6
8
0
Time
2
4
6
8
Time
Figure 4.3.: Discretized pmf for T .
Example 4.4.1.
A Geiger counter emits a beep when a radioactive particle is detected. The
rate of beeping determines how radioactive the source is. Suppose that we
record the time T to the next beep. It turns out that T behaves like a random
variable. Suppose that we measured T with increasing precision. We might
get histograms that look like those in Figure 4.3 for the pmf of T . It’s pretty
obvious that we want to replace these histograms by a smooth curve. In fact
the pictures should remind us of the pictures drawn for the Riemann sums that
define the integral.
The analogue to a probability mass function for a continuous variable is a probability density function.
Definition 4.4.2 (probability density function, continuous random variable). A probability density function (pdf) is a function f such that
• f (x) ≥ 0 for all real numbers x, and
R∞
• −∞ f (x) dx = 1.
The continuous random variable X defined by the pdf f satisfies
Z b
P(a ≤ X ≤ b) =
f (x) dx
a
for any real numbers a ≤ b.
The following simple lemma demonstrates one way in which continuous random
variables are very different from discrete random variables.
Lemma 4.4.3. Let X be a continuous random variable with pdf f . Then for any
a ∈ R,
1. P(X = a) = 0,
2. P(X < a) = P(X ≤ a), and
3. P(X > a) = P(X ≥ a).
94
4.4. Continuous Random Variables
Z
Proof.
a
f (x) dx = 0 . And P(X ≤ a) = P(X < a) + P(X = a) = P(X < a).
a
Example 4.4.4.
(
3x2
Q. Consider the function f (x) =
0
calculate P(X ≤ 1/2).
x ∈ [0, 1]
Show that f is a pdf and
otherwise.
0.0
1.0
f (x)
2.0
3.0
A. Let’s begin be looking at a plot of the pdf.
0.0
0.2
0.4
0.6
0.8
1.0
x
The rectangular region of the plot has an area of 3, so it is plausible that the
area under the graph of the pdf is 1. We can verify this by integration.
Z
∞
f (x) dx =
−∞
so f is a pdf and P(X ≤ 1/2) =
1
Z
0
R 1/2
0
1
3x2 dx = x3 0 = 1 ,
1/2
3x2 dx = x3 0 = 1/8.
The cdf of a continuous random variable is defined the same way as it was for a
discrete random variable, but we use an integral rather than a sum to get the cdf
from the pdf in this case.
Definition 4.4.5 (cumulative distribution function). Let X be a continuous random
variable with pdf f , then the cumulative distribution function (cdf) for X is
Z x
F (x) = P(X ≤ x) =
f (t) dt .
−∞
Example 4.4.6.
Q. Determine the cdf of the random variable from Example 4.4.4.
95
4. Random Variables
A. For any x ∈ [0, 1],
Z
FX (x) = P(X ≤ x) =
0
So


0
FX (x) = x3


1
x
x
3t2 dt = t3 0 = x3 .
x ∈ [−∞, 0)
x ∈ [0, 1]
x ∈ (1, ∞) .
Notice that the cdf FX is an antiderivative of the pdf fX . This follows immediately
from the Fundamental Theorem of Calculus. Notice also that P(a ≤ X ≤ b) =
F (b) − F (a).
Lemma 4.4.7. Let FX be the cdf of a continuous random variable X. Then the pdf
fX satisfies
fX (x) =
d
FX (x) .
dx
Just as the binomial and hypergeometric distributions were important families of
discrete random variables, there are several important families of continuous random
variables that are often used as models of real-world situations. We investigate a few
of these in the next three subsections.
4.4.2. Uniform Distributions
The continuous uniform distribution has a pdf that is constant on some interval.
Definition 4.4.8 (uniform random variable). A continuous uniform random variable
on the interval [a, b] is the random variable with pdf given by
(
1
x ∈ [a, b]
f (x; a, b) = b−a
0
otherwise.
It is easy to confirm that this function is indeed a pdf. We could integrate, or
we could simply use geometry. The region under the graph of the uniform pdf is a
1
rectangle with width b − a and height b−a
, so the area is 1.
Example 4.4.9.
Q. Let X be uniform on [0, 10]. What is P(X > 7)? What is P(3 ≤ X < 7)?
96
4.4. Continuous Random Variables
A. Again we argue geometrically. P(X > 7) is represented by a rectangle with
base from 7 to 10 along the x-axis and a height of .1, so P(X > 7) = 3·0.1 = 0.3.
Similarly P(3 ≤ X < 7) = 0.4. In fact, for any interval of width w contained in
[0, 10], the probability that X falls in that particular interval is w/10.
We could also compute these results by integrating, but this would be silly.
Example 4.4.10.
Q. Let X be uniform on the interval [0, 1] (which we denote X ∼ Unif(0, 1))
what is the cdf for X?
Rx
A. For x ∈ [0, 1], FX (x) = 0 1 dx = x, so


0
FX (x) = x


1
x ∈ (∞, 0)
x ∈ [0, 1]
x ∈ (1, ∞) .
f (x)
0.0
0.4
0.8
pdf for Unif(0,1)
0.0
0.5
1.0
1.5
2.0
1.5
2.0
x
F (x)
0.0
0.4
0.8
cdf for Unif(0,1)
0.0
0.5
1.0
x
Although it has a very simple pdf and cdf, this random variable actually has
several important uses. One such use is related to random number generation.
Computers are not able to generate truly random numbers. Algorithms that
attempt to simulate randomness are called pseudo-random number generators.
X ∼ Unif(0, 1) is a model for an idealized random number generator. Computer
scientists compare the behavior of a pseudo-random number generator with the
behavior that would be expected for X to test the quality of the pseudo-random
number generator.
97
4. Random Variables
There are R functions for computing the pdf and cdf of a uniform random variable as
well as a function to return random numbers. An additional function computes the
quantiles of the uniform distribution. If X ∼ Unif(min, max) the following functions
can be used.
function (& parameters)
explanation
runif(n,min,max)
makes n random draws of the random variable X and returns them in a vector.
dunif(x,min,max
returns fX (x), (the pdf).
punif(q,min,max)
returns P(X ≤ q) (the cdf).
qunif(p,min,max)
returns x such that P(X ≤ x) = p.
Here are examples of computations for X ∼ Unif(0, 10).
> runif(6,0,10)
# 6 random values on [0,10]
[1] 5.449745 4.124461 3.029500 5.384229 7.771744 8.571396
> dunif(5,0,10)
# pdf is 1/10
[1] 0.1
> punif(5,0,10)
# half the distribution is below 5
[1] 0.5
> qunif(.25,0,10) # 1/4 of the distribution is below 2.5
[1] 2.5
4.4.3. Exponential Distributions
In Example 4.4.1 we considered a “waiting time” random variable, namely the waiting
time until the next radioactive event. Waiting times are important random variables
in reliability studies. For example, a common characteristic of a manufactured object
is MTF or mean time to failure. The model often used for the Geiger counter random
variable is the exponential distribution. Note that a waiting time can be any x in the
range 0 ≤ x < ∞.
Definition 4.4.11 (The exponential distribution). The random variable X has the
exponential distribution with parameter λ > 0 (X ∼ Exp(λ)) if X has the pdf
(
λe−λx x ≥ 0
fX (x) =
0
x<0.
It is easy to see that the function fX of the previous definition is a pdf for any
positive value of λ. R refers to the value of λ as the rate so the appropriate functions
in R are rexp(n,rate), dexp(x,rate), pexp(x,rate), and qexp(p,rate). We will
see later that rate is an apt name for λ as λ will be the rate per unit time if X is a
waiting time random variable.
98
4.4. Continuous Random Variables
0.8
0.06
0.6
y
1.0
0.08
y
0.10
0.04
0.4
0.02
0.2
0.0
0.00
0
10
20
30
x
40
50
0
10
20
30
40
50
x
Figure 4.4.: The pdf and cdf of the random variable T ∼ Exp(0.1).
Example 4.4.12.
Suppose that a random variable T measures the time until the next radioactive
event is recorded at a Geiger counter (time measured since the last event).
For a particular radioactive material, a plausible models for T is T ∼ Exp(0.1)
where time is measured in seconds. Then the following R session computes some
important values related to T .
> pexp(q=0.1,rate=.1)
# probability waiting time less than .1
[1] 0.009950166
> pexp(q=1,rate=.1)
# probability waiting time less than 1
[1] 0.09516258
> pexp(q=10,rate=.1)
[1] 0.6321206
> pexp(q=20,rate=.1)
[1] 0.8646647
> pexp(100,rate=.1)
[1] 0.9999546
> pexp(30,rate=.1)-pexp(5,rate=.1)
# probability waiting time between 5 and 30
[1] 0.5567436
> qexp(p=.5,rate=.1)
# probability is .5 that T is less than 6.93
[1] 6.931472
The graphs in Figure 4.4 are graphs of the pdf and cdf of this random variable.
All exponential distributions look the same except for the scale. The rate of 0.1
here means that we can expect that in the long run this process will average
0.1 counts per second.
Notice that when given a random variable such as the waiting time to a geiger
counter event, we are not handed its pdf as well. The pdf is a model of the situation.
In the case of an example such as this, we really are faced with two decisions.
1. Which family (e.g., uniform, exponential, etc.) of distributions best models the
99
4. Random Variables
situation?
2. What particular values of the parameters should we use for the pdf?
Sometimes we can begin to answer question 1 even before we collect data. Each
of the distributions that we have met has certain properties which we check against
our process. For example, it is often apparent whether the properties of a binomial
process should apply to a certain process we are examining. Of course it is always
useful to check our answer to question 1 by collecting data and verifying that the
shape of the distribution of the data collected is consistent with the distribution we
are using. The only reasonable way to answer the second question however is to
collect data. In the last example, for instance, we saw that if X ∼ Exp(0.1) that
P (X ≤ 6.93) = .5. Therefore if about half of our data are less than 6.93, we would
say that the data are consistent with the hypothesis that X ∼ Exp(0.1) but if almost
all the data are less than 5, we would probably doubt that X has this distribution.
The problems of choosing the appropriate distribution and the appropriate values of
the parameters is an important one that we will address in various ways in Chapter 5.
4.4.4. Weibull Distributions
A very important generalization of the exponential distribution is the Weibull distribution. A Weibull distribution is often used by engineers to model phenomena such
as failure, manufacturing or delivery times. They have also been used for applications as diverse as fading in wireless communications channels and wind velocity. The
Weibull is a two-parameter family of distributions. The two parameters are a shape
parameter α and a scale parameter β.
Definition 4.4.13 (The Weibull distributions). The random variable X has a Weibull
distribution with shape parameter α > 0 and scale parameter β > 0 (X ∼ Weibull(α, β))
if the pdf of X is
(
fX (x; α, β) =
0
α
βα
α
xα−1 e−(x/β)
x≥0
x<0
Notice that if X ∼ Weibull(1, λ) then X ∼ Exp(1/λ). Varying α in the Weibull
distribution changes the shape of the distribution while changing β changes the scale.
The effect of fixing β (β = 5) and changing α (α = 1, 2, 3) is illustrated by the first
graph in Figure 4.5 while the second graph shows the effect of changing β (β = 1, 3, 5)
with α fixed at α = 2. The appropriate R functions to compute with the Weibull
distribution are dweibull(x,shape,scale), pweibull(q,shape,scale), etc.
100
y21
0.4
0.2
0.10
0.0
0.00
0.05
y35
0.15
0.6
0.20
0.8
4.4. Continuous Random Variables
0
2
4
6
8
10
0
2
4
x
6
8
10
x
Figure 4.5.: Left: fixed β. Right: fixed α.
Example 4.4.14.
The Weibull distribution is sometimes used to model the maximum wind velocity measured during a 24 hour period at a specific location. The dataset
http://www.calvin.edu/~stob/data/wind.csv gives the maximum wind velocity at the San Diego airport on each of 6,209 consecutive days. It is claimed
that the maximum wind velocity measured on a day behaves like a random
variable W that has a Weibull distribution with α = 3.46 and β = 16.90. The
R code below investigates that model using this past data. (In fact, this model
is not a very good one although the output below suggests that it might be
plausible.)
> w$Wind
[1] 14 11 10 13 11 11 26 21 14 13 10 10 13 10 13 13 12 12 13 17 11 11 13 25 15
[26] 18 13 17 12 14 15 10 16 17 17 13 18 14 12 20 11 14 20 16 12 14 18 17 13 16
[51] 13 16 11 13 11 15 13 15 16 18 14 15 15 14 14 16 15 18 14 16 14 10 17 14 12
.............
> cutpts=c(0,5,10,15,20,25,30)
> table(cut(w$Wind,cutpts))
(0,5] (5,10] (10,15] (15,20] (20,25] (25,30]
2
434
3303
1910
409
95
> length(w$Wind[w$Wind<12.5])/6209
[1] 0.2728298
# 27.3% days with max windspeed less than 12.5
> pweibull(12.5,3.46,16.9)
[1] 0.2968784
# 29.7% predicted by Weibull model
> length(w$Wind[w$Wind<22.5])/6209
[1] 0.951361
> pweibull(22.5,3.46,16.9)
[1] 0.9322498
> simulation=rweibull(100000,3.46,16.9)
# 100,000 simulated days
> mean(simulation)
# simulated days have mean about the same as actual
101
4. Random Variables
[1] 15.18883
> mean(w$Wind)
[1] 15.32405
> sd(simulation)
[1] 4.85144
> sd(w$Wind)
[1] 4.239603
>
# simulated days have greater variation
4.5. The Mean of a Random Variable
Just as numerical summaries of a data set can help us understand our data, numerical summaries of the distribution of a random variable can help us understand the
behavior of that random variable. In this section we introduce the notion of a mean
of a random variable. The name of this summary, mean, is no accident. The mean
of a random variable is supposed to measure the “center” of a distribution in the
same way that the mean of data measures the center of that data. We will use our
experience with data to help us develop a definition.
4.5.1. The Mean of a Discrete Random Variable
Example 4.5.1.
Q. Let’s begin with a motivating example. Suppose a student has taken 10
courses and received 5 A’s, 4 B’s and 1 C. Using the traditional numerical scale
where an A is worth 4, a B is worth 3 and a C is worth 2, what is this student’s
GPA (grade point average)?
= 3 is not correct. We cannot simply
A. The first thing to notice is that 4+3+2
3
add up the values and divide by the number of values. Clearly this student
should have GPA that is higher than 3.0, since there were more A’s than C’s.
Consider now a correct way to do this calculation and some algebraic reformu-
102
4.5. The Mean of a Random Variable
lations of it.
GPA =
4+4+4+4+4+3+3+3+3+2
5·4+4·3+1·2
=
10
10
5
4
1
=
·4+
·3+
·2
10
10
10
5
4
1
=4·
+3·
+2·
10
10
10
= 3.4
Our definition of the mean of a random variable follows the example above. Notice
that we can think of the GPA as a sum of terms of the form
(grade)(proportion of students getting that grade) .
Since the limiting proportion of outcomes that have a particular value is the probability of that value, we are led to the following definition.
Definition 4.5.2 (mean). Let X be a discrete random variable with pmf f . The
mean (also called expected value) of X is denoted as µX or E(X) and defined by
X
µX = E(X) =
x · f (x) .
x
The sum is taken over all possible values of X.
Example 4.5.3.
Q. If we flip four fair coins and let X count the number of heads, what is E(X)?
A. If we flip four fair coins and let X count the number of heads, then the distribution of X is described by the following table. (Note that X ∼ Binom(4, .5).)
value of X
probability
0
1
16
1
4
16
2
6
16
3
4
16
4
1
16
So the expected value is
0·
1
4
6
4
1
+1·
+2·
+3·
+4·
=2
16
16
16
16
16
On average we get 2 heads in 4 tosses. This is certainly in keeping with our
informal understanding of the word average.
More generally, the mean of a binomial random variable is found by the following
Theorem.
103
4. Random Variables
Theorem 4.5.4. Let X ∼ Binom(n, π). Then E(X) = nπ.
Similarly, the mean of a hypergeometric random variable is just what we think it
should be.
Theorem 4.5.5. Let X ∼ Hyper(m, n, k). Then E(X) = km/(m + n).
The following example illustrates the computation of the mean for a hypergeometric
random variable.
> x=c(0:5)
> p=dhyper(x,m=4,n=25,k=5)
> sum(x*p)
[1] 0.6896552
> 4/29 * 5
[1] 0.6896552
4.5.2. The Mean of a Continuous Random Variable
If we think of probability as mass, then the expected value for a discrete random
variable X is the center of mass of a system of point masses where a mass fX (x)
is placed at each possible value of X. The expected value of a continuous random
variable should also be the center of mass where the pdf is now interpreted as density.
Definition 4.5.6 (mean). Let X be a continuous random variable with pdf f . The
mean of X is defined by
Z ∞
µX = E(X) =
xf (x) dx .
−∞
Example 4.5.7.
(
3x2
Recall the pdf in Example 4.4.4: f (x) =
0
E(X) =
Z
1
x ∈ [0, 1]
. Then
otherwise.
x · 3x2 dx = 3/4 .
0
The value 3/4 seems plausible from the graph of f .
We compute the mean of two of our favorite continuous random variables in the
next Theorem.
Theorem 4.5.8.
104
4.6. Functions of a Random Variable
1. If X ∼ Unif(a, b) then E(X) = (a + b)/2.
2. If X ∼ Exp(λ) then E(X) = 1/λ.
Proof. The proof of each of these is a simple integral. These are left to the reader.
Our intuition tells us that in a large sequence of trials of the random process
described by X, the sample mean of the observations should be usually be close the
mean of X. This is in fact true and is known as the Law of Large Numbers. We will
not state that law precisely here but we will illustrate it using several simulations in
R.
> r=rexp(100000,rate=1)
> mean(r)
[1] 0.9959467
> r=runif(100000,min=0,max=10)
> mean(r)
[1] 5.003549
> r=rbinom(100000,size=100,p=.1)
> mean(r)
[1] 9.99755
> r=rhyper(100000,m=10,n=20,k=6)
> mean(r)
[1] 1.99868
# should be 1
# should be 5
# should be 10
# should be 2
4.6. Functions of a Random Variable
After collecting data, we often transform it. That is we apply some function to all
the data. For example, we saw the value of using a logarithmic transformation (on
the U.S. Counties data) to make a distribution more symmetric. Now consider the
notion of transforming a random variable.
Definition 4.6.1 (transformation). Suppose that t is a function defined on all the
possible values of the random variable X. Then the random variable t(X) is the
random variable that has outcome t(x) whenever x is the outcome of X.
If the random variable Y is defined by Y = t(X), then Y itself has an expected
value. To find the expected value of Y , we would need to find the pmf or pdf of Y ,
fY (y), and then use the definition of E(Y ) to compute E(Y ). Occasionally, this is
easy to do, particularly in the case of a discrete random variable X.
Example 4.6.2.
Suppose that X is the random variable that results when a single die is rolled
105
4. Random Variables
and the number on its face recorded. The pdf of X is f (x) = 1/6, x =
1, 2, 3, 4, 5, 6, and E(X) = 3.5. Now suppose that for a certain game, the value
Y = X 2 is interesting. Then the pdf of Y is easily seen to be f (y) = 1/6,
y = 1, 4, 9, 16, 25, 36, and E(Y ) = 15.2. Note that to find E(Y ) we first
found the pdf of Y and then found E(Y ) using the usual method. Note that
E(Y ) 6= [E(X)]2 !
It turns out that there is a way to compute E(t(X)) that does not require us to
first find fY . This is especially useful in the case that X is continuous.
Lemma 4.6.3. If X is a random variable (discrete or continuous) and t a function
defined on the values of X, then if Y = t(X) and X has pdf (pmf) fX
(P
t(x)fX (x)
E(Y ) = R ∞x
−∞ t(x)f (x) dx
if X is discrete
if X is continuous .
We will not give the proof but it is easy to see that this lemma should be so (at
least for the discrete case) by looking at an example.
Example 4.6.4.
Let X be the result of tossing a fair die. X has possible outcomes 1, 2, 3, 4, 5, 6.
Let Y be the random variable |X − 2|. Then the lemma gives
E(Y ) =
6
X
|x − 2| ·
x=1
1
1
1
1
1
1
1
11
=1· +0· +1· +2· +3· +4· =
.
6
6
6
6
6
6
6
6
But we can also compute E(Y ) directly from the definition. Noting that the
possible values of Y are 0, 1, 2, 3, 4, we have
E(Y ) =
4
X
y=0
yfY (y) = 0 ·
1
2
1
1
1
11
+1· +2· +3· +4· =
.
6
6
6
6
6
6
The sum that computes E(Y ) is clearly the same sum as E(X) but in a “different
order” and with some terms combined since there are more than one x that
produce a given value of Y .
Example 4.6.5.
Suppose that X ∼ Unif(0, 1) and that Y = X 2 . Then
Z 1
E(Y ) =
x2 · 1 dx = 1/3 .
0
This is consistent with the following simulation.
106
4.6. Functions of a Random Variable
> x=runif(1000,0,1)
> y=x^2
> mean(y)
[1] 0.326449
While it is not necessarily the case that E(t(X)) = t(E(X)) (see problem 4.22), the
next proposition shows that the expectation function is a “linear operator.”
Lemma 4.6.6. If a and b are real numbers, then E(aX + b) = a E(X) + b.
4.6.1. The Variance of a Random Variable
We are now in a position to define the variance of a random variable. Recall that
the variance of a set of n data points x1 , . . . , xn is almost the average of the squareddeviation from the sample mean.
X
Var(x) =
(xi − x)2 /(n − 1)
The natural analogue for random variables is the following.
Definition 4.6.7 (variance, standard deviation of a random variable). Let X be a
random variable. The variance of X is defined by
2
σX
= Var(X) = E((X − µX )2 ) .
The standard deviation is the square root of the variance and is denoted σX .
It is obvious from the definition that σX ≥ 0 and that σX > 0 unless X = µX with
probability 1.
Example 4.6.8.
Suppose that X is a uniform random variable, X ∼ Unif(0, 1). Then E(X) =
1/2. To compute the variance of X we need to compute
Z 1
(x − 1/2)2 dx
0
It is easy to see that the value of this integral is 1/12.
The following lemma records the variance of several of our favorite random variables.
1. If X ∼ Binom(n, π) then Var(X) = nπ(1 − π).
m
n
m+n−k
2. If X ∼ Hyper(m, n, k) then Var(X) = k m+n
m+n
m+n−1 .
Lemma 4.6.9.
107
4. Random Variables
3. If X ∼ Unif(a, b) then Var(X) = (b − a)2 /12.
4. If X ∼ Exp(λ) then Var(X) = 1/λ2 .
It is instructive to compare the variances of the binomial and the hypergeometric
distribution. We do that in the next example.
Example 4.6.10.
Suppose that a population has 10,000 voters and that 4,000 of them plan to
vote for a certain candidate. We select 100 voters at random and ask them if
they favor this candidate. Obviously, the number of voters X that favor this
candidate has the distribution Hyper(4000, 6000, 100). This distribution has
mean 40 and variance 100(.4)(.6)(.99). On the other hand, were we to treat
this situation as sampling with replacement so that X ∼ Binom(100, .4), X
would have mean 40 and variance 100(.4)(.6). The only difference in the two
expressions for the variance is the term m+n−k
m+n−1 which is sometimes called the
finite population correction factor. It should really be called the sampling
without replacement correction factor.
The following lemma sometimes helps us to compute the variance of X. It also is
useful in understanding the properties of the variance.
Lemma 4.6.11. Suppose that the random variable X is either discrete or continuous
with mean µX . Then
2
σX
= E(X 2 ) − µ2X .
Proof. We have
2
σX
= E((X −µX )2 ) = E(X 2 −2µX X +µ2X ) = E(X 2 )−2µX E(X)+µ2X = E(X 2 )−µ2X .
Note that we have used the linearity of E and also that E(c) = c if c is a constant.
4.7. The Normal Distribution
The most important distribution in statistics is called the normal distribution.
Definition 4.7.1 (normal distribution). A random variable X has the normal distribution with parameters µ and σ if X has pdf
f (x; µ, σ) = √
1
2
2
e−(x−µ) /2σ
2πσ
We write X ∼ Norm(µ, σ) in this case.
108
−∞<x<∞.
4.7. The Normal Distribution
0.4
f(x)
0.3
0.2
0.1
0.0
−3
−2
−1
0
1
2
3
x
Figure 4.6.: The pdf of a standard normal random variable.
The mean and variance of a normal distribution are µ and σ 2 so that the parameters are aptly, rather than confusingly, named. R functions dnorm(x,mean,sd),
pnorm(q,mean,sd), rnorm(n,mean,sd), and qnorm(p,mean,sd) compute the relevant values.
If µ = 0 and σ = 1 we say that X has a standard normal distribution. Figure 4.6 provides a graph of the density of the standard normal distribution. Notice
the following important characteristics of this distribution: it is unimodal, symmetric, and can take on all possible real values both positive and negative. The curve in
Figure 4.6 suffices to understand all of the normal distributions due to the following
lemma.
Lemma 4.7.2. If X ∼ Norm(µ, σ) then the random variable Z = (X − µ)/σ has the
standard normal distribution.
Proof. To see this, we show that P(a ≤ Z ≤ b) is computed by the integral of the
standard normal density function.
Z µ+bσ
X −µ
1
2
2
√
P(a ≤ Z ≤ b) = P(a ≤
≤ b) = P (µ + aσ ≤ X ≤ µ + bσ) =
e−(x−µ) /2σ dx .
σ
2πσ
µ+aσ
Now in the integral, make the substitution u = (x − µ)/σ. We have then that
Z
µ+bσ
µ+aσ
√
1
2
2
e−(x−µ) /2σ dx =
2πσ
Z
a
b
1
2
√ e−u /2 du .
2π
But the latter integral is precisely the integral that computes P(a ≤ U ≤ b) if U is a
standard normal random variable.
The normal distribution is used so often that it is helpful to commit to memory
certain important probability benchmarks associated with it.
109
4. Random Variables
The 68–95–99.7 Rule
If Z has a standard normal distribution, then
1. P(−1 ≤ Z ≤ 1) ≈ 68%
2. P(−2 ≤ Z ≤ 2) ≈ 95%
3. P(−3 ≤ Z ≤ 3) ≈ 99.7%.
If the distribution of X is normal (but not necessarily standard normal), then these
approximations have natural interpretations using Lemma 4.7.2. For example, we can
say that the probability that X is within one standard deviation of the mean is about
68%.
Example 4.7.3.
In 2000, the average height of a 19-year old United States male was 69.6 inches.
The standard deviation of the population of males was 5.8 inches. The distribution of heights of this population is well-modeled by a normal distribution.
Then the percentage of males within 5.8 inches of 69.6 inches was approximately
68%. In R,
> pnorm(69.6+5.8,69.6,5.8)-pnorm(69.6-5.8,69.6,5.8)
[1] 0.6826895
It turns out that the normal distribution is a good model for many variables. Whenever a variable has a unimodal, symmetric distribution in some population, we tend to
think of the normal distribution as a possible model for that variable. For example,
suppose that we take repeated measures of a difficult to measure quantity such as
the charge of an electron. It might be reasonable to assume that our measurements
center on the true value of the quantity but have some spread around that true value.
And it might also be reasonable to assume that the spread is symmetric around the
true value with measurements closer to the true value being more likely to occur than
measurements that are further away from the true value. Then a normal random
variable is a candidate (and often used) model for this situation.
4.8. Exercises
4.1 Suppose that you roll 5 standard dice. Determine the probability that all the
dice are the same. (Hint: first compute the probability that all five dice are sixes.)
110
4.8. Exercises
4.2 Suppose that you deal 5 cards from a standard deck of cards. Determine the
probability that all the cards are of the same color. (A standard deck of cards has 52
cards in two colors. There are 26 red and 26 black cards. You should be able to do
this computation using R and the appropriate discrete distribution.)
4.3 Acceptance sampling is a procedure that tests some of the items in a lot and
decides to accept or reject the entire lot based on the results of testing the sample.
Suppose that the test determines whether an item is “acceptable” or “defective”.
Suppose that in a lot of 100 items, 4 are tested and that the lot is rejected if one or
more of those four are found to be defective.
a) If 10% of the lot of 100 are defective, what is the probability that the purchaser
will reject the shipment?
b) If 20% of the lot of 100 are defective, what is the probability that the purchaser
will reject the shipment?
4.4 Suppose that there are 10,000 voters in a certain community. A random sample
of 100 of the voters is chosen and are asked whether they are for or against a new
bond proposal. Suppose that in fact only 4,500 of the voters are in favor of the bond
proposal.
a) What is the probability that fewer than half of the sampled voters (i.e., 49 or
fewer) are in favor of the bond proposal?
b) Suppose instead that the sample consists of 2,000 voters. Answer the same
question as in the previous part.
4.5 If the population is very large relative to the size of the sample, sampling with
replacement should yield very similar results to that of sampling without replacement.
Suppose that an urn contains 10,000 balls, 3,000 of which are white.
a) If 100 of these balls are chosen at random with replacement, what is the probability that at most 25 of these are white?
b) If 100 of these balls are chosen at random without replacement, what is the
probability that at most 25 of these are white?
4.6 In the days before calculators, it was customary for textbooks to include tables
of the cdf of the binomial distribution for small values of n. Of course not all values
of π could be included — often only the values π = .1, , 2, . . . , .8, .9 were included.
Let’s supposes that one of these tables includes the value of the cdf of the binomial
distribution for all n ≤ 25, all x ≤ n and all these values of π.
a) To save space, the values of π = .6, .7, .8, .9 could be omitted. Give a clear
reason why F (x; n, π) could be computed for these values of π from the other
111
4. Random Variables
values in the table.
b) On the other hand, we could instead omit the values of x ≥ n/2. Show how
the value of F (x; n, π) could be computed from the other values in the table for
such omitted values of x.
(Hint: one person’s success is another person’s failure.)
4.7 The number of trials in the ESP experiment, 25, was arbitrary and perhaps too
small. Suppose that instead we use 100 trials.
a) Suppose that the subject gets 30 right. What is the p-value of this test statistic?
b) Suppose that the subject actually has a probability of .30 of guessing the card
correctly. What is the probability that the subject will get at least 30 correct?
4.8 A basketball player claims to be a 90% free-throw shooter. Namely, she claims
to be able to make 90% of her free-throws. Should we doubt her claim if she makes
14 out of 20 in a session at practice? Set this problem up as a hypothesis testing
problem and answer the following questions.
a) What are the null and alternate hypotheses?
b) What is the p-value of the result 14?
c) If the decision rule is to reject her claim if she makes 15 or fewer free-throws,
what is the probability of a Type I error?
4.9 Nationally, 79% of students report that they have cheated on an exam at some
point in their college career. You can’t believe that the number is this high at your
own institution. Suppose that you take a random sample of size 50 from your student
body. Since 50 is so small compared to the size of the student body, you can treat
this sampling situation as sampling with replacement for the purposes of doing a
statistical analysis.
a) Write an appropriate set of hypotheses to test the claim that 79% of students
cheat.
b) Construct a decision rule so that the probability of a Type I error is less than
5%.
4.10 A random variable X has the triangular distribution if it has pdf
(
2x
fX (x) =
0
a) Show that fX is indeed a pdf.
b) Compute P(0 ≤ X ≤ 1/2).
112
x ∈ [0, 1]
otherwise.
4.8. Exercises
c) Find the number m such that P(0 ≤ X ≤ m) = 1/2. (If is natural to call m the
median of the distribution.)
(
k(x − 2)(x + 2)
4.11 Let f (x) =
0
−2 ≤ x ≤ 2
otherwise.
a) Determine the value of k that makes f a pdf. Let X be the corresponding
random variable.
b) Calculate P(X ≥ 0).
c) Calculate P(X ≥ 1).
d) Calculate P(−1 ≤ X ≤ 1).
4.12 Describe a random variable that is neither continuous nor discrete. Does your
random variable have a pmf? a pdf? a cdf?
4.13 Show that if f and g are pdfs and α ∈ [0, 1], then αf + (1 − α)g is also a pdf.
4.14 Suppose that a number of measurements that are made to 3 decimal digits
accuracy are each rounded to the nearest whole number. A good model for the
“rounding error” introduced by this process is that X ∼ Unif(−.5, .5) where X is the
difference between the true value of the measurement and the rounded value.
a) Explain why this uniform distribution might be a good model for X.
b) What is the probability that the rounding error has absolute value smaller than
.1?
4.15 If X ∼ Exp(λ), find the median of X. That is find the number m (in terms of
λ) such that P(X ≤ m) = 1/2.
4.16 A part in the shuttle has a lifetime that can be modeled by the exponential
distribution with parameter λ = 0.0001, where the units are hours. The shuttle
mission is scheduled for 200 hours.
a) What is the probability that the part fails on the mission?
b) The event that is described in part (a) is BAD. So the shuttle actually runs
three of these systems in parallel. What is the probability that the mission ends
without all three failing if they are functioning independently?
c) Is the assumption of independence in the previous part a realistic one?
4.17 The lifetime of a certain brand of water heaters in years can be modeled by a
Weibull distribution with α = 2 and β = 25.
113
4. Random Variables
a) What is the probability that the water heater fails within its warranty period
of 10 years?
b) What is it probability that the water heater lasts longer than 30 years?
c) Using a simulation, estimate the average life of one of these water heaters.
4.18 Suppose that a fair die is tossed until the number 6 occurs. Let the random variable X count the number of tosses needed (so the possible values of X are
1, 2, 3, 4, . . . .)
a) Write the pdf of X. (Hint: compute f (1), f (2), f (3), . . . until you see the pattern.)
b) What is the expected value of X? Hint: it will be useful to recall the following
fact from 162:
∞
X
1
=
xrx−1 .
(1 − r)2
x=1
4.19 Prove Theorem 4.5.8.
4.20 Suppose that you have an urn containing 100 balls, some unknown number of
which are red and the rest are black. You choose 10 balls without replacement and
find that 4 of them are red.
a) How many red balls do you think are in the urn? Give an argument using the
idea of expected value.
b) Suppose that there were only 20 red balls in the urn. How likely is it that a
sample of 10 balls would have at least 4 red balls?
4.21 The file http://www.calvin.edu/~stob/data/scores.csv contains a dataset
that records the time in seconds between scores in a basketball game played between
Kalamazoo College and Calvin College on February 7, 2003.
a) This waiting time data might be modeled by an exponential distribution. Make
some sort of graphical representation of the data and use it to explain why the
exponential distribution might be a good candidate for this data.
b) If we use the exponential distribution to model this data, which λ should we
use? (A good choice would be to make the sample mean equal to the expected
value of the random variable.)
c) Your model of part (b) makes a prediction about the proportion of times that
the next score will be within 10, 20, 30 and 40 seconds of the previous score.
Test that prediction against what actually happened in this game.
114
4.8. Exercises
4.22 Show that it is not necessarily the case that E(t(X)) = t(E(X)).
4.23 Prove Lemma 4.6.6 in the case that X is continuous.
4.24 Let X be the random variable that results form tossing a fair six-sided die and
reading the result (1–6). Since E(X) = 3.5, the following game seems fair. I will pay
you 3.52 and then we will roll the die and you will pay me the square of the result.
Is the game fair? Why or why not?
4.25 Not every distribution has a mean! Define
f (x) =
1 1
π 1 + x2
−∞<x<∞.
a) Show that f is a density function. (The resulting distribution is called the
Cauchy distribution.)
b) Show that this distribution does not have a mean. (You will need to recall the
notion of an improper integral.)
4.26 In this problem we compare sampling with replacement to sampling without
replacement. You will recall that the former is modeled by the binomial distribution
and the latter by the hypergeometric distribution. Consider the following setting.
There are 4,167 students at Calvin and we would like to know what they think about
abolishing the interim. We take a random sample of size 100 and ask the 100 students
whether or not they favor abolishing the interim. Suppose that 1,000 students favor
abolishing the interim and the other 3,167 misguidedly want to keep it.
a) Suppose that we sample these 100 students with replacement. What is the mean
and the variance of the random variable that counts the number of students in
the sample that favor abolishing the interim?
b) Now suppose that we sample these 100 students without replacement. What is
the mean and the variance of the random variable that counts the number of
students in the sample that favor abolishing the interim?
c) Comment on the similarities and differences between the two. Give an intuitive
reason for any difference.
4.27 Scores on IQ tests are scaled so that they have a normal distribution with mean
100 and standard deviation 15 (at least on the Stanford-Binet IQ Test).
a) MENSA, a society supposedly for persons of high intellect, requires a score of
130 on the Stanford-Binet IQ test for membership. What percentage of the
population qualifies for MENSA?
b) One psychology text labels those with IQs of between 80 and 115 as having
115
4. Random Variables
“normal intelligence.” What percentage of the population does this range contain?
c) The top 25% of scores on an IQ test are in what range?
116
5. Inference - One Variable
In Chapter 2 we introduced random sampling as a way of making inferences about
populations. Recall the framework. We first identified a population and some
parameters of that population about which we wanted to make inferences. We then
chose a sample, most often by simple random sampling, and computed statistics
from that sample to allow us to make statements about the parameters. Alas, these
statements were subject to sampling error. Armed now with the technology of the
last two chapters, we develop this framework further with a particular emphasis on
understanding sampling error. We will focus especially on the problem of making
inferences about the mean of a population from that of a sample.
5.1. Statistics and Sampling Distributions
5.1.1. Samples as random variables
Suppose that we have a large population and a variable x defined on that population,
and we would like to estimate the mean of x on that population. We choose a simple
random sample x1 , . . . , xn and compute x. How is this sample mean related to the
population mean? In other words, what is likely to be the sampling error?
Consider the first value of the sample, x1 . This value is the result of a random
variable, namely the random variable that results from choosing an individual from
the population at random and measuring or recording the value of the variable x.
We call that random variable X1 . Similarly, X2 is the process of choosing the second
element of the sample. And so forth. The result is a sequence of random variables
X1 , . . . , Xn .
Since we are now thinking of the data x1 , . . . , xn are the result of the random
variables X1 , . . . , Xn , the sample mean x is the result of a random variable as well,
namely
X1 + · · · + Xn
X=
.
n
Then X is a random variable and so it also has a distribution. We’ll call the distribution of X the sampling distribution of the mean since it is a distribution that
results from sampling. The same kind of analysis can be done for any statistic. For
2 for the random variable that is the result of computing the
example, we will write SX
117
5. Inference - One Variable
sample variance that results from X1 , . . . , Xn . This is indeed a random variable —
2 . As another example, the
different possible samples may have different values of SX
sample median X̃ is a statistic and so it has a distribution as well.
2 (as well
We would like to know the distribution of the random variables X and SX
as the distribution of any other statistics that we might want to compute). Obviously,
these distributions depend on the distributions of X1 , . . . , Xn , which in turn depend
on the underlying population. Before investigating this problem analytically, let’s
investigate it via simulation.
5.1.2. Big Example
Percent of Total
In general, we do not know the distribution of the variable in the population. In order
to illustrate what can happen in simple random sampling, we will do some simulation
in a situation where we actually have the entire population. The dataset we will use
is a dataset that contains information on every baseball game played in Major League
Baseball during the 2003 season. This population consists of 2430 games. The dataset
is available at http://www.calvin.edu/~stob/data/bballgames03.csv. For our
variable of interest, we will consider the number of runs scored by the visitors in each
game. In the population, the distribution of this variable is unimodal and positively
skewed as illustrated in Figure 5.1. Some numerical characteristics of this population
10
5
0
0
5
10
15
20
Visitor Score
Figure 5.1.: Runs scored by visitors in 2003 baseball games.
are as follows.
> games=read.csv(’http://www.calvin.edu/~stob/data/baseballgames-2003.csv’)
> vs=games$visscore
> summary(vs)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
0.000
2.000
4.000
4.656
7.000 19.000
Suppose that we take samples of size 2 from this population. It is in fact possible
to generate all possible samples of size 2 and compute the mean of each such sample
118
5.1. Statistics and Sampling Distributions
using the function combn().
> vs2mean=combn(vs,2,mean) # applies mean to all combinations of 2 elements of vs
> summary(vs2mean)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
0.000
3.000
4.500
4.656
6.000 18.500
Note that the mean of the distribution of sample means of size 2 is the same as the
mean of the population. This should be expected. The histogram of the samples
means is in Figure 5.2.
Percent of Total
15
10
5
0
0
5
10
15
20
Means, samples of size 2
Figure 5.2.: All means of samples of size 2.
We note the following two features of the distribution of sample means of samples
of size 2: its spread is less than the spread of the population variable and its shape,
while still positively skewed, is less so.
It is not realistic to generate the actual sampling distribution of the X for samples
larger than size 2. For example, there are 1014 samples of size 5. However, simulation
allows us to get a fairly good idea of what the distribution of X looks like for larger
sample sizes. Consider first samples of size 5.
> vs5mean=replicate(10000, mean(sample(vs,5,replace=F)))
> summary(vs5mean)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
0.600
3.600
4.600
4.669
5.600 10.800
Comparing Figure 5.3 to Figure 5.2, we see that the distribution of the sample mean
in samples of size 5 appears to have less spread and to be more symmetric than that
of the distribution of sample means in samples of size 2.
Now let’s consider samples of size 30. Again, simulating this situation by choosing
10,000 such samples, we have the following results.
> vs30mean=replicate(10000,mean(sample(vs,30,replace=F)))
> summary(vs30mean)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
119
5. Inference - One Variable
Percent of Total
20
15
10
5
0
0
2
4
6
8
10
Means, samples of size 5
Figure 5.3.: Means of 10,000 samples of size 5.
2.433
4.267
4.633
4.658
5.033
7.300
Percent of Total
25
20
15
10
5
0
2
3
4
5
6
7
Means, samples of size 30
Figure 5.4.: Means of 10,000 samples of size 30.
With samples of size 30, we note that the distribution is now dramatically decreased
in spread. For example, the IQR is 0.76 (as compared to 2.0 for samples of size
5). This says that if we use the sample mean of a sample of size 30 to estimate the
population mean (of 4.656), over 50% of the time we will be within 0.4 of the true
value. Notice too from Figure 5.4 that the distribution of X 30 appears to be unimodal
and quite symmetric.
5.1.3. The Standard Framework
We are now conceiving of the simple random sample x1 , . . . , xn from a population as
the result of n random variables X1 , . . . , Xn . What can we say about the distributions of these random variables? The first property is the Identically Distributed
Property.
120
5.1. Statistics and Sampling Distributions
Identically Distributed Property
In simple random sampling, the random variables X1 , . . . , Xn all have the same
distribution. In fact, the distribution of Xi is the same as the distribution of the
variable x in the population.
It is easy to see that this property is true in the case of simple random sampling.
Each xi is equally likely to be any on the individuals in the population. Therefore the
distribution of possible values of Xi is exactly the same as the distribution of actual
values of x in the population. For example, if the values of x are normally distributed
in the population, then Xi will have that same normal distribution.
One important fact to note however is that the random variables Xi are not independent one from another. In simple random sampling (which among other properties
is sampling without replacement) the outcome of X2 is dependent on that of X1 . This
will usually be an annoyance to us in trying to analyze the distribution of of certain
statistics — independent random variables are easier to deal with. Therefore we will
simplify and often assume that the Xi are independent. In fact, if we sample with
replacement, this will be exactly true. And if the population is large, this will be
“almost” true — sampling without replacement behaves almost like sampling with
replacement. One general rule of thumb is that if the sample is of size less than 10% of
the population, then it does not do much harm to treat sampling without replacement
in the same way as sampling with replacement. Therefore we will usually assume that
our sample random variables are independent.
The i.i.d. assumption.
Random variables X1 , . . . , Xn are called i.i.d. if they are independent and
identically distributed. We will usually assume that the random variables
X1 , . . . , Xn that arise from a simple random sample are i.i.d. (For this reason,
we will call i.i.d. random variables X1 , . . . , Xn a random sample from X.)
Given i.i.d. random variables X1 , . . . , Xn , we will refer to their (common) distribution as the population distribution. With all this background, we expand the meaning
of our four important concepts.
121
5. Inference - One Variable
Population
Parameter
Sample
Statistic
any random variable X
a numerical property of X, (e.g., µX )
i.i.d. random variables X1 , . . . , Xn with the same distribution as X
any function T = f (X1 , . . . , Xn ) of the sample
While we have motivated this terminology by the very important problem of sampling from a finite population, it is also useful for describing other situations. Suppose
that we have a random variable X which (since it is a random process) is repeatable
under essentially identical conditions. Suppose that the process is repeated n times.
Then the results of those n trials X1 , . . . , Xn are i.i.d. random variables and so fit
the framework above.
5.2. The Sampling Distribution of the Mean
In this section we consider the problem of determining the sampling distribution
of the mean. Namely we assume that X1 , . . . , Xn are i.i.d. random variables with
population random variable X and we want to explore the relationship between the
distribution of X and that of X. The fundamental tool in studying this problem is
the following theorem.
Theorem 5.2.1. Suppose that Y and Z are random variables. Then
1. If c is a constant E(cY ) = c E(Y ) and Var(cY ) = c2 Var(Y ),
2. E(Y + Z) = E(Y ) + E(Z), and
3. if Y and Z are independent, then Var(Y + Z) = Var(Y ) + Var(Z).
We will not prove this theorem. Part (1) is easy to prove (it’s a simple fact about
integrals or sums). Part (2) certainly fits our intuition. Part (3) is not obvious. While
there certainly should be some relationship between the variance of Y + Z and those
of Y and Z, the fact that variances are additive seems almost accidental. Notice
that this rule looks like a “Pythagorean Theorem” as it involves squares on both
sides. From this Theorem, we now have one of the most important tools of inferential
statistics.
Theorem 5.2.2 (The distribution of the sample mean). Suppose that X1 , . . . , Xn
are i.i.d. random variables with population random variable X. Then
1. E(X) = E(X) , and
2. Var(X) = Var(X)/n.
Proof. By Theorem 5.2.1, we have that
E(X1 + · · · + Xn ) =
122
X
E(Xi ) = n E(X).
5.2. The Sampling Distribution of the Mean
Then
E(X) = E
X1 + · · · + Xn
n
=
1
1
E(X1 + · · · + Xn ) = n E(X) = E(X) .
n
n
Similarly
Var(X) = Var
X1 + · · · + Xn
n
=
1
1
1
Var(X1 +· · ·+Xn ) = 2 (n Var X) = Var(X) .
n2
n
n
Example 5.2.3.
We know that a random variable X such that X ∼ Unif(0, 1) has mean 1/2
and variance 1/12. Suppose that we have a random sample X1 , . . . , X10 with
population random variable X. Then X 10 has mean 1/2 and variance 1/120.
This is not inconsistent with the simulation below.
> means=replicate(10000,mean(runif(10,0,1)))
> mean(means)
[1] 0.4991267
> var(means)
[1] 0.008315763
> 1/120
[1] 0.008333333
Theorem 5.2.2 gives us two crucial pieces of information concerning the distribution
of X. However, it does not tell us the shape of the distribution. In the example of
Section 5.1.2, we noted that as the size of the sample increased, the empirical distribution of X approached a more symmetrical distribution. This was not a property
peculiar to that example. The next theorem is so important, we might call it the
Fundamental Theorem of Statistics.
Theorem 5.2.4 (The Central Limit Theorem). Suppose that X is a random variable
with mean µ and variance σ 2 . For every n, let X n denote the sample mean of i.i.d.
random variables X1 , . . . , Xn which have the same distribution as X. Then as n gets
large, the shape of the distribution of X n approaches that of a normal distribution.
In particular for every a, b,
Xn − µ
√ ≤ b = P(a ≤ Z ≤ b)
lim P a ≤
n→∞
σ/ n
where Z is a standard normal random variable.
123
5. Inference - One Variable
The Central Limit Theorem (CLT) is a limit theorem. As such, it only provides an
approximation. In using it, we will always be faced with the question of large n needs
to be so that the approximation is “close enough” for our purposes. Nevertheless, it
will be a crucial tool in making inferences about µ.
Example 5.2.5.
Continuing Example 5.2.3, suppose again that X1 , . . . , X10 is a random sample
from a population X ∼ Unif(0, 1). By the Central Limit Theorem, we have that
X is approximately normal with mean 1/2 and variance 1/120. Therefore we
have the approximate probability statement
!
r
r
1
1
1
1
P
≈ .68 .
−
≤X≤ +
2
120
2
120
Again, we can compare this we the results of a simulation.
> means=replicate(10000,mean(runif(10,0,1)))
> sum( (1/2-sqrt(1/120))<means & means<(1/2+sqrt(1/120)) )
[1] 6783
> pnorm(1)-pnorm(-1)
[1] 0.6826895
We know even more in the special case that the population random variable X is
normally distributed.
Theorem 5.2.6. Suppose that X is normally distributed with mean µ and variance
σ 2 . Let X1 , . . . , Xn be i.i.d. random variables with population random variable X.
Then X n has a normal distribution with mean µ and variance σ 2 /n.
Example 5.2.7.
The distribution of heights of 20 year old females in the United States in 2005
was very close to being normal with mean 163.3 cm and standard deviation
6.5 cm. If a random sample of 20 such females had been chosen, what is the
probability that the mean of the sample was greater than 165 cm? Since the
distribution of the sample
√ mean of a sample of size 20 has mean 163.3 and
standard deviation 6.5/ 20 = 1.45, a sample mean of 165 has a z-score of
(165 − 163.3)/1.45 = 1.17. Since 1-pnorm(1.17)=.12, this probability is 12%.
5.3. Estimating Parameters
The results of the last section taken together tell us that x provides a good estimate
of µX . In this section, we look at the problem of parameter estimation in general and
124
5.3. Estimating Parameters
identify properties to look for in good estimators.
Suppose that X is a random variable and that θ is a parameter associated with
2 . Let X , . . . , X be a random
X. Examples of such parameters include µX and σX
1
n
sample with population random variable X. With that setting, we have the following
definition.
Definition 5.3.1 (estimator, estimate). An estimator of the parameter θ is any
statistic θ̂ = f (X1 , . . . , Xn ) used to estimate θ. The value of θ̂ for a particular
outcome of X1 , . . . , Xn is called the estimate of θ.
2 is σ̂ 2 .
Using the notation of the definition, X should be written µ̂ and SX
X
5.3.1. Bias
Consider the following simple situation. We have one observation x from a random
variable X ∼ Binom(n, π) and we wish to estimate π. An absolutely natural choice
is to use x/n. In other words, π̂ = X/n. One way of justifying this choice is that
E(X/n) = π so “on average” this estimator gets it right. Consider another estimator,
proposed by Laplace. He suggested using π̂L = X+1
n+2 . Notice that if π > .5, this
estimator tends to underestimate π a bit by on average shading its estimate towards
0.5. Likewise, if π < 0.5, the estimate tends to be a little larger than π. In other
words, Laplace’s estimate has a bias.
Definition 5.3.2 (unbiased, bias). An estimator θ̂ of θ is unbiased if E(θ̂) = θ. The
bias of an estimator θ̂ is E(θ̂) − θ.
It is important to note that θ is unknown and E(θ̂) depends on θ so that in general
we do not know the bias of an estimator. In the first example below, we look at examples where we can determine that an estimator is unbiased. In the second example,
we look more carefully at the bias of Laplace’s estimator. In the third example, we
look at another biased estimator via a simulation.
Example 5.3.3.
1. Since E(X n ) = µX for all random variables X no matter what the sample
size n, we have that X n is an unbiased estimator of µ.
2. It can be shown that E(S 2 ) = σ 2 . Thus S 2 is an unbiased estimator of σ 2 .
This is the real reason for using n − 1 in the definition of S 2 rather than
n. (It is important to note that it does not follow that S is an unbiased
estimator of σ. Indeed, this is not true.)
125
5. Inference - One Variable
3. X/n is an unbiased estimator of π if X ∼ Binom(n, π).
Example 5.3.4.
Consider Laplace’s estimator π̂L = X+1
n+2 . We have
1
n
1
X +1
=
E (X + 1) =
π+
.
E(π̂L ) = E
n+2
n+2
n+2
n+2
Thus the bias of π̂L is
1
n
1
2
1 − 2π
−π =
E(π̂l ) − π =
π+
−
π=
.
n+2
n+2
n+2 n+2
n+2
If π = .5 then this estimator is unbiased but the bias is negative if π > 0.5 and
positive if π < 0.5.
Example 5.3.5.
Suppose that we have a random sample from a population X ∼ Exp(λ). Since
µX = 1/λ, we have that E(X) = 1/λ. Therefore a reasonable choice for an
estimator of λ is λ̂ = 1/X. Notice that this estimator is not necessarily unbiased.
We investigate with a simulation. We first consider random samples of size 5
and then random samples of size 20. We use λ = 10 in our simulation.
> hatlambda5 = replicate(10000,1/mean(rexp(5,10)))
> mean(hatlambda5)
[1] 12.47850
> hatlambda20 = replicate(10000,1/mean(rexp(20,10)))
> mean(hatlambda20)
[1] 10.51414
Note that in both cases, our estimator appears to be biased and produces an
overestimate on average.
The last example illustrates an important point. Even if θ̂ is an unbiased estimator
of θ, this does not mean that f (θ̂) is an unbiased estimator of f (θ).
5.3.2. Variance
An estimator is a random variable. In considering its bias, we are considering its
mean. But its variance is also important — an estimator with large variance is not
likely to produce an estimate close to the parameter it is trying to estimate.
Definition 5.3.6 (standard error). If θ̂ is an estimator for θ, the standard error of θ̂
is
q
σθ̂ = Var(θ̂) .
126
5.3. Estimating Parameters
If we can estimate σθ̂ , we write sθ̂ for the estimate of σθ̂ .
Example 5.3.7.
2 /n.
Regardless of the population random variable X, we know that Var(X) = σX
√
Thus σX = σX / n. To estimate this, it is natural to use
√
sX = sX / n .
Example 5.3.8.
If X ∼ Binom(n, π), we have that π̂ = X/n has variance Var(π̂) = π(1 − π)/n.
Thus
r
π(1 − π)
.
σπ̂ =
n
A good estimator for σπ̂ can be found by using π̂ to estimate π. Thus
r
π̂(1 − π̂)
.
sπ̂ =
n
An unbiased estimator with small variance is obviously the kind of estimator that
we seek. We note that the sample mean is always an unbiased estimator of the
population mean and the variance of the sample mean goes to 0 as the sample size
gets large.
5.3.3. Mean Squared Error
Bias is bad and so is high variance. We put these two measures together into one in
this section.
Definition 5.3.9 (mean squared error). The mean squared error of an estimator θ̂
is
MSE(θ̂) = E[(θ̂ − θ)2 ] .
The mean squared error measures how far away θ̂ is from θ on average where the
measure of distance is our now familiar one of squaring.
Proposition 5.3.10. For any estimator θ̂ of θ
MSE(θ̂) = Var(θ̂) + Bias(θ̂)2 .
The proof of Proposition 5.3.10 is a messy computation and we will omit it. We
illustrate the use of the MSE to compare the two estimators we have for the parameter
127
5. Inference - One Variable
0.025
0.008
0.020
MSE
MSE
0.006
0.015
0.010
0.004
0.002
0.005
0.000
0.000
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
pi
0.4
0.6
0.8
1.0
pi
Figure 5.5.: MSE of two estimators of π, sample sizes n = 10 and n = 30.
π of the binomial distribution. Again, π̂ denotes the usual unbiased estimator and
π̂L = X+1
n+2 denotes the Laplace estimator. We have
Estimator
Bias
Variance
π
0
π(1 − π)
n
πL
1 − 2π
n+2
π(1 − π)
n + 4 + n4
It is obvious that πL has a smaller variance that π (and it is clear why this should be
so). It is not immediately obvious from the expressions above which has the smaller
MSE. In fact, this depends on both π and n. In the Figure 5.5, we plot the MSE of
both estimators for samples of size 10 and size 30 respectively. Note that the Laplace
estimator has smaller MSE for intermediate values of π while the unbiased estimator
has smaller MSE for extreme values of π. Ase we might expect, there is a greater
difference in the two estimators for smaller samples than for large samples.
5.4. Confidence Interval for Sample Mean
In this section, we introduce an important method for quantifying sampling error, the
confidence interval. First, we’ll look at a very special but important case.
5.4.1. Confidence Intervals for Normal Populations
Suppose that X1 , . . . , Xn is a random sample with population random variable X
with unknown mean µ and variance σ 2 . Suppose too that the population random
128
5.4. Confidence Interval for Sample Mean
variable X has a normal distribution. Using Theorem 5.2.6 and one of our favorite
facts about the standard normal distribution, we have
X −µ
√ < 1.96 = .95 .
P −1.96 <
σ/ n
We now do some algebra to get
σ
σ
P X − 1.96 √ < µ < X + 1.96 √
n
n
The interval
σ
σ
X − 1.96 √ , X + 1.96 √
n
n
= .95
is a random interval. Now suppose that we know σ (an unlikely happenstance, we
admit). For any particular set of data x1 , . . . , xn the interval is simply a numerical
interval. The key fact is that we are fairly confident that this interval contains µ.
Definition 5.4.1 (confidence interval). Suppose that X1 , . . . , Xn is a random sample
from a normal distribution with known variance σ 2 . Suppose that x1 , . . . , xn is the
observed sample. The interval
σ
σ
x − 1.96 √ , x + 1.96 √
n
n
is called a 95% confidence interval for µ.
Example 5.4.2.
A machine creates rods that are to have a diameter of 23 millimeters. It is
known that the distribution of the diameters of the parts is normal and that
the standard deviation of the actual diameters of parts created over time is
0.1 mm. A random sample of 40 parts are measured precisely to determine
if the machine is still producing rods of diameter 23 mm. The data and 95%
confidence interval are given by
> x
[1] 22.958 23.179 23.049 22.863 23.098 23.011 22.958 23.186
[11] 23.166 22.883 22.926 23.051 23.146 23.080 22.957 23.054
[21] 23.040 23.057 22.985 22.827 23.172 23.039 23.029 22.889
[31] 22.837 23.045 22.957 23.212 23.092 22.886 23.018 23.031
> mean(x)
[1] 23.024
> c(mean(x)-(1.96)*.1/sqrt(40),mean(x)+(1.96)*.1/sqrt(40))
[1] 22.993 23.055
23.015
22.995
23.019
23.059
23.089
22.894
23.073
23.117
129
5. Inference - One Variable
It appears that the process could still be producing rods of diameter 23 mm.
Of course the example illustrates a problem with using this notion of confidence
interval, namely we need to know the standard deviation of the population. It is
unlikely that we would be in a situation where the mean of the population is unknown
but the standard deviation is known. One approach to solving this problem is to use
an estimate for σ, namely sX , the sample standard deviation. If the sample size is
quite large, we hope that sX is close to σ so that our confidence interval statement
is approximately correct. In the case of a normal population random variable X
however, we know more.
5.4.2. The t Distribution
Definition 5.4.3 (t distribution). A random variable T has a t distribution (with
parameter ν ≥ 1, called the degrees of freedom of the distribution) if it has pdf
1 Γ((ν + 1)/2)
1
f (x) = √
Γ(ν/2) (1 + x2 /ν)(ν+1)/2
πν
−∞<x<∞
Here Γ is the gamma function from mathematics but all we need to know about
the constant out front is that it exists to make the integral of the density function
equal to 1. Some properties of the t distribution include
1. f is symmetric about x = 0 and unimodal. In fact f looks bell-shaped.
2. If ν > 1 then the mean of T is 0.
3. If ν > 2 then the variance of T is ν/(ν − 2).
4. For large ν, T is approximately standard normal.
In summary, the t distributions look very similar to the normal distribution except
that they have slightly more spread, especially for small values of ν. R knows the
t-distribution of course and the appropriate functions are
0.3
density
0.2
0.1
x=seq(-3,3,.01)
y=dt(x,3)
z=dt(x,10)
w=dnorm(x,0,1)
plot(w~x,type="l",ylab="density")
lines(y~x)
lines(z~x)
0.0
>
>
>
>
>
>
>
0.4
dt(x,df), pt(), qt(), and rt(). The graphs of the normal distribution and two
t-distributions are shown below.
−3
−2
−1
0
x
130
1
2
3
5.4. Confidence Interval for Sample Mean
The important fact that relates the t distribution to the normal distribution is the
following theorem which is one of the most heavily used in statistics.
Theorem 5.4.4. If X1 , . . . , Xn is a random sample from a normal distribution with
mean µ and variance σ 2 , then the random variable
X −µ
√
S/ n
has a t distribution with n − 1 degrees of freedom.
To generate confidence intervals using this theorem,, first define tβ,ν to be the
unique number such that
P (T > tβ,ν ) = β
where T is random variable that has a t distribution with ν degrees of freedom. We
have the following:
Confidence Interval for µ If x1 , . . . , xn are the observed values of a random
sample from a normal distribution with unknown mean µ and t∗ = tα/2,n−1 , the
interval
s
s
x̄ − t∗ √ , x̄ + t∗ √
n
n
is an 100(1 − α)% confidence interval for µ.
Example 5.4.5.
It is plausible to think that the logs of populations of U.S. counties have a
normal distribution. (We’ll talk about how to test that claim a later point.)
In the following example, we look at a sample of 10 such counties and produce
a 95% confidence interval for the mean of the log-population. To produce a
95% confidence interval, we need t.025 which is the 97.5% quantile of the t
distribution. Notice that the true mean of our population random variable is
10.22 so in this case the confidence interval does capture the mean.
>
>
>
>
>
counties=read.csv(’http://www.calvin.edu/~stob/data/uscounties.csv’)
logpop=log(counties$Population)
smallsample=sample(logpop,10,replace=F)
# our sample of size 10
tstar = qt(.975,9)
# 9 degrees of freedom
xbar= mean(smallsample)
131
5. Inference - One Variable
> s= sd(smallsample)
> c( xbar-tstar* s/sqrt(10), xbar+tstar * s/sqrt(10))
[1] 10.14891 12.01605
5.4.3. Interpreting Confidence Intervals
It is important to be very careful in making statements about what a confidence
interval means.
In Example 5.4.5, we can say something like “we are 95% confident that the true
mean of the logs of population is in the interval (10.15, 12.02).” (This, at least, is
what many AP Statistics students are taught to say.) But beware:
This is not a probability statement! That is, we do not say that the
probability that the true mean is in the interval (10.15, 12.02) is 95%.
There is no probability after the experiment is done, only before.
The correct probability statement is one that we make before the experiment.
If we are to generate a 95% confidence interval for the mean of the population from a sample of size 10 from this population, then the probability
is 95% that the resulting confidence interval will contain the mean.
Another way of saying this using the relative frequency interpretation of probability
is
If we generate many 95% confidence intervals by this procedure, approximately 95% of them will contain the mean of the population.
After the experiment, a good way of saying what confidence means is this
Either the population mean is in (10.15, 12.02) or something very surprising happened.
5.4.4. Variants on Confidence Intervals and Using R
Nothing is sacred about 95%. We could generate 90% confidence intervals or confidence intervals of any other level. There might also be a reason for generating onesided confidence intervals which could be done by using eliminating one of the two
tails of the t-distribution in our computation. R will actually do all the computations
for us. We illustrate.
Example 5.4.6.
The file http://www.calvin.edu/~stob/data/March9bball.csv contains the
results of all basketball games played in NCAA Division I on March 9, 2008. It
132
5.4. Confidence Interval for Sample Mean
might be a reasonable assumption that the visitor’s scores in Division I games
have a normal distribution and that the games of March 9 approximate a random sample. Proceeding on that assumption, we write a variety of different
confidence intervals. Notice that the output of t.test() gives a variety of
information beyond simply the confidence interval.
> games=read.csv(’http://www.calvin.edu/~stob/data/March9bball.csv’)
> names(games)
[1] "Visitor" "Vscore" "Home"
"Hscore"
> t.test(games$Vscore)
One Sample t-test
data: games$Vscore
t = 35.7926, df = 38, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
59.38840 66.50903
sample estimates:
mean of x
62.94872
> t.test(games$Vscore,conf.level=.9)
# 90% confidence interval
One Sample t-test
data: games$Vscore
t = 35.7926, df = 38, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
59.98362 65.91382
sample estimates:
mean of x
62.94872
> t.test(games$Vscore,conf.level=.9,alternative=’greater’)
# 90% one-sided interval
One Sample t-test
data: games$Vscore
t = 35.7926, df = 38, p-value < 2.2e-16
alternative hypothesis: true mean is greater than 0
90 percent confidence interval:
60.65496
Inf
sample estimates:
mean of x
62.94872
133
5. Inference - One Variable
5.5. Non-Normal Populations
In this section we consider the problem of generating confidence intervals for the mean
in the case that our population random variable does not have a normal distribution.
Of course it is not hard to find examples where this would be useful. Indeed, it is
really not often that we know our population is normal. Our advice in this section
amounts to the following: we can often use the same confidence intervals that we used
when the population is normal.
5.5.1. t Confidence Intervals are Robust
A statistical procedure is robust if it performs as advertised (at least approximately)
even if the underlying distributional assumptions are not satisfied. The important
fact about confidence intervals generated by the method of the last section is that
they are robust against violations of the normality assumption if the sample size is
not small and if the data does not have extreme outliers. To measure whether the t
procedure works, we have the following definition.
Definition 5.5.1 (coverage probability). Suppose that I is a random interval used
as a confidence interval for θ. The coverage probability of I is P(θ ∈ I). (In other
words, the coverage probability is the true confidence level of the confidence intervals
produced by I.)
We would like that 95% confidence intervals generated from the t distribution have
a 95% coverage probability even in the case that the normality assumption is not
satisfied. We first look at some examples.
Example 5.5.2.
We will use as our population the maximum wind velocity at the San Diego
airport on 6,209 consecutive days. The true mean of this population is 15.32.
We generate 10,000 samples of each of size 10, 30 and 50.
> w=read.csv(’http://www.calvin.edu/~stob/data/wind.csv’)
> m=mean(w$Wind)
#
samples of size 10
> intervals= replicate(10000,t.test(sample(w$Wind,10,replace=F))$conf.int)
> sum(intervals[1,]<m & intervals[2,]>m)
[1] 9346
# samples of size 30
> intervals= replicate(10000, t.test(sample(w$Wind,30,replace=F))$conf.int)
134
5.5. Non-Normal Populations
> sum(intervals[1,]<m & intervals[2,]>m)
[1] 9427
# samples of size 50
> intervals= replicate(10000, t.test(sample(w$Wind,50,replace=F))$conf.int)
> sum(intervals[1,]<m & intervals[2,]>m)
[1] 9441
We find that we do not quite achieve our desired goal of 95% confidence intervals
though it appears for samples of size 50 we have approximately 94.4% confidence
intervals.
Example 5.5.3.
Suppose that X ∼ Exp(0.2) so that µX = 5. We generate 10,000 different
random samples of size 10 for this distribution and compute the 95% confidence
interval given by the t-distribution in each case. We note that we do not have
exceptional success - only 89.1% of the 95% confidence intervals contain the
mean.
> # samples of size 10 from an exponential distribution with mean 5
> # t.test()$conf.int recovers just the confidence interval
>
> intervals = replicate(10000, t.test(rexp(10,.2))$conf.int)
>
> # now count the intervals that capture the mean
>
> sum (intervals[1,]<5 & intervals[2,]>5)
[1] 8918
With random samples of size 30, we do better and with samples of size 50 better
yet. However in no case do we achieve the 95% coverage probability that we
desire. The exponential distribution is quite asymmetric.
#
samples of size 30
> intervals = replicate(10000, t.test(rexp(30,.2))$conf.int)
> sum (intervals[1,]<5 & intervals[2,]>5)
[1] 9297
#
samples of size 50
>
intervals = replicate(10000, t.test(rexp(50,.2))$conf.int)
> sum (intervals[1,]<5 & intervals[2,]>5)
[1] 9348
In neither of the last two examples did we achieve our objective of 95% confidence
intervals containing the mean 95% of the time. The next example uses the Weibull
135
5. Inference - One Variable
distribution with parameters that make it fairly symmetric.
Example 5.5.4.
The Weibull distribution with parameters α = 5 and β = 10 has mean 9.181687.
We generate samples of size 10, 30 and 50. Note that we have achieved almost
exactly 95% confidence intervals regardless of the sample size.
> m=9.181687
# mean of Weibull distribution with parameters 5, 10
> intervals = replicate(10000, t.test(rweibull(10,5,10))$conf.int)
> sum (intervals[1,]<m & intervals[2,]>m)
[1] 9502
> intervals = replicate(10000, t.test(rweibull(30,5,10))$conf.int)
> sum (intervals[1,]<m & intervals[2,]>m)
[1] 9499
>
intervals = replicate(10000, t.test(rweibull(50,5,10))$conf.int)
> sum (intervals[1,]<m & intervals[2,]>m)
[1] 9496
5.5.2. Why are t Confidence Intervals Robust?
Let’s consider generating a 95% confidence interval from 30 data points x1 , . . . , x30 .
The t-confidence interval in this case is
s
s
.
x − 2.05 √ , x − 2.05 √
30
30
The magic number 2.05 of course is just t.025,29 .
(5.1)
Let’s approach the problem of generating a confidence interval from a different
direction. Namely let’s use the Central Limit Theorem. The CLT says that the
random variable
X −µ
√
σ/ n
has a distribution that is approximately standard normal (if we believe that n = 30
is large). We therefore have the following approximate probability statment:
X −µ
√ < 1.96 ≈ .95 .
P −1.96 <
σ/ n
This leads to the approximate 95% confidence interval
σ
σ
√
√
x − 1.96
, x − 1.96
.
30
30
136
(5.2)
5.6. Confidence Interval for Proportion
The problem with this interval (besides the fact that it is only approximate) is that
σ is not known. Now for a reasonably large sample size, we might expect that the
value s of the sample standard deviation is close to σ. If we replace σ in 5.2 by s, we
have the interval
s
s
.
x − 1.96 √ , x − 1.96 √
30
30
Now we see that the only difference between this interval (which involves two approximations) and the interval of Equation 5.1 that results from the t-distribution is the
difference between the numbers 1.96 and 2.04. It is easy to give an argument for using
a larger number than 1.96 — using 2.04 helps compensate for the fact that we are
making several approximations in constructing the interval by expanding the width
of the interval slightly.
Of course we should note that the t intervals do not perform equally well regardless of the population. The performance of this method depends on the shape of
the distribution (symmetric, unimodal is best) and the sample size (the larger the
better).
5.6. Confidence Interval for Proportion
To estimate the proportion of individuals in a population with a certain property, we
often choose a random sample and use as an estimate the proportion of individuals in
the sample with that property. This is the methodology of political polls, for example.
While this random process is best modeled by the hypergeometric distribution, we
normally use the binomial distribution instead if the size of the population is large
relative to the size of the sample.
So then, assume that we have a binomial random variable X ∼ Binom(n, π) where
as usual n is known but π is not. Then of course the obvious estimator for π is
π̂ = X
n and it is an unbiased estimator of π. Of course we would also like to write
a confidence interval for π so that we know the precision of our estimate. Because
X is discrete, there is no good way to write exact confidence intervals for π, but the
Central Limit Theorem allows us to write an approximate confidence interval that is
really quite good. The key is to understanding the relationship between the binomial
distribution and the Central Limit Theorem.
Theorem 5.6.1. Suppose that X ∼ Binom(n, π). Then if n is large, the random
variable
X
−π
qn
π(1−π)
n
has a distribution that is approximately standard normal.
137
5. Inference - One Variable
Proof. Let the individual trials of the random process X be denoted X1 , . . . , Xn . This
2 = π(1 − π) for
sequence is i.i.d. P
In fact Xi ∼ Binom(1, π). Obviously µXi = π and σX
i
each i and X =
Xi . We apply the CLT to the sequence X1 , . . . , Xn . The random
variable X
n is the sample mean for this i.i.d. sequence and so has mean π and variance
π(1−π)
. The result follows.
n
The Theorem suggests how to find an (approximate) confidence interval. For a
fixed β, let zβ be the number such that P(Z > zβ ) = 1 − β where Z is the standard
normal random variable. Then we have the following approximate equality from the
CLT.
!
π̂ − π
P −zα/2 < p
< zα/2
π(1 − π)/n
≈1−α
(5.3)
Equation 5.3 is the starting point for several different approximate confidence intervals.
As we did for confidence intervals for µ, we should attempt to use Equation 5.3 to
isolate π in the “middle” of the inequalities. The first two steps are
r
P −zα/2
π(1 − π)
< π̂ − π < zα/2
n
r
π(1 − π)
n
!
≈ 1 − α,
and thus
r
P π̂ − zα/2
π(1 − π)
< π < π̂ + zα/2
n
r
π(1 − π)
n
!
≈1−α.
(5.4)
The problem with 5.4 is that the unknown π appears not only in the middle of
the inequalities but also in the bounds. Thus we do not yet have a true confidence
interval since the endpoints are not statistics that we can compute from the data.
The Wald interval.
The Wald interval results from replacing π by π̂ in the endpoints of the interval of
5.4.
p
p
π̂ − zα/2 π̂(1 − π̂)/n, π̂ + zα/2 π̂(1 − π̂)/n
Until recently, this was the standard confidence interval suggested in most elementary statistics textbooks if the sample size is large enough. (In fact this interval
still receives credit on the AP Statistics Test.) Books varied as to what large enough
138
5.6. Confidence Interval for Proportion
meant. A typical piece of advice is to only use this interval if nπ̂(1−π̂) ≥ 10. However,
you should never use this interval.
The coverage probability of the (approximately) 95% Wald confidence intervals is
almost always less than 95% and could be quite a bit less depending on π and the
sample size. For example, if π = .2, it takes a sample size of 118 to guarantee that
the coverage probability of the Wald confidence interval is at least 93%. For very
small probabilities, it takes thousands of observations to ensure that the coverage
probability of the Wald interval approaches 95%.
The Wilson Interval.
At least since 1927, a much better interval than the Wald interval has been known
although it wasn’t always appreciated how much better the Wilson interval is. The
Wilson interval is derived by solving the inequality in 5.3 so that π is isolated in
the middle. After some algebra and the quadratic formula, we get the following
(impressive looking) approximate confidence interval statement:

 π̂ +
P

2
zα/2
2n
r
− zα/2
1+
π̂(1−π̂)
n
2 )/n
(zα/2
+
2
zα/2
4n2
π̂ +
<π<
2
zα/2
2n
r
+ zα/2
1+
π̂(1−π̂)
n
2 )/n
(zα/2
+
2
zα/2
4n2


≈1−α

The Wilson interval performs much better than the Wald interval. If nπ̂(1−π̂) ≥ 10,
you can be reasonably certain that the coverage probability of the 95% Wilson interval
is at least 93%. The Wilson interval is computed by R in the function prop.test().
The option correct=F needs to be used however. (The option correct=T makes a
“continuity” correction that comes from the fact that binomial data is discrete. It is
not recommended to be used in the Wilson interval however.)
Example 5.6.2.
In a poll taken in Mississippi on March 7, 2008, of 354 voters who were decided
between Obama and Clinton, 190 said that they would vote for Obama in the
Mississippi primary. We can estimate the proportion of voters in the population that will vote for Obama (of those who were decided on one of these two
candidates) using the Wilson method.
> prop.test(190,354,correct=F)
1-sample proportions test without continuity correction
data:
190 out of 354, null probability 0.5
139
5. Inference - One Variable
X-squared = 1.9096, df = 1, p-value = 0.167
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.4846622 0.5879957
sample estimates:
p
0.5367232
We see that π̂ = .537 and that a 95% confidence interval for π is (.485, .588).
This is often reported by the media as 53.7% ± 5.1% with no mention of the
fact that a 95% confidence interval is being used. (Note that the center of the
interval is not π̂ but in this case does agree with π̂ to three decimal digits.)
Notice that the center of the Wilson interval is not π̂. It is
z2
2 /2
α/2
x + zα/2
π̂ + 2n
.
2 )/n = n + z 2
1 + (zα/2
α/2
2
A way to think about this is that the center of the interval comes from adding zα/2
2 /2 successes to the observed data. (For a 95% confidence interval, this
trials and zα/2
is very close to adding two successes and four trials.) This is the basis for the next
interval.
The Agresti-Coull Interval.
Agresti and Coull (1998) suggest combining the biased estimator of π̂ that is used
in the Wilson interval together with the simpler estimate for the standard error that
comes from the Wald interval. In particular, if we are looking for a 100(1 − α)%
confidence interval and x is the number of successes observed in n trials, define
2
x̃ = x + zα/2
/2
2
ñ = n + zα/2
π̃ =
x̃
ñ
Then the Agresti-Coull interval is
!
r
r
π̃(1 − π̃)
π̃(1 − π̃)
π̃ − zα/2
, π̃ + zα/2
ñ
ñ
In practice, this estimator is even better than the Wilson estimator and is now
widely recommended, even in basic statistics textbooks. For the particular example
of x = 7 and n = 10, the Wilson and Agresti-Coull intervals are compared below.
Note that the Agresti-Coull interval is somewhat wider than the Wilson interval. Of
course wider intervals are more likely to capture the mean.
140
5.7. The Bootstrap
#
The Wilson interval
> prop.test(7,10,correct=F)
1-sample proportions test without continuity correction
data: 7 out of 10, null probability 0.5
X-squared = 1.6, df = 1, p-value = 0.2059
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.3967781 0.8922087
sample estimates:
p
0.7
#
The Agresti-Coull Interval
> xtilde=9
> ntilde=14
> z=qnorm(.975)
> pitilde=xtilde/ntilde
> se= sqrt ( pitilde * (1-pitilde)/ntilde )
> c( pitilde - z* se, pitilde + z * se)
[1] 0.3918637 0.8938505
In elementary statistics books, the Agrest-Coull interval is often presented at the
“Plus 4” interval and the instructions for computing it are simply to add four trials
and two successes and then to compute the Wald interval.
5.7. The Bootstrap
Throughout this chapter we have been developing methods for making inferences
about the unknown value of the parameter θ associated with a “population” random
variable. In general, to estimate θ we need good answers to two questions
1. What estimator θ̂ of θ should we use?
2. How accurate is the estimator θ̂?
For the case that θ is the population mean, we have a rich theory that answers
these questions. We answered the questions by knowing two things:
1. the distribution of the population random variable (e.g., normal, binomial), and
2. how the sampling distribution of the estimator depends on the the distribution
of the population.
141
5. Inference - One Variable
If we know the distribution of the population random variable but not how the
sampling distribution of the estimator depends on it, we can often do simulation to
get an idea of the sampling distribution of the estimator. Indeed, that is what we
did in Section 5.1. In this section we look at the bootstrap, a “computer-intensive”
technique for addressing these questions if we know neither of these two facts.
We will illustrate the bootstrap with the following example dataset. (This dataset is
found in the package boot which you would probably need to load from the internet.)
The data are the times to failure for the air-conditioning unit of a certain Boeing 720
aircraft.
> aircondit$hours
[1]
3
5
7 18 43
> mean(aircondit$hours)
[1] 108.0833
85
91
98 100 130 230 487
Suppose that we want to estimate the MTF — mean time to failure of such airconditioning units. Our estimate is 108 hours, but we would like an estimate of
the precision of this estimate, e.g., a confidence interval. While the simple advice
of Section 5.5 is to use the t-distribution, this is not really a good strategy as the
dataset is quite small and the distribution of the data is quite skewed. Furthermore,
the small size of the dataset does not suggest to us a particular distribution for the
population (although engineers might naturally turn to some Weibull distribution).
The idea of the bootstrap is to generate lots of different samples from the population
as we did in Section 5.1). However, without any assumptions about the shape of the
distribution of the population, the bootstrap uses the data itself to approximate that
shape. In this case, we have that 1/12 of our sample has the value 3, 1/12 of the
sample has the value 5, etc. Therefore, we will model the population by assuming
that 1/12 of the population has the value 3, 1/12 of the population has the value 5,
etc! Now to take a random sample of size 12 from such a population, we need only
take a sample of size 12 from our original data with replacement. The idea of the
bootstrap is to take many such samples and compute the value of the estimator for
each sample and thereby get an approximation to the sampling distribution of the
estimator.
Here are the steps to computing a bootstrap confidence interval for the mean of our
air-conditioning failure time population. The following R function chooses 1,000 different random samples of size 12 from our original random sample, with replacement
and computes the mean of each sample.
> means = replicate (1000, mean(sample(aircondit$hours,12,replace=T)))
These 1,000 means are our approximation of what would happen if we took 1,000
samples from the population of air-cinditioning failure times. A histogram of these
142
5.7. The Bootstrap
1,000 means is in Figure 5.6. We now convert these 1,000 means to a confidence
30
Percent of Total
25
20
15
10
5
0
0
50
100
150
200
250
300
means
Figure 5.6.: 1,000 sample means of bootstrapped samples of air-conditioning failure
times.
interval by using the quantile() function.
> quantile(means,c(0.025,0.975))
2.5%
97.5%
45.16042 190.33750
It is reasonable to announce that the 95% confidence interval for µ is (45.16, 190.34)
hours.
The bootstrap method that we illustrated above (called the bootstrap percentile
confidence interval), is quite general. There was nothing special about the fact that
we were constructing a confidence interval for the mean. Indeed, we could use the
very same method to construct a confidence interval for any parameter, as long as
we have a reasonable estimator for the parameter. (For parameters other than the
mean, there are more sophisticated bootstrap methods that account for the fact that
many estimators are biased.) We illustrate with one more example.
Example 5.7.1.
The dataset city in the boot package consists of a random sample of 10 of
the 196 largest cities of 1930. The variables are u which is the population (in
1,000s) in 1920 and x which is the population in 1930. P
The population
is the
P
196 cities and we would like to know the value of θ =
x/ u, the ratio of
increase
of
population
in
these
cities
from
1920
to
1930.
The
obvious estimator
P P
is θ̂ =
x/ u for the sample. We construct our bootstrap confidence interval
for θ.
> library(boot)
> city
u
x
1 138 143
143
5. Inference - One Variable
2
93 104
3
61 69
4 179 260
5
48 75
6
37 63
7
29 50
8
23 48
9
30 111
10
2 50
> thetahat=sum(city$x)/sum(city$u)
> thetahat
[1] 1.520312
# estimate from sample
> thetahats = replicate ( 1000, { i=sample((1:10),10,replace=T) ;
+
us=city[i,]$u ; xs=city[i,]$x ;
+
sum(xs)/sum(us) } )
> quantile(thetahats, c(0.025,0.975))
2.5%
97.5%
1.250343 2.127813
# bootstrap confidence interval
Notice that th confidence interval is very wide. This is only to be expected from
such a small sample.
5.8. Testing Hypotheses About the Mean
In this section, we review the logic of hypothesis testing in the context of testing
hypotheses about the mean. While the language of hypothesis testing is still quite
common in the literature, it is fair to say that confidence intervals are a superior way
to quantify inferences about the mean. The language of hypothesis testing is perhaps
most useful when one needs to make a decision about the parameter in question. We
first look at an example of a situation in which a decision rule is necessary.
Example 5.8.1.
Kellogg’s makes Raisin Bran and fills boxes that are labelled 11 oz. NIST
mandates testing protocols to ensure that this claim is accurate. Suppose that
a shipment of 250 boxes, called the inspection lot, is to be tested. The mandated
procedure is to take a random sample of 12 boxes from this shipment. If any
box is more than 1/2 ounce underweight, then the lot is declared defective.
Else, the sample mean x and the sample standard deviation s are computed.
The shipment is rejected if (x − 11)/s ≤ −0.635.
144
5.8. Testing Hypotheses About the Mean
We can view Example 5.8.1 as implementing a hypothesis test. Recall the technology. There are four steps as described in Section 4.3.
1. Identify the hypotheses.
2. Collect data and compute a test statistic.
3. Compute a p-value.
4. Draw a conclusion.
We go through these four steps in the case that our hypotheses are about the
population mean µ, using the Kellogg’s example as an illustration. We will suppose
that X1 , . . . , Xn is a random sample from a normal distribution with unknown mean
µ and that we wish to make inferences about µ.
Identify the Hypotheses
We start with a null hypothesis, H0 , the default or “status quo” hypothesis. We
want to use the data to determine whether there is substantial evidence against it.
The alternate hypothesis, Ha , is the hypothesis that we are wanting to put forward
as true if we have sufficient evidence in its favor. So in the Raisin Bran example, our
pair of hypotheses are
H0 :
Ha :
µ = 11
µ < 11 .
In general, our hypotheses for a test of means is one of the following three pairs:
H0
Ha
µ = µ0
µ < µ0
H0
Ha
µ = µ0
µ > µ0
H0
Ha
µ = µ0
µ 6= µ0
where µ0 is some fixed number.
Collect data and compute a test statistic
We will use the following test statistic:
T =
X − µ0
√ .
S/ n
The important fact about this statistic is that if H0 is true then the distribution
of T is known. (It is a t distribution with n − 1 degrees of freedom.) This is the
key property that we need whenever we do a hypothesis test: we must have a test
statistic whose distribution we know if H0 is true.
145
5. Inference - One Variable
Compute a p-value
Recall that the p-value of the test statistic t is the probability that we would see a
value at least as extreme as t (in the direction of the alternate hypothesis) if the null
hypothesis were true. The R function t.test() computes the p-value if the argument
alternative is appropriately set. Let’s look at some possible Raisin Bran data.
> raisinbran
[1] 11.01 10.91 10.94 11.01 10.97 11.01 10.95 10.93 10.92 10.83 11.02 10.84
> t.test(raisinbran,alternative="less",mu=11)
One Sample t-test
data: raisinbran
t = -2.9689, df = 11, p-value = 0.006385
alternative hypothesis: true mean is less than 11
95 percent confidence interval:
-Inf 10.97827
sample estimates:
mean of x
10.945
In this example, the p-value is 0.006. This means that if the null hypothesis (of
µ = 11) were true, we would expect to get a value of the test statistic at least as
extreme as the value we computed from the data (-2.9689) 0.6% of the time. This
would be an extremely rare occurence so this is strong evidence against the null
hypothesis.
Draw a conclusion
It is often enough to present the result of a hypothesis test by stating the p-value.
What to do with that evidence is not really a statistical problem. It is sometimes
necessary to go further however and to announce a decision. That is the case in the
Raisin Bran example where it is necessary to decide whether to reject the shipment
as being underweight.
In this case, we set up the hypothesis test in terms of a decision rule. The possible
decisions are either to reject the null hypothesis (and accept the alternate hypothesis) or not to reject the null hypothesis. The decision rule is expressed in terms of
the test statistic. In order to determine what the decision rule should be, we need to
examine the errors in making an incorrect decision.
Recall the kinds of errors that we might make:
1. A Type I error is the error of rejecting H0 even though it is true. The probability
of a type I error is denoted by α.
146
5.8. Testing Hypotheses About the Mean
2. A Type II error is the error of not rejecting H0 even though it is false. The
probability of a Type II error is denoted by β.
To construct a decision rule, we choose α, the probability of a Type I error. This
number α is often called the significance level of the test.
In this case, testing H0 : µ = µ0 versus Ha : µ < µ0 , our decision rule should be:
Reject H0 if and only if t < −tα,n−1 .
It is obvious that this decision rule will reject H0 if it is true with probability α.
While the R example above does not explicitly make a decision, the p-value of the
test statistic gives us enough information to determine what the decision should be.
Namely if the p-value is less than α, we must reject the null hypothesis. Otherwise
we do not. In the Kellogg’s example above, we obviously reject the null hypothesis
for any reasonable value of α.
We can now understand the test that NIST prescribes in Example 5.8.1. The NIST
manual says that “this method gives acceptable lots a 97.5% chance of passing.” In
other words, NIST is prescribing that α = 0.025. For such an α, our test should be
to reject H0 if
x − 11
√ < −t.025,11
s/ 12
or if
t0.25,11
x − 11
= 0.635
<− √
s
12
which is exactly the requirement of the NIST test. Of course this NIST method
implicitly is relying on the assumption that the distribution of the lot is normal. We
really should be cautious about using the t-distribution for a non-normal population
with a sample size of 12 although the t-test is robust.
Type II Errors
The four step procedure above focuses on α, the probability of a Type I error. Usually,
the consequences of a Type I error are much more severe than those of making a
Type II error and it is for this reason that we set α to be a small number. But if
our procedures were only about minimizing Type I errors, we would never reject H0
since this would make the probability of a Type I error 0!
Of course the probability of a Type II error depends on the distribution of
T =
X̄ − 11
√
S/ 12
if µ 6= 11. This distribution depends on the true mean µ, the standard deviation
σ (neither of which we know), and the sample size. R will compute this probability
147
5. Inference - One Variable
for us if we specify these values. The probability of a type II error is denoted by β
and the number 1 − β is called the power of the hypothesis test. (Higher powers
are better.) The R function power.t.test computes the power given the following
arguments:
delta
sd
n
sig.level
type
alternative
the deviation of the true mean from the null hypothesis mean
the true standard deviation
the sample size
α
this t-test is called a one.sample test
we tested a one.sided alternative
In the Raisin Bran example, if the true value of the mean is 10.9 and the standard
deviation is 0.1, then the power of the test is 88.3%. In other words, we will reject
a shipment that on average is one standard deviation underweight 88.3% of the time
using this test.
> power.t.test(delta=.1,sd=.1,n=12,sig.level=.025,type=’one.sample’,
+ alternative=’one.sided’)
One-sample t test power calculation
n
delta
sd
sig.level
power
alternative
=
=
=
=
=
=
12
0.1
0.1
0.025
0.8828915
one.sided
Obviously, the test that we use should have greater power if the true mean is further
from 11.
> diff=seq(0,.1,.01)
> power.t.test(delta=diff,sd=.1,n=12,sig.level=.025,type=’one.sample’,
+ alternative=’one.sided’)
One-sample t test power calculation
n
delta
sd
sig.level
power
=
=
=
=
=
12
0.00, 0.01,
0.1
0.025
0.02500000,
0.24401839,
0.71365697,
alternative = one.sided
0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10
0.05024502, 0.09249152, 0.15643493,
0.35263574, 0.47466264, 0.59891866,
0.80978484, 0.88289152
Many users of hypothesis testing technology do not think very carefully about type
148
5.9. Exercises
II errors before doing the experiment and so often construct tests that are not very
powerful. For example, if we think that it is important to reject shipments with that
average more than a half a standard deviation underweight, we find that the sample
size of 12 given above has power only 35%. We really should increase the sample size
in this case (and we know this even before we collect data).
5.9. Exercises
5.1 In this problem and the next, we investigate the use of the sample mean to
estimate the mean population of a U.S. county. We use the dataset at http://www.
calvin.edu/~stob/data/counties.csv.
a) What is the average population of a U.S. county? (Answer: 89596.)
b) Generate 10,000 samples of size 5 and compute the mean of the population of
each sample. In how many of these 10,000 samples was the sample mean greater
than the population mean? Why so many?
c) Repeat part (b) but this time use samples of size 30. Compare the result to
that of part (b).
d) For the 10,000 samples of size 30 in part (c), what is the IQR of the sample
means?
e) Explain why using the sample mean for a sample of size 30 is likely to give a
fairly poor estimate of the average population of a county.
5.2 Recall from Chapter 1 that reexpressing the population of counties by taking
logarithms produced a symmetric unimodal distribution. (See Figure 1.3.) Let’s now
repeat the work of the last problem using this transformed data.
a) What is the mean of the log of population for all counties? (Answer: 10.22)
b) Generate 10,000 samples of size 5 and compute the mean of the log-population
for each of the samples. In how many of these samples was the sample mean
greater than the population mean?
c) Repeat part (b) but this time use samples of size 30.
d) For the 10,000 samples in part (c), what is the IQR of the sample means?
e) How useful is a sample of size 30 for estimating the mean log-population?
5.3 Suppose that X ∼ Binom(n, π) and that Y ∼ Binom(m, π). Also suppose that X
and Y are independent.
a) Give a convincing reason why Z = X + Y should have a binomial distribution
(with parameters n + m and π).
149
5. Inference - One Variable
b) Show that the mean and variance of Z as computed by Theorem 5.2.1 from that
of X and Y is the same as computed directly from the fact that Z is binomial
with parameters n + m and π.
5.4 In this problem, you are to investigate the accuracy of the approximation of the
Central Limit Theorem for the exponential distribution. Suppose that X ∼ Exp(0.1)
and that a random sample of size 20 is chosen from this population.
a) What is the mean and variance of X?
b) What is the mean and variance of X?
c) Using the Central Limit Theorem approximation, compute the probability that
X is within 1, 2, 3, 4, and 5 of µX .
d) Now choose 1,000 random samples of size 20 from this distribution. Count the
number of samples in which x is with 1, 2, 3, 4, and 5 of µX and compare to
part (c). Comment.
5.5 Scores on the SAT test were redefined (recentered) in 1990 and were set to have
mean of 500 and standard deviation of 110 on each of the Mathematics and Verbal
Tests. The scores were constructed so that the population had a normal distribution
(or at least very close to normal). In random sample from this population of size 100,
a) What is the probability that the sample mean will be between 490 and 510?
b) What is the probability that the sample mean will exceed 500? 510? 520?
5.6 Continuing Problem 5.5, the total SAT score for each student is formed by adding
their verbal score V and their math score M .
a) If the two scores for an individual are independent of each other, what is the
mean and standard deviation of V + M ?
b) It is not likely that the verbal and mathematics scores of individuals in the
population behave like independent random variables. Do you expect that the
standard deviation of V + M is more or less than you computed in part (a)?
Why?
5.7 Which is wider, a 90% confidence interval or a 95% confidence interval generated
from the same random sample from a normal population?
5.8 Suppose that the standard deviation σ of a normal population is known. How
large a random sample must be chosen so that a 95% confidence interval will be of
form x ± .1σ?
150
5.9. Exercises
5.9 The dataset found at http://www.calvin.edu/~stob/data/normaltemp.csv
contains the body temperature and heart rate of 130 adults. (“What’s Normal? –
Temperature, Gender, and Heart Rate” in the Journal of Statistics Education (Shoemaker 1996). )
a) Assuming that the body temperatures of adults in the population is approximately normal and that the 130 adults sampled behave like a simple random
sample, write a 95% confidence interval for the mean body temperature of an
adult.
b) Comment on the result in (a).
c) Is there anything in the data that would lead you to believe that the normality
assumption is incorrect?
5.10 The R dataset morley contains the speed of light measurements for 100 different
experimental runs. The vector Speed contains the measurements (in some obscure
units).
a) If we think of these 100 measurements as repeated independent trials of a random variable X, what is a good description of the population of which these
measurements are a sample?
b) Write a 95% confidence interval for the mean of this population.
c) What is the value tβ,n−1 for the confidence interval generated in the previous
part?
d) Is there anything in the histogram of the data values that suggests that the t
procedure might not be a good one for generating a confidence interval in this
case?
5.11 Write 95% confidence intervals for the mean of the sepal length of each of the
three species of irises in the R dataset iris. Would you say that these confidence
intervals give strong evidence that the means of the sepal lengths of these species are
different?
5.12 Suppose that 4 circuit boards out of 100 tested are defective. Generate 95%
confidence intervals for the proportion of the population of boards that is defective.
Give each of the Wald, Wilson and Agresti-Coull intervals.
5.13 The Chicago Cubs (a major league baseball team) won 11 games and lost 5
games in their season series against the St. Louis Cardinals last year. Write a 90%
confidence interval for the proportion of the games that the Cubs would win if they
played many games against the Cardinals. Comment on the assumptions you are
making about the process of playing baseball games.
151
5. Inference - One Variable
5.14 In a taste test, 30 Calvin students prefer Andrea’s Pizza and 19 prefer Papa
John’s. If the sample of students could reasonably be considered a random sample of
Calvin students write a 95% confidence interval for the proportion of students who
prefer Andrea’s Pizza.
5.15 It is common to use a sample size of 1,000 when doing a political poll. It is
also common to use the Wald interval to report the results of such polls. What is
the widest that a 95% confidence interval for a proportion could be with this sample
size?
5.16 Usually articles that are testing hypotheses do not report all the data but only
report x, s and n. From these three numbers and the pair of hypotheses you should
be able to compute a p-value. Do that for the following situations:
a) H0 : µ = 11 Ha : µ < 11 x = 10.49 s = .3
n = 12.
b) H0 : µ = 11 Ha : µ 6= 11 x = 10.75 s = .4
n = 12
c) H0 : µ = 11 Ha : µ > 11 x = 11.05 s = .2
n = 100
d) H0 : µ = 11 Ha : µ < 11 x = 10.50 s = .2
n=2
5.17 A study in an oncology journal once used a t-test to test a hypothesis about
the mean where there were just 4 data points (n = 4). Given the fact that the
experimenter predicted before the experiment that the effect size was about .5σ,
should the experimenter have done the experiment at all? Give an argument using
the concept of power.
152
6. Producing Data – Experiments
In many datasets we have more than one variable and we wish to describe and explain
the relationships between them. Often, we would like to establish a cause-and-effect
relationship.
6.1. Observational Studies
The American Music Conference is an organization that promotes music education at
all levels. On their website http://www.amc-music.com/research_briefs.htm they
promote music education as having all sorts of benefits. For example, they quote a
study performed at the University of Sarasota in which “middle school and high school
students who participated in instrumental music scored significantly higher than their
non-band peers in standardized tests”. Does this mean that if the availability of
and participation in instrumental programs in a school is increased, standardized
test scores would generally increase? The American Music Conference is at least
suggesting that this is true. They are attempting to “explain” the variation in test
scores by the variation in music participation. The problem with that conclusion is
that there might be other factors that cause the higher test scores of the band students.
For example, students who play in bands are more likely to come from schools with
more financial resources. They are also more likely to be in families that are actively
involved in their education. It might be that music participation and higher test scores
are a result of these variables. Such variables are often called lurking variables. A
lurking variable is any variable that is not measured or accounted for but that has a
significant effect on the relationship of the variables in the study.
The Sarasota study described above is an observational study. In such a study,
the researcher simply observes the values of the relevant variables on the individuals
studied. But as we saw above, an observational study can never definitively establish a
causal relationship between two variables. This problem typically bedevils the analysis
of data concerning health and medical treatment. The long process of establishing
the relationship between smoking and lung cancer is a classic example. In 1957,
the Joint Report of the Study Group on Smoking and Health concluded (in Science,
vol. 125, pages 1129–1133) that smoking is an important health hazard because it
causes an increased risk for lung cancer. However for many years after that the
tobacco industry denied this claim. One of their principal arguments is that the
153
6. Producing Data – Experiments
data indicating this relationship came from observational studies. (Indeed, the data
in the Joint Report came from 16 independent observational studies.) For example,
the report documented that one out of every ten males who smoked at least two
packs a day died of lung cancer. but only one out of every 275 males who did not
smoke died of lung cancer. Data such as this falls short of establishing a cause-andeffect relationship however as there might be other variables that increase both one’s
disposition to smoke and susceptibility to lung cancer.
Observational studies are useful for identifying possible relationships and also simply for describing relationships that exist. But they can never establish that there
is a causal relationship between variables. Using observational studies in this way
is analogous to using convenience samples to make inferences about a population.
There are some observational studies that are better than others however. The music
study described above is a retrospective study. That is the researchers identified
the subjects and then recorded information about past music behavior and grades.
A prospective study is one in which the researcher identifies the subjects and then
records variables over a period of time. A prospective study usually has a greater
chance of identifying relevant possible “lurking” variables so as to rule them out as
explanations for a possible relationship.
One of the most ambitious and scientifically important prospective observational
studies has been the Framingham Heart Study. In 1948, researchers identified a
sample of 5,209 adults in the town of Framingham, Massachusetts (a town about
25 miles west of Boston). The researchers tracked the lifestyle choices and medical
records of these individuals for the rest of their lives. In fact the study continues to this
day with the 1,110 individuals who are still living. The researchers have also added
to the study 5,100 children of original study participants. There is no question that
the Framingham Heart Study has led to a much greater understanding of what causes
heart disease although it is “only” an observational study. For example, it is this study
that gave researchers the first convincing data that smoking can cause high blood
pressure. The website of the study http://www.nhlbi.nih.gov/about/framingham/
gives a wealth of information about the study and about cardiovascular health.
6.2. Randomized Comparative Experiments
If an observational study falls short of establishing a causal relationship and even an
expensive well-designed prospective observational study cannot identify all possible
lurking variables, can we ever prove such a relationship?
The “gold standard” for establishing a cause and effect relationship between two
variables is the randomized comparative experiment. In an experiment, we want
to study the relationship between two or more variables. At least one variable is an
154
6.2. Randomized Comparative Experiments
explanatory variable and the value of the variable can be controlled or manipulated.
At least one variable is a response variable. The experimenter has access to a
certain set of experimental units (subjects, individuals, cases), sets various
values of the explanatory variables to create a treatment, and records the values of
the response variables.
It is important first of all that an experiment be comparative. If we are attempting to establish that music participation increases grades, we cannot simply look at
participators. We need to compare the achievement level of participators to those who
do not participate. Many educational studies fall short of this standard. A school
might introduce a new curriculum in mathematics and measure the test scores of the
students at the end of the year. However the school cannot make the case that the
test scores are a result of the new curriculum — the students might have achieved
the same level with any curriculum.
In a randomized experiment we assign the individuals to the various treatments
at random. For example, if we took 100 fifth graders and randomly chose 50 of them
to be in the band and 50 of them not to receive any music instruction, we could begin
to believe that differences in their test scores could be explained by the different
treatments.
Example 6.2.1.
Patients undergoing certain kinds of eye surgery are likely to experience serious
post-operative pain. Researchers were interested in the question of whether
giving acetaminophin to the patients before they experienced any pain would
substantially reduce the subsequent pain and the further need for analgesics.
One group received acetaminophin before the surgery but no pain medicine after the surgery. A second group received no pain medicine before the surgery
and acetaminophin after the surgery. And the third group received no acetaminophin either before or after the surgery. Sixty subjects were used and
20 subjects were assigned at random to each group. (Soltani, Hashemi, and
Babaei, Journal of Research in Medical Sciences, March and April 2007; vol.
12, No 2.)
In Example 6.2.1, the goal of random assignment is to construct groups that are
likely to be representative of the whole pool of subjects. If the assignment were left
to the surgeons, for example, it might be the case that surgeons would give more pain
medication to certain types of patients and therefore we wouldn’t be able to attribute
the different results to the different treatments.
Example 6.2.2.
The R dataset chickwts gives the weights of chicks who were fed six different
155
6. Producing Data – Experiments
diets over a period of time. The experimenter was attempting to determine
which chicken feed caused the greatest weight gain. Feed is the explanatory
variable and there were six treatments (six different feeds). Weight is the response variable. The first step in designing such an experiment is to assign
baby chicks at random to the six different feed groups. If we allow the experimenter to choose which chicks receive which feed, she might unconsciously (or
consciously) construct treatment groups that are unequal to start.
Student (W.S. Gosset) was one of the researchers in the early part of the twentieth
century who realized the importance of randomization. One of his influential papers
analyzed a large scale study that was to compare the nutritional effects of pasteurized
and unpasteurized milk. In the Spring of 1930, 20,000 school children participated
in the study. Of these, 5,000 received pasteurized milk each day, 5,000 received
unpasteurized milk, and 10,000 did not receive milk at all. The weight and height of
each student was recorded both before and after the trial. Student analyzed the way
in which students were assigned to the three experimental treatments. There were
67 schools involved and in each school about half the students were in the control
group and half received milk. However each school received only one kind of milk,
pasteurized or unpasteurized. This was the first sort of bias that Student found — he
was not convinced that the schools that received pasteurized milk were comparable
to those that received unpasteurized milk. A more important difficulty was the way
in which students were assigned either to the control or milk group within a school.
The students were assigned at random initially, but teachers were given freedom to
adjust the assignments if it seemed to them that the two groups were not comparable
to each other in weight and height. In fact Student showed that this freedom on
the part of teachers to assign subjects to groups resulted in a systematic difference
between the groups in initial weight and height. The control groups were taller and
heavier on average than those in the milk groups. Student conjectured that teachers
unconsciously favored giving milk to the more undernourished students.
Of course assigning subjects to treatments at random does not ensure that the
experimental groups are alike in all relevant ways. Just as we were subjected to
sampling error when choosing a random sample from a population, we can have
variation in the groups due to the chance mechanism alone. But assigning subjects
at random will allow us to make probabilistic statements about the likelihood of such
error just as we were able to make confidence intervals for parameters based on our
analysis of sampling error that might arise in random sampling.
Randomized assignment and random samples
We assign subjects to treatments at random so that the various treatment groups will
be similar with respect to the variables that we do not control. That is, we would
156
6.2. Randomized Comparative Experiments
like the experimental groups to be representative of the whole group of subjects. In
surveys (Chapter 2), we choose a random sample from a population for a similar
reason. We hope that the random sample is representative of a larger population.
Ideally, we would like both kinds of randomness in our experiments. Not only do we
ensure that the subjects are assigned at random to treatments, but we would like the
subjects to be chosen at random from a larger population. If this is true, we could
more easily justify generalizing our experimental results to a larger population than
the immediate subject pool. However that is almost never the case. In the pain study
of Example 6.2.1, the subjects were simply all those persons who were operated on
at a given clinic in a given period of time. This issue is particularly important if we
try to generalize the conclusions of a an experiment to a larger population.
Example 6.2.3.
The author of this text participated in a study to investigate how people make
probabilistic judgments in situations for which they do not have much data.
(Default Probabilities, Osherson, Smith, Stob, and Wilkie, Cognitive Science,
(15), 1991, 251–270.) Subjects were placed in various experimental groups at
random. However the subjects were not chosen at random from any particular
population. Indeed every subject was an undergraduate in an introductory
psychology course at the University of Michigan or Massachusetts Institute of
Technology. It is difficult to make an argument that the results of the paper
would generalize to the population of all undergraduates in the United States
let alone to the population of all adults. The MIT students in particular seemed
to have a different set of strategies for dealing with probabilistic arguments.
Other features of a good experiment
In our analysis of simple random sampling from a population, we saw again and
again the importance of large samples in getting precise estimates of our parameters.
Analogously, if we are to measure precisely the effect of a treatment, we would like
many individuals in each treatment group. This principle is known as replication.
With a small number of individuals, it might be difficult to determine whether the
differences in response are due only to the treatments or whether they reflect the
natural variation in individuals. The chickwts data illustrate the issue. Figure 6.1
plots the weights of the six different treatment groups of chicks. While there is
definitely some variation between the groups, there is also considerable variation
within each group. Chicks fed meatmeal, for example, have weights spanning most of
the range of the the entire experimental group. It is probably the case that the small
difference between the linseed and soybean groups is due to the particular chicks in
the groups rather than due to the feed. More chickens in each group would help us
157
6. Producing Data – Experiments
●
400
●
weight
●
●
300
●
●
●
●
200
●
100
casein
horsebean
linseed
meatmeal
soybean
sunflower
Figure 6.1.: Weights of six different treatment groups of a total of 71 chicks.
resolve this issue however.
In most good experiments one of the treatments is a control. A control generally
means a treatment that is a baseline or status quo treatment. In an educational
experiment, the control group might receive the standard curriculum while another
group is receiving the supposed improved curriculum. In a medical experiment, the
control group might receive the generally accepted treatment (or no treatment at all
if ethical) while another group receives a new drug. In Example 6.2.1, the group
that received no pre-pain medication is referred to as the control group. The goal of
a control group is to establish a baseline to which to compare the new or changed
treatment.
Often the control is a placebo. A placebo is a “treatment” that is really no
treatment at all but looks like a treatment from the point of view of the subject. In
Example 6.2.1, all subjects received pills both before and after surgery. But some
of these pills contained no acetaminophin and were inert. Placebos are given to
ensure that the placebo effect is measurable. The placebo effect is the tendency for
experimental subjects to be affected by the treatment even if it has no content. The
need for control groups and placebos is highlighted by the next famous example.
Example 6.2.4.
During the period 1927-1932, researchers conducted a large-scale study of industrial efficiency at the Hawthorne Plant of the Western Electric Company
in Cicero, IL. The researchers were interested in how physical and environmental features (e.g., lighting) affected worker productivity and satisfaction.
Researchers found that no matter what the experimental conditions were, productivity tended to improve. Workers participating in the experiment tended
to work harder and better to satisfy those persons who were experimenting on
158
6.3. Blocking
them. This feature of human experimentation — that the experimentation itself changes behavior whatever the treatment — is now called the Hawthorne
Effect. (It is now generally accepted that the extent of the Hawthorne Effect
in the original experiments have been significantly overstated by the gazillions
of undergraduate psychology textbooks that refer to it. But the name remains
and it makes a nice story as well as a plausible cautionary tale!)
Another feature which helps to ensure that the differences in treatments are due
to the treatments themselves is blinding. An experiment is blind if the subjects do
not know which treatment group that they are in. In Example 6.2.1, no subject knew
whether they were receiving acetaminophin or a placebo. It is plausible that a subject knowing they receive a placebo would have a different (subjective) estimate of
pain than one who thought that they might be receiving acetaminophin. An experiment is double-blind if the person administering the treatment also does not know
which treatment is being administered. This prevents the researcher from treating the
groups differently. It is not always possible or ethical to make an experiment blind or
double-blind. But when possible, blinding helps to ensure that the differences between
treatments are due to the treatments which is always the goal in experimentation.
6.3. Blocking
If the experimental subjects are identical, it does not matter which is assigned to
which treatment. The differences in the response variable are likely to be the result
of the differences in treatment. The subjects are not usually identical however or at
least cannot be treated identically. So we would like to know that the differences
in the response variable are due to the differences in the explanatory variable and
not any systematic differences in subjects. Randomization is one tool that we use to
distribute such differences equally across the treatments. In some cases however, our
experimental units are not identical or our experiment itself introduces a systematic
difference in the units that is due to something other than the treatment variable.
This leads to the notion of blocking which we illustrate with a classic example.
R.A. Fisher was one of the key early figures in developing the principles of good
experimental design. He did much of this while working at Rothamsted Experimental
Station on agricultural experiments. He studied closely data from experiments that
were attempting to establish such things as the effects of fertilizer on yield. Suppose
that we have three unimaginatively named fertilizers A, B, C. We could divide
the plot of land that we are using as in the first diagram of Figure 6.2. But it
might be the case that the further north in the plot, the better the soil conditions.
In that case, the variation in yield might be better explained (or at least partially
explained) by the location of the plot rather than by fertilizer. In this example, we
would say that the effects of northernness and fertilizer are confounded, meaning
159
6. Producing Data – Experiments
A
C
A
B
B
B
C
A
C
A
B
C
Figure 6.2.: Two experimental designs for three fertilizers.
simply that we cannot separate them given the data of the experiment at hand. To
separate out the effect of northernness from that of fertilizer, we could instead divide
the patch using the second diagram in figure 6.2. Of course there still might be
variations in the soil conditions across the three fertilizers. But we would at least
be able to measure the effect of northernness separately from that of fertilizer. In
this example, “northernness” is a blocking variable and our goal is to isolate the
variability attributable to northernness so that we can see the differences between the
fertilizers more clearly.
In a medical experiment it is often the case that gender or age are used as blocking variables. Obviously, we cannot assign individuals to the various levels of these
variables at random but it is plausible that in certain circumstances gender or age
can have a significant effect on the response. If so, it would be useful to design an
experiment that allows us to separate out the effects of, say, gender and the treatment.
When using a blocking variable, it is important to continue to honor the principle
of randomization. Suppose for example that we use gender as a blocking variable in a
medical experiment comparing two treatments. The ideal experimental design would
be to take a group of females and assign them at random to the two treatments and
similarly for the group of males. That is, we should randomize the treatments within
the blocks. The resulting experiment is usually called a randomized block design.
It is not completely randomized because subjects in one block cannot be assigned to
another but within a block it is randomized.
It is instructive to compare the randomized block design to stratified random sampling. In each case, we divide subjects into groups and randomize within these groups.
The goal is to isolate and measure the variability that is due to the groups so that
we can measure the variability that remains.
A special case of blocking is known as a matched pair design. In such an experiment, there are just two observations in each block (one for each of two treatments).
In his 1908 paper, Student analyzed earlier published data from such an experiment.
160
6.4. Experimental Design
That data is in the R dataframe sleep. The two different treatments were two different soporifics (sleeping drugs). There was no control treatment. The response variable
was the number of extra hours of sleep gained by the subject over his “normal” sleep.
There were just 10 subjects and each subject took both drugs (on different nights).
Thus each subject was a block and there was one observation on each treatment in
each block. Student then compared the difference in the two drugs on each patient.
Using the individuals as blocks served to help Student to decide what part of the variation in the response could be explained by the normal variation between individuals
and what could be attributed to the drugs themselves.
In educational experiments, matched pairs are often constructed by finding two
students who are very similar in baseline academic performance. Then it is hoped
that the differences between these students at the end of the experiment are the result
of the different treatments.
It is important to remember that block designs are not an alternative to randomization. Indeed, it is very important that we randomize the assignment to treatments
within every block for the same reasons that randomization is important when we
have no blocking variable. Identifying blocking variables is simply acknowledging
that there are variables on which the treatments may systematically differ.
6.4. Experimental Design
In the above sections, we have introduced the three key features of a good experimental design — randomization, replication, blocking. We’ve illustrated these principles
in the case that we have just one explanatory variable with just a few levels. These
principles can be extended to situations with more than one explanatory variable
however. In this book, we will not investigate the problem of inference for such situations or discuss in detail the issues of experimental design in these cases. In this
section, we look at one example of extending these principles to experiments involving
more than one explanatory variable.
Example 6.4.1.
The R dataframe ToothGrowth contains the results of an experiment performed
on Guinea Pigs to determine the effect of Vitamin C on tooth growth. There
were two treatment variables, the dose of Vitamin C, and the delivery method
of the Vitamin C. The dose variable had three levels (.5, 1, and 2 mg) and the
delivery method was by either orange juice or ascorbic acid. There were 10
guinea pigs given each of the six treatments. The plot below (using coplot())
shows the differences between the two delivery methods and the various does
levels.
161
6. Producing Data – Experiments
Given : supp
VC
OJ
1.0
1.5
2.0
35
0.5
●
●
30
●
●
20
5
10
15
len
25
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.5
1.0
1.5
2.0
ToothGrowth data: length vs dose, given type of supplement
It appears that both the delivery method and the dose have some effect on
tooth growth.
Both the principles of randomization and replication extend to experiments with
more than one explanatory variable. In Example 6.4.1 for example, it is apparent
that the 60 guinea pigs should have been assigned at random to the six different
treatments. And it also is clear that there should have been enough guinea pigs in
each treatment so that the natural variation from pig to pig can be accounted for.
No blocking variables are described in the tooth growth study but it is often the case
that natural blocking variables can be identified. For example, in the tooth growth
study, it might not have been possible for the same technician to have recorded all the
measurements. In that case, it would not be a good idea for one technician to make
all the measurements for the orange juice treatment while another technician makes
all the measurements for the ascorbic acid treatment. The blocking variable would
be the technician and we would attempt to randomize assignment withing treatment.
Since there were 10 guinea pigs in each of the 6 treatments, two technicians could
each measure 5 guinea pigs in each treatment.
162
7. Inference – Two Variables
Is there a relationship between two variables? If there is a relationship, is it causal?
7.1. Two Categorical Variables
7.1.1. The Data
Suppose that the data consist of a number of observations on which we observe two
categorical variables. We normally present such data in a (two-way) contingency
table.
Example 7.1.1.
In 1973, the rate of acceptance to graduate school at the University of California at Berkeley was lower for females than males. (See the R dataset
UCBAdmissions.) Here 4,526 individuals are classified according to these two
variables in the following contingency table.
> xtabs(Freq~Gender+Admit,data=UCBAdmissions)
Admit
Gender
Admitted Rejected
Male
1198
1493
Female
557
1278
We introduce some notation to aid in our discussion and analysis of such situations.
I
J
nij
ni.
n.j
n = n..
the
the
the
the
the
the
number of rows
number of columns
integer entry in the ith row and j th column
sum of the entries in the ith row
sum of the entries in the j th column
total of all entries in the table
We’ll also usually call the row variable R and the column variable C. Dots in
subscripts are often used in statistics to denote the operation of summing over the
possible values of that subscript. Hence ni. sums over the possible values of the second
163
7. Inference – Two Variables
subscript. This notation can be extended to more dimensions with k, l etc. denoting
the generic subscripts in the next places.
Our research question and the nature of the two categorical variables determine how
we collect and analyze the data. There are three different data collection schemes that
we distinguish among.
1. I independent populations. On this model, R, the categorical variable that
determines the rows, defines I many populations. The data are collected by
choosing a simple random sample of each population and categorizing each
population according the column categorical variable. An example of such a
data collection exercise might be to choose a random sample of students of each
class level and ask each subject a YES-NO question. On this model of sampling,
we need to be able to identify each of the populations in advance.
2. One population, two factors. On this model, we choose n individuals at
random from one population and classify the individuals according to the two
different categorical variables.
3. I experimental treatments. On this model, the I rows are the I different
treatments to which we might assign a number of individuals. We assign ni.
individuals to each treatment (we hope by randomization) and then observe the
value of the column categorical variable in each individual.
Sometimes it is difficult to see immediately which of the three data collection
schemes is the best description of our data and sometimes it is clear that the data
did not arise in any one of these ways. For example, in most observational studies,
randomness does not play a role. It is often the case that such studies correspond to
the description of the second data collection scheme above but without the random
sampling. How we make inferences from such data from observational studies (and
whether we can make any inferences at all) is usually a difficult question. Of course
the data collection scheme should match the research question and we would like to
phrase our research questions as questions about parameters.
7.1.2. I independent populations
Suppose that random samples are chosen from each of the I independent populations
determined by the rows. This situation is really that of stratified random sampling
with the rows determining the strata. In this case, the variable C divides each population into J many groups. A natural question to ask is whether the proportion of
individuals in a particular group is the same across populations.
Example 7.1.2.
In [AM], Chase and Dummer report on a survey of 478 children in Ingham and
164
7.1. Two Categorical Variables
Clinton Counties in Michigan. (The data are available at the Data and Story Library and at http://www.calvin.edu/~stob/data/popularkids.csv.) The
children were chosen from grades 4, 5, and 6. Among the questions asked was
which goal was most important to them: making good grades, being popular,
or being good in sports. The results are
> pk=read.csv(’http://www.calvin.edu/~stob/data/popularkids.csv’)
> names(pk)
[1] "Gender"
"Grade"
"Age"
"Race"
"Urban.Rural"
[6] "School"
"Goals"
"Grades"
"Sports"
"Looks"
[11] "Money"
> xtabs(~Grade+Goals,data=pk)
Goals
Grade Grades Popular Sports
4
63
31
25
5
88
55
33
6
96
55
32
Here the three populations are students in the three grades and the research
question is whether students at the three grade levels are the same in their
choice of their most important goal.
We define parameters as follows:
πi,j = proportion of population i at level j of the second variable.
Note that with πij defined in this way,
πij = 1 for every i. A natural first
j=1
hypothesis to test is
H0 :
J
X
for every j,
π1,j = π2,j = · · · = πI,j .
If H0 is true, we say that the populations are homogeneous (with respect to variable
C). In order to test this hypothesis, it is necessary to construct a test statistic T such
that two things are true:
1. We know the distribution of T when H0 is true, and
2. The values of T tend to be small if H0 is true and large if H0 is false (or the
other way around).
It is easy to construct test statistics that have the second of these two properties.
However, since the distribution of such a statistic is discrete, it is usually computationally impossible to determine the distribution of the statistic we construct even under
the assumption that the null hypothesis is true. The classical test in this situation
is to use a test statistic for which we have a good approximation to its distribution.
165
7. Inference – Two Variables
The statistic is called the chi-square (χ2 ) statistic and its lineage is really the same
as that of the normal approximation to the binomial distribution. To form the chisquare statistic, we investigate what we expect would happen if the null hypothesis
were true. In this case, for every j, we have π1,j = π2,j = · · · = πI,j . We let π.j
denote the common value. (Here we use the dot in a slightly different but analogous
manner.) How would we estimate π.j , the probability of an individual falling in the
j th column? Since there are n.j individuals in this column, a natural estimate would
be π̂.j = n.j /n. With this estimate of π.j , we can estimate the number of individuals
that should fall in each cell. Since there are ni. individuals in row i, we should estimate that there are ni. π̂.j = ni. n.j /n individuals in the i, j th cell. This quantity is
important: we give it a name and notation.
Definition 7.1.3 (Expected Count). Under the null hypothesis H0 , the expected
count in cell i, j is
ni. n.j
n̂i,j =
.
n
We now introduce the statistic that we use to test this hypothesis. (We use X 2
rather than χ2 so that the statistic is an upper-case Roman letter!)
X2 =
X (observed − expected)2
expected
=
X X (nij − n̂ij )2
.
n̂ij
i
j
It is not hard to see that this statistic is always nonnegative and tends to be larger
if the null hypothesis is false and smaller if it is true. However the distribution of
this statistic cannot be computed exactly for all but the smallest n. We digress and
introduce a new and important distribution.
Definition 7.1.4 (chi-square distribution). The chi-square distribution is a oneparameter family of distributions with parameter a natural number ν and pdf
f (x; ν) =
1
2ν/2 Γ(ν/2)
xv/2−1 e−x/2
x≥0.
The chi-square distribution has mean ν and variance 2ν. The parameter ν is called
the degrees of freedom.
The plot of the density function for the chi-square distribution with ν = 4 is in
Figure 7.1.
The importance of the chi-square distribution stems from the following fact.
Proposition 7.1.5. Suppose that X1 , . . . , Xν are independent random variables each
of which has a standard normal distribution. Then X12 + · · · + Xν2 has a chi-square
distribution with ν degrees of freedom.
166
7.1. Two Categorical Variables
y
0.15
0.10
0.05
0.00
0
2
4
6
8
10
12
x
Figure 7.1.: The density of the chi-square distribution with ν = 4.
For our purposes, we have the following fact.
Proposition 7.1.6. If the null hypothesis H0 is true, then the statistic X 2 has a
distribution that is approximately chi-square with (I − 1)(J − 1) degrees of freedom.
We now use the proposition to make a hypothesis test.
chi-square test of homogeneity of populations.
Suppose that the value of X 2 is c. The p-value of the hypothesis test of H0
is p = P(X 2 ≥ c) where we assume that X 2 has a chi-square distribution with
ν = (I − 1)(J − 1) degrees of freedom.
Example 7.1.7.
Continuing the popular kids example, Example 7.1.2, we compute the chi-square
value using R. While R does the computations, we illustrate the computation
by considering the first cell. There are 478 subjects total (n.. = 478) of which
119 are in grade 4 (n1. = 119). Of the 478 subjects, 247 have getting good
grades as their most important goal. Thus 247/478 = 51.7% of the sampled
children have this as their goal. The expected count in the first cell is therefore
n̂1,1 = (247/478)119 = 61.49. Since the actual count is 63, this contributes
(61.49 − 63)2 /61.49 = .037 to the chi-squared value. Continuing over the six
cells, we have a chi-square value of 1.3121 according to R.
167
7. Inference – Two Variables
> popkidstable=xtabs(~Grade+Goals,data=pk)
> chisq.test(popkidstable)
Pearson’s chi-square test
data: popkidstable
X-squared = 1.3121, df = 4, p-value = 0.8593
The value of X 2 is 1.31. The p-value indicates that if H0 is true, we would
expect to see a value of X 2 at least as large as 1.31 over 85% of the time. So
if H0 is true, this value of the chi-square statistic is not at all surprising. We
have no reason to doubt the null hypothesis that students of these three grades
do not differ in their most important goals.
The use of the chi-square distribution is only an approximation. The approximation
is better if the populations are large and the individual cell sizes are not too small.
The conventional wisdom is to not use this test if any cell has a count of 0 or more
than 20% of the cells have expected count less than 5. R will give a warning message
if any cell has expected count less than 5.
7.1.3. One population, two factors
We now look at the case in which the contingency table results from sampling from
a single population and classifying the sampled elements according to two different
categorical variables. The natural research question is whether the two variables are
“independent” of each other. We start with an example.
Example 7.1.8.
During the Spring semester of 2007, 280 statistics students were given a survey. Among other things, they were asked their gender and whether they were
smokers. The results are tabulated below. (Note that the file was created using
a blank field to denote a missing value. An argument to read.csv() addresses
that.)
> survey=read.csv(’http://www.calvin.edu/~stob/data/survey.csv’,na.strings=c(’NA’,’’))
> t=xtabs(~gender+smoker,data=survey)
smoker
gender Non Smoke
F 133
5
M 125
13
Now these 280 students were not a random sample of 280 students from any
particular population. However we might think that this group could be representative of the population of all students with respect to the relationship of
168
7.1. Two Categorical Variables
smoking to gender. We note that in this (convenience) sample, a male is more
likely to smoke than a female. Does this difference indicate a true difference
between the genders or is this simply a result of sampling variability?
To formulate the research question as a question about parameters, we define πi,j
as the proportion of the population that has the value i for variable R and j for
variable S. We also define πi. and π.j to denote the proportion of the population
with the relevant value of each individual categorical variable. Then the hypothesis
of independence that we wish to test is
H0 :
for every i, j:
πi,j = πi. π.j .
This hypothesis is an independence hypothesis as it states that the events of an
object being classified as i on variable R and j on variable C are independent. Just
as in the case of independent populations, it is plausible to estimate π.j by n.j /n. It
is also reasonable to estimate πi. by ni. /n. Then, if the null hypothesis is true we
should use π̂i,j = ni. n.j /n2 as our estimate of πi,j . Notice that with this estimate of
πi,j , we expect that we would have nπ̂i,j = ni. n.j /n individuals in cell i, j. This is
exactly the same expected cell value as in the case of the test for homogeneity. This
suggests that exactly the same statistic, X 2 , should be used to test H0 . Indeed, we
have
Proposition 7.1.9. If H0 is true, then the statistic
X2 =
X (observed − expected)2
expected
=
X X (nij − n̂ij )2
n̂ij
i
j
has a distribution that is approximately χ-squared with (I − 1)(J − 1) degrees of
freedom.
The proposition means that we can use exactly the same R test in this case. It also
means that in cases where it is not so clear whether we are testing for homogeneity or independence, it doesn’t really matter! In the smoking and gender example,
Example 7.1.8, we have
> chisq.test(t)
Pearson’s Chi-squared test with Yates’ continuity correction
data: t
X-squared = 2.9121, df = 1, p-value = 0.08791
A p-value of 0.088 suggests that there is not sufficient evidence to claim that smoking and gender are not independent.
169
7. Inference – Two Variables
7.1.4. I experimental treatments
The third way that a two-way contingency table might arise is in the case that the rows
correspond to the the different treatments in an experiment. Here we are thinking
that the n individuals are assigned at random to the I treatments with ni. individuals
assigned to treatment i. (We hope as well that the n individuals are a random
sample from some larger population to which we want to generalize the results of the
experiment. This hope will hardly ever be realized.) We want to know whether the
experimental treatments have an effect on the column variable C.
Example 7.1.10.
In [LP01], a study was done to see if delayed prescribing of antibiotics was as
effective as immediate prescribing of antibiotics for treatment of ear infections.
164 children were assigned to the treatment group that received a prescription
for antibiotics but which was instructed not to take the antibiotics for three
days (the “delay” group). 151 children received a prescription for antibiotics
to be taken immediately (the “immediate” group). The assignment was by
randomization. One of the side effects of antibiotics in children is diarrhea. Of
the delay group, 15 children had diarrhea and of the immediate group, 29 had
diarrhea. The question is whether the rate of diarrhea differs for those receiving
antibiotics immediately as opposed to those who waited. We do not have the
raw data so we construct the table ourselves using the summary data above.
> m=matrix(c(15,149,29,122),nrow=2,ncol=2,byrow=T)
> m
[,1] [,2]
[1,]
15 149
[2,]
29 122
> colnames(m)=c(’Diarrhea’,’None’)
> rownames(m)=c(’Delay’,’Immediate’)
> m
Diarrhea None
Delay
15 149
Immediate
29 122
Obviously, the rate of diarrhea in the immediate group is bigger but we would
like to know if this difference could be attributable to chance.
The null hypothesis in this case is that there is no difference between the treatments
(e.g., the rows) as far as the column variable C is concerned. This is essentially a
homogeneity hypothesis and we will analyze the data in precisely the same manner as
the case of I independent populations. In this case, we could think of the treatment
levels as defining theoretical populations, namely the population of individuals that
might have received each treatment. The “random sample” from the ith population
170
7.2. Difference of Two Means
is then the collection of subjects randomly assigned to treatment i. We write the null
hypothesis in terms of parameters πij just as in the null hypothesis for homogeneity.
In this case πij denotes the probability that a subject assigned to treatment i will
have the value j on the the categorical variable. C. The null hypothesis is
H0 :
for every j,
π1,j = π2,j = · · · = πI,j .
and we test this null hypothesis exactly the same way as in the case of homogeneity.
Example 7.1.11.
Continuing Example 7.1.10, we have the following test of the hypothesis that
there is no difference in the rates of diarrhea for the two treatment conditions.
> chisq.test(m,correct=F)
Pearson’s Chi-squared test
data: m
X-squared = 6.6193, df = 1, p-value = 0.01009
With a p-value of .01 it appears that the difference in the rate of diarrhea in the
two groups is greater than we would expect to see if the null hypothesis were
true. We would reject the null hypothesis at the significance level α = 0.05, for
example.
In the above test, we have chosen not to use something called the “continuity
correction” correct=F. If we use the correction, we find
> chisq.test(m)
Pearson’s Chi-squared test with Yates’ continuity correction
data: m
X-squared = 5.8088, df = 1, p-value = 0.01595
In the correction, which is used only for the two-by-two case, the value 0.5 is
subtracted from all of the terms Observed − Expected. It turns out that this
makes the chi-square approximation somewhat closer.
7.2. Difference of Two Means
This section addresses the problem of determining the relationship between a categorical variable (with two levels) and a quantitative variable. Just as in the case of
two categorical variables, data like this can arise from independent samples from two
different populations, from a randomized comparative experiment with two treatment
171
7. Inference – Two Variables
groups, or from cross-classifying a random sample from a single population according
to the two variables. We look at the two population case here (and suggest that the
two treatment group case should be analyzed the same way as in Section 7.1).
Assumptions for two independent samples:
1. X1 , . . . , Xm is a random sample from a population with mean µX and vari2 .
ance σX
2. Y1 , . . . , Yn is a random sample from a population with mean µY and variance
σY2 .
3. The two samples are independent one from another.
4. The samples come from normal distributions.
Of course the fourth assumption above is an assumption of convenience to make
the mathematics work out. In most cases, our populations are not known or known
not to be normal and we hope that the inference procedures we develop below are
reasonably robust.
We first write a confidence interval for the difference in the two means µX − µY .
Just as did our confidence intervals for one mean µ, our confidence interval will have
the form
(estimate) ± (critical value) · (estimate of standard error) .
The natural choice for an estimator of µX − µY is X − Y . To write the other two
pieces of the confidence interval, we need to know the distribution of X − Y . The
necessary fact is this:
X − Y − (µX − µY )
q
∼ Norm(0, 1) .
2
2
σX
σY
m + n
Analogously to confidence intervals for a single mean, it seems like the right way
to proceed is to estimate σX by sX , σY by sY and to investigate the random variable
X − Y − (µX − µY )
q
.
2
SX
SY2
m + n
172
(7.1)
7.2. Difference of Two Means
The problem with this approach is that the distribution of this quantity is not
known even if we assume that the populations are normal (unlike the case of the single
mean where the analogous quantity has a t-distribution). We need to be content with
an approximation.
Lemma 7.2.1. (Welch) The quantity in Equation 7.1 has a distribution that is approximately a t-distribution with degrees of freedom ν where ν is given by
ν=
2
SX
m
2 /m)2
(SX
m−1
+
+
SY2
n
2
(SY2 /n)2
n−1
(7.2)
(It isn’t at all obvious from the formula but it is good to know that min(m−1, n−1) ≤
ν ≤ n + m − 2.)
We are now in a position to write a confidence interval for µX − µY .
An approximate 100(1 − α)% confidence interval for µX − µY is
!
r
s2X
s2Y
∗
x−y±t
+
m
n
(7.3)
where t∗ is the appropriate critical value tα/2,ν from the t-distribution with ν
degrees of freedom given by (7.2).
We note that ν is not necessarily an integer and we leave it R to compute both the
value of ν and the critical value t∗ .
Example 7.2.2.
The barley dataset of the lattice package has the yield in bushels per acre
of various experiments done in Minnesota in 1931 and 1932. If we think of the
experiments done in 1931 and in 1932 as samples from two populations, we have
> t.test(yield~year,barley)
Welch Two Sample t-test
data: yield by year
t = -2.9031, df = 116.214, p-value = 0.004422
173
7. Inference – Two Variables
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-8.940071 -1.688820
sample estimates:
mean in group 1932 mean in group 1931
31.76333
37.07778
There is a significant difference in the mean yield for the two years.
We should remark at this point that older books (and even some newer books which
don’t reflect current practice) suggest an alternate approach to the problem of writing
confidence intervals for µX − µY . These books suggest that we assume that the two
standard deviations σX and σY are equal. In this case the exact distribution of our
quantity is known (it is t with n + m − 2 degrees of freedom). The difficulty with this
approach is that there is usually no reason to suppose that σX and σY are equal and
if they are not equal the proposed confidence interval procedure is not as robust as
the one we are using. Current best practice is to always prefer the Welch procedure
to that of assuming that the two standard deviations are equal.
Robustness
Confidence intervals generated by Equation 7.3 are probably the most common confidence intervals in the statistical literature. But those who generate such intervals are
not always sensitive to the hypotheses that are necessary to be confident about the
confidence intervals generated. It should first be noted that the confidence intervals
constructed are based on the hypothesis that the two populations are normally distributed. It is often apparent from even a cursory examination of the data that this
hypothesis is unlikely to be true. However, if the sample sizes are large enough, the
intervals generated are fairly robust. (This is related to the Central Limit Theorem
and the fact that we are making inferences about means.) There are a number of different rules of thumb as to what large enough means, but n, m > 15 for distributions
that are relatively symmetric and n, m > 40 for most distributions are common rules
of thumb. A second principle is that we are surer of confidence intervals where the
quotients s2X /m and s2Y /n are not too different in size than those in which they are
quite different.
Turning Confidence Intervals into Hypothesis Tests
It is often the case that researchers content themselves with testing hypotheses about
µX − µY rather than computing a confidence interval for that quantity. For example,
the null hypothesis µX − µY = 0 in the context of an experiment is a claim that
there is no difference in the two treatments represented by X and Y . This would
be the typical null hypothesis in comparing a medical treatment to a control or a
174
7.2. Difference of Two Means
placebo. Hypothesis testing of this sort has fallen into disfavor in many circles since
the knowledge that µX − µY 6= 0 is of rather limited interest unless the size of this
quantity is known. (After all, nobody should really believe that two populations
would have exactly the same mean on any variable.) A confidence interval gives
information about the size of the difference. Nevertheless, since the literature is still
littered with such hypothesis tests, we give an example here.
Example 7.2.3.
Returning to our favorite chicks, we might want to know if we should believe
that the effect of a diet of horsebean seed is really different that a diet of linseed.
Suppose that x1 , . . . , xm are the weights of the m chickens fed horsebean seed
and y1 , . . . , yn are the weights of the n chickens fed linseed. The hypothesis that
we really want to test is H0 q
: µX − µY = 0. We note that if the null hypothesis
is true, then T = (X − Y )/ Sx2 /m + Sy2 /m has a distribution that is approximately a t-distribution with the Welch formula giving the degrees of freedom.
Thus the obvious strategy is to reject the null hypothesis if the value of T is too
large in absolute value. Fortunately, R does all the appropriate computations.
Notice that the mean weight of the two groups of chickens differs by 58.5 but
that a 95% confidence interval for the true difference in means is (−99.1, −18.0).
On this basis we expect to conclude that the linseed diet is superior, i.e., that
there is a difference in the mean weights of the two populations. This is verified
by the hypothesis test of H0 : µX − µY = 0 which results in a p-value of 0.007.
That is, this great a difference in mean weight would have been quite unlikely
to occur if there was no real difference in the mean weights of the populations.
> hb=chickwts$weight[chickwts$feed=="horsebean"]
> ls=chickwts$weight[chickwts$feed=="linseed"]
> t.test(hb,ls)
Welch Two Sample t-test
data: hb and ls
t = -3.0172, df = 19.769, p-value = 0.006869
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-99.05970 -18.04030
sample estimates:
mean of x mean of y
160.20
218.75
175
7. Inference – Two Variables
Variations
One-sided confidence intervals and one-sided tests are possible as are intervals of
different confidence levels. All that is needed is an adjustment of the critical numbers
(for confidence intervals) or p-values for tests.
Example 7.2.4.
A random dot stereogram is shown to two groups of subjects and the time it
takes for the subject to see the image is recorded. Subjects in one group (VV)
are told what they are looking for but subjects in the other group (NV) are
not. The quantity of interest is the difference in average times. If µX is the
theoretical average of the population of the NV group and µY is the average of
the VV group, then we might want to test the hypothesis
H0 : µX − µY = 0
Ha : µX > µY
> rds=read.csv(’http://www.calvin.edu/~stob/data/randomdot.csv’)
> rds
Time Treatment
1 47.20001
NV
2 21.99998
NV
3 20.39999
NV
......................
77 1.10000
VV
78 1.00000
VV
> t.test(Time~Treatment,data=rds,conf.level=.9,alternative="greater")
Welch Two Sample t-test
data: Time by Treatment
t = 2.0384, df = 70.039, p-value = 0.02264
alternative hypothesis: true difference in means is greater than 0
90 percent confidence interval:
1.099229
Inf
sample estimates:
mean in group NV mean in group VV
8.560465
5.551429
>
From this we see that a lower bound on the difference µX −µY is 1.10 at the 90%
level of confidence. And we see that the p-value for the result of this hypothesis
test is 0.023. We would probably conclude that those getting no information
take longer than those who do on average.
176
7.3. Exercises
Just as in the case of the t-test for one mean, it is important to consider the
power of the two-sample t-test before conducting an experiment. The R function
power.t.test() with argument type=’two.sample’ does the appropriate computations.
7.3. Exercises
7.1 In Berkson, JASA, 33, pp. 526-536, there is data on the result of an experiment
evaluating a treatment designed to prevent the common cold. There were 300 subjects
and 143 received the treatment and 157 the placebo. Of the treatment group, 121
eventually got a cold and of the placebo group, 145 got a cold. Was the treatment
effective? Write a contingency table and formulate this problem as a chi-square
hypothesis test (as indeed Berkson did).
7.2 The DAAG package has a dataset rareplants classifying various plant species
in South Australia and Tasmania. Each species was classified according to whether it
was rare or common in each of those two locations (giving the possibilities CC, CR,
RC, RR) and whether its habitat was wet, dry, or both (W, D, WD). The dataset
contains the summary table which is also reproduced here.
> rareplants
D
W WD
CC 37 190 94
CR 23 59 23
RC 10 141 28
RR 15 58 16
a) what hypothesis exactly is begging to be tested with the aid of this contingency
table? (e.g., homogeneity or independence?)
b) Test this hypothesis.
7.3 21 rubber bands were divided into two groups. One group was placed in hot
water for 4 minutes while the other was left at room temperature. They were each
then stretched by a 1.35 km weight and the amount of stretch in mm was recorded.
(The dataset comes from the DAAG library where it is called two65). You can
get the dataset in a dataframe format from http://www.calvin.edu/~stob/data/
rubberbands.csv. Write a 95% confidence interval for the difference in average
stretch for this kind of rubberband for the two conditions.
7.4 The dataset http://www.calvin.edu/~stob/data/reading.csv contains the
results of an experiment done to test the effectiveness of three different methods of
reading instruction. We are interested here in comparing the two methods DRTA and
177
7. Inference – Two Variables
Strat. Let’s suppose, for the moment, that students were assigned randomly to these
two different treatments.
a) Use the scores on the third posttest (POST3) to investigate the difference between these two teaching methods by constructing a 95% confidence interval for
the difference in the means of posttest scores.
b) Your confidence interval in part (a) relies on certain assumptions. Do you have
any concerns about these assumptions being satisfied in this case?
c) Using your result in (a), can you make a conclusion about which method of
reading instruction is better?
7.5 Surveying a choir, you might expect that there would not be a significant height
difference between sopranos and altos but that there would be between sopranos and
basses. The dataset singer from the lattice package contains the heights of the
members of the New York Choral Society together with their singing parts.
a) Decide whether these differences do or do not exist by computing relevant confidence intervals.
b) These singers aren’t random samples from any particular population. Explain
what your conclusion in (a) might be about.
7.6 The package alr3 has a dataframe ais containing various statistics on 202 elite
Australian athletes. (The package must be loaded and then the dataset must be
loaded as well using data(ais).)
a) Is there a difference between the hemoglobin levels of males and females? (Well,
of course there is a difference. But is it statistically significant?)
b) What assumptions are you making about the data in (a) to make it a problem
in statistical inference?
c) To what populations do you think you could generalize the result of (a)?
178
8. Regression
In Section 1.6 we introduced the least-squares method for finding a linear function
that best describes the relationship between a pair of quantitative variables. In this
chapter we enhance that method by grounding it in a statistical model.
8.1. The Linear Model
Suppose that we have n individuals on which we measure two qunatitative variables.
The data then consist of n pairs (x1 , y1 ), . . . , (xn , yn ). We will develop a model for
the situation in which we consider the variable x as an explanatory variable and y
as the response variable. Our model will assume that for each fixed data value x,
the corresponding value y is the result of a random variable, Y . The linearity of the
model comes from the fact that we will assume that the expected value of Y is a
linear function of x. The model is given by
The standard linear model
The standard linear model is given by the equation
Y = β0 + β1 x + (8.1)
where
1. is a random variable with mean 0 and variance σ 2 ,
2. β0 , β1 , σ 2 are (unknown) parameters,
3. and has a normal distribution.
We will assume that the data (x1 , y1 ), . . . , (xn , yn ), result from n independent trials
governed by the process above. That is, we assume that 1 , . . . , n is an iid sequence
of random variables with mean 0 and variance σ 2 . Then each yi is the result of a
random variable Yi given by
Yi = β0 + β1 xi + i .
179
8. Regression
Notice that in our description of the data collection process, yi is treated as the
result of a random variable but xi is not. Also note that for any fixed i, the mean of
Yi is β0 + β1 xi and the variance of Yi is σ 2 .
There are three unknown parameters (β0 , β1 , and σ 2 ) in the linear model. Usually,
the most interesting of these from a scientific point of view is β1 since it is an expression
of the way of in which the response variable Y depends on the value of the explanatory
variable x. We would like to estimate these parameters and make inferences about
them. It turns out that we have already done much of the work in Section 1.6.
The least-squares line is the “right” line to use in estimating β0 and β1 . We review
the construction of that line. Let β̂0 and β̂1 denote the estimators of β0 and β1
respectively.
A note on notation. It would be nice to use uppercase to denote the
estimator and lowercase to denote the estimate. That would mean that
we should use b1 to denote the estimate of β1 and B1 to denote the estimator of β1 . However this is typically not done and instead β̂1 is used for
both. So β̂1 might be a number (an estimate) or a random variable (the
corresponding estimator) depending on the context. Be careful!
Now define
ŷi = β̂0 + β̂1 xi .
Of course ŷi is not defined until we specify how to choose β̂0 and β̂1 . Given β̂0 and
β̂1 , we define
SSResid =
n
n 2
X
X
(yi − ŷi )2 =
yi − (β̂0 + β̂1 xi ) .
i=1
i=1
We proceed exactly as in Section 1.6. Namely we choose β̂0 and β̂1 to minimize
SSResid. (In fact in that section we called these two numbers b0 and b1 .) We have
the following expressions for β̂0 and β̂1 .
Pn
(xi − x)yi
β̂1 = Pi=1
β̂0 = y − β̂1 x .
n
2
i=1 (xi − x)
The corresponding estimators result from these expressions by replacing yi by Yi . The
desirable properties of these estimators (besides minimizing SSResid) are summarized
in the next three results.
Proposition 8.1.1. Assume only that E(i ) = 0 for all i in the model given by
(8.1). Then β̂0 and β̂1 are unbiased estimates of β0 and β1 respectively. Therefore,
ŷi = β̂0 + β̂1 xi is an unbiased estimate of β0 + β1 xi (which is the expected value of Yi
for the value x = xi ).
180
8.1. The Linear Model
130
●
●
●
●
●
loss
120
●
●
110
●
●
100
●
●
90
●
●
0.0
0.5
1.0
1.5
2.0
Fe
Figure 8.1.: The corrosion data with the least-squares line added.
Notice that in Proposition 8.1.1 we do not need to assume that that the errors have
constant variance or that they are independent! This proposition therefore gives us
a very good reason to use the least-squares slope and intercept for our estimates.
Example 8.1.2.
In Example 1.6.1 we looked at the loss due to corrosion of 13 Cu/Ni alloy bars
submerged in the ocean for sixty days. Here the iron content Fe is the explanatory variable and it is reasonable to treat that as controlled and known by the
experimenter (rather than as a random variable). The data plot suggests that
the linear model might be a reasonable approximation to the true relationship
between iron content and material loss. We reproduce the analysis here. Using
R we find that β̂0 = 129.79 and β̂1 = −24.02.
> library(faraway)
> data(corrosion)
> corrosion[c(1:3,12:13),]
Fe loss
1 0.01 127.6
2 0.48 124.0
3 0.71 110.8
12 1.44 91.4
13 1.96 86.2
> lm(loss~Fe,data=corrosion)
Call:
lm(formula = loss ~ Fe, data = corrosion)
Coefficients:
(Intercept)
129.79
Fe
-24.02
181
8. Regression
If we add the assumption of independence of the i and also the assumption of
constant variance, we know considerably moreP
about our estimates as evidenced by
the next two propositions. (Recall that Sxx = ni=1 (xi − x)2 .)
Proposition 8.1.3. Suppose that Yi = β0 + β1 xi + i where the random variables i
are independent and satisfy E(i ) = 0 and Var(i ) = σ 2 . Then
σ2
,
Sxx
P
1
σ 2 ni=1 x2i
x2
2
=σ
.
2. Var(β̂0 ) =
+
n Sxx
n Sxx
1. Var(β̂1 ) =
It is not important to remember the formulas of this proposition. But they are
worth examining for what they say about the variance of our estimators. We can
decrease the variance of the estimator of slope, for example, by collecting a large
amount of data with x values that are widely spread. This seems intuitively correct.
In general we like unbiased estimators with small variance. The next theorem
assures us that the least squares estimators are good estimators in this respect.
Theorem 8.1.4 (Gauss-Markov Theorem). Assume that E(i ) = 0, Var(i ) = σ 2 ,
and the random variables i are independent. Then the estimators β̂0 and β̂1 are
the unbiased estimators of minimum variance among all unbiased estimators that
are linear in the random variables Yi . (We say that these estimators are best linear
unbiased estimators, BLUE.)
While there might be non-linear estimators that improve on β̂0 and β̂1 , the GaussMarkov Theorem gives us a powerful reason for using these estimators. Notice however
that the Theorem has hypotheses. Both the homeoscedasticity (equal variance) and
independence hypotheses are important.
Our final proposition of the section gives us additional information if we add the
normality assumption.
Theorem 8.1.5. Assume that E(i ) = 0, Var(i ) = σ 2 , and the random variables i
are independent and normally distributed. Then β̂0 and β̂1 are normally distributed.
We exploit this theorem in the next section to make inferences about the parameters
β0 , β1 .
182
8.2. Inferences
8.2. Inferences
We first consider the problem of making inferences about β1 . In particular, we would
like to construct confidence intervals for β1 with the aid of our estimate β̂1 . In order
to do this, we must clearly make some distributional assumptions about the Yi . So
for this entire section, we will assume all the hypotheses of the standard linear model,
namely that E(i ) = 0, Var(i ) = σ 2 , the random variables i are normally distributed
and independent of one another. From the results of the last section, we then have
β̂1 ∼ N β1 , σ 2 /Sxx .
We’ve been in this situation before. Namely, we have an estimator that has a
normal distribution centered at the true value of the parameter but with a standard
deviation that depends on an unknown parameter σ. Clearly the way to proceed is
to estimate the unknown standard deviation. To do this, we need to estimate σ.
Proposition 8.2.1. Under the assumptions of the linear model,
MSResid =
SSResid
n−2
is an unbiased estimate of σ 2 .
While we will not prove the proposition, let’s see that it is plausible. The numerator
in this computation is a sum of terms of the form (yi −ŷi )2 . Since ŷi is the best estimate
of E(Yi ) that we have, yi − ŷi is a measure of the deviation of yi from its mean. Thus
(yi − ŷi )2 functions exactly the same way that (xi − x)2 functions the computation
of the sample variance. However in this case we have a denominator of n − 2 rather
than n − 1. This accounts for the fact that we are minimizing SSResid by choosing
two parameters. The n − 2 is the key to making this estimator unbiased — a more
straightforward choice would have been to use√n in the denominator. Since MSResid
is an estimate for σ 2 , we will use s to denote MSResid.
√
With the estimate s = MSResid for σ in hand, √
we can estimate the standard
2
deviation of β̂1 . Since Var(β̂1 ) = σ /Sxx we will use s/ Sxx to estimate the standard
deviation of β̂1 . We can similarly estimate Var(β̂0 ). We record these estimates in a
definition.
Definition 8.2.2 (standard errors of β̂0 and β̂1 ). The estimates of the standard
deviation of the estimators β̂0 and β̂1 , called the standard errors of the estimates, are
given by
s
and
1. sβ̂1 = √
Sxx
183
8. Regression
s
2. sβ̂0 = s
1 (xi − x)2
.
+
n
Sxx
We illustrate all the estimates computed so far with another example.
Example 8.2.3.
A number of paper helicopters were dropped from a balcony and the time in
air was recorded by two different timers. Various dimensions of each helicopter
were measured including L, the “wing” length. A plot shows that there is a
positive relationship betweeen L and the time of the second timer (Time.2).
To describe the relationship, we suppose that a linear model might be a good
description. A plot of the data with a regression line added is in Figure 8.2
●
7
●
●
●
●
●
●
Time.2
6
●
●
●
5
●
●
●
●
●
●
4
●
2
3
4
5
6
L
Figure 8.2.: Flight time for helicopters with various wing lengths.
> h=read.csv(’http://www.calvin.edu/~stob/data/helicopter.csv’)
> h[1,]
Number W L H
B Time.1 Time.2
1
1 3 6 2 1.5
6.89
6.82
> l=lm(Time.2~L,data=h)
> summary(l)
Call:
lm(formula = Time.2 ~ L, data = h)
Residuals:
Min
1Q
-1.42875 -0.53381
Median
0.04489
3Q
0.49348
Max
1.59941
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
2.9816
0.7987
3.733 0.00200 **
L
0.5773
0.1753
3.293 0.00493 **
184
8.2. Inferences
--Signif. codes:
0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 0.8872 on 15 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-Squared: 0.4196,
Adjusted R-squared: 0.381
F-statistic: 10.85 on 1 and 15 DF, p-value: 0.004925
We have the following estimates: s = 0.8872 (labeled residual standard error in
the output of lm), β̂1 = 0.5773, sβ̂1 = 0.1753, β0 = 2.9816, and sβ̂0 = 0.7897.
To construct confidence intervals for β1 , we need one more piece of information.
The following result should not seem surprising given our work on the t-distribution.
Proposition 8.2.4. With all the assumptions of the linear model, the random variable
β̂1 − β1
β̂1 − β1
√
T =
=
Sβ̂1
S/ Sxx
has a t-distribution with n − 2 degrees of freedom.
The proposition is another example for us of the use of the t distribution to generate
a confidence interval in the presence of a normality assumption. We generalize this
into a principle (which is too imprecise to call a theorem or to prove).
Suppose that θ̂ is an unbiased estimator of a parameter θ and sθ̂ is the standard
error of θ̂ (that is an estimate of the standard deviation of θ̂). Suppose also
that sθ̂ has ν degrees of freedom. Then, in the presence of sufficient normality
θ̂ − θ
has a t distribution with ν degrees
assumptions, the random variable T =
sθ̂
of freedom.
We now use Proposition 8.2.4 to write confidence intervals for β1 .
Confidence Intervals for β1
A 100(1 − α)% confidence interval for β1 is given by
β̂1 ± tα/2,n−2 · sβ̂1 .
185
8. Regression
We don’t even have to use qt() or do the multiplication since R will compute the
confidence intervals for us. Both 95% and 90% confidence intervals for the slope and
the intercept of the regression line in Example 8.2.3 are given by
> confint(l)
2.5 %
(Intercept) 1.2791991
L
0.2036587
> confint(l,level=.9)
5 %
(Intercept) 1.5814235
L
0.2699840
97.5 %
4.6839421
0.9508533
95 %
4.3817177
0.8845281
The 95% confidence interval for β1 of (0.204, 0.951) gives us a very good idea of the
large uncertainty in the estimate of linear relationship between L and flight time.
Nevertheless, it does tell us that L does have some use in predicting flight time.
8.3. More Inferences
We usually want to use the results of a regression to make inferences about the possible
values of y for given values of x. In this section, we look at two different kinds of
inferences of this sort. We begin with an example.
Example 8.3.1.
In the R library DAAG is a dataset ironslag that has observations of measurements of the iron content of 53 samples of slag by two different methods. One
method, the chemical method, is more time-consuming and expensive than the
other, the magnetic method, but presumably more accurate.
●
30
●
●
●
●
●
● ●
●
●
chemical
25
●
● ●
●
●
● ●
●
●
●
●
20
●
●
●
● ●
●
● ●
●
● ● ●
●
●
●
●
● ● ●
● ●
●
15
●
●
●
10
●
10
15
20
25
30
35
40
magnetic
Figure 8.3.: Iron content measured by two different methods.
186
8.3. More Inferences
> library(DAAG)
> l=lm(chemical~magnetic,data=ironslag)
> summary(l)
Call:
lm(formula = chemical ~ magnetic, data = ironslag)
Residuals:
Min
1Q Median
-6.5828 -2.6893 -0.3825
3Q
2.7240
Max
6.6572
Coefficients:
Estimate Std. Error t value
(Intercept) 8.95650
1.65235
5.420
magnetic
0.58664
0.07624
7.695
--Signif. codes: 0 ’***’ 0.001 ’**’ 0.01
Pr(>|t|)
1.63e-06 ***
4.38e-10 ***
’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 3.464 on 51 degrees of freedom
Multiple R-Squared: 0.5372,
Adjusted R-squared: 0.5282
F-statistic: 59.21 on 1 and 51 DF, p-value: 4.375e-10
Given a particular value x = x∗ (here x∗ might be one of the values xi or some
other possible value of x), define Y = β0 + β1 x∗ and Ŷ = β̂0 + β̂1 x∗ . Since β̂0 and β̂1
are unbiased estimators of β0 and β1 , we have that E(Ŷ ) = β0 + β1 x∗ = E(Y ). It can
also be shown that
1 (x∗ − x̄)2
2
+
.
Var(Ŷ ) = σ
n
Sxx
If we make the normality assumptions of the standard linear model, Ŷ is also normally
distributed and we have the following confidence interval.
Confidence intervals for β0 + β1 x∗
A 100(1 − α)% confidence interval for β0 + β1 x∗ is given by
s
1 (x∗ − x)2
β̂0 + β̂1 x∗ ± tα/2,n−2 · s
+
n
Sxx
Notice that the confidence interval is smallest when x∗ = x and at that point the
√
standard error is simply s/ n. This error should remind us of the standard error in
the construction of simple confidence intervals for the mean of a normal population.
187
8. Regression
The confidence interval is wider the greater the distance of x∗ from x. This is not
surprising as small errors in the position of a line magnify the errors at its extremes.
Of course the computations of these intervals are to be left to R. We illustrate with
the ironslag data.
Example 8.3.2.
(continuing Example 8.3.1) In the ironslag data, the values of the explanatory
variable magnetic range from 10 to 40. We use R to write confidence intervals
for β0 + β1 x∗ for four different values of x∗ in this range.
> x=data.frame(magnetic=c(10,20,30,40))
> predict(l,x,interval=’confidence’)
fit
lwr
upr
1 14.82291 12.91976 16.72607
2 20.68933 19.72724 21.65142
3 26.55574 24.84847 28.26301
4 32.42215 29.32547 35.51884
Notice that for a value of x∗ = 20, the confidence interval for the mean of Y is
(19.7, 21, 7) which is considerably narrower than the confidence interval at the
extremes of the data. As is usual, R defaults to a 95% confidence interval.
It is important to realize that the confidence intervals produced by this method are
confidence intervals for the mean of Y . The confidence interval of (19.7, 21.7) for
x∗ = 20 means that we are confident that the true line has the value somewhere in
this interval at x = 20.
Obviously, we often want to use the regression line to make predictions about future
observations of Y . Suppose for example in Example 8.3.1 that we produce another
sample with a measurement of 30 on the variable magnetic. The fitted line predicts
a measurement of 26.56 on the variable chemical. We also have a confidence interval
of (24.85, 28.26) for the mean of the possible observations at x = 30. But what we
would like to do is have an estimate of how close our measured value is likely to be
to our predicted value of 26.56. We take up this question next.
Given a value of x = x∗ , we define Y = β0 + β1 x∗ and Ŷ = β̂0 + β̂1 x∗ as before.
Since Y is going to be based on a future observation of y, we know that that the
random variable Y is independent of the the random variable Ŷ (which is based on
the sample observations). Consider the random variable Y − Ŷ . (This is simply the
error made in using Ŷ to predict Y .) This random variable has mean 0 and variance
given by
1 (x∗ − x)2
2
2
Var(Y − Ŷ ) = Var(Y ) + Var(Ŷ ) = σ + σ
+
.
n
Sxx
This leads to the following prediction interval for Y .
188
8.3. More Inferences
Prediction intervals for a new Y given x = x∗ .
A 100(1 − α)% prediction interval for a future value of Y given x = x∗ is
s
1 (x∗ − x)2
β̂0 + β̂1 x∗ ± tα/2,n−2 · s 1 + +
.
n
Sxx
For the ironslag data, with x = x∗ we have
> predict(l,data.frame(magnetic=30),interval="predict")
fit
lwr
upr
[1,] 26.55574 19.39577 33.71571
Obviously, this is a very wide interval compared to the confidence intervals we generated for the mean. This is because we are asking that the interval capture 95% of the
values of future measurements rather than just the true mean of such measurements.
The problem of multiple confidence intervals
When constructing many confidence intervals, we need to be careful in how we phrase
our conclusions. Consider the problem of constructing 95% confidence intervals for
the two parameters β0 and β1 . By the definition of confidence intervals, there is a
95% probability that the confidence interval that we will construct for β0 will in fact
contain β0 and similarly for β1 . But what is the probability that both confidence
intervals will be correct? Formally, let Iβ0 denote the (random) interval for β0 and
Iβ1 denote the interval for β1 . We have P(β0 ∈ Iβ0 ) = .95 and P(β1 ∈ Iβ1 ) = .95.
Then we know that
.90 ≤ P (β0 ∈ Iβ0 and β1 ∈ Iβ1 ) ≤ .95,
(8.2)
the first inequality since each confidence interval can be wrong at most 5% of the
time. However, we cannot say more than this unless we know the joint distribution
of β̂0 and β̂1 . In fact, given the full assumptions of the normality model, we can find
a joint confidence region in the plane for the pair (β0 , β1 ). We need the ellipse
package of R.
> e=ellipse(l)
> e[1:5,]
(Intercept) magnetic
[1,]
9.562752 0.6146146
[2,]
9.300103 0.6266209
[3,]
9.036070 0.6384663
[4,]
8.771717 0.6501030
[5,]
8.508108 0.6614841
> plot(e,type=’l’)
189
0.6
0.4
0.5
magnetic
0.7
8. Regression
6
8
10
12
(Intercept)
Figure 8.4.: The 95% confidence ellipse for the parameters in the ironslag example.
We note that an ellipse is simply a set of points (by default 100 points are used)
and we can plot the points. (It is easier to use standard graphics to do this.) The
resulting ellipse is in Figure 8.4. The ellipse is chosen to have minimum area (just as
our confidence intervals are chosen to have minimum length. Thus more values of the
slope and intercept are allowed but the ellipse itself is small in area (compared to the
rectangle that is implied by using both individual confidence intervals).
The problem of multiple confidence intervals arises in several other places. For
example, if we generate many 95% confidence intervals for the mean of Y given x
from the same data, we are not 95% confident in the entire collection.
8.4. Diagnostics
We can construct the regression line and compute confidence and prediction intervals
for any set of pairs (x1 , y1 ), . . . , (xn , yn ). But unless the hypotheses of the linear
model are satisfied and the data are “clean,” we will be producing mostly nonsense.
Anscombe constructed the examples of Figure 8.5 to illustrate this fact in a dramatic
way. Each of the datasets has the same regression line: y = 3+.5x. Indeed, the means
and standard deviations of all the x’s are exactly the same in each case, and similarly
for the y’s. These data are available in the dataset anscombe. The first example
looks like a textbook example for the application of regression. The relationship in
the second example is clearly non-linear. In the third example, one point is disguising
what seems to be the “real” relationship between x and y. And in the fourth example,
it is clear that some other method of analysis is more appropriate (is the outlier good
190
8.4. Diagnostics
12
12
Anscombe's 4 Regression data sets
●
●
●
●
●
●
●
●
6
●
●
●
4
●
●●●
●
8
●●
y2
8
y1
6
●
4
10
10
●
●
●
5
10
15
5
10
x1
15
x2
12
●
10
10
12
●
●
●
●
●
●
●
●
●
●
●
8
●
6
●
●
●
4
●
●
4
●
●
y4
8
6
y3
●
●
5
10
x3
15
5
10
15
x4
Figure 8.5.: Four datasets with regression line y = 3 + .5x.
data or not?). In each of these four examples, a simple plot of the data suffices
to convince us not to use linear regression (at least with the data as given). But
departures from the assumptions that are more subtle are not always easily detectable
by a plot. (That will be true particularly in the case of several predictors which we
take up in the next section.) In this section we look at some of the things that can
be done to determine if the linear model is the appropriate one.
8.4.1. The residuals
A careful look at the residuals often gives useful information about the appropriateness
of the linear model. We will use ei for the ith residual rather than ri to emphasize
that the residual is an estimate of i , the error random variable of the model. Thus
ei = yi − ŷi . If the linear model is true, the random variables i are a random sample
from a population that has mean 0, variance σ 2 , and, in the case of the normality
assumption, are normally distributed. The residuals are estimates of the i in this
random sample so it behooves us to take a closer look at the distribution of the
residuals. The first step in an analysis of the model using residuals is to construct
a residual plot. While we could plot the residuals ei against either of xi or yi , the
plot that is usually constructed is that of the residuals against the fitted values ŷi .
In other words, we plot the n points (ŷ1 , e1 ), . . . , (ŷn , en ). In this plot we are looking
for violations of the linearity assumption, heteroscedasticity (unequal variances), and
perhaps non-normality.
191
8. Regression
Example 8.4.1.
A famous dataset on cats used in a certain experiment has measurements of the
body weight (in kg) and brain weight (in g) of 144 cats of both sexes. A linear
regression suggests a strong relationship.
>
>
>
>
>
library(MASS)
cats.m=subset(cats,Sex==’M’)
l.cats.m=lm(Hwt~Bwt,data=cats.m)
xyplot(residuals(l.cats.m)~fitted(l.cats.m))
summary(l.cats.m)
Call:
lm(formula = Hwt ~ Bwt, data = cats.m)
Residuals:
Min
1Q Median
-3.7728 -1.0478 -0.2976
3Q
0.9835
Max
4.8646
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.1841
0.9983 -1.186
0.239
Bwt
4.3127
0.3399 12.688
<2e-16 ***
--Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 1.557 on 95 degrees of freedom
Multiple R-Squared: 0.6289,
Adjusted R-squared: 0.625
F-statistic:
161 on 1 and 95 DF, p-value: < 2.2e-16
Note that R has functions to return both a vector of fitted values and a vector
of the residual values of the fit. This makes it easy to construct the residual
plot.
●
●
residuals(l)
2
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−2
●
●
●
●
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
8
10
12
14
fitted(l)
This plot gives no obvious evidence of the failure of any of our assumptions.
The residuals do look as if they are random noise.
192
8.4. Diagnostics
We next take a more careful look at the size of the residuals. The residual ei is
the result of a random variable, Ei where Ei = Yi − Ŷi . (It is useful at this point to
stop and think about what a complicated random variable Ei is. We’ve come a long
way from tossing coins.) The important facts about the distribution of the residual
random variable Ei are
E(Ei ) = 0
1 (xi − x)2
Var(Ei ) = σ 2 1 − −
.
n
Sxx
The first equation here is easy to prove and expected. It follows from the fact
that β̂0 and β̂1 are unbiased estimators of β0 and β1 . Since Yi = β0 + β1 xi + i and
Ŷi = β̂0 + β̂1 xi , we have that Ei = i + (β0 − β̂0 ) + (βi − β̂1 )xi .
The variance computation above is a bit surprising at first glance. For ease of
1
(xi − x)2
notation, define hi =
+
. Then the second equality above says that
n
Sxx
Var(Ei ) = σ 2 (1 − hi ). It can be shown that 1/n ≤ hi for every i. Therefore we have
2
that Var(Ei ) ≤ n−1
n σ . This means that the variance of our estimates of i are smaller
than the variances of the i by a factor that depends only the x values. Notice that
if hi is large, the variance of Ei is small. This means that for such a point, the line
is forced to be close to the point. Since hi is large when xi is far from x, this means
that points with extreme values of x pull the line close to them. The number hi is
appropriately called the leverage of the point (xi , yi ). This suggests that we should
pay careful attention to points of high leverage.
With the variance of the residual in hand, we can normalize ei by dividing by the
estimate of its standard deviation. We are not surprised when the resulting random
variable has a t distribution. The resulting proposition should have a familiar look.
Proposition 8.4.2. With the normality assumption, Ei∗ =
tion with n − 2 degrees of freedom.
√Ei
s 1−hi
has a t distribu-
The proposition implies that if we the normality assumption is true we should not
expect to see many standardized residuals outside of the range −2 ≤ e∗i ≤ 2. It is
useful to plot the standardized residuals against the fitted values. In the cats example,
the plot of the standardized residuals is produced by
> xyplot(rstandard(l.cats.m)~fitted(l.cats.m))
From this plot we see that there are one or two large residuals, both for relatively
large fitted values (corresponding to large cats).
193
8. Regression
●
3
●
●
2
rstandard(l.cats.m)
●
1
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−1
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−2
●
8
10
12
14
16
fitted(l.cats.m)
8.4.2. Influential Observations
An influential observation is one that has a large effect on the fit. We have already
seen that a point with large leverage has the potential to have a large effect on the
fit as the fitted line tends to be closer to such a point than other points. However
that point might still have a relatively small effect on the regression as it might be
entirely consistent with the rest of the data. To measure the influence of a particular
observation on the fit, we consider what would change if we left that point out of the
fit. Let a subscript of (i) to any computed value denote the value we get from a fit
that omits the point (xi , yi ). Thus β̂0(i) denotes the value of β̂0 when the point (xi , yi )
is removed. Also ŷj(i) denotes the predicted yj when the point (xi , yi ) is removed. We
might measure the influence of a point on the regression by measuring
1. changes in the coefficients β̂ − β̂(i) and
2. changes in the fit ŷj(i) − ŷj .
The R function dfbeta() computes the changes in the coefficients. In the case of the
cats data, we have
> dfbeta(l.cats.m)
(Intercept)
Bwt
48 -0.1333235404 4.245539e-02
49 -0.1333235404 4.245539e-02
50
0.2807378812 -8.855077e-02
...............................
143 0.1677492306 -6.250644e-02
194
8.5. Multiple Regression
144 -0.6605688365
2.461400e-01
Note that the last observation has a considerably greater influence on the regression
than the four other points listed (and indeed it is the point of greatest influence in
this sense). In particular, its inclusion changes the intercept by 0.25 (from 4.06 to
4.31).
Changes in the fit depend on the scale of the observations, so it is customary to
normalize by a measure of scale. One such popular measure is known as Cook’s
distance. The Cook’s distance Di of a point (xi , yi ) P
is a measure of how this point
affects the other fitted values and is defined by Di = (ŷj − ŷj(i) )2 /(2s2 ). It can be
shown that
e∗2
hi
i
Di =
2 (1 − hi )
Thus the point (xi , yi ) has a large influence on the regression if it has a large residual
and or a large leverage and especially if it has both. A general rule of thumb is that a
point with Cook’s distance greater than 0.7 is considered influential. In the cats data
the last point is by far the most influential but is not considered overly influential by
this criterion. This point corresponds to the biggest male cat.
> cd=cooks.distance(l.cats.m)
> summary(cd)
Min.
1st Qu.
Median
Mean
3rd Qu.
Max.
1.117e-06 1.563e-03 4.482e-03 1.331e-02 1.155e-02 3.189e-01
> cd[cd>0.1]
140
144
0.1302626 0.3189215
8.5. Multiple Regression
In this section, we extend the linear model to the case of several quantitative explanatory variables. There are many issues involved in this problem and this section serves
only as an introduction. We start with an example.
Example 8.5.1.
The dataset fat in the faraway package contains several body measurements of
252 adult males. Included in this dataset are two measures of the percentage of
body fat, the Brozek and Siri indices. Each of these indices computes the percentage of body fat from the density (in gm/cm3 ) which in turn is approximated
by an underwater weighing technique. This is a time-consuming procedure and
it might be useful to be able to estimate the percentage of body fat from easily
obtainable measurements. For example, it might be nice to have a relationship
195
8. Regression
of the following form: density = f (x1 , . . . , xk ) for k easily measured variables
x1 , . . . , xk . We will first investigate the problem of approximating body fat by
a function of only weight and abdomen circumference. The data on the first
two individuals is given for illustration.
> fat[1:2,]
brozek siri density age
1
12.6 12.3 1.0708 23
2
6.9 6.1 1.0853 22
thigh knee ankle biceps
1 59.0 37.3 21.9
32.0
2 58.7 37.3 23.4
30.5
weight height adipos free neck chest abdom hip
154.25 67.75
23.7 134.9 36.2 93.1 85.2 94.5
173.25 72.25
23.4 161.3 38.5 93.6 83.0 98.7
forearm wrist
27.4 17.1
28.9 18.2
The notation gets a bit messy. We will continue to use y for the response variable
and we will use x1 , . . . , xk for the k explanatory variables. We will again assume that
there are n individuals and use the subscript i to range over individuals. Therefore,
the ith data point is (xi1 , xi2 , . . . , xik , yi ). The standard linear model now becomes
the following.
The standard linear model
The standard linear model is given by the equation
Y = β0 + β1 x1 + · · · + βk xk + (8.3)
where
1. is a random variable with mean 0 and variance σ 2 ,
2. β0 , β1 , . . . , βk , σ 2 are (unknown) parameters,
3. and has a normal distribution.
We again assume that the n data points are the result of independent 1 , . . . , n .
To find good estimates of β0 , . . . , βk we proceed exactly as in the case of one predictor
and find the least squares estimates. Specifically, let β̂i be an estimate of βi and define
ŷi = β̂0 + β̂1 xi1 + β̂2 xi2 + · · · + β̂k xik .
We choose these estimates so that we minimize SSResid where
n
X
SSResid =
(yi − ŷi )2 .
i=1
196
8.5. Multiple Regression
It is routine to find the values of the β̂’s that minimize SSResid. R computes them
with dispatch. Suppose that we use weight and abdomen circumference to try to
predict the Brozek measure of body fat.
> l=lm(brozek~weight+abdom,data=fat)
> l
Call:
lm(formula = brozek ~ weight + abdom, data = fat)
Coefficients:
(Intercept)
-41.3481
weight
-0.1365
abdom
0.9151
In the case of multiple predictors, we need to be very careful in how we interpret the
various coefficients of the model. For example β̂1 = −0.14 in this model seems to
indicate that body fat is decreasing as a function of weight. This is counter to our
intuition and our experience which says that the heaviest men tend to have more body
fat than average. On the other hand, the coefficient β̂2 = 0.9151 seems to be consistent
with the relationship between stomach girth and body fat that we know. The key here
is that the coefficient β̂1 measures the effect of weight on body fat for a fixed abdomen
circumference. This makes more sense. Among individuals with a fixed abdomen
circumference, the heavier individuals tend to be taller and so have perhaps less body
fat. Even this interpretation needs to be expressed carefully however. It is misleading
to say that “body fat decreases as weight increases with abdomen circumference held
fixed” since increasing weight tends to increase abdomen circumference. We will come
back to this relationship in a moment but first we investigate the problem of inference
in this linear model. The short story of inference is that all of the results for the one
predictor case have the obvious extensions to more than one variable. For example,
we have
Theorem 8.5.2 (Gauss-Markov Theorem). The least squares estimator β̂j of βj is
the minimum variance unbiased estimator of βj among all linear estimators of βj .
To estimate σ 2 , we again use MSResid except that we define MSResid by
MSResid =
SSResid
.
n − (k + 1)
The denominator in MSResid is simply n − p where p is the number of estimated
parameters in SSResid. Using the estimate MSResid of σ 2 , we can again produce an
estimate sβ̂j of the standard deviation of β̂j and produce confidence intervals for β̂j .
For the body fat data we have
> summary(l)
197
8. Regression
Call:
lm(formula = brozek ~ weight + abdom, data = fat)
Residuals:
Min
1Q
-10.83074 -2.97730
Median
0.02372
3Q
2.93970
Max
9.76794
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -41.34812
2.41299 -17.136 < 2e-16 ***
weight
-0.13645
0.01928 -7.079 1.47e-11 ***
abdom
0.91514
0.05254 17.419 < 2e-16 ***
--Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 4.127 on 249 degrees of freedom
Multiple R-Squared: 0.7187,
Adjusted R-squared: 0.7165
F-statistic: 318.1 on 2 and 249 DF, p-value: < 2.2e-16
> confint(l)
2.5 %
97.5 %
(Intercept) -46.1005887 -36.59566057
weight
-0.1744175 -0.09848946
abdom
0.8116675
1.01860856
From the output we observe
the following. Our estimate for σ is the residual standard
√
error, 4.127, which is MSResid. We note that 249 degrees of freedom are used which
is 252 − 3 since there are three parameters. We can compute the confidence interval
for β̂1 from the summary table (β̂1 = −0.14 and sβ̂1 = 0.019) using the t distribution
with 249 degrees of freedom or from the R function confint.
We can compute confidence intervals for the expected value of body fat and prediction intervals for an individual observation as well. Investigating what happens for
a male weighing 180 pounds with an abdomen measure of 82 cm gives the following
prediction and confidence intervals:
> d=data.frame(weight=180, abdom=82)
> predict(l,d,interval=’confidence’)
fit
lwr
upr
[1,] 9.13157 7.892198 10.37094
> predict(l,d,interval=’prediction’)
fit
lwr
upr
[1,] 9.13157 0.9090354 17.35410
The average body fat of such individuals is likely to be between 7.9% and 10.4%. An
individual male not part of the dataset is likely to have body fat between 0.91% and
17.4%.
198
8.5. Multiple Regression
We now return to the issue of interpreting the coefficients in the linear model. In
the case of the body fat example, let’s fit a model with weight as the only predictor.
> lm(brozek~weight,data=fat)
Call:
lm(formula = brozek ~ weight, data = fat)
Coefficients:
(Intercept)
-9.9952
weight
0.1617
Notice that the sign of the relationship between weight and body fat has changed!
Using weight alone, we predict an increase of 0.16 in percentage of body fat for each
pound increase in weight. What has happened? Let’s first restate the two fitted linear
relationships:
brozek = −41.3 − 0.14 weight + 0.92 abdom
(8.4)
brozek = −10.0 + 0.16 weight
(8.5)
In order to understand the relationships above, it is important to understand that
there is a linear relationship between weight and the abdomen measurement. One
more regression is useful.
> lm(abdom~weight,data=fat)
Call:
lm(formula = abdom ~ weight, data = fat)
Coefficients:
(Intercept)
34.2604
weight
0.3258
Now suppose that we change weight by 10 pounds. The last analysis says that we
would predict that the abdomen measure increases by 3.3 cm. Using (8.4) we see that
a increase in 10 pounds of weight and an increase of 3.3 cm in abdomen circumference
causes an increase of 10 ∗ (−0.14) + 0.92 ∗ (3.3) = 1.6% in Brozek index. But this is
precisely what an increase in 10 pounds of weight should produce according (8.5). The
fact that our predictors are linearly related in the set of data (and so presumably in
the population that we are modeling) is known as multicollinearity. The presence
of multicollinearity makes it difficult to give simple interpretations of the coefficients
in a multiple regression.
199
8. Regression
Interaction terms
Consider our linear relationship, brozek = −41.3 − 0.14 weight + 0.92 abdom. This
model implies that for any fixed value of abdom, the slope of the line relating brozek
to weight is always −0.14. An alternative (and more complicated) model would be
that the slope of this line also changes as the value of abdom changes. One strategy
for incorporating such behavior into our model is to add an additional term, an
interaction term. The equation for the linear model with an interaction term in
the case that there are only two predictor variables is
Y = β0 + β1 x1 + β2 x2 + β1,2 x1 x2 + .
While this is not the only way that two variables could interact, it seems to be the
simplest possible way. R allows us to add an interaction term using a colon.
> lm(brozek~weight+abdom+weight:abdom,data=fat)
Call:
lm(formula = brozek ~ weight + abdom + weight:abdom, data = fat)
Coefficients:
(Intercept)
-65.866013
weight
0.003406
abdom
1.155338
weight:abdom
-0.001350
While the coefficient for the interaction term (−0.0014) seems small, one should realize
that the values of the product of these two variables are large so that this term
contributes significantly to the sum. On the other hand, in the presence of this
interaction term, the contribution of the term for weight is now very small.
With all the possible variables that we might include in our model and with all the
possible interaction terms, it is important to have some tools for evaluating different
choices. We take up this issue in the next section.
8.6. Evaluating Models
In the previous section, we considered several different linear models for predicting
the Brozek body fat index from easily determined physical measurements. Other
models could be considered by using other physical measurements that were available
in the dataset. How should we evaluate one of these models and how should we choose
among them?
One of the principle tools used to evaluate such models is known as the analysis
of variance. Given a linear model (any model, really), we choose the parameters to
200
8.6. Evaluating Models
minimize SSResid. Recall
SSResid =
n
X
(yi − ŷi )2 .
i=1
Therefore it seems reasonable to suppose that a model with smaller SSResid is better
than one with large SSResid. Such a model seems to “explain” or account for more of
the variation in the yi . Consider the two models for body fat, one using only abdomen
circumference and the other only weight.
> la=lm(brozek~abdom,data=fat)
> anova(la)
Analysis of Variance Table
Response: brozek
Df Sum Sq Mean Sq F value
Pr(>F)
abdom
1 9984.1 9984.1
489.9 < 2.2e-16 ***
Residuals 250 5094.9
20.4
--Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
> lw=lm(brozek~weight,data=fat)
> anova(lw)
Analysis of Variance Table
Response: brozek
Df Sum Sq Mean Sq F value
Pr(>F)
weight
1 5669.1 5669.1 150.62 < 2.2e-16 ***
Residuals 250 9409.9
37.6
--Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Among other things, the function anova() tells us that SSResid = 5, 095 for the linear
model using abdomen circumference and SSResid = 9, 410 for the model using only
weight. While this comparison seems clearly to indicate that abdomen circumference
predicts Brozek index better on average than does weight, using SSResid as an absolute measure of goodness of fit is has two shortcomings. First, the units of SSResid
are in terms of the squares of y units which means that SSResid will tend to be large
or small according as the observations are large or small. Second, we will obviously
reduce SSResid by including more variables in the model so that comparing SSResid
does not give us a good way of comparing, say, the model with abdomen circumference and weight to the model with abdomen circumference alone. We address the
first issue first.
We would like to transform SSResid into a dimension free measurement. The key
to doing this is to compare SSResid to the maximum possible SSResid. To do this,
201
8. Regression
define
SSTotal =
n
X
(yi − y)2 ,
i=1
The quantity SSTotal could be viewed as SSResid for the model with only a constant
term. We P
have already seen (Problem 1.16) that y is the unique constant c that
minimizes (yi − c)2 . The quantity SSTotal can be computed from the output of the
function anova() by summing the column labeled Sum Sq. For the body fat data,
that number is SSTotal = 1, 579.0.
We first note that 0 ≤ SSResid ≤ SSTotal. This is because choosing β0 = y and
β1 = 0 would already achieve SSResid = SSTotal but SSResid is the minimum among
all choices of β0 , β1 . Using this fact, we have a first measure of the fit of a linear
model. Define
SSResid
R2 = 1 −
.
SSTotal
We have that 0 ≤ R2 ≤ 1 and R2 is close to 1 if linear part of the model fits the data
well. The number R2 is sometimes called the coefficient of determination of the
model and is often read as a percentage. In the model for Brozek index which uses
only abdomen circumference, we can compute R2 from the statistics in the analysis
of variance table or else we can read it from the summary of the regression where it is
labeled Multiple R-Squared. We read the result below as “abdomen circumference
explains 66.2% of the variation in Brozek index.”
> summary(la)
Call:
lm(formula = brozek ~ abdom, data = fat)
Residuals:
Min
1Q
-17.62568 -3.46724
Median
0.01113
3Q
3.14145
Max
11.97539
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -35.19661
2.46229 -14.29
<2e-16 ***
abdom
0.58489
0.02643
22.13
<2e-16 ***
--Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 4.514 on 250 degrees of freedom
Multiple R-Squared: 0.6621,
Adjusted R-squared: 0.6608
F-statistic: 489.9 on 1 and 250 DF, p-value: < 2.2e-16
The value of R2 for the model using only weight is 37.6%.
The number R2 values for two different models with the same number of parameters
202
8.6. Evaluating Models
gives us a reasonable way to compare their usefulness. However R2 is a misleading
tool for comparing models with differing numbers of parameters. After all, if we allow
ourselves n different parameters (i.e., we have n different explanatory variables), we
will be able to fit the data exactly and so achieve R2 = 100%. We consider just one
way of comparing two models with a different number of parameters. Given a model
with parameters β0 , . . . , βk , we define a quantity AIC, called the Akaike Information
Criterion by
SSResid
AIC = n ln
+ 2(k + 1)
n
While we cannot give the theoretical basis for choosing this measure, we can notice
the following two properties:
1. AIC is larger if SSResid is larger, and
2. AIC is larger if the number of parameters is larger.
These two properties should lead us to choose models with small AIC. Indeed, AIC
captures one good way of measuring the trade-off in reducing SSResid (good) by
increasing the number of terms in the model (bad). We can compute AIC for a given
model by extractAIC in R.
> law=lm(brozek ~ abdom + weight,data=fat)
> extractAIC(law)
[1]
3.0000 717.4471
The 3 parameter model with linear terms for abdomen circumference and weight
has AIC = 717.4. This value of AIC does not mean much alone but it is used for
comparing models with differing numbers of parameters. (We should remark here
that there are different definitions of AIC that vary in the choice of some constants
in the formula. The R function AIC() computes one other version of AIC. It does not
usually matter which AIC one uses to compare two models.)
We illustrate the use of AIC in developing a model by applying it the the Brozek
data. We first consider a model that contains all 12 easily measured explanatory
variables in the dataset fat.
> lbig=lm(brozek ~ weight + height + neck + chest +
+ abdom + hip + thigh + knee + ankle + biceps + forearm + wrist,data=fat)
> extractAIC(lbig)
[1] 13.0000 712.5451
At least by the AIC criterion, the 13 parameter model is better (by a small margin)
than the 3 parameter model that we first considered.
We really do not want the 13 parameter model above, however. First, it is too
complicated to suit the purpose of easily approximating body fat from body measurements. Second, we really cannot believe that all these explanatory variables are
necessary. In order to decide which model to use, we might simply evaluate AIC for
203
8. Regression
all possible subsets of the 12 explanatory variables in the big model. While R packages
exist that do this, we use an alternate approach where we consider one variable at
a time. The R function that does this is step(). At each stage, step() performs a
regression for each variable, determining how AIC would change if that variable were
left-out (or included) in the model. The output is lengthy, the piece below illustrates
the first step:
> step(lbig,direction=’both’)
Start: AIC=712.55
brozek ~ weight + height + neck + chest + abdom + hip + thigh +
knee + ankle + biceps + forearm + wrist
- chest
- knee
- ankle
- height
- biceps
- thigh
<none>
- hip
- neck
- forearm
- weight
- wrist
- abdom
Df Sum of Sq
RSS
1
0.6 3842.8
1
3.0 3845.3
1
5.8 3848.0
1
15.5 3857.8
1
19.4 3861.6
1
19.9 3862.1
3842.3
1
42.0 3884.2
1
55.4 3897.7
1
67.0 3909.2
1
67.3 3909.6
1
98.1 3940.4
1
2831.4 6673.6
AIC
710.6
710.7
710.9
711.6
711.8
711.8
712.5
713.3
714.2
714.9
714.9
716.9
849.7
Step: AIC=710.58
brozek ~ weight + height + neck + abdom + hip + thigh + knee +
ankle + biceps + forearm + wrist
For each possible variable that is in the big model, AIC is computed for a regression
leaving that variable out. For example, leaving out the variable chest reduces AIC to
710.6, an improvement from the value 712.5 of the full model. Removing chest gives
the most reduction of AIC. The second step starts with this model and determines
that it is useful to remove the knee measurement from the model.
brozek ~ weight + height + neck + abdom + hip + thigh + knee +
ankle + biceps + forearm + wrist
-
204
knee
ankle
height
biceps
Df Sum of Sq
RSS
1
3.3 3846.1
1
5.9 3848.7
1
14.9 3857.8
1
19.0 3861.8
AIC
708.8
709.0
709.6
709.8
8.6. Evaluating Models
- thigh
<none>
- hip
- neck
+ chest
- forearm
- weight
- wrist
- abdom
1
1
1
1
1
1
1
1
21.9 3864.7
3842.8
41.6 3884.4
55.9 3898.7
0.6 3842.3
66.4 3909.2
87.3 3930.1
98.0 3940.9
3953.3 7796.1
710.0
710.6
711.3
712.2
712.5
712.9
714.2
714.9
886.9
Step: AIC=708.8
brozek ~ weight + height + neck + abdom + hip + thigh + ankle +
biceps + forearm + wrist
Notice that at this second step, all variables in the model were considered for exclusion
and all variables currently not in the model (chest) were considered for inclusion.
After several more steps, the final step determines that no single variable should be
included or excluded:
Df Sum of Sq
<none>
- hip
+ biceps
+ height
- thigh
+ ankle
- neck
+ knee
+ chest
- wrist
- forearm
- weight
- abdom
1
1
1
1
1
1
1
1
1
1
1
1
38.8
19.6
17.5
53.3
6.8
57.7
2.4
0.1
89.8
102.6
134.1
4965.6
RSS
3887.9
3926.6
3868.2
3870.3
3941.1
3881.1
3945.5
3885.4
3887.7
3977.7
3990.5
4021.9
8853.4
AIC
705.5
706.0
706.2
706.4
706.9
707.1
707.2
707.4
707.5
709.3
710.1
712.1
910.9
Call:
lm(formula = brozek ~ weight + neck + abdom + hip + thigh + forearm +
Coefficients:
(Intercept)
-21.7410
forearm
0.4372
weight
-0.1042
wrist
-1.0514
neck
-0.3971
abdom
0.9584
hip
-0.2010
wrist, data = fat)
thigh
0.2090
The final model has AIC = 705.5 and appears to be the best model, at least by the
AIC criterion.
205
8. Regression
8.7. Exercises
8.1 Sometimes the experimenter has control over the choice of the points x1 , . . . , xn
in an experiment. Consider the following two sets of choices:
Set A:
Set B:
x1 = 1, x2 = 2, x3 = 3, x4 = 4, x5 = 5, x6 = 6, x7 = 7, x8 = 8, x9 = 9, x10 = 10
x1 = 1, x2 = 1, x3 = 1, x4 = 1, x5 = 1, x6 = 10, x7 = 10, x8 = 10, x9 = 10, x10 = 10
a) Explain how Proposition 8.1.3 can be used to argue for Set B.
b) Despite the argument in part (a), why might Set A be a better choice?
8.2 A simple random sample was chosen the population of all the students with senior
status as of February, 2003, who had taken the ACT test. The ACT score and GPA
of each student is in the file http://www.calvin.edu/~stob/data/sr80.csv’.
a) Write the equation of the regression line that could be used to predict the GPA
of a student from their ACT.
b) Write a 95% confidence interval for the slope of the line.
c) For each of the ACT scores 20, 25, 30, use the line to predict the GPA of a
student with that score.
d) Write 95% confidence intervals for the mean GPA of all students with ACT
scores 20, 25, and 30.
e) Write a 95% prediction interval for the GPA of another student with ACT score
20.
f ) Plot the residuals from this regression and say whether the residuals indicate
any concerns about whether the assumptions of the standard linear model are
met.
8.3 A famous dataset (Pierce, 1948) contains data on the relationship between cricket
chirps and temperature. The dataset is reproduced at http://www.calvin.edu/
~stob/data/crickets.csv’. Here the variables are Temperature in degrees Fahrenheit and Chirps giving the number of chirps per second of crickets at that temperature.
a) Write the equation of the regression line that could be used to predict the
temperature from the number of cricket chirps per second.
b) Write a 95% confidence interval for the slope of the line.
c) Write a 95% confidence interval for the mean temperature for each of the values
12, 14, 16, and 18 of cricket chirps per second.
d) You hear a cricket chirping 15 times per second. What is an interval that is
likely to capture the value of the temperature? Explain what likely means here.
206
8.7. Exercises
e) Plot the residuals from this regression and say whether the residuals indicate
any concerns about whether the assumptions of the standard linear model are
met.
8.4 Prove Equation 8.2.
8.5 The faraway package contains a dataset cpd which has the projected and actual
sales of 20 different products of a company. (The data were actually transformed to
disguise the company.)
a) Write a regression line that describes a linear relationship between projected
and actual sales.
b) Identify one data point that has particularly large influence on the regression.
Give a couple of quantitative measures that summarize its influence.
c) Refit the regression line after removing the data point that you identified in
part (b). How does the equation of the line change?
207
A. Appendix: Using R
A.1. Getting Started
Download R from the R project website http://www.r-project.org/ which requires
a few clicks or directly from http://cran.stat.ucla.edu/. There are Windows,
Mac, and Unix versions. These notes are for the Windows version. There will be
minor differences for the other versions.
A.2. Vectors and Factors
A vector has a length (a non-negative integer) and a mode (numeric, character, complex, or logical). All elements of the vector must be of the same mode. Typically,
we use a vector to store the values of a quantitative variable. Usually vectors will
be constructed by reading data from an R dataset or a file. But short vectors can be
constructed by entering the elements directly.
> x=c(1,3,5,7,9)
> x
[1] 1 3 5 7 9
Note that the [1] that precedes the elements of the vectors is not one of the elements
but rather an indication that the first element of the vector follows. There are a
couple of shortcuts that help construct vectors that are regular.
> y=1:5
> z=seq(0,10,.5)
> y;z
[1] 1 2 3 4 5
[1] 0.0 0.5 1.0
[16] 7.5 8.0 8.5
1.5
9.0
2.0 2.5
9.5 10.0
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
To refer to individual elements of a vector we use square brackets. Note that a variety
of expressions, including other vectors, can go within the brackets.
> x[3]
[1] 5
> x[c(1,3,5)]
[1] 1 5 9
> x[-4]
# 3rd element of x
# 1st, 3rd, 5th elements of x
# all but 4th element of x
209
7.0
A. Appendix: Using R
[1] 1 3 5 9
> x[-c(2,3)]
[1] 1 7 9
# all but 2nd and 3rd elements of x
If a vector t is a logical vector of the same length as x, then x[t] selects only those
elements of x for which t is true. Such logical vectors t are often constructed from
logical operations on x itself.
> x>5
[1] FALSE FALSE FALSE
> x[x>5]
[1] 7 9
> x[x==1|x>5]
[1] 1 7 9
# compares x elementwise to 5
TRUE
TRUE
# those elements of x where condition is true
# == for equality and | for logical or
Arithmetic on vectors works element by element as do many functions.
> x
[1] 1 3 5 7 9
> y
[1] 1 2 3 4 5
> x*y
# componentwise multiplication
[1] 1 6 15 28 45
> x^2
# exponentiation of each element by a constant
[1] 1 9 25 49 81
> c(1,2,3,4)*c(2,4)
# if the vectors are not of the same length, the shorter is
[1] 2 8 6 16
#
recycled if the lengths are compatible
> log(x)
# the log function operates componentwise
[1] 0.000000 1.098612 1.609438 1.945910 2.197225
A.3. Data frames
Datasets are typically stored in data frames. A data frame in R is a data structure
that can be considered a two-dimensional array with rows and columns. Each column
is a vector or a factor. The rows usually correspond to the individuals of our dataset.
Usually data frames are constructed by reading data from a file or loading a built-in R
dataset (see the next section). A data frame can also be constructed from individual
vectors and factors.
> dim(iris)
# 150 rows or observations, 5 columns or variables
[1] 150
5
> iris[1,]
# the first observation (row)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1
5.1
3.5
1.4
0.2 setosa
> iris[,1]
# the first column (variable), output is a vector
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
210
A.3. Data frames
[19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
[37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
[55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
[73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
[91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
[109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
[127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9
> iris[1]
# alternative means of referring to first column, output is a data frame
Sepal.Length
1
5.1
2
4.9
3
4.7
4
4.6
5
5.0
................
# many observations omitted
145
6.7
146
6.7
147
6.3
148
6.5
149
6.2
150
5.9
> iris[1:5,3]
# the first five observations, the third variable
[1] 1.4 1.4 1.3 1.5 1.4
> iris$Sepal.Length
# the vector in the data frame named Sepal.Length
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
[19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
[37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
[55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
[73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
[91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
[109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
[127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9
> iris$Sepal.Length[10]
# iris$Sepal.Length is a vector and can be used as such
[1] 4.9
We next demonstrate how to construct a data frame from vectors and factors.
>
>
>
>
x=1:3
y=factor(c("a","b","c"))
# makes a factor of the character vector
d=data.frame(numbers=x, letters=y)
d
numbers letters
1
1
a
2
2
b
3
3
c
211
A. Appendix: Using R
> d[,2]
[1] a b c
Levels: a b c
> d$numbers
[1] 1 2 3
A.4. Getting Data In and Out
Accessing datasets in R
There are a large number of datasets that are included with the standard distribution
of R. Many of these are historically important datasets or datasets that are often
used in statistics courses. A complete list of such datasets is available by data().
A built-in dataset named junk usually contains a data.frame named junk and the
command data(junk) defines that data.frame. In fact, many datasets are preloaded.
For example, the iris dataset is available to you without using data(iris). For the
built-in dataset junk, ?junk usually gives a description of the dataset.
Many users of R have made other datasets available by creating a package. A
package is a collection of R datasets and/or functions that a user can load. Some of
these packages come with the standard distribution of R. Others are available from
CRAN. To load a package, use library(package.name) or require(package.name).
For example, the faraway package contains several datasets. One such dataset records
various health statistics on 768 adult pima indians for a medical study of diabetes.
> library(faraway)
> data(pima)
> dim(pima)
[1] 768
9
> pima[1:5,]
pregnant glucose diastolic triceps insulin bmi diabetes age test
1
6
148
72
35
0 33.6
0.627 50
1
2
1
85
66
29
0 26.6
0.351 31
0
3
8
183
64
0
0 23.3
0.672 32
1
4
1
89
66
23
94 28.1
0.167 21
0
5
0
137
40
35
168 43.1
2.288 33
1
>
If the package is not included in the distribution of R installed on your machine, the
package can be installed from a remote site. This can be done easily in both Windows
and Mac implementations of R using menus.
Finally, datasets can be loaded from a file that is located on one’s local computer
or on the internet. Two things need to be known: the format of the data file and
the location of the data file. The most common format of a datafile is CSV (comma
212
A.4. Getting Data In and Out
separated values). In this format, each individual is a line in the file and the values
of the variables are separated by commas. The first line of such a file contains the
variable names. There are no individual names. The R function read.csv reads such
a file. Other formats are possible and the function read.table can be used with
various options to read these. The following example shows how a file is read from
the internet. The file contains the offensive statistics of all major league baseball
teams for the complete 2007 season.
> bball=read.csv(’http://www.calvin.edu/~stob/data/baseball2007.csv’)
> bball[1:4,]
CLUB LEAGUE
BA
SLG
OBP
G
AB
R
H
TB X2B X3B
1
New York
A 0.290 0.463 0.366 162 5717 968 1656 2649 326 32
2
Detroit
A 0.287 0.458 0.345 162 5757 887 1652 2635 352 50
3
Seattle
A 0.287 0.425 0.337 162 5684 794 1629 2416 284 22
4 Los Angeles
A 0.284 0.417 0.345 162 5554 822 1578 2317 324 23
SH SF HBP BB IBB
SO SB CS GDP LOB SHO
E DP TP
1 41 54 78 637 32 991 123 40 138 1249
8 88 174 0
2 31 45 56 474 45 1054 103 30 128 1148
3 99 148 0
3 33 40 62 389 32 861 81 30 154 1128
7 90 167 0
4 32 65 40 507 55 883 139 55 146 1100
8 101 154 0
>
HR
201
177
153
123
RBI
929
857
754
776
Creating datasets in R
Probably the best way to create a new dataset for use in R is to use an external
program to create it. Excel, for example, can save a spreadsheet in CSV format. The
editing features of Excel make it very easy to create such a dataset. Small datasets
can be entered into R by hand. Usually this is done by creating the vectors of the
data.frame individually. Vectors can be created using the c() or scan() functions.
> x=c(1,2,3,4,5:10)
> x
[1] 1 2 3 4 5 6 7
> y=c(’a’, ’b’,’c’)
> y
[1] "a" "b" "c"
> z=scan()
1: 2 3 4
4: 11 12 19
7: 4
8:
Read 7 items
> z
[1] 2 3 4 11 12 19 4
>
8
9 10
213
A. Appendix: Using R
The scan() function prompts the user with the number of the next item to enter.
Items are entered delimited by spaces or commas. We can use as many lines as we
like and the input is terminated by a blank line. There is also a data editor available
in the graphical user interfaces but it is quite primitive.
A.5. Functions in R
Almost all the capabilities of R are implemented as functions. A function in R is
much like a mathematical function. Namely, a function has inputs and outputs. In
mathematics, f (x, y) is functional notation. The name of the function is f and there
are two inputs x, and y. The expression f (x, y) is the name of the output of the
function. The notation in R is quite similar. For example, mean(x) denotes the result
of applying the function mean to the input x. There are some important differences
in the conventions that we typically use in mathematics and that are used in R.
A first difference is that functions in R often have optional arguments. For example,
in using the function to compute the mean, there is an optional argument that allows
us to compute the trimmed mean. Thus mean(x,trim=.1) computes a 10%-trimmed
mean of x.
A second difference is that in R inputs have names. In mathematics, we rely only on
position to identify which input is which in functions that have several inputs. Because
we have optional arguments in R, we need some way to indicate which arguments we
are including. Hence, in the example of the mean function above, the argument trim
is named. If we use a function in R, without naming arguments, then R assumes
that the arguments are included in a certain order (that can be determined from the
documentation). For example, the mean function has specification
mean(x, trim = 0, na.rm = FALSE, ...)
This means that the first three arguments are called x, trim, and na.rm. The latter
two of these arguments have default values if they are missing. If unnamed, the
arguments must appear in this order. If named they can appear in any order. The
following short session of R shows the variety of possibilities. Just remember that
R first matches up the named arguments. Then R uses the unnamed arguments to
match the other arguments it accepts in the order that it expects them. Notice that
mean allows other arguments ... that it does not use.
> mean(y)
[1] 5.5
> y=1:10
> mean(x=y)
> mean(y,trim=.1)
> mean(trim=.1,y)
> mean(trim=.1,x=y)
214
# all these are legal
A.6. Samples and Simulation
>
>
>
>
>
>
mean(y,.1,na.rm=F)
mean(y,na.rm=F,.1)
mean(y,na.rm=F,trim=.1)
mean(y,.1,F)
mean(y,trim=.1,F)
mean(y,F,trim=.1)
> mean(y,F,.1)
> mean(.1,y)
> mean(z=y,.1)
# these are not legal
A third difference between R and our usual mathematical conventions is that many
functions are “vectorized.” For example, the natural log function operates on vectors
one component at a time:
> x=c(1:10)
> log(x)
[1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
[8] 2.0794415 2.1972246 2.3025851
A.6. Samples and Simulation
The sample() function allows us to choose probability samples of any size from a
fixed population. The syntax is sample(x,size,replace=F,prob=NULL) where
x
size
replace
prob
a vector representing the population
the size of the sample
is true or false according to whether the sampling is with replacement or not
if present, a vector of same length of x of probabilities of choosing the corresponding individual
The following R session gives examples of some typical uses of the sample command.
> x=1:6
> sample(x)
# a random permutation
[1] 5 6 2 1 4 3
> sample(x,size=10,replace=T)
# throwing 10 dice
[1] 5 3 2 5 5 5 5 3 1 2
> sample(x,size=10,replace=T,prob=c(1/2,1/10,1/10,1/10,1/10,1/10)) # weighted dice
[1] 1 1 3 5 1 6 1 5 1 1
> sample(x,size=10,replace=T,prob=c(5,3,2,1,1,1)) # weights need not sum to 1 (used proportionally
[1] 3 1 1 1 2 1 4 2 1 5
> sample(x,size=4,replace=F)
# sampling without replacement
[1] 2 3 1 6
215
A. Appendix: Using R
Simulation is an important tool for understanding what might happen in a random
sampling situation. Many simulations can be performed using the replicate function. The simplest form of the replicate function is replicate(n,expr) where expr
is an R expression that has a value (e.g., a function) and n is the number of times that
we wish to replicate expr. The result of replicate is a list but if all replications
of expr have scalar values of the same mode, the result is a vector. Continuing the
dice-tossing motif, the following R session gives the result of computing the mean of
10 dice rolls for 20 different trials.
> replicate(20, mean(sample(1:6,10,replace=T)))
[1] 3.5 4.3 3.5 3.7 2.1 3.5 3.6 3.4 3.9 3.3 2.6 3.1 3.8 3.2 2.9 3.1 3.1 3.6 4.0
[20] 2.7
If expr returns something other than a scalar, then the object created by replicate
might be a list or a matrix. For example, we generate 10 different permutations of
the numbers from 1 to 5.
> r=replicate(10,sample(c(1:5)))
> r
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]
3
3
5
4
2
2
4
1
3
1
[2,]
5
1
4
1
4
3
1
2
1
3
[3,]
2
5
1
5
1
5
3
5
5
2
[4,]
4
2
3
3
3
1
5
3
4
4
[5,]
1
4
2
2
5
4
2
4
2
5
> r[,1]
[1] 3 5 2 4 1
Notice that the results of replicate are placed in the columns of the returned object. In fact the result of replicate can have quite a complicated structure. In the
following code, we simulate 1,000 different tosses of 1,000 dice and for each of the
trials we construct a histogram. Note that the internal structure of a histogram is a
list with various components.
> h=replicate(1000,hist(sample(1:6,1000,replace=T)))
> h[,1]
$breaks
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0
$counts
[1] 169 176
0 177
0 149
0 170
0 159
$intensities
[1] 0.3379999 0.3520000 0.0000000 0.3540000 0.0000000 0.2980000 0.0000000
[8] 0.3400000 0.0000000 0.3180000
$density
216
A.7. Formulas
[1] 0.3379999 0.3520000 0.0000000 0.3540000 0.0000000 0.2980000 0.0000000
[8] 0.3400000 0.0000000 0.3180000
$mids
[1] 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25 5.75
$xname
[1] "sample(1:6, 1000, replace = T)"
$equidist
[1] TRUE
A.7. Formulas
Formulas are used extensively in R when analyzing multivariate data. Formulas can
take many forms and their meaning varies by R context but in general they are used to
describe models in which we have a dependent or response variable that depends
on some independent or predictor variables. There may also be conditioning
variables that limit the scope of the model. Suppose that x, y, z, w are variables
(which are usually vectors or factors). Then the following are legal formulas, together
with a way to read them.
x~y
x~y|z
x~y+w
x~y*w
x~y+I(y^2)
x
x
x
x
x
modeled
modeled
modeled
modeled
modeled
by
by
by
by
by
y
y conditioned on z
y and w
y, w and y*w
y and y2
Notice in the last example that we are essentially defining a new variable, y 2 as
one of the predictor variables. In this case we need I to indicate that this is the
interpretation. Most arithmetic expressions can occur within the scope of I. For
example,
> histogram(~I(x^2+x))
produces a histogram of the transformed variable x2 + x. (Leaving out the I in this
case gives a completely different result.)
Most graphics commands that use formulas will use the vertical axis for the response
variable, the horizontal axis for the predictor variable, and will draw a separate plot
for each value of the conditioning variable (which is usually a categorical variable).
217
A. Appendix: Using R
A.8. Lattice Graphics
The lattice graphics package (accessed by library(lattice)) is the R implementation of Trellis graphics, a graphics system developed at Bell Laboratories. The
lattice graphics package is completely self-contained and unrelated to the base graphics package of R. Lattice graphics functions in general produce objects that are of class
“trellis.” These objects can be manipulated and printed. Printing a lattice object is
generally what makes a graph appear in its own window on the display. The standard
high-level graphics functions automatically print the object they create. The most
important lattice graphic functions are as follows.
xyplot()
bwplot()
histogram()
dotplot()
densityplot()
qq()
qqmath()
stripplot()
contourplot()
levelplot()
splom()
rfs()
scatter plot
box and whiskers plot
histograms
dot plots
kernel density plots
quantile-quantile plot for comparing two distributions
quantile plots against certain mathematical distributions
one-dimensional scatter plots
contour plot of trivariate data
level plot of trivariate data
scatter plot matrix of several variables
residuals and fitted values plot
The syntax of these plotting commands differs according to the nature of the plot
and the data and most of these high-level plotting commands allow various options.
A typical syntax is found in xyplot() which we illustrate here using the iris data.
> xyplot(Sepal.Length~Sepal.Width | Species, data=iris, subset=c(1:149),
+ type=c("p","r"),layout=c(3,1))
Here we are using the data frame iris, and we are using only the first 149 observations of this data frame. We are making three x-y plots, one for each Species (the
conditioning variable in the formula). The plots have Sepal.Width on the horizontal
axis and Sepal.Length on the vertical axis. The plots contain points and also a fitted
regression line. There three plots are displayed in a 3 columns by 1 row display. All
kinds of options besides type and layout are available to control the size, shape,
labeling, colors, etc. of the plot.
218
A.9. Exercises
A.9. Exercises
A.1 Choose 4 integers in the range 1–10 and 4 in the range 11–20. Enter these 8
integers in non-decreasing order into a vector x. For each of the following R commands,
write down a guess as to what the output of R would be and then write down (using
R of course), what the output actually is.
a) x
b) x+1
c) sum(x)
d) x>10
e) x[x>10]
f ) sum(x>10)
Explain what R is computing here.
g) sum(x[x>10])
Explain what R is computing here.
h) x[-(1:4)]
i) x^2
A.2 The following table gives the total of votes cast for each of the candidates in the
2008 Presidential Primaries in the State of Michigan.
Democratic
Clinton
328,151
Uncommitted 236,723
Kucinich
21,708
Dodd
3,853
Gravel
2,363
Republican
Romney
337,847
McCain
257,521
Huckabee
139,699
Paul
54,434
Thompson
32,135
Giuliani
24,706
Uncommitted
17,971
Hunter
2,823
a) Create a data frame in R, named Michigan, that has three variables: candidate,
party, votes. Be careful to make variables factors or vectors as appropriate.
b) Write an R expression to list all the candidates.
c) Write an R expression to list all the Democratic candidates.
d) Write an R expression that computes the total number of votes case in the
Democratic primary.
A.3 The function mad computes the median absolute deviation of the absolute deviations from the median of a vector of numbers. That is, if m is the median of
219
A. Appendix: Using R
x1 , . . . , xn , then the median absolute deviation from the median is
median{|x1 − m|, . . . |xn − m|} .
Actually, the function in R is considerably more versatile. For example, instead of
m, the function allows as an option that the mean x be used instead. Also there
are several choices for which median is computed of the set of numbers. Finally,
the R function multiplies the result by a constant (the default is 1.4826 for technical
reasons). Using ?mad, we find that the usage for the function is
mad(x, center = median(x), constant = 1.4826, na.rm = FALSE,
low = FALSE, high = FALSE)
Enter the vector x=c(1,2,4,6,8,10).
a) R computes mad(x) to be 4.4478. (Try it!) Using the help document and the
default values of the function, explain how the number 4.4478 is computed.
b) Compute mad(x,mean(x),constant=1,FALSE,TRUE,FALSE). Explain the result.
c) The three logical values in the expression in part (b) might be mysterious to a
reader. Write an R function that is somewhat more self-explanatory.
A.4 In R, define a vector x with 100 values of your own choosing. Compare the
behavior of
> histogram(~x^2+x)
> histogram(~I(x^2+x))
and state precisely what each of the two expressions does with the data x.
220
Bibliography
[AM]
Chase M. A. and Dummer G. M. The role of sports as a social determinant
for children. Research Quarterly for Exercise and Sport, 63:18–424.
[Bur06] U.S. Census Bureau. Current population survey, design and methodology.
(Technical Paper 66):175, October 2006.
[LP01] Williamson I. Little P., Gould C. Delayed presciribing of antibiotics increased
duration of acute otitis media but reduced diarrhoea. Evidence-Based Nursing, 4(4):107, October 2001.
221
Index of Terms
3:16, 53
68–95–99.7 Rule, 109
AIC, 203
alternate hypothesis, 89
Anscombe examples, 190
Bayes’ Theorem, 73
bias, 125
bimodal, 11
bin, 9
binomial distribution, 83
binomial process, 81
binomial random variable, 82
blind, 159
blocking, 159
BLUE (estimators), 182
bootstrap, 143
boxplot, 17
Bunko, 82
Cartesian product of sets, 62
categorical variable, 3
Cauchy distribution, 115
census, 43
Central Limit Theorem, 123
chi-square distribution, 166
chi-square statistic, 166
expected count, 166
chi-square test, 167
Chicago Cubs, 151
CIRP Survey (Quest), 49
cluster sampling, 51
complement of a set, 59
conditional probability, 70
confidence interval, 129
bootstrap, 143
coverage probability, 134
difference of two means, 173
for a proportion (Agresti-Coull), 140
for a proportion (Plus 4), 141
for a proportion (Wald), 138
for a proportion (Wilson), 139
for linear regression, 185
for the mean, 129, 131
confounding, 159
continuous random variable, see random
variable, continuous
control group, 158
convenience sample, 45
Cook’s distance, 195
course evaluations, 70
coverage probability, 134
cross tabulation, 23
cumulative distribution function
continuous random variable, 95
cumulative distribution function (cdf), 83
Current Population Survey, 44, 51
dataset, 3
decile, 17
decision rule, 146
discrete random variable, see random variable, discrete
distribution, 6
equally likely outcomes, 60
estimate, 125
estimator, 125
unbiased, 125
event, 56
expected value, 103
223
Index of Terms
exponential distribution, 98
factor, 3
fat (faraway), 195
five number summary, 17
Framingham Heart Study, 154
frequentist, 57
Literary Digest, 44
lurking variables, 153
Manny Ramirez, 68
matched pair design, 160
mean, 12
of a continuous random variable, 104
of a random variable, see expected
Gauss-Markov Theorem, 182, 197
value
gold standard, see randomized compara- mean absolute deviation, 19
tive experiment
mean absolute deviation, MAD, 38
mean squared error, 127
hat, 30
median, 12
hat notation, 125
missing values, 3
Hawthorne Effect, 158
model, 27
hinge
Monopoly, 61
lower, 17
mortality table, 68
upper, 17
multicollinearity, 199
histogram, 8
Multiplication Law of Probability, 70
histogram, density, 11
Multiplicaton Principle, 62
homeoscedasticity, 182
multistage sampling, 51
hypergeometric distribution, 85
hypothesis, 88
National Immunization Survey, 48
alternate, 145
Nellie Fox, 75
null, 145
normal distribution, 108
hypothesis test, 88
standard, 109
null hypothesis, 89
i.i.d., 121
icosohedron, 75
identically distributed, 120
independent events, 73
influential observations, 194
inter-quartile range, 17
interaction term, 200
Interim, abolish, 53
intersection of sets, 59
IQR, see inter-quartile range
Knuth, Donald, 52
Kolmogorov, 65
Law of Large Numbers, 58
least-squares line, 31
224
observational study, 153
prospective, 154
retrospective, 154
outlier, 14, 18
1.5IQR rule, 18
p-value, 90, 146
parameter, 44, 121
percentile, 16
placebo, 158
placebo effect, 158
population, 43, 121
power, 148
prediction interval, 188
probability, 55
probability density function (pdf), 94
probability mass function (pmf), 80
pseudo-random number, see random number generation
quantile, 16
quantitative variable, 3
quartile, 17
random number generation, 97
random number table, 52
random process, 55
random sample, 50
random variable, 79
continuous, 80, 93
discrete, 79
randomized comparative experiment, 154
rectangular distribution, 112
replication, 157
residual, 30
residual plot, 33
residual sum of squares (SSResid, SSE),
30
residuals
plots, 192
standardize plots, 193
resistant, 14
robust, 134
two-sample t, 174
rounding error, 113
sample, 43, 121
sample space, 56
sampling distribution, 90, 117
sampling error, 46
sampling frame, 48
scatterplot, 27
simple random sample, 45
Simpson’s paradox, 26
skew, 9
SRS, see simple random sample
standard deviation, 19
of a random variable, 107
standard error, 126
standard linear model, 179
standard normal distribution, 109
standardization (of a variable), 38
standardized variable, 32
statistic, 44, 121
statistics, 1
stem and leaf plot, 12
stratified random sample, 50
subjectivist, 57
sum of squares, 38
symmetric, 9
t-distribution, 130
test statistic, 89, 145
total deviation from the mean, 37
track records, 28
transform (reexpress), 10
transformation
of a random variable, 105
treatment, 155
tree, probability, 71
trees, 28
trimmed mean, 15
two independent samples, 172
Type I error, 91, 146
Type II error, 91, 147
uniform distribution, 96
unimodal, 11
union of sets, 59
variable, 3
explanatory, 155
response, 155
variance, 19
of a random variable, 107
Weibull distribution, 100
Welch approximation, 173
225
Index of Terms
wind speed in San Diego, 101
z-score, 38
226
Index of R Functions and Objects
boxplot(), 17
bwplot(), 17, 20
cats, 191
chisq.test(), 167, 171
combn(), 119
cooks.distance(), 195
coplot(), 161
counties (M243), 149
cumsum(), 58
cut(), 8
data.frame, 4
dbinom(), 84
dexp(), 98
dfbeta(), 194
dhyper(), 85
dnorm(), 109
dt(), 130
dunif(), 98
dweibull(), 100
ellipse(), 190
factor, 3
fitted(), 33
fivenum(), 17
formula, 20
hist(), 8
histogram(), 8, 21
mean(), 13
mean(,trim=), 15
median(), 13
pbinom(), 84
pexp(), 98
phyper(), 85
plot(), 27
pnorm(), 109
power.t.test(), 148, 177
prop.test(), 139
pt(), 130
punif(), 98
pweibull(), 100
qexp(), 98
qnorm(), 109
qt(), 130
qunif(), 98
qweibull, 100
rbinom(), 84
replicate(), 47
residuals(), 33
rexp(), 98
rhyper(), 85
rnorm, 109
rt(), 130
runif(), 98
rweibull(), 100
lattice, 8
lines(), 130
lm(), 31
sample(), 46, 47
scale(), 38
sd(), 19
stem(), 12
mad(), 38
t.test(), 133, 146, 173
227
Index of R Functions and Objects
table(), 7
var(), 19
vector, 3
xtabs(), 23, 163
xyplot(), 27, 58
228
Index of Datasets Used
aircondit (boot), 142
ais (alr3), 178
al2003 (M243), 39
barley (lattice), 39, 173
bball2007 (M243), 8
bballgames03 (M243), 118
broccoli (faraway), 6
chickwts (R), 36, 155, 175
city (boot), 143
corrosion (faraway), 26, 181
counties (M243), 5, 47, 52, 131
cpd (faraway), 207
crickets (M243), 206
CSBVPolitical (M243), 22
reading (M243), 39, 177
rubberbands (two65 from DAAG), 177
senior students (M243), 206
singer (lattice), 38, 178
sleep (R), 161
ToothGrowth (R), 161
trees (R), 28
two65 (DAAG), 177
UCB Admissions (R), 163
wind (M243), 101, 134
women (R), 40
deathpenalty (M243), 25
faithful (R), 11
helicopter (M243), 184
iris (R), 4, 151
ironslag (DAAG), 186
March9bball (M243), 132
mentrack (M243), 28
morley (R), 151
normaltemp (M243), 151
OrchardSprays (R), 36
popular kids (M243), 165
Puromycin (R), 40
random dot stereogram (M243), 176
rareplants (DAAG), 177
229